Analytical Considerations
The limitations to the use of the jSFS to summarize genomic data should be recognized. As described above, when we summarized our data into a single jSFS, we downsampled data so that every SNP was included in each individual in the jSFS. We note that doing this required us to forfeit a considerable amount of data (Table 2). We performed a sensitivity analysis on the use of the jSFS by constructing three different dataset sizes, of 100 jSFS each, for each species, Thuja plicata andTsuga heterophylla (Table 2), which resulted in 100 model predictions per species, per dataset (Fig. 5). The biggest discrepancy in the entire inference is within the Tsuga heterophyllaprediction, where a different demographic history is supported by the jSFS that included more individuals and fewer SNPs, and the jSFS with the most SNPs and fewest individuals (Fig. 5). While the two models supported are generally consistent with our overall inference of pre-Pleistocene divergence followed by secondary contact, they differ in the presence of a population bottleneck during the Pleistocene and subsequent population expansion after the last glacial retreat. The difference in the model support for western hemlock across downsampling regimes could be due to the dataset with more SNPs being able to estimate the bottleneck and expansion parameters more effectively, and therefore showing strong support for that model. Conversely, the information in the dataset with fewer SNPs may have just been insufficient to estimate those parameters. For the purposes of our study more data is not necessary, but for future and more precise demographic parameter inference, this may be the case.
Our model-selection procedure supports, for both western redcedar and western hemlock, a pre-Pleistocene divergence event, followed by secondary gene flow between the populations. The approach used here for model selection using Random Forest and the jSFS (Smith et al. 2017, Smith & Carsten 2020) had yet to be tested using plant species or demographic models of this complexity. This is a likelihood-free approach that is based on simulating allelic data while accounting for coalescent stochasticity and demographic processes. Model selection, both in general and when implemented with machine-learning as employed here, is as accurate as the data are distinct in model space. This means that we should be able to assess if the empirical data are insufficient to distinguish among these models, which is indicated by the classifier’s error rates. Indeed, our simulations indicated that the genomic signatures of the class of four recent dispersal models (Models H-K, Fig. 2) are not differentiable from each other (Fig S4). This is most likely due to the recent divergence time between the coast and inland populations resulting in low resolution of the distinct migration patterns those models are simulated under. However, all hypotheses positing post-Pleistocene dispersal, regardless of migration pattern, are well-differentiated from those positing a pre-Pleistocene dispersal, or specifically the persistence of disjunct coastal and inland populations through the Pleistocene (Fig S4). Additionally, when we pool the recent dispersal models into a general recent dispersal model, our data show that the error rates in all of our classifiers are extremely low, indicating high confidence in our classifier and high information content in data with respect to distinguishing among the final eight demographic scenarios we propose (Fig. 4). Again, the power of the machine learning classifier depends on how distinct the data are in model space and models can be very simple or very complex, which all influences the power of the classifier.
The approach employed here provides flexibility to the demographic model designs and simulation of data, as well as computational efficiency. As is true for all inferences based on model selection, it remains possible that some as yet unexamined model may be a better description of the true evolutionary history of these taxa, perhaps specifically those that model more than two populations and therefore more complex divergence and migration scenarios. Nevertheless, the approach to inference that we have adopted here (i.e., developing models that are derived from extrinsic information such as pollen records and climate data, collecting genomic data, ranking models, assessing their identifiability, and making inferences) is an extremely powerful framework for phylogeographic research. In contrast to the approach that bases inference on methods designed for data exploration, our approach to inference utilizes existing data to formulate hypotheses that can then be supported (or not) as new data are collected and analyzed.