Analytical Considerations
The limitations to the use of the jSFS to summarize genomic data should
be recognized. As described above, when we summarized our data into a
single jSFS, we downsampled data so that every SNP was included in each
individual in the jSFS. We note that doing this required us to forfeit a
considerable amount of data (Table 2). We performed a sensitivity
analysis on the use of the jSFS by constructing three different dataset
sizes, of 100 jSFS each, for each species, Thuja plicata andTsuga heterophylla (Table 2), which resulted in 100 model
predictions per species, per dataset (Fig. 5). The biggest discrepancy
in the entire inference is within the Tsuga heterophyllaprediction, where a different demographic history is supported by the
jSFS that included more individuals and fewer SNPs, and the jSFS with
the most SNPs and fewest individuals (Fig. 5). While the two models
supported are generally consistent with our overall inference of
pre-Pleistocene divergence followed by secondary contact, they differ in
the presence of a population bottleneck during the Pleistocene and
subsequent population expansion after the last glacial retreat. The
difference in the model support for western hemlock across downsampling
regimes could be due to the dataset with more SNPs being able to
estimate the bottleneck and expansion parameters more effectively, and
therefore showing strong support for that model. Conversely, the
information in the dataset with fewer SNPs may have just been
insufficient to estimate those parameters. For the purposes of our study
more data is not necessary, but for future and more precise demographic
parameter inference, this may be the case.
Our model-selection procedure supports, for both western redcedar and
western hemlock, a pre-Pleistocene divergence event, followed by
secondary gene flow between the populations. The approach used here for
model selection using Random Forest and the jSFS (Smith et al. 2017,
Smith & Carsten 2020) had yet to be tested using plant species or
demographic models of this complexity. This is a likelihood-free
approach that is based on simulating allelic data while accounting for
coalescent stochasticity and demographic processes. Model selection,
both in general and when implemented with machine-learning as employed
here, is as accurate as the data are distinct in model space. This means
that we should be able to assess if the empirical data are insufficient
to distinguish among these models, which is indicated by the
classifier’s error rates. Indeed, our simulations indicated that the
genomic signatures of the class of four recent dispersal models (Models
H-K, Fig. 2) are not differentiable from each other (Fig S4). This is
most likely due to the recent divergence time between the coast and
inland populations resulting in low resolution of the distinct migration
patterns those models are simulated under. However, all hypotheses
positing post-Pleistocene dispersal, regardless of migration pattern,
are well-differentiated from those positing a pre-Pleistocene dispersal,
or specifically the persistence of disjunct coastal and inland
populations through the Pleistocene (Fig S4). Additionally, when we pool
the recent dispersal models into a general recent dispersal model, our
data show that the error rates in all of our classifiers are extremely
low, indicating high confidence in our classifier and high information
content in data with respect to distinguishing among the final eight
demographic scenarios we propose (Fig. 4). Again, the power of the
machine learning classifier depends on how distinct the data are in
model space and models can be very simple or very complex, which all
influences the power of the classifier.
The approach employed here provides flexibility to the demographic model
designs and simulation of data, as well as computational efficiency. As
is true for all inferences based on model selection, it remains possible
that some as yet unexamined model may be a better description of the
true evolutionary history of these taxa, perhaps specifically those that
model more than two populations and therefore more complex divergence
and migration scenarios. Nevertheless, the approach to inference that we
have adopted here (i.e., developing models that are derived from
extrinsic information such as pollen records and climate data,
collecting genomic data, ranking models, assessing their
identifiability, and making inferences) is an extremely powerful
framework for phylogeographic research. In contrast to the approach that
bases inference on methods designed for data exploration, our approach
to inference utilizes existing data to formulate hypotheses that can
then be supported (or not) as new data are collected and analyzed.