Inferring geographical barriers
To identify the biogeographical boundaries that exhibit the largest
genetic discontinuities between population pairs,
we
used the FST matrix as the distance matrix to
calculate the Monmonier maximum difference by BARRIER 2.2 (Manni,
Guérard, & Heyer, 2004). To ensure robustness, we randomly selected 30
genes from the 93 genes to accumulate one FSTmatrix. We repeated this process 100 times and obtained 100FST matrices. The robustness of each barrier was
assessed by bootstrapping over the 100 matrices of genetic
differentiation.
Mantel tests of FST against geographic distance
were performed to test the isolation by distance model. We used the
spheric distance of pairwise sampling sites to approximate geographic
distance. The test was conducted with 1000 permutations and at two
levels: all populations, and populations of a genetic group.
Demographic
history simulation
The populations were distinctly structured, basing on the abovementioned
analyses. The populations in the southern South China Sea (Lineage 1,
including Chaiya and Kuching) are closely related to those in
Australasia (Lineage 2, including U-Daintree, Darwin, and Sorong)
instead of those geographically surrounding them (Lineage 3, including
Sanya, Wenchang, Yalong, La-un, Ngao, and Bali). To examine how the
populations in the southern South China Sea originated, we built 12
models and used the approximate Bayesian computation (ABC) method to
choose the optimal model. The 12 models were first different in
population topology: (1) Lineage 2 diverged with Lineage 3 first and
then Lineage 1 diverged from Lineage 3 (Models 1-4); (2) Lineage 2
diverged with Lineage 3 first and then Lineage 1 diverged from Lineage 2
(Models 5-8); and (3) Lineage 2 diverged with Lineage 3 first and then
Lineage 1 derived from an admixture of Lineage 2 and Lineage 3 (Models
9-12). Within each topology, the four models differed in population
reduction, population bottleneck, population reduction with gene flow,
and population bottleneck with gene
flow.
Simulated sequences under these models were produced by ms software
(Hudson, 2002). To reduce the complexity of the parameters, we derived
population size parameters from a single N0.
N0 was randomly picked from the prior distribution and
assigned to Lineage 1 as the baseline. The population size of the other
two lineages
(Nx)
is equal to θx/θ0, where
θx and θ0 are the observed θ of the
current and baseline
lineages,
respectively. We performed 1×105 simulations using the
ms program for each model (Hudson, 2002). Eighty loci of 1000 base pairs
were simulated in each run with the mutation rate set at
4.06x10-8 per generation per bp. The mutation rate was
estimated from phylogenomic comparisons to closely related species (Z.
He et al., 2019). The sample size of each group was set equal to the
corresponding cluster pooling in multiple populations. Uniform prior
distributions were set for the demographic parameters, with
corresponding parameters in different models having an identical prior
range (Table S1 in supplementary file).
For each simulation, we calculated 18 summary statistics, including
Watterson’s estimator (θ), nucleotide polymorphism (π), segregating
sites (S), and Tajima’s D within each group andDXY and FST for each pair
of groups. To perform model selection, we calculated Euclidean distances
by comparing the summary statistics of the simulated and observed
sequences. We set the tolerance as 0.05 to retain the simulated data,
and then the approximate Bayesian computation schema (Beaumont, Zhang,
& Balding, 2002) was used to estimate the Bayesian posterior
probabilities of each mode using the “abc” package in R (Csilléry,
François, & Blum, 2012). The “postpr” function and “neuralnet”
option were used. With the optimal model selected, the “neuralnet”
method in the “abc” R package was again used to estimate the posterior
distribution of demographic parameters.