Inferring geographical barriers
To identify the biogeographical boundaries that exhibit the largest genetic discontinuities between population pairs, we used the FST matrix as the distance matrix to calculate the Monmonier maximum difference by BARRIER 2.2 (Manni, Guérard, & Heyer, 2004). To ensure robustness, we randomly selected 30 genes from the 93 genes to accumulate one FSTmatrix. We repeated this process 100 times and obtained 100FST matrices. The robustness of each barrier was assessed by bootstrapping over the 100 matrices of genetic differentiation.
Mantel tests of FST against geographic distance were performed to test the isolation by distance model. We used the spheric distance of pairwise sampling sites to approximate geographic distance. The test was conducted with 1000 permutations and at two levels: all populations, and populations of a genetic group.
Demographic history simulation
The populations were distinctly structured, basing on the abovementioned analyses. The populations in the southern South China Sea (Lineage 1, including Chaiya and Kuching) are closely related to those in Australasia (Lineage 2, including U-Daintree, Darwin, and Sorong) instead of those geographically surrounding them (Lineage 3, including Sanya, Wenchang, Yalong, La-un, Ngao, and Bali). To examine how the populations in the southern South China Sea originated, we built 12 models and used the approximate Bayesian computation (ABC) method to choose the optimal model. The 12 models were first different in population topology: (1) Lineage 2 diverged with Lineage 3 first and then Lineage 1 diverged from Lineage 3 (Models 1-4); (2) Lineage 2 diverged with Lineage 3 first and then Lineage 1 diverged from Lineage 2 (Models 5-8); and (3) Lineage 2 diverged with Lineage 3 first and then Lineage 1 derived from an admixture of Lineage 2 and Lineage 3 (Models 9-12). Within each topology, the four models differed in population reduction, population bottleneck, population reduction with gene flow, and population bottleneck with gene flow.
Simulated sequences under these models were produced by ms software (Hudson, 2002). To reduce the complexity of the parameters, we derived population size parameters from a single N0. N0 was randomly picked from the prior distribution and assigned to Lineage 1 as the baseline. The population size of the other two lineages (Nx) is equal to θx0, where θx and θ0 are the observed θ of the current and baseline lineages, respectively. We performed 1×105 simulations using the ms program for each model (Hudson, 2002). Eighty loci of 1000 base pairs were simulated in each run with the mutation rate set at 4.06x10-8 per generation per bp. The mutation rate was estimated from phylogenomic comparisons to closely related species (Z. He et al., 2019). The sample size of each group was set equal to the corresponding cluster pooling in multiple populations. Uniform prior distributions were set for the demographic parameters, with corresponding parameters in different models having an identical prior range (Table S1 in supplementary file).
For each simulation, we calculated 18 summary statistics, including Watterson’s estimator (θ), nucleotide polymorphism (π), segregating sites (S), and Tajima’s D within each group andDXY and FST for each pair of groups. To perform model selection, we calculated Euclidean distances by comparing the summary statistics of the simulated and observed sequences. We set the tolerance as 0.05 to retain the simulated data, and then the approximate Bayesian computation schema (Beaumont, Zhang, & Balding, 2002) was used to estimate the Bayesian posterior probabilities of each mode using the “abc” package in R (Csilléry, François, & Blum, 2012). The “postpr” function and “neuralnet” option were used. With the optimal model selected, the “neuralnet” method in the “abc” R package was again used to estimate the posterior distribution of demographic parameters.