Sequence analysis and genetic diversity estimation
The sequences of 18 populations obtained from Sanger sequencing were
aligned and edited in SeqMan 7.1.0. For each population, the nucleotide
diversity (π) and DNA polymorphism (Watterson’s θ) were calculated using
DnaSP 5.10 (Librado & Rozas, 2009).
We obtained the reference sequences of the 93 genes by sequencing oneA. corniculatum individual using the Sanger method (see the
supplementary file of He et al., 2019). The reference sequences ranged
in length from 203 to 2422 bp. The short reads produced from Illumina
sequencing were mapped to reference sequences using MAQ 0.7.1 (H. Li,
Ruan, & Durbin, 2008) with the parameters set such that the mutation
rate between the reference and read was set to 0.002, the threshold of
mismatch base quality sum was 200, and the minimum mapping quality of
the reads was 30. To exclude false-positive mismatches, we counted the
mismatch rate for each site across the read and the mismatch rate for
each base quality. We trimmed the first and last 10 bases of each read
and filtered bases with a quality score of less than 20. Single
nucleotide polymorphisms (SNPs) were also identified using MAQ 0.7.1 (H.
Li et al., 2008). To avoid introducing bias from sequencing errors, we
discarded the sites with insufficient site coverage (<100
reads) and those with minor allele frequency less than 1/2N (N is the
number of individuals) in each population (Z. He et al., 2013). The
allele frequencies for each SNP site in a population were obtained by
counting the depth of each allele.
For the Illumina data, we estimated the nucleotide polymorphism
(Watterson’s θ) of each gene using the method of He et al. (Z. He et
al., 2013). The nucleotide diversity (π) of each gene was also estimated
according to Nei’s formula (Nei, 1987) with an in-house script. To
estimate absolute genetic divergence between populations, we computed
pairwise DXY following the formula derived by Nei
(Nei & Li, 1979). Pairwise DXY values were
summed over all SNPs, and the sum was normalized by effective sequence
length. For each pair of populations, the effective sequence length was
defined by sites without missing data in either population. We also
estimated Wright’s F statistics (FST ) (Wright,
1950) with these data.