Joint Site Frequency Spectra
In the remainder of our analyses, we study our data using Joint Site Frequency Spectra (jSFS) because it summarizes much of the genomic data into one statistic that can be used for inference (Gutenkunst et al. 2009, Xu & Hickerson 2015). The jSFS is essentially the overlap in the distribution in frequency of alleles between two populations and the pattern of overlap can tell us a lot about demographic processes in the past, both analytically (Gutenkunst et al. 2009, Excoffieret al. 2013) and visually (Fig. 2). More specifically, a single SFS represents the distribution of the number of sites that are present at each of the N allele frequencies in the population, whereN is equal to the number of total chromosomes present in the population. For a diploid organism, this is twice the number of individuals. A joint SFS is then the combination of two SFS, one for each population, as a matrix that is (Npop1 + 1) by (Npop2 + 1) cells. Each row is one of the allele frequencies in the first population, beginning with 0 and then ranging from \(\frac{1}{N_{pop1}}\) to Npop1 and each column is the allele frequencies in the second population, again beginning with 0 and ranging from \(\frac{1}{N_{pop2}}\) toNpop2 . Each cell then indicates the number of sites at that corresponding allele frequency in both populations. If the entire jSFS is standardized by the total number of sites, each cell indicates the proportion of sites at the corresponding population allele frequencies. The first row and column correspond to the sites that are at given frequencies in one population while not present at all in the other population, referred to hereafter as the “0” rows and columns. Again, these indicate the variants present in one population and not the other, thus the cell at row “0” and column “0” indicates the sites that do not vary in either population. With SNP data and for demographic model selection, this cell is not typically considered because it is only relevant for scaling the proportion of invariant sites for parameter estimates. Thus, when estimating demographic parameters from these models, the monomorphic cell along with linked SNPs is needed to inform the composite likelihood of the models (Excoffier et al.2013).
There is a trade-off between the number of chromosomes that can be included from each population and the number of unlinked SNPs included in the jSFS because the jSFS cannot accommodate missing data. The missing data are generally due to random missing data and allelic dropout from reduced representation sequencing (Andrews et al.2016), where loci are not represented across all or even a majority of individuals in the population. Thus, the more samples per population included, the fewer SNPs there are to sample from to construct the jSFS. For this reason, we down-sampled the number of SNPs and alleles (chromosomes in the population) to construct three different jSFS datasets for each species using custom python scripts (github.com/isaacovercast/easySFS). We enforced a different number of alleles to be included per population, which resulted in a different number of unlinked SNPs being sampled in each dataset (Table 1). These datasets thus represent a spectrum of genomic information ranging from more individuals in the population but fewer SNPs to fewer individuals represented from the populations with many more SNPs included. We used unlinked SNPs for model selection (see below) to satisfy the assumption that SNPs are independent. We subsampled 100 different observed jSFS for each of the sample sizes for each of the species (600 observed jSFS in total) and masked monomorphic sites in all jSFS. For parameter estimation using the jSFS, we use the full SNP dataset, meaning we included linked SNPs in the construction of the jSFS. We also considered the monomorphic cell in the jSFS when estimating parameters because this cell provides information important to scale the invariant sites in the genome. To calculate the monomorphic cell, we measured the ratio of monomorphic sites and polymorphic sites in our entire datasets for each species and then used those ratios, multiplied by the total number of biallelic SNPs used in the empirical jSFS.