Joint Site Frequency Spectra
In the remainder of our analyses, we study our data using Joint Site
Frequency Spectra (jSFS) because it summarizes much of the genomic data
into one statistic that can be used for inference (Gutenkunst et
al. 2009, Xu & Hickerson 2015). The jSFS is essentially the overlap in
the distribution in frequency of alleles between two populations and the
pattern of overlap can tell us a lot about demographic processes in the
past, both analytically (Gutenkunst et al. 2009, Excoffieret al. 2013) and visually (Fig. 2). More specifically, a single
SFS represents the distribution of the number of sites that are present
at each of the N allele frequencies in the population, whereN is equal to the number of total chromosomes present in the
population. For a diploid organism, this is twice the number of
individuals. A joint SFS is then the combination of two SFS, one for
each population, as a matrix that is (Npop1 + 1)
by (Npop2 + 1) cells. Each row is one of the
allele frequencies in the first population, beginning with 0 and then
ranging from \(\frac{1}{N_{pop1}}\) to Npop1 and
each column is the allele frequencies in the second population, again
beginning with 0 and ranging from \(\frac{1}{N_{pop2}}\) toNpop2 . Each cell then indicates the number of
sites at that corresponding allele frequency in both populations. If the
entire jSFS is standardized by the total number of sites, each cell
indicates the proportion of sites at the corresponding population allele
frequencies. The first row and column correspond to the sites that are
at given frequencies in one population while not present at all in the
other population, referred to hereafter as the “0” rows and columns.
Again, these indicate the variants present in one population and not the
other, thus the cell at row “0” and column “0” indicates the sites
that do not vary in either population. With SNP data and for demographic
model selection, this cell is not typically considered because it is
only relevant for scaling the proportion of invariant sites for
parameter estimates. Thus, when estimating demographic parameters from
these models, the monomorphic cell along with linked SNPs is needed to
inform the composite likelihood of the models (Excoffier et al.2013).
There is a trade-off between the number of chromosomes that can be
included from each population and the number of unlinked SNPs included
in the jSFS because the jSFS cannot accommodate missing data. The
missing data are generally due to random missing data and allelic
dropout from reduced representation sequencing (Andrews et al.2016), where loci are not represented across all or even a majority of
individuals in the population. Thus, the more samples per population
included, the fewer SNPs there are to sample from to construct the jSFS.
For this reason, we down-sampled the number of SNPs and alleles
(chromosomes in the population) to construct three different jSFS
datasets for each species using custom python scripts
(github.com/isaacovercast/easySFS). We enforced a different number of
alleles to be included per population, which resulted in a different
number of unlinked SNPs being sampled in each dataset (Table 1). These
datasets thus represent a spectrum of genomic information ranging from
more individuals in the population but fewer SNPs to fewer individuals
represented from the populations with many more SNPs included. We used
unlinked SNPs for model selection (see below) to satisfy the assumption
that SNPs are independent. We subsampled 100 different observed jSFS for
each of the sample sizes for each of the species (600 observed jSFS in
total) and masked monomorphic sites in all jSFS. For parameter
estimation using the jSFS, we use the full SNP dataset, meaning we
included linked SNPs in the construction of the jSFS. We also considered
the monomorphic cell in the jSFS when estimating parameters because this
cell provides information important to scale the invariant sites in the
genome. To calculate the monomorphic cell, we measured the ratio of
monomorphic sites and polymorphic sites in our entire datasets for each
species and then used those ratios, multiplied by the total number of
biallelic SNPs used in the empirical jSFS.