Bioinformatic suites for low coverage whole genome sequencing data
PoolParty2 is a bioinformatic pipeline that was designed for the utilization of pooled or individually barcoded low coverage whole genome sequencing (lcWGS) data with a high-quality reference genome to perform genome-wide genetic association analysis with binary phenotypic traits or environmental variables. Since its original presentation, it has seen numerous updates, though most of these will be invisible to the user since their effect is to provide greater efficiency or stability. For example, the scripts that implement the Local Score analysis of Fariello et al. (2017) have been improved to make them more robust to variations in SNP and allele frequency count and distribution and provide more accurate reports with respect to outlier region margins. The major workhorse of the pipeline, the PPalign module, has received the most updates since the pipeline’s first presentation. Notably, this module is now able to call variants in parallel (across multiple computational threads) by dividing chromosomes or scaffolds into groups based on a user-specified number of threads, which substantially reduces processing time for large datasets. Another notable update includes the user’s ability to limit the number of individuals to normalize simultaneously, which reduces RAM memory requirements in memory-limited or shared computing environments, with the proviso that this also causes the duration of processing time for normalization to increase linearly.
These improvements to the PPalign module of PoolParty2 are important since, while there are myriad analyses available for the discovery of large effect loci, few bioinformatic suites are designed to convey lcWGS data from sequencing output, via quality assurance analyses, through to analytical results. angsd, another useful bioinformatic suite designed for barcoded but not pooled lcWGS data, requires read alignments as input, and provides additional functionalities that PoolParty2 does not, making these bioinformatic suites strong complements. In particular, angsd can estimate individual genotype likelihoods or posterior probabilities. For analyses where individual phenotype data are important, such as random forest analyses (e.g. Hess, Zendt, Matala, & Narum, 2016), genotype likelihoods would be necessary. Moreover, as we demonstrate here, these genotype likelihoods allow the estimation of linkage disequilibrium in specified windows, and which can be used to identify outlier regions under putative selection, though, as our empirical results portray, most effectively in combination with analyses that consider divergence among focal sets of samples. The analyses available through the PPanalyze module of PoolParty2 have most often been applied for the discovery of one or a few genomic regions with large, independent or simple-interaction effects on traits (e.g. direct epistasis), though, as results from our empirical data also portray, these analyses, with appropriate corroboration, may identify many loci with smaller effects on polygenic traits. Loci contributing to polygenic traits with continuous variation and low penetrance, however, will be challenging to discover with analyses that rely mainly on allele frequency, and their discovery will likely depend on identifying combinatorial association of non-contiguous genotypes, for which genotype likelihoods may be better suited. It should be noted, however, that genotype likelihoods may only be sufficiently informative for some analyses if coverage is high enough for each individual, and the decision of how to multiplex individuals or how deeply to sequence, made early in the data preparation phase, must be done with respect to the types of analyses that will be needed (Lou et al., 2021; Paril et al., 2022).
Using simulated data, we demonstrated that angsd and PoolParty2 both provide remarkably precise estimates of allele frequency considering the degree of individual coverage constraint and variation with which we challenged them, with PoolParty2 exhibiting a moderate edge in identifying true sites with fewer artifacts and accurately estimating their allele frequencies. Both bioinformatic suites also produce very low rates of false positive outliers, at least when data are examined on a site-aggregate (windowed or serial) basis. A number of the per-site divergence analyses are roughly equivalent between bioinformatic suites, including estimates of FST, sliding window FST, and basic tests of allele frequency proportions, though we note that to estimate site frequency spectra in order to calculate FST with angsd, each population must be analyzed separately, and if the user desires to utilize the empirical major/minor alleles rather than reference/alternate or ancestral/derived, a prior run with all data must be made. Beyond these, however, the Local Score analyses (Fariello et al., 2017) and estimate of linkage outliers from ngsLD (Fox et al., 2019) provide powerful and complementary means to identify true outlier regions. Moreover, these analyses can be run in parallel by supplying filtered BAM files and a list of variant sites to angsd to generate genotype likelihoods for linkage estimation with ngsLD while simultaneously running PPanalyze and Local Score analyses. We provide the code to demonstrate how we executed this for our simulated and empirical data (Supplemental File 1).
However, results from the analysis of simulated data also indicate that, while both angsd and PoolParty2 are effective at identifying segregating variants, estimating allele frequencies, and identifying outlier loci using lcWGS data, the particular evolutionary context and genomic environment of putatively adaptive loci may constrain the efficacy of any bioinformatic suite at a given depth of coverage. For example, we observed that while both analytical suites provided results that correctly portrayed the presence and location of outlier loci in each of the simulated datasets, peaks in the higher background divergence scenarios were more difficult to discern visually and an analysis designed to objectively discriminate outlier loci from background rates of variation, Local Score, identified fewer significant regions. While outlier identification using linkage was more effective in high background divergence scenarios, and coordination of this approach with the other analyses increases flexibility of the combined toolkit, this analysis was less effective with lower coverage regardless of evolutionary context. Although we did not investigate it directly here because of the nature of the simulated data, one strategy to increase power in challenging scenarios is through the use of paired sample replicates, i.e., samples of the same phenotype or across the same environmental axes from multiple populations (e.g. Lotterhos & Whitlock, 2015). Analyses that emphasize repeated differences across these replicates, such as the Cochran–Mantel–Haenszel test available in PoolParty2, increase the power to identify regions associated with the focal contrasts while minimizing the influence of background variation (Cochran, 1954; Mantel & Haenszel, 1959). Our recommendation based on these observations is for researchers to consider existing information about the demographic and genetic context of the traits and populations under study, and to carefully arrange experimental design to maximize power for any given scenario (Lou et al., 2021).
One underappreciated phenomenon in lcWGS are technical artifacts commonly known as batch effects (Leek et al., 2010; Lou & Therkildsen, 2022). These occur when idiosyncrasies of the library and sequencing process introduce artifacts for samples processed together that are later misinterpreted as true biological differences between groups under analysis, and may have a number of contributing causes in lcWGS data (Lou & Therkildsen, 2022). While batch effects are certainly not exclusive to lcWGS, barcoded and especially pooled lcWGS data that are prepared group-wise may be subject to them. Importantly, both PoolParty2 and angsd have a number of utilities included that help eliminate some batch effects, including read trimming, mapping and SNP quality filtering, minor allele frequency thresholds, and read and individual observation minima. Moreover, with any bioinformatic suite it is wise that users conduct tests to determine if results are sensitive to various trimming and filtering parameters. The alignment filtering utilities provided with angsd appear to be more stringent, even with the same parameters, than what is currently applied in PoolParty2, as evidenced by diminished correlation in allele frequencies for sites with higher depth as estimated by PoolParty2 but not for angsd. Additional variant filters are accessible in angsd, including those that examine strand balance, and though not built-in, these same filters can be applied to the variants discovered by PoolParty2 through VCF filtering tools such as bcftools or vcflib (Danecek et al., 2021; Garrison, Kronenberg, Dawson, Pedersen, & Prins, 2021), as described in our tutorials. Aside from these filters, one important means of controlling batch effects is to distribute barcoded samples from different analytical groups (e.g., populations) among sequencing runs, and ideally library preparation groups, since this reduces the chance that technical artifacts will be identified as group-wise differences (O’Leary, Puritz, Willis, Hollenbeck, & Portnoy, 2018). In addition, another strategy to minimize the influence of batch effects and other false positives is through the use of paired sample replicates as described above: analyses that utilize paired replicates reduce the chance that artifacts owing to any technical pair will be identified as consistent, significant differences (Cochran, 1954; Mantel & Haenszel, 1959).
Discovery of large effect loci in non-model organisms
We successfully apply our integrated pipeline to a common data arrangment of low coverage whole genome sequencing (lcWGS) data from steelhead trout (Oncorhynchus mykiss ) to identify loci strongly divergent among populations with distinct histories and putatively of strong effect on traits under selection in these groups. This species exhibits an impressive array of life history diversity even among salmonids, including notable and important variation in migration phenology (run timing), age at maturity (age at first reproduction), propensity for residency or anadromy, precocial sexual maturation, and iteroparity (repeat spawning) (Busby et al., 1999; Carlson & Seamons, 2008; Quinn, Seamons, Vollestad, & Duffy, 2011). Many of these traits are highly variable both among and within phylogeographic lineages and local populations (Busby et al., 1999), are part of the portfolio of life history diversity that assist populations in persisting through natural and anthropogenic environmental challenges (Quinn, McGinnity, & Reed, 2016), and are among the outward physical traits used by managers to estimate the contribution of distinct stocks to mixed-stock fisheries (Hess et al., 2021).
The diversity of life history ensembles among regional and local stocks of steelhead reflects a multifaceted history of selection, and upon these effects anthropogenic forces have applied new challenges. One of the most direct human impacts on steelhead are use of hatcheries to mitigate for the loss of fish from dams, overfishing, habitat degradation, and other human actions. While hatcheries have begun to shift towards inclusion of natural origin (NOR) returns in broodstock recruitment to avoid ‘domestication’ selection, these are still the minority, and indeed some operations have taken the opposite approach, actively choosing hatchery returns with characteristics desirable for hatchery managers or fishers, with many consequences to genomic variation. While the typical application of our pipeline for discovering loci under selection relies upon groups with known phenotypic differences, discrete phenotypes with heritable bases may be overall uncommon in non-model organisms, at least relative to opportunities for examining populations arrayed across replicate environmental axes (e.g. Lotterhos & Whitlock, 2015). To reflect this, we chose to examine genomic divergence in a hatchery stock with a notable history of hatchery selection without a priori designation other than group membership, as would be the case for most landscape-level genome scans. However, in this case the Skamania Hatchery stock is well known for its phenotypic ensemble, its history of hatchery origin recruitment and artificial selection, and the extent to which this stock has been outplanted or strayed into numerous Columbia Basin tributaries. Not surprisingly, we observed strong divergence in two regions known to be associated with early migration timing and age at maturity (ocean duration) on chromosomes 28 and 25, respectively. However, we also saw significant divergences in regions that contained several dozen additional genes, including a prominent region on chromosome 20 containing two genes important in metabolism and cellular signaling. This region on chromosome 20 is also known to harbor a structural inversion in this species (Pearse et al., 2019), which may have been involved in artificial selection of the hatchery stock. Importantly, these regions were emphasized as outliers both by the analyses we applied that considered only global linkage patterns as well as the analyses that considered contrast-specific divergence while controlling for linkage. However, the linkage-only analysis identified additional regions that did not appear to exhibit strong divergence across our contrast of interest (hatchery vs. natural origin), demonstrating the importance of corroborating outliers identified from any single analysis. While we decline to hypothesize how these additional genes may be involved in domestication, these results demonstrate the utility of applying our pipeline to landscape-level samples and the efficacy and complementarity of the PoolParty and angsd pipelines for analyzing lcWGS data.
Acknowledgements
We thank colleagues at the Columbia River Inter-Tribal Fish Commission (CRITFC) in Portland, OR and Hagerman, ID for facilitating the current dataset and the associate editor and two anonymous reviewers for comments on a previous version of this manuscript. We also greatly appreciate the assistance by Nicolas R. Lou and Nina Overgaard Therkildsen for providing and explaining the simulated sequence data. Samples were collected by CRITFC staff or graciously provided by the Oregon and Washington Departments of Fisheries and Wildlife. Funding for this project was contributed by Bonneville Power Administration (Grant no. 2008-907-00), by the NSF Idaho EPSCoR Program, and by the National Science Foundation (award no. OIA-1757324).
Data availability
Data referenced in this manuscript are available from NBCI Short Read Archive (PRJNA854899). The pipeline used for bioinformatic processing of these data, PoolParty, is available from Github (https://github.com/stuartwillis/poolparty).
References Cited
Aguirre-Ramirez, E., Velasco-Cuervo, S., & Toro-Perea, N. (2021). Genomic traces of the fruit fly Anastrepha obliqua associated with its polyphagous nature. Insects , 12 (12). doi: 10.3390/insects12121116
Andrews, K. R., Good, J. M., Miller, M. R., Luikart, G., & Hohenlohe, P. A. (2016). Harnessing the power of RADseq for ecol & evol genomics.pdf. Nature Reviews Genetics , 17 .
Ayerst, J. D. (1976). The Role of Hatcheries in Rebuilding Stcelhead Runs of the Columbia River System. American Fisheries Society Special Publication: Proceedings of a Symposium Held in Vancouver, Washington, March 5-6, 1976 , 84–88.
Baird, N. A., Etter, P. D., Atwood, T. S., Currey, M. C., Shiver, A. L., Lewis, Z. A., … Johnson, E. A. (2008). Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS ONE ,3 (e3376).
Bonhomme, M., Chevalet, C., Servin, B., Boitard, S., Abdallah, J., Blott, S., & SanCristobal, M. (2010). Detecting selection in population trees: The Lewontin and Krakauer test extended. Genetics ,186 (1). doi: 10.1534/genetics.110.117275
Busby, P., Wainwright, T., & Bryant, G. (1999). Status Review of Steelhead from Washington, Idaho, Oregon, and California.Sustainable Fisheries Management . doi: 10.1201/9781439822678.ch11
Carlson, S. M., & Seamons, T. R. (2008). SYNTHESIS: A review of quantitative genetic components of fitness in salmonids: implications for adaptation to future change. Evolutionary Applications ,1 (2), 222–238. doi: 10.1111/j.1752-4571.2008.00025.x
Cochran, W. G. (1954). Some Methods for Strengthening the Common χ 2 Tests. Biometrics , 10 (4). doi: 10.2307/3001616
Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., … Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience , 10 (2). doi: 10.1093/gigascience/giab008
Fariello, M. I., Boitard, S., Mercier, S., Robelin, D., Faraut, T., Arnould, C., … SanCristobal, M. (2017). Accounting for linkage disequilibrium in genome scans for selection without individual genotypes: The local score approach. Molecular Ecology ,26 (14). doi: 10.1111/mec.14141
Fox, E. A., Wright, A. E., Fumagalli, M., & Vieira, F. G. (2019). NgsLD: Evaluating linkage disequilibrium using genotype likelihoods.Bioinformatics , 35 (19). doi: 10.1093/bioinformatics/btz200
Futschik, A., & Schlötterer, C. (2010). The next generation of molecular markers from massively parallel sequencing of pooled DNA samples. Genetics , 186 (1). doi: 10.1534/genetics.110.114397
Garrison, E., Kronenberg, Z. N., Dawson, E. T., Pedersen, B. S., & Prins, P. (2021). Vcflib and tools for processing the VCF variant call format. BioRxiv .
Günther, T., & Coop, G. (2013). Robust identification of local adaptation from allele frequencies. Genetics , 195 (1). doi: 10.1534/genetics.113.152462
Hess, J. E., Collins, E. E., Harmon, S. A., Horn, R. L., Koch, I. J., Stephenson, J., … Narum, S. R. (2021). GENETIC ASSESSMENT OF COLUMBIA RIVER STOCKS: 1/1/2020 - 12/31/2020 Annual Report, 2008-907-00 .
Hess, J. E., Zendt, J. S., Matala, A. R., & Narum, S. R. (2016). Genetic basis of adult migration timing in anadromous steelhead discovered through multivariate association testing. Proceedings of the Royal Society B: Biological Sciences , 283 (1830). doi: 10.1098/rspb.2015.3064
Hoban, S., Kelley, J. L., Lotterhos, K. E., Antolin, M. F., Bradburd, G., Lowry, D. B., … Whitlock, M. C. (2016). Finding the genomic basis of local adaptation: Pitfalls, practical solutions, and future directions. American Naturalist . doi: 10.1086/688018
Horn, R. L., Kamphaus, C., Murdoch, K., & Narum, S. R. (2020). Detecting genomic variation underlying phenotypic characteristics of reintroduced Coho salmon (Oncorhynchus kisutch). Conservation Genetics , 21 (6). doi: 10.1007/s10592-020-01307-0
Korneliussen, T. S., Albrechtsen, A., & Nielsen, R. (2014). ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics ,15 (1). doi: 10.1186/s12859-014-0356-4
Lamichhaney, S., Barrio, A. M., Rafati, N., Sundström, G., Rubin, C. J., Gilbert, E. R., … Andersson, L. (2012). Population-scale sequencing reveals genetic differentiation due to local adaptation in Atlantic herring. Proceedings of the National Academy of Sciences of the United States of America , 109 (47). doi: 10.1073/pnas.1216128109
Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E., … Irizarry, R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data.Nature Reviews Genetics , Vol. 11. doi: 10.1038/nrg2825
Lotterhos, K. E., & Whitlock, M. C. (2015). The relative power of genome scans to detect local adaptation depends on sampling design and statistical method. Molecular Ecology . doi: 10.1111/mec.13100
Lou, R. N., Jacobs, A., Wilder, A. P., & Therkildsen, N. O. (2021). A beginner’s guide to low-coverage whole genome sequencing for population genomics. Molecular Ecology , 30 (23). doi: 10.1111/mec.16077
Lou, R. N., & Therkildsen, N. O. (2022). Batch effects in population genomic studies with low-coverage whole genome sequencing data: Causes, detection and mitigation. Molecular Ecology Resources ,22 (5). doi: 10.1111/1755-0998.13559
Lowry, D. B., Hoban, S., Kelley, J. L., Lotterhos, K. E., Reed, L. K., Antolin, M. F., & Storfer, A. (2017). Breaking RAD: an evaluation of the utility of restriction site-associated DNA sequencing for genome scans of adaptation. Molecular Ecology Resources , 17 (2). doi: 10.1111/1755-0998.12635
Lyu, G., Feng, C., Zhu, S., Ren, S., Dang, W., Irwin, D. M., … Zhang, S. (2021). Whole Genome Sequencing Reveals Signatures for Artificial Selection for Different Sizes in Japanese Primitive Dog Breeds. Frontiers in Genetics , 12 . doi: 10.3389/fgene.2021.671686
Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute , 22 (4). doi: 10.1093/jnci/22.4.719
Micheletti, S. J., Hess, J. E., Zendt, J. S., & Narum, S. R. (2018). Selection at a genomic region of major effect is responsible for evolution of complex life histories in anadromous steelhead 06 Biological Sciences 0604 Genetics. BMC Evolutionary Biology ,18 (1). doi: 10.1186/s12862-018-1255-5
Micheletti, S. J., & Narum, S. R. (2018). Utility of pooled sequencing for association mapping in nonmodel organisms. Molecular Ecology Resources . doi: 10.1111/1755-0998.12784
O’Leary, S. J., Puritz, J. B., Willis, S. C., Hollenbeck, C. M., & Portnoy, D. S. (2018). These aren’t the loci you’re looking for: Principles of effective SNP filtering for molecular ecologists.Molecular Ecology , 0 (ja). doi: 10.1111/mec.14792
Paril, J. F., Balding, D. J., & Fournier-Level, A. (2022). Optimizing sampling design and sequencing strategy for the genomic analysis of quantitative traits in natural populations. Molecular Ecology Resources , 22 (1). doi: 10.1111/1755-0998.13458
Pearse, D. E., Barson, N. J., Nome, T., Gao, G., Campbell, M. A., Abadía-Cardoso, A., … Lien, S. (2019). Sex-dependent dominance maintains migration supergene in rainbow trout. Nature Ecology and Evolution , 3 (12). doi: 10.1038/s41559-019-1044-6
Puritz, J. B., Matz, M. V, Toonen, R. J., Weber, J. N., Bolnick, D. I., & Bird, C. E. (2014). Demystifying the RAD fad. Molecular Ecology , 23 (24), 5937–5942.
Quinn, T. P., McGinnity, P., & Reed, T. E. (2016). The paradox of “premature migration” by adult anadromous salmonid fishes: Patterns and hypotheses. Canadian Journal of Fisheries and Aquatic Sciences , Vol. 73, pp. 1015–1030. doi: 10.1139/cjfas-2015-0345
Quinn, T. P., Seamons, T. R., Vollestad, L. A., & Duffy, E. (2011). Effects of growth and reproductive history on the egg size-fecundity trade-off in steelhead. Transactions of the American Fisheries Society . doi: 10.1080/00028487.2010.550244
Ren, S., Lyu, G., Irwin, D. M., Liu, X., Feng, C., Luo, R., … Wang, Z. (2021). Pooled Sequencing Analysis of Geese (Anser cygnoides) Reveals Genomic Variations Associated With Feather Color.Frontiers in Genetics , 12 . doi: 10.3389/fgene.2021.650013
Schlötterer, C., Tobler, R., Kofler, R., & Nolte, V. (2014). Sequencing pools of individuals-mining genome-wide polymorphism data without big funding. Nature Reviews Genetics , 15 (11). doi: 10.1038/nrg3803
Therkildsen, N. O., & Palumbi, S. R. (2017). Practical low-coverage genomewide sequencing of hundreds of individually barcoded samples for population and evolutionary genomics in nonmodel species.Molecular Ecology Resources , 17 (2). doi: 10.1111/1755-0998.12593
Tiffin, P., & Ross-Ibarra, J. (2014). Advances and limits of using population genetics to understand local adaptation. Trends in Ecology and Evolution , Vol. 29. doi: 10.1016/j.tree.2014.10.004
Willis, S. C., Hess, J. E., Fryer, J. K., Whiteaker, J. M., Brun, C., Gerstenberger, R., & Narum, S. R. (2020). Steelhead (Oncorhynchus mykiss) lineages and sexes show variable patterns of association of adult migration timing and age-at-maturity traits with two genomic regions. Evolutionary Applications , 13 (10), 2836–2856. doi: 10.1111/eva.13088
Willis, S. C., Hollenbeck, C. M., Puritz, J. B., & Portnoy, D. S. (2022). Genetic recruitment patterns are patchy and spatiotemporally unpredictable in a deep-water snapper (Lutjanus vivanus) sampled in fished and protected areas of western Puerto Rico. Conservation Genetics , 23 (3). doi: 10.1007/s10592-021-01426-2
Zhu, Y., Bergland, A. O., González, J., & Petrov, D. A. (2012). Empirical validation of pooled whole genome population re-sequencing in Drosophila melanogaster. PLoS ONE , 7 (7). doi: 10.1371/journal.pone.0041901
Tables
Table 1. Results of SNP calling and filtering by PoolParty2 and angsd with different modules and filtering thresholds in two different demographic scenarios (high and low background FST) and coverage levels (0.33x and 1x). The true number of simulated sites for the high background FST scenarios with minor allele frequency >=0.001 was 158,746 SNPs and for low background FST was 789,423 SNPs.
Table 2. Outlier regions identified from simulated data using windowed mean linkage and regions identified with Local Score analysis of significance from Fisher’s Exact tests. For each outlier region identified, the width in Kbp is reported. For windowed linkage analysis, the mean and maximum windowed mean of linkage is listed, while for Local Score analyses across quantiles of significance as smoothing parameter values, ξ, the number of replicates out of three in which a region was significant are listed.
Table 3. Outlier regions identified in select chromosomes from empirical steelhead trout data using windowed mean linkage and regions identified with Local Score analysis of significance from Fisher’s Exact tests. For each outlier region identified, the beginning and end of the regions (in Mbp), width (in Kbp), and, for regions identified with windowed linkage analysis, the mean and maximum windowed mean of linkage, are reported. Local score results are reported across quantiles of significance as smoothing parameter values, ξ.
Supplemental Table 1. Outlier regions identified in across all chromosomes from empirical steelhead trout data using Local Score analysis of significance from Fisher’s Exact tests. For each outlier region identified, the beginning and end, width, local score of the peak, and number of replicates out of 3 in which the region was significant, are reported. Regions are reported across quantiles of significance as smoothing parameter values, ξ.
Figures
Figure 1. Graphical representation of the normalization process of allele frequency estimates for variance in read depth possible with the incorporation on barcoded individual samples.
Figure 2. Comparing estimated allele frequencies with true allele frequencies across depth cutoffs for data simulated with high background FST and low coverage (0.33x). Left axis: This red and blue lines show correlation for PoolParty2 estimated frequencies for each population for normalized (solid) and non-normalized (dashed) data. Orange and green lines show correlation with angsd estimated frequencies. Black line shows correlation in depths estimated by angsd and PoolParty2. Right axis: thick purple line shows number of sites passing depth thresholds. Left and right panels shows depths reported by PoolParty2 and angsd, respectively.
Figure 3. Manhattan plots of divergence estimated for variant sites identified by PoolParty2 in simulated data in scenarios with high (A,B) or low (C,D) background FST and 0.33x (A,C) or 1x coverage (B,D). For each dataset, the divergence (FST) estimated in 100Kbp sliding windows (left column), Local Score estimated with ξ reflecting the 80th quantile of Fisher’s Exact Test significance values (middle column), and mean linkage in 100Kbp windows (right column) are shown. Red dots indicate series of 10+ windows of mean linkage above 2x the interquartile range. Blue dots in each panel reflect the positions (but not values) of loci simulated under selection.
Figure 4. Manhattan plots of divergence and linkage among hatchery and natural origin population samples of steelhead. A) Individual site FST corrected for sample size (red line indicates FST of 0.5) B) Mean FSTin sliding windows of 100Kb across the genome. C) -Log10 significance (p ) values from individual site Fisher’s Exact tests D-G) Local score analysis of significance values from Fisher’s Exact tests for the 80th, 90th, 95th, and 99th quantile of significance values as ξ (red line indicates FDR of 0.05); H-J) Mean linkage in 100Kbp windows for chromosomes 20, 25, and 28 (red dots indicate series of 20+ windows of mean linkage above 2x the interquartile range).
Supplemental Figure 1. True FST for loci simulated under demographic scenarios producing A) high background FST(low NE, low migration) or B) low background FST (high NE, high migration). Black dots indicated the FST of positions simulated under selection.
Supplemental Figure 2. Comparing estimated allele frequencies with true allele frequencies across depth cutoffs for data simulated with high background FST and low coverage (0.33x). Left axis: This red and blue lines show correlation for PoolParty2 estimated frequencies for each population for normalized (solid) and non-normalized (dashed) data. Orange and green lines show correlation with angsd estimated frequencies. Black line shows correlation in depths estimated by angsd and PoolParty2. Right axis: thick blue line shows number of sites passing depth thresholds. Left and right panels shows depths reported by PoolParty2 and angsd, respectively. A) high background FST, 0.33x coverage; B) high background FST, 1x coverage; C) low background FST, 0.33x coverage; D) low background FST, 1x coverage
Supplemental Figure 3. Manhattan plot of results from analysis of 0.33x coverage data simulated with high background FST. A) FST estimated by PoolParty; B) 100Kbp sliding windows of FST estimated by PoolParty2; C) -Log10 significance (p) values from Fisher’s Exact test of allele frequencies between populations calculated by PoolParty2; D) FST estimated by angsd; E) 100Kbp sliding windows of FST estimated by angsd; F) Likelihood ratio of frequency differences between population calculated by angsd; G) Mean linkeage in 100Kbp windows estimated for sites identified by PoolParty2, genotype likelihood estimated by angsd, and linkage calculated with ngsLD, with series of 10+ windows with mean linkage above 2x the interquartile range in red; H-L) Local score values for ξ values reflecting the 70th, 80th, 90th, 95th, and 99th quantile of significance values from Fisher’s Exact test. Blue dots in each panel reflect the positions (but not values) of loci simulated under selection.
Supplemental Figure 4. Manhattan plot of results from analysis of 1x coverage data simulated with high background FST. A) FST estimated by PoolParty2; B) 100Kbp sliding windows of FST estimated by PoolParty2; C) -Log10 significance (p) values from Fisher’s Exact test of allele frequencies between populations calculated by PoolParty2; D) FST estimated by angsd; E) 100Kbp sliding windows of FST estimated by angsd; F) Likelihood ratio of frequency differences between population calculated by angsd; G) Mean linkeage in 100Kbp windows estimated for sites identified by PoolParty2, genotype likelihood estimated by angsd, and linkage calculated with ngsLD, with series of 10+ windows with mean linkage above 2x the interquartile range in red; H-L) Local score values for ξ values reflecting the 70th, 80th, 90th, 95th, and 99th quantile of significance values from Fisher’s Exact test. Blue dots in each panel reflect the positions (but not values) of loci simulated under selection.
Supplemental Figure 5. Manhattan plot of results from analysis of 0.33x coverage data simulated with low background FST. A) FST estimated by PoolParty2; B) 100Kbp sliding windows of FST estimated by PoolParty2; C) -Log10 significance (p) values from Fisher’s Exact test of allele frequencies between populations calculated by PoolParty2; D) FST estimated by angsd; E) 100Kbp sliding windows of FST estimated by angsd; F) Likelihood ratio of frequency differences between population calculated by angsd; G) Mean linkeage in 100Kbp windows estimated for sites identified by PoolParty2, genotype likelihood estimated by angsd, and linkage calculated with ngsLD, with series of 10+ windows with mean linkage above 2x the interquartile range in red; H-L) Local score values for ξ values reflecting the 70th, 80th, 90th, 95th, and 99th quantile of significance values from Fisher’s Exact test. Blue dots in each panel reflect the positions (but not values) of loci simulated under selection.
Supplemental Figure 6. Manhattan plot of results from analysis of 1x coverage data simulated with low background FST. A) FST estimated by PoolParty; B) 100Kbp sliding windows of FST estimated by PoolParty2; C) -Log10 significance (p) values from Fisher’s Exact test of allele frequencies between populations calculated by PoolParty2; D) FST estimated by angsd; E) 100Kbp sliding windows of FST estimated by angsd; F) Likelihood ratio of frequency differences between population calculated by angsd; G) Mean linkeage in 100Kbp windows estimated for sites identified by PoolParty2, genotype likelihood estimated by angsd, and linkage calculated with ngsLD, with series of 10+ windows with mean linkage above 2x the interquartile range in red; H-L) Local score values for ξ values reflecting the 70th, 80th, 90th, 95th, and 99th quantile of significance values from Fisher’s Exact test. Blue dots in each panel reflect the positions (but not values) of loci simulated under selection.
Supplemental Figure 7. Quality control analyses for steelhead from Skamania hatchery or two natural origin populations, Willametter River (Eagle Creek) or Lewis River. A) Proportion of the genome covered at various minimum coverage limits. B) Mean coverage across sites for each population sample (summed across individuals) C) Principal components analysis of allele frequencies for variants with a maximum allele frequency difference ≤ 0.9. D) Mean proportion across population samples of each chromosome. E) Number of SNPs per 10 Kb across the genome (SNP density).