Comment on Bohmann et al. Strategies for sample labelling and library preparation in DNA metabarcoding studies
Hambäck, P.A.1*, J. Sargac2 and M. Grudzinska-Sterno1
1Department of Ecology, Environment and Plant Sciences, Stockholm University; 2Aquatic Sciences and Assessment, Swedish University of Agricultural Sciences
Corresponding author: Department of Ecology, Environment and Plant Sciences, Stokholm University, 106 91 Stockholm, Sweden. E-mail: peter.hamback@su.se
Abstract
DNA metabarcoding necessitates labelling amplicons in order to connect sequencing reads with samples, but labelling protocols may cause errors where indexes are incorrectly assembled during PCR due to tag-jumping. A recent paper by Bohmann et al (2021) reviews the main labelling methods and point out that library building using PCR’s on tagged amplicons may be particularly problematic. Due to unforeseen problems in two sequencing projects, we had to use a second PCR on tagged amplicons to salvage two large data sets. This test showed that the problems with tag-jumping errors were acceptable and could be accounted for during analysis, if handled properly when designing the indexing strategy.
Introduction
DNA metabarcoding has rapidly become a mainstream tool in ecological research, both for species inventories using environmental DNA and for identification of gut contents when quantifying diets (Liu, Clarke, Baker, Jordan, & Burridge, 2020), not only because of the reduced costs for sequencing but also because of the development of protocols that increase the quality of output data (reviewed in Alberdi et al., 2019; Alberdi, Aizpurua, Gilbert, & Bohmann, 2018). Because metabarcoding comes with contamination risks during both data collection and laboratory procedures, several papers describe methods to reduce contamination during field collections and preparation of samples before sequencing (Alberdi et al., 2018; King et al., 2012; King, Read, Traugott, & Symondson, 2008). One recent issue under discussion concerns labelling of amplicons to enable the connection of samples and sequence data (Schnell, Bohmann, & Gilbert, 2015). To enable the identification of metabarcoding data, the workflow involves methods of labelling amplicons through the addition of nucleotide tags on the 5’-end of metabarcoding primers and/or as indexes during library preparation. There are three main strategies for labelling in metabarcoding studies, of which one (the so-called ‘tagged PCR approach’) can result in tag-jumps, i.e., the appearance of sequences carrying new combinations of the used 5’ nucleotide tags (e.g. Esling et al 2015, Schnell et al 2015). A recent review by Bohmann et al. (2021) nicely describe the risks connected to different indexing methods, each with their different pros and cons.
As reviewed by Bohmann et al. (2021), risks of getting erroneous indexed tag combinations will only occur as a result of library preparation with T4 DNA Polymerase blunt-ending or when libraries are prepared by ligating indexes on pools of tagged amplicons in a second PCR, suggesting that these methods should be avoided (Carøe & Bohmann, 2020). While we completely agree with this general advice, problems during execution may necessitate workflow changes involving a second PCR to salvage data. Moreover, including a risk analysis before setting up the workflow may allow for a smooth transition between methods if needed. In our group, we use metabarcoding to identify prey in spider guts using the tagged PCR approach with metabarcoding primers having 5’ nucleotide tags (Binladen et al., 2007) (hereafter tags). We thereafter pool amplicons and build libraries using a PCR-free protocol that includes a ligation of Illumina adaptors with dual indexes (for complete laboratory protocols see Hambäck et al., 2021). As Bohmann et al. (2021) points out, this approach cannot cause tag jumping if a blunt-ending step is excluded from the library preparation protocol. For this reason, we omit the end-repair step and perform phosphorylation and adenylation of DNA fragments in a separate reaction.
A problem when extracting DNA from guts of small organisms is that prey DNA is in low amounts and highly fragmented. In two recent projects, even after PCR amplification, our samples were found by the sequencing lab to contain too little DNA for the MiSeq to run properly. At this stage, we either had to abandon the projects, wasting resources in collecting and preparing the samples, or use additional PCR steps on tagged amplicon pools to boost DNA amounts. Following discussions with the staff at our National Genomics Institute (NGI), our sequencing facility, we decided to run additional PCRs with only 6 cycles. Libraries were prepared using SMARTer ThruPLEX DNA-seq library preparation kit excluding fragmentation of DNA (Takara Bio), and to measure tag jumping rates we only used 75% of available tag combinations while leaving about 25% empty. Moreover, to force tag jumping errors to occur only between spider individuals within sites, the new libraries were reconstructed based on sampling units. Because downstream analyses in this study focused on between site differences and therefore data were pooled over spider individuals, tag jumping between spiders within sites did not distort the result. This unplanned library adjustment necessitated some spider individuals to be discarded to avoid duplicity of tags between individuals within sites. In retrospect, this problem could have been avoided when setting tags to samples.
In post-processing, we used standard settings to clean, filter, de-multiplex and tabulate sequence data using Obitools (Boyer et al., 2016) on the Galaxy platform (Jalili et al., 2020) similar to our previous studies. Thus, pair-end reads were joined using ‘Illuminapairedend’, trimmed and annotated using ‘NGSfilter’ before filtering on length using ‘obigrep’ (310-330 bp) and identifying unique sequences with ‘obiuniq’. After tabulating the data, we identified OTU’s using ‘pick_otus’ and connected representative sequences based on ‘pick_rep_set” to taxon identities using Barcode of Life Database (BoLD) (Ratnasingham & Hebert, 2007). Here we only report sequence number distributions separated between correct and false tag combinations, leaving results on actual spider diets to other publications.
In project 1, involving linyphiid spiders, the yield of useful prey sequences was about 8.7 million sequences after cleaning and demultiplexing. Among these sequences, only 0.03% were connected to false tag combinations. Because about 25% of tag combinations were empty, the tag jumping rate can be estimated to be about 0.12%. In project 2, involving lycosid spiders, the yield was lower (0.45 million sequences), with a higher percentage (1.9%) of sequences connected to false tag combinations, corresponding to an approximate tag jumping rate of 7.6%. Notable is that the tag jumping rate decreased to about 5.8% when excluding sequences with no match in BoLD. When examining the frequency distribution of sequence number for true and false tag combinations (Fig. 1), it was also apparent that there was almost no overlap in the distributions for project 1 but a larger overlap in project 2.
For the data from project 2, we decided to further compare proportions of false combinations between the 32 libraries. It was evident that these proportions showed large variation between libraries, from 0.004% up to 30%. The reason for this variability is not evident, but data suggest that problems with high frequencies of false combinations mainly occurred at a low total yield per library, in our case below 5000 sequences (Fig. 2), which could explain why no problems appeared in project 1. The reason for the low yield in project 2 is unclear, but it served us well to illustrate the yield dependent error rates. It is apparent that for this protocol, tag jumping errors are unproblematic as long as yield is sufficiently high. For the final diet analyses, we will use estimated tag jumping rates to set dynamic thresholds for data exclusions at the level of sampling sites and species (see Cirtwill & Hambäck, 2021).
To summarize, similar to previous studies (Carøe & Bohmann, 2020; Schnell et al., 2015), we find that tag jumping errors are potential problems in metabarcoding studies when libraries are built using a PCR on pools of tagged amplicons and such an approach should be avoided when possible. However, as in our case, even when using a PCR-free library preparation protocol it is sometimes necessary to enrich libraries to obtain sufficient concentrations for sequencing. In an ideal world, we could have collected new samples but costs are often prohibitive. We instead used a strategy where error rates could be estimated and where effects from errors could be avoided. When doing this, we find that estimated error rates due to tag jumping are small when DNA yields are not very low, suggesting that risks of enriching tagged amplicon libraries through additional PCR cycles can be acceptable. This information is good news when aiming to describe gut contents of small invertebrate predators, where sometimes DNA amounts can be very low. However, it is advisable to consider risks prior to designing tagging protocols to enable future methodological switches.
Acknowledgements
Y. Marincevic-Zuniga and R. Kudva was very helpful in trouble-shooting, whereas K. Bohmann and R.K. Johnson provided helpful comments on a previous version of this manuscript. Sequencing was performed by the SNP&SEQ Technology Platform in Uppsala. This facility is part of the National Genomics Infrastructure (NGI) Sweden and Science for Life Laboratory. The SNP&SEQ Platform is also supported but the Swedish Research Council and the Knut and Alice Wallenberg Foundation.
References
Alberdi, A., Aizpurua, O., Bohmann, K., Gopalakrishnan, S., Lynggaard, C., Nielsen, M., & Gilbert, M. T. P. (2019). Promises and pitfalls of using high-throughput sequencing for diet analysis. Molecular Ecology Resources, 19 (2), 327-348. doi:10.1111/1755-0998.12960
Alberdi, A., Aizpurua, O., Gilbert, M. T. P., & Bohmann, K. (2018). Scrutinizing key steps for reliable metabarcoding of environmental samples. Methods in Ecology and Evolution, 9 (1), 134-147. doi:10.1111/2041-210x.12849
Binladen, J., Gilbert, M. T. P., Bollback, J. P., Panitz, F., Bendixen, C., Nielsen, R., & Willerslev, E. (2007). The use of coded PCR primers enables high-throughput sequencing of multiple homolog amplification products by 454 parallel sequencing. PLoS One, 2 , e197. doi:10.1371/journal.pone.0000197
Bohmann, K., Elbrecht, V., Carøe, C., Bista, I., Leese, F., Bunce, M., . . . Creer, S. (2021). Strategies for sample labelling and library preparation in DNA metabarcoding studies. Molecular Ecology Resources . doi:10.1111/1755-0998.13512
Boyer, F., Mercier, C., Bonin, A., Le Bras, Y., Taberlet, P., & Coissac, E. (2016). OBITOOLS: a UNIX-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16 , 176-182. doi:10.1111/1755-0998.12428
Carøe, C., & Bohmann, K. (2020). Tagsteady: A metabarcoding library preparation protocol to avoid false assignment of sequences to samples.Molecular Ecology Resources, 20 (6), 1620-1631. doi:10.1111/1755-0998.13227
Cirtwill, A. R., & Hambäck, P. (2021). Building food networks from molecular data: Bayesian or fixed-number thresholds for including links.Basic and Applied Ecology, 50 , 67-76. doi:10.1016/j.baae.2020.11.007
Hambäck, P. A., Cirtwill, A. R., García, D., Grudzinska-Sterno, M., Miñarro, M., Tasin, M., . . . Samnegård, U. (2021). More intraguild prey than pest species in arachnid diets may compromise biological control in apple orchards. Basic and Applied Ecology, 57 , 1-13. doi:10.1016/j.baae.2021.09.006
Jalili, V., Afgan, E., Gu, Q., Clements, D., Blankenberg, D., Goecks, J., . . . Nekrutenko, A. (2020). The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2020 update.Nucleic Acids Res, 48 (W1), W395-W402. doi:10.1093/nar/gkaa434
King, R. A., Davey, J. S., Bell, J. R., Read, D. S., Bohan, D. A., & Symondson, W. O. C. (2012). Suction sampling as a significant source of error in molecular analysis of predator diets. Bulletin of Entomological Research, 102 (3), 261-266. doi:10.1017/S0007485311000575
King, R. A., Read, D. S., Traugott, M., & Symondson, W. O. C. (2008). Molecular analysis of predation: a review of best practice for DNA-based approaches. Molecular Ecology, 17 (4), 947-963. doi:10.1111/j.1365-294X.2007.03613.x
Liu, M. X., Clarke, L. J., Baker, S. C., Jordan, G. J., & Burridge, C. P. (2020). A practical guide to DNA metabarcoding for entomological ecologists. Ecological Entomology, 45 , 373-385. doi:10.1111/een.12831
Ratnasingham, S., & Hebert, P. D. N. (2007). BOLD: The Barcode of Life Data System (www.barcodinglife.org).Molecular Ecology Notes, 7 (3), 355-364. doi:10.1111/j.1471-8286.2007.01678.x
Schnell, I. B., Bohmann, K., & Gilbert, M. T. P. (2015). Tag jumps illuminated - reducing sequence-to-sample misidentifications in metabarcoding studies. Molecular Ecology Resources, 15 (6), 1289-1303. doi:10.1111/1755-0998.12402
Figure legends
Fig. 1. Frequency distribution of sequence number among spider individuals with correct (filled bars) and false (open bars) tags in two separate data sets (A, B). Notice that false tags have a similar distribution for the two data sets but that correct tags are much lower in (B).
Fig. 2. Relationship between the proportion of sequences with false tag combinations and total DNA yield per library (N=32).
Figure 1