Variants, filtering, error polishing, and phasing
PCR targeted long-read sequencing of the successfully phased 77 DNMs
produced an average allele coverage of 35,430X, with a mean background
noise of 4% at iSNP positions (see Supplementary Table 7 and methods
section ‘Bioinformatics’ for background noise calculation). This
increased coverage compared to alternative targeting methods, WES or WGS
is expected to help with error reduction and mosaicism detection (Wright
et al., 2019). The DNMs, which were initially identified with short-read
WES and validated with Sanger sequencing, were used to anchor the
preliminary phasing of long reads. This anchoring approach groups
long-reads by the base information at the DNM position. After variant
calling was performed (methods section ‘Bioinformatics’), homozygous
variants were removed and heterozygous variants were checked and
filtered based on agreement with the DNM grouped reads. All remaining
variants from the long-read sequencing approach were error polished and
filtered using WES and parental ONT data (Figure 2 and Supplementary
Tables 4 and 5). Following this, iSNPs were identified from the
remaining variants. The iSNP with the greatest confidence (coverage,
supporting data, DNM allele agreement) was selected for phasing. When
phasing the reads based on the DNM and selected iSNP, additional alleles
were allowed for the DNM in case of a postzygotic event, but these were
screened for credible biological relevance, i.e. the DNM wild type (wt)
would have to match the iSNP of the DNM alt. Importantly, this anchored
approach filtered out 10% more presumably falsely called variants in
comparison to the standard filtering of variants based on quality
criteria, sequencing coverage and consensus (see Supplementary Table 4).
Small indels made up on average 25% of the false positives, and on
average 94% of all indels detected in the long read sequencing data
were likely false positives (see Supplementary Table 5, and for an
example of a false indel see Supplementary Figure 7). Because of this
high error rate for indel calling, we decided to remove all indels
without supporting data available, which is noted in the illustration of
our approach (Figure 2).