Genome annotation
We annotated repetitive elements, noncoding RNAs (ncRNAs) and
protein-coding genes (PCGs) in both genomes. A de novo repeat
library was constructed using RepeatModeler v2.0.1 (Flynn et al., 2020),
and an additional LTR discovery pipeline was also applied
(‘-LTRStruct’). We then combined the de novo library with the
Dfam 3.1 (Hubley et al., 2016) and RepBase-20181026 databases (Bao et
al., 2015) to construct a custom repeat library, which was employed to
mask repeats in the genome using RepeatMasker v4.0.9 (Smit et al.,
2013–2015). We scanned ncRNAs using Infernal v1.1.3 (Nawrocki & Eddy,
2013) and tRNAscan-SE v2.0.7 (Chan & Lowe, 2019); low-confidence tRNAs
were filtered with the tRNAscan-SE built-in script
‘EukHighConfidenceFilter’.
We predicted PCGs using the MAKER v3.01.03 pipeline (Holt & Yandell,
2011), which included the EVidenceModeler (EVM, Haas et al., 2008)
module. Ab initio , transcriptome and protein homology-based
evidence were employed to support the predicted gene models. Ab
initio gene models were generated using BRAKER v2.1.5 (Brůna et al.,
2021) (a combination of two ab initio predictors, Augustus v3.3.4
(Stanke et al., 2004) and GeneMark-ES/ET/EP 4.68_lic (Brůna et al.,
2020)); BRAKER can simultaneously incorporate transcriptome and protein
evidence to improve prediction accuracy. The input transcriptome
alignments were produced using HISAT2 v2.2.0 (Kim et al., 2019), and
arthropod proteins were mined from the OrthoDB10 v1 database
(Kriventseva et al., 2019) as a reference. Transcriptome evidence
(transcripts) was assembled using the genome-guided assembler StringTie
v2.1.4 (Kovaka et al., 2019). Protein sequences of Drosophila
melanogaster , Tribolium castaneum , Bombyx mori ,Apis mellifera and Daphnia magna were downloaded from NCBI
and fed to MAKER as evidence of protein homology. EVM was also activated
with the weights of ab initio prediction, transcripts and
proteins set to 1, 2 and 8, respectively. We assigned gene functions
using Diamond v2.0.8 (Buchfink et al., 2021) to search the UniProtKB
database; the more sensitive mode and an e-value of 1e-5 were used
(‘–more-sensitive -e 1e-5’). We further identified protein domains
and assigned Gene Ontology (GO) and pathway (KEGG, Reactome) annotations
using eggNOG-mapper v2.0.1 (Huerta-Cepas et al., 2017) and InterProScan
5.47–82.0 (Finn et al., 2017). Five databases were included in the
InterProScan analyses: Pfam (El-Gebali et al., 2019), SMART (Letunic &
Bork, 2018), Superfamily (Wilson et al., 2009), Gene3D (Lewis et al.,
2018), and CDD (Marchler-Bauer et al., 2017).