Genome annotation
We annotated repetitive elements, noncoding RNAs (ncRNAs) and protein-coding genes (PCGs) in both genomes. A de novo repeat library was constructed using RepeatModeler v2.0.1 (Flynn et al., 2020), and an additional LTR discovery pipeline was also applied (‘-LTRStruct’). We then combined the de novo library with the Dfam 3.1 (Hubley et al., 2016) and RepBase-20181026 databases (Bao et al., 2015) to construct a custom repeat library, which was employed to mask repeats in the genome using RepeatMasker v4.0.9 (Smit et al., 2013–2015). We scanned ncRNAs using Infernal v1.1.3 (Nawrocki & Eddy, 2013) and tRNAscan-SE v2.0.7 (Chan & Lowe, 2019); low-confidence tRNAs were filtered with the tRNAscan-SE built-in script ‘EukHighConfidenceFilter’.
We predicted PCGs using the MAKER v3.01.03 pipeline (Holt & Yandell, 2011), which included the EVidenceModeler (EVM, Haas et al., 2008) module. Ab initio , transcriptome and protein homology-based evidence were employed to support the predicted gene models. Ab initio gene models were generated using BRAKER v2.1.5 (Brůna et al., 2021) (a combination of two ab initio predictors, Augustus v3.3.4 (Stanke et al., 2004) and GeneMark-ES/ET/EP 4.68_lic (Brůna et al., 2020)); BRAKER can simultaneously incorporate transcriptome and protein evidence to improve prediction accuracy. The input transcriptome alignments were produced using HISAT2 v2.2.0 (Kim et al., 2019), and arthropod proteins were mined from the OrthoDB10 v1 database (Kriventseva et al., 2019) as a reference. Transcriptome evidence (transcripts) was assembled using the genome-guided assembler StringTie v2.1.4 (Kovaka et al., 2019). Protein sequences of Drosophila melanogaster , Tribolium castaneum , Bombyx mori ,Apis mellifera and Daphnia magna were downloaded from NCBI and fed to MAKER as evidence of protein homology. EVM was also activated with the weights of ab initio prediction, transcripts and proteins set to 1, 2 and 8, respectively. We assigned gene functions using Diamond v2.0.8 (Buchfink et al., 2021) to search the UniProtKB database; the more sensitive mode and an e-value of 1e-5 were used (‘–more-sensitive -e 1e-5’). We further identified protein domains and assigned Gene Ontology (GO) and pathway (KEGG, Reactome) annotations using eggNOG-mapper v2.0.1 (Huerta-Cepas et al., 2017) and InterProScan 5.47–82.0 (Finn et al., 2017). Five databases were included in the InterProScan analyses: Pfam (El-Gebali et al., 2019), SMART (Letunic & Bork, 2018), Superfamily (Wilson et al., 2009), Gene3D (Lewis et al., 2018), and CDD (Marchler-Bauer et al., 2017).