Annotation of xenobiotic detoxification-related gene families
In contrast to insects, xenobiotic detoxification-related gene families are greatly expanded in Collembola, possibly due to their adaptations to complex soil environments (Faddeeva-Vakhrusheva et al., 2017; Manni et al., 2020). The copy numbers of these families of FCDK and FCSH may be different, since the parthenogenetic strains show a wider distribution and better adaptability than the sexual strains of F. candida . We annotated the genes of five detoxification-related families, including the cytochrome P450 (CYP), ATP-binding cassette transporter (ABC), carboxyl/cholinesterase (CCE), UDP-glycosyltransferase (UGT), and glutathione-S-transferase (GST) families, using the BITACORA v1.3 (Vizueta et al., 2020) pipeline, and we further manually checked them. BITACORA performed initial BLASTP searches of the annotated proteins generated via the automatic MAKER pipeline and TBLASTN analyses in the genome assembly and confirmed the gene models with protein domains in each family via HMMER searches (Altschul, 1997; Eddy, 2011). Reference protein sequences of D. melanogaster , B. mori and F. candida for the ABC, CCE, GST and UGT families were obtained from the NCBI RefSeq database, whereas CYP sequences were mined from Dermauw et al. (2020). HMM profiles of each family were downloaded from the PFAM database: ABC (PF00005), CCE (PF00135), GST (PF14497, PF02798), CYP (PF00067), and UGT (PF00201). A cut-off e-value of 1e-5 was applied for BLAST and HMM searches. A close proximity algorithm was used to predict novel genes from TBLASTN alignments with a maximum intron length of 15,000 bp. The resulting CYP sequences were manually examined based on conserved protein structures, which were characterized by a four-helix bundle (D, E, I and L), helices J and K, two sets of β sheets and a coil ‘meander’. The functions of predicted proteins were checked via online BLASTP analysis in the nonredundant protein database (nr). The classification of each family and possible sequence errors were assisted by constructing phylogenetic trees. To construct the phylogenies of five gene families, the amino acid sequences of each family were aligned using MAFFT via the L-INS-I method and trimmed using trimAl v1.4.1 (Capella-Gutiérrez et al., 2009) with the ‘gappyout’ mode strategy. Phylogenetic trees were constructed using IQ-TREE, with automatic model selection and 1,000 ultrafast bootstrap replicates. Tree figures were enhanced using online EvolView v3 (Subramanian et al., 2019).