Orthology identification and phylogenetic inference
We inferred PCG sequence orthology across eleven arthropod species: one
crustacean (D. magna ), one dipluran (Catajapyx
aquilonaris ), four insects (Zootermopsis nevadensis , T.
castaneum , A. mellifera , D. melanogaster ), and five
collembolans (Sinella curviseta , Orchesella cincta ,Holacanthella duospinosa , FCDK, FCSH). Protein sequences ofC. aquilonaris and H. duospinosa were downloaded from i5K,
those of S. curviseta were obtained from FigShare
(https://doi.org/10.6084/m9.figshare.7286231.v2), and other data were
procured from NCBI. After removing redundant isoforms, orthogroups (gene
families) were inferred using OrthoFinder v2.5.2 (Emms & Kelly, 2019),
and Diamond was employed for sequence alignment in ultrasensitive mode
(‘-S diamond_ultra_sens’).
Single-copy orthologues estimated with OrthoFinder were used to infer
phylogeny and divergence times. We aligned the protein sequences of each
orthologue using MAFFT v7.394 (Katoh & Standley, 2013) with the
high-accuracy L-INS-I method, trimmed unreliable homologous sites using
BMGE v1.12 (Criscuolo & Gribaldo, 2010) with stringent parameters (‘-m
BLOSUM90 -h 0.4’), and concatenated individual alignments into a matrix.
We then estimated substitution models and partitioning schemes and
reconstructed the phylogeny using IQ-TREE v2.0.7 (Minh et al., 2020);
genes that violated SRH (stationary, reversible and homogeneous)
assumptions were excluded (‘–symtest-remove-bad –symtest-pval
0.10’); to reduce the computational burden, the model was restricted to
LG (‘–mset LG’), and the top 10% of partitioning schemes were
considered (‘–rclusterf 10’); ultrafast bootstrap and SH-like
approximate likelihood ratio tests were calculated to assess node
support (‘-B 1000 –alrt 1000’). We estimated divergence times using
MCMCTree within the PAML v4.9j package (Yang 2007); the JC69
substitution model, the independent rate clock model, and the
approximate likelihood calculation and ML estimation of branch lengths
were applied. We repeated the runs at least twice to ensure convergence,
and each ran for 60,000 generations, with the first 10,000 considered
burn-in. Five fossils from the PBDB database
(https://www.paleobiodb.org/navigator/)
were applied for node calibration: one Branchiopoda (<541
Mya), one Hexapoda (<485.4 Mya), the most recent common
ancestor (MRCA) of Diplura and Insecta (>407.6 Mya), one
Holometabola (315.2‒382.7 Mya), and the MRCA of Coleoptera and Diptera
(>295.5 Mya).