Genome assembly
PacBio long reads were assembled using Flye v2.7.1 (Kolmogorov et al.,
2019), with a minimum overlap between reads of 1,000 and two rounds of
self-polishing (‘-m 1000 -i 2’). Primary contigs were polished with two
iterations of Illumina short reads using NextPolish v1.3.1 (Hu et al.,
2020). Quality control was performed for short reads prior to polishing
using the ‘bbduk.sh’ script in BBTools package v38.82 (Bushnell, 2014):
quality trimming (> Q20), length filtering (>
15 bp), polymer trimming (> 10 bp) and correction of
overlapping paired reads. Redundant haplotypic duplications were removed
using Purge_Dups v1.0.1 (Guan et al., 2020) with the default settings.
All sequence alignment tasks were performed using Minimap2 v2.17 (Li,
2018) within the above polishing and purging progress. For Hi-C
scaffolding, read alignment to the assembly, duplicate removal, and Hi-C
contact extractions were executed using Juicer v1.6.2 (Durand et al.,
2016) employing BWA v0.7.17 (Li & Durbin, 2009) as the aligner. We then
used the 3D-DNA v180922 pipeline (Dudchenko et al., 2017) to anchor
contigs to generate pseudochromosomes. Possible assembly errors, such as
misjoins, translocations, and inversions, were manually corrected using
the Assembly Tools module within Juicebox v1.11.08 (Durand et al.,
2016). Potential contaminants were detected using MMseqs2 v11
(Steinegger & Söding, 2017) to perform BLASTN-like searches against the
NCBI nucleotide (nt) and UniVec databases. Genome quality was further
evaluated based on genome completeness and the mapping rate of raw
reads. Genome completeness was assessed using BUSCO v3.0.2 (Waterhouse
et al., 2018) against the arthropod gene set (arthropoda_odb10, n =
1,013). Raw PacBio and Illumina reads were aligned to the assembly using
Minimap2, with the mapping rate calculated with SAMtools v1.9 (Danecek
et al., 2021).