Material and Methods
Sequence filtering and
clustering
The pre-release sequences of polymer entities as well as InChI code of
non-polymer ligands were downloaded every Saturday from the PDB
(wwPDB consortium et al.
2018)
(http://www.wwpdb.org/files/).
Structures containing sequences with unknown residues, starting with
caps, or whose type (protein, DNA or RNA) couldn’t be assigned
unambiguously were discarded. Within a pre-release week, amino acid
sequences of 30 amino acid residues or longer (“protein”) were
clustered with CD-HIT (Li and
Godzik 2006) applying a 99% sequence identity threshold. Amino acid
sequences of less than 30 amino acid residues (“peptides”), as well as
DNA and RNA sequences were clustered based on exact identity (100%).
One representative sequence per cluster was selected as target for
structure prediction.
Template searches
Target protein sequences were submitted to two template searches. First,
a BLAST+ v. 2.2.31 (Camacho
et al. 2009) search against a database of current PDB entries at the
time of pre-release was performed. A threshold of 85% sequence identity
and at least 70% coverage was used to identify target sequences with
very high similarity to a protein with known structure. Next, sequence
profiles were built using 1 iteration of HHblits v. 3.2.0
(Steinegger et al. 2019)
against Uniclust30 (2018_08)
(Mirdita et al. 2017). The
profiles were used to search a database of PDB entries available on
2021-03-19, with an HHblits probability threshold of 70% and a coverage
threshold of 70% in order to identify target sequences with more remote
similarity to a protein with a known structure. Since this was done as a
retrospective analysis, hits that were released after the date of the
pre-release of the target were filtered out. For peptide sequences of
less than 30 amino acid residues and sequences of nucleic acid residues,
a lookup was performed against a database of current PDB entries at the
time of pre-release with a 100% identity threshold.
Templates found by BLAST, HHblits and lookup on single chains were
aggregated into complexes. A structure was considered to be a template
if all the chains of the target structure could be uniquely mapped to
the chains of the template structure, and the template structure didn’t
contain any extra polymer chain.
Scores
Single-chain predictions were evaluated against the reference structure
with the lDDT score (Mariani
et al. 2013) using OpenStructure v. 2.1.0
(Biasini et al. 2013), the
global CAD atom-atom (AA) score v. 1646_63d6b800098c
(K. Olechnovič, Kulberkytė,
and Venclovas 2013), and the GDT_TS score using LGA v. 05/2009
(Zemla 2003). When the
target structure contained more than one copy of the sequence, more than
one biological assembly, or for homo-oligomeric predictions, the scores
were calculated between all possible combinations of target assembly and
target and model chains, and only the most favorable score was kept.
Homo- and hetero-oligomeric predictions were evaluated with the
oligo-lDDT and QS-score
(Bertoni et al. 2017) using
OpenStructure v. 2.1.0
(Biasini et al. 2013), as
well as the MM-align-based TM-score v. 20190426
(Mukherjee and Zhang 2009).
The oligomeric lDDT score (oligo-lDDT) is an extension of the lDDT score
for protein complexes and has also been used in CASP since CASP13
(Guzenko et al. 2019;
Kryshtafovych et al. 2019). It relies on the QS-score to identify the
mapping of chains and residues between the model and target structure.
Once the mapping is identified, the all-atom lDDT score can be applied
on the protein complex in the same way as it is applied for single
chains with the advantage that it now also considers inter-chain
contacts. Extra atoms in the model for mapped chains have no effect on
lDDT scores, while extra atoms in the target structure reduce the score.
For the oligomeric lDDT score, we penalize extra chains in both
reference and model by including them as non-conserved contacts.
Ligand analysis
Functional domain annotation was extracted from CATH
(Sillitoe et al. 2021)
version 4.3.0. We used the “Structure external” links from DrugBank
(Wishart et al. 2006)
version 5.1.8 to identify drug-containing targets. The analysis was
performed with Python 3.6.6, OpenStructure v. 2.1.0
(Biasini et al. 2013), and
pandas v. 1.1.5 (McKinney
2010).
Structure visualization
Figures were generated with the Mol* Viewer
(Sehnal et al. 2021).