Material and Methods

Sequence filtering and clustering

The pre-release sequences of polymer entities as well as InChI code of non-polymer ligands were downloaded every Saturday from the PDB (wwPDB consortium et al. 2018) (http://www.wwpdb.org/files/). Structures containing sequences with unknown residues, starting with caps, or whose type (protein, DNA or RNA) couldn’t be assigned unambiguously were discarded. Within a pre-release week, amino acid sequences of 30 amino acid residues or longer (“protein”) were clustered with CD-HIT (Li and Godzik 2006) applying a 99% sequence identity threshold. Amino acid sequences of less than 30 amino acid residues (“peptides”), as well as DNA and RNA sequences were clustered based on exact identity (100%). One representative sequence per cluster was selected as target for structure prediction.

Template searches

Target protein sequences were submitted to two template searches. First, a BLAST+ v. 2.2.31 (Camacho et al. 2009) search against a database of current PDB entries at the time of pre-release was performed. A threshold of 85% sequence identity and at least 70% coverage was used to identify target sequences with very high similarity to a protein with known structure. Next, sequence profiles were built using 1 iteration of HHblits v. 3.2.0 (Steinegger et al. 2019) against Uniclust30 (2018_08) (Mirdita et al. 2017). The profiles were used to search a database of PDB entries available on 2021-03-19, with an HHblits probability threshold of 70% and a coverage threshold of 70% in order to identify target sequences with more remote similarity to a protein with a known structure. Since this was done as a retrospective analysis, hits that were released after the date of the pre-release of the target were filtered out. For peptide sequences of less than 30 amino acid residues and sequences of nucleic acid residues, a lookup was performed against a database of current PDB entries at the time of pre-release with a 100% identity threshold.
Templates found by BLAST, HHblits and lookup on single chains were aggregated into complexes. A structure was considered to be a template if all the chains of the target structure could be uniquely mapped to the chains of the template structure, and the template structure didn’t contain any extra polymer chain.

Scores

Single-chain predictions were evaluated against the reference structure with the lDDT score (Mariani et al. 2013) using OpenStructure v. 2.1.0 (Biasini et al. 2013), the global CAD atom-atom (AA) score v. 1646_63d6b800098c (K. Olechnovič, Kulberkytė, and Venclovas 2013), and the GDT_TS score using LGA v. 05/2009 (Zemla 2003). When the target structure contained more than one copy of the sequence, more than one biological assembly, or for homo-oligomeric predictions, the scores were calculated between all possible combinations of target assembly and target and model chains, and only the most favorable score was kept.
Homo- and hetero-oligomeric predictions were evaluated with the oligo-lDDT and QS-score (Bertoni et al. 2017) using OpenStructure v. 2.1.0 (Biasini et al. 2013), as well as the MM-align-based TM-score v. 20190426 (Mukherjee and Zhang 2009). The oligomeric lDDT score (oligo-lDDT) is an extension of the lDDT score for protein complexes and has also been used in CASP since CASP13 (Guzenko et al. 2019; Kryshtafovych et al. 2019). It relies on the QS-score to identify the mapping of chains and residues between the model and target structure. Once the mapping is identified, the all-atom lDDT score can be applied on the protein complex in the same way as it is applied for single chains with the advantage that it now also considers inter-chain contacts. Extra atoms in the model for mapped chains have no effect on lDDT scores, while extra atoms in the target structure reduce the score. For the oligomeric lDDT score, we penalize extra chains in both reference and model by including them as non-conserved contacts.

Ligand analysis

Functional domain annotation was extracted from CATH (Sillitoe et al. 2021) version 4.3.0. We used the “Structure external” links from DrugBank (Wishart et al. 2006) version 5.1.8 to identify drug-containing targets. The analysis was performed with Python 3.6.6, OpenStructure v. 2.1.0 (Biasini et al. 2013), and pandas v. 1.1.5 (McKinney 2010).

Structure visualization

Figures were generated with the Mol* Viewer (Sehnal et al. 2021).