Figure 1: Target 2020-12-19_00000231 (PDB ID 7K93) is a hetero-2-2mer protein complex of a Dengue virus non-structural protein (NS1) (green) in complex with a mouse neutralizing single chain Fab variable region (orange) (Biering et al. 2021). While templates can be easily identified with HHblits for both entities, there is no overlap between the template lists, meaning the two proteins have never been observed in a homologous complex. Specifically, no homologs of this Dengue virus protein have been observed in complex with an antibody. This therefore constitutes an interesting target for the modeling of heteromeric protein complexes.
Looking at the data we collected in the 52 pre-release weeks of 2020, 3158 interesting protein structures where no closely related homolog could be found with BLAST were released by the PDB. Among those, 1017 were monomers, 1011 homo-oligomeric complexes (which can’t be distinguished from monomers from the sequence-only pre-release data) and 1130 were hetero-oligomers.
In order to retrospecitvely analyse the complexity of the hetero-oligmeric target set, we repeated the template search with HHblits to identify more remotely related homologous complexes. We could identify a homologous hetero-oligomeric complex with HHblits for 565 of these 1130 targets, where all entities of the target could be uniquely mapped to the template, and reciprocally. In 240 hetero-oligomeric complexes, templates for individual entities could be identified with HHblits, but not in the same complex (or the template contained extra entities); and 113 complexes could similarly be identified with BLAST. These 353 “novel complex” targets are of particular interest, as an accurate prediction would have to successfully predict the assembly mode of the complex, and accurately model the (unknown) interfaces, therefore going beyond the classical reach of homology modeling. Finally for the remaining 212 complexes, no template could be identified by HHblits for at least one of the target entities.
HHblits was able to identify homologs in the vast majority (1734) of the 2028 monomeric (1017) or homo-oligomeric (1011) interesting protein structures contained in the CAMEO target set. We note, however, that 43 of the targets could only be mapped to templates in complex with a different partner. The interfaces are likely to differ from the templates, and therefore we consider these targets as interesting modeling targets for CAMEO. Finally HHblits was unable to identify a template for 294 of these targets.
In order to evaluate the predictions, we are using the same scores as for the homo-oligomers: oligo-lDDT, QS-score and TM-score. In addition, other single-chain scores can be generalized to evaluate heteromers in the same fashion as the oligo-lDDT score is a generalization of the lDDT score to oligomers. Finally we are also looking at the applicability of the scores used by the CAPRI community for automated evaluation.
It should be noted that the selection of interesting protein target structures is performed regardless of ligand contents, but non-polymer ligands are submitted nonetheless to participating servers that support it. 76% of the structures released by the PDB in 2020, and 65% of the interesting protein structures selected in this category, contain at least one ligand. In addition, we are considering specifically selecting interesting ligand modeling targets, which we describe in the following section.

Non-polymer Ligands

Small chemical compounds which are not part of a polymer chain are provided as InChI codes and PDB chemical components in the pre-release of the PDB. They are included in the target definition together with the polymer entities for participating servers that support predicting small chemical compounds in complex with proteins. Consequently, in addition to predicting the correct protein structure, predictors are challenged to include the ligands in their models at the correct binding site in an accurate conformation.
However predicting the exact pose of a ligand within a theoretical model remains a challenge which is out of reach for most current protein prediction servers. To specifically facilitate the development of such methods, these should be evaluated separately to the prediction of protein complexes. Therefore we are proposing a specialized CAMEO category, where easy protein modeling targets (as per the opposite of the definition in the previous section) are selected if they contain novel ligands that haven’t been seen in a template.
We analyzed the feasibility of this approach on the current data in the PDB. In 2020, we observed 4870 protein targets that would be trivial to solve with comparative modeling but included a combination of non-polymer ligands never seen before in a template for those structures. Furthermore, 4486 of them contained only homo-oligomeric or monomeric targets, which would enable many current protein structure prediction servers to participate without having to implement new modeling approaches for protein complexes.
Interestingly, 3491 of these 4870 structures contained a known drug from DrugBank (Wishart et al. 2006). Figure 2 shows a typical example of such a target, the SARS-CoV-2 main protease in complex with Boceprevir, an FDA-approved drug for the treatment of the hepatitis C virus (Fu et al. 2020). Drug repurposing studies are common in the PDB, and the CAMEO target set is therefore representative of current areas of active research and can help developers to assess the performance of their methods on relevant datasets. For instance 149 DrugBank drug-containing ligand modeling targets were identified by CATH as containing the 3CL-PRO main protease domain 3 (CATH ID 1.10.1840.10), and an additional 70 targets had ligands not known to DrugBank.