Figure 1: Target 2020-12-19_00000231 (PDB ID 7K93) is a hetero-2-2mer
protein complex of a Dengue virus non-structural protein (NS1) (green)
in complex with a mouse neutralizing single chain Fab variable region
(orange) (Biering et al.
2021). While templates can be easily identified with HHblits for both
entities, there is no overlap between the template lists, meaning the
two proteins have never been observed in a homologous complex.
Specifically, no homologs of this Dengue virus protein have been
observed in complex with an antibody. This therefore constitutes an
interesting target for the modeling of heteromeric protein complexes.
Looking at the data we collected in the 52 pre-release weeks of 2020,
3158 interesting protein structures where no closely related homolog
could be found with BLAST were released by the PDB. Among those, 1017
were monomers, 1011 homo-oligomeric complexes (which can’t be
distinguished from monomers from the sequence-only pre-release data) and
1130 were hetero-oligomers.
In order to retrospecitvely analyse the complexity of the
hetero-oligmeric target set, we repeated the template search with
HHblits to identify more remotely related homologous complexes. We could
identify a homologous hetero-oligomeric complex with HHblits for 565 of
these 1130 targets, where all entities of the target could be uniquely
mapped to the template, and reciprocally. In 240 hetero-oligomeric
complexes, templates for individual entities could be identified with
HHblits, but not in the same complex (or the template contained extra
entities); and 113 complexes could similarly be identified with BLAST.
These 353 “novel complex” targets are of particular interest, as an
accurate prediction would have to successfully predict the assembly mode
of the complex, and accurately model the (unknown) interfaces, therefore
going beyond the classical reach of homology modeling. Finally for the
remaining 212 complexes, no template could be identified by HHblits for
at least one of the target entities.
HHblits was able to identify homologs in the vast majority (1734) of the
2028 monomeric (1017) or homo-oligomeric (1011) interesting protein
structures contained in the CAMEO target set. We note, however, that 43
of the targets could only be mapped to templates in complex with a
different partner. The interfaces are likely to differ from the
templates, and therefore we consider these targets as interesting
modeling targets for CAMEO. Finally HHblits was unable to identify a
template for 294 of these targets.
In order to evaluate the predictions, we are using the same scores as
for the homo-oligomers: oligo-lDDT, QS-score and TM-score. In addition,
other single-chain scores can be generalized to evaluate heteromers in
the same fashion as the oligo-lDDT score is a generalization of the lDDT
score to oligomers. Finally we are also looking at the applicability of
the scores used by the CAPRI community for automated evaluation.
It should be noted that the selection of interesting protein target
structures is performed regardless of ligand contents, but non-polymer
ligands are submitted nonetheless to participating servers that support
it. 76% of the structures released by the PDB in 2020, and 65% of the
interesting protein structures selected in this category, contain at
least one ligand. In addition, we are considering specifically selecting
interesting ligand modeling targets, which we describe in the following
section.
Non-polymer Ligands
Small chemical compounds which are not part of a polymer chain are
provided as InChI codes and PDB chemical components in the pre-release
of the PDB. They are included in the target definition together with the
polymer entities for participating servers that support predicting small
chemical compounds in complex with proteins. Consequently, in addition
to predicting the correct protein structure, predictors are challenged
to include the ligands in their models at the correct binding site in an
accurate conformation.
However predicting the exact pose of a ligand within a theoretical model
remains a challenge which is out of reach for most current protein
prediction servers. To specifically facilitate the development of such
methods, these should be evaluated separately to the prediction of
protein complexes. Therefore we are proposing a specialized CAMEO
category, where easy protein modeling targets (as per the opposite of
the definition in the previous section) are selected if they contain
novel ligands that haven’t been seen in a template.
We analyzed the feasibility of this approach on the current data in the
PDB. In 2020, we observed 4870 protein targets that would be trivial to
solve with comparative modeling but included a combination of
non-polymer ligands never seen before in a template for those
structures. Furthermore, 4486 of them contained only homo-oligomeric or
monomeric targets, which would enable many current protein structure
prediction servers to participate without having to implement new
modeling approaches for protein complexes.
Interestingly, 3491 of these 4870 structures contained a known drug from
DrugBank (Wishart et al.
2006). Figure 2 shows a typical example of such a target, the
SARS-CoV-2 main protease in complex with Boceprevir, an FDA-approved
drug for the treatment of the hepatitis C virus
(Fu et al. 2020). Drug
repurposing studies are common in the PDB, and the CAMEO target set is
therefore representative of current areas of active research and can
help developers to assess the performance of their methods on relevant
datasets. For instance 149 DrugBank drug-containing ligand modeling
targets were identified by CATH as containing the 3CL-PRO main protease
domain 3 (CATH ID 1.10.1840.10), and an additional 70 targets had
ligands not known to DrugBank.