Figure 3: Target 2020-05-30_00000276 (PDB ID 6LQF) is an ARID-PHD
protein cassette in complex with a peptide, DNA and zinc ions
(Tan et al. 2020). The
protein only has remote similarity (< 30% sequence identity)
to known structures, and none of them are in complex with DNA or the
H3K4me3 peptide, making it an extremely challenging target. We are not
aware of any methods that would currently be able to model this type of
complex with acceptable accuracy. It should be noted that the peptide
contains a non-canonical residue (N-Trimethyllysine, derived from
Lysine).
In 2020, following the criteria outlined in the previous sections, we
observed 983 structures containing more than one type of polymer
entities. All of them were proteins in complex with peptides (421), DNA
(279), RNA (199), DNA and RNA (52) or both peptides and nucleic acids
(32).
With appropriate extensions, we believe that some of the scores selected
for the individual target types such as the oligo-lDDT and CAD-score
will be applicable to evaluate all these targets in a consistent manner.
Non-canonical amino acids and
bases
Macromolecular structures frequently contain amino (or nucleic) acid
residues which are not part of the 20 (respectively 8) standard
residues. Traditionally for modeling purposes, the target sequences are
canonicalized, that is modified residues are represented by their
“parent” or closest canonical amino acid residue. However this may
result in suboptimal models which wouldn’t accurately represent the
region containing the modification. Post-translational modifications
such as phosphorylations can result in significant conformational
changes of the protein structure, which would be impossible to correctly
model without knowledge of the modification.
As this information is available at the time of pre-release, CAMEO can
provide sequences containing non-canonical residues on an opt-in basis
(Figure 3). In this case, sequences will contain the PDB component
identifier (typically 3 letters) enclosed in round brackets, in place of
the parent amino acids. Models correctly representing those residues are
expected to obtain higher scores for the all-atom measures such as the
lDDT or the CAD-score.
In 2020, 444 of the 4323 protein, DNA, RNA and mixed structures and
complexes we observed contained non-canonical residues. We observed
these non-canonical residues in proteins (286), peptides (112), DNA (35)
and RNA (27). 16 of them were observed in mixed complexes.
Current implementation status of CAMEO
At the time of writing, the CAMEO “Structures & Complexes”
functionality is available as a beta version athttps://beta.cameo3d.org/ and
is open for registrations. It has been providing targets containing
proteins, DNA and RNA to registered servers on a weekly basis since
October 2020. Participants can currently choose to receive the
non-polymer ligands contained in these targets as InChI codes or PDB
component IDs, as well as non-canonicalized sequences including modified
residues. Predictions can be returned in PDB or mmCIF format, and are
assessed with a fully automated pipeline including the oligo-lDDT and
QS-scores. A weekly download of models, reference structures and
assessment results is made available for offline analysis.
Our next steps will be to refine the target selection process,
especially with respect to selecting relevant ligand targets as
described in the previous sections. We are exploring ways to increase
the diversity of the target selection, while ensuring that as many
participants as possible receive a common subset of targets in order to
make comparisons between servers possible for some aspects of the
evaluation. We aim to improve the scoring by providing more diverse
scores as described in the previous sections. Most groups developing
novel methods have implemented their own scoring workflows locally. We
therefore consider at this point the raw data downloads of the
prediction results as a crucial service to the community developing
specialized prediction methods as it allows including independent blind
prediction data in publications describing the new method.