Introduction

The 2020 CASP14 experiment saw an unprecedented improvement in the performance of 3D protein structure prediction. One method (AlphaFold2) was able to generate highly accurate predictions even for the most challenging de novo targets. Beyond the CASP community, this breakthrough has implications for the entire field of structural biology: accurately predicting the structure of a single protein chain has never been closer to being considered a solved problem. But far from being the end of structure prediction, this might instead be the beginning of a new era in the 3D modeling of biomolecular structures. Areas that have been limited so far due to the inability to produce sufficiently accurate de novo protein models in the first place, such as the prediction of protein-ligand interactions, large macromolecular complexes and assemblies, or variant effects, might now be within reach of the next generation of structural prediction methods. Independent blind assessment of these techniques will be more than ever required in order to support the development of reliable and reproducible methods. In order to assist the community to tackle those challenges, we are introducing an extension of CAMEO (available at beta.cameo3d.org) with the aim to shift the focus from the prediction of individual protein chains to the prediction of macromolecular complexes as determined experimentally by X-ray crystallography or increasingly cryo-EM techniques and deposited to the PDB (wwPDB consortium et al. 2018).
In this new CAMEO category, participating methods receive the sequences of all unique polymer chains, as well as the InChI codes of non-polymer entities composing the complex as prediction targets. The challenges of the modeling task are to: 1) predict the stoichiometry of the complex; 2) predict the 3D structure of all the components: proteins, peptides, DNA, RNA and ligands, including their orientation and interfaces; and 3) provide per-residue confidence estimates of the model. This CAMEO category is based on an opt-in model: participants only receive the target type(s) their method is able to model. This means that a method that only predicts single protein chains can still participate and will receive the targets composed of only one protein sequence, which can be either monomers or homo-oligomers, while another method by the same group might be designed to predict e.g. complexes of proteins with drug-like small molecules.
In this manuscript, we describe the different types of prediction targets that CAMEO enables in the new category, and estimate the number of expected validation targets for each category based on PDB statistics observed in 2020. One major challenge will be the scoring of the new type of predictions with regard to the actual experimental structures. Wherever appropriate, we comment on scores that are foreseen to be applied to the various prediction types. We are welcoming feedback from the community regarding complementary scoring approaches.