Figure 2 : Protein-ligand complexes (PLC) released per year (in brown and orange) and those passing the relaxed quality criteria (in green and blue), divided according to sequence identity to PLC seen in previous years. The left two bars of each year (in brown and green) are PLC with ligand combinations which differ from previous PLC, and the right two bars (in orange and blue) are PLC containing the same set of ligands as a matching PLC at that sequence identity.
This approach can be used in CAMEO to select the set of PLC to send out for prediction, without sacrificing too many PLC and ensuring that predictors do not waste resources on previously seen PLC or those with very similar templates. However, this approach has some shortcomings mainly due to the limited information available to CAMEO when selecting targets, namely the unique protein sequences and ligand chemical identities.
First, highly redundant regions or pockets in a PLC might be classified as novel due to the presence of other novel pockets in different areas of the complex. On the other hand, small molecule binding poses, even for the same or very similar chemical compounds, can vary significantly even within the same protein due to different protein conformations or a small number of mutations in crucial binding regions. This cannot be accounted for in the CAMEO pre-filtering step but is useful information for evaluation and highly necessary for representative dataset creation. Therefore, utilizing structure and binding pocket clustering from the protein side and 3D ligand conformation clustering from the small molecule side is recommended. The same considerations apply to the oligomeric state of each entity and the stoichiometries of each ligand in a PLC, information that is not available from the PDB pre-release. These factors are particularly important when the same ligand is present in different protein pockets or in cases where a ligand is involved in protein oligomerization. Therefore, this information must be incorporated for assessment and when creating a representative benchmark dataset, and will be explored in future efforts.

2.3 Can we automatically score predicted protein-ligand complexes?

We developed an automated benchmarking workflow, consisting of two components: (1) Preprocessing, input preparation, set-up and running of five PLC prediction tools (Autodock Vina30,31, SMINA 32, GNINA 33, DiffDock34, and TankBind35) with different input parameters, and (2) Assessment of PLC prediction results using different scoring metrics. The workflow is implemented using Nextflow36 to enable efficient parallelization and distributed execution, making it well-suited for handling large datasets and computationally intensive tasks. Each process is encapsulated in a module, with dependency management controlled using Conda37 or Singularity38 . The resources for each step in the pipeline are defined individually, ensuring that only the required resources are reserved and failed processes are automatically restarted with increased resources. Upon completion, all the predicted binding poses are collected and a summary of scores is created, along with reporting on resource usage across the evaluated tools.
We run this workflow using the PDBBind time-split test-set of 363 protein-ligand pockets. As the two most recent deep learning tools in our set, TankBind and DiffDock, are trained on the remaining proteins in PDBBind, this is the most fair set to use for their evaluation at the current time. However, it is important to emphasize that the aim of this experiment is to demonstrate the feasibility of an automated benchmarking workflow, and not a comprehensive evaluation of the tools, due to the issues in this test set already discussed in the previous sections.
As these tools already take a protein structure as input and we are interested in extending this to settings where also the structure may be computationally modeled or in a different conformation, we also evaluated PLC prediction results on 256 AlphaFold39structures of monomeric proteins from the same test-set. 77% (197) of the AlphaFold models are within 2 Å RMSD of the crystal structure.
In order to demonstrate the workflow in different input settings, we use P2Rank 40to detect pockets in each protein in the test set and report results in two scenarios: Blind docking , which is considered the worst-case scenario for docking tools where no indication is provided about the possible location of the ligand, and Best pocket docking , representing the best-case scenario where the correct binding pocket is known and used to define the docking search space. P2Rank was able to predict the center of the correct binding pocket for 89.2% (324) of the receptors within 8 Å distance of the true binding site center, defined as the mean coordinate of the ligand in the pocket. On the other hand, for the AlphaFold modeled receptors, the percentage was 81.1% (206), where the ground truth pocket is defined by structural superposition of the model with the reference structure. For the evaluation of Best pocket docking, the P2Rank pocket that had the smallest distance from the true binding site center was considered the best pocket.
The reporting workflow utilizes BiSyRMSD (referred to as RMSD) and lDDT-PLI scoring to evaluate the predicted ligand structures generated by the different docking methods. Both of these are novel scoring metrics developed for the CASP15 CASP-PLI experiment9 that consider both predicted protein structure and predicted ligand conformation. In addition, lDDT-PLI focuses on the interactions between protein and ligand atoms. Table 1 and Table 2 display the outcomes for PLC prediction using the 363 receptors from the PDBbind test-set and the 256 AlphaFold modeled receptors respectively. The full results are available as Supplementary Table 1 and 2 for the experimentally solved and AlphaFold modeled receptors, respectively. The highest ranked pose (top-1) and the best scored pose out of the top-5 ranked poses (where the ranking is an output of each tool) are assessed for blind docking where the entire protein is employed to define the search box. Furthermore, for all tools except DiffDock where this option is not present, the same assessment is carried out for the best-case scenario using the best pocket for defining the search box. Figure 3 depicts the distributions of these scores for the top-1 and best out of top-5 poses for experimental and modeled receptors for both docking scenarios.
Table 1 : Prediction of small molecule binding to crystallized protein structures from the PDBbind testset containing 363 PLC. For some PLC the pipeline did not complete successfully. Shown are the number of PLC (n), the success rate (SR) defined as the percentage of predictions with RMSD < 2 Å, the median RMSD, the mean lDDT-PLI, and the standard deviation of lDDT-PLI. DiffDock does not use a pocket definition. TANKBind gives only one prediction per search box.