2 Results

2.1 Is the ground truth good enough?

To assess the distribution of high quality crystal structures of PLC in the Protein Data Bank (PDB)22, we extracted protein, ligand, and binding pocket (defined as a 6Å radius around the ligand) information from PDB validation reports from 114,973 PLC entries in the PDB solved by X-ray crystallography for which Electron-Density Server (EDS) validation information is made available in the PDB23–25, and which contain at least one protein chain (polymer entity) and at least one non-polymer entity (small molecule ligand or ion). Ligands present in the BioLIP artifact list were excluded26. This list contains 463 frequent crystallization artifacts such as solvents and buffers. It may also filter out a few biologically relevant ligands, however this is rare and we considered the trade-off acceptable for this study.
We analyzed 236,538 small molecule pockets across 75,065 PLC PDB entries and 32,273 unique small-molecule ligands, and 798,651 ion pockets across 84,215 PLC and 138 unique ions. In total, this corresponds to over a million pockets.
The authors of the Iridium dataset defined a highly stringent set of criteria regarding the quality of crystal structures, with emphasis on the suitability for pose prediction, virtual screening and binding affinity estimation18. These include criteria on the protein (resolution ≤ 3.5Å, R < 0.4, Rfree < 0.45, absolute difference between R and Rfree ≤ 0.05) as well as ligand and pocket criteria (full density with RSR ≤ 0.1 and RSCC ≥ 0.9, full atom occupancies and no alternative configurations for ligand atoms and protein atoms within 6Å of ligand.)
We applied the Iridium criteria to the binding pockets within our set of PLC. Only 0.3% (721) of small molecule pockets across 504 PLC and 0.98% (315) of unique small molecule ligands, and 0.66% (5,248) of ion pockets across 3,379 PLC and 35.51% (49) of unique ion ligands passed the criteria. In total, 0.58% of all pockets are acceptable according to the Iridium criteria, across 3.21% (3,686) of PLC and 1.12% (364) of unique ligands..
Thus this criteria is too stringent for both of the applications we explore. For continuous evaluation methods such as CAMEO which runs on a weekly basis, the majority, if not all PLC would be discarded. Similarly, restricting to such a small fraction of the PDB is incompatible with creating a diverse and representative dataset of PLC for comprehensive benchmarking. We suggest alternative “relaxed” criteria with RSCC > 0.8 and >90% protein residues within 6Å of ligand with RSCC > 0.8, with the remaining criteria the same as Iridium. The threshold of 0.8 for RSCC is in accordance with the widely accepted rule of thumb that 0.8 < RSCC < 0.95 are generally ok, RSCC > 0.95 indicate a very good fit, and RSCC < 0.8 indicate that the experimental data may not accord with the ligand placement27. Having such a set of relaxed criteria could be used as a post-filter step in the CAMEO setting and, in the latter case, the stringent Iridium criteria could be used to create the starting set with more PLC being added based on their novelty and the relaxed criteria.
Figure 1 shows the distribution of validation data values across all binding pockets as well as the selected relaxed thresholds for four criteria: resolution (Figure 1A), absolute difference between R and Rfree (Figure 1B), RSCC (Figure 1C) and percentage of protein residues within 6A of the ligand with RSCC > 0.8 (Figure 1D). The most stringent criterion is by far the absolute difference between R and Rfree , which removes almost 15% of the pockets.
We applied these relaxed criteria to the dataset of binding pockets. We found that 44.96% (106,357) of small molecule pockets across 36,959 PLC and 51.34% (16,568) of unique small molecule ligands passed the relaxed criteria. Similarly, 48.73% (389,217) of ion pockets across 55,594 PLC and 89.86% (124) of unique ion ligands passed. Thus, the criteria retains 47.87% (495,574) of all pockets, spread across 62.38% (71,720) of PLC and 51.50% (16,692) of unique ligands.