2 Results
2.1 Is the ground truth good
enough?
To assess the distribution of high quality crystal structures of PLC in
the Protein Data Bank (PDB)22, we
extracted protein, ligand, and binding pocket (defined as a 6Å radius
around the ligand) information from PDB validation reports from 114,973
PLC entries in the PDB solved by X-ray crystallography for which
Electron-Density Server (EDS) validation information is made available
in the PDB23–25,
and which contain at least one protein chain (polymer entity) and at
least one non-polymer entity (small molecule ligand or ion). Ligands
present in the BioLIP artifact list were excluded26. This
list contains 463 frequent crystallization artifacts such as solvents
and buffers. It may also filter out a few biologically relevant ligands,
however this is rare and we considered the trade-off acceptable for this
study.
We analyzed 236,538 small molecule pockets across 75,065 PLC PDB entries
and 32,273 unique small-molecule ligands, and 798,651 ion pockets across
84,215 PLC and 138 unique ions. In total, this corresponds to over a
million pockets.
The authors of the Iridium dataset defined a highly stringent set of
criteria regarding the quality of crystal structures, with emphasis on
the suitability for pose prediction, virtual screening and binding
affinity estimation18. These
include criteria on the protein (resolution ≤ 3.5Å, R < 0.4,
Rfree < 0.45, absolute difference between R
and Rfree ≤ 0.05) as well as ligand and pocket criteria
(full density with RSR ≤ 0.1 and RSCC ≥ 0.9, full atom occupancies and
no alternative configurations for ligand atoms and protein atoms within
6Å of ligand.)
We applied the Iridium criteria to the binding pockets within our set of
PLC. Only 0.3% (721) of small molecule pockets across 504 PLC and
0.98% (315) of unique small molecule ligands, and 0.66% (5,248) of ion
pockets across 3,379 PLC and 35.51% (49) of unique ion ligands passed
the criteria. In total, 0.58% of all pockets are acceptable according
to the Iridium criteria, across 3.21% (3,686) of PLC and 1.12% (364)
of unique ligands..
Thus this criteria is too stringent for both of the applications we
explore. For continuous evaluation methods such as CAMEO which runs on a
weekly basis, the majority, if not all PLC would be discarded.
Similarly, restricting to such a small fraction of the PDB is
incompatible with creating a diverse and representative dataset of PLC
for comprehensive benchmarking. We suggest alternative “relaxed”
criteria with RSCC > 0.8 and >90% protein
residues within 6Å of ligand with RSCC > 0.8, with the
remaining criteria the same as Iridium. The threshold of 0.8 for RSCC is
in accordance with the widely accepted rule of thumb that 0.8
< RSCC < 0.95 are generally ok, RSCC >
0.95 indicate a very good fit, and RSCC < 0.8 indicate that
the experimental data may not accord with the ligand placement27. Having
such a set of relaxed criteria could be used as a post-filter step in
the CAMEO setting and, in the latter case, the stringent Iridium
criteria could be used to create the starting set with more PLC being
added based on their novelty and the relaxed criteria.
Figure 1 shows the distribution of validation data values across all
binding pockets as well as the selected relaxed thresholds for four
criteria: resolution (Figure 1A), absolute difference between R and
Rfree (Figure 1B), RSCC (Figure 1C) and percentage of
protein residues within 6A of the ligand with RSCC > 0.8
(Figure 1D). The most stringent criterion is by far the absolute
difference between R and Rfree , which removes almost
15% of the pockets.
We applied these relaxed criteria to the dataset of binding pockets. We
found that 44.96% (106,357) of small molecule pockets across 36,959 PLC
and 51.34% (16,568) of unique small molecule ligands passed the relaxed
criteria. Similarly, 48.73% (389,217) of ion pockets across 55,594 PLC
and 89.86% (124) of unique ion ligands passed. Thus, the criteria
retains 47.87% (495,574) of all pockets, spread across 62.38% (71,720)
of PLC and 51.50% (16,692) of unique ligands.