2.4 - Quantification of Experimental Error and Model Validation
In order to develop a model for epistasis, it is important to quantify
how much of the observed epistasis could be attributed to error, or
noise, in the experimental data. Quantification of overall error is
based on the error in three values (𝚫𝚫G1,2,
𝚫𝚫G1, 𝚫𝚫G2), each of which were
determined using a broad range of techniques and conditions from diverse
studies (e.g., 60+ for binding). A survey of six studies that contained
some of the largest observed epistasis for binding showed the
experimental standard error for 𝚫𝚫G to be in the range 0.05 - 0.3
kcal/mol55–57.
However, some studies do explicitly include the error for epistasis
(frequently termed the coupling energy). For example, in the case of
barnase-barstar, Schreiber et al., reports errors in ϵ from 0.2 - 0.39
kcal/mol across 33 mutation
pairs58and Goldman et al. reports an error of 0.3 kcal/mol across 13 pairs for
an Idiotype-AntiIdiotype Protein-Protein
complex59.
There are outliers, such as the study from Pielak, et al. with six
mutational pairs in the Iso-1-cytochrome C Peroxidase
complex60found to have an error range of 0.4 - 1.0 kcal/mol with an average error
of 0.75 kcal/mol for six samples; an unusually large error. In summary,
the reported error for our curated binding and folding datasets are in
the range of 0.2 - 1.0 kcal/mol, with mean around 0.4 kcal/mol. For the
remainder of this study, we will use a slightly more conservative
estimated error of 0.5 kcal/mol to quantify the amount of observed
epistasis.
Since our binding and folding data comes from many different protein
systems collected by a diversity of methodologies and laboratories,
there is an inherent imbalance in the quantity and quality of data for
each system. To test the robustness of our model to this bias, we
applied a modified “leave-one-out” procedure. We randomly removed 10%
of the protein systems and their data, creating a subset from the
remaining 90% of systems. The model selection procedure was performed
on this subset to generate a new model. This process of removing 10% of
the systems and running model selection was repeated 100 times. The
resulting 100 subset models were analyzed and compared to determine
which terms appeared, their frequency of appearance, and average
performance or ranking when present in a model.