3. Results
To build a statistical model for epistasis in proteins we used data for
binding curated from SKEMPI v2.0 (572 mutation pairs), and for folding
curated from ProTherm 4 (204 mutation pairs). We first considered the
extent to which epistasis was present in our data set. To determine
this, we defined an epistasis cutoff; values where |ϵ|
is larger than the cutoff are considered epistatic, and other values are
not. Ideally, the cutoff would be chosen based on the experimental error
or uncertainty, however, given that our data come from a broad spectrum
of methods and sources, this is not possible to determine for the
dataset as a whole (see supplemental Figure S1 for the dataset divisions
with various cutoffs).
Figure 1 shows the free energy change of the double mutant as a function
of the sum of individual free energies for both binding and folding
datasets with a cutoff of 0.5 kcal/mol. Both figures 1 and 2 show that
epistasis is present in binding and folding. In both datasets there is a
marked trend for large sums of constituent single mutations (sum in EQ
1) to correspond to a double mutant with free energy falling below the
1:1 line (i.e., more stabilizing that predicted by additivity). The
opposite is true for constituent mutations with smaller sums.
After ascertaining the extent to which epistasis is present in our data,
we investigated how well the separation between mutation sites could
explain the epistatic effect. Figure 2 shows the relationship between
separation distance and the observed epistasis for binding (top) and
folding (bottom). Both show the general expected trend of less epistasis
as separation increases. Both also show a larger number of data points
for distances with the largest ϵ values, or spread in ϵ (around 6-10 Å).
Figure 3 shows our analysis to determine whether the apparent decrease
in epistasis with increasing separation distance (Figure 2) is due to an
actual relationship or a consequence of the larger number of data points
at small distances. Figure 3A shows null model (σ(ϵ) is not a function
of r) and alternative model (σ(ϵ) exponentially decreases as a function
of r) for the likelihood ratio analysis. Figure 3B shows the simulated
distribution of the likelihood ratio, \(\Lambda\),from the analysis with
1000 samples. The experimentally observed likelihood ratio is well
outside the distribution of null ratios given by the label “EXP” and
has a value of -5.50 compared with the tail of the simulation
distribution minimum of -4.59. In simple terms, this results in a
p-value of p < 1/1000 (p<0.001) in strong support of
the alternative model.
Table 1 shows a summary of the binding (1A) and folding (1B) statistical
models for epistasis in protein systems. Both models have similar
predictive power in the range of 25-30%. The final selected binding
model contains all features that we considered except for hydrophobicity
(seven features, 28 terms including interactions) and depends on
SASAabs and secondary structure in addition to binding
specific features like the complex type. The folding model is simpler
(five features, 12 terms with interactions), and depends on
hydrophobicity and SASArel. Features are listed in order
according to their relative contribution to the explanatory power of the
full model. That is, the highest-ranked feature is the one whose removal
leads to the greatest reduction in R2. For the binding
epistasis model, the largest contributor was the complex type, with a
change in R2 of 0.128 upon removal followed by charge
with a change in R2 of 0.078 upon its removal. The
remaining terms each contribute ~5% or less to the
predictive power of the binding model. For the folding epistasis model,
the largest contributor was hydrophobicity with a change in
R2 of 0.151 upon removal, followed by both size and
charge with similar contributions (change in R2 of
0.0765 and 0.0695 with their removal respectively). The remaining terms
each contribute ~4.5% or less to the predictive power
of the folding model.
Table 2 shows the results of 100 trials of our ”leave-10%-out”
robustness test where 10% of the available systems were randomly
removed. These results show that both of our full models are highly
robust – with the binding model being slightly more robust than the
folding model. All terms present in the full models are present in the
leave-10%-out analysis, most occurring in all trials. Additionally, the
mean ranks of most terms are identical to the full-data binding model
with more variance in the folding model. Graphical representation of our
robustness tests shown in supplemental Figure S2.
Figure 4 further illustrates the results of our statistical model for
epistasis in binding. For charge, the subcategory for interactions
involving an attractive pairing (0A) contains the most strongly
epistatic mutations. While mutations in this subcategory cover a broad
range of values, many tend towards positive epistasis; the largest value
belongs to this subcategory. Neutral or constant charge states (00) show
a near normal distribution centered on zero with some low levels of
epistasis. Changes involving a repulsive interaction (0R) contain the
least number of data and have a narrow distribution, with fewer large
values for ϵ in either direction. For the complex type category, the
antibody-antigen subcategory shows the most epistasis, including the
most positive. TCR/pMHC also contains a large amount of positive
epistasis. Cytokine-cytokine is the only subcategory with a negative
mean suggesting that mutations in this subcategory tend to have negative
epistasis. Generic protein-protein complexes show similar behavior to
the neutral charge category; centered on zero, broad spread, but low
numbers of epistatic data points. Other categorical features are shown
in supplemental Figure S3.