4. Discussion
Before building linear models we first determined the extent to which
our datasets contain meaningful epistasis. That is, considering there is
uncertainty in the data, where should we draw the line between epistatic
and non-epistatic values of ϵ? We estimated (see Methods) that the error
for both datasets fall between 0.2 - 1.0 kcal/mol with an average around
0.5 - 0.6 kcal/mol. From this, we estimated a cutoff of 0.5 kcal/mol,
i.e., |ϵ| > 0.5 kcal/mol are considered
epistatic. There are further limitations of our dataset; the data is not
from randomized studies. Instead, the experiments were generally
conducted in a targeted fashion with a priori knowledge of function.
This may explain why we find more positive epistasis (more stabilizing
that additivity predicts) than negative epistasis (more destabilizing
that additivity predicts) as shown in Figure 1. Alternatively, it is
possible that more positive epistasis is present in the data because
negative epistasis could lead to protein misfolding or non-binding
events in the experiments. The former reasoning is an artifact of how
the data was generated, and the latter is related to biophysical
features of the proteins; both carry different implications for the
dataset and warrant future work.
Separation distance is the most intuitive feature expected to contribute
to epistasis, because residues that are near each other are more likely
to interact than those far apart. Simple comparisons show a decreasing
spread of epistasis with increasing distance (Figure 2). The folding
data show this most strongly with a sharp peak around the shortest
separation distances of approximately 6 Å, dropping to near zero at
larger distances. The binding data show a possible peak around 10 Å,
however, the trend is not as clear. Additionally, with binding there is
a paucity of data from 25 Å to 40 Å with only one data point around 40
Å. Our tests using likelihood ratio methods (Figure 3) confirm that
separation does play a role in epistasis for both binding and folding.
Our alternative model (width of possible ϵ values depends on separation)
was a better explanation than the null model (no relationship between
separation and ϵ), with a p-value of p < 0.001 in the case of
binding, and p < 0.002 in the case of folding. The importance
of separation is also illustrated in our models (Table 1) where both
folding and binding models have negative coefficients for separation
distance. In the folding model, a 10 Å increase in separation between
residues results in a decrease in epistasis of 0.416. In the binding
model, the effect of separation alone is an order of magnitude less than
the folding model and has less significance in the model (p=0.8074).
Instead, the effect of separation in the binding model is most strongly
characterized by the interaction with charge. With charge alone, changes
involving attractive pairings show an increase in epistatic effect
whereas changes involving a repulsive pairing show a decrease. The
interaction between charge and separation contributes an opposing
effect: as separation between residues increases, changes involving
attractive and repulsive pairings cause a decrease and increase in
epistatic effect respectively. Intuitively, as separation between
charged residues, regardless of categorization, increases the net effect
of charge on epistasis tends towards zero (Δϵcharge +
Δϵcharge:separation ~ 0).
In addition to separation distance, amino acid size is present in both
models. Size is another feature one might intuitively expect to
contribute to epistasis: large absolute changes in size imply that voids
are created when residues change from larger to smaller, or that smaller
to larger residues create steric clashes. In both models the coefficient
is positive (increases in ϵ occur with change in size) with more of an
effect in the case of folding (on the order of 10-2 vs
10-3 in the case of binding). Size interaction terms
differ between binding and folding. In the case of binding, when there
are changes in size that occur on different protein chains, there is a
reduction in epistasis. Otherwise, for all complex types, changes in
size lead to an increase in epistasis, most strongly with
Antibody-antigen complexes. For folding, sizenetinteracts with hydrophobicity and SASArel leading to
decreases in epistasis. This will be discussed further with the features
specific to the folding model.
In the case of both binding and folding, there are unique features that
contribute significantly to the observed epistasis. In the case of
binding, these elements only apply to binding interactions such as the
type of complex (defined by function) and whether both mutations occur
on the same side of the binding interaction. Complex type is the most
significant contributor to the observed epistasis (∆R2= 0.17) with most complexes showing less epistatic effect compared to
the reference category of generic protein-protein complexes. There is an
exception with Cytokine-cytokine complexes that shows a small increase
in epistasis with a coefficient of +0.4677. The interaction side is a
smaller contributor compared to complex type (∆R2 =
0.0465), with a slight increase in epistatic effect when mutations occur
on opposite sides of the binding interaction. This is consistent with
intuition; if both mutations are near the binding interface and on
opposite sides, they are more likely to directly interact, or propagate
effects at the interface. Additional features that contribute to
epistasis in binding are secondary structure and
SASAabs. Secondary structure has a minor contribution,
with a slight increase in epistatic effect when residues belong to
different secondary structure types. This is counterbalanced by an
interaction with separation distance, where residues that occur in
different secondary structures, and are also far apart, lead to a
decrease in epistatic effect. This could be due to direct interactions
between sites; if they are close together but belong to different
secondary structures, they can change these structures either directly
or indirectly. This is less likely to happen if they are further apart.
SASAabs is the penultimate feature in the model ranking
with a very small coefficient (-0.006). This implies that changes in the
total exposed surface area due to the two mutations lead to small
reductions in the epistatic effect.
Unique to the folding model, hydrophobicity is present, and is the
strongest contributor to epistasis with a ∆R2 of
0.1506. Changes in the net hydrophobicity lead to an increase in the
observed epistasis. This is consistent with other studies that have
shown that hydrophobicity contributes to predicting folding stabilities
with double
mutations61.
Most of the other terms present in the folding model interact with
hydrophobicity leading to a stronger effect on epistasis, and a
reduction when paired with changes in size, and changes in charge
involving attractive interactions.
Since our statistical models for both binding and folding explain
approximately 25-30% of the observed epistasis, an important question
is: what explains the other 70-75%? We believe the answer lies in
dynamical properties that are beyond the scope of what we investigated
here. Protein complexes are not static objects, thus static features
like those considered in this study are only likely to capture some of
the true physical effect they can have on these systems. While a tool
like molecular dynamics could potentially help address this question,
given the number of mutations and systems considered here, the
computational cost would be unreasonably large and will be left as a
topic for future study.
Given the size of our datasets, and the imbalanced nature of the data in
terms of protein systems, we performed a “leave-10%-out” validation
procedure to test the robustness of our models and determine whether
there are system-specific effects (see Table 2). We found that our
binding model was very robust; all terms appearing in the full model
were also present in the validation trials effectively 100% of the time
(the least significant term, secondary structure, was missing from three
trials). The mean rank was also consistent between the validation trials
and the full model ranking for the three most significant terms, the 4th
and 5th are switched but close enough to be within a margin of error,
the 6th and 7th were also consistently ranked. The folding model was
slightly less robust. The effect of hydrophobicity was very robust being
ranked first in the full model and appearing in 99 of the 100 validation
trials with a mean rank of one. The remaining folding model terms appear
between 96% to 100% of the time, however their mean rankings are
generally inconsistent with their full model rank, indicating that while
they are important to explaining epistasis we cannot be as certain of
their relative contribution.
A limitation in the current study, that is also a limitation for all
similar studies, is the lack of comprehensive, diverse, and unbiased
datasets. Given the challenges associated with measuring binding or
folding free energies for a large number of mutants, these datasets are
built with narrow focus and small sample sizes. Such databases tend to
be biased toward systems of particular interest. Additionally, they will
not contain mutations that result in a nonviable protein or system. This
does not make the data any less relevant since in nature proteins must
be viable, and thus we should expect similar results (e.g., the
preponderance of positive epistasis observed in this study). If we want
to understand the nature of epistasis at the level of protein stability,
we need to study it across more protein systems in a more systematic
fashion. To build a truly predictive model of epistasis, dynamic
properties would need to be considered and a larger, more representative
sample of data would need to be accessible.