3. Results
To build a statistical model for epistasis in proteins we used data for binding curated from SKEMPI v2.0 (572 mutation pairs), and for folding curated from ProTherm 4 (204 mutation pairs). We first considered the extent to which epistasis was present in our data set. To determine this, we defined an epistasis cutoff; values where |ϵ| is larger than the cutoff are considered epistatic, and other values are not. Ideally, the cutoff would be chosen based on the experimental error or uncertainty, however, given that our data come from a broad spectrum of methods and sources, this is not possible to determine for the dataset as a whole (see supplemental Figure S1 for the dataset divisions with various cutoffs).
Figure 1 shows the free energy change of the double mutant as a function of the sum of individual free energies for both binding and folding datasets with a cutoff of 0.5 kcal/mol. Both figures 1 and 2 show that epistasis is present in binding and folding. In both datasets there is a marked trend for large sums of constituent single mutations (sum in EQ 1) to correspond to a double mutant with free energy falling below the 1:1 line (i.e., more stabilizing that predicted by additivity). The opposite is true for constituent mutations with smaller sums.
After ascertaining the extent to which epistasis is present in our data, we investigated how well the separation between mutation sites could explain the epistatic effect. Figure 2 shows the relationship between separation distance and the observed epistasis for binding (top) and folding (bottom). Both show the general expected trend of less epistasis as separation increases. Both also show a larger number of data points for distances with the largest ϵ values, or spread in ϵ (around 6-10 Å).
Figure 3 shows our analysis to determine whether the apparent decrease in epistasis with increasing separation distance (Figure 2) is due to an actual relationship or a consequence of the larger number of data points at small distances. Figure 3A shows null model (σ(ϵ) is not a function of r) and alternative model (σ(ϵ) exponentially decreases as a function of r) for the likelihood ratio analysis. Figure 3B shows the simulated distribution of the likelihood ratio, \(\Lambda\),from the analysis with 1000 samples. The experimentally observed likelihood ratio is well outside the distribution of null ratios given by the label “EXP” and has a value of -5.50 compared with the tail of the simulation distribution minimum of -4.59. In simple terms, this results in a p-value of p < 1/1000 (p<0.001) in strong support of the alternative model.
Table 1 shows a summary of the binding (1A) and folding (1B) statistical models for epistasis in protein systems. Both models have similar predictive power in the range of 25-30%. The final selected binding model contains all features that we considered except for hydrophobicity (seven features, 28 terms including interactions) and depends on SASAabs and secondary structure in addition to binding specific features like the complex type. The folding model is simpler (five features, 12 terms with interactions), and depends on hydrophobicity and SASArel. Features are listed in order according to their relative contribution to the explanatory power of the full model. That is, the highest-ranked feature is the one whose removal leads to the greatest reduction in R2. For the binding epistasis model, the largest contributor was the complex type, with a change in R2 of 0.128 upon removal followed by charge with a change in R2 of 0.078 upon its removal. The remaining terms each contribute ~5% or less to the predictive power of the binding model. For the folding epistasis model, the largest contributor was hydrophobicity with a change in R2 of 0.151 upon removal, followed by both size and charge with similar contributions (change in R2 of 0.0765 and 0.0695 with their removal respectively). The remaining terms each contribute ~4.5% or less to the predictive power of the folding model.
Table 2 shows the results of 100 trials of our ”leave-10%-out” robustness test where 10% of the available systems were randomly removed. These results show that both of our full models are highly robust – with the binding model being slightly more robust than the folding model. All terms present in the full models are present in the leave-10%-out analysis, most occurring in all trials. Additionally, the mean ranks of most terms are identical to the full-data binding model with more variance in the folding model. Graphical representation of our robustness tests shown in supplemental Figure S2.
Figure 4 further illustrates the results of our statistical model for epistasis in binding. For charge, the subcategory for interactions involving an attractive pairing (0A) contains the most strongly epistatic mutations. While mutations in this subcategory cover a broad range of values, many tend towards positive epistasis; the largest value belongs to this subcategory. Neutral or constant charge states (00) show a near normal distribution centered on zero with some low levels of epistasis. Changes involving a repulsive interaction (0R) contain the least number of data and have a narrow distribution, with fewer large values for ϵ in either direction. For the complex type category, the antibody-antigen subcategory shows the most epistasis, including the most positive. TCR/pMHC also contains a large amount of positive epistasis. Cytokine-cytokine is the only subcategory with a negative mean suggesting that mutations in this subcategory tend to have negative epistasis. Generic protein-protein complexes show similar behavior to the neutral charge category; centered on zero, broad spread, but low numbers of epistatic data points. Other categorical features are shown in supplemental Figure S3.