(EQ 3)
Secondary structure information was included by considering whether a
given mutational site was located in an alpha helix (“H”), beta sheet
(“S”), or loop (“L”). Secondary structure content was determined
using a PyMol
script46.
As with other categorical features overparameterization may be a
concern, though in this case the explicit consideration only has nine
possible cases. We tested the possible abstractions, ranging from
explicit consideration of the structures at each site (e.g.,
HL,LL,LS,…) to the simplest case of a boolean value denoting whether
both sites belong to the same type of structure (“0”) or different
structures (“1”).
We also considered the effect of solvent accessible surface area (SASA):
a metric describing whether a residue is exposed or buried. To calculate
the SASA, we first prepared the PDB files using pdbfixer from the
OpenMM software
suite47,
to add missing residues, replace non-standard residues with their
standard equivalents, and add missing hydrogens. The repaired structures
were then processed with
FoldX48to generate mutations using the BuildModel command. DSSP v
3.0.049was then used to calculate the absolute SASA (SASAabs)
for each residue of interest. Both absolute and relative SASA were
considered, relative SASA (SASArel) was calculated using
the empirical max accessible surface area (ASAmax)
generated by Tien et
al50via the formula:
SASArel = SASAabs /
ASAmax. (EQ 4)
Since SASA changes affect both wildtype and mutant residues, we used a
modified version of EQ 2 replacing sizenet with SASA.
We also included classification information. For binding, we included
the type of protein-protein complex broken into five categories, based
on the information provided in the SKEMPI v2.0 database:
antibody-antigen (AB/AG), T cell receptor-peptide bound major
histocompatibility complex (TCR/pMHC), Cytokine-Cytokine receptor
(Cyto/Cyto), GTPase-other, and non-specific protein-protein interaction
(Pr/PI) which functioned as the reference category for the statistical
models. We also included a boolean value indicating whether or not the
mutational sites occur on the same (“0”) or different (“1”) protein
chains, as sites which occur on the same chain may have a different
effect on binding than if they occur on opposing chains. For folding, we
included the system size given by the total number of residues acquired
from the PDB.