(EQ 3)
Secondary structure information was included by considering whether a given mutational site was located in an alpha helix (“H”), beta sheet (“S”), or loop (“L”). Secondary structure content was determined using a PyMol script46. As with other categorical features overparameterization may be a concern, though in this case the explicit consideration only has nine possible cases. We tested the possible abstractions, ranging from explicit consideration of the structures at each site (e.g., HL,LL,LS,…) to the simplest case of a boolean value denoting whether both sites belong to the same type of structure (“0”) or different structures (“1”).
We also considered the effect of solvent accessible surface area (SASA): a metric describing whether a residue is exposed or buried. To calculate the SASA, we first prepared the PDB files using pdbfixer from the OpenMM software suite47, to add missing residues, replace non-standard residues with their standard equivalents, and add missing hydrogens. The repaired structures were then processed with FoldX48to generate mutations using the BuildModel command. DSSP v 3.0.049was then used to calculate the absolute SASA (SASAabs) for each residue of interest. Both absolute and relative SASA were considered, relative SASA (SASArel) was calculated using the empirical max accessible surface area (ASAmax) generated by Tien et al50via the formula:
SASArel = SASAabs / ASAmax. (EQ 4)
Since SASA changes affect both wildtype and mutant residues, we used a modified version of EQ 2 replacing sizenet with SASA.
We also included classification information. For binding, we included the type of protein-protein complex broken into five categories, based on the information provided in the SKEMPI v2.0 database: antibody-antigen (AB/AG), T cell receptor-peptide bound major histocompatibility complex (TCR/pMHC), Cytokine-Cytokine receptor (Cyto/Cyto), GTPase-other, and non-specific protein-protein interaction (Pr/PI) which functioned as the reference category for the statistical models. We also included a boolean value indicating whether or not the mutational sites occur on the same (“0”) or different (“1”) protein chains, as sites which occur on the same chain may have a different effect on binding than if they occur on opposing chains. For folding, we included the system size given by the total number of residues acquired from the PDB.