Age estimation model
Comparison of models
We constructed three types of age estimation models based on the DNA methylation levels of CpGs adjacent to SLC12A5 (SLC12A5-1, -2, -3, and -4), POU4F2 (POU4F2-1, -2, -3, and -4), VGF(VGF-1, -2, and -3), and SCGN (SCGN-1 and -2) and compared their performances (Table 2; Supplementary Figures S_R2_1, 2, 3, and 4).
The single regression model using the methylation level of SLC12A5-4 showed the best performance; the mean absolute error (MAE) after LOOCV was 1.6 (Figure 4a). The formula for age estimation was as follows:
“Estimated Age” = (−1.962e-11 + 0.9808 × “methylation level of SLC12A5-4”) × 10.113 (standard deviation of training data) + 12.550 (mean of training data)
The elastic net regression model selected as the best performing model included the methylation levels of SLC12A5-4, POU4F2-2, and VGF-2; the MAE after LOOCV was 1.5 (Figure 4b). The formula for age estimation is as follows:
“Estimated Age” = (−1.717e-11 + 0.6728 × “methylation level of SLC12A5-4” + 0.1652 × “methylation level of POU4F2-2” + 0.1535 × “methylation level of VGF-2”) × 10.113 + 12.550
The SVR model that showed the best performance used the methylation levels of SLC12A5-1, -2, -3 and -4; the MAE after LOOCV was 1.3 (Figure 4c). The R script used to estimate age is available in Supplementary File. Details of the parameters used in the elastic net regression and SVR models are shown in Supplementary Table S_R1.
Influences of age, sex, and growth environment on the model
We used linear regression analysis to identify the factors that affect Δage and |Δage| in the best model (i.e., the SVR model with four CpGs adjacent to SLC12A5 ). When Δage was used as the dependent variable, the best regression model included age, growth environment, and the interaction between age and growth environment as explanatory variables (adjusted R2 = 0.1869) (Table 3 and Figure 5). Among those variables, the interaction between age and growth environment was statistically significant (Figure 5b). When |Δage| was used as the dependent variable, the best regression model included age, growth environment, and the interaction between age and growth environment as explanatory variables (adjusted R2 = 0.186) (Table 4 and Figure 6). Among them, growth environment was statistically significant (Figure 6d). The explanatory variables that were statistically significant for other models (i.e., the single regression model and elastic net regression model) were shown in supplementary tables (Supplementary Tables S_R2, 3, 4, and 5).