Zhuonan Wang

and 5 more

Estimates of soil organic carbon (SOC) stocks are essential for many environmental applications. However, significant inconsistencies exist in SOC stock estimates for the U.S. across current SOC maps. We propose an upscaling framework that combines unsupervised multivariate geographic clustering (MGC) and supervised random forest regression, improving SOC maps by capturing heterogeneous relationships with SOC drivers. We first used MGC to divide the U.S. into 20 SOC regions based on the similarity of covariates (soil biogeochemical, bioclimatic, biological, and physiographic variables). Subsequently, separate random forest models were trained for each SOC region, utilizing environmental covariates and SOC observations. Our estimated SOC stocks for the U.S. (52.6 + 3.2 Pg for 0-30 cm and 108.3 + 8.2 Pg 0-100 cm depths) were within the range estimated by existing products like HWSD (46.7 Pg for 0-30 cm and 90.7 Pg 0-100 cm depth) and SoilGrids 2.0 (45.7 Pg for 0-30 cm and 133.0 Pg 0-100 cm depth). However, independent validation with soil profile data from the National Ecological Observatory Network showed that our approach (R2 = 0.51) outperformed the estimates obtained from Harmonized World Soil Database (R2 = 0.23) and SoilGrids 2.0 (R2 = 0.39) for the topsoil (0-30 cm). Uncertainty analysis (e.g., low representativeness and high coefficients of variation) identified regions requiring more measurements, such as Alaska and the deserts of the U.S. Southwest. Our approach effectively captures the heterogenous relationships between widely available predictors and SOC across regions, offering reliable gridded SOC estimates for benchmarking Earth system models.

Richard Mills

and 3 more

As high-resolution geospatiotemporal data sets from observatory networks, remote sensing platforms, and computational Earth systems increase in abundance, fidelity, and richness, machine learning approaches that can fully utilize increasingly powerful parallel computing resources are becoming essential for analysis and exploration of such data sets. We explore one such approach, applying a state-of-the-art distributed memory parallel implementation of Support Vector Machine (SVM) classification to large remote-sensing data sets. We have used MODIS 8-day surface reflectance (MOD09A1) and land surface temperature (MOD11A2) for classifying wildfires over Alaska and California. Monitoring Trends in Burn Severity (MTBS) burn perimeter data was used to set boundaries of burned and unburned areas for our two-class problem. MTBS covers years from 1984-2019, recording only fires over 1000 acres or greater in the western United States. We seek to find a parallel computing solution (using the PermonSVM solver, described below) to accurately classify wildfires and find smaller unrecorded wildfires. An initial assessment for wildfire classification over interior Alaska shows that PermonSVM has an accuracy of 96% and over 5000 false positives (i.e., fires unrecorded in MTBS). Next steps include mapping larger regions over Alaska and California and understanding the tradeoffs of scalability and accuracy. The parallel tool we employ is PermonSVM, which is built on top of the widely-used open source toolkit PETSc, the Portable, Extensible Toolkit for Scientific Computation. Recent developments in PETSc have focused on supporting cutting-edge GPU-based high-performance computing (HPC) architectures, and these can be easily leveraged in PermonSVM by using appropriate GPU-enabled matrix and vector types in PETSc. We achieve significant GPU speedup for the SVM calculations on the Summit supercomputer at Oak Ridge National Laboratory – currently one of the best available “at scale” proxies for upcoming exascale-class supercomputers – and are actively working to further improve computational efficiency on Summit as well as on prototype exascale node architectures.