Figure
1 . A map of the FluxNet sites used in the analysis, coded by the IGBP
vegetation type.
2.2 SUMMA standalone simulations
We used the Structure for Unifying Multiple Modeling Alternatives
(SUMMA) to simulate the hydrologic cycle (Clark et al., 2015) including
the resulting turbulent heat fluxes. SUMMA is a hydrologic modeling
framework that allows users to select between different model
configurations and process parameterizations. The clean separation
between the numerical solver and flux parameterizations allowed us to be
confident that coupled DL parameterizations embedded into SUMMA did not
affect any model components in unintentional ways. The core numerical
solver in SUMMA enforces closure of the mass and energy balance and is
used in all of our simulations.
SUMMA provides multiple flux parameterizations and process
representations for many hydrologic processes. Because we were primarily
interested in turbulent heat fluxes, we used a configuration for the
other processes which would be suitable for general purpose hydrologic
modeling, including runoff and snowpack simulations. For simulation of
transpiration we used a Ball-Berry approach for simulating stomatal
conductance (Ball et al., 1987), an exponentially decaying root density
profile, and soil moisture controls that mimic the Noah land surface
model (Niu et al., 2011). Similarly, the radiative transfer
parameterizations which are the primary controls on the sensible heat
fluxes are also set up to mimic the Noah land surface model. The
functional forms of the turbulent heat fluxes in SUMMA is similar to
many other land surface and hydrologic models, given by the bulk
transfer equations (in resistance terms) as in Bonan (2015).
At each of the sites described in section 2.1 we independently
calibrated a standalone SUMMA model using the dynamically dimensioned
search algorithm (Tolson & Shoemaker, 2007) as implemented in the
OSTRICH optimization package (Matott, 2017) using the mean squared error
as the optimization criteria. A summary of the calibration variables and
test ranges is shown in table S1 of the supporting information. The
first year of available data was used for calibration. Because of the
limited length of the data record at some sites, the calibration period
was not excluded from subsequent analysis. The 10 parameters we chose to
calibrate largely control water movement through the vegetation and soil
domains. In the soil domain these include the residual and saturated
moisture contents, field capacity, and controls on anisotropy of flows.
In the vegetation domain these include controls on photosynthesis,
rooting depth, wilting and transpiration water contents, amount of
throughfall of precipitation through the canopy, and a generic scaling
factor for the amount of vegetation.
The calibrations were run to a maximum of 500 trial iterations, which
provided good convergence across sites (see the supporting information
for convergence plots). We used the mean square error at a half hourly
timestep for both the latent and sensible heat as the objective function
and saved the best set of parameters for each site to use as our
comparison to the DL parameterizations. To provide good estimates of the
initial soil moisture and temperature states we spun up the standalone
SUMMA simulations for 10 years both before and after calibration (for a
total of 20 spinup years). We will refer to the standalone calibrated
SUMMA simulations as SA (StandAlone) for the remainder of the paper. To
summarize, we independently calibrated a set of parameters for each
site, whose resulting best parameter set was used as an in-sample
benchmark for comparison with our DL parameterizations. A brief
description of the computational cost and runtimes associated with
calibrating SA is provided in the supporting information.
2.3 DL parameterization and simulations
To build DL parameterizations of turbulent heat fluxes we constructed
our neural networks using the Keras python package (Chollet , 2015). The
neural networks take in a variety of input data such as meteorologic
forcing data and output the bulk latent and sensible heat fluxes as
shown in panel b) of figure 2.
Our neural networks were constructed using only dense layers where every
node in one layer is connected to all nodes in the preceeding and
following layers. We used the deep-dense architecture because it is the
only network architecture that could easily be coupled to SUMMA, given
the capabilities of the coupling tools. We will discuss the details of
how we coupled the neural networks to SUMMA later in this section. We
tested networks with as few as one layer and 12 nodes and up to 10
layers and 64 nodes were tested. After manual trial and error we settled
on 6 layers each with 48 nodes. Smaller architectures were not as well
able to capture the extremes of the turbulent heat fluxes and larger
networks showed diminishing additional improvement. A simple schematic
of the neural network architecture is shown in figure S2 of the
supporting information.
We used hyperbolic tangent (tanh) activations in all of the nodes of the
network. Stochastic gradient descent (SGD) with an exponential learning
rate decay curve was used as the optimizer to train the weights and
biases of the neural networks. We used the mean square error (the same
as our objective function in the calibration of SA) in the 30-minute
turbulent heat flux estimates as our loss function, similar to the
objective function in our calibration of the SUMMA-SA simulations.
Dropout was applied after the first layer and before the final layer
with a retention rate of 0.9 to regularize. Dropout works by randomly
pruning some fraction (one minus the retention rate) of the nodes in a
given layer during training. This reduces the likelihood of overfitting
the network as there is some stochasticity in the model architecture
during training.
When training the networks we performed a 5-fold cross validation. We
used 48 sites to train each network and then applied it out of sample to
each of the remaining 12 sites. The data from the 48 sites used to train
each network were randomly shuffled and split into 80% training and
20% validation data. The validation data was used to define an early
stopping criterion for the training procedure where training was stopped
if the validation loss was not decreased for 10 training epochs. This
procedure keeps the model from overfitting on the training data. The
maximum number of training epochs was set to 500 epochs, with a batch
size of 768 data points (or 14 days of data points). All data was
shuffled before training to remove any temporal bias that the model
could learn, which also reduces overfitting.