Figure 1 . A map of the FluxNet sites used in the analysis, coded by the IGBP vegetation type.
2.2 SUMMA standalone simulations
We used the Structure for Unifying Multiple Modeling Alternatives (SUMMA) to simulate the hydrologic cycle (Clark et al., 2015) including the resulting turbulent heat fluxes. SUMMA is a hydrologic modeling framework that allows users to select between different model configurations and process parameterizations. The clean separation between the numerical solver and flux parameterizations allowed us to be confident that coupled DL parameterizations embedded into SUMMA did not affect any model components in unintentional ways. The core numerical solver in SUMMA enforces closure of the mass and energy balance and is used in all of our simulations.
SUMMA provides multiple flux parameterizations and process representations for many hydrologic processes. Because we were primarily interested in turbulent heat fluxes, we used a configuration for the other processes which would be suitable for general purpose hydrologic modeling, including runoff and snowpack simulations. For simulation of transpiration we used a Ball-Berry approach for simulating stomatal conductance (Ball et al., 1987), an exponentially decaying root density profile, and soil moisture controls that mimic the Noah land surface model (Niu et al., 2011). Similarly, the radiative transfer parameterizations which are the primary controls on the sensible heat fluxes are also set up to mimic the Noah land surface model. The functional forms of the turbulent heat fluxes in SUMMA is similar to many other land surface and hydrologic models, given by the bulk transfer equations (in resistance terms) as in Bonan (2015).
At each of the sites described in section 2.1 we independently calibrated a standalone SUMMA model using the dynamically dimensioned search algorithm (Tolson & Shoemaker, 2007) as implemented in the OSTRICH optimization package (Matott, 2017) using the mean squared error as the optimization criteria. A summary of the calibration variables and test ranges is shown in table S1 of the supporting information. The first year of available data was used for calibration. Because of the limited length of the data record at some sites, the calibration period was not excluded from subsequent analysis. The 10 parameters we chose to calibrate largely control water movement through the vegetation and soil domains. In the soil domain these include the residual and saturated moisture contents, field capacity, and controls on anisotropy of flows. In the vegetation domain these include controls on photosynthesis, rooting depth, wilting and transpiration water contents, amount of throughfall of precipitation through the canopy, and a generic scaling factor for the amount of vegetation.
The calibrations were run to a maximum of 500 trial iterations, which provided good convergence across sites (see the supporting information for convergence plots). We used the mean square error at a half hourly timestep for both the latent and sensible heat as the objective function and saved the best set of parameters for each site to use as our comparison to the DL parameterizations. To provide good estimates of the initial soil moisture and temperature states we spun up the standalone SUMMA simulations for 10 years both before and after calibration (for a total of 20 spinup years). We will refer to the standalone calibrated SUMMA simulations as SA (StandAlone) for the remainder of the paper. To summarize, we independently calibrated a set of parameters for each site, whose resulting best parameter set was used as an in-sample benchmark for comparison with our DL parameterizations. A brief description of the computational cost and runtimes associated with calibrating SA is provided in the supporting information.
2.3 DL parameterization and simulations
To build DL parameterizations of turbulent heat fluxes we constructed our neural networks using the Keras python package (Chollet , 2015). The neural networks take in a variety of input data such as meteorologic forcing data and output the bulk latent and sensible heat fluxes as shown in panel b) of figure 2.
Our neural networks were constructed using only dense layers where every node in one layer is connected to all nodes in the preceeding and following layers. We used the deep-dense architecture because it is the only network architecture that could easily be coupled to SUMMA, given the capabilities of the coupling tools. We will discuss the details of how we coupled the neural networks to SUMMA later in this section. We tested networks with as few as one layer and 12 nodes and up to 10 layers and 64 nodes were tested. After manual trial and error we settled on 6 layers each with 48 nodes. Smaller architectures were not as well able to capture the extremes of the turbulent heat fluxes and larger networks showed diminishing additional improvement. A simple schematic of the neural network architecture is shown in figure S2 of the supporting information.
We used hyperbolic tangent (tanh) activations in all of the nodes of the network. Stochastic gradient descent (SGD) with an exponential learning rate decay curve was used as the optimizer to train the weights and biases of the neural networks. We used the mean square error (the same as our objective function in the calibration of SA) in the 30-minute turbulent heat flux estimates as our loss function, similar to the objective function in our calibration of the SUMMA-SA simulations. Dropout was applied after the first layer and before the final layer with a retention rate of 0.9 to regularize. Dropout works by randomly pruning some fraction (one minus the retention rate) of the nodes in a given layer during training. This reduces the likelihood of overfitting the network as there is some stochasticity in the model architecture during training.
When training the networks we performed a 5-fold cross validation. We used 48 sites to train each network and then applied it out of sample to each of the remaining 12 sites. The data from the 48 sites used to train each network were randomly shuffled and split into 80% training and 20% validation data. The validation data was used to define an early stopping criterion for the training procedure where training was stopped if the validation loss was not decreased for 10 training epochs. This procedure keeps the model from overfitting on the training data. The maximum number of training epochs was set to 500 epochs, with a batch size of 768 data points (or 14 days of data points). All data was shuffled before training to remove any temporal bias that the model could learn, which also reduces overfitting.