Figure
3 . Empirical CDFs of performance measures for simulations across all
sites. a) shows the NSE for latent heat, b) the NSE for sensible heat,
c) the KGE for latent heat, and d) the KGE for sensible heat.
Even though the curves of the performance measures look quite similar
between NN1W and NN2W, the performance differences from SA were not
always perfectly correlated. Figure 3 shows the change in performance
from SA for each site, ranked by SA performance. The maximum improvement
that is possible is also shown to provide a reference to account for the
fact that the range of both NSE and KGE is (-∞,1]. That is, there is
more room for improvement for poorly performing sites than there is for
well performing sites. For both performance measures and fluxes the
general pattern of improvement follows the maximum improvement curve,
with some added noise.
While on average the NN-based configurations performed better than the
SA simulations, they performed worse at some locations. NN-based
simulations generally had a higher NSE , but the KGE scores were more
mixed for sensible heat, with SA outperforming the NN-based
configurations at a number of sites. The NN-based configurations
performed much worse at AT-Neu, DK-Eng, and CH-Cha (the outliers in the
lowest 25th percentile of Figure 4d), where they failed in simulating
large, upward, nighttime sensible heat fluxes. SA also performed poorly
for these nighttime fluxes, but to a lesser extent. For latent heat,
while some sites showed higher NSE and KGE values for SA results than
for the NN-based simulations, more sites showed poor performance across
all configurations when evaluated by NSE. Decreases in performance
relative to SA mostly occurred where the NN-based configurations
consistently overestimated latent heat during winter, which most likely
stems from our assumption that all latent heat is treated as
transpiration. For both conditions for which SA outperformed the
NN-based configurations, we believe that the performance of the NN-based
configurations can be improved if more training data or more
sophisticated ML methods were used, since the number of outliers was
small and the average performance improvement was large.