Comparison of Combination Methods to Create Calibrated Ensemble Forecasts for Seasonal Influenza in the U.S.
Abstract
The characteristics of influenza seasons varies substantially from year to year, posing challenges for public health preparation and response. Influenza forecasting is used to inform seasonal outbreak response, which can in turn potentially reduce the societal impact of an epidemic. The United States Centers for Disease Control and Prevention, in collaboration with external researchers, has run an annual prospective influenza forecasting exercise, known as the FluSight challenge. A subset of participating teams has worked together to produce a collaborative multi-model ensemble, the FluSight Network ensemble. Uniting theoretical results from the forecasting literature with domain-specific forecasts from influenza outbreaks, we applied parametric forecast combination methods that simultaneously optimize individual model weights and calibrate the ensemble via a beta transformation. We used the beta-transformed linear pool and the finite beta mixture model to produce ensemble forecasts retrospectively for the 2016/2017 to 2018/2019 influenza seasons in the U.S. We compared their performance to methods currently used in the FluSight challenge, namely the equally weighted linear pool and the linear pool. Ensemble forecasts produced from methods with a beta transformation were shown to outperform those from the equally weighted linear pool and the linear pool for all week-ahead targets across in the test seasons based on average log scores. We observed improvements in overall accuracy despite the beta-transformed linear pool or beta mixture methods’ modest under-prediction across all targets and seasons. Combination techniques that explicitly adjust for known calibration issues in linear pooling should be considered to improve ensemble probabilistic scores in outbreak settings.
keywords:
,
,
and ,
1 Introduction
Seasonal influenza outbreaks pose public health challenges and cause a large morbidity and mortality burden worldwide. The United States Centers for Disease Control and Prevention (CDC) estimates there were 35.5 million cases of influenza, 490,600 influenza-related hospitalizations, and 34,200 deaths from influenza during the 2018–2019 influenza season in the US (Centers for Disease Control and Prevention, 2021a). Influenza forecasting has become integral to public health decision making (Lutz et al., 2019). A forecasting model uses data to make projections of the future trajectory of an infectious disease target, such as cases, hospitalizations and deaths, and can provide uncertainty measures of its predictions. Thus, forecasting models are a powerful tool for public health officials to improve seasonal outbreak preparedness and response, which can in turn potentially reduce the burden of seasonal influenza. The CDC’s establishment of the Center for Forecasting and Outbreak Analytics in August of 2021 (Centers for Disease Control and Prevention, 2021b) highlights a critical need to advance the use of infectious disease forecasting and modeling.
To provide public health officials real-time, prospective information about the future trajectory of seasonal influenza, the CDC, in collaboration with external researchers, started an annual prospective influenza forecasting exercise in the U.S., known as the FluSight challenge, in 2013. This exercise has been conducted with the goal of improving forecast accuracy and the integration of forecasts with real-time public health decision making. Multiple academic and non-academic groups submit weekly forecasts to the FluSight challenge. A submission typically contains probabilistic and point forecasts for seven targets in each of the 10 Health and Human Services (HHS) regions in the U.S. as well as at the national level. In this manuscript, we focus on the probabilistic forecasts, in which a predictive distribution is specified for the outcome of interest. All forecast targets are based on the weighted percentage of outpatient visits for influenza-like illness (wILI) collected through the U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet), weighted by state populations.
Constructing a single ensemble forecast that combines the forecasts from multiple individual models has advantages. A ensemble forecast unifies signals from many models into a single forecast, making it easier for stakeholders to understand. In addition, ensemble forecasts have been shown to consistently achieve a high degree of accuracy and often outperform individual forecasts of infectious disease targets (Yamana, Kandula and Shaman, 2016; Yamana, Kandula and Shaman, 2017; DeFelice et al., 2017; Viboud et al., 2018; Ray et al., 2020; Chowell et al., 2020; Cramer et al., 2021; Oidtman et al., 2021; Ray et al., 2022). A subset of teams participating in the FluSight challenge has produced a collaborative multi-model ensemble, the FluSight Network ensemble, using stacked generalization —in particular, the FluSight Network ensemble is calculated as a linear combination of the individual forecasts.
Despite the success of linear combination methods such as the one used to produce the FluSight Network ensemble, their forecasts lack calibration (Hora, 2004; Gneiting and Ranjan, 2013). Gneiting and Ranjan (2013) proved that the linear aggregation increases the dispersion of the combined predictive distribution and therefore may result in overdispersed ensemble forecasts even when the individual forecasts are well-calibrated. More generally, a simple linear combination of individual forecasts may produce miscalibrated ensemble forecasts unless their calibration is adjusted for.
Previous work has presented parametric and nonparametric approaches to combining and calibrating ensemble forecasts. The beta-transformed linear pool is a combination formula that calibrates the combined predictive distribution by overlaying the linear pool with a beta transformation (Ranjan and Gneiting, 2010; Gneiting and Ranjan, 2013). Bassetti, Casarin and Ravazzolo (2018) proposed an extension to the beta-transformed linear pool, using a Bayesian nonparametric approach to estimate infinite beta mixture models. This method achieves a theoretically stronger result of probabilistic calibration compared to the beta-transformed linear pool by extending the flexibility of the combination function. Kuleshov and Deshpande (2021) introduced calibrated risk minimization as a principle that maximizes sharpness subject to calibration by adding calibration loss as a constraint in the loss function. Rumack, Tibshirani and Rosenfeld (2021) presented a post-processing method called the recalibration ensemble that combines and calibrates forecasts in separate steps and applied this method to recalibrating epidemic forecasts.
In practice, there is merit in selecting parsimonious models and combination methods with computationally efficient estimation (Claeskens et al., 2016; Baran and Lerch, 2018; Stanescu and Pandey, 2016). The optimal degree of flexibility and computational complexity of combination methods often vary for different applications. Baran and Lerch (2018) compare the performance of multiple forecast combination methods and assesses the degree of flexibility combination methods needed to yield the best practical results for post-processing applications in forecasting wind speed and precipitation. In influenza probabilistic forecasting, Ray and Reich (2018) study a range of weighing schemes with different levels of complexity in generating ensemble forecasts via the feature-weighted ensemble approach that combines aspects of linear pooling or stacking and gradient boosting. In both of these studies, the methods with an intermediate level of flexibility yielded better predictive performances in their respective applications.
This work aims to add to the growing field of infectious disease probabilistic forecasting by investigating the accuracy and probabilistic calibration of ensemble forecasts produced from combination methods that combine and calibrate simultaneously while not having any knowledge of the underlying model structure of the individual models or the ability to reproduce their forecasts in the U.S. seasonal influenza setting. Using 27 individual models from the FluSight network, we apply the linear pool, beta-transformed linear pool, and the finite beta mixture approach to combine predictive distributions. We also examine whether more parsimonious approaches to the aforementioned methods with fixed, equal individual model weights are sufficient to produce accurate and well-calibrated ensemble forecasts. We modify estimation approaches of the methods with beta transformation to accommodate the binned probability distribution representation used in the FluSight challenge (McGowan et al., 2019).
Section 2 reviews the CDC influenza data, forecast targets, and forecast combination methods. Section 3 describes the application of the combination methods in seasonal influenza forecasting and presents results. Section 4 contains discussions of results in the context of related work, real-time forecasting operations, and data-driven public health decision making.
2 Methods
2.1 Influenza Data
The U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet) publishes the weekly percentage of outpatient doctor’s office visits due to influenza-like illness weighted by state populations (wILI). ILINet is a syndromic surveillance system that includes more than 3,000 providers (Centers for Disease Control and Prevention, 2020). The CDC Influenza Division reports weekly estimates of wILI for the United States and for the 10 Health and Human Services (HHS) regions (Figure 1).

2.2 Forecast targets
Forecasts submitted to the CDC FluSight challenge typically consists of three seasonal targets and four short-term targets. We produce ensemble forecasts of short-term 1-4 week ahead wILI for all locations from the 2016/2017 to 2018/2019 influenza season in this study. We do not include forecasts of seasonal targets, such as the peak week, peak incidence, and seasonal onset, due to the lack of importance of probabilistic forecasts after those events have been observed in a particular season.
2.3 Forecast combination methods
Let and be predictive probability density functions (PDFs) and cumulative distribution functions (CDFs), respectively, for a real-valued forecast target, , from individual models. The combination methods described in this section include the linear pool as a baseline method and the beta-transformed linear pool and finite beta mixture combination as the methods that combine and calibrate forecasts.
2.3.1 Linear pool (LP and EW-LP)
The linear pool is a mixture model with a predictive density
(1) |
where is a nonnegative weight for the individual model and . The equally weighted linear pool (EW-LP) is a special case of the LP with the weights fixed to .
2.3.2 Beta-transformed linear pool (BLP and EW-BLP)
Gneiting and Ranjan (2013) demonstrate that the LP produces forecasts that lack calibration when the component forecasts are well-calibrated and propose a flexible alternative approach, the beta-transformed linear pool (BLP), which has a predictive CDF defined by
(2) |
where denotes the CDF of the beta distribution with the parameters , is a nonnegative weight for the individual model weight and . To find the predictive PDF of the BLP we can differntiate the above CDF, finding
(3) |
where is the PDF of the beta distribution. The LP is a special case of the BLP when . The equally weighted component variation of this method is the equally weighted beta-transformed linear pool (EW-BLP), which is a special case of the BLP with fixed weights . Figure 2 demonstrates how the BLP’s beta transformation operates on the LP’s predictive CDF.
2.3.3 Finite beta mixture combination ( and )
Bassetti, Casarin and Ravazzolo (2018) use a Bayesian approach to extend the BLP to finite and infinite beta mixtures for combining and calibrating predictive distributions. Baran and Lerch (2018) note the high computational costs of the estimating this approach. Due to the computational burden of this Bayesian approach, we choose to employ a frequentist approach to estimate a finite beta mixture model
(4) |
where is the number of beta components, is a beta mixture weight for the beta component, denotes the CDF of the beta distribution with the parameters , and comprises the individual model weights specific to each beta component. Differentiating the CDF, the predictive density of the is
(5) |
The equally weighted variation of the finite beta mixture combination approach () is a special case of the with With , the and the become the BLP and the EW-BLP, respectively.

2.4 Modification of the BLP and methods for combining discrete distributions
The predictive density functions of the BLP and the are given by the equation (3) and (5), respectively. However, the forecasts of 1-4 week ahead wILI, which is a continuous measure of disease incidence, are represented using a binned probability format in submissions to the FluSight challenge. Here we describe a modification to BLP and models to handle this discretized representation of the target variable.
Let denote a predictive CDF of a forecasting model, be the outcome variable, and be a collection of disjoint bins covering the set of possible outcomes for , with for 111The FluSight Challenge uses a slightly different binned probability format where the bin is defined as (McGowan et al., 2019); this detail does not have a practical impact on the set up because the influenza-like-illness measure is continuous.. An individual forecast of consists of an assignment of probabilities to each of the bins:
(6) | ||||
(7) |
In order to estimate the parameters of the , we modify the log-likelihood function, which is the log of equation (5), for a single observation that falls in bin to be
where is the probability assigned to bin by the ’s discretized predictive distribution, is the continuous predictive CDF of the , and are the predictive CDFs of a individual model , and is the probability assigned to bin by individual model ’s discretized predictive distribution.
Since the BLP is a special case of the where , the modified log-likelihood function of the BLP is the same as above with a single beta component term in the outer summation.
2.5 Data and code accessibility
3 Application in seasonal influenza forecasting in the U.S.
We apply the combination methods introduced in Section 2 to prospective forecasts from 27 individual forecasting models (Table S1, Supplementary Material (Wattanachit et al., 2022)) available in the FluSight Network repository (FluSight Network, 2020) to generate weekly ensemble forecasts of 1-4 week ahead wILI for the United States and the 10 Health and Human Services (HHS) regions from the 2016/2017 to 2018/2019 influenza seasons.
3.1 Forecast evaluation
We follow the FluSight Challenge guidelines (Centers for Disease Control and Prevention, 2019a) by using the logarithmic score or log score which is defined as the logarithm of the predictive density or mass function evaluated at the observed data point. The log score is a proper scoring rule that assesses the sharpness and calibration of probabilistic forecasts simultaneously (Gneiting and Raftery, 2007). In the FluSight challenge where forecasts are represented in a binned probability format, the log score is defined as
where is the observed value of the forecast target and and are the pre-specified lower and upper bounds of bin such that .
We generate forecasts from each combination method for all combinations of week, region, target, and season, and calculate their log scores. Following the CDC scoring convention (McGowan et al., 2019), we truncate log scores to be no lower than . The benefit of this approach is that it enables us to average log scores for a method even when that method receives a log score of (assigning zero probability to an observed value) for any forecasts. However, this modified log score is formally no longer a proper score. Log scores are averaged across all forecast regions and weeks for each target and test season to get summary measures of accuracy for each method to compare their performance.
The calibration of a probabilistic forecast addresses the statistical consistency between the predictive distributions of forecasts and the observations. The concept of calibration allows us to assess whether a model produces reliable forecasts, i.e. whether an event that the model assigns a particular predicted probability really occurs that at that frequency in the long run. The forecast is probabilistically calibrated if its probability integral transform (PIT) values are uniformly distributed on the unit interval (Gneiting, Balabdaoui and Raftery, 2007). The probability integral transform, , is the probability obtained from evaluating the predictive CDF of a model at the observed value ()222Since the target variables are discretized in this application, we adapt Definition 2.5 in Gneiting and Raftery (2007) by sampling a uniformly distributed PIT value between and where is the bin containing the observed value..
To assess the probabilistic calibration of the ensemble forecasts (i.e., the uniformity of the PIT values), we use the graphical tool called the probability plot, which plots the empirical CDF of the PIT values. Specifically, we compute the PIT values of all observations in the test seasons and plot their the empirical CDFs by target and season. The empirical CDF curve should follow a 45-degree line bisecting the plot if the forecasts are probabilistically calibrated. In the case where deviations from uniformity are observed, the shape of empirical CDF curve of the PIT values suggests the causes behind the lack of probabilistic calibration (Laio and Tamea, 2007). For example, PIT values concentrating near 0 and 1 indicates that the observed values fall on the tails of the predictive distribution of the forecasts more frequently than expected, i.e., the probability plot shows the slope steeper than 1 near the PIT values of 0 and 1, so that the predictive distributions were too narrow. To quantitatively measure the deviation of a PIT CDF curve from a standard uniform CDF, we compute the Cramer distance (Gneiting and Raftery, 2007; Rizzo and Székely, 2016), , where is a PIT CDF curve and is a standard uniform CDF. The Cramer distance can be viewed as a summary measure of calibration, however, it lacks the diagnostic property of the probability plot.
3.2 Parameter estimation
In our application, the training data set consists of individual forecasts and influenza data from the 2010/2011 to 2018/2019 season. The test data set includes the 2016/2017 to 2018/2019 influenza seasons. When generating forecasts for a test set season, the training data set consists of all the influenza seasons preceding that season.
The parameters of each combination method are estimated simultaneously by maximizing the average log score, which is positively oriented (i.e., higher scores are better), via maximum likelihood estimation over a training data set. The parameters are chosen to be target-specific to allow for variations among targets. We update the parameter estimates for each test season, so there are 12 sets of parameters to be estimated for four targets and three test seasons. We modify the estimation approach for the BLP, EW-BLP, , and as outlined in the Methods section in order to apply the combination methods to individual forecasts in a binned probability representation.
3.2.1 Choice of for finite beta mixture combination approaches
We use a leave-one-season-out cross validation approach to select the number of beta components, , in the and for each target-test season pair. Specifically, we train the and using through 5 on each subset of data in the training data with one season left out and use those ensemble fits to generate forecasts for the left out influenza season. Log scores for all combination methods are calculated for all unique forecasts, then averaged across all weeks, regions, and validation seasons to obtain a single mean validation log score for each target and method. In order to take model complexity into account, we calculate mean validation log scores across all locations for each validation season in training seasons, compute a standard error for each target-test season pair, and select the smallest for and with mean validation log scores within 1 standard error of the best log score in a particular target-test season pair.
Based on the mean validation log scores in Table 1, is selected for the and for all four targets and three seasons. The variation with has the best mean validation log scores in every instance other than the 2-week ahead target in the 2018/2019 season. Overall, using a higher number of beta components in the finite beta mixture approaches does not substantially improve mean out-of-sample log scores in our application. Thus, the finite beta mixture methods with the most parsimonious number of parameters are selected.
2016/2017 | ||||||||
---|---|---|---|---|---|---|---|---|
Week | ||||||||
ahead | ||||||||
1 | -2.49 | -2.50 | -2.50 | -2.50 | -2.50 | -2.50 | -2.50 | -2.50 |
2 | -2.74 | -2.75 | -2.75 | -2.76 | -2.76 | -2.76 | -2.76 | -2.76 |
3 | -2.95 | -2.95 | -2.95 | -2.97 | -2.92 | -2.92 | -2.92 | -2.92 |
4 | -3.08 | -3.09 | -3.10 | -3.11 | -3.03 | -3.03 | -3.04 | -3.04 |
2017/2018 | ||||||||
1 | -2.49 | -2.50 | -2.50 | -2.51 | -2.51 | -2.51 | -2.51 | -2.52 |
2 | -2.75 | -2.75 | -2.76 | -2.77 | -2.78 | -2.78 | -2.78 | -2.78 |
3 | -2.96 | -2.96 | -2.97 | -2.98 | -2.94 | -2.95 | -2.95 | -2.95 |
4 | -3.09 | -3.09 | -3.10 | -3.10 | -3.06 | -3.06 | -3.06 | -3.06 |
2018/2019 | ||||||||
1 | -2.51 | -2.52 | -2.52 | -2.53 | -2.55 | -2.55 | -2.55 | -2.55 |
2 | -2.80 | -2.79 | -2.81 | -2.80 | -2.85 | -2.84 | -2.85 | -2.85 |
3 | -2.99 | -3.00 | -3.00 | -3.01 | -3.03 | -3.03 | -3.03 | -3.03 |
4 | -3.13 | -3.14 | -3.13 | -3.14 | -3.15 | -3.15 | -3.15 | -3.15 |
3.3 Results
3.3.1 Overall Summary
Based on mean out-of-sample log scores across all targets and seasons (Figure 3 panel (b)), the is the most accurate method, followed by the BLP and LP. Across all three test seasons, the outperformed the other five methods for 3 and 4 week ahead horizons, and performed as equally well as the BLP for the 2 week ahead horizon (Figure 3 panel (a)). The BLP is the best performing method for the 1 week ahead horizon. The is also the best performing method for the 2017/2018 and 2018/2019 season based on mean out-of-sample log scores across all four horizons, while the BLP is the best performing method for the 2016/2017 season. These results indicate that the BLP and can consistently improve the accuracy of ensemble forecasts compared to the other commonly used methods included in this study despite season-to-season and target variations.
The 1 week ahead forecasts from the BLP and methods are more probabilistically calibrated than other methods based on the probability plots and their Cramer distances from the standard uniform CDF. However, the forecasts produced from all beta-transformed methods became less calibrated as forecast horizons increased.
3.3.2 Comparison of combination methods’ accuracy
Across all targets and seasons, the has the best mean out-of-sample log score of , though it only marginally outperformed the BLP and LP, which have mean out-of-sample log scores of and , respectively (Figure 3 panel (b)). Across all three test seasons, the BLP has the best mean out-of-sample log scores for the 1 week ahead horizon (Figure 3 panel (a)). The , which is the most flexible method in this study, has the best mean out-of-sample log scores of and for 3 and 4 week ahead horizons, respectively. It also performed as well as the BLP, which also has the best mean out-of-sample log score of for the 2 week ahead horizon. Across all four target horizons, the BLP is the most accurate method for the 2016/2017 season, while the is the most accurate for the 2017/2018 and 2018/2019 season.
The individual observation-level log scores of the ensemble forecasts from the beta-transformed combination methods exhibit higher variation compared to those from the EW-LP and the LP for all targets and test seasons (Figure 4). This is in alignment with the theoretical result that the LP tends to produce overdispersed or wider forecasts, resulting in less extreme log scores. The performance of in the test seasons was less consistent compared to its superior performance across all targets and seasons in the training periods, which indicates of some degree of over-fitting; however, it was always among the top two methods in terms of out-of-sample log score across all targets and seasons (Figure 3). We also see notably higher variation in log scores across all methods for all targets in the 2017/2018 season, which was one of the most severe and longest flu seasons in the recent years (Centers for Disease Control and Prevention, 2019b). Section 2 in the Supplementary Material (Wattanachit et al., 2022) provides detailed results of mean out-of-sample log scores aggregated and disaggregated at different levels.


3.3.3 Comparison of combination methods’ calibration
The empirical CDF curves of PIT values from probabilistically calibrated forecasts should follow the CDF of a standard uniform distribution, that is, a diagonal line between 0 and 1. To quantify the deviation from the CDF of a standard uniform distribution, we computed the Cramer distances between the CDF curves of PIT values and the CDF of a standard uniform distribution. Overall, the BLP and methods are more probabilistically calibrated than other methods for the 1 week ahead horizon, as their empirical CDF curves are less deviated from the reference line (Figure 5) and their Cramer distances are the lowest (Figure 6). However, the forecasts produced from all beta-transformed methods became less calibrated as forecast horizons increased. Across all horizons in the test period, except the 1 week ahead horizon, the empirical CDF curves of the PIT values from the forecasts from the were the most miscalibrated among all beta-transformed methods as indicated by its Cramer distances being the highest.
Based on the probability plots in Figure 5, all combination methods produced forecasts that lack probabilistic calibration in the test period. Recall that according to theory, the LP will produce ensemble forecasts with too wide predictive distributions when individual models are well-calibrated (Gneiting and Ranjan, 2013). The results (Figure 5) in our application during the training period are consistent with this theory. Specifically, the forecasts from the LP and EW-LP tended to be to wide, i.e., more observed values concentrated near the center of predictive distributions than expected for a well-calibrated model as indicated by the slopes of PIT CDF curves being higher than 1 for intermediate PIT values and lower than 1 near 0 and 1 (Figure 5).
The beta transformed combination methods were successful at correcting the overdispersion of the LP and EW-LP. However, their forecasts are miscalibrated in the test period due to under-prediction (Figure 5 panel (b)). The empirical CDF curves of the PIT values from the BLP, EW-BLP, , and were below the reference line across all PIT values from the test period, indicating consistent under-prediction, i.e., the predictive distributions tended to be concentrated below the observed values.
Despite the under-prediction across horizons observed in both training and test periods, the beta-transformed combination methods probabilistic calibration was notably better in the training period, especially for 3 and 4 week ahead horizons (Figure 5 panel (a)). In addition, the calibration of the forecasts produced from the EW-BLP and was similar to that of the forecasts produced from the BLP and in the training period, but more miscalibrated in the test period, as their Cramer distances are notably higher than those of the BLP and in the test period. Similar to the calibration observed in the test period, The LP and EW-LP also produced wide forecasts in the training period, albeit with more miscalibration in the upper tails across all horizons.
The probability plots by target-season pairs (Figure S6, Supplemental Material (Wattanachit et al., 2022)) show similar calibration results as in Figure 5 —the empirical CDF curves of the PIT values of forecasts produced from the LP methods in the test seasons also appear miscalibrated in the lower tail. The calibration of all methods by target-season pairs are discussed in more details in Section 3 in Supplemental Material (Wattanachit et al., 2022).
3.3.4 Comparison of accuracy and calibration between combination methods with equally weighted and with optimally weighted individual models
The equally weighted variations of the combination methods (EW-LP, EW-BLP, and ), though more parsimonious, had sub-optimal forecast accuracy compared to their counterparts that assigned weights to the individual models in this application. The EW-LP was the worst overall method across all targets and seasons, and all equally weighted variations had worse mean out-of-sample log scores compared to their more complex counterparts (Figure 3). Additionally, the equally weighted ensembles generally had poorer calibration than the corresponding weighted variations in both the training period and the test period (Figures 5, 6).


4 Discussion
As demonstrated in the forecasting literature, in many settings ensemble forecasts have consistent superior performance and give decision makers the ability to unify the strengths and diversity of individual models into one forecast. These particular advantages are of great importance in practice for infectious disease forecasting (DeFelice et al., 2017; Reich et al., 2019; Chowell et al., 2020; McAndrew and Reich, 2020; Ray et al., 2020, 2022). This work aims to offer insight into forecast accuracy and calibration of parametric combination methods in which calibration and individual model weight estimation happen simultaneously in the application of seasonal influenza forecasting in the U.S.
We applied the linear pool, beta-transformed linear pool, and the finite beta mixture combination method to available forecasts in the FluSight challenge to produce ensemble forecasts for seasonal influenza in the U.S. retrospectively for three test seasons and compared their performance. Our results showed that two of the combination methods included in this study, the BLP and , offered consistently superior forecast accuracy relative to the LP and EW-LP. Either the BLP or the or both delivered better mean log scores across the test seasons compared to the other methods for all targets and across all targets for all seasons. Despite using different methods to create ensemble forecasts, our findings are in agreement with the findings in Rumack, Tibshirani and Rosenfeld (2021) that combination methods that take into account forecast calibration improve accuracy of seasonal influenza ensemble forecasts in the U.S.
The uses twice as many parameters as the BLP, but only marginally outperformed it in two out of four targets and five out of twelve target-season pairs. Considering the large number of individual models in the FluSight challenge, the BLP may be more easily applicable in practice compared to the as it has half as many individual model weights and beta parameters to estimate. Although the LP under-performed relative to the BLP and for most targets and seasons, the differences in mean log scores were typically small. More parsimonious combination methods with fixed, equally weighted individual model weights, namely the EW-LP, EW-BLP and , appear to not be flexible enough to deliver superior performance compared to the other methods in this study. While this is the case for this application, combination methods using equal weights with or without the beta transformation could be useful in other applications where it might be difficult to estimate individual model weights when available models change over time or training data are limited (Ray et al., 2022).
The results on the probabilistic calibration of the ensemble forecasts measured by the uniformity of the PIT values are less straightforward. The results for the LP and EW-LP forecasts indicate that they had too wide predictive distributions across all targets. Despite the beta-transformed combination methods’ success at correcting the overdispersion of the ensemble forecasts from the LP and EW-LP, their forecasts exhibited a pattern of systematic under-prediction. This under-prediction is relatively more pronounced at longer forecast horizons, especially in the 2017/2018 and 2018/2019 influenza seasons. In the 2017/2018 season, which was a large influenza season in the U.S. (Centers for Disease Control and Prevention, 2019b), the ensemble forecasts from all combination methods under-predict to some extent. Note that an overdispersed forecaster, such as the LP and EW-LP, has the advantage of being more likely to capture an extreme season, though it may not be optimal overall based on proper scores. These more conservative methods may be desirable in applications in which stakeholders want to avoid missing a large season at the expense of having too wide forecasts.
Due to the flexibility of the beta-transformed combination methods and season-to-season variations of influenza, the lack of calibration could stem from overfitting to the training season data as the CDFs of PIT values in training seasons showed better calibration overall. In this study, the parameters of the beta-transformed combination methods were estimated by maximum likelihood, which is not equivalent to a measure of probabilistic calibration. If probabilistic calibration is of critical importance for the application at hand, other approaches such as post-hoc calibration techniques that directly target a calibrated forecast distribution may be appropriate.
Exploring different approaches to select training periods can be useful. For instance, Baran and Lerch (2018) use rolling training periods in the application of a similar set of combination methods to wind speed and precipitation forecasting and Rumack, Tibshirani and Rosenfeld (2021) constructs a training period that takes into account seasonality of epidemic forecasting. In addition, combination methods that require the joint estimation of all parameters, including the parameters in the individual models, may be considered in a setting where the underlying model structure of individual models are known. An example of one of these methods is the mixture EMOS model proposed by Baran and Lerch (2016). Since the FluSight challenge provides forecasts only, in this work we selected combination methods that do not require reproduction of forecasts from individual models.
For combining forecasts in outbreak settings, the beta-transformed linear pool (BLP) is a promising alternative to standard linear pooling (LP) methods. Compared with LP, the BLP has only two additional parameters, and , and the simple modification of the log likelihood function makes the BLP applicable to combining forecasts in a binned probability format. The may add value in instances where the BLP is not flexible enough, though we only see marginal improvement in the mean out-of-sample log scores in our application.
As infectious disease forecasting has come to the forefront of the public health effort in formulating well-informed policies in response to outbreaks, it is critical to gain insight on model combination approaches in order to combine individual models’ strengths and to produce accurate ensemble forecasts. This study demonstrates an effort to improve our understanding of how forecast combination methods compare in a setting of an infectious disease with well-established surveillance data pipeline like seasonal influenza. {funding} This work has been supported by the National Institutes of General Medical Sciences (R35GM119582) and the US CDC (1U01IP001122). The content is solely the responsibility of the authors and does not necessarily represent the official views of NIGMS, the National Institutes of Health, or CDC. {supplement} \stitleIndividual Forecasting Model Information and Additional Results \sdescriptionThis supplement contains sections 1–3 including individual forecasting model information and additional results of the application.
References
- Baran and Lerch (2016) {barticle}[author] \bauthor\bsnmBaran, \bfnmSándor\binitsS. and \bauthor\bsnmLerch, \bfnmSebastian\binitsS. (\byear2016). \btitleMixture EMOS model for calibrating ensemble forecasts of wind speed. \bjournalEnvironmetrics \bvolume27 \bpages116–130. \bdoi10.1002/env.2380 \endbibitem
- Baran and Lerch (2018) {barticle}[author] \bauthor\bsnmBaran, \bfnmSándor\binitsS. and \bauthor\bsnmLerch, \bfnmSebastian\binitsS. (\byear2018). \btitleCombining predictive distributions for the statistical post-processing of ensemble forecasts. \bjournalInternational Journal of Forecasting \bvolume34 \bpages477–496. \bdoi10.1016/j.ijforecast.2018.01.005 \endbibitem
- Bassetti, Casarin and Ravazzolo (2018) {barticle}[author] \bauthor\bsnmBassetti, \bfnmFederico\binitsF., \bauthor\bsnmCasarin, \bfnmRoberto\binitsR. and \bauthor\bsnmRavazzolo, \bfnmFrancesco\binitsF. (\byear2018). \btitleBayesian Nonparametric Calibration and Combination of Predictive Distributions. \bjournalJournal of the American Statistical Association \bvolume113 \bpages675–685. \bdoi10.1080/01621459.2016.1273117 \endbibitem
- Chowell et al. (2020) {barticle}[author] \bauthor\bsnmChowell, \bfnmGerardo\binitsG., \bauthor\bsnmLuo, \bfnmR.\binitsR., \bauthor\bsnmSun, \bfnmKaiyuan\binitsK., \bauthor\bsnmRoosa, \bfnmK.\binitsK., \bauthor\bsnmTariq, \bfnmA.\binitsA. and \bauthor\bsnmViboud, \bfnmCécile\binitsC. (\byear2020). \btitleReal-time forecasting of epidemic trajectories using computational dynamic ensembles. \bjournalEpidemics \bvolume30 \bpages100379. \bdoi10.1016/j.epidem.2019.100379 \endbibitem
- Claeskens et al. (2016) {barticle}[author] \bauthor\bsnmClaeskens, \bfnmGerda\binitsG., \bauthor\bsnmMagnus, \bfnmJan R.\binitsJ. R., \bauthor\bsnmVasnev, \bfnmAndrey L.\binitsA. L. and \bauthor\bsnmWang, \bfnmWendun\binitsW. (\byear2016). \btitleThe forecast combination puzzle: A simple theoretical explanation. \bjournalInternational Journal of Forecasting \bvolume32 \bpages754–762. \bdoi10.1016/j.ijforecast.2015.12.005 \endbibitem
- Cramer et al. (2021) {barticle}[author] \bauthor\bsnmCramer, \bfnmEstee Y.\binitsE. Y., \bauthor\bsnmRay, \bfnmEvan L.\binitsE. L., \bauthor\bsnmLopez, \bfnmVelma K.\binitsV. K., \bauthor\bsnmBracher, \bfnmJohannes\binitsJ., \bauthor\bsnmBrennen, \bfnmAndrea\binitsA., \bauthor\bsnmRivadeneira, \bfnmAlvaro J. Castro\binitsA. J. C., \bauthor\bsnmGerding, \bfnmAaron\binitsA., \bauthor\bsnmGneiting, \bfnmTilmann\binitsT., \bauthor\bsnmHouse, \bfnmKatie H.\binitsK. H., \bauthor\bsnmHuang, \bfnmYuxin\binitsY., \bauthor\bsnmJayawardena, \bfnmDasuni\binitsD., \bauthor\bsnmKanji, \bfnmAbdul H.\binitsA. H., \bauthor\bsnmKhandelwal, \bfnmAyush\binitsA., \bauthor\bsnmLe, \bfnmKhoa\binitsK., \bauthor\bsnmMühlemann, \bfnmAnja\binitsA., \bauthor\bsnmNiemi, \bfnmJarad\binitsJ., \bauthor\bsnmShah, \bfnmApurv\binitsA., \bauthor\bsnmStark, \bfnmAriane\binitsA., \bauthor\bsnmWang, \bfnmYijin\binitsY., \bauthor\bsnmWattanachit, \bfnmNutcha\binitsN., \bauthor\bsnmZorn, \bfnmMartha W.\binitsM. W., \bauthor\bsnmGu, \bfnmYouyang\binitsY., \bauthor\bsnmJain, \bfnmSansiddh\binitsS., \bauthor\bsnmBannur, \bfnmNayana\binitsN., \bauthor\bsnmDeva, \bfnmAyush\binitsA., \bauthor\bsnmKulkarni, \bfnmMihir\binitsM., \bauthor\bsnmMerugu, \bfnmSrujana\binitsS., \bauthor\bsnmRaval, \bfnmAlpan\binitsA., \bauthor\bsnmShingi, \bfnmSiddhant\binitsS., \bauthor\bsnmTiwari, \bfnmAvtansh\binitsA., \bauthor\bsnmWhite, \bfnmJerome\binitsJ., \bauthor\bsnmWoody, \bfnmSpencer\binitsS., \bauthor\bsnmDahan, \bfnmMaytal\binitsM., \bauthor\bsnmFox, \bfnmSpencer\binitsS., \bauthor\bsnmGaither, \bfnmKelly\binitsK., \bauthor\bsnmLachmann, \bfnmMichael\binitsM., \bauthor\bsnmMeyers, \bfnmLauren Ancel\binitsL. A., \bauthor\bsnmScott, \bfnmJames G.\binitsJ. G., \bauthor\bsnmTec, \bfnmMauricio\binitsM., \bauthor\bsnmSrivastava, \bfnmAjitesh\binitsA., \bauthor\bsnmGeorge, \bfnmGlover E.\binitsG. E., \bauthor\bsnmCegan, \bfnmJeffrey C.\binitsJ. C., \bauthor\bsnmDettwiller, \bfnmIan D.\binitsI. D., \bauthor\bsnmEngland, \bfnmWilliam P.\binitsW. P., \bauthor\bsnmFarthing, \bfnmMatthew W.\binitsM. W., \bauthor\bsnmHunter, \bfnmRobert H.\binitsR. H., \bauthor\bsnmLafferty, \bfnmBrandon\binitsB., \bauthor\bsnmLinkov, \bfnmIgor\binitsI., \bauthor\bsnmMayo, \bfnmMichael L.\binitsM. L., \bauthor\bsnmParno, \bfnmMatthew D.\binitsM. D., \bauthor\bsnmRowland, \bfnmMichael A.\binitsM. A., \bauthor\bsnmTrump, \bfnmBenjamin D.\binitsB. D., \bauthor\bsnmCorsetti, \bfnmSabrina M.\binitsS. M., \bauthor\bsnmBaer, \bfnmThomas M.\binitsT. M., \bauthor\bsnmEisenberg, \bfnmMarisa C.\binitsM. C., \bauthor\bsnmFalb, \bfnmKarl\binitsK., \bauthor\bsnmHuang, \bfnmYitao\binitsY., \bauthor\bsnmMartin, \bfnmEmily T.\binitsE. T., \bauthor\bsnmMcCauley, \bfnmElla\binitsE., \bauthor\bsnmMyers, \bfnmRobert L.\binitsR. L., \bauthor\bsnmSchwarz, \bfnmTom\binitsT., \bauthor\bsnmSheldon, \bfnmDaniel\binitsD., \bauthor\bsnmGibson, \bfnmGraham Casey\binitsG. C., \bauthor\bsnmYu, \bfnmRose\binitsR., \bauthor\bsnmGao, \bfnmLiyao\binitsL., \bauthor\bsnmMa, \bfnmYian\binitsY., \bauthor\bsnmWu, \bfnmDongxia\binitsD., \bauthor\bsnmYan, \bfnmXifeng\binitsX., \bauthor\bsnmJin, \bfnmXiaoyong\binitsX., \bauthor\bsnmWang, \bfnmYu-Xiang\binitsY.-X., \bauthor\bsnmChen, \bfnmYangQuan\binitsY., \bauthor\bsnmGuo, \bfnmLihong\binitsL., \bauthor\bsnmZhao, \bfnmYanting\binitsY., \bauthor\bsnmGu, \bfnmQuanquan\binitsQ., \bauthor\bsnmChen, \bfnmJinghui\binitsJ., \bauthor\bsnmWang, \bfnmLingxiao\binitsL., \bauthor\bsnmXu, \bfnmPan\binitsP., \bauthor\bsnmZhang, \bfnmWeitong\binitsW., \bauthor\bsnmZou, \bfnmDifan\binitsD., \bauthor\bsnmBiegel, \bfnmHannah\binitsH., \bauthor\bsnmLega, \bfnmJoceline\binitsJ., \bauthor\bsnmSnyder, \bfnmTimothy L.\binitsT. L., \bauthor\bsnmWilson, \bfnmDavison D.\binitsD. D., \bauthor\bsnmMcConnell, \bfnmSteve\binitsS., \bauthor\bsnmWalraven, \bfnmRobert\binitsR., \bauthor\bsnmShi, \bfnmYunfeng\binitsY., \bauthor\bsnmBan, \bfnmXuegang\binitsX., \bauthor\bsnmHong, \bfnmQi-Jun\binitsQ.-J., \bauthor\bsnmKong, \bfnmStanley\binitsS., \bauthor\bsnmTurtle, \bfnmJames A.\binitsJ. A., \bauthor\bsnmBen-Nun, \bfnmMichal\binitsM., \bauthor\bsnmRiley, \bfnmPete\binitsP., \bauthor\bsnmRiley, \bfnmSteven\binitsS., \bauthor\bsnmKoyluoglu, \bfnmUgur\binitsU., \bauthor\bsnmDesRoches, \bfnmDavid\binitsD., \bauthor\bsnmHamory, \bfnmBruce\binitsB., \bauthor\bsnmKyriakides, \bfnmChristina\binitsC., \bauthor\bsnmLeis, \bfnmHelen\binitsH., \bauthor\bsnmMilliken, \bfnmJohn\binitsJ., \bauthor\bsnmMoloney, \bfnmMichael\binitsM., \bauthor\bsnmMorgan, \bfnmJames\binitsJ., \bauthor\bsnmOzcan, \bfnmGokce\binitsG., \bauthor\bsnmSchrader, \bfnmChris\binitsC., \bauthor\bsnmShakhnovich, \bfnmElizabeth\binitsE., \bauthor\bsnmSiegel, \bfnmDaniel\binitsD., \bauthor\bsnmSpatz, \bfnmRyan\binitsR., \bauthor\bsnmStiefeling, \bfnmChris\binitsC., \bauthor\bsnmWilkinson, \bfnmBarrie\binitsB., \bauthor\bsnmWong, \bfnmAlexander\binitsA., \bauthor\bsnmGao, \bfnmZhifeng\binitsZ., \bauthor\bsnmBian, \bfnmJiang\binitsJ., \bauthor\bsnmCao, \bfnmWei\binitsW., \bauthor\bsnmFerres, \bfnmJuan Lavista\binitsJ. L., \bauthor\bsnmLi, \bfnmChaozhuo\binitsC., \bauthor\bsnmLiu, \bfnmTie-Yan\binitsT.-Y., \bauthor\bsnmXie, \bfnmXing\binitsX., \bauthor\bsnmZhang, \bfnmShun\binitsS., \bauthor\bsnmZheng, \bfnmShun\binitsS., \bauthor\bsnmVespignani, \bfnmAlessandro\binitsA., \bauthor\bsnmChinazzi, \bfnmMatteo\binitsM., \bauthor\bsnmDavis, \bfnmJessica T.\binitsJ. T., \bauthor\bsnmMu, \bfnmKunpeng\binitsK., \bauthor\bsnmPiontti, \bfnmAna Pastore y\binitsA. P. y., \bauthor\bsnmXiong, \bfnmXinyue\binitsX., \bauthor\bsnmZheng, \bfnmAndrew\binitsA., \bauthor\bsnmBaek, \bfnmJackie\binitsJ., \bauthor\bsnmFarias, \bfnmVivek\binitsV., \bauthor\bsnmGeorgescu, \bfnmAndreea\binitsA., \bauthor\bsnmLevi, \bfnmRetsef\binitsR., \bauthor\bsnmSinha, \bfnmDeeksha\binitsD., \bauthor\bsnmWilde, \bfnmJoshua\binitsJ., \bauthor\bsnmPenna, \bfnmNicolas D.\binitsN. D., \bauthor\bsnmCeli, \bfnmLeo A.\binitsL. A., \bauthor\bsnmSundar, \bfnmSaketh\binitsS., \bauthor\bsnmCavany, \bfnmSean\binitsS., \bauthor\bsnmEspaña, \bfnmGuido\binitsG., \bauthor\bsnmMoore, \bfnmSean\binitsS., \bauthor\bsnmOidtman, \bfnmRachel\binitsR., \bauthor\bsnmPerkins, \bfnmAlex\binitsA., \bauthor\bsnmOsthus, \bfnmDave\binitsD., \bauthor\bsnmCastro, \bfnmLauren\binitsL., \bauthor\bsnmFairchild, \bfnmGeoffrey\binitsG., \bauthor\bsnmMichaud, \bfnmIsaac\binitsI., \bauthor\bsnmKarlen, \bfnmDean\binitsD., \bauthor\bsnmLee, \bfnmElizabeth C.\binitsE. C., \bauthor\bsnmDent, \bfnmJuan\binitsJ., \bauthor\bsnmGrantz, \bfnmKyra H.\binitsK. H., \bauthor\bsnmKaminsky, \bfnmJoshua\binitsJ., \bauthor\bsnmKaminsky, \bfnmKathryn\binitsK., \bauthor\bsnmKeegan, \bfnmLindsay T.\binitsL. T., \bauthor\bsnmLauer, \bfnmStephen A.\binitsS. A., \bauthor\bsnmLemaitre, \bfnmJoseph C.\binitsJ. C., \bauthor\bsnmLessler, \bfnmJustin\binitsJ., \bauthor\bsnmMeredith, \bfnmHannah R.\binitsH. R., \bauthor\bsnmPerez-Saez, \bfnmJavier\binitsJ., \bauthor\bsnmShah, \bfnmSam\binitsS., \bauthor\bsnmSmith, \bfnmClaire P.\binitsC. P., \bauthor\bsnmTruelove, \bfnmShaun A.\binitsS. A., \bauthor\bsnmWills, \bfnmJosh\binitsJ., \bauthor\bsnmKinsey, \bfnmMatt\binitsM., \bauthor\bsnmObrecht, \bfnmR. F.\binitsR. F., \bauthor\bsnmTallaksen, \bfnmKatharine\binitsK., \bauthor\bsnmBurant, \bfnmJohn C.\binitsJ. C., \bauthor\bsnmWang, \bfnmLily\binitsL., \bauthor\bsnmGao, \bfnmLei\binitsL., \bauthor\bsnmGu, \bfnmZhiling\binitsZ., \bauthor\bsnmKim, \bfnmMyungjin\binitsM., \bauthor\bsnmLi, \bfnmXinyi\binitsX., \bauthor\bsnmWang, \bfnmGuannan\binitsG., \bauthor\bsnmWang, \bfnmYueying\binitsY., \bauthor\bsnmYu, \bfnmShan\binitsS., \bauthor\bsnmReiner, \bfnmRobert C.\binitsR. C., \bauthor\bsnmBarber, \bfnmRyan\binitsR., \bauthor\bsnmGaikedu, \bfnmEmmanuela\binitsE., \bauthor\bsnmHay, \bfnmSimon\binitsS., \bauthor\bsnmLim, \bfnmSteve\binitsS., \bauthor\bsnmMurray, \bfnmChris\binitsC., \bauthor\bsnmPigott, \bfnmDavid\binitsD., \bauthor\bsnmPrakash, \bfnmB. Aditya\binitsB. A., \bauthor\bsnmAdhikari, \bfnmBijaya\binitsB., \bauthor\bsnmCui, \bfnmJiaming\binitsJ., \bauthor\bsnmRodríguez, \bfnmAlexander\binitsA., \bauthor\bsnmTabassum, \bfnmAnika\binitsA., \bauthor\bsnmXie, \bfnmJiajia\binitsJ., \bauthor\bsnmKeskinocak, \bfnmPinar\binitsP., \bauthor\bsnmAsplund, \bfnmJohn\binitsJ., \bauthor\bsnmBaxter, \bfnmArden\binitsA., \bauthor\bsnmOruc, \bfnmBuse Eylul\binitsB. E., \bauthor\bsnmSerban, \bfnmNicoleta\binitsN., \bauthor\bsnmArik, \bfnmSercan O.\binitsS. O., \bauthor\bsnmDusenberry, \bfnmMike\binitsM., \bauthor\bsnmEpshteyn, \bfnmArkady\binitsA., \bauthor\bsnmKanal, \bfnmElli\binitsE., \bauthor\bsnmLe, \bfnmLong T.\binitsL. T., \bauthor\bsnmLi, \bfnmChun-Liang\binitsC.-L., \bauthor\bsnmPfister, \bfnmTomas\binitsT., \bauthor\bsnmSava, \bfnmDario\binitsD., \bauthor\bsnmSinha, \bfnmRajarishi\binitsR., \bauthor\bsnmTsai, \bfnmThomas\binitsT., \bauthor\bsnmYoder, \bfnmNate\binitsN., \bauthor\bsnmYoon, \bfnmJinsung\binitsJ., \bauthor\bsnmZhang, \bfnmLeyou\binitsL., \bauthor\bsnmAbbott, \bfnmSam\binitsS., \bauthor\bsnmBosse, \bfnmNikos I.\binitsN. I., \bauthor\bsnmFunk, \bfnmSebastian\binitsS., \bauthor\bsnmHellewel, \bfnmJoel\binitsJ., \bauthor\bsnmMeakin, \bfnmSophie R.\binitsS. R., \bauthor\bsnmMunday, \bfnmJames D.\binitsJ. D., \bauthor\bsnmSherratt, \bfnmKatherine\binitsK., \bauthor\bsnmZhou, \bfnmMingyuan\binitsM., \bauthor\bsnmKalantari, \bfnmRahi\binitsR., \bauthor\bsnmYamana, \bfnmTeresa K.\binitsT. K., \bauthor\bsnmPei, \bfnmSen\binitsS., \bauthor\bsnmShaman, \bfnmJeffrey\binitsJ., \bauthor\bsnmAyer, \bfnmTurgay\binitsT., \bauthor\bsnmAdee, \bfnmMadeline\binitsM., \bauthor\bsnmChhatwal, \bfnmJagpreet\binitsJ., \bauthor\bsnmDalgic, \bfnmOzden O.\binitsO. O., \bauthor\bsnmLadd, \bfnmMary A.\binitsM. A., \bauthor\bsnmLinas, \bfnmBenjamin P.\binitsB. P., \bauthor\bsnmMueller, \bfnmPeter\binitsP., \bauthor\bsnmXiao, \bfnmJade\binitsJ., \bauthor\bsnmLi, \bfnmMichael L.\binitsM. L., \bauthor\bsnmBertsimas, \bfnmDimitris\binitsD., \bauthor\bsnmLami, \bfnmOmar Skali\binitsO. S., \bauthor\bsnmSoni, \bfnmSaksham\binitsS., \bauthor\bsnmBouardi, \bfnmHamza Tazi\binitsH. T., \bauthor\bsnmWang, \bfnmYuanjia\binitsY., \bauthor\bsnmWang, \bfnmQinxia\binitsQ., \bauthor\bsnmXie, \bfnmShanghong\binitsS., \bauthor\bsnmZeng, \bfnmDonglin\binitsD., \bauthor\bsnmGreen, \bfnmAlden\binitsA., \bauthor\bsnmBien, \bfnmJacob\binitsJ., \bauthor\bsnmHu, \bfnmAddison J.\binitsA. J., \bauthor\bsnmJahja, \bfnmMaria\binitsM., \bauthor\bsnmNarasimhan, \bfnmBalasubramanian\binitsB., \bauthor\bsnmRajanala, \bfnmSamyak\binitsS., \bauthor\bsnmRumack, \bfnmAaron\binitsA., \bauthor\bsnmSimon, \bfnmNoah\binitsN., \bauthor\bsnmTibshirani, \bfnmRyan\binitsR., \bauthor\bsnmTibshirani, \bfnmRob\binitsR., \bauthor\bsnmVentura, \bfnmValerie\binitsV., \bauthor\bsnmWasserman, \bfnmLarry\binitsL., \bauthor\bsnmO’Dea, \bfnmEamon B.\binitsE. B., \bauthor\bsnmDrake, \bfnmJohn M.\binitsJ. M., \bauthor\bsnmPagano, \bfnmRobert\binitsR., \bauthor\bsnmWalker, \bfnmJo W.\binitsJ. W., \bauthor\bsnmSlayton, \bfnmRachel B.\binitsR. B., \bauthor\bsnmJohansson, \bfnmMichael\binitsM., \bauthor\bsnmBiggerstaff, \bfnmMatthew\binitsM. and \bauthor\bsnmReich, \bfnmNicholas G.\binitsN. G. (\byear2021). \btitleEvaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the US. \bpages2021.02.03.21250974. \bdoi10.1101/2021.02.03.21250974 \endbibitem
- DeFelice et al. (2017) {barticle}[author] \bauthor\bsnmDeFelice, \bfnmNicholas B.\binitsN. B., \bauthor\bsnmLittle, \bfnmEliza\binitsE., \bauthor\bsnmCampbell, \bfnmScott R.\binitsS. R. and \bauthor\bsnmShaman, \bfnmJeffrey\binitsJ. (\byear2017). \btitleEnsemble forecast of human West Nile virus cases and mosquito infection rates. \bjournalNature Communications \bvolume8 \bpages14592. \bnotePublisher: Nature Publishing Group. \bdoi10.1038/ncomms14592 \endbibitem
- Centers for Disease Control and Prevention (2019a) {bmisc}[author] \bauthor\bsnmCenters for Disease Control and Prevention (\byear2019a). \btitleEpidemic Prediction Initiative. \bhowpublishedURL: https://predict.cdc.gov/post/5d827e75fba2091084d47b96. \endbibitem
- Centers for Disease Control and Prevention (2019b) {bmisc}[author] \bauthor\bsnmCenters for Disease Control and Prevention (\byear2019b). \btitleWhat You Should Know for the 2017-2018 Influenza Season. \bhowpublishedURL: https://www.cdc.gov/flu/about/season/flu-season-2017-2018.htm. \endbibitem
- Centers for Disease Control and Prevention (2020) {bmisc}[author] \bauthor\bsnmCenters for Disease Control and Prevention (\byear2020). \btitleU.S. Influenza Surveillance System: Purpose and Methods | CDC. \bhowpublishedURL: https://www.cdc.gov/flu/weekly/overview.htm. \endbibitem
- Centers for Disease Control and Prevention (2021a) {bmisc}[author] \bauthor\bsnmCenters for Disease Control and Prevention (\byear2021a). \btitleEstimated Influenza Illnesses, Medical visits, Hospitalizations, and Deaths in the United States — 2018–2019 influenza season | CDC. \bhowpublishedURL: https://www.cdc.gov/flu/about/burden/2018-2019.html. \endbibitem
- Centers for Disease Control and Prevention (2021b) {bmisc}[author] \bauthor\bsnmCenters for Disease Control and Prevention (\byear2021b). \btitleCDC Stands Up New Disease Forecasting Center. \bhowpublishedURL: https://www.cdc.gov/media/releases/2021/p0818-disease-forecasting-center.html. \endbibitem
- Gneiting, Balabdaoui and Raftery (2007) {barticle}[author] \bauthor\bsnmGneiting, \bfnmTilmann\binitsT., \bauthor\bsnmBalabdaoui, \bfnmFadoua\binitsF. and \bauthor\bsnmRaftery, \bfnmAdrian E.\binitsA. E. (\byear2007). \btitleProbabilistic forecasts, calibration and sharpness. \bjournalJournal of the Royal Statistical Society: Series B (Statistical Methodology) \bvolume69 \bpages243–268. \bdoi10.1111/j.1467-9868.2007.00587.x \endbibitem
- Gneiting and Raftery (2007) {barticle}[author] \bauthor\bsnmGneiting, \bfnmTilmann\binitsT. and \bauthor\bsnmRaftery, \bfnmAdrian E\binitsA. E. (\byear2007). \btitleStrictly Proper Scoring Rules, Prediction, and Estimation. \bjournalJournal of the American Statistical Association \bvolume102 \bpages359–378. \bdoi10.1198/016214506000001437 \endbibitem
- Gneiting and Ranjan (2013) {barticle}[author] \bauthor\bsnmGneiting, \bfnmTilmann\binitsT. and \bauthor\bsnmRanjan, \bfnmRoopesh\binitsR. (\byear2013). \btitleCombining predictive distributions. \bjournalElectronic Journal of Statistics \bvolume7 \bpages1747–1782. \bdoi10.1214/13-EJS823 \endbibitem
- Hora (2004) {barticle}[author] \bauthor\bsnmHora, \bfnmStephen C.\binitsS. C. (\byear2004). \btitleProbability Judgments for Continuous Quantities: Linear Combinations and Calibration. \bjournalManagement Science \bvolume50 \bpages597–604. \bnotePublisher: INFORMS. \endbibitem
- Kuleshov and Deshpande (2021) {barticle}[author] \bauthor\bsnmKuleshov, \bfnmVolodymyr\binitsV. and \bauthor\bsnmDeshpande, \bfnmShachi\binitsS. (\byear2021). \btitleCalibrated and Sharp Uncertainties in Deep Learning via Simple Density Estimation. \bjournalarXiv:2112.07184 [cs]. \bnotearXiv: 2112.07184. \endbibitem
- Laio and Tamea (2007) {barticle}[author] \bauthor\bsnmLaio, \bfnmF.\binitsF. and \bauthor\bsnmTamea, \bfnmS.\binitsS. (\byear2007). \btitleVerification tools for probabilistic forecasts of continuous hydrological variables. \bjournalHydrology and Earth System Sciences \bvolume11 \bpages1267–1277. \bnotePublisher: Copernicus GmbH. \bdoi10.5194/hess-11-1267-2007 \endbibitem
- Lutz et al. (2019) {barticle}[author] \bauthor\bsnmLutz, \bfnmChelsea S.\binitsC. S., \bauthor\bsnmHuynh, \bfnmMimi P.\binitsM. P., \bauthor\bsnmSchroeder, \bfnmMonica\binitsM., \bauthor\bsnmAnyatonwu, \bfnmSophia\binitsS., \bauthor\bsnmDahlgren, \bfnmF. Scott\binitsF. S., \bauthor\bsnmDanyluk, \bfnmGregory\binitsG., \bauthor\bsnmFernandez, \bfnmDanielle\binitsD., \bauthor\bsnmGreene, \bfnmSharon K.\binitsS. K., \bauthor\bsnmKipshidze, \bfnmNodar\binitsN., \bauthor\bsnmLiu, \bfnmLeann\binitsL., \bauthor\bsnmMgbere, \bfnmOsaro\binitsO., \bauthor\bsnmMcHugh, \bfnmLisa A.\binitsL. A., \bauthor\bsnmMyers, \bfnmJennifer F.\binitsJ. F., \bauthor\bsnmSiniscalchi, \bfnmAlan\binitsA., \bauthor\bsnmSullivan, \bfnmAmy D.\binitsA. D., \bauthor\bsnmWest, \bfnmNicole\binitsN., \bauthor\bsnmJohansson, \bfnmMichael A.\binitsM. A. and \bauthor\bsnmBiggerstaff, \bfnmMatthew\binitsM. (\byear2019). \btitleApplying infectious disease forecasting to public health: a path forward using influenza forecasting examples. \bjournalBMC Public Health \bvolume19 \bpages1659. \bdoi10.1186/s12889-019-7966-8 \endbibitem
- McAndrew and Reich (2020) {barticle}[author] \bauthor\bsnmMcAndrew, \bfnmThomas\binitsT. and \bauthor\bsnmReich, \bfnmNicholas G.\binitsN. G. (\byear2020). \btitleAdaptively stacking ensembles for influenza forecasting with incomplete data. \bjournalarXiv:1908.01675 [cs, stat]. \bnotearXiv: 1908.01675. \endbibitem
- McGowan et al. (2019) {barticle}[author] \bauthor\bsnmMcGowan, \bfnmCraig J.\binitsC. J., \bauthor\bsnmBiggerstaff, \bfnmMatthew\binitsM., \bauthor\bsnmJohansson, \bfnmMichael\binitsM., \bauthor\bsnmApfeldorf, \bfnmKaryn M.\binitsK. M., \bauthor\bsnmBen-Nun, \bfnmMichal\binitsM., \bauthor\bsnmBrooks, \bfnmLogan\binitsL., \bauthor\bsnmConvertino, \bfnmMatteo\binitsM., \bauthor\bsnmErraguntla, \bfnmMadhav\binitsM., \bauthor\bsnmFarrow, \bfnmDavid C.\binitsD. C., \bauthor\bsnmFreeze, \bfnmJohn\binitsJ., \bauthor\bsnmGhosh, \bfnmSaurav\binitsS., \bauthor\bsnmHyun, \bfnmSangwon\binitsS., \bauthor\bsnmKandula, \bfnmSasikiran\binitsS., \bauthor\bsnmLega, \bfnmJoceline\binitsJ., \bauthor\bsnmLiu, \bfnmYang\binitsY., \bauthor\bsnmMichaud, \bfnmNicholas\binitsN., \bauthor\bsnmMorita, \bfnmHaruka\binitsH., \bauthor\bsnmNiemi, \bfnmJarad\binitsJ., \bauthor\bsnmRamakrishnan, \bfnmNaren\binitsN., \bauthor\bsnmRay, \bfnmEvan L.\binitsE. L., \bauthor\bsnmReich, \bfnmNicholas G.\binitsN. G., \bauthor\bsnmRiley, \bfnmPete\binitsP., \bauthor\bsnmShaman, \bfnmJeffrey\binitsJ., \bauthor\bsnmTibshirani, \bfnmRyan\binitsR., \bauthor\bsnmVespignani, \bfnmAlessandro\binitsA., \bauthor\bsnmZhang, \bfnmQian\binitsQ. and \bauthor\bsnmReed, \bfnmCarrie\binitsC. (\byear2019). \btitleCollaborative efforts to forecast seasonal influenza in the United States, 2015–2016. \bjournalScientific Reports \bvolume9 \bpages683. \bnotePublisher: Nature Publishing Group. \bdoi10.1038/s41598-018-36361-9 \endbibitem
- FluSight Network (2020) {bmisc}[author] \bauthor\bsnmFluSight Network (\byear2020). \btitleGitHub - FluSightNetwork/cdc-flusight-ensemble: Guidelines and forecasts for a collaborative U.S. influenza forecasting project. \bhowpublishedURL: https://github.com/FluSightNetwork/cdc-flusight-ensemble. \endbibitem
- Oidtman et al. (2021) {barticle}[author] \bauthor\bsnmOidtman, \bfnmRachel J.\binitsR. J., \bauthor\bsnmOmodei, \bfnmElisa\binitsE., \bauthor\bsnmKraemer, \bfnmMoritz U. G.\binitsM. U. G., \bauthor\bsnmCastañeda-Orjuela, \bfnmCarlos A.\binitsC. A., \bauthor\bsnmCruz-Rivera, \bfnmErica\binitsE., \bauthor\bsnmMisnaza-Castrillón, \bfnmSandra\binitsS., \bauthor\bsnmCifuentes, \bfnmMyriam Patricia\binitsM. P., \bauthor\bsnmRincon, \bfnmLuz Emilse\binitsL. E., \bauthor\bsnmCañon, \bfnmViviana\binitsV., \bauthor\bsnmAlarcon, \bfnmPedro de\binitsP. d., \bauthor\bsnmEspaña, \bfnmGuido\binitsG., \bauthor\bsnmHuber, \bfnmJohn H.\binitsJ. H., \bauthor\bsnmHill, \bfnmSarah C.\binitsS. C., \bauthor\bsnmBarker, \bfnmChristopher M.\binitsC. M., \bauthor\bsnmJohansson, \bfnmMichael A.\binitsM. A., \bauthor\bsnmManore, \bfnmCarrie A.\binitsC. A., \bauthor\bsnmReiner, \bfnmJr\binitsJ., \bauthor\bsnmRodriguez-Barraquer, \bfnmIsabel\binitsI., \bauthor\bsnmSiraj, \bfnmAmir S.\binitsA. S., \bauthor\bsnmFrias-Martinez, \bfnmEnrique\binitsE., \bauthor\bsnmGarcía-Herranz, \bfnmManuel\binitsM. and \bauthor\bsnmPerkins, \bfnmT. Alex\binitsT. A. (\byear2021). \btitleTrade-offs between individual and ensemble forecasts of an emerging infectious disease. \bjournalNature Communications \bvolume12 \bpages5379. \bnotePublisher: Nature Publishing Group. \bdoi10.1038/s41467-021-25695-0 \endbibitem
- Ranjan and Gneiting (2010) {barticle}[author] \bauthor\bsnmRanjan, \bfnmRoopesh\binitsR. and \bauthor\bsnmGneiting, \bfnmTilmann\binitsT. (\byear2010). \btitleCombining probability forecasts. \bjournalJournal of the Royal Statistical Society: Series B (Statistical Methodology) \bvolume72 \bpages71–91. \bdoi10.1111/j.1467-9868.2009.00726.x \endbibitem
- Ray and Reich (2018) {barticle}[author] \bauthor\bsnmRay, \bfnmEvan L.\binitsE. L. and \bauthor\bsnmReich, \bfnmNicholas G.\binitsN. G. (\byear2018). \btitlePrediction of infectious disease epidemics via weighted density ensembles. \bjournalPLOS Computational Biology \bvolume14 \bpagese1005910. \bnotearXiv: 1703.10936. \bdoi10.1371/journal.pcbi.1005910 \endbibitem
- Ray et al. (2020) {barticle}[author] \bauthor\bsnmRay, \bfnmEvan L.\binitsE. L., \bauthor\bsnmWattanachit, \bfnmNutcha\binitsN., \bauthor\bsnmNiemi, \bfnmJarad\binitsJ., \bauthor\bsnmKanji, \bfnmAbdul Hannan\binitsA. H., \bauthor\bsnmHouse, \bfnmKatie\binitsK., \bauthor\bsnmCramer, \bfnmEstee Y.\binitsE. Y., \bauthor\bsnmBracher, \bfnmJohannes\binitsJ., \bauthor\bsnmZheng, \bfnmAndrew\binitsA., \bauthor\bsnmYamana, \bfnmTeresa K.\binitsT. K., \bauthor\bsnmXiong, \bfnmXinyue\binitsX., \bauthor\bsnmWoody, \bfnmSpencer\binitsS., \bauthor\bsnmWang, \bfnmYuanjia\binitsY., \bauthor\bsnmWang, \bfnmLily\binitsL., \bauthor\bsnmWalraven, \bfnmRobert L.\binitsR. L., \bauthor\bsnmTomar, \bfnmVishal\binitsV., \bauthor\bsnmSherratt, \bfnmKatharine\binitsK., \bauthor\bsnmSheldon, \bfnmDaniel\binitsD., \bauthor\bsnmReiner, \bfnmRobert C.\binitsR. C., \bauthor\bsnmPrakash, \bfnmB. Aditya\binitsB. A., \bauthor\bsnmOsthus, \bfnmDave\binitsD., \bauthor\bsnmLi, \bfnmMichael Lingzhi\binitsM. L., \bauthor\bsnmLee, \bfnmElizabeth C.\binitsE. C., \bauthor\bsnmKoyluoglu, \bfnmUgur\binitsU., \bauthor\bsnmKeskinocak, \bfnmPinar\binitsP., \bauthor\bsnmGu, \bfnmYouyang\binitsY., \bauthor\bsnmGu, \bfnmQuanquan\binitsQ., \bauthor\bsnmGeorge, \bfnmGlover E.\binitsG. E., \bauthor\bsnmEspaña, \bfnmGuido\binitsG., \bauthor\bsnmCorsetti, \bfnmSabrina\binitsS., \bauthor\bsnmChhatwal, \bfnmJagpreet\binitsJ., \bauthor\bsnmCavany, \bfnmSean\binitsS., \bauthor\bsnmBiegel, \bfnmHannah\binitsH., \bauthor\bsnmBen-Nun, \bfnmMichal\binitsM., \bauthor\bsnmWalker, \bfnmJo\binitsJ., \bauthor\bsnmSlayton, \bfnmRachel\binitsR., \bauthor\bsnmLopez, \bfnmVelma\binitsV., \bauthor\bsnmBiggerstaff, \bfnmMatthew\binitsM., \bauthor\bsnmJohansson, \bfnmMichael A.\binitsM. A., \bauthor\bsnmReich, \bfnmNicholas G.\binitsN. G. and \bauthor\bsnmConsortium, \bfnmon behalf of the COVID-19 Forecast Hub\binitso. b. o. t. C.-. F. H. (\byear2020). \btitleEnsemble Forecasts of Coronavirus Disease 2019 (COVID-19) in the U.S. \bpages2020.08.19.20177493. \bdoi10.1101/2020.08.19.20177493 \endbibitem
- Ray et al. (2022) {barticle}[author] \bauthor\bsnmRay, \bfnmEvan L.\binitsE. L., \bauthor\bsnmBrooks, \bfnmLogan C.\binitsL. C., \bauthor\bsnmBien, \bfnmJacob\binitsJ., \bauthor\bsnmBiggerstaff, \bfnmMatthew\binitsM., \bauthor\bsnmBosse, \bfnmNikos I.\binitsN. I., \bauthor\bsnmBracher, \bfnmJohannes\binitsJ., \bauthor\bsnmCramer, \bfnmEstee Y.\binitsE. Y., \bauthor\bsnmFunk, \bfnmSebastian\binitsS., \bauthor\bsnmGerding, \bfnmAaron\binitsA., \bauthor\bsnmJohansson, \bfnmMichael A.\binitsM. A., \bauthor\bsnmRumack, \bfnmAaron\binitsA., \bauthor\bsnmWang, \bfnmYijin\binitsY., \bauthor\bsnmZorn, \bfnmMartha\binitsM., \bauthor\bsnmTibshirani, \bfnmRyan J.\binitsR. J. and \bauthor\bsnmReich, \bfnmNicholas G.\binitsN. G. (\byear2022). \btitleComparing trained and untrained probabilistic ensemble forecasts of COVID-19 cases and deaths in the United States. \bjournalarXiv:2201.12387 [stat]. \bnotearXiv: 2201.12387. \endbibitem
- Reich et al. (2019) {barticle}[author] \bauthor\bsnmReich, \bfnmNicholas G.\binitsN. G., \bauthor\bsnmMcGowan, \bfnmCraig J.\binitsC. J., \bauthor\bsnmYamana, \bfnmTeresa K.\binitsT. K., \bauthor\bsnmTushar, \bfnmAbhinav\binitsA., \bauthor\bsnmRay, \bfnmEvan L.\binitsE. L., \bauthor\bsnmOsthus, \bfnmDave\binitsD., \bauthor\bsnmKandula, \bfnmSasikiran\binitsS., \bauthor\bsnmBrooks, \bfnmLogan C.\binitsL. C., \bauthor\bsnmCrawford-Crudell, \bfnmWillow\binitsW., \bauthor\bsnmGibson, \bfnmGraham Casey\binitsG. C., \bauthor\bsnmMoore, \bfnmEvan\binitsE., \bauthor\bsnmSilva, \bfnmRebecca\binitsR., \bauthor\bsnmBiggerstaff, \bfnmMatthew\binitsM., \bauthor\bsnmJohansson, \bfnmMichael A.\binitsM. A., \bauthor\bsnmRosenfeld, \bfnmRoni\binitsR. and \bauthor\bsnmShaman, \bfnmJeffrey\binitsJ. (\byear2019). \btitleAccuracy of real-time multi-model ensemble forecasts for seasonal influenza in the U.S. \bjournalPLOS Computational Biology \bvolume15 \bpagese1007486. \bnotePublisher: Public Library of Science. \bdoi10.1371/journal.pcbi.1007486 \endbibitem
- Rizzo and Székely (2016) {barticle}[author] \bauthor\bsnmRizzo, \bfnmMaria L.\binitsM. L. and \bauthor\bsnmSzékely, \bfnmGábor J.\binitsG. J. (\byear2016). \btitleEnergy distance. \bjournalWIREs Computational Statistics \bvolume8 \bpages27–38. \bdoi10.1002/wics.1375 \endbibitem
- Rumack, Tibshirani and Rosenfeld (2021) {barticle}[author] \bauthor\bsnmRumack, \bfnmAaron\binitsA., \bauthor\bsnmTibshirani, \bfnmRyan J.\binitsR. J. and \bauthor\bsnmRosenfeld, \bfnmRoni\binitsR. (\byear2021). \btitleRecalibrating probabilistic forecasts of epidemics. \bjournalarXiv:2112.06305 [cs]. \bnotearXiv: 2112.06305. \endbibitem
- Stanescu and Pandey (2016) {bincollection}[author] \bauthor\bsnmStanescu, \bfnmAna\binitsA. and \bauthor\bsnmPandey, \bfnmGaurav\binitsG. (\byear2016). \btitleLearning parsimonious ensembles for unbalanced computational genomics problems. In \bbooktitleBiocomputing 2017 \bpages288–299. \bpublisherWorld Scientific. \bdoi10.1142/9789813207813_0028 \endbibitem
- Viboud et al. (2018) {barticle}[author] \bauthor\bsnmViboud, \bfnmCécile\binitsC., \bauthor\bsnmSun, \bfnmKaiyuan\binitsK., \bauthor\bsnmGaffey, \bfnmRobert\binitsR., \bauthor\bsnmAjelli, \bfnmMarco\binitsM., \bauthor\bsnmFumanelli, \bfnmLaura\binitsL., \bauthor\bsnmMerler, \bfnmStefano\binitsS., \bauthor\bsnmZhang, \bfnmQian\binitsQ., \bauthor\bsnmChowell, \bfnmGerardo\binitsG., \bauthor\bsnmSimonsen, \bfnmLone\binitsL. and \bauthor\bsnmVespignani, \bfnmAlessandro\binitsA. (\byear2018). \btitleThe RAPIDD ebola forecasting challenge: Synthesis and lessons learnt. \bjournalEpidemics \bvolume22 \bpages13–21. \bdoi10.1016/j.epidem.2017.08.002 \endbibitem
- Wattanachit (2021) {bmisc}[author] \bauthor\bsnmWattanachit, \bfnmNutcha\binitsN. (\byear2021). \btitleForecast Calibration. \bhowpublishedURL: https://github.com/NutchaW/forecast_calibration. \endbibitem
- Wattanachit et al. (2022) {barticle}[author] \bauthor\bsnmWattanachit, \bfnmNutcha\binitsN., \bauthor\bsnmRay, \bfnmEvan L.\binitsE. L., \bauthor\bsnmMcAndrew, \bfnmThomas\binitsT. and \bauthor\bsnmReich, \bfnmNicholas G.\binitsN. G. (\byear2022). \btitleSupplement to ”Comparison of Combination Methods to Create Calibrated Ensemble Forecasts for Seasonal Influenza in the U.S.”. \endbibitem
- Yamana, Kandula and Shaman (2016) {barticle}[author] \bauthor\bsnmYamana, \bfnmTeresa K.\binitsT. K., \bauthor\bsnmKandula, \bfnmSasikiran\binitsS. and \bauthor\bsnmShaman, \bfnmJeffrey\binitsJ. (\byear2016). \btitleSuperensemble forecasts of dengue outbreaks. \bjournalJournal of The Royal Society Interface \bvolume13 \bpages20160410. \bnotePublisher: Royal Society. \bdoi10.1098/rsif.2016.0410 \endbibitem
- Yamana, Kandula and Shaman (2017) {barticle}[author] \bauthor\bsnmYamana, \bfnmTeresa K.\binitsT. K., \bauthor\bsnmKandula, \bfnmSasikiran\binitsS. and \bauthor\bsnmShaman, \bfnmJeffrey\binitsJ. (\byear2017). \btitleIndividual versus superensemble forecasts of seasonal influenza outbreaks in the United States. \bjournalPLOS Computational Biology \bvolume13 \bpagese1005801. \bnotePublisher: Public Library of Science. \bdoi10.1371/journal.pcbi.1005801 \endbibitem