This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

From Many to One: Consensus Inference in a MIP

Noel Cressie School of Mathematics and Applied Statistics University of Wollongong, Australia Michael Bertolacci School of Mathematics and Applied Statistics University of Wollongong, Australia Andrew Zammit-Mangion School of Mathematics and Applied Statistics University of Wollongong, Australia
(arXiv Version: 8 Jul 21)
Abstract

A Model Intercomparison Project (MIP) consists of teams who each estimate the same underlying quantity (e.g., temperature projections to the year 2070), and the spread of the estimates indicates their uncertainty. It recognizes that a community of scientists will not agree completely but that there is value in looking for a consensus and information in the range of disagreement. A simple average of the teams’ outputs gives a consensus estimate, but it does not recognize that some outputs are more variable than others. Statistical analysis of variance (ANOVA) models offer a way to obtain a weighted consensus estimate of outputs with a variance that is the smallest possible and hence the tightest possible ‘one-sigma’ and ‘two-sigma’ intervals. Modulo dependence between MIP outputs, the ANOVA approach weights a team’s output inversely proportional to its variation. When external verification data are available for evaluating the fidelity of each MIP output, ANOVA weights can also provide a prior distribution for Bayesian Model Averaging to yield a consensus estimate. We use a MIP of carbon dioxide flux inversions to illustrate the ANOVA-based weighting and subsequent consensus inferences.

Keywords: analysis of variance (ANOVA); model intercomparison project (MIP); multi-model ensemble; statistically optimal weights; SUPE-ANOVA framework; uncertainty quantification.

1 Introduction

Increasing our knowledge of the behavior of geophysical processes in the past, at the present, and in the future requires a scientific understanding of the processes, of the parameters that govern them and of the observing systems measuring them. That understanding can be quantified through a geophysical model and the mathematical and statistical relationships embedded in it. The presence of uncertainty in the model formulation is generally recognized, although where and how it enters is usually a topic of more or less disagreement (e.g., Knutti et al.,, 2010). Reconciliation and consensus estimation in these settings is a topic of central interest in many geophysical studies of global import, including the Coupled Model Intercomparison Projects (CMIP, Meehl et al.,, 2000), and the Ice Sheet Mass Balance Inter-comparison Exercises (IMBIE, Shepherd et al.,, 2018).

This article considers the Model Intercomparison Project (MIP) approach, commonly used to understand the similarities and differences in competing geophysical models. Here a number of teams produce outputs that estimate the underlying geophysical process in question, but the outputs generally differ due to modeling choices or assumptions. The collection of outputs, which we simply call the ‘MIP’s outputs,’ is sometimes also called a ‘multi-model ensemble’. When analyzing and summarizing a MIP’s outputs, it is common to use the ensemble average (‘one vote per model’; see Flato et al.,, 2014, Section 9.2.2.3), and either the range from the ensemble’s lowermost to its uppermost or the ensemble standard deviation, to put bounds on how the process did, does, or will behave. However, it is recognized that this practice of treating all outputs equally is frequently inappropriate (Tebaldi and Knutti,, 2007), as some outputs reproduce observational data more faithfully than others and should therefore receive more attention (Knutti,, 2010). At the same time, similarities between the methodologies of certain teams suggest that these team’s outputs should not be treated as completely independent and therefore each one should receive less attention (Tebaldi and Knutti,, 2007; Abramowitz et al.,, 2019). The most common paradigm (and the one used in this article) for converting these ideas into a method for combining outputs is to weight the outputs according to given criteria. Under this paradigm, the ensemble is typically summarized through a weighted mean rather than the unweighted mean (i.e., the average), and the uncertainty quantification also takes into account the weights (e.g., Tebaldi et al.,, 2005).

Refer to caption
Figure 1: Global land (top) and global ocean (bottom) estimated CO2 fluxes from the nine teams who participated in the OCO-2 Model Intercomparison Project (MIP), reported in Crowell et al., (2019). The lines show the fluxes of each ensemble member, and the crosses show the unweighted ensemble monthly means from January 2015 through December 2016. The fluxes are for the LN observation type (see Section 5).

A key question is how to construct the ensemble weights. The methods used to do so depend on the details of the application. It is natural to use any available observational data to help with the weighting and, broadly speaking, applications fall onto a spectrum depending on how much verification data are available. At one end of the spectrum, when the underlying geophysical process is contemporaneous, such as in the prediction of meteorological fields, weighting methods can typically take advantage of large amounts of verification data (e.g., Krishnamurti et al.,, 2000). At the other end of the spectrum, when the geophysical process is extremely difficult to measure directly, little-to-no direct verification data may be available; this situation is the focus of this article.

In the middle of the spectrum, limited verification data are available. The archetypal case of this is the projection of future climate, where verification data are available for a number of years up to the beginning of the projections. For example, see Figure 11.9 on page 981 of the Intergovernmental Panel on Climate Change Fifth Assessment Report (Kirtman et al.,, 2013), which includes 42 projections from the CMIP5 ensemble members. For 1986–2005, the projections are based on historical observed greenhouse gas concentrations. Afterwards, they are based on a pre-specified path of future emissions, called a representative concentration pathway (RCP, in this case RCP 4.5). Verification data are available for the years 1986–2005, and they are used to evaluate the performance of the models. In this and similar settings, there are many reasons why historical performance may not guarantee future accuracy (Reifen and Toumi,, 2009) but, nonetheless, one should not discount this historical information entirely (Knutti,, 2010).

For MIPs like CMIP5, where limited verification data are available, Giorgi and Mearns, (2002) present a framework called the Reliability Ensemble Average (REA), in which heuristic weights are derived that reward ensemble members that are simultaneously (i) in consensus and (ii) in agreement with verification data. Tebaldi et al., (2005) formalize REA in a statistical framework for analyzing model projections of future global temperatures across 22 regions. In their approach, the regions are treated independently, and the individual outputs are modeled as random quantities that center on the true climate state. The variances differ between the outputs, which imply different weights, and the estimated weights balance the REA criteria (i) and (ii). There are several extensions of the Tebaldi et al., (2005) method: Smith et al., (2009) treat the regions jointly and describe a cross-validation technique to validate the model fit, while Tebaldi and Sansó, (2009) allow for joint modeling of both temperature and precipitation.

A key limitation of the REA framework and its associated methods is that they do not account for dependence between outputs (Abramowitz et al.,, 2019). More recent developments have attempted to address this by modifying the weights, typically by down-weighting sub-groups of outputs that are considered to be dependent (Bishop and Abramowitz,, 2013). However, there is disagreement on what it means for certain MIP outputs to be dependent, and approaches to understanding this include the distance between outputs (e.g., Sanderson et al.,, 2017), the historical ‘genealogy’ of climate models (e.g., Knutti et al.,, 2013, 2017), component-wise similarity between models (e.g., Boé,, 2018), dependence between errors (e.g., Bishop and Abramowitz,, 2013; Haughton et al.,, 2015), and probabilistic/statistical dependence (e.g., Chandler,, 2013; Sunyer et al.,, 2014).

We now consider the case that is our main focus, where little to no direct verification data are available. An example of this, described in detail in Section 5, is the estimation of surface sources and sinks (fluxes) of CO2 in a MIP; see Figure 1, which shows the nine teams’ estimates of global land and global ocean non-fossil-fuel fluxes, as well as the unweighted ensemble average. Fluxes of CO2 occur over large temporal and geographical scales and are difficult to measure directly. One way their distribution is inferred indirectly is through their consequences on atmospheric concentrations of CO2, for which observations are available (Ciais et al.,, 2010). However, there are complications in using such indirect data taken anywhere in the atmosphere for verification, because issues such as overfitting and model misspecification mean that an accurate reproduction of the observed concentrations may not imply accurate fluxes (e.g., Houweling et al.,, 2010). Therefore, in this case and in cases like it, one is not applying REA’s criterion (ii), of agreement with flux data. However, according to REA criterion (i), one can infer the consensus between the MIP’s outputs (while taking into consideration model dependence, if needed). To apply the first criterion, we present a framework that enables statistically unbiased prediction and estimation (SUPE) based on an analysis of variance (ANOVA) statistical model, originally used for analyzing agricultural field experiments (Fisher,, 1935; Snedecor,, 1937). The framework promotes models that are in agreement and discounts those that are in disagreement, but it does so selectively in regions and seasons thought to have their own geophysical characteristics. For each of these regions/seasons, it provides a consensus summary and accompanying uncertainty quantification. It can also cater for model interdependencies when generating the summary, provided these have been, or can be, identified from a priori or expert knowledge.

Section 2 presents the ANOVA model in general, by recognizing that different factor combinations (e.g., spatial regions, seasons, geophysical-model configurations) will lead to agreement in differing degrees. We model the teams’ outputs for each factor combination as having different variances and, in Section 3, obtain statistically optimal weights based on these variances that allow optimal SUPE and its uncertainty quantification. In Section 4, we show how to estimate these weights using likelihood methods derived from the statistical distributions in the ANOVA model. Section 5 gives an illustration of our statistical-weighting methodology on the MIP published by Crowell et al., (2019), where nine teams produced outputs of global carbon-dioxide fluxes that were compared and summarized. A discussion of our framework, which we call SUPE-ANOVA, and conclusions, are given in Section 6.

2 Analysis of variance (ANOVA) for a MIP analysis

The area of statistics known as Experimental Design was born out of a need to control variations in agricultural field experiments, in order to compare and then select crop treatments based on replicate observations. The statistical ANOVA model was devised to analyze these experiments, and much has been written about it, starting with Fisher, (1935), Snedecor, (1937), Snedecor and Cochran in its eight editions (1957, …, 2014), and many other texts. It has been adopted in a number of disciplines driven by experimentation, including industrial manufacturing (e.g., Wu and Hamada,, 2000), and computer experiments where the so-called treatments are complex algorithms and code (e.g., Santner et al.,, 2010). We propose using the ANOVA framework for the analysis of outputs from important geophysical experiments performed within a MIP. It enables the output from many to be weighted in a statistically efficient manner to obtain one consensus estimate and a quantification of its uncertainty.

Let ‘ff’ represent a combination of factors (e.g., season, region) that the underlying geophysical process is expected to depend on, and let ‘jj’ index the output of the various teams participating in the MIP. By stratifying on each factor combination, we can write the output for the ffth factor combination (f=1,,Ff=1,\ldots,F) and the jjth team (j=1,,Jj=1,\ldots,J) as

{Yf,i(j):i=1,,I(f)},for f=1,,F,j=1,,J,\left\{Y_{f,i}^{(j)}:i=1,\ldots,I(f)\right\},\quad\text{for }f=1,\ldots,F,\enskip j=1,\ldots,J, (1)

where I(f)I(f) represents the number of replicates for factor combination ff. Each combination (f,i)(f,i) corresponds to a particular region/time-period in the MIP, and note that I(f)I(f) may differ for different factor combinations, reflecting the ability to capture complex MIP designs. In the analysis of the CO2-flux MIP given in Section 5, I(f)I(f) is a constant number (equal to six) for all regions and seasons.

The notation recognizes that each output is split into FF blocks (factor combinations) and, within the ffth block, there are (I(f)J)(I(f)\cdot J) outputs that are observing the same part of the geophysical process, corresponding to the I(f)I(f) replicates and the JJ teams participating in the MIP. For example, suppose the MIP’s goal is to compare 2020 climate models’ projections of North American mean temperature between the years 20212021 and 20702070, inclusive, at yearly intervals. Say the factor ff corresponds to decade, so f{1,2,,5}f\in\{1,2,\ldots,5\}. Further, i{1,2,,10}i\in\{1,2,\ldots,10\} corresponds to the I(f)=10I(f)=10 years within a decade, and j{1,2,,20}j\in\{1,2,\ldots,20\} corresponds to the 2020 climate models. For climate model 17 (say) and decade 4 (say, which corresponds to the years between 20512051 to 20602060, inclusive), the output at yearly intervals within the decade is {Y4,i(17):i=1,,10}\{Y_{4,i}^{(17)}:i=1,\ldots,10\}.

The statistical ANOVA model proposed for (1) is, for f=1,,Ff=1,\ldots,F, i=1,,I(f)i=1,\ldots,I(f), and j=1,,Jj=1,\ldots,J,

Yf,i(j)=Yf,i+ηf,i(j),Y_{f,i}^{(j)}=Y_{f,i}+\eta_{f,i}^{(j)}, (2)

where {ηf,i(1),,ηf,i(J)}\{\eta_{f,i}^{(1)},\ldots,\eta_{f,i}^{(J)}\} each have mean zero, var(ηf,i(j))=(σf(j))2\mathrm{var}(\eta_{f,i}^{(j)})=(\sigma_{f}^{(j)})^{2}, and cov(ηf,i(j),ηf,i(j))=ρf(j,j)σf(j)σf(j)\mathrm{cov}(\eta_{f,i}^{(j)},\eta_{f,i}^{(j^{\prime})})=\rho_{f}^{(j,j^{\prime})}\sigma_{f}^{(j)}\sigma_{f}^{(j^{\prime})}. Here, Yf,iY_{f,i} is the MIP’s consensus of the geophysical process that each of the MIP outputs is attempting to estimate, and ηf,i(j)\eta_{f,i}^{(j)} is the deviation from the consensus of the jjth output, assumed to have team-specific variances and covariances and to be independent of Yf,iY_{f,i}. Marginally, ηf,i(j)Dist(0,(σf(j))2)\eta_{f,i}^{(j)}\sim\mathrm{Dist}(0,(\sigma_{f}^{(j)})^{2}), where Dist(μ,σ2)\mathrm{Dist}(\mu,\sigma^{2}) denotes a generic distribution with mean μ\mu and variance σ2\sigma^{2}. Dependence between teams’ outputs is captured with correlations {ρf(j,j):j,j=1,,J}\{\rho_{f}^{(j,j^{\prime})}:j,j^{\prime}=1,\ldots,J\}. The statistical ANOVA model has variances that may differ from team to team. A mathematical-statistical result called the Gauss–Markov theorem (e.g., Casella and Berger,, 2006) implies that the hereto-referred weights are inversely proportional to these variances, and there is a more general version of the theorem that includes the dependencies between outputs; see Appendix B.

For f=1,,Ff=1,\ldots,F and i=1,,I(f)i=1,\ldots,I(f), we write the consensus as,

Yf,i=μf+αf,i,Y_{f,i}=\mu_{f}+\alpha_{f,i}, (3)

where {αf,1,,αf,I(f)}\{\alpha_{f,1},\ldots,\alpha_{f,I(f)}\} each have mean zero. Here, μf\mu_{f} is the unknown mean of the I(f)I(f) replicates, which we refer to as the climatological mean, and {αf,1,,αf,I(f)}\{\alpha_{f,1},\ldots,\alpha_{f,I(f)}\} are random effects that capture variability in the replicates, and whose distribution we denote as Dist(0,τf2)\mathrm{Dist}(0,\tau^{2}_{f}). In the illustrative example of North American temperatures, ff indexes decade, μf\mu_{f} corresponds to the consensus temperature in the ffth decade, while αf,i\alpha_{f,i} is the deviation from μf\mu_{f} for the iith year within the decade. The variability of temperatures can differ between decades, reflected by τf2\tau^{2}_{f} for f=1,,5f=1,\ldots,5. A temporal trend within the ffth decade could also be included in the model for {αf,i:i=1,,I(f)}\{\alpha_{f,i}:i=1,\ldots,I(f)\}. The team-specific variances {(σf(j))2:j=1,,J}\{(\sigma_{f}^{(j)})^{2}:j=1,\ldots,J\} and correlations {ρf(j,j):j,j=1,,J}\{\rho_{f}^{(j,j^{\prime})}:j,j^{\prime}=1,\ldots,J\} defined through (2) may also be different for different decades. In the North American temperature example, a team’s output with higher weight in one decade may not have higher weight in another.

The statistical ANOVA model in (2) and (3) allows statistically unbiased prediction and estimation (SUPE) on the consensus for the climatological means, {μf:f=1,,F}\{\mu_{f}:f=1,\ldots,F\}, and on the consensus for the underlying geophysical process, {Yf,i:i=1,,I(f);f=1,,F}\{Y_{f,i}:i=1,\ldots,I(f);\enskip f=1,\ldots,F\}. Notice the heterogeneity of variation in (2) and (3), where the variances (and correlations) depend on both ff and jj. In the next section, we explain how knowledge of these variance (and correlation) parameters leads to statistically optimal weighting of the JJ MIP outputs (1) for each ff when making inference on {μf}\{\mu_{f}\} and {Yf,i}\{Y_{f,i}\}.

3 Weighting MIP outputs for optimal inference

In the simple example from Section 2, of combining 20 climate models, not all models are expected to show the same variability. Different assumptions are made, say, about the model’s base resolution at which energy and water exchange takes place, how nonlinear dynamics are approximated, and so forth. This possibility is taken into account through (2), where the outputs from each team receives a different variance. A consequence of this is that, for optimal estimation and prediction of the consensus, different weights should be assigned to the different teams. This section discusses how to construct the optimal weights.

To illustrate optimal weighting in a simpler setting, without factor combinations, consider each climate model’s 2070 projected North American mean temperature (Tmp), YTmp(1),,YTmp(20)Y_{\mathrm{Tmp}}^{(1)},\ldots,Y_{\mathrm{Tmp}}^{(20)}. In this case, there is only one replication, so we omit the subscript ii on all quantities. To start with, assume a naïve statistical model wherein the 20 climate models all have the same variability:

YTmp(j)=μTmp+ηTmp(j),j=1,,20,Y_{\mathrm{Tmp}}^{(j)}=\mu_{\mathrm{Tmp}}+\eta_{\mathrm{Tmp}}^{(j)},\enskip j=1,\ldots,20, (4)

where independently, ηTmp(j)Dist(0,σTmp2)\eta_{\mathrm{Tmp}}^{(j)}\sim\mathrm{Dist}(0,\sigma^{2}_{\mathrm{Tmp}}). For simplicity of exposition, the correlations in this example are assumed to be zero. From (4), YTmp(j)Dist(μTmp,σTmp2)Y_{\mathrm{Tmp}}^{(j)}\sim\mathrm{Dist}(\mu_{\mathrm{Tmp}},\sigma^{2}_{\mathrm{Tmp}}) for j=1,,20j=1,\ldots,20. Then the best linear unbiased estimator (BLUE) of μTmp\mu_{\mathrm{Tmp}} is the unweighted mean, and its variance is straightforward to calculate. That is,

μ¯Tmp120j=120YTmp(j),and var(μ¯Tmp)=σTmp2/20.\bar{\mu}_{\mathrm{Tmp}}\equiv\frac{1}{20}\sum_{j=1}^{20}Y_{\mathrm{Tmp}}^{(j)},\enskip\text{and }\mathrm{var}(\bar{\mu}_{\mathrm{Tmp}})=\sigma^{2}_{\mathrm{Tmp}}/20. (5)

To justify using the word ‘best’ in the acronym ‘BLUE,’ one needs to build a statistical model. Assuming the model (4), the estimate (5) is the most precise linear estimator, in that var(μ¯Tmp)var(μ~Tmp)\mathrm{var}(\bar{\mu}_{\mathrm{Tmp}})\leq\mathrm{var}(\tilde{\mu}_{\mathrm{Tmp}}), for any other linear unbiased estimator, μ~Tmp\tilde{\mu}_{\mathrm{Tmp}}, of μTmp\mu_{\mathrm{Tmp}}. However, (4) does not allow for different variabilities in the different MIP outputs.

Now, for the sake of argument, suppose that the first 15 out of the 20 climate models have an underlying resolution of 0.5°×0.5°0.5\degree\times 0.5\degree, while the remaining five models have a resolution of 1°×1°1\degree\times 1\degree. Consequently, an assumption of equal variance in the statistical distributions in (4) will need modification. Assume variances (σTmp(lo))2(\sigma_{\mathrm{Tmp}}^{(\text{lo})})^{2} and (σTmp(hi))2(\sigma_{\mathrm{Tmp}}^{(\text{hi})})^{2} are known, and

YTmp(j){Dist(μTmp,(σTmp(lo))2);j=1,,15,Dist(μTmp,(σTmp(hi))2);j=16,,20.Y_{\mathrm{Tmp}}^{(j)}\sim\begin{cases}\mathrm{Dist}\left(\mu_{\mathrm{Tmp}},\left(\sigma_{\mathrm{Tmp}}^{(\text{lo})}\right)^{2}\right);&j=1,\ldots,15,\\ \mathrm{Dist}\left(\mu_{\mathrm{Tmp}},\left(\sigma_{\mathrm{Tmp}}^{(\text{hi})}\right)^{2}\right);&j=16,\ldots,20.\end{cases} (6)

In Section 4, we show how these variances can be estimated. The parameter μTmp\mu_{\mathrm{Tmp}} is still the target of inference but, under the statistical model (6), the Gauss–Markov theorem (e.g., Casella and Berger,, 2006) can be used to establish that the estimate μ¯Tmp\bar{\mu}_{\mathrm{Tmp}} in (5) is no longer the BLUE. We define the weights for estimating the projected mean temperature in the year 2070, μTmp\mu_{\mathrm{Tmp}}, as follows:

wTmp(j)={(σTmp(lo))2{15(σTmp(lo))2+5(σTmp(hi))2}1;j=1,,15,(σTmp(hi))2{15(σTmp(lo))2+5(σTmp(hi))2}1;j=16,,20.w_{\mathrm{Tmp}}^{(j)}=\begin{cases}\left(\sigma_{\mathrm{Tmp}}^{(\text{lo})}\right)^{-2}\left\{15\left(\sigma_{\mathrm{Tmp}}^{(\text{lo})}\right)^{-2}+5\left(\sigma_{\mathrm{Tmp}}^{(\text{hi})}\right)^{-2}\right\}^{-1};&j=1,\ldots,15,\\ \left(\sigma_{\mathrm{Tmp}}^{(\text{hi})}\right)^{-2}\left\{15\left(\sigma_{\mathrm{Tmp}}^{(\text{lo})}\right)^{-2}+5\left(\sigma_{\mathrm{Tmp}}^{(\text{hi})}\right)^{-2}\right\}^{-1};&j=16,\ldots,20.\end{cases} (7)

Under the statistical model (6), the Gauss–Markov theorem can be used to establish that the BLUE (the most precise linear unbiased estimator) and its variance are, respectively,

μTmp=j=120wTmp(j)YTmp(j),and var(μTmp)={15(σTmp(lo))2+5(σTmp(hi))2}1.\mu_{\mathrm{Tmp}}^{*}=\sum_{j=1}^{20}w_{\mathrm{Tmp}}^{(j)}Y_{\mathrm{Tmp}}^{(j)},\enskip\text{and }\mathrm{var}(\mu_{\mathrm{Tmp}}^{*})=\left\{15\left(\sigma_{\mathrm{Tmp}}^{(\text{lo})}\right)^{-2}+5\left(\sigma_{\mathrm{Tmp}}^{(\text{hi})}\right)^{-2}\right\}^{-1}. (8)

That is, the statistically optimal weights for estimation of μTmp\mu_{\mathrm{Tmp}} are given by (7), and the best estimator of μTmp\mu_{\mathrm{Tmp}} is obtained by weighting each datum inversely proportional to its variance.

A more general statistical model, analogous to what we consider in (3), involves the random effect, αTmp\alpha_{\mathrm{Tmp}}, as follows:

YTmp=μTmp+αTmp,Y_{\mathrm{Tmp}}=\mu_{\mathrm{Tmp}}+\alpha_{\mathrm{Tmp}}, (9)

where αTmpDist(0,τTmp2)\alpha_{\mathrm{Tmp}}\sim\mathrm{Dist}(0,\tau_{\mathrm{Tmp}}^{2}), and τTmp2\tau_{\mathrm{Tmp}}^{2} is assumed known. When αTmp=0\alpha_{\mathrm{Tmp}}=0, we recover the model (6), however the random effect αTmp\alpha_{\mathrm{Tmp}} plays an important role in recognizing the stochasticity in the projected temperature. Hence, for j=1,,20j=1,\ldots,20,

YTmp(j)=YTmp+ηTmp(j)=μTmp+αTmp+ηTmp(j).Y_{\mathrm{Tmp}}^{(j)}=Y_{\mathrm{Tmp}}+\eta_{\mathrm{Tmp}}^{(j)}=\mu_{\mathrm{Tmp}}+\alpha_{\mathrm{Tmp}}+\eta_{\mathrm{Tmp}}^{(j)}. (10)

Notice that there are between-model correlation terms induced by the random effect; for jjj\neq j^{\prime},

cov(YTmp(j),YTmp(j))=cov(αTmp+ηTmp(j),αTmp+ηTmp(j))=τTmp2,\mathrm{cov}\left(Y_{\mathrm{Tmp}}^{(j)},Y_{\mathrm{Tmp}}^{(j^{\prime})}\right)=\mathrm{cov}\big{(}\alpha_{\mathrm{Tmp}}+\eta_{\mathrm{Tmp}}^{(j)},\alpha_{\mathrm{Tmp}}+\eta_{\mathrm{Tmp}}^{(j^{\prime})}\big{)}=\tau^{2}_{\mathrm{Tmp}},

where recall that in this simple example ηTmp(j)\eta_{\mathrm{Tmp}}^{(j)} and ηTmp(j)\eta_{\mathrm{Tmp}}^{(j^{\prime})} are uncorrelated. Prediction of YTmpY_{\mathrm{Tmp}}, which is the appropriate inference in the presence of the random effect, is made through the predictive distribution of YTmpY_{\mathrm{Tmp}} given all the data from all the climate-model outputs; the optimal predictor of YTmpY_{\mathrm{Tmp}} is the mean of this predictive distribution. We use the best linear unbiased predictor (BLUP) to make inference on YTmpY_{\mathrm{Tmp}}. In the simple example of North American temperatures, the BLUP is

YTmp=λTmp(0)μTmp+j=120λTmp(j)YTmp(j),Y_{\mathrm{Tmp}}^{*}=\lambda_{\mathrm{Tmp}}^{(0)}\mu_{\mathrm{Tmp}}+\sum_{j=1}^{20}\lambda_{\mathrm{Tmp}}^{(j)}Y_{\mathrm{Tmp}}^{(j)}, (11)

where

λTmp(j)={τTmp2{τTmp2+15(σTmp(lo))2+5(σTmp(hi))2}1;j=0,(σTmp(lo))2{τTmp2+15(σTmp(lo))2+5(σTmp(hi))2}1;j=1,,15,(σTmp(hi))2{τTmp2+15(σTmp(lo))2+5(σTmp(hi))2}1;j=16,,20.\lambda_{\mathrm{Tmp}}^{(j)}=\begin{cases}\tau_{\mathrm{Tmp}}^{-2}\left\{\tau_{\mathrm{Tmp}}^{-2}+15\left(\sigma_{\mathrm{Tmp}}^{(\text{lo})}\right)^{-2}+5\left(\sigma_{\mathrm{Tmp}}^{(\text{hi})}\right)^{-2}\right\}^{-1};&j=0,\\ \left(\sigma_{\mathrm{Tmp}}^{(\text{lo})}\right)^{-2}\left\{\tau_{\mathrm{Tmp}}^{-2}+15\left(\sigma_{\mathrm{Tmp}}^{(\text{lo})}\right)^{-2}+5\left(\sigma_{\mathrm{Tmp}}^{(\text{hi})}\right)^{-2}\right\}^{-1};&j=1,\ldots,15,\\ \left(\sigma_{\mathrm{Tmp}}^{(\text{hi})}\right)^{-2}\left\{\tau_{\mathrm{Tmp}}^{-2}+15\left(\sigma_{\mathrm{Tmp}}^{(\text{lo})}\right)^{-2}+5\left(\sigma_{\mathrm{Tmp}}^{(\text{hi})}\right)^{-2}\right\}^{-1};&j=16,\ldots,20.\end{cases} (12)

Notice that, for both optimal prediction (BLUP) and optimal estimation (BLUE), outputs with larger variability around the consensus receive lower weights. Additionally, for the BLUP given by (11), the weight on μTmp\mu_{\mathrm{Tmp}} depends on how variable the underlying process is; that is, it depends on τTmp2\tau_{\mathrm{Tmp}}^{2}. Thus, the more variable the geophysical process is (captured through the parameter τTmp2\tau^{2}_{\mathrm{Tmp}}), the less weight its mean μTmp\mu_{\mathrm{Tmp}} receives in the optimal predictor, YTmpY_{\mathrm{Tmp}}^{*}, of YTmpY_{\mathrm{Tmp}}.

To justify the word ‘best’ in the acronym ‘BLUP,’ an uncertainty measure called the mean-squared prediction error (MSPE) is used. Let Y~Tmp\tilde{Y}_{\mathrm{Tmp}} be any linear predictor of YTmpY_{\mathrm{Tmp}}; then the BLUP, YTmpY_{\mathrm{Tmp}}^{*}, minimizes the MSPE since

E(YTmpYTmp)2E(Y~TmpYTmp)2.E(Y_{\mathrm{Tmp}}^{*}-Y_{\mathrm{Tmp}})^{2}\leq E(\tilde{Y}_{\mathrm{Tmp}}-Y_{\mathrm{Tmp}})^{2}.

A general formula for the MSPE of the BLUP is given later in this section.

Returning to the generic MIP, the steps to carry out optimally weighted linear inference follow the same line of development as in the temperature example: (1) specify the first-moment and second-moment parameters in the ANOVA model; (2) derive the variances and covariances of all random quantities in terms of the parameters; (3) weight the outputs (essentially) inversely proportional to their variances to obtain a BLUE/BLUP (i.e., to obtain a best SUPE); (4) quantify uncertainty (variance of the BLUE and MSPE of the BLUP). These four steps require an initialization that might be called a zeroth step, namely the estimation of the parameters (σf(1))2,,(σf(J))2(\sigma_{f}^{(1)})^{2},\ldots,(\sigma_{f}^{(J)})^{2} and τf2\tau^{2}_{f}; see Section 4. The correlation parameters, {ρf(j,j):j,j=1,,J}\{\rho_{f}^{(j,j^{\prime})}:j,j^{\prime}=1,\ldots,J\}, could be known a priori or estimated using a likelihood-based methodology along with the variance parameters; see Appendix B. All together, these steps make up an inference framework for MIPs that we call SUPE-ANOVA.

We now give the optimal estimation and prediction formulas for the generic statistical ANOVA model under the assumption that the correlations between the models, {ρf(j,j)}\{\rho_{f}^{(j,j^{\prime})}\}, are zero; the general case of non-zero correlations involves matrix algebra and is given in Appendix B. From (2) and (3), it is straightforward to show that E(Yf,i(j))=μfE(Y_{f,i}^{(j)})=\mu_{f}, var(Yf,i(j))=τf2+(σf(j))2\mathrm{var}(Y_{f,i}^{(j)})=\tau^{2}_{f}+(\sigma_{f}^{(j)})^{2}, and cov(Yf,i(j),Yf,i(j))=τf2\mathrm{cov}(Y_{f,i}^{(j)},Y_{f,i}^{(j^{\prime})})=\tau^{2}_{f}. From a multivariate version of the Gauss-Markov theorem (e.g., Rencher and Christensen,, 2012), the BLUE of μf\mu_{f}, and its variance, are given by

μf\displaystyle\mu_{f}^{*} j=1Ji=1I(f)wf,i(j)Yf,i(j),\displaystyle\equiv\sum_{j=1}^{J}\sum_{i=1}^{I(f)}w_{f,i}^{(j)}Y_{f,i}^{(j)}, (13)
Sf2\displaystyle S^{2}_{f} var(μf)=1I(f)(1j=1J(σf(j))2+τf2),\displaystyle\equiv\mathrm{var}(\mu_{f}^{*})=\frac{1}{I(f)}\left(\frac{1}{\sum_{j=1}^{J}\left(\sigma_{f}^{(j)}\right)^{-2}}+\tau_{f}^{2}\right), (14)

for f=1,,Ff=1,\ldots,F, where the optimal weights are

wf,i(j)=(σf(j))2{I(f)j=1J(σf(j))2}1.w_{f,i}^{(j)}=\left(\sigma_{f}^{(j)}\right)^{-2}\left\{I(f)\sum_{j^{\prime}=1}^{J}\left(\sigma_{f}^{(j^{\prime})}\right)^{-2}\right\}^{-1}. (15)

One-sigma and two-sigma confidence intervals for the unknown consensus mean μf\mu_{f}, are given by μf±Sf\mu_{f}^{*}\pm S_{f} and μf±2Sf\mu_{f}^{*}\pm 2S_{f}, respectively.

From (3), we also want to make inference on Yf,iY_{f,i}. Recall that Yf,i=μf+αf,iY_{f,i}=\mu_{f}+\alpha_{f,i}, where {αf,i}\{\alpha_{f,i}\} are random quantities that capture the departures from the longer-term climatology given by μf\mu_{f}, and they are assumed to have distribution Dist(0,τf2)\mathrm{Dist}(0,\tau^{2}_{f}). Since the goal of the MIP is to infer a consensus for the underlying process {Yf,i}\{Y_{f,i}\}, BLUPs of these quantities are sought, along with their MSPEs. Straightforward use of the general Gauss–Markov Theorem (e.g., Rencher and Christensen,, 2012) results in the BLUP Yf,iY_{f,i}^{*} and its MSPE, as follows:

Yf,i\displaystyle Y_{f,i}^{*} =λf,i(0)μf+j=1Jλf,i(j)Yf,i(j)\displaystyle=\lambda_{f,i}^{(0)}\mu_{f}+\sum_{j=1}^{J}\lambda_{f,i}^{(j)}Y_{f,i}^{(j)} (16)
MSPE(Yf,i)\displaystyle\mathrm{MSPE}(Y_{f,i}^{*}) E(Yf,iYf,i)2=(τf2+j=1J(σf(j))2)1,\displaystyle\equiv E(Y_{f,i}^{*}-Y_{f,i})^{2}=\left(\tau_{f}^{-2}+\sum_{j^{\prime}=1}^{J}\left(\sigma_{f}^{(j^{\prime})}\right)^{-2}\right)^{-1}, (17)

where (for the moment) all parameters, including μf\mu_{f}, are assumed known, and

λf,i(j)={τf2{τf2+j=1J(σf(j))2}1;j=0,(σf(j))2{τf2+j=1J(σf(j))2}1;j=1,,J.\lambda_{f,i}^{(j)}=\begin{cases}\tau_{f}^{-2}\left\{\tau_{f}^{-2}+\sum_{j^{\prime}=1}^{J}\left(\sigma_{f}^{(j^{\prime})}\right)^{-2}\right\}^{-1};&j=0,\\ \left(\sigma_{f}^{(j)}\right)^{-2}\left\{\tau_{f}^{-2}+\sum_{j^{\prime}=1}^{J}\left(\sigma_{f}^{(j^{\prime})}\right)^{-2}\right\}^{-1};&j=1,\ldots,J.\\ \end{cases} (18)

One-sigma and two-sigma prediction intervals for Yf,iY_{f,i} are given by Yf,i±MSPE(Yf,i)Y_{f,i}^{*}\pm\sqrt{\mathrm{MSPE}(Y_{f,i}^{*})} and Yf,i±2MSPE(Yf,i)Y_{f,i}^{*}\pm 2\sqrt{\mathrm{MSPE}(Y_{f,i}^{*})}, respectively.

In the next section, we give likelihood-based estimates of the model parameters, which we notate as μ^f\hat{\mu}_{f}, {(σ^f(j))2:j=1,,J}\{(\hat{\sigma}_{f}^{(j)})^{2}:j=1,\ldots,J\}, and τ^f2\hat{\tau}_{f}^{2}. Then the empirical BLUE/BLUP (written as EBLUE/EBLUP) equations are just those in (13)–(18) with the model parameters, μf\mu_{f}, {(σf(j))2:j=1,,J}\{(\sigma_{f}^{(j)})^{2}:j=1,\ldots,J\}, and τf2\tau_{f}^{2}, replaced by their estimated quantities. This yields the EBLUE for μf\mu_{f}, which we denote by μ^f\hat{\mu}_{f}, and an estimated variance denoted by S^f2\hat{S}_{f}^{2}. Further, it yields an EBLUP for Yf,iY_{f,i}, which we denote by Y^f,i\hat{Y}_{f,i}, and an estimated mean-squared prediction error, denoted by MSPE^(Y^f,i)\widehat{\mathrm{MSPE}}(\hat{Y}_{f,i}). These formulas complete the specification of SUPE-ANOVA.

The likelihood-based estimates given in the next section require specification of statistical distributions and, to do this, we assume Dist(,)\mathrm{Dist}(\cdot,\cdot) is Gau(,)\mathrm{Gau}(\cdot,\cdot) for all distributions. When we assume Gaussianity, it should be noted that (16) is best among all possible predictors, not just linear ones (e.g., Cressie and Wikle,, 2011, Sec. 4.1.2).

4 The initialization step: Estimation of parameters

Estimation of the parameters of the ANOVA model given by (2) and (3) is performed under the assumption that all unknown distributions are Gaussian, which yields a likelihood function of the model’s parameters. The correlation parameters {ρf(j,j)}\{\rho_{f}^{(j,j^{\prime})}\} can be modeled and estimated, but they are assumed to be zero in the application in Section 5; here we omit them from the estimation equations, but see Appendix B where they are included. For the remaining parameters, it is known that the maximum likelihood estimates of variance parameters are biased and, to avoid this, we use restricted maximum likelihood (REML) estimation (Patterson and Thompson,, 1971; Harville,, 1977). REML estimation maximizes the restricted likelihood, denoted by L(r)(𝝉2,𝝈2𝐘)L^{(r)}(\bm{\tau}^{2},\bm{\sigma}^{2}\mid\mathbf{Y}), where 𝐘(Y1,1(1),Y1,1(2),,YF,I(F)(J))\mathbf{Y}\equiv(Y_{1,1}^{(1)},Y_{1,1}^{(2)},\ldots,Y_{F,I(F)}^{(J)})^{\prime} is the (Jf=1FI(f))(J\cdot\sum_{f=1}^{F}I(f))-dimensional vector of all outputs from the MIP, 𝝈2((σ1(1))2,(σ1(2))2,,(σF(J))2)\bm{\sigma}^{2}\equiv((\sigma_{1}^{(1)})^{2},(\sigma_{1}^{(2)})^{2},\ldots,(\sigma_{F}^{(J)})^{2})^{\prime} is an (FJ)(F\cdot J)-dimensional vector, and 𝝉2=(τ12,,τF2)\bm{\tau}^{2}=(\tau_{1}^{2},\ldots,\tau_{F}^{2})^{\prime} is an FF-dimensional vector. The definition of the restricted likelihood and a derivation of its analytical expression are given in Appendix B.

Small sample sizes can lead to variance estimates that need to be regularized in a manner similar to Twomey-Tikhonov regularization. This is achieved through the penalized restricted likelihood:

L(p)(𝝉2,𝝈2𝐘;𝐛)L(r)(𝝉2,𝝈2𝐘)p(𝝈2;𝐛),L^{(p)}(\bm{\tau}^{2},\bm{\sigma}^{2}\mid\mathbf{Y};\mathbf{b})\equiv L^{(r)}(\bm{\tau}^{2},\bm{\sigma}^{2}\mid\mathbf{Y})p(\bm{\sigma}^{2};\mathbf{b}), (19)

for

p(𝝈2;𝐛)f=1Fj=1Jga((σf(j))2;bf),p(\bm{\sigma}^{2};\mathbf{b})\equiv\prod_{f=1}^{F}\prod_{j=1}^{J}g_{a}\left(\left(\sigma_{f}^{(j)}\right)^{2};b_{f}\right),

where 𝐛(b1,,bF)\mathbf{b}\equiv(b_{1},\ldots,b_{F})^{\prime} and ga(x;b)=xa1exp(b/x)g_{a}(x;b)=x^{-a-1}\exp(-b/x) for fixed a>0a>0. The function ga(x;b)g_{a}(x;b) is proportional to the density of an inverse gamma distribution with shape aa and scale bb, which is a common choice when making Bayesian inference on variance parameters. The value of the scales 𝐛\mathbf{b} in the penalty p(𝝈2;𝐛)p(\bm{\sigma}^{2};\mathbf{b}) is considered unknown and is estimated, whereas the shape aa is fixed and pre-specified to guard against one team’s output being over-weighted in the BLUEs/BLUPs. That is, for a given factor f{1,,F}f\in\{1,\ldots,F\}, the penalty puts a soft limit on the ratio between the largest and the smallest team-specific variances, {(σf(1))2,,(σf(J))2}\{(\sigma_{f}^{(1)})^{2},\ldots,(\sigma_{f}^{(J)})^{2}\}, to a degree specified by aa, while the actual scale of the variances as a whole is left unknown and estimated through bfb_{f}.

Calibration of the shape aa can be achieved by interpreting the penalty as a Bayesian prior on the variances. In that case, we specify that the ratio of the prior 97.5th percentile of the variances to the prior 2.5th percentile of the variances is equal to four, which yields the fixed value of a=8.48a=8.48. This choice means that it is unlikely that the largest variance is more than four times larger than the smallest variance.

The expression L(p)(𝝉2,𝝈2𝐘;𝐛)L^{(p)}(\bm{\tau}^{2},\bm{\sigma}^{2}\mid\mathbf{Y};\mathbf{b}) in (19) is maximized with respect to the model parameters 𝝉2\bm{\tau}^{2} and 𝝈2\bm{\sigma}^{2} and with respect to the penalization scale 𝐛\mathbf{b}, to yield the penalized REML estimates {(σ^f(j))2:f=1,,F,j=1,,J}\{(\hat{\sigma}_{f}^{(j)})^{2}:f=1,\ldots,F,\enskip j=1,\ldots,J\} and {τ^f2:f=1,,F}\{\hat{\tau}_{f}^{2}:f=1,\ldots,F\}. Appendix B gives the formulas for the general case where correlations are present. For f=1,,Ff=1,\ldots,F, these estimates are substituted into (13), the BLUE for μf\mu_{f}, to give the EBLUE of μf\mu_{f}, denoted by μ^f\hat{\mu}_{f}, with empirical variance S^f2\hat{S}^{2}_{f}. For f=1,,Ff=1,\ldots,F, i=1,,I(f)i=1,\ldots,I(f), the EBLUP of Yf,iY_{f,i}, denoted by Y^f,i\hat{Y}_{f,i}, with empirical MSPE, denoted by MSPE^(Y^f,i)\widehat{\mathrm{MSPE}}(\hat{Y}_{f,i}), is obtained similarly.

The statistical methodology given in Sections 2, 3, and 4 is applied in the next section to a MIP comprising nine flux inversion–based estimates of global carbon dioxide fluxes.

5 The OCO-2 MIP: Consensus CO2 fluxes

The rise in the atmospheric concentration of the greenhouse gas carbon dioxide (CO2) is the main determinant of global warming (IPCC,, 2013), and estimation of the sources and sinks (called fluxes) of CO2 is a high priority for the global community (Friedlingstein et al.,, 2014). Fluxes of CO2 arise from both anthropogenic and natural processes, and the latter are hard to observe directly, as they occur over large spatial and temporal scales. One way to estimate the natural fluxes, called flux inversion, is through observation of their influence on atmospheric concentrations of CO2. As natural fluxes are so difficult to measure otherwise, flux inversions for CO2 play an important role in quantifying Earth’s carbon budget.

Flux inversion is an ill-posed inverse problem, in which an atmospheric transport model is used to simulate the transport of the trace gas from its sources to the times and places of its observation. The transport model is combined with a procedure, usually based on a Bayesian state-space model, to work backwards from observations to fluxes. The data used by inversions are typically a combination of surface, aircraft, and ship observations, and data collected by spaceborne instruments. In this section, we shall use the output from a MIP based on CO2 data from NASA’s Orbiting Carbon Observatory-2 (OCO-2) satellite (Crowell et al.,, 2019) to obtain consensus flux estimates and predictions.

Flux inversion is subject to many forms of model uncertainty, which complicate interpretation of the estimated fluxes. These uncertainties stem from questions about the fidelity of simulated atmospheric transport, about how best to parameterize the unknown fluxes, and about the choice of prior for the flux field, among others. To address these in a rigorous manner, the OCO-2 project established a series of MIPs where an ensemble of flux outputs was assembled from submissions by different teams, who performed global flux inversions under a common protocol. The MIPs involve the use of OCO-2 retrievals of the vertical column–average CO2 concentration, as well as flask and in situ observations of CO2 concentrations.

The first of the MIPs concluded in 2019, and its results were reported by Crowell et al., (2019). The satellite data in this MIP were based on version 7r of the OCO-2 data, so henceforth we refer to it as MIPv7. Nine teams participated in MIPv7. The results were released as a set of aggregated monthly non-fossil-fuel fluxes spanning 27 disjoint, large regions derived from TransCom3 regions, shown in Figure A1 (where the unlabeled areas are assumed to have zero flux) and listed in Table A1. The outputs from the MIP covered each month over the two years from January 2015 to December 2016, inclusive. Each team was tasked with producing estimated fluxes from three observation types: from an inversion based on flask and in situ measurements only (IS), from an inversion based on OCO-2 retrievals made in land glint (LG) mode, and from an inversion based on OCO-2 retrievals made in land nadir (LN) mode.

To apply the SUPE-ANOVA framework to MIPv7, the fluxes from each mode were considered separately. Then, the factor of interest is f=(s,r)f=(s,r), where s=1,,4s=1,\ldots,4 is a season (starting with December-January-February [DJF]), and r=1,,27r=1,\ldots,27 is a spatial region. The replicates i=1,,6i=1,\ldots,6 are the six months within a season for the two-year study period, so I(f)=6I(f)=6 for all factors. From (2) and (3), the flux output of the jjth team is modeled by,

Ys,r,i(j)=Ys,r,i+ηs,r,i(j),ηs,r,i(j)Gau(0,(σs,r(j))2),Y_{s,r,i}^{(j)}=Y_{s,r,i}+\eta_{s,r,i}^{(j)},\quad\eta_{s,r,i}^{(j)}\sim\mathrm{Gau}\left(0,\left(\sigma_{s,r}^{(j)}\right)^{2}\right), (20)

for j=1,,9j=1,\ldots,9, where {ηs,r,i(j):j=1,,9}\{\eta_{s,r,i}^{(j)}:j=1,\ldots,9\} are uncorrelated; we assume they are uncorrelated for simplicity, but, if desired, it is relatively easy to incorporate correlations using the equations in Appendix B. The consensus flux is modeled by

Ys,r,i=μs,r+αs,r,i,αs,r,iGau(0,τs,r2).Y_{s,r,i}=\mu_{s,r}+\alpha_{s,r,i},\quad\alpha_{s,r,i}\sim\mathrm{Gau}(0,\tau_{s,r}^{2}). (21)

The term Ys,r,iY_{s,r,i} represents the MIP consensus for the underlying geophysical flux for month ii in season ss and region rr; μs,r\mu_{s,r} is the more-slowly-varying consensus climatological flux for the season and region; τs,r2\tau_{s,r}^{2} is the variance of the deviations of the monthly fluxes from the climatological flux; and (σs,r(j))2(\sigma_{s,r}^{(j)})^{2} is the variance of the jjth team’s flux output within each region and season.

The six replicates within each factor (s,r)(s,r) are too few for reliable estimation of the variance parameters, but this can be addressed straightforwardly by grouping regions with similar variability within a season. Hence, for the four seasons s=1,,4s=1,\ldots,4, variance parameters τs,r2\tau_{s,r}^{2}, {(σs,r(j))2:j=1,,J}\{(\sigma_{s,r}^{(j)})^{2}:j=1,\ldots,J\}, and the penalization scales bs,rb_{s,r} were constrained to have identical values within grouped regions. We present a clustering approach to construct groups in Appendix C; the chosen groups are shown in Figure A2.

The parameters of the model in (20) and (21) were estimated according to the initialization step given in Section 4, and the SUPE-ANOVA framework was used to obtain the consensus flux predictions {Y^s,r,i:s=1,,4;r=1,,27;i=1,,6}\{\hat{Y}_{s,r,i}:s=1,\ldots,4;\enskip r=1,\ldots,27;\enskip i=1,\ldots,6\}. To check whether the model has good fit to the fluxes, we define the standardized errors, η^s,r,i(j)=(Ys,r,i(j)Y^s,r,i)/σ^s,r(j)\hat{\eta}_{s,r,i}^{(j)}=(Y_{s,r,i}^{(j)}-\hat{Y}_{s,r,i})/\hat{\sigma}_{s,r}^{(j)}, for s=1,,4s=1,\ldots,4, r=1,,27r=1,\ldots,27, i=1,,6i=1,\ldots,6, and j=1,,9j=1,\ldots,9. Under the Gaussian assumptions used to fit the model, the standardized errors {η^s,r,i(j)}\{\hat{\eta}_{s,r,i}^{(j)}\} should be approximately i.i.d.Gau(0,1)\text{i.i.d.}\,\mathrm{Gau}(0,1). We diagnosed this visually using Q-Q plots, by plotting {η^s,r,i(j)}\{\hat{\eta}_{s,r,i}^{(j)}\} for each season and group of regions against the theoretical quantiles of the Gau(0,1)\mathrm{Gau}(0,1) distribution; under the assumptions, the relationship should be approximately on a 45°\degree line. Figure A3 shows these Q-Q plots, which do not indicate any obvious violations of the model assumptions.

Refer to caption
Figure 2: SUPE-ANOVA consensus (black lines) and unweighted-average (black crosses) non-fossil-fuel fluxes for the region T05a (Temperate Northern Africa). (a) The nine individual OCO-2 MIPv7 fluxes are shown with colored lines. (b) The individual fluxes are omitted, and a shaded area is added to show the 95% prediction interval around the SUPE-ANOVA consensus fluxes.

As an example of predicting consensus fluxes at the regional level, Figure 2(a) shows the SUPE-ANOVA predictions as a time series for one region, T05a (Temperate Northern Africa), as well as the time series of fluxes from each of the nine MIP participants; the unweighted averages, Y¯s,r,i=19j=19Ys,r,i(j)\bar{Y}_{s,r,i}=\frac{1}{9}\sum_{j=1}^{9}Y_{s,r,i}^{(j)}, are also shown. Separate panels give the time series for each observation type (IS, LN, and LG). Figure 2(b) repeats Figure 2(a) but with the participating teams’ fluxes omitted to allow the vertical axis to be re-scaled. It also shows 95% prediction intervals for the predicted consensus fluxes, where the upper and lower values are Y^s,r,i±1.96×MSPE^(Y^s,r,i)\hat{Y}_{s,r,i}\pm 1.96\times\sqrt{\widehat{\mathrm{MSPE}}(\hat{Y}_{s,r,i})}. For the IS fluxes in the region T05a, the predicted consensus hardly differs from the unweighted average. By contrast, for the LN and LG fluxes, the predicted consensus sometimes differs significantly from the unweighted average.

The SUPE-ANOVA predictions of the consensus fluxes and the unweighted-average fluxes at the regional level were also aggregated spatially and temporally to yield the annual global land and annual global ocean fluxes. These are shown in Figure 3, which also shows the land–ocean aggregates for the nine teams in the MIP. Fluxes are given for all three observation types, IS, LN, and LG, and for both study years, 2015 and 2016. The SUPE-ANOVA consensus fluxes imply a larger land sink and a smaller ocean sink than do the unweighted averages. Also, for the IS observation type, there are more extreme values from the TM5-4DVAR team then the other teams, and these have been downweighted in the consensus estimate, relative to the weights they received in the unweighted average.

Refer to caption
Figure 3: SUPE-ANOVA consensus (\ast) and unweighted-average (+) non-fossil-fuel OCO-2 MIPv7 annual global ocean fluxes versus annual global land fluxes for each observation type (IS, LN, and LG) and year (2015 and 2016). Colored points show the underlying MIP outputs from the nine teams.
Refer to caption
Figure 4: Estimated weights, {w^s,r(j):s=1,,4;r=1,,27;j=1,,9}\{\hat{w}_{s,r}^{(j)}:s=1,\ldots,4;r=1,\ldots,27;j=1,\ldots,9\}, obtained from SUPE-ANOVA, for each observation type (IS, LN, and LG), season (s)s), region (rr, shown according to its group; see Figure A2), and team (jj) in the OCO-2 MIPv7.

The different teams’ weights used in the SUPE-ANOVA framework vary between seasons and regions, although they are identical within the cluster groups. For a given (s,r)(s,r), define the climatological weights, w^s,r(j)=(σ^s,r(j))2/[j=1J(σ^s,r(j))2]\hat{w}_{s,r}^{(j)}=(\hat{\sigma}_{s,r}^{(j)})^{-2}/\left[\sum_{j^{\prime}=1}^{J}(\hat{\sigma}_{s,r}^{(j^{\prime})})^{-2}\right], for j=1,,9j=1,\ldots,9. From (13), these weights correspond to the relative weighting of each participant in estimating the consensus mean climatological flux, μs,r\mu_{s,r}. Figure 4 shows the climatological weights for the three observation types (IS, LN, LG), for each season and grouping of regions within each season. From inspection of the weights in Figure 4, the UT team and the UoE team generally have higher weights due to their associated estimated variances being consistently smaller.

6 Discussion and conclusions

A reader of this article might ask whether in the North American temperature example given in Section 3, the simple unweighted average of the MIP outputs provides a reasonable estimate of μTmp\mu_{\mathrm{Tmp}}. Under the SUPE-ANOVA framework, that estimate μ¯Tmp\bar{\mu}_{\mathrm{Tmp}} given by Equation (5) is unbiased. However, it is not a statistically efficient estimator under our consensus model. This can have a profound effect on the design of the MIP. Using the optimal weights, namely, wTmp(j)w_{\mathrm{Tmp}}^{(j)} inversely proportional to (σTmp(j))2var(YTmp(j))(\sigma_{\mathrm{Tmp}}^{(j)})^{2}\equiv\mathrm{var}(Y_{\mathrm{Tmp}}^{(j)}) for j=1,,Jj=1,\ldots,J, means that the precision of μTmp=j=1JwTmp(j)YTmp(j)\mu_{\mathrm{Tmp}}^{*}=\sum_{j=1}^{J}w_{\mathrm{Tmp}}^{(j)}Y_{\mathrm{Tmp}}^{(j)} can be matched to the precision of μ¯Tmp\bar{\mu}_{\mathrm{Tmp}} but with JJ (much) less than 20. Since μTmp\mu_{\mathrm{Tmp}}^{*} is also unbiased, this optimally weighted mean has a distinct inferential advantage over the unweighted mean, μ¯Tmp\bar{\mu}_{\mathrm{Tmp}}, when both are based on the same J=20J=20 outputs {YTmp(1),,YTmp(20)}\{Y_{\mathrm{Tmp}}^{(1)},\ldots,Y_{\mathrm{Tmp}}^{(20)}\}. Furthermore, the simple variance, σTmp2/20\sigma^{2}_{\mathrm{Tmp}}/20, in Equation (5) is an incorrect expression for var(μ¯Tmp)\mathrm{var}(\bar{\mu}_{\mathrm{Tmp}}) when it is no longer true that (σTmp(1))2==(σTmp(20))2=σTmp2(\sigma_{\mathrm{Tmp}}^{(1)})^{2}=\ldots=(\sigma_{\mathrm{Tmp}}^{(20)})^{2}=\sigma_{\mathrm{Tmp}}^{2}. The correct expression is

var(μ¯Tmp)=j=120(wTmp(j))2(σTmp(j))2,\mathrm{var}(\bar{\mu}_{\mathrm{Tmp}})=\sum_{j=1}^{20}\left(w_{\mathrm{Tmp}}^{(j)}\right)^{2}\left(\sigma_{\mathrm{Tmp}}^{(j)}\right)^{2},

which will give valid consensus inferences and correct coverage for the one-sigma and two-sigma intervals presented in Section 3, albeit wider than the intervals based on var(μTmp)\mathrm{var}(\mu_{\mathrm{Tmp}}^{*}).

A number of articles reviewed in Section 1 have taken a Bayesian viewpoint (Tebaldi et al.,, 2005; Manning et al.,, 2009; Smith et al.,, 2009; Tebaldi and Sansó,, 2009; Chandler,, 2013). A Bayesian spatial ANOVA was used in Sain et al., (2011), Kang et al., (2012), and Kang and Cressie, (2013) to analyze projected temperature in a MIP called the North American Regional Climate Change Assessment Program (NARCCAP; see Mearns et al.,, 2012). This approach adds another layer of complexity to consensus inference in a MIP, where prior distributions on mean and variance parameters need to be specified, sensitivity to the specification assessed, and posterior-distribution sampling algorithms developed. In contrast, our SUPE-ANOVA framework does not depend on priors, is fast to compute, and results in easy-to-interpret weights on the JJ MIP outputs within the ANOVA factors (e.g., season, region).

We return to the general SUPE-ANOVA framework developed in Sections 2 and 3 to show how the optimal weights could be used in conjunction with verification data. Let {Zk:k=1,,K}\{Z_{k}:k=1,\ldots,K\} denote the verification data, with associated measurement-error variances {var(Zk):k=1,,K}\{\mathrm{var}(Z_{k}):k=1,\ldots,K\}. Each datum ZkZ_{k} can be indexed according to which factor ff and replicate i{1,,I(f)}i\in\{1,\ldots,I(f)\} it belongs. Hence, rewrite ZkZ_{k} as Zf(k),i(k)Z_{f(k),i(k)}, which will be compared to the JJ outputs, {Yf(k),i(k)(j):j=1,,J}\{Y_{f(k),i(k)}^{(j)}:j=1,\ldots,J\}. Then a simple and statistically justifiable way to assign weights to each team’s output is to define

ν(j)(f=1Fi=1I(f)wf,i(j)){k=1Kexp[(Zf(k),i(k)Yf(k),i(k)(j))22var(Zf(k),i(k))]},\nu^{(j)}\propto\left(\prod_{f=1}^{F}\prod_{i=1}^{I(f)}w_{f,i}^{(j)}\right)\left\{\prod_{k=1}^{K}\exp\left[-\frac{\left(Z_{f(k),i(k)}-Y_{f(k),i(k)}^{(j)}\right)^{2}}{2\,\mathrm{var}(Z_{f(k),i(k)})}\right]\right\}, (22)

where j=1Jν(j)=1\sum_{j=1}^{J}\nu^{(j)}=1. Equation (22) shows how the SUPE-ANOVA weights, {wf,i(j)}\{w_{f,i}^{(j)}\}, can be used to define a prior in a Bayesian Model Averaging (BMA) scheme, where {ν(j):j=1,,J}\{\nu^{(j)}:j=1,\ldots,J\} can be interpreted as the posterior weights assigned to the JJ models, given the verification data {Zk:k=1,K}\{Z_{k}:k=1,\ldots K\}. That is, the JJ models are weighted based on verification data without making the common ‘truth-plus-error’ assumption (Bishop and Abramowitz,, 2013), or even without making the more general assumption of co-exchangeability between the MIP outputs and the data (Rougier et al.,, 2013). Now, since the BMA weights have lost their dependence on the FF factors, they will not yield optimal estimates of μf\mu_{f}, or optimal predictions of {Yf,i}\{Y_{f,i}\}; instead, they can be used to suggest which model is the most reliable overall in the sense of REA.

In summary, basic statistical principles have been followed to make optimal consensus inference in a MIP. Once the variance/covariance parameters in the SUPE-ANOVA framework have been estimated in an initialization step, the weights are immediate; code in the R programming language (R Core Team,, 2020) is available at https://github.com/mbertolacci/supe-anova-paper for the OCO-2 MIP analyzed in Section 5.

Acknowledgements

This project was largely supported by the Australian Research Council (ARC) Discovery Project (DP) DP190100180 and NASA ROSES grant NNH20ZDA001N-OCOST. Noel Cressie’s research was also supported by ARC DP150104576, and Andrew Zammit-Mangion’s research was also supported by an ARC Discovery Early Career Research Award (DECRA) DE180100203. The authors are grateful to the OCO-2 Flux Group for their feedback and suggestions.

Data Availability Statement

The data for the OCO-2 MIP analyzed in Section 5 are described by Crowell et al., (2019) and may be found on the OCO-2 MIPv7 website hosted by the NOAA Global Monitoring Laboratory (https://gml.noaa.gov/ccgg/OCO2/).

References

  • Abramowitz et al., (2019) Abramowitz, G., et al. (2019). ESD Reviews: Model dependence in multi-model climate ensembles: Weighting, sub-selection and out-of-sample testing. Earth System Dynamics, 10(1), 91–105, https://doi.org/10.5194/esd-10-91-2019.
  • Bishop and Abramowitz, (2013) Bishop, C. H. & Abramowitz, G. (2013). Climate model dependence and the replicate Earth paradigm. Climate Dynamics, 41(3), 885–900, https://doi.org/10.1007/s00382-012-1610-y.
  • Boé, (2018) Boé, J. (2018). Interdependency in Multimodel Climate Projections: Component Replication and Result Similarity. Geophysical Research Letters, 45(6), 2771–2779, https://doi.org/10.1002/2017GL076829.
  • Casella and Berger, (2006) Casella, G. & Berger, R. L. (2006). Statistical Inference, second edition. Pacific Grove, CA: Duxbury Press.
  • Chandler, (2013) Chandler, R. E. (2013). Exploiting strength, discounting weakness: Combining information from multiple climate simulators. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1991), 20120388, https://doi.org/10.1098/rsta.2012.0388.
  • Ciais et al., (2010) Ciais, P., Rayner, P., Chevallier, F., Bousquet, P., Logan, M., Peylin, P., & Ramonet, M. (2010). Atmospheric inversions for estimating CO2 fluxes: Methods and perspectives. Climatic Change, 103(1), 69–92, https://doi.org/10.1007/s10584-010-9909-3.
  • Cressie and Wikle, (2011) Cressie, N. & Wikle, C. K. (2011). Statistics for Spatio-Temporal Data. Hoboken, NJ: Wiley.
  • Crowell et al., (2019) Crowell, S., et al. (2019). The 2015–2016 carbon cycle as seen from OCO-2 and the global in situ network. Atmospheric Chemistry and Physics, 19, 9797–9831, https://doi.org/10.5194/acp-19-9797-2019.
  • Fisher, (1935) Fisher, R. A. (1935). The Design of Experiments. Edinburgh, UK: Oliver and Boyd.
  • Flato et al., (2014) Flato, G., et al. (2014). Evaluation of climate models. In T. Stocker, D. Qin, G.-K. Plattner, M. Tignor, S. K. Allen, J. Boschung, A. Nauels, Y. Xia, V. Bex, & P. M. Midgley (Eds.), Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change (pp. 741–866). Cambridge University Press, Cambridge UK and New York, NY.
  • Friedlingstein et al., (2014) Friedlingstein, P., Meinshausen, M., Arora, V. K., Jones, C. D., Anav, A., Liddicoat, S. K., & Knutti, R. (2014). Uncertainties in CMIP5 climate projections due to carbon cycle feedbacks. Journal of Climate, 27, 511–526, https://doi.org/10.1175/JCLI-D-12-00579.1.
  • Giorgi and Mearns, (2002) Giorgi, F. & Mearns, L. O. (2002). Calculation of average, uncertainty range, and reliability of regional climate changes from AOGCM simulations via the “reliability ensemble averaging” (REA) method. Journal of Climate, 15(10), 1141–1158, https://doi.org/10.1175/1520-0442(2002)015<1141:COAURA>2.0.CO;2.
  • Harville, (1977) Harville, D. A. (1977). Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association, 72, 320–338, https://doi.org/10.1080/01621459.1977.10480998.
  • Haughton et al., (2015) Haughton, N., Abramowitz, G., Pitman, A., & Phipps, S. J. (2015). Weighting climate model ensembles for mean and variance estimates. Climate Dynamics, 45(11), 3169–3181, https://doi.org/10.1007/s00382-015-2531-3.
  • Houweling et al., (2010) Houweling, S., et al. (2010). The importance of transport model uncertainties for the estimation of CO2 sources and sinks using satellite measurements. Atmospheric Chemistry and Physics, 10(20), 9981–9992, https://doi.org/10.5194/acp-10-9981-2010.
  • IPCC, (2013) IPCC (2013). Climate Change 2013: The Physical Science Basis. Working Group I Contribution to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge, United Kingdom and New York, NY, USA: Cambridge University Press.
  • Kang and Cressie, (2013) Kang, E. L. & Cressie, N. (2013). Bayesian hierarchical ANOVA of regional climate-change projections from NARCCAP Phase II. International Journal of Applied Earth Observation and Geoinformation, 22, 3–15, https://doi.org/10.1016/j.jag.2011.12.007.
  • Kang et al., (2012) Kang, E. L., Cressie, N., & Sain, S. R. (2012). Combining outputs from the North American Regional Climate Change Assessment Program by using a Bayesian hierarchical model. Journal of the Royal Statistical Society: Series C (Applied Statistics), 61(2), 291–313, https://doi.org/10.1111/j.1467-9876.2011.01010.x.
  • Kirtman et al., (2013) Kirtman, B., et al. (2013). Near-term climate change: Projections and predictability. In T. Stocker, D. Qin, G.-K. Plattner, M. Tignor, S. Allen, J. Boschung, A. Nauels, Y. Xia, V. Bex, & P. Midgley (Eds.), Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change (pp. 953–1028). Cambridge, United Kingdom and New York, NY, USA: Cambridge University Press.
  • Knutti, (2010) Knutti, R. (2010). The end of model democracy? Climatic Change, 102, 395–404, https://doi.org/10.1007/s10584-010-9800-2.
  • Knutti et al., (2010) Knutti, R., Furrer, R., Tebaldi, C., Cermak, J., & Meehl, G. A. (2010). Challenges in combining projections from multiple climate models. Journal of Climate, 23(10), 2739–2758, https://doi.org/10.1175/2009JCLI3361.1.
  • Knutti et al., (2013) Knutti, R., Masson, D., & Gettelman, A. (2013). Climate model genealogy: Generation CMIP5 and how we got there. Geophysical Research Letters, 40(6), 1194–1199, https://doi.org/10.1002/grl.50256.
  • Knutti et al., (2017) Knutti, R., Sedláček, J., Sanderson, B. M., Lorenz, R., Fischer, E. M., & Eyring, V. (2017). A climate model projection weighting scheme accounting for performance and interdependence. Geophysical Research Letters, 44(4), 1909–1918, https://doi.org/10.1002/2016GL072012.
  • Krishnamurti et al., (2000) Krishnamurti, T. N., et al. (2000). Multimodel ensemble forecasts for weather and seasonal climate. Journal of Climate, 13(23), 4196–4216, https://doi.org/10.1175/1520-0442(2000)013<4196:MEFFWA>2.0.CO;2.
  • Manning et al., (2009) Manning, L. J., Hall, J. W., Fowler, H. J., Kilsby, C. G., & Tebaldi, C. (2009). Using probabilistic climate change information from a multimodel ensemble for water resources assessment. Water Resources Research, 45(11), https://doi.org/10.1029/2007WR006674.
  • Mearns et al., (2012) Mearns, L. O., et al. (2012). The North American Regional Climate Change Assessment Program: Overview of phase I results. Bulletin of the American Meteorological Society, 93(9), 1337–1362, https://doi.org/10.1175/BAMS-D-11-00223.1.
  • Meehl et al., (2000) Meehl, G. A., Boer, G. J., Covey, C., Latif, M., & Stouffer, R. J. (2000). The Coupled Model Intercomparison Project (CMIP). Bulletin of the American Meteorological Society, 81(2), 313–318, https://doi.org/10.1175/1520-0477(2000)080<0305:MROTEA>2.3.CO;2.
  • Patterson and Thompson, (1971) Patterson, H. D. & Thompson, R. (1971). Recovery of inter-block information when block sizes are unequal. Biometrika, 58, 545–554, https://doi.org/10.1093/biomet/58.3.545.
  • R Core Team, (2020) R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
  • Reifen and Toumi, (2009) Reifen, C. & Toumi, R. (2009). Climate projections: Past performance no guarantee of future skill? Geophysical Research Letters, 36(13), https://doi.org/10.1029/2009GL038082.
  • Rencher and Christensen, (2012) Rencher, A. C. & Christensen, W. F. (2012). Methods of Multivariate Analysis, third edition. Hoboken, NJ: Wiley.
  • Rougier et al., (2013) Rougier, J., Goldstein, M., & House, L. (2013). Second-order exchangeability analysis for multimodel ensembles. Journal of the American Statistical Association, 108(503), 852–863, https://doi.org/10.1080/01621459.2013.802963.
  • Sain et al., (2011) Sain, S. R., Nychka, D., & Mearns, L. (2011). Functional ANOVA and regional climate experiments: A statistical analysis of dynamic downscaling. Environmetrics, 22(6), 700–711, https://doi.org/10.1002/env.1068.
  • Sanderson et al., (2017) Sanderson, B. M., Wehner, M., & Knutti, R. (2017). Skill and independence weighting for multi-model assessments. Geoscientific Model Development, 10(6), 2379–2395, https://doi.org/10.5194/gmd-10-2379-2017.
  • Santner et al., (2010) Santner, T. J., Williams, B. J., & Notz, W. I. (2010). The Design and Analysis of Computer Experiments. New York, NY: Springer.
  • Shepherd et al., (2018) Shepherd, A., et al. (2018). Mass balance of the Antarctic Ice Sheet from 1992 to 2017. Nature, 558(7709), 219–222, https://doi.org/10.1038/s41586-018-0179-y.
  • Smith et al., (2009) Smith, R. L., Tebaldi, C., Nychka, D., & Mearns, L. O. (2009). Bayesian modeling of uncertainty in ensembles of climate models. Journal of the American Statistical Association, 104(485), 97–116, https://doi.org/10.1198/jasa.2009.0007.
  • Snedecor, (1937) Snedecor, G. W. (1937). Statistical Methods Applied to Experiments in Agriculture and Biology, first edition. Ames, IA: Collegiate Press.
  • Snedecor and Cochran, (2014) Snedecor, G. W. & Cochran, W. G. (2014). Statistical Methods, eighth edition. Chichester, UK: Wiley.
  • Sunyer et al., (2014) Sunyer, M. A., Madsen, H., Rosbjerg, D., & Arnbjerg-Nielsen, K. (2014). A Bayesian approach for uncertainty quantification of extreme precipitation projections including climate model interdependency and nonstationary bias. Journal of Climate, 27(18), 7113–7132, https://doi.org/10.1175/JCLI-D-13-00589.1.
  • Tebaldi and Knutti, (2007) Tebaldi, C. & Knutti, R. (2007). The use of the multi-model ensemble in probabilistic climate projections. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 365(1857), 2053–2075, https://doi.org/10.1098/rsta.2007.2076.
  • Tebaldi and Sansó, (2009) Tebaldi, C. & Sansó, B. (2009). Joint projections of temperature and precipitation change from multiple climate models: A hierarchical Bayesian approach. Journal of the Royal Statistical Society: Series A (Statistics in Society), 172(1), 83–106, https://doi.org/10.1111/j.1467-985X.2008.00545.x.
  • Tebaldi et al., (2005) Tebaldi, C., Smith, R. L., Nychka, D., & Mearns, L. O. (2005). Quantifying uncertainty in projections of regional climate change: A Bayesian approach to the analysis of multimodel ensembles. Journal of Climate, 18(10), 1524–1540, https://doi.org/10.1175/JCLI3363.1.
  • Wu and Hamada, (2000) Wu, C. F. J. & Hamada, M. (2000). Experiments: Planning, Analysis, and Parameter Design Optimization. New York, NY: Wiley.

Appendices

Appendix A Additional Figures and Tables

Refer to caption
Figure A1: The spatial extents of the 27 regions (derived from the TransCom3 regions) over which the OCO-2 MIPv7 fluxes are estimated. The names of the regions are given in Table A1.
Code Name Code Name
T01 North American Boreal T10b Temperate Australia
T02 North American Temperate T11 Europe
T03a Northern Tropical South America T12 North Pacific Temperate
T03b Southern Tropical South America T13 West Pacific Tropical
T04 South American Temperate T14 East Pacific Tropical
T05a Temperate Northern Africa T15 South Pacific Temperate
T05b Northern Tropical Africa T16 Northern Ocean
T06a Southern Tropical Africa T17 North Atlantic Temperate
T06b Temperate Southern Africa T18 Atlantic Tropical
T07 Eurasia Boreal T19 South Atlantic Temperate
T08 Eurasia Temperate T20 Southern Ocean
T09a Northern Tropical Asia T21 Indian Tropical
T09b Southern Tropical Asia T22 South Indian Temperate
T10a Tropical Australia
Table A1: The 27 regions over which the OCO-2 MIPv7 fluxes are estimated (Crowell et al.,, 2019). Each region has a code and a name. The locations of the regions are shown in Figure A1.
Refer to caption
Figure A2: The groups chosen using the clustering procedure for the OCO-2 MIPv7 for each of the four seasons, DJF, MAM, JJA, and SON. The group numbers are ordered within each season, from the clustered regions exhibiting the most variability, group one, to those exhibiting the least variability, group four (or group three for MAM, since it contains only three clusters).
Refer to caption
Figure A3: Q-Q plots showing the standardized residuals, {η^s,r,i(j):s=1,,4,r=1,,27,i=1,,6,j=1,,9}\{\hat{\eta}_{s,r,i}^{(j)}:s=1,\ldots,4,r=1,\ldots,27,i=1,\ldots,6,j=1,\ldots,9\}, versus the theoretical quantiles of the Gau(0,1)\mathrm{Gau}(0,1) distribution for the SUPE-ANOVA model applied to the OCO-2 MIPv7.

Appendix B The general SUPE-ANOVA equations and the restricted likelihood

To derive the general estimation/prediction equations and the restricted likelihood of the ANOVA model given by (2) and (3), we first rewrite the model in matrix-vector form as

𝐘=𝐗𝝁+𝐙𝜶+𝜼,\mathbf{Y}=\bm{\mathrm{X}}\bm{\mu}+\bm{\mathrm{Z}}\bm{\alpha}+\bm{\eta}, (23)

where 𝜼(η1,1(1),η1,1(2),,ηF,I(F)(J))\bm{\eta}\equiv(\eta_{1,1}^{(1)},\eta_{1,1}^{(2)},\ldots,\eta_{F,I(F)}^{(J)})^{\prime} is a (Jf=1FI(f))(J\cdot\sum_{f=1}^{F}I(f))-dimensional vector; 𝝁(μ1,,μF)\bm{\mu}\equiv(\mu_{1},\ldots,\mu_{F})^{\prime} is an FF-dimensional vector; 𝜶(α1,1,α1,2,,αF,I(F))\bm{\alpha}\equiv(\alpha_{1,1},\alpha_{1,2},\ldots,\alpha_{F,I(F)})^{\prime} is a (f=1FI(f))(\sum_{f=1}^{F}I(f))-dimensional vector; and 𝐗\bm{\mathrm{X}} and 𝐙\bm{\mathrm{Z}} are matrices of ones and zeros that map the entries of 𝝁\bm{\mu} and 𝜶\bm{\alpha}, respectively, to the appropriate entries of 𝐘\mathbf{Y}. Let 𝚺ηvar(𝜼)\bm{\mathrm{\Sigma}}_{\eta}\equiv\mathrm{var}(\bm{\eta}) and 𝚺αvar(𝜶)\bm{\mathrm{\Sigma}}_{\alpha}\equiv\mathrm{var}(\bm{\alpha}). The matrix 𝚺α\bm{\mathrm{\Sigma}}_{\alpha} is diagonal with the entries of 𝝉2\bm{\tau}^{2} on its diagonal, while 𝚺η\bm{\mathrm{\Sigma}}_{\eta} has the entries of 𝝈2\bm{\sigma}^{2} on its diagonal and {ρf(j,j)σf(j)σf(j):jj}\{\rho_{f}^{(j,j^{\prime})}\sigma_{f}^{(j)}\sigma_{f}^{(j^{\prime})}:j\neq j^{\prime}\} in its off-diagonal entries. For both matrices, the entries are repeated as needed for each replicate i=1,,I(f)i=1,\ldots,I(f). Under (23), the marginal mean and variance of 𝐘\mathbf{Y} are, therefore, E(𝐘)=𝐗𝝁E(\mathbf{Y})=\bm{\mathrm{X}}\bm{\mu} and 𝚺Yvar(𝐘)=𝐙𝚺α𝐙+𝚺η\bm{\mathrm{\Sigma}}_{Y}\equiv\mathrm{var}(\mathbf{Y})=\bm{\mathrm{Z}}\bm{\mathrm{\Sigma}}_{\alpha}\bm{\mathrm{Z}}^{\prime}+\bm{\mathrm{\Sigma}}_{\eta}, respectively.

The multivariate Gauss–Markov theorem (e.g., Rencher and Christensen,, 2012) yields the BLUE under (23) as

𝝁=(𝐗𝚺Y1𝐗)1𝐗𝚺Y1𝐘.\bm{\mu}^{*}=(\bm{\mathrm{X}}^{\prime}\bm{\mathrm{\Sigma}}_{Y}^{-1}\bm{\mathrm{X}})^{-1}\bm{\mathrm{X}}^{\prime}\bm{\mathrm{\Sigma}}_{Y}^{-1}\mathbf{Y}. (24)

The variance of 𝝁\bm{\mu}^{*} is given by 𝚺μ(𝐗𝚺Y1𝐗)1\bm{\mathrm{\Sigma}}_{\mu}\equiv(\bm{\mathrm{X}}^{\prime}\bm{\mathrm{\Sigma}}_{Y}^{-1}\bm{\mathrm{X}})^{-1}. The BLUP for 𝜶\bm{\alpha} is then

𝜶=𝚺α𝐙𝚺Y1(𝐘𝐗𝝁),\bm{\alpha}^{*}=\bm{\mathrm{\Sigma}}_{\alpha}\bm{\mathrm{Z}}^{\prime}\bm{\mathrm{\Sigma}}_{Y}^{-1}(\mathbf{Y}-\bm{\mathrm{X}}\bm{\mu}^{*}), (25)

with prediction variance 𝚺α𝚺α𝚺α𝐙𝚺Y1𝐙𝚺α\bm{\mathrm{\Sigma}}_{\alpha}^{*}\equiv\bm{\mathrm{\Sigma}}_{\alpha}-\bm{\mathrm{\Sigma}}_{\alpha}\bm{\mathrm{Z}}^{\prime}\bm{\mathrm{\Sigma}}_{Y}^{-1}\bm{\mathrm{Z}}\bm{\mathrm{\Sigma}}_{\alpha}. The BLUP and prediction variance for 𝐘\mathbf{Y} follow as 𝐗𝝁+𝐙𝜶\bm{\mathrm{X}}\bm{\mu}^{*}+\bm{\mathrm{Z}}\bm{\alpha}^{*} and 𝐙𝚺α𝐙\bm{\mathrm{Z}}\bm{\mathrm{\Sigma}}_{\alpha}^{*}\bm{\mathrm{Z}}^{\prime}, respectively. When ρf(j,j)=0\rho_{f}^{(j,j^{\prime})}=0 for f=1,,Ff=1,\ldots,F and j,j=1,,Jj,j^{\prime}=1,\ldots,J, the vector expressions for the BLUE and the BLUP given here can be shown to be equivalent to the element-wise expressions (13) and (16), respectively.

To derive the restricted likelihood, we first assume that all unknown distributions are Gaussian. Under this assumption, the distribution of 𝐘\mathbf{Y} is multivariate normal with 𝐘Gau(𝐗𝝁,𝚺Y)\mathbf{Y}\sim\mathrm{Gau}(\bm{\mathrm{X}}\bm{\mu},\bm{\mathrm{\Sigma}}_{Y}). Viewed as a function of its parameters, 𝝁,𝝉2\bm{\mu},\bm{\tau}^{2}, 𝝈2\bm{\sigma}^{2}, and 𝝆(ρ1(1,2),ρ1(1,3),,ρF(J1,J))\bm{\rho}\equiv(\rho_{1}^{(1,2)},\rho_{1}^{(1,3)},\ldots,\rho_{F}^{(J-1,J)})^{\prime}, where 𝝉2\bm{\tau}^{2} and 𝝈2\bm{\sigma}^{2} are defined in Section 4, the density of this distribution is the likelihood L(𝝁,𝝉2,𝝈2,𝝆𝐘)L(\bm{\mu},\bm{\tau}^{2},\bm{\sigma}^{2},\bm{\rho}\mid\mathbf{Y}). The restricted likelihood is equal to the integral of the likelihood with respect to 𝝁\bm{\mu} (Harville,, 1977); that is,

L(r)(𝝉2,𝝈2,𝝆𝐘)\displaystyle L^{(r)}(\bm{\tau}^{2},\bm{\sigma}^{2},\bm{\rho}\mid\mathbf{Y}) L(𝝁,𝝉2,𝝈2,𝝆𝐘)d𝝁\displaystyle\equiv\int L(\bm{\mu},\bm{\tau}^{2},\bm{\sigma}^{2},\bm{\rho}\mid\mathbf{Y})\ \mathrm{d}\bm{\mu}
|𝚺μ|1/2|𝚺Y|1/2exp{12(𝐘𝝁)𝚺Y1(𝐘𝝁)},\displaystyle\propto|\bm{\mathrm{\Sigma}}_{\mu}|^{1/2}|\bm{\mathrm{\Sigma}}_{Y}|^{-1/2}\exp\left\{-\frac{1}{2}(\mathbf{Y}-\bm{\mu}^{*})^{\prime}\bm{\mathrm{\Sigma}}_{Y}^{-1}(\mathbf{Y}-\bm{\mu}^{*})\right\},

where the constant of proportionality does not depend on the unknown parameters. Where 𝝆\bm{\rho} is assumed to be known, as in Sections 4 and 5, we write the restricted likelihood as L(r)(𝝉2,𝝈2𝐘)L^{(r)}(\bm{\tau}^{2},\bm{\sigma}^{2}\mid\mathbf{Y}).

Appendix C Clustering procedure to group regions of similar variability

For the OCO-2 MIPv7 example in Section 5, we established the groupings of regions with similar variability as follows: Let S~s,r\tilde{S}_{s,r} be the empirical median of the set {|Ys,r,i(j)Y~s,r,i|:i=1,,6;j=1,,9}\{|Y_{s,r,i}^{(j)}-\tilde{Y}_{s,r,i}|:i=1,\ldots,6;\enskip j=1,\ldots,9\}, where Y~s,r,i\tilde{Y}_{s,r,i} is the empirical median of the set {Ys,r,i(j):j=1,,9}\{Y_{s,r,i}^{(j)}:j=1,\ldots,9\}. The value S~s,r\tilde{S}_{s,r} is a robust empirical estimate of the ensemble variability in the season and region. We then divided the 27 regions into groups that are assumed to have identical variances by applying kk-means clustering to the set {log10(S~s,r):r=1,,27}\{\log_{10}(\tilde{S}_{s,r}):r=1,\ldots,27\}. The clustering is therefore based on the order of magnitude of the ensemble variability. The number of clusters, kk, was chosen as the smallest kk such that the sum of the within-cluster sums-of-squares is at least 90% of the total sums-of-squares. The fluxes derived from the LN observation type were used to do the clustering, and the same clusters were then used on the fluxes derived from the LG and the IS observation types. The resulting groupings of regions are shown in Figure A2; the group numbers are ordered from the clustered regions exhibiting the most variability (Group 1 has the highest average log10(S~s,r)\log_{10}(\tilde{S}_{s,r})) to those exhibiting the least variability. Four clusters were chosen for DJF, JJA, and SON, and three clusters for MAM.