This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Decisions, decisions, decisions in an uncertain environment

Noel Cressie (ncressie@uow.edu.au) NIASRA, School of Mathematics and Applied Statistics, University of Wollongong, Australia

Abstract

Decision-makers abhor uncertainty, and it is certainly true that the less there is of it the better. However, recognizing that uncertainty is part of the equation, particularly for deciding on environmental policy, is a prerequisite for making wise decisions. Even making no decision is a decision that has consequences, and using the presence of uncertainty as the reason for failing to act is a poor excuse. Statistical science is the science of uncertainty, and it should play a critical role in the decision-making process. This opinion piece focuses on the summit of the knowledge pyramid that starts from data and rises in steps from data to information, from information to knowledge, and finally from knowledge to decisions. Enormous advances have been made in the last 100 years ascending the pyramid, with deviations that have followed different routes. There has generally been a healthy supply of uncertainty quantification along the way but, in a rush to the top, where the decisions are made, uncertainty is often left behind. In my opinion, statistical science needs to be much more pro-active in evolving classical decision theory into a relevant and practical area of decision applications. This article follows several threads, building on the decision-theoretic foundations of loss functions and Bayesian uncertainty.

1 Introduction

Making a decision under almost any circumstance comes with uncertainty. For example, a country’s investment in carbon capture and storage (CCS) comes with only a degree of certainty that the technology can be applied at scale. Australia’s current commitment to “net zero emissions by 2050” assumes that CCS, along with other technologies not even conceived of yet, will contribute substantially to reducing Australia’s greenhouse gas emissions. In the Australian Government’s 2021 plan, there is a prediction that “… future technology breakthroughs will reduce emissions by … 15 per cent by 2050” (Commonwealth of Australia, 2021). The remaining 85% appears to be less uncertain in the Australian Government’s technology-driven plan: 40% comes from driving down technology costs and accelerating their deployment at scale (Technology Investment Roadmap, 2021); 20% comes from emissions reductions already made since 2005; 15% comes from global technology trends; and 10% comes from high-integrity offsets. A decision was made by the Australian Government in 2021 to rule out taxes and legislative mechanisms as part of achieving net zero carbon emissions by 2050, in line with the ruling Liberal-National Party (LNP) coalition’s conservative agenda. (In the federal election of May 2022, the LNP were defeated by the Australian Labor Party, who promised to accelerate the passage to net zero carbon emissions – new decisions by the new Australian Government are forthcoming.)

Policy-makers make decisions based on data, modeling (depending on assumptions), predictions (from the modeling), and expert opinions (sometimes colored by the politics of the day). Where is there not uncertainty in this process, yet how often is it made transparent? Even the emissions reductions of 20% claimed between 2005 and 2021 in the Australian Government’s plan is based on industry and state-government self-reporting; deeper in the plan, a figure to one decimal place of 20.8% is given, indicating high certainty but based on figures whose uncertainty has almost certainly (!) not been quantified.

Statistical science is the science of uncertainty, which includes its quantification, its estimation, and its consequences for inference in scientific models and the decisions that follow. Statistical scientists have almost always focused on how uncertainty and variability affect inference on key parameters and (summaries of) scientific processes; however, much less of their literature is devoted to the effect of uncertainty on decisions. Sometimes decision-makers see uncertainty as a reason for lack of action, but that too is a decision, and it has consequences. Sometimes the uncertainty of inference on an important environmental parameter is provided but ignored by the decision-maker, and only the estimate remains, to be seen as precise as the number of digits quoted.

Science (environmental, medical, social, statistical, etc.) in its service to society should be better integrated into the peak of the knowledge pyramid, where decisions are made: At the base of the pyramid are data; at the next level up is information obtained by exploring the data for structure; the information is then converted into knowledge at the next level by modeling the uncertainty/variability and inferring etiology; and finally, at the peak, decisions are made. Uncertainty quantification should be involved at all levels, and the consequences of ignoring it in environmental problems can be serious. For example, deciding to evacuate neighbourhoods and townships faced with possible flooding is based on meteorological and hydrological forecasts. Uncertainty of both forecasts should jointly inform evacuation decisions by government agencies; communities cannot be kept safe “on average."

In the rest of this article, I shall discuss decision-making in the presence of uncertainty, opining that its future is in addressing the complex decisions that arise in uncertain environments. Section 2 reviews several approaches that have been taken in this area. Section 3 makes the case that decision theory and its applications have much to offer decision-makers, with its quantification of uncertainty through a posterior probability distribution and its quantification of the consequences of possible actions through a loss function. Several classes of loss functions are presented as alternatives to squared-error loss, including asymmetric loss functions that are key for making environmental decisions. Section 4 gives an example of how one such loss function can be calibrated. Finally, Section 5 gives concluding remarks and discusses new research directions.

2 A selection of criteria to help decision-makers

Before committing effort and resources to making a decision, ask first “What is the question that the decision will answer?” In the example given in Section 1, the question is, “How will Australia achieve net zero carbon emissions by 2050?” It took about five years from the time of the COP21 Paris Agreement for the Australian Government to seriously address that question. The best questions focus the decision-maker to provide the “best” answers/decisions. Providing several “good” but sub-optimal answers is not all that bad, but lack of transparency through a “good-better-best” criterion is.

In that spirit, I am arguing for clarity of assumptions, a statement of the criteria and a common numéraire against which all competing decisions are evaluated, an error budget that sets out sources of uncertainty, a statistical model that quantifies the uncertainties, and inferences that incorporate uncertainty quantification. Not just any model will do and, guided by the error budget, decision-making in the presence of uncertainty is in my opinion best served with a well fitted Bayesian hierarchical model (BHM). Complexity in underlying processes can often be captured by a computer model (e.g., Santner et al., 2010) or what has been referred to as a “digital twin.” When building a BHM, the digital twin or some approximation to it could be used as the first moment of the hidden process.

Complexity is often the case in the environmental sciences, where there are many questions asked by many stakeholders. Taking a BHM approach means that all the answers can be found somewhere in a (high-dimensional) posterior distribution. This article addresses, inter alia, how one might organize the knowledge contained in the posterior distribution in order to make an optimal decision.

Section 3 presents a decision-theoretic approach to inference, with an emphasis on prediction of unknown random quantities, rather than on estimation of unknown parameters. A relatively new approach to making the difficult decisions one sees in environmental applications is VOI (Value of Information), which aims to understand what is gained by taking additional types and amounts of data (e.g., Eidsvik et al., 2015). Suppose “value” VV is measured by a particular forecast’s accuracy (economic, meteorological, epidemiological, etc.), but there is a suggestion that an extra data set might be incorporated to improve the accuracy. Then,

VOIE(Vexisting and extra data)E(Vexisting data).VOI\equiv E\!\left(V\mid\text{existing and extra data}\right)-E(V\mid\text{existing data}).

Of course, a threshold will need to be specified to decide whether the additional data adds enough value. In some instances, the extra data set is costly and requires a cost-of-sampling function to offset the data’s added value. As discussed in Section 5, incorporating the cost of data is an important part of sampling design; not all sensors are cheap, and “found” data can be of variable quality and value.

When the measurement error of the data is assumed to be zero, the information is deemed “perfect” and VOPI (Value of Perfect Information) is an aspirational VOI value: Then a proposed data set’s VOI can be compared to the corresponding VOPI (e.g., Lark et al., 2022). Use of VOI to make decisions is a relatively new approach, and practical issues are still being investigated.

Consider the basic problem of deciding between competing models, M1,,MmM_{1},...,M_{m}. One classical approach is to use Bayes factors (e.g., Berger, 1985). Define πkPr(Mk)\pi_{k}\equiv\mathrm{Pr}(M_{k}), the prior probability that MkM_{k} is the correct model, and assume that one of the k=1,,mk=1,...,m models is correct. Informed by data zz, the posterior probability that MkM_{k} is the correct model is

p(kz)Pr(Mkz)Pr(zMk)πk;k=1,,m.p(k\mid z)\equiv\mathrm{Pr}(M_{k}\mid z)\propto\mathrm{Pr}(z\mid M_{k})\pi_{k};k=1,...,m.

Then MkM_{k} could be compared to any of the other models, here MjM_{j}, by considering the Bayes factor,

Pr(zMk)Pr(zMj)=p(kz)p(jz)×πjπk,\frac{\mathrm{Pr}(z\mid M_{k})}{\mathrm{Pr}(z\mid M_{j})}=\frac{p(k\mid z)}{p(j\mid z)}\times\frac{\pi_{j}}{\pi_{k}},

and one could choose the optimal model, δBAF(z)argmax{p(kz)/πk:k=1,,m}\delta_{BAF}^{*}(z)\equiv\mathrm{argmax}\left\{p(k\mid z)/\pi_{k}:k=1,...,m\right\}.

However, choosing the wrong model has differential consequences, depending on what is the correct model and which model was chosen. Another classical criterion is the expected posterior loss. Suppose those consequences are quantified by a decision table, which is an m×mm\times m matrix of losses (Ljk)(L_{jk}) where, for j,k=1,,mj,k=1,...,m, LjkL_{jk} is the loss incurred by choosing model MjM_{j} when MkM_{k} is the correct model, and Ljj=0L_{jj}=0. A decision to choose model MjM_{j} has expected posterior loss,

EPL(jz)k=1mLjkp(kz);j=1,,m.E_{PL}(j\mid z)\equiv\sum_{k=1}^{m}L_{jk}p(k\mid z);j=1,...,m. (1)

Then, minimizing the EPLE_{PL} criterion (1) over j{1,,m}j\in\{1,...,m\} results in an optimal decision, which is to choose the model, δEPL(z)argmin{EPL(jz):j=1,,m}\delta_{EPL}^{*}(z)\equiv\mathrm{argmin}\left\{E_{PL}(j\mid z):j=1,...,m\right\}.

These two decisions, δBAF\delta_{BAF}^{*} and δEPL\delta_{EPL}^{*}, are only indirectly related, as can be seen if it is assumed that LjkL_{jk} is the indicator function, I(jk)I(j\neq k), sometimes known as the 0-11 loss function. Then EPL(jz)=1p(jz);j=1,,mE_{PL}(j\mid z)=1-p(j\mid z);j=1,...,m, and hence δEPL(z)=argmax{p(jz):j=1,,m}\delta_{EPL}^{*}(z)=\mathrm{argmax}\left\{p(j\mid z):j=1,...,m\right\}. Therefore, under a uniform prior (i.e., πj=1/m\pi_{j}=1/m for j=1,,mj=1,...,m), δBAFδEPL\delta_{BAF}^{*}\equiv\delta_{EPL}^{*}; otherwise, the two optimal decisions are different. This most basic of decision problems, to decide between models M1,,MmM_{1},...,M_{m}, was chosen to illustrate that different criteria may lead to different optimal decisions. Since the method of Bayes Factors optimizes the (frequentist) likelihood, my preference is for a method directly based on the (Bayesian) posterior distribution, such as the expected posterior loss given by (1). This is reinforced by the conclusions of Kass and Raftery (1995), that Bayes factors should be avoided when the prior is not uniform.

Decisions come with consequences (losses), and it is important to capture loss along with the uncertainty. The essential property of a loss function is that it be bounded from below; hence, without loss of generality, it can be assumed to be non-negative. If one is to use expected posterior loss as a criterion, then it must also be assumed that its expectation with respect to the posterior probability measure exists (an assumption that involves the behavior of both loss and uncertainty). Uncertainties expressed through a BHM are traditionally made up of parametric conditional probabilities. A different emphasis is given in the book by Smith (2010), who features the Bayes networks that support a BHM but devotes less to the quantification of its conditional probabilities.

We shall see in the next section that the expected posterior loss is a way to account for the high-impact, low-probability scenarios seen in the presence of extreme environmental threats. In my opinion, while statistical science has devoted much of its effort to the construction and properties of distribution functions for inference, it has spent far less effort on developing loss functions for decisions. Both are needed!

3 From decision theory to decision applications

Uncertainty quantification based on conditional-probability models offers a framework for optimal decision-making. In particular, hierarchical statistical modeling effectively separates out the data from the latent process and the parameters. In this article, I focus principally on decisions about (functions of) random process(es). Decision theory provides a way to determine an optimal predictor δ(z)\delta^{*}(z) of the random quantity YY based on data zz in the presence of uncertainty in both YY and zz.

Very simplistically, consider a set of actions {a(1),a(2),}\{a^{(1)},a^{(2)},...\} and a loss function LL, where LL quantifies the “balance sheet” of costs and benefits that depend on both the true quantity YY and an action aa that could be taken by the decision-maker; hence, write it as L(a,Y)L(a,Y). This allows action a(1)a^{(1)} to be compared to action a(2)a^{(2)} for each possible realization yy of YY. The action with the lower loss is preferred, but a uniform ordering for all yy may not be achievable. In what follows, a solution based on decision theory is presented. For the case where a multivariable decision is needed, Section 3.2 proposes an eigenanalysis that reduces the problem to making independent decisions in orthogonal eigenspaces. Sections 3.3 and 3.4 develop new loss functions of action aa and random variable YY that give the decision-maker plenty of alternatives to the ubiquitous squared-error loss function, LSEL(a,Y)(aY)2L_{SEL}(a,Y)\equiv(a-Y)^{2}.

3.1 Prediction of a random quantity is a decision

Consider the BHM defined by a sequence of conditional-probability models, as follows. First, the data model is the conditional distribution, f(zy)f(z\mid y), of (potentially multivariate) data zz given the random process Y=yY=y. Second, the process model is the distribution π(y)\pi(y), which quantifies the uncertainty in the latent process YY. When YY is partially known through physical laws, the data model and the process model together define a physical-statistical model (e.g., Kuhnert, 2014). Now the distributions f(zy)f(z\mid y) and π(y)\pi(y) are implicitly conditional on parameters that are notated here as θ\theta. Then third, the parameter model is the prior probability distribution of the unknown parameters θ\theta.

Some realisations are probabilistically more likely than others given the data zz, and here the presence of the posterior distribution of YY is critical. (Note that the posterior p(yz)f12(y,z)p(y\mid z)\propto f_{12}(y,z), where f12f_{12} is the joint distribution, or uncertainty, of the unknowns YY and zz.) The classical approach taken in decision theory is to multiply the loss times the posterior density evaluated at a given yy and then to integrate over yy (e.g., Berger, 1985). That is, compute the expected posterior loss,

EPL(az)L(a,y)p(y|z)dy=E(L(a,Y)z),E_{PL}(a\mid z)\equiv\int L(a,y)p(y|z)\mathrm{d}y=E(L(a,Y)\mid z), (2)

which is a function of zz with no dependence on yy. Then (2) is a posterior weighted average of L(a,y)L(a,y) over yy, and the predictor a(1)a^{(1)} is preferred to the predictor a(2)a^{(2)} if EPL(a(1)z)EPL(a(2)z)E_{PL}(a^{(1)}\mid z)\leq E_{PL}(a^{(2)}\mid z). Hence, under criterion (2), a uniform ordering of L(a(1),y)L(a^{(1)},y) versus L(a(2),y)L(a^{(2)},y) for all yy, is not required.

Note that once LL is specified, the set of all possible losses, {L(a,y)}\{L(a,y)\} does not change. However, the posterior distribution p(yz)p(y\mid z) changes according to the data zz observed and the modeling assumptions made about the uncertainty of YY and zz. It is used in (2) to weight the possible realisations of L(a,y)L(a,y) for a fixed aa. Then aa is varied until a decision δ(z)\delta^{*}(z) is found that minimizes (2). That is, δ(z)\delta^{*}(z) is the best decision about YY based on the optimization criterion EPLE_{PL} given by (2). Notice that YY is quite general and only needs to be defined on a measurable space. For example, the case of multivariate spatio-temporal prediction is discussed in Section 3.2, and the case of prediction in regression is presented in Kowal (2022).

Now suppose a different question is asked that involves making inference not on YY but on a given functional g(Y)g(Y). A different loss function might be used but, for simplicity, suppose here that the same one is used. Then the integrand in (2) is replaced with

L(ag,g(y))p(yz)dy,\int{L(a_{g},g(y))p(y\mid z)}\mathrm{d}y, (3)

where aga_{g} is a generic action to take about g(Y)g(Y). Minimizing (3) with respect to aga_{g} yields an optimal decision, δg(z)\delta^{*}_{g}(z), about g(Y)g(Y). It is tempting but not optimal to predict g(Y)g(Y) with g(δ(z))g(\delta^{*}(z)), since that predictor does not minimize (3) unless g(y)g(y) is linear in yy. For example, if g(Y)=Y2g(Y)=Y^{2} or exp{Y}\exp\{Y\} or I(Y>0)I(Y>0), then (3) needs to be minimized separately for each individual g()g(\cdot).

An obvious special case is where the random quantity YY is a random variable; Sections 3.3, 3.4, and 4 consider this in detail. In Section 3.2 to follow, YY is a multivariate spatio-temporal process, a situation that often arises when making decisions about the environment.

3.2 Prediction of many variables

Applications of decision theory need to go beyond choosing between a finite number of models (Section 2) and prediction of a random variable (Sections 3.3, 3.4, and 4). All things are more-or-less related in some grand way, but science investigates which variables are most related and how. Questions in environmental science are often answered by first asking “Where?” and “When?”. Consequently, prediction of multivariate spatio-temporal processes in the presence of uncertainty is an important problem that requires not only careful statistical modeling and efficient computations, but it also requires loss functions that emphasize the subspaces where critical action is needed.

Let Y{𝐘(𝐬;t):𝐬Ds,tDt}Y\equiv\{\mathbf{Y}(\mathbf{s};t):\mathbf{s}\in D_{s},t\in D_{t}\} denote a vector-valued spatio-temporal process defined on the spatio-temporal domain Ds×DtD_{s}\times D_{t}, where 𝐘(𝐬;t)(Yi(𝐬;t):i=1,,N)\mathbf{Y}(\mathbf{s};t)\equiv(Y_{i}(\mathbf{s};t):i=1,...,N)^{\prime}, and NN is a positive integer. It is important to recognise that environmental processes may be multivariate and that the units of various components may be very different. Let 𝐑\mathbf{R} denote the N×NN\times N correlation matrix of 𝐘(𝐬;t)\mathbf{Y}(\mathbf{s};t), where it is assumed that this matrix is positive-definite and invariant over Ds×DtD_{s}\times D_{t} (i.e., it does not depend on (𝐬;t)(\mathbf{s};t)). However, non-stationary spatio-temporal dependence in YY is allowed, since the invariance assumption only applies to the correlations and not the means and variances. In practice, 𝐑\mathbf{R} could be estimated from noisy observations on YY.

Consider the spectral decomposition, 𝐑=𝐏𝚲𝐏=i=1Nλi𝐏i𝐏i\mathbf{R}=\mathbf{P}\bm{\Lambda}\mathbf{P}^{\prime}=\sum_{i=1}^{N}\lambda_{i}\mathbf{P}_{i}\mathbf{P}_{i}^{\prime}, where 𝚲diag(λ1,,λN)\bm{\Lambda}\equiv\text{diag}(\lambda_{1},...,\lambda_{N}) is a N×NN\times N diagonal matrix of decreasing eigenvalues, and the N×NN\times N matrix 𝐏(𝐏1,,𝐏N)\mathbf{P}\equiv(\mathbf{P}_{1},...,\mathbf{P}_{N}) is made up of the corresponding eigenvectors (satisfying 𝐏𝐏=𝐈)\mathbf{P}^{\prime}\mathbf{P}=\mathbf{I}). To build a loss function, one could project 𝐘(𝐬;t)\mathbf{Y}(\mathbf{s};t) and a generic predictor 𝐚(𝐬;t)\mathbf{a}(\mathbf{s};t) onto each of the eigenspaces. That is, for i=1,,Ni=1,...,N, define xi(𝐬;t)𝐏iY(𝐬;t)x_{i}(\mathbf{s};t)\equiv\mathbf{P}_{i}^{\prime}Y(\mathbf{s};t) and bi(𝐬;t)𝐏i𝐚(𝐬;t)b_{i}(\mathbf{s};t)\equiv\mathbf{P}_{i}^{\prime}\mathbf{a}(\mathbf{s};t). Then the loss function Li(bi(𝐬;t),xi(𝐬;t))L_{i}(b_{i}(\mathbf{s};t),x_{i}(\mathbf{s};t)), acts on the ii-th orthogonal eigenspace, and the expected posterior loss LiL_{i} can be minimized with respect to bi(𝐬;t)b_{i}(\mathbf{s};t) to yield γi(𝐬;t)\gamma_{i}^{*}(\mathbf{s};t).

It is easy to see that the predictor of 𝐘(𝐬;t)\mathbf{Y}(\mathbf{s};t),

𝜹(𝐬;t)𝐏(γ1(𝐬;t),,γN(𝐬;t)),\bm{\delta}^{*}(\mathbf{s};t)\equiv\mathbf{P}(\gamma_{1}^{*}(\mathbf{s};t),...,\gamma_{N}^{*}(\mathbf{s};t))^{\prime}, (4)

minimizes, with respect to 𝐚(𝐬;t)\mathbf{a}(\mathbf{s};t), the expected posterior loss given by

EPL(𝐚(𝐬;t)z)E(i=1NLi(𝐏i𝐚(𝐬;t),𝐏i𝐘(𝐬;t))z).E_{PL}(\mathbf{a}(\mathbf{s};t)\mid z)\equiv E\!\left(\sum_{i=1}^{N}L_{i}(\mathbf{P}_{i}^{\prime}\mathbf{a}(\mathbf{s};t),\mathbf{P}_{i}^{\prime}\mathbf{Y}(\mathbf{s};t))\mid z\right). (5)

Obvious examples of LiL_{i} are the squared-error loss function, Li(bi,xi)=wi(bixi)2L_{i}(b_{i},x_{i})=w_{i}(b_{i}-x_{i})^{2}, and the 0-11 loss function, Li(bi,xi)wiI(bixi0)L_{i}(b_{i},x_{i})\equiv w_{i}I(b_{i}-x_{i}\neq 0), where wi>0w_{i}>0. Since 𝐏1,,𝐏N\mathbf{P}_{1},...,\mathbf{P}_{N} are ordered according to the eigenvalues λ1λ2λN>0\lambda_{1}\geq\lambda_{2}\geq...\geq\lambda_{N}>0, the coefficients {wi}\{w_{i}\} in these examples could be given by a monotonic increasing function of {λi}\{\lambda_{i}\}. To my knowledge, the spatio-temporal predictor (4), obtained by minimizing (5), is new and has not been investigated.

3.3 Building loss functions

Classical presentations of decision theory (e.g., Berger, 1985) consider the loss function as given and then use the expected loss under the joint distribution of YY and zz (expected joint loss), as the optimization criterion. It is straightforward to show that the consequent expected posterior loss (i.e., conditional on zz) yields the same optimal predictor when minimized. However, the expected posterior loss as a function of action aa may order differently when z=z(1)z=z^{(1)} than when z=z(2)z=z^{(2)}. This is the nature of the uncertainty, and it can be accounted for coherently through the joint distribution of YY and zz, f1,2(y,z)=p(yz)f2(z)f_{1,2}(y,z)=p(y\mid z)f_{2}(z), where f2(z)f_{2}(z) is the marginal distribution of zz, and recall that p(yz)p(y\mid z) is the posterior distribution of YY.

The mathematical properties of a loss function are remarkably mild: For aa belonging to the space of actions 𝒜\mathcal{A} and yy belonging to the space of predictands, 𝒴\mathcal{Y}, the loss function L(a,y)L(a,y) has domain given by (a,y)𝒜×𝒴(a,y)\in\mathcal{A}\times\mathcal{Y} and range given by +{0}\mathbb{R}^{+}\cup\{0\}, the non-negative real line. Furthermore, 𝒴𝒜\mathcal{Y}\subset\mathcal{A} and when a=ya=y, L(y,y)=0L(y,y)=0. Finally, the expectation in EPLE_{PL} has to be well defined in order for it to be used as an optimization criterion.

Often, the space of possible actions 𝒜\mathcal{A} is a metric space with metric d(,)d(\cdot,\cdot), and LL is given by LMTC(a,y)=m(d(a,y))L_{MTC}(a,y)=m(d(a,y)), where m()m(\cdot) is a monotonic increasing function on the non-negative real line. For example, when YY is a random variable, squared-error loss is given by LMTC(2)(a,y)=(|ay|)2L_{MTC}^{(2)}(a,y)=(|a-y|)^{2}, where d(a,y)=|ay|d(a,y)=|a-y| and m(x)=x2m(x)=x^{2}. This generalizes to the loss function,

LMTC(ρ)(a,y)=(|ay|)ρ;ρ(0,).L_{MTC}^{(\rho)}(a,y)=(|a-y|)^{\rho};\rho\in(0,\infty). (6)

Then absolute-deviation loss is LMTC(1)L_{MTC}^{(1)}, and the 0-11 loss function is obtained in the limit as ρ0\rho\rightarrow 0 from above: LMTC(0)(a,y)I(ay0)L_{MTC}^{(0)}(a,y)\equiv I(a-y\neq 0).

It is straightforward to show that by substituting LMTC(2)L_{MTC}^{(2)}, LMTC(1)L_{MTC}^{(1)}, and LMTC(0)L_{MTC}^{(0)} into (2) and minimizing it with respect to aa\in\mathbb{R}, one obtains the optimal predictor of YY as, respectively, the posterior mean, the posterior median, and the posterior mode. These three predictors actually coincide for relatively small families of symmetric unimodal posterior densities (e.g., scale mixtures of Gaussian densities). For unimodal but skewed densities, there is a well known “mode, median, and mean” inequality (e.g., Groeneveld and Meeden, 1977), which illustrates that choice of loss function can move the optimal predictor of YY substantially in one direction or another.

In the rest of Section 3, YY is a random variable to be predicted, for which classical, less well known, and new loss functions are discussed and compared. Asymmetry in a loss function, where L(a,y)L(a,y) is different for a<ya<y than for a>ya>y, captures the reality that the loss of an underprediction can be greater than the loss of an overprediction (or vice versa), such as for prediction of a river’s crest-height at a time of heavy rainfall in its catchment. An example of an asymmetric loss function is:

LQTL(q)(a,Y)=(aY)(I(aY>0)q);q(0,1),L_{QTL}^{(q)}(a,Y)=(a-Y)(I(a-Y>0)-q);q\in(0,1), (7)

where I()I(\cdot) is the indicator function. Upon minimizing the EPLE_{PL} given by E(LQTL(q)(a,Y)z)E(L_{QTL}^{(q)}(a,Y)\mid z), with respect to aa, it is easily seen that the optimal predictor is the qq-th posterior quantile. Further, LQTL(0.5)(a,Y)=LMTC(1)(a,Y)L_{QTL}^{(0.5)}(a,Y)=L_{MTC}^{(1)}(a,Y), and q=0.5q=0.5 results in the only symmetric loss function in the class (7) generated by q(0,1)q\in(0,1). Choosing 0.5<q<10.5<q<1 yields larger loss when a<Ya<Y (i.e., underprediction) than when a>Ya>Y, which is appropriate when deciding how best to react to a flooding threat. For example, suppose a decision has to be made whether to evacuate low-lying neighbourhoods (in the short-term) or deciding how high to build a town’s levy (in the long-term). This amounts to predicting the indicator function, I(Y>κ)I(Y>\kappa), for different thresholds κ>0\kappa>0. In the spatial context, such indicator functions define exceedance sets (Cressie and Suesse, 2020).

Another asymmetric family is defined by the LINEX loss function (Varian, 1975; Zellner, 1986; Cressie, 1993, p. 108),

LLNX(ψ)(a,Y)exp{ψ(aY)}ψ(aY)1;ψ(,).L_{LNX}^{(\psi)}(a,Y)\equiv\exp\{\psi(a-Y)\}-\psi(a-Y)-1;\psi\in(-\infty,\infty). (8)

Choosing ψ<0\psi<0 gives larger loss when a<Ya<Y (i.e., underprediction). Upon substituting LLNX(ψ)(a,Y)L^{(\psi)}_{LNX}(a,Y) into (2) and minimizing it with respect to aa\in\mathbb{R}, one obtains the optimal predictor of YY:

δLNX(z)=(1/ψ)log(E(exp{ψY}z));ψ(,)\delta_{LNX}^{*}(z)=(-1/\psi)\log(E(\exp\{-\psi Y\}\mid z));\psi\in(-\infty,\infty)

which, by Jensen’s inequality, is larger than E(Yz) for ψ<0E(Y\mid z)\text{ for }\psi<0. That is, when ψ<0\psi<0 in the LINEX loss function (8), the optimal predictor is larger than the classical posterior-mean predictor. While the two asymmetric classes, {LQTL(q):q(0,1)}\{L_{QTL}^{(q)}:q\in(0,1)\} given by (7) and {LLNX(ψ):ψ(,)}\{L_{LNX}^{(\psi)}:\psi\in(-\infty,\infty)\} given by (8), are quite different in terms of their differentiability at a=Ya=Y and their shape as a function of (aY)(a-Y), qualitatively they each have the ability to assign asymmetric losses that yield intuitively reasonable predictors.

There are other quite straightforward ways to define loss functions. Let f(u;ω)f(u;\omega), for parameters ωΩ\omega\in\Omega, denote a parametric bounded density function of uu such that f(u;ω)f(0;ω)f(u;\omega)\leq f(0;\omega), for uu\in\mathbb{R}. Then the so-called potential function can also be used to define a loss function,

LPTL(ω)(a,Y)log(f(aY;ω))+log(f(0;ω));ωΩL_{PTL}^{(\omega)}(a,Y)\equiv-\log(f(a-Y;\omega))+\log(f(0;\omega));\omega\in\Omega (9)

which, depending on f(;ω)f(\cdot;\omega), may or may not be symmetric as a function of aa around YY. Note that LPTL(ω)0L_{PTL}^{(\omega)}\geq 0 and LPTL(ω)(Y,Y)=0L_{PTL}^{(\omega)}(Y,Y)=0. A symmetric class is defined by the generalized Gaussian density, f(u;ω)=(1/2)(ω/Γ(ω1))exp{|u|ω};ω(0,)f(u;\omega)=(1/2)(\omega/\Gamma(\omega^{-1}))\exp\{-|u|^{\omega}\};\omega\in(0,\infty), and so (9) becomes

LPTL(ω)(a,Y)|aY|ω;ω(0,).L_{PTL}^{(\omega)}(a,Y)\equiv|a-Y|^{\omega};\omega\in(0,\infty). (10)

Clearly, the same family of loss functions, {LMTC(ρ):ρ(0,)}\{L_{MTC}^{(\rho)}:\rho\in(0,\infty)\} given by (6) and {LPTL(ω):ω(0,)}\{L_{PTL}^{(\omega)}:\omega\in(0,\infty)\} given by (10), can be obtained by going down different paths of construction.

A straightforward way to generate an asymmetric loss function from a symmetric one, LL say, is to take positive powers, (L)p(L)^{p} for p(0,)p\in(0,\infty), or to exponentiate, exp{L}1\exp\{L\}-1, and there are many other ways given the mild requirements for a function of (a,Y)(a,Y) to be a loss function. For example, the (weighted) sum of two loss functions is a loss function (such as (5) in Section 3.2), as is the product of two loss functions. Thus, taking all weighted sums of products of some well known loss functions generates a very broad class.

While loss functions appear to be plentiful, once chosen, there may be a problem to minimize the criterion EPLE_{PL} given by (2), for what could be quite an unusual loss function. For example, the loss function, L(a,Y)LMTC(2)(a,Y)+LMTC(0)(a,Y)L(a,Y)\equiv L^{(2)}_{MTC}(a,Y)+L_{MTC}^{(0)}(a,Y), is zero (as it should be) but discontinuous at a=Ya=Y. It assigns substantial loss for aa infinitesimally close to YY and even more substantial additive squared-error loss for aa far away from YY. Clearly, loss functions can be pieced together to avoid discontinuities and even defined to be 0 in a measurable region of 𝒜\mathcal{A} that includes a=Ya=Y; again, minimizing EPLE_{PL} may prove problematic.

Having rich classes to draw from is essential, but calibrating loss functions to similar past environmental threats or to simulations of YY and zz based on a digital twin has, to my knowledge, not been formalized. Following historic rain events in February 2022 in central Eastern Australia, towns along major rivers were inundated. Many houses in neighbourhoods were flooded up to two storeys, and thousands of residents were displaced. On 2 March 2022, Andrew Hall, CEO of the Insurance Council of Australia (http://insurancecouncil.com.au), was interviewed on an Australian Broadcasting Corporation news station. He stated that of all the money spent on these types of disasters, only 3% is typically spent on prevention/planning. The remaining 97% is spent on clean-up. Calibrating this to the quantile loss functions, {LQTL(q):q(0,1)}\{L_{QTL}^{(q)}:q\in(0,1)\}, one might predict a river’s crest-height YY with the 0.970.97 posterior quantile of p(yz)p(y\mid z), to help decision-makers decide whether or not to evacuate low-lying neighborhoods. In Section 4, I show how this type of calibration can be used in the LINEX class of asymmetric loss functions given by (8).

3.4 Loss functions not based on displacement

A generic loss function L(a,Y)L(a,Y) does not have to be a function of the displacement, aYa-Y, but often it is the default choice. All the examples given in Section 3.3 (i.e., LMTCL_{MTC}, LQTLL_{QTL}, LLNXL_{LNX}, LPTLL_{PTL}) are functions of displacement. One easy way to generalise any loss function is to define a weighted version, where the weight is a function of YY. If L(a,Y)L(a,Y) is a function only of displacement, its weighted version, Lw(a,Y)L_{w}(a,Y), will lose that property:

Lw(a,Y)w(Y)L(a,Y),L_{w}(a,Y)\equiv w(Y)L(a,Y), (11)

where w()>0w(\cdot)>0. Then substituting LωL_{\omega} into (2) yields EPLE_{PL} given by w(y)L(a,y)f(yz)dyL(a,y)pw(yz)dy\int w(y)L(a,y)f(y\mid z)\mathrm{d}y\propto\int L(a,y)p_{w}(y\mid z)\mathrm{d}y, where pw(yz)p(yz)w(y)f(zy)π(y)w(y)p_{w}(y\mid z)\propto p(y\mid z)w(y)\propto f(z\mid y)\pi(y)w(y). Analytically, the effect of using the weighted version (11) is to modify the distribution π()\pi(\cdot) and replace it with πw()π()ω()\pi_{w}(\cdot)\equiv\pi(\cdot)\omega(\cdot). Then the weighted posterior is

pw(yz)=f(zy)πw(y)f(zx)πw(x)dx=w(y)p(yz)w(x)p(xz)dx.p_{w}(y\mid z)=\frac{f(z\mid y)\pi_{w}(y)}{\int f(z\mid x)\pi_{w}(x)\mathrm{d}x}=\frac{w(y)p(y\mid z)}{\int w(x)p(x\mid z)\mathrm{d}x}.

If (2) can be minimized based on p(yz)p(y\mid z), then an analogous result follows immediately based on pw(yz)p_{w}(y\mid z). For example, under squared-error loss, where LSEL(a,Y)=(aY)2L_{SEL}(a,Y)=(a-Y)^{2}, the optimal predictor of YY is δSEL(z)=E(Yz)\delta_{SEL}^{*}(z)=E(Y\mid z). Hence, if LSEL,w(a,Y)w(Y)(aY)2L_{SEL,w}(a,Y)\equiv w(Y)(a-Y)^{2}, then the optimal predictor of yy is δSEL,w(z)Ew(Yz)\delta^{*}_{SEL,w}(z)\equiv E_{w}(Y\mid z), where the expectation EwE_{w} is taken with respect to pw(z)p_{w}(\cdot\mid z). In a more complicated multivariate problem of finding hotspots among the latent spatial process, Y{Y(𝐬1),,Y(𝐬n)}Y\equiv\{Y(\mathbf{s}_{1}),...,Y(\mathbf{s}_{n})\} of nn small-area disease rates at spatial locations {𝐬1,,𝐬n}\{\mathbf{s}_{1},...,\mathbf{s}_{n}\}, Wright et al. (2003) used a weighted ranks squared error loss (WRSEL) function, where the weights were functions of the ranks of {Y(𝐬i)}\{Y(\mathbf{s}_{i})\}, and the data were the observed disease counts z{z(𝐬1),,z(𝐬n)}z\equiv\{z(\mathbf{s}_{1}),...,z(\mathbf{s}_{n})\}.

In some settings, YY is positive, such as when it represents a random rate. Then loss may be better represented as a function of the ratio, a/Ya/Y, rather than of the displacement, aYa-Y. An obvious example derives from the absolute percentage error, LAPE(a,Y)|1a/Y|L_{APE}(a,Y)\equiv|1-a/Y|. Goodness-of-fit tests are often expressed in terms of divergence measures that are functions of ratios. For example, the power-divergence between two non-negative sequences {bh:h=1,,H}\{b_{h}:h=1,...,H\} and {ch:h=1,,H}\{c_{h}:h=1,...,H\} is defined by Read and Cressie (1988) as, Ph=1Hchϕλ(bh/ch)P\equiv\sum_{h=1}^{H}c_{h}\phi_{\lambda}(b_{h}/c_{h}), where λ(,)\lambda\in(-\infty,\infty) and ϕλ(r)(λ(λ+1))1{(rλ+1r)+λ(1r)}\phi_{\lambda}(r)\equiv(\lambda(\lambda+1))^{-1}\{(r^{\lambda+1}-r)+\lambda(1-r)\}, for r(0,)r\in(0,\infty). Then a class of loss functions for Y>0Y>0 can be defined as:

LPWD(λ)(a,Y)Yϕλ(a/Y);<λ<,L_{PWD}^{(\lambda)}(a,Y)\equiv Y\phi_{\lambda}(a/Y);-\infty<\lambda<\infty, (12)

where the members, λ=0,1\lambda=0,-1, are given by their respective limits. A related family has been used by Cressie et al. (2022) to generalize kriging (Matheron, 1963) to spatial prediction of a non-negative spatial process. Note that the loss (12) as a function of aa around YY is asymmetric; for example, when λ<0\lambda<0, underprediction incurs more loss than overprediction.

Other classes that are loss functions of the ratio (a/Y)(a/Y) can be found by selecting appropriate probability density functions from the exponential family of distributions. Consider the probability density,

hα(x)exp{αx}l(x)/k(α);x0,h^{\alpha}(x)\equiv\exp\{-\alpha x\}l(x)/k(\alpha);x\geq 0, (13)

where α>0\alpha>0 is a parameter, the normalizing constant k(α)k(\alpha) is a function of α\alpha that guarantees the density integrates to 1, and l(x)l(x) can be thought of as a base measure that controls the shape of the density. For example, consider the gamma distribution, which is in the exponential family; for a given ν>1\nu>1, it has base measure given by l(ν)(x)=xν1l^{(\nu)}(x)=x^{\nu-1} and normalizing constant given by k(ν)(α)=Γ(ν)/ανk^{(\nu)}(\alpha)=\Gamma(\nu)/\alpha^{\nu}. That is, define the family of gamma probabilitiy densities, hGAM(α,ν)(x)exp{αx}l(ν)(x)/k(ν)(α);x>0h_{GAM}^{(\alpha,\nu)}(x)\equiv\exp\{-\alpha x\}l^{(\nu)}(x)/k^{(\nu)}(\alpha);x>0, for α>0\alpha>0 and ν>1\nu>1.

The mode of the gamma density is at x(0)=(ν1)/αx^{(0)}=(\nu-1)/\alpha, and hence in a similar manner to the definition (9), the following loss function can be defined:

LGAM(α,ν)(a,Y)=log{hGAM(α,ν)(ν1α)}log{hGAM(α,ν)(ν1α(a/Y))};α>0,ν>1,L_{GAM}^{(\alpha,\nu)}(a,Y)=\log\left\{h_{GAM}^{(\alpha,\nu)}\left(\frac{\nu-1}{\alpha}\right)\right\}-\log\left\{h_{GAM}^{(\alpha,\nu)}\left(\frac{\nu-1}{\alpha}\left(a/Y\right)\right)\right\};\alpha>0,\nu>1,

which is a function of a/Ya/Y. Note that LGAM(α,ν)0L_{GAM}^{(\alpha,\nu)}\geq 0 and LGAM(α,ν)(Y,Y)=0L_{GAM}^{(\alpha,\nu)}(Y,Y)=0. Since

log{hGAM(α,ν)(ν1α(a/Y))}=(ν1)(a/Y)+(ν1)log(ν1α)+(ν1)log(a/Y),\log\left\{h_{GAM}^{(\alpha,\nu)}\left(\frac{\nu-1}{\alpha}\left(a/Y\right)\right)\right\}=-(\nu-1)\left(a/Y\right)+(\nu-1)\log\left(\frac{\nu-1}{\alpha}\right)+(\nu-1)\log\left(a/Y\right),

its derivative with respect to aa is (ν1)(Y1a1)(\nu-1)\left(Y^{-1}-a^{-1}\right). Hence, when LGAM(α,ν)(a,Y)L_{GAM}^{(\alpha,\nu)}(a,Y) is substituted into (2) and the resulting EPLE_{PL} is minimized with respect to aa, the optimal predictor is, for all (ϕ,ν)(0,)×(1,)(\phi,\nu)\in(0,\infty)\times(1,\infty),

δGAM(z){E(Y1z)}1.\delta^{*}_{GAM}(z)\equiv\left\{E(Y^{-1}\mid z)\right\}^{-1}. (14)

It is straightforward to show that (14) is also the optimal predictor based on LPWD(1)(a,Y)L_{PWD}^{(1)}(a,Y) given by (12). Further, the weighted loss function, YLGAM(ϕ,ν)(a,Y)YL_{GAM}^{(\phi,\nu)}(a,Y) results in the optimal predictor, E(Yz)E(Y\mid z), which is also the optimal predictor for LSEL(a,Y)=LMTC(2)(a,Y)L_{SEL}(a,Y)=L_{MTC}^{(2)}(a,Y) given by (6) and for LPWD(1)(a,Y)L_{PWD}^{(-1)}(a,Y) given by (12). Clearly, very different loss functions can yield identitical optimal predictors and, hence, what discriminates between decisions in the presence of uncertainty is not only the predictor but also the quantification of its uncertainty via the assumed loss function (e.g., the minimized EPLE_{PL}). There is also a notion of inefficiency caused by using a “working” loss function instead of the “true” loss function. Inefficiency of a decision can be defined, for example, via the working/true expected posterior losses. Inefficiencies also arise when a “working” probability model for YY and zz are used instead of the “true” probability model; see Cressie (2021).

4 Calibrating within a class of loss functions

In Section 3, it was shown that loss functions abound, to capture circumstances where losses are not symmetric (e.g., squared-error loss) functions of displacement. When making decisions in the face of extreme events, asymmetric loss can be used to represent the often smaller cost of mitigation versus a major cost of recovery.

In this section, the class of LINEX loss functions {LLNX(ψ)(a,Y):ψ(,)}\{L_{LNX}^{(\psi)}(a,Y):\psi\in(-\infty,\infty)\}, given by (8), quantifies this with a scalar tuning parameter ψ\psi, where ψ<0\psi<0 recognizes that larger losses will be incurred from underprediction of YY (e.g., YY is a river’s crest-height after an extreme amount of rainfall has occurred in the river’s catchment). For |ψ||\psi| small, a Taylor-series expansion results in (approximate) squared-error loss: LLNX(ψ)(ψ2/2)(aY)2(aY)2L_{LNX}^{(\psi)}\approx(\psi^{2}/2)(a-Y)^{2}\propto(a-Y)^{2}. However, as ψ\psi\rightarrow-\infty, the loss when a<Ya<Y (underprediction) grows exponentially compared to the loss when a>Ya>Y (overprediction). Overprediction of the crest-height of a major river might lead to mass-evacuation orders that later could be seen as an unnecessary nuisance, but the alternative could be a very costly loss of material possessions/animals/crops/life. The asymmetric LINEX loss function can be tuned to make more conservative flood-related decisions by making ψ\psi more negative, resulting in a biased but conservative prediction of the crest-height. In this section, a way to calibrate ψ\psi is proposed.

Decisions are made in an uncertain environment, where here the uncertainty is expressed probabilistically, specifically through the joint distribution of YY and zz. We saw in Section 3.3 that the optimal predictor of YY obtained from minimizing (2) with the LINEX loss function given by (8) is, for ψ(,)\psi\in(-\infty,\infty),

δLNX(z)\displaystyle\delta_{LNX}^{*}(z) =(1/ψ)log(E(exp{ψY}z))\displaystyle=(-1/\psi)\log\!\left(E(\exp\{-\psi Y\}\mid z)\right)
(1/ψ)log(exp{ψμYz}+(ψ2/2)σYz2exp{ψμYz})\displaystyle\simeq(-1/\psi)\log\!\left(\exp\{-\psi\mu_{Y\mid z}\}+(\psi^{2}/2)\sigma^{2}_{Y\mid z}\exp\{-\psi\mu_{Y\mid z}\}\right)
=μYz+(1/ψ)log(1+(ψ2/2)σYz2),\displaystyle=\mu_{Y\mid z}+(-1/\psi)\log(1+(\psi^{2}/2)\sigma^{2}_{Y\mid z}), (15)

where the approximation relies on small σyz2\sigma^{2}_{y\mid z} and is obtained using the delta method in terms of the conditional moments, μYzE(Yz)\mu_{Y\mid z}\equiv E(Y\mid z) and σYz2var(Yz)\sigma^{2}_{Y\mid z}\equiv\text{var}(Y\mid z). A further approximation up to O(σYz2)O(\sigma^{2}_{Y\mid z}) is provided by the Taylor-series expansion, log(1+x)=x+O(x2)\log(1+x)=x+O(x^{2}); then from (15),

δLNX(z)\displaystyle\delta_{LNX}^{*}(z) μYz(1/ψ)log(1+(ψ2/2)σYz2)μYz+(ψσYz/2)σYz.\displaystyle\simeq\mu_{Y\mid z}-(1/\psi)\log(1+(\psi^{2}/2)\sigma^{2}_{Y\mid z})\simeq\mu_{Y\mid z}+(-\psi\sigma_{Y\mid z}/2)\sigma_{Y\mid z}~{}.

For ψ<0\psi<0, the approximation shows that δLNX(z)\delta_{LNX}^{*}(z) conservatively overpredicts in relation to the posterior mean, μYz=E(Yz)\mu_{Y\mid z}=E(Y\mid z). (It was shown in Section 3.3 that overprediction occurs for ψ<0\psi<0, no matter how large σYz2\sigma^{2}_{Y\mid z} is.) It is now shown that the approximation allows ψ\psi to be chosen according to a back-of-the-envelope calculation: Based on the 0.970.97 Gaussian quantile, 1.88, which is motivated by the discussion at the end of Section 3.3, we solve for ψ\psi in (ψσYz)/2=1.88(-\psi\sigma_{Y\mid z})/2=1.88, giving ψ=3.76/σYz\psi=-3.76/\sigma_{Y\mid z}. Note that the calibration could be improved if more were known about p(Yz)p(Y\mid z), ideally its moment generating function. Unlike the quantile loss function, the LINEX loss function has all derivatives, including LLNX/a\partial L_{LNX}/\partial a. In general, it can be advantangeous in higher-dimensional problems to use loss functions that allow calculation of derivatives to arrive at an optimal decision in terms of EPLE_{PL} given by (2).

5 Conclusions and future research directions

The previous sections have followed several threads of decision theory and demonstrated the importance of quantifying a decision aa in terms of a loss L(a,Y)L(a,Y), where YY is the unobserved random quantity of interest. Data zz that are dependent on YY are available, and the uncertainty in YY and zz is expressed probabilistically through the joint distribution f12(y,z)f_{12}(y,z). The two essential components for decision theory and its applications are L(a,y)L(a,y) and f12(y,z)f_{12}(y,z). Concentrating on EPLE_{PL} requires working with the posterior distribution, which is calculated via Bayes’ Theorem, p(yz)=f12(y,z)/f2(z)=f(zy)π(y)/f2(z)p(y\mid z)=f_{12}(y,z)/f_{2}(z)=f(z\mid y)\pi(y)/f_{2}(z), where recall that f(zy)f(z\mid y) is the “data model” or the “likelihood”, π(y)\pi(y) is the “process model,” and f2(z)f_{2}(z) is the marginal distribution of zz (also called “the normalizing constant”).

There are a number of key conclusions to draw:

  • An optimal decision depends fundamentally on the criterion being optimized.

  • A decision comes with consequences (here expressed as losses), which depend on both the hidden process YY and on the decision δ(z)\delta(z) about YY, where zz represents the data. Loss functions are plentiful.

  • The hidden process YY is uncertain, and that uncertainty can be expressed according to a probability model π()\pi(\cdot) for YY.

  • After obtaining data zz, the process model π()\pi(\cdot) is updated to the posterior distribution p(z)p(\cdot\mid z) according to Bayes’ Theorem, and that posterior should be used in the decision-making process, along with a loss function LL.

  • The same optimal decision may come from quite different optimality criteria (defined by quite different loss functions).

  • The optimized criterion should not be forgotten and needs to be evaluated along with the optimal decision.

  • The optimality criterion is usually the expected posterior loss, EPLE_{PL} (but it does not have to be).


Classical decision theory has concentrated on two types of “expected loss,” namely Expected Joint Loss (EJLE_{JL}), sometimes referred to as Bayes risk or unconditional risk:

EJL(L,δ,f12(,))L(δ(z),y)f12(y,z)dydz;E_{JL}(L,\delta,f_{12}(\cdot,\cdot))\equiv\int L(\delta(z),y)f_{12}(y,z)\mathrm{d}y\mathrm{d}z;

and Expected Posterior Loss (EPLE_{PL}), sometimes referred to as conditional risk:

EPL(L,δ(z),p(z))L(δ(z),y)p(yz)dy.E_{PL}(L,\delta(z),p(\cdot\mid z))\equiv\int L(\delta(z),y)p(y\mid z)\mathrm{d}y.

In my opinion, the word “risk” should not be used to describe EJLE_{JL} and EPLE_{PL}; “risk” should be used in a way that many decision-makers and the general public use it, namely as a probability. What’s in a name? Everything! Traditionalists may not like it, but if one adopts, say, the abbreviations “EPLE_{PL}” to replace “conditional risk” and “EJLE_{JL}” to replace “unconditional risk,” it would avoid the profound ambiguity that currently exists where, in common parlance, “risk” often means “probability”, not “expected loss.”

The decision-theoretic approach optimizes EJLE_{JL} with respect to decision δ\delta or, equivalently, it opimizes EPLE_{PL} with respect to δ(z)\delta(z). The two ingredients of EPLE_{PL} are combined first as a product, L(δ(z),y)×p(yz)L(\delta(z),y)\times p(y\mid z), which is a way of weighting the consequence of a decision about an event (no matter how catastrophic) with the risk/probability of that event (no matter how small), given the data zz. Recall that these products are integrated over all possible events to form EPLE_{PL}, which is then minimized with respect to δ(z)𝒜\delta(z)\in\mathcal{A}, for each zz, resulting in the optimal predictor δ()\delta^{*}(\cdot).

A possible future direction is to find other ways of combining {L(δ(z),y):y𝒴}\{L(\delta(z),y):y\in\mathcal{Y}\} and {p(yz):y𝒴}\{p(y\mid z):y\in\mathcal{Y}\}. One way that I do not advocate is minimizing the maximum loss, resulting in the so-called minimax decision, δMML(z)argminδ(z){maxy𝒴L(δ(z),y)}\delta_{MML}(z)\equiv\mathrm{argmin}_{\delta(z)}\{\max_{y\in\mathcal{Y}}{L(\delta(z),y)}\}, since it is blind to the uncertainty of the environment YY by assuming all values of y𝒴y\in\mathcal{Y} are equally likely. However, minimizing the maximum posterior loss,
δMMP(z)argminδ(z){maxy𝒴L(δ(z),y)×p(yz)}\delta_{MMP}(z)\equiv\mathrm{argmin}_{\delta(z)}\{\max_{y\in\mathcal{Y}}{L(\delta(z),y)\times p(y\mid z)}\}, does combine a value of yy with its posterior probability but, instead of averaging, it takes the maximum.

Just as the minimax criterion considers only LL and ignores pp, the reverse sometimes happens where decisions are made based only on the posterior distribution pp. For example, a “yes/no” decision might be based on a threshold κ\kappa where “yes” corresponds to Pr(Y>κ)0.5\mathrm{Pr}(Y>\kappa)\geq 0.5. Some reverse engineering from Section 3.3 can be used to demonstrate that the implied loss function for the yes/no decision is an unweighted 0-11 loss function applied to the probability model, πκPr(Y>κ)\pi_{\kappa}\equiv\mathrm{Pr}(Y>\kappa) and 1πκ=Pr(Yκ)1-\pi_{\kappa}=\mathrm{Pr}(Y\leq\kappa). So under 0-11 loss, one decides “yes” if pκ(z)Pr(Y>κz)1pκ(z)p_{\kappa}(z)\equiv\mathrm{Pr}(Y>\kappa\mid z)\geq 1-p_{\kappa}(z), which is equivalent to Pr(Y>κz)0.5\mathrm{Pr}(Y>\kappa\mid z)\geq 0.5. However, “yes” might be more costly than “no,” which can be captured with a loss function given by a decision table (described in Section 2).

In Sections 3.3 and 3.4, it was shown that much more is possible to capture asymmetric loss than simply using posterior quantiles. For example, consider the weighted 0-1 and quantile loss functions:

LMTC,w(0)(a,Y)\displaystyle L_{MTC,w}^{(0)}(a,Y) w(Y)I(aY0)\displaystyle\equiv w(Y)I(a-Y\neq 0)
LQTL,w(q)(a,Y)\displaystyle L_{QTL,w}^{(q)}(a,Y) w(Y)(aY)(I(aY>0)q),\displaystyle\equiv w(Y)(a-Y)(I(a-Y>0)-q),

which are particularly useful in the spatial context. Now the risk maps {δQTL(q)(z;𝐬):𝐬Dd}\{\delta_{QTL}^{(q)}(z;\mathbf{s}):\mathbf{s}\in D\subset\mathbb{R}^{d}\}, for various posterior quantiles over a spatial domain of interest DD (contained in dd-dimensional Euclidean space d\mathbb{R}^{d}, say), allow identification of at-risk regions in DD. However, they do not use information about catastrophic losses that would be incurred should a threshold κ\kappa be exceeded. That could happen by letting the weight w(Y)w(Y) depend on κ\kappa.

There are other ways to combine loss and posterior probabilities. For example, recall the tail posterior probability, Pr(Y>κz)pκ(z)\mathrm{Pr}(Y>\kappa\mid z)\equiv p_{\kappa}(z) for κ(,)\kappa\in(-\infty,\infty), which is sometimes called the tail risk. Then for a given δ(z)\delta(z), consider the curve obtained by plotting L(δ(z),κ)L(\delta(z),\kappa) on the vertical axis (from 0 to \infty) versus pκ(z)p_{\kappa}(z) on the horizontal axis (from 0 to 11), where the curve is a locus traced out by varying the threshold κ\kappa over 𝒴\mathcal{Y}. Each δ(z)\delta(z) generates a curve, and the lower bound of these curves over δ(z)𝒜\delta(z)\in\mathcal{A} can be plotted and represents the smallest possible loss, conditional on zz. This lower envelope gives a benchmark against which predictors, such as δEPL(z)\delta_{EPL}(z), δMML(z)\delta_{MML}(z), δMMP(z)\delta_{MMP}(z), and any others could be judged. The plot would also have visual appeal to decision-makers, who could assess the costs of different actions as the tail risk varies.

Another approach is to define the probability model of uncertainty directly in terms of loss functions; see Bissiri et al. (2016). In this framework, which is a form of generalized Bayesian inference, the likelihood (i.e., the data model) is replaced with a loss function. While the framework was originally laid out for performing inference on an unknown parameter of interest, it has since been expanded to prediction of an unknown random variable or process of interest (e.g., Loaiza-Maya et al., 2021).This “generalized Bayes” approach of Bissiri et al. (2016) is quite different from the classical decision-theoretic approach, since it combines the notions of loss and posterior distribution.

The decision-maker may convene an expert panel which is asked to reach a consensus about YY. This can be dangerous if a single Y=yY=y is declared: A consensus can be a decision that everyone agrees to collectively, but few believe in individually! The panel’s distribution of beliefs about YY could be considered to be an elicitation of the distribution π()\pi(\cdot). However, even this could be unsatisfactory since, without data zz, optimizing EPLE_{PL} results in a decision, δEPLargminδL(δ,y)π(y)dy\delta_{EPL}^{*}\equiv\mathrm{argmin}_{\delta}\int L(\delta,y)\pi(y)\mathrm{d}y, which is a predictor of YY that is not informed by data. Of course, the expert panel might have commissioned the collection of data, but sometimes it does not happen because sampling is costly or (related) data are readily available.

The cost of data zz should be accounted for. Suppose zz is made up of nn individual observations; to emphasize this, it is written here as znz_{n}; consequently, the posterior distribution is written as p(yzn)p(y\mid z_{n}). Let c(n)c(n) denote a cost function that quantifies the cost of sampling. Now, the expert panel is faced with recommending a decision about the environment YY, but it first has to design a sample znz_{n} at a cost of c(n)c(n) dollars. How large should nn be?

To answer this question, define a new loss function based on the original L(a,Y)L(a,Y):

LCST(n,a,Y)τL(a,Y)+c(n),L_{CST}(n,a,Y)\equiv\tau L(a,Y)+c(n),

where τ>0\tau>0 is an “exchange rate” that converts units of loss into dollars. The data have not been collected yet so, in terms of EJLE_{JL}, the optimal nn is,

n=argminn0{τL(δ,y)f12(y,zn)dydzn+c(n)}.n^{*}=\mathop{\mathrm{argmin}}_{n\geq 0}\left\{\tau\int L(\delta,y)f_{12}(y,z_{n})\mathrm{d}y\mathrm{d}z_{n}+c(n)\right\}.

For example, Cressie and Morgan (1989) solve this sample-size optimization problem for deciding between two simple hypotheses; and in a spatial design for regularly spaced ice-core sampling on a transect across Antarctica, Cressie (1998) determined nn by optimizing the constant spacing between successive samples. The initial loss LL was not in units of dollars but expressed in terms of the kriging variance in a tranche of Antarctica that went from coast-to-coast through the South Pole, and c(n)c(n) was linear in nn after an initial set up cost of c(0)=c0>0c(0)=c_{0}>0.

So-called citizen science is one way to avoid the cost of sampling, but the uncertainty in the data conditional on Y=yY=y, may be difficult to define or may be prohibitively large. This difficulty could be mitigated if the citizen-scientists have been given a protocol and adhere to it (e.g., the breeding bird survey; see Robbins et al., 1986).

Decision theory informs decision applications and many applications involve multiple criteria. A possible future direction is to develop a methodology that tackles situations where there are multiple stakeholders, each with their own loss function(s). In the context of making management decisions, to mitigate pollutants entering Australia’s Great Barrier Reef lagoon from the Burdekin River catchment in North Queensland, Kuhnert et al. (2018) have put uncertainty at the centre of their approach. A variety of loss functions give a variety of optimal predictions along with uncertainty expressed through expected posterior losses. The authors combine these with a range of static and dynamic visualisations that the decision-maker and stakeholders can use to explore a shortlist of decisions and their consequences. The aim is that this will lead to a stakeholder-consensus loss function, although more likely it will quantify where the differences lie. However, even this limited outcome is beneficial since it crystalizes the consequences of any proposed decision. One way to construct a consensus loss is to take a weighted sum of everyone’s loss function. Then the decision-maker’s task would be to choose the weights after consulting with the stakeholders. A variety of loss functions may also be accompanied by a variety of posterior distributions, since statistical models of uncertainty are not ordained but involve a certain amount of judgement. Settling for a good decision rather than the best decision could be approached through the notion of bounded rationality (Simon, 1957). Alternativey, each stakeholder could be considered to be a group member, and the problem could be approached through making a Bayesian group decision (e.g., Keeney and Nan, 2011).

Steinitz (2012) has proposed a non-probabilistic way to make decisions where different stakeholders have different loss functions. He called it Geodesign, in which the environmental study asks six questions: How should the study area be described? How does the study area operate? Is the current study area working well? How might the study area be altered? What differences might the changes cause? How should the study area be changed? If uncertainty is incorporated into this series of questions, then it is clear that, currently, decision theory is not addressing them all. A possible future direction is to expand decision theory so that it does.

Now return to the basic problem discussed in Section 2, of choosing between mm models M1,,MmM_{1},...,M_{m} with prior distribution {π1,,πm}\{\pi_{1},...,\pi_{m}\}. Based on data zz, the posterior distribution is p(kz)=Pr(Mkz)Pr(zMk)πkp(k\mid z)=\mathrm{Pr}(M_{k}\mid z)\propto\mathrm{Pr}(z\mid M_{k})\pi_{k}, for k=1,,mk=1,...,m. Section 2 proposes minimizing EPLE_{PL} given by (1), but it does not address the prediction problem. In what follows, an approach to optimal prediction, called Bayesian Model Averaging (BMA), is discussed (e.g., Draper, 1995). A possible future direction is to fully integrate BMA into a decision-theoretic setting, and a start is made in what follows. Let Y^k\hat{Y}_{k} be the optimal predictor of YkY_{k} under model MkM_{k} where, for the moment, squared-error loss is assumed; hence, Y^k=Ek(Ykz)\hat{Y}_{k}=E_{k}(Y_{k}\mid z), where EkE_{k} denotes expectation under model MkM_{k}. The BMA predictor is now derived as an optimal decision that minimizes EPLE_{PL} for the squared-error loss function given by

LSEL(a,YK)=(aYK)2,L_{SEL}(a,Y_{K})=(a-Y_{K})^{2}, (16)

where KK is the unknown random index of the correct model, and YKY_{K} is the unknown predictand under the correct model. Then the criterion EPLE_{PL} given by (2) is:

E((aYK)2z)=k=1mEk((aYk)2z)p(kz).E((a-Y_{K})^{2}\mid z)=\sum_{k=1}^{m}E_{k}((a-Y_{k})^{2}\mid z)p(k\mid z).

Now differentiate the right-hand side with respect to aa and set the result equal to 0. This yields,
k=1mp(kz)Ek(aYkz)=0\sum_{k=1}^{m}p(k\mid z)E_{k}(a-Y_{k}\mid z)=0 and, hence, the optimal predictor of YKY_{K} is,

δBMA(z)k=1mp(kz)Ek(Ykz),\delta_{BMA}^{*}(z)\equiv\sum_{k=1}^{m}p(k\mid z)E_{k}(Y_{k}\mid z), (17)

which is the BMA predictor. Squared-error loss may not be appropriate for all models. To allow for a diversity of loss functions, L1(a,Y1),,Lm(a,Ym)L_{1}(a,Y_{1}),...,L_{m}(a,Y_{m}), define the loss for prediction of YKY_{K} to be LK(a,YK)L_{K}(a,Y_{K}), where LKL_{K} is a random loss function. This generalises (16), and hence EPLE_{PL} is simply, k=1mE(Lk(a,Yk)z)p(kz)\sum_{k=1}^{m}E(L_{k}(a,Y_{k})\mid z)p(k\mid z). Consequently, the BMA predictor (17) generalizes to a posterior-weighted combination of possibly quite different individual predictors.

What should one do if none of the models is correct? The models are capturing the “known unknowns,” but the correct model may be one of the “unknown unknowns.” One way to fill this gap is to include an extra, maximum-entropy model (e.g., Cressie, 2021). Indeed, Le and Zidek (2006, Ch. 11) proposed using only a maximum-entropy model for design and analysis because, in the design stage, the totality of questions that will be asked of the data is never known.

Other possible future directions include: optimal prediction of spatial processes {Y(𝐬):𝐬D}\{Y(\mathbf{s}):\mathbf{s}\in D\}, including joint prediction of YY at several spatial locations and block prediction of Y(B)|B|1BY(𝐬)d𝐬Y(B)\equiv|B|^{-1}\int_{B}Y(\mathbf{s})\mathrm{d}\mathbf{s}; optimizing EPLE_{PL} or any other decision-theoretic criterion using a combination of simulation, emulation, and numerical methods; allowing artificial intelligence (AI) to make decisions in ways that may be difficult to formalize in terms of a loss function; and making decisions based on visualization and human intelligence (HI), particularly in a spatial setting using GIS (Geographical Information Systems) mapping capabilities.

I conclude this article by opining that while diagnostics for fitting an appropriate statistical model of the uncertainties are commonly used, very little research has been done on diagnosing and validating the choice of a loss function. At the very least, post hoc evaluation of the decisions made should be carried out as a matter of course. In that vein, a possible future direction is to look at the consequences of using an inappropriate loss function whose EPLE_{PL} and EJLE_{JL} are not fit for purpose. That means the decision δ(z)\delta^{*}(z) that minimizes them may be non-optimal in terms of the “true” EPLE_{PL} that should have been used. While such optimized predictions may not be best, they may nonetheless be “quite good,” which can be determined by carefully designed sensitivity experiments evaluated in terms of the true EPLE_{PL} or the true EJLE_{JL}.

Acknowledgments

My thanks go to the editors of this special issue for the opportunity to be opinionated! I am also grateful to Alan Pearse for his dedicated help in preparation of this opinion piece. The research was supported by ARC Discovery Projects, DP190100180 and DP150104576.

References

  • Berger (1985) Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis (2nd ed.). New York, NY: Springer.
  • Bissiri et al. (2016) Bissiri, P. G., C. C. Holmes, and S. G. Walker (2016). A general framework for updating belief distributions. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 78, 1103–1130.
  • Commonwealth of Australia (2021) Commonwealth of Australia (2021). Australia’s whole-of-economy long-term emissions reduction plan. Australian Government Department of Industry, Science, Energy and Resources. Retrieved from: https://www.industry.gov.au/data-and-publications/australias-long-term-emissions-reduction-plan.
  • Cressie (1993) Cressie, N. (1993). Statistics for Spatial Data (rev. ed.). New York, NY: Wiley.
  • Cressie (1998) Cressie, N. (1998). Transect-spacing design of ice cores on the Antarctic continent. Canadian Journal of Statistics 26, 405–418.
  • Cressie (2021) Cressie, N. (2021). A few statistical principles for data science. Australian & New Zealand Journal of Statistics 63, 182–200.
  • Cressie and Morgan (1989) Cressie, N. and P. B. Morgan (1989). Design considerations for Neyman-Pearson and Wald hypothesis testing. Metrika 36, 317–325.
  • Cressie et al. (2022) Cressie, N., A. R. Pearse, and D. Gunawan (2022). Optimal spatial prediction for non-negative spatial processes using a phi-divergence loss function. In N. Balakrishnan, M. A. Gil, N. Martin, D. Morales, and M. Pardo (Eds.), Trends in Mathematical, Information and Data Sciences, pp. 181–197. New York, NY: Springer.
  • Cressie and Suesse (2020) Cressie, N. and T. Suesse (2020). Great expectations and even greater exceedances from spatially referenced data. Spatial Statistics 37, 100420.
  • Draper (1995) Draper, D. (1995). Assessment and propagation of model uncertainty. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 57, 45–97.
  • Eidsvik et al. (2015) Eidsvik, J., T. Mukerji, and D. Bhattacharjya (2015). Value of Information in the Earth Sciences: Integrating Spatial Modeling and Decision Analysis. Cambridge, UK: Cambridge University Press.
  • Groeneveld and Meeden (1977) Groeneveld, R. A. and G. Meeden (1977). The mode, median, and mean inequality. The American Statistician 31, 120–121.
  • Kass and Raftery (1995) Kass, R. E. and A. E. Raftery (1995). Bayes factors. Journal of the American Statistical Association 90, 773–795.
  • Keeney and Nan (2011) Keeney, R. and R. Nan (2011). A theorem for Bayesian group decisions. Journal of Risk and Uncertainty 43, 1–17.
  • Kowal (2022) Kowal, D. R. (2022). Fast, optimal, and targeted predictions using parameterized decision analysis. Journal of the American Statistical Association 117, in press.
  • Kuhnert (2014) Kuhnert, P. M. (2014). Editorial: Physical-statistical modelling. Environmetrics 25, 201–202.
  • Kuhnert et al. (2018) Kuhnert, P. M., D. E. Pagendam, R. Bartley, D. W. Gladish, S. E. Lewis, and Z. T. Bainbridge (2018). Making management decisions in the face of uncertainty: a case study using the Burdekin catchment in the Great Barrier Reef. Marine and Freshwater Research 69, 1187–1200.
  • Lark et al. (2022) Lark, R. M., C. Chagumaira, and A. E. Milne (2022). Decisions, uncertainty and spatial information. Spatial Statistics 50, 100619.
  • Le and Zidek (2006) Le, N. and J. Zidek (2006). Statistical Analysis of Environmental Space-Time Processes. New York, NY: Springer.
  • Loaiza-Maya et al. (2021) Loaiza-Maya, R., G. M. Martin, and D. T. Frazier (2021). Focused Bayesian prediction. Journal of Applied Econometrics 36, 517–543.
  • Matheron (1963) Matheron, G. (1963). Traité de Géostatistique Appliquée, Tome II: le Krigeage. Mémoires du Bureau de Recherches Géologiques et Minières. Bureau de Recherches Géologiques et Minières, No. 24. Paris, France.
  • Read and Cressie (1988) Read, T. and N. Cressie (1988). Goodness-of-Fit Statistics for Discrete Multivariate Data. New York, NY: Springer.
  • Robbins et al. (1986) Robbins, C., D. Bystrak, and P. Geissler (1986). The breeding bird survey: Its first fifteen years, 1965-1979. Fish and Wildlife Service Resource Publication 157. Washington, DC: US Department of the Interior.
  • Santner et al. (2010) Santner, T. J., R. J. Williams, and W. I. Notz (2010). The Design and Analysis of Computer Experiments. New York, NY: Springer.
  • Simon (1957) Simon, H. (1957). Models of Man. New York, NY: Wiley.
  • Smith (2010) Smith, J. Q. (2010). Bayesian Decision Analysis: Principles and Practice. Cambridge, UK: Cambridge University Press.
  • Steinitz (2012) Steinitz, C. (2012). A Framework for Geodesign. Redlands, CA: Esri Press.
  • Technology Investment Roadmap (2021) Technology Investment Roadmap (2021). Low Emissions Technology Statement 2021. Australian Government Department of Industry, Science, Energy and Resources. Retrieved from: https://www.industry.gov.au/data-and-publications/technology-investment-roadmap-low-emissions-technology-statement-2021.
  • Varian (1975) Varian, H. R. (1975). A Bayesian approach to real estate assessment. In S. E. Fienberg and A. Zellner (Eds.), Studies in Bayesian Econometrics and Statistics in Honor of Leonard J. Savage, pp.  195–208. Amsterdam, NL: North-Holland.
  • Wright et al. (2003) Wright, D. L., H. S. Stern, and N. Cressie (2003). Loss functions for estimation of extrema with an application to disease mapping. Canadian Journal of Statistics 31, 251–266.
  • Zellner (1986) Zellner, A. (1986). Bayesian estimation and prediction using asymmetric loss functions. Journal of the American Statistical Association 81, 446–451.