This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Evaluation of point forecasts for extreme events using consistent scoring functions

Robert J. Taggart
Bureau of Meteorology
robert.taggart@bom.gov.au
Abstract

We present a method for comparing point forecasts in a region of interest, such as the tails or centre of a variable’s range. This method cannot be hedged, in contrast to conditionally selecting events to evaluate and then using a scoring function that would have been consistent (or proper) prior to event selection. Our method also gives decompositions of scoring functions that are consistent for the mean or a particular quantile or expectile. Each member of each decomposition is itself a consistent scoring function that emphasises performance over a selected region of the variable’s range. The score of each member of the decomposition has a natural interpretation rooted in optimal decision theory. It is the weighted average of economic regret over user decision thresholds, where the weight emphasises those decision thresholds in the corresponding region of interest.

Keywords: Consistent scoring function; Decision theory; Forecast ranking; Forecast verification; Point forecast; Proper scoring rule; Rare and extreme events.

1 Introduction

Extreme events occur in many systems, from atmospheric to economic, and present significant challenges to society. Hence the accurate prediction of extreme events is of vital importance. In many such situations, competing forecasts are produced by a variety of forecast systems and it is natural to want to compare the performance of such forecasts with emphasis on the extremes.

In this context, it is critical that the methodology for requesting, evaluating and ranking competing forecasting systems is decision-theoretically coherent. Since the future is not precisely known, ideally forecasts ought to be probabilistic in nature, taking the form of a predictive probability distribution and assessed using a proper scoring rule (Gneiting and Katzfuss, 2014). Nonetheless, in many contexts and for a variety of reasons, point forecasts (i.e. single-valued forecasts taking values from the prediction space) are issued and used. For decision-theoretically coherent evaluation of point forecasts, either the scoring function (such as the squared error or absolute error scoring function) that will be used to assess predictive performance should be advertised in advance, or a specific functional (such as the mean or median) of the forecaster’s predictive distribution should be requested and evaluation then conducted using a scoring function that is consistent for that functional (Gneiting, 2011a ). The use of proper scoring rules and consistent scoring functions encourage forecasters to quote honest, carefully considered forecasts.

To compare competing forecasts for the extremes, one may be tempted to use what would otherwise be a proper scoring rule or consistent scoring function, but restrict evaluation to a subset of events for which extremes were either observed, or forecast or perhaps both. However, such methodologies promote hedging strategies (as illustrated in Section 2) and can result in misguided inferences. This gives rise to the forecaster’s dilemma, whereby “if forecast evaluation proceeds conditionally on a catastrophic event having been observed [then] always predicting a calamity becomes a worthwhile strategy” (Lerch et al., 2017).

Nonetheless, for predictive distributions Gneiting and Ranjan, (2011) showed that a suitable alternative exists in the threshold-weighted continuous ranked probability score, which is a weighted version of the commonly used continuous ranked probability score (CRPS). The weight is selected to emphasise the region of interest (such as the tails or some other region of a variable’s range) and induces a proper scoring rule. This technique extends in a natural way to point forecasts targeting the median functional, since the CRPS is a generalisation of the absolute error scoring function. The UK Met Office has recently applied this method to report the performance of temperature forecasts, with emphasis on climatological extremes (Sharpe et al., 2020).

Despite this progress, Lerch et al., (2017) offer this summary of the general situation for evaluating point forecasts at the extremes: “there is no obvious way to abate the forecaster’s dilemma by adapting existing forecast evaluation methods appropriately, such that particular emphasis can be put on extreme outcomes.” In this paper we remedy this situation. We construct consistent scoring functions that can be used evaluate point forecasts that emphasise performance in the region of interest for important classes of functionals, including expectations and quantiles. Moreover, the relevant specific case of these constructions agrees with the extension of the threshold-weighted CRPS to the median functional.

The main idea of this paper can be illustrated using the squared error scoring function S(x,y)=(xy)2S(x,y)=(x-y)^{2}, which is consistent for point forecasts of the expectation functional. Suppose that the outcome space \mathbb{R} is partitioned as =I1I2\mathbb{R}=I_{1}\cup I_{2}, where I1=(,a)I_{1}=(-\infty,a) and I2=[a,)I_{2}=[a,\infty) for some aa in \mathbb{R}. Corollary 3.2 gives the decomposition S=S1+S2S=S_{1}+S_{2}, where each scoring function SiS_{i} is consistent for expectations whilst also emphasising predictive performance on the interval IiI_{i}. In particular, if x,yI1x,y\in I_{1} then S2(x,y)=0S_{2}(x,y)=0. Given a point forecast xx and corresponding observation yy, the explicit formula for S2S_{2} is

S2(x,y)=(ya)2𝟙{ya}(xa)2𝟙{xa}2(yx)(xa)𝟙{xa}.S_{2}(x,y)=(y-a)^{2}\mathbbm{1}\{y\geq a\}-(x-a)^{2}\mathbbm{1}\{x\geq a\}-2(y-x)(x-a)\mathbbm{1}\{x\geq a\}. (1.1)

Here 𝟙\mathbbm{1} denotes the indicator function, so that 𝟙{xa}\mathbbm{1}\{x\geq a\} equals 11 if xax\geq a and 0 otherwise.

The performance of competing point forecasts for expectations can then be compared by computing the mean scores S¯\bar{S}, S¯1\bar{S}_{1} and S¯2\bar{S}_{2} for each forecast system, over the same set events. To illustrate, consider two forecast systems A and B, whose error characteristics are depicted by the scatter plot of Figure 1. System B is homoscedastic (i.e., has even scatter about the diagonal line throughout the variable’s range) whereas System A is heteroscedastic (with relatively small errors over lower parts of the variable’s range and relatively large over higher parts). For this set of events, the mean squared error S¯\bar{S} for each system is very similar and there is no statistical significance in their difference. However, with a=10a=10, the mean score S¯2\bar{S}_{2} of System A is significantly higher than that of B (i.e., B is superior when emphasis is placed on the region [10,)[10,\infty)) whilst the mean score S¯1\bar{S}_{1} of System A is significantly lower than that of B (i.e., A is superior when emphasis is placed on the region (,10)(-\infty,10)). Full details for this case study are given in Section 3.5.

Refer to caption
Figure 1: Scatter plot of forecasts against observations for a random sample of events from the Synthetic data example of Section 3.5.

This example is illustrative. Decompositions of the outcome space can consist of more than two intervals, and the boundary between intervals can also be ‘blurred’ by selecting suitable weight functions. Each decomposition of the outcome space then induces a decomposition

S=i=1nSiS=\sum_{i=1}^{n}S_{i} (1.2)

of a given consistent scoring function SS, whose summands SiS_{i} are also consistent for the functional of concern and with each SiS_{i} emphasising forecast performance in the relevant region of the outcome space. Such decompositions are presented for the consistent scoring functions of quantiles, expectations, expectiles, and Huber means. The approach is unified, in that the same decomposition of the outcome space induces decompositions of the CRPS and of a variety of scoring functions for point forecasts. Details are given in Section 3 with illustrative examples.

Mathematically, the main result of this paper is a corollary of the mixture representation theorems for the consistent scoring functions of quantiles, expectiles (Ehm et al., 2016) and Huber functionals (Taggart, 2021). This furnishes each member SiS_{i} of the decomposition (1.2) with an interpretation rooted in optimal decision theory. Certain classes of optimal decision rules elicit a user action if and only if a point forecast xx exceeds a particular decision threshold θ\theta. The score Si(x,y)S_{i}(x,y) is a measure of economic regret, relative to decisions based on a perfect forecast, of using the point forecast xx when the observation yy realises, averaged over all decision thresholds θ\theta belonging to corresponding interval IiI_{i} of the partition. Precise details are given in Section 4 and illustrated with the aid of Murphy diagrams.

2 Hedging strategies for naïve assessments of forecasts of extreme events

We illustrate how a seemingly natural approach for comparing predictive performance at the extremes creates opportunities for hedging strategies.

A meteorological agency maintains two forecast systems A and B, each of which produces predictive distributions for hourly rainfall YY at a particular location. System A is fully automated and hence cheaper to support than B. System B has knowledge of the System A forecast prior to issuing its own forecast. The agency requests the mean value of their predictive distributions with a lead time of 1 day. These point forecasts are assessed using the squared error scoring function, which is consistent for forecasts of the mean (i.e., the forecast strategy that minimises one’s expected score is to issue the mean value of one’s predictive distribution). For a two year period, the bulk of observed and forecast values fall in the interval [0mm,10mm][0\,\mathrm{mm},10\,\mathrm{mm}] and there is no statistically significant difference between the performance of the two systems when scored using the squared error function.

However, the maintainers of System B claim that B performs better for the extremes and that bulk verification statistics fail to highlight this. The agency decides to use forecasts from A for the majority of cases, but will test whether B is significantly better than A at forecasting the heavy rainfall events. The agency considers four options for selecting hourly events to assess, after which the squared error scoring function will be used to compare predictive performance on those events. If System B does not perform significantly better than A over the next 12 months then it will be decommissioned.

For a given forecast case, let FAF_{\mathrm{A}} and FBF_{\mathrm{B}} denote the predictive distributions produced by each system, and xAx_{\mathrm{A}} and xBx_{\mathrm{B}} the respective point forecasts issued. Suppose that the random variable YY has a distribution specified by FBF_{\mathrm{B}}. For each option, the maintainers of System B have a strategy to optimise their expected score; that is, to choose xBx_{\mathrm{B}} such that 𝔼[(xBY)2]\mathbb{E}[(x_{\mathrm{B}}-Y)^{2}] is minimised.

  1. Option 1:

    Only assess events where the observation is at least 20 mm.
    Strategy: If (Y20)>0\mathbb{P}(Y\geq 20)>0 then FB|{Y20}F_{\mathrm{B}}|\{Y\geq 20\}, which denotes the predictive distribution of B\mathrm{B} conditioned on the event {Y20}\{Y\geq 20\}, exists. In this case, forecast xB=Mean(FB|{Y20}))x_{\mathrm{B}}=\mathrm{Mean}(F_{\mathrm{B}}|\{Y\geq 20\})) and otherwise forecast xB=20x_{\mathrm{B}}=20.

  2. Option 2:

    Only assess events where either xAx_{\mathrm{A}} or xBx_{\mathrm{B}} is at least 20 mm.
    Strategy: If max(xA,Mean(FB))20\max(x_{\mathrm{A}},\mathrm{Mean}(F_{\mathrm{B}}))\geq 20 then xB=Mean(FB)x_{\mathrm{B}}=\mathrm{Mean}(F_{\mathrm{B}}). Otherwise if 𝔼[(20Y)2]<𝔼[(xAY)2]\mathbb{E}[(20-Y)^{2}]<\mathbb{E}[(x_{\mathrm{A}}-Y)^{2}] then xB=20x_{\mathrm{B}}=20 else xB=Mean(FB)x_{\mathrm{B}}=\mathrm{Mean}(F_{\mathrm{B}}). This will ensure that the event is assessed whenever System B expects that a forecast of 20 will receive a more favourable score than than a forecast of xAx_{\mathrm{A}}.

  3. Option 3:

    Only assess events where either xAx_{\mathrm{A}}, xBx_{\mathrm{B}} or the observation is at least 20 mm.
    Strategy: If max(xA,Mean(FB))20\max(x_{\mathrm{A}},\mathrm{Mean}(F_{\mathrm{B}}))\geq 20 then xB=Mean(FB)x_{\mathrm{B}}=\mathrm{Mean}(F_{\mathrm{B}}). Else if 𝔼[(20Y)2]<𝔼[(xAY)2]\mathbb{E}[(20-Y)^{2}]<\mathbb{E}[(x_{\mathrm{A}}-Y)^{2}] then xB=20x_{\mathrm{B}}=20. Otherwise, the only other way the event will be assessed is if the observation is at least 20mm, so forecast xB=19.9x_{\mathrm{B}}=19.9.

  4. Option 4:

    Only assess events where xA20x_{\mathrm{A}}\geq 20.
    Strategy: In this case there is nothing that System B can to do influence which events will be assessed. Therefore xB=Mean(FB)x_{\mathrm{B}}=\mathrm{Mean}(F_{\mathrm{B}}).

Of these, only Option 4 does not expose the agency to a hedging strategy from System B. However, under Option 4 the developers of A may employ the strategy of forecasting xA=max(19.9,Mean(FA))x_{\mathrm{A}}=\max(19.9,\mathrm{Mean}(F_{\mathrm{A}})), so that no further comparison of systems is made.

There is an Option 5: use a scoring function that is consistent for the mean functional and that emphasises performance at the extremes. We turn attention to this now.

3 Decompositions of scoring functions

We work in a setting where point forecasts xx and observations yy belong to some interval II of the real line \mathbb{R} (possibly with I=I=\mathbb{R}). In Section 3.1 we introduce partitions of unity, which are used to ‘subdivide’ the outcome space II into subregions of interest. We then illustrate in Section 3.2 how such subdivisions induce decompositions of the CRPS, where each member of the decomposition emphasises performance on the corresponding subregion of II whilst retaining propriety. To obtain analogous decompositions for consistent scoring functions, we recall in Section 3.3 that the consistent scoring functions for quantiles and expectations (among others) have general mathematical forms. The aim is to find which specific instances of the general form emphasise performance on the subregions specified by the partition of unity. This is answered in Section 3.4. Section 3.5 contains examples of such decompositions, and opens with a worked example showing how to find the formula for the squared error decomposition of Equation (1.1).

3.1 Partitions of unity

Recall that the support of a function χ:I\chi:I\to\mathbb{R}, denoted supp(χ)\mathrm{supp}(\chi), is defined by

supp(χ)={tI:χ(t)0}.\mathrm{supp}(\chi)=\{t\in I:\chi(t)\neq 0\}.

In this paper, we say that {χj}j=1n\{\chi_{j}\}_{j=1}^{n} is a partition of unity on II if {χj}j=1n\{\chi_{j}\}_{j=1}^{n} is a finite set of measurable functions χj:I\chi_{j}:I\to\mathbb{R} such that 0χj(t)10\leq\chi_{j}(t)\leq 1 and

j=1nχj(t)=1\sum_{j=1}^{n}\chi_{j}(t)=1

whenever tIt\in I. (For readers unfamiliar with the concept of measurability, any piecewise continuous function is measurable.) We will call each χj\chi_{j} a weight function. We note that these differ from typical definitions in that we do not require χj\chi_{j} to be continuous or have bounded support.

Refer to caption
Figure 2: Two partitions {χj}j=1n\{\chi_{j}\}_{j=1}^{n} of unity on the interval [0,6)[0,6), one using rectangular weight functions and the other trapezoidal weight functions.

Figure 2 illustrates two different partitions of unity for the interval I=[0,6)I=[0,6). The rectangular partition of unity is consists of rectangular weight functions, each of the form

χj(t)={1,at<b0,otherwise\chi_{j}(t)=\begin{cases}1,&a\leq t<b\\ 0,&\text{otherwise}\end{cases} (3.1)

for suitable constants satisfying a<ba<b, both depending on jj. The trapezoidal partition of unity consists of trapezoidal weight functions, each typically having the form

χj(t)={(ta)/(ba),at<b1,bt<c(dt)/(dc),ct<d0,otherwise\chi_{j}(t)=\begin{cases}(t-a)/(b-a),&a\leq t<b\\ 1,&b\leq t<c\\ (d-t)/(d-c),&c\leq t<d\\ 0,&\text{otherwise}\end{cases} (3.2)

for suitable constants satisfying a<b<c<da<b<c<d, all depending on jj, with appropriate modification for the end cases.

More generally, if {ψj}j=1n\{\psi_{j}\}_{j=1}^{n} is a set of piecewise nonnegative measurable functions with the property that

j=1nψj(t)>0\sum_{j=1}^{n}\psi_{j}(t)>0

whenever tIt\in I, then a partition of unity {χj}j=1n\{\chi_{j}\}_{j=1}^{n} can be constructed by defining

χj(t)=ψj(t)(j=1nψj(t))1.\chi_{j}(t)=\psi_{j}(t)\Big{(}\sum_{j=1}^{n}\psi_{j}(t)\Big{)}^{-1}.

3.2 Decomposition of the CRPS

Each partition {χj}j=1n\{\chi_{j}\}_{j=1}^{n} of unity induces a corresponding decomposition of the CRPS. Given a predictive distribution FF, expressed as a cumulative density function, and a corresponding observation yy, the CRPS is defined by

CRPS(F,y)=I(F(z)𝟙{yz})2dz\mathrm{CRPS}(F,y)=\int_{I}(F(z)-\mathbbm{1}\{y\leq z\})^{2}\,\mathrm{d}z

and for each χj\chi_{j} the threshold-weighted CRPS by

CRPSj(F,y)=I(F(z)𝟙{yz})2χj(z)dz.\mathrm{CRPS}_{j}(F,y)=\int_{I}(F(z)-\mathbbm{1}\{y\leq z\})^{2}\,\chi_{j}(z)\,\mathrm{d}z.

Both are proper scoring rules (Gneiting and Ranjan, 2011). Thus the CRPS\mathrm{CRPS} has a decomposition

CRPS=j=1nCRPSj,\mathrm{CRPS}=\sum_{j=1}^{n}\mathrm{CRPS}_{j},

where each component CRPSj\mathrm{CRPS}_{j} is proper and emphasises performance in the region determined by the weight χj\chi_{j}. The Sydney rainfall forecasts example of Section 3.5 illustrates the application of such a decomposition. We now establish analogous decompositions for a wide range of scoring functions.

3.3 Consistent scoring functions

For decision-theoretically coherent point forecasting, forecasters need a directive in the form of a statistical functional (Gneiting, 2011a ) or a scoring function which should be minimised (Patton, 2020). A statistical functional T\mathrm{T} is a (potentially set-valued) mapping from a class of probability distributions \mathcal{F} to the real line \mathbb{R}. Examples include the mean (or expectation) functional, quantiles and expectiles (Newey and Powell, 1987), the latter recently attracting interest in risk management (Bellini and Di Bernardino, 2017). A consistent scoring function is a special case of a proper scoring rule in the context of point forecasts, and rewards forecasters who make careful honest forecasts.

Definition 3.1.

(Gneiting, 2011a ) A scoring function S:I×I[0,)S:I\times I\to[0,\infty) is consistent for the functional T\mathrm{T} relative to a class \mathcal{F} of probability distributions if

𝔼S(t,Y)𝔼S(x,Y),whenever YF,\mathbb{E}S(t,Y)\leq\mathbb{E}S(x,Y),\qquad\text{whenever }Y\sim F, (3.3)

for all probability distributions FF\in\mathcal{F}, all tT(F)t\in\mathrm{T}(F) and all xIx\in I. The function SS is strictly consistent if SS is consistent and if equality in Equation (3.3) implies that xT(F)x\in\mathrm{T}(F).

The consistent scoring functions for many commonly used statistical functionals have general forms.

Given g:Ig:I\to\mathbb{R} and α(0,1)\alpha\in(0,1), define the ‘quantile scoring function’ QSF(g,α):I×I\mathrm{QSF}(g,\alpha):I\times I\to\mathbb{R} by

QSF(g,α)(x,y)=(𝟙{y<x}α)(g(x)g(y))x,yI.\mathrm{QSF}(g,\alpha)(x,y)=(\mathbbm{1}\{y<x\}-\alpha)(g(x)-g(y))\qquad\forall x,y\in I. (3.4)

The name QSF is justified because, subject to slight regularity conditions, a scoring function SS is consistent for the α\alpha-quantile functional if and only if S=QSF(g,α)S=\mathrm{QSF}(g,\alpha) where gg is nondecreasing (Gneiting, 2011b ; Gneiting, 2011a ; Thomson, 1979). The absolute error scoring function S(x,y)=|xy|S(x,y)=|x-y| for the median functional arises from Equation (3.4) when g(t)=2tg(t)=2t and α=1/2\alpha=1/2. The commonly used α\alpha-quantile scoring function

S(x)=(𝟙{y<x}α)(xy)S(x)=(\mathbbm{1}\{y<x\}-\alpha)(x-y) (3.5)

arises when g(t)=tg(t)=t.

Given a convex function ϕ:I\phi:I\to\mathbb{R} with subderivative ϕ\phi^{\prime} and α(0,1)\alpha\in(0,1), define the function ESF(ϕ,α):I×I\mathrm{ESF}(\phi,\alpha):I\times I\to\mathbb{R} by

ESF(ϕ,α)(x,y)=|𝟙{y<x}α|(ϕ(y)ϕ(x)ϕ(x)(yx))x,yI.\mathrm{ESF}(\phi,\alpha)(x,y)=|\mathbbm{1}\{y<x\}-\alpha|\big{(}\phi(y)-\phi(x)-\phi^{\prime}(x)(y-x)\big{)}\qquad\forall x,y\in I. (3.6)

(The subderivative is a generalisation of the derivative for convex functions and coincides with the derivative when the convex function is differentiable.) Subject to weak regularity conditions, a scoring function SS is consistent for the α\alpha-expectile functional if and only if S=ESF(ϕ,α)S=\mathrm{ESF}(\phi,\alpha) where ϕ\phi is convex (Gneiting, 2011a ; Savage, 1971). The expectation (or the mean) functional corresponds to the special case α=1/2\alpha=1/2, with the squared error scoring function S(x,y)=(xy)2S(x,y)=(x-y)^{2} for expectations arising from Equation (3.6) when ϕ(t)=2t2\phi(t)=2t^{2} and α=1/2\alpha=1/2. A special case of the squared error scoring function is the Brier score, where I=[0,1]I=[0,1] and observations typically take values in {0,1}\{0,1\}.

Given ϕ:I\phi:I\to\mathbb{R} with subderivative ϕ\phi^{\prime} and ν>0\nu>0, define the function HSF(ϕ,ν):I×I\mathrm{HSF}(\phi,\nu):I\times I\to\mathbb{R} by

HSF(ϕ,ν)(x,y)=12(ϕ(y)ϕ(κν(xy)+y)+κν(xy)ϕ(x))x,yI,\mathrm{HSF}(\phi,\nu)(x,y)=\tfrac{1}{2}\big{(}\phi(y)-\phi(\kappa_{\nu}(x-y)+y)+\kappa_{\nu}(x-y)\phi^{\prime}(x)\big{)}\qquad\forall x,y\in I, (3.7)

where κν\kappa_{\nu} is the ‘capping’ function defined by κν(x)=max(ν,min(x,ν))\kappa_{\nu}(x)=\max(-\nu,\min(x,\nu)). Subject to slight regularity conditions, a scoring function SS is consistent for the Huber mean functional (with tuning parameter ν\nu) if and only if S=HSF(ϕ,ν)S=\mathrm{HSF}(\phi,\nu) where ϕ\phi is convex (Taggart, 2021). The Huber loss scoring function

S(x,y)={12(xy)2,|xy|νν|xy|12ν2,|xy|>νS(x,y)=\begin{cases}\tfrac{1}{2}(x-y)^{2},&|x-y|\leq\nu\\ \nu|x-y|-\tfrac{1}{2}\nu^{2},&|x-y|>\nu\end{cases}

arises from HSF(ϕ,ν)\mathrm{HSF}(\phi,\nu) when ϕ(t)=t2\phi(t)=t^{2}, and is used by the Bureau of Meteorology to score forecasts of various parameters. The Huber mean is an intermediary between the median and the mean functionals, and is a robust measure of the centre of a distribution (Huber, 1964; Taggart, 2021).

3.4 Decomposition of consistent scoring functions

We now state the main result of this paper, namely that the consistent scoring functions for the quantile, expectile and Huber mean functionals can be written as a sum of consistent scoring functions with respect to the chosen partition of unity. It is presented as a corollary since it follows from the mixture representation theorems of Ehm et al., (2016) and Taggart, (2021).

Corollary 3.2.

Suppose that {χj}j=1n\{\chi_{j}\}_{j=1}^{n} is a partition of unity on II. For each jj in {1,,n}\{1,\ldots,n\}, fix any points uju_{j} and vjv_{j} in II.

  1. (a)

    If g:Ig:I\to\mathbb{R} is a nondecreasing differentiable function and α(0,1)\alpha\in(0,1) then

    QSF(g,α)=j=1nQSF(gj,α)\mathrm{QSF}(g,\alpha)=\sum_{j=1}^{n}\mathrm{QSF}(g_{j},\alpha)

    where gjg_{j} is nondecreasing and defined by

    gj(u)=ujuχj(θ)g(θ)dθ.g_{j}(u)=\int_{u_{j}}^{u}\chi_{j}(\theta)g^{\prime}(\theta)\,\mathrm{d}\theta. (3.8)

    Moreover, if I0II_{0}\subset I is an interval and supp(χj)I0=\mathrm{supp}(\chi_{j})\cap I_{0}=\emptyset then QSF(gj,α)=0\mathrm{QSF}(g_{j},\alpha)=0 on I0×I0I_{0}\times I_{0}.

  2. (b)

    If ϕ:I\phi:I\to\mathbb{R} is a convex twice-differentiable function, α(0,1)\alpha\in(0,1) and ν>0\nu>0 then

    ESF(ϕ,α)=j=1nESF(ϕj,α)\mathrm{ESF}(\phi,\alpha)=\sum_{j=1}^{n}\mathrm{ESF}(\phi_{j},\alpha) (3.9)

    and

    HSF(ϕ,ν)=j=1nHSF(ϕj,ν),\mathrm{HSF}(\phi,\nu)=\sum_{j=1}^{n}\mathrm{HSF}(\phi_{j},\nu),

    where ϕj\phi_{j} is convex and defined by

    ϕj(u)=ujuvjvχj(θ)ϕ′′(θ)dθdv.\phi_{j}(u)=\int_{u_{j}}^{u}\int_{v_{j}}^{v}\chi_{j}(\theta)\phi^{\prime\prime}(\theta)\,\mathrm{d}\theta\,\mathrm{d}v. (3.10)

    Moreover, if I0II_{0}\subset I is an interval and supp(χj)I0=\mathrm{supp}(\chi_{j})\cap I_{0}=\emptyset then ESF(ϕj,α)=HSF(ϕj,α)=0\mathrm{ESF}(\phi_{j},\alpha)=\mathrm{HSF}(\phi_{j},\alpha)=0 on I0×I0I_{0}\times I_{0}.

See Appendix A for the proof. Appendix B states closed-form expressions for gjg_{j} and ϕj\phi_{j} for commonly used scoring functions when the partition of unity is rectangular or trapezoidal. We note that the natural analogue of Corollary 3.2 for the consistent scoring functions of Huber functionals (which are generalised Huber means) also holds.

One can show that QSF(gj,α)\mathrm{QSF}(g_{j},\alpha), ESF(ϕj,α)\mathrm{ESF}(\phi_{j},\alpha) and HSF(ϕj,ν)\mathrm{HSF}(\phi_{j},\nu) are independent of the choice of points uju_{j} and vjv_{j} in II. In practice, the choice of uju_{j} and vjv_{j} may be determined for computational convenience, such as selecting (if it exists) the minimum of supp(χj)\mathrm{supp}(\chi_{j}).

Finally, we discuss strict consistency. Suppose that QSF(g,α)\mathrm{QSF}(g,\alpha), ESF(ϕ,α)\mathrm{ESF}(\phi,\alpha) and HSF(ϕ,ν)\mathrm{HSF}(\phi,\nu) are strictly consistent for the quantile, expectile or Huber mean functionals respectively for some class \mathcal{F} of probability distributions. This occurs when gg is strictly positive, or when ϕ\phi is strictly convex, perhaps subject to mild regularity conditions on \mathcal{F} (Gneiting, 2011a ; Taggart, 2021). If χj\chi_{j} is strictly positive on II then QSF(gj,α)\mathrm{QSF}(g_{j},\alpha), ESF(ϕj,α)\mathrm{ESF}(\phi_{j},\alpha) and HSF(ϕj,ν)\mathrm{HSF}(\phi_{j},\nu) are also strictly consistent for their respective functionals. An example of such a partition {χ1,χ2}\{\chi_{1},\chi_{2}\} of unity on \mathbb{R} is given by

χ2(t)=12+1πarctan(ta)\chi_{2}(t)=\tfrac{1}{2}+\tfrac{1}{\pi}\arctan(t-a)

and χ1(t)=1χ2(t)\chi_{1}(t)=1-\chi_{2}(t). Here, χ2\chi_{2} induces scoring functions S2S_{2} that emphasise performance on (a,)(a,\infty) but do not completely ignore performance on any subinterval of \mathbb{R}.

3.5 Examples

We begin by demonstrating how to find explicit formulae for a particular decomposition.

Refer to caption
Figure 3: Decomposition S=S1+S2S=S_{1}+S_{2} of the squared error scoring function, using the rectangular and trapezoidal partitions. The solid line represents SS, the dotted line S1S_{1} and the dashed line S2S_{2} for different forecast–observation pairs. In each case, S1S_{1} emphasises performance on the interval (0,3)(0,3) while S2S_{2} emphasises performance on (3,6)(3,6).
Example 3.3 (Decomposition of the squared error scoring function).

Let SS denote the scoring function S(x,y)=(xy)2S(x,y)=(x-y)^{2}. Note that S=ESF(ϕ,0.5)S=\mathrm{ESF}(\phi,0.5), where ϕ(t)=2t2\phi(t)=2t^{2} via Equation (3.6), so that Corollary 3.2 applies. We use a simple rectangular partition {χ1,χ2}\{\chi_{1},\chi_{2}\} of unity, where

χ2(t)={0,t<a1,ta\chi_{2}(t)=\begin{cases}0,&t<a\\ 1,&t\geq a\end{cases}

and χ1(t)=1χ2(t)\chi_{1}(t)=1-\chi_{2}(t). Corollary 3.2 gives the corresponding decomposition S=S1+S2S=S_{1}+S_{2}, where Si=ESF(ϕi,0.5)S_{i}=\mathrm{ESF}(\phi_{i},0.5). To find the explicit formula for S2S_{2}, we compute the function ϕ2\phi_{2} using Equation (3.10). Integrating twice with the choice u2=v2=au_{2}=v_{2}=a gives

ϕ2(u)\displaystyle\phi_{2}(u) ={0,u<a2(ua)2,ua\displaystyle=\begin{cases}0,&u<a\\ 2(u-a)^{2},&u\geq a\end{cases}
=2(ua)2 1{ua}.\displaystyle=2(u-a)^{2}\,\mathbbm{1}\{u\geq a\}.

Thus S2=ESF(ϕ2,0.5)S_{2}=\mathrm{ESF}(\phi_{2},0.5), which yields the explicit formula given by Equation (1.1) via Equation (3.6). The explicit formula for S1S_{1} follows easily from the identity S1=SS2S_{1}=S-S_{2}.

Figure 3 illustrates the decomposition S=S1+S2S=S_{1}+S_{2} with respect to the rectangular partition {χ1,χ2}\{\chi_{1},\chi_{2}\} of unity with a=3a=3, and also with respect to a trapezoidal partition. For each forecast xx and observation yy, the solid line represents the score S(x,y)S(x,y), the dotted line the score S1(x,y)S_{1}(x,y) corresponding to the weight function χ1\chi_{1} with support on the left of the interval, and the dashed line the score S2(x,y)S_{2}(x,y) corresponding to the weight function χ2\chi_{2} with support on the right of the interval. Note that when the forecast xx and observation yy both lie outside the support of χj\chi_{j} then Sj(x,y)=0S_{j}(x,y)=0.

Refer to caption
Figure 4: Decomposition S=S1+S2S=S_{1}+S_{2} of the commonly used consistent scoring function for 0.250.25-quantile forecasts, using the rectangular and trapezoidal partitions. The solid line represents SS, the dotted line S1S_{1} and the dashed line S2S_{2} for different forecast–observation pairs. In each case, S1S_{1} emphasises performance on the interval (0,4)(0,4) while S2S_{2} emphasises performance on (4,8)(4,8).
Example 3.4 (Decomposition of a quantile scoring function).

Let SS denote the scoring function

S(x,y)={0.25(yx),yx0.75(xy),x<y,S(x,y)=\begin{cases}0.25(y-x),&y\geq x\\ 0.75(x-y),&x<y,\end{cases}

so that S=QSF(g,α)S=\mathrm{QSF}(g,\alpha) where g(t)=tg(t)=t and α=0.25\alpha=0.25. The decomposition S=S1+S2S=S_{1}+S_{2} is illustrated in Figure 4 with respect to two different partitions {χ1,χ2}\{\chi_{1},\chi_{2}\} of unity. The solid line represents SS, the dotted line S1S_{1} and the dashed line S2S_{2}.

Example 3.5 (Synthetic data).

Suppose that the climatological distribution of an unknown quantity YY is normal with Y𝒩(4,152)Y\sim\mathcal{N}(4,15^{2}). Two forecast systems A and B issue point forecasts for YY targeting the mean functional. Their respective forecast errors are identically independently normally distributed with error characteristics eA𝒩(0,(arctan(y10)+2)2)e_{\mathrm{A}}\sim\mathcal{N}(0,(\arctan(y-10)+2)^{2}) (where yy is the observation) and eB𝒩(0,22)e_{\mathrm{B}}\sim\mathcal{N}(0,2^{2}). Hence the standard deviation of eAe_{\mathrm{A}} is lower than the standard deviation of eBe_{\mathrm{B}} when yy is less than 1010 while the reverse is true when yy is greater than 1010. This is evident in the varying degree of scatter about the diagonal line in forecast–observation plot of Figure 1.

We sample 10000 independent observations and corresponding forecasts, and compare both forecast systems using the squared error scoring function SS along with the two components of its decomposition S=S1+S2S=S_{1}+S_{2}. The decomposition is induced from a rectangular partition {χ1,χ2}\{\chi_{1},\chi_{2}\} of unity on \mathbb{R} with supp(χ1)=(,10)\mathrm{supp}(\chi_{1})=(-\infty,10) and supp(χ2)=[10,)\mathrm{supp}(\chi_{2})=[10,\infty). The mean scores for each system are shown in Table 1, along with a 95% confidence interval of the difference in mean scores.

An analysis of this sample concludes that neither System A nor System B is significantly better than its competitor as measured by SS, since 0 belongs to the confidence interval of the difference in their mean scores. However, one can infer with high confidence that A performs better when scored by S1S_{1}, since 0 is well outside the corresponding confidence interval. Similarly, one may conclude with high confidence that B performs better when scored by S2S_{2}.

Based on the results in the table, one would use the forecast from A if both forecasts are less than 10, and the forecast from B if both forecasts are greater than 10. If the forecasts lie on opposite sides of 10, one option is to take the average of the two forecasts. We revisit this example again in Section 4 from the perspective of users and optimal decision rules.

Table 1: Comparative evaluation of forecast systems A and B from Synthetic data example, rounded to 2 decimal places. The difference is the mean score of A minus the mean score of B.
Mean score System A System B 95% confidence interval of difference
S¯\bar{S} 4.14 4.02 (-0.12, 0.36)
S¯1\bar{S}_{1} 0.61 2.65 (-2.16, -1.92)
S¯2\bar{S}_{2} 3.53 1.36 (1.97, 2.36)
Example 3.6 (Sydney rainfall forecasts).

Two Bureau of Meteorology forecast systems BoM and OCF issue predictive distributions for daily rainfall at Sydney Observatory Hill. Climatologically, daily rainfall exceeds 35.8 mm on average 12 times a year, and 42.2 mm on average 6 times a year. We partition the outcome space [0,)[0,\infty) using a trapezoidal partition {χ1,χ2}\{\chi_{1},\chi_{2}\} of unity, where χ2(t)=0\chi_{2}(t)=0 if 0t<35.80\leq t<35.8, χ2(t)=1\chi_{2}(t)=1 if t42.2t\geq 42.2 and χ2(t)=(t35.8)/(42.235.8)\chi_{2}(t)=(t-35.8)/(42.2-35.8) if 35.8t<42.235.8\leq t<42.2. Naturally, χ1(t)=1χ2(t)\chi_{1}(t)=1-\chi_{2}(t).

Refer to caption
Figure 5: Proper score decompositions S=S1+S2S=S_{1}+S_{2} by lead day for different types of Sydney daily rainfall forecasts, where SS is a standard quantile scoring function (top), squared error scoring function (bottom left) or CRPS (bottom right). In each case the score S2S_{2} emphasises performance for heavier rainfall.

Over the entire outcome space, the quality of each system’s forecasts is assessed using the CRPS (for predictive distributions), the squared error scoring function (for expectation point forecasts), and the standard quantile scoring function of Equation (3.5) (for quantile point forecasts). Moreover, each of these scores SS are decomposed as S=S1+S2S=S_{1}+S_{2} using the common partition {χ1,χ2}\{\chi_{1},\chi_{2}\}. Their mean scores by forecast lead day, for the period July 2018 to June 2020, are shown in Figure 5. This example illustrates that the same partition of unity can be used to hone in on particular regions of interest across a variety of forecast types, using an assessment method that cannot be hedged. When performance is assessed with emphasis on heavier rainfall via S2S_{2}, BoM is better than OCF at lead days 1 and 2 for expectile and 0.9-quantile forecasts, marginally better with its predictive distributions but worse than OCF at lead day 1 for 0.25-quantile forecasts.

4 Decision theoretic interpretation of scoring function decompositions

Mixture representations and elementary scoring functions facilitate a decision-theoretic interpretation of the scoring function decompositions of Corollary 3.2.

To avoid unimportant technical details, in this section assume that I=I=\mathbb{R}, gg is a nondecreasing differentiable function on \mathbb{R} and ϕ\phi is convex twice-differentiable function on \mathbb{R}. Suppose that SS is any one of the scoring functions QSF(g,α)\mathrm{QSF}(g,\alpha), ESF(ϕ,α)\mathrm{ESF}(\phi,\alpha) or HSF(ϕ,ν)\mathrm{HSF}(\phi,\nu). Thus SS is consistent for some functional T\mathrm{T} (either a specific quantile, expectile or Huber mean). Then SS admits a mixture representation

S(x,y)=SθT(x,y)dM(θ),S(x,y)=\int_{\mathbb{R}}S_{\theta}^{\mathrm{T}}(x,y)\,\mathrm{d}M(\theta), (4.1)

where each SθTS_{\theta}^{\mathrm{T}} is an elementary scoring function for T\mathrm{T} and the mixing measure dM(θ)\mathrm{d}M(\theta) is nonnegative (Ehm et al., 2016; Taggart, 2021). That is, SS is a weighted average of elementary scoring functions. Explicit formulae for these elementary scoring functions and mixing measures are given in Tables 2 and 3.

The elementary scoring functions SθTS_{\theta}^{\mathrm{T}} and their corresponding functionals T\mathrm{T} arise naturally in the context of optimal decision rules. In a certain class of such rules, a predefined action is taken if and only if the point forecast xx for some unknown quantity YY exceeds a certain decision threshold θ\theta. The usefulness of the forecast for the problem at hand can be assessed via a loss function SθS_{\theta}, where Sθ(x,y)S_{\theta}(x,y) encodes the economic regret, relative to a perfect forecast, of applying the decision rule with forecast xx when the observation yy is realised. Typically Sθ(x,y)>0S_{\theta}(x,y)>0 whenever θ\theta lies between xx and yy (e.g., the forecast exceeded the decision threshold but the realisation did not, resulting in regret), and Sθ(x,y)=0S_{\theta}(x,y)=0 otherwise (i.e., a perfect forecast would have resulted in the same decision, resulting in no regret).

Table 2: Elementary scoring functions SθT(x,y)S_{\theta}^{\mathrm{T}}(x,y) for different functionals T\mathrm{T}.
Functional T\mathrm{T} Elementary score SθT(x,y)S_{\theta}^{\mathrm{T}}(x,y)
α\alpha-quantile {1α,yθ<xα,xθ<y0,otherwise\begin{cases}1-\alpha,&y\leq\theta<x\\ \alpha,&x\leq\theta<y\\ 0,&\text{otherwise}\end{cases}
α\alpha-expectile {(1α)|yθ|,yθ<xα|yθ|,xθ<y0,otherwise\begin{cases}(1-\alpha)|y-\theta|,&y\leq\theta<x\\ \alpha|y-\theta|,&x\leq\theta<y\\ 0,&\text{otherwise}\end{cases}
ν\nu-Huber mean {12min(θy,ν)yθ<x12min(yθ,ν),xθ<y0,otherwise\begin{cases}\tfrac{1}{2}\min(\theta-y,\nu)&y\leq\theta<x\\ \tfrac{1}{2}\min(y-\theta,\nu),&x\leq\theta<y\\ 0,&\text{otherwise}\end{cases}

For the decision rule to be optimal over many forecast cases, the point forecast xx should be one that minimises the expected score 𝔼Sθ(x,Y)\mathbb{E}S_{\theta}(x,Y), where YFY\sim F for the predictive distribution FF. Binary betting decisions (Ehm et al., 2016) and simple cost–loss decision models (where the user must weigh up the cost of taking protective action as insurance against an event which may or may not occur (Richardson, 2000)) give rise to some α\alpha-quantile of FF being the optimal point forecast. The α\alpha-expectile functional gives the optimal point forecast for investment problems with cost basis θ\theta, revenue yy and simple taxation structures (Ehm et al., 2016). The Huber mean generates optimal point forecasts when profits and losses are capped in such investment problems (Taggart, 2021). Though the language here is financial, such investment decisions can be made using weather forecasts (Taggart, 2021, Example 5.4). In each case, the particular score Sθ(x,y)S_{\theta}(x,y) is, up to a multiplicative constant, the elementary score SθT(x,y)S_{\theta}^{\mathrm{T}}(x,y) in Equation (4.1) for the appropriate functional T\mathrm{T}.

Thus the representation (4.1) expresses S(x,y)S(x,y) as a weighted average of elementary scores, each of which encodes the economic regret associated with a decision made using the forecast xx with decision threshold θ\theta. The weight on each θ\theta is determined by the mixing measure dM(θ)\mathrm{d}M(\theta), as detailed in Table 3. Relative to QSF(g,α)\mathrm{QSF}(g,\alpha), the scoring function QSF(gj,α)\mathrm{QSF}(g_{j},\alpha) weighs the economic regret for decision threshold θ\theta by a factor of χj(θ)\chi_{j}(\theta), thus privileging certain decision thresholds over others.

Table 3: Weights in the mixing measure dM(θ)\mathrm{d}M(\theta) for different scoring functions SS.
Scoring function SS dM(θ)\mathrm{d}M(\theta)
QSF(g,α)\mathrm{QSF}(g,\alpha) g(θ)dθg^{\prime}(\theta)\,\mathrm{d}\theta
QSF(gj,α)\mathrm{QSF}(g_{j},\alpha) χj(θ)g(θ)dθ\chi_{j}(\theta)g^{\prime}(\theta)\,\mathrm{d}\theta
ESF(ϕ,α)\mathrm{ESF}(\phi,\alpha), HSF(ϕ,α)\mathrm{HSF}(\phi,\alpha) ϕ′′(θ)dθ\phi^{\prime\prime}(\theta)\,\mathrm{d}\theta
ESF(ϕj,α)\mathrm{ESF}(\phi_{j},\alpha), HSF(ϕj,α)\mathrm{HSF}(\phi_{j},\alpha) χj(θ)ϕ′′(θ)dθ\chi_{j}(\theta)\phi^{\prime\prime}(\theta)\,\mathrm{d}\theta

We illustrate these ideas using the point forecasts generated by Systems A and B from the Synthetic data example (Section 3.5). These point forecasts target the mean functional, and so induce optimal decision rules for investment problems with cost basis θ\theta. The decision rule is to invest if and only if the forecast future value xx of the investment exceeds the fixed up-front cost θ\theta of the investment. As previously noted, neither forecast system performs significantly better than the other when scored by squared error. However, for certain decision thresholds θ\theta, the mean elementary score S¯θ\bar{S}_{\theta} (which measures the average economic regret) of one system is clearly superior to the other. This is illustrated by the Murphy diagram of Figure 6, which is a plot of S¯θ\bar{S}_{\theta} against θ\theta (Ehm et al., 2016). Since a lower score is better, users with a decision threshold exceeding about 88 should use forecasts from B, while those with a decision threshold less than 88 should use forecasts from A.

Refer to caption
Figure 6: Murphy diagrams for the two forecast systems A (lighter) and B (darker) of the Synthetic data example. The graphs on each panel are the same but the shaded regions differ. Each shaded area is proportional to the mean score of the scoring function ESFj(t2t2,1/2)\mathrm{ESF}_{j}(t\mapsto 2t^{2},1/2), which is the summand in the mean squared error decomposition corresponding to the weight function χ\chi. Here χ\chi is either the rectangular weight function χR\chi_{R} (top) or trapezoidal weight function χT\chi_{T} (bottom).

Let S=S1+S2S=S_{1}+S_{2} denote the same decomposition of the squared error scoring function used for the Synthetic data example (Section 3.5), induced from a rectangular partition of unity. To aid simplicity of interpretation, assume for the user group under consideration that there is one constant kk such that the each elementary score equals kk multiplied by the economic regret. We observe the following.

For SS, ϕ(t)=2t2\phi(t)=2t^{2} and so dM(θ)=4dθ\mathrm{d}M(\theta)=4\,\mathrm{d}\theta by Table 3. So by Equation (4.1), the mean squared error S¯\bar{S} for each forecast system is proportional to the total area under the respective Murphy graph. Hence mean squared error S¯\bar{S} can be interpreted as being proportional to the mean economic regret averaged across all decision thresholds.

For S2S_{2}, dM(θ)=4χ2(θ)dθ\mathrm{d}M(\theta)=4\chi_{2}(\theta)\,\mathrm{d}\theta, where χ2\chi_{2} is the weight function χR\chi_{R} on the top panel of Figure 6. From Equation (4.1), the mean score S¯2\bar{S}_{2} for each forecast system is proportional to the area under the respective Murphy curve for decision thresholds θ\theta in [10,)[10,\infty). This is illustrated on the top panel of Figure 6 with shaded regions. System B has the smaller area and mean score and is hence superior. So S¯2\bar{S}_{2} can be interpreted as being proportional to the mean economic regret averaged across all decision thresholds θ\theta in [10,)[10,\infty). Similarly, S¯1\bar{S}_{1} can be interpreted as being proportional to the mean economic regret averaged across all decision thresholds θ\theta less than 1010.

The bottom panel of Figure 6 illustrates the effect of using a trapezoidal weight function χT\chi_{T}. The corresponding scoring function applies zero weight to economic regret experienced by users with decision thresholds less than 0, and increasing weight to thresholds greater than zero until a full weight of 1 is applied for decision thresholds greater than or equal to 20. The areas of the shaded regions are proportional to the mean score for Systems A and B.

5 Conclusion

We have shown that the scoring functions that are consistent for quantiles, expectations, expectiles or Huber means can be decomposed as a sum of consistent scoring functions, each of which emphasises predictive performance in a region of the variable’s range according to the selected partition of unity. In particular, this allows the comparison of predictive performance for the extreme region of a variable’s range in a way that cannot be hedged and is not susceptible to misguided inferences.

The decomposition is consonant with the analogous decomposition of the CRPS using the threshold-weighting method, and hence by extension with known decompositions for the absolute error scoring function.

Such decompositions are a consequence of the mixture representation for each of the aforementioned classes of consistent scoring functions and their associated functionals. This has two implications. First, each score in the decomposition can be interpreted as the weighted average of economic regret associated with optimal decision rules, where the average is taken over all user decision thresholds and the weight applied is given by the corresponding weight function in the partition of unity. Second, such decompositions will also exist for the consistent scoring functions of other functionals that have analogous mixture representations.

Acknowledgements

The author thanks Jonas Brehmer, Deryn Griffiths, Michael Foley and Tom Pagano for their constructive comments, which helped improve the quality of this paper.

Appendix A Proof of Corollary 1

As mentioned in Section 3.4, Corollary 3.2 essentially follows from the mixture representation of theorems of Ehm et al., (2016) and Taggart, (2021) by decomposing the mixing measure. However, we provide a direct proof to avoid some technical hypotheses of those theorems.

Proof.

We prove for the case ESF\mathrm{ESF}. The proof for the cases QSF\mathrm{QSF} and HSF\mathrm{HSF} are similar.

To begin, note that if ϕ(s)=ψ(s)+cs+d\phi(s)=\psi(s)+cs+d whenever sIs\in I for some real constants cc and dd then ESF(ϕ,α)=ESF(ψ,α)\mathrm{ESF}(\phi,\alpha)=\mathrm{ESF}(\psi,\alpha). So without loss of generality we may assume that uj=u0u_{j}=u_{0} and vj=v0v_{j}=v_{0} in the definition of ϕj\phi_{j} whenever 1jn1\leq j\leq n for some constants u0u_{0} and v0v_{0} in II.

Suppose that ϕ\phi is a convex twice-differentiable function and α(0,1)\alpha\in(0,1). Given the partition of unity {χj}j=1n\{\chi_{j}\}_{j=1}^{n}, define ϕj\phi_{j} by Equation (3.10). Then

|𝟙{y<x}α|1j=1nESF(ϕj,α)(x,y)\displaystyle|\mathbbm{1}\{y<x\}-\alpha|^{-1}\sum_{j=1}^{n}\mathrm{ESF}(\phi_{j},\alpha)(x,y)
=j=1n(ϕj(y)ϕj(x)ϕj(x)(yx))\displaystyle\qquad=\sum_{j=1}^{n}\big{(}\phi_{j}(y)-\phi_{j}(x)-\phi_{j}^{\prime}(x)(y-x)\big{)}
=j=1n(xyv0vχj(θ)ϕ′′(θ)dθdv(yx)v0xχj(θ)ϕ′′(θ)dθ)\displaystyle\qquad=\sum_{j=1}^{n}\left(\int^{y}_{x}\int^{v}_{v_{0}}\chi_{j}(\theta)\phi^{\prime\prime}(\theta)\,\mathrm{d}\theta\,\mathrm{d}v-(y-x)\int^{x}_{v_{0}}\chi_{j}(\theta)\phi^{\prime\prime}(\theta)\,\mathrm{d}\theta\right)
=xyv0vϕ′′(θ)dθdv(yx)v0xϕ′′(θ)dθ\displaystyle\qquad=\int^{y}_{x}\int^{v}_{v_{0}}\phi^{\prime\prime}(\theta)\,\mathrm{d}\theta\,\mathrm{d}v-(y-x)\int^{x}_{v_{0}}\phi^{\prime\prime}(\theta)\,\mathrm{d}\theta
=xy(ϕ(v)ϕ(v0))dv(yx)(ϕ(x)ϕ(v0))\displaystyle\qquad=\int^{y}_{x}\big{(}\phi^{\prime}(v)-\phi^{\prime}(v_{0})\big{)}\,\mathrm{d}v-(y-x)\big{(}\phi^{\prime}(x)-\phi^{\prime}(v_{0})\big{)}
=(ϕ(y)ϕ(v0)y)(ϕ(x)ϕ(v0)x)(yx)(ϕ(x)ϕ(v0))\displaystyle\qquad=(\phi(y)-\phi^{\prime}(v_{0})y)-(\phi(x)-\phi^{\prime}(v_{0})x)-(y-x)(\phi^{\prime}(x)-\phi^{\prime}(v_{0}))
=|𝟙{y<x}α|1ESF(ϕ,α)(x,y),\displaystyle\qquad=|\mathbbm{1}\{y<x\}-\alpha|^{-1}\mathrm{ESF}(\phi,\alpha)(x,y),

which establishes Equation (3.9). The convexity of ϕj\phi_{j} follows from the fact that ϕj′′(u)=χj(u)ϕ′′(u)0\phi_{j}^{\prime\prime}(u)=\chi_{j}(u)\phi^{\prime\prime}(u)\geq 0, since ϕ\phi is convex. Finally, suppose that x,yI0Ix,y\in I_{0}\subset I and supp(χj)I0=\mathrm{supp}(\chi_{j})\cap I_{0}=\emptyset. Now

|𝟙{y<x}α|1ESF(ϕj,α)(x,y)\displaystyle|\mathbbm{1}\{y<x\}-\alpha|^{-1}\mathrm{ESF}(\phi_{j},\alpha)(x,y) =xy(ϕj(w)ϕj(x))dw\displaystyle=\int_{x}^{y}(\phi_{j}^{\prime}(w)-\phi_{j}^{\prime}(x))\,\mathrm{d}w
=xyxwϕj′′(z)dzdw\displaystyle=\int_{x}^{y}\int_{x}^{w}\phi_{j}^{\prime\prime}(z)\,\mathrm{d}z\,\mathrm{d}w
=xyxwχj(z)ϕ′′(z)dzdw,\displaystyle=\int_{x}^{y}\int_{x}^{w}\chi_{j}(z)\phi^{\prime\prime}(z)\,\mathrm{d}z\,\mathrm{d}w,

and since ww lies between xx and yy, it follows that ww also lies in I0I_{0}. But χj=0\chi_{j}=0 on I0I_{0} and hence the inner integral vanishes. This shows that ESF(ϕj,α)=0\mathrm{ESF}(\phi_{j},\alpha)=0 on I0×I0I_{0}\times I_{0}. ∎

Appendix B Formulae for commonly used scoring functions

Table 4 presents closed-form expressions for gjg_{j} of Equation (3.8) in the case when g(t)=tg(t)=t and for ϕj\phi_{j} of Equation (3.10) in the case when ϕ(t)=2t2\phi(t)=2t^{2}, each induced by rectangular weight functions of the form (3.1) or trapezoidal weight functions of the form (3.2). These expressions facilitate the computation of decompositions for the absolute error, standard quantile, squared error and Huber loss scoring functions. Note also that for each weight function we have ϕj(t)=4gj(t)\phi_{j}^{\prime}(t)=4g_{j}(t).

Table 4: Closed-form expressions for gjg_{j} and ϕj\phi_{j} for different weight functions when g(t)=tg(t)=t, ϕ(t)=2t2\phi(t)=2t^{2}.
weight function expression
rectangular,given byEquation (3.1)\begin{matrix}\text{rectangular,}\\ \text{given by}\\ \text{Equation~{}(\ref{eq:rectangular bump})}\end{matrix} gj(t)={0,t<ata,at<bba,tbg_{j}(t)=\begin{cases}0,&t<a\\ t-a,&a\leq t<b\\ b-a,&t\geq b\end{cases}
ϕj(t)={0,t<a2(ta)2,at<b4(ba)t+2(a2b2),tb\phi_{j}(t)=\begin{cases}0,&t<a\\ 2(t-a)^{2},&a\leq t<b\\ 4(b-a)t+2(a^{2}-b^{2}),&t\geq b\end{cases}
trapezoidal,given byEquation (3.2)\begin{matrix}\text{trapezoidal,}\\ \text{given by}\\ \text{Equation~{}(\ref{eq:trapezoidal bump})}\end{matrix} gj(t)={0,t<a(ta)2/(2(ba)),at<bt(b+a)/2,bt<c(dt)2/(2(dc))+(d+cab)/2,ct<d(d+cab)/2,tdg_{j}(t)=\begin{cases}0,&t<a\\ (t-a)^{2}/(2(b-a)),&a\leq t<b\\ t-(b+a)/2,&b\leq t<c\\ -(d-t)^{2}/(2(d-c))+(d+c-a-b)/2,&c\leq t<d\\ (d+c-a-b)/2,&t\geq d\end{cases}
ϕj(t)={0,t<a2(ta)33(ba),at<b2t22(a+b)t+23(ba)2+2ab,bt<c2(dt)33(dc)+2(d+cab)t+23((ba)2+3ab(dc)23cd),ct<d2(d+cab)t+23((ba)2+3ab(dc)23cd),td\phi_{j}(t)=\begin{cases}0,&t<a\\ \frac{2(t-a)^{3}}{3(b-a)},&a\leq t<b\\ 2t^{2}-2(a+b)t+\frac{2}{3}(b-a)^{2}+2ab,&b\leq t<c\\ \frac{2(d-t)^{3}}{3(d-c)}+2(d+c-a-b)t+\tfrac{2}{3}((b-a)^{2}+3ab-(d-c)^{2}-3cd),&c\leq t<d\\ 2(d+c-a-b)t+\tfrac{2}{3}((b-a)^{2}+3ab-(d-c)^{2}-3cd),&t\geq d\end{cases}

References

  • Bellini and Di Bernardino, (2017) Bellini, F. and Di Bernardino, E. (2017). Risk management with expectiles. The European Journal of Finance, 23(6):487–506.
  • Ehm et al., (2016) Ehm, W., Gneiting, T., Jordan, A., and Krüger, F. (2016). Of quantiles and expectiles: consistent scoring functions, choquet representations and forecast rankings. J. R. Statist. Soc. B, 78:505–562.
  • (3) Gneiting, T. (2011a). Making and evaluating point forecasts. Journal of the American Statistical Association, 106(494):746–762.
  • (4) Gneiting, T. (2011b). Quantiles as optimal point forecasts. International Journal of forecasting, 27(2):197–207.
  • Gneiting and Katzfuss, (2014) Gneiting, T. and Katzfuss, M. (2014). Probabilistic forecasting. Annual Review of Statistics and Its Application, 1:125–151.
  • Gneiting and Ranjan, (2011) Gneiting, T. and Ranjan, R. (2011). Comparing density forecasts using threshold-and quantile-weighted scoring rules. Journal of Business & Economic Statistics, 29(3):411–422.
  • Huber, (1964) Huber, P. J. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35:73–101.
  • Lerch et al., (2017) Lerch, S., Thorarinsdottir, T. L., Ravazzolo, F., and Gneiting, T. (2017). Forecaster’s dilemma: Extreme events and forecast evaluation. Statistical Science, 32(1):106–127.
  • Newey and Powell, (1987) Newey, W. K. and Powell, J. L. (1987). Asymmetric least squares estimation and testing. Econometrica, 55:819–847.
  • Patton, (2020) Patton, A. J. (2020). Comparing possibly misspecified forecasts. Journal of Business & Economic Statistics, 38(4):796–809.
  • Richardson, (2000) Richardson, D. S. (2000). Skill and relative economic value of the ECMWF ensemble prediction system. Quarterly Journal of the Royal Meteorological Society, 126(563):649–667.
  • Savage, (1971) Savage, L. J. (1971). Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783–801.
  • Sharpe et al., (2020) Sharpe, M., Bysouth, C., and Gill, P. (2020). New operational measure to assess extreme events using site-specific climatology. Presented at 2020 International Verification Methods Workshop Online. Retreived from https://jwgfvr.univie.ac.at/presentations-and-notes/ on 5 January 2021.
  • Taggart, (2021) Taggart, R. (2021). Point forecasting and forecast evaluation with generalized Huber loss. Electronic Journal of Statistics, to appear.
  • Thomson, (1979) Thomson, W. (1979). Eliciting production possibilities from a well-informed manager. Journal of Economic Theory, 20:360–380.