Evaluation of point forecasts for extreme events using consistent scoring functions

Robert J. Taggart
Bureau of Meteorology
robert.taggart@bom.gov.au

Abstract

We present a method for comparing point forecasts in a region of interest, such as the tails or centre of a variable’s range. This method cannot be hedged, in contrast to conditionally selecting events to evaluate and then using a scoring function that would have been consistent (or proper) prior to event selection. Our method also gives decompositions of scoring functions that are consistent for the mean or a particular quantile or expectile. Each member of each decomposition is itself a consistent scoring function that emphasises performance over a selected region of the variable’s range. The score of each member of the decomposition has a natural interpretation rooted in optimal decision theory. It is the weighted average of economic regret over user decision thresholds, where the weight emphasises those decision thresholds in the corresponding region of interest.

Keywords: Consistent scoring function; Decision theory; Forecast ranking; Forecast verification; Point forecast; Proper scoring rule; Rare and extreme events.

1 Introduction

Extreme events occur in many systems, from atmospheric to economic, and present significant challenges to society. Hence the accurate prediction of extreme events is of vital importance. In many such situations, competing forecasts are produced by a variety of forecast systems and it is natural to want to compare the performance of such forecasts with emphasis on the extremes.

In this context, it is critical that the methodology for requesting, evaluating and ranking competing forecasting systems is decision-theoretically coherent. Since the future is not precisely known, ideally forecasts ought to be probabilistic in nature, taking the form of a predictive probability distribution and assessed using a proper scoring rule (Gneiting and Katzfuss, 2014). Nonetheless, in many contexts and for a variety of reasons, point forecasts (i.e. single-valued forecasts taking values from the prediction space) are issued and used. For decision-theoretically coherent evaluation of point forecasts, either the scoring function (such as the squared error or absolute error scoring function) that will be used to assess predictive performance should be advertised in advance, or a specific functional (such as the mean or median) of the forecaster’s predictive distribution should be requested and evaluation then conducted using a scoring function that is consistent for that functional (Gneiting, 2011a ). The use of proper scoring rules and consistent scoring functions encourage forecasters to quote honest, carefully considered forecasts.

To compare competing forecasts for the extremes, one may be tempted to use what would otherwise be a proper scoring rule or consistent scoring function, but restrict evaluation to a subset of events for which extremes were either observed, or forecast or perhaps both. However, such methodologies promote hedging strategies (as illustrated in Section 2) and can result in misguided inferences. This gives rise to the forecaster’s dilemma, whereby “if forecast evaluation proceeds conditionally on a catastrophic event having been observed [then] always predicting a calamity becomes a worthwhile strategy” (Lerch et al., 2017).

Nonetheless, for predictive distributions Gneiting and Ranjan, (2011) showed that a suitable alternative exists in the threshold-weighted continuous ranked probability score, which is a weighted version of the commonly used continuous ranked probability score (CRPS). The weight is selected to emphasise the region of interest (such as the tails or some other region of a variable’s range) and induces a proper scoring rule. This technique extends in a natural way to point forecasts targeting the median functional, since the CRPS is a generalisation of the absolute error scoring function. The UK Met Office has recently applied this method to report the performance of temperature forecasts, with emphasis on climatological extremes (Sharpe et al., 2020).

Despite this progress, Lerch et al., (2017) offer this summary of the general situation for evaluating point forecasts at the extremes: “there is no obvious way to abate the forecaster’s dilemma by adapting existing forecast evaluation methods appropriately, such that particular emphasis can be put on extreme outcomes.” In this paper we remedy this situation. We construct consistent scoring functions that can be used evaluate point forecasts that emphasise performance in the region of interest for important classes of functionals, including expectations and quantiles. Moreover, the relevant specific case of these constructions agrees with the extension of the threshold-weighted CRPS to the median functional.

The main idea of this paper can be illustrated using the squared error scoring function $S(x,y)=(x-y)^{2}$ , which is consistent for point forecasts of the expectation functional. Suppose that the outcome space $\mathbb{R}$ is partitioned as $\mathbb{R}=I_{1}\cup I_{2}$ , where $I_{1}=(-\infty,a)$ and $I_{2}=[a,\infty)$ for some $a$ in $\mathbb{R}$ . Corollary 3.2 gives the decomposition $S=S_{1}+S_{2}$ , where each scoring function $S_{i}$ is consistent for expectations whilst also emphasising predictive performance on the interval $I_{i}$ . In particular, if $x,y\in I_{1}$ then $S_{2}(x,y)=0$ . Given a point forecast $x$ and corresponding observation $y$ , the explicit formula for $S_{2}$ is

S_{2}(x,y)=(y-a)^{2}\mathbbm{1}\{y\geq a\}-(x-a)^{2}\mathbbm{1}\{x\geq a\}-2(y-x)(x-a)\mathbbm{1}\{x\geq a\}.

(1.1)

Here $\mathbbm{1}$ denotes the indicator function, so that $\mathbbm{1}\{x\geq a\}$ equals $1$ if $x\geq a$ and $0$ otherwise.

The performance of competing point forecasts for expectations can then be compared by computing the mean scores $\bar{S}$ , $\bar{S}_{1}$ and $\bar{S}_{2}$ for each forecast system, over the same set events. To illustrate, consider two forecast systems A and B, whose error characteristics are depicted by the scatter plot of Figure 1. System B is homoscedastic (i.e., has even scatter about the diagonal line throughout the variable’s range) whereas System A is heteroscedastic (with relatively small errors over lower parts of the variable’s range and relatively large over higher parts). For this set of events, the mean squared error $\bar{S}$ for each system is very similar and there is no statistical significance in their difference. However, with $a=10$ , the mean score $\bar{S}_{2}$ of System A is significantly higher than that of B (i.e., B is superior when emphasis is placed on the region $[10,\infty)$ ) whilst the mean score $\bar{S}_{1}$ of System A is significantly lower than that of B (i.e., A is superior when emphasis is placed on the region $(-\infty,10)$ ). Full details for this case study are given in Section 3.5.

Refer to caption — Figure 1: Scatter plot of forecasts against observations for a random sample of events from the Synthetic data example of Section 3.5.

This example is illustrative. Decompositions of the outcome space can consist of more than two intervals, and the boundary between intervals can also be ‘blurred’ by selecting suitable weight functions. Each decomposition of the outcome space then induces a decomposition

S=\sum_{i=1}^{n}S_{i}

(1.2)

of a given consistent scoring function $S$ , whose summands $S_{i}$ are also consistent for the functional of concern and with each $S_{i}$ emphasising forecast performance in the relevant region of the outcome space. Such decompositions are presented for the consistent scoring functions of quantiles, expectations, expectiles, and Huber means. The approach is unified, in that the same decomposition of the outcome space induces decompositions of the CRPS and of a variety of scoring functions for point forecasts. Details are given in Section 3 with illustrative examples.

Mathematically, the main result of this paper is a corollary of the mixture representation theorems for the consistent scoring functions of quantiles, expectiles (Ehm et al., 2016) and Huber functionals (Taggart, 2021). This furnishes each member $S_{i}$ of the decomposition (1.2) with an interpretation rooted in optimal decision theory. Certain classes of optimal decision rules elicit a user action if and only if a point forecast $x$ exceeds a particular decision threshold $\theta$ . The score $S_{i}(x,y)$ is a measure of economic regret, relative to decisions based on a perfect forecast, of using the point forecast $x$ when the observation $y$ realises, averaged over all decision thresholds $\theta$ belonging to corresponding interval $I_{i}$ of the partition. Precise details are given in Section 4 and illustrated with the aid of Murphy diagrams.

2 Hedging strategies for naïve assessments of forecasts of extreme events

We illustrate how a seemingly natural approach for comparing predictive performance at the extremes creates opportunities for hedging strategies.

A meteorological agency maintains two forecast systems A and B, each of which produces predictive distributions for hourly rainfall $Y$ at a particular location. System A is fully automated and hence cheaper to support than B. System B has knowledge of the System A forecast prior to issuing its own forecast. The agency requests the mean value of their predictive distributions with a lead time of 1 day. These point forecasts are assessed using the squared error scoring function, which is consistent for forecasts of the mean (i.e., the forecast strategy that minimises one’s expected score is to issue the mean value of one’s predictive distribution). For a two year period, the bulk of observed and forecast values fall in the interval $[0\,\mathrm{mm},10\,\mathrm{mm}]$ and there is no statistically significant difference between the performance of the two systems when scored using the squared error function.

However, the maintainers of System B claim that B performs better for the extremes and that bulk verification statistics fail to highlight this. The agency decides to use forecasts from A for the majority of cases, but will test whether B is significantly better than A at forecasting the heavy rainfall events. The agency considers four options for selecting hourly events to assess, after which the squared error scoring function will be used to compare predictive performance on those events. If System B does not perform significantly better than A over the next 12 months then it will be decommissioned.

For a given forecast case, let $F_{\mathrm{A}}$ and $F_{\mathrm{B}}$ denote the predictive distributions produced by each system, and $x_{\mathrm{A}}$ and $x_{\mathrm{B}}$ the respective point forecasts issued. Suppose that the random variable $Y$ has a distribution specified by $F_{\mathrm{B}}$ . For each option, the maintainers of System B have a strategy to optimise their expected score; that is, to choose $x_{\mathrm{B}}$ such that $\mathbb{E}[(x_{\mathrm{B}}-Y)^{2}]$ is minimised.

Option 1:

Only assess events where the observation is at least 20 mm.
Strategy: If $\mathbb{P}(Y\geq 20)>0$ then $F_{\mathrm{B}}|\{Y\geq 20\}$ , which denotes the predictive distribution of $\mathrm{B}$ conditioned on the event $\{Y\geq 20\}$ , exists. In this case, forecast $x_{\mathrm{B}}=\mathrm{Mean}(F_{\mathrm{B}}|\{Y\geq 20\}))$ and otherwise forecast $x_{\mathrm{B}}=20$ .
Option 2:

Only assess events where either $x_{\mathrm{A}}$ or $x_{\mathrm{B}}$ is at least 20 mm.
Strategy: If $\max(x_{\mathrm{A}},\mathrm{Mean}(F_{\mathrm{B}}))\geq 20$ then $x_{\mathrm{B}}=\mathrm{Mean}(F_{\mathrm{B}})$ . Otherwise if $\mathbb{E}[(20-Y)^{2}]<\mathbb{E}[(x_{\mathrm{A}}-Y)^{2}]$ then $x_{\mathrm{B}}=20$ else $x_{\mathrm{B}}=\mathrm{Mean}(F_{\mathrm{B}})$ . This will ensure that the event is assessed whenever System B expects that a forecast of 20 will receive a more favourable score than than a forecast of $x_{\mathrm{A}}$ .
Option 3:

Only assess events where either $x_{\mathrm{A}}$ , $x_{\mathrm{B}}$ or the observation is at least 20 mm.
Strategy: If $\max(x_{\mathrm{A}},\mathrm{Mean}(F_{\mathrm{B}}))\geq 20$ then $x_{\mathrm{B}}=\mathrm{Mean}(F_{\mathrm{B}})$ . Else if $\mathbb{E}[(20-Y)^{2}]<\mathbb{E}[(x_{\mathrm{A}}-Y)^{2}]$ then $x_{\mathrm{B}}=20$ . Otherwise, the only other way the event will be assessed is if the observation is at least 20mm, so forecast $x_{\mathrm{B}}=19.9$ .
Option 4:

Only assess events where $x_{\mathrm{A}}\geq 20$ .
Strategy: In this case there is nothing that System B can to do influence which events will be assessed. Therefore $x_{\mathrm{B}}=\mathrm{Mean}(F_{\mathrm{B}})$ .

Of these, only Option 4 does not expose the agency to a hedging strategy from System B. However, under Option 4 the developers of A may employ the strategy of forecasting $x_{\mathrm{A}}=\max(19.9,\mathrm{Mean}(F_{\mathrm{A}}))$ , so that no further comparison of systems is made.

There is an Option 5: use a scoring function that is consistent for the mean functional and that emphasises performance at the extremes. We turn attention to this now.

3 Decompositions of scoring functions

We work in a setting where point forecasts $x$ and observations $y$ belong to some interval $I$ of the real line $\mathbb{R}$ (possibly with $I=\mathbb{R}$ ). In Section 3.1 we introduce partitions of unity, which are used to ‘subdivide’ the outcome space $I$ into subregions of interest. We then illustrate in Section 3.2 how such subdivisions induce decompositions of the CRPS, where each member of the decomposition emphasises performance on the corresponding subregion of $I$ whilst retaining propriety. To obtain analogous decompositions for consistent scoring functions, we recall in Section 3.3 that the consistent scoring functions for quantiles and expectations (among others) have general mathematical forms. The aim is to find which specific instances of the general form emphasise performance on the subregions specified by the partition of unity. This is answered in Section 3.4. Section 3.5 contains examples of such decompositions, and opens with a worked example showing how to find the formula for the squared error decomposition of Equation (1.1).

3.1 Partitions of unity

Recall that the support of a function $\chi:I\to\mathbb{R}$ , denoted $\mathrm{supp}(\chi)$ , is defined by

\mathrm{supp}(\chi)=\{t\in I:\chi(t)\neq 0\}.

In this paper, we say that $\{\chi_{j}\}_{j=1}^{n}$ is a partition of unity on $I$ if $\{\chi_{j}\}_{j=1}^{n}$ is a finite set of measurable functions $\chi_{j}:I\to\mathbb{R}$ such that $0\leq\chi_{j}(t)\leq 1$ and

\sum_{j=1}^{n}\chi_{j}(t)=1

whenever $t\in I$ . (For readers unfamiliar with the concept of measurability, any piecewise continuous function is measurable.) We will call each $\chi_{j}$ a weight function. We note that these differ from typical definitions in that we do not require $\chi_{j}$ to be continuous or have bounded support.

Figure 2 illustrates two different partitions of unity for the interval $I=[0,6)$ . The rectangular partition of unity is consists of rectangular weight functions, each of the form

\chi_{j}(t)=\begin{cases}1,&a\leq t<b\\ 0,&\text{otherwise}\end{cases}

(3.1)

for suitable constants satisfying $a<b$ , both depending on $j$ . The trapezoidal partition of unity consists of trapezoidal weight functions, each typically having the form

\chi_{j}(t)=\begin{cases}(t-a)/(b-a),&a\leq t<b\\ 1,&b\leq t<c\\ (d-t)/(d-c),&c\leq t<d\\ 0,&\text{otherwise}\end{cases}

(3.2)

for suitable constants satisfying $a<b<c<d$ , all depending on $j$ , with appropriate modification for the end cases.

More generally, if $\{\psi_{j}\}_{j=1}^{n}$ is a set of piecewise nonnegative measurable functions with the property that

\sum_{j=1}^{n}\psi_{j}(t)>0

whenever $t\in I$ , then a partition of unity $\{\chi_{j}\}_{j=1}^{n}$ can be constructed by defining

\chi_{j}(t)=\psi_{j}(t)\Big{(}\sum_{j=1}^{n}\psi_{j}(t)\Big{)}^{-1}.

3.2 Decomposition of the CRPS

Each partition $\{\chi_{j}\}_{j=1}^{n}$ of unity induces a corresponding decomposition of the CRPS. Given a predictive distribution $F$ , expressed as a cumulative density function, and a corresponding observation $y$ , the CRPS is defined by

\mathrm{CRPS}(F,y)=\int_{I}(F(z)-\mathbbm{1}\{y\leq z\})^{2}\,\mathrm{d}z

and for each $\chi_{j}$ the threshold-weighted CRPS by

\mathrm{CRPS}_{j}(F,y)=\int_{I}(F(z)-\mathbbm{1}\{y\leq z\})^{2}\,\chi_{j}(z)\,\mathrm{d}z.

Both are proper scoring rules (Gneiting and Ranjan, 2011). Thus the $\mathrm{CRPS}$ has a decomposition

\mathrm{CRPS}=\sum_{j=1}^{n}\mathrm{CRPS}_{j},

where each component $\mathrm{CRPS}_{j}$ is proper and emphasises performance in the region determined by the weight $\chi_{j}$ . The Sydney rainfall forecasts example of Section 3.5 illustrates the application of such a decomposition. We now establish analogous decompositions for a wide range of scoring functions.

3.3 Consistent scoring functions

For decision-theoretically coherent point forecasting, forecasters need a directive in the form of a statistical functional (Gneiting, 2011a ) or a scoring function which should be minimised (Patton, 2020). A statistical functional $\mathrm{T}$ is a (potentially set-valued) mapping from a class of probability distributions $\mathcal{F}$ to the real line $\mathbb{R}$ . Examples include the mean (or expectation) functional, quantiles and expectiles (Newey and Powell, 1987), the latter recently attracting interest in risk management (Bellini and Di Bernardino, 2017). A consistent scoring function is a special case of a proper scoring rule in the context of point forecasts, and rewards forecasters who make careful honest forecasts.

Definition 3.1.

(Gneiting, 2011a ) A scoring function $S:I\times I\to[0,\infty)$ is consistent for the functional $\mathrm{T}$ relative to a class $\mathcal{F}$ of probability distributions if

\mathbb{E}S(t,Y)\leq\mathbb{E}S(x,Y),\qquad\text{whenever }Y\sim F,

(3.3)

for all probability distributions $F\in\mathcal{F}$ , all $t\in\mathrm{T}(F)$ and all $x\in I$ . The function $S$ is strictly consistent if $S$ is consistent and if equality in Equation (3.3) implies that $x\in\mathrm{T}(F)$ .

The consistent scoring functions for many commonly used statistical functionals have general forms.

Given $g:I\to\mathbb{R}$ and $\alpha\in(0,1)$ , define the ‘quantile scoring function’ $\mathrm{QSF}(g,\alpha):I\times I\to\mathbb{R}$ by

\mathrm{QSF}(g,\alpha)(x,y)=(\mathbbm{1}\{y<x\}-\alpha)(g(x)-g(y))\qquad\forall x,y\in I.

(3.4)

The name QSF is justified because, subject to slight regularity conditions, a scoring function $S$ is consistent for the $\alpha$ -quantile functional if and only if $S=\mathrm{QSF}(g,\alpha)$ where $g$ is nondecreasing (Gneiting, 2011b ; Gneiting, 2011a ; Thomson, 1979). The absolute error scoring function $S(x,y)=|x-y|$ for the median functional arises from Equation (3.4) when $g(t)=2t$ and $\alpha=1/2$ . The commonly used $\alpha$ -quantile scoring function

S(x)=(\mathbbm{1}\{y<x\}-\alpha)(x-y)

(3.5)

arises when $g(t)=t$ .

Given a convex function $\phi:I\to\mathbb{R}$ with subderivative $\phi^{\prime}$ and $\alpha\in(0,1)$ , define the function $\mathrm{ESF}(\phi,\alpha):I\times I\to\mathbb{R}$ by

\mathrm{ESF}(\phi,\alpha)(x,y)=|\mathbbm{1}\{y<x\}-\alpha|\big{(}\phi(y)-\phi(x)-\phi^{\prime}(x)(y-x)\big{)}\qquad\forall x,y\in I.

(3.6)

(The subderivative is a generalisation of the derivative for convex functions and coincides with the derivative when the convex function is differentiable.) Subject to weak regularity conditions, a scoring function $S$ is consistent for the $\alpha$ -expectile functional if and only if $S=\mathrm{ESF}(\phi,\alpha)$ where $\phi$ is convex (Gneiting, 2011a ; Savage, 1971). The expectation (or the mean) functional corresponds to the special case $\alpha=1/2$ , with the squared error scoring function $S(x,y)=(x-y)^{2}$ for expectations arising from Equation (3.6) when $\phi(t)=2t^{2}$ and $\alpha=1/2$ . A special case of the squared error scoring function is the Brier score, where $I=[0,1]$ and observations typically take values in $\{0,1\}$ .

Given $\phi:I\to\mathbb{R}$ with subderivative $\phi^{\prime}$ and $\nu>0$ , define the function $\mathrm{HSF}(\phi,\nu):I\times I\to\mathbb{R}$ by

\mathrm{HSF}(\phi,\nu)(x,y)=\tfrac{1}{2}\big{(}\phi(y)-\phi(\kappa_{\nu}(x-y)+y)+\kappa_{\nu}(x-y)\phi^{\prime}(x)\big{)}\qquad\forall x,y\in I,

(3.7)

where $\kappa_{\nu}$ is the ‘capping’ function defined by $\kappa_{\nu}(x)=\max(-\nu,\min(x,\nu))$ . Subject to slight regularity conditions, a scoring function $S$ is consistent for the Huber mean functional (with tuning parameter $\nu$ ) if and only if $S=\mathrm{HSF}(\phi,\nu)$ where $\phi$ is convex (Taggart, 2021). The Huber loss scoring function

S(x,y)=\begin{cases}\tfrac{1}{2}(x-y)^{2},&|x-y|\leq\nu\\ \nu|x-y|-\tfrac{1}{2}\nu^{2},&|x-y|>\nu\end{cases}

arises from $\mathrm{HSF}(\phi,\nu)$ when $\phi(t)=t^{2}$ , and is used by the Bureau of Meteorology to score forecasts of various parameters. The Huber mean is an intermediary between the median and the mean functionals, and is a robust measure of the centre of a distribution (Huber, 1964; Taggart, 2021).

3.4 Decomposition of consistent scoring functions

We now state the main result of this paper, namely that the consistent scoring functions for the quantile, expectile and Huber mean functionals can be written as a sum of consistent scoring functions with respect to the chosen partition of unity. It is presented as a corollary since it follows from the mixture representation theorems of Ehm et al., (2016) and Taggart, (2021).

Corollary 3.2.

Suppose that $\{\chi_{j}\}_{j=1}^{n}$ is a partition of unity on $I$ . For each $j$ in $\{1,\ldots,n\}$ , fix any points $u_{j}$ and $v_{j}$ in $I$ .

(a)

If $g:I\to\mathbb{R}$ is a nondecreasing differentiable function and $\alpha\in(0,1)$ then

$\mathrm{QSF}(g,\alpha)=\sum_{j=1}^{n}\mathrm{QSF}(g_{j},\alpha)$

where $g_{j}$ is nondecreasing and defined by

$g_{j}(u)=\int_{u_{j}}^{u}\chi_{j}(\theta)g^{\prime}(\theta)\,\mathrm{d}\theta.$ (3.8)

Moreover, if $I_{0}\subset I$ is an interval and $\mathrm{supp}(\chi_{j})\cap I_{0}=\emptyset$ then $\mathrm{QSF}(g_{j},\alpha)=0$ on $I_{0}\times I_{0}$ .
(b)

If $\phi:I\to\mathbb{R}$ is a convex twice-differentiable function, $\alpha\in(0,1)$ and $\nu>0$ then

$\mathrm{ESF}(\phi,\alpha)=\sum_{j=1}^{n}\mathrm{ESF}(\phi_{j},\alpha)$ (3.9)

and

$\mathrm{HSF}(\phi,\nu)=\sum_{j=1}^{n}\mathrm{HSF}(\phi_{j},\nu),$

where $\phi_{j}$ is convex and defined by

$\phi_{j}(u)=\int_{u_{j}}^{u}\int_{v_{j}}^{v}\chi_{j}(\theta)\phi^{\prime\prime}(\theta)\,\mathrm{d}\theta\,\mathrm{d}v.$ (3.10)

Moreover, if $I_{0}\subset I$ is an interval and $\mathrm{supp}(\chi_{j})\cap I_{0}=\emptyset$ then $\mathrm{ESF}(\phi_{j},\alpha)=\mathrm{HSF}(\phi_{j},\alpha)=0$ on $I_{0}\times I_{0}$ .

See Appendix A for the proof. Appendix B states closed-form expressions for $g_{j}$ and $\phi_{j}$ for commonly used scoring functions when the partition of unity is rectangular or trapezoidal. We note that the natural analogue of Corollary 3.2 for the consistent scoring functions of Huber functionals (which are generalised Huber means) also holds.

One can show that $\mathrm{QSF}(g_{j},\alpha)$ , $\mathrm{ESF}(\phi_{j},\alpha)$ and $\mathrm{HSF}(\phi_{j},\nu)$ are independent of the choice of points $u_{j}$ and $v_{j}$ in $I$ . In practice, the choice of $u_{j}$ and $v_{j}$ may be determined for computational convenience, such as selecting (if it exists) the minimum of $\mathrm{supp}(\chi_{j})$ .

Finally, we discuss strict consistency. Suppose that $\mathrm{QSF}(g,\alpha)$ , $\mathrm{ESF}(\phi,\alpha)$ and $\mathrm{HSF}(\phi,\nu)$ are strictly consistent for the quantile, expectile or Huber mean functionals respectively for some class $\mathcal{F}$ of probability distributions. This occurs when $g$ is strictly positive, or when $\phi$ is strictly convex, perhaps subject to mild regularity conditions on $\mathcal{F}$ (Gneiting, 2011a ; Taggart, 2021). If $\chi_{j}$ is strictly positive on $I$ then $\mathrm{QSF}(g_{j},\alpha)$ , $\mathrm{ESF}(\phi_{j},\alpha)$ and $\mathrm{HSF}(\phi_{j},\nu)$ are also strictly consistent for their respective functionals. An example of such a partition $\{\chi_{1},\chi_{2}\}$ of unity on $\mathbb{R}$ is given by

\chi_{2}(t)=\tfrac{1}{2}+\tfrac{1}{\pi}\arctan(t-a)

and $\chi_{1}(t)=1-\chi_{2}(t)$ . Here, $\chi_{2}$ induces scoring functions $S_{2}$ that emphasise performance on $(a,\infty)$ but do not completely ignore performance on any subinterval of $\mathbb{R}$ .

3.5 Examples

We begin by demonstrating how to find explicit formulae for a particular decomposition.

Example 3.3 (Decomposition of the squared error scoring function).

Let $S$ denote the scoring function $S(x,y)=(x-y)^{2}$ . Note that $S=\mathrm{ESF}(\phi,0.5)$ , where $\phi(t)=2t^{2}$ via Equation (3.6), so that Corollary 3.2 applies. We use a simple rectangular partition $\{\chi_{1},\chi_{2}\}$ of unity, where

\chi_{2}(t)=\begin{cases}0,&t<a\\ 1,&t\geq a\end{cases}

and $\chi_{1}(t)=1-\chi_{2}(t)$ . Corollary 3.2 gives the corresponding decomposition $S=S_{1}+S_{2}$ , where $S_{i}=\mathrm{ESF}(\phi_{i},0.5)$ . To find the explicit formula for $S_{2}$ , we compute the function $\phi_{2}$ using Equation (3.10). Integrating twice with the choice $u_{2}=v_{2}=a$ gives

	$\displaystyle\phi_{2}(u)$	$\displaystyle=\begin{cases}0,&u<a\\ 2(u-a)^{2},&u\geq a\end{cases}$
		$\displaystyle=2(u-a)^{2}\,\mathbbm{1}\{u\geq a\}.$

Thus $S_{2}=\mathrm{ESF}(\phi_{2},0.5)$ , which yields the explicit formula given by Equation (1.1) via Equation (3.6). The explicit formula for $S_{1}$ follows easily from the identity $S_{1}=S-S_{2}$ .

Figure 3 illustrates the decomposition $S=S_{1}+S_{2}$ with respect to the rectangular partition $\{\chi_{1},\chi_{2}\}$ of unity with $a=3$ , and also with respect to a trapezoidal partition. For each forecast $x$ and observation $y$ , the solid line represents the score $S(x,y)$ , the dotted line the score $S_{1}(x,y)$ corresponding to the weight function $\chi_{1}$ with support on the left of the interval, and the dashed line the score $S_{2}(x,y)$ corresponding to the weight function $\chi_{2}$ with support on the right of the interval. Note that when the forecast $x$ and observation $y$ both lie outside the support of $\chi_{j}$ then $S_{j}(x,y)=0$ .

Example 3.4 (Decomposition of a quantile scoring function).

Let $S$ denote the scoring function

S(x,y)=\begin{cases}0.25(y-x),&y\geq x\\ 0.75(x-y),&x<y,\end{cases}

so that $S=\mathrm{QSF}(g,\alpha)$ where $g(t)=t$ and $\alpha=0.25$ . The decomposition $S=S_{1}+S_{2}$ is illustrated in Figure 4 with respect to two different partitions $\{\chi_{1},\chi_{2}\}$ of unity. The solid line represents $S$ , the dotted line $S_{1}$ and the dashed line $S_{2}$ .

Example 3.5 (Synthetic data).

Suppose that the climatological distribution of an unknown quantity $Y$ is normal with $Y\sim\mathcal{N}(4,15^{2})$ . Two forecast systems A and B issue point forecasts for $Y$ targeting the mean functional. Their respective forecast errors are identically independently normally distributed with error characteristics $e_{\mathrm{A}}\sim\mathcal{N}(0,(\arctan(y-10)+2)^{2})$ (where $y$ is the observation) and $e_{\mathrm{B}}\sim\mathcal{N}(0,2^{2})$ . Hence the standard deviation of $e_{\mathrm{A}}$ is lower than the standard deviation of $e_{\mathrm{B}}$ when $y$ is less than $10$ while the reverse is true when $y$ is greater than $10$ . This is evident in the varying degree of scatter about the diagonal line in forecast–observation plot of Figure 1.

We sample 10000 independent observations and corresponding forecasts, and compare both forecast systems using the squared error scoring function $S$ along with the two components of its decomposition $S=S_{1}+S_{2}$ . The decomposition is induced from a rectangular partition $\{\chi_{1},\chi_{2}\}$ of unity on $\mathbb{R}$ with $\mathrm{supp}(\chi_{1})=(-\infty,10)$ and $\mathrm{supp}(\chi_{2})=[10,\infty)$ . The mean scores for each system are shown in Table 1, along with a 95% confidence interval of the difference in mean scores.

An analysis of this sample concludes that neither System A nor System B is significantly better than its competitor as measured by $S$ , since $0$ belongs to the confidence interval of the difference in their mean scores. However, one can infer with high confidence that A performs better when scored by $S_{1}$ , since $0$ is well outside the corresponding confidence interval. Similarly, one may conclude with high confidence that B performs better when scored by $S_{2}$ .

Based on the results in the table, one would use the forecast from A if both forecasts are less than 10, and the forecast from B if both forecasts are greater than 10. If the forecasts lie on opposite sides of 10, one option is to take the average of the two forecasts. We revisit this example again in Section 4 from the perspective of users and optimal decision rules.

Table 1: Comparative evaluation of forecast systems A and B from Synthetic data example, rounded to 2 decimal places. The difference is the mean score of A minus the mean score of B.

Mean score	System A	System B	95% confidence interval of difference
$\bar{S}$	4.14	4.02	(-0.12, 0.36)
$\bar{S}_{1}$	0.61	2.65	(-2.16, -1.92)
$\bar{S}_{2}$	3.53	1.36	(1.97, 2.36)

Example 3.6 (Sydney rainfall forecasts).

Two Bureau of Meteorology forecast systems BoM and OCF issue predictive distributions for daily rainfall at Sydney Observatory Hill. Climatologically, daily rainfall exceeds 35.8 mm on average 12 times a year, and 42.2 mm on average 6 times a year. We partition the outcome space $[0,\infty)$ using a trapezoidal partition $\{\chi_{1},\chi_{2}\}$ of unity, where $\chi_{2}(t)=0$ if $0\leq t<35.8$ , $\chi_{2}(t)=1$ if $t\geq 42.2$ and $\chi_{2}(t)=(t-35.8)/(42.2-35.8)$ if $35.8\leq t<42.2$ . Naturally, $\chi_{1}(t)=1-\chi_{2}(t)$ .

Over the entire outcome space, the quality of each system’s forecasts is assessed using the CRPS (for predictive distributions), the squared error scoring function (for expectation point forecasts), and the standard quantile scoring function of Equation (3.5) (for quantile point forecasts). Moreover, each of these scores $S$ are decomposed as $S=S_{1}+S_{2}$ using the common partition $\{\chi_{1},\chi_{2}\}$ . Their mean scores by forecast lead day, for the period July 2018 to June 2020, are shown in Figure 5. This example illustrates that the same partition of unity can be used to hone in on particular regions of interest across a variety of forecast types, using an assessment method that cannot be hedged. When performance is assessed with emphasis on heavier rainfall via $S_{2}$ , BoM is better than OCF at lead days 1 and 2 for expectile and 0.9-quantile forecasts, marginally better with its predictive distributions but worse than OCF at lead day 1 for 0.25-quantile forecasts.

4 Decision theoretic interpretation of scoring function decompositions

Mixture representations and elementary scoring functions facilitate a decision-theoretic interpretation of the scoring function decompositions of Corollary 3.2.

To avoid unimportant technical details, in this section assume that $I=\mathbb{R}$ , $g$ is a nondecreasing differentiable function on $\mathbb{R}$ and $\phi$ is convex twice-differentiable function on $\mathbb{R}$ . Suppose that $S$ is any one of the scoring functions $\mathrm{QSF}(g,\alpha)$ , $\mathrm{ESF}(\phi,\alpha)$ or $\mathrm{HSF}(\phi,\nu)$ . Thus $S$ is consistent for some functional $\mathrm{T}$ (either a specific quantile, expectile or Huber mean). Then $S$ admits a mixture representation

S(x,y)=\int_{\mathbb{R}}S_{\theta}^{\mathrm{T}}(x,y)\,\mathrm{d}M(\theta),

(4.1)

where each $S_{\theta}^{\mathrm{T}}$ is an elementary scoring function for $\mathrm{T}$ and the mixing measure $\mathrm{d}M(\theta)$ is nonnegative (Ehm et al., 2016; Taggart, 2021). That is, $S$ is a weighted average of elementary scoring functions. Explicit formulae for these elementary scoring functions and mixing measures are given in Tables 2 and 3.

The elementary scoring functions $S_{\theta}^{\mathrm{T}}$ and their corresponding functionals $\mathrm{T}$ arise naturally in the context of optimal decision rules. In a certain class of such rules, a predefined action is taken if and only if the point forecast $x$ for some unknown quantity $Y$ exceeds a certain decision threshold $\theta$ . The usefulness of the forecast for the problem at hand can be assessed via a loss function $S_{\theta}$ , where $S_{\theta}(x,y)$ encodes the economic regret, relative to a perfect forecast, of applying the decision rule with forecast $x$ when the observation $y$ is realised. Typically $S_{\theta}(x,y)>0$ whenever $\theta$ lies between $x$ and $y$ (e.g., the forecast exceeded the decision threshold but the realisation did not, resulting in regret), and $S_{\theta}(x,y)=0$ otherwise (i.e., a perfect forecast would have resulted in the same decision, resulting in no regret).

Table 2: Elementary scoring functions

S_{\theta}^{\mathrm{T}}(x,y)

for different functionals

\mathrm{T}

Functional $\mathrm{T}$	Elementary score $S_{\theta}^{\mathrm{T}}(x,y)$
$\alpha$ -quantile	$\begin{cases}1-\alpha,&y\leq\theta<x\\ \alpha,&x\leq\theta<y\\ 0,&\text{otherwise}\end{cases}$
$\alpha$ -expectile	$\begin{cases}(1-\alpha)\|y-\theta\|,&y\leq\theta<x\\ \alpha\|y-\theta\|,&x\leq\theta<y\\ 0,&\text{otherwise}\end{cases}$
$\nu$ -Huber mean	$\begin{cases}\tfrac{1}{2}\min(\theta-y,\nu)&y\leq\theta<x\\ \tfrac{1}{2}\min(y-\theta,\nu),&x\leq\theta<y\\ 0,&\text{otherwise}\end{cases}$

For the decision rule to be optimal over many forecast cases, the point forecast $x$ should be one that minimises the expected score $\mathbb{E}S_{\theta}(x,Y)$ , where $Y\sim F$ for the predictive distribution $F$ . Binary betting decisions (Ehm et al., 2016) and simple cost–loss decision models (where the user must weigh up the cost of taking protective action as insurance against an event which may or may not occur (Richardson, 2000)) give rise to some $\alpha$ -quantile of $F$ being the optimal point forecast. The $\alpha$ -expectile functional gives the optimal point forecast for investment problems with cost basis $\theta$ , revenue $y$ and simple taxation structures (Ehm et al., 2016). The Huber mean generates optimal point forecasts when profits and losses are capped in such investment problems (Taggart, 2021). Though the language here is financial, such investment decisions can be made using weather forecasts (Taggart, 2021, Example 5.4). In each case, the particular score $S_{\theta}(x,y)$ is, up to a multiplicative constant, the elementary score $S_{\theta}^{\mathrm{T}}(x,y)$ in Equation (4.1) for the appropriate functional $\mathrm{T}$ .

Thus the representation (4.1) expresses $S(x,y)$ as a weighted average of elementary scores, each of which encodes the economic regret associated with a decision made using the forecast $x$ with decision threshold $\theta$ . The weight on each $\theta$ is determined by the mixing measure $\mathrm{d}M(\theta)$ , as detailed in Table 3. Relative to $\mathrm{QSF}(g,\alpha)$ , the scoring function $\mathrm{QSF}(g_{j},\alpha)$ weighs the economic regret for decision threshold $\theta$ by a factor of $\chi_{j}(\theta)$ , thus privileging certain decision thresholds over others.

Table 3: Weights in the mixing measure

\mathrm{d}M(\theta)

for different scoring functions

S

Scoring function $S$	$\mathrm{d}M(\theta)$
$\mathrm{QSF}(g,\alpha)$	$g^{\prime}(\theta)\,\mathrm{d}\theta$
$\mathrm{QSF}(g_{j},\alpha)$	$\chi_{j}(\theta)g^{\prime}(\theta)\,\mathrm{d}\theta$
$\mathrm{ESF}(\phi,\alpha)$ , $\mathrm{HSF}(\phi,\alpha)$	$\phi^{\prime\prime}(\theta)\,\mathrm{d}\theta$
$\mathrm{ESF}(\phi_{j},\alpha)$ , $\mathrm{HSF}(\phi_{j},\alpha)$	$\chi_{j}(\theta)\phi^{\prime\prime}(\theta)\,\mathrm{d}\theta$

We illustrate these ideas using the point forecasts generated by Systems A and B from the Synthetic data example (Section 3.5). These point forecasts target the mean functional, and so induce optimal decision rules for investment problems with cost basis $\theta$ . The decision rule is to invest if and only if the forecast future value $x$ of the investment exceeds the fixed up-front cost $\theta$ of the investment. As previously noted, neither forecast system performs significantly better than the other when scored by squared error. However, for certain decision thresholds $\theta$ , the mean elementary score $\bar{S}_{\theta}$ (which measures the average economic regret) of one system is clearly superior to the other. This is illustrated by the Murphy diagram of Figure 6, which is a plot of $\bar{S}_{\theta}$ against $\theta$ (Ehm et al., 2016). Since a lower score is better, users with a decision threshold exceeding about $8$ should use forecasts from B, while those with a decision threshold less than $8$ should use forecasts from A.

Let $S=S_{1}+S_{2}$ denote the same decomposition of the squared error scoring function used for the Synthetic data example (Section 3.5), induced from a rectangular partition of unity. To aid simplicity of interpretation, assume for the user group under consideration that there is one constant $k$ such that the each elementary score equals $k$ multiplied by the economic regret. We observe the following.

For $S$ , $\phi(t)=2t^{2}$ and so $\mathrm{d}M(\theta)=4\,\mathrm{d}\theta$ by Table 3. So by Equation (4.1), the mean squared error $\bar{S}$ for each forecast system is proportional to the total area under the respective Murphy graph. Hence mean squared error $\bar{S}$ can be interpreted as being proportional to the mean economic regret averaged across all decision thresholds.

For $S_{2}$ , $\mathrm{d}M(\theta)=4\chi_{2}(\theta)\,\mathrm{d}\theta$ , where $\chi_{2}$ is the weight function $\chi_{R}$ on the top panel of Figure 6. From Equation (4.1), the mean score $\bar{S}_{2}$ for each forecast system is proportional to the area under the respective Murphy curve for decision thresholds $\theta$ in $[10,\infty)$ . This is illustrated on the top panel of Figure 6 with shaded regions. System B has the smaller area and mean score and is hence superior. So $\bar{S}_{2}$ can be interpreted as being proportional to the mean economic regret averaged across all decision thresholds $\theta$ in $[10,\infty)$ . Similarly, $\bar{S}_{1}$ can be interpreted as being proportional to the mean economic regret averaged across all decision thresholds $\theta$ less than $10$ .

The bottom panel of Figure 6 illustrates the effect of using a trapezoidal weight function $\chi_{T}$ . The corresponding scoring function applies zero weight to economic regret experienced by users with decision thresholds less than 0, and increasing weight to thresholds greater than zero until a full weight of 1 is applied for decision thresholds greater than or equal to 20. The areas of the shaded regions are proportional to the mean score for Systems A and B.

5 Conclusion

We have shown that the scoring functions that are consistent for quantiles, expectations, expectiles or Huber means can be decomposed as a sum of consistent scoring functions, each of which emphasises predictive performance in a region of the variable’s range according to the selected partition of unity. In particular, this allows the comparison of predictive performance for the extreme region of a variable’s range in a way that cannot be hedged and is not susceptible to misguided inferences.

The decomposition is consonant with the analogous decomposition of the CRPS using the threshold-weighting method, and hence by extension with known decompositions for the absolute error scoring function.

Such decompositions are a consequence of the mixture representation for each of the aforementioned classes of consistent scoring functions and their associated functionals. This has two implications. First, each score in the decomposition can be interpreted as the weighted average of economic regret associated with optimal decision rules, where the average is taken over all user decision thresholds and the weight applied is given by the corresponding weight function in the partition of unity. Second, such decompositions will also exist for the consistent scoring functions of other functionals that have analogous mixture representations.

Acknowledgements

The author thanks Jonas Brehmer, Deryn Griffiths, Michael Foley and Tom Pagano for their constructive comments, which helped improve the quality of this paper.

Appendix A Proof of Corollary 1

As mentioned in Section 3.4, Corollary 3.2 essentially follows from the mixture representation of theorems of Ehm et al., (2016) and Taggart, (2021) by decomposing the mixing measure. However, we provide a direct proof to avoid some technical hypotheses of those theorems.

Proof.

We prove for the case $\mathrm{ESF}$ . The proof for the cases $\mathrm{QSF}$ and $\mathrm{HSF}$ are similar.

To begin, note that if $\phi(s)=\psi(s)+cs+d$ whenever $s\in I$ for some real constants $c$ and $d$ then $\mathrm{ESF}(\phi,\alpha)=\mathrm{ESF}(\psi,\alpha)$ . So without loss of generality we may assume that $u_{j}=u_{0}$ and $v_{j}=v_{0}$ in the definition of $\phi_{j}$ whenever $1\leq j\leq n$ for some constants $u_{0}$ and $v_{0}$ in $I$ .

Suppose that $\phi$ is a convex twice-differentiable function and $\alpha\in(0,1)$ . Given the partition of unity $\{\chi_{j}\}_{j=1}^{n}$ , define $\phi_{j}$ by Equation (3.10). Then

	$\displaystyle\|\mathbbm{1}\{y<x\}-\alpha\|^{-1}\sum_{j=1}^{n}\mathrm{ESF}(\phi_{j},\alpha)(x,y)$
	$\displaystyle\qquad=\sum_{j=1}^{n}\big{(}\phi_{j}(y)-\phi_{j}(x)-\phi_{j}^{\prime}(x)(y-x)\big{)}$
	$\displaystyle\qquad=\sum_{j=1}^{n}\left(\int^{y}_{x}\int^{v}_{v_{0}}\chi_{j}(\theta)\phi^{\prime\prime}(\theta)\,\mathrm{d}\theta\,\mathrm{d}v-(y-x)\int^{x}_{v_{0}}\chi_{j}(\theta)\phi^{\prime\prime}(\theta)\,\mathrm{d}\theta\right)$
	$\displaystyle\qquad=\int^{y}_{x}\int^{v}_{v_{0}}\phi^{\prime\prime}(\theta)\,\mathrm{d}\theta\,\mathrm{d}v-(y-x)\int^{x}_{v_{0}}\phi^{\prime\prime}(\theta)\,\mathrm{d}\theta$
	$\displaystyle\qquad=\int^{y}_{x}\big{(}\phi^{\prime}(v)-\phi^{\prime}(v_{0})\big{)}\,\mathrm{d}v-(y-x)\big{(}\phi^{\prime}(x)-\phi^{\prime}(v_{0})\big{)}$
	$\displaystyle\qquad=(\phi(y)-\phi^{\prime}(v_{0})y)-(\phi(x)-\phi^{\prime}(v_{0})x)-(y-x)(\phi^{\prime}(x)-\phi^{\prime}(v_{0}))$
	$\displaystyle\qquad=\|\mathbbm{1}\{y<x\}-\alpha\|^{-1}\mathrm{ESF}(\phi,\alpha)(x,y),$

which establishes Equation (3.9). The convexity of $\phi_{j}$ follows from the fact that $\phi_{j}^{\prime\prime}(u)=\chi_{j}(u)\phi^{\prime\prime}(u)\geq 0$ , since $\phi$ is convex. Finally, suppose that $x,y\in I_{0}\subset I$ and $\mathrm{supp}(\chi_{j})\cap I_{0}=\emptyset$ . Now

	$\displaystyle\|\mathbbm{1}\{y<x\}-\alpha\|^{-1}\mathrm{ESF}(\phi_{j},\alpha)(x,y)$	$\displaystyle=\int_{x}^{y}(\phi_{j}^{\prime}(w)-\phi_{j}^{\prime}(x))\,\mathrm{d}w$
		$\displaystyle=\int_{x}^{y}\int_{x}^{w}\phi_{j}^{\prime\prime}(z)\,\mathrm{d}z\,\mathrm{d}w$
		$\displaystyle=\int_{x}^{y}\int_{x}^{w}\chi_{j}(z)\phi^{\prime\prime}(z)\,\mathrm{d}z\,\mathrm{d}w,$

and since $w$ lies between $x$ and $y$ , it follows that $w$ also lies in $I_{0}$ . But $\chi_{j}=0$ on $I_{0}$ and hence the inner integral vanishes. This shows that $\mathrm{ESF}(\phi_{j},\alpha)=0$ on $I_{0}\times I_{0}$ . ∎

Appendix B Formulae for commonly used scoring functions

Table 4 presents closed-form expressions for $g_{j}$ of Equation (3.8) in the case when $g(t)=t$ and for $\phi_{j}$ of Equation (3.10) in the case when $\phi(t)=2t^{2}$ , each induced by rectangular weight functions of the form (3.1) or trapezoidal weight functions of the form (3.2). These expressions facilitate the computation of decompositions for the absolute error, standard quantile, squared error and Huber loss scoring functions. Note also that for each weight function we have $\phi_{j}^{\prime}(t)=4g_{j}(t)$ .

Table 4: Closed-form expressions for

g_{j}

and

\phi_{j}

for different weight functions when

g(t)=t

\phi(t)=2t^{2}

weight function	expression
$\begin{matrix}\text{rectangular,}\\ \text{given by}\\ \text{Equation~{}(\ref{eq:rectangular bump})}\end{matrix}$	$g_{j}(t)=\begin{cases}0,&t<a\\ t-a,&a\leq t<b\\ b-a,&t\geq b\end{cases}$
	$\phi_{j}(t)=\begin{cases}0,&t<a\\ 2(t-a)^{2},&a\leq t<b\\ 4(b-a)t+2(a^{2}-b^{2}),&t\geq b\end{cases}$
$\begin{matrix}\text{trapezoidal,}\\ \text{given by}\\ \text{Equation~{}(\ref{eq:trapezoidal bump})}\end{matrix}$	$g_{j}(t)=\begin{cases}0,&t<a\\ (t-a)^{2}/(2(b-a)),&a\leq t<b\\ t-(b+a)/2,&b\leq t<c\\ -(d-t)^{2}/(2(d-c))+(d+c-a-b)/2,&c\leq t<d\\ (d+c-a-b)/2,&t\geq d\end{cases}$
	$\phi_{j}(t)=\begin{cases}0,&t<a\\ \frac{2(t-a)^{3}}{3(b-a)},&a\leq t<b\\ 2t^{2}-2(a+b)t+\frac{2}{3}(b-a)^{2}+2ab,&b\leq t<c\\ \frac{2(d-t)^{3}}{3(d-c)}+2(d+c-a-b)t+\tfrac{2}{3}((b-a)^{2}+3ab-(d-c)^{2}-3cd),&c\leq t<d\\ 2(d+c-a-b)t+\tfrac{2}{3}((b-a)^{2}+3ab-(d-c)^{2}-3cd),&t\geq d\end{cases}$

References

Bellini and Di Bernardino, (2017) Bellini, F. and Di Bernardino, E. (2017). Risk management with expectiles. The European Journal of Finance, 23(6):487–506.
Ehm et al., (2016) Ehm, W., Gneiting, T., Jordan, A., and Krüger, F. (2016). Of quantiles and expectiles: consistent scoring functions, choquet representations and forecast rankings. J. R. Statist. Soc. B, 78:505–562.
(3) Gneiting, T. (2011a). Making and evaluating point forecasts. Journal of the American Statistical Association, 106(494):746–762.
(4) Gneiting, T. (2011b). Quantiles as optimal point forecasts. International Journal of forecasting, 27(2):197–207.
Gneiting and Katzfuss, (2014) Gneiting, T. and Katzfuss, M. (2014). Probabilistic forecasting. Annual Review of Statistics and Its Application, 1:125–151.
Gneiting and Ranjan, (2011) Gneiting, T. and Ranjan, R. (2011). Comparing density forecasts using threshold-and quantile-weighted scoring rules. Journal of Business & Economic Statistics, 29(3):411–422.
Huber, (1964) Huber, P. J. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35:73–101.
Lerch et al., (2017) Lerch, S., Thorarinsdottir, T. L., Ravazzolo, F., and Gneiting, T. (2017). Forecaster’s dilemma: Extreme events and forecast evaluation. Statistical Science, 32(1):106–127.
Newey and Powell, (1987) Newey, W. K. and Powell, J. L. (1987). Asymmetric least squares estimation and testing. Econometrica, 55:819–847.
Patton, (2020) Patton, A. J. (2020). Comparing possibly misspecified forecasts. Journal of Business & Economic Statistics, 38(4):796–809.
Richardson, (2000) Richardson, D. S. (2000). Skill and relative economic value of the ECMWF ensemble prediction system. Quarterly Journal of the Royal Meteorological Society, 126(563):649–667.
Savage, (1971) Savage, L. J. (1971). Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783–801.
Sharpe et al., (2020) Sharpe, M., Bysouth, C., and Gill, P. (2020). New operational measure to assess extreme events using site-specific climatology. Presented at 2020 International Verification Methods Workshop Online. Retreived from https://jwgfvr.univie.ac.at/presentations-and-notes/ on 5 January 2021.
Taggart, (2021) Taggart, R. (2021). Point forecasting and forecast evaluation with generalized Huber loss. Electronic Journal of Statistics, to appear.
Thomson, (1979) Thomson, W. (1979). Eliciting production possibilities from a well-informed manager. Journal of Economic Theory, 20:360–380.