Flexible risk design using bi-directional dispersion

Matthew J. Holland
Osaka University

Abstract

Many novel notions of “risk” (e.g., CVaR, tilted risk, DRO risk) have been proposed and studied, but these risks are all at least as sensitive as the mean to loss tails on the upside, and tend to ignore deviations on the downside. We study a complementary new risk class that penalizes loss deviations in a bi-directional manner, while having more flexibility in terms of tail sensitivity than is offered by mean-variance. This class lets us derive high-probability learning guarantees without explicit gradient clipping, and empirical tests using both simulated and real data illustrate a high degree of control over key properties of the test loss distribution incurred by gradient-based learners.

1 Introduction

What does it mean for a learner to successfully generalize? Broadly speaking, this is an ambiguous property of learning systems that can be defined, measured, and construed in countless ways. In the context of machine learning, however, the notion of “success” in off-sample generalization is almost without exception formalized as minimizing the expected value of a random loss $\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}(h)$ , where $h$ is a candidate parameter, model, or decision rule, and $\operatorname{\mathsf{L}}(h)$ is a random variable on a probability space $(\Omega,\mathcal{F},\mu)$ [41, 53]. The idea of quantifying the risk of an unexpected outcome (here, a random loss) using the expected value dates back to the Bernoullis and Gabriel Cramer in the early 18th century [5, 25]. In a more modern context, the emphasis on average performance is the “general setting of the learning problem” of Vapnik, [58], and plays a central role in the decision-theoretic learning model of Haussler, [27]. Use of the expected loss to quantify off-sample generalization has been essential to the development of both the statistical and computational theories of learning [16, 34].

While the expected loss still remains pervasive, important new lines of work on risk-sensitive learning have begun exploring novel feedback mechanisms for learning algorithms, in some cases derived directly from new risk functions that replace the expected loss. Learning algorithms designed using conditional value-at-risk (CVaR) [13, 31] and tilted (or “entropic”) risk [19, 37, 38] are well-known examples of location properties which emphasize loss tails in one direction more than the mean itself does. This is often used to increase sensitivity to “worst-case” events [33, 55], but in special cases where losses are bounded below, sensitivity to tails on the downside can be used to realize an insensitivity to tails on the upside [36]. This strong asymmetry is not specific to the preceding two risk function classes, but rather is inherent in much broader classes such as optimized certainty equivalent (OCE) risk [7, 8, 36] and distributionally robust optimization (DRO) risk [6, 17, 18, 24]. Unsurprisingly, naive empirical estimators of these risks are particularly fragile under outliers coming from the “sensitive direction,” as is evidenced by the plethora of attempts in the literature to design robust modifications [30, 45, 60]. In general, however, loss distributions can display long tails in either direction over the learning process (see Figure 3), particularly when losses are unbounded below (e.g., negative rewards [54], unhinged loss [57]), and loss functions whose empirical mean has no minimum appear frequently (e.g., separable logistic regression [1, 50]). Since the tail behavior of stochastic losses and gradients is well-known to play a critical role in the stability and robustness of learning systems [12, 61], the inability to control tail sensitivity in both directions represents a genuine limitation to machine learning methodology.

A natural alternative class of risk functions that gives us control over tail sensitivity in both directions is that of the “M-location” of the loss distribution, namely any value in

\displaystyle\operatorname*{arg\,min}_{\theta\in\mathbb{R}}\operatorname{\mathbf{E}}_{\mu}\rho\left(\operatorname{\mathsf{L}}(h)-\theta\right)\subset\mathbb{R},

(1)

where $\rho:\mathbb{R}\to\mathbb{R}_{+}$ is assumed to be such that this set of minimizers is non-empty. Here various special choices of $\rho$ let us recover well-known locations, such as the mean (with $\rho(\cdot)=(\cdot)^{2}$ ), median ( $\rho(\cdot)=\lvert{\cdot}\rvert$ ), arbitrary quantiles (via “pinball” function [56]), and even further beyond to “expectiles” (using curved variants of the pinball function [21]). The obvious limitation here is that while computing (1) using empirical estimates is easy, minimization as a function of $h$ is in general a difficult bi-level programming problem. As an alternative approach, in this paper we study the potential benefits and tradeoffs that arise in using performance criteria of the form

\displaystyle\min_{\theta\in\mathbb{R}}\left[\eta\theta+\operatorname{\mathbf{E}}_{\mu}\rho\left(\operatorname{\mathsf{L}}(h)-\theta\right)\right]

(2)

where $\eta\in\mathbb{R}$ . By making a sacrifice of fidelity to the M-location (1), we see that the criterion in (2) suggests a congenial objective function (joint in $(h,\theta)$ ). Intuitively, one minimizes the sum of generalized “location” and “dispersion” properties, and the nature of this dispersion impacts the fidelity of the location term to the original M-location induced by $\rho$ . These two locations align perfectly in the special case where we set $\rho(\cdot)=(\cdot)^{2}/2$ and $\eta=1$ , since (2) is equivalent to the mean-variance objective $\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}(h)+\operatorname{var}_{\mu}\operatorname{\mathsf{L}}(h)/2$ , but more generally, it is clear that allowing for more diverse choices of $\rho$ gives us new freedom in terms of tail control with respect to both location and dispersion. We consider a concrete yet flexible class of risk functions that generalizes beyond (2), allows for easy implementation, and is analytically tractable from the standpoint of providing formal learning guarantees. Our main contributions are as follows:

•

A new class of “threshold risks” (T-risks; defined in §3) that provide a tractable alternative to M-locations, and a bi-directional complement to OCE/DRO risks (reviewed in §2.2).
•

A stochastic learning algorithm for T-risks that enjoys high-probability guarantees of convergence to a stationary point under heavy-tailed losses/gradients, without manual clipping (details in §4, Theorem 3).
•

Strong empirical evidence of the flexibility and utility inherent in T-risk learners. In particular: robustness to unbalanced noisy class labels without regularization (Figure 4), sharp control over sensitivity to outliers in regression with convex base losses (Figure 5), and smooth interpolation between mean and mean-variance minimizers on clean, normalized benchmark classification datasets (Figures 6, 7 and 14–18).

The overall flow of the paper is as follows. Background information on notation and related literature is given in §2, and we introduce the new risk class of interest in §3. Formal aspects of the learning problem using these risks are treated in §4, and empirical findings are explained and discussed in §5. All formal proofs and supplementary results are organized in §A–§D of the appendix, and code for reproducing all the results in this paper is provided in an online repository.¹¹1https://github.com/feedbackward/bdd

2 Background

2.1 Notation

Random quantities

To start, let us clarify the nature of the random losses we consider. With the context of the underlying probability space $(\Omega,\mathcal{F},\mu)$ , we write $\operatorname{\mathsf{L}}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{\mathsf{L}}(h;\cdot):\Omega\to\mathbb{R}$ to refer to a random variable (i.e., a $\mathcal{F}$ -measurable function) on $\Omega$ , though we only use the form $\operatorname{\mathsf{L}}(h)$ in the body of this paper. When we talk about “sampling” losses or a “random draw” of the losses, this amounts to computing a realization $\operatorname{\mathsf{L}}(h;\omega)\in\mathbb{R}$ . We use standard notation for taking expectation, e.g., $\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\int_{\Omega}\operatorname{\mathsf{L}}(h;\omega)\,\mu(\mathop{}\!\mathrm{d}\omega)$ . These conventions extend to random quantities based on the losses (e.g., the gradient ${\operatorname{\mathsf{L}}}^{\prime}(h)$ considered in §4.1). Similarly, we will use $\operatorname{\mathbf{P}}$ as a general-purpose probability function, representing both $\mu$ and product measures; when the source of randomness is not immediate from the context, it will be stated explicitly. We will use $\operatorname{R}(\cdot)$ as a generic symbol for risk functions (often modified with subscripts), with the understanding that $\operatorname{R}(\cdot)$ maps random losses $\operatorname{\mathsf{L}}(h)$ to real values. We will overload this notation, writing $\operatorname{R}(\operatorname{\mathsf{L}})$ when the role of $h$ is unimportant, and writing $\operatorname{R}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{R}(\operatorname{\mathsf{L}}(h))$ when we want to emphasize the dependence on $h$ . This convention will be applied to other quantities as well, such as writing $\operatorname{D}_{\rho}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{D}_{\rho}(\operatorname{\mathsf{L}}(h))\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{\mathbf{E}}_{\mu}\rho(\operatorname{\mathsf{L}}(h)-\theta)$ for the expected dispersion induced by $\rho$ , first defined in (22).

Norms

We will use $\lVert{\cdot}\rVert$ as a general-purpose notation for all norms that appear in this paper. That is, we do not use different notation to distinguish different norm spaces. The reason for this is that we will never consider two distinct norms on the same set; each norm is associated with a distinct set, and thus as long as it is clear which set a particular element belongs to, there should be no confusion. The only exception to this rule is the special case of $\mathbb{R}$ , in which we write $\lvert{\cdot}\rvert$ for the absolute value, as is traditional. For a function $f:\mathbb{R}\to\mathbb{R}$ in one variable, we use ${f}^{\prime}$ to denote the usual derivative. More general notions (e.g., Gateaux or Fréchet differentials) only make an appearance in §4, and the generality they afford us is not crucial to the main narrative, so the details can be easily skipped over if the reader is unfamiliar with such concepts. All the other undefined notation we use is essentially standard, and can be found in most introductory analysis textbooks.

Miscellaneous

For a function $f:\mathbb{R}\to\mathbb{R}$ in one variable, we use ${f}^{\prime}$ to denote the usual derivative. More general notions (e.g., Gateaux or Fréchet differentials) only make an appearance in §4, and the generality they afford us is not crucial to the main narrative, so the details can be easily skipped over if the reader is unfamiliar with such concepts. All the other undefined notation we use is essentially standard, and can be found in most introductory analysis textbooks. Particularly in formal proofs, we will frequently make use of the notation $\rho_{\sigma}(x)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\rho(x/\sigma)$ , where $\sigma>0$ is a scaling parameter.

2.2 Review of key risk functions

OCE-type risks

As a computationally convenient way to interpolate between the mean and the extreme values of $\operatorname{\mathsf{L}}(h)$ , the tilted risk [37, 38] is a natural choice, defined for $\gamma\neq 0$ as

\displaystyle\operatorname{R}_{\textup{tilt}}(h;\gamma)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\frac{1}{\gamma}\log\left(\operatorname{\mathbf{E}}_{\mu}\mathrm{e}^{\gamma\operatorname{\mathsf{L}}(h)}\right).

(3)

This is simply a re-scaling of the cumulant distribution function of $\operatorname{\mathsf{L}}(h)$ , viewed as a function of $h$ , where taking $\gamma\to\infty$ and $\gamma\to-\infty$ lets us approach the supremum and infimum of $\operatorname{\mathsf{L}}(h)$ , respectively. Another important class of risk functions is based upon the conditional value-at-risk (CVaR) [46], defined for $0\leq\beta<1$ as

\displaystyle\operatorname{R}_{\textup{CVaR}}(h;\beta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{\mathbf{E}}_{\mu}\left[\operatorname{\mathsf{L}}(h)\,|\,\operatorname{\mathsf{L}}(h)\geq\operatorname{Q}_{\beta}(h)\right].

(4)

This is the expected loss at $h$ , conditioned on the event that the loss exceeds the $\beta$ -quantile of $\operatorname{\mathsf{L}}(h)$ , denoted here by $\operatorname{Q}_{\beta}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\inf\left\{x\in\mathbb{R}:\operatorname{\mathbf{P}}\{\operatorname{\mathsf{L}}(h)\leq x\}\geq\beta\right\}$ . Both of these risk functions can be re-written in a form similar to that of (2), namely

\displaystyle h\mapsto\inf_{\theta\in\mathbb{R}}\left[\theta+\operatorname{\mathbf{E}}_{\mu}\phi(\operatorname{\mathsf{L}}(h)-\theta)\right]

(5)

where $\phi(x)=(\mathrm{e}^{\gamma x}-1)/\gamma$ yields $\operatorname{R}_{\textup{tilt}}$ (basic calculus), and $\phi(x)=\max\{0,x\}/(1-\beta)$ yields $\operatorname{R}_{\textup{CVaR}}$ (see Rockafellar and Uryasev, [46, 47]). When $\phi:\mathbb{R}\to\mathbb{R}$ is restricted to be a non-decreasing, closed, convex function which satisfies both $\phi(0)=0$ and $1\in\partial\phi(0)$ , the mapping given in (5) is called an optimized certainty equivalent (OCE) risk [7, 8, 36]. The class of OCE risks strictly generalizes the expected value (noting $\phi(x)=x$ is valid), and includes $\operatorname{R}_{\textup{tilt}}(\cdot;\gamma)$ when $\gamma>0$ , as well as $\operatorname{R}_{\textup{CVaR}}$ . The mean-variance is sometimes stated to be an OCE risk [36, Table 1], but this fails to hold when losses are unbounded above and below.

DRO-type risks

Another important class of risk functions is that of robustly regularized risks which are designed to ensure that risk minimizers are robust to a certain degree of divergence in the underlying data model. Making this concrete, it is typical to assume the random losses are the outputs of a loss function $\ell$ depending on the candidate $h$ and some random data $\mathsf{Z}$ , i.e., $\operatorname{\mathsf{L}}(h)=\ell(h;\mathsf{Z})$ , with $\mathsf{Z}\sim\mu$ as our reference model. To measure divergence from this reference model, it is convenient to use the Cressie-Read family of divergence functions [17, 60], namely for any $c>1$ and assuming $\nu\ll\mu$ (absolutely continuity) holds, functions of the form

\displaystyle\operatorname{Div}_{c}(\nu;\mu)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{\mathbf{E}}_{\mu}f_{c}\left(\frac{\mathop{}\!\mathrm{d}\nu}{\mathop{}\!\mathrm{d}\mu}\right),\text{ where }f_{c}(x)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\frac{x^{c}-cx+c-1}{c(c-1)}

(6)

and $\mathop{}\!\mathrm{d}\nu/\mathop{}\!\mathrm{d}\mu$ is the Radon-Nikodym density of $\nu$ with respect to $\mu$ .²²2For background on absolute continuity and density functions, see Ash and Doléans-Dade, [3, §2.2]. The resulting robustly regularized risk, called the DRO risk, is defined as

\displaystyle\operatorname{R}_{\textup{DRO}}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\sup\left\{\operatorname{\mathbf{E}}\operatorname{\mathsf{L}}(h):\operatorname{\mathsf{L}}\in\mathcal{L}\right\}

(7)

where the constrained set of random losses $\mathcal{L}$ , determined by $c>1$ and $a>0$ , is defined as

\displaystyle\mathcal{L}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\left\{\operatorname{\mathsf{L}}(\cdot)=\ell(\cdot;\mathsf{Z}):\mathsf{Z}\sim\nu\text{ and }\operatorname{Div}_{c}(\nu;\mu)\leq a\right\}.

(8)

For this particular family of divergences, the risk can be characterized as the optimal value of a simple optimization problem [17], namely we have that

\displaystyle\operatorname{R}_{\textup{DRO}}(h)=\inf_{\theta\in\mathbb{R}}\left[\theta+\left(1+c(c-1)a\right)^{1/c}\left(\operatorname{\mathbf{E}}_{\mu}\left(\operatorname{\mathsf{L}}(h)-\theta\right)_{+}^{c_{\ast}}\right)^{1/c_{\ast}}\right]

(9)

where $c_{\ast}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=c/(c-1)$ . While strictly speaking this is not an OCE risk, note that if we set $\phi(x)=(1+c(c-1)a)^{c_{\ast}/c}(x)_{+}^{c_{\ast}}$ , then the DRO risk can be written as

\displaystyle\operatorname{R}_{\textup{DRO}}(h)=\inf_{\theta\in\mathbb{R}}\left[\theta+[\operatorname{\mathbf{E}}_{\mu}\phi(\operatorname{\mathsf{L}}(h)-\theta)]^{1/c_{\ast}}\right],

(10)

giving us an expression of this risk as the sum of a threshold and an asymmetric dispersion. When we set $c=2$ , this yields the well-known special case of $\chi^{2}$ -DRO risk [26, 60]. In addition to the one-directional nature of the dispersion term in these risks, all of these risks are at least as sensitive to loss tails (on the upside) as the classical expected loss $\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}(h)$ is; this holds for CVaR (with $\beta>0$ ), tilted risk (with $\gamma>0$ ), and even robust variants of DRO risk [60].

Key differences

While it is clear that the form of the preceding risk classes given in (5) and (10) based on various choices of $\phi(\cdot)$ is the same as our $\rho(\cdot)$ -based risk of interest in (2), they are fundamentally different in that none of the choices of $\phi(\cdot)$ induce a meaningful M-location; since all these $\phi(\cdot)$ are monotonic on $\mathbb{R}$ , both minimization and maximization of $\theta\mapsto\operatorname{\mathbf{E}}_{\mu}\phi(\operatorname{\mathsf{L}}(h)-\theta)$ is trivially accomplished by taking $\lvert{\theta}\rvert\to\infty$ . In stark contrast, $\rho(\cdot)$ is assumed to be such that the solution set in (1) is a subset of the real line. We will introduce a concrete and flexible class from which $\rho$ will be taken in §3, and in Figure 1 give a side-by-side comparison with the $\phi$ functions discussed in the preceding paragraphs.

2.3 Closely related work

This work falls into the broad context of machine learning driven by novel risk functions [28]. Of all the papers cited above, the works of Lee et al., [36] on OCE risks, and Li et al., 2021b [38], Li et al., 2021a [37] on tilted risks are of a similar nature to our paper, with the obvious difference being that the class of risks is fundamentally different, as described in the preceding paragraphs. Indeed, many of our empirical tests involve direct comparison with the risk classes studied in these works (e.g., Figures 2, 4, and 12), and so they provide critical context for our work. Previous work by Holland, [30] studies a rudimentary special case of what we call “minimal T-risk” here; the focus in that work was on obtaining learning guarantees (in expectation) when the risk is potentially non-convex and non-smooth in $h$ , but with convex $\rho$ , and no comparison was made with OCE/DRO risk classes. We build upon these results here, considering a broad class of dispersions $\rho$ which are differentiable but need not be convex (see (15)); we how such risk classes can readily admit high-probability learning guarantees for stochastic gradient-based algorithms (Theorem 3), provide bounds on the average loss incurred by empirical risk minimizers using our risk (Proposition 7), and make detailed empirical comparisons with each of the key existing risk classes.

3 Threshold risk

To ground ourselves conceptually, let us refer to $\operatorname{\mathsf{L}}(h)$ as the base loss incurred by $h$ . The exact nature of $h$ is left completely abstract for the moment, as all that matters is the probability distribution of this base loss. By selecting an arbitrary threshold $\theta\in\mathbb{R}$ , we define a broad class of properties as

\displaystyle\operatorname{R}_{\rho}(h;\theta,\eta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\eta\theta+\operatorname{\mathbf{E}}_{\mu}\rho\left(\operatorname{\mathsf{L}}(h)-\theta\right).

(11)

Here $\eta\in\mathbb{R}$ is a weighting parameter allowed to be negative, and as a bare minimum, $\rho$ is assumed to be such that the resulting M-location(s) are well-defined in the sense that the inclusion in (1) holds. We call $\rho\left(\operatorname{\mathsf{L}}(h)-\theta\right)$ the (random) dispersion of the base loss, taken with respect to threshold $\theta$ , and we refer to $\operatorname{R}_{\rho}(h;\theta,\eta)$ in (11) as the threshold risk (or simply T-risk) under $\rho$ .

3.1 Minimal T-risk and M-location

Arguably the most intuitive special case of T-risk is the minimal T-risk, where minimization is with respect to the threshold $\theta\in\mathbb{R}$ . Let us denote this risk and the optimal threshold set as

\displaystyle\underline{\operatorname{R}}_{\rho}(h;\eta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\inf_{\theta\in\mathbb{R}}\operatorname{R}_{\rho}(h;\theta,\eta),\qquad\theta_{\rho}(h;\eta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname*{arg\,min}_{\theta\in\mathbb{R}}\operatorname{R}_{\rho}(h;\theta,\eta).

(12)

Clearly, if $\rho$ is bounded above or grows too slowly, we will have $\underline{\operatorname{R}}_{\rho}(h;\eta)=-\infty$ and no real-valued minimizers, i.e., $\theta_{\rho}(h;\eta)=\emptyset$ . Letting $\operatorname{M}_{\rho}(h)$ denote the M-locations in (1), for $\eta\neq 0$ we have

\displaystyle\theta_{\rho}(h;\eta)\neq\emptyset\implies\operatorname{M}_{\rho}(h)\neq\emptyset,

(13)

although the converse does not hold in general.³³3For example, consider choices of $\rho$ that are “re-descending” [32]. When $\eta=0$ , these two solution sets align, i.e., we have $\theta_{\rho}(h;0)=\operatorname{M}_{\rho}(h)$ . More generally, depending on the sign of $\eta$ , the optimal thresholds can be either larger or smaller than the corresponding M-locations. More precisely, for any $\theta^{\prime}\in\operatorname{M}_{\rho}(h)$ , as long as $\theta_{\rho}(h;\eta)$ is non-empty, there exists $\theta\in\theta_{\rho}(h;\eta)$ such that $\theta\operatorname{sign}(\eta)\leq\theta^{\prime}\operatorname{sign}(\eta)$ .

Special case minimized by quantiles

The form given in (11) is very general, but it can be understood as a straightforward generalization of the convex objective used to characterize quantiles. More precisely, taking $\beta\in(0,1)$ and denoting the $\beta$ -quantile of the base loss using $\operatorname{Q}_{\beta}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\inf\left\{x\in\mathbb{R}:\operatorname{\mathbf{P}}\{\operatorname{\mathsf{L}}(h)\leq x\}\geq\beta\right\}$ , it is well-known that in the special case of $\rho(\cdot)=\lvert{\cdot}\rvert$ , we have

\displaystyle\operatorname{Q}_{\beta}(h)\in\theta_{\rho}(h;\theta,1-2\beta)

(14)

for any choice of $0<\beta<1$ , as long as $\operatorname{\mathbf{E}}_{\mu}\lvert{\operatorname{\mathsf{L}}(h)}\rvert$ is finite [35]. The T-risk in (11) simply allows for a more flexible choice of $\rho$ , and thus generalizes the dispersion term in this objective function.

3.2 T-risk with scaled Barron dispersion

Refer to caption — Figure 1: Left-most plot: the graph of $x\mapsto\rho(x/\sigma;\alpha)$ from §3.2 for varying choices of $\alpha$ , with $\sigma$ fixed to $\sigma=0.2$ for visual ease. Middle two plots: graphs of $\phi(x)$ in (5) for CVaR and tilted risk, respectively over different choices of $\beta$ and $\gamma$ . Right-most plot: graph of $\phi(x)$ in (10) for the $\chi^{2}$ -DRO risk, where $a\geq 0$ is re-parametrized using $0\leq\widetilde{a}<1$ using the relation $a=((1-\widetilde{a})^{-1}-1)^{2}/2$ .

In order to capture a range of sensitivities to loss tails in both directions, we would like to select $\rho$ from a class of functions that gives us sufficient control over scale, boundedness, and growth rates. As a concrete choice, we propose to set $\rho$ in (11) as $\rho(x)=\rho(x/\sigma;\alpha)$ , where $\sigma>0$ is a scaling parameter, and $\rho(\cdot;\alpha)$ with shape $\alpha\in[-\infty,2]$ is a family of functions that ranges from bounded and logarithmic growth on the lower end to quadratic growth on the upper end, defined as:

\displaystyle\rho(x;\alpha)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\begin{cases}x^{2}/2,&\text{if }\alpha=2\\ \log\left(1+x^{2}/2\right),&\text{if }\alpha=0\\ 1-\exp\left(-x^{2}/2\right),&\text{if }\alpha=-\infty\\ \frac{\lvert{\alpha-2}\rvert}{\alpha}\left(\left(1+\frac{x^{2}}{\lvert{\alpha-2}\rvert}\right)^{\alpha/2}-1\right),&\text{otherwise}.\end{cases}

(15)

At a high level, $\rho(\cdot;\alpha)$ is approximately quadratic near zero for any choice of shape $\alpha$ , but its growth as one deviates far from zero depends greatly on $\alpha$ . We refer to (15) as the Barron class of functions for computing dispersion.⁴⁴4The reason for this naming is that Barron, [4] recently studied this class in the context of designing loss functions for computer vision applications. We remark that this differs considerably from our usage in computing the dispersion of random losses, where the loss function underlying the base loss is left completely arbitrary. Recalling the risks reviewed in §2.2, since $\rho(\cdot;\alpha)$ is flat at zero and symmetric about zero, the Barron class clearly takes us beyond the functions $\phi(\cdot)$ allowed by OCE risks (5) and used in typical DRO risk definitions (10); see Figure 1 for a visual comparison.

As mentioned in §2.1, we will often use the generic shorthand notation $\rho_{\sigma}(x)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\rho(x/\sigma)$ , and drop the dependence on $\alpha$ when clear from context. The shape parameter $\alpha$ gives us direct control over the conditions needed for a finite T-risk $\operatorname{R}_{\rho}(h;\theta,\eta)$ , as the following lemma shows.

Lemma 1 (Finiteness and shape).

Let $\rho$ be from the Barron class (15). Then in order to ensure $\operatorname{\mathbf{E}}_{\mu}\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)<\infty$ holds for all $\theta\in\mathbb{R}$ , each of the following conditions (depending on the value of $\alpha$ ) is sufficient. For $0<\alpha\leq 2$ , let $\operatorname{\mathbf{E}}_{\mu}\lvert\operatorname{\mathsf{L}}(h)\rvert^{\alpha}<\infty$ . For $\alpha=0$ , let $\operatorname{\mathbf{E}}_{\mu}\lvert\operatorname{\mathsf{L}}(h)\rvert^{c}<\infty$ for some $c>0$ . For $-\infty\leq\alpha<0$ , let $\operatorname{\mathsf{L}}(h)$ be $\mathcal{F}$ -measurable. Furthermore, for the cases where $\alpha\neq 0$ , the above conditions are also necessary.

Assuming $\mu$ -integrability as in Lemma 1, the Barron class furnishes a non-empty set of M-locations $\operatorname{M}_{\rho}(h)$ for any choice of $\alpha$ , and when restricted to $\alpha\geq 1$ with appropriate settings of $\eta$ and $\sigma$ , the optimal threshold set $\theta_{\rho}(h;\eta)$ contains a single unique solution (see Lemma 10). For any valid choice of $\alpha$ , the function $\rho(\cdot;\alpha)$ is twice continuously differentiable on $\mathbb{R}$ (see §D.3 for exact expressions). All the limits in $\alpha$ behave as we would expect: $\rho(x;\alpha)\to\rho(x;c)$ as $\alpha\to c$ for $c\in\{-\infty,0,2\}$ (see §B.2 for details). For $\alpha\geq 0$ , the dispersion function is unbounded, with growth ranging from logarithmic to quadratic depending on the choice of $\alpha$ . For $\alpha<0$ , the dispersion function is bounded. The mapping $x\mapsto\rho_{\sigma}(x;\alpha)$ is convex on $\mathbb{R}$ for $\alpha\geq 1$ , and for $\alpha<1$ it is only convex between $\pm\sigma\sqrt{\lvert{\alpha-2}\rvert/(1-\alpha)}$ , and concave elsewhere (see Lemma 8). The class $\operatorname{R}_{\rho}(h;\theta,\eta)$ of T-risks (11) under the scaled Barron dispersion $\rho_{\sigma}(x;\alpha)$ is the central focus of this paper.

3.3 Sensitivity to outliers and tail direction

Before we consider the learning problem, which typically involves the evaluation of many different loss distributions over the course of training, here we consider a fixed distribution, and numerically compare the T-risk (11) and M-location (1) induced by $\rho$ from the Barron class in §3.2, along with the key OCE and DRO risks discussed in §2.2.

Experiment setup

We generate random values to simulate loss distributions, and evaluate how the values returned by each risk function change as we modify their respective parameters. Letting $\operatorname{\mathsf{L}}$ denote the random base loss we are simulating, we specify a parametric distribution for $\operatorname{\mathsf{L}}$ , from which we take an independent sample $\{\operatorname{\mathsf{L}}_{1},\ldots,\operatorname{\mathsf{L}}_{m}\}$ . In all cases, we center the true distribution such that $\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}=0$ . We use this common sample to compare the values returned by each of the aforementioned risks, as well as the optimal choice of threshold parameter $\theta$ . To ensure that key trends are consistent across samples, we take a large sample size of $m=10^{4}$ . For T-risk, we adjust $\alpha$ and $\eta$ , and for M-location, just $\alpha$ . In both cases, we leave $\sigma=0.5$ fixed. For CVaR, we modify the quantile level $0<\beta<1$ . For tilted risk, we modify the parameter $\gamma\in\mathbb{R}$ . For $\chi^{2}$ -DRO risk, we modify $0<\widetilde{a}<1$ , having re-parameterized $a$ in (8) by $a\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=((1-\widetilde{a})^{-1}-1)^{2}/2$ , as is common practice [60].

Representative results

An illustrative example is given in Figure 2, where we look at how each risk class behaves under a centered asymmetric distribution, before and after flipping it (i.e., under $\operatorname{\mathsf{L}}$ and − $\operatorname{\mathsf{L}}$ ). Starting from the two left-most plots, we show $\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}};\eta)$ (dashed curves) and $\theta_{\rho}(\operatorname{\mathsf{L}};\eta)$ (solid curves) from (12) as a function of $\alpha$ , coloring the area between these graphs in gray. The first plot corresponds to $\eta=1$ , the second to $\eta=-1$ . Similarly for the M-location we plot $\operatorname{M}_{\rho}(\operatorname{\mathsf{L}})$ (solid) and $\operatorname{M}_{\rho}(\operatorname{\mathsf{L}})+\operatorname{\mathbf{E}}_{\mu}\rho(\operatorname{\mathsf{L}}-\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}))$ (dashed). Analogous values are plotted for each of the other classes; note that for the tilted risk (3) with $\gamma>0$ , the optimal threshold and the risk value are in fact the same value (see §B.1). The right-most plot is a histogram of the random sample $\{\operatorname{\mathsf{L}}_{1},\ldots,\operatorname{\mathsf{L}}_{m}\}$ , here from a centered log-Normal distribution. All plots share a common vertical axis, and horizontal rules are drawn at the median (red, solid) and at the mean (gray, dotted; always zero due to centering). The critical point to emphasize here is how all the OCE and DRO risks here are highly asymmetric in terms of their tail sensitivity, in stark contrast with both the M-location and the T-risk. Turning tail sensitivity high enough in each of these classes (e.g., $\alpha>1.5$ , $\beta>0.5$ , $\gamma>1.0$ , $\widetilde{a}>0.5$ ), note how flipping the distribution tails from the upside (top row) to the downside (bottom row) leads to a dramatic decrease in all risks but the T-risk and M-location. Finally, note how the T-risk thresholds $\theta_{\rho}(\operatorname{\mathsf{L}};\eta)$ close in on the M-location $\operatorname{M}_{\rho}(\operatorname{\mathsf{L}})$ from above/below depending on whether $\eta$ is negative/positive. Results for numerous distributions are available in our online repository (cf. §1).

4 Learning algorithm analysis

We now proceed to consider the learning problem, in which the goal is ultimately to select a candidate $h$ such that the distribution of $\operatorname{\mathsf{L}}(h)$ is “optimal” in the sense of achieving the smallest possible value of T-risk $\operatorname{R}_{\rho}(h;\theta,\eta)$ in (11), with $\rho$ taken from the Barron class (15), and $h$ taken from some set $\mathcal{H}$ . An obvious take-away of the integrability conditions (Lemma 1) is that even when the base loss is heavy-tailed in the sense of having infinite higher-order moments, we can always adjust the dispersion function $\rho(\cdot;\alpha)$ in such a way that transforming the base loss to obtain new feedback

\displaystyle\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\eta\theta+\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta;\alpha)

(16)

gives us an unbiased estimator of the finite T-risk, i.e., $\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta)=\operatorname{R}_{\rho}(h;\theta,\eta)\in\mathbb{R}$ . Intuitively, one expects that a similar property can be leveraged to control heavy-tailed stochastic gradients used in an iterative learning algorithm. We explore this point in detail in §4.1. We will then complement this analysis by considering in §4.2 the basic properties of T-risk at the minimal threshold $\underline{\operatorname{R}}_{\rho}(h;\eta)$ given in (12), viewed from the perspectives of axiomatic risk design and empirical risk minimization.

4.1 T-risk and stochastic gradients

For the time being let us fix an arbitrary threshold $\theta\in\mathbb{R}$ , and assuming the gradient ${\operatorname{\mathsf{L}}}^{\prime}(h)$ is $\mu$ -almost surely finite, denote the partial derivative of the transformed losses (16) with respect to $h$ by

\displaystyle\partial_{h}\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}={\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta){\operatorname{\mathsf{L}}}^{\prime}(h).

(17)

Writing $\partial_{h}\operatorname{R}_{\rho}$ for the gradient of $h\mapsto\operatorname{R}_{\rho}(h;\theta,\eta)$ , an analogue of Lemma 1 for gradients holds.

Lemma 2 (Unbiased gradients).

Let $\mathcal{U}$ be an open subset of any metric space such that $\mathcal{H}\subseteq\mathcal{U}$ . Let the base loss map $h\mapsto\operatorname{\mathsf{L}}(h)$ be Fréchet differentiable on $\mathcal{U}$ ( $\mu$ -almost surely), with gradient denoted by ${\operatorname{\mathsf{L}}}^{\prime}(h)$ for each $h\in\mathcal{U}$ . Fixing any choice of $-\infty\leq\alpha\leq 2$ , we have that

\displaystyle\operatorname{\mathbf{E}}_{\mu}\left[\sup_{h\in\mathcal{H}}\lVert{{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)){\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert\right]<\infty\implies\operatorname{\mathbf{E}}_{\mu}\left[\partial_{h}\operatorname{\mathsf{L}}_{\rho}\right]=\partial_{h}\operatorname{R}_{\rho}

(18)

with the implied equality valid on all of $\mathcal{H}\times\mathbb{R}^{2}$ .

Consider a setting in which the gradients are heavy-tailed, i.e., where $\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert^{p}=\infty$ for $p>2$ . If the ultimate goal of learning is minimization of $h\mapsto\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}(h)$ , then in order to obtain high-probability guarantees of finding a nearly-stationary point with rates matching the in-expectation case, one cannot naively use the raw gradients ${\operatorname{\mathsf{L}}}^{\prime}(h)$ , but must rather carry out a delicate truncation which accounts for the bias incurred [14, 22, 23, 42]. On the other hand, if the ultimate objective is $h\mapsto\operatorname{R}_{\rho}(h;\theta,\eta)$ , then using (17) there is zero bias by design (Lemma 2), and when we take the shape parameter of our dispersion function such that $\alpha\leq 0$ , we have

\displaystyle\operatorname{\mathbf{P}}\left\{\left\lVert{\partial_{h}\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta)}\right\rVert\leq\Gamma\right\}=1

(19)

for an appropriate choice of $0<\Gamma<\infty$ under standard loss functions such as quadratic and logistic losses, even when the random losses and gradients are heavy-tailed (see Corollary 4).

To see how this plays out for the analysis of learning algorithms, let us consider plugging the raw stochastic gradients $\partial_{h}\operatorname{\mathsf{L}}_{\rho}$ into a simple update procedure. Given an independent sequence of random losses $(\operatorname{\mathsf{L}}_{1},\operatorname{\mathsf{L}}_{2},\ldots)$ , let us denote by $(\operatorname{\mathsf{L}}_{\rho,1},\operatorname{\mathsf{L}}_{\rho,2},\ldots)$ the transformed losses computed via (16), and for a sequence $(h_{1},h_{2},\ldots)$ let $G_{t}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\partial_{h}\operatorname{\mathsf{L}}_{\rho,t}(h_{t};\theta,\eta)$ denote the resulting stochastic gradients for any integer $t\geq 1$ . Fixing $\theta\in\mathbb{R}$ and letting $h_{1}\in\mathcal{H}$ denote an arbitrary initial value, we consider a particular sequence generated using the following update rule:

\displaystyle h_{t+1}=h_{t}-a_{t}\widetilde{M}_{t},

(20)

where $a_{t}$ is a non-negative step-size we control, and the update direction satisfies

\displaystyle\lVert{\widetilde{M}_{t}}\rVert=1,\langle\widetilde{M}_{t},M_{t}\rangle=\lVert{M_{t}}\rVert,\text{ where }M_{t}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=bM_{t-1}+(1-b)G_{t}

(21)

with $0<b<1$ also being a controllable parameter. This is an unconstrained, normalized stochastic gradient descent routine using momentum; it modifies the procedure of Cutkosky and Mehta, [14] in that we do not truncate $G_{t}$ . Note that if $\mathcal{H}$ is a Banach space and $\mathcal{H}^{\ast}$ its dual, in general we have that $G_{t}$ and $M_{t}$ are elements of $\mathcal{H}^{\ast}$ . When $\mathcal{H}$ is reflexive (e.g., any Hilbert space), it is always possible to construct $\widetilde{M}_{t}$ from $M_{t}$ .⁵⁵5Given any $h^{\ast}\in\mathcal{H}^{\ast}$ , we can always find $h\in\mathcal{H}$ such that $\langle h^{\ast},h\rangle=\lVert{h}\rVert\lVert{h^{\ast}}\rVert$ [39, §5.6]. The following theorem shows how the gradient norms incurred by this algorithm can be bounded with high probability.

Theorem 3 (Stationary points with high probability).

Let $\mathcal{H}$ be a reflexive Banach space, with a Fréchet differentiable norm satisfying $\lVert{h_{1}+h_{2}}\rVert^{2}\leq\lVert{h_{1}}\rVert^{2}+\langle\nabla\lVert{h_{1}}\rVert^{2},h_{2}\rangle+\lVert{h_{2}}\rVert^{2}$ for any $h_{1},h_{2}\in\mathcal{H}$ . In addition, assume the losses are such that (19) holds on $\mathcal{H}$ , $\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{1})-{\operatorname{\mathsf{L}}}^{\prime}(h_{2})}\rVert\leq\lambda_{1}\lVert{h_{1}-h_{2}}\rVert$ for any $h_{1},h_{2}\in\mathcal{H}$ , and $\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}^{2}<\infty$ , where $\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\sup_{h\in\mathcal{H}}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert$ . Run the learning algorithm in (20)–(21) for $T$ iterations, with $b=1-1/\sqrt{T}$ and $a_{t}=(1/T)^{3/4}$ for all steps $t=1,2,\ldots,T$ , assuming each $h_{t}\in\mathcal{H}$ . Taking any $0<\delta<1$ , it then follows that

\displaystyle\frac{1}{T}\sum_{t=1}^{T}\lVert{\partial_{h}\operatorname{R}_{\rho}(h_{t};\theta,\eta)}\rVert\leq\frac{c_{1}}{T^{1/4}}+\frac{c_{2}}{\sqrt{T}}+\frac{\lambda}{2T^{3/4}}

with probability no less than $1-\delta$ , using coefficients defined as

	$\displaystyle c_{1}$	$\displaystyle\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{R}_{\rho}(h_{1};\theta,\eta)-\operatorname{R}_{\rho}(h_{T+1};\theta,\eta)+16\Gamma\sqrt{\log(3T\delta^{-1})}+2\lambda$
	$\displaystyle c_{2}$	$\displaystyle\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=20\Gamma\log(3T\delta^{-1})+2\Gamma$
	$\displaystyle\lambda$	$\displaystyle\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\frac{\lambda_{3}+\lambda_{4}}{\sigma}+\frac{\lambda_{2}}{\sigma^{2}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}},\quad\lambda_{2}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lVert{{\rho}^{\prime\prime}}\rVert_{\infty},\quad\lambda_{3}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\left(\frac{2\lambda_{2}}{\sigma}\right)\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}^{2},\quad\lambda_{4}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lambda_{1}\lVert{{\rho}^{\prime}}\rVert_{\infty}$

for any choice of $\sigma>0$ .

These high-probability $\mathcal{O}(T^{-1/4})$ rates match standard guarantees in the stochastic optimization literature under non-convex objectives in expectation [15, 20]. The main take-away of Theorem 3 is that even if the random losses and gradients are unbounded and heavy-tailed, as long as the dispersion function $\rho$ is chosen to modulate extreme values (such that (19) holds), we can obtain confidence intervals for the T-risk gradient norms incurred by stochastic gradient updates. Distribution control is implied by the risk design, and thus there is no additional need for truncation or bias control. The following corollary illustrates how the bounded gradient condition in Theorem 3 is satisfied under very weak assumptions on the data.

Corollary 4.

Assume the random losses $\operatorname{\mathsf{L}}(h)$ are driven by random data $(\mathsf{X},\mathsf{Y})$ , where $\mathsf{X}$ takes values in a Banach space $\mathcal{X}$ , and $\mathcal{H}\subset\mathcal{X}^{\ast}$ has finite diameter. Consider the following losses:

E1.

Quadratic loss: $\operatorname{\mathsf{L}}(h)=(h(\mathsf{X})-\mathsf{Y})^{2}/2$ , with $h(\mathsf{X})=\langle h,\mathsf{X}\rangle$ and $\mathsf{Y}=\langle h^{\ast},\mathsf{X}\rangle+\varepsilon$ , where $\varepsilon$ is zero-mean, has finite variance, and is independent of $\mathsf{X}$ .
E2.

Logistic loss: $\operatorname{\mathsf{L}}(h)=\log(\sum_{j=1}^{k}\exp(\langle h_{j},\mathsf{X}\rangle))-\sum_{j=1}^{k}\widetilde{\mathsf{Y}}_{j}\langle h_{j},\mathsf{X}\rangle$ , where we have $k\geq 2$ classes, $h=(h_{1},\ldots,h_{k})$ with each $h_{j}\in\mathcal{H}$ , and $(\widetilde{\mathsf{Y}}_{1},\ldots,\widetilde{\mathsf{Y}}_{k})$ is a one-hot representation of the class label $\mathsf{Y}$ assigned to $\mathsf{X}$ .

If we set $\rho(\cdot)=\rho(\cdot;\alpha)$ with $\alpha\leq 0$ , then under the examples E1.–E2., as long as $\operatorname{\mathbf{E}}_{\mu}\lVert{\mathsf{X}}\rVert^{2}<\infty$ , we have that the bounds assumed by Theorem 3, including (19), are satisfied on $\mathcal{H}$ .

In practice, the threshold $\theta$ will not typically be fixed arbitrarily, but rather selected in a data-dependent fashion, potentially optimized alongside $h$ ; the impact of such algorithmic choices will be evaluated in our empirical tests in §5. In the following sub-section, we consider some key properties of the special case in which threshold $\theta$ is always taken to yield the smallest overall T-risk value.

Limitations

An obvious limitation of our analysis is that we have assumed that each $h_{t}\in\mathcal{H}$ ; this matches the setup of [14], but the condition that $\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}^{2}<\infty$ will sometimes require $\mathcal{H}$ to have a finite diameter. Modifying the procedure to allow for projection of the iterates to $\mathcal{H}$ is a point of technical interest, but is out of this paper’s scope. Another limitation is that our current approach using smoothness properties (Lemma 12) necessitates an assumption of gradients with finite second-order moments. In the special case where $\alpha=1$ , arguments based on smoothness can be replaced with arguments based on weak convexity [15, 30]; but this fails for more general $\alpha$ since $\rho(\cdot;\alpha)$ is not convex for $\alpha<1$ , and not Lipschitz for $\alpha>1$ . Another potential option is to split the sample, leverage stronger guarantees available in expectation for gradient descent run on the subsets, and robustly choose the best candidate based on a validation set of data [29].

4.2 T-risk with minimizing thresholds

Recall the minimal T-risk $\underline{\operatorname{R}}_{\rho}(h;\eta)$ and optimal thresholds $\theta_{\rho}(h;\eta)$ defined earlier in (12), here restricted to scaled $\rho(\cdot)=\rho_{\sigma}(\cdot;\alpha)$ from the Barron class (15). This is arguably the most natural subset of T-risks, with a functional form aligned with OCE/DRO risks discussed in §2.2. For readability, we denote the dispersion of $\operatorname{\mathsf{L}}(h)$ measured about $\theta\in\mathbb{R}$ using $\rho$ as

\displaystyle\operatorname{D}_{\rho}(h;\theta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{\mathbf{E}}_{\mu}\rho\left(\operatorname{\mathsf{L}}(h)-\theta\right).

(22)

Here we will overload our notation, writing $\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}};\eta)$ , $\theta_{\rho}(\operatorname{\mathsf{L}};\eta)$ , and $\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\theta)$ when we want to leave $h$ abstract and focus on random losses in a set $\operatorname{\mathsf{L}}\in\mathcal{L}$ . For this risk to be finite, we require $\alpha\geq 1$ , and in addition need $\lvert{\eta}\rvert<1/\sigma$ for the special case of $\alpha=1$ ; otherwise, the dispersion term grows too slowly and we have $\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}};\eta)=-\infty$ (Lemma 10). When finite, the optimal threshold is unique, and overloading our notation once more we use $\theta_{\rho}(\operatorname{\mathsf{L}};\eta)$ to denote it. The following lemma summarizes some basic properties of the minimal T-risk.

Lemma 5.

Let $\mathcal{L}$ be such that for each $\operatorname{\mathsf{L}}\in\mathcal{L}$ we have $\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}};\eta)<\infty$ , and let $\alpha$ , $\sigma$ , and $\eta$ be such that $\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}};\eta)>-\infty$ . Under these assumptions, the dispersion part of the minimal T-risk is translation invariant, i.e., for any $c\in\mathbb{R}$ , we have $\operatorname{D}_{\rho}(\operatorname{\mathsf{L}}+c;\theta_{\rho}(\operatorname{\mathsf{L}}+c;\eta))=\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\theta_{\rho}(\operatorname{\mathsf{L}};\eta))$ . This dispersion term is always non-negative, and if $\eta\neq 0$ , then we have $\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\theta_{\rho}(\operatorname{\mathsf{L}};\eta))>0$ for any $\operatorname{\mathsf{L}}\in\mathcal{L}$ , even if $\operatorname{\mathsf{L}}$ is constant. If $\mathcal{L}$ is convex, then so is the map $\operatorname{\mathsf{L}}\mapsto\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}};\eta)$ . The optimal threshold is translation equivariant in that we have $\theta_{\rho}(\operatorname{\mathsf{L}}+c;\eta)=\theta_{\rho}(\operatorname{\mathsf{L}};\eta)+c$ for any choice of $c\in\mathbb{R}$ , and it is monotonic in that whenever $\operatorname{\mathsf{L}}_{1},\operatorname{\mathsf{L}}_{2}\in\mathcal{L}$ and $\operatorname{\mathsf{L}}_{1}\leq\operatorname{\mathsf{L}}_{2}$ almost surely, we have $\theta_{\rho}(\operatorname{\mathsf{L}}_{1};\eta)\leq\theta_{\rho}(\operatorname{\mathsf{L}}_{2};\eta)$ .

Let us briefly discuss the properties described in Lemma 5 with a bit more context. One of the best-known classes of risk functions is that of coherent risks [2], typically characterized by properties of convexity, monotonicity, translation equivariance, and positive homogeneity [52]. Our general notion of “dispersion” is often referred to as “deviation” in the risk literature, and the properties of translation invariance, sub-linearity (implying convexity), non-negativity, and definiteness (i.e., zero only for constants) allow one to establish links between deviations and coherent risks [49]. In general, the risk $\operatorname{\mathsf{L}}\mapsto\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}};\eta)$ takes us outside of this traditional class, while still maintaining lucid connections as summarized in the preceding lemma.

Remark 6 (Risk quadrangle).

It should also be noted that our T-risks are not what would typically be called “risks” in the context of the “expectation quadrangle” in the framework developed by Rockafellar and Uryasev, [48]. With their Example 7 as a clear reference for comparison, our function $\rho(\cdot)$ corresponds to their “error integrand” $e(\cdot)$ , and the risk derived from their quadrangle would be

\displaystyle\operatorname{\mathsf{L}}\mapsto\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}+\operatorname{\mathbf{E}}_{\mu}\rho\left(\operatorname{\mathsf{L}}-\operatorname{M}_{\rho}(\operatorname{\mathsf{L}})\right)=\inf_{\theta\in\mathbb{R}}\left[\theta+\operatorname{\mathbf{E}}_{\mu}v(\operatorname{\mathsf{L}}-\theta)\right]

where $v(x)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\rho(x)+x$ and $\operatorname{M}_{\rho}(\operatorname{\mathsf{L}})$ denotes the M-location (1) under $\rho$ . Under appropriate choice of $\rho$ , this is an OCE-type risk, and evidently our optimal thresholds $\theta_{\rho}(\operatorname{\mathsf{L}};\eta)$ with $\eta\neq 0$ do not make an appearance in risks of this form, highlighting the distinct nature of $\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}};\eta)$ .

Next, we briefly consider the question of how algorithms designed to minimize $\underline{\operatorname{R}}_{\rho}(h;\eta)$ perform in terms of the classical risk, namely the expected loss $\operatorname{R}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}(h)$ . This is a big topic, but as an initial look, we consider how the expected loss incurred by minimizers of the empirical T-risk can be controlled given sufficiently good concentration of the M-estimators induced by $\rho$ , and the empirical mean. Let us denote any empirical T-risk minimizer by

\displaystyle\widehat{h}_{\rho}\in\operatorname*{arg\,min}_{h\in\mathcal{H}}\left[\inf_{\theta\in\mathbb{R}}\left(\eta\theta+\frac{1}{n}\sum_{i=1}^{n}\rho_{\sigma}\left(\operatorname{\mathsf{L}}_{i}(h)-\theta\right)\right)\right]

(23)

where for simplicity $(\operatorname{\mathsf{L}}_{1},\ldots,\operatorname{\mathsf{L}}_{n})$ is an iid sample of random losses.

Proposition 7.

Take any T-risk parameters $\alpha\geq 1$ , $\eta>0$ , and $\sigma>0$ , denoting the resulting empirical (minimal) T-risk minimizer $\widehat{h}_{\rho}$ in (23). Let $\widehat{\operatorname{R}}(h)$ , $\widehat{\operatorname{M}}_{\rho}(h)$ , $\widehat{\theta}_{\rho}(h;\eta)$ , and $\widehat{\operatorname{D}}_{\rho}(h;\theta)$ denote the empirical analogues of $\operatorname{R}(h)$ , $\operatorname{M}_{\rho}(h)$ , $\theta_{\rho}(h;\eta)$ , and $\operatorname{D}_{\rho}(h;\theta)$ for each $h\in\mathcal{H}$ . Then, with $\lVert{\cdot}\rVert_{\mathcal{H}}$ as in Theorem 3, we have

\displaystyle\operatorname{R}(\widehat{h}_{\rho})\leq\operatorname{R}(h^{\ast})+\lVert{\widehat{\operatorname{M}}_{\rho}-\widehat{\theta}_{\rho}}\rVert_{\mathcal{H}}+2\lVert{\widehat{\operatorname{M}}_{\rho}-\operatorname{R}}\rVert_{\mathcal{H}}+\frac{1}{\eta}\widehat{\operatorname{D}}_{\rho}(h^{\ast};\widehat{\operatorname{M}}_{\rho}(h^{\ast}))+4\lVert{\operatorname{R}-\widehat{\operatorname{R}}}\rVert_{\mathcal{H}}

where we can freely choose $h^{\ast}\in\operatorname*{arg\,min}_{h\in\mathcal{H}}\operatorname{R}(h)$ to be optimal in the expected loss.

We have left the upper bound in Proposition 7 rather abstract to emphasize the key factors that can be used to control expected loss bounds for empirical T-risk minimizers and keep the overall narrative clear. Note that in the special case of $\alpha=2$ one has $\widehat{\operatorname{M}}_{\rho}(h)-\widehat{\theta}_{\rho}(h;\eta)=\eta$ , and more generally taking $\eta\to 0$ sends this difference to zero. There exists tension due to the $1/\eta$ coefficient on the dispersion term. All other terms, including the dispersion itself, are free of $\eta$ . Note also that the remaining term $\lVert{\widehat{\operatorname{M}}_{\rho}-\operatorname{R}}\rVert_{\mathcal{H}}$ is the (uniform) difference between the M-estimator induced by $\rho$ and the classical risk; this can be modulated directly using the scale parameter $\sigma$ , and sharp bounds for a broad classes of M-estimators have been established recently [40]. Our Proposition 7 is analogous to Theorem 7 of Lee et al., [36], with the key difference being that we study (minimal) T-risks instead of OCE risks, without assuming that losses are bounded (below or above).

5 Learning applications

To complement the “static” empirical analysis in §3.3 and the formal insights for learning algorithms in §4, here we empirically investigate how risk function design and data properties impact the behaviour of stochastic gradient-based learners.

5.1 Classification with noisy and unbalanced labels

Experiment setup

As an initial example using simulated data, we design a binary classification problem by generating 500 labeled data points as shown in Figure 3 (left-most plot). The majority class comprises 95% of the sample, and we flip 5% of the labels uniformly at random. Aside from the flipped labels, the data is linearly separable. We consider two candidate classifiers to initialize an iterative learning algorithm: one candidate that does well on the majority class (dotted purple), and one candidate that does well on the minority class (dashed pink). Due to class imbalance, the loss distributions incurred by these two candidates are highly asymmetric and differ in their long tail direction (Figure 3, remaining plots). We run empirical risk minimization for each risk class described in §2.2 (joint in $(h,\theta)$ ), implemented by 15,000 iterations of full-batch gradient descent, using a step size of $0.01$ , and three different choices of base loss function (logistic, hinge, unhinged). We run this procedure on the same data for a wide range of risk parameters $\alpha$ (T-risk), $\beta$ (CVaR), $\gamma$ (tilted), and $a$ ( $\chi^{2}$ -DRO), and choose a representative setting for each risk class as the one achieving the best final training classification error.

Representative results

The full-batch (training) classification error and norm trajectories for these representatives initialized at each of the two candidates just mentioned (“mostly correct” and “mostly incorrect”) and trained using the unhinged loss as a base loss are shown in Figure 4. Regardless of the direction of the initial loss distribution tails, we see that using the Barron-type T-risk has the flexibility to achieve both a stable and superior long-term error rate, while at the same time penalizing exceedingly overconfident correct examples (via bi-directionality), and thereby keeping the norm of the linear classifier small. An almost identical trend is also observed under the (binary) logistic loss, whereas classification error rates under the hinge loss tend to be less stable (see Figures 10–11 in §B.6).

5.2 Control of outlier sensitivity

Experiment setup

For a crystal-clear example of real data that induces losses with heavy tails, we consider the famous Belgian phone call dataset [51, 59] with input features normalized to the unit interval, and raw outputs used as-is. We conduct two regression tasks: the first uses the original data just stated, and the second has us modify such that it is an outlier on both the horizontal and vertical axes (we just multiply the original data point by 5); such points are said to have “high leverage” [32, Ch. 7]. For these two tasks, once again we run empirical risk minimization under each of the risk classes of interest, implemented using 15,000 iterations of full-batch gradient descent, with fixed step size 0.005. To illustrate the flexibility of the T-risk, here instead of minimizing (11) jointly in $(h,\theta)$ , we fix $\theta$ at the start of the learning process along with $\sigma$ and $\eta$ , iteratively optimizing $h$ only (i.e., $\theta$ is not updated at any point). For simplicity, we set $\theta$ and $\sigma$ both to be the median of the losses incurred at initialization. All other risk classes are precisely as in the previous experiment.

Representative results

The final regression lines obtained under each risk setting by running the learning procedure just described are plotted along with the data in Figure 5. Colors correspond to the individual risk function choices within each risk family, as denoted by color bars under each plot. The gray regression line denotes the common initial value used by all methods. Algorithm outputs using CVaR and $\chi^{2}$ -DRO are always at least as sensitive to outliers as the ordinary least-squares (OLS) solution. While the tilted risk does let us interpolate between lower and upper quantiles, this transition is not smooth; even trying 20 values within the small window of $\gamma\in[-0.025,0.025]$ , the algorithm outputs essentially jump between two extremes. This difference is particularly lucid in the bottom plots of Figure 5, where we have a high-leverage point. In contrast, with all other parameters of T-risk fixed, note that just tweaking $\alpha$ gives us a remarkable degree of flexibility to control the final output; while the base loss is fixed to the squared error, the regression lines range from those that ignore outliers to those that are very sensitive to outliers. In the high-leverage case, it is well-known that this cannot be achieved by simply changing to a different convex base loss (e.g., MAE instead of OLS), giving a concise illustration of the flexibility inherent in the T-risk class. Algorithm behavior can be sensitive to initial value; see §B.5 and Figure 13 in the appendix for an example where the naive median-based procedure described here causes gradient-based algorithms to stall out.

5.3 Distribution control under SGD on benchmark datasets

Experiment setup

Finally we consider tests using datasets and models which are orders of magnitude larger than the previous two experimental setups. We use several well-known benchmark datasets for multi-class classification; details are given in §B.5. All features are normalized to the unit interval (categorical variables are one-hot), and scores for each class are computed by a linear combination of features. As a base loss, we use the multi-class logistic regression loss. Here we investigate how a stochastic gradient descent implementation (with averaging) of the empirical T-risk minimization (joint in $(h,\theta)$ ) behaves as we control the shape parameter $\alpha$ of the Barron class, with $\sigma=0.99$ and $\eta=1$ fixed throughout. We record the base loss (logistic loss) and zero-one loss (misclassification error) at the start and end of each epoch. We also record the full base loss distribution after the last epoch concludes. We run 5 independent trials, in which the data is shuffled and initial parameters are determined randomly. In each trial, for all datasets and methods, we use a mini-batch size of 32, and we run 30 epochs. As reference algorithms, we consider “vanilla” ERM, using the traditional risk $\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}(h)$ , and mean-variance implemented as a special case of T-risk (with $\rho(\cdot)=(\cdot)^{2}/2$ and $\eta=1$ ). 80% of the data is used for training, 10% for validation, and 10% for testing. For each risk class, we try five different step sizes. Validation data is used to evaluate different step sizes and choose the best one for each risk setting. All the results we present here are based on loss values computed on the test set: solid lines represent averages taken over trials, and shaded areas denote $\pm$ standard deviation over trials.

Representative results

In Figures 6–7, we give results based on the following two datasets: “extended MNIST” (47 classes, balanced) [11], and “cover type” (7 classes, imbalanced) [9]. We plot the average and standard deviation of the base and zero-one losses as a function of epoch number, plus give a histogram of test (base) losses for a single trial, compared with the loss distribution incurred by a random initialization of the same model (left-most plot; gray is test, black is training). Colors are analogous to previous experiments, here evenly spaced over the allowable range ( $1\leq\alpha\leq 2$ ). It is clear how modifying the Barron dispersion function shape across this range lets us flexibly interpolate between the test (base) loss distribution achieved by a traditional ERM solution and a mean-variance solution, in terms of both the mean and standard deviation. This monotonicity (as a function of $\alpha$ ) is salient in the base loss, but this does not always appear for the zero-one loss. Since these datasets have normalized features with negligible label noise, egregious outliers are rare, and thus the trends observed for the mean and standard deviation here also hold for outlier-resistant location-dispersion pairs such as the median and the median-absolute-deviations about the median. Similar results for several other datasets are provided in §B.6, and we remark that the key trends hold across all datasets tested. We have seen in our previous experiments how under heavy-tailed losses/gradients the T-risk solution can differ greatly from that of the vanilla ERM solution, so it is interesting to observe how under large, normalized, clean classification datasets, the T-risk allows us to very smoothly control a tradeoff between average test loss and variance on the test set.

Acknowledgments

This work was supported by JST ACT-X Grant Number JPMJAX200O and JST PRESTO Grant Number JPMJPR21C6.

References

Albert and Anderson, [1984] Albert, A. and Anderson, J. A. (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika, 71(1):1–10.
Artzner et al., [1999] Artzner, P., Delbaen, F., Eber, J.-M., and Heath, D. (1999). Coherent measures of risk. Mathematical Finance, 9(3):203–228.
Ash and Doléans-Dade, [2000] Ash, R. B. and Doléans-Dade, C. A. (2000). Probability and Measure Theory. Academic Press, 2nd edition.
Barron, [2019] Barron, J. T. (2019). A general and adaptive robust loss function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4331–4339.
Bassett, [1987] Bassett, Jr, G. W. (1987). The St. Petersburg paradox and bounded utility. History of Political Economy, 19(4):517–523.
Ben-Tal et al., [2013] Ben-Tal, A., Den Hertog, D., De Waegenaere, A., Melenberg, B., and Rennen, G. (2013). Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357.
Ben-Tal and Teboulle, [1986] Ben-Tal, A. and Teboulle, M. (1986). Expected utility, penalty functions, and duality in stochastic nonlinear programming. Management Science, 32(11):1445–1466.
Ben-Tal and Teboulle, [2007] Ben-Tal, A. and Teboulle, M. (2007). An old-new concept of convex risk measures: The optimized certainty equivalent. Mathematical Finance, 17(3):449–476.
Blackard and Dean, [1999] Blackard, J. A. and Dean, D. J. (1999). Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and Electronics in Agriculture, 24(3):131–151.
Boyd and Vandenberghe, [2004] Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
Cohen et al., [2017] Cohen, G., Afshar, S., Tapson, J., and Van Schaik, A. (2017). EMNIST: an extension of MNIST to handwritten letters. arXiv preprint arXiv:1702.05373v2.
Şimşekli et al., [2019] Şimşekli, U., Sagun, L., and Gürbüzbalaban, M. (2019). A tail-index analysis of stochastic gradient noise in deep neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pages 5827–5837.
Curi et al., [2020] Curi, S., Levy, K. Y., Jegelka, S., and Krause, A. (2020). Adaptive sampling for stochastic risk-averse learning. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pages 1036–1047.
Cutkosky and Mehta, [2021] Cutkosky, A. and Mehta, H. (2021). High-probability bounds for non-convex stochastic optimization with heavy tails. arXiv preprint arXiv:2106.14343v2.
Davis and Drusvyatskiy, [2019] Davis, D. and Drusvyatskiy, D. (2019). Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239.
Devroye et al., [1996] Devroye, L., Györfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer.
Duchi and Namkoong, [2018] Duchi, J. and Namkoong, H. (2018). Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750v6.
Duchi and Namkoong, [2019] Duchi, J. and Namkoong, H. (2019). Variance-based regularization with convex objectives. Journal of Machine Learning Research, 20(68):1–55.
Föllmer and Knispel, [2011] Föllmer, H. and Knispel, T. (2011). Entropic risk measures: Coherence vs. convexity, model ambiguity and robust large deviations. Stochastics and Dynamics, 11(02n03):333–351.
Ghadimi et al., [2016] Ghadimi, S., Lan, G., and Zhang, H. (2016). Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267–305.
Gneiting, [2011] Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association, 106(494):746–762.
Gorbunov et al., [2020] Gorbunov, E., Danilova, M., and Gasnikov, A. (2020). Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020).
Gorbunov et al., [2021] Gorbunov, E., Danilova, M., Shibaev, I., Dvurechensky, P., and Gasnikov, A. (2021). Near-optimal high probability complexity bounds for non-smooth stochastic optimization with heavy-tailed noise. arXiv preprint arXiv:2106.05958.
Gotoh et al., [2018] Gotoh, J.-y., Kim, M. J., and Lim, A. E. (2018). Robust empirical optimization is almost the same as mean–variance optimization. Operations Research Letters, 46(4):448–452.
Hacking, [2006] Hacking, I. (2006). The emergence of probability: a philosophical study of early ideas about probability, induction and statistical inference. Cambridge University Press, 2nd edition.
Hashimoto et al., [2018] Hashimoto, T. B., Srivastava, M., Namkoong, H., and Liang, P. (2018). Fairness without demographics in repeated loss minimization. In Proceedings of the 35th International Conference on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pages 1929–1938.
Haussler, [1992] Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1):78–150.
[28] Holland, M. J. (2021a). Designing off-sample performance metrics. arXiv preprint arXiv:2110.04996v1.
[29] Holland, M. J. (2021b). Robustness and scalability under heavy tails, without strong convexity. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 130 of Proceedings of Machine Learning Research.
Holland, [2022] Holland, M. J. (2022). Learning with risks based on M-location. Machine Learning.
Holland and Haress, [2021] Holland, M. J. and Haress, E. M. (2021). Learning with risk-averse feedback under potentially heavy tails. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 130 of Proceedings of Machine Learning Research.
Huber and Ronchetti, [2009] Huber, P. J. and Ronchetti, E. M. (2009). Robust Statistics. John Wiley & Sons, 2nd edition.
Kashima, [2007] Kashima, H. (2007). Risk-sensitive learning via minimization of empirical conditional value-at-risk. IEICE Transactions on Information and Systems, 90(12):2043–2052.
Kearns and Vazirani, [1994] Kearns, M. J. and Vazirani, U. V. (1994). An Introduction to Computational Learning Theory. MIT Press.
Koltchinskii, [1997] Koltchinskii, V. I. (1997). $M$ -estimation, convexity and quantiles. The Annals of Statistics, pages 435–477.
Lee et al., [2020] Lee, J., Park, S., and Shin, J. (2020). Learning bounds for risk-sensitive learning. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pages 13867–13879.
[37] Li, T., Beirami, A., Sanjabi, M., and Smith, V. (2021a). On tilted losses in machine learning: Theory and applications. arXiv preprint arXiv:2109.06141v1.
[38] Li, T., Beirami, A., Sanjabi, M., and Smith, V. (2021b). Tilted empirical risk minimization. In The 9th International Conference on Learning Representations (ICLR).
Luenberger, [1969] Luenberger, D. G. (1969). Optimization by Vector Space Methods. John Wiley & Sons.
Minsker, [2019] Minsker, S. (2019). Uniform bounds for robust mean estimators. arXiv preprint arXiv:1812.03523v4.
Mohri et al., [2012] Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2012). Foundations of Machine Learning. MIT Press.
Nazin et al., [2019] Nazin, A. V., Nemirovsky, A. S., Tsybakov, A. B., and Juditsky, A. B. (2019). Algorithms of robust stochastic optimization based on mirror descent method. Automation and Remote Control, 80(9):1607–1627.
Nesterov, [2004] Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course. Springer.
Penot, [2012] Penot, J.-P. (2012). Calculus Without Derivatives, volume 266 of Graduate Texts in Mathematics. Springer.
Prashanth et al., [2020] Prashanth, L. A., Jagannathan, K., and Kolla, R. K. (2020). Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions. In 37th International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, pages 5577–5586.
Rockafellar and Uryasev, [2000] Rockafellar, R. T. and Uryasev, S. (2000). Optimization of conditional value-at-risk. Journal of Risk, 2:21–42.
Rockafellar and Uryasev, [2002] Rockafellar, R. T. and Uryasev, S. (2002). Conditional value-at-risk for general loss distributions. Journal of Banking & Finance, 26(7):1443–1471.
Rockafellar and Uryasev, [2013] Rockafellar, R. T. and Uryasev, S. (2013). The fundamental risk quadrangle in risk management, optimization and statistical estimation. Surveys in Operations Research and Management Science, 18(1-2):33–53.
Rockafellar et al., [2006] Rockafellar, R. T., Uryasev, S., and Zabarankin, M. (2006). Generalized deviations in risk analysis. Finance and Stochastics, 10(1):51–74.
Rousseeuw and Christmann, [2003] Rousseeuw, P. J. and Christmann, A. (2003). Robustness against separation and outliers in logistic regression. Computational Statistics & Data Analysis, 43(3):315–332.
Rousseeuw and Leroy, [1987] Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. John Wiley & Sons.
Ruszczyński and Shapiro, [2006] Ruszczyński, A. and Shapiro, A. (2006). Optimization of convex risk functions. Mathematics of Operations Research, 31(3):433–452.
Shalev-Shwartz and Ben-David, [2014] Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
Sutton and Barto, [2018] Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press, 2nd edition.
Takeda and Sugiyama, [2008] Takeda, A. and Sugiyama, M. (2008). $\nu$ -support vector machine as conditional value-at-risk minimization. In Proceedings of the 25th International Conference on Machine Learning (ICML), pages 1056–1063.
Takeuchi et al., [2006] Takeuchi, I., Le, Q. V., Sears, T. D., and Smola, A. J. (2006). Nonparametric quantile estimation. Journal of Machine Learning Research, 7:1231–1264.
Van Rooyen et al., [2016] Van Rooyen, B., Menon, A., and Williamson, R. C. (2016). Learning with symmetric label noise: The importance of being unhinged. In Advances in Neural Information Processing Systems 28 (NIPS 2015), pages 10–18.
Vapnik, [1999] Vapnik, V. N. (1999). The Nature of Statistical Learning Theory. Statistics for Engineering and Information Science. Springer, 2nd edition.
Venables and Ripley, [2002] Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S. Springer, 4th edition.
Zhai et al., [2021] Zhai, R., Dan, C., Kolter, J. Z., and Ravikumar, P. (2021). DORO: Distributional and outlier robust optimization. In 38th International Conference on Machine Learning (ICML), volume 139 of Proceedings of Machine Learning Research, pages 12345–12355.
Zhang et al., [2020] Zhang, J., Karimireddy, S. P., Veit, A., Kim, S., Reddi, S., Kumar, S., and Sra, S. (2020). Why are adaptive methods good for attention models? In Advances in Neural Information Processing Systems, volume 33, pages 15383–15393.

Appendix A Appendix summary

For ease of reference, we start the appendix with a concise tables of contents.

•
§B: Supplementary information
- –
  
  §B.1: Details for tilted risk
- –
  
  §B.2: Details for Barron class limits
- –
  
  §B.3: Additional lemmas for Barron class and T-risk
- –
  
  §B.4: Smoothness of the T-risk
- –
  
  §B.5: Experimental details
- –
  
  §B.6: Additional figures
•
§C: Detailed proofs
- –
  
  §C.1: Proofs of results in the main text
- –
  
  §C.2: Smoothness computations (proof of Lemma 12)
•
§D: Additional technical facts
- –
  
  §D.1: Lipschitz properties
- –
  
  §D.2: Convexity
- –
  
  §D.3: Derivatives for the Barron class
- –
  
  §D.4: Elementary inequalities
- –
  
  §D.5: Expected dispersion is coercive

Appendix B Supplementary information

Here we provide additional details that did not fit into the main body of the paper.

B.1 Details for tilted risk

Let $\mathsf{X}\sim\mu$ be an arbitrary random variable. Assuming the distribution is such that we can differentiate through the integral, we have

\displaystyle\frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d}\theta}\left[\theta+\frac{1}{\gamma}\left(\operatorname{\mathbf{E}}_{\mu}\mathrm{e}^{\gamma(\mathsf{X}-\theta)}-1\right)\right]=0\iff\operatorname{\mathbf{E}}_{\mu}\mathrm{e}^{\gamma(\mathsf{X}-\theta)}=1.

(24)

Let $\theta^{\ast}$ be any value that satisfies the first-order optimality condition in (24). It follows that

\displaystyle\theta^{\ast}+\frac{1}{\gamma}\left(\operatorname{\mathbf{E}}_{\mu}\mathrm{e}^{\gamma(\mathsf{X}-\theta^{\ast})}-1\right)=\theta^{\ast}.

It is easy to confirm that setting $\theta^{\ast}=(1/\gamma)\log(\operatorname{\mathbf{E}}_{\mu}\mathrm{e}^{\gamma\mathsf{X}})$ gives us a valid solution. For more background, see the recent works of Li et al., 2021b [38], Li et al., 2021a [37] and the references therein. Note also that this log-exponential criterion appears (with $\gamma=1$ ) in Rockafellar and Uryasev, [48, Ex. 8].

B.2 Details for Barron class limits

For the limit as $\alpha\to 0$ , use the fact that for any $a>0$ , we have

\displaystyle\lim\limits_{x\to 0_{+}}\frac{(a^{x}-1)}{x}=\log(a).

(25)

This equality is sometimes known as Halley’s formula. For the limit as $\alpha\to-\infty$ , first note that for any $\alpha<0$ we can write $\lvert{2-\alpha}\rvert=2+\lvert{\alpha}\rvert$ , and thus $\lvert{\alpha}\rvert/2=(\lvert{2-\alpha}\rvert/2)-1$ . With this in mind, for any $\alpha<0$ we can observe

	$\displaystyle\left(1+\frac{x^{2}}{\lvert{\alpha-2}\rvert}\right)^{\alpha/2}=\left(1+\frac{x^{2}}{\lvert{\alpha-2}\rvert}\right)^{-\lvert{\alpha}\rvert/2}$	$\displaystyle=\left(1+\frac{x^{2}}{\lvert{\alpha-2}\rvert}\right)^{1-(\lvert{2-\alpha}\rvert/2)}$
		$\displaystyle=\frac{\left(1+\frac{x^{2}}{\lvert{\alpha-2}\rvert}\right)}{\sqrt{\left(1+\frac{x^{2}}{\lvert{\alpha-2}\rvert}\right)^{\lvert{\alpha-2}\rvert}}}$
		$\displaystyle\to\frac{1}{\sqrt{\exp(x^{2})}}=\exp(-x^{2}/2),$

where the limit is taken as $\alpha\to-\infty$ , and follows from the classical limit characterization of the exponential function. For the limit as $\alpha\to 2_{-}$ , first note that

\displaystyle\lvert{\alpha-2}\rvert\left(1+\frac{x^{2}}{\lvert{\alpha-2}\rvert}\right)^{\alpha/2}=\left(\lvert{\alpha-2}\rvert^{2/\alpha}+\lvert{\alpha-2}\rvert^{(2/\alpha)-1}x^{2}\right)^{\alpha/2}

and that as long as $\alpha<2$ , we can write

\displaystyle\lvert{\alpha-2}\rvert^{(2/\alpha)-1}=(2-\alpha)^{(2-\alpha)/\alpha}=\left(u^{u}\right)^{1/(2-u)}

where we have introduced $u\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=2-\alpha$ . Taking $\alpha\to 2_{-}$ amounts to $u\to 0_{+}$ , and thus using the fact that $u^{u}\to 1$ as $u\to 0_{+}$ , the desired result follows from straightforward analysis.

B.3 Additional lemmas for Barron class and T-risk

Lemma 8 (Dispersion function convexity and smoothness).

Consider $\rho_{\sigma}(x)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\rho(x/\sigma;\alpha)$ with $\rho(\cdot;\alpha)$ from the Barron class (15), with parameter $-\infty\leq\alpha\leq 2$ . The following properties hold for any choice of $\sigma>0$ .

•

Case of $\alpha=2$ :
$\rho_{\sigma}$ is convex and $1/\sigma^{2}$ -smooth on $\mathbb{R}$ .
•

Case of $\alpha=0$ :
$\rho_{\sigma}$ is convex on $[-\sqrt{2}\sigma,\sqrt{2}\sigma]$ , and is $1/(\sqrt{2}\sigma)$ -Lipschitz and $1/\sigma^{2}$ -smooth on $\mathbb{R}$ .
•

Case of $\alpha=-\infty$ :
$\rho_{\sigma}$ is convex on $[-\sigma,\sigma]$ , and is $(1/\sigma)\exp(-1/2)$ -Lipschitz and $1/\sigma^{2}$ -smooth on $\mathbb{R}$ .
•

Otherwise:
$\rho_{\sigma}$ is $1/\sigma^{2}$ -smooth on $\mathbb{R}$ . When $\alpha\geq 1$ , $\rho_{\sigma}$ is convex on $\mathbb{R}$ . When $\alpha=1$ , $\rho_{\sigma}$ is also $1/\sigma$ -Lipschitz on $\mathbb{R}$ . Else, when $\alpha<1$ , we have that $\rho_{\sigma}$ is convex between $\pm\sigma\sqrt{\lvert{\alpha-2}\rvert/(1-\alpha)}$ , and is $(1/\sigma)(\sqrt{(1-\alpha)/\lvert{\alpha-2}\rvert})^{1-\alpha}$ -Lipschitz on $\mathbb{R}$ .

Furthermore, all these coefficients are tight (see also Figure 9).

The dispersion $\operatorname{D}_{\rho}(h;\theta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{\mathbf{E}}_{\mu}\rho(\operatorname{\mathsf{L}}(h)-\theta)$ plays a prominent role in the risk definitions considered in this paper, and one is naturally interested in the properties of the map $\theta\mapsto\operatorname{D}_{\rho}(h;\theta)$ . The following lemma shows that using $\rho=\rho_{\sigma}$ from the Barron class, we can differentiate under the integral without needing any additional conditions beyond those required for finiteness.

Lemma 9.

Let $\rho_{\sigma}$ be as in Lemma 8. Assume that the random loss $\operatorname{\mathsf{L}}(h)$ is $\mathcal{F}$ -measurable in general, and that $\operatorname{\mathbf{E}}_{\mu}\lvert\operatorname{\mathsf{L}}(h)\rvert<\infty$ holds whenever $1<\alpha\leq 2$ . It follows that the first two derivatives are $\mu$ -integrable, namely that

\displaystyle\lvert\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)\rvert<\infty,\qquad\lvert\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)\rvert<\infty

(26)

for any $\theta\in\mathbb{R}$ . Furthermore, the function $\theta\mapsto\operatorname{D}_{\rho}(h;\theta)$ is twice-differentiable on $\mathbb{R}$ , and satisfies the Leibniz integration property for both derivatives, that is

\displaystyle\left.\frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d}\theta}\operatorname{D}_{\rho}(h;\theta)\right|_{\theta=u}=-\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-u),\qquad\left.\frac{\mathop{}\!\mathrm{d}^{2}}{\mathop{}\!\mathrm{d}\theta^{2}}\operatorname{D}_{\rho}(h;\theta)\right|_{\theta=u}=\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-u)

(27)

for any $u\in\mathbb{R}$ .⁶⁶6Let us emphasize that ${\rho}^{\prime}_{\sigma}$ and ${\rho}^{\prime\prime}_{\sigma}$ denote the first and second derivatives of $x\mapsto\rho(x/\sigma;\alpha)$ , which differ from ${\rho}^{\prime}(x/\sigma;\alpha)$ and ${\rho}^{\prime\prime}(x/\sigma;\alpha)$ by a $\sigma$ -dependent factor; see §D.3 for details.

With first-order information about the expected dispersion in hand, one can readily obtain conditions under which the special case of T-risk $\underline{\operatorname{R}}_{\rho}(h;\eta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\inf_{\theta\in\mathbb{R}}\operatorname{R}_{\rho}(h;\theta,\eta)$ is finite and determined by a meaningful “optimal threshold.”

Lemma 10.

Following the setup of Lemma 1, let $\operatorname{D}_{\rho}(h;\theta)<\infty$ for all $\theta\in\mathbb{R}$ . If $\alpha>1$ , then for any choice of $\sigma>0$ and $\eta\in\mathbb{R}$ , there exists a finite optimal threshold $\theta_{\rho}(h;\eta)\in\mathbb{R}$ such that

\displaystyle\underline{\operatorname{R}}_{\rho}(h;\eta)=\eta\theta_{\rho}(h;\eta)+\operatorname{D}_{\rho}(h;\theta_{\rho}(h;\eta)).

(28)

In the case of $\alpha=1$ , we have that (28) holds if and only if $\lvert{\eta}\rvert<1/\sigma$ . In both cases, the optimal threshold is unique. In the case of $\alpha<1$ , then there is no minimum and thus $\underline{\operatorname{R}}_{\rho}(h;\eta)=-\infty$ .

Remark 11.

We have overloaded our notation $\theta_{\rho}(h;\eta)$ in Lemma 10, recalling that we have used the same notation to denote the set of optimal thresholds for the T-risk in (11). This saves us from having to introduce additional symbols, and should not lead to any confusion since we only do this overloading when the solution set contains a single unique solution.

B.4 Smoothness of the T-risk

When the objective $\operatorname{R}_{\rho}(h;\theta,\eta)$ in (11) is sufficiently smooth, we can apply well-established analytical techniques to control the gradient norms of stochastic gradient-based learning algorithms. Assuming we have unbiased first-order stochastic feedback as in (17)–(18), we will always have to deal with terms of the form $\operatorname{\mathbf{E}}_{\mu}\left[{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta){\operatorname{\mathsf{L}}}^{\prime}(h)\right]$ . Defining $f(h,\theta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}={\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta){\operatorname{\mathsf{L}}}^{\prime}(h)$ for readability, and considering the function difference at two arbitrary points $(h_{1},\theta_{1})$ and $(h_{2},\theta_{2})$ , first note that

	$\displaystyle f$	$\displaystyle(h_{1},\theta_{1})-f(h_{2},\theta_{2})$
		$\displaystyle=\underbrace{{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{1})-\theta_{1})\left[{\operatorname{\mathsf{L}}}^{\prime}(h_{1})-{\operatorname{\mathsf{L}}}^{\prime}(h_{2})\right]}_{A}+\underbrace{\left[{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{1})-\theta_{1})-{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{2})-\theta_{2})\right]{\operatorname{\mathsf{L}}}^{\prime}(h_{2})}_{B}.$		(29)

In the case of $\rho(\cdot)=\rho(\cdot;\alpha)$ from the Barron class (15), when $-\infty\leq\alpha\leq 1$ , we have that ${\rho}^{\prime}$ is both bounded ( $\lVert{{\rho}^{\prime}}\rVert_{\infty}<\infty$ ) and Lipschitz continuous on $\mathbb{R}$ (see Lemma 8). This means that all we need in order to control $\operatorname{\mathbf{E}}_{\mu}A+\operatorname{\mathbf{E}}_{\mu}B$ is for $\operatorname{\mathsf{L}}$ to be smooth (for control of $A$ ) and for ${\operatorname{\mathsf{L}}}^{\prime}(\cdot)$ to have a norm bounded over $\mathcal{H}$ (for control of $B$ ); see §C.2 for more details. Things are more difficult in the case of $1<\alpha\leq 2$ , since the dispersion function $\rho$ is not (globally) Lipschitz, meaning that $\lVert{{\rho}^{\prime}}\rVert_{\infty}=\infty$ . Even if $\operatorname{\mathsf{L}}$ is smooth, when the threshold parameter is left unconstrained, it will always be possible for $\lVert{\operatorname{\mathbf{E}}_{\mu}A}\rVert\to\infty$ as $\lvert{\theta_{1}}\rvert\to\infty$ , impeding smoothness guarantees for $\operatorname{R}_{\rho}$ in this setting.

Let us proceed by distilling the preceding discussion into a set of concrete conditions that are sufficient to make $(h,\theta)\mapsto\operatorname{R}_{\rho}(h;\theta,\eta)$ amenable to standard analysis techniques for stochastic gradient-based algorithms. For readability, we write $\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\sup\{\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert:h\in\mathcal{H}\}$ .

A1.

Moment bound for loss gradient. For any choice of $h_{1},h_{2}\in\mathcal{H}$ , $0<c<1$ , and $k\in\{1,2\}$ , the loss $\operatorname{\mathsf{L}}$ is differentiable at $ch_{1}+(1-c)h_{2}$ , and satisfies

\displaystyle\operatorname{\mathbf{E}}_{\mu}\left(\sup_{0<c<1}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(ch_{1}+(1-c)h_{2})}\rVert\right)^{k}\leq\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}^{k}<\infty.

(30)

A2.

Loss is smooth in expectation. There exists $0<\lambda_{1}<\infty$ such that for any choice of $h_{1},h_{2}\in\mathcal{H}$ , we have $\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{1})-{\operatorname{\mathsf{L}}}^{\prime}(h_{2})}\rVert\leq\lambda_{1}\lVert{h_{1}-h_{2}}\rVert$ .
A3.

Dispersion is Lipschitz and smooth. The function $\rho$ is such that $\lVert{{\rho}^{\prime}}\rVert_{\infty}<\infty$ , and there exists $0<\lambda_{2}<\infty$ such that $\lvert{{\rho}^{\prime}(x_{1})-{\rho}^{\prime}(x_{2})}\rvert\leq\lambda_{2}\lvert{x_{1}-x_{2}}\rvert$ for all $x_{1},x_{2}\in\mathbb{R}$ .

If $\mathcal{H}$ is a convex set, then the first inequality in A1. holds trivially. Note that under A2., the right-hand side of (30) will be finite for $k=1$ whenever $\mathcal{H}$ has bounded diameter and $\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert<\infty$ for some $h\in\mathcal{H}$ . As for A3., all the requirements are clearly met by the Barron class with $-\infty\leq\alpha\leq 1$ . These conditions imply a Lipschitz property for the gradients, as summarized in the following lemma.

Lemma 12.

Let the conditions A1., A2., and A3. hold. Then, the T-risk map $(h,\theta)\mapsto\operatorname{R}_{\rho}(h;\theta,\eta)$ defined in (11) is smooth on $\mathcal{H}\times\mathbb{R}$ in the sense that

\displaystyle\lVert{{\operatorname{R}}^{\prime}_{\rho}(h_{1};\theta_{1},\eta)-{\operatorname{R}}^{\prime}_{\rho}(h_{2};\theta_{2},\eta)}\rVert

\displaystyle\leq\left(\frac{\lambda_{5}}{\sigma}+\frac{\lambda_{2}}{\sigma^{2}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}\right)\left(\lVert{h_{1}-h_{2}}\rVert+\lvert{\theta_{1}-\theta_{2}}\rvert\right)

for any choice of $h_{1},h_{2}\in\mathcal{H}$ and $\theta_{1},\theta_{2}\in\mathbb{R}$ . Here the factor $\lambda_{5}$ is defined $\lambda_{5}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lambda_{3}+\lambda_{4}$ , where

\displaystyle\lambda_{3}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\left(\frac{\lambda_{2}}{\sigma}\right)\left[\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}^{2}+\sup_{h\in\mathcal{H}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert\right],\qquad\lambda_{4}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lambda_{1}\lVert{{\rho}^{\prime}}\rVert_{\infty}.

Remark 13 (Norm on product spaces).

In Lemma 12 we have to deal with norms on product spaces, and in all cases we just use the traditional choice of summing the norms of the constituent elements, i.e., on $\mathcal{H}\times\mathbb{R}$ , we have $\lVert{(h,\theta)}\rVert\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lVert{h}\rVert+\lvert{\theta}\rvert$ . Similarly, we have that the gradient ${\operatorname{R}}^{\prime}_{\rho}(h;\theta,\eta)=(\partial_{h}\operatorname{R}_{\rho}(h;\theta,\eta),\partial_{\theta}\operatorname{R}_{\rho}(h;\theta,\eta))$ , a pair of linear functionals. As such, the norm of ${\operatorname{R}}^{\prime}_{\rho}(h;\theta,\eta)$ defined as the sum of the norms of these two constituent functionals.

Proving Lemma 12 is straightforward but somewhat tedious. Detailed computations as well as a direct proof are organized in §C.2 for easy reference.

B.5 Experimental details

Here we provide some additional details for the empirical analysis carried out in §3.3 and §5. Detailed hyperparameter settings and seeds for exact re-creation of all the results in this paper are available at the GitHub repository cited in §1, and thus here we will not explicitly write all the step sizes, shape settings, etc., but rather focus on concise exposition of points on which we expect readers to desire clarification.

Static risk analysis

For our experiments in §3.3, we gave just one plot using a log-Normal distribution, but analogous tests were run for a wide variety of parametric distributions. In total, we have run tests for Bernoulli, Beta, $\chi^{2}$ , Exponential, Gamma, log-Normal, Normal, Pareto, Uniform, Wald, and Weibull distributions. The settings of each distribution to be sampled from has not been tweaked at all; we set the parameters rather arbitrarily before running any tests. As for the fixed value of $\sigma=0.5$ in the T-risk across all tests, we tested several values of $\sigma$ ranging from $0.1$ to $10$ , and based on the rough position of $\theta_{\rho}(\operatorname{\mathsf{L}};\eta)$ in the Normal case, we determined $0.5$ as a reasonable representative value; indeed, this settings performs quite well across a very wide range of distributions. Regarding the optimization involved in solving for the optimal threshold $\theta$ (for T-risk, OCE risks, and DRO risk), we use minimize_scalar from SciPy, with bounded solver type, and valid brackets set manually.

Noisy linear classification

In the tests described in §5.1, we only give error/norm trajectories for “representative” settings of each risk class (Figure 4). For T-risk we consider different choices of $1\leq\alpha\leq 2$ , for CVaR we consider $0.025\leq\beta\leq 0.75$ , for tilted risk we consider $\gamma$ between $\pm 0.05$ , and for $\chi^{2}$ -DRO we consider $0.025\leq\widetilde{a}\leq 0.35$ . For each of these ranges, we evaluate five evenly-spaced candidates (via np.linspace), and representative settings were selected as those which achieved the best classification error (average zero-one loss) after the final iteration. In the event of ties, the smaller hyperparameter value was always selected (via np.argmin).

Regression under outliers

For the tests introduced in §5.2, we have given results for learning algorithms started at a point which is quite accurate for the majority of the data points, but incurs extremely large errors on the outlying minority. This choice of initial value naturally has a strong impact on the behavior of learning algorithms under each risk. For some perspective, in Figure 13 we provide results using a different initial value (again, in gray), which complement Figure 5. Since the naive strategy fixing $\theta$ at the initial median sets the scale extremely large and close to the loss incurred at most points, even a large number of gradient-based updates result in minimal change. The basic reason for this is quite straightforward. Since the T-risk gradient is modulated by $\rho_{\sigma}^{\prime}(\operatorname{\mathsf{L}}(h)-\theta)$ , and all points are such that either $\operatorname{\mathsf{L}}_{i}(h)\ll\theta$ (the minority) or $\operatorname{\mathsf{L}}_{i}(h)\approx\theta$ (the majority), both cases shrink the norm of the update direction vector $\alpha\leq 1$ . When implementing T-risk in the more traditional way (jointly in $(h,\theta)$ ) and choosing $\alpha\in[1,2]$ , we see results that are very similar to CVaR. Finally, we remark that here we have set the hyperparameter ranges with upper bounds low enough that the learning procedure described here does not run into overflow errors.

Classification under larger benchmark datasets

In §5.3 we make use of several well-known benchmark datasets, and in our figures we identify them respectively by the following keywords: adult,⁷⁷7https://archive.ics.uci.edu/ml/datasets/Adult australian,⁸⁸8https://archive.ics.uci.edu/ml/datasets/statlog+(australian+credit+approval) cifar10,⁹⁹9https://www.cs.toronto.edu/~kriz/cifar.html covtype,¹⁰¹⁰10https://archive.ics.uci.edu/ml/datasets/covertype emnist_balanced,¹¹¹¹11https://www.nist.gov/itl/products-and-services/emnist-dataset fashion_mnist,¹²¹²12https://github.com/zalandoresearch/fashion-mnist and protein.¹³¹³13https://www.kdd.org/kdd-cup/view/kdd-cup-2004/Data For further background on all of these datasets, please access the URLs provided in the footnotes. As mentioned in the main text, for a $k$ -class problem with $d$ features, the predictor is characterized by $k$ weighting vectors $(w_{1},\ldots,w_{k})$ , each of which is $w_{j}\in\mathbb{R}^{d}$ and computes scores as ${w}^{\top}_{j}x$ for $x\in\mathbb{R}^{d}$ . These weight vectors are penalized using the usual multi-class logistic loss, namely the negative log-likelihood of the $k$ -Categorical distribution that arises after passing these scores through the (logistic) softmax function. Regarding step sizes, as mentioned in the main text, we consider choices of factor $c\in\{0.1,0.5,1.0,1.5,2.0\}$ , and set step size to $c/\sqrt{kd}$ , where $d$ and $k$ are as just stated. In Figures 14–18 of §B.6, we provide additional results for the datasets not covered in Figures 6–7 from the main text.

B.6 Additional figures

Here we include a number of figures that complement those provided in the main text. A brief summary is given below.

•

Figures 10–11 are additional results from the experiments described in §5.1, here using logistic and hinge losses instead of the unhinged loss.
•

Figures 12–13 are related to the regression under outliers task described in §5.2. The first figure shows how different regression lines incur very different loss distributions under different convex base loss functions. The second figure illustrates how a different initial value impacts the learned regression lines under each risk class.
•

Figures 14–18 are all completely analogous to Figure 6 given in the main text of §5.3, but for additional benchmark datasets.

Appendix C Detailed proofs

C.1 Proofs of results in the main text

Proof of Lemma 8.

In this proof, without further mention, we will make regular use of the following two helper results: Lemma 14 (bounded gradient implies Lipschitz continuity) and Lemma 15 (positive definite Hessian implies convexity). For reference, the first and second derivatives of $\rho_{\sigma}$ are given in §D.3. We take up each $\alpha$ setting one at a time.

First, the case of $\alpha=2$ . For this case, clearly ${\rho}^{\prime}_{\sigma}$ is unbounded, and thus $\rho_{\sigma}$ is not (globally) Lipschitz on $\mathbb{R}$ . On the other hand, since ${\rho}^{\prime\prime}_{\sigma}(x)=1/\sigma^{2}$ , we have that ${\rho}^{\prime}_{\sigma}$ is $\lambda$ -Lipschitz with $\lambda=(1/\sigma^{2})$ .

Next, the case of $\alpha=0$ . For any fixed $\sigma>0$ , in both the limits $x\to 0$ and $\lvert x\rvert\to\infty$ , we have ${\rho}^{\prime}_{\sigma}(x)\to 0$ . Maximum and minimum values are achieved when ${\rho}^{\prime\prime}_{\sigma}(x)=0$ , and this occurs if and only if $x^{2}=2\sigma^{2}$ . It follows from direct computation that ${\rho}^{\prime}_{\sigma}(\pm\sqrt{2}\sigma)=\pm 1/(\sqrt{2}\sigma)$ , and thus $\rho_{\sigma}$ is $\lambda$ -Lipschitz with $\lambda=1/(\sqrt{2}\sigma)$ . Next, recalling that ${\rho}^{\prime\prime}_{\sigma}$ takes the form

\displaystyle{\rho}^{\prime\prime}_{\sigma}(x)=\frac{2}{x^{2}+2\sigma^{2}}\left(1-\frac{2x^{2}}{x^{2}+2\sigma^{2}}\right),

we see that this is a product of two factors, one taking values in $(0,1/\sigma^{2})$ , and one taking values in $(-1,1)$ . The absolute value of both of these factors is maximized when $x=0$ , and so $\lvert{\rho}^{\prime\prime}_{\sigma}(x)\rvert\leq{\rho}^{\prime\prime}_{\sigma}(0)=1/\sigma^{2}$ , meaning that ${\rho}^{\prime}_{\sigma}$ is $\lambda$ -Lipschitz with $\lambda=1/\sigma^{2}$ . Finally, regarding convexity, we have that ${\rho}^{\prime\prime}_{\sigma}(x)\geq 0$ if and only if $\lvert x\rvert\leq\sqrt{2}\sigma$ .

Next, the case of $\alpha=-\infty$ . For any fixed $\sigma>0$ , we have ${\rho}^{\prime}_{\sigma}(x)\to 0$ in both the limits $x\to 0$ and $\lvert x\rvert\to\infty$ . Furthermore, it is immediate that ${\rho}^{\prime\prime}_{\sigma}(x)=0$ at the points $x=\pm\sigma$ . Evaluating ${\rho}^{\prime}_{\sigma}$ at these stationary points we have ${\rho}^{\prime}_{\sigma}(\pm\sigma)=\pm(1/\sigma)\exp(-1/2)$ , and thus $\rho_{\sigma}$ is $\lambda$ -Lipschitz with $\lambda=(1/\sigma)\exp(-1/2)$ . Regarding bounds on ${\rho}^{\prime\prime}_{\sigma}$ , first note that ${\rho}^{\prime\prime}_{\sigma}(x)\to 0$ as $\lvert x\rvert\to\infty$ , and ${\rho}^{\prime\prime}_{\sigma}(0)=1/\sigma^{2}$ . Then to identify stationary points, note that

\displaystyle\rho_{\sigma}^{\prime\prime\prime}(x)=\frac{1}{\sigma^{2}}\exp\left(-\frac{1}{2}\left(\frac{x}{\sigma}\right)^{2}\right)\left[\frac{x}{\sigma^{2}}\left(\frac{x^{2}}{\sigma^{2}}-1\right)-\frac{2x}{\sigma^{2}}\right]

and thus $\rho_{\sigma}^{\prime\prime\prime}(x)=0$ if and only if $(x/\sigma)^{2}-1=2$ , i.e., the stationary points are $x=\pm\sqrt{3}\sigma$ , both of which yield the same value, namely ${\rho}^{\prime\prime}_{\sigma}(\pm\sqrt{3}\sigma)=-(2/\sigma^{2})\exp(-3/2)$ . Since $2\exp(-3/2)\approx 0.45<1$ , we conclude that ${\rho}^{\prime}_{\sigma}$ is $\lambda$ -Lipschitz with $\lambda=1/\sigma^{2}$ . Finally, since ${\rho}^{\prime\prime}_{\sigma}(x)\geq 0$ if and only if $\lvert x\rvert\leq\sigma$ , this specifies the region on which $\rho_{\sigma}$ is convex.

Finally, all that remains is the general case of $-\infty<\alpha<2$ where $\alpha\neq 0$ . Note that in order for ${\rho}^{\prime\prime}_{\sigma}(x)=0$ to hold, we require

\displaystyle\frac{2(x/\sigma)^{2}}{1+(x/\sigma)^{2}/\lvert{\alpha-2}\rvert}=\frac{\lvert{\alpha-2}\rvert}{1-(\alpha/2)},

which via some basic algebra is equivalent to

\displaystyle\left(\frac{x}{\sigma}\right)^{2}=\frac{\lvert{\alpha-2}\rvert}{1-\alpha}.

Clearly, this is only possible when $\alpha<1$ , so we consider this sub-case first. This implies stationary points $\pm x^{\ast}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\pm\sigma\sqrt{\lvert{\alpha-2}\rvert/(1-\alpha)}$ , for which we have

\displaystyle{\rho}^{\prime}_{\sigma}(\pm x^{\ast})=\pm\frac{1}{\sigma}\sqrt{\frac{\lvert{\alpha-2}\rvert}{1-\alpha}}\left(\frac{2-\alpha}{1-\alpha}\right)^{(\alpha/2)-1}=\pm\frac{1}{\sigma}\left(\sqrt{\frac{\lvert{\alpha-2}\rvert}{1-\alpha}}\right)^{\alpha-1}=\pm\frac{1}{\sigma}\left(\sqrt{\frac{1-\alpha}{\lvert{\alpha-2}\rvert}}\right)^{1-\alpha}.

Since ${\rho}^{\prime}_{\sigma}(x)\to 0$ in both the limits $x\to 0$ and $\lvert x\rvert\to\infty$ , we have obtained a maximum value for ${\rho}^{\prime}_{\sigma}$ at $x^{\ast}$ , thus implying for the case of $\alpha<1$ that $\rho_{\sigma}$ is $\lambda$ -Lipschitz, with a coefficient of $\lambda=(1/\sigma)(\sqrt{(1-\alpha)/\lvert{\alpha-2}\rvert})^{1-\alpha}$ . For the case of $\alpha=1$ , direct inspection shows

\displaystyle\lvert{\rho}^{\prime}_{\sigma}(x)\rvert=\frac{\lvert{x/\sigma^{2}}\rvert}{\sqrt{1+(x/\sigma)^{2}}}=\frac{1}{\sigma^{2}\sqrt{(1/x^{2})+(1/\sigma^{2})}},

a value which is maximized in the limit $\lvert x\rvert\to\infty$ . As such, for $\alpha=1$ , we have that $\rho_{\sigma}$ is $\lambda$ -Lipschitz with $\lambda=1/\sigma$ . For the case of $1<\alpha<2$ , ${\rho}^{\prime}_{\sigma}$ is unbounded. To see this, note that for $x>0$ we have

\displaystyle{\rho}^{\prime}_{\sigma}(x)=\frac{1}{\sigma^{2}}\frac{\left(1+(x/\sigma)^{2}/\lvert{\alpha-2}\rvert\right)}{\left((1/x)+x/(\sigma^{2}\lvert{\alpha-2}\rvert)\right)}\geq\left(\frac{1}{\sigma^{2+\alpha}\sqrt{\lvert{\alpha-2}\rvert^{\alpha}}}\right)\frac{x^{\alpha}}{\left((1/x)+x/(\sigma^{2}\lvert{\alpha-2}\rvert)\right)},

and since $\alpha>1$ , sending $x\to\infty$ clearly implies ${\rho}^{\prime}_{\sigma}(x)\to\infty$ , and this means that $\rho_{\sigma}$ cannot be Lipschitz on $\mathbb{R}$ when $\alpha>1$ . As for bounds on ${\rho}^{\prime\prime}_{\sigma}$ , recall that

\displaystyle{\rho}^{\prime\prime}_{\sigma}(x)=\frac{1}{\sigma^{2}}\underbrace{\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{(\alpha/2)-1}}_{A(x)}\left(1-\underbrace{\frac{1-(\alpha/2)}{\lvert{\alpha-2}\rvert}\frac{2(x/\sigma)^{2}}{1+(x/\sigma)^{2}/\lvert{\alpha-2}\rvert}}_{B(x)}\right)

where we have introduced the labels $A(x)$ and $B(x)$ just as convenient notation. Fixing any $\sigma>0$ , first note that since $\alpha<2$ , we have $(\alpha/2)-1<0$ and thus $0\leq A(x)\leq 1$ . Next, direct inspection shows $0\leq B(x)\leq 2(1-(\alpha/2))$ . These two facts immediately imply an upper bound ${\rho}^{\prime\prime}_{\sigma}(x)\leq 1/\sigma^{2}$ and a lower bound ${\rho}^{\prime\prime}_{\sigma}(x)\geq-(1-\alpha)/\sigma^{2}$ , both of which hold for any $\alpha<2$ . Furthermore, for the case of $1\leq\alpha<2$ , we thus have $0\leq{\rho}^{\prime\prime}_{\sigma}(x)\leq 1/\sigma^{2}$ . When $\alpha<1$ however, ${\rho}^{\prime\prime}_{\sigma}$ can be negative. To get matching lower bounds requires $A(x)(1-B(x))\geq-1$ , or $A(x)(B(x)-1)\leq 1$ . To study conditions under which this holds, first note that $B(x)$ can be re-written as

\displaystyle B(x)=\left(\frac{2-\alpha}{\lvert{\alpha-2}\rvert}\right)\frac{(x/\sigma)^{2}}{1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}}=\frac{(x/\sigma)^{2}}{1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}},

and thus we have

\displaystyle A(x)(B(x)-1)=\frac{(x/\sigma)^{2}}{\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{2-(\alpha/2)}}-\frac{1}{\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{1-(\alpha/2)}}.

(31)

To get a more convenient upper bound on this, observe that $(1+x)^{1-(\alpha/2)}\leq(1+x)^{2-(\alpha/2)}$ for any $x\geq 0$ and $-\infty\leq\alpha\leq 2$ . It follows immediately that

\displaystyle A(x)(B(x)-1)\leq\frac{(x/\sigma)^{2}-1}{\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{2-(\alpha/2)}}.

(32)

To get the right-hand side of (32) to be no greater than $1$ is equivalent to

\displaystyle(x/\sigma)^{2}-1\leq\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{2-(\alpha/2)}.

(33)

For the case of $0\leq\alpha<1$ , note that $1\leq\lvert{\alpha-2}\rvert=2-\alpha<2-(\alpha/2)$ , and using the helper inequality (69), we have

\displaystyle\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{2-(\alpha/2)}\geq\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{\lvert{\alpha-2}\rvert}\geq 1+(x/\sigma)^{2}>(x/\sigma)^{2}-1,

which implies (33) for $0\leq\alpha<1$ . All that remains is the case of $-\infty<\alpha<0$ , which requires a bit more care. Returning to the exact form of $A(x)(B(x)-1)$ given in (31), note that the inequality

\displaystyle(x/\sigma)^{2}-\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)\leq\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{2-(\alpha/2)}

(34)

is equivalent to the desired property, i.e., $A(x)(B(x)-1)\leq 1\iff\text{(\ref{eqn:basic_barron_3})}$ . Using Bernoulli’s inequality (71), we can bound the right-hand side of (34) as

\displaystyle\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{2-(\alpha/2)}\geq 1+\left(\frac{2-(\alpha/2)}{\lvert{\alpha-2}\rvert}\right)(x/\sigma)^{2}.

Subtracting the left-hand side of (34) from the right-hand side of the preceding inequality, we obtain

	$\displaystyle\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{2-(\alpha/2)}-\left[(x/\sigma)^{2}-1-\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right]$	$\displaystyle\geq 2+\left(\frac{2-(\alpha/2)}{\lvert{\alpha-2}\rvert}-1+\frac{1}{\lvert{\alpha-2}\rvert}\right)(x/\sigma)^{2}$
		$\displaystyle=2+\left(\frac{1-(\lvert{\alpha}\rvert/2)}{2+\lvert{\alpha}\rvert}\right)(x/\sigma)^{2},$		(35)

where the second step uses the fact that for $\alpha<0$ , we can write $\lvert{\alpha-2}\rvert=2+\lvert{\alpha}\rvert$ and $2-(\alpha/2)=2+(\lvert{\alpha}\rvert/2)$ . Note that the right-hand side of (35) is non-negative for all $x\in\mathbb{R}$ whenever $-2\leq\alpha<0$ , which via (34) tells us that $A(x)(B(x)-1)\leq 1$ indeed holds in this case as well. For the case of $-\infty<\alpha<-2$ , note that showing (34) holds is equivalent to showing $f_{\alpha}(x)\geq 0$ for all $x\geq 0$ , where for convenience we define the polynomial

\displaystyle f_{\alpha}(x)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=1+\left(\frac{1}{2+\lvert{\alpha}\rvert}-1\right)x+\left(1+\frac{x}{2+\lvert{\alpha}\rvert}\right)^{2+(\lvert{\alpha}\rvert/2)}.

The first derivative is

\displaystyle{f}^{\prime}_{\alpha}(x)=\frac{2+(\lvert{\alpha}\rvert/2)}{2+\lvert{\alpha}\rvert}\left(1+\frac{x}{2+\lvert{\alpha}\rvert}\right)^{1+(\lvert{\alpha}\rvert/2)}+\frac{1}{2+\lvert{\alpha}\rvert}-1,

and with this form in hand, solving for $f_{\alpha}(x)=0$ , it is straightforward to confirm that $x_{\alpha}^{\ast}$ given below is a stationary point:

\displaystyle x_{\alpha}^{\ast}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\left(2+\lvert{\alpha}\rvert\right)\left[\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{1}{1+(\lvert{\alpha}\rvert/2)}}-1\right].

Furthermore, it is clear that ${f}^{\prime\prime}_{\alpha}\geq 0$ , implying that $f_{\alpha}$ is convex, and that $x_{\alpha}^{\ast}$ is a minimum. As such, the minimum value taken by $f_{\alpha}$ on $\mathbb{R}_{+}$ is

	$\displaystyle f_{\alpha}(x_{\alpha}^{\ast})$	$\displaystyle=\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{1}{1+(\lvert{\alpha}\rvert/2)}}-\left(2+\lvert{\alpha}\rvert\right)\left[\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{1}{1+(\lvert{\alpha}\rvert/2)}}-1\right]+\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{2+(\lvert{\alpha}\rvert/2)}{1+(\lvert{\alpha}\rvert/2)}}$
		$\displaystyle=\left(2+\lvert{\alpha}\rvert\right)+\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{2+(\lvert{\alpha}\rvert/2)}{1+(\lvert{\alpha}\rvert/2)}}-\left(1+\lvert{\alpha}\rvert\right)\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{1}{1+(\lvert{\alpha}\rvert/2)}}$
		$\displaystyle=\left(2+\lvert{\alpha}\rvert\right)-\left(1+\lvert{\alpha}\rvert\right)\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{1}{1+(\lvert{\alpha}\rvert/2)}}\left[1-\frac{1}{2+(\lvert{\alpha}\rvert/2)}\right]$
		$\displaystyle=1+\left(1+\lvert{\alpha}\rvert\right)\left[1-\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{1}{1+(\lvert{\alpha}\rvert/2)}}\left[1-\frac{1}{2+(\lvert{\alpha}\rvert/2)}\right]\right].$

We require $f_{\alpha}(x_{\alpha}^{\ast})\geq 0$ for all $-\infty<\alpha<-2$ . From the preceding equalities, note that a simple sufficient condition for $f_{\alpha}(x_{\alpha}^{\ast})\geq 1$ is

\displaystyle\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{1}{1+(\lvert{\alpha}\rvert/2)}}\left[1-\frac{1}{2+(\lvert{\alpha}\rvert/2)}\right]\leq 1

or equivalently

\displaystyle\left(1-\frac{1}{2+(\lvert{\alpha}\rvert/2)}\right)^{1+(\lvert{\alpha}\rvert/2)}\leq\left(\frac{2+(\lvert{\alpha}\rvert/2)}{1+\lvert{\alpha}\rvert}\right).

(36)

Applying the helper inequality (70) to the left-hand side of (36), we have

	$\displaystyle\left(1-\frac{1}{2+(\lvert{\alpha}\rvert/2)}\right)^{1+(\lvert{\alpha}\rvert/2)}\leq 1+\frac{\left(\frac{-(1+(\lvert{\alpha}\rvert/2))}{2+(\lvert{\alpha}\rvert/2)}\right)}{\left(1-\frac{-\lvert{\alpha}\rvert/2}{2+(\lvert{\alpha}\rvert/2)}\right)}=1-\left(\frac{1+(\lvert{\alpha}\rvert/2)}{2+\lvert{\alpha}\rvert}\right)$	$\displaystyle=\frac{1+(\lvert{\alpha}\rvert/2)}{2+\lvert{\alpha}\rvert}$
		$\displaystyle\leq\frac{2+(\lvert{\alpha}\rvert/2)}{1+\lvert{\alpha}\rvert}.$

This is precisely the desired inequality (36), implying $f_{\alpha}(x_{\alpha}^{\ast})\geq 1>0$ for all $-\infty<\alpha<-2$ , and in fact all real $\alpha<0$ . To summarize, we have $A(x)(B(x)-1)\leq 1$ for all $x\in\mathbb{R}$ , and thus the desired $1/\sigma^{2}$ -smoothness result follows, concluding the proof. ∎

Proof of Lemma 1.

Let $\mathsf{X}$ denote any $\mathcal{F}$ -measurable random variable. The continuity of $\rho$ implies that the integral $\operatorname{\mathbf{E}}_{\mu}\rho_{\sigma}(\mathsf{X}-\theta)$ exists for any $\sigma>0$ ; we just need to prove it is finite.¹⁴¹⁴14This uses the fact that any composition of (Borel) measurable functions is itself measurable [3, Lem. 1.5.7]. Since we are taking $\rho$ from the Barron class (15), we consider each $\alpha$ setting separately. Starting with $\alpha=2$ , note that

\displaystyle\rho_{\sigma}(\mathsf{X}-\theta;2)=\frac{1}{2}\left(\frac{\mathsf{X}-\theta}{\sigma}\right)^{2}

and thus $\operatorname{\mathbf{E}}_{\mu}\mathsf{X}^{2}<\infty$ is sufficient and necessary. For $\alpha=0$ , first note that we have

\displaystyle\rho_{\sigma}(\mathsf{X}-\theta;0)=\log\left(1+\frac{1}{2}\left(\frac{\mathsf{X}-\theta}{\sigma}\right)^{2}\right).

Let $f_{1}(x)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\log(1+x)$ and $f_{2}(x)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=x^{c}/c$ , where $0<c<1$ . Note that $f_{1}(0)=f_{2}(0)=0$ , and furthermore that for any $x>0$ ,

\displaystyle{f}^{\prime}_{1}(x)=\frac{1}{1+x}<\left(\frac{1}{1+x}\right)^{1-c}<\left(\frac{1}{x}\right)^{1-c}={f}^{\prime}_{2}(x).

We may thus conclude that $f_{1}(x)\leq f_{2}(x)$ for all $x\geq 0$ , and thus for any $0<c<1$ we have

\displaystyle\log\left(1+\frac{1}{2}\left(\frac{\mathsf{X}-\theta}{\sigma}\right)^{2}\right)\leq\frac{1}{c}\left(\frac{\mathsf{X}-\theta}{\sqrt{2}\sigma}\right)^{2c}.

It follows that to ensure $\operatorname{\mathbf{E}}_{\mu}\rho_{\sigma}(\mathsf{X}-\theta;0)<\infty$ , it is sufficient if we assume $\operatorname{\mathbf{E}}_{\mu}\lvert{\mathsf{X}}\rvert^{c}<\infty$ for some $c>0$ . Proceeding to the case of $\alpha=-\infty$ , we have

\displaystyle\rho_{\sigma}(\mathsf{X}-\theta;-\infty)=1-\exp\left(-\frac{1}{2}\left(\frac{\mathsf{X}-\theta}{\sigma}\right)^{2}\right).

Any composition of measurable functions is measurable, and since the right-hand side is bounded above by $1$ and below by $0$ , we have that $\rho_{\sigma}(\mathsf{X}-\theta;-\infty)$ is $\mu$ -integrable without requiring any extra assumptions on $\mathsf{X}$ besides measurability. All that remains for the Barron class is the case of non-zero $-\infty<\alpha<2$ , where we have

\displaystyle\rho_{\sigma}(\mathsf{X}-\theta;\alpha)=\frac{\lvert{\alpha-2}\rvert}{\alpha}\left(\left(1+\frac{1}{\lvert{\alpha-2}\rvert}\left(\frac{\mathsf{X}-\theta}{\sigma}\right)^{2}\right)^{\alpha/2}-1\right).

Let us break this into two cases: $-\infty<\alpha<0$ and $0<\alpha<2$ . Starting with the former case, this is easy since

\displaystyle\left(1+x^{2}\right)^{\alpha/2}=\frac{1}{\left(\sqrt{1+x^{2}}\right)^{-\alpha}}

which is bounded above by $1$ and below by $0$ for any $\alpha<0$ and $x\in\mathbb{R}$ , which means the random variable $\rho_{\sigma}(\mathsf{X}-\theta;\alpha)$ is $\mu$ -integrable without any extra assumptions on $\mathsf{X}$ . As for the latter case of $0<\alpha<2$ , first note that the monotonicity of $(\cdot)^{\alpha/2}$ on $\mathbb{R}_{+}$ implies

\displaystyle(1+x^{2})^{\alpha/2}\geq\lvert{x}\rvert^{\alpha}

which means $\operatorname{\mathbf{E}}_{\mu}\lvert{\mathsf{X}}\rvert^{\alpha}<\infty$ is necessary. That this condition is also sufficient is immediate from the form of $\rho_{\sigma}(\mathsf{X}-\theta;\alpha)$ just given. This concludes the proof; the desired result stated in the lemma follows from setting $\mathsf{X}=\operatorname{\mathsf{L}}(h)$ and the observation that the choice of $\theta\in\mathbb{R}$ in the preceding discussion was arbitrary. ∎

Proof of Lemma 9.

Referring to the derivatives (67)–(68) in §D.3, we know that ${\rho}^{\prime}_{\sigma}$ is measurable, and by the proof of Lemma 8, we know that $\lVert{{\rho}^{\prime}}\rVert_{\infty}<\infty$ for all $\alpha\leq 1$ . Thus, as long as $\operatorname{\mathsf{L}}(h)$ is $\mathcal{F}$ -measurable, we have that ${\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)$ is $\mu$ -integrable. For the case of $1<\alpha\leq 2$ , note that $\lvert{{\rho}^{\prime}_{\sigma}(x)}\rvert\leq\lvert{x}\rvert/\sigma^{2}$ holds, meaning that $\operatorname{\mathbf{E}}_{\mu}\lvert{\operatorname{\mathsf{L}}(h)}\rvert<\infty$ implies integrability. Similarly for the second derivatives, from the proof of Lemma 8, we see that $\lVert{{\rho}^{\prime\prime}}\rVert_{\infty}<\infty$ for all $-\infty\leq\alpha\leq 1$ , implying the $\mu$ -integrability of ${\rho}^{\prime\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)$ .

The Leibniz integration property follows using a straightforward dominated convergence argument, which we give here for completeness. Letting $(a_{k})$ be any non-zero real sequence such that $a_{k}\to 0$ , we can write

	$\displaystyle\frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d}\theta}\operatorname{D}_{\rho}(h;\theta)$	$\displaystyle=\lim\limits_{k\to\infty}\frac{\operatorname{D}_{\rho}(h;\theta+a_{k})-\operatorname{D}_{\rho}(h;\theta)}{a_{k}}$
		$\displaystyle=\lim\limits_{k\to\infty}\operatorname{\mathbf{E}}_{\mu}\left[\frac{\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-(\theta+a_{k}))-\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)}{a_{k}}\right].$

For notational convenience, let us denote the key sequence of functions by

\displaystyle f_{k}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\frac{\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-(\theta+a_{k}))-\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)}{a_{k}}

and note that $f_{k}\to f\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=-{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)$ pointwise as $k\to\infty$ . We can then say the following: for all $k$ , we have that

\displaystyle\lvert{f_{k}}\rvert\leq\sup_{0<c<1}\lvert{{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-(\theta+ca_{k}))}\rvert\leq g\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lvert{{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta^{\prime})}\rvert

for an appropriate choice of $\theta^{\prime}\in\mathbb{R}$ . The first inequality follows from the helper Lemma 14. We can always find an appropriate $\theta^{\prime}$ because the sequence $(a_{k})$ is bounded and ${\rho}^{\prime}$ is eventually monotone, regardless of the choice of $\alpha$ . With the fact $\lvert{f_{k}}\rvert\leq g$ in hand, recall that we have already proved that $\operatorname{\mathbf{E}}_{\mu}g<\infty$ under the assumptions we have made, and thus $\operatorname{\mathbf{E}}_{\mu}f_{k}\to\operatorname{\mathbf{E}}_{\mu}f$ by dominated convergence.¹⁵¹⁵15See for example Ash and Doléans-Dade, [3, Thm. 1.6.9]. As such, we have

\displaystyle\frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d}\theta}\operatorname{D}_{\rho}(h;\theta)=\lim\limits_{k\to\infty}\operatorname{\mathbf{E}}_{\mu}f_{k}=\operatorname{\mathbf{E}}_{\mu}f=-\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta),

which is the desired Leibniz property for the first derivative. A completely analogous argument holds for the second derivative, yielding the desired result. ∎

Proof of Lemma 10.

From Lemma 9, we know that the map $\theta\mapsto\operatorname{D}_{\rho}(h;\theta)$ is differentiable and thus continuous. Using continuity, taking any $a<b$ and constructing a closed interval $[a,b]$ , the Weierstrass extreme value theorem tells us that $\operatorname{D}_{\rho}(h;\theta)$ achieves its maximum and minimum on $[a,b]$ . Furthermore, note that $\rho$ taken from the Barron class (15) satisfies all the requirements of our helper Lemma 16, and thus implies $\operatorname{D}_{\rho}(h;\theta)\to\sup\{\rho(x):x\in\mathbb{R}\}$ as $\lvert{\theta}\rvert\to\infty$ . We can thus always take the interval $[a,b]$ wide enough that

\displaystyle\theta\notin[a,b]\implies\operatorname{D}_{\rho}(h;\theta)\geq\max_{a\leq x\leq b}\operatorname{D}_{\rho}(h;x)\geq\min_{a\leq x\leq b}\operatorname{D}_{\rho}(h;x).

This proves the existence of a minimizer of $\theta\mapsto\operatorname{D}_{\rho}(h;\theta)$ on $\mathbb{R}$ .

Next, considering the T-risk $\operatorname{R}_{\rho}(h;\theta,\eta)$ and minimization with respect to $\theta$ , since we are doing unconstrained optimization, any solution $\theta_{\rho}(h;\eta)$ must satisfy $\eta+\mathop{}\!\mathrm{d}_{\theta}\operatorname{D}_{\rho}(h;\theta_{\rho}(h;\eta))=0$ , where $\mathop{}\!\mathrm{d}_{\theta}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\mathop{}\!\mathrm{d}/\mathop{}\!\mathrm{d}\theta$ . Using Lemma 9 again, this can be equivalently re-written as

\displaystyle\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}\left(\operatorname{\mathsf{L}}(h)-\theta_{\rho}(h;\eta)\right)=\eta.

(37)

When $\alpha>1$ , the derivative of the dispersion function has unbounded range, i.e., ${\rho}^{\prime}_{\sigma}(\mathbb{R})=\mathbb{R}$ . As such, an argument identical to that used in the proof of Lemma 16 implies that for any $\eta\in\mathbb{R}$ , we can always find a $\theta_{\eta}(h)\in\mathbb{R}$ such that (37) holds, recalling that continuity follows via Lemma 9. Combining this with convexity gives us a valid solution. The special case of $\alpha=1$ requires additional conditions, since from Lemma 8, we know that in this case $\lvert{{\rho}^{\prime}_{\sigma}}\rvert\leq 1/\sigma$ , and thus by an analogous argument, whenever $\lvert{\theta_{\rho}(h;\eta)}\rvert<1/\sigma$ we can find a finite solution.

To prove uniqueness under $1\leq\alpha\leq 2$ , direct inspection of the second derivative in (68) shows us that ${\rho}^{\prime\prime}_{\sigma}(x)>0$ on $\mathbb{R}$ whenever we have

\displaystyle\frac{2(x/\sigma)^{2}}{\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)}<\frac{\lvert{\alpha-2}\rvert}{1-(\alpha/2)}.

Re-arranging the above inequality yields an equivalent condition of $(x/\sigma)^{2}(1-\alpha)<\lvert{\alpha-2}\rvert$ , a condition which holds on $\mathbb{R}$ if and only if $1\leq\alpha\leq 2$ . Since ${\rho}^{\prime\prime}_{\sigma}$ is positive on $\mathbb{R}$ , this implies that $\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)>0$ for all $\theta\in\mathbb{R}$ . Using Lemma 9, we have that $\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)$ is equal to the second derivative of $\operatorname{D}_{\rho}(h;\theta)$ with respect to $\theta$ , which implies that $\theta\mapsto\operatorname{D}_{\rho}(h;\theta)$ and $\theta\mapsto\operatorname{D}_{\rho}(h;\theta)+\eta\theta$ are strictly convex on $\mathbb{R}$ , and thus their minimum must be unique.¹⁶¹⁶16See for example Boyd and Vandenberghe, [10, Sec. 3.1.4]. ∎

Proof of Lemma 5.

For random loss $\operatorname{\mathsf{L}}$ , using Lemma 9, first-order optimality conditions require

\displaystyle\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}\left(\operatorname{\mathsf{L}}-\operatorname{M}_{\rho}(\operatorname{\mathsf{L}})\right)=0,\qquad\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}\left(\operatorname{\mathsf{L}}-\theta_{\rho}(\operatorname{\mathsf{L}};\eta)\right)=\eta.

(38)

If these conditions hold, then from direct inspection, the same conditions will clearly hold if we replace $\operatorname{\mathsf{L}}$ by $\operatorname{\mathsf{L}}+c$ , $\operatorname{M}_{\rho}(\operatorname{\mathsf{L}})$ by $\operatorname{M}_{\rho}(\operatorname{\mathsf{L}})+c$ , and $\theta_{\rho}(\operatorname{\mathsf{L}};\eta)$ by $\theta_{\rho}(\operatorname{\mathsf{L}};\eta)+c$ . This implies both translation-invariance of the dispersions and the translation-equivariance of the optimal thresholds. Non-negativity follows trivially from the fact that $\rho(\cdot)\geq 0$ . Noting that $\rho(x)>0$ for all $x\neq 0$ , we have that $\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\theta)=0$ if and only if $\operatorname{\mathsf{L}}=\theta$ almost surely.¹⁷¹⁷17This fact follows from basic Lebesgue integration theory [3, Thm. 1.6.6]. Since $\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}))\leq\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\theta_{\rho}(\operatorname{\mathsf{L}};\eta))$ by the optimality of $\operatorname{M}_{\rho}(\operatorname{\mathsf{L}})$ , it follows that for any non-constant $\operatorname{\mathsf{L}}$ , we must have $\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\theta_{\rho}(\operatorname{\mathsf{L}};\eta))>0$ . Furthermore, from the optimality condition (38) for $\theta_{\rho}(\operatorname{\mathsf{L}};\eta)$ , even when $\operatorname{\mathsf{L}}$ is constant, we must have $\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\theta_{\rho}(\operatorname{\mathsf{L}};\eta))>0$ whenever $\eta\neq 0$ , since ${\rho}^{\prime}_{\sigma}(x)=0$ if and only if $x=0$ .

In the special case where $1\leq\alpha\leq 2$ , we have that ${\rho}^{\prime\prime}_{\sigma}$ is positive on $\mathbb{R}$ (see §D.3 and Fig. 8). This implies that $\rho_{\sigma}$ is strictly convex, and ${\rho}^{\prime}_{\sigma}$ is monotonically increasing. Let $\operatorname{\mathsf{L}}_{1}\leq\operatorname{\mathsf{L}}_{2}$ almost surely, but say $\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}_{1})>\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}_{2})$ . Using the optimality condition (38), uniqueness of the solution via Lemma 10, and the aforementioned monotonicity of ${\rho}^{\prime}_{\sigma}$ , we have

\displaystyle 0=\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}\left(\operatorname{\mathsf{L}}_{1}-\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}_{1})\right)<\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}\left(\operatorname{\mathsf{L}}_{1}-\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}_{2})\right)\leq\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}\left(\operatorname{\mathsf{L}}_{2}-\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}_{2})\right)=0.

This is a contradiction, and thus we must have $\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}_{1})\leq\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}_{2})$ . An identical argument using the exact same properties proves that $\theta_{\rho}(\operatorname{\mathsf{L}}_{1};\eta)\leq\theta_{\rho}(\operatorname{\mathsf{L}}_{2};\eta)$ also holds. Finally, to prove convexity, take any $\operatorname{\mathsf{L}}_{1},\operatorname{\mathsf{L}}_{2}\in\mathcal{L}$ , $\theta_{1},\theta_{2}\in\mathbb{R}$ , and $a\in(0,1)$ , and note that

	$\displaystyle\underline{\operatorname{R}}_{\rho}(a\operatorname{\mathsf{L}}_{1}+(1-a)\operatorname{\mathsf{L}}_{2};\eta)$	$\displaystyle\leq\operatorname{D}_{\rho}(a\operatorname{\mathsf{L}}_{1}+(1-a)\operatorname{\mathsf{L}}_{2};a\theta_{1}+(1-a)\theta_{2})+\eta\left(a\theta_{1}+(1-a)\theta_{2}\right)$
		$\displaystyle=\operatorname{\mathbf{E}}_{\mu}\rho_{\sigma}\left(a(\operatorname{\mathsf{L}}_{1}-\theta_{1})+(1-a)(\operatorname{\mathsf{L}}_{2}-\theta_{2})\right)+\eta\left(a\theta_{1}+(1-a)\theta_{2}\right)$
		$\displaystyle\leq a\left(\operatorname{D}_{\rho}(\operatorname{\mathsf{L}}_{1};\theta_{1})+\eta\theta_{1}\right)+(1-\alpha)\left(\operatorname{D}_{\rho}(\operatorname{\mathsf{L}}_{2};\theta_{2})+\eta\theta_{2}\right).$

The first inequality uses optimality of the threshold in the definition of $\underline{\operatorname{R}}_{\rho}$ , whereas the second inequality uses the convexity of $\rho_{\sigma}$ . Since the choice of $\theta_{1}$ and $\theta_{2}$ here were arbitrary, we can set $\theta_{1}=\theta_{\rho}(\operatorname{\mathsf{L}}_{1};\eta)$ and $\theta_{2}=\theta_{\rho}(\operatorname{\mathsf{L}}_{2};\eta)$ to obtain the desired inequality

\displaystyle\underline{\operatorname{R}}_{\rho}(a\operatorname{\mathsf{L}}_{1}+(1-a)\operatorname{\mathsf{L}}_{2};\eta)\leq a\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}}_{1};\eta)+(1-a)\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}}_{2};\eta)

giving us convexity of the threshold risk. As a direct corollary, setting $\eta=0$ yields the convexity result for $\operatorname{\mathsf{L}}\mapsto\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}))$ . ∎

Proof of Lemma 2.

The crux of this result is an analogue to Lemma 9 regarding the differentials of $\operatorname{D}_{\rho}(h;\theta)$ , this time taken with respect to $h$ , rather than $\theta$ . Fixing arbitrary $g,h\in\mathcal{H}$ , let us start by considering the following sequence of random variables:

\displaystyle f_{k}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\frac{\rho_{\sigma}(\operatorname{\mathsf{L}}(h+a_{k}g)-\theta)-\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)}{a_{k}}

(39)

where $(a_{k})$ is any sequence of real values such that $a_{k}\to 0_{+}$ as $k\to\infty$ . Before getting into the details, let us unpack the differentiability assumption made on the base loss. Before random sampling, the map $h\mapsto\operatorname{\mathsf{L}}(h)$ is of course a map from $\mathcal{H}$ to the set of measurable functions $\{\operatorname{\mathsf{L}}(h):h\in\mathcal{H}\}$ , but after sampling, there is no randomness and it is simply a map from $\mathcal{H}$ to $\mathbb{R}$ . Having sampled the random loss, the property we desire is that for each $h\in\mathcal{H}$ , there exists a continuous linear functional ${\operatorname{\mathsf{L}}}^{\prime}(h):\mathcal{U}\to\mathbb{R}$ such that

\displaystyle\lim\limits_{\lVert{g}\rVert\to 0}\frac{\lvert{\operatorname{\mathsf{L}}(h+g)-\operatorname{\mathsf{L}}(h)-{\operatorname{\mathsf{L}}}^{\prime}(h)(g)}\rvert}{\lVert{g}\rVert}=0.

(40)

The differentiability condition in the lemma statement is simply that

\displaystyle\mu\{\,\text{equality (\ref{eqn:unbiased_1a}) holds}\,\}=1.

(41)

On this “good” event, since the map $x\mapsto\rho_{\sigma}(x)$ is differentiable by definition, we have that the composition $h\mapsto\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)$ is also differentiable for any choice of $\theta\in\mathbb{R}$ , and a general chain rule can be applied to compute the differentials.¹⁸¹⁸18See Penot, [44, Thm. 2.47] for this key fact, where “ $X$ ” is $\mathcal{U}$ here, and both “ $Y$ ” and “ $Z$ ” are $\mathbb{R}$ here. In particular, we have a pointwise limit of

\displaystyle f\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lim\limits_{k\to\infty}f_{k}={\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta){\operatorname{\mathsf{L}}}^{\prime}(h)(g)

(42)

which also uses the fact that the Fréchet and Gateaux differentials are equal here.¹⁹¹⁹19Luenberger, [39, §7.2, Prop. 2] Technically, it just remains to obtain conditions which imply $\operatorname{\mathbf{E}}_{\mu}f_{k}\to\operatorname{\mathbf{E}}_{\mu}f$ . In pursuit of a $\mu$ -integrable upper bound on the sequence $(f_{k})$ , note that for large enough $k$ , we have

$\displaystyle\lvert{f_{k}}\rvert$	$\displaystyle\leq\frac{1}{a_{k}}\lVert{a_{k}g}\rVert\sup_{0<a<a_{k}}\lVert{{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h+ag)-\theta){\operatorname{\mathsf{L}}}^{\prime}(h+ag)}\rVert$
	$\displaystyle\leq\lVert{g}\rVert\sup\left\{\lVert{{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{0})-\theta){\operatorname{\mathsf{L}}}^{\prime}(h_{0})}\rVert:h_{0}\in\mathcal{H}\right\}$
	$\displaystyle\leq\frac{\lVert{g}\rVert}{\sigma^{2}}\sup\left\{\lvert{\operatorname{\mathsf{L}}(h_{0})-\theta}\rvert\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{0})}\rVert:h_{0}\in\mathcal{H}\right\}.$	(43)

The key to the first of the preceding inequalities is a generalized mean value theorem.²⁰²⁰20Considering the proof of Lemma 14 due to Luenberger, [39, §7.3, Prop. 2], just generalize the one-dimensional part of the argument from the original interval $[0,1]$ to the interval $[0,a_{k}]$ here. Both the first and second inequalities also use the fact that $h+a_{k}g\in\mathcal{H}$ eventually. The final inequality uses the fact that ${\rho}^{\prime}_{\sigma}(x)={\rho}^{\prime}(x/\sigma)/\sigma\leq\lvert{x}\rvert/\sigma^{2}$ for any choice of $-\infty\leq\alpha\leq 2$ . This inequality suggests a natural condition of

\displaystyle\operatorname{\mathbf{E}}_{\mu}\left[\sup_{h_{0}\in\mathcal{H}}\lVert{\operatorname{\mathsf{L}}(h_{0}){\operatorname{\mathsf{L}}}^{\prime}(h_{0})}\rVert\right]<\infty

(44)

under which we can apply a standard dominated convergence argument.²¹²¹21See for example Ash and Doléans-Dade, [3, Thm. 1.6.9]. If (43) holds for say all $k\geq k_{0}$ , then we can just bound $\lvert{f_{k}}\rvert$ by the greater of $\max_{j\leq k_{0}}\lvert{f_{j}}\rvert$ (clearly $\mu$ -integrable) and the right-hand side of (43). In particular, the key implication is that

\displaystyle\text{(\ref{eqn:unbiased_2})}\implies\lim\limits_{k\to\infty}\operatorname{\mathbf{E}}_{\mu}f_{k}=\operatorname{\mathbf{E}}_{\mu}f.

(45)

Since we have

\displaystyle\lim\limits_{k\to\infty}\operatorname{\mathbf{E}}_{\mu}f_{k}=\lim\limits_{a\to 0_{+}}\operatorname{\mathbf{E}}_{\mu}\left[\frac{\rho_{\sigma}(\operatorname{\mathsf{L}}(h+ag)-\theta)-\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)}{a}\right]={\operatorname{D}}^{\prime}_{\rho}(h;\theta)(g),

where ${\operatorname{D}}^{\prime}_{\rho}(h;\theta):\mathcal{U}\to\mathbb{R}$ denotes the gradient of $h\mapsto\operatorname{D}_{\rho}(h;\theta)$ , we see that by applying the preceding argument (culminating in (45)) to the modified losses (16), we readily obtain the desired result. ∎

Proof of Theorem 3.

To begin, let us consider the smoothness of the objective $h\mapsto\operatorname{R}_{\rho}(h;\theta,\eta)$ under the present assumptions. From Lemma 12 and the basic properties of the Barron class of dispersion functions (Lemma 8), it follows that this function is $\lambda$ -smooth with coefficient

\displaystyle\lambda\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\frac{\lambda_{5}}{\sigma}+\frac{\lambda_{2}}{\sigma^{2}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}},

(46)

where $\lambda_{5}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lambda_{3}+\lambda_{4}$ , and

\displaystyle\lambda_{3}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\left(\frac{\lambda_{2}}{\sigma}\right)\left[\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}^{2}+\sup_{h\in\mathcal{H}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert\right],\qquad\lambda_{4}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lambda_{1}\lVert{{\rho}^{\prime}}\rVert_{\infty}.

From here, we can leverage the main argument of Cutkosky and Mehta, [14, Thm. 2], utilizing the smoothness property given by (46) above, and the $\Gamma$ -bound of (19). For completeness and transparency we include the key details here. First, note that if we define $\epsilon_{t}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=G_{t}-\partial_{h}\operatorname{R}_{\rho}(h_{t};\theta,\eta)$ , $\widehat{\epsilon}_{t}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=M_{t}-\partial_{h}\operatorname{R}_{\rho}(h_{t};\theta,\eta)$ , and $S(h_{t},h_{t+1})\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\partial_{h}\operatorname{R}_{\rho}(h_{t};\theta,\eta)-\partial_{h}\operatorname{R}_{\rho}(h_{t+1};\theta,\eta)$ , our definitions imply that for each $t\geq 1$ , we have

\displaystyle M_{t+1}=\partial_{h}\operatorname{R}_{\rho}(h;\theta,\eta)+b\left(\widehat{\epsilon}_{t}+S(h_{t},h_{t+1})\right)+(1-b)\epsilon_{t+1}

(47)

Using the form (47), it follows immediately that

\displaystyle\widehat{\epsilon}_{t+1}=b\left(\widehat{\epsilon}_{t}+S(h_{t},h_{t+1})\right)+(1-b)\epsilon_{t+1}

(48)

again for each $t\geq 1$ . By setting $M_{0}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=0$ and $h_{0}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=h_{1}$ , we trivially have

\displaystyle\widehat{\epsilon}_{0}=-\partial_{h}\operatorname{R}_{\rho}(h_{0};\theta,\eta)=-\partial_{h}\operatorname{R}_{\rho}(h_{1};\theta,\eta)

and one can then easily check that (48) holds for all $t\geq 0$ . Expanding the recursion of (48), we have

\displaystyle\widehat{\epsilon}_{t+1}=(1-b)\sum_{k=0}^{t}b^{k}\epsilon_{t-k+1}+\sum_{k=1}^{t+1}b^{k}S(h_{t-k+1},h_{t-k+2})+b^{t+1}\widehat{\epsilon}_{0}.

(49)

We take the summands of (49) one at a time. Since the stochastic gradients are $\Gamma$ -bounded, we have that $b^{k}\lVert{\epsilon_{t-k+1}}\rVert\leq 2\Gamma$ for all $0\leq k\leq t$ . Furthermore, we have $\operatorname{\mathbf{E}}_{\mu}\epsilon_{t-k+1}=0$ (Lemma 2) and $\operatorname{\mathbf{E}}_{\mu}(b^{k}\lVert{\epsilon_{t-k+1}}\rVert)^{2}\leq(2b^{k}\Gamma)^{2}$ for each $k$ . These bounds can be passed to standard concentration inequalities for martingales on Banach spaces [14, Lem. 14], which using the smoothness property of $\mathcal{H}$ that we assumed tell us that with probability no less than $1-\delta$ , we have

\displaystyle\left\lVert{\sum_{k=0}^{t}b^{k}\epsilon_{t-k+1}}\right\rVert\leq 10\Gamma\max\left\{1,\log(3\delta^{-1})\right\}+8\Gamma\sqrt{\max\left\{1,\log(3\delta^{-1})\right\}\sum_{k=0}^{t}b^{2k}}.

(50)

Moving on to the second term of (49), note that using $\lambda$ -smoothness of the risk function with coefficient $\lambda$ given by (46), along with the definition of the update procedure (20)–(21), we have

\displaystyle\lVert{S(h_{t},h_{t+1})}\rVert\leq\lambda\lVert{h_{t}-h_{t+1}}\rVert=\lambda a_{t}\lVert{\widetilde{M}_{t}}\rVert=\lambda a_{t}.

This implies that using a constant step size $a_{t}=a$ , we can control the sum as

\displaystyle\left\lVert{\sum_{k=1}^{t+1}b^{k}S(h_{t-k+1},h_{t-k+2})}\right\rVert\leq\lambda\sum_{k=1}^{t+1}a_{t-k+1}b^{k}\leq\frac{a\lambda}{1-b}.

(51)

Finally, the third term of (49) is easily controlled as $\lVert{\widehat{\epsilon}_{0}}\rVert=\lVert{\partial_{h}\operatorname{R}_{\rho}(h_{1};\theta,\eta)}\rVert\leq\Gamma$ . Taking this bound along with (50) and (51), we see that $\lVert{\widehat{\epsilon}_{t+1}}\rVert$ can be bounded above by

\displaystyle(1-b)\left(10\Gamma\max\left\{1,\log(3\delta^{-1})\right\}+8\Gamma\sqrt{\max\left\{1,\log(3\delta^{-1})\right\}\sum_{k=0}^{t}b^{2k}}\right)+\frac{a\lambda}{1-b}+b^{t+1}\Gamma

(52)

on the high-probability event mentioned earlier. To make use of the bound (52), note that using the $\lambda$ -smoothness of the losses, the update procedure used here can be shown [14, Lem. 1] to satisfy

\displaystyle\sum_{k=1}^{t}\lVert{\partial_{h}\operatorname{R}_{\rho}(h_{t};\theta,\eta)}\rVert\leq\frac{\operatorname{R}_{\rho}(h_{1};\theta,\eta)-\operatorname{R}_{\rho}(h_{t+1};\theta,\eta)}{a}+2\sum_{k=1}^{t}\lVert{\widehat{\epsilon}_{k}}\rVert+\left(\frac{\lambda a}{2}\right)t.

(53)

Using a union bound, we have that the bound of (52) holds for all $\widehat{\epsilon}_{1},\ldots,\widehat{\epsilon}_{t}$ with probability no less than $1-t\delta$ . To get some clean bounds out of (52) and (53), first we loosen

\displaystyle(1-b)\sqrt{\sum_{k=0}^{t}b^{2k}}\leq\frac{1-b}{\sqrt{1-b^{2}}}\leq\frac{1-b}{\sqrt{1-b}}=\sqrt{1-b}

and note that $0<\delta<1$ and $t\geq 1\geq\mathrm{e}/3$ implies $\log(3t\delta^{-1})\geq 1$ . This means that with probability no less than $1-\delta$ , we have

\displaystyle 2\sum_{k=1}^{t}\lVert{\widehat{\epsilon}_{k}}\rVert\leq(1-b)20\Gamma t\log(3t\delta^{-1})+16\Gamma t\sqrt{(1-b)\log(3t\delta^{-1})}+\frac{2a\lambda t}{1-b}+\frac{2\Gamma}{1-b}.

Dividing both sides by $t$ and plugging in the prescribed settings of $a$ and $b$ , we have

\displaystyle\frac{2}{t}\sum_{k=1}^{t}\lVert{\widehat{\epsilon}_{k}}\rVert\leq\frac{20\Gamma\log(3t\delta^{-1})}{\sqrt{t}}+\frac{16\Gamma\sqrt{\log(3t\delta^{-1})}}{t^{1/4}}+\frac{2\lambda}{t^{1/4}}+\frac{2\Gamma}{\sqrt{t}}.

(54)

Taking this bound and our $a$ setting and applying it to (53), we obtain

\displaystyle\frac{1}{t}\sum_{k=1}^{t}\lVert{\partial_{h}\operatorname{R}_{\rho}(h_{t};\theta,\eta)}\rVert\leq\frac{\operatorname{R}_{\rho}(h_{1};\theta,\eta)-\operatorname{R}_{\rho}(h_{t+1};\theta,\eta)}{t^{1/4}}+(\text{RHS of (\ref{eqn:sgd_convergence_9})})+\frac{\lambda}{2t^{3/4}}.

(55)

To clean this all up, we have

	$\displaystyle\frac{1}{t}\sum_{k=1}^{t}\lVert{\partial_{h}\operatorname{R}_{\rho}(h_{t};\theta,\eta)}\rVert$	$\displaystyle\leq\frac{1}{t^{1/4}}\left(\operatorname{R}_{\rho}(h_{1};\theta,\eta)-\operatorname{R}_{\rho}(h_{t+1};\theta,\eta)+16\Gamma\sqrt{\log(3t\delta^{-1})}+2\lambda\right)$
		$\displaystyle\qquad+\frac{1}{\sqrt{t}}\left(20\Gamma\log(3t\delta^{-1})+2\Gamma\right)+\frac{\lambda}{2t^{3/4}}.$

For readability, the proof uses a slightly looser choice of $\lambda_{3}$ , and instead of $t$ iterations, it is stated for $T$ iterations. ∎

Proof of Corollary 4.

Let $\alpha=0$ , and note from §D.3 that we have

\displaystyle\rho_{\sigma}(x)=\frac{2x}{x^{2}+2\sigma^{2}}.

It thus follows from (17) that

\displaystyle\partial_{h}\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta)=\frac{2(\operatorname{\mathsf{L}}(h)-\theta)}{(\operatorname{\mathsf{L}}(h)-\theta)^{2}+2\sigma^{2}}{\operatorname{\mathsf{L}}}^{\prime}(h).

In the case of the quadratic loss with a linear model as assumed here, this becomes

	$\displaystyle\partial_{h}\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta)$	$\displaystyle=\frac{2(\operatorname{\mathsf{L}}(h)-\theta)}{(\operatorname{\mathsf{L}}(h)-\theta)^{2}+2\sigma^{2}}\left(h(\mathsf{X})-\mathsf{Y}\right)\mathsf{X}$
		$\displaystyle=\frac{2(\operatorname{\mathsf{L}}(h)-\theta)\sqrt{2\operatorname{\mathsf{L}}(h)}}{(\operatorname{\mathsf{L}}(h)-\theta)^{2}+2\sigma^{2}}\operatorname{sign}(h(\mathsf{X})-\mathsf{Y})\mathsf{X}.$

Regarding growth in $\mathsf{X}$ , note that since $\operatorname{\mathsf{L}}(h)=\mathcal{O}(\lVert{\mathsf{X}}\rVert^{2})$ , both the numerator and denominator are $\mathcal{O}(\lVert{\mathsf{X}}\rVert^{4})$ . As for growth in $\operatorname{\mathsf{L}}(h)$ which accounts for the random noise $\varepsilon$ as well, the numerator is $\mathcal{O}(\lvert{\operatorname{\mathsf{L}}}\rvert^{3/2})$ whereas the denominator is $\mathcal{O}(\lvert{\operatorname{\mathsf{L}}}\rvert^{2})$ , and thus accounting for the randomness of $\mathsf{X}$ and $\varepsilon$ which can both be potentially unbounded and heavy-tailed, we see that the norm of $\partial_{h}\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta)$ must be bounded as the norm of the inputs, noise, and/or loss grow large. Trivially, since $\sigma>0$ , the limits where $\operatorname{\mathsf{L}}(h)\to\theta$ also result in $\partial_{h}\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta)$ with a norm that is almost surely bounded.

Proceeding to the case of the logistic loss, an analogous argument yields the desired result. First note that under the linear model assumed here, the gradient with respect to any $h_{j}$ takes the form

\displaystyle\partial_{h_{j}}\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta)=\frac{2(\operatorname{\mathsf{L}}(h)-\theta)}{(\operatorname{\mathsf{L}}(h)-\theta)^{2}+2\sigma^{2}}(p_{j}(h)-\widetilde{\mathsf{Y}}_{j})\mathsf{X}

where $p_{j}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\exp(h_{j}(\mathsf{X}))/\sum_{i=1}^{k}\exp(h_{i}(\mathsf{X}))$ , i.e., the softmax transformation of the score assigned by $h_{j}$ . By definition, the coefficients $(p_{j}(h)-\widetilde{\mathsf{Y}}_{j})$ are bounded. Furthermore, using our linear model assumption, we have that $\operatorname{\mathsf{L}}(h)=\mathcal{O}(\lVert{\mathsf{X}}\rVert)$ , and as such the numerator and denominator are both $\mathcal{O}(\lVert{\mathsf{X}}\rVert^{2})$ , implying the desired boundedness.

Taking the previous two paragraphs together, we have that the bound (19) is satisfied for a finite $\Gamma$ under $\alpha=0$ , even when the data is unbounded and potentially heavy-tailed. For $\alpha<0$ , the derivative of $\rho$ shrinks even faster, so the same result follows a fortiori from the $\alpha=0$ case. Finally, the $\lambda_{1}$ -smoothness assumption in Theorem 3 follows from direct inspection of the forms of ${\operatorname{\mathsf{L}}}^{\prime}(h)$ given here for each loss, using our assumption of $\lVert{\mathsf{X}}\rVert^{2}$ having bounded second moments. ∎

Proof of Proposition 7.

To begin, recall from §3 the notation $\operatorname{M}_{\rho}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname*{arg\,min}_{\theta\in\mathbb{R}}\operatorname{D}_{\rho}(h;\theta)$ for the M-locations and $\theta_{\rho}(h;\eta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname*{arg\,min}_{\theta\in\mathbb{R}}\operatorname{R}_{\rho}(h;\theta,\eta)$ for the T-risk optimal thresholds. By Lemma 10, both $\operatorname{M}_{\rho}(h)$ and $\theta_{\rho}(h;\eta)$ have unique solutions, and thus we overload this notation to represent the unique solution. By definition, we have

	$\displaystyle\underline{\operatorname{R}}_{\rho}(h;\eta)=\operatorname{R}_{\rho}(h;\theta_{\rho}(h;\eta))$	$\displaystyle\leq\eta\operatorname{M}_{\rho}(h)+\operatorname{D}_{\rho}(h;\operatorname{M}_{\rho}(h))$
		$\displaystyle=\eta\operatorname{R}(h)+\eta(\operatorname{M}_{\rho}(h)-\operatorname{R}(h))+\operatorname{D}_{\rho}(h;\operatorname{M}_{\rho}(h)).$

Similarly, we can obtain a lower bound using the optimality of $\operatorname{M}_{\rho}(h)$ and $\theta_{\rho}(h;\eta)$ as

	$\displaystyle\underline{\operatorname{R}}_{\rho}(h;\eta)$	$\displaystyle=\eta\theta_{\rho}(h;\eta)+\operatorname{D}_{\rho}(h;\theta_{\rho}(h;\eta))$
		$\displaystyle\geq\eta\theta_{\rho}(h;\eta)+\operatorname{D}_{\rho}(h;\operatorname{M}_{\rho}(h))$
		$\displaystyle=\eta\operatorname{R}(h)+\eta(\theta_{\rho}(h;\eta)-\operatorname{R}(h))+\operatorname{D}_{\rho}(h;\operatorname{M}_{\rho}(h)).$

Taking these two bounds together, we have

\displaystyle\eta(\theta_{\rho}(h;\eta)-\operatorname{R}(h))+\operatorname{D}_{\rho}(h;\operatorname{M}_{\rho}(h))\leq\underline{\operatorname{R}}_{\rho}(h;\eta)-\eta\operatorname{R}(h)\leq\eta(\operatorname{M}_{\rho}(h)-\operatorname{R}(h))+\operatorname{D}_{\rho}(h;\operatorname{M}_{\rho}(h))

(56)

for any choice of $\eta\in\mathbb{R}$ . The bounds in (56) are stated for ideal risk quantities for the true distribution under $\mu$ , but an identical argument holds if we replace $\mu$ by the empirical measure induced by an iid sample $\operatorname{\mathsf{L}}_{1},\ldots,\operatorname{\mathsf{L}}_{n}$ . Writing this out explicitly, let $\widehat{\operatorname{D}}_{\rho}$ , $\underline{\widehat{\operatorname{R}}}_{\rho}$ , and $\widehat{\operatorname{R}}$ denote the empirical analogues of $\operatorname{D}_{\rho}$ , $\underline{\operatorname{R}}_{\rho}$ , and $\operatorname{R}$ , and similarly let $\widehat{\operatorname{M}}_{\rho}$ and $\widehat{\theta}_{\rho}$ be the empirical analogues of $\operatorname{M}_{\rho}$ and $\theta_{\rho}$ . From the argument leading to (56), it follows that

\displaystyle\eta(\widehat{\theta}_{\rho}(h;\eta)-\widehat{\operatorname{R}}(h))+\widehat{\operatorname{D}}_{\rho}(h;\widehat{\operatorname{M}}_{\rho}(h))\leq\underline{\widehat{\operatorname{R}}}_{\rho}(h;\eta)-\eta\widehat{\operatorname{R}}(h)\leq\eta(\widehat{\operatorname{M}}_{\rho}(h)-\widehat{\operatorname{R}}(h))+\widehat{\operatorname{D}}_{\rho}(h;\widehat{\operatorname{M}}_{\rho}(h)).

(57)

Next, using the lower bound in (57), for any $\eta\geq 0$ we have that

$\displaystyle\eta\operatorname{R}(h)$	$\displaystyle=\eta\left(\widehat{\operatorname{R}}(h)+(\operatorname{R}(h)-\widehat{\operatorname{R}}(h))\right)$
	$\displaystyle\leq\eta\left(\widehat{\operatorname{R}}(h)+\lVert{\operatorname{R}-\widehat{\operatorname{R}}}\rVert_{\mathcal{H}}\right)$
	$\displaystyle\leq\underline{\widehat{\operatorname{R}}}_{\rho}(h;\eta)-\eta(\widehat{\theta}_{\rho}(h;\eta)-\widehat{\operatorname{R}}(h))-\widehat{\operatorname{D}}_{\rho}(h;\widehat{\operatorname{M}}_{\rho}(h))+\eta\lVert{\operatorname{R}-\widehat{\operatorname{R}}}\rVert_{\mathcal{H}}$
	$\displaystyle\leq\underline{\widehat{\operatorname{R}}}_{\rho}(h;\eta)+\eta(\widehat{\operatorname{R}}(h)-\widehat{\theta}_{\rho}(h;\eta))+\eta\lVert{\operatorname{R}-\widehat{\operatorname{R}}}\rVert_{\mathcal{H}}.$	(58)

Letting $\widehat{h}$ be a minimizer of $\underline{\widehat{\operatorname{R}}}_{\rho}$ , using the upper bound in (57) and any choice of $h^{\ast}$ , we have

	$\displaystyle\underline{\widehat{\operatorname{R}}}_{\rho}(\widehat{h};\eta)\leq\underline{\widehat{\operatorname{R}}}_{\rho}(h^{\ast};\eta)$	$\displaystyle\leq\eta\widehat{\operatorname{R}}(h^{\ast})+\eta(\widehat{\operatorname{M}}_{\rho}(h^{\ast})-\widehat{\operatorname{R}}(h^{\ast}))+\widehat{\operatorname{D}}_{\rho}(h^{\ast};\widehat{\operatorname{M}}_{\rho}(h^{\ast}))$
		$\displaystyle\leq\eta\operatorname{R}(h^{\ast})+\eta(\widehat{\operatorname{M}}_{\rho}(h^{\ast})-\widehat{\operatorname{R}}(h^{\ast}))+\widehat{\operatorname{D}}_{\rho}(h^{\ast};\widehat{\operatorname{M}}_{\rho}(h^{\ast}))+\eta\lVert{\operatorname{R}-\widehat{\operatorname{R}}}\rVert_{\mathcal{H}}.$		(59)

Combining (58) and (59), we have that $\eta\operatorname{R}(\widehat{h})$ is bounded above by

\displaystyle\eta\operatorname{R}(h^{\ast})+\eta(\widehat{\operatorname{M}}_{\rho}(h^{\ast})-\theta_{\rho}(\widehat{h};\eta))+\eta(\widehat{\operatorname{R}}(\widehat{h})-\widehat{\operatorname{R}}(h^{\ast}))+\widehat{\operatorname{D}}_{\rho}(h^{\ast};\widehat{\operatorname{M}}_{\rho}(h^{\ast}))+2\eta\lVert{\operatorname{R}-\widehat{\operatorname{R}}}\rVert_{\mathcal{H}}.

(60)

Some elementary manipulations let us bound the key differences in (60) as

\displaystyle\widehat{\operatorname{M}}_{\rho}(h^{\ast})-\theta_{\rho}(\widehat{h};\eta)+\widehat{\operatorname{R}}(\widehat{h})-\widehat{\operatorname{R}}(h^{\ast})\leq 2\lVert{\widehat{\operatorname{M}}_{\rho}-\operatorname{R}}\rVert_{\mathcal{H}}+2\lVert{\operatorname{R}-\widehat{\operatorname{R}}}\rVert_{\mathcal{H}}+\lVert{\widehat{\operatorname{M}}_{\rho}-\theta_{\rho}}\rVert_{\mathcal{H}},

and thus dividing by $\eta>0$ , we end up with a final bound taking the form

\displaystyle\operatorname{R}(\widehat{h})\leq\operatorname{R}(h^{\ast})+\lVert{\widehat{\operatorname{M}}_{\rho}-\theta_{\rho}}\rVert_{\mathcal{H}}+2\lVert{\widehat{\operatorname{M}}_{\rho}-\operatorname{R}}\rVert_{\mathcal{H}}+\frac{1}{\eta}\widehat{\operatorname{D}}_{\rho}(h^{\ast};\widehat{\operatorname{M}}_{\rho}(h^{\ast}))+4\lVert{\operatorname{R}-\widehat{\operatorname{R}}}\rVert_{\mathcal{H}}.

The desired result is just the special case where we set $h^{\ast}$ is the expected loss minimizer. ∎

C.2 Smoothness computations (proof of Lemma 12)

Here we provide detailed computations for the smoothness coefficients used in Lemma 12. We assume here that the assumptions A1., A2., and A3. are satisfied. Starting with the difference of expected gradients, using Jensen’s inequality and the smoothness assumption A2., we have

	$\displaystyle\lVert{\operatorname{\mathbf{E}}_{\mu}\left[{\operatorname{\mathsf{L}}}^{\prime}(h_{1})-{\operatorname{\mathsf{L}}}^{\prime}(h_{2})\right]}\rVert$	$\displaystyle\leq\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{1})-{\operatorname{\mathsf{L}}}^{\prime}(h_{2})}\rVert$
		$\displaystyle\leq\lambda_{1}\lVert{h_{1}-h_{2}}\rVert.$		(61)

As discussed in §B.4, differences of gradients modulated by ${\rho}^{\prime}$ are slightly more complicated. In particular, recalling the equality (29), the norm of the difference

\displaystyle\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{1})-\theta_{1}){\operatorname{\mathsf{L}}}^{\prime}(h_{1})-\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{2})-\theta_{2}){\operatorname{\mathsf{L}}}^{\prime}(h_{2})

(62)

can be bounded above by the sum of

\displaystyle\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{1})}\rVert\lvert{{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{1})-\theta_{1})-{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{2})-\theta_{2})}\rvert

(63)

and

\displaystyle\operatorname{\mathbf{E}}_{\mu}\lvert{{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{2})-\theta_{2})}\rvert\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{1})-{\operatorname{\mathsf{L}}}^{\prime}(h_{2})}\rVert.

(64)

We take up (63) and (64) one at a time. Starting with (63), from A3. we know that the dispersion derivative ${\rho}^{\prime}$ is $\lambda_{2}$ -Lipschitz, and thus we have

(63)	$\displaystyle\leq\left(\frac{\lambda_{2}}{\sigma}\right)\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{1})}\rVert\left(\lvert{\operatorname{\mathsf{L}}(h_{1})-\operatorname{\mathsf{L}}(h_{2})}\rvert+\lvert{\theta_{1}-\theta_{2}}\rvert\right)$
	$\displaystyle\leq\left(\frac{\lambda_{2}}{\sigma}\right)\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{1})}\rVert\left(\lVert{h_{1}-h_{2}}\rVert\sup_{0<c<1}\lVert{{\operatorname{\mathsf{L}}}^{\prime}((1-c)h_{1}-ch_{2})}\rVert+\lvert{\theta_{1}-\theta_{2}}\rvert\right)$
	$\displaystyle\leq\left(\frac{\lambda_{2}}{\sigma}\right)\left(\lVert{h_{1}-h_{2}}\rVert\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}^{2}+\lvert{\theta_{1}-\theta_{2}}\rvert\sup_{h\in\mathcal{H}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert\right)$
	$\displaystyle\leq\lambda_{3}\left(\lVert{h_{1}-h_{2}}\rVert+\lvert{\theta_{1}-\theta_{2}}\rvert\right).$	(65)

Here, the second inequality uses the helper Lemma 14 and our assumption of differentiability, while the third inequality uses our assumption on the expected squared norm of the gradient. We have set the Lipschitz coefficient $\lambda_{3}$ in (65) to be

\displaystyle\lambda_{3}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\left(\frac{\lambda_{2}}{\sigma}\right)\left[\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}^{2}+\sup_{h\in\mathcal{H}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert\right].

This gives us a bound on (63). Moving on to (64), if ${\rho}^{\prime}$ is bounded on $\mathbb{R}$ , then we have

	(64)	$\displaystyle\leq\lVert{{\rho}^{\prime}}\rVert_{\infty}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{1})-{\operatorname{\mathsf{L}}}^{\prime}(h_{2})}\rVert$
		$\displaystyle\leq\lambda_{4}\lVert{h_{1}-h_{2}}\rVert$		(66)

with $\lambda_{4}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lambda_{1}\lVert{{\rho}^{\prime}}\rVert_{\infty}$ , recalling the bound (61). To summarize, we can use (65) and (66) to control (62) as follows:

	$\displaystyle(\text{\ref{eqn:smoothness_limited_2}})$	$\displaystyle\leq(\text{\ref{eqn:smoothness_limited_3}})+(\text{\ref{eqn:smoothness_limited_4}})$
		$\displaystyle\leq\lambda_{3}\left(\lVert{h_{1}-h_{2}}\rVert+\lvert{\theta_{1}-\theta_{2}}\rvert\right)+\lambda_{4}\lVert{h_{1}-h_{2}}\rVert$
		$\displaystyle\leq\lambda_{5}\left(\lVert{h_{1}-h_{2}}\rVert+\lvert{\theta_{1}-\theta_{2}}\rvert\right)$

where $\lambda_{5}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lambda_{3}+\lambda_{4}$ . With these preparatory details organized, it is straightforward to obtain a Lipschitz property on the gradient of $\operatorname{R}_{\rho}(\cdot)$ , as summarized in Lemma 12, and detailed in the proof below.

Proof of Lemma 12.

Using our upper bounds on (61) and (62), we have

	$\displaystyle\lVert{\partial_{h}\operatorname{R}_{\rho}(h_{1};\theta_{1},\eta)-\partial_{h}\operatorname{R}_{\rho}(h_{2};\theta_{2},\eta)}\rVert$
	$\displaystyle\leq\left(\frac{1}{\sigma}\right)\left\lVert{\operatorname{\mathbf{E}}_{\mu}\left[{\rho}^{\prime}\left(\frac{\operatorname{\mathsf{L}}(h_{1})-\theta_{1}}{\sigma}\right){\operatorname{\mathsf{L}}}^{\prime}(h_{1})-{\rho}^{\prime}\left(\frac{\operatorname{\mathsf{L}}(h_{2})-\theta_{2}}{\sigma}\right){\operatorname{\mathsf{L}}}^{\prime}(h_{2})\right]}\right\rVert$
	$\displaystyle\leq\left(\frac{\lambda_{5}}{\sigma}\right)\left(\lVert{h_{1}-h_{2}}\rVert+\lvert{\theta_{1}-\theta_{2}}\rvert\right).$

Next, let us look at the partial derivative taken with respect to the threshold parameter $\theta$ . To bound the absolute value of these differences, using the generalized mean value theorem (Lemma 14), we have

	$\displaystyle\lvert{\partial_{\theta}\operatorname{R}_{\rho}(h_{1};\theta_{1},\eta)-\partial_{\theta}\operatorname{R}_{\rho}(h_{2};\theta_{2},\eta)}\rvert$	$\displaystyle\leq\left(\frac{1}{\sigma}\right)\left\lvert\operatorname{\mathbf{E}}_{\mu}\left[{\rho}^{\prime}\left(\frac{\operatorname{\mathsf{L}}(h_{1})-\theta_{1}}{\sigma}\right)-{\rho}^{\prime}\left(\frac{\operatorname{\mathsf{L}}(h_{2})-\theta_{2}}{\sigma}\right)\right]\right\rvert$
		$\displaystyle\leq\left(\frac{\lambda_{2}}{\sigma^{2}}\right)\left(\operatorname{\mathbf{E}}_{\mu}\lvert{\operatorname{\mathsf{L}}(h_{1})-\operatorname{\mathsf{L}}(h_{2})}\rvert+\lvert{\theta_{1}-\theta_{2}}\rvert\right)$
		$\displaystyle\leq\left(\frac{\lambda_{2}}{\sigma^{2}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}\right)\left(\lVert{h_{1}-h_{2}}\rVert+\lvert{\theta_{1}-\theta_{2}}\rvert\right).$

Taking the preceding upper bounds together, the gradient difference for $\operatorname{R}_{\rho}$ can be bounded as

	$\displaystyle\lVert{{\operatorname{R}}^{\prime}_{\rho}(h_{1};\theta_{1},\eta)-{\operatorname{R}}^{\prime}_{\rho}(h_{2};\theta_{2},\eta)}\rVert$
	$\displaystyle=\lVert{\partial_{h}\operatorname{R}_{\rho}(h_{1};\theta_{1},\eta)-\partial_{h}\operatorname{R}_{\rho}(h_{2};\theta_{2},\eta)}\rVert+\lvert{\partial_{\theta}\operatorname{R}_{\rho}(h_{1};\theta_{1},\eta)-\partial_{\theta}\operatorname{R}_{\rho}(h_{2};\theta_{2},\eta)}\rvert$
	$\displaystyle\leq\left(\left(\frac{\lambda_{5}}{\sigma}\right)+\frac{\eta\lambda_{2}}{\sigma^{2}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}\right)\left(\lVert{h_{1}-h_{2}}\rVert+\lvert{\theta_{1}-\theta_{2}}\rvert\right),$

noting that the initial equality follows from the fact that we are using the sum of norms for our product space norm here (see also Remark 13). These bounds on the gradient differences are precisely the desired result. ∎

Appendix D Additional technical facts

D.1 Lipschitz properties

Here we give a fundamental property of differentiable functions that generalizes the mean value theorem.

Lemma 14.

Let $\mathcal{U}$ and $\mathcal{V}$ be normed linear spaces, and let $f:\mathcal{U}\to\mathcal{V}$ be Fréchet differentiable on an open set $S\subseteq\mathcal{U}$ . Taking any $u\in S$ , we have

\displaystyle\lVert{f(u+u^{\prime})-f(u)}\rVert\leq\lVert{u^{\prime}}\rVert\sup_{0<c<1}\lVert{{f}^{\prime}(u+cu^{\prime})}\rVert

for any $u^{\prime}\in\mathcal{U}$ such that $u+cu^{\prime}\in S$ for all $0\leq c\leq 1$ .

Proof.

See Luenberger, [39, §7.3, Prop. 2]. ∎

Note that Lemma 14 has the following important corollary: bounded gradients imply Lipschitz continuity. In particular, if $\lVert{{f}^{\prime}(u)}\rVert\leq\lambda<\infty$ for all $u\in S$ , then it follows immediately that $f$ is $\lambda$ -Lipschitz on $S$ .

A closely related result goes in the other direction. Let $f:\mathcal{U}\to\mkern 1.5mu\overline{\mkern-1.5mu\mathbb{R}\mkern-1.5mu}\mkern 1.5mu$ be convex and $\lambda$ -Lipschitz. If $f$ is sub-differentiable at a point $u\in\mathcal{U}$ , then we have

\displaystyle\lvert{\langle\partial f(u),u^{\prime}-u\rangle}\rvert\leq\lvert{f(u^{\prime})-f(u)}\rvert\leq\lambda\lVert{u^{\prime}-u}\rVert.

As such, for convex, sub-differentiable functions, $\lambda$ -Lipschitz continuity implies that all sub-gradients are bounded as $\lVert{\partial f(x)}\rVert\leq\lambda$ .

D.2 Convexity

Lemma 15.

Let function $f:\mathbb{R}^{d}\to\mathbb{R}$ be twice continuously differentiable. Then $f$ is convex if and only if its Hessian is positive semi-definite, namely when

\displaystyle\langle{f}^{\prime\prime}(v)u,u\rangle\geq 0

for all $u,v\in\mathbb{R}^{d}$ .

Proof.

See Nesterov, [43, Thm. 2.1.4]. ∎

D.3 Derivatives for the Barron class

Let $\rho(\cdot;\alpha)$ be defined according to (15). Here we compute derivatives of the map $x\mapsto\rho_{\sigma}(x;\alpha)$ , using the shorthand notation $\rho_{\sigma}(x;\alpha)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\rho(x/\sigma;\alpha)$ . We denote the first derivative of $\rho_{\sigma}(\cdot;\alpha)$ evaluated at $x\in\mathbb{R}$ by ${\rho}^{\prime}_{\sigma}(x;\alpha)$ , which is computed as

\displaystyle{\rho}^{\prime}_{\sigma}(x;\alpha)=\begin{cases}x/\sigma^{2},&\text{if }\alpha=2\\ 2x/(x^{2}+2\sigma^{2}),&\text{if }\alpha=0\\ (x/\sigma^{2})\exp\left(-(x/\sigma)^{2}/2\right),&\text{if }\alpha=-\infty\\ \frac{x}{\sigma^{2}}\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{(\alpha/2)-1},&\text{otherwise}.\end{cases}

(67)

In the same way, letting ${\rho}^{\prime\prime}_{\sigma}(x;\alpha)$ denote the second derivative of $\rho_{\sigma}(\cdot;\alpha)$ evaluated at $x\in\mathbb{R}$ , this is computed as

\displaystyle{\rho}^{\prime\prime}_{\sigma}(x;\alpha)=\begin{cases}1/\sigma^{2},&\text{if }\alpha=2\\ \frac{2}{x^{2}+2\sigma^{2}}\left(1-\frac{2x^{2}}{x^{2}+2\sigma^{2}}\right),&\text{if }\alpha=0\\ (1/\sigma^{2})\exp\left(-(x/\sigma)^{2}/2\right)\left(1-\left(\frac{x}{\sigma}\right)^{2}\right),&\text{if }\alpha=-\infty\\ \frac{1}{\sigma^{2}}\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{(\alpha/2)-1}\left(1-\frac{1-(\alpha/2)}{\lvert{\alpha-2}\rvert}\frac{2(x/\sigma)^{2}}{1+(x/\sigma)^{2}/\lvert{\alpha-2}\rvert}\right),&\text{otherwise}.\end{cases}

(68)

We emphasize that ${\rho}^{\prime}_{\sigma}(x;\alpha)$ and ${\rho}^{\prime\prime}_{\sigma}(x;\alpha)$ are not equal to ${\rho}^{\prime}(x/\sigma;\alpha)$ and ${\rho}^{\prime\prime}(x/\sigma;\alpha)$ , but by a simple application of the chain rule are easily seen to satisfy the relations

\displaystyle{\rho}^{\prime}_{\sigma}(x;\alpha)=\frac{1}{\sigma}{\rho}^{\prime}(x/\sigma;\alpha),\qquad{\rho}^{\prime\prime}_{\sigma}(x;\alpha)=\frac{1}{\sigma^{2}}{\rho}^{\prime\prime}(x/\sigma;\alpha)

for any $x\in\mathbb{R}$ , $\sigma>0$ , and $\alpha\in[-\infty,2]$ .

D.4 Elementary inequalities

The following elementary inequalities will be of use.

\displaystyle\left(1+\frac{x}{p}\right)^{p}\geq\left(1+\frac{x}{q}\right)^{q},\qquad\forall\,x\geq 0,\quad p>q>0

(69)

\displaystyle(1+x)^{c}\leq 1+\frac{cx}{1-(c-1)x},\qquad-1\leq x<\frac{1}{c-1},\quad c>1

(70)

The inequality below is sometimes referred to as Bernoulli’s inequality.

\displaystyle\left(1+x\right)^{a}\geq 1+ax,\qquad\forall\,x>-1,\quad a\geq 1.

(71)

D.5 Expected dispersion is coercive

Lemma 16 (Expected dispersion is coercive).

Let $f:\mathbb{R}\to\mathbb{R}_{+}$ be any non-negative function which is even (i.e., $f(x)=f(-x)$ for all $x\in\mathbb{R}$ ) and non-decreasing on $\mathbb{R}_{+}$ . Let $\mathsf{X}$ be any random variable such that $\operatorname{\mathbf{E}}_{\mu}f(\mathsf{X}-\theta)<\infty$ for all $\theta\in\mathbb{R}$ . Then, we have

\displaystyle\lim\limits_{\lvert\theta\rvert\to\infty}\operatorname{\mathbf{E}}_{\mu}f(\mathsf{X}-\theta)=\lim\limits_{x\to\infty}f(x)

and note that this includes the divergent case where $f(x)\to\infty$ as $\lvert x\rvert\to\infty$ .

Proof of Lemma 16.

By our assumptions, we have $f(x)\geq 0$ and $f(-x)=f(x)$ for all $u\in\mathbb{R}$ , and $f(x_{1})\leq f(x_{2})$ whenever $0\leq x_{1}\leq x_{2}$ . With these facts in place, note that for any choice of $a\geq 0$ and $\theta$ such that $\lvert\theta\rvert\geq a$ , we have

$\displaystyle\operatorname{\mathbf{E}}_{\mu}f(\mathsf{X}-\theta)$	$\displaystyle=\operatorname{\mathbf{E}}_{\mu}f(\lvert\theta-\mathsf{X}\rvert)$
	$\displaystyle\geq\operatorname{\mathbf{E}}_{\mu}f\left(\lvert\lvert\theta\rvert-\lvert\mathsf{X}\rvert\rvert\right)$
	$\displaystyle\geq\operatorname{\mathbf{E}}_{\mu}\operatorname{I}_{\{\lvert\mathsf{X}\rvert\leq a\}}f\left(\lvert\lvert\theta\rvert-\lvert\mathsf{X}\rvert\rvert\right)$
	$\displaystyle\geq f\left(\lvert\theta\rvert-a\right)\operatorname{\mathbf{P}}\{\lvert\mathsf{X}\rvert\leq a\}.$	(72)

For readability, let us write $f(+\infty)$ for the limit of $f(x)$ as $\lvert x\rvert\to\infty$ . Trivially, we know that $\operatorname{\mathbf{E}}_{\mu}f(\mathsf{X}-\theta)\leq f(+\infty)$ . Using the preceding inequality (72), we have a lower bound of $\operatorname{\mathbf{E}}_{\mu}f(\mathsf{X}-\theta)\geq f(+\infty)\operatorname{\mathbf{P}}\{\lvert\mathsf{X}\rvert\leq a\}$ that holds for any $a\geq 0$ . When $f(+\infty)=\infty$ , the desired result is immediate. When $f(+\infty)<\infty$ , simply note that $\{\lvert\mathsf{X}\rvert\leq a\}\uparrow\Omega$ as $a\uparrow\infty$ , and thus using the continuity of probability measures, we have $\operatorname{\mathbf{P}}\{\lvert\mathsf{X}\rvert\leq a\}\to 1$ as $a\to\infty$ .²²²²22All countably additive set functions on $\sigma$ -fields satisfy such continuity properties [3, Thm. 1.2.7]. Thus, the lower bound (72) can be taken arbitrarily close to $f(+\infty)$ , implying the desired result. ∎