This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Flexible risk design using bi-directional dispersion

Matthew J. Holland
Osaka University
Abstract

Many novel notions of “risk” (e.g., CVaR, tilted risk, DRO risk) have been proposed and studied, but these risks are all at least as sensitive as the mean to loss tails on the upside, and tend to ignore deviations on the downside. We study a complementary new risk class that penalizes loss deviations in a bi-directional manner, while having more flexibility in terms of tail sensitivity than is offered by mean-variance. This class lets us derive high-probability learning guarantees without explicit gradient clipping, and empirical tests using both simulated and real data illustrate a high degree of control over key properties of the test loss distribution incurred by gradient-based learners.

1 Introduction

What does it mean for a learner to successfully generalize? Broadly speaking, this is an ambiguous property of learning systems that can be defined, measured, and construed in countless ways. In the context of machine learning, however, the notion of “success” in off-sample generalization is almost without exception formalized as minimizing the expected value of a random loss 𝐄μ𝖫(h)\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}(h), where hh is a candidate parameter, model, or decision rule, and 𝖫(h)\operatorname{\mathsf{L}}(h) is a random variable on a probability space (Ω,,μ)(\Omega,\mathcal{F},\mu) [41, 53]. The idea of quantifying the risk of an unexpected outcome (here, a random loss) using the expected value dates back to the Bernoullis and Gabriel Cramer in the early 18th century [5, 25]. In a more modern context, the emphasis on average performance is the “general setting of the learning problem” of Vapnik, [58], and plays a central role in the decision-theoretic learning model of Haussler, [27]. Use of the expected loss to quantify off-sample generalization has been essential to the development of both the statistical and computational theories of learning [16, 34].

While the expected loss still remains pervasive, important new lines of work on risk-sensitive learning have begun exploring novel feedback mechanisms for learning algorithms, in some cases derived directly from new risk functions that replace the expected loss. Learning algorithms designed using conditional value-at-risk (CVaR) [13, 31] and tilted (or “entropic”) risk [19, 37, 38] are well-known examples of location properties which emphasize loss tails in one direction more than the mean itself does. This is often used to increase sensitivity to “worst-case” events [33, 55], but in special cases where losses are bounded below, sensitivity to tails on the downside can be used to realize an insensitivity to tails on the upside [36]. This strong asymmetry is not specific to the preceding two risk function classes, but rather is inherent in much broader classes such as optimized certainty equivalent (OCE) risk [7, 8, 36] and distributionally robust optimization (DRO) risk [6, 17, 18, 24]. Unsurprisingly, naive empirical estimators of these risks are particularly fragile under outliers coming from the “sensitive direction,” as is evidenced by the plethora of attempts in the literature to design robust modifications [30, 45, 60]. In general, however, loss distributions can display long tails in either direction over the learning process (see Figure 3), particularly when losses are unbounded below (e.g., negative rewards [54], unhinged loss [57]), and loss functions whose empirical mean has no minimum appear frequently (e.g., separable logistic regression [1, 50]). Since the tail behavior of stochastic losses and gradients is well-known to play a critical role in the stability and robustness of learning systems [12, 61], the inability to control tail sensitivity in both directions represents a genuine limitation to machine learning methodology.

A natural alternative class of risk functions that gives us control over tail sensitivity in both directions is that of the “M-location” of the loss distribution, namely any value in

argminθ𝐄μρ(𝖫(h)θ),\displaystyle\operatorname*{arg\,min}_{\theta\in\mathbb{R}}\operatorname{\mathbf{E}}_{\mu}\rho\left(\operatorname{\mathsf{L}}(h)-\theta\right)\subset\mathbb{R}, (1)

where ρ:+\rho:\mathbb{R}\to\mathbb{R}_{+} is assumed to be such that this set of minimizers is non-empty. Here various special choices of ρ\rho let us recover well-known locations, such as the mean (with ρ()=()2\rho(\cdot)=(\cdot)^{2}), median (ρ()=||\rho(\cdot)=\lvert{\cdot}\rvert), arbitrary quantiles (via “pinball” function [56]), and even further beyond to “expectiles” (using curved variants of the pinball function [21]). The obvious limitation here is that while computing (1) using empirical estimates is easy, minimization as a function of hh is in general a difficult bi-level programming problem. As an alternative approach, in this paper we study the potential benefits and tradeoffs that arise in using performance criteria of the form

minθ[ηθ+𝐄μρ(𝖫(h)θ)]\displaystyle\min_{\theta\in\mathbb{R}}\left[\eta\theta+\operatorname{\mathbf{E}}_{\mu}\rho\left(\operatorname{\mathsf{L}}(h)-\theta\right)\right] (2)

where η\eta\in\mathbb{R}. By making a sacrifice of fidelity to the M-location (1), we see that the criterion in (2) suggests a congenial objective function (joint in (h,θ)(h,\theta)). Intuitively, one minimizes the sum of generalized “location” and “dispersion” properties, and the nature of this dispersion impacts the fidelity of the location term to the original M-location induced by ρ\rho. These two locations align perfectly in the special case where we set ρ()=()2/2\rho(\cdot)=(\cdot)^{2}/2 and η=1\eta=1, since (2) is equivalent to the mean-variance objective 𝐄μ𝖫(h)+varμ𝖫(h)/2\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}(h)+\operatorname{var}_{\mu}\operatorname{\mathsf{L}}(h)/2, but more generally, it is clear that allowing for more diverse choices of ρ\rho gives us new freedom in terms of tail control with respect to both location and dispersion. We consider a concrete yet flexible class of risk functions that generalizes beyond (2), allows for easy implementation, and is analytically tractable from the standpoint of providing formal learning guarantees. Our main contributions are as follows:

  • A new class of “threshold risks” (T-risks; defined in §3) that provide a tractable alternative to M-locations, and a bi-directional complement to OCE/DRO risks (reviewed in §2.2).

  • A stochastic learning algorithm for T-risks that enjoys high-probability guarantees of convergence to a stationary point under heavy-tailed losses/gradients, without manual clipping (details in §4, Theorem 3).

  • Strong empirical evidence of the flexibility and utility inherent in T-risk learners. In particular: robustness to unbalanced noisy class labels without regularization (Figure 4), sharp control over sensitivity to outliers in regression with convex base losses (Figure 5), and smooth interpolation between mean and mean-variance minimizers on clean, normalized benchmark classification datasets (Figures 6, 7 and 1418).

The overall flow of the paper is as follows. Background information on notation and related literature is given in §2, and we introduce the new risk class of interest in §3. Formal aspects of the learning problem using these risks are treated in §4, and empirical findings are explained and discussed in §5. All formal proofs and supplementary results are organized in §A–§D of the appendix, and code for reproducing all the results in this paper is provided in an online repository.111https://github.com/feedbackward/bdd

2 Background

2.1 Notation

Random quantities

To start, let us clarify the nature of the random losses we consider. With the context of the underlying probability space (Ω,,μ)(\Omega,\mathcal{F},\mu), we write 𝖫(h) . . =𝖫(h;):Ω\operatorname{\mathsf{L}}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{\mathsf{L}}(h;\cdot):\Omega\to\mathbb{R} to refer to a random variable (i.e., a \mathcal{F}-measurable function) on Ω\Omega, though we only use the form 𝖫(h)\operatorname{\mathsf{L}}(h) in the body of this paper. When we talk about “sampling” losses or a “random draw” of the losses, this amounts to computing a realization 𝖫(h;ω)\operatorname{\mathsf{L}}(h;\omega)\in\mathbb{R}. We use standard notation for taking expectation, e.g., 𝐄μ𝖫(h) . . =Ω𝖫(h;ω)μ(dω)\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\int_{\Omega}\operatorname{\mathsf{L}}(h;\omega)\,\mu(\mathop{}\!\mathrm{d}\omega). These conventions extend to random quantities based on the losses (e.g., the gradient 𝖫(h){\operatorname{\mathsf{L}}}^{\prime}(h) considered in §4.1). Similarly, we will use 𝐏\operatorname{\mathbf{P}} as a general-purpose probability function, representing both μ\mu and product measures; when the source of randomness is not immediate from the context, it will be stated explicitly. We will use R()\operatorname{R}(\cdot) as a generic symbol for risk functions (often modified with subscripts), with the understanding that R()\operatorname{R}(\cdot) maps random losses 𝖫(h)\operatorname{\mathsf{L}}(h) to real values. We will overload this notation, writing R(𝖫)\operatorname{R}(\operatorname{\mathsf{L}}) when the role of hh is unimportant, and writing R(h) . . =R(𝖫(h))\operatorname{R}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{R}(\operatorname{\mathsf{L}}(h)) when we want to emphasize the dependence on hh. This convention will be applied to other quantities as well, such as writing Dρ(h) . . =Dρ(𝖫(h)) . . =𝐄μρ(𝖫(h)θ)\operatorname{D}_{\rho}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{D}_{\rho}(\operatorname{\mathsf{L}}(h))\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{\mathbf{E}}_{\mu}\rho(\operatorname{\mathsf{L}}(h)-\theta) for the expected dispersion induced by ρ\rho, first defined in (22).

Norms

We will use \lVert{\cdot}\rVert as a general-purpose notation for all norms that appear in this paper. That is, we do not use different notation to distinguish different norm spaces. The reason for this is that we will never consider two distinct norms on the same set; each norm is associated with a distinct set, and thus as long as it is clear which set a particular element belongs to, there should be no confusion. The only exception to this rule is the special case of \mathbb{R}, in which we write ||\lvert{\cdot}\rvert for the absolute value, as is traditional. For a function f:f:\mathbb{R}\to\mathbb{R} in one variable, we use f{f}^{\prime} to denote the usual derivative. More general notions (e.g., Gateaux or Fréchet differentials) only make an appearance in §4, and the generality they afford us is not crucial to the main narrative, so the details can be easily skipped over if the reader is unfamiliar with such concepts. All the other undefined notation we use is essentially standard, and can be found in most introductory analysis textbooks.

Miscellaneous

For a function f:f:\mathbb{R}\to\mathbb{R} in one variable, we use f{f}^{\prime} to denote the usual derivative. More general notions (e.g., Gateaux or Fréchet differentials) only make an appearance in §4, and the generality they afford us is not crucial to the main narrative, so the details can be easily skipped over if the reader is unfamiliar with such concepts. All the other undefined notation we use is essentially standard, and can be found in most introductory analysis textbooks. Particularly in formal proofs, we will frequently make use of the notation ρσ(x) . . =ρ(x/σ)\rho_{\sigma}(x)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\rho(x/\sigma), where σ>0\sigma>0 is a scaling parameter.

2.2 Review of key risk functions

OCE-type risks

As a computationally convenient way to interpolate between the mean and the extreme values of 𝖫(h)\operatorname{\mathsf{L}}(h), the tilted risk [37, 38] is a natural choice, defined for γ0\gamma\neq 0 as

Rtilt(h;γ) . . =1γlog(𝐄μeγ𝖫(h)).\displaystyle\operatorname{R}_{\textup{tilt}}(h;\gamma)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\frac{1}{\gamma}\log\left(\operatorname{\mathbf{E}}_{\mu}\mathrm{e}^{\gamma\operatorname{\mathsf{L}}(h)}\right). (3)

This is simply a re-scaling of the cumulant distribution function of 𝖫(h)\operatorname{\mathsf{L}}(h), viewed as a function of hh, where taking γ\gamma\to\infty and γ\gamma\to-\infty lets us approach the supremum and infimum of 𝖫(h)\operatorname{\mathsf{L}}(h), respectively. Another important class of risk functions is based upon the conditional value-at-risk (CVaR) [46], defined for 0β<10\leq\beta<1 as

RCVaR(h;β) . . =𝐄μ[𝖫(h)|𝖫(h)Qβ(h)].\displaystyle\operatorname{R}_{\textup{CVaR}}(h;\beta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{\mathbf{E}}_{\mu}\left[\operatorname{\mathsf{L}}(h)\,|\,\operatorname{\mathsf{L}}(h)\geq\operatorname{Q}_{\beta}(h)\right]. (4)

This is the expected loss at hh, conditioned on the event that the loss exceeds the β\beta-quantile of 𝖫(h)\operatorname{\mathsf{L}}(h), denoted here by Qβ(h) . . =inf{x:𝐏{𝖫(h)x}β}\operatorname{Q}_{\beta}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\inf\left\{x\in\mathbb{R}:\operatorname{\mathbf{P}}\{\operatorname{\mathsf{L}}(h)\leq x\}\geq\beta\right\}. Both of these risk functions can be re-written in a form similar to that of (2), namely

hinfθ[θ+𝐄μϕ(𝖫(h)θ)]\displaystyle h\mapsto\inf_{\theta\in\mathbb{R}}\left[\theta+\operatorname{\mathbf{E}}_{\mu}\phi(\operatorname{\mathsf{L}}(h)-\theta)\right] (5)

where ϕ(x)=(eγx1)/γ\phi(x)=(\mathrm{e}^{\gamma x}-1)/\gamma yields Rtilt\operatorname{R}_{\textup{tilt}} (basic calculus), and ϕ(x)=max{0,x}/(1β)\phi(x)=\max\{0,x\}/(1-\beta) yields RCVaR\operatorname{R}_{\textup{CVaR}} (see Rockafellar and Uryasev, [46, 47]). When ϕ:\phi:\mathbb{R}\to\mathbb{R} is restricted to be a non-decreasing, closed, convex function which satisfies both ϕ(0)=0\phi(0)=0 and 1ϕ(0)1\in\partial\phi(0), the mapping given in (5) is called an optimized certainty equivalent (OCE) risk [7, 8, 36]. The class of OCE risks strictly generalizes the expected value (noting ϕ(x)=x\phi(x)=x is valid), and includes Rtilt(;γ)\operatorname{R}_{\textup{tilt}}(\cdot;\gamma) when γ>0\gamma>0, as well as RCVaR\operatorname{R}_{\textup{CVaR}}. The mean-variance is sometimes stated to be an OCE risk [36, Table 1], but this fails to hold when losses are unbounded above and below.

DRO-type risks

Another important class of risk functions is that of robustly regularized risks which are designed to ensure that risk minimizers are robust to a certain degree of divergence in the underlying data model. Making this concrete, it is typical to assume the random losses are the outputs of a loss function \ell depending on the candidate hh and some random data 𝖹\mathsf{Z}, i.e., 𝖫(h)=(h;𝖹)\operatorname{\mathsf{L}}(h)=\ell(h;\mathsf{Z}), with 𝖹μ\mathsf{Z}\sim\mu as our reference model. To measure divergence from this reference model, it is convenient to use the Cressie-Read family of divergence functions [17, 60], namely for any c>1c>1 and assuming νμ\nu\ll\mu (absolutely continuity) holds, functions of the form

Divc(ν;μ) . . =𝐄μfc(dνdμ), where fc(x) . . =xccx+c1c(c1)\displaystyle\operatorname{Div}_{c}(\nu;\mu)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{\mathbf{E}}_{\mu}f_{c}\left(\frac{\mathop{}\!\mathrm{d}\nu}{\mathop{}\!\mathrm{d}\mu}\right),\text{ where }f_{c}(x)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\frac{x^{c}-cx+c-1}{c(c-1)} (6)

and dν/dμ\mathop{}\!\mathrm{d}\nu/\mathop{}\!\mathrm{d}\mu is the Radon-Nikodym density of ν\nu with respect to μ\mu.222For background on absolute continuity and density functions, see Ash and Doléans-Dade, [3, §2.2]. The resulting robustly regularized risk, called the DRO risk, is defined as

RDRO(h) . . =sup{𝐄𝖫(h):𝖫}\displaystyle\operatorname{R}_{\textup{DRO}}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\sup\left\{\operatorname{\mathbf{E}}\operatorname{\mathsf{L}}(h):\operatorname{\mathsf{L}}\in\mathcal{L}\right\} (7)

where the constrained set of random losses \mathcal{L}, determined by c>1c>1 and a>0a>0, is defined as

. . ={𝖫()=(;𝖹):𝖹ν and Divc(ν;μ)a}.\displaystyle\mathcal{L}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\left\{\operatorname{\mathsf{L}}(\cdot)=\ell(\cdot;\mathsf{Z}):\mathsf{Z}\sim\nu\text{ and }\operatorname{Div}_{c}(\nu;\mu)\leq a\right\}. (8)

For this particular family of divergences, the risk can be characterized as the optimal value of a simple optimization problem [17], namely we have that

RDRO(h)=infθ[θ+(1+c(c1)a)1/c(𝐄μ(𝖫(h)θ)+c)1/c]\displaystyle\operatorname{R}_{\textup{DRO}}(h)=\inf_{\theta\in\mathbb{R}}\left[\theta+\left(1+c(c-1)a\right)^{1/c}\left(\operatorname{\mathbf{E}}_{\mu}\left(\operatorname{\mathsf{L}}(h)-\theta\right)_{+}^{c_{\ast}}\right)^{1/c_{\ast}}\right] (9)

where c . . =c/(c1)c_{\ast}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=c/(c-1). While strictly speaking this is not an OCE risk, note that if we set ϕ(x)=(1+c(c1)a)c/c(x)+c\phi(x)=(1+c(c-1)a)^{c_{\ast}/c}(x)_{+}^{c_{\ast}}, then the DRO risk can be written as

RDRO(h)=infθ[θ+[𝐄μϕ(𝖫(h)θ)]1/c],\displaystyle\operatorname{R}_{\textup{DRO}}(h)=\inf_{\theta\in\mathbb{R}}\left[\theta+[\operatorname{\mathbf{E}}_{\mu}\phi(\operatorname{\mathsf{L}}(h)-\theta)]^{1/c_{\ast}}\right], (10)

giving us an expression of this risk as the sum of a threshold and an asymmetric dispersion. When we set c=2c=2, this yields the well-known special case of χ2\chi^{2}-DRO risk [26, 60]. In addition to the one-directional nature of the dispersion term in these risks, all of these risks are at least as sensitive to loss tails (on the upside) as the classical expected loss 𝐄μ𝖫(h)\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}(h) is; this holds for CVaR (with β>0\beta>0), tilted risk (with γ>0\gamma>0), and even robust variants of DRO risk [60].

Key differences

While it is clear that the form of the preceding risk classes given in (5) and (10) based on various choices of ϕ()\phi(\cdot) is the same as our ρ()\rho(\cdot)-based risk of interest in (2), they are fundamentally different in that none of the choices of ϕ()\phi(\cdot) induce a meaningful M-location; since all these ϕ()\phi(\cdot) are monotonic on \mathbb{R}, both minimization and maximization of θ𝐄μϕ(𝖫(h)θ)\theta\mapsto\operatorname{\mathbf{E}}_{\mu}\phi(\operatorname{\mathsf{L}}(h)-\theta) is trivially accomplished by taking |θ|\lvert{\theta}\rvert\to\infty. In stark contrast, ρ()\rho(\cdot) is assumed to be such that the solution set in (1) is a subset of the real line. We will introduce a concrete and flexible class from which ρ\rho will be taken in §3, and in Figure 1 give a side-by-side comparison with the ϕ\phi functions discussed in the preceding paragraphs.

2.3 Closely related work

This work falls into the broad context of machine learning driven by novel risk functions [28]. Of all the papers cited above, the works of Lee et al., [36] on OCE risks, and Li et al., 2021b [38], Li et al., 2021a [37] on tilted risks are of a similar nature to our paper, with the obvious difference being that the class of risks is fundamentally different, as described in the preceding paragraphs. Indeed, many of our empirical tests involve direct comparison with the risk classes studied in these works (e.g., Figures 2, 4, and 12), and so they provide critical context for our work. Previous work by Holland, [30] studies a rudimentary special case of what we call “minimal T-risk” here; the focus in that work was on obtaining learning guarantees (in expectation) when the risk is potentially non-convex and non-smooth in hh, but with convex ρ\rho, and no comparison was made with OCE/DRO risk classes. We build upon these results here, considering a broad class of dispersions ρ\rho which are differentiable but need not be convex (see (15)); we how such risk classes can readily admit high-probability learning guarantees for stochastic gradient-based algorithms (Theorem 3), provide bounds on the average loss incurred by empirical risk minimizers using our risk (Proposition 7), and make detailed empirical comparisons with each of the key existing risk classes.

3 Threshold risk

To ground ourselves conceptually, let us refer to 𝖫(h)\operatorname{\mathsf{L}}(h) as the base loss incurred by hh. The exact nature of hh is left completely abstract for the moment, as all that matters is the probability distribution of this base loss. By selecting an arbitrary threshold θ\theta\in\mathbb{R}, we define a broad class of properties as

Rρ(h;θ,η) . . =ηθ+𝐄μρ(𝖫(h)θ).\displaystyle\operatorname{R}_{\rho}(h;\theta,\eta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\eta\theta+\operatorname{\mathbf{E}}_{\mu}\rho\left(\operatorname{\mathsf{L}}(h)-\theta\right). (11)

Here η\eta\in\mathbb{R} is a weighting parameter allowed to be negative, and as a bare minimum, ρ\rho is assumed to be such that the resulting M-location(s) are well-defined in the sense that the inclusion in (1) holds. We call ρ(𝖫(h)θ)\rho\left(\operatorname{\mathsf{L}}(h)-\theta\right) the (random) dispersion of the base loss, taken with respect to threshold θ\theta, and we refer to Rρ(h;θ,η)\operatorname{R}_{\rho}(h;\theta,\eta) in (11) as the threshold risk (or simply T-risk) under ρ\rho.

3.1 Minimal T-risk and M-location

Arguably the most intuitive special case of T-risk is the minimal T-risk, where minimization is with respect to the threshold θ\theta\in\mathbb{R}. Let us denote this risk and the optimal threshold set as

R¯ρ(h;η) . . =infθRρ(h;θ,η),θρ(h;η) . . =argminθRρ(h;θ,η).\displaystyle\underline{\operatorname{R}}_{\rho}(h;\eta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\inf_{\theta\in\mathbb{R}}\operatorname{R}_{\rho}(h;\theta,\eta),\qquad\theta_{\rho}(h;\eta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname*{arg\,min}_{\theta\in\mathbb{R}}\operatorname{R}_{\rho}(h;\theta,\eta). (12)

Clearly, if ρ\rho is bounded above or grows too slowly, we will have R¯ρ(h;η)=\underline{\operatorname{R}}_{\rho}(h;\eta)=-\infty and no real-valued minimizers, i.e., θρ(h;η)=\theta_{\rho}(h;\eta)=\emptyset. Letting Mρ(h)\operatorname{M}_{\rho}(h) denote the M-locations in (1), for η0\eta\neq 0 we have

θρ(h;η)Mρ(h),\displaystyle\theta_{\rho}(h;\eta)\neq\emptyset\implies\operatorname{M}_{\rho}(h)\neq\emptyset, (13)

although the converse does not hold in general.333For example, consider choices of ρ\rho that are “re-descending” [32]. When η=0\eta=0, these two solution sets align, i.e., we have θρ(h;0)=Mρ(h)\theta_{\rho}(h;0)=\operatorname{M}_{\rho}(h). More generally, depending on the sign of η\eta, the optimal thresholds can be either larger or smaller than the corresponding M-locations. More precisely, for any θMρ(h)\theta^{\prime}\in\operatorname{M}_{\rho}(h), as long as θρ(h;η)\theta_{\rho}(h;\eta) is non-empty, there exists θθρ(h;η)\theta\in\theta_{\rho}(h;\eta) such that θsign(η)θsign(η)\theta\operatorname{sign}(\eta)\leq\theta^{\prime}\operatorname{sign}(\eta).

Special case minimized by quantiles

The form given in (11) is very general, but it can be understood as a straightforward generalization of the convex objective used to characterize quantiles. More precisely, taking β(0,1)\beta\in(0,1) and denoting the β\beta-quantile of the base loss using Qβ(h) . . =inf{x:𝐏{𝖫(h)x}β}\operatorname{Q}_{\beta}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\inf\left\{x\in\mathbb{R}:\operatorname{\mathbf{P}}\{\operatorname{\mathsf{L}}(h)\leq x\}\geq\beta\right\}, it is well-known that in the special case of ρ()=||\rho(\cdot)=\lvert{\cdot}\rvert, we have

Qβ(h)θρ(h;θ,12β)\displaystyle\operatorname{Q}_{\beta}(h)\in\theta_{\rho}(h;\theta,1-2\beta) (14)

for any choice of 0<β<10<\beta<1, as long as 𝐄μ|𝖫(h)|\operatorname{\mathbf{E}}_{\mu}\lvert{\operatorname{\mathsf{L}}(h)}\rvert is finite [35]. The T-risk in (11) simply allows for a more flexible choice of ρ\rho, and thus generalizes the dispersion term in this objective function.

3.2 T-risk with scaled Barron dispersion

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Left-most plot: the graph of xρ(x/σ;α)x\mapsto\rho(x/\sigma;\alpha) from §3.2 for varying choices of α\alpha, with σ\sigma fixed to σ=0.2\sigma=0.2 for visual ease. Middle two plots: graphs of ϕ(x)\phi(x) in (5) for CVaR and tilted risk, respectively over different choices of β\beta and γ\gamma. Right-most plot: graph of ϕ(x)\phi(x) in (10) for the χ2\chi^{2}-DRO risk, where a0a\geq 0 is re-parametrized using 0a~<10\leq\widetilde{a}<1 using the relation a=((1a~)11)2/2a=((1-\widetilde{a})^{-1}-1)^{2}/2.

In order to capture a range of sensitivities to loss tails in both directions, we would like to select ρ\rho from a class of functions that gives us sufficient control over scale, boundedness, and growth rates. As a concrete choice, we propose to set ρ\rho in (11) as ρ(x)=ρ(x/σ;α)\rho(x)=\rho(x/\sigma;\alpha), where σ>0\sigma>0 is a scaling parameter, and ρ(;α)\rho(\cdot;\alpha) with shape α[,2]\alpha\in[-\infty,2] is a family of functions that ranges from bounded and logarithmic growth on the lower end to quadratic growth on the upper end, defined as:

ρ(x;α) . . ={x2/2,if α=2log(1+x2/2),if α=01exp(x2/2),if α=|α2|α((1+x2|α2|)α/21),otherwise.\displaystyle\rho(x;\alpha)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\begin{cases}x^{2}/2,&\text{if }\alpha=2\\ \log\left(1+x^{2}/2\right),&\text{if }\alpha=0\\ 1-\exp\left(-x^{2}/2\right),&\text{if }\alpha=-\infty\\ \frac{\lvert{\alpha-2}\rvert}{\alpha}\left(\left(1+\frac{x^{2}}{\lvert{\alpha-2}\rvert}\right)^{\alpha/2}-1\right),&\text{otherwise}.\end{cases} (15)

At a high level, ρ(;α)\rho(\cdot;\alpha) is approximately quadratic near zero for any choice of shape α\alpha, but its growth as one deviates far from zero depends greatly on α\alpha. We refer to (15) as the Barron class of functions for computing dispersion.444The reason for this naming is that Barron, [4] recently studied this class in the context of designing loss functions for computer vision applications. We remark that this differs considerably from our usage in computing the dispersion of random losses, where the loss function underlying the base loss is left completely arbitrary. Recalling the risks reviewed in §2.2, since ρ(;α)\rho(\cdot;\alpha) is flat at zero and symmetric about zero, the Barron class clearly takes us beyond the functions ϕ()\phi(\cdot) allowed by OCE risks (5) and used in typical DRO risk definitions (10); see Figure 1 for a visual comparison.

As mentioned in §2.1, we will often use the generic shorthand notation ρσ(x) . . =ρ(x/σ)\rho_{\sigma}(x)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\rho(x/\sigma), and drop the dependence on α\alpha when clear from context. The shape parameter α\alpha gives us direct control over the conditions needed for a finite T-risk Rρ(h;θ,η)\operatorname{R}_{\rho}(h;\theta,\eta), as the following lemma shows.

Lemma 1 (Finiteness and shape).

Let ρ\rho be from the Barron class (15). Then in order to ensure 𝐄μρσ(𝖫(h)θ)<\operatorname{\mathbf{E}}_{\mu}\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)<\infty holds for all θ\theta\in\mathbb{R}, each of the following conditions (depending on the value of α\alpha) is sufficient. For 0<α20<\alpha\leq 2, let 𝐄μ|𝖫(h)|α<\operatorname{\mathbf{E}}_{\mu}\lvert\operatorname{\mathsf{L}}(h)\rvert^{\alpha}<\infty. For α=0\alpha=0, let 𝐄μ|𝖫(h)|c<\operatorname{\mathbf{E}}_{\mu}\lvert\operatorname{\mathsf{L}}(h)\rvert^{c}<\infty for some c>0c>0. For α<0-\infty\leq\alpha<0, let 𝖫(h)\operatorname{\mathsf{L}}(h) be \mathcal{F}-measurable. Furthermore, for the cases where α0\alpha\neq 0, the above conditions are also necessary.

Assuming μ\mu-integrability as in Lemma 1, the Barron class furnishes a non-empty set of M-locations Mρ(h)\operatorname{M}_{\rho}(h) for any choice of α\alpha, and when restricted to α1\alpha\geq 1 with appropriate settings of η\eta and σ\sigma, the optimal threshold set θρ(h;η)\theta_{\rho}(h;\eta) contains a single unique solution (see Lemma 10). For any valid choice of α\alpha, the function ρ(;α)\rho(\cdot;\alpha) is twice continuously differentiable on \mathbb{R} (see §D.3 for exact expressions). All the limits in α\alpha behave as we would expect: ρ(x;α)ρ(x;c)\rho(x;\alpha)\to\rho(x;c) as αc\alpha\to c for c{,0,2}c\in\{-\infty,0,2\} (see §B.2 for details). For α0\alpha\geq 0, the dispersion function is unbounded, with growth ranging from logarithmic to quadratic depending on the choice of α\alpha. For α<0\alpha<0, the dispersion function is bounded. The mapping xρσ(x;α)x\mapsto\rho_{\sigma}(x;\alpha) is convex on \mathbb{R} for α1\alpha\geq 1, and for α<1\alpha<1 it is only convex between ±σ|α2|/(1α)\pm\sigma\sqrt{\lvert{\alpha-2}\rvert/(1-\alpha)}, and concave elsewhere (see Lemma 8). The class Rρ(h;θ,η)\operatorname{R}_{\rho}(h;\theta,\eta) of T-risks (11) under the scaled Barron dispersion ρσ(x;α)\rho_{\sigma}(x;\alpha) is the central focus of this paper.

3.3 Sensitivity to outliers and tail direction

Before we consider the learning problem, which typically involves the evaluation of many different loss distributions over the course of training, here we consider a fixed distribution, and numerically compare the T-risk (11) and M-location (1) induced by ρ\rho from the Barron class in §3.2, along with the key OCE and DRO risks discussed in §2.2.

Experiment setup

We generate random values to simulate loss distributions, and evaluate how the values returned by each risk function change as we modify their respective parameters. Letting 𝖫\operatorname{\mathsf{L}} denote the random base loss we are simulating, we specify a parametric distribution for 𝖫\operatorname{\mathsf{L}}, from which we take an independent sample {𝖫1,,𝖫m}\{\operatorname{\mathsf{L}}_{1},\ldots,\operatorname{\mathsf{L}}_{m}\}. In all cases, we center the true distribution such that 𝐄μ𝖫=0\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}=0. We use this common sample to compare the values returned by each of the aforementioned risks, as well as the optimal choice of threshold parameter θ\theta. To ensure that key trends are consistent across samples, we take a large sample size of m=104m=10^{4}. For T-risk, we adjust α\alpha and η\eta, and for M-location, just α\alpha. In both cases, we leave σ=0.5\sigma=0.5 fixed. For CVaR, we modify the quantile level 0<β<10<\beta<1. For tilted risk, we modify the parameter γ\gamma\in\mathbb{R}. For χ2\chi^{2}-DRO risk, we modify 0<a~<10<\widetilde{a}<1, having re-parameterized aa in (8) by a . . =((1a~)11)2/2a\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=((1-\widetilde{a})^{-1}-1)^{2}/2, as is common practice [60].

Representative results

An illustrative example is given in Figure 2, where we look at how each risk class behaves under a centered asymmetric distribution, before and after flipping it (i.e., under 𝖫\operatorname{\mathsf{L}} and −𝖫\operatorname{\mathsf{L}}). Starting from the two left-most plots, we show R¯ρ(𝖫;η)\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}};\eta) (dashed curves) and θρ(𝖫;η)\theta_{\rho}(\operatorname{\mathsf{L}};\eta) (solid curves) from (12) as a function of α\alpha, coloring the area between these graphs in gray. The first plot corresponds to η=1\eta=1, the second to η=1\eta=-1. Similarly for the M-location we plot Mρ(𝖫)\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}) (solid) and Mρ(𝖫)+𝐄μρ(𝖫Mρ(𝖫))\operatorname{M}_{\rho}(\operatorname{\mathsf{L}})+\operatorname{\mathbf{E}}_{\mu}\rho(\operatorname{\mathsf{L}}-\operatorname{M}_{\rho}(\operatorname{\mathsf{L}})) (dashed). Analogous values are plotted for each of the other classes; note that for the tilted risk (3) with γ>0\gamma>0, the optimal threshold and the risk value are in fact the same value (see §B.1). The right-most plot is a histogram of the random sample {𝖫1,,𝖫m}\{\operatorname{\mathsf{L}}_{1},\ldots,\operatorname{\mathsf{L}}_{m}\}, here from a centered log-Normal distribution. All plots share a common vertical axis, and horizontal rules are drawn at the median (red, solid) and at the mean (gray, dotted; always zero due to centering). The critical point to emphasize here is how all the OCE and DRO risks here are highly asymmetric in terms of their tail sensitivity, in stark contrast with both the M-location and the T-risk. Turning tail sensitivity high enough in each of these classes (e.g., α>1.5\alpha>1.5, β>0.5\beta>0.5, γ>1.0\gamma>1.0, a~>0.5\widetilde{a}>0.5), note how flipping the distribution tails from the upside (top row) to the downside (bottom row) leads to a dramatic decrease in all risks but the T-risk and M-location. Finally, note how the T-risk thresholds θρ(𝖫;η)\theta_{\rho}(\operatorname{\mathsf{L}};\eta) close in on the M-location Mρ(𝖫)\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}) from above/below depending on whether η\eta is negative/positive. Results for numerous distributions are available in our online repository (cf. §1).

Refer to caption
Figure 2: Evaluation of risk class behavior after flipping an asymmetric distribution.

4 Learning algorithm analysis

We now proceed to consider the learning problem, in which the goal is ultimately to select a candidate hh such that the distribution of 𝖫(h)\operatorname{\mathsf{L}}(h) is “optimal” in the sense of achieving the smallest possible value of T-risk Rρ(h;θ,η)\operatorname{R}_{\rho}(h;\theta,\eta) in (11), with ρ\rho taken from the Barron class (15), and hh taken from some set \mathcal{H}. An obvious take-away of the integrability conditions (Lemma 1) is that even when the base loss is heavy-tailed in the sense of having infinite higher-order moments, we can always adjust the dispersion function ρ(;α)\rho(\cdot;\alpha) in such a way that transforming the base loss to obtain new feedback

𝖫ρ(h;θ,η) . . =ηθ+ρσ(𝖫(h)θ;α)\displaystyle\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\eta\theta+\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta;\alpha) (16)

gives us an unbiased estimator of the finite T-risk, i.e., 𝐄μ𝖫ρ(h;θ,η)=Rρ(h;θ,η)\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta)=\operatorname{R}_{\rho}(h;\theta,\eta)\in\mathbb{R}. Intuitively, one expects that a similar property can be leveraged to control heavy-tailed stochastic gradients used in an iterative learning algorithm. We explore this point in detail in §4.1. We will then complement this analysis by considering in §4.2 the basic properties of T-risk at the minimal threshold R¯ρ(h;η)\underline{\operatorname{R}}_{\rho}(h;\eta) given in (12), viewed from the perspectives of axiomatic risk design and empirical risk minimization.

4.1 T-risk and stochastic gradients

For the time being let us fix an arbitrary threshold θ\theta\in\mathbb{R}, and assuming the gradient 𝖫(h){\operatorname{\mathsf{L}}}^{\prime}(h) is μ\mu-almost surely finite, denote the partial derivative of the transformed losses (16) with respect to hh by

h𝖫ρ(h;θ,η) . . =ρσ(𝖫(h)θ)𝖫(h).\displaystyle\partial_{h}\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}={\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta){\operatorname{\mathsf{L}}}^{\prime}(h). (17)

Writing hRρ\partial_{h}\operatorname{R}_{\rho} for the gradient of hRρ(h;θ,η)h\mapsto\operatorname{R}_{\rho}(h;\theta,\eta), an analogue of Lemma 1 for gradients holds.

Lemma 2 (Unbiased gradients).

Let 𝒰\mathcal{U} be an open subset of any metric space such that 𝒰\mathcal{H}\subseteq\mathcal{U}. Let the base loss map h𝖫(h)h\mapsto\operatorname{\mathsf{L}}(h) be Fréchet differentiable on 𝒰\mathcal{U} (μ\mu-almost surely), with gradient denoted by 𝖫(h){\operatorname{\mathsf{L}}}^{\prime}(h) for each h𝒰h\in\mathcal{U}. Fixing any choice of α2-\infty\leq\alpha\leq 2, we have that

𝐄μ[suphρσ(𝖫(h))𝖫(h)]<𝐄μ[h𝖫ρ]=hRρ\displaystyle\operatorname{\mathbf{E}}_{\mu}\left[\sup_{h\in\mathcal{H}}\lVert{{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)){\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert\right]<\infty\implies\operatorname{\mathbf{E}}_{\mu}\left[\partial_{h}\operatorname{\mathsf{L}}_{\rho}\right]=\partial_{h}\operatorname{R}_{\rho} (18)

with the implied equality valid on all of ×2\mathcal{H}\times\mathbb{R}^{2}.

Consider a setting in which the gradients are heavy-tailed, i.e., where 𝐄μ𝖫(h)p=\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert^{p}=\infty for p>2p>2. If the ultimate goal of learning is minimization of h𝐄μ𝖫(h)h\mapsto\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}(h), then in order to obtain high-probability guarantees of finding a nearly-stationary point with rates matching the in-expectation case, one cannot naively use the raw gradients 𝖫(h){\operatorname{\mathsf{L}}}^{\prime}(h), but must rather carry out a delicate truncation which accounts for the bias incurred [14, 22, 23, 42]. On the other hand, if the ultimate objective is hRρ(h;θ,η)h\mapsto\operatorname{R}_{\rho}(h;\theta,\eta), then using (17) there is zero bias by design (Lemma 2), and when we take the shape parameter of our dispersion function such that α0\alpha\leq 0, we have

𝐏{h𝖫ρ(h;θ,η)Γ}=1\displaystyle\operatorname{\mathbf{P}}\left\{\left\lVert{\partial_{h}\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta)}\right\rVert\leq\Gamma\right\}=1 (19)

for an appropriate choice of 0<Γ<0<\Gamma<\infty under standard loss functions such as quadratic and logistic losses, even when the random losses and gradients are heavy-tailed (see Corollary 4).

To see how this plays out for the analysis of learning algorithms, let us consider plugging the raw stochastic gradients h𝖫ρ\partial_{h}\operatorname{\mathsf{L}}_{\rho} into a simple update procedure. Given an independent sequence of random losses (𝖫1,𝖫2,)(\operatorname{\mathsf{L}}_{1},\operatorname{\mathsf{L}}_{2},\ldots), let us denote by (𝖫ρ,1,𝖫ρ,2,)(\operatorname{\mathsf{L}}_{\rho,1},\operatorname{\mathsf{L}}_{\rho,2},\ldots) the transformed losses computed via (16), and for a sequence (h1,h2,)(h_{1},h_{2},\ldots) let Gt . . =h𝖫ρ,t(ht;θ,η)G_{t}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\partial_{h}\operatorname{\mathsf{L}}_{\rho,t}(h_{t};\theta,\eta) denote the resulting stochastic gradients for any integer t1t\geq 1. Fixing θ\theta\in\mathbb{R} and letting h1h_{1}\in\mathcal{H} denote an arbitrary initial value, we consider a particular sequence generated using the following update rule:

ht+1=htatM~t,\displaystyle h_{t+1}=h_{t}-a_{t}\widetilde{M}_{t}, (20)

where ata_{t} is a non-negative step-size we control, and the update direction satisfies

M~t=1,M~t,Mt=Mt, where Mt . . =bMt1+(1b)Gt\displaystyle\lVert{\widetilde{M}_{t}}\rVert=1,\langle\widetilde{M}_{t},M_{t}\rangle=\lVert{M_{t}}\rVert,\text{ where }M_{t}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=bM_{t-1}+(1-b)G_{t} (21)

with 0<b<10<b<1 also being a controllable parameter. This is an unconstrained, normalized stochastic gradient descent routine using momentum; it modifies the procedure of Cutkosky and Mehta, [14] in that we do not truncate GtG_{t}. Note that if \mathcal{H} is a Banach space and \mathcal{H}^{\ast} its dual, in general we have that GtG_{t} and MtM_{t} are elements of \mathcal{H}^{\ast}. When \mathcal{H} is reflexive (e.g., any Hilbert space), it is always possible to construct M~t\widetilde{M}_{t} from MtM_{t}.555Given any hh^{\ast}\in\mathcal{H}^{\ast}, we can always find hh\in\mathcal{H} such that h,h=hh\langle h^{\ast},h\rangle=\lVert{h}\rVert\lVert{h^{\ast}}\rVert [39, §5.6]. The following theorem shows how the gradient norms incurred by this algorithm can be bounded with high probability.

Theorem 3 (Stationary points with high probability).

Let \mathcal{H} be a reflexive Banach space, with a Fréchet differentiable norm satisfying h1+h22h12+h12,h2+h22\lVert{h_{1}+h_{2}}\rVert^{2}\leq\lVert{h_{1}}\rVert^{2}+\langle\nabla\lVert{h_{1}}\rVert^{2},h_{2}\rangle+\lVert{h_{2}}\rVert^{2} for any h1,h2h_{1},h_{2}\in\mathcal{H}. In addition, assume the losses are such that (19) holds on \mathcal{H}, 𝐄μ𝖫(h1)𝖫(h2)λ1h1h2\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{1})-{\operatorname{\mathsf{L}}}^{\prime}(h_{2})}\rVert\leq\lambda_{1}\lVert{h_{1}-h_{2}}\rVert for any h1,h2h_{1},h_{2}\in\mathcal{H}, and 𝐄μ𝖫2<\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}^{2}<\infty, where 𝖫 . . =suph𝖫(h)\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\sup_{h\in\mathcal{H}}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert. Run the learning algorithm in (20)–(21) for TT iterations, with b=11/Tb=1-1/\sqrt{T} and at=(1/T)3/4a_{t}=(1/T)^{3/4} for all steps t=1,2,,Tt=1,2,\ldots,T, assuming each hth_{t}\in\mathcal{H}. Taking any 0<δ<10<\delta<1, it then follows that

1Tt=1ThRρ(ht;θ,η)c1T1/4+c2T+λ2T3/4\displaystyle\frac{1}{T}\sum_{t=1}^{T}\lVert{\partial_{h}\operatorname{R}_{\rho}(h_{t};\theta,\eta)}\rVert\leq\frac{c_{1}}{T^{1/4}}+\frac{c_{2}}{\sqrt{T}}+\frac{\lambda}{2T^{3/4}}

with probability no less than 1δ1-\delta, using coefficients defined as

c1\displaystyle c_{1} . . =Rρ(h1;θ,η)Rρ(hT+1;θ,η)+16Γlog(3Tδ1)+2λ\displaystyle\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{R}_{\rho}(h_{1};\theta,\eta)-\operatorname{R}_{\rho}(h_{T+1};\theta,\eta)+16\Gamma\sqrt{\log(3T\delta^{-1})}+2\lambda
c2\displaystyle c_{2} . . =20Γlog(3Tδ1)+2Γ\displaystyle\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=20\Gamma\log(3T\delta^{-1})+2\Gamma
λ\displaystyle\lambda . . =λ3+λ4σ+λ2σ2𝐄μ𝖫,λ2 . . =ρ′′,λ3 . . =(2λ2σ)𝐄μ𝖫2,λ4 . . =λ1ρ\displaystyle\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\frac{\lambda_{3}+\lambda_{4}}{\sigma}+\frac{\lambda_{2}}{\sigma^{2}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}},\quad\lambda_{2}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lVert{{\rho}^{\prime\prime}}\rVert_{\infty},\quad\lambda_{3}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\left(\frac{2\lambda_{2}}{\sigma}\right)\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}^{2},\quad\lambda_{4}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lambda_{1}\lVert{{\rho}^{\prime}}\rVert_{\infty}

for any choice of σ>0\sigma>0.

These high-probability 𝒪(T1/4)\mathcal{O}(T^{-1/4}) rates match standard guarantees in the stochastic optimization literature under non-convex objectives in expectation [15, 20]. The main take-away of Theorem 3 is that even if the random losses and gradients are unbounded and heavy-tailed, as long as the dispersion function ρ\rho is chosen to modulate extreme values (such that (19) holds), we can obtain confidence intervals for the T-risk gradient norms incurred by stochastic gradient updates. Distribution control is implied by the risk design, and thus there is no additional need for truncation or bias control. The following corollary illustrates how the bounded gradient condition in Theorem 3 is satisfied under very weak assumptions on the data.

Corollary 4.

Assume the random losses 𝖫(h)\operatorname{\mathsf{L}}(h) are driven by random data (𝖷,𝖸)(\mathsf{X},\mathsf{Y}), where 𝖷\mathsf{X} takes values in a Banach space 𝒳\mathcal{X}, and 𝒳\mathcal{H}\subset\mathcal{X}^{\ast} has finite diameter. Consider the following losses:

  • E1.

    Quadratic loss: 𝖫(h)=(h(𝖷)𝖸)2/2\operatorname{\mathsf{L}}(h)=(h(\mathsf{X})-\mathsf{Y})^{2}/2, with h(𝖷)=h,𝖷h(\mathsf{X})=\langle h,\mathsf{X}\rangle and 𝖸=h,𝖷+ε\mathsf{Y}=\langle h^{\ast},\mathsf{X}\rangle+\varepsilon, where ε\varepsilon is zero-mean, has finite variance, and is independent of 𝖷\mathsf{X}.

  • E2.

    Logistic loss: 𝖫(h)=log(j=1kexp(hj,𝖷))j=1k𝖸~jhj,𝖷\operatorname{\mathsf{L}}(h)=\log(\sum_{j=1}^{k}\exp(\langle h_{j},\mathsf{X}\rangle))-\sum_{j=1}^{k}\widetilde{\mathsf{Y}}_{j}\langle h_{j},\mathsf{X}\rangle, where we have k2k\geq 2 classes, h=(h1,,hk)h=(h_{1},\ldots,h_{k}) with each hjh_{j}\in\mathcal{H}, and (𝖸~1,,𝖸~k)(\widetilde{\mathsf{Y}}_{1},\ldots,\widetilde{\mathsf{Y}}_{k}) is a one-hot representation of the class label 𝖸\mathsf{Y} assigned to 𝖷\mathsf{X}.

If we set ρ()=ρ(;α)\rho(\cdot)=\rho(\cdot;\alpha) with α0\alpha\leq 0, then under the examples E1.E2., as long as 𝐄μ𝖷2<\operatorname{\mathbf{E}}_{\mu}\lVert{\mathsf{X}}\rVert^{2}<\infty, we have that the bounds assumed by Theorem 3, including (19), are satisfied on \mathcal{H}.

In practice, the threshold θ\theta will not typically be fixed arbitrarily, but rather selected in a data-dependent fashion, potentially optimized alongside hh; the impact of such algorithmic choices will be evaluated in our empirical tests in §5. In the following sub-section, we consider some key properties of the special case in which threshold θ\theta is always taken to yield the smallest overall T-risk value.

Limitations

An obvious limitation of our analysis is that we have assumed that each hth_{t}\in\mathcal{H}; this matches the setup of [14], but the condition that 𝐄μ𝖫2<\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}^{2}<\infty will sometimes require \mathcal{H} to have a finite diameter. Modifying the procedure to allow for projection of the iterates to \mathcal{H} is a point of technical interest, but is out of this paper’s scope. Another limitation is that our current approach using smoothness properties (Lemma 12) necessitates an assumption of gradients with finite second-order moments. In the special case where α=1\alpha=1, arguments based on smoothness can be replaced with arguments based on weak convexity [15, 30]; but this fails for more general α\alpha since ρ(;α)\rho(\cdot;\alpha) is not convex for α<1\alpha<1, and not Lipschitz for α>1\alpha>1. Another potential option is to split the sample, leverage stronger guarantees available in expectation for gradient descent run on the subsets, and robustly choose the best candidate based on a validation set of data [29].

4.2 T-risk with minimizing thresholds

Recall the minimal T-risk R¯ρ(h;η)\underline{\operatorname{R}}_{\rho}(h;\eta) and optimal thresholds θρ(h;η)\theta_{\rho}(h;\eta) defined earlier in (12), here restricted to scaled ρ()=ρσ(;α)\rho(\cdot)=\rho_{\sigma}(\cdot;\alpha) from the Barron class (15). This is arguably the most natural subset of T-risks, with a functional form aligned with OCE/DRO risks discussed in §2.2. For readability, we denote the dispersion of 𝖫(h)\operatorname{\mathsf{L}}(h) measured about θ\theta\in\mathbb{R} using ρ\rho as

Dρ(h;θ) . . =𝐄μρ(𝖫(h)θ).\displaystyle\operatorname{D}_{\rho}(h;\theta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{\mathbf{E}}_{\mu}\rho\left(\operatorname{\mathsf{L}}(h)-\theta\right). (22)

Here we will overload our notation, writing R¯ρ(𝖫;η)\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}};\eta), θρ(𝖫;η)\theta_{\rho}(\operatorname{\mathsf{L}};\eta), and Dρ(𝖫;θ)\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\theta) when we want to leave hh abstract and focus on random losses in a set 𝖫\operatorname{\mathsf{L}}\in\mathcal{L}. For this risk to be finite, we require α1\alpha\geq 1, and in addition need |η|<1/σ\lvert{\eta}\rvert<1/\sigma for the special case of α=1\alpha=1; otherwise, the dispersion term grows too slowly and we have R¯ρ(𝖫;η)=\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}};\eta)=-\infty (Lemma 10). When finite, the optimal threshold is unique, and overloading our notation once more we use θρ(𝖫;η)\theta_{\rho}(\operatorname{\mathsf{L}};\eta) to denote it. The following lemma summarizes some basic properties of the minimal T-risk.

Lemma 5.

Let \mathcal{L} be such that for each 𝖫\operatorname{\mathsf{L}}\in\mathcal{L} we have R¯ρ(𝖫;η)<\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}};\eta)<\infty, and let α\alpha, σ\sigma, and η\eta be such that R¯ρ(𝖫;η)>\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}};\eta)>-\infty. Under these assumptions, the dispersion part of the minimal T-risk is translation invariant, i.e., for any cc\in\mathbb{R}, we have Dρ(𝖫+c;θρ(𝖫+c;η))=Dρ(𝖫;θρ(𝖫;η))\operatorname{D}_{\rho}(\operatorname{\mathsf{L}}+c;\theta_{\rho}(\operatorname{\mathsf{L}}+c;\eta))=\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\theta_{\rho}(\operatorname{\mathsf{L}};\eta)). This dispersion term is always non-negative, and if η0\eta\neq 0, then we have Dρ(𝖫;θρ(𝖫;η))>0\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\theta_{\rho}(\operatorname{\mathsf{L}};\eta))>0 for any 𝖫\operatorname{\mathsf{L}}\in\mathcal{L}, even if 𝖫\operatorname{\mathsf{L}} is constant. If \mathcal{L} is convex, then so is the map 𝖫R¯ρ(𝖫;η)\operatorname{\mathsf{L}}\mapsto\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}};\eta). The optimal threshold is translation equivariant in that we have θρ(𝖫+c;η)=θρ(𝖫;η)+c\theta_{\rho}(\operatorname{\mathsf{L}}+c;\eta)=\theta_{\rho}(\operatorname{\mathsf{L}};\eta)+c for any choice of cc\in\mathbb{R}, and it is monotonic in that whenever 𝖫1,𝖫2\operatorname{\mathsf{L}}_{1},\operatorname{\mathsf{L}}_{2}\in\mathcal{L} and 𝖫1𝖫2\operatorname{\mathsf{L}}_{1}\leq\operatorname{\mathsf{L}}_{2} almost surely, we have θρ(𝖫1;η)θρ(𝖫2;η)\theta_{\rho}(\operatorname{\mathsf{L}}_{1};\eta)\leq\theta_{\rho}(\operatorname{\mathsf{L}}_{2};\eta).

Let us briefly discuss the properties described in Lemma 5 with a bit more context. One of the best-known classes of risk functions is that of coherent risks [2], typically characterized by properties of convexity, monotonicity, translation equivariance, and positive homogeneity [52]. Our general notion of “dispersion” is often referred to as “deviation” in the risk literature, and the properties of translation invariance, sub-linearity (implying convexity), non-negativity, and definiteness (i.e., zero only for constants) allow one to establish links between deviations and coherent risks [49]. In general, the risk 𝖫R¯ρ(𝖫;η)\operatorname{\mathsf{L}}\mapsto\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}};\eta) takes us outside of this traditional class, while still maintaining lucid connections as summarized in the preceding lemma.

Remark 6 (Risk quadrangle).

It should also be noted that our T-risks are not what would typically be called “risks” in the context of the “expectation quadrangle” in the framework developed by Rockafellar and Uryasev, [48]. With their Example 7 as a clear reference for comparison, our function ρ()\rho(\cdot) corresponds to their “error integrand” e()e(\cdot), and the risk derived from their quadrangle would be

𝖫𝐄μ𝖫+𝐄μρ(𝖫Mρ(𝖫))=infθ[θ+𝐄μv(𝖫θ)]\displaystyle\operatorname{\mathsf{L}}\mapsto\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}+\operatorname{\mathbf{E}}_{\mu}\rho\left(\operatorname{\mathsf{L}}-\operatorname{M}_{\rho}(\operatorname{\mathsf{L}})\right)=\inf_{\theta\in\mathbb{R}}\left[\theta+\operatorname{\mathbf{E}}_{\mu}v(\operatorname{\mathsf{L}}-\theta)\right]

where v(x) . . =ρ(x)+xv(x)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\rho(x)+x and Mρ(𝖫)\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}) denotes the M-location (1) under ρ\rho. Under appropriate choice of ρ\rho, this is an OCE-type risk, and evidently our optimal thresholds θρ(𝖫;η)\theta_{\rho}(\operatorname{\mathsf{L}};\eta) with η0\eta\neq 0 do not make an appearance in risks of this form, highlighting the distinct nature of R¯ρ(𝖫;η)\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}};\eta).

Next, we briefly consider the question of how algorithms designed to minimize R¯ρ(h;η)\underline{\operatorname{R}}_{\rho}(h;\eta) perform in terms of the classical risk, namely the expected loss R(h) . . =𝐄μ𝖫(h)\operatorname{R}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}(h). This is a big topic, but as an initial look, we consider how the expected loss incurred by minimizers of the empirical T-risk can be controlled given sufficiently good concentration of the M-estimators induced by ρ\rho, and the empirical mean. Let us denote any empirical T-risk minimizer by

h^ρargminh[infθ(ηθ+1ni=1nρσ(𝖫i(h)θ))]\displaystyle\widehat{h}_{\rho}\in\operatorname*{arg\,min}_{h\in\mathcal{H}}\left[\inf_{\theta\in\mathbb{R}}\left(\eta\theta+\frac{1}{n}\sum_{i=1}^{n}\rho_{\sigma}\left(\operatorname{\mathsf{L}}_{i}(h)-\theta\right)\right)\right] (23)

where for simplicity (𝖫1,,𝖫n)(\operatorname{\mathsf{L}}_{1},\ldots,\operatorname{\mathsf{L}}_{n}) is an iid sample of random losses.

Proposition 7.

Take any T-risk parameters α1\alpha\geq 1, η>0\eta>0, and σ>0\sigma>0, denoting the resulting empirical (minimal) T-risk minimizer h^ρ\widehat{h}_{\rho} in (23). Let R^(h)\widehat{\operatorname{R}}(h), M^ρ(h)\widehat{\operatorname{M}}_{\rho}(h), θ^ρ(h;η)\widehat{\theta}_{\rho}(h;\eta), and D^ρ(h;θ)\widehat{\operatorname{D}}_{\rho}(h;\theta) denote the empirical analogues of R(h)\operatorname{R}(h), Mρ(h)\operatorname{M}_{\rho}(h), θρ(h;η)\theta_{\rho}(h;\eta), and Dρ(h;θ)\operatorname{D}_{\rho}(h;\theta) for each hh\in\mathcal{H}. Then, with \lVert{\cdot}\rVert_{\mathcal{H}} as in Theorem 3, we have

R(h^ρ)R(h)+M^ρθ^ρ+2M^ρR+1ηD^ρ(h;M^ρ(h))+4RR^\displaystyle\operatorname{R}(\widehat{h}_{\rho})\leq\operatorname{R}(h^{\ast})+\lVert{\widehat{\operatorname{M}}_{\rho}-\widehat{\theta}_{\rho}}\rVert_{\mathcal{H}}+2\lVert{\widehat{\operatorname{M}}_{\rho}-\operatorname{R}}\rVert_{\mathcal{H}}+\frac{1}{\eta}\widehat{\operatorname{D}}_{\rho}(h^{\ast};\widehat{\operatorname{M}}_{\rho}(h^{\ast}))+4\lVert{\operatorname{R}-\widehat{\operatorname{R}}}\rVert_{\mathcal{H}}

where we can freely choose hargminhR(h)h^{\ast}\in\operatorname*{arg\,min}_{h\in\mathcal{H}}\operatorname{R}(h) to be optimal in the expected loss.

We have left the upper bound in Proposition 7 rather abstract to emphasize the key factors that can be used to control expected loss bounds for empirical T-risk minimizers and keep the overall narrative clear. Note that in the special case of α=2\alpha=2 one has M^ρ(h)θ^ρ(h;η)=η\widehat{\operatorname{M}}_{\rho}(h)-\widehat{\theta}_{\rho}(h;\eta)=\eta, and more generally taking η0\eta\to 0 sends this difference to zero. There exists tension due to the 1/η1/\eta coefficient on the dispersion term. All other terms, including the dispersion itself, are free of η\eta. Note also that the remaining term M^ρR\lVert{\widehat{\operatorname{M}}_{\rho}-\operatorname{R}}\rVert_{\mathcal{H}} is the (uniform) difference between the M-estimator induced by ρ\rho and the classical risk; this can be modulated directly using the scale parameter σ\sigma, and sharp bounds for a broad classes of M-estimators have been established recently [40]. Our Proposition 7 is analogous to Theorem 7 of Lee et al., [36], with the key difference being that we study (minimal) T-risks instead of OCE risks, without assuming that losses are bounded (below or above).

5 Learning applications

To complement the “static” empirical analysis in §3.3 and the formal insights for learning algorithms in §4, here we empirically investigate how risk function design and data properties impact the behaviour of stochastic gradient-based learners.

5.1 Classification with noisy and unbalanced labels

Experiment setup

As an initial example using simulated data, we design a binary classification problem by generating 500 labeled data points as shown in Figure 3 (left-most plot). The majority class comprises 95% of the sample, and we flip 5% of the labels uniformly at random. Aside from the flipped labels, the data is linearly separable. We consider two candidate classifiers to initialize an iterative learning algorithm: one candidate that does well on the majority class (dotted purple), and one candidate that does well on the minority class (dashed pink). Due to class imbalance, the loss distributions incurred by these two candidates are highly asymmetric and differ in their long tail direction (Figure 3, remaining plots). We run empirical risk minimization for each risk class described in §2.2 (joint in (h,θ)(h,\theta)), implemented by 15,000 iterations of full-batch gradient descent, using a step size of 0.010.01, and three different choices of base loss function (logistic, hinge, unhinged). We run this procedure on the same data for a wide range of risk parameters α\alpha (T-risk), β\beta (CVaR), γ\gamma (tilted), and aa (χ2\chi^{2}-DRO), and choose a representative setting for each risk class as the one achieving the best final training classification error.

Representative results

The full-batch (training) classification error and norm trajectories for these representatives initialized at each of the two candidates just mentioned (“mostly correct” and “mostly incorrect”) and trained using the unhinged loss as a base loss are shown in Figure 4. Regardless of the direction of the initial loss distribution tails, we see that using the Barron-type T-risk has the flexibility to achieve both a stable and superior long-term error rate, while at the same time penalizing exceedingly overconfident correct examples (via bi-directionality), and thereby keeping the norm of the linear classifier small. An almost identical trend is also observed under the (binary) logistic loss, whereas classification error rates under the hinge loss tend to be less stable (see Figures 1011 in §B.6).

Refer to caption
Refer to caption
Figure 3: Left-most plot: simulated binary classification data, with unbalanced classes and randomly flipped labels (indicated by red diamonds). Remaining plots: loss distributions for three types of loss functions, evaluated at two different candidates, corresponding to the dashed pink and dotted purple lines in the left-most plot.
Refer to caption
Refer to caption
Refer to caption
Figure 4: Classification error and norm trajectories (over iteration number, log10\log_{10} scale) for gradient descent implementations of each risk class, using unhinged base loss and data given in Figure 3.

5.2 Control of outlier sensitivity

Experiment setup

For a crystal-clear example of real data that induces losses with heavy tails, we consider the famous Belgian phone call dataset [51, 59] with input features normalized to the unit interval, and raw outputs used as-is. We conduct two regression tasks: the first uses the original data just stated, and the second has us modify such that it is an outlier on both the horizontal and vertical axes (we just multiply the original data point by 5); such points are said to have “high leverage” [32, Ch. 7]. For these two tasks, once again we run empirical risk minimization under each of the risk classes of interest, implemented using 15,000 iterations of full-batch gradient descent, with fixed step size 0.005. To illustrate the flexibility of the T-risk, here instead of minimizing (11) jointly in (h,θ)(h,\theta), we fix θ\theta at the start of the learning process along with σ\sigma and η\eta, iteratively optimizing hh only (i.e., θ\theta is not updated at any point). For simplicity, we set θ\theta and σ\sigma both to be the median of the losses incurred at initialization. All other risk classes are precisely as in the previous experiment.

Representative results

The final regression lines obtained under each risk setting by running the learning procedure just described are plotted along with the data in Figure 5. Colors correspond to the individual risk function choices within each risk family, as denoted by color bars under each plot. The gray regression line denotes the common initial value used by all methods. Algorithm outputs using CVaR and χ2\chi^{2}-DRO are always at least as sensitive to outliers as the ordinary least-squares (OLS) solution. While the tilted risk does let us interpolate between lower and upper quantiles, this transition is not smooth; even trying 20 values within the small window of γ[0.025,0.025]\gamma\in[-0.025,0.025], the algorithm outputs essentially jump between two extremes. This difference is particularly lucid in the bottom plots of Figure 5, where we have a high-leverage point. In contrast, with all other parameters of T-risk fixed, note that just tweaking α\alpha gives us a remarkable degree of flexibility to control the final output; while the base loss is fixed to the squared error, the regression lines range from those that ignore outliers to those that are very sensitive to outliers. In the high-leverage case, it is well-known that this cannot be achieved by simply changing to a different convex base loss (e.g., MAE instead of OLS), giving a concise illustration of the flexibility inherent in the T-risk class. Algorithm behavior can be sensitive to initial value; see §B.5 and Figure 13 in the appendix for an example where the naive median-based procedure described here causes gradient-based algorithms to stall out.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Using T-risks with thresholds set to median initialization error, we can achieve a smooth transition between outlier-robust and outlier-sensitive solutions, even under high-leverage points. Top: original “phones” dataset. Bottom: modified data including a single high-leverage point.

5.3 Distribution control under SGD on benchmark datasets

Experiment setup

Finally we consider tests using datasets and models which are orders of magnitude larger than the previous two experimental setups. We use several well-known benchmark datasets for multi-class classification; details are given in §B.5. All features are normalized to the unit interval (categorical variables are one-hot), and scores for each class are computed by a linear combination of features. As a base loss, we use the multi-class logistic regression loss. Here we investigate how a stochastic gradient descent implementation (with averaging) of the empirical T-risk minimization (joint in (h,θ)(h,\theta)) behaves as we control the shape parameter α\alpha of the Barron class, with σ=0.99\sigma=0.99 and η=1\eta=1 fixed throughout. We record the base loss (logistic loss) and zero-one loss (misclassification error) at the start and end of each epoch. We also record the full base loss distribution after the last epoch concludes. We run 5 independent trials, in which the data is shuffled and initial parameters are determined randomly. In each trial, for all datasets and methods, we use a mini-batch size of 32, and we run 30 epochs. As reference algorithms, we consider “vanilla” ERM, using the traditional risk 𝐄μ𝖫(h)\operatorname{\mathbf{E}}_{\mu}\operatorname{\mathsf{L}}(h), and mean-variance implemented as a special case of T-risk (with ρ()=()2/2\rho(\cdot)=(\cdot)^{2}/2 and η=1\eta=1). 80% of the data is used for training, 10% for validation, and 10% for testing. For each risk class, we try five different step sizes. Validation data is used to evaluate different step sizes and choose the best one for each risk setting. All the results we present here are based on loss values computed on the test set: solid lines represent averages taken over trials, and shaded areas denote ±\pm standard deviation over trials.

Representative results

In Figures 67, we give results based on the following two datasets: “extended MNIST” (47 classes, balanced) [11], and “cover type” (7 classes, imbalanced) [9]. We plot the average and standard deviation of the base and zero-one losses as a function of epoch number, plus give a histogram of test (base) losses for a single trial, compared with the loss distribution incurred by a random initialization of the same model (left-most plot; gray is test, black is training). Colors are analogous to previous experiments, here evenly spaced over the allowable range (1α21\leq\alpha\leq 2). It is clear how modifying the Barron dispersion function shape across this range lets us flexibly interpolate between the test (base) loss distribution achieved by a traditional ERM solution and a mean-variance solution, in terms of both the mean and standard deviation. This monotonicity (as a function of α\alpha) is salient in the base loss, but this does not always appear for the zero-one loss. Since these datasets have normalized features with negligible label noise, egregious outliers are rare, and thus the trends observed for the mean and standard deviation here also hold for outlier-resistant location-dispersion pairs such as the median and the median-absolute-deviations about the median. Similar results for several other datasets are provided in §B.6, and we remark that the key trends hold across all datasets tested. We have seen in our previous experiments how under heavy-tailed losses/gradients the T-risk solution can differ greatly from that of the vanilla ERM solution, so it is interesting to observe how under large, normalized, clean classification datasets, the T-risk allows us to very smoothly control a tradeoff between average test loss and variance on the test set.

Refer to caption
Refer to caption
Refer to caption
Figure 6: Using T-risk to interpolate between test loss distributions (dataset: emnist_balanced).
Refer to caption
Refer to caption
Refer to caption
Figure 7: Analogous to Figure 6 (dataset: covtype).

Acknowledgments

This work was supported by JST ACT-X Grant Number JPMJAX200O and JST PRESTO Grant Number JPMJPR21C6.

References

  • Albert and Anderson, [1984] Albert, A. and Anderson, J. A. (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika, 71(1):1–10.
  • Artzner et al., [1999] Artzner, P., Delbaen, F., Eber, J.-M., and Heath, D. (1999). Coherent measures of risk. Mathematical Finance, 9(3):203–228.
  • Ash and Doléans-Dade, [2000] Ash, R. B. and Doléans-Dade, C. A. (2000). Probability and Measure Theory. Academic Press, 2nd edition.
  • Barron, [2019] Barron, J. T. (2019). A general and adaptive robust loss function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4331–4339.
  • Bassett, [1987] Bassett, Jr, G. W. (1987). The St. Petersburg paradox and bounded utility. History of Political Economy, 19(4):517–523.
  • Ben-Tal et al., [2013] Ben-Tal, A., Den Hertog, D., De Waegenaere, A., Melenberg, B., and Rennen, G. (2013). Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357.
  • Ben-Tal and Teboulle, [1986] Ben-Tal, A. and Teboulle, M. (1986). Expected utility, penalty functions, and duality in stochastic nonlinear programming. Management Science, 32(11):1445–1466.
  • Ben-Tal and Teboulle, [2007] Ben-Tal, A. and Teboulle, M. (2007). An old-new concept of convex risk measures: The optimized certainty equivalent. Mathematical Finance, 17(3):449–476.
  • Blackard and Dean, [1999] Blackard, J. A. and Dean, D. J. (1999). Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and Electronics in Agriculture, 24(3):131–151.
  • Boyd and Vandenberghe, [2004] Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
  • Cohen et al., [2017] Cohen, G., Afshar, S., Tapson, J., and Van Schaik, A. (2017). EMNIST: an extension of MNIST to handwritten letters. arXiv preprint arXiv:1702.05373v2.
  • Şimşekli et al., [2019] Şimşekli, U., Sagun, L., and Gürbüzbalaban, M. (2019). A tail-index analysis of stochastic gradient noise in deep neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pages 5827–5837.
  • Curi et al., [2020] Curi, S., Levy, K. Y., Jegelka, S., and Krause, A. (2020). Adaptive sampling for stochastic risk-averse learning. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pages 1036–1047.
  • Cutkosky and Mehta, [2021] Cutkosky, A. and Mehta, H. (2021). High-probability bounds for non-convex stochastic optimization with heavy tails. arXiv preprint arXiv:2106.14343v2.
  • Davis and Drusvyatskiy, [2019] Davis, D. and Drusvyatskiy, D. (2019). Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239.
  • Devroye et al., [1996] Devroye, L., Györfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer.
  • Duchi and Namkoong, [2018] Duchi, J. and Namkoong, H. (2018). Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750v6.
  • Duchi and Namkoong, [2019] Duchi, J. and Namkoong, H. (2019). Variance-based regularization with convex objectives. Journal of Machine Learning Research, 20(68):1–55.
  • Föllmer and Knispel, [2011] Föllmer, H. and Knispel, T. (2011). Entropic risk measures: Coherence vs. convexity, model ambiguity and robust large deviations. Stochastics and Dynamics, 11(02n03):333–351.
  • Ghadimi et al., [2016] Ghadimi, S., Lan, G., and Zhang, H. (2016). Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267–305.
  • Gneiting, [2011] Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association, 106(494):746–762.
  • Gorbunov et al., [2020] Gorbunov, E., Danilova, M., and Gasnikov, A. (2020). Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020).
  • Gorbunov et al., [2021] Gorbunov, E., Danilova, M., Shibaev, I., Dvurechensky, P., and Gasnikov, A. (2021). Near-optimal high probability complexity bounds for non-smooth stochastic optimization with heavy-tailed noise. arXiv preprint arXiv:2106.05958.
  • Gotoh et al., [2018] Gotoh, J.-y., Kim, M. J., and Lim, A. E. (2018). Robust empirical optimization is almost the same as mean–variance optimization. Operations Research Letters, 46(4):448–452.
  • Hacking, [2006] Hacking, I. (2006). The emergence of probability: a philosophical study of early ideas about probability, induction and statistical inference. Cambridge University Press, 2nd edition.
  • Hashimoto et al., [2018] Hashimoto, T. B., Srivastava, M., Namkoong, H., and Liang, P. (2018). Fairness without demographics in repeated loss minimization. In Proceedings of the 35th International Conference on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pages 1929–1938.
  • Haussler, [1992] Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1):78–150.
  • [28] Holland, M. J. (2021a). Designing off-sample performance metrics. arXiv preprint arXiv:2110.04996v1.
  • [29] Holland, M. J. (2021b). Robustness and scalability under heavy tails, without strong convexity. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 130 of Proceedings of Machine Learning Research.
  • Holland, [2022] Holland, M. J. (2022). Learning with risks based on M-location. Machine Learning.
  • Holland and Haress, [2021] Holland, M. J. and Haress, E. M. (2021). Learning with risk-averse feedback under potentially heavy tails. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 130 of Proceedings of Machine Learning Research.
  • Huber and Ronchetti, [2009] Huber, P. J. and Ronchetti, E. M. (2009). Robust Statistics. John Wiley & Sons, 2nd edition.
  • Kashima, [2007] Kashima, H. (2007). Risk-sensitive learning via minimization of empirical conditional value-at-risk. IEICE Transactions on Information and Systems, 90(12):2043–2052.
  • Kearns and Vazirani, [1994] Kearns, M. J. and Vazirani, U. V. (1994). An Introduction to Computational Learning Theory. MIT Press.
  • Koltchinskii, [1997] Koltchinskii, V. I. (1997). MM-estimation, convexity and quantiles. The Annals of Statistics, pages 435–477.
  • Lee et al., [2020] Lee, J., Park, S., and Shin, J. (2020). Learning bounds for risk-sensitive learning. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pages 13867–13879.
  • [37] Li, T., Beirami, A., Sanjabi, M., and Smith, V. (2021a). On tilted losses in machine learning: Theory and applications. arXiv preprint arXiv:2109.06141v1.
  • [38] Li, T., Beirami, A., Sanjabi, M., and Smith, V. (2021b). Tilted empirical risk minimization. In The 9th International Conference on Learning Representations (ICLR).
  • Luenberger, [1969] Luenberger, D. G. (1969). Optimization by Vector Space Methods. John Wiley & Sons.
  • Minsker, [2019] Minsker, S. (2019). Uniform bounds for robust mean estimators. arXiv preprint arXiv:1812.03523v4.
  • Mohri et al., [2012] Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2012). Foundations of Machine Learning. MIT Press.
  • Nazin et al., [2019] Nazin, A. V., Nemirovsky, A. S., Tsybakov, A. B., and Juditsky, A. B. (2019). Algorithms of robust stochastic optimization based on mirror descent method. Automation and Remote Control, 80(9):1607–1627.
  • Nesterov, [2004] Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course. Springer.
  • Penot, [2012] Penot, J.-P. (2012). Calculus Without Derivatives, volume 266 of Graduate Texts in Mathematics. Springer.
  • Prashanth et al., [2020] Prashanth, L. A., Jagannathan, K., and Kolla, R. K. (2020). Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions. In 37th International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, pages 5577–5586.
  • Rockafellar and Uryasev, [2000] Rockafellar, R. T. and Uryasev, S. (2000). Optimization of conditional value-at-risk. Journal of Risk, 2:21–42.
  • Rockafellar and Uryasev, [2002] Rockafellar, R. T. and Uryasev, S. (2002). Conditional value-at-risk for general loss distributions. Journal of Banking & Finance, 26(7):1443–1471.
  • Rockafellar and Uryasev, [2013] Rockafellar, R. T. and Uryasev, S. (2013). The fundamental risk quadrangle in risk management, optimization and statistical estimation. Surveys in Operations Research and Management Science, 18(1-2):33–53.
  • Rockafellar et al., [2006] Rockafellar, R. T., Uryasev, S., and Zabarankin, M. (2006). Generalized deviations in risk analysis. Finance and Stochastics, 10(1):51–74.
  • Rousseeuw and Christmann, [2003] Rousseeuw, P. J. and Christmann, A. (2003). Robustness against separation and outliers in logistic regression. Computational Statistics & Data Analysis, 43(3):315–332.
  • Rousseeuw and Leroy, [1987] Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. John Wiley & Sons.
  • Ruszczyński and Shapiro, [2006] Ruszczyński, A. and Shapiro, A. (2006). Optimization of convex risk functions. Mathematics of Operations Research, 31(3):433–452.
  • Shalev-Shwartz and Ben-David, [2014] Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
  • Sutton and Barto, [2018] Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press, 2nd edition.
  • Takeda and Sugiyama, [2008] Takeda, A. and Sugiyama, M. (2008). ν\nu-support vector machine as conditional value-at-risk minimization. In Proceedings of the 25th International Conference on Machine Learning (ICML), pages 1056–1063.
  • Takeuchi et al., [2006] Takeuchi, I., Le, Q. V., Sears, T. D., and Smola, A. J. (2006). Nonparametric quantile estimation. Journal of Machine Learning Research, 7:1231–1264.
  • Van Rooyen et al., [2016] Van Rooyen, B., Menon, A., and Williamson, R. C. (2016). Learning with symmetric label noise: The importance of being unhinged. In Advances in Neural Information Processing Systems 28 (NIPS 2015), pages 10–18.
  • Vapnik, [1999] Vapnik, V. N. (1999). The Nature of Statistical Learning Theory. Statistics for Engineering and Information Science. Springer, 2nd edition.
  • Venables and Ripley, [2002] Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S. Springer, 4th edition.
  • Zhai et al., [2021] Zhai, R., Dan, C., Kolter, J. Z., and Ravikumar, P. (2021). DORO: Distributional and outlier robust optimization. In 38th International Conference on Machine Learning (ICML), volume 139 of Proceedings of Machine Learning Research, pages 12345–12355.
  • Zhang et al., [2020] Zhang, J., Karimireddy, S. P., Veit, A., Kim, S., Reddi, S., Kumar, S., and Sra, S. (2020). Why are adaptive methods good for attention models? In Advances in Neural Information Processing Systems, volume 33, pages 15383–15393.

Appendix A Appendix summary

For ease of reference, we start the appendix with a concise tables of contents.

  • §B: Supplementary information

    • §B.1: Details for tilted risk

    • §B.2: Details for Barron class limits

    • §B.3: Additional lemmas for Barron class and T-risk

    • §B.4: Smoothness of the T-risk

    • §B.5: Experimental details

    • §B.6: Additional figures

  • §C: Detailed proofs

    • §C.1: Proofs of results in the main text

    • §C.2: Smoothness computations (proof of Lemma 12)

  • §D: Additional technical facts

    • §D.1: Lipschitz properties

    • §D.2: Convexity

    • §D.3: Derivatives for the Barron class

    • §D.4: Elementary inequalities

    • §D.5: Expected dispersion is coercive

Appendix B Supplementary information

Here we provide additional details that did not fit into the main body of the paper.

B.1 Details for tilted risk

Let 𝖷μ\mathsf{X}\sim\mu be an arbitrary random variable. Assuming the distribution is such that we can differentiate through the integral, we have

ddθ[θ+1γ(𝐄μeγ(𝖷θ)1)]=0𝐄μeγ(𝖷θ)=1.\displaystyle\frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d}\theta}\left[\theta+\frac{1}{\gamma}\left(\operatorname{\mathbf{E}}_{\mu}\mathrm{e}^{\gamma(\mathsf{X}-\theta)}-1\right)\right]=0\iff\operatorname{\mathbf{E}}_{\mu}\mathrm{e}^{\gamma(\mathsf{X}-\theta)}=1. (24)

Let θ\theta^{\ast} be any value that satisfies the first-order optimality condition in (24). It follows that

θ+1γ(𝐄μeγ(𝖷θ)1)=θ.\displaystyle\theta^{\ast}+\frac{1}{\gamma}\left(\operatorname{\mathbf{E}}_{\mu}\mathrm{e}^{\gamma(\mathsf{X}-\theta^{\ast})}-1\right)=\theta^{\ast}.

It is easy to confirm that setting θ=(1/γ)log(𝐄μeγ𝖷)\theta^{\ast}=(1/\gamma)\log(\operatorname{\mathbf{E}}_{\mu}\mathrm{e}^{\gamma\mathsf{X}}) gives us a valid solution. For more background, see the recent works of Li et al., 2021b [38], Li et al., 2021a [37] and the references therein. Note also that this log-exponential criterion appears (with γ=1\gamma=1) in Rockafellar and Uryasev, [48, Ex. 8].

B.2 Details for Barron class limits

For the limit as α0\alpha\to 0, use the fact that for any a>0a>0, we have

limx0+(ax1)x=log(a).\displaystyle\lim\limits_{x\to 0_{+}}\frac{(a^{x}-1)}{x}=\log(a). (25)

This equality is sometimes known as Halley’s formula. For the limit as α\alpha\to-\infty, first note that for any α<0\alpha<0 we can write |2α|=2+|α|\lvert{2-\alpha}\rvert=2+\lvert{\alpha}\rvert, and thus |α|/2=(|2α|/2)1\lvert{\alpha}\rvert/2=(\lvert{2-\alpha}\rvert/2)-1. With this in mind, for any α<0\alpha<0 we can observe

(1+x2|α2|)α/2=(1+x2|α2|)|α|/2\displaystyle\left(1+\frac{x^{2}}{\lvert{\alpha-2}\rvert}\right)^{\alpha/2}=\left(1+\frac{x^{2}}{\lvert{\alpha-2}\rvert}\right)^{-\lvert{\alpha}\rvert/2} =(1+x2|α2|)1(|2α|/2)\displaystyle=\left(1+\frac{x^{2}}{\lvert{\alpha-2}\rvert}\right)^{1-(\lvert{2-\alpha}\rvert/2)}
=(1+x2|α2|)(1+x2|α2|)|α2|\displaystyle=\frac{\left(1+\frac{x^{2}}{\lvert{\alpha-2}\rvert}\right)}{\sqrt{\left(1+\frac{x^{2}}{\lvert{\alpha-2}\rvert}\right)^{\lvert{\alpha-2}\rvert}}}
1exp(x2)=exp(x2/2),\displaystyle\to\frac{1}{\sqrt{\exp(x^{2})}}=\exp(-x^{2}/2),

where the limit is taken as α\alpha\to-\infty, and follows from the classical limit characterization of the exponential function. For the limit as α2\alpha\to 2_{-}, first note that

|α2|(1+x2|α2|)α/2=(|α2|2/α+|α2|(2/α)1x2)α/2\displaystyle\lvert{\alpha-2}\rvert\left(1+\frac{x^{2}}{\lvert{\alpha-2}\rvert}\right)^{\alpha/2}=\left(\lvert{\alpha-2}\rvert^{2/\alpha}+\lvert{\alpha-2}\rvert^{(2/\alpha)-1}x^{2}\right)^{\alpha/2}

and that as long as α<2\alpha<2, we can write

|α2|(2/α)1=(2α)(2α)/α=(uu)1/(2u)\displaystyle\lvert{\alpha-2}\rvert^{(2/\alpha)-1}=(2-\alpha)^{(2-\alpha)/\alpha}=\left(u^{u}\right)^{1/(2-u)}

where we have introduced u . . =2αu\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=2-\alpha. Taking α2\alpha\to 2_{-} amounts to u0+u\to 0_{+}, and thus using the fact that uu1u^{u}\to 1 as u0+u\to 0_{+}, the desired result follows from straightforward analysis.

B.3 Additional lemmas for Barron class and T-risk

Lemma 8 (Dispersion function convexity and smoothness).

Consider ρσ(x) . . =ρ(x/σ;α)\rho_{\sigma}(x)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\rho(x/\sigma;\alpha) with ρ(;α)\rho(\cdot;\alpha) from the Barron class (15), with parameter α2-\infty\leq\alpha\leq 2. The following properties hold for any choice of σ>0\sigma>0.

  • Case of α=2\alpha=2:
    ρσ\rho_{\sigma} is convex and 1/σ21/\sigma^{2}-smooth on \mathbb{R}.

  • Case of α=0\alpha=0:
    ρσ\rho_{\sigma} is convex on [2σ,2σ][-\sqrt{2}\sigma,\sqrt{2}\sigma], and is 1/(2σ)1/(\sqrt{2}\sigma)-Lipschitz and 1/σ21/\sigma^{2}-smooth on \mathbb{R}.

  • Case of α=\alpha=-\infty:
    ρσ\rho_{\sigma} is convex on [σ,σ][-\sigma,\sigma], and is (1/σ)exp(1/2)(1/\sigma)\exp(-1/2)-Lipschitz and 1/σ21/\sigma^{2}-smooth on \mathbb{R}.

  • Otherwise:
    ρσ\rho_{\sigma} is 1/σ21/\sigma^{2}-smooth on \mathbb{R}. When α1\alpha\geq 1, ρσ\rho_{\sigma} is convex on \mathbb{R}. When α=1\alpha=1, ρσ\rho_{\sigma} is also 1/σ1/\sigma-Lipschitz on \mathbb{R}. Else, when α<1\alpha<1, we have that ρσ\rho_{\sigma} is convex between ±σ|α2|/(1α)\pm\sigma\sqrt{\lvert{\alpha-2}\rvert/(1-\alpha)}, and is (1/σ)((1α)/|α2|)1α(1/\sigma)(\sqrt{(1-\alpha)/\lvert{\alpha-2}\rvert})^{1-\alpha}-Lipschitz on \mathbb{R}.

Furthermore, all these coefficients are tight (see also Figure 9).

Refer to caption
Refer to caption
Figure 8: Here we consider the function xρ(x/σ;α)x\mapsto\rho(x/\sigma;\alpha), with ρ\rho from the Barron class (15), for a variety of shape parameter values α\alpha, with scale parameter σ=0.5\sigma=0.5 fixed. We plot graphs for this function (left), its first derivative (center), and its second derivative (right), each computed over ±5σ\pm 5\sigma.
Refer to caption
Refer to caption
Figure 9: Here we show zoomed-in versions of the two right-most plots in Figure 8, now with dashed horizontal lines denoting the Lipschitz (left) and smoothness (right) coefficients stated in Lemma 8 (both positive and negative values). The Lipschitz coefficients depend on α\alpha and σ\sigma, and the colors in the left plot reflect the α\alpha value. The smoothness coefficient 1/σ21/\sigma^{2} is independent of α\alpha, and is drawn in light gray. Numerically, we can see that the coefficients are tight, though tighter lower bounds for the second derivatives are possible.

The dispersion Dρ(h;θ) . . =𝐄μρ(𝖫(h)θ)\operatorname{D}_{\rho}(h;\theta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname{\mathbf{E}}_{\mu}\rho(\operatorname{\mathsf{L}}(h)-\theta) plays a prominent role in the risk definitions considered in this paper, and one is naturally interested in the properties of the map θDρ(h;θ)\theta\mapsto\operatorname{D}_{\rho}(h;\theta). The following lemma shows that using ρ=ρσ\rho=\rho_{\sigma} from the Barron class, we can differentiate under the integral without needing any additional conditions beyond those required for finiteness.

Lemma 9.

Let ρσ\rho_{\sigma} be as in Lemma 8. Assume that the random loss 𝖫(h)\operatorname{\mathsf{L}}(h) is \mathcal{F}-measurable in general, and that 𝐄μ|𝖫(h)|<\operatorname{\mathbf{E}}_{\mu}\lvert\operatorname{\mathsf{L}}(h)\rvert<\infty holds whenever 1<α21<\alpha\leq 2. It follows that the first two derivatives are μ\mu-integrable, namely that

|𝐄μρσ(𝖫(h)θ)|<,|𝐄μρσ′′(𝖫(h)θ)|<\displaystyle\lvert\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)\rvert<\infty,\qquad\lvert\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)\rvert<\infty (26)

for any θ\theta\in\mathbb{R}. Furthermore, the function θDρ(h;θ)\theta\mapsto\operatorname{D}_{\rho}(h;\theta) is twice-differentiable on \mathbb{R}, and satisfies the Leibniz integration property for both derivatives, that is

ddθDρ(h;θ)|θ=u=𝐄μρσ(𝖫(h)u),d2dθ2Dρ(h;θ)|θ=u=𝐄μρσ′′(𝖫(h)u)\displaystyle\left.\frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d}\theta}\operatorname{D}_{\rho}(h;\theta)\right|_{\theta=u}=-\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-u),\qquad\left.\frac{\mathop{}\!\mathrm{d}^{2}}{\mathop{}\!\mathrm{d}\theta^{2}}\operatorname{D}_{\rho}(h;\theta)\right|_{\theta=u}=\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-u) (27)

for any uu\in\mathbb{R}.666Let us emphasize that ρσ{\rho}^{\prime}_{\sigma} and ρσ′′{\rho}^{\prime\prime}_{\sigma} denote the first and second derivatives of xρ(x/σ;α)x\mapsto\rho(x/\sigma;\alpha), which differ from ρ(x/σ;α){\rho}^{\prime}(x/\sigma;\alpha) and ρ′′(x/σ;α){\rho}^{\prime\prime}(x/\sigma;\alpha) by a σ\sigma-dependent factor; see §D.3 for details.

With first-order information about the expected dispersion in hand, one can readily obtain conditions under which the special case of T-risk R¯ρ(h;η) . . =infθRρ(h;θ,η)\underline{\operatorname{R}}_{\rho}(h;\eta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\inf_{\theta\in\mathbb{R}}\operatorname{R}_{\rho}(h;\theta,\eta) is finite and determined by a meaningful “optimal threshold.”

Lemma 10.

Following the setup of Lemma 1, let Dρ(h;θ)<\operatorname{D}_{\rho}(h;\theta)<\infty for all θ\theta\in\mathbb{R}. If α>1\alpha>1, then for any choice of σ>0\sigma>0 and η\eta\in\mathbb{R}, there exists a finite optimal threshold θρ(h;η)\theta_{\rho}(h;\eta)\in\mathbb{R} such that

R¯ρ(h;η)=ηθρ(h;η)+Dρ(h;θρ(h;η)).\displaystyle\underline{\operatorname{R}}_{\rho}(h;\eta)=\eta\theta_{\rho}(h;\eta)+\operatorname{D}_{\rho}(h;\theta_{\rho}(h;\eta)). (28)

In the case of α=1\alpha=1, we have that (28) holds if and only if |η|<1/σ\lvert{\eta}\rvert<1/\sigma. In both cases, the optimal threshold is unique. In the case of α<1\alpha<1, then there is no minimum and thus R¯ρ(h;η)=\underline{\operatorname{R}}_{\rho}(h;\eta)=-\infty.

Remark 11.

We have overloaded our notation θρ(h;η)\theta_{\rho}(h;\eta) in Lemma 10, recalling that we have used the same notation to denote the set of optimal thresholds for the T-risk in (11). This saves us from having to introduce additional symbols, and should not lead to any confusion since we only do this overloading when the solution set contains a single unique solution.

B.4 Smoothness of the T-risk

When the objective Rρ(h;θ,η)\operatorname{R}_{\rho}(h;\theta,\eta) in (11) is sufficiently smooth, we can apply well-established analytical techniques to control the gradient norms of stochastic gradient-based learning algorithms. Assuming we have unbiased first-order stochastic feedback as in (17)–(18), we will always have to deal with terms of the form 𝐄μ[ρσ(𝖫(h)θ)𝖫(h)]\operatorname{\mathbf{E}}_{\mu}\left[{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta){\operatorname{\mathsf{L}}}^{\prime}(h)\right]. Defining f(h,θ) . . =ρσ(𝖫(h)θ)𝖫(h)f(h,\theta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}={\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta){\operatorname{\mathsf{L}}}^{\prime}(h) for readability, and considering the function difference at two arbitrary points (h1,θ1)(h_{1},\theta_{1}) and (h2,θ2)(h_{2},\theta_{2}), first note that

f\displaystyle f (h1,θ1)f(h2,θ2)\displaystyle(h_{1},\theta_{1})-f(h_{2},\theta_{2})
=ρσ(𝖫(h1)θ1)[𝖫(h1)𝖫(h2)]A+[ρσ(𝖫(h1)θ1)ρσ(𝖫(h2)θ2)]𝖫(h2)B.\displaystyle=\underbrace{{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{1})-\theta_{1})\left[{\operatorname{\mathsf{L}}}^{\prime}(h_{1})-{\operatorname{\mathsf{L}}}^{\prime}(h_{2})\right]}_{A}+\underbrace{\left[{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{1})-\theta_{1})-{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{2})-\theta_{2})\right]{\operatorname{\mathsf{L}}}^{\prime}(h_{2})}_{B}. (29)

In the case of ρ()=ρ(;α)\rho(\cdot)=\rho(\cdot;\alpha) from the Barron class (15), when α1-\infty\leq\alpha\leq 1, we have that ρ{\rho}^{\prime} is both bounded (ρ<\lVert{{\rho}^{\prime}}\rVert_{\infty}<\infty) and Lipschitz continuous on \mathbb{R} (see Lemma 8). This means that all we need in order to control 𝐄μA+𝐄μB\operatorname{\mathbf{E}}_{\mu}A+\operatorname{\mathbf{E}}_{\mu}B is for 𝖫\operatorname{\mathsf{L}} to be smooth (for control of AA) and for 𝖫(){\operatorname{\mathsf{L}}}^{\prime}(\cdot) to have a norm bounded over \mathcal{H} (for control of BB); see §C.2 for more details. Things are more difficult in the case of 1<α21<\alpha\leq 2, since the dispersion function ρ\rho is not (globally) Lipschitz, meaning that ρ=\lVert{{\rho}^{\prime}}\rVert_{\infty}=\infty. Even if 𝖫\operatorname{\mathsf{L}} is smooth, when the threshold parameter is left unconstrained, it will always be possible for 𝐄μA\lVert{\operatorname{\mathbf{E}}_{\mu}A}\rVert\to\infty as |θ1|\lvert{\theta_{1}}\rvert\to\infty, impeding smoothness guarantees for Rρ\operatorname{R}_{\rho} in this setting.

Let us proceed by distilling the preceding discussion into a set of concrete conditions that are sufficient to make (h,θ)Rρ(h;θ,η)(h,\theta)\mapsto\operatorname{R}_{\rho}(h;\theta,\eta) amenable to standard analysis techniques for stochastic gradient-based algorithms. For readability, we write 𝖫 . . =sup{𝖫(h):h}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\sup\{\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert:h\in\mathcal{H}\}.

  • A1.

    Moment bound for loss gradient. For any choice of h1,h2h_{1},h_{2}\in\mathcal{H}, 0<c<10<c<1, and k{1,2}k\in\{1,2\}, the loss 𝖫\operatorname{\mathsf{L}} is differentiable at ch1+(1c)h2ch_{1}+(1-c)h_{2}, and satisfies

    𝐄μ(sup0<c<1𝖫(ch1+(1c)h2))k𝐄μ𝖫k<.\displaystyle\operatorname{\mathbf{E}}_{\mu}\left(\sup_{0<c<1}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(ch_{1}+(1-c)h_{2})}\rVert\right)^{k}\leq\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}^{k}<\infty. (30)
  • A2.

    Loss is smooth in expectation. There exists 0<λ1<0<\lambda_{1}<\infty such that for any choice of h1,h2h_{1},h_{2}\in\mathcal{H}, we have 𝐄μ𝖫(h1)𝖫(h2)λ1h1h2\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{1})-{\operatorname{\mathsf{L}}}^{\prime}(h_{2})}\rVert\leq\lambda_{1}\lVert{h_{1}-h_{2}}\rVert.

  • A3.

    Dispersion is Lipschitz and smooth. The function ρ\rho is such that ρ<\lVert{{\rho}^{\prime}}\rVert_{\infty}<\infty, and there exists 0<λ2<0<\lambda_{2}<\infty such that |ρ(x1)ρ(x2)|λ2|x1x2|\lvert{{\rho}^{\prime}(x_{1})-{\rho}^{\prime}(x_{2})}\rvert\leq\lambda_{2}\lvert{x_{1}-x_{2}}\rvert for all x1,x2x_{1},x_{2}\in\mathbb{R}.

If \mathcal{H} is a convex set, then the first inequality in A1. holds trivially. Note that under A2., the right-hand side of (30) will be finite for k=1k=1 whenever \mathcal{H} has bounded diameter and 𝐄μ𝖫(h)<\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert<\infty for some hh\in\mathcal{H}. As for A3., all the requirements are clearly met by the Barron class with α1-\infty\leq\alpha\leq 1. These conditions imply a Lipschitz property for the gradients, as summarized in the following lemma.

Lemma 12.

Let the conditions A1., A2., and A3. hold. Then, the T-risk map (h,θ)Rρ(h;θ,η)(h,\theta)\mapsto\operatorname{R}_{\rho}(h;\theta,\eta) defined in (11) is smooth on ×\mathcal{H}\times\mathbb{R} in the sense that

Rρ(h1;θ1,η)Rρ(h2;θ2,η)\displaystyle\lVert{{\operatorname{R}}^{\prime}_{\rho}(h_{1};\theta_{1},\eta)-{\operatorname{R}}^{\prime}_{\rho}(h_{2};\theta_{2},\eta)}\rVert (λ5σ+λ2σ2𝐄μ𝖫)(h1h2+|θ1θ2|)\displaystyle\leq\left(\frac{\lambda_{5}}{\sigma}+\frac{\lambda_{2}}{\sigma^{2}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}\right)\left(\lVert{h_{1}-h_{2}}\rVert+\lvert{\theta_{1}-\theta_{2}}\rvert\right)

for any choice of h1,h2h_{1},h_{2}\in\mathcal{H} and θ1,θ2\theta_{1},\theta_{2}\in\mathbb{R}. Here the factor λ5\lambda_{5} is defined λ5 . . =λ3+λ4\lambda_{5}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lambda_{3}+\lambda_{4}, where

λ3 . . =(λ2σ)[𝐄μ𝖫2+suph𝐄μ𝖫(h)],λ4 . . =λ1ρ.\displaystyle\lambda_{3}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\left(\frac{\lambda_{2}}{\sigma}\right)\left[\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}^{2}+\sup_{h\in\mathcal{H}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert\right],\qquad\lambda_{4}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lambda_{1}\lVert{{\rho}^{\prime}}\rVert_{\infty}.
Remark 13 (Norm on product spaces).

In Lemma 12 we have to deal with norms on product spaces, and in all cases we just use the traditional choice of summing the norms of the constituent elements, i.e., on ×\mathcal{H}\times\mathbb{R}, we have (h,θ) . . =h+|θ|\lVert{(h,\theta)}\rVert\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lVert{h}\rVert+\lvert{\theta}\rvert. Similarly, we have that the gradient Rρ(h;θ,η)=(hRρ(h;θ,η),θRρ(h;θ,η)){\operatorname{R}}^{\prime}_{\rho}(h;\theta,\eta)=(\partial_{h}\operatorname{R}_{\rho}(h;\theta,\eta),\partial_{\theta}\operatorname{R}_{\rho}(h;\theta,\eta)), a pair of linear functionals. As such, the norm of Rρ(h;θ,η){\operatorname{R}}^{\prime}_{\rho}(h;\theta,\eta) defined as the sum of the norms of these two constituent functionals.

Proving Lemma 12 is straightforward but somewhat tedious. Detailed computations as well as a direct proof are organized in §C.2 for easy reference.

B.5 Experimental details

Here we provide some additional details for the empirical analysis carried out in §3.3 and §5. Detailed hyperparameter settings and seeds for exact re-creation of all the results in this paper are available at the GitHub repository cited in §1, and thus here we will not explicitly write all the step sizes, shape settings, etc., but rather focus on concise exposition of points on which we expect readers to desire clarification.

Static risk analysis

For our experiments in §3.3, we gave just one plot using a log-Normal distribution, but analogous tests were run for a wide variety of parametric distributions. In total, we have run tests for Bernoulli, Beta, χ2\chi^{2}, Exponential, Gamma, log-Normal, Normal, Pareto, Uniform, Wald, and Weibull distributions. The settings of each distribution to be sampled from has not been tweaked at all; we set the parameters rather arbitrarily before running any tests. As for the fixed value of σ=0.5\sigma=0.5 in the T-risk across all tests, we tested several values of σ\sigma ranging from 0.10.1 to 1010, and based on the rough position of θρ(𝖫;η)\theta_{\rho}(\operatorname{\mathsf{L}};\eta) in the Normal case, we determined 0.50.5 as a reasonable representative value; indeed, this settings performs quite well across a very wide range of distributions. Regarding the optimization involved in solving for the optimal threshold θ\theta (for T-risk, OCE risks, and DRO risk), we use minimize_scalar from SciPy, with bounded solver type, and valid brackets set manually.

Noisy linear classification

In the tests described in §5.1, we only give error/norm trajectories for “representative” settings of each risk class (Figure 4). For T-risk we consider different choices of 1α21\leq\alpha\leq 2, for CVaR we consider 0.025β0.750.025\leq\beta\leq 0.75, for tilted risk we consider γ\gamma between ±0.05\pm 0.05, and for χ2\chi^{2}-DRO we consider 0.025a~0.350.025\leq\widetilde{a}\leq 0.35. For each of these ranges, we evaluate five evenly-spaced candidates (via np.linspace), and representative settings were selected as those which achieved the best classification error (average zero-one loss) after the final iteration. In the event of ties, the smaller hyperparameter value was always selected (via np.argmin).

Regression under outliers

For the tests introduced in §5.2, we have given results for learning algorithms started at a point which is quite accurate for the majority of the data points, but incurs extremely large errors on the outlying minority. This choice of initial value naturally has a strong impact on the behavior of learning algorithms under each risk. For some perspective, in Figure 13 we provide results using a different initial value (again, in gray), which complement Figure 5. Since the naive strategy fixing θ\theta at the initial median sets the scale extremely large and close to the loss incurred at most points, even a large number of gradient-based updates result in minimal change. The basic reason for this is quite straightforward. Since the T-risk gradient is modulated by ρσ(𝖫(h)θ)\rho_{\sigma}^{\prime}(\operatorname{\mathsf{L}}(h)-\theta), and all points are such that either 𝖫i(h)θ\operatorname{\mathsf{L}}_{i}(h)\ll\theta (the minority) or 𝖫i(h)θ\operatorname{\mathsf{L}}_{i}(h)\approx\theta (the majority), both cases shrink the norm of the update direction vector α1\alpha\leq 1. When implementing T-risk in the more traditional way (jointly in (h,θ)(h,\theta)) and choosing α[1,2]\alpha\in[1,2], we see results that are very similar to CVaR. Finally, we remark that here we have set the hyperparameter ranges with upper bounds low enough that the learning procedure described here does not run into overflow errors.

Classification under larger benchmark datasets

In §5.3 we make use of several well-known benchmark datasets, and in our figures we identify them respectively by the following keywords: adult,777https://archive.ics.uci.edu/ml/datasets/Adult australian,888https://archive.ics.uci.edu/ml/datasets/statlog+(australian+credit+approval) cifar10,999https://www.cs.toronto.edu/~kriz/cifar.html covtype,101010https://archive.ics.uci.edu/ml/datasets/covertype emnist_balanced,111111https://www.nist.gov/itl/products-and-services/emnist-dataset fashion_mnist,121212https://github.com/zalandoresearch/fashion-mnist and protein.131313https://www.kdd.org/kdd-cup/view/kdd-cup-2004/Data For further background on all of these datasets, please access the URLs provided in the footnotes. As mentioned in the main text, for a kk-class problem with dd features, the predictor is characterized by kk weighting vectors (w1,,wk)(w_{1},\ldots,w_{k}), each of which is wjdw_{j}\in\mathbb{R}^{d} and computes scores as wjx{w}^{\top}_{j}x for xdx\in\mathbb{R}^{d}. These weight vectors are penalized using the usual multi-class logistic loss, namely the negative log-likelihood of the kk-Categorical distribution that arises after passing these scores through the (logistic) softmax function. Regarding step sizes, as mentioned in the main text, we consider choices of factor c{0.1,0.5,1.0,1.5,2.0}c\in\{0.1,0.5,1.0,1.5,2.0\}, and set step size to c/kdc/\sqrt{kd}, where dd and kk are as just stated. In Figures 1418 of §B.6, we provide additional results for the datasets not covered in Figures 67 from the main text.

B.6 Additional figures

Here we include a number of figures that complement those provided in the main text. A brief summary is given below.

  • Figures 1011 are additional results from the experiments described in §5.1, here using logistic and hinge losses instead of the unhinged loss.

  • Figures 1213 are related to the regression under outliers task described in §5.2. The first figure shows how different regression lines incur very different loss distributions under different convex base loss functions. The second figure illustrates how a different initial value impacts the learned regression lines under each risk class.

  • Figures 1418 are all completely analogous to Figure 6 given in the main text of §5.3, but for additional benchmark datasets.

Refer to caption
Refer to caption
Refer to caption
Figure 10: Analogous to Figure 4, this time using logistic loss.
Refer to caption
Refer to caption
Refer to caption
Figure 11: Analogous to Figure 4, this time using the hinge loss.
Refer to caption
Refer to caption
Figure 12: Here the left-most plot is the Belgian phone call dataset for use in a one-dimensional regression task, with two linear regression candidates. The remaining three plots show histograms of the loss distributions incurred by each of these candidates using three loss functions commonly used in regression.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 13: Same procedure as in Figure 5, but this time started from a different initial value.
Refer to caption
Refer to caption
Refer to caption
Figure 14: Using T-risk to interpolate between test loss distributions (dataset: adult).
Refer to caption
Refer to caption
Refer to caption
Figure 15: Using T-risk to interpolate between test loss distributions (dataset: australian).
Refer to caption
Refer to caption
Figure 16: Using T-risk to interpolate between test loss distributions (dataset: cifar10).
Refer to caption
Refer to caption
Refer to caption
Figure 17: Using T-risk to interpolate between test loss distributions (dataset: fashion_mnist).
Refer to caption
Refer to caption
Refer to caption
Figure 18: Using T-risk to interpolate between test loss distributions (dataset: protein).

Appendix C Detailed proofs

C.1 Proofs of results in the main text

Proof of Lemma 8.

In this proof, without further mention, we will make regular use of the following two helper results: Lemma 14 (bounded gradient implies Lipschitz continuity) and Lemma 15 (positive definite Hessian implies convexity). For reference, the first and second derivatives of ρσ\rho_{\sigma} are given in §D.3. We take up each α\alpha setting one at a time.

First, the case of α=2\alpha=2. For this case, clearly ρσ{\rho}^{\prime}_{\sigma} is unbounded, and thus ρσ\rho_{\sigma} is not (globally) Lipschitz on \mathbb{R}. On the other hand, since ρσ′′(x)=1/σ2{\rho}^{\prime\prime}_{\sigma}(x)=1/\sigma^{2}, we have that ρσ{\rho}^{\prime}_{\sigma} is λ\lambda-Lipschitz with λ=(1/σ2)\lambda=(1/\sigma^{2}).

Next, the case of α=0\alpha=0. For any fixed σ>0\sigma>0, in both the limits x0x\to 0 and |x|\lvert x\rvert\to\infty, we have ρσ(x)0{\rho}^{\prime}_{\sigma}(x)\to 0. Maximum and minimum values are achieved when ρσ′′(x)=0{\rho}^{\prime\prime}_{\sigma}(x)=0, and this occurs if and only if x2=2σ2x^{2}=2\sigma^{2}. It follows from direct computation that ρσ(±2σ)=±1/(2σ){\rho}^{\prime}_{\sigma}(\pm\sqrt{2}\sigma)=\pm 1/(\sqrt{2}\sigma), and thus ρσ\rho_{\sigma} is λ\lambda-Lipschitz with λ=1/(2σ)\lambda=1/(\sqrt{2}\sigma). Next, recalling that ρσ′′{\rho}^{\prime\prime}_{\sigma} takes the form

ρσ′′(x)=2x2+2σ2(12x2x2+2σ2),\displaystyle{\rho}^{\prime\prime}_{\sigma}(x)=\frac{2}{x^{2}+2\sigma^{2}}\left(1-\frac{2x^{2}}{x^{2}+2\sigma^{2}}\right),

we see that this is a product of two factors, one taking values in (0,1/σ2)(0,1/\sigma^{2}), and one taking values in (1,1)(-1,1). The absolute value of both of these factors is maximized when x=0x=0, and so |ρσ′′(x)|ρσ′′(0)=1/σ2\lvert{\rho}^{\prime\prime}_{\sigma}(x)\rvert\leq{\rho}^{\prime\prime}_{\sigma}(0)=1/\sigma^{2}, meaning that ρσ{\rho}^{\prime}_{\sigma} is λ\lambda-Lipschitz with λ=1/σ2\lambda=1/\sigma^{2}. Finally, regarding convexity, we have that ρσ′′(x)0{\rho}^{\prime\prime}_{\sigma}(x)\geq 0 if and only if |x|2σ\lvert x\rvert\leq\sqrt{2}\sigma.

Next, the case of α=\alpha=-\infty. For any fixed σ>0\sigma>0, we have ρσ(x)0{\rho}^{\prime}_{\sigma}(x)\to 0 in both the limits x0x\to 0 and |x|\lvert x\rvert\to\infty. Furthermore, it is immediate that ρσ′′(x)=0{\rho}^{\prime\prime}_{\sigma}(x)=0 at the points x=±σx=\pm\sigma. Evaluating ρσ{\rho}^{\prime}_{\sigma} at these stationary points we have ρσ(±σ)=±(1/σ)exp(1/2){\rho}^{\prime}_{\sigma}(\pm\sigma)=\pm(1/\sigma)\exp(-1/2), and thus ρσ\rho_{\sigma} is λ\lambda-Lipschitz with λ=(1/σ)exp(1/2)\lambda=(1/\sigma)\exp(-1/2). Regarding bounds on ρσ′′{\rho}^{\prime\prime}_{\sigma}, first note that ρσ′′(x)0{\rho}^{\prime\prime}_{\sigma}(x)\to 0 as |x|\lvert x\rvert\to\infty, and ρσ′′(0)=1/σ2{\rho}^{\prime\prime}_{\sigma}(0)=1/\sigma^{2}. Then to identify stationary points, note that

ρσ′′′(x)=1σ2exp(12(xσ)2)[xσ2(x2σ21)2xσ2]\displaystyle\rho_{\sigma}^{\prime\prime\prime}(x)=\frac{1}{\sigma^{2}}\exp\left(-\frac{1}{2}\left(\frac{x}{\sigma}\right)^{2}\right)\left[\frac{x}{\sigma^{2}}\left(\frac{x^{2}}{\sigma^{2}}-1\right)-\frac{2x}{\sigma^{2}}\right]

and thus ρσ′′′(x)=0\rho_{\sigma}^{\prime\prime\prime}(x)=0 if and only if (x/σ)21=2(x/\sigma)^{2}-1=2, i.e., the stationary points are x=±3σx=\pm\sqrt{3}\sigma, both of which yield the same value, namely ρσ′′(±3σ)=(2/σ2)exp(3/2){\rho}^{\prime\prime}_{\sigma}(\pm\sqrt{3}\sigma)=-(2/\sigma^{2})\exp(-3/2). Since 2exp(3/2)0.45<12\exp(-3/2)\approx 0.45<1, we conclude that ρσ{\rho}^{\prime}_{\sigma} is λ\lambda-Lipschitz with λ=1/σ2\lambda=1/\sigma^{2}. Finally, since ρσ′′(x)0{\rho}^{\prime\prime}_{\sigma}(x)\geq 0 if and only if |x|σ\lvert x\rvert\leq\sigma, this specifies the region on which ρσ\rho_{\sigma} is convex.

Finally, all that remains is the general case of <α<2-\infty<\alpha<2 where α0\alpha\neq 0. Note that in order for ρσ′′(x)=0{\rho}^{\prime\prime}_{\sigma}(x)=0 to hold, we require

2(x/σ)21+(x/σ)2/|α2|=|α2|1(α/2),\displaystyle\frac{2(x/\sigma)^{2}}{1+(x/\sigma)^{2}/\lvert{\alpha-2}\rvert}=\frac{\lvert{\alpha-2}\rvert}{1-(\alpha/2)},

which via some basic algebra is equivalent to

(xσ)2=|α2|1α.\displaystyle\left(\frac{x}{\sigma}\right)^{2}=\frac{\lvert{\alpha-2}\rvert}{1-\alpha}.

Clearly, this is only possible when α<1\alpha<1, so we consider this sub-case first. This implies stationary points ±x . . =±σ|α2|/(1α)\pm x^{\ast}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\pm\sigma\sqrt{\lvert{\alpha-2}\rvert/(1-\alpha)}, for which we have

ρσ(±x)=±1σ|α2|1α(2α1α)(α/2)1=±1σ(|α2|1α)α1=±1σ(1α|α2|)1α.\displaystyle{\rho}^{\prime}_{\sigma}(\pm x^{\ast})=\pm\frac{1}{\sigma}\sqrt{\frac{\lvert{\alpha-2}\rvert}{1-\alpha}}\left(\frac{2-\alpha}{1-\alpha}\right)^{(\alpha/2)-1}=\pm\frac{1}{\sigma}\left(\sqrt{\frac{\lvert{\alpha-2}\rvert}{1-\alpha}}\right)^{\alpha-1}=\pm\frac{1}{\sigma}\left(\sqrt{\frac{1-\alpha}{\lvert{\alpha-2}\rvert}}\right)^{1-\alpha}.

Since ρσ(x)0{\rho}^{\prime}_{\sigma}(x)\to 0 in both the limits x0x\to 0 and |x|\lvert x\rvert\to\infty, we have obtained a maximum value for ρσ{\rho}^{\prime}_{\sigma} at xx^{\ast}, thus implying for the case of α<1\alpha<1 that ρσ\rho_{\sigma} is λ\lambda-Lipschitz, with a coefficient of λ=(1/σ)((1α)/|α2|)1α\lambda=(1/\sigma)(\sqrt{(1-\alpha)/\lvert{\alpha-2}\rvert})^{1-\alpha}. For the case of α=1\alpha=1, direct inspection shows

|ρσ(x)|=|x/σ2|1+(x/σ)2=1σ2(1/x2)+(1/σ2),\displaystyle\lvert{\rho}^{\prime}_{\sigma}(x)\rvert=\frac{\lvert{x/\sigma^{2}}\rvert}{\sqrt{1+(x/\sigma)^{2}}}=\frac{1}{\sigma^{2}\sqrt{(1/x^{2})+(1/\sigma^{2})}},

a value which is maximized in the limit |x|\lvert x\rvert\to\infty. As such, for α=1\alpha=1, we have that ρσ\rho_{\sigma} is λ\lambda-Lipschitz with λ=1/σ\lambda=1/\sigma. For the case of 1<α<21<\alpha<2, ρσ{\rho}^{\prime}_{\sigma} is unbounded. To see this, note that for x>0x>0 we have

ρσ(x)=1σ2(1+(x/σ)2/|α2|)((1/x)+x/(σ2|α2|))(1σ2+α|α2|α)xα((1/x)+x/(σ2|α2|)),\displaystyle{\rho}^{\prime}_{\sigma}(x)=\frac{1}{\sigma^{2}}\frac{\left(1+(x/\sigma)^{2}/\lvert{\alpha-2}\rvert\right)}{\left((1/x)+x/(\sigma^{2}\lvert{\alpha-2}\rvert)\right)}\geq\left(\frac{1}{\sigma^{2+\alpha}\sqrt{\lvert{\alpha-2}\rvert^{\alpha}}}\right)\frac{x^{\alpha}}{\left((1/x)+x/(\sigma^{2}\lvert{\alpha-2}\rvert)\right)},

and since α>1\alpha>1, sending xx\to\infty clearly implies ρσ(x){\rho}^{\prime}_{\sigma}(x)\to\infty, and this means that ρσ\rho_{\sigma} cannot be Lipschitz on \mathbb{R} when α>1\alpha>1. As for bounds on ρσ′′{\rho}^{\prime\prime}_{\sigma}, recall that

ρσ′′(x)=1σ2(1+(x/σ)2|α2|)(α/2)1A(x)(11(α/2)|α2|2(x/σ)21+(x/σ)2/|α2|B(x))\displaystyle{\rho}^{\prime\prime}_{\sigma}(x)=\frac{1}{\sigma^{2}}\underbrace{\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{(\alpha/2)-1}}_{A(x)}\left(1-\underbrace{\frac{1-(\alpha/2)}{\lvert{\alpha-2}\rvert}\frac{2(x/\sigma)^{2}}{1+(x/\sigma)^{2}/\lvert{\alpha-2}\rvert}}_{B(x)}\right)

where we have introduced the labels A(x)A(x) and B(x)B(x) just as convenient notation. Fixing any σ>0\sigma>0, first note that since α<2\alpha<2, we have (α/2)1<0(\alpha/2)-1<0 and thus 0A(x)10\leq A(x)\leq 1. Next, direct inspection shows 0B(x)2(1(α/2))0\leq B(x)\leq 2(1-(\alpha/2)). These two facts immediately imply an upper bound ρσ′′(x)1/σ2{\rho}^{\prime\prime}_{\sigma}(x)\leq 1/\sigma^{2} and a lower bound ρσ′′(x)(1α)/σ2{\rho}^{\prime\prime}_{\sigma}(x)\geq-(1-\alpha)/\sigma^{2}, both of which hold for any α<2\alpha<2. Furthermore, for the case of 1α<21\leq\alpha<2, we thus have 0ρσ′′(x)1/σ20\leq{\rho}^{\prime\prime}_{\sigma}(x)\leq 1/\sigma^{2}. When α<1\alpha<1 however, ρσ′′{\rho}^{\prime\prime}_{\sigma} can be negative. To get matching lower bounds requires A(x)(1B(x))1A(x)(1-B(x))\geq-1, or A(x)(B(x)1)1A(x)(B(x)-1)\leq 1. To study conditions under which this holds, first note that B(x)B(x) can be re-written as

B(x)=(2α|α2|)(x/σ)21+(x/σ)2|α2|=(x/σ)21+(x/σ)2|α2|,\displaystyle B(x)=\left(\frac{2-\alpha}{\lvert{\alpha-2}\rvert}\right)\frac{(x/\sigma)^{2}}{1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}}=\frac{(x/\sigma)^{2}}{1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}},

and thus we have

A(x)(B(x)1)=(x/σ)2(1+(x/σ)2|α2|)2(α/2)1(1+(x/σ)2|α2|)1(α/2).\displaystyle A(x)(B(x)-1)=\frac{(x/\sigma)^{2}}{\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{2-(\alpha/2)}}-\frac{1}{\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{1-(\alpha/2)}}. (31)

To get a more convenient upper bound on this, observe that (1+x)1(α/2)(1+x)2(α/2)(1+x)^{1-(\alpha/2)}\leq(1+x)^{2-(\alpha/2)} for any x0x\geq 0 and α2-\infty\leq\alpha\leq 2. It follows immediately that

A(x)(B(x)1)(x/σ)21(1+(x/σ)2|α2|)2(α/2).\displaystyle A(x)(B(x)-1)\leq\frac{(x/\sigma)^{2}-1}{\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{2-(\alpha/2)}}. (32)

To get the right-hand side of (32) to be no greater than 11 is equivalent to

(x/σ)21(1+(x/σ)2|α2|)2(α/2).\displaystyle(x/\sigma)^{2}-1\leq\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{2-(\alpha/2)}. (33)

For the case of 0α<10\leq\alpha<1, note that 1|α2|=2α<2(α/2)1\leq\lvert{\alpha-2}\rvert=2-\alpha<2-(\alpha/2), and using the helper inequality (69), we have

(1+(x/σ)2|α2|)2(α/2)(1+(x/σ)2|α2|)|α2|1+(x/σ)2>(x/σ)21,\displaystyle\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{2-(\alpha/2)}\geq\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{\lvert{\alpha-2}\rvert}\geq 1+(x/\sigma)^{2}>(x/\sigma)^{2}-1,

which implies (33) for 0α<10\leq\alpha<1. All that remains is the case of <α<0-\infty<\alpha<0, which requires a bit more care. Returning to the exact form of A(x)(B(x)1)A(x)(B(x)-1) given in (31), note that the inequality

(x/σ)2(1+(x/σ)2|α2|)(1+(x/σ)2|α2|)2(α/2)\displaystyle(x/\sigma)^{2}-\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)\leq\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{2-(\alpha/2)} (34)

is equivalent to the desired property, i.e., A(x)(B(x)1)1(34)A(x)(B(x)-1)\leq 1\iff\text{(\ref{eqn:basic_barron_3})}. Using Bernoulli’s inequality (71), we can bound the right-hand side of (34) as

(1+(x/σ)2|α2|)2(α/2)1+(2(α/2)|α2|)(x/σ)2.\displaystyle\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{2-(\alpha/2)}\geq 1+\left(\frac{2-(\alpha/2)}{\lvert{\alpha-2}\rvert}\right)(x/\sigma)^{2}.

Subtracting the left-hand side of (34) from the right-hand side of the preceding inequality, we obtain

(1+(x/σ)2|α2|)2(α/2)[(x/σ)21(x/σ)2|α2|]\displaystyle\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{2-(\alpha/2)}-\left[(x/\sigma)^{2}-1-\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right] 2+(2(α/2)|α2|1+1|α2|)(x/σ)2\displaystyle\geq 2+\left(\frac{2-(\alpha/2)}{\lvert{\alpha-2}\rvert}-1+\frac{1}{\lvert{\alpha-2}\rvert}\right)(x/\sigma)^{2}
=2+(1(|α|/2)2+|α|)(x/σ)2,\displaystyle=2+\left(\frac{1-(\lvert{\alpha}\rvert/2)}{2+\lvert{\alpha}\rvert}\right)(x/\sigma)^{2}, (35)

where the second step uses the fact that for α<0\alpha<0, we can write |α2|=2+|α|\lvert{\alpha-2}\rvert=2+\lvert{\alpha}\rvert and 2(α/2)=2+(|α|/2)2-(\alpha/2)=2+(\lvert{\alpha}\rvert/2). Note that the right-hand side of (35) is non-negative for all xx\in\mathbb{R} whenever 2α<0-2\leq\alpha<0, which via (34) tells us that A(x)(B(x)1)1A(x)(B(x)-1)\leq 1 indeed holds in this case as well. For the case of <α<2-\infty<\alpha<-2, note that showing (34) holds is equivalent to showing fα(x)0f_{\alpha}(x)\geq 0 for all x0x\geq 0, where for convenience we define the polynomial

fα(x) . . =1+(12+|α|1)x+(1+x2+|α|)2+(|α|/2).\displaystyle f_{\alpha}(x)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=1+\left(\frac{1}{2+\lvert{\alpha}\rvert}-1\right)x+\left(1+\frac{x}{2+\lvert{\alpha}\rvert}\right)^{2+(\lvert{\alpha}\rvert/2)}.

The first derivative is

fα(x)=2+(|α|/2)2+|α|(1+x2+|α|)1+(|α|/2)+12+|α|1,\displaystyle{f}^{\prime}_{\alpha}(x)=\frac{2+(\lvert{\alpha}\rvert/2)}{2+\lvert{\alpha}\rvert}\left(1+\frac{x}{2+\lvert{\alpha}\rvert}\right)^{1+(\lvert{\alpha}\rvert/2)}+\frac{1}{2+\lvert{\alpha}\rvert}-1,

and with this form in hand, solving for fα(x)=0f_{\alpha}(x)=0, it is straightforward to confirm that xαx_{\alpha}^{\ast} given below is a stationary point:

xα . . =(2+|α|)[(1+|α|2+(|α|/2))11+(|α|/2)1].\displaystyle x_{\alpha}^{\ast}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\left(2+\lvert{\alpha}\rvert\right)\left[\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{1}{1+(\lvert{\alpha}\rvert/2)}}-1\right].

Furthermore, it is clear that fα′′0{f}^{\prime\prime}_{\alpha}\geq 0, implying that fαf_{\alpha} is convex, and that xαx_{\alpha}^{\ast} is a minimum. As such, the minimum value taken by fαf_{\alpha} on +\mathbb{R}_{+} is

fα(xα)\displaystyle f_{\alpha}(x_{\alpha}^{\ast}) =(1+|α|2+(|α|/2))11+(|α|/2)(2+|α|)[(1+|α|2+(|α|/2))11+(|α|/2)1]+(1+|α|2+(|α|/2))2+(|α|/2)1+(|α|/2)\displaystyle=\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{1}{1+(\lvert{\alpha}\rvert/2)}}-\left(2+\lvert{\alpha}\rvert\right)\left[\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{1}{1+(\lvert{\alpha}\rvert/2)}}-1\right]+\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{2+(\lvert{\alpha}\rvert/2)}{1+(\lvert{\alpha}\rvert/2)}}
=(2+|α|)+(1+|α|2+(|α|/2))2+(|α|/2)1+(|α|/2)(1+|α|)(1+|α|2+(|α|/2))11+(|α|/2)\displaystyle=\left(2+\lvert{\alpha}\rvert\right)+\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{2+(\lvert{\alpha}\rvert/2)}{1+(\lvert{\alpha}\rvert/2)}}-\left(1+\lvert{\alpha}\rvert\right)\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{1}{1+(\lvert{\alpha}\rvert/2)}}
=(2+|α|)(1+|α|)(1+|α|2+(|α|/2))11+(|α|/2)[112+(|α|/2)]\displaystyle=\left(2+\lvert{\alpha}\rvert\right)-\left(1+\lvert{\alpha}\rvert\right)\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{1}{1+(\lvert{\alpha}\rvert/2)}}\left[1-\frac{1}{2+(\lvert{\alpha}\rvert/2)}\right]
=1+(1+|α|)[1(1+|α|2+(|α|/2))11+(|α|/2)[112+(|α|/2)]].\displaystyle=1+\left(1+\lvert{\alpha}\rvert\right)\left[1-\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{1}{1+(\lvert{\alpha}\rvert/2)}}\left[1-\frac{1}{2+(\lvert{\alpha}\rvert/2)}\right]\right].

We require fα(xα)0f_{\alpha}(x_{\alpha}^{\ast})\geq 0 for all <α<2-\infty<\alpha<-2. From the preceding equalities, note that a simple sufficient condition for fα(xα)1f_{\alpha}(x_{\alpha}^{\ast})\geq 1 is

(1+|α|2+(|α|/2))11+(|α|/2)[112+(|α|/2)]1\displaystyle\left(\frac{1+\lvert{\alpha}\rvert}{2+(\lvert{\alpha}\rvert/2)}\right)^{\frac{1}{1+(\lvert{\alpha}\rvert/2)}}\left[1-\frac{1}{2+(\lvert{\alpha}\rvert/2)}\right]\leq 1

or equivalently

(112+(|α|/2))1+(|α|/2)(2+(|α|/2)1+|α|).\displaystyle\left(1-\frac{1}{2+(\lvert{\alpha}\rvert/2)}\right)^{1+(\lvert{\alpha}\rvert/2)}\leq\left(\frac{2+(\lvert{\alpha}\rvert/2)}{1+\lvert{\alpha}\rvert}\right). (36)

Applying the helper inequality (70) to the left-hand side of (36), we have

(112+(|α|/2))1+(|α|/2)1+((1+(|α|/2))2+(|α|/2))(1|α|/22+(|α|/2))=1(1+(|α|/2)2+|α|)\displaystyle\left(1-\frac{1}{2+(\lvert{\alpha}\rvert/2)}\right)^{1+(\lvert{\alpha}\rvert/2)}\leq 1+\frac{\left(\frac{-(1+(\lvert{\alpha}\rvert/2))}{2+(\lvert{\alpha}\rvert/2)}\right)}{\left(1-\frac{-\lvert{\alpha}\rvert/2}{2+(\lvert{\alpha}\rvert/2)}\right)}=1-\left(\frac{1+(\lvert{\alpha}\rvert/2)}{2+\lvert{\alpha}\rvert}\right) =1+(|α|/2)2+|α|\displaystyle=\frac{1+(\lvert{\alpha}\rvert/2)}{2+\lvert{\alpha}\rvert}
2+(|α|/2)1+|α|.\displaystyle\leq\frac{2+(\lvert{\alpha}\rvert/2)}{1+\lvert{\alpha}\rvert}.

This is precisely the desired inequality (36), implying fα(xα)1>0f_{\alpha}(x_{\alpha}^{\ast})\geq 1>0 for all <α<2-\infty<\alpha<-2, and in fact all real α<0\alpha<0. To summarize, we have A(x)(B(x)1)1A(x)(B(x)-1)\leq 1 for all xx\in\mathbb{R}, and thus the desired 1/σ21/\sigma^{2}-smoothness result follows, concluding the proof. ∎

Proof of Lemma 1.

Let 𝖷\mathsf{X} denote any \mathcal{F}-measurable random variable. The continuity of ρ\rho implies that the integral 𝐄μρσ(𝖷θ)\operatorname{\mathbf{E}}_{\mu}\rho_{\sigma}(\mathsf{X}-\theta) exists for any σ>0\sigma>0; we just need to prove it is finite.141414This uses the fact that any composition of (Borel) measurable functions is itself measurable [3, Lem. 1.5.7]. Since we are taking ρ\rho from the Barron class (15), we consider each α\alpha setting separately. Starting with α=2\alpha=2, note that

ρσ(𝖷θ;2)=12(𝖷θσ)2\displaystyle\rho_{\sigma}(\mathsf{X}-\theta;2)=\frac{1}{2}\left(\frac{\mathsf{X}-\theta}{\sigma}\right)^{2}

and thus 𝐄μ𝖷2<\operatorname{\mathbf{E}}_{\mu}\mathsf{X}^{2}<\infty is sufficient and necessary. For α=0\alpha=0, first note that we have

ρσ(𝖷θ;0)=log(1+12(𝖷θσ)2).\displaystyle\rho_{\sigma}(\mathsf{X}-\theta;0)=\log\left(1+\frac{1}{2}\left(\frac{\mathsf{X}-\theta}{\sigma}\right)^{2}\right).

Let f1(x) . . =log(1+x)f_{1}(x)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\log(1+x) and f2(x) . . =xc/cf_{2}(x)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=x^{c}/c, where 0<c<10<c<1. Note that f1(0)=f2(0)=0f_{1}(0)=f_{2}(0)=0, and furthermore that for any x>0x>0,

f1(x)=11+x<(11+x)1c<(1x)1c=f2(x).\displaystyle{f}^{\prime}_{1}(x)=\frac{1}{1+x}<\left(\frac{1}{1+x}\right)^{1-c}<\left(\frac{1}{x}\right)^{1-c}={f}^{\prime}_{2}(x).

We may thus conclude that f1(x)f2(x)f_{1}(x)\leq f_{2}(x) for all x0x\geq 0, and thus for any 0<c<10<c<1 we have

log(1+12(𝖷θσ)2)1c(𝖷θ2σ)2c.\displaystyle\log\left(1+\frac{1}{2}\left(\frac{\mathsf{X}-\theta}{\sigma}\right)^{2}\right)\leq\frac{1}{c}\left(\frac{\mathsf{X}-\theta}{\sqrt{2}\sigma}\right)^{2c}.

It follows that to ensure 𝐄μρσ(𝖷θ;0)<\operatorname{\mathbf{E}}_{\mu}\rho_{\sigma}(\mathsf{X}-\theta;0)<\infty, it is sufficient if we assume 𝐄μ|𝖷|c<\operatorname{\mathbf{E}}_{\mu}\lvert{\mathsf{X}}\rvert^{c}<\infty for some c>0c>0. Proceeding to the case of α=\alpha=-\infty, we have

ρσ(𝖷θ;)=1exp(12(𝖷θσ)2).\displaystyle\rho_{\sigma}(\mathsf{X}-\theta;-\infty)=1-\exp\left(-\frac{1}{2}\left(\frac{\mathsf{X}-\theta}{\sigma}\right)^{2}\right).

Any composition of measurable functions is measurable, and since the right-hand side is bounded above by 11 and below by 0, we have that ρσ(𝖷θ;)\rho_{\sigma}(\mathsf{X}-\theta;-\infty) is μ\mu-integrable without requiring any extra assumptions on 𝖷\mathsf{X} besides measurability. All that remains for the Barron class is the case of non-zero <α<2-\infty<\alpha<2, where we have

ρσ(𝖷θ;α)=|α2|α((1+1|α2|(𝖷θσ)2)α/21).\displaystyle\rho_{\sigma}(\mathsf{X}-\theta;\alpha)=\frac{\lvert{\alpha-2}\rvert}{\alpha}\left(\left(1+\frac{1}{\lvert{\alpha-2}\rvert}\left(\frac{\mathsf{X}-\theta}{\sigma}\right)^{2}\right)^{\alpha/2}-1\right).

Let us break this into two cases: <α<0-\infty<\alpha<0 and 0<α<20<\alpha<2. Starting with the former case, this is easy since

(1+x2)α/2=1(1+x2)α\displaystyle\left(1+x^{2}\right)^{\alpha/2}=\frac{1}{\left(\sqrt{1+x^{2}}\right)^{-\alpha}}

which is bounded above by 11 and below by 0 for any α<0\alpha<0 and xx\in\mathbb{R}, which means the random variable ρσ(𝖷θ;α)\rho_{\sigma}(\mathsf{X}-\theta;\alpha) is μ\mu-integrable without any extra assumptions on 𝖷\mathsf{X}. As for the latter case of 0<α<20<\alpha<2, first note that the monotonicity of ()α/2(\cdot)^{\alpha/2} on +\mathbb{R}_{+} implies

(1+x2)α/2|x|α\displaystyle(1+x^{2})^{\alpha/2}\geq\lvert{x}\rvert^{\alpha}

which means 𝐄μ|𝖷|α<\operatorname{\mathbf{E}}_{\mu}\lvert{\mathsf{X}}\rvert^{\alpha}<\infty is necessary. That this condition is also sufficient is immediate from the form of ρσ(𝖷θ;α)\rho_{\sigma}(\mathsf{X}-\theta;\alpha) just given. This concludes the proof; the desired result stated in the lemma follows from setting 𝖷=𝖫(h)\mathsf{X}=\operatorname{\mathsf{L}}(h) and the observation that the choice of θ\theta\in\mathbb{R} in the preceding discussion was arbitrary. ∎

Proof of Lemma 9.

Referring to the derivatives (67)–(68) in §D.3, we know that ρσ{\rho}^{\prime}_{\sigma} is measurable, and by the proof of Lemma 8, we know that ρ<\lVert{{\rho}^{\prime}}\rVert_{\infty}<\infty for all α1\alpha\leq 1. Thus, as long as 𝖫(h)\operatorname{\mathsf{L}}(h) is \mathcal{F}-measurable, we have that ρσ(𝖫(h)θ){\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta) is μ\mu-integrable. For the case of 1<α21<\alpha\leq 2, note that |ρσ(x)||x|/σ2\lvert{{\rho}^{\prime}_{\sigma}(x)}\rvert\leq\lvert{x}\rvert/\sigma^{2} holds, meaning that 𝐄μ|𝖫(h)|<\operatorname{\mathbf{E}}_{\mu}\lvert{\operatorname{\mathsf{L}}(h)}\rvert<\infty implies integrability. Similarly for the second derivatives, from the proof of Lemma 8, we see that ρ′′<\lVert{{\rho}^{\prime\prime}}\rVert_{\infty}<\infty for all α1-\infty\leq\alpha\leq 1, implying the μ\mu-integrability of ρσ′′(𝖫(h)θ){\rho}^{\prime\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta).

The Leibniz integration property follows using a straightforward dominated convergence argument, which we give here for completeness. Letting (ak)(a_{k}) be any non-zero real sequence such that ak0a_{k}\to 0, we can write

ddθDρ(h;θ)\displaystyle\frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d}\theta}\operatorname{D}_{\rho}(h;\theta) =limkDρ(h;θ+ak)Dρ(h;θ)ak\displaystyle=\lim\limits_{k\to\infty}\frac{\operatorname{D}_{\rho}(h;\theta+a_{k})-\operatorname{D}_{\rho}(h;\theta)}{a_{k}}
=limk𝐄μ[ρσ(𝖫(h)(θ+ak))ρσ(𝖫(h)θ)ak].\displaystyle=\lim\limits_{k\to\infty}\operatorname{\mathbf{E}}_{\mu}\left[\frac{\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-(\theta+a_{k}))-\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)}{a_{k}}\right].

For notational convenience, let us denote the key sequence of functions by

fk . . =ρσ(𝖫(h)(θ+ak))ρσ(𝖫(h)θ)ak\displaystyle f_{k}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\frac{\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-(\theta+a_{k}))-\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)}{a_{k}}

and note that fkf . . =ρσ(𝖫(h)θ)f_{k}\to f\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=-{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta) pointwise as kk\to\infty. We can then say the following: for all kk, we have that

|fk|sup0<c<1|ρσ(𝖫(h)(θ+cak))|g . . =|ρσ(𝖫(h)θ)|\displaystyle\lvert{f_{k}}\rvert\leq\sup_{0<c<1}\lvert{{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-(\theta+ca_{k}))}\rvert\leq g\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lvert{{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta^{\prime})}\rvert

for an appropriate choice of θ\theta^{\prime}\in\mathbb{R}. The first inequality follows from the helper Lemma 14. We can always find an appropriate θ\theta^{\prime} because the sequence (ak)(a_{k}) is bounded and ρ{\rho}^{\prime} is eventually monotone, regardless of the choice of α\alpha. With the fact |fk|g\lvert{f_{k}}\rvert\leq g in hand, recall that we have already proved that 𝐄μg<\operatorname{\mathbf{E}}_{\mu}g<\infty under the assumptions we have made, and thus 𝐄μfk𝐄μf\operatorname{\mathbf{E}}_{\mu}f_{k}\to\operatorname{\mathbf{E}}_{\mu}f by dominated convergence.151515See for example Ash and Doléans-Dade, [3, Thm. 1.6.9]. As such, we have

ddθDρ(h;θ)=limk𝐄μfk=𝐄μf=𝐄μρσ(𝖫(h)θ),\displaystyle\frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d}\theta}\operatorname{D}_{\rho}(h;\theta)=\lim\limits_{k\to\infty}\operatorname{\mathbf{E}}_{\mu}f_{k}=\operatorname{\mathbf{E}}_{\mu}f=-\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta),

which is the desired Leibniz property for the first derivative. A completely analogous argument holds for the second derivative, yielding the desired result. ∎

Proof of Lemma 10.

From Lemma 9, we know that the map θDρ(h;θ)\theta\mapsto\operatorname{D}_{\rho}(h;\theta) is differentiable and thus continuous. Using continuity, taking any a<ba<b and constructing a closed interval [a,b][a,b], the Weierstrass extreme value theorem tells us that Dρ(h;θ)\operatorname{D}_{\rho}(h;\theta) achieves its maximum and minimum on [a,b][a,b]. Furthermore, note that ρ\rho taken from the Barron class (15) satisfies all the requirements of our helper Lemma 16, and thus implies Dρ(h;θ)sup{ρ(x):x}\operatorname{D}_{\rho}(h;\theta)\to\sup\{\rho(x):x\in\mathbb{R}\} as |θ|\lvert{\theta}\rvert\to\infty. We can thus always take the interval [a,b][a,b] wide enough that

θ[a,b]Dρ(h;θ)maxaxbDρ(h;x)minaxbDρ(h;x).\displaystyle\theta\notin[a,b]\implies\operatorname{D}_{\rho}(h;\theta)\geq\max_{a\leq x\leq b}\operatorname{D}_{\rho}(h;x)\geq\min_{a\leq x\leq b}\operatorname{D}_{\rho}(h;x).

This proves the existence of a minimizer of θDρ(h;θ)\theta\mapsto\operatorname{D}_{\rho}(h;\theta) on \mathbb{R}.

Next, considering the T-risk Rρ(h;θ,η)\operatorname{R}_{\rho}(h;\theta,\eta) and minimization with respect to θ\theta, since we are doing unconstrained optimization, any solution θρ(h;η)\theta_{\rho}(h;\eta) must satisfy η+dθDρ(h;θρ(h;η))=0\eta+\mathop{}\!\mathrm{d}_{\theta}\operatorname{D}_{\rho}(h;\theta_{\rho}(h;\eta))=0, where dθ . . =d/dθ\mathop{}\!\mathrm{d}_{\theta}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\mathop{}\!\mathrm{d}/\mathop{}\!\mathrm{d}\theta. Using Lemma 9 again, this can be equivalently re-written as

𝐄μρσ(𝖫(h)θρ(h;η))=η.\displaystyle\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}\left(\operatorname{\mathsf{L}}(h)-\theta_{\rho}(h;\eta)\right)=\eta. (37)

When α>1\alpha>1, the derivative of the dispersion function has unbounded range, i.e., ρσ()={\rho}^{\prime}_{\sigma}(\mathbb{R})=\mathbb{R}. As such, an argument identical to that used in the proof of Lemma 16 implies that for any η\eta\in\mathbb{R}, we can always find a θη(h)\theta_{\eta}(h)\in\mathbb{R} such that (37) holds, recalling that continuity follows via Lemma 9. Combining this with convexity gives us a valid solution. The special case of α=1\alpha=1 requires additional conditions, since from Lemma 8, we know that in this case |ρσ|1/σ\lvert{{\rho}^{\prime}_{\sigma}}\rvert\leq 1/\sigma, and thus by an analogous argument, whenever |θρ(h;η)|<1/σ\lvert{\theta_{\rho}(h;\eta)}\rvert<1/\sigma we can find a finite solution.

To prove uniqueness under 1α21\leq\alpha\leq 2, direct inspection of the second derivative in (68) shows us that ρσ′′(x)>0{\rho}^{\prime\prime}_{\sigma}(x)>0 on \mathbb{R} whenever we have

2(x/σ)2(1+(x/σ)2|α2|)<|α2|1(α/2).\displaystyle\frac{2(x/\sigma)^{2}}{\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)}<\frac{\lvert{\alpha-2}\rvert}{1-(\alpha/2)}.

Re-arranging the above inequality yields an equivalent condition of (x/σ)2(1α)<|α2|(x/\sigma)^{2}(1-\alpha)<\lvert{\alpha-2}\rvert, a condition which holds on \mathbb{R} if and only if 1α21\leq\alpha\leq 2. Since ρσ′′{\rho}^{\prime\prime}_{\sigma} is positive on \mathbb{R}, this implies that 𝐄μρσ′′(𝖫(h)θ)>0\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)>0 for all θ\theta\in\mathbb{R}. Using Lemma 9, we have that 𝐄μρσ′′(𝖫(h)θ)\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta) is equal to the second derivative of Dρ(h;θ)\operatorname{D}_{\rho}(h;\theta) with respect to θ\theta, which implies that θDρ(h;θ)\theta\mapsto\operatorname{D}_{\rho}(h;\theta) and θDρ(h;θ)+ηθ\theta\mapsto\operatorname{D}_{\rho}(h;\theta)+\eta\theta are strictly convex on \mathbb{R}, and thus their minimum must be unique.161616See for example Boyd and Vandenberghe, [10, Sec. 3.1.4].

Proof of Lemma 5.

For random loss 𝖫\operatorname{\mathsf{L}}, using Lemma 9, first-order optimality conditions require

𝐄μρσ(𝖫Mρ(𝖫))=0,𝐄μρσ(𝖫θρ(𝖫;η))=η.\displaystyle\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}\left(\operatorname{\mathsf{L}}-\operatorname{M}_{\rho}(\operatorname{\mathsf{L}})\right)=0,\qquad\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}\left(\operatorname{\mathsf{L}}-\theta_{\rho}(\operatorname{\mathsf{L}};\eta)\right)=\eta. (38)

If these conditions hold, then from direct inspection, the same conditions will clearly hold if we replace 𝖫\operatorname{\mathsf{L}} by 𝖫+c\operatorname{\mathsf{L}}+c, Mρ(𝖫)\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}) by Mρ(𝖫)+c\operatorname{M}_{\rho}(\operatorname{\mathsf{L}})+c, and θρ(𝖫;η)\theta_{\rho}(\operatorname{\mathsf{L}};\eta) by θρ(𝖫;η)+c\theta_{\rho}(\operatorname{\mathsf{L}};\eta)+c. This implies both translation-invariance of the dispersions and the translation-equivariance of the optimal thresholds. Non-negativity follows trivially from the fact that ρ()0\rho(\cdot)\geq 0. Noting that ρ(x)>0\rho(x)>0 for all x0x\neq 0, we have that Dρ(𝖫;θ)=0\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\theta)=0 if and only if 𝖫=θ\operatorname{\mathsf{L}}=\theta almost surely.171717This fact follows from basic Lebesgue integration theory [3, Thm. 1.6.6]. Since Dρ(𝖫;Mρ(𝖫))Dρ(𝖫;θρ(𝖫;η))\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}))\leq\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\theta_{\rho}(\operatorname{\mathsf{L}};\eta)) by the optimality of Mρ(𝖫)\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}), it follows that for any non-constant 𝖫\operatorname{\mathsf{L}}, we must have Dρ(𝖫;θρ(𝖫;η))>0\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\theta_{\rho}(\operatorname{\mathsf{L}};\eta))>0. Furthermore, from the optimality condition (38) for θρ(𝖫;η)\theta_{\rho}(\operatorname{\mathsf{L}};\eta), even when 𝖫\operatorname{\mathsf{L}} is constant, we must have Dρ(𝖫;θρ(𝖫;η))>0\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\theta_{\rho}(\operatorname{\mathsf{L}};\eta))>0 whenever η0\eta\neq 0, since ρσ(x)=0{\rho}^{\prime}_{\sigma}(x)=0 if and only if x=0x=0.

In the special case where 1α21\leq\alpha\leq 2, we have that ρσ′′{\rho}^{\prime\prime}_{\sigma} is positive on \mathbb{R} (see §D.3 and Fig. 8). This implies that ρσ\rho_{\sigma} is strictly convex, and ρσ{\rho}^{\prime}_{\sigma} is monotonically increasing. Let 𝖫1𝖫2\operatorname{\mathsf{L}}_{1}\leq\operatorname{\mathsf{L}}_{2} almost surely, but say Mρ(𝖫1)>Mρ(𝖫2)\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}_{1})>\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}_{2}). Using the optimality condition (38), uniqueness of the solution via Lemma 10, and the aforementioned monotonicity of ρσ{\rho}^{\prime}_{\sigma}, we have

0=𝐄μρσ(𝖫1Mρ(𝖫1))<𝐄μρσ(𝖫1Mρ(𝖫2))𝐄μρσ(𝖫2Mρ(𝖫2))=0.\displaystyle 0=\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}\left(\operatorname{\mathsf{L}}_{1}-\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}_{1})\right)<\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}\left(\operatorname{\mathsf{L}}_{1}-\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}_{2})\right)\leq\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}\left(\operatorname{\mathsf{L}}_{2}-\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}_{2})\right)=0.

This is a contradiction, and thus we must have Mρ(𝖫1)Mρ(𝖫2)\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}_{1})\leq\operatorname{M}_{\rho}(\operatorname{\mathsf{L}}_{2}). An identical argument using the exact same properties proves that θρ(𝖫1;η)θρ(𝖫2;η)\theta_{\rho}(\operatorname{\mathsf{L}}_{1};\eta)\leq\theta_{\rho}(\operatorname{\mathsf{L}}_{2};\eta) also holds. Finally, to prove convexity, take any 𝖫1,𝖫2\operatorname{\mathsf{L}}_{1},\operatorname{\mathsf{L}}_{2}\in\mathcal{L}, θ1,θ2\theta_{1},\theta_{2}\in\mathbb{R}, and a(0,1)a\in(0,1), and note that

R¯ρ(a𝖫1+(1a)𝖫2;η)\displaystyle\underline{\operatorname{R}}_{\rho}(a\operatorname{\mathsf{L}}_{1}+(1-a)\operatorname{\mathsf{L}}_{2};\eta) Dρ(a𝖫1+(1a)𝖫2;aθ1+(1a)θ2)+η(aθ1+(1a)θ2)\displaystyle\leq\operatorname{D}_{\rho}(a\operatorname{\mathsf{L}}_{1}+(1-a)\operatorname{\mathsf{L}}_{2};a\theta_{1}+(1-a)\theta_{2})+\eta\left(a\theta_{1}+(1-a)\theta_{2}\right)
=𝐄μρσ(a(𝖫1θ1)+(1a)(𝖫2θ2))+η(aθ1+(1a)θ2)\displaystyle=\operatorname{\mathbf{E}}_{\mu}\rho_{\sigma}\left(a(\operatorname{\mathsf{L}}_{1}-\theta_{1})+(1-a)(\operatorname{\mathsf{L}}_{2}-\theta_{2})\right)+\eta\left(a\theta_{1}+(1-a)\theta_{2}\right)
a(Dρ(𝖫1;θ1)+ηθ1)+(1α)(Dρ(𝖫2;θ2)+ηθ2).\displaystyle\leq a\left(\operatorname{D}_{\rho}(\operatorname{\mathsf{L}}_{1};\theta_{1})+\eta\theta_{1}\right)+(1-\alpha)\left(\operatorname{D}_{\rho}(\operatorname{\mathsf{L}}_{2};\theta_{2})+\eta\theta_{2}\right).

The first inequality uses optimality of the threshold in the definition of R¯ρ\underline{\operatorname{R}}_{\rho}, whereas the second inequality uses the convexity of ρσ\rho_{\sigma}. Since the choice of θ1\theta_{1} and θ2\theta_{2} here were arbitrary, we can set θ1=θρ(𝖫1;η)\theta_{1}=\theta_{\rho}(\operatorname{\mathsf{L}}_{1};\eta) and θ2=θρ(𝖫2;η)\theta_{2}=\theta_{\rho}(\operatorname{\mathsf{L}}_{2};\eta) to obtain the desired inequality

R¯ρ(a𝖫1+(1a)𝖫2;η)aR¯ρ(𝖫1;η)+(1a)R¯ρ(𝖫2;η)\displaystyle\underline{\operatorname{R}}_{\rho}(a\operatorname{\mathsf{L}}_{1}+(1-a)\operatorname{\mathsf{L}}_{2};\eta)\leq a\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}}_{1};\eta)+(1-a)\underline{\operatorname{R}}_{\rho}(\operatorname{\mathsf{L}}_{2};\eta)

giving us convexity of the threshold risk. As a direct corollary, setting η=0\eta=0 yields the convexity result for 𝖫Dρ(𝖫;Mρ(𝖫))\operatorname{\mathsf{L}}\mapsto\operatorname{D}_{\rho}(\operatorname{\mathsf{L}};\operatorname{M}_{\rho}(\operatorname{\mathsf{L}})). ∎

Proof of Lemma 2.

The crux of this result is an analogue to Lemma 9 regarding the differentials of Dρ(h;θ)\operatorname{D}_{\rho}(h;\theta), this time taken with respect to hh, rather than θ\theta. Fixing arbitrary g,hg,h\in\mathcal{H}, let us start by considering the following sequence of random variables:

fk . . =ρσ(𝖫(h+akg)θ)ρσ(𝖫(h)θ)ak\displaystyle f_{k}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\frac{\rho_{\sigma}(\operatorname{\mathsf{L}}(h+a_{k}g)-\theta)-\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)}{a_{k}} (39)

where (ak)(a_{k}) is any sequence of real values such that ak0+a_{k}\to 0_{+} as kk\to\infty. Before getting into the details, let us unpack the differentiability assumption made on the base loss. Before random sampling, the map h𝖫(h)h\mapsto\operatorname{\mathsf{L}}(h) is of course a map from \mathcal{H} to the set of measurable functions {𝖫(h):h}\{\operatorname{\mathsf{L}}(h):h\in\mathcal{H}\}, but after sampling, there is no randomness and it is simply a map from \mathcal{H} to \mathbb{R}. Having sampled the random loss, the property we desire is that for each hh\in\mathcal{H}, there exists a continuous linear functional 𝖫(h):𝒰{\operatorname{\mathsf{L}}}^{\prime}(h):\mathcal{U}\to\mathbb{R} such that

limg0|𝖫(h+g)𝖫(h)𝖫(h)(g)|g=0.\displaystyle\lim\limits_{\lVert{g}\rVert\to 0}\frac{\lvert{\operatorname{\mathsf{L}}(h+g)-\operatorname{\mathsf{L}}(h)-{\operatorname{\mathsf{L}}}^{\prime}(h)(g)}\rvert}{\lVert{g}\rVert}=0. (40)

The differentiability condition in the lemma statement is simply that

μ{equality (40) holds}=1.\displaystyle\mu\{\,\text{equality (\ref{eqn:unbiased_1a}) holds}\,\}=1. (41)

On this “good” event, since the map xρσ(x)x\mapsto\rho_{\sigma}(x) is differentiable by definition, we have that the composition hρσ(𝖫(h)θ)h\mapsto\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta) is also differentiable for any choice of θ\theta\in\mathbb{R}, and a general chain rule can be applied to compute the differentials.181818See Penot, [44, Thm. 2.47] for this key fact, where “XX” is 𝒰\mathcal{U} here, and both “YY” and “ZZ” are \mathbb{R} here. In particular, we have a pointwise limit of

f . . =limkfk=ρσ(𝖫(h)θ)𝖫(h)(g)\displaystyle f\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lim\limits_{k\to\infty}f_{k}={\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta){\operatorname{\mathsf{L}}}^{\prime}(h)(g) (42)

which also uses the fact that the Fréchet and Gateaux differentials are equal here.191919Luenberger, [39, §7.2, Prop. 2] Technically, it just remains to obtain conditions which imply 𝐄μfk𝐄μf\operatorname{\mathbf{E}}_{\mu}f_{k}\to\operatorname{\mathbf{E}}_{\mu}f. In pursuit of a μ\mu-integrable upper bound on the sequence (fk)(f_{k}), note that for large enough kk, we have

|fk|\displaystyle\lvert{f_{k}}\rvert 1akakgsup0<a<akρσ(𝖫(h+ag)θ)𝖫(h+ag)\displaystyle\leq\frac{1}{a_{k}}\lVert{a_{k}g}\rVert\sup_{0<a<a_{k}}\lVert{{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h+ag)-\theta){\operatorname{\mathsf{L}}}^{\prime}(h+ag)}\rVert
gsup{ρσ(𝖫(h0)θ)𝖫(h0):h0}\displaystyle\leq\lVert{g}\rVert\sup\left\{\lVert{{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{0})-\theta){\operatorname{\mathsf{L}}}^{\prime}(h_{0})}\rVert:h_{0}\in\mathcal{H}\right\}
gσ2sup{|𝖫(h0)θ|𝖫(h0):h0}.\displaystyle\leq\frac{\lVert{g}\rVert}{\sigma^{2}}\sup\left\{\lvert{\operatorname{\mathsf{L}}(h_{0})-\theta}\rvert\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{0})}\rVert:h_{0}\in\mathcal{H}\right\}. (43)

The key to the first of the preceding inequalities is a generalized mean value theorem.202020Considering the proof of Lemma 14 due to Luenberger, [39, §7.3, Prop. 2], just generalize the one-dimensional part of the argument from the original interval [0,1][0,1] to the interval [0,ak][0,a_{k}] here. Both the first and second inequalities also use the fact that h+akgh+a_{k}g\in\mathcal{H} eventually. The final inequality uses the fact that ρσ(x)=ρ(x/σ)/σ|x|/σ2{\rho}^{\prime}_{\sigma}(x)={\rho}^{\prime}(x/\sigma)/\sigma\leq\lvert{x}\rvert/\sigma^{2} for any choice of α2-\infty\leq\alpha\leq 2. This inequality suggests a natural condition of

𝐄μ[suph0𝖫(h0)𝖫(h0)]<\displaystyle\operatorname{\mathbf{E}}_{\mu}\left[\sup_{h_{0}\in\mathcal{H}}\lVert{\operatorname{\mathsf{L}}(h_{0}){\operatorname{\mathsf{L}}}^{\prime}(h_{0})}\rVert\right]<\infty (44)

under which we can apply a standard dominated convergence argument.212121See for example Ash and Doléans-Dade, [3, Thm. 1.6.9]. If (43) holds for say all kk0k\geq k_{0}, then we can just bound |fk|\lvert{f_{k}}\rvert by the greater of maxjk0|fj|\max_{j\leq k_{0}}\lvert{f_{j}}\rvert (clearly μ\mu-integrable) and the right-hand side of (43). In particular, the key implication is that

(44)limk𝐄μfk=𝐄μf.\displaystyle\text{(\ref{eqn:unbiased_2})}\implies\lim\limits_{k\to\infty}\operatorname{\mathbf{E}}_{\mu}f_{k}=\operatorname{\mathbf{E}}_{\mu}f. (45)

Since we have

limk𝐄μfk=lima0+𝐄μ[ρσ(𝖫(h+ag)θ)ρσ(𝖫(h)θ)a]=Dρ(h;θ)(g),\displaystyle\lim\limits_{k\to\infty}\operatorname{\mathbf{E}}_{\mu}f_{k}=\lim\limits_{a\to 0_{+}}\operatorname{\mathbf{E}}_{\mu}\left[\frac{\rho_{\sigma}(\operatorname{\mathsf{L}}(h+ag)-\theta)-\rho_{\sigma}(\operatorname{\mathsf{L}}(h)-\theta)}{a}\right]={\operatorname{D}}^{\prime}_{\rho}(h;\theta)(g),

where Dρ(h;θ):𝒰{\operatorname{D}}^{\prime}_{\rho}(h;\theta):\mathcal{U}\to\mathbb{R} denotes the gradient of hDρ(h;θ)h\mapsto\operatorname{D}_{\rho}(h;\theta), we see that by applying the preceding argument (culminating in (45)) to the modified losses (16), we readily obtain the desired result. ∎

Proof of Theorem 3.

To begin, let us consider the smoothness of the objective hRρ(h;θ,η)h\mapsto\operatorname{R}_{\rho}(h;\theta,\eta) under the present assumptions. From Lemma 12 and the basic properties of the Barron class of dispersion functions (Lemma 8), it follows that this function is λ\lambda-smooth with coefficient

λ . . =λ5σ+λ2σ2𝐄μ𝖫,\displaystyle\lambda\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\frac{\lambda_{5}}{\sigma}+\frac{\lambda_{2}}{\sigma^{2}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}, (46)

where λ5 . . =λ3+λ4\lambda_{5}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lambda_{3}+\lambda_{4}, and

λ3 . . =(λ2σ)[𝐄μ𝖫2+suph𝐄μ𝖫(h)],λ4 . . =λ1ρ.\displaystyle\lambda_{3}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\left(\frac{\lambda_{2}}{\sigma}\right)\left[\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}^{2}+\sup_{h\in\mathcal{H}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert\right],\qquad\lambda_{4}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lambda_{1}\lVert{{\rho}^{\prime}}\rVert_{\infty}.

From here, we can leverage the main argument of Cutkosky and Mehta, [14, Thm. 2], utilizing the smoothness property given by (46) above, and the Γ\Gamma-bound of (19). For completeness and transparency we include the key details here. First, note that if we define ϵt . . =GthRρ(ht;θ,η)\epsilon_{t}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=G_{t}-\partial_{h}\operatorname{R}_{\rho}(h_{t};\theta,\eta), ϵ^t . . =MthRρ(ht;θ,η)\widehat{\epsilon}_{t}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=M_{t}-\partial_{h}\operatorname{R}_{\rho}(h_{t};\theta,\eta), and S(ht,ht+1) . . =hRρ(ht;θ,η)hRρ(ht+1;θ,η)S(h_{t},h_{t+1})\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\partial_{h}\operatorname{R}_{\rho}(h_{t};\theta,\eta)-\partial_{h}\operatorname{R}_{\rho}(h_{t+1};\theta,\eta), our definitions imply that for each t1t\geq 1, we have

Mt+1=hRρ(h;θ,η)+b(ϵ^t+S(ht,ht+1))+(1b)ϵt+1\displaystyle M_{t+1}=\partial_{h}\operatorname{R}_{\rho}(h;\theta,\eta)+b\left(\widehat{\epsilon}_{t}+S(h_{t},h_{t+1})\right)+(1-b)\epsilon_{t+1} (47)

Using the form (47), it follows immediately that

ϵ^t+1=b(ϵ^t+S(ht,ht+1))+(1b)ϵt+1\displaystyle\widehat{\epsilon}_{t+1}=b\left(\widehat{\epsilon}_{t}+S(h_{t},h_{t+1})\right)+(1-b)\epsilon_{t+1} (48)

again for each t1t\geq 1. By setting M0 . . =0M_{0}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=0 and h0 . . =h1h_{0}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=h_{1}, we trivially have

ϵ^0=hRρ(h0;θ,η)=hRρ(h1;θ,η)\displaystyle\widehat{\epsilon}_{0}=-\partial_{h}\operatorname{R}_{\rho}(h_{0};\theta,\eta)=-\partial_{h}\operatorname{R}_{\rho}(h_{1};\theta,\eta)

and one can then easily check that (48) holds for all t0t\geq 0. Expanding the recursion of (48), we have

ϵ^t+1=(1b)k=0tbkϵtk+1+k=1t+1bkS(htk+1,htk+2)+bt+1ϵ^0.\displaystyle\widehat{\epsilon}_{t+1}=(1-b)\sum_{k=0}^{t}b^{k}\epsilon_{t-k+1}+\sum_{k=1}^{t+1}b^{k}S(h_{t-k+1},h_{t-k+2})+b^{t+1}\widehat{\epsilon}_{0}. (49)

We take the summands of (49) one at a time. Since the stochastic gradients are Γ\Gamma-bounded, we have that bkϵtk+12Γb^{k}\lVert{\epsilon_{t-k+1}}\rVert\leq 2\Gamma for all 0kt0\leq k\leq t. Furthermore, we have 𝐄μϵtk+1=0\operatorname{\mathbf{E}}_{\mu}\epsilon_{t-k+1}=0 (Lemma 2) and 𝐄μ(bkϵtk+1)2(2bkΓ)2\operatorname{\mathbf{E}}_{\mu}(b^{k}\lVert{\epsilon_{t-k+1}}\rVert)^{2}\leq(2b^{k}\Gamma)^{2} for each kk. These bounds can be passed to standard concentration inequalities for martingales on Banach spaces [14, Lem. 14], which using the smoothness property of \mathcal{H} that we assumed tell us that with probability no less than 1δ1-\delta, we have

k=0tbkϵtk+110Γmax{1,log(3δ1)}+8Γmax{1,log(3δ1)}k=0tb2k.\displaystyle\left\lVert{\sum_{k=0}^{t}b^{k}\epsilon_{t-k+1}}\right\rVert\leq 10\Gamma\max\left\{1,\log(3\delta^{-1})\right\}+8\Gamma\sqrt{\max\left\{1,\log(3\delta^{-1})\right\}\sum_{k=0}^{t}b^{2k}}. (50)

Moving on to the second term of (49), note that using λ\lambda-smoothness of the risk function with coefficient λ\lambda given by (46), along with the definition of the update procedure (20)–(21), we have

S(ht,ht+1)λhtht+1=λatM~t=λat.\displaystyle\lVert{S(h_{t},h_{t+1})}\rVert\leq\lambda\lVert{h_{t}-h_{t+1}}\rVert=\lambda a_{t}\lVert{\widetilde{M}_{t}}\rVert=\lambda a_{t}.

This implies that using a constant step size at=aa_{t}=a, we can control the sum as

k=1t+1bkS(htk+1,htk+2)λk=1t+1atk+1bkaλ1b.\displaystyle\left\lVert{\sum_{k=1}^{t+1}b^{k}S(h_{t-k+1},h_{t-k+2})}\right\rVert\leq\lambda\sum_{k=1}^{t+1}a_{t-k+1}b^{k}\leq\frac{a\lambda}{1-b}. (51)

Finally, the third term of (49) is easily controlled as ϵ^0=hRρ(h1;θ,η)Γ\lVert{\widehat{\epsilon}_{0}}\rVert=\lVert{\partial_{h}\operatorname{R}_{\rho}(h_{1};\theta,\eta)}\rVert\leq\Gamma. Taking this bound along with (50) and (51), we see that ϵ^t+1\lVert{\widehat{\epsilon}_{t+1}}\rVert can be bounded above by

(1b)(10Γmax{1,log(3δ1)}+8Γmax{1,log(3δ1)}k=0tb2k)+aλ1b+bt+1Γ\displaystyle(1-b)\left(10\Gamma\max\left\{1,\log(3\delta^{-1})\right\}+8\Gamma\sqrt{\max\left\{1,\log(3\delta^{-1})\right\}\sum_{k=0}^{t}b^{2k}}\right)+\frac{a\lambda}{1-b}+b^{t+1}\Gamma (52)

on the high-probability event mentioned earlier. To make use of the bound (52), note that using the λ\lambda-smoothness of the losses, the update procedure used here can be shown [14, Lem. 1] to satisfy

k=1thRρ(ht;θ,η)Rρ(h1;θ,η)Rρ(ht+1;θ,η)a+2k=1tϵ^k+(λa2)t.\displaystyle\sum_{k=1}^{t}\lVert{\partial_{h}\operatorname{R}_{\rho}(h_{t};\theta,\eta)}\rVert\leq\frac{\operatorname{R}_{\rho}(h_{1};\theta,\eta)-\operatorname{R}_{\rho}(h_{t+1};\theta,\eta)}{a}+2\sum_{k=1}^{t}\lVert{\widehat{\epsilon}_{k}}\rVert+\left(\frac{\lambda a}{2}\right)t. (53)

Using a union bound, we have that the bound of (52) holds for all ϵ^1,,ϵ^t\widehat{\epsilon}_{1},\ldots,\widehat{\epsilon}_{t} with probability no less than 1tδ1-t\delta. To get some clean bounds out of (52) and (53), first we loosen

(1b)k=0tb2k1b1b21b1b=1b\displaystyle(1-b)\sqrt{\sum_{k=0}^{t}b^{2k}}\leq\frac{1-b}{\sqrt{1-b^{2}}}\leq\frac{1-b}{\sqrt{1-b}}=\sqrt{1-b}

and note that 0<δ<10<\delta<1 and t1e/3t\geq 1\geq\mathrm{e}/3 implies log(3tδ1)1\log(3t\delta^{-1})\geq 1. This means that with probability no less than 1δ1-\delta, we have

2k=1tϵ^k(1b)20Γtlog(3tδ1)+16Γt(1b)log(3tδ1)+2aλt1b+2Γ1b.\displaystyle 2\sum_{k=1}^{t}\lVert{\widehat{\epsilon}_{k}}\rVert\leq(1-b)20\Gamma t\log(3t\delta^{-1})+16\Gamma t\sqrt{(1-b)\log(3t\delta^{-1})}+\frac{2a\lambda t}{1-b}+\frac{2\Gamma}{1-b}.

Dividing both sides by tt and plugging in the prescribed settings of aa and bb, we have

2tk=1tϵ^k20Γlog(3tδ1)t+16Γlog(3tδ1)t1/4+2λt1/4+2Γt.\displaystyle\frac{2}{t}\sum_{k=1}^{t}\lVert{\widehat{\epsilon}_{k}}\rVert\leq\frac{20\Gamma\log(3t\delta^{-1})}{\sqrt{t}}+\frac{16\Gamma\sqrt{\log(3t\delta^{-1})}}{t^{1/4}}+\frac{2\lambda}{t^{1/4}}+\frac{2\Gamma}{\sqrt{t}}. (54)

Taking this bound and our aa setting and applying it to (53), we obtain

1tk=1thRρ(ht;θ,η)Rρ(h1;θ,η)Rρ(ht+1;θ,η)t1/4+(RHS of (54))+λ2t3/4.\displaystyle\frac{1}{t}\sum_{k=1}^{t}\lVert{\partial_{h}\operatorname{R}_{\rho}(h_{t};\theta,\eta)}\rVert\leq\frac{\operatorname{R}_{\rho}(h_{1};\theta,\eta)-\operatorname{R}_{\rho}(h_{t+1};\theta,\eta)}{t^{1/4}}+(\text{RHS of (\ref{eqn:sgd_convergence_9})})+\frac{\lambda}{2t^{3/4}}. (55)

To clean this all up, we have

1tk=1thRρ(ht;θ,η)\displaystyle\frac{1}{t}\sum_{k=1}^{t}\lVert{\partial_{h}\operatorname{R}_{\rho}(h_{t};\theta,\eta)}\rVert 1t1/4(Rρ(h1;θ,η)Rρ(ht+1;θ,η)+16Γlog(3tδ1)+2λ)\displaystyle\leq\frac{1}{t^{1/4}}\left(\operatorname{R}_{\rho}(h_{1};\theta,\eta)-\operatorname{R}_{\rho}(h_{t+1};\theta,\eta)+16\Gamma\sqrt{\log(3t\delta^{-1})}+2\lambda\right)
+1t(20Γlog(3tδ1)+2Γ)+λ2t3/4.\displaystyle\qquad+\frac{1}{\sqrt{t}}\left(20\Gamma\log(3t\delta^{-1})+2\Gamma\right)+\frac{\lambda}{2t^{3/4}}.

For readability, the proof uses a slightly looser choice of λ3\lambda_{3}, and instead of tt iterations, it is stated for TT iterations. ∎

Proof of Corollary 4.

Let α=0\alpha=0, and note from §D.3 that we have

ρσ(x)=2xx2+2σ2.\displaystyle\rho_{\sigma}(x)=\frac{2x}{x^{2}+2\sigma^{2}}.

It thus follows from (17) that

h𝖫ρ(h;θ,η)=2(𝖫(h)θ)(𝖫(h)θ)2+2σ2𝖫(h).\displaystyle\partial_{h}\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta)=\frac{2(\operatorname{\mathsf{L}}(h)-\theta)}{(\operatorname{\mathsf{L}}(h)-\theta)^{2}+2\sigma^{2}}{\operatorname{\mathsf{L}}}^{\prime}(h).

In the case of the quadratic loss with a linear model as assumed here, this becomes

h𝖫ρ(h;θ,η)\displaystyle\partial_{h}\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta) =2(𝖫(h)θ)(𝖫(h)θ)2+2σ2(h(𝖷)𝖸)𝖷\displaystyle=\frac{2(\operatorname{\mathsf{L}}(h)-\theta)}{(\operatorname{\mathsf{L}}(h)-\theta)^{2}+2\sigma^{2}}\left(h(\mathsf{X})-\mathsf{Y}\right)\mathsf{X}
=2(𝖫(h)θ)2𝖫(h)(𝖫(h)θ)2+2σ2sign(h(𝖷)𝖸)𝖷.\displaystyle=\frac{2(\operatorname{\mathsf{L}}(h)-\theta)\sqrt{2\operatorname{\mathsf{L}}(h)}}{(\operatorname{\mathsf{L}}(h)-\theta)^{2}+2\sigma^{2}}\operatorname{sign}(h(\mathsf{X})-\mathsf{Y})\mathsf{X}.

Regarding growth in 𝖷\mathsf{X}, note that since 𝖫(h)=𝒪(𝖷2)\operatorname{\mathsf{L}}(h)=\mathcal{O}(\lVert{\mathsf{X}}\rVert^{2}), both the numerator and denominator are 𝒪(𝖷4)\mathcal{O}(\lVert{\mathsf{X}}\rVert^{4}). As for growth in 𝖫(h)\operatorname{\mathsf{L}}(h) which accounts for the random noise ε\varepsilon as well, the numerator is 𝒪(|𝖫|3/2)\mathcal{O}(\lvert{\operatorname{\mathsf{L}}}\rvert^{3/2}) whereas the denominator is 𝒪(|𝖫|2)\mathcal{O}(\lvert{\operatorname{\mathsf{L}}}\rvert^{2}), and thus accounting for the randomness of 𝖷\mathsf{X} and ε\varepsilon which can both be potentially unbounded and heavy-tailed, we see that the norm of h𝖫ρ(h;θ,η)\partial_{h}\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta) must be bounded as the norm of the inputs, noise, and/or loss grow large. Trivially, since σ>0\sigma>0, the limits where 𝖫(h)θ\operatorname{\mathsf{L}}(h)\to\theta also result in h𝖫ρ(h;θ,η)\partial_{h}\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta) with a norm that is almost surely bounded.

Proceeding to the case of the logistic loss, an analogous argument yields the desired result. First note that under the linear model assumed here, the gradient with respect to any hjh_{j} takes the form

hj𝖫ρ(h;θ,η)=2(𝖫(h)θ)(𝖫(h)θ)2+2σ2(pj(h)𝖸~j)𝖷\displaystyle\partial_{h_{j}}\operatorname{\mathsf{L}}_{\rho}(h;\theta,\eta)=\frac{2(\operatorname{\mathsf{L}}(h)-\theta)}{(\operatorname{\mathsf{L}}(h)-\theta)^{2}+2\sigma^{2}}(p_{j}(h)-\widetilde{\mathsf{Y}}_{j})\mathsf{X}

where pj(h) . . =exp(hj(𝖷))/i=1kexp(hi(𝖷))p_{j}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\exp(h_{j}(\mathsf{X}))/\sum_{i=1}^{k}\exp(h_{i}(\mathsf{X})), i.e., the softmax transformation of the score assigned by hjh_{j}. By definition, the coefficients (pj(h)𝖸~j)(p_{j}(h)-\widetilde{\mathsf{Y}}_{j}) are bounded. Furthermore, using our linear model assumption, we have that 𝖫(h)=𝒪(𝖷)\operatorname{\mathsf{L}}(h)=\mathcal{O}(\lVert{\mathsf{X}}\rVert), and as such the numerator and denominator are both 𝒪(𝖷2)\mathcal{O}(\lVert{\mathsf{X}}\rVert^{2}), implying the desired boundedness.

Taking the previous two paragraphs together, we have that the bound (19) is satisfied for a finite Γ\Gamma under α=0\alpha=0, even when the data is unbounded and potentially heavy-tailed. For α<0\alpha<0, the derivative of ρ\rho shrinks even faster, so the same result follows a fortiori from the α=0\alpha=0 case. Finally, the λ1\lambda_{1}-smoothness assumption in Theorem 3 follows from direct inspection of the forms of 𝖫(h){\operatorname{\mathsf{L}}}^{\prime}(h) given here for each loss, using our assumption of 𝖷2\lVert{\mathsf{X}}\rVert^{2} having bounded second moments. ∎

Proof of Proposition 7.

To begin, recall from §3 the notation Mρ(h) . . =argminθDρ(h;θ)\operatorname{M}_{\rho}(h)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname*{arg\,min}_{\theta\in\mathbb{R}}\operatorname{D}_{\rho}(h;\theta) for the M-locations and θρ(h;η) . . =argminθRρ(h;θ,η)\theta_{\rho}(h;\eta)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname*{arg\,min}_{\theta\in\mathbb{R}}\operatorname{R}_{\rho}(h;\theta,\eta) for the T-risk optimal thresholds. By Lemma 10, both Mρ(h)\operatorname{M}_{\rho}(h) and θρ(h;η)\theta_{\rho}(h;\eta) have unique solutions, and thus we overload this notation to represent the unique solution. By definition, we have

R¯ρ(h;η)=Rρ(h;θρ(h;η))\displaystyle\underline{\operatorname{R}}_{\rho}(h;\eta)=\operatorname{R}_{\rho}(h;\theta_{\rho}(h;\eta)) ηMρ(h)+Dρ(h;Mρ(h))\displaystyle\leq\eta\operatorname{M}_{\rho}(h)+\operatorname{D}_{\rho}(h;\operatorname{M}_{\rho}(h))
=ηR(h)+η(Mρ(h)R(h))+Dρ(h;Mρ(h)).\displaystyle=\eta\operatorname{R}(h)+\eta(\operatorname{M}_{\rho}(h)-\operatorname{R}(h))+\operatorname{D}_{\rho}(h;\operatorname{M}_{\rho}(h)).

Similarly, we can obtain a lower bound using the optimality of Mρ(h)\operatorname{M}_{\rho}(h) and θρ(h;η)\theta_{\rho}(h;\eta) as

R¯ρ(h;η)\displaystyle\underline{\operatorname{R}}_{\rho}(h;\eta) =ηθρ(h;η)+Dρ(h;θρ(h;η))\displaystyle=\eta\theta_{\rho}(h;\eta)+\operatorname{D}_{\rho}(h;\theta_{\rho}(h;\eta))
ηθρ(h;η)+Dρ(h;Mρ(h))\displaystyle\geq\eta\theta_{\rho}(h;\eta)+\operatorname{D}_{\rho}(h;\operatorname{M}_{\rho}(h))
=ηR(h)+η(θρ(h;η)R(h))+Dρ(h;Mρ(h)).\displaystyle=\eta\operatorname{R}(h)+\eta(\theta_{\rho}(h;\eta)-\operatorname{R}(h))+\operatorname{D}_{\rho}(h;\operatorname{M}_{\rho}(h)).

Taking these two bounds together, we have

η(θρ(h;η)R(h))+Dρ(h;Mρ(h))R¯ρ(h;η)ηR(h)η(Mρ(h)R(h))+Dρ(h;Mρ(h))\displaystyle\eta(\theta_{\rho}(h;\eta)-\operatorname{R}(h))+\operatorname{D}_{\rho}(h;\operatorname{M}_{\rho}(h))\leq\underline{\operatorname{R}}_{\rho}(h;\eta)-\eta\operatorname{R}(h)\leq\eta(\operatorname{M}_{\rho}(h)-\operatorname{R}(h))+\operatorname{D}_{\rho}(h;\operatorname{M}_{\rho}(h)) (56)

for any choice of η\eta\in\mathbb{R}. The bounds in (56) are stated for ideal risk quantities for the true distribution under μ\mu, but an identical argument holds if we replace μ\mu by the empirical measure induced by an iid sample 𝖫1,,𝖫n\operatorname{\mathsf{L}}_{1},\ldots,\operatorname{\mathsf{L}}_{n}. Writing this out explicitly, let D^ρ\widehat{\operatorname{D}}_{\rho}, R¯^ρ\underline{\widehat{\operatorname{R}}}_{\rho}, and R^\widehat{\operatorname{R}} denote the empirical analogues of Dρ\operatorname{D}_{\rho}, R¯ρ\underline{\operatorname{R}}_{\rho}, and R\operatorname{R}, and similarly let M^ρ\widehat{\operatorname{M}}_{\rho} and θ^ρ\widehat{\theta}_{\rho} be the empirical analogues of Mρ\operatorname{M}_{\rho} and θρ\theta_{\rho}. From the argument leading to (56), it follows that

η(θ^ρ(h;η)R^(h))+D^ρ(h;M^ρ(h))R¯^ρ(h;η)ηR^(h)η(M^ρ(h)R^(h))+D^ρ(h;M^ρ(h)).\displaystyle\eta(\widehat{\theta}_{\rho}(h;\eta)-\widehat{\operatorname{R}}(h))+\widehat{\operatorname{D}}_{\rho}(h;\widehat{\operatorname{M}}_{\rho}(h))\leq\underline{\widehat{\operatorname{R}}}_{\rho}(h;\eta)-\eta\widehat{\operatorname{R}}(h)\leq\eta(\widehat{\operatorname{M}}_{\rho}(h)-\widehat{\operatorname{R}}(h))+\widehat{\operatorname{D}}_{\rho}(h;\widehat{\operatorname{M}}_{\rho}(h)). (57)

Next, using the lower bound in (57), for any η0\eta\geq 0 we have that

ηR(h)\displaystyle\eta\operatorname{R}(h) =η(R^(h)+(R(h)R^(h)))\displaystyle=\eta\left(\widehat{\operatorname{R}}(h)+(\operatorname{R}(h)-\widehat{\operatorname{R}}(h))\right)
η(R^(h)+RR^)\displaystyle\leq\eta\left(\widehat{\operatorname{R}}(h)+\lVert{\operatorname{R}-\widehat{\operatorname{R}}}\rVert_{\mathcal{H}}\right)
R¯^ρ(h;η)η(θ^ρ(h;η)R^(h))D^ρ(h;M^ρ(h))+ηRR^\displaystyle\leq\underline{\widehat{\operatorname{R}}}_{\rho}(h;\eta)-\eta(\widehat{\theta}_{\rho}(h;\eta)-\widehat{\operatorname{R}}(h))-\widehat{\operatorname{D}}_{\rho}(h;\widehat{\operatorname{M}}_{\rho}(h))+\eta\lVert{\operatorname{R}-\widehat{\operatorname{R}}}\rVert_{\mathcal{H}}
R¯^ρ(h;η)+η(R^(h)θ^ρ(h;η))+ηRR^.\displaystyle\leq\underline{\widehat{\operatorname{R}}}_{\rho}(h;\eta)+\eta(\widehat{\operatorname{R}}(h)-\widehat{\theta}_{\rho}(h;\eta))+\eta\lVert{\operatorname{R}-\widehat{\operatorname{R}}}\rVert_{\mathcal{H}}. (58)

Letting h^\widehat{h} be a minimizer of R¯^ρ\underline{\widehat{\operatorname{R}}}_{\rho}, using the upper bound in (57) and any choice of hh^{\ast}, we have

R¯^ρ(h^;η)R¯^ρ(h;η)\displaystyle\underline{\widehat{\operatorname{R}}}_{\rho}(\widehat{h};\eta)\leq\underline{\widehat{\operatorname{R}}}_{\rho}(h^{\ast};\eta) ηR^(h)+η(M^ρ(h)R^(h))+D^ρ(h;M^ρ(h))\displaystyle\leq\eta\widehat{\operatorname{R}}(h^{\ast})+\eta(\widehat{\operatorname{M}}_{\rho}(h^{\ast})-\widehat{\operatorname{R}}(h^{\ast}))+\widehat{\operatorname{D}}_{\rho}(h^{\ast};\widehat{\operatorname{M}}_{\rho}(h^{\ast}))
ηR(h)+η(M^ρ(h)R^(h))+D^ρ(h;M^ρ(h))+ηRR^.\displaystyle\leq\eta\operatorname{R}(h^{\ast})+\eta(\widehat{\operatorname{M}}_{\rho}(h^{\ast})-\widehat{\operatorname{R}}(h^{\ast}))+\widehat{\operatorname{D}}_{\rho}(h^{\ast};\widehat{\operatorname{M}}_{\rho}(h^{\ast}))+\eta\lVert{\operatorname{R}-\widehat{\operatorname{R}}}\rVert_{\mathcal{H}}. (59)

Combining (58) and (59), we have that ηR(h^)\eta\operatorname{R}(\widehat{h}) is bounded above by

ηR(h)+η(M^ρ(h)θρ(h^;η))+η(R^(h^)R^(h))+D^ρ(h;M^ρ(h))+2ηRR^.\displaystyle\eta\operatorname{R}(h^{\ast})+\eta(\widehat{\operatorname{M}}_{\rho}(h^{\ast})-\theta_{\rho}(\widehat{h};\eta))+\eta(\widehat{\operatorname{R}}(\widehat{h})-\widehat{\operatorname{R}}(h^{\ast}))+\widehat{\operatorname{D}}_{\rho}(h^{\ast};\widehat{\operatorname{M}}_{\rho}(h^{\ast}))+2\eta\lVert{\operatorname{R}-\widehat{\operatorname{R}}}\rVert_{\mathcal{H}}. (60)

Some elementary manipulations let us bound the key differences in (60) as

M^ρ(h)θρ(h^;η)+R^(h^)R^(h)2M^ρR+2RR^+M^ρθρ,\displaystyle\widehat{\operatorname{M}}_{\rho}(h^{\ast})-\theta_{\rho}(\widehat{h};\eta)+\widehat{\operatorname{R}}(\widehat{h})-\widehat{\operatorname{R}}(h^{\ast})\leq 2\lVert{\widehat{\operatorname{M}}_{\rho}-\operatorname{R}}\rVert_{\mathcal{H}}+2\lVert{\operatorname{R}-\widehat{\operatorname{R}}}\rVert_{\mathcal{H}}+\lVert{\widehat{\operatorname{M}}_{\rho}-\theta_{\rho}}\rVert_{\mathcal{H}},

and thus dividing by η>0\eta>0, we end up with a final bound taking the form

R(h^)R(h)+M^ρθρ+2M^ρR+1ηD^ρ(h;M^ρ(h))+4RR^.\displaystyle\operatorname{R}(\widehat{h})\leq\operatorname{R}(h^{\ast})+\lVert{\widehat{\operatorname{M}}_{\rho}-\theta_{\rho}}\rVert_{\mathcal{H}}+2\lVert{\widehat{\operatorname{M}}_{\rho}-\operatorname{R}}\rVert_{\mathcal{H}}+\frac{1}{\eta}\widehat{\operatorname{D}}_{\rho}(h^{\ast};\widehat{\operatorname{M}}_{\rho}(h^{\ast}))+4\lVert{\operatorname{R}-\widehat{\operatorname{R}}}\rVert_{\mathcal{H}}.

The desired result is just the special case where we set hh^{\ast} is the expected loss minimizer. ∎

C.2 Smoothness computations (proof of Lemma 12)

Here we provide detailed computations for the smoothness coefficients used in Lemma 12. We assume here that the assumptions A1., A2., and A3. are satisfied. Starting with the difference of expected gradients, using Jensen’s inequality and the smoothness assumption A2., we have

𝐄μ[𝖫(h1)𝖫(h2)]\displaystyle\lVert{\operatorname{\mathbf{E}}_{\mu}\left[{\operatorname{\mathsf{L}}}^{\prime}(h_{1})-{\operatorname{\mathsf{L}}}^{\prime}(h_{2})\right]}\rVert 𝐄μ𝖫(h1)𝖫(h2)\displaystyle\leq\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{1})-{\operatorname{\mathsf{L}}}^{\prime}(h_{2})}\rVert
λ1h1h2.\displaystyle\leq\lambda_{1}\lVert{h_{1}-h_{2}}\rVert. (61)

As discussed in §B.4, differences of gradients modulated by ρ{\rho}^{\prime} are slightly more complicated. In particular, recalling the equality (29), the norm of the difference

𝐄μρσ(𝖫(h1)θ1)𝖫(h1)𝐄μρσ(𝖫(h2)θ2)𝖫(h2)\displaystyle\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{1})-\theta_{1}){\operatorname{\mathsf{L}}}^{\prime}(h_{1})-\operatorname{\mathbf{E}}_{\mu}{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{2})-\theta_{2}){\operatorname{\mathsf{L}}}^{\prime}(h_{2}) (62)

can be bounded above by the sum of

𝐄μ𝖫(h1)|ρσ(𝖫(h1)θ1)ρσ(𝖫(h2)θ2)|\displaystyle\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{1})}\rVert\lvert{{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{1})-\theta_{1})-{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{2})-\theta_{2})}\rvert (63)

and

𝐄μ|ρσ(𝖫(h2)θ2)|𝖫(h1)𝖫(h2).\displaystyle\operatorname{\mathbf{E}}_{\mu}\lvert{{\rho}^{\prime}_{\sigma}(\operatorname{\mathsf{L}}(h_{2})-\theta_{2})}\rvert\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{1})-{\operatorname{\mathsf{L}}}^{\prime}(h_{2})}\rVert. (64)

We take up (63) and (64) one at a time. Starting with (63), from A3. we know that the dispersion derivative ρ{\rho}^{\prime} is λ2\lambda_{2}-Lipschitz, and thus we have

(63) (λ2σ)𝐄μ𝖫(h1)(|𝖫(h1)𝖫(h2)|+|θ1θ2|)\displaystyle\leq\left(\frac{\lambda_{2}}{\sigma}\right)\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{1})}\rVert\left(\lvert{\operatorname{\mathsf{L}}(h_{1})-\operatorname{\mathsf{L}}(h_{2})}\rvert+\lvert{\theta_{1}-\theta_{2}}\rvert\right)
(λ2σ)𝐄μ𝖫(h1)(h1h2sup0<c<1𝖫((1c)h1ch2)+|θ1θ2|)\displaystyle\leq\left(\frac{\lambda_{2}}{\sigma}\right)\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{1})}\rVert\left(\lVert{h_{1}-h_{2}}\rVert\sup_{0<c<1}\lVert{{\operatorname{\mathsf{L}}}^{\prime}((1-c)h_{1}-ch_{2})}\rVert+\lvert{\theta_{1}-\theta_{2}}\rvert\right)
(λ2σ)(h1h2𝐄μ𝖫2+|θ1θ2|suph𝐄μ𝖫(h))\displaystyle\leq\left(\frac{\lambda_{2}}{\sigma}\right)\left(\lVert{h_{1}-h_{2}}\rVert\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}^{2}+\lvert{\theta_{1}-\theta_{2}}\rvert\sup_{h\in\mathcal{H}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert\right)
λ3(h1h2+|θ1θ2|).\displaystyle\leq\lambda_{3}\left(\lVert{h_{1}-h_{2}}\rVert+\lvert{\theta_{1}-\theta_{2}}\rvert\right). (65)

Here, the second inequality uses the helper Lemma 14 and our assumption of differentiability, while the third inequality uses our assumption on the expected squared norm of the gradient. We have set the Lipschitz coefficient λ3\lambda_{3} in (65) to be

λ3 . . =(λ2σ)[𝐄μ𝖫2+suph𝐄μ𝖫(h)].\displaystyle\lambda_{3}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\left(\frac{\lambda_{2}}{\sigma}\right)\left[\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}^{2}+\sup_{h\in\mathcal{H}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h)}\rVert\right].

This gives us a bound on (63). Moving on to (64), if ρ{\rho}^{\prime} is bounded on \mathbb{R}, then we have

(64) ρ𝐄μ𝖫(h1)𝖫(h2)\displaystyle\leq\lVert{{\rho}^{\prime}}\rVert_{\infty}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}(h_{1})-{\operatorname{\mathsf{L}}}^{\prime}(h_{2})}\rVert
λ4h1h2\displaystyle\leq\lambda_{4}\lVert{h_{1}-h_{2}}\rVert (66)

with λ4 . . =λ1ρ\lambda_{4}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lambda_{1}\lVert{{\rho}^{\prime}}\rVert_{\infty}, recalling the bound (61). To summarize, we can use (65) and (66) to control (62) as follows:

(62)\displaystyle(\text{\ref{eqn:smoothness_limited_2}}) (63)+(64)\displaystyle\leq(\text{\ref{eqn:smoothness_limited_3}})+(\text{\ref{eqn:smoothness_limited_4}})
λ3(h1h2+|θ1θ2|)+λ4h1h2\displaystyle\leq\lambda_{3}\left(\lVert{h_{1}-h_{2}}\rVert+\lvert{\theta_{1}-\theta_{2}}\rvert\right)+\lambda_{4}\lVert{h_{1}-h_{2}}\rVert
λ5(h1h2+|θ1θ2|)\displaystyle\leq\lambda_{5}\left(\lVert{h_{1}-h_{2}}\rVert+\lvert{\theta_{1}-\theta_{2}}\rvert\right)

where λ5 . . =λ3+λ4\lambda_{5}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\lambda_{3}+\lambda_{4}. With these preparatory details organized, it is straightforward to obtain a Lipschitz property on the gradient of Rρ()\operatorname{R}_{\rho}(\cdot), as summarized in Lemma 12, and detailed in the proof below.

Proof of Lemma 12.

Using our upper bounds on (61) and (62), we have

hRρ(h1;θ1,η)hRρ(h2;θ2,η)\displaystyle\lVert{\partial_{h}\operatorname{R}_{\rho}(h_{1};\theta_{1},\eta)-\partial_{h}\operatorname{R}_{\rho}(h_{2};\theta_{2},\eta)}\rVert
(1σ)𝐄μ[ρ(𝖫(h1)θ1σ)𝖫(h1)ρ(𝖫(h2)θ2σ)𝖫(h2)]\displaystyle\leq\left(\frac{1}{\sigma}\right)\left\lVert{\operatorname{\mathbf{E}}_{\mu}\left[{\rho}^{\prime}\left(\frac{\operatorname{\mathsf{L}}(h_{1})-\theta_{1}}{\sigma}\right){\operatorname{\mathsf{L}}}^{\prime}(h_{1})-{\rho}^{\prime}\left(\frac{\operatorname{\mathsf{L}}(h_{2})-\theta_{2}}{\sigma}\right){\operatorname{\mathsf{L}}}^{\prime}(h_{2})\right]}\right\rVert
(λ5σ)(h1h2+|θ1θ2|).\displaystyle\leq\left(\frac{\lambda_{5}}{\sigma}\right)\left(\lVert{h_{1}-h_{2}}\rVert+\lvert{\theta_{1}-\theta_{2}}\rvert\right).

Next, let us look at the partial derivative taken with respect to the threshold parameter θ\theta. To bound the absolute value of these differences, using the generalized mean value theorem (Lemma 14), we have

|θRρ(h1;θ1,η)θRρ(h2;θ2,η)|\displaystyle\lvert{\partial_{\theta}\operatorname{R}_{\rho}(h_{1};\theta_{1},\eta)-\partial_{\theta}\operatorname{R}_{\rho}(h_{2};\theta_{2},\eta)}\rvert (1σ)|𝐄μ[ρ(𝖫(h1)θ1σ)ρ(𝖫(h2)θ2σ)]|\displaystyle\leq\left(\frac{1}{\sigma}\right)\left\lvert\operatorname{\mathbf{E}}_{\mu}\left[{\rho}^{\prime}\left(\frac{\operatorname{\mathsf{L}}(h_{1})-\theta_{1}}{\sigma}\right)-{\rho}^{\prime}\left(\frac{\operatorname{\mathsf{L}}(h_{2})-\theta_{2}}{\sigma}\right)\right]\right\rvert
(λ2σ2)(𝐄μ|𝖫(h1)𝖫(h2)|+|θ1θ2|)\displaystyle\leq\left(\frac{\lambda_{2}}{\sigma^{2}}\right)\left(\operatorname{\mathbf{E}}_{\mu}\lvert{\operatorname{\mathsf{L}}(h_{1})-\operatorname{\mathsf{L}}(h_{2})}\rvert+\lvert{\theta_{1}-\theta_{2}}\rvert\right)
(λ2σ2𝐄μ𝖫)(h1h2+|θ1θ2|).\displaystyle\leq\left(\frac{\lambda_{2}}{\sigma^{2}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}\right)\left(\lVert{h_{1}-h_{2}}\rVert+\lvert{\theta_{1}-\theta_{2}}\rvert\right).

Taking the preceding upper bounds together, the gradient difference for Rρ\operatorname{R}_{\rho} can be bounded as

Rρ(h1;θ1,η)Rρ(h2;θ2,η)\displaystyle\lVert{{\operatorname{R}}^{\prime}_{\rho}(h_{1};\theta_{1},\eta)-{\operatorname{R}}^{\prime}_{\rho}(h_{2};\theta_{2},\eta)}\rVert
=hRρ(h1;θ1,η)hRρ(h2;θ2,η)+|θRρ(h1;θ1,η)θRρ(h2;θ2,η)|\displaystyle=\lVert{\partial_{h}\operatorname{R}_{\rho}(h_{1};\theta_{1},\eta)-\partial_{h}\operatorname{R}_{\rho}(h_{2};\theta_{2},\eta)}\rVert+\lvert{\partial_{\theta}\operatorname{R}_{\rho}(h_{1};\theta_{1},\eta)-\partial_{\theta}\operatorname{R}_{\rho}(h_{2};\theta_{2},\eta)}\rvert
((λ5σ)+ηλ2σ2𝐄μ𝖫)(h1h2+|θ1θ2|),\displaystyle\leq\left(\left(\frac{\lambda_{5}}{\sigma}\right)+\frac{\eta\lambda_{2}}{\sigma^{2}}\operatorname{\mathbf{E}}_{\mu}\lVert{{\operatorname{\mathsf{L}}}^{\prime}}\rVert_{\mathcal{H}}\right)\left(\lVert{h_{1}-h_{2}}\rVert+\lvert{\theta_{1}-\theta_{2}}\rvert\right),

noting that the initial equality follows from the fact that we are using the sum of norms for our product space norm here (see also Remark 13). These bounds on the gradient differences are precisely the desired result. ∎

Appendix D Additional technical facts

D.1 Lipschitz properties

Here we give a fundamental property of differentiable functions that generalizes the mean value theorem.

Lemma 14.

Let 𝒰\mathcal{U} and 𝒱\mathcal{V} be normed linear spaces, and let f:𝒰𝒱f:\mathcal{U}\to\mathcal{V} be Fréchet differentiable on an open set S𝒰S\subseteq\mathcal{U}. Taking any uSu\in S, we have

f(u+u)f(u)usup0<c<1f(u+cu)\displaystyle\lVert{f(u+u^{\prime})-f(u)}\rVert\leq\lVert{u^{\prime}}\rVert\sup_{0<c<1}\lVert{{f}^{\prime}(u+cu^{\prime})}\rVert

for any u𝒰u^{\prime}\in\mathcal{U} such that u+cuSu+cu^{\prime}\in S for all 0c10\leq c\leq 1.

Proof.

See Luenberger, [39, §7.3, Prop. 2]. ∎

Note that Lemma 14 has the following important corollary: bounded gradients imply Lipschitz continuity. In particular, if f(u)λ<\lVert{{f}^{\prime}(u)}\rVert\leq\lambda<\infty for all uSu\in S, then it follows immediately that ff is λ\lambda-Lipschitz on SS.

A closely related result goes in the other direction. Let f:𝒰¯f:\mathcal{U}\to\mkern 1.5mu\overline{\mkern-1.5mu\mathbb{R}\mkern-1.5mu}\mkern 1.5mu be convex and λ\lambda-Lipschitz. If ff is sub-differentiable at a point u𝒰u\in\mathcal{U}, then we have

|f(u),uu||f(u)f(u)|λuu.\displaystyle\lvert{\langle\partial f(u),u^{\prime}-u\rangle}\rvert\leq\lvert{f(u^{\prime})-f(u)}\rvert\leq\lambda\lVert{u^{\prime}-u}\rVert.

As such, for convex, sub-differentiable functions, λ\lambda-Lipschitz continuity implies that all sub-gradients are bounded as f(x)λ\lVert{\partial f(x)}\rVert\leq\lambda.

D.2 Convexity

Lemma 15.

Let function f:df:\mathbb{R}^{d}\to\mathbb{R} be twice continuously differentiable. Then ff is convex if and only if its Hessian is positive semi-definite, namely when

f′′(v)u,u0\displaystyle\langle{f}^{\prime\prime}(v)u,u\rangle\geq 0

for all u,vdu,v\in\mathbb{R}^{d}.

Proof.

See Nesterov, [43, Thm. 2.1.4]. ∎

D.3 Derivatives for the Barron class

Let ρ(;α)\rho(\cdot;\alpha) be defined according to (15). Here we compute derivatives of the map xρσ(x;α)x\mapsto\rho_{\sigma}(x;\alpha), using the shorthand notation ρσ(x;α) . . =ρ(x/σ;α)\rho_{\sigma}(x;\alpha)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\rho(x/\sigma;\alpha). We denote the first derivative of ρσ(;α)\rho_{\sigma}(\cdot;\alpha) evaluated at xx\in\mathbb{R} by ρσ(x;α){\rho}^{\prime}_{\sigma}(x;\alpha), which is computed as

ρσ(x;α)={x/σ2,if α=22x/(x2+2σ2),if α=0(x/σ2)exp((x/σ)2/2),if α=xσ2(1+(x/σ)2|α2|)(α/2)1,otherwise.\displaystyle{\rho}^{\prime}_{\sigma}(x;\alpha)=\begin{cases}x/\sigma^{2},&\text{if }\alpha=2\\ 2x/(x^{2}+2\sigma^{2}),&\text{if }\alpha=0\\ (x/\sigma^{2})\exp\left(-(x/\sigma)^{2}/2\right),&\text{if }\alpha=-\infty\\ \frac{x}{\sigma^{2}}\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{(\alpha/2)-1},&\text{otherwise}.\end{cases} (67)

In the same way, letting ρσ′′(x;α){\rho}^{\prime\prime}_{\sigma}(x;\alpha) denote the second derivative of ρσ(;α)\rho_{\sigma}(\cdot;\alpha) evaluated at xx\in\mathbb{R}, this is computed as

ρσ′′(x;α)={1/σ2,if α=22x2+2σ2(12x2x2+2σ2),if α=0(1/σ2)exp((x/σ)2/2)(1(xσ)2),if α=1σ2(1+(x/σ)2|α2|)(α/2)1(11(α/2)|α2|2(x/σ)21+(x/σ)2/|α2|),otherwise.\displaystyle{\rho}^{\prime\prime}_{\sigma}(x;\alpha)=\begin{cases}1/\sigma^{2},&\text{if }\alpha=2\\ \frac{2}{x^{2}+2\sigma^{2}}\left(1-\frac{2x^{2}}{x^{2}+2\sigma^{2}}\right),&\text{if }\alpha=0\\ (1/\sigma^{2})\exp\left(-(x/\sigma)^{2}/2\right)\left(1-\left(\frac{x}{\sigma}\right)^{2}\right),&\text{if }\alpha=-\infty\\ \frac{1}{\sigma^{2}}\left(1+\frac{(x/\sigma)^{2}}{\lvert{\alpha-2}\rvert}\right)^{(\alpha/2)-1}\left(1-\frac{1-(\alpha/2)}{\lvert{\alpha-2}\rvert}\frac{2(x/\sigma)^{2}}{1+(x/\sigma)^{2}/\lvert{\alpha-2}\rvert}\right),&\text{otherwise}.\end{cases} (68)

We emphasize that ρσ(x;α){\rho}^{\prime}_{\sigma}(x;\alpha) and ρσ′′(x;α){\rho}^{\prime\prime}_{\sigma}(x;\alpha) are not equal to ρ(x/σ;α){\rho}^{\prime}(x/\sigma;\alpha) and ρ′′(x/σ;α){\rho}^{\prime\prime}(x/\sigma;\alpha), but by a simple application of the chain rule are easily seen to satisfy the relations

ρσ(x;α)=1σρ(x/σ;α),ρσ′′(x;α)=1σ2ρ′′(x/σ;α)\displaystyle{\rho}^{\prime}_{\sigma}(x;\alpha)=\frac{1}{\sigma}{\rho}^{\prime}(x/\sigma;\alpha),\qquad{\rho}^{\prime\prime}_{\sigma}(x;\alpha)=\frac{1}{\sigma^{2}}{\rho}^{\prime\prime}(x/\sigma;\alpha)

for any xx\in\mathbb{R}, σ>0\sigma>0, and α[,2]\alpha\in[-\infty,2].

D.4 Elementary inequalities

The following elementary inequalities will be of use.

(1+xp)p(1+xq)q,x0,p>q>0\displaystyle\left(1+\frac{x}{p}\right)^{p}\geq\left(1+\frac{x}{q}\right)^{q},\qquad\forall\,x\geq 0,\quad p>q>0 (69)
(1+x)c1+cx1(c1)x,1x<1c1,c>1\displaystyle(1+x)^{c}\leq 1+\frac{cx}{1-(c-1)x},\qquad-1\leq x<\frac{1}{c-1},\quad c>1 (70)

The inequality below is sometimes referred to as Bernoulli’s inequality.

(1+x)a1+ax,x>1,a1.\displaystyle\left(1+x\right)^{a}\geq 1+ax,\qquad\forall\,x>-1,\quad a\geq 1. (71)

D.5 Expected dispersion is coercive

Lemma 16 (Expected dispersion is coercive).

Let f:+f:\mathbb{R}\to\mathbb{R}_{+} be any non-negative function which is even (i.e., f(x)=f(x)f(x)=f(-x) for all xx\in\mathbb{R}) and non-decreasing on +\mathbb{R}_{+}. Let 𝖷\mathsf{X} be any random variable such that 𝐄μf(𝖷θ)<\operatorname{\mathbf{E}}_{\mu}f(\mathsf{X}-\theta)<\infty for all θ\theta\in\mathbb{R}. Then, we have

lim|θ|𝐄μf(𝖷θ)=limxf(x)\displaystyle\lim\limits_{\lvert\theta\rvert\to\infty}\operatorname{\mathbf{E}}_{\mu}f(\mathsf{X}-\theta)=\lim\limits_{x\to\infty}f(x)

and note that this includes the divergent case where f(x)f(x)\to\infty as |x|\lvert x\rvert\to\infty.

Proof of Lemma 16.

By our assumptions, we have f(x)0f(x)\geq 0 and f(x)=f(x)f(-x)=f(x) for all uu\in\mathbb{R}, and f(x1)f(x2)f(x_{1})\leq f(x_{2}) whenever 0x1x20\leq x_{1}\leq x_{2}. With these facts in place, note that for any choice of a0a\geq 0 and θ\theta such that |θ|a\lvert\theta\rvert\geq a, we have

𝐄μf(𝖷θ)\displaystyle\operatorname{\mathbf{E}}_{\mu}f(\mathsf{X}-\theta) =𝐄μf(|θ𝖷|)\displaystyle=\operatorname{\mathbf{E}}_{\mu}f(\lvert\theta-\mathsf{X}\rvert)
𝐄μf(||θ||𝖷||)\displaystyle\geq\operatorname{\mathbf{E}}_{\mu}f\left(\lvert\lvert\theta\rvert-\lvert\mathsf{X}\rvert\rvert\right)
𝐄μI{|𝖷|a}f(||θ||𝖷||)\displaystyle\geq\operatorname{\mathbf{E}}_{\mu}\operatorname{I}_{\{\lvert\mathsf{X}\rvert\leq a\}}f\left(\lvert\lvert\theta\rvert-\lvert\mathsf{X}\rvert\rvert\right)
f(|θ|a)𝐏{|𝖷|a}.\displaystyle\geq f\left(\lvert\theta\rvert-a\right)\operatorname{\mathbf{P}}\{\lvert\mathsf{X}\rvert\leq a\}. (72)

For readability, let us write f(+)f(+\infty) for the limit of f(x)f(x) as |x|\lvert x\rvert\to\infty. Trivially, we know that 𝐄μf(𝖷θ)f(+)\operatorname{\mathbf{E}}_{\mu}f(\mathsf{X}-\theta)\leq f(+\infty). Using the preceding inequality (72), we have a lower bound of 𝐄μf(𝖷θ)f(+)𝐏{|𝖷|a}\operatorname{\mathbf{E}}_{\mu}f(\mathsf{X}-\theta)\geq f(+\infty)\operatorname{\mathbf{P}}\{\lvert\mathsf{X}\rvert\leq a\} that holds for any a0a\geq 0. When f(+)=f(+\infty)=\infty, the desired result is immediate. When f(+)<f(+\infty)<\infty, simply note that {|𝖷|a}Ω\{\lvert\mathsf{X}\rvert\leq a\}\uparrow\Omega as aa\uparrow\infty, and thus using the continuity of probability measures, we have 𝐏{|𝖷|a}1\operatorname{\mathbf{P}}\{\lvert\mathsf{X}\rvert\leq a\}\to 1 as aa\to\infty.222222All countably additive set functions on σ\sigma-fields satisfy such continuity properties [3, Thm. 1.2.7]. Thus, the lower bound (72) can be taken arbitrarily close to f(+)f(+\infty), implying the desired result. ∎