This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adversarial Surrogate Risk Bounds
for Binary Classification

Natalie Frank natalief@uw.edu
Department of Applied Mathematics
University of Washington
Abstract

A central concern in classification is the vulnerability of machine learning models to adversarial attacks. Adversarial training is one of the most popular techniques for training robust classifiers, which involves minimizing an adversarial surrogate risk. Recent work characterized when a minimizing sequence of an adversarial surrogate risk is also a minimizing sequence of the adversarial classification risk for binary classification— a property known as adversarial consistency. However, these results do not address the rate at which the adversarial classification risk converges to its optimal value for such a sequence of functions that minimize the adversarial surrogate. This paper provides surrogate risk bounds that quantify that convergence rate. Additionally, we derive distribution-dependent surrogate risk bounds in the standard (non-adversarial) learning setting, that may be of independent interest.

1 Introduction

A central concern regarding sophisticated machine learning models is their susceptibility to adversarial attacks. Prior work (Biggio et al., 2013; Szegedy et al., 2013) demonstrated that imperceptible perturbations can derail the performance of neural nets. As such models are used in security-critical applications such as facial recognition (Xu et al., 2022) and medical imaging (Paschali et al., 2018), training robust models remains a central concern in machine learning.

In the standard classification setting, the classification risk is the proportion of incorrectly classified data. Rather that minimizing this quantity directly, which is a combinatorial optimization problem, typical machine learning algorithms perform gradient descent on a well-behaved alternative surrogate risk. If a sequence of functions that minimizes this surrogate risk also minimizes the classification risk, then the surrogate risk is referred to as consistent for that specific data distribution. In addition to consistency, one would hope that minimizing the surrogate risk would be an efficient method for minimizing the classification risk. This convergence rate can be bounded by surrogate risk bounds, which are functions that provide a bound on the excess classification risk in terms of the excess surrogate risk.

In the standard binary classification setting, consistency and surrogate risk bounds are well-studied topics (Bartlett et al., 2006; Lin, 2004; Steinwart, 2007; Zhang, 2004). On the other hand, fewer results are known about the adversarial setting. The adversarial classification risk incurs a penalty when a point can be perturbed into the opposite class. Similarly, adversarial surrogate risks involve computing the worst-case value (i.e. supremum) of a loss function over an ϵ\epsilon-ball. Frank & Niles-Weed (2024a) characterized which risks are consistent for all data distributions, and the corresponding losses are referred to as adversarially consistent. Unfortunately, no convex loss function can be adversarially consistent for all data distributions (Meunier et al., 2022). On the other hand, Frank (2025) showed that such situations are rather atypical— when the data distribution is absolutely continuous, a surrogate risk is adversarially consistent so long as the adversarial Bayes classifier satisfies a certain notion of uniqueness called uniqueness up to degeneracy. While these results characterize consistency, none describe convergence rates.

Our Contributions:

  • We prove a linear surrogate risk bound for adversarially consistent losses (Theorem 11).

  • If the ‘distribution of optimal attacks’ satisfies a bounded noise condition, we prove a linear surrogate risk bound, under mild conditions on the loss function (Theorems 11 and 12).

  • We prove a distribution dependent surrogate risk bound that applies whenever a loss is adversarially consistent for a data distribution (Theorem 13).

Notably, this last bullet applies to convex loss functions. Due to the consistency results in prior work (Frank, 2025; Frank & Niles-Weed, 2024a; Meunier et al., 2022), one cannot hope for distribution independent surrogate bounds for non-adversarially consistent losses. To the best of the authors’ knowledge this paper is the first to prove surrogate risk bounds for the risks most commonly used in adversarial training, see Section 6 for a comparison with prior work. Understanding the optimality of the bounds presented in this paper remains an open problem.

2 Background and Preliminaries

2.1 Surrogate Risks

This paper studies binary classification on d\mathbb{R}^{d} with labels 1-1 and +1+1. The measures 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1} describe the probabilities of finding data with labels 1-1, +1+1, respectively, in subset of d\mathbb{R}^{d}. The classification risk of a set AA is the misclassification rate when points in AA are classified as +1+1 and points in ACA^{C} are classified as 1-1:

R(A)=𝟏AC𝑑1+𝟏A𝑑0R(A)=\int{\mathbf{1}}_{A^{C}}d{\mathbb{P}}_{1}+\int{\mathbf{1}}_{A}d{\mathbb{P}}_{0}

The minimal classification risk over all Borel sets is denoted RR_{*}. As the derivative of an indicator function is zero wherever it is defined, the empirical version of this risk cannot be optimized with first order descent methods. Consequently, common machine learning algorithms minimize a different quantity called a surrogate risk. The surrogate risk of a function ff is defined as

Rϕ(f)=ϕ(f)𝑑1+ϕ(f)𝑑0.R_{\phi}(f)=\int\phi(f)d{\mathbb{P}}_{1}+\int\phi(-f)d{\mathbb{P}}_{0}.

In practice, the loss function ϕ\phi selected so that it has well-behaved derivative. In this paper, We assume:

Assumption 1.

The loss ϕ\phi is continuous, non-increasing, and limαϕ(α)=0\lim_{\alpha\to\infty}\phi(\alpha)=0.

The minimal surrogate risk over all Borel measurable functions is denoted Rϕ,R_{\phi,*}. After optimizing the surrogate risk, a classifier is obtained by threshholding the resulting ff at zero. Consequently, we define the classification error of a function by R(f)=R({f>0})R(f)=R(\{f>0\}) or equivalently,

R(f)=𝟏f0𝑑1+𝟏f>0𝑑0.R(f)=\int{\mathbf{1}}_{f\leq 0}d{\mathbb{P}}_{1}+\int{\mathbf{1}}_{f>0}d{\mathbb{P}}_{0}.

It remains to verify that minimizing the surrogate risk RϕR_{\phi} will also minimize the classification risk RR.

Definition 1.

The loss function ϕ\phi is consistent for the distribution 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1} if every minimizing sequence of RϕR_{\phi} is also a minimizing sequence of RR when the data is distributed according to 0{\mathbb{P}}_{0} and 1{\mathbb{P}}_{1}. The loss function ϕ\phi is consistent if it is consistent for all data distributions.

Prior work establishes conditions under which many common loss functions are consistent.

Theorem 1.

A convex loss ϕ\phi is consistent iff it is differentiable at zero and ϕ(0)<0\phi^{\prime}(0)<0.

See (Bartlett et al., 2006, Theorem 2). Furthermore, (Frank & Niles-Weed, 2024a, Proposition 3) establishes a condition that applies to non-convex losses:

Theorem 2.

If infα1/2(ϕ(α)+ϕ(α))<ϕ(0)\inf_{\alpha}1/2(\phi(\alpha)+\phi(-\alpha))<\phi(0), then the loss ϕ\phi is consistent.

The ρ\rho-margin loss ϕρ(α)=min(1,max(1α/ρ,0))\phi_{\rho}(\alpha)=\min(1,\max(1-\alpha/\rho,0)) and the shifted sigmoid loss ϕτ(α)=1/(1+exp(ατ))\phi_{\tau}(\alpha)=1/(1+\exp(\alpha-\tau)), τ>0\tau>0, both satisfy this criterion. However, a convex loss ϕ\phi cannot satisfy this condition:

12(ϕ(α)+ϕ(α))ϕ(12α+12α)=ϕ(0).\frac{1}{2}\left(\phi(\alpha)+\phi(-\alpha)\right)\geq\phi\big{(}\frac{1}{2}\alpha+\frac{1}{2}\cdot-\alpha\big{)}=\phi(0). (1)

In addition to consistency, understanding convergence rates is a key concern. Specifically, prior work (Bartlett et al., 2006; Zhang, 2004) establishes surrogate risk bounds of the form Ψ(R(f)R)Rϕ(f)Rϕ,\Psi(R(f)-R_{*})\leq R_{\phi}(f)-R_{\phi,*} for some function Ψ\Psi. This inequality bounds the convergence rate of R(f)RR(f)-R_{*} in terms of the convergence of Rϕ(f)Rϕ,R_{\phi}(f)-R_{\phi,*}.

The values RR_{*}, Rϕ,R_{\phi,*} can be expressed in terms of the data distribution by re-writing these quantities in terms of the total probability measure =0+1{\mathbb{P}}={\mathbb{P}}_{0}+{\mathbb{P}}_{1} and the conditional probability of the label +1, given by η(x)=d1/d\eta(x)=d{\mathbb{P}}_{1}/d{\mathbb{P}}. An equivalent formulation of the classification risk is

R(f)=C(η(𝐱),f(𝐱))𝑑(𝐱)R(f)=\int C(\eta({\mathbf{x}}),f({\mathbf{x}}))d{\mathbb{P}}({\mathbf{x}}) (2)

with

C(η,α)=η𝟏α0+(1η)𝟏α>0,C(\eta,\alpha)=\eta{\mathbf{1}}_{\alpha\leq 0}+(1-\eta){\mathbf{1}}_{\alpha>0}, (3)

and the minimal classification risk is found by minimizing the integrand of Equation 2 at each 𝐱{\mathbf{x}}. Define

C(η)=infαC(η,α)=min(η,1η),C^{*}(\eta)=\inf_{\alpha}C(\eta,\alpha)=\min(\eta,1-\eta), (4)

then the minimal classification risk is

R=C(η(𝐱))𝑑(𝐱).R_{*}=\int C^{*}(\eta({\mathbf{x}}))d{\mathbb{P}}({\mathbf{x}}).

Analogously, the surrogate risk in terms of η\eta and {\mathbb{P}} is

Rϕ(f)=Cϕ(η(𝐱),f(𝐱))𝑑R_{\phi}(f)=\int C_{\phi}(\eta({\mathbf{x}}),f({\mathbf{x}}))d{\mathbb{P}} (5)

and the minimal surrogate risk is

Rϕ,=Cϕ(η(𝐱))𝑑(𝐱)R_{\phi,*}=\int C_{\phi}^{*}(\eta({\mathbf{x}}))d{\mathbb{P}}({\mathbf{x}})

with the conditional risk Cϕ(η,α)C_{\phi}(\eta,\alpha) and minimal conditional risk Cϕ(η)C_{\phi}^{*}(\eta) defined by

Cϕ(η,α)=ηϕ(α)+(1η)ϕ(α),Cϕ(η)=infαCϕ(η,α).C_{\phi}(\eta,\alpha)=\eta\phi(\alpha)+(1-\eta)\phi(-\alpha),\quad C_{\phi}^{*}(\eta)=\inf_{\alpha}C_{\phi}(\eta,\alpha). (6)

Notice that minimizers to RϕR_{\phi} may need to be ¯\overline{\mathbb{R}}-valued— consider the exponential loss ϕ(α)=eα\phi(\alpha)=e^{-\alpha} and a distribution with η(𝐱)1\eta({\mathbf{x}})\equiv 1. Then the only minimizer to RϕR_{\phi} would be ++\infty.

The consistency of ϕ\phi can be fully characterized by the properties of the function Cϕ(η)C_{\phi}^{*}(\eta).

Theorem 3.

A loss ϕ\phi is consistent iff Cϕ(η)<ϕ(0)C_{\phi}^{*}(\eta)<\phi(0) for all η1/2\eta\neq 1/2.

Surprisingly, this criterion has not appeared in prior work. See Appendix A for a proof.

In terms of the function CϕC_{\phi}^{*}, Theorem 2 states that any loss ϕ\phi with Cϕ(1/2)<ϕ(0)C_{\phi}^{*}(1/2)<\phi(0) is consistent.

The function CϕC_{\phi}^{*} is a key component of surrogate risk bounds from prior work. Specifically, Bartlett et al. (2006) show:

Theorem 4.

Let ϕ\phi be any loss satisfying Assumption 1 with Cϕ(1/2)=ϕ(0)C_{\phi}^{*}(1/2)=\phi(0) and define

Ψ(θ)=ϕ(0)Cϕ(1+θ2).\Psi(\theta)=\phi(0)-C_{\phi}^{*}\left(\frac{1+\theta}{2}\right).

Then

Ψ(C(η,f)C(η))Cϕ(η,f)Cϕ(η)\Psi(C(\eta,f)-C^{*}(\eta))\leq C_{\phi}(\eta,f)-C_{\phi}^{*}(\eta) (7)

and consequently

Ψ(R(f)R)Rϕ(f)Rϕ.\Psi(R(f)-R_{*})\leq R_{\phi}(f)-R_{\phi}^{*}. (8)

Equation 8 is a consequence of Equation 7 and Jensen’s inequality. Furthermore, a result of (Bartlett et al., 2006) implies a linear bound when Cϕ(1/2)<ϕ(0)C_{\phi}^{*}(1/2)<\phi(0):

R(f)R1ϕ(0)Cϕ(1/2)(Rϕ(f)Rϕ,)R(f)-R_{*}\leq\frac{1}{\phi(0)-C_{\phi}^{*}(1/2)}(R_{\phi}(f)-R_{\phi,*}) (9)

Furthermore, a distribution with zero classification error RR_{*} has the surrogate risk bound

R(f)R1ϕ(0)(Rϕ(f)Rϕ,)R(f)-R_{*}\leq\frac{1}{\phi(0)}(R_{\phi}(f)-R_{\phi,*}) (10)

so long as ϕ(0)>0\phi(0)>0. Such distributions are referred to as realizable. A proof of this result that transfers directly to the adversarial scenario is provided in Section B.1.

A distribution is said to satisfy Massart’s noise condition (Massart & Nédélec, 2006) if there is an α(0,1/2]\alpha\in(0,1/2] such that |η1/2|α|\eta-1/2|\geq\alpha holds {\mathbb{P}}-a.e. Under this condition, Massart & Nédélec (2006) establish improved sample complexity guarantees. Furthermore, such distributions exhibit a linear surrogate loss bound as well. These linear bounds, the realizable bounds from Equation 10, and the linear bounds from Equation 9 are summarized in a single statement below.

Proposition 1.

Let η\eta, {\mathbb{P}} be a distribution that satisfies |η1/2|α|\eta-1/2|\geq\alpha {\mathbb{P}}-a.e. with a constant α[0,1/2]\alpha\in[0,1/2], and let ϕ\phi be a loss with ϕ(0)>Cϕ(1/2α)\phi(0)>C_{\phi}^{*}(1/2-\alpha). Then for all |η1/2|α|\eta-1/2|\geq\alpha,

C(η,f)C(η)1ϕ(0)Cϕ(12α)(Cϕ(η,f)Cϕ(η))C(\eta,f)-C^{*}(\eta)\leq\frac{1}{\phi(0)-C_{\phi}^{*}(\frac{1}{2}-\alpha)}(C_{\phi}(\eta,f)-C_{\phi}^{*}(\eta)) (11)

and consequently

R(f)R1ϕ(0)Cϕ(12α)(Rϕ(f)Rϕ,)R(f)-R_{*}\leq\frac{1}{\phi(0)-C_{\phi}^{*}(\frac{1}{2}-\alpha)}(R_{\phi}(f)-R_{\phi,*}) (12)

When α0\alpha\neq 0, this surrogate risk bound proves a linear convergence rate under Massart’s noise condition. If α=0\alpha=0 and Cϕ(1/2)<ϕ(0)C_{\phi}^{*}(1/2)<\phi(0), then the bound in Equation 12 reduces to Equation 9 while if α=1/2\alpha=1/2 then this bound reduces to Equation 10. See Section B.2 for a proof of this result. One of the main results of this paper is that Equation 12 generalizes to adversarial risks.

Note that the surrogate risk bound of Theorem 4 can be linear even for convex loss functions. For the hinge loss ϕ(α)=max(1α,0)\phi(\alpha)=\max(1-\alpha,0), the function ϕ\phi computes to ϕ(θ)=|θ|\phi(\theta)=|\theta|. Prior work (Frongillo & Waggoner, 2021, Theorem 1) observed a linear surrogate bound for piecewise linear losses: if ϕ\phi is piecewise linear, then Cϕ(η)C_{\phi}^{*}(\eta) is piecewise linear and Jensen’s inequality implies a linear surrogate bound so long as ϕ\phi is consistent (due to Theorem 3). On the other hand, (Frongillo & Waggoner, 2021, Theorem 2) show that convex losses which are locally strictly convex and Lipschitz achieve at best a square root surrogate risk rate.

Mahdavi et al. (2014) emphasize the importance of a linear convergence rate in a surrogate risk bound. Their paper studies the sample complexity of estimating a classifier with a surrogate risk. They note that typically convex surrogate losses exhibiting favorable sample complexity do not satisfy favorable surrogate risk bounds, due to the results of (Frongillo & Waggoner, 2021). Consequently, Proposition 1 hints that proving favorable sample complexity guarantees for learning with convex surrogate risks could require distributional assumptions, such as Massart’s noise condition.

2.2 Adversarial Risks

This paper extends surrogate risk bounds of Equations 8, 10 and 12 to adversarial risks. The adversarial classification risk incurs a penalty of 1 whenever a point 𝐱{\mathbf{x}} can be perturbed into the opposite class. This penalty can be expressed in terms of supremums of indicator functions— the adversarial classification risk incurs a penalty of 1 whenever sup𝐱𝐱ϵ𝟏A(𝐱)=1\sup_{\|{\mathbf{x}}^{\prime}-{\mathbf{x}}\|\leq\epsilon}{\mathbf{1}}_{A}({\mathbf{x}}^{\prime})=1 or sup𝐱𝐱ϵ𝟏AC(𝐱)=1\sup_{\|{\mathbf{x}}^{\prime}-{\mathbf{x}}\|\leq\epsilon}{\mathbf{1}}_{A^{C}}({\mathbf{x}}^{\prime})=1. Define

Sϵ(g)(𝐱)=sup𝐱𝐱ϵg(𝐱).S_{\epsilon}(g)({\mathbf{x}})=\sup_{\|{\mathbf{x}}-{\mathbf{x}}^{\prime}\|\leq\epsilon}g({\mathbf{x}}^{\prime}).

The adversarial classification risk is then

Rϵ(A)=Sϵ(𝟏AC)𝑑1+Sϵ(𝟏A)𝑑0R^{\epsilon}(A)=\int S_{\epsilon}({\mathbf{1}}_{A^{C}})d{\mathbb{P}}_{1}+\int S_{\epsilon}({\mathbf{1}}_{A})d{\mathbb{P}}_{0}

and the adversarial surrogate risk is111In order to define the risks RϕϵR_{\phi}^{\epsilon} and RϵR^{\epsilon}, one must argue that Sϵ(g)S_{\epsilon}(g) is measurable. Theorem 1 of (Frank & Niles-Weed, 2024b) proves that whenever gg is Borel, Sϵ(g)S_{\epsilon}(g) is always measurable with respect to the completion of any Borel measure.

Rϕϵ(f)=Sϵ(ϕ(f))𝑑1+Sϵ(ϕ(f))𝑑0.R_{\phi}^{\epsilon}(f)=\int S_{\epsilon}(\phi(f))d{\mathbb{P}}_{1}+\int S_{\epsilon}(\phi(-f))d{\mathbb{P}}_{0}.

A minimizer of the adversarial classification risk is called an adversarial Bayes classifier. After optimizing the surrogate risk, a classifier is obtained by threshholding the resulting function ff at zero. Consequently, the adversarial classification error of a function ff is defined as Rϵ(f)=Rϵ({f>0})R^{\epsilon}(f)=R^{\epsilon}(\{f>0\}) or equivalently,

Rϵ(f)=Sϵ(𝟏f0)𝑑1+Sϵ(𝟏f>0)𝑑0.R^{\epsilon}(f)=\int S_{\epsilon}({\mathbf{1}}_{f\leq 0})d{\mathbb{P}}_{1}+\int S_{\epsilon}({\mathbf{1}}_{f>0})d{\mathbb{P}}_{0}. (13)

Just as in the standard case, one would hope that minimizing the adversarial surrogate risk would minimize the adversarial classification risk.

Definition 2.

We say a loss ϕ\phi is adversarially consistent for the distribution 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1} if any minimizing sequence of RϕϵR_{\phi}^{\epsilon} is also a minimizing sequence of RϵR^{\epsilon}. We say that ϕ\phi is adversarially consistent if it is adversarially consistent for every possible 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1}.

Theorem 2 of (Frank & Niles-Weed, 2024a) characterizes the adversarially consistent losses:

Theorem 5.

The loss ϕ\phi is adversarially consistent iff Cϕ(1/2)<ϕ(0)C_{\phi}^{*}(1/2)<\phi(0).

Theorem 2 implies that every adversarially consistent loss is also consistent. Unfortunately, Equation 1 shows that no convex loss is adversarially consistent. However, the data distribution for which adversarial consistency fails presented in (Meunier et al., 2022) is fairly atypical: Let 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1} be the uniform distributions on Bϵ(𝟎)¯\overline{B_{\epsilon}({\mathbf{0}})}. Then one can show that the function sequence

fn={1n𝐱01n𝐱=0f_{n}=\begin{cases}\frac{1}{n}&{\mathbf{x}}\neq 0\\ -\frac{1}{n}&{\mathbf{x}}=0\end{cases} (14)

minimizes RϕϵR_{\phi}^{\epsilon} but not RϵR^{\epsilon} whenever Cϕ(1/2)=ϕ(0)C_{\phi}^{*}(1/2)=\phi(0) (See Proposition 2 of (Frank & Niles-Weed, 2024a)). A more refined analysis relates adversarial consistency for losses with Cϕ(1/2)=ϕ(0)C_{\phi}^{*}(1/2)=\phi(0) to a notion of uniqueness of the adversarial Bayes classifier for losses satisfying Cϕ(1/2)=ϕ(0)C_{\phi}^{*}(1/2)=\phi(0).

Definition 3.

Two adversarial Bayes classifiers A1A_{1}, A2A_{2} are equivalent up to degeneracy if any set AA with A1A2AA1A2A_{1}\cap A_{2}\subset A\subset A_{1}\cup A_{2} is also an adversarial Bayes classifier. The adversarial Bayes classifier is unique up to degeneracy if any two adversarial Bayes classifiers are equivalent up to degeneracy.

Theorem 3.3 of (Frank, 2024) proves that whenever {\mathbb{P}} is absolutely continuous with respect to Lebesgue measure, then equivalence up to degeneracy is in fact an equivalence relation. Next, Theorem 4 of (Frank, 2025) relates this condition to the consistency of ϕ\phi.

Theorem 6.

Let ϕ\phi be a loss with Cϕ(1/2)=ϕ(0)C_{\phi}^{*}(1/2)=\phi(0) and assume that {\mathbb{P}} is absolutely continuous with respect to Lebesgue measure. Then ϕ\phi is adversarially consistent for the data distribution given by 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1} iff the adversarial Bayes classifier is unique up to degeneracy.

ϵ\epsilon
(a)
ϵ\epsilon
(b)
ϵ\epsilon
(c)
ϵ\epsilon
(d)
Figure 1: Adversarial Bayes classifiers for the example considered in Equation 14. The adversarial Bayes classifiers in (a) and (b) are equivalent up to degeneracy and the the adversarial Bayes classifiers in (c) and (d) are equivalent up to degeneracy, but the adversarial Bayes classifiers in (a) and (c) are not equivalent up to degeneracy.

The extension of Theorem 4 to the adversarial setting must reflect the consistency results of Theorems 5 and 6.

2.3 Minimax Theorems

A central tool in analyzing the adversarial consistency of surrogate risks is minimax theorems. These results allow one to express adversarial risks in a ‘pointwise’ manner analogous to Equation 5. We will then combine this ‘pointwise’ expression together with the proof of Theorem 4 to produce surrogate bounds for adversarial risks.

These minimax theorems utilize the \infty-Wasserstein (WW_{\infty}) metric from optimal transport. Let {\mathbb{Q}} and {\mathbb{Q}}^{\prime} be two finite positive measures with the same total mass. Informally, the measure {\mathbb{Q}}^{\prime} is within ϵ\epsilon of {\mathbb{Q}} in the WW_{\infty} metric if one can achieve the measure {\mathbb{Q}}^{\prime} by moving points of {\mathbb{Q}} by at most ϵ\epsilon.

The WW_{\infty} metric is formally defined in terms of the set of couplings between {\mathbb{Q}} and {\mathbb{Q}}^{\prime}. A Borel measure γ\gamma on d×d\mathbb{R}^{d}\times\mathbb{R}^{d} is a coupling between {\mathbb{Q}} and {\mathbb{Q}}^{\prime} if its first marginal is {\mathbb{Q}} and its second marginal is {\mathbb{Q}}^{\prime}, or in other words, γ(A×d)=(A)\gamma(A\times\mathbb{R}^{d})={\mathbb{Q}}(A) and γ(d×A)=(A)\gamma(\mathbb{R}^{d}\times A)={\mathbb{Q}}^{\prime}(A) for all Borel sets AA. Let Π(,)\Pi({\mathbb{Q}},{\mathbb{Q}}^{\prime}) be the set of all couplings between the measures {\mathbb{Q}} and {\mathbb{Q}}^{\prime}. Then the WW_{\infty} between {\mathbb{Q}} and {\mathbb{Q}}^{\prime} is then

W(,)=infγΠ(,)esssup(𝐱,𝐲)γ𝐱𝐲.W_{\infty}({\mathbb{Q}},{\mathbb{Q}}^{\prime})=\inf_{\gamma\in\Pi({\mathbb{Q}},{\mathbb{Q}}^{\prime})}\operatorname*{ess\,sup}_{({\mathbf{x}},{\mathbf{y}})\sim\gamma}\|{\mathbf{x}}-{\mathbf{y}}\|. (15)

Theorem 2.6 of (Jylhä, 2014) proves that the infimum in Equation 15 is always attained. The ϵ\epsilon-ball around {\mathbb{Q}} in the WW_{\infty} metric is

ϵ()={:W(,)ϵ}.{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{Q}})=\{{\mathbb{Q}}^{\prime}:W_{\infty}({\mathbb{Q}}^{\prime},{\mathbb{Q}})\leq\epsilon\}.

The minimax theorem below will relate the adversarial risks RϕϵR_{\phi}^{\epsilon}, RϵR^{\epsilon} to dual problems in which an adversary seeks to maximize some dual quantity over Wasserstein-\infty balls. Specifically, one can show:

Lemma 1.

Let gg be a Borel function. Let γ\gamma be a coupling between the measures {\mathbb{Q}} and {\mathbb{Q}}^{\prime} supported on Δϵ={(𝐱,𝐱):𝐱𝐱ϵ}\Delta_{\epsilon}=\{({\mathbf{x}},{\mathbf{x}}^{\prime}):\|{\mathbf{x}}-{\mathbf{x}}^{\prime}\|\leq\epsilon\}. Then Sϵ(g)(𝐱)g(𝐱)S_{\epsilon}(g)({\mathbf{x}})\geq g({\mathbf{x}}^{\prime}) γ\gamma-a.e. and consequently

Sϵ(g)𝑑supϵ()g𝑑.\int S_{\epsilon}(g)d{\mathbb{Q}}\geq\sup_{{\mathbb{Q}}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{Q}})}\int gd{\mathbb{Q}}^{\prime}.

See Appendix C for a proof. Thus, applying Lemma 1, the quantity infARϵ(A)\inf_{A}R^{\epsilon}(A) can be lower bounded by an infimum followed by a supremum. Is it possible to swap this infimum and supremum? (Pydi & Jog, 2021) answers this question in the affirmative. Let C(η)C^{*}(\eta) be as defined in Equation 4 and let

R¯(0,1)=infA Borel𝟏AC𝑑1+𝟏A𝑑0=C(d1d(1+0))d(0+1).\bar{R}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})=\inf_{A\text{ Borel}}\int{\mathbf{1}}_{A^{C}}d{\mathbb{P}}_{1}^{\prime}+\int{\mathbf{1}}_{A}d{\mathbb{P}}_{0}^{\prime}=\int C^{*}\left(\frac{d{\mathbb{P}}_{1}^{\prime}}{d\left({\mathbb{P}}_{1}^{\prime}+{\mathbb{P}}_{0}^{\prime}\right)}\right)d\left({\mathbb{P}}_{0}^{\prime}+{\mathbb{P}}_{1}^{\prime}\right). (16)
Theorem 7.

Let R¯\bar{R} be as defined in Equation 16. Then

infA BorelRϵ(A)=sup1ϵ(1)0ϵ(0)R¯(0,1).\inf_{A\text{ Borel}}R^{\epsilon}(A)=\sup_{\begin{subarray}{c}{\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\\ {\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\end{subarray}}\bar{R}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}).

Furthermore, equality is attained at some Borel measurable AA, 0{\mathbb{P}}_{0}^{*}, and 1{\mathbb{P}}_{1}^{*} with 0ϵ(0){\mathbb{P}}_{0}^{*}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0}) and 1ϵ(1){\mathbb{P}}_{1}^{*}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1}).

See (Frank & Niles-Weed, 2024a, Theorem 1) for a proof of this statement. The maximizers 0{\mathbb{P}}_{0}^{*}, 1{\mathbb{P}}_{1}^{*} can be interpreted as optimal adversarial attacks (see discussion following (Frank & Niles-Weed, 2024b, Theorem 7)). Frank (2024, Theorem 3.4) provide a criterion for uniqueness up to degeneracy in terms of dual maximizers.

Theorem 8.

The following are equivalent:

  1. A)

    The adversarial Bayes classifier is unique up to degeneracy

  2. B)

    There are maximizers 0{\mathbb{P}}_{0}^{*}, 1{\mathbb{P}}_{1}^{*} of R¯\bar{R} for which (η=1/2)=0{\mathbb{P}}^{*}(\eta^{*}=1/2)=0, where =0+1{\mathbb{P}}^{*}={\mathbb{P}}_{0}^{*}+{\mathbb{P}}_{1}^{*} and η=d1/d\eta^{*}=d{\mathbb{P}}_{1}^{*}/d{\mathbb{P}}^{*}

In other words, the adversarial Bayes classifier is unique up to degeneracy iff the region where both classes are equally probable has measure zero under some optimal adversarial attack. Theorems 6 and 8 relate adversarial consistency and the dual problem, suggesting that these optimal adversarial attacks 0{\mathbb{P}}_{0}^{*}, 1{\mathbb{P}}_{1}^{*} may appear in adversarial surrogate bounds.

Frank & Niles-Weed (2024b) prove an minimax principle analogous to Theorem 7 for the adversarial surrogate risk. Let Cϕ(η)C_{\phi}^{*}(\eta) be as defined in Equation 6 and let

R¯ϕ(0,1)=inff Borelϕ(f)𝑑1+ϕ(f)𝑑0=Cϕ(d1d(1+0))d(0+1)\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})=\inf_{f\text{ Borel}}\int\phi(f)d{\mathbb{P}}_{1}^{\prime}+\int\phi(-f)d{\mathbb{P}}_{0}^{\prime}=\int C_{\phi}^{*}\left(\frac{d{\mathbb{P}}_{1}^{\prime}}{d\left({\mathbb{P}}_{1}^{\prime}+{\mathbb{P}}_{0}^{\prime}\right)}\right)d\left({\mathbb{P}}_{0}^{\prime}+{\mathbb{P}}_{1}^{\prime}\right) (17)
Theorem 9.

Let R¯ϕ\bar{R}_{\phi} be defined as in Equation 17. Then

inff Borel,¯-valuedRϕϵ(f)=sup1ϵ(1)0ϵ(0)R¯ϕ(0,1).\inf_{\begin{subarray}{c}f\text{ Borel,}\\ \overline{\mathbb{R}}\text{-valued}\end{subarray}}R_{\phi}^{\epsilon}(f)=\sup_{\begin{subarray}{c}{\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\\ {\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\end{subarray}}\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}).

Furthermore, equality is attained at some Borel measurable ff^{*}, 0{\mathbb{P}}_{0}^{*}, and 1{\mathbb{P}}_{1}^{*} with 0ϵ(0){\mathbb{P}}_{0}^{*}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0}) and 1ϵ(1){\mathbb{P}}_{1}^{*}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1}).

Just as in the non-adversarial scenario, RϕϵR_{\phi}^{\epsilon} may not assume its infimum at an \mathbb{R}-valued function. However, (Frank & Niles-Weed, 2024a, Lemma 8) show that

inff ¯-valuedRϕϵ(f)=inff -valued Rϕϵ(f).\inf_{f\text{ $\overline{\mathbb{R}}$-valued}}R_{\phi}^{\epsilon}(f)=\inf_{f\text{ $\mathbb{R}$-valued }}R_{\phi}^{\epsilon}(f).

Lastly, one can show that maximizers of R¯ϕ\bar{R}_{\phi} are always maximizers of R¯\bar{R} as well. In other words— optimal attacks on minimizers of the adversarial surrogate RϕϵR_{\phi}^{\epsilon} are always optimal attacks on minimizers of the adversarial classification risk RϵR^{\epsilon} as well.

Theorem 10.

Consider maximizing the dual objectives R¯ϕ\bar{R}_{\phi} and R¯\bar{R} over ϵ(0)×ϵ(1){{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\times{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1}).

  1. 1)

    Any maximizer (0,1)({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}) of R¯ϕ\bar{R}_{\phi} over ϵ(0)×ϵ(1){{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\times{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1}) must maximize R¯\bar{R} as well.

  2. 2)

    If the adversarial Bayes classifier is unique up to degeneracy, then there are maximizers (0,1)({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}) of R¯ϕ\bar{R}_{\phi} where (η=1/2)=0{\mathbb{P}}^{*}(\eta^{*}=1/2)=0, with =0+1{\mathbb{P}}^{*}={\mathbb{P}}_{0}^{*}+{\mathbb{P}}_{1}^{*} and η=d1/d\eta^{*}=d{\mathbb{P}}_{1}^{*}/d{\mathbb{P}}^{*}.

See Appendix D for a proof of Item 1), Item 2) is shown in Theorems 5 and 7 of (Frank, 2025).

3 Main Results

Prior work has characterized when a loss ϕ\phi is adversarially consistent with respect to a distribution 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1}: Theorem 5 proves that a distribution independent surrogate risk bound is only possible when Cϕ(1/2)<ϕ(0)C_{\phi}^{*}(1/2)<\phi(0) while Theorem 6 suggests that a surrogate bound must depend on the marginal distribution of η\eta^{*} under {\mathbb{P}}^{*}, and furthermore, such a bound is only possible when (η=1/2)=0{\mathbb{P}}^{*}(\eta^{*}=1/2)=0.

Compare these statements to Proposition 1: Theorems 5 and 6 imply that ϕ\phi is adversarially consistent for 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1} if Cϕ(1/2)<ϕ(0)C_{\phi}^{*}(1/2)<\phi(0) or if there exist some maximizers of R¯\bar{R} that satisfy Massart’s noise condition. Alternatively, due to Theorem 10, one can equivalently assume that there are maximizers of R¯ϕ\bar{R}_{\phi} satisfying Massart’s noise condition. Our first bound extends Proposition 1 to the adversarial scenario, with the data distribution 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1} replaced with the distribution of optimal adversarial attacks.

Theorem 11.

Let 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1} be a distribution for which there are maximizers 0{\mathbb{P}}_{0}^{*}, 1{\mathbb{P}}_{1}^{*} of the dual problem R¯ϕ\bar{R}_{\phi} that satisfy |η1/2|α|\eta^{*}-1/2|\geq\alpha {\mathbb{P}}^{*}-a.e. for some constant α[0,1/2]\alpha\in[0,1/2] with Cϕ(1/2α)<ϕ(0)C_{\phi}^{*}(1/2-\alpha)<\phi(0), where =0+1{\mathbb{P}}^{*}={\mathbb{P}}_{0}^{*}+{\mathbb{P}}_{1}^{*}, η=d1/d\eta^{*}=d{\mathbb{P}}_{1}^{*}/d{\mathbb{P}}^{*}. Then

Rϵ(f)Rϵ3+521ϕ(0)Cϕ(1/2α)(Rϕϵ(f)Rϕ,ϵ).R^{\epsilon}(f)-R_{*}^{\epsilon}\leq\frac{3+\sqrt{5}}{2}\frac{1}{\phi(0)-C_{\phi}^{*}(1/2-\alpha)}(R_{\phi}^{\epsilon}(f)-R_{\phi,*}^{\epsilon}).

When Cϕ(1/2)<ϕ(0)C_{\phi}^{*}(1/2)<\phi(0), one can select α=0\alpha=0 in Theorem 11 to produce a distribution-independent bound. The constant (3+5)/2(3+\sqrt{5})/2 may be sub-optimal; in fact Theorem 4 of Frank (2025) proves that Rϵ(f)Rϵ1/(2(ϕ(0)Cϕρ(1/2)))(Rϕρϵ(f)Rϕρ,ϵ)R^{\epsilon}(f)-R^{\epsilon}_{*}\leq 1/(2(\phi(0)-C_{\phi_{\rho}}^{*}(1/2)))(R_{\phi_{\rho}}^{\epsilon}(f)-R^{\epsilon}_{\phi_{\rho},*}) where ϕρ(α)=min(1,max(0,1α/ρ))\phi_{\rho}(\alpha)=\min(1,\max(0,1-\alpha/\rho)) is the ρ\rho-margin loss. Furthermore, the bound in Equation 10 extends directly to the adversarial setting.

Theorem 12.

Let ϕ\phi be any loss with ϕ(0)>0\phi(0)>0 satisfying Assumption 1. Then if Rϵ=0R_{*}^{\epsilon}=0,

Rϵ(f)Rϵ1ϕ(0)(Rϕϵ(f)Rϕ,ϵ)R^{\epsilon}(f)-R^{\epsilon}_{*}\leq\frac{1}{\phi(0)}\left(R_{\phi}^{\epsilon}(f)-R_{\phi,*}^{\epsilon}\right)

A distribution will have zero adversarial risk whenever the supports of 0{\mathbb{P}}_{0} and 1{\mathbb{P}}_{1} are separated by at least 2ϵ2\epsilon, see Examples 1 and 2(a) for and example. Zero adversarial classification risk corresponds to α=1/2\alpha=1/2 in Massart’s noise condition.

In contrast, Theorem 11 states that if some distribution of optimal adversarial attacks satisfies Massart’s noise condition, then the excess adversarial surrogate risk is at worst a linear upper bound on the excess adversarial classification risk. However, if Cϕ(1/2)=ϕ(0)C_{\phi}^{*}(1/2)=\phi(0), this constant approaches infinity as α0\alpha\to 0, reflecting the fact that adversarial consistency fails when the adversarial Bayes classifier is not unique up to degeneracy. When α1/2\alpha\neq 1/2, understanding what assumptions on 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1} guarantee Massart’s noise condition for 0{\mathbb{P}}_{0}^{*}, 1{\mathbb{P}}_{1}^{*} is an open question. Example 4.6 of (Frank, 2024) demonstrates a distribution that satisfies Massart’s noise condition and yet the adversarial Bayes classifier is not unique up to degeneracy. Thus Massart’s noise condition for 0,1{\mathbb{P}}_{0},{\mathbb{P}}_{1} does not guarantee Massart’s noise condition for 0{\mathbb{P}}_{0}^{*}, 1{\mathbb{P}}_{1}^{*}. See Examples 2 and 2(b) for an example where Theorem 11 applies with α>0\alpha>0.

Finally, by averaging bounds of the form Theorem 11 over all values of η\eta^{*} produces a distribution-dependent surrogate bound, valid whenever the adversarial Bayes classifier is unique up to degeneracy. For a given function ff, let the concave envelope of ff be the smallest concave function larger than ff:

conc(f)=inf{g:f on dom(h),g concave and upper semi-continuous}\operatorname{conc}(f)=\inf\{g:\geq f\text{ on }\operatorname{dom}(h),g\text{ concave and upper semi-continuous}\} (18)
Theorem 13.

Assume that Cϕ(1/2)=ϕ(0)C_{\phi}^{*}(1/2)=\phi(0) and that the adversarial Bayes classifier is unique up to degeneracy. Let 0{\mathbb{P}}_{0}^{*}, 1{\mathbb{P}}_{1}^{*} be maximizers of R¯ϕ\bar{R}_{\phi} for which (η=1/2)=0{\mathbb{P}}^{*}(\eta^{*}=1/2)=0, with =0+1{\mathbb{P}}^{*}={\mathbb{P}}_{0}^{*}+{\mathbb{P}}_{1}^{*} and η=d1/d\eta^{*}=d{\mathbb{P}}_{1}^{*}/d{\mathbb{P}}^{*}. Let H(z)=conc((|η1/2|z))H(z)=\operatorname{conc}({\mathbb{P}}^{*}(|\eta^{*}-1/2|\leq z)). Let Ψ\Psi be the function defined by Theorem 4 and let Λ~(z)=Ψ1(min(z6,ϕ(0)))\tilde{\Lambda}(z)=\Psi^{-1}(\min(\frac{z}{6},\phi(0))). Then

Rϵ(f)RϵΦ~(Rϕϵ(f)Rϕ,ϵ)R^{\epsilon}(f)-R^{\epsilon}_{*}\leq\tilde{\Phi}(R_{\phi}^{\epsilon}(f)-R_{\phi,*}^{\epsilon})

with

Φ~(z)=6(id+min(1,2eHln2H))Λ~\tilde{\Phi}(z)=6\left(\operatorname{id}+\min(1,\sqrt{-2{e}H\ln 2H})\right)\circ\tilde{\Lambda}

See Examples 3 and 2(c) for an example of calculating a distribution-dependent surrogate risk bound.

One can prove that the function HH is always continuous and satisfies H(0)=0H(0)=0, proving that this bound is non-vacuous (see Lemma 2 below). Further notice that HlnHH\ln H approaches zero as H0H\to 0.

The map Φ~\tilde{\Phi} combines two components: Λ~\tilde{\Lambda}, a modified version of Ψ1\Psi^{-1}, and HH, a modification of the cdf of |η1/2||\eta^{*}-1/2|. The function Λ~\tilde{\Lambda} is a scaled version of Ψ1\Psi^{-1}, where Ψ\Psi is the surrogate risk bound in the non-adversarial case of Theorem 4. The domain of Ψ1\Psi^{-1} is [0,ϕ(0)][0,\phi(0)], and thus the role of the min\min in the definition of Λ~\tilde{\Lambda} is to truncate the argument so that it fits into this domain. The factor of 1/61/6 in this function appears to be an artifact of our proof, see Section 5 for further discussion. In contrast, the map HH translates the distribution of η\eta^{*} into a surrogate risk transformation. Compare with Theorem 6, which states that consistency fails if (η=1/2)>0{\mathbb{P}}^{*}(\eta^{*}=1/2)>0; accordingly, the function HH becomes a poorer bound when more mass of η\eta^{*} is near 1/21/2.

Examples

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 2: Distributions from Examples 1, 2 and 3 along with attacks that maximize the dual R¯ϕ\bar{R}_{\phi}.

Below are three examples for which each of our three main theorems apply. These examples are all one-dimensional distributions, and we denote the pdfs of 0{\mathbb{P}}_{0}, and 1{\mathbb{P}}_{1} by p0p_{0} and p1p_{1}.

To start, a distribution for which the supports of 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1} are more than 2ϵ2\epsilon apart must have zero risk. Furthermore, if {\mathbb{P}} is absolutely continuous with respect to Lebesgue measure and the supports of 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1} are exactly 2ϵ2\epsilon apart, then the adversarial classification risk will be zero (see for instance (Awasthi et al., 2023a, Lemma 4) or (Pydi & Jog, 2021, Lemma 4.3)).

Example 1 (When Rϵ=0R_{*}^{\epsilon}=0).

Let p0p_{0} and p1p_{1} be defined by

p0(x)={1if x[1δ,δ]0otherwisep1(x)={1if x[δ,1+δ]0otherwisep_{0}(x)=\begin{cases}1&\text{if }x\in[-1-\delta,-\delta]\\ 0&\text{otherwise}\end{cases}\quad p_{1}(x)=\begin{cases}1&\text{if }x\in[\delta,1+\delta]\\ 0&\text{otherwise}\end{cases}

for some δ>0\delta>0. See Figure 2(a) for a depiction of p0p_{0} and p1p_{1}. This distribution satisfies Rϵ=0R^{\epsilon}_{*}=0 for all ϵδ\epsilon\leq\delta and thus the surrogate bound of Theorem 12 applies.

Examples 2 and 3 require computing maximizers to the dual R¯ϕ\bar{R}_{\phi}. See Sections J.2 and J.1 for these calculations. The following example illustrates a distribution for which Massart’s noise condition can be verified for a distribution of optimal attacks.

Example 2 (Massart’s noise condition).

Let δ>0\delta>0 and let pp be the uniform density on [1δ,δ][δ,1+δ][-1-\delta,-\delta]\cup[\delta,1+\delta]. Define η\eta by

η(x)={14if x[1δ,δ]34if x[δ,1+δ]\eta(x)=\begin{cases}\frac{1}{4}&\text{if }x\in[-1-\delta,-\delta]\\ \frac{3}{4}&\text{if }x\in[\delta,1+\delta]\end{cases} (19)

see Figure 2(b) for a depiction of p0p_{0} and p1p_{1}. For this distribution and ϵδ\epsilon\leq\delta, the minimal surrogate and adversarial surrogate risks are always equal (Rϕ,=Rϕ,ϵR_{\phi,*}=R^{\epsilon}_{\phi,*}). This fact together with Theorem 9 imply that optimal attacks on this distribution are 1=1{\mathbb{P}}_{1}^{*}={\mathbb{P}}_{1} and 0=0{\mathbb{P}}_{0}^{*}={\mathbb{P}}_{0}, see Section J.1 for details. Consequently: the distribution of optimal attacks 0{\mathbb{P}}_{0}^{*}, 1{\mathbb{P}}_{1}^{*} satisfies Massart’s noise condition with α=1/4\alpha=1/4 and as a result the bounds of Theorem 11 apply.

Finally, the next example presents a case in which Massart’s noise condition fails for the distribution of optimal adversarial attacks, yet the adversarial Bayes classifier remains unique up to degeneracy. Theorem 13 continues to yield a valid surrogate bound.

Example 3 (Gaussian example).

Consider an equal Gaussian mixture with equal variances and differing means, with μ1>μ0\mu_{1}>\mu_{0}:

p0(x)=1212πσe(xμ0)22σ2,p1(x)=1212πσe(xμ1)22σ2p_{0}(x)=\frac{1}{2}\cdot\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu_{0})^{2}}{2\sigma^{2}}},\quad p_{1}(x)=\frac{1}{2}\cdot\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu_{1})^{2}}{2\sigma^{2}}}

We further assume μ1μ02σ\mu_{1}-\mu_{0}\leq\sqrt{2}\sigma; see Figure 2(c) for a depiction. We will show that when μ1μ0<2ϵ\mu_{1}-\mu_{0}<2\epsilon, the optimal attacks 0{\mathbb{P}}_{0}^{*}, 1{\mathbb{P}}_{1}^{*} are gaussians centered at μ0+ϵ\mu_{0}+\epsilon and μ1ϵ\mu_{1}-\epsilon— explicitly the pdfs of 0{\mathbb{P}}_{0}^{*} and 1{\mathbb{P}}_{1}^{*} are given by

p0(x)=1212πσe(x(μ0+ϵ))22σ2,p1(x)=1212πσe(x(μ1ϵ))22σ2,p_{0}^{*}(x)=\frac{1}{2}\cdot\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-(\mu_{0}+\epsilon))^{2}}{2\sigma^{2}}},\quad p_{1}^{*}(x)=\frac{1}{2}\cdot\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-(\mu_{1}-\epsilon))^{2}}{2\sigma^{2}}}, (20)

see Section J.2 for details. We verify that 0{\mathbb{P}}_{0}^{*} and 1{\mathbb{P}}_{1}^{*} are in fact optimal by finding a function ff^{*} for which Rϕϵ(f)=R¯ϕ(0,1)R_{\phi}^{\epsilon}(f)=\bar{R}_{\phi}({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}), the strong duality result in Theorem 9 will then imply that 0{\mathbb{P}}_{0}^{*} and 1{\mathbb{P}}_{1}^{*} must maximize the dual R¯ϕ\bar{R}_{\phi}, see Section J.2 for details.

However, when μ1μ02σ\mu_{1}-\mu_{0}\leq\sqrt{2}\sigma, then the function h(z)=(|η1/2|z)h(z)={\mathbb{P}}^{*}(|\eta^{*}-1/2|\leq z) is concave in zz for all ϵ<(μ1μ0)/2\epsilon<(\mu_{1}-\mu_{0})/2 and consequently h=Hh=H, see Section J.3 for details. Unfortunately, hh is a rather unweildy function. By comparing to the linear approximation at zero, one can show the following upper bound on HH:

H(z)min(16σ2μ1μ02ϵz,1).H(z)\leq\min\left(\frac{16\sigma^{2}}{\mu_{1}-\mu_{0}-2\epsilon}z,1\right). (21)

Again, see Section J.3 for details.

When ϵ(μ1μ0)/2\epsilon\geq(\mu_{1}-\mu_{0})/2, (Frank, 2024, Example 4.1) shows that the adversarial Bayes classifier is not unique up to degeneracy. Notably, the bound in preceding example deteriorates as (μ1μ0)/2ϵ(\mu_{1}-\mu_{0})/2\to\epsilon, and then fails entirely when ϵ=(μ1μ0)/2\epsilon=(\mu_{1}-\mu_{0})/2.

4 Linear Surrogate Bounds— Proof of Theorems 12 and 11

The proof of Theorem 12 simply involves bounding the indicator functions Sϵ(𝟏f>0)S_{\epsilon}({\mathbf{1}}_{f>0}), Sϵ(𝟏f0)S_{\epsilon}({\mathbf{1}}_{f\leq 0}) in terms of the functions Sϵ(ϕf)S_{\epsilon}(\phi\circ f) and Sϵ(ϕf)S_{\epsilon}(\phi\circ-f). This strategy is entirely analogous to that the argument for the (non-adversarial) surrogate bound Equation 10 in Section B.1. A similar argument is also an essential intermediate step of the proof of Theorem 11.

Proof of Theorem 12.

If Rϵ=0R_{*}^{\epsilon}=0, then the duality result Theorem 7 implies that for any measures 0ϵ(0){\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0}), 1ϵ(1){\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1}) then (η=0 or 1)=1{\mathbb{P}}^{\prime}(\eta^{\prime}=0\text{ or }1)=1, where =0+1{\mathbb{P}}^{\prime}={\mathbb{P}}_{0}^{\prime}+{\mathbb{P}}_{1}^{\prime} and η=d1/d\eta^{\prime}=d{\mathbb{P}}_{1}^{\prime}/d{\mathbb{P}}^{\prime}. Consequently, R¯ϕ(0,1)=0\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})=0 for any (0,1)ϵ(0)×ϵ(1)({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\times{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1}) and consequently Rϕ,ϵ=0R_{\phi,*}^{\epsilon}=0. Thus it remains to show that Rϵ(f)1ϕ(0)Rϕϵ(f)R^{\epsilon}(f)\leq\frac{1}{\phi(0)}R_{\phi}^{\epsilon}(f) for all functions ff. We will prove the bound

Sϵ(𝟏f0)(𝐱)1ϕ(0)Sϵ(ϕ(f))(𝐱).S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})\leq\frac{1}{\phi(0)}S_{\epsilon}(\phi(f))({\mathbf{x}}). (22)

The inequality Equation 22 trivially holds when Sϵ(𝟏f0)=0S_{\epsilon}({\mathbf{1}}_{f\leq 0})=0. Alternatively, the relation Sϵ(𝟏f0)(𝐱)=1S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})=1 implies f(𝐱)0f({\mathbf{x}}^{\prime})\leq 0 for some 𝐱Bϵ(𝐱)¯{\mathbf{x}}^{\prime}\in\overline{B_{\epsilon}({\mathbf{x}})} and consequently Sϵ(ϕ(f))(𝐱)ϕ(0)S_{\epsilon}(\phi(f))({\mathbf{x}})\geq\phi(0). Thus whenever Sϵ(𝟏f0)(𝐱)=1S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})=1,

Sϵ(𝟏f0)(𝐱)=ϕ(0)ϕ(0)1ϕ(0)Sϵ(ϕ(f))(𝐱).S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})=\frac{\phi(0)}{\phi(0)}\leq\frac{1}{\phi(0)}S_{\epsilon}(\phi(f))({\mathbf{x}}). (23)

An analogous argument implies that whenever Sϵ(𝟏f>0)(𝐱)=1S_{\epsilon}({\mathbf{1}}_{f>0})({\mathbf{x}})=1,

Sϵ(𝟏f>0)(𝐱)=Sϵ(𝟏f<0)(𝐱)1ϕ(0)Sϵ(ϕ(f))(𝐱).S_{\epsilon}({\mathbf{1}}_{f>0})({\mathbf{x}})=S_{\epsilon}({\mathbf{1}}_{-f<0})({\mathbf{x}})\leq\frac{1}{\phi(0)}S_{\epsilon}(\phi(-f))({\mathbf{x}}).

As a result:

Rϵ(f)\displaystyle R^{\epsilon}(f) =Sϵ(𝟏f0)(𝐱)d1+Sϵ(𝟏f>0)(𝐱)d01ϕ(0)(Sϵ(ϕ(f))(𝐱)d1+Sϵ(ϕ(f)(𝐱)d0)\displaystyle=\int S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})d{\mathbb{P}}_{1}+\int S_{\epsilon}({\mathbf{1}}_{f>0})({\mathbf{x}})d{\mathbb{P}}_{0}\leq\frac{1}{\phi(0)}\left(\int S_{\epsilon}(\phi(f))({\mathbf{x}})d{\mathbb{P}}_{1}+\int S_{\epsilon}(\phi(-f)({\mathbf{x}})d{\mathbb{P}}_{0}\right)
=1ϕ(0)Rϕϵ(f)\displaystyle=\frac{1}{\phi(0)}R_{\phi}^{\epsilon}(f)

In contrast, when the optimal surrogate risk Rϕ,ϵR^{\epsilon}_{\phi,*} is non-zero, the bound in Theorem 11 necessitates a more sophisticated argument. Below, we decompose both the adversarial classification risk and the adversarial surrogate risk as the sum of four terms positive terms.

Let 0,1{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*} be any maximizers of R¯ϕ\bar{R}_{\phi} that also maximize R¯\bar{R} by Theorem 10. Set =0+1{\mathbb{P}}^{*}={\mathbb{P}}_{0}^{*}+{\mathbb{P}}_{1}^{*}, η=d1/d\eta^{*}=d{\mathbb{P}}_{1}^{*}/d{\mathbb{P}}^{*}. Let γ0\gamma_{0}^{*}, γ1\gamma_{1}^{*} be the couplings between 0{\mathbb{P}}_{0}, 0{\mathbb{P}}_{0}^{*} and 1{\mathbb{P}}_{1}, 1{\mathbb{P}}_{1}^{*} respectively that achieve the infimum in Equation 15. Then due to the strong duality in Theorem 7, one can decompose the excess classification risk as

Rϵ(f)Rϵ=Rϵ(f)R¯(0,1)=I1(f)+I0(f)\displaystyle R^{\epsilon}(f)-R_{*}^{\epsilon}=R^{\epsilon}(f)-\bar{R}({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*})=I_{1}(f)+I_{0}(f) (24)

with

I1(f)=(Sϵ(𝟏f0)(𝐱)𝟏f0(𝐱)dγ1)+(C(η,f)C(η)d1)I_{1}(f)=\left(\int S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f\leq 0}({\mathbf{x}}^{\prime})d\gamma_{1}^{*}\right)+\left(\int C(\eta^{*},f)-C^{*}(\eta^{*})d{\mathbb{P}}_{1}^{*}\right)
I0(f)=(Sϵ(𝟏f>0)(𝐱)𝟏f>0(𝐱)dγ0)+(C(η,f)C(η)d0)I_{0}(f)=\left(\int S_{\epsilon}({\mathbf{1}}_{f>0})({\mathbf{x}})-{\mathbf{1}}_{f>0}({\mathbf{x}}^{\prime})d\gamma_{0}^{*}\right)+\left(\int C(\eta^{*},f)-C^{*}(\eta^{*})d{\mathbb{P}}_{0}^{*}\right)

Lemma 1 implies that Sϵ(𝟏f0)(𝐱)𝟏f0(𝐱)S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f\leq 0}({\mathbf{x}}^{\prime}) must be positive, while the definition of CC^{*} implies that C(η,f)C(η)0C(\eta^{*},f)-C^{*}(\eta^{*})\geq 0.

Similarly, one can express the excess surrogate risk as

Rϕϵ(f)Rϕ,ϵ\displaystyle R_{\phi}^{\epsilon}(f)-R_{\phi,*}^{\epsilon} =I1ϕ(f)+I0ϕ(f)\displaystyle=I_{1}^{\phi}(f)+I_{0}^{\phi}(f) (25)

with

I1ϕ(f)=(Sϵ(ϕ(f))(𝐱)ϕ(f)(𝐱)dγ1)+(Cϕ(η,f)Cϕ(η)d1)I_{1}^{\phi}(f)=\left(\int S_{\epsilon}(\phi(f))({\mathbf{x}})-\phi(f)({\mathbf{x}}^{\prime})d\gamma_{1}^{*}\right)+\left(\int C_{\phi}(\eta^{*},f)-C_{\phi}^{*}(\eta^{*})d{\mathbb{P}}_{1}^{*}\right)
I0ϕ(f)=(Sϵ(ϕ(f))(𝐱)ϕ(f)(𝐱)dγ0)+(Cϕ(η,f)Cϕ(η)d0)I_{0}^{\phi}(f)=\left(\int S_{\epsilon}(\phi(-f))({\mathbf{x}})-\phi(-f)({\mathbf{x}}^{\prime})d\gamma_{0}^{*}\right)+\left(\int C_{\phi}(\eta^{*},f)-C_{\phi}^{*}(\eta^{*})d{\mathbb{P}}_{0}^{*}\right)

Define Kϕ=3+521ϕ(0)Cϕ(1/2α)K_{\phi}=\frac{3+\sqrt{5}}{2}\cdot\frac{1}{\phi(0)-C_{\phi}^{*}(1/2-\alpha)}. We will argue that:

I0(f)KϕI0ϕ(f).I_{0}(f)\leq K_{\phi}I_{0}^{\phi}(f). (26)
I1(f)KϕI1ϕ(f).I_{1}(f)\leq K_{\phi}I_{1}^{\phi}(f). (27)

Below, we discuss the proof of Equation 27 and an analogous argument will imply Equation 26.

The proof proceeds by splitting the domain d×d\mathbb{R}^{d}\times\mathbb{R}^{d} into three different regions D1D_{1}, E1E_{1} F1F_{1} and proving the inequality in each case with a slightly different argument. These three cases will also appear in the proof of theorem Theorem 13. Define the sets D1D_{1}, E1E_{1}, F1F_{1} by

D1={(𝐱,𝐱):Sϵ(𝟏f0)(𝐱)𝟏f(𝐱)0=0}\displaystyle D_{1}=\{({\mathbf{x}},{\mathbf{x}}^{\prime}):S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}=0\} (28)
E1={(𝐱,𝐱):Sϵ(𝟏f0)(𝐱)𝟏f(𝐱)0=1,f(𝐱)β}\displaystyle E_{1}=\{({\mathbf{x}},{\mathbf{x}}^{\prime}):S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}=1,f({\mathbf{x}}^{\prime})\geq\beta\} (29)
F1={(𝐱,𝐱):Sϵ(𝟏f0)(𝐱)𝟏f(𝐱)0=1,f(𝐱)<β}\displaystyle F_{1}=\{({\mathbf{x}},{\mathbf{x}}^{\prime}):S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}=1,f({\mathbf{x}}^{\prime})<\beta\} (30)

where β>0\beta>0 is some constant, to be specified later (see Equations 32 and 33). On the set D1D_{1} the adversarial error matches the non-adversarial error with respect to the distribution 0{\mathbb{P}}_{0}^{*}, 1{\mathbb{P}}_{1}^{*}, and thus the bound in Equation 12 implies a linear surrogate bound. On E1E_{1}, the same argument as Equation 23 together with Equation 12 proves a linear surrogate bound for adversarial risks. In short: this argument uses the first term in I1ϕ(f)I_{1}^{\phi}(f) to bound the first term in I1(f)I_{1}(f) and the second term of I1ϕ(f)I_{1}^{\phi}(f) to bound the second term of I1(f)I_{1}(f).

In contrast, The counterexample discussed in Equation 14 demonstrates that when ff is near 0, the quantity Sϵ(ϕ(f))(𝐱)ϕ(f)(𝐱)S_{\epsilon}(\phi(f))({\mathbf{x}})-\phi(f)({\mathbf{x}}^{\prime}) can be small even though Sϵ(𝟏f0)(𝐱)𝟏f0(𝐱)S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f\leq 0}({\mathbf{x}}^{\prime}) can be large. Consequently, a different strategy is required to establish a linear surrogate bound on the set F1F_{1}. The key observation is that under the assumptions of Proposition 1, the function ff must be bounded away from zero whenever it misclassifies a point. As a result, the excess conditional risk Cϕ(η,f)Cϕ(η)C_{\phi}(\eta,f)-C_{\phi}^{*}(\eta) is bounded below by a positive constant and thus can be used to bound terms comprising I1ϕ(f)I_{1}^{\phi}(f). The constant β\beta is then specifically chosen to balance the contribution of the risks over the sets E1E_{1} and F1F_{1}.

Proof of Theorem 11.

We will will prove Equation 27, the argument for Equation 26 is analogous. Due due to Equations 24 and 25, these inequalities prove the desired result. First, notice that Equation 12 implies that

C(η(𝐱),f(𝐱))C(η(𝐱))1ϕ(0)Cϕ(1/2α)(Cϕ(η(𝐱),f(𝐱))Cϕ(η(𝐱)))-a.e.C(\eta^{*}({\mathbf{x}}^{\prime}),f({\mathbf{x}}^{\prime}))-C^{*}(\eta^{*}({\mathbf{x}}^{\prime}))\leq\frac{1}{\phi(0)-C_{\phi}^{*}(1/2-\alpha)}\left(C_{\phi}(\eta^{*}({\mathbf{x}}^{\prime}),f({\mathbf{x}}^{\prime}))-C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime}))\right)\quad{\mathbb{P}}^{*}\text{-a.e.} (31)

Choose the constant β\beta to satisfy

ϕ(β)=tCϕ(1/2α)+(1t)ϕ(0)\phi(\beta)=tC_{\phi}^{*}(1/2-\alpha)+(1-t)\phi(0) (32)

with t=(35)/2t=(3-\sqrt{5})/2. The parameter tt is specifically selected to balance the contributions of the errors on E1E_{1} and F1F_{1}, specifically it should satisfy

1t=1+11t=3+52=Kϕ(ϕ(0)Cϕ(1/2α))\frac{1}{t}=1+\frac{1}{1-t}=\frac{3+\sqrt{5}}{2}=K_{\phi}(\phi(0)-C_{\phi}^{*}(1/2-\alpha)) (33)

Next, we prove the relation Equation 27 on each of the sets D1D_{1}, E1E_{1}, F1F_{1} separately.

  1. 1.

    On the set D1D_{1}:

    Lemma 1 implies that Sϵ(ϕ(f))(𝐱)ϕ(f(𝐱))0S_{\epsilon}(\phi(f))({\mathbf{x}})-\phi(f({\mathbf{x}}^{\prime}))\geq 0. This fact together with Equation 31 implies Equation 27.

  2. 2.

    On the set E1E_{1}:

    If Sϵ(𝟏f0)(𝐱)𝟏f(𝐱)0=1S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}=1 but f(𝐱)βf({\mathbf{x}}^{\prime})\geq\beta, then Sϵ(ϕf)(𝐱)ϕ(0)S_{\epsilon}(\phi\circ f)({\mathbf{x}})\geq\phi(0) while ϕ(f(𝐱))ϕ(β)\phi(f({\mathbf{x}}^{\prime}))\leq\phi(\beta) and thus Sϵ(ϕf)(𝐱)ϕ(f(𝐱))ϕ(0)ϕ(β)=t(ϕ(0)Cϕ(1/2α))=1/KϕS_{\epsilon}(\phi\circ f)({\mathbf{x}})-\phi(f({\mathbf{x}}^{\prime}))\geq\phi(0)-\phi(\beta)=t(\phi(0)-C_{\phi}^{*}(1/2-\alpha))=1/K_{\phi} by Equation 33. Thus

    Sϵ(𝟏f0)(𝐱)𝟏f0(𝐱)=1=KϕKϕKϕ(Sϵ(ϕf)(𝐱)ϕ(f(𝐱))S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f\leq 0}({\mathbf{x}}^{\prime})=1=\frac{K_{\phi}}{K_{\phi}}\leq{K_{\phi}}(S_{\epsilon}(\phi\circ f)({\mathbf{x}})-\phi(f({\mathbf{x}}^{\prime})) (34)

    This relation together with Equation 31 implies Equation 27.

  3. 3.

    On the set F1F_{1}:

    First, Sϵ(𝟏f0)(𝐱)𝟏f(𝐱)0=1S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}=1 implies that f(𝐱)>0f({\mathbf{x}}^{\prime})>0. If additionally f(𝐱)<βf({\mathbf{x}}^{\prime})<\beta, then both f(𝐱)<βf({\mathbf{x}}^{\prime})<\beta and f(𝐱)<β-f({\mathbf{x}}^{\prime})<\beta and consequently Cϕ(η,f(𝐱))ϕ(β)C_{\phi}(\eta^{*},f({\mathbf{x}}^{\prime}))\geq\phi(\beta). Furthermore, as CϕC_{\phi}^{*} is increasing on [0,1/2][0,1/2] and decreasing on [1/2,1][1/2,1] (see Lemma 5 in Section B.2), sup|η1/2|αCϕ(η)=Cϕ(1/2α)\sup_{|\eta-1/2|\geq\alpha}C_{\phi}^{*}(\eta)=C_{\phi}^{*}(1/2-\alpha). Thus due to the choice of β\beta in Equation 32:

    Cϕ(η,f(𝐱))Cϕ(η)ϕ(β)Cϕ(1/2α)=(1t)(ϕ(0)Cϕ(1/2α)).C_{\phi}(\eta^{*},f({\mathbf{x}}^{\prime}))-C_{\phi}^{*}(\eta^{*})\geq\phi(\beta)-C_{\phi}^{*}(1/2-\alpha)=(1-t)(\phi(0)-C_{\phi}^{*}(1/2-\alpha)).

    The same argument as Equation 34 then implies

    Sϵ(𝟏f0)(𝐱)𝟏f0(𝐱)1(1t)(ϕ(0)Cϕ(1/2α))(Cϕ(η,f(𝐱))Cϕ(η))S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f\leq 0}({\mathbf{x}}^{\prime})\leq\frac{1}{(1-t)(\phi(0)-C_{\phi}^{*}(1/2-\alpha))}(C_{\phi}^{*}(\eta^{*},f({\mathbf{x}}^{\prime}))-C_{\phi}^{*}(\eta^{*}))

    This relation together with Equations 31, 33 and 1 imply Equation 27.

5 Proof of Theorem 13

Before proving Theorem 13, we will show that this bound is non-vacuous when the adversarial Bayes classifier is unique up to degeneracy. The function h(z)=(|η1/2|z)h(z)={\mathbb{P}}(|\eta^{*}-1/2|\leq z) is a cdf, and is thus right-continuous in zz. Furthermore, if the adversarial Bayes classifier is unique up to degeneracy, then h(0)=0h(0)=0. The following lemma implies that if H=conc(h)H=\operatorname{conc}(h) then HH is continuous at 0 and H(0)=0H(0)=0. See Appendix E for a proof.

Lemma 2.

Let h:[0,1/2]h:[0,1/2]\to\mathbb{R} be a non-decreasing function with h(0)=0h(0)=0 and h(1/2)=1h(1/2)=1 that is right-continuous at 0. Then conc(h)\operatorname{conc}(h) is non-decreasing, conc(h)(0)=0\operatorname{conc}(h)(0)=0, and continuous on [0,1/2][0,1/2].

The first step in proving Theorem 13 is showing an analog of Theorem 11 with α=0\alpha=0 for which the linear function is replaced by an η\eta-dependent concave function.

Proposition 2.

Let Φ\Phi be a concave non-decreasing function for which C(η,α)C(η)Φ(Cϕ(η,α)Cϕ(η))C(\eta,\alpha)-C^{*}(\eta)\leq\Phi(C_{\phi}(\eta,\alpha)-C_{\phi}^{*}(\eta)) for any η[0,1]\eta\in[0,1] and α¯\alpha\in\overline{\mathbb{R}}. Let 0{\mathbb{P}}_{0}^{*}, 1{\mathbb{P}}_{1}^{*} be any two maximzers of R¯ϕ\bar{R}_{\phi} for which (η=1/2)=0{\mathbb{P}}^{*}(\eta^{*}=1/2)=0 for =0+1{\mathbb{P}}^{*}={\mathbb{P}}_{0}^{*}+{\mathbb{P}}_{1}^{*} and η=d1/d\eta^{*}=d{\mathbb{P}}_{1}^{*}/d{\mathbb{P}}^{*}. Let G:[0,)G:[0,\infty)\to\mathbb{R} be any non-decreasing concave function for which the quantity

K=1G((ϕ(0)Cϕ(η))/2)𝑑K=\int\frac{1}{G((\phi(0)-C_{\phi}^{*}(\eta^{*}))/2)}d{\mathbb{P}}^{*}

is finite. Then Rϵ(f)RϵΦ~(Rϕϵ(f)Rϕ,ϵ)R^{\epsilon}(f)-R_{*}^{\epsilon}\leq\tilde{\Phi}(R_{\phi}^{\epsilon}(f)-R_{\phi,*}^{\epsilon}), where

Φ~(z)=6KG(16z)+2Φ(12z)\tilde{\Phi}(z)=6\sqrt{KG\left(\frac{1}{6}z\right)}+2\Phi\left(\frac{1}{2}z\right) (35)

The function Ψ1\Psi^{-1} in Theorem 4 and the surrogate bounds of Zhang (2004) provide examples of candidate functions for Φ\Phi. As before, this result is proved by dividing the risks RϕϵR_{\phi}^{\epsilon}, RϵR^{\epsilon} as the sum of four terms as in Equation 24, Equation 25 and then bounding these quantities over the sets D1D_{1}, E1E_{1}, and F1F_{1} defined in Equation 28,Equation 29, and Equation 30 separately. The key observation is that when ff is bounded away from argminCϕ(η,)\operatorname*{argmin}C_{\phi}(\eta,\cdot), the excess conditional risk Cϕ(η,f)Cϕ(η)C_{\phi}(\eta,f)-C_{\phi}^{*}(\eta) must be strictly positive. This quantity again serves to bound both components of I1ϕ(f)I_{1}^{\phi}(f), even if iti s not uniformly bounded away from zero. As before, the constant β\beta is selected to balance the contributions of the risk on the sets E1E_{1} and F1F_{1}. This time, the value β\beta is function of η(𝐱)\eta^{*}({\mathbf{x}}^{\prime}), where β:[0,1]¯\beta:[0,1]\to\overline{\mathbb{R}} is a monotonic function for which

ϕ(β(η))=tCϕ(η)+(1t)ϕ(0)=12(ϕ(0)+Cϕ(η))\phi(\beta(\eta))=tC_{\phi}^{*}(\eta)+(1-t)\phi(0)=\frac{1}{2}\left(\phi(0)+C_{\phi}^{*}(\eta)\right) (36)

with t=1/2t=1/2. In Appendix F, we show that there exists such a function β\beta. An argument like the proof of Theorem 11 mixed with applications of the Cauchy-Schwartz and Jensen’s inequality then proves Proposition 2, see Appendix G for details. Again, the function β\beta is chosen to balance the contributions of the upper bounds on the risk on E1E_{1} and F1F_{1}.

The factor of 1/61/6 in Equation 35 arises as an artifact of the proof technique. Specifically, the constant 6=236=2\cdot 3 reflects two distinct components of the argument: the factor of 3 results from averaging over three sets D1D_{1}, E1E_{1}, F1F_{1}, (see Equation 68 in Appendix G), the factor of 2 arises from combining the bounds associated with the two integrals I1(f)I_{1}(f) and I0(f)I_{0}(f) (see Equations 66 and 68 in Appendix G).

We now turn to the problem of identifying functions GG for which the constant KK in the preceding proposition is guaranteed to be finite. Observe that ϕ(0)Cϕ(η)ϕ(0)Cϕ(1/2)\phi(0)-C_{\phi}^{*}(\eta^{*})\geq\phi(0)-C_{\phi}^{*}(1/2) and so if ϕ(0)>Cϕ(1/2)\phi(0)>C_{\phi}^{*}(1/2), the identity function is a possible choice for GG. This option results in

Φ~(z)=2ϕ(0)Cϕ(1/2)z+2Φ(12z),\tilde{\Phi}(z)=\frac{2}{\phi(0)-C_{\phi}^{*}(1/2)}z+2\Phi\left(\frac{1}{2}z\right),

which may improve the convergence rate relative to the bound in Theorem 11. The results developed here extend the classical analysis of Bartlett et al. (2006) to the adversarial setting. Moreover, Proposition 2 points to a pathway for generalizing the framework of Zhang (2004) to robust classification.

Alternatively, we consider constructing a function GG for which the constant KK in Proposition 2 is always finite when the adversarial Bayes classifier is unique, but distribution dependent. Observe that if hh is the cdf of |η1/2||\eta-1/2| and hh is continuous, then 1/hr𝑑h\int 1/h^{r}dh is always finite. This calculation suggests Φ=hΨ1\Phi=h\circ\Psi^{-1}, with Ψ\Psi defined in Theorem 4. To ensure the concavity of GG, we instead select G=HΨ1G=H\circ\Psi^{-1} with H=conc(h)H=\operatorname{conc}(h).

Lemma 3.

Assume Cϕ(1/2)=ϕ(0)C_{\phi}^{*}(1/2)=\phi(0). Let 1{\mathbb{P}}_{1}, 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1}^{*}, 0{\mathbb{P}}_{0}^{*}, ϕ\phi, HH, and Ψ\Psi be as in Theorem 13. Let Λ(z)=Ψ1(min(z,ϕ(0)))\Lambda(z)=\Psi^{-1}(\min(z,\phi(0))). Then for any r(0,1)r\in(0,1), Then

Rϵ(f)RϵΦ~(Rϕϵ(f)Rϕ,ϵ)R^{\epsilon}(f)-R^{\epsilon}_{*}\leq\tilde{\Phi}(R_{\phi}^{\epsilon}(f)-R_{\phi,*}^{\epsilon}) (37)

with

Φ~(z)=611r2rH(Λ(16z))r+2Λ(z2).\tilde{\Phi}(z)=6\sqrt{\frac{1}{1-r}2^{r}H\left(\Lambda\left(\frac{1}{6}z\right)\right)^{r}}+2\Lambda\left(\frac{z}{2}\right).
Proof.

For convenience, let Λ¯(z)=12Λ(2z)\bar{\Lambda}(z)=\frac{1}{2}\Lambda(2z). Let G=(HΛ¯)rG=(H\circ\bar{\Lambda})^{r}, where h(z)=(|η1/2|z)h(z)={\mathbb{P}}^{*}(|\eta^{*}-1/2|\leq z). Then GG is concave because it is the composition of an concave function and an increasing concave function. We will verify that

K=1G((ϕ(0)Cϕ(η))/2)𝑑2r1rK=\int\frac{1}{G((\phi(0)-C_{\phi}^{*}(\eta^{*}))/2)}d{\mathbb{P}}^{*}\leq\frac{2^{r}}{1-r}

First,

1G((ϕ(0)Cϕ(η))/2)𝑑=1H(|η1/2|)r𝑑=[0,12]1H(s)r𝑑s=(0,12]1H(s)r𝑑s\displaystyle\int\frac{1}{G((\phi(0)-C_{\phi}^{*}(\eta^{*}))/2)}d{\mathbb{P}}^{*}=\int\frac{1}{H(|\eta^{*}-1/2|)^{r}}d{\mathbb{P}}^{*}=\int_{[0,\frac{1}{2}]}\frac{1}{H(s)^{r}}d{\mathbb{P}}^{*}\sharp s=\int_{(0,\frac{1}{2}]}\frac{1}{H(s)^{r}}d{\mathbb{P}}^{*}\sharp s

with s=|η1/2|s=|\eta^{*}-1/2|. The assumption (|η=1/2|)=0{\mathbb{P}}^{*}(|\eta^{*}=1/2|)=0 allows us to drop 0 from the domain of integration. Because the function HH is continuous on (0,1](0,1] by Lemma 2, this last expression can actually be evaluated as a Riemann-Stieltjes integral with respect to the function h(s)=(|η1/2|s)h(s)={\mathbb{P}}(|\eta^{*}-1/2|\leq s):

1H(s)r𝑑s=1H(s)r𝑑h\displaystyle\int\frac{1}{H(s)^{r}}d{\mathbb{P}}^{*}\sharp s=\int\frac{1}{H(s)^{r}}dh (38)

This result is standard when {\mathbb{P}}^{*} is Lebesgue measure, (see for instance Theorem 5.46 of (Wheeden & Zygmund, 1977)). We prove equality in Equation 38 for strictly decreasing functions in Proposition 4 in Section H.1.

Finally, the integral in Equation 38 can be bounded as

1H(s)r𝑑h2r1r\displaystyle\int\frac{1}{H(s)^{r}}dh\leq\frac{2^{r}}{1-r} (39)

If hh were differentiable, then the chain rule would imply

1H(s)r𝑑h1h(s)r𝑑h=011h(s)rh(s)𝑑z=11rh(s)1r|01=11r.\int\frac{1}{H(s)^{r}}dh\leq\int\frac{1}{h(s)^{r}}dh=\int_{0}^{1}\frac{1}{h(s)^{r}}h^{\prime}(s)dz=\frac{1}{1-r}h(s)^{1-r}\bigg{|}_{0}^{1}=\frac{1}{1-r}.

This calculation is more delicate for non-differentiable HH; we formally prove inequality in Equation 39 in Section H.2.

This calculation proves the inequality Equation 37 with Φ~\tilde{\Phi} as

62r1rH(12Λ(26z))r+Λ(z)6\sqrt{\frac{\cdot 2^{r}}{1-r}H\left(\frac{1}{2}\Lambda\left(\frac{2}{6}z\right)\right)^{r}}+\Lambda(z)

The concavity of Λ\Lambda together with the fact that Λ(0)=0\Lambda(0)=0 then proves the result. ∎

Minimizing this bound over rr then produces Theorem 13, see Appendix I for details.

6 Related Works

The most similar results to this paper are Li & Telgarsky (2023); Mao et al. (2023a). Li & Telgarsky (2023) prove a surrogate bound for convex losses, when one can minimize over the thresh-holding value in Equation 13 rather than just 0. (Mao et al., 2023a) proves an adversarial surrogate bound for a modified ρ\rho-margin loss.

Many papers study the statistical consistency of surrogate risks in the standard and adversarial context. Bartlett et al. (2006); Zhang (2004) prove surrogate risk bounds that apply to the class of all measurable functions Lin (2004); Steinwart (2007) prove further results on consistency in the standard setting, and Frongillo & Waggoner (2021) study the optimally of such surrogate risk bounds. (Bao, 2023) relies on the modulus of convexity of CϕC_{\phi}^{*} to construct surrogate risk bounds. Philip M. Long (2013); Mingyuan Zhang (2020); Awasthi et al. (2022); Mao et al. (2023a; b); Awasthi et al. (2023b) further study consistency restricted to a specific family of functions; a concepts called \mathcal{H}-consistency. Prior work Mahdavi et al. (2014)also uses these surrogate risk bounds in conjunction with surrogate generalization bounds to study the generalization of the classification error.

In the adversarial setting, (Meunier et al., 2022; Frank & Niles-Weed, 2024a) identify which losses are adversarially consistent for all data distributions while (Frank, 2025) shows that under reasonable distributional assumptions, a consistent loss is adversarially consistent for a specific distribution iff the adversarial Bayes classifier is unique up to degeneracy. (Awasthi et al., 2021) study adversarial consistency for a well-motivated class of linear functions while some prior work also studies the approximation error caused by learning from a restricted function class \mathcal{H}. Liu et al. (2024) study the approximation error of the surrogate risk. Complimenting this result, Awasthi et al. (2023b); Mao et al. (2023a) study \mathcal{H}-consistency in the adversarial setting for specific surrogate risks. Standard and adversarial surrogate risk bounds are a tool central tool in the derivation of the \mathcal{H}-consistency bounds in this line of research. Whether the adversarial surrogate bounds presented in this paper could result in improved adversarial \mathcal{H}-consistency bounds remains an open problem.

Our proofs rely on prior results that study adversarial risks and adversarial Bayes classifiers. Notably, (Bungert et al., 2021; Pydi & Jog, 2021; 2020; Bhagoji et al., 2019; Awasthi et al., 2023a) establish the existence of the adversarial Bayes classifier while (Frank & Niles-Weed, 2024b; Pydi & Jog, 2020; 2021; Bhagoji et al., 2019; Frank, 2025) prove various minimax theorems for the adversarial surrogate and classification risks. Subsequently, (Pydi & Jog, 2020) uses a minimax theorem to study the adversarial Bayes classifier, and (Frank, 2024) uses minimax results to study the notion of uniqueness up to degeneracy.

7 Conclusion

In conclusion, we prove surrogate risk bounds for adversarial risks whenever ϕ\phi is adversarially consistent for the distribution 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1}. When ϕ\phi is adversarially consistent or the distribution of optimal adversarial attacks satisfies Massart’s noise condition, we prove a linear surrogate risk bound. For the general case, we prove a concave distribution-dependent bound. Understanding the optimality of these bounds remains an open problem, as does understanding how these bounds interact with the sample complexity of estimating the surrogate quantity. These questions were partly addressed by (Frongillo & Waggoner, 2021) and (Mahdavi et al., 2014) in the standard setting, but remain unstudied in the adversarial scenario.

Acknowledgments

Natalie Frank was supported in part by the Research Training Group in Modeling and Simulation funded by the National Science Foundation via grant RTG/DMS – 1646339, NSF grant DMS-2210583, and NSF TRIPODS II - DMS 2023166.

References

  • Ambrosio et al. (2000) Luigi Ambrosio, Nicola Fosco, and Diego Pallara. Functions of Bounded Variation and Free Discontuity Problems. Oxford Mathematics Monographs. Oxford University Press, 2000.
  • Apostol (1974) Tom M. Apostol. Mathematical analysis, 1974.
  • Awasthi et al. (2021) Pranjal Awasthi, Natalie S. Frank, Anqui Mao, Mehryar Mohri, and Yutao Zhong. Calibration and consistency of adversarial surrogate losses. NeurIps, 2021.
  • Awasthi et al. (2022) Pranjal Awasthi, Anqi Mao, Mehryar Mohri, and Yutao Zhong. H-consistency bounds for surrogate loss minimizers. In Proceedings of the 39th International Conference on Machine Learning. PMLR, 2022.
  • Awasthi et al. (2023a) Pranjal Awasthi, Natalie S. Frank, and Mehryar Mohri. On the existence of the adversarial bayes classifier (extended version). arxiv, 2023a.
  • Awasthi et al. (2023b) Pranjal Awasthi, Anqi Mao, Mehryar Mohri, and Yutao Zhong. Theoretically grounded loss functions and algorithms for adversarial robustness. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research. PMLR, 2023b.
  • Bao (2023) Han Bao. Proper losses, moduli of convexity, and surrogate regret bounds. In Proceedings of Thirty Sixth Conference on Learning Theory, Proceedings of Machine Learning Research. PMLR, 2023.
  • Bartlett et al. (2006) Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 2006.
  • Bhagoji et al. (2019) Arjun Nitin Bhagoji, Daniel Cullina, and Prateek Mittal. Lower bounds on adversarial robustness from optimal transport, 2019.
  • Biggio et al. (2013) Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp.  387–402. Springer, 2013.
  • Buja et al. (2005) Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications. Technical report, University of Pennslvania, 2005.
  • Bungert et al. (2021) Leon Bungert, Nicolás García Trillos, and Ryan Murray. The geometry of adversarial training in binary classification. arxiv, 2021.
  • Frank (2024) Natalie S. Frank. A notion of uniqueness for the adversarial bayes classifier, 2024.
  • Frank (2025) Natalie S. Frank. Adversarial consistency and the uniqueness of the adversarial bayes classifier. European Journal of Applied Mathematics, 2025.
  • Frank & Niles-Weed (2024a) Natalie S. Frank and Jonathan Niles-Weed. The adversarial consistency of surrogate risks for binary classification. NeurIps, 2024a.
  • Frank & Niles-Weed (2024b) Natalie S. Frank and Jonathan Niles-Weed. Existence and minimax theorems for adversarial surrogate risks in binary classification. Journal of Machine Learning Research, 2024b.
  • Frongillo & Waggoner (2021) Rafael Frongillo and Bo Waggoner. Surrogate regret bounds for polyhedral losses. In Advances in Neural Information Processing Systems, 2021.
  • Hiriart-Urruty & Lemaréchal (2001) Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal. Fundamentals of convex analysis, 2001.
  • Jylhä (2014) Heikki Jylhä. The ll^{\infty} optimal transport: Infinite cyclical monotonicity and the existence of optimal transport maps. Calculus of Variations and Partial Differential Equations, 2014.
  • Li & Telgarsky (2023) Justin D. Li and Matus Telgarsky. On achieving optimal adversarial test error, 2023.
  • Lin (2004) Yi Lin. A note on margin-based loss functions in classification. Statistics & Probability Letters, 68(1):73–82, 2004.
  • Liu et al. (2024) Changyu Liu, Yuling Jiao, Junhui Wang, and Jian Huang. Nonasymptotic bounds for adversarial excess risk under misspecified models. SIAM Journal on Mathematics of Data Science, 6(4), 2024. URL https://doi.org/10.1137/23M1598210.
  • Mahdavi et al. (2014) Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Binary excess risk for smooth convex surrogates, 2014.
  • Mao et al. (2023a) Anqi Mao, Mehryar Mohri, and Yutao Zhong. Cross-entropy loss functions: Theoretical analysis and applications, 2023a.
  • Mao et al. (2023b) Anqi Mao, Mehryar Mohri, and Yutao Zhong. Structured prediction with stronger consistency guarantees. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., 2023b.
  • Massart & Nédélec (2006) Pascal Massart and Élodie Nédélec. Risk bounds for statistical learning. The Annals of Statistics, 34, 2006.
  • Meunier et al. (2022) Laurent Meunier, Raphaël Ettedgui, Rafael Pinot, Yann Chevaleyre, and Jamal Atif. Towards consistency in adversarial classification. arXiv, 2022.
  • Mingyuan Zhang (2020) Shivani Agarwal Mingyuan Zhang. Consistency vs. h-consistency: The interplay between surrogate loss functions and the scoring function class. NeurIps, 2020.
  • Paschali et al. (2018) Magdalini Paschali, Sailesh Conjeti, Fernando Navarro, and Nassir Navab. Generalizability vs. robustness: Adversarial examples for medical imaging. Springer, 2018.
  • Philip M. Long (2013) Rocco A. Servedio Philip M. Long. Consistency versus realizable h-consistency for multiclass classification. ICML, 2013.
  • Pydi & Jog (2020) Muni Sreenivas Pydi and Varun Jog. Adversarial risk via optimal transport and optimal couplings. ICML, 2020.
  • Pydi & Jog (2021) Muni Sreenivas Pydi and Varun Jog. The many faces of adversarial risk. Neural Information Processing Systems, 2021.
  • Raftery & Gneiting (2007) Adrian Raftery and Raftery Gneiting. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistcal Association, 2007.
  • Reid & Williamson (2009) Mark D. Reid and Robert C. Williamson. Surrogate regret bounds for proper losses. In Proceedings of the 26th Annual International Conference on Machine Learning, New York, NY, USA, 2009. Association for Computing Machinery.
  • Rudin (1976) Walter Rudin. Principles of Mathematical Analysis. Mathematics Series. McGraw-Hill International Editions, third edition, 1976.
  • Savage (1971) Leonard J. Savage. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336), 1971.
  • Schervish (1989) Mark J. Schervish. A general method for comparing probability assessors. The Annals of Statistics, 1989.
  • Stein & Shakarchi (2005) Elias Stein and Rami Shakarchi. Real analysis: Measure Theory, Integration, and Hilbert Spaces. Princeton University Press, 2005.
  • Steinwart (2007) Ingo Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 2007.
  • Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • Wheeden & Zygmund (1977) Richard L. Wheeden and Antoni Zygmund. Measure and Integral. Pure and Applied Mathematics. Marcel Dekker Inc., 1977.
  • Xu et al. (2022) Ying Xu, Kiran Raja, Raghavendra Ramachandra, and Christoph Busch. Adversarial Attacks on Face Recognition Systems, pp.  139–161. Springer International Publishing, Cham, 2022.
  • Zhang (2004) Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 2004.

Contents of Appendix

\startcontents

[appendix] \printcontents[appendix]l1

Appendix A Proof of Theorem 3

Lemma 4.

Assume ϕ\phi is consistent. Then Cϕ(η)=ϕ(0)C_{\phi}^{*}(\eta)=\phi(0) implies that η=1/2\eta=1/2.

This result appeared as Lemma 7 of (Frank, 2025).

Proof.

If ϕ\phi is consistent and 0 minimizes Cϕ(η,α)C_{\phi}(\eta,\alpha), then 0 must also minimize C(η,α)=η𝟏α0+(1η)𝟏α>0C(\eta,\alpha)=\eta{\mathbf{1}}_{\alpha\leq 0}+(1-\eta){\mathbf{1}}_{\alpha>0} and consequently η1/2\eta\leq 1/2. However Cϕ(η,α)=Cϕ(1η,α)C_{\phi}(\eta,\alpha)=C_{\phi}(1-\eta,-\alpha) so that 0 must minimize C(1η,α)C(1-\eta,-\alpha) as well. Consequently, 1η1/21-\eta\leq 1/2 and thus η\eta must actually equal 1/21/2. ∎

Proof of Theorem 3.

Forward direction: Assume that ϕ\phi is consistent. Note that Cϕ(η)Cϕ(η,0)=ϕ(0)C_{\phi}^{*}(\eta)\leq C_{\phi}(\eta,0)=\phi(0) for any η\eta. Thus Lemma 4 implies that Cϕ(η)<ϕ(0)C_{\phi}^{*}(\eta)<\phi(0) for η1/2\eta\neq 1/2.

Backward direction: Assume that Cϕ(η)<ϕ(0)C_{\phi}^{*}(\eta)<\phi(0) for all η1/2\eta\neq 1/2. Notice that if η=1/2\eta=1/2, C(1/2,α)C(1/2,\alpha) is constant in α\alpha so any sequence αn\alpha_{n} minimizes C(1/2,)C(1/2,\cdot). We will show if η>1/2\eta>1/2 and αn\alpha_{n} is a minimizing sequence of Cϕ(η,)C_{\phi}(\eta,\cdot), then αn>0\alpha_{n}>0 for sufficiently large nn, and thus must also minimize C(η,)C(\eta,\cdot). An analogous argument will imply that if η<1/2\eta<1/2, any minimizing sequence of Cϕ(η,)C_{\phi}(\eta,\cdot) must also minimize C(η,)C(\eta,\cdot) as well.

Assume η>1/2\eta>1/2 and let αn\alpha_{n} be any minimizing sequence of Cϕ(η,)C_{\phi}(\eta,\cdot). Let α\alpha^{*} be a limit point of the sequence αn\alpha_{n} in the extended real number line ¯\overline{\mathbb{R}}. Then α\alpha^{*} is a minimizer of Cϕ(η,α)C_{\phi}(\eta,\alpha). The statement Cϕ(η)=Cϕ(η,α)<ϕ(0)C_{\phi}^{*}(\eta)=C_{\phi}(\eta,\alpha^{*})<\phi(0) implies that at least one of ϕ(α)\phi(\alpha^{*}) and ϕ(α)\phi(-\alpha^{*}) must be larger that or equal to ϕ(0)\phi(0), and so the other must be strictly less than ϕ(0)\phi(0). Because η>1/2\eta>1/2 and α\alpha^{*} is a minimizer of Cϕ(η,)C_{\phi}(\eta,\cdot), one can conclude that ϕ(α)<ϕ(0)\phi(\alpha^{*})<\phi(0) and consequently α>0\alpha^{*}>0.

Therefore, every limit point of the sequence {αn}\{\alpha_{n}\} is strictly positive. Consequently, one can conclude that αn>0\alpha_{n}>0 for sufficiently large nn.

Appendix B (Non-Adversarial) Surrogate Bounds

B.1 The Realizable Case— Proof of Equation 10

Proof of Equation 10.

If R=0R_{*}=0, then (η=0 or 1)=1{\mathbb{P}}(\eta=0\text{ or }1)=1 and consequently Cϕ(η)=0C_{\phi}^{*}(\eta)=0 {\mathbb{P}}-a.e. As a result, Rϕ,=0R_{\phi,*}=0 and thus it remains to show that R(f)1ϕ(0)Rϕ(f)R(f)\leq\frac{1}{\phi(0)}R_{\phi}(f) for all functions ff. Next, we will prove the bound

𝟏f(𝐱)01ϕ(0)ϕ(f(𝐱)).{\mathbf{1}}_{f({\mathbf{x}})\leq 0}\leq\frac{1}{\phi(0)}\phi(f({\mathbf{x}})). (40)

The inequality Equation 40 trivially holds when 𝟏f(x)0=0{\mathbf{1}}_{f(x)\leq 0}=0. Alternatively, the relation 𝟏f(𝐱)0=1{\mathbf{1}}_{f({\mathbf{x}})\leq 0}=1 implies f(𝐱)0f({\mathbf{x}})\leq 0 and consequently ϕ(f(𝐱))ϕ(0)\phi(f({\mathbf{x}}))\geq\phi(0). Thus whenever 𝟏f(𝐱)0=1{\mathbf{1}}_{f({\mathbf{x}})\leq 0}=1,

𝟏f(𝐱)0=ϕ(0)ϕ(0)1ϕ(0)ϕ(f(𝐱)).{\mathbf{1}}_{f({\mathbf{x}})\leq 0}=\frac{\phi(0)}{\phi(0)}\leq\frac{1}{\phi(0)}\phi(f({\mathbf{x}})).

An analogous argument implies

𝟏f(𝐱)>0=𝟏f(𝐱)<01ϕ(0)ϕ(f(𝐱)).{\mathbf{1}}_{f({\mathbf{x}})>0}={\mathbf{1}}_{-f({\mathbf{x}})<0}\leq\frac{1}{\phi(0)}\phi(-f({\mathbf{x}})).

As a result:

R(f)\displaystyle R(f) =𝟏f(𝐱)0𝑑1+𝟏f(𝐱)>0𝑑01ϕ(0)(ϕ(f(𝐱))𝑑1+ϕ(f(𝐱))𝑑0)\displaystyle=\int{\mathbf{1}}_{f({\mathbf{x}})\leq 0}d{\mathbb{P}}_{1}+\int{\mathbf{1}}_{f({\mathbf{x}})>0}d{\mathbb{P}}_{0}\leq\frac{1}{\phi(0)}\left(\int\phi(f({\mathbf{x}}))d{\mathbb{P}}_{1}+\int\phi(-f({\mathbf{x}}))d{\mathbb{P}}_{0}\right)
=1ϕ(0)Rϕ(f)\displaystyle=\frac{1}{\phi(0)}R_{\phi}(f)

B.2 Linear Surrogate Risk Bounds—Proof of Proposition 1

In this appendix, we will find it useful to study the function

Cϕ(η)=infz(2η1)0Cϕ(η,z)C_{\phi}^{-}(\eta)=\inf_{z(2\eta-1)\leq 0}C_{\phi}(\eta,z)

introduced by (Bartlett et al., 2006). This function maps η\eta to the smallest value of the conditional ϕ\phi-risk assuming an incorrect classification. The symmetry Cϕ(η,α)=Cϕ(1η,α)C_{\phi}(\eta,\alpha)=C_{\phi}(1-\eta,-\alpha) implies Cϕ(η)=Cϕ(1η)C_{\phi}^{-}(\eta)=C_{\phi}^{-}(1-\eta). Further, the function CϕC_{\phi}^{-} is concave on each of the intervals [0,1/2][0,1/2] and [1/2,1][1/2,1], as it is an infimum of linear functions on each of these regions. The next result examines the monotonicity properties of CϕC_{\phi}^{*} and CϕC_{\phi}^{-}.

Lemma 5.

The function CϕC_{\phi}^{*} is non-decreasing on [0,1/2][0,1/2] and non-increasing on [1/2,1][1/2,1]. In contrast, CϕC_{\phi}^{-} is non-increasing on [0,1/2][0,1/2] and non-decreasing on [1/2,1][1/2,1]

Proof.

The symmetry Cϕ(η)=Cϕ(1η)C_{\phi}^{*}(\eta)=C_{\phi}^{*}(1-\eta) and Cϕ(η)=Cϕ(1η)C_{\phi}^{-}(\eta)=C_{\phi}^{-}(1-\eta) implies that it suffices to check monotonicity on [0,1/2][0,1/2]. Observe that

Cϕ(η,α)Cϕ(η,α)=η(ϕ(α)ϕ(α))+(1η)(ϕ(α)ϕ(α))=(2η1)(ϕ(α)ϕ(α)).C_{\phi}(\eta,\alpha)-C_{\phi}(\eta,-\alpha)=\eta(\phi(\alpha)-\phi(-\alpha))+(1-\eta)(\phi(\alpha)-\phi(-\alpha))=(2\eta-1)(\phi(\alpha)-\phi(-\alpha)).

If η1/2\eta\leq 1/2, then this quantity is non-negative when α0\alpha\leq 0. Therefore, when computing CϕC_{\phi}^{*} over [0,1/2][0,1/2], it suffices to minimize Cϕ(η,α)C_{\phi}(\eta,\alpha) over α0\alpha\leq 0. In other words, for η1/2\eta\leq 1/2,

Cϕ(η)=infαCϕ(η,α)=infα0Cϕ(η,α)C_{\phi}^{*}(\eta)=\inf_{\alpha}C_{\phi}(\eta,\alpha)=\inf_{\alpha\leq 0}C_{\phi}(\eta,\alpha)

For any fixed α0\alpha\leq 0, the quantity Cϕ(η,α)C_{\phi}(\eta,\alpha) is non-increasing in η\eta and thus Cϕ(η1)Cϕ(η2)C_{\phi}^{*}(\eta_{1})\leq C_{\phi}^{*}(\eta_{2}) when η1η212\eta_{1}\leq\eta_{2}\leq 12.

In contrast, for any α0\alpha\geq 0, the quantity Cϕ(η,α)C_{\phi}(\eta,\alpha) is non-decreasing in η\eta and thus Cϕ(η1)Cϕ(η2)C_{\phi}^{-}(\eta_{1})\geq C_{\phi}^{-}(\eta_{2}) when η1η212\eta_{1}\leq\eta_{2}\leq 12.

Next we’ll prove a useful lower bound on CϕC_{\phi}^{-}.

Lemma 6.

For all η[0,1]\eta\in[0,1],

Cϕ(η)(12η)ϕ(0)+2ηCϕ(η)C_{\phi}^{-}(\eta)\geq(1-2\eta)\phi(0)+2\eta C_{\phi}^{*}(\eta) (41)
Proof.

First, observe that η\eta is the convex combination η=2η1/2+(12η)0\eta=2\eta\cdot 1/2+(1-2\eta)\cdot 0. By the concavity of CϕC_{\phi}^{-} on [0,1/2][0,1/2],

Cϕ(η)\displaystyle C_{\phi}^{-}(\eta) =Cϕ(2η12+(12η)0)(12η)Cϕ(0)+2ηCϕ(12)\displaystyle=C_{\phi}^{-}\left(2\eta\cdot\frac{1}{2}+(1-2\eta)\cdot 0\right)\geq(1-2\eta)C_{\phi}^{-}(0)+2\eta C_{\phi}^{-}\left(\tfrac{1}{2}\right)

However, Cϕ(0)=ϕ(0)C_{\phi}^{-}(0)=\phi(0) while Cϕ(1/2)=Cϕ(1/2)C_{\phi}^{-}(1/2)=C_{\phi}^{*}(1/2). Further, Lemma 5 implies that Cϕ(1/2)Cϕ(η)C_{\phi}^{*}(1/2)\geq C_{\phi}^{*}(\eta), yielding the desired inequality.

Proof of Proposition 1.

If C(η,f)C(η)=0C(\eta,f)-C^{*}(\eta)=0 then Equation 11 holds trivially. Otherwise, C(η,f)C(η)=|2η1|C(\eta,f)-C^{*}(\eta)=|2\eta-1|. If C(η,f)=|2η1|C(\eta,f)=|2\eta-1|, then

C(η,f)C(η)\displaystyle C(\eta,f)-C^{*}(\eta) =|2η1|=|2η1|ϕ(0)Cϕ(η)ϕ(0)Cϕ(η)\displaystyle=|2\eta-1|=|2\eta-1|\cdot\frac{\phi(0)-C_{\phi}^{*}(\eta)}{\phi(0)-C_{\phi}^{*}(\eta)} (42)
1ϕ(0)Cϕ(η)((|2η1|ϕ(0)+(1|2η1|)Cϕ(η))Cϕ(η))\displaystyle\leq\frac{1}{\phi(0)-C_{\phi}^{*}(\eta)}\left(\left(|2\eta-1|\phi(0)+(1-|2\eta-1|)C_{\phi}^{*}(\eta)\right)-C_{\phi}^{*}(\eta)\right)

At the same time, because |η1/2|α|\eta-1/2|\geq\alpha {\mathbb{P}}-a.e. Lemma 5 implies that Cϕ(η)Cϕ(1/2α)C_{\phi}^{*}(\eta)\leq C_{\phi}^{*}(1/2-\alpha) {\mathbb{P}}-a.e. Furthermore, applying Equation 41 at 1η1-\eta and observing Cϕ(η)=Cϕ(1η)C_{\phi}^{*}(\eta)=C_{\phi}^{*}(1-\eta) shows that

|2η1|ϕ(0)+(1|2η1|)Cϕ(η)Cϕ(η).|2\eta-1|\phi(0)+(1-|2\eta-1|)C_{\phi}^{*}(\eta)\leq C_{\phi}^{-}(\eta).

Therefore, Equation 42 is bounded above by

1ϕ(0)Cϕ(12α)(Cϕ(η)Cϕ(η))1ϕ(0)Cϕ(12α)(Cϕ(η,f)Cϕ(η)).\displaystyle\leq\frac{1}{\phi(0)-C_{\phi}^{*}\left(\frac{1}{2}-\alpha\right)}\left(C_{\phi}^{-}(\eta)-C_{\phi}^{*}(\eta)\right)\leq\frac{1}{\phi(0)-C_{\phi}^{*}\left(\frac{1}{2}-\alpha\right)}\left(C_{\phi}(\eta,f)-C_{\phi}^{*}(\eta)\right). (43)

The last equality follows from the assumption C(η,f)C(η)=|2η1|C(\eta,f)-C^{*}(\eta)=|2\eta-1|, as it implies (2η1)f0(2\eta-1)f\leq 0, and thus Cϕ(η,f)Cϕ(η)C_{\phi}(\eta,f)\geq C_{\phi}^{-}(\eta). Consequently, Equation 43 implies Equation 11.

Integrating Equation 11 with respect to {\mathbb{P}} then produces the surrogate bound Equation 12.

Appendix C Proof of Lemma 1

Proof of Lemma 1.

If 𝐱Bϵ(𝐱)¯{\mathbf{x}}^{\prime}\in\overline{B_{\epsilon}({\mathbf{x}})} then Sϵ(g)(𝐱)g(𝐱)S_{\epsilon}(g)({\mathbf{x}})\geq g({\mathbf{x}}^{\prime}). Thus if γ\gamma is supported on Δϵ\Delta_{\epsilon}, then Sϵ(g)(𝐱)g(𝐱)S_{\epsilon}(g)({\mathbf{x}})\geq g({\mathbf{x}}^{\prime}) γ\gamma-a.e. Integrating this inequality in γ\gamma produces

Sϵ(g)𝑑g𝑑.\int S_{\epsilon}(g)d{\mathbb{Q}}\geq\int gd{\mathbb{Q}}^{\prime}.

Taking the supreumum over all ϵ(){\mathbb{Q}}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{Q}}) then proves the result. ∎

Appendix D Proof of Theorem 10 Item 1)

In this appendix, we relax the regularity assumption of the loss functions: specifically we require loss functions to be lower semi-continuous rather than continuous.

Assumption 2.

The loss ϕ\phi is lower semi-continuous, non-increasing, and limαϕ(α)=0\lim_{\alpha\to\infty}\phi(\alpha)=0.

Frank & Niles-Weed (2024b) establish their result under this weaker condition, so Theorem 9 continues to hold for lower semi-continuous losses. Moreover, Theorem 7 of (Frank & Niles-Weed, 2024b) finds two conditions that characterize minimizers and maximizers of RϕϵR_{\phi}^{\epsilon} and R¯ϕ\bar{R}_{\phi}.

Theorem 14 (Complementary Slackness).

The function ff^{*} minimizes RϕϵR_{\phi}^{\epsilon} and the measures (0,1)({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}) maximize R¯ϕ\bar{R}_{\phi} over ϵ(0)×ϵ(1){{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\times{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1}) iff the following two conditions hold:

  1. 1)
    Sϵ(ϕ(f))𝑑1=ϕ(f)𝑑1andSϵ(ϕ(f))𝑑0=ϕ(f)𝑑0\int S_{\epsilon}(\phi(f^{*}))d{\mathbb{P}}_{1}=\int\phi(f^{*})d{\mathbb{P}}_{1}^{*}\quad\text{and}\quad\int S_{\epsilon}(\phi(-f^{*}))d{\mathbb{P}}_{0}=\int\phi(-f^{*})d{\mathbb{P}}_{0}^{*}
  2. 2)
    Cϕ(η,f)=Cϕ(η)-a.e.C_{\phi}(\eta^{*},f^{*})=C_{\phi}^{*}(\eta^{*})\quad{\mathbb{P}}^{*}\text{-a.e.}

To prove Item 1) of Theorem 10, one uses the properties above to show that Rϵ(f)R¯(0,1)R^{\epsilon}(f^{*})\leq\bar{R}({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}) for some function ff^{*} and any pair of measures 0{\mathbb{P}}_{0}^{*}, 1{\mathbb{P}}_{1}^{*} that maximize R¯ϕ\bar{R}_{\phi}. When ϕ\phi is strictly decreasing at 0, the identity {𝟏ϕ(f(𝐱))ϕ(0)=𝟏f(𝐱)0}\{{\mathbf{1}}_{\phi(f({\mathbf{x}}))\geq\phi(0)}={\mathbf{1}}_{f({\mathbf{x}})\leq 0}\} provides a means to relate R¯ϕ\bar{R}_{\phi} and R¯\bar{R}. In Sections D.1 and D.2, we show that for any loss ϕ\phi, there is another loss ψ\psi with Cϕ=CψC_{\phi}^{*}=C_{\psi}^{*} that satisfies this property.

Lemma 7.

Let ϕ\phi be a consistent loss function that satisfies Assumption 1. Then there is a non-increasing, lower semi-continuous loss function ψ\psi with limαϕ(α)=0\lim_{\alpha\to\infty}\phi(\alpha)=0 for which Cϕ(η)=Cψ(η)C_{\phi}^{*}(\eta)=C_{\psi}^{*}(\eta) and

ψ(α)>ψ(0)for all α<0.\psi(\alpha)>\psi(0)\quad\quad\text{for all }\alpha<0. (44)

The function ψ\psi may fail to be continuous.

Proof of Item 1) of Theorem 10 .

Let ψ\psi be the loss function constructed in Lemma 7 for which Cψ=CϕC_{\psi}^{*}=C_{\phi}^{*} and let γ1\gamma_{1}^{*} be the coupling between 1{\mathbb{P}}_{1} and 1{\mathbb{P}}_{1}^{*} that achieves the minimum WW_{\infty} distance. We aim to show that

Sϵ(𝟏f0)𝑑1=𝟏f0𝑑1\int S_{\epsilon}({\mathbf{1}}_{f^{*}\leq 0})d{\mathbb{P}}_{1}=\int{\mathbf{1}}_{f^{*}\leq 0}d{\mathbb{P}}_{1}^{*} (45)

and similarly,

Sϵ(𝟏f>0)𝑑0=𝟏f>0𝑑0.\int S_{\epsilon}({\mathbf{1}}_{f^{*}>0})d{\mathbb{P}}_{0}=\int{\mathbf{1}}_{f^{*}>0}d{\mathbb{P}}_{0}^{*}. (46)

Combining Equations 45 and 46 yields Rϵ(f)=R¯(0,1)R^{\epsilon}(f^{*})=\bar{R}({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}), and thus 0{\mathbb{P}}_{0}^{*}, 1{\mathbb{P}}_{1}^{*} are optimal by Theorem 7.

We now prove Equation 45; the argument for Equation 46 is analogous, and we briefly outline the necessary modifications at the end of the proof.

First observe that the property Equation 44 implies that 𝟏f(𝐲)0=𝟏ψ(f(𝐲))ψ(0){\mathbf{1}}_{f({\mathbf{y}})\leq 0}={\mathbf{1}}_{\psi(f({\mathbf{y}}))\geq\psi(0)} pointwise and thus Sϵ(𝟏f0)(𝐱)=Sϵ(𝟏ψ(f)ψ(0))(𝐱)S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})=S_{\epsilon}({\mathbf{1}}_{\psi(f)\geq\psi(0)})({\mathbf{x}}). We will apply the complimentary slackness conditions of Theorem 14 to argue that in fact Sϵ(𝟏f0)(𝐱)=𝟏f(𝐱)0S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})={\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0} γ1\gamma_{1}^{*}-a.e.

Item 1) of Theorem 14 implies that

Sϵ(ψf)(𝐱)=ψf(𝐱)γ1-a.e.S_{\epsilon}(\psi\circ f^{*})({\mathbf{x}})=\psi\circ f^{*}({\mathbf{x}}^{\prime})\quad\gamma_{1}^{*}\text{-a.e.} (47)

and thus ψf\psi\circ f^{*} assumes its maximum over closed ϵ\epsilon-balls 1{\mathbb{P}}_{1}-a.e. Now because ff^{*} assumes its maximum over closed ϵ\epsilon-balls 1{\mathbb{P}}_{1}-a.e., one can further conclude that Sϵ(𝟏ψfψ(0))=𝟏Sϵ(ψf)ψ(0)S_{\epsilon}({\mathbf{1}}_{\psi\circ f^{*}\geq\psi(0)})={\mathbf{1}}_{S_{\epsilon}(\psi\circ f^{*})\geq\psi(0)} 1{\mathbb{P}}_{1}-a.e. and therefore

Sϵ(𝟏ψfψ(0))(𝐱)=𝟏Sϵ(ψf)(𝐱)ψ(0)=𝟏ψ(f(𝐱))ψ(0)γ1-a.e.S_{\epsilon}({\mathbf{1}}_{\psi\circ f^{*}\geq\psi(0)})({\mathbf{x}})={\mathbf{1}}_{S_{\epsilon}(\psi\circ f^{*})({\mathbf{x}})\geq\psi(0)}={\mathbf{1}}_{\psi(f^{*}({\mathbf{x}}^{\prime}))\geq\psi(0)}\quad\gamma_{1}^{*}\text{-a.e.} (48)

Next, the property Equation 44 implies that 𝟏f0=𝟏ψfψ(0){\mathbf{1}}_{f^{*}\leq 0}={\mathbf{1}}_{\psi\circ f^{*}\geq\psi(0)}. Consequently, Equation 48 implies that

Sϵ(𝟏f0)=𝟏ψ(f(𝐱))ψ(0)γ1-a.e.S_{\epsilon}({\mathbf{1}}_{f^{*}\leq 0})={\mathbf{1}}_{\psi(f^{*}({\mathbf{x}}^{\prime}))\geq\psi(0)}\quad\gamma_{1}^{*}\text{-a.e.}

Now, Equation 44 again implies that 𝟏ψ(f(𝐱))ψ(0)=𝟏f(𝐱)0{\mathbf{1}}_{\psi(f^{*}({\mathbf{x}}^{\prime}))\geq\psi(0)}={\mathbf{1}}_{f^{*}({\mathbf{x}}^{\prime})\leq 0}. Combining with Equation 48 proves that Sϵ(𝟏ψfψ(0))=𝟏f(𝐱)0S_{\epsilon}({\mathbf{1}}_{\psi\circ f^{*}\geq\psi(0)})={\mathbf{1}}_{f^{*}({\mathbf{x}}^{\prime})\leq 0} and integrating this statement results in Equation 45.

Similarly, if γ0\gamma_{0}^{*} is the coupling between 0{\mathbb{P}}_{0} and 0{\mathbb{P}}_{0}^{*} that achieves the minimum WW_{\infty} distance, then

Sϵ(ψf)(𝐱)=ψf(𝐱)γ0-a.e.S_{\epsilon}(\psi\circ-f^{*})({\mathbf{x}})=\psi\circ-f^{*}({\mathbf{x}}^{\prime})\quad\gamma_{0}^{*}\text{-a.e.}

and thus ψf\psi\circ-f^{*} assumes its maximum over closed ϵ\epsilon-balls 0{\mathbb{P}}_{0}-a.e. and analogous reasoning to Equation 48 implies that

Sϵ(𝟏ψf(𝐱)ψ(0))=𝟏Sϵ(ψf(𝐱))ψ(0)=𝟏ψ(f(𝐱))ψ(0)γ0-a.e.S_{\epsilon}({\mathbf{1}}_{\psi\circ-f^{*}({\mathbf{x}})\geq\psi(0)})={\mathbf{1}}_{S_{\epsilon}(\psi\circ-f^{*}({\mathbf{x}}))\geq\psi(0)}={\mathbf{1}}_{\psi(-f^{*}({\mathbf{x}}^{\prime}))\geq\psi(0)}\quad\gamma_{0}^{*}\text{-a.e.} (49)

Finally, Equation 44 implies that 𝟏f>0=𝟏ψ(f)<ψ(0){\mathbf{1}}_{f^{*}>0}={\mathbf{1}}_{\psi(-f^{*})<\psi(0)}. Consequently, the same argument as for Equation 45 implies Equation 46.

D.1 Proof of Lemma 7

Proper losses are loss functions that can be used to learn a probability value rather than a binary classification decision. These losses are typically studied as functions on [0,1][0,1] rather than \mathbb{R}. Prior work (Bao, 2023; Buja et al., 2005; Reid & Williamson, 2009) considers losses of the form (c,η^)\ell(c,\hat{\eta}) where η^[0,1]\hat{\eta}\in[0,1] is the estimated probability value of the binary class c{0,1}c\in\{0,1\} . One can define the minimal conditional risk just as before via

C(η)=infη^[0,1]η(1,η^)+(1η)(0,η^)C_{\ell}^{*}(\eta)=\inf_{\hat{\eta}\in[0,1]}\eta\ell(1,\hat{\eta})+(1-\eta)\ell(0,\hat{\eta})

Savage (1971) first studied how to reconstruct the loss \ell given a concave function HH. We adapt their argument to construct a proper loss \ell on [0,1][0,1], and then compose this loss with a link to obtain a loss on \mathbb{R} with the desired properties. This reconstruction involves technical properties of concave functions.

Lemma 8.

For a continuous concave function H:[0,1]H:[0,1]\to\mathbb{R}, the left-hand (DHD^{-}H) and right-hand (D+HD^{+}H) derivatives always exist. The right-hand derivative D+HD^{+}H is non-increasing and right-continuous.

Furthermore, if HH has a strict maximum at 1/21/2, then the function H+:[0,1]H(η)H_{+}^{\prime}:[0,1]\to\partial H(\eta) defined by

H+(η)={D+H(η)η1/20η=1/2H_{+}^{\prime}(\eta)=\begin{cases}D^{+}H(\eta)&\eta\neq 1/2\\ 0&\eta=1/2\end{cases} (50)

is non-increasing and right-continuous.

See Section D.2 for a proof.

Proposition 3.

Let H:[0,1][0,K]H:[0,1]\twoheadrightarrow[0,K] be a concave function for which the only global maximum is at 1/21/2. Define \ell by

H(1,η^)=H(η^)+(1η^)H+(η^),H(0,η^)=H(η^)η^H+(η^)\ell_{H}(1,\hat{\eta})=H(\hat{\eta})+(1-\hat{\eta})H_{+}^{\prime}(\hat{\eta}),\quad\ell_{H}(0,\hat{\eta})=H(\hat{\eta})-\hat{\eta}H_{+}^{\prime}(\hat{\eta})

Then CH=HC_{\ell_{H}}^{*}=H.

Notice that Lemma 8 implies that H\ell_{H} is lower semi-continuous. Furthermore, when H(η)=H(1η)H(\eta)=H(1-\eta), the definitions of H(1,η^)\ell_{H}(1,\hat{\eta}) and H(0,η^)\ell_{H}(0,\hat{\eta}) imply that

H(1,1η^)=H(0,η^).\ell_{H}(1,1-\hat{\eta})=\ell_{H}(0,\hat{\eta}). (51)

Indeed, Equation 51 suggests that H(1,η^)\ell_{H}(1,\hat{\eta}) corresponds to ϕ(α)\phi(\alpha) while H(0,η^)\ell_{H}(0,\hat{\eta}) corresponds to ϕ(α)\phi(-\alpha). Fixing the value of H+H_{+}^{\prime} at 1/21/2 to zero ensures that H(1,)\ell_{H}(1,\cdot), H(0,)\ell_{H}(0,\cdot) coincide η^=1/2\hat{\eta}=1/2, which corresponds to α=0\alpha=0 for the losses ϕ(α)\phi(\alpha), ϕ(α)\phi(-\alpha). This correspondence is formalized in the proof of proof of Lemma 7.

Proof of Proposition 3.

Calculating CHC_{\ell_{H}}^{*} for the loss H\ell_{H} defined above results in

CH(η)=infη^[0,1]H(η^)+(ηη^)H+(η^)C_{\ell_{H}}^{*}(\eta)=\inf_{\hat{\eta}\in[0,1]}H(\hat{\eta})+(\eta-\hat{\eta})H_{+}^{\prime}(\hat{\eta})

The choice η^=η\hat{\eta}=\eta results in CH(η)H(η)C_{\ell_{H}}^{*}(\eta)\leq H(\eta).

However, the concavity of HH implies that

H(η^)+H+(η^)(ηη^)H(η)H(\hat{\eta})+H_{+}^{\prime}(\hat{\eta})(\eta-\hat{\eta})\geq H(\eta)

and as a result:

CH(η)H(η).C_{\ell_{H}}^{*}(\eta)\geq H(\eta).

Therefore, CH(η)=H(η)C_{\ell_{H}}^{*}(\eta)=H(\eta).

Next, an integral representation of H\ell_{H} proves that H(1,η^)\ell_{H}(1,\hat{\eta}) is non-decreasing. Let β(c)=H+(c)\beta(c)=-H_{+}^{\prime}(c), then

H(1,η^)=η^1(1c)𝑑β(c),H(0,η^)=0η^c𝑑β(c),\ell_{H}(1,\hat{\eta})=\int_{\hat{\eta}}^{1}(1-c)d\beta(c),\quad\ell_{H}(0,\hat{\eta})=\int_{0}^{\hat{\eta}}cd\beta(c), (52)

where the integrals in Equation 52 are Reimann-Stieltjes integrals. Lemma 8 implies that β\beta is right-continuous and non-decreasing, and thus these integrals are in fact well defined.

Such a representation was first given in (Schervish, 1989, Theorem 4.2) for left-continuous losses and Lebesgue integrals, see also (Raftery & Gneiting, 2007, Theorem 3) for a discussion of this result. (Reid & Williamson, 2009, Theorem 2) offer an alternative proof of Equation 52 terms of generalized derivatives.

Writing these integrals as Riemann-Stieltjes integrals provides a more streamlined proof— integration by parts for Riemann-Stieltjes integrals (see for instance (Apostol, 1974, Theorem 7.6, Chapter 7)) implies that the first integral in Equation 52 evaluates to H(1,η^)\ell_{H}(1,\hat{\eta}) and the second evaluates to H(0,η^)\ell_{H}(0,\hat{\eta}).

Corollary 1.

The function H(0,η^)\ell_{H}(0,\hat{\eta}) is upper semi-continuous while the function H(1,η^)\ell_{H}(1,\hat{\eta}) is lower semi-continuous. Furthermore, H(1,η^)\ell_{H}(1,\hat{\eta}) is non-increasing and H(0,η^)\ell_{H}(0,\hat{\eta}) is non-decreasing.

Proof.

The representation Equation 52 implies that H(1,η^)\ell_{H}(1,\hat{\eta}) is non-increasing and H(0,η^)\ell_{H}(0,\hat{\eta}) is non-decreasing in η^\hat{\eta}.

Lemma 8 implies that D+H(η^)D^{+}H(\hat{\eta}) is both non-increasing and right-continuous. For non-increasing functions, right-continuous and lower semi-continuous is equivalent. Consequently, H(0,η^)\ell_{H}(0,\hat{\eta}) is upper semi-continuous while the function H(1,η^)\ell_{H}(1,\hat{\eta}) is lower semi-continuous.

Finally, we will use the H\ell_{H} defined in the previous result to construct the ψ\psi of Lemma 7.

Proof of Lemma 7.

Define a function σ:¯[0,1]\sigma:\overline{\mathbb{R}}\to[0,1] via σ(α)=11+eα\sigma(\alpha)=\frac{1}{1+e^{-\alpha}} for α\alpha\in\mathbb{R} and extend via continuity to ±\pm\infty. Notice that this function satisfies

σ(α)=1σ(α).\sigma(-\alpha)=1-\sigma(\alpha). (53)

To simplify, notation, we let H(η)=Cϕ(η)H(\eta)=C_{\phi}^{*}(\eta) and let H\ell_{H} be the loss function defined by Proposition 3. Define the loss ψ\psi by

ψ(α)=H(1,σ(α)).\psi(\alpha)=\ell_{H}(1,\sigma(\alpha)).

This composition is non-increasing and lower semi-continuous.

Next, the function CϕC_{\phi}^{*} computes to HH:

Cϕ(η)\displaystyle C_{\phi}^{*}(\eta) =infαηH(1,σ(α))+(1η)H(1,σ(α))\displaystyle=\inf_{\alpha\in\mathbb{R}}\eta\ell_{H}(1,\sigma(\alpha))+(1-\eta)\ell_{H}(1,\sigma(-\alpha))
=infαηH(1,σ(α))+(1η)H(1,1σ(α))\displaystyle=\inf_{\alpha\in\mathbb{R}}\eta\ell_{H}(1,\sigma(\alpha))+(1-\eta)\ell_{H}(1,1-\sigma(\alpha)) (Equation 53)
=infη^[0,1]ηH(1,η^)+(1η)H(1,1η^)\displaystyle=\inf_{\hat{\eta}\in[0,1]}\eta\ell_{H}(1,\hat{\eta})+(1-\eta)\ell_{H}(1,1-\hat{\eta})
=infη^[0,1]ηH(1,η^)+(1η)H(0,η^)\displaystyle=\inf_{\hat{\eta}\in[0,1]}\eta\ell_{H}(1,\hat{\eta})+(1-\eta)\ell_{H}(0,\hat{\eta}) (Equation 51)
=H(η)\displaystyle=H(\eta) (Proposition 3)

Finally, we’ll argue that ψ(α)>ψ(0)\psi(\alpha)>\psi(0) for all α<0\alpha<0. To start, note that ψ(0)=H(1/2)\psi(0)=H(1/2). The assumption that ϕ\phi is consistent together with Theorem 3 implies that D+H(η)D^{+}H(\eta) must be strictly positive on (0,1/2)(0,1/2) and strictly negative on (1/2,1](1/2,1]. Next, the concavity of HH implies that

ψ(0)=H(1/2)H(η)+D+H(η)(1/2η).\psi(0)=H(1/2)\leq H(\eta)+D^{+}H(\eta)\cdot(1/2-\eta).

If furthermore η<1/2\eta<1/2, then both D+H(η)D^{+}H(\eta) and (1/2η)(1/2-\eta) are strictly positive and consequently one can conclude that

ψ(0)<H(η)+H+(η)(1η)=H(1,η).\psi(0)<H(\eta)+H_{+}^{\prime}(\eta)\cdot(1-\eta)=\ell_{H}(1,\eta). (54)

However, α<0\alpha<0 implies σ(α)<1/2\sigma(\alpha)<1/2 and consequently Equation 54 implies that when α<0\alpha<0,

ψ(0)<H(1,σ(α))=ψ(α).\psi(0)<\ell_{H}(1,\sigma(\alpha))=\psi(\alpha).

It remains to compute limαψ(α)\lim_{\alpha\to\infty}\psi(\alpha):

limαψ(α)=H(1,1)=H(1)=0\lim_{\alpha\to\infty}\psi(\alpha)=\ell_{H}(1,1)=H(1)=0

D.2 Proof of Lemma 8

Proof of Lemma 8.

We’ll begin by proving existence and subsequently we’ll show that this function is non-increasing and right-continuous.

Problem 23 of Chapter 4, (Rudin, 1976) implies that for a concave function HH, if 0<s<t<u<10<s<t<u<1, then

H(t)H(s)tsH(u)H(s)usH(u)H(t)ut.\frac{H(t)-H(s)}{t-s}\geq\frac{H(u)-H(s)}{u-s}\geq\frac{H(u)-H(t)}{u-t}. (55)

The continuity of HH then implies that this inequality holds for 0s<t<u10\leq s<t<u\leq 1. The first inequality implies the the right-hand limit D+H(s)D^{+}H(s) exists for s[0,1)s\in[0,1) while the second inequality implies the left-hand limit DH(s)D^{-}H(s) exists for s(0,1]s\in(0,1].

Next, we’ll prove that D+HD^{+}H is non-increasing. First, Equation 55 implies that if 0s<t<u<v10\leq s<t<u<v\leq 1 then

H(t)H(s)tsH(v)H(u)vu\frac{H(t)-H(s)}{t-s}\geq\frac{H(v)-H(u)}{v-u} (56)

and consequently taking the limits tst\downarrow s and vuv\downarrow u proves that the function D+HD^{+}H is non-increasing.

Next, we prove that D+HD^{+}H is right-continuous. Fix a point ss and define K=limxsD+H(x)K=\lim_{x\downarrow s}D^{+}H(x). We will argue that D+H(s)KD^{+}H(s)\leq K, and consequently, limxsD+H(x)=D+H(s)\lim_{x\downarrow s}D^{+}H(x)=D^{+}H(s). Equation 56 implies that for any x,u,vx,u,v satisfying s<x<u<vs<x<u<v,

KD+H(x)H(v)H(u)vuK\geq D^{+}H(x)\geq\frac{H(v)-H(u)}{v-u}

and thus

K(vu)H(v)H(u)K(v-u)\geq H(v)-H(u)

for any s<u<vs<u<v. Thus by the continuity of HH, this inequality must extend to u=su=s,

K(vs)H(v)H(s)K(v-s)\geq H(v)-H(s)

and taking the limit vsv\downarrow s then proves

KD+H(s)K\geq D^{+}H(s)

and thus the monotonicity of HH implies D+H(s)=KD^{+}H(s)=K.

Finally, if HH has a strict maximum at 1/21/2, then the super-differential H(η)\partial H(\eta) can include 0 only at η=1/2\eta=1/2. Thus, as D+HD^{+}H is non-increasing, D+H(η)>0D^{+}H(\eta)>0 when η<1/2\eta<1/2 and D+H(η)<0D^{+}H(\eta)<0 when η>1/2\eta>1/2. Thus the function H+H_{+}^{\prime} defined in Equation 50 is non-increasing and right-continuous.

Appendix E Proof of Lemma 2

We define the concave conjugate of a function hh as

h(y)=infxdom(h)yxh(x)h_{*}(y)=\inf_{x\in\operatorname{dom}(h)}yx-h(x)

Recall that conc(h)\operatorname{conc}(h) as defined in Equation 18 is the biconjugate hh_{**}. Consequently, conc(h)\operatorname{conc}(h) can be expressed as

conc(h)(x)=inf{(x): linear, and h on dom(h)}\operatorname{conc}(h)(x)=\inf\{\ell(x):\ell\text{ linear, and }\ell\geq h\text{ on }\operatorname{dom}(h)\} (57)

One can prove Lemma 2 by studying properties of concave conjugates.

Lemma 9.

Let h:[a,b]h:[a,b]\to\mathbb{R} be a non-decreasing function. Then conc(h)\operatorname{conc}(h) is non-decreasing as well.

Proof.

We will argue that if hh is non-decreasing, then it suffices to consider the infimum in Equation 57 over non-decreasing linear functions. Observe that if \ell is a decreasing linear function with (x)h(x)\ell(x)\geq h(x) then the constant function (b)\ell(b) satisfies (b)h(x)\ell(b)\geq h(x) and (b)(x)\ell(b)\leq\ell(x). Therefore,

conc(h)(x)=inf{(x): linear, non-decreasing, and h}\operatorname{conc}(h)(x)=\inf\{\ell(x):\ell\text{ linear, non-decreasing, and }\ell\geq h\}

Lemma 10.

Let h:[0,b]h:[0,b]\to\mathbb{R} be a non-decreasing function that is right-continuous at zero with h(0)=0h(0)=0. Then supyh(y)=0\sup_{y}h_{*}(y)=0. Furthermore, there is a sequence yny_{n} with yny_{n}\to\infty and limnh(yn)=0\lim_{n\to\infty}h_{*}(y_{n})=0.

Proof.

First, notice that

h(y)=infx[0,b]yxh(x)y0h(0)=0h_{*}(y)=\inf_{x\in[0,b]}yx-h(x)\leq y\cdot 0-h(0)=0 (58)

for any yy\in\mathbb{R}. It remains to show a sequence yny_{n} for which limnh(yn)=0\lim_{n\to\infty}h_{*}(y_{n})=0.

We will argue than any sequence yny_{n} with

yn>nh(b)supx[1/n,b]h(x)xy_{n}>nh(b)\geq\sup_{x\in[1/n,b]}\frac{h(x)}{x} (59)

satisfies this property.

If x[1/n,b]x\in[1/n,b] and yny_{n} satisfies Equation 59 then

xynh(x)=x(ynh(x)x)>0xy_{n}-h(x)=x\left(y_{n}-\frac{h(x)}{x}\right)>0

and thus Equation 58 implies that

h(yn)=infx[0,1/n)xynh(x)h_{*}(y_{n})=\inf_{x\in[0,1/n)}xy_{n}-h(x)

The monononicity of hh then implies that

h(yn)h(1/n)h_{*}(y_{n})\geq-h(1/n)

and

limnh(yn)0\lim_{n\to\infty}h_{*}(y_{n})\geq 0

because hh is right-continuous at zero. This relation together with Equation 58 implies the result.

Proof of Lemma 2.

Lemma 9 implies that conc(h)\operatorname{conc}(h) is non-decreasing. Standard results in convex analysis imply that conc(h)\operatorname{conc}(h) is continuous on (0,1/2)(0,1/2) (Hiriart-Urruty & Lemaréchal, 2001, Lemma 3.1.1) and upper semi-continuous on [0,1/2][0,1/2] (Hiriart-Urruty & Lemaréchal, 2001, Theorem 1.3.5). Thus monotonicity implies that for all x[0,1/2]x\in[0,1/2], conc(h)(x)conc(h)(1/2)\operatorname{conc}(h)(x)\leq\operatorname{conc}(h)(1/2) and thus limx1/2conc(h)(x)conc(h)(1/2)\lim_{x\to 1/2}\operatorname{conc}(h)(x)\leq\operatorname{conc}(h)(1/2). We will show the opposite inequality, implying that conch\operatorname{conc}h is continuous at 1/21/2.

First, as the constant function h(1/2)h(1/2) is an upper bound on hh, one can conclude that conc(h)(1/2)=h(1/2)=1\operatorname{conc}(h)(1/2)=h(1/2)=1. Next, recall that conc(h)\operatorname{conc}(h) can be expressed as an infimum of linear functions as in Equation 57. If h\ell\geq h, then (0)0\ell(0)\geq 0 and (1/2)1\ell(1/2)\geq 1. Therefore,

(12δ)=((12δ)12+2δ0)=(12δ)(12)+2δ(0)12δ.\ell(\tfrac{1}{2}-\delta)=\ell((1-2\delta)\cdot\tfrac{1}{2}+2\delta\cdot 0)=(1-2\delta)\ell(\tfrac{1}{2})+2\delta\ell(0)\geq 1-2\delta.

Therefore, the representation Equation 57 implies that conc(h)(1/2δ)12δ\operatorname{conc}(h)(1/2-\delta)\geq 1-2\delta. Taking δ0\delta\to 0 proves that limx1/2conc(h)(x)1\lim_{x\to 1/2}\operatorname{conc}(h)(x)\geq 1. Thus, conc(h)\operatorname{conc}(h) is continuous at 1/21/2, if viewed as a function on [0,1/2][0,1/2].

Next, Lemma 10 implies that h(0)=0h_{**}(0)=0:

h(0)=infyh(y)=supyh(y)=0.h_{**}(0)=\inf_{y\in\mathbb{R}}-h_{*}(y)=-\sup_{y\in\mathbb{R}}h_{*}(y)=0.

Finally, it remains to show that hh_{**} is continuous at 0. The monotonicity of hh_{**} implies that limy0+h(y)=infy(0,1/2]h(y)\lim_{y\to 0^{+}}h_{**}(y)=\inf_{y\in(0,1/2]}h_{**}(y) and consequently

limy0+h(y)=infy(0,1/2]infxyxh(x)=infxinfy(0,1/2]yxh(x)=infxh(x)+{0if x0x2if x<0\displaystyle\lim_{y\to 0^{+}}h_{**}(y)=\inf_{y\in(0,1/2]}\inf_{x\in\mathbb{R}}yx-h_{*}(x)=\inf_{x\in_{\mathbb{R}}}\inf_{y\in(0,1/2]}yx-h_{*}(x)=\inf_{x\in\mathbb{R}}-h_{*}(x)+\begin{cases}0&\text{if }x\geq 0\\ \frac{x}{2}&\text{if }x<0\end{cases}
=min(infx0h(x),infx<0x2h(x))\displaystyle=\min\left(\inf_{x\geq 0}-h_{*}(x),\inf_{x<0}\frac{x}{2}-h_{*}(x)\right) (60)

However, Lemma 10 implies that

infx0h(x)=infxh(x)=0\inf_{x\geq 0}-h_{*}(x)=\inf_{x\in\mathbb{R}}-h_{*}(x)=0 (61)

Notice that if x0x\leq 0,

h(x)=infz[0,1/2]xzh(z)=x2h(12)=x21h_{*}(x)=\inf_{z\in[0,1/2]}xz-h(z)=\frac{x}{2}-h\left(\frac{1}{2}\right)=\frac{x}{2}-1 (62)

Consequently, Equation 61 and Equation 62 implies that Equation 60 evaluates to 0.

Appendix F Defining the Function β\beta

Lemma 11.

Let β:[0,1]¯\beta:[0,1]\to\overline{\mathbb{R}} be a function defined by

β(η)=inf{α:ϕ(α)=12(ϕ(0)+Cϕ(η))}\beta(\eta)=\inf\{\alpha:\phi(\alpha)=\frac{1}{2}(\phi(0)+C_{\phi}^{*}(\eta))\}

Then Cϕ(β(η))=12(ϕ(0)+Cϕ(η))C_{\phi}^{*}(\beta(\eta))=\frac{1}{2}(\phi(0)+C_{\phi}^{*}(\eta)), and β\beta is monotonic on [0,1/2][0,1/2] and [1/2,1][1/2,1].

Proof.

The statement Cϕ(β(η))=12(ϕ(0)+Cϕ(η))C_{\phi}^{*}(\beta(\eta))=\frac{1}{2}(\phi(0)+C_{\phi}^{*}(\eta)) is a consequence of the continuity of ϕ\phi. Recall that ϕ\phi is non-increasing, and CϕC_{\phi}^{*} is non-decreasing on [0,1/2][0,1/2]. Thus, β(η)\beta(\eta) must be non-increasing as well on [0,1/2][0,1/2]. Analogously, β\beta must be non-decreasing on [1/2,1][1/2,1].

Appendix G Proof of Proposition 2

A modified version of Jensen’s inequality will be used at several points in the proof of Proposition 2.

Lemma 12.

Let GG be a concave function with G(0)=0G(0)=0 and let ν\nu be a measure with ν(d)1\nu(\mathbb{R}^{d})\leq 1. Then

G(f)𝑑νG(f𝑑ν)\int G(f)d\nu\leq G\left(\int fd\nu\right)
Proof.

The inequality trivially holds if ν(d)=0\nu(\mathbb{R}^{d})=0, so we assume ν(d)>0\nu(\mathbb{R}^{d})>0. Jensen’s inequality implies that

G(f)𝑑ν=ν(d)(1ν(d)G(f)𝑑ν)ν(d)G(1ν(d)f𝑑ν).\int G(f)d\nu=\nu(\mathbb{R}^{d})\left(\frac{1}{\nu(\mathbb{R}^{d})}\int G(f)d\nu\right)\leq\nu(\mathbb{R}^{d})G\left(\frac{1}{\nu(\mathbb{R}^{d})}\int fd\nu\right).

As G(0)=0G(0)=0, concavity implies that

ν(d)G(1ν(d)fdν)+(1ν(d)G(0)G(fdν)\nu(\mathbb{R}^{d})G\left(\frac{1}{\nu(\mathbb{R}^{d})}\int fd\nu\right)+(1-\nu(\mathbb{R}^{d})G(0)\leq G\left(\int fd\nu\right)

Proof of Proposition 2.

Define I1(f)I_{1}(f), I0(f)I_{0}(f), I1ϕ(f)I_{1}^{\phi}(f), and I0ϕ(f)I_{0}^{\phi}(f) as in Section 4. We will prove

I1(f)12Φ~(2I1ϕ(f))I_{1}(f)\leq\frac{1}{2}\tilde{\Phi}\big{(}2I_{1}^{\phi}(f)\big{)} (63)

and an analogous argument will imply

I0(f)12Φ~(2I0ϕ(f)).I_{0}(f)\leq\frac{1}{2}\tilde{\Phi}\big{(}2I_{0}^{\phi}(f)\big{)}. (64)

The concavity of Φ~\tilde{\Phi} then implies that

Rϵ(f)Rϵ=I1(f)+I0(f)12Φ~(2I1ϕ(f))+12Φ~(2I0ϕ(f))Φ~(122I1ϕ(f)+122I0ϕ(f))=Φ~(Rϕϵ(f)Rϕ,ϵ).R^{\epsilon}(f)-R_{*}^{\epsilon}=I_{1}(f)+I_{0}(f)\leq\frac{1}{2}\tilde{\Phi}\big{(}2I_{1}^{\phi}(f)\big{)}+\frac{1}{2}\tilde{\Phi}\big{(}2I_{0}^{\phi}(f)\big{)}\leq\tilde{\Phi}\Big{(}\frac{1}{2}2I_{1}^{\phi}(f)+\frac{1}{2}2I_{0}^{\phi}(f)\Big{)}=\tilde{\Phi}\big{(}R_{\phi}^{\epsilon}(f)-R_{\phi,*}^{\epsilon}\big{)}.

We will prove Equation 63, the argument for Equation 64 is analogous. Next, let γ1\gamma_{1}^{*} be the coupling between 1{\mathbb{P}}_{1} and 1{\mathbb{P}}_{1}^{*} supported on Δϵ\Delta_{\epsilon}. The assumption on Φ\Phi implies that

C(η(𝐱),f(𝐱))C(η(𝐱))Φ(Cϕ(η(𝐱),f(𝐱))Cϕ(η(𝐱)))C(\eta^{*}({\mathbf{x}}^{\prime}),f({\mathbf{x}}^{\prime}))-C^{*}(\eta^{*}({\mathbf{x}}^{\prime}))\leq\Phi\big{(}C_{\phi}(\eta^{*}({\mathbf{x}}^{\prime}),f({\mathbf{x}}^{\prime}))-C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime}))\big{)} (65)

and consequently,

C(η(𝐱),f(𝐱))C(η(𝐱))dγ1Φ(Cϕ(η(𝐱),f(𝐱))Cϕ(η(𝐱))dγ1)Φ(I1ϕ(f)).\int C(\eta^{*}({\mathbf{x}}^{\prime}),f({\mathbf{x}}^{\prime}))-C^{*}(\eta^{*}({\mathbf{x}}^{\prime}))d\gamma_{1}^{*}\leq\Phi\left(\int C_{\phi}(\eta^{*}({\mathbf{x}}^{\prime}),f({\mathbf{x}}^{\prime}))-C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime}))d\gamma_{1}^{*}\right)\leq\Phi(I_{1}^{\phi}(f)). (66)

To bound the term Sϵ(𝟏f0)(𝐱)𝟏f(𝐱)0S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}, we consider three different cases for (𝐱,𝐱)({\mathbf{x}},{\mathbf{x}}^{\prime}). Define the sets D1D_{1}, E1E_{1}, F1F_{1} by

D1={(𝐱,𝐱):Sϵ(𝟏f0)(𝐱)𝟏f(𝐱)0=0}D_{1}=\{({\mathbf{x}},{\mathbf{x}}^{\prime}):S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}=0\}
E1={(𝐱,𝐱):Sϵ(𝟏f0)(𝐱)𝟏f(𝐱)0=1,f(𝐱)β(η(𝐱))}E_{1}=\{({\mathbf{x}},{\mathbf{x}}^{\prime}):S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}=1,f({\mathbf{x}}^{\prime})\geq\beta(\eta^{*}({\mathbf{x}}^{\prime}))\}
F1={(𝐱,𝐱):Sϵ(𝟏f0)(𝐱)𝟏f(𝐱)0=1,f(𝐱)<β(η(𝐱))}F_{1}=\{({\mathbf{x}},{\mathbf{x}}^{\prime}):S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}=1,f({\mathbf{x}}^{\prime})<\beta(\eta^{*}({\mathbf{x}}^{\prime}))\}

with the function β\beta as in Equation 36. The composition βη\beta\circ\eta^{*} is measurable because β\beta is piecewise monotonic, see Lemma 11 in Appendix F. We will show that if T1T_{1} is any of the three sets D1D_{1}, E1E_{1}, F1F_{1}, then

T1Sϵ(𝟏f0)(𝐱)𝟏f(𝐱)0dγ1\displaystyle\int_{T_{1}}S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}d\gamma_{1}^{*} (67)
(1G((ϕ(0)Cϕ(η(𝐱)))/2)dγ1)12G(T1((Sϵ(ϕf)(𝐱)ϕ(f(𝐱)))+(Cϕ(η(𝐱),f(𝐱))Cϕ(η(𝐱)))dγ1)12\displaystyle\leq\left(\int\frac{1}{G\Big{(}\big{(}\phi(0)-C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime}))\big{)}/2\Big{)}}d\gamma_{1}^{*}\right)^{\frac{1}{2}}G\left(\int_{T_{1}}\big{(}(S_{\epsilon}(\phi\circ f)({\mathbf{x}})-\phi(f({\mathbf{x}}^{\prime}))\big{)}+\left(C_{\phi}(\eta^{*}({\mathbf{x}}^{\prime}),f({\mathbf{x}}^{\prime}))-C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime}))\right)d\gamma_{1}^{*}\right)^{\frac{1}{2}}

Thus because GG is concave and non-decreasing, the composition G\sqrt{G} is as well. Thus summing the inequality Equation 67 over T1{D1,E1,F1}T_{1}\in\{D_{1},E_{1},F_{1}\} results in

Sϵ(𝟏f0)(𝐱)𝟏f(𝐱)0dγ1\displaystyle\int S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}d\gamma_{1}^{*} 3(1G((ϕ(0)Cϕ(η(𝐱)))/2)𝑑)12G(13I1ϕ(f))12\displaystyle\leq 3\left(\int\frac{1}{G\Big{(}\big{(}\phi(0)-C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime}))\big{)}/2\Big{)}}d{\mathbb{P}}^{*}\right)^{\frac{1}{2}}G\left(\frac{1}{3}I_{1}^{\phi}(f)\right)^{\frac{1}{2}} (68)

Summing Equation 66 and Equation 68 results in Equation 63.

It remains to show the inequality Equation 67 for the three sets D1D_{1}, E1E_{1}, F1F_{1}.

  1. A)

    On the set D1D_{1}:

    If Sϵ(𝟏f0)(𝐱)=𝟏f(𝐱)0S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})={\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}, then D1Sϵ(𝟏f0)(𝐱)𝟏f(𝐱)0dγ1=0\int_{D_{1}}S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}d\gamma_{1}^{*}=0 while the left-hand side of Equation 67 is non-negative by Lemma 1, which implies Equation 67 for T1=D1T_{1}=D_{1}.

  2. B)

    On the set E1E_{1}:

    If Sϵ(𝟏f0)(𝐱)𝟏f(𝐱)0=1S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}=1 but f(𝐱)β(η(𝐱))f({\mathbf{x}}^{\prime})\geq\beta(\eta^{*}({\mathbf{x}}^{\prime})), then Sϵ(ϕf)(𝐱)ϕ(0)S_{\epsilon}(\phi\circ f)({\mathbf{x}})\geq\phi(0) while ϕ(f(𝐱))ϕ(β(η(𝐱)))\phi(f({\mathbf{x}}^{\prime}))\leq\phi(\beta(\eta^{*}({\mathbf{x}}^{\prime}))) and thus Sϵ(ϕf)(𝐱)ϕ(f(𝐱))ϕ(0)ϕ(β(η(𝐱)))=(ϕ(0)Cϕ(η(𝐱)))/2S_{\epsilon}(\phi\circ f)({\mathbf{x}})-\phi(f({\mathbf{x}}^{\prime}))\geq\phi(0)-\phi(\beta(\eta^{*}({\mathbf{x}}^{\prime})))=(\phi(0)-C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime})))/2. Lemma 1 then implies that Sϵ(ϕf)(𝐱)ϕ(f(𝐱))0S_{\epsilon}(\phi\circ f)({\mathbf{x}})-\phi(f({\mathbf{x}}^{\prime}))\geq 0 γ1\gamma_{1}^{*}-a.e. and thus

    Sϵ(𝟏f0)(𝐱)𝟏f(𝐱)0=1=G((ϕ(0)Cϕ(η(𝐱)))/2)G((ϕ(0)Cϕ(η(𝐱)))/2)G(Sϵ(ϕf)(𝐱)ϕ(f(𝐱)))G((ϕ(0)Cϕ(η(𝐱)))/2)γ1-a.e.S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}=1=\frac{\sqrt{G\left(\left(\phi(0)-C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime}))\right)/2\right)}}{\sqrt{G\left(\left(\phi(0)-C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime}))\right)/2\right)}}\leq\frac{\sqrt{G\big{(}S_{\epsilon}(\phi\circ f)({\mathbf{x}})-\phi(f({\mathbf{x}}^{\prime}))\big{)}}}{\sqrt{G\left(\left(\phi(0)-C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime}))\right)/2\right)}}\quad\gamma_{1}^{*}\text{-a.e.} (69)

    Now the Cauchy-Schwartz inequality and the variant of Jensen’s inequality on Lemma 12 imply

    E1Sϵ(𝟏f0)(𝐱)𝟏f(𝐱)0dγ1\displaystyle\int_{E_{1}}S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}d\gamma_{1}^{*} (70)
    (E11G((ϕ(0)Cϕ(η(𝐱)))/2)𝑑γ1)12(E1G(Sϵ(ϕf)(𝐱)ϕ(f(𝐱)))𝑑γ1)12\displaystyle\leq\left(\int_{E_{1}}\frac{1}{G\left(\left(\phi(0)-C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime}))\right)/2\right)}d\gamma_{1}^{*}\right)^{\frac{1}{2}}\left(\int_{E_{1}}G\big{(}S_{\epsilon}(\phi\circ f)({\mathbf{x}})-\phi(f({\mathbf{x}}^{\prime}))\big{)}d\gamma_{1}^{*}\right)^{\frac{1}{2}}
    (1G((ϕ(0)Cϕ(η(𝐱)))/2)𝑑γ1)12G(E1Sϵ(ϕf)(𝐱)ϕ(f(𝐱))dγ1)12\displaystyle\leq\left(\int\frac{1}{G\left(\left(\phi(0)-C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime}))\right)/2\right)}d\gamma_{1}^{*}\right)^{\frac{1}{2}}G\left(\int_{E_{1}}S_{\epsilon}(\phi\circ f)({\mathbf{x}})-\phi(f({\mathbf{x}}^{\prime}))d\gamma_{1}^{*}\right)^{\frac{1}{2}}
  3. C)

    On the set F1F_{1}:

    First, Sϵ(𝟏f0)(𝐱)𝟏f(𝐱)0=1S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}=1 implies that f(𝐱)>0f({\mathbf{x}}^{\prime})>0. If furthermore f(𝐱)<β(η(𝐱))f({\mathbf{x}}^{\prime})<\beta(\eta^{*}({\mathbf{x}}^{\prime})), then both f(𝐱)<β(η(𝐱))f({\mathbf{x}}^{\prime})<\beta(\eta^{*}({\mathbf{x}}^{\prime})) and f(𝐱)<β(η(𝐱))-f({\mathbf{x}}^{\prime})<\beta(\eta^{*}({\mathbf{x}}^{\prime})) and consequently Cϕ(η(𝐱),f(𝐱))ϕ(β(η(𝐱)))C_{\phi}(\eta^{*}({\mathbf{x}}^{\prime}),f({\mathbf{x}}^{\prime}))\geq\phi(\beta(\eta^{*}({\mathbf{x}}^{\prime}))), and so due to the definition of β\beta in Equation 36:

    Cϕ(η,f(𝐱))Cϕ(η)12(ϕ(0)Cϕ(η)).C_{\phi}(\eta^{*},f({\mathbf{x}}^{\prime}))-C_{\phi}^{*}(\eta^{*})\geq\frac{1}{2}(\phi(0)-C_{\phi}^{*}(\eta^{*})).

    The same argument as Equation 69 then implies

    Sϵ(𝟏f0)(𝐱)𝟏f0(𝐱)G(Cϕ(η,f(𝐱))Cϕ(η(𝐱)))G((ϕ(0)Cϕ(η(𝐱)))/2)γ0-a.e.S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f\leq 0}({\mathbf{x}}^{\prime})\leq\frac{\sqrt{G\big{(}C_{\phi}^{*}(\eta^{*},f({\mathbf{x}}^{\prime}))-C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime}))\big{)}}}{\sqrt{G\left(\left(\phi(0)-C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime}))\right)/2\right)}}\quad\gamma_{0}^{*}\text{-a.e.}

    Now the Cauchy-Schwartz inequality and Lemma 12 imply

    F1Sϵ(𝟏f0)(𝐱)𝟏f(𝐱)0dγ1\displaystyle\int_{F_{1}}S_{\epsilon}({\mathbf{1}}_{f\leq 0})({\mathbf{x}})-{\mathbf{1}}_{f({\mathbf{x}}^{\prime})\leq 0}d\gamma_{1}^{*}\leq
    (F11G((ϕ(0)Cϕ(η(𝐱)))/2)𝑑γ1)12(F1G(Cϕ(η(𝐱),f(𝐱))Cϕ(η(𝐱)))𝑑γ1)12\displaystyle\left(\int_{F_{1}}\frac{1}{G\left(\left(\phi(0)-C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime}))\right)/2\right)}d\gamma_{1}^{*}\right)^{\frac{1}{2}}\left(\int_{F_{1}}G\big{(}C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime}),f({\mathbf{x}}^{\prime}))-C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime}))\big{)}d\gamma_{1}^{*}\right)^{\frac{1}{2}}\leq
    (1G((ϕ(0)Cϕ(η(𝐱)))/2)𝑑1)12G(F1Cϕ(η,f(𝐱))Cϕ(η)dγ1)12\displaystyle\left(\int\frac{1}{G\left(\left(\phi(0)-C_{\phi}^{*}(\eta^{*}({\mathbf{x}}^{\prime}))\right)/2\right)}d{\mathbb{P}}_{1}^{*}\right)^{\frac{1}{2}}G\left(\int_{F_{1}}C_{\phi}^{*}(\eta^{*},f({\mathbf{x}}^{\prime}))-C_{\phi}^{*}(\eta^{*})d\gamma_{1}^{*}\right)^{\frac{1}{2}}

Appendix H Technical Integral Lemmas

H.1 The Lebesgue and Riemann–Stieltjes integral of an increasing function

The goal of this section is to prove Equation 38, or namely:

Proposition 4.

Let ff be a non-increasing, non-negative, continuous function on an interval [a,b][a,b] and let {\mathbb{Q}} be a finite positive measure. Let zz be a random variable distributed according to {\mathbb{Q}} and define h(α)=(zα)h(\alpha)={\mathbb{Q}}(z\leq\alpha) then

(a,b]f(z)𝑑(z)=abf(α)𝑑h(α)\int_{(a,b]}f(z)d{\mathbb{Q}}(z)=\int_{a}^{b}f(\alpha)dh(\alpha)

where the integral on the left is defined as the Lebesgue integral in terms of the measure {\mathbb{Q}} while the integral on the right is defined as a Riemann–Stieltjes integral.

Proof.

Recall that the Lebesgue integral f𝑑\int fd{\mathbb{Q}} is defined as

f𝑑=sup{g𝑑:gf,g simple function, }\int fd{\mathbb{Q}}=\sup\left\{\int gd{\mathbb{Q}}:g\leq f,g\text{ simple function, }\right\}

while the Riemann-Stieltjes integral is defined as the value of the limits

f𝑑h=limΔαi0i=0I1f(αi)(h(αi+1)h(αi))=limΔαi0i=0I1f(αi+1)(h(αi+1)h(αi)),\int fdh=\lim_{\Delta\alpha_{i}\to 0}\sum_{i=0}^{I-1}f(\alpha_{i})(h(\alpha_{i+1})-h(\alpha_{i}))=\lim_{\Delta\alpha_{i}\to 0}\sum_{i=0}^{I-1}f(\alpha_{i+1})(h(\alpha_{i+1})-h(\alpha_{i})), (71)

where these limits are evaluated as the size of the partition Δαi=αi+1αi\Delta\alpha_{i}=\alpha_{i+1}-\alpha_{i} approaches 0. A standard analysis result states that these limits exist and are equal whenever ff is continuous (see for instance Theorem 2.24 of (Wheeden & Zygmund, 1977)). Let δ>0\delta>0 be arbitrary and choose a partition {αi}i=0I\{\alpha_{i}\}_{i=0}^{I} for which each of the sums in Equation 71 is within δ\delta of f𝑑h\int fdh, and f(αi)f(αi+1)<δf(\alpha_{i})-f(\alpha_{i+1})<\delta for all ii. Such a partition exists because every continuous function on a compact set is uniformly continuous.

Next, consider two simple functions g1g_{1}, g2g_{2} defined according to

g1(z)=i=0I1f(αi+1)χz(αi,αi+1],g2(z)=i=0I1f(αi)χz(αi,αi+1].g_{1}(z)=\sum_{i=0}^{I-1}f(\alpha_{i+1})\chi_{z\in(\alpha_{i},\alpha_{i+1}]},\quad g_{2}(z)=\sum_{i=0}^{I-1}f(\alpha_{i})\chi_{z\in(\alpha_{i},\alpha_{i+1}]}.

By construction, g1(x)f(x)g_{1}(x)\leq f(x) for all x(a,b]x\in(a,b]. Moreover, since f(αi)f(αi+1)<δf(\alpha_{i})-f(\alpha_{i+1})<\delta, it follows that f(x)g2(x)+δf(x)\leq g_{2}(x)+\delta when x(a,b]x\in(a,b]. Now applying the definition of the integral of a simply function, we obtain:

f𝑑hδi=0I1f(αi+1)(h(αi+1)h(αi))=(a,b]g1𝑑(a,b]f𝑑(a,b]g2+δd\displaystyle\int fdh-\delta\leq\sum_{i=0}^{I-1}f(\alpha_{i+1})\big{(}h(\alpha_{i+1})-h(\alpha_{i})\big{)}=\int_{(a,b]}g_{1}d{\mathbb{Q}}\leq\int_{(a,b]}fd{\mathbb{Q}}\leq\int_{(a,b]}g_{2}+\delta d{\mathbb{Q}}
=i=0I1f(αi)(h(αi+1)h(αi))+δ((a,b])f𝑑h+δ+δ((a,b])\displaystyle=\sum_{i=0}^{I-1}f(\alpha_{i})\big{(}h(\alpha_{i+1})-h(\alpha_{i})\big{)}+\delta{\mathbb{Q}}((a,b])\leq\int fdh+\delta+\delta{\mathbb{Q}}((a,b])

As δ\delta is arbitrary, it follows that f𝑑h=f𝑑\int fdh=\int fd{\mathbb{Q}}. ∎

Notice that because H(0)=0H(0)=0, the integral in the right-hand side of Equation 38 is technically an improper integral. Thus to show Equation 38, one can conclude that

z(δ,1/2]1H(z)𝑑(z)=δ1/21H(α)𝑑h(α)\int_{z\in(\delta,1/2]}\frac{1}{H(z)}d{\mathbb{Q}}(z)=\int_{\delta}^{1/2}\frac{1}{H(\alpha)}dh(\alpha)

from Proposition 4 and then take the limit δ0\delta\to 0.

H.2 Proof of the last equality in Equation 39

Even though H=conc(h)H=\operatorname{conc}(h) is always continuous by Lemma 2, hh may be discontinuous and thus hr𝑑h\int h^{-r}dh may not exist as a Riemann-Stieltjes integral. Thus one must avoid this quantity in proofs.

The proof of Equation 39 relies on summation by parts and on a well-behaved decomposition of hh that is a consequence of the Lebesgue decomposition of a function of bounded variation. This result states that hh can be decomposed into a continuous portion and a right-continuous “jump” portion.

See Corollary 3.33 of (Ambrosio et al., 2000) for a statement that implies this result or (Stein & Shakarchi, 2005, Exercise 24,Chapter 3).

Proposition 5.

Let h:[0,1/2][0,1]h:[0,1/2]\to[0,1] be a non-decreasing, and right-continuous function with h(0)=0h(0)=0 and h(1/2)=1h(1/2)=1. Then one can decompose hh as

h=whC+(1w)hJh=wh_{C}+(1-w)h_{J}

where 0w10\leq w\leq 1, hCh_{C}, hJh_{J} are non-decreasing functions mapping [0,1/2][0,1/2] into [0,1][0,1] with hC(0)=hJ(0)=0h_{C}(0)=h_{J}(0)=0, hC(1)=hJ(1)=1h_{C}(1)=h_{J}(1)=1, hCh_{C} is continuous, and hJh_{J} is a right-continuous step-function.

The goal of this appendix is to prove the following inequality:

Lemma 13.

Let h:[0,1/2][0,1]h:[0,1/2]\to[0,1] be an increasing and right-continuous function with h(0)=0h(0)=0 and h(1/2)=1h(1/2)=1. Let HH be any continuous function with HhH\geq h and let r[0,1)r\in[0,1). Then one can bound the Riemann–Stieltjes integral 1/H(z)r𝑑h\int 1/H(z)^{r}dh by

1H(z)r𝑑h2r1r\int\frac{1}{H(z)^{r}}dh\leq\frac{2^{r}}{1-r}
Proof.

Let h=whC+(1w)hJh=wh_{C}+(1-w)h_{J} be the decomposition in Proposition 5. Thus

1H(z)r𝑑h=w1H(z)r𝑑hC+(1w)1H(z)r𝑑hJ\int\frac{1}{H(z)^{r}}dh=w\int\frac{1}{H(z)^{r}}dh_{C}+(1-w)\int\frac{1}{H(z)^{r}}dh_{J} (72)

We will bound each of the two integrals above separately. Notice that the portions of this decomposition bound HH below: whcHwh_{c}\leq H and (1w)hJH(1-w)h_{J}\leq H. First,

w1H(z)r𝑑hC=wsupphC1H(z)r𝑑hCwsupphC1(whC(z))r𝑑hC=w1rsupphChCr𝑑hCw\int\frac{1}{H(z)^{r}}dh_{C}=w\int_{\operatorname{supp}h_{C}}\frac{1}{H(z)^{r}}dh_{C}\leq w\int_{\operatorname{supp}h_{C}}\frac{1}{(wh_{C}(z))^{r}}dh_{C}=w^{1-r}\int_{\operatorname{supp}h_{C}}h_{C}^{-r}dh_{C}

Now by comparing the limiting sums that define the Riemman-Stieltjes integral hCr𝑑hC\int h_{C}^{-r}dh_{C} with the limit of the Riemann sums for the integral yr𝑑y\int y^{-r}dy, one can conclude that hCr𝑑hC\int h_{C}^{-r}dh_{C} can be evaluated as a Riemann integral in the variable hCh_{C}. This argument relies on the continuity of the function hCh_{C}. Thus supphChCr𝑑hC1/(1r)\int_{\operatorname{supp}h_{C}}h_{C}^{-r}dh_{C}\leq 1/(1-r). Consequently:

w1H(z)r𝑑hCw1r1rw\int\frac{1}{H(z)^{r}}dh_{C}\leq\frac{w^{1-r}}{1-r} (73)

Next, because hJh_{J} is right-continuous and h(z)=0h(z)=0, one can write

hJ(z)=k=0K1ak𝟏[zk,zk+1)(z)+aK𝟏zKh_{J}(z)=\sum_{k=0}^{K-1}a_{k}{\mathbf{1}}_{[z_{k},z_{k+1})}(z)+a_{K}{\mathbf{1}}_{z_{K}} (74)

with K{}K\in\mathbb{N}\cup\{\infty\}. Furthermore, because h(0)=0h(0)=0, h(1/2)=1h(1/2)=1 and hh is non-decreasing, one can conclude that z0=0z_{0}=0, zK=1/2z_{K}=1/2, a0=0a_{0}=0, aK=1a_{K}=1, and akak+1a_{k}\leq a_{k+1}. Lastly, one can require that ak<ak+1a_{k}<a_{k+1} in this representation, since otherwise one could express hJh_{J} as in Equation 74 but with a smaller value of KK. Thus the second integral in Equation 72 evaluates to

(1w)1H(z)r𝑑hJ=(1w)k=0K11H(zk+1)r(ak+1ak).\displaystyle(1-w)\int\frac{1}{H(z)^{r}}dh_{J}=(1-w)\sum_{k=0}^{K-1}\frac{1}{H(z_{k+1})^{r}}(a_{k+1}-a_{k}).

Recalling that H(z)(1w)hJH(z)\geq(1-w)h_{J}, this quantity is bounded above by

(1w)k=0K11H(zk+1)r(ak+1ak)(1w)k=0K11(1w)rhJ(zk+1)r(ak+1ak)=(1w)1rk=0K11ak+1r(ak+1ak)(1-w)\sum_{k=0}^{K-1}\frac{1}{H(z_{k+1})^{r}}(a_{k+1}-a_{k})\leq(1-w)\sum_{k=0}^{K-1}\frac{1}{(1-w)^{r}h_{J}(z_{k+1})^{r}}(a_{k+1}-a_{k})=(1-w)^{1-r}\sum_{k=0}^{K-1}\frac{1}{a_{k+1}^{r}}(a_{k+1}-a_{k}) (75)

Because the function yyry\mapsto y^{-r} is decreasing in yy, one can bound k=0K1ak+1r(ak+1ak)akak+1yr𝑑y\sum_{k=0}^{K-1}a_{k+1}^{-r}(a_{k+1}-a_{k})\leq\int_{a_{k}}^{a_{k+1}}y^{-r}dy and consequently Equation 75 is bounded above by

(1w)1rk=0K1akak+1yr𝑑y=(1w)1r01yr𝑑y=(1w)r1r(1-w)^{1-r}\sum_{k=0}^{K-1}\int_{a_{k}}^{a_{k+1}}y^{-r}dy=(1-w)^{1-r}\int_{0}^{1}y^{-r}dy=\frac{(1-w)^{r}}{1-r}

Combining this bound with Equation 73 shows that Equation 72 is bounded above by

11r(w1r+(1w)1r).\frac{1}{1-r}(w^{1-r}+(1-w)^{1-r}).

Maximizing this quantity over w[0,1]w\in[0,1] proves the result. ∎

Appendix I Optimizing the Bound of Lemma 3 over rr

Proof of Theorem 13.

Let

f(r)=11rarf(r)=\frac{1}{1-r}a^{r}

Then

f(r)=1(1r)2ar+11rlnaarf^{\prime}(r)=\frac{1}{(1-r)^{2}}a^{r}+\frac{1}{1-r}\ln aa^{r}

solving f(r)=0f^{\prime}(r^{*})=0 produces r=1+1lnar^{*}=1+\frac{1}{\ln a}, and

f(1+1lna)=lnaa1+lna=ealnaf\left(1+\frac{1}{\ln a}\right)=-\ln aa^{1+\ln a}=-ea\ln a

One can verify that this point is a minimum via the second derivative test:

f(r)=(11r+lna)f(r)f^{\prime}(r)=\left(\frac{1}{1-r}+\ln a\right)f(r)

and thus

f′′(r)=(11r+lna)f(r)+1(1r)2f(r).f^{\prime\prime}(r)=\left(\frac{1}{1-r}+\ln a\right)f^{\prime}(r)+\frac{1}{(1-r)^{2}}f(r).

Consequently, f′′(r)=ln(a)2f(1+1lna)>0f^{\prime\prime}(r^{*})=\ln(a)^{2}f(1+\frac{1}{\ln a})>0.

However, the point rr^{*} is in the interval [0,1][0,1] only when a[0,e1]a\in[0,e^{-1}]. When a>e1a>e^{-1}, ff is minimized over [0,1][0,1] at r=0r=0. Because rr^{*} is a minimizer when a[0,e1]a\in[0,e^{-1}], one can bound f(0)f(r)f(0)\geq f(r^{*}) over this set and thus

f(r)min(1,ealna)f(r)\leq\min\left(1,-ea\ln a\right)

Next, let Λ\Lambda be defined as in Lemma 3. Then the concavity of Λ\Lambda and the fact that Λ(0)=0\Lambda(0)=0 implies that 6Λ(z/6)2Λ(z/2)6\Lambda(z/6)\geq 2\Lambda(z/2).

Appendix J Further details from Examples 3 and 2

In Sections J.1 and J.2, we use an operation analogous to SϵS_{\epsilon}. Let IϵI_{\epsilon} be an operation on functions that computes the infimum over an ϵ\epsilon-ball. Formally, we define:

Iϵ(g)(𝐱)=inf𝐱𝐱ϵg(𝐱).I_{\epsilon}(g)({\mathbf{x}})=\inf_{\|{\mathbf{x}}^{\prime}-{\mathbf{x}}\|\leq\epsilon}g({\mathbf{x}}^{\prime}). (76)

Nest, we define a mapping αϕ\alpha_{\phi} from η[0,1]\eta\in[0,1] to minimizers of Cϕ(η,)C_{\phi}(\eta,\cdot) by

αϕ(η)=inf{α:α is a minimizer of Cϕ(η,)}.\alpha_{\phi}(\eta)=\inf\{\alpha:\alpha\text{ is a minimizer of }C_{\phi}(\eta,\cdot)\}. (77)

Lemma 25 of (Frank & Niles-Weed, 2024b) shows that the function αϕ\alpha_{\phi} defined in Equation 77 maps η\eta to the smallest minimizer of Cϕ(η,)C_{\phi}(\eta,\cdot) and is non-decreasing. This property will be instrumental in constructing minimizers for RϕϵR_{\phi}^{\epsilon}.

J.1 Calculating the optimal 0{\mathbb{P}}_{0}^{*}, 1{\mathbb{P}}_{1}^{*} for Example 2

First, notice that a minimizer of RϕR_{\phi} is given by f(x)=αϕ(η(x))f(x)=\alpha_{\phi}(\eta(x)) with η(x)\eta(x) as defined in Equation 19.

Below, we construct a minimizer ff^{*} for RϕϵR_{\phi}^{\epsilon}. We’ll verify that this function is a minimizer by showing that Rϕϵ(f)=Rϕ(f)R_{\phi}^{\epsilon}(f^{*})=R_{\phi}(f). As the minimal possible adversarial risk is bounded below by Rϕ,R_{\phi,*}, one can conclude that ff^{*} minimizes RϕϵR_{\phi}^{\epsilon}. Consequently, R¯ϕ(0,1)=Rϕϵ(f)\bar{R}_{\phi}({\mathbb{P}}_{0},{\mathbb{P}}_{1})=R_{\phi}^{\epsilon}(f) and thus the strong duality result in Theorem 9 would imply that 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1} must maximize the dual problem.

Define a function η~:[δϵ1,1+δ+ϵ][0,1]\tilde{\eta}:[-\delta-\epsilon-1,1+\delta+\epsilon]\to[0,1] by

η~(x)={14if x[1δϵ,0)12if x=034if x(0,1+δ+ϵ]\tilde{\eta}(x)=\begin{cases}\frac{1}{4}&\text{if }x\in[-1-\delta-\epsilon,0)\\ \frac{1}{2}&\text{if }x=0\\ \frac{3}{4}&\text{if }x\in(0,1+\delta+\epsilon]\end{cases}

and a function ff^{*} by f(x)=αϕ(η~(x))f^{*}(x)=\alpha_{\phi}(\tilde{\eta}(x)). Both η~\tilde{\eta} and αϕ\alpha_{\phi} are non-decreasing, and so ff^{*} must be non-decreasing as well. Consequently, Sϵ(ϕ(f))(x)=ϕ(Iϵ(f)(x))=ϕ(f(xϵ))S_{\epsilon}(\phi(f^{*}))(x)=\phi(I_{\epsilon}(f^{*})(x))=\phi(f^{*}(x-\epsilon)) and similarly, Sϵ(ϕ(f))(x)=ϕ(Sϵ(f)(x))=ϕ(f(x+ϵ))S_{\epsilon}(\phi(-f^{*}))(x)=\phi(-S_{\epsilon}(f^{*})(x))=\phi(-f^{*}(x+\epsilon)). (Recall the IϵI_{\epsilon} operation was defined in Equation 76.)

Therefore,

Rϕϵ(f)\displaystyle R_{\phi}^{\epsilon}(f^{*}) =Sϵ(ϕ(f))(x)p1(x)𝑑x+Sϵ(ϕ(f))(x)p0(x)𝑑x=ϕ(f(xϵ))p1(x)𝑑x+ϕ(f(x+ϵ))p0(x)𝑑x\displaystyle=\int S_{\epsilon}(\phi(f^{*}))(x)p_{1}(x)dx+\int S_{\epsilon}(\phi(-f^{*}))(x)p_{0}(x)dx=\int\phi(f^{*}(x-\epsilon))p_{1}(x)dx+\int\phi(-f^{*}(x+\epsilon))p_{0}(x)dx
=ϕ(f(x))p1(x+ϵ)𝑑x+ϕ(f(x))p0(xϵ)𝑑x\displaystyle=\int\phi(f^{*}(x))p_{1}(x+\epsilon)dx+\int\phi(-f^{*}(x))p_{0}(x-\epsilon)dx

However, because the point 0 has Lebesgue measure zero if it is contained in suppp1(+ϵ)\operatorname{supp}p_{1}(\cdot+\epsilon):

ϕ(f(x))p1(x+ϵ)𝑑x=1δϵδϵ18ϕ(αϕ(14))𝑑x+δϵ1+δϵ18ϕ(αϕ(34))𝑑x=ϕ(f(x))p1(x)𝑑x\int\phi(f^{*}(x))p_{1}(x+\epsilon)dx=\int_{-1-\delta-\epsilon}^{-\delta-\epsilon}\frac{1}{8}\phi\Big{(}\alpha_{\phi}\Big{(}\frac{1}{4}\Big{)}\Big{)}dx+\int_{\delta-\epsilon}^{1+\delta-\epsilon}\frac{1}{8}\phi\Big{(}\alpha_{\phi}\Big{(}\frac{3}{4}\Big{)}\Big{)}dx=\int\phi(f(x))p_{1}(x)dx

Analogously, one can show that

ϕ(f(x))p0(xϵ)𝑑x=ϕ(f(x))p0(x)𝑑x\int\phi(-f^{*}(x))p_{0}(x-\epsilon)dx=\int\phi(-f(x))p_{0}(x)dx

and consequently Rϕϵ(f)=Rϕ(f)R_{\phi}^{\epsilon}(f^{*})=R_{\phi}(f).

J.2 Calculating the optimal 0{\mathbb{P}}_{0}^{*} and 1{\mathbb{P}}_{1}^{*} for Example 3

We will show that the densities in Equation 20 are dual optimal by finding a function ff^{*} for which Rϕϵ(f)=R¯ϕ(0,1)R_{\phi}^{\epsilon}(f^{*})=\bar{R}_{\phi}({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}). Theorem 9 will then imply that 0{\mathbb{P}}_{0}^{*}, 1{\mathbb{P}}_{1}^{*} must maximize the dual.

and let η(x)\eta^{*}(x) is defined by

η(x)=p1(x)p1(x)+p0(x),\eta^{*}(x)=\frac{p_{1}^{*}(x)}{p_{1}^{*}(x)+p_{0}^{*}(x)},

with p0(x)p_{0}^{*}(x) and p1(x)p_{1}^{*}(x) as in Equation 20. For a given loss ϕ\phi we will prove that the optimal function ff^{*} is given by

f(x)=αϕ(η(x)).f^{*}(x)=\alpha_{\phi}(\eta^{*}(x)).

The function η\eta^{*} computes to

η(x)=11+eμ1μ02ϵσ2(μ1+μ02x)\eta^{*}(x)=\frac{1}{1+e^{\frac{\mu_{1}-\mu_{0}-2\epsilon}{\sigma^{2}}(\frac{\mu_{1}+\mu_{0}}{2}-x)}}

If μ1μ0>2ϵ\mu_{1}-\mu_{0}>2\epsilon, the function η(x)\eta^{*}(x) is increasing in xx and consequently ff^{*} is non-decreasing. Therefore, Sϵ(ϕ(f))(x)=ϕ(Iϵ(f)(x))=ϕ(f(xϵ))S_{\epsilon}(\phi(f^{*}))(x)=\phi(I_{\epsilon}(f^{*})(x))=\phi(f^{*}(x-\epsilon)) (recall IϵI_{\epsilon} was defined in Equation 76). Similarly, one can argue that Sϵ(ϕ(f))(x)=ϕ(f(x+ϵ))S_{\epsilon}(\phi(-f^{*}))(x)=\phi(-f^{*}(x+\epsilon)). Therefore,

Rϕϵ(f)\displaystyle R_{\phi}^{\epsilon}(f^{*}) =Sϵ(ϕ(f))(x)p1(x)𝑑x+Sϵ(ϕ(f))(x)p0(x)𝑑x=ϕ(f(xϵ))p1(x)𝑑x+ϕ(f(x+ϵ))p0(x)𝑑x\displaystyle=\int S_{\epsilon}(\phi(f^{*}))(x)p_{1}(x)dx+\int S_{\epsilon}(\phi(-f^{*}))(x)p_{0}(x)dx=\int\phi(f^{*}(x-\epsilon))p_{1}(x)dx+\int\phi(-f^{*}(x+\epsilon))p_{0}(x)dx
=ϕ(f(x))p1(x+ϵ)𝑑x+ϕ(f(x))p0(xϵ)𝑑x\displaystyle=\int\phi(f^{*}(x))p_{1}(x+\epsilon)dx+\int\phi(-f^{*}(x))p_{0}(x-\epsilon)dx

Next, notice that p1(x+ϵ)=p1(x)p_{1}(x+\epsilon)=p_{1}^{*}(x) and p0(xϵ)=p0(x)p_{0}(x-\epsilon)=p_{0}^{*}(x). Define =0+1{\mathbb{P}}^{*}={\mathbb{P}}_{0}^{*}+{\mathbb{P}}_{1}^{*}. Then

Rϕϵ(f)\displaystyle R_{\phi}^{\epsilon}(f^{*}) =ηϕ(αϕ(η))+(1η)ϕ(αϕ(η))d=Cϕ(η)𝑑=R¯ϕ(0,1)\displaystyle=\int\eta^{*}\phi(\alpha_{\phi}(\eta^{*}))+(1-\eta^{*})\phi(-\alpha_{\phi}(\eta^{*}))d{\mathbb{P}}^{*}=\int C_{\phi}^{*}(\eta^{*})d{\mathbb{P}}^{*}=\bar{R}_{\phi}({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*})

Consequently, the strong duality result in Theorem 9 implies that 0{\mathbb{P}}_{0}^{*} 1{\mathbb{P}}_{1}^{*} must maximize the dual R¯ϕ\bar{R}_{\phi}.

J.3 Showing Equation 21

Lemma 14.

Consider an equal gaussian mixture with variance σ\sigma and means μ0<μ1\mu_{0}<\mu_{1}, with pdfs given by

p0(x)=1212πσe(xμ0)22σ2,p1(x)=1212πσe(xμ1)22σ2p_{0}(x)=\frac{1}{2}\cdot\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu_{0})^{2}}{2\sigma^{2}}},\quad p_{1}(x)=\frac{1}{2}\cdot\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu_{1})^{2}}{2\sigma^{2}}}

Let η(x)=p1(x)/(p0(x)+p1(x))\eta(x)=p_{1}(x)/(p_{0}(x)+p_{1}(x)). Then |η(x)1/2|z|\eta(x)-1/2|\leq z is equivalent to x[μ0+μ12Δ(z),μ0+μ12+Δ(z)]x\in[\frac{\mu_{0}+\mu_{1}}{2}-\Delta(z),\frac{\mu_{0}+\mu_{1}}{2}+\Delta(z)], where Δ(z)\Delta(z) is defined by

Δ(z)=σ2μ1μ0ln(12+z12z).\Delta(z)=\frac{\sigma^{2}}{\mu_{1}-\mu_{0}}\ln\left(\frac{\frac{1}{2}+z}{\frac{1}{2}-z}\right). (78)
Proof.

The function η\eta can be rewritten as η(x)=1/(1+p0/p1)\eta(x)=1/(1+p_{0}/p_{1}) while

p0(x)p1(x)=exp((xμ0)22σ2+(xμ1)22σ2)=exp(μ1μ0σ2(μ1+μ02x))\frac{p_{0}(x)}{p_{1}(x)}=\exp\left(-\frac{(x-\mu_{0})^{2}}{2\sigma^{2}}+\frac{(x-\mu_{1})^{2}}{2\sigma^{2}}\right)=\exp\left(\frac{\mu_{1}-\mu_{0}}{\sigma^{2}}\left(\frac{\mu_{1}+\mu_{0}}{2}-x\right)\right)

Consequently, |η(x)1/2|z|\eta(x)-1/2|\leq z is equivalent to

12z1exp(μ1μ0σ2(μ1+μ02x))+112+z\frac{1}{2}-z\leq\frac{1}{\exp\left(\frac{\mu_{1}-\mu_{0}}{\sigma^{2}}(\frac{\mu_{1}+\mu_{0}}{2}-x)\right)+1}\leq\frac{1}{2}+z

which is equivalent to

μ1+μ02σ2μ1μ0ln(112z1)xμ1+μ02σ2μ1μ0ln(1z+121)\frac{\mu_{1}+\mu_{0}}{2}-\frac{\sigma^{2}}{\mu_{1}-\mu_{0}}\ln\left(\frac{1}{\frac{1}{2}-z}-1\right)\leq x\leq\frac{\mu_{1}+\mu_{0}}{2}-\frac{\sigma^{2}}{\mu_{1}-\mu_{0}}\ln\left(\frac{1}{z+\frac{1}{2}}-1\right)

Finally, notice that

Δ(z)=σ2μ1μ0ln(112z1)\Delta(z)=\frac{\sigma^{2}}{\mu_{1}-\mu_{0}}\ln\left(\frac{1}{\frac{1}{2}-z}-1\right) (79)

while

Δ(z)=σ2μ1μ0ln(1z+121)-\Delta(z)=\frac{\sigma^{2}}{\mu_{1}-\mu_{0}}\ln\left(\frac{1}{z+\frac{1}{2}}-1\right)

Lemma 15.

Let p0,p1p_{0},p_{1}, and η\eta be as in Lemma 14 and let h(z)=(|η1/2|z)h(z)={\mathbb{P}}(|\eta-1/2|\leq z). Then if μ1μ042σ\mu_{1}-\mu_{0}\leq 4\sqrt{2}\sigma, then hh is concave.

Proof.

To start, we calculate the second derivative of Δ(z)\Delta(z) and the first derivative of p0p_{0}.

The first derivative of Δ\Delta is

Δ(z)=σ2μ1μ0114z2.\Delta^{\prime}(z)=\frac{\sigma^{2}}{\mu_{1}-\mu_{0}}\cdot\frac{1}{\frac{1}{4}-z^{2}}.

and the second derivative of Δ(z)\Delta(z) is

Δ′′(z)=σ2μ1μ02z(14z2)2\Delta^{\prime\prime}(z)=\frac{\sigma^{2}}{\mu_{1}-\mu_{0}}\cdot\frac{2z}{(\frac{1}{4}-z^{2})^{2}} (80)

Next, one can calculate the derivative of p0p_{0} as

p0(x)=1212πσ(xμ0)σ2e(xμ0)22σ2=(xμ0)σ2p0(x)p_{0}^{\prime}(x)=\frac{1}{2}\cdot\frac{1}{\sqrt{2\pi}\sigma}\cdot\frac{-(x-\mu_{0})}{\sigma^{2}}e^{-\frac{(x-\mu_{0})^{2}}{2\sigma^{2}}}=-\frac{(x-\mu_{0})}{\sigma^{2}}p_{0}(x) (81)

and similarly

p1(x)=(xμ1)σ2p1(x)p_{1}^{\prime}(x)=-\frac{(x-\mu_{1})}{\sigma^{2}}p_{1}(x) (82)

Let p(x)=p0+p1p(x)=p_{0}+p_{1}. Lemma 14 implies that the function hh is given by h(z)=12Δ(z)12+Δ(z)p(z)𝑑zh(z)=\int_{\tfrac{1}{2}-\Delta(z)}^{\tfrac{1}{2}+\Delta(z)}p(z)dz. The first derivative of hh is then

h(z)=(p(μ1+μ02+Δ(z))+p(μ1+μ02Δ(z)))Δ(z).h^{\prime}(z)=\left(p\Big{(}\frac{\mu_{1}+\mu_{0}}{2}+\Delta(z)\Big{)}+p\Big{(}\frac{\mu_{1}+\mu_{0}}{2}-\Delta(z)\Big{)}\right)\Delta^{\prime}(z).

Differentiating hh twice results in

h′′(z)\displaystyle h^{\prime\prime}(z) =(p(μ1+μ02+Δ(z))+p(μ1+μ02Δ(z)))Δ′′(z)\displaystyle=\left(p\Big{(}\frac{\mu_{1}+\mu_{0}}{2}+\Delta(z)\Big{)}+p\Big{(}\frac{\mu_{1}+\mu_{0}}{2}-\Delta(z)\Big{)}\right)\Delta^{\prime\prime}(z)
+(p(μ1+μ02+Δ(z))p(μ1+μ02Δ(z)))(Δ(z))2\displaystyle+\left(p^{\prime}\Big{(}\frac{\mu_{1}+\mu_{0}}{2}+\Delta(z)\Big{)}-p^{\prime}\Big{(}\frac{\mu_{1}+\mu_{0}}{2}-\Delta(z)\Big{)}\right)(\Delta^{\prime}(z))^{2}
=(p(μ1+μ02+Δ(z))+p(μ1+μ02Δ(z)))(Δ′′(z)Δ(z)Δ(z)2σ2)\displaystyle=\left(p\Big{(}\frac{\mu_{1}+\mu_{0}}{2}+\Delta(z)\Big{)}+p\Big{(}\frac{\mu_{1}+\mu_{0}}{2}-\Delta(z)\Big{)}\right)\left(\Delta^{\prime\prime}(z)-\frac{\Delta(z)\Delta^{\prime}(z)^{2}}{\sigma^{2}}\right) (83)
+(p1(μ1+μ02+Δ(z))+p1(μ1+μ02Δ(z)))μ1μ02σ2(Δ(z))2\displaystyle+\bigg{(}p_{1}\Big{(}\frac{\mu_{1}+\mu_{0}}{2}+\Delta(z)\Big{)}+p_{1}\Big{(}\frac{\mu_{1}+\mu_{0}}{2}-\Delta(z)\Big{)}\bigg{)}\frac{\mu_{1}-\mu_{0}}{2\sigma^{2}}(\Delta^{\prime}(z))^{2} (84)
(p0(μ1+μ02+Δ(z))+p0(μ1+μ02Δ(z)))μ1μ02σ2(Δ(z))2.\displaystyle-\bigg{(}p_{0}\Big{(}\frac{\mu_{1}+\mu_{0}}{2}+\Delta(z)\Big{)}+p_{0}\Big{(}\frac{\mu_{1}+\mu_{0}}{2}-\Delta(z)\Big{)}\bigg{)}\frac{\mu_{1}-\mu_{0}}{2\sigma^{2}}(\Delta^{\prime}(z))^{2}. (85)

where the final equality is a consequence of Equations 81 and 82. Next, we’ll argue that the sum of the terms in Equations 84 and 85 is zero:

(p1(μ1+μ02+Δ(z))+p1(μ1+μ02Δ(z)))(p0(μ1+μ02+Δ(z))+p0(μ1+μ02Δ(z)))\displaystyle\bigg{(}p_{1}\Big{(}\frac{\mu_{1}+\mu_{0}}{2}+\Delta(z)\Big{)}+p_{1}\Big{(}\frac{\mu_{1}+\mu_{0}}{2}-\Delta(z)\Big{)}\bigg{)}-\bigg{(}p_{0}\Big{(}\frac{\mu_{1}+\mu_{0}}{2}+\Delta(z)\Big{)}+p_{0}\Big{(}\frac{\mu_{1}+\mu_{0}}{2}-\Delta(z)\Big{)}\bigg{)}
=122πσ((e(μ0μ12+Δ(z))22σ2+e(μ0μ12Δ(z))22σ2)(e(μ1μ02+Δ(z))22σ2+e(μ1μ02Δ(z))22σ2))\displaystyle=\frac{1}{2\sqrt{2\pi}\sigma}\left(\left(e^{-\frac{\left(\frac{\mu_{0}-\mu_{1}}{2}+\Delta(z)\right)^{2}}{2\sigma^{2}}}+e^{-\frac{\left(\frac{\mu_{0}-\mu_{1}}{2}-\Delta(z)\right)^{2}}{2\sigma^{2}}}\right)-\left(e^{-\frac{\left(\frac{\mu_{1}-\mu_{0}}{2}+\Delta(z)\right)^{2}}{2\sigma^{2}}}+e^{-\frac{\left(\frac{\mu_{1}-\mu_{0}}{2}-\Delta(z)\right)^{2}}{2\sigma^{2}}}\right)\right)
=0\displaystyle=0

Next, we’ll show that under the assumption μ1μ02σ\mu_{1}-\mu_{0}\leq\sqrt{2}\sigma, the term Equation 83 is always negative. Define k=sigma2/(μ1μ0)k=sigma^{2}/(\mu_{1}-\mu_{0}). Then

Δ′′(z)Δ(z)Δ(z)2σ2=2k(14z2)2(zk22σ2ln(112z1))\Delta^{\prime\prime}(z)-\Delta(z)\frac{\Delta^{\prime}(z)^{2}}{\sigma^{2}}=\frac{2k}{(\frac{1}{4}-z^{2})^{2}}\left(z-\frac{k^{2}}{2\sigma^{2}}\ln\left(\frac{1}{\frac{1}{2}-z}-1\right)\right) (86)

The fact that Δ′′(z)>0\Delta^{\prime\prime}(z)>0 for all zz implies that ln(1/(1/2z)1)\ln(1/(1/2-z)-1) is convex, and this function has derivative 44 at zero. Consequently, ln(1/(1/2z)1)4z\ln(1/(1/2-z)-1)\geq 4z and Equation 86 implies

Δ′′(z)2k(14z2)2(zk22σ24z)=2kz(14z2)2(12k2σ2)\Delta^{\prime\prime}(z)\leq\frac{2k}{(\frac{1}{4}-z^{2})^{2}}(z-\frac{k^{2}}{2\sigma^{2}}\cdot 4z)=\frac{2kz}{(\frac{1}{4}-z^{2})^{2}}\left(1-\frac{2k^{2}}{\sigma^{2}}\right)

The condition μ1μ02σ\mu_{1}-\mu_{0}\leq\sqrt{2}\sigma is equivalent to 12k2/σ2<01-2k^{2}/\sigma^{2}<0.

This lemma implies that h(z)h(0)zh(z)\leq h^{\prime}(0)z. Noting also that h(z)1h(z)\leq 1 for all zz produces the bound

h(z)min(16σ2μ1μ0z,1)h(z)\leq\min\left(\frac{16\sigma^{2}}{\mu_{1}-\mu_{0}}z,1\right)

applying this bound to the gaussians with densities p0p_{0}^{*} and p1p_{1}^{*} results in Equation 21.