Adversarial Surrogate Risk Bounds
for Binary Classification
Abstract
A central concern in classification is the vulnerability of machine learning models to adversarial attacks. Adversarial training is one of the most popular techniques for training robust classifiers, which involves minimizing an adversarial surrogate risk. Recent work characterized when a minimizing sequence of an adversarial surrogate risk is also a minimizing sequence of the adversarial classification risk for binary classification— a property known as adversarial consistency. However, these results do not address the rate at which the adversarial classification risk converges to its optimal value for such a sequence of functions that minimize the adversarial surrogate. This paper provides surrogate risk bounds that quantify that convergence rate. Additionally, we derive distribution-dependent surrogate risk bounds in the standard (non-adversarial) learning setting, that may be of independent interest.
1 Introduction
A central concern regarding sophisticated machine learning models is their susceptibility to adversarial attacks. Prior work (Biggio et al., 2013; Szegedy et al., 2013) demonstrated that imperceptible perturbations can derail the performance of neural nets. As such models are used in security-critical applications such as facial recognition (Xu et al., 2022) and medical imaging (Paschali et al., 2018), training robust models remains a central concern in machine learning.
In the standard classification setting, the classification risk is the proportion of incorrectly classified data. Rather that minimizing this quantity directly, which is a combinatorial optimization problem, typical machine learning algorithms perform gradient descent on a well-behaved alternative surrogate risk. If a sequence of functions that minimizes this surrogate risk also minimizes the classification risk, then the surrogate risk is referred to as consistent for that specific data distribution. In addition to consistency, one would hope that minimizing the surrogate risk would be an efficient method for minimizing the classification risk. This convergence rate can be bounded by surrogate risk bounds, which are functions that provide a bound on the excess classification risk in terms of the excess surrogate risk.
In the standard binary classification setting, consistency and surrogate risk bounds are well-studied topics (Bartlett et al., 2006; Lin, 2004; Steinwart, 2007; Zhang, 2004). On the other hand, fewer results are known about the adversarial setting. The adversarial classification risk incurs a penalty when a point can be perturbed into the opposite class. Similarly, adversarial surrogate risks involve computing the worst-case value (i.e. supremum) of a loss function over an -ball. Frank & Niles-Weed (2024a) characterized which risks are consistent for all data distributions, and the corresponding losses are referred to as adversarially consistent. Unfortunately, no convex loss function can be adversarially consistent for all data distributions (Meunier et al., 2022). On the other hand, Frank (2025) showed that such situations are rather atypical— when the data distribution is absolutely continuous, a surrogate risk is adversarially consistent so long as the adversarial Bayes classifier satisfies a certain notion of uniqueness called uniqueness up to degeneracy. While these results characterize consistency, none describe convergence rates.
Our Contributions:
-
•
We prove a linear surrogate risk bound for adversarially consistent losses (Theorem 11).
-
•
If the ‘distribution of optimal attacks’ satisfies a bounded noise condition, we prove a linear surrogate risk bound, under mild conditions on the loss function (Theorems 11 and 12).
-
•
We prove a distribution dependent surrogate risk bound that applies whenever a loss is adversarially consistent for a data distribution (Theorem 13).
Notably, this last bullet applies to convex loss functions. Due to the consistency results in prior work (Frank, 2025; Frank & Niles-Weed, 2024a; Meunier et al., 2022), one cannot hope for distribution independent surrogate bounds for non-adversarially consistent losses. To the best of the authors’ knowledge this paper is the first to prove surrogate risk bounds for the risks most commonly used in adversarial training, see Section 6 for a comparison with prior work. Understanding the optimality of the bounds presented in this paper remains an open problem.
2 Background and Preliminaries
2.1 Surrogate Risks
This paper studies binary classification on with labels and . The measures , describe the probabilities of finding data with labels , , respectively, in subset of . The classification risk of a set is the misclassification rate when points in are classified as and points in are classified as :
The minimal classification risk over all Borel sets is denoted . As the derivative of an indicator function is zero wherever it is defined, the empirical version of this risk cannot be optimized with first order descent methods. Consequently, common machine learning algorithms minimize a different quantity called a surrogate risk. The surrogate risk of a function is defined as
In practice, the loss function selected so that it has well-behaved derivative. In this paper, We assume:
Assumption 1.
The loss is continuous, non-increasing, and .
The minimal surrogate risk over all Borel measurable functions is denoted . After optimizing the surrogate risk, a classifier is obtained by threshholding the resulting at zero. Consequently, we define the classification error of a function by or equivalently,
It remains to verify that minimizing the surrogate risk will also minimize the classification risk .
Definition 1.
The loss function is consistent for the distribution , if every minimizing sequence of is also a minimizing sequence of when the data is distributed according to and . The loss function is consistent if it is consistent for all data distributions.
Prior work establishes conditions under which many common loss functions are consistent.
Theorem 1.
A convex loss is consistent iff it is differentiable at zero and .
See (Bartlett et al., 2006, Theorem 2). Furthermore, (Frank & Niles-Weed, 2024a, Proposition 3) establishes a condition that applies to non-convex losses:
Theorem 2.
If , then the loss is consistent.
The -margin loss and the shifted sigmoid loss , , both satisfy this criterion. However, a convex loss cannot satisfy this condition:
(1) |
In addition to consistency, understanding convergence rates is a key concern. Specifically, prior work (Bartlett et al., 2006; Zhang, 2004) establishes surrogate risk bounds of the form for some function . This inequality bounds the convergence rate of in terms of the convergence of .
The values , can be expressed in terms of the data distribution by re-writing these quantities in terms of the total probability measure and the conditional probability of the label +1, given by . An equivalent formulation of the classification risk is
(2) |
with
(3) |
and the minimal classification risk is found by minimizing the integrand of Equation 2 at each . Define
(4) |
then the minimal classification risk is
Analogously, the surrogate risk in terms of and is
(5) |
and the minimal surrogate risk is
with the conditional risk and minimal conditional risk defined by
(6) |
Notice that minimizers to may need to be -valued— consider the exponential loss and a distribution with . Then the only minimizer to would be .
The consistency of can be fully characterized by the properties of the function .
Theorem 3.
A loss is consistent iff for all .
Surprisingly, this criterion has not appeared in prior work. See Appendix A for a proof.
In terms of the function , Theorem 2 states that any loss with is consistent.
The function is a key component of surrogate risk bounds from prior work. Specifically, Bartlett et al. (2006) show:
Equation 8 is a consequence of Equation 7 and Jensen’s inequality. Furthermore, a result of (Bartlett et al., 2006) implies a linear bound when :
(9) |
Furthermore, a distribution with zero classification error has the surrogate risk bound
(10) |
so long as . Such distributions are referred to as realizable. A proof of this result that transfers directly to the adversarial scenario is provided in Section B.1.
A distribution is said to satisfy Massart’s noise condition (Massart & Nédélec, 2006) if there is an such that holds -a.e. Under this condition, Massart & Nédélec (2006) establish improved sample complexity guarantees. Furthermore, such distributions exhibit a linear surrogate loss bound as well. These linear bounds, the realizable bounds from Equation 10, and the linear bounds from Equation 9 are summarized in a single statement below.
Proposition 1.
Let , be a distribution that satisfies -a.e. with a constant , and let be a loss with . Then for all ,
(11) |
and consequently
(12) |
When , this surrogate risk bound proves a linear convergence rate under Massart’s noise condition. If and , then the bound in Equation 12 reduces to Equation 9 while if then this bound reduces to Equation 10. See Section B.2 for a proof of this result. One of the main results of this paper is that Equation 12 generalizes to adversarial risks.
Note that the surrogate risk bound of Theorem 4 can be linear even for convex loss functions. For the hinge loss , the function computes to . Prior work (Frongillo & Waggoner, 2021, Theorem 1) observed a linear surrogate bound for piecewise linear losses: if is piecewise linear, then is piecewise linear and Jensen’s inequality implies a linear surrogate bound so long as is consistent (due to Theorem 3). On the other hand, (Frongillo & Waggoner, 2021, Theorem 2) show that convex losses which are locally strictly convex and Lipschitz achieve at best a square root surrogate risk rate.
Mahdavi et al. (2014) emphasize the importance of a linear convergence rate in a surrogate risk bound. Their paper studies the sample complexity of estimating a classifier with a surrogate risk. They note that typically convex surrogate losses exhibiting favorable sample complexity do not satisfy favorable surrogate risk bounds, due to the results of (Frongillo & Waggoner, 2021). Consequently, Proposition 1 hints that proving favorable sample complexity guarantees for learning with convex surrogate risks could require distributional assumptions, such as Massart’s noise condition.
2.2 Adversarial Risks
This paper extends surrogate risk bounds of Equations 8, 10 and 12 to adversarial risks. The adversarial classification risk incurs a penalty of 1 whenever a point can be perturbed into the opposite class. This penalty can be expressed in terms of supremums of indicator functions— the adversarial classification risk incurs a penalty of 1 whenever or . Define
The adversarial classification risk is then
and the adversarial surrogate risk is111In order to define the risks and , one must argue that is measurable. Theorem 1 of (Frank & Niles-Weed, 2024b) proves that whenever is Borel, is always measurable with respect to the completion of any Borel measure.
A minimizer of the adversarial classification risk is called an adversarial Bayes classifier. After optimizing the surrogate risk, a classifier is obtained by threshholding the resulting function at zero. Consequently, the adversarial classification error of a function is defined as or equivalently,
(13) |
Just as in the standard case, one would hope that minimizing the adversarial surrogate risk would minimize the adversarial classification risk.
Definition 2.
We say a loss is adversarially consistent for the distribution , if any minimizing sequence of is also a minimizing sequence of . We say that is adversarially consistent if it is adversarially consistent for every possible , .
Theorem 2 of (Frank & Niles-Weed, 2024a) characterizes the adversarially consistent losses:
Theorem 5.
The loss is adversarially consistent iff .
Theorem 2 implies that every adversarially consistent loss is also consistent. Unfortunately, Equation 1 shows that no convex loss is adversarially consistent. However, the data distribution for which adversarial consistency fails presented in (Meunier et al., 2022) is fairly atypical: Let , be the uniform distributions on . Then one can show that the function sequence
(14) |
minimizes but not whenever (See Proposition 2 of (Frank & Niles-Weed, 2024a)). A more refined analysis relates adversarial consistency for losses with to a notion of uniqueness of the adversarial Bayes classifier for losses satisfying .
Definition 3.
Two adversarial Bayes classifiers , are equivalent up to degeneracy if any set with is also an adversarial Bayes classifier. The adversarial Bayes classifier is unique up to degeneracy if any two adversarial Bayes classifiers are equivalent up to degeneracy.
Theorem 3.3 of (Frank, 2024) proves that whenever is absolutely continuous with respect to Lebesgue measure, then equivalence up to degeneracy is in fact an equivalence relation. Next, Theorem 4 of (Frank, 2025) relates this condition to the consistency of .
Theorem 6.
Let be a loss with and assume that is absolutely continuous with respect to Lebesgue measure. Then is adversarially consistent for the data distribution given by , iff the adversarial Bayes classifier is unique up to degeneracy.
The extension of Theorem 4 to the adversarial setting must reflect the consistency results of Theorems 5 and 6.
2.3 Minimax Theorems
A central tool in analyzing the adversarial consistency of surrogate risks is minimax theorems. These results allow one to express adversarial risks in a ‘pointwise’ manner analogous to Equation 5. We will then combine this ‘pointwise’ expression together with the proof of Theorem 4 to produce surrogate bounds for adversarial risks.
These minimax theorems utilize the -Wasserstein () metric from optimal transport. Let and be two finite positive measures with the same total mass. Informally, the measure is within of in the metric if one can achieve the measure by moving points of by at most .
The metric is formally defined in terms of the set of couplings between and . A Borel measure on is a coupling between and if its first marginal is and its second marginal is , or in other words, and for all Borel sets . Let be the set of all couplings between the measures and . Then the between and is then
(15) |
Theorem 2.6 of (Jylhä, 2014) proves that the infimum in Equation 15 is always attained. The -ball around in the metric is
The minimax theorem below will relate the adversarial risks , to dual problems in which an adversary seeks to maximize some dual quantity over Wasserstein- balls. Specifically, one can show:
Lemma 1.
Let be a Borel function. Let be a coupling between the measures and supported on . Then -a.e. and consequently
See Appendix C for a proof. Thus, applying Lemma 1, the quantity can be lower bounded by an infimum followed by a supremum. Is it possible to swap this infimum and supremum? (Pydi & Jog, 2021) answers this question in the affirmative. Let be as defined in Equation 4 and let
(16) |
Theorem 7.
Let be as defined in Equation 16. Then
Furthermore, equality is attained at some Borel measurable , , and with and .
See (Frank & Niles-Weed, 2024a, Theorem 1) for a proof of this statement. The maximizers , can be interpreted as optimal adversarial attacks (see discussion following (Frank & Niles-Weed, 2024b, Theorem 7)). Frank (2024, Theorem 3.4) provide a criterion for uniqueness up to degeneracy in terms of dual maximizers.
Theorem 8.
The following are equivalent:
-
A)
The adversarial Bayes classifier is unique up to degeneracy
-
B)
There are maximizers , of for which , where and
In other words, the adversarial Bayes classifier is unique up to degeneracy iff the region where both classes are equally probable has measure zero under some optimal adversarial attack. Theorems 6 and 8 relate adversarial consistency and the dual problem, suggesting that these optimal adversarial attacks , may appear in adversarial surrogate bounds.
Frank & Niles-Weed (2024b) prove an minimax principle analogous to Theorem 7 for the adversarial surrogate risk. Let be as defined in Equation 6 and let
(17) |
Theorem 9.
Let be defined as in Equation 17. Then
Furthermore, equality is attained at some Borel measurable , , and with and .
Just as in the non-adversarial scenario, may not assume its infimum at an -valued function. However, (Frank & Niles-Weed, 2024a, Lemma 8) show that
Lastly, one can show that maximizers of are always maximizers of as well. In other words— optimal attacks on minimizers of the adversarial surrogate are always optimal attacks on minimizers of the adversarial classification risk as well.
Theorem 10.
Consider maximizing the dual objectives and over .
-
1)
Any maximizer of over must maximize as well.
-
2)
If the adversarial Bayes classifier is unique up to degeneracy, then there are maximizers of where , with and .
See Appendix D for a proof of Item 1), Item 2) is shown in Theorems 5 and 7 of (Frank, 2025).
3 Main Results
Prior work has characterized when a loss is adversarially consistent with respect to a distribution , : Theorem 5 proves that a distribution independent surrogate risk bound is only possible when while Theorem 6 suggests that a surrogate bound must depend on the marginal distribution of under , and furthermore, such a bound is only possible when .
Compare these statements to Proposition 1: Theorems 5 and 6 imply that is adversarially consistent for , if or if there exist some maximizers of that satisfy Massart’s noise condition. Alternatively, due to Theorem 10, one can equivalently assume that there are maximizers of satisfying Massart’s noise condition. Our first bound extends Proposition 1 to the adversarial scenario, with the data distribution , replaced with the distribution of optimal adversarial attacks.
Theorem 11.
Let , be a distribution for which there are maximizers , of the dual problem that satisfy -a.e. for some constant with , where , . Then
When , one can select in Theorem 11 to produce a distribution-independent bound. The constant may be sub-optimal; in fact Theorem 4 of Frank (2025) proves that where is the -margin loss. Furthermore, the bound in Equation 10 extends directly to the adversarial setting.
Theorem 12.
Let be any loss with satisfying Assumption 1. Then if ,
A distribution will have zero adversarial risk whenever the supports of and are separated by at least , see Examples 1 and 2(a) for and example. Zero adversarial classification risk corresponds to in Massart’s noise condition.
In contrast, Theorem 11 states that if some distribution of optimal adversarial attacks satisfies Massart’s noise condition, then the excess adversarial surrogate risk is at worst a linear upper bound on the excess adversarial classification risk. However, if , this constant approaches infinity as , reflecting the fact that adversarial consistency fails when the adversarial Bayes classifier is not unique up to degeneracy. When , understanding what assumptions on , guarantee Massart’s noise condition for , is an open question. Example 4.6 of (Frank, 2024) demonstrates a distribution that satisfies Massart’s noise condition and yet the adversarial Bayes classifier is not unique up to degeneracy. Thus Massart’s noise condition for does not guarantee Massart’s noise condition for , . See Examples 2 and 2(b) for an example where Theorem 11 applies with .
Finally, by averaging bounds of the form Theorem 11 over all values of produces a distribution-dependent surrogate bound, valid whenever the adversarial Bayes classifier is unique up to degeneracy. For a given function , let the concave envelope of be the smallest concave function larger than :
(18) |
Theorem 13.
Assume that and that the adversarial Bayes classifier is unique up to degeneracy. Let , be maximizers of for which , with and . Let . Let be the function defined by Theorem 4 and let . Then
with
See Examples 3 and 2(c) for an example of calculating a distribution-dependent surrogate risk bound.
One can prove that the function is always continuous and satisfies , proving that this bound is non-vacuous (see Lemma 2 below). Further notice that approaches zero as .
The map combines two components: , a modified version of , and , a modification of the cdf of . The function is a scaled version of , where is the surrogate risk bound in the non-adversarial case of Theorem 4. The domain of is , and thus the role of the in the definition of is to truncate the argument so that it fits into this domain. The factor of in this function appears to be an artifact of our proof, see Section 5 for further discussion. In contrast, the map translates the distribution of into a surrogate risk transformation. Compare with Theorem 6, which states that consistency fails if ; accordingly, the function becomes a poorer bound when more mass of is near .
Examples



Below are three examples for which each of our three main theorems apply. These examples are all one-dimensional distributions, and we denote the pdfs of , and by and .
To start, a distribution for which the supports of , are more than apart must have zero risk. Furthermore, if is absolutely continuous with respect to Lebesgue measure and the supports of , are exactly apart, then the adversarial classification risk will be zero (see for instance (Awasthi et al., 2023a, Lemma 4) or (Pydi & Jog, 2021, Lemma 4.3)).
Example 1 (When ).
Let and be defined by
for some . See Figure 2(a) for a depiction of and . This distribution satisfies for all and thus the surrogate bound of Theorem 12 applies.
Examples 2 and 3 require computing maximizers to the dual . See Sections J.2 and J.1 for these calculations. The following example illustrates a distribution for which Massart’s noise condition can be verified for a distribution of optimal attacks.
Example 2 (Massart’s noise condition).
Let and let be the uniform density on . Define by
(19) |
see Figure 2(b) for a depiction of and . For this distribution and , the minimal surrogate and adversarial surrogate risks are always equal (). This fact together with Theorem 9 imply that optimal attacks on this distribution are and , see Section J.1 for details. Consequently: the distribution of optimal attacks , satisfies Massart’s noise condition with and as a result the bounds of Theorem 11 apply.
Finally, the next example presents a case in which Massart’s noise condition fails for the distribution of optimal adversarial attacks, yet the adversarial Bayes classifier remains unique up to degeneracy. Theorem 13 continues to yield a valid surrogate bound.
Example 3 (Gaussian example).
Consider an equal Gaussian mixture with equal variances and differing means, with :
We further assume ; see Figure 2(c) for a depiction. We will show that when , the optimal attacks , are gaussians centered at and — explicitly the pdfs of and are given by
(20) |
see Section J.2 for details. We verify that and are in fact optimal by finding a function for which , the strong duality result in Theorem 9 will then imply that and must maximize the dual , see Section J.2 for details.
However, when , then the function is concave in for all and consequently , see Section J.3 for details. Unfortunately, is a rather unweildy function. By comparing to the linear approximation at zero, one can show the following upper bound on :
(21) |
Again, see Section J.3 for details.
When , (Frank, 2024, Example 4.1) shows that the adversarial Bayes classifier is not unique up to degeneracy. Notably, the bound in preceding example deteriorates as , and then fails entirely when .
4 Linear Surrogate Bounds— Proof of Theorems 12 and 11
The proof of Theorem 12 simply involves bounding the indicator functions , in terms of the functions and . This strategy is entirely analogous to that the argument for the (non-adversarial) surrogate bound Equation 10 in Section B.1. A similar argument is also an essential intermediate step of the proof of Theorem 11.
Proof of Theorem 12.
If , then the duality result Theorem 7 implies that for any measures , then , where and . Consequently, for any and consequently . Thus it remains to show that for all functions . We will prove the bound
(22) |
The inequality Equation 22 trivially holds when . Alternatively, the relation implies for some and consequently . Thus whenever ,
(23) |
An analogous argument implies that whenever ,
As a result:
∎
In contrast, when the optimal surrogate risk is non-zero, the bound in Theorem 11 necessitates a more sophisticated argument. Below, we decompose both the adversarial classification risk and the adversarial surrogate risk as the sum of four terms positive terms.
Let be any maximizers of that also maximize by Theorem 10. Set , . Let , be the couplings between , and , respectively that achieve the infimum in Equation 15. Then due to the strong duality in Theorem 7, one can decompose the excess classification risk as
(24) |
with
Lemma 1 implies that must be positive, while the definition of implies that .
Similarly, one can express the excess surrogate risk as
(25) |
with
Define . We will argue that:
(26) |
(27) |
Below, we discuss the proof of Equation 27 and an analogous argument will imply Equation 26.
The proof proceeds by splitting the domain into three different regions , and proving the inequality in each case with a slightly different argument. These three cases will also appear in the proof of theorem Theorem 13. Define the sets , , by
(28) | |||
(29) | |||
(30) |
where is some constant, to be specified later (see Equations 32 and 33). On the set the adversarial error matches the non-adversarial error with respect to the distribution , , and thus the bound in Equation 12 implies a linear surrogate bound. On , the same argument as Equation 23 together with Equation 12 proves a linear surrogate bound for adversarial risks. In short: this argument uses the first term in to bound the first term in and the second term of to bound the second term of .
In contrast, The counterexample discussed in Equation 14 demonstrates that when is near 0, the quantity can be small even though can be large. Consequently, a different strategy is required to establish a linear surrogate bound on the set . The key observation is that under the assumptions of Proposition 1, the function must be bounded away from zero whenever it misclassifies a point. As a result, the excess conditional risk is bounded below by a positive constant and thus can be used to bound terms comprising . The constant is then specifically chosen to balance the contribution of the risks over the sets and .
Proof of Theorem 11.
We will will prove Equation 27, the argument for Equation 26 is analogous. Due due to Equations 24 and 25, these inequalities prove the desired result. First, notice that Equation 12 implies that
(31) |
Choose the constant to satisfy
(32) |
with . The parameter is specifically selected to balance the contributions of the errors on and , specifically it should satisfy
(33) |
Next, we prove the relation Equation 27 on each of the sets , , separately.
-
1.
On the set :
Lemma 1 implies that . This fact together with Equation 31 implies Equation 27.
-
2.
On the set :
If but , then while and thus by Equation 33. Thus
(34) This relation together with Equation 31 implies Equation 27.
-
3.
On the set :
First, implies that . If additionally , then both and and consequently . Furthermore, as is increasing on and decreasing on (see Lemma 5 in Section B.2), . Thus due to the choice of in Equation 32:
The same argument as Equation 34 then implies
This relation together with Equations 31, 33 and 1 imply Equation 27.
∎
5 Proof of Theorem 13
Before proving Theorem 13, we will show that this bound is non-vacuous when the adversarial Bayes classifier is unique up to degeneracy. The function is a cdf, and is thus right-continuous in . Furthermore, if the adversarial Bayes classifier is unique up to degeneracy, then . The following lemma implies that if then is continuous at 0 and . See Appendix E for a proof.
Lemma 2.
Let be a non-decreasing function with and that is right-continuous at . Then is non-decreasing, , and continuous on .
The first step in proving Theorem 13 is showing an analog of Theorem 11 with for which the linear function is replaced by an -dependent concave function.
Proposition 2.
Let be a concave non-decreasing function for which for any and . Let , be any two maximzers of for which for and . Let be any non-decreasing concave function for which the quantity
is finite. Then , where
(35) |
The function in Theorem 4 and the surrogate bounds of Zhang (2004) provide examples of candidate functions for . As before, this result is proved by dividing the risks , as the sum of four terms as in Equation 24, Equation 25 and then bounding these quantities over the sets , , and defined in Equation 28,Equation 29, and Equation 30 separately. The key observation is that when is bounded away from , the excess conditional risk must be strictly positive. This quantity again serves to bound both components of , even if iti s not uniformly bounded away from zero. As before, the constant is selected to balance the contributions of the risk on the sets and . This time, the value is function of , where is a monotonic function for which
(36) |
with . In Appendix F, we show that there exists such a function . An argument like the proof of Theorem 11 mixed with applications of the Cauchy-Schwartz and Jensen’s inequality then proves Proposition 2, see Appendix G for details. Again, the function is chosen to balance the contributions of the upper bounds on the risk on and .
The factor of in Equation 35 arises as an artifact of the proof technique. Specifically, the constant reflects two distinct components of the argument: the factor of 3 results from averaging over three sets , , , (see Equation 68 in Appendix G), the factor of 2 arises from combining the bounds associated with the two integrals and (see Equations 66 and 68 in Appendix G).
We now turn to the problem of identifying functions for which the constant in the preceding proposition is guaranteed to be finite. Observe that and so if , the identity function is a possible choice for . This option results in
which may improve the convergence rate relative to the bound in Theorem 11. The results developed here extend the classical analysis of Bartlett et al. (2006) to the adversarial setting. Moreover, Proposition 2 points to a pathway for generalizing the framework of Zhang (2004) to robust classification.
Alternatively, we consider constructing a function for which the constant in Proposition 2 is always finite when the adversarial Bayes classifier is unique, but distribution dependent. Observe that if is the cdf of and is continuous, then is always finite. This calculation suggests , with defined in Theorem 4. To ensure the concavity of , we instead select with .
Proof.
For convenience, let . Let , where . Then is concave because it is the composition of an concave function and an increasing concave function. We will verify that
First,
with . The assumption allows us to drop 0 from the domain of integration. Because the function is continuous on by Lemma 2, this last expression can actually be evaluated as a Riemann-Stieltjes integral with respect to the function :
(38) |
This result is standard when is Lebesgue measure, (see for instance Theorem 5.46 of (Wheeden & Zygmund, 1977)). We prove equality in Equation 38 for strictly decreasing functions in Proposition 4 in Section H.1.
Finally, the integral in Equation 38 can be bounded as
(39) |
If were differentiable, then the chain rule would imply
This calculation is more delicate for non-differentiable ; we formally prove inequality in Equation 39 in Section H.2.
This calculation proves the inequality Equation 37 with as
The concavity of together with the fact that then proves the result. ∎
Minimizing this bound over then produces Theorem 13, see Appendix I for details.
6 Related Works
The most similar results to this paper are Li & Telgarsky (2023); Mao et al. (2023a). Li & Telgarsky (2023) prove a surrogate bound for convex losses, when one can minimize over the thresh-holding value in Equation 13 rather than just 0. (Mao et al., 2023a) proves an adversarial surrogate bound for a modified -margin loss.
Many papers study the statistical consistency of surrogate risks in the standard and adversarial context. Bartlett et al. (2006); Zhang (2004) prove surrogate risk bounds that apply to the class of all measurable functions Lin (2004); Steinwart (2007) prove further results on consistency in the standard setting, and Frongillo & Waggoner (2021) study the optimally of such surrogate risk bounds. (Bao, 2023) relies on the modulus of convexity of to construct surrogate risk bounds. Philip M. Long (2013); Mingyuan Zhang (2020); Awasthi et al. (2022); Mao et al. (2023a; b); Awasthi et al. (2023b) further study consistency restricted to a specific family of functions; a concepts called -consistency. Prior work Mahdavi et al. (2014)also uses these surrogate risk bounds in conjunction with surrogate generalization bounds to study the generalization of the classification error.
In the adversarial setting, (Meunier et al., 2022; Frank & Niles-Weed, 2024a) identify which losses are adversarially consistent for all data distributions while (Frank, 2025) shows that under reasonable distributional assumptions, a consistent loss is adversarially consistent for a specific distribution iff the adversarial Bayes classifier is unique up to degeneracy. (Awasthi et al., 2021) study adversarial consistency for a well-motivated class of linear functions while some prior work also studies the approximation error caused by learning from a restricted function class . Liu et al. (2024) study the approximation error of the surrogate risk. Complimenting this result, Awasthi et al. (2023b); Mao et al. (2023a) study -consistency in the adversarial setting for specific surrogate risks. Standard and adversarial surrogate risk bounds are a tool central tool in the derivation of the -consistency bounds in this line of research. Whether the adversarial surrogate bounds presented in this paper could result in improved adversarial -consistency bounds remains an open problem.
Our proofs rely on prior results that study adversarial risks and adversarial Bayes classifiers. Notably, (Bungert et al., 2021; Pydi & Jog, 2021; 2020; Bhagoji et al., 2019; Awasthi et al., 2023a) establish the existence of the adversarial Bayes classifier while (Frank & Niles-Weed, 2024b; Pydi & Jog, 2020; 2021; Bhagoji et al., 2019; Frank, 2025) prove various minimax theorems for the adversarial surrogate and classification risks. Subsequently, (Pydi & Jog, 2020) uses a minimax theorem to study the adversarial Bayes classifier, and (Frank, 2024) uses minimax results to study the notion of uniqueness up to degeneracy.
7 Conclusion
In conclusion, we prove surrogate risk bounds for adversarial risks whenever is adversarially consistent for the distribution , . When is adversarially consistent or the distribution of optimal adversarial attacks satisfies Massart’s noise condition, we prove a linear surrogate risk bound. For the general case, we prove a concave distribution-dependent bound. Understanding the optimality of these bounds remains an open problem, as does understanding how these bounds interact with the sample complexity of estimating the surrogate quantity. These questions were partly addressed by (Frongillo & Waggoner, 2021) and (Mahdavi et al., 2014) in the standard setting, but remain unstudied in the adversarial scenario.
Acknowledgments
Natalie Frank was supported in part by the Research Training Group in Modeling and Simulation funded by the National Science Foundation via grant RTG/DMS – 1646339, NSF grant DMS-2210583, and NSF TRIPODS II - DMS 2023166.
References
- Ambrosio et al. (2000) Luigi Ambrosio, Nicola Fosco, and Diego Pallara. Functions of Bounded Variation and Free Discontuity Problems. Oxford Mathematics Monographs. Oxford University Press, 2000.
- Apostol (1974) Tom M. Apostol. Mathematical analysis, 1974.
- Awasthi et al. (2021) Pranjal Awasthi, Natalie S. Frank, Anqui Mao, Mehryar Mohri, and Yutao Zhong. Calibration and consistency of adversarial surrogate losses. NeurIps, 2021.
- Awasthi et al. (2022) Pranjal Awasthi, Anqi Mao, Mehryar Mohri, and Yutao Zhong. H-consistency bounds for surrogate loss minimizers. In Proceedings of the 39th International Conference on Machine Learning. PMLR, 2022.
- Awasthi et al. (2023a) Pranjal Awasthi, Natalie S. Frank, and Mehryar Mohri. On the existence of the adversarial bayes classifier (extended version). arxiv, 2023a.
- Awasthi et al. (2023b) Pranjal Awasthi, Anqi Mao, Mehryar Mohri, and Yutao Zhong. Theoretically grounded loss functions and algorithms for adversarial robustness. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research. PMLR, 2023b.
- Bao (2023) Han Bao. Proper losses, moduli of convexity, and surrogate regret bounds. In Proceedings of Thirty Sixth Conference on Learning Theory, Proceedings of Machine Learning Research. PMLR, 2023.
- Bartlett et al. (2006) Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 2006.
- Bhagoji et al. (2019) Arjun Nitin Bhagoji, Daniel Cullina, and Prateek Mittal. Lower bounds on adversarial robustness from optimal transport, 2019.
- Biggio et al. (2013) Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387–402. Springer, 2013.
- Buja et al. (2005) Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications. Technical report, University of Pennslvania, 2005.
- Bungert et al. (2021) Leon Bungert, Nicolás García Trillos, and Ryan Murray. The geometry of adversarial training in binary classification. arxiv, 2021.
- Frank (2024) Natalie S. Frank. A notion of uniqueness for the adversarial bayes classifier, 2024.
- Frank (2025) Natalie S. Frank. Adversarial consistency and the uniqueness of the adversarial bayes classifier. European Journal of Applied Mathematics, 2025.
- Frank & Niles-Weed (2024a) Natalie S. Frank and Jonathan Niles-Weed. The adversarial consistency of surrogate risks for binary classification. NeurIps, 2024a.
- Frank & Niles-Weed (2024b) Natalie S. Frank and Jonathan Niles-Weed. Existence and minimax theorems for adversarial surrogate risks in binary classification. Journal of Machine Learning Research, 2024b.
- Frongillo & Waggoner (2021) Rafael Frongillo and Bo Waggoner. Surrogate regret bounds for polyhedral losses. In Advances in Neural Information Processing Systems, 2021.
- Hiriart-Urruty & Lemaréchal (2001) Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal. Fundamentals of convex analysis, 2001.
- Jylhä (2014) Heikki Jylhä. The optimal transport: Infinite cyclical monotonicity and the existence of optimal transport maps. Calculus of Variations and Partial Differential Equations, 2014.
- Li & Telgarsky (2023) Justin D. Li and Matus Telgarsky. On achieving optimal adversarial test error, 2023.
- Lin (2004) Yi Lin. A note on margin-based loss functions in classification. Statistics & Probability Letters, 68(1):73–82, 2004.
- Liu et al. (2024) Changyu Liu, Yuling Jiao, Junhui Wang, and Jian Huang. Nonasymptotic bounds for adversarial excess risk under misspecified models. SIAM Journal on Mathematics of Data Science, 6(4), 2024. URL https://doi.org/10.1137/23M1598210.
- Mahdavi et al. (2014) Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Binary excess risk for smooth convex surrogates, 2014.
- Mao et al. (2023a) Anqi Mao, Mehryar Mohri, and Yutao Zhong. Cross-entropy loss functions: Theoretical analysis and applications, 2023a.
- Mao et al. (2023b) Anqi Mao, Mehryar Mohri, and Yutao Zhong. Structured prediction with stronger consistency guarantees. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., 2023b.
- Massart & Nédélec (2006) Pascal Massart and Élodie Nédélec. Risk bounds for statistical learning. The Annals of Statistics, 34, 2006.
- Meunier et al. (2022) Laurent Meunier, Raphaël Ettedgui, Rafael Pinot, Yann Chevaleyre, and Jamal Atif. Towards consistency in adversarial classification. arXiv, 2022.
- Mingyuan Zhang (2020) Shivani Agarwal Mingyuan Zhang. Consistency vs. h-consistency: The interplay between surrogate loss functions and the scoring function class. NeurIps, 2020.
- Paschali et al. (2018) Magdalini Paschali, Sailesh Conjeti, Fernando Navarro, and Nassir Navab. Generalizability vs. robustness: Adversarial examples for medical imaging. Springer, 2018.
- Philip M. Long (2013) Rocco A. Servedio Philip M. Long. Consistency versus realizable h-consistency for multiclass classification. ICML, 2013.
- Pydi & Jog (2020) Muni Sreenivas Pydi and Varun Jog. Adversarial risk via optimal transport and optimal couplings. ICML, 2020.
- Pydi & Jog (2021) Muni Sreenivas Pydi and Varun Jog. The many faces of adversarial risk. Neural Information Processing Systems, 2021.
- Raftery & Gneiting (2007) Adrian Raftery and Raftery Gneiting. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistcal Association, 2007.
- Reid & Williamson (2009) Mark D. Reid and Robert C. Williamson. Surrogate regret bounds for proper losses. In Proceedings of the 26th Annual International Conference on Machine Learning, New York, NY, USA, 2009. Association for Computing Machinery.
- Rudin (1976) Walter Rudin. Principles of Mathematical Analysis. Mathematics Series. McGraw-Hill International Editions, third edition, 1976.
- Savage (1971) Leonard J. Savage. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336), 1971.
- Schervish (1989) Mark J. Schervish. A general method for comparing probability assessors. The Annals of Statistics, 1989.
- Stein & Shakarchi (2005) Elias Stein and Rami Shakarchi. Real analysis: Measure Theory, Integration, and Hilbert Spaces. Princeton University Press, 2005.
- Steinwart (2007) Ingo Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 2007.
- Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- Wheeden & Zygmund (1977) Richard L. Wheeden and Antoni Zygmund. Measure and Integral. Pure and Applied Mathematics. Marcel Dekker Inc., 1977.
- Xu et al. (2022) Ying Xu, Kiran Raja, Raghavendra Ramachandra, and Christoph Busch. Adversarial Attacks on Face Recognition Systems, pp. 139–161. Springer International Publishing, Cham, 2022.
- Zhang (2004) Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 2004.
Contents of Appendix
[appendix] \printcontents[appendix]l1
Appendix A Proof of Theorem 3
Lemma 4.
Assume is consistent. Then implies that .
This result appeared as Lemma 7 of (Frank, 2025).
Proof.
If is consistent and minimizes , then must also minimize and consequently . However so that must minimize as well. Consequently, and thus must actually equal . ∎
Proof of Theorem 3.
Forward direction: Assume that is consistent. Note that for any . Thus Lemma 4 implies that for .
Backward direction: Assume that for all . Notice that if , is constant in so any sequence minimizes . We will show if and is a minimizing sequence of , then for sufficiently large , and thus must also minimize . An analogous argument will imply that if , any minimizing sequence of must also minimize as well.
Assume and let be any minimizing sequence of . Let be a limit point of the sequence in the extended real number line . Then is a minimizer of . The statement implies that at least one of and must be larger that or equal to , and so the other must be strictly less than . Because and is a minimizer of , one can conclude that and consequently .
Therefore, every limit point of the sequence is strictly positive. Consequently, one can conclude that for sufficiently large .
∎
Appendix B (Non-Adversarial) Surrogate Bounds
B.1 The Realizable Case— Proof of Equation 10
Proof of Equation 10.
If , then and consequently -a.e. As a result, and thus it remains to show that for all functions . Next, we will prove the bound
(40) |
The inequality Equation 40 trivially holds when . Alternatively, the relation implies and consequently . Thus whenever ,
An analogous argument implies
As a result:
∎
B.2 Linear Surrogate Risk Bounds—Proof of Proposition 1
In this appendix, we will find it useful to study the function
introduced by (Bartlett et al., 2006). This function maps to the smallest value of the conditional -risk assuming an incorrect classification. The symmetry implies . Further, the function is concave on each of the intervals and , as it is an infimum of linear functions on each of these regions. The next result examines the monotonicity properties of and .
Lemma 5.
The function is non-decreasing on and non-increasing on . In contrast, is non-increasing on and non-decreasing on
Proof.
The symmetry and implies that it suffices to check monotonicity on . Observe that
If , then this quantity is non-negative when . Therefore, when computing over , it suffices to minimize over . In other words, for ,
For any fixed , the quantity is non-increasing in and thus when .
In contrast, for any , the quantity is non-decreasing in and thus when .
∎
Next we’ll prove a useful lower bound on .
Lemma 6.
For all ,
(41) |
Proof.
First, observe that is the convex combination . By the concavity of on ,
However, while . Further, Lemma 5 implies that , yielding the desired inequality.
∎
Proof of Proposition 1.
If then Equation 11 holds trivially. Otherwise, . If , then
(42) | ||||
At the same time, because -a.e. Lemma 5 implies that -a.e. Furthermore, applying Equation 41 at and observing shows that
Therefore, Equation 42 is bounded above by
(43) |
The last equality follows from the assumption , as it implies , and thus . Consequently, Equation 43 implies Equation 11.
Integrating Equation 11 with respect to then produces the surrogate bound Equation 12.
∎
Appendix C Proof of Lemma 1
Proof of Lemma 1.
If then . Thus if is supported on , then -a.e. Integrating this inequality in produces
Taking the supreumum over all then proves the result. ∎
Appendix D Proof of Theorem 10 Item 1)
In this appendix, we relax the regularity assumption of the loss functions: specifically we require loss functions to be lower semi-continuous rather than continuous.
Assumption 2.
The loss is lower semi-continuous, non-increasing, and .
Frank & Niles-Weed (2024b) establish their result under this weaker condition, so Theorem 9 continues to hold for lower semi-continuous losses. Moreover, Theorem 7 of (Frank & Niles-Weed, 2024b) finds two conditions that characterize minimizers and maximizers of and .
Theorem 14 (Complementary Slackness).
The function minimizes and the measures maximize over iff the following two conditions hold:
-
1)
-
2)
To prove Item 1) of Theorem 10, one uses the properties above to show that for some function and any pair of measures , that maximize . When is strictly decreasing at 0, the identity provides a means to relate and . In Sections D.1 and D.2, we show that for any loss , there is another loss with that satisfies this property.
Lemma 7.
Let be a consistent loss function that satisfies Assumption 1. Then there is a non-increasing, lower semi-continuous loss function with for which and
(44) |
The function may fail to be continuous.
Proof of Item 1) of Theorem 10 .
Let be the loss function constructed in Lemma 7 for which and let be the coupling between and that achieves the minimum distance. We aim to show that
(45) |
and similarly,
(46) |
Combining Equations 45 and 46 yields , and thus , are optimal by Theorem 7.
We now prove Equation 45; the argument for Equation 46 is analogous, and we briefly outline the necessary modifications at the end of the proof.
First observe that the property Equation 44 implies that pointwise and thus . We will apply the complimentary slackness conditions of Theorem 14 to argue that in fact -a.e.
Item 1) of Theorem 14 implies that
(47) |
and thus assumes its maximum over closed -balls -a.e. Now because assumes its maximum over closed -balls -a.e., one can further conclude that -a.e. and therefore
(48) |
Next, the property Equation 44 implies that . Consequently, Equation 48 implies that
Now, Equation 44 again implies that . Combining with Equation 48 proves that and integrating this statement results in Equation 45.
Similarly, if is the coupling between and that achieves the minimum distance, then
and thus assumes its maximum over closed -balls -a.e. and analogous reasoning to Equation 48 implies that
(49) |
Finally, Equation 44 implies that . Consequently, the same argument as for Equation 45 implies Equation 46.
∎
D.1 Proof of Lemma 7
Proper losses are loss functions that can be used to learn a probability value rather than a binary classification decision. These losses are typically studied as functions on rather than . Prior work (Bao, 2023; Buja et al., 2005; Reid & Williamson, 2009) considers losses of the form where is the estimated probability value of the binary class . One can define the minimal conditional risk just as before via
Savage (1971) first studied how to reconstruct the loss given a concave function . We adapt their argument to construct a proper loss on , and then compose this loss with a link to obtain a loss on with the desired properties. This reconstruction involves technical properties of concave functions.
Lemma 8.
For a continuous concave function , the left-hand () and right-hand () derivatives always exist. The right-hand derivative is non-increasing and right-continuous.
Furthermore, if has a strict maximum at , then the function defined by
(50) |
is non-increasing and right-continuous.
See Section D.2 for a proof.
Proposition 3.
Let be a concave function for which the only global maximum is at . Define by
Then .
Notice that Lemma 8 implies that is lower semi-continuous. Furthermore, when , the definitions of and imply that
(51) |
Indeed, Equation 51 suggests that corresponds to while corresponds to . Fixing the value of at to zero ensures that , coincide , which corresponds to for the losses , . This correspondence is formalized in the proof of proof of Lemma 7.
Proof of Proposition 3.
Calculating for the loss defined above results in
The choice results in .
However, the concavity of implies that
and as a result:
Therefore, .
∎
Next, an integral representation of proves that is non-decreasing. Let , then
(52) |
where the integrals in Equation 52 are Reimann-Stieltjes integrals. Lemma 8 implies that is right-continuous and non-decreasing, and thus these integrals are in fact well defined.
Such a representation was first given in (Schervish, 1989, Theorem 4.2) for left-continuous losses and Lebesgue integrals, see also (Raftery & Gneiting, 2007, Theorem 3) for a discussion of this result. (Reid & Williamson, 2009, Theorem 2) offer an alternative proof of Equation 52 terms of generalized derivatives.
Writing these integrals as Riemann-Stieltjes integrals provides a more streamlined proof— integration by parts for Riemann-Stieltjes integrals (see for instance (Apostol, 1974, Theorem 7.6, Chapter 7)) implies that the first integral in Equation 52 evaluates to and the second evaluates to .
Corollary 1.
The function is upper semi-continuous while the function is lower semi-continuous. Furthermore, is non-increasing and is non-decreasing.
Proof.
The representation Equation 52 implies that is non-increasing and is non-decreasing in .
Lemma 8 implies that is both non-increasing and right-continuous. For non-increasing functions, right-continuous and lower semi-continuous is equivalent. Consequently, is upper semi-continuous while the function is lower semi-continuous.
∎
Finally, we will use the defined in the previous result to construct the of Lemma 7.
Proof of Lemma 7.
Define a function via for and extend via continuity to . Notice that this function satisfies
(53) |
To simplify, notation, we let and let be the loss function defined by Proposition 3. Define the loss by
This composition is non-increasing and lower semi-continuous.
Next, the function computes to :
(Equation 53) | ||||
(Equation 51) | ||||
(Proposition 3) |
Finally, we’ll argue that for all . To start, note that . The assumption that is consistent together with Theorem 3 implies that must be strictly positive on and strictly negative on . Next, the concavity of implies that
If furthermore , then both and are strictly positive and consequently one can conclude that
(54) |
However, implies and consequently Equation 54 implies that when ,
It remains to compute :
∎
D.2 Proof of Lemma 8
Proof of Lemma 8.
We’ll begin by proving existence and subsequently we’ll show that this function is non-increasing and right-continuous.
Problem 23 of Chapter 4, (Rudin, 1976) implies that for a concave function , if , then
(55) |
The continuity of then implies that this inequality holds for . The first inequality implies the the right-hand limit exists for while the second inequality implies the left-hand limit exists for .
Next, we’ll prove that is non-increasing. First, Equation 55 implies that if then
(56) |
and consequently taking the limits and proves that the function is non-increasing.
Next, we prove that is right-continuous. Fix a point and define . We will argue that , and consequently, . Equation 56 implies that for any satisfying ,
and thus
for any . Thus by the continuity of , this inequality must extend to ,
and taking the limit then proves
and thus the monotonicity of implies .
Finally, if has a strict maximum at , then the super-differential can include 0 only at . Thus, as is non-increasing, when and when . Thus the function defined in Equation 50 is non-increasing and right-continuous.
∎
Appendix E Proof of Lemma 2
We define the concave conjugate of a function as
Recall that as defined in Equation 18 is the biconjugate . Consequently, can be expressed as
(57) |
One can prove Lemma 2 by studying properties of concave conjugates.
Lemma 9.
Let be a non-decreasing function. Then is non-decreasing as well.
Proof.
We will argue that if is non-decreasing, then it suffices to consider the infimum in Equation 57 over non-decreasing linear functions. Observe that if is a decreasing linear function with then the constant function satisfies and . Therefore,
∎
Lemma 10.
Let be a non-decreasing function that is right-continuous at zero with . Then . Furthermore, there is a sequence with and .
Proof.
First, notice that
(58) |
for any . It remains to show a sequence for which .
We will argue than any sequence with
(59) |
satisfies this property.
If and satisfies Equation 59 then
and thus Equation 58 implies that
The monononicity of then implies that
and
because is right-continuous at zero. This relation together with Equation 58 implies the result.
∎
Proof of Lemma 2.
Lemma 9 implies that is non-decreasing. Standard results in convex analysis imply that is continuous on (Hiriart-Urruty & Lemaréchal, 2001, Lemma 3.1.1) and upper semi-continuous on (Hiriart-Urruty & Lemaréchal, 2001, Theorem 1.3.5). Thus monotonicity implies that for all , and thus . We will show the opposite inequality, implying that is continuous at .
First, as the constant function is an upper bound on , one can conclude that . Next, recall that can be expressed as an infimum of linear functions as in Equation 57. If , then and . Therefore,
Therefore, the representation Equation 57 implies that . Taking proves that . Thus, is continuous at , if viewed as a function on .
Next, Lemma 10 implies that :
Finally, it remains to show that is continuous at 0. The monotonicity of implies that and consequently
(60) |
However, Lemma 10 implies that
(61) |
Notice that if ,
(62) |
Consequently, Equation 61 and Equation 62 implies that Equation 60 evaluates to 0.
∎
Appendix F Defining the Function
Lemma 11.
Let be a function defined by
Then , and is monotonic on and .
Proof.
The statement is a consequence of the continuity of . Recall that is non-increasing, and is non-decreasing on . Thus, must be non-increasing as well on . Analogously, must be non-decreasing on .
∎
Appendix G Proof of Proposition 2
A modified version of Jensen’s inequality will be used at several points in the proof of Proposition 2.
Lemma 12.
Let be a concave function with and let be a measure with . Then
Proof.
The inequality trivially holds if , so we assume . Jensen’s inequality implies that
As , concavity implies that
∎
Proof of Proposition 2.
Define , , , and as in Section 4. We will prove
(63) |
and an analogous argument will imply
(64) |
The concavity of then implies that
We will prove Equation 63, the argument for Equation 64 is analogous. Next, let be the coupling between and supported on . The assumption on implies that
(65) |
and consequently,
(66) |
To bound the term , we consider three different cases for . Define the sets , , by
with the function as in Equation 36. The composition is measurable because is piecewise monotonic, see Lemma 11 in Appendix F. We will show that if is any of the three sets , , , then
(67) | ||||
Thus because is concave and non-decreasing, the composition is as well. Thus summing the inequality Equation 67 over results in
(68) |
Summing Equation 66 and Equation 68 results in Equation 63.
It remains to show the inequality Equation 67 for the three sets , , .
-
A)
On the set :
If , then while the left-hand side of Equation 67 is non-negative by Lemma 1, which implies Equation 67 for .
-
B)
On the set :
-
C)
On the set :
First, implies that . If furthermore , then both and and consequently , and so due to the definition of in Equation 36:
The same argument as Equation 69 then implies
Now the Cauchy-Schwartz inequality and Lemma 12 imply
∎
Appendix H Technical Integral Lemmas
H.1 The Lebesgue and Riemann–Stieltjes integral of an increasing function
The goal of this section is to prove Equation 38, or namely:
Proposition 4.
Let be a non-increasing, non-negative, continuous function on an interval and let be a finite positive measure. Let be a random variable distributed according to and define then
where the integral on the left is defined as the Lebesgue integral in terms of the measure while the integral on the right is defined as a Riemann–Stieltjes integral.
Proof.
Recall that the Lebesgue integral is defined as
while the Riemann-Stieltjes integral is defined as the value of the limits
(71) |
where these limits are evaluated as the size of the partition approaches 0. A standard analysis result states that these limits exist and are equal whenever is continuous (see for instance Theorem 2.24 of (Wheeden & Zygmund, 1977)). Let be arbitrary and choose a partition for which each of the sums in Equation 71 is within of , and for all . Such a partition exists because every continuous function on a compact set is uniformly continuous.
Next, consider two simple functions , defined according to
By construction, for all . Moreover, since , it follows that when . Now applying the definition of the integral of a simply function, we obtain:
As is arbitrary, it follows that . ∎
Notice that because , the integral in the right-hand side of Equation 38 is technically an improper integral. Thus to show Equation 38, one can conclude that
from Proposition 4 and then take the limit .
H.2 Proof of the last equality in Equation 39
Even though is always continuous by Lemma 2, may be discontinuous and thus may not exist as a Riemann-Stieltjes integral. Thus one must avoid this quantity in proofs.
The proof of Equation 39 relies on summation by parts and on a well-behaved decomposition of that is a consequence of the Lebesgue decomposition of a function of bounded variation. This result states that can be decomposed into a continuous portion and a right-continuous “jump” portion.
See Corollary 3.33 of (Ambrosio et al., 2000) for a statement that implies this result or (Stein & Shakarchi, 2005, Exercise 24,Chapter 3).
Proposition 5.
Let be a non-decreasing, and right-continuous function with and . Then one can decompose as
where , , are non-decreasing functions mapping into with , , is continuous, and is a right-continuous step-function.
The goal of this appendix is to prove the following inequality:
Lemma 13.
Let be an increasing and right-continuous function with and . Let be any continuous function with and let . Then one can bound the Riemann–Stieltjes integral by
Proof.
Let be the decomposition in Proposition 5. Thus
(72) |
We will bound each of the two integrals above separately. Notice that the portions of this decomposition bound below: and . First,
Now by comparing the limiting sums that define the Riemman-Stieltjes integral with the limit of the Riemann sums for the integral , one can conclude that can be evaluated as a Riemann integral in the variable . This argument relies on the continuity of the function . Thus . Consequently:
(73) |
Next, because is right-continuous and , one can write
(74) |
with . Furthermore, because , and is non-decreasing, one can conclude that , , , , and . Lastly, one can require that in this representation, since otherwise one could express as in Equation 74 but with a smaller value of . Thus the second integral in Equation 72 evaluates to
Recalling that , this quantity is bounded above by
(75) |
Because the function is decreasing in , one can bound and consequently Equation 75 is bounded above by
Combining this bound with Equation 73 shows that Equation 72 is bounded above by
Maximizing this quantity over proves the result. ∎
Appendix I Optimizing the Bound of Lemma 3 over
Proof of Theorem 13.
Let
Then
solving produces , and
One can verify that this point is a minimum via the second derivative test:
and thus
Consequently, .
However, the point is in the interval only when . When , is minimized over at . Because is a minimizer when , one can bound over this set and thus
Next, let be defined as in Lemma 3. Then the concavity of and the fact that implies that .
∎
Appendix J Further details from Examples 3 and 2
In Sections J.1 and J.2, we use an operation analogous to . Let be an operation on functions that computes the infimum over an -ball. Formally, we define:
(76) |
Nest, we define a mapping from to minimizers of by
(77) |
Lemma 25 of (Frank & Niles-Weed, 2024b) shows that the function defined in Equation 77 maps to the smallest minimizer of and is non-decreasing. This property will be instrumental in constructing minimizers for .
J.1 Calculating the optimal , for Example 2
First, notice that a minimizer of is given by with as defined in Equation 19.
Below, we construct a minimizer for . We’ll verify that this function is a minimizer by showing that . As the minimal possible adversarial risk is bounded below by , one can conclude that minimizes . Consequently, and thus the strong duality result in Theorem 9 would imply that , must maximize the dual problem.
Define a function by
and a function by . Both and are non-decreasing, and so must be non-decreasing as well. Consequently, and similarly, . (Recall the operation was defined in Equation 76.)
Therefore,
However, because the point 0 has Lebesgue measure zero if it is contained in :
Analogously, one can show that
and consequently .
J.2 Calculating the optimal and for Example 3
We will show that the densities in Equation 20 are dual optimal by finding a function for which . Theorem 9 will then imply that , must maximize the dual.
and let is defined by
with and as in Equation 20. For a given loss we will prove that the optimal function is given by
The function computes to
If , the function is increasing in and consequently is non-decreasing. Therefore, (recall was defined in Equation 76). Similarly, one can argue that . Therefore,
Next, notice that and . Define . Then
Consequently, the strong duality result in Theorem 9 implies that must maximize the dual .
J.3 Showing Equation 21
Lemma 14.
Consider an equal gaussian mixture with variance and means , with pdfs given by
Let . Then is equivalent to , where is defined by
(78) |
Proof.
The function can be rewritten as while
Consequently, is equivalent to
which is equivalent to
Finally, notice that
(79) |
while
∎
Lemma 15.
Let , and be as in Lemma 14 and let . Then if , then is concave.
Proof.
To start, we calculate the second derivative of and the first derivative of .
The first derivative of is
and the second derivative of is
(80) |
Next, one can calculate the derivative of as
(81) |
and similarly
(82) |
Let . Lemma 14 implies that the function is given by . The first derivative of is then
Differentiating twice results in
(83) | ||||
(84) | ||||
(85) |
where the final equality is a consequence of Equations 81 and 82. Next, we’ll argue that the sum of the terms in Equations 84 and 85 is zero:
Next, we’ll show that under the assumption , the term Equation 83 is always negative. Define . Then
(86) |
The fact that for all implies that is convex, and this function has derivative at zero. Consequently, and Equation 86 implies
The condition is equivalent to .
∎
This lemma implies that . Noting also that for all produces the bound
applying this bound to the gaussians with densities and results in Equation 21.