This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Collective Robustness of Bootstrap Aggregation Against Data Poisoning

Aeiau Zzzz    Bauiu C. Yyyy    Cieua Vvvvv    Iaesut Saoeu    Fiuea Rrrr    Tateu H. Yasehe    Aaoeu Iasoh    Buiui Eueu    Aeuia Zzzz    Bieea C. Yyyy    Teoau Xxxx    Eee Pppp
Abstract

Bootstrap aggregating (bagging) is a popular ensemble learning protocol for its effectiveness, simpleness and robustness. Prior works have derived the deterministic robustness certificate for its specific variant, against data poisoning. However, it remains open that, 1) how to generalize the deterministic robustness certificate to the general form of bagging; 2) how to compute its collective robustness. Collective robustness refers to the minimum number of simultaneously unchanged predictions when predict a testset, which has proven a more informative and practical robustness measure against data poisoning. In this paper, we propose pseudo bagging with the deterministic robustness against data poisoning, to approximate an arbitrary form of bagging. Moreover, we propose the first certification for certifying the tight collective robustness of pseudo bagging. Specifically, it is computed by solving a binary linear integer programming (BLIP) problem, whose time complexity is exponential to the testset size. To reduce time complexity, a decomposition strategy is devised to compute a lower bound. Empirically experiments show the notable advantage in terms of practical applicability, collective robustness and certified accuracy.

Robustness certification, EL, certified robustness

1 Introduction

Bagging, as a well-known ensemble learning method, is commonly used for improving accuracy or reducing overfitting.  (levine2021deep) prove the deterministic robustness certificate for a specific variant of bagging, partition-based bagging, against the general data poisoning attack (the attacker is allowed to arbitrarily insert/delete/modify a bounded number of training samples). This bagging-based defense significantly improves robustness when compared to prior works (Wang2020OnCR; Rosenfeld2019CertifiedRT; ma2019data).

However, in terms of defense construction, the practicality of current bagging-based defenses (levine2021deep; jia2021intrinsic) is limited by their hard constraints. Specifically,  (levine2021deep) requires the trainset of each sub-classifier to be disjoint.  (jia2021intrinsic) needs to train hundreds of sub-classifiers for estimating an acceptable robustness lower bound. Therefore, there exists a gap between the bagging-based defenses and the general form of bagging, since we cannot arbitrarily customize the number of sub-classifiers and the sub-trainset size simultaneously. It is meaningful to develop a robustness certification that can guarantee the robustness for bagging, instead of the specific variants.

Moreover, in terms of robustness certification, existing certifications against data poisoning mainly focus on sample-wise robustness (whether a single prediction is changeable under the attacks), and lack discussing a more informative robustness measure, collective robustness (the minimum number of unchanged predictions under the attacks). In practice, collective robustness is superior than sample-wise robustness for two reasons. First, sample-wise robustness is only a specific case of collective robustness when the testset size is one. Second, poisoning attacks have two properties: 1) an attacker cannot craft different poisoning attacks for each testing sample; 2) a poisoning attack globally influences all the predictions. Therefore, collective robustness that jointly consider those two properties of poisoning attack, seems more practical than sample-wise robustness. Therefore, it is preferable to compute collective robustness for bagging-based defenses.

However, computing collective robustness is challenging, as we need to collectively consider the strongest attack for multiple predictions. To our knowledge, there are only two collective certification for specific defenses. Specifically, the first work (schuchardt2021collective) computes the collective robustness against adversarial examples for graph neural networks by exploiting the locality property, but the locality is the unique property of GNN and cannot be generalized to all the classifiers. The other work (jia2022rnn) computes the collective robustness against data poisoning for a machine-learning classifier, namely rNN (radius Nearest Neighbors), but the certification leverages the unique nature of rNN and cannot be applied to bagging. Therefore, computing collective robustness for bagging-based defense is non-trivial.

In this paper, we propose pseudo bagging to offer bagging certified robustness, and collective certification to compute the tight collective robustness for pseudo bagging defenses

. Specifically, our main idea is to formulate the problem of computing the collective robustness as a binary integer linear programming (BILP). Specifically, the objective of BILP is to maximize the number of successfully changed predictions w.r.t. the pre-specific poison budget. Notably, the robustness guaranteed by our method is tight, as we only have access to the predictions. Moreover, by our method, we can estimate a much tighter accuracy lower bound (certified accuracy) than prior methods. The main contributions are three-fold:

1) On defense construction, we propose pseudo bagging to approximate bagging algorithm, which

2)On robustness certification, We show that computing collective robustness is a NP-hard problem. We thus propose a problem decomposition strategy to compute a lower bound instead of exact robustness, reducing the time complexity to 𝒪(N)\mathcal{O}(N). The lower bound certified by our method is theoretically greater than or equal to the prior methods.

3) The empirical results on MNIST and CIFAR-10 show that our method significantly improves collective robustness. The source code will be made public available.

2 Related Works

We discuss the related works that our contribution mainly relates to: 1) EL based certified defenses; 2) robustness certifications against data poisoning.

Certified defenses against data poisoning.

There are a line of certified defenses (steinhardt2017certified; Wang2020OnCR) against data poisoning, such as random flipping (Rosenfeld2019CertifiedRT), randomized smoothing (weber2020rab), differential privacy (ma2019data) and ensemble-based defenses (levine2021deep; jia2021intrinsic). Among them, only  (ma2019data; jia2022rnn; jia2021intrinsic; levine2021deep) are designed to defend the general data poisoning attacks. (ma2019data) is limited to the training algorithms with differential privacy and  (jia2022rnn) can only be applied to rNN. Compared to  (jia2021intrinsic), the robustness guaranteed by (levine2021deep) is often higher under the similar computation cost.

Robustness certification against general data poisoning.

Among existing robustness certifications against general data poisoning (ma2019data; jia2021intrinsic; jia2022rnn; levine2021deep),  (ma2019data; jia2021intrinsic) are probabilistic, meaning that they have chance to output a wrong robustness certificate. The robustness certification of  (jia2022rnn) is deterministic, but only applies to rNN, instead of general NN classifiers. Currently, all the certifications for general NN classifiers are focusing on the sample-wise certificate, which often output a loose robustness certificate when certifying the robustness for multiple predictions. Therefore, in this work, we aim to develop a deterministic and collective certification to certify the exact robustness for multiple predictions.

3 Preliminaries

Strongest threat model.

The attacker is able to arbitrarily insert/delete/modify r\leq r training samples, where rr denotes the poison budget. Since we have no requirement on the sub-classifiers, we assume: the attacker can fully control the sub-classifiers trained on the poisoned data.

Bootstrap aggregating (bagging) (peter2002bag) is defined as: 1) construct GG sub-trainsets by randomly selecting kk samples from the trainset (of size NN) GG times; 2) train GG sub-classifiers on GG sub-trainsets independently; 3) predict the majority class among GG sub-classifiers’ predictions.

3.1 Certified robustness of bagging-based ensemble learning (levine2021deep).

Main idea.

Bagging-based ensemble learning obtains the certified robustness from: 1) limit the influence scope of each poisoned data by training each sub-classifier on a subset of trainset; 2) make prediction based on the majority voting mechanism.

General sample-wise certification of BEL.

we compute the sample-wise robustness by three steps: 1) compute

contrict the portion of unaffected sub-classifiers under any attack

Example 1 (Partition aggregation)
Example 2 (Bootstrap aggregation)

DPA exploits the majority voting mechanism to achieve certified robustness. Specifically, given a trainset 𝒟\mathcal{D}, we first split 𝒟\mathcal{D} into GG (GG is pre-specified) disjoint partitions 𝒟=g=1G𝒟gsub\mathcal{D}=\bigcup_{g=1}^{G}\mathcal{D}^{\rm{sub}}_{g}, based on a deterministic partition rule (e.g. hash function). Then we learn the sub-classifiers fg():g=1,,Gf_{g}(\cdot):\,g=1,\dots,G on those GG partitions independently, in a deterministic manner. The ensemble of these GG sub-classifiers g()g(\cdot) predicts the class with the majority votes. let Vx(y):y𝒴{V}_{x}(y):y\in{\mathcal{Y}} (𝒴{\mathcal{Y}} is the output space) refer to the number of sub-classifier votes for class yy after the attack when predicting xx.

Vx(y)g=1G\mathbbI{fg(x)=y}\vspace{-5pt}\footnotesize{V}_{x}(y)\coloneqq\sum_{g=1}^{G}\mathbb{I}\{f_{g}(x)=y\} (1)

For clarity, V¯x(y){\overline{V}}_{x}(y) denotes the number of votes for the class yy before the attack. The ensemble predictions g(x)g(x) is:

g(x)={\argmaxy𝒴Vx(y),only one majority class\argminy\argmaxy𝒴Vx(y),multiple majority classes\footnotesize g(x)=\begin{cases}\argmax\limits_{y\in{\mathcal{Y}}}{V}_{x}(y),&\text{only one majority class}\\ \argmin\limits_{y}\argmax\limits_{y\in{\mathcal{Y}}}{V}_{x}(y),&\text{multiple majority classes}\\ \end{cases} (2)

where \argminy\argmin\limits_{y} means that the class with the smallest index has a higher priority if there exists multiple majority classes. {remark}[Reproducibility] We emphasize that, DPA requires both the partition results and the training process to be reproducible. We have readily realized reproducibility by specifying the random seed for the random operations.

Sample-wise certification of DPA.

(levine2021deep) proves that, the ensemble prediction g(x)g(x) is certifiably robust against poison budget rr:

r=Vx(yA)Vx(yB)\mathbbI{yA<yB}2\footnotesize r=\lfloor\frac{V_{x}(y_{A})-V_{x}(y_{B})-\mathbb{I}\{y_{A}<y_{B}\}}{2}\rfloor (3)

where yAy_{A} (g(x)g(x)), yBy_{B} denote the top-2 majority classes. {remark}[Proof sketch of DPA] The attacker (of poison budget rr) can only control at most rr sub-classifiers. Consider the worst case where the attacker modify rr votes for yAy_{A} to yBy_{B}, yAy_{A} still has a higher priority than yBy_{B}.

Naive collective certification.

A naive collective certification to simply fuse multiple sample-wise certificates, which is to count the number of sample-wise robust predictions. However, the naive collective certification only gives a lower bound of the true collective robustness.

Refer to caption
Figure 1: An example to illustrate the gap between xxx. Here g(x)g(x) is the ensemble of f1(x),f2(x),f3(x)f_{1}(x),f_{2}(x),f_{3}(x), and the testing samples are x1,x2,x3x_{1},x_{2},x_{3}. Cat/Dog denotes the correct predictions, and Cat/Dog denotes the wrong predictions. Consider an attacker (poison budget is 1) can control an arbitrary sub-classifier. Sample-wise certificate: we consider g(x1),g(x2),g(x3)g(x_{1}),g(x_{2}),g(x_{3}) independently. To change g(x1)g(x_{1})/g(x2)g(x_{2})/g(x3)g(x_{3}), the attacker can flip f2(x1)f_{2}(x_{1})/f3(x2)f_{3}(x_{2})/f1(x3)f_{1}(x_{3}) respectively. Therefore, all the three predictions are not robust and the sample-wise robustness is 0. Collective certificate: we consider g(x1),g(x2),g(x3)g(x_{1}),g(x_{2}),g(x_{3}) collectively. If the attacker poisons f1f_{1}/f2f_{2}/f3f_{3}, the prediction g(x1)g(x_{1})/g(x2)g(x_{2})/g(x3)g(x_{3}) is unchangeable respectively. Thus the collective robustness is 11.

Certification gap.

We illustratively show the gap between the sample-wise certificate and the collective certificate by a toy example, as shown in Fig. 1.

4 Methodology

Unlike the sample-wise threat model 3, here we model an attacker (with full knowledge) that aims to maximize the number of successfully changed predictions within the pre-specific poison budget. In fact, based on the nature of DPA, computing the optimal poisoning case can be simplified as: compute the maximum number of changed ensemble predictions if we can arbitrarily select rr sub-classifiers to fully control.

4.1 Compute Collective Certification of DPA

Given a collection of testing samples {xi}i=1N\{x_{i}\}_{i=1}^{N}, we aim to compute the maximum number of successfully changed predictions, under the poison budget rr. {yi}i=1N\{y_{i}\}_{i=1}^{N} denote the ensemble predictions before the attack. We formulate the problem (BILP):

maxA1,,AGi=1N\mathbbI{Vxi(yi)<maxyyi[Vxi(y)+12\mathbbI{y<yi}]}\displaystyle\max_{A_{1},\dots,A_{G}}\;\sum_{i=1}^{N}\mathbb{I}\left\{{V}_{x_{i}}(y_{i})<\max_{y\neq y_{i}}\left[{V}_{x_{i}}(y)+\frac{1}{2}\mathbb{I}\{y<y_{i}\}\right]\right\} (4)
s.t.𝐀=[A1,A2,,AG]{0,1}G\displaystyle s.t.\quad\mathbf{A}=[A_{1},A_{2},\dots,A_{G}]\in\{0,1\}^{G} (5)
g=1GAgp=1Gq=1GApAq\mathbbI{fpfq}r\displaystyle\sum_{g=1}^{G}A_{g}-\sum_{p=1}^{G}\sum_{q=1}^{G}A_{p}A_{q}\mathbb{I}\{f_{p}\cap f_{q}\}\leq r (6)
Vxi(yi)=V¯xi(yi)g=1GAg\mathbbI{fg(xi)=yi}i\displaystyle{V}_{x_{i}}(y_{i})={\overline{V}}_{x_{i}}(y_{i})-\sum_{g=1}^{G}A_{g}\mathbb{I}\{f_{g}(x_{i})=y_{i}\}\;\forall i (7)
Vxi(y)=[V¯xi(y)+g=1GAg\mathbbI{fg(xi)y}]i,yyi\displaystyle{V}_{x_{i}}(y)=\left[{\overline{V}}_{x_{i}}(y)+\sum_{g=1}^{G}A_{g}\mathbb{I}\{f_{g}(x_{i})\neq y\}\right]\;\forall i,y\neq y_{i} (8)

We now explain each equation respectively. Specifically, Eq. ((5)): A1,A2,,AGA_{1},A_{2},\dots,A_{G} are GG binary variables which denotes the poisoning attack. Ag{0,1}A_{g}\in\{0,1\} denotes whether the attacker poisons the gg-th sub-trainset to control fgf_{g}. Eq. (6): the poison budget is bounded to be r\leq r. Eq. (7): Vxi(yi){V}_{x_{i}}(y_{i}) denotes the minimum number of votes for the (original) ensemble prediction yiy_{i} after the attack [A1,A2,,AG][A_{1},A_{2},\dots,A_{G}], which equals to the number of votes for yiy_{i} (V¯xi(yi){\overline{V}}_{x_{i}}(y_{i})) minus the number of attacked sub-classifiers that originally predict yiy_{i} (g=1GAg\mathbbI{fg(xi)=yi}\sum_{g=1}^{G}A_{g}\mathbb{I}\{f_{g}(x_{i})=y_{i}\}). Eq. (8): Vxi(y){V}_{x_{i}}(y) denotes the number of votes for the class yyiy\neq y_{i}. Note that based on our threat model, the attacker can arbitrarily modify the predictions of those poisoned sub-classifiers, thus the number of votes for the class yyiy\neq y_{i} equal to the number of original votes for yyiy\neq y_{i} plus the number of poisoned sub-classifier whose original predictions are not yy. Eq. (4): we aim to maximize the number of changed ensemble predictions. Note that the ensemble prediction is unchanged only if there exists a class yyiy\neq y_{i} that has a higher priority than yiy_{i}.

Observations about Problem (BILP)

  1. 1.

    The collective robustness (the minimum number of unchanged ensemble predictions) is equal to the total number of predictions minus the optimal value of problem (BILP).

  2. 2.

    The collective robustness is tight.

  3. 3.

    Problem (P1) is an NP-hard problem, because

  4. 4.

    We can simplify problem by ignoring xix_{i} that satisfies: Vxi(yi)maxyyiVxi(y)>r{V}_{x_{i}}(y_{i})-\max_{y\neq y_{i}}{V}_{x_{i}}(y)>r and maxyyiVxi(y)=\max_{y\neq y_{i}}{V}_{x_{i}}(y)=.

  5. 5.

    We can simplify Problem (P1) by partitioning it into multiple sub-problems.

  6. 6.

    The collective robustness is highly related with prediction diversity.

Implementation.

Alg. 1 shows our practical algorithm for computing the collective certification for a collection of testing samples.

Algorithm 1 Certify
  Input: The testset 𝒟test={(xi,yi)}i=1N\mathcal{D}_{test}=\{(x_{i},y_{i})\}_{i=1}^{N}, the sub-classifiers f1(),,fG()f_{1}(\cdot),\dots,f_{G}(\cdot), the poison budget rr.
  Output: The minimum certified accuracy CA¯(r)\underline{\rm{CA}}(r) under the poison budget RR.
  for i=1i=1 to NN do
     Compute the predictions fg(xi):g=1,,Gf_{g}(x_{i}):\,g=1,\dots,G;
  end for
  Compute the maximum number of successfully attacked ensemble predictions N¯ATK\overline{N}_{\rm{ATK}} by solving (P1); # solve a binary integer linear programming problem
  Compute certified accuracy CA¯(R)NN¯ATKN\underline{\rm{CA}}(R)\leftarrow\frac{N-\overline{N}_{\rm{ATK}}}{N};
  return certified accuracy CA¯(R)\underline{\rm{CA}}(R).

5 Comparisons to Prior Works

We compare our method to: 1) partition aggregation defense (levine2021deep); 2) prior certified defenses with collective certifications (steinhardt2017certified; schuchardt2021collective; jia2022rnn).

Comparison to  (levine2021deep) Our collective certification computes the exact certified accuracy against the general data poisoning attack, as we only have access to the predictions and the poison budget.

Comparison to  (steinhardt2017certified) The collective certification (steinhardt2017certified) is substantially different from ours, which instead derives an approximate upper bound of the test loss against data poisoning attacks. This method cannot guarantee the certified accuracy.

Comparison to  (schuchardt2021collective)  (schuchardt2021collective) prove the collective certification against GNN adversarial examples, which is based on the locality property of the base classifiers (e.g. GCN). However, this technique is invalid for many other tasks (e.g. image classification), where the classifiers lack the locality property.

Comparison to  (jia2022rnn) This related work derives the collective certification for the machine learning algorithm rNN (Radius Nearest Neighbors). Since the naive rNN performs poorly on the tasks as CIFAR-10, the author propose to enhance rNN by using a pre-trained encoder to extract the features, on the top of rNN. Therefore, the performance of rNN (jia2022rnn) highly depends on the encoder. However, a well trained encoder is often unavailable in practice, and training encoder is also vulnerable to data poisoning attacks. This vulnerability is ignored by  (jia2022rnn), which is unfair when compared to other certified defenses.

Table 1: Experimental setups.

Dataset Trainset GG (partitions) δ\delta (step) MNIST 60,000 250/500/1000 G/10G/10 CIFAR-10 50,000 50/100/200 G/10G/10

6 Experiments

6.1 Experimental Setups

Datasets and models.

We mainly evaluate the robustness certification on MNIST (lecun2010mnist) and CIFAR-10 (krizhevsky2009learning), which follows the prior works (levine2021deep; jia2021intrinsic; jia2022rnn). Following  (levine2021deep), we use NiN (Network-In-Network) model architecture (min2014nin) for MNIST, and NiN with full data augmentation for CIFAR-10. The trainset sizes of two datasets are 60,00060,000 and 50,00050,000 respectively, and the testset sizes of both are 10,00010,000. All the experiments are conducted on CPU (16 Intel(R) Xeon(R) Gold 5222 CPU @ 3.80GHz) and GPU (one NVIDIA RTX 2080 Ti).

Peer methods.

We compare our method (the collective certification of partition aggregation) to the sample-wise certification of aggregation partition (levine2021deep) and the sample-wise certification of bagging222We take the confidence level 1α=0.9991-\alpha=0.999, the same as the original implementation. We set its number of base classifiers same as the number of our partitions for the computational fairness. (jia2021intrinsic). We do not compare to collective certification of rNN (jia2022rnn) because its pre-trained encoder is trained from the additional datasets without considering the poisoning attacks, which is unfair for comparison.

Evaluation metrics.

Following the prior works (levine2021deep; jia2021intrinsic; jia2022rnn), we evaluate the performances of the robustness certification on the given testset by three evaluation metrics: CA(r) (certified accuracy under the poison budget rr), ACR (average certified robustness) and MCR (median certified robustness). Specifically, CA(r), ACR and MCR are given by:

CA(r)NN¯ATK(r)N\displaystyle\rm{CA}(r)\coloneqq\frac{N-\overline{N}_{\rm{ATK}}(r)}{N} (9)
ACRr=1CA(r)\displaystyle\rm{ACR}\coloneqq\sum_{r=1}^{\infty}\rm{CA}(r) (10)
MCRmaxCA(r)50%r\displaystyle\rm{MCR}\coloneqq\max_{\rm{CA}(r)\geq 50\%}r (11)

where NN is the testset size. We emphasize that our definitions for CA(r), ACR and MCR are compatible with the original definitions in the sample-wise certification (cohen2019certified; levine2021deep). Specifically, 1) both definitions of CA(r)CA(r) give an accuracy lower bound, but our CA(r)CA(r) gives the exact lower bound; 2) both definitions of ACR are the sum of the certified accuracy with respect to the poison budget, meaning the average poison budget that a testing sample can tolerate; 3) both definitions of MCR are the poison budget at which the certified accuracy is 50%50\%. For the computational consideration, we approximate ACR\rm{ACR} by only using CA(r):r=0,δ,2δ,,G2\rm{CA}(r):\;r=0,\delta,2\delta,\dots,\frac{G}{2}. We report ACR¯\underline{\rm{ACR}} (lower bound), ACR¯\overline{\rm{ACR}} (upper bound) and AACR\rm{AACR} (approximate ACR), which are computed by: (s=G2δs=\frac{G}{2\delta}).

ACR¯r=δ,2δ,,sδδCA(r)\displaystyle\underline{\rm{ACR}}\coloneqq\sum_{r=\delta,2\delta,\dots,s\delta}\delta\cdot\rm{CA}(r) (12)
ACR¯r=0,δ,2δ,,(s1)δδCA(r)\displaystyle\overline{\rm{ACR}}\coloneqq\sum_{r=0,\delta,2\delta,\dots,(s-1)\delta}\delta\cdot\rm{CA}(r) (13)
AACR12(ACR¯+ACR¯)\displaystyle\rm{AACR}\coloneqq\frac{1}{2}\left(\underline{\rm{ACR}}+\overline{\rm{ACR}}\right) (14)

Implementation details.

In practice, we solve the optimization problem (P1) by Gurobi 9.0 (gurobi), which has the preferable function: Gurobi can return a lower/upper bound of the objective value within the given time period.

Table 2: MNIST: Compare our method CPA (collective certification of partition aggregation) to SPA. Better results are in bold.

Partitions Method MCR ACR CA(13) CA(25) CA(38) CA(50) CA(63) CA(75) CA(88) CA(100) CA(113) CA(125) 250 CPA\rm{CPA} SPA\rm{SPA} SBag\rm{SBag} Partitions Method MCR ACR CA(25) CA(50) CA(75) CA(100) CA(125) CA(150) CA(175) CA(200) CA(225) CA(250) 500 CPA\rm{CPA} SPA\rm{SPA} SBag\rm{SBag}

Table 3: CIFAR-10: comparison to. We embolden the better results among the comparisons.

Partitions Method MCR ACR CA(3) CA(5) CA(8) CA(10) CA(13) CA(15) CA(18) CA(20) CA(23) CA(25) 50 CPA\rm{CPA} SPA\rm{SPA} Bag\rm{Bag} Partitions Method MCR ACR CA(5) CA(10) CA(15) CA(20) CA(25) CA(30) CA(35) CA(40) CA(45) CA(50) 100 CPA\rm{CPA} SPA\rm{SPA} SBag\rm{SBag} Partitions Method MCR ACR CA(10) CA(20) CA(30) CA(40) CA(50) CA(60) CA(70) CA(80) CA(90) CA(100) 200 CPA\rm{CPA} SPA\rm{SPA} SBag\rm{SBag}

Evaluation results on MNIST.

Evaluation results on CIFAR-10.

6.2 Runtime Analysis

6.3 Ablation Studies

Impact of partition size.

Impact of testset size.

Impact of ensemble diversity.

7 Conclusion

We have presented a collective certification approach against both training-stage attacks and inference-stage attacks. To our best knowledge, this is the first work to consider an NN classifier’s robustness certification against data poisoning in a collective manner. Consequently, the collective certification reports a much more precise evaluation on the overall certified robustness for the given testset. Our collective certification can be easily implemented. Empirical results suggest that the collective certification can yield much stronger overall robustness certification.

Appendix A Proofs