Attribute-Efficient Learning of Halfspaces with Malicious Noise: Near-Optimal Label Complexity and Noise Tolerance

Jie Shen
Stevens Institute of Technology
jie.shen@stevens.edu
Chicheng Zhang
University of Arizona
chichengz@cs.arizona.edu

Abstract

This paper is concerned with computationally efficient learning of homogeneous sparse halfspaces in $\mathbb{R}^{d}$ under noise. Though recent works have established attribute-efficient learning algorithms under various types of label noise (e.g. bounded noise), it remains an open question of when and how $s$ -sparse halfspaces can be efficiently learned under the challenging malicious noise model, where an adversary may corrupt both the unlabeled examples and the labels. We answer this question in the affirmative by designing a computationally efficient active learning algorithm with near-optimal label complexity of $\tilde{O}(s\log^{4}\frac{d}{\epsilon})$ ¹¹1We use the notation $\tilde{O}(f)\mathrel{\mathop{\mathchar 58\relax}}=O(f\log f)$ . and noise tolerance $\eta=\Omega(\epsilon)$ , where $\epsilon\in(0,1)$ is the target error rate, under the assumption that the distribution over (uncorrupted) unlabeled examples is isotropic log-concave. Our algorithm can be straightforwardly tailored to the passive learning setting, and we show that its sample complexity is $\tilde{O}(\frac{1}{\epsilon}s^{2}\log^{5}d)$ which also enjoys attribute efficiency. Our main techniques include attribute-efficient paradigms for soft outlier removal and for empirical risk minimization, and a new analysis of uniform concentration for unbounded instances – all of them crucially take the sparsity structure of the underlying halfspace into account.

Keywords: halfspaces, malicious noise, passive and active learning, attribute efficiency

1 Introduction

This paper investigates the fundamental problem of learning halfspaces under noise [Val84, Val85]. In the absence of noise, this problem is well understood [Ros58, BEHW89]. However, the premise changes immediately when the unlabeled examples²²2We will also refer to unlabeled examples as instances in this paper. or the labels are corrupted by noise. In the last decades, various types of label noise have been extensively studied, and a plethora of polynomial-time algorithms have been developed that are resilient to random classification noise [BFKV96], bounded noise [Slo88, Slo92, MN06], and adversarial noise [KSS92, KKMS05]. Significant progress towards optimal noise tolerance is also witnessed in the past few years [Dan15, ABHU15, YZ17, DGT19, DKTZ20]. In this regard, a surge of recent research interest is concentrated on further improvement of the performance guarantees by leveraging the structure of the underlying halfspace into algorithmic design. Of central interest is a property termed attribute efficiency, which proves to be useful when the data lie in a high-dimensional space [Lit87], or even in an infinite-dimensional space but with bounded number of effective attributes [Blu90]. In the statistics and signal processing community, it is often referred to as sparsity, dating back to the celebrated Lasso estimator [Tib96, CDS98, CT05, Don06]. Recently, learning of sparse halfspaces in an attribute-efficient manner was highlighted as an open problem in [Fel14], and in a series of recent works [PV13b, ABHZ16, Zha18, ZSA20], this property was carefully explored for label-noise-tolerant learning of halfspaces with improved or even near-optimal sample complexity, label complexity, or generalization error, where the key insight is that such structural constraint effectively controls the complexity of the hypothesis class [Zha02, KST08].

Compared to the rich set of positive results on attribute-efficient learning of sparse halfspaces under label noise, less is known when both instances and labels are corrupted. Specifically, under the $\eta$ -malicious noise model [Val85, KL88], there is an unknown hypothesis $w^{*}$ and an unknown instance distribution $D$ selected from a certain family by an adversary. Each time with probability $1-\eta$ , the adversary returns an instance $x$ drawn from $D$ and the label $y=\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)}$ ; with probability $\eta$ , it instead is allowed to return an arbitrary pair $(x,y)\in\mathbb{R}^{d}\times\{-1,1\}$ that may depend on the state of the learning algorithm and the history of its outputs. Since this is a much more challenging noise model, only recently has an algorithm with near-optimal noise tolerance been established in [ABL17], although without attribute efficiency. It is worth noting that the problem of learning sparse halfspaces is also closely related to one-bit compressed sensing [BB08] where one is allowed to utilize any distribution $D$ over measurements for recovering the target hypothesis. However, even with such strong condition, existing theory therein can only handle label noise [PV13a, ABHZ16, BFN⁺17]. This naturally raises two fundamental questions: 1) can we design attribute-efficient learning algorithms that are capable of tolerating the malicious noise; and 2) can we still obtain near-optimal performance guarantees on the degree of noise tolerance and on the sample complexity.

In this paper, we answer the two questions in the affirmative under a mild distributional assumption that $D$ is chosen from the family of isotropic log-concave distributions [LV07, Vem10], which covers prominent distributions such as normal distributions, exponential distributions, and logistic distributions. Moreover, we take label complexity into consideration [CAL94], for which we show that our bound is near-optimal in that aspect. We build our algorithm upon the margin-based active learning framework [BBZ07], which queries the label of an instance when it has small “margin” with respect to the currently learned hypothesis.

From a high level, this work can be thought of as extending the best known result of [ABL17] to the high-dimensional regime. However, even under the low-dimensional setting where $s=d$ , our bound of label complexity is better than theirs in terms of the dependence on the dimension $d$ : they have a quadratic dependence whereas we have a linear dependence (up to logarithmic factors). Moreover, as we will describe in Section 3, obtaining such algorithmic extension is nontrivial both computationally and statistically. This work can also be viewed as an extension of [Zha18] to the malicious noise model. In fact, our construction of empirical risk minimization is inspired by that work. However, they considered only label noise which makes their algorithm and analysis not applicable to our setting: it turns out that when facing malicious noise, a sophisticated design of outlier removal paradigm is crucial for optimal noise tolerance [KLS09].

Also in line with this work is learning with nasty noise [DKS18] and robust sparse functional estimation [BDLS17]. Both works considered more general setting in the following sense: [DKS18] showed that by properly adapting the techniques in robust mean estimation, some more general concepts, e.g. low-degree polynomial threshold functions and intersections of halfspaces, can be efficiently learned with $\operatorname{poly}\mathinner{\left(d,1/\epsilon\right)}$ sample complexity; [BDLS17] showed that under proper sparsity assumptions, a sample complexity bound of $\operatorname{poly}\mathinner{\left(s,\log d,1/\epsilon\right)}$ can be achieved for many sparse estimation problems, such as generalized linear models with Lipschitz mapping functions and covariance estimation. However, we remark that neither of them obtained label efficiency. In addition, when adapted to our setting, Theorem 1.5 of [DKS18] only handles noise rate $\eta\leq O(\epsilon^{c})$ for some constant $c$ that is greater than one, while as to be shown in Section 4, we obtain the near-optimal noise tolerance $\eta\leq O(\epsilon)$ . [BDLS17] achieved near-optimal noise tolerance but their analysis is restricted to the Gaussian marginal distribution and Lipschitz mapping functions. In addition to such fundamental differences, the main techniques we develop are distinct from theirs, which will be described in more detail in Section 3.3.3.

1.1 Main results

We informally present our main results below; readers are referred to Theorem 4 in Section 4 for a precise statement.

Theorem 1 (Informal).

Consider the malicious noise model with noise rate $\eta$ . If the unlabeled data distribution is isotropic log-concave and the underlying halfspace $w^{*}$ is $s$ -sparse, then there is an algorithm that for any given target error rate $\epsilon\in(0,1)$ , PAC learns the underlying halfspace in polynomial time provided that $\eta\leq O(\epsilon)$ . In addition, the label complexity is $\tilde{O}\big{(}s\log^{4}\frac{d}{\epsilon}\big{)}$ and the sample complexity is $\tilde{O}\mathinner{\bigl{(}\frac{1}{\epsilon}s^{2}\log^{5}d\bigr{)}}$ .

First of all, note that the noise tolerance is near-optimal as [KL88] showed that a noise rate greater than $\frac{\epsilon}{1+\epsilon}$ cannot be tolerated by any algorithm regardless of the computational power. The following fact establishes the optimality of our label complexity.

Lemma 2.

Active learning of $s$ -sparse halfspaces under isotropic log-concave distributions in the realizable case has an information-theoretic label complexity lower bound of $\Omega\mathinner{\bigl{(}s(\log\frac{1}{\epsilon}+\log\frac{d}{s})\bigr{)}}$ .

To see this lemma, observe that there exist $\epsilon$ -packings of $s$ -sparse halfspaces with sizes $(\frac{1}{\epsilon})^{\Omega(s)}$ [Lon95] and $(\frac{d}{s})^{\Omega(s)}$ [RWY11]; applying Theorem 1 of [KMT93] gives the lower bound.

1.2 Related works

[KL88] presented a general analysis on efficiently learning halfspaces, showing that even without any distributional assumptions, it is possible to tolerate the malicious noise at a rate of ${\Omega}(\epsilon/d)$ , but a noise rate greater than $\frac{\epsilon}{1+\epsilon}$ cannot be tolerated. The noise model was further studied by [Sch92, Bsh98, CDF⁺99], and [KKMS05] obtained a noise tolerance $\Omega(\epsilon/d^{1/4})$ when $D$ is the uniform distribution. [KLS09] improved this result to $\Omega(\epsilon^{2}/\log(d/\epsilon))$ for the uniform distribution, and showed a noise tolerance $\Omega(\epsilon^{3}/\log^{2}(d/\epsilon))$ for isotropic log-concave distributions. A near-optimal result of $\Omega(\epsilon)$ was established in [ABL17] for both uniform and isotropic log-concave distributions.

Achieving attribute efficiency has been a long-standing goal in machine learning and statistics [Blu90, BHL95], and has found a variety of applications with strong theoretical backend. A partial list includes online classification [Lit87], learning decision lists [Ser99, KS04, LS06], compressed sensing [Don06, CW08, TW10, SL18], one-bit compressed sensing [BB08, PV16], and variable selection [FL01, FF08, SL17a, SL17b].

Label-efficient learning has also been broadly studied since gathering high quality labels is often expensive. The prominent approaches include disagreement-based active learning [Han11, Han14], margin-based active learning [BBZ07, BL13, YZ17], selective sampling [CCG11, DGS12], and adaptive one-bit compressed sensing [ZYJ14, BFN⁺17]. There are also a number of interesting works that appeal to extra information to mitigate the labeling cost, such as comparison [XZS⁺17, KLMZ17] and search [BH12, BHLZ16].

Recent works such as [DKK⁺16, LRV16] studied mean estimation under a strong noise model where in addition to returning dirty instances, the adversary has also the power of eliminating a few clean instances, similar to the nasty noise model in learning halfspaces [BEK02]. The main technique of robust mean estimation is a novel outlier removal paradigm, which uses the spectral norm of the covariance matrix to detect dirty instances. This is similar in spirit to the idea of [KLS09, ABL17] and the current work. However, there is no direct connection between mean estimation and halfspace learning since the former is an unsupervised problem while the latter is supervised (although any connection would be very interesting). Very recently, such technique was extensively investigated in a variety of problems such as clustering and linear regression; we refer the reader to a comprehensive survey by [DK19] for more information.

Roadmap.

We collect useful notations and formally define the problem in Section 2. In Section 3, we describe our algorithms, followed by a theoretical analysis in Section 4. We conclude this paper in Section 5, and defer all proof details to the appendix.

2 Preliminaries

We study the problem of learning sparse halfspaces in $\mathbb{R}^{d}$ under the malicious noise model with noise rate $\eta\in[0,1/2)$ [Val85, KL88], where an oracle $\mathrm{EX}_{\eta}(D,w^{*})$ (i.e. adversary) first selects a member $D$ from a family of distributions $\mathcal{D}$ and a concept $w^{*}$ from a concept class $\mathcal{C}$ ; during the learning process, $D$ and $w^{*}$ are fixed. Each time the adversary is called, with probability $1-\eta$ , a random pair $(x,y)$ is returned to the learner with $x\sim D$ and $y=\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)}$ , referred to as a clean sample; with probability $\eta$ , the adversary can return an arbitrary pair $(x,y)\in\mathbb{R}^{d}\times\{-1,1\}$ , referred to as a dirty sample. The adversary is assumed to have unrestricted computational power to search dirty samples that may depend on, e.g. the states of the learning algorithm and the history of its outputs. Formally, we make the following distributional assumptions.

Assumption 1.

Let $\mathcal{D}$ be the family of isotropic log-concave distributions. The underlying distribution $D$ from which clean instances are drawn is chosen from $\mathcal{D}$ by the adversary, and is fixed during the learning process. The learner is given the knowledge of $\mathcal{D}$ but not of $D$ .

Assumption 2.

With probability $1-\eta$ , the adversary returns a pair $(x,y)$ where $x\sim D$ and $y=\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)}$ ; with probability $\eta$ , it may return an arbitrary pair $(x,y)\in\mathbb{R}^{d}\times\{-1,1\}$ .

Since we are interested in obtaining a label-efficient algorithm, we will consider a natural extension of such passive learning model. In particular, [ABL17] proposed to consider the following: when a labeled instance $(x,y)$ is generated, the learner only has access to an instance-generation oracle $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ which returns $x$ , and must make a separate call to a label revealing oracle $\mathrm{EX}_{\eta}^{y}(D,w^{*})$ to obtain $y$ . We refer to the total number of calls to $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ as the sample complexity of the learning algorithm, and to that of $\mathrm{EX}_{\eta}^{y}(D,w^{*})$ as the label complexity.

We will presume that the concept class $\mathcal{C}$ consists of homogeneous halfspaces that have unit $\ell_{2}$ -norm and are $s$ -sparse, i.e. the number of non-zero elements of any $w\in\mathcal{C}$ is at most $s$ where $s\in\{1,2,\dots,d\}$ . The learning algorithm is given this concept class, that is, the set of homogeneous $s$ -sparse halfspaces. For a hypothesis $w\in\mathcal{C}$ , we define its error rate on a distribution ${D}$ as $\operatorname{err}_{{D}}(w)=\operatorname{Pr}_{x\sim{D}}\mathinner{\left(\operatorname{sign}\mathinner{\left(w\cdot x\right)}\neq\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)}\right)}$ . The goal of the learner is to find a hypothesis $w$ in polynomial time such that with probability $1-\delta$ , $\operatorname{err}_{D}(w)\leq\epsilon$ for any given failure confidence $\delta\in(0,1)$ and any error rate $\epsilon\in(0,1)$ , with a few calls to $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ and $\mathrm{EX}_{\eta}^{y}(D,w^{*})$ .

For a reference vector $u\in\mathbb{R}^{d}$ and a positive scalar $b$ , we call the region $X_{u,b}\mathrel{\mathop{\mathchar 58\relax}}=\{x\in\mathbb{R}^{d}\mathrel{\mathop{\mathchar 58\relax}}\left\lvert u\cdot x\right\rvert\leq b\}$ as band, and we denote by $D_{u,b}$ the distribution obtained by conditioning $D$ on the event $x\in X_{u,b}$ . Given a hypothesis $w$ in $\mathbb{R}^{d}$ , a labeled instance $(x,y)$ , and a parameter $\tau>0$ , we define the $\tau$ -hinge loss $\ell_{\tau}(w;x,y)=\max\big{\{}0,1-\frac{1}{\tau}y(w\cdot x)\big{\}}$ . For a labeled set $S=\{(x_{i},y_{i})\}_{i=1}^{n}$ , we define $\ell_{\tau}(w;S)=\frac{1}{n}\sum_{i=1}^{n}\ell_{\tau}(w;x_{i},y_{i})$ .

For $p\geq 1$ , we denote by $B_{p}(u,r)$ the $\ell_{p}$ -ball centering at the point $u$ with radius $r>0$ , i.e. $B_{p}(u,r)=\{w\in\mathbb{R}^{d}\mathrel{\mathop{\mathchar 58\relax}}\left\lVert w-u\right\rVert_{p}\leq r\}$ . We will be particularly interested in the cases $p=1,2,\infty$ . For a vector $u\in\mathbb{R}^{d}$ , the hard thresholding operation $\mathcal{H}_{s}(u)$ keeps its $s$ largest (in absolute value) elements and sets the remaining to zero. Let $u,v\in\mathbb{R}^{d}$ be two vectors; we write $\theta(u,v)$ to denote the angle between them, and write $u\cdot v$ to denote their inner product. For a matrix $H$ , we denote by $\left\lVert H\right\rVert_{*}$ its trace norm (also known as the nuclear norm), i.e. the sum of its singular values. We will also use $\left\lVert H\right\rVert_{1}$ to denote the entrywise $\ell_{1}$ -norm of $H$ , i.e. the sum of absolute values of its entries. If $H$ is a symmetric matrix, we use $H\succeq 0$ to denote that it is positive semidefinite.

Throughout this paper, the subscript variants of the lowercase letter $c$ , e.g. $c_{1}$ and $c_{2}$ , are reserved for specific absolute constants that are uniquely determined by the distribution $D$ . We also reserve $C_{1}$ and $C_{2}$ for specific constants. We remark that the value of all the constants involved in the paper does not depend on the underlying distribution $D$ , but rather on the knowledge of $\mathcal{D}$ given to the learner. We collect all the definitions of these constants in Appendix A.

3 Main Algorithm

We first present an overview of our learning algorithm, followed by specifying all the hyper-parameters used therein. Then we describe in detail the attribute-efficient outlier removal scheme, which is the core technique in the paper.

3.1 Overview

Our main algorithm, namely Algorithm 1, is based on the celebrated margin-based active learning framework [BBZ07]. The key observation is that a good classifier can be learned by concentrating on fitting only the most informative labeled instances, measured by the closeness to the current decision boundary (i.e. the closer the more informative). In our algorithm, the sampling region is set as $\mathbb{R}^{d}$ at phase $k=1$ , and is set as the band $X_{w_{k-1},b_{k}}=\{x\in\mathbb{R}^{d}\mathrel{\mathop{\mathchar 58\relax}}\left\lvert w_{k-1}\cdot x\right\rvert\leq b_{k}\}$ at phases $k\geq 2$ . Once we obtain the instance set $\bar{T}$ , we perform a pruning step that removes all instances having large $\ell_{\infty}$ -norm. This is motivated by our analysis that with high probability, all clean instances in $\bar{T}$ must have small $\ell_{\infty}$ -norm provided that Assumption 1 is satisfied. Since the oracle $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ may output dirty instances, we design an attribute-efficient soft outlier removal procedure, which aims to find proper weights for all instances in $T$ , such that the clean instances (i.e. those from $D_{w_{k-1},b_{k}}$ ) have overwhelming weights compared to dirty instances. Equipped with the learned weights, it is possible to minimize the reweighted hinge loss to obtain a refined halfspace. However, this would lead to a suboptimal label complexity since we have to query the label for all instances in $T$ . Our remedy is to randomly sample a few points from $T$ according to their importance, which is crucial for us to obtain near-optimal label complexity.

When minimizing the hinge loss, we carefully construct the constraint set $W_{k}$ with three properties. First, it has an $\ell_{2}$ -norm constraint. As a useful fact of isotropic log-concave distributions, the $\ell_{2}$ -distance to the underlying halfspace $w^{*}$ is of the same order as the error rate. Thus, if we were able to ensure that the target halfspace $w^{*}$ stays in $W_{k}$ , we would show that the error rate of $w_{k}$ is as small as $O(r_{k})$ , the radius of the $\ell_{2}$ -ball. Second, $W_{k}$ has an $\ell_{1}$ -norm constraint, which is well-known for its power to promote sparse solutions and to guarantee attribute-efficient sample complexity [Tib96, CDS98, CT05, PV13b]. Lastly, the $\ell_{2}$ and $\ell_{1}$ radii of $W_{k}$ shrinks by a constant factor in each phase; hence, when Algorithm 1 terminates, the radius of the $\ell_{2}$ -ball will be as small as $O(\epsilon)$ . Notably, [Zha18] also utilizes such constraint for active learning of sparse halfspaces, but only under the setting of label noise.

The last step in Algorithm 1 is to perform hard-thresholding $\mathcal{H}_{s}$ on the solution $v_{k}$ followed by $\ell_{2}$ -normalization. Roughly speaking, these two steps will produce an iterate $w_{k}$ consistent with the structure of $w^{*}$ (i.e. $w_{k}$ is guaranteed to belong to the concept class $\mathcal{C}$ ), and more importantly, will be useful to show that $w^{*}$ lies in $W_{k}$ in all phases.

Algorithm 1 Attribute and Label-Efficient Algorithm Tolerating Malicious Noise

0: Error rate

\epsilon

, failure probability

\delta

, sparsity parameter

s

, an instance generation oracle

\mathrm{EX}_{\eta}^{x}(D,w^{*})

, a label revealing oracle

\mathrm{EX}_{\eta}^{y}(D,w^{*})

0: A halfspace

w_{k_{0}}

such that

\operatorname{err}_{D}(w_{k_{0}})\leq\epsilon

with probability

1-\delta

k_{0}\leftarrow\big{\lceil}\log\mathinner{\bigl{(}\frac{\pi}{16c_{1}\epsilon}\bigr{)}}\big{\rceil}

2: Initialize

w_{0}

as the zero vector in

\mathbb{R}^{d}

3: for phases

k=1,2,\dots,k_{0}

4: Clear the working set

\bar{T}

5: If

k=1

, independently draw

n_{k}

instances from

\text{EX}^{x}_{\eta}(D,w^{*})

and put them into

\bar{T}

; otherwise, draw

n_{k}

instances from

\text{EX}^{x}_{\eta}(D,w^{*})

conditioned on

\left\lvert w_{k-1}\cdot x\right\rvert\leq b_{k}

and put into

\bar{T}

6: Pruning: Remove all instances

x

\bar{T}

with

\left\lVert x\right\rVert_{\infty}>c_{9}\log\frac{48n_{k}d}{b_{k}\delta_{k}}

to form a set

T

7: Soft outlier removal: Apply Algorithm 2 to

T

with

u\leftarrow w_{k-1}

b\leftarrow b_{k}

r\leftarrow r_{k}

\rho\leftarrow\rho_{k}

\xi\leftarrow\xi_{k}

C\leftarrow 2C_{2}

, and let

q=\mathinner{\left\{q(x)\right\}}_{x\in T}

be the returned function. Normalize

q

to form a probability distribution

p

over

T

8: Random sampling:

S_{k}\leftarrow

Independently draw

m_{k}

instances (with replacement) from

T

according to

p

and query

\text{EX}^{y}_{\eta}(D,w^{*})

for their labels.

9: Let

W_{k}=B_{2}(w_{k-1},r_{k})\cap B_{1}(w_{k-1},\rho_{k})

. Find

v_{k}\in W_{k}

such that

\ell_{\tau_{k}}(v_{k};S_{k})\leq\min_{w\in W_{k}}\ell_{\tau_{k}}(w;S_{k})+\kappa.

10:

w_{k}\leftarrow\frac{\mathcal{H}_{s}(v_{k})}{\left\lVert\mathcal{H}_{s}(v_{k})\right\rVert_{2}}

11: end for

12: return

w_{k_{0}}

3.2 Hyper-parameter setting

We elaborate on our hyper-parameter setting that is used in Algorithm 1 and our analysis. Let $g(t)=c_{2}\mathinner{\bigl{(}2t\exp(-t)+\frac{c_{3}\pi}{4}\exp\mathinner{\bigl{(}-\frac{c_{4}t}{4\pi}\bigr{)}}+16\exp(-t)\bigr{)}}$ , where the constants are specified in Appendix A. Observe that there exists an absolute constant $\bar{c}\geq 8\pi/c_{4}$ satisfying $g(\bar{c})\leq 2^{-8}\pi$ , since the continuous function $g(t)\to 0$ as $t\to+\infty$ and all the involved quantities in $g(t)$ are absolute constants. Given such constant $\bar{c}$ , we set $b_{k}=\bar{c}\cdot 2^{-k-3}$ , $\tau_{k}=c_{0}\kappa\cdot\min\{b_{k},1/9\}$ , $\delta_{k}=\frac{\delta}{(k+1)(k+2)}$ ,

r_{k}=\begin{cases}1,&k=1\\ 2^{-k-3},&k\geq 2\end{cases},\ \text{and}\ \rho_{k}=\begin{cases}\sqrt{s},&k=1\\ \sqrt{2s}\cdot 2^{-k-3},&k\geq 2\end{cases}.

We set the constant $\kappa=\exp(-\bar{c})$ , and choose $\xi_{k}=\min\big{\{}\frac{1}{2},\frac{\kappa^{2}}{16}\mathinner{\bigl{(}1+4\sqrt{C_{2}}{z_{k}}/{\tau_{k}}\bigr{)}}^{-2}\big{\}}$ . Observe that all $\xi_{k}$ ’s are lower bounded by the constant $c_{6}\mathrel{\mathop{\mathchar 58\relax}}=\min\Big{\{}\frac{1}{2},\frac{\kappa^{2}}{16}\mathinner{\Bigl{(}1+\frac{4}{c_{0}\kappa\bar{c}}\sqrt{C_{2}\bar{c}^{2}+C_{2}}\Bigr{)}}^{-2}\Big{\}}$ . Our theoretical guarantee holds for any noise rate $\eta\leq c_{5}\epsilon$ , where the constant $c_{5}\mathrel{\mathop{\mathchar 58\relax}}=\frac{c_{8}}{2\pi}\bar{c}c_{1}c_{6}$ .

We set the total number of phases $k_{0}=\big{\lceil}\log\mathinner{\bigl{(}\frac{\pi}{16c_{1}\epsilon}\bigr{)}}\big{\rceil}$ in Algorithm 1. Consider any phase $k\geq 1$ . We use $n_{k}=\tilde{O}\mathinner{\bigl{(}s^{2}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\bigl{(}\log d+{\log^{3}\frac{1}{\delta_{k}}}\bigr{)}}\bigr{)}}$ as the size of unlabeled instance set $\bar{T}$ . We will show that by making $N_{k}=O\mathinner{\left(n_{k}/b_{k}\right)}$ calls to $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ , Algorithm 1 is guaranteed to obtain such $\bar{T}$ in each phase with high probability. We set $m_{k}=\tilde{O}\mathinner{\bigl{(}s\log^{2}\frac{d}{b_{k}\delta_{k}}\cdot\log\frac{d}{\delta_{k}}\bigr{)}}$ as the size of labeled instance set $S_{k}$ , which is also the number of calls to $\mathrm{EX}_{\eta}^{y}(D,w^{*})$ . Note that $N\mathrel{\mathop{\mathchar 58\relax}}=\sum_{k=1}^{k_{0}}N_{k}$ is the sample complexity of Algorithm 1, and $m\mathrel{\mathop{\mathchar 58\relax}}=\sum_{k=1}^{k_{0}}m_{k}$ is its label complexity.

3.3 Attribute and computationally efficient soft outlier removal

Our soft outlier removal procedure is inspired by [ABL17]. We first briefly describe their main idea. Then we introduce a natural extension of their approach to the high-dimensional regime and show why it fails. Lastly, we present our novel outlier removal scheme.

To ease our discussion, we decompose $T=T_{\mathrm{C}}\cup T_{\mathrm{D}}$ where $T_{\mathrm{C}}$ is the set of clean instances in $T$ and $T_{\mathrm{D}}$ consists of all dirty instances. Ideally, we would expect to find a function $q\mathrel{\mathop{\mathchar 58\relax}}T\rightarrow[0,1]$ such that $q(x)=1$ for all $x\in T_{\mathrm{C}}$ and $q(x)=0$ otherwise. Suppose that $\xi$ is the fraction of dirty instances in $T$ . Then one would expect that the total weights $\sum_{x\in T}q(x)$ is as large as $(1-\xi)\left\lvert T\right\rvert$ in order to include such ideal function. On the other hand, we must restrict the weights of dirty instances; namely, we need to characterize under what conditions $T_{\mathrm{C}}$ can be distinguished from $T_{\mathrm{D}}$ . The key observation made in [KLS09] and [ABL17] is that if the dirty instances want to deteriorate the hinge loss (which is the purpose of the adversary), they must lead to a variance³³3We follow [ABL17] and slightly abuse the word “variance” without subtracting the squared mean of $w\cdot x$ . of $w\cdot x$ orders of magnitude larger than $\Omega(b^{2}+r^{2})$ on the direction of a particular halfspace. Thus, it suffices to find a proper weight for each instance, such that the reweighted variance $\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)(w\cdot x)^{2}$ is as small as $O(b^{2}+r^{2})$ for all feasible halfspaces $w\in W$ . Now it remains to resolve two questions: 1) how many instances do we need to draw in order to guarantee the existence of such function $q$ ; and 2) how to find a feasible function $q$ in polynomial time.

If label complexity were our only objective, we could have used the soft outlier removal procedure of [ABL17] directly, i.e. we set $W=B_{2}(u,r)$ , which in conjunction with the $\ell_{1}$ -norm constrained hinge loss minimization of [Zha18] would result in an $\tilde{O}\mathinner{\bigl{(}\frac{d^{2}}{\epsilon}\bigr{)}}$ sample complexity and a $\operatorname{poly}\mathinner{\left(s,\log d,\log(1/\epsilon)\right)}$ label complexity. However, as we would also like to optimize for the learner’s sample complexity by utilizing the sparsity assumption, we need an attribute-efficient outlier removal procedure.

3.3.1 A natural approach and why it fails

It is well-known that incorporating an $\ell_{1}$ -norm constraint often leads to a sample complexity sublinear in the dimension [Zha02, KST08]. Thus, a natural approach for attribute-efficient outlier removal is to set $W=B_{2}(u,r)\cap B_{1}(u,\rho)$ for some carefully chosen radius $\rho>0$ . With the new localized concept space, it is possible to show that a sample size of $\operatorname{poly}\mathinner{\left(s,\log d\right)}$ suffices to guarantee the existence of a function $q$ such that the reweighted variance is small over all $w\in W$ . However, on the computational side, for a given $q$ , we will have to check the reweighted variance for all $w\in W$ , which amounts to finding a global optimum of the following program:

\max_{w\in\mathbb{R}^{d}}\ \frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)(w\cdot x)^{2},\ \mathrm{s.t.}\ \left\lVert w-u\right\rVert_{2}\leq r,\ \left\lVert w-u\right\rVert_{1}\leq\rho.

(3.1)

The above program is closely related to the problem of sparse principal component analysis (PCA) [ZHT06], and unfortunately it is known that finding a global optimum is NP-hard [Ste05, TP14].

Algorithm 2 Attribute-Efficient Localized Soft Outlier Removal

0: Reference vector

u

, band width

b

, radius

r

for

\ell_{2}

-ball, radius

\rho

for

\ell_{1}

-ball, empirical noise rate

\xi

, absolute constant

C

, a set of unlabeled instances

T

where for all

x\in T

\left\lvert u\cdot x\right\rvert\leq b

0: A function

q\mathrel{\mathop{\mathchar 58\relax}}T\rightarrow[0,1]

1: Define the convex set of matrices

\mathcal{M}=\big{\{}H\in\mathbb{R}^{d\times d}\mathrel{\mathop{\mathchar 58\relax}}\ H\succeq 0,\ \left\lVert H\right\rVert_{*}\leq r^{2},\ \left\lVert H\right\rVert_{1}\leq\rho^{2}\big{\}}

2: Find a function

q\mathrel{\mathop{\mathchar 58\relax}}T\rightarrow[0,1]

satisfying the following constraints:

1.

for all $x\in T,0\leq q(x)\leq 1$ ;
2.

$\sum_{x\in T}q(x)\geq(1-\xi)\left\lvert T\right\rvert$ ;
3.

$\sup_{H\in\mathcal{M}}\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)x^{\top}Hx\leq C(b^{2}+r^{2})$ .

3: return

q

3.3.2 Convex relaxation of sparse principal component analysis

Our goal is to find a function $q$ such that the objective value in (3.1) is less than $O\mathinner{\left(b^{2}+r^{2}\right)}$ for all $w\in W$ . To circumvent the computational intractability caused by the non-convexity of the objective function, we consider an alternative formulation using semidefinite programming (SDP), similar to the approach of [dGJL07]. First, let $v=w-u$ . It is not hard to see that $(w\cdot x)^{2}\leq 2(u\cdot x)^{2}+2(v\cdot x)^{2}$ . Due to our localized sampling scheme, we have $(u\cdot x)^{2}\leq b^{2}$ with probability $1$ . Thus, we only need to examine the maximum value of $\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)(v\cdot x)^{2}$ over $v\in B_{2}(0,r)\cap B_{1}(0,\rho)$ . Now the technique of [dGJL07] comes in: the rank-one symmetric matrix $vv^{\top}$ is replaced by a new variable $H\in\mathbb{R}^{d\times d}$ which is positive semidefinite, and the vector $\ell_{2}$ and $\ell_{1}$ -norm constraints are relaxed to the matrix trace and $\ell_{1}$ -norm constraints respectively as follows:

\max_{H\in\mathbb{R}^{d\times d}}\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)x^{\top}Hx,\ \mathrm{s.t.}\ H\succeq 0,\ \left\lVert H\right\rVert_{*}\leq r^{2},\ \left\lVert H\right\rVert_{1}\leq\rho^{2}.

(3.2)

The program (3.2) has two salient features: first, it is a semidefinite program that can be optimized efficiently [BV04]; second, if its objective value is upper bounded by $O\mathinner{\left(b^{2}+r^{2}\right)}$ , we immediately obtain that the reweighted variance is well controlled. This is the theme of the following lemma.

Lemma 3.

Suppose that Assumption 1 and 2 are satisfied, and that $\eta\leq c_{5}\epsilon$ . There exists a constant $C_{2}>2$ such that the following holds. For any phase $k$ of Algorithm 1 with $1\leq k\leq k_{0}$ , write $\mathcal{M}_{k}=\mathinner{\left\{H\in\mathbb{R}^{d\times d}\mathrel{\mathop{\mathchar 58\relax}}\ H\succeq 0,\ \left\lVert H\right\rVert_{*}\leq r_{k}^{2},\ \left\lVert H\right\rVert_{1}\leq\rho_{k}^{2}\right\}}$ . Then with probability $1-\frac{\delta_{k}}{24}$ over the draw of $T_{\mathrm{C}}$ , we have

\sup_{H\in\mathcal{M}_{k}}\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T_{\mathrm{C}}}x^{\top}Hx\leq 2C_{2}(b_{k}^{2}+r_{k}^{2}),

provided that $\left\lvert T_{\mathrm{C}}\right\rvert\geq\tilde{O}\mathinner{\Bigl{(}s^{2}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\bigl{(}\log d+\log^{2}\frac{1}{\delta_{k}}\bigr{)}}\Bigr{)}}$ .

Recall that Algorithm 1 sets $n_{k}=\tilde{O}\mathinner{\bigl{(}s^{2}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\bigl{(}\log d+{\log^{3}\frac{1}{\delta_{k}}}\bigr{)}}\bigr{)}}$ , which suffices to guarantee the condition on $\left\lvert T_{\mathrm{C}}\right\rvert$ holds (see Appendix D.2); therefore, the above concentration bound holds with high probability. As a result, it is not hard to verify that the function $q\mathrel{\mathop{\mathchar 58\relax}}T\rightarrow[0,1]$ , where $q(x)=1$ for all $x\in T_{\mathrm{C}}$ and $q(x)=0$ for all $x\in T_{\mathrm{D}}$ , satisfies all three constraints in Algorithm 2. In other words, Lemma 3 establishes the existence of a feasible function $q$ to Algorithm 2. Furthermore, observe that the optimization problem of finding a feasible $q$ in Algorithm 2 is a semi-infinite linear program. For a given candidate $q$ , we can construct an efficient oracle as follows: it checks if $q$ violates the first two constraints; if not, it checks the last constraint by invoking a polynomial-time SDP solver to find the maximum objective value of (3.2). It is well-known that equipped with such separation oracle, Algorithm 2 will return a desired function $q$ in polynomial time by the ellipsoid method [GLS12, Chapter 3].

3.3.3 Comparison to prior works

We remark that the setting of $n_{k}$ results in a sample complexity of $\tilde{O}\mathinner{\bigl{(}\frac{s^{2}}{b_{k}}\bigr{)}}$ for phase $k$ (see a formal statement in Lemma 6), which implies a total sample complexity of $\tilde{O}\mathinner{\bigl{(}\frac{s^{2}}{\epsilon}\bigr{)}}$ . When $s\ll d$ , this substantially improves upon the sample complexity of $\tilde{O}\mathinner{\bigl{(}\frac{d^{2}}{\epsilon}\bigr{)}}$ when naively applying the soft outlier removal procedure in [ABL17].

We remark three crucial technical differences from [DKS18] and [BDLS17]. First, we progressively restrict the variance to identify dirty instances, i.e. the variance upper bound is set as $O(1)$ at the beginning of Algorithm 1 and progressively decreases to $O(\epsilon^{2})$ (see our setting of $b_{k}$ and $r_{k}$ ), while in [DKS18, BDLS17] and many of their follow-up works it is typically fixed to $O(\epsilon)$ . Second, we control the variance locally, i.e. we only require a small variance over a localized instance space $D_{w_{k-1},b_{k}}$ and a localized concept space $\mathcal{M}_{k}$ . Third, the small variance is used to robustly estimate the hinge loss in our work, while in [DKS18] it was utilized to approximate the Chow parameters. All these problem-specific design of outlier removal are vital for us to obtain the first near-optimal guarantee on attribute efficiency and label efficiency for learning sparse halfspaces.

4 Performance Guarantee

In the following, we always presume that the underlying halfspace is parameterized by $w^{*}$ , which is $s$ -sparse and has unit $\ell_{2}$ -norm. This condition may not be explicitly stated in our analysis.

Our main theorem is as follows. We note that there are two sources of randomness in Algorithm 1: the random draw of instances from $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ , and the random sampling step (i.e. Step 8); the probability is taken over all the randomness in the algorithm.

Theorem 4.

Suppose that Assumptions 1 and 2 are satisfied. There exists an absolute constant $c_{5}$ such that for any $\epsilon\in(0,1)$ and $\delta\in(0,1)$ , if $\eta\leq c_{5}\epsilon$ , then with probability at least $1-\delta$ , $\operatorname{err}_{D}(w_{k_{0}})\leq\epsilon$ where $w_{k_{0}}$ is the output of Algorithm 1. Furthermore, Algorithm 1 has a sample complexity of $\tilde{O}\big{(}\frac{1}{\epsilon}s^{2}\log^{4}d\cdot\mathinner{\bigl{(}\log d+\log^{3}\frac{1}{\delta}\bigr{)}}\big{)}$ , and a label complexity of $\tilde{O}\big{(}s\log^{2}\frac{d}{\epsilon\delta}\cdot\log\frac{d}{\delta}\cdot\log\frac{1}{\epsilon}\big{)}$ , and has running time $\operatorname{poly}\mathinner{\left(d,{1}/{\epsilon},{1}/{\delta}\right)}$ .

Algorithm 1 can be straightforwardly modified to work in the passive learning setting, where the learner has direct access to the labeled instance oracle $\mathrm{EX}_{\eta}(D,w^{*})$ . The modified algorithm works as follows: it calls $\mathrm{EX}_{\eta}(D,w^{*})$ to obtain a pair of instance and the label whenever Algorithm 1 calls $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ . In particular, for the passive learning algorithm, the working set $\bar{T}$ is always a labeled instance set, and there is no need for it to query $\mathrm{EX}_{\eta}^{y}(D,w^{*})$ in the random sampling step.

We have the following simple corollary which is an immediate result from Theorem 4.

Corollary 5.

Suppose that Assumptions 1 and 2 are satisfied. There exists a polynomial-time algorithm (that has access to only $\mathrm{EX}_{\eta}(D,w^{*})$ ) and an absolute constant $c_{5}$ such that for any $\epsilon\in(0,1)$ and $\delta\in(0,1)$ , if $\eta\leq c_{5}\epsilon$ , then with probability at least $1-\delta$ , the algorithm outputs a hypothesis with error at most $\epsilon$ , using $\tilde{O}\big{(}\frac{1}{\epsilon}s^{2}\log^{4}d\cdot\mathinner{\bigl{(}\log d+\log^{3}\frac{1}{\delta}\bigr{)}}\big{)}$ labeled instances.

We need an ensemble of new results to prove Theorem 4. Specifically, we propose new techniques to control the sample and computational complexity of soft outlier removal, and a new analysis of label complexity by making full use of the localization in the instance and concept spaces. We elaborate on them in the following, and sketch the proof of Theorem 4 at the end of this section.

4.1 Localized sampling in the instance space

Localized sampling, also known as margin-based active learning, is a useful technique proposed in [BBZ07]. Interestingly, under isotropic log-concave distributions, [BL13] showed that if the band width $b$ is large enough, the region outside the band, i.e. $\{x\in\mathbb{R}^{d}\mathrel{\mathop{\mathchar 58\relax}}\left\lvert w\cdot x\right\rvert>b\}$ , can be safely “ignored”, in the sense that, if $w$ is close enough to $w^{*}$ , it is guaranteed to incur a small error rate therein. Motivated by this elegant finding, theoretical analyses in the literature are often dedicated to bounding the error rate within the band, and it is now well understood that a constant error rate within the band suffices to ensure significant progress in each phase [ABHU15, ABL17, Zha18]. We follow this line of reasoning and our technical contribution is to show how to obtain such constant error rate with near-optimal label complexity and noise tolerance.

Our analysis will rely on the condition that $\bar{T}$ has sufficiently many instances. Specifically, in order to collect $n_{k}$ instances to form the working set $\bar{T}$ , we need to call $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ enough number of times since our sampling is localized within the band $X_{k}\mathrel{\mathop{\mathchar 58\relax}}=\mathinner{\left\{x\mathrel{\mathop{\mathchar 58\relax}}\left\lvert w_{k-1}\cdot x\right\rvert\leq b_{k}\right\}}$ . The following lemma characterizes the sample complexity at phase $k$ .

Lemma 6.

Suppose that Assumption 1 and 2 are satisfied. Further assume $\eta<\frac{1}{2}$ . With probability $1-\frac{\delta_{k}}{4}$ , we will obtain $n_{k}$ instances that fall into the band $X_{k}=\{x\mathrel{\mathop{\mathchar 58\relax}}\left\lvert w_{k-1}\cdot x\right\rvert\leq b_{k}\}$ by making a number of $N_{k}=O\mathinner{\bigl{(}\frac{1}{b_{k}}\mathinner{\bigl{(}n_{k}+\log\frac{1}{\delta_{k}}\bigr{)}}\bigr{)}}$ calls to the instance generation oracle $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ .

4.2 Attribute and computationally efficient soft outlier removal

We summarize the performance guarantee of Algorithm 2 in the following proposition.

Proposition 7.

Consider phase $k$ of Algorithm 1 for any $1\leq k\leq k_{0}$ . Suppose that Assumption 1 and 2 are satisfied, and that $\eta\leq c_{5}\epsilon$ . With the setting of $n_{k}$ , with probability $1-\frac{\delta_{k}}{8}$ over the draw of $\bar{T}$ , Algorithm 2 will output a function $q\mathrel{\mathop{\mathchar 58\relax}}T\rightarrow[0,1]$ in polynomial time with the following properties: (1) $\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)\geq 1-\xi_{k}$ ; (2) for all $w\in W_{k}$ , $\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)(w\cdot x)^{2}\leq 5C_{2}\mathinner{\left(b_{k}^{2}+r_{k}^{2}\right)}$ .

Again, we remind that the key difference between our algorithm and that of [ABL17] is in Constraint 3 of Algorithm 2: we require that the “variance proxy” $\sum_{x\in T}q(x)x^{\top}Hx$ of the reweighted instances are small for all positive semidefinite $H$ that lies in an intersection of a trace-norm ball and an $\ell_{1}$ -norm ball. On the statistical side, this favorable constraint set of $H$ , in conjunction with Adamczak’s bound in empirical processes literature [Ada08], results in sufficient uniform concentration of the variance proxy $x^{\top}Hx$ with a sample complexity of $\operatorname{poly}\mathinner{\left(s,\log d\right)}$ . This significantly improves the sample complexity of $\operatorname{poly}\mathinner{\left(d\right)}$ established in [ABL17]. The detailed proof can be found in Appendix D.3.

Remark 1.

While in some standard settings, a proper $\ell_{1}$ -norm constraint suffices to guarantee a desired bound of sample complexity in the high-dimensional regime [Wai09, KST08], we note that in order to establish near-optimal noise tolerance, the $\ell_{2}$ -norm constraint of $w$ (hence the trace-norm of $H$ ) is vital as well. Though eliminating it eases the search of a feasible function $q$ , this leads to a suboptimal noise tolerance $\eta\leq\Omega({\epsilon}/{s})$ . Informally speaking, the per-phase error rate, expected to be a constant, is inherently proportional to the variance $(w\cdot x)^{2}$ times $\xi_{k}$ , the noise rate within the band. Now without the trace-norm constraint, the variance would be $s$ times larger than before (since we now have to use $\rho_{k}^{2}=O(sr_{k}^{2})$ as a proxy for the constraint set’s radius, measured in trace norm). This implies that we need to set $\xi_{k}$ a factor $1/s$ of before, which in turn indicates that the noise tolerance $\eta$ becomes a factor $1/s$ of before since $\eta/\epsilon\approx\xi_{k}$ . We refer the reader to Proposition 31 and Lemma 36 for details.

Remark 2.

The quantity $n_{k}$ has a quadratic dependence on the sparsity parameter $s$ . This cannot be improved in some sparse PCA related problems [BR13], but it is not clear whether such dependence is optimal in our case. We leave this investigation to future work.

Next, we describe the statistical property of the distribution $p$ (obtained by normalizing $q$ returned by Algorithm 2). Observe that the noise rate within the band is at most $\eta/b_{k}\leq O(\eta/\epsilon)\leq\xi_{k}$ since the probability mass of the band is $\Theta(b_{k})$ – an important property of isotropic log-concave distributions. Also, it is possible to show that the variance of clean instances on directions $H\in\mathcal{M}_{k}$ is $O(b_{k}^{2}+r_{k}^{2})$ (see Lemma 16). Therefore, Algorithm 2 is essentially searching for a weighting such that clean instances have overwhelming weights over dirty instances, and that the variance of the weighted instances is similar to that of the clean instances. Recall that $T_{\mathrm{C}}\subset T$ is the set of clean instances in $T$ . Let $\tilde{T}_{\mathrm{C}}=\{(x,y_{x})\}_{x\in T_{\mathrm{C}}}$ be the unrevealed labeled set where each instance is correctly annotated by $w^{*}$ . The following proposition, which is similar to Lemma 4.7 of [ABL17] but with refinement, states that the reweighted hinge loss $\ell_{\tau_{k}}(w;p)\mathrel{\mathop{\mathchar 58\relax}}=\sum_{x\in T}p(x)\ell_{\tau_{k}}(w;x,y_{x})$ , is a good proxy for the hinge loss evaluated exclusively on clean labeled instances $\tilde{T}_{\mathrm{C}}$ .

Proposition 8.

Suppose Assumption 1 and 2 are satisfied, and $\eta\leq c_{5}\epsilon$ . For any phase $k$ of Algorithm 1, with probability $1-\frac{\delta_{k}}{4}$ over the draw of $\bar{T}$ , we have $\sup_{w\in W_{k}}\mathinner{\!\bigl{\lvert}\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})-\ell_{\tau_{k}}(w;p)\bigr{\rvert}}\leq\kappa$ .

Note that though this proposition is phrased in terms of the hinge loss on pairs $(x,y_{x})$ , it is only used in the analysis and our algorithm does not require the knowledge of the labels $y_{x}$ – the algorithm even does not need to exactly identify the set of clean instances $T_{\mathrm{C}}$ . As a result, the size of $T_{\mathrm{C}}$ does not count towards our label complexity. Proposition 7 together with Proposition 8 implies that with high probability, Algorithm 2 produces a desired probability distribution in polynomial time, which justifies its computational and statistical efficiency.

In addition, let $L_{\tau_{k}}(w)\mathrel{\mathop{\mathchar 58\relax}}=\operatorname{\mathbb{E}}_{x\sim D_{w_{k-1},b_{k}}}\mathinner{\bigl{[}\ell_{\tau_{k}}\mathinner{\left(w;x,\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)}\right)}\bigr{]}}$ be the expected loss on $D_{w_{k-1},b_{k}}$ . The following result links $L_{\tau_{k}}(w)$ to the empirical hinge loss on clean instances.

Proposition 9.

Under Assumption 1 and 2, and $\eta\leq c_{5}\epsilon$ , for any phase $k$ of Algorithm 1, with probability $1-\frac{\delta_{k}}{4}$ over the draw of $\bar{T}$ , we have $\sup_{w\in W_{k}}\mathinner{\!\bigl{\lvert}L_{\tau_{k}}(w)-\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})\bigr{\rvert}}\leq\kappa$ .

4.3 Attribute and label-efficient empirical risk minimization

In light of Proposition 8, one may want to find an iterate by minimizing its reweighted hinge loss $\ell_{\tau_{k}}(w;p)$ . This requires collecting labels for all instances in $T$ , which leads to a suboptimal label complexity $O\mathinner{\bigl{(}s^{2}\cdot\operatorname{polylog}\mathinner{\left(d,1/\epsilon\right)}\bigr{)}}$ . As a remedy, we perform a random sampling process, which draws $m_{k}$ instances from $T$ according to the distribution $p$ and then query their labels, resulting in the labeled instance set $S_{k}$ . By standard uniform convergence arguments, it is expected that $\ell_{\tau_{k}}(w;S_{k})\approx\ell_{\tau_{k}}(w;p)$ provided that $m_{k}$ is large enough, as is shown in the following proposition.

Proposition 10.

Suppose that Assumption 1 and 2 are satisfied. For any phase $k$ of Algorithm 1, with probability $1-\frac{\delta_{k}}{4}$ , we have $\sup_{w\in W_{k}}\left\lvert\ell_{\tau_{k}}(w;p)-\ell_{\tau_{k}}(w;S_{k})\right\rvert\leq\kappa$ .

We remark that when establishing the performance guarantee, the $\ell_{1}$ -norm constraint on the hypothesis space, together with an $\ell_{\infty}$ -norm upper bound on the localized instance space, leads to a Rademacher complexity that has a linear dependence on the sparsity (up to a logarithmic factor). Technically speaking, our analysis is more involved than that of [ABL17]: applying their analysis to the setting of learning sparse halfspaces along with the fact that the VC dimension of the class of $s$ -sparse halfspaces is $O(s\log(d/s))$ would give a label complexity quadratic in $s$ .

4.4 Uniform concentration for unbounded data

Our analysis involves building uniform concentration bounds. The primary issue of applying standard concentration results, e.g. Theorem 1 of [KST08], is that the instances are not contained in a pre-specified $\ell_{\infty}$ -ball with probability $1$ under isotropic log-concave distribution. [ABL17, Zha18] construct a conditional distribution, on which the data are all bounded from above, and then measure the difference between this conditional distribution and the original one. We circumvent such technical complication by using the Adamczak’s bound [Ada08] in the empirical process literature, which provides a generic way to analyze concentration inequalities for well-behaved distributions with unbounded support. See Appendix C for a concrete treatment.

4.5 Proof sketch of Theorem 4

Proof.

We first show that error rate of $v_{k}$ on $D_{w_{k-1},b_{k}}$ is a constant, and that of $w_{k}$ follows since hard thresholding and $\ell_{2}$ -norm projection can only deviate the error rate by a constant factor. Observe that in light of Proposition 8, Proposition 9, and Proposition 10, we have $\left\lvert\ell_{\tau_{k}}(w;S_{k})-L_{\tau_{k}}(w)\right\rvert\leq 3\kappa$ for all $w\in W_{k}$ . Therefore, if $w^{*}\in W_{k}$ , by the optimality of $v_{k}$ , we have $L_{\tau_{k}}(v_{k})\leq\ell_{\tau_{k}}(v_{k};S_{k})+3\kappa\leq\ell_{\tau_{k}}(w^{*};S_{k})+4\kappa\leq L_{\tau_{k}}(w^{*})+7\kappa\leq 8\kappa$ , where the last inequality is by Lemma 3.7 of [ABL17]. Since $L_{\tau_{k}}(v_{k})$ always serves as an upper bound of $\operatorname{err}_{D_{w_{k-1},b_{k}}}(v_{k})$ , the constant error rate on $D_{w_{k-1},b_{k}}$ follows. Next we can use the analysis framework of margin-based active learning to show that such constant error rate ensures that the angle between $w_{k}$ and $w^{*}$ is as small as $O(2^{-k})$ , which in turn implies $w^{*}\in W_{k+1}$ . It remains to show $w^{*}\in W_{1}$ ; this can be easily seen by the definition of $W_{1}$ : $W_{1}=B_{2}(0,1)\cap B_{1}(0,\sqrt{s})$ . Hence, we conclude $w^{*}\in W_{k}$ for all $1\leq k\leq k_{0}$ . Observe that the radius of $\ell_{2}$ -ball of $W_{k_{0}}$ is as small as $\epsilon$ , which, by a basic property of isotropic log-concave distributions, implies the error rate of $w_{k_{0}}$ on $D$ is less than $\epsilon$ .

The sample and label complexity bounds follow from our setting of $N_{k}$ and $m_{k}$ , and the fact that $b_{k}\in[\epsilon,\bar{c}/16]$ for all $k\leq k_{0}$ . See Appendix D.5 for the full proof. ∎

5 Conclusion and Open Questions

We have presented a computationally efficient algorithm for learning sparse halfspaces under the challenging malicious noise model. Our algorithm leverages the well-established margin-based active learning framework, with a particular treatment on attribute efficiency, label complexity, and noise tolerance. We have shown that our theoretical guarantees for label complexity and noise tolerance are near-optimal, and the sample complexity of a passive learning variant of our algorithm is attribute-efficient, thanks to the set of new techniques proposed in this paper.

We raise three open questions for further investigation. First, as we discussed in Section 4.2, the sample complexity for concentration of $x^{\top}Hx$ has a quadratic dependence on $s$ . It would be interesting to study whether this is a fundamental limit of learning under isotropic log-concave distributions, or it can be improved by a more sophisticated localization scheme in the instance and the concept spaces. Second, while isotropic log-concave distributions bear favorable properties that fit perfectly in the margin-based framework, it would be interesting to examine whether the established results can be extended to heavy-tailed distributions. This may lead to a large error rate within the band that cannot be controlled at a constant level, and new techniques must be developed. Finally, it would be interesting to design computationally more efficient algorithms, e.g. stochastic gradient descent-type algorithms similar to [DKM05], with comparable statistical guarantees.

References

[ABHU15] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Ruth Urner. Efficient learning of linear separators under bounded noise. In Proceedings of the 28th Annual Conference on Learning Theory, pages 167–190, 2015.
[ABHZ16] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Hongyang Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Proceedings of the 29th Annual Conference on Learning Theory, pages 152–192, 2016.
[ABL17] Pranjal Awasthi, Maria-Florina Balcan, and Philip M. Long. The power of localization for efficiently learning linear separators with noise. Journal of the ACM, 63(6):50:1–50:27, 2017.
[Ada08] Radoslaw Adamczak. A tail inequality for suprema of unbounded empirical processes with applications to Markov chains. Electronic Journal of Probability, 13(34):1000–1034, 2008.
[BB08] Petros Boufounos and Richard G. Baraniuk. 1-bit compressive sensing. In Proceedings of the 42nd Annual Conference on Information Sciences and Systems, pages 16–21, 2008.
[BBZ07] Maria-Florina Balcan, Andrei Z. Broder, and Tong Zhang. Margin based active learning. In Proceedings of the 20th Annual Conference on Learning Theory, pages 35–50, 2007.
[BDLS17] Sivaraman Balakrishnan, Simon S. Du, Jerry Li, and Aarti Singh. Computationally efficient robust sparse estimation in high dimensions. In Proceedings of the 30th Annual Conference on Learning Theory, pages 169–212, 2017.
[BEHW89] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929–965, 1989.
[BEK02] Nader H. Bshouty, Nadav Eiron, and Eyal Kushilevitz. PAC learning with nasty noise. Theoretical Computer Science, 288(2):255–275, 2002.
[BFKV96] Avrim Blum, Alan M. Frieze, Ravi Kannan, and Santosh S. Vempala. A polynomial-time algorithm for learning noisy linear threshold functions. In Proceedings of the 37th Annual IEEE Symposium on Foundations of Computer Science, pages 330–338, 1996.
[BFN⁺17] Richard G. Baraniuk, Simon Foucart, Deanna Needell, Yaniv Plan, and Mary Wootters. Exponential decay of reconstruction error from binary measurements of sparse signals. IEEE Transactions on Information Theory, 63(6):3368–3385, 2017.
[BH12] Maria Florina Balcan and Steve Hanneke. Robust interactive learning. In Conference on Learning Theory, pages 20–1, 2012.
[BHL95] Avrim Blum, Lisa Hellerstein, and Nick Littlestone. Learning in the presence of finitely or infinitely many irrelevant attributes. Journal of Computer and System Sciences, 50(1):32–40, 1995.
[BHLZ16] Alina Beygelzimer, Daniel J. Hsu, John Langford, and Chicheng Zhang. Search improves label for active learning. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, pages 3342–3350, 2016.
[BL13] Maria-Florina Balcan and Philip M. Long. Active and passive learning of linear separators under log-concave distributions. In Proceedings of The 26th Annual Conference on Learning Theory, pages 288–316, 2013.
[Blu90] Avrim Blum. Learning boolean functions in an infinite attribute space. In Proceedings of the 22nd Annual ACM Symposium on Theory of Computing, pages 64–72, 1990.
[BM02] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.
[BR13] Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component detection. In Proceedings of the 26th Annual Conference on Learning Theory, pages 1046–1066, 2013.
[Bsh98] Nader H. Bshouty. A new composition theorem for learning algorithms. In Proceedings of the 30th Annual ACM Symposium on the Theory of Computing, pages 583–589, 1998.
[BV04] Stephen P. Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.
[CAL94] David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994.
[CCG11] Giovanni Cavallanti, Nicolò Cesa-Bianchi, and Claudio Gentile. Learning noisy linear classifiers via adaptive and selective sampling. Machine Learning, 83(1):71–102, 2011.
[CDF⁺99] Nicolò Cesa-Bianchi, Eli Dichterman, Paul Fischer, Eli Shamir, and Hans Ulrich Simon. Sample-efficient strategies for learning in the presence of noise. Journal of the ACM, 46(5):684–719, 1999.
[CDS98] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33–61, 1998.
[CT05] Emmanuel J. Candès and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.
[CW08] Emmanuel J. Candès and Michael B. Wakin. An introduction to compressive sampling. IEEE Signal Processing Magazine, 25(2):21–30, 2008.
[Dan15] Amit Daniely. A PTAS for agnostically learning halfspaces. In Proceedings of The 28th Annual Conference on Learning Theory, volume 40, pages 484–502, 2015.
[dGJL07] Alexandre d’Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A direct formulation for sparse PCA using semidefinite programming. SIAM Review, 49(3):434–448, 2007.
[DGS12] Ofer Dekel, Claudio Gentile, and Karthik Sridharan. Selective sampling and active learning from single and multiple teachers. Journal of Machine Learning Research, 13:2655–2697, 2012.
[DGT19] Ilias Diakonikolas, Themis Gouleakis, and Christos Tzamos. Distribution-independent PAC learning of halfspaces with Massart noise. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems, pages 4751–4762, 2019.
[DK19] Ilias Diakonikolas and Daniel M. Kane. Recent advances in algorithmic high-dimensional robust statistics. CoRR, abs/1911.05911, 2019.
[DKK⁺16] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Zheng Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. CoRR, abs/1604.06443, 2016.
[DKM05] Sanjoy Dasgupta, Adam Tauman Kalai, and Claire Monteleoni. Analysis of perceptron-based active learning. In Proceedings of the 18th Annual Conference on Learning Theory, pages 249–263, 2005.
[DKS18] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM Symposium on Theory of Computing, pages 1061–1073, 2018.
[DKTZ20] Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Learning halfspaces with Massart noise under structured distributions. In Proceedings of the 33rd Annual Conference on Learning Theory, volume 125, pages 1486–1513, 2020.
[Don06] David L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
[Dud14] Richard M. Dudley. Uniform central limit theorems, volume 142. Cambridge University Press, 2014.
[Fel14] Vitaly Feldman. Open problem: The statistical query complexity of learning sparse halfspaces. In Proceedings of The 27th Annual Conference on Learning Theory, volume 35, pages 1283–1289, 2014.
[FF08] Jianqing Fan and Yingying Fan. High dimensional classification using features annealed independence rules. Annals of Statistics, 36(6):2605–2637, 2008.
[FL01] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
[GLS12] Martin Grötschel, László Lovász, and Alexander Schrijver. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.
[Han11] Steve Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333–361, 2011.
[Han14] Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131–309, 2014.
[KKMS05] Adam Tauman Kalai, Adam R. Klivans, Yishay Mansour, and Rocco A. Servedio. Agnostically learning halfspaces. In Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science, pages 11–20, 2005.
[KL88] Michael J. Kearns and Ming Li. Learning in the presence of malicious errors. In Proceedings of the 20th Annual ACM Symposium on Theory of Computing, pages 267–280, 1988.
[KLMZ17] Daniel M. Kane, Shachar Lovett, Shay Moran, and Jiapeng Zhang. Active classification with comparison queries. In Chris Umans, editor, Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science, pages 355–366, 2017.
[KLS09] Adam R. Klivans, Philip M. Long, and Rocco A. Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10:2715–2740, 2009.
[KMT93] Sanjeev R. Kulkarni, Sanjoy K. Mitter, and John N. Tsitsiklis. Active learning using arbitrary binary valued queries. Machine Learning, 11(1):23–35, 1993.
[KS04] Adam R. Klivans and Rocco A. Servedio. Toward attribute efficient learning of decision lists and parities. In Proceedings of the 17th Annual Conference on Learning Theory, pages 224–238, 2004.
[KSS92] Michael J. Kearns, Robert E. Schapire, and Linda Sellie. Toward efficient agnostic learning. In David Haussler, editor, Proceedings of the 5th Annual Conference on Computational Learning Theory, pages 341–352, 1992.
[KST08] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Proceedings of the 22nd Annual Conference on Neural Information Processing Systems, pages 793–800, 2008.
[Lit87] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm (extended abstract). In Proceedings of the 28th Annual IEEE Symposium on Foundations of Computer Science, pages 68–77, 1987.
[Lon95] Philip M. Long. On the sample complexity of PAC learning half-spaces against the uniform distribution. IEEE Transactions on Neural Networks, 6(6):1556–1559, 1995.
[LRV16] Kevin A. Lai, Anup B. Rao, and Santosh S. Vempala. Agnostic estimation of mean and covariance. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, pages 665–674, 2016.
[LS06] Philip M. Long and Rocco A. Servedio. Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions. In Proceedings of the 20th Annual Conference on Neural Information Processing Systems, pages 921–928, 2006.
[LV07] László Lovász and Santosh S. Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures and Algorithms, 30(3):307–358, 2007.
[MN06] Pascal Massart and Élodie Nédélec. Risk bounds for statistical learning. The Annals of Statistics, pages 2326–2366, 2006.
[PV13a] Yaniv Plan and Roman Vershynin. One-bit compressed sensing by linear programming. Communications on Pure and Applied Mathematics, 66(8):1275–1297, 2013.
[PV13b] Yaniv Plan and Roman Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1):482–494, 2013.
[PV16] Yaniv Plan and Roman Vershynin. The generalized lasso with non-linear observations. IEEE Transactions on Information Theory, 62(3):1528–1537, 2016.
[Ros58] Frank Rosenblatt. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386–408, 1958.
[RWY11] Garvesh Raskutti, Martin J. Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over $\ell_{q}$ -balls. IEEE Transactions on Information Theory, 57(10):6976–6994, 2011.
[Sch92] Robert E. Schapire. Design and analysis of efficient learning algorithms. MIT Press, Cambridge, MA, USA, 1992.
[Ser99] Rocco A. Servedio. Computational sample complexity and attribute-efficient learning. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing, pages 701–710, 1999.
[SL17a] Jie Shen and Ping Li. On the iteration complexity of support recovery via hard thresholding pursuit. In Proceedings of the 34th International Conference on Machine Learning, pages 3115–3124, 2017.
[SL17b] Jie Shen and Ping Li. Partial hard thresholding: Towards a principled analysis of support recovery. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pages 3127–3137, 2017.
[SL18] Jie Shen and Ping Li. A tight bound of hard thresholding. Journal of Machine Learning Research, 18(208):1–42, 2018.
[Slo88] Robert H. Sloan. Types of noise in data for concept learning. In Proceedings of the First Annual Workshop on Computational Learning Theory, pages 91–96, 1988.
[Slo92] Robert H. Sloan. Corrigendum to types of noise in data for concept learning. In Proceedings of the Fifth Annual ACM Conference on Computational Learning Theory, page 450, 1992.
[SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
[Ste05] Daureen Steinberg. Computation of matrix norms with applications to robust optimization. Research thesis, Technion-Israel University of Technology, 2005.
[Tib96] Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
[TP14] Andreas M. Tillmann and Marc E. Pfetsch. The computational complexity of the restricted isometry property, the nullspace property, and related concepts in compressed sensing. IEEE Transactions on Information Theory, 60(2):1248–1259, 2014.
[TW10] Joel A. Tropp and Stephen J. Wright. Computational methods for sparse solution of linear inverse problems. Proceedings of the IEEE, 98(6):948–958, 2010.
[Val84] Leslie G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
[Val85] Leslie G. Valiant. Learning disjunction of conjunctions. In Proceedings of the 9th International Joint Conference on Artificial Intelligence, pages 560–566, 1985.
[vdGL13] Sara van de Geer and Johannes Lederer. The Bernstein-Orlicz norm and deviation inequalities. Probability Theory and Related Fields, 157:225–250, 2013.
[VDVW96] Aad W Van Der Vaart and Jon A Wellner. Weak convergence and empirical processes. Springer, 1996.
[Vem10] Santosh S. Vempala. A random-sampling-based algorithm for learning intersections of halfspaces. Journal of the ACM, 57(6):32:1–32:14, 2010.
[Wai09] Martin J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$ -constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5):2183–2202, 2009.
[XZS⁺17] Yichong Xu, Hongyang Zhang, Aarti Singh, Artur Dubrawski, and Kyle Miller. Noise-tolerant interactive learning using pairwise comparisons. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pages 2431–2440, 2017.
[YZ17] Songbai Yan and Chicheng Zhang. Revisiting perceptron: Efficient and label-optimal learning of halfspaces. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pages 1056–1066, 2017.
[Zha02] Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002.
[Zha18] Chicheng Zhang. Efficient active learning of sparse halfspaces. In Proceedings of the 31st Annual Conference On Learning Theory, pages 1856–1880, 2018.
[ZHT06] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2):265–286, 2006.
[ZSA20] Chicheng Zhang, Jie Shen, and Pranjal Awasthi. Efficient active learning of sparse halfspaces with arbitrary bounded noise. CoRR, abs/2002.04840, 2020.
[ZYJ14] Lijun Zhang, Jinfeng Yi, and Rong Jin. Efficient algorithms for robust one-bit compressive sensing. In Proceedings of the 31st International Conference on Machine Learning, pages 820–828, 2014.

Appendix A Detailed Choices of Reserved Constants and Additional Notations

Constants.

The absolute constants $c_{0}$ , $c_{1}$ and $c_{2}$ are specified in Lemma 12, and $c_{3}$ and $c_{4}$ are specified in Lemma 13. $c_{5}$ and $c_{6}$ were clarified in Section 3.2. The definition of $c_{7}$ , $c_{8}$ , $c_{9}$ can be found in Lemma 14, Lemma 17, and Lemma 18 respectively. The absolute constant $C_{1}$ acts as an upper bound of all $b_{k}$ ’s, and by our choice in Section 3.2, $C_{1}=\bar{c}/16$ . The absolute constant $C_{2}$ is defined in Lemma 16. Other absolute constants, such as $C_{3},C_{4}$ are not quite crucial to our analysis or algorithmic design. Therefore, we do not track their definitions. The subscript variants of $K$ , e.g. $K_{1}$ and $K_{2}$ , are also absolute constants but their values may change from appearance to appearance. We remark that the value of all these constants does not depend on the underlying distribution $D$ chosen by the adversary, but rather depends on the knowledge of $\mathcal{D}$ .

Pruning.

Consider Algorithm 1. For each phase $k$ , we sample a working set $\bar{T}$ and remove all instances that have large $\ell_{\infty}$ -norm to obtain $T$ (Step 6), which is equivalent to intersecting it with the $\ell_{\infty}$ -ball $B_{\infty}(0,\nu_{k})\mathrel{\mathop{\mathchar 58\relax}}=\mathinner{\left\{x\mathrel{\mathop{\mathchar 58\relax}}\|x\|_{\infty}\leq\nu_{k}\right\}}$ where $\nu_{k}=c_{9}\log\frac{48\left\lvert\bar{T}\right\rvert d}{b_{k}\delta_{k}}$ . This is motivated by Lemma 18, which states that with high probability, all clean instances in $\bar{T}$ are in $B_{\infty}(0,\nu_{k})$ . Specifically, Denote by $\bar{T}_{\mathrm{C}}$ (respectively $\bar{T}_{\mathrm{D}}$ ) the set of clean (respectively dirty) instances in $\bar{T}$ . Lemma 18 implies that with probability $1-\frac{\delta_{k}}{48}$ , $\bar{T}_{\mathrm{C}}\subset B_{\infty}(0,\nu_{k})$ . Therefore, with high probability, all the instances in $\bar{T}_{\mathrm{C}}$ are kept in this step and only instances in $\bar{T}_{\mathrm{D}}$ may be removed. Denote by $T_{\mathrm{C}}=\bar{T}_{\mathrm{C}}\cap B_{\infty}(0,\nu_{k})$ and $T_{\mathrm{D}}=\bar{T}_{\mathrm{D}}\cap B_{\infty}(0,\nu_{k})$ ; we therefore also have the decomposition $T=T_{\mathrm{C}}\cup T_{\mathrm{D}}$ . We finally denote by $\hat{T}_{\mathrm{C}}$ the unrevealed labeled set that corresponds to $\bar{T}_{\mathrm{C}}$ .

Table 1: Summary of useful notations associated with the working set

\bar{T}

at each phase

k

$\bar{T}$	instance set obtained by calling $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ conditioned on $\left\lvert w_{k-1}\cdot x\right\rvert\leq b_{k}$
$\bar{T}_{\mathrm{C}}$	set of instances in $\bar{T}$ that $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ draws from the distribution $D$
$\bar{T}_{\mathrm{D}}$	set of dirty instances in $\bar{T}$ , i.e. $\bar{T}\backslash\bar{T}_{\mathrm{C}}$
$T$	set of instances in $\bar{T}$ that lie in $B_{\infty}(0,\nu_{k})$
$T_{\mathrm{C}}$	set of instances in $\bar{T}_{\mathrm{C}}$ that lie in $B_{\infty}(0,\nu_{k})$
$T_{\mathrm{D}}$	set of instances in $\bar{T}_{\mathrm{D}}$ that lie in $B_{\infty}(0,\nu_{k})$
$\hat{T}_{\mathrm{C}}$	unrevealed labeled set of $\bar{T}_{\mathrm{C}}$
$\tilde{T}_{\mathrm{C}}$	unrevealed labeled set of $T_{\mathrm{C}}$

Regularity condition on $D_{u,b}$ .

We will frequently work with the conditional distribution $D_{u,b}$ obtained by conditioning $D$ on the event that $x$ is in the band $\{x\in\mathbb{R}^{d}\mathrel{\mathop{\mathchar 58\relax}}\left\lvert u\cdot x\right\rvert\leq b\}$ . We give the following regularity condition to ease our terminology.

Definition 11.

A conditional distribution $D_{u,b}$ is said to satisfy the regularity condition if one of the following holds: 1) the vector $u\in\mathbb{R}^{d}$ has unit $\ell_{2}$ -norm and $0<b\leq C_{1}$ ; 2) the vector $u$ is the zero vector and $b=C_{1}$ .

In particular, at each phase $k$ of Algorithm 1, $u$ is set to $w_{k-1}$ and $b$ is set to $b_{k}$ . For $k=1$ , $u=w_{0}$ is a zero vector, $b=b_{1}=C_{1}$ , satisfying the regularity condition. It is worth mentioning that at phase $1$ the conditional distribution $D_{u,b}$ boils down to $D$ . For all $k\geq 2$ , $u$ is a unit vector and $b\in(0,C_{1}]$ in view of our construction of $b_{k}$ . Therefore, for all $k\geq 1$ , $D_{w_{k-1},b_{k}}$ satisfy the regularity condition.

Appendix B Useful Properties of Isotropic Log-Concave Distributions

We record some useful properties of isotropic log-concave distributions.

Lemma 12.

There are absolute constants $c_{0},c_{1},c_{2}>0$ , such that the following holds for all isotropic log-concave distributions $D\in\mathcal{D}$ . Let $f_{D}$ be the density function. We have

1.

Orthogonal projections of $D$ onto subspaces of $\mathbb{R}^{d}$ are isotropic log-concave;
2.

If $d=1$ , then $\operatorname{Pr}_{x\sim D}(a\leq x\leq b)\leq\left\lvert b-a\right\rvert$ ;
3.

If $d=1$ , then $f_{D}(x)\geq c_{0}$ for all $x\in[-1/9,1/9]$ ;

For any two vectors $u,v\in\mathbb{R}^{d}$ ,

c_{1}\cdot\operatorname{Pr}_{x\sim D}\mathinner{\left(\operatorname{sign}\mathinner{\left(u\cdot x\right)}\neq\operatorname{sign}\mathinner{\left(v\cdot x\right)}\right)}\leq\theta(u,v)\leq c_{2}\cdot\operatorname{Pr}_{x\sim D}\mathinner{\left(\operatorname{sign}\mathinner{\left(u\cdot x\right)}\neq\operatorname{sign}\mathinner{\left(v\cdot x\right)}\right)};

5.

$\operatorname{Pr}_{x\sim D}\big{(}\left\lVert x\right\rVert_{2}\geq t\sqrt{d}\big{)}\leq\exp(-t+1)$ .

We remark that Parts 1, 2, 3, and 5 are due to [LV07], and Part 4 is from [Vem10, BL13].

The following lemma is implied by the proof of Theorem 21 of [BL13], which shows that if we choose a proper band width $b>0$ , the error outside the band will be small. This observation is crucial for controlling the error over the distribution $D$ , and has been broadly recognized in the literature [ABL17, Zha18].

Lemma 13 (Theorem 21 of [BL13]).

There are absolute constants $c_{3},c_{4}>0$ such that the following holds for all isotropic log-concave distributions $D\in\mathcal{D}$ . Let $u$ and $v$ be two unit vectors in $\mathbb{R}^{d}$ and assume that $\theta(u,v)=\alpha<\pi/2$ . Then for any $b\geq\frac{4}{c_{4}}\alpha$ , we have

\operatorname{Pr}_{x\sim D}(\operatorname{sign}\mathinner{\left(u\cdot x\right)}\neq\operatorname{sign}\mathinner{\left(v\cdot x\right)}\ \text{and}\ \left\lvert v\cdot x\right\rvert\geq b)\leq c_{3}\alpha\exp\left(-\frac{c_{4}b}{2\alpha}\right).

Lemma 14 (Lemma 20 of [ABHZ16]).

There is an absolute constant $c_{7}>0$ such that the following holds for all isotropic log-concave distributions $D\in\mathcal{D}$ . Draw $n$ i.i.d. instances from $D$ to form a set $S$ . Then

\operatorname{Pr}_{S\sim D^{n}}\left(\max_{x\in S}\left\lVert x\right\rVert_{\infty}\geq c_{7}\log\frac{\left\lvert S\right\rvert d}{\delta}\right)\leq\delta.

Lemma 15.

There is an absolute constant $\bar{C}_{2}\geq 1$ such that the following holds for all isotropic log-concave distributions $D\in\mathcal{D}$ and all $D_{u,b}$ that satisfy the regularity condition:

\sup_{w\in B_{2}(u,r)}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\bigl{[}(w\cdot x)^{2}\bigr{]}}\leq\bar{C}_{2}(b^{2}+r^{2}).

Proof.

When $u$ is a unit vector, Lemma 3.4 of [ABL17] shows that there exists a constant $K_{1}$ such that

\sup_{w\in B_{2}(u,r)}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\bigl{[}(w\cdot x)^{2}\bigr{]}}\leq K_{1}(b^{2}+r^{2}).

When $u$ is a zero vector, $D_{u,b}$ reduces to $D$ and the constraint $w\in B_{2}(u,r)$ reads as $\left\lVert w\right\rVert_{2}\leq r$ . Thus we have

\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\bigl{[}(w\cdot x)^{2}\bigr{]}}=\left\lVert w\right\rVert_{2}^{2}\leq r^{2}<b^{2}+r^{2}.

The proof is complete by choosing $\bar{C}_{2}=K_{1}+1$ . ∎

Lemma 16.

There is an absolute constant $C_{2}\geq 2$ such that the following holds for all isotropic log-concave distributions $D\in\mathcal{D}$ and all $D_{u,b}$ that satisfy the regularity condition:

\sup_{H\in\mathcal{M}}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\bigl{[}x^{\top}Hx\bigr{]}}\leq C_{2}(b^{2}+r^{2}),

where $\mathcal{M}\mathrel{\mathop{\mathchar 58\relax}}=\mathinner{\left\{H\in\mathbb{R}^{d\times d}\mathrel{\mathop{\mathchar 58\relax}}\ H\succeq 0,\ \left\lVert H\right\rVert_{*}\leq r^{2},\ \left\lVert H\right\rVert_{1}\leq\rho^{2}\right\}}$ .

Proof.

Since $H\in\mathcal{M}$ is a positive semidefinite matrix with trace norm at most $r^{2}$ , it has eigendecomposition $H=\sum_{i=1}^{d}\lambda_{i}v_{i}v_{i}^{\top}$ , where $\lambda_{i}\geq 0$ are the eigenvalues such that $\sum_{i=1}^{d}\lambda_{i}\leq r^{2}$ , and $v_{i}$ ’s are orthonormal vectors in $\mathbb{R}^{d}$ . Thus,

{x^{\top}Hx}=\frac{1}{r^{2}}\sum_{i=1}^{d}{\lambda_{i}}{(rv_{i}\cdot x)^{2}}\leq\frac{2}{r^{2}}\cdot\sum_{i=1}^{d}{\lambda_{i}}\mathinner{\left[\mathinner{\left((rv_{i}+u)\cdot x\right)}^{2}+(u\cdot x)^{2}\right]}.

Since $x$ is drawn from $D_{u,b}$ , we have $(u\cdot x)^{2}\leq b^{2}$ . Moreover, applying Lemma 15 with the setting of $w=rv+u$ implies that

\sup_{v\in B_{2}(0,1)}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\left[\mathinner{\left((rv+u)\cdot x\right)}^{2}\right]}\leq\bar{C}_{2}(b^{2}+r^{2}).

Therefore,

\sup_{H\in\mathcal{M}}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\bigl{[}x^{\top}Hx\bigr{]}}\leq\frac{2}{r^{2}}\cdot\sum_{i=1}^{d}{\lambda_{i}}\mathinner{\left(\bar{C}_{2}(b^{2}+r^{2})+b^{2}\right)}\leq 2(\bar{C}_{2}+1)(b^{2}+r^{2}).

The proof is complete by choosing $C_{2}=2(\bar{C}_{2}+1)$ . ∎

Lemma 17.

Let $c_{8}=\min\big{\{}2c_{0},\frac{2c_{0}}{9C_{1}},\frac{1}{C_{1}}\big{\}}$ . Then for all isotropic log-concave distributions $D\in\mathcal{D}$ and all $D_{u,b}$ satisfying the regularity condition,

1.

$\operatorname{Pr}_{x\sim D}\left(\left\lvert u\cdot x\right\rvert\leq b\right)\geq c_{8}\cdot b$ ;
2.

$\operatorname{Pr}_{x\sim D_{u,b}}(E)\leq\frac{1}{c_{8}b}\operatorname{Pr}_{x\sim D}(E)$ for any event $E$ .

Proof.

We first consider the case that $u$ is a unit vector.

For the lower bound, Part 3 of Lemma 12 shows that the density function of the random variable $u\cdot x$ is lower bounded by $c_{0}$ when $\left\lvert u\cdot x\right\rvert\leq 1/9$ . Thus

\displaystyle\operatorname{Pr}_{x\sim D}\left(\left\lvert u\cdot x\right\rvert\leq b\right)\geq\operatorname{Pr}_{x\sim D}\left(\left\lvert u\cdot x\right\rvert\leq\min\{b,{1}/{9}\}\right)\geq 2c_{0}\min\{b,{1}/{9}\}\geq 2c_{0}\min\bigg{\{}1,\frac{1}{9C_{1}}\bigg{\}}\cdot b

where in the last inequality we use the condition $b\leq C_{1}$ .

For any event $E$ , we always have

\operatorname{Pr}_{x\sim D_{u,b}}(E)\leq\frac{\operatorname{Pr}_{x\sim D}(E)}{\operatorname{Pr}_{x\sim D}({\left\lvert u\cdot x\right\rvert\leq b})}\leq\frac{1}{c_{8}b}\operatorname{Pr}_{x\sim D}(E).

Now we consider the case that $u$ is the zero vector and $b=C_{1}$ . Then $\operatorname{Pr}_{x\sim D}\left(\left\lvert u\cdot x\right\rvert\leq b\right)=1\geq c_{8}\cdot b$ in view of the choice $c_{8}$ . Thus Part 2 still follows. The proof is complete. ∎

Lemma 18.

There exists an absolute constant $c_{9}>0$ such that the following holds for all isotropic log-concave distributions $D\in\mathcal{D}$ and all $D_{u,b}$ that satisfy the regularity condition. Let $S$ be a set of i.i.d. instances drawn from $D_{u,b}$ . Then

\operatorname{Pr}_{S\sim D_{u,b}^{n}}\mathinner{\left(\max_{x\in S}\left\lVert x\right\rVert_{\infty}\geq c_{9}\log\frac{\left\lvert S\right\rvert d}{b\delta}\right)}\leq\delta.

Proof.

Using Lemma 14 we have

\operatorname{Pr}_{S\sim D^{n}}\mathinner{\left(\max_{x\in S}\left\lVert x\right\rVert_{\infty}\geq c_{7}\log\frac{\left\lvert S\right\rvert d}{\delta}\right)}\leq\delta.

Thus, using Part 2 of Lemma 17 gives

\operatorname{Pr}_{S\sim D_{u,b}^{n}}\mathinner{\left(\max_{x\in S}\left\lVert x\right\rVert_{\infty}\geq c_{7}\log\frac{\left\lvert S\right\rvert d}{\delta}\right)}\leq\frac{\delta}{c_{8}b}.

The proof is complete by changing $\delta$ to $\delta^{\prime}=\frac{\delta}{c_{8}b}$ . ∎

Appendix C Orlicz Norm and Concentration Results using Adamczak’s Bound

The following notion of Orlicz norm [vdGL13, Dud14] is useful in handling random variables that have tails of the form $\exp(-t^{\alpha})$ for general $\alpha$ ’s beyond $\alpha=2$ (subgaussian) and $\alpha=1$ (subexponential).

Definition 19 (Orlicz norm).

For any $z\in\mathbb{R}$ , let $\psi_{\alpha}\mathrel{\mathop{\mathchar 58\relax}}z\mapsto\exp(z^{\alpha})-1$ . Furthermore, for a random variable $Z\in\mathbb{R}$ and $\alpha>0$ , define $\left\lVert Z\right\rVert_{\psi_{\alpha}}$ , the Orlicz norm of $Z$ with respect to $\psi_{\alpha}$ , as:

\left\lVert Z\right\rVert_{\psi_{\alpha}}=\inf\Big{\{}t>0\mathrel{\mathop{\mathchar 58\relax}}\operatorname{\mathbb{E}}_{Z}\mathinner{\bigl{[}\psi_{\alpha}\mathinner{\left({\left\lvert Z\right\rvert}/{t}\right)}\bigr{]}}\leq 1\Big{\}}.

We collect some basic facts about Orlicz norms in the following lemma; they can be found in Section 1.3 of [VDVW96].

Lemma 20.

Let $Z$ , $Z_{1}$ , $Z_{2}$ be real-valued random variables. Consider the Orlicz norm with respect to $\psi_{\alpha}$ . We have the following:

1.

$\left\lVert\cdot\right\rVert_{\psi_{\alpha}}$ is a norm. For any $a\in\mathbb{R}$ , $\left\lVert aZ\right\rVert_{\psi_{\alpha}}=\left\lvert a\right\rvert\cdot\left\lVert Z\right\rVert_{\psi_{\alpha}}$ ; $\left\lVert Z_{1}+Z_{2}\right\rVert_{\psi_{\alpha}}\leq\left\lVert Z_{1}\right\rVert_{\psi_{\alpha}}+\left\lVert Z_{2}\right\rVert_{\psi_{\alpha}}$ .
2.

$\left\lVert Z\right\rVert_{p}\leq\left\lVert Z\right\rVert_{\psi_{p}}\leq p!\left\lVert Z\right\rVert_{\psi_{1}}$ where $\left\lVert Z\right\rVert_{p}\mathrel{\mathop{\mathchar 58\relax}}=\mathinner{\left(\operatorname{\mathbb{E}}\mathinner{\left[\left\lvert Z\right\rvert^{p}\right]}\right)}^{1/p}$ .
3.

For any $p,\alpha>0$ , $\left\lVert Z\right\rVert_{\psi_{p}}^{\alpha}=\left\lVert Z^{\alpha}\right\rVert_{\psi_{p/\alpha}}$ .
4.

If $\operatorname{Pr}\left(\left\lvert Z\right\rvert\geq t\right)\leq K_{1}\exp\mathinner{\left(-K_{2}t^{\alpha}\right)}$ for any $t\geq 0$ , then $\left\lVert Z\right\rVert_{\psi_{\alpha}}\leq\left(\frac{2(\ln K_{1}+1)}{K_{2}}\right)^{1/\alpha}$ .
5.

If $\left\lVert Z\right\rVert_{\psi_{\alpha}}\leq K$ , then for all $t\geq 0$ , $\operatorname{Pr}\left(\left\lvert Z\right\rvert\geq t\right)\leq 2\exp\left(-(\frac{t}{K})^{\alpha}\right)$ .

The following auxiliary results, tailored to the localized sampling scheme in Algorithm 1, will also be useful in our analysis.

Lemma 21.

There exists an absolute constant $C_{3}>0$ such that the following holds for all isotropic log-concave distributions $D\in\mathcal{D}$ and all $D_{u,b}$ that satisfy the regularity condition. Let $S=\{x_{1},\dots,x_{n}\}$ be a set of $n$ instances drawn from $D_{u,b}$ . Then

\left\lVert\max_{x\in S}\left\lVert x\right\rVert_{\infty}\right\rVert_{\psi_{1}}\leq C_{3}\log\frac{nd}{b}.

Consequently,

\operatorname{\mathbb{E}}_{S\sim D_{u,b}^{n}}\mathinner{\Bigl{[}\max_{x\in S}\left\lVert x\right\rVert_{\infty}\Bigr{]}}\leq C_{3}\log\frac{nd}{b}.

Proof.

Let $Z$ be isotropic log-concave random variable in $\mathbb{R}$ . Part 5 of Lemma 12 shows that for all $t>0$ ,

\operatorname{Pr}(\left\lvert Z\right\rvert>t)\leq\exp(-t+1).

Fix $i\in\{1,\dots,n\}$ and fix $j\in\{1,\dots,d\}$ . Denote by $x_{i}^{(j)}$ the $j$ -th coordinate of $x_{i}$ . Part 1 of Lemma 12 suggests that $x_{i}^{(j)}$ is isotropic log-concave. Thus, by Part 2 of Lemma 17,

\operatorname{Pr}_{x\sim D_{u,b}}\mathinner{\Bigl{(}\ \mathinner{\!\bigl{\lvert}x_{i}^{(j)}\bigr{\rvert}}>t\Bigr{)}}\leq\frac{1}{c_{8}b}\operatorname{Pr}_{x\sim D}\mathinner{\left(\ \mathinner{\!\bigl{\lvert}x_{i}^{(j)}\bigr{\rvert}}>t\right)}\leq\frac{1}{c_{8}b}\exp(-t+1).

Taking the union bound over $i\in\{1,\dots,n\}$ and $j\in\{1,\dots,d\}$ , we have for all $t>0$

\operatorname{Pr}_{x\sim D_{u,b}}\mathinner{\left(\max_{x\in S}\left\lVert x\right\rVert_{\infty}>t\right)}\leq\frac{nd}{c_{8}b}\exp(-t+1).

Now Part 4 of Lemma 20 immediately implies that

\left\lVert\max_{x\in S}\left\lVert x\right\rVert_{\infty}\right\rVert_{\psi_{1}}\leq C_{3}\log\frac{nd}{b}

for some constant $C_{3}>0$ . The second inequality of the lemma is an immediate result by combining the above and Part 2 of Lemma 20. ∎

C.1 Adamczak’s bound

In this section, we establish the key concentration results that will be used to analyze the performance of soft outlier removal and random sampling in Algorithm 1. Since we are considering the isotropic log-concave distribution, any unlabeled instance $x$ is unbounded. This prevents us from using standard concentration bounds, e.g. [KST08]. We henceforth appeal to the following generalization of Talagrand’s inequality, due to [Ada08].

Lemma 22 (Adamczak’s bound).

For any $\alpha\in(0,1]$ , there exists a constant $\Lambda_{\alpha}>0$ , such that the following holds. Given any function class $\mathcal{F}$ , and a function $F$ such that for any $f\in\mathcal{F}$ , $\left\lvert f(x)\right\rvert\leq F(x)$ , we have with probability at least $1-\delta$ over the draw of a set $S=\{x_{1},\dots,x_{n}\}$ of i.i.d. instances from $D$ ,

	$\displaystyle\sup_{f\in\mathcal{F}}\mathinner{\!\biggl{\lvert}\frac{1}{n}\sum_{i=1}^{n}{f(x_{i})}-\operatorname{\mathbb{E}}_{x\sim D}\mathinner{\left[f(x)\right]}\biggr{\rvert}}\leq\Lambda_{\alpha}\left(\operatorname{\mathbb{E}}_{S\sim D^{n}}\mathinner{\biggl{[}\sup_{f\in\mathcal{F}}\mathinner{\!\biggl{\lvert}\frac{1}{n}\sum_{i=1}^{n}{f(x_{i})}-\operatorname{\mathbb{E}}_{x\sim D}\mathinner{\left[f(x)\right]}\biggr{\rvert}}\biggr{]}}\right.$
	$\displaystyle\left.+\sqrt{\frac{\sup_{f\in\mathcal{F}}\operatorname{\mathbb{E}}_{x\sim D}\mathinner{\left[(f(x))^{2}\right]}\ln\frac{1}{\delta}}{n}}+\frac{(\ln\frac{1}{\delta})^{1/\alpha}}{n}\left\lVert\max_{1\leq i\leq n}F(x_{i})\right\rVert_{\psi_{\alpha}}\right).$

We first establish the following result that upper bounds the expected value of Rademacher complexity of linear classes by the Orlicz norm of the random instances.

Lemma 23.

There exists an absolute constant $C_{5}>0$ such that the following holds for all isotropic log-concave distributions $D\in\mathcal{D}$ and all $D_{u,b}$ that satisfy the regularity condition. Let $S=\{x_{1},\dots,x_{n}\}$ be a set of $n$ i.i.d. unlabeled instances drawn from $D_{u,b}$ . Denote $W=B_{2}(u,r)\cap B_{1}(u,\rho)$ . Let a sequence of random variables $Z=\{z_{1},\dots,z_{n}\}$ be drawn from a distribution supported on a bounded interval $[-\lambda,\lambda]$ for some $\lambda>0$ . Let $\sigma=\{\sigma_{1},\dots,\sigma_{n}\}$ , where the $\sigma_{i}$ ’s are i.i.d. Rademacher random variables independent of $S$ and $Z$ . We have:

\operatorname{\mathbb{E}}_{S,Z,\sigma}\mathinner{\biggl{[}\sup_{w\in W}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}z_{i}(w\cdot x_{i})\biggr{\rvert}}\biggr{]}}\leq\lambda b\sqrt{n}+C_{5}\rho\lambda\sqrt{n\log d}\cdot\log\frac{nd}{b}.

Proof.

Let $V=B_{2}(0,r)\cap B_{1}(0,\rho)$ so that any $w\in W$ can be expressed as $w=u+v$ for some $v\in V$ . First, conditioned on $S$ and $Z$ , we have that

\operatorname{\mathbb{E}}_{\sigma}\mathinner{\biggl{[}\sup_{v\in V}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}z_{i}(v\cdot x_{i})\biggr{\rvert}}\biggr{]}}\leq\rho\sqrt{2n\log(2d)}\cdot\max_{1\leq i\leq n}\left\lVert z_{i}x_{i}\right\rVert_{\infty}\leq\rho\lambda\sqrt{2n\log(2d)}\cdot\max_{1\leq i\leq n}\left\lVert x_{i}\right\rVert_{\infty}.

Thus,

	$\displaystyle\operatorname{\mathbb{E}}_{S,Z,\sigma}\mathinner{\biggl{[}\sup_{v\in V}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}z_{i}(v\cdot x_{i})\biggr{\rvert}}\biggr{]}}$	$\displaystyle\leq\rho\lambda\sqrt{2n\log(2d)}\cdot\operatorname{\mathbb{E}}_{S}\mathinner{\left[\max_{1\leq i\leq n}\left\lVert x_{i}\right\rVert_{\infty}\right]}$
		$\displaystyle\leq C_{5}\rho\lambda\sqrt{n\log d}\cdot\log\frac{nd}{b},$		(C.1)

where the second inequality follows from Lemma 21.

On the other side, using the fact that for any random variable $A$ , $\operatorname{\mathbb{E}}[A]\leq\mathinner{\left(\operatorname{\mathbb{E}}[A^{2}]\right)}^{1/2}$ , we have

	$\displaystyle\operatorname{\mathbb{E}}_{S,Z,\sigma}\mathinner{\biggl{[}\ \mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}z_{i}(u\cdot x_{i})\biggr{\rvert}}\biggr{]}}$	$\displaystyle\leq\sqrt{\operatorname{\mathbb{E}}_{S,Z,\sigma}\mathinner{\Biggl{[}\mathinner{\biggl{(}\sum_{i=1}^{n}\sigma_{i}z_{i}(u\cdot x_{i})\biggr{)}}^{2}\Biggr{]}}}$
		$\displaystyle=\sqrt{\operatorname{\mathbb{E}}_{S,Z}\mathinner{\Biggl{[}\sum_{i=1}^{n}z_{i}^{2}(u\cdot x_{i})^{2}\Biggr{]}}}\leq\sqrt{nb^{2}\lambda^{2}},$

where in the equality we use the observation that $\operatorname{\mathbb{E}}_{S,Z,\sigma}\mathinner{\left[\sigma_{i}\sigma_{j}z_{i}z_{j}(u\cdot x_{i})(u\cdot x_{j})\right]}=0$ when $i\neq j$ , and in the last inequality we used the condition that $x_{i}$ is drawn from $D_{u,b}$ . Combining the above with (C.1) we obtain the desired result. ∎

C.2 Uniform concentration of hinge loss

Proposition 24.

There exists an absolute constant $C_{6}>0$ such that the following holds for all isotropic log-concave distributions $D\in\mathcal{D}$ and all $D_{u,b}$ that satisfy the regularity condition. Let $S=\{x_{1},\dots,x_{n}\}$ be a set of $n$ i.i.d. unlabeled instances drawn from $D_{u,b}$ which satisfies the regularity condition. Let $y_{x}=\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)}$ for any $x\sim D_{u,b}$ . Denote $W=B_{2}(u,r)\cap B_{1}(u,\rho)$ and let $G(w)=\frac{1}{n}\sum_{i=1}^{n}\ell_{\tau}(w;x_{i},y_{x_{i}})-\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\big{[}\ell_{\tau}\left(w;x,y_{x}\right)\big{]}$ . Then with probability $1-\delta$ ,

\displaystyle\sup_{w\in W}\left\lvert G(w)\right\rvert\leq C_{6}\mathinner{\left(\frac{b+\rho\sqrt{\log d}\log\frac{nd}{b}}{\tau\sqrt{n}}+\frac{b+r}{\tau\sqrt{n}}\sqrt{\log\frac{1}{\delta}}+\frac{b+\rho\log\frac{nd}{b}}{\tau n}\log\frac{1}{\delta}\right)}.

In particular, suppose $b=O(r)$ , $\rho=O(\sqrt{s}r)$ and $\tau=\Omega(r)$ . Then we have: for any $t>0$ , a sample size $n=\tilde{O}\Big{(}\frac{1}{t^{2}}s\log^{2}\frac{d}{b}\cdot\log\frac{d}{\delta}\Big{)}$ suffices to guarantee that with probability $1-\delta$ , $\sup_{w\in W}\left\lvert G(w)\right\rvert\leq t$ .

Proof.

We will use Lemma 22 with function class $\mathcal{F}=\mathinner{\left\{(x,y)\mapsto\ell_{\tau}(w;x,y)\mathrel{\mathop{\mathchar 58\relax}}w\in W\right\}}$ and the Orlicz norm with respect to $\psi_{1}$ . We define $F(x,y)=1+\frac{b}{\tau}+\frac{\rho}{\tau}\|x\|_{\infty}$ . It can be seen that for every $w\in W$ ,

\left\lvert\ell_{\tau}(w;x,y)\right\rvert\leq 1+\frac{\left\lvert w\cdot x\right\rvert}{\tau}\leq 1+\frac{{u}\cdot{x}}{\tau}+\frac{{(w-u)}\cdot{x}}{\tau}\leq 1+\frac{b}{\tau}+\frac{\rho}{\tau}\|x\|_{\infty}=F(x,y).

That is, for every $f$ in $\mathcal{F}$ , $\left\lvert f(x,y)\right\rvert\leq F(x,y)$ .

Step 1. We upper bound $\left\lVert\max_{1\leq i\leq n}F(x_{i},y_{x_{i}})\right\rVert_{\psi_{1}}$ . Since $\left\lVert\cdot\right\rVert_{\psi_{1}}$ is a norm, we have

$\displaystyle\left\lVert\max_{1\leq i\leq n}F(x_{i},y_{x_{i}})\right\rVert_{\psi_{1}}$	$\displaystyle\leq\left\lVert 1+\frac{b}{\tau}\right\rVert_{\psi_{1}}+\left\lVert\frac{\rho}{\tau}\cdot\max_{1\leq i\leq n}\left\lVert x_{i}\right\rVert_{\infty}\right\rVert_{\psi_{1}}$
	$\displaystyle=1+\frac{b}{\tau}+\frac{\rho}{\tau}\cdot\left\lVert\max_{1\leq i\leq n}\left\lVert x_{i}\right\rVert_{\infty}\right\rVert_{\psi_{1}}$
	$\displaystyle\leq 1+\frac{b}{\tau}+\frac{C_{3}\rho}{\tau}\log\frac{nd}{b},$	(C.2)

where we applied Lemma 21 in the last inequality.

Step 2. Next, we upper bound $\sup_{w\in W}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\left[(\ell_{\tau}(w;x,y_{x}))^{2}\right]}$ . For all $w$ in $W$ , we have

\sup_{w\in W}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\left[(\ell_{\tau}(w;x,y_{x}))^{2}\right]}\leq 2\cdot\sup_{w\in W}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\biggl{[}1+\frac{(w\cdot x)^{2}}{\tau^{2}}\biggr{]}}\leq 2+2\bar{C}_{2}\cdot\frac{r^{2}+b^{2}}{\tau^{2}}

(C.3)

where the last inequality uses Lemma 15.

Step 3. Finally, we upper bound $\operatorname{\mathbb{E}}_{S\sim D_{u,b}^{n}}\mathinner{\left[\sup_{w\in W}\left\lvert G(w)\right\rvert\right]}$ . Let $\sigma=\{\sigma_{1},\dots,\sigma_{n}\}$ where each $\sigma_{i}$ is an i.i.d. draw from the Rademacher distribution. We have

$\displaystyle\operatorname{\mathbb{E}}_{S}\mathinner{\biggl{[}\sup_{w\in W}\left\lvert G(w)\right\rvert\biggr{]}}$	$\displaystyle\leq\frac{2}{n}\operatorname{\mathbb{E}}_{S,\sigma}\mathinner{\biggl{[}\sup_{w\in W}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}\ell_{\tau}\mathinner{\left(w;x_{i},y_{x_{i}}\right)}\biggr{\rvert}}\biggr{]}}$
	$\displaystyle\leq\frac{2}{\tau n}\operatorname{\mathbb{E}}_{S,\sigma}\mathinner{\biggl{[}\sup_{w\in W}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}y_{x_{i}}(w\cdot x_{i})\biggr{\rvert}}\biggr{]}}$
	$\displaystyle\leq\frac{2b}{\tau\sqrt{n}}+\frac{2C_{5}\rho}{\tau}\cdot\sqrt{\frac{\log d}{n}}\cdot\log\frac{nd}{b}.$	(C.4)

In the above, the first inequality used standard symmetrization arguments; see, for example, Lemma 26.2 of [SSBD14]. In the second inequality, we used the contraction property of Rademacher complexity and the fact that $\ell_{\tau}(w;x,y)$ can be seen as a $\frac{1}{\tau}$ -Lipschitz function $\phi(a)=\max\big{\{}0,1-\frac{a}{\tau}\big{\}}$ applied on input $a=yw\cdot x$ . In the last inequality, we applied Lemma 23 with the fact that $\left\lvert y_{x_{i}}\right\rvert\leq 1$ .

Putting together. The first inequality of the proposition follows from combining (C.2), (C.3), and (C.4), and using Lemma 22 with $\mathcal{F}$ and $\psi_{1}$ . Under our choice of $(b,r,\rho,\tau)$ , with some calculation we obtain the bound of $n$ . ∎

C.3 Uniform concentration of relaxed sparse PCA

Proposition 25.

There exists an absolute constant $C_{7}>0$ such that the following holds for all isotropic log-concave distributions $D\in\mathcal{D}$ and all $D_{u,b}$ that satisfy the regularity condition. Let $S=\{x_{1},\dots,x_{n}\}$ be a set of $n$ i.i.d. unlabeled instances drawn from $D_{u,b}$ . Denote $G(H)=\frac{1}{n}\sum_{i=1}^{n}{x_{i}^{\top}Hx_{i}}-\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\bigl{[}x^{\top}Hx\bigr{]}}$ . Then with probability $1-\delta$ ,

\displaystyle\sup_{H\in\mathcal{M}}\left\lvert G(H)\right\rvert\leq C_{7}\rho^{2}\log^{2}\frac{nd}{b}\mathinner{\biggl{(}\sqrt{\frac{\log d}{n}}+\sqrt{\frac{\log({1}/{\delta})}{n}}+\frac{\log^{2}\frac{1}{\delta}}{n}\biggr{)}}.

In particular, suppose $\rho=O(\sqrt{s}r)$ and $r=O(b)$ . Then we have: for any $t>0$ , a sample size

n=\tilde{O}\mathinner{\left(\frac{1}{t^{2}}s^{2}b^{4}\log^{4}\frac{d}{b}\cdot\mathinner{\left(\log d+\log^{2}\frac{1}{\delta}\right)}\right)}

suffices to guarantee that with probability $1-\delta$ , $\sup_{H\in\mathcal{M}}\left\lvert G(H)\right\rvert\leq t$ .

Proof.

Recall that $\mathcal{M}=\mathinner{\left\{H\in\mathbb{R}^{d\times d}\mathrel{\mathop{\mathchar 58\relax}}H\succeq 0,\left\lVert H\right\rVert\leq r^{2},\left\lVert H\right\rVert_{1}\leq\rho^{2}\right\}}$ . For any matrix $H$ , we denote by $H_{ij}$ the $(i,j)$ -th entry of the matrix $H$ . For any vector $x$ , we denote by $x^{(i)}$ the $i$ -th coordinate of $x$ .

We will use Lemma 22 with function class $\mathcal{F}=\mathinner{\left\{x\mapsto x^{\top}Hx\mathrel{\mathop{\mathchar 58\relax}}H\in\mathcal{M}\right\}}$ and the Orlicz norm with respect to $\psi_{0.5}$ . Consider the function $f(x)\mathrel{\mathop{\mathchar 58\relax}}=x^{\top}Hx$ parameterized by $H\in\mathcal{M}$ . First, we wish to find a function $F(x)$ that upper bounds $\left\lvert f(x)\right\rvert$ . It is easy to see that

\mathinner{\!\Bigl{\lvert}x^{\top}Hx\Bigr{\rvert}}=\mathinner{\!\Bigl{\lvert}\sum_{i,j}H_{ij}x^{(i)}x^{(j)}\Bigr{\rvert}}\leq\left\lVert x\right\rVert_{\infty}^{2}\sum_{i,j}\left\lvert H_{ij}\right\rvert\leq\rho^{2}\left\lVert x\right\rVert_{\infty}^{2}.

(C.5)

Thus it suffices to choose $F(x)=\rho^{2}\left\lVert x\right\rVert_{\infty}^{2}$ .

Step 1. We first bound $\left\lVert\sqrt{\max_{1\leq i\leq n}F(x_{i})}\right\rVert_{\psi_{1}}=\left\lVert\rho\cdot\max_{1\leq i\leq n}\left\lVert x_{i}\right\rVert_{\infty}\right\rVert_{\psi_{1}}\leq C_{3}\rho\log\frac{nd}{b}$ by Lemma 21. By Part 3 of Lemma 20, $\left\lVert\max_{1\leq i\leq n}F(x)\right\rVert_{\psi_{0.5}}$ equals $\left\lVert\sqrt{\max_{1\leq i\leq n}F(x)}\right\rVert_{\psi_{1}}^{2}$ . Thus

\left\lVert\max_{1\leq i\leq n}F(x)\right\rVert_{\psi_{0.5}}\leq\mathinner{\left(C_{3}\rho\log\frac{nd}{b}\right)}^{2}.

(C.6)

Step 2. Next we upper bound $\sup_{f\in\mathcal{F}}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\left[(f(x))^{2}\right]}$ where we remark that taking the superum over $f\in\mathcal{F}$ is equivalent to taking that over $H\in\mathcal{M}$ . Since $\left\lvert f(x)\right\rvert\leq F(x)$ , we have

(f(x))^{2}\leq(F(x))^{2}\leq\rho^{4}\left\lVert x\right\rVert_{\infty}^{4}.

In view of Part 2 of Lemma 20, we have

\mathinner{\left(\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\left[\left\lVert x\right\rVert_{\infty}^{4}\right]}\right)}^{1/4}\leq 24\left\lVert\left\lVert x\right\rVert_{\infty}\right\rVert_{\psi_{1}}\leq 24C_{3}\log\frac{d}{b},

(C.7)

where the last inequality follows from Lemma 21. Hence,

\sup_{f\in\mathcal{F}}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\left[(f(x))^{2}\right]}\leq K_{1}\rho^{4}\log^{4}\frac{d}{b}

(C.8)

for some absolute constant $K_{1}>0$ .

Step 3. Finally, we upper bound $\operatorname{\mathbb{E}}_{S\sim D^{n}}\mathinner{\left[\sup_{f\in\mathcal{F}}\left\lvert\frac{1}{n}\sum_{i=1}^{n}f(x_{i})-\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\left[f(x)\right]}\right\rvert\right]}$ . Let $\sigma=\{\sigma_{1},\dots,\sigma_{n}\}$ where $\sigma_{i}$ ’s are independent draw from the Rademacher distribution. By standard symmetrization arguments (see e.g. Lemma 26.2 of [SSBD14]), we have

\operatorname{\mathbb{E}}_{S}\mathinner{\biggl{[}\sup_{f\in\mathcal{F}}\left\lvert G(v,H)\right\rvert\biggr{]}}\leq\frac{2}{n}\operatorname{\mathbb{E}}_{S,\sigma}\mathinner{\biggl{[}\sup_{f\in\mathcal{F}}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}f(x_{i})\biggr{\rvert}}\biggr{]}}=\frac{2}{n}\operatorname{\mathbb{E}}_{S,\sigma}\mathinner{\biggl{[}\sup_{H\in\mathcal{M}}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}x_{i}^{\top}Hx_{i}\biggr{\rvert}}\biggr{]}}.

(C.9)

We first condition on $S$ and consider the expectation over $\sigma$ . For a matrix $H$ , we use $\operatorname*{vec}(H)$ to denote the vector obtained by concatenating all of the columns of $H$ ; likewise for $x_{i}x_{i}^{\top}$ . It is crucial to observe that with this notation, for any $H\in\mathcal{M}$ , we have $\left\lVert\operatorname*{vec}(H)\right\rVert_{1}=\left\lVert H\right\rVert_{1}\leq\rho^{2}$ . It follows that

	$\displaystyle\operatorname{\mathbb{E}}_{\sigma}\mathinner{\Biggl{[}\ \mathinner{\!\biggl{\lvert}\sup_{H\in\mathcal{M}}\sum_{i=1}^{n}\sigma_{i}x_{i}^{\top}Hx_{i}\biggr{\rvert}}\Biggr{]}}$	$\displaystyle\leq\mathbb{E}_{\sigma}\mathinner{\Biggl{[}\sup_{H\mathrel{\mathop{\mathchar 58\relax}}\left\lVert\operatorname{vec}(H)\right\rVert_{1}\leq\rho^{2}}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}\left\langle\operatorname{vec}(H),\operatorname*{vec}(x_{i}x_{i}^{\top})\right\rangle\biggr{\rvert}}\Biggr{]}}$
		$\displaystyle\leq\rho^{2}\sqrt{n\ln(2d^{2})}\cdot\max_{1\leq i\leq n}\left\lVert\operatorname*{vec}(x_{i}x_{i}^{\top})\right\rVert_{\infty}\cdot$
		$\displaystyle=\rho^{2}\sqrt{n\ln(2d^{2})}\cdot\max_{1\leq i\leq n}\left\lVert x_{i}\right\rVert_{\infty}^{2}.$

where the second inequality is from Lemma 39, and the equality is from the observation that $\|\operatorname*{vec}(x_{i}x_{i}^{\top})\|_{\infty}=\|x_{i}\|_{\infty}^{2}$ . Therefore,

	$\displaystyle\operatorname{\mathbb{E}}_{S,\sigma}\mathinner{\Biggl{[}\ \mathinner{\!\biggl{\lvert}\sup_{H\in\mathcal{M}}\sum_{i=1}^{n}\sigma_{i}x_{i}^{\top}Hx_{i}\biggr{\rvert}}\Biggr{]}}$	$\displaystyle\leq\rho^{2}\sqrt{n\ln(2d^{2})}\cdot\operatorname{\mathbb{E}}_{S}\mathinner{\left[\max_{1\leq i\leq n}\\|x_{i}\\|_{\infty}^{2}\right]}$
		$\displaystyle\leq\rho^{2}\sqrt{2n\ln(2d)}\cdot 2\left\lVert\max_{1\leq i\leq n}\left\lVert x_{i}\right\rVert_{\infty}\right\rVert_{\psi_{1}}^{2}$
		$\displaystyle\leq\rho^{2}\sqrt{2n\ln(2d)}\cdot C_{3}^{2}\log^{2}\frac{nd}{b},$

where the second inequality follows from Part 2 of Lemma 20, and the last inequality follows from Lemma 21. In summary,

\operatorname{\mathbb{E}}_{S,\sigma}\mathinner{\Biggl{[}\sup_{f\in\mathcal{F}}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}x_{i}^{\top}Hx_{i}\biggr{\rvert}}\Biggr{]}}\leq K_{2}\sqrt{n\ln d}\cdot\rho^{2}\log^{2}\frac{nd}{b}

(C.10)

for some constant $K_{2}>0$ .

Combining (C.9) and (C.10), we have

\operatorname{\mathbb{E}}_{S}\mathinner{\biggl{[}\sup_{f\in\mathcal{F}}\left\lvert G(H)\right\rvert\biggr{]}}\leq\frac{K_{3}\sqrt{\log d}}{\sqrt{n}}\cdot{\rho^{2}}\log^{2}\frac{nd}{b}.

(C.11)

Putting together. Combining (C.6), (C.8), (C.11), and using Lemma 22 gives the first inequality of the proposition. Under our setting of $(b,r,\rho)$ , by some calculation we obtain the bound of $n$ . The proof is complete. ∎

Appendix D Performance Guarantee of Algorithm 1

In this section, we leverage all the tools from previous sections to establish the performance guarantee of Algorithm 1. Our main theorem, Theorem 4, follows from the analysis of each step of the algorithm, as we describe below.

D.1 Analysis of sample complexity

Recall that we refer to the number of calls to $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ as the sample complexity of Algorithm 1. In order to obtain $n_{k}$ instances residing the band $X_{k}\mathrel{\mathop{\mathchar 58\relax}}=\{x\mathrel{\mathop{\mathchar 58\relax}}\left\lvert w_{k-1}\cdot x\right\rvert\leq b_{k}\}$ , we have to call $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ sufficient times.

Lemma 26 (Restatement of Lemma 6).

Consider phase $k$ of Algorithm 1 for any $k\geq 1$ . Suppose that Assumption 1 and 2 are satisfied. Further assume $\eta<\frac{1}{2}$ . By making a number of $N_{k}=O\mathinner{\Bigl{(}\frac{1}{b_{k}}\mathinner{\bigl{(}n_{k}+\log\frac{1}{\delta_{k}}\bigr{)}}\Bigr{)}}$ calls to the instance generation oracle $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ , we will obtain $n_{k}$ instances that fall into $X_{k}$ with probability $1-\frac{\delta_{k}}{4}$ .

Proof.

By Lemma 17

\operatorname{Pr}_{x\sim D}(x\in X_{k})\geq c_{8}b_{k}.

This implies that

		$\displaystyle\ \operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}(x\in X_{k}\ \text{and}\ x\text{ is clean})$
	$\displaystyle=$	$\displaystyle\ \operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{})}(x\in X_{k}\mid x\text{ is clean})\cdot\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{})}(x\text{ is clean})$
	$\displaystyle\geq$	$\displaystyle\ c_{8}b_{k}(1-\eta).$

We want to ensure that by drawing $N_{k}$ instances from $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ , with probability at least $1-\frac{\delta_{k}}{4}$ , $n_{k}$ out of them fall into the band $X_{k}$ . We apply the second inequality of Lemma 38 by letting $Z_{i}=\boldsymbol{1}_{\mathinner{\left\{x_{i}\in X_{k}\ \text{and}\ x_{i}\text{ is clean}\right\}}}$ and $\alpha=1/2$ , and obtain

\operatorname{Pr}\mathinner{\left(\left\lvert\bar{T_{\mathrm{C}}}\right\rvert\leq\frac{c_{8}b_{k}(1-\eta)}{2}N_{k}\right)}\leq\exp\mathinner{\left(-\frac{c_{8}b_{k}(1-\eta)N_{k}}{8}\right)},

where the probability is taken over the event that we make a number of $N_{k}$ calls to $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ . Thus, when $N_{k}\geq\frac{8}{c_{8}b_{k}(1-\eta)}\mathinner{\left(n_{k}+\ln\frac{4}{\delta_{k}}\right)}$ , we are guaranteed that at least $n_{k}$ samples from $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ fall into the band $X_{k}$ with probability $1-\frac{\delta_{k}}{4}$ . The lemma follows by observing $\eta<\frac{1}{2}$ . ∎

D.2 Analysis of pruning and the structure of $\bar{T}$

With the instance set $\bar{T}$ on hand, we estimate the empirical noise rate after applying pruning (Step 6) in Algorithm 1. Recall that $n_{k}=\left\lvert\bar{T}\right\rvert$ , i.e. the number of unlabeled instances before pruning.

Lemma 27.

Suppose that Assumption 1 and Assumption 2 are satisfied. Further assume $\eta<\frac{1}{2}$ . If $D_{u,b}$ satisfies the regularity condition, we have

\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(x\ \text{is\ dirty}\mid x\in X_{u,b}\right)}\leq\frac{2\eta}{c_{8}b}

where $c_{8}$ was defined in Lemma 17 and $X_{u,b}\mathrel{\mathop{\mathchar 58\relax}}=\mathinner{\left\{x\in\mathbb{R}^{d}\mathrel{\mathop{\mathchar 58\relax}}\left\lvert u\cdot x\right\rvert\leq b\right\}}$ .

Proof.

For an instance $x$ , we use $\mathrm{tag}_{x}=1$ to denote that $x$ is drawn from $D$ , and use $\mathrm{tag}_{x}=-1$ to denote that $x$ is adversarially generated.

We first calculate the probability that an instance returned by $\mathrm{EX}_{\eta}^{x}(D,w^{*})$ falls into the band $X_{u,b}$ as follows:

		$\displaystyle\ \operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(x\in X_{u,b}\right)}$
	$\displaystyle=$	$\displaystyle\ \operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{})}\mathinner{\left(x\in X_{u,b}\ \text{and}\ \mathrm{tag}_{x}=1\right)}+\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{})}\mathinner{\left(x\in X_{u,b}\ \text{and}\ \mathrm{tag}_{x}=-1\right)}$
	$\displaystyle\geq$	$\displaystyle\ \operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(x\in X_{u,b}\ \text{and}\ \mathrm{tag}_{x}=1\right)}$
	$\displaystyle=$	$\displaystyle\ \operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{})}\mathinner{\left(x\in X_{u,b}\mid\mathrm{tag}_{x}=1\right)}\cdot\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{})}\mathinner{\left(\mathrm{tag}_{x}=1\right)}$
	$\displaystyle=$	$\displaystyle\ \operatorname{Pr}_{x\sim D}\mathinner{\left(x\in X_{u,b}\right)}\cdot\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(\mathrm{tag}_{x}=1\right)}$
	$\displaystyle\stackrel{{\scriptstyle\zeta}}{{\geq}}$	$\displaystyle\ c_{8}b\cdot(1-\eta)$
	$\displaystyle\geq$	$\displaystyle\ \frac{1}{2}c_{8}b,$

where in the inequality $\zeta$ we applied Part 1 of Lemma 17. It is thus easy to see that

\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(\textrm{tag}_{x}=-1\mid x\in X_{u,b}\right)}\leq\frac{\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(\textrm{tag}_{x}=-1\right)}}{\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(x\in X_{u,b}\right)}}\leq\frac{2\eta}{c_{8}b},

which is the desired result. ∎

Lemma 28.

Suppose that Assumptions 1 and 2 are satisfied. Further assume $\eta\leq c_{5}\epsilon$ . For any $1\leq k\leq k_{0}$ , if $n_{k}\geq\frac{6}{\xi_{k}}\ln\frac{48}{\delta_{k}}$ , then with probability $1-\frac{\delta_{k}}{24}$ over the draw of $\bar{T}$ , the following results hold simultaneously:

1.

$T_{\mathrm{C}}=\bar{T}_{\mathrm{C}}$ and hence $\tilde{T}_{\mathrm{C}}=\hat{T}_{\mathrm{C}}$ , i.e. all clean instances in $\bar{T}$ are intact after pruning;
2.

$\frac{\left\lvert T_{\mathrm{D}}\right\rvert}{\left\lvert T\right\rvert}\leq\xi_{k}$ , i.e. the empirical noise rate after pruning is upper bounded by $\xi_{k}$ ;
3.

$\left\lvert T_{\mathrm{C}}\right\rvert\geq(1-\xi_{k})n_{k}$ .

In particular, with the hyper-parameter setting in Section 3.2, $\left\lvert T_{\mathrm{C}}\right\rvert\geq\frac{1}{2}n_{k}$ .

Proof.

Let us write events $E_{1}\mathrel{\mathop{\mathchar 58\relax}}=\mathinner{\left\{T_{\mathrm{C}}=\bar{T}_{\mathrm{C}}\right\}}$ , $E_{2}\mathrel{\mathop{\mathchar 58\relax}}=\mathinner{\left\{\left\lvert\bar{T}_{\mathrm{D}}\right\rvert\leq\xi_{k}n_{k}\right\}}$ . We bound the probability of the two events over the draw of $\bar{T}$ .

Recall that Lemma 18 implies that with probability $1-\frac{\delta_{k}}{48}$ , all instances in $\bar{T}_{\mathrm{C}}$ are in the $\ell_{\infty}$ -ball $B_{\infty}(0,\nu_{k})$ for $\nu_{k}=c_{9}\log\frac{48\left\lvert\bar{T}\right\rvert d}{b_{k}\delta_{k}}$ , which implies $\operatorname{Pr}(E_{1})\geq 1-\frac{\delta_{k}}{48}$ .

We next calculate the noise rate within the band $X_{k}\mathrel{\mathop{\mathchar 58\relax}}=\{x\mathrel{\mathop{\mathchar 58\relax}}\left\lvert w_{k-1}\cdot x\right\rvert\leq b_{k}\}$ by Lemma 27:

\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}(x\ \text{is\ dirty}\mid x\in X_{k})\leq\frac{2\eta}{c_{8}b_{k}}=\frac{2\eta}{c_{8}\bar{c}\cdot 2^{-k-3}}\leq\frac{\pi}{c_{8}\bar{c}c_{1}}\cdot\frac{\eta}{\epsilon}\leq\frac{\pi c_{5}}{c_{8}\bar{c}c_{1}}\leq\frac{\xi_{k}}{2},

where the equality applies our setting on $b_{k}$ , the second inequality uses the condition $k\leq k_{0}$ and the setting $k_{0}=\log\big{(}\frac{\pi}{16c_{1}\epsilon}\big{)}$ , and the last inequality is guaranteed by our choice of $c_{5}$ . Now we apply the first inequality of Lemma 38 by specifying $Z_{i}=\boldsymbol{1}_{\mathinner{\left\{x_{i}\ \text{is\ dirty}\right\}}}$ , $\alpha=1$ therein, which gives

\operatorname{Pr}\mathinner{\left(\left\lvert\bar{T}_{\mathrm{D}}\right\rvert\geq\xi_{k}n_{k}\right)}\leq\exp\mathinner{\left(-\frac{\xi_{k}n_{k}}{6}\right)},

where the probability is taken over the draw of $\bar{T}$ . This implies $\operatorname{Pr}(E_{2})\geq 1-\frac{\delta_{k}}{48}$ provided that $n_{k}\geq\frac{6}{\xi_{k}}\ln\frac{48}{\delta_{k}}$ .

By union bound, we have $\operatorname{Pr}(E_{1}\cap E_{2})\geq 1-\frac{\delta_{k}}{24}$ . We show that on the event $E_{1}\cap E_{2}$ , the second and third parts of the lemma follow. To see this, we note that it trivially holds that $\frac{\left\lvert T_{\mathrm{D}}\right\rvert}{\left\lvert T\right\rvert}\leq\frac{\left\lvert\bar{T}_{\mathrm{D}}\right\rvert}{n_{k}}$ since only dirty instances have chance to be removed. This proves the second part. Also, it is easy to see that $\left\lvert T_{\mathrm{C}}\right\rvert=\left\lvert\bar{T}_{\mathrm{C}}\right\rvert=\left\lvert\bar{T}\right\rvert-\left\lvert\bar{T}_{\mathrm{D}}\right\rvert\geq(1-\xi_{k})\left\lvert\bar{T}\right\rvert$ , which is exactly the third part. ∎

D.3 Analysis of Algorithm 2

Lemma 29 (Restatement of Lemma 3).

Suppose that Assumption 1 and 2 are satisfied, and that $\eta\leq c_{5}\epsilon$ . There exists a constant $C_{2}>2$ such that the following holds. Consider phase $k$ of Algorithm 1 for any $1\leq k\leq k_{0}$ . Denote by $\mathcal{M}_{k}$ the constraint set of (3.2). If $\left\lvert T_{\mathrm{C}}\right\rvert=\tilde{O}\mathinner{\Bigl{(}s^{2}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\bigl{(}\log d+\log^{2}\frac{1}{\delta_{k}}\bigr{)}}\Bigr{)}}$ , then with probability $1-\frac{\delta_{k}}{24}$ over the draw of $T_{\mathrm{C}}$ , we have

1.

$\sup_{H\in\mathcal{M}_{k}}\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T_{\mathrm{C}}}x^{\top}Hx\leq 2C_{2}(b_{k}^{2}+r_{k}^{2})$ ;
2.

$\sup_{w\in W_{k}}\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T_{\mathrm{C}}}(w\cdot x)^{2}\leq 5C_{2}\mathinner{\left(b_{k}^{2}+r_{k}^{2}\right)}$ .

Proof.

The first part is an immediate result by combining Proposition 25 and Lemma 16, and recognizing our setting of $b_{k}$ and $r_{k}$ .

To see the second part, for any $w\in W_{k}$ , we can upper bound $(w\cdot x)^{2}$ as follows:

(w\cdot x)^{2}\leq 2(w_{k-1}\cdot x)^{2}+2(v\cdot x)^{2}\leq 2b_{k}^{2}+2x^{\top}(vv^{\top})x,

where $v=w-w_{k-1}\in B_{2}(0,r_{k})\cap B_{1}(0,\rho_{k})$ . Hence it is easy to see that $vv^{\top}$ lies in $\mathcal{M}_{k}$ . This indicates that for any $w\in W_{k}$ , there exists an $H\in\mathcal{M}_{k}$ such that

(w\cdot x)^{2}\leq 2\big{[}b_{k}^{2}+x^{\top}Hx\big{]}.

(D.1)

Thus,

\sup_{w\in W_{k}}\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T_{\mathrm{C}}}(w\cdot x)^{2}\leq 2b_{k}^{2}+2\sup_{H\in\mathcal{M}_{k}}\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T_{\mathrm{C}}}x^{\top}Hx\leq 5C_{2}(b_{k}^{2}+r_{k}^{2}),

where the last inequality follows from the fact $C_{2}\geq 2$ . ∎

Proposition 30 (Formal statement of Proposition 7).

Consider phase $k$ of Algorithm 1 for any $1\leq k\leq k_{0}$ . Suppose that Assumption 1 and 2 are satisfied, and that $\eta\leq c_{5}\epsilon$ . With probability $1-\frac{\delta_{k}}{8}$ (over the draw of $\bar{T}$ ), Algorithm 2 will output a function $q\mathrel{\mathop{\mathchar 58\relax}}T\rightarrow[0,1]$ with the following properties:

1.

for all $x\in T,\ q(x)\in[0,1]$ ;
2.

$\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)\geq 1-\xi_{k}$ ;
3.

for all $w\in W_{k}$ , $\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)(w\cdot x)^{2}\leq 5C_{2}\mathinner{\left(b_{k}^{2}+r_{k}^{2}\right)}$ .

Furthermore, such function $q$ can be found in polynomial time.

Proof.

Our choice on $n_{k}$ satisfies the condition $n_{k}\geq\frac{6}{\xi_{k}}\ln\frac{48}{\delta_{k}}$ since $\xi_{k}$ is lower bounded by a constant (see Section 3.2 for our parameter setting). Thus by Lemma 28, with probability $1-\frac{\delta_{k}}{24}$ , $\left\lvert T_{\mathrm{C}}\right\rvert\geq(1-\xi_{k})n_{k}$ . We henceforth condition on this happening.

On the other side, Lemma 3 and Proposition 25 together implies that with probability $1-\frac{\delta_{k}}{24}$ , for all $H\in\mathcal{M}_{k}$ , we have

\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T_{\mathrm{C}}}x^{\top}Hx\leq 2C_{2}(b_{k}^{2}+r_{k}^{2})

(D.2)

provided that

\left\lvert T_{\mathrm{C}}\right\rvert=\tilde{O}\mathinner{\Bigl{(}s^{2}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\Bigl{(}\log d+\log^{2}\frac{1}{\delta_{k}}\Bigr{)}}\Bigr{)}}.

(D.3)

Note that (D.3) is satisfied in view of the aforementioned event $\left\lvert T_{\mathrm{C}}\right\rvert\geq(1-\xi_{k})n_{k}$ along with the setting of $n_{k}$ and $\xi_{k}$ . By union bound, the events (D.2) and $\left\lvert T_{\mathrm{C}}\right\rvert\geq(1-\xi_{k})\left\lvert T\right\rvert$ hold simultaneously with probability at least $1-\frac{\delta_{k}}{8}$ .

Now we show that these two events together implies the existence of a feasible function $q(x)$ to Algorithm 2. Consider a particular function $q(x)$ with $q(x)=0$ for all $x\in T_{\mathrm{D}}$ and $q(x)=1$ for all $x\in T_{\mathrm{C}}$ . We immediately have

\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)=\frac{\left\lvert T_{\mathrm{C}}\right\rvert}{\left\lvert T\right\rvert}\geq 1-\xi_{k}.

In addition, for all $H\in\mathcal{M}_{k}$ ,

\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)x^{\top}Hx=\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T_{\mathrm{C}}}x^{\top}Hx\leq\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T_{\mathrm{C}}}x^{\top}Hx\leq 2C_{2}(b_{k}^{2}+r_{k}^{2}),

(D.4)

where the first inequality follows from the fact $\left\lvert T\right\rvert\geq\left\lvert T_{\mathrm{C}}\right\rvert$ and the second inequality follows from (D.2). Namely, such function $q(x)$ satisfies all the constraints in Algorithm 2. Finally, combining (D.1) and (D.4) gives Part 3.

It remains to show that for a given candidate function $q$ , a separation oracle for Algorithm 2 can be constructed in polynomial time. First, it is straightforward to check whether the first two constraints $q(x)\in[0,1]$ and $\sum_{x\in T}q(x)\geq(1-\xi)\left\lvert T\right\rvert$ are violated. If not, we just need to further check if there exists an $H\in\mathcal{M}_{k}$ such that $\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)x^{\top}Hx>2C_{2}(b_{k}^{2}+r_{k}^{2})$ . To this end, we appeal to solving the following program:

\max_{H\in\mathcal{M}_{k}}\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)x^{\top}Hx.

This is a semidefinite program that can be solved in polynomial time [BV04]. If the maximum objective value is greater than $2C_{2}(b_{k}^{2}+r_{k}^{2})$ , then we conclude that $q$ is not feasible; otherwise we would have found a desired function. ∎

The analysis of the following proposition closely follows [ABL17] with a refined treatment. Let $\ell_{\tau_{k}}(w;p)\mathrel{\mathop{\mathchar 58\relax}}=\sum_{x\in T}p(x)\ell_{\tau_{k}}(w;x,y_{x})$ where $y_{x}$ is the unrevealed label of $x$ that the adversary has committed to.

Proposition 31 (Formal statement of Proposition 8).

Consider phase $k$ of Algorithm 1. Suppose that Assumption 1 and 2 are satisfied. Assume that $\eta\leq c_{5}\epsilon$ . Set $N_{k}$ and $\xi_{k}$ as in Section 3.2. Denote $z_{k}\mathrel{\mathop{\mathchar 58\relax}}=\sqrt{b_{k}^{2}+r_{k}^{2}}=\sqrt{\bar{c}^{2}+1}\cdot 2^{-k-3}$ . With probability $1-\frac{\delta_{k}}{4}$ over the draw of $\bar{T}$ , for all $w\in W_{k}$

	$\displaystyle\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})$	$\displaystyle\leq\ell_{\tau_{k}}(w;p)+2\xi_{k}\left(1+\sqrt{10C_{2}}\cdot\frac{z_{k}}{\tau_{k}}\right)+\sqrt{10C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}},$
	$\displaystyle\ell_{\tau_{k}}(w;p)$	$\displaystyle\leq\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})+2\xi_{k}+\sqrt{20C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}}.$

In particular, with our hyper-parameter setting,

\left\lvert\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})-\ell_{\tau_{k}}(w;p)\right\rvert\leq\kappa.

Proof.

The choice of $n_{k}$ guarantees that Lemma 28 and Proposition 30 hold simultaneously with probability $1-\frac{\delta_{k}}{4}$ . We thus have for all $w\in W_{k}$

$\displaystyle\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)(w\cdot x)^{2}$	$\displaystyle\leq 5C_{2}z_{k}^{2},$	(D.5)
$\displaystyle\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T_{\mathrm{C}}}(w\cdot x)^{2}$	$\displaystyle\leq 5C_{2}z_{k}^{2},$	(D.6)
$\displaystyle\frac{\left\lvert T_{\mathrm{D}}\right\rvert}{\left\lvert T\right\rvert}$	$\displaystyle\leq\xi_{k}.$	(D.7)

In the above expression, (D.5) and (D.6) follow from Part 3 and Part 2 of Lemma 29 respectively, (D.7) follows from Lemma 28. It follows from Eq. (D.7) and $\xi_{k}\leq 1/2$ that

\frac{\left\lvert T\right\rvert}{\left\lvert T_{\mathrm{C}}\right\rvert}=\frac{\left\lvert T\right\rvert}{\left\lvert T\right\rvert-\left\lvert T_{\mathrm{D}}\right\rvert}=\frac{1}{1-\left\lvert T_{\mathrm{D}}\right\rvert/\left\lvert T\right\rvert}\leq\frac{1}{1-\xi_{k}}\leq 2.

(D.8)

In the following, we condition on the event that all these inequalities are satisfied.

Step 1. First we upper bound $\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})$ by $\ell_{\tau_{k}}(w;p)$ .

$\displaystyle\left\lvert T_{\mathrm{C}}\right\rvert\cdot\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})$	$\displaystyle=\sum_{x\in T_{\mathrm{C}}}\ell(w;x,y_{x})$
	$\displaystyle=\sum_{x\in T}\mathinner{\left[q(x)\ell(w;x,y_{x})+\big{(}\boldsymbol{1}_{\mathinner{\left\{x\in T_{\mathrm{C}}\right\}}}-q(x)\big{)}\ell(w;x,y_{x})\right]}$
	$\displaystyle\stackrel{{\scriptstyle\zeta_{1}}}{{\leq}}\sum_{x\in T}q(x)\ell(w;x,y_{x})+\sum_{x\in T_{\mathrm{C}}}(1-q(x))\ell(w;x,y_{x})$
	$\displaystyle\stackrel{{\scriptstyle\zeta_{2}}}{{\leq}}\sum_{x\in T}q(x)\ell(w;x,y_{x})+\sum_{x\in T_{\mathrm{C}}}(1-q(x))\mathinner{\left(1+\frac{\left\lvert w\cdot x\right\rvert}{\tau_{k}}\right)}$
	$\displaystyle\stackrel{{\scriptstyle\zeta_{3}}}{{\leq}}\sum_{x\in T}q(x)\ell(w;x,y_{x})+\xi_{k}\left\lvert T\right\rvert+\frac{1}{\tau_{k}}\sum_{x\in T_{\mathrm{C}}}(1-q(x))\left\lvert w\cdot x\right\rvert$
	$\displaystyle\stackrel{{\scriptstyle\zeta_{4}}}{{\leq}}\sum_{x\in T}q(x)\ell(w;x,y_{x})+\xi_{k}\left\lvert T\right\rvert+\frac{1}{\tau_{k}}\sqrt{\sum_{x\in T_{\mathrm{C}}}(1-q(x))^{2}}\cdot\sqrt{\sum_{x\in T_{\mathrm{C}}}(w\cdot x)^{2}}$
	$\displaystyle\stackrel{{\scriptstyle\zeta_{5}}}{{\leq}}\sum_{x\in T}q(x)\ell(w;x,y_{x})+\xi_{k}\left\lvert T\right\rvert+\frac{1}{\tau_{k}}\sqrt{\xi_{k}\left\lvert T\right\rvert}\cdot\sqrt{5C_{2}\left\lvert T_{\mathrm{C}}\right\rvert}\cdot{z_{k}},$	(D.9)

where $\zeta_{1}$ follows from the simple fact that

	$\displaystyle\sum_{x\in T}\mathinner{\bigl{(}\boldsymbol{1}_{\mathinner{\left\{x\in T_{\mathrm{C}}\right\}}}-q(x)\bigr{)}}\ell(w;x,y_{x})$	$\displaystyle=\sum_{x\in T_{\mathrm{C}}}(1-q(x))\ell(w;x,y_{x})+\sum_{x\in T_{\mathrm{D}}}(-q(x))\ell(w;x,y_{x})$
		$\displaystyle\leq\sum_{x\in T_{\mathrm{C}}}(1-q(x))\ell(w;x,y_{x}),$

$\zeta_{2}$ explores the fact that the hinge loss is always upper bounded by $1+\frac{\left\lvert w\cdot x\right\rvert}{\tau_{k}}$ and that $1-q(x)\geq 0$ , $\zeta_{3}$ follows from Part 2 of Proposition 30, $\zeta_{4}$ applies Cauchy-Schwarz inequality, and $\zeta_{5}$ uses Eq. (D.6).

In view of Eq. (D.8), we have $\frac{\left\lvert T\right\rvert}{\left\lvert T_{\mathrm{C}}\right\rvert}\leq 2$ . Continuing Eq. (D.9), we obtain

$\displaystyle\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})$	$\displaystyle\leq\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T}q(x)\ell(w;x,y_{x})+2\xi_{k}+\sqrt{10C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}}$
	$\displaystyle=\frac{\sum_{x\in T}q(x)}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T}p(x)\ell(w;x,y_{x})+2\xi_{k}+\sqrt{10C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}}$
	$\displaystyle=\ell_{\tau_{k}}(w;p)+\mathinner{\left(\frac{\sum_{x\in T}q(x)}{\left\lvert T_{\mathrm{C}}\right\rvert}-1\right)}\sum_{x\in T}p(x)\ell(w;x,y_{x})+2\xi_{k}+\sqrt{10C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}}$
	$\displaystyle\leq\ell_{\tau_{k}}(w;p)+\mathinner{\left(\frac{\left\lvert T\right\rvert}{\left\lvert T_{\mathrm{C}}\right\rvert}-1\right)}\sum_{x\in T}p(x)\ell(w;x,y_{x})+2\xi_{k}+\sqrt{10C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}}$
	$\displaystyle\leq\ell_{\tau_{k}}(w;p)+2\xi_{k}\sum_{x\in T}p(x)\ell(w;x,y_{x})+2\xi_{k}+\sqrt{10C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}},$	(D.10)

where in the last inequality we use $\left\lvert T\right\rvert/\left\lvert T_{\mathrm{C}}\right\rvert-1=\frac{\left\lvert T_{\mathrm{D}}\right\rvert/\left\lvert T\right\rvert}{1-\left\lvert T_{\mathrm{D}}\right\rvert/\left\lvert T\right\rvert}\leq 2\left\lvert T_{\mathrm{D}}\right\rvert/\left\lvert T\right\rvert$ . On the other hand, we have the following result which will be proved later on.

Claim D.1.

$\sum_{x\in T}p(x)\ell(w;x,y_{x})\leq 1+\sqrt{10C_{2}}\cdot\frac{z_{k}}{\tau_{k}}.$

Therefore, continuing Eq. (D.10) we have

\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})\leq\ell_{\tau_{k}}(w;p)+2\xi_{k}\left(1+\sqrt{10C_{2}}\cdot\frac{z_{k}}{\tau_{k}}\right)+\sqrt{10C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}}.

which proves the first inequality of the proposition.

Step 2. We move on to prove the second inequality of the theorem, i.e. using $\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})$ to upper bound $\ell_{\tau_{k}}(w;p)$ . Let us denote by $p_{\mathrm{D}}=\sum_{x\in T_{\mathrm{D}}}p(x)$ the probability mass on dirty instances. Then

p_{\mathrm{D}}=\frac{\sum_{x\in T_{\mathrm{D}}}q(x)}{\sum_{x\in T}q(x)}\leq\frac{\left\lvert T_{\mathrm{D}}\right\rvert}{(1-\xi_{k})\left\lvert T\right\rvert}\leq\frac{\xi_{k}}{1-\xi_{k}}\leq 2\xi_{k},

(D.11)

where the first inequality follows from $q(x)\leq 1$ and Part 2 of Proposition 30, the second inequality follows from (D.7), and the last inequality is by our choice $\xi_{k}\leq 1/2$ .

Note that by Part 2 of Proposition 30 and the choice $\xi_{k}\leq 1/2$ , we have $\sum_{x\in T}q(x)\geq(1-\xi_{k})\left\lvert T\right\rvert\geq\left\lvert T\right\rvert/2$ . Hence

\sum_{x\in T}p(x)(w\cdot x)^{2}=\frac{1}{\sum_{x\in T}q(x)}\sum_{x\in T}q(x)(w\cdot x)^{2}\leq\frac{2}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)(w\cdot x)^{2}\leq 10C_{2}z_{k}^{2}

(D.12)

where the last inequality holds because of (D.5). Thus,

	$\displaystyle\sum_{x\in T_{\mathrm{D}}}p(x)\ell(w;x,y_{x})$	$\displaystyle\leq\sum_{x\in T_{\mathrm{D}}}p(x)\mathinner{\left(1+\frac{\left\lvert w\cdot x\right\rvert}{\tau_{k}}\right)}$
		$\displaystyle=p_{\mathrm{D}}+\frac{1}{\tau_{k}}\sum_{x\in T_{\mathrm{D}}}p(x)\left\lvert w\cdot x\right\rvert$
		$\displaystyle=p_{\mathrm{D}}+\frac{1}{\tau_{k}}\sum_{x\in T}\mathinner{\left(\boldsymbol{1}_{\mathinner{\left\{x\in T_{\mathrm{D}}\right\}}}\sqrt{p(x)}\right)}\cdot\mathinner{\left(\sqrt{p(x)}\left\lvert w\cdot x\right\rvert\right)}$
		$\displaystyle\leq p_{\mathrm{D}}+\frac{1}{\tau_{k}}\sqrt{\sum_{x\in T}\boldsymbol{1}_{\mathinner{\left\{x\in T_{\mathrm{D}}\right\}}}{p(x)}}\cdot\sqrt{\sum_{x\in T}p(x)(w\cdot x)^{2}}$
		$\displaystyle\stackrel{{\scriptstyle\eqref{eq:tmp:sum p(x)(wx)^2}}}{{\leq}}p_{\mathrm{D}}+\sqrt{p_{\mathrm{D}}}\cdot\sqrt{10C_{2}}\cdot\frac{z_{k}}{\tau_{k}}.$

With the result on hand, we bound $\ell_{\tau_{k}}(w;p)$ as follows:

	$\displaystyle\ell_{\tau_{k}}(w;p)$	$\displaystyle=\sum_{x\in T_{\mathrm{C}}}p(x)\ell(w;x,y_{x})+\sum_{x\in T_{\mathrm{D}}}p(x)\ell(w;x,y_{x})$
		$\displaystyle\leq\sum_{x\in T_{\mathrm{C}}}\ell(w;x,y_{x})+\sum_{x\in T_{\mathrm{D}}}p(x)\ell(w;x,y_{x})$
		$\displaystyle=\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})+\sum_{x\in T_{\mathrm{D}}}p(x)\ell(w;x,y_{x})$
		$\displaystyle\leq\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})+p_{\mathrm{D}}+\sqrt{p_{\mathrm{D}}}\cdot\sqrt{10C_{2}}\cdot\frac{z_{k}}{\tau_{k}}$
		$\displaystyle\stackrel{{\scriptstyle\eqref{eq:tmp:p_D}}}{{\leq}}\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})+2\xi_{k}+\sqrt{20C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}},$

which proves the second inequality of the proposition.

Putting together. We would like to show $\left\lvert\ell_{\tau_{k}}(w;p)-\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})\right\rvert\leq\kappa$ . Indeed, this is guaranteed by our setting of $\xi_{k}$ in Section 3.2 which ensures that $\xi_{k}$ simultaneously fulfills the following three constraints:

	$\displaystyle 2\xi_{k}\left(1+\sqrt{10C_{2}}\cdot\frac{z_{k}}{\tau_{k}}\right)+\sqrt{10C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}}\leq\kappa,$
	$\displaystyle 2\xi_{k}+\sqrt{20C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}}\leq\kappa,\quad\text{and}\quad\xi_{k}\leq\frac{1}{2}.$

This completes the proof. ∎

Proof of Claim D.1.

Since $\ell(w;x,y_{x})\leq 1+\frac{\left\lvert w\cdot x\right\rvert}{\tau_{k}}$ , it follows that

	$\displaystyle\sum_{x\in T}p(x)\ell(w;x,y_{x})$	$\displaystyle\leq\sum_{x\in T}p(x)\mathinner{\left(1+\frac{\left\lvert w\cdot x\right\rvert}{\tau_{k}}\right)}$
		$\displaystyle=1+\frac{1}{\tau_{k}}\sum_{x\in T}p(x)\left\lvert w\cdot x\right\rvert$
		$\displaystyle\leq 1+\frac{1}{\tau_{k}}\sqrt{\sum_{x\in T}p(x)(w\cdot x)^{2}}$
		$\displaystyle\stackrel{{\scriptstyle\eqref{eq:tmp:sum p(x)(wx)^2}}}{{\leq}}1+\sqrt{10C_{2}}\cdot\frac{z_{k}}{\tau_{k}},$

which completes the proof of Claim D.1. ∎

The following result is a simple application of Proposition 24. It shows that the loss evaluated on clean instances concentrates around the expected loss.

Proposition 32 (Restatement of Proposition 9).

Consider phase $k$ of Algorithm 1. Suppose that Assumption 1 and 2 are satisfied, and assume $\eta\leq c_{5}\epsilon$ . Then with probability $1-\frac{\delta_{k}}{4}$ over the draw of $\bar{T}$ , for all $w\in W_{k}$ we have

\left\lvert L_{\tau_{k}}(w)-\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})\right\rvert\leq\kappa.

where $L_{\tau_{k}}(w)\mathrel{\mathop{\mathchar 58\relax}}=\operatorname{\mathbb{E}}_{x\sim D_{w_{k-1},b_{k}}}\mathinner{\left[\ell_{\tau_{k}}(w;x,\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)})\right]}$ .

Proof.

The choice of $n_{k}$ , i.e. the size of $\left\lvert\bar{T}\right\rvert$ , ensures that with probability $1-\frac{\delta_{k}}{8}$ , $\left\lvert T_{\mathrm{C}}\right\rvert$ is at least $\zeta\log\zeta$ where $\zeta=K\cdot s\log^{2}\frac{d}{b_{k}}\cdot\log\frac{d}{\delta_{k}}$ for some constant $K>0$ in view of Lemma 28. This observation in allusion to Proposition 24 and union bound, immediately gives the desired result. ∎

D.4 Analysis of random sampling

Proposition 33 (Restatement of Proposition 10).

Consider phase $k$ Algorithm 1. Suppose that Assumption 1 and 2 are satisfied, and assume $\eta\leq c_{5}\epsilon$ . Set $n_{k}$ and $m_{k}$ as in Section 3.2. Then with probability $1-\frac{\delta_{k}}{4}$ over the draw of $S_{k}$ , for all $w\in W_{k}$ we have

\left\lvert\ell_{\tau_{k}}(w;p)-\ell_{\tau_{k}}(w;S_{k})\right\rvert\leq\kappa.

Proof.

Since we applied pruning to remove all instances with large $\ell_{\infty}$ -norm, this proposition can be proved by a standard concentration argument for uniform convergence of linear classes under distributions with $\ell_{\infty}$ bounded support. We include the proof for completeness.

Note that the randomness is taken over the i.i.d. draw of $m_{k}$ samples from $T$ according to the distribution $p$ over $T$ . Thus, for any $(x,y)\in S_{k}$ , $\operatorname{\mathbb{E}}[\ell_{\tau_{k}}(w;x,y)]=\ell_{\tau_{k}}(w;p)$ . Moreover, let $R_{k}=\max_{x\in T}\left\lVert x\right\rVert_{\infty}$ . Any instance $x$ drawn from $T$ satisfies $\left\lVert x\right\rVert_{\infty}\leq R_{k}$ with probability $1$ . It is also easy to verify that

\ell_{\tau_{k}}(w;x,y)\leq 1+\frac{\left\lvert w\cdot x\right\rvert}{\tau_{k}}\leq 1+\frac{(w-w_{k-1})\cdot x}{\tau_{k}}+\frac{\left\lvert w_{k-1}\cdot x\right\rvert}{\tau_{k}}\leq 1+\frac{\rho_{k}R_{k}}{\tau_{k}}+\frac{b_{k}}{\tau_{k}}.

By Theorem 8 of [BM02] along with standard symmetrization arguments, we have that with probability at least $1-\frac{\delta_{k}}{4}$ ,

\left\lvert\ell_{\tau_{k}}(w;p)-\ell_{\tau_{k}}(w;S_{k})\right\rvert\leq\mathinner{\left(1+\frac{\rho_{k}R_{k}}{\tau_{k}}+\frac{b_{k}}{\tau_{k}}\right)}\sqrt{\frac{\ln(4/\delta_{k})}{2m_{k}}}+\mathcal{R}(\mathcal{F};S_{k})

(D.13)

where $\mathcal{R}(\mathcal{F};S_{k})$ denotes the Rademacher complexity of function class $\mathcal{F}$ on the labeled set $S_{k}$ , and $\mathcal{F}\mathrel{\mathop{\mathchar 58\relax}}=\mathinner{\left\{\ell_{\tau_{k}}(w;x,y)\mathrel{\mathop{\mathchar 58\relax}}w\in W_{k}\right\}}$ . In order to calculate $\mathcal{R}(\mathcal{F};S_{k})$ , we observe that each function $\ell_{\tau_{k}}(w;x,y)$ is a composition of $\phi(a)=\max\mathinner{\left\{0,1-\frac{1}{\tau_{k}}ya\right\}}$ and function class $\mathcal{G}\mathrel{\mathop{\mathchar 58\relax}}=\{x\mapsto w\cdot x\mathrel{\mathop{\mathchar 58\relax}}w\in W_{k}\}$ . Since $\phi(a)$ is $\frac{1}{\tau_{k}}$ -Lipschitz, by contraction property of Rademacher complexity, we have

\mathcal{R}(\mathcal{F};S_{k})\leq\frac{1}{\tau_{k}}\mathcal{R}(\mathcal{G};S_{k}).

(D.14)

Let $\sigma=\{\sigma_{1},\dots,\sigma_{m_{k}}\}$ where the $\sigma_{i}$ ’s are i.i.d. draw from the Rademacher distribution, and let $V_{k}=B_{2}(0,r_{k})\cap B_{1}(0,\rho_{k})$ . We compute $\mathcal{R}(\mathcal{G};S_{k})$ as follows:

	$\displaystyle\mathcal{R}(\mathcal{G};S_{k})$	$\displaystyle=\frac{1}{m_{k}}\operatorname{\mathbb{E}}_{\sigma}\mathinner{\biggl{[}\sup_{w\in W_{k}}w\cdot\mathinner{\biggl{(}\sum_{i=1}^{m_{k}}\sigma_{i}x_{i}\biggr{)}}\biggr{]}}$
		$\displaystyle=\frac{1}{m_{k}}\operatorname{\mathbb{E}}_{\sigma}\mathinner{\biggl{[}w_{k-1}\cdot\mathinner{\biggl{(}\sum_{i=1}^{m_{k}}\sigma_{i}x_{i}\biggr{)}}\biggr{]}}+\frac{1}{m_{k}}\operatorname{\mathbb{E}}_{\sigma}\mathinner{\biggl{[}\sup_{w\in W_{k}}(w-w_{k-1})\cdot\mathinner{\biggl{(}\sum_{i=1}^{m_{k}}\sigma_{i}x_{i}\biggr{)}}\biggr{]}}$
		$\displaystyle=\frac{1}{m_{k}}\operatorname{\mathbb{E}}_{\sigma}\mathinner{\biggl{[}\sup_{v\in V_{k}}v\cdot\mathinner{\biggl{(}\sum_{i=1}^{m_{k}}\sigma_{i}x_{i}\biggr{)}}\biggr{]}}$
		$\displaystyle\leq\rho_{k}R_{k}\sqrt{\frac{2\log(2d)}{m_{k}}},$

where the first equality is by the definition of Rademacher complexity, the second equality simply decompose $w$ as a sum of $w_{k-1}$ and $w-w_{k-1}$ , the third equality is by the fact that every $\sigma_{i}$ has zero mean, and the inequality applies Lemma 39. We combine the above result with (D.13) and (D.14), and obtain that with probability $1-\frac{\delta_{k}}{4}$ ,

\left\lvert\ell_{\tau_{k}}(w;p)-\ell_{\tau_{k}}(w;S_{k})\right\rvert\leq\mathinner{\left(1+\frac{\rho_{k}R_{k}}{\tau_{k}}+\frac{b_{k}}{\tau_{k}}\right)}\sqrt{\frac{\ln(4/\delta_{k})}{m_{k}}}+\frac{\rho_{k}R_{k}}{\tau_{k}}\sqrt{\frac{2\log(2d)}{m_{k}}}.

(D.15)

Recall that we remove all instances with large $\ell_{\infty}$ -norm in the pruning step of Algorithm 1. In particular, we have

R_{k}\leq c_{9}\log\frac{48n_{k}d}{b_{k}\delta_{k}}.

Plugging this upper bound into (D.15) and using our hyper-parameter setting gives

\left\lvert\ell_{\tau_{k}}(w;p)-\ell_{\tau_{k}}(w;S_{k})\right\rvert\leq K_{1}\cdot\sqrt{s}\log\frac{n_{k}d}{b_{k}\delta_{k}}\mathinner{\left(\sqrt{\frac{\log(1/\delta_{k})}{m_{k}}}+\sqrt{\frac{\log d}{m_{k}}}\right)}

for some constant $K_{1}>0$ . Hence,

m_{k}={O}\mathinner{\left(s\log^{2}\frac{n_{k}d}{b_{k}\delta_{k}}\cdot\log\frac{d}{\delta_{k}}\right)}=\tilde{O}\mathinner{\left(s\log^{2}\frac{d}{b_{k}\delta_{k}}\cdot\log\frac{d}{\delta_{k}}\right)}

suffices to ensure $\left\lvert\ell_{\tau_{k}}(w;p)-\ell_{\tau_{k}}(w;S_{k})\right\rvert\leq\kappa$ with probability $1-\frac{\delta_{k}}{4}$ . ∎

D.5 Analysis of Per-Phase Progress

Let $L_{\tau_{k}}(w)=\operatorname{\mathbb{E}}_{x\sim D_{w_{k-1},b_{k}}}\mathinner{\left[\ell_{\tau_{k}}(w;x,\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)})\right]}$ .

Lemma 34 (Lemma 3.7 of [ABL17]).

Suppose Assumption 1 is satisfied. Then

L_{\tau_{k}}(w^{*})\leq\frac{\tau_{k}}{c_{0}\min\{b_{k},1/9\}}.

In particular, by our choice of $\tau_{k}$

L_{\tau_{k}}(w^{*})\leq\kappa.

Lemma 35.

For any $1\leq k\leq k_{0}$ , if $w^{*}\in W_{k}$ , then with probability $1-\delta_{k}$ , $\operatorname{err}_{D_{w_{k-1},b_{k}}}(v_{k})\leq 8\kappa$ .

Proof.

Observe that with the setting of $N_{k}$ , we have with probability $1-\delta_{k}$ over all the randomness in phase $k$ , Lemma 26, Proposition 31, Proposition 32 and Proposition 33 hold simultaneously. Now we condition on the event that all of these properties are satisfied, which implies for all $w\in W_{k}$ ,

\left\lvert L_{\tau_{k}}(w)-\ell_{\tau_{k}}(w;S_{k})\right\rvert\leq 3\kappa.

(D.16)

We have

	$\displaystyle\operatorname{err}_{D_{w_{k-1},b_{k}}}(v_{k})\leq L_{\tau_{k}}(v_{k})\stackrel{{\scriptstyle\zeta_{1}}}{{\leq}}\ell_{\tau_{k}}(v_{k};S_{k})+3\kappa\stackrel{{\scriptstyle\zeta_{2}}}{{\leq}}\min_{w\in W_{k}}\ell_{\tau_{k}}(w;S_{k})+4\kappa$	$\displaystyle\stackrel{{\scriptstyle\zeta_{3}}}{{\leq}}\ell_{\tau_{k}}(w^{*};S_{k})+4\kappa$
		$\displaystyle\leq L_{\tau_{k}}(w^{*})+7\kappa.$

In the above, the first inequality follows from the fact that hinge loss upper bounds the 0/1 loss, $\zeta_{1}$ and the last inequality applies (C.1), $\zeta_{2}$ is by the definition of $v_{k}$ (see Algorithm 1), and $\zeta_{3}$ is by our assumption that $w^{*}$ is feasible. The proof is complete in view of Lemma 34. ∎

Lemma 36.

For any $1\leq k\leq k_{0}$ , if $w^{*}\in W_{k}$ , then with probability $1-\delta_{k}$ , $\theta(v_{k},w^{*})\leq 2^{-k-8}\pi$ .

Proof.

For $k=1$ , by Lemma 35 and that we actually sample from $D$ , we have

\operatorname{Pr}_{x\sim D}\left(\operatorname{sign}\mathinner{\left(v_{1}\cdot x\right)}\neq\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)}\right)\leq 8\kappa.

Hence Part 4 of Lemma 12 indicates that

\theta(v_{1},w^{*})\leq 8c_{2}\kappa=16c_{2}\kappa\cdot 2^{-1}.

(D.17)

Now we consider $2\leq k\leq k_{0}$ . Denote $X_{k}=\{x\mathrel{\mathop{\mathchar 58\relax}}\left\lvert w_{k-1}\cdot x\right\rvert\leq b_{k}\}$ , and $\bar{X}_{k}=\{x\mathrel{\mathop{\mathchar 58\relax}}\left\lvert w_{k-1}\cdot x\right\rvert>b_{k}\}$ . We will show that the error of $v_{k}$ on both $X_{k}$ and $\bar{X}_{k}$ is small, hence $v_{k}$ is a good approximation to $w^{*}$ .

First, we consider the error on $X_{k}$ , which is given by

	$\displaystyle\ \operatorname{Pr}_{x\sim D}\left(\operatorname{sign}\mathinner{\left(v_{k}\cdot x\right)}\neq\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)},x\in X_{k}\right)$
$\displaystyle=$	$\displaystyle\ \operatorname{Pr}_{x\sim D}\left(\operatorname{sign}\mathinner{\left(v_{k}\cdot x\right)}\neq\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)}\mid x\in X_{k}\right)\cdot\operatorname{Pr}_{x\sim D}(x\in X_{k})$
$\displaystyle=$	$\displaystyle\ \operatorname{err}_{D_{w_{k-1},b_{k}}}(v_{k})\cdot\operatorname{Pr}_{x\sim D}(x\in X_{k})$
$\displaystyle\leq$	$\displaystyle\ 8\kappa\cdot 2b_{k}$
$\displaystyle=$	$\displaystyle\ 16\kappa b_{k},$	(D.18)

where the inequality is due to Lemma 35 and Lemma 17. Note that the inequality holds with probability $1-\delta_{k}$ .

Next we derive the error on $\bar{X}_{k}$ . Note that Lemma 10 of [Zha18] states for any unit vector $u$ , and any general vector $v$ , $\theta(v,u)\leq\pi\left\lVert v-u\right\rVert_{2}$ . Hence,

\displaystyle\theta(v_{k},w^{*})\leq\pi\left\lVert v_{k}-w^{*}\right\rVert_{2}\leq\pi(\left\lVert v_{k}-w_{k-1}\right\rVert_{2}+\left\lVert w^{*}-w_{k-1}\right\rVert_{2})\leq 2\pi r_{k}.

Recall that we set $r_{k}=2^{-k-3}<1/4$ in our algorithm and choose $b_{k}=\bar{c}\cdot r_{k}$ where $\bar{c}\geq 8\pi/c_{4}$ , which allows us to apply Lemma 13 and obtain

	$\displaystyle\operatorname{Pr}_{x\sim D}\left(\operatorname{sign}\mathinner{\left(v_{k}\cdot x\right)}\neq\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)},x\notin X_{k}\right)$	$\displaystyle\leq c_{3}\cdot 2\pi r_{k}\cdot\exp\mathinner{\left(-\frac{c_{4}\bar{c}\cdot r_{k}}{2\cdot 2\pi r_{k}}\right)}$
		$\displaystyle=2^{-k}\cdot\frac{c_{3}\pi}{4}\exp\mathinner{\left(-\frac{c_{4}\bar{c}}{4\pi}\right)}.$

This in allusion to (D.18) gives

\operatorname{err}_{D}(v_{k})\leq 16\kappa\cdot\bar{c}\cdot r_{k}+2^{-k}\cdot\frac{c_{3}\pi}{4}\exp\mathinner{\left(-\frac{c_{4}\bar{c}}{4\pi}\right)}=\mathinner{\left(2\kappa\bar{c}+\frac{c_{3}\pi}{4}\exp\mathinner{\left(-\frac{c_{4}\bar{c}}{4\pi}\right)}\right)}\cdot 2^{-k}.

Recall that we set $\kappa=\exp(-\bar{c})$ and denote by $f(\bar{c})$ the coefficient of $2^{-k}$ in the above expression. By Part 4 of Lemma 12

\theta(v_{k},w^{*})\leq c_{2}\operatorname{err}_{D}(v_{k})\leq c_{2}f(\bar{c})\cdot 2^{-k}.

(D.19)

Now let $g(\bar{c})=c_{2}f(\bar{c})+16c_{2}\exp(-\bar{c})$ . By our choice of $\bar{c}$ , $g(\bar{c})\leq 2^{-8}\pi$ . This ensures that for both (D.17) and (D.19), $\theta(v_{k},w^{*})\leq 2^{-k-8}\pi$ for any $k\geq 1$ . ∎

Lemma 37.

For any $1\leq k\leq k_{0}$ , if $\theta(v_{k},w^{*})\leq 2^{-k-8}\pi$ , then $w^{*}\in W_{k+1}$ .

Proof.

We first show that $\left\lVert w_{k}-w^{*}\right\rVert_{2}\leq r_{k+1}$ . Let $\hat{v}_{k}=v_{k}/\left\lVert v_{k}\right\rVert_{2}$ . By algebra $\left\lVert\hat{v}_{k}-w^{*}\right\rVert_{2}=2\sin\frac{\theta(v_{k},w^{*})}{2}\leq\theta(v_{k},w^{*})\leq 2^{-k-8}\pi\leq 2^{-k-6}$ . Now we have

	$\displaystyle\left\lVert w_{k}-w^{*}\right\rVert_{2}$	$\displaystyle=\left\lVert\mathcal{H}_{s}(v_{k})/\left\lVert\mathcal{H}_{s}(v_{k})\right\rVert_{2}-w^{*}\right\rVert_{2}$
		$\displaystyle=\left\lVert\mathcal{H}_{s}(\hat{v}_{k})/\left\lVert\mathcal{H}_{s}(\hat{v}_{k})\right\rVert_{2}-w^{*}\right\rVert_{2}$
		$\displaystyle\leq 2\left\lVert\mathcal{H}_{s}(\hat{v}_{k})-w^{*}\right\rVert_{2}$
		$\displaystyle\leq 4\left\lVert\hat{v}_{k}-w^{*}\right\rVert_{2}$
		$\displaystyle\leq 2^{-k-4}$
		$\displaystyle=r_{k+1}.$

By the sparsity of $w_{k}$ and $w^{*}$ , and our choice $\rho_{k+1}=\sqrt{2s}r_{k+1}$ , we always have

\left\lVert w_{k}-w^{*}\right\rVert_{1}\leq\sqrt{2s}\left\lVert w_{k}-w^{*}\right\rVert_{2}\leq\sqrt{2s}r_{k+1}=\rho_{k+1}.

The proof is complete. ∎

D.6 Proof of Theorem 4

Proof.

We will prove the theorem with the following claim.

Claim D.2.

For any $1\leq k\leq k_{0}$ , with probability at least $1-\sum_{i=1}^{k}\delta_{i}$ , $w^{*}$ is in $W_{k+1}$ .

Based on the claim, we immediately have that with probability at least $1-\sum_{k=1}^{k_{0}}\delta_{k}\geq 1-\delta$ , $w^{*}$ is in $W_{k_{0}+1}$ . By our construction of $W_{k_{0}+1}$ , we have

\left\lVert w^{*}-w_{k_{0}}\right\rVert_{2}\leq 2^{-k_{0}-4}.

This, together with Part 4 of Lemma 12 and the fact that $\theta(w^{*},w_{k_{0}})\leq\pi\left\lVert w^{*}-w_{k_{0}}\right\rVert_{2}$ (see Lemma 10 of [Zha18]), implies

\operatorname{err}_{D}(w_{k_{0}})\leq\frac{\pi}{c_{1}}\cdot 2^{-k_{0}-4}=\epsilon.

Finally, we derive the sample complexity and label complexity. Recall that $n_{k}$ was involved in Proposition 30, i.e. the quantity $\left\lvert T\right\rvert$ , where we required

n_{k}=\tilde{O}\mathinner{\biggl{(}s^{2}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\Bigl{(}\log d+\log^{2}\frac{1}{\delta_{k}}\Bigr{)}}+\log\frac{1}{\delta_{k}}\biggr{)}}=\tilde{O}\mathinner{\biggl{(}s^{2}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\Bigl{(}\log d+\log^{2}\frac{1}{\delta_{k}}\Bigr{)}}\biggr{)}}.

It is also involved in Proposition 33, where we need

m_{k}={O}\mathinner{\left(s\log^{2}\frac{n_{k}d}{b_{k}\delta_{k}}\cdot\log\frac{d}{\delta_{k}}\right)}

and $n_{k}\geq m_{k}$ since $S_{k}$ is a labeled subset of $T$ . As $m_{k}$ has a cubic dependence on $\log\frac{1}{\delta_{k}}$ , our final choice of $n_{k}$ is given by

n_{k}=\tilde{O}\mathinner{\biggl{(}s^{2}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\Bigl{(}\log d+\log^{3}\frac{1}{\delta_{k}}\Bigr{)}}\biggr{)}}.

(D.20)

This in turn gives

m_{k}=\tilde{O}\mathinner{\left(s\log^{2}\frac{d}{b_{k}\delta_{k}}\cdot\log\frac{d}{\delta_{k}}\right)}.

(D.21)

Therefore, by Lemma 26 we obtain an upper bound of the sample size $N_{k}$ at phase $k$ as follows:

N_{k}=\tilde{O}\mathinner{\biggl{(}\frac{s^{2}}{b_{k}}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\Bigl{(}\log d+\log^{3}\frac{1}{\delta_{k}}\Bigr{)}}\biggr{)}}\leq\tilde{O}\mathinner{\biggl{(}\frac{s^{2}}{\epsilon}\log^{4}d\mathinner{\Bigl{(}\log d+\log^{3}\frac{1}{\delta}\Bigr{)}}\biggr{)}},

where the last inequality follows from $b_{k}=\Omega(\epsilon)$ for all $k\leq k_{0}$ and our choice of $\delta_{k}$ . Consequently, the total sample complexity

N=\sum_{k=1}^{k_{0}}N_{k}\leq k_{0}\cdot\tilde{O}\mathinner{\biggl{(}\frac{s^{2}}{\epsilon}\log^{4}d\mathinner{\Bigl{(}\log d+\log^{3}\frac{1}{\delta}\Bigr{)}}\biggr{)}}=\tilde{O}\mathinner{\biggl{(}\frac{s^{2}}{\epsilon}\log^{4}d\mathinner{\Bigl{(}\log d+\log^{3}\frac{1}{\delta}\Bigr{)}}\biggr{)}}.

Likewise, we can show that the total label complexity

m=\sum_{k=1}^{k_{0}}m_{k}\leq k_{0}\cdot\tilde{O}\mathinner{\Bigl{(}s\log^{2}\frac{d}{\epsilon\delta}\cdot\log\frac{d}{\delta}\Bigr{)}}=\tilde{O}\mathinner{\Bigl{(}s\log^{2}\frac{d}{\epsilon\delta}\cdot\log\frac{d}{\delta}\cdot\log\frac{1}{\epsilon}\Bigr{)}}.

It remains to prove Claim D.2 by induction. First, for $k=1$ , $W_{1}=B_{2}(0,1)\cap B_{1}(0,\sqrt{s})$ . Therefore, $w^{*}\in W_{1}$ with probability $1$ . Now suppose that Claim D.2 holds for some $k\geq 2$ , that is, there is an event $E_{k-1}$ that happens with probability $1-\sum_{i}^{k-1}\delta_{i}$ , and on this event $w^{*}\in W_{k}$ . By Lemma 36 we know that there is an event $F_{k}$ that happens with probability $1-\delta_{k}$ , on which $\theta(v_{k},w^{*})\leq 2^{-k-8}\pi$ . This further implies that $w^{*}\in W_{k+1}$ in view of Lemma 37. Therefore, consider the event $E_{k-1}\cap F_{k}$ , on which $w^{*}\in W_{k+1}$ with probability $\operatorname{Pr}(E_{k-1})\cdot\operatorname{Pr}(F_{k}\mid E_{k-1})=(1-\sum_{i}^{k-1}\delta_{i})(1-\delta_{k})\geq 1-\sum_{i=1}^{k}\delta_{i}$ . ∎

Appendix E Miscellaneous Lemmas

Lemma 38 (Chernoff bound).

Let $Z_{1},Z_{2},\dots,Z_{n}$ be $n$ independent random variables that take value in $\{0,1\}$ . Let $Z=\sum_{i=1}^{n}Z_{i}$ . For each $Z_{i}$ , suppose that $\operatorname{Pr}(Z_{i}=1)\leq\eta$ . Then for any $\alpha\in[0,1]$

\operatorname{Pr}\left(Z\geq(1+\alpha)\eta n\right)\leq e^{-\frac{\alpha^{2}\eta n}{3}}.

When $\operatorname{Pr}(Z_{i}=1)\geq\eta$ , for any $\alpha\in[0,1]$

\operatorname{Pr}\left(Z\leq(1-\alpha)\eta n\right)\leq e^{-\frac{\alpha^{2}\eta n}{2}}.

Lemma 39 (Theorem 1 of [KST08]).

Let $\sigma=(\sigma_{1},\dots,\sigma_{n})$ where $\sigma_{i}$ ’s are independent draws from the Rademacher distribution and let $x_{1},\dots,x_{n}$ be given instances in $\mathbb{R}^{d}$ . Then

\operatorname{\mathbb{E}}_{\sigma}\mathinner{\biggl{[}\sup_{w\in B_{1}(0,\rho)}\sum_{i=1}^{n}\sigma_{i}w\cdot x_{i}\biggr{]}}\leq\rho\sqrt{2n\log(2d)}\max_{1\leq i\leq n}\left\lVert x_{i}\right\rVert_{\infty}.

Attribute-Efficient Learning of Halfspaces with Malicious Noise: Near-Optimal Label Complexity and Noise Tolerance

Abstract

1 Introduction

1.1 Main results

Theorem 1 (Informal).

Lemma 2.

1.2 Related works

Roadmap.

2 Preliminaries

Assumption 1.

Assumption 2.

3 Main Algorithm

3.1 Overview

3.2 Hyper-parameter setting

3.3 Attribute and computationally efficient soft outlier removal

3.3.1 A natural approach and why it fails

3.3.2 Convex relaxation of sparse principal component analysis

Lemma 3.

3.3.3 Comparison to prior works

4 Performance Guarantee

Theorem 4.

Corollary 5.

4.1 Localized sampling in the instance space

Lemma 6.

4.2 Attribute and computationally efficient soft outlier removal

Proposition 7.

Remark 1.

Remark 2.

Proposition 8.

Proposition 9.

4.3 Attribute and label-efficient empirical risk minimization

Proposition 10.

4.4 Uniform concentration for unbounded data

4.5 Proof sketch of Theorem 4

Proof.

5 Conclusion and Open Questions

References

Appendix A Detailed Choices of Reserved Constants and Additional Notations

Constants.

Pruning.

Regularity condition on Du,bD_{u,b}.

Definition 11.

Appendix B Useful Properties of Isotropic Log-Concave Distributions

Lemma 12.

Lemma 13 (Theorem 21 of [BL13]).

Lemma 14 (Lemma 20 of [ABHZ16]).

Lemma 15.

Proof.

Lemma 16.

Proof.

Lemma 17.

Proof.

Lemma 18.

Proof.

Appendix C Orlicz Norm and Concentration Results using Adamczak’s Bound

Definition 19 (Orlicz norm).

Lemma 20.

Lemma 21.

Proof.

C.1 Adamczak’s bound

Lemma 22 (Adamczak’s bound).

Lemma 23.

Proof.

C.2 Uniform concentration of hinge loss

Proposition 24.

Proof.

C.3 Uniform concentration of relaxed sparse PCA

Proposition 25.

Proof.

Appendix D Performance Guarantee of Algorithm 1

D.1 Analysis of sample complexity

Lemma 26 (Restatement of Lemma 6).

Proof.

D.2 Analysis of pruning and the structure of T¯\bar{T}

Lemma 27.

Proof.

Lemma 28.

Proof.

D.3 Analysis of Algorithm 2

Lemma 29 (Restatement of Lemma 3).

Regularity condition on $D_{u,b}$ .

D.2 Analysis of pruning and the structure of $\bar{T}$