This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Attribute-Efficient Learning of Halfspaces with Malicious Noise: Near-Optimal Label Complexity and Noise Tolerance

Jie Shen
Stevens Institute of Technology
jie.shen@stevens.edu
   Chicheng Zhang
University of Arizona
chichengz@cs.arizona.edu
Abstract

This paper is concerned with computationally efficient learning of homogeneous sparse halfspaces in d\mathbb{R}^{d} under noise. Though recent works have established attribute-efficient learning algorithms under various types of label noise (e.g. bounded noise), it remains an open question of when and how ss-sparse halfspaces can be efficiently learned under the challenging malicious noise model, where an adversary may corrupt both the unlabeled examples and the labels. We answer this question in the affirmative by designing a computationally efficient active learning algorithm with near-optimal label complexity of O~(slog4dϵ)\tilde{O}(s\log^{4}\frac{d}{\epsilon})111We use the notation O~(f):=O(flogf)\tilde{O}(f)\mathrel{\mathop{\mathchar 58\relax}}=O(f\log f). and noise tolerance η=Ω(ϵ)\eta=\Omega(\epsilon), where ϵ(0,1)\epsilon\in(0,1) is the target error rate, under the assumption that the distribution over (uncorrupted) unlabeled examples is isotropic log-concave. Our algorithm can be straightforwardly tailored to the passive learning setting, and we show that its sample complexity is O~(1ϵs2log5d)\tilde{O}(\frac{1}{\epsilon}s^{2}\log^{5}d) which also enjoys attribute efficiency. Our main techniques include attribute-efficient paradigms for soft outlier removal and for empirical risk minimization, and a new analysis of uniform concentration for unbounded instances – all of them crucially take the sparsity structure of the underlying halfspace into account.

Keywords: halfspaces, malicious noise, passive and active learning, attribute efficiency

1 Introduction

This paper investigates the fundamental problem of learning halfspaces under noise [Val84, Val85]. In the absence of noise, this problem is well understood [Ros58, BEHW89]. However, the premise changes immediately when the unlabeled examples222We will also refer to unlabeled examples as instances in this paper. or the labels are corrupted by noise. In the last decades, various types of label noise have been extensively studied, and a plethora of polynomial-time algorithms have been developed that are resilient to random classification noise [BFKV96], bounded noise [Slo88, Slo92, MN06], and adversarial noise [KSS92, KKMS05]. Significant progress towards optimal noise tolerance is also witnessed in the past few years [Dan15, ABHU15, YZ17, DGT19, DKTZ20]. In this regard, a surge of recent research interest is concentrated on further improvement of the performance guarantees by leveraging the structure of the underlying halfspace into algorithmic design. Of central interest is a property termed attribute efficiency, which proves to be useful when the data lie in a high-dimensional space [Lit87], or even in an infinite-dimensional space but with bounded number of effective attributes [Blu90]. In the statistics and signal processing community, it is often referred to as sparsity, dating back to the celebrated Lasso estimator [Tib96, CDS98, CT05, Don06]. Recently, learning of sparse halfspaces in an attribute-efficient manner was highlighted as an open problem in [Fel14], and in a series of recent works [PV13b, ABHZ16, Zha18, ZSA20], this property was carefully explored for label-noise-tolerant learning of halfspaces with improved or even near-optimal sample complexity, label complexity, or generalization error, where the key insight is that such structural constraint effectively controls the complexity of the hypothesis class [Zha02, KST08].

Compared to the rich set of positive results on attribute-efficient learning of sparse halfspaces under label noise, less is known when both instances and labels are corrupted. Specifically, under the η\eta-malicious noise model [Val85, KL88], there is an unknown hypothesis ww^{*} and an unknown instance distribution DD selected from a certain family by an adversary. Each time with probability 1η1-\eta, the adversary returns an instance xx drawn from DD and the label y=sign(wx)y=\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)}; with probability η\eta, it instead is allowed to return an arbitrary pair (x,y)d×{1,1}(x,y)\in\mathbb{R}^{d}\times\{-1,1\} that may depend on the state of the learning algorithm and the history of its outputs. Since this is a much more challenging noise model, only recently has an algorithm with near-optimal noise tolerance been established in [ABL17], although without attribute efficiency. It is worth noting that the problem of learning sparse halfspaces is also closely related to one-bit compressed sensing [BB08] where one is allowed to utilize any distribution DD over measurements for recovering the target hypothesis. However, even with such strong condition, existing theory therein can only handle label noise [PV13a, ABHZ16, BFN+17]. This naturally raises two fundamental questions: 1) can we design attribute-efficient learning algorithms that are capable of tolerating the malicious noise; and 2) can we still obtain near-optimal performance guarantees on the degree of noise tolerance and on the sample complexity.

In this paper, we answer the two questions in the affirmative under a mild distributional assumption that DD is chosen from the family of isotropic log-concave distributions [LV07, Vem10], which covers prominent distributions such as normal distributions, exponential distributions, and logistic distributions. Moreover, we take label complexity into consideration [CAL94], for which we show that our bound is near-optimal in that aspect. We build our algorithm upon the margin-based active learning framework [BBZ07], which queries the label of an instance when it has small “margin” with respect to the currently learned hypothesis.

From a high level, this work can be thought of as extending the best known result of [ABL17] to the high-dimensional regime. However, even under the low-dimensional setting where s=ds=d, our bound of label complexity is better than theirs in terms of the dependence on the dimension dd: they have a quadratic dependence whereas we have a linear dependence (up to logarithmic factors). Moreover, as we will describe in Section 3, obtaining such algorithmic extension is nontrivial both computationally and statistically. This work can also be viewed as an extension of [Zha18] to the malicious noise model. In fact, our construction of empirical risk minimization is inspired by that work. However, they considered only label noise which makes their algorithm and analysis not applicable to our setting: it turns out that when facing malicious noise, a sophisticated design of outlier removal paradigm is crucial for optimal noise tolerance [KLS09].

Also in line with this work is learning with nasty noise [DKS18] and robust sparse functional estimation [BDLS17]. Both works considered more general setting in the following sense: [DKS18] showed that by properly adapting the techniques in robust mean estimation, some more general concepts, e.g. low-degree polynomial threshold functions and intersections of halfspaces, can be efficiently learned with poly(d,1/ϵ)\operatorname{poly}\mathinner{\left(d,1/\epsilon\right)} sample complexity; [BDLS17] showed that under proper sparsity assumptions, a sample complexity bound of poly(s,logd,1/ϵ)\operatorname{poly}\mathinner{\left(s,\log d,1/\epsilon\right)} can be achieved for many sparse estimation problems, such as generalized linear models with Lipschitz mapping functions and covariance estimation. However, we remark that neither of them obtained label efficiency. In addition, when adapted to our setting, Theorem 1.5 of [DKS18] only handles noise rate ηO(ϵc)\eta\leq O(\epsilon^{c}) for some constant cc that is greater than one, while as to be shown in Section 4, we obtain the near-optimal noise tolerance ηO(ϵ)\eta\leq O(\epsilon). [BDLS17] achieved near-optimal noise tolerance but their analysis is restricted to the Gaussian marginal distribution and Lipschitz mapping functions. In addition to such fundamental differences, the main techniques we develop are distinct from theirs, which will be described in more detail in Section 3.3.3.

1.1 Main results

We informally present our main results below; readers are referred to Theorem 4 in Section 4 for a precise statement.

Theorem 1 (Informal).

Consider the malicious noise model with noise rate η\eta. If the unlabeled data distribution is isotropic log-concave and the underlying halfspace ww^{*} is ss-sparse, then there is an algorithm that for any given target error rate ϵ(0,1)\epsilon\in(0,1), PAC learns the underlying halfspace in polynomial time provided that ηO(ϵ)\eta\leq O(\epsilon). In addition, the label complexity is O~(slog4dϵ)\tilde{O}\big{(}s\log^{4}\frac{d}{\epsilon}\big{)} and the sample complexity is O~(1ϵs2log5d)\tilde{O}\mathinner{\bigl{(}\frac{1}{\epsilon}s^{2}\log^{5}d\bigr{)}}.

First of all, note that the noise tolerance is near-optimal as [KL88] showed that a noise rate greater than ϵ1+ϵ\frac{\epsilon}{1+\epsilon} cannot be tolerated by any algorithm regardless of the computational power. The following fact establishes the optimality of our label complexity.

Lemma 2.

Active learning of ss-sparse halfspaces under isotropic log-concave distributions in the realizable case has an information-theoretic label complexity lower bound of Ω(s(log1ϵ+logds))\Omega\mathinner{\bigl{(}s(\log\frac{1}{\epsilon}+\log\frac{d}{s})\bigr{)}}.

To see this lemma, observe that there exist ϵ\epsilon-packings of ss-sparse halfspaces with sizes (1ϵ)Ω(s)(\frac{1}{\epsilon})^{\Omega(s)} [Lon95] and (ds)Ω(s)(\frac{d}{s})^{\Omega(s)} [RWY11]; applying Theorem 1 of [KMT93] gives the lower bound.

1.2 Related works

[KL88] presented a general analysis on efficiently learning halfspaces, showing that even without any distributional assumptions, it is possible to tolerate the malicious noise at a rate of Ω(ϵ/d){\Omega}(\epsilon/d), but a noise rate greater than ϵ1+ϵ\frac{\epsilon}{1+\epsilon} cannot be tolerated. The noise model was further studied by [Sch92, Bsh98, CDF+99], and [KKMS05] obtained a noise tolerance Ω(ϵ/d1/4)\Omega(\epsilon/d^{1/4}) when DD is the uniform distribution. [KLS09] improved this result to Ω(ϵ2/log(d/ϵ))\Omega(\epsilon^{2}/\log(d/\epsilon)) for the uniform distribution, and showed a noise tolerance Ω(ϵ3/log2(d/ϵ))\Omega(\epsilon^{3}/\log^{2}(d/\epsilon)) for isotropic log-concave distributions. A near-optimal result of Ω(ϵ)\Omega(\epsilon) was established in [ABL17] for both uniform and isotropic log-concave distributions.

Achieving attribute efficiency has been a long-standing goal in machine learning and statistics [Blu90, BHL95], and has found a variety of applications with strong theoretical backend. A partial list includes online classification [Lit87], learning decision lists [Ser99, KS04, LS06], compressed sensing [Don06, CW08, TW10, SL18], one-bit compressed sensing [BB08, PV16], and variable selection [FL01, FF08, SL17a, SL17b].

Label-efficient learning has also been broadly studied since gathering high quality labels is often expensive. The prominent approaches include disagreement-based active learning [Han11, Han14], margin-based active learning [BBZ07, BL13, YZ17], selective sampling [CCG11, DGS12], and adaptive one-bit compressed sensing [ZYJ14, BFN+17]. There are also a number of interesting works that appeal to extra information to mitigate the labeling cost, such as comparison [XZS+17, KLMZ17] and search [BH12, BHLZ16].

Recent works such as [DKK+16, LRV16] studied mean estimation under a strong noise model where in addition to returning dirty instances, the adversary has also the power of eliminating a few clean instances, similar to the nasty noise model in learning halfspaces [BEK02]. The main technique of robust mean estimation is a novel outlier removal paradigm, which uses the spectral norm of the covariance matrix to detect dirty instances. This is similar in spirit to the idea of [KLS09, ABL17] and the current work. However, there is no direct connection between mean estimation and halfspace learning since the former is an unsupervised problem while the latter is supervised (although any connection would be very interesting). Very recently, such technique was extensively investigated in a variety of problems such as clustering and linear regression; we refer the reader to a comprehensive survey by [DK19] for more information.

Roadmap.

We collect useful notations and formally define the problem in Section 2. In Section 3, we describe our algorithms, followed by a theoretical analysis in Section 4. We conclude this paper in Section 5, and defer all proof details to the appendix.

2 Preliminaries

We study the problem of learning sparse halfspaces in d\mathbb{R}^{d} under the malicious noise model with noise rate η[0,1/2)\eta\in[0,1/2) [Val85, KL88], where an oracle EXη(D,w)\mathrm{EX}_{\eta}(D,w^{*}) (i.e. adversary) first selects a member DD from a family of distributions 𝒟\mathcal{D} and a concept ww^{*} from a concept class 𝒞\mathcal{C}; during the learning process, DD and ww^{*} are fixed. Each time the adversary is called, with probability 1η1-\eta, a random pair (x,y)(x,y) is returned to the learner with xDx\sim D and y=sign(wx)y=\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)}, referred to as a clean sample; with probability η\eta, the adversary can return an arbitrary pair (x,y)d×{1,1}(x,y)\in\mathbb{R}^{d}\times\{-1,1\}, referred to as a dirty sample. The adversary is assumed to have unrestricted computational power to search dirty samples that may depend on, e.g. the states of the learning algorithm and the history of its outputs. Formally, we make the following distributional assumptions.

Assumption 1.

Let 𝒟\mathcal{D} be the family of isotropic log-concave distributions. The underlying distribution DD from which clean instances are drawn is chosen from 𝒟\mathcal{D} by the adversary, and is fixed during the learning process. The learner is given the knowledge of 𝒟\mathcal{D} but not of DD.

Assumption 2.

With probability 1η1-\eta, the adversary returns a pair (x,y)(x,y) where xDx\sim D and y=sign(wx)y=\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)}; with probability η\eta, it may return an arbitrary pair (x,y)d×{1,1}(x,y)\in\mathbb{R}^{d}\times\{-1,1\}.

Since we are interested in obtaining a label-efficient algorithm, we will consider a natural extension of such passive learning model. In particular, [ABL17] proposed to consider the following: when a labeled instance (x,y)(x,y) is generated, the learner only has access to an instance-generation oracle EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}) which returns xx, and must make a separate call to a label revealing oracle EXηy(D,w)\mathrm{EX}_{\eta}^{y}(D,w^{*}) to obtain yy. We refer to the total number of calls to EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}) as the sample complexity of the learning algorithm, and to that of EXηy(D,w)\mathrm{EX}_{\eta}^{y}(D,w^{*}) as the label complexity.

We will presume that the concept class 𝒞\mathcal{C} consists of homogeneous halfspaces that have unit 2\ell_{2}-norm and are ss-sparse, i.e. the number of non-zero elements of any w𝒞w\in\mathcal{C} is at most ss where s{1,2,,d}s\in\{1,2,\dots,d\}. The learning algorithm is given this concept class, that is, the set of homogeneous ss-sparse halfspaces. For a hypothesis w𝒞w\in\mathcal{C}, we define its error rate on a distribution D{D} as errD(w)=PrxD(sign(wx)sign(wx))\operatorname{err}_{{D}}(w)=\operatorname{Pr}_{x\sim{D}}\mathinner{\left(\operatorname{sign}\mathinner{\left(w\cdot x\right)}\neq\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)}\right)}. The goal of the learner is to find a hypothesis ww in polynomial time such that with probability 1δ1-\delta, errD(w)ϵ\operatorname{err}_{D}(w)\leq\epsilon for any given failure confidence δ(0,1)\delta\in(0,1) and any error rate ϵ(0,1)\epsilon\in(0,1), with a few calls to EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}) and EXηy(D,w)\mathrm{EX}_{\eta}^{y}(D,w^{*}).

For a reference vector udu\in\mathbb{R}^{d} and a positive scalar bb, we call the region Xu,b:={xd:|ux|b}X_{u,b}\mathrel{\mathop{\mathchar 58\relax}}=\{x\in\mathbb{R}^{d}\mathrel{\mathop{\mathchar 58\relax}}\left\lvert u\cdot x\right\rvert\leq b\} as band, and we denote by Du,bD_{u,b} the distribution obtained by conditioning DD on the event xXu,bx\in X_{u,b}. Given a hypothesis ww in d\mathbb{R}^{d}, a labeled instance (x,y)(x,y), and a parameter τ>0\tau>0, we define the τ\tau-hinge loss τ(w;x,y)=max{0,11τy(wx)}\ell_{\tau}(w;x,y)=\max\big{\{}0,1-\frac{1}{\tau}y(w\cdot x)\big{\}}. For a labeled set S={(xi,yi)}i=1nS=\{(x_{i},y_{i})\}_{i=1}^{n}, we define τ(w;S)=1ni=1nτ(w;xi,yi)\ell_{\tau}(w;S)=\frac{1}{n}\sum_{i=1}^{n}\ell_{\tau}(w;x_{i},y_{i}).

For p1p\geq 1, we denote by Bp(u,r)B_{p}(u,r) the p\ell_{p}-ball centering at the point uu with radius r>0r>0, i.e. Bp(u,r)={wd:wupr}B_{p}(u,r)=\{w\in\mathbb{R}^{d}\mathrel{\mathop{\mathchar 58\relax}}\left\lVert w-u\right\rVert_{p}\leq r\}. We will be particularly interested in the cases p=1,2,p=1,2,\infty. For a vector udu\in\mathbb{R}^{d}, the hard thresholding operation s(u)\mathcal{H}_{s}(u) keeps its ss largest (in absolute value) elements and sets the remaining to zero. Let u,vdu,v\in\mathbb{R}^{d} be two vectors; we write θ(u,v)\theta(u,v) to denote the angle between them, and write uvu\cdot v to denote their inner product. For a matrix HH, we denote by H\left\lVert H\right\rVert_{*} its trace norm (also known as the nuclear norm), i.e. the sum of its singular values. We will also use H1\left\lVert H\right\rVert_{1} to denote the entrywise 1\ell_{1}-norm of HH, i.e. the sum of absolute values of its entries. If HH is a symmetric matrix, we use H0H\succeq 0 to denote that it is positive semidefinite.

Throughout this paper, the subscript variants of the lowercase letter cc, e.g. c1c_{1} and c2c_{2}, are reserved for specific absolute constants that are uniquely determined by the distribution DD. We also reserve C1C_{1} and C2C_{2} for specific constants. We remark that the value of all the constants involved in the paper does not depend on the underlying distribution DD, but rather on the knowledge of 𝒟\mathcal{D} given to the learner. We collect all the definitions of these constants in Appendix A.

3 Main Algorithm

We first present an overview of our learning algorithm, followed by specifying all the hyper-parameters used therein. Then we describe in detail the attribute-efficient outlier removal scheme, which is the core technique in the paper.

3.1 Overview

Our main algorithm, namely Algorithm 1, is based on the celebrated margin-based active learning framework [BBZ07]. The key observation is that a good classifier can be learned by concentrating on fitting only the most informative labeled instances, measured by the closeness to the current decision boundary (i.e. the closer the more informative). In our algorithm, the sampling region is set as d\mathbb{R}^{d} at phase k=1k=1, and is set as the band Xwk1,bk={xd:|wk1x|bk}X_{w_{k-1},b_{k}}=\{x\in\mathbb{R}^{d}\mathrel{\mathop{\mathchar 58\relax}}\left\lvert w_{k-1}\cdot x\right\rvert\leq b_{k}\} at phases k2k\geq 2. Once we obtain the instance set T¯\bar{T}, we perform a pruning step that removes all instances having large \ell_{\infty}-norm. This is motivated by our analysis that with high probability, all clean instances in T¯\bar{T} must have small \ell_{\infty}-norm provided that Assumption 1 is satisfied. Since the oracle EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}) may output dirty instances, we design an attribute-efficient soft outlier removal procedure, which aims to find proper weights for all instances in TT, such that the clean instances (i.e. those from Dwk1,bkD_{w_{k-1},b_{k}}) have overwhelming weights compared to dirty instances. Equipped with the learned weights, it is possible to minimize the reweighted hinge loss to obtain a refined halfspace. However, this would lead to a suboptimal label complexity since we have to query the label for all instances in TT. Our remedy is to randomly sample a few points from TT according to their importance, which is crucial for us to obtain near-optimal label complexity.

When minimizing the hinge loss, we carefully construct the constraint set WkW_{k} with three properties. First, it has an 2\ell_{2}-norm constraint. As a useful fact of isotropic log-concave distributions, the 2\ell_{2}-distance to the underlying halfspace ww^{*} is of the same order as the error rate. Thus, if we were able to ensure that the target halfspace ww^{*} stays in WkW_{k}, we would show that the error rate of wkw_{k} is as small as O(rk)O(r_{k}), the radius of the 2\ell_{2}-ball. Second, WkW_{k} has an 1\ell_{1}-norm constraint, which is well-known for its power to promote sparse solutions and to guarantee attribute-efficient sample complexity [Tib96, CDS98, CT05, PV13b]. Lastly, the 2\ell_{2} and 1\ell_{1} radii of WkW_{k} shrinks by a constant factor in each phase; hence, when Algorithm 1 terminates, the radius of the 2\ell_{2}-ball will be as small as O(ϵ)O(\epsilon). Notably, [Zha18] also utilizes such constraint for active learning of sparse halfspaces, but only under the setting of label noise.

The last step in Algorithm 1 is to perform hard-thresholding s\mathcal{H}_{s} on the solution vkv_{k} followed by 2\ell_{2}-normalization. Roughly speaking, these two steps will produce an iterate wkw_{k} consistent with the structure of ww^{*} (i.e. wkw_{k} is guaranteed to belong to the concept class 𝒞\mathcal{C}), and more importantly, will be useful to show that ww^{*} lies in WkW_{k} in all phases.

Algorithm 1 Attribute and Label-Efficient Algorithm Tolerating Malicious Noise
0:  Error rate ϵ\epsilon, failure probability δ\delta, sparsity parameter ss, an instance generation oracle EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}), a label revealing oracle EXηy(D,w)\mathrm{EX}_{\eta}^{y}(D,w^{*}).
0:  A halfspace wk0w_{k_{0}} such that errD(wk0)ϵ\operatorname{err}_{D}(w_{k_{0}})\leq\epsilon with probability 1δ1-\delta.
1:  k0log(π16c1ϵ)k_{0}\leftarrow\big{\lceil}\log\mathinner{\bigl{(}\frac{\pi}{16c_{1}\epsilon}\bigr{)}}\big{\rceil}.
2:  Initialize w0w_{0} as the zero vector in d\mathbb{R}^{d}.
3:  for phases k=1,2,,k0k=1,2,\dots,k_{0} do
4:     Clear the working set T¯\bar{T}.
5:     If k=1k=1, independently draw nkn_{k} instances from EXηx(D,w)\text{EX}^{x}_{\eta}(D,w^{*}) and put them into T¯\bar{T}; otherwise, draw nkn_{k} instances from EXηx(D,w)\text{EX}^{x}_{\eta}(D,w^{*}) conditioned on |wk1x|bk\left\lvert w_{k-1}\cdot x\right\rvert\leq b_{k} and put into T¯\bar{T}.
6:     Pruning: Remove all instances xx in T¯\bar{T} with x>c9log48nkdbkδk\left\lVert x\right\rVert_{\infty}>c_{9}\log\frac{48n_{k}d}{b_{k}\delta_{k}} to form a set TT.
7:     Soft outlier removal: Apply Algorithm 2 to TT with uwk1u\leftarrow w_{k-1}, bbkb\leftarrow b_{k}, rrkr\leftarrow r_{k}, ρρk\rho\leftarrow\rho_{k}, ξξk\xi\leftarrow\xi_{k}, C2C2C\leftarrow 2C_{2}, and let q={q(x)}xTq=\mathinner{\left\{q(x)\right\}}_{x\in T} be the returned function. Normalize qq to form a probability distribution pp over TT.
8:     Random sampling: SkS_{k}\leftarrow Independently draw mkm_{k} instances (with replacement) from TT according to pp and query EXηy(D,w)\text{EX}^{y}_{\eta}(D,w^{*}) for their labels.
9:     Let Wk=B2(wk1,rk)B1(wk1,ρk)W_{k}=B_{2}(w_{k-1},r_{k})\cap B_{1}(w_{k-1},\rho_{k}). Find vkWkv_{k}\in W_{k} such that
τk(vk;Sk)minwWkτk(w;Sk)+κ.\ell_{\tau_{k}}(v_{k};S_{k})\leq\min_{w\in W_{k}}\ell_{\tau_{k}}(w;S_{k})+\kappa.
10:     wks(vk)s(vk)2w_{k}\leftarrow\frac{\mathcal{H}_{s}(v_{k})}{\left\lVert\mathcal{H}_{s}(v_{k})\right\rVert_{2}}.
11:  end for
12:  return  wk0w_{k_{0}}.

3.2 Hyper-parameter setting

We elaborate on our hyper-parameter setting that is used in Algorithm 1 and our analysis. Let g(t)=c2(2texp(t)+c3π4exp(c4t4π)+16exp(t))g(t)=c_{2}\mathinner{\bigl{(}2t\exp(-t)+\frac{c_{3}\pi}{4}\exp\mathinner{\bigl{(}-\frac{c_{4}t}{4\pi}\bigr{)}}+16\exp(-t)\bigr{)}}, where the constants are specified in Appendix A. Observe that there exists an absolute constant c¯8π/c4\bar{c}\geq 8\pi/c_{4} satisfying g(c¯)28πg(\bar{c})\leq 2^{-8}\pi, since the continuous function g(t)0g(t)\to 0 as t+t\to+\infty and all the involved quantities in g(t)g(t) are absolute constants. Given such constant c¯\bar{c}, we set bk=c¯2k3b_{k}=\bar{c}\cdot 2^{-k-3}, τk=c0κmin{bk,1/9}\tau_{k}=c_{0}\kappa\cdot\min\{b_{k},1/9\}, δk=δ(k+1)(k+2)\delta_{k}=\frac{\delta}{(k+1)(k+2)},

rk={1,k=12k3,k2,andρk={s,k=12s2k3,k2.r_{k}=\begin{cases}1,&k=1\\ 2^{-k-3},&k\geq 2\end{cases},\ \text{and}\ \rho_{k}=\begin{cases}\sqrt{s},&k=1\\ \sqrt{2s}\cdot 2^{-k-3},&k\geq 2\end{cases}.

We set the constant κ=exp(c¯)\kappa=\exp(-\bar{c}), and choose ξk=min{12,κ216(1+4C2zk/τk)2}\xi_{k}=\min\big{\{}\frac{1}{2},\frac{\kappa^{2}}{16}\mathinner{\bigl{(}1+4\sqrt{C_{2}}{z_{k}}/{\tau_{k}}\bigr{)}}^{-2}\big{\}}. Observe that all ξk\xi_{k}’s are lower bounded by the constant c6:=min{12,κ216(1+4c0κc¯C2c¯2+C2)2}c_{6}\mathrel{\mathop{\mathchar 58\relax}}=\min\Big{\{}\frac{1}{2},\frac{\kappa^{2}}{16}\mathinner{\Bigl{(}1+\frac{4}{c_{0}\kappa\bar{c}}\sqrt{C_{2}\bar{c}^{2}+C_{2}}\Bigr{)}}^{-2}\Big{\}}. Our theoretical guarantee holds for any noise rate ηc5ϵ\eta\leq c_{5}\epsilon, where the constant c5:=c82πc¯c1c6c_{5}\mathrel{\mathop{\mathchar 58\relax}}=\frac{c_{8}}{2\pi}\bar{c}c_{1}c_{6}.

We set the total number of phases k0=log(π16c1ϵ)k_{0}=\big{\lceil}\log\mathinner{\bigl{(}\frac{\pi}{16c_{1}\epsilon}\bigr{)}}\big{\rceil} in Algorithm 1. Consider any phase k1k\geq 1. We use nk=O~(s2log4dbk(logd+log31δk))n_{k}=\tilde{O}\mathinner{\bigl{(}s^{2}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\bigl{(}\log d+{\log^{3}\frac{1}{\delta_{k}}}\bigr{)}}\bigr{)}} as the size of unlabeled instance set T¯\bar{T}. We will show that by making Nk=O(nk/bk)N_{k}=O\mathinner{\left(n_{k}/b_{k}\right)} calls to EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}), Algorithm 1 is guaranteed to obtain such T¯\bar{T} in each phase with high probability. We set mk=O~(slog2dbkδklogdδk)m_{k}=\tilde{O}\mathinner{\bigl{(}s\log^{2}\frac{d}{b_{k}\delta_{k}}\cdot\log\frac{d}{\delta_{k}}\bigr{)}} as the size of labeled instance set SkS_{k}, which is also the number of calls to EXηy(D,w)\mathrm{EX}_{\eta}^{y}(D,w^{*}). Note that N:=k=1k0NkN\mathrel{\mathop{\mathchar 58\relax}}=\sum_{k=1}^{k_{0}}N_{k} is the sample complexity of Algorithm 1, and m:=k=1k0mkm\mathrel{\mathop{\mathchar 58\relax}}=\sum_{k=1}^{k_{0}}m_{k} is its label complexity.

3.3 Attribute and computationally efficient soft outlier removal

Our soft outlier removal procedure is inspired by [ABL17]. We first briefly describe their main idea. Then we introduce a natural extension of their approach to the high-dimensional regime and show why it fails. Lastly, we present our novel outlier removal scheme.

To ease our discussion, we decompose T=TCTDT=T_{\mathrm{C}}\cup T_{\mathrm{D}} where TCT_{\mathrm{C}} is the set of clean instances in TT and TDT_{\mathrm{D}} consists of all dirty instances. Ideally, we would expect to find a function q:T[0,1]q\mathrel{\mathop{\mathchar 58\relax}}T\rightarrow[0,1] such that q(x)=1q(x)=1 for all xTCx\in T_{\mathrm{C}} and q(x)=0q(x)=0 otherwise. Suppose that ξ\xi is the fraction of dirty instances in TT. Then one would expect that the total weights xTq(x)\sum_{x\in T}q(x) is as large as (1ξ)|T|(1-\xi)\left\lvert T\right\rvert in order to include such ideal function. On the other hand, we must restrict the weights of dirty instances; namely, we need to characterize under what conditions TCT_{\mathrm{C}} can be distinguished from TDT_{\mathrm{D}}. The key observation made in [KLS09] and [ABL17] is that if the dirty instances want to deteriorate the hinge loss (which is the purpose of the adversary), they must lead to a variance333We follow [ABL17] and slightly abuse the word “variance” without subtracting the squared mean of wxw\cdot x. of wxw\cdot x orders of magnitude larger than Ω(b2+r2)\Omega(b^{2}+r^{2}) on the direction of a particular halfspace. Thus, it suffices to find a proper weight for each instance, such that the reweighted variance 1|T|xTq(x)(wx)2\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)(w\cdot x)^{2} is as small as O(b2+r2)O(b^{2}+r^{2}) for all feasible halfspaces wWw\in W. Now it remains to resolve two questions: 1) how many instances do we need to draw in order to guarantee the existence of such function qq; and 2) how to find a feasible function qq in polynomial time.

If label complexity were our only objective, we could have used the soft outlier removal procedure of [ABL17] directly, i.e. we set W=B2(u,r)W=B_{2}(u,r), which in conjunction with the 1\ell_{1}-norm constrained hinge loss minimization of [Zha18] would result in an O~(d2ϵ)\tilde{O}\mathinner{\bigl{(}\frac{d^{2}}{\epsilon}\bigr{)}} sample complexity and a poly(s,logd,log(1/ϵ))\operatorname{poly}\mathinner{\left(s,\log d,\log(1/\epsilon)\right)} label complexity. However, as we would also like to optimize for the learner’s sample complexity by utilizing the sparsity assumption, we need an attribute-efficient outlier removal procedure.

3.3.1 A natural approach and why it fails

It is well-known that incorporating an 1\ell_{1}-norm constraint often leads to a sample complexity sublinear in the dimension [Zha02, KST08]. Thus, a natural approach for attribute-efficient outlier removal is to set W=B2(u,r)B1(u,ρ)W=B_{2}(u,r)\cap B_{1}(u,\rho) for some carefully chosen radius ρ>0\rho>0. With the new localized concept space, it is possible to show that a sample size of poly(s,logd)\operatorname{poly}\mathinner{\left(s,\log d\right)} suffices to guarantee the existence of a function qq such that the reweighted variance is small over all wWw\in W. However, on the computational side, for a given qq, we will have to check the reweighted variance for all wWw\in W, which amounts to finding a global optimum of the following program:

maxwd1|T|xTq(x)(wx)2,s.t.wu2r,wu1ρ.\max_{w\in\mathbb{R}^{d}}\ \frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)(w\cdot x)^{2},\ \mathrm{s.t.}\ \left\lVert w-u\right\rVert_{2}\leq r,\ \left\lVert w-u\right\rVert_{1}\leq\rho. (3.1)

The above program is closely related to the problem of sparse principal component analysis (PCA) [ZHT06], and unfortunately it is known that finding a global optimum is NP-hard [Ste05, TP14].

Algorithm 2 Attribute-Efficient Localized Soft Outlier Removal
0:  Reference vector uu, band width bb, radius rr for 2\ell_{2}-ball, radius ρ\rho for 1\ell_{1}-ball, empirical noise rate ξ\xi, absolute constant CC, a set of unlabeled instances TT where for all xTx\in T, |ux|b\left\lvert u\cdot x\right\rvert\leq b.
0:  A function q:T[0,1]q\mathrel{\mathop{\mathchar 58\relax}}T\rightarrow[0,1].
1:  Define the convex set of matrices ={Hd×d:H0,Hr2,H1ρ2}\mathcal{M}=\big{\{}H\in\mathbb{R}^{d\times d}\mathrel{\mathop{\mathchar 58\relax}}\ H\succeq 0,\ \left\lVert H\right\rVert_{*}\leq r^{2},\ \left\lVert H\right\rVert_{1}\leq\rho^{2}\big{\}}.
2:  Find a function q:T[0,1]q\mathrel{\mathop{\mathchar 58\relax}}T\rightarrow[0,1] satisfying the following constraints:
  1. 1.

    for all xT,0q(x)1x\in T,0\leq q(x)\leq 1;

  2. 2.

    xTq(x)(1ξ)|T|\sum_{x\in T}q(x)\geq(1-\xi)\left\lvert T\right\rvert;

  3. 3.

    supH1|T|xTq(x)xHxC(b2+r2)\sup_{H\in\mathcal{M}}\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)x^{\top}Hx\leq C(b^{2}+r^{2}).

3:  return  qq.

3.3.2 Convex relaxation of sparse principal component analysis

Our goal is to find a function qq such that the objective value in (3.1) is less than O(b2+r2)O\mathinner{\left(b^{2}+r^{2}\right)} for all wWw\in W. To circumvent the computational intractability caused by the non-convexity of the objective function, we consider an alternative formulation using semidefinite programming (SDP), similar to the approach of [dGJL07]. First, let v=wuv=w-u. It is not hard to see that (wx)22(ux)2+2(vx)2(w\cdot x)^{2}\leq 2(u\cdot x)^{2}+2(v\cdot x)^{2}. Due to our localized sampling scheme, we have (ux)2b2(u\cdot x)^{2}\leq b^{2} with probability 11. Thus, we only need to examine the maximum value of 1|T|xTq(x)(vx)2\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)(v\cdot x)^{2} over vB2(0,r)B1(0,ρ)v\in B_{2}(0,r)\cap B_{1}(0,\rho). Now the technique of [dGJL07] comes in: the rank-one symmetric matrix vvvv^{\top} is replaced by a new variable Hd×dH\in\mathbb{R}^{d\times d} which is positive semidefinite, and the vector 2\ell_{2} and 1\ell_{1}-norm constraints are relaxed to the matrix trace and 1\ell_{1}-norm constraints respectively as follows:

maxHd×d1|T|xTq(x)xHx,s.t.H0,Hr2,H1ρ2.\max_{H\in\mathbb{R}^{d\times d}}\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)x^{\top}Hx,\ \mathrm{s.t.}\ H\succeq 0,\ \left\lVert H\right\rVert_{*}\leq r^{2},\ \left\lVert H\right\rVert_{1}\leq\rho^{2}. (3.2)

The program (3.2) has two salient features: first, it is a semidefinite program that can be optimized efficiently [BV04]; second, if its objective value is upper bounded by O(b2+r2)O\mathinner{\left(b^{2}+r^{2}\right)}, we immediately obtain that the reweighted variance is well controlled. This is the theme of the following lemma.

Lemma 3.

Suppose that Assumption 1 and 2 are satisfied, and that ηc5ϵ\eta\leq c_{5}\epsilon. There exists a constant C2>2C_{2}>2 such that the following holds. For any phase kk of Algorithm 1 with 1kk01\leq k\leq k_{0}, write k={Hd×d:H0,Hrk2,H1ρk2}\mathcal{M}_{k}=\mathinner{\left\{H\in\mathbb{R}^{d\times d}\mathrel{\mathop{\mathchar 58\relax}}\ H\succeq 0,\ \left\lVert H\right\rVert_{*}\leq r_{k}^{2},\ \left\lVert H\right\rVert_{1}\leq\rho_{k}^{2}\right\}}. Then with probability 1δk241-\frac{\delta_{k}}{24} over the draw of TCT_{\mathrm{C}}, we have

supHk1|TC|xTCxHx2C2(bk2+rk2),\sup_{H\in\mathcal{M}_{k}}\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T_{\mathrm{C}}}x^{\top}Hx\leq 2C_{2}(b_{k}^{2}+r_{k}^{2}),

provided that |TC|O~(s2log4dbk(logd+log21δk))\left\lvert T_{\mathrm{C}}\right\rvert\geq\tilde{O}\mathinner{\Bigl{(}s^{2}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\bigl{(}\log d+\log^{2}\frac{1}{\delta_{k}}\bigr{)}}\Bigr{)}}.

Recall that Algorithm 1 sets nk=O~(s2log4dbk(logd+log31δk))n_{k}=\tilde{O}\mathinner{\bigl{(}s^{2}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\bigl{(}\log d+{\log^{3}\frac{1}{\delta_{k}}}\bigr{)}}\bigr{)}}, which suffices to guarantee the condition on |TC|\left\lvert T_{\mathrm{C}}\right\rvert holds (see Appendix D.2); therefore, the above concentration bound holds with high probability. As a result, it is not hard to verify that the function q:T[0,1]q\mathrel{\mathop{\mathchar 58\relax}}T\rightarrow[0,1], where q(x)=1q(x)=1 for all xTCx\in T_{\mathrm{C}} and q(x)=0q(x)=0 for all xTDx\in T_{\mathrm{D}}, satisfies all three constraints in Algorithm 2. In other words, Lemma 3 establishes the existence of a feasible function qq to Algorithm 2. Furthermore, observe that the optimization problem of finding a feasible qq in Algorithm 2 is a semi-infinite linear program. For a given candidate qq, we can construct an efficient oracle as follows: it checks if qq violates the first two constraints; if not, it checks the last constraint by invoking a polynomial-time SDP solver to find the maximum objective value of (3.2). It is well-known that equipped with such separation oracle, Algorithm 2 will return a desired function qq in polynomial time by the ellipsoid method [GLS12, Chapter 3].

3.3.3 Comparison to prior works

We remark that the setting of nkn_{k} results in a sample complexity of O~(s2bk)\tilde{O}\mathinner{\bigl{(}\frac{s^{2}}{b_{k}}\bigr{)}} for phase kk (see a formal statement in Lemma 6), which implies a total sample complexity of O~(s2ϵ)\tilde{O}\mathinner{\bigl{(}\frac{s^{2}}{\epsilon}\bigr{)}}. When sds\ll d, this substantially improves upon the sample complexity of O~(d2ϵ)\tilde{O}\mathinner{\bigl{(}\frac{d^{2}}{\epsilon}\bigr{)}} when naively applying the soft outlier removal procedure in [ABL17].

We remark three crucial technical differences from [DKS18] and [BDLS17]. First, we progressively restrict the variance to identify dirty instances, i.e. the variance upper bound is set as O(1)O(1) at the beginning of Algorithm 1 and progressively decreases to O(ϵ2)O(\epsilon^{2}) (see our setting of bkb_{k} and rkr_{k}), while in [DKS18, BDLS17] and many of their follow-up works it is typically fixed to O(ϵ)O(\epsilon). Second, we control the variance locally, i.e. we only require a small variance over a localized instance space Dwk1,bkD_{w_{k-1},b_{k}} and a localized concept space k\mathcal{M}_{k}. Third, the small variance is used to robustly estimate the hinge loss in our work, while in [DKS18] it was utilized to approximate the Chow parameters. All these problem-specific design of outlier removal are vital for us to obtain the first near-optimal guarantee on attribute efficiency and label efficiency for learning sparse halfspaces.

4 Performance Guarantee

In the following, we always presume that the underlying halfspace is parameterized by ww^{*}, which is ss-sparse and has unit 2\ell_{2}-norm. This condition may not be explicitly stated in our analysis.

Our main theorem is as follows. We note that there are two sources of randomness in Algorithm 1: the random draw of instances from EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}), and the random sampling step (i.e. Step 8); the probability is taken over all the randomness in the algorithm.

Theorem 4.

Suppose that Assumptions 1 and 2 are satisfied. There exists an absolute constant c5c_{5} such that for any ϵ(0,1)\epsilon\in(0,1) and δ(0,1)\delta\in(0,1), if ηc5ϵ\eta\leq c_{5}\epsilon, then with probability at least 1δ1-\delta, errD(wk0)ϵ\operatorname{err}_{D}(w_{k_{0}})\leq\epsilon where wk0w_{k_{0}} is the output of Algorithm 1. Furthermore, Algorithm 1 has a sample complexity of O~(1ϵs2log4d(logd+log31δ))\tilde{O}\big{(}\frac{1}{\epsilon}s^{2}\log^{4}d\cdot\mathinner{\bigl{(}\log d+\log^{3}\frac{1}{\delta}\bigr{)}}\big{)}, and a label complexity of O~(slog2dϵδlogdδlog1ϵ)\tilde{O}\big{(}s\log^{2}\frac{d}{\epsilon\delta}\cdot\log\frac{d}{\delta}\cdot\log\frac{1}{\epsilon}\big{)}, and has running time poly(d,1/ϵ,1/δ)\operatorname{poly}\mathinner{\left(d,{1}/{\epsilon},{1}/{\delta}\right)}.

Algorithm 1 can be straightforwardly modified to work in the passive learning setting, where the learner has direct access to the labeled instance oracle EXη(D,w)\mathrm{EX}_{\eta}(D,w^{*}). The modified algorithm works as follows: it calls EXη(D,w)\mathrm{EX}_{\eta}(D,w^{*}) to obtain a pair of instance and the label whenever Algorithm 1 calls EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}). In particular, for the passive learning algorithm, the working set T¯\bar{T} is always a labeled instance set, and there is no need for it to query EXηy(D,w)\mathrm{EX}_{\eta}^{y}(D,w^{*}) in the random sampling step.

We have the following simple corollary which is an immediate result from Theorem 4.

Corollary 5.

Suppose that Assumptions 1 and 2 are satisfied. There exists a polynomial-time algorithm (that has access to only EXη(D,w)\mathrm{EX}_{\eta}(D,w^{*})) and an absolute constant c5c_{5} such that for any ϵ(0,1)\epsilon\in(0,1) and δ(0,1)\delta\in(0,1), if ηc5ϵ\eta\leq c_{5}\epsilon, then with probability at least 1δ1-\delta, the algorithm outputs a hypothesis with error at most ϵ\epsilon, using O~(1ϵs2log4d(logd+log31δ))\tilde{O}\big{(}\frac{1}{\epsilon}s^{2}\log^{4}d\cdot\mathinner{\bigl{(}\log d+\log^{3}\frac{1}{\delta}\bigr{)}}\big{)} labeled instances.

We need an ensemble of new results to prove Theorem 4. Specifically, we propose new techniques to control the sample and computational complexity of soft outlier removal, and a new analysis of label complexity by making full use of the localization in the instance and concept spaces. We elaborate on them in the following, and sketch the proof of Theorem 4 at the end of this section.

4.1 Localized sampling in the instance space

Localized sampling, also known as margin-based active learning, is a useful technique proposed in [BBZ07]. Interestingly, under isotropic log-concave distributions, [BL13] showed that if the band width bb is large enough, the region outside the band, i.e. {xd:|wx|>b}\{x\in\mathbb{R}^{d}\mathrel{\mathop{\mathchar 58\relax}}\left\lvert w\cdot x\right\rvert>b\}, can be safely “ignored”, in the sense that, if ww is close enough to ww^{*}, it is guaranteed to incur a small error rate therein. Motivated by this elegant finding, theoretical analyses in the literature are often dedicated to bounding the error rate within the band, and it is now well understood that a constant error rate within the band suffices to ensure significant progress in each phase [ABHU15, ABL17, Zha18]. We follow this line of reasoning and our technical contribution is to show how to obtain such constant error rate with near-optimal label complexity and noise tolerance.

Our analysis will rely on the condition that T¯\bar{T} has sufficiently many instances. Specifically, in order to collect nkn_{k} instances to form the working set T¯\bar{T}, we need to call EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}) enough number of times since our sampling is localized within the band Xk:={x:|wk1x|bk}X_{k}\mathrel{\mathop{\mathchar 58\relax}}=\mathinner{\left\{x\mathrel{\mathop{\mathchar 58\relax}}\left\lvert w_{k-1}\cdot x\right\rvert\leq b_{k}\right\}}. The following lemma characterizes the sample complexity at phase kk.

Lemma 6.

Suppose that Assumption 1 and 2 are satisfied. Further assume η<12\eta<\frac{1}{2}. With probability 1δk41-\frac{\delta_{k}}{4}, we will obtain nkn_{k} instances that fall into the band Xk={x:|wk1x|bk}X_{k}=\{x\mathrel{\mathop{\mathchar 58\relax}}\left\lvert w_{k-1}\cdot x\right\rvert\leq b_{k}\} by making a number of Nk=O(1bk(nk+log1δk))N_{k}=O\mathinner{\bigl{(}\frac{1}{b_{k}}\mathinner{\bigl{(}n_{k}+\log\frac{1}{\delta_{k}}\bigr{)}}\bigr{)}} calls to the instance generation oracle EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}).

4.2 Attribute and computationally efficient soft outlier removal

We summarize the performance guarantee of Algorithm 2 in the following proposition.

Proposition 7.

Consider phase kk of Algorithm 1 for any 1kk01\leq k\leq k_{0}. Suppose that Assumption 1 and 2 are satisfied, and that ηc5ϵ\eta\leq c_{5}\epsilon. With the setting of nkn_{k}, with probability 1δk81-\frac{\delta_{k}}{8} over the draw of T¯\bar{T}, Algorithm 2 will output a function q:T[0,1]q\mathrel{\mathop{\mathchar 58\relax}}T\rightarrow[0,1] in polynomial time with the following properties: (1) 1|T|xTq(x)1ξk\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)\geq 1-\xi_{k}; (2) for all wWkw\in W_{k}, 1|T|xTq(x)(wx)25C2(bk2+rk2)\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)(w\cdot x)^{2}\leq 5C_{2}\mathinner{\left(b_{k}^{2}+r_{k}^{2}\right)}.

Again, we remind that the key difference between our algorithm and that of [ABL17] is in Constraint 3 of Algorithm 2: we require that the “variance proxy” xTq(x)xHx\sum_{x\in T}q(x)x^{\top}Hx of the reweighted instances are small for all positive semidefinite HH that lies in an intersection of a trace-norm ball and an 1\ell_{1}-norm ball. On the statistical side, this favorable constraint set of HH, in conjunction with Adamczak’s bound in empirical processes literature [Ada08], results in sufficient uniform concentration of the variance proxy xHxx^{\top}Hx with a sample complexity of poly(s,logd)\operatorname{poly}\mathinner{\left(s,\log d\right)}. This significantly improves the sample complexity of poly(d)\operatorname{poly}\mathinner{\left(d\right)} established in [ABL17]. The detailed proof can be found in Appendix D.3.

Remark 1.

While in some standard settings, a proper 1\ell_{1}-norm constraint suffices to guarantee a desired bound of sample complexity in the high-dimensional regime [Wai09, KST08], we note that in order to establish near-optimal noise tolerance, the 2\ell_{2}-norm constraint of ww (hence the trace-norm of HH) is vital as well. Though eliminating it eases the search of a feasible function qq, this leads to a suboptimal noise tolerance ηΩ(ϵ/s)\eta\leq\Omega({\epsilon}/{s}). Informally speaking, the per-phase error rate, expected to be a constant, is inherently proportional to the variance (wx)2(w\cdot x)^{2} times ξk\xi_{k}, the noise rate within the band. Now without the trace-norm constraint, the variance would be ss times larger than before (since we now have to use ρk2=O(srk2)\rho_{k}^{2}=O(sr_{k}^{2}) as a proxy for the constraint set’s radius, measured in trace norm). This implies that we need to set ξk\xi_{k} a factor 1/s1/s of before, which in turn indicates that the noise tolerance η\eta becomes a factor 1/s1/s of before since η/ϵξk\eta/\epsilon\approx\xi_{k}. We refer the reader to Proposition 31 and Lemma 36 for details.

Remark 2.

The quantity nkn_{k} has a quadratic dependence on the sparsity parameter ss. This cannot be improved in some sparse PCA related problems [BR13], but it is not clear whether such dependence is optimal in our case. We leave this investigation to future work.

Next, we describe the statistical property of the distribution pp (obtained by normalizing qq returned by Algorithm 2). Observe that the noise rate within the band is at most η/bkO(η/ϵ)ξk\eta/b_{k}\leq O(\eta/\epsilon)\leq\xi_{k} since the probability mass of the band is Θ(bk)\Theta(b_{k}) – an important property of isotropic log-concave distributions. Also, it is possible to show that the variance of clean instances on directions HkH\in\mathcal{M}_{k} is O(bk2+rk2)O(b_{k}^{2}+r_{k}^{2}) (see Lemma 16). Therefore, Algorithm 2 is essentially searching for a weighting such that clean instances have overwhelming weights over dirty instances, and that the variance of the weighted instances is similar to that of the clean instances. Recall that TCTT_{\mathrm{C}}\subset T is the set of clean instances in TT. Let T~C={(x,yx)}xTC\tilde{T}_{\mathrm{C}}=\{(x,y_{x})\}_{x\in T_{\mathrm{C}}} be the unrevealed labeled set where each instance is correctly annotated by ww^{*}. The following proposition, which is similar to Lemma 4.7 of [ABL17] but with refinement, states that the reweighted hinge loss τk(w;p):=xTp(x)τk(w;x,yx)\ell_{\tau_{k}}(w;p)\mathrel{\mathop{\mathchar 58\relax}}=\sum_{x\in T}p(x)\ell_{\tau_{k}}(w;x,y_{x}), is a good proxy for the hinge loss evaluated exclusively on clean labeled instances T~C\tilde{T}_{\mathrm{C}}.

Proposition 8.

Suppose Assumption 1 and 2 are satisfied, and ηc5ϵ\eta\leq c_{5}\epsilon. For any phase kk of Algorithm 1, with probability 1δk41-\frac{\delta_{k}}{4} over the draw of T¯\bar{T}, we have supwWk|τk(w;T~C)τk(w;p)|κ\sup_{w\in W_{k}}\mathinner{\!\bigl{\lvert}\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})-\ell_{\tau_{k}}(w;p)\bigr{\rvert}}\leq\kappa.

Note that though this proposition is phrased in terms of the hinge loss on pairs (x,yx)(x,y_{x}), it is only used in the analysis and our algorithm does not require the knowledge of the labels yxy_{x} – the algorithm even does not need to exactly identify the set of clean instances TCT_{\mathrm{C}}. As a result, the size of TCT_{\mathrm{C}} does not count towards our label complexity. Proposition 7 together with Proposition 8 implies that with high probability, Algorithm 2 produces a desired probability distribution in polynomial time, which justifies its computational and statistical efficiency.

In addition, let Lτk(w):=𝔼xDwk1,bk[τk(w;x,sign(wx))]L_{\tau_{k}}(w)\mathrel{\mathop{\mathchar 58\relax}}=\operatorname{\mathbb{E}}_{x\sim D_{w_{k-1},b_{k}}}\mathinner{\bigl{[}\ell_{\tau_{k}}\mathinner{\left(w;x,\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)}\right)}\bigr{]}} be the expected loss on Dwk1,bkD_{w_{k-1},b_{k}}. The following result links Lτk(w)L_{\tau_{k}}(w) to the empirical hinge loss on clean instances.

Proposition 9.

Under Assumption 1 and 2, and ηc5ϵ\eta\leq c_{5}\epsilon, for any phase kk of Algorithm 1, with probability 1δk41-\frac{\delta_{k}}{4} over the draw of T¯\bar{T}, we have supwWk|Lτk(w)τk(w;T~C)|κ\sup_{w\in W_{k}}\mathinner{\!\bigl{\lvert}L_{\tau_{k}}(w)-\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})\bigr{\rvert}}\leq\kappa.

4.3 Attribute and label-efficient empirical risk minimization

In light of Proposition 8, one may want to find an iterate by minimizing its reweighted hinge loss τk(w;p)\ell_{\tau_{k}}(w;p). This requires collecting labels for all instances in TT, which leads to a suboptimal label complexity O(s2polylog(d,1/ϵ))O\mathinner{\bigl{(}s^{2}\cdot\operatorname{polylog}\mathinner{\left(d,1/\epsilon\right)}\bigr{)}}. As a remedy, we perform a random sampling process, which draws mkm_{k} instances from TT according to the distribution pp and then query their labels, resulting in the labeled instance set SkS_{k}. By standard uniform convergence arguments, it is expected that τk(w;Sk)τk(w;p)\ell_{\tau_{k}}(w;S_{k})\approx\ell_{\tau_{k}}(w;p) provided that mkm_{k} is large enough, as is shown in the following proposition.

Proposition 10.

Suppose that Assumption 1 and 2 are satisfied. For any phase kk of Algorithm 1, with probability 1δk41-\frac{\delta_{k}}{4}, we have supwWk|τk(w;p)τk(w;Sk)|κ\sup_{w\in W_{k}}\left\lvert\ell_{\tau_{k}}(w;p)-\ell_{\tau_{k}}(w;S_{k})\right\rvert\leq\kappa.

We remark that when establishing the performance guarantee, the 1\ell_{1}-norm constraint on the hypothesis space, together with an \ell_{\infty}-norm upper bound on the localized instance space, leads to a Rademacher complexity that has a linear dependence on the sparsity (up to a logarithmic factor). Technically speaking, our analysis is more involved than that of [ABL17]: applying their analysis to the setting of learning sparse halfspaces along with the fact that the VC dimension of the class of ss-sparse halfspaces is O(slog(d/s))O(s\log(d/s)) would give a label complexity quadratic in ss.

4.4 Uniform concentration for unbounded data

Our analysis involves building uniform concentration bounds. The primary issue of applying standard concentration results, e.g. Theorem 1 of [KST08], is that the instances are not contained in a pre-specified \ell_{\infty}-ball with probability 11 under isotropic log-concave distribution. [ABL17, Zha18] construct a conditional distribution, on which the data are all bounded from above, and then measure the difference between this conditional distribution and the original one. We circumvent such technical complication by using the Adamczak’s bound [Ada08] in the empirical process literature, which provides a generic way to analyze concentration inequalities for well-behaved distributions with unbounded support. See Appendix C for a concrete treatment.

4.5 Proof sketch of Theorem 4

Proof.

We first show that error rate of vkv_{k} on Dwk1,bkD_{w_{k-1},b_{k}} is a constant, and that of wkw_{k} follows since hard thresholding and 2\ell_{2}-norm projection can only deviate the error rate by a constant factor. Observe that in light of Proposition 8, Proposition 9, and Proposition 10, we have |τk(w;Sk)Lτk(w)|3κ\left\lvert\ell_{\tau_{k}}(w;S_{k})-L_{\tau_{k}}(w)\right\rvert\leq 3\kappa for all wWkw\in W_{k}. Therefore, if wWkw^{*}\in W_{k}, by the optimality of vkv_{k}, we have Lτk(vk)τk(vk;Sk)+3κτk(w;Sk)+4κLτk(w)+7κ8κL_{\tau_{k}}(v_{k})\leq\ell_{\tau_{k}}(v_{k};S_{k})+3\kappa\leq\ell_{\tau_{k}}(w^{*};S_{k})+4\kappa\leq L_{\tau_{k}}(w^{*})+7\kappa\leq 8\kappa, where the last inequality is by Lemma 3.7 of [ABL17]. Since Lτk(vk)L_{\tau_{k}}(v_{k}) always serves as an upper bound of errDwk1,bk(vk)\operatorname{err}_{D_{w_{k-1},b_{k}}}(v_{k}), the constant error rate on Dwk1,bkD_{w_{k-1},b_{k}} follows. Next we can use the analysis framework of margin-based active learning to show that such constant error rate ensures that the angle between wkw_{k} and ww^{*} is as small as O(2k)O(2^{-k}), which in turn implies wWk+1w^{*}\in W_{k+1}. It remains to show wW1w^{*}\in W_{1}; this can be easily seen by the definition of W1W_{1}: W1=B2(0,1)B1(0,s)W_{1}=B_{2}(0,1)\cap B_{1}(0,\sqrt{s}). Hence, we conclude wWkw^{*}\in W_{k} for all 1kk01\leq k\leq k_{0}. Observe that the radius of 2\ell_{2}-ball of Wk0W_{k_{0}} is as small as ϵ\epsilon, which, by a basic property of isotropic log-concave distributions, implies the error rate of wk0w_{k_{0}} on DD is less than ϵ\epsilon.

The sample and label complexity bounds follow from our setting of NkN_{k} and mkm_{k}, and the fact that bk[ϵ,c¯/16]b_{k}\in[\epsilon,\bar{c}/16] for all kk0k\leq k_{0}. See Appendix D.5 for the full proof. ∎

5 Conclusion and Open Questions

We have presented a computationally efficient algorithm for learning sparse halfspaces under the challenging malicious noise model. Our algorithm leverages the well-established margin-based active learning framework, with a particular treatment on attribute efficiency, label complexity, and noise tolerance. We have shown that our theoretical guarantees for label complexity and noise tolerance are near-optimal, and the sample complexity of a passive learning variant of our algorithm is attribute-efficient, thanks to the set of new techniques proposed in this paper.

We raise three open questions for further investigation. First, as we discussed in Section 4.2, the sample complexity for concentration of xHxx^{\top}Hx has a quadratic dependence on ss. It would be interesting to study whether this is a fundamental limit of learning under isotropic log-concave distributions, or it can be improved by a more sophisticated localization scheme in the instance and the concept spaces. Second, while isotropic log-concave distributions bear favorable properties that fit perfectly in the margin-based framework, it would be interesting to examine whether the established results can be extended to heavy-tailed distributions. This may lead to a large error rate within the band that cannot be controlled at a constant level, and new techniques must be developed. Finally, it would be interesting to design computationally more efficient algorithms, e.g. stochastic gradient descent-type algorithms similar to [DKM05], with comparable statistical guarantees.

References

  • [ABHU15] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Ruth Urner. Efficient learning of linear separators under bounded noise. In Proceedings of the 28th Annual Conference on Learning Theory, pages 167–190, 2015.
  • [ABHZ16] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Hongyang Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Proceedings of the 29th Annual Conference on Learning Theory, pages 152–192, 2016.
  • [ABL17] Pranjal Awasthi, Maria-Florina Balcan, and Philip M. Long. The power of localization for efficiently learning linear separators with noise. Journal of the ACM, 63(6):50:1–50:27, 2017.
  • [Ada08] Radoslaw Adamczak. A tail inequality for suprema of unbounded empirical processes with applications to Markov chains. Electronic Journal of Probability, 13(34):1000–1034, 2008.
  • [BB08] Petros Boufounos and Richard G. Baraniuk. 1-bit compressive sensing. In Proceedings of the 42nd Annual Conference on Information Sciences and Systems, pages 16–21, 2008.
  • [BBZ07] Maria-Florina Balcan, Andrei Z. Broder, and Tong Zhang. Margin based active learning. In Proceedings of the 20th Annual Conference on Learning Theory, pages 35–50, 2007.
  • [BDLS17] Sivaraman Balakrishnan, Simon S. Du, Jerry Li, and Aarti Singh. Computationally efficient robust sparse estimation in high dimensions. In Proceedings of the 30th Annual Conference on Learning Theory, pages 169–212, 2017.
  • [BEHW89] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929–965, 1989.
  • [BEK02] Nader H. Bshouty, Nadav Eiron, and Eyal Kushilevitz. PAC learning with nasty noise. Theoretical Computer Science, 288(2):255–275, 2002.
  • [BFKV96] Avrim Blum, Alan M. Frieze, Ravi Kannan, and Santosh S. Vempala. A polynomial-time algorithm for learning noisy linear threshold functions. In Proceedings of the 37th Annual IEEE Symposium on Foundations of Computer Science, pages 330–338, 1996.
  • [BFN+17] Richard G. Baraniuk, Simon Foucart, Deanna Needell, Yaniv Plan, and Mary Wootters. Exponential decay of reconstruction error from binary measurements of sparse signals. IEEE Transactions on Information Theory, 63(6):3368–3385, 2017.
  • [BH12] Maria Florina Balcan and Steve Hanneke. Robust interactive learning. In Conference on Learning Theory, pages 20–1, 2012.
  • [BHL95] Avrim Blum, Lisa Hellerstein, and Nick Littlestone. Learning in the presence of finitely or infinitely many irrelevant attributes. Journal of Computer and System Sciences, 50(1):32–40, 1995.
  • [BHLZ16] Alina Beygelzimer, Daniel J. Hsu, John Langford, and Chicheng Zhang. Search improves label for active learning. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, pages 3342–3350, 2016.
  • [BL13] Maria-Florina Balcan and Philip M. Long. Active and passive learning of linear separators under log-concave distributions. In Proceedings of The 26th Annual Conference on Learning Theory, pages 288–316, 2013.
  • [Blu90] Avrim Blum. Learning boolean functions in an infinite attribute space. In Proceedings of the 22nd Annual ACM Symposium on Theory of Computing, pages 64–72, 1990.
  • [BM02] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.
  • [BR13] Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component detection. In Proceedings of the 26th Annual Conference on Learning Theory, pages 1046–1066, 2013.
  • [Bsh98] Nader H. Bshouty. A new composition theorem for learning algorithms. In Proceedings of the 30th Annual ACM Symposium on the Theory of Computing, pages 583–589, 1998.
  • [BV04] Stephen P. Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.
  • [CAL94] David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994.
  • [CCG11] Giovanni Cavallanti, Nicolò Cesa-Bianchi, and Claudio Gentile. Learning noisy linear classifiers via adaptive and selective sampling. Machine Learning, 83(1):71–102, 2011.
  • [CDF+99] Nicolò Cesa-Bianchi, Eli Dichterman, Paul Fischer, Eli Shamir, and Hans Ulrich Simon. Sample-efficient strategies for learning in the presence of noise. Journal of the ACM, 46(5):684–719, 1999.
  • [CDS98] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33–61, 1998.
  • [CT05] Emmanuel J. Candès and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.
  • [CW08] Emmanuel J. Candès and Michael B. Wakin. An introduction to compressive sampling. IEEE Signal Processing Magazine, 25(2):21–30, 2008.
  • [Dan15] Amit Daniely. A PTAS for agnostically learning halfspaces. In Proceedings of The 28th Annual Conference on Learning Theory, volume 40, pages 484–502, 2015.
  • [dGJL07] Alexandre d’Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A direct formulation for sparse PCA using semidefinite programming. SIAM Review, 49(3):434–448, 2007.
  • [DGS12] Ofer Dekel, Claudio Gentile, and Karthik Sridharan. Selective sampling and active learning from single and multiple teachers. Journal of Machine Learning Research, 13:2655–2697, 2012.
  • [DGT19] Ilias Diakonikolas, Themis Gouleakis, and Christos Tzamos. Distribution-independent PAC learning of halfspaces with Massart noise. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems, pages 4751–4762, 2019.
  • [DK19] Ilias Diakonikolas and Daniel M. Kane. Recent advances in algorithmic high-dimensional robust statistics. CoRR, abs/1911.05911, 2019.
  • [DKK+16] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Zheng Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. CoRR, abs/1604.06443, 2016.
  • [DKM05] Sanjoy Dasgupta, Adam Tauman Kalai, and Claire Monteleoni. Analysis of perceptron-based active learning. In Proceedings of the 18th Annual Conference on Learning Theory, pages 249–263, 2005.
  • [DKS18] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM Symposium on Theory of Computing, pages 1061–1073, 2018.
  • [DKTZ20] Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Learning halfspaces with Massart noise under structured distributions. In Proceedings of the 33rd Annual Conference on Learning Theory, volume 125, pages 1486–1513, 2020.
  • [Don06] David L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
  • [Dud14] Richard M. Dudley. Uniform central limit theorems, volume 142. Cambridge University Press, 2014.
  • [Fel14] Vitaly Feldman. Open problem: The statistical query complexity of learning sparse halfspaces. In Proceedings of The 27th Annual Conference on Learning Theory, volume 35, pages 1283–1289, 2014.
  • [FF08] Jianqing Fan and Yingying Fan. High dimensional classification using features annealed independence rules. Annals of Statistics, 36(6):2605–2637, 2008.
  • [FL01] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
  • [GLS12] Martin Grötschel, László Lovász, and Alexander Schrijver. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.
  • [Han11] Steve Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333–361, 2011.
  • [Han14] Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131–309, 2014.
  • [KKMS05] Adam Tauman Kalai, Adam R. Klivans, Yishay Mansour, and Rocco A. Servedio. Agnostically learning halfspaces. In Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science, pages 11–20, 2005.
  • [KL88] Michael J. Kearns and Ming Li. Learning in the presence of malicious errors. In Proceedings of the 20th Annual ACM Symposium on Theory of Computing, pages 267–280, 1988.
  • [KLMZ17] Daniel M. Kane, Shachar Lovett, Shay Moran, and Jiapeng Zhang. Active classification with comparison queries. In Chris Umans, editor, Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science, pages 355–366, 2017.
  • [KLS09] Adam R. Klivans, Philip M. Long, and Rocco A. Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10:2715–2740, 2009.
  • [KMT93] Sanjeev R. Kulkarni, Sanjoy K. Mitter, and John N. Tsitsiklis. Active learning using arbitrary binary valued queries. Machine Learning, 11(1):23–35, 1993.
  • [KS04] Adam R. Klivans and Rocco A. Servedio. Toward attribute efficient learning of decision lists and parities. In Proceedings of the 17th Annual Conference on Learning Theory, pages 224–238, 2004.
  • [KSS92] Michael J. Kearns, Robert E. Schapire, and Linda Sellie. Toward efficient agnostic learning. In David Haussler, editor, Proceedings of the 5th Annual Conference on Computational Learning Theory, pages 341–352, 1992.
  • [KST08] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Proceedings of the 22nd Annual Conference on Neural Information Processing Systems, pages 793–800, 2008.
  • [Lit87] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm (extended abstract). In Proceedings of the 28th Annual IEEE Symposium on Foundations of Computer Science, pages 68–77, 1987.
  • [Lon95] Philip M. Long. On the sample complexity of PAC learning half-spaces against the uniform distribution. IEEE Transactions on Neural Networks, 6(6):1556–1559, 1995.
  • [LRV16] Kevin A. Lai, Anup B. Rao, and Santosh S. Vempala. Agnostic estimation of mean and covariance. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, pages 665–674, 2016.
  • [LS06] Philip M. Long and Rocco A. Servedio. Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions. In Proceedings of the 20th Annual Conference on Neural Information Processing Systems, pages 921–928, 2006.
  • [LV07] László Lovász and Santosh S. Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures and Algorithms, 30(3):307–358, 2007.
  • [MN06] Pascal Massart and Élodie Nédélec. Risk bounds for statistical learning. The Annals of Statistics, pages 2326–2366, 2006.
  • [PV13a] Yaniv Plan and Roman Vershynin. One-bit compressed sensing by linear programming. Communications on Pure and Applied Mathematics, 66(8):1275–1297, 2013.
  • [PV13b] Yaniv Plan and Roman Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1):482–494, 2013.
  • [PV16] Yaniv Plan and Roman Vershynin. The generalized lasso with non-linear observations. IEEE Transactions on Information Theory, 62(3):1528–1537, 2016.
  • [Ros58] Frank Rosenblatt. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386–408, 1958.
  • [RWY11] Garvesh Raskutti, Martin J. Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over q\ell_{q}-balls. IEEE Transactions on Information Theory, 57(10):6976–6994, 2011.
  • [Sch92] Robert E. Schapire. Design and analysis of efficient learning algorithms. MIT Press, Cambridge, MA, USA, 1992.
  • [Ser99] Rocco A. Servedio. Computational sample complexity and attribute-efficient learning. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing, pages 701–710, 1999.
  • [SL17a] Jie Shen and Ping Li. On the iteration complexity of support recovery via hard thresholding pursuit. In Proceedings of the 34th International Conference on Machine Learning, pages 3115–3124, 2017.
  • [SL17b] Jie Shen and Ping Li. Partial hard thresholding: Towards a principled analysis of support recovery. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pages 3127–3137, 2017.
  • [SL18] Jie Shen and Ping Li. A tight bound of hard thresholding. Journal of Machine Learning Research, 18(208):1–42, 2018.
  • [Slo88] Robert H. Sloan. Types of noise in data for concept learning. In Proceedings of the First Annual Workshop on Computational Learning Theory, pages 91–96, 1988.
  • [Slo92] Robert H. Sloan. Corrigendum to types of noise in data for concept learning. In Proceedings of the Fifth Annual ACM Conference on Computational Learning Theory, page 450, 1992.
  • [SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
  • [Ste05] Daureen Steinberg. Computation of matrix norms with applications to robust optimization. Research thesis, Technion-Israel University of Technology, 2005.
  • [Tib96] Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • [TP14] Andreas M. Tillmann and Marc E. Pfetsch. The computational complexity of the restricted isometry property, the nullspace property, and related concepts in compressed sensing. IEEE Transactions on Information Theory, 60(2):1248–1259, 2014.
  • [TW10] Joel A. Tropp and Stephen J. Wright. Computational methods for sparse solution of linear inverse problems. Proceedings of the IEEE, 98(6):948–958, 2010.
  • [Val84] Leslie G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
  • [Val85] Leslie G. Valiant. Learning disjunction of conjunctions. In Proceedings of the 9th International Joint Conference on Artificial Intelligence, pages 560–566, 1985.
  • [vdGL13] Sara van de Geer and Johannes Lederer. The Bernstein-Orlicz norm and deviation inequalities. Probability Theory and Related Fields, 157:225–250, 2013.
  • [VDVW96] Aad W Van Der Vaart and Jon A Wellner. Weak convergence and empirical processes. Springer, 1996.
  • [Vem10] Santosh S. Vempala. A random-sampling-based algorithm for learning intersections of halfspaces. Journal of the ACM, 57(6):32:1–32:14, 2010.
  • [Wai09] Martin J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using 1\ell_{1}-constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5):2183–2202, 2009.
  • [XZS+17] Yichong Xu, Hongyang Zhang, Aarti Singh, Artur Dubrawski, and Kyle Miller. Noise-tolerant interactive learning using pairwise comparisons. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pages 2431–2440, 2017.
  • [YZ17] Songbai Yan and Chicheng Zhang. Revisiting perceptron: Efficient and label-optimal learning of halfspaces. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pages 1056–1066, 2017.
  • [Zha02] Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002.
  • [Zha18] Chicheng Zhang. Efficient active learning of sparse halfspaces. In Proceedings of the 31st Annual Conference On Learning Theory, pages 1856–1880, 2018.
  • [ZHT06] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2):265–286, 2006.
  • [ZSA20] Chicheng Zhang, Jie Shen, and Pranjal Awasthi. Efficient active learning of sparse halfspaces with arbitrary bounded noise. CoRR, abs/2002.04840, 2020.
  • [ZYJ14] Lijun Zhang, Jinfeng Yi, and Rong Jin. Efficient algorithms for robust one-bit compressive sensing. In Proceedings of the 31st International Conference on Machine Learning, pages 820–828, 2014.

Appendix A Detailed Choices of Reserved Constants and Additional Notations

Constants.

The absolute constants c0c_{0}, c1c_{1} and c2c_{2} are specified in Lemma 12, and c3c_{3} and c4c_{4} are specified in Lemma 13. c5c_{5} and c6c_{6} were clarified in Section 3.2. The definition of c7c_{7}, c8c_{8}, c9c_{9} can be found in Lemma 14, Lemma 17, and Lemma 18 respectively. The absolute constant C1C_{1} acts as an upper bound of all bkb_{k}’s, and by our choice in Section 3.2, C1=c¯/16C_{1}=\bar{c}/16. The absolute constant C2C_{2} is defined in Lemma 16. Other absolute constants, such as C3,C4C_{3},C_{4} are not quite crucial to our analysis or algorithmic design. Therefore, we do not track their definitions. The subscript variants of KK, e.g. K1K_{1} and K2K_{2}, are also absolute constants but their values may change from appearance to appearance. We remark that the value of all these constants does not depend on the underlying distribution DD chosen by the adversary, but rather depends on the knowledge of 𝒟\mathcal{D}.

Pruning.

Consider Algorithm 1. For each phase kk, we sample a working set T¯\bar{T} and remove all instances that have large \ell_{\infty}-norm to obtain TT (Step 6), which is equivalent to intersecting it with the \ell_{\infty}-ball B(0,νk):={x:xνk}B_{\infty}(0,\nu_{k})\mathrel{\mathop{\mathchar 58\relax}}=\mathinner{\left\{x\mathrel{\mathop{\mathchar 58\relax}}\|x\|_{\infty}\leq\nu_{k}\right\}} where νk=c9log48|T¯|dbkδk\nu_{k}=c_{9}\log\frac{48\left\lvert\bar{T}\right\rvert d}{b_{k}\delta_{k}}. This is motivated by Lemma 18, which states that with high probability, all clean instances in T¯\bar{T} are in B(0,νk)B_{\infty}(0,\nu_{k}). Specifically, Denote by T¯C\bar{T}_{\mathrm{C}} (respectively T¯D\bar{T}_{\mathrm{D}}) the set of clean (respectively dirty) instances in T¯\bar{T}. Lemma 18 implies that with probability 1δk481-\frac{\delta_{k}}{48}, T¯CB(0,νk)\bar{T}_{\mathrm{C}}\subset B_{\infty}(0,\nu_{k}). Therefore, with high probability, all the instances in T¯C\bar{T}_{\mathrm{C}} are kept in this step and only instances in T¯D\bar{T}_{\mathrm{D}} may be removed. Denote by TC=T¯CB(0,νk)T_{\mathrm{C}}=\bar{T}_{\mathrm{C}}\cap B_{\infty}(0,\nu_{k}) and TD=T¯DB(0,νk)T_{\mathrm{D}}=\bar{T}_{\mathrm{D}}\cap B_{\infty}(0,\nu_{k}); we therefore also have the decomposition T=TCTDT=T_{\mathrm{C}}\cup T_{\mathrm{D}}. We finally denote by T^C\hat{T}_{\mathrm{C}} the unrevealed labeled set that corresponds to T¯C\bar{T}_{\mathrm{C}}.

Table 1: Summary of useful notations associated with the working set T¯\bar{T} at each phase kk.
T¯\bar{T} instance set obtained by calling EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}) conditioned on |wk1x|bk\left\lvert w_{k-1}\cdot x\right\rvert\leq b_{k}
T¯C\bar{T}_{\mathrm{C}} set of instances in T¯\bar{T} that EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}) draws from the distribution DD
T¯D\bar{T}_{\mathrm{D}} set of dirty instances in T¯\bar{T}, i.e. T¯\T¯C\bar{T}\backslash\bar{T}_{\mathrm{C}}
TT set of instances in T¯\bar{T} that lie in B(0,νk)B_{\infty}(0,\nu_{k})
TCT_{\mathrm{C}} set of instances in T¯C\bar{T}_{\mathrm{C}} that lie in B(0,νk)B_{\infty}(0,\nu_{k})
TDT_{\mathrm{D}} set of instances in T¯D\bar{T}_{\mathrm{D}} that lie in B(0,νk)B_{\infty}(0,\nu_{k})
T^C\hat{T}_{\mathrm{C}} unrevealed labeled set of T¯C\bar{T}_{\mathrm{C}}
T~C\tilde{T}_{\mathrm{C}} unrevealed labeled set of TCT_{\mathrm{C}}
Regularity condition on Du,bD_{u,b}.

We will frequently work with the conditional distribution Du,bD_{u,b} obtained by conditioning DD on the event that xx is in the band {xd:|ux|b}\{x\in\mathbb{R}^{d}\mathrel{\mathop{\mathchar 58\relax}}\left\lvert u\cdot x\right\rvert\leq b\}. We give the following regularity condition to ease our terminology.

Definition 11.

A conditional distribution Du,bD_{u,b} is said to satisfy the regularity condition if one of the following holds: 1) the vector udu\in\mathbb{R}^{d} has unit 2\ell_{2}-norm and 0<bC10<b\leq C_{1}; 2) the vector uu is the zero vector and b=C1b=C_{1}.

In particular, at each phase kk of Algorithm 1, uu is set to wk1w_{k-1} and bb is set to bkb_{k}. For k=1k=1, u=w0u=w_{0} is a zero vector, b=b1=C1b=b_{1}=C_{1}, satisfying the regularity condition. It is worth mentioning that at phase 11 the conditional distribution Du,bD_{u,b} boils down to DD. For all k2k\geq 2, uu is a unit vector and b(0,C1]b\in(0,C_{1}] in view of our construction of bkb_{k}. Therefore, for all k1k\geq 1, Dwk1,bkD_{w_{k-1},b_{k}} satisfy the regularity condition.

Appendix B Useful Properties of Isotropic Log-Concave Distributions

We record some useful properties of isotropic log-concave distributions.

Lemma 12.

There are absolute constants c0,c1,c2>0c_{0},c_{1},c_{2}>0, such that the following holds for all isotropic log-concave distributions D𝒟D\in\mathcal{D}. Let fDf_{D} be the density function. We have

  1. 1.

    Orthogonal projections of DD onto subspaces of d\mathbb{R}^{d} are isotropic log-concave;

  2. 2.

    If d=1d=1, then PrxD(axb)|ba|\operatorname{Pr}_{x\sim D}(a\leq x\leq b)\leq\left\lvert b-a\right\rvert;

  3. 3.

    If d=1d=1, then fD(x)c0f_{D}(x)\geq c_{0} for all x[1/9,1/9]x\in[-1/9,1/9];

  4. 4.

    For any two vectors u,vdu,v\in\mathbb{R}^{d},

    c1PrxD(sign(ux)sign(vx))θ(u,v)c2PrxD(sign(ux)sign(vx));c_{1}\cdot\operatorname{Pr}_{x\sim D}\mathinner{\left(\operatorname{sign}\mathinner{\left(u\cdot x\right)}\neq\operatorname{sign}\mathinner{\left(v\cdot x\right)}\right)}\leq\theta(u,v)\leq c_{2}\cdot\operatorname{Pr}_{x\sim D}\mathinner{\left(\operatorname{sign}\mathinner{\left(u\cdot x\right)}\neq\operatorname{sign}\mathinner{\left(v\cdot x\right)}\right)};
  5. 5.

    PrxD(x2td)exp(t+1)\operatorname{Pr}_{x\sim D}\big{(}\left\lVert x\right\rVert_{2}\geq t\sqrt{d}\big{)}\leq\exp(-t+1).

We remark that Parts 123, and 5 are due to [LV07], and Part 4 is from [Vem10, BL13].

The following lemma is implied by the proof of Theorem 21 of [BL13], which shows that if we choose a proper band width b>0b>0, the error outside the band will be small. This observation is crucial for controlling the error over the distribution DD, and has been broadly recognized in the literature [ABL17, Zha18].

Lemma 13 (Theorem 21 of [BL13]).

There are absolute constants c3,c4>0c_{3},c_{4}>0 such that the following holds for all isotropic log-concave distributions D𝒟D\in\mathcal{D}. Let uu and vv be two unit vectors in d\mathbb{R}^{d} and assume that θ(u,v)=α<π/2\theta(u,v)=\alpha<\pi/2. Then for any b4c4αb\geq\frac{4}{c_{4}}\alpha, we have

PrxD(sign(ux)sign(vx)and|vx|b)c3αexp(c4b2α).\operatorname{Pr}_{x\sim D}(\operatorname{sign}\mathinner{\left(u\cdot x\right)}\neq\operatorname{sign}\mathinner{\left(v\cdot x\right)}\ \text{and}\ \left\lvert v\cdot x\right\rvert\geq b)\leq c_{3}\alpha\exp\left(-\frac{c_{4}b}{2\alpha}\right).
Lemma 14 (Lemma 20 of [ABHZ16]).

There is an absolute constant c7>0c_{7}>0 such that the following holds for all isotropic log-concave distributions D𝒟D\in\mathcal{D}. Draw nn i.i.d. instances from DD to form a set SS. Then

PrSDn(maxxSxc7log|S|dδ)δ.\operatorname{Pr}_{S\sim D^{n}}\left(\max_{x\in S}\left\lVert x\right\rVert_{\infty}\geq c_{7}\log\frac{\left\lvert S\right\rvert d}{\delta}\right)\leq\delta.
Lemma 15.

There is an absolute constant C¯21\bar{C}_{2}\geq 1 such that the following holds for all isotropic log-concave distributions D𝒟D\in\mathcal{D} and all Du,bD_{u,b} that satisfy the regularity condition:

supwB2(u,r)𝔼xDu,b[(wx)2]C¯2(b2+r2).\sup_{w\in B_{2}(u,r)}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\bigl{[}(w\cdot x)^{2}\bigr{]}}\leq\bar{C}_{2}(b^{2}+r^{2}).
Proof.

When uu is a unit vector, Lemma 3.4 of [ABL17] shows that there exists a constant K1K_{1} such that

supwB2(u,r)𝔼xDu,b[(wx)2]K1(b2+r2).\sup_{w\in B_{2}(u,r)}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\bigl{[}(w\cdot x)^{2}\bigr{]}}\leq K_{1}(b^{2}+r^{2}).

When uu is a zero vector, Du,bD_{u,b} reduces to DD and the constraint wB2(u,r)w\in B_{2}(u,r) reads as w2r\left\lVert w\right\rVert_{2}\leq r. Thus we have

𝔼xDu,b[(wx)2]=w22r2<b2+r2.\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\bigl{[}(w\cdot x)^{2}\bigr{]}}=\left\lVert w\right\rVert_{2}^{2}\leq r^{2}<b^{2}+r^{2}.

The proof is complete by choosing C¯2=K1+1\bar{C}_{2}=K_{1}+1. ∎

Lemma 16.

There is an absolute constant C22C_{2}\geq 2 such that the following holds for all isotropic log-concave distributions D𝒟D\in\mathcal{D} and all Du,bD_{u,b} that satisfy the regularity condition:

supH𝔼xDu,b[xHx]C2(b2+r2),\sup_{H\in\mathcal{M}}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\bigl{[}x^{\top}Hx\bigr{]}}\leq C_{2}(b^{2}+r^{2}),

where :={Hd×d:H0,Hr2,H1ρ2}\mathcal{M}\mathrel{\mathop{\mathchar 58\relax}}=\mathinner{\left\{H\in\mathbb{R}^{d\times d}\mathrel{\mathop{\mathchar 58\relax}}\ H\succeq 0,\ \left\lVert H\right\rVert_{*}\leq r^{2},\ \left\lVert H\right\rVert_{1}\leq\rho^{2}\right\}}.

Proof.

Since HH\in\mathcal{M} is a positive semidefinite matrix with trace norm at most r2r^{2}, it has eigendecomposition H=i=1dλiviviH=\sum_{i=1}^{d}\lambda_{i}v_{i}v_{i}^{\top}, where λi0\lambda_{i}\geq 0 are the eigenvalues such that i=1dλir2\sum_{i=1}^{d}\lambda_{i}\leq r^{2}, and viv_{i}’s are orthonormal vectors in d\mathbb{R}^{d}. Thus,

xHx=1r2i=1dλi(rvix)22r2i=1dλi[((rvi+u)x)2+(ux)2].{x^{\top}Hx}=\frac{1}{r^{2}}\sum_{i=1}^{d}{\lambda_{i}}{(rv_{i}\cdot x)^{2}}\leq\frac{2}{r^{2}}\cdot\sum_{i=1}^{d}{\lambda_{i}}\mathinner{\left[\mathinner{\left((rv_{i}+u)\cdot x\right)}^{2}+(u\cdot x)^{2}\right]}.

Since xx is drawn from Du,bD_{u,b}, we have (ux)2b2(u\cdot x)^{2}\leq b^{2}. Moreover, applying Lemma 15 with the setting of w=rv+uw=rv+u implies that

supvB2(0,1)𝔼xDu,b[((rv+u)x)2]C¯2(b2+r2).\sup_{v\in B_{2}(0,1)}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\left[\mathinner{\left((rv+u)\cdot x\right)}^{2}\right]}\leq\bar{C}_{2}(b^{2}+r^{2}).

Therefore,

supH𝔼xDu,b[xHx]2r2i=1dλi(C¯2(b2+r2)+b2)2(C¯2+1)(b2+r2).\sup_{H\in\mathcal{M}}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\bigl{[}x^{\top}Hx\bigr{]}}\leq\frac{2}{r^{2}}\cdot\sum_{i=1}^{d}{\lambda_{i}}\mathinner{\left(\bar{C}_{2}(b^{2}+r^{2})+b^{2}\right)}\leq 2(\bar{C}_{2}+1)(b^{2}+r^{2}).

The proof is complete by choosing C2=2(C¯2+1)C_{2}=2(\bar{C}_{2}+1). ∎

Lemma 17.

Let c8=min{2c0,2c09C1,1C1}c_{8}=\min\big{\{}2c_{0},\frac{2c_{0}}{9C_{1}},\frac{1}{C_{1}}\big{\}}. Then for all isotropic log-concave distributions D𝒟D\in\mathcal{D} and all Du,bD_{u,b} satisfying the regularity condition,

  1. 1.

    PrxD(|ux|b)c8b\operatorname{Pr}_{x\sim D}\left(\left\lvert u\cdot x\right\rvert\leq b\right)\geq c_{8}\cdot b;

  2. 2.

    PrxDu,b(E)1c8bPrxD(E)\operatorname{Pr}_{x\sim D_{u,b}}(E)\leq\frac{1}{c_{8}b}\operatorname{Pr}_{x\sim D}(E) for any event EE.

Proof.

We first consider the case that uu is a unit vector.

For the lower bound, Part 3 of Lemma 12 shows that the density function of the random variable uxu\cdot x is lower bounded by c0c_{0} when |ux|1/9\left\lvert u\cdot x\right\rvert\leq 1/9. Thus

PrxD(|ux|b)PrxD(|ux|min{b,1/9})2c0min{b,1/9}2c0min{1,19C1}b\displaystyle\operatorname{Pr}_{x\sim D}\left(\left\lvert u\cdot x\right\rvert\leq b\right)\geq\operatorname{Pr}_{x\sim D}\left(\left\lvert u\cdot x\right\rvert\leq\min\{b,{1}/{9}\}\right)\geq 2c_{0}\min\{b,{1}/{9}\}\geq 2c_{0}\min\bigg{\{}1,\frac{1}{9C_{1}}\bigg{\}}\cdot b

where in the last inequality we use the condition bC1b\leq C_{1}.

For any event EE, we always have

PrxDu,b(E)PrxD(E)PrxD(|ux|b)1c8bPrxD(E).\operatorname{Pr}_{x\sim D_{u,b}}(E)\leq\frac{\operatorname{Pr}_{x\sim D}(E)}{\operatorname{Pr}_{x\sim D}({\left\lvert u\cdot x\right\rvert\leq b})}\leq\frac{1}{c_{8}b}\operatorname{Pr}_{x\sim D}(E).

Now we consider the case that uu is the zero vector and b=C1b=C_{1}. Then PrxD(|ux|b)=1c8b\operatorname{Pr}_{x\sim D}\left(\left\lvert u\cdot x\right\rvert\leq b\right)=1\geq c_{8}\cdot b in view of the choice c8c_{8}. Thus Part 2 still follows. The proof is complete. ∎

Lemma 18.

There exists an absolute constant c9>0c_{9}>0 such that the following holds for all isotropic log-concave distributions D𝒟D\in\mathcal{D} and all Du,bD_{u,b} that satisfy the regularity condition. Let SS be a set of i.i.d. instances drawn from Du,bD_{u,b}. Then

PrSDu,bn(maxxSxc9log|S|dbδ)δ.\operatorname{Pr}_{S\sim D_{u,b}^{n}}\mathinner{\left(\max_{x\in S}\left\lVert x\right\rVert_{\infty}\geq c_{9}\log\frac{\left\lvert S\right\rvert d}{b\delta}\right)}\leq\delta.
Proof.

Using Lemma 14 we have

PrSDn(maxxSxc7log|S|dδ)δ.\operatorname{Pr}_{S\sim D^{n}}\mathinner{\left(\max_{x\in S}\left\lVert x\right\rVert_{\infty}\geq c_{7}\log\frac{\left\lvert S\right\rvert d}{\delta}\right)}\leq\delta.

Thus, using Part 2 of Lemma 17 gives

PrSDu,bn(maxxSxc7log|S|dδ)δc8b.\operatorname{Pr}_{S\sim D_{u,b}^{n}}\mathinner{\left(\max_{x\in S}\left\lVert x\right\rVert_{\infty}\geq c_{7}\log\frac{\left\lvert S\right\rvert d}{\delta}\right)}\leq\frac{\delta}{c_{8}b}.

The proof is complete by changing δ\delta to δ=δc8b\delta^{\prime}=\frac{\delta}{c_{8}b}. ∎

Appendix C Orlicz Norm and Concentration Results using Adamczak’s Bound

The following notion of Orlicz norm [vdGL13, Dud14] is useful in handling random variables that have tails of the form exp(tα)\exp(-t^{\alpha}) for general α\alpha’s beyond α=2\alpha=2 (subgaussian) and α=1\alpha=1 (subexponential).

Definition 19 (Orlicz norm).

For any zz\in\mathbb{R}, let ψα:zexp(zα)1\psi_{\alpha}\mathrel{\mathop{\mathchar 58\relax}}z\mapsto\exp(z^{\alpha})-1. Furthermore, for a random variable ZZ\in\mathbb{R} and α>0\alpha>0, define Zψα\left\lVert Z\right\rVert_{\psi_{\alpha}}, the Orlicz norm of ZZ with respect to ψα\psi_{\alpha}, as:

Zψα=inf{t>0:𝔼Z[ψα(|Z|/t)]1}.\left\lVert Z\right\rVert_{\psi_{\alpha}}=\inf\Big{\{}t>0\mathrel{\mathop{\mathchar 58\relax}}\operatorname{\mathbb{E}}_{Z}\mathinner{\bigl{[}\psi_{\alpha}\mathinner{\left({\left\lvert Z\right\rvert}/{t}\right)}\bigr{]}}\leq 1\Big{\}}.

We collect some basic facts about Orlicz norms in the following lemma; they can be found in Section 1.3 of [VDVW96].

Lemma 20.

Let ZZ, Z1Z_{1}, Z2Z_{2} be real-valued random variables. Consider the Orlicz norm with respect to ψα\psi_{\alpha}. We have the following:

  1. 1.

    ψα\left\lVert\cdot\right\rVert_{\psi_{\alpha}} is a norm. For any aa\in\mathbb{R}, aZψα=|a|Zψα\left\lVert aZ\right\rVert_{\psi_{\alpha}}=\left\lvert a\right\rvert\cdot\left\lVert Z\right\rVert_{\psi_{\alpha}}; Z1+Z2ψαZ1ψα+Z2ψα\left\lVert Z_{1}+Z_{2}\right\rVert_{\psi_{\alpha}}\leq\left\lVert Z_{1}\right\rVert_{\psi_{\alpha}}+\left\lVert Z_{2}\right\rVert_{\psi_{\alpha}}.

  2. 2.

    ZpZψpp!Zψ1\left\lVert Z\right\rVert_{p}\leq\left\lVert Z\right\rVert_{\psi_{p}}\leq p!\left\lVert Z\right\rVert_{\psi_{1}} where Zp:=(𝔼[|Z|p])1/p\left\lVert Z\right\rVert_{p}\mathrel{\mathop{\mathchar 58\relax}}=\mathinner{\left(\operatorname{\mathbb{E}}\mathinner{\left[\left\lvert Z\right\rvert^{p}\right]}\right)}^{1/p}.

  3. 3.

    For any p,α>0p,\alpha>0, Zψpα=Zαψp/α\left\lVert Z\right\rVert_{\psi_{p}}^{\alpha}=\left\lVert Z^{\alpha}\right\rVert_{\psi_{p/\alpha}}.

  4. 4.

    If Pr(|Z|t)K1exp(K2tα)\operatorname{Pr}\left(\left\lvert Z\right\rvert\geq t\right)\leq K_{1}\exp\mathinner{\left(-K_{2}t^{\alpha}\right)} for any t0t\geq 0, then Zψα(2(lnK1+1)K2)1/α\left\lVert Z\right\rVert_{\psi_{\alpha}}\leq\left(\frac{2(\ln K_{1}+1)}{K_{2}}\right)^{1/\alpha}.

  5. 5.

    If ZψαK\left\lVert Z\right\rVert_{\psi_{\alpha}}\leq K, then for all t0t\geq 0, Pr(|Z|t)2exp((tK)α)\operatorname{Pr}\left(\left\lvert Z\right\rvert\geq t\right)\leq 2\exp\left(-(\frac{t}{K})^{\alpha}\right).

The following auxiliary results, tailored to the localized sampling scheme in Algorithm 1, will also be useful in our analysis.

Lemma 21.

There exists an absolute constant C3>0C_{3}>0 such that the following holds for all isotropic log-concave distributions D𝒟D\in\mathcal{D} and all Du,bD_{u,b} that satisfy the regularity condition. Let S={x1,,xn}S=\{x_{1},\dots,x_{n}\} be a set of nn instances drawn from Du,bD_{u,b}. Then

maxxSxψ1C3logndb.\left\lVert\max_{x\in S}\left\lVert x\right\rVert_{\infty}\right\rVert_{\psi_{1}}\leq C_{3}\log\frac{nd}{b}.

Consequently,

𝔼SDu,bn[maxxSx]C3logndb.\operatorname{\mathbb{E}}_{S\sim D_{u,b}^{n}}\mathinner{\Bigl{[}\max_{x\in S}\left\lVert x\right\rVert_{\infty}\Bigr{]}}\leq C_{3}\log\frac{nd}{b}.
Proof.

Let ZZ be isotropic log-concave random variable in \mathbb{R}. Part 5 of Lemma 12 shows that for all t>0t>0,

Pr(|Z|>t)exp(t+1).\operatorname{Pr}(\left\lvert Z\right\rvert>t)\leq\exp(-t+1).

Fix i{1,,n}i\in\{1,\dots,n\} and fix j{1,,d}j\in\{1,\dots,d\}. Denote by xi(j)x_{i}^{(j)} the jj-th coordinate of xix_{i}. Part 1 of Lemma 12 suggests that xi(j)x_{i}^{(j)} is isotropic log-concave. Thus, by Part 2 of Lemma 17,

PrxDu,b(|xi(j)|>t)1c8bPrxD(|xi(j)|>t)1c8bexp(t+1).\operatorname{Pr}_{x\sim D_{u,b}}\mathinner{\Bigl{(}\ \mathinner{\!\bigl{\lvert}x_{i}^{(j)}\bigr{\rvert}}>t\Bigr{)}}\leq\frac{1}{c_{8}b}\operatorname{Pr}_{x\sim D}\mathinner{\left(\ \mathinner{\!\bigl{\lvert}x_{i}^{(j)}\bigr{\rvert}}>t\right)}\leq\frac{1}{c_{8}b}\exp(-t+1).

Taking the union bound over i{1,,n}i\in\{1,\dots,n\} and j{1,,d}j\in\{1,\dots,d\}, we have for all t>0t>0

PrxDu,b(maxxSx>t)ndc8bexp(t+1).\operatorname{Pr}_{x\sim D_{u,b}}\mathinner{\left(\max_{x\in S}\left\lVert x\right\rVert_{\infty}>t\right)}\leq\frac{nd}{c_{8}b}\exp(-t+1).

Now Part 4 of Lemma 20 immediately implies that

maxxSxψ1C3logndb\left\lVert\max_{x\in S}\left\lVert x\right\rVert_{\infty}\right\rVert_{\psi_{1}}\leq C_{3}\log\frac{nd}{b}

for some constant C3>0C_{3}>0. The second inequality of the lemma is an immediate result by combining the above and Part 2 of Lemma 20. ∎

C.1 Adamczak’s bound

In this section, we establish the key concentration results that will be used to analyze the performance of soft outlier removal and random sampling in Algorithm 1. Since we are considering the isotropic log-concave distribution, any unlabeled instance xx is unbounded. This prevents us from using standard concentration bounds, e.g. [KST08]. We henceforth appeal to the following generalization of Talagrand’s inequality, due to [Ada08].

Lemma 22 (Adamczak’s bound).

For any α(0,1]\alpha\in(0,1], there exists a constant Λα>0\Lambda_{\alpha}>0, such that the following holds. Given any function class \mathcal{F}, and a function FF such that for any ff\in\mathcal{F}, |f(x)|F(x)\left\lvert f(x)\right\rvert\leq F(x), we have with probability at least 1δ1-\delta over the draw of a set S={x1,,xn}S=\{x_{1},\dots,x_{n}\} of i.i.d. instances from DD,

supf|1ni=1nf(xi)𝔼xD[f(x)]|Λα(𝔼SDn[supf|1ni=1nf(xi)𝔼xD[f(x)]|]\displaystyle\sup_{f\in\mathcal{F}}\mathinner{\!\biggl{\lvert}\frac{1}{n}\sum_{i=1}^{n}{f(x_{i})}-\operatorname{\mathbb{E}}_{x\sim D}\mathinner{\left[f(x)\right]}\biggr{\rvert}}\leq\Lambda_{\alpha}\left(\operatorname{\mathbb{E}}_{S\sim D^{n}}\mathinner{\biggl{[}\sup_{f\in\mathcal{F}}\mathinner{\!\biggl{\lvert}\frac{1}{n}\sum_{i=1}^{n}{f(x_{i})}-\operatorname{\mathbb{E}}_{x\sim D}\mathinner{\left[f(x)\right]}\biggr{\rvert}}\biggr{]}}\right.
+supf𝔼xD[(f(x))2]ln1δn+(ln1δ)1/αnmax1inF(xi)ψα).\displaystyle\left.+\sqrt{\frac{\sup_{f\in\mathcal{F}}\operatorname{\mathbb{E}}_{x\sim D}\mathinner{\left[(f(x))^{2}\right]}\ln\frac{1}{\delta}}{n}}+\frac{(\ln\frac{1}{\delta})^{1/\alpha}}{n}\left\lVert\max_{1\leq i\leq n}F(x_{i})\right\rVert_{\psi_{\alpha}}\right).

We first establish the following result that upper bounds the expected value of Rademacher complexity of linear classes by the Orlicz norm of the random instances.

Lemma 23.

There exists an absolute constant C5>0C_{5}>0 such that the following holds for all isotropic log-concave distributions D𝒟D\in\mathcal{D} and all Du,bD_{u,b} that satisfy the regularity condition. Let S={x1,,xn}S=\{x_{1},\dots,x_{n}\} be a set of nn i.i.d. unlabeled instances drawn from Du,bD_{u,b}. Denote W=B2(u,r)B1(u,ρ)W=B_{2}(u,r)\cap B_{1}(u,\rho). Let a sequence of random variables Z={z1,,zn}Z=\{z_{1},\dots,z_{n}\} be drawn from a distribution supported on a bounded interval [λ,λ][-\lambda,\lambda] for some λ>0\lambda>0. Let σ={σ1,,σn}\sigma=\{\sigma_{1},\dots,\sigma_{n}\}, where the σi\sigma_{i}’s are i.i.d. Rademacher random variables independent of SS and ZZ. We have:

𝔼S,Z,σ[supwW|i=1nσizi(wxi)|]λbn+C5ρλnlogdlogndb.\operatorname{\mathbb{E}}_{S,Z,\sigma}\mathinner{\biggl{[}\sup_{w\in W}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}z_{i}(w\cdot x_{i})\biggr{\rvert}}\biggr{]}}\leq\lambda b\sqrt{n}+C_{5}\rho\lambda\sqrt{n\log d}\cdot\log\frac{nd}{b}.
Proof.

Let V=B2(0,r)B1(0,ρ)V=B_{2}(0,r)\cap B_{1}(0,\rho) so that any wWw\in W can be expressed as w=u+vw=u+v for some vVv\in V. First, conditioned on SS and ZZ, we have that

𝔼σ[supvV|i=1nσizi(vxi)|]ρ2nlog(2d)max1inzixiρλ2nlog(2d)max1inxi.\operatorname{\mathbb{E}}_{\sigma}\mathinner{\biggl{[}\sup_{v\in V}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}z_{i}(v\cdot x_{i})\biggr{\rvert}}\biggr{]}}\leq\rho\sqrt{2n\log(2d)}\cdot\max_{1\leq i\leq n}\left\lVert z_{i}x_{i}\right\rVert_{\infty}\leq\rho\lambda\sqrt{2n\log(2d)}\cdot\max_{1\leq i\leq n}\left\lVert x_{i}\right\rVert_{\infty}.

Thus,

𝔼S,Z,σ[supvV|i=1nσizi(vxi)|]\displaystyle\operatorname{\mathbb{E}}_{S,Z,\sigma}\mathinner{\biggl{[}\sup_{v\in V}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}z_{i}(v\cdot x_{i})\biggr{\rvert}}\biggr{]}} ρλ2nlog(2d)𝔼S[max1inxi]\displaystyle\leq\rho\lambda\sqrt{2n\log(2d)}\cdot\operatorname{\mathbb{E}}_{S}\mathinner{\left[\max_{1\leq i\leq n}\left\lVert x_{i}\right\rVert_{\infty}\right]}
C5ρλnlogdlogndb,\displaystyle\leq C_{5}\rho\lambda\sqrt{n\log d}\cdot\log\frac{nd}{b}, (C.1)

where the second inequality follows from Lemma 21.

On the other side, using the fact that for any random variable AA, 𝔼[A](𝔼[A2])1/2\operatorname{\mathbb{E}}[A]\leq\mathinner{\left(\operatorname{\mathbb{E}}[A^{2}]\right)}^{1/2}, we have

𝔼S,Z,σ[|i=1nσizi(uxi)|]\displaystyle\operatorname{\mathbb{E}}_{S,Z,\sigma}\mathinner{\biggl{[}\ \mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}z_{i}(u\cdot x_{i})\biggr{\rvert}}\biggr{]}} 𝔼S,Z,σ[(i=1nσizi(uxi))2]\displaystyle\leq\sqrt{\operatorname{\mathbb{E}}_{S,Z,\sigma}\mathinner{\Biggl{[}\mathinner{\biggl{(}\sum_{i=1}^{n}\sigma_{i}z_{i}(u\cdot x_{i})\biggr{)}}^{2}\Biggr{]}}}
=𝔼S,Z[i=1nzi2(uxi)2]nb2λ2,\displaystyle=\sqrt{\operatorname{\mathbb{E}}_{S,Z}\mathinner{\Biggl{[}\sum_{i=1}^{n}z_{i}^{2}(u\cdot x_{i})^{2}\Biggr{]}}}\leq\sqrt{nb^{2}\lambda^{2}},

where in the equality we use the observation that 𝔼S,Z,σ[σiσjzizj(uxi)(uxj)]=0\operatorname{\mathbb{E}}_{S,Z,\sigma}\mathinner{\left[\sigma_{i}\sigma_{j}z_{i}z_{j}(u\cdot x_{i})(u\cdot x_{j})\right]}=0 when iji\neq j, and in the last inequality we used the condition that xix_{i} is drawn from Du,bD_{u,b}. Combining the above with (C.1) we obtain the desired result. ∎

C.2 Uniform concentration of hinge loss

Proposition 24.

There exists an absolute constant C6>0C_{6}>0 such that the following holds for all isotropic log-concave distributions D𝒟D\in\mathcal{D} and all Du,bD_{u,b} that satisfy the regularity condition. Let S={x1,,xn}S=\{x_{1},\dots,x_{n}\} be a set of nn i.i.d. unlabeled instances drawn from Du,bD_{u,b} which satisfies the regularity condition. Let yx=sign(wx)y_{x}=\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)} for any xDu,bx\sim D_{u,b}. Denote W=B2(u,r)B1(u,ρ)W=B_{2}(u,r)\cap B_{1}(u,\rho) and let G(w)=1ni=1nτ(w;xi,yxi)𝔼xDu,b[τ(w;x,yx)]G(w)=\frac{1}{n}\sum_{i=1}^{n}\ell_{\tau}(w;x_{i},y_{x_{i}})-\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\big{[}\ell_{\tau}\left(w;x,y_{x}\right)\big{]}. Then with probability 1δ1-\delta,

supwW|G(w)|C6(b+ρlogdlogndbτn+b+rτnlog1δ+b+ρlogndbτnlog1δ).\displaystyle\sup_{w\in W}\left\lvert G(w)\right\rvert\leq C_{6}\mathinner{\left(\frac{b+\rho\sqrt{\log d}\log\frac{nd}{b}}{\tau\sqrt{n}}+\frac{b+r}{\tau\sqrt{n}}\sqrt{\log\frac{1}{\delta}}+\frac{b+\rho\log\frac{nd}{b}}{\tau n}\log\frac{1}{\delta}\right)}.

In particular, suppose b=O(r)b=O(r), ρ=O(sr)\rho=O(\sqrt{s}r) and τ=Ω(r)\tau=\Omega(r). Then we have: for any t>0t>0, a sample size n=O~(1t2slog2dblogdδ)n=\tilde{O}\Big{(}\frac{1}{t^{2}}s\log^{2}\frac{d}{b}\cdot\log\frac{d}{\delta}\Big{)} suffices to guarantee that with probability 1δ1-\delta, supwW|G(w)|t\sup_{w\in W}\left\lvert G(w)\right\rvert\leq t.

Proof.

We will use Lemma 22 with function class ={(x,y)τ(w;x,y):wW}\mathcal{F}=\mathinner{\left\{(x,y)\mapsto\ell_{\tau}(w;x,y)\mathrel{\mathop{\mathchar 58\relax}}w\in W\right\}} and the Orlicz norm with respect to ψ1\psi_{1}. We define F(x,y)=1+bτ+ρτxF(x,y)=1+\frac{b}{\tau}+\frac{\rho}{\tau}\|x\|_{\infty}. It can be seen that for every wWw\in W,

|τ(w;x,y)|1+|wx|τ1+uxτ+(wu)xτ1+bτ+ρτx=F(x,y).\left\lvert\ell_{\tau}(w;x,y)\right\rvert\leq 1+\frac{\left\lvert w\cdot x\right\rvert}{\tau}\leq 1+\frac{{u}\cdot{x}}{\tau}+\frac{{(w-u)}\cdot{x}}{\tau}\leq 1+\frac{b}{\tau}+\frac{\rho}{\tau}\|x\|_{\infty}=F(x,y).

That is, for every ff in \mathcal{F}, |f(x,y)|F(x,y)\left\lvert f(x,y)\right\rvert\leq F(x,y).

Step 1. We upper bound max1inF(xi,yxi)ψ1\left\lVert\max_{1\leq i\leq n}F(x_{i},y_{x_{i}})\right\rVert_{\psi_{1}}. Since ψ1\left\lVert\cdot\right\rVert_{\psi_{1}} is a norm, we have

max1inF(xi,yxi)ψ1\displaystyle\left\lVert\max_{1\leq i\leq n}F(x_{i},y_{x_{i}})\right\rVert_{\psi_{1}} 1+bτψ1+ρτmax1inxiψ1\displaystyle\leq\left\lVert 1+\frac{b}{\tau}\right\rVert_{\psi_{1}}+\left\lVert\frac{\rho}{\tau}\cdot\max_{1\leq i\leq n}\left\lVert x_{i}\right\rVert_{\infty}\right\rVert_{\psi_{1}}
=1+bτ+ρτmax1inxiψ1\displaystyle=1+\frac{b}{\tau}+\frac{\rho}{\tau}\cdot\left\lVert\max_{1\leq i\leq n}\left\lVert x_{i}\right\rVert_{\infty}\right\rVert_{\psi_{1}}
1+bτ+C3ρτlogndb,\displaystyle\leq 1+\frac{b}{\tau}+\frac{C_{3}\rho}{\tau}\log\frac{nd}{b}, (C.2)

where we applied Lemma 21 in the last inequality.

Step 2. Next, we upper bound supwW𝔼xDu,b[(τ(w;x,yx))2]\sup_{w\in W}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\left[(\ell_{\tau}(w;x,y_{x}))^{2}\right]}. For all ww in WW, we have

supwW𝔼xDu,b[(τ(w;x,yx))2]2supwW𝔼xDu,b[1+(wx)2τ2]2+2C¯2r2+b2τ2\sup_{w\in W}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\left[(\ell_{\tau}(w;x,y_{x}))^{2}\right]}\leq 2\cdot\sup_{w\in W}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\biggl{[}1+\frac{(w\cdot x)^{2}}{\tau^{2}}\biggr{]}}\leq 2+2\bar{C}_{2}\cdot\frac{r^{2}+b^{2}}{\tau^{2}} (C.3)

where the last inequality uses Lemma 15.

Step 3. Finally, we upper bound 𝔼SDu,bn[supwW|G(w)|]\operatorname{\mathbb{E}}_{S\sim D_{u,b}^{n}}\mathinner{\left[\sup_{w\in W}\left\lvert G(w)\right\rvert\right]}. Let σ={σ1,,σn}\sigma=\{\sigma_{1},\dots,\sigma_{n}\} where each σi\sigma_{i} is an i.i.d. draw from the Rademacher distribution. We have

𝔼S[supwW|G(w)|]\displaystyle\operatorname{\mathbb{E}}_{S}\mathinner{\biggl{[}\sup_{w\in W}\left\lvert G(w)\right\rvert\biggr{]}} 2n𝔼S,σ[supwW|i=1nσiτ(w;xi,yxi)|]\displaystyle\leq\frac{2}{n}\operatorname{\mathbb{E}}_{S,\sigma}\mathinner{\biggl{[}\sup_{w\in W}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}\ell_{\tau}\mathinner{\left(w;x_{i},y_{x_{i}}\right)}\biggr{\rvert}}\biggr{]}}
2τn𝔼S,σ[supwW|i=1nσiyxi(wxi)|]\displaystyle\leq\frac{2}{\tau n}\operatorname{\mathbb{E}}_{S,\sigma}\mathinner{\biggl{[}\sup_{w\in W}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}y_{x_{i}}(w\cdot x_{i})\biggr{\rvert}}\biggr{]}}
2bτn+2C5ρτlogdnlogndb.\displaystyle\leq\frac{2b}{\tau\sqrt{n}}+\frac{2C_{5}\rho}{\tau}\cdot\sqrt{\frac{\log d}{n}}\cdot\log\frac{nd}{b}. (C.4)

In the above, the first inequality used standard symmetrization arguments; see, for example, Lemma 26.2 of [SSBD14]. In the second inequality, we used the contraction property of Rademacher complexity and the fact that τ(w;x,y)\ell_{\tau}(w;x,y) can be seen as a 1τ\frac{1}{\tau}-Lipschitz function ϕ(a)=max{0,1aτ}\phi(a)=\max\big{\{}0,1-\frac{a}{\tau}\big{\}} applied on input a=ywxa=yw\cdot x. In the last inequality, we applied Lemma 23 with the fact that |yxi|1\left\lvert y_{x_{i}}\right\rvert\leq 1.

Putting together. The first inequality of the proposition follows from combining (C.2), (C.3), and (C.4), and using Lemma 22 with \mathcal{F} and ψ1\psi_{1}. Under our choice of (b,r,ρ,τ)(b,r,\rho,\tau), with some calculation we obtain the bound of nn. ∎

C.3 Uniform concentration of relaxed sparse PCA

Proposition 25.

There exists an absolute constant C7>0C_{7}>0 such that the following holds for all isotropic log-concave distributions D𝒟D\in\mathcal{D} and all Du,bD_{u,b} that satisfy the regularity condition. Let S={x1,,xn}S=\{x_{1},\dots,x_{n}\} be a set of nn i.i.d. unlabeled instances drawn from Du,bD_{u,b}. Denote G(H)=1ni=1nxiHxi𝔼xDu,b[xHx]G(H)=\frac{1}{n}\sum_{i=1}^{n}{x_{i}^{\top}Hx_{i}}-\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\bigl{[}x^{\top}Hx\bigr{]}}. Then with probability 1δ1-\delta,

supH|G(H)|C7ρ2log2ndb(logdn+log(1/δ)n+log21δn).\displaystyle\sup_{H\in\mathcal{M}}\left\lvert G(H)\right\rvert\leq C_{7}\rho^{2}\log^{2}\frac{nd}{b}\mathinner{\biggl{(}\sqrt{\frac{\log d}{n}}+\sqrt{\frac{\log({1}/{\delta})}{n}}+\frac{\log^{2}\frac{1}{\delta}}{n}\biggr{)}}.

In particular, suppose ρ=O(sr)\rho=O(\sqrt{s}r) and r=O(b)r=O(b). Then we have: for any t>0t>0, a sample size

n=O~(1t2s2b4log4db(logd+log21δ))n=\tilde{O}\mathinner{\left(\frac{1}{t^{2}}s^{2}b^{4}\log^{4}\frac{d}{b}\cdot\mathinner{\left(\log d+\log^{2}\frac{1}{\delta}\right)}\right)}

suffices to guarantee that with probability 1δ1-\delta, supH|G(H)|t\sup_{H\in\mathcal{M}}\left\lvert G(H)\right\rvert\leq t.

Proof.

Recall that ={Hd×d:H0,Hr2,H1ρ2}\mathcal{M}=\mathinner{\left\{H\in\mathbb{R}^{d\times d}\mathrel{\mathop{\mathchar 58\relax}}H\succeq 0,\left\lVert H\right\rVert\leq r^{2},\left\lVert H\right\rVert_{1}\leq\rho^{2}\right\}}. For any matrix HH, we denote by HijH_{ij} the (i,j)(i,j)-th entry of the matrix HH. For any vector xx, we denote by x(i)x^{(i)} the ii-th coordinate of xx.

We will use Lemma 22 with function class ={xxHx:H}\mathcal{F}=\mathinner{\left\{x\mapsto x^{\top}Hx\mathrel{\mathop{\mathchar 58\relax}}H\in\mathcal{M}\right\}} and the Orlicz norm with respect to ψ0.5\psi_{0.5}. Consider the function f(x):=xHxf(x)\mathrel{\mathop{\mathchar 58\relax}}=x^{\top}Hx parameterized by HH\in\mathcal{M}. First, we wish to find a function F(x)F(x) that upper bounds |f(x)|\left\lvert f(x)\right\rvert. It is easy to see that

|xHx|=|i,jHijx(i)x(j)|x2i,j|Hij|ρ2x2.\mathinner{\!\Bigl{\lvert}x^{\top}Hx\Bigr{\rvert}}=\mathinner{\!\Bigl{\lvert}\sum_{i,j}H_{ij}x^{(i)}x^{(j)}\Bigr{\rvert}}\leq\left\lVert x\right\rVert_{\infty}^{2}\sum_{i,j}\left\lvert H_{ij}\right\rvert\leq\rho^{2}\left\lVert x\right\rVert_{\infty}^{2}. (C.5)

Thus it suffices to choose F(x)=ρ2x2F(x)=\rho^{2}\left\lVert x\right\rVert_{\infty}^{2}.

Step 1. We first bound max1inF(xi)ψ1=ρmax1inxiψ1C3ρlogndb\left\lVert\sqrt{\max_{1\leq i\leq n}F(x_{i})}\right\rVert_{\psi_{1}}=\left\lVert\rho\cdot\max_{1\leq i\leq n}\left\lVert x_{i}\right\rVert_{\infty}\right\rVert_{\psi_{1}}\leq C_{3}\rho\log\frac{nd}{b} by Lemma 21. By Part 3 of Lemma 20, max1inF(x)ψ0.5\left\lVert\max_{1\leq i\leq n}F(x)\right\rVert_{\psi_{0.5}} equals max1inF(x)ψ12\left\lVert\sqrt{\max_{1\leq i\leq n}F(x)}\right\rVert_{\psi_{1}}^{2}. Thus

max1inF(x)ψ0.5(C3ρlogndb)2.\left\lVert\max_{1\leq i\leq n}F(x)\right\rVert_{\psi_{0.5}}\leq\mathinner{\left(C_{3}\rho\log\frac{nd}{b}\right)}^{2}. (C.6)

Step 2. Next we upper bound supf𝔼xDu,b[(f(x))2]\sup_{f\in\mathcal{F}}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\left[(f(x))^{2}\right]} where we remark that taking the superum over ff\in\mathcal{F} is equivalent to taking that over HH\in\mathcal{M}. Since |f(x)|F(x)\left\lvert f(x)\right\rvert\leq F(x), we have

(f(x))2(F(x))2ρ4x4.(f(x))^{2}\leq(F(x))^{2}\leq\rho^{4}\left\lVert x\right\rVert_{\infty}^{4}.

In view of Part 2 of Lemma 20, we have

(𝔼xDu,b[x4])1/424xψ124C3logdb,\mathinner{\left(\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\left[\left\lVert x\right\rVert_{\infty}^{4}\right]}\right)}^{1/4}\leq 24\left\lVert\left\lVert x\right\rVert_{\infty}\right\rVert_{\psi_{1}}\leq 24C_{3}\log\frac{d}{b}, (C.7)

where the last inequality follows from Lemma 21. Hence,

supf𝔼xDu,b[(f(x))2]K1ρ4log4db\sup_{f\in\mathcal{F}}\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\left[(f(x))^{2}\right]}\leq K_{1}\rho^{4}\log^{4}\frac{d}{b} (C.8)

for some absolute constant K1>0K_{1}>0.

Step 3. Finally, we upper bound 𝔼SDn[supf|1ni=1nf(xi)𝔼xDu,b[f(x)]|]\operatorname{\mathbb{E}}_{S\sim D^{n}}\mathinner{\left[\sup_{f\in\mathcal{F}}\left\lvert\frac{1}{n}\sum_{i=1}^{n}f(x_{i})-\operatorname{\mathbb{E}}_{x\sim D_{u,b}}\mathinner{\left[f(x)\right]}\right\rvert\right]}. Let σ={σ1,,σn}\sigma=\{\sigma_{1},\dots,\sigma_{n}\} where σi\sigma_{i}’s are independent draw from the Rademacher distribution. By standard symmetrization arguments (see e.g. Lemma 26.2 of [SSBD14]), we have

𝔼S[supf|G(v,H)|]2n𝔼S,σ[supf|i=1nσif(xi)|]=2n𝔼S,σ[supH|i=1nσixiHxi|].\operatorname{\mathbb{E}}_{S}\mathinner{\biggl{[}\sup_{f\in\mathcal{F}}\left\lvert G(v,H)\right\rvert\biggr{]}}\leq\frac{2}{n}\operatorname{\mathbb{E}}_{S,\sigma}\mathinner{\biggl{[}\sup_{f\in\mathcal{F}}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}f(x_{i})\biggr{\rvert}}\biggr{]}}=\frac{2}{n}\operatorname{\mathbb{E}}_{S,\sigma}\mathinner{\biggl{[}\sup_{H\in\mathcal{M}}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}x_{i}^{\top}Hx_{i}\biggr{\rvert}}\biggr{]}}. (C.9)

We first condition on SS and consider the expectation over σ\sigma. For a matrix HH, we use vec(H)\operatorname*{vec}(H) to denote the vector obtained by concatenating all of the columns of HH; likewise for xixix_{i}x_{i}^{\top}. It is crucial to observe that with this notation, for any HH\in\mathcal{M}, we have vec(H)1=H1ρ2\left\lVert\operatorname*{vec}(H)\right\rVert_{1}=\left\lVert H\right\rVert_{1}\leq\rho^{2}. It follows that

𝔼σ[|supHi=1nσixiHxi|]\displaystyle\operatorname{\mathbb{E}}_{\sigma}\mathinner{\Biggl{[}\ \mathinner{\!\biggl{\lvert}\sup_{H\in\mathcal{M}}\sum_{i=1}^{n}\sigma_{i}x_{i}^{\top}Hx_{i}\biggr{\rvert}}\Biggr{]}} 𝔼σ[supH:vec(H)1ρ2|i=1nσivec(H),vec(xixi)|]\displaystyle\leq\mathbb{E}_{\sigma}\mathinner{\Biggl{[}\sup_{H\mathrel{\mathop{\mathchar 58\relax}}\left\lVert\operatorname*{vec}(H)\right\rVert_{1}\leq\rho^{2}}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}\left\langle\operatorname*{vec}(H),\operatorname*{vec}(x_{i}x_{i}^{\top})\right\rangle\biggr{\rvert}}\Biggr{]}}
ρ2nln(2d2)max1invec(xixi)\displaystyle\leq\rho^{2}\sqrt{n\ln(2d^{2})}\cdot\max_{1\leq i\leq n}\left\lVert\operatorname*{vec}(x_{i}x_{i}^{\top})\right\rVert_{\infty}\cdot
=ρ2nln(2d2)max1inxi2.\displaystyle=\rho^{2}\sqrt{n\ln(2d^{2})}\cdot\max_{1\leq i\leq n}\left\lVert x_{i}\right\rVert_{\infty}^{2}.

where the second inequality is from Lemma 39, and the equality is from the observation that vec(xixi)=xi2\|\operatorname*{vec}(x_{i}x_{i}^{\top})\|_{\infty}=\|x_{i}\|_{\infty}^{2}. Therefore,

𝔼S,σ[|supHi=1nσixiHxi|]\displaystyle\operatorname{\mathbb{E}}_{S,\sigma}\mathinner{\Biggl{[}\ \mathinner{\!\biggl{\lvert}\sup_{H\in\mathcal{M}}\sum_{i=1}^{n}\sigma_{i}x_{i}^{\top}Hx_{i}\biggr{\rvert}}\Biggr{]}} ρ2nln(2d2)𝔼S[max1inxi2]\displaystyle\leq\rho^{2}\sqrt{n\ln(2d^{2})}\cdot\operatorname{\mathbb{E}}_{S}\mathinner{\left[\max_{1\leq i\leq n}\|x_{i}\|_{\infty}^{2}\right]}
ρ22nln(2d)2max1inxiψ12\displaystyle\leq\rho^{2}\sqrt{2n\ln(2d)}\cdot 2\left\lVert\max_{1\leq i\leq n}\left\lVert x_{i}\right\rVert_{\infty}\right\rVert_{\psi_{1}}^{2}
ρ22nln(2d)C32log2ndb,\displaystyle\leq\rho^{2}\sqrt{2n\ln(2d)}\cdot C_{3}^{2}\log^{2}\frac{nd}{b},

where the second inequality follows from Part 2 of Lemma 20, and the last inequality follows from Lemma 21. In summary,

𝔼S,σ[supf|i=1nσixiHxi|]K2nlndρ2log2ndb\operatorname{\mathbb{E}}_{S,\sigma}\mathinner{\Biggl{[}\sup_{f\in\mathcal{F}}\mathinner{\!\biggl{\lvert}\sum_{i=1}^{n}\sigma_{i}x_{i}^{\top}Hx_{i}\biggr{\rvert}}\Biggr{]}}\leq K_{2}\sqrt{n\ln d}\cdot\rho^{2}\log^{2}\frac{nd}{b} (C.10)

for some constant K2>0K_{2}>0.

Combining (C.9) and (C.10), we have

𝔼S[supf|G(H)|]K3logdnρ2log2ndb.\operatorname{\mathbb{E}}_{S}\mathinner{\biggl{[}\sup_{f\in\mathcal{F}}\left\lvert G(H)\right\rvert\biggr{]}}\leq\frac{K_{3}\sqrt{\log d}}{\sqrt{n}}\cdot{\rho^{2}}\log^{2}\frac{nd}{b}. (C.11)

Putting together. Combining (C.6), (C.8), (C.11), and using Lemma 22 gives the first inequality of the proposition. Under our setting of (b,r,ρ)(b,r,\rho), by some calculation we obtain the bound of nn. The proof is complete. ∎

Appendix D Performance Guarantee of Algorithm 1

In this section, we leverage all the tools from previous sections to establish the performance guarantee of Algorithm 1. Our main theorem, Theorem 4, follows from the analysis of each step of the algorithm, as we describe below.

D.1 Analysis of sample complexity

Recall that we refer to the number of calls to EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}) as the sample complexity of Algorithm 1. In order to obtain nkn_{k} instances residing the band Xk:={x:|wk1x|bk}X_{k}\mathrel{\mathop{\mathchar 58\relax}}=\{x\mathrel{\mathop{\mathchar 58\relax}}\left\lvert w_{k-1}\cdot x\right\rvert\leq b_{k}\}, we have to call EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}) sufficient times.

Lemma 26 (Restatement of Lemma 6).

Consider phase kk of Algorithm 1 for any k1k\geq 1. Suppose that Assumption 1 and 2 are satisfied. Further assume η<12\eta<\frac{1}{2}. By making a number of Nk=O(1bk(nk+log1δk))N_{k}=O\mathinner{\Bigl{(}\frac{1}{b_{k}}\mathinner{\bigl{(}n_{k}+\log\frac{1}{\delta_{k}}\bigr{)}}\Bigr{)}} calls to the instance generation oracle EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}), we will obtain nkn_{k} instances that fall into XkX_{k} with probability 1δk41-\frac{\delta_{k}}{4}.

Proof.

By Lemma 17

PrxD(xXk)c8bk.\operatorname{Pr}_{x\sim D}(x\in X_{k})\geq c_{8}b_{k}.

This implies that

PrxEXηx(D,w)(xXkandx is clean)\displaystyle\ \operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}(x\in X_{k}\ \text{and}\ x\text{ is clean})
=\displaystyle= PrxEXηx(D,w)(xXkx is clean)PrxEXηx(D,w)(x is clean)\displaystyle\ \operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}(x\in X_{k}\mid x\text{ is clean})\cdot\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}(x\text{ is clean})
\displaystyle\geq c8bk(1η).\displaystyle\ c_{8}b_{k}(1-\eta).

We want to ensure that by drawing NkN_{k} instances from EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}), with probability at least 1δk41-\frac{\delta_{k}}{4}, nkn_{k} out of them fall into the band XkX_{k}. We apply the second inequality of Lemma 38 by letting Zi=𝟏{xiXkandxi is clean}Z_{i}=\boldsymbol{1}_{\mathinner{\left\{x_{i}\in X_{k}\ \text{and}\ x_{i}\text{ is clean}\right\}}} and α=1/2\alpha=1/2, and obtain

Pr(|TC¯|c8bk(1η)2Nk)exp(c8bk(1η)Nk8),\operatorname{Pr}\mathinner{\left(\left\lvert\bar{T_{\mathrm{C}}}\right\rvert\leq\frac{c_{8}b_{k}(1-\eta)}{2}N_{k}\right)}\leq\exp\mathinner{\left(-\frac{c_{8}b_{k}(1-\eta)N_{k}}{8}\right)},

where the probability is taken over the event that we make a number of NkN_{k} calls to EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}). Thus, when Nk8c8bk(1η)(nk+ln4δk)N_{k}\geq\frac{8}{c_{8}b_{k}(1-\eta)}\mathinner{\left(n_{k}+\ln\frac{4}{\delta_{k}}\right)}, we are guaranteed that at least nkn_{k} samples from EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}) fall into the band XkX_{k} with probability 1δk41-\frac{\delta_{k}}{4}. The lemma follows by observing η<12\eta<\frac{1}{2}. ∎

D.2 Analysis of pruning and the structure of T¯\bar{T}

With the instance set T¯\bar{T} on hand, we estimate the empirical noise rate after applying pruning (Step 6) in Algorithm 1. Recall that nk=|T¯|n_{k}=\left\lvert\bar{T}\right\rvert, i.e. the number of unlabeled instances before pruning.

Lemma 27.

Suppose that Assumption 1 and Assumption 2 are satisfied. Further assume η<12\eta<\frac{1}{2}. If Du,bD_{u,b} satisfies the regularity condition, we have

PrxEXηx(D,w)(xis dirtyxXu,b)2ηc8b\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(x\ \text{is\ dirty}\mid x\in X_{u,b}\right)}\leq\frac{2\eta}{c_{8}b}

where c8c_{8} was defined in Lemma 17 and Xu,b:={xd:|ux|b}X_{u,b}\mathrel{\mathop{\mathchar 58\relax}}=\mathinner{\left\{x\in\mathbb{R}^{d}\mathrel{\mathop{\mathchar 58\relax}}\left\lvert u\cdot x\right\rvert\leq b\right\}}.

Proof.

For an instance xx, we use tagx=1\mathrm{tag}_{x}=1 to denote that xx is drawn from DD, and use tagx=1\mathrm{tag}_{x}=-1 to denote that xx is adversarially generated.

We first calculate the probability that an instance returned by EXηx(D,w)\mathrm{EX}_{\eta}^{x}(D,w^{*}) falls into the band Xu,bX_{u,b} as follows:

PrxEXηx(D,w)(xXu,b)\displaystyle\ \operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(x\in X_{u,b}\right)}
=\displaystyle= PrxEXηx(D,w)(xXu,bandtagx=1)+PrxEXηx(D,w)(xXu,bandtagx=1)\displaystyle\ \operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(x\in X_{u,b}\ \text{and}\ \mathrm{tag}_{x}=1\right)}+\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(x\in X_{u,b}\ \text{and}\ \mathrm{tag}_{x}=-1\right)}
\displaystyle\geq PrxEXηx(D,w)(xXu,bandtagx=1)\displaystyle\ \operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(x\in X_{u,b}\ \text{and}\ \mathrm{tag}_{x}=1\right)}
=\displaystyle= PrxEXηx(D,w)(xXu,btagx=1)PrxEXηx(D,w)(tagx=1)\displaystyle\ \operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(x\in X_{u,b}\mid\mathrm{tag}_{x}=1\right)}\cdot\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(\mathrm{tag}_{x}=1\right)}
=\displaystyle= PrxD(xXu,b)PrxEXηx(D,w)(tagx=1)\displaystyle\ \operatorname{Pr}_{x\sim D}\mathinner{\left(x\in X_{u,b}\right)}\cdot\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(\mathrm{tag}_{x}=1\right)}
ζ\displaystyle\stackrel{{\scriptstyle\zeta}}{{\geq}} c8b(1η)\displaystyle\ c_{8}b\cdot(1-\eta)
\displaystyle\geq 12c8b,\displaystyle\ \frac{1}{2}c_{8}b,

where in the inequality ζ\zeta we applied Part 1 of Lemma 17. It is thus easy to see that

PrxEXηx(D,w)(tagx=1xXu,b)PrxEXηx(D,w)(tagx=1)PrxEXηx(D,w)(xXu,b)2ηc8b,\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(\textrm{tag}_{x}=-1\mid x\in X_{u,b}\right)}\leq\frac{\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(\textrm{tag}_{x}=-1\right)}}{\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}\mathinner{\left(x\in X_{u,b}\right)}}\leq\frac{2\eta}{c_{8}b},

which is the desired result. ∎

Lemma 28.

Suppose that Assumptions 1 and 2 are satisfied. Further assume ηc5ϵ\eta\leq c_{5}\epsilon. For any 1kk01\leq k\leq k_{0}, if nk6ξkln48δkn_{k}\geq\frac{6}{\xi_{k}}\ln\frac{48}{\delta_{k}}, then with probability 1δk241-\frac{\delta_{k}}{24} over the draw of T¯\bar{T}, the following results hold simultaneously:

  1. 1.

    TC=T¯CT_{\mathrm{C}}=\bar{T}_{\mathrm{C}} and hence T~C=T^C\tilde{T}_{\mathrm{C}}=\hat{T}_{\mathrm{C}}, i.e. all clean instances in T¯\bar{T} are intact after pruning;

  2. 2.

    |TD||T|ξk\frac{\left\lvert T_{\mathrm{D}}\right\rvert}{\left\lvert T\right\rvert}\leq\xi_{k}, i.e. the empirical noise rate after pruning is upper bounded by ξk\xi_{k};

  3. 3.

    |TC|(1ξk)nk\left\lvert T_{\mathrm{C}}\right\rvert\geq(1-\xi_{k})n_{k}.

In particular, with the hyper-parameter setting in Section 3.2, |TC|12nk\left\lvert T_{\mathrm{C}}\right\rvert\geq\frac{1}{2}n_{k}.

Proof.

Let us write events E1:={TC=T¯C}E_{1}\mathrel{\mathop{\mathchar 58\relax}}=\mathinner{\left\{T_{\mathrm{C}}=\bar{T}_{\mathrm{C}}\right\}}, E2:={|T¯D|ξknk}E_{2}\mathrel{\mathop{\mathchar 58\relax}}=\mathinner{\left\{\left\lvert\bar{T}_{\mathrm{D}}\right\rvert\leq\xi_{k}n_{k}\right\}}. We bound the probability of the two events over the draw of T¯\bar{T}.

Recall that Lemma 18 implies that with probability 1δk481-\frac{\delta_{k}}{48}, all instances in T¯C\bar{T}_{\mathrm{C}} are in the \ell_{\infty}-ball B(0,νk)B_{\infty}(0,\nu_{k}) for νk=c9log48|T¯|dbkδk\nu_{k}=c_{9}\log\frac{48\left\lvert\bar{T}\right\rvert d}{b_{k}\delta_{k}}, which implies Pr(E1)1δk48\operatorname{Pr}(E_{1})\geq 1-\frac{\delta_{k}}{48}.

We next calculate the noise rate within the band Xk:={x:|wk1x|bk}X_{k}\mathrel{\mathop{\mathchar 58\relax}}=\{x\mathrel{\mathop{\mathchar 58\relax}}\left\lvert w_{k-1}\cdot x\right\rvert\leq b_{k}\} by Lemma 27:

PrxEXηx(D,w)(xis dirtyxXk)2ηc8bk=2ηc8c¯2k3πc8c¯c1ηϵπc5c8c¯c1ξk2,\operatorname{Pr}_{x\sim\mathrm{EX}_{\eta}^{x}(D,w^{*})}(x\ \text{is\ dirty}\mid x\in X_{k})\leq\frac{2\eta}{c_{8}b_{k}}=\frac{2\eta}{c_{8}\bar{c}\cdot 2^{-k-3}}\leq\frac{\pi}{c_{8}\bar{c}c_{1}}\cdot\frac{\eta}{\epsilon}\leq\frac{\pi c_{5}}{c_{8}\bar{c}c_{1}}\leq\frac{\xi_{k}}{2},

where the equality applies our setting on bkb_{k}, the second inequality uses the condition kk0k\leq k_{0} and the setting k0=log(π16c1ϵ)k_{0}=\log\big{(}\frac{\pi}{16c_{1}\epsilon}\big{)}, and the last inequality is guaranteed by our choice of c5c_{5}. Now we apply the first inequality of Lemma 38 by specifying Zi=𝟏{xiis dirty}Z_{i}=\boldsymbol{1}_{\mathinner{\left\{x_{i}\ \text{is\ dirty}\right\}}}, α=1\alpha=1 therein, which gives

Pr(|T¯D|ξknk)exp(ξknk6),\operatorname{Pr}\mathinner{\left(\left\lvert\bar{T}_{\mathrm{D}}\right\rvert\geq\xi_{k}n_{k}\right)}\leq\exp\mathinner{\left(-\frac{\xi_{k}n_{k}}{6}\right)},

where the probability is taken over the draw of T¯\bar{T}. This implies Pr(E2)1δk48\operatorname{Pr}(E_{2})\geq 1-\frac{\delta_{k}}{48} provided that nk6ξkln48δkn_{k}\geq\frac{6}{\xi_{k}}\ln\frac{48}{\delta_{k}}.

By union bound, we have Pr(E1E2)1δk24\operatorname{Pr}(E_{1}\cap E_{2})\geq 1-\frac{\delta_{k}}{24}. We show that on the event E1E2E_{1}\cap E_{2}, the second and third parts of the lemma follow. To see this, we note that it trivially holds that |TD||T||T¯D|nk\frac{\left\lvert T_{\mathrm{D}}\right\rvert}{\left\lvert T\right\rvert}\leq\frac{\left\lvert\bar{T}_{\mathrm{D}}\right\rvert}{n_{k}} since only dirty instances have chance to be removed. This proves the second part. Also, it is easy to see that |TC|=|T¯C|=|T¯||T¯D|(1ξk)|T¯|\left\lvert T_{\mathrm{C}}\right\rvert=\left\lvert\bar{T}_{\mathrm{C}}\right\rvert=\left\lvert\bar{T}\right\rvert-\left\lvert\bar{T}_{\mathrm{D}}\right\rvert\geq(1-\xi_{k})\left\lvert\bar{T}\right\rvert, which is exactly the third part. ∎

D.3 Analysis of Algorithm 2

Lemma 29 (Restatement of Lemma 3).

Suppose that Assumption 1 and 2 are satisfied, and that ηc5ϵ\eta\leq c_{5}\epsilon. There exists a constant C2>2C_{2}>2 such that the following holds. Consider phase kk of Algorithm 1 for any 1kk01\leq k\leq k_{0}. Denote by k\mathcal{M}_{k} the constraint set of (3.2). If |TC|=O~(s2log4dbk(logd+log21δk))\left\lvert T_{\mathrm{C}}\right\rvert=\tilde{O}\mathinner{\Bigl{(}s^{2}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\bigl{(}\log d+\log^{2}\frac{1}{\delta_{k}}\bigr{)}}\Bigr{)}}, then with probability 1δk241-\frac{\delta_{k}}{24} over the draw of TCT_{\mathrm{C}}, we have

  1. 1.

    supHk1|TC|xTCxHx2C2(bk2+rk2)\sup_{H\in\mathcal{M}_{k}}\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T_{\mathrm{C}}}x^{\top}Hx\leq 2C_{2}(b_{k}^{2}+r_{k}^{2});

  2. 2.

    supwWk1|TC|xTC(wx)25C2(bk2+rk2)\sup_{w\in W_{k}}\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T_{\mathrm{C}}}(w\cdot x)^{2}\leq 5C_{2}\mathinner{\left(b_{k}^{2}+r_{k}^{2}\right)}.

Proof.

The first part is an immediate result by combining Proposition 25 and Lemma 16, and recognizing our setting of bkb_{k} and rkr_{k}.

To see the second part, for any wWkw\in W_{k}, we can upper bound (wx)2(w\cdot x)^{2} as follows:

(wx)22(wk1x)2+2(vx)22bk2+2x(vv)x,(w\cdot x)^{2}\leq 2(w_{k-1}\cdot x)^{2}+2(v\cdot x)^{2}\leq 2b_{k}^{2}+2x^{\top}(vv^{\top})x,

where v=wwk1B2(0,rk)B1(0,ρk)v=w-w_{k-1}\in B_{2}(0,r_{k})\cap B_{1}(0,\rho_{k}). Hence it is easy to see that vvvv^{\top} lies in k\mathcal{M}_{k}. This indicates that for any wWkw\in W_{k}, there exists an HkH\in\mathcal{M}_{k} such that

(wx)22[bk2+xHx].(w\cdot x)^{2}\leq 2\big{[}b_{k}^{2}+x^{\top}Hx\big{]}. (D.1)

Thus,

supwWk1|TC|xTC(wx)22bk2+2supHk1|TC|xTCxHx5C2(bk2+rk2),\sup_{w\in W_{k}}\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T_{\mathrm{C}}}(w\cdot x)^{2}\leq 2b_{k}^{2}+2\sup_{H\in\mathcal{M}_{k}}\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T_{\mathrm{C}}}x^{\top}Hx\leq 5C_{2}(b_{k}^{2}+r_{k}^{2}),

where the last inequality follows from the fact C22C_{2}\geq 2. ∎

Proposition 30 (Formal statement of Proposition 7).

Consider phase kk of Algorithm 1 for any 1kk01\leq k\leq k_{0}. Suppose that Assumption 1 and 2 are satisfied, and that ηc5ϵ\eta\leq c_{5}\epsilon. With probability 1δk81-\frac{\delta_{k}}{8} (over the draw of T¯\bar{T}), Algorithm 2 will output a function q:T[0,1]q\mathrel{\mathop{\mathchar 58\relax}}T\rightarrow[0,1] with the following properties:

  1. 1.

    for all xT,q(x)[0,1]x\in T,\ q(x)\in[0,1];

  2. 2.

    1|T|xTq(x)1ξk\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)\geq 1-\xi_{k};

  3. 3.

    for all wWkw\in W_{k}, 1|T|xTq(x)(wx)25C2(bk2+rk2)\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)(w\cdot x)^{2}\leq 5C_{2}\mathinner{\left(b_{k}^{2}+r_{k}^{2}\right)}.

Furthermore, such function qq can be found in polynomial time.

Proof.

Our choice on nkn_{k} satisfies the condition nk6ξkln48δkn_{k}\geq\frac{6}{\xi_{k}}\ln\frac{48}{\delta_{k}} since ξk\xi_{k} is lower bounded by a constant (see Section 3.2 for our parameter setting). Thus by Lemma 28, with probability 1δk241-\frac{\delta_{k}}{24}, |TC|(1ξk)nk\left\lvert T_{\mathrm{C}}\right\rvert\geq(1-\xi_{k})n_{k}. We henceforth condition on this happening.

On the other side, Lemma 3 and Proposition 25 together implies that with probability 1δk241-\frac{\delta_{k}}{24}, for all HkH\in\mathcal{M}_{k}, we have

1|TC|xTCxHx2C2(bk2+rk2)\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T_{\mathrm{C}}}x^{\top}Hx\leq 2C_{2}(b_{k}^{2}+r_{k}^{2}) (D.2)

provided that

|TC|=O~(s2log4dbk(logd+log21δk)).\left\lvert T_{\mathrm{C}}\right\rvert=\tilde{O}\mathinner{\Bigl{(}s^{2}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\Bigl{(}\log d+\log^{2}\frac{1}{\delta_{k}}\Bigr{)}}\Bigr{)}}. (D.3)

Note that (D.3) is satisfied in view of the aforementioned event |TC|(1ξk)nk\left\lvert T_{\mathrm{C}}\right\rvert\geq(1-\xi_{k})n_{k} along with the setting of nkn_{k} and ξk\xi_{k}. By union bound, the events (D.2) and |TC|(1ξk)|T|\left\lvert T_{\mathrm{C}}\right\rvert\geq(1-\xi_{k})\left\lvert T\right\rvert hold simultaneously with probability at least 1δk81-\frac{\delta_{k}}{8}.

Now we show that these two events together implies the existence of a feasible function q(x)q(x) to Algorithm 2. Consider a particular function q(x)q(x) with q(x)=0q(x)=0 for all xTDx\in T_{\mathrm{D}} and q(x)=1q(x)=1 for all xTCx\in T_{\mathrm{C}}. We immediately have

1|T|xTq(x)=|TC||T|1ξk.\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)=\frac{\left\lvert T_{\mathrm{C}}\right\rvert}{\left\lvert T\right\rvert}\geq 1-\xi_{k}.

In addition, for all HkH\in\mathcal{M}_{k},

1|T|xTq(x)xHx=1|T|xTCxHx1|TC|xTCxHx2C2(bk2+rk2),\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)x^{\top}Hx=\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T_{\mathrm{C}}}x^{\top}Hx\leq\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T_{\mathrm{C}}}x^{\top}Hx\leq 2C_{2}(b_{k}^{2}+r_{k}^{2}), (D.4)

where the first inequality follows from the fact |T||TC|\left\lvert T\right\rvert\geq\left\lvert T_{\mathrm{C}}\right\rvert and the second inequality follows from (D.2). Namely, such function q(x)q(x) satisfies all the constraints in Algorithm 2. Finally, combining (D.1) and (D.4) gives Part 3.

It remains to show that for a given candidate function qq, a separation oracle for Algorithm 2 can be constructed in polynomial time. First, it is straightforward to check whether the first two constraints q(x)[0,1]q(x)\in[0,1] and xTq(x)(1ξ)|T|\sum_{x\in T}q(x)\geq(1-\xi)\left\lvert T\right\rvert are violated. If not, we just need to further check if there exists an HkH\in\mathcal{M}_{k} such that 1|T|xTq(x)xHx>2C2(bk2+rk2)\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)x^{\top}Hx>2C_{2}(b_{k}^{2}+r_{k}^{2}). To this end, we appeal to solving the following program:

maxHk1|T|xTq(x)xHx.\max_{H\in\mathcal{M}_{k}}\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)x^{\top}Hx.

This is a semidefinite program that can be solved in polynomial time [BV04]. If the maximum objective value is greater than 2C2(bk2+rk2)2C_{2}(b_{k}^{2}+r_{k}^{2}), then we conclude that qq is not feasible; otherwise we would have found a desired function. ∎

The analysis of the following proposition closely follows [ABL17] with a refined treatment. Let τk(w;p):=xTp(x)τk(w;x,yx)\ell_{\tau_{k}}(w;p)\mathrel{\mathop{\mathchar 58\relax}}=\sum_{x\in T}p(x)\ell_{\tau_{k}}(w;x,y_{x}) where yxy_{x} is the unrevealed label of xx that the adversary has committed to.

Proposition 31 (Formal statement of Proposition 8).

Consider phase kk of Algorithm 1. Suppose that Assumption 1 and 2 are satisfied. Assume that ηc5ϵ\eta\leq c_{5}\epsilon. Set NkN_{k} and ξk\xi_{k} as in Section 3.2. Denote zk:=bk2+rk2=c¯2+12k3z_{k}\mathrel{\mathop{\mathchar 58\relax}}=\sqrt{b_{k}^{2}+r_{k}^{2}}=\sqrt{\bar{c}^{2}+1}\cdot 2^{-k-3}. With probability 1δk41-\frac{\delta_{k}}{4} over the draw of T¯\bar{T}, for all wWkw\in W_{k}

τk(w;T~C)\displaystyle\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}}) τk(w;p)+2ξk(1+10C2zkτk)+10C2ξkzkτk,\displaystyle\leq\ell_{\tau_{k}}(w;p)+2\xi_{k}\left(1+\sqrt{10C_{2}}\cdot\frac{z_{k}}{\tau_{k}}\right)+\sqrt{10C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}},
τk(w;p)\displaystyle\ell_{\tau_{k}}(w;p) τk(w;T~C)+2ξk+20C2ξkzkτk.\displaystyle\leq\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})+2\xi_{k}+\sqrt{20C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}}.

In particular, with our hyper-parameter setting,

|τk(w;T~C)τk(w;p)|κ.\left\lvert\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})-\ell_{\tau_{k}}(w;p)\right\rvert\leq\kappa.
Proof.

The choice of nkn_{k} guarantees that Lemma 28 and Proposition 30 hold simultaneously with probability 1δk41-\frac{\delta_{k}}{4}. We thus have for all wWkw\in W_{k}

1|T|xTq(x)(wx)2\displaystyle\frac{1}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)(w\cdot x)^{2} 5C2zk2,\displaystyle\leq 5C_{2}z_{k}^{2}, (D.5)
1|TC|xTC(wx)2\displaystyle\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T_{\mathrm{C}}}(w\cdot x)^{2} 5C2zk2,\displaystyle\leq 5C_{2}z_{k}^{2}, (D.6)
|TD||T|\displaystyle\frac{\left\lvert T_{\mathrm{D}}\right\rvert}{\left\lvert T\right\rvert} ξk.\displaystyle\leq\xi_{k}. (D.7)

In the above expression, (D.5) and (D.6) follow from Part 3 and Part 2 of Lemma 29 respectively, (D.7) follows from Lemma 28. It follows from Eq. (D.7) and ξk1/2\xi_{k}\leq 1/2 that

|T||TC|=|T||T||TD|=11|TD|/|T|11ξk2.\frac{\left\lvert T\right\rvert}{\left\lvert T_{\mathrm{C}}\right\rvert}=\frac{\left\lvert T\right\rvert}{\left\lvert T\right\rvert-\left\lvert T_{\mathrm{D}}\right\rvert}=\frac{1}{1-\left\lvert T_{\mathrm{D}}\right\rvert/\left\lvert T\right\rvert}\leq\frac{1}{1-\xi_{k}}\leq 2. (D.8)

In the following, we condition on the event that all these inequalities are satisfied.

Step 1. First we upper bound τk(w;T~C)\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}}) by τk(w;p)\ell_{\tau_{k}}(w;p).

|TC|τk(w;T~C)\displaystyle\left\lvert T_{\mathrm{C}}\right\rvert\cdot\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}}) =xTC(w;x,yx)\displaystyle=\sum_{x\in T_{\mathrm{C}}}\ell(w;x,y_{x})
=xT[q(x)(w;x,yx)+(𝟏{xTC}q(x))(w;x,yx)]\displaystyle=\sum_{x\in T}\mathinner{\left[q(x)\ell(w;x,y_{x})+\big{(}\boldsymbol{1}_{\mathinner{\left\{x\in T_{\mathrm{C}}\right\}}}-q(x)\big{)}\ell(w;x,y_{x})\right]}
ζ1xTq(x)(w;x,yx)+xTC(1q(x))(w;x,yx)\displaystyle\stackrel{{\scriptstyle\zeta_{1}}}{{\leq}}\sum_{x\in T}q(x)\ell(w;x,y_{x})+\sum_{x\in T_{\mathrm{C}}}(1-q(x))\ell(w;x,y_{x})
ζ2xTq(x)(w;x,yx)+xTC(1q(x))(1+|wx|τk)\displaystyle\stackrel{{\scriptstyle\zeta_{2}}}{{\leq}}\sum_{x\in T}q(x)\ell(w;x,y_{x})+\sum_{x\in T_{\mathrm{C}}}(1-q(x))\mathinner{\left(1+\frac{\left\lvert w\cdot x\right\rvert}{\tau_{k}}\right)}
ζ3xTq(x)(w;x,yx)+ξk|T|+1τkxTC(1q(x))|wx|\displaystyle\stackrel{{\scriptstyle\zeta_{3}}}{{\leq}}\sum_{x\in T}q(x)\ell(w;x,y_{x})+\xi_{k}\left\lvert T\right\rvert+\frac{1}{\tau_{k}}\sum_{x\in T_{\mathrm{C}}}(1-q(x))\left\lvert w\cdot x\right\rvert
ζ4xTq(x)(w;x,yx)+ξk|T|+1τkxTC(1q(x))2xTC(wx)2\displaystyle\stackrel{{\scriptstyle\zeta_{4}}}{{\leq}}\sum_{x\in T}q(x)\ell(w;x,y_{x})+\xi_{k}\left\lvert T\right\rvert+\frac{1}{\tau_{k}}\sqrt{\sum_{x\in T_{\mathrm{C}}}(1-q(x))^{2}}\cdot\sqrt{\sum_{x\in T_{\mathrm{C}}}(w\cdot x)^{2}}
ζ5xTq(x)(w;x,yx)+ξk|T|+1τkξk|T|5C2|TC|zk,\displaystyle\stackrel{{\scriptstyle\zeta_{5}}}{{\leq}}\sum_{x\in T}q(x)\ell(w;x,y_{x})+\xi_{k}\left\lvert T\right\rvert+\frac{1}{\tau_{k}}\sqrt{\xi_{k}\left\lvert T\right\rvert}\cdot\sqrt{5C_{2}\left\lvert T_{\mathrm{C}}\right\rvert}\cdot{z_{k}}, (D.9)

where ζ1\zeta_{1} follows from the simple fact that

xT(𝟏{xTC}q(x))(w;x,yx)\displaystyle\sum_{x\in T}\mathinner{\bigl{(}\boldsymbol{1}_{\mathinner{\left\{x\in T_{\mathrm{C}}\right\}}}-q(x)\bigr{)}}\ell(w;x,y_{x}) =xTC(1q(x))(w;x,yx)+xTD(q(x))(w;x,yx)\displaystyle=\sum_{x\in T_{\mathrm{C}}}(1-q(x))\ell(w;x,y_{x})+\sum_{x\in T_{\mathrm{D}}}(-q(x))\ell(w;x,y_{x})
xTC(1q(x))(w;x,yx),\displaystyle\leq\sum_{x\in T_{\mathrm{C}}}(1-q(x))\ell(w;x,y_{x}),

ζ2\zeta_{2} explores the fact that the hinge loss is always upper bounded by 1+|wx|τk1+\frac{\left\lvert w\cdot x\right\rvert}{\tau_{k}} and that 1q(x)01-q(x)\geq 0, ζ3\zeta_{3} follows from Part 2 of Proposition 30, ζ4\zeta_{4} applies Cauchy-Schwarz inequality, and ζ5\zeta_{5} uses Eq. (D.6).

In view of Eq. (D.8), we have |T||TC|2\frac{\left\lvert T\right\rvert}{\left\lvert T_{\mathrm{C}}\right\rvert}\leq 2. Continuing Eq. (D.9), we obtain

τk(w;T~C)\displaystyle\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}}) 1|TC|xTq(x)(w;x,yx)+2ξk+10C2ξkzkτk\displaystyle\leq\frac{1}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T}q(x)\ell(w;x,y_{x})+2\xi_{k}+\sqrt{10C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}}
=xTq(x)|TC|xTp(x)(w;x,yx)+2ξk+10C2ξkzkτk\displaystyle=\frac{\sum_{x\in T}q(x)}{\left\lvert T_{\mathrm{C}}\right\rvert}\sum_{x\in T}p(x)\ell(w;x,y_{x})+2\xi_{k}+\sqrt{10C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}}
=τk(w;p)+(xTq(x)|TC|1)xTp(x)(w;x,yx)+2ξk+10C2ξkzkτk\displaystyle=\ell_{\tau_{k}}(w;p)+\mathinner{\left(\frac{\sum_{x\in T}q(x)}{\left\lvert T_{\mathrm{C}}\right\rvert}-1\right)}\sum_{x\in T}p(x)\ell(w;x,y_{x})+2\xi_{k}+\sqrt{10C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}}
τk(w;p)+(|T||TC|1)xTp(x)(w;x,yx)+2ξk+10C2ξkzkτk\displaystyle\leq\ell_{\tau_{k}}(w;p)+\mathinner{\left(\frac{\left\lvert T\right\rvert}{\left\lvert T_{\mathrm{C}}\right\rvert}-1\right)}\sum_{x\in T}p(x)\ell(w;x,y_{x})+2\xi_{k}+\sqrt{10C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}}
τk(w;p)+2ξkxTp(x)(w;x,yx)+2ξk+10C2ξkzkτk,\displaystyle\leq\ell_{\tau_{k}}(w;p)+2\xi_{k}\sum_{x\in T}p(x)\ell(w;x,y_{x})+2\xi_{k}+\sqrt{10C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}}, (D.10)

where in the last inequality we use |T|/|TC|1=|TD|/|T|1|TD|/|T|2|TD|/|T|\left\lvert T\right\rvert/\left\lvert T_{\mathrm{C}}\right\rvert-1=\frac{\left\lvert T_{\mathrm{D}}\right\rvert/\left\lvert T\right\rvert}{1-\left\lvert T_{\mathrm{D}}\right\rvert/\left\lvert T\right\rvert}\leq 2\left\lvert T_{\mathrm{D}}\right\rvert/\left\lvert T\right\rvert. On the other hand, we have the following result which will be proved later on.

Claim D.1.

xTp(x)(w;x,yx)1+10C2zkτk.\sum_{x\in T}p(x)\ell(w;x,y_{x})\leq 1+\sqrt{10C_{2}}\cdot\frac{z_{k}}{\tau_{k}}.

Therefore, continuing Eq. (D.10) we have

τk(w;T~C)τk(w;p)+2ξk(1+10C2zkτk)+10C2ξkzkτk.\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})\leq\ell_{\tau_{k}}(w;p)+2\xi_{k}\left(1+\sqrt{10C_{2}}\cdot\frac{z_{k}}{\tau_{k}}\right)+\sqrt{10C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}}.

which proves the first inequality of the proposition.

Step 2. We move on to prove the second inequality of the theorem, i.e. using τk(w;T~C)\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}}) to upper bound τk(w;p)\ell_{\tau_{k}}(w;p). Let us denote by pD=xTDp(x)p_{\mathrm{D}}=\sum_{x\in T_{\mathrm{D}}}p(x) the probability mass on dirty instances. Then

pD=xTDq(x)xTq(x)|TD|(1ξk)|T|ξk1ξk2ξk,p_{\mathrm{D}}=\frac{\sum_{x\in T_{\mathrm{D}}}q(x)}{\sum_{x\in T}q(x)}\leq\frac{\left\lvert T_{\mathrm{D}}\right\rvert}{(1-\xi_{k})\left\lvert T\right\rvert}\leq\frac{\xi_{k}}{1-\xi_{k}}\leq 2\xi_{k}, (D.11)

where the first inequality follows from q(x)1q(x)\leq 1 and Part 2 of Proposition 30, the second inequality follows from (D.7), and the last inequality is by our choice ξk1/2\xi_{k}\leq 1/2.

Note that by Part 2 of Proposition 30 and the choice ξk1/2\xi_{k}\leq 1/2, we have xTq(x)(1ξk)|T||T|/2\sum_{x\in T}q(x)\geq(1-\xi_{k})\left\lvert T\right\rvert\geq\left\lvert T\right\rvert/2. Hence

xTp(x)(wx)2=1xTq(x)xTq(x)(wx)22|T|xTq(x)(wx)210C2zk2\sum_{x\in T}p(x)(w\cdot x)^{2}=\frac{1}{\sum_{x\in T}q(x)}\sum_{x\in T}q(x)(w\cdot x)^{2}\leq\frac{2}{\left\lvert T\right\rvert}\sum_{x\in T}q(x)(w\cdot x)^{2}\leq 10C_{2}z_{k}^{2} (D.12)

where the last inequality holds because of (D.5). Thus,

xTDp(x)(w;x,yx)\displaystyle\sum_{x\in T_{\mathrm{D}}}p(x)\ell(w;x,y_{x}) xTDp(x)(1+|wx|τk)\displaystyle\leq\sum_{x\in T_{\mathrm{D}}}p(x)\mathinner{\left(1+\frac{\left\lvert w\cdot x\right\rvert}{\tau_{k}}\right)}
=pD+1τkxTDp(x)|wx|\displaystyle=p_{\mathrm{D}}+\frac{1}{\tau_{k}}\sum_{x\in T_{\mathrm{D}}}p(x)\left\lvert w\cdot x\right\rvert
=pD+1τkxT(𝟏{xTD}p(x))(p(x)|wx|)\displaystyle=p_{\mathrm{D}}+\frac{1}{\tau_{k}}\sum_{x\in T}\mathinner{\left(\boldsymbol{1}_{\mathinner{\left\{x\in T_{\mathrm{D}}\right\}}}\sqrt{p(x)}\right)}\cdot\mathinner{\left(\sqrt{p(x)}\left\lvert w\cdot x\right\rvert\right)}
pD+1τkxT𝟏{xTD}p(x)xTp(x)(wx)2\displaystyle\leq p_{\mathrm{D}}+\frac{1}{\tau_{k}}\sqrt{\sum_{x\in T}\boldsymbol{1}_{\mathinner{\left\{x\in T_{\mathrm{D}}\right\}}}{p(x)}}\cdot\sqrt{\sum_{x\in T}p(x)(w\cdot x)^{2}}
(D.12)pD+pD10C2zkτk.\displaystyle\stackrel{{\scriptstyle\eqref{eq:tmp:sum p(x)(wx)^2}}}{{\leq}}p_{\mathrm{D}}+\sqrt{p_{\mathrm{D}}}\cdot\sqrt{10C_{2}}\cdot\frac{z_{k}}{\tau_{k}}.

With the result on hand, we bound τk(w;p)\ell_{\tau_{k}}(w;p) as follows:

τk(w;p)\displaystyle\ell_{\tau_{k}}(w;p) =xTCp(x)(w;x,yx)+xTDp(x)(w;x,yx)\displaystyle=\sum_{x\in T_{\mathrm{C}}}p(x)\ell(w;x,y_{x})+\sum_{x\in T_{\mathrm{D}}}p(x)\ell(w;x,y_{x})
xTC(w;x,yx)+xTDp(x)(w;x,yx)\displaystyle\leq\sum_{x\in T_{\mathrm{C}}}\ell(w;x,y_{x})+\sum_{x\in T_{\mathrm{D}}}p(x)\ell(w;x,y_{x})
=τk(w;T~C)+xTDp(x)(w;x,yx)\displaystyle=\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})+\sum_{x\in T_{\mathrm{D}}}p(x)\ell(w;x,y_{x})
τk(w;T~C)+pD+pD10C2zkτk\displaystyle\leq\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})+p_{\mathrm{D}}+\sqrt{p_{\mathrm{D}}}\cdot\sqrt{10C_{2}}\cdot\frac{z_{k}}{\tau_{k}}
(D.11)τk(w;T~C)+2ξk+20C2ξkzkτk,\displaystyle\stackrel{{\scriptstyle\eqref{eq:tmp:p_D}}}{{\leq}}\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})+2\xi_{k}+\sqrt{20C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}},

which proves the second inequality of the proposition.

Putting together. We would like to show |τk(w;p)τk(w;T~C)|κ\left\lvert\ell_{\tau_{k}}(w;p)-\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})\right\rvert\leq\kappa. Indeed, this is guaranteed by our setting of ξk\xi_{k} in Section 3.2 which ensures that ξk\xi_{k} simultaneously fulfills the following three constraints:

2ξk(1+10C2zkτk)+10C2ξkzkτkκ,\displaystyle 2\xi_{k}\left(1+\sqrt{10C_{2}}\cdot\frac{z_{k}}{\tau_{k}}\right)+\sqrt{10C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}}\leq\kappa,
2ξk+20C2ξkzkτkκ,andξk12.\displaystyle 2\xi_{k}+\sqrt{20C_{2}\xi_{k}}\cdot\frac{z_{k}}{\tau_{k}}\leq\kappa,\quad\text{and}\quad\xi_{k}\leq\frac{1}{2}.

This completes the proof. ∎

Proof of Claim D.1.

Since (w;x,yx)1+|wx|τk\ell(w;x,y_{x})\leq 1+\frac{\left\lvert w\cdot x\right\rvert}{\tau_{k}}, it follows that

xTp(x)(w;x,yx)\displaystyle\sum_{x\in T}p(x)\ell(w;x,y_{x}) xTp(x)(1+|wx|τk)\displaystyle\leq\sum_{x\in T}p(x)\mathinner{\left(1+\frac{\left\lvert w\cdot x\right\rvert}{\tau_{k}}\right)}
=1+1τkxTp(x)|wx|\displaystyle=1+\frac{1}{\tau_{k}}\sum_{x\in T}p(x)\left\lvert w\cdot x\right\rvert
1+1τkxTp(x)(wx)2\displaystyle\leq 1+\frac{1}{\tau_{k}}\sqrt{\sum_{x\in T}p(x)(w\cdot x)^{2}}
(D.12)1+10C2zkτk,\displaystyle\stackrel{{\scriptstyle\eqref{eq:tmp:sum p(x)(wx)^2}}}{{\leq}}1+\sqrt{10C_{2}}\cdot\frac{z_{k}}{\tau_{k}},

which completes the proof of Claim D.1. ∎

The following result is a simple application of Proposition 24. It shows that the loss evaluated on clean instances concentrates around the expected loss.

Proposition 32 (Restatement of Proposition 9).

Consider phase kk of Algorithm 1. Suppose that Assumption 1 and 2 are satisfied, and assume ηc5ϵ\eta\leq c_{5}\epsilon. Then with probability 1δk41-\frac{\delta_{k}}{4} over the draw of T¯\bar{T}, for all wWkw\in W_{k} we have

|Lτk(w)τk(w;T~C)|κ.\left\lvert L_{\tau_{k}}(w)-\ell_{\tau_{k}}(w;\tilde{T}_{\mathrm{C}})\right\rvert\leq\kappa.

where Lτk(w):=𝔼xDwk1,bk[τk(w;x,sign(wx))]L_{\tau_{k}}(w)\mathrel{\mathop{\mathchar 58\relax}}=\operatorname{\mathbb{E}}_{x\sim D_{w_{k-1},b_{k}}}\mathinner{\left[\ell_{\tau_{k}}(w;x,\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)})\right]}.

Proof.

The choice of nkn_{k}, i.e. the size of |T¯|\left\lvert\bar{T}\right\rvert, ensures that with probability 1δk81-\frac{\delta_{k}}{8}, |TC|\left\lvert T_{\mathrm{C}}\right\rvert is at least ζlogζ\zeta\log\zeta where ζ=Kslog2dbklogdδk\zeta=K\cdot s\log^{2}\frac{d}{b_{k}}\cdot\log\frac{d}{\delta_{k}} for some constant K>0K>0 in view of Lemma 28. This observation in allusion to Proposition 24 and union bound, immediately gives the desired result. ∎

D.4 Analysis of random sampling

Proposition 33 (Restatement of Proposition 10).

Consider phase kk Algorithm 1. Suppose that Assumption 1 and 2 are satisfied, and assume ηc5ϵ\eta\leq c_{5}\epsilon. Set nkn_{k} and mkm_{k} as in Section 3.2. Then with probability 1δk41-\frac{\delta_{k}}{4} over the draw of SkS_{k}, for all wWkw\in W_{k} we have

|τk(w;p)τk(w;Sk)|κ.\left\lvert\ell_{\tau_{k}}(w;p)-\ell_{\tau_{k}}(w;S_{k})\right\rvert\leq\kappa.
Proof.

Since we applied pruning to remove all instances with large \ell_{\infty}-norm, this proposition can be proved by a standard concentration argument for uniform convergence of linear classes under distributions with \ell_{\infty} bounded support. We include the proof for completeness.

Note that the randomness is taken over the i.i.d. draw of mkm_{k} samples from TT according to the distribution pp over TT. Thus, for any (x,y)Sk(x,y)\in S_{k}, 𝔼[τk(w;x,y)]=τk(w;p)\operatorname{\mathbb{E}}[\ell_{\tau_{k}}(w;x,y)]=\ell_{\tau_{k}}(w;p). Moreover, let Rk=maxxTxR_{k}=\max_{x\in T}\left\lVert x\right\rVert_{\infty}. Any instance xx drawn from TT satisfies xRk\left\lVert x\right\rVert_{\infty}\leq R_{k} with probability 11. It is also easy to verify that

τk(w;x,y)1+|wx|τk1+(wwk1)xτk+|wk1x|τk1+ρkRkτk+bkτk.\ell_{\tau_{k}}(w;x,y)\leq 1+\frac{\left\lvert w\cdot x\right\rvert}{\tau_{k}}\leq 1+\frac{(w-w_{k-1})\cdot x}{\tau_{k}}+\frac{\left\lvert w_{k-1}\cdot x\right\rvert}{\tau_{k}}\leq 1+\frac{\rho_{k}R_{k}}{\tau_{k}}+\frac{b_{k}}{\tau_{k}}.

By Theorem 8 of [BM02] along with standard symmetrization arguments, we have that with probability at least 1δk41-\frac{\delta_{k}}{4},

|τk(w;p)τk(w;Sk)|(1+ρkRkτk+bkτk)ln(4/δk)2mk+(;Sk)\left\lvert\ell_{\tau_{k}}(w;p)-\ell_{\tau_{k}}(w;S_{k})\right\rvert\leq\mathinner{\left(1+\frac{\rho_{k}R_{k}}{\tau_{k}}+\frac{b_{k}}{\tau_{k}}\right)}\sqrt{\frac{\ln(4/\delta_{k})}{2m_{k}}}+\mathcal{R}(\mathcal{F};S_{k}) (D.13)

where (;Sk)\mathcal{R}(\mathcal{F};S_{k}) denotes the Rademacher complexity of function class \mathcal{F} on the labeled set SkS_{k}, and :={τk(w;x,y):wWk}\mathcal{F}\mathrel{\mathop{\mathchar 58\relax}}=\mathinner{\left\{\ell_{\tau_{k}}(w;x,y)\mathrel{\mathop{\mathchar 58\relax}}w\in W_{k}\right\}}. In order to calculate (;Sk)\mathcal{R}(\mathcal{F};S_{k}), we observe that each function τk(w;x,y)\ell_{\tau_{k}}(w;x,y) is a composition of ϕ(a)=max{0,11τkya}\phi(a)=\max\mathinner{\left\{0,1-\frac{1}{\tau_{k}}ya\right\}} and function class 𝒢:={xwx:wWk}\mathcal{G}\mathrel{\mathop{\mathchar 58\relax}}=\{x\mapsto w\cdot x\mathrel{\mathop{\mathchar 58\relax}}w\in W_{k}\}. Since ϕ(a)\phi(a) is 1τk\frac{1}{\tau_{k}}-Lipschitz, by contraction property of Rademacher complexity, we have

(;Sk)1τk(𝒢;Sk).\mathcal{R}(\mathcal{F};S_{k})\leq\frac{1}{\tau_{k}}\mathcal{R}(\mathcal{G};S_{k}). (D.14)

Let σ={σ1,,σmk}\sigma=\{\sigma_{1},\dots,\sigma_{m_{k}}\} where the σi\sigma_{i}’s are i.i.d. draw from the Rademacher distribution, and let Vk=B2(0,rk)B1(0,ρk)V_{k}=B_{2}(0,r_{k})\cap B_{1}(0,\rho_{k}). We compute (𝒢;Sk)\mathcal{R}(\mathcal{G};S_{k}) as follows:

(𝒢;Sk)\displaystyle\mathcal{R}(\mathcal{G};S_{k}) =1mk𝔼σ[supwWkw(i=1mkσixi)]\displaystyle=\frac{1}{m_{k}}\operatorname{\mathbb{E}}_{\sigma}\mathinner{\biggl{[}\sup_{w\in W_{k}}w\cdot\mathinner{\biggl{(}\sum_{i=1}^{m_{k}}\sigma_{i}x_{i}\biggr{)}}\biggr{]}}
=1mk𝔼σ[wk1(i=1mkσixi)]+1mk𝔼σ[supwWk(wwk1)(i=1mkσixi)]\displaystyle=\frac{1}{m_{k}}\operatorname{\mathbb{E}}_{\sigma}\mathinner{\biggl{[}w_{k-1}\cdot\mathinner{\biggl{(}\sum_{i=1}^{m_{k}}\sigma_{i}x_{i}\biggr{)}}\biggr{]}}+\frac{1}{m_{k}}\operatorname{\mathbb{E}}_{\sigma}\mathinner{\biggl{[}\sup_{w\in W_{k}}(w-w_{k-1})\cdot\mathinner{\biggl{(}\sum_{i=1}^{m_{k}}\sigma_{i}x_{i}\biggr{)}}\biggr{]}}
=1mk𝔼σ[supvVkv(i=1mkσixi)]\displaystyle=\frac{1}{m_{k}}\operatorname{\mathbb{E}}_{\sigma}\mathinner{\biggl{[}\sup_{v\in V_{k}}v\cdot\mathinner{\biggl{(}\sum_{i=1}^{m_{k}}\sigma_{i}x_{i}\biggr{)}}\biggr{]}}
ρkRk2log(2d)mk,\displaystyle\leq\rho_{k}R_{k}\sqrt{\frac{2\log(2d)}{m_{k}}},

where the first equality is by the definition of Rademacher complexity, the second equality simply decompose ww as a sum of wk1w_{k-1} and wwk1w-w_{k-1}, the third equality is by the fact that every σi\sigma_{i} has zero mean, and the inequality applies Lemma 39. We combine the above result with (D.13) and (D.14), and obtain that with probability 1δk41-\frac{\delta_{k}}{4},

|τk(w;p)τk(w;Sk)|(1+ρkRkτk+bkτk)ln(4/δk)mk+ρkRkτk2log(2d)mk.\left\lvert\ell_{\tau_{k}}(w;p)-\ell_{\tau_{k}}(w;S_{k})\right\rvert\leq\mathinner{\left(1+\frac{\rho_{k}R_{k}}{\tau_{k}}+\frac{b_{k}}{\tau_{k}}\right)}\sqrt{\frac{\ln(4/\delta_{k})}{m_{k}}}+\frac{\rho_{k}R_{k}}{\tau_{k}}\sqrt{\frac{2\log(2d)}{m_{k}}}. (D.15)

Recall that we remove all instances with large \ell_{\infty}-norm in the pruning step of Algorithm 1. In particular, we have

Rkc9log48nkdbkδk.R_{k}\leq c_{9}\log\frac{48n_{k}d}{b_{k}\delta_{k}}.

Plugging this upper bound into (D.15) and using our hyper-parameter setting gives

|τk(w;p)τk(w;Sk)|K1slognkdbkδk(log(1/δk)mk+logdmk)\left\lvert\ell_{\tau_{k}}(w;p)-\ell_{\tau_{k}}(w;S_{k})\right\rvert\leq K_{1}\cdot\sqrt{s}\log\frac{n_{k}d}{b_{k}\delta_{k}}\mathinner{\left(\sqrt{\frac{\log(1/\delta_{k})}{m_{k}}}+\sqrt{\frac{\log d}{m_{k}}}\right)}

for some constant K1>0K_{1}>0. Hence,

mk=O(slog2nkdbkδklogdδk)=O~(slog2dbkδklogdδk)m_{k}={O}\mathinner{\left(s\log^{2}\frac{n_{k}d}{b_{k}\delta_{k}}\cdot\log\frac{d}{\delta_{k}}\right)}=\tilde{O}\mathinner{\left(s\log^{2}\frac{d}{b_{k}\delta_{k}}\cdot\log\frac{d}{\delta_{k}}\right)}

suffices to ensure |τk(w;p)τk(w;Sk)|κ\left\lvert\ell_{\tau_{k}}(w;p)-\ell_{\tau_{k}}(w;S_{k})\right\rvert\leq\kappa with probability 1δk41-\frac{\delta_{k}}{4}. ∎

D.5 Analysis of Per-Phase Progress

Let Lτk(w)=𝔼xDwk1,bk[τk(w;x,sign(wx))]L_{\tau_{k}}(w)=\operatorname{\mathbb{E}}_{x\sim D_{w_{k-1},b_{k}}}\mathinner{\left[\ell_{\tau_{k}}(w;x,\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)})\right]}.

Lemma 34 (Lemma 3.7 of [ABL17]).

Suppose Assumption 1 is satisfied. Then

Lτk(w)τkc0min{bk,1/9}.L_{\tau_{k}}(w^{*})\leq\frac{\tau_{k}}{c_{0}\min\{b_{k},1/9\}}.

In particular, by our choice of τk\tau_{k}

Lτk(w)κ.L_{\tau_{k}}(w^{*})\leq\kappa.
Lemma 35.

For any 1kk01\leq k\leq k_{0}, if wWkw^{*}\in W_{k}, then with probability 1δk1-\delta_{k}, errDwk1,bk(vk)8κ\operatorname{err}_{D_{w_{k-1},b_{k}}}(v_{k})\leq 8\kappa.

Proof.

Observe that with the setting of NkN_{k}, we have with probability 1δk1-\delta_{k} over all the randomness in phase kk, Lemma 26, Proposition 31, Proposition 32 and Proposition 33 hold simultaneously. Now we condition on the event that all of these properties are satisfied, which implies for all wWkw\in W_{k},

|Lτk(w)τk(w;Sk)|3κ.\left\lvert L_{\tau_{k}}(w)-\ell_{\tau_{k}}(w;S_{k})\right\rvert\leq 3\kappa. (D.16)

We have

errDwk1,bk(vk)Lτk(vk)ζ1τk(vk;Sk)+3κζ2minwWkτk(w;Sk)+4κ\displaystyle\operatorname{err}_{D_{w_{k-1},b_{k}}}(v_{k})\leq L_{\tau_{k}}(v_{k})\stackrel{{\scriptstyle\zeta_{1}}}{{\leq}}\ell_{\tau_{k}}(v_{k};S_{k})+3\kappa\stackrel{{\scriptstyle\zeta_{2}}}{{\leq}}\min_{w\in W_{k}}\ell_{\tau_{k}}(w;S_{k})+4\kappa ζ3τk(w;Sk)+4κ\displaystyle\stackrel{{\scriptstyle\zeta_{3}}}{{\leq}}\ell_{\tau_{k}}(w^{*};S_{k})+4\kappa
Lτk(w)+7κ.\displaystyle\leq L_{\tau_{k}}(w^{*})+7\kappa.

In the above, the first inequality follows from the fact that hinge loss upper bounds the 0/1 loss, ζ1\zeta_{1} and the last inequality applies (C.1), ζ2\zeta_{2} is by the definition of vkv_{k} (see Algorithm 1), and ζ3\zeta_{3} is by our assumption that ww^{*} is feasible. The proof is complete in view of Lemma 34. ∎

Lemma 36.

For any 1kk01\leq k\leq k_{0}, if wWkw^{*}\in W_{k}, then with probability 1δk1-\delta_{k}, θ(vk,w)2k8π\theta(v_{k},w^{*})\leq 2^{-k-8}\pi.

Proof.

For k=1k=1, by Lemma 35 and that we actually sample from DD, we have

PrxD(sign(v1x)sign(wx))8κ.\operatorname{Pr}_{x\sim D}\left(\operatorname{sign}\mathinner{\left(v_{1}\cdot x\right)}\neq\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)}\right)\leq 8\kappa.

Hence Part 4 of Lemma 12 indicates that

θ(v1,w)8c2κ=16c2κ21.\theta(v_{1},w^{*})\leq 8c_{2}\kappa=16c_{2}\kappa\cdot 2^{-1}. (D.17)

Now we consider 2kk02\leq k\leq k_{0}. Denote Xk={x:|wk1x|bk}X_{k}=\{x\mathrel{\mathop{\mathchar 58\relax}}\left\lvert w_{k-1}\cdot x\right\rvert\leq b_{k}\}, and X¯k={x:|wk1x|>bk}\bar{X}_{k}=\{x\mathrel{\mathop{\mathchar 58\relax}}\left\lvert w_{k-1}\cdot x\right\rvert>b_{k}\}. We will show that the error of vkv_{k} on both XkX_{k} and X¯k\bar{X}_{k} is small, hence vkv_{k} is a good approximation to ww^{*}.

First, we consider the error on XkX_{k}, which is given by

PrxD(sign(vkx)sign(wx),xXk)\displaystyle\ \operatorname{Pr}_{x\sim D}\left(\operatorname{sign}\mathinner{\left(v_{k}\cdot x\right)}\neq\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)},x\in X_{k}\right)
=\displaystyle= PrxD(sign(vkx)sign(wx)xXk)PrxD(xXk)\displaystyle\ \operatorname{Pr}_{x\sim D}\left(\operatorname{sign}\mathinner{\left(v_{k}\cdot x\right)}\neq\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)}\mid x\in X_{k}\right)\cdot\operatorname{Pr}_{x\sim D}(x\in X_{k})
=\displaystyle= errDwk1,bk(vk)PrxD(xXk)\displaystyle\ \operatorname{err}_{D_{w_{k-1},b_{k}}}(v_{k})\cdot\operatorname{Pr}_{x\sim D}(x\in X_{k})
\displaystyle\leq 8κ2bk\displaystyle\ 8\kappa\cdot 2b_{k}
=\displaystyle= 16κbk,\displaystyle\ 16\kappa b_{k}, (D.18)

where the inequality is due to Lemma 35 and Lemma 17. Note that the inequality holds with probability 1δk1-\delta_{k}.

Next we derive the error on X¯k\bar{X}_{k}. Note that Lemma 10 of [Zha18] states for any unit vector uu, and any general vector vv, θ(v,u)πvu2\theta(v,u)\leq\pi\left\lVert v-u\right\rVert_{2}. Hence,

θ(vk,w)πvkw2π(vkwk12+wwk12)2πrk.\displaystyle\theta(v_{k},w^{*})\leq\pi\left\lVert v_{k}-w^{*}\right\rVert_{2}\leq\pi(\left\lVert v_{k}-w_{k-1}\right\rVert_{2}+\left\lVert w^{*}-w_{k-1}\right\rVert_{2})\leq 2\pi r_{k}.

Recall that we set rk=2k3<1/4r_{k}=2^{-k-3}<1/4 in our algorithm and choose bk=c¯rkb_{k}=\bar{c}\cdot r_{k} where c¯8π/c4\bar{c}\geq 8\pi/c_{4}, which allows us to apply Lemma 13 and obtain

PrxD(sign(vkx)sign(wx),xXk)\displaystyle\operatorname{Pr}_{x\sim D}\left(\operatorname{sign}\mathinner{\left(v_{k}\cdot x\right)}\neq\operatorname{sign}\mathinner{\left(w^{*}\cdot x\right)},x\notin X_{k}\right) c32πrkexp(c4c¯rk22πrk)\displaystyle\leq c_{3}\cdot 2\pi r_{k}\cdot\exp\mathinner{\left(-\frac{c_{4}\bar{c}\cdot r_{k}}{2\cdot 2\pi r_{k}}\right)}
=2kc3π4exp(c4c¯4π).\displaystyle=2^{-k}\cdot\frac{c_{3}\pi}{4}\exp\mathinner{\left(-\frac{c_{4}\bar{c}}{4\pi}\right)}.

This in allusion to (D.18) gives

errD(vk)16κc¯rk+2kc3π4exp(c4c¯4π)=(2κc¯+c3π4exp(c4c¯4π))2k.\operatorname{err}_{D}(v_{k})\leq 16\kappa\cdot\bar{c}\cdot r_{k}+2^{-k}\cdot\frac{c_{3}\pi}{4}\exp\mathinner{\left(-\frac{c_{4}\bar{c}}{4\pi}\right)}=\mathinner{\left(2\kappa\bar{c}+\frac{c_{3}\pi}{4}\exp\mathinner{\left(-\frac{c_{4}\bar{c}}{4\pi}\right)}\right)}\cdot 2^{-k}.

Recall that we set κ=exp(c¯)\kappa=\exp(-\bar{c}) and denote by f(c¯)f(\bar{c}) the coefficient of 2k2^{-k} in the above expression. By Part 4 of Lemma 12

θ(vk,w)c2errD(vk)c2f(c¯)2k.\theta(v_{k},w^{*})\leq c_{2}\operatorname{err}_{D}(v_{k})\leq c_{2}f(\bar{c})\cdot 2^{-k}. (D.19)

Now let g(c¯)=c2f(c¯)+16c2exp(c¯)g(\bar{c})=c_{2}f(\bar{c})+16c_{2}\exp(-\bar{c}). By our choice of c¯\bar{c}, g(c¯)28πg(\bar{c})\leq 2^{-8}\pi. This ensures that for both (D.17) and (D.19), θ(vk,w)2k8π\theta(v_{k},w^{*})\leq 2^{-k-8}\pi for any k1k\geq 1. ∎

Lemma 37.

For any 1kk01\leq k\leq k_{0}, if θ(vk,w)2k8π\theta(v_{k},w^{*})\leq 2^{-k-8}\pi, then wWk+1w^{*}\in W_{k+1}.

Proof.

We first show that wkw2rk+1\left\lVert w_{k}-w^{*}\right\rVert_{2}\leq r_{k+1}. Let v^k=vk/vk2\hat{v}_{k}=v_{k}/\left\lVert v_{k}\right\rVert_{2}. By algebra v^kw2=2sinθ(vk,w)2θ(vk,w)2k8π2k6\left\lVert\hat{v}_{k}-w^{*}\right\rVert_{2}=2\sin\frac{\theta(v_{k},w^{*})}{2}\leq\theta(v_{k},w^{*})\leq 2^{-k-8}\pi\leq 2^{-k-6}. Now we have

wkw2\displaystyle\left\lVert w_{k}-w^{*}\right\rVert_{2} =s(vk)/s(vk)2w2\displaystyle=\left\lVert\mathcal{H}_{s}(v_{k})/\left\lVert\mathcal{H}_{s}(v_{k})\right\rVert_{2}-w^{*}\right\rVert_{2}
=s(v^k)/s(v^k)2w2\displaystyle=\left\lVert\mathcal{H}_{s}(\hat{v}_{k})/\left\lVert\mathcal{H}_{s}(\hat{v}_{k})\right\rVert_{2}-w^{*}\right\rVert_{2}
2s(v^k)w2\displaystyle\leq 2\left\lVert\mathcal{H}_{s}(\hat{v}_{k})-w^{*}\right\rVert_{2}
4v^kw2\displaystyle\leq 4\left\lVert\hat{v}_{k}-w^{*}\right\rVert_{2}
2k4\displaystyle\leq 2^{-k-4}
=rk+1.\displaystyle=r_{k+1}.

By the sparsity of wkw_{k} and ww^{*}, and our choice ρk+1=2srk+1\rho_{k+1}=\sqrt{2s}r_{k+1}, we always have

wkw12swkw22srk+1=ρk+1.\left\lVert w_{k}-w^{*}\right\rVert_{1}\leq\sqrt{2s}\left\lVert w_{k}-w^{*}\right\rVert_{2}\leq\sqrt{2s}r_{k+1}=\rho_{k+1}.

The proof is complete. ∎

D.6 Proof of Theorem 4

Proof.

We will prove the theorem with the following claim.

Claim D.2.

For any 1kk01\leq k\leq k_{0}, with probability at least 1i=1kδi1-\sum_{i=1}^{k}\delta_{i}, ww^{*} is in Wk+1W_{k+1}.

Based on the claim, we immediately have that with probability at least 1k=1k0δk1δ1-\sum_{k=1}^{k_{0}}\delta_{k}\geq 1-\delta, ww^{*} is in Wk0+1W_{k_{0}+1}. By our construction of Wk0+1W_{k_{0}+1}, we have

wwk022k04.\left\lVert w^{*}-w_{k_{0}}\right\rVert_{2}\leq 2^{-k_{0}-4}.

This, together with Part 4 of Lemma 12 and the fact that θ(w,wk0)πwwk02\theta(w^{*},w_{k_{0}})\leq\pi\left\lVert w^{*}-w_{k_{0}}\right\rVert_{2} (see Lemma 10 of [Zha18]), implies

errD(wk0)πc12k04=ϵ.\operatorname{err}_{D}(w_{k_{0}})\leq\frac{\pi}{c_{1}}\cdot 2^{-k_{0}-4}=\epsilon.

Finally, we derive the sample complexity and label complexity. Recall that nkn_{k} was involved in Proposition 30, i.e. the quantity |T|\left\lvert T\right\rvert, where we required

nk=O~(s2log4dbk(logd+log21δk)+log1δk)=O~(s2log4dbk(logd+log21δk)).n_{k}=\tilde{O}\mathinner{\biggl{(}s^{2}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\Bigl{(}\log d+\log^{2}\frac{1}{\delta_{k}}\Bigr{)}}+\log\frac{1}{\delta_{k}}\biggr{)}}=\tilde{O}\mathinner{\biggl{(}s^{2}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\Bigl{(}\log d+\log^{2}\frac{1}{\delta_{k}}\Bigr{)}}\biggr{)}}.

It is also involved in Proposition 33, where we need

mk=O(slog2nkdbkδklogdδk)m_{k}={O}\mathinner{\left(s\log^{2}\frac{n_{k}d}{b_{k}\delta_{k}}\cdot\log\frac{d}{\delta_{k}}\right)}

and nkmkn_{k}\geq m_{k} since SkS_{k} is a labeled subset of TT. As mkm_{k} has a cubic dependence on log1δk\log\frac{1}{\delta_{k}}, our final choice of nkn_{k} is given by

nk=O~(s2log4dbk(logd+log31δk)).n_{k}=\tilde{O}\mathinner{\biggl{(}s^{2}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\Bigl{(}\log d+\log^{3}\frac{1}{\delta_{k}}\Bigr{)}}\biggr{)}}. (D.20)

This in turn gives

mk=O~(slog2dbkδklogdδk).m_{k}=\tilde{O}\mathinner{\left(s\log^{2}\frac{d}{b_{k}\delta_{k}}\cdot\log\frac{d}{\delta_{k}}\right)}. (D.21)

Therefore, by Lemma 26 we obtain an upper bound of the sample size NkN_{k} at phase kk as follows:

Nk=O~(s2bklog4dbk(logd+log31δk))O~(s2ϵlog4d(logd+log31δ)),N_{k}=\tilde{O}\mathinner{\biggl{(}\frac{s^{2}}{b_{k}}\log^{4}\frac{d}{b_{k}}\cdot\mathinner{\Bigl{(}\log d+\log^{3}\frac{1}{\delta_{k}}\Bigr{)}}\biggr{)}}\leq\tilde{O}\mathinner{\biggl{(}\frac{s^{2}}{\epsilon}\log^{4}d\mathinner{\Bigl{(}\log d+\log^{3}\frac{1}{\delta}\Bigr{)}}\biggr{)}},

where the last inequality follows from bk=Ω(ϵ)b_{k}=\Omega(\epsilon) for all kk0k\leq k_{0} and our choice of δk\delta_{k}. Consequently, the total sample complexity

N=k=1k0Nkk0O~(s2ϵlog4d(logd+log31δ))=O~(s2ϵlog4d(logd+log31δ)).N=\sum_{k=1}^{k_{0}}N_{k}\leq k_{0}\cdot\tilde{O}\mathinner{\biggl{(}\frac{s^{2}}{\epsilon}\log^{4}d\mathinner{\Bigl{(}\log d+\log^{3}\frac{1}{\delta}\Bigr{)}}\biggr{)}}=\tilde{O}\mathinner{\biggl{(}\frac{s^{2}}{\epsilon}\log^{4}d\mathinner{\Bigl{(}\log d+\log^{3}\frac{1}{\delta}\Bigr{)}}\biggr{)}}.

Likewise, we can show that the total label complexity

m=k=1k0mkk0O~(slog2dϵδlogdδ)=O~(slog2dϵδlogdδlog1ϵ).m=\sum_{k=1}^{k_{0}}m_{k}\leq k_{0}\cdot\tilde{O}\mathinner{\Bigl{(}s\log^{2}\frac{d}{\epsilon\delta}\cdot\log\frac{d}{\delta}\Bigr{)}}=\tilde{O}\mathinner{\Bigl{(}s\log^{2}\frac{d}{\epsilon\delta}\cdot\log\frac{d}{\delta}\cdot\log\frac{1}{\epsilon}\Bigr{)}}.

It remains to prove Claim D.2 by induction. First, for k=1k=1, W1=B2(0,1)B1(0,s)W_{1}=B_{2}(0,1)\cap B_{1}(0,\sqrt{s}). Therefore, wW1w^{*}\in W_{1} with probability 11. Now suppose that Claim D.2 holds for some k2k\geq 2, that is, there is an event Ek1E_{k-1} that happens with probability 1ik1δi1-\sum_{i}^{k-1}\delta_{i}, and on this event wWkw^{*}\in W_{k}. By Lemma 36 we know that there is an event FkF_{k} that happens with probability 1δk1-\delta_{k}, on which θ(vk,w)2k8π\theta(v_{k},w^{*})\leq 2^{-k-8}\pi. This further implies that wWk+1w^{*}\in W_{k+1} in view of Lemma 37. Therefore, consider the event Ek1FkE_{k-1}\cap F_{k}, on which wWk+1w^{*}\in W_{k+1} with probability Pr(Ek1)Pr(FkEk1)=(1ik1δi)(1δk)1i=1kδi\operatorname{Pr}(E_{k-1})\cdot\operatorname{Pr}(F_{k}\mid E_{k-1})=(1-\sum_{i}^{k-1}\delta_{i})(1-\delta_{k})\geq 1-\sum_{i=1}^{k}\delta_{i}. ∎

Appendix E Miscellaneous Lemmas

Lemma 38 (Chernoff bound).

Let Z1,Z2,,ZnZ_{1},Z_{2},\dots,Z_{n} be nn independent random variables that take value in {0,1}\{0,1\}. Let Z=i=1nZiZ=\sum_{i=1}^{n}Z_{i}. For each ZiZ_{i}, suppose that Pr(Zi=1)η\operatorname{Pr}(Z_{i}=1)\leq\eta. Then for any α[0,1]\alpha\in[0,1]

Pr(Z(1+α)ηn)eα2ηn3.\operatorname{Pr}\left(Z\geq(1+\alpha)\eta n\right)\leq e^{-\frac{\alpha^{2}\eta n}{3}}.

When Pr(Zi=1)η\operatorname{Pr}(Z_{i}=1)\geq\eta, for any α[0,1]\alpha\in[0,1]

Pr(Z(1α)ηn)eα2ηn2.\operatorname{Pr}\left(Z\leq(1-\alpha)\eta n\right)\leq e^{-\frac{\alpha^{2}\eta n}{2}}.
Lemma 39 (Theorem 1 of [KST08]).

Let σ=(σ1,,σn)\sigma=(\sigma_{1},\dots,\sigma_{n}) where σi\sigma_{i}’s are independent draws from the Rademacher distribution and let x1,,xnx_{1},\dots,x_{n} be given instances in d\mathbb{R}^{d}. Then

𝔼σ[supwB1(0,ρ)i=1nσiwxi]ρ2nlog(2d)max1inxi.\operatorname{\mathbb{E}}_{\sigma}\mathinner{\biggl{[}\sup_{w\in B_{1}(0,\rho)}\sum_{i=1}^{n}\sigma_{i}w\cdot x_{i}\biggr{]}}\leq\rho\sqrt{2n\log(2d)}\max_{1\leq i\leq n}\left\lVert x_{i}\right\rVert_{\infty}.