Attribute-Efficient Learning of Halfspaces with Malicious Noise: Near-Optimal Label Complexity and Noise Tolerance
Abstract
This paper is concerned with computationally efficient learning of homogeneous sparse halfspaces in under noise. Though recent works have established attribute-efficient learning algorithms under various types of label noise (e.g. bounded noise), it remains an open question of when and how -sparse halfspaces can be efficiently learned under the challenging malicious noise model, where an adversary may corrupt both the unlabeled examples and the labels. We answer this question in the affirmative by designing a computationally efficient active learning algorithm with near-optimal label complexity of 111We use the notation . and noise tolerance , where is the target error rate, under the assumption that the distribution over (uncorrupted) unlabeled examples is isotropic log-concave. Our algorithm can be straightforwardly tailored to the passive learning setting, and we show that its sample complexity is which also enjoys attribute efficiency. Our main techniques include attribute-efficient paradigms for soft outlier removal and for empirical risk minimization, and a new analysis of uniform concentration for unbounded instances – all of them crucially take the sparsity structure of the underlying halfspace into account.
Keywords: halfspaces, malicious noise, passive and active learning, attribute efficiency
1 Introduction
This paper investigates the fundamental problem of learning halfspaces under noise [Val84, Val85]. In the absence of noise, this problem is well understood [Ros58, BEHW89]. However, the premise changes immediately when the unlabeled examples222We will also refer to unlabeled examples as instances in this paper. or the labels are corrupted by noise. In the last decades, various types of label noise have been extensively studied, and a plethora of polynomial-time algorithms have been developed that are resilient to random classification noise [BFKV96], bounded noise [Slo88, Slo92, MN06], and adversarial noise [KSS92, KKMS05]. Significant progress towards optimal noise tolerance is also witnessed in the past few years [Dan15, ABHU15, YZ17, DGT19, DKTZ20]. In this regard, a surge of recent research interest is concentrated on further improvement of the performance guarantees by leveraging the structure of the underlying halfspace into algorithmic design. Of central interest is a property termed attribute efficiency, which proves to be useful when the data lie in a high-dimensional space [Lit87], or even in an infinite-dimensional space but with bounded number of effective attributes [Blu90]. In the statistics and signal processing community, it is often referred to as sparsity, dating back to the celebrated Lasso estimator [Tib96, CDS98, CT05, Don06]. Recently, learning of sparse halfspaces in an attribute-efficient manner was highlighted as an open problem in [Fel14], and in a series of recent works [PV13b, ABHZ16, Zha18, ZSA20], this property was carefully explored for label-noise-tolerant learning of halfspaces with improved or even near-optimal sample complexity, label complexity, or generalization error, where the key insight is that such structural constraint effectively controls the complexity of the hypothesis class [Zha02, KST08].
Compared to the rich set of positive results on attribute-efficient learning of sparse halfspaces under label noise, less is known when both instances and labels are corrupted. Specifically, under the -malicious noise model [Val85, KL88], there is an unknown hypothesis and an unknown instance distribution selected from a certain family by an adversary. Each time with probability , the adversary returns an instance drawn from and the label ; with probability , it instead is allowed to return an arbitrary pair that may depend on the state of the learning algorithm and the history of its outputs. Since this is a much more challenging noise model, only recently has an algorithm with near-optimal noise tolerance been established in [ABL17], although without attribute efficiency. It is worth noting that the problem of learning sparse halfspaces is also closely related to one-bit compressed sensing [BB08] where one is allowed to utilize any distribution over measurements for recovering the target hypothesis. However, even with such strong condition, existing theory therein can only handle label noise [PV13a, ABHZ16, BFN+17]. This naturally raises two fundamental questions: 1) can we design attribute-efficient learning algorithms that are capable of tolerating the malicious noise; and 2) can we still obtain near-optimal performance guarantees on the degree of noise tolerance and on the sample complexity.
In this paper, we answer the two questions in the affirmative under a mild distributional assumption that is chosen from the family of isotropic log-concave distributions [LV07, Vem10], which covers prominent distributions such as normal distributions, exponential distributions, and logistic distributions. Moreover, we take label complexity into consideration [CAL94], for which we show that our bound is near-optimal in that aspect. We build our algorithm upon the margin-based active learning framework [BBZ07], which queries the label of an instance when it has small “margin” with respect to the currently learned hypothesis.
From a high level, this work can be thought of as extending the best known result of [ABL17] to the high-dimensional regime. However, even under the low-dimensional setting where , our bound of label complexity is better than theirs in terms of the dependence on the dimension : they have a quadratic dependence whereas we have a linear dependence (up to logarithmic factors). Moreover, as we will describe in Section 3, obtaining such algorithmic extension is nontrivial both computationally and statistically. This work can also be viewed as an extension of [Zha18] to the malicious noise model. In fact, our construction of empirical risk minimization is inspired by that work. However, they considered only label noise which makes their algorithm and analysis not applicable to our setting: it turns out that when facing malicious noise, a sophisticated design of outlier removal paradigm is crucial for optimal noise tolerance [KLS09].
Also in line with this work is learning with nasty noise [DKS18] and robust sparse functional estimation [BDLS17]. Both works considered more general setting in the following sense: [DKS18] showed that by properly adapting the techniques in robust mean estimation, some more general concepts, e.g. low-degree polynomial threshold functions and intersections of halfspaces, can be efficiently learned with sample complexity; [BDLS17] showed that under proper sparsity assumptions, a sample complexity bound of can be achieved for many sparse estimation problems, such as generalized linear models with Lipschitz mapping functions and covariance estimation. However, we remark that neither of them obtained label efficiency. In addition, when adapted to our setting, Theorem 1.5 of [DKS18] only handles noise rate for some constant that is greater than one, while as to be shown in Section 4, we obtain the near-optimal noise tolerance . [BDLS17] achieved near-optimal noise tolerance but their analysis is restricted to the Gaussian marginal distribution and Lipschitz mapping functions. In addition to such fundamental differences, the main techniques we develop are distinct from theirs, which will be described in more detail in Section 3.3.3.
1.1 Main results
We informally present our main results below; readers are referred to Theorem 4 in Section 4 for a precise statement.
Theorem 1 (Informal).
Consider the malicious noise model with noise rate . If the unlabeled data distribution is isotropic log-concave and the underlying halfspace is -sparse, then there is an algorithm that for any given target error rate , PAC learns the underlying halfspace in polynomial time provided that . In addition, the label complexity is and the sample complexity is .
First of all, note that the noise tolerance is near-optimal as [KL88] showed that a noise rate greater than cannot be tolerated by any algorithm regardless of the computational power. The following fact establishes the optimality of our label complexity.
Lemma 2.
Active learning of -sparse halfspaces under isotropic log-concave distributions in the realizable case has an information-theoretic label complexity lower bound of .
1.2 Related works
[KL88] presented a general analysis on efficiently learning halfspaces, showing that even without any distributional assumptions, it is possible to tolerate the malicious noise at a rate of , but a noise rate greater than cannot be tolerated. The noise model was further studied by [Sch92, Bsh98, CDF+99], and [KKMS05] obtained a noise tolerance when is the uniform distribution. [KLS09] improved this result to for the uniform distribution, and showed a noise tolerance for isotropic log-concave distributions. A near-optimal result of was established in [ABL17] for both uniform and isotropic log-concave distributions.
Achieving attribute efficiency has been a long-standing goal in machine learning and statistics [Blu90, BHL95], and has found a variety of applications with strong theoretical backend. A partial list includes online classification [Lit87], learning decision lists [Ser99, KS04, LS06], compressed sensing [Don06, CW08, TW10, SL18], one-bit compressed sensing [BB08, PV16], and variable selection [FL01, FF08, SL17a, SL17b].
Label-efficient learning has also been broadly studied since gathering high quality labels is often expensive. The prominent approaches include disagreement-based active learning [Han11, Han14], margin-based active learning [BBZ07, BL13, YZ17], selective sampling [CCG11, DGS12], and adaptive one-bit compressed sensing [ZYJ14, BFN+17]. There are also a number of interesting works that appeal to extra information to mitigate the labeling cost, such as comparison [XZS+17, KLMZ17] and search [BH12, BHLZ16].
Recent works such as [DKK+16, LRV16] studied mean estimation under a strong noise model where in addition to returning dirty instances, the adversary has also the power of eliminating a few clean instances, similar to the nasty noise model in learning halfspaces [BEK02]. The main technique of robust mean estimation is a novel outlier removal paradigm, which uses the spectral norm of the covariance matrix to detect dirty instances. This is similar in spirit to the idea of [KLS09, ABL17] and the current work. However, there is no direct connection between mean estimation and halfspace learning since the former is an unsupervised problem while the latter is supervised (although any connection would be very interesting). Very recently, such technique was extensively investigated in a variety of problems such as clustering and linear regression; we refer the reader to a comprehensive survey by [DK19] for more information.
Roadmap.
2 Preliminaries
We study the problem of learning sparse halfspaces in under the malicious noise model with noise rate [Val85, KL88], where an oracle (i.e. adversary) first selects a member from a family of distributions and a concept from a concept class ; during the learning process, and are fixed. Each time the adversary is called, with probability , a random pair is returned to the learner with and , referred to as a clean sample; with probability , the adversary can return an arbitrary pair , referred to as a dirty sample. The adversary is assumed to have unrestricted computational power to search dirty samples that may depend on, e.g. the states of the learning algorithm and the history of its outputs. Formally, we make the following distributional assumptions.
Assumption 1.
Let be the family of isotropic log-concave distributions. The underlying distribution from which clean instances are drawn is chosen from by the adversary, and is fixed during the learning process. The learner is given the knowledge of but not of .
Assumption 2.
With probability , the adversary returns a pair where and ; with probability , it may return an arbitrary pair .
Since we are interested in obtaining a label-efficient algorithm, we will consider a natural extension of such passive learning model. In particular, [ABL17] proposed to consider the following: when a labeled instance is generated, the learner only has access to an instance-generation oracle which returns , and must make a separate call to a label revealing oracle to obtain . We refer to the total number of calls to as the sample complexity of the learning algorithm, and to that of as the label complexity.
We will presume that the concept class consists of homogeneous halfspaces that have unit -norm and are -sparse, i.e. the number of non-zero elements of any is at most where . The learning algorithm is given this concept class, that is, the set of homogeneous -sparse halfspaces. For a hypothesis , we define its error rate on a distribution as . The goal of the learner is to find a hypothesis in polynomial time such that with probability , for any given failure confidence and any error rate , with a few calls to and .
For a reference vector and a positive scalar , we call the region as band, and we denote by the distribution obtained by conditioning on the event . Given a hypothesis in , a labeled instance , and a parameter , we define the -hinge loss . For a labeled set , we define .
For , we denote by the -ball centering at the point with radius , i.e. . We will be particularly interested in the cases . For a vector , the hard thresholding operation keeps its largest (in absolute value) elements and sets the remaining to zero. Let be two vectors; we write to denote the angle between them, and write to denote their inner product. For a matrix , we denote by its trace norm (also known as the nuclear norm), i.e. the sum of its singular values. We will also use to denote the entrywise -norm of , i.e. the sum of absolute values of its entries. If is a symmetric matrix, we use to denote that it is positive semidefinite.
Throughout this paper, the subscript variants of the lowercase letter , e.g. and , are reserved for specific absolute constants that are uniquely determined by the distribution . We also reserve and for specific constants. We remark that the value of all the constants involved in the paper does not depend on the underlying distribution , but rather on the knowledge of given to the learner. We collect all the definitions of these constants in Appendix A.
3 Main Algorithm
We first present an overview of our learning algorithm, followed by specifying all the hyper-parameters used therein. Then we describe in detail the attribute-efficient outlier removal scheme, which is the core technique in the paper.
3.1 Overview
Our main algorithm, namely Algorithm 1, is based on the celebrated margin-based active learning framework [BBZ07]. The key observation is that a good classifier can be learned by concentrating on fitting only the most informative labeled instances, measured by the closeness to the current decision boundary (i.e. the closer the more informative). In our algorithm, the sampling region is set as at phase , and is set as the band at phases . Once we obtain the instance set , we perform a pruning step that removes all instances having large -norm. This is motivated by our analysis that with high probability, all clean instances in must have small -norm provided that Assumption 1 is satisfied. Since the oracle may output dirty instances, we design an attribute-efficient soft outlier removal procedure, which aims to find proper weights for all instances in , such that the clean instances (i.e. those from ) have overwhelming weights compared to dirty instances. Equipped with the learned weights, it is possible to minimize the reweighted hinge loss to obtain a refined halfspace. However, this would lead to a suboptimal label complexity since we have to query the label for all instances in . Our remedy is to randomly sample a few points from according to their importance, which is crucial for us to obtain near-optimal label complexity.
When minimizing the hinge loss, we carefully construct the constraint set with three properties. First, it has an -norm constraint. As a useful fact of isotropic log-concave distributions, the -distance to the underlying halfspace is of the same order as the error rate. Thus, if we were able to ensure that the target halfspace stays in , we would show that the error rate of is as small as , the radius of the -ball. Second, has an -norm constraint, which is well-known for its power to promote sparse solutions and to guarantee attribute-efficient sample complexity [Tib96, CDS98, CT05, PV13b]. Lastly, the and radii of shrinks by a constant factor in each phase; hence, when Algorithm 1 terminates, the radius of the -ball will be as small as . Notably, [Zha18] also utilizes such constraint for active learning of sparse halfspaces, but only under the setting of label noise.
The last step in Algorithm 1 is to perform hard-thresholding on the solution followed by -normalization. Roughly speaking, these two steps will produce an iterate consistent with the structure of (i.e. is guaranteed to belong to the concept class ), and more importantly, will be useful to show that lies in in all phases.
3.2 Hyper-parameter setting
We elaborate on our hyper-parameter setting that is used in Algorithm 1 and our analysis. Let , where the constants are specified in Appendix A. Observe that there exists an absolute constant satisfying , since the continuous function as and all the involved quantities in are absolute constants. Given such constant , we set , , ,
We set the constant , and choose . Observe that all ’s are lower bounded by the constant . Our theoretical guarantee holds for any noise rate , where the constant .
We set the total number of phases in Algorithm 1. Consider any phase . We use as the size of unlabeled instance set . We will show that by making calls to , Algorithm 1 is guaranteed to obtain such in each phase with high probability. We set as the size of labeled instance set , which is also the number of calls to . Note that is the sample complexity of Algorithm 1, and is its label complexity.
3.3 Attribute and computationally efficient soft outlier removal
Our soft outlier removal procedure is inspired by [ABL17]. We first briefly describe their main idea. Then we introduce a natural extension of their approach to the high-dimensional regime and show why it fails. Lastly, we present our novel outlier removal scheme.
To ease our discussion, we decompose where is the set of clean instances in and consists of all dirty instances. Ideally, we would expect to find a function such that for all and otherwise. Suppose that is the fraction of dirty instances in . Then one would expect that the total weights is as large as in order to include such ideal function. On the other hand, we must restrict the weights of dirty instances; namely, we need to characterize under what conditions can be distinguished from . The key observation made in [KLS09] and [ABL17] is that if the dirty instances want to deteriorate the hinge loss (which is the purpose of the adversary), they must lead to a variance333We follow [ABL17] and slightly abuse the word “variance” without subtracting the squared mean of . of orders of magnitude larger than on the direction of a particular halfspace. Thus, it suffices to find a proper weight for each instance, such that the reweighted variance is as small as for all feasible halfspaces . Now it remains to resolve two questions: 1) how many instances do we need to draw in order to guarantee the existence of such function ; and 2) how to find a feasible function in polynomial time.
If label complexity were our only objective, we could have used the soft outlier removal procedure of [ABL17] directly, i.e. we set , which in conjunction with the -norm constrained hinge loss minimization of [Zha18] would result in an sample complexity and a label complexity. However, as we would also like to optimize for the learner’s sample complexity by utilizing the sparsity assumption, we need an attribute-efficient outlier removal procedure.
3.3.1 A natural approach and why it fails
It is well-known that incorporating an -norm constraint often leads to a sample complexity sublinear in the dimension [Zha02, KST08]. Thus, a natural approach for attribute-efficient outlier removal is to set for some carefully chosen radius . With the new localized concept space, it is possible to show that a sample size of suffices to guarantee the existence of a function such that the reweighted variance is small over all . However, on the computational side, for a given , we will have to check the reweighted variance for all , which amounts to finding a global optimum of the following program:
(3.1) |
The above program is closely related to the problem of sparse principal component analysis (PCA) [ZHT06], and unfortunately it is known that finding a global optimum is NP-hard [Ste05, TP14].
-
1.
for all ;
-
2.
;
-
3.
.
3.3.2 Convex relaxation of sparse principal component analysis
Our goal is to find a function such that the objective value in (3.1) is less than for all . To circumvent the computational intractability caused by the non-convexity of the objective function, we consider an alternative formulation using semidefinite programming (SDP), similar to the approach of [dGJL07]. First, let . It is not hard to see that . Due to our localized sampling scheme, we have with probability . Thus, we only need to examine the maximum value of over . Now the technique of [dGJL07] comes in: the rank-one symmetric matrix is replaced by a new variable which is positive semidefinite, and the vector and -norm constraints are relaxed to the matrix trace and -norm constraints respectively as follows:
(3.2) |
The program (3.2) has two salient features: first, it is a semidefinite program that can be optimized efficiently [BV04]; second, if its objective value is upper bounded by , we immediately obtain that the reweighted variance is well controlled. This is the theme of the following lemma.
Lemma 3.
Recall that Algorithm 1 sets , which suffices to guarantee the condition on holds (see Appendix D.2); therefore, the above concentration bound holds with high probability. As a result, it is not hard to verify that the function , where for all and for all , satisfies all three constraints in Algorithm 2. In other words, Lemma 3 establishes the existence of a feasible function to Algorithm 2. Furthermore, observe that the optimization problem of finding a feasible in Algorithm 2 is a semi-infinite linear program. For a given candidate , we can construct an efficient oracle as follows: it checks if violates the first two constraints; if not, it checks the last constraint by invoking a polynomial-time SDP solver to find the maximum objective value of (3.2). It is well-known that equipped with such separation oracle, Algorithm 2 will return a desired function in polynomial time by the ellipsoid method [GLS12, Chapter 3].
3.3.3 Comparison to prior works
We remark that the setting of results in a sample complexity of for phase (see a formal statement in Lemma 6), which implies a total sample complexity of . When , this substantially improves upon the sample complexity of when naively applying the soft outlier removal procedure in [ABL17].
We remark three crucial technical differences from [DKS18] and [BDLS17]. First, we progressively restrict the variance to identify dirty instances, i.e. the variance upper bound is set as at the beginning of Algorithm 1 and progressively decreases to (see our setting of and ), while in [DKS18, BDLS17] and many of their follow-up works it is typically fixed to . Second, we control the variance locally, i.e. we only require a small variance over a localized instance space and a localized concept space . Third, the small variance is used to robustly estimate the hinge loss in our work, while in [DKS18] it was utilized to approximate the Chow parameters. All these problem-specific design of outlier removal are vital for us to obtain the first near-optimal guarantee on attribute efficiency and label efficiency for learning sparse halfspaces.
4 Performance Guarantee
In the following, we always presume that the underlying halfspace is parameterized by , which is -sparse and has unit -norm. This condition may not be explicitly stated in our analysis.
Our main theorem is as follows. We note that there are two sources of randomness in Algorithm 1: the random draw of instances from , and the random sampling step (i.e. Step 8); the probability is taken over all the randomness in the algorithm.
Theorem 4.
Algorithm 1 can be straightforwardly modified to work in the passive learning setting, where the learner has direct access to the labeled instance oracle . The modified algorithm works as follows: it calls to obtain a pair of instance and the label whenever Algorithm 1 calls . In particular, for the passive learning algorithm, the working set is always a labeled instance set, and there is no need for it to query in the random sampling step.
We have the following simple corollary which is an immediate result from Theorem 4.
Corollary 5.
We need an ensemble of new results to prove Theorem 4. Specifically, we propose new techniques to control the sample and computational complexity of soft outlier removal, and a new analysis of label complexity by making full use of the localization in the instance and concept spaces. We elaborate on them in the following, and sketch the proof of Theorem 4 at the end of this section.
4.1 Localized sampling in the instance space
Localized sampling, also known as margin-based active learning, is a useful technique proposed in [BBZ07]. Interestingly, under isotropic log-concave distributions, [BL13] showed that if the band width is large enough, the region outside the band, i.e. , can be safely “ignored”, in the sense that, if is close enough to , it is guaranteed to incur a small error rate therein. Motivated by this elegant finding, theoretical analyses in the literature are often dedicated to bounding the error rate within the band, and it is now well understood that a constant error rate within the band suffices to ensure significant progress in each phase [ABHU15, ABL17, Zha18]. We follow this line of reasoning and our technical contribution is to show how to obtain such constant error rate with near-optimal label complexity and noise tolerance.
Our analysis will rely on the condition that has sufficiently many instances. Specifically, in order to collect instances to form the working set , we need to call enough number of times since our sampling is localized within the band . The following lemma characterizes the sample complexity at phase .
4.2 Attribute and computationally efficient soft outlier removal
We summarize the performance guarantee of Algorithm 2 in the following proposition.
Proposition 7.
Again, we remind that the key difference between our algorithm and that of [ABL17] is in Constraint 3 of Algorithm 2: we require that the “variance proxy” of the reweighted instances are small for all positive semidefinite that lies in an intersection of a trace-norm ball and an -norm ball. On the statistical side, this favorable constraint set of , in conjunction with Adamczak’s bound in empirical processes literature [Ada08], results in sufficient uniform concentration of the variance proxy with a sample complexity of . This significantly improves the sample complexity of established in [ABL17]. The detailed proof can be found in Appendix D.3.
Remark 1.
While in some standard settings, a proper -norm constraint suffices to guarantee a desired bound of sample complexity in the high-dimensional regime [Wai09, KST08], we note that in order to establish near-optimal noise tolerance, the -norm constraint of (hence the trace-norm of ) is vital as well. Though eliminating it eases the search of a feasible function , this leads to a suboptimal noise tolerance . Informally speaking, the per-phase error rate, expected to be a constant, is inherently proportional to the variance times , the noise rate within the band. Now without the trace-norm constraint, the variance would be times larger than before (since we now have to use as a proxy for the constraint set’s radius, measured in trace norm). This implies that we need to set a factor of before, which in turn indicates that the noise tolerance becomes a factor of before since . We refer the reader to Proposition 31 and Lemma 36 for details.
Remark 2.
The quantity has a quadratic dependence on the sparsity parameter . This cannot be improved in some sparse PCA related problems [BR13], but it is not clear whether such dependence is optimal in our case. We leave this investigation to future work.
Next, we describe the statistical property of the distribution (obtained by normalizing returned by Algorithm 2). Observe that the noise rate within the band is at most since the probability mass of the band is – an important property of isotropic log-concave distributions. Also, it is possible to show that the variance of clean instances on directions is (see Lemma 16). Therefore, Algorithm 2 is essentially searching for a weighting such that clean instances have overwhelming weights over dirty instances, and that the variance of the weighted instances is similar to that of the clean instances. Recall that is the set of clean instances in . Let be the unrevealed labeled set where each instance is correctly annotated by . The following proposition, which is similar to Lemma 4.7 of [ABL17] but with refinement, states that the reweighted hinge loss , is a good proxy for the hinge loss evaluated exclusively on clean labeled instances .
Proposition 8.
Note that though this proposition is phrased in terms of the hinge loss on pairs , it is only used in the analysis and our algorithm does not require the knowledge of the labels – the algorithm even does not need to exactly identify the set of clean instances . As a result, the size of does not count towards our label complexity. Proposition 7 together with Proposition 8 implies that with high probability, Algorithm 2 produces a desired probability distribution in polynomial time, which justifies its computational and statistical efficiency.
In addition, let be the expected loss on . The following result links to the empirical hinge loss on clean instances.
4.3 Attribute and label-efficient empirical risk minimization
In light of Proposition 8, one may want to find an iterate by minimizing its reweighted hinge loss . This requires collecting labels for all instances in , which leads to a suboptimal label complexity . As a remedy, we perform a random sampling process, which draws instances from according to the distribution and then query their labels, resulting in the labeled instance set . By standard uniform convergence arguments, it is expected that provided that is large enough, as is shown in the following proposition.
Proposition 10.
We remark that when establishing the performance guarantee, the -norm constraint on the hypothesis space, together with an -norm upper bound on the localized instance space, leads to a Rademacher complexity that has a linear dependence on the sparsity (up to a logarithmic factor). Technically speaking, our analysis is more involved than that of [ABL17]: applying their analysis to the setting of learning sparse halfspaces along with the fact that the VC dimension of the class of -sparse halfspaces is would give a label complexity quadratic in .
4.4 Uniform concentration for unbounded data
Our analysis involves building uniform concentration bounds. The primary issue of applying standard concentration results, e.g. Theorem 1 of [KST08], is that the instances are not contained in a pre-specified -ball with probability under isotropic log-concave distribution. [ABL17, Zha18] construct a conditional distribution, on which the data are all bounded from above, and then measure the difference between this conditional distribution and the original one. We circumvent such technical complication by using the Adamczak’s bound [Ada08] in the empirical process literature, which provides a generic way to analyze concentration inequalities for well-behaved distributions with unbounded support. See Appendix C for a concrete treatment.
4.5 Proof sketch of Theorem 4
Proof.
We first show that error rate of on is a constant, and that of follows since hard thresholding and -norm projection can only deviate the error rate by a constant factor. Observe that in light of Proposition 8, Proposition 9, and Proposition 10, we have for all . Therefore, if , by the optimality of , we have , where the last inequality is by Lemma 3.7 of [ABL17]. Since always serves as an upper bound of , the constant error rate on follows. Next we can use the analysis framework of margin-based active learning to show that such constant error rate ensures that the angle between and is as small as , which in turn implies . It remains to show ; this can be easily seen by the definition of : . Hence, we conclude for all . Observe that the radius of -ball of is as small as , which, by a basic property of isotropic log-concave distributions, implies the error rate of on is less than .
The sample and label complexity bounds follow from our setting of and , and the fact that for all . See Appendix D.5 for the full proof. ∎
5 Conclusion and Open Questions
We have presented a computationally efficient algorithm for learning sparse halfspaces under the challenging malicious noise model. Our algorithm leverages the well-established margin-based active learning framework, with a particular treatment on attribute efficiency, label complexity, and noise tolerance. We have shown that our theoretical guarantees for label complexity and noise tolerance are near-optimal, and the sample complexity of a passive learning variant of our algorithm is attribute-efficient, thanks to the set of new techniques proposed in this paper.
We raise three open questions for further investigation. First, as we discussed in Section 4.2, the sample complexity for concentration of has a quadratic dependence on . It would be interesting to study whether this is a fundamental limit of learning under isotropic log-concave distributions, or it can be improved by a more sophisticated localization scheme in the instance and the concept spaces. Second, while isotropic log-concave distributions bear favorable properties that fit perfectly in the margin-based framework, it would be interesting to examine whether the established results can be extended to heavy-tailed distributions. This may lead to a large error rate within the band that cannot be controlled at a constant level, and new techniques must be developed. Finally, it would be interesting to design computationally more efficient algorithms, e.g. stochastic gradient descent-type algorithms similar to [DKM05], with comparable statistical guarantees.
References
- [ABHU15] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Ruth Urner. Efficient learning of linear separators under bounded noise. In Proceedings of the 28th Annual Conference on Learning Theory, pages 167–190, 2015.
- [ABHZ16] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Hongyang Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Proceedings of the 29th Annual Conference on Learning Theory, pages 152–192, 2016.
- [ABL17] Pranjal Awasthi, Maria-Florina Balcan, and Philip M. Long. The power of localization for efficiently learning linear separators with noise. Journal of the ACM, 63(6):50:1–50:27, 2017.
- [Ada08] Radoslaw Adamczak. A tail inequality for suprema of unbounded empirical processes with applications to Markov chains. Electronic Journal of Probability, 13(34):1000–1034, 2008.
- [BB08] Petros Boufounos and Richard G. Baraniuk. 1-bit compressive sensing. In Proceedings of the 42nd Annual Conference on Information Sciences and Systems, pages 16–21, 2008.
- [BBZ07] Maria-Florina Balcan, Andrei Z. Broder, and Tong Zhang. Margin based active learning. In Proceedings of the 20th Annual Conference on Learning Theory, pages 35–50, 2007.
- [BDLS17] Sivaraman Balakrishnan, Simon S. Du, Jerry Li, and Aarti Singh. Computationally efficient robust sparse estimation in high dimensions. In Proceedings of the 30th Annual Conference on Learning Theory, pages 169–212, 2017.
- [BEHW89] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929–965, 1989.
- [BEK02] Nader H. Bshouty, Nadav Eiron, and Eyal Kushilevitz. PAC learning with nasty noise. Theoretical Computer Science, 288(2):255–275, 2002.
- [BFKV96] Avrim Blum, Alan M. Frieze, Ravi Kannan, and Santosh S. Vempala. A polynomial-time algorithm for learning noisy linear threshold functions. In Proceedings of the 37th Annual IEEE Symposium on Foundations of Computer Science, pages 330–338, 1996.
- [BFN+17] Richard G. Baraniuk, Simon Foucart, Deanna Needell, Yaniv Plan, and Mary Wootters. Exponential decay of reconstruction error from binary measurements of sparse signals. IEEE Transactions on Information Theory, 63(6):3368–3385, 2017.
- [BH12] Maria Florina Balcan and Steve Hanneke. Robust interactive learning. In Conference on Learning Theory, pages 20–1, 2012.
- [BHL95] Avrim Blum, Lisa Hellerstein, and Nick Littlestone. Learning in the presence of finitely or infinitely many irrelevant attributes. Journal of Computer and System Sciences, 50(1):32–40, 1995.
- [BHLZ16] Alina Beygelzimer, Daniel J. Hsu, John Langford, and Chicheng Zhang. Search improves label for active learning. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, pages 3342–3350, 2016.
- [BL13] Maria-Florina Balcan and Philip M. Long. Active and passive learning of linear separators under log-concave distributions. In Proceedings of The 26th Annual Conference on Learning Theory, pages 288–316, 2013.
- [Blu90] Avrim Blum. Learning boolean functions in an infinite attribute space. In Proceedings of the 22nd Annual ACM Symposium on Theory of Computing, pages 64–72, 1990.
- [BM02] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.
- [BR13] Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component detection. In Proceedings of the 26th Annual Conference on Learning Theory, pages 1046–1066, 2013.
- [Bsh98] Nader H. Bshouty. A new composition theorem for learning algorithms. In Proceedings of the 30th Annual ACM Symposium on the Theory of Computing, pages 583–589, 1998.
- [BV04] Stephen P. Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.
- [CAL94] David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994.
- [CCG11] Giovanni Cavallanti, Nicolò Cesa-Bianchi, and Claudio Gentile. Learning noisy linear classifiers via adaptive and selective sampling. Machine Learning, 83(1):71–102, 2011.
- [CDF+99] Nicolò Cesa-Bianchi, Eli Dichterman, Paul Fischer, Eli Shamir, and Hans Ulrich Simon. Sample-efficient strategies for learning in the presence of noise. Journal of the ACM, 46(5):684–719, 1999.
- [CDS98] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33–61, 1998.
- [CT05] Emmanuel J. Candès and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.
- [CW08] Emmanuel J. Candès and Michael B. Wakin. An introduction to compressive sampling. IEEE Signal Processing Magazine, 25(2):21–30, 2008.
- [Dan15] Amit Daniely. A PTAS for agnostically learning halfspaces. In Proceedings of The 28th Annual Conference on Learning Theory, volume 40, pages 484–502, 2015.
- [dGJL07] Alexandre d’Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A direct formulation for sparse PCA using semidefinite programming. SIAM Review, 49(3):434–448, 2007.
- [DGS12] Ofer Dekel, Claudio Gentile, and Karthik Sridharan. Selective sampling and active learning from single and multiple teachers. Journal of Machine Learning Research, 13:2655–2697, 2012.
- [DGT19] Ilias Diakonikolas, Themis Gouleakis, and Christos Tzamos. Distribution-independent PAC learning of halfspaces with Massart noise. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems, pages 4751–4762, 2019.
- [DK19] Ilias Diakonikolas and Daniel M. Kane. Recent advances in algorithmic high-dimensional robust statistics. CoRR, abs/1911.05911, 2019.
- [DKK+16] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Zheng Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. CoRR, abs/1604.06443, 2016.
- [DKM05] Sanjoy Dasgupta, Adam Tauman Kalai, and Claire Monteleoni. Analysis of perceptron-based active learning. In Proceedings of the 18th Annual Conference on Learning Theory, pages 249–263, 2005.
- [DKS18] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM Symposium on Theory of Computing, pages 1061–1073, 2018.
- [DKTZ20] Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Learning halfspaces with Massart noise under structured distributions. In Proceedings of the 33rd Annual Conference on Learning Theory, volume 125, pages 1486–1513, 2020.
- [Don06] David L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
- [Dud14] Richard M. Dudley. Uniform central limit theorems, volume 142. Cambridge University Press, 2014.
- [Fel14] Vitaly Feldman. Open problem: The statistical query complexity of learning sparse halfspaces. In Proceedings of The 27th Annual Conference on Learning Theory, volume 35, pages 1283–1289, 2014.
- [FF08] Jianqing Fan and Yingying Fan. High dimensional classification using features annealed independence rules. Annals of Statistics, 36(6):2605–2637, 2008.
- [FL01] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
- [GLS12] Martin Grötschel, László Lovász, and Alexander Schrijver. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.
- [Han11] Steve Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333–361, 2011.
- [Han14] Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131–309, 2014.
- [KKMS05] Adam Tauman Kalai, Adam R. Klivans, Yishay Mansour, and Rocco A. Servedio. Agnostically learning halfspaces. In Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science, pages 11–20, 2005.
- [KL88] Michael J. Kearns and Ming Li. Learning in the presence of malicious errors. In Proceedings of the 20th Annual ACM Symposium on Theory of Computing, pages 267–280, 1988.
- [KLMZ17] Daniel M. Kane, Shachar Lovett, Shay Moran, and Jiapeng Zhang. Active classification with comparison queries. In Chris Umans, editor, Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science, pages 355–366, 2017.
- [KLS09] Adam R. Klivans, Philip M. Long, and Rocco A. Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10:2715–2740, 2009.
- [KMT93] Sanjeev R. Kulkarni, Sanjoy K. Mitter, and John N. Tsitsiklis. Active learning using arbitrary binary valued queries. Machine Learning, 11(1):23–35, 1993.
- [KS04] Adam R. Klivans and Rocco A. Servedio. Toward attribute efficient learning of decision lists and parities. In Proceedings of the 17th Annual Conference on Learning Theory, pages 224–238, 2004.
- [KSS92] Michael J. Kearns, Robert E. Schapire, and Linda Sellie. Toward efficient agnostic learning. In David Haussler, editor, Proceedings of the 5th Annual Conference on Computational Learning Theory, pages 341–352, 1992.
- [KST08] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Proceedings of the 22nd Annual Conference on Neural Information Processing Systems, pages 793–800, 2008.
- [Lit87] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm (extended abstract). In Proceedings of the 28th Annual IEEE Symposium on Foundations of Computer Science, pages 68–77, 1987.
- [Lon95] Philip M. Long. On the sample complexity of PAC learning half-spaces against the uniform distribution. IEEE Transactions on Neural Networks, 6(6):1556–1559, 1995.
- [LRV16] Kevin A. Lai, Anup B. Rao, and Santosh S. Vempala. Agnostic estimation of mean and covariance. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, pages 665–674, 2016.
- [LS06] Philip M. Long and Rocco A. Servedio. Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions. In Proceedings of the 20th Annual Conference on Neural Information Processing Systems, pages 921–928, 2006.
- [LV07] László Lovász and Santosh S. Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures and Algorithms, 30(3):307–358, 2007.
- [MN06] Pascal Massart and Élodie Nédélec. Risk bounds for statistical learning. The Annals of Statistics, pages 2326–2366, 2006.
- [PV13a] Yaniv Plan and Roman Vershynin. One-bit compressed sensing by linear programming. Communications on Pure and Applied Mathematics, 66(8):1275–1297, 2013.
- [PV13b] Yaniv Plan and Roman Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1):482–494, 2013.
- [PV16] Yaniv Plan and Roman Vershynin. The generalized lasso with non-linear observations. IEEE Transactions on Information Theory, 62(3):1528–1537, 2016.
- [Ros58] Frank Rosenblatt. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386–408, 1958.
- [RWY11] Garvesh Raskutti, Martin J. Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over -balls. IEEE Transactions on Information Theory, 57(10):6976–6994, 2011.
- [Sch92] Robert E. Schapire. Design and analysis of efficient learning algorithms. MIT Press, Cambridge, MA, USA, 1992.
- [Ser99] Rocco A. Servedio. Computational sample complexity and attribute-efficient learning. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing, pages 701–710, 1999.
- [SL17a] Jie Shen and Ping Li. On the iteration complexity of support recovery via hard thresholding pursuit. In Proceedings of the 34th International Conference on Machine Learning, pages 3115–3124, 2017.
- [SL17b] Jie Shen and Ping Li. Partial hard thresholding: Towards a principled analysis of support recovery. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pages 3127–3137, 2017.
- [SL18] Jie Shen and Ping Li. A tight bound of hard thresholding. Journal of Machine Learning Research, 18(208):1–42, 2018.
- [Slo88] Robert H. Sloan. Types of noise in data for concept learning. In Proceedings of the First Annual Workshop on Computational Learning Theory, pages 91–96, 1988.
- [Slo92] Robert H. Sloan. Corrigendum to types of noise in data for concept learning. In Proceedings of the Fifth Annual ACM Conference on Computational Learning Theory, page 450, 1992.
- [SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
- [Ste05] Daureen Steinberg. Computation of matrix norms with applications to robust optimization. Research thesis, Technion-Israel University of Technology, 2005.
- [Tib96] Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
- [TP14] Andreas M. Tillmann and Marc E. Pfetsch. The computational complexity of the restricted isometry property, the nullspace property, and related concepts in compressed sensing. IEEE Transactions on Information Theory, 60(2):1248–1259, 2014.
- [TW10] Joel A. Tropp and Stephen J. Wright. Computational methods for sparse solution of linear inverse problems. Proceedings of the IEEE, 98(6):948–958, 2010.
- [Val84] Leslie G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
- [Val85] Leslie G. Valiant. Learning disjunction of conjunctions. In Proceedings of the 9th International Joint Conference on Artificial Intelligence, pages 560–566, 1985.
- [vdGL13] Sara van de Geer and Johannes Lederer. The Bernstein-Orlicz norm and deviation inequalities. Probability Theory and Related Fields, 157:225–250, 2013.
- [VDVW96] Aad W Van Der Vaart and Jon A Wellner. Weak convergence and empirical processes. Springer, 1996.
- [Vem10] Santosh S. Vempala. A random-sampling-based algorithm for learning intersections of halfspaces. Journal of the ACM, 57(6):32:1–32:14, 2010.
- [Wai09] Martin J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using -constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5):2183–2202, 2009.
- [XZS+17] Yichong Xu, Hongyang Zhang, Aarti Singh, Artur Dubrawski, and Kyle Miller. Noise-tolerant interactive learning using pairwise comparisons. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pages 2431–2440, 2017.
- [YZ17] Songbai Yan and Chicheng Zhang. Revisiting perceptron: Efficient and label-optimal learning of halfspaces. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pages 1056–1066, 2017.
- [Zha02] Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002.
- [Zha18] Chicheng Zhang. Efficient active learning of sparse halfspaces. In Proceedings of the 31st Annual Conference On Learning Theory, pages 1856–1880, 2018.
- [ZHT06] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2):265–286, 2006.
- [ZSA20] Chicheng Zhang, Jie Shen, and Pranjal Awasthi. Efficient active learning of sparse halfspaces with arbitrary bounded noise. CoRR, abs/2002.04840, 2020.
- [ZYJ14] Lijun Zhang, Jinfeng Yi, and Rong Jin. Efficient algorithms for robust one-bit compressive sensing. In Proceedings of the 31st International Conference on Machine Learning, pages 820–828, 2014.
Appendix A Detailed Choices of Reserved Constants and Additional Notations
Constants.
The absolute constants , and are specified in Lemma 12, and and are specified in Lemma 13. and were clarified in Section 3.2. The definition of , , can be found in Lemma 14, Lemma 17, and Lemma 18 respectively. The absolute constant acts as an upper bound of all ’s, and by our choice in Section 3.2, . The absolute constant is defined in Lemma 16. Other absolute constants, such as are not quite crucial to our analysis or algorithmic design. Therefore, we do not track their definitions. The subscript variants of , e.g. and , are also absolute constants but their values may change from appearance to appearance. We remark that the value of all these constants does not depend on the underlying distribution chosen by the adversary, but rather depends on the knowledge of .
Pruning.
Consider Algorithm 1. For each phase , we sample a working set and remove all instances that have large -norm to obtain (Step 6), which is equivalent to intersecting it with the -ball where . This is motivated by Lemma 18, which states that with high probability, all clean instances in are in . Specifically, Denote by (respectively ) the set of clean (respectively dirty) instances in . Lemma 18 implies that with probability , . Therefore, with high probability, all the instances in are kept in this step and only instances in may be removed. Denote by and ; we therefore also have the decomposition . We finally denote by the unrevealed labeled set that corresponds to .
instance set obtained by calling conditioned on | |
set of instances in that draws from the distribution | |
set of dirty instances in , i.e. | |
set of instances in that lie in | |
set of instances in that lie in | |
set of instances in that lie in | |
unrevealed labeled set of | |
unrevealed labeled set of |
Regularity condition on .
We will frequently work with the conditional distribution obtained by conditioning on the event that is in the band . We give the following regularity condition to ease our terminology.
Definition 11.
A conditional distribution is said to satisfy the regularity condition if one of the following holds: 1) the vector has unit -norm and ; 2) the vector is the zero vector and .
In particular, at each phase of Algorithm 1, is set to and is set to . For , is a zero vector, , satisfying the regularity condition. It is worth mentioning that at phase the conditional distribution boils down to . For all , is a unit vector and in view of our construction of . Therefore, for all , satisfy the regularity condition.
Appendix B Useful Properties of Isotropic Log-Concave Distributions
We record some useful properties of isotropic log-concave distributions.
Lemma 12.
There are absolute constants , such that the following holds for all isotropic log-concave distributions . Let be the density function. We have
-
1.
Orthogonal projections of onto subspaces of are isotropic log-concave;
-
2.
If , then ;
-
3.
If , then for all ;
-
4.
For any two vectors ,
-
5.
.
The following lemma is implied by the proof of Theorem 21 of [BL13], which shows that if we choose a proper band width , the error outside the band will be small. This observation is crucial for controlling the error over the distribution , and has been broadly recognized in the literature [ABL17, Zha18].
Lemma 13 (Theorem 21 of [BL13]).
There are absolute constants such that the following holds for all isotropic log-concave distributions . Let and be two unit vectors in and assume that . Then for any , we have
Lemma 14 (Lemma 20 of [ABHZ16]).
There is an absolute constant such that the following holds for all isotropic log-concave distributions . Draw i.i.d. instances from to form a set . Then
Lemma 15.
There is an absolute constant such that the following holds for all isotropic log-concave distributions and all that satisfy the regularity condition:
Proof.
When is a unit vector, Lemma 3.4 of [ABL17] shows that there exists a constant such that
When is a zero vector, reduces to and the constraint reads as . Thus we have
The proof is complete by choosing . ∎
Lemma 16.
There is an absolute constant such that the following holds for all isotropic log-concave distributions and all that satisfy the regularity condition:
where .
Proof.
Since is a positive semidefinite matrix with trace norm at most , it has eigendecomposition , where are the eigenvalues such that , and ’s are orthonormal vectors in . Thus,
Since is drawn from , we have . Moreover, applying Lemma 15 with the setting of implies that
Therefore,
The proof is complete by choosing . ∎
Lemma 17.
Let . Then for all isotropic log-concave distributions and all satisfying the regularity condition,
-
1.
;
-
2.
for any event .
Proof.
We first consider the case that is a unit vector.
For the lower bound, Part 3 of Lemma 12 shows that the density function of the random variable is lower bounded by when . Thus
where in the last inequality we use the condition .
For any event , we always have
Now we consider the case that is the zero vector and . Then in view of the choice . Thus Part 2 still follows. The proof is complete. ∎
Lemma 18.
There exists an absolute constant such that the following holds for all isotropic log-concave distributions and all that satisfy the regularity condition. Let be a set of i.i.d. instances drawn from . Then
Appendix C Orlicz Norm and Concentration Results using Adamczak’s Bound
The following notion of Orlicz norm [vdGL13, Dud14] is useful in handling random variables that have tails of the form for general ’s beyond (subgaussian) and (subexponential).
Definition 19 (Orlicz norm).
For any , let . Furthermore, for a random variable and , define , the Orlicz norm of with respect to , as:
We collect some basic facts about Orlicz norms in the following lemma; they can be found in Section 1.3 of [VDVW96].
Lemma 20.
Let , , be real-valued random variables. Consider the Orlicz norm with respect to . We have the following:
-
1.
is a norm. For any , ; .
-
2.
where .
-
3.
For any , .
-
4.
If for any , then .
-
5.
If , then for all , .
The following auxiliary results, tailored to the localized sampling scheme in Algorithm 1, will also be useful in our analysis.
Lemma 21.
There exists an absolute constant such that the following holds for all isotropic log-concave distributions and all that satisfy the regularity condition. Let be a set of instances drawn from . Then
Consequently,
Proof.
Let be isotropic log-concave random variable in . Part 5 of Lemma 12 shows that for all ,
Fix and fix . Denote by the -th coordinate of . Part 1 of Lemma 12 suggests that is isotropic log-concave. Thus, by Part 2 of Lemma 17,
Taking the union bound over and , we have for all
Now Part 4 of Lemma 20 immediately implies that
for some constant . The second inequality of the lemma is an immediate result by combining the above and Part 2 of Lemma 20. ∎
C.1 Adamczak’s bound
In this section, we establish the key concentration results that will be used to analyze the performance of soft outlier removal and random sampling in Algorithm 1. Since we are considering the isotropic log-concave distribution, any unlabeled instance is unbounded. This prevents us from using standard concentration bounds, e.g. [KST08]. We henceforth appeal to the following generalization of Talagrand’s inequality, due to [Ada08].
Lemma 22 (Adamczak’s bound).
For any , there exists a constant , such that the following holds. Given any function class , and a function such that for any , , we have with probability at least over the draw of a set of i.i.d. instances from ,
We first establish the following result that upper bounds the expected value of Rademacher complexity of linear classes by the Orlicz norm of the random instances.
Lemma 23.
There exists an absolute constant such that the following holds for all isotropic log-concave distributions and all that satisfy the regularity condition. Let be a set of i.i.d. unlabeled instances drawn from . Denote . Let a sequence of random variables be drawn from a distribution supported on a bounded interval for some . Let , where the ’s are i.i.d. Rademacher random variables independent of and . We have:
Proof.
Let so that any can be expressed as for some . First, conditioned on and , we have that
Thus,
(C.1) |
where the second inequality follows from Lemma 21.
On the other side, using the fact that for any random variable , , we have
where in the equality we use the observation that when , and in the last inequality we used the condition that is drawn from . Combining the above with (C.1) we obtain the desired result. ∎
C.2 Uniform concentration of hinge loss
Proposition 24.
There exists an absolute constant such that the following holds for all isotropic log-concave distributions and all that satisfy the regularity condition. Let be a set of i.i.d. unlabeled instances drawn from which satisfies the regularity condition. Let for any . Denote and let . Then with probability ,
In particular, suppose , and . Then we have: for any , a sample size suffices to guarantee that with probability , .
Proof.
We will use Lemma 22 with function class and the Orlicz norm with respect to . We define . It can be seen that for every ,
That is, for every in , .
Step 1. We upper bound . Since is a norm, we have
(C.2) |
where we applied Lemma 21 in the last inequality.
Step 3. Finally, we upper bound . Let where each is an i.i.d. draw from the Rademacher distribution. We have
(C.4) |
In the above, the first inequality used standard symmetrization arguments; see, for example, Lemma 26.2 of [SSBD14]. In the second inequality, we used the contraction property of Rademacher complexity and the fact that can be seen as a -Lipschitz function applied on input . In the last inequality, we applied Lemma 23 with the fact that .
C.3 Uniform concentration of relaxed sparse PCA
Proposition 25.
There exists an absolute constant such that the following holds for all isotropic log-concave distributions and all that satisfy the regularity condition. Let be a set of i.i.d. unlabeled instances drawn from . Denote . Then with probability ,
In particular, suppose and . Then we have: for any , a sample size
suffices to guarantee that with probability , .
Proof.
Recall that . For any matrix , we denote by the -th entry of the matrix . For any vector , we denote by the -th coordinate of .
We will use Lemma 22 with function class and the Orlicz norm with respect to . Consider the function parameterized by . First, we wish to find a function that upper bounds . It is easy to see that
(C.5) |
Thus it suffices to choose .
Step 2. Next we upper bound where we remark that taking the superum over is equivalent to taking that over . Since , we have
In view of Part 2 of Lemma 20, we have
(C.7) |
where the last inequality follows from Lemma 21. Hence,
(C.8) |
for some absolute constant .
Step 3. Finally, we upper bound . Let where ’s are independent draw from the Rademacher distribution. By standard symmetrization arguments (see e.g. Lemma 26.2 of [SSBD14]), we have
(C.9) |
We first condition on and consider the expectation over . For a matrix , we use to denote the vector obtained by concatenating all of the columns of ; likewise for . It is crucial to observe that with this notation, for any , we have . It follows that
where the second inequality is from Lemma 39, and the equality is from the observation that . Therefore,
where the second inequality follows from Part 2 of Lemma 20, and the last inequality follows from Lemma 21. In summary,
(C.10) |
for some constant .
Appendix D Performance Guarantee of Algorithm 1
In this section, we leverage all the tools from previous sections to establish the performance guarantee of Algorithm 1. Our main theorem, Theorem 4, follows from the analysis of each step of the algorithm, as we describe below.
D.1 Analysis of sample complexity
Recall that we refer to the number of calls to as the sample complexity of Algorithm 1. In order to obtain instances residing the band , we have to call sufficient times.
Lemma 26 (Restatement of Lemma 6).
Proof.
We want to ensure that by drawing instances from , with probability at least , out of them fall into the band . We apply the second inequality of Lemma 38 by letting and , and obtain
where the probability is taken over the event that we make a number of calls to . Thus, when , we are guaranteed that at least samples from fall into the band with probability . The lemma follows by observing . ∎
D.2 Analysis of pruning and the structure of
With the instance set on hand, we estimate the empirical noise rate after applying pruning (Step 6) in Algorithm 1. Recall that , i.e. the number of unlabeled instances before pruning.
Lemma 27.
Proof.
For an instance , we use to denote that is drawn from , and use to denote that is adversarially generated.
Lemma 28.
Suppose that Assumptions 1 and 2 are satisfied. Further assume . For any , if , then with probability over the draw of , the following results hold simultaneously:
-
1.
and hence , i.e. all clean instances in are intact after pruning;
-
2.
, i.e. the empirical noise rate after pruning is upper bounded by ;
-
3.
.
In particular, with the hyper-parameter setting in Section 3.2, .
Proof.
Let us write events , . We bound the probability of the two events over the draw of .
Recall that Lemma 18 implies that with probability , all instances in are in the -ball for , which implies .
We next calculate the noise rate within the band by Lemma 27:
where the equality applies our setting on , the second inequality uses the condition and the setting , and the last inequality is guaranteed by our choice of . Now we apply the first inequality of Lemma 38 by specifying , therein, which gives
where the probability is taken over the draw of . This implies provided that .
By union bound, we have . We show that on the event , the second and third parts of the lemma follow. To see this, we note that it trivially holds that since only dirty instances have chance to be removed. This proves the second part. Also, it is easy to see that , which is exactly the third part. ∎
D.3 Analysis of Algorithm 2
Lemma 29 (Restatement of Lemma 3).
Proof.
The first part is an immediate result by combining Proposition 25 and Lemma 16, and recognizing our setting of and .
To see the second part, for any , we can upper bound as follows:
where . Hence it is easy to see that lies in . This indicates that for any , there exists an such that
(D.1) |
Thus,
where the last inequality follows from the fact . ∎
Proposition 30 (Formal statement of Proposition 7).
Proof.
Our choice on satisfies the condition since is lower bounded by a constant (see Section 3.2 for our parameter setting). Thus by Lemma 28, with probability , . We henceforth condition on this happening.
On the other side, Lemma 3 and Proposition 25 together implies that with probability , for all , we have
(D.2) |
provided that
(D.3) |
Note that (D.3) is satisfied in view of the aforementioned event along with the setting of and . By union bound, the events (D.2) and hold simultaneously with probability at least .
Now we show that these two events together implies the existence of a feasible function to Algorithm 2. Consider a particular function with for all and for all . We immediately have
In addition, for all ,
(D.4) |
where the first inequality follows from the fact and the second inequality follows from (D.2). Namely, such function satisfies all the constraints in Algorithm 2. Finally, combining (D.1) and (D.4) gives Part 3.
It remains to show that for a given candidate function , a separation oracle for Algorithm 2 can be constructed in polynomial time. First, it is straightforward to check whether the first two constraints and are violated. If not, we just need to further check if there exists an such that . To this end, we appeal to solving the following program:
This is a semidefinite program that can be solved in polynomial time [BV04]. If the maximum objective value is greater than , then we conclude that is not feasible; otherwise we would have found a desired function. ∎
The analysis of the following proposition closely follows [ABL17] with a refined treatment. Let where is the unrevealed label of that the adversary has committed to.
Proposition 31 (Formal statement of Proposition 8).
Proof.
The choice of guarantees that Lemma 28 and Proposition 30 hold simultaneously with probability . We thus have for all
(D.5) | ||||
(D.6) | ||||
(D.7) |
In the above expression, (D.5) and (D.6) follow from Part 3 and Part 2 of Lemma 29 respectively, (D.7) follows from Lemma 28. It follows from Eq. (D.7) and that
(D.8) |
In the following, we condition on the event that all these inequalities are satisfied.
Step 1. First we upper bound by .
(D.9) |
where follows from the simple fact that
explores the fact that the hinge loss is always upper bounded by and that , follows from Part 2 of Proposition 30, applies Cauchy-Schwarz inequality, and uses Eq. (D.6).
In view of Eq. (D.8), we have . Continuing Eq. (D.9), we obtain
(D.10) |
where in the last inequality we use . On the other hand, we have the following result which will be proved later on.
Claim D.1.
Step 2. We move on to prove the second inequality of the theorem, i.e. using to upper bound . Let us denote by the probability mass on dirty instances. Then
(D.11) |
where the first inequality follows from and Part 2 of Proposition 30, the second inequality follows from (D.7), and the last inequality is by our choice .
Note that by Part 2 of Proposition 30 and the choice , we have . Hence
(D.12) |
where the last inequality holds because of (D.5). Thus,
With the result on hand, we bound as follows:
which proves the second inequality of the proposition.
Putting together. We would like to show . Indeed, this is guaranteed by our setting of in Section 3.2 which ensures that simultaneously fulfills the following three constraints:
This completes the proof. ∎
The following result is a simple application of Proposition 24. It shows that the loss evaluated on clean instances concentrates around the expected loss.
Proposition 32 (Restatement of Proposition 9).
D.4 Analysis of random sampling
Proposition 33 (Restatement of Proposition 10).
Proof.
Since we applied pruning to remove all instances with large -norm, this proposition can be proved by a standard concentration argument for uniform convergence of linear classes under distributions with bounded support. We include the proof for completeness.
Note that the randomness is taken over the i.i.d. draw of samples from according to the distribution over . Thus, for any , . Moreover, let . Any instance drawn from satisfies with probability . It is also easy to verify that
By Theorem 8 of [BM02] along with standard symmetrization arguments, we have that with probability at least ,
(D.13) |
where denotes the Rademacher complexity of function class on the labeled set , and . In order to calculate , we observe that each function is a composition of and function class . Since is -Lipschitz, by contraction property of Rademacher complexity, we have
(D.14) |
Let where the ’s are i.i.d. draw from the Rademacher distribution, and let . We compute as follows:
where the first equality is by the definition of Rademacher complexity, the second equality simply decompose as a sum of and , the third equality is by the fact that every has zero mean, and the inequality applies Lemma 39. We combine the above result with (D.13) and (D.14), and obtain that with probability ,
(D.15) |
Recall that we remove all instances with large -norm in the pruning step of Algorithm 1. In particular, we have
Plugging this upper bound into (D.15) and using our hyper-parameter setting gives
for some constant . Hence,
suffices to ensure with probability . ∎
D.5 Analysis of Per-Phase Progress
Let .
Lemma 34 (Lemma 3.7 of [ABL17]).
Lemma 35.
For any , if , then with probability , .
Proof.
Observe that with the setting of , we have with probability over all the randomness in phase , Lemma 26, Proposition 31, Proposition 32 and Proposition 33 hold simultaneously. Now we condition on the event that all of these properties are satisfied, which implies for all ,
(D.16) |
We have
In the above, the first inequality follows from the fact that hinge loss upper bounds the 0/1 loss, and the last inequality applies (C.1), is by the definition of (see Algorithm 1), and is by our assumption that is feasible. The proof is complete in view of Lemma 34. ∎
Lemma 36.
For any , if , then with probability , .
Proof.
For , by Lemma 35 and that we actually sample from , we have
Hence Part 4 of Lemma 12 indicates that
(D.17) |
Now we consider . Denote , and . We will show that the error of on both and is small, hence is a good approximation to .
First, we consider the error on , which is given by
(D.18) |
where the inequality is due to Lemma 35 and Lemma 17. Note that the inequality holds with probability .
Next we derive the error on . Note that Lemma 10 of [Zha18] states for any unit vector , and any general vector , . Hence,
Recall that we set in our algorithm and choose where , which allows us to apply Lemma 13 and obtain
This in allusion to (D.18) gives
Recall that we set and denote by the coefficient of in the above expression. By Part 4 of Lemma 12
(D.19) |
Lemma 37.
For any , if , then .
Proof.
We first show that . Let . By algebra . Now we have
By the sparsity of and , and our choice , we always have
The proof is complete. ∎
D.6 Proof of Theorem 4
Proof.
We will prove the theorem with the following claim.
Claim D.2.
For any , with probability at least , is in .
Based on the claim, we immediately have that with probability at least , is in . By our construction of , we have
This, together with Part 4 of Lemma 12 and the fact that (see Lemma 10 of [Zha18]), implies
Finally, we derive the sample complexity and label complexity. Recall that was involved in Proposition 30, i.e. the quantity , where we required
It is also involved in Proposition 33, where we need
and since is a labeled subset of . As has a cubic dependence on , our final choice of is given by
(D.20) |
This in turn gives
(D.21) |
Therefore, by Lemma 26 we obtain an upper bound of the sample size at phase as follows:
where the last inequality follows from for all and our choice of . Consequently, the total sample complexity
Likewise, we can show that the total label complexity
It remains to prove Claim D.2 by induction. First, for , . Therefore, with probability . Now suppose that Claim D.2 holds for some , that is, there is an event that happens with probability , and on this event . By Lemma 36 we know that there is an event that happens with probability , on which . This further implies that in view of Lemma 37. Therefore, consider the event , on which with probability . ∎
Appendix E Miscellaneous Lemmas
Lemma 38 (Chernoff bound).
Let be independent random variables that take value in . Let . For each , suppose that . Then for any
When , for any
Lemma 39 (Theorem 1 of [KST08]).
Let where ’s are independent draws from the Rademacher distribution and let be given instances in . Then