The Johnson-Lindenstrauss Lemma for Clustering and Subspace Approximation: From Coresets to Dimension Reduction
Abstract
We study the effect of Johnson-Lindenstrauss transforms in various projective clustering problems, generalizing recent results which only applied to center-based clustering [MMR19]. We ask the general question: for a Euclidean optimization problem and an accuracy parameter , what is the smallest target dimension such that a Johnson-Lindenstrauss transform preserves the cost of the optimal solution up to a -factor. We give a new technique which uses coreset constructions to analyze the effect of the Johnson-Lindenstrauss transform. Our technique, in addition applying to center-based clustering, improves on (or is the first to address) other Euclidean optimization problems, including:
-
•
For -subspace approximation: we show that suffices, whereas the prior best bound, of , only applied to the case [CEM+15].
-
•
For -flat approximation: we show suffices, completely removing the dependence on from the prior bound of [KR15].
-
•
For -line approximation: we show suffices, and ours is the first to give any dimension reduction result.
1 Introduction
The Johnson-Lindenstrauss lemma [JL84] concerns dimensionality reduction for high-dimensional Euclidean spaces. It states that, for any set of points in and any , there exists a map , with such that, for any ,
(1) |
From a computational perspective, the lemma has been extremely influential in designing algorithms for high-dimensional geometric problems, partly because proofs show that a random linear map, oblivious to the data, suffices. Proofs specify a distribution supported on linear maps which is independent of (for example, given by a matrix of i.i.d entries [IM98, DG03]), and show that a draw satisfies (1) with probability at least .
In this paper, we study the Johnson-Lindenstrauss transforms for projective clustering problems, generalizing a recent line-of-work which gave (surprising) dimension reduction results for center-based clustering [BZD10, CEM+15, BBCA+19, MMR19]. The goal is to reduce the dimensionality of the input of a (more general) projective clustering problem (from to with ) without affecting the cost of the optimal solution significantly. We map -dimensional points to -dimensional points (via a random linear map ) such that the optimal cost in -dimensions is within a -factor of the optimal cost in the original -dimensional space. We study this for a variety of problems, each of which is specified by a set of candidate solutions and a cost function. By varying the family of candidate solutions and the cost functions considered, one obtains center-based clustering problems (like -means and -median), as well as subspace approximation problems (like principal components analysis), and beyond (like clustering with subspaces). The key question we address here is:
Main Question: For a projective clustering problem, how small can be as a function of (the dataset size) and (the accuracy parameter), such that the cost of the optimization is preserved up to -factor with probability at least over ?
Our results fit into a line of prior work on the power of Johnson-Lindenstrauss maps beyond the original discovery of [JL84]. These have been investigated before for various problems and in various contexts, including nearest neighbor search [IM98, HIM12, AIR18], numerical linear algebra [Sar06, Mah11, Woo14], prioritized and terminal embeddings [EFN17, MMMR18, NN19, CN21], and clustering and facility location problems [BZD10, CEM+15, KR15, BBCA+19, MMR19, NSIZ21, ISZ21].
1.1 Our Contribution
In the -projective clustering problem with cost ([HPV02, AP03, EV05, DRVW06, VX12b, KR15, FSS20, TWZ+22]), the goal is to cluster a dataset , where each cluster is approximated by an affine -dimensional subspace. Namely, we define an objective function for -projective clustering problems with cost on a dataset , which aims to minimize
where the “candidate solutions” consist of all -tuples of -dimensional affine subspaces. The cost function maps each candidate solution to a cost in given by the -norm of the vector of distances between each dataset point to its nearest point on one of the subspaces. Intuitively, each point of the dataset “pays” for the Euclidean distance to the nearest subspace in , and the total cost is the -norm of the payments (one for each point).
There has been significant prior work which showed surprising results for the special case of -clustering (like -means and -median, which corresponds to -projective clustering with -cost) as well as for low-rank approximation (which corresponds to -projective clustering for non-affine subspaces with -cost) [BZD10, CEM+15, BBCA+19, MMR19]. It is important to note that the techniques in prior works are specifically tailored to the Euclidean optimization problem at hand. For example, the results of [MMR19], which apply for -clustering with -norm, rely on using center points as the approximation and do not generalize to affine subspaces beyond dimension . The other result of [CEM+15] for low-rank approximation uses the specific algebraic properties of the -norm (which characterize the optimal low-rank approximation). These prior works carry a strong conceptual message: for -clustering and low-rank approximation, even though many pairwise distances among dataset points become highly distorted (since one projects to dimensions), the cost of the optimization (which aggregates distances) need not be significantly distorted.
Our Results. We show that -clustering with -norm and low-rank approximation are not isolated incidents, but rather, part of a more general phenomenon. Our main conceptual contribution is the following: we use algorithms for constructing coresets (via the “sensitivity sampling” framework of [FL11]) to obtain bounds on dimension reduction. Then, the specific bounds that we obtain for the various problems depend on the sizes of the coresets that the algorithms can produce. We can instantiate our framework to new upper bounds for the following problems:
-
•
-Subspace Approximation. This problem is a (restricted) -projective clustering problems with -cost. We seek to minimize over a -dimensional subspace of the -norm of the -dimensional vector where the coordinate encodes the distance between and the closest point in .111This is a restricted version of projective clustering because subspaces are not affine and required to go through the origin.
-
•
-Flat Approximation. This problem is exactly the -projective clustering problem with cost. It is similar to -subspace approximation, except we optimize over all affine subspaces.
-
•
-Line Approximation. This problem corresponds to -projective clustering with -cost. The optimization minimizes, over an arbitrary set of of -dimensional affine subspaces (i.e., lines in ), the -norm of the -dimensional vector where the coordinate encodes the distance between and the closest point on any line in .
Concretely, our quantitative results are summarized by the following theorem.
Theorem 1 (Main Result—Informal).
Let be any dataset, and denote a matrix of i.i.d entries from . Let and be candidate solutions for a projective clustering problem in and , respectively. For any , we have
(2) |
whenever:
-
•
-subspace and -flat approximation: and are all -dimensional subspaces of and , respectively; the cost measures the -norm of distances between points to the subspace; and, . Similarly, the same bound on holds for and varying over all affine -dimensional subspaces.
-
•
-line approximation: and are all -tuples of lines in and , respectively; the cost measures the -norm of distances between points and the nearest line; and, .
In all cases, the bound that we obtain is directly related to the size of the best coresets from the sensitivity sampling framework, and all of our proofs follow the same format.222The reason we did not generalize the results to -projective clustering with -cost with is that these problems do not admit small coresets [HP04]. Researchers have studied “integer -projective clustering” where one restricts the input points to have bounded integer coordinates, where small coresets do exists [EV05]. However, using this approach for dimension reduction would incur additional additive errors, so we have chosen not to pursue this route. Our proofs are not entirely black-box applications of coresets (since we use the specific instantiations of the sensitivity sampling framework), but we believe that any improvement on the size of the best coresets will likely lead to quantitative improvements on the dimension reduction bounds. However, improving our current bounds by either better coresets or via a different argument altogether (for example, improving on the cubic dependence on ) seems to require significantly new ideas.
Target Dimension for Johnson-Lindenstrauss Transforms | ||
---|---|---|
Problem | New Result | Prior Best |
-subspace | (when ) [CEM+15] | |
-flat | [KR15] | |
-line | None |
A Subtlety in “For-All” versus “Optimal” Guarantees. So far, the main question (and the results we obtain) focus on applying the Johnson-Lindenstrauss transform and preserving the optimal cost, i.e., that the minimizing solution in the original and the dimension reduced space have approximately the same cost. A stronger guarantee which one may hope for, a so-called “for-all” guarantee, asks that after applying the Johnson-Lindenstrauss transform, every solution has its cost approximately preserved before and after dimension reduction. We do not achieve “for all” guarantees, like those appearing in [MMR19]. However, we emphasize that various subtleties arise in what is meant by “a solution,” as the prior work on dimension reduction and coresets refer to different notions (even though they agree at the optimum).
Consider, for example, the -medoid problem, a constrained version of the -means problem. The -medoid cost of a dataset is the minimum over centers chosen from the dataset , of the sum of squares of distances from each dataset point to . The subtlety is the following: one can apply a Johnson-Lindenstrauss transform to dimensions and preserve the -means cost, and one may hope that a “for-all” guarantee would also preserve the -medoid cost. Somewhat surprisingly, we show that it does not.
Theorem 2 (Johnson-Lindenstrauss for Medoid—Informal (see Theorem 12)).
For any , applying a Johnson-Lindenstrauss transform to dimension decreases the cost of the -medoid problem by a factor approaching .
Not all “for-all” guarantees are the same. A “for-all” guarantee comes with (an implicit) representation of a candidate “solution,” and different choices of representations yield different guarantees. The above theorem does not contradict the “for-all” guarantee of [MMR19] because there, a candidate solution for -clustering refers to a partition of into parts and not a set of centers. Often in the coreset literature, a “solution” refers to a set of centers and not arbitrary partitions. For , there are many possible centers but only one partition, and it is important (as per Theorem 2) that a potential “for all” guarantee considers partitions.
For -subspace and -flat approximation, in a natural representation of the solutions, the same issue arises. Consider the -column subset selection problem, a constrained version of the -subspace approximation problem, where subspaces must be in the span of the dataset points. The -column subset selection cost of a dataset is the minimum over -dimensional subspaces spanned by a dataset point of , of the sum of squares of distances from each dataset point to the projection onto the subspace. Similarly to Theorem 2, a Johnson-Lindenstrauss transform does not preserve the cost of -column subset selection.
Theorem 3 (Johnson-Lindenstrauss for Column Subset Selection—Informal (see Theorem 13)).
For any , applying a Johnson-Lindenstrauss transform to dimension decreases the cost of the -column subset selection problem by a factor approaching .
The above theorem does not contradict the “for-all” guarantee of [CEM+15] for similar reasons (which, in addition, crucially rely on having , and which we elaborate on in Appendix A). For -line approximation, however, there is an interesting open problem: is it true that after applying a Johnson-Lindenstrauss transform to , for all partitions of into parts, the cost of optimally approximating each part with a line has its cost preserved.
1.2 Related Work
Dimension Reduction.
Our paper continues a line of work initiated by Boutsidis, Zouzias, and Drineas [BZD10], who first studied the effect of a Johnson-Lindenstrauss transform for -means clustering, and showed that sufficed for a -approximation. The bound was improved to -approximation with by Cohen, Elder, Musco, Musco, Persu [CEM+15], who also showed that gave a -approximation. Becchetti, Bury, Cohen-Addad, Grandoni, Schwiegelshohn [BBCA+19] showed that sufficed for preserving the costs of all -mean clusterings. Makarychev, Makarychev, and Razenshteyn [MMR19] improved and generalized the above bounds for all -clustering. They showed that preserved costs to -factor.
Coresets.
Coresets are a well-studied technique for reducing the size of a dataset, while approximately preserving a particular desired property. Since its formalization in Agarwal, Har-Peled, and Varadarajan [AHPV05], coresets have played a foundational role in computational geometry, and found widespread application in clustering, numerical linear algebra, and machine learning (see the recent survey [Fel20]). Indeed, even for clustering problems in Euclidean spaces, there is a long line of work (which is still on-going) [BHPI02, HPM04, AHPV05, Che09, LS10, FL11, VX12b, VX12a, FSS13, BFL16, SW18, HV20, BBH+20, CASS21, CALSS22] exploring the best coreset constructions.
Most relevant to our work is the “sensitivity sampling” framework of Feldman and Langberg [FL11], which gives algorithms for constructing coresets for the projective clustering problems we study. In light of the results of [FL11], as well as the classical formulation of the Johnson-Lindenstrauss lemma [JL84], it may seem natural to apply coreset algorithms and dimensionality reduction concurrently. However, this is not without a few technical challenges. As we will see in the next subsection, it is not necessarily the case that coreset algorithms and random projections “commute.” Put succinctly, the random projection of a coreset of may not be a coreset of the random projection . Indeed, proving such a statement constitutes the bulk of the technical work.
1.3 Organization
The following section (Section 2) overviews the high-level plan, since all our results follow the same technique. To highlight the technique, the first technical section considers the case of -clustering (Section 4), where the technique of arguing via coresets shows to obtain . The remaining sections cover the technical material for -subspace approximation (in Section 5), -flat approximation (in Section 6), and finally -line approximation (in Section 7).
2 Overview of Techniques
In this subsection, we give a high-level overview of the techniques employed. As it will turn out, all results in this paper follow from one general technique, which we instantiate for the various problem instances.
We give an abstract instantiation of the approach. We will be concerned with geometric optimization problems of the following sort:
-
•
For each , we specify a class of candidate solutions given by a set . For example, in center-based clustering, may be given by a tuple of points in , corresponding to centers for a center-based clustering. In subspace approximation, the set may denote the set of all -dimensional subspaces of .
-
•
There will be a cost function which, takes a point and a potential solution , and outputs the cost of on . Continuing on the example on center-based clustering, may denote the distance from a dataset point to its nearest point in . In subspace approximation, may denote the distance from a dataset point to the orthogonal projection of that point onto the subspace . For a parameter , we will denote the cost of using for a dataset by
For simplicity in the notation, we will drop the subscripts from the functions and when they are clear from context.
We let denote a distribution over linear maps which will satisfy some “Johnson-Lindenstrauss” guarantees (we will specify in the preliminaries the properties we will need). For concreteness, we will think of given by matrix multiplication by a matrix of i.i.d draws of . We ask, for a particular bound on the dataset size , a geometric optimization problem (specified by , and ), and a parameter , what is the smallest such that with probability at least over the draw of ,
(3) |
The right-most inequality in (3) claims that the cost after applying does not increase significantly, i.e., . This direction is easy to prove for the following reason. For a dataset , we consider the solution minimizing . We sample and we find a candidate solution which exhibits an upper bound on . For example, in the center-based clustering, is a set of centers in , and we may consider as the centers from after applying . The fact that with high probability over will follow straight-forwardly from properties of . Importantly, the optimal solution does not depend on . In fact, while we expect to distort some distances substantially, we can pick so that too many distortions on these points is unlikely.
However, the same reasoning does not apply to the left-most inequality in (3). This is because the solution which minimizes depends on . Indeed, we would expect to take advantage of distortions in in order to decrease the cost of the optimal solution. We proceed by the following high level plan. We identify a sequence of important events defined over the draw of which occur with probability at least . The special property is that if satisfies these events, we can identify, from minimizing , a candidate solution which exhibits an upper bound .
We now specify how exactly we define, for an optimal (depending on ), a candidate solution whose cost is not much higher than . For that, we will use the notion of coresets. Before the formal definition, we note there is a natural extension of for weighted datasets. In particular, if is a set of points and is a set of weights for , then we will use as -th power of the sum over all of .
Definition 2.1 ((Weak)333The word “weak” is used to distinguish these from “strong” coresets. These are a weighted subset of points which approximately preserve the cost of all candidate solutions. Coresets, see also [Fel20]).
For , let denote a class of candidate solutions and specify the cost of a point to a solution. For a dataset and a parameter , a (weak) -coreset for is a weighted set of points and which satisfy
It will be crucial for us that these problems admit small coresets. More specifically, for the problems considered in this paper, there exists (known) algorithms which can produce small-size coresets from a dataset. In what follows, ALG is a randomized algorithm which receives as input a dataset and outputs a weighted subset of points which is a weak -coreset for with high probability. Computationally, the benefit of using coresets is that the sets tend to be much smaller than , so that one may compute on and obtain an approximately optimal solution for . For us, the benefit will come in defining the important events. At a high level, since is small, the important events defined with respect to will only worry about distortions within the subset (or subspace spanned by) .
In particular, it is natural to consider the following approach:
-
1.
We begin with the original dataset , and we consider the solution which minimizes . The goal is to show that cannot be much larger than , where minimizes .
-
2.
Instead of considering the entire dataset , we execute and consider the weak -coreset that we obtain. If we can identify a candidate solution whose cost , we would be done. Indeed, , and the fact is a weak -coreset implies .
-
3.
Moving to a coreset allows one to relate and by considering the performance of on . The benefit is that the important events, defined over the draw of , set as a function of , instead of . A useful example to consider is requiring be -Lipschitz on the entire subspace spanned by , which requires . For the problems considered here, a nearly optimal for will be in the subspace spanned by , so we may identify whose cost on is not much higher than the by evaluating since lies inside .444While the above results in bounds for which are already meaningful, we will exploit other geometric aspects of the problems considered to get bounds on which are logarithmic in the coreset size. For center-based clustering, [MMR19] showed that one may apply Kirzbraun’s theorem. For subspace approximation, we use the geometric lemmas of [SV12].
-
4.
The remaining step is showing . In particular, one would like to claim is a weak -coreset for and use the right-most inequality in Definition 2.1. However, it is not clear this is so. The problem is that the algorithm ALG depends on the -dimensional representation of , and may not be a valid output for . As we show, this does work for (some) coreset algorithms built on the sensitivity sampling framework (see, [FL11, BFL16]).555We will not prove that with high probability over and , is a weak -coreset for . Rather, all we need is that the right-most inequality in Definition 2.1 holds for and , which is what we show.
2.0.1 Sensitivity Sampling for Step 4
In the remainder of this section, we briefly overview the sensitivity sampling framework, and the components required to make Step 4 go through. At a high level, coreset algorithms in the sensitivity sampling framework proceed in the following way. Given a dataset , the algorithm computes a sensitivity sampling distribution supported on . The requirement is that, for each potential solution , sampling from gives a low-variance estimator for . In particular, we let be the probability of sampling . Then, for any distribution and any ,
(4) |
Equation 4 implies that, for any , if is i.i.d samples from and , the expectation of is . In addition, the algorithm designs so that, for a parameter ,
(5) |
If we set , (5) and Chebyshev’s inequality implies for each with a high constant probability, and the remaining work is in increasing by a large enough factor to “union bound” over all . There is a canonical way of ensuring and satisfy (5): we first define , known as a “sensitivity function”, which sets for each ,
(6) |
which is known as the “total sensitivity.” Then, the distribution is given by letting .
We now show how to incorporate the map , to argue Step 4. Recall that we let denote i.i.d draws from and the weights be . We want to argue that, with high constant probability over the draw of and , we have
(7) |
First, note that the analogous version of (4) for continues to hold. In particular, for any map in the support of and minimizing ,
(8) |
Hence, it remains to define satisfying (6) which also satisfies one additional requirement. With high probability over , we should have
(9) |
The above translates to saying, for most the variance of , when , is small. Once that is established, we may apply Chebyshev’s inequality and conclude (7) with high constant probability.666Since Steps 1–4 only argued about the optimal , there is no need to “union bound” over all in our arguments.
2.0.2 The Circularity and How to Break It
One final technical hurdle arises. While one may define the sensitivity function to be exactly and automatically satisfy (6), it becomes challenging to argue that (9) holds. In the end, the complexity we seek to optimize is the total sensitivity , so there is flexibility in defining while showing (9) holds. In fact, sensitivity functions which are computationally simple tend to be known, since an algorithm using coresets must quickly compute for every .
The sensitivity functions used in the literature (for instance, in [FL11, VX12b]) are defined with respect to an approximately optimal (or bi-criteria approximation) for . Furthermore, the arguments used to show these function satisfy (6), which we will also employ for (9), crucially utilize the approximation guarantee on . The apparent circularity appears in approximation algorithms and also shows up in the analysis here:
- •
-
•
We use the proof of the “easy” direction to identify a solution with (recall this was used to establish the right-most inequality in (3)). From the analytical perspective, it is useful to think of as the function one would get from defining a sensitivity function like in the previous step with instead of . If we could show was approximately optimal for , we could use [VX12b, VX12a] again to argue (9). The circularity is the following. Showing is approximately optimal means showing an upper bound on in terms of . Since, we picked to be at most , this is exactly what we sought to prove.
If “approximately optimal” above required be a -approximation to the optimal , we would have a complete circularity and be unable to proceed. However, similarly to the case of approximation algorithms, it suffices to have a poor approximation. Suppose we showed was a -approximation, then, increasing by a factor depending on (which would affect the resulting dimensionality ) would account for this increase and drive the variance back down to . Moreover, showing is a -approximation with probability at least over , given Steps 1–4 is straight-forward. Instead of showing the stronger bound that , we show that . The latter (loose) bound is a consequence of applying Markov’s inequality to (8).
In summary, we overcome the circularity by going through Steps 1–4 twice. In the first time, we show a weaker -approximation. Specifically, we show that preserves the cost of up to factor . The first time around, we won’t upper bound the variance in (9), and we simply use Markov’s inequality to (8) in order to prove a (loose) bound on Step 4. Once we’ve established the -factor approximation, we are guaranteed that is a -approximation to . This means that, actually, the sensitivity sampling distribution we had considered (when viewed as a sensitivity sampling distribution for ) gives estimators with bounded variance, as in (9). In particular, going through Steps 1–4 once again implies that was actually the desired -approximation.
3 Preliminaries
We specify the properties we use from the distribution . We will refer to these as “Johnson-Lindenstrauss” distributions. Throughout the proof, we will often refer to as given by a matrix of i.i.d draws from . The goal of specifying the useful properties is to use other “Johnson-Lindenstrauss”-like distributions. The first property we need is that is a linear map, and that any satisfies
We use the standard property of , that preserves distances with high probability, i.e., for any ,
More generally, we use the conclusion of the following lemma. We give a proof when is a i.i.d entries of .
Lemma 3.1.
Let denote a Johnson-Lindenstrauss distribution over maps given by a matrix multiplication on the left by a matrix of i.i.d draws of . If , then for any ,
Proof.
We note that by the 2-stability of the Gaussian distribution, we have is equivalently distributed as , where . Therefore, we have that for any which we will optimize shortly to be a small constant times ,
Furthermore,
We will upper bound the probability that exceeds by the Chernoff-Hoeffding method. In particular, recall that when is distributed as a -random variable with degrees of freedom, rescaled by , such that the moment generating function of has the following closed form solution whenever :
In particular, for any , we may upper bound
by letting setting whenever . Setting to be a small constant of gives the desired guarantees. ∎
Definition 3.2 (Subspace Embeddings).
Let and denote a subspace of . For , a map is an -subspace embedding of if, for any ,
Lemma 3.3.
Let and be a subspace of dimension at most . For , let denote a Johnson-Lindenstrauss distribution over maps given by a matrix multiplication on the left by a matrix of i.i.d draws of . If , then is an -subspace embedding of with probability at least .
4 Center-based -clustering
In the -clustering problems, for any set of points, and point , we write
and for a subset ,
We extend the above notation to weighted subsets, where for a subset with (non-negative) weights , we write . The main result of this section is the following theorem.
Theorem 4 (Johnson-Lindenstrauss for Center-Based Clustering).
Let be any set of points, and let denote the optimal -clustering of . For any , suppose we let be the distribution over Johnson-Lindenstrauss maps where
Then, with probability at least over the draw of ,
There are two directions to showing dimension reduction: (1) the optimal clustering in the reduced space is not too expensive, and (2) the optimal clustering in the reduced space is not too cheap. We note that (1) is simple because we can exhibit a clustering in the reduced space whose cost is not too high; however, (2) is much trickier, since we need to rule out a too-good-to-be-true clustering in the reduced space.
4.1 Easy Direction: Optimum Cost Does Not Increase
Lemma 4.1.
Let be any set of points and let of size be the centers of the optimal -clustering of . We let be the distribution over Johnson-Lindenstrauss maps. If , then with probability at least over the draw of ,
and hence,
Proof.
For , let denote the closest point from to . We compute the expected positive deviation from assigning to , and note that the can only be lower. Hence, if we can show
(10) |
then by Markov’s inequality, we will obtain , and obtain the desired bound when raising to power . This last part follows from Lemma 3.1. ∎
4.2 Hard Direction: Optimum Cost Does Not Decrease
We now show that after applying a Johnson-Lindenstrauss map, the cost of the optimal clustering in the dimension-reduced space is not too cheap. This section will be significantly more difficult, and will draw on the following preliminaries.
4.2.1 Preliminaries
Definition 4.2 (Weak and Strong Coresets).
Let be a set of points. A (weak) -coreset of for -clustering is a subset of points with weights such that,
The coreset is a strong -coreset if, for all , we have
Notice that Definition 4.2 gives an approach to finding an approximately optimal -clustering. Given , we find a (weak) -coreset and find the optimal clustering with respect to the coreset . Then, we can deduce that a clustering which is optimal for the coreset is also approximately optimal for the original point set. A common and useful framework for building coresets is by utilizing the “sensitivity” sampling framework.
Definition 4.3 (Sensitivities).
Let , and consider any set of points , as well as , . A sensitivity function for -clustering in is a function satisfying, that for all ,
The total sensitivity of the sensitivity function is given by
For a sensitivity function, we let denote the sensitivity sampling distribution, supported on , which samples with probability proportional to .
The following lemma gives a particularly simple sensitivity sampling distribution, which will be useful for analyzing our dimension reduction procedure. The proof below will follow from two applications of the triangle inequality which we reproduce from Claim 5.6 in [HV20].
Lemma 4.4.
Let and consider a set of points . Let of size be optimal -clustering of , and let denote the function which sends to its closest point in , and let be the set of points where . Then, the function given by
is a sensitivity function for -clustering in , satisfying
Proof.
Consider any set of points, and let be the function which sends each to its closest point in . Then, we have
(11) | ||||
(12) | ||||
(13) |
where we used the triangle inequality and Hölder inequality in (11), and added additional non-negative terms in (12), and we finally apply the triangle inequality and Hölder’s inequality once more in (13). Hence, using the fact that is the optimal clustering, we have
The bound on follows from summing over all , noting the fact that . ∎
The main idea behind the sensitivity sampling framework for building coresets is to sample from a sensitivity sampling distribution enough times in order to build a coreset. For this work, it will be sufficient to consider the following theorem of [HV20], which shows that draws from an appropriate sensitivity sampling distribution suffices to build strong -coresets for -clustering in .
Theorem 5 (-Strong Coresets from Sensitivity Sampling [HV20]).
For any subset and . Let of size be the optimal -clustering of , and let denote the sensitivity sampling distribution of Lemma 4.4.
-
•
Let denote a random (multi-)set and given by, for iterations, sampling i.i.d and letting .
-
•
Then, with probability over the draw of , it is an -strong coreset for .
Theorem 6 (Kirszbraun theorem).
Let and be an -Lipschitz map (with respect to Euclidean norms on and ). There exists a map which is -Lipschitz extending , i.e., for all .
4.3 The Important Events
We now define the important events which will allow us to prove that the optimum -clustering after dimension reduction does not decrease substantially. We first define the events, and then we prove that if the events are all satisfied, then we obtain our desired lower bound.
Definition 4.5 (The Events).
Let and of size be centers for an optimal -clustering of , and is the sensitivity sampling distribution of with respect to as in Lemma 4.4. We will consider the following experiment,
-
1.
We generate a sample by sampling from for i.i.d iterations and set .
-
2.
Furthermore, we sample which is a Johnson-Lindenstrauss map .
-
3.
We let denote the image of on .
The events are the following:
-
•
: The weighted (multi-)set is a weak -coreset for -clustering in .
-
•
: The map , given by restricting is -bi-Lipschitz.
-
•
: We let of size be the optimal centers for -clustering in . The weighted (multi-)set satisfies
Lemma 4.6.
Let , and suppose and satisfy events and , then,
Proof.
Let of size denote the centers which give the optimal -clustering of in . Then, by ,
Now, we use Kirszbraun’s Theorem, and extend in a -Lipschitz manner. Hence,
Finally, using the fact that is an -weak coreset, we may conclude that
∎
We now turn to showing that an appropriate setting of parameters implies that the events occur often. For the first event, Theorem 5 from [HV20] implies event occurs with probability . We state the usual guarantees of the Johnson-Lindenstrauss transform, which is what we need for event to hold.
Lemma 4.7.
Let be any set of points, and denote the Johnson-Lindenstrauss map, with
Then, with probability over the draw of , is -bi-Lipschitz, and hence, occurs with probability .
4.3.1 A Bad Approximation Guarantee
Lemma 4.8 (Warm-Up Lemma).
Fix any and let denote the optimal centers for -clustering of , then with probability at least over the draw of as per Definition 4.5,
with probability at least . In other words, holds with probability at least .
Proof.
For any , let denote the point in closest to . Then, we note
(14) |
so that in expectation over , we have
By Markov’s inequality, we obtain our desired bound. ∎
Corollary 4.9.
Let be any set of points, and of size be centers for optimally -clustering . For any , let be the Johnson-Lindenstrauss map with
Then, with probability at least over the draw of ,
(15) |
4.3.2 Improving the Approximation
In what follows, we will improve upon the approximation of Corollary 4.9 significantly, to show that with large probability, holds. We let and denote of size to be the optimal -clustering of . As before, we let map each to the point center in , and be the sensitivities of with respect to as in Lemma 4.4, and be the sensitivity sampling distribution.
We define one more event, , with respect to the randomness of . First, we let denote the random variable given by
(16) |
Notice that when consists of i.i.d , then is distributed as -random variable with degrees of freedom. We say event holds whenever
(17) |
Claim 4.10.
With probability at least over the draw of , event holds.
Proof.
The proof will simply follow from computing the expectation of the left-hand side of (17), and applying Markov’s inequality. In particular, for every , we can apply Lemma LABEL:lem:guassian-guarantee to conclude have
The remaining part follows from the bound on . ∎
Lemma 4.11.
Proof.
The proof follows the same schema as Lemma 4.8. However, we give a bound on the variance of the estimator in order to improve upon the use of Markov’s inequality. Specifically, we compute the variance of a rescaling of (14).
(18) |
By the same argument as in the proof of Lemma 4.4 (applying the triangle inequality twice),
(19) |
Recall that we have the lower bound (2) and the upper bound (1). Hence, along with the definition of in (16) (we remove the bold-face as is fixed), we upper bound the left-hand side of (19) by
In particular, we may plug this in to (32) and use the definition of . Specifically, one obtains the variance in (32) is upper bounded by
where in the final inequality, we used the fact that holds. Hence, letting be a large enough polynomial in implies the variance is smaller than , so we can apply Chebyshev’s inequality. ∎
Corollary 4.12.
Let by any set of points, and of size be centers for optimally -clustering . For any , let be the Johnson-Lindenstrauss map with
Then, with probability at least over the draw of ,
Proof.
We sample and as per Definition 4.5. Note that by Theorem 5 and the setting of , event holds with probability at least over the draw of . By Lemma 4.1, Corollary 4.9, and Claim 4.10, and the setting of and , the conditions (1), (2) and (3) of Lemma 4.11 hold with probability at least , with being set to a large enough constant, and hence event holds with probability at least . Finally, event holds with probability by Lemma 4.7, and taking a union bound and Lemma 4.6 gives the desired bound. ∎
5 Subspace Approximation
In the -subspace approximation problem, we consider a subspace of dimension less than , which we may encode by a collection of at most orthonormal vectors . We let denote the map which sends each vector to its closest point in , and note that
For any subset and any -dimensional subspace , we let
In this section, we will show that we may compute the optimum -subspace approximation after applying a Johnson-Lindenstrauss transform.
Theorem 7 (Johnson-Lindenstrauss for -Subspace Approximation).
Let be any set of points, and let denote the optimal -subspace approximation of . For any , suppose we let be the distribution over Johnson-Lindenstrauss maps where
Then, with probability at least over the draw of ,
The proof of the above theorem proceeds in two steps, and models the argument in the previous section. First, we show that the cost of the optimum does not increase substantially (the right-most inequality in the theorem). This is done in the next subsection. The second step is showing that the optimum does not decrease substantially (the left-most inequality in the theorem). The second step is done in the subsequent subsection.
5.1 Easy Direction: Optimum Cost Does Not Increase
The first direction, which shows that the optimal -subspace approximation does not increase follows similarly to Lemma 4.1.
Lemma 5.1.
Let by any set of points and let be optimal -subspace approximation of . For any , we let be the distribution over Johnson-Lindenstrauss maps. If , then with probability at least over the draw of ,
and hence,
Proof.
Note that is a linear map, so if we let denote orthonormal unit vectors spanning , then are vectors spanning the subspace . Furthermore, we may consider the -dimensional subspace
Notice that for any , by linearity of ,
which means that we may always upper bound
(20) |
It hence remains to upper bound the left-hand side of (20). We now use the fact that is drawn from a Johnson-Lindenstrauss distribution. Specifically, the lemma follows from applying Markov’s inequality once we show
which follows from Lemma 3.1. ∎
5.2 Hard Direction: Optimum Cost Does Not Decrease
5.2.1 Preliminaries
In the -subspace approximation problem, there will be a difference between complexities of known strong coresets and weak coresets. Our argument will only use weak coresets, which is important for us, as strong coresets have a dependence on (which we are trying to avoid).
Definition 5.2 (Weak Coresets for -subspace approximation).
Let be a set of points. A weak -coreset of for -subspace approximation is a weighted subset of points with weights such that,
Similarly to the case of -clustering, algorithms for building weak coresets proceed by sampling according to the sensitivity framework. We proceed by defining sensitivity functions in the context of subspace approximation, and then state a lemma which gives a sensitivity function that we will use.
Definition 5.3 (Sensitivities).
Let , and consider any set of points , as well as and . A sensitivity function for -subspace approximation in is a function satisfying that, for all ,
The total sensitivity of the sensitivity function is given by
For a sensitivity function, we let denote the sensitivity sampling distribution, supported on , which samples with probability proportional to .
We now state a specific sensitivity function that we will use. The proof will closely follow a method for bounding the total sensitivity of [VX12b]. The resulting weak -coreset will have a worse dependence than the best-known coresets for this problem; however, the specific form of the sensitivity function will be especially useful for us. Specifically, the non-optimality of the sensitivity function will not significantly affect the final bound on dimension reduction.
Lemma 5.4 (Theorem 18 of [VX12b]).
Let , and consider any set of points , as well as with , and . Suppose is the optimal -subspace approximation of in . Then, the function given by
is a sensitivity function for -subspace approximation of in , satisfying
Proof.
Consider any subspace of dimension at most . Then, for any
(21) |
Notice that by the triangle inequality and Hölder’s inequality, and that since is the optimal -subspace approximation. Hence, dividing the left- and right-hand side of (21) we have
It remains to show that, for any set of points (in particular, the set ), and any ,
In particular, note that for any subspace of dimension at most , there exists a -dimensional subspace containing given by all vectors orthogonal to . In particular, since is contained in , and by the definition of . The bound on the total sensitivity then follows from Lemma 16 in [VX12b], where we use the fact that lies in a -dimensional subspace. ∎
We will use the following geometric theorem of [SV12] in our proof. The theorem says that an approximately optimal -subspace approximation lies in the span of a small set of points. We state the lemma for weighted point sets, even though [SV12] state it for unweighted points. We note that adding weights can be simulated by replicating points.
Lemma 5.5 (Theorem 3.1 [SV12]).
Let , and consider a weighted set of points with weights , as well as and . There exists a subset of and a -dimensional subspace within the span of satisfying
Lemma 5.4 gives us an appropriate sensitivity function, and Lemma 5.5 limits the search of the subspace to just a small set of points. Similarly to the case of -clustering, we can use this to construct weak -coresets for -subspace approximation. The following theorem is Theorem 5.10 from [HV20]. We state the theorem with the sensitivity function of Lemma 5.4.
Theorem 8 (Theorem 5.10 from [HV20]).
For any subset and , let denote the sensitivity sampling distribution from the sensitivity function of Lemma 5.4.
-
•
Let denote the random (multi-)set and given by, for
iterations, sampling i.i.d and letting .
-
•
Then, with probability over the draw of , it is an -weak coreset for -subspace approximation of .
5.3 The Important Events
Similarly to the previous section, we define the important events, over the randomness in such that, if these are satisfied, then the optimum of -subspace approximation after dimension reduction does not decrease substantially. We first define the events, and then we prove that if the events are all satisfied, then we obtain our desired approximation.
Definition 5.6 (The Events).
Let , and the sensitivity sampling distribution of from Lemma 5.4. We consider the following experiment,
-
1.
We generate a sample by sampling from for i.i.d iterations and set .
-
2.
Furthermore, we sample , which is a Johnson-Lindenstrauss map .
-
3.
We let denote the image of on .
The events are the following:
-
•
: The weighted (multi-)set is a weak -coreset for -subspace approximation of in .
-
•
: The map satisfies the following condition. For any choice of points of , is an -subspace embedding of the subspace spanned by these points.
-
•
: Let denote the -dimensional subspace for optimal -subspace approximation of in . Then,
Lemma 5.7.
Let , and suppose and satisfy events , , and . Then,
Proof.
Consider a fixed and satisfying the three events of Definition 5.6. Let be the -dimensional subspace which minimizes . Then, by event , we have . Now, we apply Lemma 5.5 to , and we obtain a subset of size for which there exists a -dimensional subspace within the span of which satisfies
Note that is a -dimensional subspace lying in the span of . For any , we will use the fact that is satisfied to say that is an -subspace embedding of the subspace spanned by . This will enable us to find a subspace of dimension whose cost of approximating is at most .
Specifically, we write , as orthogonal unit vectors which span . Because lies in the span of , there are vectors in the span of which satisfy
Hence, the subspace given by the span of all vectors in is a -dimensional subspace lying in the span of . For , we may write the coefficients , and we may express projection as
which is a linear combination of . By event , is an -subspace embedding of the subspace spanned by , so
Combining the inequalities, we have
and finally, since is a -weak coreset, we obtain the desired inequality. ∎
We note that event will be satisfied with sufficiently high probability from Theorem 8. Furthermore, event is satisfied with sufficiently high probability from the following simple lemma. All that will remain is showing that event is satisfied.
Lemma 5.8.
Let be any set of points and , and let denote the Johnson-Lindenstrauss map, with
Then, with probability over the draw of , is an -subspace embedding for all subspaces spanned by vectors of .
Proof.
There are at most subspaces spanned by vectors of . If is a subspace embedding for all of them, we obtain our desired conclusion. We use Lemma 3.3 with to be a sufficiently small constant factor of and union bound. ∎
5.3.1 A Bad Approximation Guarantee
Lemma 5.9 (Warm-Up Lemma).
Fix any and let denote the -dimensional subspace for optimal -subspace approximation of in . Then, with probability at least over the draw of as per Definition 5.6,
In other words, event holds with probability at least .
Proof.
Similarly to the proof of Lemma 4.8, we compute the expectation of the left-hand side of the inequality and use Markov’s inequality.
(22) |
∎
Corollary 5.10.
Let be any set of points. For any , let be the Johnson-Lindenstrauss map with
Then, with probability at least over the draw of ,
5.3.2 Improving the Approximation
We now improve on the approximation of Corollary 5.10 in a fashion similar to that of Subsection 4.3.2. We will show that with large probability, holds. We let and be the subspace of dimension for optimal -subspace approximation of in . As before, we let be the sensitivities of with respect to (as in Lemma 5.4), and be the sensitivity sampling distribution.
We define one more events, with respect to the randomness in . Let denote the random variable given by
(23) |
We say event holds whenever
(24) |
which holds with probability at least over the draw of , similarly to the proof of Claim 4.10 and Lemma 5.4.
Lemma 5.11.
Proof.
Again, the proof is similar to that of Lemma 4.11, where we bound the variance of the estimator to apply Chebyshev’s inequality. In particular, we have
(25) |
Similarly to the proof of Lemma 4.11, we will upper bound
as a function of and (given by (23) without boldface as is fixed). Toward this bound, we simplify the notation by letting and . Then, since is a linear map, for any
(26) |
In particular, if we let be the matrix given by having the rows be points , then writing , we have is the matrix whose rows are ; in particular, one may compare the left- and right-hand sides of (26) by letting . Thus, we have
(27) |
We may now apply the triangle inequality, as well as (1) and (2), and we have
(28) |
Finally, we note that, similarly to the proof of Lemma 5.4,
(29) |
Continuing to upper-bound the left-hand side of (27) by plugging in (26), (29) and (28),
In particular, the bound on the variance in (25) is at most
so letting gives the desired bound on the variance. ∎
Corollary 5.12.
Let be any set of points, and let be the subspace for optimal -subspace approximation of . For any , let be the Johnson-Lindenstrauss map with
Then, with probability at least over the draw of ,
Proof.
We sample and as per Definition 5.6. Again, Theorem 8 guarantees that occurs with probability at least over the draw of . By Lemma 5.1, Corollary 5.10 and (24), the condition (1), (2), and (3) are satisfied with probability at least , so we may apply Lemma 5.11 and have holds with probability at least . Finally, event holds with probability at least by Lemma 5.8. Taking a union bound and applying Lemma 5.7 gives the desired bound. ∎
6 -Flat Approximation
In the -flat approximation problem, we consider subspace of dimension less than , which we may encode by a collection of at most orthonormal vectors , as well as a translation vector . The -flat specified by and is given by the affine subspace
We let denote the map which sends each to its closest point on , and we note that
For any , we let
In this section, we show that we may find the optimal -flat approximation after applying a Johnson-Lindenstrauss map. The proof will be almost exactly the same as the -subspace approximation problem. Indeed, it only remains to incorporate a translation vector.
Theorem 9 (Johnson-Lindenstrauss for -Flat Approximation).
Let be any set of points, and let denote the optimal -flat approximation of . For any , suppose we let be a distribution over a Johnson-Lindenstrauss maps where
Then, with probability at least over the draw of ,
6.1 Easy Direction: Optimum Cost Does Not Increase
Lemma 6.1.
Let by any set of points and let be the optimal -flat approximation of . We let be the distribution over Johnson-Lindenstrauss maps. If , then with probability at least over the draw of ,
and hence,
6.2 Hard Direction: Optimum Cost Does Not Decrease
6.2.1 Preliminaries
The proof in this section follows similarly to that of -subspace approximation.
Definition 6.2 (Weak Coresets for -flat approximation).
Let be a set of points. A weak -coreset of for -flat approximation is a weighted subset of points with weights such that,
Definition 6.3 (Sensitivities).
Let , and consider any set of points , as well as and . A sensitivity function for -flat approximation in is a function satisfying that, for all ,
The total sensitivity of the senstivity function is given by
For a sensitivity function, we let denote the sensitivity sampling distribution, supported on , which samples with probability proportional to .
The sensitivity function we use here generalizes that of the previous section. In particular, the proof will follow similarly, and we will defer to the arguments in the previous section.
Lemma 6.4.
Let , and consider any set of points , as well as with and . Suppose is the optimal -flat approximation of in . Then, the function given by
is a sensitivity function for -flat approximation of in , satisfying
Proof.
Consider any -flat , given by a subspace of dimension at most , and a translation . As in the proof of Lemma 5.4,
We now have that for any , and any ,
Finally, for each , we may consider appending an additional coordinate and consider where the -th entry is . Then, by linearity
and the bound on the total sensitivity follows from Lemma 5.4. ∎
In the -subspace approximation section, we used a lemma (Lemma 5.5) to narrow down the approximately optimal subspaces to those spanned by at most points. Here, we use a similar lemma in order to find an approximately optimal translation vector , which is spanned by a small subset of points.
Lemma 6.5 (Lemma 3.3 [SV07]).
Let , and consider a weighted set of points with weights , as well as and . Suppose is the optimal -flat approximation of , encoded by a -dimensional subspace and translation vector . There exists a subset of size and a point such that the -flat
satisfies
Theorem 10 (-Weak Coresets for -Flats via Sensitivity Sampling).
For any subset and , let denote the sensitivity sampling distribution.
-
•
Let denote the random (multi-)set and given by, for
iterations, sampling i.i.d and letting .
-
•
Then, with probability over the draw of , it is an -weak coreset for -subspace approximation of .
6.3 The Important Events
The important events we consider mirror those the subspace approximation problem. The only event which would change is , where we require to be an -subspace embedding for all subsets of points from . This will allow us to incorporate the translation from Lemma 6.5.
Definition 6.6 (The Events).
Let , and the sensitivity sampling distribution of of Lemma 6.4. We consider the following experiment,
-
1.
We generate a sample by sampling from for i.i.d iterations and set .
-
2.
Furthermore, we sample , which is a Johnson-Lindenstrauss map .
-
3.
We let denote the image of on .
The events are the following:
-
•
: The weighted (multi-)set is a weak -coreset for -flat approximation of in .
-
•
: The map satisfies the following condition. For any choice of points of , is an -subspace embedding of the subspace spanned by these points.
-
•
: Let denote the optimal -flat approximation of in . Then,
Lemma 6.7.
Let , and suppose and satisfy events , , and . Then,
Proof.
Consider a fixed and satisfying the three events of Definition 5.6. Let be the -flat which minimizes . Suppose that is specified by a -dimensional subspace and a translation . Then, by event , we have . Now, we apply Lemma 6.5 to , and we obtain a subset of size for which there exists a translation vector within the such that -flat given by and satisfies
Furthermore, by Lemma 5.5 to the weighted vectors ,777Here, we are using the short-hand . there exists a subset of size and a -dimensional subspace within the span of such that the -flat specified by and satisfies
Recall that (i) is a -dimensional subspace lying in the span of , (ii) is within , and (iii) for any , is an -subspace embedding of the span of . Similarly to Lemma 5.7, we may find a -flat such that for every ,
and hence
Finally, since is a -weak coreset, we obtain the desired inequality. ∎
As in the previous section, events and hold with sufficiently high probability. All that remains is showing that holds with sufficiently high probability. We proceed in a similar fashion, where we first show a loose approximation guarantee, and later improve on it.
Lemma 6.8.
Fix any and let denote the -flat for optimal -flat approximation of in . Then with probability at least over the draw of as per Definition 6.6,
In other words, event holds with probability at least .
Corollary 6.9.
Let be any set of points. For any , let be the Johnson-Lindenstrauss map with
Then, with probability at least over the draw of ,
6.3.1 Improving the approximation
The improvement of the approximation, follows from upper bounding the variance, as in the -clustering problem, and the -subspace approximation problem. In particular, we show that holds. Fix and be the optimal -flat approximation of in . The sensitivity function specified in Lemma 6.4 specify the sensitivity sampling distribution .
We let denote the following event with respect to the randomness in . For each , we let denote the random variable
and as in (23) and (24), event , which occurs with probability at least , whenever
Lemma 6.10.
Proof.
We similarly bound the variance of
It is not hard to show, as in the proof of Lemma 5.11, that writing and , that
and similarly to before, we have
This implies that the variance is at most
by setting to be large enough when holds, and we apply Chebyshev’s inequality. ∎
Corollary 6.11.
Let be any set of points, and let be the optimal -flat approximation of . For any , let be the Johnson-Lindenstrauss map with
Then, with probability at least over the draw of ,
7 -Line Approximation
In the -line approximation problem, we consider a collection of lines in . A line is encoded by a vector and a unit vector , where we will write
For a single line encoded by and , we write as the orthogonal projection of a point onto , i.e., the closest vector which lies on the line, where
For any set of lines, , and a point , we write
and for any dataset and set of lines , we consider the map which sends a point to its nearest line in . Then, we write
as the cost of representing the points in according to the lines in . In this section, we show that we may find the optimal -line approximation after applying a Johnson-Lindenstrauss map. Specifically, we prove:
Theorem 11 (Johnson-Lindenstrauss for -Line Approximation).
Let be any set of points, and let denote a set of lines in for optimally -line approximation of . For any , suppose we let be a distribution over Johnson-Lindenstrauss maps where
Then, with probability at least over the draw of ,
7.1 Easy Direction: Optimum Cost Does Not Increase
Lemma 7.1.
Let be any set of points and let be a set of lines in for optimal -line approximation of , and for each , let be the line assigned to . We let be the distribution over Johnson-Lindenstrauss maps. If , then with probability at least over the draw of ,
and hence,
By now, there is a straight-forward way to prove the above lemma. For each set of lines in , there is an analogous definition of lines in . Hence, we use Lemma LABEL:lem:gaussians as in previous sections to upper bound the cost .
7.2 Hard Direction: Optimum Cost Does Not Decrease
7.2.1 Preliminaries
At a high level, we proceed with the same argument as in previous sections: we consider a sensitivity function for -line approximation of in , and use it to build a weak coreset, as well as argue that sensitivity sampling is a low-variance estimator of the optimal -line approximation in the projected space. The proof in this section will be significantly more complicated than the previous section. Defining the appropriate sensitivity functions, which will give a low-variance estimator in the projected space, is considerably more difficult than the expressions of Lemmas 4.4, 5.4, and 6.4. For this reason, will be proceed by assuming access to a sensitivity function which we will define lated in the section.
Definition 7.2 (Weak Coresets for -Line Approximation).
Let be a set of points. A weak -coreset of for -line approximation is a weighted subset of points with weights such that
Definition 7.3 (Sensitivities).
Let , and consider any set of points , as well as and . A sensitivity function for -line approximation in is a function which satisfies that, for all ,
The total sensitivity of the sensitivity function is given by
For a sensitivity function, we let denote the sensitivity sampling distribution, supported on , which samples with probability proportional to .
Similarly to before, we first give a lemma which narrows down the space of the optimal line approximations for a set of points. the following lemma is a re-formulations of Lemma 5.5 and Lemma 6.5 catered to the case of -line approximation.
Lemma 7.4 (Theorem 3.1 and Lemma 3.3 of [SV12]).
Let and be any set of points with weights , , and . There exists a subset of size and a line in within the span of such that
Lemma 7.5 (Weak Coresets for -Line Approximation [FL11, VX12b]).
For any subset and , let denote a sensitivity function for -line approximation of with total sensitivity and let its sensitivity sampling distribution.
-
•
Let denote the random (multi-)set and given by, for
iterations, sampling i.i.d and letting .
-
•
Then, with probability over the draw of , it is an -weak coreset for -line approximation of .
We note that [FL11] and [VX12b] only give a strong coreset for -line approximation of . For example, Theorem 13 in [VX12b] giving the above bound follows from the fact that the “function dimension” (see Definition 3 of [VX12b]) for -line approximation is . However, Lemma 7.4 implies that for any set of points, a line which approximates the points is within a span of points. This means that, for -weak coresets, it suffices to only consider lines spanned by , giving us a “function dimension” of .
7.2.2 The Important Events
Definition 7.6 (The Events).
Let , and be a sensitivity function for -line approximation of in , with total sensitivity and sensitivity sampling distribution . We consider the following experiment,
-
1.
We generate a sample by sampling from for i.i.d iterations and set .
-
2.
Furthermore, we sample , which is a Johnson-Lindenstrauss map .
-
3.
We let denote the image of on .
The events are the following:
-
•
: The weighted (multi-)set is a weak -coreset for -line approximation of in .
-
•
: For any subset of points from , the map is an -subspace embedding for the subspace spanned by that subset.
-
•
: Let denote lines in for optimal -line approximation of in . Then,
Lemma 7.7.
Let , and suppose and satisfy events and . Then,
Proof.
Let and be sampled according to Definition 7.6, and suppose events and all hold. Let denote the set of lines for optimal -line approximation of in . Then, by event , we have . Consider the partition of into induced by the lines in closest to .
For each , we apply Lemma 7.4 to with weights . In particular, there exists subsets and lines in such that each line lie in the span of , and
Event implies that for each and each , is an -subspace embedding for the subspace spanned by . It is not hard to see, that there exists lines in such that for all ,
and therefore,
Lastly, is a -weak coreset for , which means that
Combining all inequalities gives the desired lemma. ∎
By now, we note that it is straight-forward to prove the following corollary, which gives a dimension reduction bound which depends on the total sensitivity of a sensitivity function.
Corollary 7.8.
Let be any set of points, and for and , let be a sensitivity function for -line approximation of in . For any , let be the Johnson-Lindenstrauss map with
Then, with probability at least over the draw of ,
7.3 A Sensitivity Function for -Line Approximation
We now describe a sensitivity function for -line approximation of points in . Similarly to the previous section, we consider a set of points , and we design a sensitivity function for -line approximation of in . The sensitivity function should satisfy two requirements. The first is that we have a good bound on the total sensitivity, , where the target dimension will have logarithmic dependence on (for example, like in Corollary 7.8).
The second is that will hold with sufficiently high probability over the draw of . In other words, we will proceed similarly to Lemmas 4.11, 5.11, and 6.10 and show that, for the optimal -line approximation of in , sampling according to the sensitivity sampling distribution gives a low-variance estimate for the cost of .
7.3.1 From Coresets for -line approximation to Sensitivity Functions
Unfortunately, we do not know of a “clean” description of a sensitivity function for -line approximation, as was the case in previous definitions. Certainly, one may define a sensitivity function to be , but then arguing that holds with high probability becomes more complicated. The sensitivity function which we present follows the connection between sensitivity and -coresets [VX12a].
Definition 7.9 (-coresets for -line approximation).
Let be any subset of points, and . A subset is a -coreset for -line approximation if the following holds:
-
•
Let be any collection of lines in , and such that for all ,
-
•
Then, for all ,
Note that the -line approximation is the problem of minimum enclosing cylinder: we are given a set of points , and want to find a set of cylinders of smallest radius such that . Thus, Definition 7.9, a set is a -coreset for -line approximation if, given any cylinders which contain , increasing the radii by a factor of contains . The reason they will be relevant for defining a sensitivity function is the following simple lemma, whose main idea is from [VX12a].
Lemma 7.10 (Sensitivities from -coresets for -line approximation (see Lemma 3.1 in [VX12a])).
Let be any set of points and , . Let be the lines in for optimal -line approximation of , and let where . For , let the function be defined as follows:
-
•
Let denote a partition of where each is a -coreset for -line approximation of .
-
•
For each , where we let
Then, is a sensitivity function for -line approximation, and the total sensitivity
Proof.
Suppose and . Consider any set of lines in . The goal is to show that
(30) |
We will first use Hölder inequality and the triangle inequality, as well as the fact that in order to write the following:
(31) |
The justification for (31) is the following: for every , is a -coreset for -line approximation of a set of points which contains . Therefore, if , there must exists a point with . Suppose otherwise: every satisfies . Then, the cylinders of radius contain , so increasing the radius by a factor of contains . However, this means , which is a contradiction.
Hence, we always have that satisfies
Continuing on upper-bounding (31), we now use the fact because is the optimal -line approximation. Therefore,
so dividing by and noticing that implies is a sensitivity function.
The bound on total sensitivity then follows from
since . ∎
7.3.2 A simple coreset for -line approximation of one-dimensional instances
Suppose first, that a dataset lies on a line in , and let be a collection of cylinders. Then, the intersection of the cylinders with the line results in a union of intervals on the line. If we increase the radius of each cylinder by a factor of , the lengths of the intervals are scaled by factor of (while keeping center of interval fixed). We first show that, for any which lie on a line, there exists a small subset such that: if is any collection of intervals which covers , then increasing the length of each interval by a factor of (while keeping the center of the interval fixed) covers .
Lemma 7.11.
There exists a large enough constant such that the following is true. Let be a set of points lying on a line in , and . There exists a subset which is a -coreset for -line approximation of size at most .
Proof.
The construction is recursive. Let be the line containing , and after choosing an arbitrary direction on , let be the points in sorted order according to the chosen direction.
The set is initially empty, and we include . Suppose that (the construction is symmetric, with and switched otherwise). We divide into two sets, the subsets and . Then, we perform three recursive calls: (i) we let be a -coreset for -line approximation of , (ii) we let be a -coreset for -line approximation of , and (iii) we let be a -coreset for -line approximation of . We add , and to .
The proof of correctness argues as follows. Let be an arbitrary collection of cylinders which covers . The goal is to show that increasing the radius of by a factor of 3 covers . Let be the intervals given by . We let the indices be such that is the first interval which contains , and the last interval which contains . We note that covers . We must show that if we increase the length of each interval by a factor of , we cover . We consider three cases:
-
•
Suppose there exists an index such that and both lie in the interval . Recall , and all points are contained within and . Hence, when we increase the length of by a factor of while keeping center fixed, and lie in the same interval, and thus cover .
-
•
Suppose and lie in different intervals, but there exists such that and lie in the interval . Then, since all points of between and , covers . Since covers and is a -coreset for -line approximation of , increasing the length of each interval by a factor of covers , and therefore all of .
-
•
Suppose , and all lie in different intervals. Then, since and are not on the same interval, the intervals covers . Similarly, and are not on the same interval, so the intervals covers . Since is a -coreset for -line approximation of , increasing the radius of each interval by a factor of covers all of . In addition, is a -coreset for -line approximation of , so increasing length of intervals by a factor of covers .
This concludes the correctness of the coreset, and it remains to upper bound the size. Let be an upper bound on the coreset size of -line approximation of a subset of size . We have , since any single interval which covers and covers everything in between them. By our recursive construction, we have
By a simple induction, one can show is at most when , for large enough constant and large enough . ∎
7.3.3 The coreset for points on lines and the effect of dimension reduction
Lemma 7.12.
There exists a large enough constant such that the following is true. Let be a set of points lying on lines in . There exists a subset which satisfies the following two requirements:
-
1.
is a -coreset for -line approximation of size at most .
-
2.
If is a linear map, then is a -coreset for -line approximation of .
Proof.
Let be the partition of into points lying on the lines of , respectively. We may write each line by two vectors , and have
Let be the -coreset for -line approximation of specified by Lemma 7.11. We let be the union of all . Item 1 follows from Lemma 7.11, since we are taking the union of coresets.
We now argue Item 2. Since is a linear map, and every point in lies on the line , there exists a map where each satisfies
In other words, lies within a line in . We note that the relative order of points in remains the same, since for any two points ,
We note that the construction of Lemma 7.11 only considers the order of points in , as well as the ratio of distances. Therefore, executing the construction of Lemma 7.11 on the points returns the set . ∎
Corollary 7.13.
Let be a set of points lying on lines in .
-
•
Let denote a partition of where each is a -coreset for -line approximation of from Lemma 7.12 on the set .
-
•
Let be any linear map.
For any set of lines in , if , we have
7.4 Improving the approximation
We now instantiate the sensitivity function of Lemma 7.12, and use Corollary 7.8 and Lemma 7.7 in order to improve on the approximation. Similarly to before, we show that event occurs with sufficiently high probability over the draw of and by giving an upper bound on the variance as in Lemma 4.11, Lemma 5.11, and Lemma 6.10.
Fix , and let be the optimal -line approximation of in . For , we let be given by , and we denote the set . The sensitivity function is specified by Lemma 7.10. Recall that we first let denote a partition , where is the -coreset for -line approximation of from Lemma 7.12. For with , we have
We let denote the following event with respect to the randomness in . For each , we let denote the random variable
and as in previous sections, event , which occurs with probability at least , whenever
Lemma 7.14.
Proof.
We bound the variance,
(32) |
We note that, as before, we will apply Hölder’s inequality and the triangle inequality, followed by Corollary 7.13. Specifically, suppose with , then,
We note that from (• ‣ 7.14) and (• ‣ 7.14), we have . So the above simplifies to
We now continue upper bounding (32), where the variance becomes less than
since event holds. Since , we obtain our desired bound on the variance by letting be a large enough polynomial of , , and . ∎
Corollary 7.15.
Let be any set of points, and let be the optimal set of lines for -line approximation of . For any , let be the Johnson-Lindenstrauss map with
Then, with probability at least over the draw of ,
Appendix A On preserving “all solutions” and comparisons to prior work
This section is meant for two things:
-
1.
To help compare the guarantees of this work to that of prior works on -clustering of [MMR19] and -subspace approximation [CEM+15], expanding on the discussion in the introduction. In short, for -clustering, the results of [MMR19] are qualitatively stronger than the results obtained here. In -subspace approximation, the “for all” guarantees of [CEM+15] are for the qualitatively different problem of low-rank approximation. While the costs of low-rank approximation and -subspace approximation happen to agree at the optimum, the notion of a candidate solution is different.
-
2.
To show that, for two related problems of “medoid” and “column subset selection,” one cannot apply the Johnson-Lindenstrauss transform to dimension while preserving the cost. The medoid problem is a center-based clustering problem, and column subset selection problem is a subspace approximation problem. The instances we will construct for these problems are very symmetric, such that uniform sampling will give small coresets. These give concrete examples ruling out a theorem which directly relates the size of coresets to the effect of the Johnson-Lindenstrauss transform.
Center-Based Clustering
Consider the following (slight) modification to the center-based clustering problems known as the “medoid” problem.
Definition A.1 (-medoid problem).
Let be any set of points. The -medoid problem asks to optimize
Notice the difference between -medoid and -mean: in -medoid the center is restricted to be from within the set of points , whereas in -mean the center is arbitrary. Perhaps surprisingly, this modification has a dramatic effect on dimension reduction.
Theorem 12.
For large enough , there exists a set of points (in particular, given by the -basis vector ) such that, with high probability over the draw of where ,
Theorem 12 gives very strong lower bound for dimension reduction for -medoid, showing that decreasing the dimension to any does not preserve (even the optimal) solutions within better-than factor . This is in stark contrast to the results on center-based clustering, where the -mean problem can preserve the solutions up to -approximation without any dependence on or . The proof itself is also very straight-forward: each is an independent Gaussian vector in , and if , with high probability, there exists an index where . In a similar vein, with high probability . We take a union bound and set the center for the index where . By the pythagorean theorem, the cost of this -medoid solution is at most . On the other hand, every -medoid solution in has cost .
We emphasize that Theorem 12 does not contradict [MMR19, BBCA+19], even though it rules out that “all candidate centers” are preserved. The reason is that the notion of “candidate solution” is different. Informally, [MMR19] shows that for any dataset of vectors and any , , applying the Johnson-Lindenstrauss map with satisfies the following guarantee: for all partitions of into sets, , the following is true:
The “for all” quantifies over clusterings is different (as seen from the -medoid example) from the “for all” over centers .
Subspace Approximation
The same subtlety appears in subspace approximation. Here, we can consider the -column subset selection problem, which at a high level, is the medoid version of subspace approximation. We want to approximate a set of points by their projections onto the subspace spanned by one of those points.
Definition A.2 (-column subset selection).
Let be any set of points. The -column subset selection problem asks to optimize
Again, notice the difference between -column subset selection and -subspace approximation: the subspace is restricted to be in the span of a point from . Given Theorem 12, it is not surprising that Johnson-Lindenstrauss cannot reduce the dimension of -column subset selection to without incurring high distortions.
Theorem 13.
For large enough , there exists a set of points such that, with high probability over the draw of where ,
The proof is slightly more involved. The instance sets , and sets where . For any subspace spanned by any of the points , via a straight-forward calculation, the distance between and is when , and therefore, the cost of -column subset selection in is . We apply dimension reduction to and we think of as the independent Gaussian vectors given by . As in the -medoid case, there exists an index for which , and notice that when this occurs, is essentially (because ). Letting be the subspace spanned by , we get that the distance between the projection is at most . This latter fact is because the subspace spanned by is essentially spanned by . Therefore, the cost of the -column subset selection of is at most .
As above, Theorem 13 does not contradict [CEM+15], even though it means that “all candidate subspaces” are preserved needs to be carefully considered. The notion of “candidate solutions” is different. In the matrix notation that [CEM+15] uses, the points in are stacked into rows of an matrix (which we denote ). A Johnson-Lindenstrauss map is represented by a matrix, and applying the map to every point in corresponds to the operation (which is now an matrix). [CEM+15] shows that if is sampled with , the following occurs with high probability. For all rank- projection matrices , we have
Note that when we multiply the matrix on the left-hand side by , we are projecting the columns of to a -dimensional subspace of . This is different from approximating all points in with a -dimensional subspace in , which would correspond to finding a rank- projection matrix and considering . In the matrix notation of [CEM+15], the dimension reduction result for -subspace approximation says that
(33) |
At the optimal and the optimal , the costs coincide (a property which holds only for ). Thus, [CEM+15] implies (33), but it does not say that the cost of all subspaces of are preserved (as there is a type mismatch in the rank- projections on the left- and right-hand side of (33)).
References
- [AHPV05] Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. Geometric approximation via coresets. Combinatorial and computational geometry, 2005.
- [AIR18] Alexandr Andoni, Piotr Indyk, and Ilya Razenshteyn. Approximate nearest neighbor search in high dimensions. In Proceedings of the International Congress of Mathematicians (ICM ’2018), 2018.
- [AP03] Panjak K Agarwal and Cecilia M Procopiuc. Approximation algorithms for projective clustering. Journal of Algorithms, 46(2):115–139, 2003.
- [BBCA+19] Luca Becchetti, Marc Bury, Vincent Cohen-Addad, Fabrizio Grandoni, and Chris Schwiegelshohn. Oblivious dimension reduction for -means: Beyond subspaces and the johnson-lindenstrauss lemma. In Proceedings of the 51th ACM Symposium on the Theory of Computing (STOC ’2019), 2019.
- [BBH+20] Daniel Baker, Vladimir Braverman, Lingxiao Huang, Shaofeng H-C Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering in graphs of bounded treewidth. In Proceedings of the 37th International Conference on Machine Learning (ICML ’2020), 2020.
- [BFL16] Vladimir Braverman, Dan Feldman, and Harry Lang. New frameworks for offline and streaming coreset constructions. arXiv preprint arXiv:1612.00889, 2016.
- [BHPI02] Mihai Badoiu, Sariel Har-Peled, and Piotr Indyk. Approximate clustering via core-sets. In Proceedings of the 34th ACM Symposium on the Theory of Computing (STOC ’2002), 2002.
- [BZD10] Christos Boutsidis, Anastasios Zouzias, and Petros Drineas. Random projections for -means clustering. In Proceedings of Advances in Neural Information Processing Systems 23 (NeurIPS ’2010), 2010.
- [CALSS22] Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, and Chris Schwiegelshohn. Towards optimal lower bounds for -median and -means coresets. In Proceedings of the 54th ACM Symposium on the Theory of Computing (STOC ’2022), 2022.
- [CASS21] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In Proceedings of the 53rd ACM Symposium on the Theory of Computing (STOC ’2021), 2021.
- [CEM+15] Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Mădălina Persu. Dimensionality reduction for -means clustering and low rank approximation. In Proceedings of the 47th ACM Symposium on the Theory of Computing (STOC ’2015), 2015.
- [Che09] Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923–947, 2009.
- [CN21] Yeshwanth Cherapanamjeri and Jelani Nelson. Terminal embeddings in sublinear time. In Proceedings of the 62nd Annual IEEE Symposium on Foundations of Computer Science (FOCS ’2021), 2021.
- [DG03] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003.
- [DRVW06] Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation and projective clustering via volume sampling. Theory of Computing, 2(1):225–247, 2006.
- [EFN17] Michael Elkin, Arnold Filter, and Ofer Neiman. Terminal embeddings. Theoretical Computer Science, 697:1–36, 2017.
- [EV05] Michael Edwards and Kasturi Varadarajan. No coreset, no cry: Ii. In Proceedings of the 25th International Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS ’2005), 2005.
- [Fel20] Dan Feldman. Introduction to core-sets: an updated survey. 2020.
- [FL11] Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on the Theory of Computing (STOC ’2011), 2011.
- [FSS13] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: constant-size coresets for k-means, pca and projective clustering. In Proceedings of the 24th ACM-SIAM Symposium on Discrete Algorithms (SODA ’2013), 2013.
- [FSS20] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: constant-size coresets for k-means, pca, and projective clustering. SIAM Journal on Computing, 49(3):601–657, 2020.
- [HIM12] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(1):321–350, 2012.
- [HP04] Sariel Har-Peled. No, coreset, no cry. In Proceedings of the 24th International Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS ’2004), 2004.
- [HPM04] Sariel Har-Peled and Soham Mazumdar. On coresets for -means and -median clustering. In Proceedings of the 36th ACM Symposium on the Theory of Computing (STOC ’2004), 2004.
- [HPV02] Sariel Har-Peled and Kasturi Varadarajan. Projective clustering in high-dimensions using core-sets. In Proceedings of the 18th ACM Symposium on Computational Geometry (SoCG ’2002), 2002.
- [HV20] Lingxiao Huang and Nisheeth K. Vishnoi. Coresets for clustering in euclidean spaces: importance sampling is nearly optimal. In Proceedings of the 52nd ACM Symposium on the Theory of Computing (STOC ’2020), 2020.
- [IM98] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th ACM Symposium on the Theory of Computing (STOC ’1998), pages 604–613, 1998.
- [ISZ21] Zachary Izzo, Sandeep Silwal, and Samson Zhou. Dimensionality reduction for wasserstein barycenter. In Proceedings of Advances in Neural Information Processing Systems 34 (NeurIPS ’2021), 2021.
- [JL84] William Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982), volume 26 of Contemporary Mathematics, pages 189–206. 1984.
- [KR15] Michael Kerber and Sharath Raghvendra. Approximation and streaming algorithms for projective clustering via random projections. In Proceedings of the 27th Canadian Conference on Computational Geometry (CCCG ’2015), 2015.
- [LS10] Michael Langberg and Leonard J. Schulman. Universal epsilon-approximators for integrals. In Proceedings of the 21st ACM-SIAM Symposium on Discrete Algorithms (SODA ’2010), 2010.
- [Mah11] Michael W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends® in Machine Learning, 3(2):123–224, 2011.
- [MMMR18] Sepideh Mahabadi, Konstantin Makarychev, Yuri Makarychev, and Ilya Razenstheyn. Non-linear dimension reduction via outler bi-lipschitz embeddings. In Proceedings of the 50th ACM Symposium on the Theory of Computing (STOC ’2018), 2018.
- [MMR19] Konstantin Makarychev, Yuri Makarychev, and Ilya Razenshteyn. Performance of johnson-lindenstrauss transform for -means and -medians clustering. In Proceedings of the 51th ACM Symposium on the Theory of Computing (STOC ’2019), 2019.
- [NN19] Shyam Nayaranan and Jelani Nelson. Optimal terminal dimensionality reduction in euclidean space. In Proceedings of the 51th ACM Symposium on the Theory of Computing (STOC ’2019), 2019.
- [NSIZ21] Shyam Narayanan, Sandeep Silwal, Piotr Indyk, and Or Zamir. Randomized dimensionality reduction for facility location and single-linkage clustering. In Proceedings of the 38th International Conference on Machine Learning (ICML ’2021), 2021.
- [Sar06] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’2006), 2006.
- [SV07] Nariankadu D. Shyamalkumar and Kasturi Varadarajan. Efficient subspace approximation algorithms. In Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA ’2007), 2007.
- [SV12] Nariankadu D. Shuamalkumar and Katsuri Varadarajan. Efficient subspace approximation algorithms. Discrete and Computational Geometry, 47(1):44–63, 2012.
- [SW18] Christian Sohler and David Woodruff. Strong coresets for -median and subspace approximation. In Proceedings of the 59th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’2018), 2018.
- [TWZ+22] Murad Tukan, Xuan Wu, Samson Zhou, Vladimir Braverman, and Dan Feldman. New coresets for projective clustering and applications. arXiv preprint arXiv:2203.04370, 2022.
- [VX12a] Kasturi Varadarajan and Xin Xiao. A near-linear algorithm for projective clustering of integer points. In Proceedings of the 33rd ACM-SIAM Symposium on Discrete Algorithms (SODA ’2012), 2012.
- [VX12b] Kasturi Varadarajan and Xin Xiao. On the sensitivity of shape fitting problems. In Proceedings of the 32nd International Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS ’2012), 2012.
- [Woo14] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014.