This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Johnson-Lindenstrauss Lemma for Clustering and Subspace Approximation: From Coresets to Dimension Reduction

Moses Charikar
Stanford University
email: moses@cs.stanford.edu. Moses Charikar is supported by a Simons Investigator Award.
   Erik Waingarten
University of Pennsylvania
email: ewaingar@seas.upenn.edu. Part of this work was done while Erik Waingarten was a postdoc at Stanford University, supported by an NSF postdoctoral fellowship and by Moses Charikar’s Simons Investigator Award.
Abstract

We study the effect of Johnson-Lindenstrauss transforms in various projective clustering problems, generalizing recent results which only applied to center-based clustering [MMR19]. We ask the general question: for a Euclidean optimization problem and an accuracy parameter ε(0,1)\varepsilon\in(0,1), what is the smallest target dimension tt\in\mathbbm{N} such that a Johnson-Lindenstrauss transform 𝚷:dt\mathbf{\Pi}\colon\mathbbm{R}^{d}\to\mathbbm{R}^{t} preserves the cost of the optimal solution up to a (1+ε)(1+\varepsilon)-factor. We give a new technique which uses coreset constructions to analyze the effect of the Johnson-Lindenstrauss transform. Our technique, in addition applying to center-based clustering, improves on (or is the first to address) other Euclidean optimization problems, including:

  • For (k,z)(k,z)-subspace approximation: we show that t=O~(zk2/ε3)t=\tilde{O}(zk^{2}/\varepsilon^{3}) suffices, whereas the prior best bound, of O(k/ε2)O(k/\varepsilon^{2}), only applied to the case z=2z=2 [CEM+15].

  • For (k,z)(k,z)-flat approximation: we show t=O~(zk2/ε3)t=\tilde{O}(zk^{2}/\varepsilon^{3}) suffices, completely removing the dependence on nn from the prior bound O~(zk2logn/ε3)\tilde{O}(zk^{2}\log n/\varepsilon^{3}) of [KR15].

  • For (k,z)(k,z)-line approximation: we show t=O((kloglogn+z+log(1/ε))/ε3)t=O((k\log\log n+z+\log(1/\varepsilon))/\varepsilon^{3}) suffices, and ours is the first to give any dimension reduction result.

1 Introduction

The Johnson-Lindenstrauss lemma [JL84] concerns dimensionality reduction for high-dimensional Euclidean spaces. It states that, for any set of nn points x1,,xnx_{1},\dots,x_{n} in d\mathbbm{R}^{d} and any ε(0,1)\varepsilon\in(0,1), there exists a map Π:dt\Pi\colon\mathbbm{R}^{d}\to\mathbbm{R}^{t}, with t=O(logn/ε2)t=O(\log n/\varepsilon^{2}) such that, for any i,j[n]i,j\in[n],

11+εxixj2Π(xi)Π(xj)2(1+ε)xixj2.\displaystyle\dfrac{1}{1+\varepsilon}\cdot\|x_{i}-x_{j}\|_{2}\leq\|\Pi(x_{i})-\Pi(x_{j})\|_{2}\leq(1+\varepsilon)\cdot\|x_{i}-x_{j}\|_{2}. (1)

From a computational perspective, the lemma has been extremely influential in designing algorithms for high-dimensional geometric problems, partly because proofs show that a random linear map, oblivious to the data, suffices. Proofs specify a distribution 𝒥d,t\mathcal{J}_{d,t} supported on linear maps dt\mathbbm{R}^{d}\to\mathbbm{R}^{t} which is independent of x1,,xnx_{1},\dots,x_{n} (for example, given by a t×dt\times d matrix of i.i.d 𝒩(0,1/t)\mathcal{N}(0,1/t) entries [IM98, DG03]), and show that a draw 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t} satisfies (1) with probability at least 0.90.9.

In this paper, we study the Johnson-Lindenstrauss transforms for projective clustering problems, generalizing a recent line-of-work which gave (surprising) dimension reduction results for center-based clustering [BZD10, CEM+15, BBCA+19, MMR19]. The goal is to reduce the dimensionality of the input of a (more general) projective clustering problem (from dd to tt with tdt\ll d) without affecting the cost of the optimal solution significantly. We map dd-dimensional points to tt-dimensional points (via a random linear map 𝚷\mathbf{\Pi}) such that the optimal cost in tt-dimensions is within a (1+ε)(1+\varepsilon)-factor of the optimal cost in the original dd-dimensional space. We study this for a variety of problems, each of which is specified by a set of candidate solutions 𝒞d\mathcal{C}_{d} and a cost function. By varying the family of candidate solutions 𝒞d\mathcal{C}_{d} and the cost functions considered, one obtains center-based clustering problems (like kk-means and kk-median), as well as subspace approximation problems (like principal components analysis), and beyond (like clustering with subspaces). The key question we address here is:

Main Question: For a projective clustering problem, how small can tt be as a function of nn (the dataset size) and ε\varepsilon (the accuracy parameter), such that the cost of the optimization is preserved up to (1±ε)(1\pm\varepsilon)-factor with probability at least 0.90.9 over 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}?

Our results fit into a line of prior work on the power of Johnson-Lindenstrauss maps beyond the original discovery of [JL84]. These have been investigated before for various problems and in various contexts, including nearest neighbor search [IM98, HIM12, AIR18], numerical linear algebra [Sar06, Mah11, Woo14], prioritized and terminal embeddings [EFN17, MMMR18, NN19, CN21], and clustering and facility location problems [BZD10, CEM+15, KR15, BBCA+19, MMR19, NSIZ21, ISZ21].

1.1 Our Contribution

In the (k,j)(k,j)-projective clustering problem with z\ell_{z} cost ([HPV02, AP03, EV05, DRVW06, VX12b, KR15, FSS20, TWZ+22]), the goal is to cluster a dataset X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d}, where each cluster is approximated by an affine jj-dimensional subspace. Namely, we define an objective function for (k,j)(k,j)-projective clustering problems with z\ell_{z} cost on a dataset XX, which aims to minimize

minc𝒞dcostz(X,c),\min_{c\in\mathcal{C}_{d}}\mathrm{cost}_{z}(X,c),

where the “candidate solutions” 𝒞d\mathcal{C}_{d} consist of all kk-tuples of jj-dimensional affine subspaces. The cost function costz(X,)\mathrm{cost}_{z}(X,\cdot) maps each candidate solution c𝒞dc\in\mathcal{C}_{d} to a cost in 0\mathbbm{R}_{\geq 0} given by the z\ell_{z}-norm of the vector of distances between each dataset point xXx\in X to its nearest point on one of the kk subspaces. Intuitively, each point of the dataset xiXx_{i}\in X “pays” for the Euclidean distance to the nearest subspace in c𝒞dc\in\mathcal{C}_{d}, and the total cost is the z\ell_{z}-norm of the nn payments (one for each point).

There has been significant prior work which showed surprising results for the special case of (k,z)(k,z)-clustering (like kk-means and kk-median, which corresponds to (k,0)(k,0)-projective clustering with z\ell_{z}-cost) as well as for low-rank approximation (which corresponds to (1,k)(1,k)-projective clustering for non-affine subspaces with 2\ell_{2}-cost) [BZD10, CEM+15, BBCA+19, MMR19]. It is important to note that the techniques in prior works are specifically tailored to the Euclidean optimization problem at hand. For example, the results of [MMR19], which apply for (k,0)(k,0)-clustering with z\ell_{z}-norm, rely on using center points as the approximation and do not generalize to affine subspaces beyond dimension 0. The other result of [CEM+15] for low-rank approximation uses the specific algebraic properties of the 2\ell_{2}-norm (which characterize the optimal low-rank approximation). These prior works carry a strong conceptual message: for (k,z)(k,z)-clustering and low-rank approximation, even though many pairwise distances among dataset points become highly distorted (since one projects to tlognt\ll\log n dimensions), the cost of the optimization (which aggregates distances) need not be significantly distorted.

Our Results. We show that (k,0)(k,0)-clustering with z\ell_{z}-norm and low-rank approximation are not isolated incidents, but rather, part of a more general phenomenon. Our main conceptual contribution is the following: we use algorithms for constructing coresets (via the “sensitivity sampling” framework of [FL11]) to obtain bounds on dimension reduction. Then, the specific bounds that we obtain for the various problems depend on the sizes of the coresets that the algorithms can produce. We can instantiate our framework to new upper bounds for the following problems:

  • (k,z)(k,z)-Subspace Approximation. This problem is a (restricted) (1,k)(1,k)-projective clustering problems with z\ell_{z}-cost. We seek to minimize over a kk-dimensional subspace SS of d\mathbbm{R}^{d} the z\ell_{z}-norm of the nn-dimensional vector where the coordinate i[n]i\in[n] encodes the distance between xix_{i} and the closest point in SS.111This is a restricted version of projective clustering because subspaces are not affine and required to go through the origin.

  • (k,z)(k,z)-Flat Approximation. This problem is exactly the (1.k)(1.k)-projective clustering problem with z\ell_{z} cost. It is similar to (k,z)(k,z)-subspace approximation, except we optimize over all affine subspaces.

  • (k,z)(k,z)-Line Approximation. This problem corresponds to (k,1)(k,1)-projective clustering with z\ell_{z}-cost. The optimization minimizes, over an arbitrary set LL of kk of 11-dimensional affine subspaces l1,,lkl_{1},\dots,l_{k} (i.e., kk lines in d\mathbbm{R}^{d}), the z\ell_{z}-norm of the nn-dimensional vector where the coordinate i[n]i\in[n] encodes the distance between xix_{i} and the closest point on any line in LL.

Concretely, our quantitative results are summarized by the following theorem.

Theorem 1 (Main Result—Informal).

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be any dataset, and 𝒥d,t\mathcal{J}_{d,t} denote a t×dt\times d matrix of i.i.d entries from 𝒩(0,1/t)\mathcal{N}(0,1/t). Let 𝒞d\mathcal{C}_{d} and 𝒞t\mathcal{C}_{t} be candidate solutions for a projective clustering problem in d\mathbbm{R}^{d} and t\mathbbm{R}^{t}, respectively. For any ε(0,1)\varepsilon\in(0,1), we have

𝐏𝐫𝚷𝒥d,t[|minc𝒞tcost(𝚷(X),c)minc𝒞dcost(X,c)1|ε]0.9\displaystyle\mathop{{\bf Pr}\/}_{\mathbf{\Pi}\sim\mathcal{J}_{d,t}}\left[\left|\dfrac{\min_{c^{\prime}\in\mathcal{C}_{t}}\mathrm{cost}(\mathbf{\Pi}(X),c^{\prime})}{\min_{c\in\mathcal{C}_{d}}\mathrm{cost}(X,c)}-1\right|\leq\varepsilon\right]\geq 0.9 (2)

whenever:

  • (k,z)(k,z)-subspace and (k,z)(k,z)-flat approximation: 𝒞d\mathcal{C}_{d} and 𝒞t\mathcal{C}_{t} are all kk-dimensional subspaces of d\mathbbm{R}^{d} and t\mathbbm{R}^{t}, respectively; the cost measures the z\ell_{z}-norm of distances between points to the subspace; and, t=O~(zk2/ε3)t=\tilde{O}(zk^{2}/\varepsilon^{3}). Similarly, the same bound on tt holds for 𝒞d\mathcal{C}_{d} and 𝒞t\mathcal{C}_{t} varying over all affine kk-dimensional subspaces.

  • (k,z)(k,z)-line approximation: 𝒞d\mathcal{C}_{d} and 𝒞t\mathcal{C}_{t} are all kk-tuples of lines in d\mathbbm{R}^{d} and t\mathbbm{R}^{t}, respectively; the cost measures the z\ell_{z}-norm of distances between points and the nearest line; and, t=O((kloglogn+z+log(1/ε))/ε3)t=O((k\log\log n+z+\log(1/\varepsilon))/\varepsilon^{3}).

In all cases, the bound that we obtain is directly related to the size of the best coresets from the sensitivity sampling framework, and all of our proofs follow the same format.222The reason we did not generalize the results to (k,j)(k,j)-projective clustering with z\ell_{z}-cost with j>1j>1 is that these problems do not admit small coresets [HP04]. Researchers have studied “integer (k,j)(k,j)-projective clustering” where one restricts the input points to have bounded integer coordinates, where small coresets do exists [EV05]. However, using this approach for dimension reduction would incur additional additive errors, so we have chosen not to pursue this route. Our proofs are not entirely black-box applications of coresets (since we use the specific instantiations of the sensitivity sampling framework), but we believe that any improvement on the size of the best coresets will likely lead to quantitative improvements on the dimension reduction bounds. However, improving our current bounds by either better coresets or via a different argument altogether (for example, improving on the cubic dependence on ε\varepsilon) seems to require significantly new ideas.

Target Dimension for Johnson-Lindenstrauss Transforms
Problem New Result Prior Best
(k,z)(k,z)-subspace O~(zk2/ε3)\tilde{O}(zk^{2}/\varepsilon^{3}) O(k/ε2)O(k/\varepsilon^{2}) (when z=2z=2) [CEM+15]
(k,z)(k,z)-flat O~(zk2/ε3)\tilde{O}(zk^{2}/\varepsilon^{3}) O~(zk2logn/ε3)\tilde{O}(zk^{2}\log n/\varepsilon^{3}) [KR15]
(k,z)(k,z)-line O((kloglogn+z+log(1/ε))/ε3)O((k\log\log n+z+\log(1/\varepsilon))/\varepsilon^{3}) None
Figure 1: Comparison to Prior Bounds

A Subtlety in “For-All” versus “Optimal” Guarantees. So far, the main question (and the results we obtain) focus on applying the Johnson-Lindenstrauss transform and preserving the optimal cost, i.e., that the minimizing solution in the original and the dimension reduced space have approximately the same cost. A stronger guarantee which one may hope for, a so-called “for-all” guarantee, asks that after applying the Johnson-Lindenstrauss transform, every solution has its cost approximately preserved before and after dimension reduction. We do not achieve “for all” guarantees, like those appearing in [MMR19]. However, we emphasize that various subtleties arise in what is meant by “a solution,” as the prior work on dimension reduction and coresets refer to different notions (even though they agree at the optimum).

Consider, for example, the 11-medoid problem, a constrained version of the 11-means problem. The 11-medoid cost of a dataset XX is the minimum over centers cc chosen from the dataset XX, of the sum of squares of distances from each dataset point xx to cc. The subtlety is the following: one can apply a Johnson-Lindenstrauss transform to t=O(log(1/ε)/ε2)t=O(\log(1/\varepsilon)/\varepsilon^{2}) dimensions and preserve the 11-means cost, and one may hope that a “for-all” guarantee would also preserve the 11-medoid cost. Somewhat surprisingly, we show that it does not.

Theorem 2 (Johnson-Lindenstrauss for Medoid—Informal (see Theorem 12)).

For any t=o(logn)t=o(\log n), applying a Johnson-Lindenstrauss transform to dimension tt decreases the cost of the 11-medoid problem by a factor approaching 22.

Not all “for-all” guarantees are the same. A “for-all” guarantee comes with (an implicit) representation of a candidate “solution,” and different choices of representations yield different guarantees. The above theorem does not contradict the “for-all” guarantee of [MMR19] because there, a candidate solution for (k,z)(k,z)-clustering refers to a partition of XX into kk parts and not a set of centers. Often in the coreset literature, a “solution” refers to a set of centers and not arbitrary partitions. For k=1k=1, there are many possible centers but only one partition, and it is important (as per Theorem 2) that a potential “for all” guarantee considers partitions.

For (k,z)(k,z)-subspace and -flat approximation, in a natural representation of the solutions, the same issue arises. Consider the 11-column subset selection problem, a constrained version of the (1,2)(1,2)-subspace approximation problem, where subspaces must be in the span of the dataset points. The 11-column subset selection cost of a dataset is the minimum over 11-dimensional subspaces spanned by a dataset point of XX, of the sum of squares of distances from each dataset point xx to the projection onto the subspace. Similarly to Theorem 2, a Johnson-Lindenstrauss transform does not preserve the cost of 11-column subset selection.

Theorem 3 (Johnson-Lindenstrauss for Column Subset Selection—Informal (see Theorem 13)).

For any t=o(logn)t=o(\log n), applying a Johnson-Lindenstrauss transform to dimension tt decreases the cost of the 11-column subset selection problem by a factor approaching 3/23/2.

The above theorem does not contradict the “for-all” guarantee of [CEM+15] for similar reasons (which, in addition, crucially rely on having z=2z=2, and which we elaborate on in Appendix A). For (k,z)(k,z)-line approximation, however, there is an interesting open problem: is it true that after applying a Johnson-Lindenstrauss transform to t=poly(kloglogn/ε)t=\mathrm{poly}(k\log\log n/\varepsilon), for all partitions of XX into kk parts, the cost of optimally approximating each part with a line has its cost preserved.

1.2 Related Work

Dimension Reduction.

Our paper continues a line of work initiated by Boutsidis, Zouzias, and Drineas [BZD10], who first studied the effect of a Johnson-Lindenstrauss transform for kk-means clustering, and showed that t=O(k/ε2)t=O(k/\varepsilon^{2}) sufficed for a 2+ε2+\varepsilon-approximation. The bound was improved to (1+ε)(1+\varepsilon)-approximation with t=O(k/ε2)t=O(k/\varepsilon^{2}) by Cohen, Elder, Musco, Musco, Persu [CEM+15], who also showed that t=O(logk/ε2)t=O(\log k/\varepsilon^{2}) gave a 9+ε9+\varepsilon-approximation. Becchetti, Bury, Cohen-Addad, Grandoni, Schwiegelshohn [BBCA+19] showed that t=O((logk+loglogn)log(1/ε)/ε6)t=O((\log k+\log\log n)\log(1/\varepsilon)/\varepsilon^{6}) sufficed for preserving the costs of all kk-mean clusterings. Makarychev, Makarychev, and Razenshteyn [MMR19] improved and generalized the above bounds for all (k,z)(k,z)-clustering. They showed that t=O((logk+zlog(1/ε)+z2)/ε2)t=O((\log k+z\log(1/\varepsilon)+z^{2})/\varepsilon^{2}) preserved costs to (1±ε)(1\pm\varepsilon)-factor.

For subspace approximation problems, [CEM+15] showed that t=O(k/ε2)t=O(k/\varepsilon^{2}) preserves the cost of (k,2)(k,2)-subspace approximation to (1+ε)(1+\varepsilon)-factor. In addition, [KR15] showed that O(zk2logn/ε3)O(zk^{2}\log n/\varepsilon^{3}) preserved the cost of the (k,z)(k,z)-flat approximation to (1+ε)(1+\varepsilon)-factor.

Coresets.

Coresets are a well-studied technique for reducing the size of a dataset, while approximately preserving a particular desired property. Since its formalization in Agarwal, Har-Peled, and Varadarajan [AHPV05], coresets have played a foundational role in computational geometry, and found widespread application in clustering, numerical linear algebra, and machine learning (see the recent survey [Fel20]). Indeed, even for clustering problems in Euclidean spaces, there is a long line of work (which is still on-going) [BHPI02, HPM04, AHPV05, Che09, LS10, FL11, VX12b, VX12a, FSS13, BFL16, SW18, HV20, BBH+20, CASS21, CALSS22] exploring the best coreset constructions.

Most relevant to our work is the “sensitivity sampling” framework of Feldman and Langberg [FL11], which gives algorithms for constructing coresets for the projective clustering problems we study. In light of the results of [FL11], as well as the classical formulation of the Johnson-Lindenstrauss lemma [JL84], it may seem natural to apply coreset algorithms and dimensionality reduction concurrently. However, this is not without a few technical challenges. As we will see in the next subsection, it is not necessarily the case that coreset algorithms and random projections “commute.” Put succinctly, the random projection 𝚷\mathbf{\Pi} of a coreset of XX may not be a coreset of the random projection 𝚷(X)\mathbf{\Pi}(X). Indeed, proving such a statement constitutes the bulk of the technical work.

1.3 Organization

The following section (Section 2) overviews the high-level plan, since all our results follow the same technique. To highlight the technique, the first technical section considers the case of (k,z)(k,z)-clustering (Section 4), where the technique of arguing via coresets shows to obtain t=O((log(k)+zlog(1/ε))/ε2)t=O((\log(k)+z\log(1/\varepsilon))/\varepsilon^{2}). The remaining sections cover the technical material for (k,z)(k,z)-subspace approximation (in Section 5), (k,z)(k,z)-flat approximation (in Section 6), and finally (k,z)(k,z)-line approximation (in Section 7).

2 Overview of Techniques

In this subsection, we give a high-level overview of the techniques employed. As it will turn out, all results in this paper follow from one general technique, which we instantiate for the various problem instances.

We give an abstract instantiation of the approach. We will be concerned with geometric optimization problems of the following sort:

  • For each dd\in\mathbbm{N}, we specify a class of candidate solutions given by a set 𝒞d\mathcal{C}_{d}. For example, in center-based clustering, 𝒞d\mathcal{C}_{d} may be given by a tuple of kk points in d\mathbbm{R}^{d}, corresponding to kk centers for a center-based clustering. In subspace approximation, the set 𝒞d\mathcal{C}_{d} may denote the set of all kk-dimensional subspaces of d\mathbbm{R}^{d}.

  • There will be a cost function fd:d×𝒞d0f_{d}\colon\mathbbm{R}^{d}\times\mathcal{C}_{d}\to\mathbbm{R}_{\geq 0} which, takes a point xdx\in\mathbbm{R}^{d} and a potential solution c𝒞dc\in\mathcal{C}_{d}, and outputs the cost of xx on cc. Continuing on the example on center-based clustering, fdf_{d} may denote the distance from a dataset point xdx\in\mathbbm{R}^{d} to its nearest point in cc. In subspace approximation, fdf_{d} may denote the distance from a dataset point to the orthogonal projection of that point onto the subspace cc. For a parameter zz\in\mathbbm{N}, we will denote the cost of using cc for a dataset XdX\subset\mathbbm{R}^{d} by

    costd,z(X,c)=(xXfd(x,c)z)1/z.\mathrm{cost}_{d,z}(X,c)=\left(\sum_{x\in X}f_{d}(x,c)^{z}\right)^{1/z}.

    For simplicity in the notation, we will drop the subscripts from the functions ff and cost\mathrm{cost} when they are clear from context.

We let 𝒥d,t\mathcal{J}_{d,t} denote a distribution over linear maps 𝚷:dt\mathbf{\Pi}\colon\mathbbm{R}^{d}\to\mathbbm{R}^{t} which will satisfy some “Johnson-Lindenstrauss” guarantees (we will specify in the preliminaries the properties we will need). For concreteness, we will think of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t} given by matrix multiplication by a t×dt\times d matrix of i.i.d draws of 𝒩(0,1/t)\mathcal{N}(0,1/t). We ask, for a particular bound on the dataset size nn\in\mathbbm{N}, a geometric optimization problem (specified by {𝒞d}d\{\mathcal{C}_{d}\}_{d\in\mathbbm{N}}, fdf_{d} and zz), and a parameter ε(0,1)\varepsilon\in(0,1), what is the smallest tt\in\mathbbm{N} such that with probability at least 0.90.9 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t},

11+εminc𝒞dcost(X,c)minc𝒞tcost(𝚷(X),c)(1+ε)minc𝒞dcost(X,c).\displaystyle\frac{1}{1+\varepsilon}\cdot\min_{c\in\mathcal{C}_{d}}\mathrm{cost}(X,c)\leq\min_{c\in\mathcal{C}_{t}}\mathrm{cost}(\mathbf{\Pi}(X),c)\leq(1+\varepsilon)\cdot\min_{c\in\mathcal{C}_{d}}\mathrm{cost}(X,c). (3)

The right-most inequality in (3) claims that the cost after applying 𝚷\mathbf{\Pi} does not increase significantly, i.e., minc𝒞tcost(𝚷(X),c)(1+ε)minc𝒞dcost(X,c)\min_{c\in\mathcal{C}_{t}}\mathrm{cost}(\mathbf{\Pi}(X),c)\leq(1+\varepsilon)\min_{c\in\mathcal{C}_{d}}\mathrm{cost}(X,c). This direction is easy to prove for the following reason. For a dataset XdX\subset\mathbbm{R}^{d}, we consider the solution c𝒞dc^{*}\in\mathcal{C}_{d} minimizing cost(X,c)\mathrm{cost}(X,c^{*}). We sample 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t} and we find a candidate solution c𝒞tc^{**}\in\mathcal{C}_{t} which exhibits an upper bound on minc𝒞tcost(𝚷(X),c)cost(𝚷(X),c)\min_{c\in\mathcal{C}_{t}}\mathrm{cost}(\mathbf{\Pi}(X),c)\leq\mathrm{cost}(\mathbf{\Pi}(X),c^{**}). For example, in the center-based clustering, c𝒞dc^{*}\in\mathcal{C}_{d} is a set of kk centers in d\mathbbm{R}^{d}, and we may consider c𝒞tc^{**}\in\mathcal{C}_{t} as the kk centers from cc^{*} after applying 𝚷\mathbf{\Pi}. The fact that cost(𝚷(x),c)(1+ε)cost(X,c)\mathrm{cost}(\mathbf{\Pi}(x),c^{**})\leq(1+\varepsilon)\mathrm{cost}(X,c^{*}) with high probability over 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t} will follow straight-forwardly from properties of 𝒥d,t\mathcal{J}_{d,t}. Importantly, the optimal solution cc^{*} does not depend on 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}. In fact, while we expect 𝚷:dt\mathbf{\Pi}\colon\mathbbm{R}^{d}\to\mathbbm{R}^{t} to distort some distances substantially, we can pick c𝒞tc^{**}\in\mathcal{C}_{t} so that too many distortions on these points is unlikely.

However, the same reasoning does not apply to the left-most inequality in (3). This is because the solution c𝒞tc^{**}\in\mathcal{C}_{t} which minimizes minc𝒞tcost(𝚷(X),c)\min_{c\in\mathcal{C}_{t}}\mathrm{cost}(\mathbf{\Pi}(X),c) depends on 𝚷\mathbf{\Pi}. Indeed, we would expect c𝒞tc^{**}\in\mathcal{C}_{t} to take advantage of distortions in 𝚷\mathbf{\Pi} in order to decrease the cost of the optimal solution. We proceed by the following high level plan. We identify a sequence of important events defined over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t} which occur with probability at least 0.90.9. The special property is that if 𝚷\mathbf{\Pi} satisfies these events, we can identify, from c𝒞tc^{**}\in\mathcal{C}_{t} minimizing cost(𝚷(X),c)\mathrm{cost}(\mathbf{\Pi}(X),c^{**}), a candidate solution c𝒞dc^{*}\in\mathcal{C}_{d} which exhibits an upper bound cost(X,c)(1+ε)cost(𝚷(X),c)\mathrm{cost}(X,c^{*})\leq(1+\varepsilon)\mathrm{cost}(\mathbf{\Pi}(X),c^{**}).

We now specify how exactly we define, for an optimal c𝒞tc^{**}\in\mathcal{C}_{t} (depending on 𝚷\mathbf{\Pi}), a candidate solution c𝒞tc^{*}\in\mathcal{C}_{t} whose cost is not much higher than cost(𝚷(X),c)\mathrm{cost}(\mathbf{\Pi}(X),c^{**}). For that, we will use the notion of coresets. Before the formal definition, we note there is a natural extension of cost\mathrm{cost} for weighted datasets. In particular, if SdS\subset\mathbbm{R}^{d} is a set of points and w:S0w\colon S\to\mathbbm{R}_{\geq 0} is a set of weights for SS, then we will use cost((S,w),c)\mathrm{cost}((S,w),c) as 1/z1/z-th power of the sum over all xSx\in S of w(x)fd(x,c)zw(x)\cdot f_{d}(x,c)^{z}.

Definition 2.1 ((Weak)333The word “weak” is used to distinguish these from “strong” coresets. These are a weighted subset of points which approximately preserve the cost of all candidate solutions. Coresets, see also [Fel20]).

For dd\in\mathbbm{N}, let 𝒞d\mathcal{C}_{d} denote a class of candidate solutions and f:d×𝒞d0f\colon\mathbbm{R}^{d}\times\mathcal{C}_{d}\to\mathbbm{R}_{\geq 0} specify the cost of a point to a solution. For a dataset XdX\subset\mathbbm{R}^{d} and a parameter ε(0,1)\varepsilon\in(0,1), a (weak) ε\varepsilon-coreset for XX is a weighted set of points SdS\subset\mathbbm{R}^{d} and w:S0w\colon S\to\mathbbm{R}_{\geq 0} which satisfy

11+εminc𝒞dcost(X,c)minc𝒞dcost((S,w),c)(1+ε)minc𝒞dcost(X,c).\dfrac{1}{1+\varepsilon}\cdot\min_{c\in\mathcal{C}_{d}}\mathrm{cost}(X,c)\leq\min_{c\in\mathcal{C}_{d}}\mathrm{cost}((S,w),c)\leq(1+\varepsilon)\cdot\min_{c\in\mathcal{C}_{d}}\mathrm{cost}(X,c).

It will be crucial for us that these problems admit small coresets. More specifically, for the problems considered in this paper, there exists (known) algorithms which can produce small-size coresets from a dataset. In what follows, ALG is a randomized algorithm which receives as input a dataset XdX\subset\mathbbm{R}^{d} and outputs a weighted subset of points (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) which is a weak ε\varepsilon-coreset for XX with high probability. Computationally, the benefit of using coresets is that the sets 𝐒\mathbf{S} tend to be much smaller than XX, so that one may compute on (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) and obtain an approximately optimal solution for XX. For us, the benefit will come in defining the important events. At a high level, since 𝐒\mathbf{S} is small, the important events defined with respect to 𝚷\mathbf{\Pi} will only worry about distortions within the subset (or subspace spanned by) 𝐒\mathbf{S}.

In particular, it is natural to consider the following approach:

  1. 1.

    We begin with the original dataset XdX\subset\mathbbm{R}^{d}, and we consider the solution c𝒞dc\in\mathcal{C}_{d} which minimizes cost(X,c)\mathrm{cost}(X,c). The goal is to show that cost(X,c)\mathrm{cost}(X,c) cannot be much larger than cost(𝚷(X),c)\mathrm{cost}(\mathbf{\Pi}(X),c^{**}), where c𝒞tc^{**}\in\mathcal{C}_{t} minimizes cost(𝚷(X),c)\mathrm{cost}(\mathbf{\Pi}(X),c^{**}).

  2. 2.

    Instead of considering the entire dataset XX, we execute ALG(X)\texttt{ALG}(X) and consider the weak ε\varepsilon-coreset (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) that we obtain. If we can identify a candidate solution c𝒞dc^{*}\in\mathcal{C}_{d} whose cost cost((𝐒,𝒘),c)(1+ε)cost(𝚷(X),c)\mathrm{cost}((\mathbf{S},\boldsymbol{w}),c^{*})\leq(1+\varepsilon)\mathrm{cost}(\mathbf{\Pi}(X),c^{**}), we would be done. Indeed, minc𝒞dcost((𝐒,𝒘),c)cost((𝐒,𝒘),c)\min_{c\in\mathcal{C}_{d}}\mathrm{cost}((\mathbf{S},\boldsymbol{w}),c)\leq\mathrm{cost}((\mathbf{S},\boldsymbol{w}),c^{*}), and the fact (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) is a weak ε\varepsilon-coreset implies minc𝒞dcost(X,c)(1+ε)minc𝒞dcost((𝐒,𝒘),c)\min_{c\in\mathcal{C}_{d}}\mathrm{cost}(X,c)\leq(1+\varepsilon)\min_{c\in\mathcal{C}_{d}}\mathrm{cost}((\mathbf{S},\boldsymbol{w}),c).

  3. 3.

    Moving to a coreset (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) allows one to relate cost((𝐒,𝒘),c)\mathrm{cost}((\mathbf{S},\boldsymbol{w}),c^{*}) and cost((𝚷(𝐒),𝒘),c)\mathrm{cost}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),c^{**}) by considering the performance of 𝚷\mathbf{\Pi} on 𝐒\mathbf{S}. The benefit is that the important events, defined over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}, set tt as a function of |𝐒||\mathbf{S}|, instead of |X||X|. A useful example to consider is requiring 𝚷1\mathbf{\Pi}^{-1} be (1+ε)(1+\varepsilon)-Lipschitz on the entire subspace spanned by 𝐒\mathbf{S}, which requires t=Θ(|𝐒|/ε2)t=\Theta(|\mathbf{S}|/\varepsilon^{2}). For the problems considered here, a nearly optimal c𝒞tc^{**}\in\mathcal{C}_{t} for (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) will be in the subspace spanned by 𝐒\mathbf{S}, so we may identify c𝒞dc^{*}\in\mathcal{C}_{d} whose cost on (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) is not much higher than the cost((𝚷(𝐒),𝒘),c)\mathrm{cost}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),c^{**}) by evaluating 𝚷1(c)\mathbf{\Pi}^{-1}(c^{**}) since cc^{**} lies inside span(𝐒)\mathrm{span}(\mathbf{S}).444While the above results in bounds for tt which are already meaningful, we will exploit other geometric aspects of the problems considered to get bounds on tt which are logarithmic in the coreset size. For center-based clustering, [MMR19] showed that one may apply Kirzbraun’s theorem. For subspace approximation, we use the geometric lemmas of [SV12].

  4. 4.

    The remaining step is showing cost((𝚷(𝐒),𝒘),c)(1+ε)cost(𝚷(X),c)\mathrm{cost}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),c^{**})\leq(1+\varepsilon)\mathrm{cost}(\mathbf{\Pi}(X),c^{**}). In particular, one would like to claim (𝚷(𝐒),𝒘)(\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}) is a weak ε\varepsilon-coreset for 𝚷(X)\mathbf{\Pi}(X) and use the right-most inequality in Definition 2.1. However, it is not clear this is so. The problem is that the algorithm ALG depends on the dd-dimensional representation of XdX\subset\mathbbm{R}^{d}, and (𝚷(𝐒),𝒘)(\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}) may not be a valid output for ALG(𝚷(X))\texttt{ALG}(\mathbf{\Pi}(X)). As we show, this does work for (some) coreset algorithms built on the sensitivity sampling framework (see, [FL11, BFL16]).555We will not prove that with high probability over 𝚷\mathbf{\Pi} and ALG(X)\texttt{ALG}(X), (𝚷(𝐒),𝒘)(\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}) is a weak ε\varepsilon-coreset for 𝚷(X)\mathbf{\Pi}(X). Rather, all we need is that the right-most inequality in Definition 2.1 holds for (𝚷(𝐒),𝒘)(\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}) and 𝚷(X)\mathbf{\Pi}(X), which is what we show.

2.0.1 Sensitivity Sampling for Step 4

In the remainder of this section, we briefly overview the sensitivity sampling framework, and the components required to make Step 4 go through. At a high level, coreset algorithms in the sensitivity sampling framework proceed in the following way. Given a dataset XdX\subset\mathbbm{R}^{d}, the algorithm computes a sensitivity sampling distribution σ~\tilde{\sigma} supported on XX. The requirement is that, for each potential solution c𝒞dc\in\mathcal{C}_{d}, sampling from σ~\tilde{\sigma} gives a low-variance estimator for costd,z(X,c)z\mathrm{cost}_{d,z}(X,c)^{z}. In particular, we let σ~(x)\tilde{\sigma}(x) be the probability of sampling xXx\in X. Then, for any distribution σ~\tilde{\sigma} and any c𝒞dc\in\mathcal{C}_{d},

𝐄𝒙σ~[1σ~(𝒙)fd(𝒙,c)zcostd,z(X,c)z]=1.\displaystyle\mathop{{\bf E}\/}_{\boldsymbol{x}\sim\tilde{\sigma}}\left[\dfrac{1}{\tilde{\sigma}(\boldsymbol{x})}\cdot\dfrac{f_{d}(\boldsymbol{x},c)^{z}}{\mathrm{cost}_{d,z}(X,c)^{z}}\right]=1. (4)

Equation 4 implies that, for any mm\in\mathbbm{N}, if 𝐒\mathbf{S} is mm i.i.d samples from σ~\tilde{\sigma} and 𝒘(x)=1/(mσ~(x))\boldsymbol{w}(x)=1/(m\tilde{\sigma}(x)), the expectation of costd,z((𝐒,𝒘),c)z\mathrm{cost}_{d,z}((\mathbf{S},\boldsymbol{w}),c)^{z} is costd,z(X,c)z\mathrm{cost}_{d,z}(X,c)^{z}. In addition, the algorithm designs σ~\tilde{\sigma} so that, for a parameter T>0T>0,

supc𝒞d𝐄𝒙σ~[(1σ~(𝒙)fd(𝒙,c)zcostd,z(X,c)z)2]T.\displaystyle\sup_{c\in\mathcal{C}_{d}}\mathop{{\bf E}\/}_{\boldsymbol{x}\sim\tilde{\sigma}}\left[\left(\dfrac{1}{\tilde{\sigma}(\boldsymbol{x})}\cdot\dfrac{f_{d}(\boldsymbol{x},c)^{z}}{\mathrm{cost}_{d,z}(X,c)^{z}}\right)^{2}\right]\leq T. (5)

If we set mT/ε2m\geq T/\varepsilon^{2}, (5) and Chebyshev’s inequality implies costd,z((𝐒,𝒘),c)z1±εcostd,z(X,c)z\mathrm{cost}_{d,z}((\mathbf{S},\boldsymbol{w}),c)^{z}\approx_{1\pm\varepsilon}\mathrm{cost}_{d,z}(X,c)^{z} for each c𝒞dc\in\mathcal{C}_{d} with a high constant probability, and the remaining work is in increasing mm by a large enough factor to “union bound” over all c𝒞dc\in\mathcal{C}_{d}. There is a canonical way of ensuring σ~\tilde{\sigma} and TT satisfy (5): we first define σ:X0\sigma\colon X\to\mathbbm{R}_{\geq 0}, known as a “sensitivity function”, which sets for each xXx\in X,

σ(x)supc𝒞dfd(x,c)zcostd,z(X,c)z,andT=xXσ(x),\displaystyle\sigma(x)\geq\sup_{c\in\mathcal{C}_{d}}\dfrac{f_{d}(x,c)^{z}}{\mathrm{cost}_{d,z}(X,c)^{z}},\qquad\text{and}\qquad T=\sum_{x\in X}\sigma(x), (6)

which is known as the “total sensitivity.” Then, the distribution is given by letting σ~(x)=σ(x)/T\tilde{\sigma}(x)=\sigma(x)/T.

We now show how to incorporate the map 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}, to argue Step 4. Recall that we let 𝐒\mathbf{S} denote mm i.i.d draws from σ~\tilde{\sigma} and the weights be 𝒘(x)=1/(mσ~(x))\boldsymbol{w}(x)=1/(m\tilde{\sigma}(x)). We want to argue that, with high constant probability over the draw of (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) and 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}, we have

costt,z((𝚷(𝐒),𝒘),c)(1+ε)costt,z(𝚷(X),c).\displaystyle\mathrm{cost}_{t,z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),c^{**})\leq(1+\varepsilon)\cdot\mathrm{cost}_{t,z}(\mathbf{\Pi}(X),c^{**}). (7)

First, note that the analogous version of (4) for costt,z(𝚷(X),c)\mathrm{cost}_{t,z}(\mathbf{\Pi}(X),c) continues to hold. In particular, for any map 𝚷\mathbf{\Pi} in the support of 𝒥d,t\mathcal{J}_{d,t} and c𝒞tc^{**}\in\mathcal{C}_{t} minimizing costt,z(𝚷(X),c)\mathrm{cost}_{t,z}(\mathbf{\Pi}(X),c),

𝐄𝒙σ~[1σ~(𝒙)ft(𝚷(𝒙),c)zcostt,z(𝚷(X),c)z]=1.\displaystyle\mathop{{\bf E}\/}_{\boldsymbol{x}\sim\tilde{\sigma}}\left[\dfrac{1}{\tilde{\sigma}(\boldsymbol{x})}\cdot\dfrac{f_{t}(\mathbf{\Pi}(\boldsymbol{x}),c^{**})^{z}}{\mathrm{cost}_{t,z}(\mathbf{\Pi}(X),c^{**})^{z}}\right]=1. (8)

Hence, it remains to define σ~\tilde{\sigma} satisfying (6) which also satisfies one additional requirement. With high probability over 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}, we should have

𝐄𝒙σ~[(1σ~(𝒙)ft(𝚷(𝒙),c)zcostt,z(𝚷(X),c))2]T.\displaystyle\mathop{{\bf E}\/}_{\boldsymbol{x}\sim\tilde{\sigma}}\left[\left(\dfrac{1}{\tilde{\sigma}(\boldsymbol{x})}\cdot\dfrac{f_{t}(\mathbf{\Pi}(\boldsymbol{x}),c^{**})^{z}}{\mathrm{cost}_{t,z}(\mathbf{\Pi}(X),c^{**})}\right)^{2}\right]\lesssim T. (9)

The above translates to saying, for most 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t} the variance of cost((𝚷(𝐒),𝒘),c)\mathrm{cost}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),c^{**}), when m=O(T/ε2)m=O(T/\varepsilon^{2}), is small. Once that is established, we may apply Chebyshev’s inequality and conclude (7) with high constant probability.666Since Steps 14 only argued about the optimal c𝒞tc^{**}\in\mathcal{C}_{t}, there is no need to “union bound” over all c𝒞tc\in\mathcal{C}_{t} in our arguments.

2.0.2 The Circularity and How to Break It

One final technical hurdle arises. While one may define the sensitivity function σ(x)\sigma(x) to be exactly supc𝒞dfd(x,c)z/costd,z(X,c)z\sup_{c\in\mathcal{C}_{d}}f_{d}(x,c)^{z}/\mathrm{cost}_{d,z}(X,c)^{z} and automatically satisfy (6), it becomes challenging to argue that (9) holds. In the end, the complexity we seek to optimize is the total sensitivity TT, so there is flexibility in defining σ\sigma while showing (9) holds. In fact, sensitivity functions σ\sigma which are computationally simple tend to be known, since an algorithm using coresets must quickly compute σ(x)\sigma(x) for every xXx\in X.

The sensitivity functions σ\sigma used in the literature (for instance, in [FL11, VX12b]) are defined with respect to an approximately optimal c𝒞dc\in\mathcal{C}_{d} (or bi-criteria approximation) for costd,z(X,c)\mathrm{cost}_{d,z}(X,c). Furthermore, the arguments used to show these function satisfy (6), which we will also employ for (9), crucially utilize the approximation guarantee on c𝒞dc\in\mathcal{C}_{d}. The apparent circularity appears in approximation algorithms and also shows up in the analysis here:

  • For XdX\subset\mathbbm{R}^{d}, we identify the optimal c𝒞dc\in\mathcal{C}_{d} minimizing costd,z(X,c)\mathrm{cost}_{d,z}(X,c), and use cc to define σ:X0\sigma\colon X\to\mathbbm{R}_{\geq 0}. The fact that c𝒞dc\in\mathcal{C}_{d} is optimal (and therefore approximately optimal) allows us to use known arguments (in particular, those in [VX12b, VX12a]) to establish (6) and give an upper bound on TT.

  • We use the proof of the “easy” direction to identify a solution c𝒞tc^{\prime}\in\mathcal{C}_{t} with costt,z(𝚷(X),c)(1+ε)costd,z(X,c)\mathrm{cost}_{t,z}(\mathbf{\Pi}(X),c^{\prime})\leq(1+\varepsilon)\mathrm{cost}_{d,z}(X,c) (recall this was used to establish the right-most inequality in (3)). From the analytical perspective, it is useful to think of σ:𝚷(X)0\sigma^{\prime}\colon\mathbf{\Pi}(X)\to\mathbbm{R}_{\geq 0} as the function one would get from defining a sensitivity function like in the previous step with cc^{\prime} instead of cc. If we could show c𝒞tc^{\prime}\in\mathcal{C}_{t} was approximately optimal for 𝚷(X)\mathbf{\Pi}(X), we could use [VX12b, VX12a] again to argue (9). The circularity is the following. Showing c𝒞tc^{\prime}\in\mathcal{C}_{t} is approximately optimal means showing an upper bound on minc𝒞tcostt,z(𝚷(X),c)\min_{c\in\mathcal{C}_{t}}\mathrm{cost}_{t,z}(\mathbf{\Pi}(X),c) in terms of costt,z(𝚷(X),c)\mathrm{cost}_{t,z}(\mathbf{\Pi}(X),c^{\prime}). Since, we picked costt,z(𝚷(X),c)\mathrm{cost}_{t,z}(\mathbf{\Pi}(X),c^{\prime}) to be at most (1+ε)minc𝒞dcostd,z(X,c)(1+\varepsilon)\min_{c\in\mathcal{C}_{d}}\mathrm{cost}_{d,z}(X,c), this is exactly what we sought to prove.

If “approximately optimal” above required c𝒞tc^{\prime}\in\mathcal{C}_{t} be a (1+ε)(1+\varepsilon)-approximation to the optimal 𝚷(X)\mathbf{\Pi}(X), we would have a complete circularity and be unable to proceed. However, similarly to the case of approximation algorithms, it suffices to have a poor approximation. Suppose we showed c𝒞tc^{\prime}\in\mathcal{C}_{t} was a CC-approximation, then, increasing mm by a factor depending on CC (which would affect the resulting dimensionality tt) would account for this increase and drive the variance back down to ε2\varepsilon^{2}. Moreover, showing cc^{\prime} is a O(1)O(1)-approximation with probability at least 0.990.99 over 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}, given Steps 14 is straight-forward. Instead of showing the stronger bound that costt,z((𝚷(𝐒),𝒘),c)(1+ε)costt,z(𝚷(X),c)\mathrm{cost}_{t,z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),c^{**})\leq(1+\varepsilon)\mathrm{cost}_{t,z}(\mathbf{\Pi}(X),c^{**}), we show that costt,z((𝚷(𝐒),𝒘),c)O(1)costt,z(𝚷(X),c)\mathrm{cost}_{t,z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),c^{**})\leq O(1)\cdot\mathrm{cost}_{t,z}(\mathbf{\Pi}(X),c^{**}). The latter (loose) bound is a consequence of applying Markov’s inequality to (8).

In summary, we overcome the circularity by going through Steps 14 twice. In the first time, we show a weaker O(1)O(1)-approximation. Specifically, we show that 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t} preserves the cost of minc𝒞tcostd,z(𝚷(X),c)\min_{c\in\mathcal{C}_{t}}\mathrm{cost}_{d,z}(\mathbf{\Pi}(X),c) up to factor O(1)O(1). The first time around, we won’t upper bound the variance in (9), and we simply use Markov’s inequality to (8) in order to prove a (loose) bound on Step 4. Once we’ve established the O(1)O(1)-factor approximation, we are guaranteed that c𝒞tc^{\prime}\in\mathcal{C}_{t} is a O(1)O(1)-approximation to minc𝒞tcostt,z(𝚷(X),c)\min_{c\in\mathcal{C}_{t}}\mathrm{cost}_{t,z}(\mathbf{\Pi}(X),c). This means that, actually, the sensitivity sampling distribution σ~\tilde{\sigma} we had considered (when viewed as a sensitivity sampling distribution for 𝚷(X)\mathbf{\Pi}(X)) gives estimators with bounded variance, as in (9). In particular, going through Steps 14 once again implies that c𝒞tc^{\prime}\in\mathcal{C}_{t} was actually the desired 1±ε1\pm\varepsilon-approximation.

3 Preliminaries

We specify the properties we use from the distribution 𝒥d,t\mathcal{J}_{d,t}. We will refer to these as “Johnson-Lindenstrauss” distributions. Throughout the proof, we will often refer to 𝒥d,t\mathcal{J}_{d,t} as given by a t×dt\times d matrix of i.i.d draws from 𝒩(0,1/t)\mathcal{N}(0,1/t). The goal of specifying the useful properties is to use other “Johnson-Lindenstrauss”-like distributions. The first property we need is that 𝚷:dt\mathbf{\Pi}\colon\mathbbm{R}^{d}\to\mathbbm{R}^{t} is a linear map, and that any x,ydx,y\in\mathbbm{R}^{d} satisfies

𝐄𝚷𝒥d,t[𝚷(x)𝚷(y)22xy22]=1.\mathop{{\bf E}\/}_{\mathbf{\Pi}\sim\mathcal{J}_{d,t}}\left[\dfrac{\|\mathbf{\Pi}(x)-\mathbf{\Pi}(y)\|_{2}^{2}}{\|x-y\|_{2}^{2}}\right]=1.

We use the standard property of 𝒥d,t\mathcal{J}_{d,t}, that 𝚷\mathbf{\Pi} preserves distances with high probability, i.e., for any x,ydx,y\in\mathbbm{R}^{d},

𝐏𝐫𝚷𝒥d,t[|𝚷(x)𝚷(y)22xy221|ε]eΩ(ε2t).\mathop{{\bf Pr}\/}_{\mathbf{\Pi}\sim\mathcal{J}_{d,t}}\left[\left|\dfrac{\|\mathbf{\Pi}(x)-\mathbf{\Pi}(y)\|_{2}^{2}}{\|x-y\|_{2}^{2}}-1\right|\geq\varepsilon\right]\leq e^{-\Omega(\varepsilon^{2}t)}.

More generally, we use the conclusion of the following lemma. We give a proof when 𝒥d,t\mathcal{J}_{d,t} is a t×dt\times d i.i.d entries of 𝒩(0,1/t)\mathcal{N}(0,1/t).

Lemma 3.1.

Let 𝒥d,t\mathcal{J}_{d,t} denote a Johnson-Lindenstrauss distribution over maps dt\mathbbm{R}^{d}\to\mathbbm{R}^{t} given by a matrix multiplication on the left by a t×dt\times d matrix of i.i.d draws of 𝒩(0,1/t)\mathcal{N}(0,1/t). If tz/ε2t\gtrsim z/\varepsilon^{2}, then for any x,ydx,y\in\mathbbm{R}^{d},

𝐄𝚷𝒥d.t[(𝚷(x)𝚷(y)2zxy2z1)+](1+ε)z1100.\displaystyle\mathop{{\bf E}\/}_{\mathbf{\Pi}\sim\mathcal{J}_{d.t}}\left[\left(\dfrac{\|\mathbf{\Pi}(x)-\mathbf{\Pi}(y)\|_{2}^{z}}{\|x-y\|_{2}^{z}}-1\right)^{+}\right]\leq\dfrac{(1+\varepsilon)^{z}-1}{100}.
Proof.

We note that by the 2-stability of the Gaussian distribution, we have 𝚷(x)𝚷(y)22\|\mathbf{\Pi}(x)-\mathbf{\Pi}(y)\|_{2}^{2} is equivalently distributed as 𝒈22xy22\|\boldsymbol{g}\|_{2}^{2}\cdot\|x-y\|_{2}^{2}, where 𝒈𝒩(0,It/t)\boldsymbol{g}\sim\mathcal{N}(0,I_{t}/t). Therefore, we have that for any λ>0\lambda>0 which we will optimize shortly to be a small constant times ε\varepsilon,

𝐄𝚷𝒥(d,t)[(𝚷(x)𝚷(y)2zxy2z1)+]\displaystyle\mathop{{\bf E}\/}_{\mathbf{\Pi}\sim\mathcal{J}(d,t)}\left[\left(\dfrac{\|\mathbf{\Pi}(x)-\mathbf{\Pi}(y)\|_{2}^{z}}{\|x-y\|_{2}^{z}}-1\right)^{+}\right] (1+λ)z1+𝐄𝒈𝒩(0,It/t)[(𝒈2z(1+λ)z)+].\displaystyle\leq(1+\lambda)^{z}-1+\mathop{{\bf E}\/}_{\boldsymbol{g}\sim\mathcal{N}(0,I_{t}/t)}\left[(\|\boldsymbol{g}\|_{2}^{z}-(1+\lambda)^{z})^{+}\right].

Furthermore,

𝐄𝒈𝒩(0,It/t)[(𝒈2z(1+λ)z)+]\displaystyle\mathop{{\bf E}\/}_{\boldsymbol{g}\sim\mathcal{N}(0,I_{t}/t)}\left[\left(\|\boldsymbol{g}\|_{2}^{z}-(1+\lambda)^{z}\right)^{+}\right] =u:(1+λ)z𝐏𝐫[𝒈2zu]du=z2v:(1+λ)2𝐏𝐫[𝒈22v]vz/21dv.\displaystyle=\int_{u:(1+\lambda)^{z}}^{\infty}\mathop{{\bf Pr}\/}\left[\|\boldsymbol{g}\|_{2}^{z}\geq u\right]du=\frac{z}{2}\int_{v:(1+\lambda)^{2}}^{\infty}\mathop{{\bf Pr}\/}\left[\|\boldsymbol{g}\|_{2}^{2}\geq v\right]\cdot v^{z/2-1}dv.

We will upper bound the probability that 𝒈22\|\boldsymbol{g}\|_{2}^{2} exceeds vv by the Chernoff-Hoeffding method. In particular, recall that 𝒈22\|\boldsymbol{g}\|_{2}^{2} when 𝒩(0,It/t)\mathcal{N}(0,I_{t}/t) is distributed as a χ2\chi^{2}-random variable with tt degrees of freedom, rescaled by 1/t1/t, such that the moment generating function of 𝒈22\|\boldsymbol{g}\|_{2}^{2} has the following closed form solution whenever α<t/2\alpha<t/2:

log(𝐄𝒈𝒩(0,It/t)[exp(α𝒈22)])=t2log(12αt)α+2α2t\log\left(\mathop{{\bf E}\/}_{\boldsymbol{g}\sim\mathcal{N}(0,I_{t}/t)}\left[\exp\left(\alpha\|\boldsymbol{g}\|_{2}^{2}\right)\right]\right)=-\frac{t}{2}\log\left(1-\frac{2\alpha}{t}\right)\leq\alpha+\frac{2\alpha^{2}}{t}

In particular, for any α<t/2\alpha<t/2, we may upper bound

z2v:(1+λ)2𝐏𝐫[𝒈22v]vz/21\displaystyle\frac{z}{2}\int_{v:(1+\lambda)^{2}}^{\infty}\mathop{{\bf Pr}\/}\left[\|\boldsymbol{g}\|_{2}^{2}\geq v\right]v^{z/2-1} z2exp(α+2α2t)v:(1+λ)2exp((αz21)v)𝑑v\displaystyle\leq\frac{z}{2}\cdot\exp\left(\alpha+\frac{2\alpha^{2}}{t}\right)\int_{v:(1+\lambda)^{2}}^{\infty}\exp\left(-\left(\alpha-\frac{z}{2}-1\right)v\right)dv
z2αz2exp(2α2tαλ+(1+λ)(z/2+1))<λ100,\displaystyle\leq\dfrac{z}{2\alpha-z-2}\cdot\exp\left(\frac{2\alpha^{2}}{t}-\alpha\lambda+(1+\lambda)(z/2+1)\right)<\frac{\lambda}{100},

by letting setting α=tλ/10\alpha=t\lambda/10 whenever tz/λ2t\gtrsim z/\lambda^{2}. Setting λ\lambda to be a small constant of ε\varepsilon gives the desired guarantees. ∎

Definition 3.2 (Subspace Embeddings).

Let dd\in\mathbbm{N} and AdA\subset\mathbbm{R}^{d} denote a subspace of d\mathbbm{R}^{d}. For ε>0\varepsilon>0, a map f:dtf\colon\mathbbm{R}^{d}\to\mathbbm{R}^{t} is an ε\varepsilon-subspace embedding of AA if, for any xAx\in A,

11+εf(x)2x2(1+ε)f(x)2.\frac{1}{1+\varepsilon}\cdot\|f(x)\|_{2}\leq\|x\|_{2}\leq(1+\varepsilon)\cdot\|f(x)\|_{2}.
Lemma 3.3.

Let dd\in\mathbbm{N} and AdA\subset\mathbbm{R}^{d} be a subspace of dimension at most kk. For ε,δ(0,1/2)\varepsilon,\delta\in(0,1/2), let 𝒥d,t\mathcal{J}_{d,t} denote a Johnson-Lindenstrauss distribution over maps dt\mathbbm{R}^{d}\to\mathbbm{R}^{t} given by a matrix multiplication on the left by a t×dt\times d matrix of i.i.d draws of 𝒩(0,1/t)\mathcal{N}(0,1/t). If t(k+log(1/δ))/ε2t\sim(k+\log(1/\delta))/\varepsilon^{2}, then 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t} is an ε\varepsilon-subspace embedding of AA with probability at least 1δ1-\delta.

4 Center-based (k,z)(k,z)-clustering

In the (k,z)(k,z)-clustering problems, for any set CdC\subset\mathbbm{R}^{d} of kk points, and point xdx\in\mathbbm{R}^{d}, we write

costzz(x,C)=mincCxc22,\mathrm{cost}_{z}^{z}(x,C)=\min_{c\in C}\|x-c\|_{2}^{2},

and for a subset XdX\subset\mathbbm{R}^{d},

costzz(X,C)=xXcost(x,C)=xXmincCxc2z.\mathrm{cost}_{z}^{z}(X,C)=\sum_{x\in X}\mathrm{cost}(x,C)=\sum_{x\in X}\min_{c\in C}\|x-c\|_{2}^{z}.

We extend the above notation to weighted subsets, where for a subset SdS\subset\mathbbm{R}^{d} with (non-negative) weights w:S0w\colon S\to\mathbbm{R}_{\geq 0}, we write costzz((S,w),C)=xSw(x)mincCxc2z\mathrm{cost}_{z}^{z}((S,w),C)=\sum_{x\in S}w(x)\min_{c\in C}\|x-c\|_{2}^{z}. The main result of this section is the following theorem.

Theorem 4 (Johnson-Lindenstrauss for Center-Based Clustering).

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be any set of points, and let CdC\subset\mathbbm{R}^{d} denote the optimal (k,z)(k,z)-clustering of XX. For any ε(0,1/2)\varepsilon\in(0,1/2), suppose we let 𝒥d,t\mathcal{J}_{d,t} be the distribution over Johnson-Lindenstrauss maps where

tlogk+zlog(1/ε)ε2,t\gtrsim\frac{\log k+z\log(1/\varepsilon)}{\varepsilon^{2}},

Then, with probability at least 0.90.9 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t},

11+εcostz(X,C)minCt|C|=kcostz(𝚷(X),C)(1+ε)costz(X,C).\frac{1}{1+\varepsilon}\cdot\mathrm{cost}_{z}(X,C)\leq\min_{\begin{subarray}{c}C^{\prime}\subset\mathbbm{R}^{t}\\ |C^{\prime}|=k\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),C^{\prime})\leq(1+\varepsilon)\mathrm{cost}_{z}(X,C).

There are two directions to showing dimension reduction: (1) the optimal clustering in the reduced space is not too expensive, and (2) the optimal clustering in the reduced space is not too cheap. We note that (1) is simple because we can exhibit a clustering in the reduced space whose cost is not too high; however, (2) is much trickier, since we need to rule out a too-good-to-be-true clustering in the reduced space.

4.1 Easy Direction: Optimum Cost Does Not Increase

Lemma 4.1.

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be any set of points and let CdC\subset\mathbbm{R}^{d} of size kk be the centers of the optimal (k,z)(k,z)-clustering of XX. We let 𝒥d,t\mathcal{J}_{d,t} be the distribution over Johnson-Lindenstrauss maps. If tz/ε2t\gtrsim z/\varepsilon^{2}, then with probability at least 0.990.99 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t},

(xX𝚷(x)𝚷(c(x))2z)1/z(1+ε)costz(X,C),\left(\sum_{x\in X}\|\mathbf{\Pi}(x)-\mathbf{\Pi}(c(x))\|_{2}^{z}\right)^{1/z}\leq(1+\varepsilon)\mathrm{cost}_{z}(X,C),

and hence,

minCt|C|=kcostz(𝚷(X),C)(1+ε)minCd|C|=kcostz(X,C).\min_{\begin{subarray}{c}C^{\prime}\subset\mathbbm{R}^{t}\\ |C^{\prime}|=k\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),C^{\prime})\leq(1+\varepsilon)\min_{\begin{subarray}{c}C\subset\mathbbm{R}^{d}\\ |C|=k\end{subarray}}\mathrm{cost}_{z}(X,C).
Proof.

For xXx\in X, let c(x)Cc(x)\in C denote the closest point from CC to xx. We compute the expected positive deviation from assigning 𝚷(x)\mathbf{\Pi}(x) to 𝚷(c(x))\mathbf{\Pi}(c(x)), and note that the costz(𝚷(X),𝚷(C))\mathrm{cost}_{z}(\mathbf{\Pi}(X),\mathbf{\Pi}(C)) can only be lower. Hence, if we can show

1costzz(X,C)𝐄𝚷𝒥d,t[xX(𝚷(x)𝚷(c(x))2zxc(x)2z)+](1+ε)z1100,\displaystyle\dfrac{1}{\mathrm{cost}_{z}^{z}(X,C)}\mathop{{\bf E}\/}_{\mathbf{\Pi}\sim\mathcal{J}_{d,t}}\left[\sum_{x\in X}\left(\left\|\mathbf{\Pi}(x)-\mathbf{\Pi}(c(x))\right\|_{2}^{z}-\|x-c(x)\|_{2}^{z}\right)^{+}\right]\lesssim\dfrac{(1+\varepsilon)^{z}-1}{100}, (10)

then by Markov’s inequality, we will obtain costzz(𝚷(X),𝚷(C))(1+ε)zcostzz(X,C)\mathrm{cost}_{z}^{z}(\mathbf{\Pi}(X),\mathbf{\Pi}(C))\leq(1+\varepsilon)^{z}\mathrm{cost}_{z}^{z}(X,C), and obtain the desired bound when raising to power 1/z1/z. This last part follows from Lemma 3.1. ∎

4.2 Hard Direction: Optimum Cost Does Not Decrease

We now show that after applying a Johnson-Lindenstrauss map, the cost of the optimal clustering in the dimension-reduced space is not too cheap. This section will be significantly more difficult, and will draw on the following preliminaries.

4.2.1 Preliminaries

Definition 4.2 (Weak and Strong Coresets).

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be a set of points. A (weak) ε\varepsilon-coreset of XX for (k,z)(k,z)-clustering is a subset SdS\subset\mathbbm{R}^{d} of points with weights w:S0w\colon S\to\mathbbm{R}_{\geq 0} such that,

11+εminCd|C|kcostz(X,C)minCd|C|kcostz((S,w),C)(1+ε)minCd|C|kcostz(X,C).\dfrac{1}{1+\varepsilon}\cdot\min_{\begin{subarray}{c}C\subset\mathbbm{R}^{d}\\ |C|\leq k\end{subarray}}\mathrm{cost}_{z}(X,C)\leq\min_{\begin{subarray}{c}C\subset\mathbbm{R}^{d}\\ |C|\leq k\end{subarray}}\mathrm{cost}_{z}((S,w),C)\leq(1+\varepsilon)\cdot\min_{\begin{subarray}{c}C\subset\mathbbm{R}^{d}\\ |C|\leq k\end{subarray}}\mathrm{cost}_{z}(X,C).

The coreset (S,w)(S,w) is a strong ε\varepsilon-coreset if, for all C={c1,,ck}dC=\{c_{1},\dots,c_{k}\}\subset\mathbbm{R}^{d}, we have

11+εcostz(X,C)costz((S,w),C)(1+ε)costz(X,C).\dfrac{1}{1+\varepsilon}\cdot\mathrm{cost}_{z}(X,C)\leq\mathrm{cost}_{z}((S,w),C)\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}(X,C).

Notice that Definition 4.2 gives an approach to finding an approximately optimal (k,z)(k,z)-clustering. Given XdX\subset\mathbbm{R}^{d}, we find a (weak) ε\varepsilon-coreset (S,w)(S,w) and find the optimal clustering CdC\subset\mathbbm{R}^{d} with respect to the coreset (S,w)(S,w). Then, we can deduce that a clustering which is optimal for the coreset is also approximately optimal for the original point set. A common and useful framework for building coresets is by utilizing the “sensitivity” sampling framework.

Definition 4.3 (Sensitivities).

Let n,dn,d\in\mathbbm{N}, and consider any set of points X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d}, as well as kk\in\mathbbm{N}, z1z\geq 1. A sensitivity function σ:X0\sigma\colon X\to\mathbbm{R}_{\geq 0} for (k,z)(k,z)-clustering XX in d\mathbbm{R}^{d} is a function satisfying, that for all xXx\in X,

supCd|C|kcostzz(x,C)costzz(X,C)σ(x).\displaystyle\sup_{\begin{subarray}{c}C\subset\mathbbm{R}^{d}\\ |C|\leq k\end{subarray}}\dfrac{\mathrm{cost}_{z}^{z}(x,C)}{\mathrm{cost}_{z}^{z}(X,C)}\leq\sigma(x).

The total sensitivity of the sensitivity function σ\sigma is given by

𝔖σ=xXσ(x).\mathfrak{S}_{\sigma}=\sum_{x\in X}\sigma(x).

For a sensitivity function, we let σ~\tilde{\sigma} denote the sensitivity sampling distribution, supported on XX, which samples xXx\in X with probability proportional to σ(x)\sigma(x).

The following lemma gives a particularly simple sensitivity sampling distribution, which will be useful for analyzing our dimension reduction procedure. The proof below will follow from two applications of the triangle inequality which we reproduce from Claim 5.6 in [HV20].

Lemma 4.4.

Let n,dn,d\in\mathbbm{N} and consider a set of points X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d}. Let CdC\subset\mathbbm{R}^{d} of size kk be optimal (k,z)(k,z)-clustering of XX, and let c:XCc\colon X\to C denote the function which sends xXx\in X to its closest point in CC, and let XcXX_{c}\subset X be the set of points where c(x)=cc(x)=c. Then, the function σ:X0\sigma\colon X\to\mathbbm{R}_{\geq 0} given by

σ(x)=2z1xc(x)2zcostzz(X,C)+22z1|Xc(x)|\sigma(x)=2^{z-1}\cdot\dfrac{\|x-c(x)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,C)}+\dfrac{2^{2z-1}}{|X_{c(x)}|}

is a sensitivity function for (k,z)(k,z)-clustering XX in d\mathbbm{R}^{d}, satisfying

𝔖σ=2z1+22z1k\mathfrak{S}_{\sigma}=2^{z-1}+2^{2z-1}\cdot k
Proof.

Consider any set CdC^{\prime}\subset\mathbbm{R}^{d} of kk points, and let c:XCc^{\prime}\colon X\to C^{\prime} be the function which sends each xXx\in X to its closest point in CC^{\prime}. Then, we have

xc(x)2z\displaystyle\|x-c^{\prime}(x)\|_{2}^{z} (xc(x)2+c(x)c(x)2)z2z1xc(x)2z+2z1c(x)c(x)2z\displaystyle\leq\left(\|x-c(x)\|_{2}+\|c(x)-c^{\prime}(x)\|_{2}\right)^{z}\leq 2^{z-1}\|x-c(x)\|_{2}^{z}+2^{z-1}\|c(x)-c^{\prime}(x)\|_{2}^{z} (11)
2z1xc(x)2z+2z1|Xc(x)|xXc(x)c(x)2z\displaystyle\leq 2^{z-1}\|x-c(x)\|_{2}^{z}+\dfrac{2^{z-1}}{|X_{c(x)}|}\sum_{x\in X}\|c(x)-c^{\prime}(x)\|_{2}^{z} (12)
2z1xc(x)2z+22(z1)|Xc(x)|(costzz(X,C)+costzz(X,C)),\displaystyle\leq 2^{z-1}\|x-c(x)\|_{2}^{z}+\frac{2^{2(z-1)}}{|X_{c(x)}|}\left(\mathrm{cost}_{z}^{z}(X,C)+\mathrm{cost}_{z}^{z}(X,C^{\prime})\right), (13)

where we used the triangle inequality and Hölder inequality in (11), and added additional non-negative terms in (12), and we finally apply the triangle inequality and Hölder’s inequality once more in (13). Hence, using the fact that CC is the optimal clustering, we have

costzz(x,C)costzz(X,C)\displaystyle\dfrac{\mathrm{cost}_{z}^{z}(x,C^{\prime})}{\mathrm{cost}_{z}^{z}(X,C^{\prime})} 2z1xc(x)2zcost(X,C)+22(z1)|Xc(x)|(costzz(X,C)costzz(X,C)+1)2z1xc(x)2zcostzz(X,C)+22z1|Xc(x)|.\displaystyle\leq 2^{z-1}\cdot\dfrac{\|x-c(x)\|_{2}^{z}}{\mathrm{cost}(X,C^{\prime})}+\dfrac{2^{2(z-1)}}{|X_{c(x)}|}\left(\dfrac{\mathrm{cost}_{z}^{z}(X,C)}{\mathrm{cost}_{z}^{z}(X,C^{\prime})}+1\right)\leq 2^{z-1}\cdot\dfrac{\|x-c(x)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,C)}+\dfrac{2^{2z-1}}{|X_{c(x)}|}.

The bound on 𝔖σ\mathfrak{S}_{\sigma} follows from summing over all xXx\in X, noting the fact that xX1/|Xc(x)|=k\sum_{x\in X}1/|X_{c(x)}|=k. ∎

The main idea behind the sensitivity sampling framework for building coresets is to sample from a sensitivity sampling distribution enough times in order to build a coreset. For this work, it will be sufficient to consider the following theorem of [HV20], which shows that poly(k,1/εz)\mathrm{poly}(k,1/\varepsilon^{z}) draws from an appropriate sensitivity sampling distribution suffices to build strong ε\varepsilon-coresets for (k,z)(k,z)-clustering in d\mathbbm{R}^{d}.

Theorem 5 (ε\varepsilon-Strong Coresets from Sensitivity Sampling [HV20]).

For any subset X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} and ε(0,1/2)\varepsilon\in(0,1/2). Let CdC\subset\mathbbm{R}^{d} of size kk be the optimal (k,z)(k,z)-clustering of XX, and let σ~\tilde{\sigma} denote the sensitivity sampling distribution of Lemma 4.4.

  • Let (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) denote a random (multi-)set 𝐒X\mathbf{S}\subset X and 𝒘:𝐒0\boldsymbol{w}\colon\mathbf{S}\to\mathbbm{R}_{\geq 0} given by, for m=poly(k,1/εz)m=\mathrm{poly}(k,1/\varepsilon^{z}) iterations, sampling 𝒙σ~\boldsymbol{x}\sim\tilde{\sigma} i.i.d and letting 𝒘(𝒙)=1/(mσ~(x))\boldsymbol{w}(\boldsymbol{x})=1/(m\tilde{\sigma}(x)).

  • Then, with probability 1o(1)1-o(1) over the draw of (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}), it is an ε\varepsilon-strong coreset for XX.

Theorem 6 (Kirszbraun theorem).

Let Yd1Y\subset\mathbbm{R}^{d_{1}} and ϕ:Yd2\phi\colon Y\to\mathbbm{R}^{d_{2}} be an LL-Lipschitz map (with respect to Euclidean norms on d1\mathbbm{R}^{d_{1}} and d2\mathbbm{R}^{d_{2}}). There exists a map ϕ~:d1d2\tilde{\phi}\colon\mathbbm{R}^{d_{1}}\to\mathbbm{R}^{d_{2}} which is LL-Lipschitz extending ϕ\phi, i.e., ϕ(x)=ϕ~(x)\phi(x)=\tilde{\phi}(x) for all xYx\in Y.

4.3 The Important Events

We now define the important events which will allow us to prove that the optimum (k,z)(k,z)-clustering after dimension reduction does not decrease substantially. We first define the events, and then we prove that if the events are all satisfied, then we obtain our desired lower bound.

Definition 4.5 (The Events).

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} and CdC\subset\mathbbm{R}^{d} of size kk be centers for an optimal (k,z)(k,z)-clustering of XX, and σ~\tilde{\sigma} is the sensitivity sampling distribution of XX with respect to CC as in Lemma 4.4. We will consider the following experiment,

  1. 1.

    We generate a sample (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) by sampling from σ~\tilde{\sigma} for m=poly(k,1/εz)m=\mathrm{poly}(k,1/\varepsilon^{z}) i.i.d iterations 𝒙σ~\boldsymbol{x}\sim\tilde{\sigma} and set 𝒘(𝒙)=1/(mσ~(𝒙))\boldsymbol{w}(\boldsymbol{x})=1/(m\tilde{\sigma}(\boldsymbol{x})).

  2. 2.

    Furthermore, we sample 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t} which is a Johnson-Lindenstrauss map dt\mathbbm{R}^{d}\to\mathbbm{R}^{t}.

  3. 3.

    We let 𝐒=𝚷(𝐒)t\mathbf{S}^{\prime}=\mathbf{\Pi}(\mathbf{S})\subset\mathbbm{R}^{t} denote the image of 𝚷\mathbf{\Pi} on 𝐒\mathbf{S}.

The events are the following:

  • 𝐄1\mathbf{E}_{1} : The weighted (multi-)set (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) is a weak ε\varepsilon-coreset for (k,z)(k,z)-clustering XX in d\mathbbm{R}^{d}.

  • 𝐄2\mathbf{E}_{2} : The map 𝚷:𝐒𝐒\mathbf{\Pi}\colon\mathbf{S}\to\mathbf{S}^{\prime}, given by restricting 𝚷\mathbf{\Pi} is (1+ε)(1+\varepsilon)-bi-Lipschitz.

  • 𝐄3(β)\mathbf{E}_{3}(\beta) : We let 𝐂t\mathbf{C}^{\prime}\subset\mathbbm{R}^{t} of size kk be the optimal centers for (k,z)(k,z)-clustering 𝚷(X)\mathbf{\Pi}(X) in t\mathbbm{R}^{t}. The weighted (multi-)set (𝚷(𝐒),𝒘)(\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}) satisfies

    costzz((𝚷(𝐒),𝒘),𝐂)βcostzz(𝚷(X),𝐂).\mathrm{cost}_{z}^{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),\mathbf{C}^{\prime})\leq\beta\cdot\mathrm{cost}_{z}^{z}(\mathbf{\Pi}(X),\mathbf{C}^{\prime}).
Lemma 4.6.

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d}, and suppose (𝐒,𝐰)(\mathbf{S},\boldsymbol{w}) and 𝚷:dt\mathbf{\Pi}\colon\mathbbm{R}^{d}\to\mathbbm{R}^{t} satisfy events 𝐄1,𝐄2\mathbf{E}_{1},\mathbf{E}_{2} and 𝐄3(β)\mathbf{E}_{3}(\beta), then,

minCt|C|=kcostz(𝚷(X),C)1β1/z(1+ε)1+1/zminCd|C|=kcostz(X,C).\displaystyle\min_{\begin{subarray}{c}C^{\prime}\subset\mathbbm{R}^{t}\\ |C^{\prime}|=k\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),C^{\prime})\geq\dfrac{1}{\beta^{1/z}(1+\varepsilon)^{1+1/z}}\cdot\min_{\begin{subarray}{c}C\subset\mathbbm{R}^{d}\\ |C|=k\end{subarray}}\mathrm{cost}_{z}(X,C).
Proof.

Let 𝐂t\mathbf{C}^{\prime}\subset\mathbbm{R}^{t} of size kk denote the centers which give the optimal (k,z)(k,z)-clustering of 𝚷(X)\mathbf{\Pi}(X) in t\mathbbm{R}^{t}. Then, by 𝐄3\mathbf{E}_{3},

costzz(𝚷(X),𝐂)(1/β)costzz((𝚷(𝐒),𝒘),𝐂).\displaystyle\mathrm{cost}_{z}^{z}(\mathbf{\Pi}(X),\mathbf{C}^{\prime})\geq(1/\beta)\cdot\mathrm{cost}_{z}^{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),\mathbf{C}^{\prime}).

Now, we use Kirszbraun’s Theorem, and extend 𝚷1:td\mathbf{\Pi}^{-1}\colon\mathbbm{R}^{t}\to\mathbbm{R}^{d} in a (1+ε)(1+\varepsilon)-Lipschitz manner. Hence,

costz((𝚷(𝐒),𝒘),𝐂)11+εcostz((𝐒,𝒘),𝚷1(𝐂))11+εminC′′d|C′′|=kcostz((𝐒,𝒘),C′′).\mathrm{cost}_{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),\mathbf{C}^{\prime})\geq\dfrac{1}{1+\varepsilon}\cdot\mathrm{cost}_{z}((\mathbf{S},\boldsymbol{w}),\mathbf{\Pi}^{-1}(\mathbf{C}^{\prime}))\geq\dfrac{1}{1+\varepsilon}\cdot\min_{\begin{subarray}{c}C^{\prime\prime}\subset\mathbbm{R}^{d}\\ |C^{\prime\prime}|=k\end{subarray}}\mathrm{cost}_{z}((\mathbf{S},\boldsymbol{w}),C^{\prime\prime}).

Finally, using the fact that (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) is an ε\varepsilon-weak coreset, we may conclude that

costzz((𝐒,𝒘),C′′)11+εcostzz(X,C′′).\mathrm{cost}_{z}^{z}((\mathbf{S},\boldsymbol{w}),C^{\prime\prime})\geq\dfrac{1}{1+\varepsilon}\cdot\mathrm{cost}_{z}^{z}(X,C^{\prime\prime}).

We now turn to showing that an appropriate setting of parameters implies that the events occur often. For the first event, Theorem 5 from [HV20] implies event 𝐄1\mathbf{E}_{1} occurs with probability 1o(1)1-o(1). We state the usual guarantees of the Johnson-Lindenstrauss transform, which is what we need for event 𝐄2\mathbf{E}_{2} to hold.

Lemma 4.7.

Let SdS\subset\mathbbm{R}^{d} be any set of mm points, and 𝒥d,t\mathcal{J}_{d,t} denote the Johnson-Lindenstrauss map, with

tlogmε2.t\gtrsim\dfrac{\log m}{\varepsilon^{2}}.

Then, with probability 0.990.99 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}, 𝚷:S𝚷(S)\mathbf{\Pi}\colon S\to\mathbf{\Pi}(S) is (1±ε)(1\pm\varepsilon)-bi-Lipschitz, and hence, 𝐄2\mathbf{E}_{2} occurs with probability 0.990.99.

4.3.1 A Bad Approximation Guarantee

Lemma 4.8 (Warm-Up Lemma).

Fix any Π𝒥d,t\Pi\in\mathcal{J}_{d,t} and let CtC^{\prime}\subset\mathbbm{R}^{t} denote the optimal centers for (k,z)(k,z)-clustering of Π(X)\Pi(X), then with probability at least 0.990.99 over the draw of (𝐒,𝐰)(\mathbf{S},\boldsymbol{w}) as per Definition 4.5,

x𝐒𝒘(x)mincCΠ(x)c2z100costzz(Π(x),C),\displaystyle\sum_{x\in\mathbf{S}}\boldsymbol{w}(x)\cdot\min_{c\in C^{\prime}}\|\Pi(x)-c\|_{2}^{z}\leq 100\cdot\mathrm{cost}_{z}^{z}(\Pi(x),C^{\prime}),

with probability at least 0.990.99. In other words, 𝐄3(100)\mathbf{E}_{3}(100) holds with probability at least 0.990.99.

Proof.

For any xXx\in X, let c(x)Cc^{\prime}(x)\in C^{\prime} denote the point in CC^{\prime} closest to Π(x)\Pi(x). Then, we note

x𝐒𝒘(x)Π(x)c(x)2z=𝐄𝒙𝐒[1σ~(𝒙)Π(𝒙)c(𝒙)2z],\displaystyle\sum_{x\in\mathbf{S}}\boldsymbol{w}(x)\cdot\|\Pi(x)-c^{\prime}(x)\|_{2}^{z}=\mathop{{\bf E}\/}_{\boldsymbol{x}\sim\mathbf{S}}\left[\frac{1}{\tilde{\sigma}(\boldsymbol{x})}\cdot\|\Pi(\boldsymbol{x})-c^{\prime}(\boldsymbol{x})\|_{2}^{z}\right], (14)

so that in expectation over 𝐒\mathbf{S}, we have

𝐄𝐒[𝐄𝒙𝐒[1σ~(𝒙)Π(𝒙)c(𝒙)2z]]=𝐄𝒙σ~[1σ~(𝒙)Π(𝒙)c(𝒙)2z]=costzz(Π(X),C).\displaystyle\mathop{{\bf E}\/}_{\mathbf{S}}\left[\mathop{{\bf E}\/}_{\boldsymbol{x}\sim\mathbf{S}}\left[\frac{1}{\tilde{\sigma}(\boldsymbol{x})}\cdot\|\Pi(\boldsymbol{x})-c^{\prime}(\boldsymbol{x})\|_{2}^{z}\right]\right]=\mathop{{\bf E}\/}_{\boldsymbol{x}\sim\tilde{\sigma}}\left[\frac{1}{\tilde{\sigma}(\boldsymbol{x})}\cdot\|\Pi(\boldsymbol{x})-c^{\prime}(\boldsymbol{x})\|_{2}^{z}\right]=\mathrm{cost}_{z}^{z}(\Pi(X),C^{\prime}).

By Markov’s inequality, we obtain our desired bound. ∎

Corollary 4.9.

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be any set of points, and CdC\subset\mathbbm{R}^{d} of size kk be centers for optimally (k,z)(k,z)-clustering XX. For any ε(0,1/2)\varepsilon\in(0,1/2), let 𝒥d,t\mathcal{J}_{d,t} be the Johnson-Lindenstrauss map with

tzlog(1/ε)+logkε2,t\gtrsim\dfrac{z\log(1/\varepsilon)+\log k}{\varepsilon^{2}},

Then, with probability at least 0.970.97 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t},

11001/z(1+ε)1+1/zcostz(X,C)minCt|C|=kcostz(𝚷(X),C).\displaystyle\dfrac{1}{100^{1/z}(1+\varepsilon)^{1+1/z}}\cdot\mathrm{cost}_{z}\left(X,C\right)\leq\min_{\begin{subarray}{c}C^{\prime}\subset\mathbbm{R}^{t}\\ |C^{\prime}|=k\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),C^{\prime}). (15)
Proof.

We sample 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t} and (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) as per Definition 4.5. By Lemma 4.7, Lemma 4.8, Theorem 5, and a union bound, we have events 𝐄1\mathbf{E}_{1}, 𝐄2\mathbf{E}_{2}, and 𝐄3(100)\mathbf{E}_{3}(100) hold with probability at least 0.970.97. Hence, we obtain the desired result from applying Lemma 4.6. ∎

4.3.2 Improving the Approximation

In what follows, we will improve upon the approximation of Corollary 4.9 significantly, to show that with large probability, 𝐄3(1+ε)\mathbf{E}_{3}(1+\varepsilon) holds. We let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} and denote CdC\subset\mathbbm{R}^{d} of size kk to be the optimal (k,z)(k,z)-clustering of XX. As before, we let c:XCc\colon X\to C map each xXx\in X to the point center in CC, and σ:X0\sigma\colon X\to\mathbbm{R}_{\geq 0} be the sensitivities of XX with respect to CC as in Lemma 4.4, and σ~\tilde{\sigma} be the sensitivity sampling distribution.

We define one more event, 𝐄4\mathbf{E}_{4}, with respect to the randomness of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}. First, we let 𝐃x0\mathbf{D}_{x}\in\mathbbm{R}_{\geq 0} denote the random variable given by

𝐃x=def𝚷(x)𝚷(c(x))2xc(x)2.\displaystyle\mathbf{D}_{x}\stackrel{{\scriptstyle\rm def}}{{=}}\dfrac{\|\mathbf{\Pi}(x)-\mathbf{\Pi}(c(x))\|_{2}}{\|x-c(x)\|_{2}}. (16)

Notice that when 𝚷\mathbf{\Pi} consists of i.i.d 𝒩(0,1/t)\mathcal{N}(0,1/t), then t𝐃x2t\mathbf{D}_{x}^{2} is distributed as χ2\chi^{2}-random variable with tt degrees of freedom. We say event 𝐄4\mathbf{E}_{4} holds whenever

xX𝐃x2zσ(x)100(k+1)2z.\displaystyle\sum_{x\in X}\mathbf{D}_{x}^{2z}\cdot\sigma(x)\leq 100(k+1)2^{z}. (17)
Claim 4.10.

With probability at least 0.990.99 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}, event 𝐄4\mathbf{E}_{4} holds.

Proof.

The proof will simply follow from computing the expectation of the left-hand side of (17), and applying Markov’s inequality. In particular, for every xXx\in X, we can apply Lemma LABEL:lem:guassian-guarantee to conclude have

𝐄𝚷𝒥d,t[𝐃x2z]2z.\displaystyle\mathop{{\bf E}\/}_{\mathbf{\Pi}\sim\mathcal{J}_{d,t}}\left[\mathbf{D}_{x}^{2z}\right]\leq 2^{z}.

The remaining part follows from the bound on 𝔖σ\mathfrak{S}_{\sigma}. ∎

Lemma 4.11.

Let Π𝒥d,t\Pi\in\mathcal{J}_{d,t} be a Johnson-Lindenstrauss map where, for α>1\alpha>1, the following events holds:

  1. 1.

    Guarantee from Lemma 4.1: xXΠ(x)Π(c(x))2zαcostzz(X,C)\sum_{x\in X}\|\Pi(x)-\Pi(c(x))\|_{2}^{z}\leq\alpha\cdot\mathrm{cost}_{z}^{z}(X,C).

  2. 2.

    Guarantee from Corollary 4.9: letting CtC^{\prime}\subset\mathbbm{R}^{t} be the optimal (k,z)(k,z)-clustering of Π(X)\Pi(X), then costzz(Π(X),C)(1/α)costzz(X,C)\mathrm{cost}_{z}^{z}(\Pi(X),C^{\prime})\geq(1/\alpha)\cdot\mathrm{cost}_{z}^{z}(X,C).

  3. 3.

    Event 𝐄4\mathbf{E}_{4} holds.

Then, if we let (𝐒,𝐰)(\mathbf{S},\boldsymbol{w}) denote m=poly(k,2z,1/ε,α)m=\mathrm{poly}(k,2^{z},1/\varepsilon,\alpha) i.i.d draws from σ~\tilde{\sigma} and 𝐰(x)=1/(mσ~(x))\boldsymbol{w}(x)=1/(m\tilde{\sigma}(x)), with probability at least 0.990.99,

costzz((Π(𝐒),𝒘),C)(1+ε)costzz(Π(X),C).\mathrm{cost}_{z}^{z}((\Pi(\mathbf{S}),\boldsymbol{w}),C^{\prime})\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}^{z}(\Pi(X),C^{\prime}).
Proof.

The proof follows the same schema as Lemma 4.8. However, we give a bound on the variance of the estimator in order to improve upon the use of Markov’s inequality. Specifically, we compute the variance of a rescaling of (14).

𝐕𝐚𝐫𝐒[𝐄𝒙𝐒[1σ~(𝒙)Π(𝒙)c(𝒙)2zcostzz(Π(X),C)]]1m𝐄𝒙σ~[(1σ~(𝒙)Π(𝒙)c(𝒙)2zcostzz(Π(X),C))2]\displaystyle\mathop{\operatorname{{\bf Var}}}_{\mathbf{S}}\left[\mathop{{\bf E}\/}_{\boldsymbol{x}\sim\mathbf{S}}\left[\frac{1}{\tilde{\sigma}(\boldsymbol{x})}\cdot\dfrac{\|\Pi(\boldsymbol{x})-c^{\prime}(\boldsymbol{x})\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(\Pi(X),C^{\prime})}\right]\right]\leq\frac{1}{m}\mathop{{\bf E}\/}_{\boldsymbol{x}\sim\tilde{\sigma}}\left[\left(\frac{1}{\tilde{\sigma}(\boldsymbol{x})}\cdot\dfrac{\|\Pi(\boldsymbol{x})-c^{\prime}(\boldsymbol{x})\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(\Pi(X),C^{\prime})}\right)^{2}\right]
=𝔖σmxX(1σ(x)Π(x)c(x)22zcostz2z(Π(X),C)).\displaystyle\qquad\qquad\qquad\qquad=\frac{\mathfrak{S}_{\sigma}}{m}\sum_{x\in X}\left(\frac{1}{\sigma(x)}\cdot\dfrac{\|\Pi(x)-c^{\prime}(x)\|_{2}^{2z}}{\mathrm{cost}_{z}^{2z}(\Pi(X),C^{\prime})}\right). (18)

By the same argument as in the proof of Lemma 4.4 (applying the triangle inequality twice),

Π(x)c(x)2zcostzz(Π(X),C)\displaystyle\dfrac{\|\Pi(x)-c^{\prime}(x)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(\Pi(X),C^{\prime})} 2z1Π(x)Π(c(x))2zcostzz(Π(X),C)+22(z1)|Xc(x)|(costzz(Π(X),Π(C))costzz(Π(X),C)+1).\displaystyle\leq 2^{z-1}\cdot\dfrac{\|\Pi(x)-\Pi(c(x))\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(\Pi(X),C^{\prime})}+\dfrac{2^{2(z-1)}}{|X_{c(x)}|}\left(\dfrac{\mathrm{cost}_{z}^{z}(\Pi(X),\Pi(C))}{\mathrm{cost}_{z}^{z}(\Pi(X),C^{\prime})}+1\right). (19)

Recall that we have the lower bound (2) and the upper bound (1). Hence, along with the definition of DxD_{x} in (16) (we remove the bold-face as Π\Pi is fixed), we upper bound the left-hand side of (19) by

α2z1Dxzxc(x)2zcostzz(X,C)+22(z1)|Xc(x)|(α2+1)α22100z(Dxz+1)σ(x).\displaystyle\alpha\cdot 2^{z-1}\cdot\dfrac{D_{x}^{z}\cdot\|x-c(x)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,C)}+\dfrac{2^{2(z-1)}}{|X_{c(x)}|}\left(\alpha^{2}+1\right)\leq\alpha^{2}\cdot 2^{100z}(D_{x}^{z}+1)\cdot\sigma(x).

In particular, we may plug this in to (32) and use the definition of σ(x)\sigma(x). Specifically, one obtains the variance in (32) is upper bounded by

𝔖σα42200zmxX(Dxz+1)2σ(x)\displaystyle\frac{\mathfrak{S}_{\sigma}\cdot\alpha^{4}\cdot 2^{200z}}{m}\sum_{x\in X}(D_{x}^{z}+1)^{2}\cdot\sigma(x) 4𝔖σ2α42200zm+4𝔖σα42200zmxXDx2zσ(x)\displaystyle\leq\frac{4\mathfrak{S}_{\sigma}^{2}\cdot\alpha^{4}\cdot 2^{200z}}{m}+\frac{4\mathfrak{S}_{\sigma}\cdot\alpha^{4}\cdot 2^{200z}}{m}\sum_{x\in X}D_{x}^{2z}\cdot\sigma(x)
1000𝔖σ2α42300zm,\displaystyle\leq\frac{1000\mathfrak{S}_{\sigma}^{2}\cdot\alpha^{4}\cdot 2^{300z}}{m},

where in the final inequality, we used the fact that 𝐄4\mathbf{E}_{4} holds. Hence, letting mm be a large enough polynomial in poly(k,2z,1/ε,α)\mathrm{poly}(k,2^{z},1/\varepsilon,\alpha) implies the variance is smaller than o(ε2)o(\varepsilon^{2}), so we can apply Chebyshev’s inequality. ∎

Corollary 4.12.

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} by any set of points, and CdC\subset\mathbbm{R}^{d} of size kk be centers for optimally (k,z)(k,z)-clustering XX. For any ε(0,1/2)\varepsilon\in(0,1/2), let 𝒥d,t\mathcal{J}_{d,t} be the Johnson-Lindenstrauss map with

tzlog(1/ε)+logkε2.t\gtrsim\frac{z\log(1/\varepsilon)+\log k}{\varepsilon^{2}}.

Then, with probability at least 0.920.92 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t},

1(1+ε)1+2/zcostz(X,C)minCt|C|kcostz(𝚷(X),C).\dfrac{1}{(1+\varepsilon)^{1+2/z}}\cdot\mathrm{cost}_{z}(X,C)\leq\min_{\begin{subarray}{c}C^{\prime}\subset\mathbbm{R}^{t}\\ |C^{\prime}|\leq k\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),C^{\prime}).
Proof.

We sample 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t} and (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) as per Definition 4.5. Note that by Theorem 5 and the setting of mm, event 𝐄1\mathbf{E}_{1} holds with probability at least 0.990.99 over the draw of (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}). By Lemma 4.1, Corollary 4.9, and Claim 4.10, and the setting of mm and tt, the conditions (1), (2) and (3) of Lemma 4.11 hold with probability at least 0.950.95, with α\alpha being set to a large enough constant, and hence event 𝐄3(1+ε)\mathbf{E}_{3}(1+\varepsilon) holds with probability at least 0.940.94. Finally, event 𝐄2\mathbf{E}_{2} holds with probability 0.990.99 by Lemma 4.7, and taking a union bound and Lemma 4.6 gives the desired bound. ∎

5 Subspace Approximation

In the (k,z)(k,z)-subspace approximation problem, we consider a subspace RdR\subset\mathbbm{R}^{d} of dimension less than kk, which we may encode by a collection of at most kk orthonormal vectors r1,,rkRr_{1},\dots,r_{k}\in R. We let ρR:dd\rho_{R}\colon\mathbbm{R}^{d}\to\mathbbm{R}^{d} denote the map which sends each vector xdx\in\mathbbm{R}^{d} to its closest point in RR, and note that

ρR(x)\displaystyle\rho_{R}(x) =argminzRxz22=i=1kx,ririd,\displaystyle=\mathop{\mathrm{argmin}}_{z\in R}\|x-z\|_{2}^{2}=\sum_{i=1}^{k}\langle x,r_{i}\rangle\cdot r_{i}\in\mathbbm{R}^{d},

For any subset XdX\subset\mathbbm{R}^{d} and any kk-dimensional subspace RR, we let

costzz(X,R)=xXxρR(x)2z.\mathrm{cost}_{z}^{z}(X,R)=\sum_{x\in X}\left\|x-\rho_{R}(x)\right\|_{2}^{z}.

In this section, we will show that we may compute the optimum kk-subspace approximation after applying a Johnson-Lindenstrauss transform.

Theorem 7 (Johnson-Lindenstrauss for kk-Subspace Approximation).

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be any set of points, and let RR denote the optimal (k,z)(k,z)-subspace approximation of XX. For any ε(0,1)\varepsilon\in(0,1), suppose we let 𝒥d,t\mathcal{J}_{d,t} be the distribution over Johnson-Lindenstrauss maps where

tzk2polylog(k/ε)ε3.t\gtrsim\dfrac{z\cdot k^{2}\cdot\mathrm{polylog}(k/\varepsilon)}{\varepsilon^{3}}.

Then, with probability at least 0.90.9 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t},

11+εcostz(X,R)minRtdimRkcostz(𝚷(X),R)(1+ε)costz(X,R).\displaystyle\dfrac{1}{1+\varepsilon}\cdot\mathrm{cost}_{z}(X,R)\leq\min_{\begin{subarray}{c}R^{\prime}\subset\mathbbm{R}^{t}\\ \dim R^{\prime}\leq k\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),R^{\prime})\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}(X,R).

The proof of the above theorem proceeds in two steps, and models the argument in the previous section. First, we show that the cost of the optimum does not increase substantially (the right-most inequality in the theorem). This is done in the next subsection. The second step is showing that the optimum does not decrease substantially (the left-most inequality in the theorem). The second step is done in the subsequent subsection.

5.1 Easy Direction: Optimum Cost Does Not Increase

The first direction, which shows that the optimal (k,z)(k,z)-subspace approximation does not increase follows similarly to Lemma 4.1.

Lemma 5.1.

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} by any set of points and let RdR\subset\mathbbm{R}^{d} be optimal (k,z)(k,z)-subspace approximation of XX. For any ε(0,1)\varepsilon\in(0,1), we let 𝒥d,t\mathcal{J}_{d,t} be the distribution over Johnson-Lindenstrauss maps. If tz/ε2t\gtrsim z/\varepsilon^{2}, then with probability at least 0.990.99 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t},

xX𝚷(x)𝚷(ρR(x))2z(1+ε)costz(X,R),\sum_{x\in X}\|\mathbf{\Pi}(x)-\mathbf{\Pi}(\rho_{R}(x))\|_{2}^{z}\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}(X,R),

and hence,

minRtdimRkcostz(𝚷(X),R)(1+ε)minRddimRkcostz(X,R).\min_{\begin{subarray}{c}R^{\prime}\subset\mathbbm{R}^{t}\\ \dim R^{\prime}\leq k\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),R^{\prime})\leq(1+\varepsilon)\min_{\begin{subarray}{c}R\subset\mathbbm{R}^{d}\\ \dim R\leq k\end{subarray}}\mathrm{cost}_{z}(X,R).
Proof.

Note that 𝚷\mathbf{\Pi} is a linear map, so if we let r1,,rkRr_{1},\dots,r_{k}\in R denote kk orthonormal unit vectors spanning RR, then 𝚷(r1),,𝚷(rk)t\mathbf{\Pi}(r_{1}),\dots,\mathbf{\Pi}(r_{k})\in\mathbbm{R}^{t} are kk vectors spanning the subspace 𝚷(R)\mathbf{\Pi}(R). Furthermore, we may consider the kk-dimensional subspace

𝚷(R)=def{i=1kαi𝚷(ri)t:α1,,αk}.\mathbf{\Pi}(R)\stackrel{{\scriptstyle\rm def}}{{=}}\left\{\sum_{i=1}^{k}\alpha_{i}\cdot\mathbf{\Pi}(r_{i})\in\mathbbm{R}^{t}:\alpha_{1},\dots,\alpha_{k}\in\mathbbm{R}\right\}.

Notice that for any xdx\in\mathbbm{R}^{d}, by linearity of 𝚷\mathbf{\Pi},

𝚷(ρR(x))\displaystyle\mathbf{\Pi}(\rho_{R}(x)) =i=1kxτ,ri𝚷(ri)𝚷(F),\displaystyle=\sum_{i=1}^{k}\langle x-\tau,r_{i}\rangle\cdot\mathbf{\Pi}(r_{i})\in\mathbf{\Pi}(F),

which means that we may always upper bound

minRtdimRkcostz(𝚷(X),R)\displaystyle\min_{\begin{subarray}{c}R^{\prime}\subset\mathbbm{R}^{t}\\ \dim R^{\prime}\leq k\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),R^{\prime}) (xX𝚷(x)𝚷(ρR(x))2z)1/z.\displaystyle\leq\left(\sum_{x\in X}\left\|\mathbf{\Pi}(x)-\mathbf{\Pi}(\rho_{R}(x))\right\|_{2}^{z}\right)^{1/z}. (20)

It hence remains to upper bound the left-hand side of (20). We now use the fact that 𝚷\mathbf{\Pi} is drawn from a Johnson-Lindenstrauss distribution. Specifically, the lemma follows from applying Markov’s inequality once we show

1costzz(X,F)𝐄𝚷𝒥d,t[xX(𝚷(x)𝚷(ρF(x))2zxρF(x)2z)+](1+ε)z1100,\dfrac{1}{\mathrm{cost}_{z}^{z}(X,F)}\mathop{{\bf E}\/}_{\mathbf{\Pi}\sim\mathcal{J}_{d,t}}\left[\sum_{x\in X}\left(\|\mathbf{\Pi}(x)-\mathbf{\Pi}(\rho_{F}(x))\|_{2}^{z}-\|x-\rho_{F}(x)\|_{2}^{z}\right)^{+}\right]\leq\dfrac{(1+\varepsilon)^{z}-1}{100},

which follows from Lemma 3.1. ∎

5.2 Hard Direction: Optimum Cost Does Not Decrease

5.2.1 Preliminaries

In the (k,z)(k,z)-subspace approximation problem, there will be a difference between complexities of known strong coresets and weak coresets. Our argument will only use weak coresets, which is important for us, as strong coresets have a dependence on dd (which we are trying to avoid).

Definition 5.2 (Weak Coresets for (k,z)(k,z)-subspace approximation).

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be a set of points. A weak ε\varepsilon-coreset of XX for (k,z)(k,z)-subspace approximation is a weighted subset SdS\subset\mathbbm{R}^{d} of points with weights w:S0w\colon S\to\mathbbm{R}_{\geq 0} such that,

11+εminRddimRkcostz(X,R)minRddimRkcostz((S,w),R)(1+ε)minRddimRkcostz(X,R).\frac{1}{1+\varepsilon}\cdot\min_{\begin{subarray}{c}R\subset\mathbbm{R}^{d}\\ \dim R\leq k\end{subarray}}\mathrm{cost}_{z}(X,R)\leq\min_{\begin{subarray}{c}R\subset\mathbbm{R}^{d}\\ \dim R\leq k\end{subarray}}\mathrm{cost}_{z}((S,w),R)\leq(1+\varepsilon)\cdot\min_{\begin{subarray}{c}R\subset\mathbbm{R}^{d}\\ \dim R\leq k\end{subarray}}\mathrm{cost}_{z}(X,R).

Similarly to the case of (k,z)(k,z)-clustering, algorithms for building weak coresets proceed by sampling according to the sensitivity framework. We proceed by defining sensitivity functions in the context of subspace approximation, and then state a lemma which gives a sensitivity function that we will use.

Definition 5.3 (Sensitivities).

Let n,dn,d\in\mathbbm{N}, and consider any set of points X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d}, as well as kk\in\mathbbm{N} and z1z\geq 1. A sensitivity function σ:X0\sigma\colon X\to\mathbbm{R}_{\geq 0} for (k,z)(k,z)-subspace approximation in d\mathbbm{R}^{d} is a function satisfying that, for all xXx\in X,

supRddimRkxρR(x)2zcostzz(X,R)σ(x).\displaystyle\sup_{\begin{subarray}{c}R\subset\mathbbm{R}^{d}\\ \dim R\leq k\end{subarray}}\dfrac{\|x-\rho_{R}(x)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,R)}\leq\sigma(x).

The total sensitivity of the sensitivity function σ\sigma is given by

𝔖σ=xXσ(x).\mathfrak{S}_{\sigma}=\sum_{x\in X}\sigma(x).

For a sensitivity function, we let σ~\tilde{\sigma} denote the sensitivity sampling distribution, supported on XX, which samples xXx\in X with probability proportional to σ(x)\sigma(x).

We now state a specific sensitivity function that we will use. The proof will closely follow a method for bounding the total sensitivity of [VX12b]. The resulting weak ε\varepsilon-coreset will have a worse dependence than the best-known coresets for this problem; however, the specific form of the sensitivity function will be especially useful for us. Specifically, the non-optimality of the sensitivity function will not significantly affect the final bound on dimension reduction.

Lemma 5.4 (Theorem 18 of [VX12b]).

Let n,dn,d\in\mathbbm{N}, and consider any set of points X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d}, as well as kk\in\mathbbm{N} with k<dk<d, and z1z\geq 1. Suppose RdR\subset\mathbbm{R}^{d} is the optimal (k,z)(k,z)-subspace approximation of XX in d\mathbbm{R}^{d}. Then, the function σ:X0\sigma\colon X\to\mathbbm{R}_{\geq 0} given by

σ(x)=2z1xρR(x)2zcostzz(X,R)+22z1supud|ρR(x),u|zxX|ρR(x),u|z\displaystyle\sigma(x)=2^{z-1}\cdot\dfrac{\|x-\rho_{R}(x)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,R)}+2^{2z-1}\cdot\sup_{u\in\mathbbm{R}^{d}}\dfrac{|\langle\rho_{R}(x),u\rangle|^{z}}{\sum_{x^{\prime}\in X}|\langle\rho_{R}(x^{\prime}),u\rangle|^{z}}

is a sensitivity function for (k,z)(k,z)-subspace approximation of XX in d\mathbbm{R}^{d}, satisfying

𝔖σ2z1+22z1(k+1)1+z.\mathfrak{S}_{\sigma}\leq 2^{z-1}+2^{2z-1}(k+1)^{1+z}.
Proof.

Consider any subspace RdR^{\prime}\subset\mathbbm{R}^{d} of dimension at most kk. Then, for any xXx\in X

xρR(x)2zxρR(ρR(x))2z2z1(xρR(x)2z+ρR(x)ρR(ρR(x))2z)\displaystyle\|x-\rho_{R^{\prime}}(x)\|_{2}^{z}\leq\|x-\rho_{R^{\prime}}(\rho_{R}(x))\|_{2}^{z}\leq 2^{z-1}\left(\|x-\rho_{R}(x)\|_{2}^{z}+\|\rho_{R}(x)-\rho_{R^{\prime}}(\rho_{R}(x))\|_{2}^{z}\right)
2z1(xρR(x)2zcostzz(X,R)costzz(X,R)+ρR(x)ρR(ρR(x))2zcostzz(ρR(X),R)costzz(ρR(X),R)).\displaystyle\qquad\leq 2^{z-1}\left(\dfrac{\|x-\rho_{R}(x)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,R)}\cdot\mathrm{cost}_{z}^{z}(X,R)+\dfrac{\|\rho_{R}(x)-\rho_{R^{\prime}}(\rho_{R}(x))\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(\rho_{R}(X),R^{\prime})}\cdot\mathrm{cost}_{z}^{z}(\rho_{R}(X),R^{\prime})\right). (21)

Notice that costzz(ρR(X),R)2z1(costzz(X,R)+costzz(X,R))\mathrm{cost}_{z}^{z}(\rho_{R}(X),R^{\prime})\leq 2^{z-1}(\mathrm{cost}_{z}^{z}(X,R)+\mathrm{cost}_{z}^{z}(X,R^{\prime})) by the triangle inequality and Hölder’s inequality, and that costzz(X,R)costzz(X,R)\mathrm{cost}_{z}^{z}(X,R)\leq\mathrm{cost}_{z}^{z}(X,R^{\prime}) since RR is the optimal (k,z)(k,z)-subspace approximation. Hence, dividing the left- and right-hand side of (21) we have

xρR(x)2zcostzz(X,R)2z1xρR(x)2zcostzz(X,R)+22z1ρR(x)ρR(ρR(x))2zcostzz(ρR(X),R).\displaystyle\dfrac{\|x-\rho_{R^{\prime}}(x)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,R^{\prime})}\leq 2^{z-1}\cdot\dfrac{\|x-\rho_{R}(x)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,R)}+2^{2z-1}\cdot\dfrac{\|\rho_{R}(x)-\rho_{R^{\prime}}(\rho_{R}(x))\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(\rho_{R}(X),R^{\prime})}.

It remains to show that, for any set of points YdY\subset\mathbbm{R}^{d} (in particular, the set {ρR(x):xX}\{\rho_{R}(x):x\in X\}), and any yYy\in Y,

supRddimRkyρR(y)2zcostzz(Y,R)supHddimH=d1yρH(y)2zcostzz(Y,H)=supud|y,u|zyY|y,u|z.\displaystyle\sup_{\begin{subarray}{c}R^{\prime}\subset\mathbbm{R}^{d}\\ \dim R^{\prime}\leq k\end{subarray}}\dfrac{\|y-\rho_{R^{\prime}}(y)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(Y,R^{\prime})}\leq\sup_{\begin{subarray}{c}H\subset\mathbbm{R}^{d}\\ \dim H=d-1\end{subarray}}\dfrac{\|y-\rho_{H}(y)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(Y,H)}=\sup_{u\in\mathbbm{R}^{d}}\dfrac{|\langle y,u\rangle|^{z}}{\sum_{y^{\prime}\in Y}|\langle y^{\prime},u\rangle|^{z}}.

In particular, note that for any subspace RdR^{\prime}\subset\mathbbm{R}^{d} of dimension at most kk, there exists a (d1)(d-1)-dimensional subspace HdH\subset\mathbbm{R}^{d} containing RR^{\prime} given by all vectors orthogonal to yρR(y)y-\rho_{R^{\prime}}(y). In particular, costzz(Y,H)costzz(Y,R)\mathrm{cost}_{z}^{z}(Y,H)\leq\mathrm{cost}_{z}^{z}(Y,R^{\prime}) since RR^{\prime} is contained in HH, and yρH(y)2=yρR(y)2\|y-\rho_{H}(y)\|_{2}=\|y-\rho_{R^{\prime}}(y)\|_{2} by the definition of HH. The bound on the total sensitivity then follows from Lemma 16 in [VX12b], where we use the fact that {ρR(x):xX}\{\rho_{R}(x):x\in X\} lies in a kk-dimensional subspace. ∎

We will use the following geometric theorem of [SV12] in our proof. The theorem says that an approximately optimal (k,z)(k,z)-subspace approximation lies in the span of a small set of points. We state the lemma for weighted point sets, even though [SV12] state it for unweighted points. We note that adding weights can be simulated by replicating points.

Lemma 5.5 (Theorem 3.1 [SV12]).

Let d,kd,k\in\mathbbm{N}, and consider a weighted set of points SdS\subset\mathbbm{R}^{d} with weights w:S0w\colon S\to\mathbbm{R}_{\geq 0}, as well as ε(0,1)\varepsilon\in(0,1) and z1z\geq 1. There exists a subset QSQ\subset S of O(k2log(k/ε)/ε)O(k^{2}\log(k/\varepsilon)/\varepsilon) and a kk-dimensional subspace RdR^{\prime}\subset\mathbbm{R}^{d} within the span of QQ satisfying

costz((S,w),R)(1+ε)minRddimRkcostz((S,w),R).\mathrm{cost}_{z}((S,w),R^{\prime})\leq(1+\varepsilon)\min_{\begin{subarray}{c}R\subset\mathbbm{R}^{d}\\ \dim R\leq k\end{subarray}}\mathrm{cost}_{z}((S,w),R).

Lemma 5.4 gives us an appropriate sensitivity function, and Lemma 5.5 limits the search of the subspace to just a small set of points. Similarly to the case of (k,z)(k,z)-clustering, we can use this to construct weak ε\varepsilon-coresets for (k,z)(k,z)-subspace approximation. The following theorem is Theorem 5.10 from [HV20]. We state the theorem with the sensitivity function of Lemma 5.4.

Theorem 8 (Theorem 5.10 from [HV20]).

For any subset X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} and ε(0,1/2)\varepsilon\in(0,1/2), let σ~\tilde{\sigma} denote the sensitivity sampling distribution from the sensitivity function of Lemma 5.4.

  • Let (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) denote the random (multi-)set 𝐒X\mathbf{S}\subset X and 𝒘:𝐒0\boldsymbol{w}\colon\mathbf{S}\to\mathbbm{R}_{\geq 0} given by, for

    m=poly((k+1)z,1/ε)m=\mathrm{poly}((k+1)^{z},1/\varepsilon)

    iterations, sampling 𝒙σ~\boldsymbol{x}\sim\tilde{\sigma} i.i.d and letting 𝒘(𝒙)=1/(mσ~(x))\boldsymbol{w}(\boldsymbol{x})=1/(m\tilde{\sigma}(x)).

  • Then, with probability 1o(1)1-o(1) over the draw of (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}), it is an ε\varepsilon-weak coreset for (k,z)(k,z)-subspace approximation of XX.

5.3 The Important Events

Similarly to the previous section, we define the important events, over the randomness in 𝚷\mathbf{\Pi} such that, if these are satisfied, then the optimum of (k,z)(k,z)-subspace approximation after dimension reduction does not decrease substantially. We first define the events, and then we prove that if the events are all satisfied, then we obtain our desired approximation.

Definition 5.6 (The Events).

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d}, and σ~\tilde{\sigma} the sensitivity sampling distribution of XX from Lemma 5.4. We consider the following experiment,

  1. 1.

    We generate a sample (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) by sampling from σ~\tilde{\sigma} for m=poly(kz,1/ε)m=\mathrm{poly}(k^{z},1/\varepsilon) i.i.d iterations 𝒙σ~\boldsymbol{x}\sim\tilde{\sigma} and set 𝒘(𝒙)=1/(mσ~(𝒙))\boldsymbol{w}(\boldsymbol{x})=1/(m\tilde{\sigma}(\boldsymbol{x})).

  2. 2.

    Furthermore, we sample 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}, which is a Johnson-Lindenstrauss map dt\mathbbm{R}^{d}\to\mathbbm{R}^{t}.

  3. 3.

    We let 𝐒=𝚷(𝐒)t\mathbf{S}^{\prime}=\mathbf{\Pi}(\mathbf{S})\subset\mathbbm{R}^{t} denote the image of 𝚷\mathbf{\Pi} on 𝐒\mathbf{S}.

The events are the following:

  • 𝐄1\mathbf{E}_{1} : The weighted (multi-)set (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) is a weak ε\varepsilon-coreset for (k,z)(k,z)-subspace approximation of XX in d\mathbbm{R}^{d}.

  • 𝐄2\mathbf{E}_{2} : The map 𝚷:dt\mathbf{\Pi}\colon\mathbbm{R}^{d}\to\mathbbm{R}^{t} satisfies the following condition. For any choice of O(k2log(k/ε)/ε)O(k^{2}\log(k/\varepsilon)/\varepsilon) points of 𝐒\mathbf{S}, 𝚷\mathbf{\Pi} is an ε\varepsilon-subspace embedding of the subspace spanned by these points.

  • 𝐄3(β)\mathbf{E}_{3}(\beta) : Let 𝐑t\mathbf{R}^{\prime}\subset\mathbbm{R}^{t} denote the kk-dimensional subspace for optimal (k,z)(k,z)-subspace approximation of 𝚷(X)\mathbf{\Pi}(X) in t\mathbbm{R}^{t}. Then,

    costzz((𝚷(𝐒),𝒘),𝐑)βcostzz(𝚷(X),𝐑).\mathrm{cost}_{z}^{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),\mathbf{R}^{\prime})\leq\beta\cdot\mathrm{cost}_{z}^{z}(\mathbf{\Pi}(X),\mathbf{R}^{\prime}).
Lemma 5.7.

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d}, and suppose (𝐒,𝐰)(\mathbf{S},\boldsymbol{w}) and 𝚷:dt\mathbf{\Pi}\colon\mathbbm{R}^{d}\to\mathbbm{R}^{t} satisfy events 𝐄1\mathbf{E}_{1}, 𝐄2\mathbf{E}_{2}, and 𝐄3(β)\mathbf{E}_{3}(\beta). Then,

minRtdimRkcostz(𝚷(X),R)1β1/z(1+ε)3minRddimRkcostz(X,R).\min_{\begin{subarray}{c}R^{\prime}\subset\mathbbm{R}^{t}\\ \dim R^{\prime}\leq k\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),R^{\prime})\geq\dfrac{1}{\beta^{1/z}(1+\varepsilon)^{3}}\cdot\min_{\begin{subarray}{c}R\subset\mathbbm{R}^{d}\\ \dim R\leq k\end{subarray}}\mathrm{cost}_{z}(X,R).
Proof.

Consider a fixed 𝚷\mathbf{\Pi} and (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) satisfying the three events of Definition 5.6. Let 𝐑t\mathbf{R}^{\prime}\subset\mathbbm{R}^{t} be the kk-dimensional subspace which minimizes costzz(𝚷(X),𝐑)\mathrm{cost}_{z}^{z}(\mathbf{\Pi}(X),\mathbf{R}^{\prime}). Then, by event 𝐄3(β)\mathbf{E}_{3}(\beta), we have costzz((𝚷(𝐒),𝒘),𝐑)βcostzz(𝚷(X),𝐑)\mathrm{cost}_{z}^{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),\mathbf{R}^{\prime})\leq\beta\cdot\mathrm{cost}_{z}^{z}(\mathbf{\Pi}(X),\mathbf{R}^{\prime}). Now, we apply Lemma 5.5 to (𝚷(𝐒),𝒘)(\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}), and we obtain a subset Q𝐒Q\subset\mathbf{S} of size O(k2log(k/ε)/ε)O(k^{2}\log(k/\varepsilon)/\varepsilon) for which there exists a kk-dimensional subspace R′′tR^{\prime\prime}\subset\mathbbm{R}^{t} within the span of 𝚷(Q)\mathbf{\Pi}(Q) which satisfies

(x𝐒𝒘(x)𝚷(x)ρR′′(𝚷(x))2z)1/z=costz((𝚷(𝐒),𝒘),R′′)(1+ε)costz((𝚷(𝐒),𝒘),𝐑).\left(\sum_{x\in\mathbf{S}}\boldsymbol{w}(x)\cdot\|\mathbf{\Pi}(x)-\rho_{R^{\prime\prime}}(\mathbf{\Pi}(x))\|_{2}^{z}\right)^{1/z}=\mathrm{cost}_{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),R^{\prime\prime})\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),\mathbf{R}^{\prime}).

Note that R′′R^{\prime\prime} is a kk-dimensional subspace lying in the span of 𝚷(Q)\mathbf{\Pi}(Q). For any x𝐒x\in\mathbf{S}, we will use the fact that 𝐄2\mathbf{E}_{2} is satisfied to say that 𝚷\mathbf{\Pi} is an ε\varepsilon-subspace embedding of the subspace spanned by Q{x}Q\cup\{x\}. This will enable us to find a subspace UdU\subset\mathbbm{R}^{d} of dimension kk whose cost of approximating (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) is at most (1+ε)costz((𝚷(𝐒),𝒘),R′′)(1+\varepsilon)\cdot\mathrm{cost}_{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),R^{\prime\prime}).

Specifically, we write v1,,vktv_{1},\dots,v_{k}\in\mathbbm{R}^{t}, as orthogonal unit vectors which span R′′R^{\prime\prime}. Because R′′R^{\prime\prime} lies in the span of 𝚷(Q)\mathbf{\Pi}(Q), there are vectors u1,,ukdu_{1},\dots,u_{k}\in\mathbbm{R}^{d} in the span of QQ which satisfy

v=𝚷(u)tforu=yQc,yyd.v_{\ell}=\mathbf{\Pi}(u_{\ell})\in\mathbbm{R}^{t}\qquad\text{for}\qquad u_{\ell}=\sum_{y\in Q}c_{\ell,y}y\in\mathbbm{R}^{d}.

Hence, the subspace UU given by the span of all vectors in UU is a kk-dimensional subspace lying in the span of QQ. For x𝐒x\in\mathbf{S}, we may write the coefficients γ(x)=𝚷(x),v\gamma_{\ell}(x)=\langle\mathbf{\Pi}(x),v_{\ell}\rangle, and we may express projection ρR′′(𝚷(x))t\rho_{R^{\prime\prime}}(\mathbf{\Pi}(x))\in\mathbbm{R}^{t} as

ρR′′(𝚷(x))==1kγ(x)v=𝚷(yQ(=1kγ(x)c,y)y)=𝚷(=1kγ(x)u).\rho_{R^{\prime\prime}}(\mathbf{\Pi}(x))=\sum_{\ell=1}^{k}\gamma_{\ell}(x)\cdot v_{\ell}=\mathbf{\Pi}\left(\sum_{y\in Q}\left(\sum_{\ell=1}^{k}\gamma_{\ell}(x)c_{\ell,y}\right)\cdot y\right)=\mathbf{\Pi}\left(\sum_{\ell=1}^{k}\gamma_{\ell}(x)u_{\ell}\right).

which is a linear combination of QQ. By event 𝐄2\mathbf{E}_{2}, 𝚷\mathbf{\Pi} is an ε\varepsilon-subspace embedding of the subspace spanned by Q{x}Q\cup\{x\}, so

xρU(x)2x=1kγ(x)u2(1+ε)𝚷(x)ρR′′(𝚷(x))2.\|x-\rho_{U}(x)\|_{2}\leq\left\|x-\sum_{\ell=1}^{k}\gamma_{\ell}(x)u_{\ell}\right\|_{2}\leq(1+\varepsilon)\|\mathbf{\Pi}(x)-\rho_{R^{\prime\prime}}(\mathbf{\Pi}(x))\|_{2}.

Combining the inequalities, we have

costz((𝐒,𝒘),U)(1+ε)costz((𝚷(𝐒),𝒘),R′′),\mathrm{cost}_{z}\left((\mathbf{S},\boldsymbol{w}),U\right)\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}\left((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),R^{\prime\prime}\right),

and finally, since (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) is a ε\varepsilon-weak coreset, we obtain the desired inequality. ∎

We note that event 𝐄1\mathbf{E}_{1} will be satisfied with sufficiently high probability from Theorem 8. Furthermore, event 𝐄2\mathbf{E}_{2} is satisfied with sufficiently high probability from the following simple lemma. All that will remain is showing that event 𝐄3(β)\mathbf{E}_{3}(\beta) is satisfied.

Lemma 5.8.

Let SdS\subset\mathbbm{R}^{d} be any set of mm points and \ell\in\mathbbm{N}, and let 𝒥d,t\mathcal{J}_{d,t} denote the Johnson-Lindenstrauss map, with

tlogmε2.t\gtrsim\dfrac{\ell\log m}{\varepsilon^{2}}.

Then, with probability 0.990.99 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}, 𝚷\mathbf{\Pi} is an ε\varepsilon-subspace embedding for all subspaces spanned by \ell vectors of SS.

Proof.

There are at most (m)\binom{m}{\ell} subspaces spanned by \ell vectors of SS. If 𝚷\mathbf{\Pi} is a subspace embedding for all of them, we obtain our desired conclusion. We use Lemma 3.3 with δ\delta to be a sufficiently small constant factor of 1/m1/m^{\ell} and union bound. ∎

5.3.1 A Bad Approximation Guarantee

Lemma 5.9 (Warm-Up Lemma).

Fix any Π𝒥d,t\Pi\in\mathcal{J}_{d,t} and let RtR^{\prime}\subset\mathbbm{R}^{t} denote the kk-dimensional subspace for optimal (k,z)(k,z)-subspace approximation of Π(X)\Pi(X) in t\mathbbm{R}^{t}. Then, with probability at least 0.990.99 over the draw of (𝐒,𝐰)(\mathbf{S},\boldsymbol{w}) as per Definition 5.6,

x𝐒𝒘(x)Π(x)ρR(Π(x))2z100costzz(Π(X),R).\displaystyle\sum_{x\in\mathbf{S}}\boldsymbol{w}(x)\cdot\|\Pi(x)-\rho_{R^{\prime}}(\Pi(x))\|_{2}^{z}\leq 100\cdot\mathrm{cost}_{z}^{z}(\Pi(X),R^{\prime}).

In other words, event 𝐄3(100)\mathbf{E}_{3}(100) holds with probability at least 0.990.99.

Proof.

Similarly to the proof of Lemma 4.8, we compute the expectation of the left-hand side of the inequality and use Markov’s inequality.

𝐄𝐒[𝐄𝒙𝐒[1σ~(𝒙)Π(x)ρR(Π(x))2z]]\displaystyle\mathop{{\bf E}\/}_{\mathbf{S}}\left[\mathop{{\bf E}\/}_{\boldsymbol{x}\sim\mathbf{S}}\left[\frac{1}{\tilde{\sigma}(\boldsymbol{x})}\cdot\|\Pi(x)-\rho_{R^{\prime}}(\Pi(x))\|_{2}^{z}\right]\right] =𝐄𝒙σ~[1σ~(𝒙)Π(x)ρR(Π(x))2z]=costzz(Π(X),R).\displaystyle=\mathop{{\bf E}\/}_{\boldsymbol{x}\sim\tilde{\sigma}}\left[\frac{1}{\tilde{\sigma}(\boldsymbol{x})}\cdot\|\Pi(x)-\rho_{R^{\prime}}(\Pi(x))\|_{2}^{z}\right]=\mathrm{cost}_{z}^{z}(\Pi(X),R^{\prime}). (22)

Corollary 5.10.

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be any set of points. For any ε(0,1/2)\varepsilon\in(0,1/2), let 𝒥d,t\mathcal{J}_{d,t} be the Johnson-Lindenstrauss map with

tzk2polylog(k/ε)ε3.t\gtrsim\dfrac{z\cdot k^{2}\cdot\mathrm{polylog}(k/\varepsilon)}{\varepsilon^{3}}.

Then, with probability at least 0.970.97 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t},

11001/z(1+ε)3costz(X,R)minRtdimRkcostz(𝚷(X),R).\displaystyle\dfrac{1}{100^{1/z}(1+\varepsilon)^{3}}\cdot\mathrm{cost}_{z}(X,R)\leq\min_{\begin{subarray}{c}R^{\prime}\subset\mathbbm{R}^{t}\\ \dim R^{\prime}\leq k\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),R^{\prime}).
Proof.

We sample 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t} and (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) as per Definition 5.6. By Theorem 8 and Lemma 5.9, and a union bound, events 𝐄1\mathbf{E}_{1} and 𝐄3(100)\mathbf{E}_{3}(100) hold with probability at least 0.980.98. Event 𝐄2\mathbf{E}_{2} occurs with probability at least 0.990.99 by apply Lemma 5.8 with m=poly(kz,1/ε)m=\mathrm{poly}(k^{z},1/\varepsilon) and =O(k2log(k/ε)/ε)\ell=O(k^{2}\log(k/\varepsilon)/\varepsilon). Hence, we apply Lemma 5.7. ∎

5.3.2 Improving the Approximation

We now improve on the approximation of Corollary 5.10 in a fashion similar to that of Subsection 4.3.2. We will show that with large probability, 𝐄3(1+ε)\mathbf{E}_{3}(1+\varepsilon) holds. We let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} and RdR\subset\mathbbm{R}^{d} be the subspace of dimension kk for optimal (k,z)(k,z)-subspace approximation of XX in d\mathbbm{R}^{d}. As before, we let σ:X0\sigma\colon X\to\mathbbm{R}_{\geq 0} be the sensitivities of XX with respect to RR (as in Lemma 5.4), and σ~\tilde{\sigma} be the sensitivity sampling distribution.

We define one more events, 𝐄4\mathbf{E}_{4} with respect to the randomness in 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}. Let 𝐃x0\mathbf{D}_{x}\in\mathbbm{R}_{\geq 0} denote the random variable given by

𝐃x=def𝚷(x)𝚷(ρR(x))2xρR(x)2.\displaystyle\mathbf{D}_{x}\stackrel{{\scriptstyle\rm def}}{{=}}\dfrac{\|\mathbf{\Pi}(x)-\mathbf{\Pi}(\rho_{R}(x))\|_{2}}{\|x-\rho_{R}(x)\|_{2}}. (23)

We say event 𝐄4\mathbf{E}_{4} holds whenever

xX𝐃x2zσ(x)1002z𝔖σ,\displaystyle\sum_{x\in X}\mathbf{D}_{x}^{2z}\cdot\sigma(x)\leq 100\cdot 2^{z}\cdot\mathfrak{S}_{\sigma}, (24)

which holds with probability at least 0.990.99 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}, similarly to the proof of Claim 4.10 and Lemma 5.4.

Lemma 5.11.

Let Π𝒥d,t\Pi\in\mathcal{J}_{d,t} be a Johnson-Lindenstrauss map where, for α>1\alpha>1, the follows events hold:

  1. 1.

    Guarantee from Lemma 5.1: xXΠ(x)Π(ρR(x))2zαcostzz(X,R)\sum_{x\in X}\|\Pi(x)-\Pi(\rho_{R}(x))\|_{2}^{z}\leq\alpha\cdot\mathrm{cost}_{z}^{z}(X,R).

  2. 2.

    Guarantee from Corollary 5.10: letting RtR^{\prime}\subset\mathbbm{R}^{t} be the optimal (k,z)(k,z)-subspace approximation of Π(X)\Pi(X), then costzz(X,R)αcostzz(Π(X),R)\mathrm{cost}_{z}^{z}(X,R)\leq\alpha\mathrm{cost}_{z}^{z}(\Pi(X),R^{\prime}).

  3. 3.

    Event 𝐄4\mathbf{E}_{4} holds.

Then, if we let (𝐒,𝐰)(\mathbf{S},\boldsymbol{w}) denote m=poly(kz,1/ε,α)m=\mathrm{poly}(k^{z},1/\varepsilon,\alpha) i.i.d draws from σ~\tilde{\sigma} and 𝐰(x)=1/(mσ~(x))\boldsymbol{w}(x)=1/(m\tilde{\sigma}(x)), with probability at least 0.990.99,

costzz((Π(𝐒),𝒘),R)(1+ε)costzz(Π(X),R).\mathrm{cost}_{z}^{z}((\Pi(\mathbf{S}),\boldsymbol{w}),R^{\prime})\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}^{z}(\Pi(X),R^{\prime}).
Proof.

Again, the proof is similar to that of Lemma 4.11, where we bound the variance of the estimator to apply Chebyshev’s inequality. In particular, we have

𝐕𝐚𝐫𝐒[𝐄𝒙𝐒[1σ~(𝒙)Π(𝒙)ρR(Π(𝒙))2zcostzz(Π(X),R)]]𝔖σmxX(1σ(x)Π(x)ρR(Π(x))22zcostz2z(Π(X),R)).\displaystyle\mathop{\operatorname{{\bf Var}}}_{\mathbf{S}}\left[\mathop{{\bf E}\/}_{\boldsymbol{x}\sim\mathbf{S}}\left[\dfrac{1}{\tilde{\sigma}(\boldsymbol{x})}\cdot\dfrac{\|\Pi(\boldsymbol{x})-\rho_{R^{\prime}}(\Pi(\boldsymbol{x}))\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(\Pi(X),R^{\prime})}\right]\right]\leq\frac{\mathfrak{S}_{\sigma}}{m}\sum_{x\in X}\left(\frac{1}{\sigma(x)}\cdot\dfrac{\|\Pi(x)-\rho_{R^{\prime}}(\Pi(x))\|_{2}^{2z}}{\mathrm{cost}_{z}^{2z}(\Pi(X),R^{\prime})}\right). (25)

Similarly to the proof of Lemma 4.11, we will upper bound

Π(x)ρR(Π(x))2zcostzz(Π(X),R)\displaystyle\dfrac{\|\Pi(x)-\rho_{R^{\prime}}(\Pi(x))\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(\Pi(X),R)}

as a function of σ(x)\sigma(x) and DxD_{x} (given by (23) without boldface as Π\Pi is fixed). Toward this bound, we simplify the notation by letting yx=ρR(x)dy_{x}=\rho_{R}(x)\in\mathbbm{R}^{d} and Y={yx:xX}Y=\{y_{x}:x\in X\}. Then, since Π:dt\Pi\colon\mathbbm{R}^{d}\to\mathbbm{R}^{t} is a linear map, for any xXx\in X

supvt|Π(yx),v|zxX|Π(yx),v|zsupud|yx,u|zxX|yx,u|z.\displaystyle\sup_{v\in\mathbbm{R}^{t}}\dfrac{|\langle\Pi(y_{x}),v\rangle|^{z}}{\sum_{x^{\prime}\in X}|\langle\Pi(y_{x^{\prime}}),v\rangle|^{z}}\leq\sup_{u\in\mathbbm{R}^{d}}\dfrac{|\langle y_{x},u\rangle|^{z}}{\sum_{x^{\prime}\in X}|\langle y_{x^{\prime}},u\rangle|^{z}}. (26)

In particular, if we let Mn×dM\in\mathbbm{R}^{n\times d} be the matrix given by having the rows be points yxYy_{x}\in Y, then writing Πd×t\Pi\in\mathbbm{R}^{d\times t}, we have MΠn×tM\Pi\in\mathbbm{R}^{n\times t} is the matrix whose rows are Π(yx)\Pi(y_{x}); in particular, one may compare the left- and right-hand sides of (26) by letting u=Πvdu=\Pi v\in\mathbbm{R}^{d}. Thus, we have

Π(x)ρR(Π(x))2zΠ(x)ρR(Π(yx))2z2z1(Π(x)Π(yx)2z+Π(yx)ρR(Π(yx))2z)\displaystyle\|\Pi(x)-\rho_{R^{\prime}}(\Pi(x))\|_{2}^{z}\leq\|\Pi(x)-\rho_{R^{\prime}}(\Pi(y_{x}))\|_{2}^{z}\leq 2^{z-1}\left(\|\Pi(x)-\Pi(y_{x})\|_{2}^{z}+\|\Pi(y_{x})-\rho_{R^{\prime}}(\Pi(y_{x}))\|_{2}^{z}\right)
2z1(Dxzxyx2zcostzz(X,R)costzz(X,R)+Π(yx)ρR(Π(yx))2zcostzz(Π(Y),R)costzz(Π(Y),R)).\displaystyle\qquad\leq 2^{z-1}\left(D_{x}^{z}\cdot\dfrac{\|x-y_{x}\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,R)}\cdot\mathrm{cost}_{z}^{z}(X,R)+\dfrac{\|\Pi(y_{x})-\rho_{R^{\prime}}(\Pi(y_{x}))\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(\Pi(Y),R^{\prime})}\cdot\mathrm{cost}_{z}^{z}(\Pi(Y),R^{\prime})\right). (27)

We may now apply the triangle inequality, as well as (1) and (2), and we have

costzz(Π(Y),R)\displaystyle\mathrm{cost}_{z}^{z}(\Pi(Y),R^{\prime}) 2z1(costzz(Π(X),R)+xXΠ(x)Π(yx)2z)\displaystyle\leq 2^{z-1}\left(\mathrm{cost}_{z}^{z}(\Pi(X),R^{\prime})+\sum_{x\in X}\|\Pi(x)-\Pi(y_{x})\|_{2}^{z}\right)
2z1(costzz(Π(X),R)+αcostzz(X,R))2z1(1+α2)costzz(Π(X),R).\displaystyle\leq 2^{z-1}\left(\mathrm{cost}_{z}^{z}(\Pi(X),R^{\prime})+\alpha\cdot\mathrm{cost}_{z}^{z}(X,R)\right)\leq 2^{z-1}(1+\alpha^{2})\cdot\mathrm{cost}_{z}^{z}(\Pi(X),R^{\prime}). (28)

Finally, we note that, similarly to the proof of Lemma 5.4,

Π(yx)ρR(Π(yx))2zcostzz(Π(Y),R)supvt|Π(yx),v|zxX|Π(yx),v|z.\displaystyle\frac{\|\Pi(y_{x})-\rho_{R^{\prime}}(\Pi(y_{x}))\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(\Pi(Y),R^{\prime})}\leq\sup_{v\in\mathbbm{R}^{t}}\dfrac{|\langle\Pi(y_{x}),v\rangle|^{z}}{\sum_{x^{\prime}\in X}|\langle\Pi(y_{x^{\prime}}),v\rangle|^{z}}. (29)

Continuing to upper-bound the left-hand side of (27) by plugging in (26), (29) and (28),

Π(x)ρR(Π(x))2zcostzz(Π(X),R)\displaystyle\dfrac{\|\Pi(x)-\rho_{R^{\prime}}(\Pi(x))\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(\Pi(X),R^{\prime})} 2z1(Dxzαxyx2zcostzz(X,Y)+2z1(α2+1)supud|yx,u|zxX|yx,u|z)\displaystyle\leq 2^{z-1}\left(D_{x}^{z}\alpha\cdot\dfrac{\|x-y_{x}\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,Y)}+2^{z-1}(\alpha^{2}+1)\sup_{u\in\mathbbm{R}^{d}}\dfrac{|\langle y_{x},u\rangle|^{z}}{\sum_{x^{\prime}\in X}|\langle y_{x^{\prime}},u\rangle|^{z}}\right)
Dxz(α2+1)σ(x).\displaystyle\leq D_{x}^{z}\cdot(\alpha^{2}+1)\cdot\sigma(x).

In particular, the bound on the variance in (25) is at most

𝔖σm24z(α2+1)2xXDx2zσ(x)𝔖σ2m25z(α2+1),\dfrac{\mathfrak{S}_{\sigma}}{m}\cdot 2^{4z}(\alpha^{2}+1)^{2}\sum_{x\in X}D_{x}^{2z}\sigma(x)\leq\dfrac{\mathfrak{S}_{\sigma}^{2}}{m}\cdot 2^{5z}(\alpha^{2}+1),

so letting m=poly(kz,1/ε,α)m=\mathrm{poly}(k^{z},1/\varepsilon,\alpha) gives the desired bound on the variance. ∎

Corollary 5.12.

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be any set of points, and let RdR\subset\mathbbm{R}^{d} be the subspace for optimal (k,z)(k,z)-subspace approximation of XX. For any ε(0,1/2)\varepsilon\in(0,1/2), let 𝒥d,t\mathcal{J}_{d,t} be the Johnson-Lindenstrauss map with

tzk2polylog(k/ε)ε3.t\gtrsim\dfrac{z\cdot k^{2}\cdot\mathrm{polylog}(k/\varepsilon)}{\varepsilon^{3}}.

Then, with probability at least 0.920.92 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t},

1(1+ε)3+1/zcostz(X,R)minRtdimRkcostz(𝚷(X),R).\dfrac{1}{(1+\varepsilon)^{3+1/z}}\cdot\mathrm{cost}_{z}(X,R)\leq\min_{\begin{subarray}{c}R^{\prime}\subset\mathbbm{R}^{t}\\ \dim R^{\prime}\leq k\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),R^{\prime}).
Proof.

We sample 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t} and (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) as per Definition 5.6. Again, Theorem 8 guarantees that 𝐄1\mathbf{E}_{1} occurs with probability at least 0.990.99 over the draw of (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}). By Lemma 5.1, Corollary 5.10 and (24), the condition (1), (2), and (3) are satisfied with probability at least 0.950.95, so we may apply Lemma 5.11 and have 𝐄3(1+ε)\mathbf{E}_{3}(1+\varepsilon) holds with probability at least 0.940.94. Finally, event 𝐄2\mathbf{E}_{2} holds with probability at least 0.990.99 by Lemma 5.8. Taking a union bound and applying Lemma 5.7 gives the desired bound. ∎

6 kk-Flat Approximation

In the (k,z)(k,z)-flat approximation problem, we consider subspace RdR\subset\mathbbm{R}^{d} of dimension less than kk, which we may encode by a collection of at most kk orthonormal vectors r1,,rkRr_{1},\dots,r_{k}\in R, as well as a translation vector τd\tau\in\mathbbm{R}^{d}. The kk-flat specified by RR and τ\tau is given by the affine subspace

F={x+τd:xR}.F=\left\{x+\tau\in\mathbbm{R}^{d}:x\in R\right\}.

We let ρF:d\rho_{F}\colon\mathbbm{R}^{d}\to\mathbbm{R} denote the map which sends each xdx\in\mathbbm{R}^{d} to its closest point on FF, and we note that

ρF(x)\displaystyle\rho_{F}(x) =argminyFxy22=τ+i=1rxτ,riri\displaystyle=\mathop{\mathrm{argmin}}_{y\in F}\|x-y\|_{2}^{2}=\tau+\sum_{i=1}^{r}\langle x-\tau,r_{i}\rangle r_{i}

For any XdX\subset\mathbbm{R}^{d}, we let

costzz(X,F)=defxXxρF(x)2z.\mathrm{cost}_{z}^{z}(X,F)\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{x\in X}\left\|x-\rho_{F}(x)\right\|_{2}^{z}.

In this section, we show that we may find the optimal kk-flat approximation after applying a Johnson-Lindenstrauss map. The proof will be almost exactly the same as the (k,z)(k,z)-subspace approximation problem. Indeed, it only remains to incorporate a translation vector.

Theorem 9 (Johnson-Lindenstrauss for kk-Flat Approximation).

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be any set of points, and let FdF\subset\mathbbm{R}^{d} denote the optimal (k,z)(k,z)-flat approximation of XX. For any ε(0,1)\varepsilon\in(0,1), suppose we let 𝒥d,t\mathcal{J}_{d,t} be a distribution over a Johnson-Lindenstrauss maps where

tzk2polylog(k/ε)ε3.t\gtrsim\dfrac{z\cdot k^{2}\cdot\mathrm{polylog}(k/\varepsilon)}{\varepsilon^{3}}.

Then, with probability at least 0.90.9 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t},

11+εcostz(X,F)minF k-flatin tcostz(𝚷(X),F)(1+ε)costz(X,F).\dfrac{1}{1+\varepsilon}\cdot\mathrm{cost}_{z}(X,F)\leq\min_{\begin{subarray}{c}F^{\prime}\text{ $k$-flat}\\ \text{in $\mathbbm{R}^{t}$}\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),F^{\prime})\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}(X,F).

6.1 Easy Direction: Optimum Cost Does Not Increase

Lemma 6.1.

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} by any set of points and let FdF\subset\mathbbm{R}^{d} be the optimal (k,z)(k,z)-flat approximation of XX. We let 𝒥d,t\mathcal{J}_{d,t} be the distribution over Johnson-Lindenstrauss maps. If tz/ε2t\gtrsim z/\varepsilon^{2}, then with probability at least 0.990.99 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t},

xX𝚷(x)𝚷(ρF(x))2z(1+ε)costzz(X,F),\sum_{x\in X}\|\mathbf{\Pi}(x)-\mathbf{\Pi}(\rho_{F}(x))\|_{2}^{z}\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}^{z}(X,F),

and hence,

minF k-flatin tcostz(𝚷(X),F)(1+ε)costz(X,F).\min_{\begin{subarray}{c}F^{\prime}\text{ $k$-flat}\\ \text{in $\mathbbm{R}^{t}$}\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),F^{\prime})\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}(X,F).

The proof follows in a similar fashion to Lemma 4.1 and Lemma 5.1. In particular, there is a natural definition of a kk-flat 𝚷(F)t\mathbf{\Pi}(F)\subset\mathbbm{R}^{t}, and the proof proceeds by upper bounding the expected dilation of 𝚷(x)𝚷(ρF(x))2z\|\mathbf{\Pi}(x)-\mathbf{\Pi}(\rho_{F}(x))\|_{2}^{z}.

6.2 Hard Direction: Optimum Cost Does Not Decrease

6.2.1 Preliminaries

The proof in this section follows similarly to that of (k,z)(k,z)-subspace approximation.

Definition 6.2 (Weak Coresets for (k,z)(k,z)-flat approximation).

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be a set of points. A weak ε\varepsilon-coreset of XX for (k,z)(k,z)-flat approximation is a weighted subset SdS\subset\mathbbm{R}^{d} of points with weights w:S0w\colon S\to\mathbbm{R}_{\geq 0} such that,

11+εminF k-flatin dcostz(X,F)minF k-flatin dcostz((S,w),F)(1+ε)minF k-flatin dcostz(X,F)\dfrac{1}{1+\varepsilon}\cdot\min_{\begin{subarray}{c}F\text{ $k$-flat}\\ \text{in $\mathbbm{R}^{d}$}\end{subarray}}\mathrm{cost}_{z}(X,F)\leq\min_{\begin{subarray}{c}F\text{ $k$-flat}\\ \text{in $\mathbbm{R}^{d}$}\end{subarray}}\mathrm{cost}_{z}((S,w),F)\leq(1+\varepsilon)\cdot\min_{\begin{subarray}{c}F\text{ $k$-flat}\\ \text{in $\mathbbm{R}^{d}$}\end{subarray}}\mathrm{cost}_{z}(X,F)
Definition 6.3 (Sensitivities).

Let n,dn,d\in\mathbbm{N}, and consider any set of points X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d}, as well as kk\in\mathbbm{N} and z1z\geq 1. A sensitivity function σ:X0\sigma\colon X\to\mathbbm{R}_{\geq 0} for (k,z)(k,z)-flat approximation in d\mathbbm{R}^{d} is a function satisfying that, for all xXx\in X,

supFdk-flatxρF(x)2zcostzz(X,F)σ(x).\displaystyle\sup_{\begin{subarray}{c}F\subset\mathbbm{R}^{d}\\ \text{$k$-flat}\end{subarray}}\dfrac{\|x-\rho_{F}(x)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,F)}\leq\sigma(x).

The total sensitivity of the senstivity function σ\sigma is given by

𝔖σ=xXσ(x).\mathfrak{S}_{\sigma}=\sum_{x\in X}\sigma(x).

For a sensitivity function, we let σ~\tilde{\sigma} denote the sensitivity sampling distribution, supported on XX, which samples xXx\in X with probability proportional to σ(x)\sigma(x).

The sensitivity function we use here generalizes that of the previous section. In particular, the proof will follow similarly, and we will defer to the arguments in the previous section.

Lemma 6.4.

Let n,dn,d\in\mathbbm{N}, and consider any set of points X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d}, as well as kk\in\mathbbm{N} with k<dk<d and z1z\geq 1. Suppose FdF\subset\mathbbm{R}^{d} is the optimal (k,z)(k,z)-flat approximation of XX in d\mathbbm{R}^{d}. Then, the function σ:X0\sigma\colon X\to\mathbbm{R}_{\geq 0} given by

σ(x)=2z1xρF(x)2zcostzz(X,F)+22z1supudϕ|ρF(x),uϕ|zxX|ρF(x),uϕ|z\sigma(x)=2^{z-1}\cdot\dfrac{\|x-\rho_{F}(x)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,F)}+2^{2z-1}\cdot\sup_{\begin{subarray}{c}u\in\mathbbm{R}^{d}\\ \phi\in\mathbbm{R}\end{subarray}}\dfrac{|\langle\rho_{F}(x),u\rangle-\phi|^{z}}{\sum_{x^{\prime}\in X}|\langle\rho_{F}(x^{\prime}),u\rangle-\phi|^{z}}

is a sensitivity function for (k,z)(k,z)-flat approximation of XX in d\mathbbm{R}^{d}, satisfying

𝔖σ2z1+22z1(k+2)1+z.\mathfrak{S}_{\sigma}\leq 2^{z-1}+2^{2z-1}(k+2)^{1+z}.
Proof.

Consider any kk-flat FdF^{\prime}\subset\mathbbm{R}^{d}, given by a subspace RdR\subset\mathbbm{R}^{d} of dimension at most kk, and a translation τd\tau\in\mathbbm{R}^{d}. As in the proof of Lemma 5.4,

xρF(x)2zcostzz(X,F)2z1xρF(x)2zcostzz(X,F)+22z1ρF(x)ρF(ρF(x))2zcostzz(ρF(X),F).\displaystyle\dfrac{\|x-\rho_{F^{\prime}}(x)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,F^{\prime})}\leq 2^{z-1}\cdot\dfrac{\|x-\rho_{F}(x)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,F)}+2^{2z-1}\cdot\dfrac{\|\rho_{F}(x)-\rho_{F^{\prime}}(\rho_{F}(x))\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(\rho_{F}(X),F^{\prime})}.

We now have that for any YdY\subset\mathbbm{R}^{d}, and any yYy\in Y,

supFdk-flatyρF(y)2zcostzz(Y,F)supτdsupud|yτ,u|zyY|yτ,u|z.\displaystyle\sup_{\begin{subarray}{c}F^{\prime}\subset\mathbbm{R}^{d}\\ \text{$k$-flat}\end{subarray}}\dfrac{\|y-\rho_{F^{\prime}}(y)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(Y,F^{\prime})}\leq\sup_{\tau\in\mathbbm{R}^{d}}\sup_{u\in\mathbbm{R}^{d}}\dfrac{|\langle y-\tau,u\rangle|^{z}}{\sum_{y^{\prime}\in Y}|\langle y^{\prime}-\tau,u\rangle|^{z}}.

Finally, for each yYdy\in Y\subset\mathbbm{R}^{d}, we may consider appending an additional coordinate and consider yd+1y^{*}\in\mathbbm{R}^{d+1} where the d+1d+1-th entry is 11. Then, by linearity

supτdsupud|yτ,u|zyY|yτ,u|z=supvd+1|y,v|zyY|y,v|z,\sup_{\tau\in\mathbbm{R}^{d}}\sup_{u\in\mathbbm{R}^{d}}\dfrac{|\langle y-\tau,u\rangle|^{z}}{\sum_{y^{\prime}\in Y}|\langle y-\tau,u\rangle|^{z}}=\sup_{v\in\mathbbm{R}^{d+1}}\dfrac{|\langle y^{*},v\rangle|^{z}}{\sum_{y^{\prime}\in Y}|\langle y^{{}^{\prime}*},v\rangle|^{z}},

and the bound on the total sensitivity follows from Lemma 5.4. ∎

In the (k,z)(k,z)-subspace approximation section, we used a lemma (Lemma 5.5) to narrow down the approximately optimal subspaces to those spanned by at most O(k2log(k/ε)/ε)O(k^{2}\log(k/\varepsilon)/\varepsilon) points. Here, we use a similar lemma in order to find an approximately optimal translation vector τd\tau\in\mathbbm{R}^{d}, which is spanned by a small subset of points.

Lemma 6.5 (Lemma 3.3 [SV07]).

Let d,kd,k\in\mathbbm{N}, and consider a weighted set of points SdS\subset\mathbbm{R}^{d} with weights w:S0w\colon S\to\mathbbm{R}_{\geq 0}, as well as ε(0,1)\varepsilon\in(0,1) and z1z\geq 1. Suppose FdF\subset\mathbbm{R}^{d} is the optimal (k,z)(k,z)-flat approximation of XX, encoded by a kk-dimensional subspace RdR\subset\mathbbm{R}^{d} and translation vector τd\tau\in\mathbbm{R}^{d}. There exists a subset QSQ\subset S of size O(log(1/ε)/ε)O(\log(1/\varepsilon)/\varepsilon) and a point τconv(Q)\tau^{\prime}\in\mathrm{conv}(Q) such that the kk-flat

F={τ+yd:yR}F^{\prime}=\left\{\tau^{\prime}+y\in\mathbbm{R}^{d}:y\in R\right\}

satisfies

costz((S,w),F)(1+ε)costz((S,w),F).\mathrm{cost}_{z}((S,w),F^{\prime})\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}((S,w),F).
Theorem 10 (ε\varepsilon-Weak Coresets for kk-Flats via Sensitivity Sampling).

For any subset X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} and ε(0,1/2)\varepsilon\in(0,1/2), let σ~\tilde{\sigma} denote the sensitivity sampling distribution.

  • Let (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) denote the random (multi-)set 𝐒X\mathbf{S}\subset X and 𝒘:𝐒0\boldsymbol{w}\colon\mathbf{S}\to\mathbbm{R}_{\geq 0} given by, for

    m=poly((k+2)z,1/ε)m=\mathrm{poly}((k+2)^{z},1/\varepsilon)

    iterations, sampling 𝒙σ~\boldsymbol{x}\sim\tilde{\sigma} i.i.d and letting 𝒘(𝒙)=1/(mσ~(x))\boldsymbol{w}(\boldsymbol{x})=1/(m\tilde{\sigma}(x)).

  • Then, with probability 1o(1)1-o(1) over the draw of (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}), it is an ε\varepsilon-weak coreset for (k,z)(k,z)-subspace approximation of XX.

6.3 The Important Events

The important events we consider mirror those the subspace approximation problem. The only event which would change is 𝐄2\mathbf{E}_{2}, where we require 𝚷\mathbf{\Pi} to be an ε\varepsilon-subspace embedding for all subsets of O(k2log(k/ε)/ε)+O(log(1/ε)/ε)O(k^{2}\log(k/\varepsilon)/\varepsilon)+O(\log(1/\varepsilon)/\varepsilon) points from 𝐒\mathbf{S}. This will allow us to incorporate the translation τ\tau^{\prime} from Lemma 6.5.

Definition 6.6 (The Events).

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d}, and σ~\tilde{\sigma} the sensitivity sampling distribution of XX of Lemma 6.4. We consider the following experiment,

  1. 1.

    We generate a sample (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) by sampling from σ~\tilde{\sigma} for m=poly(kz,1/ε)m=\mathrm{poly}(k^{z},1/\varepsilon) i.i.d iterations 𝒙σ~\boldsymbol{x}\sim\tilde{\sigma} and set 𝒘(𝒙)=1/(mσ~(𝒙))\boldsymbol{w}(\boldsymbol{x})=1/(m\tilde{\sigma}(\boldsymbol{x})).

  2. 2.

    Furthermore, we sample 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}, which is a Johnson-Lindenstrauss map dt\mathbbm{R}^{d}\to\mathbbm{R}^{t}.

  3. 3.

    We let 𝐒=𝚷(𝐒)t\mathbf{S}^{\prime}=\mathbf{\Pi}(\mathbf{S})\subset\mathbbm{R}^{t} denote the image of 𝚷\mathbf{\Pi} on 𝐒\mathbf{S}.

The events are the following:

  • 𝐄1\mathbf{E}_{1} : The weighted (multi-)set (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) is a weak ε\varepsilon-coreset for (k,z)(k,z)-flat approximation of XX in d\mathbbm{R}^{d}.

  • 𝐄2\mathbf{E}_{2} : The map 𝚷:dt\mathbf{\Pi}\colon\mathbbm{R}^{d}\to\mathbbm{R}^{t} satisfies the following condition. For any choice of O(k2log(k/ε)/ε)O(k^{2}\log(k/\varepsilon)/\varepsilon) points of 𝐒\mathbf{S}, 𝚷\mathbf{\Pi} is an ε\varepsilon-subspace embedding of the subspace spanned by these points.

  • 𝐄3(β)\mathbf{E}_{3}(\beta) : Let 𝐅t\mathbf{F}^{\prime}\subset\mathbbm{R}^{t} denote the optimal (k,z)(k,z)-flat approximation of 𝚷(X)\mathbf{\Pi}(X) in t\mathbbm{R}^{t}. Then,

    costzz((𝚷(𝐒),𝒘),𝐅)βcostzz(𝚷(X),𝐅).\mathrm{cost}_{z}^{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),\mathbf{F}^{\prime})\leq\beta\cdot\mathrm{cost}_{z}^{z}(\mathbf{\Pi}(X),\mathbf{F}^{\prime}).
Lemma 6.7.

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d}, and suppose (𝐒,𝐰)(\mathbf{S},\boldsymbol{w}) and 𝚷:dt\mathbf{\Pi}\colon\mathbbm{R}^{d}\to\mathbbm{R}^{t} satisfy events 𝐄1\mathbf{E}_{1}, 𝐄2\mathbf{E}_{2}, and 𝐄3(β)\mathbf{E}_{3}(\beta). Then,

minF k-flatin tcostz(𝚷(X),F)1β1/z(1+ε)4minF k-flatin dcostz(X,F).\min_{\begin{subarray}{c}F^{\prime}\text{ $k$-flat}\\ \text{in $\mathbbm{R}^{t}$}\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),F^{\prime})\geq\dfrac{1}{\beta^{1/z}(1+\varepsilon)^{4}}\cdot\min_{\begin{subarray}{c}F\text{ $k$-flat}\\ \text{in $\mathbbm{R}^{d}$}\end{subarray}}\mathrm{cost}_{z}(X,F).
Proof.

Consider a fixed 𝚷\mathbf{\Pi} and (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) satisfying the three events of Definition 5.6. Let FtF^{\prime}\subset\mathbbm{R}^{t} be the kk-flat which minimizes costzz(𝚷(X),F)\mathrm{cost}_{z}^{z}(\mathbf{\Pi}(X),F^{\prime}). Suppose that FF^{\prime} is specified by a kk-dimensional subspace RR^{\prime} and a translation τ\tau^{\prime}. Then, by event 𝐄3(β)\mathbf{E}_{3}(\beta), we have costzz((𝚷(𝐒),𝒘),F)βcostzz(𝚷(X),F)\mathrm{cost}_{z}^{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),F^{\prime})\leq\beta\cdot\mathrm{cost}_{z}^{z}(\mathbf{\Pi}(X),F^{\prime}). Now, we apply Lemma 6.5 to (𝚷(𝐒),𝒘)(\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}), and we obtain a subset Q𝐒Q\subset\mathbf{S} of size O(log(1/ε)/ε)O(\log(1/\varepsilon)/\varepsilon) for which there exists a translation vector τ′′t\tau^{\prime\prime}\in\mathbbm{R}^{t} within the conv(𝚷(Q))\mathrm{conv}(\mathbf{\Pi}(Q)) such that kk-flat F′′F^{\prime\prime} given by τ′′\tau^{\prime\prime} and RR^{\prime} satisfies

(x𝐒𝒘(x)𝚷(x)ρF′′(𝚷(x))2z)1/z=costz((𝚷(𝐒),𝒘),F′′)(1+ε)costz((𝚷(𝐒),𝒘),F).\left(\sum_{x\in\mathbf{S}}\boldsymbol{w}(x)\cdot\|\mathbf{\Pi}(x)-\rho_{F^{\prime\prime}}(\mathbf{\Pi}(x))\|_{2}^{z}\right)^{1/z}=\mathrm{cost}_{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),F^{\prime\prime})\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),F^{\prime}).

Furthermore, by Lemma 5.5 to the weighted vectors (𝚷(𝐒)τ′′,𝒘)(\mathbf{\Pi}(\mathbf{S})-\tau^{\prime\prime},\boldsymbol{w}),777Here, we are using the short-hand 𝚷(𝐒)τ′′=def{𝚷(x)τ′′t:x𝐒}\mathbf{\Pi}(\mathbf{S})-\tau^{\prime\prime}\stackrel{{\scriptstyle\rm def}}{{=}}\left\{\mathbf{\Pi}(x)-\tau^{\prime\prime}\in\mathbbm{R}^{t}:x\in\mathbf{S}\right\}. there exists a subset Q𝚷(𝐒)τ′′Q^{\prime}\subset\mathbf{\Pi}(\mathbf{S})-\tau^{\prime\prime} of size O(k2log(k/ε)/ε)O(k^{2}\log(k/\varepsilon)/\varepsilon) and a kk-dimensional subspace R′′dR^{\prime\prime}\subset\mathbbm{R}^{d} within the span of QQ^{\prime} such that the kk-flat F′′′F^{\prime\prime\prime} specified by R′′R^{\prime\prime} and τ′′\tau^{\prime\prime} satisfies

(x𝐒𝒘(x)𝚷(x)ρF′′′(𝚷(x))2z)1/z=costz((𝚷(𝐒),𝒘),F′′′)(1+ε)2costz((𝚷(𝐒),𝒘),F).\left(\sum_{x\in\mathbf{S}}\boldsymbol{w}(x)\cdot\|\mathbf{\Pi}(x)-\rho_{F^{\prime\prime\prime}}(\mathbf{\Pi}(x))\|_{2}^{z}\right)^{1/z}=\mathrm{cost}_{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),F^{\prime\prime\prime})\leq(1+\varepsilon)^{2}\cdot\mathrm{cost}_{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),F^{\prime}).

Recall that (i) R′′R^{\prime\prime} is a kk-dimensional subspace lying in the span of 𝚷(Q)\mathbf{\Pi}(Q^{\prime}), (ii) τ′′t\tau^{\prime\prime}\in\mathbbm{R}^{t} is within conv(𝚷(Q))\mathrm{conv}(\mathbf{\Pi}(Q)), and (iii) for any x𝐒x\in\mathbf{S}, 𝚷\mathbf{\Pi} is an ε\varepsilon-subspace embedding of the span of QQ{x}Q\cup Q^{\prime}\cup\{x\}. Similarly to Lemma 5.7, we may find a kk-flat UU such that for every x𝐒x\in\mathbf{S},

xρU(x)2(1+ε)𝚷(x)ρF′′′(𝚷(x))2,\|x-\rho_{U}(x)\|_{2}\leq(1+\varepsilon)\|\mathbf{\Pi}(x)-\rho_{F^{\prime\prime\prime}}(\mathbf{\Pi}(x))\|_{2},

and hence

costz((𝐒,𝒘),U)(1+ε)costz((𝚷(𝐒),𝒘),F′′′).\mathrm{cost}_{z}((\mathbf{S},\boldsymbol{w}),U)\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),F^{\prime\prime\prime}).

Finally, since (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) is a ε\varepsilon-weak coreset, we obtain the desired inequality. ∎

As in the previous section, events 𝐄1\mathbf{E}_{1} and 𝐄2\mathbf{E}_{2} hold with sufficiently high probability. All that remains is showing that 𝐄3(1+ε)\mathbf{E}_{3}(1+\varepsilon) holds with sufficiently high probability. We proceed in a similar fashion, where we first show a loose approximation guarantee, and later improve on it.

Lemma 6.8.

Fix any Π𝒥d,t\Pi\in\mathcal{J}_{d,t} and let FtF^{\prime}\subset\mathbbm{R}^{t} denote the kk-flat for optimal (k,z)(k,z)-flat approximation of Π(X)\Pi(X) in t\mathbbm{R}^{t}. Then with probability at least 0.990.99 over the draw of (𝐒,𝐰)(\mathbf{S},\boldsymbol{w}) as per Definition 6.6,

x𝐒𝒘(x)Π(x)ρF(Π(x))2z100costzz(Π(X),F).\sum_{x\in\mathbf{S}}\boldsymbol{w}(x)\cdot\|\Pi(x)-\rho_{F^{\prime}}(\Pi(x))\|_{2}^{z}\leq 100\cdot\mathrm{cost}_{z}^{z}(\Pi(X),F^{\prime}).

In other words, event 𝐄3(100)\mathbf{E}_{3}(100) holds with probability at least 0.990.99.

Corollary 6.9.

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be any set of points. For any ε(0,1/2)\varepsilon\in(0,1/2), let 𝒥d,t\mathcal{J}_{d,t} be the Johnson-Lindenstrauss map with

tzk2polylog(k/ε)ε3.t\gtrsim\dfrac{z\cdot k^{2}\cdot\mathrm{polylog}(k/\varepsilon)}{\varepsilon^{3}}.

Then, with probability at least 0.970.97 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t},

11001/z(1+ε)4minFdk-flatcostz(X,F)minFtk-flatcostz(𝚷(X),F).\dfrac{1}{100^{1/z}(1+\varepsilon)^{4}}\cdot\min_{\begin{subarray}{c}F\subset\mathbbm{R}^{d}\\ \text{$k$-flat}\end{subarray}}\mathrm{cost}_{z}(X,F)\leq\min_{\begin{subarray}{c}F^{\prime}\subset\mathbbm{R}^{t}\\ \text{$k$-flat}\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),F^{\prime}).

6.3.1 Improving the approximation

The improvement of the approximation, follows from upper bounding the variance, as in the (k,z)(k,z)-clustering problem, and the (k,z)(k,z)-subspace approximation problem. In particular, we show that 𝐄3(1+ε)\mathbf{E}_{3}(1+\varepsilon) holds. Fix X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} and FdF\subset\mathbbm{R}^{d} be the optimal (k,z)(k,z)-flat approximation of XX in d\mathbbm{R}^{d}. The sensitivity function σ:X0\sigma\colon X\to\mathbbm{R}_{\geq 0} specified in Lemma 6.4 specify the sensitivity sampling distribution σ~\tilde{\sigma}.

We let 𝐄4\mathbf{E}_{4} denote the following event with respect to the randomness in 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}. For each xXx\in X, we let 𝐃x0\mathbf{D}_{x}\in\mathbbm{R}_{\geq 0} denote the random variable

𝐃x=def𝚷(x)𝚷(ρF(x))2xρF(x)2,\mathbf{D}_{x}\stackrel{{\scriptstyle\rm def}}{{=}}\dfrac{\|\mathbf{\Pi}(x)-\mathbf{\Pi}(\rho_{F}(x))\|_{2}}{\|x-\rho_{F}(x)\|_{2}},

and as in (23) and (24), event 𝐄4\mathbf{E}_{4}, which occurs with probability at least 0.990.99, whenever

xX𝐃x2zσ(x)1002z𝔖σ.\sum_{x\in X}\mathbf{D}_{x}^{2z}\cdot\sigma(x)\leq 100\cdot 2^{z}\cdot\mathfrak{S}_{\sigma}.
Lemma 6.10.

Let Π𝒥d,t\Pi\in\mathcal{J}_{d,t} be a Johnson-Lindensrauss map where, for α>1\alpha>1, the following events hold:

  1. 1.

    Guarantee from Lemma 6.1: xXΠ(x)Π(ρF(x))2zαcostzz(X,F)\sum_{x\in X}\|\Pi(x)-\Pi(\rho_{F}(x))\|_{2}^{z}\leq\alpha\cdot\mathrm{cost}_{z}^{z}(X,F).

  2. 2.

    Guarantee from Corollary 6.9: letting FtF^{\prime}\subset\mathbbm{R}^{t} be the optimal (k,z)(k,z)-flat approximation of Π(X)\Pi(X), then costzz(X,F)αcostzz(Π(X),F)\mathrm{cost}_{z}^{z}(X,F)\leq\alpha\mathrm{cost}_{z}^{z}(\Pi(X),F^{\prime}).

  3. 3.

    Event 𝐄4\mathbf{E}_{4} holds.

Then, if we let (𝐒,𝐰)(\mathbf{S},\boldsymbol{w}) denote m=poly(kz,1/ε,α)m=\mathrm{poly}(k^{z},1/\varepsilon,\alpha) i.i.d draws from σ~\tilde{\sigma} and 𝐰(x)=1/(mσ~(x))\boldsymbol{w}(x)=1/(m\tilde{\sigma}(x)), with probability at least 0.990.99,

costzz((Π(𝐒),𝒘),F)(1+ε)costzz(Π(X),F).\mathrm{cost}_{z}^{z}((\Pi(\mathbf{S}),\boldsymbol{w}),F^{\prime})\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}^{z}(\Pi(X),F^{\prime}).
Proof.

We similarly bound the variance of

𝐕𝐚𝐫𝐒[𝐄𝒙𝐒[1σ~(𝒙)Π(𝒙)ρF(Π(𝒙))2zcostzz(Π(X),F)]]𝔖σmxX(1σ(x)Π(x)ρF(Π(x))2zcostz2z(Π(X),F)).\displaystyle\operatorname{{\bf Var}}_{\mathbf{S}}\left[\mathop{{\bf E}\/}_{\boldsymbol{x}\sim\mathbf{S}}\left[\dfrac{1}{\tilde{\sigma}(\boldsymbol{x})}\dfrac{\|\Pi(\boldsymbol{x})-\rho_{F^{\prime}}\left(\Pi(\boldsymbol{x})\right)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(\Pi(X),F^{\prime})}\right]\right]\leq\dfrac{\mathfrak{S}_{\sigma}}{m}\sum_{x\in X}\left(\dfrac{1}{\sigma(x)}\cdot\dfrac{\|\Pi(x)-\rho_{F^{\prime}}(\Pi(x))\|_{2}^{z}}{\mathrm{cost}_{z}^{2z}(\Pi(X),F^{\prime})}\right).

It is not hard to show, as in the proof of Lemma 5.11, that writing yx=ρF(x)dy_{x}=\rho_{F}(x)\in\mathbbm{R}^{d} and Y={yx:xX}Y=\{y_{x}:x\in X\}, that

Π(x)ρF(Π(x))2zcostzz(Π(X),F)2z1α𝐃xzxρF(x)2zcostzz(X,F)+22z2(1+α2)supvtμ|Π(yx),vμ|zxX|Π(yx),vμ|z,\displaystyle\dfrac{\|\Pi(x)-\rho_{F^{\prime}}(\Pi(x))\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(\Pi(X),F^{\prime})}\leq 2^{z-1}\alpha\mathbf{D}_{x}^{z}\cdot\dfrac{\|x-\rho_{F}(x)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,F)}+2^{2z-2}(1+\alpha^{2})\sup_{\begin{subarray}{c}v\in\mathbbm{R}^{t}\\ \mu\in\mathbbm{R}\end{subarray}}\dfrac{|\langle\Pi(y_{x}),v\rangle-\mu|^{z}}{\sum_{x^{\prime}\in X}|\langle\Pi(y_{x^{\prime}}),v\rangle-\mu|^{z}},

and similarly to before, we have

supvtμ|Π(yx),vμ|zxX|Π(yx),vμ|zsupudϕR|yx,uϕ|zxX|yx,uϕ|z.\displaystyle\sup_{\begin{subarray}{c}v\in\mathbbm{R}^{t}\\ \mu\in\mathbbm{R}\end{subarray}}\dfrac{|\langle\Pi(y_{x}),v\rangle-\mu|^{z}}{\sum_{x^{\prime}\in X}|\langle\Pi(y_{x^{\prime}}),v\rangle-\mu|^{z}}\leq\sup_{\begin{subarray}{c}u\in\mathbbm{R}^{d}\\ \phi\in R\end{subarray}}\dfrac{|\langle y_{x},u\rangle-\phi|^{z}}{\sum_{x^{\prime}\in X}|\langle y_{x^{\prime}},u\rangle-\phi|^{z}}.

This implies that the variance is at most

24z2𝔖σα4mxX𝐃x2zσ(x)ε22^{4z-2}\cdot\dfrac{\mathfrak{S}_{\sigma}\alpha^{4}}{m}\cdot\sum_{x\in X}\mathbf{D}_{x}^{2z}\cdot\sigma(x)\leq\varepsilon^{2}

by setting m=poly((k+2)z,1/ε,α)m=\mathrm{poly}((k+2)^{z},1/\varepsilon,\alpha) to be large enough when 𝐄4\mathbf{E}_{4} holds, and we apply Chebyshev’s inequality. ∎

Corollary 6.11.

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be any set of points, and let FdF\subset\mathbbm{R}^{d} be the optimal (k,z)(k,z)-flat approximation of XX. For any ε(0,1/2)\varepsilon\in(0,1/2), let 𝒥d,t\mathcal{J}_{d,t} be the Johnson-Lindenstrauss map with

tzk2polylog(k/ε)ε3.t\gtrsim\dfrac{z\cdot k^{2}\cdot\mathrm{polylog}(k/\varepsilon)}{\varepsilon^{3}}.

Then, with probability at least 0.920.92 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t},

1(1+ε)4+1/zcostz(X,F)minFtk-flatcostz(𝚷(X),F).\dfrac{1}{(1+\varepsilon)^{4+1/z}}\cdot\mathrm{cost}_{z}(X,F)\leq\min_{\begin{subarray}{c}F^{\prime}\subset\mathbbm{R}^{t}\\ \text{$k$-flat}\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),F^{\prime}).

7 kk-Line Approximation

In the (k,z)(k,z)-line approximation problem, we consider a collection of kk lines in d\mathbbm{R}^{d}. A line is encoded by a vector vdv\in\mathbbm{R}^{d} and a unit vector uSd1u\in S^{d-1}, where we will write

(v,u)={v+tud:t}.\ell(v,u)=\left\{v+t\cdot u\in\mathbbm{R}^{d}:t\in\mathbbm{R}\right\}.

For a single line \ell encoded by vv and uu, we write ρ:d\rho_{\ell}\colon\mathbbm{R}^{d}\to\mathbbm{R} as the orthogonal projection of a point onto \ell, i.e., the closest vector which lies on the line, where

ρ(x)=argminyxy22=v+xv,uu.\rho_{\ell}(x)=\mathop{\mathrm{argmin}}_{y\in\ell}\|x-y\|_{2}^{2}=v+\langle x-v,u\rangle u.

For any set of kk lines, L={1,,k}L=\{\ell_{1},\dots,\ell_{k}\}, and a point xXx\in X, we write

costzz(x,L)=minLxρ(x)2z,\mathrm{cost}_{z}^{z}(x,L)=\min_{\ell\in L}\|x-\rho_{\ell}(x)\|_{2}^{z},

and for any dataset XdX\subset\mathbbm{R}^{d} and set of lines LL, we consider the map :XL\ell\colon X\to L which sends a point xx to its nearest line in LL. Then, we write

costzz(X,L)=xXxρ(x)(x)2z,\mathrm{cost}_{z}^{z}(X,L)=\sum_{x\in X}\|x-\rho_{\ell(x)}(x)\|_{2}^{z},

as the cost of representing the points in XX according to the kk lines in LL. In this section, we show that we may find the optimal (k,z)(k,z)-line approximation after applying a Johnson-Lindenstrauss map. Specifically, we prove:

Theorem 11 (Johnson-Lindenstrauss for (k,z)(k,z)-Line Approximation).

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be any set of points, and let L={1,,k}L=\{\ell_{1},\dots,\ell_{k}\} denote a set of lines in d\mathbbm{R}^{d} for optimally (k,z)(k,z)-line approximation of XX. For any ε(0,1)\varepsilon\in(0,1), suppose we let 𝒥d,t\mathcal{J}_{d,t} be a distribution over Johnson-Lindenstrauss maps where

tkloglogn+z+log(1/ε)ε3.t\gtrsim\dfrac{k\log\log n+z+\log(1/\varepsilon)}{\varepsilon^{3}}.

Then, with probability at least 0.90.9 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t},

11+εcostz(X,L)minL k linesin tcostz(𝚷(X),L)(1+ε)costz(X,L).\dfrac{1}{1+\varepsilon}\cdot\mathrm{cost}_{z}(X,L)\leq\min_{\begin{subarray}{c}L^{\prime}\text{ $k$ lines}\\ \text{in $\mathbbm{R}^{t}$}\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),L^{\prime})\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}(X,L).

7.1 Easy Direction: Optimum Cost Does Not Increase

Lemma 7.1.

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be any set of points and let L={1,,k}L=\{\ell_{1},\dots,\ell_{k}\} be a set of kk lines in d\mathbbm{R}^{d} for optimal (k,z)(k,z)-line approximation of XX, and for each xXx\in X, let (x)L\ell(x)\in L be the line assigned to xx. We let 𝒥d,t\mathcal{J}_{d,t} be the distribution over Johnson-Lindenstrauss maps. If tz/ε2t\gtrsim z/\varepsilon^{2}, then with probability at least 0.990.99 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t},

(xX𝚷(x)𝚷(ρ(x)(x))2z)1/z(1+ε)costz(X,L),\left(\sum_{x\in X}\|\mathbf{\Pi}(x)-\mathbf{\Pi}(\rho_{\ell(x)}(x))\|_{2}^{z}\right)^{1/z}\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}(X,L),

and hence,

minL k linesin tcostz(𝚷(X),L)(1+ε)costz(X,L).\min_{\begin{subarray}{c}L^{\prime}\text{ $k$ lines}\\ \text{in $\mathbbm{R}^{t}$}\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),L^{\prime})\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}(X,L).

By now, there is a straight-forward way to prove the above lemma. For each set of kk lines L={1,,k}L=\{\ell_{1},\dots,\ell_{k}\} in d\mathbbm{R}^{d}, there is an analogous definition of kk lines 𝚷(L)\mathbf{\Pi}(L) in t\mathbbm{R}^{t}. Hence, we use Lemma LABEL:lem:gaussians as in previous sections to upper bound the cost costz(𝚷(X),𝚷(L))\mathrm{cost}_{z}(\mathbf{\Pi}(X),\mathbf{\Pi}(L)).

7.2 Hard Direction: Optimum Cost Does Not Decrease

7.2.1 Preliminaries

At a high level, we proceed with the same argument as in previous sections: we consider a sensitivity function for (k,z)(k,z)-line approximation of XX in d\mathbbm{R}^{d}, and use it to build a weak coreset, as well as argue that sensitivity sampling is a low-variance estimator of the optimal (k,z)(k,z)-line approximation in the projected space. The proof in this section will be significantly more complicated than the previous section. Defining the appropriate sensitivity functions, which will give a low-variance estimator in the projected space, is considerably more difficult than the expressions of Lemmas 4.45.4, and 6.4. For this reason, will be proceed by assuming access to a sensitivity function which we will define lated in the section.

Definition 7.2 (Weak Coresets for (k,z)(k,z)-Line Approximation).

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be a set of points. A weak ε\varepsilon-coreset of XX for (k,z)(k,z)-line approximation is a weighted subset SdS\subset\mathbbm{R}^{d} of points with weights w:S0w\colon S\to\mathbbm{R}_{\geq 0} such that

11+εminL k linesin dcostz(X,L)minL k linesin dcostz((S,w),L)(1+ε)minL k linesin dcostz(X,L)\displaystyle\frac{1}{1+\varepsilon}\cdot\min_{\begin{subarray}{c}L\text{ $k$ lines}\\ \text{in $\mathbbm{R}^{d}$}\end{subarray}}\mathrm{cost}_{z}(X,L)\leq\min_{\begin{subarray}{c}L\text{ $k$ lines}\\ \text{in $\mathbbm{R}^{d}$}\end{subarray}}\mathrm{cost}_{z}((S,w),L)\leq(1+\varepsilon)\cdot\min_{\begin{subarray}{c}L\text{ $k$ lines}\\ \text{in $\mathbbm{R}^{d}$}\end{subarray}}\mathrm{cost}_{z}(X,L)
Definition 7.3 (Sensitivities).

Let n,dn,d\in\mathbbm{N}, and consider any set of points X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d}, as well as kk\in\mathbbm{N} and z1z\geq 1. A sensitivity function σ:X0\sigma\colon X\to\mathbbm{R}_{\geq 0} for (k,z)(k,z)-line approximation in d\mathbbm{R}^{d} is a function which satisfies that, for all xXx\in X,

supLk linesin dxρ(x)(x)2zcostzz(X,L)σ(x).\displaystyle\sup_{\begin{subarray}{c}L\text{: $k$ lines}\\ \text{in $\mathbbm{R}^{d}$}\end{subarray}}\dfrac{\|x-\rho_{\ell(x)}(x)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,L)}\leq\sigma(x).

The total sensitivity of the sensitivity function σ\sigma is given by

𝔖σ=xXσ(x).\mathfrak{S}_{\sigma}=\sum_{x\in X}\sigma(x).

For a sensitivity function, we let σ~\tilde{\sigma} denote the sensitivity sampling distribution, supported on XX, which samples xXx\in X with probability proportional to σ(x)\sigma(x).

Similarly to before, we first give a lemma which narrows down the space of the optimal line approximations for a set of points. the following lemma is a re-formulations of Lemma 5.5 and Lemma 6.5 catered to the case of (k,z)(k,z)-line approximation.

Lemma 7.4 (Theorem 3.1 and Lemma 3.3 of [SV12]).

Let dd\in\mathbbm{N} and SdS\subset\mathbbm{R}^{d} be any set of points with weights w:S0w\colon S\to\mathbbm{R}_{\geq 0}, ε(0,1/2)\varepsilon\in(0,1/2), and z1z\geq 1. There exists a subset QSQ\subset S of size O(log(1/ε)/ε)O(\log(1/\varepsilon)/\varepsilon) and a line \ell in d\mathbbm{R}^{d} within the span of QQ such that

costz((S,w),{})(1+ε)min linein dcostz((S,w),{}).\mathrm{cost}_{z}((S,w),\{\ell\})\leq(1+\varepsilon)\min_{\begin{subarray}{c}\ell^{\prime}\text{ line}\\ \text{in $\mathbbm{R}^{d}$}\end{subarray}}\mathrm{cost}_{z}((S,w),\{\ell^{\prime}\}).
Lemma 7.5 (Weak Coresets for kk-Line Approximation [FL11, VX12b]).

For any subset X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} and ε(0,1/2)\varepsilon\in(0,1/2), let σ\sigma denote a sensitivity function for (k,z)(k,z)-line approximation of XX with total sensitivity 𝔖σ\mathfrak{S}_{\sigma} and let σ~\tilde{\sigma} its sensitivity sampling distribution.

  • Let (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) denote the random (multi-)set 𝐒X\mathbf{S}\subset X and 𝒘:𝐒0\boldsymbol{w}\colon\mathbf{S}\to\mathbbm{R}_{\geq 0} given by, for

    m=poly(𝔖σ,k,1/ε),m=\mathrm{poly}(\mathfrak{S}_{\sigma},k,1/\varepsilon),

    iterations, sampling 𝒙σ~\boldsymbol{x}\sim\tilde{\sigma} i.i.d and letting 𝒘(𝒙)=1/(mσ~(𝒙))\boldsymbol{w}(\boldsymbol{x})=1/(m\tilde{\sigma}(\boldsymbol{x})).

  • Then, with probability 1o(1)1-o(1) over the draw of (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}), it is an ε\varepsilon-weak coreset for (k,z)(k,z)-line approximation of XX.

We note that [FL11] and [VX12b] only give a strong coreset for (k,z)(k,z)-line approximation of poly(𝔖σ,k,d,1/ε)\mathrm{poly}(\mathfrak{S}_{\sigma},k,d,1/\varepsilon). For example, Theorem 13 in [VX12b] giving the above bound follows from the fact that the “function dimension” (see Definition 3 of [VX12b]) for (k,z)(k,z)-line approximation is O(kd)O(kd). However, Lemma 7.4 implies that for any set of points, a line which approximates the points is within a span of O(log(1/ε)/ε)O(\log(1/\varepsilon)/\varepsilon) points. This means that, for ε\varepsilon-weak coresets, it suffices to only consider kk lines spanned by O(klog(1/ε)/ε)O(k\log(1/\varepsilon)/\varepsilon), giving us a “function dimension” of O(klog(1/ε)/ε)O(k\log(1/\varepsilon)/\varepsilon).

7.2.2 The Important Events

Definition 7.6 (The Events).

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d}, and σ\sigma be a sensitivity function for (k,z)(k,z)-line approximation of XX in d\mathbbm{R}^{d}, with total sensitivity 𝔖σ\mathfrak{S}_{\sigma} and sensitivity sampling distribution σ~\tilde{\sigma}. We consider the following experiment,

  1. 1.

    We generate a sample (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) by sampling from σ~\tilde{\sigma} for m=poly(𝔖σ,k,1/ε)m=\mathrm{poly}(\mathfrak{S}_{\sigma},k,1/\varepsilon) i.i.d iterations 𝒙σ~\boldsymbol{x}\sim\tilde{\sigma} and set 𝒘(𝒙)=1/(mσ~(𝒙))\boldsymbol{w}(\boldsymbol{x})=1/(m\tilde{\sigma}(\boldsymbol{x})).

  2. 2.

    Furthermore, we sample 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}, which is a Johnson-Lindenstrauss map dt\mathbbm{R}^{d}\to\mathbbm{R}^{t}.

  3. 3.

    We let 𝐒=𝚷(𝐒)t\mathbf{S}^{\prime}=\mathbf{\Pi}(\mathbf{S})\subset\mathbbm{R}^{t} denote the image of 𝚷\mathbf{\Pi} on 𝐒\mathbf{S}.

The events are the following:

  • 𝐄1\mathbf{E}_{1} : The weighted (multi-)set (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) is a weak ε\varepsilon-coreset for (k,z)(k,z)-line approximation of XX in d\mathbbm{R}^{d}.

  • 𝐄2\mathbf{E}_{2} : For any subset of O(log(1/ε)/ε)O(\log(1/\varepsilon)/\varepsilon) points from 𝐒\mathbf{S}, the map 𝚷:dt\mathbf{\Pi}\colon\mathbbm{R}^{d}\to\mathbbm{R}^{t} is an ε\varepsilon-subspace embedding for the subspace spanned by that subset.

  • 𝐄3(β)\mathbf{E}_{3}(\beta) : Let 𝐋={1,,k}\mathbf{L}^{\prime}=\{\boldsymbol{\ell}_{1}^{\prime},\dots,\boldsymbol{\ell}_{k}^{\prime}\} denote kk lines in t\mathbbm{R}^{t} for optimal (k,z)(k,z)-line approximation of 𝚷(X)\mathbf{\Pi}(X) in t\mathbbm{R}^{t}. Then,

    costzz((𝚷(𝐒),𝒘),𝐋)βcostzz(𝚷(X),𝐋).\displaystyle\mathrm{cost}_{z}^{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),\mathbf{L}^{\prime})\leq\beta\cdot\mathrm{cost}_{z}^{z}(\mathbf{\Pi}(X),\mathbf{L}^{\prime}).
Lemma 7.7.

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d}, and suppose (𝐒,𝐰)(\mathbf{S},\boldsymbol{w}) and 𝚷:dt\mathbf{\Pi}\colon\mathbbm{R}^{d}\to\mathbbm{R}^{t} satisfy events 𝐄1,𝐄2\mathbf{E}_{1},\mathbf{E}_{2} and 𝐄3\mathbf{E}_{3}. Then,

minL k linesin tcostz(𝚷(X),L)1β1/z(1+ε)3minL k linesin dcostz(X,L).\displaystyle\min_{\begin{subarray}{c}L^{\prime}\text{ $k$ lines}\\ \text{in $\mathbbm{R}^{t}$}\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),L^{\prime})\geq\dfrac{1}{\beta^{1/z}(1+\varepsilon)^{3}}\cdot\min_{\begin{subarray}{c}L\text{ $k$ lines}\\ \text{in $\mathbbm{R}^{d}$}\end{subarray}}\mathrm{cost}_{z}(X,L).
Proof.

Let 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t} and (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) be sampled according to Definition 7.6, and suppose events 𝐄1,𝐄2\mathbf{E}_{1},\mathbf{E}_{2} and 𝐄3\mathbf{E}_{3} all hold. Let 𝐋={1,,k}\mathbf{L}^{\prime}=\{\boldsymbol{\ell}_{1}^{\prime},\dots,\boldsymbol{\ell}_{k}^{\prime}\} denote the set of kk lines for optimal (k,z)(k,z)-line approximation of 𝚷(X)\mathbf{\Pi}(X) in t\mathbbm{R}^{t}. Then, by event 𝐄3\mathbf{E}_{3}, we have costzz((𝚷(𝐒),𝒘),𝐋)βcostzz(𝚷(X),𝐋)\mathrm{cost}_{z}^{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),\mathbf{L}^{\prime})\leq\beta\cdot\mathrm{cost}_{z}^{z}(\mathbf{\Pi}(X),\mathbf{L}^{\prime}). Consider the partition of 𝐒\mathbf{S} into 𝐒1,,𝐒k\mathbf{S}_{1},\dots,\mathbf{S}_{k} induced by the lines in 𝐋\mathbf{L}^{\prime} closest to 𝚷(𝐒)\mathbf{\Pi}(\mathbf{S}).

For each i[k]i\in[k], we apply Lemma 7.4 to 𝚷(𝐒i)\mathbf{\Pi}(\mathbf{S}_{i}) with weights 𝒘:𝐒i0\boldsymbol{w}\colon\mathbf{S}_{i}\to\mathbbm{R}_{\geq 0}. In particular, there exists subsets 𝐐1𝐒1,,𝐐k𝐒k\mathbf{Q}_{1}\subset\mathbf{S}_{1},\dots,\mathbf{Q}_{k}\subset\mathbf{S}_{k} and kk lines 𝐋′′={1′′,,k′′}\mathbf{L}^{\prime\prime}=\{\boldsymbol{\ell}_{1}^{\prime\prime},\dots,\boldsymbol{\ell}_{k}^{\prime\prime}\} in t\mathbbm{R}^{t} such that each line i′′\boldsymbol{\ell}_{i}^{\prime\prime} lie in the span of 𝐐i\mathbf{Q}_{i}, and

costz((𝚷(𝐒),𝒘),𝐋′′)(1+ε)costz((𝚷(𝐒),𝒘),𝐋).\mathrm{cost}_{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),\mathbf{L}^{\prime\prime})\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),\mathbf{L}^{\prime}).

Event 𝐄2\mathbf{E}_{2} implies that for each i[k]i\in[k] and each x𝐒x\in\mathbf{S}, 𝚷\mathbf{\Pi} is an ε\varepsilon-subspace embedding for the subspace spanned by 𝐐i{x}\mathbf{Q}_{i}\cup\{x\}. It is not hard to see, that there exists kk lines 𝐇={𝒉1,,𝒉k}\mathbf{H}=\{\boldsymbol{h}_{1},\dots,\boldsymbol{h}_{k}\} in d\mathbbm{R}^{d} such that for all x𝐒x\in\mathbf{S},

xρ𝒉i(x)2(1+ε)𝚷(x)ρi′′(𝚷(x))2,\|x-\rho_{\boldsymbol{h}_{i}}(x)\|_{2}\leq(1+\varepsilon)\cdot\|\mathbf{\Pi}(x)-\rho_{\boldsymbol{\ell}_{i}^{\prime\prime}}(\mathbf{\Pi}(x))\|_{2},

and therefore,

costz((𝐒,𝒘),𝐇)(1+ε)costz((𝚷(𝐒),𝒘),𝐋′′).\displaystyle\mathrm{cost}_{z}((\mathbf{S},\boldsymbol{w}),\mathbf{H})\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}((\mathbf{\Pi}(\mathbf{S}),\boldsymbol{w}),\mathbf{L}^{\prime\prime}).

Lastly, (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) is a ε\varepsilon-weak coreset for XX, which means that

minL k linesin dcostz(X,L)(1+ε)costz((𝐒,𝒘),𝐇).\min_{\begin{subarray}{c}L\text{ $k$ lines}\\ \text{in $\mathbbm{R}^{d}$}\end{subarray}}\mathrm{cost}_{z}(X,L)\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}((\mathbf{S},\boldsymbol{w}),\mathbf{H}).

Combining all inequalities gives the desired lemma. ∎

By now, we note that it is straight-forward to prove the following corollary, which gives a dimension reduction bound which depends on the total sensitivity of a sensitivity function.

Corollary 7.8.

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be any set of points, and for kk\in\mathbbm{N} and z1z\geq 1, let σ:X0\sigma\colon X\to\mathbbm{R}_{\geq 0} be a sensitivity function for (k,z)(k,z)-line approximation of XX in d\mathbbm{R}^{d}. For any ε(0,1/2)\varepsilon\in(0,1/2), let 𝒥d,t\mathcal{J}_{d,t} be the Johnson-Lindenstrauss map with

tlog(𝔖σ,k,1/ε)ε3.t\gtrsim\dfrac{\log(\mathfrak{S}_{\sigma},k,1/\varepsilon)}{\varepsilon^{3}}.

Then, with probability at least 0.970.97 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t},

11001/z(1+ε)3minL k linesin dcostz(X,L)minL k linesin tcostz(𝚷(X),L)\displaystyle\dfrac{1}{100^{1/z}(1+\varepsilon)^{3}}\min_{\begin{subarray}{c}L\text{ $k$ lines}\\ \text{in $\mathbbm{R}^{d}$}\end{subarray}}\mathrm{cost}_{z}(X,L)\leq\min_{\begin{subarray}{c}L^{\prime}\text{ $k$ lines}\\ \text{in $\mathbbm{R}^{t}$}\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),L^{\prime})

7.3 A Sensitivity Function for (k,z)(k,z)-Line Approximation

We now describe a sensitivity function for (k,z)(k,z)-line approximation of points in d\mathbbm{R}^{d}. Similarly to the previous section, we consider a set of points X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d}, and we design a sensitivity function σ:X0\sigma\colon X\to\mathbbm{R}_{\geq 0} for (k,z)(k,z)-line approximation of XX in d\mathbbm{R}^{d}. The sensitivity function should satisfy two requirements. The first is that we have a good bound on the total sensitivity, 𝔖σ\mathfrak{S}_{\sigma}, where the target dimension tt will have logarithmic dependence on 𝔖σ\mathfrak{S}_{\sigma} (for example, like in Corollary 7.8).

The second is that 𝐄3(1+ε)\mathbf{E}_{3}(1+\varepsilon) will hold with sufficiently high probability over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}. In other words, we will proceed similarly to Lemmas 4.115.11, and 6.10 and show that, for the optimal (k,z)(k,z)-line approximation 𝐋\mathbf{L}^{\prime} of 𝚷(X)\mathbf{\Pi}(X) in t\mathbbm{R}^{t}, sampling according to the sensitivity sampling distribution gives a low-variance estimate for the cost of 𝐋\mathbf{L}^{\prime}.

7.3.1 From Coresets for (k,)(k,\infty)-line approximation to Sensitivity Functions

Unfortunately, we do not know of a “clean” description of a sensitivity function for (k,z)(k,z)-line approximation, as was the case in previous definitions. Certainly, one may define a sensitivity function to be σ(x)=supLcostzz(x,L)/costzz(X,L)\sigma(x)=\sup_{L}\mathrm{cost}_{z}^{z}(x,L)/\mathrm{cost}_{z}^{z}(X,L), but then arguing that 𝐄3(1+ε)\mathbf{E}_{3}(1+\varepsilon) holds with high probability becomes more complicated. The sensitivity function which we present follows the connection between sensitivity and \ell_{\infty}-coresets [VX12a].

Definition 7.9 (cc-coresets for (k,)(k,\infty)-line approximation).

Let Y={y1,,yn}dY=\{y_{1},\dots,y_{n}\}\subset\mathbbm{R}^{d} be any subset of points, and c1c\geq 1. A subset AYA\subset Y is a cc-coreset for (k,)(k,\infty)-line approximation if the following holds:

  • Let L={1,,k}L=\{\ell_{1},\dots,\ell_{k}\} be any collection of kk lines in d\mathbbm{R}^{d}, and r0r\in\mathbbm{R}_{\geq 0} such that for all yAy\in A,

    minLyρ(y)2r.\min_{\ell\in L}\|y-\rho_{\ell}(y)\|_{2}\leq r.
  • Then, for all xXx\in X,

    minLyρ(y)2cr.\min_{\ell\in L}\|y-\rho_{\ell}(y)\|_{2}\leq cr.

Note that the (k,)(k,\infty)-line approximation is the problem of minimum enclosing cylinder: we are given a set of points YY, and want to find a set of kk cylinders C1,,CkdC_{1},\dots,C_{k}\subset\mathbbm{R}^{d} of smallest radius such that Yi=1kCiY\subset\bigcup_{i=1}^{k}C_{i}. Thus, Definition 7.9, a set AYA\subset Y is a cc-coreset for (k,)(k,\infty)-line approximation if, given any kk cylinders which contain AA, increasing the radii by a factor of cc contains YY. The reason they will be relevant for defining a sensitivity function is the following simple lemma, whose main idea is from [VX12a].

Lemma 7.10 (Sensitivities from cc-coresets for (k,)(k,\infty)-line approximation (see Lemma 3.1 in [VX12a])).

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be any set of points and kk\in\mathbbm{N}, z1z\geq 1. Let L={1,,k}L=\{\ell_{1},\dots,\ell_{k}\} be the kk lines in d\mathbbm{R}^{d} for optimal (k,z)(k,z)-line approximation of XX, and let Y={yxd:xX}Y=\{y_{x}\in\mathbbm{R}^{d}:x\in X\} where yx=ρ(x)(x)y_{x}=\rho_{\ell(x)}(x). For c1c\geq 1, let the function σ:X0\sigma\colon X\to\mathbbm{R}_{\geq 0} be defined as follows:

  • Let A1,A2,,AsA_{1},A_{2},\dots,A_{s} denote a partition of YY where each AiA_{i} is a cc-coreset for (k,)(k,\infty)-line approximation of Y(i=1i1Ai)Y\setminus\left(\bigcup_{i^{\prime}=1}^{i-1}A_{i^{\prime}}\right).

  • For each xXx\in X, where yxAiy_{x}\in A_{i} we let

    σ(x)=def2z1xyx2zcostzz(X,L)+22z1ci.\sigma(x)\stackrel{{\scriptstyle\rm def}}{{=}}2^{z-1}\cdot\dfrac{\|x-y_{x}\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,L)}+2^{2z-1}\cdot\dfrac{c}{i}.

Then, σ\sigma is a sensitivity function for (k,z)(k,z)-line approximation, and the total sensitivity

𝔖σ=O(22zclognmaxi[s]|Ai|)\mathfrak{S}_{\sigma}=O\left(2^{2z}\cdot c\cdot\log n\cdot\max_{i\in[s]}|A_{i}|\right)
Proof.

Suppose xXx\in X and yxAiy_{x}\in A_{i}. Consider any set L={1,,k}L^{\prime}=\{\ell_{1}^{\prime},\dots,\ell_{k}^{\prime}\} of kk lines in d\mathbbm{R}^{d}. The goal is to show that

minLxρ(x)2zcostzz(X,L)σ(x).\displaystyle\dfrac{\min_{\ell^{\prime}\in L^{\prime}}\|x-\rho_{\ell^{\prime}}(x)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,L^{\prime})}\leq\sigma(x). (30)

We will first use Hölder inequality and the triangle inequality, as well as the fact that yxAiy_{x}\in A_{i} in order to write the following:

minLxρ(x)2z\displaystyle\min_{\ell^{\prime}\in L^{\prime}}\|x-\rho_{\ell^{\prime}}(x)\|_{2}^{z} 2z1xyx2z+2z1minLyxρ(yx)2z\displaystyle\leq 2^{z-1}\|x-y_{x}\|_{2}^{z}+2^{z-1}\min_{\ell^{\prime}\in L^{\prime}}\|y_{x}-\rho_{\ell^{\prime}}(y_{x})\|_{2}^{z}
2z1xyx2z+2z1(costzz(Y,L)ci).\displaystyle\leq 2^{z-1}\|x-y_{x}\|_{2}^{z}+2^{z-1}\left(\mathrm{cost}_{z}^{z}(Y,L^{\prime})\cdot\dfrac{c}{i}\right). (31)

The justification for (31) is the following: for every jij\leq i, AjA_{j} is a cc-coreset for (k,)(k,\infty)-line approximation of a set of points which contains yxy_{x}. Therefore, if minLyxρ(yx)2=r\min_{\ell^{\prime}\in L^{\prime}}\|y_{x}-\rho_{\ell^{\prime}}(y_{x})\|_{2}=r, there must exists a point uAju\in A_{j} with minLuρ(u)2r/c\min_{\ell^{\prime}\in L^{\prime}}\|u-\rho_{\ell^{\prime}}(u)\|_{2}\geq r/c. Suppose otherwise: every uAju\in A_{j} satisfies minLuρ(u)2<r/c\min_{\ell^{\prime}\in L^{\prime}}\|u-\rho_{\ell^{\prime}}(u)\|_{2}<r/c. Then, the kk cylinders of radius r/cr/c contain AjA_{j}, so increasing the radius by a factor of cc contains yxy_{x}. However, this means minLyxρ(yx)2<cr/c<r\min_{\ell^{\prime}\in L^{\prime}}\|y_{x}-\rho_{\ell^{\prime}}(y_{x})\|_{2}<c\cdot r/c<r, which is a contradiction.

Hence, we always have that yxAiy_{x}\in A_{i} satisfies

minLyxρ(yx)2zcostzz(Y,L)ci.\min_{\ell^{\prime}\in L^{\prime}}\|y_{x}-\rho_{\ell^{\prime}}(y_{x})\|_{2}^{z}\leq\mathrm{cost}_{z}^{z}(Y,L^{\prime})\cdot\dfrac{c}{i}.

Continuing on upper-bounding (31), we now use the fact costzz(Y,L)2z1costzz(X,L)+2z1costzz(X,L)2zcostzz(X,L)\mathrm{cost}_{z}^{z}(Y,L^{\prime})\leq 2^{z-1}\cdot\mathrm{cost}_{z}^{z}(X,L)+2^{z-1}\cdot\mathrm{cost}_{z}^{z}(X,L^{\prime})\leq 2^{z}\mathrm{cost}_{z}^{z}(X,L^{\prime}) because costzz(X,L)\mathrm{cost}_{z}^{z}(X,L) is the optimal (k,z)(k,z)-line approximation. Therefore,

minLxρ(x)2z2z1xyx2z+22z1cicostzz(X,L),\displaystyle\min_{\ell^{\prime}\in L^{\prime}}\|x-\rho_{\ell^{\prime}}(x)\|_{2}^{z}\leq 2^{z-1}\cdot\|x-y_{x}\|_{2}^{z}+2^{2z-1}\cdot\dfrac{c}{i}\cdot\mathrm{cost}_{z}^{z}(X,L^{\prime}),

so dividing by costzz(X,L)\mathrm{cost}_{z}^{z}(X,L^{\prime}) and noticing that costzz(X,L)costzz(X,L)\mathrm{cost}_{z}^{z}(X,L)\leq\mathrm{cost}_{z}^{z}(X,L^{\prime}) implies σ\sigma is a sensitivity function.

The bound on total sensitivity then follows from

𝔖σ=xXσ(x)=i=1sxAsσ(x)=2z1+22z1i=1s|Ai|cs=O(22zcmaxi[s]|Ai|logn),\displaystyle\mathfrak{S}_{\sigma}=\sum_{x\in X}\sigma(x)=\sum_{i=1}^{s}\sum_{x\in A_{s^{\prime}}}\sigma(x)=2^{z-1}+2^{2z-1}\sum_{i=1}^{s}|A_{i}|\cdot\dfrac{c}{s^{\prime}}=O\left(2^{2z}\cdot c\cdot\max_{i\in[s]}|A_{i}|\cdot\log n\right),

since sns\leq n. ∎

7.3.2 A simple coreset for (k,)(k,\infty)-line approximation of one-dimensional instances

Suppose first, that a dataset Y={y1,,yn}Y=\{y_{1},\dots,y_{n}\} lies on a line in d\mathbbm{R}^{d}, and let C1,,CkC_{1},\dots,C_{k} be a collection of kk cylinders. Then, the intersection of the cylinders with the line results in a union of kk intervals on the line. If we increase the radius of each cylinder C1,,CkC_{1},\dots,C_{k} by a factor of cc, the lengths of the intervals are scaled by factor of cc (while keeping center of interval fixed). We first show that, for any Y={y1,,yn}Y=\{y_{1},\dots,y_{n}\} which lie on a line, there exists a small subset QYQ\subset Y such that: if I1,,IkI_{1},\dots,I_{k} is any collection of kk intervals which covers QQ, then increasing the length of each interval by a factor of 33 (while keeping the center of the interval fixed) covers YY.

Lemma 7.11.

There exists a large enough constant c10c_{1}\in\mathbbm{R}_{\geq 0} such that the following is true. Let Y={y1,,yn}Y=\{y_{1},\dots,y_{n}\} be a set of points lying on a line in d\mathbbm{R}^{d}, and kk\in\mathbbm{N}. There exists a subset QYQ\subset Y which is a 33-coreset for (k,)(k,\infty)-line approximation of size at most (c1logn)k(c_{1}\log n)^{k}.

Proof.

The construction is recursive. Let \ell be the line containing YY, and after choosing an arbitrary direction on \ell, let y1,,yny_{1},\dots,y_{n} be the points in sorted order according to the chosen direction.

The set QQ is initially empty, and we include Q{y1,yn/2,yn}Q\leftarrow\{y_{1},y_{\lceil n/2\rceil},y_{n}\}. Suppose that y1yn/22yn/2yn2\|y_{1}-y_{\lceil n/2\rceil}\|_{2}\geq\|y_{\lceil n/2\rceil}-y_{n}\|_{2} (the construction is symmetric, with y1y_{1} and yny_{n} switched otherwise). We divide YY into two sets, the subsets YL={y1,,yn/2}Y_{L}=\{y_{1},\dots,y_{\lfloor n/2\rfloor}\} and YR={yn/2,,yn}Y_{R}=\{y_{\lceil n/2\rceil},\dots,y_{n}\}. Then, we perform three recursive calls: (i) we let Q1Q_{1} be a 33-coreset for (k,)(k,\infty)-line approximation of YLY_{L}, (ii) we let Q2Q_{2} be a 33-coreset for (k1,)(k-1,\infty)-line approximation of YLY_{L}, and (iii) we let Q3Q_{3} be a 33-coreset for (k1,)(k-1,\infty)-line approximation of YRY_{R}. We add Q1,Q2Q_{1},Q_{2}, and Q3Q_{3} to QQ.

The proof of correctness argues as follows. Let C1,,CkC_{1},\dots,C_{k} be an arbitrary collection of kk cylinders which covers QQ. The goal is to show that increasing the radius of C1,,CkC_{1},\dots,C_{k} by a factor of 3 covers YY. Let I1,,IkI_{1},\dots,I_{k} be the intervals given by Ii=CiI_{i}=\ell\cap C_{i}. We let the indices u,v[k]u,v\in[k] be such that IuI_{u} is the first interval which contains y1y_{1}, and IvI_{v} the last interval which contains yny_{n}. We note that I1IkI_{1}\cup\dots\cup I_{k} covers QQ. We must show that if we increase the length of each interval by a factor of 33, we cover YY. We consider three cases:

  • Suppose there exists an index i[k]i^{*}\in[k] such that y1y_{1} and yn/2y_{\lceil n/2\rceil} both lie in the interval IiI_{i^{*}}. Recall y1yn/22yn/2yn2\|y_{1}-y_{\lceil n/2\rceil}\|_{2}\geq\|y_{\lceil n/2\rceil}-y_{n}\|_{2}, and all points are contained within y1y_{1} and yny_{n}. Hence, when we increase the length of IiI_{i^{*}} by a factor of 33 while keeping center fixed, y1y_{1} and yny_{n} lie in the same interval, and thus cover YY.

  • Suppose y1y_{1} and yn/2y_{\lceil n/2\rceil} lie in different intervals, but there exists ii^{*} such that yn/2y_{\lceil n/2\rceil} and yny_{n} lie in the interval IiI_{i^{*}}. Then, since all points of YRY_{R} between yn/2y_{\lceil n/2\rceil} and yny_{n}, IiI_{i^{*}} covers YRY_{R}. Since I1,,IkI_{1},\dots,I_{k} covers Q1Q_{1} and Q1Q_{1} is a 33-coreset for (k,)(k,\infty)-line approximation of YLY_{L}, increasing the length of each interval by a factor of 33 covers YLY_{L}, and therefore all of YY.

  • Suppose y1y_{1}, yn/2y_{\lceil n/2\rceil} and yny_{n} all lie in different intervals. Then, since y1y_{1} and yn/2y_{\lceil n/2\rceil} are not on the same interval, the k1k-1 intervals i[k]{u}Ii\bigcup_{i\in[k]\setminus\{u\}}I_{i} covers Q2Q_{2}. Similarly, yn/2y_{\lceil n/2\rceil} and yny_{n} are not on the same interval, so the k1k-1 intervals i[k]{v}Ii\bigcup_{i\in[k]\setminus\{v\}}I_{i} covers Q3Q_{3}. Since Q2Q_{2} is a 33-coreset for (k1,)(k-1,\infty)-line approximation of YLY_{L}, increasing the radius of each interval by a factor of 33 covers all of YLY_{L}. In addition, Q3Q_{3} is a 33-coreset for (k1,)(k-1,\infty)-line approximation of YRY_{R}, so increasing length of intervals by a factor of 33 covers YRY_{R}.

This concludes the correctness of the coreset, and it remains to upper bound the size. Let f(k,n)f(k,n)\in\mathbbm{N} be an upper bound on the coreset size of (k,)(k,\infty)-line approximation of a subset of size nn. We have f(1,n)=2f(1,n)=2, since any single interval which covers y1y_{1} and yny_{n} covers everything in between them. By our recursive construction, we have

f(k,n)3+f(k,n/2)+2f(k1,n/2).f(k,n)\leq 3+f(k,n/2)+2\cdot f(k-1,n/2).

By a simple induction, one can show f(k,n)f(k,n) is at most (c1logn)k(c_{1}\log n)^{k} when k2k\geq 2, for large enough constant c1c_{1} and large enough nn. ∎

7.3.3 The coreset for points on kk lines and the effect of dimension reduction

Lemma 7.12.

There exists a large enough constant c10c_{1}\in\mathbbm{R}_{\geq 0} such that the following is true. Let Y={y1,,yn}Y=\{y_{1},\dots,y_{n}\} be a set of points lying on kk lines in d\mathbbm{R}^{d}. There exists a subset QYQ\subset Y which satisfies the following two requirements:

  1. 1.

    QQ is a 33-coreset for (k,)(k,\infty)-line approximation of YY size at most k(c1logn)kk(c_{1}\log n)^{k}.

  2. 2.

    If Π:dt\Pi\colon\mathbbm{R}^{d}\to\mathbbm{R}^{t} is a linear map, then Π(Q)\Pi(Q) is a 33-coreset for (k,)(k,\infty)-line approximation of Π(Y)\Pi(Y).

Proof.

Let Y1,,YkY_{1},\dots,Y_{k} be the partition of YY into points lying on the lines 1,,k\ell_{1},\dots,\ell_{k} of d\mathbbm{R}^{d}, respectively. We may write each line i\ell_{i} by two vectors ui,vidu_{i},v_{i}\in\mathbbm{R}^{d}, and have

i={ui+tvi:t}.\ell_{i}=\left\{u_{i}+t\cdot v_{i}:t\in\mathbbm{R}\right\}.

Let QiQ_{i} be the 33-coreset for (k,)(k,\infty)-line approximation of YiY_{i} specified by Lemma 7.11. We let QQ be the union of all QiQ_{i}. Item 1 follows from Lemma 7.11, since we are taking the union of kk coresets.

We now argue Item 2. Since Π\Pi is a linear map, and every point in YiY_{i} lies on the line i\ell_{i}, there exists a map t:Yit\colon Y_{i}\to\mathbbm{R} where each yYiy\in Y_{i} satisfies

y=ui+t(y)vidand thus,Π(y)=Π(ui)+t(y)Π(vi)t.y=u_{i}+t(y)\cdot v_{i}\in\mathbbm{R}^{d}\qquad\text{and thus,}\qquad\Pi(y)=\Pi(u_{i})+t(y)\cdot\Pi(v_{i})\in\mathbbm{R}^{t}.

In other words, Π(Yi)\Pi(Y_{i}) lies within a line in t\mathbbm{R}^{t}. We note that the relative order of points in Π(Yi)\Pi(Y_{i}) remains the same, since for any two points y,yYiy,y^{\prime}\in Y_{i},

Π(y)Π(y)2=|t(y)t(y)|Π(vi)2,yy2=|t(y)t(y)|vi2.\|\Pi(y)-\Pi(y^{\prime})\|_{2}=|t(y)-t(y^{\prime})|\cdot\|\Pi(v_{i})\|_{2},\qquad\|y-y^{\prime}\|_{2}=|t(y)-t(y^{\prime})|\cdot\|v_{i}\|_{2}.

We note that the construction of Lemma 7.11 only considers the order of points in YiY_{i}, as well as the ratio of distances. Therefore, executing the construction of Lemma 7.11 on the points Π(Yi)\Pi(Y_{i}) returns the set Π(Qi)\Pi(Q_{i}). ∎

Corollary 7.13.

Let Y={y1,,yn}dY=\{y_{1},\dots,y_{n}\}\subset\mathbbm{R}^{d} be a set of points lying on kk lines in d\mathbbm{R}^{d}.

  • Let A1,,AsA_{1},\dots,A_{s} denote a partition of YY where each AiA_{i} is a 33-coreset for (k,)(k,\infty)-line approximation of YY from Lemma 7.12 on the set Yi=1i1AiY\setminus\bigcup_{i^{\prime}=1}^{i-1}A_{i^{\prime}}.

  • Let Π:dt\Pi\colon\mathbbm{R}^{d}\to\mathbbm{R}^{t} be any linear map.

For any set of kk lines L={1,,k}L^{\prime}=\{\ell_{1}^{\prime},\dots,\ell_{k}^{\prime}\} in t\mathbbm{R}^{t}, if yAiy\in A_{i}, we have

Π(y)ρL(Π(y))2costzz(Π(Y),L)3i.\|\Pi(y)-\rho_{L^{\prime}}(\Pi(y))\|_{2}\leq\mathrm{cost}_{z}^{z}(\Pi(Y),L^{\prime})\cdot\dfrac{3}{i}.
Proof.

The proof follows from applying the same observation of Lemma 7.10 to Π(Aj)\Pi(A_{j}), which is a 33-coreset for (k,)(k,\infty)-line approximation by Lemma 7.12. Namely, for every jij\leq i, the set Π(Aj)\Pi(A_{j}) is a 33-coreset for (k,)(k,\infty)-line approximation of a set containing yy. Thus, if yρL(y)2=r\|y-\rho_{L^{\prime}}(y)\|_{2}=r, there must be a set of at least ii points yYy^{\prime}\in Y where yρL(y)2r/3\|y^{\prime}-\rho_{L^{\prime}}(y^{\prime})\|_{2}\geq r/3. ∎

7.4 Improving the approximation

We now instantiate the sensitivity function of Lemma 7.12, and use Corollary 7.8 and Lemma 7.7 in order to improve on the approximation. Similarly to before, we show that event 𝐄3(1+ε)\mathbf{E}_{3}(1+\varepsilon) occurs with sufficiently high probability over the draw of 𝚷\mathbf{\Pi} and (𝐒,𝒘)(\mathbf{S},\boldsymbol{w}) by giving an upper bound on the variance as in Lemma 4.11, Lemma 5.11, and Lemma 6.10.

Fix X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d}, and let L={1,,k}L=\{\ell_{1},\dots,\ell_{k}\} be the optimal (k,z)(k,z)-line approximation of XX in d\mathbbm{R}^{d}. For xXx\in X, we let yxdy_{x}\in\mathbbm{R}^{d} be given by yx=ρ(x)(x)y_{x}=\rho_{\ell(x)}(x), and we denote the set Y={yx:xX}Y=\{y_{x}:x\in X\}. The sensitivity function σ:X0\sigma\colon X\to\mathbbm{R}_{\geq 0} is specified by Lemma 7.10. Recall that we first let A1,,AsA_{1},\dots,A_{s} denote a partition YY, where AiA_{i} is the 33-coreset for (k,)(k,\infty)-line approximation of YY from Lemma 7.12. For xXx\in X with yxAiy_{x}\in A_{i}, we have

σ(x)=2z1xyx2zcostzz(X,L)+22z13i.\sigma(x)=2^{z-1}\cdot\dfrac{\|x-y_{x}\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,L)}+2^{2z-1}\cdot\frac{3}{i}.

We let 𝐄4\mathbf{E}_{4} denote the following event with respect to the randomness in 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t}. For each xXx\in X, we let 𝐃x0\mathbf{D}_{x}\in\mathbbm{R}_{\geq 0} denote the random variable

𝐃x=𝚷(x)𝚷(ρ(x)(x))2xρ(x)(x)2,\mathbf{D}_{x}=\dfrac{\|\mathbf{\Pi}(x)-\mathbf{\Pi}\left(\rho_{\ell(x)}(x)\right)\|_{2}}{\|x-\rho_{\ell(x)}(x)\|_{2}},

and as in previous sections, event 𝐄4\mathbf{E}_{4}, which occurs with probability at least 0.990.99, whenever

xx𝐃x2zσ(x)1002z𝔖σ.\sum_{x\in x}\mathbf{D}_{x}^{2z}\cdot\sigma(x)\leq 100\cdot 2^{z}\cdot\mathfrak{S}_{\sigma}.
Lemma 7.14.

Let Π𝒥d,t\Pi\in\mathcal{J}_{d,t} be a Johnson-Lindenstrauss map where, for α>1\alpha>1, the following events hold:

  • Guarantee from Lemma 7.1: xXΠ(x)Π(ρ(x)(x))2zαcostzz(X,L)\sum_{x\in X}\|\Pi(x)-\Pi(\rho_{\ell(x)}(x))\|_{2}^{z}\leq\alpha\cdot\mathrm{cost}_{z}^{z}(X,L).

  • Guarantee from Corollary 7.8: letting L={1,,k}L^{\prime}=\{\ell_{1}^{\prime},\dots,\ell_{k}^{\prime}\} be the optimal (k,z)(k,z)-line approximation fo Π(X)\Pi(X), then costzz(X,L)αcostzz(Π(X),L)\mathrm{cost}_{z}^{z}(X,L)\leq\alpha\cdot\mathrm{cost}_{z}^{z}(\Pi(X),L^{\prime}).

  • Event 𝐄4\mathbf{E}_{4} holds.

Then, if we let (𝐒,𝐰)(\mathbf{S},\boldsymbol{w}) denote m=poly((logn)k,1/ε,α)m=\mathrm{poly}((\log n)^{k},1/\varepsilon,\alpha), i.i.d draws from σ~\tilde{\sigma} and 𝐰(x)=1/(mσ~(x))\boldsymbol{w}(x)=1/(m\tilde{\sigma}(x)), with probability at least 0.990.99,

costzz((Π(𝐒),𝒘),L)(1+ε)costzz(Π(X),L).\mathrm{cost}_{z}^{z}((\Pi(\mathbf{S}),\boldsymbol{w}),L^{\prime})\leq(1+\varepsilon)\cdot\mathrm{cost}_{z}^{z}(\Pi(X),L^{\prime}).
Proof.

We bound the variance,

𝐕𝐚𝐫𝐒[𝐄𝒙𝐒[1σ~(𝒙)Π(𝒙)ρL(Π(𝒙))2zcostzz(Π(X),L)]]𝔖σmxX(1σ(x)Π(x)ρL(Π(x))22zcostz2z(Π(X),L)).\displaystyle\operatorname{{\bf Var}}_{\mathbf{S}}\left[\mathop{{\bf E}\/}_{\boldsymbol{x}\sim\mathbf{S}}\left[\dfrac{1}{\tilde{\sigma}(\boldsymbol{x})}\cdot\dfrac{\|\Pi(\boldsymbol{x})-\rho_{L^{\prime}}(\Pi(\boldsymbol{x}))\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(\Pi(X),L^{\prime})}\right]\right]\leq\dfrac{\mathfrak{S}_{\sigma}}{m}\sum_{x\in X}\left(\dfrac{1}{\sigma(x)}\cdot\dfrac{\|\Pi(x)-\rho_{L^{\prime}}(\Pi(x))\|_{2}^{2z}}{\mathrm{cost}_{z}^{2z}(\Pi(X),L^{\prime})}\right). (32)

We note that, as before, we will apply Hölder’s inequality and the triangle inequality, followed by Corollary 7.13. Specifically, suppose xXx\in X with yxAiy_{x}\in A_{i}, then,

Π(x)ρL(Π(x))2z\displaystyle\|\Pi(x)-\rho_{L^{\prime}}(\Pi(x))\|_{2}^{z} 2z1𝐃xzxρL(x)2z+2z1Π(yx)ρL(Π(yx))2z\displaystyle\leq 2^{z-1}\cdot\mathbf{D}_{x}^{z}\cdot\|x-\rho_{L}(x)\|_{2}^{z}+2^{z-1}\cdot\|\Pi(y_{x})-\rho_{L^{\prime}}(\Pi(y_{x}))\|_{2}^{z}
2z1𝐃xzxρL(x)2z+2z1costzz(Π(Y),L)3i\displaystyle\leq 2^{z-1}\cdot\mathbf{D}_{x}^{z}\cdot\|x-\rho_{L}(x)\|_{2}^{z}+2^{z-1}\cdot\mathrm{cost}_{z}^{z}(\Pi(Y),L^{\prime})\cdot\frac{3}{i}
2z1𝐃xzxρL(x)2z+22z23i(costzz(Π(X),L)+costzz(Π(X),Π(Y))).\displaystyle\leq 2^{z-1}\cdot\mathbf{D}_{x}^{z}\cdot\|x-\rho_{L}(x)\|_{2}^{z}+\dfrac{2^{2z-2}\cdot 3}{i}\left(\mathrm{cost}_{z}^{z}(\Pi(X),L^{\prime})+\mathrm{cost}_{z}^{z}(\Pi(X),\Pi(Y))\right).

We note that from (7.14) and (7.14), we have costzz(Π(X),Π(Y))αcostzz(X,L)α2costzz(Π(X),L)\mathrm{cost}_{z}^{z}(\Pi(X),\Pi(Y))\leq\alpha\mathrm{cost}_{z}^{z}(X,L)\leq\alpha^{2}\mathrm{cost}_{z}^{z}(\Pi(X),L^{\prime}). So the above simplifies to

Π(x)ρL(Π(x))2zcostzz(Π(X),L)\displaystyle\dfrac{\|\Pi(x)-\rho_{L^{\prime}}(\Pi(x))\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(\Pi(X),L^{\prime})} 2z1𝐃xzxρL(x)2zcostzz(X,L)+22z23(1+α2)i\displaystyle\leq 2^{z-1}\cdot\mathbf{D}_{x}^{z}\cdot\dfrac{\|x-\rho_{L}(x)\|_{2}^{z}}{\mathrm{cost}_{z}^{z}(X,L)}+2^{2z-2}\cdot\dfrac{3(1+\alpha^{2})}{i}
(𝐃xz+1+α2)σ(x).\displaystyle\leq(\mathbf{D}_{x}^{z}+1+\alpha^{2})\cdot\sigma(x).

We now continue upper bounding (32), where the variance becomes less than

𝔖σmxX(𝐃xz+1+α2)2σ(x)𝔖σ2α4m,\displaystyle\dfrac{\mathfrak{S}_{\sigma}}{m}\sum_{x\in X}\left(\mathbf{D}_{x}^{z}+1+\alpha^{2}\right)^{2}\cdot\sigma(x)\lesssim\dfrac{\mathfrak{S}_{\sigma}^{2}\cdot\alpha^{4}}{m},

since event 𝐄4\mathbf{E}_{4} holds. Since 𝔖σpoly(22z,(logn)k)\mathfrak{S}_{\sigma}\leq\mathrm{poly}(2^{2z},(\log n)^{k}), we obtain our desired bound on the variance by letting mm be a large enough polynomial of (logn)k(\log n)^{k}, α\alpha, and 1/ε1/\varepsilon. ∎

Corollary 7.15.

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be any set of points, and let L={1,,k}L=\{\ell_{1},\dots,\ell_{k}\} be the optimal set of kk lines for (k,z)(k,z)-line approximation of XX. For any ε(0,1/2)\varepsilon\in(0,1/2), let 𝒥d,t\mathcal{J}_{d,t} be the Johnson-Lindenstrauss map with

tkloglogn+z+log(1/ε)ε3.t\gtrsim\dfrac{k\log\log n+z+\log(1/\varepsilon)}{\varepsilon^{3}}.

Then, with probability at least 0.920.92 over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t},

minL k linesin tcostz(𝚷(X),L)1(1+ε)3+1/zcostz(X,L).\min_{\begin{subarray}{c}L^{\prime}\text{ $k$ lines}\\ \text{in $\mathbbm{R}^{t}$}\end{subarray}}\mathrm{cost}_{z}(\mathbf{\Pi}(X),L^{\prime})\geq\dfrac{1}{(1+\varepsilon)^{3+1/z}}\cdot\mathrm{cost}_{z}(X,L).

Appendix A On preserving “all solutions” and comparisons to prior work

This section is meant for two things:

  1. 1.

    To help compare the guarantees of this work to that of prior works on (k,z)(k,z)-clustering of [MMR19] and (k,2)(k,2)-subspace approximation [CEM+15], expanding on the discussion in the introduction. In short, for (k,z)(k,z)-clustering, the results of [MMR19] are qualitatively stronger than the results obtained here. In (k,2)(k,2)-subspace approximation, the “for all” guarantees of [CEM+15] are for the qualitatively different problem of low-rank approximation. While the costs of low-rank approximation and (k,2)(k,2)-subspace approximation happen to agree at the optimum, the notion of a candidate solution is different.

  2. 2.

    To show that, for two related problems of “medoid” and “column subset selection,” one cannot apply the Johnson-Lindenstrauss transform to dimension o(logn)o(\log n) while preserving the cost. The medoid problem is a center-based clustering problem, and column subset selection problem is a subspace approximation problem. The instances we will construct for these problems are very symmetric, such that uniform sampling will give small coresets. These give concrete examples ruling out a theorem which directly relates the size of coresets to the effect of the Johnson-Lindenstrauss transform.

Center-Based Clustering

Consider the following (slight) modification to the center-based clustering problems known as the “medoid” problem.

Definition A.1 (11-medoid problem).

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be any set of points. The 11-medoid problem asks to optimize

mincXxXxc22.\min_{\begin{subarray}{c}c\in X\end{subarray}}\sum_{x\in X}\|x-c\|_{2}^{2}.

Notice the difference between 11-medoid and 11-mean: in 11-medoid the center is restricted to be from within the set of points XX, whereas in 11-mean the center is arbitrary. Perhaps surprisingly, this modification has a dramatic effect on dimension reduction.

Theorem 12.

For large enough n,dn,d\in\mathbbm{N}, there exists a set of points XdX\subset\mathbbm{R}^{d} (in particular, given by the nn-basis vector {e1,,en}n\{e_{1},\dots,e_{n}\}\subset\mathbbm{R}^{n}) such that, with high probability over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t} where t=o(logn)t=o(\log n),

mincXxXxc22minc𝚷(X)xX𝚷(x)c222o(1).\displaystyle\frac{\displaystyle\mathop{\min}_{c\in X}\sum_{x\in X}\|x-c\|_{2}^{2}}{\displaystyle\mathop{\min}_{c^{\prime}\in\mathbf{\Pi}(X)}\sum_{x\in X}\|\mathbf{\Pi}(x)-c^{\prime}\|_{2}^{2}}\geq 2-o(1).

Theorem 12 gives very strong lower bound for dimension reduction for kk-medoid, showing that decreasing the dimension to any o(logn)o(\log n) does not preserve (even the optimal) solutions within better-than factor 22. This is in stark contrast to the results on center-based clustering, where the 11-mean problem can preserve the solutions up to (1±ε)(1\pm\varepsilon)-approximation without any dependence on nn or dd. The proof itself is also very straight-forward: each 𝚷(ei)\mathbf{\Pi}(e_{i}) is an independent Gaussian vector in t\mathbbm{R}^{t}, and if t=o(logn)t=o(\log n), with high probability, there exists an index i[n]i\in[n] where 𝚷(ei)22=o(1)\|\mathbf{\Pi}(e_{i})\|_{2}^{2}=o(1). In a similar vein, with high probability i=1n𝚷(ei)22(1+o(1))n\sum_{i=1}^{n}\|\mathbf{\Pi}(e_{i})\|_{2}^{2}\leq(1+o(1))n. We take a union bound and set the center c=𝚷(ei)c^{\prime}=\mathbf{\Pi}(e_{i}) for the index ii where 𝚷(ei)22=o(1)\|\mathbf{\Pi}(e_{i})\|_{2}^{2}=o(1). By the pythagorean theorem, the cost of this 11-medoid solution is at most (1+o(1))n(1+o(1))n. On the other hand, every 11-medoid solution in XX has cost 2(n1)2(n-1).

We emphasize that Theorem 12 does not contradict [MMR19, BBCA+19], even though it rules out that “all candidate centers” are preserved. The reason is that the notion of “candidate solution” is different. Informally, [MMR19] shows that for any dataset XdX\subset\mathbbm{R}^{d} of nn vectors and any kk\in\mathbbm{N}, ε>0\varepsilon>0, applying the Johnson-Lindenstrauss map 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t} with t=O(log(k/ε)/ε2)t=O(\log(k/\varepsilon)/\varepsilon^{2}) satisfies the following guarantee: for all partitions of XX into kk sets, (P1,P2,,Pk)(P_{1},P_{2},\dots,P_{k}), the following is true:

=1kminctxP𝚷(x)c221±ε=1kmincdxPxc22.\displaystyle\sum_{\ell=1}^{k}\min_{c_{\ell}^{\prime}\in\mathbbm{R}^{t}}\sum_{x\in P_{\ell}}\|\mathbf{\Pi}(x)-c_{\ell}^{\prime}\|_{2}^{2}\approx_{1\pm\varepsilon}\sum_{\ell=1}^{k}\min_{c_{\ell}\in\mathbbm{R}^{d}}\sum_{x\in P_{\ell}}\|x-c_{\ell}\|_{2}^{2}.

The “for all” quantifies over clusterings (P1,,Pk)(P_{1},\dots,P_{k}) is different (as seen from the 11-medoid example) from the “for all” over centers c1,,ckc_{1},\dots,c_{k}.

Subspace Approximation

The same subtlety appears in subspace approximation. Here, we can consider the 11-column subset selection problem, which at a high level, is the medoid version of subspace approximation. We want to approximate a set of points by their projections onto the subspace spanned by one of those points.

Definition A.2 (11-column subset selection).

Let X={x1,,xn}dX=\{x_{1},\dots,x_{n}\}\subset\mathbbm{R}^{d} be any set of points. The 11-column subset selection problem asks to optimize

minS=span({xi})xiXxXxρS(x)22\displaystyle\min_{\begin{subarray}{c}S=\mathrm{span}(\{x_{i}\})\\ x_{i}\in X\end{subarray}}\sum_{x\in X}\|x-\rho_{S}(x)\|_{2}^{2}

Again, notice the difference between 11-column subset selection and (k,1)(k,1)-subspace approximation: the subspace SS is restricted to be in the span of a point from XX. Given Theorem 12, it is not surprising that Johnson-Lindenstrauss cannot reduce the dimension of 11-column subset selection to o(logn)o(\log n) without incurring high distortions.

Theorem 13.

For large enough n,dn,d\in\mathbbm{N}, there exists a set of points XdX\subset\mathbbm{R}^{d} such that, with high probability over the draw of 𝚷𝒥d,t\mathbf{\Pi}\sim\mathcal{J}_{d,t} where t=o(logn)t=o(\log n),

minS=span(x)xXxXxρS(x)22minS=span(𝚷(x))xXxX𝚷(x)ρS(𝚷(x))223/2o(1).\displaystyle\frac{\displaystyle\mathop{\min}_{\begin{subarray}{c}S=\mathrm{span}(x)\\ x\in X\end{subarray}}\sum_{x\in X}\|x-\rho_{S}(x)\|_{2}^{2}}{\displaystyle\mathop{\min}_{\begin{subarray}{c}S^{\prime}=\mathrm{span}(\mathbf{\Pi}(x))\\ x\in X\end{subarray}}\sum_{x\in X}\|\mathbf{\Pi}(x)-\rho_{S^{\prime}}(\mathbf{\Pi}(x))\|_{2}^{2}}\geq 3/2-o(1).

The proof is slightly more involved. The instance sets d=n+1d=n+1, and sets X={x1,,xn}X=\{x_{1},\dots,x_{n}\} where xi=(en+1+ei)/2x_{i}=(e_{n+1}+e_{i})/\sqrt{2}. For any subspace SS spanned by any of the points xix_{i}, via a straight-forward calculation, the distance between xjx_{j} and ρS(xj)\rho_{S}(x_{j}) is 3/4\sqrt{3/4} when jij\neq i, and therefore, the cost of 11-column subset selection in XX is 3/4(n1)3/4\cdot(n-1). We apply dimension reduction to t=o(logn)t=o(\log n) and we think of 𝒈1,,𝒈n+1t\boldsymbol{g}_{1},\dots,\boldsymbol{g}_{n+1}\in\mathbbm{R}^{t} as the independent Gaussian vectors given by 𝚷(e1),,𝚷(en+1)\mathbf{\Pi}(e_{1}),\dots,\mathbf{\Pi}(e_{n+1}). As in the 11-medoid case, there exists an index i[n]i\in[n] for which 𝒈i22=o(1)\|\boldsymbol{g}_{i}\|_{2}^{2}=o(1), and notice that when this occurs, 𝚷(xi)\mathbf{\Pi}(x_{i}) is essentially 𝒈n+1/2\boldsymbol{g}_{n+1}/\sqrt{2} (because 𝚷(xi)𝒈n+1/22=o(1)\|\mathbf{\Pi}(x_{i})-\boldsymbol{g}_{n+1}/\sqrt{2}\|_{2}=o(1)). Letting SS be the subspace spanned by 𝚷(xi)\mathbf{\Pi}(x_{i}), we get that the distance between the projection 𝚷(xj)ρS(𝚷(xj))22\|\mathbf{\Pi}(x_{j})-\rho_{S}(\mathbf{\Pi}(x_{j}))\|_{2}^{2} is at most 𝒈j22/2+o(1)\|\boldsymbol{g}_{j}\|_{2}^{2}/2+o(1). This latter fact is because the subspace spanned by SS is essentially spanned by 𝒈n+1\boldsymbol{g}_{n+1}. Therefore, the cost of the 11-column subset selection of 𝚷(X)\mathbf{\Pi}(X) is at most n/2(1+o(1))n/2(1+o(1)).

As above, Theorem 13 does not contradict [CEM+15], even though it means that “all candidate subspaces” are preserved needs to be carefully considered. The notion of “candidate solutions” is different. In the matrix notation that [CEM+15] uses, the points in XX are stacked into rows of an n×dn\times d matrix (which we denote XX). A Johnson-Lindenstrauss map 𝚷\mathbf{\Pi} is represented by a d×td\times t matrix, and applying the map to every point in XX corresponds to the operation X𝚷X\mathbf{\Pi} (which is now an n×tn\times t matrix). [CEM+15] shows that if 𝚷\mathbf{\Pi} is sampled with t=O(k/ε2)t=O(k/\varepsilon^{2}), the following occurs with high probability. For all rank-kk projection matrices Pn×nP\in\mathbbm{R}^{n\times n}, we have

XPXF21±εX𝚷PX𝚷F2.\|X-PX\|_{F}^{2}\approx_{1\pm\varepsilon}\|X\mathbf{\Pi}-PX\mathbf{\Pi}\|_{F}^{2}.

Note that when we multiply the matrix XX on the left-hand side by PP, we are projecting the dd columns of XX to a kk-dimensional subspace of n\mathbbm{R}^{n}. This is different from approximating all points in XX with a kk-dimensional subspace in d\mathbbm{R}^{d}, which would correspond to finding a rank-kk projection matrix Sd×dS\in\mathbbm{R}^{d\times d} and considering XXSF2\|X-XS\|_{F}^{2}. In the matrix notation of [CEM+15], the dimension reduction result for (k,2)(k,2)-subspace approximation says that

minSd×d rank-kprojectionXXSF21±εminSt×t rank-kprojectionX𝚷X𝚷SF2.\displaystyle\min_{\begin{subarray}{c}S\in\mathbbm{R}^{d\times d}\\ \text{ rank-$k$}\\ \text{projection}\end{subarray}}\|X-XS\|_{F}^{2}\approx_{1\pm\varepsilon}\min_{\begin{subarray}{c}S^{\prime}\in\mathbbm{R}^{t\times t}\\ \text{ rank-$k$}\\ \text{projection}\end{subarray}}\|X\mathbf{\Pi}-X\mathbf{\Pi}S^{\prime}\|_{F}^{2}. (33)

At the optimal Sd×dS\in\mathbbm{R}^{d\times d} and the optimal Pn×nP\in\mathbbm{R}^{n\times n}, the costs coincide (a property which holds only for z=2z=2). Thus, [CEM+15] implies (33), but it does not say that the cost of all subspaces of d\mathbbm{R}^{d} are preserved (as there is a type mismatch in the rank-kk projections on the left- and right-hand side of (33)).

References

  • [AHPV05] Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. Geometric approximation via coresets. Combinatorial and computational geometry, 2005.
  • [AIR18] Alexandr Andoni, Piotr Indyk, and Ilya Razenshteyn. Approximate nearest neighbor search in high dimensions. In Proceedings of the International Congress of Mathematicians (ICM ’2018), 2018.
  • [AP03] Panjak K Agarwal and Cecilia M Procopiuc. Approximation algorithms for projective clustering. Journal of Algorithms, 46(2):115–139, 2003.
  • [BBCA+19] Luca Becchetti, Marc Bury, Vincent Cohen-Addad, Fabrizio Grandoni, and Chris Schwiegelshohn. Oblivious dimension reduction for kk-means: Beyond subspaces and the johnson-lindenstrauss lemma. In Proceedings of the 51th ACM Symposium on the Theory of Computing (STOC ’2019), 2019.
  • [BBH+20] Daniel Baker, Vladimir Braverman, Lingxiao Huang, Shaofeng H-C Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering in graphs of bounded treewidth. In Proceedings of the 37th International Conference on Machine Learning (ICML ’2020), 2020.
  • [BFL16] Vladimir Braverman, Dan Feldman, and Harry Lang. New frameworks for offline and streaming coreset constructions. arXiv preprint arXiv:1612.00889, 2016.
  • [BHPI02] Mihai Badoiu, Sariel Har-Peled, and Piotr Indyk. Approximate clustering via core-sets. In Proceedings of the 34th ACM Symposium on the Theory of Computing (STOC ’2002), 2002.
  • [BZD10] Christos Boutsidis, Anastasios Zouzias, and Petros Drineas. Random projections for kk-means clustering. In Proceedings of Advances in Neural Information Processing Systems 23 (NeurIPS ’2010), 2010.
  • [CALSS22] Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, and Chris Schwiegelshohn. Towards optimal lower bounds for kk-median and kk-means coresets. In Proceedings of the 54th ACM Symposium on the Theory of Computing (STOC ’2022), 2022.
  • [CASS21] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In Proceedings of the 53rd ACM Symposium on the Theory of Computing (STOC ’2021), 2021.
  • [CEM+15] Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Mădălina Persu. Dimensionality reduction for kk-means clustering and low rank approximation. In Proceedings of the 47th ACM Symposium on the Theory of Computing (STOC ’2015), 2015.
  • [Che09] Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923–947, 2009.
  • [CN21] Yeshwanth Cherapanamjeri and Jelani Nelson. Terminal embeddings in sublinear time. In Proceedings of the 62nd Annual IEEE Symposium on Foundations of Computer Science (FOCS ’2021), 2021.
  • [DG03] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003.
  • [DRVW06] Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation and projective clustering via volume sampling. Theory of Computing, 2(1):225–247, 2006.
  • [EFN17] Michael Elkin, Arnold Filter, and Ofer Neiman. Terminal embeddings. Theoretical Computer Science, 697:1–36, 2017.
  • [EV05] Michael Edwards and Kasturi Varadarajan. No coreset, no cry: Ii. In Proceedings of the 25th International Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS ’2005), 2005.
  • [Fel20] Dan Feldman. Introduction to core-sets: an updated survey. 2020.
  • [FL11] Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on the Theory of Computing (STOC ’2011), 2011.
  • [FSS13] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: constant-size coresets for k-means, pca and projective clustering. In Proceedings of the 24th ACM-SIAM Symposium on Discrete Algorithms (SODA ’2013), 2013.
  • [FSS20] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: constant-size coresets for k-means, pca, and projective clustering. SIAM Journal on Computing, 49(3):601–657, 2020.
  • [HIM12] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(1):321–350, 2012.
  • [HP04] Sariel Har-Peled. No, coreset, no cry. In Proceedings of the 24th International Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS ’2004), 2004.
  • [HPM04] Sariel Har-Peled and Soham Mazumdar. On coresets for kk-means and kk-median clustering. In Proceedings of the 36th ACM Symposium on the Theory of Computing (STOC ’2004), 2004.
  • [HPV02] Sariel Har-Peled and Kasturi Varadarajan. Projective clustering in high-dimensions using core-sets. In Proceedings of the 18th ACM Symposium on Computational Geometry (SoCG ’2002), 2002.
  • [HV20] Lingxiao Huang and Nisheeth K. Vishnoi. Coresets for clustering in euclidean spaces: importance sampling is nearly optimal. In Proceedings of the 52nd ACM Symposium on the Theory of Computing (STOC ’2020), 2020.
  • [IM98] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th ACM Symposium on the Theory of Computing (STOC ’1998), pages 604–613, 1998.
  • [ISZ21] Zachary Izzo, Sandeep Silwal, and Samson Zhou. Dimensionality reduction for wasserstein barycenter. In Proceedings of Advances in Neural Information Processing Systems 34 (NeurIPS ’2021), 2021.
  • [JL84] William Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982), volume 26 of Contemporary Mathematics, pages 189–206. 1984.
  • [KR15] Michael Kerber and Sharath Raghvendra. Approximation and streaming algorithms for projective clustering via random projections. In Proceedings of the 27th Canadian Conference on Computational Geometry (CCCG ’2015), 2015.
  • [LS10] Michael Langberg and Leonard J. Schulman. Universal epsilon-approximators for integrals. In Proceedings of the 21st ACM-SIAM Symposium on Discrete Algorithms (SODA ’2010), 2010.
  • [Mah11] Michael W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends® in Machine Learning, 3(2):123–224, 2011.
  • [MMMR18] Sepideh Mahabadi, Konstantin Makarychev, Yuri Makarychev, and Ilya Razenstheyn. Non-linear dimension reduction via outler bi-lipschitz embeddings. In Proceedings of the 50th ACM Symposium on the Theory of Computing (STOC ’2018), 2018.
  • [MMR19] Konstantin Makarychev, Yuri Makarychev, and Ilya Razenshteyn. Performance of johnson-lindenstrauss transform for kk-means and kk-medians clustering. In Proceedings of the 51th ACM Symposium on the Theory of Computing (STOC ’2019), 2019.
  • [NN19] Shyam Nayaranan and Jelani Nelson. Optimal terminal dimensionality reduction in euclidean space. In Proceedings of the 51th ACM Symposium on the Theory of Computing (STOC ’2019), 2019.
  • [NSIZ21] Shyam Narayanan, Sandeep Silwal, Piotr Indyk, and Or Zamir. Randomized dimensionality reduction for facility location and single-linkage clustering. In Proceedings of the 38th International Conference on Machine Learning (ICML ’2021), 2021.
  • [Sar06] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’2006), 2006.
  • [SV07] Nariankadu D. Shyamalkumar and Kasturi Varadarajan. Efficient subspace approximation algorithms. In Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA ’2007), 2007.
  • [SV12] Nariankadu D. Shuamalkumar and Katsuri Varadarajan. Efficient subspace approximation algorithms. Discrete and Computational Geometry, 47(1):44–63, 2012.
  • [SW18] Christian Sohler and David Woodruff. Strong coresets for kk-median and subspace approximation. In Proceedings of the 59th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’2018), 2018.
  • [TWZ+22] Murad Tukan, Xuan Wu, Samson Zhou, Vladimir Braverman, and Dan Feldman. New coresets for projective clustering and applications. arXiv preprint arXiv:2203.04370, 2022.
  • [VX12a] Kasturi Varadarajan and Xin Xiao. A near-linear algorithm for projective clustering of integer points. In Proceedings of the 33rd ACM-SIAM Symposium on Discrete Algorithms (SODA ’2012), 2012.
  • [VX12b] Kasturi Varadarajan and Xin Xiao. On the sensitivity of shape fitting problems. In Proceedings of the 32nd International Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS ’2012), 2012.
  • [Woo14] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014.