This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Streaming Euclidean Max-Cut: Dimension vs Data Reduction

Xiaoyu Chen
Peking University
Email: yuchen21@stu.pku.edu.cn
   Shaofeng H.-C. Jiang
Peking University
Research partially supported by a national key R&D program of China No. 2021YFA1000900, a startup fund from Peking University, and the Advanced Institute of Information Technology, Peking University. Email: shaofeng.jiang@pku.edu.cn
   Robert Krauthgamer
Weizmann Institute of Science
Work partially supported by ONR Award N00014-18-1-2364, by a Weizmann-UK Making Connections Grant, by a Minerva Foundation grant, and the Weizmann Data Science Research Center. Email: robert.krauthgamer@weizmann.ac.il
Abstract

Max-Cut is a fundamental problem that has been studied extensively in various settings. We design an algorithm for Euclidean Max-Cut, where the input is a set of points in d{\mathbb{R}}^{d}, in the model of dynamic geometric streams, where the input X[Δ]dX\subseteq[\Delta]^{d} is presented as a sequence of point insertions and deletions. Previously, Frahling and Sohler [STOC 2005] designed a (1+ε)(1+\varepsilon)-approximation algorithm for the low-dimensional regime, i.e., it uses space exp(d)\exp(d).

To tackle this problem in the high-dimensional regime, which is of growing interest, one must improve the dependence on the dimension dd, ideally to space complexity poly(ε1dlogΔ)\operatorname{poly}(\varepsilon^{-1}d\log\Delta). Lammersen, Sidiropoulos, and Sohler [WADS 2009] proved that Euclidean Max-Cut admits dimension reduction with target dimension d=poly(ε1)d^{\prime}=\operatorname{poly}(\varepsilon^{-1}). Combining this with the aforementioned algorithm that uses space exp(d)\exp(d^{\prime}), they obtain an algorithm whose overall space complexity is indeed polynomial in dd, but unfortunately exponential in ε1\varepsilon^{-1}.

We devise an alternative approach of data reduction, based on importance sampling, and achieve space bound poly(ε1dlogΔ)\operatorname{poly}(\varepsilon^{-1}d\log\Delta), which is exponentially better (in ε\varepsilon) than the dimension-reduction approach. To implement this scheme in the streaming model, we employ a randomly-shifted quadtree to construct a tree embedding. While this is a well-known method, a key feature of our algorithm is that the embedding’s distortion O(dlogΔ)O(d\log\Delta) affects only the space complexity, and the approximation ratio remains 1+ε1+\varepsilon.

1 Introduction

Max-Cut is a fundamental problem in multiple domains, from constraint satisfaction (CSP) and linear equations to clustering. It was studied extensively in many computational models and for types of inputs, and many (nearly) tight bounds were obtained, oftentimes leading the way to even more general problems. For instance, in the offline setting, Max-Cut admits a polynomial-time 0.8780.878-approximation for general graphs [GW95], and this approximation factor is tight under the Unique Games Conjecture [KKMO07]. In contrast, if the input is a dense unweighted graph, or a metric space (viewed as a weighted graph), then a PTAS exists [dlVK00, dlVK01]. In the graph-streaming setting, (1+ε)(1+\varepsilon)-approximation can be obtained using O~(n)\tilde{O}(n) space [AG09], and this space bound is tight [KK19].

However, the streaming complexity of Max-Cut is only partially resolved in the geometric setting, i.e., for Euclidean points. A known algorithm, due to Frahling and Sohler [FS05], achieves (1+ε)(1+\varepsilon)-approximation but uses space exp(d)\exp(d), which is prohibitive when the dimension is high. Combining this algorithm with a dimension reduction result, based on the Johnson-Lindenstrauss Lemma but specialized to Max-Cut and has target dimension poly(ε1)\operatorname{poly}(\varepsilon^{-1}) [LSS09, Lam11], one can achieve polynomial dependence on dd, but at the expense of introducing to the space complexity an undesirable exp(poly(ε1))\exp(\operatorname{poly}(\varepsilon^{-1}))-factor. It was left open to obtain in the high-dimension regime space complexity that is truly efficient, i.e., poly(ε1d)\operatorname{poly}(\varepsilon^{-1}d).

We answer this question by providing the first streaming algorithms that achieve (1+ε)(1+\varepsilon)-approximation for Max-Cut using space poly(dε1)\operatorname{poly}(d\varepsilon^{-1}). We consider the setting of dynamic geometric streams, introduced by Indyk [Ind04], where the input is a dataset X[Δ]dX\subseteq[\Delta]^{d} that is presented as a stream of point insertions and deletions. The goal of the algorithm is to approximate (multiplicatively) the so-called Max-Cut value, defined as

Max-Cut(X):=maxSXxS,yXSxy2\textsc{Max-Cut}(X):=\max_{S\subseteq X}\sum_{x\in S,y\in X\setminus S}\|x-y\|_{2}

(see Section 2 for general metric spaces). We say that Max-Cut(X)\textsc{Max-Cut}(X) is α\alpha-approximated, for α1\alpha\geq 1, by a value η0\eta\geq 0 if Max-Cut(X)/αηMax-Cut(X)\textsc{Max-Cut}(X)/\alpha\leq\eta\leq\textsc{Max-Cut}(X).111We will actually aim at η(1±ε)Max-Cut(X)\eta\in(1\pm\varepsilon)\cdot\textsc{Max-Cut}(X), for 0<ε<1/20<\varepsilon<1/2, which can be scaled to achieve (1+O(ε))(1+O(\varepsilon))-approximation. We assume throughout that XX contains distinct points (and is not a multiset), hence n:=|X||Δ|dn:=|X|\leq|\Delta|^{d}. In the high-dimension regime, algorithms can use at most poly(dlogΔ)\operatorname{poly}(d\log\Delta) bits of space, which is polynomial in the number of bits required to represent a point in [Δ]d[\Delta]^{d}, and also allows counting to n|Δ|dn\leq|\Delta|^{d}. In the low-dimension regime, algorithms may have bigger space complexity, e.g., exponential in dd.

A central challenge in the area of geometric streaming algorithms is to achieve good accuracy (approximation) in the high-dimension regime. Indeed, algorithms for many basic problems (like diameter, minimum spanning tree, facility location, and Max-Cut), achieve good approximation, say for concreteness O(1)O(1) or even 1+ε1+\varepsilon, using space that is exponential in dd [AHV04, Zar11, FS05, FIS08, LS08, CLMS13, CJKV22]. In contrast, algorithms that use space polynomial in dd are fewer and they typically achieve far worse approximation ratio [Ind04, CJLW22, CJK+22, WY22], and obtaining O(1)O(1)-approximation remains open. In particular, Indyk [Ind04] tackled the high-dimension regime using a technique of randomized tree embedding, which is rather general and economical in space, but unfortunately distorts distances by a factor of O(dlogΔ)O(d\log\Delta) that goes directly into the approximation ratio. Attempts to improve the approximation ratio had only limited success so far; for example, the algorithm of [AS15] (for diameter) works only in insertion-only streams, and the algorithms of [CJLW22, CJK+22] (for MST and for facility location) fall short of the desired O(1)O(1)-approximation in one pass.

1.1 Our Results

We bypass the limitation of dimension reduction via a data reduction approach, and design a streaming algorithm that (1+ε)(1+\varepsilon)-approximates Max-Cut using poly(ε1dlogΔ)\operatorname{poly}(\varepsilon^{-1}d\log\Delta) space, thereby closing the gap of high dimension (for Max-Cut). Our approach works not only under Euclidean norm, but also when distances are calculated using p\ell_{p} norm, for p1p\geq 1.

Data Reduction via Importance Sampling

We present an algorithm that is based on the data-reduction approach, namely, it uses the dataset XX to construct a small instance XX^{\prime} that has a similar Max-Cut value, then solve Max-Cut on it optimally and report this value Max-Cut(X)\textsc{Max-Cut}(X^{\prime}). Following a common paradigm, XX^{\prime} is actually a re-weighted subset of XX, that is picked by non-uniform sampling from XX, known as importance sampling.

Theorem 1.1 (Streaming Max-Cut in p\ell_{p} Norm).

There is a randomized streaming algorithm that, given 0<ε<1/20<\varepsilon<1/2, p1p\geq 1, integers Δ,d1\Delta,d\geq 1, and an input dataset X[Δ]dX\subseteq[\Delta]^{d} presented as a dynamic stream, uses space poly(ε1dlogΔ)\operatorname{poly}(\varepsilon^{-1}d\log\Delta) and reports an estimate η>0\eta>0 that with probability at least 2/32/3 is a (1+ε)(1+\varepsilon)-approximation to Max-Cut(X)\textsc{Max-Cut}(X) in p\ell_{p}.

This data-reduction approach was previously used for several clustering problems. For kk-median and related problems, such an instance XX^{\prime} is often called a coreset, and there are many constructions, see e.g. [HM04, FS05, FL11, BFL+17, HSYZ18, FSS20]. Earlier work [Sch00] has designed importance-sampling based algorithms for a problem closely related to Max-Cut, but did not provide a guarantee that XX and XX^{\prime} have a similar Max-Cut value (see Section 1.3 for a more detailed discussion). Recently, importance sampling was used to design streaming algorithms for facility location in high dimension [CJK+22], although their sample XX^{\prime} is not an instance of facility location. Overall, this prior work is not useful for us, and we have to devise our own sampling distribution, prove that it preserves the Max-Cut value, and design a streaming algorithm that samples from this distribution.

The approximation ratio 1+ε1+\varepsilon in Theorem 1.1 is essentially the best one can hope for using small space, because finding the Max-Cut value exactly, even in one dimension, requires Ω(Δ)\Omega(\Delta) space, as shown in A.3. Compared with the dimension-reduction approach (based on [FS05]), our Theorem 1.1 has the advantage that it works for all p\ell_{p} norms (p1p\geq 1) and not only 2\ell_{2}. The result of [FS05] is stronger in another aspect, of providing a “cut oracle”, i.e., an implicit representation of the approximately optimal cut that can answer the side of the cut that each data point xXx\in X (given as a query) belongs to. This feature extends also to high dimension, as it is easy to combine with the dimension reduction. For completeness, we give a (somewhat simplified) proof of the dimension reduction in Appendix B, followed by a formal statement of this “cut oracle” in Corollary B.3. It remains open to design a streaming algorithm that computes such an implicit representation using space poly(ε1dlogΔ)\operatorname{poly}(\varepsilon^{-1}d\log\Delta).

1.2 Technical Overview

In order to estimate Max-Cut using importance sampling, we must first identify a sampling distribution for which Max-Cut(X)\textsc{Max-Cut}(X^{\prime}) indeed approximates Max-Cut(X)\textsc{Max-Cut}(X), and then we have to design a streaming algorithm that samples from this distribution.

Sampling Probability

One indication that geometric Max-Cut admits data reduction by sampling comes from the setting of dense unweighted graphs, where it is known that Max-Cut can be (1+ε)(1+\varepsilon)-approximated using a uniform sample of O(ε4)O(\varepsilon^{-4}) vertices, namely, by taking the Max-Cut value in the induced subgraph and scaling appropriately [AdlVKK03, RV07] (improving over [GGR96]). However, sampling points uniformly clearly cannot work in the metric case – if a point set XX has many co-located points and a few distant points that are also far from each other, then uniform sampling from XX is unlikely to pick the distant points, which have a crucial contribution to the Max-Cut value. It is therefore natural to employ importance sampling, i.e., sample each point with probability proportional to its contribution. The contribution of a point xx to the Max-Cut value is difficult to gauge, but we can use instead a simple proxy – its total distance to all other points q(x):=yXdist(x,y)q(x):=\sum_{y\in X}\operatorname{dist}(x,y), which is just its contribution to twice the total edge weight x,yXdist(x,y)\sum_{x,y\in X}\operatorname{dist}(x,y). For any fixed cut in XX, this sampling works well and the estimate will (likely) have additive error εx,yXdist(x,y)=Θ(ε)Max-Cut(X)\varepsilon\sum_{x,y\in X}\operatorname{dist}(x,y)=\Theta(\varepsilon)\cdot\textsc{Max-Cut}(X). While the analysis is straightforward for a fixed cut, say a maximum one, proving that the sampling preserves the Max-Cut value is much more challenging, as one has to argue about all possible cuts.

We show in Theorem 3.1 that O(ε4)O(\varepsilon^{-4}) independent samples generated with probability proportional to q(x)q(x) preserve the Max-Cut value. This holds even if the sampling probabilities are dampened by a factor λ1\lambda\geq 1, at the cost of increasing the number of samples by a poly(λ)\operatorname{poly}(\lambda) factor. To be more precise, it suffices to sample from any probability distribution {px:xX}\{p_{x}:\ x\in X\}, where p(x)1λq(x)yXq(y)p(x)\geq\frac{1}{\lambda}\frac{q(x)}{\sum_{y\in X}q(y)} for all xXx\in X. A small technicality is that we require the sampling procedure to report a random sample xx^{*} together with its corresponding p(x)p(x^{*}), in order to re-weight the sample xx^{*} by factor 1/p(x)1/p(x^{*}), but this extra information can often be obtained using the same methods.

This sampling distribution, i.e., probabilities proportional to {q(x):xX}\{q(x):\ x\in X\}, can be traced to two prior works. Schulman [Sch00] used essentially the same probabilities, but his analysis works only for one fixed cut, and extends to a multiple cuts by a union bound.222We gloss over slight technical differences, e.g., he deals with squared Euclidean distances, and his sampling and re-weighting processes are slightly different. The exact same probabilities were also used by [dlVK01] as weights to convert a metric instance to a dense unweighted graph, and thereby obtain a PTAS (without sampling or any data reduction).

In fact, our proof combines several observations from [dlVK01] about a uniform sample of O(ε4)O(\varepsilon^{-4}) vertices in unweighted graphs [AdlVKK03, RV07]. In a nutshell, we relate sampling from XX proportionally to {q(x):xX}\{q(x):\ x\in X\} with sampling uniformly from a certain dense unweighted graph, whose cut values corresponds to those in XX, and we thereby derive a bound on the Max-Cut value of the sample XX^{\prime}.

Streaming Implementation

Designing a procedure to sample proportionally to {q(x):xX}\{q(x):\ x\in X\}, when XX is presented as a stream, is much more challenging and is our main technical contribution (Lemma 4.2). The main difficulty is that standard tools for sampling from a stream, such as p\ell_{p}-samplers [MW10, JST11, JW21], are based on the frequency vector and oblivious to the geometric structure of XX. Indeed, the literature lacks geometric samplers, which can be very useful when designing geometric streaming algorithms. Our goal of sampling proportionally to q(x)q(x), which is the total distance to all other points in XX, seems like a fundamental geometric primitive, and therefore our sampler (Lemma 4.2) is a significant addition to the geometric-streaming toolbox, and may be of independent interest.

High-Level Picture

At a high level, our sampling procedure is based on a randomly-shifted quadtree TT that is defined on the entire input domain [Δ]d[\Delta]^{d} and captures its geometry. This randomly-shifted quadtree TT is data oblivious, and thus can be picked (constructed implicitly) even before the stream starts (as an initialization step), using space O(poly(dlogΔ))O(\operatorname{poly}(d\log\Delta)). This technique was introduced by Indyk [Ind04], who noted that the quadtree essentially defines a tree embedding with expected distortion of distances O(dlogΔ)O(d\log\Delta). Unfortunately, the distortion is fundamental to this approach, and directly affects the accuracy (namely, goes into the approximation ratio) of streaming algorithms that use this technique [Ind04, CJLW22].333A randomly-shifted quadtree was used also in Arora’s approximation algorithm for TSP [Aro98], but as the basis for dynamic programming rather than as a tree embedding, and similarly in streaming applications of this technique [CJKV22]. While our approach still suffers this distortion, we can avoid its effect on the accuracy; instead, it affects the importance-sampling distribution, namely, the dampening factor λ\lambda grows by factor O(dlogΔ)O(d\log\Delta), which can be compensated by drawing more samples. This increases the space complexity moderately, which we can afford, and overall leads to (1+ε)(1+\varepsilon)-approximation using space poly(ε1dlogΔ)\operatorname{poly}(\varepsilon^{-1}d\log\Delta).

Comparison to Other Sampling Approaches

A different importance sampling method was recently designed in [CJK+22] for facility location in high dimension. Their sampling procedure relies on a geometric hashing (space partitioning), and completely avoids a quadtree.444A key parameter in that hashing, called consistency, turns out to be poly(d)\operatorname{poly}(d), and unfortunately affects also the final approximation ratio for facility location, and not only the importance-sampling parameter λ\lambda (and thus the space). Our sampling technique may be more easily applicable to other problems, as it uses a more standard and general tool of a randomly-shifted quadtree. Another importance-sampling method was recently designed in [MRWZ20] in the context of matrix streams. Geometrically, their importance-sampling can be viewed as picking a point (row in the matrix) proportionally to its length, but after the points are projected, at query time, onto a subspace. This sampling procedure is very effective for linear-algebraic problems, like column-subset selection and subspace approximation, which can be viewed also as geometric problems.

Previously, quadtrees were employed in geometric sampling results, but mostly a fixed quadtree and not a randomly-shifted one, and to perform uniform sampling over its non-empty squares at each level, as opposed to our importance sampling [FS05, FIS08, CJKV22]. Furthermore, in [FIS08], uniform sampling was augmented to report also all the points nearby the sampled points, and this was used to estimate the number of connected components in a metric threshold graph.

Sampling on Quadtree: Bypassing the Distortion

Finally, we discuss how to perform the importance sampling, based on a randomly-shifted quadtree TT, but without the O(dlogΔ)O(d\log\Delta) distortion going into the approximation ratio. As explained above, our importance sampling (Theorem 3.1) requires that every point xXx\in X is sampled with some probability p(x)1λq(x)Qp(x)\geq\frac{1}{\lambda}\frac{q(x)}{Q} for Q:=yXq(y)Q:=\sum_{y\in X}q(y), and the dampening factor λ\lambda can be up to O(poly(dlogΔ))O(\operatorname{poly}(d\log\Delta)). We will exploit the fact that there is no upper bound on the probability p(x)p(x). As usual, a randomly-shifted quadtree TT is obtained by recursively partitioning of the grid [2Δ]d[2\Delta]^{d}, each time into 2d2^{d} equal-size grids, which we call squares, and building a tree whose nodes are all these squares, and every square is connected to its parent by an edge whose weight is equal to that square’s Euclidean diameter. Furthermore, this entire partitioning is shifted by a uniformly random vshift[Δ]dv_{\textrm{shift}}\in[\Delta]^{d}. (See Section 4.1 for details.) The key is that every grid point x[Δ]dx\in[\Delta]^{d} is represented by a tree leaf, and thus TT defines distances on [Δ]d[\Delta]^{d}, also called a tree embedding. Define qTq_{T} and QTQ_{T} analogously to qq and QQ, but using the tree distance, that is, qT(x)q_{T}(x) is the total tree distance from xx to all other points in XX, and QT:=yXqT(y)Q_{T}:=\sum_{y\in X}q_{T}(y).

The tree-embedding guarantees, for all x[Δ]dx\in[\Delta]^{d}, that qT(x)q_{T}(x) overestimates q(x)q(x) (with probability 11), and that 𝔼T[qT(x)]O(dlogΔ)q(x)\operatorname*{\mathbb{E}}_{T}[q_{T}(x)]\leq O(d\log\Delta)\cdot q(x). It thus suffices to have one event, QTO(dlogΔ)QQ_{T}\leq O(d\log\Delta)\cdot Q, which happens with constant probability by Markov’s inequality, and then we would have qT(x)QT1O(dlogΔ)q(x)Q\frac{q_{T}(x)}{Q_{T}}\geq\frac{1}{O(d\log\Delta)}\cdot\frac{q(x)}{Q} simultaneously for all xXx\in X. As mentioned earlier, the algorithm picks (constructs implicitly) TT even before the stream starts, and the analysis assumes the event mentioned above happens. In fact, the rest of the sampling procedure works correctly for arbitrary TT, i.e., regardless of that event. We stress, however, that our final algorithm actually needs to generate multiple samples that are picked independently from the same distribution, and this is easily achieved by parallel executions that share the same random tree TT but use independent coins in all later steps. We thus view this TT as a preprocessing step that is run only once, particularly when we refer to samples as independent.

A Two-level Sampling by Using Tree Structure

The remaining challenge is to sample (in a streaming fashion) with probability proportional to qTq_{T} for a fixed quadtree TT. To this end, we make heavy use of the structure of the quadtree. In particular, the quadtree TT has O(dlogΔ)O(d\log\Delta) levels, and every data point xXx\in X forms a distinct leaf in TT. Clearly, each internal node in TT represents a subset of XX (all its descendants in TT, that is, the points of XX inside its square). At each level ii (the root node has level 11), we identify a level-ii heavy node hih_{i} that contains the maximum number of points from XX (breaking ties arbitrarily). We further identify a critical level kk, such that h1,,hkh_{1},\ldots,h_{k} (viewed as subsets of XX), all contain more than 0.5|X|0.5|X| points, but hk+1h_{k+1} does not. This clearly ensures that h1,,hkh_{1},\ldots,h_{k} forms a path. Let X(hi)XX(h_{i})\subseteq X denote the subset of XX represented by node hih_{i}. The tree structure and the path property of (h1,,hk)(h_{1},\ldots,h_{k}) guarantee that, for each i<ki<k, all points xX(hi)X(hi+1)x\in X(h_{i})\setminus X(h_{i+1}) have roughly the same qT(x)q_{T}(x) values. For the boundary case i=ki=k, a similar claim holds for xX(hk)x\in X(h_{k}). Hence, a natural algorithm is to employ two-level sampling: first draw a level iki^{*}\leq k, then draw a uniform sample from X(hi)X(hi+1)X(h_{i^{*}})\setminus X(h_{i^{*}+1}). The distribution of sampling ii^{*} also requires a careful design, but we omit this from the current overview, and focus instead on how to draw a uniform sample from X(hi)X(hi+1)X(h_{i})\setminus X(h_{i+1}).

Unfortunately, sampling from X(hi)X(hi+1)X(h_{i})\setminus X(h_{i+1}) still requires Ω(Δ)\Omega(\Delta) space in the streaming model, even for d=1d=1. (We prove this reduction from INDEX in A.2.) In fact, it is even hard to determine whether X(hi)X(hi+1)=X(h_{i})\setminus X(h_{i+1})=\emptyset; the difficulty is that hih_{i} is not known in advance, otherwise it would be easy. We therefore relax the two-level sampling, and replace sampling from X(hi)X(hi+1)X(h_{i})\setminus X(h_{i+1}), with its superset XX(hi+1)X\setminus X(h_{i+1}). This certainly biases the sampling probability, in fact some probabilities might change by an unbounded factor (Remark 4.12), nevertheless we prove that the increase in the dampening parameter λ\lambda is bounded by an O(logΔ)O(\log\Delta) factor (Lemma 4.13), which we can still afford. This part crucially uses the tree and the path property of (h1,,hk)(h_{1},\ldots,h_{k}).

Sampling from Light Parts

The final remaining step is to sample from XX(hi)X\setminus X(h_{i}) for a given iki\leq k (Lemma 4.14). We can assume here X(hi)X(h_{i}) contains more than 0.5|X|0.5|X| points, and thus XX(hi)X\setminus X(h_{i}) is indeed the “light” part, containing few (and possibly no) points. To let the light points “stand out” in a sampling, we hash the tree nodes (instead of the points) randomly into two buckets. Since |X(h(i))|>0.5|X||X(h(i))|>0.5|X|, the heavy node hih_{i} always lies in the bucket that contains more points, and we therefore sample only from the light bucket. (To implement this in streaming, we actually generate two samples, one from each bucket, and in parallel estimate the two buckets’ sizes to know which of the two samples to use.) Typically, the light bucket contains at least half of the points from XX(hi)X\setminus X(h_{i}), which is enough. Overall, this yields a sampling procedure that uses space poly(ε1dlogΔ)\operatorname{poly}(\varepsilon^{-1}d\log\Delta).

1.3 Related Work

Geometric streaming in the low-dimension regime was studied much more extensively than in the high-dimensional case that we investigate here. In [FS05], apart from the O(εdpolylogΔ)O(\varepsilon^{-d}\operatorname{poly}\log\Delta)-space (1+ε)(1+\varepsilon)-approximation for Max-Cut that we already mentioned, similar results were obtained also for kk-median, maximum spanning tree, maximum matching and similar maximization problems. For minimum spanning tree, a (1+ε)(1+\varepsilon)-approximation O((ε1logΔ)d)O((\varepsilon^{-1}\log\Delta)^{d})-space algorithm was devised in [FIS08], alongside several useful techniques for geometric sampling in low dimension. In [ABIW09], an O(ε1)O(\varepsilon^{-1})-approximation O~(Δε)\tilde{O}(\Delta^{\varepsilon})-space streaming algorithm was obtained for computing earth-mover distance in dimension d=2d=2. For facility location in dimension d=2d=2, a (1+ε)(1+\varepsilon)-approximation poly(ε1logΔ)\operatorname{poly}(\varepsilon^{-1}\log\Delta)-space algorithm was designed in [CLMS13]. Recently, Steiner forest (a generalization of Steiner tree, which asks to find a minimum-weight graph that connects kk groups of points), was studied in [CJK+22] for dimension d=2d=2, and they obtain O(1)O(1)-approximation using space (poly(klogΔ))(\operatorname{poly}(k\log\Delta)).

Our sampling distribution may seem reminiscent of an importance sampling procedure devised by Schulman [Sch00], however his results and techniques are not useful in our context. First, the problem formulation differs, as it asks to approximate the complement objective of sum of distances inside each cluster (which is only stronger than our MAX-CUT objective), but the objective function sums squared Euclidean (rather than Euclidean) distances. Second, his algorithm is for the offline setting and is not directly applicable in streaming. Third, his analysis does not provide a guarantee on Max-Cut(X)\textsc{Max-Cut}(X^{\prime}), but rather only on certain cuts of XX^{\prime} (not all of them), and his approximation guarantee includes a non-standard twist.

2 Preliminaries

Consider a metric space (V,dist)(V,\operatorname{dist}). The Euclidean case is V=dV=\mathbb{R}^{d} and dist=2\operatorname{dist}=\ell_{2}, and the p\ell_{p} case is V=dV=\mathbb{R}^{d} and dist=p\operatorname{dist}=\ell_{p}. For XVX\subseteq V, the cut function cutX:2X\operatorname{cut}_{X}:2^{X}\to\mathbb{R} is defined as cutX(S):=xS,yXSdist(x,y)\operatorname{cut}_{X}(S):=\sum_{x\in S,y\in X\setminus S}\operatorname{dist}(x,y). The Max-Cut value of a dataset XVX\subseteq V is defined as

Max-Cut(X):=maxSXcutX(S).\textsc{Max-Cut}(X):=\max_{S\subseteq X}{\operatorname{cut}_{X}(S)}.

We shall use the following standard tools as building blocks in our algorithms. Recall that a turnstile (or dynamic) stream can contain insertions and deletions of items from some domain [N][N], and it naturally defines a frequency vector xNx\in\mathbb{R}^{N}, where every possible item has a coordinate that counts its net frequency, i.e., insertions minus deletions. In general, this model allows frequencies to be negative. However, in our setting where X[Δ]dX\subset[\Delta]^{d} is presented as a dynamic geometric stream, then frequency vector has Δd\Delta^{d} coordinates all in the range {0,1}\{0,1\}, and is just the incidence vector of XX.

Lemma 2.1 (0\ell_{0}-Norm Estimator [KNW10]).

There exists a streaming algorithm that, given 0<ε,δ<10<\varepsilon,\delta<1, integers N,M1N,M\geq 1, and a frequency vector x[M,M]Nx\in[-M,M]^{N} presented as a turnstile stream, where we denote its support by X:={i[N]:xi0}X:=\{i\in[N]:\ x_{i}\neq 0\}, uses space poly(ε1log(δ1MN))\operatorname{poly}(\varepsilon^{-1}\log(\delta^{-1}MN)) to return r0r^{*}\geq 0, such that Pr[r(1±ε)|X|]1δ\Pr[r^{*}\in(1\pm\varepsilon)|X|]\geq 1-\delta.

Lemma 2.2 (0\ell_{0}-Sampler [JST11]).

There exists a streaming algorithm that, given 0<δ<10<\delta<1, integers N,M1N,M\geq 1, and a frequency vector x[M,M]Nx\in[-M,M]^{N} presented as a turnstile stream, where we assume that its support X:={i[N]:xi0}X:=\{i\in[N]:x_{i}\neq 0\} is non-empty, uses space polylog(δ1MN)\operatorname{poly}\log(\delta^{-1}MN), to return a sample iX{}i^{*}\in X\cup\{\perp\}, such that with probability at least 1δ1-\delta,

iX,Pr[i=i]=1|X|.\forall i\in X,\qquad\Pr[i^{*}=i]=\frac{1}{|X|}.

3 Approximating Max-Cut by Importance Sampling

In this section, we consider a general metric space (V,dist)(V,\operatorname{dist}) (which includes Euclidean spaces by setting V=dV=\mathbb{R}^{d} and dist=2\operatorname{dist}=\ell_{2}), and show that a small importance-sample SS on a dataset XVX\subseteq V may be used to estimate Max-Cut(X)\textsc{Max-Cut}(X) by simply computing Max-Cut(S)\textsc{Max-Cut}(S).

Point-weighted Set [dlVK01]

Since we apply importance sampling (as opposed to using a uniform probability for every point), the sampled points need to be re-weighted so that the estimation is unbiased. Hence, we consider the notion of point-weighted sets. This notion was first considered in [dlVK01] to reduce the metric Max-Cut to Max-Cut in (dense) weighted graphs. Specifically, a point-weighted set SVS\subseteq V is a subset of VV that is associated with a point-weight function wS:S+w_{S}:S\to\mathbb{R}_{+}. For a point-weighted set SS, the distance distS(x,y)\operatorname{dist}_{S}(x,y) between x,ySx,y\in S is also re-weighted such that distS(x,y):=dist(x,y)wS(x)wS(y)\operatorname{dist}_{S}(x,y):=\frac{\operatorname{dist}(x,y)}{w_{S}(x)w_{S}(y)}. Under this weighting, when an edge {x,y}\{x,y\} appears in a cut, its contribution is still accounted as wS(x)wS(y)distS(x,y)=dist(x,y)w_{S}(x)\cdot w_{S}(y)\cdot\operatorname{dist}_{S}(x,y)=\operatorname{dist}(x,y).

We prove (in Theorem 3.1) that for every dataset XVX\subseteq V, if one construct a point-weighted subset SXS\subseteq X by drawing i.i.d. samples from a distribution on XX, where each xXx\in X is sampled proportional to q(x)=yXdist(x,y)q(x)=\sum_{y\in X}{\operatorname{dist}(x,y)} which is the sum of distances to other points in XX, up to an error factor of λ\lambda, then Max-Cut(S)\textsc{Max-Cut}(S) is (1+ε)(1+\varepsilon)-approximation to Max-Cut(X)\textsc{Max-Cut}(X) with high probability.

Theorem 3.1.

Given ε,δ>0,λ1\varepsilon,\delta>0,\lambda\geq 1, metric space (V,dist)(V,\operatorname{dist}) and dataset XVX\subseteq V, let 𝒟\mathcal{D} be a distribution (px:xX)(p_{x}:x\in X) on XX such that xX,px1λq(x)Q\forall x\in X,p_{x}\geq\frac{1}{\lambda}\cdot\frac{q(x)}{Q}, where q(x)=yXdist(x,y)q(x)=\sum_{y\in X}\operatorname{dist}(x,y) and Q=xXq(x)Q=\sum_{x\in X}q(x). Let SS be a point-weighted set that is obtained by an i.i.d. sample of m2m\geq 2 points from 𝒟\mathcal{D}, weighted by wS(x):=p^xw_{S}(x):=\hat{p}_{x} such that pxp^x(1+ε)pxp_{x}\leq\hat{p}_{x}\leq(1+\varepsilon)\cdot p_{x}. If mO(ε4λ8)m\geq O(\varepsilon^{-4}\lambda^{-8}), then with probability at least 0.90.9, the value Max-Cut(S)m2\frac{\textsc{Max-Cut}(S)}{m^{2}} is a (1+ε)(1+\varepsilon)-approximation to Max-Cut(X)\textsc{Max-Cut}(X).

The O(ε4)O(\varepsilon^{-4}) dependence on ε\varepsilon of Theorem 3.1 matches a similar O(ε4)O(\varepsilon^{-4}) sampling complexity bound for unweighted graphs in [AdlVKK03, RV07]555A slightly weaker bound of O(ε4polylog(ε1))O(\varepsilon^{-4}\operatorname{poly}\log(\varepsilon^{-1})) was obtained in [AdlVKK03], and [RV07] gave an improved technical lemma which can be directly plugged into [AdlVKK03] to obtain the O(ε4)O(\varepsilon^{-4}) bound.. To the best of our knowledge, this O(ε4)O(\varepsilon^{-4}) is the state-of-the-art even for the case of unweighted graphs. Although our proof of Theorem 3.1 is obtained mostly by using the bound in [AdlVKK03, RV07] as a black box, the generalization to metric spaces, as well as the allowance of λ\lambda which is the error in sampling probability, is new.

3.1 Proof of Theorem 3.1

As mentioned, the plan is to apply the sampling bound proved in [AdlVKK03, RV07], which we restate in Lemma 3.2. In fact, the original statement in [AdlVKK03, RV07] was only made for unweighted graphs, i.e., edge weights in {0,1}\{0,1\}. However, we observe that only the fact that the edges weights are between [0,1][0,1] was used in their proof. Hence, in our statement of Lemma 3.2 we make this stronger claim of [0,1][0,1] edge weights.

Here, for a graph G(V,E)G(V,E) with weight function lenG:E\mathrm{len}_{G}:E\to\mathbb{R} (lenG()=1\mathrm{len}_{G}(\cdot)=1 for unweighted graphs), we define the cut function for SXVS\subseteq X\subseteq V (as well as Max-Cut) similarly, as cutX(S)=xS,yXS:{x,y}ElenG(x,y)\operatorname{cut}_{X}(S)=\sum_{x\in S,y\in X\setminus S:\{x,y\}\in E}\mathrm{len}_{G}(x,y).

Lemma 3.2 ([AdlVKK03, RV07]).

Consider a weighted graph G(V,E)G(V,E) with weights in [0,1][0,1]. Let DVD\subseteq V be a uniformly independent sample from VV (possibly with repetitions) of O(ε4)O(\varepsilon^{-4}) points. Then with probability at least 0.90.9,

|1|D|2Max-Cut(D)1|V|2Max-Cut(V)|ε.\left|\frac{1}{|D|^{2}}\textsc{Max-Cut}(D)-\frac{1}{|V|^{2}}\textsc{Max-Cut}(V)\right|\leq\varepsilon.

Hence, our plan is to define an auxiliary graph GG^{\prime} whose edge weights are in [0,1][0,1], such that our importance sampling may be interpreted as a uniform sampling from vertices in GG^{\prime}. Eventually, our sampling bound would follow from Lemma 3.2.

Defining Auxiliary Graph

Since we focus on approximate solutions, we can assume that pxp_{x}’s (xXx\in X) are of finite precision. Then, let NN be a sufficiently large number such that for all xXx\in X, NpxNp_{x} is an integer. We define an auxiliary graph G(X,E:=X×X)G^{\prime}(X^{\prime},E^{\prime}:=X^{\prime}\times X^{\prime}), such that XX^{\prime} is formed by copying each point xXx\in X for NpxNp_{x} times, and edge weight lenG(x,y):=14λ2Qdist(x,y)p^xp^y\mathrm{len}_{G^{\prime}}(x,y):=\frac{1}{4\lambda^{2}Q}\cdot\frac{\operatorname{dist}(x,y)}{\hat{p}_{x}\hat{p}_{y}}. Clearly, if we let xx^{*} be a uniform sample from XX^{\prime}, then for every xXx\in X, Pr[x=x]=px\Pr[x^{*}=x]=p_{x}. Hence, this uniform sample xx^{*} is identically distributed as an importance-sample from distribution 𝒟\mathcal{D} on XX. Furthermore, for x,ySx,y\in S, it holds that

distS(x,y)=dist(x,y)p^xp^y=4λ2QlenG(x,y).\operatorname{dist}_{S}(x,y)=\frac{\operatorname{dist}(x,y)}{\hat{p}_{x}\hat{p}_{y}}=4\lambda^{2}Q\cdot\mathrm{len}_{G^{\prime}}(x,y).

Hence, we conclude the following fact.

Fact 3.3.

Let SXS^{\prime}\subseteq X^{\prime} be mm uniform samples from XX^{\prime}. Then, the value 4λ2QMax-Cut(S)4\lambda^{2}Q\cdot\textsc{Max-Cut}(S^{\prime}) and Max-Cut(S)\textsc{Max-Cut}(S) are identically distributed.

Therefore, it suffices to show the SS^{\prime} from 3.3 satisfies that 4λ2QMax-Cut(S)|S|24\lambda^{2}Q\cdot\frac{\textsc{Max-Cut}(S^{\prime})}{|S^{\prime}|^{2}} is a (1+ε)(1+\varepsilon)-approximation to Max-Cut(X)\textsc{Max-Cut}(X), with constant probability. Our plan is to apply Lemma 3.2, but we first need to show, in Lemma 3.4, that the edge weights of GG^{\prime} are in [0,1][0,1], and in Lemma 3.6 that Max-Cut(X)\textsc{Max-Cut}(X^{\prime}) is a (1+ε)(1+\varepsilon)-approximation to Max-Cut(X)\textsc{Max-Cut}(X) up to a scaling.

Lemma 3.4.

For all x,yXx,y\in X^{\prime}, lenG(x,y)1\mathrm{len}_{G^{\prime}}(x,y)\leq 1.

Proof.

We need the following fact from [dlVK01] (which was proved in [dlVK01, Lemma 7]).

Lemma 3.5 ([dlVK01]).

For all xXx\in X, it holds that

dist(x,y)q(x)q(x)4Q.\frac{\operatorname{dist}(x,y)}{q(x)q(x)}\leq\frac{4}{Q}.

Applying Lemma 3.5,

lenG(x,y)=14λ2Qdist(x,y)p^xp^y14λ2Qdist(x,y)pxpy1.\mathrm{len}_{G^{\prime}}(x,y)=\frac{1}{4\lambda^{2}Q}\cdot\frac{\operatorname{dist}(x,y)}{\hat{p}_{x}\hat{p}_{y}}\leq\frac{1}{4\lambda^{2}Q}\cdot\frac{\operatorname{dist}(x,y)}{p_{x}p_{y}}\leq 1.

Lemma 3.6.

4λ2QN2Max-Cut(X)(1±ε)Max-Cut(X)\frac{4\lambda^{2}Q}{N^{2}}\textsc{Max-Cut}(X^{\prime})\in(1\pm\varepsilon)\cdot\textsc{Max-Cut}(X).

Proof.

Let X~\widetilde{X} be a point-weighted set formed by re-weight points in XX with wX~(x)=Npxw_{\widetilde{X}}(x)=Np_{x}. The following lemma from [dlVK01] shows that Max-Cut(X)=Max-Cut(X~)\textsc{Max-Cut}(X)=\textsc{Max-Cut}(\widetilde{X}).

Lemma 3.7 ([dlVK01, Lemma 5 and Lemma 6]).

Let (U,distU)(U,\operatorname{dist}_{U}) be a metric space, and WUW\subseteq U be a dataset. Suppose for every xWx\in W, μx>0\mu_{x}>0 is an integer weight. Then the point-weighted set WW^{\prime} obtained from re-weighting each point xWx\in W by μx\mu_{x}, satisfies that

Max-Cut(W)=Max-Cut(W).\textsc{Max-Cut}(W^{\prime})=\textsc{Max-Cut}(W).

Now, we observe that X~\widetilde{X} can be naturally interpreted a weighted complete graph G~\widetilde{G}, where we copy xXx\in X for wX~w_{\widetilde{X}} times to form the vertex set, and the edge length is defined as distX~(x,y)\operatorname{dist}_{\widetilde{X}}(x,y). Notice that the vertex set of G~\widetilde{G} is exactly XX^{\prime}, and that the edge length

distX~(x,y)=dist(x,y)N2pxpy(1±ε)4λ2QN2lenG(x,y).\operatorname{dist}_{\widetilde{X}}(x,y)=\frac{\operatorname{dist}(x,y)}{N^{2}p_{x}p_{y}}\in(1\pm\varepsilon)\cdot\frac{4\lambda^{2}Q}{N^{2}}\cdot\mathrm{len}_{G^{\prime}}(x,y).

Therefore, we conclude that Max-Cut(X)=Max-Cut(X~)(1±ε)4λ2QN2Max-Cut(X)\textsc{Max-Cut}(X)=\textsc{Max-Cut}(\widetilde{{X}})\in(1\pm\varepsilon)\cdot\frac{4\lambda^{2}Q}{N^{2}}\cdot\textsc{Max-Cut}(X^{\prime}). This finishes the proof. ∎

Now, we are ready to apply Lemma 3.2. Let SXS^{\prime}\subseteq X^{\prime} such that |S|=O(ε4)|S^{\prime}|=O(\varepsilon^{-4}) be the resultant set by applying Lemma 3.2 with G=GG=G^{\prime} (recalling that the promise of [0,1][0,1] edge weights is proved in Lemma 3.4). Then

1|S|2Max-Cut(S)Max-Cut(X)|X|2±ε.\frac{1}{|S^{\prime}|^{2}}\textsc{Max-Cut}(S^{\prime})\in\frac{\textsc{Max-Cut}(X^{\prime})}{|X^{\prime}|^{2}}\pm\varepsilon.

Applying Lemma 3.6, and observe that |X|=N|X^{\prime}|=N, the above equivalents to

4λ2Q|S|2Max-Cut(S)\displaystyle\frac{4\lambda^{2}Q}{|S^{\prime}|^{2}}\textsc{Max-Cut}(S^{\prime}) (1±ε)Max-Cut(X)±ε4λ2Q\displaystyle\in(1\pm\varepsilon)\cdot\textsc{Max-Cut}(X)\pm\varepsilon\cdot 4\lambda^{2}Q
(1±O(λ2ε))Max-Cut(X)\displaystyle\in(1\pm O(\lambda^{2}\varepsilon))\cdot\textsc{Max-Cut}(X)

where the last inequality follows from Max-Cut(X)Ω(Q)\textsc{Max-Cut}(X)\geq\Omega(Q). We finish the proof by rescaling ε\varepsilon.

4 Streaming Implementations

Theorem 4.1 (Streaming Euclidean Max-Cut).

There is a randomized streaming algorithm that, given 0<ε<1/20<\varepsilon<1/2, p1p\geq 1, integers Δ,d1\Delta,d\geq 1, and an input dataset X[Δ]dX\subseteq[\Delta]^{d} presented as a dynamic stream, uses space poly(ε1dlogΔ)\operatorname{poly}(\varepsilon^{-1}d\log\Delta) and reports an estimate η>0\eta>0 that with high probability (at least 2/3{2}/{3}) is a (1+ε)(1+\varepsilon)-approximation to Max-Cut(X)\textsc{Max-Cut}(X) in p\ell_{p}.

Our algorithm employs importance sampling as formulated in Theorem 3.1, and thus needs (enough) samples from a distribution 𝒟\mathcal{D} that corresponds to the input X[Δ]dX\subseteq[\Delta]^{d} with small parameter λ>0\lambda>0. Given these samples, the algorithm can estimate Max-Cut(X)\textsc{Max-Cut}(X) by a brute-force search on the (point-weighted) samples, which can be done using small space. Note that Theorem 3.1 works for a general metric space, hence it also applies to the p\ell_{p} case as we require. We thus focus henceforth on performing importance sampling from a dataset XX that is presented as a dynamic stream, as formalized next in Lemma 4.2.

Lemma 4.2 (Importance-Sampling Algorithm).

There is a randomized streaming algorithm 𝒜\mathcal{A} that, given 0<ε<1/20<\varepsilon<1/2, p1p\geq 1, integers Δ,d1\Delta,d\geq 1, and an input dataset X[Δ]dX\subseteq[\Delta]^{d} presented as a dynamic stream, it uses space poly(ε1dlogΔ)\operatorname{poly}(\varepsilon^{-1}d\log\Delta) and reports zX{}z^{*}\in X\cup\{\perp\} together with p[0,1]p^{*}\in[0,1]. The algorithm has a random initialization with success probability at least 0.990.99,666It is convenient to separate the random coins of the algorithm into two groups, even though they can all be tossed before the stream starts. We refer to the coin tosses of the first group as an initialization step, and condition on their “success” when analyzing the second group of coins. The algorithm cannot tell whether its initialization was successful, and thus this event appears only in the analysis (in Lemma 4.4). and conditioned on a successful initialization, its random output satisfies: (1) with probability at least 11/poly(Δd)1-1/\operatorname{poly}(\Delta^{d}),

xX,Pr[z=x]1λq(x)Q,\forall x\in X,\qquad\Pr[z^{*}=x]\geq\frac{1}{\lambda}\frac{q(x)}{Q},

for q(x):=yXdist(x,y)q(x):=\sum_{y\in X}\operatorname{dist}(x,y), Q:=xXq(x)Q:=\sum_{x\in X}q(x), dist=p\operatorname{dist}=\ell_{p}, and λ:=poly(dlogΔ)\lambda:=\operatorname{poly}(d\log\Delta); and (2) whenever zz^{*}\neq\perp,

z=xXp(1±ε)Pr[z=x].z^{*}=x\in X\quad\Longrightarrow\quad p^{*}\in(1\pm\varepsilon)\cdot\Pr[z^{*}=x].

The somewhat intricate statement of Lemma 4.2 is very useful to generate many samples with a large success probability. The obvious approach to generate tt samples is to run tt executions of this algorithm (all in parallel on the same stream) using independent coins, but then the success probability is only 0.99t0.99^{t}. Consider instead running tt parallel executions, using the same initialization coins but otherwise independent coins, which requires total space tpoly(ε1dlogΔ)t\cdot\operatorname{poly}(\varepsilon^{-1}d\log\Delta). Then with probability at least 0.990.99 the initialization succeeds, in which case the tt executions produce tt independent samples, each of the form (z,p)(z^{*},p^{*}) and satisfies the two guarantees in the lemma.

4.1 The Importance-Sampling Algorithm (Proof of Lemma 4.2)

Our plan is to implement the importance sampling on a tree metric generated by a randomized embedding of the input dataset. The notion of randomized tree embedding was first proposed in [Bar96] for arbitrary metric spaces, and the specific embedding that we employ was given by [Ind04] for p\ell_{p} metrics presented as a stream of points. We describe this tree embedding below. We stress that our algorithm can be easily implemented in low space because it does not need to compute the entire embedding explicitly; for instance, the algorithm’s initialization picks random coins, which determine the embedding but do not require any further computation.

Initialization Step: Randomized Tree Embedding [Bar96, CCG+98, Ind04]

Assume without loss of generality that Δ1\Delta\geq 1 is an integral power of 22, and let L:=1+dlogΔL:=1+d\log\Delta. Let {𝒢i}i=0L\{\mathcal{G}_{i}\}_{i=0}^{L} be a recursive partitioning of the grid [2Δ]d[2\Delta]^{d} into squares,777Strictly speaking, these squares are actually hypercubes (sometimes called cells or grids), but we call them squares for intuition. as follows. Start with 𝒢0\mathcal{G}_{0} being a trivial partitioning that has one part corresponding to the entire grid [2Δ]d[2\Delta]^{d}, and for each i0i\geq 0, subdivide every square in 𝒢i\mathcal{G}_{i} into 2d2^{d} squares of half the side-length, to obtain a partition 𝒢i+1\mathcal{G}_{i+1} of the entire grid [2Δ]d[2\Delta]^{d}. Thus, every 𝒢i\mathcal{G}_{i} is a partition into squares of side-length 2i2^{i}. The recursive partitioning {𝒢i}i\{\mathcal{G}_{i}\}_{i} naturally defines a rooted tree TT, whose nodes are the squares inside all the 𝒢i\mathcal{G}_{i}’s, that if often called a quadtree decomposition (even though every tree node has 2d2^{d} children rather than 44). Finally, make the quadtree TT random by shifting the entire recursive partitioning by a vector vshift-v_{\textrm{shift}}, where vshiftv_{\textrm{shift}} is chosen uniformly at random from [Δ]d[\Delta]^{d}. (This is equivalent to shifting the dataset [Δ]d[\Delta]^{d} by vshiftv_{\textrm{shift}}, which explains why we defined the recursive partitioning over an extended grid [2Δ]d[2\Delta]^{d}.) Every node in TT has a level (or equivalently, depth), where the root is at level 11, and the level of every other node is one bigger than that of its parent node. The scale of a tree node is the side-length of the corresponding square. Observe that leaves of TT have scale 202^{0} and thus correspond to squares that contain a single grid point; moreover, points x[Δ]dx\in[\Delta]^{d} correspond to distinct leaves in TT. Define the weight of an edge in TT between a node uu at scale 2i2^{i} and its parent as d1p2id^{\frac{1}{p}}\cdot 2^{i} (i.e., the diameter of uu’s square). Define a tree embedding of [Δ]d[\Delta]^{d} by mapping every point x[Δ]dx\in[\Delta]^{d} to its corresponding leaf in TT, and let the tree distance between two points x,y[Δ]dx,y\in[\Delta]^{d}, denoted distT(x,y)\operatorname{dist}_{T}(x,y), be the distance in TT between their corresponding leaves. The following lemma bounds the distortion of this randomized tree embedding. We remark that a better distortion of O(dmax{1p,11p}logΔ)O(d^{\max\{\frac{1}{p},1-\frac{1}{p}\}}\log\Delta) may be obtained via a different technique that is less suitable for streaming [CCG+98].

Lemma 4.3 ([Ind04, Fact 1]).

Let TT be a randomized tree as above. Then for all x,y[Δ]dx,y\in[\Delta]^{d},

distT(x,y)dist(x,y);\displaystyle\operatorname{dist}_{T}(x,y)\geq\operatorname{dist}(x,y);
𝔼[\displaystyle\operatorname*{\mathbb{E}}[ distT(x,y)]O(dlogΔ)dist(x,y).\displaystyle\operatorname{dist}_{T}(x,y)]\leq O\left(d\log\Delta\right)\operatorname{dist}(x,y).

Streaming Implementation of Randomized Tree Embedding

We emphasize that in our definition of the quadtree TT is non-standard as it contains the entire grid [Δ]d[\Delta]^{d} as leaves (the standard approach is to recursively partition only squares that contain at least one point from the dataset XX). The advantage of our approach is that the tree is defined obliviously of the dataset XX (e.g., of updates to XX). In particular, the leaf-to-root path from a point x[Δ]dx\in[\Delta]^{d} is well-defined regardless of XX and can be computed on-the-fly (without constructing the entire tree TT) using time and space poly(dlogΔ)\operatorname{poly}(d\log\Delta), providing sufficient information for evaluating the tree distance.

Our streaming algorithm samples such a tree TT as an initialization step, i.e., before the stream starts, which requires small space because it can be done implicitly by picking poly(dlogΔ)\operatorname{poly}(d\log\Delta) random bits that describe the random shift vector vv. Next, we show in Lemma 4.4 that this initialization step succeeds with 0.990.99 probability, and on success, every distance dist(x,y)\operatorname{dist}(x,y) for x,yXx,y\in X is well-approximated by its corresponding distT(x,y)\operatorname{dist}_{T}(x,y). In this case, the sampling of points xx with probability proportional to q(x)q(x) can be replaced by sampling with probabilities that are derived from the tree metric. More specifically, the probability of sampling each xXx\in X deviates from the desired probability q(x)Q\frac{q(x)}{Q} by at most a factor of poly(dlogΔ)\operatorname{poly}(d\log\Delta). We remark that the event of success does depend on the input XX, but the algorithm does not need to know whether the initialization succeeded.

Lemma 4.4.

For xXx\in X, let qT(x):=yXdistT(x,y)q_{T}(x):=\sum_{y\in X}\operatorname{dist}_{T}(x,y) and let QT:=xXqT(x)Q_{T}:=\sum_{x\in X}q_{T}(x). Then

PrT[xX,qT(x)QT1O(dlogΔ)q(x)Q]0.99.\Pr_{T}\left[\forall x\in X,\ \frac{q_{T}(x)}{Q_{T}}\geq\frac{1}{O\left(d\log\Delta\right)}\frac{q(x)}{Q}\right]\geq 0.99.
Proof.

Fix some xXx\in X. By Lemma 4.3,

qT(x)=yXdistT(x,y)yXdist(x,y)=q(x)q_{T}(x)=\sum_{y\in X}\operatorname{dist}_{T}(x,y)\geq\sum_{y\in X}\operatorname{dist}(x,y)=q(x) (1)

and

𝔼[yXqT(y)]=𝔼[y,yXdistT(y,y)]O(dlogΔ)y,yX𝔼[dist(y,y)].\operatorname*{\mathbb{E}}\left[\sum_{y\in X}q_{T}(y)\right]=\operatorname*{\mathbb{E}}\left[\sum_{y,y^{\prime}\in X}\operatorname{dist}_{T}(y,y^{\prime})\right]\leq O\left(d\log\Delta\right)\sum_{y,y^{\prime}\in X}\operatorname*{\mathbb{E}}[\operatorname{dist}(y,y^{\prime})].

By Markov’s inequality, with high constant probability,

yXqT(y)O(dlogΔ)y,yX𝔼[dist(y,y)].\sum_{y\in X}q_{T}(y)\leq O\left(d\log\Delta\right)\sum_{y,y^{\prime}\in X}\operatorname*{\mathbb{E}}[\operatorname{dist}(y,y^{\prime})]. (2)

We finish the proof by combining (2) and (1). ∎

Sampling w.r.t. Tree Distance

In the remainder of the proof, we assume that the random tree TT was already picked and condition on its success as formulated in Lemma 4.4. This lemma shows that it actually suffices to sample each xx with probability proportional to qT(x)q_{T}(x). Next, we provide in 4.5 a different formula for qT(x)q_{T}(x) that is based on xx’s ancestors in the tree TT, namely, on counting how many data points (i.e., from XX) are contained in the squares that correspond to these ancestors. To this end, we need to set up some basic notation regarding XX and TT.

The Input XX in the Tree TT

Let n:=|X|n:=|X| be the number of input points at the end of the stream. For a tree node vTv\in T, let X(v)XX(v)\subseteq X be the set of points from XX that are contained in the square corresponding to vv. For xXx\in X and i1i\geq 1, let anci(x)\operatorname{anc}_{i}(x) be the level-ii ancestor of xx in TT (recalling that xx corresponds to a leaf). By definition, ancL+1(x):=x\operatorname{anc}_{L+1}(x):=x. For 0iL0\leq i\leq L, let βi:=d1p2L+1i\beta_{i}:=d^{\frac{1}{p}}\cdot 2^{L+1-i}, which is the edge-length between a level-ii node uu and its parent (since the scale of a level-ii node is 2L+1i2^{L+1-i}). Due to the tree structure, we have the following representation of qT(x)q_{T}(x).

Fact 4.5.

For every xXx\in X, we have qT(x)=2i=0Lβi(n|X(anci(x))|)q_{T}(x)=2\sum_{i=0}^{L}\beta_{i}\cdot(n-|X(\operatorname{anc}_{i}(x))|).

For each level ii, let hih_{i} be a level-ii node whose corresponding square contains the most points from XX, breaking ties arbitrarily. Next, we wish to identify a critical level kk; ideally, this is the last level going down from the root, i.e., largest ii, such that |X(hi)|0.6n|X(h_{i})|\geq 0.6n (the constant 0.60.6 is somewhat arbitrary). However, it is difficult to find this kk exactly in a streaming algorithm, and thus we use instead a level k~\tilde{k} that satisfies a relaxed guarantee that only requires estimates on different |X(hi)||X(h_{i})|, as follows. Let us fix henceforth two constants 0.5<σσ+10.5<\sigma^{-}\leq\sigma^{+}\leq 1.

Definition 4.6 (Critical Level).

Level 1k~<L+11\leq\tilde{k}<L+1 is called (σ,σ+)(\sigma^{-},\sigma^{+})-critical, if |X(hk~)|σn|X(h_{\tilde{k}})|\geq\sigma^{-}n and |X(hk~+1)|σ+n|X(h_{\tilde{k}+1})|\leq\sigma^{+}n.

Suppose henceforth that k~\tilde{k} is a (σ,σ+)(\sigma^{-},\sigma^{+})-critical level. (Such a critical level clearly exists, although its value need not be unique.) Since |X(hi)||X(hi+1)||X(h_{i})|\geq|X(h_{i+1})| for every i<k~i<\tilde{k} (because hih_{i} contains the most points from XX at level ii), we know that |X(hi)|σn|X(h_{i})|\geq\sigma^{-}n for every ik~i\leq\tilde{k} (not only for i=k~i=\tilde{k}), and |X(hi)|σ+n|X(h_{i})|\leq\sigma^{+}n for every i>k~i>\tilde{k}.

Fact 4.7.

Each hih_{i} is the parent of hi+1h_{i+1} for 1ik~11\leq i\leq\tilde{k}-1, hence (h1,,hk~)(h_{1},\ldots,h_{\tilde{k}}) forms a path from the root of TT.

Next, we further “simplify” the representation of qT(x)q_{T}(x), by introducing an approximate version of it that requires even less information about xx. Specifically, we introduce in Definition 4.8 a sequence of O(L)O(L) values that are independent of xx, namely, one value q~i\tilde{q}_{i} for each level ik~i\leq\tilde{k}, and then we show in Lemma 4.9 that for every xXx\in X, we can approximate qT(x)q_{T}(x) by one of these O(L)O(L) values, namely, by q~i\tilde{q}_{i} for a suitable level i=(x)i=\ell(x).

Definition 4.8 (Estimator for qTq_{T}).

For 1ik~1\leq i\leq\tilde{k}, define

q~i:=nβi+jiβj(n|X(hj)|).\tilde{q}_{i}:=n\beta_{i}+\sum_{j\leq i}\beta_{j}\cdot(n-|X(h_{j})|).

Relating qTq_{T} and q~\tilde{q}

For xXx\in X, let (x)\ell(x) be the maximum level 1jk~1\leq j\leq\tilde{k} such that ancj(x)=hj\operatorname{anc}_{j}(x)=h_{j}. This is well-defined, because j=1j=1 always satisfies that ancj(x)=hj\operatorname{anc}_{j}(x)=h_{j}. The next lemma shows that qT(x)q_{T}(x) can be approximated by q~i\tilde{q}_{i} for i=(x)i=\ell(x).

Lemma 4.9.

Let k~\tilde{k} be a (σ,σ+)(\sigma^{-},\sigma^{+})-critical level. Then

xX,q~(x)=Θ(1)qT(x).\forall x\in X,\qquad\tilde{q}_{\ell(x)}=\Theta(1)\cdot q_{T}(x).
Proof.
12qT(x)\displaystyle\frac{1}{2}q_{T}(x) =i=0Lβi(n|X(anci(x))|)\displaystyle=\sum_{i=0}^{L}\beta_{i}\cdot(n-|X(\operatorname{anc}_{i}(x))|)
=i(x)βi(n|X(hi)|)+i>(x)βi(n|X(anci(x))|)\displaystyle=\sum_{i\leq\ell(x)}\beta_{i}\cdot(n-|X(h_{i})|)+\sum_{i>\ell(x)}\beta_{i}\cdot(n-|X(\operatorname{anc}_{i}(x))|) (3)
i(x)βi(n|X(hi)|)+[min{σ,1σ+},1]ni>(x)βi\displaystyle\in\sum_{i\leq\ell(x)}\beta_{i}\cdot(n-|X(h_{i})|)+[\min\{\sigma^{-},1-\sigma^{+}\},1]\cdot n\sum_{i>\ell(x)}\beta_{i} (4)
i(x)βi(n|X(hi)|)+[min{σ,1σ+},1]nβ(x)\displaystyle\in\sum_{i\leq\ell(x)}\beta_{i}\cdot(n-|X(h_{i})|)+[\min\{\sigma^{-},1-\sigma^{+}\},1]\cdot n\beta_{\ell(x)}
[min{σ,1σ+},1]q~(x).\displaystyle\in[\min\{\sigma^{-},1-\sigma^{+}\},1]\cdot\tilde{q}_{\ell(x)}.

In the above, (3) follows from the fact that anci(x)=hi\operatorname{anc}_{i}(x)=h_{i} for i(x)i\leq\ell(x) (by the definition of (x)\ell(x) and the property that (h1,,hk~)(h_{1},\ldots,h_{\tilde{k}}) forms a path from 4.7). (4) follows from the definition of (σ,μ)(\sigma,\mu)-critical and the definition of \ell. ∎

The next lemma shows that the sequence q~1,,q~k~\tilde{q}_{1},\ldots,\tilde{q}_{\tilde{k}} is non-increasing.

Fact 4.10.

q~1=β1n\tilde{q}_{1}=\beta_{1}n, and for every 2ik~2\leq i\leq\tilde{k}, we have q~iq~i1\tilde{q}_{i}\leq\tilde{q}_{i-1}.

Proof.

The fact for i=1i=1 is immediate. Now consider i2i\geq 2. We have

q~i1q~i=n(βi1βi)βi(n|X(hi)|)=βi|X(hi)|0,\tilde{q}_{i-1}-\tilde{q}_{i}=n(\beta_{i-1}-\beta_{i})-\beta_{i}\cdot(n-|X(h_{i})|)=\beta_{i}\cdot|X(h_{i})|\geq 0,

which verifies the lemma. ∎

Alternative Sampling Procedure

Recall that level k~\tilde{k} is assumed to be (σ,σ+)(\sigma^{-},\sigma^{+})-critical for fixed constants 0.5<σσ+10.5<\sigma^{-}\leq\sigma^{+}\leq 1. We plan to sample xXx\in X with probability proportional to q~(x)\tilde{q}_{\ell(x)}, and by Lemma 4.9 this only loses an O(1)O(1) factor in the bound λ\lambda needed for importance sampling (as in Lemma 4.2). For 1ik~1\leq i\leq\tilde{k}, define Xi:={xX(x)=i}X_{i}:=\{x\in X\mid\ell(x)=i\}. Notice that {Xi}i=1k~\{X_{i}\}_{i=1}^{\tilde{k}} forms a partition of XX, and

Xi={X(hi)X(hi+1)if 1ik~1;X(hk~)if i=k~.X_{i}=\begin{cases}X(h_{i})\setminus X(h_{i+1})&\text{if $1\leq i\leq\tilde{k}-1$;}\\ X(h_{\tilde{k}})&\text{if $i=\tilde{k}$.}\end{cases} (5)

By definition, points in the same XiX_{i} have the same q~(x)\tilde{q}_{\ell(x)}, and thus also the same sampling probability. A natural approach to sampling a point from XX with the desired probabilities is to first pick a random i[k~]i\in[\tilde{k}] (non-uniformly) and then sample uniformly a point from that XiX_{i}. But unfortunately, it is impossible to sample uniformly from XiX_{i} in streaming (this is justified in A.2), and thus we shall sample instead from an “extended” set XiextXiX^{\mathrm{ext}}_{i}\supseteq X_{i}, defined as follows.

Xiext:={XX(hi+1)if 1ik~1;X(hk~)if i=k~.X^{\mathrm{ext}}_{i}:=\begin{cases}X\setminus X(h_{i+1})&\text{if $1\leq i\leq\tilde{k}-1$;}\\ X(h_{\tilde{k}})&\text{if $i=\tilde{k}$.}\end{cases} (6)

The path structure of {hi}i\{h_{i}\}_{i} (4.7) implies the following.

Fact 4.11.

For every 1i<k~1\leq i<\tilde{k}, we have Xiext=X1XiX^{\mathrm{ext}}_{i}=X_{1}\cup\ldots\cup X_{i}.

We describe in Algorithm 1 a procedure for sampling xXx\in X with probability proportional to q~(x)\tilde{q}_{\ell(x)}, based on the above approach of picking a random i[k~]i\in[\tilde{k}] (from a suitable distribution) and then sampling uniformly a point from that XiextX^{\mathrm{ext}}_{i}. We then prove in Lemma 4.13 that this procedure samples from XX with probabilities proportional to q~(x)\tilde{q}_{\ell(x)}, up to an O(L)O(L) factor.

Remark 4.12.

Sampling from the extended sets (XiextX^{\mathrm{ext}}_{i} instead of XiX_{i}) can significantly bias the sampling probabilities, because the “contribution” of a point xXx\in X can increase by an unbounded factor. On the one hand, this can increase the sampling probability of that xx, which is not a problem at all. On the other hand, it might increase the total contribution of all points (and thus decrease some individual sampling probabilities), but our analysis shows that this effect is bounded by an O(L)O(L) factor. The intuition here is that q(x)q(x) represents the sum of distances from xx to all other points yXy\in X, and we can rearrange their total xq(x)\sum_{x}q(x) by the “other” point yXy\in X, and the crux now is that the contribution of each yXy\in X increases by at most O(L)O(L) factor.

Algorithm 1 Alternative sampling procedure (offline)
1:draw a random ii^{*} where each 1ik~1\leq i\leq\tilde{k} is picked with probability ri:=|Xiext|q~ij=1k~|Xjext|q~jr_{i}:=\frac{|X^{\mathrm{ext}}_{i}|\tilde{q}_{i}}{\sum_{j=1}^{\tilde{k}}|X^{\mathrm{ext}}_{j}|\tilde{q}_{j}}
2:draw xXiextx\in X^{\mathrm{ext}}_{i} uniformly at random
3:return z=xz^{*}=x as the sample, together with p=i=(x)k~ri|Xiext|p^{*}=\sum_{i=\ell(x)}^{\tilde{k}}\frac{r_{i}}{|X^{\mathrm{ext}}_{i}|} as its sampling probability
Lemma 4.13.

Algorithm 1 samples every xXx\in X with probability Pr[z=x]=i=(x)k~ri|Xiext|\Pr[z^{*}=x]=\sum_{i=\ell(x)}^{\tilde{k}}\frac{r_{i}}{|X^{\mathrm{ext}}_{i}|}, exactly as line 3 reports in pp^{*}, and furthermore this is bounded by Pr[z=x]Ω(1L)q~(x)xXq~(x)\Pr[z^{*}=x]\geq\Omega\left(\frac{1}{L}\right)\frac{\tilde{q}_{\ell(x)}}{\sum_{x\in X}\tilde{q}_{\ell(x)}}.

Proof.

Observe that xX(x)x\in X_{\ell(x)}, and by 4.11, this point xx can only be sampled for i(x)i\geq\ell(x). Therefore, Pr[z=x]=i=(x)k~ri|Xiext|=p\Pr[z^{*}=x]=\sum_{i=\ell(x)}^{\tilde{k}}\frac{r_{i}}{|X^{\mathrm{ext}}_{i}|}=p^{*}. We bound this probability by

Pr[z=x]=i(x)ri|Xiext|=i(x)q~ij=1k~|Xjext|q~j=i(x)q~ij=1k~|Xjext|q~jq~(x)j=1k~|Xjext|q~j.\Pr[z^{*}=x]=\sum_{i\geq\ell(x)}\frac{r_{i}}{|X^{\mathrm{ext}}_{i}|}=\sum_{i\geq\ell(x)}\frac{\tilde{q}_{i}}{\sum_{j=1}^{\tilde{k}}{|X^{\mathrm{ext}}_{j}|\tilde{q}_{j}}}=\frac{\sum_{i\geq\ell(x)}\tilde{q}_{i}}{\sum_{j=1}^{\tilde{k}}{|X^{\mathrm{ext}}_{j}|\tilde{q}_{j}}}\geq\frac{\tilde{q}_{\ell(x)}}{\sum_{j=1}^{\tilde{k}}{|X^{\mathrm{ext}}_{j}|\tilde{q}_{j}}}. (7)

Next, to bound the denominator j=1k~|Xjext|q~j{\sum_{j=1}^{\tilde{k}}{|X^{\mathrm{ext}}_{j}|\tilde{q}_{j}}}, observe that |Xjext|=i=1j|Xi||X^{\mathrm{ext}}_{j}|=\sum_{i=1}^{j}|X_{i}| for all j<k~j<\tilde{k} (by 4.11), and therefore

j=1k~|Xjext|q~j\displaystyle\sum_{j=1}^{\tilde{k}}{|X^{\mathrm{ext}}_{j}|\tilde{q}_{j}} =|Xk~|q~k~+j=1k~1i=1j|Xi|q~j=|Xk~|q~k~+i=1k~1j=ik~1|Xi|q~j|Xk~|q~k~+k~i=1k~1|Xi|q~i\displaystyle=|X_{\tilde{k}}|\tilde{q}_{\tilde{k}}+\sum_{j=1}^{\tilde{k}-1}\sum_{i=1}^{j}|X_{i}|\tilde{q}_{j}=|X_{\tilde{k}}|\tilde{q}_{\tilde{k}}+\sum_{i=1}^{\tilde{k}-1}\sum_{j=i}^{\tilde{k}-1}|X_{i}|\tilde{q}_{j}\leq|X_{\tilde{k}}|\tilde{q}_{\tilde{k}}+\tilde{k}\cdot\sum_{i=1}^{\tilde{k}-1}|X_{i}|\tilde{q}_{i}
(L+1)i=1k~|Xi|q~i=(L+1)xXq~(x),\displaystyle\leq(L+1)\sum_{i=1}^{\tilde{k}}{|X_{i}|\tilde{q}_{i}}=(L+1)\sum_{x\in X}{\tilde{q}_{\ell(x)}},

where the first inequality is by the monotonicity of q~i\tilde{q}_{i}’s (4.10). Combining this with (7), the lemma follows. ∎

Implementing Algorithm 1 in Streaming

To implement Algorithm 1 in streaming, we first need a streaming algorithm that finds a critical level k~\tilde{k} using space O(poly(dlogΔ))O(\operatorname{poly}(d\log\Delta)). We discuss this next.

Finding k~\tilde{k}

For each level ii, we draw poly(dlogΔ)\operatorname{poly}(d\log\Delta) samples SiXS_{i}\subseteq X uniformly at random from XX. We then count the number of samples that lie in each tree node (square) at level ii, and let mim_{i} be the maximum count. We let k~\tilde{k} be the largest level ii such that mi|Si|0.6\frac{m_{i}}{|S_{i}|}\geq 0.6. By a standard application of Chernoff bound, with probability at least 1poly(Δd)1-\operatorname{poly}(\Delta^{d}), this level k~\tilde{k} is (0.55,0.65)(0.55,0.65)-critical. Moreover, this process can be implemented in streaming using space poly(dlogΔ)\operatorname{poly}(d\log\Delta), by maintaining, for each level ii, only |Si|=poly(dlogΔ)|S_{i}|=\operatorname{poly}(d\log\Delta) independent 0\ell_{0}-samplers (Lemma 2.2) on the domain [Δ]d[\Delta]^{d}. A similar approach can be used to (1+ε)(1+\varepsilon)-approximate the size of X(hi)X(h_{i}) for every ik~i\leq\tilde{k}, and also sample uniformly from these sets, using space O(poly(ε1dlogΔ))O(\operatorname{poly}(\varepsilon^{-1}d\log\Delta)) and with failure probability 11/poly(Δd)1-1/\operatorname{poly}(\Delta^{d}) (by using Lemmas 2.1 and 2.2).

Estimating and Sampling from XX(hi)X\setminus X(h_{i})

We also need to estimate q~i\tilde{q}_{i} and |Xiext||X^{\mathrm{ext}}_{i}|, and to sample uniformly at random from XiextX^{\mathrm{ext}}_{i}, for every ik~i\leq\tilde{k}. The case i=k~i=\tilde{k} was already discussed, because Xk~ext=X(hk~)X^{\mathrm{ext}}_{\tilde{k}}=X(h_{\tilde{k}}). It remains to consider i<k~i<\tilde{k}, in which case we need to (1±ε)(1\pm\varepsilon)-approximate the size of XX(hi)X\setminus X(h_{i}), and also to sample uniformly at random from that set, and we can assume that |X(hi)|>0.5n|X(h_{i})|>0.5n. We provide such a streaming algorithm in Lemma 4.14 below, which we prove in Section 4.2. This lemma is stated in a more general form that may be of independent interest, where the input is a frequency vector xNx\in\mathbb{R}^{N} (i.e., a stream of insertions and deletions of items from domain [N][N]) and access to a function 𝒫:[N][N]\mathcal{P}:[N]\to[N^{\prime}], for NNN^{\prime}\leq N, that can be viewed as a partition of the domain into NN^{\prime} parts. In our intended application, the domain [N][N] will be the grid [Δ]d[\Delta]^{d}, and the partition 𝒫\mathcal{P} will be its partition into squares of a given level ii; observe that it is easy to implement 𝒫\mathcal{P} as a function that maps each grid point to its level-ii square. Roughly speaking, the streaming algorithm in Lemma 4.14 samples uniformly from the support set supp(x)={i[N]:xi0}\operatorname{supp}(x)=\{i\in[N]:x_{i}\neq 0\}, but excluding indices that lie in the part of 𝒫\mathcal{P} that is heaviest, i.e., has most nonzero indices, assuming it is sufficiently heavy. In our intended application, this method samples uniformly from the input X[Δ]dX\subset[\Delta]^{d} but excluding points that lie in the heaviest square, i.e., uniformly from XX(hi)X\setminus X(h_{i}). (square with the largest number of input points).

Lemma 4.14 (Sampling from Light Parts).

There exists a streaming algorithm, that given 0<ε,δ,σ<0.50<\varepsilon,\delta,\sigma<0.5, integers N,N,M1N,N^{\prime},M\geq 1, a mapping 𝒫:[N][N]\mathcal{P}:[N]\to[N^{\prime}], and a frequency vector x[M,M]Nx\in[-M,M]^{N} that is presented as a stream of additive updates, uses space O(poly(ε1σ1log(δ1MN)))O(\operatorname{poly}(\varepsilon^{-1}\sigma^{-1}\log(\delta^{-1}MN))), and reports a sample i[N]{}i^{*}\in[N]\cup\{\operatorname*{\perp}\} and a value r0r^{*}\geq 0. Let X:={i[N]xi0}X:=\{i\in[N]\mid x_{i}\neq 0\} be the support of xx, and let jmax:=argmaxj[N]|𝒫1(j)X|j_{\max}:=\arg\max_{j\in[N^{\prime}]}|\mathcal{P}^{-1}(j)\cap X| be the heaviest 𝒫\mathcal{P} with respect to XX. If Xheavy:=𝒫1(jmax)XX_{\mathrm{heavy}}:=\mathcal{P}^{-1}(j_{\max})\cap X satisfies |Xheavy|(0.5+σ)|X||X_{\mathrm{heavy}}|\geq(0.5+\sigma)|X|, then with probability at least 1δ1-\delta,

  • r(1±ε)|Xlight|r^{*}\in(1\pm\varepsilon)\cdot|X_{\mathrm{light}}| where Xlight:=XXheavyX_{\mathrm{light}}:=X\setminus X_{\mathrm{heavy}}, and

  • unless XlightX_{\mathrm{light}} is empty, iXlighti^{*}\in X_{\mathrm{light}} and moreover for all iXlighti\in X_{\mathrm{light}}, it holds that Pr[i=i]=1|Xlight|\Pr[i^{*}=i]=\frac{1}{|X_{\mathrm{light}}|} (provided that |Xlight|0|X_{\mathrm{light}}|\neq 0).

In our application, we will apply Lemma 4.14 in parallel for every level ii, with N=ΔdN=\Delta^{d}, i.e., the items being inserted and deleted are points in [Δ]d[\Delta]^{d}, and a mapping 𝒫\mathcal{P} defined by the level-ii squares (tree nodes), i.e., for x[Δ]dx\in[\Delta]^{d} we define 𝒫(x)\mathcal{P}(x) as the level-ii node that contains xx. We will set the failure probability to be δ=1/poly(Δd)\delta=1/\operatorname{poly}(\Delta^{d}) and a fixed σ=0.05\sigma=0.05. This way, conditioning on the success of Lemma 4.14, we can compute q~i\tilde{q}_{i}, |Xiext||X^{\mathrm{ext}}_{i}| with error (1±ε)(1\pm\varepsilon), and sampled from XiextX^{\mathrm{ext}}_{i} uniformly.

Concluding Lemma 4.2

In conclusion, our streaming algorithm initializes with sampling a randomly-shifted quadtree TT which defines a tree embedding, all in an implicit way. Then, assume TT is obtained and condition on the success of it, specifically Lemma 4.4 (with probability 0.990.99), we use the streaming implementation of Algorithm 1, as outlined above. The resultant zz^{*} and pp^{*} are the return value. The error bound on zz^{*} and pp^{*} and the bound of λ=O(poly(dlogΔ))\lambda=O(\operatorname{poly}(d\log\Delta)) follow by Lemma 4.4 and Lemma 4.13, plus an additional error and failure probability introduced by streaming, which is bounded in the previous paragraphs. This finishes the proof.

4.2 Sampling from The Light Parts (Proof of Lemma 4.14)

An Offline Algorithm

Notice that {𝒫1(y)}y[N]\{\mathcal{P}^{-1}(y)\}_{y\in[N^{\prime}]} defines a partition of [N][N]. In our proof, we interpret 𝒫={Pi}i\mathcal{P}=\{P_{i}\}_{i} as such a partition. Let Pmax:=𝒫1(ymax)P_{\max}:=\mathcal{P}^{-1}(y_{\max}) be the part of 𝒫\mathcal{P} that contains the most from XX, so Xheavy=PmaxXX_{\mathrm{heavy}}=P_{\max}\cap X. We start with an offline algorithm, summarized in Algorithm 2.

Algorithm 2 Sampling and estimating from the light part (offline)
1:let u2u\leftarrow 2, sΘ(log(Nδ1))s\leftarrow\Theta(\log(N\delta^{-1}))
2:let {h1,,hs}\mathcal{H}\leftarrow\{h_{1},\ldots,h_{s}\} be a collection of independent random hash functions, where each hh\in\mathcal{H} (h:𝒫[u]h:\mathcal{P}\to[u]) satisfies PP\forall P\neq P^{\prime}, Pr[h(P)=h(P)]1/u\Pr[h(P)=h(P^{\prime})]\leq 1/u
3:for t[s]t\in[s] do
4:     for j[u]j\in[u], let Bj(P𝒫:ht(P)=jP)XB_{j}\leftarrow\left(\bigcup_{P\in\mathcal{P}:h_{t}(P)=j}P\right)\cap X
5:     let jargmaxj|Bj|j^{*}\leftarrow\arg\max_{j}|B_{j}|
6:     let DtXP𝒫:ht(P)=jPD_{t}\leftarrow X\setminus\bigcup_{P\in\mathcal{P}:h_{t}(P)=j^{*}}{P}
7:end for
8:compute Dallt[s]DtD_{\mathrm{all}}\leftarrow\bigcup_{t\in[s]}D_{t}
9:return a uniform sample iDalli^{*}\in D_{\mathrm{all}}, and report r:=|Dall|r^{*}:=|D_{\mathrm{all}}| as the estimate for |Xlight||X_{\mathrm{light}}|

In the algorithm, we consider a set of s=Θ(log(Nδ1))s=\Theta(\log(N\delta^{-1})) random hash functions h1,,hsh_{1},\ldots,h_{s} that randomly map each part in 𝒫\mathcal{P} to one of u=2u=2 buckets (as in line 2).

Then, consider some hth_{t} for t[s]t\in[s]. Let BjB_{j} (j[u]j\in[u]) be the elements from all parts that are mapped by hth_{t} to the bucket jj (in line 4). We find jj^{*} as the bucket that contains the most elements from XX (in line 5). Since we assume |Xheavy|(0.5+σ)|X|>0.5|X||X_{\mathrm{heavy}}|\geq(0.5+\sigma)|X|>0.5|X|, we know the bucket ht(Pmax)h_{t}(P_{\max}) contains more than 0.5|X|0.5|X| elements form XX (recalling that Pmax=𝒫1(ymax)P_{\max}=\mathcal{P}^{-1}(y_{\max}) is the part that contains the most from XX), and this implies ht(Pmax)h_{t}(P_{\max}) must be the bucket that contains the most elements from XX. Hence,

j=ht(Pmax).j^{*}=h_{t}(P_{\max}). (8)

Next, we drop the elements that lie in the bucket jj^{*}, and take the remaining elements, as DtD_{t} (in line 6). While DtD_{t} certainly does not contain any element from XheavyX_{\mathrm{heavy}} (by (8) and the definition of DtD_{t}’s), DtD_{t} is only a subset of XlightX_{\mathrm{light}}. Hence, we take the union of all DtD_{t}’s (over t[s]t\in[s]), denoted as DallD_{\mathrm{all}} (in line 8), which equals XlightX_{\mathrm{light}} with high probability.

Analysis of DallD_{\mathrm{all}}

For every iXlighti\in X_{\mathrm{light}}, every t[s]t\in[s],

Pr[iDt]=Pr[ht(Pi)=ht(Pmax)]1u=12,\Pr[i\notin D_{t}]=\Pr[h_{t}(P_{i})=h_{t}(P_{\max})]\leq\frac{1}{u}=\frac{1}{2},

where Pi𝒫P_{i}\in\mathcal{P} is the part that ii belongs to. Therefore, by the independence of hth_{t}’s, we know for every iXlighti\in X_{\mathrm{light}},

Pr[iDall]=Pr[t[s],iDt]12s=δpoly(N).\Pr[i\notin D_{\mathrm{all}}]=\Pr[\forall t\in[s],i\notin D_{t}]\leq\frac{1}{2^{s}}=\frac{\delta}{\operatorname{poly}(N)}.

Taking a union bound over iXlighti\in X_{\mathrm{light}}, we have

Pr[iXlight,iDall]δpoly(N)|Xlight|δ.\Pr[\exists i\in X_{\mathrm{light}},i\notin D_{\mathrm{all}}]\leq\frac{\delta}{\operatorname{poly}(N)}|X_{\mathrm{light}}|\leq\delta.

Hence, we conclude that

Pr[Dall=Xlight]1δ.\Pr[D_{\mathrm{all}}=X_{\mathrm{light}}]\geq 1-\delta.

Conditioning on the event that Dall=XlightD_{\mathrm{all}}=X_{\mathrm{light}}, we conclude that r=|Dall|=|Xlight|r^{*}=|D_{\mathrm{all}}|=|X_{\mathrm{light}}|, and that iXlight\forall i\in X_{\mathrm{light}}, Pr[i=i]=1|Dall|=1|Xlight|\Pr[i^{*}=i]=\frac{1}{|D_{\mathrm{all}}|}=\frac{1}{|X_{\mathrm{light}}|}.

Streaming Algorithm

It remains to give a streaming implementation for Algorithm 2. Before the stream starts, we initialize several streaming data structures. We start with building the hash functions \mathcal{H}, and this can be implemented using space poly(logN)\operatorname{poly}(\log N), by using hash families of limited independence. Next, we maintain for every t[s]t\in[s], for every bucket j[u]j\in[u], an 0\ell_{0}-sampler j(t)\mathcal{L}^{(t)}_{j} (Lemma 2.2) with failure probability O(δus)O(\frac{\delta}{us}), as well as an 0\ell_{0}-norm estimator 𝒦j(t)\mathcal{K}^{(t)}_{j} (Lemma 2.1) with failure probability O(δus)O(\frac{\delta}{us}) and error guarantee εσmin{ε,σ}\varepsilon\sigma\leq\min\{\varepsilon,\sigma\}, both on domain [n][n]. The setup of the failure probabilities immediately implies that with probability 1δ1-\delta, all data structures succeed simultaneously, and we condition on their success in the following argument. Since we need to combine the linear sketches j(t)\mathcal{L}^{(t)}_{j}’s in later steps, for every t[s]t\in[s] and j[u]j\in[u], we use the same random seeds among all 0\ell_{0}-samplers {j(t)}\{\mathcal{L}^{(t)}_{j}\}’s, so that they can be “combined” by simply adding up. Also do the same for 𝒦j(t)\mathcal{K}^{(t)}_{j}’s. Another detail is that, strictly speaking, we need O(1)O(1) independent “copies” of every 𝒦\mathcal{K} and \mathcal{L}, since we need to query each of them O(1)O(1) times. As this only enlarges the space by an O(1)O(1) factor, we omit this detail for the sake of presentation.

Whenever an update for element i[n]i\in[n] is received, we update ji(t)\mathcal{L}^{(t)}_{j_{i}} and 𝒦ji(t)\mathcal{K}^{(t)}_{j_{i}} for every t[s]t\in[s], where ji:=ht(Pi)j_{i}:=h_{t}(P_{i}), and Pi𝒫P_{i}\in\mathcal{P} is the unique part that contains ii.

When the stream terminates, we proceed to generate the sample iXlighti^{*}\in X_{\mathrm{light}} and the estimate rr^{*} for XlightX_{\mathrm{light}}. For t[s]t\in[s], j[u]j\in[u], query 𝒦j(t)\mathcal{K}^{(t)}_{j} to obtain an estimator for |Bj||B_{j}| (line 4) within (1±εσ)(1\pm\varepsilon\sigma) factor. Use these estimations to find jj^{*} (line 5). Note that this jj^{*} is the same as computing using exact |Bj||B_{j}| values. To see this, the key observation is that, |Xheavy|(0.5+σ)|X||X_{\mathrm{heavy}}|\geq(0.5+\sigma)|X|, while for every P𝒫PmaxP\in\mathcal{P}\setminus P_{\max} we have |P|(0.5σ)|X||P|\leq(0.5-\sigma)|X|. Hence, to precisely find jj^{*}, it suffices to distinguish between subsets PP, PP^{\prime} such that |P|(0.5+σ)|X||P|\geq(0.5+\sigma)|X| and |P|(0.5σ)|X||P^{\prime}|\leq(0.5-\sigma)|X|. Even with a (1±εσ)(1\pm\varepsilon\sigma) error (which is the error of our 𝒦\mathcal{K}’s), this gap is still 0.5+σ0.5σ1+εσ1εσ>1\frac{0.5+\sigma}{0.5-\sigma}\cdot\frac{1+\varepsilon\sigma}{1-\varepsilon\sigma}>1 which is large enough.

Next, compute (t):=j[u]{j}j(t)\mathcal{L}^{(t)}:=\sum_{j\in[u]\setminus\{j^{*}\}}\mathcal{L}^{(t)}_{j} as the 0\ell_{0}-sampler that corresponds to DtD_{t} (line 6), and obtain 𝒦(t):=j[u]{j}𝒦j(t)\mathcal{K}^{(t)}:=\sum_{j\in[u]\setminus\{j^{*}\}}\mathcal{K}^{(t)}_{j} similarly. We can do this since we use the same random seeds among j(t)\mathcal{L}^{(t)}_{j}’s (and the same has been done to 𝒦\mathcal{K}). We further compute :=t[s](t)\mathcal{L}:=\sum_{t\in[s]}\mathcal{L}^{(t)} whose support corresponds to DallD_{\mathrm{all}}. Define 𝒦:=t[s]𝒦(t)\mathcal{K}:=\sum_{t\in[s]}\mathcal{K}^{(t)} similarly. The final return values ii^{*} and rr^{*} are given by querying \mathcal{L} and 𝒦\mathcal{K}. Note that on the success of the 0\ell_{0}-sampler \mathcal{L}, the probability for i=ii^{*}=i for each iWi\in W is exactly 1|W|\frac{1}{|W|} (Lemma 2.2). However, the rr^{*} deviates from |W||W| by a multiplicative (1±ε)(1\pm\varepsilon) factor.

In conclusion, the analysis of Algorithm 2 still goes through by using the estimated values as in the above procedure, except that one needs to rescale ε\varepsilon and δ\delta by a constant factor, to compensate the error and failure probability introduced by the streaming data structures. This finishes the proof of Lemma 4.14.

Acknowledgments

We thank Christian Sohler for pointing us to the dimension reduction result in [LSS09, Lam11]. We also thank an anonymous reviewer for pointing out how to simplify our proof of this result (Theorem B.1).

References

  • [ABIW09] Alexandr Andoni, Khanh Do Ba, Piotr Indyk, and David P. Woodruff. Efficient sketches for earth-mover distance, with applications. In FOCS, pages 324–330. IEEE Computer Society, 2009.
  • [AdlVKK03] Noga Alon, Wenceslas Fernandez de la Vega, Ravi Kannan, and Marek Karpinski. Random sampling and approximation of MAX-CSPs. J. Comput. Syst. Sci., 67(2):212–243, 2003.
  • [AG09] Kook Jin Ahn and Sudipto Guha. Graph sparsification in the semi-streaming model. In ICALP (2), volume 5556 of Lecture Notes in Computer Science, pages 328–338. Springer, 2009.
  • [AHV04] Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. Approximating extent measures of points. J. ACM, 51(4):606–635, 2004. doi:10.1145/1008731.1008736.
  • [Aro98] Sanjeev Arora. Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems. Journal of the ACM, 45(5):753–782, 1998. doi:10.1145/290179.290180.
  • [AS15] Pankaj K. Agarwal and R. Sharathkumar. Streaming algorithms for extent problems in high dimensions. Algorithmica, 72(1):83–98, 2015. doi:10.1007/s00453-013-9846-4.
  • [Bar96] Yair Bartal. Probabilistic approximations of metric spaces and its algorithmic applications. In FOCS, pages 184–193. IEEE Computer Society, 1996.
  • [BFL+17] Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, and Lin F. Yang. Clustering high dimensional dynamic data streams. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 576–585. PMLR, 2017.
  • [CCG+98] Moses Charikar, Chandra Chekuri, Ashish Goel, Sudipto Guha, and Serge A. Plotkin. Approximating a finite metric by a small number of tree metrics. In FOCS, pages 379–388. IEEE Computer Society, 1998.
  • [CJK+22] Artur Czumaj, Shaofeng H.-C. Jiang, Robert Krauthgamer, Pavel Veselý, and Mingwei Yang. Streaming facility location in high dimension via new geometric hashing. In FOCS. IEEE, 2022.
  • [CJKV22] Artur Czumaj, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Pavel Veselý. Streaming algorithms for geometric Steiner forest. In ICALP, volume 229 of LIPIcs, pages 47:1–47:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPIcs.ICALP.2022.47.
  • [CJLW22] Xi Chen, Rajesh Jayaram, Amit Levi, and Erik Waingarten. New streaming algorithms for high dimensional EMD and MST. In STOC, pages 222–233. ACM, 2022.
  • [CLMS13] Artur Czumaj, Christiane Lammersen, Morteza Monemizadeh, and Christian Sohler. (1+ε1+\varepsilon)-approximation for facility location in data streams. In SODA, pages 1710–1728. SIAM, 2013.
  • [dlVK00] Wenceslas Fernandez de la Vega and Marek Karpinski. Polynomial time approximation of dense weighted instances of MAX-CUT. Random Struct. Algorithms, 16(4):314–332, 2000.
  • [dlVK01] Wenceslas Fernandez de la Vega and Claire Kenyon. A randomized approximation scheme for metric MAX-CUT. J. Comput. Syst. Sci., 63(4):531–541, dec 2001. doi:10.1006/jcss.2001.1772.
  • [FIS08] Gereon Frahling, Piotr Indyk, and Christian Sohler. Sampling in dynamic data streams and applications. Int. J. Comput. Geom. Appl., 18(1/2):3–28, 2008.
  • [FL11] Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In STOC, pages 569–578. ACM, 2011.
  • [FS05] Gereon Frahling and Christian Sohler. Coresets in dynamic geometric data streams. In STOC, pages 209–217. ACM, 2005.
  • [FSS20] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for kk-means, PCA, and projective clustering. SIAM J. Comput., 49(3):601–657, 2020.
  • [GGR96] Oded Goldreich, Shafi Goldwasser, and Dana Ron. Property testing and its connection to learning and approximation. In FOCS, pages 339–348. IEEE Computer Society, 1996.
  • [GW95] Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6):1115–1145, 1995.
  • [HM04] Sariel Har-Peled and Soham Mazumdar. On coresets for kk-means and kk-median clustering. In STOC, pages 291–300. ACM, 2004.
  • [HSYZ18] Wei Hu, Zhao Song, Lin F. Yang, and Peilin Zhong. Nearly optimal dynamic kk-means clustering for high-dimensional data. CoRR, abs/1802.00459, 2018.
  • [Ind04] Piotr Indyk. Algorithms for dynamic geometric problems over data streams. In STOC, pages 373–380. ACM, 2004.
  • [JKS08] T. S. Jayram, Ravi Kumar, and D. Sivakumar. The one-way communication complexity of hamming distance. Theory Comput., 4(1):129–135, 2008.
  • [JL84] W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982), pages 189–206. Amer. Math. Soc., 1984.
  • [JST11] Hossein Jowhari, Mert Saglam, and Gábor Tardos. Tight bounds for Lp{L}_{p} samplers, finding duplicates in streams, and related problems. In PODS, pages 49–58. ACM, 2011.
  • [JW21] Rajesh Jayaram and David Woodruff. Perfect LpL_{p} sampling in a data stream. SIAM J. Comput., 50(2):382–439, 2021. doi:10.1137/18M1229912.
  • [KK19] Michael Kapralov and Dmitry Krachun. An optimal space lower bound for approximating MAX-CUT. In STOC, pages 277–288. ACM, 2019.
  • [KKMO07] Subhash Khot, Guy Kindler, Elchanan Mossel, and Ryan O’Donnell. Optimal inapproximability results for MAX-CUT and other 2-variable CSPs? SIAM J. Comput., 37(1):319–357, 2007.
  • [KN97] Eyal Kushilevitz and Noam Nisan. Communication Complexity. Cambridge University Press, 1997.
  • [KNR99] Ilan Kremer, Noam Nisan, and Dana Ron. On randomized one-round communication complexity. Comput. Complex., 8(1):21–49, 1999.
  • [KNW10] Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS, pages 41–52. ACM, 2010.
  • [Lam11] Christiane Lammersen. Approximation Techniques for Facility Location and Their Applications in Metric Embeddings. PhD thesis, Dissertation, Dortmund, Technische Universität, 2010, 2011.
  • [LS08] Christiane Lammersen and Christian Sohler. Facility location in dynamic geometric data streams. In ESA, volume 5193 of Lecture Notes in Computer Science, pages 660–671. Springer, 2008.
  • [LSS09] Christiane Lammersen, Anastasios Sidiropoulos, and Christian Sohler. Streaming embeddings with slack. In WADS, volume 5664 of Lecture Notes in Computer Science, pages 483–494. Springer, 2009.
  • [MMR19] Konstantin Makarychev, Yury Makarychev, and Ilya P. Razenshteyn. Performance of Johnson-Lindenstrauss transform for kk-means and kk-medians clustering. In STOC, pages 1027–1038. ACM, 2019.
  • [MRWZ20] Sepideh Mahabadi, Ilya P. Razenshteyn, David P. Woodruff, and Samson Zhou. Non-adaptive adaptive sampling on turnstile streams. In Proccedings of the 52nd Annual ACM Symposium on Theory of Computing, STOC 2020, pages 1251–1264. ACM, 2020. doi:10.1145/3357713.3384331.
  • [MW10] Morteza Monemizadeh and David P. Woodruff. 1-pass relative-error LpL_{p}-sampling with applications. In Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’10, page 1143–1160. SIAM, 2010. doi:10.1137/1.9781611973075.92.
  • [RV07] Mark Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geometric functional analysis. J. ACM, 54(4):21, 2007.
  • [Sch00] L. J. Schulman. Clustering for edge-cost minimization. In 32nd Annual ACM Symposium on Theory of Computing, pages 547–555. ACM, 2000. doi:10.1145/335305.335373.
  • [WY22] David P. Woodruff and Taisuke Yasuda. High-dimensional geometric streaming in polynomial space. In FOCS. IEEE, 2022. To Appear.
  • [Zar11] Hamid Zarrabi-Zadeh. An almost space-optimal streaming algorithm for coresets in fixed dimensions. Algorithmica, 60(1):46–59, 2011. doi:10.1007/s00453-010-9392-2.

Appendix A Lower Bounds Based on INDEX

Definition A.1 (INDEX Problem).

Alice is given a message x{0,1}nx\in\{0,1\}^{n}, and Bob is given an index i[n]i\in[n]. Alice can send Bob exactly one message MM, and Bob needs to use his input ii and this message MM, to compute xi{0,1}x_{i}\in\{0,1\}.

It is well known that the INDEX problem requires Ω(n)\Omega(n) combination to succeed with constant probability, i.e., M=Ω(n)M=\Omega(n) (see e.g., [KN97, KNR99, JKS08]).

Claim A.2.

For every integer Δ1\Delta\geq 1, given access to a quadtree TT on [Δ][\Delta], any streaming algorithm that tests with constant success probability whether X(hi)X(hi+1)=X(h_{i^{*}})\setminus X(h_{i^{*}+1})=\emptyset for every X[Δ]X\subseteq[\Delta] and ii^{*} presented as an insertion-only point stream must use space Ω(Δ)\Omega(\sqrt{\Delta}), where hih_{i} is defined as in Section 4 which is the level-ii node of TT that contains the most from XX, and it is promised that both |X(hi)||X(h_{i^{*}})| and |X(hi+1)||X(h_{i^{*}+1})| is larger than 0.5|X|0.5|X|.

Proof.

Pick the level ii^{*} in TT such that the squares (i.e., intervals) are of side-length m:=10Δm:=10\sqrt{\Delta}. We reduce to INDEX with n:=Δ/m=110Δn:=\Delta/m=\frac{1}{10}\sqrt{\Delta}, corresponding to the level-ii^{*} nodes. (Note that ii^{*} is public knowledge between Alice and Bob). Suppose Alice receives x{0,1}nx\in\{0,1\}^{n}. For j[n]j\in[n], if xj=1x_{j}=1, insert a point at coordinate (j1)m+1(j-1)\cdot m+1, which is the first coordinate in the jj-th interval at level ii^{*}, and do nothing if xj=0x_{j}=0. Feed this input to the algorithm, and send the internal state of the algorithm to Bob.

When Bob receives i[n]i\in[n], Bob inserts one point to each coordinate in the right/larger sub-interval of the ii-th level-ii^{*} interval, namely, ((i1)m+m2,im1]((i-1)\cdot m+\frac{m}{2},i\cdot m-1]. Clearly, hih_{i^{*}} is this ii-th interval at level ii^{*}, hi+1h_{i^{*}+1} is the right sub-interval of hih_{i^{*}}, and both |X(hi)||X(h_{i^{*}})| and |X(hi+1)||X(h_{i^{*}+1})| contain more than 0.5|X|0.5|X| points. Observe that X(hi)X(hi+1)=X(h_{i^{*}})\setminus X(h_{i^{*}+1})=\emptyset if and only if xi=0x_{i}=0. This finishes the proof. ∎

Claim A.3.

For every integer Δ1\Delta\geq 1, any algorithm that with constant success probability computes Max-Cut(X)\textsc{Max-Cut}(X) exactly for every X[Δ]X\subseteq[\Delta] presented as an insertion-only point stream must use space Ω(poly(Δ))\Omega(\operatorname{poly}(\Delta)).

Proof.

In our proof, we assume an algorithm 𝒜\mathcal{A} returns η0\eta\geq 0, such that for every X[Δ]X\subseteq[\Delta],

Pr[η=Max-Cut(X)]11/Δc,\Pr[\eta=\textsc{Max-Cut}(X)]\geq 1-1/\Delta^{c},

for some sufficiently large c1c\geq 1. We show such 𝒜\mathcal{A} must use space Ω(poly(Δ))\Omega(\operatorname{poly}(\Delta)). Note that the assumption about the 1/Δc1/\Delta^{c} probability is without loss of generality.

We reduce to INDEX with n:=Δ0.1n:=\Delta^{0.1}. Let m:=Δn=Δ0.9=n9m:=\frac{\Delta}{n}=\Delta^{0.9}=n^{9}. Suppose Alice receives x{0,1}nx\in\{0,1\}^{n}. For j[n]j\in[n], if xj=1x_{j}=1, insert a point at coordinate (j1)m(j-1)m. 888Here we may insert a point at 0 which is not in [Δ][\Delta]. This may be resolved by e.g., enlarging Δ\Delta by 11 and shifting the coordinates, but we omit this detail for the sake of presentation. If xj=0x_{j}=0 do nothing. Alice feeds this input to 𝒜\mathcal{A}, and sends the internal state of 𝒜\mathcal{A} to Bob.

Now, suppose Bob receives i[n]i\in[n]. Bob resumes algorithm 𝒜\mathcal{A}, and use 𝒜\mathcal{A} to do the following: for every j[n]j\in[n], insert to the stream Pj:={(j1)m+m2+t1tn3}P_{j}:=\{(j-1)m+\frac{m}{2}+t\mid 1\leq t\leq n^{3}\} and query the Max-Cut value. Restore to the initial received states of 𝒜\mathcal{A} after every iteration of jj. This may be done, by saving the received state, insert the points PjP_{j} for one jj, query, and fall back to the saved state. Eventually, since 𝒜\mathcal{A} succeeds with probability 11/Δc1-1/\Delta^{c}, by a union bound, with constant probability, all the queries are answered correctly simultaneously. We condition on this constant-probability event.

Let X:={i[n]xi0}X:=\{i\in[n]\mid x_{i}\neq 0\} be the support of Alice’s input xx, and let X:={(i1)miX}X^{\prime}:=\{(i-1)m\mid i\in X\} be the set of points inserted by Alice. Let X^j:=XPj\widehat{X}_{j}:=X^{\prime}\cup P_{j}. Notice that now we have the exact value of Max-Cut(X^j)\textsc{Max-Cut}(\widehat{X}_{j}) for every j[n]j\in[n]. It remains to show this information suffices to deduce the xix_{i} value, which the INDEX problem asks for. To this end, we need to show the following lemma, that gives the (rough) value of Max-Cut(X^j)\textsc{Max-Cut}(\widehat{X}_{j}), which is basically iX|ij0.5|\sum_{i\in X}|i-j-0.5| (up to scaling and a neglectable additive error).

Lemma A.4.

For every j[n]j\in[n], Max-Cut(X^j)n12iX|ij0.5|±O(n4)\textsc{Max-Cut}(\widehat{X}_{j})\in n^{12}\sum_{i\in X}|i-j-0.5|\pm O(n^{4}).

Proof.

Fix some jj. Note that PjX=P_{j}\cap X^{\prime}=\emptyset. Observe that the cut value of the cut (X,Pj)(X^{\prime},P_{j}) is

xX,yPjdist(x,y)\displaystyle\sum_{x\in X^{\prime},y\in P_{j}}\operatorname{dist}(x,y) =iXt[n3]|(i1)m(j1)mm2t|\displaystyle=\sum_{i\in X}\sum_{t\in[n^{3}]}\left|(i-1)m-(j-1)m-\frac{m}{2}-t\right|
=iXt[n3]|(ij0.5)n9t|\displaystyle=\sum_{i\in X}\sum_{t\in[n^{3}]}|(i-j-0.5)n^{9}-t|
iXt[n3]|ij0.5|n9±t\displaystyle\in\sum_{i\in X}\sum_{t\in[n^{3}]}|i-j-0.5|\cdot n^{9}\pm t
iX|ij0.5|n12±O(n4).\displaystyle\in\sum_{i\in X}|i-j-0.5|\cdot n^{12}\pm O(n^{4}).

It suffices to show the cut (Pj,X)(P_{j},X^{\prime}) achieves Max-Cut(X^j)\textsc{Max-Cut}(\widehat{X}_{j}). Define the notation cut(W,Z):=xW,yZdist(x,y)\operatorname{cut}(W,Z):=\sum_{x\in W,y\in Z}{\operatorname{dist}(x,y)}. Consider a cut (S,T)(S,T) of X^j\widehat{X}_{j} such that SPjS\neq P_{j} and TXT\neq X^{\prime}, and we show cut(S,T)<cut(X,Pj)\operatorname{cut}(S,T)<\operatorname{cut}(X^{\prime},P_{j}). A useful fact which follows from the definition is that dist(X,Pj)m2\operatorname{dist}(X^{\prime},P_{j})\geq\frac{m}{2}.

Easy Case

First, we consider the (easy) case such that SPjS\subseteq P_{j} or TPjT\subseteq P_{j}. In this case, there is some WPj\emptyset\neq W\subset P_{j} such that cut(S,T)=cut(PjW,XW)\operatorname{cut}(S,T)=\operatorname{cut}(P_{j}\setminus W,X^{\prime}\cup W). Then,

cut(S,T)\displaystyle\operatorname{cut}(S,T) =cut(PjW,XW)=cut(PjW,X)+cut(Pj,W)\displaystyle=\operatorname{cut}(P_{j}\setminus W,X^{\prime}\cup W)=\operatorname{cut}(P_{j}\setminus W,X^{\prime})+\operatorname{cut}(P_{j},W)
<cut(PjW,X)+|Pj|24n3\displaystyle<\operatorname{cut}(P_{j}\setminus W,X^{\prime})+\frac{|P_{j}|^{2}}{4}\cdot n^{3}
=cut(Pj,X)cut(W,X)+n94\displaystyle=\operatorname{cut}(P_{j},X^{\prime})-\operatorname{cut}(W,X)+\frac{n^{9}}{4}
cut(Pj,X)m2+n94\displaystyle\leq\operatorname{cut}(P_{j},X^{\prime})-\frac{m}{2}+\frac{n^{9}}{4}
<cut(Pj,X).\displaystyle<\operatorname{cut}(P_{j},X^{\prime}).

General Case

Now, we handle the remaining case where neither SS nor TT is a subset of PjP_{j}. Then |SX|>0|S\cap X^{\prime}|>0 and |TX|>0|T\cap X^{\prime}|>0. Combining these facts and assumptions, we have

cut(Pj,X)cut(S,T)\displaystyle\operatorname{cut}(P_{j},X^{\prime})-\operatorname{cut}(S,T)
=\displaystyle= cut(Pj,X)cut(SPj,TX)cut(SX,TPj)cut(SX,TX)cut(SPj,TPj)\displaystyle\operatorname{cut}(P_{j},X^{\prime})-\operatorname{cut}(S\cap P_{j},T\cap X^{\prime})-\operatorname{cut}(S\cap X^{\prime},T\cap P_{j})-\operatorname{cut}(S\cap X^{\prime},T\cap X^{\prime})-\operatorname{cut}(S\cap P_{j},T\cap P_{j})
>\displaystyle> cut(SPj,SX)+cut(TPj,TX)|X|24Δ|Pj|2n3\displaystyle\operatorname{cut}(S\cap P_{j},S\cap X^{\prime})+\operatorname{cut}(T\cap P_{j},T\cap X^{\prime})-\frac{|X^{\prime}|^{2}}{4}\cdot\Delta-|P_{j}|^{2}\cdot n^{3}
\displaystyle\geq m2(|SX||SPj|+|TX||TPj|)n124n9\displaystyle\ \frac{m}{2}\cdot(|S\cap X^{\prime}|\cdot|S\cap P_{j}|+|T\cap X^{\prime}|\cdot|T\cap P_{j}|)-\frac{n^{12}}{4}-n^{9}
\displaystyle\geq m2(|SPj|+|TPj|)n124n9\displaystyle\ \frac{m}{2}\cdot(|S\cap P_{j}|+|T\cap P_{j}|)-\frac{n^{12}}{4}-n^{9}
=\displaystyle= n124n9>0.\displaystyle\ \frac{n^{12}}{4}-n^{9}>0.

This finishes the proof of Lemma A.4. ∎ Observe that, the function f:[n]+f:[n]\to{\mathbb{R}}_{+} such that f(j):=2iX|ij0.5|f(j):=2\sum_{i\in X}|i-j-0.5| is integer-valued, so 2n12iX|ij0.5|2n^{12}\sum_{i\in X}|i-j-0.5| must be a multiple of n12n^{12}. Therefore, by Lemma A.4, we round the value of 2Max-Cut(X^j)2\cdot\textsc{Max-Cut}(\widehat{X}_{j}) to the nearest multiple of n12n^{12} (so that the additive error O(n4)O(n^{4}) in Lemma A.4 is ignored), and we obtain the exact value of 2n12iX|ij0.5|2n^{12}\sum_{i\in X}|i-j-0.5|. In other words, we know the value of f(j)f(j) for all j[n]j\in[n]. Finally, because of the piece-wise linear structure of ff, one can make use of all the f(j)f(j) values to recover all of XX. This finishes the proof of A.3. ∎

Appendix B Dimension Reduction for Max-Cut

For completeness, we prove below a dimension reduction for Max-Cut, similarly to the one given in [LSS09], with proof details appearing in [Lam11].

Theorem B.1.

Let XdX\subset\mathbb{R}^{d} be a finite set, 0<ε,δ<10<\varepsilon,\delta<1 and π:dd\pi:\mathbb{R}^{d}\to\mathbb{R}^{d^{\prime}} be a JL Transform as in Definition B.2 with target dimension d=O(ε2log(ε1δ1))d^{\prime}=O\left(\varepsilon^{-2}{\log(\varepsilon^{-1}\delta^{-1})}\right). Then with probability 1δ1-\delta,

Max-Cut(π(X))(1±ε)Max-Cut(X).\textsc{Max-Cut}(\pi(X))\in(1\pm\varepsilon)\cdot\textsc{Max-Cut}(X). (9)

Our bound is not directly comparable with that in [LSS09, Lam11], since we improve the dependence on δ\delta from O(δ2)O(\delta^{-2}) to O(log(δ1))O(\log(\delta^{-1})), but we also introduced an additional log(ε1)\log(\varepsilon^{-1}) factor. Our argument is simpler, achieved by making use of a certain formulation of the Johnson-Lindenstrauss (JL) transform [JL84], that differs from the one used in [LSS09, Lam11].

The JL transform formulation that we use (see Definition B.2) is similar to the one previously used in [MMR19], to obtain dimension reduction results for clustering problems. Its second property (the expectation bound), may not hold for all constructions of the JL transform (and was not used in [LSS09, Lam11]), but it can be realized by a random matrix with independent sub-Gaussian entries.

Definition B.2 (JL Transform [JL84, MMR19]).

For every integer d,d1d,d^{\prime}\geq 1, there exists a randomized mapping π:dd\pi:\mathbb{R}^{d}\to\mathbb{R}^{d^{\prime}} such that for all xydx\neq y\in\mathbb{R}^{d},

Prπ\displaystyle\Pr_{\pi} [dist(π(x),π(y))(1±ε)dist(x,y)]eCdε2\displaystyle\Big{[}\operatorname{dist}(\pi(x),\pi(y))\not\in(1\pm\varepsilon)\cdot\operatorname{dist}(x,y)\Big{]}\leq e^{-Cd^{\prime}\varepsilon^{2}} (10)
𝔼π\displaystyle\operatorname*{\mathbb{E}}_{\pi} [max{|dist(π(x),π(y))dist(x,y)1|ε,0}]eCdε2,\displaystyle\left[\max\left\{\left|\frac{\operatorname{dist}(\pi(x),\pi(y))}{\operatorname{dist}(x,y)}-1\right|-\varepsilon,0\right\}\right]\leq e^{-Cd^{\prime}\varepsilon^{2}}, (11)

for some universal constant C>0C>0.

Our version is similar but not identical to that in [MMR19], because we introduce an absolute value (to bound the error from both sides), but the proof is similar.

Proof of Theorem B.1.

Observe that it suffices to show with probability 1δ1-\delta,

x,yX|dist(π(x),π(y))dist(x,y)|O(ε)x,yXdist(x,y).\sum_{x,y\in X}|\operatorname{dist}(\pi(x),\pi(y))-\operatorname{dist}(x,y)|\leq O(\varepsilon)\sum_{x,y\in X}\operatorname{dist}(x,y). (12)

Let P:={(x,y)X:|dist(π(x),π(y))dist(x,y)|>εdist(x,y)}P:=\left\{(x,y)\in X:|\operatorname{dist}(\pi(x),\pi(y))-\operatorname{dist}(x,y)|>\varepsilon\operatorname{dist}(x,y)\right\} be the set of “bad” point pairs whose distances are distorted by more than ε\varepsilon error. Since by definition we have (with probability 11)

(x,y)X×XP|dist(π(x),π(y))dist(x,y)|εx,yXdist(x,y),\displaystyle\sum_{(x,y)\in X\times X\setminus P}|\operatorname{dist}(\pi(x),\pi(y))-\operatorname{dist}(x,y)|\leq\varepsilon\cdot\sum_{x,y\in X}\operatorname{dist}(x,y),

hence, it remains to show the following holds with probability 1δ1-\delta

(x,y)P|dist(π(x),π(y))dist(x,y)|εx,yXdist(x,y).\sum_{(x,y)\in P}|\operatorname{dist}(\pi(x),\pi(y))-\operatorname{dist}(x,y)|\leq\varepsilon\sum_{x,y\in X}\operatorname{dist}(x,y). (13)

By the guarantee of Definition B.2, we have

𝔼[(x,y)P|dist(π(x),π(y))dist(x,y)|εdist(x,y)]\displaystyle\operatorname*{\mathbb{E}}[\sum_{(x,y)\in P}|\operatorname{dist}(\pi(x),\pi(y))-\operatorname{dist}(x,y)|-\varepsilon\operatorname{dist}(x,y)]
=\displaystyle= 𝔼[x,yPmax{|dist(π(x),π(y))dist(x,y)|εdist(x,y),0}]\displaystyle\operatorname*{\mathbb{E}}[\sum_{x,y\in P}\max\{|\operatorname{dist}(\pi(x),\pi(y))-\operatorname{dist}(x,y)|-\varepsilon\operatorname{dist}(x,y),0\}]
\displaystyle\leq 𝔼[x,yXmax{|dist(π(x),π(y))dist(x,y)|εdist(x,y),0}]\displaystyle\operatorname*{\mathbb{E}}[\sum_{x,y\in X}\max\{|\operatorname{dist}(\pi(x),\pi(y))-\operatorname{dist}(x,y)|-\varepsilon\operatorname{dist}(x,y),0\}]
\displaystyle\leq eCdε2x,yXdist(x,y)εδx,yXdist(x,y).\displaystyle e^{-Cd^{\prime}\varepsilon^{2}}\sum_{x,y\in X}\operatorname{dist}(x,y)\leq\varepsilon\delta\sum_{x,y\in X}\operatorname{dist}(x,y).

Therefore, by Markov’s inequality, we conclude that (13) holds with probability 1δ1-\delta. This finishes the proof. ∎

Theorem B.1 implies the corollary below about a streaming algorithm that reports an encoding of a near-optimal cut (and not just its value). The most natural way to report a cut of XX is to somehow represent of a 22-partition of XX, but this is not possible because that contains XX itself, which requires Ω(n)\Omega(n) bits to store. Instead, we let the algorithm report a function f:d{0,1}f:\mathbb{R}^{d}\to\{0,1\} (using some encoding), and then ff implicitly defines the cut (Xf1(0),Xf1(1))(X\cap f^{-1}(0),X\cap f^{-1}(1)). In other words, the algorithm essentially reports an “oracle” that does not know XX, but can determine, for each input point xXx\in X, its side in the cut. This formulation was suggested by [FS05], and in fact we rely on their solution and combine it with our dimension reduction.

Corollary B.3 (Cut Oracle).

There is a randomized streaming algorithm that, given 0<ε<1/20<\varepsilon<1/2, integers Δ,d1\Delta,d\geq 1, and an input dataset X[Δ]dX\subseteq[\Delta]^{d} presented as a dynamic stream, the algorithm uses space exp(poly(ε1))poly(dlogΔ)\exp(\operatorname{poly}(\varepsilon^{-1}))\operatorname{poly}(d\log\Delta), and reports (an encoding of) a mapping f:d{0,1}f:\mathbb{R}^{d}\to\{0,1\}, such that with constant probability (at least 2/32/3), cutX(Xf1(0))(1ε)Max-Cut(X)\operatorname{cut}_{X}(X\cap f^{-1}(0))\geq(1-\varepsilon)\cdot\textsc{Max-Cut}(X).

Proof.

As noted in [FS05], there exists an algorithm 𝒜\mathcal{A} that finds an ff with the same guarantee and failure probability, except that the space usage is εO(d)poly(logΔ)\varepsilon^{-O(d)}\cdot\operatorname{poly}(\log\Delta). Hence, we can use this 𝒜\mathcal{A} as a black with Theorem B.1 to conclude the theorem.

Specifically, let π:dd\pi:\mathbb{R}^{d}\to\mathbb{R}^{d^{\prime}} such that d=O(ε2log(ε1))d^{\prime}=O(\varepsilon^{-2}\log(\varepsilon^{-1})) be a mapping that satisfies Theorem B.1. Then, for every update of point x[Δ]dx\in[\Delta]^{d} in the stream, we map it to π(x)\pi(x) and feed it to 𝒜\mathcal{A}. When the stream terminates, we use 𝒜\mathcal{A} to compute an f:d{0,1}f^{\prime}:\mathbb{R}^{d^{\prime}}\to\{0,1\}. Then, to define the final f:d{0,1}f:\mathbb{R}^{d}\to\{0,1\} is defined as f(x):=f(π(x))f(x):=f^{\prime}(\pi(x)). This finishes the proof. ∎