This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Approximating (k,)(k,\ell)-Median Clustering for Polygonal Curves

Maike Buchin Faculty of Mathematics, Ruhr-University Bochum, Germany, maike.buchin@rub.de    Anne Driemel Hausdorff Center for Mathematics, University of Bonn, Germany, driemel@cs.uni-bonn.de    Dennis Rohde Faculty of Mathematics, Ruhr-University Bochum, Germany, dennis.rohde-t1b@rub.de
Abstract

In 2015, Driemel, Krivošija and Sohler introduced the (k,)(k,\ell)-median problem for clustering polygonal curves under the Fréchet distance. Given a set of input curves, the problem asks to find kk median curves of at most \ell vertices each that minimize the sum of Fréchet distances over all input curves to their closest median curve. A major shortcoming of their algorithm is that the input curves are restricted to lie on the real line. In this paper, we present a randomized bicriteria-approximation algorithm that works for polygonal curves in d\mathbb{R}^{d} and achieves approximation factor (1+ε)(1+\varepsilon) with respect to the clustering costs. The algorithm has worst-case running-time linear in the number of curves, polynomial in the maximum number of vertices per curve, i.e. their complexity, and exponential in dd, \ell, ε\varepsilon and δ\delta, i.e., the failure probability. We achieve this result through a shortcutting lemma, which guarantees the existence of a polygonal curve with similar cost as an optimal median curve of complexity \ell, but of complexity at most 222\ell-2, and whose vertices can be computed efficiently. We combine this lemma with the superset-sampling technique by Kumar et al. to derive our clustering result. In doing so, we describe and analyze a generalization of the algorithm by Ackermann et al., which may be of independent interest.

1 Introduction

Since the development of kk-means – the pioneer of modern computational clustering – the last 65 years have brought a diversity of specialized [31, 7, 20, 6, 13, 19, 32] as well as generalized clustering algorithms [24, 2, 5]. However, in most cases clustering of point sets was studied. Many clustering problems indeed reduce to clustering of point sets, but for sequential data like time-series and trajectories – which arise in the natural sciences, medicine, sports, finance, ecology, audio/speech analysis, handwriting and many more – this is not the case. Hence, we need specialized clustering methods for these purposes, cf. [1, 12, 18, 29, 30].

A promising branch of this active research deals with (k,)(k,\ell)-center and (k,)(k,\ell)-median clustering – adaptions of the well-known Euclidean kk-center and kk-median clustering. In (k,)(k,\ell)-center clustering, respective (k,)(k,\ell)-median clustering, we are given a set of nn polygonal curves in d\mathbb{R}^{d} of complexity (i.e., the number of vertices of the curve) at most mm each and want to compute kk centers that minimize the objective function – just as in Euclidean kk-clustering. In addition, the centers are restricted to have complexity at most \ell each to prevent over-fitting – a problem specific for sequential data. A great benefit of regarding the sequential data as polygonal curves is that we introduce an implicit linear interpolation. This does not require any additional storage space since we only need to store the vertices of the curves, which are the sequences at hand. We compare the polygonal curves by their Fréchet distance, that is a continuous distance measure which takes the entire course of the curves into account, not only the pairwise distances among their vertices. Therefore, irregular sampled sequences are automatically handled by the interpolation, which is desirable in many cases. Moreover, Buchin et al. [10] showed, by using heuristics, that the (k,)(k,\ell)-clustering objectives yield promising results on trajectory data.

This branch of research formed only recently, about twenty years after Alt and Godau developed an algorithm to compute the Fréchet distance between polygonal curves [4]. Several papers have since studied this type of clustering [15, 9, 10, 11, 26]. However, all of these clustering algorithms, except the approximation-schemes for polygonal curves in \mathbb{R} [15] and the heuristics in [10], choose a kk-subset of the input as centers. (This is also often called discrete clustering.) This kk-subset is later simplified, or all input-curves are simplified before choosing a kk-subset. Either way, using these techniques one cannot achieve an approximation factor of less than 22. This is because there need not be an input curve with distance to its median which is less than the average distance of a curve to its median.

Driemel et al. [15], who were the first to study clustering of polygonal curves under the Fréchet distance in this setting, already overcame this problem in one dimension by defining and analyzing δ\delta-signatures, which are succinct representations of classes of curves that allow synthetic center-curves to be constructed. However, it seems that δ\delta-signatures are only applicable in \mathbb{R}. Here, we extend their work and obtain the first randomized bicriteria approximation algorithm for (k,)(k,\ell)-median clustering of polygonal curves in d\mathbb{R}^{d}.

1.1 Related Work

Driemel et al. [15] introduced the (k,)(k,\ell)-center and (k,)(k,\ell)-median objectives and developed the first approximation-schemes for these objectives, for curves in \mathbb{R}. Furthermore, they proved that (k,)(k,\ell)-center as well as (k,)(k,\ell)-median clustering is NP-hard, where kk is a part of the input and \ell is fixed. Also, they showed that the doubling dimension of the metric space of polygonal curves under the Fréchet distance is unbounded, even when the complexity of the curves is bounded.

Following this work, Buchin et al. [9] developed a constant-factor approximation algorithm for (k,)(k,\ell)-center clustering in d\mathbb{R}^{d}. Furthermore, they provide improved results on the hardness of approximating (k,)(k,\ell)-center clustering under the Fréchet distance: the (k,)(k,\ell)-center problem is NP-hard to approximate within a factor of (1.5ε)(1.5-\varepsilon) for curves in \mathbb{R} and within a factor of (2.25ε)(2.25-\varepsilon) for curves in d\mathbb{R}^{d}, where d2d\geq 2, in both cases even if k=1k=1. Furthermore, for the (k,)(k,\ell)-median variant, Buchin et al. [11] proved NP-hardness using a similar reduction. Again, the hardness holds even if k=1k=1. Also, they provided (1+ε)(1+\varepsilon)-approximation algorithms for (k,)(k,\ell)-center, as well as (k,)(k,\ell)-median clustering, under the discrete Fréchet distance. Nath and Taylor [28] give improved algorithms for (1+ε)(1+\varepsilon)-approximation of (k,)(k,\ell)-median clustering under discrete Fréchet and Hausdorff distance. Recently, Meintrup et al. [26] introduced a practical (1+ε)(1+\varepsilon)-approximation algorithm for discrete kk-median clustering under the Fréchet distance, when the input adheres to a certain natural assumption, i.e., the presence of a certain number of outliers.

Our algorithms build upon the clustering algorithm of Kumar et al. [25], which was later extended by Ackermann et al. [2]. This algorithm is a recursive approximation scheme, that employs two phases in each call. In the so-called candidate phase it computes candidates by taking a sample SS from the input set TT and running an algorithm on each subset of SS of a certain size. Which algorithm to use depends on the metric at hand. The idea behind this is simple: if TT contains a cluster TT^{\prime} that takes a constant fraction of its size, then a constant fraction of SS is from TT^{\prime} with high probability. By brute-force enumeration of all subsets of SS, we find this subset STS^{\prime}\subseteq T^{\prime} and if SS is taken uniformly and independently at random from TT then SS^{\prime} is a uniform and independent sample from TT^{\prime}. Ackermann et al. proved for various metric and non-metric distance measures, that SS^{\prime} can be used for computing candidates that contain a (1+ε)(1+\varepsilon)-approximate median for TT^{\prime} with high probability. The algorithm recursively calls itself for each candidate to eventually evaluate these together with the candidates for the remaining clusters.

The second phase of the algorithm is the so-called pruning phase, where it partitions its input according to the candidates at hand into two sets of equal size: one with the smaller distances to the candidates and one with the larger distances to the candidates. It then recursively calls itself with the second set as input. The idea behind this is that small clusters now become large enough to find candidates for these. Furthermore, the partitioning yields a provably small error. Finally it returns the set of kk candidates that together evaluated best.

1.2 Our Contributions

Refer to caption
Figure 1: From left to right: symbolic depiction of the operation principle of Algorithms 1, 2 and 4. Among all approximate \ell-simplifications (depicted in blue) of the input curves (depicted in black), Algorithm 1 returns the one that evaluates best (the solid curve) with respect to a sample of the input. Algorithm 2 does not return a single curve, but a set of candidates. These include the curve returned by Algorithm 1 plus all curves with \ell vertices from the cubic grids, covering balls of certain radius centered at the vertices of an input curve that is close to a median, w.h.p. Algorithm 4 is similar to Algorithm 2 but does not only cover the vertices of a single curve, but of multiple curves. We depict the best approximate median that can be generated from the grids in solid green.

We present several algorithms for approximating (1,)(1,\ell)-median clustering of polygonal curves under the Fréchet distance, see Fig. 1 for an illustration of the operation principles of our algorithms. While the first one, Algorithm 1, yields only a coarse approximation (factor 34), it is suitable as plugin for the following two algorithms, Algorithms 2 and 4, due to its asymptotically fast running-time. These algorithms yield a better approximation (factor 3+ε3+\varepsilon, respectively 1+ε1+\varepsilon). Additionally, Algorithms 2 and 4 are not only able to yield an approximation for the input set TT, but for a cluster TTT^{\prime}\subseteq T, that takes a constant fraction of TT. We would like to use these as plugins to the (1+ε)(1+\varepsilon)-approximation algorithm for kk-median clustering by Ackermann et al. [2], but that would require our algorithms to comply with the sampling properties. For an input set TT the weak sampling property expresses that a constant-size set of candidates can be computed, that contains a (1+ε)(1+\varepsilon)-approximate median for TT with high probability, by taking a constant-size uniform and independent sample of TT. Further, the running-time for computing the candidates depends only on the size of the sample, the size of the candidate set and the failure probability parameter. The strong sampling property is defined similarly, but instead of a candidate set, an approximate median can be computed directly and the running-time may only depend on the size of the sample. In our algorithms, the running-time for computing the candidate set depends on mm which is a parameter of the input. Additionally, our first algorithm for computing candidates, which contain a (3+ε)(3+\varepsilon)-approximate (1,)(1,\ell)-median with high probability, does not achieve the required approximation-factor of (1+ε)(1+\varepsilon). However, looking into the analysis of Ackermann et al., any algorithm for computing candidates, with some guaranteed approximation-factor, can be used in the recursive approximation-scheme. Therefore, we decided to generalize the kk-median clustering algorithm of Ackermann et al. [2].

Nath and Taylor [28] use a similar approach, but they developed yet another way to compute candidates: they define and analyze gg-coverability, which is a generalization of the notion of doubling dimension and indeed, for the discrete Fréchet distance the proof builds upon the doubling dimension of points in d\mathbb{R}^{d}. However, the doubling dimension of polygonal curves under the Fréchet distance is unbounded, even when the complexities of the curves are bounded and it is an open question whether gg-coverability holds for the continuous Fréchet distance.

We circumvent this by taking a different approach using the idea of shortcutting. It is well-known that shortcutting a polygonal curve (that is, replacing a subcurve by the line segment connecting its endpoints) does not increase its Fréchet distance to a line segment. This idea has been used before for a variety of Fréchet-distance related problems [3, 15, 14, 8]. Specifically, we introduce two new shortcutting lemmata. These lemmata guarantee the existence of good approximate medians, with complexity at most 222\ell-2 and whose vertices can be computed efficiently. The first one enables us to return candidates, which contain a (3+ε)(3+\varepsilon)-approximate median for a cluster inside the input, that takes a constant fraction of the input, w.h.p., and we call it simple shortcutting. The second one enables us to return candidates, which contain a (1+ε)(1+\varepsilon)-approximate median for a cluster inside the input, that takes a constant fraction of the input, w.h.p., and we call it advanced shortcutting. All in all, we obtain as main result, following from Corollary 7.5:

Theorem 1.1.

Given a set TT of nn polygonal curves in d\mathbb{R}^{d}, of complexity at most mm each, parameter values ε(0,0.158]\varepsilon\in(0,0.158] and δ(0,1)\delta\in(0,1), and constants k,k,\ell\in\mathbb{N}, there exists an algorithm, which computes a set CC of kk polygonal curves, each of complexity at most 222\ell-2, such that with probability at least (1δ)(1-\delta), it holds that

cost(T,C)=τTmincCdF(c,τ)(1+ε)τTmincCdF(c,τ)=(1+ε)cost(T,C)\displaystyle\operatorname{cost}\left\lparen T,C\right\rparen=\sum_{\tau\in T}\min_{c\in C}d_{F}(c,\tau)\leq(1+\varepsilon)\sum_{\tau\in T}\min_{c\in C^{\ast}}d_{F}(c,\tau)=(1+\varepsilon)\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen

where CC^{\ast} is an optimal (k,)(k,\ell)-median solution for TT under the Fréchet distance dF(,)d_{F}(\cdot,\cdot).

The algorithm has worst-case running-time linear in nn, polynomial in mm and exponential in δ,ε,d\delta,\varepsilon,d and \ell.

1.3 Organization

The paper is organized as follows. First we present a simple and fast 3434-approximation algorithm for (1,)(1,\ell)-median clustering. Then, we present the (3+ε)(3+\varepsilon)-approximation algorithm for (1,)(1,\ell)-median clustering of a cluster inside the input, that takes a constant fraction of the input, which builds upon simple shortcutting and the 3434-approximation algorithm. Then, we present a more practical modification of the (3+ε)(3+\varepsilon)-approximation algorithm, which achieves a (5+ε)(5+\varepsilon)-approximation for (1,)(1,\ell)-median clustering. Following this, we present the similar but more involved (1+ε)(1+\varepsilon)-approximation algorithm for (1,)(1,\ell)-median clustering of a cluster inside the input, that takes a constant fraction of the input, which builds upon the advanced shortcutting and the 3434-approximation algorithm. Finally we present the generalized recursive kk-median approximation-scheme, which leads to our main result.

2 Preliminaries

Here we introduce all necessary definitions. In the following dd\in\mathbb{N} is an arbitrary constant. By \lVert\cdot\rVert we denote the Euclidean norm and for pdp\in\mathbb{R}^{d} and r0r\in\mathbb{R}_{\geq 0} we denote by B(p,r)={qdpqr}B(p,r)=\{q\in\mathbb{R}^{d}\mid\lVert p-q\rVert\leq r\} the closed ball of radius rr with center pp. By SnS_{n} we denote the symmetric group of degree nn. We give a standard definition of grids:

Definition 2.1 (grid).

Given a number r>0r\in\mathbb{R}_{>0}, for (p1,,pd)d(p_{1},\dots,p_{d})\in\mathbb{R}^{d} we define by G(p,r)=(p1/rr,,pd/rr)G(p,r)=(\lfloor p_{1}/r\rfloor\cdot r,\dots,\lfloor p_{d}/r\rfloor\cdot r) the rr-grid-point of pp. Let XdX\subseteq\mathbb{R}^{d} be a subset of d\mathbb{R}^{d}. The grid of cell width rr that covers XX is the set 𝔾(X,r)={G(p,r)pX}\mathbb{G}(X,r)=\{G(p,r)\mid p\in X\}.

Such a grid partitions the set XX into cubic regions and for each r>0r\in\mathbb{R}_{>0} and pXp\in X we have that pG(p,r)d2r\lVert p-G(p,r)\rVert\leq\frac{\sqrt{d}}{2}r. We give a standard definition of polygonal curves:

Definition 2.2 (polygonal curve).

A (parameterized) curve is a continuous mapping τ:[0,1]d\tau\colon[0,1]\rightarrow\mathbb{R}^{d}. A curve τ\tau is polygonal, iff there exist v1,,vmdv_{1},\dots,v_{m}\in\mathbb{R}^{d}, no three consecutive on a line, called τ\tau’s vertices and t1,,tm[0,1]t_{1},\dots,t_{m}\in[0,1] with t1<<tmt_{1}<\dots<t_{m}, t1=0t_{1}=0 and tm=1t_{m}=1, called τ\tau’s instants, such that τ\tau connects every two contiguous vertices vi=τ(ti),vi+1=τ(ti+1)v_{i}=\tau(t_{i}),v_{i+1}=\tau(t_{i+1}) by a line segment.

We call the line segments v1v2¯,,vm1vm¯\overline{v_{1}v_{2}},\dots,\overline{v_{m-1}v_{m}} the edges of τ\tau and mm the complexity of τ\tau, denoted by |τ|\lvert\tau\rvert. Sometimes we will argue about a sub-curve τ\tau of a given curve σ\sigma. We will then refer to τ\tau by restricting the domain of σ\sigma, denoted by σ|X\sigma|_{X}, where X[0,1]X\subseteq[0,1].

Definition 2.3 (Fréchet distance).

Let \mathcal{H} denote the set of all continuous bijections h:[0,1][0,1]h\colon[0,1]\rightarrow[0,1] with h(0)=0h(0)=0 and h(1)=1h(1)=1, which we call reparameterizations. The Fréchet distance between curves σ\sigma and τ\tau is defined as

dF(σ,τ)=infhmaxt[0,1]σ(t)τ(h(t)).d_{F}(\sigma,\tau)\ =\ \inf_{h\in\mathcal{H}}\ \max_{t\in[0,1]}\ \lVert\sigma(t)-\tau(h(t))\rVert.

Sometimes, given two curves σ,τ\sigma,\tau, we will refer to an hh\in\mathcal{H} as matching between σ\sigma and τ\tau.

Note that there must not exist a matching hh\in\mathcal{H}, such that maxt[0,1]σ(t)τ(h(t))=dF(σ,τ)\max_{t\in[0,1]}\lVert\sigma(t)-\tau(h(t))\rVert=d_{F}(\sigma,\tau). This is due to the fact that in some cases a matching realizing the Fréchet distance would need to match multiple points p1,,pnp_{1},\dots,p_{n} on τ\tau to a single point qq on σ\sigma, which is not possible since matchings need to be bijections, but the p1,,pnp_{1},\dots,p_{n} can get matched arbitrarily close to qq, realizing dF(σ,τ)d_{F}(\sigma,\tau) in the limit, which we formalize in the following lemma:

Lemma 2.4.

Let σ,τ:[0,1]d\sigma,\tau\colon[0,1]\rightarrow\mathbb{R}^{d} be curves. Let r=dF(σ,τ)r=d_{F}(\sigma,\tau). There exists a sequence (hi)i=1(h_{i})_{i=1}^{\infty} in \mathcal{H}, such that limimaxt[0,1]σ(t)τ(hi(t))=r\lim\limits_{i\to\infty}\max\limits_{t\in[0,1]}\lVert\sigma(t)-\tau(h_{i}(t))\rVert=r.

Proof.

Define ρ:0,hmaxt[0,1]σ(t)τ(h(t))\rho\colon\mathcal{H}\rightarrow\mathbb{R}_{\geq 0},h\mapsto\max\limits_{t\in[0,1]}\lVert\sigma(t)-\tau(h(t))\rVert with image R={ρ(h)h}R=\{\rho(h)\mid h\in\mathcal{H}\}. Per definition, we have dF(σ,τ)=infR=rd_{F}(\sigma,\tau)=\inf R=r.

For any non-empty subset XX of \mathbb{R} that is bounded from below and for every ε>0\varepsilon>0 it holds that there exists an xXx\in X with infXx<infX+ε\inf X\leq x<\inf X+\varepsilon, by definition of the infimum. Since RR\subseteq\mathbb{R} and infR\inf R exists, for every ε>0\varepsilon>0 there exists an rRr^{\prime}\in R with infRr<infR+ε\inf R\leq r^{\prime}<\inf R+\varepsilon.

Now, let ai=1/ia_{i}=1/i be a zero sequence. For every ii\in\mathbb{N} there exists an riRr_{i}\in R with rri<r+air\leq r_{i}<r+a_{i}, thus limiri=r\lim\limits_{i\to\infty}r_{i}=r.

Let ρ1(r)={hρ(h)=r}\rho^{-1}(r^{\prime})=\{h\in\mathcal{H}\mid\rho(h)=r^{\prime}\} be the preimage of ρ\rho. Since ρ\rho is a function, |ρ1(r)|1\lvert\rho^{-1}(r^{\prime})\rvert\geq 1 for each rRr^{\prime}\in R. Now, for ii\in\mathbb{N}, let hih_{i} be an arbitrary element from ρ1(ri)\rho^{-1}(r_{i}). By definition it holds that

limimaxt[0,1]σ(t)τ(hi(t))=limiρ(hi)=limiri=r=infR,\lim\limits_{i\to\infty}\max\limits_{t\in[0,1]}\lVert\sigma(t)-\tau(h_{i}(t))\rVert=\lim_{i\to\infty}\rho(h_{i})=\lim_{i\to\infty}r_{i}=r=\inf R,

which proves the claim. ∎

Now we introduce the classes of curves we are interested in.

Definition 2.5 (polygonal curve classes).

For dd\in\mathbb{N}, we define by 𝕏d\mathbb{X}^{d} the equivalence class of polygonal curves (where two curves are equivalent, iff they can be made identical by a reparameterization) in ambient space d\mathbb{R}^{d}. For mm\in\mathbb{N} we define by 𝕏md\mathbb{X}^{d}_{m} the subclass of polygonal curves of complexity at most mm.

Simplification is a fundamental problem related to curves and which appears as sub-problem in our algorithms.

Definition 2.6 (minimum-error \ell-simplification).

For a polygonal curve τ𝕏d\tau\in\mathbb{X}^{d} we denote by simpl(α,τ)\operatorname{simpl}\left\lparen\alpha,\tau\right\rparen an α\alpha-approximate minimum-error \ell-simplification of τ\tau, i.e., a curve σ𝕏d\sigma\in\mathbb{X}^{d}_{\ell} with dF(τ,σ)αdF(τ,σ)d_{F}(\tau,\sigma)\leq\alpha\cdot d_{F}(\tau,\sigma^{\prime}) for all σ𝕏d\sigma^{\prime}\in\mathbb{X}^{d}_{\ell}.

Now we define the (k,)(k,\ell)-median clustering problem for polygonal curves.

Definition 2.7 ((k,)(k,\ell)-median clustering).

The (k,)(k,\ell)-median clustering problem is defined as follows, where k,lk,l\in\mathbb{N} are fixed (constant) parameters of the problem: given a finite and non-empty set T𝕏mdT\subset\mathbb{X}^{d}_{m} of polygonal curves, compute a set of kk curves C𝕏dC^{\ast}\subset\mathbb{X}^{d}_{\ell}, such that cost(T,C)=τTmincCdF(τ,c)\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen=\sum\limits_{\tau\in T}\min\limits_{c^{\ast}\in C^{\ast}}d_{F}(\tau,c^{\ast}) is minimal.

We call cost(,)\operatorname{cost}\left\lparen\cdot,\cdot\right\rparen the objective function and we often write cost(T,c)\operatorname{cost}\left\lparen T,c\right\rparen as shorthand for cost(T,{c})\operatorname{cost}\left\lparen T,\{c\}\right\rparen. The following theorem of Indyk [23] is useful for evaluating the cost of a curve at hand.

Theorem 2.8.

[23, Theorem 31] Let ε(0,1]\varepsilon\in(0,1] and T𝕏dT\subset\mathbb{X}^{d} be a set of polygonal curves. Further let WW be a non-empty sample, drawn uniformly and independently at random from TT, with replacement. For τ,σT\tau,\sigma\in T with cost(T,τ)>(1+ε)cost(T,σ)\operatorname{cost}\left\lparen T,\tau\right\rparen>(1+\varepsilon)\operatorname{cost}\left\lparen T,\sigma\right\rparen it holds that Pr[cost(W,τ)cost(W,σ)]<exp(ε2|W|/64)\Pr[\operatorname{cost}\left\lparen W,\tau\right\rparen\leq\operatorname{cost}\left\lparen W,\sigma\right\rparen]<\exp\left(-{\varepsilon^{2}\lvert W\rvert}/{64}\right).

The following concentration bound also applies to independent Bernoulli trials, which are a special case of Poisson trials where each trial has same probability of success. Kumar et al. [25] use this to bound the probability that a subset SS^{\prime} of an independent and uniform sample SS from a set TT is entirely contained in a subset TT^{\prime} of TT. They call it superset-sampling.

Lemma 2.9 (Chernoff bound for independent Poisson trials).

[27, Theorem 4.5] Let X1,,XnX_{1},\dots,X_{n} be independent Poisson trials. For δ(0,1)\delta\in(0,1) it holds that

Pr[i=1nXi(1δ)E[i=1nXi]]exp(δ22E[i=1nXi]).\Pr\left[\sum_{i=1}^{n}X_{i}\leq(1-\delta)\operatorname{E}\left[\sum_{i=1}^{n}X_{i}\right]\right]\leq\exp\left(-\frac{\delta^{2}}{2}\operatorname{E}\left[\sum_{i=1}^{n}X_{i}\right]\right).

3 Simple and Fast 3434-Approximation for (1,)(1,\ell)-Median

Here, we present Algorithm 1, a 3434-approximation algorithm for (1,)(1,\ell)-median clustering, which is based on the following facts: we can obtain a 33-approximate solution to the (1,)(1,\ell)-median for a given set T={τ1,,τn}𝕏mdT=\{\tau_{1},\dots,\tau_{n}\}\subset\mathbb{X}^{d}_{m} of polygonal curves in terms of objective value, i.e., we obtain one of the at least n/2n/2 input curves that are within distance 2cost(T,c)/n2\cdot\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen/n to an optimal (1,)(1,\ell)-median cc^{\ast} for TT, w.h.p., by uniformly and independently sampling a sufficient number of curves from TT. There are at least n/2n/2 of these curves by an averaging argument. These curves have cost up to 3cost(T,c)3\cdot\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen by the triangle-inequality. The sample has size depending only on a parameter determining the failure probability and we can improve on running-time even more by using Theorem 2.8 and evaluate the cost of each curve in the sample of candidates against another sample of similar size instead of against the complete input. Though, we have to accept an approximation factor of 55 (if we set ε=1\varepsilon=1 in Theorem 2.8). That is indeed acceptable, since we only obtain an approximate solution in terms of objective value and completely ignore the bound on the number of vertices of the center curve, which is a disadvantage of this approach and results in the lower bound of cost(T,c)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen not necessarily holding (if <m\ell<m). To fix this, we simplify the candidate curve that evaluated best against the second sample, using an efficient minimum-error \ell-simplification approximation algorithm, which downgrades the approximation factor to 6+7α6+7\alpha, where α\alpha is the approximation factor of the minimum-error \ell-simplification.

However, Algorithm 1 is very fast in terms of the input size. Indeed, it has worst-case running-time independent of nn and sub-quartic in mm. Now, Algorithm 1 has the purpose to provide us an approximate median for a given set of polygonal curves: the bi-criteria approximation algorithms (Algorithms 2 and 4), which we present afterwards and which are capable of generating center curves with up to 222\ell-2 vertices, need an approximate median (and the approximation factor) to bound the optimal objective value. Furthermore, there is a case where Algorithms 2 and 4 may fail to provide a good approximation, but it can be proven that the result of Algorithm 1 is then a very good approximation, which can be used instead.

Algorithm 1 (1,)(1,\ell)-Median by Simplification
1:procedure (1,)(1,\ell)-Median-3434-Approximation(T={τ1,,τn}T=\{\tau_{1},\dots,\tau_{n}\}, δ\delta)
2:     SS\leftarrow sample 2(ln(2)ln(δ))\left\lceil 2(\ln(2)-\ln(\delta))\right\rceil curves from TT uniformly and independently with replacement
3:     γ64(ln(δ)ln(4ln(2)ln(δ))\gamma\leftarrow\left\lceil-64(\ln(\delta)-\ln(\lceil 4\ln(2)-\ln(\delta)\rceil)\right\rceil
4:     WW\leftarrow sample γ\gamma curves from TT uniformly and independently with replacement
5:     tt\leftarrow arbitrary elem. from argminsScost(W,s)\operatorname*{arg\,min}\limits_{s\in S}\operatorname{cost}\left\lparen W,s\right\rparen
6:     return simpl(α,t)\operatorname{simpl}\left\lparen\alpha,t\right\rparen \triangleright E.g. combining [4, 22]

Next, we prove the quality of approximation of Algorithm 1.

Theorem 3.1.

Given a parameter δ(0,1)\delta\in(0,1) and a set T={τ1,,τn}𝕏mdT=\{\tau_{1},\dots,\tau_{n}\}\subset\mathbb{X}^{d}_{m} of polygonal curves, Algorithm 1 returns with probability at least 1δ1-\delta a polygonal curve c𝕏dc\in\mathbb{X}^{d}_{\ell}, such that cost(T,c)cost(T,c)(6+7α)cost(T,c)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen\leq\operatorname{cost}\left\lparen T,c\right\rparen\leq(6+7\alpha)\cdot\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen, where cc^{\ast} is an optimal (1,)(1,\ell)-median for TT and α\alpha is the approximation-factor of the utilized minimum-error \ell-simplification approximation algorithm.

Proof.

First, we know that dF(τ,simpl(α,τ))αdF(τ,c)d_{F}(\tau,\operatorname{simpl}\left\lparen\alpha,\tau\right\rparen)\leq\alpha\cdot d_{F}(\tau,c^{\ast}), for each τT\tau\in T.

Now, there are at least n2\frac{n}{2} curves in TT that are within distance at most 2cost(T,c)n\frac{2\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n} to cc^{\ast}. Otherwise the cost of the remaining curves would exceed cost(T,c)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen, which is a contradiction. Hence each sSs\in S has probability at least 12\frac{1}{2} to be within distance 2cost(T,c)n\frac{2\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n} to cc^{\ast}.

Since the elements of SS are sampled independently we conclude that the probability that every sSs\in S has distance to cc^{\ast} greater than 2cost(T,c)n\frac{2\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n} is at most (112)|S|exp(2(ln(2)ln(δ))2)=δ2(1-\frac{1}{2})^{\lvert S\rvert}\leq\exp\left(-\frac{2(\ln(2)-\ln(\delta))}{2}\right)=\frac{\delta}{2}.

Now, assume there is a sSs\in S with dF(s,c)2cost(T,c)nd_{F}(s,c^{\ast})\leq\frac{2\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n}. We do not want any tS{s}t\in S\setminus\{s\} with cost(T,t)>2cost(T,s)\operatorname{cost}\left\lparen T,t\right\rparen>2\operatorname{cost}\left\lparen T,s\right\rparen to have cost(W,t)cost(W,s)\operatorname{cost}\left\lparen W,t\right\rparen\leq\operatorname{cost}\left\lparen W,s\right\rparen. Using Theorem 2.8 we conclude that this happens with probability at most

exp(64(ln(δ)ln(4ln(2)ln(δ))64)δ4(ln(2)ln(δ))δ2|S|,\exp\left(-\frac{-64(\ln(\delta)-\ln(\lceil 4\ln(2)-\ln(\delta)\rceil)}{64}\right)\leq\frac{\delta}{\lceil 4(\ln(2)-\ln(\delta))\rceil}\leq\frac{\delta}{2\lvert S\rvert},

for each tS{s}t\in S\setminus\{s\}.

Using a union bound over all bad events, we conclude that with probability at least 1δ1-\delta, Algorithm 1 samples a curve sSs\in S, with dF(s,c)2cost(T,c)/nd_{F}(s,c^{\ast})\leq 2\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen/n and returns the simplification c=simpl(α,t)c=\operatorname{simpl}\left\lparen\alpha,t\right\rparen of a curve tSt\in S, with cost(T,t)2cost(T,s)\operatorname{cost}\left\lparen T,t\right\rparen\leq 2\operatorname{cost}\left\lparen T,s\right\rparen. The triangle-inequality yields

τT(dF(t,c)dF(τ,c))τTdF(t,τ)2τTdF(s,τ)2τT(dF(τ,c)+dF(c,s)),\sum_{\tau\in T}(d_{F}(t,c^{\ast})-d_{F}(\tau,c^{\ast}))\leq\sum_{\tau\in T}d_{F}(t,\tau)\leq 2\sum_{\tau\in T}d_{F}(s,\tau)\leq 2\sum_{\tau\in T}(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},s)),

which is equivalent to

ndF(t,c)2cost(T,c)+cost(T,c)+2n2cost(T,c)ndF(t,c)7cost(T,c)n.n\cdot d_{F}(t,c^{\ast})\leq 2\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen+\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen+2n\frac{2\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n}\Leftrightarrow{}d_{F}(t,c^{\ast})\leq\frac{7\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n}.

Hence, we have

cost(T,c)\displaystyle\operatorname{cost}\left\lparen T,c\right\rparen =τTdF(τ,simpl(α,t))τT(dF(τ,t)+dF(t,simpl(α,t)))\displaystyle={}\sum_{\tau\in T}d_{F}(\tau,\operatorname{simpl}\left\lparen\alpha,t\right\rparen)\leq\sum_{\tau\in T}(d_{F}(\tau,t)+d_{F}(t,\operatorname{simpl}\left\lparen\alpha,t\right\rparen))
2cost(T,s)+τTαdF(t,c)2τT(dF(τ,c)+dF(c,s))+7αcost(T,c)\displaystyle\leq{}2\operatorname{cost}\left\lparen T,s\right\rparen+\sum_{\tau\in T}\alpha\cdot d_{F}(t,c^{\ast})\leq{}2\sum_{\tau\in T}(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},s))+7\alpha\cdot\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen
2cost(T,c)+4cost(T,c)+7αcost(T,c)=(6+7α)cost(T,c).\displaystyle\leq{}2\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen+4\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen+7\alpha\cdot\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen={}(6+7\alpha)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen.

The lower bound cost(T,c)cost(T,c)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen\leq\operatorname{cost}\left\lparen T,c\right\rparen follows from the fact that the returned curve has \ell vertices and that cc^{\ast} has minimum cost among all curves with \ell vertices. ∎

The following lemma enables us to obtain a concrete approximation-factor and worst-case running-time of Algorithm 1.

Lemma 3.2 (Buchin et al. [9, Lemma 7.1]).

Given a curve σ𝕏md\sigma\in\mathbb{X}^{d}_{m}, a 44-approximate minimum-error \ell-simplification can be computed in O(m3logm)O(m^{3}\log m) time.

The simplification algorithm used for obtaining this statement is a combination of the algorithm by Imai and Iri [22] and the algorithm by Alt and Godau [4]. Combining Theorem 3.1 and Lemma 3.2, we obtain the following corollary.

Corollary 3.3.

Given a parameter δ(0,1)\delta\in(0,1) and a set T𝕏mdT\subset\mathbb{X}^{d}_{m} of polygonal curves, Algorithm 1 returns with probability at least 1δ1-\delta a polygonal curve c𝕏dc\in\mathbb{X}^{d}_{\ell}, such that cost(T,c)cost(T,c)34cost(T,c)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen\leq\operatorname{cost}\left\lparen T,c\right\rparen\leq 34\cdot\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen, where cc^{\ast} is an optimal (1,)(1,\ell)-median for TT, in time O(m2log(m)(ln2δ)+m3logm)O(m^{2}\log(m)(-\ln^{2}\delta)+m^{3}\log m), when the algorithms by Imai and Iri [22] and Alt and Godau [4] are combined for \ell-simplification.

Proof.

We use Lemma 3.2 together with Theorem 3.1, which yields an approximation factor of 3434. Now, drawing the first sample takes time O(lnδ)O(-\ln\delta). Drawing the second sample also takes time O(ln(δ))O(-\ln(\delta)) and evaluating the samples against each other takes time O(m2log(m)(ln2δ))O(m^{2}\log(m)(-\ln^{2}\delta)). Simplifying one of the curves that evaluates best takes time O(m3logm)O(m^{3}\log m). We conclude that Algorithm 1 has running-time O(m2log(m)(ln2δ)+m3logm)O(m^{2}\log(m)(-\ln^{2}\delta)+m^{3}\log m). ∎

4 (3+ε)(3+\varepsilon)-Approximation for (1,)(1,\ell)-Median by Simple Shortcutting

Here, we present Algorithm 2, which returns candidates, containing a (3+ε)(3+\varepsilon)-approximate (1,)(1,\ell)-median of complexity at most 222\ell-2, for a cluster contained in the input, that takes a constant fraction of the input, w.h.p. Algorithm 2 can be used as plugin in our generalized version (Algorithm 5, Section 7) of the algorithm by Ackermann et al. [2].

In contrast to Nath and Taylor [28] we cannot use the property, that the vertices of a median must be found in the balls of radius dF(τ,c)d_{F}(\tau,c^{\ast}), centered at τ\tau’s vertices, where cc^{\ast} is an optimal (1,)(1,\ell)-median for a given input TT, which τ\tau is an element of. This is an immediate consequence of using the continuous Fréchet distance.

We circumvent this by proving the following shortcutting lemmata. We start with the simplest, which states that we can indeed search the aforementioned balls, if we accept a resulting curve of complexity at most 222\ell-2. See Fig. 2 for a visualization.

Refer to caption
Figure 2: Visualization of a simple shortcut. The black curve is an input-curve that is close to an optimal median, which is depicted in red. By inserting the blue shortcut we can find a curve that has the same distance to the black curve as the median but with all vertices contained in the balls centered at the black curves vertices.
Lemma 4.1 (shortcutting using a single polygonal curve).

Let σ,τ𝕏d\sigma,\tau\in\mathbb{X}^{d} be polygonal curves. Let v1τ,,v|τ|τv^{\tau}_{1},\dots,v^{\tau}_{\lvert\tau\rvert} be the vertices of τ\tau and let r=dF(σ,τ)r=d_{F}(\sigma,\tau). There exists a polygonal curve σ𝕏d\sigma^{\prime}\in\mathbb{X}^{d} with every vertex contained in at least one of B(v1τ,r),,B(v|τ|τ,r)B(v^{\tau}_{1},r),\dots,B(v^{\tau}_{\lvert\tau\rvert},r), dF(σ,τ)dF(σ,τ)d_{F}(\sigma^{\prime},\tau)\leq d_{F}(\sigma,\tau) and |σ|2|σ|2\lvert\sigma^{\prime}\rvert\leq 2\lvert\sigma\rvert-2.

Proof.

Let v1σ,,v|σ|σv^{\sigma}_{1},\dots,v^{\sigma}_{\lvert\sigma\rvert} be the vertices of σ\sigma. Further, let t1σ,,t|σ|σt^{\sigma}_{1},\dots,t^{\sigma}_{\lvert\sigma\rvert} and t1τ,,t|τ|τt^{\tau}_{1},\dots,t^{\tau}_{\lvert\tau\rvert} be the instants of σ\sigma and τ\tau, respectively. Also, for hh\in\mathcal{H} (recall that \mathcal{H} is the set of all continuous bijections h:[0,1][0,1]h\colon[0,1]\rightarrow[0,1] with h(0)=0h(0)=0 and h(1)=1h(1)=1), let rh=maxt[0,1]σ(t)τ(h(t))r_{h}=\max\limits_{t\in[0,1]}\lVert\sigma(t)-\tau(h(t))\rVert be the distance realized by hh. We know from Lemma 2.4 that there exists a sequence (hx)x=1(h_{x})_{x=1}^{\infty} in \mathcal{H}, such that limxrhx=dF(σ,τ)=r\lim\limits_{x\to\infty}r_{h_{x}}=d_{F}(\sigma,\tau)=r.

Now, fix an arbitrary hh\in\mathcal{H} and assume there is a vertex viσv^{\sigma}_{i} of σ\sigma, with instant tiσt^{\sigma}_{i}, that is not contained in any of B(v1τ,rh),,B(v|τ|τ,rh)B(v^{\tau}_{1},r_{h}),\dots,B(v^{\tau}_{\lvert\tau\rvert},r_{h}). Let jj be the maximum of {1,,|τ|1}\{1,\dots,\lvert\tau\rvert-1\}, such that tjτh(tiσ)tj+1τt^{\tau}_{j}\leq h(t^{\sigma}_{i})\leq t^{\tau}_{j+1}. So vσv^{\sigma} is matched to τ(tjτ)τ(tj+1τ)¯\overline{\tau(t^{\tau}_{j})\tau(t^{\tau}_{j+1})} by hh. We modify σ\sigma in such a way, that viσv^{\sigma}_{i} is replaced by two new vertices that are elements of B(vjτ,rh)B(v^{\tau}_{j},r_{h}) and B(vj+1τ,rh)B(v^{\tau}_{j+1},r_{h}), respectively.

Namely, let tt^{-} be the maximum of [0,tiσ)[0,t^{\sigma}_{i}), such that σ(t)B(vjτ,rh)\sigma(t^{-})\in B(v^{\tau}_{j},r_{h}) and let t+t^{+} be the minimum of (tiσ,1](t^{\sigma}_{i},1], such that σ(t+)B(vj+1τ,rh)\sigma(t^{+})\in B(v^{\tau}_{j+1},r_{h}). These are the instants when σ\sigma leaves B(vjτ,rh)B(v^{\tau}_{j},r_{h}) before visiting viσv^{\sigma}_{i} and σ\sigma enters B(vj+1τ,rh)B(v^{\tau}_{j+1},r_{h}) after visiting viσv^{\sigma}_{i}, respectively. Let σh\sigma^{\prime}_{h} be the piecewise defined curve, defined just like σ\sigma on [0,t][0,t^{-}] and [t+,1][t^{+},1], but on (t,t+)(t^{-},t^{+}) it connects σ(t)\sigma(t^{-}) and σ(t+)\sigma(t^{+}) with the line segment s(t)=(1ttt+t)τ(t)+ttt+tτ(t+)s(t)=\left(1-\frac{t-t^{-}}{t^{+}-t^{-}}\right)\tau(t^{-})+\frac{t-t^{-}}{t^{+}-t^{-}}\tau(t^{+}).

We know that σ(t)τ(h(t))rh\lVert\sigma(t^{-})-\tau(h(t^{-}))\rVert\leq r_{h} and σ(t+)τ(h(t+))rh\lVert\sigma(t^{+})-\tau(h(t^{+}))\rVert\leq r_{h}. Note that tjτh(t)t^{\tau}_{j}\leq h(t^{-}) and h(t+)tj+1τh(t^{+})\leq t^{\tau}_{j+1} since σ(t)\sigma(t^{-}) and σ(t+)\sigma(t^{+}) are closest points to viσv^{\sigma}_{i} on σ\sigma that have distance rhr_{h} to vjτv^{\tau}_{j} and vj+1τv^{\tau}_{j+1}, respectively, by definition. Therefore, τ\tau has no vertices between the instants h(t)h(t^{-}) and h(t+)h(t^{+}). Now, hh can be used to match σh|[0,t)\sigma^{\prime}_{h}|_{[0,t^{-})} to τ|[0,h(t))\tau|_{[0,h(t^{-}))} and σh|(t+,1]\sigma^{\prime}_{h}|_{(t^{+},1]} to τ|(t+,1]\tau|_{(t^{+},1]} with distance at most rhr_{h}. Since σh|[t,t+]\sigma^{\prime}_{h}|_{[t^{-},t^{+}]} and τ|[h(t),h(t+)]\tau|_{[h(t^{-}),h(t^{+})]} are just line segments, they can be matched to each other with distance at most max{σh(t)τ(h(t)),σh(t+)τ(h(t+))}rh\max\{\lVert\sigma^{\prime}_{h}(t^{-})-\tau(h(t^{-}))\rVert,\lVert\sigma^{\prime}_{h}(t^{+})-\tau(h(t^{+}))\rVert\}\leq r_{h}. We conclude that dF(σh,τ)rhd_{F}(\sigma^{\prime}_{h},\tau)\leq r_{h}.

Because this modification works for every hh\in\mathcal{H}, we have dF(σh,τ)rhd_{F}(\sigma^{\prime}_{h},\tau)\leq r_{h} for every hh\in\mathcal{H}. Thus, limxdF(σhx,τ)dF(σ,τ)=r\lim\limits_{x\to\infty}d_{F}(\sigma^{\prime}_{h_{x}},\tau)\leq d_{F}(\sigma,\tau)=r.

Now, to prove the claim, for every hh\in\mathcal{H} we apply this modification to viσv^{\sigma}_{i} and successively to every other vertex viσhv^{\sigma^{\prime}_{h}}_{i} of the resulting curve σh\sigma^{\prime}_{h}, not contained in one of the balls, until every vertex of σh\sigma^{\prime}_{h} is contained in a ball. Note that the modification is repeated at most |σ|2\lvert\sigma\rvert-2 times for every hh\in\mathcal{H}, since the start and end vertex of σ\sigma must be contained in B(v1τ,rh)B(v^{\tau}_{1},r_{h}) and B(v|τ|τ,rh)B(v^{\tau}_{\lvert\tau\rvert},r_{h}), respectively. Therefore, the number of vertices of every σh\sigma^{\prime}_{h} can be bounded by 2(|σ|2)+22\cdot(\lvert\sigma\rvert-2)+2 since every other vertex must not lie in a ball and for each such vertex one new vertex is created. Thus, |σh|2|σ|2\lvert\sigma^{\prime}_{h}\rvert\leq 2\lvert\sigma\rvert-2. ∎

We now present Algorithm 2, which works similar as Algorithm 1, but uses shortcutting instead of simplification. As a consequence, we can achieve an approximation factor of 3+ε3+\varepsilon, instead of a factor of 2+ε+α2+\varepsilon+\alpha, where α1\alpha\geq 1 is the approximation factor of the simplifiaction algorithm used in Algorithm 1. To achieve an approximation-factor of 3+ε3+\varepsilon using simplification, one would need to compute the optimal minimum-error \ell-simplifications of the input curves and to the best of our knowledge, there is no such algorithm for the continuous Fréchet distance.

In contrast to Algorithm 1, Algorithm 2 utilizes the superset-sampling technique by Kumar et al. [25], i.e., the concentration bound in Lemma 2.9, to obtain an approximate (1,)(1,\ell)-median for a cluster TT^{\prime} contained in the input TT, that takes a constant fraction of TT. Therefore, it has running-time exponential in the size of the sample SS. A further difference is that we need an upper and a lower bound on the cost of an optimal (1,)(1,\ell)-median for TT^{\prime}, to properly set up the grids we use for shortcutting. The lower bound can be obtained by simple estimation, using Markov’s inequality. For the upper bound we utilize a case distinction, which guarantees us that if we fail to obtain an upper bound on the optimal cost, the result of Algorithm 1 then is a good approximation (factor 2+ε2+\varepsilon) and can be used instead of a best curve obtained by shortcutting.

Algorithm 2 has several parameters: β\beta determines the size (in terms of a fraction of the input) of the smallest cluster inside the input for which an approximate median can be computed, δ\delta determines the probability of failure of the algorithm and ε\varepsilon determines the approximation factor.

Algorithm 2 (1,)(1,\ell)-Median for Subset by Simple Shortcutting
1:procedure (1,)(1,\ell)-Median-(3+ε)(3+\varepsilon)-Candidates(T={τ1,,τn},β,δ,εT=\{\tau_{1},\dots,\tau_{n}\},\beta,\delta,\varepsilon)
2:     εε/3\varepsilon^{\prime}\leftarrow\varepsilon/3, CC\leftarrow\emptyset
3:     SS\leftarrow sample 8β(ε)1(ln(δ)ln(4))\left\lceil-8\beta(\varepsilon^{\prime})^{-1}(\ln(\delta)-\ln(4))\right\rceil curves from TT uniformly and independently       with replacement
4:     for SSS^{\prime}\subseteq S with |S|=|S|2β\lvert S^{\prime}\rvert=\frac{\lvert S\rvert}{2\beta} do
5:         cc\leftarrow (1,)(1,\ell)-Median-3434-Approximation(S,δ/4)(S^{\prime},\delta/4) \triangleright Algorithm 1
6:         Δcost(S,c)\Delta\leftarrow\operatorname{cost}\left\lparen S^{\prime},c\right\rparen, Δlδn2|S|Δ34\Delta_{l}\leftarrow\frac{\delta n}{2\lvert S\rvert}\frac{\Delta}{34}, Δu1εΔ\Delta_{u}\leftarrow\frac{1}{\varepsilon^{\prime}}\Delta, CC{c}C\leftarrow C\cup\{c\}
7:         for sSs\in S^{\prime} do
8:              PP\leftarrow\emptyset
9:              for i{1,,|s|}i\in\{1,\dots,\lvert s\rvert\} do
10:                  PP𝔾(B(vis,(1+ε)Δu),2εndΔl)P\leftarrow P\cup\mathbb{G}\left(B\left(v^{s}_{i},(1+\varepsilon^{\prime})\Delta_{u}\right),\frac{2\varepsilon^{\prime}}{n\sqrt{d}}\Delta_{l}\right) \triangleright visv^{s}_{i}: iith vertex of ss               
11:              CCC\leftarrow C\ \cup set of all polygonal curves with 222\ell-2 vertices from PP               
12:     return CC

We prove the quality of approximation of Algorithm 2.

Theorem 4.2.

Given three parameters β[1,)\beta\in[1,\infty), δ,ε(0,1)\delta,\varepsilon\in(0,1) and a set T={τ1,,τn}𝕏mdT=\{\tau_{1},\dots,\tau_{n}\}\subset\mathbb{X}^{d}_{m} of polygonal curves, with probability at least 1δ1-\delta the set of candidates that Algorithm 2 returns contains a (3+ε)(3+\varepsilon)-approximate (1,)(1,\ell)-median with up to 222\ell-2 vertices for any TTT^{\prime}\subseteq T, if |T|1β|T|\lvert T^{\prime}\rvert\geq\frac{1}{\beta}\lvert T\rvert.

Proof.

We assume that |T|1β|T|\lvert T^{\prime}\rvert\geq\frac{1}{\beta}\lvert T\rvert. Let nn^{\prime} be the number of sampled curves in SS that are elements of TT^{\prime}. Clearly, E[n]i=1|S|1β=|S|β\operatorname{E}\left[n^{\prime}\right]\geq\sum_{i=1}^{\lvert S\rvert}\frac{1}{\beta}=\frac{\lvert S\rvert}{\beta}. Also nn^{\prime} is the sum of independent Bernoulli trials. A Chernoff bound (cf. Lemma 2.9) yields:

Pr[n<|S|2β]Pr[n<12E[n]]exp(14|S|2β)exp(ln(δ)ln(4)ε)=(δ4)1εδ4.\displaystyle\Pr\left[n^{\prime}<\frac{\lvert S\rvert}{2\beta}\right]\leq\Pr\left[n^{\prime}<\frac{1}{2}\operatorname{E}\left[n^{\prime}\right]\right]\leq\exp\left(-\frac{1}{4}\frac{\lvert S\rvert}{2\beta}\right)\leq\exp\left(\frac{\ln(\delta)-\ln(4)}{\varepsilon}\right)=\left(\frac{\delta}{4}\right)^{\frac{1}{\varepsilon}}\leq\frac{\delta}{4}.

In other words, with probability at most δ/4\delta/4 no subset SSS^{\prime}\subseteq S, of cardinality at least |S|2β\frac{\lvert S\rvert}{2\beta}, is a subset of TT^{\prime}. We condition the rest of the proof on the contrary event, denoted by T\mathcal{E}_{T^{\prime}}, namely, that there is a subset SSS^{\prime}\subseteq S with STS^{\prime}\subseteq T^{\prime} and |S||S|2β\lvert S^{\prime}\rvert\geq\frac{\lvert S\rvert}{2\beta}. Note that SS^{\prime} is then a uniform and independent sample of TT^{\prime}.

Now, let cargminc𝕏dcost(T,c)c^{\ast}\in\operatorname*{arg\,min}\limits_{c\in\mathbb{X}^{d}_{\ell}}\operatorname{cost}\left\lparen T^{\prime},c\right\rparen be an optimal (1,)(1,\ell)-median for TT^{\prime}. The expected distance between sSs\in S^{\prime} and cc^{\ast} is

E[dF(s,c)|T]=τTdF(c,τ)1|T|=cost(T,c)|T|.\operatorname{E}\left[d_{F}(s,c^{\ast})\ |\ \mathcal{E}_{T^{\prime}}\right]=\sum_{\tau\in T^{\prime}}d_{F}(c^{\ast},\tau)\cdot\frac{1}{\lvert T^{\prime}\rvert}=\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}.

By linearity we have E[cost(S,c)|T]=|S||T|cost(T,c)\operatorname{E}\left[\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen\ |\ \mathcal{E}_{T^{\prime}}\right]=\frac{\lvert S^{\prime}\rvert}{\lvert T^{\prime}\rvert}\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen. Markov’s inequality yields:

Pr[δ|T|4|S|cost(S,c)>cost(T,c)|T]δ4.\displaystyle\Pr\left[\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen>\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen\ \Big{|}\ \mathcal{E}_{T^{\prime}}\right]\leq\frac{\delta}{4}.

We conclude that with probability at most δ/4\delta/4 we have δ|T|4|S|cost(S,c)>cost(T,c)\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen>\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen.

Using Markov’s inequality again, for every sSs\in S^{\prime} we have

Pr[dF(s,c)>(1+ε)cost(T,c)|T||T]11+ε,\Pr\left[d_{F}(s,c^{\ast})>(1+\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\ \Big{|}\ \mathcal{E}_{T^{\prime}}\right]\leq\frac{1}{1+\varepsilon},

therefore by independence

Pr[sS(dF(s,c)>(1+ε)cost(T,c)|T|)|T]1(1+ε)|S|exp(ε2|S|2β).\Pr\left[\bigwedge_{s\in S^{\prime}}\left(d_{F}(s,c^{\ast})>(1+\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\right)\ \Big{|}\ \mathcal{E}_{T^{\prime}}\right]\leq\frac{1}{(1+\varepsilon)^{\lvert S^{\prime}\rvert}}\leq\exp\left(-\frac{\varepsilon}{2}\frac{\lvert S\rvert}{2\beta}\right).

Hence, with probability at most exp(ε8β(ln(δ)ln(4))ε4β)δ2/16δ/4\exp\left(-\frac{\varepsilon\left\lceil-\frac{8\beta(\ln(\delta)-\ln(4))}{\varepsilon}\right\rceil}{4\beta}\right)\leq\delta^{2}/16\leq\delta/4 there is no sSs\in S^{\prime} with dF(s,c)(1+ε)cost(T,c)|T|d_{F}(s,c^{\ast})\leq(1+\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}. Also, with probability at most δ/4\delta/4 Algorithm 1 fails to compute a 3434-approximate (1,)(1,\ell)-median c𝕏dc\in\mathbb{X}^{d}_{\ell} for SS^{\prime}, cf. Corollary 3.3.

Using a union bound over these bad events, we conclude that with probability at least 1δ1-\delta all of the following events occur simultaneously:

  • There is a subset SSS^{\prime}\subseteq S of cardinality at least |S|/(2β)\lvert S\rvert/(2\beta) that is a uniform and independent sample of TT^{\prime},

  • there is a curve sSs\in S^{\prime} with dF(s,c)(1+ε)cost(T,c)|T|d_{F}(s,c^{\ast})\leq(1+\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert},

  • Algorithm 1 computes a polygonal curve c𝕏dc\in\mathbb{X}^{d}_{\ell} with cost(S,cS)cost(S,c)34cost(S,cS)\operatorname{cost}\left\lparen S^{\prime},c^{\ast}_{S^{\prime}}\right\rparen\leq\operatorname{cost}\left\lparen S^{\prime},c\right\rparen\leq 34\operatorname{cost}\left\lparen S^{\prime},c^{\ast}_{S^{\prime}}\right\rparen, where cS𝕏dc^{\ast}_{S^{\prime}}\in\mathbb{X}^{d}_{\ell} is an optimal (1,)(1,\ell)-median for SS^{\prime},

  • and it holds that δ|T|4|S|cost(S,c)cost(T,c)\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen\leq\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen.

Since cSc^{\ast}_{S^{\prime}} is an optimal (1,)(1,\ell)-median for SS^{\prime} we get the following from the last two items:

cost(T,c)δ|T|4|S|cost(S,c)δ|T|4|S|cost(S,cS)δ|T|4|S|cost(S,c)34.\displaystyle\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen\geq\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen\geq\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c^{\ast}_{S^{\prime}}\right\rparen\geq\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\frac{\operatorname{cost}\left\lparen S^{\prime},c\right\rparen}{34}.

We now distinguish between two cases:

Case 1: dF(c,c)(1+2ε)cost(T,c)|T|d_{F}(c,c^{\ast})\geq(1+2\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}

The triangle-inequality yields

dF(c,s)\displaystyle d_{F}(c,s) dF(c,c)dF(c,s)dF(c,c)(1+ε)cost(T,c)|T|\displaystyle\geq{}d_{F}(c,c^{\ast})-d_{F}(c^{\ast},s)\geq d_{F}(c,c^{\ast})-(1+\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}
(1+2ε)cost(T,c)|T|(1+ε)cost(T,c)|T|=εcost(T,c)|T|.\displaystyle\geq{}(1+2\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}-(1+\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}=\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}.

As a consequence, cost(S,c)εcost(T,c)|T|cost(T,c)|T|1εcost(S,c)\operatorname{cost}\left\lparen S^{\prime},c\right\rparen\geq\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\Leftrightarrow\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\leq\frac{1}{\varepsilon}\operatorname{cost}\left\lparen S^{\prime},c\right\rparen.

Now, let v1s,,v|s|sv^{s}_{1},\dots,v^{s}_{\lvert s\rvert} be the vertices of ss. By Lemma 4.1 there exists a polygonal curve cc^{\prime} with up to 222\ell-2 vertices, every vertex contained in one of B(v1s,dF(c,s)),,B(v|s|s,dF(c,s))B(v^{s}_{1},d_{F}(c^{\ast},s)),\dots,B(v^{s}_{\lvert s\rvert},d_{F}(c^{\ast},s)) and dF(s,c)dF(s,c)(1+ε)cost(T,c)|T|(1+ε)cost(S,c)εd_{F}(s,c^{\prime})\leq d_{F}(s,c^{\ast})\leq(1+\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\leq(1+\varepsilon)\frac{\operatorname{cost}\left\lparen S^{\prime},c\right\rparen}{\varepsilon}.

In the set of candidates, that Algorithm 2 returns, a curve c′′c^{\prime\prime} with up to 222\ell-2 vertices from the union of the grid covers and distance at most ε2δn4|S|cost(S,c)nεδ|T|4|S|cost(S,c)|T|εcost(T,c)|T|\frac{\varepsilon\frac{2\delta n}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c\right\rparen}{n}\leq\frac{\varepsilon\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c\right\rparen}{\lvert T^{\prime}\rvert}\leq\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert} between every corresponding pair of vertices of cc^{\prime} and c′′c^{\prime\prime} is contained. We conclude that dF(c,c′′)εcost(T,c)|T|d_{F}(c^{\prime},c^{\prime\prime})\leq\frac{\varepsilon\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}.

We can now bound the cost of c′′c^{\prime\prime} as follows:

cost(T,c′′)\displaystyle\operatorname{cost}\left\lparen T^{\prime},c^{\prime\prime}\right\rparen =τTdF(τ,c′′)τT(dF(τ,c)+εcost(T,c)|T|)\displaystyle={}\sum_{\tau\in T^{\prime}}d_{F}(\tau,c^{\prime\prime})\leq\sum_{\tau\in T^{\prime}}\left(d_{F}(\tau,c^{\prime})+\frac{\varepsilon\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\right)
τT(dF(τ,c)+dF(c,c))+εcost(T,c)\displaystyle\leq{}\sum_{\tau\in T^{\prime}}(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},c^{\prime}))+\varepsilon\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen
τT(dF(τ,c)+dF(c,s)+dF(s,c))+εcost(T,c)(3+3ε)cost(T,c).\displaystyle\leq{}\sum_{\tau\in T^{\prime}}(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},s)+d_{F}(s,c^{\prime}))+\varepsilon\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen\leq{}(3+3\varepsilon)\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen.

Case 2: dF(c,c)<(1+2ε)cost(T,c)|T|d_{F}(c,c^{\ast})<(1+2\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}

The cost of cc can easily be bounded:

cost(T,c)τT(dF(τ,c)+dF(c,c))<cost(T,c)+(1+2ε)cost(T,c)=(2+2ε)cost(T,c).\displaystyle\operatorname{cost}\left\lparen T^{\prime},c\right\rparen\leq\sum_{\tau\in T^{\prime}}(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},c))<\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen+(1+2\varepsilon)\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen=(2+2\varepsilon)\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen.

The claim follows by rescaling ε\varepsilon by 13\frac{1}{3}. ∎

Next we analyse the worst-case running-time of Algorithm 2 and the number of candidates it returns.

Theorem 4.3.

Algorithm 2 has running-time and returns number of candidates 2O((ln(δ))2βε2+log(m))2^{O\left(\frac{(-\ln(\delta))^{2}\cdot\beta}{\varepsilon^{2}}+\log(m)\right)}.

Proof.

The sample SS has size O(ln(δ)βε)O\left(\frac{-\ln(\delta)\cdot\beta}{\varepsilon}\right) and sampling it takes time O(ln(δ)βε)O\left(\frac{-\ln(\delta)\cdot\beta}{\varepsilon}\right). Let nS=|S|n_{S}=\lvert S\rvert. The for-loop runs

(nSnS2β)2O(nS2βlognS)2O((ln(δ))2βε2)\binom{n_{S}}{\frac{n_{S}}{2\beta}}\in 2^{O\left(\frac{n_{S}}{2\beta}\log n_{S}\right)}\subset 2^{O\left(\frac{(-\ln(\delta))^{2}\cdot\beta}{\varepsilon^{2}}\right)}

times. In each iteration, we run Algorithm 1, taking time O(m2log(m)(ln2δ)+m3logm)O(m^{2}\log(m)(-\ln^{2}\delta)+m^{3}\log m) (cf. Corollary 3.3), we compute the cost of the returned curve with respect to SS^{\prime}, taking time O(ln(δ)εmlog(m))O\left(\frac{-\ln(\delta)}{\varepsilon}\cdot m\log(m)\right), and per curve in SS^{\prime} we build up to mm grids of size

((1+ε)Δε2ε2δnΔnd4|S|)d=(d|S|(1+ε)ε2δ)dO(βd(lnδ)dε3dδd)\left(\frac{\frac{(1+\varepsilon)\Delta}{\varepsilon}}{\frac{2\varepsilon 2\delta n\Delta}{n\sqrt{d}4\lvert S\rvert}}\right)^{d}=\left(\frac{\sqrt{d}\lvert S\rvert(1+\varepsilon)}{\varepsilon^{2}\delta}\right)^{d}\in O\left(\frac{\beta^{d}(-\ln\delta)^{d}}{\varepsilon^{3d}\delta^{d}}\right)

each. For each curve sSs\in S^{\prime}, Algorithm 2 then enumerates all combinations of 222\ell-2 points from these up to mm grids, resulting in

O(m22β2d2d(lnδ)2d2dε6d6dδ2d2d)O\left(\frac{m^{2\ell-2}\beta^{2\ell d-2d}(-\ln\delta)^{2\ell d-2d}}{\varepsilon^{6\ell d-6d}\delta^{2\ell d-2d}}\right)

candidates per sSs\in S^{\prime}, per iteration of the for-loop. Thus, Algorithm 2 computes O(poly(m,β,δ1,ε1))O\left(\operatorname{poly}\left\lparen m,\beta,\delta^{-1},\varepsilon^{-1}\right\rparen\right) candidates per iteration of the for-loop and enumeration also takes time O(poly(m,β,δ1,ε1))O\left(\operatorname{poly}\left\lparen m,\beta,\delta^{-1},\varepsilon^{-1}\right\rparen\right) per iteration of the for-loop.

All in all, we have running-time and number of candidates 2O((ln(δ))2βε2+log(m))2^{O\left(\frac{(-\ln(\delta))^{2}\cdot\beta}{\varepsilon^{2}}+\log(m)\right)}. ∎

5 More Practical Approximation for (1,)(1,\ell)-Median by Simple Shortcutting

The following algorithm is a modification of Algorithm 2. It is more practical since it needs to cover only up to mm (small) balls, using grids. Unfortunately, it is not compatible with the superset-sampling technique and can therefore not be used as plugin in Algorithm 5.

Algorithm 3 (1,)(1,\ell)-Median by Simple Shortcutting
1:procedure (1,)(1,\ell)-Median-(5+ε)(5+\varepsilon)(T={τ1,,τn},δ,εT=\{\tau_{1},\dots,\tau_{n}\},\delta,\varepsilon)
2:     c^\widehat{c}\leftarrow (1,)(1,\ell)-Median-3434-Approximation(T,δ/2)(T,\delta/2) \triangleright Algorithm 1
3:     Δcost(T,c^)34\Delta\leftarrow\frac{\operatorname{cost}\left\lparen T,\widehat{c}\right\rparen}{34}, εε/9\varepsilon^{\prime}\leftarrow\varepsilon/9, PP\leftarrow\emptyset
4:     SS\leftarrow sample 2(ε)1(ln(δ)ln(4))\left\lceil-2(\varepsilon^{\prime})^{-1}(\ln(\delta)-\ln(4))\right\rceil curves from TT uniformly and independently       with replacement
5:     WW\leftarrow sample 64(ε)2(ln(δ)ln(8(ε)1(ln(δ)ln(4))))\lceil-64(\varepsilon^{\prime})^{-2}(\ln(\delta)-\ln(\lceil-8(\varepsilon^{\prime})^{-1}(\ln(\delta)-\ln(4))\rceil))\rceil curves from TT       uniformly and independently with replacement
6:     cargminsScost(W,s)c\leftarrow\operatorname*{arg\,min}\limits_{s\in S}\operatorname{cost}\left\lparen W,s\right\rparen
7:     for i{1,,|c|}i\in\{1,\dots,\lvert c\rvert\} do
8:         PP𝔾(B(vic,(3+4ε)n34Δ),2εΔnd)P\leftarrow P\cup\mathbb{G}\left(B\left(v^{c}_{i},\frac{(3+4\varepsilon^{\prime})}{n}34\Delta\right),\frac{2\varepsilon^{\prime}\Delta}{n\sqrt{d}}\right) \triangleright vicv^{c}_{i} is the iith vertex of cc      
9:     CC\leftarrow set of all polygonal curves with 222\ell-2 vertices from PP
10:     return argmincCcost(T,c)\operatorname*{arg\,min}\limits_{c^{\prime}\in C}\operatorname{cost}\left\lparen T,c^{\prime}\right\rparen

We prove the quality of approximation of Algorithm 3.

Theorem 5.1.

Given two parameters δ,ε(0,1)\delta,\varepsilon\in(0,1) and a set T={τ1,,τn}𝕏mdT=\{\tau_{1},\dots,\tau_{n}\}\subset\mathbb{X}^{d}_{m} of polygonal curves, with probability at least 1δ1-\delta Algorithm 3 returns a (5+ε)(5+\varepsilon)-approximate (1,)(1,\ell)-median for TT with up to 222\ell-2 vertices.

Proof.

Let cargminc𝕏dcost(T,c)c^{\ast}\in\operatorname*{arg\,min}\limits_{c\in\mathbb{X}^{d}_{\ell}}\operatorname{cost}\left\lparen T,c\right\rparen be an optimal (1,)(1,\ell)-median for TT.

The expected distance between sSs\in S and cc^{\ast} is

E[dF(s,c)]=i=1ndF(c,τi)1n=cost(T,c)n.\operatorname{E}\left[d_{F}(s,c^{\ast})\right]=\sum_{i=1}^{n}d_{F}(c^{\ast},\tau_{i})\cdot\frac{1}{n}=\frac{\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n}.

Now using Markov’s inequality, for every sSs\in S we have

Pr[dF(s,c)>(1+ε)cost(T,c)/n]cost(T,c)n1(1+ε)cost(T,c)n1=11+ε,\Pr[d_{F}(s,c^{\ast})>(1+\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen/n]\leq\frac{\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen n^{-1}}{(1+\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen n^{-1}}=\frac{1}{1+\varepsilon},

therefore by independence

Pr[sS(dF(s,c)>(1+ε)cost(T,c)/n)]1(1+ε)|S|exp(ε|S|2).\Pr\left[\bigwedge_{s\in S}(d_{F}(s,c^{\ast})>(1+\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen/n)\right]\leq\frac{1}{(1+\varepsilon)^{\lvert S\rvert}}\leq\exp\left(-\frac{\varepsilon\lvert S\rvert}{2}\right).

Hence, with probability at most exp(ε2(ln(δ)ln(4))ε2)δ/4\exp\left(-\frac{\varepsilon\left\lceil-\frac{2(\ln(\delta)-\ln(4))}{\varepsilon}\right\rceil}{2}\right)\leq\delta/4 there is no sSs\in S with dF(s,c)(1+ε)cost(T,c)nd_{F}(s,c^{\ast})\leq(1+\varepsilon)\frac{\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n}. Now, assume there is a sSs\in S with dF(s,c)(1+ε)cost(T,c)/nd_{F}(s,c^{\ast})\leq(1+\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen/n. We do not want any tS{s}t\in S\setminus\{s\} with dF(t,c)>(1+ε)dF(s,c)d_{F}(t,c^{\ast})>(1+\varepsilon)d_{F}(s,c^{\ast}) to have cost(W,t)cost(W,s)\operatorname{cost}\left\lparen W,t\right\rparen\leq\operatorname{cost}\left\lparen W,s\right\rparen. Using Theorem 2.8, we conclude that this happens with probability at most

exp(ε264ε2(ln(δ)ln(8(ε)1(ln(δ)ln(4))))64)δ8(ε)1(ln(δ)ln(4))δ4|S|,\exp\left(-\frac{\varepsilon^{2}\lceil-64\varepsilon^{-2}(\ln(\delta)-\ln(\lceil-8(\varepsilon^{\prime})^{-1}(\ln(\delta)-\ln(4))\rceil))\rceil}{64}\right)\leq\frac{\delta}{\lceil-8(\varepsilon^{\prime})^{-1}(\ln(\delta)-\ln(4))\rceil}\leq\frac{\delta}{4\lvert S\rvert},

for each tS{s}t\in S\setminus\{s\}. Also, with probability at most δ/2\delta/2 Algorithm 1 fails to compute a 3434-approximate (1,)(1,\ell)-median c^𝕏d\widehat{c}\in\mathbb{X}^{d}_{\ell} for TT, cf. Corollary 3.3.

Using a union bound over these bad events, we conclude that with probability at least 1δ1-\delta, Algorithm 3 samples a curve tSt\in S with cost(T,t)(1+ε)cost(T,s)\operatorname{cost}\left\lparen T,t\right\rparen\leq(1+\varepsilon)\operatorname{cost}\left\lparen T,s\right\rparen and Algorithm 1 computes a 3434-approximate (1,)(1,\ell)-median c^𝕏d\widehat{c}\in\mathbb{X}^{d}_{\ell} for TT, i.e., cost(T,c)34Δ=cost(T,c^)34cost(T,c)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen\leq 34\Delta=\operatorname{cost}\left\lparen T,\widehat{c}\right\rparen\leq 34\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen. Let v1t,,v|t|tv^{t}_{1},\dots,v^{t}_{\lvert t\rvert} be the vertices of tt. By Lemma 4.1 there exists a polygonal curve cc^{\prime} with up to 222\ell-2 vertices, every vertex contained in one of B(v1t,dF(c,t)),,B(v|t|t,dF(c,t))B(v^{t}_{1},d_{F}(c^{\ast},t)),\dots,B(v^{t}_{\lvert t\rvert},d_{F}(c^{\ast},t)) and dF(t,c)dF(t,c)d_{F}(t,c^{\prime})\leq d_{F}(t,c^{\ast}). Using the triangle-inequality yields

τT(dF(t,c)dF(τ,c))τTdF(t,τ)(1+ε)τTdF(s,τ)(1+ε)τT(dF(τ,c)+dF(c,s)),\displaystyle\sum_{\tau\in T}(d_{F}(t,c^{\ast})-d_{F}(\tau,c^{\ast}))\leq\sum_{\tau\in T}d_{F}(t,\tau)\leq(1+\varepsilon)\sum_{\tau\in T}d_{F}(s,\tau)\leq(1+\varepsilon)\sum_{\tau\in T}(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},s)),

which is equivalent to

ndF(t,c)(2+ε)cost(T,c)+(1+ε)n(1+ε)cost(T,c)/n\displaystyle n\cdot d_{F}(t,c^{\ast})\leq(2+\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen+(1+\varepsilon)n(1+\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen/n\Leftrightarrow{} dF(t,c)(3+4ε)cost(T,c)/n.\displaystyle d_{F}(t,c^{\ast})\leq(3+4\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen/n.

Hence, we have dF(t,c)dF(t,c)(3+4ε)cost(T,c)/n(3+4ε)34Δ/nd_{F}(t,c^{\prime})\leq d_{F}(t,c^{\ast})\leq(3+4\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen/n\leq(3+4\varepsilon)34\Delta/n.

In the last step, Algorithm 3 returns a curve c′′c^{\prime\prime} from the set CC of all curves with up to 222\ell-2 vertices from PP, the union of the grid covers, that evaluates best. We can assume that c′′c^{\prime\prime} has distance at most εΔnεcost(T,c)n\frac{\varepsilon\Delta}{n}\leq\varepsilon\frac{\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n} between every corresponding pair of vertices of cc^{\prime} and c′′c^{\prime\prime}. We conclude that dF(c,c′′)εΔnεcost(T,c)nd_{F}(c^{\prime},c^{\prime\prime})\leq\frac{\varepsilon\Delta}{n}\leq\varepsilon\frac{\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n}.

We can now bound the cost of c′′c^{\prime\prime} as follows:

cost(T,c′′)\displaystyle\operatorname{cost}\left\lparen T,c^{\prime\prime}\right\rparen =τTdF(τ,c′′)τT(dF(τ,c)+εΔn)τT(dF(τ,t)+dF(t,c))+εcost(T,c)\displaystyle={}\sum_{\tau\in T}d_{F}(\tau,c^{\prime\prime})\leq\sum_{\tau\in T}\left(d_{F}(\tau,c^{\prime})+\frac{\varepsilon\Delta}{n}\right)\leq\sum_{\tau\in T}(d_{F}(\tau,t)+d_{F}(t,c^{\prime}))+\varepsilon\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen
(1+ε)cost(T,s)+(3+5ε)cost(T,c)\displaystyle\leq{}(1+\varepsilon)\operatorname{cost}\left\lparen T,s\right\rparen+(3+5\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen
(1+ε)τT(dF(τ,c)+dF(c,s))+(3+5ε)cost(T,c)\displaystyle\leq{}(1+\varepsilon)\sum_{\tau\in T}(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},s))+(3+5\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen
(1+ε)cost(T,c)+(1+ε)2cost(T,c)+(3+5ε)cost(T,c)\displaystyle\leq{}(1+\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen+(1+\varepsilon)^{2}\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen+(3+5\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen
(5+9ε)cost(T,c)\displaystyle\leq{}(5+9\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen

The claim follows by rescaling ε\varepsilon by 19\frac{1}{9}. ∎

We analyse the worst-case running-time of Algorithm 3.

Theorem 5.2.

Algorithm 3 has running-time O(nm21log(m)ε(22)d+m3log(m)(ln(δ))2ε3)O\left(\frac{nm^{2\ell-1}\log(m)}{\varepsilon^{(2\ell-2)d}}+\frac{m^{3}\log(m)(-\ln(\delta))^{2}}{\varepsilon^{3}}\right).

Proof.

Algorithm 1 has running-time O(m2log(m)(ln2δ))+m3logm)O(m^{2}\log(m)(-\ln^{2}\delta))+m^{3}\log m). The sample SS has size O(ln(δ)ε)O\left(\frac{-\ln(\delta)}{\varepsilon}\right) and the sample WW has size O(ln(δ)ε2)O\left(\frac{-\ln(\delta)}{\varepsilon^{2}}\right). Evaluating each curve of SS against WW takes time O(m2log(m)(ln(δ))2ε3)O\left(\frac{m^{2}\log(m)(-\ln(\delta))^{2}}{\varepsilon^{3}}\right), using the algorithm of Alt and Godau [4] to compute the distances.

Now, cc has up to mm vertices and every grid consists of ((3+ε)Δn2εΔncd)d=((3+ε)cd2ε)dO(1εd)\left(\frac{\frac{(3+\varepsilon)\Delta}{n}}{\frac{2\varepsilon^{\prime}\Delta}{nc\sqrt{d}}}\right)^{d}=\left(\frac{(3+\varepsilon)c\sqrt{d}}{2\varepsilon^{\prime}}\right)^{d}\in O\left(\frac{1}{\varepsilon^{d}}\right) points. Therefore, we have O(mεd)O\left(\frac{m}{\varepsilon^{d}}\right) points in PP and Algorithm 3 enumerates all combinations of 222\ell-2 points from PP taking time O(m22ε(22)d)O\left(\frac{m^{2\ell-2}}{\varepsilon^{(2\ell-2)d}}\right). Afterwards, these candidates are evaluated, which takes time O(nmlog(m))O(nm\log(m)) per candidate using the algorithm of Alt and Godau [4] to compute the distances. All in all, we then have running-time O(nm21log(m)ε(22)d+m3log(m)(ln(δ))2ε3)O\left(\frac{nm^{2\ell-1}\log(m)}{\varepsilon^{(2\ell-2)d}}+\frac{m^{3}\log(m)(-\ln(\delta))^{2}}{\varepsilon^{3}}\right). ∎

6 (1+ε)(1+\varepsilon)-Approximation for (1,)(1,\ell)-Median by Advanced Shortcutting

Now we present Algorithm 4, which returns candidates, containing a (1+ε)(1+\varepsilon)-approximate (1,)(1,\ell)-median of complexity at most 222\ell-2, for a cluster contained in the input, that takes a constant fraction of the input, w.h.p. Before we present the algorithm, we present our second shortcutting lemma. Here, we do not introduce shortcuts with respect to a single curve, but with respect to several curves: by introducing shortcuts with respect to ε|T|\varepsilon\lvert T\rvert well-chosen curves from the given set T𝕏mdT\subset\mathbb{X}^{d}_{m} of polygonal curves, for a given ε(0,1)\varepsilon\in(0,1), we preserve the distances to at least (1ε)|T|(1-\varepsilon)\lvert T\rvert curves from TT. In this context well-chosen means that there exists a certain number of subsets of TT, of each we have to pick a curve for shortcutting. This will enable the high quality of approximation of Algorithm 4 and we formalize this in the following lemma.

Refer to caption
Figure 3: Visualization of an advanced shortcut. The black curves are input-curves and the red curve is an optimal median. By inserting the blue shortcut we can find a curve that has distance not larger as the median to all the black curves, but one, and with all vertices contained in the balls centered at the black curves vertices.
Lemma 6.1 (shortcutting using a set of polygonal curves).

Let σ𝕏d\sigma\in\mathbb{X}^{d} be a polygonal curve with |σ|>2\lvert\sigma\rvert>2 vertices and T={τ1,,τn}𝕏dT=\{\tau_{1},\dots,\tau_{n}\}\subset\mathbb{X}^{d} be a set of polygonal curves. For i{1,,n}i\in\{1,\dots,n\}, let ri=dF(τi,σ)r_{i}=d_{F}(\tau_{i},\sigma) and for j{1,,|τi|}j\in\{1,\dots,\lvert\tau_{i}\rvert\}, let vjτiv^{\tau_{i}}_{j} be the jjth vertex of τi\tau_{i}.

For any ε(0,1)\varepsilon\in(0,1) there are 2|σ|42\lvert\sigma\rvert-4 subsets T1,,T2|σ|4TT_{1},\dots,T_{2\lvert\sigma\rvert-4}\subseteq T of εn2|σ|\frac{\varepsilon n}{2\lvert\sigma\rvert} curves each (not necessarily disjoint) such that for every subset TTT^{\prime}\subseteq T containing at least one curve out of each Tk{T1,,T2|σ|4}T_{k}\in\{T_{1},\dots,T_{2\lvert\sigma\rvert-4}\}, a polygonal curve σ𝕏d\sigma^{\prime}\in\mathbb{X}^{d} exists with every vertex contained in

τiTj{1,,|τi|}B(vjτi,ri),\bigcup\limits_{\tau_{i}\in T^{\prime}}\bigcup\limits_{j\in\{1,\dots,\lvert\tau_{i}\rvert\}}B(v^{\tau_{i}}_{j},r_{i}),

dF(τ,σ)dF(τ,σ)d_{F}(\tau,\sigma^{\prime})\leq d_{F}(\tau,\sigma) for each τT(T1T2|σ|4)\tau\in T\setminus(T_{1}\cup\dots\cup T_{2\lvert\sigma\rvert-4}) and |σ|2|σ|2\lvert\sigma^{\prime}\rvert\leq 2\lvert\sigma\rvert-2.

The idea is the following, see Fig. 3 for a visualization. One can argue that every vertex vv of σ\sigma not contained in any of the balls centered at the vertices of the curves in TT (and of radius according to their distance to σ\sigma) can be shortcut by connecting the last point pp^{-} before vv (in terms of the parameter of σ\sigma) contained in one ball and first point p+p^{+} after vv contained in one ball. This does not increase the Fréchet distances between σ\sigma and the τT\tau\in T, because only matchings among line segments are affected by this modification. Furthermore, most distances are preserved when we do not actually use the last and first ball before and after vv, but one of the εn2|σ|\frac{\varepsilon n}{2|\sigma\rvert} balls before and one of the εn2|σ|\frac{\varepsilon n}{2\lvert\sigma\rvert} balls after vv, which is the key of the following proof.

Proof of Lemma 6.1.

For the sake of simplicity, we assume that εn2|σ|\frac{\varepsilon n}{2\lvert\sigma\rvert} is integral. Let =|σ|\ell=\lvert\sigma\rvert. For i{1,,n}i\in\{1,\dots,n\}, let v1τi,,v|τi|τiv^{\tau_{i}}_{1},\dots,v^{\tau_{i}}_{\lvert\tau_{i}\rvert} be the vertices of τi\tau_{i} with instants t1τi,,t|τi|τit^{\tau_{i}}_{1},\dots,t^{\tau_{i}}_{\lvert\tau_{i}\rvert} and let v1σ,,vσv^{\sigma}_{1},\dots,v^{\sigma}_{\ell} be the vertices of σ\sigma with instants t1σ,,tσt^{\sigma}_{1},\dots,t^{\sigma}_{\ell}. Also, for hh\in\mathcal{H} (recall that \mathcal{H} is the set of all continuous bijections h:[0,1][0,1]h\colon[0,1]\rightarrow[0,1] with h(0)=0h(0)=0 and h(1)=1h(1)=1) and i{1,,n}i\in\{1,\dots,n\}, let ri,h=maxt[0,1]σ(t)τi(h(t))r_{i,h}=\max\limits_{t\in[0,1]}\lVert\sigma(t)-\tau_{i}(h(t))\rVert be the distance realized by hh with respect to τi\tau_{i}. We know from Lemma 2.4 that for each i{1,,n}i\in\{1,\dots,n\} there exists a sequence (hi,x)x=1(h_{i,x})_{x=1}^{\infty} in \mathcal{H}, such that limxri,hi,x=dF(σ,τi)=ri\lim\limits_{x\to\infty}r_{i,h_{i,x}}=d_{F}(\sigma,\tau_{i})=r_{i}.

In the following, given arbitrary h1,,hnh_{1},\dots,h_{n}\in\mathcal{H}, we describe how to modify σ\sigma, such that its vertices can be found in the balls around the vertices of the τT\tau\in T, of radii determined by h1,,hnh_{1},\dots,h_{n}. Later we will argue that this modification can be applied using the h1,x,,hn,xh_{1,x},\dots,h_{n,x}, for each xx\in\mathbb{N}, in particular.

Now, fix arbitrary h1,,hnh_{1},\dots,h_{n}\in\mathcal{H} and for an arbitrary k{2,,|σ|1}k\in\{2,\dots,\lvert\sigma\rvert-1\}, fix the vertex vkσv^{\sigma}_{k} of σ\sigma with instant tkσt^{\sigma}_{k}. For i{1,,n}i\in\{1,\dots,n\}, let sis_{i} be the maximum of {1,,|τi|1}\{1,\dots,\lvert\tau_{i}\rvert-1\}, such that tsiτihi(tkσ)tsi+1τit^{\tau_{i}}_{s_{i}}\leq h_{i}(t^{\sigma}_{k})\leq t^{\tau_{i}}_{s_{i}+1}. Namely, vkσv^{\sigma}_{k} is matched to a point on the line segment vs1τ1vs1+1τ1¯,,vsnτnvsn+1τn¯\overline{v^{\tau_{1}}_{s_{1}}v^{\tau_{1}}_{s_{1}+1}},\dots,\overline{v^{\tau_{n}}_{s_{n}}v^{\tau_{n}}_{s_{n}+1}}, respectively, by h1,,hnh_{1},\dots,h_{n}.

For i{1,,n}i\in\{1,\dots,n\}, let tit^{-}_{i} be the maximum of [0,tkσ][0,t^{\sigma}_{k}], such that σ(ti)B(vsiτi,ri,hi)\sigma(t^{-}_{i})\in B(v^{\tau_{i}}_{s_{i}},r_{i,h_{i}}) and let ti+t^{+}_{i} be the minimum of [tkσ,1][t^{\sigma}_{k},1], such that σ(ti+)B(vsi+1τi,ri,hi)\sigma(t^{+}_{i})\in B(v^{\tau_{i}}_{s_{i}+1},r_{i,h_{i}}). These are the instants when σ\sigma visits B(vsiτi,ri,hi)B(v^{\tau_{i}}_{s_{i}},r_{i,h_{i}}) before or when visiting vkσv^{\sigma}_{k} and σ\sigma visits B(vsi+1τi,ri,hi)B(v^{\tau_{i}}_{s_{i}+1},r_{i,h_{i}}) when or after visiting vkσv^{\sigma}_{k}, respectively. Furthermore, there is a permutation αSn\alpha\in S_{n} of the index set {1,,n}\{1,\dots,n\}, such that

tα1(1)tα1(n).t^{-}_{\alpha^{-1}(1)}\leq\dots\leq t^{-}_{\alpha^{-1}(n)}. (I)

Also, there is a permutation βSn\beta\in S_{n} of the index set {1,,n}\{1,\dots,n\}, such that

tβ1(1)+tβ1(n)+.t^{+}_{\beta^{-1}(1)}\leq\dots\leq t^{+}_{\beta^{-1}(n)}. (II)

Additionally, for each i{1,,n}i\in\{1,\dots,n\} we have

tsiτihi(ti)t^{\tau_{i}}_{s_{i}}\leq h_{i}(t^{-}_{i}) (III)

and

hi(ti+)tsi+1τi,h_{i}(t^{+}_{i})\leq t^{\tau_{i}}_{s_{i}+1}, (IV)

because σ(ti)\sigma(t^{-}_{i}) and σ(ti+)\sigma(t^{+}_{i}) are closest points to vσv^{\sigma} on σ\sigma that have distance at most ri,hir_{i,h_{i}} to vsiτiv^{\tau_{i}}_{s_{i}} and vsi+1τiv^{\tau_{i}}_{s_{i}+1}, respectively, by definition. We will now use Eqs. I, II, III and IV to prove that an advanced shortcut only affects matchings among line segments and hence we can easily bound the resulting distances for at least (1ε)n(1-\varepsilon)n of the curves.

Let

Ivkσ(h1,,hn)={τα1((1ε2)n+1),,τα1(n)},Ovkσ(h1,,hn)={τβ1(1),,τβ1(εn2)}.I_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})=\{\tau_{\alpha^{-1}((1-\frac{\varepsilon}{2\ell})n+1)},\dots,\tau_{\alpha^{-1}(n)}\},\ O_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})=\{\tau_{\beta^{-1}(1)},\dots,\tau_{\beta^{-1}(\frac{\varepsilon n}{2\ell})}\}.

Ivkσ(h1,,hn)I_{v^{\sigma}_{k}}(h_{1},\dots,h_{n}) is the set of the last εn2\frac{\varepsilon n}{2\ell} curves whose balls are visited by σ\sigma, before or when σ\sigma visits vkσv^{\sigma}_{k}. Similarly, Ovkσ(h1,,hn)O_{v^{\sigma}_{k}}(h_{1},\dots,h_{n}) is the set of the first εn2\frac{\varepsilon n}{2\ell} curves whose balls are visited by σ\sigma, when or immediately after σ\sigma visited vkσv^{\sigma}_{k}. We now modify σ\sigma, such that vkσv^{\sigma}_{k} is replaced by two new vertices that are elements of at least one B(vjτi,ri,hi)B(v^{\tau_{i}}_{j},r_{i,h_{i}}), for a τiIvkσ(h1,,hn)\tau_{i}\in I_{v^{\sigma}_{k}}(h_{1},\dots,h_{n}), respectively for a τiOvkσ(h1,,hn)\tau_{i}\in O_{v^{\sigma}_{k}}(h_{1},\dots,h_{n}), and j{1,,|τi|}j\in\{1,\dots,\lvert\tau_{i}\rvert\}, each.

Let σh1,,hn\sigma^{\prime}_{h_{1},\dots,h_{n}} be the piecewise defined curve, defined just like σ\sigma on [0,tα1(k1)]\left[0,t^{-}_{\alpha^{-1}(k_{1})}\right] and [tβ1(k2)+,1]\left[t^{+}_{\beta^{-1}(k_{2})},1\right] for arbitrary k1{(1ε2)n+1,,n}k_{1}\in\{(1-\frac{\varepsilon}{2\ell})n+1,\dots,n\} and k2{1,,εn2}k_{2}\in\{1,\dots,\frac{\varepsilon n}{2\ell}\}, but on (tα1(k1),tβ1(k2)+)\left(t^{-}_{\alpha^{-1}(k_{1})},t^{+}_{\beta^{-1}(k_{2})}\right) it connects σ(tα1(k1))\sigma\left(t^{-}_{\alpha^{-1}(k_{1})}\right) and σ(tβ1(k2)+)\sigma\left(t^{+}_{\beta^{-1}(k_{2})}\right) with the line segment

γ(t)=(1ttα1(k1)tβ1(k2)+tα1(k1))σ(tα1(k1))+ttα1(k1)tβ1(k2)+tα1(k1)σ(tβ1(k2)+).\gamma(t)=\left(1-\frac{t-t^{-}_{\alpha^{-1}(k_{1})}}{t^{+}_{\beta^{-1}(k_{2})}-t^{-}_{\alpha^{-1}(k_{1})}}\right)\sigma\left(t^{-}_{\alpha^{-1}(k_{1})}\right)+\frac{t-t^{-}_{\alpha^{-1}(k_{1})}}{t^{+}_{\beta^{-1}(k_{2})}-t^{-}_{\alpha^{-1}(k_{1})}}\sigma\left(t^{+}_{\beta^{-1}(k_{2})}\right).

We now argue that for all τiT(Ivkσ(h1,,hn)Ovkσ(h1,,hn))\tau_{i}\in T\setminus(I_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})\cup O_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})) the Fréchet distance between σh1,,hn\sigma^{\prime}_{h_{1},\dots,h_{n}} and τi\tau_{i} is upper bounded by ri,hir_{i,h_{i}}. First, note that by definition h1,,hnh_{1},\dots,h_{n} are strictly increasing functions, since they are continuous bijections that map 0 to 0 and 11 to 11. As immediate consequence, we have that

tsiτihi(ti)hi(tα1(k1))t^{\tau_{i}}_{s_{i}}\leq h_{i}(t^{-}_{i})\leq h_{i}\left(t^{-}_{\alpha^{-1}(k_{1})}\right) (V)

for each τiTIvkσ(h1,,hn)\tau_{i}\in T\setminus I_{v^{\sigma}_{k}}(h_{1},\dots,h_{n}) and

hi(tβ1(k2)+)hi(ti+)tsi+1τih_{i}\left(t^{+}_{\beta^{-1}(k_{2})}\right)\leq h_{i}(t^{+}_{i})\leq t^{\tau_{i}}_{s_{i}+1} (VI)

for each τiTOvkσ(h1,,hn)\tau_{i}\in T\setminus O_{v^{\sigma}_{k}}(h_{1},\dots,h_{n}), using Eqs. I, II, III and IV. Therefore, each τiT(Ivkσ(h1,,hn)Ovkσ(h1,,hn))\tau_{i}\in T\setminus(I_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})\cup O_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})) has no vertex between the instants hi(tα1(k1))h_{i}\left(t^{-}_{\alpha^{-1}(k_{1})}\right) and hi(tβ1(k2)+)h_{i}\left(t^{+}_{\beta^{-1}(k_{2})}\right). We also know that for each τiT\tau_{i}\in T

σ(tα1(k1))τi(hi(tα1(k1)))ri,hi\left\lVert\sigma\left(t^{-}_{\alpha^{-1}(k_{1})}\right)-\tau_{i}\left(h_{i}\left(t^{-}_{\alpha^{-1}(k_{1})}\right)\right)\right\rVert\leq r_{i,h_{i}} (VII)

and

σ(tβ1(k2)+)τi(hi(tβ1(k2)+))ri,hi.\left\lVert\sigma\left(t^{+}_{\beta^{-1}(k_{2})}\right)-\tau_{i}\left(h_{i}\left(t^{+}_{\beta^{-1}(k_{2})}\right)\right)\right\rVert\leq r_{i,h_{i}}. (VIII)

Let Ds,σ=[0,tα1(k1))D_{s,\sigma}=\left[0,t^{-}_{\alpha^{-1}(k_{1})}\right), Dm,σ=[tα1(k1),tβ1(k2)+]D_{m,\sigma}=\left[t^{-}_{\alpha^{-1}(k_{1})},t^{+}_{\beta^{-1}(k_{2})}\right] and De,σ=(tβ1(k2)+,1]D_{e,\sigma}=\left(t^{+}_{\beta^{-1}(k_{2})},1\right]. Also, for i{1,,n}i\in\{1,\dots,n\}, let Ds,τi=[0,hi(tα1(k1)))D_{s,\tau_{i}}=\left[0,h_{i}\left(t^{-}_{\alpha^{-1}(k_{1})}\right)\right), Dm,τi=[hi(tα1(k1)),hi(tβ1(k2)+)]D_{m,\tau_{i}}=\left[h_{i}\left(t^{-}_{\alpha^{-1}(k_{1})}\right),h_{i}\left(t^{+}_{\beta^{-1}(k_{2})}\right)\right] and De,τi=(hi(tβ1(k2)+),1]D_{e,\tau_{i}}=\left(h_{i}\left(t^{+}_{\beta^{-1}(k_{2})}\right),1\right]. Now, for each τiT(Ivkσ(h1,,hn)Ovkσ(h1,,hn))\tau_{i}\in T\setminus(I_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})\cup O_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})) we use hih_{i} to match σh1,,hn|Ds,σ\sigma^{\prime}_{h_{1},\dots,h_{n}}|_{D_{s,\sigma}} to τi|Ds,τi\tau_{i}|_{D_{s,\tau_{i}}} and σh1,,hn|De,σ\sigma^{\prime}_{h_{1},\dots,h_{n}}|_{D_{e,\sigma}} to τi|De,τi\tau_{i}|_{D_{e,\tau_{i}}} with distance at most ri,hir_{i,h_{i}}. Since σh1,,hn|Dm,σ\sigma^{\prime}_{h_{1},\dots,h_{n}}|_{D_{m,\sigma}} and τi|Dm,τi\tau_{i}|_{D_{m,\tau_{i}}} are just line segments by Eqs. V and VI, they can be matched to each other with distance at most

max{σ(tα1(k1))τi(hi(tα1(k1))),σ(tβ1(k2)+)τi(hi(tβ1(k2)+))},\max\left\{\left\lVert\sigma\left(t^{-}_{\alpha^{-1}(k_{1})}\right)-\tau_{i}\left(h_{i}\left(t^{-}_{\alpha^{-1}(k_{1})}\right)\right)\right\rVert,\left\lVert\sigma\left(t^{+}_{\beta^{-1}(k_{2})}\right)-\tau_{i}\left(h_{i}\left(t^{+}_{\beta^{-1}(k_{2})}\right)\right)\right\rVert\right\},

which is at most ri,hir_{i,h_{i}} by Eqs. VII and VIII. We conclude that dF(σh1,,hn,τi)ri,hid_{F}(\sigma^{\prime}_{h_{1},\dots,h_{n}},\tau_{i})\leq r_{i,h_{i}}.

Because this modification works for every h1,,hnh_{1},\dots,h_{n}\in\mathcal{H}, we conclude that dF(σh1,,hn,τi)ri,hid_{F}(\sigma^{\prime}_{h_{1},\dots,h_{n}},\tau_{i})\leq r_{i,h_{i}} for every h1,,hnh_{1},\dots,h_{n}\in\mathcal{H} and τiT(Ivkσ(h1,,hn)Ovkσ(h1,,hn))\tau_{i}\in T\setminus(I_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})\cup O_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})). Thus limxdF(σh1,x,,hn,x,τi)dF(σ,τi)=ri\lim\limits_{x\to\infty}d_{F}(\sigma^{\prime}_{h_{1,x},\dots,h_{n,x}},\tau_{i})\leq d_{F}(\sigma,\tau_{i})=r_{i} for each τiT(Ivkσ(h1,x,,hn,x)Ovkσ(h1,x,,hn,x))\tau_{i}\in T\setminus(I_{v^{\sigma}_{k}}(h_{1,x},\dots,h_{n,x})\cup O_{v^{\sigma}_{k}}(h_{1,x},\dots,h_{n,x})).

Now, to prove the claim, for each combination h1,,hnh_{1},\dots,h_{n}\in\mathcal{H}, we apply this modification to vkσv^{\sigma}_{k} and successively to every other vertex vlσh1,,hnv^{\sigma^{\prime}_{h_{1},\dots,h_{n}}}_{l} of the resulting curve σh1,,hn\sigma^{\prime}_{h_{1},\dots,h_{n}}, except v1σh1,,hnv^{\sigma^{\prime}_{h_{1},\dots,h_{n}}}_{1} and v|σh1,,hn|σh1,,hnv^{\sigma^{\prime}_{h_{1},\dots,h_{n}}}_{\lvert\sigma^{\prime}_{h_{1},\dots,h_{n}}\rvert}, since these must be elements of B(v1τi,ri,hi)B(v^{\tau_{i}}_{1},r_{i,h_{i}}) and B(v|τi|τi,ri,hi)B(v^{\tau_{i}}_{\lvert\tau_{i}\rvert},r_{i,h_{i}}), respectively, for each i{1,,n}i\in\{1,\dots,n\}, by definition of the Fréchet distance.

Since the modification is repeated at most |σ|2\lvert\sigma\rvert-2 times for each combination h1,hnh_{1},\dots h_{n}\in\mathcal{H}, we conclude that the number of vertices of each σh1,,hn\sigma^{\prime}_{h_{1},\dots,h_{n}} can be bounded by 2(|σ|2)+22\cdot(\lvert\sigma\rvert-2)+2.

T1,,T24T_{1},\dots,T_{2\ell-4} are therefore all the Ivkσ(h1,x,,hn,x)I_{v^{\sigma}_{k}}(h_{1,x},\dots,h_{n,x}) and Ovkσ(h1,x,,hn,x)O_{v^{\sigma}_{k}}(h_{1,x},\dots,h_{n,x}) for k{2,,2|σ|3}k\in\{2,\dots,2\lvert\sigma\rvert-3\}, when xx\to\infty. Note that every Ivkσ(h1,x,,hn,x)I_{v^{\sigma}_{k}}(h_{1,x},\dots,h_{n,x}) and Ovkσ(h1,x,,hn,x)O_{v^{\sigma}_{k}}(h_{1,x},\dots,h_{n,x}) is determined by the visiting order of the balls and since their radii converge, these sets do too. ∎

We now present Algorithm 4, which is nearly identical to Algorithm 2 but uses the advanced shortcutting lemma. Furthermore, like Algorithm 2, it can be used as plugin in the recursive kk-median approximation-scheme (Algorithm 5) that we present in Section 7.

Algorithm 4 (1,)(1,\ell)-Median for Subset by Advanced Shortcutting
1:procedure (1,)(1,\ell)-Median-(1+ε)(1+\varepsilon)-Candidates(T={τ1,,τn},β,δ,εT=\{\tau_{1},\dots,\tau_{n}\},\beta,\delta,\varepsilon)
2:     εε/6\varepsilon^{\prime}\leftarrow\varepsilon/6, CC\leftarrow\emptyset
3:     SS\leftarrow sample 8β(ε)1(ln(δ)ln(4(24)))\left\lceil-8\beta\ell(\varepsilon^{\prime})^{-1}(\ln(\delta)-\ln(4(2\ell-4)))\right\rceil curves from TT uniformly and independently       with replacement
4:     for SSS^{\prime}\subseteq S with |S|=|S|2β\lvert S^{\prime}\rvert=\frac{\lvert S\rvert}{2\beta} do
5:         cc\leftarrow (1,)(1,\ell)-Median-3434-Approximation(S,δ/4)(S^{\prime},\delta/4) \triangleright Algorithm 1
6:         Δcost(S,c)\Delta\leftarrow\operatorname{cost}\left\lparen S^{\prime},c\right\rparen, Δl2δn4|S|Δ34\Delta_{l}\leftarrow\frac{2\delta n}{4\lvert S\rvert}\frac{\Delta}{34}, Δu1εΔ\Delta_{u}\leftarrow\frac{1}{\varepsilon^{\prime}}\Delta
7:         CC{c}C\leftarrow C\cup\{c\}, PP\leftarrow\emptyset
8:         for sSs\in S^{\prime} do
9:              for i{1,,|s|}i\in\{1,\dots,\lvert s\rvert\} do
10:                  PP𝔾(B(vis,4εΔu),2εndΔl)P\leftarrow P\cup\mathbb{G}\left(B\left(v^{s}_{i},\frac{4\ell}{\varepsilon^{\prime}}\Delta_{u}\right),\frac{2\varepsilon^{\prime}}{n\sqrt{d}}\Delta_{l}\right) \triangleright visv^{s}_{i}: iith vertex of ss                        
11:         CCC\leftarrow C\ \cup set of all polygonal curves with 222\ell-2 vertices from PP      
12:     return CC

We prove the quality of approximation of Algorithm 4.

Theorem 6.2.

Given three parameters β[1,)\beta\in[1,\infty), δ(0,1)\delta\in(0,1), ε(0,0.158]\varepsilon\in(0,0.158] and a set T={τ1,,τn}𝕏mdT=\{\tau_{1},\dots,\tau_{n}\}\subset\mathbb{X}^{d}_{m} of polygonal curves, with probability at least 1δ1-\delta the set of candidates that Algorithm 4 returns contains a (1+ε)(1+\varepsilon)-approximate (1,)(1,\ell)-median with up to 222\ell-2 vertices for any TTT^{\prime}\subseteq T, if |T|1β|T|\lvert T^{\prime}\rvert\geq\frac{1}{\beta}\lvert T\rvert.

In the following proof we make use of a case distinction developed by Nath and Taylor [28, Proof of Theorem 10], which is a key ingredient to enable the (1+ε)(1+\varepsilon)-approximation, though the domain of ε\varepsilon has to be restricted to (0,0.158](0,0.158].

Proof of Theorem 6.2.

We assume that |T|1β|T|\lvert T^{\prime}\rvert\geq\frac{1}{\beta}\lvert T\rvert. Let nn^{\prime} be the number of sampled curves in SS that are elements of TT^{\prime}. Clearly, E[n]i=1|S|1β=|S|β\operatorname{E}\left[n^{\prime}\right]\geq\sum_{i=1}^{\lvert S\rvert}\frac{1}{\beta}=\frac{\lvert S\rvert}{\beta}. Also nn^{\prime} is the sum of independent Bernoulli trials. A Chernoff bound (cf. Lemma 2.9) yields:

Pr[n<|S|2β]Pr[n<E[n]2]exp(14|S|2β)exp((ln(δ)ln(4(24)))ε)(δ4)1εδ8.\displaystyle\Pr\left[n^{\prime}<\frac{\lvert S\rvert}{2\beta}\right]\leq\Pr\left[n^{\prime}<\frac{\operatorname{E}\left[n^{\prime}\right]}{2}\right]\leq\exp\left(-\frac{1}{4}\frac{\lvert S\rvert}{2\beta}\right)\leq\exp\left(\frac{\ell(\ln(\delta)-\ln(4(2\ell-4)))}{\varepsilon}\right)\leq\left(\frac{\delta^{\ell}}{4^{\ell}}\right)^{\frac{1}{\varepsilon}}\leq\frac{\delta}{8}.

In other words, with probability at most δ/8\delta/8 no subset SSS^{\prime}\subseteq S, of cardinality at least |S|2β\frac{\lvert S\rvert}{2\beta}, is a subset of TT^{\prime}. We condition the rest of the proof on the contrary event, denoted by T\mathcal{E}_{T^{\prime}}, namely, that there is a subset SSS^{\prime}\subseteq S with STS^{\prime}\subseteq T^{\prime} and |S||S|2β\lvert S^{\prime}\rvert\geq\frac{\lvert S\rvert}{2\beta}. Note that SS^{\prime} is then a uniform and independent sample of TT^{\prime}.

Now, let c𝕏dc^{\ast}\in\mathbb{X}^{d}_{\ell} be an optimal (1,)(1,\ell)-median for TT^{\prime}. The expected distance between sSs\in S^{\prime} and cc^{\ast} is

E[dF(s,c)|T]=τTdF(c,τ)1|T|=cost(T,c)|T|.\operatorname{E}\left[d_{F}(s,c^{\ast})\ |\ \mathcal{E}_{T^{\prime}}\right]=\sum_{\tau\in T^{\prime}}d_{F}(c^{\ast},\tau)\cdot\frac{1}{\lvert T^{\prime}\rvert}=\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}.

By linearity we have E[cost(S,c)|T]=|S||T|cost(T,c)\operatorname{E}\left[\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen\ |\ \mathcal{E}_{T^{\prime}}\right]=\frac{\lvert S^{\prime}\rvert}{\lvert T^{\prime}\rvert}\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen. Markov’s inequality yields:

Pr[δ|T|4|S|cost(S,c)>cost(T,c)|T]δ4.\displaystyle\Pr\left[\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen>\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen\ \Big{|}\ \mathcal{E}_{T^{\prime}}\right]\leq\frac{\delta}{4}.

We conclude that with probability at most δ/4\delta/4 we have δ|T|4|S|cost(S,c)>cost(T,c)\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen>\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen.

Now, from Lemma 6.1 we know that there are 242\ell-4 subsets T1,,T24TT^{\prime}_{1},\dots,T^{\prime}_{2\ell-4}\subseteq T^{\prime}, of cardinality ε|T|2\frac{\varepsilon\lvert T^{\prime}\rvert}{2\ell} each and which are not necessarily disjoint, such that for every set WTW\subseteq T^{\prime} that contains at least one curve τTi\tau\in T^{\prime}_{i} for each i{1,,24}i\in\{1,\dots,2\ell-4\}, there exists a curve c𝕏22dc^{\prime}\in\mathbb{X}^{d}_{2\ell-2} which has all of its vertices contained in

τWj{1,,|τ|}B(vjτ,dF(τ,c))\bigcup\limits_{\tau\in W}\bigcup\limits_{j\in\{1,\dots,\lvert\tau\rvert\}}B(v^{\tau}_{j},d_{F}(\tau,c^{\ast}))

and for at least (1ε)|T|(1-\varepsilon)\lvert T^{\prime}\rvert curves τT(T1T24)\tau\in T^{\prime}\setminus(T^{\prime}_{1}\cup\dots\cup T^{\prime}_{2\ell-4}) it holds that dF(τ,c)dF(τ,c)d_{F}(\tau,c^{\prime})\leq d_{F}(\tau,c^{\ast}).

There are up to ε|T|4\frac{\varepsilon\lvert T^{\prime}\rvert}{4\ell} curves with distance to cc^{\ast} at least 4cost(T,c)ε|T|\frac{4\ell\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\varepsilon\lvert T^{\prime}\rvert}. Otherwise the cost of these curves would exceed cost(T,c)\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen, which is a contradiction. Later we will prove that each ball we cover has radius at most 4cost(T,c)ε|T|\frac{4\ell\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\varepsilon\lvert T^{\prime}\rvert}. Therefore, for each i{1,,24}i\in\{1,\dots,2\ell-4\} we have to ignore up to half of the curves τTi\tau\in T^{\prime}_{i}, since we do not cover the balls of radius dF(τ,c)d_{F}(\tau,c^{\ast}) centered at their vertices. For each i{1,,24}i\in\{1,\dots,2\ell-4\} and sSs\in S^{\prime} we now have

Pr[sTidF(s,c)4cost(T,c)ε|T||T]ε4.\Pr\left[s\in T^{\prime}_{i}\wedge d_{F}(s,c^{\ast})\leq\frac{4\ell\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\varepsilon\lvert T^{\prime}\rvert}\ \Big{|}\ \mathcal{E}_{T^{\prime}}\right]\geq\frac{\varepsilon}{4\ell}.

Therefore, by independence, for each i{1,,24}i\in\{1,\dots,2\ell-4\} the probability that no sSs\in S^{\prime} is an element of TiT^{\prime}_{i} and has distance to cc^{\ast} at most 4cost(T,c)ε|T|\frac{4\ell\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\varepsilon\lvert T^{\prime}\rvert} is at most (1ε4)|S|exp(ε44(ln(4(24))ln(δ))ε)=exp(ln(δ4(24)))=δ4(24)(1-\frac{\varepsilon}{4\ell})^{\lvert S^{\prime}\rvert}\leq\exp\left(-\frac{\varepsilon}{4\ell}\frac{4\ell(\ln(4(2\ell-4))-\ln(\delta))}{\varepsilon}\right)=\exp\left(\ln\left(\frac{\delta}{4(2\ell-4)}\right)\right)=\frac{\delta}{4(2\ell-4)}. Also, with probability at most δ/4\delta/4 Algorithm 1 fails to compute a 3434-approximate (1,)(1,\ell)-median c𝕏dc\in\mathbb{X}^{d}_{\ell} for SS^{\prime}, cf. Corollary 3.3.

Using a union bound over these bad events, we conclude that with probability at least 17/8δ1-7/8\delta all of the following events occur simultaneously:

  1. 1.

    There is a subset SSS^{\prime}\subseteq S of cardinality at least |S|/(2β)\lvert S\rvert/(2\beta) that is a uniform and independent sample of TT^{\prime},

  2. 2.

    for each i{1,,24}i\in\{1,\dots,2\ell-4\}, SS^{\prime} contains at least one curve from TiT^{\prime}_{i} with distance to cc^{\ast} up to 4cost(T,c)ε|T|\frac{4\ell\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\varepsilon\lvert T^{\prime}\rvert},

  3. 3.

    Algorithm 1 computes a polygonal curve c𝕏dc\in\mathbb{X}^{d}_{\ell} with cost(S,cS)cost(S,c)34cost(S,cS)\operatorname{cost}\left\lparen S^{\prime},c^{\ast}_{S^{\prime}}\right\rparen\leq\operatorname{cost}\left\lparen S^{\prime},c\right\rparen\leq 34\operatorname{cost}\left\lparen S^{\prime},c^{\ast}_{S^{\prime}}\right\rparen, where cS𝕏dc^{\ast}_{S^{\prime}}\in\mathbb{X}^{d}_{\ell} is an optimal (1,)(1,\ell)-median for SS^{\prime},

  4. 4.

    and it holds that δ|T|4|S|cost(S,c)cost(T,c)\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen\leq\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen.

Let Bc={τTdF(τ,c)cost(T,c)ε2|T|}B_{c^{\ast}}=\left\{\tau\in T^{\prime}\mid d_{F}(\tau,c^{\ast})\leq\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\varepsilon^{2}\lvert T^{\prime}\rvert}\right\}, Tc=TBcT^{\prime}_{c^{\ast}}=T^{\prime}\cap B_{c^{\ast}} and Bc={τTdF(τ,c)εcost(T,c)|T|}B_{c}=\left\{\tau\in T^{\prime}\mid d_{F}(\tau,c)\leq\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\right\}. First, note that |TBc|ε2|T|\lvert T^{\prime}\setminus B_{c^{\ast}}\rvert\leq\varepsilon^{2}\lvert T^{\prime}\rvert, otherwise cost(TBc,c)>cost(T,c)\operatorname{cost}\left\lparen T^{\prime}\setminus B_{c^{\ast}},c^{\ast}\right\rparen>\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen, which is a contradiction, and therefore |Tc|(1ε2)|T|\lvert T^{\prime}_{c^{\ast}}\rvert\geq(1-\varepsilon^{2})\lvert T^{\prime}\rvert. We now distinguish two cases:

Case 1: |TcBc|>2ε|Tc|\lvert T^{\prime}_{c^{\ast}}\setminus B_{c}\rvert>2\varepsilon\lvert T^{\prime}_{c^{\ast}}\rvert

We have 2ε|Tc|(1ε2)2ε|T|ε|T|2\varepsilon\lvert T^{\prime}_{c^{\ast}}\rvert\geq(1-\varepsilon^{2})2\varepsilon\lvert T^{\prime}\rvert\geq\varepsilon\lvert T^{\prime}\rvert, hence Pr[dF(s,c)>εcost(T,c)|T||T]ε\Pr\left[d_{F}(s,c)>\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\ \Big{|}\ \mathcal{E}_{T^{\prime}}\right]\geq\varepsilon for each sSs\in S^{\prime}. Using independence we conclude that with probability at most

(1ε)|S|exp(ε4(ln(4(24))ln(δ))ε)δ444δ8(1-\varepsilon)^{\lvert S^{\prime}\rvert}\leq\exp\left(-\varepsilon\frac{4\ell(\ln(4(2\ell-4))-\ln(\delta))}{\varepsilon}\right)\leq\frac{\delta^{4\ell}}{4^{4\ell}}\leq\frac{\delta}{8}

no sSs\in S^{\prime} has distance to cc greater than εcost(T,c)|T|\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}. Using a union bound again, we conclude that with probability at least 1δ1-\delta Items 1, 2, 3 and 4 occur simultaneously and at least one sSs\in S^{\prime} has distance to cc greater than εcost(T,c)|T|\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}, hence cost(S,c)>εcost(T,c)|T|cost(S,c)ε>cost(T,c)|T|\operatorname{cost}\left\lparen S^{\prime},c\right\rparen>\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\Leftrightarrow\frac{\operatorname{cost}\left\lparen S^{\prime},c\right\rparen}{\varepsilon}>\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert} and thus we indeed cover the balls of radius at most 4cost(T,c)ε|T|<4εcost(S,c)ε\frac{4\ell\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\varepsilon\lvert T^{\prime}\rvert}<\frac{4\ell}{\varepsilon}\frac{\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen}{\varepsilon}.

In the last step, Algorithm 4 returns a set CC of all curves with up to 222\ell-2 vertices from the grids, that contains one curve, denoted by c′′c^{\prime\prime} with same number of vertices as cc^{\prime} (recall that this is the curve guaranteed from Lemma 6.1) and distance at most εnΔlε|T|cost(T,c)\frac{\varepsilon}{n}\Delta_{l}\leq\frac{\varepsilon}{\lvert T^{\prime}\rvert}\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen between every corresponding pair of vertices of cc^{\prime} and c′′c^{\prime\prime}. We conclude that dF(c,c′′)ε|T|cost(T,c)d_{F}(c^{\prime},c^{\prime\prime})\leq\frac{\varepsilon}{\lvert T^{\prime}\rvert}\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen. Also, recall that dF(τ,c)dF(τ,c)d_{F}(\tau,c^{\prime})\leq d_{F}(\tau,c^{\ast}) for τT(T1T24)\tau\in T^{\prime}\setminus(T^{\prime}_{1}\cup\dots\cup T^{\prime}_{2\ell-4}). Further, TT^{\prime} contains at least |T|2\frac{\lvert T^{\prime}\rvert}{2} curves with distance at most 2cost(T,c)|T|\frac{2\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert} to cc^{\ast}, otherwise the cost of the remaining curves would exceed cost(T,c)\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen, which is a contradiction, and since ε<12\varepsilon<\frac{1}{2} there is at least one curve σT(T1T24)\sigma\in T^{\prime}\setminus(T^{\prime}_{1}\cup\dots\cup T^{\prime}_{2\ell-4}) with dF(σ,c)dF(σ,c)2cost(T,c)|T|d_{F}(\sigma,c^{\prime})\leq d_{F}(\sigma,c^{\ast})\leq\frac{2\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert} by the pigeonhole principle. We can now bound the cost of c′′c^{\prime\prime} as follows:

cost(T,c′′)\displaystyle\operatorname{cost}\left\lparen T^{\prime},c^{\prime\prime}\right\rparen =τTdF(τ,c′′)τT(T1T24)(dF(τ,c)+ε|T|cost(T,c))+\displaystyle={}\sum_{\tau\in T^{\prime}}d_{F}(\tau,c^{\prime\prime})\leq\sum_{\tau\in T^{\prime}\setminus(T^{\prime}_{1}\cup\dots\cup T^{\prime}_{2\ell-4})}\left(d_{F}(\tau,c^{\prime})+\frac{\varepsilon}{\lvert T^{\prime}\rvert}\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen\right)\ +
τ(T1T24)(dF(τ,c)+dF(c,σ)+dF(σ,c)+dF(c,c′′))\displaystyle\ \ \ \ \sum_{\tau\in(T^{\prime}_{1}\cup\dots\cup T^{\prime}_{2\ell-4})}\left(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},\sigma)+d_{F}(\sigma,c^{\prime})+d_{F}(c^{\prime},c^{\prime\prime})\right)
(1+ε)cost(T,c)+τ(T1T24)((2+2+ε)cost(T,c)|T|)\displaystyle\leq{}(1+\varepsilon)\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen+\sum_{\tau\in(T^{\prime}_{1}\cup\dots\cup T^{\prime}_{2\ell-4})}\left((2+2+\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\right)
cost(T,c)+εcost(T,c)+5εcost(T,c)=(1+6ε)cost(T,c).\displaystyle\leq{}\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen+\varepsilon\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen+5\varepsilon\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen=(1+6\varepsilon)\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen.

Case 2: |TcBc|2ε|Tc|\lvert T^{\prime}_{c^{\ast}}\setminus B_{c}\rvert\leq 2\varepsilon\lvert T^{\prime}_{c^{\ast}}\rvert

Again, we distinguish two cases:

Case 2.1: dF(c,c)4εcost(T,c)|T|d_{F}(c,c^{\ast})\leq 4\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}

We can easily bound the cost of cc:

cost(T,c)τT(dF(τ,c)+dF(c,c))(1+4ε)cost(T,c).\displaystyle\operatorname{cost}\left\lparen T^{\prime},c\right\rparen\leq\sum_{\tau\in T^{\prime}}(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},c))\leq(1+4\varepsilon)\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen.

Case 2.2: dF(c,c)>4εcost(T,c)|T|d_{F}(c,c^{\ast})>4\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}

Recall that |Tc|(1ε2)|T|\lvert T^{\prime}_{c^{\ast}}\rvert\geq(1-\varepsilon^{2})\lvert T^{\prime}\rvert. We have

|TBc|\displaystyle\lvert T^{\prime}\setminus B_{c}\rvert |TTc|+2ε|Tc|=|T|(12ε)|Tc||T|(12ε)(1ε2)|T|\displaystyle\leq{}\lvert T^{\prime}\setminus T^{\prime}_{c^{\ast}}\rvert+2\varepsilon\lvert T^{\prime}_{c^{\ast}}\rvert=\lvert T^{\prime}\rvert-(1-2\varepsilon)\lvert T^{\prime}_{c^{\ast}}\rvert\leq\lvert T^{\prime}\rvert-(1-2\varepsilon)(1-\varepsilon^{2})\lvert T^{\prime}\rvert
=(2ε+ε22ε3)|T|<13|T|.\displaystyle=(2\varepsilon+\varepsilon^{2}-2\varepsilon^{3})\lvert T^{\prime}\rvert<\frac{1}{3}\lvert T^{\prime}\rvert.

Hence, |TBc|(12εε2+2ε3)|T|>23|T|\lvert T^{\prime}\cap B_{c}\rvert\geq(1-2\varepsilon-\varepsilon^{2}+2\varepsilon^{3})\lvert T^{\prime}\rvert>\frac{2}{3}\lvert T^{\prime}\rvert. Assume we assign all curves to cc instead of to cc^{\ast}. For τTBc\tau\in T^{\prime}\cap B_{c} we now have decrease in cost dF(τ,c)dF(τ,c)d_{F}(\tau,c^{\ast})-d_{F}(\tau,c), which can be bounded as follows:

dF(τ,c)dF(τ,c)\displaystyle d_{F}(\tau,c^{\ast})-d_{F}(\tau,c) dF(τ,c)εcost(T,c)|T|dF(c,c)dF(τ,c)εcost(T,c)|T|\displaystyle\geq{}d_{F}(\tau,c^{\ast})-\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\geq d_{F}(c,c^{\ast})-d_{F}(\tau,c)-\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}
dF(c,c)2εcost(T,c)|T|>12dF(c,c).\displaystyle\geq d_{F}(c,c^{\ast})-2\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}>\frac{1}{2}d_{F}(c,c^{\ast}).

For τTBc\tau\in T^{\prime}\setminus B_{c} we have an increase in cost dF(τ,c)dF(τ,c)dF(c,c)d_{F}(\tau,c)-d_{F}(\tau,c^{\ast})\leq d_{F}(c,c^{\ast}). Let the overall increase in cost be denoted by α\alpha, which can be bounded as follows:

α<|TBc|dF(c,c)|TBc|dF(c,c)2.\displaystyle\alpha<\lvert T^{\prime}\setminus B_{c}\rvert\cdot d_{F}(c,c^{\ast})-\lvert T^{\prime}\cap B_{c}\rvert\cdot\frac{d_{F}(c,c^{\ast})}{2}.

By the fact that |TBc|<12|TBc|\lvert T^{\prime}\setminus B_{c}\rvert<\frac{1}{2}\lvert T^{\prime}\cap B_{c}\rvert for our choice of ε\varepsilon, we conclude that α<0\alpha<0, which is a contradiction because cc^{\ast} is an optimal (1,)(1,\ell)-median for TT^{\prime}. Therefore, case 2.2 cannot occur. Rescaling ε\varepsilon by 16\frac{1}{6} proves the claim. ∎

We analyse the worst-case running-time of Algorithm 4 and the number of candidates it returns.

Theorem 6.3.

Algorithm 4 has running-time and returns number of candidates 2O((ln(δ))2βε2+log(m))2^{O\left(\frac{(-\ln(\delta))^{2}\cdot\beta}{\varepsilon^{2}}+\log(m)\right)}.

Proof.

The sample SS has size O(ln(δ)βε)O\left(\frac{-\ln(\delta)\cdot\beta}{\varepsilon}\right) and sampling it takes time O(ln(δ)βε)O\left(\frac{-\ln(\delta)\cdot\beta}{\varepsilon}\right). Let nS=|S|n_{S}=\lvert S\rvert. The for-loop runs

(nSnS2β)2O(nS2βlognS)2O((ln(δ))2βε2)\binom{n_{S}}{\frac{n_{S}}{2\beta}}\in 2^{O\left(\frac{n_{S}}{2\beta}\log n_{S}\right)}\subset 2^{O\left(\frac{(-\ln(\delta))^{2}\cdot\beta}{\varepsilon^{2}}\right)}

times. In each iteration, we run Algorithm 1, taking time O(m2log(m)(ln2δ)+m3logm)O(m^{2}\log(m)(-\ln^{2}\delta)+m^{3}\log m) (cf. Corollary 3.3), we compute the cost of the returned curve with respect to SS^{\prime}, taking time O(ln(δ)εmlog(m))O\left(\frac{-\ln(\delta)}{\varepsilon}\cdot m\log(m)\right), and per curve in SS^{\prime} we build up to mm grids of size

((1+ε)Δε2ε2δnΔnd4|S|)d=(d|S|(1+ε)ε2δ)dO(βd(lnδ)dε3dδd)\left(\frac{\frac{(1+\varepsilon)\Delta}{\varepsilon}}{\frac{2\varepsilon 2\delta n\Delta}{n\sqrt{d}4\lvert S\rvert}}\right)^{d}=\left(\frac{\sqrt{d}\lvert S\rvert(1+\varepsilon)}{\varepsilon^{2}\delta}\right)^{d}\in O\left(\frac{\beta^{d}(-\ln\delta)^{d}}{\varepsilon^{3d}\delta^{d}}\right)

each. Algorithm 4 then enumerates all combinations of 222\ell-2 points from up to |S|m\lvert S^{\prime}\rvert\cdot m grids, resulting in

O(m22β2d2d+22(lnδ)2d2d+22ε6d6d+22δ2d2d)O\left(\frac{m^{2\ell-2}\beta^{2\ell d-2d+2\ell-2}(-\ln\delta)^{2\ell d-2d+2\ell-2}}{\varepsilon^{6\ell d-6d+2\ell-2}\delta^{2\ell d-2d}}\right)

candidates per iteration of the for-loop. Thus, Algorithm 4 computes O(poly(m,β,δ1,ε1))O\left(\operatorname{poly}\left\lparen m,\beta,\delta^{-1},\varepsilon^{-1}\right\rparen\right) candidates per iteration of the for-loop and enumeration also takes time O(poly(m,β,δ1,ε1))O\left(\operatorname{poly}\left\lparen m,\beta,\delta^{-1},\varepsilon^{-1}\right\rparen\right) per iteration of the for-loop.

All in all, we have running-time and number of candidates 2O((ln(δ))2βε2+log(m))2^{O\left(\frac{(-\ln(\delta))^{2}\cdot\beta}{\varepsilon^{2}}+\log(m)\right)}. ∎

7 (1+ε)(1+\varepsilon)-Approximation for (k,)(k,\ell)-Median

We generalize the algorithm of Ackermann et al. [2] in the following way: instead of drawing a uniform sample and running a problem-specific algorithm on this sample in the candidate phase, we only run a problem-specific “plugin”-algorithm in the candidate phase, thus dropping the framework around the sampling property. We think that the problem-specific algorithms used in [2] do not fulfill the role of a plugin, since parts of the problem-specific operations, e.g. the uniform sampling, remain in the main algorithm. Here, we separate the problem-specific operations from the main algorithm: any algorithm can serve as plugin, if it is able to return candidates for a cluster that takes a constant fraction of the input, where the fraction is an input-parameter of the algorithm and some approximation factor is guaranteed (w.h.p.). The calls to the candidate-finder plugin do not even need to be independent (stochastically), allowing adaptive algorithms.

Now, let 𝒳=(X,ρ)\mathcal{X}=(X,\rho) be an arbitrary space, where XX is any non-empty (ground-)set and ρ:X×X0\rho\colon X\times X\rightarrow\mathbb{R}_{\geq 0} is a distance function (not necessarily a metric). We introduce a generalized definition of kk-median clustering. Let the medians be restricted to come from a predefined subset YXY\subseteq X.

Definition 7.1 (generalized kk-median).

The generalized kk-median clustering problem is defined as follows, where kk\in\mathbb{N} is a fixed (constant) parameter of the problem: given a finite and non-empty set ZXZ\subseteq X, compute a set CC of kk elements from YY, such that cost(Z,C)=zZmincCρ(z,c)\operatorname{cost}\left\lparen Z,C\right\rparen=\sum\limits_{z\in Z}\min\limits_{c\in C}\rho(z,c) is minimal.

The following algorithm, Algorithm 5, can approximate every kk-median problem compatible with Definition 7.1, when provided with such a problem-specific plugin-algorithm for computing candidates. In particular, it can approximate the (k,)(k,\ell)-median problem for polygonal curves under the Fréchet distance, when provided with Algorithm 2 or Algorithm 4. Then, we have X=𝕏dX=\mathbb{X}^{d}, Y𝕏d𝕏d=XY\subseteq\mathbb{X}^{d}_{\ell}\subseteq\mathbb{X}^{d}=X and Z𝕏md𝕏d=XZ\subseteq\mathbb{X}^{d}_{m}\subseteq\mathbb{X}^{d}=X. Note that the algorithm computes a bicriteria approximation, that is, the solution is approximated in terms of the cost and the number of vertices of the center curves, i.e., the centers come from 𝕏22d\mathbb{X}^{d}_{2\ell-2}.

Algorithm 5 has several parameters. The first parameter CC is the set of centers found yet and κ\kappa is the number of centers yet to be found. The following parameters concern only the plugin-algorithm used within the algorithm: β\beta determines the size (in terms of a fraction of the input) of the smallest cluster for which an approximate median can be computed, δ\delta determines the probability of failure of the plugin-algorithm and ε\varepsilon determines the approximation factor of the plugin-algorithm.

Algorithm 5 works as follows: If it has already computed some centers (and there are still centers to compute) it does pruning: some clusters might be too small for the plugin-algorithm to compute approximate medians for them. Algorithm 5 then calls itself recursively with only half of the input: the elements with larger distances to the centers yet found. This way the small clusters will eventually take a larger fraction of the input and can be found in the candidate phase. In this phase Algorithm 5 calls its plugin and for each candidate that the plugin returned, it calls itself recursively: adding the candidate at hand to the set of centers yet found and decrementing κ\kappa by one. Eventually, all combinations of computed candidates are evaluated against the original input and the centers that together evaluated best are returned.

Algorithm 5 Recursive Approximation-Scheme for kk-Median Clustering
1:procedure kk-Median(T,C,κ,β,δ,εT,C,\kappa,\beta,\delta,\varepsilon)
2:     if κ=0\kappa=0 then
3:         return CC      \triangleright Pruning Phase
4:     if CC\neq\emptyset then
5:         PP\leftarrow set of |T|2\left\lfloor\frac{\lvert T\rvert}{2}\right\rfloor elements τT\tau\in T, such that mincCρ(τ,c)mincCρ(σ,c)\min\limits_{c\in C}\rho(\tau,c)\leq\min\limits_{c\in C}\rho(\sigma,c) for each σTP\sigma\in T\setminus P
6:         CC^{\prime}\leftarrow kk-Median(TP,C,κ,β,δ,εT\setminus P,C,\kappa,\beta,\delta,\varepsilon)
7:     else
8:         CC^{\prime}\leftarrow\emptyset      \triangleright Candidate Phase
9:     K1K\leftarrow 1-Median-Candidates(T,β,δ/k,ε)(T,\beta,\delta/k,\varepsilon)
10:     for cKc\in K do
11:         CckC_{c}\leftarrow k-Median(T,C{c},κ1,β,δ,ε)(T,C\cup\{c\},\kappa-1,\beta,\delta,\varepsilon)      
12:     𝒞{C}cK{Cc}\mathcal{C}\leftarrow\{C^{\prime}\}\cup\bigcup\limits_{c\in K}\{C_{c}\}
13:     return argminC𝒞cost(T,C)\operatorname*{arg\,min}\limits_{C\in\mathcal{C}}\operatorname{cost}\left\lparen T,C\right\rparen

The quality of approximation and worst-case running-time of Algorithm 5 is stated in the following two theorems, which we prove further below. The proofs are adaptations of corresponding proofs in [2]. We provide them for the sake of completeness.

Theorem 7.2.

Let T={τ1,,τn}XT=\{\tau_{1},\dots,\tau_{n}\}\subseteq X, α[1,)\alpha\in[1,\infty) and 11-Median-Candidates be an algorithm that, given three parameters β[1,)\beta\in[1,\infty), δ,ε(0,1)\delta,\varepsilon\in(0,1) and a set TXT\subseteq X, returns with probability at least 1δ1-\delta an (α+ε)(\alpha+\varepsilon)-approximate 11-median for any TTT^{\prime}\subseteq T, if |T|1β|T|\lvert T^{\prime}\rvert\geq\frac{1}{\beta}\lvert T\rvert.

Algorithm 5 called with parameters (T,,k,β,δ,ε)(T,\emptyset,k,\beta,\delta,\varepsilon), where β(2k,)\beta\in(2k,\infty) and δ,ε(0,1)\delta,\varepsilon\in(0,1), returns with probability at least 1δ1-\delta a set C={c1,,ck}C=\{c_{1},\dots,c_{k}\} with cost(T,C)(1+4k2β2k)(α+ε)cost(T,C)\operatorname{cost}\left\lparen T,C\right\rparen\leq(1+\frac{4k^{2}}{\beta-2k})(\alpha+\varepsilon)\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen, where CC^{\ast} is an optimal set of kk medians for TT.

Theorem 7.3.

Let T1(n,β,δ,ε)T_{1}(n,\beta,\delta,\varepsilon) denote the worst-case running-time of 11-Median-Candidates for an arbitrary input-set TT with |T|=n\lvert T\rvert=n and let C(n,β,δ,ε)C(n,\beta,\delta,\varepsilon) denote the maximum number of candidates it returns. Also, let TdT_{d} denote the worst-case running-time needed to compute dd for an input element and a candidate.

If T1T_{1} and CC are non-decreasing in nn, Algorithm 5 has running-time O(C(n,β,δ,ε)k+2nTd+C(n,β,δ,ε)k+1T1(n,β,δ,ε))O(C(n,\beta,\delta,\varepsilon)^{k+2}\cdot n\cdot T_{d}+C(n,\beta,\delta,\varepsilon)^{k+1}\cdot T_{1}(n,\beta,\delta,\varepsilon)).

Now we state our main results, which follow from Theorems 4.2 and 4.3, respectively Theorems 6.2 and 6.3, and Theorems 7.2 and 7.3.

Corollary 7.4.

Given two parameters δ,ε(0,1)\delta,\varepsilon\in(0,1) and a set T𝕏mdT\subset\mathbb{X}^{d}_{m} of polygonal curves, Algorithm 5 endowed with Algorithm 2 as 11-Median-Candidates and run with parameters (T,,k,20k2ε+2k,δ,ε/5)(T,\emptyset,k,\frac{20k^{2}}{\varepsilon}+2k,\delta,\varepsilon/5) returns with probability at least 1δ1-\delta a set C𝕏22dC\subset\mathbb{X}^{d}_{2\ell-2} that is a (3+ε)(3+\varepsilon)-approximate solution to the (k,)(k,\ell)-median for TT. Algorithm 5 then has running-time n2O((ln(δ))2ε3+log(m))n\cdot 2^{O\left(\frac{(-\ln(\delta))^{2}}{\varepsilon^{3}}+\log(m)\right)}.

Corollary 7.5.

Given two parameters δ(0,1),ε(0,0.158]\delta\in(0,1),\varepsilon\in(0,0.158] and a set T𝕏mdT\subset\mathbb{X}^{d}_{m} of polygonal curves, Algorithm 5 endowed with Algorithm 4 as 11-Median-Candidates and run with parameters (T,,k,12k2ε+2k,δ,ε/3)(T,\emptyset,k,\frac{12k^{2}}{\varepsilon}+2k,\delta,\varepsilon/3) returns with probability at least 1δ1-\delta a set C𝕏22dC\subset\mathbb{X}^{d}_{2\ell-2} that is a (1+ε)(1+\varepsilon)-approximate solution to the (k,)(k,\ell)-median for TT. Algorithm 5 then has running-time n2O((ln(δ))2ε3+log(m))n\cdot 2^{O\left(\frac{(-\ln(\delta))^{2}}{\varepsilon^{3}}+\log(m)\right)}.

The following proof is an adaption of [2, Theorem 2.2 - Theorem 2.5].

Proof of Theorem 7.2.

For k=1k=1, the claim trivially holds. We now distinguish two cases. In the first case the principle of the proof is presented in all its detail. In the second case we only show how to generalize the first case to k>2k>2.

Case 1: k=2k=2

Let C={c1,c2}C^{\ast}=\{c^{\ast}_{1},c^{\ast}_{2}\} be an optimal set of kk medians for TT with clusters T1T^{\ast}_{1} and T2T^{\ast}_{2}, respectively, that form a partition of TT. For the sake of simplicity, assume that nn is a power of 22 and w.l.o.g. assume that |T1|12|T|>1β|T|\lvert T^{\ast}_{1}\rvert\geq\frac{1}{2}\lvert T\rvert>\frac{1}{\beta}\lvert T\rvert. Let C1C_{1} be the set of candidates returned by 11-Median-Candidates in the initial call. With probability at least 1δ/k1-\delta/k, there is a c1C1c_{1}\in C_{1} with cost(T1,c1)(α+ε)cost(T1,c1)\operatorname{cost}\left\lparen T^{\ast}_{1},c_{1}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{1},c^{\ast}_{1}\right\rparen. We distinguish two cases:

Case 1.1:

There exists a recursive call with parameters (T,{c1},1,β,δ,ε)(T^{\prime},\{c_{1}\},1,\beta,\delta,\varepsilon) and |T2T|1β|T|\lvert T^{\ast}_{2}\cap T^{\prime}\rvert\geq\frac{1}{\beta}\lvert T^{\prime}\rvert.

First, we assume that TT^{\prime} is the maximum cardinality input with |T2T|1β|T|\lvert T^{\ast}_{2}\cap T^{\prime}\rvert\geq\frac{1}{\beta}\lvert T^{\prime}\rvert, occurring in a recursive call of the algorithm. Let C2C_{2} be the set of candidates returned by 11-Median-Candidates in this call. With probability at least 1δ/k1-\delta/k, there is a c2C2c_{2}\in C_{2} with cost(T2T,c2)(α+ε)cost(T2T,c~2)\operatorname{cost}\left\lparen T^{\ast}_{2}\cap T^{\prime},c_{2}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{2}\cap T^{\prime},\widetilde{c}_{2}\right\rparen, where c~2\widetilde{c}_{2} is an optimal median for T2TT^{\ast}_{2}\cap T^{\prime}.

Let PP be the set of elements of TT removed in the mm\in\mathbb{N}, mlog2(n)m\leq\log_{2}(n), pruning phases between obtaining c1c_{1} and c2c_{2}. Without loss of generality we assume that PP\neq\emptyset. For i{1,,m}i\in\{1,\dots,m\}, let PiP_{i} be the elements removed in the iith (in the order of the recursive calls occurring) pruning phase. Note that the PiP_{i} are pairwise disjoint, we have that P=i=1tPiP=\cup_{i=1}^{t}P_{i} and |Pi|=n2i\lvert P_{i}\rvert=\frac{n}{2^{i}}. Since T=T1(T2T)(T2P)T=T^{\ast}_{1}\uplus(T^{\ast}_{2}\cap T^{\prime})\uplus(T^{\ast}_{2}\cap P), we have

cost(T,{c1,c2})cost(T1,c1)+cost(T2T,c2)+cost(T2P,c1).\displaystyle\operatorname{cost}\left\lparen T,\{c_{1},c_{2}\}\right\rparen\leq\operatorname{cost}\left\lparen T^{\ast}_{1},c_{1}\right\rparen+\operatorname{cost}\left\lparen T^{\ast}_{2}\cap T^{\prime},c_{2}\right\rparen+\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P,c_{1}\right\rparen. (I)

Our aim is now to prove that the number of elements wrongly assigned to c1c_{1}, i.e., T2PT^{\ast}_{2}\cap P, is small and further, that their cost is a fraction of the cost of the elements correctly assigned to c1c_{1}, i.e., T1T^{\ast}_{1}.

We define R0=TR_{0}=T and for i{1,,m}i\in\{1,\dots,m\} we define Ri=Ri1PiR_{i}=R_{i-1}\setminus P_{i}. The RiR_{i} are the elements remaining after the iith pruning phase. Note that by definition |Ri|=n2i=|Pi|\lvert R_{i}\rvert=\frac{n}{2^{i}}=\lvert P_{i}\rvert. Since Rm=TR_{m}=T^{\prime} is the maximum cardinality input, with |T2T|1β|T|\lvert T^{\ast}_{2}\cap T^{\prime}\rvert\geq\frac{1}{\beta}\lvert T^{\prime}\rvert, we have that |T2Ri|<1β|Ri|\lvert T^{\ast}_{2}\cap R_{i}\rvert<\frac{1}{\beta}\lvert R_{i}\rvert for all i{1,,m1}i\in\{1,\dots,m-1\}. Also, for each i{1,,m}i\in\{1,\dots,m\} we have PiRi1P_{i}\subseteq R_{i-1}, therefore

|T2Pi||T2Ri1|<1β|Ri1|=2βn2i\displaystyle\lvert T^{\ast}_{2}\cap P_{i}\rvert\leq\lvert T^{\ast}_{2}\cap R_{i-1}\rvert<\frac{1}{\beta}\lvert R_{i-1}\rvert=\frac{2}{\beta}\frac{n}{2^{i}} (II)

and as immediate consequence

|T1Pi|=|Pi||T2Pi|>|Pi|1β|Ri1|=(12β)n2i.\displaystyle\lvert T^{\ast}_{1}\cap P_{i}\rvert=\lvert P_{i}\rvert-\lvert T^{\ast}_{2}\cap P_{i}\rvert>\lvert P_{i}\rvert-\frac{1}{\beta}\lvert R_{i-1}\rvert=\left(1-\frac{2}{\beta}\right)\frac{n}{2^{i}}. (III)

This tells us that mainly the elements of T1T^{\ast}_{1} are removed in the pruning phase and only very few elements of T2T^{\ast}_{2}. By definition, we have for all i{1,,m1}i\in\{1,\dots,m-1\}, σPi\sigma\in P_{i} and τPi+1\tau\in P_{i+1} that ρ(σ,c1)ρ(τ,c1)\rho(\sigma,c_{1})\leq\rho(\tau,c_{1}), hence

1|T2Pi|cost(T2Pi,c1)1|T1Pi+1|cost(T1Pi+1,c1).\frac{1}{\lvert T^{\ast}_{2}\cap P_{i}\rvert}\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P_{i},c_{1}\right\rparen\leq\frac{1}{\lvert T^{\ast}_{1}\cap P_{i+1}\rvert}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap P_{i+1},c_{1}\right\rparen.

Combining this inequality with Eqs. II and III we obtain for i{1,,m1}i\in\{1,\dots,m-1\}:

β2i2ncost(T2Pi,c1)<2i+1(12/β)ncost(T1Pi+1,c1)\displaystyle\frac{\beta 2^{i}}{2n}\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P_{i},c_{1}\right\rparen<\frac{2^{i+1}}{(1-2/\beta)n}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap P_{i+1},c_{1}\right\rparen
\displaystyle\Leftrightarrow cost(T2Pi,c1)<2i+12n(12/β)nβ2icost(T1Pi+1,c1)=4(β2)cost(T1Pi+1,c1).\displaystyle\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P_{i},c_{1}\right\rparen<\frac{2^{i+1}2n}{(1-2/\beta)n\beta 2^{i}}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap P_{i+1},c_{1}\right\rparen=\frac{4}{(\beta-2)}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap P_{i+1},c_{1}\right\rparen. (IV)

We still need such a bound for i=mi=m. Since |Rm|=|Pm|\lvert R_{m}\rvert=\lvert P_{m}\rvert and also RmRm1R_{m}\subseteq R_{m-1} we can use Eq. II to obtain:

|T1Rm|=|Rm||T2Rm||Rm||T2Rm1|>(12β)n2m.\displaystyle\lvert T^{\ast}_{1}\cap R_{m}\rvert=\lvert R_{m}\rvert-\lvert T^{\ast}_{2}\cap R_{m}\rvert\geq\lvert R_{m}\rvert-\lvert T^{\ast}_{2}\cap R_{m-1}\rvert>\left(1-\frac{2}{\beta}\right)\frac{n}{2^{m}}. (V)

Also, we have for all σPm\sigma\in P_{m} and τRm\tau\in R_{m} that ρ(σ,c1)ρ(τ,c1)\rho(\sigma,c_{1})\leq\rho(\tau,c_{1}) by definition, thus

1|T2Pm|cost(T2Pm,c1)1|T1Rm|cost(T1Rm,c1).\frac{1}{\lvert T^{\ast}_{2}\cap P_{m}\rvert}\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P_{m},c_{1}\right\rparen\leq\frac{1}{\lvert T^{\ast}_{1}\cap R_{m}\rvert}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap R_{m},c_{1}\right\rparen.

We combine this inequality with Eq. II and Eq. V and obtain:

β2m2ncost(T2Pm,c1)<2m2n(12/β)nβ2mcost(T1Rm,c1)\displaystyle\frac{\beta 2^{m}}{2n}\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P_{m},c_{1}\right\rparen<\frac{2^{m}2n}{(1-2/\beta)n\beta 2^{m}}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap R_{m},c_{1}\right\rparen
\displaystyle\Leftrightarrow cost(T2Pm,c1)<2(β2)cost(T1Rm,c1).\displaystyle\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P_{m},c_{1}\right\rparen<\frac{2}{(\beta-2)}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap R_{m},c_{1}\right\rparen. (VI)

We are now ready to bound the cost of the elements of T2T^{\ast}_{2} wrongly assigned to c1c_{1}. Combining Eq. IV and Eq. VI yields:

cost(T2P,c1)\displaystyle\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P,c_{1}\right\rparen =i=1mcost(T2Pi,c1)<4β2i=1m1cost(T1Pi+1,c1)+2β2cost(T1Rm,c1)\displaystyle={}\sum_{i=1}^{m}\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P_{i},c_{1}\right\rparen<\frac{4}{\beta-2}\sum_{i=1}^{m-1}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap P_{i+1},c_{1}\right\rparen+\frac{2}{\beta-2}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap R_{m},c_{1}\right\rparen
<4β2cost(T1,c1).\displaystyle<\frac{4}{\beta-2}\operatorname{cost}\left\lparen T^{\ast}_{1},c_{1}\right\rparen.

Here, the last inequality holds, because P2,,PmP_{2},\dots,P_{m} and RmR_{m} are pairwise disjoint. Also, we have

cost(T2T,c2)(α+ε)cost(T2T,c2~)(α+ε)cost(T2T,c2)(α+ε)cost(T2,c2).\displaystyle\operatorname{cost}\left\lparen T^{\ast}_{2}\cap T^{\prime},c_{2}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{2}\cap T^{\prime},\widetilde{c_{2}}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{2}\cap T^{\prime},c^{\ast}_{2}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{2},c^{\ast}_{2}\right\rparen.

Finally, using Eq. I and a union bound, with probability at least 1δ1-\delta the following holds:

cost(T,{c1,c2})\displaystyle\operatorname{cost}\left\lparen T,\{c_{1},c_{2}\}\right\rparen <(α+ε)cost(T1,c1)+(α+ε)cost(T2,c2)+4β2(α+ε)cost(T1,c1)\displaystyle<(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{1},c^{\ast}_{1}\right\rparen+(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{2},c^{\ast}_{2}\right\rparen+\frac{4}{\beta-2}(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{1},c^{\ast}_{1}\right\rparen
<(1+4β2)(α+ε)cost(T,C)=(1+4kkβ2k)(α+ε)cost(T,C)\displaystyle<\left(1+\frac{4}{\beta-2}\right)(\alpha+\varepsilon)\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen=\left(1+\frac{4k}{k\beta-2k}\right)(\alpha+\varepsilon)\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen
(1+4k2β2k)(α+ε)cost(T,C).\displaystyle\leq{}\left(1+\frac{4k^{2}}{\beta-2k}\right)(\alpha+\varepsilon)\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen.

Case 1.2: For all recursive calls with parameters (T,{c1},1,β,δ,ε)(T^{\prime},\{c_{1}\},1,\beta,\delta,\varepsilon) it holds that |T2T|<1β|T|\lvert T^{\ast}_{2}\cap T^{\prime}\rvert<\frac{1}{\beta}\lvert T^{\prime}\rvert.

After log2(n)\log_{2}(n) pruning phases we end up with a singleton {σ}=T\{\sigma\}=T^{\prime} as input set. Since |T2T|<1β|T|\lvert T^{\ast}_{2}\cap T^{\prime}\rvert<\frac{1}{\beta}\lvert T^{\prime}\rvert, it must be that 0=|T2T|<1β|T|=1β<10=\lvert T^{\ast}_{2}\cap T^{\prime}\rvert<\frac{1}{\beta}\lvert T^{\prime}\rvert=\frac{1}{\beta}<1 and thus σT1\sigma\in T^{\ast}_{1}.

Let C2C_{2} be the set of candidates returned by 11-Median-Candidates in this call. With probability at least 1δ/k1-\delta/k there is a c2C2c_{2}\in C_{2} with cost({σ},c2)(α+ε)cost({σ},c~2)(α+ε)cost({σ},c1)\operatorname{cost}\left\lparen\{\sigma\},c_{2}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen\{\sigma\},\widetilde{c}_{2}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen\{\sigma\},c^{\ast}_{1}\right\rparen, where c~2\widetilde{c}_{2} is an optimal median for {σ}\{\sigma\}. Since cost(T2P,c1)\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P,c_{1}\right\rparen is bounded as in Case 1.1, by a union bound we have with probability at least 1δ1-\delta:

cost(T,{c1,c2})\displaystyle\operatorname{cost}\left\lparen T,\{c_{1},c_{2}\}\right\rparen cost(T1{σ},c1)+cost(T2P,c1)+cost({σ},c2)\displaystyle\leq{}\operatorname{cost}\left\lparen T^{\ast}_{1}\setminus\{\sigma\},c_{1}\right\rparen+\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P,c_{1}\right\rparen+\operatorname{cost}\left\lparen\{\sigma\},c_{2}\right\rparen
(α+ε)cost(T1,c1)+cost(T2P,c1)\displaystyle\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{1},c^{\ast}_{1}\right\rparen+\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P,c_{1}\right\rparen
(1+4β2)(α+ε)cost(T,C)\displaystyle\leq\left(1+\frac{4}{\beta-2}\right)(\alpha+\varepsilon)\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen
(1+4k2β2k)(α+ε)cost(T,C).\displaystyle\leq\left(1+\frac{4k^{2}}{\beta-2k}\right)(\alpha+\varepsilon)\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen.

Case 2: k>2k>2

We only prove the generalization of Case 1.1 to k>2k>2, the remainder of the proof is analogous to the Case 1. For the sake of brevity, for ii\in\mathbb{N}, we define [i]={1,,i}[i]=\{1,\dots,i\}. Let C={c1,,ck}C^{\ast}=\{c^{\ast}_{1},\dots,c^{\ast}_{k}\} be an optimal set of kk medians for TT with clusters T1,,TkT^{\ast}_{1},\dots,T^{\ast}_{k}, respectively, that form a partition of TT. For the sake of simplicity, assume that nn is a power of 22 and w.l.o.g. assume |T1||Tk|\lvert T^{\ast}_{1}\rvert\geq\dots\geq\lvert T^{\ast}_{k}\rvert. For i[k]i\in[k] and j[k][i]j\in[k]\setminus[i] we define Ti,j=t=ijTtT^{\ast}_{i,j}=\uplus_{t=i}^{j}T^{\ast}_{t}.

Let 𝒯0=T\mathcal{T}_{0}=T and let (𝒯j=𝒯j1𝒫j)j=1m(\mathcal{T}_{j}=\mathcal{T}_{j-1}\setminus\mathcal{P}_{j})_{j=1}^{m} be the sequence of input sets in the recursive calls of the mm\in\mathbb{N}, mlog2(n)m\leq\log_{2}(n), pruning phases, where 𝒫j\mathcal{P}_{j} is the set of elements removed in the jjth (in the order of the recursive calls occurring) pruning phase. Let 𝒯={𝒯0}{𝒯jj[m]}\mathcal{T}=\{\mathcal{T}_{0}\}\cup\{\mathcal{T}_{j}\mid j\in[m]\}. For i[k]i\in[k], let TiT_{i} be the maximum cardinality set in 𝒯\mathcal{T}, with |TiTi|1β|Ti|\lvert T^{\ast}_{i}\cap T_{i}\rvert\geq\frac{1}{\beta}\lvert T_{i}\rvert. Note that by assumption and since β>2k\beta>2k, T1=TT_{1}=T must hold and also TjTiT_{j}\subset T_{i} for j[k][i]j\in[k]\setminus[i].

Using a union bound, with probability at least 1δ1-\delta, for each i[k]i\in[k] the call of 11-Median-Candidates with input TiT_{i} yields a candidate cic_{i} with

cost(TiTi,ci)(α+ε)cost(TiTi,c~i)(α+ε)cost(TiTi,ci)(α+ε)cost(Ti,ci),\displaystyle\operatorname{cost}\left\lparen T^{\ast}_{i}\cap T_{i},c_{i}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{i}\cap T_{i},\widetilde{c}_{i}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{i}\cap T_{i},c^{\ast}_{i}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{i},c^{\ast}_{i}\right\rparen, (I)

where c~i\widetilde{c}_{i} is an optimal 11-median for TiTiT^{\ast}_{i}\cap T_{i}. Let C={c1,,ck}C=\{c_{1},\dots,c_{k}\} be the set of these candidates and for i[k1]i\in[k-1], let Pi=TiTi+1P_{i}=T_{i}\setminus T_{i+1} denote the set of elements of TT removed by the pruning phases between obtaining cic_{i} and ci+1c_{i+1}. Note that the PiP_{i} are pairwise disjoint.

By definition, the sets

T1T1,,TkTk,T2,kP1,,Tk,kPk1T^{\ast}_{1}\cap T_{1},\dots,T^{\ast}_{k}\cap T_{k},T^{\ast}_{2,k}\cap P_{1},\dots,T^{\ast}_{k,k}\cap P_{k-1}

form a partition of TT, therefore

cost(T,{c1,,ck})\displaystyle\operatorname{cost}\left\lparen T,\{c_{1},\dots,c_{k}\}\right\rparen i=1kcost(TiTi,ci)+i=1k1cost(Ti+1,kPi,{c1,,ci})\displaystyle\leq{}\sum_{i=1}^{k}\operatorname{cost}\left\lparen T^{\ast}_{i}\cap T_{i},c_{i}\right\rparen+\sum_{i=1}^{k-1}\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i},\{c_{1},\dots,c_{i}\}\right\rparen
(α+ε)i=1kcost(Ti,ci)+i=1k1cost(Ti+1,kPi,{c1,,ci}).\displaystyle\leq{}(\alpha+\varepsilon)\sum_{i=1}^{k}\operatorname{cost}\left\lparen T^{\ast}_{i},c^{\ast}_{i}\right\rparen+\sum_{i=1}^{k-1}\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i},\{c_{1},\dots,c_{i}\}\right\rparen. (II)

Now, it only remains to bound the cost of the wrongly assigned elements of Ti+1,kT^{\ast}_{i+1,k}. For i[k]i\in[k], let ni=|Ti|n_{i}=\lvert T_{i}\rvert and w.l.o.g. assume that PiP_{i}\neq\emptyset for each i[k1]i\in[k-1]. Each PiP_{i} is the disjoint union j=1miPi,j\uplus_{j=1}^{m_{i}}P_{i,j} of mim_{i}\in\mathbb{N} sets of elements of TT removed in the interim pruning phases and it holds that |Pi,j|=ni2j\lvert P_{i,j}\rvert=\frac{n_{i}}{2^{j}}. We now prove for each i[k1]i\in[k-1] and j[mi]j\in[m_{i}] that PiP_{i} contains a large number of elements from T1,iT^{\ast}_{1,i} and only a few elements from Ti+1,kT^{\ast}_{i+1,k}.

For i[k1]i\in[k-1], we define Ri,0=TiR_{i,0}=T_{i} and for j[mi]j\in[m_{i}] we define Ri,j=Ri,j1Pi,jR_{i,j}=R_{i,j-1}\setminus P_{i,j}. By definition, |Ri,j|=ni2j=|Pi,j|\lvert R_{i,j}\rvert=\frac{n_{i}}{2^{j}}=\lvert P_{i,j}\rvert, Ri,j1Ri,j2R_{i,j_{1}}\supset R_{i,j_{2}} for each j1[mi]j_{1}\in[m_{i}] and j2[mi][j1]j_{2}\in[m_{i}]\setminus[j_{1}], also Ri,mi=Ti+1R_{i,m_{i}}=T_{i+1}. Thus, |TtRi,j|<1β|Ri,j|\lvert T^{\ast}_{t}\cap R_{i,j}\rvert<\frac{1}{\beta}\lvert R_{i,j}\rvert for all i[k1],j[mi]i\in[k-1],j\in[m_{i}] and t[k][i]t\in[k]\setminus[i]. As immediate consequence we obtain |Ti+1,kRi,j|kβ|Ri,j|\lvert T^{\ast}_{i+1,k}\cap R_{i,j}\rvert\leq\frac{k}{\beta}\lvert R_{i,j}\rvert. Since Pi,jRi,j1P_{i,j}\subseteq R_{i,j-1} for all i[k1]i\in[k-1] and j[mi]j\in[m_{i}], we have

|Ti+1,kPi,j||Ti+1,kRi,j1|kβ|Ri,j1|=2kβni2j,\displaystyle\lvert T_{i+1,k}\cap P_{i,j}\rvert\leq\lvert T_{i+1,k}\cap R_{i,j-1}\rvert\leq\frac{k}{\beta}\lvert R_{i,j-1}\rvert=\frac{2k}{\beta}\frac{n_{i}}{2^{j}}, (III)

which immediately yields

|T1,iPi,j|=|Pi,j||Ti+1,kPi,j|(12kβ)ni2j.\displaystyle\lvert T_{1,i}\cap P_{i,j}\rvert=\lvert P_{i,j}\rvert-\lvert T_{i+1,k}\cap P_{i,j}\rvert\geq\left(1-\frac{2k}{\beta}\right)\frac{n_{i}}{2^{j}}. (IV)

Now, by definition we know that for all i[k1]i\in[k-1], j[mi]{mi}j\in[m_{i}]\setminus\{m_{i}\}, σPi,j\sigma\in P_{i,j} and τPi,j+1\tau\in P_{i,j+1} that minc{c1,,ci}ρ(σ,c)minc{c1,,ci}ρ(τ,c)\min\limits_{c\in\{c_{1},\dots,c_{i}\}}\rho(\sigma,c)\leq\min\limits_{c\in\{c_{1},\dots,c_{i}\}}\rho(\tau,c). Thus,

cost(Ti+1,kPi,j,{c1,,ci})|Ti+1,kPi,j|cost(T1,iPi,j+1,{c1,,ci})|T1,iPi,j+1|.\displaystyle\frac{\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i,j},\{c_{1},\dots,c_{i}\}\right\rparen}{\lvert T^{\ast}_{i+1,k}\cap P_{i,j}\rvert}\leq\frac{\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap P_{i,j+1},\{c_{1},\dots,c_{i}\}\right\rparen}{\lvert T^{\ast}_{1,i}\cap P_{i,j+1}\rvert}.

Combining this inequality with Eqs. III and IV yields for i[k1]i\in[k-1] and j[mi]{mi}j\in[m_{i}]\setminus\{m_{i}\}:

β2j2knicost(Ti+1,kPi,j,{c1,,ci})2j+1(12kβ)nicost(T1,iPi,j+1,{c1,,ci})\displaystyle\frac{\beta 2^{j}}{2kn_{i}}\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i,j},\{c_{1},\dots,c_{i}\}\right\rparen\leq\frac{2^{j+1}}{(1-\frac{2k}{\beta})n_{i}}\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap P_{i,j+1},\{c_{1},\dots,c_{i}\}\right\rparen
\displaystyle\Leftrightarrow cost(Ti+1,kPi,j,{c1,,ci})4kβ2kcost(T1,iPi,j+1,{c1,,ci})\displaystyle\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i,j},\{c_{1},\dots,c_{i}\}\right\rparen\leq\frac{4k}{\beta-2k}\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap P_{i,j+1},\{c_{1},\dots,c_{i}\}\right\rparen (V)

For each i[k1]i\in[k-1] we still need an upper bound on cost(Ti+1,kPi,mi,{c1,,ci})\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i,m_{i}},\{c_{1},\dots,c_{i}\}\right\rparen. Since |Ri,mi|=|Pi,mi|\lvert R_{i,m_{i}}\rvert=\lvert P_{i,m_{i}}\rvert and also Ri,miRi,mi1R_{i,m_{i}}\subseteq R_{i,m_{i}-1} we can use Eq. III to obtain

|T1,iRi,mi|=|Ri,mi||Ti+1,kRi,mi||Ri,mi||Ti+1,kRi,mi1|>(12kβ)ni2mi.\displaystyle\lvert T^{\ast}_{1,i}\cap R_{i,m_{i}}\rvert=\lvert R_{i,m_{i}}\lvert-\lvert T^{\ast}_{i+1,k}\cap R_{i,m_{i}}\rvert\geq\lvert R_{i,m_{i}}\lvert-\lvert T^{\ast}_{i+1,k}\cap R_{i,m_{i}-1}\rvert>\left(1-\frac{2k}{\beta}\right)\frac{n_{i}}{2^{m_{i}}}. (VI)

By definition we also know that for all i[k1]i\in[k-1], σPi,mi\sigma\in P_{i,m_{i}} and τRi,mi\tau\in R_{i,m_{i}} that minc{c1,,ci}ρ(σ,c)minc{c1,,ci}ρ(τ,c)\min\limits_{c\in\{c_{1},\dots,c_{i}\}}\rho(\sigma,c)\leq\min\limits_{c\in\{c_{1},\dots,c_{i}\}}\rho(\tau,c). Thus,

cost(Ti+1,kPi,mi,{c1,,ci})|Ti+1,kPi,mi|cost(T1,iRi,mi,{c1,,ci})|T1,iRi,mi|.\frac{\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i,m_{i}},\{c_{1},\dots,c_{i}\}\right\rparen}{\lvert T^{\ast}_{i+1,k}\cap P_{i,m_{i}}\rvert}\leq\frac{\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap R_{i,m_{i}},\{c_{1},\dots,c_{i}\}\right\rparen}{\lvert T^{\ast}_{1,i}\cap R_{i,m_{i}}\rvert}.

Combining this inequality with Eqs. III and VI yields:

β2mi2knicost(Ti+1,kPi,mi,{c1,,ci})<2mi(12kβ)nicost(T1,iRi,mi,{c1,,ci})\displaystyle\frac{\beta 2^{m_{i}}}{2kn_{i}}\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i,m_{i}},\{c_{1},\dots,c_{i}\}\right\rparen<\frac{2^{m_{i}}}{(1-\frac{2k}{\beta})n_{i}}\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap R_{i,m_{i}},\{c_{1},\dots,c_{i}\}\right\rparen
\displaystyle\Leftrightarrow cost(Ti+1,kPi,mi,{c1,,ci})<2kβ2kcost(T1,iRi,mi,{c1,,ci}).\displaystyle\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i,m_{i}},\{c_{1},\dots,c_{i}\}\right\rparen<\frac{2k}{\beta-2k}\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap R_{i,m_{i}},\{c_{1},\dots,c_{i}\}\right\rparen. (VII)

We can now give the following bound, combining Eqs. V and VII, for each i[k1]i\in[k-1]:

cost(Ti+1,kPi,{c1,,ci})\displaystyle\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i},\{c_{1},\dots,c_{i}\}\right\rparen =j=1micost(Ti+1,kPi,j,{c1,,ci})\displaystyle={}\sum_{j=1}^{m_{i}}\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i,j},\{c_{1},\dots,c_{i}\}\right\rparen
<j=1mi14kβ2kcost(T1,iPi,j+1,{c1,,ci})\displaystyle<\sum_{j=1}^{m_{i}-1}\frac{4k}{\beta-2k}\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap P_{i,j+1},\{c_{1},\dots,c_{i}\}\right\rparen
+2kβ2kcost(T1,iRi,mi,{c1,,ci})\displaystyle\ \ \ +\frac{2k}{\beta-2k}\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap R_{i,m_{i}},\{c_{1},\dots,c_{i}\}\right\rparen
<4kβ2kcost(T1,iTi,{c1,,ci}).\displaystyle<\frac{4k}{\beta-2k}\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap T_{i},\{c_{1},\dots,c_{i}\}\right\rparen. (VIII)

Here, the last inequality holds, because Pi,2,,Pi,miP_{i,2},\dots,P_{i,m_{i}} and Ri,miR_{i,m_{i}} are pairwise disjoint subsets of TiT_{i}.

Now, we plug this bound into Eq. II. Note that TjTiTjTjT^{\ast}_{j}\cap T_{i}\subseteq T^{\ast}_{j}\cap T_{j} for each i[k]i\in[k] and j[i]j\in[i] by definition. We obtain:

cost(T,{c1,,ck})\displaystyle\operatorname{cost}\left\lparen T,\{c_{1},\dots,c_{k}\}\right\rparen (α+ε)i=1kcost(Ti,ci)+i=1k1cost(Ti+1,kPi,{c1,,ci})\displaystyle\leq{}(\alpha+\varepsilon)\sum_{i=1}^{k}\operatorname{cost}\left\lparen T^{\ast}_{i},c^{\ast}_{i}\right\rparen+\sum_{i=1}^{k-1}\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i},\{c_{1},\dots,c_{i}\}\right\rparen
<(α+ε)i=1kcost(Ti,ci)+4kβ2ki=1k1cost(T1,iTi,{c1,,ci})\displaystyle<(\alpha+\varepsilon)\sum_{i=1}^{k}\operatorname{cost}\left\lparen T^{\ast}_{i},c^{\ast}_{i}\right\rparen+\frac{4k}{\beta-2k}\sum_{i=1}^{k-1}\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap T_{i},\{c_{1},\dots,c_{i}\}\right\rparen
(α+ε)i=1kcost(Ti,ci)+4kβ2ki=1k1t=1icost(TtTi,ct)\displaystyle\leq{}(\alpha+\varepsilon)\sum_{i=1}^{k}\operatorname{cost}\left\lparen T^{\ast}_{i},c^{\ast}_{i}\right\rparen+\frac{4k}{\beta-2k}\sum_{i=1}^{k-1}\sum_{t=1}^{i}\operatorname{cost}\left\lparen T^{\ast}_{t}\cap T_{i},c_{t}\right\rparen
(α+ε)i=1kcost(Ti,ci)+4kβ2ki=1k1t=1icost(TtTt,ct)\displaystyle\leq{}(\alpha+\varepsilon)\sum_{i=1}^{k}\operatorname{cost}\left\lparen T^{\ast}_{i},c^{\ast}_{i}\right\rparen+\frac{4k}{\beta-2k}\sum_{i=1}^{k-1}\sum_{t=1}^{i}\operatorname{cost}\left\lparen T^{\ast}_{t}\cap T_{t},c_{t}\right\rparen
(α+ε)i=1kcost(Ti,ci)+4k2β2ki=1k1cost(TiTi,ci)\displaystyle\leq(\alpha+\varepsilon)\sum_{i=1}^{k}\operatorname{cost}\left\lparen T^{\ast}_{i},c^{\ast}_{i}\right\rparen+\frac{4k^{2}}{\beta-2k}\sum_{i=1}^{k-1}\operatorname{cost}\left\lparen T^{\ast}_{i}\cap T_{i},c_{i}\right\rparen
(1+4k2β2k)(α+ε)i=1kcost(Ti,ci)=(1+4k2β2k)(α+ε)cost(T,C).\displaystyle\leq{}\left(1+\frac{4k^{2}}{\beta-2k}\right)(\alpha+\varepsilon)\sum_{i=1}^{k}\operatorname{cost}\left\lparen T^{\ast}_{i},c^{\ast}_{i}\right\rparen=\left(1+\frac{4k^{2}}{\beta-2k}\right)(\alpha+\varepsilon)\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen.

The last inequality follows from Eq. I. ∎

The following analysis of the worst-case running-time of Algorithm 4 is a slight adaption of [2, Theorem 2.8], which is also provided for the sake of completeness.

Proof of Theorem 7.3.

For the sake of simplicity, we assume that nn is a power of 22.

If κ=0\kappa=0, Algorithm 5 has running-time c1O(1)c_{1}\in O(1) and if κn\kappa\geq n, Algorithm 5 has running-time c2nO(n)c_{2}\cdot n\in O(n).

Let T(n,κ,β,δ,ε)T(n,\kappa,\beta,\delta,\varepsilon) denote the worst-case running-time of Algorithm 5 for input set TT with |T|=n\lvert T\rvert=n. If n>κ1n>\kappa\geq 1, Algorithm 5 has running-time at most c3(nTd+n)O(nTd)c_{3}\cdot(n\cdot T_{d}+n)\in O(n\cdot T_{d}) to obtain PP, T(n/2,κ,β,δ,ε)T(n/2,\kappa,\beta,\delta,\varepsilon) for the recursive call in the pruning phase, T1(n,β,δ,ε)T_{1}(n,\beta,\delta,\varepsilon) to obtain the candidates, C(n,β,δ,ε)T(n,κ1,β,δ,ε)C(n,\beta,\delta,\varepsilon)\cdot T(n,\kappa-1,\beta,\delta,\varepsilon) for the recursive calls in the candidate phase, one for each candidate, and c4nTdC(n,β,δ,ε)O(nTdC(n,β,δ,ε))c_{4}\cdot n\cdot T_{d}\cdot C(n,\beta,\delta,\varepsilon)\in O(n\cdot T_{d}\cdot C(n,\beta,\delta,\varepsilon)) to eventually evaluate the candidate sets. Let c=max{c1,c2,c3,c4}c=\max\{c_{1},c_{2},c_{3},c_{4}\}. We obtain the following recurrence relation:

T(n,κ,β,δ,ε){cif κ=0cnif κnC(n,β,δ,ε)T(n,κ1,β,δ,ε)+T(n/2,κ,β,δ,ε)+T1(n,β,δ,ε)+cnTdC(n,β,δ,ε))else.\displaystyle T(n,\kappa,\beta,\delta,\varepsilon)\leq\begin{cases}c&\text{if }\kappa=0\\ cn&\text{if }\kappa\geq n\\ C(n,\beta,\delta,\varepsilon)\cdot T(n,\kappa-1,\beta,\delta,\varepsilon)+T(n/2,\kappa,\beta,\delta,\varepsilon)\\ +T_{1}(n,\beta,\delta,\varepsilon)+cn\cdot T_{d}\cdot C(n,\beta,\delta,\varepsilon))&\text{else}\end{cases}.

Let f(n,β,δ,ε)=1cnT1(n,β,δ,ε)+TdC(n,β,δ,ε)f(n,\beta,\delta,\varepsilon)=\frac{1}{cn}\cdot T_{1}(n,\beta,\delta,\varepsilon)+T_{d}\cdot C(n,\beta,\delta,\varepsilon).

We prove that T(n,κ,β,δ,ε)c4κC(n,β,δ,ε)κ+1nf(n,β,δ,ε)T(n,\kappa,\beta,\delta,\varepsilon)\leq c\cdot 4^{\kappa}\cdot C(n,\beta,\delta,\varepsilon)^{\kappa+1}\cdot n\cdot f(n,\beta,\delta,\varepsilon), by induction on n,κn,\kappa.

For κ=0\kappa=0 we have T(n,κ,β,δ,ε)ccnc40C(n,β,δ,ε)nf(n,β,δ,ε)T(n,\kappa,\beta,\delta,\varepsilon)\leq c\leq cn\leq c\cdot 4^{0}\cdot C(n,\beta,\delta,\varepsilon)\cdot n\cdot f(n,\beta,\delta,\varepsilon).

For κn\kappa\geq n we have T(n,κ,β,δ,ε)cnc4κC(n,β,δ,ε)κ+1nf(n,β,δ,ε)T(n,\kappa,\beta,\delta,\varepsilon)\leq cn\leq c\cdot 4^{\kappa}\cdot C(n,\beta,\delta,\varepsilon)^{\kappa+1}\cdot n\cdot f(n,\beta,\delta,\varepsilon).

Now, let n>κ1n>\kappa\geq 1 and assume the claim holds for T(n,κ,β,δ,ε)T(n^{\prime},\kappa^{\prime},\beta,\delta,\varepsilon), for each κ{0,,κ1}\kappa^{\prime}\in\{0,\dots,\kappa-1\} and n{1,,n1}n^{\prime}\in\{1,\dots,n-1\}. We have:

T(n,κ,β,δ,ε)\displaystyle T(n,\kappa,\beta,\delta,\varepsilon) C(n,β,δ,ε)T(n,κ1,β,δ,ε)+T(n/2,κ,β,δ,ε)\displaystyle\leq{}C(n,\beta,\delta,\varepsilon)\cdot T(n,\kappa-1,\beta,\delta,\varepsilon)+T(n/2,\kappa,\beta,\delta,\varepsilon)
+T1(n,β,δ,ε)+cnTdC(n,β,δ,ε)\displaystyle\ \ \ +T_{1}(n,\beta,\delta,\varepsilon)+cn\cdot T_{d}\cdot C(n,\beta,\delta,\varepsilon)
C(n,β,δ,ε)c4κ1C(n,β,δ,ε)κnf(n,β,δ,ε)\displaystyle\leq{}C(n,\beta,\delta,\varepsilon)\cdot c\cdot 4^{\kappa-1}\cdot C(n,\beta,\delta,\varepsilon)^{\kappa}\cdot n\cdot f(n,\beta,\delta,\varepsilon)
+c4κC(n/2,β,δ,ε)κ+1n2f(n/2,β,δ,ε)\displaystyle\ \ \ +c\cdot 4^{\kappa}\cdot C(n/2,\beta,\delta,\varepsilon)^{\kappa+1}\cdot\frac{n}{2}\cdot f(n/2,\beta,\delta,\varepsilon)
+cnf(n,β,δ,ε)\displaystyle\ \ \ +cn\cdot f(n,\beta,\delta,\varepsilon)
(14+12+14κC(n,β,δ,ε)κ+1)c4κC(n,β,δ,ε)κ+1nf(n,β,δ,ε)\displaystyle\leq{}\left(\frac{1}{4}+\frac{1}{2}+\frac{1}{4^{\kappa}C(n,\beta,\delta,\varepsilon)^{\kappa+1}}\right)c\cdot 4^{\kappa}\cdot C(n,\beta,\delta,\varepsilon)^{\kappa+1}\cdot n\cdot f(n,\beta,\delta,\varepsilon)
c4κC(n,β,δ,ε)κ+1nf(n,β,δ,ε).\displaystyle\leq{}c\cdot 4^{\kappa}\cdot C(n,\beta,\delta,\varepsilon)^{\kappa+1}\cdot n\cdot f(n,\beta,\delta,\varepsilon).

The last inequality holds, because 14κC(n,β,δ,ε)κ+114\frac{1}{4^{\kappa}C(n,\beta,\delta,\varepsilon)^{\kappa+1}}\leq\frac{1}{4}, and the claim follows by induction. ∎

8 Conclusion

We have developed bi-criteria approximation algorithms for (k,)(k,\ell)-median clustering of polygonal curves under the Fréchet distance. While it showed to be relatively easy to obtain a good approximation where the centers have up to 22\ell vertices in reasonable time, a way to obtain good approximate centers with up to \ell vertices in reasonable time is not in sight. This is due to the continuous Fréchet distance: the vertices of a median need not be anywhere near a vertex of an input-curve, resulting in a huge search-space. If we cover the whole search-space by, say grids, the worst-case running-time of the resulting algorithms become dependent on the arc-lengths of the input-curves edges, which is not acceptable. We note that gg-coverability of the continuous Fréchet distance would imply the existence of sublinear size ε\varepsilon-coresets for (k,)(k,\ell)-center clustering of polygonal curves under the Fréchet distance. It is an interesting open question, if the gg-coverability holds for the continuous Fréchet distance. In contrast to the doubling dimension, which was shown to be infinite even for curves of bounded complexity [15], the VC-dimension of metric balls under the continuous Fréchet distance is bounded in terms of the complexities \ell and mm of the curves [16]. Whether this bound can be combined with the framework by Feldman and Langberg [17] to achieve faster approximations for the (k,)(k,\ell)-median problem under the continuous Fréchet distance is an interesting open problem. The general relationship between the VC-dimension of range spaces derived from metric spaces and their doubling properties is a topic of ongoing research, see for example Huang et al. [21].

References

  • Abraham et al. [2003] C. Abraham, P. A. Cornillon, E. Matzner-Løber, and N. Molinari. Unsupervised curve clustering using b-splines. Scandinavian Journal of Statistics, 30(3):581–595, 2003.
  • Ackermann et al. [2010] Marcel R. Ackermann, Johannes Blömer, and Christian Sohler. Clustering for metric and nonmetric distance measures. ACM Trans. Algorithms, 6(4):59:1–59:26, 2010.
  • Agarwal et al. [2002] Pankaj K. Agarwal, Sariel Har-Peled, Nabil H. Mustafa, and Yusu Wang. Near-linear time approximation algorithms for curve simplification. In Rolf Möhring and Rajeev Raman, editors, Algorithms - ESA, pages 29–41. Springer, 2002.
  • Alt and Godau [1995] Helmut Alt and Michael Godau. Computing the Fréchet distance between two polygonal curves. International Journal of Computational Geometry & Applications, 5:75–91, 1995.
  • Banerjee et al. [2005] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705–1749, 2005.
  • Bansal et al. [2004] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering. Machine Learning, 56(1-3):89–113, 2004.
  • Ben-Hur et al. [2001] Asa Ben-Hur, David Horn, Hava T. Siegelmann, and Vladimir Vapnik. Support vector clustering. Journal of Machine Learning Research, 2:125–137, 2001.
  • Buchin et al. [2008] Kevin Buchin, Maike Buchin, and Carola Wenk. Computing the Fréchet distance between simple polygons. Comput. Geom., 41(1-2):2–20, 2008.
  • Buchin et al. [2019a] Kevin Buchin, Anne Driemel, Joachim Gudmundsson, Michael Horton, Irina Kostitsyna, Maarten Löffler, and Martijn Struijs. Approximating (k, l)-center clustering for curves. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2922–2938, 2019a.
  • Buchin et al. [2019b] Kevin Buchin, Anne Driemel, Natasja van de L’Isle, and André Nusser. klcluster: Center-based clustering of trajectories. In Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 496–499, 2019b.
  • Buchin et al. [2020] Kevin Buchin, Anne Driemel, and Martijn Struijs. On the hardness of computing an average curve. In Susanne Albers, editor, 17th Scandinavian Symposium and Workshops on Algorithm Theory, volume 162 of LIPIcs, pages 19:1–19:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
  • Chiou and Li [2007] Jeng-Min Chiou and Pai-Ling Li. Functional clustering and identifying substructures of longitudinal data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4):679–699, 2007.
  • Cilibrasi and Vitányi [2005] Rudi Cilibrasi and Paul M. B. Vitányi. Clustering by compression. IEEE Trans. Information Theory, 51(4):1523–1545, 2005.
  • Driemel and Har-Peled [2013] Anne Driemel and Sariel Har-Peled. Jaywalking your dog: Computing the Fréchet distance with shortcuts. SIAM Journal on Computing, 42(5):1830–1866, 2013.
  • Driemel et al. [2016] Anne Driemel, Amer Krivosija, and Christian Sohler. Clustering time series under the Fréchet distance. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, pages 766–785, 2016.
  • Driemel et al. [2019] Anne Driemel, Jeff M. Phillips, and Ioannis Psarros. The VC dimension of metric balls under Fréchet and Hausdorff distances. In 35th International Symposium on Computational Geometry, pages 28:1–28:16, 2019.
  • Feldman and Langberg [2011] Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Lance Fortnow and Salil P. Vadhan, editors, Proceedings of the 43rd ACM Symposium on Theory of Computing, pages 569–578. ACM, 2011.
  • Garcia-Escudero and Gordaliza [2005] Luis Angel Garcia-Escudero and Alfonso Gordaliza. A proposal for robust curve clustering. Journal of Classification, 22(2):185–201, 2005.
  • Guha and Mishra [2016] Sudipto Guha and Nina Mishra. Clustering data streams. In Minos N. Garofalakis, Johannes Gehrke, and Rajeev Rastogi, editors, Data Stream Management - Processing High-Speed Data Streams, Data-Centric Systems and Applications, pages 169–187. Springer, 2016.
  • Har-Peled and Mazumdar [2004] Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, pages 291–300, 2004.
  • Huang et al. [2018] Lingxiao Huang, Shaofeng H.-C. Jiang, Jian Li, and Xuan Wu. Epsilon-coresets for clustering (with outliers) in doubling metrics. In 59th IEEE Annual Symposium on Foundations of Computer Science, pages 814–825. IEEE Computer Society, 2018.
  • Imai and Iri [1988] Hiroshi Imai and Masao Iri. Polygonal Approximations of a Curve — Formulations and Algorithms. Machine Intelligence and Pattern Recognition, 6:71–86, January 1988.
  • Indyk [2000] Piotr Indyk. High-dimensional Computational Geometry. PhD thesis, Stanford University, CA, USA, 2000.
  • Johnson [1967] Stephen C. Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241–254, 1967.
  • Kumar et al. [2004] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A simple linear time (1+ε\varepsilon)-approximation algorithm for k-means clustering in any dimensions. In Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’04, page 454–462. IEEE Computer Society, 2004.
  • Meintrup et al. [2019] Stefan Meintrup, Alexander Munteanu, and Dennis Rohde. Random projections and sampling algorithms for clustering of high-dimensional polygonal curves. In Advances in Neural Information Processing Systems 32, pages 12807–12817, 2019.
  • Mitzenmacher and Upfal [2017] Michael Mitzenmacher and Eli Upfal. Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press, USA, 2nd edition, 2017.
  • Nath and Taylor [2020] Abhinandan Nath and Erin Taylor. k-median clustering under discrete Fréchet and Hausdorff distances. In Sergio Cabello and Danny Z. Chen, editors, 36th International Symposium on Computational Geometry, volume 164 of LIPIcs, pages 58:1–58:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
  • Petitjean and Gançarski [2012] François Petitjean and Pierre Gançarski. Summarizing a set of time series by averaging: From steiner sequence to compact multiple alignment. Theoretical Computer Science, 414(1):76 – 91, 2012.
  • Petitjean et al. [2011] François Petitjean, Alain Ketterlin, and Pierre Gançarski. A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition, 44(3):678 – 693, 2011.
  • Schaeffer [2007] Satu Elisa Schaeffer. Graph clustering. Computer Science Review, 1(1):27 – 64, 2007.
  • Vidal [2011] René Vidal. Subspace clustering. IEEE Signal Processing Magazine, 28(2):52–68, 2011.