Approximating $(k,\ell)$ -Median Clustering for Polygonal Curves

Maike Buchin Faculty of Mathematics, Ruhr-University Bochum, Germany, maike.buchin@rub.de Anne Driemel Hausdorff Center for Mathematics, University of Bonn, Germany, driemel@cs.uni-bonn.de Dennis Rohde Faculty of Mathematics, Ruhr-University Bochum, Germany, dennis.rohde-t1b@rub.de

Abstract

In 2015, Driemel, Krivošija and Sohler introduced the $(k,\ell)$ -median problem for clustering polygonal curves under the Fréchet distance. Given a set of input curves, the problem asks to find $k$ median curves of at most $\ell$ vertices each that minimize the sum of Fréchet distances over all input curves to their closest median curve. A major shortcoming of their algorithm is that the input curves are restricted to lie on the real line. In this paper, we present a randomized bicriteria-approximation algorithm that works for polygonal curves in $\mathbb{R}^{d}$ and achieves approximation factor $(1+\varepsilon)$ with respect to the clustering costs. The algorithm has worst-case running-time linear in the number of curves, polynomial in the maximum number of vertices per curve, i.e. their complexity, and exponential in $d$ , $\ell$ , $\varepsilon$ and $\delta$ , i.e., the failure probability. We achieve this result through a shortcutting lemma, which guarantees the existence of a polygonal curve with similar cost as an optimal median curve of complexity $\ell$ , but of complexity at most $2\ell-2$ , and whose vertices can be computed efficiently. We combine this lemma with the superset-sampling technique by Kumar et al. to derive our clustering result. In doing so, we describe and analyze a generalization of the algorithm by Ackermann et al., which may be of independent interest.

1 Introduction

Since the development of $k$ -means – the pioneer of modern computational clustering – the last 65 years have brought a diversity of specialized [31, 7, 20, 6, 13, 19, 32] as well as generalized clustering algorithms [24, 2, 5]. However, in most cases clustering of point sets was studied. Many clustering problems indeed reduce to clustering of point sets, but for sequential data like time-series and trajectories – which arise in the natural sciences, medicine, sports, finance, ecology, audio/speech analysis, handwriting and many more – this is not the case. Hence, we need specialized clustering methods for these purposes, cf. [1, 12, 18, 29, 30].

A promising branch of this active research deals with $(k,\ell)$ -center and $(k,\ell)$ -median clustering – adaptions of the well-known Euclidean $k$ -center and $k$ -median clustering. In $(k,\ell)$ -center clustering, respective $(k,\ell)$ -median clustering, we are given a set of $n$ polygonal curves in $\mathbb{R}^{d}$ of complexity (i.e., the number of vertices of the curve) at most $m$ each and want to compute $k$ centers that minimize the objective function – just as in Euclidean $k$ -clustering. In addition, the centers are restricted to have complexity at most $\ell$ each to prevent over-fitting – a problem specific for sequential data. A great benefit of regarding the sequential data as polygonal curves is that we introduce an implicit linear interpolation. This does not require any additional storage space since we only need to store the vertices of the curves, which are the sequences at hand. We compare the polygonal curves by their Fréchet distance, that is a continuous distance measure which takes the entire course of the curves into account, not only the pairwise distances among their vertices. Therefore, irregular sampled sequences are automatically handled by the interpolation, which is desirable in many cases. Moreover, Buchin et al. [10] showed, by using heuristics, that the $(k,\ell)$ -clustering objectives yield promising results on trajectory data.

This branch of research formed only recently, about twenty years after Alt and Godau developed an algorithm to compute the Fréchet distance between polygonal curves [4]. Several papers have since studied this type of clustering [15, 9, 10, 11, 26]. However, all of these clustering algorithms, except the approximation-schemes for polygonal curves in $\mathbb{R}$ [15] and the heuristics in [10], choose a $k$ -subset of the input as centers. (This is also often called discrete clustering.) This $k$ -subset is later simplified, or all input-curves are simplified before choosing a $k$ -subset. Either way, using these techniques one cannot achieve an approximation factor of less than $2$ . This is because there need not be an input curve with distance to its median which is less than the average distance of a curve to its median.

Driemel et al. [15], who were the first to study clustering of polygonal curves under the Fréchet distance in this setting, already overcame this problem in one dimension by defining and analyzing $\delta$ -signatures, which are succinct representations of classes of curves that allow synthetic center-curves to be constructed. However, it seems that $\delta$ -signatures are only applicable in $\mathbb{R}$ . Here, we extend their work and obtain the first randomized bicriteria approximation algorithm for $(k,\ell)$ -median clustering of polygonal curves in $\mathbb{R}^{d}$ .

1.1 Related Work

Driemel et al. [15] introduced the $(k,\ell)$ -center and $(k,\ell)$ -median objectives and developed the first approximation-schemes for these objectives, for curves in $\mathbb{R}$ . Furthermore, they proved that $(k,\ell)$ -center as well as $(k,\ell)$ -median clustering is NP-hard, where $k$ is a part of the input and $\ell$ is fixed. Also, they showed that the doubling dimension of the metric space of polygonal curves under the Fréchet distance is unbounded, even when the complexity of the curves is bounded.

Following this work, Buchin et al. [9] developed a constant-factor approximation algorithm for $(k,\ell)$ -center clustering in $\mathbb{R}^{d}$ . Furthermore, they provide improved results on the hardness of approximating $(k,\ell)$ -center clustering under the Fréchet distance: the $(k,\ell)$ -center problem is NP-hard to approximate within a factor of $(1.5-\varepsilon)$ for curves in $\mathbb{R}$ and within a factor of $(2.25-\varepsilon)$ for curves in $\mathbb{R}^{d}$ , where $d\geq 2$ , in both cases even if $k=1$ . Furthermore, for the $(k,\ell)$ -median variant, Buchin et al. [11] proved NP-hardness using a similar reduction. Again, the hardness holds even if $k=1$ . Also, they provided $(1+\varepsilon)$ -approximation algorithms for $(k,\ell)$ -center, as well as $(k,\ell)$ -median clustering, under the discrete Fréchet distance. Nath and Taylor [28] give improved algorithms for $(1+\varepsilon)$ -approximation of $(k,\ell)$ -median clustering under discrete Fréchet and Hausdorff distance. Recently, Meintrup et al. [26] introduced a practical $(1+\varepsilon)$ -approximation algorithm for discrete $k$ -median clustering under the Fréchet distance, when the input adheres to a certain natural assumption, i.e., the presence of a certain number of outliers.

Our algorithms build upon the clustering algorithm of Kumar et al. [25], which was later extended by Ackermann et al. [2]. This algorithm is a recursive approximation scheme, that employs two phases in each call. In the so-called candidate phase it computes candidates by taking a sample $S$ from the input set $T$ and running an algorithm on each subset of $S$ of a certain size. Which algorithm to use depends on the metric at hand. The idea behind this is simple: if $T$ contains a cluster $T^{\prime}$ that takes a constant fraction of its size, then a constant fraction of $S$ is from $T^{\prime}$ with high probability. By brute-force enumeration of all subsets of $S$ , we find this subset $S^{\prime}\subseteq T^{\prime}$ and if $S$ is taken uniformly and independently at random from $T$ then $S^{\prime}$ is a uniform and independent sample from $T^{\prime}$ . Ackermann et al. proved for various metric and non-metric distance measures, that $S^{\prime}$ can be used for computing candidates that contain a $(1+\varepsilon)$ -approximate median for $T^{\prime}$ with high probability. The algorithm recursively calls itself for each candidate to eventually evaluate these together with the candidates for the remaining clusters.

The second phase of the algorithm is the so-called pruning phase, where it partitions its input according to the candidates at hand into two sets of equal size: one with the smaller distances to the candidates and one with the larger distances to the candidates. It then recursively calls itself with the second set as input. The idea behind this is that small clusters now become large enough to find candidates for these. Furthermore, the partitioning yields a provably small error. Finally it returns the set of $k$ candidates that together evaluated best.

1.2 Our Contributions

Refer to caption — Figure 1: From left to right: symbolic depiction of the operation principle of Algorithms 1, 2 and 4. Among all approximate $\ell$ -simplifications (depicted in blue) of the input curves (depicted in black), Algorithm 1 returns the one that evaluates best (the solid curve) with respect to a sample of the input. Algorithm 2 does not return a single curve, but a set of candidates. These include the curve returned by Algorithm 1 plus all curves with $\ell$ vertices from the cubic grids, covering balls of certain radius centered at the vertices of an input curve that is close to a median, w.h.p. Algorithm 4 is similar to Algorithm 2 but does not only cover the vertices of a single curve, but of multiple curves. We depict the best approximate median that can be generated from the grids in solid green.

We present several algorithms for approximating $(1,\ell)$ -median clustering of polygonal curves under the Fréchet distance, see Fig. 1 for an illustration of the operation principles of our algorithms. While the first one, Algorithm 1, yields only a coarse approximation (factor 34), it is suitable as plugin for the following two algorithms, Algorithms 2 and 4, due to its asymptotically fast running-time. These algorithms yield a better approximation (factor $3+\varepsilon$ , respectively $1+\varepsilon$ ). Additionally, Algorithms 2 and 4 are not only able to yield an approximation for the input set $T$ , but for a cluster $T^{\prime}\subseteq T$ , that takes a constant fraction of $T$ . We would like to use these as plugins to the $(1+\varepsilon)$ -approximation algorithm for $k$ -median clustering by Ackermann et al. [2], but that would require our algorithms to comply with the sampling properties. For an input set $T$ the weak sampling property expresses that a constant-size set of candidates can be computed, that contains a $(1+\varepsilon)$ -approximate median for $T$ with high probability, by taking a constant-size uniform and independent sample of $T$ . Further, the running-time for computing the candidates depends only on the size of the sample, the size of the candidate set and the failure probability parameter. The strong sampling property is defined similarly, but instead of a candidate set, an approximate median can be computed directly and the running-time may only depend on the size of the sample. In our algorithms, the running-time for computing the candidate set depends on $m$ which is a parameter of the input. Additionally, our first algorithm for computing candidates, which contain a $(3+\varepsilon)$ -approximate $(1,\ell)$ -median with high probability, does not achieve the required approximation-factor of $(1+\varepsilon)$ . However, looking into the analysis of Ackermann et al., any algorithm for computing candidates, with some guaranteed approximation-factor, can be used in the recursive approximation-scheme. Therefore, we decided to generalize the $k$ -median clustering algorithm of Ackermann et al. [2].

Nath and Taylor [28] use a similar approach, but they developed yet another way to compute candidates: they define and analyze $g$ -coverability, which is a generalization of the notion of doubling dimension and indeed, for the discrete Fréchet distance the proof builds upon the doubling dimension of points in $\mathbb{R}^{d}$ . However, the doubling dimension of polygonal curves under the Fréchet distance is unbounded, even when the complexities of the curves are bounded and it is an open question whether $g$ -coverability holds for the continuous Fréchet distance.

We circumvent this by taking a different approach using the idea of shortcutting. It is well-known that shortcutting a polygonal curve (that is, replacing a subcurve by the line segment connecting its endpoints) does not increase its Fréchet distance to a line segment. This idea has been used before for a variety of Fréchet-distance related problems [3, 15, 14, 8]. Specifically, we introduce two new shortcutting lemmata. These lemmata guarantee the existence of good approximate medians, with complexity at most $2\ell-2$ and whose vertices can be computed efficiently. The first one enables us to return candidates, which contain a $(3+\varepsilon)$ -approximate median for a cluster inside the input, that takes a constant fraction of the input, w.h.p., and we call it simple shortcutting. The second one enables us to return candidates, which contain a $(1+\varepsilon)$ -approximate median for a cluster inside the input, that takes a constant fraction of the input, w.h.p., and we call it advanced shortcutting. All in all, we obtain as main result, following from Corollary 7.5:

Theorem 1.1.

Given a set $T$ of $n$ polygonal curves in $\mathbb{R}^{d}$ , of complexity at most $m$ each, parameter values $\varepsilon\in(0,0.158]$ and $\delta\in(0,1)$ , and constants $k,\ell\in\mathbb{N}$ , there exists an algorithm, which computes a set $C$ of $k$ polygonal curves, each of complexity at most $2\ell-2$ , such that with probability at least $(1-\delta)$ , it holds that

\displaystyle\operatorname{cost}\left\lparen T,C\right\rparen=\sum_{\tau\in T}\min_{c\in C}d_{F}(c,\tau)\leq(1+\varepsilon)\sum_{\tau\in T}\min_{c\in C^{\ast}}d_{F}(c,\tau)=(1+\varepsilon)\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen

where $C^{\ast}$ is an optimal $(k,\ell)$ -median solution for $T$ under the Fréchet distance $d_{F}(\cdot,\cdot)$ .

The algorithm has worst-case running-time linear in $n$ , polynomial in $m$ and exponential in $\delta,\varepsilon,d$ and $\ell$ .

1.3 Organization

The paper is organized as follows. First we present a simple and fast $34$ -approximation algorithm for $(1,\ell)$ -median clustering. Then, we present the $(3+\varepsilon)$ -approximation algorithm for $(1,\ell)$ -median clustering of a cluster inside the input, that takes a constant fraction of the input, which builds upon simple shortcutting and the $34$ -approximation algorithm. Then, we present a more practical modification of the $(3+\varepsilon)$ -approximation algorithm, which achieves a $(5+\varepsilon)$ -approximation for $(1,\ell)$ -median clustering. Following this, we present the similar but more involved $(1+\varepsilon)$ -approximation algorithm for $(1,\ell)$ -median clustering of a cluster inside the input, that takes a constant fraction of the input, which builds upon the advanced shortcutting and the $34$ -approximation algorithm. Finally we present the generalized recursive $k$ -median approximation-scheme, which leads to our main result.

2 Preliminaries

Here we introduce all necessary definitions. In the following $d\in\mathbb{N}$ is an arbitrary constant. By $\lVert\cdot\rVert$ we denote the Euclidean norm and for $p\in\mathbb{R}^{d}$ and $r\in\mathbb{R}_{\geq 0}$ we denote by $B(p,r)=\{q\in\mathbb{R}^{d}\mid\lVert p-q\rVert\leq r\}$ the closed ball of radius $r$ with center $p$ . By $S_{n}$ we denote the symmetric group of degree $n$ . We give a standard definition of grids:

Definition 2.1 (grid).

Given a number $r\in\mathbb{R}_{>0}$ , for $(p_{1},\dots,p_{d})\in\mathbb{R}^{d}$ we define by $G(p,r)=(\lfloor p_{1}/r\rfloor\cdot r,\dots,\lfloor p_{d}/r\rfloor\cdot r)$ the $r$ -grid-point of $p$ . Let $X\subseteq\mathbb{R}^{d}$ be a subset of $\mathbb{R}^{d}$ . The grid of cell width $r$ that covers $X$ is the set $\mathbb{G}(X,r)=\{G(p,r)\mid p\in X\}$ .

Such a grid partitions the set $X$ into cubic regions and for each $r\in\mathbb{R}_{>0}$ and $p\in X$ we have that $\lVert p-G(p,r)\rVert\leq\frac{\sqrt{d}}{2}r$ . We give a standard definition of polygonal curves:

Definition 2.2 (polygonal curve).

A (parameterized) curve is a continuous mapping $\tau\colon[0,1]\rightarrow\mathbb{R}^{d}$ . A curve $\tau$ is polygonal, iff there exist $v_{1},\dots,v_{m}\in\mathbb{R}^{d}$ , no three consecutive on a line, called $\tau$ ’s vertices and $t_{1},\dots,t_{m}\in[0,1]$ with $t_{1}<\dots<t_{m}$ , $t_{1}=0$ and $t_{m}=1$ , called $\tau$ ’s instants, such that $\tau$ connects every two contiguous vertices $v_{i}=\tau(t_{i}),v_{i+1}=\tau(t_{i+1})$ by a line segment.

We call the line segments $\overline{v_{1}v_{2}},\dots,\overline{v_{m-1}v_{m}}$ the edges of $\tau$ and $m$ the complexity of $\tau$ , denoted by $\lvert\tau\rvert$ . Sometimes we will argue about a sub-curve $\tau$ of a given curve $\sigma$ . We will then refer to $\tau$ by restricting the domain of $\sigma$ , denoted by $\sigma|_{X}$ , where $X\subseteq[0,1]$ .

Definition 2.3 (Fréchet distance).

Let $\mathcal{H}$ denote the set of all continuous bijections $h\colon[0,1]\rightarrow[0,1]$ with $h(0)=0$ and $h(1)=1$ , which we call reparameterizations. The Fréchet distance between curves $\sigma$ and $\tau$ is defined as

d_{F}(\sigma,\tau)\ =\ \inf_{h\in\mathcal{H}}\ \max_{t\in[0,1]}\ \lVert\sigma(t)-\tau(h(t))\rVert.

Sometimes, given two curves $\sigma,\tau$ , we will refer to an $h\in\mathcal{H}$ as matching between $\sigma$ and $\tau$ .

Note that there must not exist a matching $h\in\mathcal{H}$ , such that $\max_{t\in[0,1]}\lVert\sigma(t)-\tau(h(t))\rVert=d_{F}(\sigma,\tau)$ . This is due to the fact that in some cases a matching realizing the Fréchet distance would need to match multiple points $p_{1},\dots,p_{n}$ on $\tau$ to a single point $q$ on $\sigma$ , which is not possible since matchings need to be bijections, but the $p_{1},\dots,p_{n}$ can get matched arbitrarily close to $q$ , realizing $d_{F}(\sigma,\tau)$ in the limit, which we formalize in the following lemma:

Lemma 2.4.

Let $\sigma,\tau\colon[0,1]\rightarrow\mathbb{R}^{d}$ be curves. Let $r=d_{F}(\sigma,\tau)$ . There exists a sequence $(h_{i})_{i=1}^{\infty}$ in $\mathcal{H}$ , such that $\lim\limits_{i\to\infty}\max\limits_{t\in[0,1]}\lVert\sigma(t)-\tau(h_{i}(t))\rVert=r$ .

Proof.

Define $\rho\colon\mathcal{H}\rightarrow\mathbb{R}_{\geq 0},h\mapsto\max\limits_{t\in[0,1]}\lVert\sigma(t)-\tau(h(t))\rVert$ with image $R=\{\rho(h)\mid h\in\mathcal{H}\}$ . Per definition, we have $d_{F}(\sigma,\tau)=\inf R=r$ .

For any non-empty subset $X$ of $\mathbb{R}$ that is bounded from below and for every $\varepsilon>0$ it holds that there exists an $x\in X$ with $\inf X\leq x<\inf X+\varepsilon$ , by definition of the infimum. Since $R\subseteq\mathbb{R}$ and $\inf R$ exists, for every $\varepsilon>0$ there exists an $r^{\prime}\in R$ with $\inf R\leq r^{\prime}<\inf R+\varepsilon$ .

Now, let $a_{i}=1/i$ be a zero sequence. For every $i\in\mathbb{N}$ there exists an $r_{i}\in R$ with $r\leq r_{i}<r+a_{i}$ , thus $\lim\limits_{i\to\infty}r_{i}=r$ .

Let $\rho^{-1}(r^{\prime})=\{h\in\mathcal{H}\mid\rho(h)=r^{\prime}\}$ be the preimage of $\rho$ . Since $\rho$ is a function, $\lvert\rho^{-1}(r^{\prime})\rvert\geq 1$ for each $r^{\prime}\in R$ . Now, for $i\in\mathbb{N}$ , let $h_{i}$ be an arbitrary element from $\rho^{-1}(r_{i})$ . By definition it holds that

\lim\limits_{i\to\infty}\max\limits_{t\in[0,1]}\lVert\sigma(t)-\tau(h_{i}(t))\rVert=\lim_{i\to\infty}\rho(h_{i})=\lim_{i\to\infty}r_{i}=r=\inf R,

which proves the claim. ∎

Now we introduce the classes of curves we are interested in.

Definition 2.5 (polygonal curve classes).

For $d\in\mathbb{N}$ , we define by $\mathbb{X}^{d}$ the equivalence class of polygonal curves (where two curves are equivalent, iff they can be made identical by a reparameterization) in ambient space $\mathbb{R}^{d}$ . For $m\in\mathbb{N}$ we define by $\mathbb{X}^{d}_{m}$ the subclass of polygonal curves of complexity at most $m$ .

Simplification is a fundamental problem related to curves and which appears as sub-problem in our algorithms.

Definition 2.6 (minimum-error $\ell$ -simplification).

For a polygonal curve $\tau\in\mathbb{X}^{d}$ we denote by $\operatorname{simpl}\left\lparen\alpha,\tau\right\rparen$ an $\alpha$ -approximate minimum-error $\ell$ -simplification of $\tau$ , i.e., a curve $\sigma\in\mathbb{X}^{d}_{\ell}$ with $d_{F}(\tau,\sigma)\leq\alpha\cdot d_{F}(\tau,\sigma^{\prime})$ for all $\sigma^{\prime}\in\mathbb{X}^{d}_{\ell}$ .

Now we define the $(k,\ell)$ -median clustering problem for polygonal curves.

Definition 2.7 ( $(k,\ell)$ -median clustering).

The $(k,\ell)$ -median clustering problem is defined as follows, where $k,l\in\mathbb{N}$ are fixed (constant) parameters of the problem: given a finite and non-empty set $T\subset\mathbb{X}^{d}_{m}$ of polygonal curves, compute a set of $k$ curves $C^{\ast}\subset\mathbb{X}^{d}_{\ell}$ , such that $\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen=\sum\limits_{\tau\in T}\min\limits_{c^{\ast}\in C^{\ast}}d_{F}(\tau,c^{\ast})$ is minimal.

We call $\operatorname{cost}\left\lparen\cdot,\cdot\right\rparen$ the objective function and we often write $\operatorname{cost}\left\lparen T,c\right\rparen$ as shorthand for $\operatorname{cost}\left\lparen T,\{c\}\right\rparen$ . The following theorem of Indyk [23] is useful for evaluating the cost of a curve at hand.

Theorem 2.8.

[23, Theorem 31] Let $\varepsilon\in(0,1]$ and $T\subset\mathbb{X}^{d}$ be a set of polygonal curves. Further let $W$ be a non-empty sample, drawn uniformly and independently at random from $T$ , with replacement. For $\tau,\sigma\in T$ with $\operatorname{cost}\left\lparen T,\tau\right\rparen>(1+\varepsilon)\operatorname{cost}\left\lparen T,\sigma\right\rparen$ it holds that $\Pr[\operatorname{cost}\left\lparen W,\tau\right\rparen\leq\operatorname{cost}\left\lparen W,\sigma\right\rparen]<\exp\left(-{\varepsilon^{2}\lvert W\rvert}/{64}\right)$ .

The following concentration bound also applies to independent Bernoulli trials, which are a special case of Poisson trials where each trial has same probability of success. Kumar et al. [25] use this to bound the probability that a subset $S^{\prime}$ of an independent and uniform sample $S$ from a set $T$ is entirely contained in a subset $T^{\prime}$ of $T$ . They call it superset-sampling.

Lemma 2.9 (Chernoff bound for independent Poisson trials).

[27, Theorem 4.5] Let $X_{1},\dots,X_{n}$ be independent Poisson trials. For $\delta\in(0,1)$ it holds that

\Pr\left[\sum_{i=1}^{n}X_{i}\leq(1-\delta)\operatorname{E}\left[\sum_{i=1}^{n}X_{i}\right]\right]\leq\exp\left(-\frac{\delta^{2}}{2}\operatorname{E}\left[\sum_{i=1}^{n}X_{i}\right]\right).

3 Simple and Fast $34$ -Approximation for $(1,\ell)$ -Median

Here, we present Algorithm 1, a $34$ -approximation algorithm for $(1,\ell)$ -median clustering, which is based on the following facts: we can obtain a $3$ -approximate solution to the $(1,\ell)$ -median for a given set $T=\{\tau_{1},\dots,\tau_{n}\}\subset\mathbb{X}^{d}_{m}$ of polygonal curves in terms of objective value, i.e., we obtain one of the at least $n/2$ input curves that are within distance $2\cdot\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen/n$ to an optimal $(1,\ell)$ -median $c^{\ast}$ for $T$ , w.h.p., by uniformly and independently sampling a sufficient number of curves from $T$ . There are at least $n/2$ of these curves by an averaging argument. These curves have cost up to $3\cdot\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen$ by the triangle-inequality. The sample has size depending only on a parameter determining the failure probability and we can improve on running-time even more by using Theorem 2.8 and evaluate the cost of each curve in the sample of candidates against another sample of similar size instead of against the complete input. Though, we have to accept an approximation factor of $5$ (if we set $\varepsilon=1$ in Theorem 2.8). That is indeed acceptable, since we only obtain an approximate solution in terms of objective value and completely ignore the bound on the number of vertices of the center curve, which is a disadvantage of this approach and results in the lower bound of $\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen$ not necessarily holding (if $\ell<m$ ). To fix this, we simplify the candidate curve that evaluated best against the second sample, using an efficient minimum-error $\ell$ -simplification approximation algorithm, which downgrades the approximation factor to $6+7\alpha$ , where $\alpha$ is the approximation factor of the minimum-error $\ell$ -simplification.

However, Algorithm 1 is very fast in terms of the input size. Indeed, it has worst-case running-time independent of $n$ and sub-quartic in $m$ . Now, Algorithm 1 has the purpose to provide us an approximate median for a given set of polygonal curves: the bi-criteria approximation algorithms (Algorithms 2 and 4), which we present afterwards and which are capable of generating center curves with up to $2\ell-2$ vertices, need an approximate median (and the approximation factor) to bound the optimal objective value. Furthermore, there is a case where Algorithms 2 and 4 may fail to provide a good approximation, but it can be proven that the result of Algorithm 1 is then a very good approximation, which can be used instead.

Algorithm 1

(1,\ell)

-Median by Simplification

1:procedure

(1,\ell)

-Median-

34

-Approximation(

T=\{\tau_{1},\dots,\tau_{n}\}

\delta

)

S\leftarrow

sample

\left\lceil 2(\ln(2)-\ln(\delta))\right\rceil

curves from

T

uniformly and independently with replacement

\gamma\leftarrow\left\lceil-64(\ln(\delta)-\ln(\lceil 4\ln(2)-\ln(\delta)\rceil)\right\rceil

W\leftarrow

sample

\gamma

curves from

T

uniformly and independently with replacement

t\leftarrow

arbitrary elem. from

\operatorname*{arg\,min}\limits_{s\in S}\operatorname{cost}\left\lparen W,s\right\rparen

6: return

\operatorname{simpl}\left\lparen\alpha,t\right\rparen

\triangleright

E.g. combining [4, 22]

Next, we prove the quality of approximation of Algorithm 1.

Theorem 3.1.

Given a parameter $\delta\in(0,1)$ and a set $T=\{\tau_{1},\dots,\tau_{n}\}\subset\mathbb{X}^{d}_{m}$ of polygonal curves, Algorithm 1 returns with probability at least $1-\delta$ a polygonal curve $c\in\mathbb{X}^{d}_{\ell}$ , such that $\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen\leq\operatorname{cost}\left\lparen T,c\right\rparen\leq(6+7\alpha)\cdot\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen$ , where $c^{\ast}$ is an optimal $(1,\ell)$ -median for $T$ and $\alpha$ is the approximation-factor of the utilized minimum-error $\ell$ -simplification approximation algorithm.

Proof.

First, we know that $d_{F}(\tau,\operatorname{simpl}\left\lparen\alpha,\tau\right\rparen)\leq\alpha\cdot d_{F}(\tau,c^{\ast})$ , for each $\tau\in T$ .

Now, there are at least $\frac{n}{2}$ curves in $T$ that are within distance at most $\frac{2\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n}$ to $c^{\ast}$ . Otherwise the cost of the remaining curves would exceed $\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen$ , which is a contradiction. Hence each $s\in S$ has probability at least $\frac{1}{2}$ to be within distance $\frac{2\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n}$ to $c^{\ast}$ .

Since the elements of $S$ are sampled independently we conclude that the probability that every $s\in S$ has distance to $c^{\ast}$ greater than $\frac{2\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n}$ is at most $(1-\frac{1}{2})^{\lvert S\rvert}\leq\exp\left(-\frac{2(\ln(2)-\ln(\delta))}{2}\right)=\frac{\delta}{2}$ .

Now, assume there is a $s\in S$ with $d_{F}(s,c^{\ast})\leq\frac{2\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n}$ . We do not want any $t\in S\setminus\{s\}$ with $\operatorname{cost}\left\lparen T,t\right\rparen>2\operatorname{cost}\left\lparen T,s\right\rparen$ to have $\operatorname{cost}\left\lparen W,t\right\rparen\leq\operatorname{cost}\left\lparen W,s\right\rparen$ . Using Theorem 2.8 we conclude that this happens with probability at most

\exp\left(-\frac{-64(\ln(\delta)-\ln(\lceil 4\ln(2)-\ln(\delta)\rceil)}{64}\right)\leq\frac{\delta}{\lceil 4(\ln(2)-\ln(\delta))\rceil}\leq\frac{\delta}{2\lvert S\rvert},

for each $t\in S\setminus\{s\}$ .

Using a union bound over all bad events, we conclude that with probability at least $1-\delta$ , Algorithm 1 samples a curve $s\in S$ , with $d_{F}(s,c^{\ast})\leq 2\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen/n$ and returns the simplification $c=\operatorname{simpl}\left\lparen\alpha,t\right\rparen$ of a curve $t\in S$ , with $\operatorname{cost}\left\lparen T,t\right\rparen\leq 2\operatorname{cost}\left\lparen T,s\right\rparen$ . The triangle-inequality yields

\sum_{\tau\in T}(d_{F}(t,c^{\ast})-d_{F}(\tau,c^{\ast}))\leq\sum_{\tau\in T}d_{F}(t,\tau)\leq 2\sum_{\tau\in T}d_{F}(s,\tau)\leq 2\sum_{\tau\in T}(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},s)),

which is equivalent to

n\cdot d_{F}(t,c^{\ast})\leq 2\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen+\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen+2n\frac{2\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n}\Leftrightarrow{}d_{F}(t,c^{\ast})\leq\frac{7\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n}.

Hence, we have

	$\displaystyle\operatorname{cost}\left\lparen T,c\right\rparen$	$\displaystyle={}\sum_{\tau\in T}d_{F}(\tau,\operatorname{simpl}\left\lparen\alpha,t\right\rparen)\leq\sum_{\tau\in T}(d_{F}(\tau,t)+d_{F}(t,\operatorname{simpl}\left\lparen\alpha,t\right\rparen))$
		$\displaystyle\leq{}2\operatorname{cost}\left\lparen T,s\right\rparen+\sum_{\tau\in T}\alpha\cdot d_{F}(t,c^{\ast})\leq{}2\sum_{\tau\in T}(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},s))+7\alpha\cdot\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen$
		$\displaystyle\leq{}2\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen+4\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen+7\alpha\cdot\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen={}(6+7\alpha)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen.$

The lower bound $\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen\leq\operatorname{cost}\left\lparen T,c\right\rparen$ follows from the fact that the returned curve has $\ell$ vertices and that $c^{\ast}$ has minimum cost among all curves with $\ell$ vertices. ∎

The following lemma enables us to obtain a concrete approximation-factor and worst-case running-time of Algorithm 1.

Lemma 3.2 (Buchin et al. [9, Lemma 7.1]).

Given a curve $\sigma\in\mathbb{X}^{d}_{m}$ , a $4$ -approximate minimum-error $\ell$ -simplification can be computed in $O(m^{3}\log m)$ time.

The simplification algorithm used for obtaining this statement is a combination of the algorithm by Imai and Iri [22] and the algorithm by Alt and Godau [4]. Combining Theorem 3.1 and Lemma 3.2, we obtain the following corollary.

Corollary 3.3.

Given a parameter $\delta\in(0,1)$ and a set $T\subset\mathbb{X}^{d}_{m}$ of polygonal curves, Algorithm 1 returns with probability at least $1-\delta$ a polygonal curve $c\in\mathbb{X}^{d}_{\ell}$ , such that $\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen\leq\operatorname{cost}\left\lparen T,c\right\rparen\leq 34\cdot\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen$ , where $c^{\ast}$ is an optimal $(1,\ell)$ -median for $T$ , in time $O(m^{2}\log(m)(-\ln^{2}\delta)+m^{3}\log m)$ , when the algorithms by Imai and Iri [22] and Alt and Godau [4] are combined for $\ell$ -simplification.

Proof.

We use Lemma 3.2 together with Theorem 3.1, which yields an approximation factor of $34$ . Now, drawing the first sample takes time $O(-\ln\delta)$ . Drawing the second sample also takes time $O(-\ln(\delta))$ and evaluating the samples against each other takes time $O(m^{2}\log(m)(-\ln^{2}\delta))$ . Simplifying one of the curves that evaluates best takes time $O(m^{3}\log m)$ . We conclude that Algorithm 1 has running-time $O(m^{2}\log(m)(-\ln^{2}\delta)+m^{3}\log m)$ . ∎

4 $(3+\varepsilon)$ -Approximation for $(1,\ell)$ -Median by Simple Shortcutting

Here, we present Algorithm 2, which returns candidates, containing a $(3+\varepsilon)$ -approximate $(1,\ell)$ -median of complexity at most $2\ell-2$ , for a cluster contained in the input, that takes a constant fraction of the input, w.h.p. Algorithm 2 can be used as plugin in our generalized version (Algorithm 5, Section 7) of the algorithm by Ackermann et al. [2].

In contrast to Nath and Taylor [28] we cannot use the property, that the vertices of a median must be found in the balls of radius $d_{F}(\tau,c^{\ast})$ , centered at $\tau$ ’s vertices, where $c^{\ast}$ is an optimal $(1,\ell)$ -median for a given input $T$ , which $\tau$ is an element of. This is an immediate consequence of using the continuous Fréchet distance.

We circumvent this by proving the following shortcutting lemmata. We start with the simplest, which states that we can indeed search the aforementioned balls, if we accept a resulting curve of complexity at most $2\ell-2$ . See Fig. 2 for a visualization.

Lemma 4.1 (shortcutting using a single polygonal curve).

Let $\sigma,\tau\in\mathbb{X}^{d}$ be polygonal curves. Let $v^{\tau}_{1},\dots,v^{\tau}_{\lvert\tau\rvert}$ be the vertices of $\tau$ and let $r=d_{F}(\sigma,\tau)$ . There exists a polygonal curve $\sigma^{\prime}\in\mathbb{X}^{d}$ with every vertex contained in at least one of $B(v^{\tau}_{1},r),\dots,B(v^{\tau}_{\lvert\tau\rvert},r)$ , $d_{F}(\sigma^{\prime},\tau)\leq d_{F}(\sigma,\tau)$ and $\lvert\sigma^{\prime}\rvert\leq 2\lvert\sigma\rvert-2$ .

Proof.

Let $v^{\sigma}_{1},\dots,v^{\sigma}_{\lvert\sigma\rvert}$ be the vertices of $\sigma$ . Further, let $t^{\sigma}_{1},\dots,t^{\sigma}_{\lvert\sigma\rvert}$ and $t^{\tau}_{1},\dots,t^{\tau}_{\lvert\tau\rvert}$ be the instants of $\sigma$ and $\tau$ , respectively. Also, for $h\in\mathcal{H}$ (recall that $\mathcal{H}$ is the set of all continuous bijections $h\colon[0,1]\rightarrow[0,1]$ with $h(0)=0$ and $h(1)=1$ ), let $r_{h}=\max\limits_{t\in[0,1]}\lVert\sigma(t)-\tau(h(t))\rVert$ be the distance realized by $h$ . We know from Lemma 2.4 that there exists a sequence $(h_{x})_{x=1}^{\infty}$ in $\mathcal{H}$ , such that $\lim\limits_{x\to\infty}r_{h_{x}}=d_{F}(\sigma,\tau)=r$ .

Now, fix an arbitrary $h\in\mathcal{H}$ and assume there is a vertex $v^{\sigma}_{i}$ of $\sigma$ , with instant $t^{\sigma}_{i}$ , that is not contained in any of $B(v^{\tau}_{1},r_{h}),\dots,B(v^{\tau}_{\lvert\tau\rvert},r_{h})$ . Let $j$ be the maximum of $\{1,\dots,\lvert\tau\rvert-1\}$ , such that $t^{\tau}_{j}\leq h(t^{\sigma}_{i})\leq t^{\tau}_{j+1}$ . So $v^{\sigma}$ is matched to $\overline{\tau(t^{\tau}_{j})\tau(t^{\tau}_{j+1})}$ by $h$ . We modify $\sigma$ in such a way, that $v^{\sigma}_{i}$ is replaced by two new vertices that are elements of $B(v^{\tau}_{j},r_{h})$ and $B(v^{\tau}_{j+1},r_{h})$ , respectively.

Namely, let $t^{-}$ be the maximum of $[0,t^{\sigma}_{i})$ , such that $\sigma(t^{-})\in B(v^{\tau}_{j},r_{h})$ and let $t^{+}$ be the minimum of $(t^{\sigma}_{i},1]$ , such that $\sigma(t^{+})\in B(v^{\tau}_{j+1},r_{h})$ . These are the instants when $\sigma$ leaves $B(v^{\tau}_{j},r_{h})$ before visiting $v^{\sigma}_{i}$ and $\sigma$ enters $B(v^{\tau}_{j+1},r_{h})$ after visiting $v^{\sigma}_{i}$ , respectively. Let $\sigma^{\prime}_{h}$ be the piecewise defined curve, defined just like $\sigma$ on $[0,t^{-}]$ and $[t^{+},1]$ , but on $(t^{-},t^{+})$ it connects $\sigma(t^{-})$ and $\sigma(t^{+})$ with the line segment $s(t)=\left(1-\frac{t-t^{-}}{t^{+}-t^{-}}\right)\tau(t^{-})+\frac{t-t^{-}}{t^{+}-t^{-}}\tau(t^{+})$ .

We know that $\lVert\sigma(t^{-})-\tau(h(t^{-}))\rVert\leq r_{h}$ and $\lVert\sigma(t^{+})-\tau(h(t^{+}))\rVert\leq r_{h}$ . Note that $t^{\tau}_{j}\leq h(t^{-})$ and $h(t^{+})\leq t^{\tau}_{j+1}$ since $\sigma(t^{-})$ and $\sigma(t^{+})$ are closest points to $v^{\sigma}_{i}$ on $\sigma$ that have distance $r_{h}$ to $v^{\tau}_{j}$ and $v^{\tau}_{j+1}$ , respectively, by definition. Therefore, $\tau$ has no vertices between the instants $h(t^{-})$ and $h(t^{+})$ . Now, $h$ can be used to match $\sigma^{\prime}_{h}|_{[0,t^{-})}$ to $\tau|_{[0,h(t^{-}))}$ and $\sigma^{\prime}_{h}|_{(t^{+},1]}$ to $\tau|_{(t^{+},1]}$ with distance at most $r_{h}$ . Since $\sigma^{\prime}_{h}|_{[t^{-},t^{+}]}$ and $\tau|_{[h(t^{-}),h(t^{+})]}$ are just line segments, they can be matched to each other with distance at most $\max\{\lVert\sigma^{\prime}_{h}(t^{-})-\tau(h(t^{-}))\rVert,\lVert\sigma^{\prime}_{h}(t^{+})-\tau(h(t^{+}))\rVert\}\leq r_{h}$ . We conclude that $d_{F}(\sigma^{\prime}_{h},\tau)\leq r_{h}$ .

Because this modification works for every $h\in\mathcal{H}$ , we have $d_{F}(\sigma^{\prime}_{h},\tau)\leq r_{h}$ for every $h\in\mathcal{H}$ . Thus, $\lim\limits_{x\to\infty}d_{F}(\sigma^{\prime}_{h_{x}},\tau)\leq d_{F}(\sigma,\tau)=r$ .

Now, to prove the claim, for every $h\in\mathcal{H}$ we apply this modification to $v^{\sigma}_{i}$ and successively to every other vertex $v^{\sigma^{\prime}_{h}}_{i}$ of the resulting curve $\sigma^{\prime}_{h}$ , not contained in one of the balls, until every vertex of $\sigma^{\prime}_{h}$ is contained in a ball. Note that the modification is repeated at most $\lvert\sigma\rvert-2$ times for every $h\in\mathcal{H}$ , since the start and end vertex of $\sigma$ must be contained in $B(v^{\tau}_{1},r_{h})$ and $B(v^{\tau}_{\lvert\tau\rvert},r_{h})$ , respectively. Therefore, the number of vertices of every $\sigma^{\prime}_{h}$ can be bounded by $2\cdot(\lvert\sigma\rvert-2)+2$ since every other vertex must not lie in a ball and for each such vertex one new vertex is created. Thus, $\lvert\sigma^{\prime}_{h}\rvert\leq 2\lvert\sigma\rvert-2$ . ∎

We now present Algorithm 2, which works similar as Algorithm 1, but uses shortcutting instead of simplification. As a consequence, we can achieve an approximation factor of $3+\varepsilon$ , instead of a factor of $2+\varepsilon+\alpha$ , where $\alpha\geq 1$ is the approximation factor of the simplifiaction algorithm used in Algorithm 1. To achieve an approximation-factor of $3+\varepsilon$ using simplification, one would need to compute the optimal minimum-error $\ell$ -simplifications of the input curves and to the best of our knowledge, there is no such algorithm for the continuous Fréchet distance.

In contrast to Algorithm 1, Algorithm 2 utilizes the superset-sampling technique by Kumar et al. [25], i.e., the concentration bound in Lemma 2.9, to obtain an approximate $(1,\ell)$ -median for a cluster $T^{\prime}$ contained in the input $T$ , that takes a constant fraction of $T$ . Therefore, it has running-time exponential in the size of the sample $S$ . A further difference is that we need an upper and a lower bound on the cost of an optimal $(1,\ell)$ -median for $T^{\prime}$ , to properly set up the grids we use for shortcutting. The lower bound can be obtained by simple estimation, using Markov’s inequality. For the upper bound we utilize a case distinction, which guarantees us that if we fail to obtain an upper bound on the optimal cost, the result of Algorithm 1 then is a good approximation (factor $2+\varepsilon$ ) and can be used instead of a best curve obtained by shortcutting.

Algorithm 2 has several parameters: $\beta$ determines the size (in terms of a fraction of the input) of the smallest cluster inside the input for which an approximate median can be computed, $\delta$ determines the probability of failure of the algorithm and $\varepsilon$ determines the approximation factor.

Algorithm 2

(1,\ell)

-Median for Subset by Simple Shortcutting

1:procedure

(1,\ell)

-Median-

(3+\varepsilon)

-Candidates(

T=\{\tau_{1},\dots,\tau_{n}\},\beta,\delta,\varepsilon

)

\varepsilon^{\prime}\leftarrow\varepsilon/3

C\leftarrow\emptyset

S\leftarrow

sample

\left\lceil-8\beta(\varepsilon^{\prime})^{-1}(\ln(\delta)-\ln(4))\right\rceil

curves from

T

uniformly and independently with replacement

4: for

S^{\prime}\subseteq S

with

\lvert S^{\prime}\rvert=\frac{\lvert S\rvert}{2\beta}

c\leftarrow

(1,\ell)

-Median-

34

-Approximation

(S^{\prime},\delta/4)

\triangleright

Algorithm 1

\Delta\leftarrow\operatorname{cost}\left\lparen S^{\prime},c\right\rparen

\Delta_{l}\leftarrow\frac{\delta n}{2\lvert S\rvert}\frac{\Delta}{34}

\Delta_{u}\leftarrow\frac{1}{\varepsilon^{\prime}}\Delta

C\leftarrow C\cup\{c\}

7: for

s\in S^{\prime}

P\leftarrow\emptyset

9: for

i\in\{1,\dots,\lvert s\rvert\}

10:

P\leftarrow P\cup\mathbb{G}\left(B\left(v^{s}_{i},(1+\varepsilon^{\prime})\Delta_{u}\right),\frac{2\varepsilon^{\prime}}{n\sqrt{d}}\Delta_{l}\right)

\triangleright

v^{s}_{i}

i

^th vertex of

s

11:

C\leftarrow C\ \cup

set of all polygonal curves with

2\ell-2

vertices from

P

12: return

C

We prove the quality of approximation of Algorithm 2.

Theorem 4.2.

Given three parameters $\beta\in[1,\infty)$ , $\delta,\varepsilon\in(0,1)$ and a set $T=\{\tau_{1},\dots,\tau_{n}\}\subset\mathbb{X}^{d}_{m}$ of polygonal curves, with probability at least $1-\delta$ the set of candidates that Algorithm 2 returns contains a $(3+\varepsilon)$ -approximate $(1,\ell)$ -median with up to $2\ell-2$ vertices for any $T^{\prime}\subseteq T$ , if $\lvert T^{\prime}\rvert\geq\frac{1}{\beta}\lvert T\rvert$ .

Proof.

We assume that $\lvert T^{\prime}\rvert\geq\frac{1}{\beta}\lvert T\rvert$ . Let $n^{\prime}$ be the number of sampled curves in $S$ that are elements of $T^{\prime}$ . Clearly, $\operatorname{E}\left[n^{\prime}\right]\geq\sum_{i=1}^{\lvert S\rvert}\frac{1}{\beta}=\frac{\lvert S\rvert}{\beta}$ . Also $n^{\prime}$ is the sum of independent Bernoulli trials. A Chernoff bound (cf. Lemma 2.9) yields:

\displaystyle\Pr\left[n^{\prime}<\frac{\lvert S\rvert}{2\beta}\right]\leq\Pr\left[n^{\prime}<\frac{1}{2}\operatorname{E}\left[n^{\prime}\right]\right]\leq\exp\left(-\frac{1}{4}\frac{\lvert S\rvert}{2\beta}\right)\leq\exp\left(\frac{\ln(\delta)-\ln(4)}{\varepsilon}\right)=\left(\frac{\delta}{4}\right)^{\frac{1}{\varepsilon}}\leq\frac{\delta}{4}.

In other words, with probability at most $\delta/4$ no subset $S^{\prime}\subseteq S$ , of cardinality at least $\frac{\lvert S\rvert}{2\beta}$ , is a subset of $T^{\prime}$ . We condition the rest of the proof on the contrary event, denoted by $\mathcal{E}_{T^{\prime}}$ , namely, that there is a subset $S^{\prime}\subseteq S$ with $S^{\prime}\subseteq T^{\prime}$ and $\lvert S^{\prime}\rvert\geq\frac{\lvert S\rvert}{2\beta}$ . Note that $S^{\prime}$ is then a uniform and independent sample of $T^{\prime}$ .

Now, let $c^{\ast}\in\operatorname*{arg\,min}\limits_{c\in\mathbb{X}^{d}_{\ell}}\operatorname{cost}\left\lparen T^{\prime},c\right\rparen$ be an optimal $(1,\ell)$ -median for $T^{\prime}$ . The expected distance between $s\in S^{\prime}$ and $c^{\ast}$ is

\operatorname{E}\left[d_{F}(s,c^{\ast})\ |\ \mathcal{E}_{T^{\prime}}\right]=\sum_{\tau\in T^{\prime}}d_{F}(c^{\ast},\tau)\cdot\frac{1}{\lvert T^{\prime}\rvert}=\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}.

By linearity we have $\operatorname{E}\left[\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen\ |\ \mathcal{E}_{T^{\prime}}\right]=\frac{\lvert S^{\prime}\rvert}{\lvert T^{\prime}\rvert}\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen$ . Markov’s inequality yields:

\displaystyle\Pr\left[\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen>\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen\ \Big{|}\ \mathcal{E}_{T^{\prime}}\right]\leq\frac{\delta}{4}.

We conclude that with probability at most $\delta/4$ we have $\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen>\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen$ .

Using Markov’s inequality again, for every $s\in S^{\prime}$ we have

\Pr\left[d_{F}(s,c^{\ast})>(1+\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\ \Big{|}\ \mathcal{E}_{T^{\prime}}\right]\leq\frac{1}{1+\varepsilon},

therefore by independence

\Pr\left[\bigwedge_{s\in S^{\prime}}\left(d_{F}(s,c^{\ast})>(1+\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\right)\ \Big{|}\ \mathcal{E}_{T^{\prime}}\right]\leq\frac{1}{(1+\varepsilon)^{\lvert S^{\prime}\rvert}}\leq\exp\left(-\frac{\varepsilon}{2}\frac{\lvert S\rvert}{2\beta}\right).

Hence, with probability at most $\exp\left(-\frac{\varepsilon\left\lceil-\frac{8\beta(\ln(\delta)-\ln(4))}{\varepsilon}\right\rceil}{4\beta}\right)\leq\delta^{2}/16\leq\delta/4$ there is no $s\in S^{\prime}$ with $d_{F}(s,c^{\ast})\leq(1+\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}$ . Also, with probability at most $\delta/4$ Algorithm 1 fails to compute a $34$ -approximate $(1,\ell)$ -median $c\in\mathbb{X}^{d}_{\ell}$ for $S^{\prime}$ , cf. Corollary 3.3.

Using a union bound over these bad events, we conclude that with probability at least $1-\delta$ all of the following events occur simultaneously:

•

There is a subset $S^{\prime}\subseteq S$ of cardinality at least $\lvert S\rvert/(2\beta)$ that is a uniform and independent sample of $T^{\prime}$ ,
•

there is a curve $s\in S^{\prime}$ with $d_{F}(s,c^{\ast})\leq(1+\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}$ ,
•

Algorithm 1 computes a polygonal curve $c\in\mathbb{X}^{d}_{\ell}$ with $\operatorname{cost}\left\lparen S^{\prime},c^{\ast}_{S^{\prime}}\right\rparen\leq\operatorname{cost}\left\lparen S^{\prime},c\right\rparen\leq 34\operatorname{cost}\left\lparen S^{\prime},c^{\ast}_{S^{\prime}}\right\rparen$ , where $c^{\ast}_{S^{\prime}}\in\mathbb{X}^{d}_{\ell}$ is an optimal $(1,\ell)$ -median for $S^{\prime}$ ,
•

and it holds that $\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen\leq\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen$ .

Since $c^{\ast}_{S^{\prime}}$ is an optimal $(1,\ell)$ -median for $S^{\prime}$ we get the following from the last two items:

\displaystyle\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen\geq\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen\geq\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c^{\ast}_{S^{\prime}}\right\rparen\geq\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\frac{\operatorname{cost}\left\lparen S^{\prime},c\right\rparen}{34}.

We now distinguish between two cases:

Case 1: $d_{F}(c,c^{\ast})\geq(1+2\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}$

The triangle-inequality yields

	$\displaystyle d_{F}(c,s)$	$\displaystyle\geq{}d_{F}(c,c^{\ast})-d_{F}(c^{\ast},s)\geq d_{F}(c,c^{\ast})-(1+\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}$
		$\displaystyle\geq{}(1+2\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}-(1+\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}=\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}.$

As a consequence, $\operatorname{cost}\left\lparen S^{\prime},c\right\rparen\geq\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\Leftrightarrow\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\leq\frac{1}{\varepsilon}\operatorname{cost}\left\lparen S^{\prime},c\right\rparen$ .

Now, let $v^{s}_{1},\dots,v^{s}_{\lvert s\rvert}$ be the vertices of $s$ . By Lemma 4.1 there exists a polygonal curve $c^{\prime}$ with up to $2\ell-2$ vertices, every vertex contained in one of $B(v^{s}_{1},d_{F}(c^{\ast},s)),\dots,B(v^{s}_{\lvert s\rvert},d_{F}(c^{\ast},s))$ and $d_{F}(s,c^{\prime})\leq d_{F}(s,c^{\ast})\leq(1+\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\leq(1+\varepsilon)\frac{\operatorname{cost}\left\lparen S^{\prime},c\right\rparen}{\varepsilon}$ .

In the set of candidates, that Algorithm 2 returns, a curve $c^{\prime\prime}$ with up to $2\ell-2$ vertices from the union of the grid covers and distance at most $\frac{\varepsilon\frac{2\delta n}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c\right\rparen}{n}\leq\frac{\varepsilon\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c\right\rparen}{\lvert T^{\prime}\rvert}\leq\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}$ between every corresponding pair of vertices of $c^{\prime}$ and $c^{\prime\prime}$ is contained. We conclude that $d_{F}(c^{\prime},c^{\prime\prime})\leq\frac{\varepsilon\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}$ .

We can now bound the cost of $c^{\prime\prime}$ as follows:

	$\displaystyle\operatorname{cost}\left\lparen T^{\prime},c^{\prime\prime}\right\rparen$	$\displaystyle={}\sum_{\tau\in T^{\prime}}d_{F}(\tau,c^{\prime\prime})\leq\sum_{\tau\in T^{\prime}}\left(d_{F}(\tau,c^{\prime})+\frac{\varepsilon\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\right)$
		$\displaystyle\leq{}\sum_{\tau\in T^{\prime}}(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},c^{\prime}))+\varepsilon\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen$
		$\displaystyle\leq{}\sum_{\tau\in T^{\prime}}(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},s)+d_{F}(s,c^{\prime}))+\varepsilon\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen\leq{}(3+3\varepsilon)\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen.$

Case 2: $d_{F}(c,c^{\ast})<(1+2\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}$

The cost of $c$ can easily be bounded:

\displaystyle\operatorname{cost}\left\lparen T^{\prime},c\right\rparen\leq\sum_{\tau\in T^{\prime}}(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},c))<\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen+(1+2\varepsilon)\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen=(2+2\varepsilon)\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen.

The claim follows by rescaling $\varepsilon$ by $\frac{1}{3}$ . ∎

Next we analyse the worst-case running-time of Algorithm 2 and the number of candidates it returns.

Theorem 4.3.

Algorithm 2 has running-time and returns number of candidates $2^{O\left(\frac{(-\ln(\delta))^{2}\cdot\beta}{\varepsilon^{2}}+\log(m)\right)}$ .

Proof.

The sample $S$ has size $O\left(\frac{-\ln(\delta)\cdot\beta}{\varepsilon}\right)$ and sampling it takes time $O\left(\frac{-\ln(\delta)\cdot\beta}{\varepsilon}\right)$ . Let $n_{S}=\lvert S\rvert$ . The for-loop runs

\binom{n_{S}}{\frac{n_{S}}{2\beta}}\in 2^{O\left(\frac{n_{S}}{2\beta}\log n_{S}\right)}\subset 2^{O\left(\frac{(-\ln(\delta))^{2}\cdot\beta}{\varepsilon^{2}}\right)}

times. In each iteration, we run Algorithm 1, taking time $O(m^{2}\log(m)(-\ln^{2}\delta)+m^{3}\log m)$ (cf. Corollary 3.3), we compute the cost of the returned curve with respect to $S^{\prime}$ , taking time $O\left(\frac{-\ln(\delta)}{\varepsilon}\cdot m\log(m)\right)$ , and per curve in $S^{\prime}$ we build up to $m$ grids of size

\left(\frac{\frac{(1+\varepsilon)\Delta}{\varepsilon}}{\frac{2\varepsilon 2\delta n\Delta}{n\sqrt{d}4\lvert S\rvert}}\right)^{d}=\left(\frac{\sqrt{d}\lvert S\rvert(1+\varepsilon)}{\varepsilon^{2}\delta}\right)^{d}\in O\left(\frac{\beta^{d}(-\ln\delta)^{d}}{\varepsilon^{3d}\delta^{d}}\right)

each. For each curve $s\in S^{\prime}$ , Algorithm 2 then enumerates all combinations of $2\ell-2$ points from these up to $m$ grids, resulting in

O\left(\frac{m^{2\ell-2}\beta^{2\ell d-2d}(-\ln\delta)^{2\ell d-2d}}{\varepsilon^{6\ell d-6d}\delta^{2\ell d-2d}}\right)

candidates per $s\in S^{\prime}$ , per iteration of the for-loop. Thus, Algorithm 2 computes $O\left(\operatorname{poly}\left\lparen m,\beta,\delta^{-1},\varepsilon^{-1}\right\rparen\right)$ candidates per iteration of the for-loop and enumeration also takes time $O\left(\operatorname{poly}\left\lparen m,\beta,\delta^{-1},\varepsilon^{-1}\right\rparen\right)$ per iteration of the for-loop.

All in all, we have running-time and number of candidates $2^{O\left(\frac{(-\ln(\delta))^{2}\cdot\beta}{\varepsilon^{2}}+\log(m)\right)}$ . ∎

5 More Practical Approximation for $(1,\ell)$ -Median by Simple Shortcutting

The following algorithm is a modification of Algorithm 2. It is more practical since it needs to cover only up to $m$ (small) balls, using grids. Unfortunately, it is not compatible with the superset-sampling technique and can therefore not be used as plugin in Algorithm 5.

Algorithm 3

(1,\ell)

-Median by Simple Shortcutting

1:procedure

(1,\ell)

-Median-

(5+\varepsilon)

(

T=\{\tau_{1},\dots,\tau_{n}\},\delta,\varepsilon

)

\widehat{c}\leftarrow

(1,\ell)

-Median-

34

-Approximation

(T,\delta/2)

\triangleright

Algorithm 1

\Delta\leftarrow\frac{\operatorname{cost}\left\lparen T,\widehat{c}\right\rparen}{34}

\varepsilon^{\prime}\leftarrow\varepsilon/9

P\leftarrow\emptyset

S\leftarrow

sample

\left\lceil-2(\varepsilon^{\prime})^{-1}(\ln(\delta)-\ln(4))\right\rceil

curves from

T

uniformly and independently with replacement

W\leftarrow

sample

\lceil-64(\varepsilon^{\prime})^{-2}(\ln(\delta)-\ln(\lceil-8(\varepsilon^{\prime})^{-1}(\ln(\delta)-\ln(4))\rceil))\rceil

curves from

T

uniformly and independently with replacement

c\leftarrow\operatorname*{arg\,min}\limits_{s\in S}\operatorname{cost}\left\lparen W,s\right\rparen

7: for

i\in\{1,\dots,\lvert c\rvert\}

P\leftarrow P\cup\mathbb{G}\left(B\left(v^{c}_{i},\frac{(3+4\varepsilon^{\prime})}{n}34\Delta\right),\frac{2\varepsilon^{\prime}\Delta}{n\sqrt{d}}\right)

\triangleright

v^{c}_{i}

is the

i

^th vertex of

c

C\leftarrow

set of all polygonal curves with

2\ell-2

vertices from

P

10: return

\operatorname*{arg\,min}\limits_{c^{\prime}\in C}\operatorname{cost}\left\lparen T,c^{\prime}\right\rparen

We prove the quality of approximation of Algorithm 3.

Theorem 5.1.

Given two parameters $\delta,\varepsilon\in(0,1)$ and a set $T=\{\tau_{1},\dots,\tau_{n}\}\subset\mathbb{X}^{d}_{m}$ of polygonal curves, with probability at least $1-\delta$ Algorithm 3 returns a $(5+\varepsilon)$ -approximate $(1,\ell)$ -median for $T$ with up to $2\ell-2$ vertices.

Proof.

Let $c^{\ast}\in\operatorname*{arg\,min}\limits_{c\in\mathbb{X}^{d}_{\ell}}\operatorname{cost}\left\lparen T,c\right\rparen$ be an optimal $(1,\ell)$ -median for $T$ .

The expected distance between $s\in S$ and $c^{\ast}$ is

\operatorname{E}\left[d_{F}(s,c^{\ast})\right]=\sum_{i=1}^{n}d_{F}(c^{\ast},\tau_{i})\cdot\frac{1}{n}=\frac{\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n}.

Now using Markov’s inequality, for every $s\in S$ we have

\Pr[d_{F}(s,c^{\ast})>(1+\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen/n]\leq\frac{\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen n^{-1}}{(1+\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen n^{-1}}=\frac{1}{1+\varepsilon},

therefore by independence

\Pr\left[\bigwedge_{s\in S}(d_{F}(s,c^{\ast})>(1+\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen/n)\right]\leq\frac{1}{(1+\varepsilon)^{\lvert S\rvert}}\leq\exp\left(-\frac{\varepsilon\lvert S\rvert}{2}\right).

Hence, with probability at most $\exp\left(-\frac{\varepsilon\left\lceil-\frac{2(\ln(\delta)-\ln(4))}{\varepsilon}\right\rceil}{2}\right)\leq\delta/4$ there is no $s\in S$ with $d_{F}(s,c^{\ast})\leq(1+\varepsilon)\frac{\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n}$ . Now, assume there is a $s\in S$ with $d_{F}(s,c^{\ast})\leq(1+\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen/n$ . We do not want any $t\in S\setminus\{s\}$ with $d_{F}(t,c^{\ast})>(1+\varepsilon)d_{F}(s,c^{\ast})$ to have $\operatorname{cost}\left\lparen W,t\right\rparen\leq\operatorname{cost}\left\lparen W,s\right\rparen$ . Using Theorem 2.8, we conclude that this happens with probability at most

\exp\left(-\frac{\varepsilon^{2}\lceil-64\varepsilon^{-2}(\ln(\delta)-\ln(\lceil-8(\varepsilon^{\prime})^{-1}(\ln(\delta)-\ln(4))\rceil))\rceil}{64}\right)\leq\frac{\delta}{\lceil-8(\varepsilon^{\prime})^{-1}(\ln(\delta)-\ln(4))\rceil}\leq\frac{\delta}{4\lvert S\rvert},

for each $t\in S\setminus\{s\}$ . Also, with probability at most $\delta/2$ Algorithm 1 fails to compute a $34$ -approximate $(1,\ell)$ -median $\widehat{c}\in\mathbb{X}^{d}_{\ell}$ for $T$ , cf. Corollary 3.3.

Using a union bound over these bad events, we conclude that with probability at least $1-\delta$ , Algorithm 3 samples a curve $t\in S$ with $\operatorname{cost}\left\lparen T,t\right\rparen\leq(1+\varepsilon)\operatorname{cost}\left\lparen T,s\right\rparen$ and Algorithm 1 computes a $34$ -approximate $(1,\ell)$ -median $\widehat{c}\in\mathbb{X}^{d}_{\ell}$ for $T$ , i.e., $\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen\leq 34\Delta=\operatorname{cost}\left\lparen T,\widehat{c}\right\rparen\leq 34\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen$ . Let $v^{t}_{1},\dots,v^{t}_{\lvert t\rvert}$ be the vertices of $t$ . By Lemma 4.1 there exists a polygonal curve $c^{\prime}$ with up to $2\ell-2$ vertices, every vertex contained in one of $B(v^{t}_{1},d_{F}(c^{\ast},t)),\dots,B(v^{t}_{\lvert t\rvert},d_{F}(c^{\ast},t))$ and $d_{F}(t,c^{\prime})\leq d_{F}(t,c^{\ast})$ . Using the triangle-inequality yields

\displaystyle\sum_{\tau\in T}(d_{F}(t,c^{\ast})-d_{F}(\tau,c^{\ast}))\leq\sum_{\tau\in T}d_{F}(t,\tau)\leq(1+\varepsilon)\sum_{\tau\in T}d_{F}(s,\tau)\leq(1+\varepsilon)\sum_{\tau\in T}(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},s)),

which is equivalent to

\displaystyle n\cdot d_{F}(t,c^{\ast})\leq(2+\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen+(1+\varepsilon)n(1+\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen/n\Leftrightarrow{}

\displaystyle d_{F}(t,c^{\ast})\leq(3+4\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen/n.

Hence, we have $d_{F}(t,c^{\prime})\leq d_{F}(t,c^{\ast})\leq(3+4\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen/n\leq(3+4\varepsilon)34\Delta/n$ .

In the last step, Algorithm 3 returns a curve $c^{\prime\prime}$ from the set $C$ of all curves with up to $2\ell-2$ vertices from $P$ , the union of the grid covers, that evaluates best. We can assume that $c^{\prime\prime}$ has distance at most $\frac{\varepsilon\Delta}{n}\leq\varepsilon\frac{\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n}$ between every corresponding pair of vertices of $c^{\prime}$ and $c^{\prime\prime}$ . We conclude that $d_{F}(c^{\prime},c^{\prime\prime})\leq\frac{\varepsilon\Delta}{n}\leq\varepsilon\frac{\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen}{n}$ .

We can now bound the cost of $c^{\prime\prime}$ as follows:

	$\displaystyle\operatorname{cost}\left\lparen T,c^{\prime\prime}\right\rparen$	$\displaystyle={}\sum_{\tau\in T}d_{F}(\tau,c^{\prime\prime})\leq\sum_{\tau\in T}\left(d_{F}(\tau,c^{\prime})+\frac{\varepsilon\Delta}{n}\right)\leq\sum_{\tau\in T}(d_{F}(\tau,t)+d_{F}(t,c^{\prime}))+\varepsilon\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen$
		$\displaystyle\leq{}(1+\varepsilon)\operatorname{cost}\left\lparen T,s\right\rparen+(3+5\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen$
		$\displaystyle\leq{}(1+\varepsilon)\sum_{\tau\in T}(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},s))+(3+5\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen$
		$\displaystyle\leq{}(1+\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen+(1+\varepsilon)^{2}\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen+(3+5\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen$
		$\displaystyle\leq{}(5+9\varepsilon)\operatorname{cost}\left\lparen T,c^{\ast}\right\rparen$

The claim follows by rescaling $\varepsilon$ by $\frac{1}{9}$ . ∎

We analyse the worst-case running-time of Algorithm 3.

Theorem 5.2.

Algorithm 3 has running-time $O\left(\frac{nm^{2\ell-1}\log(m)}{\varepsilon^{(2\ell-2)d}}+\frac{m^{3}\log(m)(-\ln(\delta))^{2}}{\varepsilon^{3}}\right)$ .

Proof.

Algorithm 1 has running-time $O(m^{2}\log(m)(-\ln^{2}\delta))+m^{3}\log m)$ . The sample $S$ has size $O\left(\frac{-\ln(\delta)}{\varepsilon}\right)$ and the sample $W$ has size $O\left(\frac{-\ln(\delta)}{\varepsilon^{2}}\right)$ . Evaluating each curve of $S$ against $W$ takes time $O\left(\frac{m^{2}\log(m)(-\ln(\delta))^{2}}{\varepsilon^{3}}\right)$ , using the algorithm of Alt and Godau [4] to compute the distances.

Now, $c$ has up to $m$ vertices and every grid consists of $\left(\frac{\frac{(3+\varepsilon)\Delta}{n}}{\frac{2\varepsilon^{\prime}\Delta}{nc\sqrt{d}}}\right)^{d}=\left(\frac{(3+\varepsilon)c\sqrt{d}}{2\varepsilon^{\prime}}\right)^{d}\in O\left(\frac{1}{\varepsilon^{d}}\right)$ points. Therefore, we have $O\left(\frac{m}{\varepsilon^{d}}\right)$ points in $P$ and Algorithm 3 enumerates all combinations of $2\ell-2$ points from $P$ taking time $O\left(\frac{m^{2\ell-2}}{\varepsilon^{(2\ell-2)d}}\right)$ . Afterwards, these candidates are evaluated, which takes time $O(nm\log(m))$ per candidate using the algorithm of Alt and Godau [4] to compute the distances. All in all, we then have running-time $O\left(\frac{nm^{2\ell-1}\log(m)}{\varepsilon^{(2\ell-2)d}}+\frac{m^{3}\log(m)(-\ln(\delta))^{2}}{\varepsilon^{3}}\right)$ . ∎

6 $(1+\varepsilon)$ -Approximation for $(1,\ell)$ -Median by Advanced Shortcutting

Now we present Algorithm 4, which returns candidates, containing a $(1+\varepsilon)$ -approximate $(1,\ell)$ -median of complexity at most $2\ell-2$ , for a cluster contained in the input, that takes a constant fraction of the input, w.h.p. Before we present the algorithm, we present our second shortcutting lemma. Here, we do not introduce shortcuts with respect to a single curve, but with respect to several curves: by introducing shortcuts with respect to $\varepsilon\lvert T\rvert$ well-chosen curves from the given set $T\subset\mathbb{X}^{d}_{m}$ of polygonal curves, for a given $\varepsilon\in(0,1)$ , we preserve the distances to at least $(1-\varepsilon)\lvert T\rvert$ curves from $T$ . In this context well-chosen means that there exists a certain number of subsets of $T$ , of each we have to pick a curve for shortcutting. This will enable the high quality of approximation of Algorithm 4 and we formalize this in the following lemma.

Lemma 6.1 (shortcutting using a set of polygonal curves).

Let $\sigma\in\mathbb{X}^{d}$ be a polygonal curve with $\lvert\sigma\rvert>2$ vertices and $T=\{\tau_{1},\dots,\tau_{n}\}\subset\mathbb{X}^{d}$ be a set of polygonal curves. For $i\in\{1,\dots,n\}$ , let $r_{i}=d_{F}(\tau_{i},\sigma)$ and for $j\in\{1,\dots,\lvert\tau_{i}\rvert\}$ , let $v^{\tau_{i}}_{j}$ be the $j$ ^th vertex of $\tau_{i}$ .

For any $\varepsilon\in(0,1)$ there are $2\lvert\sigma\rvert-4$ subsets $T_{1},\dots,T_{2\lvert\sigma\rvert-4}\subseteq T$ of $\frac{\varepsilon n}{2\lvert\sigma\rvert}$ curves each (not necessarily disjoint) such that for every subset $T^{\prime}\subseteq T$ containing at least one curve out of each $T_{k}\in\{T_{1},\dots,T_{2\lvert\sigma\rvert-4}\}$ , a polygonal curve $\sigma^{\prime}\in\mathbb{X}^{d}$ exists with every vertex contained in

\bigcup\limits_{\tau_{i}\in T^{\prime}}\bigcup\limits_{j\in\{1,\dots,\lvert\tau_{i}\rvert\}}B(v^{\tau_{i}}_{j},r_{i}),

$d_{F}(\tau,\sigma^{\prime})\leq d_{F}(\tau,\sigma)$ for each $\tau\in T\setminus(T_{1}\cup\dots\cup T_{2\lvert\sigma\rvert-4})$ and $\lvert\sigma^{\prime}\rvert\leq 2\lvert\sigma\rvert-2$ .

The idea is the following, see Fig. 3 for a visualization. One can argue that every vertex $v$ of $\sigma$ not contained in any of the balls centered at the vertices of the curves in $T$ (and of radius according to their distance to $\sigma$ ) can be shortcut by connecting the last point $p^{-}$ before $v$ (in terms of the parameter of $\sigma$ ) contained in one ball and first point $p^{+}$ after $v$ contained in one ball. This does not increase the Fréchet distances between $\sigma$ and the $\tau\in T$ , because only matchings among line segments are affected by this modification. Furthermore, most distances are preserved when we do not actually use the last and first ball before and after $v$ , but one of the $\frac{\varepsilon n}{2|\sigma\rvert}$ balls before and one of the $\frac{\varepsilon n}{2\lvert\sigma\rvert}$ balls after $v$ , which is the key of the following proof.

Proof of Lemma 6.1.

For the sake of simplicity, we assume that $\frac{\varepsilon n}{2\lvert\sigma\rvert}$ is integral. Let $\ell=\lvert\sigma\rvert$ . For $i\in\{1,\dots,n\}$ , let $v^{\tau_{i}}_{1},\dots,v^{\tau_{i}}_{\lvert\tau_{i}\rvert}$ be the vertices of $\tau_{i}$ with instants $t^{\tau_{i}}_{1},\dots,t^{\tau_{i}}_{\lvert\tau_{i}\rvert}$ and let $v^{\sigma}_{1},\dots,v^{\sigma}_{\ell}$ be the vertices of $\sigma$ with instants $t^{\sigma}_{1},\dots,t^{\sigma}_{\ell}$ . Also, for $h\in\mathcal{H}$ (recall that $\mathcal{H}$ is the set of all continuous bijections $h\colon[0,1]\rightarrow[0,1]$ with $h(0)=0$ and $h(1)=1$ ) and $i\in\{1,\dots,n\}$ , let $r_{i,h}=\max\limits_{t\in[0,1]}\lVert\sigma(t)-\tau_{i}(h(t))\rVert$ be the distance realized by $h$ with respect to $\tau_{i}$ . We know from Lemma 2.4 that for each $i\in\{1,\dots,n\}$ there exists a sequence $(h_{i,x})_{x=1}^{\infty}$ in $\mathcal{H}$ , such that $\lim\limits_{x\to\infty}r_{i,h_{i,x}}=d_{F}(\sigma,\tau_{i})=r_{i}$ .

In the following, given arbitrary $h_{1},\dots,h_{n}\in\mathcal{H}$ , we describe how to modify $\sigma$ , such that its vertices can be found in the balls around the vertices of the $\tau\in T$ , of radii determined by $h_{1},\dots,h_{n}$ . Later we will argue that this modification can be applied using the $h_{1,x},\dots,h_{n,x}$ , for each $x\in\mathbb{N}$ , in particular.

Now, fix arbitrary $h_{1},\dots,h_{n}\in\mathcal{H}$ and for an arbitrary $k\in\{2,\dots,\lvert\sigma\rvert-1\}$ , fix the vertex $v^{\sigma}_{k}$ of $\sigma$ with instant $t^{\sigma}_{k}$ . For $i\in\{1,\dots,n\}$ , let $s_{i}$ be the maximum of $\{1,\dots,\lvert\tau_{i}\rvert-1\}$ , such that $t^{\tau_{i}}_{s_{i}}\leq h_{i}(t^{\sigma}_{k})\leq t^{\tau_{i}}_{s_{i}+1}$ . Namely, $v^{\sigma}_{k}$ is matched to a point on the line segment $\overline{v^{\tau_{1}}_{s_{1}}v^{\tau_{1}}_{s_{1}+1}},\dots,\overline{v^{\tau_{n}}_{s_{n}}v^{\tau_{n}}_{s_{n}+1}}$ , respectively, by $h_{1},\dots,h_{n}$ .

For $i\in\{1,\dots,n\}$ , let $t^{-}_{i}$ be the maximum of $[0,t^{\sigma}_{k}]$ , such that $\sigma(t^{-}_{i})\in B(v^{\tau_{i}}_{s_{i}},r_{i,h_{i}})$ and let $t^{+}_{i}$ be the minimum of $[t^{\sigma}_{k},1]$ , such that $\sigma(t^{+}_{i})\in B(v^{\tau_{i}}_{s_{i}+1},r_{i,h_{i}})$ . These are the instants when $\sigma$ visits $B(v^{\tau_{i}}_{s_{i}},r_{i,h_{i}})$ before or when visiting $v^{\sigma}_{k}$ and $\sigma$ visits $B(v^{\tau_{i}}_{s_{i}+1},r_{i,h_{i}})$ when or after visiting $v^{\sigma}_{k}$ , respectively. Furthermore, there is a permutation $\alpha\in S_{n}$ of the index set $\{1,\dots,n\}$ , such that

t^{-}_{\alpha^{-1}(1)}\leq\dots\leq t^{-}_{\alpha^{-1}(n)}.

(I)

Also, there is a permutation $\beta\in S_{n}$ of the index set $\{1,\dots,n\}$ , such that

t^{+}_{\beta^{-1}(1)}\leq\dots\leq t^{+}_{\beta^{-1}(n)}.

(II)

Additionally, for each $i\in\{1,\dots,n\}$ we have

t^{\tau_{i}}_{s_{i}}\leq h_{i}(t^{-}_{i})

(III)

and

h_{i}(t^{+}_{i})\leq t^{\tau_{i}}_{s_{i}+1},

(IV)

because $\sigma(t^{-}_{i})$ and $\sigma(t^{+}_{i})$ are closest points to $v^{\sigma}$ on $\sigma$ that have distance at most $r_{i,h_{i}}$ to $v^{\tau_{i}}_{s_{i}}$ and $v^{\tau_{i}}_{s_{i}+1}$ , respectively, by definition. We will now use Eqs. I, II, III and IV to prove that an advanced shortcut only affects matchings among line segments and hence we can easily bound the resulting distances for at least $(1-\varepsilon)n$ of the curves.

Let

I_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})=\{\tau_{\alpha^{-1}((1-\frac{\varepsilon}{2\ell})n+1)},\dots,\tau_{\alpha^{-1}(n)}\},\ O_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})=\{\tau_{\beta^{-1}(1)},\dots,\tau_{\beta^{-1}(\frac{\varepsilon n}{2\ell})}\}.

$I_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})$ is the set of the last $\frac{\varepsilon n}{2\ell}$ curves whose balls are visited by $\sigma$ , before or when $\sigma$ visits $v^{\sigma}_{k}$ . Similarly, $O_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})$ is the set of the first $\frac{\varepsilon n}{2\ell}$ curves whose balls are visited by $\sigma$ , when or immediately after $\sigma$ visited $v^{\sigma}_{k}$ . We now modify $\sigma$ , such that $v^{\sigma}_{k}$ is replaced by two new vertices that are elements of at least one $B(v^{\tau_{i}}_{j},r_{i,h_{i}})$ , for a $\tau_{i}\in I_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})$ , respectively for a $\tau_{i}\in O_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})$ , and $j\in\{1,\dots,\lvert\tau_{i}\rvert\}$ , each.

Let $\sigma^{\prime}_{h_{1},\dots,h_{n}}$ be the piecewise defined curve, defined just like $\sigma$ on $\left[0,t^{-}_{\alpha^{-1}(k_{1})}\right]$ and $\left[t^{+}_{\beta^{-1}(k_{2})},1\right]$ for arbitrary $k_{1}\in\{(1-\frac{\varepsilon}{2\ell})n+1,\dots,n\}$ and $k_{2}\in\{1,\dots,\frac{\varepsilon n}{2\ell}\}$ , but on $\left(t^{-}_{\alpha^{-1}(k_{1})},t^{+}_{\beta^{-1}(k_{2})}\right)$ it connects $\sigma\left(t^{-}_{\alpha^{-1}(k_{1})}\right)$ and $\sigma\left(t^{+}_{\beta^{-1}(k_{2})}\right)$ with the line segment

\gamma(t)=\left(1-\frac{t-t^{-}_{\alpha^{-1}(k_{1})}}{t^{+}_{\beta^{-1}(k_{2})}-t^{-}_{\alpha^{-1}(k_{1})}}\right)\sigma\left(t^{-}_{\alpha^{-1}(k_{1})}\right)+\frac{t-t^{-}_{\alpha^{-1}(k_{1})}}{t^{+}_{\beta^{-1}(k_{2})}-t^{-}_{\alpha^{-1}(k_{1})}}\sigma\left(t^{+}_{\beta^{-1}(k_{2})}\right).

We now argue that for all $\tau_{i}\in T\setminus(I_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})\cup O_{v^{\sigma}_{k}}(h_{1},\dots,h_{n}))$ the Fréchet distance between $\sigma^{\prime}_{h_{1},\dots,h_{n}}$ and $\tau_{i}$ is upper bounded by $r_{i,h_{i}}$ . First, note that by definition $h_{1},\dots,h_{n}$ are strictly increasing functions, since they are continuous bijections that map $0$ to $0$ and $1$ to $1$ . As immediate consequence, we have that

t^{\tau_{i}}_{s_{i}}\leq h_{i}(t^{-}_{i})\leq h_{i}\left(t^{-}_{\alpha^{-1}(k_{1})}\right)

(V)

for each $\tau_{i}\in T\setminus I_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})$ and

h_{i}\left(t^{+}_{\beta^{-1}(k_{2})}\right)\leq h_{i}(t^{+}_{i})\leq t^{\tau_{i}}_{s_{i}+1}

(VI)

for each $\tau_{i}\in T\setminus O_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})$ , using Eqs. I, II, III and IV. Therefore, each $\tau_{i}\in T\setminus(I_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})\cup O_{v^{\sigma}_{k}}(h_{1},\dots,h_{n}))$ has no vertex between the instants $h_{i}\left(t^{-}_{\alpha^{-1}(k_{1})}\right)$ and $h_{i}\left(t^{+}_{\beta^{-1}(k_{2})}\right)$ . We also know that for each $\tau_{i}\in T$

\left\lVert\sigma\left(t^{-}_{\alpha^{-1}(k_{1})}\right)-\tau_{i}\left(h_{i}\left(t^{-}_{\alpha^{-1}(k_{1})}\right)\right)\right\rVert\leq r_{i,h_{i}}

(VII)

and

\left\lVert\sigma\left(t^{+}_{\beta^{-1}(k_{2})}\right)-\tau_{i}\left(h_{i}\left(t^{+}_{\beta^{-1}(k_{2})}\right)\right)\right\rVert\leq r_{i,h_{i}}.

(VIII)

Let $D_{s,\sigma}=\left[0,t^{-}_{\alpha^{-1}(k_{1})}\right)$ , $D_{m,\sigma}=\left[t^{-}_{\alpha^{-1}(k_{1})},t^{+}_{\beta^{-1}(k_{2})}\right]$ and $D_{e,\sigma}=\left(t^{+}_{\beta^{-1}(k_{2})},1\right]$ . Also, for $i\in\{1,\dots,n\}$ , let $D_{s,\tau_{i}}=\left[0,h_{i}\left(t^{-}_{\alpha^{-1}(k_{1})}\right)\right)$ , $D_{m,\tau_{i}}=\left[h_{i}\left(t^{-}_{\alpha^{-1}(k_{1})}\right),h_{i}\left(t^{+}_{\beta^{-1}(k_{2})}\right)\right]$ and $D_{e,\tau_{i}}=\left(h_{i}\left(t^{+}_{\beta^{-1}(k_{2})}\right),1\right]$ . Now, for each $\tau_{i}\in T\setminus(I_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})\cup O_{v^{\sigma}_{k}}(h_{1},\dots,h_{n}))$ we use $h_{i}$ to match $\sigma^{\prime}_{h_{1},\dots,h_{n}}|_{D_{s,\sigma}}$ to $\tau_{i}|_{D_{s,\tau_{i}}}$ and $\sigma^{\prime}_{h_{1},\dots,h_{n}}|_{D_{e,\sigma}}$ to $\tau_{i}|_{D_{e,\tau_{i}}}$ with distance at most $r_{i,h_{i}}$ . Since $\sigma^{\prime}_{h_{1},\dots,h_{n}}|_{D_{m,\sigma}}$ and $\tau_{i}|_{D_{m,\tau_{i}}}$ are just line segments by Eqs. V and VI, they can be matched to each other with distance at most

\max\left\{\left\lVert\sigma\left(t^{-}_{\alpha^{-1}(k_{1})}\right)-\tau_{i}\left(h_{i}\left(t^{-}_{\alpha^{-1}(k_{1})}\right)\right)\right\rVert,\left\lVert\sigma\left(t^{+}_{\beta^{-1}(k_{2})}\right)-\tau_{i}\left(h_{i}\left(t^{+}_{\beta^{-1}(k_{2})}\right)\right)\right\rVert\right\},

which is at most $r_{i,h_{i}}$ by Eqs. VII and VIII. We conclude that $d_{F}(\sigma^{\prime}_{h_{1},\dots,h_{n}},\tau_{i})\leq r_{i,h_{i}}$ .

Because this modification works for every $h_{1},\dots,h_{n}\in\mathcal{H}$ , we conclude that $d_{F}(\sigma^{\prime}_{h_{1},\dots,h_{n}},\tau_{i})\leq r_{i,h_{i}}$ for every $h_{1},\dots,h_{n}\in\mathcal{H}$ and $\tau_{i}\in T\setminus(I_{v^{\sigma}_{k}}(h_{1},\dots,h_{n})\cup O_{v^{\sigma}_{k}}(h_{1},\dots,h_{n}))$ . Thus $\lim\limits_{x\to\infty}d_{F}(\sigma^{\prime}_{h_{1,x},\dots,h_{n,x}},\tau_{i})\leq d_{F}(\sigma,\tau_{i})=r_{i}$ for each $\tau_{i}\in T\setminus(I_{v^{\sigma}_{k}}(h_{1,x},\dots,h_{n,x})\cup O_{v^{\sigma}_{k}}(h_{1,x},\dots,h_{n,x}))$ .

Now, to prove the claim, for each combination $h_{1},\dots,h_{n}\in\mathcal{H}$ , we apply this modification to $v^{\sigma}_{k}$ and successively to every other vertex $v^{\sigma^{\prime}_{h_{1},\dots,h_{n}}}_{l}$ of the resulting curve $\sigma^{\prime}_{h_{1},\dots,h_{n}}$ , except $v^{\sigma^{\prime}_{h_{1},\dots,h_{n}}}_{1}$ and $v^{\sigma^{\prime}_{h_{1},\dots,h_{n}}}_{\lvert\sigma^{\prime}_{h_{1},\dots,h_{n}}\rvert}$ , since these must be elements of $B(v^{\tau_{i}}_{1},r_{i,h_{i}})$ and $B(v^{\tau_{i}}_{\lvert\tau_{i}\rvert},r_{i,h_{i}})$ , respectively, for each $i\in\{1,\dots,n\}$ , by definition of the Fréchet distance.

Since the modification is repeated at most $\lvert\sigma\rvert-2$ times for each combination $h_{1},\dots h_{n}\in\mathcal{H}$ , we conclude that the number of vertices of each $\sigma^{\prime}_{h_{1},\dots,h_{n}}$ can be bounded by $2\cdot(\lvert\sigma\rvert-2)+2$ .

$T_{1},\dots,T_{2\ell-4}$ are therefore all the $I_{v^{\sigma}_{k}}(h_{1,x},\dots,h_{n,x})$ and $O_{v^{\sigma}_{k}}(h_{1,x},\dots,h_{n,x})$ for $k\in\{2,\dots,2\lvert\sigma\rvert-3\}$ , when $x\to\infty$ . Note that every $I_{v^{\sigma}_{k}}(h_{1,x},\dots,h_{n,x})$ and $O_{v^{\sigma}_{k}}(h_{1,x},\dots,h_{n,x})$ is determined by the visiting order of the balls and since their radii converge, these sets do too. ∎

We now present Algorithm 4, which is nearly identical to Algorithm 2 but uses the advanced shortcutting lemma. Furthermore, like Algorithm 2, it can be used as plugin in the recursive $k$ -median approximation-scheme (Algorithm 5) that we present in Section 7.

Algorithm 4

(1,\ell)

-Median for Subset by Advanced Shortcutting

1:procedure

(1,\ell)

-Median-

(1+\varepsilon)

-Candidates(

T=\{\tau_{1},\dots,\tau_{n}\},\beta,\delta,\varepsilon

)

\varepsilon^{\prime}\leftarrow\varepsilon/6

C\leftarrow\emptyset

S\leftarrow

sample

\left\lceil-8\beta\ell(\varepsilon^{\prime})^{-1}(\ln(\delta)-\ln(4(2\ell-4)))\right\rceil

curves from

T

uniformly and independently with replacement

4: for

S^{\prime}\subseteq S

with

\lvert S^{\prime}\rvert=\frac{\lvert S\rvert}{2\beta}

c\leftarrow

(1,\ell)

-Median-

34

-Approximation

(S^{\prime},\delta/4)

\triangleright

Algorithm 1

\Delta\leftarrow\operatorname{cost}\left\lparen S^{\prime},c\right\rparen

\Delta_{l}\leftarrow\frac{2\delta n}{4\lvert S\rvert}\frac{\Delta}{34}

\Delta_{u}\leftarrow\frac{1}{\varepsilon^{\prime}}\Delta

C\leftarrow C\cup\{c\}

P\leftarrow\emptyset

8: for

s\in S^{\prime}

9: for

i\in\{1,\dots,\lvert s\rvert\}

10:

P\leftarrow P\cup\mathbb{G}\left(B\left(v^{s}_{i},\frac{4\ell}{\varepsilon^{\prime}}\Delta_{u}\right),\frac{2\varepsilon^{\prime}}{n\sqrt{d}}\Delta_{l}\right)

\triangleright

v^{s}_{i}

i

^th vertex of

s

11:

C\leftarrow C\ \cup

set of all polygonal curves with

2\ell-2

vertices from

P

12: return

C

We prove the quality of approximation of Algorithm 4.

Theorem 6.2.

Given three parameters $\beta\in[1,\infty)$ , $\delta\in(0,1)$ , $\varepsilon\in(0,0.158]$ and a set $T=\{\tau_{1},\dots,\tau_{n}\}\subset\mathbb{X}^{d}_{m}$ of polygonal curves, with probability at least $1-\delta$ the set of candidates that Algorithm 4 returns contains a $(1+\varepsilon)$ -approximate $(1,\ell)$ -median with up to $2\ell-2$ vertices for any $T^{\prime}\subseteq T$ , if $\lvert T^{\prime}\rvert\geq\frac{1}{\beta}\lvert T\rvert$ .

In the following proof we make use of a case distinction developed by Nath and Taylor [28, Proof of Theorem 10], which is a key ingredient to enable the $(1+\varepsilon)$ -approximation, though the domain of $\varepsilon$ has to be restricted to $(0,0.158]$ .

Proof of Theorem 6.2.

\displaystyle\Pr\left[n^{\prime}<\frac{\lvert S\rvert}{2\beta}\right]\leq\Pr\left[n^{\prime}<\frac{\operatorname{E}\left[n^{\prime}\right]}{2}\right]\leq\exp\left(-\frac{1}{4}\frac{\lvert S\rvert}{2\beta}\right)\leq\exp\left(\frac{\ell(\ln(\delta)-\ln(4(2\ell-4)))}{\varepsilon}\right)\leq\left(\frac{\delta^{\ell}}{4^{\ell}}\right)^{\frac{1}{\varepsilon}}\leq\frac{\delta}{8}.

In other words, with probability at most $\delta/8$ no subset $S^{\prime}\subseteq S$ , of cardinality at least $\frac{\lvert S\rvert}{2\beta}$ , is a subset of $T^{\prime}$ . We condition the rest of the proof on the contrary event, denoted by $\mathcal{E}_{T^{\prime}}$ , namely, that there is a subset $S^{\prime}\subseteq S$ with $S^{\prime}\subseteq T^{\prime}$ and $\lvert S^{\prime}\rvert\geq\frac{\lvert S\rvert}{2\beta}$ . Note that $S^{\prime}$ is then a uniform and independent sample of $T^{\prime}$ .

Now, let $c^{\ast}\in\mathbb{X}^{d}_{\ell}$ be an optimal $(1,\ell)$ -median for $T^{\prime}$ . The expected distance between $s\in S^{\prime}$ and $c^{\ast}$ is

\operatorname{E}\left[d_{F}(s,c^{\ast})\ |\ \mathcal{E}_{T^{\prime}}\right]=\sum_{\tau\in T^{\prime}}d_{F}(c^{\ast},\tau)\cdot\frac{1}{\lvert T^{\prime}\rvert}=\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}.

\displaystyle\Pr\left[\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen>\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen\ \Big{|}\ \mathcal{E}_{T^{\prime}}\right]\leq\frac{\delta}{4}.

Now, from Lemma 6.1 we know that there are $2\ell-4$ subsets $T^{\prime}_{1},\dots,T^{\prime}_{2\ell-4}\subseteq T^{\prime}$ , of cardinality $\frac{\varepsilon\lvert T^{\prime}\rvert}{2\ell}$ each and which are not necessarily disjoint, such that for every set $W\subseteq T^{\prime}$ that contains at least one curve $\tau\in T^{\prime}_{i}$ for each $i\in\{1,\dots,2\ell-4\}$ , there exists a curve $c^{\prime}\in\mathbb{X}^{d}_{2\ell-2}$ which has all of its vertices contained in

\bigcup\limits_{\tau\in W}\bigcup\limits_{j\in\{1,\dots,\lvert\tau\rvert\}}B(v^{\tau}_{j},d_{F}(\tau,c^{\ast}))

and for at least $(1-\varepsilon)\lvert T^{\prime}\rvert$ curves $\tau\in T^{\prime}\setminus(T^{\prime}_{1}\cup\dots\cup T^{\prime}_{2\ell-4})$ it holds that $d_{F}(\tau,c^{\prime})\leq d_{F}(\tau,c^{\ast})$ .

There are up to $\frac{\varepsilon\lvert T^{\prime}\rvert}{4\ell}$ curves with distance to $c^{\ast}$ at least $\frac{4\ell\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\varepsilon\lvert T^{\prime}\rvert}$ . Otherwise the cost of these curves would exceed $\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen$ , which is a contradiction. Later we will prove that each ball we cover has radius at most $\frac{4\ell\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\varepsilon\lvert T^{\prime}\rvert}$ . Therefore, for each $i\in\{1,\dots,2\ell-4\}$ we have to ignore up to half of the curves $\tau\in T^{\prime}_{i}$ , since we do not cover the balls of radius $d_{F}(\tau,c^{\ast})$ centered at their vertices. For each $i\in\{1,\dots,2\ell-4\}$ and $s\in S^{\prime}$ we now have

\Pr\left[s\in T^{\prime}_{i}\wedge d_{F}(s,c^{\ast})\leq\frac{4\ell\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\varepsilon\lvert T^{\prime}\rvert}\ \Big{|}\ \mathcal{E}_{T^{\prime}}\right]\geq\frac{\varepsilon}{4\ell}.

Therefore, by independence, for each $i\in\{1,\dots,2\ell-4\}$ the probability that no $s\in S^{\prime}$ is an element of $T^{\prime}_{i}$ and has distance to $c^{\ast}$ at most $\frac{4\ell\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\varepsilon\lvert T^{\prime}\rvert}$ is at most $(1-\frac{\varepsilon}{4\ell})^{\lvert S^{\prime}\rvert}\leq\exp\left(-\frac{\varepsilon}{4\ell}\frac{4\ell(\ln(4(2\ell-4))-\ln(\delta))}{\varepsilon}\right)=\exp\left(\ln\left(\frac{\delta}{4(2\ell-4)}\right)\right)=\frac{\delta}{4(2\ell-4)}$ . Also, with probability at most $\delta/4$ Algorithm 1 fails to compute a $34$ -approximate $(1,\ell)$ -median $c\in\mathbb{X}^{d}_{\ell}$ for $S^{\prime}$ , cf. Corollary 3.3.

Using a union bound over these bad events, we conclude that with probability at least $1-7/8\delta$ all of the following events occur simultaneously:

1.

There is a subset $S^{\prime}\subseteq S$ of cardinality at least $\lvert S\rvert/(2\beta)$ that is a uniform and independent sample of $T^{\prime}$ ,
2.

for each $i\in\{1,\dots,2\ell-4\}$ , $S^{\prime}$ contains at least one curve from $T^{\prime}_{i}$ with distance to $c^{\ast}$ up to $\frac{4\ell\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\varepsilon\lvert T^{\prime}\rvert}$ ,
3.

Algorithm 1 computes a polygonal curve $c\in\mathbb{X}^{d}_{\ell}$ with $\operatorname{cost}\left\lparen S^{\prime},c^{\ast}_{S^{\prime}}\right\rparen\leq\operatorname{cost}\left\lparen S^{\prime},c\right\rparen\leq 34\operatorname{cost}\left\lparen S^{\prime},c^{\ast}_{S^{\prime}}\right\rparen$ , where $c^{\ast}_{S^{\prime}}\in\mathbb{X}^{d}_{\ell}$ is an optimal $(1,\ell)$ -median for $S^{\prime}$ ,
4.

and it holds that $\frac{\delta\lvert T^{\prime}\rvert}{4\lvert S^{\prime}\rvert}\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen\leq\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen$ .

Let $B_{c^{\ast}}=\left\{\tau\in T^{\prime}\mid d_{F}(\tau,c^{\ast})\leq\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\varepsilon^{2}\lvert T^{\prime}\rvert}\right\}$ , $T^{\prime}_{c^{\ast}}=T^{\prime}\cap B_{c^{\ast}}$ and $B_{c}=\left\{\tau\in T^{\prime}\mid d_{F}(\tau,c)\leq\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\right\}$ . First, note that $\lvert T^{\prime}\setminus B_{c^{\ast}}\rvert\leq\varepsilon^{2}\lvert T^{\prime}\rvert$ , otherwise $\operatorname{cost}\left\lparen T^{\prime}\setminus B_{c^{\ast}},c^{\ast}\right\rparen>\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen$ , which is a contradiction, and therefore $\lvert T^{\prime}_{c^{\ast}}\rvert\geq(1-\varepsilon^{2})\lvert T^{\prime}\rvert$ . We now distinguish two cases:

Case 1: $\lvert T^{\prime}_{c^{\ast}}\setminus B_{c}\rvert>2\varepsilon\lvert T^{\prime}_{c^{\ast}}\rvert$

We have $2\varepsilon\lvert T^{\prime}_{c^{\ast}}\rvert\geq(1-\varepsilon^{2})2\varepsilon\lvert T^{\prime}\rvert\geq\varepsilon\lvert T^{\prime}\rvert$ , hence $\Pr\left[d_{F}(s,c)>\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\ \Big{|}\ \mathcal{E}_{T^{\prime}}\right]\geq\varepsilon$ for each $s\in S^{\prime}$ . Using independence we conclude that with probability at most

(1-\varepsilon)^{\lvert S^{\prime}\rvert}\leq\exp\left(-\varepsilon\frac{4\ell(\ln(4(2\ell-4))-\ln(\delta))}{\varepsilon}\right)\leq\frac{\delta^{4\ell}}{4^{4\ell}}\leq\frac{\delta}{8}

no $s\in S^{\prime}$ has distance to $c$ greater than $\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}$ . Using a union bound again, we conclude that with probability at least $1-\delta$ Items 1, 2, 3 and 4 occur simultaneously and at least one $s\in S^{\prime}$ has distance to $c$ greater than $\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}$ , hence $\operatorname{cost}\left\lparen S^{\prime},c\right\rparen>\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\Leftrightarrow\frac{\operatorname{cost}\left\lparen S^{\prime},c\right\rparen}{\varepsilon}>\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}$ and thus we indeed cover the balls of radius at most $\frac{4\ell\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\varepsilon\lvert T^{\prime}\rvert}<\frac{4\ell}{\varepsilon}\frac{\operatorname{cost}\left\lparen S^{\prime},c^{\ast}\right\rparen}{\varepsilon}$ .

In the last step, Algorithm 4 returns a set $C$ of all curves with up to $2\ell-2$ vertices from the grids, that contains one curve, denoted by $c^{\prime\prime}$ with same number of vertices as $c^{\prime}$ (recall that this is the curve guaranteed from Lemma 6.1) and distance at most $\frac{\varepsilon}{n}\Delta_{l}\leq\frac{\varepsilon}{\lvert T^{\prime}\rvert}\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen$ between every corresponding pair of vertices of $c^{\prime}$ and $c^{\prime\prime}$ . We conclude that $d_{F}(c^{\prime},c^{\prime\prime})\leq\frac{\varepsilon}{\lvert T^{\prime}\rvert}\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen$ . Also, recall that $d_{F}(\tau,c^{\prime})\leq d_{F}(\tau,c^{\ast})$ for $\tau\in T^{\prime}\setminus(T^{\prime}_{1}\cup\dots\cup T^{\prime}_{2\ell-4})$ . Further, $T^{\prime}$ contains at least $\frac{\lvert T^{\prime}\rvert}{2}$ curves with distance at most $\frac{2\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}$ to $c^{\ast}$ , otherwise the cost of the remaining curves would exceed $\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen$ , which is a contradiction, and since $\varepsilon<\frac{1}{2}$ there is at least one curve $\sigma\in T^{\prime}\setminus(T^{\prime}_{1}\cup\dots\cup T^{\prime}_{2\ell-4})$ with $d_{F}(\sigma,c^{\prime})\leq d_{F}(\sigma,c^{\ast})\leq\frac{2\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}$ by the pigeonhole principle. We can now bound the cost of $c^{\prime\prime}$ as follows:

	$\displaystyle\operatorname{cost}\left\lparen T^{\prime},c^{\prime\prime}\right\rparen$	$\displaystyle={}\sum_{\tau\in T^{\prime}}d_{F}(\tau,c^{\prime\prime})\leq\sum_{\tau\in T^{\prime}\setminus(T^{\prime}_{1}\cup\dots\cup T^{\prime}_{2\ell-4})}\left(d_{F}(\tau,c^{\prime})+\frac{\varepsilon}{\lvert T^{\prime}\rvert}\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen\right)\ +$
		$\displaystyle\ \ \ \ \sum_{\tau\in(T^{\prime}_{1}\cup\dots\cup T^{\prime}_{2\ell-4})}\left(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},\sigma)+d_{F}(\sigma,c^{\prime})+d_{F}(c^{\prime},c^{\prime\prime})\right)$
		$\displaystyle\leq{}(1+\varepsilon)\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen+\sum_{\tau\in(T^{\prime}_{1}\cup\dots\cup T^{\prime}_{2\ell-4})}\left((2+2+\varepsilon)\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\right)$
		$\displaystyle\leq{}\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen+\varepsilon\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen+5\varepsilon\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen=(1+6\varepsilon)\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen.$

Case 2: $\lvert T^{\prime}_{c^{\ast}}\setminus B_{c}\rvert\leq 2\varepsilon\lvert T^{\prime}_{c^{\ast}}\rvert$

Again, we distinguish two cases:

Case 2.1: $d_{F}(c,c^{\ast})\leq 4\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}$

We can easily bound the cost of $c$ :

\displaystyle\operatorname{cost}\left\lparen T^{\prime},c\right\rparen\leq\sum_{\tau\in T^{\prime}}(d_{F}(\tau,c^{\ast})+d_{F}(c^{\ast},c))\leq(1+4\varepsilon)\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen.

Case 2.2: $d_{F}(c,c^{\ast})>4\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}$

Recall that $\lvert T^{\prime}_{c^{\ast}}\rvert\geq(1-\varepsilon^{2})\lvert T^{\prime}\rvert$ . We have

	$\displaystyle\lvert T^{\prime}\setminus B_{c}\rvert$	$\displaystyle\leq{}\lvert T^{\prime}\setminus T^{\prime}_{c^{\ast}}\rvert+2\varepsilon\lvert T^{\prime}_{c^{\ast}}\rvert=\lvert T^{\prime}\rvert-(1-2\varepsilon)\lvert T^{\prime}_{c^{\ast}}\rvert\leq\lvert T^{\prime}\rvert-(1-2\varepsilon)(1-\varepsilon^{2})\lvert T^{\prime}\rvert$
		$\displaystyle=(2\varepsilon+\varepsilon^{2}-2\varepsilon^{3})\lvert T^{\prime}\rvert<\frac{1}{3}\lvert T^{\prime}\rvert.$

Hence, $\lvert T^{\prime}\cap B_{c}\rvert\geq(1-2\varepsilon-\varepsilon^{2}+2\varepsilon^{3})\lvert T^{\prime}\rvert>\frac{2}{3}\lvert T^{\prime}\rvert$ . Assume we assign all curves to $c$ instead of to $c^{\ast}$ . For $\tau\in T^{\prime}\cap B_{c}$ we now have decrease in cost $d_{F}(\tau,c^{\ast})-d_{F}(\tau,c)$ , which can be bounded as follows:

	$\displaystyle d_{F}(\tau,c^{\ast})-d_{F}(\tau,c)$	$\displaystyle\geq{}d_{F}(\tau,c^{\ast})-\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}\geq d_{F}(c,c^{\ast})-d_{F}(\tau,c)-\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}$
		$\displaystyle\geq d_{F}(c,c^{\ast})-2\varepsilon\frac{\operatorname{cost}\left\lparen T^{\prime},c^{\ast}\right\rparen}{\lvert T^{\prime}\rvert}>\frac{1}{2}d_{F}(c,c^{\ast}).$

For $\tau\in T^{\prime}\setminus B_{c}$ we have an increase in cost $d_{F}(\tau,c)-d_{F}(\tau,c^{\ast})\leq d_{F}(c,c^{\ast})$ . Let the overall increase in cost be denoted by $\alpha$ , which can be bounded as follows:

\displaystyle\alpha<\lvert T^{\prime}\setminus B_{c}\rvert\cdot d_{F}(c,c^{\ast})-\lvert T^{\prime}\cap B_{c}\rvert\cdot\frac{d_{F}(c,c^{\ast})}{2}.

By the fact that $\lvert T^{\prime}\setminus B_{c}\rvert<\frac{1}{2}\lvert T^{\prime}\cap B_{c}\rvert$ for our choice of $\varepsilon$ , we conclude that $\alpha<0$ , which is a contradiction because $c^{\ast}$ is an optimal $(1,\ell)$ -median for $T^{\prime}$ . Therefore, case 2.2 cannot occur. Rescaling $\varepsilon$ by $\frac{1}{6}$ proves the claim. ∎

We analyse the worst-case running-time of Algorithm 4 and the number of candidates it returns.

Theorem 6.3.

Algorithm 4 has running-time and returns number of candidates $2^{O\left(\frac{(-\ln(\delta))^{2}\cdot\beta}{\varepsilon^{2}}+\log(m)\right)}$ .

Proof.

\binom{n_{S}}{\frac{n_{S}}{2\beta}}\in 2^{O\left(\frac{n_{S}}{2\beta}\log n_{S}\right)}\subset 2^{O\left(\frac{(-\ln(\delta))^{2}\cdot\beta}{\varepsilon^{2}}\right)}

\left(\frac{\frac{(1+\varepsilon)\Delta}{\varepsilon}}{\frac{2\varepsilon 2\delta n\Delta}{n\sqrt{d}4\lvert S\rvert}}\right)^{d}=\left(\frac{\sqrt{d}\lvert S\rvert(1+\varepsilon)}{\varepsilon^{2}\delta}\right)^{d}\in O\left(\frac{\beta^{d}(-\ln\delta)^{d}}{\varepsilon^{3d}\delta^{d}}\right)

each. Algorithm 4 then enumerates all combinations of $2\ell-2$ points from up to $\lvert S^{\prime}\rvert\cdot m$ grids, resulting in

O\left(\frac{m^{2\ell-2}\beta^{2\ell d-2d+2\ell-2}(-\ln\delta)^{2\ell d-2d+2\ell-2}}{\varepsilon^{6\ell d-6d+2\ell-2}\delta^{2\ell d-2d}}\right)

candidates per iteration of the for-loop. Thus, Algorithm 4 computes $O\left(\operatorname{poly}\left\lparen m,\beta,\delta^{-1},\varepsilon^{-1}\right\rparen\right)$ candidates per iteration of the for-loop and enumeration also takes time $O\left(\operatorname{poly}\left\lparen m,\beta,\delta^{-1},\varepsilon^{-1}\right\rparen\right)$ per iteration of the for-loop.

All in all, we have running-time and number of candidates $2^{O\left(\frac{(-\ln(\delta))^{2}\cdot\beta}{\varepsilon^{2}}+\log(m)\right)}$ . ∎

7 $(1+\varepsilon)$ -Approximation for $(k,\ell)$ -Median

We generalize the algorithm of Ackermann et al. [2] in the following way: instead of drawing a uniform sample and running a problem-specific algorithm on this sample in the candidate phase, we only run a problem-specific “plugin”-algorithm in the candidate phase, thus dropping the framework around the sampling property. We think that the problem-specific algorithms used in [2] do not fulfill the role of a plugin, since parts of the problem-specific operations, e.g. the uniform sampling, remain in the main algorithm. Here, we separate the problem-specific operations from the main algorithm: any algorithm can serve as plugin, if it is able to return candidates for a cluster that takes a constant fraction of the input, where the fraction is an input-parameter of the algorithm and some approximation factor is guaranteed (w.h.p.). The calls to the candidate-finder plugin do not even need to be independent (stochastically), allowing adaptive algorithms.

Now, let $\mathcal{X}=(X,\rho)$ be an arbitrary space, where $X$ is any non-empty (ground-)set and $\rho\colon X\times X\rightarrow\mathbb{R}_{\geq 0}$ is a distance function (not necessarily a metric). We introduce a generalized definition of $k$ -median clustering. Let the medians be restricted to come from a predefined subset $Y\subseteq X$ .

Definition 7.1 (generalized $k$ -median).

The generalized $k$ -median clustering problem is defined as follows, where $k\in\mathbb{N}$ is a fixed (constant) parameter of the problem: given a finite and non-empty set $Z\subseteq X$ , compute a set $C$ of $k$ elements from $Y$ , such that $\operatorname{cost}\left\lparen Z,C\right\rparen=\sum\limits_{z\in Z}\min\limits_{c\in C}\rho(z,c)$ is minimal.

The following algorithm, Algorithm 5, can approximate every $k$ -median problem compatible with Definition 7.1, when provided with such a problem-specific plugin-algorithm for computing candidates. In particular, it can approximate the $(k,\ell)$ -median problem for polygonal curves under the Fréchet distance, when provided with Algorithm 2 or Algorithm 4. Then, we have $X=\mathbb{X}^{d}$ , $Y\subseteq\mathbb{X}^{d}_{\ell}\subseteq\mathbb{X}^{d}=X$ and $Z\subseteq\mathbb{X}^{d}_{m}\subseteq\mathbb{X}^{d}=X$ . Note that the algorithm computes a bicriteria approximation, that is, the solution is approximated in terms of the cost and the number of vertices of the center curves, i.e., the centers come from $\mathbb{X}^{d}_{2\ell-2}$ .

Algorithm 5 has several parameters. The first parameter $C$ is the set of centers found yet and $\kappa$ is the number of centers yet to be found. The following parameters concern only the plugin-algorithm used within the algorithm: $\beta$ determines the size (in terms of a fraction of the input) of the smallest cluster for which an approximate median can be computed, $\delta$ determines the probability of failure of the plugin-algorithm and $\varepsilon$ determines the approximation factor of the plugin-algorithm.

Algorithm 5 works as follows: If it has already computed some centers (and there are still centers to compute) it does pruning: some clusters might be too small for the plugin-algorithm to compute approximate medians for them. Algorithm 5 then calls itself recursively with only half of the input: the elements with larger distances to the centers yet found. This way the small clusters will eventually take a larger fraction of the input and can be found in the candidate phase. In this phase Algorithm 5 calls its plugin and for each candidate that the plugin returned, it calls itself recursively: adding the candidate at hand to the set of centers yet found and decrementing $\kappa$ by one. Eventually, all combinations of computed candidates are evaluated against the original input and the centers that together evaluated best are returned.

Algorithm 5 Recursive Approximation-Scheme for

k

-Median Clustering

1:procedure

k

-Median(

T,C,\kappa,\beta,\delta,\varepsilon

)

2: if

\kappa=0

then

3: return

C

\triangleright

Pruning Phase

4: if

C\neq\emptyset

then

P\leftarrow

set of

\left\lfloor\frac{\lvert T\rvert}{2}\right\rfloor

elements

\tau\in T

, such that

\min\limits_{c\in C}\rho(\tau,c)\leq\min\limits_{c\in C}\rho(\sigma,c)

for each

\sigma\in T\setminus P

C^{\prime}\leftarrow

k

-Median(

T\setminus P,C,\kappa,\beta,\delta,\varepsilon

)

7: else

C^{\prime}\leftarrow\emptyset

\triangleright

Candidate Phase

K\leftarrow 1

-Median-Candidates

(T,\beta,\delta/k,\varepsilon)

10: for

c\in K

11:

C_{c}\leftarrow k

-Median

(T,C\cup\{c\},\kappa-1,\beta,\delta,\varepsilon)

12:

\mathcal{C}\leftarrow\{C^{\prime}\}\cup\bigcup\limits_{c\in K}\{C_{c}\}

13: return

\operatorname*{arg\,min}\limits_{C\in\mathcal{C}}\operatorname{cost}\left\lparen T,C\right\rparen

The quality of approximation and worst-case running-time of Algorithm 5 is stated in the following two theorems, which we prove further below. The proofs are adaptations of corresponding proofs in [2]. We provide them for the sake of completeness.

Theorem 7.2.

Let $T=\{\tau_{1},\dots,\tau_{n}\}\subseteq X$ , $\alpha\in[1,\infty)$ and $1$ -Median-Candidates be an algorithm that, given three parameters $\beta\in[1,\infty)$ , $\delta,\varepsilon\in(0,1)$ and a set $T\subseteq X$ , returns with probability at least $1-\delta$ an $(\alpha+\varepsilon)$ -approximate $1$ -median for any $T^{\prime}\subseteq T$ , if $\lvert T^{\prime}\rvert\geq\frac{1}{\beta}\lvert T\rvert$ .

Algorithm 5 called with parameters $(T,\emptyset,k,\beta,\delta,\varepsilon)$ , where $\beta\in(2k,\infty)$ and $\delta,\varepsilon\in(0,1)$ , returns with probability at least $1-\delta$ a set $C=\{c_{1},\dots,c_{k}\}$ with $\operatorname{cost}\left\lparen T,C\right\rparen\leq(1+\frac{4k^{2}}{\beta-2k})(\alpha+\varepsilon)\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen$ , where $C^{\ast}$ is an optimal set of $k$ medians for $T$ .

Theorem 7.3.

Let $T_{1}(n,\beta,\delta,\varepsilon)$ denote the worst-case running-time of $1$ -Median-Candidates for an arbitrary input-set $T$ with $\lvert T\rvert=n$ and let $C(n,\beta,\delta,\varepsilon)$ denote the maximum number of candidates it returns. Also, let $T_{d}$ denote the worst-case running-time needed to compute $d$ for an input element and a candidate.

If $T_{1}$ and $C$ are non-decreasing in $n$ , Algorithm 5 has running-time $O(C(n,\beta,\delta,\varepsilon)^{k+2}\cdot n\cdot T_{d}+C(n,\beta,\delta,\varepsilon)^{k+1}\cdot T_{1}(n,\beta,\delta,\varepsilon))$ .

Now we state our main results, which follow from Theorems 4.2 and 4.3, respectively Theorems 6.2 and 6.3, and Theorems 7.2 and 7.3.

Corollary 7.4.

Given two parameters $\delta,\varepsilon\in(0,1)$ and a set $T\subset\mathbb{X}^{d}_{m}$ of polygonal curves, Algorithm 5 endowed with Algorithm 2 as $1$ -Median-Candidates and run with parameters $(T,\emptyset,k,\frac{20k^{2}}{\varepsilon}+2k,\delta,\varepsilon/5)$ returns with probability at least $1-\delta$ a set $C\subset\mathbb{X}^{d}_{2\ell-2}$ that is a $(3+\varepsilon)$ -approximate solution to the $(k,\ell)$ -median for $T$ . Algorithm 5 then has running-time $n\cdot 2^{O\left(\frac{(-\ln(\delta))^{2}}{\varepsilon^{3}}+\log(m)\right)}$ .

Corollary 7.5.

Given two parameters $\delta\in(0,1),\varepsilon\in(0,0.158]$ and a set $T\subset\mathbb{X}^{d}_{m}$ of polygonal curves, Algorithm 5 endowed with Algorithm 4 as $1$ -Median-Candidates and run with parameters $(T,\emptyset,k,\frac{12k^{2}}{\varepsilon}+2k,\delta,\varepsilon/3)$ returns with probability at least $1-\delta$ a set $C\subset\mathbb{X}^{d}_{2\ell-2}$ that is a $(1+\varepsilon)$ -approximate solution to the $(k,\ell)$ -median for $T$ . Algorithm 5 then has running-time $n\cdot 2^{O\left(\frac{(-\ln(\delta))^{2}}{\varepsilon^{3}}+\log(m)\right)}$ .

The following proof is an adaption of [2, Theorem 2.2 - Theorem 2.5].

Proof of Theorem 7.2.

For $k=1$ , the claim trivially holds. We now distinguish two cases. In the first case the principle of the proof is presented in all its detail. In the second case we only show how to generalize the first case to $k>2$ .

Case 1: $k=2$

Let $C^{\ast}=\{c^{\ast}_{1},c^{\ast}_{2}\}$ be an optimal set of $k$ medians for $T$ with clusters $T^{\ast}_{1}$ and $T^{\ast}_{2}$ , respectively, that form a partition of $T$ . For the sake of simplicity, assume that $n$ is a power of $2$ and w.l.o.g. assume that $\lvert T^{\ast}_{1}\rvert\geq\frac{1}{2}\lvert T\rvert>\frac{1}{\beta}\lvert T\rvert$ . Let $C_{1}$ be the set of candidates returned by $1$ -Median-Candidates in the initial call. With probability at least $1-\delta/k$ , there is a $c_{1}\in C_{1}$ with $\operatorname{cost}\left\lparen T^{\ast}_{1},c_{1}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{1},c^{\ast}_{1}\right\rparen$ . We distinguish two cases:

Case 1.1:

There exists a recursive call with parameters $(T^{\prime},\{c_{1}\},1,\beta,\delta,\varepsilon)$ and $\lvert T^{\ast}_{2}\cap T^{\prime}\rvert\geq\frac{1}{\beta}\lvert T^{\prime}\rvert$ .

First, we assume that $T^{\prime}$ is the maximum cardinality input with $\lvert T^{\ast}_{2}\cap T^{\prime}\rvert\geq\frac{1}{\beta}\lvert T^{\prime}\rvert$ , occurring in a recursive call of the algorithm. Let $C_{2}$ be the set of candidates returned by $1$ -Median-Candidates in this call. With probability at least $1-\delta/k$ , there is a $c_{2}\in C_{2}$ with $\operatorname{cost}\left\lparen T^{\ast}_{2}\cap T^{\prime},c_{2}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{2}\cap T^{\prime},\widetilde{c}_{2}\right\rparen$ , where $\widetilde{c}_{2}$ is an optimal median for $T^{\ast}_{2}\cap T^{\prime}$ .

Let $P$ be the set of elements of $T$ removed in the $m\in\mathbb{N}$ , $m\leq\log_{2}(n)$ , pruning phases between obtaining $c_{1}$ and $c_{2}$ . Without loss of generality we assume that $P\neq\emptyset$ . For $i\in\{1,\dots,m\}$ , let $P_{i}$ be the elements removed in the $i$ ^th (in the order of the recursive calls occurring) pruning phase. Note that the $P_{i}$ are pairwise disjoint, we have that $P=\cup_{i=1}^{t}P_{i}$ and $\lvert P_{i}\rvert=\frac{n}{2^{i}}$ . Since $T=T^{\ast}_{1}\uplus(T^{\ast}_{2}\cap T^{\prime})\uplus(T^{\ast}_{2}\cap P)$ , we have

\displaystyle\operatorname{cost}\left\lparen T,\{c_{1},c_{2}\}\right\rparen\leq\operatorname{cost}\left\lparen T^{\ast}_{1},c_{1}\right\rparen+\operatorname{cost}\left\lparen T^{\ast}_{2}\cap T^{\prime},c_{2}\right\rparen+\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P,c_{1}\right\rparen.

(I)

Our aim is now to prove that the number of elements wrongly assigned to $c_{1}$ , i.e., $T^{\ast}_{2}\cap P$ , is small and further, that their cost is a fraction of the cost of the elements correctly assigned to $c_{1}$ , i.e., $T^{\ast}_{1}$ .

We define $R_{0}=T$ and for $i\in\{1,\dots,m\}$ we define $R_{i}=R_{i-1}\setminus P_{i}$ . The $R_{i}$ are the elements remaining after the $i$ ^th pruning phase. Note that by definition $\lvert R_{i}\rvert=\frac{n}{2^{i}}=\lvert P_{i}\rvert$ . Since $R_{m}=T^{\prime}$ is the maximum cardinality input, with $\lvert T^{\ast}_{2}\cap T^{\prime}\rvert\geq\frac{1}{\beta}\lvert T^{\prime}\rvert$ , we have that $\lvert T^{\ast}_{2}\cap R_{i}\rvert<\frac{1}{\beta}\lvert R_{i}\rvert$ for all $i\in\{1,\dots,m-1\}$ . Also, for each $i\in\{1,\dots,m\}$ we have $P_{i}\subseteq R_{i-1}$ , therefore

\displaystyle\lvert T^{\ast}_{2}\cap P_{i}\rvert\leq\lvert T^{\ast}_{2}\cap R_{i-1}\rvert<\frac{1}{\beta}\lvert R_{i-1}\rvert=\frac{2}{\beta}\frac{n}{2^{i}}

(II)

and as immediate consequence

\displaystyle\lvert T^{\ast}_{1}\cap P_{i}\rvert=\lvert P_{i}\rvert-\lvert T^{\ast}_{2}\cap P_{i}\rvert>\lvert P_{i}\rvert-\frac{1}{\beta}\lvert R_{i-1}\rvert=\left(1-\frac{2}{\beta}\right)\frac{n}{2^{i}}.

(III)

This tells us that mainly the elements of $T^{\ast}_{1}$ are removed in the pruning phase and only very few elements of $T^{\ast}_{2}$ . By definition, we have for all $i\in\{1,\dots,m-1\}$ , $\sigma\in P_{i}$ and $\tau\in P_{i+1}$ that $\rho(\sigma,c_{1})\leq\rho(\tau,c_{1})$ , hence

\frac{1}{\lvert T^{\ast}_{2}\cap P_{i}\rvert}\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P_{i},c_{1}\right\rparen\leq\frac{1}{\lvert T^{\ast}_{1}\cap P_{i+1}\rvert}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap P_{i+1},c_{1}\right\rparen.

Combining this inequality with Eqs. II and III we obtain for $i\in\{1,\dots,m-1\}$ :

		$\displaystyle\frac{\beta 2^{i}}{2n}\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P_{i},c_{1}\right\rparen<\frac{2^{i+1}}{(1-2/\beta)n}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap P_{i+1},c_{1}\right\rparen$
	$\displaystyle\Leftrightarrow$	$\displaystyle\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P_{i},c_{1}\right\rparen<\frac{2^{i+1}2n}{(1-2/\beta)n\beta 2^{i}}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap P_{i+1},c_{1}\right\rparen=\frac{4}{(\beta-2)}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap P_{i+1},c_{1}\right\rparen.$		(IV)

We still need such a bound for $i=m$ . Since $\lvert R_{m}\rvert=\lvert P_{m}\rvert$ and also $R_{m}\subseteq R_{m-1}$ we can use Eq. II to obtain:

\displaystyle\lvert T^{\ast}_{1}\cap R_{m}\rvert=\lvert R_{m}\rvert-\lvert T^{\ast}_{2}\cap R_{m}\rvert\geq\lvert R_{m}\rvert-\lvert T^{\ast}_{2}\cap R_{m-1}\rvert>\left(1-\frac{2}{\beta}\right)\frac{n}{2^{m}}.

(V)

Also, we have for all $\sigma\in P_{m}$ and $\tau\in R_{m}$ that $\rho(\sigma,c_{1})\leq\rho(\tau,c_{1})$ by definition, thus

\frac{1}{\lvert T^{\ast}_{2}\cap P_{m}\rvert}\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P_{m},c_{1}\right\rparen\leq\frac{1}{\lvert T^{\ast}_{1}\cap R_{m}\rvert}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap R_{m},c_{1}\right\rparen.

We combine this inequality with Eq. II and Eq. V and obtain:

		$\displaystyle\frac{\beta 2^{m}}{2n}\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P_{m},c_{1}\right\rparen<\frac{2^{m}2n}{(1-2/\beta)n\beta 2^{m}}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap R_{m},c_{1}\right\rparen$
	$\displaystyle\Leftrightarrow$	$\displaystyle\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P_{m},c_{1}\right\rparen<\frac{2}{(\beta-2)}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap R_{m},c_{1}\right\rparen.$		(VI)

We are now ready to bound the cost of the elements of $T^{\ast}_{2}$ wrongly assigned to $c_{1}$ . Combining Eq. IV and Eq. VI yields:

	$\displaystyle\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P,c_{1}\right\rparen$	$\displaystyle={}\sum_{i=1}^{m}\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P_{i},c_{1}\right\rparen<\frac{4}{\beta-2}\sum_{i=1}^{m-1}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap P_{i+1},c_{1}\right\rparen+\frac{2}{\beta-2}\operatorname{cost}\left\lparen T^{\ast}_{1}\cap R_{m},c_{1}\right\rparen$
		$\displaystyle<\frac{4}{\beta-2}\operatorname{cost}\left\lparen T^{\ast}_{1},c_{1}\right\rparen.$

Here, the last inequality holds, because $P_{2},\dots,P_{m}$ and $R_{m}$ are pairwise disjoint. Also, we have

\displaystyle\operatorname{cost}\left\lparen T^{\ast}_{2}\cap T^{\prime},c_{2}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{2}\cap T^{\prime},\widetilde{c_{2}}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{2}\cap T^{\prime},c^{\ast}_{2}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{2},c^{\ast}_{2}\right\rparen.

Finally, using Eq. I and a union bound, with probability at least $1-\delta$ the following holds:

	$\displaystyle\operatorname{cost}\left\lparen T,\{c_{1},c_{2}\}\right\rparen$	$\displaystyle<(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{1},c^{\ast}_{1}\right\rparen+(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{2},c^{\ast}_{2}\right\rparen+\frac{4}{\beta-2}(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{1},c^{\ast}_{1}\right\rparen$
		$\displaystyle<\left(1+\frac{4}{\beta-2}\right)(\alpha+\varepsilon)\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen=\left(1+\frac{4k}{k\beta-2k}\right)(\alpha+\varepsilon)\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen$
		$\displaystyle\leq{}\left(1+\frac{4k^{2}}{\beta-2k}\right)(\alpha+\varepsilon)\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen.$

Case 1.2: For all recursive calls with parameters $(T^{\prime},\{c_{1}\},1,\beta,\delta,\varepsilon)$ it holds that $\lvert T^{\ast}_{2}\cap T^{\prime}\rvert<\frac{1}{\beta}\lvert T^{\prime}\rvert$ .

After $\log_{2}(n)$ pruning phases we end up with a singleton $\{\sigma\}=T^{\prime}$ as input set. Since $\lvert T^{\ast}_{2}\cap T^{\prime}\rvert<\frac{1}{\beta}\lvert T^{\prime}\rvert$ , it must be that $0=\lvert T^{\ast}_{2}\cap T^{\prime}\rvert<\frac{1}{\beta}\lvert T^{\prime}\rvert=\frac{1}{\beta}<1$ and thus $\sigma\in T^{\ast}_{1}$ .

Let $C_{2}$ be the set of candidates returned by $1$ -Median-Candidates in this call. With probability at least $1-\delta/k$ there is a $c_{2}\in C_{2}$ with $\operatorname{cost}\left\lparen\{\sigma\},c_{2}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen\{\sigma\},\widetilde{c}_{2}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen\{\sigma\},c^{\ast}_{1}\right\rparen$ , where $\widetilde{c}_{2}$ is an optimal median for $\{\sigma\}$ . Since $\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P,c_{1}\right\rparen$ is bounded as in Case 1.1, by a union bound we have with probability at least $1-\delta$ :

	$\displaystyle\operatorname{cost}\left\lparen T,\{c_{1},c_{2}\}\right\rparen$	$\displaystyle\leq{}\operatorname{cost}\left\lparen T^{\ast}_{1}\setminus\{\sigma\},c_{1}\right\rparen+\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P,c_{1}\right\rparen+\operatorname{cost}\left\lparen\{\sigma\},c_{2}\right\rparen$
		$\displaystyle\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{1},c^{\ast}_{1}\right\rparen+\operatorname{cost}\left\lparen T^{\ast}_{2}\cap P,c_{1}\right\rparen$
		$\displaystyle\leq\left(1+\frac{4}{\beta-2}\right)(\alpha+\varepsilon)\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen$
		$\displaystyle\leq\left(1+\frac{4k^{2}}{\beta-2k}\right)(\alpha+\varepsilon)\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen.$

Case 2: $k>2$

We only prove the generalization of Case 1.1 to $k>2$ , the remainder of the proof is analogous to the Case 1. For the sake of brevity, for $i\in\mathbb{N}$ , we define $[i]=\{1,\dots,i\}$ . Let $C^{\ast}=\{c^{\ast}_{1},\dots,c^{\ast}_{k}\}$ be an optimal set of $k$ medians for $T$ with clusters $T^{\ast}_{1},\dots,T^{\ast}_{k}$ , respectively, that form a partition of $T$ . For the sake of simplicity, assume that $n$ is a power of $2$ and w.l.o.g. assume $\lvert T^{\ast}_{1}\rvert\geq\dots\geq\lvert T^{\ast}_{k}\rvert$ . For $i\in[k]$ and $j\in[k]\setminus[i]$ we define $T^{\ast}_{i,j}=\uplus_{t=i}^{j}T^{\ast}_{t}$ .

Let $\mathcal{T}_{0}=T$ and let $(\mathcal{T}_{j}=\mathcal{T}_{j-1}\setminus\mathcal{P}_{j})_{j=1}^{m}$ be the sequence of input sets in the recursive calls of the $m\in\mathbb{N}$ , $m\leq\log_{2}(n)$ , pruning phases, where $\mathcal{P}_{j}$ is the set of elements removed in the $j$ ^th (in the order of the recursive calls occurring) pruning phase. Let $\mathcal{T}=\{\mathcal{T}_{0}\}\cup\{\mathcal{T}_{j}\mid j\in[m]\}$ . For $i\in[k]$ , let $T_{i}$ be the maximum cardinality set in $\mathcal{T}$ , with $\lvert T^{\ast}_{i}\cap T_{i}\rvert\geq\frac{1}{\beta}\lvert T_{i}\rvert$ . Note that by assumption and since $\beta>2k$ , $T_{1}=T$ must hold and also $T_{j}\subset T_{i}$ for $j\in[k]\setminus[i]$ .

Using a union bound, with probability at least $1-\delta$ , for each $i\in[k]$ the call of $1$ -Median-Candidates with input $T_{i}$ yields a candidate $c_{i}$ with

\displaystyle\operatorname{cost}\left\lparen T^{\ast}_{i}\cap T_{i},c_{i}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{i}\cap T_{i},\widetilde{c}_{i}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{i}\cap T_{i},c^{\ast}_{i}\right\rparen\leq(\alpha+\varepsilon)\operatorname{cost}\left\lparen T^{\ast}_{i},c^{\ast}_{i}\right\rparen,

(I)

where $\widetilde{c}_{i}$ is an optimal $1$ -median for $T^{\ast}_{i}\cap T_{i}$ . Let $C=\{c_{1},\dots,c_{k}\}$ be the set of these candidates and for $i\in[k-1]$ , let $P_{i}=T_{i}\setminus T_{i+1}$ denote the set of elements of $T$ removed by the pruning phases between obtaining $c_{i}$ and $c_{i+1}$ . Note that the $P_{i}$ are pairwise disjoint.

By definition, the sets

T^{\ast}_{1}\cap T_{1},\dots,T^{\ast}_{k}\cap T_{k},T^{\ast}_{2,k}\cap P_{1},\dots,T^{\ast}_{k,k}\cap P_{k-1}

form a partition of $T$ , therefore

	$\displaystyle\operatorname{cost}\left\lparen T,\{c_{1},\dots,c_{k}\}\right\rparen$	$\displaystyle\leq{}\sum_{i=1}^{k}\operatorname{cost}\left\lparen T^{\ast}_{i}\cap T_{i},c_{i}\right\rparen+\sum_{i=1}^{k-1}\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i},\{c_{1},\dots,c_{i}\}\right\rparen$
		$\displaystyle\leq{}(\alpha+\varepsilon)\sum_{i=1}^{k}\operatorname{cost}\left\lparen T^{\ast}_{i},c^{\ast}_{i}\right\rparen+\sum_{i=1}^{k-1}\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i},\{c_{1},\dots,c_{i}\}\right\rparen.$		(II)

Now, it only remains to bound the cost of the wrongly assigned elements of $T^{\ast}_{i+1,k}$ . For $i\in[k]$ , let $n_{i}=\lvert T_{i}\rvert$ and w.l.o.g. assume that $P_{i}\neq\emptyset$ for each $i\in[k-1]$ . Each $P_{i}$ is the disjoint union $\uplus_{j=1}^{m_{i}}P_{i,j}$ of $m_{i}\in\mathbb{N}$ sets of elements of $T$ removed in the interim pruning phases and it holds that $\lvert P_{i,j}\rvert=\frac{n_{i}}{2^{j}}$ . We now prove for each $i\in[k-1]$ and $j\in[m_{i}]$ that $P_{i}$ contains a large number of elements from $T^{\ast}_{1,i}$ and only a few elements from $T^{\ast}_{i+1,k}$ .

For $i\in[k-1]$ , we define $R_{i,0}=T_{i}$ and for $j\in[m_{i}]$ we define $R_{i,j}=R_{i,j-1}\setminus P_{i,j}$ . By definition, $\lvert R_{i,j}\rvert=\frac{n_{i}}{2^{j}}=\lvert P_{i,j}\rvert$ , $R_{i,j_{1}}\supset R_{i,j_{2}}$ for each $j_{1}\in[m_{i}]$ and $j_{2}\in[m_{i}]\setminus[j_{1}]$ , also $R_{i,m_{i}}=T_{i+1}$ . Thus, $\lvert T^{\ast}_{t}\cap R_{i,j}\rvert<\frac{1}{\beta}\lvert R_{i,j}\rvert$ for all $i\in[k-1],j\in[m_{i}]$ and $t\in[k]\setminus[i]$ . As immediate consequence we obtain $\lvert T^{\ast}_{i+1,k}\cap R_{i,j}\rvert\leq\frac{k}{\beta}\lvert R_{i,j}\rvert$ . Since $P_{i,j}\subseteq R_{i,j-1}$ for all $i\in[k-1]$ and $j\in[m_{i}]$ , we have

\displaystyle\lvert T_{i+1,k}\cap P_{i,j}\rvert\leq\lvert T_{i+1,k}\cap R_{i,j-1}\rvert\leq\frac{k}{\beta}\lvert R_{i,j-1}\rvert=\frac{2k}{\beta}\frac{n_{i}}{2^{j}},

(III)

which immediately yields

\displaystyle\lvert T_{1,i}\cap P_{i,j}\rvert=\lvert P_{i,j}\rvert-\lvert T_{i+1,k}\cap P_{i,j}\rvert\geq\left(1-\frac{2k}{\beta}\right)\frac{n_{i}}{2^{j}}.

(IV)

Now, by definition we know that for all $i\in[k-1]$ , $j\in[m_{i}]\setminus\{m_{i}\}$ , $\sigma\in P_{i,j}$ and $\tau\in P_{i,j+1}$ that $\min\limits_{c\in\{c_{1},\dots,c_{i}\}}\rho(\sigma,c)\leq\min\limits_{c\in\{c_{1},\dots,c_{i}\}}\rho(\tau,c)$ . Thus,

\displaystyle\frac{\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i,j},\{c_{1},\dots,c_{i}\}\right\rparen}{\lvert T^{\ast}_{i+1,k}\cap P_{i,j}\rvert}\leq\frac{\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap P_{i,j+1},\{c_{1},\dots,c_{i}\}\right\rparen}{\lvert T^{\ast}_{1,i}\cap P_{i,j+1}\rvert}.

Combining this inequality with Eqs. III and IV yields for $i\in[k-1]$ and $j\in[m_{i}]\setminus\{m_{i}\}$ :

		$\displaystyle\frac{\beta 2^{j}}{2kn_{i}}\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i,j},\{c_{1},\dots,c_{i}\}\right\rparen\leq\frac{2^{j+1}}{(1-\frac{2k}{\beta})n_{i}}\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap P_{i,j+1},\{c_{1},\dots,c_{i}\}\right\rparen$
	$\displaystyle\Leftrightarrow$	$\displaystyle\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i,j},\{c_{1},\dots,c_{i}\}\right\rparen\leq\frac{4k}{\beta-2k}\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap P_{i,j+1},\{c_{1},\dots,c_{i}\}\right\rparen$		(V)

For each $i\in[k-1]$ we still need an upper bound on $\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i,m_{i}},\{c_{1},\dots,c_{i}\}\right\rparen$ . Since $\lvert R_{i,m_{i}}\rvert=\lvert P_{i,m_{i}}\rvert$ and also $R_{i,m_{i}}\subseteq R_{i,m_{i}-1}$ we can use Eq. III to obtain

\displaystyle\lvert T^{\ast}_{1,i}\cap R_{i,m_{i}}\rvert=\lvert R_{i,m_{i}}\lvert-\lvert T^{\ast}_{i+1,k}\cap R_{i,m_{i}}\rvert\geq\lvert R_{i,m_{i}}\lvert-\lvert T^{\ast}_{i+1,k}\cap R_{i,m_{i}-1}\rvert>\left(1-\frac{2k}{\beta}\right)\frac{n_{i}}{2^{m_{i}}}.

(VI)

By definition we also know that for all $i\in[k-1]$ , $\sigma\in P_{i,m_{i}}$ and $\tau\in R_{i,m_{i}}$ that $\min\limits_{c\in\{c_{1},\dots,c_{i}\}}\rho(\sigma,c)\leq\min\limits_{c\in\{c_{1},\dots,c_{i}\}}\rho(\tau,c)$ . Thus,

\frac{\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i,m_{i}},\{c_{1},\dots,c_{i}\}\right\rparen}{\lvert T^{\ast}_{i+1,k}\cap P_{i,m_{i}}\rvert}\leq\frac{\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap R_{i,m_{i}},\{c_{1},\dots,c_{i}\}\right\rparen}{\lvert T^{\ast}_{1,i}\cap R_{i,m_{i}}\rvert}.

Combining this inequality with Eqs. III and VI yields:

		$\displaystyle\frac{\beta 2^{m_{i}}}{2kn_{i}}\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i,m_{i}},\{c_{1},\dots,c_{i}\}\right\rparen<\frac{2^{m_{i}}}{(1-\frac{2k}{\beta})n_{i}}\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap R_{i,m_{i}},\{c_{1},\dots,c_{i}\}\right\rparen$
	$\displaystyle\Leftrightarrow$	$\displaystyle\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i,m_{i}},\{c_{1},\dots,c_{i}\}\right\rparen<\frac{2k}{\beta-2k}\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap R_{i,m_{i}},\{c_{1},\dots,c_{i}\}\right\rparen.$		(VII)

We can now give the following bound, combining Eqs. V and VII, for each $i\in[k-1]$ :

$\displaystyle\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i},\{c_{1},\dots,c_{i}\}\right\rparen$	$\displaystyle={}\sum_{j=1}^{m_{i}}\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i,j},\{c_{1},\dots,c_{i}\}\right\rparen$
	$\displaystyle<\sum_{j=1}^{m_{i}-1}\frac{4k}{\beta-2k}\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap P_{i,j+1},\{c_{1},\dots,c_{i}\}\right\rparen$
	$\displaystyle\ \ \ +\frac{2k}{\beta-2k}\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap R_{i,m_{i}},\{c_{1},\dots,c_{i}\}\right\rparen$
	$\displaystyle<\frac{4k}{\beta-2k}\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap T_{i},\{c_{1},\dots,c_{i}\}\right\rparen.$	(VIII)

Here, the last inequality holds, because $P_{i,2},\dots,P_{i,m_{i}}$ and $R_{i,m_{i}}$ are pairwise disjoint subsets of $T_{i}$ .

Now, we plug this bound into Eq. II. Note that $T^{\ast}_{j}\cap T_{i}\subseteq T^{\ast}_{j}\cap T_{j}$ for each $i\in[k]$ and $j\in[i]$ by definition. We obtain:

	$\displaystyle\operatorname{cost}\left\lparen T,\{c_{1},\dots,c_{k}\}\right\rparen$	$\displaystyle\leq{}(\alpha+\varepsilon)\sum_{i=1}^{k}\operatorname{cost}\left\lparen T^{\ast}_{i},c^{\ast}_{i}\right\rparen+\sum_{i=1}^{k-1}\operatorname{cost}\left\lparen T^{\ast}_{i+1,k}\cap P_{i},\{c_{1},\dots,c_{i}\}\right\rparen$
		$\displaystyle<(\alpha+\varepsilon)\sum_{i=1}^{k}\operatorname{cost}\left\lparen T^{\ast}_{i},c^{\ast}_{i}\right\rparen+\frac{4k}{\beta-2k}\sum_{i=1}^{k-1}\operatorname{cost}\left\lparen T^{\ast}_{1,i}\cap T_{i},\{c_{1},\dots,c_{i}\}\right\rparen$
		$\displaystyle\leq{}(\alpha+\varepsilon)\sum_{i=1}^{k}\operatorname{cost}\left\lparen T^{\ast}_{i},c^{\ast}_{i}\right\rparen+\frac{4k}{\beta-2k}\sum_{i=1}^{k-1}\sum_{t=1}^{i}\operatorname{cost}\left\lparen T^{\ast}_{t}\cap T_{i},c_{t}\right\rparen$
		$\displaystyle\leq{}(\alpha+\varepsilon)\sum_{i=1}^{k}\operatorname{cost}\left\lparen T^{\ast}_{i},c^{\ast}_{i}\right\rparen+\frac{4k}{\beta-2k}\sum_{i=1}^{k-1}\sum_{t=1}^{i}\operatorname{cost}\left\lparen T^{\ast}_{t}\cap T_{t},c_{t}\right\rparen$
		$\displaystyle\leq(\alpha+\varepsilon)\sum_{i=1}^{k}\operatorname{cost}\left\lparen T^{\ast}_{i},c^{\ast}_{i}\right\rparen+\frac{4k^{2}}{\beta-2k}\sum_{i=1}^{k-1}\operatorname{cost}\left\lparen T^{\ast}_{i}\cap T_{i},c_{i}\right\rparen$
		$\displaystyle\leq{}\left(1+\frac{4k^{2}}{\beta-2k}\right)(\alpha+\varepsilon)\sum_{i=1}^{k}\operatorname{cost}\left\lparen T^{\ast}_{i},c^{\ast}_{i}\right\rparen=\left(1+\frac{4k^{2}}{\beta-2k}\right)(\alpha+\varepsilon)\operatorname{cost}\left\lparen T,C^{\ast}\right\rparen.$

The last inequality follows from Eq. I. ∎

The following analysis of the worst-case running-time of Algorithm 4 is a slight adaption of [2, Theorem 2.8], which is also provided for the sake of completeness.

Proof of Theorem 7.3.

For the sake of simplicity, we assume that $n$ is a power of $2$ .

If $\kappa=0$ , Algorithm 5 has running-time $c_{1}\in O(1)$ and if $\kappa\geq n$ , Algorithm 5 has running-time $c_{2}\cdot n\in O(n)$ .

Let $T(n,\kappa,\beta,\delta,\varepsilon)$ denote the worst-case running-time of Algorithm 5 for input set $T$ with $\lvert T\rvert=n$ . If $n>\kappa\geq 1$ , Algorithm 5 has running-time at most $c_{3}\cdot(n\cdot T_{d}+n)\in O(n\cdot T_{d})$ to obtain $P$ , $T(n/2,\kappa,\beta,\delta,\varepsilon)$ for the recursive call in the pruning phase, $T_{1}(n,\beta,\delta,\varepsilon)$ to obtain the candidates, $C(n,\beta,\delta,\varepsilon)\cdot T(n,\kappa-1,\beta,\delta,\varepsilon)$ for the recursive calls in the candidate phase, one for each candidate, and $c_{4}\cdot n\cdot T_{d}\cdot C(n,\beta,\delta,\varepsilon)\in O(n\cdot T_{d}\cdot C(n,\beta,\delta,\varepsilon))$ to eventually evaluate the candidate sets. Let $c=\max\{c_{1},c_{2},c_{3},c_{4}\}$ . We obtain the following recurrence relation:

\displaystyle T(n,\kappa,\beta,\delta,\varepsilon)\leq\begin{cases}c&\text{if }\kappa=0\\ cn&\text{if }\kappa\geq n\\ C(n,\beta,\delta,\varepsilon)\cdot T(n,\kappa-1,\beta,\delta,\varepsilon)+T(n/2,\kappa,\beta,\delta,\varepsilon)\\ +T_{1}(n,\beta,\delta,\varepsilon)+cn\cdot T_{d}\cdot C(n,\beta,\delta,\varepsilon))&\text{else}\end{cases}.

Let $f(n,\beta,\delta,\varepsilon)=\frac{1}{cn}\cdot T_{1}(n,\beta,\delta,\varepsilon)+T_{d}\cdot C(n,\beta,\delta,\varepsilon)$ .

We prove that $T(n,\kappa,\beta,\delta,\varepsilon)\leq c\cdot 4^{\kappa}\cdot C(n,\beta,\delta,\varepsilon)^{\kappa+1}\cdot n\cdot f(n,\beta,\delta,\varepsilon)$ , by induction on $n,\kappa$ .

For $\kappa=0$ we have $T(n,\kappa,\beta,\delta,\varepsilon)\leq c\leq cn\leq c\cdot 4^{0}\cdot C(n,\beta,\delta,\varepsilon)\cdot n\cdot f(n,\beta,\delta,\varepsilon)$ .

For $\kappa\geq n$ we have $T(n,\kappa,\beta,\delta,\varepsilon)\leq cn\leq c\cdot 4^{\kappa}\cdot C(n,\beta,\delta,\varepsilon)^{\kappa+1}\cdot n\cdot f(n,\beta,\delta,\varepsilon)$ .

Now, let $n>\kappa\geq 1$ and assume the claim holds for $T(n^{\prime},\kappa^{\prime},\beta,\delta,\varepsilon)$ , for each $\kappa^{\prime}\in\{0,\dots,\kappa-1\}$ and $n^{\prime}\in\{1,\dots,n-1\}$ . We have:

	$\displaystyle T(n,\kappa,\beta,\delta,\varepsilon)$	$\displaystyle\leq{}C(n,\beta,\delta,\varepsilon)\cdot T(n,\kappa-1,\beta,\delta,\varepsilon)+T(n/2,\kappa,\beta,\delta,\varepsilon)$
		$\displaystyle\ \ \ +T_{1}(n,\beta,\delta,\varepsilon)+cn\cdot T_{d}\cdot C(n,\beta,\delta,\varepsilon)$
		$\displaystyle\leq{}C(n,\beta,\delta,\varepsilon)\cdot c\cdot 4^{\kappa-1}\cdot C(n,\beta,\delta,\varepsilon)^{\kappa}\cdot n\cdot f(n,\beta,\delta,\varepsilon)$
		$\displaystyle\ \ \ +c\cdot 4^{\kappa}\cdot C(n/2,\beta,\delta,\varepsilon)^{\kappa+1}\cdot\frac{n}{2}\cdot f(n/2,\beta,\delta,\varepsilon)$
		$\displaystyle\ \ \ +cn\cdot f(n,\beta,\delta,\varepsilon)$
		$\displaystyle\leq{}\left(\frac{1}{4}+\frac{1}{2}+\frac{1}{4^{\kappa}C(n,\beta,\delta,\varepsilon)^{\kappa+1}}\right)c\cdot 4^{\kappa}\cdot C(n,\beta,\delta,\varepsilon)^{\kappa+1}\cdot n\cdot f(n,\beta,\delta,\varepsilon)$
		$\displaystyle\leq{}c\cdot 4^{\kappa}\cdot C(n,\beta,\delta,\varepsilon)^{\kappa+1}\cdot n\cdot f(n,\beta,\delta,\varepsilon).$

The last inequality holds, because $\frac{1}{4^{\kappa}C(n,\beta,\delta,\varepsilon)^{\kappa+1}}\leq\frac{1}{4}$ , and the claim follows by induction. ∎

8 Conclusion

We have developed bi-criteria approximation algorithms for $(k,\ell)$ -median clustering of polygonal curves under the Fréchet distance. While it showed to be relatively easy to obtain a good approximation where the centers have up to $2\ell$ vertices in reasonable time, a way to obtain good approximate centers with up to $\ell$ vertices in reasonable time is not in sight. This is due to the continuous Fréchet distance: the vertices of a median need not be anywhere near a vertex of an input-curve, resulting in a huge search-space. If we cover the whole search-space by, say grids, the worst-case running-time of the resulting algorithms become dependent on the arc-lengths of the input-curves edges, which is not acceptable. We note that $g$ -coverability of the continuous Fréchet distance would imply the existence of sublinear size $\varepsilon$ -coresets for $(k,\ell)$ -center clustering of polygonal curves under the Fréchet distance. It is an interesting open question, if the $g$ -coverability holds for the continuous Fréchet distance. In contrast to the doubling dimension, which was shown to be infinite even for curves of bounded complexity [15], the VC-dimension of metric balls under the continuous Fréchet distance is bounded in terms of the complexities $\ell$ and $m$ of the curves [16]. Whether this bound can be combined with the framework by Feldman and Langberg [17] to achieve faster approximations for the $(k,\ell)$ -median problem under the continuous Fréchet distance is an interesting open problem. The general relationship between the VC-dimension of range spaces derived from metric spaces and their doubling properties is a topic of ongoing research, see for example Huang et al. [21].

References

Abraham et al. [2003] C. Abraham, P. A. Cornillon, E. Matzner-Løber, and N. Molinari. Unsupervised curve clustering using b-splines. Scandinavian Journal of Statistics, 30(3):581–595, 2003.
Ackermann et al. [2010] Marcel R. Ackermann, Johannes Blömer, and Christian Sohler. Clustering for metric and nonmetric distance measures. ACM Trans. Algorithms, 6(4):59:1–59:26, 2010.
Agarwal et al. [2002] Pankaj K. Agarwal, Sariel Har-Peled, Nabil H. Mustafa, and Yusu Wang. Near-linear time approximation algorithms for curve simplification. In Rolf Möhring and Rajeev Raman, editors, Algorithms - ESA, pages 29–41. Springer, 2002.
Alt and Godau [1995] Helmut Alt and Michael Godau. Computing the Fréchet distance between two polygonal curves. International Journal of Computational Geometry & Applications, 5:75–91, 1995.
Banerjee et al. [2005] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705–1749, 2005.
Bansal et al. [2004] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering. Machine Learning, 56(1-3):89–113, 2004.
Ben-Hur et al. [2001] Asa Ben-Hur, David Horn, Hava T. Siegelmann, and Vladimir Vapnik. Support vector clustering. Journal of Machine Learning Research, 2:125–137, 2001.
Buchin et al. [2008] Kevin Buchin, Maike Buchin, and Carola Wenk. Computing the Fréchet distance between simple polygons. Comput. Geom., 41(1-2):2–20, 2008.
Buchin et al. [2019a] Kevin Buchin, Anne Driemel, Joachim Gudmundsson, Michael Horton, Irina Kostitsyna, Maarten Löffler, and Martijn Struijs. Approximating (k, l)-center clustering for curves. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2922–2938, 2019a.
Buchin et al. [2019b] Kevin Buchin, Anne Driemel, Natasja van de L’Isle, and André Nusser. klcluster: Center-based clustering of trajectories. In Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 496–499, 2019b.
Buchin et al. [2020] Kevin Buchin, Anne Driemel, and Martijn Struijs. On the hardness of computing an average curve. In Susanne Albers, editor, 17th Scandinavian Symposium and Workshops on Algorithm Theory, volume 162 of LIPIcs, pages 19:1–19:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
Chiou and Li [2007] Jeng-Min Chiou and Pai-Ling Li. Functional clustering and identifying substructures of longitudinal data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4):679–699, 2007.
Cilibrasi and Vitányi [2005] Rudi Cilibrasi and Paul M. B. Vitányi. Clustering by compression. IEEE Trans. Information Theory, 51(4):1523–1545, 2005.
Driemel and Har-Peled [2013] Anne Driemel and Sariel Har-Peled. Jaywalking your dog: Computing the Fréchet distance with shortcuts. SIAM Journal on Computing, 42(5):1830–1866, 2013.
Driemel et al. [2016] Anne Driemel, Amer Krivosija, and Christian Sohler. Clustering time series under the Fréchet distance. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, pages 766–785, 2016.
Driemel et al. [2019] Anne Driemel, Jeff M. Phillips, and Ioannis Psarros. The VC dimension of metric balls under Fréchet and Hausdorff distances. In 35th International Symposium on Computational Geometry, pages 28:1–28:16, 2019.
Feldman and Langberg [2011] Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Lance Fortnow and Salil P. Vadhan, editors, Proceedings of the 43rd ACM Symposium on Theory of Computing, pages 569–578. ACM, 2011.
Garcia-Escudero and Gordaliza [2005] Luis Angel Garcia-Escudero and Alfonso Gordaliza. A proposal for robust curve clustering. Journal of Classification, 22(2):185–201, 2005.
Guha and Mishra [2016] Sudipto Guha and Nina Mishra. Clustering data streams. In Minos N. Garofalakis, Johannes Gehrke, and Rajeev Rastogi, editors, Data Stream Management - Processing High-Speed Data Streams, Data-Centric Systems and Applications, pages 169–187. Springer, 2016.
Har-Peled and Mazumdar [2004] Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, pages 291–300, 2004.
Huang et al. [2018] Lingxiao Huang, Shaofeng H.-C. Jiang, Jian Li, and Xuan Wu. Epsilon-coresets for clustering (with outliers) in doubling metrics. In 59th IEEE Annual Symposium on Foundations of Computer Science, pages 814–825. IEEE Computer Society, 2018.
Imai and Iri [1988] Hiroshi Imai and Masao Iri. Polygonal Approximations of a Curve — Formulations and Algorithms. Machine Intelligence and Pattern Recognition, 6:71–86, January 1988.
Indyk [2000] Piotr Indyk. High-dimensional Computational Geometry. PhD thesis, Stanford University, CA, USA, 2000.
Johnson [1967] Stephen C. Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241–254, 1967.
Kumar et al. [2004] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A simple linear time (1+ $\varepsilon$ )-approximation algorithm for k-means clustering in any dimensions. In Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’04, page 454–462. IEEE Computer Society, 2004.
Meintrup et al. [2019] Stefan Meintrup, Alexander Munteanu, and Dennis Rohde. Random projections and sampling algorithms for clustering of high-dimensional polygonal curves. In Advances in Neural Information Processing Systems 32, pages 12807–12817, 2019.
Mitzenmacher and Upfal [2017] Michael Mitzenmacher and Eli Upfal. Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press, USA, 2nd edition, 2017.
Nath and Taylor [2020] Abhinandan Nath and Erin Taylor. k-median clustering under discrete Fréchet and Hausdorff distances. In Sergio Cabello and Danny Z. Chen, editors, 36th International Symposium on Computational Geometry, volume 164 of LIPIcs, pages 58:1–58:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
Petitjean and Gançarski [2012] François Petitjean and Pierre Gançarski. Summarizing a set of time series by averaging: From steiner sequence to compact multiple alignment. Theoretical Computer Science, 414(1):76 – 91, 2012.
Petitjean et al. [2011] François Petitjean, Alain Ketterlin, and Pierre Gançarski. A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition, 44(3):678 – 693, 2011.
Schaeffer [2007] Satu Elisa Schaeffer. Graph clustering. Computer Science Review, 1(1):27 – 64, 2007.
Vidal [2011] René Vidal. Subspace clustering. IEEE Signal Processing Magazine, 28(2):52–68, 2011.

Approximating (k,ℓ)(k,\ell)-Median Clustering for Polygonal Curves

Abstract

1 Introduction

1.1 Related Work

1.2 Our Contributions

Theorem 1.1.

1.3 Organization

2 Preliminaries

Definition 2.1 (grid).

Definition 2.2 (polygonal curve).

Definition 2.3 (Fréchet distance).

Lemma 2.4.

Proof.

Definition 2.5 (polygonal curve classes).

Definition 2.6 (minimum-error ℓ\ell-simplification).

Definition 2.7 ((k,ℓ)(k,\ell)-median clustering).

Theorem 2.8.

Lemma 2.9 (Chernoff bound for independent Poisson trials).

3 Simple and Fast 3434-Approximation for (1,ℓ)(1,\ell)-Median

Theorem 3.1.

Proof.

Lemma 3.2 (Buchin et al. [9, Lemma 7.1]).

Corollary 3.3.

Proof.

4 (3+ε)(3+\varepsilon)-Approximation for (1,ℓ)(1,\ell)-Median by Simple Shortcutting

Lemma 4.1 (shortcutting using a single polygonal curve).

Proof.

Theorem 4.2.

Proof.

Theorem 4.3.

Proof.

5 More Practical Approximation for (1,ℓ)(1,\ell)-Median by Simple Shortcutting

Theorem 5.1.

Proof.

Theorem 5.2.

Proof.

6 (1+ε)(1+\varepsilon)-Approximation for (1,ℓ)(1,\ell)-Median by Advanced Shortcutting

Lemma 6.1 (shortcutting using a set of polygonal curves).

Proof of Lemma 6.1.

Theorem 6.2.

Proof of Theorem 6.2.

Theorem 6.3.

Proof.

7 (1+ε)(1+\varepsilon)-Approximation for (k,ℓ)(k,\ell)-Median

Definition 7.1 (generalized kk-median).

Theorem 7.2.

Theorem 7.3.

Corollary 7.4.

Corollary 7.5.

Proof of Theorem 7.2.

Case 1.1:

Proof of Theorem 7.3.

8 Conclusion

References

Approximating $(k,\ell)$ -Median Clustering for Polygonal Curves

Definition 2.6 (minimum-error $\ell$ -simplification).

Definition 2.7 ( $(k,\ell)$ -median clustering).

3 Simple and Fast $34$ -Approximation for $(1,\ell)$ -Median

4 $(3+\varepsilon)$ -Approximation for $(1,\ell)$ -Median by Simple Shortcutting

5 More Practical Approximation for $(1,\ell)$ -Median by Simple Shortcutting

6 $(1+\varepsilon)$ -Approximation for $(1,\ell)$ -Median by Advanced Shortcutting

7 $(1+\varepsilon)$ -Approximation for $(k,\ell)$ -Median

Definition 7.1 (generalized $k$ -median).