This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Approximating The pp-Mean Curve of Large Data-Sets

Sepideh Aghamolaei111Department of Computer Engineering, Sharif University of Technology, Tehran, Iran. aghamolaei@ce.sharif.edu    Mohammad Ghodsi222Department of Computer Engineering, Sharif University of Technology, Tehran, Iran. School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran. ghodsi@sharif.edu
Abstract

A set of piecewise linear functions, called polylines, P1,,PLP_{1},\ldots,P_{L} each with at most nn vertices can be simplified into a polyline MM with kk vertices, such that the Fréchet distances ϵ1,,ϵL\epsilon_{1},\ldots,\epsilon_{L} to each of these polylines are minimized under the LpL_{p} distance. We call MM for LpL_{p} with p1p\geq 1 a pp-mean curve (pp-MC).

We discuss p1p\geq 1, for which LpL_{p} distance satisfies the triangle inequality and pp-mean has not been discussed before for most values pp. Computing the pp-mean polyline is NP-hard for L=Ω(1)L=\Omega(1) and some values of pp, so we discuss approximation algorithms.

We give a O(n2logk)O(n^{2}\log k) time exact algorithm for L=2L=2 and p1p\geq 1. Also, we reduce the Fréchet distance to the discrete Fréchet distance which adds a factor 22 to both kk and ϵ\epsilon. Then we use our exact algorithm to find a 33-approximation for L>2L>2 in poly(n,L)\operatorname{poly}(n,L) time. Our method is based on a generalization of the free-space diagram (FSD) for Fréchet distance and composable core-sets for approximate summaries.

1 Introduction

A polygonal curve is a sequence of points, e.g. GPS data such as vehicle tracks on a map, time series, movement patterns, or discretized borders of countries on a map. Trajectories appear in spatial databases and networks, geographic information systems (GIS), and any dataset with temporal labels for coordinates. Simplification is a method of reducing the size of the input trajectory, mostly to achieve reduced noise, optimize the storage space, or as a preprocessing step to improve the running time of later processing algorithms.

For large datasets and models for them, such as streaming, divide and conquer including massively parallel computations (MPC) [9], MapReduce class (MRC) [28], and composable core-sets [27], there are few algorithms with good theoretical guarantees. Even on medium-sized datasets, existing algorithms take at least quadratic time for some similarity measures, and are therefore too slow to be useful in practice. Methods for partitioning data while keeping the theoretical guarantees and relaxing the condition of the simplified curve be built from the points of the original curve are our main tools in achieving these goals.

The min-k curve simplification problem finds a subcurve with the same start and end vertices and the minimum number of vertices with distance at most ϵ\epsilon from the original curve. However, for a set of curves, the simplification errors are aggregated, if they are computed independently. We focus on the simplification of a set of curves by finding a representative curve that is a good cluster representative for LpL_{p}-based clusterings and has almost k vertices, assuming that the input curves have Fréchet distances at most ϵ\epsilon from each other.

In a simplification algorithm, a shortcut is a segment that replaces a part of the curve starting and ending at vertices of that curve.

The Fréchet distance is the minimum length of the leash between a man walking on one curve from the start to the end, and his dog walking on the other curve from the start to the end, given that none of them ever goes back. Deciding the Fréchet distance between two curves takes O(n2)O(n^{2}) time for curves with O(n)O(n) vertices using the free space diagram (FSD) [7]. The Fréchet distance cannot be decided in O(n2ϵ)O(n^{2-\epsilon}) time [10] or even approximated by a factor better than 33 [17], for any ϵ>0\epsilon>0, unless SETH fails. For LL input curves with O(n)O(n) vertices, assuming SETH is true, it is not possible to decide the Fréchet distance of the curves in O(nLϵ)O(n^{L-\epsilon}) time, for all ϵ>0\epsilon>0 [12].

1.1 Previous results.

Computing a representative curve is a well studied problem [13, 36, 25, 19, 35, 6, 14, 15]. For similar (close) monotone trajectories with the same start and end vertices, Buchin, et al [13] presented algorithms for computing the median trajectory and the homotopic median trajectory.

A curve simplification where the points of the simplified curve should be a subset of the vertices of the input curve is called a discrete curve simplification. Discrete curve simplification under Fréchet distance is solvable in O(n2)O(n^{2}) time [26], and no algorithm with O(n2ϵ)O(n^{2-\epsilon}) running time exists for all ϵ>0\epsilon>0, if SETH holds [12].

In the global min-ϵ\epsilon simplification, PP^{\prime} is a subsequence of PP with at most kk vertices that minimizes dF(P,P)d_{F}(P^{\prime},P). The current best exact min-ϵ\epsilon simplification algorithms for global Fréchet distance have cubic complexity [11].

A similar problem is kk-segment mean curve [32], where a monotone path is simplified into a possibly discontinuous kk-piecewise linear function. Also, the problem of min-kk simplification with arbitrary points of the plane, where the distance is given and the goal is to minimize the number of vertices has been discussed in [34].

Approximation algorithms with near linear time exist for local simplification under Fréchet distance [3, 2], where only the error of each shortcut is taken into account. Global discrete curve simplification using Fréchet distance can be solved in O(n3)O(n^{3}) time [11]. If OV\forall\forall\exists-OV conjecture holds, there is no algorithm for global simplification using Fréchet distance with running time O(n3ϵ)O(n^{3-\epsilon}), for any ϵ>0\epsilon>0 [11].

The combination of the representative curve and curve simplification problems is the (k,l)(k,l)-clustering problem, where the cost of clustering a set of curves {Pi}i=1L\{P_{i}\}_{i=1}^{L} into kk clusters with centers {Ci}i=1k\{C_{i}\}_{i=1}^{k}, such that

𝐣=𝟏𝐧min𝐢𝐝(𝐂𝐢,𝐏𝐣)𝐩𝐩\mathbf{\sqrt[p]{\sum_{j=1}^{n}\min_{i}d(C_{i},P_{j})^{p}}}

is minimized using curves with complexity \ell as cluster representatives (centers).

Driemel et al. [21] proposed (1+ϵ)(1+\epsilon)-approximation algorithms for kk-center and kk-median clustering of curves in 1D and a 22-approximation for any dimensions, assuming the complexity of a center and kk is constant. Buchin et al. [14] presented an algorithm for computing the kk-center of a set of curves under Fréchet distance, such that the complexity of the representative curves (centers) is fixed, and prove that it is NP-hard to find a polynomial approximation scheme (PTAS) for this problem. They also presented a 33-approximation algorithm for this problem in the plane and a 66-approximation for d2d\geq 2, and proved the lower bound 2.5982.598 for the discrete Fréchet distance in 2D if PNPP\neq NP.

pp-mean trajectories for pp\rightarrow\infty using kk-center [14, 21], and p=1p=1 using kk-median [21] exist. The computation of pp-MC based on the Fréchet reparameterization has already been discussed and implemented for p=1,p=2p=1,p=2 [16], however, such a computation can have complexity O(nL)O(n^{L}), which is infeasible for large datasets. For p=1p=1, the problem is W[1]-hard using LL as the parameter [18].

1.2 Our results.

We call the objective function of pp-MC the LpL_{p}-norm of the Fréchet distance. Since the root function is monotone for p1p\geq 1, which are the values that appear in the cost of LpL_{p}-based clustering problems, it is sufficient to minimize the pp-th power of Fréchet distance or dF(.,.)pd_{F}(.,.)^{p}. Note that while both LpL_{p}-norms and the Fréchet distance satisfy the triangle inequality, their combination does not. For example, for pp\rightarrow\infty, the inequality d(a,c)ppd(a,b)p+d(b,c)pp\sqrt[p]{d(a,c)^{p}}\leq\sqrt[p]{d(a,b)^{p}+d(b,c)^{p}} becomes

d(a,c)max(d(a,b),d(b,c)),d(a,c)\leq\max(d(a,b),d(b,c)),

which does not always hold.

Given the re-parameterizations of the input curves that gives the optimal pp-MC, the problem of finding the pp-MC curve can be solved by reducing it to the point version of the problem, where a set of points is given and the goal is to find a point that minimizes the p\ell_{p}-norm of distances from itself to the rest of the points. However, since the Fréchet distance only cares about the maximum distance between the points, only the points whose matching gives the maximum distance need to be considered. These are the Fréchet events.

Based on this observation, we give the following new results, and define new concepts that explain some of the reasons behind the good performances of existing algorithms and heuristics:

  • We consider the pp-MC for most values of pp and give approximation algorithms for them. Table 1 summarizes the results on pp-mean curves of LL curves.

  • We give a divide and conquer algorithm for computing the representative curve. The parallel implementation of our algorithm has time complexity independent from LL.

  • We give a new simplification algorithm which can simplify an input curve with O(ϵ)O(\epsilon) error and O(k)O(k) vertices, where kk is the length of the optimal simplification. It is based on a reduction to the discrete case.

pp-Mean Time ϵ\epsilon kk Reference
Continuous pp-Mean:
pp\rightarrow\infty poly(n,k,L)\operatorname{poly}(n,k,L) 2.25\geq 2.25 kk Lower bound [14]
p=1p=1 O(poly(n,L))O(\operatorname{poly}(n,L)) >1>1 nn Lower bound [15]
p1p\geq 1 O(Ln5Llogn)O(Ln^{5L}\log n) 33 kk Algorithm 1
p1p\geq 1 O(L2n2logn+G(L,k))O(L^{2}n^{2}\log n+\operatorname{G}(L,k)) 22 2k2k Algorithm 2
p1p\geq 1 O(LT(n)+G(L,k)))O(L\operatorname{T}(n)+\operatorname{G}(L,k)))^{\dagger} 2α+12\alpha+1 kk Algorithm 4
Discrete pp-Mean:
pp\rightarrow\infty O(kLnlogn+n2)O(kLn\log n+n^{2}) 66 kk  [14]
p1p\geq 1 O(L2n2logn+G(L,k))O(L^{2}n^{2}\log n+\operatorname{G}(L,k))^{\dagger} 33 kk Algorithm 3
Table 1: Results on pp-mean curve in 2\mathbb{R}^{2}. T(n)\operatorname{T}(n) is the complexity of an α\alpha-approximation simplification algorithm and G(L,n)\operatorname{G}(L,n) is the complexity of a pp-MC algorithm with kk vertices for LL curves with nn vertices.
The results marked with is for one recursion, they can be run on larger inputs by recursively calling themselves at the cost of increasing the approximation factor.

2 Preliminaries

A polygonal curve PP is a sequence of points {P[i]}i=1n\{P[i]\}_{i=1}^{n} and the segments connecting each point P[i]P[i] to its next point in the sequence, P[i+1]P[i+1], for i=1,,n1i=1,\ldots,n-1.

The Fréchet distance of two curves P,Q:[0,1]2P,Q:[0,1]\rightarrow\mathbb{R}^{2} is defined as

dF(P,Q)=infα,βmaxt[0,1]d(P(α(t)),Q(β(t))),d_{F}(P,Q)=\inf_{\alpha,\beta}\max_{t\in[0,1]}d(P(\alpha(t)),Q(\beta(t))),

where α\alpha and β\beta are reparameterizations, i.e., continuous, non-decreasing, bijections from [0,1] to [0,1], and d(.,.)d(.,.) is a point metric.

In the Fréchet distance of a set of LL curves, dd is the diameter of the mapped points from the LL input curves and can be computed in O(Ln2logn)O(Ln^{2}\log n) time [23]. The Fréchet distance of LL curves is the diameter of their minimum enclosing ball or the 11-center of the curves using Fréchet distance. Using triangle inequality, the Fréchet distance of the curves is at most twice the distance from 11-center to the farthest curve.

The free-space diagram (FSD) [7] between two polygonal curves P:[0,1]2,Q:[0,1]2P:[0,1]\rightarrow\mathbb{R}^{2},Q:[0,1]\rightarrow\mathbb{R}^{2} for a constant error ϵ>0\epsilon>0, is a 2D region in the joint parameter space of those curves where each dimension is an arc-length parameterization of one of the curves, and the free space (FS) is the set of all points that are within distance ϵ\epsilon of each other: Dϵ(P,Q)={(α,β)[0,1]2d(P(α),Q(β))ϵ},D_{\epsilon}(P,Q)=\{(\alpha,\beta)\in[0,1]^{2}\mid d(P(\alpha),Q(\beta))\leq\epsilon\}, and the rest of the points are non-free. Therefore, each point of FSD defines a mapping/correspondence between a point on PP and a point on QQ. The Fréchet distance between two curves is at most ϵ\epsilon iff there is an αβ\alpha\beta-monotone path in the free space diagram from (0,0)(0,0) to (1,1)(1,1). In figures, the free space is usually shown in white, and the non-free regions are shown in gray. The orthogonal lines drawn from the vertices of the input curves build a grid (FSD grid), whose cells are called the FSD cells.

A special transformed FSD called the deformed FSD was already defined for a variation of the Fréchet distance called the backward Fréchet distance, assuming the edges of input curves have weights [24].

A curve PP is cc-packed [20] if the total arc length of PP inside any ball of radius rr is at most crc\cdot r. The time complexity of computing a (1+ϵ)(1+\epsilon)-approximation of the Fréchet distance between cc-packed curves is O(cnϵ+cnlogn)O(\frac{cn}{\epsilon}+cn\log n) [20]. cc-Packed curves also have the property that for a given ϵ>0\epsilon>0, the complexity of the RS is within a constant factor of the complexity of the RS for αϵ\alpha\cdot\epsilon, for any α>0\alpha>0. The value cc of a cc-packed curve can be approximated within factor 2+ϵ2+\epsilon, for any ϵ>0\epsilon>0 in O(τa)O(\frac{\tau}{a}) time, where τ\tau is the length of the curve and aa is the distance between the closest points on the curve [4].

LpL_{p}-based clustering problems are clusterings with cost equal to the LpL_{p}-norm of the distances between the points and their corresponding centers. For a real number p1p\geq 1, the LpL_{p} cost of a set of points Ti=(xi,yi)2T_{i}=(x_{i},y_{i})\in\mathbb{R}^{2}, i=1,,Li=1,\cdots,L is defined as

minT2i=1L|TTi|pp.\min_{T^{\prime}\in\mathbb{R}^{2}}\sqrt[p]{\sum_{i=1}^{L}|T^{\prime}-T_{i}|^{p}}.

TT^{\prime} is also the cluster center of {Ti}i=1L\{T_{i}\}_{i=1}^{L} in an LpL_{p}-based clustering. There is a (1+ϵ)(1+\epsilon)-approximation algorithm using linear programming for computing the LpL_{p} cost for fixed pp [30]. For special cases such as p=1,p=2p=1,p=2 and pp\rightarrow\infty explicit mathematical formulas for xx exist which can be used to compute xx in linear time. Constant factor approximations for LpL_{p}-based clustering also exist [8].

Given a curve QQ and a set of curves {Pi}i=1L\{P_{i}\}_{i=1}^{L}, the LpL_{p}-norm of the Fréchet distances is defined as

i=1LdF(Q,Pi)pp.\sqrt[p]{\sum_{i=1}^{L}d_{F}(Q,P_{i})^{p}}.

The LpL_{p}-norm of the Fréchet distances may not satisfy the triangle inequality for different curves QQ.

Based on this definition, given the optimal mapping between the points of the curves, it is possible to compute the corresponding curve by finding the center of the mapped points for every pair of points from the curves by solving the LpL_{p} cost optimization.

Finding the minimum-link (min-link) path in a polygonal domain asks for finding a minimum-link s-t path (a path from ss to tt) such that the number of bends is minimized and the path lies inside the polygon and those not go through a set of polygonal holes. This problem can be solved in O(EVGα2(n)logn)O(E_{VG}\alpha^{2}(n)\log n) time, where EVGE_{VG} is the number of edges in the visibility graph of the polygon [31, 33].

3 pp-MC of two polylines

3.1 A certificate for kk-simplification in FSD

Certificate for Fréchet distance

Given two curves P,QP,Q and a constant ϵ>0\epsilon>0, consider a set of certificates C(i,j,p,q,p,q)C_{(i,j,p,q,p^{\prime},q^{\prime})} that indicate whether there is a PP-monotone QQ-monotone mapping of cost at most ϵ\epsilon between the points of the ii-th edge of the first curve PP with the jj-th edge of the second curve QQ, where pp is mapped to qq at the beginning of the mapping and pp^{\prime} is mapped to qq^{\prime} at the end of this mapping. Then, the Fréchet distance of PP and QQ is at most ϵ\epsilon if a sequence of certificates π1,,πk\pi_{1},\ldots,\pi_{k} exists where the first points (p,q)(p,q) of the first certificate π1\pi_{1} are the start vertices of the curves, namely P1,Q1P_{1},Q_{1}, and the end vertices of the last certificate πk\pi_{k} are the last vertices of the curves, namely P|P|,Q|Q|P_{|P|},Q_{|Q|}. Formally,

F(ϵ,P,Q)={\displaystyle F_{(\epsilon,P,Q)}=\{ {πi}i=1k[|P|]×[|Q|]×(P×Q)2:\displaystyle\exists\{\pi_{i}\}_{i=1}^{k}\in{[|P|]\times[|Q|]\times(P\times Q)^{2}}:
q(πi)=p(πi+1),q(πi)=p(πi+1),i=1kCπi=true},\displaystyle q(\pi_{i})=p(\pi_{i+1}),q^{\prime}(\pi_{i})=p^{\prime}(\pi_{i+1}),\prod_{i=1}^{k}C_{\pi_{i}}=\text{true}\},

where [n][n] denotes the set {1,,n}\{1,\ldots,n\}, f(t)f(t) denotes the member ff of the tuple tt, and CπiC_{\pi_{i}} is the indicator variable for the Fréchet distance of two line segments.

Certificate for path of length kk

We define a certificate for a path between two points pp and qq of length kk with restriction on the feasibility of edges can be defined by the recurrence relation

F(p,q,k)\displaystyle F(p,q,k) =t2(F(p,r,i)F(r,q,ki)),\displaystyle=\vee_{t\in\mathbb{R}^{2}}(F(p,r,i)\wedge F(r,q,k-i)),
F(p,q,1)\displaystyle F(p,q,1) ={trueif there is a feasible edge (p,q)falseotherwise.\displaystyle=\begin{cases}\text{true}&\text{if there is a feasible edge $(p,q)$}\\ \text{false}&\text{otherwise}\\ \end{cases}.

An example of this problem is the shortest path in a polygon with holes, where the feasibility constraint for the validity of a segment is that it does not intersect a hole.

Certificate for kk-simplification under Fréchet distance

Similarly, a certificate can be defined for the existence of a path of length kk and Fréchet distance at most ϵ\epsilon. Let FkF_{k} be the certificate of existence of a path of length kk. Then, the certificate for a simplification of length kk is given by CπiF(p1,pn,k)C_{\pi_{i}}\wedge F(p_{1},p_{n},k). In Section 3.2, we discuss how to build a diagram and define the certificate for kk-simplification using the Fréchet distance on it.

Some previous work [11, 34] assume there is one interval on each edge and consider the first point of the interval [11], or the whole interval [34]. As shown in Figure 1, this is not always the case.

Refer to caption
Figure 1: Two intervals (bold lines) on the same edge for valid reparameterizations of the curves PP and QQ for 22-simplification under Fréchet distance.

3.2 Normalized free-space diagram

Scaling the axes of the free-space diagram by a constant has already been discussed [24]. In Lemma 1 we show a more general transformation works.

Lemma 1.

For a set of transformations fi:[0,1][0,1]f_{i}:[0,1]\rightarrow[0,1], the free-space diagram with the free-space {(t1,,tL)d(P1(f1(t1)),P2(f2(t2)),,PL(fL(tL)))ϵ}\{(t_{1},\ldots,t_{L})\mid d(P_{1}(f_{1}(t_{1})),P_{2}(f_{2}(t_{2})),\ldots,P_{L}(f_{L}(t_{L})))\leq\epsilon\} has the same set of feasible reparameterizations, if fif_{i} is a one-to-one non-decreasing function.

Proof.

Substituting ti=fi(ti)t^{\prime}_{i}=f_{i}(t_{i}) gives:

(P1(f1(t1)),P2(f2(t2)),,PL(fL(tL)))=(P1(t1),P2(t2),,PL(tL)).(P_{1}(f_{1}(t_{1})),P_{2}(f_{2}(t_{2})),\ldots,P_{L}(f_{L}(t_{L})))=(P_{1}(t^{\prime}_{1}),P_{2}(t^{\prime}_{2}),\ldots,P_{L}(t^{\prime}_{L})).

The function is one-to-one, so ti=fi1(ti)t_{i}=f^{-1}_{i}(t^{\prime}_{i}). Since ti:[0,1][0,1]t^{\prime}_{i}:[0,1]\rightarrow[0,1] and its non-decreasing, it is still a reparameterization of PiP_{i}, and the set of reparameterizations remains the same. ∎

We introduce a scaled FSD called normalized FSD which changes the representation of the curves in FSD such that a path in FSD corresponds to a segment in the original space if the derivatives of any point on the curve with respect to each of the FSD axes (input curves) is the same. We formalize this in Lemma 2, where the length of each segment from the input curves is divided by mPi2+1\sqrt{m_{P_{i}}^{2}+1}, and mPim_{P_{i}} is the slope of the segment. To handle negative slopes as well, we add at most nn points on each of the edges of FSD cells that correspond to the intersection of the extensions of the shortcuts through previous vertices with the corresponding segment of that FSD edge in the Euclidean plane. This is formally explained in Lemma 3. In the rest of the paper, when we use FSD, we mean the normalized free-space diagram.

Lemma 2.

A segment in the normalized FSD corresponds to a segment in the original space (Euclidean plane), if the slopes of the edges of each input curve have the same sign, i.e. all positive slopes or all negative.

Proof.

Assume PP and QQ are two input curves, and we want to find a condition on a curve RR in the parameter space of P,QP,Q that guarantees it will correspond to a polygonal curve in the Euclidean plane.

Choose an arbitrary segment from each of these curves. We want to change the mapping of the points of PP and QQ to the axes of the FSD to keep the slope of RR constant along different segments. Let P\ell_{P} be the length of the curve from its start vertex to the point where the length of the curve PP reaches P\ell_{P}. So, the domain of P\ell_{P} is [0,Len(P)][0,\operatorname{Len}(P)], where Len(P)=01P(t)𝑑t\operatorname{Len}(P)=\int_{0}^{1}P(t)dt is the length of curve PP. Similarly, define Q\ell_{Q} and R\ell_{R}. Figure 2 shows unit vectors in the direction of P,Q,R\ell_{P},\ell_{Q},\ell_{R} for a segment of PP, respectively.

Refer to caption
Figure 2: Unit vectors in the Euclidean plane (original space) on the left, and in the FSD (parameter space) on the right.

For each segment of the curves, we define a reparameterization. Let P(x)=mPx+p1,P^{\prime}(x)=m_{P}\cdot x+p_{1}, where x[x1,x2]x\in[x_{1},x_{2}], be a segment of curve PP with slope mPm_{P} and yy-intercept p1p_{1}. The reformulation of PP^{\prime} in terms of P\ell_{P} is given in the following formula: P(P)=p1+mPmP2+1P, where P[0,Len(P)],P^{\prime}(\ell_{P})=p_{1}+\frac{m_{P}}{\sqrt{m_{P}^{2}+1}}\ell_{P},\text{ where }\ell_{P}\in[0,\operatorname{Len}(P)], since using the derivatives of length variables: dP=(dx)2+(dy)2,dPdx=mPd\ell_{P}=\sqrt{(dx)^{2}+(dy)^{2}},\quad\frac{dP^{\prime}}{dx}=m_{P}\Leftrightarrow
dP=(dx)2+mP2(dx)2=1+mP2dxdP(P)=dP(x)dxdxdP=mPmP2+1P.d\ell_{P}=\sqrt{(dx)^{2}+m_{P}^{2}(dx)^{2}}=\sqrt{1+m_{P}^{2}}dx\Leftrightarrow dP^{\prime}(\ell_{P})=\frac{dP^{\prime}(x)}{dx}\frac{dx}{d\ell_{P}}=\frac{m_{P}}{\sqrt{m_{P}^{2}+1}}\ell_{P}. Similarly, we reparameterize a segment QQ^{\prime} of curve QQ in terms of its length variable Q\ell_{Q}. The axes of the normalized FSD are PP and QQ. So we need to compute the slope of the line segment RR^{\prime} from curve RR in terms of P\ell_{P} and Q\ell_{Q}: dRdP=dRdx×dxdP=dRdx1mP2+1,\frac{dR^{\prime}}{d\ell_{P}}=\frac{dR^{\prime}}{dx}\times\frac{dx}{d\ell_{P}}=\frac{dR^{\prime}}{dx}\frac{1}{\sqrt{m_{P}^{2}+1}}, and the equation for Q\ell_{Q} is similar. This means that scaling each segment of the curve PP by a factor 1mP2+1\frac{1}{\sqrt{m_{P}^{2}+1}} preserves the slope of RR^{\prime} in terms of P\ell^{\prime}_{P} in the Euclidean plane.

Based on Lemma 1, if a transformation converts the ellipse that is the free-space inside a cell into a degenerate ellipse, it does not work anymore. We show how to handle these cases that they still preserves the properties. If the original free-space is a degenerate case, the transformation only changes the slope of the lines.

After scaling parts of the axes of the FSD, the slope of segment RR will not change if the slopes of the segments of the input curves have the same sign. If this is not the case, use another segment from one of the curves to compute the reparameterization, and then map the points accordingly. This is possible if at least one of the edges of one of the curves has a different slope; otherwise all the points on the each of the curves are collinear. In that case, the original free-space is a degenerate ellipse, i.e. linear functions, for which the scaling by a constant factor as described in this lemma works.

Without the transformation described in Lemma 2, a polyline in the FSD still represents a polyline in the Euclidean space, however, the number of vertices can be different.

To add the signs of the slopes of the shortcut segments, it is enough to add them to the boundaries of the cells, and instead of computing the minimum-link path, compute the unweighted shortest path with these vertices in addition to the intersections of the free-space with the FSD grid (cell boundaries). The edges have weight 0 when the slope of the last segment is equal to the slope of the next segment, otherwise the edges have weight 11. Because only non-decreasing paths from the start to the end vertex correspond to valid reparameterizations, direct the edges in the order of increasing index. By computing the shortest path in the resulting DAG, the minimum complexity polyline in NFSD is computed.

Let Si,j(P,Q)S_{i,j}(P,Q) be the set of intersection points between the shortcuts p1pt¯\overline{p_{1}p_{t}} for t=1,,nt=1,\ldots,n and the segment qjqj+1¯\overline{q_{j}q_{j+1}}, for two polylines P=(p1,,pn)P=(p_{1},\ldots,p_{n}) and Q=(q1,,qn)Q=(q_{1},\ldots,q_{n}). Build the graph H=(V,E)H=(V,E) with vertices V=VGi,jSi,j(P,Q)V=V_{G}\cup\cup_{i,j}S_{i,j}(P,Q), where VGV_{G} is the set of vertices in FSD, i.e. the intersection points of the free-space with the grid lines of FSD, which are the points that map a vertex of one curve to each point on the other curve. The edges EE connect the vertices with a segment with non-negative slope between them that lies completely in the free-space, with an edge of weight 11 in the graph.

Based on the definition, three cases for the simplification can be solved using HH, depending on the subset we choose the vertices of the simplification from:

  • Vertices of the input curves: use the vertex for the grid line containing the edge of the cell containing that point. Remember that each grid line is a vertex of an input polyline in all FSDs.

  • Vertices of one input curve (curve PP): The shortest path must be computed using only the vertices of VGV_{G} and the edges between the vertices of VV or with a subset of i,jSi,j\cup_{i,j}S_{i,j} only if the incoming and outgoing edges have the same slope after mapping to the Euclidean plane.

  • Any point of the Euclidean plane: It is enough to map the vertices of VV to the Euclidean plane.

Lemma 3 shows the simplification can be computed using the shortest path on HH, or a subgraph of it.

Lemma 3.

The shortest path in HH from the start vertex to the end vertex gives the minimum complexity path in the Euclidean plane.

Proof.

Based on Lemma 2, the edges of EE that connect two points where the slope of segments between them does not change, i.e. the part of the curve between them is monotone, map to a single segment in the Euclidean plane.

Consider a reparameterization that gives the optimal simplification. For each vertex that is shortcutted in this simplification, there is a point on the boundary of the cell intersecting that path in NFSD. Since V=VG(i,jSi,j(P,Q))V=V_{G}\cup(\cup_{i,j}S_{i,j}(P,Q)) contains all such intersections, its vertices are a subset of VV. So, for every optimal simplification there is a path in NFSD.

To show every shortest path in NFSD gives an optimal simplification, for each vertex of the shortest path between two edges with different slopes in the Euclidean plane, choose a vertex from PP or QQ depending on the boundary edge that contained that point.

If we only want the vertices of PP to exist in the output solution, the shortest path must be computed using only the vertices of VGV_{G} and the edges between the vertices of VV or with a subset of i,jSi,j\cup_{i,j}S_{i,j} only if the incoming and outgoing edges have the same slope after mapping to the Euclidean plane.

Based on the definition of the edges, the slope is preserved in each edge. So, each edge in NFSD is equivalent to a segment in the Euclidean plane. This means the weight of the curve in the NFSD is equal to the complexity of the polyline in the Euclidean plane.

Adding nn points on each boundary edge increases the complexity of NFSD to O(n3)O(n^{3}). So, the simplification that minimizes the Fréchet distance between the simplified and original curves can be computed in O(n3logn)O(n^{3}\log n) time and in O(n2logn)O(n^{2}\log n) time for monotone curves. For monotone curves, all the slopes have the same sign, so we do not need to add points on the boundaries of the cells.

Note that knowing only the slopes of the lines is not enough and the mapped length of the curve is also needed to define a kk-simplification. More specifically, there can be (nk)\binom{n}{k} partial kk-simplifications of a single polyline that end at the same edge, which can result in distinct optimal matchings. In NFSD, this is equivalent to having multiple monotone shortest paths between (0,0)(0,0) and (1,1)(1,1). In a diagram with holes, the intervals on the edges that represent these partial solutions might not be continuous. In NFSDs/FSDs, the free-space acts as a certificate for valid partial matchings for Fréchet distance, however, the certificates for kk-simplifications are only covered by NFSDs.

3.3 Exact pp-mean curve

In Lemma 4, we show the reparameterization that gives the pp-mean curve of two curves is the one that gives the Fréchet distance between them.

Lemma 4.

The Fréchet reparameterization of curves PP and QQ, minimizes the distances to the pp-MC of PP and QQ.

Proof.

Let sPs\in P and qQq\in Q be a pair of points mapped to each other in the Fréchet mapping between PP and QQ. Let mm be the point on sq\overrightarrow{sq} which lies on the pp-MC of P,QP,Q. The goal is to minimize the cost of pp-MC for these points: minm|sm|p+|mq|p=minm|sm|p+(|sq||sm|)p.\min_{m}|sm|^{p}+|mq|^{p}=\min_{m}|sm|^{p}+(|sq|-|sm|)^{p}. Then, we take the derivative of the above cost expression |sm|p+(|sq||sm|)p|sm|^{p}+(|sq|-|sm|)^{p} in terms of |sm||sm|:
p|sm|p1p(|sq||sm|)p1=0|sm|p1=(|sq||sm|)p1|sm|=|sq|/2.p|sm|^{p-1}-p(|sq|-|sm|)^{p-1}=0\Leftrightarrow|sm|^{p-1}=(|sq|-|sm|)^{p-1}\Rightarrow|sm|=|sq|/2. This is a minimum of the function, since for |sm|>|sq|/2|sm|>|sq|/2 the derivative is positive and for smaller values it is negative. Substituting this value in the cost expression gives 21p|sq|p2^{1-p}|sq|^{p}. This means that the minimum of |sq||sq| also minimizes the cost expression. The maximum of |sq||sq| for all pairs s,qs,q is the maximum distance in the reparameterizations of P,QP,Q that realizes the Fréchet distance.

Based on [23], the higher dimensional FSDs (for more than two curves) can be constructed by building the FSDs of pairs of curves, extending them in the direction of the axes corresponding to the rest of the curves, and taking their intersection. Based on Lemma 1, NFSD represents the same set of reparameterizations as FSD. A (ϵ1,,ϵL)(\epsilon_{1},\ldots,\epsilon_{L})-NFSD is the LL-dimensional NFSD in which the NFSD for each pair Pi,PjP_{i},P_{j} of the input curves uses distance ϵi\epsilon_{i} from PiP_{i} and distance ϵj\epsilon_{j} from PjP_{j} as the distance (to define the free-space).

Lemma 5.

Given a set of non-negative constants ϵ1,,ϵL\epsilon_{1},\ldots,\epsilon_{L} and a set of curves P1,,PLP_{1},\ldots,P_{L}, the shortest path in the ϵ\overrightarrow{\epsilon}-NFSD of P1,,PLP_{1},\ldots,P_{L} gives the minimum-link path in the Euclidean plane with distance at most ϵi\epsilon_{i} from PiP_{i}, for each i=1,,Li=1,\ldots,L. Assume ϵ=(ϵ1,,ϵL)\overrightarrow{\epsilon}=(\epsilon_{1},\ldots,\epsilon_{L}) and ϵp=η\lVert\overrightarrow{\epsilon}\rVert_{p}=\eta.

Proof.

A point pp in the free-space inside each cell of an ϵ\overrightarrow{\epsilon}-NFSD for LL curves satisfies:

i=1,,L:Pi(ti)p2ϵi,\forall i=1,\ldots,L\;:\;\lVert P_{i}(t_{i})-p\rVert_{2}\leq\epsilon_{i},

So,

ϵp=ηi=1nϵip=ηpi=1nPi(ti)p2pηp.\lVert\overrightarrow{\epsilon}\rVert_{p}=\eta\Leftrightarrow\sum_{i=1}^{n}\epsilon_{i}^{p}=\eta^{p}\Leftrightarrow\sum_{i=1}^{n}\lVert P_{i}(t_{i})-p\rVert_{2}^{p}\leq\eta^{p}.

These are a set of LL ellipses, which are monotone except at their extreme points and the boundaries of the domain of their definition. Candidates for the optimal matchings of each point are the intersections between the grid lines, the extentions of the shortcut lines, and the ellipses.

The free-space inside each cell of NFSD for two curves is an scaled ellipse, as proved in Lemma 2; the higher dimensional NFSD is similarly proved to have an ellipsoid inside each cell as the free-space.

Using Lemma 3, the complexity (the number of vertices) of the shortest path in NFSD is equal to the complexity of the simplification in the original space. Since the scaling constants in each dimension are independent from each other, Lemma 2 generalizes to any dimension, i.e. any number of curves.

Let MM be the minimum-link path from (P1(0),,PL(0))(P_{1}(0),\ldots,P_{L}(0)) to (P1(1),,PL(1))(P_{1}(1),\ldots,P_{L}(1)) in this ϵ\overrightarrow{\epsilon}-NFSD. We showed that MM satisfies the Fréchet distance ϵi\epsilon_{i}, i.e. for each point mMm\in M, the distances to each of the curves satisfy d(m,Pi(ti))ϵid(m,P_{i}(t_{i}))\leq\epsilon_{i}. Any optimal simplification maps to a path MM^{\prime} in ϵ\overrightarrow{\epsilon}-NFSD, based on Lemma 1. MM^{\prime} is a polyline for because ϵ\overrightarrow{\epsilon}-NFSD preserves the slope of the lines with respect to the input curves and changing pp only effects the shape of the free-space.

The changes to the graph HH built from an ϵ\overrightarrow{\epsilon}-NFSD after changing ϵ\overrightarrow{\epsilon} form a discrete set of events, i.e. values ϵ\overrightarrow{\epsilon} at which HH changes. In Lemma 6, we discuss the events at which the complexity of the shortest path in ϵ\overrightarrow{\epsilon}-NFSD change, i.e. the certificates for pp-MC.

Lemma 6.

The number of events for the min-ϵ\epsilon pp-MC simplification of a curve PP with respect to LL curves is O(n2L)O(n^{2L}), and each event can be computed in O(1)O(1) time.

Proof.

Changes to the graph HH built from an ϵ\overrightarrow{\epsilon}-NFSD when changing ϵ\overrightarrow{\epsilon} happen at the intersections of the free-space inside the cell with the cell boundary, or when the monotonicity of the path between the cells changes which happen when the intersections of intervals on the boundaries of the cells. In Lemma 5, we showed the shape of the free-space inside a cell is a transformed unit ball of LpL_{p} norm, i.e. xp+yp=1x^{p}+y^{p}=1. For each edge of one curve and a vertex from the rest of the polylines, the intersection of this transformed ball and the boundary gives L1L-1 intervals. These events are the intersections of the transformed ball with the boundary, and the intersections of the projections of the intervals on their shared edge. Changing ϵ\overrightarrow{\epsilon} scale and translates the transformed unit ball of the LpL_{p} norm that represents the free-space inside each cell. So, the intersections of it with each cell boundary is still one continuous interval for each (vertex,edge) pair. Each NFSD has O(nL)O(n^{L}) cells, each with 2L2L intersections, so the number of these events is O(LnL)O(Ln^{L}).

For different slope signs, instead of a straight line segment, we look for segments s,ss,s^{\prime} that share an endpoint on the edge ee between the cell with different slopes and its neighboring cells, such that ss and ss^{\prime} are the reflections of each other with respect to ee. As discussed in the definition of HH for NFSD, these are the set of shortcuts and their extensions or reflections in case of slope changes. Based on the type of simplification, the size of HH is different:

  • Any point on one of the input curves can be used in pp-MC:
    There are |Pi|j=1,jiL|Pj|=O(nL)\lvert P_{i}\rvert\prod_{j=1,j\neq i}^{L}\lvert P_{j}\rvert=O(n^{L}) shortcuts, each intersecting with each of the O(nL)O(n^{L}) edges of the NFSD grid, resulting in O(n2L)O(n^{2L}) event points.

  • The vertices of one of the input curves, PiP_{i}:
    Only the edges of HH that change slope at a point on the edges that are on the grid line for a vertex of PiP_{i} are allowed. Since O(nL1)O(n^{L-1}) edges remain, each containing O(nL1)O(n^{L-1}) points on them (Si,jS_{i,j}), the number of events is O(n2L2)O(n^{2L-2}).

Based on the generalization of NFSD in Lemma 5, we solve the pp-MC problem in Theorem 1.

Theorem 1.

The pp-MC of a LL curves can be computed in O(Ln5Llogn)O(Ln^{5L}\log n) time.

Proof.

As long as the graph built on ϵ\overrightarrow{\epsilon}-NFSD is not changed, changing ϵ\overrightarrow{\epsilon} will not change the solution. So, it is enough to check the values ϵ\overrightarrow{\epsilon} from Lemma 6. For each of these values, we build a ϵ\overrightarrow{\epsilon}-NFSD and compute the shortest path in HH, one of which is the pp-MC of the curves (Lemma 3). There are O(n2L)O(n^{2L}) values ϵ\overrightarrow{\epsilon}, and computing the shortest path in an ϵ\overrightarrow{\epsilon}-NFSD takes O(Ln3Llogn)O(Ln^{3L}\log n) time, resulting in a O(Ln5Llogn)O(Ln^{5L}\log n) time algorithm.

Algorithm 1 implements Theorem 1. Also, the algorithm can be used to compute a min-ϵ\epsilon simplification.

Algorithm 1 Continuous pp-Mean Curve
1:Trajectories P1,,PLP_{1},\ldots,P_{L}, an integer p1p\geq 1, a constant ϵ>0\epsilon>0
2:A trajectory MM
3:EE= The values ϵ\epsilon for the events of pp-MC.
4:ϵ0=0\overrightarrow{\epsilon_{0}}=\overrightarrow{0}
5:for k=0,,nk=0,\ldots,n, and ϵ0=0\overrightarrow{\epsilon_{0}}=\overrightarrow{0} do
6:     for ϵE\overrightarrow{\epsilon}\in E do
7:         Build a ϵ\overrightarrow{\epsilon}-NFSD FF for {Pi}i=1L\{P_{i}\}_{i=1}^{L}.
8:         τϵ=\tau_{\overrightarrow{\epsilon}}= Find the shortest path in HH built from FF.
9:         ϵ0\overrightarrow{\epsilon_{0}}=the ϵ\overrightarrow{\epsilon} with ϵpϵ\lVert\overrightarrow{\epsilon}\rVert_{p}\leq\epsilon for which a path of length kk is found.
10:     end for
11:end for
12:Build a curve MM by reporting the points of τϵ0\tau_{\overrightarrow{\epsilon_{0}}} in the Euclidean plane.

Since the time complexity of the exact pp-MC algorithm is exponential in the number of curves (L)(L), we discuss approximation algorithms for such cases.

3.4 Reducing the Continuous Version to the Discrete Version: A Simplified Version of Algorithm 1

Here, we discuss a simplified version of Algorithm 1 for p=p=\infty and L=2L=2, where instead of computing the path in the parameter space (FSD), we simulate the algorithm in the original space while computing the dynamic program for FSD. So, the complexity does not depend on the ply.

Let HH be the polygon built on the intersections of the ellipses (the free spaces) in each cell CC and the boundary of the cell. For each pair of vertices p,qp,q in HH, compute the intersection of pqpq with the boundary of HH, including the boundary of the holes. Also, consider the extension of the shortcuts and the intersections between them. Add the points on the original curves corresponding to these points in FSD. Note that constructing the free-space diagram is not necessary to compute these events.

In Algorithm 2, we used unit direction vectors to indicate the slopes of the segments. The output of the algorithm is also curve-restricted, i.e. the vertices of the simplification lie on the edges of the input curve. Lemma 7 formulates the effects of the modifications.

Algorithm 2 Bi-criteria approximation of pp-MC of two curves
1:A free-space diagram DD of PP with error ϵ\epsilon
2:A monotone path TT from P[0]P[0] to P[n1]P[n-1]
3:Build GG from DD.
4:Q=Q= Add the events of GG to PP with error ϵ\epsilon.
5:D=D^{\prime}= the FSD for QQ with itself.
6:S[i][j]=,ρ[i][j]=1i,j[1,|Q|]S[i][j]=\infty,\rho[i][j]=-1\quad\forall i,j\in[1,|Q|]
7:S[1][1]=0S[1][1]=0
8:for i=2,,|Q|i=2,\cdots,|Q| do
9:     for j=1,,|Q|j=1,\cdots,|Q| do
10:         if F(D,Q[i1,,i],Q[k,,j])F(D^{\prime},Q[i-1,\cdots,i],Q[k,\cdots,j]) then
11:              S[i][j],ρ[i][j]=mink=0,,jS[i][k]+{0if 1Q[i1]Q[i]=1Q[ρ[i][j]]Q[j]1otherwise,kS[i][j],\rho[i][j]=\min_{k=0,\cdots,j}S[i][k]+\begin{cases}0&\text{if $1_{\overrightarrow{Q[i-1]Q[i]}}=1_{\overrightarrow{Q[\rho[i][j]]Q[j]}}$}\\ 1&\text{otherwise}\end{cases},k \triangleright If the last edge and the next edge are collinear, do not increase the length, otherwise, increase it by one.
12:         end if
13:     end for
14:end for
15:if S[i][j]=1S[i][j]=-1 then
16:return FAILED
17:end if
18:i|Q|,j|Q|i\leftarrow|Q|,j\leftarrow|Q|
19:while i1j1i\geq 1\wedge j\geq 1 do
20:     Add S[i]S[i] to the end of TT, if 1T[i1]T[i]1T[i]Q[ρ[i][j]]1_{\overrightarrow{T[i-1]T[i]}}\neq 1_{\overrightarrow{T[i]Q[\rho[i][j]]}}.
21:     ii1,jρ[i][j]i\leftarrow i-1,j\leftarrow\rho[i][j]
22:end while
23:return TT in the reverse order

An intersection between two shortcuts is a point in FSD that does not fall on an edge of FSD grid, i.e. it does not map a vertex of one of the input curves, therefore, it cannot be chosen as a vertex of the simplification. Lemma 7 proves there is a path of twice the length that goes uses only the vertices of one of the curves.

Lemma 7.

Each monotone path of length kk in the parameter space can be mapped to a monotone path of length at most 2k2k in HH, if the events of the intersections between the shortcuts are not used as the vertices of the simplification.

Proof.

Consider a monotone path in the FSD with a point pip_{i} on the monotone path inside the free space. Let CC be the cell containing pip_{i}. Compute the intersection of the neighboring edges pi1pip_{i-1}p_{i} and pipi+1p_{i}p_{i+1} with HH and call them uu and vv (See Section 3.4).

[Uncaptioned image]

Since the free space in each cell is an ellipse, and therefore convex, replacing pip_{i} with uu and then vv does not change the monotonicity of the curve, and only increases the length of the curve by 11. Using the same convexity argument in neighboring cells, pi1up_{i-1}u and vpi+1vp_{i+1} fall inside the free space. This can happen once for each vertex, since pip_{i} will be on an edge of the grid (equivalently a vertex of the curve) after that. By induction on the length of the curve, repeating this for all vertices gives a path of length at most 2k2k.

Algorithm 2 adds a vertex for the intersections of disks of radius ϵ\epsilon with every edge and shortcut, and then computes the simplification. In Theorem 2, we show there is a (2k)(2k)-simplification of error at most 2ϵ2\epsilon, if a kk-simplification with error at most ϵ\epsilon exists.

Theorem 2.

Algorithm 2 finds a simplification of the input curve with at most 2k2k vertices and error at most 2ϵ2\epsilon, where kk is the size of the optimal simplification using any subset of points in the plane as vertices.

Proof.

The vertices of the input curve that replace the vertices of the optimal simplification with any subset of the points in the plane as vertices, as described in Lemma 7, can be replaced by two points on the curve and with distance at most ϵ\epsilon from the simplification (points uu and vv from Lemma 7). This is because the free space interval on each edge has distance at most ϵ\epsilon from one of the curves. If a curve simplification using points of the plane with Fréchet distance ϵ\epsilon (or equivalently, a re-parameterization of the curves with distance 2ϵ2\epsilon) exists, each part of the curve that lies inside a disk of radius ϵ\epsilon can be covered by the center of that disk (a vertex of the input curve), so it does not increase the Fréchet distance. Since the algorithm restricts the points to be on the curves, any point inside the disk can be chosen as part of the output instead of its center, resulting in estimation error 2ϵ2\epsilon which is the sum of the errors at each endpoint of an edge of the optimal simplification. So, the previous Fréchet mapping (between the optimal simplification and the input) can be used with Fréchet distance 2ϵ2\epsilon for the (2k)(2k)-simplification, because the points of the (2k)(2k)-simplification have distance at most ϵ\epsilon to the points of the optimal kk-simplification, and the optimal kk-simplification has distance at most ϵ\epsilon to the input curve.

The complexity of the curve follows from Lemma 7. This argument in the Euclidean plane is equivalent to that a point on the part of an edge that lies inside a disk might be replaced by the intersection points with that disk, and therefore double the complexity of the computed path.

4 pp-Mean Curve of A Set of Curves

By substituting the Fréchet events of two curves with the Fréchet events of LL curves, the results of the previous section extend to LL curves, for pp\rightarrow\infty, since the maximum of the distances is considered. For other pp-mean curves, their distances to the pp-MC can be different, so the previous methods do not apply.

In this section, we discuss two algorithms for pp-MC of LL curves and analyze their approximation factors.

4.1 The Pairwise Algorithm for Discrete pp-Mean Curve

The pp-MCs of a set of curves, like simplification using the Fréchet distance, is not unique. So, dividing the computation of the pp-MC with at most kk vertices into first computing the Fréchet distance of a set of curves, and then simplifying the resulting curve does not yield the optimal solution.

Algorithm 3 computes an approximate pp-MC. In this algorithm, all the Fréchet distances between the curves are computed, then, the simplification of the one with the minimum distance to the rest of the curves is reported as an approximate solution.

Algorithm 3 Pairwise Algorithm
1:A set of trajectories {Pi[1..n]}i=1L\{P_{i}[1..n]\}_{i=1}^{L}
2:A pp-mean trajectory MM
3:D[i,j]=dF(Pi,Pj),i=1,,L,j=1,,LD[i,j]=d_{F}(P_{i},P_{j}),\forall i=1,\ldots,L,j=1,\ldots,L
4:M=argmini=1Lj=1LD[i,j]ppM=\arg\min_{i=1}^{L}\sqrt[p]{\sum_{j=1}^{L}D[i,j]^{p}}
5:MM\leftarrow min-ϵ\epsilon simplification of MM.
Lemma 8.

Algorithm 3 is a 33-approximation for discrete pp-MC.

Proof.

Assume PiP_{i} is the curve that has the optimal solution OiO_{i} as its simplification and let TiT_{i} be an optimal min-ϵ\epsilon simplification of TiT_{i}. Since TiT_{i} is an optimal simplification, then dF(Ti,Pi)dF(Oi,Pi).d_{F}(T_{i},P_{i})\leq d_{F}(O_{i},P_{i}). Since pp-MC is also a simplification for PiP_{i}, its distance to PiP_{i} is at least as much as the optimal simplification. Using triangle inequality of norms, the approximation factor is proved:

dF(Pj,Ti)p\displaystyle\left\lVert d_{F}(P_{j},T_{i})\right\rVert_{p}
dF(Pj,Oi)+dF(Ti,Oi)p\displaystyle\quad\leq\left\lVert d_{F}(P_{j},O_{i})+d_{F}(T_{i},O_{i})\right\rVert_{p}
dF(Pj,Oi)p+dF(Ti,Oi)p\displaystyle\quad\leq\left\lVert d_{F}(P_{j},O_{i})\right\rVert_{p}+\left\lVert d_{F}(T_{i},O_{i})\right\rVert_{p}
dF(Pj,Oi)p+dF(Ti,Pi)p+dF(Oi,Pi)p\displaystyle\quad\leq\left\lVert d_{F}(P_{j},O_{i})\right\rVert_{p}+\left\lVert d_{F}(T_{i},P_{i})\right\rVert_{p}+\left\lVert d_{F}(O_{i},P_{i})\right\rVert_{p}
dF(Pj,Oi)p+2dF(Oi,Pi)p3OPT.\displaystyle\quad\leq\left\lVert d_{F}(P_{j},O_{i})\right\rVert_{p}+2\left\lVert d_{F}(O_{i},P_{i})\right\rVert_{p}\leq 3OPT.\*

Lemma 9.

The time complexity of Algorithm 3 is

O(L2n2logn+G(L,k))O(L^{2}n^{2}\log n+\operatorname{G}(L,k))

for discrete simplification, using a pp-MC algorithm of time G(L,k)\operatorname{G}(L,k).

Proof.

Computing the Fréchet distance of two curves takes O(n2logn)O(n^{2}\log n) time. Testing each curve PiP_{i} as the center and computing the LpL_{p} norm of the Fréchet distance of all curves {Pi}i=1L\{P_{i}\}_{i=1}^{L} requires (L2)\binom{L}{2} distance computations between each pair of curves. This takes O(L2n2logn)O(L^{2}n^{2}\log n) time. Finding the minimum takes O(L2)O(L^{2}) time. Computing the pp-MC with kk vertices takes G(L,k)\operatorname{G}(L,k) time.

Note that in Algorithm 3, while distances in matrix DD satisfy the triangle inequality, their pp-th power does not. So, approximation algorithms based on triangle inequality cannot be used to prune away large distances in DD.

4.2 An Algorithm for pp-Mean of LL Curves

Algorithm 4 simplifies the input curves with error less than their distances to the optimal pp-MC, then it computes an approximate pp-MC.

Algorithm 4 pp-Mean Algorithm
1:A set of curves {Pi}i=1L\{P_{i}\}_{i=1}^{L}, an integer kk, an α\alpha-approx. min-ϵ\epsilon simplification algorithm
2:An approximate pp-mean curve MM^{\prime}
3:for i=1,,Li=1,\ldots,L do
4:     Pi=P^{\prime}_{i}= an approximate min-ϵ\epsilon simplification of PiP_{i}.
5:end for
6:M=M^{\prime}= a pp-MC of P1,,PLP^{\prime}_{1},\ldots,P^{\prime}_{L}.
Theorem 3.

The approximation factor of Algorithm 4 is 2α+12\alpha+1, if an α\alpha-approximation simplification algorithm is used.

Proof.

MM^{\prime} denotes the pp-mean of curves {Pi}i=1L\{P_{i}\}_{i=1}^{L} computed by the algorithm, and MM denotes the optimal solution. PiP^{\prime}_{i} is a simplification with error equal to the minimum error of simplifications of PiP_{i} with at most kk vertices, so it has a distance less than any other curve, including MM: dF(Pi,M)dF(Pi,Pi).d_{F}(P_{i},M)\geq d_{F}(P_{i},P^{\prime}_{i}). Since MM^{\prime} is the pp-MC with minimum cost for curves PiP^{\prime}_{i}, it has a lower cost than MM. Using triangle inequality of Fréchet distance:

dF(Pi,M)\displaystyle d_{F}(P^{\prime}_{i},M^{\prime})
dF(Pi,M)dF(Pi,Pi)+dF(Pi,M)\displaystyle\quad\leq d_{F}(P^{\prime}_{i},M)\leq d_{F}(P^{\prime}_{i},P_{i})+d_{F}(P_{i},M)
dF(Pi,Pi)p+dF(Pi,M)pp\displaystyle\quad\leq\sqrt[p]{d_{F}(P^{\prime}_{i},P_{i})^{p}+d_{F}(P^{\prime}_{i},M)^{p}}
dF(Pi,Pi)p+(dF(Pi,Pi)+dF(Pi,M))pp\displaystyle\quad\leq\sqrt[p]{d_{F}(P^{\prime}_{i},P_{i})^{p}+(d_{F}(P_{i},P^{\prime}_{i})+d_{F}(P_{i},M))^{p}}
dF(Pi,M)dF(Pi,M)\displaystyle\frac{d_{F}(P_{i},M^{\prime})}{d_{F}(P_{i},M)}
(dF(Pi,Pi)dF(Pi,M))p+(dF(Pi,Pi)+dF(Pi,M)dF(Pi,M))pp\displaystyle\quad\leq\sqrt[p]{(\frac{d_{F}(P^{\prime}_{i},P_{i})}{d_{F}(P_{i},M)})^{p}+(\frac{d_{F}(P_{i},P^{\prime}_{i})+d_{F}(P_{i},M)}{d_{F}(P_{i},M)})^{p}}
1+2pp.\displaystyle\quad\leq\sqrt[p]{1+2^{p}}.

Substituting the approximation factors for computing PiP^{\prime}_{i} from PiP_{i} gives the approximation factor:

αp+(α+1)pp2α+1.\sqrt[p]{\alpha^{p}+(\alpha+1)^{p}}\leq 2\alpha+1.

Theorem 4.

Algorithm 4 takes O(LT(n)+G(L,k))O(L\operatorname{T}(n)+\operatorname{G}(L,k)) time for continuous pp-MC, if a G(L,n)\operatorname{G}(L,n) time pp-MC algorithm on LL curves, each with complexity at most nn, is used.

Proof.

Computing the simplification of a set of curves takes T(n)\operatorname{T}(n) time. The simplification algorithm is used LL times in the first step of the algorithm, so the total time complexity of that step is LT(n)L\operatorname{T}(n). Computing the pp-mean of curves Pi,i=1,,LP^{\prime}_{i},i=1,\ldots,L takes G(L,k)\operatorname{G}(L,k) time. So, the running time of the algorithm is O(LT(n)+G(L,k))O(L\operatorname{T}(n)+\operatorname{G}(L,k)).

5 Experiments

In this section, we use two types of trajectory data to evaluate our algorithm. The first one is a set of GPS tracks in different cities, and the second one is a set of pen trajectories while writing characters on a tablet. Our divide and conquer method when used in combination with our simplification algorithm produces results with good approximation factor, faster than existing algorithms that have at least quadratic time complexity due to computing the Fréchet distance on the whole input.

5.1 GPS Trajectory Datasets

We used two tracks from the datasets of [5] of map construction repository [1], which are GPS coordinates of trajectories in several cities. One of them is track 29 of Athens Small dataset with 47 points. The other one is track 82 from Chicago dataset with 363 points. In this experiment, we only consider the first two coordinates of the tracks and use ϵ\epsilon as the simplification error and δ\delta as the rounding error, and for the simplification algorithm we use Algorithm 2.

In Figure 3, the original curve and its simplification with ϵ=100,δ=1\epsilon=100,\delta=1 are given. The number of events used is 128128 and the output size is 1414.

Refer to caption
Figure 3: Track 29 of Athens Small dataset (in blue) and its simplification (in orange).

Since track 82 of Chicago dataset is too large for the algorithm to compute fast, we break it into chunks of 3030 points, which gives 88 chunks. Also, we first compute a simplification and then a simplification using the points of the plane on each part of the curve, and concatenate the results. The output size is 121121 points. The parameters are ϵ=20\epsilon=20 and δ=2\delta=2.

Refer to caption
Figure 4: Track 82 of Chicago dataset (in blue) and its simplification (in orange).

5.2 Character Trajectories Dataset

We ran the algorithm on the first trajectory of the dataset “Character Trajectories Data Set” [38, 39, 37] from UCI Machine Learning Repository [22]. Data is the pen tip trajectories of characters with a single pen-down which was captured using a WACOM tablet with sampling frequency 200Hz, and the data was normalized and smoothed. The dataset contains 2858 character samples.

We only considered the first two dimensions x and y, and ignored the last dimension which was the pen-tip force. Since the points are close to each other, the number of events can be high, so, we partition each trajectory into subsets of smaller size, compute their simplification (using the points of the plane), and then attach them and compute their overall estimation.

The first trajectory has 178 points, and the input was partitioned into chunks of 3030 points. In Figure 5(a), its simplification (using the points of the plane) with error ϵ=0.1\epsilon=0.1 and the rounding error δ=0.01\delta=0.01 is shown, where the size of the output is 2020. In Figure 5(b), the output for ϵ=0.05\epsilon=0.05 and δ=0.001\delta=0.001 is shown, and the output size is 3535. Running the algorithm on the concatenation of the estimations of the parts with the same error is shown in Figure 5(c) and it gives an output of size 3131, and the overall error is the sum of the errors, which is ϵ=0.1,δ=0.002\epsilon=0.1,\delta=0.002.

Refer to caption
(a) The first sample (blue) and its simplification using the points of the plane (orange).
Refer to caption
(b) The first sample (blue) and its simplification using the points of the plane (orange).
Refer to caption
(c) The first sample (blue) and its simplification using the points of the plane (orange).

6 Discussions and Open Problems

Using the free-space diagram to map LL two dimensional points into LL curve-length variables removes useful information such as the slope (line inclination). In this paper, we added this information by adding some vertices to the original definition of FSD and used it to give a simplification algorithm. Our algorithm generalizes to LpL_{p} norm of the Fréchet distances, not to be confused with using LpL_{p} norms as the metric space instead of the Euclidean plane.

We also show that the pp-mean curve satisfies the composable property for core-sets, and results in a constant factor approximation summary.

7 Theoretical Insights to Existing Heuristics

Greedy simplification by moving a disk along the curve.

The heuristic simplification algorithm that sweep the curve and simplifies the part of the curve that is inside the disk of radius ϵ\epsilon is a commonly used algorithm in practice, which does not have any theoretical guarantees except for the error ϵ\epsilon. When discussing this simplification algorithms in the parameter space (FSD) of the curve with itself with ϵ\epsilon as input, we see it is in fact the lowermost feasible path. Replacing this path with the shortest path gives an algorithm for min-kk simplification with complexity dependent on the size of the FSD for ϵ\epsilon. For monotone curves, this is an output-sensitive exact algorithm that takes O(nlogn+nk)O(n\log n+nk) time.

Local polyline simplification using Fréchet distance

The results of [29] showed the famous Imai-Iri algorithm [26] is not optimal for the simplification under Fréchet distance, when taken over the whole curves (global Fréchet distance).

References

  • [1] Map construction algorithms.
  • [2] M. A. Abam, M. De Berg, P. Hachenberger, and A. Zarei. Streaming algorithms for line simplification. Discrete Comput. Geom., 43(3):497–515, 2010.
  • [3] P. K. Agarwal, S. Har-Peled, N. H. Mustafa, and Y. Wang. Near-linear time approximation algorithms for curve simplification. Algorithmica, 42(3-4):203–219, 2005.
  • [4] S. Aghamolaei, V. Keikha, M. Ghodsi, and A. Mohades. Windowing queries using Minkowski sum and their extension to MapReduce. Journal of Supercomputing, 2020.
  • [5] M. Ahmed, S. Karagiorgou, D. Pfoser, and C. Wenk. A comparison and evaluation of map construction algorithms using vehicle tracking data. GeoInformatica, 19(3):601–632, 2015.
  • [6] H.-K. Ahn, H. Alt, M. Buchin, E. Oh, L. Scharf, and C. Wenk. A middle curve based on discrete Fréchet distance. In Latin American Symp. Theoret. Informatics, pages 14–26. Springer, 2016.
  • [7] H. Alt and M. Godau. Computing the Fréchet distance between two polygonal curves. Int. J. of Comput. Geom. Appl., 5(01n02):75–91, 1995.
  • [8] M. Bateni, A. Bhaskara, S. Lattanzi, and V. Mirrokni. Distributed balanced clustering via mapping coresets. In Adv. in Neural Info. Process. Syst., pages 2591–2599, 2014.
  • [9] P. Beame, P. Koutris, and D. Suciu. Communication steps for parallel query processing. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGAI Sympos. Princ. Database Syst., pages 273–284. ACM, 2013.
  • [10] K. Bringmann. Why walking the dog takes time: Fréchet distance has no strongly subquadratic algorithms unless seth fails. In Annu. IEEE Sympos. Found. Comput. Sci., pages 661–670. IEEE, 2014.
  • [11] K. Bringmann and B. R. Chaudhury. Polyline simplification has cubic complexity. arXiv preprint arXiv:1810.00621, 2018.
  • [12] K. Buchin, M. Buchin, M. Konzack, W. Mulzer, and A. Schulz. Fine-grained analysis of problems on curves. EuroCG, Lugano, Switzerland, 2016.
  • [13] K. Buchin, M. Buchin, M. van Kreveld, M. Löffler, R. I. Silveira, C. Wenk, and L. Wiratma. Median trajectories. Algorithmica, 66(3):595–614, Jul 2013.
  • [14] K. Buchin, A. Driemel, J. Gudmundsson, M. Horton, I. Kostitsyna, M. Löffler, and M. Struijs. Approximating (k,\ell)-center clustering for curves. In Proceedings of the 30th ACM-SIAM Sympos. Discrete Algorithms, pages 2922–2938. SIAM, 2019.
  • [15] K. Buchin, A. Driemel, and M. Struijs. On the hardness of computing an average curve. arXiv preprint arXiv:1902.08053, 2019.
  • [16] K. Buchin, A. Driemel, N. van de L’Isle, and A. Nusser. klcluster: Center-based clustering of trajectories. In Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 496–499, 2019.
  • [17] K. Buchin, T. Ophelders, and B. Speckmann. Seth says: Weak Fréchet distance is faster, but only if it is continuous and in one dimension. In Proceedings of the 30th Annu. ACM Sympos. Comput. Geom., pages 2887–2901. SIAM, 2019.
  • [18] M. Buchin, N. Funk, and A. Krivošija. On the complexity of the middle curve problem. arXiv preprint arXiv:2001.10298, 2020.
  • [19] E. Chambers, I. Kostitsyna, M. Löffler, and F. Staals. Homotopy measures for representative trajectories. In Inform. Process. Lett., volume 57. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.
  • [20] A. Driemel, S. Har-Peled, and C. Wenk. Approximating the Fréchet distance for realistic curves in near linear time. Discrete Comput. Geom., 48(1):94–127, 2012.
  • [21] A. Driemel, A. Krivošija, and C. Sohler. Clustering time series under the Fréchet distance. In Proceedings of the 27th ACM-SIAM Sympos. Discrete Algorithms, pages 766–785. Society for Industrial and Applied Mathematics, 2016.
  • [22] D. Dua and C. Graff. UCI machine learning repository, 2017.
  • [23] A. Dumitrescu and G. Rote. On the Fréchet distance of a set of curves. In Canad. Conf. Computat. Geom., pages 162–165, 2004.
  • [24] A. Gheibi, A. Maheshwari, and J.-R. Sack. Weighted minimum backward Fréchet distance. Theoret. Comput. Sci., 783:9–21, 2019.
  • [25] S. Har-Peled and B. Raichel. The Fréchet distance revisited and extended. ACM Trans. Algorithms, 10(1):3, 2014.
  • [26] H. Imai and M. Iri. Polygonal approximations of a curve—formulations and algorithms. In Machine Intelligence and Pattern Recognition, volume 6, pages 71–86. Elsevier, 1988.
  • [27] P. Indyk, S. Mahabadi, M. Mahdian, and V. S. Mirrokni. Composable core-sets for diversity and coverage maximization. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGAI Sympos. Princ. Database Syst., pages 100–108. ACM, 2014.
  • [28] H. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In Proceedings of the 21st ACM-SIAM Sympos. Discrete Algorithms, pages 938–948. Society for Industrial and Applied Mathematics, 2010.
  • [29] M. v. Kreveld, M. Löffler, and L. Wiratma. On optimal polyline simplification using the Hausdorff and Fréchet distance. In Proceedings of the 34th Annu. ACM Sympos. Comput. Geom., volume 99, pages 56–1. Leibniz International Proceedings in Informatics (LIPIcs), 2018.
  • [30] J.-H. Lin and J. S. Vitter. Approximation algorithms for geometric median problems. 1992.
  • [31] J. S. Mitchell, G. Rote, and G. Woeginger. Minimum-link paths among obstacles in the plane. Algorithmica, 8(1-6):431–459, 1992.
  • [32] G. Rosman, M. Volkov, D. Feldman, J. W. Fisher III, and D. Rus. Coresets for k-segmentation of streaming data. In Adv. in Neural Info. Process. Syst., pages 559–567. Curran Associates, Inc., 2014.
  • [33] C. D. Toth, J. O’Rourke, and J. E. Goodman. Handbook of discrete and computational geometry. Chapman and Hall/CRC, 2017.
  • [34] M. van de Kerkhof, I. Kostitsyna, M. Löffler, M. Mirzanezhad, and C. Wenk. Global curve simplification. In Proceedings of the 27th Annu. European Sympos. Algorithms. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019.
  • [35] M. van Kreveld, M. Loffler, and F. Staals. Central trajectories. arXiv preprint arXiv:1501.01822, 2015.
  • [36] M. van Kreveld and L. Wiratma. Median trajectories using well-visited regions and shortest paths. In Proceedings of the 19th ACM SIGSPATIAL Internat. Conf. Advances Geogr. Inform. Syst., pages 241–250. ACM, 2011.
  • [37] B. Williams, M. Toussaint, and A. J. Storkey. Modelling motion primitives and their timing in biologically executed movements. Advances in neural information processing systems, 20:1609–1616, 2007.
  • [38] B. H. Williams, M. Toussaint, and A. J. Storkey. Extracting motion primitives from natural handwriting data. In International Conference on Artificial Neural Networks, pages 634–643. Springer, 2006.
  • [39] B. H. Williams, M. Toussaint, and A. J. Storkey. A primitive based generative model to infer timing information in unpartitioned handwriting data. In IJCAI, pages 1119–1124, 2007.