This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Joint symbolic aggregate approximation of time series

Xinye Chen 0000-0003-1778-393X Faculty of Mathematics and PhysicsCharles UniversityPragueCzech Republic xinye.chen@mff.cuni.cz
Abstract.

The increasing availability of temporal data poses a challenge to time-series and signal-processing domains due to its high numerosity and complexity. Symbolic representation outperforms raw data in a variety of engineering applications due to its storage efficiency, reduced numerosity, and noise reduction. The most recent symbolic aggregate approximation technique called ABBA demonstrates outstanding performance in preserving essential shape information of time series and enhancing the downstream applications. However, ABBA cannot handle multiple time series with consistent symbols, i.e., the same symbols from distinct time series are not identical. Also, working with appropriate ABBA digitization involves the tedious task of tuning the hyperparameters, such as the number of symbols or tolerance. Therefore, we present a joint symbolic aggregate approximation that has symbolic consistency, and show how the hyperparameter of digitization can itself be optimized alongside the compression tolerance ahead of time. Besides, we propose a novel computing paradigm that enables parallel computing of symbolic approximation. The extensive experiments demonstrate its superb performance and outstanding speed regarding symbolic approximation and reconstruction.

KEYWORDS: time series analysis, symbolic aggregate approximation, data compression, parallel computing

1. Introduction

Time series is of naturally high numerosity in the real world. Most algorithms are limited to the computational load for dealing with large-scale data. Therefore, it is very desired to compute a representation that reduces the numerosity while preserving the essential characteristics of time series, and the reasonable representation in time series often leads to a boost in algorithmic performance and dramatically alleviates the pressure of compute resources, i.e., symbolic approximation of time series has been demonstrated to speed up neural network inference (Elsworth and Güttel, 2020b). However, computing symbolic representation for large-scale time series is tricky due to its high computational complexity.

The adaptive Brownian bridge-based symbolic aggregation (ABBA) method as well as its accelerant variant fABBA is one of the state-of-the-art symbolic approximation techniques regarding reconstruction error in time series domains. However, it requires transforming one single time series at a time, which shows clumsy behavior for multiple time series, especially in a large-scale manner. Besides, this method is inherently sequential, which makes it hard to fully utilize available computing resources. More importantly, the consistency of symbols is not guaranteed. The consistency here means each distinct symbol carries the same information in any sample of multiple time series. For example, the symbol “a” that appeared in a time series should be identical to the “a” in another time series. Besides, the parameter tuning is intractable without prior knowledge, although this problem is already mitigated with fABBA by using tolerance-dominated digitization.

Our application of interests focuses on symbolizing multivariate/multiple time series in a unified manner. We propose a joint symbolic representation framework that addresses the aforementioned issues and enables parallelism. The extensive experiments demonstrate that the proposed algorithm can achieve significant speedup while retaining the competing performance of representation reconstruction, particularly for large-scale time series. The software has been integrated into PyPI registered software fABBA111https://github.com/nla-group/fABBA.

Our contribution is summarized as follows:

  1. (1)

    This paper analyzes the clustering in the digitization between ABBA and fABBA, and proposes a sampling-based k-means to accelerate the ABBA method while retaining its original accuracy.

  2. (2)

    A joint symbolic aggregate approximation method is proposed that enables a consistent symbolization for multivariate or multiple time series. Based on that, a novel parallel computing scheme for the symbolic approximation of time series, a multithreading test was performed to show its significant speedup over ABBA and fABBA.

  3. (3)

    Based on the Brownian bridge modeling, the provably error-bound method is proposed to automatically determine the hyper-parameter setting for fABBA digitization, which enables less prior knowledge of hyper-parameter tuning required for users.

The remainder of this paper is structured as follows. Section 2 discusses related work of symbolic representation as well its applications. Section 3 reviews the necessary notions of ABBA framework. Section 4 expands the existing digitization analysis and presents a sampling-based algorithm that can speed up the vector quantization-based digitization, and also introduce a hyperparameter choosing method based on Brownian bridge modeling. Section 5 formally introduces our framework of joint symbolic approximation. Section 6 shows the empirical results of various competing algorithms and section 7 concludes the paper.

2. Related work

Symbolic time series representation has important applications in time series analysis, such as clustering (Lin et al., 2007; Ruta et al., 2019; Li et al., 2021) and time series classification (Senin and Malinchik, 2013; Li and Lin, 2017; Nguyen and Ifrim, 2023), forecasting (Elsworth and Güttel, 2020b), event prediction (Zhang et al., 2017), anomaly detection (Senin et al., 2015), and motif discovery (Lin et al., 2007; Li and Lin, 2010; Gao and Lin, 2019). In this section, we briefly review some works on symbolic time series representation as well as its applications. The symbolic representation methods for time series as well as its analysis are too large a pool of literature to survey in detail, due to the limited space we only discuss a few typical ones that are mostly related to our research.

SAX (Lin et al., 2003) is the first symbolic time series representation that reduces the dimensionality of time series and allows indexing with a lower-bounding distance measure. It starts a trend that employs symbolic representation in numerous downstream time series tasks which achieves significant success, e.g., pattern search (SAXRegEx (Yu et al., 2023)), clustering (SAX Navigator (Ruta et al., 2019), SPF (Li et al., 2021)), anomaly detection (HOT SAX (Keogh et al., 2005), TARZAN (Keogh et al., 2002)) and time series classification (SAX-VSM (Senin and Malinchik, 2013), BOPF (Li and Lin, 2017), MrSQM(Nguyen and Ifrim, 2023)). SAX spawns various enhanced variants, e.g., 1d-SAX (Malinowski et al., 2013), ESAX (Lkhagva et al., 2006), pSAX and cSAX (Bountrogiannis et al., 2023); Their success is achieved either by acceleration or accuracy, but SAX still receives a wide popularity due to its appealing simplicity and speed.

ABBA (Elsworth and Güttel, 2020a) utilizes adaptive polygonal chain approximation followed by mean-based clustering to achieve symbolization of time series. The reconstruction error of the representation can be modeled as a random walk with pinned start and end points, i.e., a Brownian bridge. fABBA (Chen and Güttel, 2022a), the variant, uses an efficient greedy aggregation (GA) method to replace the k-means clustering, which speedups the digitization by order of magnitudes. Both ABBA and fABBA have been empirically demonstrated that have a better preservation of the shape of time series against SAX, especially the ups and downs behavior of time series. The application of ABBA has been shown effective regarding time series prediction and anomaly detection; e.g., the LSTM with ABBA shows robust performance over inference (Elsworth and Güttel, 2020b), the TARZAN replacing SAX with ABBA or fABBA compares favorably with SAX-based TARZAN (Elsworth and Güttel, 2020a; Chen and Güttel, 2022a). However, computing an ABBA symbolic representation for multiple time series is strenuous due to a vast number of features to be extracted, especially dealing with symbolic consistency.

3. Preliminary of ABBA

Here we briefly recap the preliminaries of ABBA method. ABBA is a symbolic time series representation based on an adaptive polygonal chain approximation, followed by the mean-based clustering algorithm. ABBA symbolization mainly contains two steps, namely compression and digitization, to aggregate time series T=[t1,t2,,tn]nT=[t_{1},t_{2},\ldots,t_{n}]\in\mathbb{R}^{n} into a symbolic approximation

(1) A=[a1,a2,,aN],A=[a_{1},a_{2},\ldots,a_{N}],

where NnN\ll n and aia_{i}\in\mathcal{L}.

Table 1 shows the procedure of symbolization (the first three steps) and inverse-symbolization (the last three steps) 222For the naming convenience, we define inc^=inc~\widehat{\text{{inc}}}=\widetilde{\text{{inc}}}, same follows the (Elsworth and Güttel, 2020a).. ABBA method essentially comprise six steps as summarized in Table 1. The difference between TT and T^\widehat{T} is referred to as reconstruction error. Obviously, a bad symbolization often leads to a high reconstruction error. We will mainly review the phase of compression and digitization below.

Table 1. Summarized notation of ABBA procedure
time series T=[t0,t1,,tn]nT=[t_{0},t_{1},\ldots,t_{n}]\in\mathbb{R}^{n}
after compression P=[(len1,inc1),,(lenN,incN)]2×NP=[(\text{{len}}_{1},\text{{inc}}_{1}),\ldots,(\text{{len}}_{N},\text{{inc}}_{N})]\in\mathbb{R}^{2\times N}
after digitization A=[a1,,aN]NA=[a_{1},\ldots,a_{N}]\in\mathcal{L}^{N}
inverse-digitization P~=[(len~1,inc~1),,(len~N,inc~N)]2×N\widetilde{P}=[(\widetilde{\text{{len}}}_{1},\widetilde{\text{{inc}}}_{1}),\ldots,(\widetilde{\text{{len}}}_{N},\widetilde{\text{{inc}}}_{N})]\in\mathbb{R}^{2\times N}
quantization P^=[(len^1,inc^1),,(len^N,inc^N)]2×N\widehat{P}=[(\widehat{\text{{len}}}_{1},\widehat{\text{{inc}}}_{1}),\ldots,(\widehat{\text{{len}}}_{N},\widehat{\text{{inc}}}_{N})]\in\mathbb{R}^{2\times N}
inverse-compression T^=[t^1,t^2,,t^n]n\widehat{T}=[\widehat{t}_{1},\widehat{t}_{2},\ldots,\widehat{t}_{n}]\in\mathbb{R}^{n}

3.1. Compression

The ABBA compression step aims to compute an adaptive piecewise linear continuous approximation of TT, that is, to obtain time series pieces P=[(len1,inc1),,(lenN,incN)]N×2P=[(\text{{len}}_{1},\text{{inc}}_{1}),\ldots,(\text{{len}}_{N},\text{{inc}}_{N})]\in\mathbb{R}^{N\times 2}, followed by a reasonable digitization that results in symbolic sequence A=[a1,a2,,aN]NA=[a_{1},a_{2},\ldots,a_{N}]\in\mathcal{L}^{N}, NnN\ll n, and each aja_{j} is an element of a finite alphabet set \mathcal{L} where ||N|\mathcal{L}|\ll N. \mathcal{L} can be referred to as dictionary in the procedure. The ABBA compression adaptively selects N+1N+1 indices i0=0<i1<<iN=ni_{0}=0<i_{1}<\cdots<i_{N}=n given a tolerance tol so that the time series TT is well approximated by a polygonal chain going through the points (ij,tij)(i_{j},t_{i_{j}}) for j=0,1,,Nj=0,1,\ldots,N. This results in a partition of TT into NN pieces pj=(lenj,incj)p_{j}=(\text{{len}}_{j},\text{{inc}}_{j}) that is determined by Tij1:ij=[tij1,tij1+1,,tij]T_{i_{j-1}:i_{j}}=[t_{i_{j-1}},t_{i_{j-1}+1},\ldots,t_{i_{j}}], each of integer length lenj:=ijij11\texttt{len}_{j}:=i_{j}-i_{j-1}\geq 1 in the time direction. Visually, each piece pjp_{j} is represented by a straight line connecting the endpoint values tij1t_{i_{j-1}} and tijt_{i_{j}} This partitioning criterion is the squared Euclidean distance of the values in pjp_{j} from the straight polygonal line is upper bounded by (lenj1)tol2(\texttt{len}_{j}-1)\cdot\texttt{tol}^{2}. For simplicity, given an index ij1i_{j-1} and starts with i0=0i_{0}=0, the procedure seeks the largest possible iji_{j} such that ij1<ijni_{j-1}<i_{j}\leq n and

(2) i=ij1ij(tij1+(tijtij1)iij1ijij1ti)2(ijij11)tol2.\displaystyle\sum_{i=i_{j-1}}^{i_{j}}\Big{(}\,t_{i_{j-1}}+(t_{i_{j}}-t_{i_{j-1}})\cdot\frac{i-i_{j-1}}{i_{j}-i_{j-1}}-t_{i}\Big{)}^{2}\leq(i_{j}-i_{j-1}-1)\cdot\texttt{tol}^{2}.

Each linear piece pjp_{j} of the resulting polygonal chain T~\widetilde{T} is referred to as a tuple (lenj,incj)(\texttt{len}_{j},\texttt{inc}_{j}), where incj=tijtij1\texttt{inc}_{j}=t_{i_{j}}-t_{i_{j-1}} is the increment in value, i.e., the subtraction of ending and starting value of Tij1:ijT_{i_{j-1}:i_{j}}. The whole polygonal chain can be recovered exactly from the first value t0t_{0} and the tuple sequence p1,p2,,pNp_{1},p_{2},\ldots,p_{N}, i.e.,

(3) (len1,inc1),,(lenN,incN)2.(\texttt{len}_{1},\texttt{inc}_{1}),\ldots,(\texttt{len}_{N},\texttt{inc}_{N})\in\mathbb{R}^{2}.

where the reconstruction error of this representation is with pinned start and end points, and can be naturally modeled as a Brownian bridge.

3.2. Digitization

The next step is referred to as digitization, which we further transformed the resulting polygonal chain T~\widetilde{T} into the symbolic representation in the form of (1).

Following (Elsworth and Güttel, 2020a), prior to digitizing, the tuple lengths and increments are separately normalized by their standard deviations σlen\sigma_{\texttt{len}} and σinc\sigma_{\texttt{inc}}, respectively. After that, further scaling is employed by using a parameter scl to assign different weights to the length of each piece pip_{i}, which denotes importance assigned to its length value in relation to its increment value. Hence, the clustering is effectively performed on the scaled tuples

(4) p1=(scllen1σlen,inc1σinc),,pn=(scllennσlen,incnσinc).p_{1}=\left(\texttt{scl}\frac{\texttt{len}_{1}}{\sigma_{\texttt{len}}},\frac{\texttt{inc}_{1}}{\sigma_{\texttt{inc}}}\right),\ldots,p_{n}=\left(\texttt{scl}\frac{\texttt{len}_{n}}{\sigma_{\texttt{len}}},\frac{\texttt{inc}_{n}}{\sigma_{\texttt{inc}}}\right).

In particular, if scl=0\texttt{scl}=0, then clustering will be only performed on the increment values of PP, while if scl=1\texttt{scl}=1, the lengths and increments are clustered with equal importance.

The steps after normalization proceed with a lossy compression technique, e.g., vector quantization (VQ), which is often achieved by mean-based clustering. The concept of vector quantization can be referenced in (Gray, 1984; Dasgupta and Freund, 2009). Given an input of NN vectors P=[p1,,pN]R×NP=[p_{1},\ldots,p_{N}]\in\mathrm{R}^{\ell\times N}, VQ seeks a codebook of kk vectors, i.e., C=[c1,,ck]R×kC=[c_{1},\ldots,c_{k}]\in\mathrm{R}^{\ell\times k} such that kk is much smaller than NN where each cic_{i} is associated with a unique cluster SiS_{i}. A quality codebook enables the sum of squared errors SSE to be small enough to an optimal level. Suppose kk clusters S1,S2,,SkPS_{1},S_{2},\ldots,S_{k}\subseteq P are computed, VQ aims to minimize

(5) SSE=i=1kϕ(ci,Si)=i=1kpSidist(p,ci)2,\texttt{SSE}=\sum_{i=1}^{k}\phi(c_{i},S_{i})=\sum_{i=1}^{k}\sum_{p\in S_{i}}\mathrm{dist}(p,c_{i})^{2},

where ϕ\phi denotes energy function, cic_{i} denotes the center of cluster SiS_{i} and dist(pi,pj)\mathrm{dist}(p_{i},p_{j}) often denotes the Euclidean norm pipj2\|p_{i}-p_{j}\|_{2}. We often choose the mean center μi\mu_{i} as cic_{i} for Euclidean space, i.e., μi=1|Si|pSip\mu_{i}=\frac{1}{|S_{i}|}\sum_{p\in S_{i}}p, and then (5) can be written as SSE=i=1k|Si|VarSi\texttt{SSE}=\sum_{i=1}^{k}|S_{i}|\text{Var}S_{i}. Lyold’s algorithm (Lloyd, 1982a) (also known as k-means algorithm) is a suboptimal solution of vector quantization to minimize SSE.

The ABBA digitization can be performed by a suitable partitional clustering algorithm that finds kk clusters from P2×NP\in\mathbb{R}^{2\times N} such that the sum of Euclidean distance SSE constructed by CC is minimized. The obtained codebook vectors are referred to symbolic centers here. Each symbolic center is associated with an identical symbol and each time series snippet pip_{i} is assigned with the closest symbolic center cic^{i} associated with its symbol

(6) ci=argmincC(pc).c^{i}=\operatorname*{arg\,min}_{c\in C}(\|p-c\|).

The symbolic centers to symbols are one-to-one mapping, denoted by Id:CAI_{d}:C\to A , thus the digitization fd:PAf_{d}:P\to A is given by

(7) fd(pi)=Id(ci)=Id(argmincC(pc)).f_{d}(p_{i})=I_{d}(c^{i})=I_{d}(\operatorname*{arg\,min}_{c\in C}(\|p-c\|)).

Each symbol is associated with a unique cluster. In practice, each clustering label (membership) corresponds to a unique byte-size integer value. The symbols used in ABBA can be represented by text characters, which are not limited to English alphabet letters—often more clusters will be used. Each character in most computer systems is used by the ASCII strings with a unique byte-size integer value (a unique cluster membership). Besides, it can be any combination of symbols, or ASCII representation.

Besides, it is fun to discuss compression rates in some cases. The digitization is the key to compression rate, which is the size of codebook CC (i.e., the number of distinct symbols |||\mathcal{L}|) divided by the length of time series. We use τc\tau_{c} to denote the compression rate, which is given by

(8) τc(0,1]:=1||n.\tau_{c}\in(0,1]:=1-\frac{|\mathcal{L}|}{n}.

3.3. Inverse symbolization

The inverse symbolization refers to the process from AA to T^\widehat{T}, the intuition is to reconstruct time series from (1) such that the reconstructed time series T^\widehat{T} is as close to TT as possible. The inverse symbolization contains three steps.

The first step is referred to as inverse-digitization, simply written as fd1f_{d}^{-1}, which uses the kk representative elements cic_{i} (in terms of, e.g., mean centers or median center of the groups SS) from codebook CC to replace the symbol in AA orderly, and thus results in a 2-by-NN array P~\widetilde{P}, i.e., an approximation of PP, where each p~iP~\widetilde{p}_{i}\in\widetilde{P} is the closest symbolic center cCc\in C to piPp_{i}\in P. The inverse digitization often leads to a non-integer value to the reconstructed length len, so (Elsworth and Güttel, 2020a) proposes a novel rounding method, which is referred to as quantization, to align the cumulated lengths with the closest integers. The method is as follows: start with rounding the first length into an integer value, i.e., len^1:=round(len~1)\widehat{\text{{len}}}_{1}:=\text{round}(\widetilde{\text{{len}}}_{1}) and calculate the rounding error e:=len~1len^1e:=\widetilde{len}_{1}-\widehat{len}_{1}. The the error is added to the rounding to len~2\widetilde{\text{{len}}}_{2}, i.e., len^2:=round(len~2+e)\widehat{\text{{len}}}_{2}:=\text{round}(\widetilde{\text{{len}}}_{2}+e) and new error ee^{\prime} is calculated as len^2+elen~2\widehat{\text{{len}}}_{2}+e-\widetilde{\text{{len}}}_{2}. Then ee^{\prime} is involved in the next rounding similarly. After all rounding is computed, we obtain

(9) P^=[(len^1,inc^1),,(len^N,inc^N)]2×N,\widehat{P}=[(\widehat{\text{{len}}}_{1},\widehat{\text{{inc}}}_{1}),\ldots,(\widehat{\text{{len}}}_{N},\widehat{\text{{inc}}}_{N})]\in\mathbb{R}^{2\times N},

where increments inc are unchanged, i.e., inc^=inc~\widehat{\text{{inc}}}=\widetilde{\text{{inc}}}. Then, the whole polygonal chain can be recovered exactly from the initial time value t0t_{0} and the tuple sequence (9) via the inverse-compression.

The lower reconstruction error means a higher approximation accuracy. The reconstruction error can be defined by mean squared error (MSE), which is given by

(10) MSE=1iin(tit^n)2.\text{MSE}=\frac{1}{i}\sum_{i}^{n}(t_{i}-\widehat{t}_{n})^{2}.

4. Clustering-based digitization

In this section we discuss two commonly used clustering approaches—VQ and GA—for ABBA digitization, on which the two ABBA variants, namely ABBA and fABBA, essentially rely. The pseudocode for VQ and GA is as illustrated in Algorithm 2 and Algorithm 3. As aforementioned, the symbolic centers are represented by the centers of clusters, which is key to the inverse symbolization. The concept of starting points spisp_{i} (the outset of each group forming, which we will elaborate later) is introduced in GA (Chen and Güttel, 2022a), but the mean centers μi\mu_{i} are preferred in inverse digitization rather than starting points spisp_{i} to seek an accurate inverse symbolization in fABBA. Let μ\mu be the mean center of set SS, denoted 1|S|pSp\frac{1}{|S|}\sum_{p\in S}p, we can easily obtain the relationship of the energy function based on starting point spsp and mean center μ\mu:

Lemma 4.1.

Given arbitrary data point pp (can be starting point) in group SS, the mean center of SS is denoted by μ\mu, we have:

(11) ϕ(p,S)=ϕ(μ,S)+|S|dist(p,μ).\phi(p,S)=\phi(\mu,S)+|S|\mathrm{dist}(p,\mu).

In terms of (11) μ\mu is thus the unique value that minimizes the energy function ϕ(p,S)\phi(p,S) (Arthur and Vassilvitskii, 2007).

The default setting to ABBA digitization is to use k-means clustering. fABBA (Chen and Güttel, 2022a) uses GA to replace VQ, which achieves significant speed while resulting in a minor loss of approximation accuracy. Both ABBA and fABBA are dominated by a hyper-parameter for digitization, we refer to kk for ABBA while α\alpha for fABBA. The kk determines how many distinct symbols (i.e., clusters) were used for symbolic representation, and the α\alpha acts as a tolerance for greedy data aggregation that determines the number of distinct symbols. As discussed in (Chen and Güttel, 2022a), not all clustering (see e.g., BIRCH (Zhang et al., 1996), CLIQUE (Agrawal et al., 1998), spectral clustering (Yu and Shi, 2003), DBSCAN (Ester et al., 1996) and HDBSCAN (Campello et al., 2013)) is suitable for the partitioning, particularly the density clustering methods, which often result in insufficient symbols that required to fully reflect time series patterns since density clustering methods suffer from chaining effect, and also they are less likely to result in satisfying SSE, thus leads to high reconstruction error.

The visual difference between the two clustering methods is as shown in Figure 1. We can see that VQ (all achieved by k-means++ throughout the paper) assigns groups to form a Voronoi diagram while the GA partitions data of 10,000 points into groups that exist overlap. The partitions with overlap clusters (symbols) inherently model the natural semantic information of words in the real world, e.g., landlady and queen all refer to a woman. Therefore, we believe our joint symbolic representation has promising applications in time series with natural language processing techniques.

Refer to caption
(a) Vector quantization
Refer to caption
(b) Greedy Aggregation
Figure 1. 2-dimensional data partition using vector quantization achieved by k-means clustering and aggregation with 26 groups: The aggregation uses 0.025 seconds to finish the task while k-means uses 0.18 seconds. The dark points refer to starting point and centers in the two figures, respectively.

Both VQ and GA can perform clustering-based image segmentation tasks, where segmentation is completed by clustering the image’s pixels (each pixel represented as a 5-dimensional vector consisting of spatial coordinates and RGB color). Figure 2 shows the result of image segmentation of two images from the COCO dataset (Lin et al., 2014) by VQ and GA using the same number of clusters, respectively. GA performs clustering in image segmentation significantly faster than VQ, and we can also observe that GA performs well-separated segmentation which is closer to human perception compared to that of VQ which approaches a Voronoi-style segmentation.

Refer to caption
Refer to caption
Figure 2. Image segmentation with VQ and GA: The three images are achieved by 2,332 and 678 clusters.

4.1. Vector quantization

The k-means problems aim to find kk clusters within data in dd-dimensional space, so as to minimize the (5). However, solving this problem is NP-hard even kk is restricted to 22 (Drineas et al., 2004; Dasgupta and Freund, 2008) or in the plane (Mahajan et al., 2012). Typically, the sub-optimal k-means problem can be solved by Lloyd’s algorithm (Lloyd, 1982b). In the implementation of ABBA software333https://github.com/nla-group/ABBA, the k-means algorithm is performed by scikit-learn library (Pedregosa et al., 2011) which runs a few times (this is controlled by the parameter n_init444The default to scikit-learn is 10 before the version 1.3.2.) of Lylod’s algorithm with D2D^{2} seeding and pick up the best result. In the setting of this paper, we found setting n_init to 1 is good enough for the ABBA performance.

As already mentioned, Lloyd’s algorithm is a widely used method to solve the k-means problem, it starts with uniformly sampling kk centers from data, often referred to as seeding, and then each point is allocated to a cluster with the closest center, and the mean centers of clusters are recomputed again. The procedure keeps repeating until the iteration converges.

The seeding has a great impact on the final result. The improved algorithm is combined with optimal seeding “D2D^{2} weighting” introduced by (Arthur and Vassilvitskii, 2007), which can significantly improve Lloyd’s algorithm. Lloyd’s algorithm with D2D^{2} weighting is called the “k-means++” algorithm. The k-means algorithm with “D2D^{2} weighting” shows O(logk)O(\log k)-competitive with the optimal clustering.

Algorithm 1 D2D^{2} weighting
1:Input: P=[pi]i=1Nd×NP=[p_{i}]_{i=1}^{N}\in\mathbb{R}^{d\times N}, kk
2:Initialize the first center c1c_{1} uniformly arbitrarily from PP
3:Select the next center ciP(i2)c_{i}\in P(i\geq 2) with probability D(p)2pPD(p)2,pP\frac{D(p^{\prime})^{2}}{\sum_{p\in P}D(p)^{2}},p^{\prime}\in P
4:Repeat Step 2, until a total of kk centers is chosen.
5:Return: c1,c2,,ckc_{1},c_{2},\ldots,c_{k}
Algorithm 2 k-means++ algorithm
1:Input: P=[pi]i=1Nd×NP=[p_{i}]_{i=1}^{N}\in\mathbb{R}^{d\times N}, kk
2:Use Algorithm 1 to select kk initial centers c1,,ckc_{1},\ldots,c_{k} from PP^{\prime}
3:for ci{c1,,ck}c_{i}\in\{c_{1},\ldots,c_{k}\} do
4:  Compute CiC_{i}, the set of points in PP where cic_{i} is the closest center
5:  Update cic_{i} by computing the mean center of cluster CiC_{i}, i.e., ci=1|Ci|pCipc_{i}=\frac{1}{|C_{i}|}\sum_{p\in C_{i}}p
6:end for
7:Repeat Steps 252\sim 5 until meet maximum iterations or clusters set converge
8:Assign pip_{i} to the closest center cic^{i} with a unique cluster label
9:Return Assigned points p1,p2,,pNp_{1},p_{2},\ldots,p_{N}

ABBA digitization using this clustering method has been shown incredibly slow speed, though the reconstruction error meets the needs of most applications. It is very desired to design a faster clustering alternative while retaining the original reconstruction error to an ultimate degree. For this reason, we propose a sampling-based k-means clustering algorithm that can address the above concern. The idea is to perform k-means++ on a uniform sample of data where only rr percent of original data is used. The algorithm is as described in Algorithm 3. Section 6 will demonstrate its performance empirically.

Algorithm 3 Sampling-based k-means algorithm
1:Input: P=[pi]i=1Nd×NP=[p_{i}]_{i=1}^{N}\in\mathbb{R}^{d\times N}, kk, rr
2:Uniformly sample r|P|\lfloor r\cdot|P|\rceil points from PP as PP^{\prime}
3:Use Algorithm 1 to select kk initial centers c1,,ckc_{1},\ldots,c_{k} from PP^{\prime}
4:for ci{c1,,ck}c_{i}\in\{c_{1},\ldots,c_{k}\} do
5:  Compute CiC_{i}, the set of points in PP^{\prime} where cic_{i} is the closest center
6:  Update cic_{i} by computing the mean center of cluster CiC_{i}, i.e., ci=1|Ci|pCipc_{i}=\frac{1}{|C_{i}|}\sum_{p\in C_{i}}p
7:end for
8:Repeat Steps 252\sim 5 until iterations end or clusters set converge
9:Assign pip_{i} to the closest center cic^{i} with a unique cluster label
10:Return Assigned points p1,p2,,pNp_{1},p_{2},\ldots,p_{N}

Figure 3 shows the simulation of k-means++ and sampling-based k-means (sampling r=10%r=10\% of data) on Gaussian blobs data with 10 clusters, proceeding by increasing data sizes. The result is as illustrated in Figure 3. We can observe that sampling-based k-means runs in a fraction of the time compared to k-means++, while giving competitive performance in terms of adjusted mutual information (AMI).

Refer to caption
(a) Runtime
Refer to caption
(b) AMI
Figure 3. Performance comparison of k-means++ and sampling-based k-means.

4.2. Greedy aggregation

The greedy aggregation is introduced in (Chen and Güttel, 2022a), which proceeds first by sorting the data and performing greedy aggregation of data into groups. The sorting order naturally avoids unnecessary computations in aggregation by triggering an early stopping. The codebook set is constructed by the mean centers of the groups resulting from the aggregation as a suboptimal solution to the k-means problem. Though its accuracy is less significant than Lylod’s algorithm, it achieves a significant speedup and the SSE is upper bounded by α2(Nk)\alpha^{2}(N-k) for NN data points.

Algorithm 4 Greedy aggregation
1:Input: P=[pi]i=1Nd×NP=[p_{i}]_{i=1}^{N}\in\mathbb{R}^{d\times N}, α\alpha
2:Sort data points PP such that ϱ1ϱ2ϱN\varrho_{1}\leq\varrho_{2}\leq\cdots\leq\varrho_{N}.
3:Label all of sorted data points as “unassigned”
4:Select the first unassigned point xix_{i} as initial starting point, and set j=i+1j=i+1
5:Check whether or not early stopping can be triggered by sorting property, if not, compute dist(pi,pj)\mathrm{dist}(p_{i},p_{j})
6:if dijαd_{ij}\leq\alpha and xjx_{j} is unassigned then
7:  Assign pjp_{j} to the same group as pip_{i}
8:end if
9:Increase jj by 1, repeat Steps 4\sim8 until j>Nj>N
10:Repeat Steps 3\sim8 until there are no unassigned points left
11:Return Assigned points p1,p2,,pNp_{1},p_{2},\ldots,p_{N}

Sorting is essential to the success of aggregation in our context. Since sorting can determine the starting points selection and forming of groups, even helps to discard unnecessary distance computations. A bad sorting will result in inefficiency of aggregation and bad-performed SSE. For example, (Chen and Güttel, 2022b) proposes PCA sorting which ensures the pairwise distance between pip_{i} and pjp_{j} is bounded by |ϱiϱj|+2σ22|\varrho_{i}-\varrho_{j}|+2\sigma_{2}^{2} where σ2\sigma_{2} is the second largest singular value of data matrix PP.

4.3. Parameter elimination

As aforementioned, digitization aims to partition PP described in (3) into kk clusters S1,,SkS_{1},\ldots,S_{k} such that (5) is minimized. The tolerance-oriented digitization enables the natural relationship between compression tol and digitization α\alpha. In this section, we discuss a novel way to eliminate the need of choosing a parameter for fABBA digitization. The Lemma 4.2 shows the reconstruction error still ensure the pin of start and end of time series.

Lemma 4.2 ((Elsworth and Güttel, 2020a)).

Mean-based clustering naturally leads to i=1Ninc^i=iNinci\sum_{i=1}^{N}\widehat{\texttt{inc}}_{i}=\sum_{i}^{N}\texttt{inc}_{i}

Proof to Lemma 4.2 is as follows:

i=1Ninc^i\displaystyle\sum_{i=1}^{N}\widehat{\texttt{inc}}_{i} =i=1Ninc~i=i=1kj=1|Si|μiinc=i=1kj|Si|1|Si|inclSiincl\displaystyle=\sum_{i=1}^{N}\widetilde{\texttt{inc}}_{i}=\sum_{i=1}^{k}\sum_{j=1}^{|S_{i}|}\mu_{i}^{\texttt{inc}}=\sum_{i=1}^{k}\sum_{j}^{|S_{i}|}\frac{1}{|S_{i}|}\sum_{\texttt{inc}_{l}\in S_{i}}\texttt{inc}_{l}
=ik1|Si|j|Si|inclSiincl=ik1|Si||Si|inclSiincl\displaystyle=\sum_{i}^{k}\frac{1}{|S_{i}|}\sum_{j}^{|S_{i}|}\sum_{\texttt{inc}_{l}\in S_{i}}\texttt{inc}_{l}=\sum_{i}^{k}\frac{1}{|S_{i}|}|S_{i}|\sum_{\texttt{inc}_{l}\in S_{i}}\texttt{inc}_{l}
=ikinclSiincl=iNinci\displaystyle=\sum_{i}^{k}\sum_{\texttt{inc}_{l}\in S_{i}}\texttt{inc}_{l}=\sum_{i}^{N}\texttt{inc}_{i}

Also, we know that tiN=t0+iNincit_{i_{N}}=t_{0}+\sum_{i}^{N}\texttt{inc}_{i}, hence, the reconstruction T~\widetilde{T} starts and ends at the same values as TT so is T^\widehat{T}.

We assume variance of length and increment of pieces, denoted by Varlen\texttt{Var}_{\texttt{len}} and Varinc\texttt{Var}_{\texttt{inc}}, are:

(12) Varlen=maxi=1,,k1|Si|lenSi|lenμilen|2,\displaystyle\texttt{Var}_{\texttt{len}}=\max_{i=1,\ldots,k}\frac{1}{|S_{i}|}\sum_{\texttt{len}\in S_{i}}|\texttt{len}-\mu_{i}^{\texttt{len}}|^{2},
Varinc=maxi=1,,k1|Si|incSi|incμiinc|2.\displaystyle\texttt{Var}_{\texttt{inc}}=\max_{i=1,\ldots,k}\frac{1}{|S_{i}|}\sum_{\texttt{inc}\in S_{i}}|\texttt{inc}-\mu_{i}^{\texttt{inc}}|^{2}.

Here we suppose the aggregation is performed on the length and increment values (1-dimensional data) of pieces simultaneously, which is referred to as hierarchical aggregation, and we denote the digitization tolerance for length and increment αlen\alpha_{\texttt{len}} and αinc\alpha_{\texttt{inc}}, respectively. Obviously, we have

(13) max(Varlen,Varinc)max(αlen,αinc)2\max(\texttt{Var}_{\texttt{len}},\texttt{Var}_{\texttt{inc}})\leq\max(\alpha_{\texttt{len}},\alpha_{\texttt{inc}})^{2}

We assume αlen=αinc=α\alpha_{\texttt{len}}=\alpha_{\texttt{inc}}=\alpha, this yields

(14) max(Varlen,Varinc)α2\max(\texttt{Var}_{\texttt{len}},\texttt{Var}_{\texttt{inc}})\leq\alpha^{2}

In the following, we will demonstrate that the Brownian bridge property as illustrated in (Elsworth and Güttel, 2020a) still holds in hierarchical aggregation for time series reconstruction. Though the length of each piece may not be consistent because of rounding error, we assume the length of the reconstructed time series is equal to the original length, i.e., i=0Nlen^i=i=0Nleni\sum_{i=0}^{N}\widehat{\texttt{len}}_{i}=\sum_{i=0}^{N}\texttt{len}_{i} (In practice, the assumption is true in most cases, but in some special cases, this does not hold true because of rounding). To simplify the modeling and facilitate the analysis, we consider only aggregating increment and assume each cluster of increment has the same mean length, i,e, μilen=n/N\mu_{i}^{\texttt{len}}=n/N. The local deviation of the increment and length value of T^\widehat{T} on piece PP_{\ell} from the true increment and length of TT, which are given by

(15) e˙,len:=len^len,\displaystyle\dot{e}_{\ell,\texttt{len}}=\widehat{\texttt{len}}_{\ell}-\texttt{len}_{\ell},
e˙,inc:=inc^inc,\displaystyle\dot{e}_{\ell,\texttt{inc}}=\widehat{\texttt{inc}}_{\ell}-\texttt{inc}_{\ell},

respectively.

The global incremental errors, i.e., the accumulated incremental errors, according to (15), e¨ij,inc\ddot{e}_{i_{j},\texttt{inc}} are given by:

(16) e¨ij,inc:=t^ijtij==1je˙,inc,j=0,,N\ddot{e}_{i_{j},\texttt{inc}}:=\widehat{t}_{i_{j}}-t_{i_{j}}=\sum_{\ell=1}^{j}\dot{e}_{\ell,\texttt{inc}},\quad j=0,\ldots,N

Also, we must consider the error arisen from the rounding error of length. Similarly, the global length errors, i.e., the accumulated length errors, according to (12) and (13), can be calculated as:

(17) e¨ij,len\displaystyle\ddot{e}_{i_{j},\texttt{len}} :==1je˙,len=1j(maxi=1,,k1|Si|lenSi|lenμilen|)\displaystyle=\sum_{\ell=1}^{j}\dot{e}_{\ell,\texttt{len}}\leq\sum_{\ell=1}^{j}(\max_{i=1,\ldots,k}\frac{1}{|S_{i}|}\sum_{\texttt{len}\in S_{i}}|\texttt{len}-\mu_{i}^{\texttt{len}}|)
=1jmaxi=1,,k1|Si|lenSi|lenμilen|2\displaystyle\leq\sum_{\ell=1}^{j}\sqrt{\max_{i=1,\ldots,k}\frac{1}{|S_{i}|}\sum_{\texttt{len}\in S_{i}}|\texttt{len}-\mu_{i}^{\texttt{len}}|^{2}}
jα,j=0,,N\displaystyle\leq j\cdot\alpha,\quad j=0,\ldots,N

The global error of the reconstructed time series, denoted by eije_{i_{j}}, is caused by errors from reconstructed length and increment. Up to this point, the global error of the reconstructed time series is still difficult to determine since the estimated error caused by the length displacement is hard to get, so we consider an approximation:

(18) eije¨ij,lene¨ij,ince_{i_{j}}\approx\ddot{e}_{i_{j},\texttt{len}}\cdot\ddot{e}_{i_{j},\texttt{inc}}

According to (14) and Lemma 4.2, the e˙,inc\dot{e}_{\ell,\texttt{inc}} is bounded by α\alpha, and 𝔼(e¨ij,inc)=0\mathbb{E}(\ddot{e}_{i_{j},\texttt{inc}})=0 since they are consistent with the deviations from their respective cluster center. Also, e¨i0,inc=e¨in,inc=0\ddot{e}_{i_{0},\texttt{inc}}=\ddot{e}_{i_{n},\texttt{inc}}=0 as proved earlier. Therefore, referred to (Elsworth and Güttel, 2020a), we can model a random process of incremental errors eije_{i_{j}}, and its associated variance:

Var(e¨ij,inc)=α2j(Nj)N,j=0,,N\texttt{Var}(\ddot{e}_{i_{j},\texttt{inc}})=\alpha^{2}\cdot\frac{j(N-j)}{N},\quad j=0,\ldots,N

Following (Elsworth and Güttel, 2020a), eije_{i_{j}} is considered to stay η\eta standard deviations away from its zeros mean. That is, we consider a realization

(19) e¨ij,inc=ηαj(Nj)N,j=0,,N.\ddot{e}_{i_{j},\texttt{inc}}=\eta\cdot\alpha\cdot\sqrt{\frac{j(N-j)}{N}},\quad j=0,\ldots,N.

Then, with (17) and (19), we have,

eijηα2j3(Nj)N,j=0,,N.e_{i_{j}}\leq\eta\cdot\alpha^{2}\cdot\sqrt{\frac{j^{3}(N-j)}{N}},\quad j=0,\ldots,N.

The process of eije_{i_{j}} is modeled as a Brownian bridge following (Elsworth and Güttel, 2020a). Considering the interpolated quadratic function on the right-hand side is concave, based on the linear stitching procedure used in the reconstruction and by piecewise linear interpolation of the incremental errors from the course time grid i0,i1,,iNi_{0},i_{1},\ldots,i_{N} to the fine time grid i=0,1,,ni=0,1,\ldots,n, it is natural to deduce that

(20) eiNneij=ηnα2Ni3(ni),i=0,,N.e_{i}\leq\sqrt{\frac{N}{n}}e_{i_{j}}=\frac{\eta}{n}\cdot\alpha^{2}\cdot\sqrt{N\cdot i^{3}\cdot(n-i)},\quad i=0,\ldots,N.

Therefore, the squared Euclidean norm of this fine-grid “worst-case” realization is upper bounded by

i=0nei2\displaystyle\sum_{i=0}^{n}e_{i}^{2} Nη2α4n2i=0ni3(ni)\displaystyle\leq\frac{N\cdot\eta^{2}\cdot\alpha^{4}}{n^{2}}\cdot\sum_{i=0}^{n}i^{3}(n-i)
=Nη2α4n2(130n112n3+120N5)\displaystyle=\frac{N\cdot\eta^{2}\cdot\alpha^{4}}{n^{2}}\cdot(\frac{1}{30}n-\frac{1}{12}n^{3}+\frac{1}{20}N^{5})
=Nη2α4n2(3n5+2n5n360)\displaystyle=\frac{N\cdot\eta^{2}\cdot\alpha^{4}}{n^{2}}\cdot(\frac{3n^{5}+2n-5n^{3}}{60})
=N(3n4+25n2)η2α460n.\displaystyle=\frac{N(3n^{4}+2-5n^{2})\cdot\eta^{2}\cdot\alpha^{4}}{60n}.

It has previously been established that euclid(T,T~)2(nN)tol2\texttt{euclid}(T,\widetilde{T})^{2}\leq(n-N)\cdot\texttt{tol}^{2} by (Elsworth and Güttel, 2020a), and based on this (4.3) is the worst-case realization of the Brownian bridge and thereby we have a probabilistic bound on the error incurred from digitization. By making euclid(T,T~)2=i=0nei2\texttt{euclid}(T,\widetilde{T})^{2}=\sum_{i=0}^{n}e_{i}^{2}, we can smartly choose

(21) α=60n(nN)tol2Nη2(3n4+25n2)4.\alpha=\sqrt[4]{\frac{60n\cdot(n-N)\cdot\texttt{tol}^{2}}{N\cdot\eta^{2}\cdot(3n^{4}+2-5n^{2})}}.

For simplicity we can set this hyperparameter controlling the tolerance of length to be the same as that of increment, i.e., αlen=αinc=α\alpha_{\texttt{len}}=\alpha_{\texttt{inc}}=\alpha. Therefore, the parameter of digitization is automatically determined by the compression tolerance, resulting in a non-parametric and error-bounded digitization procedure.

The procedure detailed above is referred to as auto digitization. On top of that, the method introduced in the Section 5 of (Elsworth and Güttel, 2020a) can also be used to approximate an error-bounded fABBA digitization and eliminate the need for tuning α\alpha, however, this is not practical as it results in a linear relationship between tol and α\alpha—the difference is as shown in Figure 4 which shows the method in (Elsworth and Güttel, 2020a) depicts a straight line (marked as green color). As a consequence, we can see our method (marked as orange color) as an improvement for choosing α\alpha to some degree.

Refer to caption
Figure 4. Value of α\alpha as tol increases.

5. Joint symbolic approximation

After discussing the two ABBA methods, we introduce a joint symbolic aggregate approximation on how to perform fast ABBA symbolization on multiple time series while retaining the symbolic consistency. This joint approximation framework is also applicable to large-scale univariate time series and multivariate time series.

The ideal case of symbolization of multiple time series is that the symbolization should have consistent symbols used in each time series and as less distinct symbols used as possible. One intuitive idea is to fit one (or a given number of) time series and use the previous symbolic information to transform the rest of the data. However it does not consider the variety of characteristics in every single time series, this might result in serious information loss in some time series. Henceforth, we require an approach, i.e., joint symbolic approximation, that can symbolize the multiple time series simultaneously.

The essential idea of joint symbolic approximation is partitional compression. Let 𝒯\mathcal{T} be a dataset of mm time series (If m=1m=1, simply partition the time series into multiple series). In contrast to the original compression, it proceeds by first computing the compression for each series. Then all outputs will be concatenated as an input to digitization which results in a single symbolic sequence. But for multiple time series, an additional step is required, i.e., divide the final symbolic sequence such that each partition corresponds to the symbolic representation of the original time series. The algorithm is as described in Algorithm 5. Since no dependencies occur between compression tasks, this allows for efficient parallel computing. The joint symbolic approximation as well as the parallel computing paradigm is as depicted in Figure 5 and the integral algorithm description is as illustrated in Algorithm 6. For the inverse symbolization, each time series can be reconstructed exactly from its first value and reconstructed pieces from inverse digitization.

The joint symbolic aggregate approximation essentially performs the same steps as the original ABBA method, the major difference is that the compression in ABBA is replaced with partitional compression. Since that, we refer to the method of joint symbolic aggregate approximation as JABBA for simplicity. The framework of joint symbolic aggregate approximation spawns two variants: (1) JABBA (VQ): performs partitional compression and digitization with k-means clustering; (2) JABBA (GA): performs partitional compression and digitization with greedy aggregation. Their performance will be evaluated in Section 6.

Algorithm 5 Partitional compression
1:Input: Time series 𝒯\mathcal{T}, tol, mm (optional)
2:if 𝒯\mathcal{T} is multivariate then
3:  𝒯=𝒯\mathcal{T}^{\prime}=\mathcal{T}
4:  mm \leftarrow Compute the number of dimensions of 𝒯\mathcal{T}
5:else
6:  𝒯={Ti,,Tm}\mathcal{T}^{\prime}=\{T_{i},\ldots,T_{m}\} \leftarrow Partition time series 𝒯\mathcal{T} into mm segments evenly
7:end if
8:for i=1:mi=1:m do \triangleright can do in parallel
9:  PiP_{i} \leftarrow Compress time series Ti𝒯T_{i}\in\mathcal{T} according to (2)
10:end for
11:PP \leftarrow Concatenate {Pi}\{P_{i}\}
12:Return: PP, mm

As aforementioned, the approach can be applied to datasets storing multiple time series such as UCR time series archive (Dau et al., 2019). With the availability of consistent symbols information, techniques of text mining and natural language processing are becoming promising in time series analysis.

Refer to caption
Figure 5. Fork and join model: since there is no dependency among compression tasks, the parallelism is easy to be executed with fork and join model.
Algorithm 6 Joint symbolic aggregate approximation
1:Input: 𝒯\mathcal{T}, mm, tol
2:PP, mm \leftarrow Perform Algorithm 5 on 𝒯\mathcal{T}\triangleright Compute partitional compression
3:Compute ABBA digitization on PP
4:Partition the symbolic sequence to mm subsequences and assign them to the corresponding dimensions, respectively.
5:Return: Symbolic representation

6. Empirical Results

In this section, we focus on the experiments regarding runtime and reconstruction errors of symbolic representation. We conduct extensive experiments on the UEA Archive (Bagnall et al., 2018), which is a well-established dataset, and synthetic Gaussian noises for the multithreading test. We select the competing algorithms that provide publicly available software555Available at https://github.com/nla-group/ABBA and https://github.com/nla-group/fABBA., which is for simplicity and efficiency.

6.1. Multivariate time series test

The UEA Archive contains 30 multivariate time series datasets with a variety of dimensions and lengths. The datasets are very huge, therefore it is inefficient for the original ABBA and its variant fABBA to perform computations one at a time, so we select some datasets in the UEA Archive for the test, which are summarized in Table 2. Though impossible for ABBA and fABBA to symbolize time series in each dimension of multivariate time series with unified symbols, we still use them for benchmarking but without considering the symbolic consistency. In order to unify their compressed time series pieces PP as specified in (3), we use partitional compression for all four competing methods and reassign the subset of output corresponding to each multivariate time series dimension to ABBA and fABBA. We start with performing the partitional compression as described in Algorithm 5 with tol of 0.01, then for the two JABBA variants we perform their digitization all at once while for ABBA and fABBA we just perform their digitization on time series pieces for each dimension of the multivariate time series one at a time, and then record the runtime for their digitization, respectively.

It’s known that the more symbols are used the more accurate the reconstruction of the representation is. In order to use roughly the same number of symbols for each method as much as possible, we first perform JABBA (GA) digitization using (21) to confirm an α\alpha value, and a total number of symbols used for the multivariate time series, denoted by kmk_{m}, then we feed the number of symbols kmk_{m} for JABBA (VQ) digitization (see Algorithm 3, we set rr to 0.5, same in the following). Then we feed the same α\alpha value to fABBA digitization and use km/dk_{m}/d for ABBA where dd is the dimension of the multivariate time series. As a consequence, the number of symbols used will be unified for ABBA, JABBA (VQ), and JABBA (GA), but not guaranteed for fABBA since its digitization is tolerance-oriented.

Table 3 showcases the average value of MSE, dynamic time warping (DTW), runtime for digitization, and the number of symbols used for each dataset. Information of compression tolerance tol used for each dataset is also given in Table 3. Accordingly, JABBA (GA) shows significant speedup over fABBA by order of magnitude though both use the same GA-based digitization. The speedup of JABBA (VQ) over ABBA is also remarkable while the reconstruction error of JABBA (VQ) is lower than ABBA in five out of the eight datasets though using sampling-based k-means is employed. Additionally, an example of reconstruction from symbolic representation for multivariate time series is presented in Figure 6, which only shows 4 out of 61 dimensions.

Table 2. Selected multivariate time series datasets in UEA Archive.
Dataset Size Dimension Length
AtrialFibrillation 30 2 640
BasicMotions 80 6 100
CharacterTrajectories 2,858 3 182
Epilepsy 275 3 206
Heartbeat 409 61 405
NATOPS 360 24 51
StandWalkJump 27 4 2,500
UWaveGestureLibrary 440 3 315
Table 3. Result in selected UEA multivariate time series datasets (all values are preserved to 2 significant digits, and the best results for MSE, DTW, and runtime are marked as boldface font).
Dataset Metric ABBA fABBA JABBA (VQ) JABBA (GA)
MSE 6.7 69 9.7 63
AtrialFibrillation DTW 880 52,000 1,600 18,000
(tol=0.01\text{{tol}}=0.01) Runtime 160 15 23 2.6
Symbols 21 550 21 21
BasicMotions MSE 22 17 14 33
DTW 710 920 690 1,800
(tol=0.01\text{{tol}}=0.01) Runtime 100 14 13 2.7
Symbols 17 460 18 18
CharacterTrajectories MSE 3.5 3.5 6.9 17
DTW 540 350 650 1,600
(tol=0.01\text{{tol}}=0.01) Runtime 24 3.3 4.8 1.7
Symbols 3.1 47 3.5 3.5
Epilepsy MSE 20 50 15 86
DTW 1,700 20,000 1,400 13,000
(tol=0.01\text{{tol}}=0.01) Runtime 83 12 31 2.2
Symbols 14 480 14 14
MSE 5.2 0.0017 1.8 0.017
Heartbeat DTW 1,400 0.69 430 6.8
(tol=0.0001\text{{tol}}=0.0001) Runtime 17,000 1,500 3,500 110
Symbols 2,000 23,000 2,000 2,000
NATOPS MSE 38 18 9.1 28
DTW 1,700 270 150 430
(tol=0.01\text{{tol}}=0.01) Runtime 110 28 100 2.5
Symbols 24 450 23 23
StandWalkJump MSE 2.2 8.7 3.7 5.7
DTW 550 5,100 730 1,200
(tol=0.005\text{{tol}}=0.005) Runtime 900 60 55 11
Symbols 190 1,900 190 190
UWaveGestureLibrary MSE 3.1 2 3 8
DTW 350 180 340 1,100
(tol=0.01\text{{tol}}=0.01) Runtime 37 3.8 5.7 1.9
Symbols 5.4 52 5.4 5.4
Refer to caption
Figure 6. Reconstruction of symbolic representation for Heartbeat.

6.2. Multithreading simulation

In this experiment, we will compare ABBA, fABBA, JABBA (GA), and JABBA (VQ) on synthetic Gaussian noise series in terms of runtime, and reconstruction accuracy with various number of time series partitions. The reconstruction accuracy is measured by MSE here.

We used Gaussian noises as the time series for benchmarking. The data generated for the test are of length 100,000 with zero mean and unit standard deviation. We first ran fABBA with tol=0.01\text{{tol}}=0.01 and α=0.05\alpha=0.05 to compute the number of symbols it used. This simulation used 358 symbols accordingly. Second, we ran ABBA by feeding the same number of symbols fABBA used to kk, i.e., 358358 symbols. After that, we run the JABBA (VQ) and JABBA (GA) with varying partitions by the same tol and specifying a consistent hyperparameter setting for digitization, i.e., k=358k=358 and α=0.05\alpha=0.05, respectively. The number of threads scheduled for JABBA is set the same as the partition number. The result shows the compression rate τc\tau_{c} computed for ABBA, fABBA, JABBA (GA), and JABBA (VQ) are , respectively.

The experimental result is as exhibited in Figure 7 and Figure 8. We mainly compare the methods with the same digitization technique. We can see that there is an obvious negative correlation between reconstruction error and the number of partitions, this can be explained by the increasing partition points that will be used for reconstruction. Figure 7 also shows that JABBA (VQ) which uses sampling-based k-means achieves similar performance against ABBA regarding MSE while performing speedup by orders of magnitude, a similar result applies to JABBA (GA) and fABBA.

Refer to caption
Figure 7. MSE of JABBA with varying number of partitions (the black line marks the result of fABBA).
Refer to caption
Figure 8. Runtime of JABBA with varying number of partitions (the black line marks the result of fABBA).

The parallel speedup mm processors is given by

Φ(m)=υ(1)υ(m)\Phi(m)=\frac{\upsilon(1)}{\upsilon(m)}

where υ(m)\upsilon(m) is referred to as the runtime of the mm processors. Without loss of generality, we only evaluate the speedup of Parallelism for JABBA (GA) as shown in Figure 9. We can see the speedup Φ(m)\Phi(m) scale almost linearly with the number of threads mm. Since our algorithm is partially parallel in compression, which is hindered by the sequential part of the algorithm, that is, the digitization. This phenomenon can be naturally explained by Amdahl’s law which gives the theoretical speedup at a fixed workload where there are limits on the benefits one can derive from parallelizing a computation.

Refer to caption
Figure 9. Speedup.

7. Summary and future work

The existing ABBA methods are incapable of handling the consistency of symbols for multiple time series and are inherently sequential, and it is not clear how to leverage the extra computational power such as multithreading processing. In this paper, we introduce a joint symbolic approximation method that improves the speed of ABBA symbolization and achieves symbolic consistency in each representation. The framework of joint symbolic approximation enables parallel computing for further speedup. Attributed to the symbolic consistency, a manipulation of natural language processing and text mining techniques is available in time series. The convergence analysis of our proposed sampling k-means method will be left as future work.

References

  • (1)
  • Agrawal et al. (1998) Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. 1998. Automatic subspace clustering of high dimensional data for data mining applications. Proceedings of the ACM SIGMOD International Conference on Management of Data 27 (1998), 94–105.
  • Arthur and Vassilvitskii (2007) David Arthur and Sergei Vassilvitskii. 2007. k-means++: the advantages of careful seeding. In SODA ’07: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 1027–1035.
  • Bagnall et al. (2018) Anthony J. Bagnall, Hoang Anh Dau, Jason Lines, Michael Flynn, James Large, Aaron Bostrom, Paul Southam, and Eamonn J. Keogh. 2018. The UEA multivariate time series classification archive. CoRR (2018).
  • Bountrogiannis et al. (2023) Konstantinos Bountrogiannis, George Tzagkarakis, and Panagiotis Tsakalides. 2023. Distribution Agnostic Symbolic Representations for Time Series Dimensionality Reduction and Online Anomaly Detection. IEEE Transactions on Knowledge and Data Engineering 35, 6 (2023), 5752–5766.
  • Campello et al. (2013) Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. 2013. Density-Based Clustering Based on Hierarchical Density Estimates. In Advances in Knowledge Discovery and Data Mining. Springer, 160–172.
  • Chen and Güttel (2022a) Xinye Chen and Stefan Güttel. 2022a. An Efficient Aggregation Method for the Symbolic Representation of Temporal Data. ACM Transactions on Knowledge Discovery from Data (2022).
  • Chen and Güttel (2022b) Xinye Chen and Stefan Güttel. 2022b. Fast and explainable clustering based on sorting. (2022), 25. arXiv:2202.01456
  • Dasgupta and Freund (2008) Sanjoy Dasgupta and Yoav Freund. 2008. Random Projection Trees and Low Dimensional Manifolds. In Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing (STOC ’08). ACM, 537–546.
  • Dasgupta and Freund (2009) Sanjoy Dasgupta and Yoav Freund. 2009. Random Projection Trees for Vector Quantization. IEEE Transactions on Information Theory 55, 7 (2009), 3229–3242.
  • Dau et al. (2019) Hoang Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, and Eamonn Keogh. 2019. The UCR time series archive. IEEE/CAA Journal of Automatica Sinica 6, 6 (2019), 1293–1305.
  • Drineas et al. (2004) P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. 2004. Clustering large graphs via the singular value decomposition. Machine Learning 56, 1–3 (2004), 9–33.
  • Elsworth and Güttel (2020a) Steven Elsworth and Stefan Güttel. 2020a. ABBA: adaptive Brownian bridge-based symbolic aggregation of time series. Data Mining and Knowledge Discovery 34 (2020), 1175–1200.
  • Elsworth and Güttel (2020b) Steven Elsworth and Stefan Güttel. 2020b. Time series forecasting using LSTM networks: A symbolic approach. (2020), 12. arXiv:2003.05672
  • Ester et al. (1996) Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD’96). AAAI Press, 226–231.
  • Gao and Lin (2019) Yifeng Gao and Jessica Lin. 2019. Discovering Subdimensional Motifs of Different Lengths in Large-Scale Multivariate Time Series. In IEEE International Conference on Data Mining. 220–229.
  • Gray (1984) Robert Gray. 1984. Vector quantization. IEEE ASSP Magazine 1, 2 (1984), 4–29.
  • Keogh et al. (2005) E. Keogh, J. Lin, and A. Fu. 2005. HOT SAX: efficiently finding the most unusual time series subsequence. In IEEE International Conference on Data Mining (ICDM’05). 1–8.
  • Keogh et al. (2002) Eamonn Keogh, Stefano Lonardi, and Bill ’Yuan-chi’ Chiu. 2002. Finding Surprising Patterns in a Time Series Database in Linear Time and Space. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’02). ACM, 550–556.
  • Li and Lin (2017) Xiaosheng Li and Jessica Lin. 2017. Linear Time Complexity Time Series Classification with Bag-of-Pattern-Features. In IEEE International Conference on Data Mining. 277–286.
  • Li et al. (2021) Xiaosheng Li, Jessica Lin, and Liang Zhao. 2021. Time Series Clustering in Linear Time Complexity. Data Mining and Knowledge Discovery 35, 6 (2021), 2369–2388.
  • Li and Lin (2010) Yuan Li and Jessica Lin. 2010. Approximate Variable-Length Time Series Motif Discovery Using Grammar Inference. In Proceedings of the 10th International Workshop on Multimedia Data Mining (MDMKDD ’10). ACM, 9.
  • Lin et al. (2003) Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. 2003. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. ACM, 2–11.
  • Lin et al. (2007) Jessica Lin, Eamonn Keogh, Li Wei, and Stefano Lonardi. 2007. Experiencing SAX: a novel symbolic representation of time series. Data Mining and Knowledge Discovery 15, 2 (2007), 107–144.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. European Conference on Computer Vision (2014), 740–755.
  • Lkhagva et al. (2006) B. Lkhagva, Yu Suzuki, and K. Kawagoe. 2006. New Time Series Data Representation ESAX for Financial Applications. In 22nd International Conference on Data Engineering Workshops (ICDEW’06). x115–x115. https://doi.org/10.1109/ICDEW.2006.99
  • Lloyd (1982a) S. Lloyd. 1982a. Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 2 (1982), 129–137.
  • Lloyd (1982b) Stuart P. Lloyd. 1982b. Least squares quantization in PCM. Transactions on Information Theory 28 (1982), 129–137.
  • Mahajan et al. (2012) Meena Mahajan, Prajakta Nimbhorkar, and Kasturi Varadarajan. 2012. The planar k-means problem is NP-hard. Theoretical Computer Science 442 (2012), 13–21. Special Issue on the Workshop on Algorithms and Computation (WALCOM 2009).
  • Malinowski et al. (2013) Simon Malinowski, Thomas Guyet, René Quiniou, and Romain Tavenard. 2013. 1d-SAX: A Novel Symbolic Representation for Time Series. In Advances in Intelligent Data Analysis XII.
  • Nguyen and Ifrim (2023) Thach Le Nguyen and Georgiana Ifrim. 2023. Fast Time Series Classification with Random Symbolic Subsequences. In Advanced Analytics and Learning on Temporal Data: 7th ECML PKDD Workshop, AALTD 2022, Grenoble, France, September 19–23, 2022, Revised Selected Papers. Springer, 50––65.
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
  • Ruta et al. (2019) N. Ruta, N. Sawada, K. McKeough, M. Behrisch, and J. Beyer. 2019. SAX Navigator: Time Series Exploration through Hierarchical Clustering. In 2019 IEEE Visualization Conference. IEEE, 236–240.
  • Senin et al. (2015) Pavel Senin, Jessica Lin, Xing Wang, Tim Oates, Sunil Gandhi, Arnold P. Boedihardjo, Crystal Chen, and Susan Frankenstein. 2015. Time series anomaly discovery with grammar-based compression.. In 18th International Conference on Extending Database Technology. OpenProceedings.org, 481–492.
  • Senin and Malinchik (2013) Pavel Senin and Sergey Malinchik. 2013. SAX-VSM: Interpretable Time Series Classification Using SAX and Vector Space Model. In IEEE International Conference on Data Mining. 1175–1180.
  • Yu and Shi (2003) Stella X. Yu and Jianbo Shi. 2003. Multiclass spectral clustering. In Proceedings of the Ninth IEEE International Conference on Computer Vision, Vol. 2. IEEE, 313.
  • Yu et al. (2023) Yuncong Yu, Tim Becker, Le Minh Trinh, and Michael Behrisch. 2023. SAXRegEx: Multivariate time series pattern search with symbolic representation, regular expression, and query expansion. Computers & Graphics 112 (2023), 13–21.
  • Zhang et al. (2017) Shengdong Zhang, Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, and Mohak Shah. 2017. Deep learning on symbolic representations for large-scale heterogeneous time-series event prediction. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. 5970–5974.
  • Zhang et al. (1996) Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 103–114.