k-MLE, k-Bregman, k-VARs:
Theory, Convergence, Computation

Zuogong Yue^1,2, and Victor Solo^2∗ ¹Z. Yue is with the School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, Hubei 430074 CHINA²Z. Yue was, and V. Solo is, with School of Electrical Engineering and Telecommunications, University of New South Wales, Kensington, NSW 2052 AUSTRALIA^∗For correspondence, v.solo@unsw.edu.au

Abstract

In this paper we develop a general approach to hard clustering which we call k-MLE and provide a general convergence result for it. Unlike other hard clustering generalizations of k-means, which are based on distance or divergence, k-MLE is based on likelihood and thus has a far greater range of application. We show that ‘k-Bregman’ clustering is a special case of k-MLE and thus provide, for the first time, a complete proof of convergence for k-Bregman clustering. We give a further application: k-VARs for clustering vector autocorrelated/autoregressive time series. It does not admit a Bregman divergence interpretation. We provide simulations and real data application as well as convergence results.

Index Terms:

Time-series clustering, multivariate time series, k-MLE, k-means, k-Bregman.

I Introduction

Clustering is a method for partitioning a collection of data vectors into groups (called clusters). It is a common technique for data mining and exploratory statistical analysis. It has been widely used in many fields, for instance, pattern recognition[1], image processing[2], computer vision[3], bioinformatics[4], and etc. The study of cluster analysis has a long history, going back to anthropology in the 1930s [5].

k-MLE. There are a wide range of clustering methods [1],[6],[7, 8] with new ones still emerging e.g. [9],[10]. There are two types: hard clustering methods and soft clustering methods. In hard clustering the groups are completely separate. In soft clustering the groups may overlap somewhat. So far hard clustering algorithms are all based on similarity measures, typically a distance or pseudo-distance [11]. Soft clustering methods are constructed in three ways: (i) based on likelihood via mixture models [1] fitted with the EM algorithm; (ii) based on fuzzy methods [12]; (iii) based on regularization e.g., with the entropy penalty [13]. In this paper we develop a new and general approach to hard clustering but based on the ‘classification’ likelihood [14] not similarity measures. We prove a general convergence result. And we show that a class of existing similarity/divergence methods [11] form a special case which we rename as k-Bregman. As an application we develop a new method (k-VARs) for clustering autocorrelated time series.

Convergence. For the k-means algorithm with general divergence measures, the definitive work on convergence is the seminal paper of [15]. The much later independent work of [16], while useful, is much less general. We will also discuss the work of [11] which also includes convergence analyses for k-Bregman, which we show are incomplete.

Model Selection. The problem of choosing the number of clusters is a challenging problem which garners ongoing interest. Important methods include BIC [17] and the gap method [18]. Since k-MLE is based on likelihood, it is straightforward to develop a BIC criterion for joint choice of model order and number of clusters. This is illustrated with k-VARs.

Time Series. In many applications the data to be clustered are autocorrelated time series, e.g., stock prices, locations of mobile robots, speech signals, ECG, etc. Statistically efficient clustering of such temporally autocorrelated data demands an integration of time-series structure into the clustering methodology. There is already a significant literature on extending clustering algorithms for time series, e.g., k-means integrated with DTW barycenter averaging (k-DBA) [19], clustering via ARMA mixtures[20], k-Spectral Centroid (k-SC) algorithm [21], k-Shape [22, 23], shapelet based methods [24, 25], deep learning based methods [26], etc. Some methods are only available for univariate time series. However some of these ‘time series’ clustering algorithms ignore the autocorrelation feature and so perform poorly in practice, when it is usually present.

The remainder of the paper is organized as follows. The derivation of k-MLE is presented in section II. In section III a convergence analysis is given. In section IV, k-MLE is applied to clustering autoregressive time series yielding the k-VARs algorithm. The section also covers computation, convergence and model selection. Application to data and simulations are in section V. Section VI contains conclusions. There is one appendix containing some proofs.

II k-MLE: A General Tool for Clustering

II-A Background

Given independently distributed (i.d.) data vectors (of dimension $d$ ) $x_{n},n=1,\dots,N$ , each belonging to one of $K$ clusters, $C_{1},\dots,C_{K}$ , we introduce the label matrix or binary cluster membership array $\tau=[\tau_{n,k}]_{N\times K}$

\tau_{n,k}=\begin{cases}1&\text{if }x_{n}\in\text{cluster }k,\\ 0&\text{otherwise}.\end{cases}

(1)

The clustering problem is to estimate $\tau_{N\times K}$ from the $d\times N$ data matrix $X_{1}^{N}=[x_{1},\cdots,x_{N}]$ . Note that there are only finitely many $\tau$ ’s. We denote the $k^{th}$ column of $\tau$ by $\tau_{k}$ .

Clustering algorithms are often categorised in various ways: partition methods versus hierarchical methods; hard clustering, where each data vector is assigned to only one cluster, versus soft (a.k.a. fuzzy) clustering, where each data vector may belong to more than one cluster; similarity based versus model based. Here we introduce a new division: between deterministic label clustering (DL-C) and stochastic label clustering (SL-C). DL-C treats the membership variables as deterministic, whereas SL-C treats them as random. We note that SL-C always produces soft clustering whereas DL-C can produce either hard clustering or soft clustering.

The classic DL-C method is k-means[1], where clustering is achieved by minimizing the criterion

J=J(\tau,\mu)=\textstyle\sum_{n=1}^{N}\sum_{k=1}^{K}\tau_{n,k}d(x_{n},\mu_{k}).

w.r.t. $\tau,\mu$ . Here $\mu_{k}$ is a cluster centre and $d(x,\mu)$ is a distance or similarity measure, i.e. it has two properties: (i) positivity $d(x,\mu)\geq 0$ ; (ii) uniqueness $d(x,\mu)=0$ iff $x=\mu$ . Sometimes symmetry is useful: (iii) $d(x,\mu)=d(\mu,x)$ . k-means uses Euclidean distance but other distance measures are also used, e.g. k-medoids [1].

The classic SL-C method is based on a mixture model [1] in which the $\tau_{nk}$ are independently distributed with $P(\tau_{nk}=1)=\pi_{k}$ .

Historically almost all hard clustering methods have been based on similarity measures whereas almost all soft clustering methods are based on models. This seems to have fostered the impression that hard clustering methods could not be model based. But we will see that k-MLE is a DL-C method but is model based. This suggests that a deeper division is between DL and SL rather than between similarity based and model based methods. This paper concentrates on DL methods.

II-B k-MLE Formulation

Suppose that $x_{n}$ are i.d. with those in cluster $k$ having density $p(x_{n};\theta_{k})$ where $\theta_{k}\in\Omega$ is $q$ -dimensional. Then the joint density or likelihood of the data in cluster $k$ is $\prod_{n=1}^{N}\left[p(x_{n};\theta_{k})\right]^{\tau_{nk}}$ with corresponding log-likelihood

\textstyle\lambda(\tau_{k},\theta_{k})=\sum_{n=1}^{N}\tau_{n,k}\ell(x_{n},\theta_{k})=\sum_{n\in C_{k}}\ell(x_{n},\theta_{k}),

(2)

where $\ell(x_{n},\theta_{k})=\mathrm{ln}\,p(x_{n};\theta_{k})$ . Denoting $\Theta=[\theta_{1},\dots,\theta_{K}]$ , the joint log-likelihood of all the data is then

\textstyle\mathcal{L}(\tau,\Theta)=\sum_{k=1}^{K}\lambda(\tau_{k},\theta_{k})=\sum_{n=1}^{N}\sum_{k=1}^{K}\tau_{n,k}\ell(x_{n},\theta_{k}).

(3)

We call this the deterministic label clustering likelihood (DL-CL). The corresponding clustering likelihood for a SL approach would be called a stochastic label clustering likelihood (SL-CL).

We now note two important properties of the DL-CL which now follow immediately.

•

Property P1: $\mathcal{L}(\tau,\Theta)$ is linear in $\tau$ .
•

Property P2: $\mathcal{L}(\tau,\Theta)$ is separable in $(\tau,\Theta)$ .

Separability means that e.g. maximizing over $\Theta$ can be done by separately maximizing each cluster-specific likelihood.

The DL-CL depends on binary and analog (i.e. continuous valued) parameters. This leads to a hybrid maximum likelihood estimation (MLE) problem as follows.

Definition 1 (k-MLE problem).

\begin{array}[]{r@{\;}l}&\underset{\Theta\in\Omega^{\times},\tau\in\mathcal{B}}{\mathrm{maximize}}\;\mathcal{L}(\tau,\Theta)\\ &\mathcal{B}=\big{\{}\begin{array}[t]{@{}r@{\;}l}\tau:&\sum_{k=1}^{K}\tau_{n,k}=1,\tau_{n,k}=0\text{ or }1,\\ &1\leq k\leq K,1\leq n\leq N\big{\}},\end{array}\end{array}

(4)

where $\Theta\in\Omega^{\times}$ means $\theta_{k}\in\Omega$ , $1\leq k\leq K$ .

For this optimization to be well defined we need to introduce some assumptions.

Assumption 1.

(a)

$\Omega$ is an open subset of $\mathbb{R}^{q}$ with boundary $\partial\Omega$ . The closure $\Omega\cup\partial\Omega$ is bounded (which thus means the closure is compact).
(b)

$\ell(x_{n},\theta)$ is continuous for $\theta\in\Omega$ .

Note that (a) and (b) ensure that $\ell(x_{n},\theta)$ is bounded. Since $\mathcal{B}$ is a discrete, bounded set, then under Assumption 1, the k-MLE problem has at least one solution.

The k-MLE algorithm solves the hybrid k-MLE problem by cyclic ascent¹¹1Cyclic optimization is such a simple idea that it is repeatedly reinvented and renamed as if it was new. (a.k.a. coordinate ascent[27]).

Definition 2 (k-MLE algorithm).

Denote the iteration index by $m$ . We assume that initial parameter estimates $\theta_{k}^{(0)}$ are available for each cluster.

•

$\tau$ -step: given $\tau^{(m)}$ and $\Theta=\Theta^{(m)}$ get $\tau^{(m+1)}$ . Since $\Theta$ is given, we need only maximize, for each $n$ , $\sum_{k=1}^{K}\tau_{n,k}\ell(x_{n},\theta_{k})$ w.r.t. $\tau_{n}=[\tau_{n,1},\dots,\tau_{n,K}]^{T}$ . Denote by $k_{n}^{*}$ , the index $k$ for which the log-likelihood $\ell(x_{n},\theta_{k})$ is the largest. Then

$\tau_{n,k}=\begin{cases}1&\text{if }k=k_{n}^{*},\\ 0&\text{otherwise},\end{cases}$ (5)

and we set $\tau^{m+1}=[\tau_{n,k}]_{N\times K}$ . If $\mathcal{L}(\tau^{(m+1)},\Theta^{(m)})=\mathcal{L}(\tau^{(m)},\Theta^{(m)})$ , stop.
•

$\Theta$ -step: given $\Theta^{(m)},\tau=\tau^{(m+1)}$ get $\Theta^{(m+1)}$ .

$\theta_{k}^{(m+1)}=\underset{\theta\in\Omega}{\operatorname{arg\,max}}\>\lambda(\tau_{k},\theta),\quad 1\leq k\leq K.$ (6)

This just entails solving an MLE problem for each cluster, using the data in that cluster. If $\mathcal{L}(\tau^{(m+1)},\Theta^{(m+1)})=\mathcal{L}(\tau^{(m+1)},\Theta^{(m)})$ , stop.

Notice there is no requirement that the log-density is a similarity measure. So this provides a considerable generalization of DL algorithms such as k-means.

After the second author had developed this method we discovered it is not new. In [14] the DL-CL likelihood appears under the name ‘classification likelihood’. But it is only presented in the context of a multivariate Gaussian density. When the densities are multivariate spherical Gaussian then k-MLE reduces to k-means. No algorithm details are given in [14].

The DL-CL has also been developed (in a different direction) by [28] under the name ‘model-based k-means’. It is derived heuristically from a SL-CL and an algorithm is given without recognizing that it is cyclic ascent. There is no convergence analysis and no tuning parameter selection. [28] attributes the method to [29],[30]. The first reference only deals with two clusters and has no convergence analysis or tuning parameter selection. The second reference fits a mixture model with an EM algorithm and so is not a DL-C method.

II-C k-Bregman

In a seminal piece of work [11] showed ‘there exists a unique Bregman divergence corresponding to every regular exponential family’. Then, using a notion of Bregman information, they developed a hard clustering algorithm based on Bregman divergence. It is not noted in [11] but their algorithm is a coordinate descent algorithm [27]. Further [11] provided a convergence result, which we will see below leaves major questions unanswered.

We now show that the Bregman hard clustering algorithm for exponential families is a special case of k-MLE; we henceforth refer to it as k-Bregman. It is shown in [11] that if $p(x;\theta)$ is a regular exponential family²²2For expontial families in general, the same result is remarked in [31]. it has a unique decomposition

\ln p(x;\theta)=-d_{\phi}(x;\mu(\theta))+\ln(b_{\phi}(x))

(7)

where $d_{\phi}(x;\mu(\theta))\geq 0$ is the Bregman divergence, $\phi$ is the conjugate function of $\psi$ , $\mu(\theta)=\nabla\psi(\theta)$ is the expectation parameter corresponding to $\theta$ , and the reminder term $b_{\phi}(x)$ does not depend on the parameter $\theta$ and is a normalization factor for the probability density. Thus the log-likelihood for the whole data can be written as

	$\displaystyle\mathcal{L}(\tau,\Theta)$	$\displaystyle=\textstyle-\sum_{n=1}^{N}\sum_{k=1}^{K}\tau_{n,k}d_{\phi}(x_{n};\theta_{k})+\zeta(X_{1}^{n}),$		(8)
	$\displaystyle\zeta(X_{1}^{n})$	$\displaystyle=\textstyle\sum_{n=1}^{N}\sum_{k=1}^{K}\tau_{n,k}\ln(b_{\phi}(x_{n})).$		(9)

However, by the property of $\tau_{n,k}$ , the remainder term $\zeta(X_{1}^{n})$ reduces to $\sum_{n=1}^{N}\ln(b_{\phi}(x_{n}))$ and can be dropped. Therefore for an exponential family k-MLE reduces to the k-Bregman similarity based method.

II-D Applications

The possibilities for k-MLE are extensive. Below we develop a new algorithm for clustering vector time series which is not similarity based; we also prove its convergence. We previously developed a scalar version [32] by taking a limit in an AR mixture model. That limiting argument can be extended to the vector case but yields a different algorithm which will be discussed elsewhere.

More generally, k-MLE can handle any data type e.g. multi-modal data, hybrid data with both continuous valued and discrete valued measurements; spatio-temporal data; temporal data sampled on mixed time-scales; multi-subject data and so on.

III k-MLE Convergence

This section develops a convergence analysis of the k-MLE algorithm. Our approach builds on the seminal work of [15]. Although [15] uses dissimilarity where we use log-likelihood, large parts of their development make no use of the dissimilarity properties. We also note the work of [16] (independent of [15]) which, while providing valuable additional insight on k-means, is much less general in scope and results than [15].

Note that [15] deals with minima whereas we deal with maxima. This is a trivial difference. However partly because of our likelihood framework, we need to rework some of the internal proof developments in [15]. Because there is only partial novelty here, we put most of those proofs in the appendix. However we also need non-trivial new results involving new conditions.

Let us note immediately from the construction of the k-MLE algorithm that

\begin{array}[]{r@{\;}l}\mathcal{L}^{(m+1)}&=\mathcal{L}(\tau^{(m+1)},\Theta^{(m+1)})\geq\mathcal{L}(\tau^{(m+1)},\Theta^{(m)})\\ &\geq\mathcal{L}(\tau^{(m)},\Theta^{(m)})=\mathcal{L}^{(m)}.\end{array}

This ascent property, i.e. non-decrease of the log-likelihood, is necessary but far from sufficient for convergence of cyclic ascent. If the log-likelihood is bounded then the ascent property ensures, via the monotonicity, that the sequence $\mathcal{L}^{(m)}$ has a limit point to which it converges. But this criterion convergence does not imply parameter convergence i.e. that $\tau^{(m)},\Theta^{(m)}$ have limit points or converge to any of them if they exist. Unfortunately both [28] and [11] mistakenly assert that criterion convergence implies parameter convergence. In fact the major/hard part of a proof of convergence is getting from criterion convergence to parameter convergence as we now show.

The parameter convergence analysis is in three parts. In the first part we show convergence in a finite number of steps to what we (c.f. [15]) call a partial maximum. In the second part further study is made to find conditions under which a partial maximum is a local maximum. In the third part this equivalence is established under MLE uniqueness.

III-A Convergence to Partial Maxima

It turns out to be convenient to consider a ‘relaxed’ or purely analog version of the k-MLE problem in which the membership variables are analog lying in $[0,1]$ . This is formalised by introducing an analog constraint set.

Definition 3 (analog constraint set).

\begin{array}[]{r@{\;}l}\mathcal{A}=\big{\{}\tau:&\sum_{k=1}^{K}\tau_{n,k}=1,\;0\leq\tau_{n,k}\leq 1,\\ &1\leq k\leq K,1\leq n\leq N\big{\}}.\end{array}

(10)

Note that $\mathcal{A}$ is bounded and $\mathcal{L}(\tau,\Theta)$ is well defined and bounded for $\tau\in\mathcal{A},\Theta\in\Omega^{\times}$ . We can thus introduce the:

Definition 4 (concentrated³³3This is standard statistical and econometric terminology. log-likelihood).

F(\tau)=\underset{\Theta\in\Omega^{\times}}{\mathrm{max}}\;\mathcal{L}(\tau,\Theta)\quad\text{for }\tau\in\mathcal{A}.

(11)

Lemma 1 ([15]).

$F(\tau)$ is convex.

Proof.

See Appendix A-A. ∎

Lemma 2 (Theorem 2[15]).

The extreme points of $\mathcal{A}$ satisfy the binary constraints $\mathcal{B}$ of the k-MLE problem.

Proof.

Elementary. ∎

We can now introduce the concentrated k-MLE problem:

Definition 5 (k-MLE-c problem).

\underset{\tau\in\mathcal{A}}{\mathrm{max}}\;F(\tau).

(12)

Proposition 3 ([15]).

The k-MLE-c problem and the k-MLE problem have the same solution set.

Proof.

See Appendix A-B. ∎

Definition 6 (Partial Maximum [15]).

A point $(\tau^{*},\Theta^{*})$ is called a partial maximum for the k-MLE problem if it satisfies

\begin{array}[]{r@{\;}l}\mathcal{L}(\tau^{*},\Theta^{*})&\geq\mathcal{L}(\tau,\Theta^{*})\quad\text{for all }\tau\in\mathcal{A},\text{ and }\\ \mathcal{L}(\tau^{*},\Theta^{*})&\geq\mathcal{L}(\tau^{*},\Theta)\quad\text{for all }\Theta\in\Omega^{\times}.\end{array}

Theorem 4.

The iterates of the k-MLE algorithm converge to a partial maximum for the k-MLE problem in a finite number of steps.

Proof.

See Appendix A-C. ∎

While Theorem 4 is encouraging, the problem is that, a partial maximum may not be a local maximum; [15] gives an example. We discuss this further in detail in the next section where we develop conditions for local maxima.

The papers [11] and [28] claim to show convergence to a local optimum, but what they show is that the algorithm stops in a finite number of steps. They do not obtain Theorem 4 because they do not have the notion of partial maximum which is a prerequisite for a local maximum.

III-B Convergence to Local Maxima

We introduce the

Definition 7 (maximizing solution set).

M(\tau)=\big{\{}\Theta_{\tau}\in\Omega^{\times}:\Theta_{\tau}=\underset{\Theta\in\Omega}{\mathrm{arg\,max}}\;\mathcal{L}(\tau,\Theta),\tau\in\mathcal{B}\big{\}}.

(13)

Since there are finitely many $\tau$ ’s, there are finitely many $M(\tau)$ ’s. Next, recalling Property P2, for fixed $\tau$ , $\mathcal{L}(\tau,\Theta)$ is separable. Thus

\begin{array}[]{r@{\;}l}M(\tau)&=\big{\{}\Theta_{\tau}=[\theta_{1,\tau_{1}},\dots,\theta_{K,\tau_{K}}]\big{\}}\\ &=M_{1}(\tau_{1})\times\cdots\times M_{K}(\tau_{K})\\ M_{k}(\tau_{k})&=\big{\{}\theta_{k,\tau_{k}}\in\Omega:\theta_{k,\tau_{k}}=\underset{\theta\in\Omega}{\mathrm{arg\,max}}\;\lambda(\tau_{k},\theta)\big{\}},\end{array}

(14)

where the operator “ $\times$ ” is the Cartesian product. We say that $M(\tau)$ is a singleton set if it consists of one point. This singleton property turns out to be crucial for convergence.

We need to determine the local maxima of $F(\tau)$ and so need to characterize its gradient. The one-sided directional derivative of $F(\tau)$ in direction $d$ (so that $d$ is a unit vector) is defined by

\mathrm{DF}(\tau;d)=\lim_{\alpha\rightarrow 0^{+}}\frac{1}{\alpha}\big{(}F(\tau+\alpha d)-F(\tau)\big{)}.

Then we have the following characterization and a well-known optimality condition.

Lemma 5 (Lemma 6 [15]).

$\mathrm{DF}(\tau^{*};d)$ exists for any $d$ at any point $\tau^{*}$ , and is given by

\mathrm{DF}(\tau^{*};d)=\underset{\Theta\in M(\tau^{*})}{\max}\left.d^{T}\frac{\partial\mathcal{L}}{\partial\tau}\right|_{\tau=\tau^{*}}=\underset{\Theta\in M(\tau^{*})}{\max}\>\mathcal{L}(d,\Theta).

(15)

Proof.

See Appendix A-D. ∎

Lemma 6 ([15, 33]).

$\tau^{*}$ is a local maximum for k-MLE-c problem if and only if $\mathrm{DF}(\tau^{*};d)\leq 0$ for all feasible $d$ .

Proposition 7 ([15]).

Given $(\tau^{*},\Theta^{*})$ where $\tau^{*}\in\mathcal{B}$ and $\Theta^{*}\in M(\tau^{*})$ , $\tau^{*}$ is a local maximum of k-MLE-c problem if and only if

F(\tau^{*})=\mathcal{L}(\tau^{*},\Theta^{*})\geq\underset{\tau\in\mathcal{A}}{\max}\;\mathcal{L}(\tau,\Theta)\quad\text{for all }\Theta\in M(\tau^{*}).

(16)

Proof.

See Appendix A-E. ∎

Now we are ready to present our main result on the local maximum of k-MLE problem.

Theorem 8.

Let $(\tau^{*},\Theta^{*})$ be a partial maximum for k-MLE problem. Suppose $M(\tau^{*})$ is a singleton set, then $\tau^{*}$ is a local maximum of k-MLE problem.

Proof.

See Appendix A-F. ∎

III-C Convergence under MLE Uniqueness

To establish the singleton property of $M(\tau^{*})$ , we need conditions to ensure the uniqueness of MLE. We first introduce the following classic result.

Theorem 9 (Corollary 2.5 [34]).

Suppose that the log-likelihood function $\lambda(\theta)$ obeys the log-likelihood uniqueness conditions:

(i)

${\lambda}(\theta)$ is twice continuously differentiable for $\theta\in\Omega$ , where $\Omega$ is a connected open subset with boundary $\partial\Omega$ .
(ii)

$\lim_{\theta\rightarrow\partial\Omega}{\lambda}(\theta)=c$ where $-\infty\leq c<\infty$ .
(iii)

the Hessian matrix $H=\big{\{}\frac{\partial^{2}{\lambda}}{\partial\theta_{i}\partial\theta_{j}}\big{\}}$ is negative definite at every stationary point of the likelihood, i.e. $\frac{\partial{\lambda}}{\partial\theta}=0$ .

Then

•

There is a unique maximum likelihood estimate $\hat{\theta}\in\Omega$ .
•

The log-likelihood function attains: no other maxima in $\Omega$ ; no minima or other stationary points in $\Omega$ ; its infimum value $c$ on $\partial\Omega$ and nowhere else.

Remark. We state the result for the log-likelihood whereas [34] state it for the likelihood. But the statements are equivalent since the Hessians at a stationary point are equivalent.

Proposition 10.

Suppose that each cluster-specific log-likelihood satisfies the conditions of Theorem 9. Then $M(\tau)$ is a singleton set for all $\tau$ .

Proof.

We denote $\mathrm{dim}(C_{k})=n_{k}$ ; the $x_{n}\in C_{k}$ as $x_{[n]}$ , $1\leq n\leq n_{k}$ ; $X_{1}^{n_{k}}=\big{(}x_{[1]},\dots,x_{[n_{k}]}\big{)}$ . In view of the representation of $M(\tau)$ as a Cartesian product, we have only to show that $M_{k}(\tau_{k})$ is a singleton for each $k$ . This will follow if we show that the MLE of $\theta_{k}$ from the joint density $\pi(X_{1}^{n_{k}};\theta_{k})=\prod_{n\in C_{k}}p(x_{[n]};\theta_{k})=\prod_{n=1}^{n_{k}}p(x_{[n]};\theta_{k})$ is unique. This now follows from Theorem 9. ∎

Remark. In Proposition 10 rather than applying the conditions of Theorem 9 to establish MLE uniqueness, it is sometimes possible to establish MLE uniqueness directly or by other means. We will do this for k-VARs below.

We can now state the main result.

Theorem 11.

Suppose that each cluster-specific likelihood satisfies the uniqueness conditions of Theorem 9. Then, the k-MLE iterates converge to a local maximum point for the k-MLE problem.

Proof.

By Theorem 4 the k-MLE algorithm iterates converge to a partial maximizer for the k-MLE problem. The Theorem 11 uniqueness conditions guarantee via Proposition 10 that each $M_{k}(\tau_{k})$ is singleton. So by Theorem 8 the iterates converge to a local maximum for the k-MLE problem, completing the proof. ∎

Remark. As indicated in the previous remark we can alternatively directly show MLE uniqueness when possible.

III-D Convergence of k-Bregman

We could now apply Proposition 10, Theorem 11 to k-Bregman. However it is simpler instead to replace Theorem 11 with yet another result from [34].

Theorem 12 (Theorem 2.6 [34]).

Suppose the log-likelihood function $\lambda(\theta)$ obeys the log-likelihood uniqueness conditions (i),(ii),(iii).

(i)

$\lambda(\theta)$ is twice continuously differentiable for $\theta\in\Omega$ a connected open subset with boundary $\partial\Omega$ .
(ii)

The gradient $\frac{\partial\lambda}{\partial\theta}$ vanishes for at least one point $\theta\in\Omega$ .
(iii)

the Hessian matrix $H=\big{\{}\frac{\partial^{2}\lambda}{\partial\theta_{i}\partial\theta_{j}}\big{\}}$ is negative definite at every point $\theta\in\Omega$ .

Then

(a)

$\lambda(\theta)$ is concave in $\theta$ .
(b)

There is a unique maximum likelihood estimate $\hat{\theta}\in\Omega$ .
(c)

The log-likelihood has no other maxima, minima or stationary points in $\Omega$ .

So we have to check the Theorem 12 conditions. In view of (7) we need only check Theorem 12 for exponential family densities. The density of an exponential family $\ln p(x;\theta)$ in (7) has the equivalent form [11, Section 4]

\ln p(x;\theta)=\theta^{T}x-\psi(\theta)-\ln p_{0}(x)

where $\psi(\theta)$ is the log partition function

\psi(\theta)=\ln Z(\theta)=\ln\int p_{0}(x)\exp({\theta^{T}x})\mathrm{d}x

and is convex [11], and $p_{0}(\cdot)$ is an arbitrary function endowing a measure (see [11, Section 4.1]). This means that the log-density is concave and so the Hessian is negative-semi definite at each point of $\Omega$ . But we need negative definiteness and to see what that entails we need to explicitly calculate the Hessian.

We find the score function $\frac{\partial\ln p(x;\theta)}{\partial\theta}=x-\frac{\partial\psi}{\partial\theta}$ and

\frac{\partial\psi}{\partial\theta}=\int\frac{xe^{\theta^{T}x}}{Z(\theta)}\mathrm{d}x=\int xp(x;\theta)\mathrm{d}x=\mathbb{E}(X)\triangleq\mu(\theta),

hence $\displaystyle\frac{\partial\ln p}{\partial\theta}=x-\mu(\theta).$ Then we have

\begin{array}[]{r@{\;}l}\displaystyle-\frac{\partial^{2}\ln p}{\partial\theta\partial\theta^{T}}&=\displaystyle\frac{\partial^{2}\psi}{\partial\theta\partial\theta^{T}}=\int x\frac{\partial p}{\partial\theta}\mathrm{d}x=\int x\frac{\partial\ln p}{\partial\theta^{T}}p\>\mathrm{d}x\\[8.0pt] &\displaystyle=\int x(x-\mu)^{T}p\mathrm{d}x=\int(x-\mu)(x-\mu)^{T}p\mathrm{d}x\\[8.0pt] &\displaystyle=\mathrm{Var}(X)=\Sigma(\theta),\end{array}

in which $\mathbb{E}(X)$ and $\mathrm{Var}(X)$ denote the mean and variance of random variable $X$ . Now we can state the Bregman result.

Proposition 13.

Suppose the cluster-specific exponential families each have: at least one point $\theta_{k}\in\Omega$ where the gradient of the log-likelihood vanishes; positive definite variance matrices $\Sigma(\theta_{k})$ . Then, the k-MLE/k-Bregman iterates converge to a local minimum point for the k-MLE/k-Bregman problem.

Proof.

The proof is the same as that of Theorem 11 we just have to check the conditions of Theorem 12. Following the discussion above, we find that the cluster specific Hessian is

-\textstyle\sum_{n=1}^{n_{k}}\Sigma(\theta_{k})=-n_{k}\Sigma(\theta_{k})

which is negative definite as required for condition (iii) in Theorem 12. Condition (i) is implicit for exponential families. Condition (ii) holds by assumption. The result now follows. ∎

IV Clustering Autocorrelated Time Series with k-VARs

We apply k-MLE to clustering of $N$ autocorrelated time series $\mathbb{Y}\triangleq\{Y^{(n)},n=1,2,\cdots,N\}$ where each $Y^{(n)}$ contains $T$ measurements of an $m$ -dimensional time series $Y_{t}^{(n)},t=1,2,\cdots,T$ .

IV-A k-VARs Derivation

We model a time series in the $k^{th}$ cluster by a Gaussian vector autoregression (VAR), of block order $p_{k}$ . The conditional likelihood function of the $k^{th}$ VAR model is

\begin{array}[]{@{}r@{\;}l}L_{n,k}&=p(Y^{(n)}\mid A_{k},\Sigma_{k})\\ &=\displaystyle\prod_{t=p_{k}+1}^{T}(2\pi)^{-\frac{m}{2}}|\Sigma_{k}|^{-\frac{1}{2}}\exp\left(-\frac{1}{2}\mathbf{e}_{nkt}^{T}\Sigma_{k}^{-1}\mathbf{e}_{nkt}\right)\end{array}

(17)

where for each $k$

	$\displaystyle\mathbf{e}_{nkt}$	$\displaystyle=Y_{t}^{(n)}-A_{k0}-A_{k1}Y_{t-1}^{(n)}--\cdots-A_{kp_{k}}Y_{t-p_{k}}^{(n)}$
		$\displaystyle=Y_{t}^{(n)}-A_{k}X_{kt}^{(n)},$
	$\displaystyle A_{k}$	$\displaystyle=\left[A_{k0},A_{k1},\dots,A_{kp_{k}}\right],$
	$\displaystyle X_{kt}^{(n)}$	$\displaystyle=\left[{1},Y_{t-1}^{(n)T},Y_{t-2}^{(n)T},\dots,Y_{t-p_{k}}^{(n)T}\right]^{T},$

Also $A_{k0}$ is a vector, while for $r\geq 1$ , $A_{k,r}$ are the VAR coefficient matrices and $\Sigma_{k}$ the driving noise variance matrix. The set-up now fits in the k-MLE framework.

The classification log-likelihood is given by

\mathcal{L}(\tau,\Theta)=\textstyle\sum_{n=1}^{N}\sum_{k=1}^{K}\tau_{n,k}\lambda_{n,k}(\theta_{k})

(18)

where $\theta_{k}$ denotes the collection of parameters $A_{k}$ and $\Sigma_{k}$ . Applying the coordinate ascent solver for k-MLE is straightforward, since we have a classic vector or multivariate regression which leads to the following algorithm:

•

label update:

	$\displaystyle\tau_{n,k}$	$\displaystyle=\begin{cases}1&\text{if }k=\operatorname{arg\,min}_{k^{\prime}}D_{n,k^{\prime}},\\ 0&\text{otherwise},\end{cases}$		(19)
	$\displaystyle D_{n,k^{\prime}}$	$\displaystyle=(T-p)\log\|\Sigma_{k^{\prime}}\|+\textstyle\sum_{t=p+1}^{T}\mathbf{e}_{nk^{\prime}t}^{T}\Sigma_{k^{\prime}}^{-1}\mathbf{e}_{nk^{\prime}t},$		(20)

•

parameter update:

$\displaystyle I_{k}$	$\displaystyle=\big{\{}n:\tau_{n,k}=1;\ n=1,\dots,N\big{\}},$	(21)
$\displaystyle A_{k}^{T}$	$\displaystyle=\textstyle\begin{array}[t]{@{}l@{}l}&\Big{(}\sum_{n\in I_{k}}\big{(}\sum_{t=p+1}^{T}X_{kt}^{(n)}X_{kt}^{(n)T}\big{)}\Big{)}^{-1}\\ &\Big{(}\sum_{n\in I_{k}}\big{(}\sum_{t=p+1}^{T}X_{kt}^{(n)}Y_{t}^{(n)T}\big{)}\Big{)},\end{array}$	(24)
$\displaystyle\Sigma_{k}$	$\displaystyle=\Big{(}\textstyle\sum_{n\in I_{k}}\big{(}\sum_{t=p+1}^{T}\mathbf{e}_{nkt}\mathbf{e}_{nkt}^{T}\big{)}\Big{)}/\Big{(}(T-p)\|I_{k}\|\Big{)},$	(25)

where $|I_{k}|$ denotes the number of elements of $I_{k}$ (i.e. the cardinality of set $I_{k}$ ), and $p\triangleq\max\{p_{1},\dots,p_{K}\}$ .

We now consider: initialisation, stopping conditions, and fast matrix computation.

IV-B k-VARs Computation

IV-B1 Initialisation

We take a simple approach. We fit a VAR model to each time series $Y^{(n)}$ , yielding $(A_{n}^{*},\Sigma_{n}^{*})$ ( $n=1,\dots,N$ ). Then we choose at random, one $A_{n}^{*}$ (and $\Sigma_{n}^{*}$ ) to initialise the cluster-specific parameters. We call k-VARs with this initialisation ‘k-VARs(rnd)’. In practice we can further repeat this many times and choose the initialisation that delivers the highest value of the likelihood to improve performance. If these $K$ initial parameters are all from different clusters then we call the result ‘k-VARs (oracle)’.

IV-B2 Stopping Criteria

We considered two stopping criteria: one based on parameters, the other based on log-likelihoods.

If $A_{k}$ and $\Sigma_{k}$ are the values given at the current iterate and $A^{+}_{k}$ and $\Sigma^{+}_{k}$ are the next, the stopping condition is

\max\left\{\|A_{k}^{+}-A_{k}\|_{F},\|\Sigma_{k}^{+}-\Sigma_{k}\|_{F}\right\}<\epsilon

(26)

for all $k=1,\dots,K$ , where $\epsilon$ is the user-specified tolerance value. One may also consider using the relative errors in (26) according to practical demands.

The log-likelihood stopping rule is

|\mathcal{L}(\tau^{+},\theta^{+})-\mathcal{L}(\tau,\theta)|<\epsilon,

(27)

where $\tau^{+}$ and $\theta^{+}$ denote the proposed updates. Compared to the parameter stopping rule, this one is computationally much cheaper since it avoids computing large matrix norms.

IV-B3 Fast Computation

The computation bottleneck of the k-VARs algorithm is (24) and (25). To accelerate computation, we use QR decomposition as now described.

We introduce extended vectors of the general form, ${\bf F}_{n}^{\top}=[F_{p+1},\cdots,F_{T}]$ and corresponding outer products of the form, ${\bf F}_{n}{\bf G}_{n}^{\top}=\mbox{$\sum_{p+1}^{T}$}F_{t}G_{t}^{\top}$ . We thus define ${\bf X}_{nk},{\bf Y}_{n},{\bf E}_{nk}={\bf Y}_{n}-{\bf X}_{nk}A_{k}^{\top}$ and ${\bf X}_{nk}^{\top}{\bf X}_{nk},{\bf X}_{nk}^{\top}{\bf Y}_{n},{\bf E}_{nk}^{\top}{\bf E}_{nk}$ .

We now introduce the QR decompositions

[{\bf X}_{nk}]_{(T-p)\times(1+mp_{k})}=(Q_{nk})_{(T-p)\times(1+mp_{k})}R_{nk}

where $Q_{nk}$ is orthogonal and $R_{nk}$ is upper triangular. We then form ${\bf Y}_{Qnk}=Q_{nk}^{\top}{\bf Y}_{n}$ which yields a new expression for (24) as follows

A_{k}^{T}=\left(\textstyle\sum_{n\in I_{k}}R_{nk}^{T}R_{nk}\right)^{-1}\left(\textstyle\sum_{n\in I_{k}}R_{nk}^{T}\mathbf{Y}_{Q_{nk}}\right).

(28)

The QR decomposition also enables cheap computation of individual time series parameters as

A_{n}^{*T}=R_{nk}^{-1}\mathbf{Y}_{Q_{nk}}.

(29)

We next introduce a second set of QR decompositions

{\bf E}_{nk}=(U_{nk})_{(T-p)\times m}(V_{nk})_{m\times m}

(30)

This yields a compact update of $\Sigma_{k}$ as follows

\Sigma_{k}=\left(\textstyle\sum_{n\in I_{k}}V_{nk}^{T}V_{nk}\right)/\big{(}(T-p)|I_{k}|\big{)},

(31)

The initial covariance for each time series can be similarly computed by

\Sigma_{n}^{*}=\big{(}1/(T-p_{k})\big{)}V_{nk}^{T}V_{nk}.

(32)

Finally we introduce the lower triangular Cholesky decompositions $\Sigma_{k}=L_{k}L_{k}^{T},\,k=1,\dots,K,$ This yields reliable computation of $D_{nk}$ as follows.

\mathbf{e}_{nkt}^{T}\Sigma_{k}^{-1}\mathbf{e}_{nkt}=(L_{k}^{-1}\mathbf{e}_{nkt})^{T}(L_{k}^{-1}\mathbf{e}_{nkt}),

(33)

where $L_{k}^{-1}\mathbf{e}_{nkt}$ is found by rapidly solving the triangular linear system $L_{k}\xi=\mathbf{e}_{nkt}$ for $\xi$ . which is very fast since $L_{k}$ is lower triangular.

IV-B4 Summary of k-VARs and Parallelism

The complete k-VARs algorithm is summarised in Algorithm 1. In addition, it is easy to see that several sections of Algorithm 1 can be parallelised to take advantage of multi-cores, including:

•

(line 3) pre-computing $n$ , QR decompositions and $K$ estimations of $A_{n}^{*}$ .
•

(line 6-11) computing $NK$ , $D_{n,k}$ s.
•

(line 12-16) updating $K$ matrix parameters $A_{k},\Sigma_{k}$ .

Algorithm 1 k-VARs algorithm

1:Input: data

\mathbb{Y}

, number of clusters

K

, model orders

p_{1},\dots,p_{K}

, and tolerance

\epsilon

2:Output: model parameters

A_{k},\Sigma_{k}

, and clustering labels

\tau_{n,k}

(

k=1,\dots,K

;

n=1,\dots,N

3:Pre-computation: perform and cache QR decomposition for each

\mathbf{X}_{nk}

, and compute

\mathbf{Y}_{Q_{nk}}

4:Initialisation: initialise

(A_{k},\Sigma_{k})

by Section IV-B1.

5:while TRUE do

6: for

n\leftarrow 1

N

\triangleright

label update

7: for

k\leftarrow 1

K

8: Compute

D_{n,k}

by (20).

9: end for

10: Determine

\tau_{n,k}

by (19) for

k=1,\dots,K

11: end for

12: for

k\leftarrow 1

K

\triangleright

parameter update

13: Determine

I_{k}

by (21).

14: Update

A_{k}

by (28).

15: Update

\Sigma_{k}

by (25).

16: end for

17: if (26) or (27) satisfied then

\triangleright

stop conditions

18: break

19: end if

20:end while

IV-C k-VARs Convergence

Convergence analysis of k-VARs algorithm is provided by applying the general k-MLE results. As indicated in the remarks following Proposition 9 and Theorem 10 we will directly show the solution to the MLE equations is unique.

Note that the model is not from a (vector) exponential family. This is because the exponent can only be represented as an inner product of parameters with data if $\Sigma_{k}$ is known. So we cannot apply (a vector version of) k-Bregman.

Differentiating a cluster-specific likelihood gives the least squares equations shown in (21),(22). From these it is implicit that uniqueness requires that, for each $k$ : (i) the inverse of the ${\bf X}_{nk}^{T}{\bf X}_{nk}$ matrix in (21) exists; (ii) $\Sigma_{k}$ in (21) has full rank. This will follow if (i),(ii) hold for every cluster of size 1. But we are now in a classic situation of requiring uniqueness of the least squares estimate of a VAR and of its noise variance from a single time series record. For cluster size 1, the ${\bf X}_{nk}^{T}{\bf X}_{nk}$ matrix in (22) is very nearly a block Toeplitz matrix. So long as each time series is generated by (i) a stable VAR with (ii) full rank noise covariance matrix and $T>p+1$ , then the ${\bf X}_{nk}^{T}{\bf X}_{nk}$ matrix will have full rank with probability 1 [35]. Further the estimated noise variance matrix will have full rank with probability 1 [35]. So uniqueness of the MLE is established and convergence of the k-VARs algorithm now follows.

IV-D Model Selection for k-VARs

The clustering algorithms, presented in section IV-A and IV-B, require the number of clusters $K$ and cluster-specific model orders $(p_{1},\dots,p_{K})$ to be specified. This section develops a Bayesian information criterion (BIC) to resolve the model selection problem. We only discuss the special case with equal $p_{k}$ ’s. The general case is computationally challenging and may not offer big improvements on large data sets.

For the case of equal $p_{k}$ ’s, we choose $K$ and $p$ as the joint minimizer of BIC on a $(K,p)$ grid as follows.

\begin{array}[]{@{}r@{\;}l}\mathrm{BIC}(K,p)=&-2\log{L}+\Big{\{}K\big{[}(p+1/2)m^{2}+3m/2\big{]}+N\Big{\}}\cdot\\ &\log\big{[}N(T-p)\big{]},\end{array}

(34)

Since the $N$ time series are assumed to be sampled independently and the $K$ models are independent then the likelihood is given by (18) with all $p_{k}$ in $D_{n,k}(\cdot,\cdot)$ being set to $p$ .

A faster way to find the minimum of BIC is to use cyclic descent. Here we alternate between the following two steps until a convergence criterion is met.
(i) Given $p$ , minimize over $K$ ;
(ii) Given $K$ , minimize over $p$ .
Of course this may get stuck in a local minimum. But that could be managed by trying a number of random starts.

V Simulation and Application to Real Data

This section has two parts. In part I (subsections A,B,C) we we compare k-VARs with other methods on simulated data. We find that k-VARs considerably outperforms state-of-the-art methods because they ignore the autocorrelation feature. Section C studies the robustness of k-VARs with non-Gaussian driving noise and with varying signal to noise ratios. In part II (section D) we apply k-VARs to real data.

V-A Baseline Methods and Performance Indices

k-VARs is implemented in MATLAB. All simulations are executed on 2.4 GHz Intel Core i9 machines with 32GB RAM and macOS Catalina. The ‘state of the art’ comparator methods are as follows.

•

k-DBA: k-Means with DTW barycenter averaging (DBA) [19], the implementation TimeSeriesKMeans with option metric="dtw" from an official Python package [36], tslearn, version 0.5.0.5.
•

k-Shape: the well-known algorithm proposed in [22, 23], the official implementation KShape from the same Python package tslearn.
•

k-GAK: Kernel k-Means [37] with Global Alignment Kernel [38] for time series clustering, the implementation KernelKMeans with option kernel="gak" from the Python package tslearn.
•

k-SC: k-Spectral Centroid proposed in [21], the MATLAB implementation given on the companion website of SNAP software [39]. (Note: this version does not include the Haar-Wavelet based incremental implementation.)

To avoid the drawbacks of Rand Index (RI) and the squared version of Normalised Mutual Information ( $\text{NMI}_{\text{sqrt}}$ ) discussed in [40], we used the improved measures, adjusted Rand Index (ARI) and Normalized Information Distance (NID) to evaluate clustering performance

	$\displaystyle\mathrm{ARI}=\frac{2(N_{00}N_{11}-N_{01}N_{10})}{(N_{00}+N_{01})(N_{01}+N_{11})+(N_{00}+N_{10})(N_{10}+N_{11})},$
	$\displaystyle\mathrm{NMI}_{\text{max}}=\frac{I(U,V)}{\max\{H(U),H(V)\}},\quad\mathrm{NID}=1-\mathrm{NMI}_{\text{max}}$

where $N_{11}$ is the number of pairs that are in the same cluster in both $U$ and $V$ , $N_{00}$ is the number of pairs that are in different clusters in both $U$ and $V$ , $N_{01}$ is the number of pairs that are in the same cluster in $U$ but in different clusters in $V$ , $N_{10}$ is the number of pairs that are in different clusters in $U$ but in the same cluster in $V$ ; and $H(U)$ is the average amount of information in $U$ , $I(U,V)$ is the mutual information, given as

\begin{array}[]{r@{\;}l}H(U)&=-\sum_{i=1}^{K}\frac{a_{i}}{N}\log\frac{a_{i}}{N},\\ I(U,V)&=\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{n_{ij}}{N}\log\frac{n_{ij}/N}{a_{i}b_{j}/N^{2}},\end{array}

in which $n_{ij}$ denotes the number of objects that are common to clusters $U_{i}$ and $V_{j}$ , $a_{i}$ is the number of objects in cluster $U_{i}$ , and $b_{j}$ is that in cluster $V_{j}$ . For both ARI and $\mathrm{NMI}_{\text{max}}$ , values close to 1 indicate high clustering performance.

V-B Comparisons on Simulated Datasets

To evaluate performance in a statistical manner, we randomly generate $40$ independent datasets. Due to scalability limits on some of the competing methods, the problem size had to be restricted. Each time-series dataset is generated by simulating a class of VAR models with the following specifications:

•

time series dimensions: $m=2,4,8$ ;
•

model order: $p=5$ ;
•

time series length: $T=80$ ;
•

number of groups: $K=8$ ;
•

number of time series per cluster: $N_{c}=30$ .

The parameters of each VAR model are generated randomly and roots scaled to ensure stability. The noise covariance matrix is computed from randomly generated Cholesky factors.

The results in Figure 1, show two versions of k-VARs (defined in the caption). k-VARs (oracle), is superior to the state-of-the-art methods. And k-VARs (rnd) is competitive with them. Note also that most of the other methods require a number of tuning parameters to be chosen. We used default values suggested in the various implementations. k-VARs has no free tuning parameters and is competitive because the other methods ignore the autocorrelation feature.

Refer to caption — (a) Adjusted Rand Index (ARI)

Next we illustrate BIC. The following set-up applied.

•

cluster-specific VAR models: $m=4$ , $p=5$ ;
•

time series: $T=200$ , $K=10$ , $N_{c}=20$ .

The grid for $K,p$ is $2\!:\!2\!:\!20\times 2\!:\!1\!:\!8$ . log-BIC is plotted as a heatmap in Figure 2.

We see that BIC attains its minimum at the ground truth $K=10$ . However there is a relatively flat region in the vicinity of the minimum. We found in the simulations that good clustering performance was tolerant to some flatness in BIC near its minimum.

V-C Effect of Non-Gaussian Driving Noise and of SNR

Here we study two things. The ability of k-VARS to deal with non-Gaussian driving noise. And the effect of signal to noise ratio (SNR).

We use t-distributed driving noise with degrees of freedom (dof): $[5,10,50,100,10000]$ . Larger dof is closer to Gaussianity. The setup is otherwise the same as for Figure 1, but with $m=3$ . The results in Figure 3 illustrate the robustness of k-VARs with little loss of performance for low dof.

The next simulation studies the effect of SNR for Gaussian driving noise. We modify a definition of SNR due to [41]. Denote $\Pi$ as the zero-lag auto-covariance of a stationary VAR(p) time series $Y_{t}$ whose white noise variance is $\Sigma$ . Then [41] uses SNR $=10\log\mbox{VNSR},\mbox{VSNR}=\frac{\lambda_{\mathrm{max}}(\Pi)}{\lambda_{\mathrm{max}}(\Sigma)}$ . This has a scaling problem. If we multiply $Y_{t}$ by a diagonal scaling matrix $L\neq cI$ (where $c$ is a constant) then this SNR changes. So instead we use VSNR $=\frac{1}{2}\lambda_{max}(\Sigma^{-1/2}\Pi\Sigma^{-1/2})$ which is unaffected by matrix scaling. The divisor of $2$ is due to a second adjustment as follows. In steady state we can write $Y_{t}=\hat{Y}_{t}+\epsilon_{t}$ where $\hat{Y}_{t}$ is the one step ahead predictor and $\epsilon_{t}$ is the one step ahead prediction error and so has variance $\Sigma$ . Also it is uncorrelated with $\hat{Y}_{t}$ which thus has variance $M=\Pi-\Sigma$ . In the scalar case the when $M=\Sigma$ i.e. ‘signal’ and noise variance are equal this would be taken to yield a SNR of 0 dB. So to reproduce that in the vector case we divide by $2$ . The model setup is the same as that for Figure 3, except we fixed $m=3$ . We have to modify $A_{k}$ to achieve a given SNR. The trick is to scale the roots of the VAR polynomial $A_{k_{1}},\cdots,A_{k_{p}}$ . The SNR values are [0,5,10,15,20] (dB), The results in Figure 4, show k-VARS to be reliable.

V-D Application to Real Dataset WAFER

We now analyse the chip frabrication ‘WAFER’ data set [42] common in time series clustering studies and available at [43]. Each data set in the database contains the measurements recorded by one sensor during the processing of one wafer by one tool.

The results in Table I, show comparisons to k-DBA, k-Shape, k-GAK, k-SC. The proposed method performs best by all measures. One may recall that the performance of k-VARs is subject to initialization and hence is restricted by local optimality. Unlike the Monte Carlo study, we can now optimise the initialisation. Recalling that k-MLE aims to maximise the mixture likelihood, we firstly perform multiple runs with random initialisations, and then apply a likelihood threshold to evaluate initialisations. The results for k-VARs in Table I show a statistical summary of 20 runs with random initialisations selected by thresholding the log-likelihood at $15.57$ , which value is chosen via 10 preliminary trials. The performance of k-VARs is so consistent that its 1st, 2nd, 3rd quartile are the same, hence only one value is given in Table I.

In addition, our methods provide cluster-specific parameters for further analyses, like model validation, Although the manufacturing processes of WAFER is complex and with higher-order dynamics, the results show that satisfactory clustering does not require precise modelling of each time series ( $p=10$ is used here). Again the reason k-VARs performs well here is that the WAFER data shows strong autocorrelation features which the other methods ignore.

TABLE I: Clustering performance for the WAFER data.

	k-DBA	k-Shape	k-GAK	k-SC	k-VARs
RI	0.6280	0.5914	0.5008	0.8050	0.8120
$\text{NMI}_{\mathrm{sqrt}}$	0.0001	0.0001	0.0054	0.0043	0.0463
ARI	0.0050	0.0321	0.0026	-0.0057	0.0165
1-NID	0.0001	0.0032	0.0038	0.0011	0.0074

VI Conclusion

In this paper we have developed a general purpose deterministic label clustering method, k-MLE, that is not based on a distance or divergence measure, but on likelihood. We developed an implementation based on cyclic ascent. We also developed a model selection procedure based on BIC and illustrated it in a special case.

We provided a general theorem on parameter convergence of the cyclic ascent iterates. In so doing we drew attention to major flaws in previous clustering algorithm convergence proofs. They only obtained criterion convergence, mistakenly asserting that it implied parameter convergence whereas proving parameter convergence is much much harder.

We showed that a previous clustering method based on Bregman divergence, which we called k-Bregman, is a special case of k-MLE. Our general convergence result then provided the first valid proof of parameter convergence of k-Bregman.

We illustrated the usefulness of k-MLE by developing a new clustering algorithm, k-VARs, for autocorrelated vector time series. We developed a fast computational procedure and a BIC model selection criterion for simultaneously choosing the number of clusters and VAR order.

We only illustrated BIC for k-VARs, but as indicated in the introduction, it can be constructed for any k-MLE algorithm, since it requires only a likelihood and a parameter count.

We compared k-VARs with state of the art vector time series clustering algorithms and found it greatly outperformed them. This is simply because those algorithms mostly ignore the autocorrelation feature. We also showed robust performance in the presence of heavy-tailed driving noise and also with varying SNR.

Acknowledgments

This work was supported by a Discovery Grant DP180102417 from the Australian Research Council.

Appendix A Proofs for Convergence Analysis

A-A Proof of Lemma 1

Let $0\leq\alpha\leq 1$ then we have

\begin{array}[]{r@{\;}l}&F\big{(}\alpha\tau^{a}+(1-\alpha)\tau^{b}\big{)}\\ &=\underset{\Theta\in\Omega^{\times}}{\max}\sum_{n=1}^{N}\sum_{k=1}^{K}\left[\alpha\tau_{n,k}^{a}+(1-\alpha)\tau_{n,k}^{b}\right]\ell(x_{n},\theta_{k})\\ &=\underset{\Theta\in\Omega^{\times}}{\max}\left[\alpha\mathcal{L}(\tau^{a},\Theta)+(1-\alpha)\mathcal{L}(\tau^{b},\Theta)\right]\\ &\leq\alpha\,\underset{\Theta\in\Omega^{\times}}{\max}\,\mathcal{L}(\tau^{a},\Theta)+(1-\alpha)\,\underset{\Theta\in\Omega^{\times}}{\max}\,\mathcal{L}(\tau^{b},\Theta)\\ &=\alpha F(\tau^{a})+(1-\alpha)F(\tau^{b})\end{array}

as required.

A-B Proof of Proposition 3

Since the concentrated log-likelihood $F(\tau)$ is convex, and since the constraint set $\mathcal{A}$ is convex, the solution of the k-MLE-c problem is an extreme point of $\mathcal{A}$ and so lies in $\mathcal{B}$ . Let $\tau^{*}$ be this solution. Let $\Theta^{*}$ be any $\Theta\in\Omega^{\times}$ that achieves $\mathcal{L}(\tau^{*},\Theta^{*})=F(\tau^{*})$ . Then $(\tau^{*},\Theta^{*})$ also solves the k-MLE problem. The converse also holds completing the proof.

A-C Proof of Theorem 4

Since $\mathcal{B},\Omega^{\times}$ are bounded sets, the iterates $(\tau^{(m)},\Theta^{(m)})$ are a bounded (matrix) sequence and so by the Bolzano-Weierstrass theorem have at least one limit point/matrix. Next we claim that an extreme point of $\mathcal{A}$ is visited at most once before the algorithm stops. Suppose that this is not true, i.e. $\tau^{(a)}=\tau^{(b)}$ for some $a<b$ . In view of the cyclic minimization we must have $\mathcal{L}(\tau^{(b)},\Theta^{(b)})\geq\mathcal{L}(\tau^{(b)},\Theta^{(a)})\geq\mathcal{L}(\tau^{(a)},\Theta^{(a)})$ . If either of the inequalities is strict then we get a contradiction due to the stopping rule and the claim holds. If both inequalities are equalities we similarly have a contradiction since the algorithm must already have stopped. It now follows that, since there are a finite number of extreme points of $\mathcal{A}$ , the algorithm will reach a partial maximum in a finite number of steps.

A-D Proof of Lemma 5

The proof in [15] deals with minima and uses slightly different assumptions. We note that Assumption 1 ensures $\mathcal{L}(\tau,\Theta)$ as well as $\frac{\partial\mathcal{L}}{\partial\tau}$ are continuous in $(\tau,\Theta)$ . Then, by [44] (Theorem 1, p. 420), we get the first equality in (15). By Property P1, we further have

\mathrm{DF}(\tau^{*};d)=\underset{\Theta\in M(\tau^{*})}{\max}\>d^{T}\left[\ell(x_{n};\theta_{k})\right]=\underset{\Theta\in M(\tau^{*})}{\max}\>\mathcal{L}(d,\Theta).

A-E Proof of Proposition 7

Suppose the inequality holds. Let $\hat{\Theta}\in M(\tau^{*})$ . Then we get

\mathcal{L}(\tau^{*},\Theta^{*})=\mathcal{L}(\tau^{*},\hat{\Theta})\geq\underset{\tau\in\mathcal{A}}{\max}\>\mathcal{L}(\tau,\hat{\Theta})

So for any feasible direction $d$ , $\left.d^{T}\frac{\partial\mathcal{L}(\tau,\hat{\Theta})}{\partial\tau}\right|_{\tau=\tau^{*}}\leq 0.$ But this is true for all $\hat{\Theta}\in M(\tau^{*})$ so

\underset{\Theta\in M(\tau^{*})}{\max}\left.d^{T}\frac{\partial\mathcal{L}(\tau,\hat{\Theta})}{\partial\tau}\right|_{\tau=\tau^{*}}\leq 0.

But by Lemma 5 the left hand side is $\mathrm{DF}(\tau^{*};d)$ , so by Lemma 6 $\tau^{*}$ is a local maximum for k-MLE-c.

Now assume $\tau^{*}$ is a local maximum for k-MLE-c. For any feasible direction $d$ , by Lemma 6 $\mathrm{DF}(\tau^{*};d)\leq 0$ . Lemma 5 ensures $\left.d^{T}\textstyle\frac{\partial\mathcal{L}(\tau,{\Theta})}{\partial\tau}\right|_{\tau=\tau^{*}}\leq 0$ for any feasible $d$ and $\Theta\in M(\tau^{*})$ . Now consider a fixed $\hat{\Theta}\in M(\tau^{*})$ . Recall also that $\mathcal{L}(\tau,\hat{\Theta})$ is linear in $\tau$ by Property P1. These imply that $\mathcal{L}(\tau^{*},\hat{\Theta})\geq\underset{\tau\in\mathcal{A}}{\max}\>\{\mathcal{L}(\tau,\hat{\Theta})\}$ . Since $\hat{\Theta}$ is arbitrary, hence (16) holds and this completes the proof.

A-F Proof of Theorem 8

Since $(\tau^{*},\Theta^{*})$ is a partial maximum, then from the definition $\mathcal{L}(\tau^{*},\Theta^{*})\geq\underset{\tau\in\mathcal{A}}{\max}\>\mathcal{L}(\tau,\Theta^{*})$ . But when $M(\tau^{*})$ is a singleton set this is exactly the local maximum condition of Theorem 7, thereby completing the proof.

References

[1] R. Duda, P. E. Hart, and D. G. Stork, Pattern classification. New York: Wiley, 2001.
[2] M. Sonka, V. Hlavac, and R. Boyle, Image processing, analysis, and machine vision. Stamford, CT, USA: Cengage Learning, 2015.
[3] R. Szeliski, Computer vision : algorithms and applications. London New York: Springer, 2011.
[4] N. C. Jones and P. A. Pevzner, An introduction to bioinformatics algorithms. Cambridge, MA: MIT Press, 2004.
[5] H. E. Driver and A. L. Kroeber, Quantitative expression of cultural relationships. University of California Press, 1932, vol. 31, no. 4.
[6] T. Hastie, R. Tibshirani, and J. Friedman, “The elements of statistical learning: data mining, inference, and prediction, Springer Series in Statistics,” 2009.
[7] B. Everitt, S. Landau, M. Leese, and D. Stahl, Cluster Analysis. Hoboken: Wiley, 2011.
[8] C. Hennig, M. Meila, F. Murtagh, and R. Roberto, Handbook of cluster analysis. Chapman & Hall/CRC, 2016.
[9] K. P. et al., “Convex clustering shrinkage.” in PASCAL: Workshop on Statistics and Optimization of Clustering Workshop, London, UK, 2005.
[10] G. C. et.al., “Convex clustering: an attractive alternative to hierarchical clustering,” PLoS Comput. Biol., vol. 11, p. e1004228, 2015.
[11] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with Bregman divergences,” Journal of machine learning research, vol. 6, no. Oct, pp. 1705–1749, 2005.
[12] S. Miyamoto, H. Ichihashi, and K. Honda, Algorithms for fuzzy clustering. Berlin: Springer, 2017.
[13] N. Karayiannis, “Meca: Maximum entropy clustering algorithm,” in Proc. 1994 IEEE 3rd Int Fuzzy Systems Conf. IEEE, 1994, pp. 630–635.
[14] J. D. Banfield and A. E. Raftery, “Model-based Gaussian and non-Gaussian clustering,” Biometrics, pp. 803–821, 1993.
[15] S. Z. Selim and M. A. Ismail, “K-means-type algorithms: A generalized convergence theorem and characterization of local optimality,” IEEE Trans. P.A.M.I., no. 1, pp. 81–87, 1984.
[16] L. Bottou and Y. Bengio, “Convergence properties of the k-means algorithms,” in NIPS, 1995, pp. 585–592.
[17] G. Schwarz, “Estimating the dimension of a model,” The annals of statistics, vol. 6, no. 2, pp. 461–464, 1978.
[18] M. Yan and K. Ye, “Determining the number of clusters using the weighted gap statistic,” Biometrics, vol. 63, pp. 1031–1037, 2007.
[19] F. Petitjean, A. Ketterlin, and P. Gançarski, “A global averaging method for dynamic time warping, with applications to clustering,” Pattern recognition, vol. 44, no. 3, pp. 678–693, 2011.
[20] Y. Xiong and D.-Y. Yeung, “Time series clustering with ARMA mixtures,” Pattern Recognition, vol. 37, no. 8, pp. 1675–1689, 2004.
[21] J. Yang and J. Leskovec, “Patterns of temporal variation in online media,” in Proceedings of the fourth ACM international conference on Web search and data mining, 2011, pp. 177–186.
[22] J. Paparrizos and L. Gravano, “k-shape: Efficient and accurate clustering of time series,” in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, pp. 1855–1870.
[23] ——, “Fast and accurate time-series clustering,” ACM Transactions on Database Systems (TODS), vol. 42, no. 2, pp. 1–49, 2017.
[24] L. Ulanova, N. Begum, and E. Keogh, “Scalable clustering of time series with u-shapelets,” in Proceedings of the 2015 SIAM international conference on data mining. SIAM, 2015, pp. 900–908.
[25] V. S. Siyou Fotso, E. Mephu Nguifo, and P. Vaslin, “Frobenius correlation based u-shapelets discovery for time series clustering,” Pattern Recognition, vol. 103, jul 2020.
[26] N. S. Madiraju, S. M. Sadat, D. Fisher, and H. Karimabadi, “Deep temporal clustering: Fully unsupervised learning of time-domain features,” arXiv preprint arXiv:1802.01059, 2018.
[27] D. Luenberger and Y. Ye, Linear and Nonlinear Programming, 3rd ed. Springer, 2008.
[28] S. Zhong and J. Ghosh, “A unified framework for model-based clustering,” Jl. Machine Learning Research, vol. 4, pp. 1001–1037, 2003.
[29] M. Kearns, Y. Mansour, and A. Y. Ng, “An information-theoretic analysis of hard and soft assignment methods for clustering,” in Proc. 13th Conf. Uncertainty in Artificial Intelligence,, 1997, pp. 282–293.
[30] C. Li and G. Biswas, “Applying the hidden markov model methodology for unsupervised learning of temporal data,” Int. Jl. of Knowledge-based Intelligent Eng. Systems, vol. 6, pp. 152–160, 2002.
[31] J. Forster and M. K. Warmuth, “Relative expected instantaneous loss bounds,” Journal of Computer and System Sciences, vol. 64, no. 1, pp. 76–102, 2002.
[32] Z. Yue and V. Solo, “Large-scale time series clustering with k-ars,” in Proc IEEE ICASSP. IEEE, 2020, pp. 6039–6043.
[33] J. R. Magnus and H. Neudecker, Matrix differential calculus with applications in statistics and econometrics. John Wiley & Sons, 2019.
[34] T. Mäkeläinen, K. Schmidt, and G. P. H. Styan, “On the Existence and Uniqueness of the Maximum Likelihood Estimate of a Vector-Valued Parameter in Fixed-Size Samples,” Ann. Stat., vol. 9, no. 4, pp. 758–767, jul 1981.
[35] H. Lütkepohl, New Introduction to Multiple Time Series Analysis. Berlin: Springer, 2005.
[36] J. Tavenard, R. et.al. C. Holtz, M. “Tslearn, A Machine Learning Toolkit for Time Series Data,” Journal of Machine Learning Research, vol. 21, no. 118, pp. 1–6, 2020.
[37] I. S. Dhillon, Y. Guan, and B. Kulis, “Kernel k-means: spectral clustering and normalized cuts,” in Proceedings of the tenth ACM SIGKDD Int. Conf. on Knowledge discovery & data mining, 2004, pp. 551–556.
[38] M. Cuturi, “Fast global alignment kernels,” in Proc. 28th Int. Conf. machine learning (ICML-11), 2011, pp. 929–936.
[39] J. Leskovec, “Codes: K-Spectral Centroid - Cluster Time Series by Shape,” 2014. [Online]. Available: http://snap.stanford.edu/data/ksc.html
[40] N. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance,” Jl. MLR, vol. 11, pp. 2837–2854, 2010.
[41] W. Nicholson, I. W. andJ. Bien, and D. Matteson, “High dimensional forecasting by interpretable vector autoregression,” Jl .Mach. Learn. Res., vol. 21, pp. 1–52, 2020.
[42] R. T. Olszewski, “Generalized feature extraction for structural pattern recognition in time-series data,” Tech. Rep., 2001.
[43] M. G. Baydogan, “Multivariate Time Series Classification Datasets,” 2017. [Online]. Available: http://www.mustafabaydogan.com
[44] L. S. Lasdon, Optimization theory for large systems. New York : London: New York : Macmillan N.Y., 1970.

k-MLE, k-Bregman, k-VARs: Theory, Convergence, Computation

Abstract

Index Terms:

I Introduction

II k-MLE: A General Tool for Clustering

II-A Background

II-B k-MLE Formulation

Definition 1 (k-MLE problem).

Assumption 1.

Definition 2 (k-MLE algorithm).

II-C k-Bregman

II-D Applications

III k-MLE Convergence

III-A Convergence to Partial Maxima

Definition 3 (analog constraint set).

Definition 4 (concentrated333This is standard statistical and econometric terminology. log-likelihood).

Lemma 1 ([15]).

Proof.

Lemma 2 (Theorem 2[15]).

Proof.

Definition 5 (k-MLE-c problem).

Proposition 3 ([15]).

Proof.

Definition 6 (Partial Maximum [15]).

Theorem 4.

Proof.

III-B Convergence to Local Maxima

Definition 7 (maximizing solution set).

Lemma 5 (Lemma 6 [15]).

Proof.

Lemma 6 ([15, 33]).

Proposition 7 ([15]).

Proof.

Theorem 8.

Proof.

III-C Convergence under MLE Uniqueness

Theorem 9 (Corollary 2.5 [34]).

Proposition 10.

Proof.

Theorem 11.

Proof.

III-D Convergence of k-Bregman

Theorem 12 (Theorem 2.6 [34]).

Proposition 13.

Proof.

IV Clustering Autocorrelated Time Series with k-VARs

IV-A k-VARs Derivation

IV-B k-VARs Computation

IV-B1 Initialisation

IV-B2 Stopping Criteria

IV-B3 Fast Computation

IV-B4 Summary of k-VARs and Parallelism

IV-C k-VARs Convergence

IV-D Model Selection for k-VARs

V Simulation and Application to Real Data

V-A Baseline Methods and Performance Indices

V-B Comparisons on Simulated Datasets

V-C Effect of Non-Gaussian Driving Noise and of SNR

V-D Application to Real Dataset WAFER

VI Conclusion

Acknowledgments

Appendix A Proofs for Convergence Analysis

A-A Proof of Lemma 1

A-B Proof of Proposition 3

A-C Proof of Theorem 4

A-D Proof of Lemma 5

A-E Proof of Proposition 7

A-F Proof of Theorem 8

References

k-MLE, k-Bregman, k-VARs:
Theory, Convergence, Computation

Definition 4 (concentrated³³3This is standard statistical and econometric terminology. log-likelihood).