This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the fitting of mixtures of multivariate skew tt-distributions via the EM algorithm

Sharon Lee, Geoffrey J. McLachlan
Department of mathematics, the University of Queensland,
Brisbane, Australia
Abstract

We show how the expectation-maximization (EM) algorithm can be applied exactly for the fitting of mixtures of a general multivariate skew tt (MST) distributions, eliminating the need for computationally expensive Monte Carlo estimation. Finite mixtures of MST distributions have proven to be useful in modelling heterogeneous data with asymmetric and heavy tail behaviour. Recently, they have been exploited as an effective tool for modelling flow cytometric data. However, without restrictions on the the characterizations of the component skew tt-distributions, Monte Carlo methods have been used to fit these models. In this paper, we show how the EM algorithm can be implemented for the iterative computation of the maximum likelihood estimates of the model parameters without resorting to Monte Carlo methods for mixtures with unrestricted MST components. The fast calculation of semi-infinite integrals on the E-step of the EM algorithm is effected by noting that they can be put in the form of moments of the truncated multivariate tt-distribution, which subsequently can be expressed in terms of the non-truncated form of the tt-distribution function for which fast algorithms are available. We demonstrate the usefulness of the proposed methodology by some applications to three real data sets.

1 Introduction

Finite mixture distributions have become increasingly popular in the modelling and analysis of data due to their flexibility. This use of finite mixture distributions to model heterogeneous data has undergone intensive development in the past decades, as witnessed by the numerous applications in various scientific fields such as bioinformatics, cluster analysis, genetics, information processing, medicine, and pattern recognition. Comprehensive surveys on mixture models and their applications can be found, for example, in the monographs by Everitt and Hand (1981), Titterington, Smith, and Markov (1985), Lindsay (1995), McLachlan and Basford (1988), and McLachlan and Peel (2000), among others; see also the papers by Banfield and Raftery (1993) and Fraley and Raftery (1999).

Mixtures of multivariate tt-distributions, as proposed by McLachlan and Peel (1998, 2000), provide extra flexibility over normal mixtures. The thickness of tails can be regulated by an additional parameter – the degrees of freedom, thus enabling it to accommodate outliers better than normal distributions. However, in many practical problems, the data often involve observations whose distributions are highly asymmetric as well as having longer tails than the normal, for example, datasets from flow cytometry (Pyne et al., 2009). Azzalini (1985) introduced the so-called skew-normal (SN) distribution for modelling symmetry in data sets. Following the development of the SN and skew tt-mixture models by Lin, Lee, and Yen (2007), and Lin, Lee, and Hsieh (2007), respectively, Basso et al. (2010) studied a class of mixture models where the components densities are scale mixtures of skew-normal distributions introduced by Branco and Dey (2001), which include the classical skew-normal and skew tt-distributions as special cases. Recently, Cabral, Lachos, and Prates (2012) have extended the work of Basso et al. (2010) to the multivariate case.

In a study of automated flow cytometry analysis, Pyne et al. (2009) proposed a finite mixture of multivariate skew tt-distributions based on a ‘restricted’ variant of the skew tt-distribution introduced by Sahu, Dey, and Branco (2003). Lin (2010) considered a similar approach, but working with the original (unrestricted) characterization by Sahu et al. (2003). However, with this more general formulation, maximum likelihood (ML) estimation via the EM algorithm (Dempster, Laird, and Rubin, 1977) can no longer be implemented in closed form due to the intractability of some of the conditional expectations involved on the E-step. To work around this, Lin (2010) proposed a Monte Carlo (MC) version of the E-step. One potential drawback of this approach is that the model fitting procedure relies on MC estimates which can be computationally expensive.

In this paper, we show how the EM algorithm can be implemented exactly to calculate the ML estimates of the parameters for the (unrestricted) multivariate skew tt-mixture model, based on analytically reduced expressions for the conditional expectations, suitable for numerical evaluation using readily available software. A key factor in being able to compute the integrals quickly by numerical means is the recognition that they can be expressed as moments of a truncated multivariate tt-distribution, which in turn can be expressed in terms of the distribution function of a (non-truncated) multivariate central tt-random vector, for which fast programs already exist. We show that the proposed algorithm is highly efficient compared to the version with a MC E-step. It produces highly accurate results for which, if MC were to achieve comparable accuracy, a large number of draws would be necessary.

The remainder of the paper is organized as follows. In Section 2, for the sake of completeness, we include a brief description of the multivariate skew tt-distribution (MST) used for defining the multivariate skew tt-mixture model. We also describe the truncated tt-distribution in the multivariate case, critical for the swift evaluation of the integrals on the E-step occurring in the calculation of some of the conditional expectations. Section 3 presents the development of an EM algorithm for obtaining ML estimates for the MST distribution. In the following section, the finite mixture of MST (FM-MST) distributions is defined. Section 5 presents an implementation of the EM algorithm to the fitting of the FM-MST model. An approximation to the observed information matrix is discussed in Section 6. Finally, we present some applications of the proposed methodology in Section 7.

2 Preliminaries

We begin by defining the multivariate skew tt-distribution and briefly describing some related properties. Some alternative versions of the distribution are also discussed. Next, we introduce the truncated multivariate tt-distribution and provide some formulas for computing its moments. These expressions are crucial for the swift evaluation of the conditional expectations on the E-step to be discussed in the next section.

2.1 The Multivariate Skew tt-Distribution

Following Sahu et al. (2003), a random vector 𝒀\boldsymbol{Y} is said to follow a pp-dimensional (unrestricted) skew tt-distribution with p×1p\times 1 location vector 𝝁\boldsymbol{\mu}, p×pp\times p scale matrix 𝚺\boldsymbol{\Sigma}, p×1p\times 1 skewness vector 𝜹\boldsymbol{\delta}, and scalar degrees of freedom ν\nu, if its density is given by

fp(𝒚;𝝁,𝚺,𝜹,ν)=2ptp,ν(𝒚;𝝁,𝛀)Tp,ν+p(𝒚;𝟎,𝚲),f_{p}(\boldsymbol{y};\boldsymbol{\mu},\boldsymbol{\Sigma},\boldsymbol{\delta},\nu)=2^{p}t_{p,\nu}\left(\boldsymbol{y};\boldsymbol{\mu},\boldsymbol{\Omega}\right)T_{p,\nu+p}\left(\boldsymbol{y}^{*};\boldsymbol{0},\boldsymbol{\Lambda}\right), (1)

where

𝚫\displaystyle\boldsymbol{\Delta} =diag(𝜹),\displaystyle=\mbox{diag}(\boldsymbol{\delta}),
𝛀\displaystyle\boldsymbol{\Omega} =𝚺+𝚫𝚫T,\displaystyle=\boldsymbol{\Sigma}+\boldsymbol{\Delta}\boldsymbol{\Delta}^{T},
𝒚\displaystyle\boldsymbol{y}^{*} =𝒒ν+pν+d(𝒚),\displaystyle=\boldsymbol{q}\sqrt{\frac{\nu+p}{\nu+d\left(\boldsymbol{y}\right)}},
𝒒\displaystyle\boldsymbol{q} =𝚫T𝛀1(𝒚𝝁),\displaystyle=\boldsymbol{\Delta}^{T}\boldsymbol{\Omega}^{-1}(\boldsymbol{y}-\boldsymbol{\mu}),
d(𝒚)\displaystyle d\left(\boldsymbol{y}\right) =(𝒚𝝁)T𝛀1(𝒚𝝁),\displaystyle=(\boldsymbol{y}-\boldsymbol{\mu})^{T}\boldsymbol{\Omega}^{-1}(\boldsymbol{y}-\boldsymbol{\mu}),
𝚲\displaystyle\boldsymbol{\Lambda} =𝑰p𝚫T𝛀1𝚫.\displaystyle=\boldsymbol{I}_{p}-\boldsymbol{\Delta}^{T}\boldsymbol{\Omega}^{-1}\boldsymbol{\Delta}.

Here the operator diag(𝜹)\mbox{diag}(\boldsymbol{\delta}) denotes a diagonal matrix with diagonal elements specifed by the vector 𝜹\boldsymbol{\delta}. Also, we let tp,ν(.;𝝁,𝚺)t_{p,\nu}(.;\boldsymbol{\mu},\boldsymbol{\Sigma}) be the pp-dimensional tt-density with location vector 𝝁\boldsymbol{\mu}, scale matrix 𝚺\boldsymbol{\Sigma}, and degrees of freedom ν\nu, and Tp,ν(.;𝝁,𝚺)T_{p,\nu}(.;\boldsymbol{\mu},\boldsymbol{\Sigma}) the corresponding (cumulative) distribution function. The notation 𝒀STp,ν(𝝁,𝚺,𝜹)\boldsymbol{Y}\sim\mbox{ST}_{p,\nu}(\boldsymbol{\mu},\boldsymbol{\Sigma},\boldsymbol{\delta}) will be used. Note that when 𝜹=𝟎\boldsymbol{\delta}=\boldsymbol{0}, (1) reduces to the symmetric tt-density tp,ν(𝒚;𝝁,𝚺)t_{p,\nu}(\boldsymbol{y};\boldsymbol{\mu},\boldsymbol{\Sigma}). Also, when ν\nu\rightarrow\infty, we obtain the skew normal distribution.

The MST distribution admits a convenient hierarchical form,

𝒀𝒖,w\displaystyle\boldsymbol{Y}\mid\boldsymbol{u},w \displaystyle\sim Np(𝝁+𝚫𝒖,1w𝚺),\displaystyle N_{p}\left(\boldsymbol{\mu}+{\boldsymbol{\Delta}}\boldsymbol{u},\textstyle\frac{1}{w}{\boldsymbol{\Sigma}}\right),
𝑼w\displaystyle\boldsymbol{U}\mid w \displaystyle\sim HNp(𝟎,1w𝑰p),\displaystyle HN_{p}\left(\boldsymbol{0},\frac{1}{w}\boldsymbol{I}_{p}\right),
W\displaystyle W \displaystyle\sim gamma(ν2,ν2),\displaystyle\mbox{gamma}\left(\frac{\nu}{2},\frac{\nu}{2}\right),

where 𝑰p\boldsymbol{I}_{p} is the p×pp\times p identity matrix, Nk(𝝁,𝚺)N_{k}(\boldsymbol{\mu},\boldsymbol{\Sigma}) denotes the multivariate normal distribution with mean 𝝁\boldsymbol{\mu} and covariance matrix 𝚺\boldsymbol{\Sigma}, HNp(𝟎,𝚺)HN_{p}(\boldsymbol{0},\boldsymbol{\Sigma}) represents the pp-dimensional half-normal distribution with mean 𝟎\boldsymbol{0} and scale matrix 𝚺\boldsymbol{\Sigma}, and gamma(α,β)\mbox{gamma}(\alpha,\beta) is the Gamma distribution with mean α/β\alpha/\beta.

We observe from (LABEL:MST_H) that the MST distribution (1) has the following stochastic representation. Suppose that conditional on the value ww of the gamma random variable WW,

(𝑼0𝑼)Np((𝝁𝟎),(𝚺/w𝟎𝟎𝑰p/w)),\left(\begin{array}[]{c}\boldsymbol{U}_{0}\\ \boldsymbol{U}\end{array}\right)\sim N_{p}\left(\left(\begin{array}[]{c}\boldsymbol{\mu}\\ \boldsymbol{0}\end{array}\right),\left(\begin{array}[]{cc}\boldsymbol{\Sigma}/w&\boldsymbol{0}\\ \boldsymbol{0}&\boldsymbol{I}_{p}/w\end{array}\right)\right), (3)

where 𝑰p\boldsymbol{I}_{p} denotes the p×pp\times p identity matrix, 𝟎\boldsymbol{0} denotes the zero vector of appropriate dimension, and 𝑼0\boldsymbol{U}_{0} is a pp-dimensional random vector. Then

𝒀=𝚫|𝑼|+𝑼0\boldsymbol{Y}=\boldsymbol{\Delta}\left|\boldsymbol{U}\right|+\boldsymbol{U}_{0} (4)

has the multivariate skew tt-distribution density (1). In the above, |𝑼|\left|\boldsymbol{U}\right| denotes the vector whose iith element is the magnitude of the iith element of the vector 𝑼\boldsymbol{U}. It is important to note that, although also known as the multivariate skew tt-distribution, the versions considered by Azzalini and Dalla Valle (1996), Gupta (2003), and Lachos, Ghosh, and Arellano-Valle (2010), among others, are different from (1). These versions are simpler in that the skew tt-density is defined in terms involving only the univariate tt-distribution function instead of the multivariate form of the latter as used in (1). Recently, Pyne et al. (2009) proposed a simplified version of the skew tt-density given by (1) by replacing the term 𝚫|𝑼|\boldsymbol{\Delta}\left|\boldsymbol{U}\right| in (4) by the term 𝜹|U|\boldsymbol{\delta}\left|U\right|, where UU is a univariate central tt-random variable with ν\nu degrees of freedom, leading to the reduced skew tt-density:

2ptp,ν(𝒚;𝝁,𝚺)T1,ν+p(y1;0,1),2^{p}t_{p,\nu}\left(\mbox{\boldmath$y$};\boldsymbol{\mu},\boldsymbol{\Sigma}\right)T_{1,\nu+p}\left(y_{1}^{*};0,1\right), (5)

where y1=[(ν+p)/(ν+d(𝒚))]12𝜹T𝛀1(𝒚𝝁)y_{1}^{*}=\left[\left(\nu+p\right)/\left(\nu+d\left(\mbox{\boldmath$y$}\right)\right)\right]^{\frac{1}{2}}\boldsymbol{\delta}^{T}\boldsymbol{\Omega}^{-1}(\mbox{\boldmath$y$}-\boldsymbol{\mu}). We shall refer to this characterization of skew tt-distribution as the ‘restricted’ multivariate skew tt (rMST)distribution. One immediate consequence of this type of ‘simplification’ is that the correlation structure of the original symmetric model is affected by the introduction of skewness, whereas for (1) the correlation structure remains the same, as noted in Arellano-Valle, Bolfarine, and Lachos (2007). Nevertheless, one major advantage of having simplified forms like (5) is that calculations on the E-step can be expressed in closed form. However, the form of skewness is limited in these characterizations. Here, we extend their approach to the more general form of the skew tt-density as proposed by Sahu et al. (2003).

2.2 The truncated multivariate tt-distribution

Let 𝑿\boldsymbol{X} be a pp-dimensional random variable having a multivariate tt-distribution with location vector 𝝁\boldsymbol{\mu}, scale matrix 𝚺\boldsymbol{\Sigma}, and ν\nu degrees of freedom. Truncating 𝒙\boldsymbol{x} to the hyperplane region 𝔸={𝒙𝒂,𝒂p}\mathbb{A}=\left\{\boldsymbol{x}\leq\boldsymbol{a},\;\boldsymbol{a}\in\mathbb{R}^{p}\right\}, where 𝒙𝒂\boldsymbol{x}\leq\boldsymbol{a} means each element xi=(𝒙)ix_{i}=(\boldsymbol{x})_{i} is less than or equal to ai=(𝒂)ia_{i}=(\boldsymbol{a})_{i} for i=1,,pi=1,\ldots,p, results in a right-truncated tt-distribution whose density is given by

f𝔸(𝒙;𝝁,𝚺,ν)=Tp,ν1(𝒂;𝝁,𝚺)tp,ν(𝒙;𝝁,𝚺,ν),𝒙𝔸.f_{\mathbb{A}}(\boldsymbol{x};\boldsymbol{\mu},\boldsymbol{\Sigma},\nu)=T_{p,\nu}^{-1}\left(\boldsymbol{a};\boldsymbol{\mu},\boldsymbol{\Sigma}\right)t_{p,\nu}\left(\boldsymbol{x};\boldsymbol{\mu},\boldsymbol{\Sigma},\nu\right),\;\;\;\;\boldsymbol{x}\in\mathbb{A}. (6)

For a random vector 𝑿\boldsymbol{X} with density (6), we write 𝑿ttp,ν(𝝁,𝚺;𝔸)\boldsymbol{X}\sim tt_{p,\nu}\left(\boldsymbol{\mu},\boldsymbol{\Sigma};\mathbb{A}\right). For our purposes, we will be concerned with the first two moments of 𝑿\boldsymbol{X}, specifically E(𝑿)E(\boldsymbol{X}) and E(𝑿𝑿T)E(\boldsymbol{X}\boldsymbol{X}^{T}). Explicit formulas for the truncated central tt-distribution in the univariate case tt1,ν(0,σ2;𝔸)tt_{1,\nu}\left(0,\sigma^{2};\mathbb{A}\right) were provided by O’Hagan (1973), who expressed the moments in terms of the non-truncated tt-distribution. The multivariate case was studied in O’Hagan (1976), but still considering the central case only. We will generalize these results to the case with non-zero location vector here.

Before presenting the expressions, it will be convenient to introduce some notation. Let 𝒙\boldsymbol{x} be a vector, where xix_{i} denotes the iith element and 𝒙ij\boldsymbol{x}_{ij} is a two-dimensional vector with elements xix_{i} and xjx_{j}. Also, 𝒙i\boldsymbol{x}_{-i} and 𝒙ij\boldsymbol{x}_{-ij} represents the (p1)(p-1) and (p2)(p-2)-dimensional vector, respectively, with the corresponding elements removed. For a matrix 𝑿\boldsymbol{X}, xijx_{ij} denotes the ijijth element, and 𝑿ij\boldsymbol{X}_{ij} defines the 2×22\times 2 matrix consisting of the elements xiix_{ii}, xijx_{ij}, xjix_{ji} and xjjx_{jj}. 𝑿i\boldsymbol{X}_{-i} is created by removing the iith row and column from 𝑿\boldsymbol{X}. Similarly, 𝑿ij\boldsymbol{X}_{-ij} is the (p2)(p-2)-dimensional square matrix resulting from the removal of the iith and jjth row and column from 𝑿\boldsymbol{X}. Lastly, 𝑿(ij)\boldsymbol{X}_{(ij)} is the iith and jjth column of 𝑿\boldsymbol{X} with the elements of 𝑿ij\boldsymbol{X}_{ij} removed, yielding a (p2)×2(p-2)\times 2 matrix. We now proceed to the expressions for the first two moments of 𝑿\boldsymbol{X}.

With some effort, one can show that the first moment of (6) is

E(𝑿)\displaystyle E\left(\boldsymbol{X}\right) =𝝁c1𝚺𝝃=𝝁𝝁,\displaystyle=\boldsymbol{\mu}-c^{-1}\boldsymbol{\Sigma}\boldsymbol{\xi}=\boldsymbol{\mu}-\boldsymbol{\mu}^{*}, (7)

where c=Tp,ν(𝒂𝝁;𝟎,𝚺)c=T_{p,\nu}\left(\boldsymbol{a}-\boldsymbol{\mu};\boldsymbol{0},\boldsymbol{\Sigma}\right), and 𝝃\boldsymbol{\xi} is a p×1p\times 1 vector with elements

ξi\displaystyle\xi_{i} =(2πσii)12(νν+σii1(aiμi)2)(ν12)Γ(ν12)Γ(ν2)ν2Tp1,ν1(a;𝟎,𝚺),\displaystyle=\left(2\pi\sigma_{ii}\right)^{-\frac{1}{2}}\left(\frac{\nu}{\nu+\sigma_{ii}^{-1}(a_{i}-\mu_{i})^{2}}\right)^{(\frac{\nu-1}{2})}\frac{\Gamma\left(\frac{\nu-1}{2}\right)}{\Gamma\left(\frac{\nu}{2}\right)}\sqrt{\frac{\nu}{2}}T_{p-1,\nu-1}\left(a^{*};\boldsymbol{0},\boldsymbol{\Sigma}^{*}\right),

for i=1,,pi=1,\ldots,p, and where

𝒂\displaystyle\boldsymbol{a}^{*} =(𝒂i𝝁i)(𝒂i𝝁i)σii1𝚺(i)\displaystyle=\left(\boldsymbol{a}_{-i}-\boldsymbol{\mu}_{-i}\right)-\left(\boldsymbol{a}_{i}-\boldsymbol{\mu}_{i}\right)\sigma_{ii}^{-1}\boldsymbol{\Sigma}_{(i)}
and
𝚺\displaystyle\boldsymbol{\Sigma}^{*} =(ν+σii1(aiμi)2ν1)(𝚺i1σii𝚺(i)𝚺(i)T).\displaystyle=\left(\frac{\nu+\sigma_{ii}^{-1}\left(a_{i}-\mu_{i}\right)^{2}}{\nu-1}\right)\left(\boldsymbol{\Sigma}_{-i}-\frac{1}{\sigma_{ii}}\boldsymbol{\Sigma}_{(i)}\boldsymbol{\Sigma}_{(i)}^{T}\right).

The second moment is given by

E(𝑿𝑿T)\displaystyle E\left(\boldsymbol{X}\boldsymbol{X}^{T}\right) =\displaystyle= 𝝁𝝁T𝝁𝝁T𝝁𝝁Tc1𝚺𝑯𝚺\displaystyle\boldsymbol{\mu}\boldsymbol{\mu}^{T}-\boldsymbol{\mu}\boldsymbol{\mu}^{*^{T}}-\boldsymbol{\mu}^{*}\boldsymbol{\mu}^{T}-c^{-1}\boldsymbol{\Sigma}\boldsymbol{H}\boldsymbol{\Sigma} (8)
+c1(νν2)Tp,ν2(𝒂𝝁;𝟎,(νν2)𝚺)𝚺,\displaystyle+c^{-1}\left(\textstyle\frac{\nu}{\nu-2}\right)T_{p,\nu-2}\left(\boldsymbol{a}-\boldsymbol{\mu};\boldsymbol{0},\left(\textstyle\frac{\nu}{\nu-2}\right)\boldsymbol{\Sigma}\right)\boldsymbol{\Sigma},

where 𝑯\boldsymbol{H} is a p×pp\times p matrix with off-diagonal elements

hij=12πσiiσjjσij2(νν2)(νν)ν21Tp2,ν2(𝒂;𝟎,𝚺),ij,h_{ij}=-\frac{1}{2\pi\sqrt{\sigma_{ii}\sigma_{jj}-\sigma_{ij}^{2}}}\left(\frac{\nu}{\nu-2}\right)\left(\frac{\nu}{\nu^{*}}\right)^{\frac{\nu}{2}-1}T_{p-2,\nu-2}\left(\boldsymbol{a}^{**};\boldsymbol{0},\boldsymbol{\Sigma}^{**}\right),\;\;i\neq j,

and diagonal elements,

hii\displaystyle h_{ii} =σii1(aiμi)ξiσii1jiσijhij,\displaystyle=\sigma_{ii}^{-1}(a_{i}-\mu_{i})\xi_{i}-\sigma_{ii}^{-1}\sum_{j\neq i}\sigma_{ij}h_{ij},
and
ν\displaystyle\nu^{*} =ν+(𝒂ij𝝁ij)T𝚺ij1(𝒂ij𝝁ij),\displaystyle=\nu+\left(\boldsymbol{a}_{ij}-\boldsymbol{\mu}_{ij}\right)^{T}\boldsymbol{\Sigma}_{ij}^{-1}\left(\boldsymbol{a}_{ij}-\boldsymbol{\mu}_{ij}\right),
𝒂\displaystyle\boldsymbol{a}^{**} =(𝒂ij𝝁ij)𝚺(ij)𝚺ij1(𝒂ij𝝁ij),\displaystyle=\left(\boldsymbol{a}_{-ij}-\boldsymbol{\mu}_{-ij}\right)-\boldsymbol{\Sigma}_{(ij)}\boldsymbol{\Sigma}_{ij}^{-1}\left(\boldsymbol{a}_{ij}-\boldsymbol{\mu}_{ij}\right),
𝚺\displaystyle\boldsymbol{\Sigma}^{**} =νν2(𝚺ij𝚺(ij)𝚺ij1𝚺(ij)T).\displaystyle=\frac{\nu^{*}}{\nu-2}\left(\boldsymbol{\Sigma}_{-ij}-\boldsymbol{\Sigma}_{(ij)}\boldsymbol{\Sigma}_{ij}^{-1}\boldsymbol{\Sigma}_{(ij)}^{T}\right).

It is worth noting that evaluation of the expressions (7) and (8) rely on algorithms for computing the multivariate central tt-distribution function for which highly efficient procedures are readily available in many statistical packages. For example, an implementation of Genz’s algorithm (Genz and Bretz, 2002; Kotz and Nadarajah, 2004) is provided by the mvtnorm package available from the R website.

3 ML Estimation for the MST Distribution

In this section, we describe an EM algorithm for the ML estimation of the MST distribution specified by (1). To apply the EM algorithm, the observed data vector 𝒚=(𝒚1T,,𝒚nT)T\boldsymbol{y}=\left(\boldsymbol{y}_{1}^{T},\ldots,\boldsymbol{y}_{n}^{T}\right)^{T} is regarded as incomplete, and we introduce two latent variables denoted by 𝒖\boldsymbol{u} and ww, as defined by (LABEL:MST_H). We let 𝜽\boldsymbol{\theta} be the parameter containing the elements of the location parameter 𝝁\boldsymbol{\mu}, the distinct elements of the scale matrix 𝚺\boldsymbol{\Sigma}, the elements of the skew parameter 𝜹\boldsymbol{\delta}, and the degrees of freedom ν\nu. It follows that the complete-data log-likelihood function for 𝜽\boldsymbol{\theta} is given by

logLc(𝜽;𝒚,𝒖,w)\displaystyle\log L_{c}(\boldsymbol{\theta};\boldsymbol{y},\boldsymbol{u},w) =\displaystyle= K12nlog|𝚺|nlogΓ(12ν)+12nνlog(12ν)\displaystyle K-\textstyle\frac{1}{2}n\log\left|\boldsymbol{\Sigma}\right|-n\log\Gamma\left(\textstyle\frac{1}{2}\nu\right)+\textstyle\frac{1}{2}n\nu\log\left(\textstyle\frac{1}{2}\nu\right) (9)
12w(d(𝒚)+(𝒖𝒒)T𝚲1(𝒖𝒒))\displaystyle-\textstyle\frac{1}{2}w\left(d\left(\boldsymbol{y}\right)+\left(\boldsymbol{u}-\boldsymbol{q}\right)^{T}\boldsymbol{\Lambda}^{-1}\left(\boldsymbol{u}-\boldsymbol{q}\right)\right)
+(12ν+p1)log(w),\displaystyle+\left(\textstyle\frac{1}{2}\nu+p-1\right)\log(w),

where KK does not depend on 𝜽\boldsymbol{\theta}.

The implementation of the EM algorithm requires alternating repeatedly the E- and M-steps until convergence in the case where the sequence of the log likelihood values L(𝜽(k)){L(\boldsymbol{\theta}^{(k)})} is bounded above. Here 𝜽(k)\boldsymbol{\theta}^{(k)} denotes the value of 𝜽\boldsymbol{\theta} after the kkth iteration.

On the (k+1)(k+1)th iteration, the E-step requires the calculation of the conditional expectation of the complete-data log likelihood given the observed data 𝒚\boldsymbol{y}, using the current estimate 𝜽(k)\boldsymbol{\theta}^{(k)} for 𝜽\boldsymbol{\theta}. That is, we have to calculate the so-called QQ-function defined by

Q(𝜽;𝜽(k))=E𝜽(k){logLc(𝜽;𝒚,𝒖,w)𝒚},Q(\boldsymbol{\theta};\boldsymbol{\theta}^{(k)})=E_{\boldsymbol{\theta}^{(k)}}\left\{\log L_{c}(\boldsymbol{\theta};\boldsymbol{y},\boldsymbol{u},w)\mid\boldsymbol{y}\right\}, (10)

where E𝜽(k)E_{\boldsymbol{\theta}^{(k)}} denotes the expectation operator, using 𝜽(k)\boldsymbol{\theta}^{(k)} for 𝜽\boldsymbol{\theta}. This, in effect, requires the calculation of the conditional expectations

e1,j(k)\displaystyle e_{1,j}^{(k)} =E𝜽(k){log(Wj)𝒚j},\displaystyle=E_{\boldsymbol{\theta}^{(k)}}\left\{\log(W_{j})\mid\boldsymbol{y}_{j}\right\},
e2,j(k)\displaystyle e_{2,j}^{(k)} =E𝜽(k){Wj𝒚j},\displaystyle=E_{\boldsymbol{\theta}^{(k)}}\left\{W_{j}\mid\boldsymbol{y}_{j}\right\},
𝒆3,j(k)\displaystyle\mbox{\boldmath$e$}_{3,j}^{(k)} =E𝜽(k){Wj𝑼j𝒚j},\displaystyle=E_{\boldsymbol{\theta}^{(k)}}\left\{W_{j}\boldsymbol{U}_{j}\mid\boldsymbol{y}_{j}\right\},
𝒆4,j(k)\displaystyle\mbox{\boldmath$e$}_{4,j}^{(k)} =E𝜽(k){Wj𝑼j𝑼jT𝒚j}.\displaystyle=E_{\boldsymbol{\theta}^{(k)}}\left\{W_{j}\boldsymbol{U}_{j}\boldsymbol{U}_{j}^{T}\mid\boldsymbol{y}_{j}\right\}.

Note that the QQ-function does not admit a closed form expression for this problem, due to the conditional expectations e1,j(k)e_{1,j}^{(k)}, 𝒆3,j(k)\mbox{\boldmath$e$}_{3,j}^{(k)}, and 𝒆4,j(k)\mbox{\boldmath$e$}_{4,j}^{(k)} not being able to be evaluated in closed form.

Concerning the calculation of the expectation e1,j(k)e_{1,j}^{(k)}, the conditional density of WjW_{j} given 𝒚j\boldsymbol{y}_{j}, is given by

f(wj𝒚j)=Γ(wj;ν(k)+p2,ν(k)+d(k)(𝒚j)2)Φp(𝒒j(k)wj;𝟎,𝚲(k))Tp,ν(k)+p(𝒚j(k);𝟎,𝚲(k)),f(w_{j}\mid\boldsymbol{y}_{j})=\frac{\Gamma\left(w_{j};\frac{\nu^{(k)}+p}{2},\frac{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}{2}\right)\Phi_{p}\left(\boldsymbol{q}_{j}^{(k)}\sqrt{w_{j}};\boldsymbol{0},\boldsymbol{\Lambda}^{(k)}\right)}{T_{p,\nu^{(k)}+p}\left(\boldsymbol{y}_{j}^{*(k)};\boldsymbol{0},\boldsymbol{\Lambda}^{(k)}\right)}, (11)

where

𝒚j(k)\displaystyle\boldsymbol{y}_{j}^{*(k)} =\displaystyle= 𝒒j(k)ν(k)+pν(k)+d(k)(𝒚j),\displaystyle\boldsymbol{q}_{j}^{(k)}\sqrt{\frac{\nu^{(k)}+p}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}},
𝒒j(k)\displaystyle\boldsymbol{q}_{j}^{(k)} =\displaystyle= 𝚫(k)T𝛀(k)1(𝒚j𝝁(k)),\displaystyle\boldsymbol{\Delta}^{(k)^{T}}\boldsymbol{\Omega}^{(k)^{-1}}\left(\boldsymbol{y}_{j}-\boldsymbol{\mu}^{(k)}\right),
d(k)(𝒚j)\displaystyle d^{(k)}(\boldsymbol{y}_{j}) =\displaystyle= (𝒚j𝝁(k))T𝛀(k)1(𝒚j𝝁(k)),\displaystyle\left(\boldsymbol{y}_{j}-\boldsymbol{\mu}^{(k)}\right)^{T}\boldsymbol{\Omega}^{(k)^{-1}}\left(\boldsymbol{y}_{j}-\boldsymbol{\mu}^{(k)}\right),

and 𝟎\boldsymbol{0} is the zero vector of appropriate dimension.

The conditional expectation E𝜽(k){log(Wj)𝒚j}E_{\boldsymbol{\theta}^{(k)}}\left\{\log(W_{j})\mid\boldsymbol{y}_{j}\right\} can be reduced to

e1,j(k)\displaystyle e_{1,j}^{(k)} =\displaystyle= (ν(k)+pν(k)+d(k)(𝒚j))Tp,ν(k)+p+2(𝒒j(k)ν(k)+p+2ν(k)+d(k)(𝒚j);𝟎,𝚲(k))Tp,ν(k)+p(𝒚j(k);𝟎,𝚲(k))\displaystyle\left(\frac{\nu^{(k)}+p}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}\right)\frac{T_{p,\nu^{(k)}+p+2}\left(\boldsymbol{q}_{j}^{(k)}\sqrt{\frac{\nu^{(k)}+p+2}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}};\boldsymbol{0},\boldsymbol{\Lambda}^{(k)}\right)}{T_{p,\nu^{(k)}+p}\left(\boldsymbol{y}_{j}^{*(k)};\boldsymbol{0},\boldsymbol{\Lambda}^{(k)}\right)} (12)
log(ν(k)+d(k)(𝒚j)2)(ν(k)+pν(k)+d(k)(𝒚j))+ψ(ν(k)+p2)+S,\displaystyle-\log\left(\frac{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}{2}\right)-\left(\frac{\nu^{(k)}+p}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}\right)+\psi\left(\frac{\nu^{(k)}+p}{2}\right)+S,

where the last term SS is given by

S\displaystyle S =\displaystyle= ψ(ν(k)2+p)ψ(ν(k)+p2)+(ν(k)+pν(k)+d(k)(𝒚j))\displaystyle\psi\left(\frac{\nu^{(k)}}{2}+p\right)-\psi\left(\frac{\nu^{(k)}+p}{2}\right)+\left(\frac{\nu^{(k)}+p}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}\right) (13)
(ν(k)+pν(k)+d(k)(𝒚j))Tp,ν(k)+p+2(𝒒j(k)ν(k)+p+2ν(k)+d(k)(𝒚j);𝟎,𝚲(k))Tp,ν(k)+p(𝒒j(k)ν(k)+pν(k)+d(k)(𝒚j);𝟎,𝚲(k))\displaystyle-\left(\frac{\nu^{(k)}+p}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}\right)\frac{T_{p,\nu^{(k)}+p+2}\left(\boldsymbol{q}_{j}^{(k)}\sqrt{\frac{\nu^{(k)}+p+2}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}};\boldsymbol{0},\boldsymbol{\Lambda}^{(k)}\right)}{T_{p,\nu^{(k)}+p}\left(\boldsymbol{q}_{j}^{(k)}\sqrt{\frac{\nu^{(k)}+p}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}};\boldsymbol{0},\boldsymbol{\Lambda}^{(k)}\right)}
[π(ν(k)+p)]p2|𝚲|12Tp,ν(k)+p(𝒒j(k)ν(k)+pν(k)+d(k)(𝒚j);𝟎,𝚲(k))Γ(ν(k)2+p)Γ(ν(k)+p2)S1,j(k),\displaystyle-\frac{\left[\pi\left(\nu^{(k)}+p\right)\right]^{-\frac{p}{2}}\left|\boldsymbol{\Lambda}\right|^{-\frac{1}{2}}}{T_{p,\nu^{(k)}+p}\left(\boldsymbol{q}_{j}^{(k)}\sqrt{\frac{\nu^{(k)}+p}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}};\boldsymbol{0},\boldsymbol{\Lambda}^{(k)}\right)}\frac{\Gamma\left(\frac{\nu^{(k)}}{2}+p\right)}{\Gamma\left(\frac{\nu^{(k)}+p}{2}\right)}S_{1,j}^{(k)},

and S1,j(k)S_{1,j}^{(k)} is an integral given by

S1,j(k)\displaystyle S_{1,j}^{(k)} =\displaystyle= [𝒒j(k)]1[𝒒j(k)]2[𝒒j(k)]plog(1+𝒔T𝚲(k)1𝒔ν(k)+d(k)(𝒚j))\displaystyle\int_{-\infty}^{\left[\boldsymbol{q}_{j}^{(k)}\right]_{1}}\int_{-\infty}^{\left[\boldsymbol{q}_{j}^{(k)}\right]_{2}}\ldots\int_{-\infty}^{\left[\boldsymbol{q}_{j}^{(k)}\right]_{p}}log\left(1+\frac{\boldsymbol{s}^{T}{\boldsymbol{\Lambda}}^{{(k)}^{-1}}\boldsymbol{s}}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}\right) (14)
[1+𝒔T𝚲(k)1𝒔ν(k)+d(k)(𝒚j)](ν(k)2+p)ds1ds2dsp,\displaystyle\left[1+\frac{\boldsymbol{s}^{T}{\boldsymbol{\Lambda}}^{{(k)}^{-1}}\boldsymbol{s}}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}\right]^{-\left(\frac{\nu^{(k)}}{2}+p\right)}ds_{1}ds_{2}\ldots ds_{p},

and ψ()\psi(\cdot) denotes the Digamma function.

Combining (12) and (13), e1,j(k)e_{1,j}^{(k)} can be reduced to

e1,j(k)\displaystyle e_{1,j}^{(k)} =\displaystyle= ψ(ν(k)2+p)log(ν(k)+d(k)(𝒚j)2)\displaystyle\psi\left(\frac{\nu^{(k)}}{2}+p\right)-\log\left(\frac{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}{2}\right) (15)
Tp,ν(k)+p1(𝒒j(k)ν(k)+pν(k)+d(k)(𝒚j);𝟎,𝚲(k))S1,j(k).\displaystyle-T_{p,\nu^{(k)}+p}^{-1}\left(\boldsymbol{q}_{j}^{(k)}\sqrt{\frac{\nu^{(k)}+p}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}};\boldsymbol{0},\boldsymbol{\Lambda}^{(k)}\right)S_{1,j}^{(k)}.

We note that the term SS will be very small in practice since it would be zero if we adopted a one-step late (OSL) EM algorithm (Green, 1990). In which case, there would be no need to calculate the multiple integral S1,j(k)S_{1,j}^{(k)} in (13). Hence then, e1,j(k)e_{1,j}^{(k)} can be reduced to

e1,j(k)\displaystyle e_{1,j}^{(k)} =\displaystyle= (ν(k)+pν(k)+d(k)(𝒚j))Tp,ν(k)+p+2(𝒒j(k)ν(k)+p+2ν(k)+d(k)(𝒚j);𝟎,𝚲(k))Tp,ν(k)+p(𝒚j(k);𝟎,𝚲(k))\displaystyle\left(\frac{\nu^{(k)}+p}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}\right)\frac{T_{p,\nu^{(k)}+p+2}\left(\boldsymbol{q}_{j}^{(k)}\sqrt{\frac{\nu^{(k)}+p+2}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}};\boldsymbol{0},\boldsymbol{\Lambda}^{(k)}\right)}{T_{p,\nu^{(k)}+p}\left(\boldsymbol{y}_{j}^{*(k)};\boldsymbol{0},\boldsymbol{\Lambda}^{(k)}\right)} (16)
log(ν(k)+d(k)(𝒚j)2)(ν(k)+pν(k)+d(k)(𝒚j))+ψ(ν(k)+p2).\displaystyle-\log\left(\frac{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}{2}\right)-\left(\frac{\nu^{(k)}+p}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}\right)+\psi\left(\frac{\nu^{(k)}+p}{2}\right).

It can be easily shown that e2,j(k)e_{2,j}^{(k)} can be written in closed form (see, for example, Lin (2010)), given by

e2,j(k)=(ν(k)+pν(k)+d(k)(𝒚j))Tp,ν(k)+p+2(𝒒j(k)ν(k)+p+2ν(k)+d(k)(𝒚j);𝟎,𝚲(k))Tp,ν(k)+p(𝒚j(k);𝟎,𝚲(k)).e_{2,j}^{(k)}=\left(\frac{\nu^{(k)}+p}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}\right)\frac{T_{p,\nu^{(k)}+p+2}\left(\boldsymbol{q}_{j}^{(k)}\sqrt{\frac{\nu^{(k)}+p+2}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}};\boldsymbol{0},\boldsymbol{\Lambda}^{(k)}\right)}{T_{p,\nu^{(k)}+p}\left(\boldsymbol{y}_{j}^{*(k)};\boldsymbol{0},\boldsymbol{\Lambda}^{(k)}\right)}. (17)

To obtain 𝒆3,j(k)\mbox{\boldmath$e$}_{3,j}^{(k)} and 𝒆4,j(k)\mbox{\boldmath$e$}_{4,j}^{(k)}, first note that the joint density of 𝒚j\boldsymbol{y}_{j}, 𝒖j\boldsymbol{u}_{j}, and wjw_{j} is given by

f(𝒚j,𝒖j,wj)\displaystyle f(\boldsymbol{y}_{j},\boldsymbol{u}_{j},w_{j}) =\displaystyle= πpΓ(ν(k)2)1(ν(k)2)(ν(k)2)wj(ν(k)2+p1)\displaystyle\pi^{-p}\Gamma\left(\frac{\nu^{(k)}}{2}\right)^{-1}\left(\frac{\nu^{(k)}}{2}\right)^{\left(\frac{\nu^{(k)}}{2}\right)}w_{j}^{\left(\frac{\nu^{(k)}}{2}+p-1\right)} (18)
ewj2[ν(k)+d(k)(𝒚j)+(𝒖j𝒒j(k))T𝚲(k)1(𝒖j𝒒j(k))].\displaystyle e^{-\frac{w_{j}}{2}\left[\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})+\left(\boldsymbol{u}_{j}-\boldsymbol{q}_{j}^{(k)}\right)^{T}\boldsymbol{\Lambda}^{(k)^{-1}}\left(\boldsymbol{u}_{j}-\boldsymbol{q}_{j}^{(k)}\right)\right]}.

Using Bayes’ rule, the conditional density of 𝒖j\boldsymbol{u}_{j} and wjw_{j} given 𝒚j\boldsymbol{y}_{j} can be written as

f(𝒖j,wj𝒚j)=wjp2Γ(wj;ν(k)+p2,d(k)(𝒚j)2)ewj2(𝒖j𝒒j(k))T𝚲(k)1(𝒖j𝒒j(k))(2π)p2|𝚲(k)|12Tp,ν(k)u+p(𝒒j(k)ν(k)+pν(k)+d(k)(𝒚j);𝟎,𝚲(k)).f(\boldsymbol{u}_{j},w_{j}\mid\boldsymbol{y}_{j})=\frac{w_{j}^{\frac{p}{2}}\Gamma\left(w_{j};\frac{\nu^{(k)}+p}{2},\frac{d^{(k)}\left(\boldsymbol{y}_{j}\right)}{2}\right)e^{-\frac{w_{j}}{2}\left(\boldsymbol{u}_{j}-\boldsymbol{q}_{j}^{(k)}\right)^{T}\boldsymbol{\Lambda}^{(k)^{-1}}\left(\boldsymbol{u}_{j}-\boldsymbol{q}_{j}^{(k)}\right)}}{(2\pi)^{\frac{p}{2}}\left|\boldsymbol{\Lambda}^{(k)}\right|^{\frac{1}{2}}T_{p,\nu^{(k)}u+p}\left(\boldsymbol{q}_{j}^{(k)}\sqrt{\frac{\nu^{(k)}+p}{\nu^{(k)}+d^{(k)}\left(\boldsymbol{y}_{j}\right)}};\boldsymbol{0},\boldsymbol{\Lambda}^{(k)}\right)}.

From (3), standard conditional expectation calculations yield

𝒆3,j(k)=(ν(k)+pν(k)+d(k)(𝒚j))Tν(k)+p+2(𝒒j(k);𝟎,(ν(k)+d(k)(𝒚j)ν(k)+p+2)𝚲(k))Tp,ν(k)+p(𝒚j(k);𝟎,𝚲(k))𝑺2,j(k)=e2,j(k)𝑺2,j(k),\displaystyle\mbox{\boldmath$e$}_{3,j}^{(k)}=\left(\frac{\nu^{(k)}+p}{\nu^{(k)}+d^{(k)}\left(\boldsymbol{y}_{j}\right)}\right)\frac{T_{\nu^{(k)}+p+2}\left(\boldsymbol{q}_{j}^{(k)};\boldsymbol{0},\left(\frac{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}{\nu^{(k)}+p+2}\right)\boldsymbol{\Lambda}^{(k)}\right)}{T_{p,\nu^{(k)}+p}\left(\boldsymbol{y}_{j}^{*(k)};\boldsymbol{0},\boldsymbol{\Lambda}^{(k)}\right)}\boldsymbol{S}_{2,j}^{(k)}=e_{2,j}^{(k)}\boldsymbol{S}_{2,j}^{(k)},
(19)

where 𝑺2,j(k)\boldsymbol{S}_{2,j}^{(k)} represents the expected value of a truncated pp-dimensional tt-variate 𝑿j\boldsymbol{X}_{j}, which is distributed as,

𝑿jttp,ν(k)+p+2(𝒒j(k),(ν(k)+d(k)(𝒚j)ν(k)+p+2)𝚲(k);+).\boldsymbol{X}_{j}\sim tt_{p,\nu^{(k)}+p+2}\left(\boldsymbol{q}_{j}^{(k)},\left(\frac{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}{\nu^{(k)}+p+2}\right)\boldsymbol{\Lambda}^{(k)};\mathbb{R}^{+}\right). (20)

That is, the random vector 𝑿j\boldsymbol{X}_{j} is truncated to lie in the positive hyperplane +\mathbb{R}^{+}.

Analogously, 𝒆4,j(k)\mbox{\boldmath$e$}_{4,j}^{(k)} can be reduced to

𝒆4,j(k)=(ν(k)+pν(k)+d(k)(𝒚j))Tp,ν(k)+p+2(𝒒j(k);𝟎,(ν(k)+d(k)(𝒚j)ν(k)+p+2)𝚲(k))Tp,ν(k)+p(𝒚j(k);𝟎,𝚲(k))𝑺3,j(k)=e2,j(k)𝑺3,j(k),\displaystyle\mbox{\boldmath$e$}_{4,j}^{(k)}=\left(\frac{\nu^{(k)}+p}{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}\right)\frac{T_{p,\nu^{(k)}+p+2}\left(\boldsymbol{q}_{j}^{(k)};\boldsymbol{0},\left(\frac{\nu^{(k)}+d^{(k)}(\boldsymbol{y}_{j})}{\nu^{(k)}+p+2}\right)\boldsymbol{\Lambda}^{(k)}\right)}{T_{p,\nu^{(k)}+p}\left(\boldsymbol{y}_{j}^{*(k)};\boldsymbol{0},\boldsymbol{\Lambda}^{(k)}\right)}\boldsymbol{S}_{3,j}^{(k)}=e_{2,j}^{(k)}\boldsymbol{S}_{3,j}^{(k)},
(21)

where 𝑺3,j(k)\boldsymbol{S}_{3,j}^{(k)} represents the second moment of 𝑿j\boldsymbol{X}_{j}. The truncated moments 𝑺2,ij(k)\boldsymbol{S}_{2,ij}^{(k)} and 𝑺3,ij(k)\boldsymbol{S}_{3,ij}^{(k)} can be swiftly evaluated using the expressions (7) and (8) in Section 2.2.

3.1 M-step

On the (k+1)(k+1)th iteration, the M-step consists of the maximization of the QQ-function (10) with respect to 𝜽\boldsymbol{\theta}. For easier computation, we employ the ECM extension of the EM algorithm, where the M-step is replaced by four conditional–maximization (CM)-steps, corresponding to the decomposition of 𝜽\boldsymbol{\theta} into four subvectors, 𝜽=(𝜽1T,𝜽2T,𝜽3T,θ4)T\boldsymbol{\theta}=(\boldsymbol{\theta}_{1}^{T},\boldsymbol{\theta}_{2}^{T},\boldsymbol{\theta}_{3}^{T},\theta_{4})^{T}, where 𝜽1=𝝁\boldsymbol{\theta}_{1}=\boldsymbol{\mu}, 𝜽2=𝜹\boldsymbol{\theta}_{2}=\boldsymbol{\delta}, 𝜽3\boldsymbol{\theta}_{3} is the vector containing the distinct elements of 𝚺\boldsymbol{\Sigma}, and θ4=ν\theta_{4}=\nu. To compute 𝝁(k+1)\boldsymbol{\mu}^{(k+1)}, we maximize Q(𝝁,𝜽2(k),𝜽3(k),θ4(k);𝜽(k))Q(\boldsymbol{\mu},\boldsymbol{\theta}_{2}^{(k)},\boldsymbol{\theta}_{3}^{(k)},\theta_{4}^{(k)};\boldsymbol{\theta}^{(k)}) with respect to 𝝁\boldsymbol{\mu}, and to compute 𝜹(k+1)\boldsymbol{\delta}^{(k+1)}, we first update 𝝁\boldsymbol{\mu} to 𝝁(k+1)\boldsymbol{\mu}^{(k+1)} and then maximize Q(𝝁(k+1),𝜹,𝜽3(k),θ4(k);𝜽(k))Q(\boldsymbol{\mu}^{(k+1)},\boldsymbol{\delta},\boldsymbol{\theta}_{3}^{(k)},\theta_{4}^{(k)};\boldsymbol{\theta}^{(k)}) with respect to 𝜹\boldsymbol{\delta}, and so on.

We let DIAG(𝑨)\mbox{DIAG}(\boldsymbol{A}) denote the operator that produces a vector by extracting the diagonal elements of 𝑨\boldsymbol{A}. Straightforward algebraic manipulations lead to the following closed form expressions for 𝝁(k+1)\boldsymbol{\mu}^{(k+1)}, 𝚺(k+1)\boldsymbol{\Sigma}^{(k+1)}, and 𝜹(k+1)\boldsymbol{\delta}^{(k+1)},

𝝁(k+1)\displaystyle\boldsymbol{\mu}^{(k+1)} =j=1n[e2,j(k)𝒚j𝚫(k)𝒆3,j(k)]j=1ne2,j(k),\displaystyle=\frac{\sum_{j=1}^{n}\left[e_{2,j}^{(k)}\boldsymbol{y}_{j}-\boldsymbol{\Delta}^{(k)}\mbox{\boldmath$e$}_{3,j}^{(k)}\right]}{\sum_{j=1}^{n}e_{2,j}^{(k)}}, (22)
𝜹(k+1)\displaystyle\boldsymbol{\delta}^{(k+1)} =(𝚺(k)1j=1n𝒆4.j(k))1DIAG(𝚺(k)1j=1n(𝒚j𝝁(k))𝒆3,j(k)T),\displaystyle=\left(\boldsymbol{\Sigma}^{(k)^{-1}}\odot\sum_{j=1}^{n}\mbox{\boldmath$e$}_{4.j}^{(k)}\right)^{-1}\mbox{DIAG}\left(\boldsymbol{\Sigma}^{(k)^{-1}}\sum_{j=1}^{n}(\mbox{\boldmath$y$}_{j}-\boldsymbol{\mu}^{(k)})\mbox{\boldmath$e$}_{3,j}^{(k)^{T}}\right), (23)
and
𝚺(k+1)\displaystyle\boldsymbol{\Sigma}^{(k+1)} =1nj=1n[𝚫(k+1)𝒆4,j(k)T𝚫(k+1)T(𝒚j𝝁(k+1))𝒆3,j(k)T𝚫(k+1)\displaystyle=\frac{1}{n}\sum_{j=1}^{n}\left[\boldsymbol{\Delta}^{(k+1)}\mbox{\boldmath$e$}_{4,j}^{(k)^{T}}\boldsymbol{\Delta}^{(k+1)^{T}}\right.-\left(\boldsymbol{y}_{j}-\boldsymbol{\mu}^{(k+1)}\right)\mbox{\boldmath$e$}_{3,j}^{(k)^{T}}\boldsymbol{\Delta}^{(k+1)}
+(𝒚j𝝁(k+1))(𝒚j𝝁(k+1))Te2,j(k)𝚫(k+1)𝒆3,j(k)(𝒚j𝝁(k+1))T],\displaystyle+\left(\boldsymbol{y}_{j}-\boldsymbol{\mu}^{(k+1)}\right)\left(\boldsymbol{y}_{j}-\boldsymbol{\mu}^{(k+1)}\right)^{T}e_{2,j}^{(k)}\left.-{\boldsymbol{\Delta}}^{(k+1)}\mbox{\boldmath$e$}_{3,j}^{(k)}\left(\boldsymbol{y}_{j}-\boldsymbol{\mu}^{(k+1)}\right)^{T}\right], (24)

where \odot denotes the Hadamard or element-wise product, and 𝚫(k+1)=diag(𝜹(k+1))\boldsymbol{\Delta}^{(k+1)}=\mbox{diag}\left(\boldsymbol{\delta}^{(k+1)}\right).

An updated estimate of the degrees of freedom ν(k+1)\nu^{(k+1)} is obtained by solving the equation

log(ν(k+1)2)ψ(ν(k+1)2)+1=1nj=1n(e2,j(k)e1,j(k)).\log\left(\frac{\nu^{(k+1)}}{2}\right)-\psi\left(\frac{\nu^{(k+1)}}{2}\right)+1=\frac{1}{n}\sum_{j=1}^{n}\left(e_{2,j}^{(k)}-e_{1,j}^{(k)}\right). (25)

In summary, the ECM algorithm proceeds as follows on the (k+1)(k+1)th iteration:

E-step: Given 𝜽=𝜽(k)\boldsymbol{\theta}=\boldsymbol{\theta}^{(k)}, compute the four conditional expectations e1,j(k)e_{1,j}^{(k)}, e2,j(k)e_{2,j}^{(k)}, 𝒆3,j(k)\mbox{\boldmath$e$}_{3,j}^{(k)} and 𝒆4,j(k)\mbox{\boldmath$e$}_{4,j}^{(k)} by using (15), (17), (19), and (21), respectively, for j=1,,nj=1,\ldots,n.

M-step: Update 𝝁(k+1)\boldsymbol{\mu}^{(k+1)}, 𝜹(k+1)\boldsymbol{\delta}^{(k+1)} , 𝚺(k+1)\boldsymbol{\Sigma}^{(k+1)} and by using (22), (23), and (24). Calculate ν(k+1)\nu^{(k+1)} by solving (25).

4 The Multivariate Skew tt-Mixture Model

The probability density function (pdf) of a finite mixture of gg multivariate skew tt-components, using the notation above, is given by

f(𝒚;𝚿)=h=1gπhfp(𝒚;𝝁h,𝚺h,𝜹h,νh),f\left(\boldsymbol{y};\boldsymbol{\Psi}\right)=\sum_{h=1}^{g}\pi_{h}f_{p}\left(\boldsymbol{y};\boldsymbol{\mu}_{h},\boldsymbol{\Sigma}_{h},\boldsymbol{\delta}_{h},\nu_{h}\right), (26)

where fp(𝒚;𝝁h,𝚺h,𝜹h,νh)f_{p}\left(\boldsymbol{y};\boldsymbol{\mu}_{h},\boldsymbol{\Sigma}_{h},\boldsymbol{\delta}_{h},\nu_{h}\right) denotes the iith MST component of the mixture model as defined by (1), with location parameter 𝝁h\boldsymbol{\mu}_{h}, scale matrix 𝚺h\boldsymbol{\Sigma}_{h}, skew parameter 𝜹h\boldsymbol{\delta}_{h}, and degrees of freedom νh\nu_{h}. The mixing proportions πh\pi_{h} satisfy πh0\pi_{h}\geq 0 (h=1,,g)(h=1,\ldots,g) and h=1gπh=1\sum_{h=1}^{g}\pi_{h}=1. We shall denote the model defined by (26) by FM-MST (finite mixture of MST) distributions. Let 𝚿\boldsymbol{\Psi} contain all the unknown parameters of the FM-MST model; that is, 𝚿=(π1,,πg1,𝜽1T,,𝜽gT)T\boldsymbol{\Psi}=\left(\pi_{1},\ldots,\pi_{g-1},\boldsymbol{\theta}_{1}^{T},\ldots,\boldsymbol{\theta}_{g}^{T}\right)^{T} where now 𝜽h\boldsymbol{\theta}_{h} consists of the unknown parameters of the iith component density function.

To formulate the estimation of the unknown parameters in the FM-MST model as an incomplete-data problem in the EM framework, a set of latent component labels 𝒛j=(z1j,,zgj)T\boldsymbol{z}_{j}=\left(z_{1j},\ldots,z_{gj}\right)^{T} (j=1,,n)(j=1,\ldots,n) is introduced, where each element zhjz_{hj} is a binary variable defined as

zhj={1,if𝒚j,belongs to componenti,0,otherwise,z_{hj}=\left\{\begin{array}[]{cc}1,&\mbox{if}\;\boldsymbol{y}_{j},\;\mbox{belongs to component}\;i,\\ 0,&\mbox{otherwise},\end{array}\right. (27)

and h=1gzhj=1\sum_{h=1}^{g}z_{hj}=1 (j=1,,n)(j=1,\ldots,n). Hence, the random vector 𝒁j\boldsymbol{Z}_{j} corresponding to 𝒛j\boldsymbol{z}_{j} follows a multinomial distribution with one trial and cell probabilities π1,,πg\pi_{1},\ldots,\pi_{g}; that is, 𝒁jMultg(1;π1,,πg)\boldsymbol{Z}_{j}\sim\mbox{Mult}_{g}(1;\pi_{1},\ldots,\pi_{g}). It follows that the FM-MST model can be represented in the hierarchical form given by

𝒀j𝒖j,wj,zhj=1\displaystyle\boldsymbol{Y}_{j}\mid\boldsymbol{u}_{j},w_{j},z_{hj}=1 \displaystyle\sim Np(𝝁h+𝚫h𝒖j,1wj𝚺h),\displaystyle N_{p}\left(\boldsymbol{\mu}_{h}+\boldsymbol{\Delta}_{h}\boldsymbol{u}_{j},\frac{1}{w_{j}}\boldsymbol{\Sigma}_{h}\right),
𝑼jwj,zhj=1\displaystyle\boldsymbol{U}_{j}\mid w_{j},z_{hj}=1 \displaystyle\sim HNp(𝟎,1wj𝑰p),\displaystyle HN_{p}\left(\boldsymbol{0},\frac{1}{w_{j}}\boldsymbol{I}_{p}\right),
Wjzhj=1\displaystyle W_{j}\mid z_{hj}=1 \displaystyle\sim gamma(νh2,νh2),\displaystyle\mbox{gamma}\left(\frac{\nu_{h}}{2},\frac{\nu_{h}}{2}\right),
𝒁j\displaystyle\boldsymbol{Z}_{j} \displaystyle\sim Multg(1,𝝅),\displaystyle\mbox{Mult}_{g}\left(1,\boldsymbol{\pi}\right), (28)

where 𝚫h=diag(𝜹h)\boldsymbol{\Delta}_{h}=\mbox{diag}\left(\boldsymbol{\delta}_{h}\right) and 𝝅=(π1,,πg)T\boldsymbol{\pi}=\left(\pi_{1},\ldots,\pi_{g}\right)^{T}.

5 ML Estimation for FM-MST Distributions

From the hierarchical characterization (28) of the FM-MST distributions, the complete-data log-likelihood function is given by

logLc(𝚿)=logL1c(𝚿)+logL2c(𝚿)+logL3c(𝚿),\log L_{c}\left(\boldsymbol{\Psi}\right)=\log L_{1c}\left(\boldsymbol{\Psi}\right)+\log L_{2c}\left(\boldsymbol{\Psi}\right)+\log L_{3c}\left(\boldsymbol{\Psi}\right), (29)

where

logL1c(𝚿)\displaystyle\log L_{1c}\left(\boldsymbol{\Psi}\right) =h=1gj=1nzhjlog(πh),\displaystyle=\sum_{h=1}^{g}\sum_{j=1}^{n}z_{hj}\log\left(\pi_{h}\right),
logL2c(𝚿)\displaystyle\log L_{2c}\left(\boldsymbol{\Psi}\right) =h=1gj=1nzhj[(νh2)log(νh2)+(νh2+p1)log(wj)\displaystyle=\sum_{h=1}^{g}\sum_{j=1}^{n}z_{hj}\left[\left(\frac{\nu_{h}}{2}\right)\log\left(\frac{\nu_{h}}{2}\right)+\left(\frac{\nu_{h}}{2}+p-1\right)\log\left(w_{j}\right)\right.
logΓ(νh2)(wj2)νh],\displaystyle\left.-\log\Gamma\left(\frac{\nu_{h}}{2}\right)-\left(\frac{w_{j}}{2}\right)\nu_{h}\right],
logL3c(𝚿)\displaystyle\log L_{3c}\left(\boldsymbol{\Psi}\right) =h=1gj=1nzhj{plog(2π)12log|𝚺h|\displaystyle=\sum_{h=1}^{g}\sum_{j=1}^{n}z_{hj}\left\{-p\log\left(2\pi\right)-\frac{1}{2}\log\left|{\boldsymbol{\Sigma}}_{h}\right|\right.
wj2[dh(𝒚j)+(𝒖j𝒒hj)T𝚲h1(𝒖j𝒒hj)]},\displaystyle-\left.\frac{w_{j}}{2}\left[d_{h}\left(\boldsymbol{y}_{j}\right)+\left(\boldsymbol{u}_{j}-\boldsymbol{q}_{hj}\right)^{T}\boldsymbol{\Lambda}_{h}^{-1}\left(\boldsymbol{u}_{j}-\boldsymbol{q}_{hj}\right)\right]\right\}, (30)

and where

dh(𝒚j)\displaystyle d_{h}\left(\boldsymbol{y}_{j}\right) =(𝒚j𝝁h)T𝛀h1(𝒚j𝝁h),\displaystyle=\left(\boldsymbol{y}_{j}-\boldsymbol{\mu}_{h}\right)^{T}{\boldsymbol{\Omega}}_{h}^{-1}\left(\boldsymbol{y}_{j}-\boldsymbol{\mu}_{h}\right),
𝒒hj\displaystyle\boldsymbol{q}_{hj} =𝚫hT𝛀h1(𝒚j𝝁h),\displaystyle=\boldsymbol{\Delta}_{h}^{T}{\boldsymbol{\Omega}}_{h}^{-1}\left(\boldsymbol{y}_{j}-\boldsymbol{\mu}_{h}\right),
𝚲h\displaystyle{\boldsymbol{\Lambda}}_{h} =𝑰p𝚫hT𝛀h1𝚫h,\displaystyle=\boldsymbol{I}_{p}-\boldsymbol{\Delta}_{h}^{T}{\boldsymbol{\Omega}}_{h}^{-1}\boldsymbol{\Delta}_{h},
𝛀h\displaystyle{\boldsymbol{\Omega}}_{h} =𝚺h+𝚫h𝚫hT.\displaystyle={\boldsymbol{\Sigma}}_{h}+\boldsymbol{\Delta}_{h}\boldsymbol{\Delta}_{h}^{T}.

It is clear from (29) that maximization of the QQ-function of the complete-data log likelihood (McLachlan and Krishnan, 2008),

Q(𝚿;𝚿(k))\displaystyle Q(\boldsymbol{\Psi};\boldsymbol{\Psi}^{(k)}) =\displaystyle= E𝚿(k){logLc(𝚿)𝒚},\displaystyle E_{\boldsymbol{\Psi}^{(k)}}\left\{\log L_{c}\left(\boldsymbol{\Psi}\right)\mid\boldsymbol{y}\right\},

only requires maximization of the components functions Lhc(𝚿)L_{hc}(\boldsymbol{\Psi}) separately (h=1,2,3)(h=1,2,3). The necessary conditional expectations involved in computing the QQ-function with respect to (30) are, namely,

τhj(k)\displaystyle\tau_{hj}^{(k)} =E𝚿(k){Zhj𝒚j},\displaystyle=E_{\boldsymbol{\Psi}^{(k)}}\{Z_{hj}\mid\boldsymbol{y}_{j}\},
e1,hj(k)\displaystyle e_{1,hj}^{(k)} =E𝚿(k){log(Wj)𝒚j,zhj=1},\displaystyle=E_{\boldsymbol{\Psi}^{(k)}}\{\log(W_{j})\mid\boldsymbol{y}_{j},z_{hj}=1\},
e2,hj(k)\displaystyle e_{2,hj}^{(k)} =E𝚿(k){Wj𝑼j𝒚j,zhj=1},\displaystyle=E_{\boldsymbol{\Psi}^{(k)}}\{W_{j}\boldsymbol{U}_{j}\mid\boldsymbol{y}_{j},z_{hj}=1\},
𝒆3,hj(k)\displaystyle\mbox{\boldmath$e$}_{3,hj}^{(k)} =E𝚿(k){Wj𝑼j𝒚j,zhj=1},\displaystyle=E_{\boldsymbol{\Psi}^{(k)}}\{W_{j}\boldsymbol{U}_{j}\mid\boldsymbol{y}_{j},z_{hj}=1\},
𝒆4,hj(k)\displaystyle\mbox{\boldmath$e$}_{4,hj}^{(k)} =E𝚿(k){Wj𝑼j𝑼jT𝒚j,zhj=1}.\displaystyle=E_{\boldsymbol{\Psi}^{(k)}}\{W_{j}\boldsymbol{U}_{j}\boldsymbol{U}_{j}^{T}\mid\boldsymbol{y}_{j},z_{hj}=1\}. (31)

The posterior probability of membership of the hhth component by 𝒚j\boldsymbol{y}_{j}, using the current estimate 𝚿(k)\boldsymbol{\Psi}^{(k)} for 𝚿\boldsymbol{\Psi}, is given using Bayes’ Theorem by

τhj(k)=πh(k)fp(𝒚j;𝝁h(k),𝚺h(k),𝜹h(k),νh(k))h=1gπh(k)fp(𝒚j;𝝁h(k),𝚺h(k),𝜹h(k),νh(k)).\tau_{hj}^{(k)}=\frac{\pi_{h}^{(k)}f_{p}\left(\boldsymbol{y}_{j};\boldsymbol{\mu}_{h}^{(k)},{\boldsymbol{\Sigma}}_{h}^{(k)},\boldsymbol{\delta}_{h}^{(k)},\nu_{h}^{(k)}\right)}{\sum_{h=1}^{g}\pi_{h}^{(k)}f_{p}\left(\boldsymbol{y}_{j};\boldsymbol{\mu}_{h}^{(k)},{\boldsymbol{\Sigma}}_{h}^{(k)},\boldsymbol{\delta}_{h}^{(k)},\nu_{h}^{(k)}\right)}. (32)

The other four expectations have analogous expressions to their one-component counterpart given in Section 3. They are given by

e1,hj(k)\displaystyle e_{1,hj}^{(k)} =ψ(νh(k)2+p)log(νh(k)+dh(k)(𝒚j)2)\displaystyle=\psi\left(\frac{\nu_{h}^{(k)}}{2}+p\right)-\log\left(\frac{\nu_{h}^{(k)}+d_{h}^{(k)}(\boldsymbol{y}_{j})}{2}\right) (33)
Tp,νh(k)+p1(𝒒hj(k)νh(k)+pνh(k)+dh(k)(𝒚j);𝟎,𝚲h(k))S1,hj(k),\displaystyle-T_{p,\nu_{h}^{(k)}+p}^{-1}\left(\boldsymbol{q}_{hj}^{(k)}\sqrt{\textstyle\frac{\nu_{h}^{(k)}+p}{\nu_{h}^{(k)}+d_{h}^{(k)}(\boldsymbol{y}_{j})}};\boldsymbol{0},\boldsymbol{\Lambda}_{h}^{(k)}\right)S_{1,hj}^{(k)},
e2,hj(k)\displaystyle e_{2,hj}^{(k)} =(νh(k)+pνh(k)+dh(k)(𝒚j))Tp,νh(k)+p+2(𝒒hj(k)νh(k)+p+2νh(k)+dh(k)(𝒚j);𝟎,𝚲h(k))Tp,νh(k)+p(yhj(k);𝟎,𝚲h(k)),\displaystyle=\left(\frac{\nu_{h}^{(k)}+p}{\nu_{h}^{(k)}+d_{h}^{(k)}\left(\boldsymbol{y}_{j}\right)}\right)\frac{T_{p,\nu_{h}^{(k)}+p+2}\left(\boldsymbol{q}_{hj}^{(k)}\sqrt{\frac{\nu_{h}^{(k)}+p+2}{\nu_{h}^{(k)}+d_{h}^{(k)}(\boldsymbol{y}_{j})}};\boldsymbol{0},{\boldsymbol{\Lambda}}_{h}^{(k)}\right)}{T_{p,\nu_{h}^{(k)}+p}\left(y_{hj}^{*(k)};\boldsymbol{0},{\boldsymbol{\Lambda}}_{h}^{(k)}\right)}, (34)
𝒆3,hj(k)\displaystyle\mbox{\boldmath$e$}_{3,hj}^{(k)} =(νh(k)+pνh(k)+dh(k)(𝒚j))Tp,νh(k)+p1(yhj(k);𝟎,𝚲h(k))𝑺2,ij(k),\displaystyle=\left(\frac{\nu_{h}^{(k)}+p}{\nu_{h}^{(k)}+d_{h}^{(k)}(\boldsymbol{y}_{j})}\right)T_{p,\nu_{h}^{(k)}+p}^{-1}\left(y_{hj}^{*(k)};\boldsymbol{0},\boldsymbol{\Lambda}_{h}^{(k)}\right)\boldsymbol{S}_{2,ij}^{(k)}, (35)
𝒆4,hj(k)\displaystyle\mbox{\boldmath$e$}_{4,hj}^{(k)} =(νh(k)+pνh(k)+dh(k)(𝒚j))Tp,νh(k)+p1(yhj(k);𝟎,𝚲h(k))𝑺3,ij(k),\displaystyle=\left(\frac{\nu_{h}^{(k)}+p}{\nu_{h}^{(k)}+d_{h}^{(k)}(\boldsymbol{y}_{j})}\right)T_{p,\nu_{h}^{(k)}+p}^{-1}\left(y_{hj}^{*(k)};\boldsymbol{0},\boldsymbol{\Lambda}_{h}^{(k)}\right)\boldsymbol{S}_{3,ij}^{(k)}, (36)

where S1,hj(k)S_{1,hj}^{(k)} is a scalar defined by

S1,ij(k)\displaystyle S_{1,ij}^{(k)} =\displaystyle= [𝒒hj(k)]1[𝒒hj(k)]2[𝒒hj(k)]plog(1+𝒔T𝚲h1𝒔νh(k)+dh(k)(𝒚j))\displaystyle\int_{-\infty}^{\left[\boldsymbol{q}_{hj}^{(k)}\right]_{1}}\int_{-\infty}^{\left[\boldsymbol{q}_{hj}^{(k)}\right]_{2}}\ldots\int_{-\infty}^{\left[\boldsymbol{q}_{hj}^{(k)}\right]_{p}}log\left(1+\frac{\boldsymbol{s}^{T}{\boldsymbol{\Lambda}_{h}}^{-1}\boldsymbol{s}}{\nu_{h}^{(k)}+d_{h}^{(k)}(\boldsymbol{y}_{j})}\right)
[1+𝒔T𝚲h1𝒔νh(k)+dh(k)(𝒚j)](νh(k)2+p)d𝒖,\displaystyle\left[1+\frac{\boldsymbol{s}^{T}{\boldsymbol{\Lambda}_{h}}^{-1}\boldsymbol{s}}{\nu_{h}^{(k)}+d_{h}^{(k)}(\boldsymbol{y}_{j})}\right]^{-\left(\frac{\nu_{h}^{(k)}}{2}+p\right)}d\boldsymbol{u},

𝑺2,hj(k)\boldsymbol{S}_{2,hj}^{(k)} is a p×1p\times 1 vector whose rrth element is

000urtp,νh(k)+p+2(𝒖;𝒒hj(k),(νh(k)+dh(k)(𝒚j)νh(k)+p+2)𝚫h(k))𝑑𝒖,\displaystyle\int_{0}^{\infty}\int_{0}^{\infty}\ldots\int_{0}^{\infty}u_{r}\;t_{p,\nu_{h}^{(k)}+p+2}\left(\boldsymbol{u};\boldsymbol{q}_{hj}^{(k)},\left(\frac{\nu_{h}^{(k)}+d_{h}^{(k)}(\boldsymbol{y}_{j})}{\nu_{h}^{(k)}+p+2}\right)\boldsymbol{\Delta}_{h}^{(k)}\right)d\boldsymbol{u}, (38)

and 𝑺3,hj(k)\boldsymbol{S}_{3,hj}^{(k)} is a p×pp\times p matrix whose (r,s)(r,s)th element is

000urustp,νh(k)+p+2(𝒖;𝒒hj(k),(νh(k)+dh(k)(𝒚j)νh(k)+p+2)𝚫h(k))𝑑𝒖,\displaystyle\int_{0}^{\infty}\int_{0}^{\infty}\ldots\int_{0}^{\infty}u_{r}\;u_{s}\;t_{p,\nu_{h}^{(k)}+p+2}\left(\boldsymbol{u};\boldsymbol{q}_{hj}^{(k)},\left(\frac{\nu_{h}^{(k)}+d_{h}^{(k)}(\boldsymbol{y}_{j})}{\nu_{h}^{(k)}+p+2}\right)\boldsymbol{\Delta}_{h}^{(k)}\right)d\boldsymbol{u}, (39)

where, for convenience of notation, d𝒖d\boldsymbol{u} is used to denote du1,du2,,dupdu_{1},du_{2},\ldots,du_{p}.

It is important to note that S2,hj(k)S_{2,hj}^{(k)} and S3,hj(k)S_{3,hj}^{(k)} are related to the first two moments of a truncated pp-dimensional tt-variate 𝑿hj\boldsymbol{X}_{hj}. More specifically, let

𝑿hjttp,νh+p+2(𝒒hj(k),(νh(k)+dh(k)(𝒚j)νh(k)+p+2)𝚫h(k),+),\boldsymbol{X}_{hj}\sim tt_{p,\nu_{h}+p+2}\left(\boldsymbol{q}_{hj}^{(k)},\left(\frac{\nu_{h}^{(k)}+d_{h}^{(k)}(\boldsymbol{y}_{j})}{\nu_{h}^{(k)}+p+2}\right)\boldsymbol{\Delta}_{h}^{(k)},\mathbb{R}^{+}\right),

the truncated tt-distribution as defined by (6). Then

S2,hj(k)\displaystyle S_{2,hj}^{(k)} =Tp,νh(k)+p+2(𝒒hj(k);𝟎,(νh(k)+dh(k)(𝒚j)νh(k)+p+2𝚫h(k)))E(𝑿hj),\displaystyle=T_{p,\nu_{h}^{(k)}+p+2}\left(\boldsymbol{q}_{hj}^{(k)};\boldsymbol{0},\left(\frac{\nu_{h}^{(k)}+d_{h}^{(k)}(\boldsymbol{y}_{j})}{\nu_{h}^{(k)}+p+2}\boldsymbol{\Delta}_{h}^{(k)}\right)\right)E(\boldsymbol{X}_{hj}),
and
S3,hj(k)\displaystyle S_{3,hj}^{(k)} =Tp,νh(k)+p+2(𝒒hj(k);𝟎,(νh(k)+dh(k)(𝒚j)νh(k)+p+2𝚫h(k)))E(𝑿hj𝑿hjT),\displaystyle=T_{p,\nu_{h}^{(k)}+p+2}\left(\boldsymbol{q}_{hj}^{(k)};\boldsymbol{0},\left(\frac{\nu_{h}^{(k)}+d_{h}^{(k)}(\boldsymbol{y}_{j})}{\nu_{h}^{(k)}+p+2}\boldsymbol{\Delta}_{h}^{(k)}\right)\right)E(\boldsymbol{X}_{hj}\boldsymbol{X}_{hj}^{T}),

and hence (35) and (36) reduces to 𝒆3,hj(k)=e2,hj(k)E(𝑿hj)\mbox{\boldmath$e$}_{3,hj}^{(k)}=e_{2,hj}^{(k)}E(\boldsymbol{X}_{hj}) and 𝒆4,hj(k)=e2,hj(k)E(𝑿hj𝑿hjT)\mbox{\boldmath$e$}_{4,hj}^{(k)}=e_{2,hj}^{(k)}E(\boldsymbol{X}_{hj}\boldsymbol{X}_{hj}^{T}) respectively, which can be implicitly expressed in terms of the parameters 𝒒hj(k)\boldsymbol{q}_{hj}^{(k)}, dh(k)(𝒚j)d_{h}^{(k)}(\boldsymbol{y}_{j}), 𝚫h(k)\boldsymbol{\Delta}_{h}^{(k)}, νh(k)\nu_{h}^{(k)} using results (7) and (8). It is worth emphasizing that computation of 𝒆3hj(k)\mbox{\boldmath$e$}_{3hj}^{(k)} and 𝒆4hj(k)\mbox{\boldmath$e$}_{4hj}^{(k)} depends on algorithms for evaluating the multivariate tt-distribution function, for which fast procedures are available.

In summary, the ECM algorithm is implemented as follows on the (k+1)(k+1)th iteration:

E-step: Given 𝚿=𝚿(k)\boldsymbol{\Psi}=\boldsymbol{\Psi}^{(k)}, compute τhj(k)\tau_{hj}^{(k)} using (32), and e1,hj(k)e_{1,hj}^{(k)}, e2,hj(k)e_{2,hj}^{(k)}, 𝒆3,hj(k)\mbox{\boldmath$e$}_{3,hj}^{(k)}, and 𝒆4,hj(k)\mbox{\boldmath$e$}_{4,hj}^{(k)} as described by (33), (34), (35), and (36) respectively, for h=1,,gh=1,\ldots,g and j=1,,nj=1,\ldots,n.

M-step: Update the estimate of 𝚿\boldsymbol{\Psi} by calculating for h=1,,gh=1,\ldots,g, the following estimates of the parameters in 𝚿\boldsymbol{\Psi},

𝝁h(k)\displaystyle\boldsymbol{\mu}_{h}^{(k)} =j=1nτhj(k)[e2,hj(k)𝒚j𝚫h(k)𝒆3,hj(k)]j=1nτhj(k)e2,hj(k),\displaystyle=\frac{\sum_{j=1}^{n}\tau_{hj}^{(k)}\left[e_{2,hj}^{(k)}\boldsymbol{y}_{j}-\boldsymbol{\Delta}_{h}^{(k)}\mbox{\boldmath$e$}_{3,hj}^{(k)}\right]}{\sum_{j=1}^{n}\tau_{hj}^{(k)}e_{2,hj}^{(k)}},
𝜹(k+1)\displaystyle\boldsymbol{\delta}^{(k+1)} =(𝚺h(k)1j=1nτhj(k)𝒆4,hj(k))1DIAG(𝚺h(k)1j=1nτhj(k)(𝒚j𝝁h(k))𝒆3,hj(k)T),\displaystyle=\left(\boldsymbol{\Sigma}_{h}^{(k)^{-1}}\odot\sum_{j=1}^{n}\tau_{hj}^{(k)}\mbox{\boldmath$e$}_{4,hj}^{(k)}\right)^{-1}\mbox{DIAG}\left(\boldsymbol{\Sigma}_{h}^{(k)^{-1}}\sum_{j=1}^{n}\tau_{hj}^{(k)}(\mbox{\boldmath$y$}_{j}-\boldsymbol{\mu}_{h}^{(k)})\mbox{\boldmath$e$}_{3,hj}^{(k)^{T}}\right),
and
𝚺h(k+1)\displaystyle{\boldsymbol{\Sigma}}_{h}^{(k+1)} =1j=1nτhj(k)j=1nτhj(k)[𝚫h(k+1)𝒆4,hj(k)T𝚫h(k+1)T(𝒚j𝝁h(k+1))𝒆3,hj(k)T𝚫h(k+1)\displaystyle=\frac{1}{\sum_{j=1}^{n}\tau_{hj}^{(k)}}\sum_{j=1}^{n}\tau_{hj}^{(k)}\left[\boldsymbol{\Delta}_{h}^{(k+1)}\mbox{\boldmath$e$}_{4,hj}^{(k)^{T}}\boldsymbol{\Delta}_{h}^{(k+1)^{T}}\right.\left(\boldsymbol{y}_{j}-\boldsymbol{\mu}_{h}^{(k+1)}\right)\mbox{\boldmath$e$}_{3,hj}^{(k)^{T}}\boldsymbol{\Delta}_{h}^{(k+1)}
𝚫h(k+1)𝒆3,hj(k)(𝒚j𝝁h(k+1))T+(𝒚j𝝁h(k+1))(𝒚j𝝁h(k+1))Te2,hj(k)].\displaystyle-\boldsymbol{\Delta}_{h}^{(k+1)}\mbox{\boldmath$e$}_{3,hj}^{(k)}\left(\boldsymbol{y}_{j}-\boldsymbol{\mu}_{h}^{(k+1)}\right)^{T}\left.+\left(\boldsymbol{y}_{j}-\boldsymbol{\mu}_{h}^{(k+1)}\right)\left(\boldsymbol{y}_{j}-\boldsymbol{\mu}_{h}^{(k+1)}\right)^{T}e_{2,hj}^{(k)}\right].

An update νh(k+1)\nu_{h}^{(k+1)} of the degrees of freedom is obtained by solving iteratively the equation

log(νh(k+1)2)ψ(νh(k+1)2)=j=1n[τhj(k)(e2,hj(k)e1,hj(k)1)]j=1nτhj(k).\log\left(\frac{\nu_{h}^{(k+1)}}{2}\right)-\psi\left(\frac{\nu_{h}^{(k+1)}}{2}\right)=\frac{\sum_{j=1}^{n}\left[\tau_{hj}^{(k)}\left(e_{2,hj}^{(k)}-e_{1,hj}^{(k)}-1\right)\right]}{\sum_{j=1}^{n}\tau_{hj}^{(k)}}.

A program for implementing this EM algorithm has been written in R.

6 The Empirical Information Matrix

We consider an approximation to the asymptotic covariance matrix of the ML estimates using the inverse of the empirical information matrix (Basford et al., 1997). The empirical information matrix is given by

Ie(𝚿;𝒚)=j=1n𝒔(𝒚j;𝚿^)𝒔T(𝒚j;𝚿^),I_{e}\left(\boldsymbol{\Psi};\boldsymbol{y}\right)=\sum_{j=1}^{n}\boldsymbol{s}\left(\boldsymbol{y}_{j};\hat{\boldsymbol{\Psi}}\right)\boldsymbol{s}^{T}\left(\boldsymbol{y}_{j};\hat{\boldsymbol{\Psi}}\right), (40)

where s(𝒚j;𝚿^)=E𝚿^{logLcj(𝚿)/𝚿𝒚j}s\left(\boldsymbol{y}_{j};\hat{\boldsymbol{\Psi}}\right)=E_{\hat{\boldsymbol{\Psi}}}\left\{\partial\log L_{cj}\left(\boldsymbol{\Psi}\right)/\partial\boldsymbol{\Psi}\mid\boldsymbol{y}_{j}\right\} (j=1,,n)(j=1,\ldots,n) are the individual scores, consisting of

(sj,π1,,sj,πg1,sj,𝝁1,,sj,𝝁g,sj,𝜹1\displaystyle(s_{j,\pi_{1}},\ldots,s_{j,\pi_{g-1}},s_{j,\boldsymbol{\mu}_{1}},\ldots,s_{j,\boldsymbol{\mu}_{g}},s_{j,\boldsymbol{\delta}_{1}}
,sj,𝜹g,sj,𝚺1,,sj,𝚺g,sj,ν1,,sj,νg).\displaystyle\ldots,s_{j,\boldsymbol{\delta}_{g}},s_{j,\boldsymbol{\Sigma}_{1}},\ldots,s_{j,\boldsymbol{\Sigma}_{g}},s_{j,\nu_{1}},\ldots,s_{j,\nu_{g}}).

We let Lcj(𝚿)L_{cj}\left(\boldsymbol{\Psi}\right) denote the complete-data log likelihood formed from the single observation 𝒚j\boldsymbol{y}_{j}. An estimate of the covariance matrix of 𝚿^\hat{\boldsymbol{\Psi}} is given by taking the inverse of (40). After some algebraic manipulations, one can show that the elements of 𝒔(𝒚j;𝚿^)\boldsymbol{s}\left(\boldsymbol{y}_{j};\hat{\boldsymbol{\Psi}}\right) are given by the following explicit expressions:

sj,πh\displaystyle s_{j,\pi_{h}} =τhjπhτgjπg,\displaystyle=\frac{\tau_{hj}}{\pi_{h}}-\frac{\tau_{gj}}{\pi_{g}},
sj,𝝁h\displaystyle s_{j,\boldsymbol{\mu}_{h}} =τhj𝚺^h1[e2,ij(𝒚j𝝁^h)𝚫^h𝒆3,ij],\displaystyle=\tau_{hj}\hat{\boldsymbol{\Sigma}}_{h}^{-1}\left[e_{2,ij}\left(\boldsymbol{y}_{j}-\hat{\boldsymbol{\mu}}_{h}\right)-\hat{\boldsymbol{\Delta}}_{h}\mbox{\boldmath$e$}_{3,ij}\right],
sj,𝚺h\displaystyle s_{j,\boldsymbol{\Sigma}_{h}} =12τhj[(𝒚j𝝁^h)(𝒚j𝝁^h)T(𝒚j𝝁^h)𝒆3,hjT𝚫^h\displaystyle=\textstyle\frac{1}{2}\tau_{hj}\left[\left(\boldsymbol{y}_{j}-\hat{\boldsymbol{\mu}}_{h}\right)\left(\boldsymbol{y}_{j}-\hat{\boldsymbol{\mu}}_{h}\right)^{T}-\left(\boldsymbol{y}_{j}-\hat{\boldsymbol{\mu}}_{h}\right)\mbox{\boldmath$e$}_{3,hj}^{T}\hat{\boldsymbol{\Delta}}_{h}\right.
𝚫^h𝒆3,hj(𝒚j𝝁^h)+𝚫^h𝒆3,ij(𝒚j𝝁^h)+𝜹^h𝒆4,hjT𝚫^h]𝚺^h1\displaystyle\left.-\hat{\boldsymbol{\Delta}}_{h}\mbox{\boldmath$e$}_{3,hj}\left(\boldsymbol{y}_{j}-\hat{\boldsymbol{\mu}}_{h}\right)+\hat{\boldsymbol{\Delta}}_{h}\mbox{\boldmath$e$}_{3,ij}\left(\boldsymbol{y}_{j}-\hat{\boldsymbol{\mu}}_{h}\right)+\hat{\boldsymbol{\delta}}_{h}\mbox{\boldmath$e$}_{4,hj}^{T}\hat{\boldsymbol{\Delta}}_{h}\right]\hat{\boldsymbol{\Sigma}}_{h}^{-1}
12τhj𝚺^h1,\displaystyle-\textstyle\frac{1}{2}\tau_{hj}\hat{\boldsymbol{\Sigma}}_{h}^{-1},
sj,𝜹h\displaystyle s_{j,\boldsymbol{\delta}_{h}} =τhj[diag(𝚺^h1(𝒚j𝝁^h))𝒆3,hj(𝚺^h1𝒆4,hj)𝜹^h],\displaystyle=\tau_{hj}\left[\mbox{diag}\left(\hat{\boldsymbol{\Sigma}}_{h}^{-1}\left(\boldsymbol{y}_{j}-\hat{\boldsymbol{\mu}}_{h}\right)\right)\mbox{\boldmath$e$}_{3,hj}-\left(\hat{\boldsymbol{\Sigma}}_{h}^{-1}\odot\mbox{\boldmath$e$}_{4,hj}\right)\hat{\boldsymbol{\delta}}_{h}\right],
sj,νh\displaystyle s_{j,\nu_{h}} =12τhj[log(12ν^h)+1+e1,hjψ(12ν^h)e2,hj].\displaystyle=\textstyle\frac{1}{2}\tau_{hj}\left[\log\left(\textstyle\frac{1}{2}\hat{\nu}_{h}\right)+1+e_{1,hj}-\psi\left(\textstyle\frac{1}{2}\hat{\nu}_{h}\right)-e_{2,hj}\right].

7 Examples

In this section, we fit the FM-MST model to three real data sets to demonstrate its usefulness in analyzing and clustering multivariate skewed data. In the first example, we focus on the flexibility of the FM-MST model in capturing the asymmetric shape of flow cytometric data. The next example illustrates the clustering capability of the model. In the final example, we demonstrate the computational efficiency of the proposed algorithm.

7.1 Lymphoma Data

We consider a subset of the T-cell phosphorylation data collected by Maier et al. (2007). In the original data, blood samples from 30 subjects were stained with four fluorophore-labeled antibodies against CD4, CD45RA, SLP76(pY128), and ZAP70(pY292) before and after an anti-CD3 stimulation. In this example, we focus on a reduced subset of the data in two variables – CD4 and ZAP70. This bivariate sample (Figure 1) is apparently bimodal and exhibits asymmetric pattern. Hence we fit a two-component FM-MST model to the data. More specifically, the fitted model can be written as

f2(𝒚j;𝚿)=π1f2(𝒚j;𝝁1,𝚺1,𝜹1,ν1)+(1π1)f2(𝒚j;𝝁2,𝚺2,𝜹2,ν2),f_{2}\left(\boldsymbol{y}_{j};\boldsymbol{\Psi}\right)=\pi_{1}f_{2}\left(\boldsymbol{y}_{j};\boldsymbol{\mu}_{1},\boldsymbol{\Sigma}_{1},\boldsymbol{\delta}_{1},\nu_{1}\right)+\left(1-\pi_{1}\right)f_{2}\left(\boldsymbol{y}_{j};\boldsymbol{\mu}_{2},\boldsymbol{\Sigma}_{2},\boldsymbol{\delta}_{2},\nu_{2}\right),

where

𝝁i=(μi,1,μi,2)T,𝚺i=(σi,11σi,12σi,12σi,22),𝜹i=(δi,1,δi,2)T(i=1,2).\boldsymbol{\mu}_{i}=\left(\mu_{i,1},\mu_{i,2}\right)^{T},\;\boldsymbol{\Sigma}_{i}=\left(\begin{array}[]{cc}\sigma_{i,11}&\sigma_{i,12}\\ \sigma_{i,12}&\sigma_{i,22}\end{array}\right),\boldsymbol{\delta}_{i}=\left(\delta_{i,1},\delta_{i,2}\right)^{T}(i=1,2).

For comparison, we include the fitting of a two-component mixture of skew tt-distributions from the skew-normal independent (SNI) family (Lachos, Ghosh, and Arellano-Valle, (2010)), hereafter named the FM-SNI-ST model. The estimated FM-SNI-ST density can be computed from the R package mixsmsn (Cabral, Lachos, and Prates, (2012)). Note that the MST distribution is different to the SNI-ST distribution since the skewing function is not of dimension one. Note also that the SNI-ST distribution is equivalent to the restricted MST distribution (5) after reparametrization. Moreover, under the FM-SNI-ST settings, the correlation structure of 𝒀\boldsymbol{Y} will also be dependent on the skewness parameter, whereas for the FM-MST distributions the covariance structure is not affected by 𝜹\boldsymbol{\delta}. The contours of the fitted SNI-ST and MST component densities are depicted in Fig 1(b) and Fig 1(c), respectively. To better visualize the shape of the fitted models, we display the estimated densities of each component instead of the mixture contours. It can be seen that the FM-MST model provides a noticeably better fit. From a clustering point of view, the FM-MST model also shows better performance as it is able to separate the two clusters correctly. Moreover, it adapts to the asymmetric shape of each cluster more adequately. Thus the superiority of FM-MST model is evident in dealing with asymmetric and heavily tailed data in this data set.

Refer to caption
Figure 1: Mixture modelling of a reduced subset of prephosphorylation T cell population. Bivariate skew tt-mixtures were fitted to the data restricted in two dimensions CD45 and ZAP70. (a) Hue intensity plot of the Lymphoma data set; (b) the contours of the component densities in the fitted two-component skew tt-mixture model FM-SNI-ST using the R package mixsmsn; (c) the fitted component contours of the two-component FM-MST model.

7.2 GvHD Data

Our second example concerns a data set collected by Brinkman et al. (2007), where peripheral blood samples were collected weekly from patients following blood and bone marrow transplant. The original goal was to identify cellular signatures that can predict or assist in early detection of Graft versus Host Disease (GvHD), a common post-transplantation complication in which the recipient’s bone marrow was attacked by the new donor material. Samples were stained with four fluorescence reagents: CD4 FITC, CD8β\beta PE, CD3 PerCP, and CD8 APC. Hence we fit a 4-variate FM-MST model to a case sample with a population of 13773 cells. The data set is shown in Figure 2, where cells are displayed in five different colours according to a manual expert clustering into five clusters. In addition, we include the results for the FM-SNI-ST model and the restricted MST mixture model introduced in Section 2.1 (equation 5), hereafter denoted by FM-RMST.

Refer to caption
Figure 2: GvHD data set: Expert manual clustering of a population of 13773 cells stained with four fluorescence reagents – CD4 FITC (FL1-H), CD8β\beta PE (FL2.H), CD3 PerCP (FL3.H) and CD8 APC (FL4.H).

We compare the performance of the three models FM-MST, FM-SNI-ST, and FM-RMST in assigning cells to the expert clusters. Manual gating suggests there are five clusters in this case sample. Hence we applied the algorithm for the fitting of each model with gg predefined as 55. For a fair comparison, we started the three algorithms using the same initial values. The initial clustering is based on kk-means.The degrees of freedom are set to be identical for all components for the first iteration and assigned a relatively large value. A similar strategy was described in Lin (2010).

To assess the performance of these three algorithms, we take the manual expert clustering as being the ‘true’ class membership and we calculated the error rate of classification against this benchmark result with dead cells removed, measured by choosing among the possible permutations of class labels the one that gives the highest value.

As anticipated, the optimal clustering result was given by the FM-MST model. It achieved the lowest misclassification rate. The FM-SNI-ST model has a higher number of misallocations. The FM-RMST model has a disappointing performance in terms of clustering. Its error rate is almost double that of its competitors. It is worth pointing out that both the FM-MST and FM-RMST models have 99 free parameters, while the FM-SNI-ST model has 95 parameters. It is evident that these two restricted models have inferior performance. This reveals some evidence of the extra flexibility offered by the more general FM-MST model.

Table 1: Clustering error rates for various multivariate skew tt mixture models on the GvHD data set.
Model Error rate Number of free parameters
FM-MST 0.0875 99
FM-SNI-ST 0.13078 95
FM-RMST 0.20700 99

7.3 AIS Data

We now illustrate the computational efficiency of our exact implementation of the E-step of the EM algorithm as in Section 5. We denote this version of the EM algorithm with the exact E-step as EM-exact. In addition, we consider the EM alternative with a Monte Carlo (MC) E-step as given by Lin (2010), which is denoted by EM-MC. Since both models are based on the same characterization of the multivariate skew tt-distribution defined by Sahu et al. (2003), it is appropriate to compare their computation time. We assess their time performance on the well-analyzed Australian Institute of Sport (AIS) data, which consists of p=13p=13 measurements made on n=202n=202 athletes. As in Lin (2010), we limit this illustration to a bivariate subset of two variables – body mass index (BMI) and the percentage of body fat (Bfat). As noted by Lin (2010), these data are apparently bimodal. Hence a two-component mixture model is fitted to the data set.

A summary of the results are listed in Table 2. Also reported there are the values of the log-likelihood, the Akaike information criterion (AIC) (Akaike, 1974) and the Bayesian information criterion (BIC) (Schwarz, 1978) defined by

AIC=2m2L(𝚿)andBIC=mlogn2L(𝚿),\mbox{AIC}=2m-2L\left(\boldsymbol{\Psi}\right)\;\mbox{and}\;\mbox{BIC}=m\log n-2L\left(\boldsymbol{\Psi}\right), (41)

respectively, where L(𝚿)L\left(\boldsymbol{\Psi}\right) is the value of the log likelihood at 𝚿\boldsymbol{\Psi}, mm is the number of free parameters, and nn is the sample size. Models with smaller AIC and BIC values are preferred when comparing different fitted results. The best value from each criterion are highlighted in bold font in Table 2. For this illustration, the EM-MC E-step is undertaken with 5050 random draws as recommended by Lin (2010). Note that the degrees of freedom is not restricted to be the same for the two components. The gender of each individual in this data set is also recorded, thus enabling us to evaluate the error rate of binary classification for the two methods.

Not surprisingly, the model selection criteria favour the EM-exact algorithm. Not only did it achieve lower AIC and BIC values, the computation time is remarkably lower than its competitor. It is more than five times faster than the EM-MC alternative.

Table 2: Computation time and clustering error rates for two different implementations of the EM algorithm for the multivariate skew tt mixture models on the AIS data set. For EM-exact, the E-step is implemented exactly as described in Section 5. As an alternative, the EM algorithm was implemented with a Monte Carlo E-step, EM-MC, as in Lin (2010). Time is measured in seconds.
Model EM-exact EM-MC
Component 1 2 1 2
π\pi 0.44 0.56 0.59 0.41
μi1\mu_{i1} 19.74 21.83 19.89 22.47
μi2\mu_{i2} 15.99 5.89 15.50 7.30
Σi,11\Sigma_{i,11} 3.03 3.16 2.96 3.23
Σi,12\Sigma_{i,12} 7.71 0.54 6.17 1.34
Σi,22\Sigma_{i,22} 2.36 0.11 25.80 2.14
δi1\delta_{i1} 3.34 1.44 2.72 0.71
δi2\delta_{i2} 3.15 3.76 2.22 1.13
ν\nu 42.05 3.82 23.98 25.93
L(𝚿)L\left(\boldsymbol{\Psi}\right) -1077.257 -1088.066
AIC 2188.514 2207.956
BIC 2244.755 2264.197
error rate 0.0792 0.0891
time 64.63 349.9

8 Computation Time and Accuracy for E-step

We now proceed to two interesting experiments for evaluating the computational cost and accuracy of using the EM-exact and EM-MC algorithms on high-dimensional data. As pointed out previously, the main computational cost for EM-exact is evaluating the multivariate tt-distribution function. Calculation of the first two moments of a pp-variate truncated tt-distribution requires the evaluation of two Tp()T_{p}(\cdot) functions, pp evaluations of Tp1()T_{p-1}(\cdot), and 12p(p1)\textstyle\frac{1}{2}p(p-1) evaluations of Tp2()T_{p-2}(\cdot). Hence, the computation time will increase substantially with the number of dimensions. However, with the EM-exact algorithm, accuracy can be compromised for time.

Refer to caption
Figure 3: Comparison of performance of the EM-MC and EM-exact methods on a subset of 100 samples from the Brain Tumor data. Green line: EM-MC with 500 draws, red line: EM-MC with 100 draws, blue line: EM-MC with 50 draws, black line: EM-exact. (a) Typical computation time for E-step on a sample of 100 data in various dimensions. (b) Total absolute error of E-step for one data point.

We sampled 100 data from a Brain Tumor dataset supplied by Geoff Osborne from the Queensland Brain Institute at the University of Queensland. In both experiments we varied the dimension pp of the sample. The graph in Figure 3(a) shows the typical CPU time per each E-step iteration for various dimensions pp of the data; EM-MC(m)(m) represents the EM-MC algorithm with mm random draws using the Gibbs sampling approach described in Lin (2010). It is worth noting that in both experiments EM-exact is evaluated with a default tolerance of at least 10610^{-6}. As seen in Figure 3, EM-exact is the fastest among the four versions of the E-step for low dimensions. For example, at p=2p=2, EM-exact at least 25 times faster than EM-MC(50). It is important to note that although EM-MC(50) is slightly faster than EM-exact at higher dimensions, EM-exact produces results to a significantly higher accuracy, while EM-MC requires a large number of draws to achieve comparable results. We note that in our simulations, for example, at p=7p=7, 50 draws is insufficient to achieve acceptable estimates. Preliminary results suggests that at least 500 draws is required to generate reasonable approximations when pp is greater than 6. In this case, EM-exact is at least ten times quicker. Furthermore, EM-exact also has an additional advantage over the EM-MC alternative in that its results are reproducible.

To compare the accuracy of the EM-exact and EM-MC algorithms, we compute the total absolute error against the baseline EM-exact with a maximum tolerance of 101810^{-18}. For each of the EM-MC(m)(m) algorithms, the average total absolute error of 100 trials is used. For EM-exact, the default tolerance is set to 10610^{-6}. The results are shown in Figure 3(b). Not surprisingly, the absolute error of the EM-MC algorithm is significantly higher than that of the EM-exact algorithm. It can be observed that the absolute error is very high even for EM-MC(500). At p=10p=10, for example, EM-exact is at least 30000 times more accurate and takes less than half the time required for EM-MC(500).

It is important to emphasize that as the dimension pp of the data increases, EM-MC requires considerably more draws to provide a comparable (and acceptable) level of accuracy as EM-exact, which can be computationally intensive. Hence we advocate the use of EM-exact, especially for applications involving high dimensional data.

9 Concluding Remarks

We have described an exact EM algorithm for evaluating the parameters of a general multivariate skew tt-mixture model. This model has a more general characterization than various alternative versions of the skew tt-distribution available in the literature and hence offers greater flexibility in capturing the asymmetric shape of skewed data.

Our proposed method is based on reduced analytical expressions for the E-step conditional expectations, which can be formulated in terms of the first and second moments of a multivariate truncated tt-distribution. The latter can then be expressed further in terms of the distribution function of the multivariate central tt-distribution for which fast algorithms capable of producing highly accurate results already exist. It is demonstrated to have a marked advantage over the EM algorithm with a Monte Carlo E-step. To achieve comparable accuracy to that of the EM algorithm with the E-step implemented using the above numerical approach, the version of the algorithm with a Monte Carlo E-step would require a large number of draws, which would be computationally expensive.

Acknowledgments

This work is supported by a grant from the Australian Research Council. Also, we would like to thank Professor Seung-Gu Kim for comments and corrections, and Dr Kui (Sam) Wang for his helpful discussions on this topic.

References

  • [1] H. Akaike. A new look at the statistical model identification. Automatic Control, 19:716–723, 1974.
  • [2] R.B. Arellano-Valle, H. Bolfarine, and V.H. Lachos. Bayesian inference for skew-normal linear mixed models. Journal of Applied Statistics, 34(6):663–682, 2007.
  • [3] A. Azzalini. A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, 12:171–178, 1985.
  • [4] A. Azzalini and A Dalla Valle. The multivariate skew-normal distribution. Biometrika, 83(4):715–726, 1996.
  • [5] J. D. Banfield and A. Raftery. Model-based gaussian and non-gaussian clustering. Biometrics, 49:803–821, 1993.
  • [6] K. E. Basford, D. R. Greenway, G. J. McLachlan, and D. Peel. Standard errors of fitted means under normal mixture. Computational Statistics, 12:1–17, 1997.
  • [7] R. M. Basso, V. H. Lachos, C. R. B. Cabral, and P. Ghosh. Robust mixture modeling based on scale mixtures of skew-normal distributions. Computational Statistics and Data Analysis, 54:2926–2941, 2010.
  • [8] M. D. Branco and D. K. Dey. A general class of multivariate skew-elliptical distributions. Journal of Multivariate Analysis, 79:99–113, 2001.
  • [9] R. Brinkman, M. Gaspareto, S.-J. Lee, A. Ribickas, J. Perkins, W. Janssen, R. Smiley, and C. Smith. High content flow cytometry and temporal data analysis for defining a cellular signature of graft versus host disease. Biological of Blood and Marrow Transplantation, 13:691–700, 2007.
  • [10] C.S. Cabral, V.H. Lachos, and M.O. Prates. Multivariate mixture modeling using skew-normal independent distributions. Computational Statistics and Data Analysis, 56:126–142, 2012.
  • [11] A.P. Dempster, N. M. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of Royal Statistical Society , Series B, 39:1–38, 1977.
  • [12] B. S. Everitt and D. J. Hand. Finite Mixture Distributions. Chapman and Hall, London, 1981.
  • [13] C. Fraley and A. E. Raftery. How many clusters? which clustering methods? answers via model-based cluster analysis. Computer Journal, 41:578–588, 1999.
  • [14] A Genz and F. Bretz. Methods for the computation of multivariate t-probabilities. Journal of Computational and Graphical Statistics, 11:950–971, 2002.
  • [15] P. J. Green. On use of the em algorithm for penalized likelihood estimation. Journal of the Royal Statistical Society B, 52:443–452, 1990.
  • [16] A. K. Gupta. Multivariate skew-tt distribution. Statistics, 37:359–363, 2003.
  • [17] S. Kotz and S. Nadarajah. Multivariate t Distributions and Their Applications. Cambridge University Press, Cambridge, 2004.
  • [18] V. H. Lachos, P. Ghosh, and R. B. Arellano-Valle. Likelihood based inference for skew normal independent linear mixed models. Statistica Sinica, 20:303–322, 2010.
  • [19] T. I. Lin. Robust mixture modeling using multivariate skew tt distribution. Statistics and Computing, 20:343–356, 2010.
  • [20] T. I. Lin, J. C. Lee, and W. J. Hsieh. Robust mixture modeling using the skew-tt distribution. Statistics and Computing, 17:81–92, 2007.
  • [21] T. I. Lin, J. C. Lee, and S. Y. Yen. Finite mixture modelling using the skew normal distribution. Statistica Sinica, 17:909–927, 2007.
  • [22] B. G. Lindsay. Mixture Models: Theory, Geometry, and Applications. NSF-CBMS Regional Conference Series in probability and Statistics, Volume 5, Institute of Mathematical Statistics, Hayward, CA, 1995.
  • [23] L. M. Maier, D. E. Anderson, P. L. De Jager, L.S. Wicker, and D. A. Hafler. Allelic variant in ctla4 alters t cell phosphorylation patterns. Proceedings of the National Academy of Sciences of the United States of America, 104:18607–18612, 2007.
  • [24] G. J. McLachlan and K. E. Basford. Mixture Models: Inference and Applications. Marcel Dekker, New York, 1988.
  • [25] G. J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley-Interscience, Hokoben, N. J., 2nd edition, 2008.
  • [26] G. J. McLachlan and D. Peel. Finite Mixture Models. Wiley Series in Probability and Statistics, 2000.
  • [27] G.J. McLachlan and D. Peel. Robust cluster analysis via mixtures of multivariate tt-distributions. In A. Amin, D. Dori, P. Pudil, and H. Freeman, editors, Lecture Notes in Computer Science, volume 1451, pages 658–666. Springer-Verlag, Berlin, 1998.
  • [28] A. O’Hagan. Bayes estimation of a convex quadratic. Biometrika, 60:565–571, 1973.
  • [29] A. O’Hagan. Moments of the truncated multivariate-tt distribution. 1976.
  • [30] S. Pyne, X. Hu, K. Wang, E. Rossin, T.-I. Lin, L. M. Maier, C. Baecher-Allan, G. J. McLachlan, P. Tamayo, D. A. Hafler, P. L. De Jager, and J. P. Mesirow. Automated high-dimensional flow cytometric data analysis. PNAS, 106:8519–8524, 2009.
  • [31] S.K. Sahu, D.K. Dey, and M.D. Branco. A new class of multivariate skew distributions with applications to bayesian regression models. The Canadian Journal of Statistics, 31:129–150, 2003.
  • [32] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461–464, 1978.
  • [33] D. M. Titterington, A. F. M. Smith, and U. E. Markov. Statistical Analysis of Finite Mixture Distributions. Wiley, New York, 1985.