This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multivariate Generalized Gaussian Distribution: Convexity and Graphical Models

Teng Zhang, Ami Wiesel and Maria Sabrina Greco Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org. Teng Zhang is with the University of Minnesota, Institute for Mathematics and its Applications, USA. Ami Wiesel is with The Hebrew University of Jerusalem, Israel. Maria Sabrina Greco is with the Universita di Pisa, Italy. Emails: zhang620@umn.edu, amiw@cs.huji.ac.il, and m.greco@iet.unipi.it. A. Wiesel was partially supported by the Intel Collaboration Research Institute for Computational Intelligence.
Abstract

We consider covariance estimation in the multivariate generalized Gaussian distribution (MGGD) and elliptically symmetric (ES) distribution. The maximum likelihood optimization associated with this problem is non-convex, yet it has been proved that its global solution can be often computed via simple fixed point iterations. Our first contribution is a new analysis of this likelihood based on geodesic convexity that requires weaker assumptions. Our second contribution is a generalized framework for structured covariance estimation under sparsity constraints. We show that the optimizations can be formulated as convex minimization as long the MGGD shape parameter is larger than half and the sparsity pattern is chordal. These include, for example, maximum likelihood estimation of banded inverse covariances in multivariate Laplace distributions, which are associated with time varying autoregressive processes.

Index Terms:
Multivariate generalized Gaussian distribution, geodesic convexity, graphical models, Cholesky decomposition.

I Introduction

Covariance estimation is a fundamental problem in multivariate statistics. Many techniques for hypothesis testing, inference, denoising and prediction rely on accurate estimation of the true covariance. The problem is challenging when the available data is high dimensional and non-Gaussian. Such settings are typical in many applications including speech, radar, wireless communication, finance and more. These led to a growing interest in both robust and structured covariance estimation. Specifically, in this paper, we consider maximum likelihood estimation (MLE) in the multivariate generalized Gaussian distribution, with and without sparsity constraints on the inverse covariance.

The first part of this paper considers the geodesic convexity in MLE in elliptically symmetric (ES) distributions. Methods for robust covariance estimation date back to the early works of [18, 31]. A popular approach is Tyler’s scatter estimate. It involves a non-convex optimization yet can be solved via simple fixed point iteration. It has been rigorously analyzed and successfully applied to different problems [31, 26]. Recently, it was shown that the result is in fact the solution to a geodesically convex minimization [6, 10, 32, 33, 36]. Geodesic convexity ensures that any local minima is also globally optimal and leads to a much simpler analysis [27]. It also allows for numerous extensions, e.g., regularized solutions. In a competing line of works, a different class of robust covariance estimation techniques was proposed based on the MGGD [23, 11]. A well known example of MGGD is the multivariate Laplace distribution [14]. Fixed point iterations for MGGD estimation and their analyses has recently been considered in [24, 9, 25]. The first contribution in this paper is a new analysis which shows that the negative log-likelihood in MGGD is also geodesically convex. This result requires weaker conditions than previous analyses, provides more intuition and paves the road to numerous generalizations.

The second part of this paper addresses structured covariance estimation. Structure exploitation is a main ingredient in modern statistics that allows accurate high dimensional estimation via a small number of samples. A promising approach is based on sparse inverse covariance models. In the multivariate Gaussian case these are known as graphical models and characterize conditional independence [13, 20, 8, 4]. These models have been successfully applied to speech recognition, sensor networks, computer networks and other fields in signal processing [10, 35]. Our goal is to combine such models with non-Gaussian distributions, e.g., MGGD. Recently, a similar problem was addressed using an expectation maximization technique [15]. Another line of work focused on combining Tyler’s scatter estimate with a banded inverse covariance prior [2, 1]. Graphical models for transelliptical distributions were discussed in [21]. In this paper, we combine the MGGD framework with prior sparsity constraints on the inverse covariance. We show that the optimization can be formulated into a convex form as long as the MGGD scale parameter is larger than half and the sparsity satisfies a chordal structure. Chordal models, also known as decomposable or triangulated models, include banded structures, multiscale settings and other practical scenarios [34, 16, 12]. Such structures are associated with a perfect ordering of the variables. A typical example is banded models associated with time-varying autoregressive processes [3].

Our results are also applicable to the case of unknown sparsity pattern, i.e., structure learning via sparsity inducing penalties, but require prior knowledge of the perfect order. Recent works on structure learning in directed acyclic graphs provide data driven techniques for learning this order [30, 28].

The outline of the paper is as follows. We begin in Section II with a few mathematical preliminaries that will be useful in our work. Then, we continue to our two main contributions. In Section III we provide a new geodesic analysis of MGGD estimation, and in Section IV we introduce a convex optimization framework for chordal structured MGGD estimation. Simulation results are described in Section V, and concluding remarks are offered in Section VI.

We use the following notations. We denote the set of real, symmetric and positive definite matrices by S++(p)p×p\mathrm{S_{++}}(p)\subset\mathbb{R}^{p\times p}. We denote the span operator by sp{}\mathrm{sp}\{\cdot\}.

II Preliminaries

We begin with a brief review of two mathematical concepts which will be instrumental in the next sections.

II-A Geodesic convexity

Geodesic convexity is an extension of classical convexity which replaces lines with geodesic paths in manifolds. More details on this topic can be found in [27]. Given a Riemannian manifold \mathcal{M} and a set 𝒜\mathcal{A}\subset\mathcal{M}, we say a function f:𝒜f:\mathcal{A}\rightarrow\mathbb{R} is geodesically convex, if every geodesic γxy\gamma_{xy} of \mathcal{M} with endpoints x,y𝒜x,y\in\mathcal{A} (i.e., γxy\gamma_{xy} is a function from [0,1][0,1] to \mathcal{M} with γxy(0)=x\gamma_{xy}(0)=x and γxy(1)=y\gamma_{xy}(1)=y) lies in 𝒜\mathcal{A}, and

f(γxy(t))(1t)f(x)+tf(y)f\left(\gamma_{xy}(t)\right)\leq(1-t)f(x)+tf(y)
for any x,y𝒜 and 0<t<1.\displaystyle\text{ for any $x,y\in\mathcal{A}$ and $0<t<1$}. (1)

If the inequality in (1) is replaced by strict inequality, we call the function ff geodesically strictly convex. An equivalent definition follows from [22, Theorem 1.1.4].

Proposition 1

For continuous function ff, the definition in (1) is equivalent to the condition

f(γxy(12))12f(x)+12f(y)f\left(\gamma_{xy}\left(\frac{1}{2}\right)\right)\leq\frac{1}{2}f(x)+\frac{1}{2}f(y)
for any x,y𝒜.\displaystyle\text{for any $x,y\in\mathcal{A}$}. (2)

The importance of geodesic convexity stems from the following properties (see [27, Theorem 2.1] for more details).

Proposition 2

Any local minimizer of a geodesically convex function is also its global minimizer.

Proposition 3

Any strictly geodesically convex function has a unique global minimizer.

In particular, we consider geodesic convexity on the manifold of positive definite matrices denoted by S++(p)\mathrm{S_{++}}(p). The geodesic connecting 𝚺1S++(p)\bm{\Sigma}_{1}\in\mathrm{S_{++}}(p) and 𝚺2S++(p)\bm{\Sigma}_{2}\in\mathrm{S_{++}}(p) is defined as [7, Chapter 6]

γ𝚺1𝚺2(t)=𝚺112(𝚺112𝚺2𝚺112)t𝚺112.\displaystyle\gamma_{\bm{\Sigma}_{1}\bm{\Sigma}_{2}}(t)=\bm{\Sigma}_{1}^{\frac{1}{2}}\left(\bm{\Sigma}_{1}^{-\frac{1}{2}}\bm{\Sigma}_{2}\bm{\Sigma}_{1}^{-\frac{1}{2}}\right)^{t}\bm{\Sigma}_{1}^{\frac{1}{2}}. (3)

Geodesic convexity of Tyler’s likelihood has been identified in [6, 32, 33, 36]. In Section III, we continue this line of works and show that the MGGD likelihood is also geodesically convex.

II-B Chordal graphs

In statistical graphical modeling, graphs are used to characterize the sparsity pattern of the corresponding inverse covariance matrices. When the pattern belongs to a special class known as chordal graphs, these concentration matrices satisfy an appealing structure which will be exploited in Section IV. More details on chordal graphs and their relation to graphical models can be found in [20, 34, 16, 12].

A graph G(V,E)G(V,E) is chordal if every cycle of length 4\geq 4 has an edge joining two nonconsecutive vertices of the cycle. For a chordal graph, there is a perfect elimination ordering of vertices, (v1,v2,,vn)(v_{1},v_{2},\cdots,v_{n}) such that for any 1in1\leq i\leq n, the neighbor of viv_{i}, Adj(vi)=(uV:(u,vi)E)\mathrm{Adj}(v_{i})=(u\in V:(u,v_{i})\in E), satisfies that Adj(vi){vi+1,vi+2,,vn}\mathrm{Adj}(v_{i})\cap\{v_{i+1},v_{i+2},\cdots,v_{n}\} induces a fully connected clique, i.e., a set of fully connected nodes.

It is convenient to define the sparsity pattern of a square n×nn\times n matrix 𝑪\bm{C} via a graph G(V,E)G(V,E) with nn vertices. We say that 𝑪\bm{C} is GG-sparse if

[𝑪]ij=0,for all(i,j)E.\displaystyle[\bm{C}]_{ij}=0,\qquad\text{for all}\quad(i,j)\notin E. (4)

A Cholesky decomposition is a generalization of the squared root operation to positive definite matrices. Any 𝚺S++(p)\bm{\Sigma}\in\mathrm{S_{++}}(p) has a unique decomposition 𝚺=𝑪𝑪T\bm{\Sigma}=\bm{C}\bm{C}^{T} where the Cholesky factor 𝑪\bm{C} is a lower-triangular matrix with positive diagonal elements.

The following result characterizes the relation between sparse Cholesky decompositions and chordal graphs.

Proposition 4

Let G(V,E)G(V,E) be a chordal graph with a natural111If the order is not natural, this result holds by permuting the columns of the matrix. perfect order, i.e., vi=iv_{i}=i for i=1,,ni=1,\cdots,n. If 𝚺\bm{\Sigma} is GG-sparse, real and positive definite, then there exists a unique GG-sparse lower triangular 𝐂\bm{C} with positive definite diagonal entries such that 𝚺=𝐂𝐂T\bm{\Sigma}=\bm{C}\bm{C}^{T}. If 𝐂\bm{C} is GG-sparse and lower-triangular, then 𝐂𝐂T\bm{C}\bm{C}^{T} is GG-sparse and positive definite.

The existence proof can be found in [16, Section 2.1], and the uniqueness is a direct consequence of the perfect elimination order.

Example 1

For a banded matrix with band width dd, it corresponds to a graph G(V,E)G(V,E) defined as follows: (i,j)E(i,j)\in E when |ij|<d|i-j|<d. Now we check that this is a chordal graph: for every cycle with length 4\geq 4, the indices of the four nodes differ by dd at most, therefore the edges connect all nodes in the cycle, which corresponds to the definition of chordal graph.

The elimination order for this chordal graph turns out to be the natural order vi=iv_{i}=i. Therefore Proposition 4 shows that the Cholesky decomposition of a banded matrix 𝚺\bm{\Sigma} is the product of a banded (with the same bandwidth dd) lower-triangular matrix with its transpose.

One can also easily show that the product of a banded lower-triangular matrix with its transpose is still a banded (with the same band width), positive definite matrix, which exemplifies the last sentence in Proposition 4.

For more examples of Chordal graph and its associated perfect order, we recommend the examples and graphs in [35, Section II.A] and  [12, Section 3].

III Geodesic convexity in MGGD

In this section, we consider unconstrained MLE in MGGD. More precisely, we address a more general family of elliptically symmetric (ES) distributions (see [11] for a review and [24] for a recent generalization to complex case). These problems involve non-convex minimizations, yet it has been shown that their global minima can be efficiently found using simple fixed point iterations. We will show that the negative log-likelihoods are in fact geodesically convex, and that this may be the underlying principle behind their success.

An random variable 𝒛p\bm{z}\in\mathbb{R}^{p} has a ES distribution in real space if its probability density function (p.d.f.) is

f(𝒛)=Cp,g|𝚺|0.5g((𝒛μ)T𝚺1(𝒛μ)),f(\bm{z})=C_{p,g}|\bm{\Sigma}|^{-0.5}g\left(\left(\bm{z}-\mu\right)^{T}\bm{\Sigma}^{-1}\left(\bm{z}-\mu\right)\right), (5)

where g:[0,)(0,)g:[0,\infty)\rightarrow(0,\infty) such that 0tp1g(t2)dt<\int_{0}^{\infty}t^{p-1}g(t^{2}){\,\mathrm{d}}t<\infty, Cp,gC_{p,g} is a normalization constant such that the integral of the distribution is 11. In (5), 𝚺S++(p)\bm{\Sigma}\in\mathrm{S_{++}}(p) is called a scatter matrix, and μp\mu\in\mathbb{R}^{p} is the center of the distribution.

MGGD [23] is a widely used special case of ES when

g(x)=exp(xβ/2)\displaystyle g(x)=\exp\left(-x^{\beta}/2\right) (6)

where β\beta is the shape parameter. In particular, for β=0.5\beta=0.5 it gives multivariate analog of Laplace distribution, and the multivariate Gaussian distribution is obtained for β=1\beta=1.

We consider the estimation of 𝚺\bm{\Sigma} given nn independent and identically distributed (i.i.d) realizations of a zero mean ES random vector denoted by 𝒛1,,𝒛n\bm{z}_{1},\cdots,\bm{z}_{n}. The MLE is the parameter that minimizes the negative-log-likelihood

L0(𝚺)=i=1nρ(𝒛iT𝚺1𝒛i)+n2logdet(𝚺),L_{0}(\bm{\Sigma})=\sum_{i=1}^{n}\rho\left(\bm{z}_{i}^{T}\bm{\Sigma}^{-1}\bm{z}_{i}\right)+\frac{n}{2}\log\det(\bm{\Sigma}),

where ρ(x)=logg(x)\rho(x)=-\log g(x). The following Theorem 1 characterizes the existence and uniqueness properties of this MLE. Theorem 1(a) characterizes the uniqueness and is the main contribution in this section; Theorem 1(b) characterize the existence and is borrowed from [31, Theorem 2.1].

Theorem 1

(a) Assume that ρ(x)\rho(x) is continuous in (0,)(0,\infty), nondecreasing and ρ(ex)\rho(e^{x}) is convex, then L0(𝚺)L_{0}(\bm{\Sigma}) is geodesically convex in S++(p)\mathrm{S_{++}}(p). If additionally sp{𝐳1,𝐳2,,𝐳n}=p\mathrm{sp}\{\bm{z}_{1},\bm{z}_{2},\cdots,\bm{z}_{n}\}=\mathbb{R}^{p}, ρ(x)\rho(x) is strictly increasing and ρ(ex)\rho(e^{x}) is strictly convex, then L0(𝚺)L_{0}(\bm{\Sigma}) is geodesically strictly convex in S++(p)\mathrm{S_{++}}(p).

(b) If ρ(x)\rho(x) is continuous in [0,)[0,\infty), a1=sup{a|xa/2exp(ρ(x))0as x}a_{1}=\sup\{a|x^{a/2}\exp(-\rho(x))\rightarrow 0\,\,\,\,\text{as $x\rightarrow\infty$}\} is positive, and 𝒳={𝐳i}i=1n\mathcal{X}=\{\bm{z}_{i}\}_{i=1}^{n},

|𝒳V|n<1pdim(V)a1\frac{|\mathcal{X}\cap\mathrm{V}|}{n}<1-\frac{p-\dim(\mathrm{V})}{a_{1}} for any linear subspace Vp\mathrm{V}\in\mathbb{R}^{p}, (7)

then there exists a minimizer of L0(𝚺)L_{0}(\bm{\Sigma}).

Before proving this theorem, a few comments are in order. We remark that the condition sp{𝒛1,𝒛2,,𝒛n}=p\mathrm{sp}\{\bm{z}_{1},\bm{z}_{2},\cdots,\bm{z}_{n}\}=\mathbb{R}^{p} is also implicated by (7), since otherwise V=sp{𝒛1,𝒛2,,𝒛n}\mathrm{V}=\mathrm{sp}\{\bm{z}_{1},\bm{z}_{2},\cdots,\bm{z}_{n}\} violated the assumption. This condition is necessary since otherwise L0(𝑷V)L_{0}(\bm{P}_{\mathrm{V}})\rightarrow-\infty as 𝚺𝑷V\bm{\Sigma}\rightarrow\bm{P}_{\mathrm{V}}, where 𝑷V\bm{P}_{\mathrm{V}} is the projector to subspace V\mathrm{V}) and therefore the minimizer of L0(𝚺)L_{0}(\bm{\Sigma}) does not exist.

Theorem 1 relaxes the conditions for the uniqueness/existence of the minimizer of L0(𝚺)L_{0}(\bm{\Sigma}) presented in [19, Theorem 2.2]:
M1 ρ(x)\rho^{\prime}(x) is non-negative, continuous and nonincreasing.
M2 xρ(x)x\rho^{\prime}(x) is strictly increasing.
M3 condition (7).
Our assumption that ρ\rho is increasing corresponds to ρ(x)>0\rho^{\prime}(x)>0, which follows from M1, where ρ\rho^{\prime} is nonnegative and M2 (which exclude the possibility that ρ(x)=0\rho^{\prime}(x)=0 for x>0x>0); and our assumption that ρ(ex)\rho(e^{x}) is strictly convex corresponds to M2, where xρ(x)x\rho^{\prime}(x) is strictly increasing. In comparison, our conditions does not require ρ\rho to be differentiable, and we do not assume their assumptions in M1 that ρ(x)\rho^{\prime}(x) is continuous or nonincreasing.

A special case of Theorem 1 is Tyler’s M-estimator in which ρ(x)=log(x)/2p\rho(x)=\log(x)/2p. Following the first part of Theorem 1(a), L0(𝚺)L_{0}(\bm{\Sigma}) is geodesically convex, which has been previously identified in [6, 32, 33, 36]. The geodesic convexity does not contradict the non-uniqueness of Tyler’s M-estimator since this convexity is not strict. Another special case is the class of MGGD estimators. The following corollary follows from Theorem 1 and the fact that a1a_{1} in Theorem 1(b) is \infty.

Corollary 1

For all β>0\beta>0, the ML estimator for MGGD exists and unique if sp{𝐳1,𝐳2,,𝐳n}=p\mathrm{sp}\{\bm{z}_{1},\bm{z}_{2},\cdots,\bm{z}_{n}\}=\mathbb{R}^{p}.

Existence and uniqueness of the MLE in MGGD has been previously addressed in [24, Section V.A]. However, this contribution does not identify geodesic convexity and applies only to 0<β10<\beta\leq 1. Our theorem applies to MGGD with all β>0\beta>0 and provides additional insight based on geodesic convexity. Furthermore, it applies to other ES distributions, including cases where ρ(x)\rho^{\prime}(x) is not continuous, e.g., ρ(x)=x\rho(x)=x when x1x\leq 1 and ρ(x)=3x2\rho(x)=3x-2 when x>1x>1.

Here we remark that although ML estimator for elliptically symmetric (ES) distribution is the motivation of the argument, Theorem 1 (and Theorem 2 in next section) hold even if ρ(x)\rho(x) does not relate to an elliptically symmetric (ES) distribution. For example, Tyler’s M-estimator can be written in the form of (5) with ρ(x)=log(x)/2p\rho(x)=\log(x)/2p; however this ρ(x)\rho(x) corresponds to the central angular Gaussian which is not a member in ES class [24, (36)]. Another example is in [24, page 5610, Example 1], where ρ\rho^{\prime} is set to be a huber function.

We remark that the Theorem 1 also applied to the complex elliptically symmetric (CES) distributions defined by

f(𝒛)=Cp,g|𝚺|1g((𝒛μ)H𝚺1(𝒛μ)),f(\bm{z})=C_{p,g}|\bm{\Sigma}|^{-1}g\left(\left(\bm{z}-\mu\right)^{H}\bm{\Sigma}^{-1}\left(\bm{z}-\mu\right)\right), (8)

where the MLE estimator minimizes

L0(𝚺)=i=1nρ(𝒛iH𝚺1𝒛i)+nlogdet(𝚺).L_{0}(\bm{\Sigma})=\sum_{i=1}^{n}\rho\left(\bm{z}_{i}^{H}\bm{\Sigma}^{-1}\bm{z}_{i}\right)+n\log\det(\bm{\Sigma}).

Then Theorem 1 holds with a1=sup{a|xaexp(ρ(x))0as x}a_{1}=\sup\{a|x^{a}\exp(-\rho(x))\rightarrow 0\,\,\,\,\text{as $x\rightarrow\infty$}\}. This is also a generalization of the uniqueness/existence of the minimizer of L0(𝚺)L_{0}(\bm{\Sigma}) presented in [24, Section V.A].

Proof:

(a) Applying  [33, Proposition 1], if 𝚺3\bm{\Sigma}_{3} is the geodesic mean of 𝚺1\bm{\Sigma}_{1} and 𝚺2\bm{\Sigma}_{2} defined in (3), then

ln(𝒛T𝚺11𝒛)+ln(𝒛T𝚺21𝒛)2ln(𝒛T𝚺31𝒛).\ln\left(\bm{z}^{T}\bm{\Sigma}_{1}^{-1}\bm{z}\right)+\ln\left(\bm{z}^{T}\bm{\Sigma}_{2}^{-1}\bm{z}\right)\geq 2\ln\left(\bm{z}^{T}\bm{\Sigma}_{3}^{-1}\bm{z}\right). (9)

Combining this fact with the convex/monotone properties of ρ\rho, we have

ρ(𝒛iT𝚺11𝒛i)+ρ(𝒛iT𝚺21𝒛i)\displaystyle\rho\left(\bm{z}_{i}^{T}\bm{\Sigma}_{1}^{-1}\bm{z}_{i}\right)+\rho\left(\bm{z}_{i}^{T}\bm{\Sigma}_{2}^{-1}\bm{z}_{i}\right)
\displaystyle\geq 2ρ(exp((ln(𝒛iT𝚺11𝒛i)+ln(𝒛iT𝚺21𝒛i))/2))\displaystyle 2\rho\left(\exp\left(\left(\ln\left(\bm{z}_{i}^{T}\bm{\Sigma}_{1}^{-1}\bm{z}_{i}\right)+\ln\left(\bm{z}_{i}^{T}\bm{\Sigma}_{2}^{-1}\bm{z}_{i}\right)\right)/2\right)\right) (10)
\displaystyle\geq 2ρ(exp(ln(𝒛iT𝚺31𝒛i)))=2ρ(𝒛iT𝚺31𝒛i),\displaystyle 2\rho\left(\exp\left(\ln\left(\bm{z}_{i}^{T}\bm{\Sigma}_{3}^{-1}\bm{z}_{i}\right)\right)\right)=2\rho\left(\bm{z}_{i}^{T}\bm{\Sigma}_{3}^{-1}\bm{z}_{i}\right), (11)

Since

logdet(𝚺1)+logdet(𝚺2)=2logdet(𝚺3),\displaystyle\log\det\left(\bm{\Sigma}_{1}\right)+\log\det\left(\bm{\Sigma}_{2}\right)=2\log\det\left(\bm{\Sigma}_{3}\right), (12)

we proved

L0(𝚺1)+L0(𝚺2)2L0(𝚺3)\displaystyle L_{0}\left(\bm{\Sigma}_{1}\right)+L_{0}\left(\bm{\Sigma}_{2}\right)\geq 2L_{0}\left(\bm{\Sigma}_{3}\right) (13)

By Proposition 1, we proved the geodesic convexity of L0(𝚺)L_{0}(\bm{\Sigma}).

When ρ(ex)\rho(e^{x}) is strictly convex, L0(𝚺)L_{0}(\bm{\Sigma}) is geodesically strictly convex since the equalities in (10) and (11) can not hold simultaneously for all 1in1\leq i\leq n. The proof is as follows. Following the proof of [36, Theorem III.1], when sp{𝒛1,𝒛2,,𝒛n}=p\mathrm{sp}\{\bm{z}_{1},\bm{z}_{2},\cdots,\bm{z}_{n}\}=\mathbb{R}^{p}, the equality in (9) (and therefore the equality in (11)) holds for all 1in1\leq i\leq n only when 𝚺1=c𝚺2\bm{\Sigma}_{1}=c\bm{\Sigma}_{2} for some c1c\neq 1. However, 𝚺1=c𝚺2\bm{\Sigma}_{1}=c\bm{\Sigma}_{2} would fail the equality in (10) due to the strict convexity of ρ(ex)\rho(e^{x}). ∎

In practice, various numerical techniques can be used to find a local minima of the MGGD negative log-likelihood. Theorem 1 and geodesic convexity then ensure that this local minima will also be global minima. A promising approach is the classical iterative reweighed scheme due to [19, 5]:

𝚺m+1=i=1nu(𝒛iT𝚺m1𝒛i)𝒛i𝒛iT/n,\bm{\Sigma}_{m+1}=\sum_{i=1}^{n}u(\bm{z}_{i}^{T}\bm{\Sigma}_{m}^{-1}\bm{z}_{i})\bm{z}_{i}\bm{z}_{i}^{T}/n, (14)

where u(x)=ρ(x)u(x)=\rho^{\prime}(x). It has been shown in [5, Proposition 1] that when Sp(𝒛1,𝒛2,,𝒛n)=p\mathrm{Sp}(\bm{z}_{1},\bm{z}_{2},\cdots,\bm{z}_{n})=\mathbb{R}^{p}, ρ′′(x)\rho^{\prime\prime}(x) is continuous and nonnegative, then L0(𝚺m)L_{0}(\bm{\Sigma}_{m}) decreases monotonically (Sp(𝒛1,𝒛2,,𝒛n)=p\mathrm{Sp}(\bm{z}_{1},\bm{z}_{2},\cdots,\bm{z}_{n})=\mathbb{R}^{p} is assumed in order to make sure that L0(𝚺m+1)L_{0}(\bm{\Sigma}_{m+1}) is well-defined). [5, Proposition 3] shows that any limiting point of the sequence 𝚺m\bm{\Sigma}_{m} is a stationary point. When the assumptions in Theorem 1 hold, this point is the unique minimizer of L0(𝚺)L_{0}(\bm{\Sigma}).

IV Convexity in chordal MGGD

In this section, we consider structured MGGD estimation. In particular, we consider MLE of the MGGD scatter matrix subject to sparsity constraints. Thus, we are interested in the solution to

min𝚺i=1nρ(𝒛iT𝚺1𝒛i)+nlogdet(𝚺)s.t.[𝚺1]ij=0,(i,j)E\displaystyle\begin{array}[]{ll}\min_{\bm{\Sigma}}&\sum_{i=1}^{n}\rho\left(\bm{z}_{i}^{T}\bm{\Sigma}^{-1}\bm{z}_{i}\right)+n\log\det(\bm{\Sigma})\\ {\mathrm{s.t.}}&\left[\bm{\Sigma}^{-1}\right]_{ij}=0,\quad(i,j)\notin E\end{array} (17)

where the objective is the negative MGGD log-likelihood as described in Section III, and EE is the edge set of a known graph G(V,E)G(V,E) associated with the sparsity of 𝚺1\bm{\Sigma}^{-1}. Unfortunately, the above minimization is not convex in 𝚺\bm{\Sigma} or 𝚺1\bm{\Sigma}^{-1}. Nor is it geodesically convex in it. Indeed, the sparsity constraints are not preserved by the positive definite geodesic in (3). Thus, it is not clear whether its global minima can be found in an efficient manner.

In what follows, we propose a simple trick to “convexify” the optimization in many interesting cases of MGGD. In particular, we assume the GG is chordal and represent 𝚺1\bm{\Sigma}^{-1} with its Cholesky factor 𝑪\bm{C}. Due to Proposition 4, a unique chordal decomposition exists and we obtain

min𝑪𝒴i=1nρ(𝑪T𝒛i2)2nj=1plog(𝑪j,j)s.t.[𝑪]ij=0,(i,j)E,\displaystyle\begin{array}[]{ll}\min_{\bm{C}\in\mathcal{Y}}&\sum_{i=1}^{n}\rho\left(\|\bm{C}^{T}\bm{z}_{i}\|^{2}\right)-2n\sum_{j=1}^{p}\log(\bm{C}_{j,j})\\ {\mathrm{s.t.}}&\left[\bm{C}\right]_{ij}=0,\quad(i,j)\notin E,\end{array} (20)

where

𝒴={𝑪p×p:𝑪i,i>0𝑪i,j=0 for all i<j}\displaystyle\mathcal{Y}=\left\{\bm{C}\in\mathbb{R}^{p\times p}:\begin{array}[]{l}\bm{C}_{i,i}>0\\ \text{$\bm{C}_{i,j}=0$ for all $i<j$}\end{array}\right\} (23)

and we have used

nlogdet((𝑪𝑪T)1)=2nj=1plog(𝑪j,j).\displaystyle n\log\det\left(\left(\bm{C}\bm{C}^{T}\right)^{-1}\right)=-2n\sum_{j=1}^{p}\log(\bm{C}_{j,j}). (24)

The following theorem characterizes the properties of this optimization.

Theorem 2

(a) Assume that ρ(x)\rho(x) is continuous in (0,)(0,\infty), nondecreasing and ρ(x2)\rho\left(x^{2}\right) is convex, the objective of (20) is strictly convex. (b) When sp{𝐳1,𝐳2,,𝐳n}=p\mathrm{sp}\{\bm{z}_{1},\bm{z}_{2},\cdots,\bm{z}_{n}\}=\mathbb{R}^{p}, a minimizer to (20) exists.

Proof:

Negative logarithms are strictly convex, and it remains to show that

ρ((𝑪1+𝑪22)T𝒛i2)\displaystyle\rho\left(\left\|\left(\frac{\bm{C}_{1}+\bm{C}_{2}}{2}\right)^{T}\bm{z}_{i}\right\|^{2}\right)
ρ((𝑪1T𝒛i+𝑪2T𝒛i2)2)\displaystyle\qquad\leq\rho\left(\left(\frac{\|\bm{C}_{1}^{T}\bm{z}_{i}\|+\|\bm{C}_{2}^{T}\bm{z}_{i}\|}{2}\right)^{2}\right)
12ρ(𝑪1T𝒛i2)+12ρ(𝑪2T𝒛i2)\displaystyle\qquad\leq\frac{1}{2}\rho\left(\|\bm{C}_{1}^{T}\bm{z}_{i}\|^{2}\right)+\frac{1}{2}\rho\left(\|\bm{C}_{2}^{T}\bm{z}_{i}\|^{2}\right) (25)

where we have used the triangle inequality and the convexity of ρ(x2)\rho\left(x^{2}\right).

To prove the existence of minimizer, we need to prove that for L1(𝑪)=i=1nρ(𝑪T𝒛i2)2nj=1plog(𝑪j,j)L_{1}(\bm{C})=\sum_{i=1}^{n}\rho\left(\|\bm{C}^{T}\bm{z}_{i}\|^{2}\right)-2n\sum_{j=1}^{p}\log(\bm{C}_{j,j}), L1(𝑪)L_{1}(\bm{C})\rightarrow\infty as 𝑪F\|\bm{C}\|_{F}\rightarrow\infty or the smallest eigenvalue of 𝑪\bm{C} goes to 0. When 𝑪F\|\bm{C}\|_{F}\rightarrow\infty, since sp{𝒛1,𝒛2,,𝒛n}=p\mathrm{sp}\{\bm{z}_{1},\bm{z}_{2},\cdots,\bm{z}_{n}\}=\mathbb{R}^{p} and 𝑪\bm{C} is full rank, there exists c2c_{2} such that i=1n𝑪T𝒛i2c2𝑪F2\sum_{i=1}^{n}\|\bm{C}^{T}\bm{z}_{i}\|^{2}\geq c_{2}\|\bm{C}\|_{F}^{2}. Consider that ρ(x2)\rho(x^{2}) is convex, there exists a constant c1c_{1} and C1C_{1} such that when 1ni=1n𝑪T𝒛i2>C1\frac{1}{n}\sum_{i=1}^{n}\|\bm{C}^{T}\bm{z}_{i}\|^{2}>C_{1},

i=1nρ(𝑪T𝒛i2)nρ(1ni=1n𝑪T𝒛i2)\displaystyle\sum_{i=1}^{n}\rho(\|\bm{C}^{T}\bm{z}_{i}\|^{2})\geq n\rho(\frac{1}{n}\sum_{i=1}^{n}\|\bm{C}^{T}\bm{z}_{i}\|^{2})
\displaystyle\geq nc11ni=1n𝑪T𝒛i2nc1c2𝑪F.\displaystyle nc_{1}\sqrt{\frac{1}{n}\sum_{i=1}^{n}\|\bm{C}^{T}\bm{z}_{i}\|^{2}}\geq\sqrt{n}c_{1}c_{2}\|\bm{C}\|_{F}.

Therefore as 𝑪F\|\bm{C}\|_{F}\rightarrow\infty, L1(𝑪)L_{1}(\bm{C})\rightarrow\infty.

When the smallest eigenvalue of 𝑪\bm{C} goes to 0, we have det(𝑪)0\det(\bm{C})\rightarrow 0 and j=1plog(𝑪j,j)=logdet(𝑪)\sum_{j=1}^{p}\log(\bm{C}_{j,j})=\log\det(\bm{C})\rightarrow-\infty. In another aspect, since ρ(x2)\rho(x^{2}) is convex, liminfx0ρ(x)\lim\inf_{x\rightarrow 0}\rho(x) exists (by convexity it is larger than 2ρ(1)ρ(4)2\rho(1)-\rho(4)). Combining it with the fact that ρ(x)\rho(x) is nondecreasing, L1(𝑪)L_{1}(\bm{C})\rightarrow\infty as the smallest eigenvalue of 𝑪\bm{C} goes to 0, and the existence of the minimizer of L1(𝑪)L_{1}(\bm{C}) is proved. ∎

The theorem shows that, under suitable conditions, any local minimizer of (20) is globally optimal. This holds for any ES distribution with ρ\rho satisfying the assumptions. In particular, an important special case is chordal graphical models in MGGD:

Corollary 2

If sp{𝐳1,𝐳2,,𝐳n}=p\mathrm{sp}\{\bm{z}_{1},\bm{z}_{2},\cdots,\bm{z}_{n}\}=\mathbb{R}^{p}, under chordal graphical models the ML estimator for MGGD exists and unique for all β0.5\beta\geq 0.5.

Unfortunately, the recent result in [2] on ρ(x)=log(x)/2p\rho(x)=\log(x)/2p does not satisfy our conditions and requires a different analysis. And there is no detailed proof to support the claim in [2] that this banded Tyler estimators converge to unique fixed points.

When the condition in Theorem 2 holds, the objective function is convex with respect to 𝑪\bm{C}. Therefore, it can be numerically minimized via any general convex optimization solver. For example, in the MGGD case with β0.5\beta\geq 0.5 the problem can be expressed as

min𝑪𝒴,𝐭i=1n|ti|2β2nj=1plog(𝑪j,j)s.t.ti𝑪𝒛i[𝑪]ij=0,(i,j)E\displaystyle\begin{array}[]{ll}\min_{\bm{C}\in\mathcal{Y},\bf{t}}&\sum_{i=1}^{n}|t_{i}|^{2\beta}-2n\sum_{j=1}^{p}\log(\bm{C}_{j,j})\\ {\mathrm{s.t.}}&t_{i}\geq\|\bm{C}\bm{z}_{i}\|\\ &\left[\bm{C}\right]_{ij}=0,\quad(i,j)\notin E\end{array} (29)

where 𝐭\bf{t} are auxiliary variables. This formulation with second order cone constraints can be easily solved using the popular CVX package [17].

Alternatively, the optimization can be addressed using a majorization - minimization (MM) technique, e.g. [5]. The method begins with an initial estimate 𝚺^0S++(p)\hat{\bm{\Sigma}}_{0}\in\mathrm{S_{++}}(p) and updates it according the following iterations

𝚺^m+1=arg{min𝚺h(𝚺,𝚺m)s.t.[𝚺1]ij=0,(i,j)E\displaystyle\hat{\bm{\Sigma}}_{m+1}=\arg\left\{\begin{array}[]{ll}\min_{\bm{\Sigma}}&h(\bm{\Sigma},\bm{\Sigma}_{m})\\ {\mathrm{s.t.}}&\left[\bm{\Sigma}^{-1}\right]_{ij}=0,\quad(i,j)\notin E\end{array}\right. (32)

where

h(𝚺,𝚺m)=i=1nu(𝒛iT𝚺m1𝒛i)𝒛iT𝚺1𝒛i+nlogdet(𝚺)+C,h(\bm{\Sigma},\bm{\Sigma}_{m})=\sum_{i=1}^{n}u(\bm{z}_{i}^{T}\bm{\Sigma}_{m}^{-1}\bm{z}_{i})\bm{z}_{i}^{T}\bm{\Sigma}^{-1}\bm{z}_{i}+n\log\det(\bm{\Sigma})+C, (33)

u(x)=ρ(x)u(x)=\rho^{\prime}(x) and CC is chosen such that h(𝚺m,𝚺m)=L0(𝚺m){h(\bm{\Sigma}_{m},\bm{\Sigma}_{m})=L_{0}(\bm{\Sigma}_{m})}. Since ρ′′(x)\rho^{\prime\prime}(x) is continuous and nonnegative, we have

L0(𝚺m+1)h(𝚺,𝚺m)h(𝚺m,𝚺m)=L0(𝚺m)\displaystyle L_{0}(\bm{\Sigma}_{m+1})\geq h(\bm{\Sigma},\bm{\Sigma}_{m})\geq h(\bm{\Sigma}_{m},\bm{\Sigma}_{m})=L_{0}(\bm{\Sigma}_{m}) (34)

and L0(𝚺m)L_{0}(\bm{\Sigma}_{m}) converges as mm\rightarrow\infty. Each iteration step in (32)-(33) can be interpreted as a Gaussian graphical model optimization (the weights u()u(\cdot) are constant with respect to the optimization variable). These minimizations have a simple closed form solution when G(V,E)G(V,E) is chordal (see appendix). Thus, the proposed technique is very efficient for implementation in practice.

Finally, it is worth mentioning that our framework can also be extended to structure learning in MGGD graphical models. Structure learning, also known as covariance selection, considers the estimation of an inverse covariance which is known to be sparse but the sparsity pattern itself is unknown [13, 8, 29]. Adding a sparsity constraint usually destroys the convexity. Instead, the modern approach relies on a convex relaxation based on an L1 norm penalty. In our context, the problem is convex in the Cholesky factor rather than in the inverse covariance itself. This leads to the following convex minimization

min𝑪𝒴i=1nρ(𝑪𝒛i2)2nj=1plog([𝑪]j,j)+λ𝑪1\displaystyle\min_{\bm{C}\in\mathcal{Y}}\sum_{i=1}^{n}\rho(\|\bm{C}\bm{z}_{i}\|^{2})-2n\sum_{j=1}^{p}\log([\bm{C}]_{j,j})+\lambda\|\bm{C}\|_{1} (35)

where λ\lambda is a regularization parameter, and 𝑪1\|\bm{C}\|_{1} is a matrix version of the L1 norm, namely a sum over the absolute values of the elements in 𝑪\bm{C}. It is important to emphasize that this approach is only applicable to chordal graphical models and it assumes that the perfect order of the variables is known a priori. Recent developments in high dimensional covariance estimation provide data-driven methods for identifying this order and structure [30, 28]. We leave this topic as a possible direction for future research.

Furthermore, our methodology can be easily extended to joint estimation of the covariance and the centering parameter μ\mu. For this purpose, consider the optimization of

min𝑪𝒴,μi=1nρ(𝑪T𝒛i𝒙2)2nj=1plog(𝑪j,j)s.t.[𝑪]ij=0,(i,j)E\displaystyle\begin{array}[]{ll}\min_{\bm{C}\in\mathcal{Y},\mu}&\sum_{i=1}^{n}\rho\left(\|\bm{C}^{T}\bm{z}_{i}-\bm{x}\|^{2}\right)-2n\sum_{j=1}^{p}\log(\bm{C}_{j,j})\\ {\mathrm{s.t.}}&\left[\bm{C}\right]_{ij}=0,\quad(i,j)\notin E\end{array}\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! (38)

where 𝒙=𝑪T𝝁\bm{x}=\bm{C}^{T}{\bm{\mu}} When the conditions in Theorem 2(a) hold, the objective function is jointly convex with respect to (𝑪,𝒙)(\bm{C},\bm{x}), therefore any of its local minimizer is also its global minimizer. Its algorithm can be similarly addressed using a majorization - minimization (MM) technique as in (32)-(33).

V Numerical results

In this section, we present results of numerical simulations. The purpose of these simulations is three fold: to validate our theoretical results using synthetic data, to provide insight on a few open theoretical questions using synthetic data and to demonstrate the usefulness of our theoretical models in a real world setting.

The first experiment considered inverse covariance estimation using synthetic data generated from a known MGGD distribution with β0.5\beta\geq 0.5 and a chordal structure. The theory in Section IV shows that, in this case, the MLE can be formulated as a convex optimization. To verify this result, we generated nn i.i.d. realizations of an MGGD with p=10p=10, β=0.5\beta=0.5 (corresponding to a classical multivariate Laplacian distribution), and a banded inverse covariance of width b=4b=4. In particular, we used a Toeplitz inverse covariance with 1.01.0 diagonal elements and 0.40.4 off-diagonal elements within the main band. We estimated the unknown covariance using five estimators:

  • G: A Gaussian MLE with no prior knowledge of the structure, corresponding to the classical sample covariance.

  • BG: A Gaussian banded MLE as detailed in the appendix.

  • MGGD: An MGGD MLE with no prior knowledge of the structure via 3030 iterations of (14).

  • BMGGD1: An MGGD banded MLE via 3030 iterations of (32) and the subroutine in the appendix. This estimator is initialized with 𝚺^=𝐈\hat{\bm{\Sigma}}=\mathbf{I}.

  • BMGGD2: An MGGD banded MLE via 3030 iterations of (32) and the subroutine in the appendix. This estimator is initialized with 𝚺^=𝚺\hat{\bm{\Sigma}}=\bm{\Sigma} (which is clearly impossible in practice).

In all estimators above we use 3030 as the number of iterations since in our simulations the fixed point algorithms (14) and (32) converge well after 3030 iterations.

Figure 1 shows the normalized mean squared Frobenius error in the covariance averaged over 1000010000 independent simulations as a function of the number of samples nn. It is easy to see the performance advantage of banded MGGD estimator. As expected, BMGGD1 and BMGGD2 converge to the identical fixed points irrespective of their initial condition.

The second experiment is very similar to the first except that β=0.2\beta=0.2. Our theory assumes β0.5\beta\geq 0.5 and therefore does not hold for this value of shape parameter. Yet, according to [2] banded Tyler estimators do converge to unique fixed points, and these can be interpreted as the limit of MGGD when β0\beta\rightarrow 0. Our proposed fixed point iteration in (32) can be implemented with a small β\beta. Therefore, it is interesting to examine its performance and we repeat the first experiment with β=0.2\beta=0.2. The results are presented in Figure 2. Interestingly, BMGGD1 and BMGGD2 converged to identical fixed points irrespective of their initial condition. Thus, although we have no proof for this behavior, we conjecture that, in practice, the iteration can also be used for all MGGDs.

The third experiment focused on non-chordal structures. Our theory assumes a chordal sparsity pattern and it is interesting to see whether this assumption is indeed critical. Thus, we repeated the first experiment associated with MGGD β=0.5\beta=0.5, but replaced the banded inverse covariance with a loopy structure. In particular, we constructed a two dimensional grid graph of size p=32p=3^{2} with the edges (3,1)(3,1), (5,1)(5,1), (6,1)(6,1), (7,1)(7,1), (8,1)(8,1), (9,1)(9,1), (4,2)(4,2), (6,2)(6,2), (7,2)(7,2), (8,2)(8,2), (9,2)(9,2), (4,3)(4,3), (5,3)(5,3), (7,3)(7,3), (8,3)(8,3), (9,3)(9,3), (6,4)(6,4), (8,4)(8,4), (9,4)(9,4), (7,5)(7,5), (9,5)(9,5), (7,6)(7,6), (8,6)(8,6), and (9,7)(9,7). All the non-zero off diagonal elements were assigned a constant value small enough to ensure that 𝚺1\bm{\Sigma}^{-1} will be well conditioned. In contrast to the chordal case, MLE in general Gaussian Graphical models in does not satisfy a closed form solution. Instead, we solved (17) using the CVX optimization package [17]. The latter was used for implementing BG, and for the inner solution in each iteration of BMGGD1 and BMGGD2. The results averaged over 400400 simulations222Due to CVX these simulations are highly time consuming. are provided in Figure 3. Here too, the iterations converged to identical solutions and seem to be independent of their initial conditions.

The fourth experiment addressed the practical use of the BMGGD1 estimator in a real world example. Following [29] we considered the SONAR dataset from the UCI machine learning data repository. This dataset has 111 spectra from metal cylinders and 97 spectra from rocks, where each spectrum has 60 frequency band energy measurements. Quadratic discriminant test is used to classify the metals and the rocks. It requires the estimation of the covariance in both classes, and previous work demonstrated the advantage of BG over the classical sample covariance and the naive diagonal estimate. We repeated the experiment step by step and added BMGGD1. Specifically, we chose β\beta as the parameter within {0.5,0.6,,0.9,1.0}\{0.5,0.6,\cdots,0.9,1.0\} that maximizes the MGGD likelihood, and used the same band which was used by BG (selected via 10 random splits with 1/31/3 of the data for training and the validation likelihood). For the QDA test we applied the covariance of MGGD, which is c(β)𝚺c(\beta)\bm{\Sigma}, where c(β)=21βΓ(p+22β)pΓ(p2β)c(\beta)=2^{\frac{1}{\beta}}\frac{\Gamma(\frac{p+2}{2\beta})}{p\Gamma(\frac{p}{2\beta})}. The test errors over a standard leave-one-out cross validation are provided in Table I. These results remained stable over different randomizations, and demonstrate the advantage of the proposed BMGGD1 framework.

TABLE I: Test errors in SONAR dataset.
Sample covariance Naive Bayes BG BMGGD1
24.0%24.0\% 32.7%32.7\% 15.4%15.4\% 13.5%13.5\%
Refer to caption
Figure 1: MGGD with β=0.5\beta=0.5 and a banded inverse covariance .
Refer to caption
Figure 2: MGGD with β=0.2\beta=0.2 and a banded inverse covariance .
Refer to caption
Figure 3: MGGD with β=0.5\beta=0.5 and a non-chordal graphical model.

VI Discussion

In this paper, we consider covariance estimation in the multivariate generalized Gaussian distribution (MGGD). We proved that the MGGD negative log-likelihood is geodesically convex for β>0\beta>0. In the sparsely constrained case, we proved that a simple change of variables can transform it into a convex function as long as β0.5\beta\geq 0.5 and the underlying graph is chordal. This means that any local solution of these minimization is globally optimal and the problems can be solved using standard descent methods. In practice, we observed this behavior also for smaller values of β\beta. This agrees with a similar result on banded Tyler methods which can be interpreted as β0\beta\rightarrow 0. An interesting direction for future work is a rigorous analysis of these phenomenon when 0<β<0.50<\beta<0.5. Another direction is relaxing the chordal assumption on the sparsity pattern. Here too, our numerical experience suggests that simple descent methods converge to the global solution and are independent of their initial conditions. Finally, as mentioned above, it in interesting to examine the problem of structure learning in MGGDs via sparsity enforcing priors.

In this appendix, we review a simple closed form solution for the MLE in chordal (decomposable) Gaussian graphical models [20, 34, 4]. The problem is

min𝚺i=1nαi𝒛iT𝚺1𝒛i+nlogdet𝚺s.t.[𝚺1]ij=0,(i,j)E\displaystyle\begin{array}[]{ll}\min_{\bm{\Sigma}}&\sum_{i=1}^{n}\alpha_{i}\bm{z}_{i}^{T}\bm{\Sigma}^{-1}\bm{z}_{i}+n\log\det{\bm{\Sigma}}\\ \ {\mathrm{s.t.}}&\left[\bm{\Sigma}^{-1}\right]_{ij}=0,\quad(i,j)\notin E\end{array} (42)

where

αi=u(𝒛iT𝚺m1𝒛i)\displaystyle\alpha_{i}=u(\bm{z}_{i}^{T}\bm{\Sigma}_{m}^{-1}\bm{z}_{i}) (43)

are the iteration weights which do not depend on 𝚺\bm{\Sigma}. Using the chordal Cholesky decomposition 𝚺1=𝑪𝑪T\bm{\Sigma}^{-1}=\bm{C}\bm{C}^{T}, the problem is equivalent to

min𝑪i=1nαi𝑪T𝒛i22nj=1plog([𝑪]jj)s.t.[𝑪]ij=0,(i,j)E or i>j\displaystyle\begin{array}[]{ll}\min_{\bm{C}}&\sum_{i=1}^{n}\alpha_{i}\|\bm{C}^{T}\bm{z}_{i}\|^{2}-2n\sum_{j=1}^{p}\log([\bm{C}]_{jj})\\ \ {\mathrm{s.t.}}&\left[\bm{C}\right]_{ij}=0,\quad(i,j)\notin E\text{ or }i>j\end{array} (46)

This minimization is completely separable and each column of 𝑪\bm{C} can be simply obtained by a linear regression. Let {𝒁i}i=1pn\{\bm{Z}_{i}\}_{i=1}^{p}\subset\mathbb{R}^{n} be vectors consists of the ii-th components of α1𝒛1,α2𝒛2,,αn𝒛n\sqrt{\alpha_{1}}\bm{z}_{1},\sqrt{\alpha_{2}}\bm{z}_{2},\cdots,\sqrt{\alpha_{n}}\bm{z}_{n}, Ji={i+1kp:(i,k)E}J_{i}=\{i+1\leq k\leq p:(i,k)\in E\}, and assume that in the linear regression of 𝒁i\bm{Z}_{i} with respect to {𝒁j}jJi\{\bm{Z}_{j}\}_{j\in J_{i}}, the parameter on 𝒁j\bm{Z}_{j} is β(i,j)\beta(i,j), and standard error is rir_{i}. Then

[𝑨]ij\displaystyle[\bm{A}]_{ij} =\displaystyle= {1,when i=jβ(j,i),when i>j and (i,j)E0,otherwise.\displaystyle\begin{cases}1,&\text{when $i=j$}\\ \beta(j,i),&\text{when $i>j$ and $(i,j)\in E$}\\ 0,&\text{otherwise}.\end{cases} (47)
𝑫\displaystyle\bm{D} =\displaystyle= diag(r1,r2,,rp),\displaystyle\operatorname{diag}(r_{1},r_{2},\cdots,r_{p}), (48)

and finally 𝑪=𝑫𝑨\bm{C}=\bm{D}\bm{A}.

References

  • [1] Y. Abramovich and O. Besson. Regularized covariance matrix estimation in complex ellipti-cally symmetric distributions using the expected likelihood approach. ISAE, Tech. Rep, 2012.
  • [2] Y. Abramovich and B. Johnson. Expected likelihood approach for covariance matrix estimation: Complex angular central gaussian case. In Sensor Array and Multichannel Signal Processing Workshop (SAM), 2012 IEEE 7th, pages 317–320. IEEE, 2012.
  • [3] Y. Abramovich, N. Spencer, and M. Turley. Time-varying autoregressive (tvar) models for multiple radar observations. Signal Processing, IEEE Transactions on, 55(4):1298–1311, 2007.
  • [4] H. Andersen, M. Hojbjerre, D. Sorensen, and P. Eriksen. Linear and Graphical Models: For the Multivariate Complex Normal Distribution, volume 101. Springer, 1995.
  • [5] O. Arslan. Convergence behavior of an iterative reweighting algorithm to compute multivariate M-estimates for location and scatter. Journal of Statistical Planning and Inference, 118(1-2):115 – 128, 2004.
  • [6] C. Auderset, C. Mazza, and E. A. Ruh. Angular gaussian and cauchy estimation. Journal of Multivariate Analysis, 93(1):180 – 197, 2005.
  • [7] R. Bhatia. Positive Definite Matrices. Princeton Series in Applied Mathematics. Princeton University Press, 2007.
  • [8] P. Bickel and E. Levina. Regularized estimation of large covariance matrices. The Annals of Statistics, 36(1):199–227, 2008.
  • [9] L. Bombrun, F. Pascal, J. Tourneret, and Y. Berthoumieu. Performance of the maximum likelihood estimators for the parameters of multivariate generalized gaussian distributions. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 3525–3528. IEEE, 2012.
  • [10] Y. Chen, A. Wiesel, and A. Hero. Robust shrinkage estimation of high-dimensional covariance matrices. Signal Processing, IEEE Transactions on, 59(9):4097 –4107, sept. 2011.
  • [11] M. A. Chmielewski. Elliptically symmetric distributions: A review and bibliography. International Statistical Review / Revue Internationale de Statistique, 49(1):pp. 67–74, 1981.
  • [12] J. Dahl, L. Vandenberghe, and V. Roychowdhury. Covariance selection for nonchordal graphs via chordal embedding. Optimization Methods Software, 23(4):501–520, Aug. 2008.
  • [13] A. Dempster. Covariance selection. Biometrics, pages 157–175, 1972.
  • [14] T. Eltoft, T. Kim, and T. Lee. On the multivariate laplace distribution. Signal Processing Letters, IEEE, 13(5):300–303, 2006.
  • [15] M. Finegold and M. Drton. Robust graphical modeling of gene networks using classical and alternative t-distributions. The Annals of Applied Statistics, 5(2A):1057–1080, 2011.
  • [16] M. Fukuda, M. Kojima, K. Murota, and K. Nakata. Exploiting sparsity in semidefinite programming via matrix completion i: General framework. Siam Journal on Optimization, 11:647–674, 1999.
  • [17] M. Grant, S. Boyd, and Y. Ye. Cvx: Matlab software for disciplined convex programming. Online accessiable: http://stanford. edu/˜ boyd/cvx, 2008.
  • [18] P. Huber. Robust statistics. 1981.
  • [19] J. T. Kent and D. E. Tyler. Redescending M-estimates of multivariate location and scatter. The Annals of Statistics, 19(4):pp. 2102–2119, 1991.
  • [20] S. Lauritzen. Graphical models, volume 17. Oxford University Press, USA, 1996.
  • [21] H. Liu, F. Han, and C. Zhang. Transelliptical graphical models. In Advances in Neural Information Processing Systems 25, pages 809–817, 2012.
  • [22] C. Niculescu and L. Persson. Convex functions and their applications: a contemporary approach. Number v. 13 in CMS books in mathematics. Springer, 2006.
  • [23] M. Novey, T. Adali, and A. Roy. A complex generalized gaussian distribution; characterization, generation, and estimation. Signal Processing, IEEE Transactions on, 58(3):1427 –1433, march 2010.
  • [24] E. Ollila, D. E. Tyler, V. Koivunen, and H. V. Poor. Complex Elliptically Symmetric Distributions: Survey, New Results and Applications. IEEE Transactions on Signal Processing, 60(11):5597–5625, Nov. 2012.
  • [25] F. Pascal, L. Bombrun, J.-Y. Tourenet, and Y. Berthoumieu. Parameter estimation for multivariate generalized gaussian distributions. Signal Processing, IEEE Transactions on (submitted to), 2012. [Online]. http://arxiv.org/abs/1302.6498.
  • [26] F. Pascal, Y. Chitour, J. Ovarlez, P. Forster, and P. Larzabal. Covariance structure maximum-likelihood estimates in compound gaussian noise: Existence and algorithm analysis. Signal Processing, IEEE Transactions on, 56(1):34–48, 2008.
  • [27] T. Rapcsak. Geodesic convexity in nonlinear optimization. Journal of Optimization Theory and Applications, 69:169–183, 1991. 10.1007/BF00940467.
  • [28] B. Rolfs and B. Rajaratnam. Natural order recovery for banded covariance models. In Sensor Array and Multichannel Signal Processing Workshop (SAM), 2012 IEEE 7th, pages 365–368. IEEE, 2012.
  • [29] A. J. Rothman, E. Levina, and J. Zhu. A new approach to cholesky-based covariance regularization in high dimensions. Biometrika, 97(3):539–550, 2010.
  • [30] P. Rütimann and P. Bühlmann. High dimensional sparse covariance estimation via directed acyclic graphs. Electronic Journal of Statistics, 3:1133–1160, 2009.
  • [31] D. E. Tyler. A distribution-free m-estimator of multivariate scatter. The Annals of Statistics, 15(1):pp. 234–251, 1987.
  • [32] A. Wiesel. Geodesic convexity and covariance estimation. Signal Processing, IEEE Transactions on, 60(12):6182 –6189, dec. 2012.
  • [33] A. Wiesel. Unified framework to regularized covariance estimation in scaled gaussian models. Signal Processing, IEEE Transactions on, 60(1):29 –38, jan. 2012.
  • [34] A. Wiesel, Y. Eldar, and A. Hero. Covariance estimation in decomposable gaussian graphical models. Signal Processing, IEEE Transactions on, 58(3):1482 –1492, march 2010.
  • [35] A. Wiesel and A. Hero. Decomposable principal component analysis. Signal Processing, IEEE Transactions on, 57(11):4369–4377, 2009.
  • [36] T. Zhang. Robust subspace recovery by geodesically convex optimization. arXiv preprint arXiv:1206.1386, 2012.