This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sparse Cholesky Covariance Parametrization for Recovering Latent Structure in Ordered Data

Irene Córdoba, Concha Bielza, Pedro Larrañaga and Gherardo Varando
Abstract

The sparse Cholesky parametrization of the inverse covariance matrix is directly related to Gaussian Bayesian networks. Its counterpart, the covariance Cholesky factorization model, has a natural interpretation as a hidden variable model for ordered signal data. Despite this, it has received little attention so far, with few notable exceptions. To fill this gap, in this paper we focus on arbitrary zero patterns in the Cholesky factor of a covariance matrix. We discuss how these models can also be extended, in analogy with Gaussian Bayesian networks, to data where no apparent order is available. For the ordered scenario, we propose a novel estimation method that is based on matrix loss penalization, as opposed to the existing regression-based approaches. The performance of this sparse model for the Cholesky factor, together with our novel estimator, is assessed in a simulation setting , as well as over spatial and temporal real data where a natural ordering arises among the variables. We give guidelines, based on the empirical results, about which of the methods analysed is more appropriate for each setting.

Keywords Covariance matrix, sparse matrices, regression analysis, graphical model, Gaussian distribution

1 Introduction

The multivariate Gaussian distribution is central in both statistics and machine learning because of its wide applicability and well theoretical behaviour. Whenever it is necessary to reduce model dimension, such as in a high dimensional setting [1] or in graphical models [2, 3], sparsity is imposed in either the covariance matrix 𝚺\boldsymbol{\mathrm{\Sigma}} or its inverse 𝛀=𝚺1\boldsymbol{\mathrm{\Omega}}=\boldsymbol{\mathrm{\Sigma}}^{-1}.

The inverse covariance matrix 𝛀\boldsymbol{\mathrm{\Omega}} directly encodes information about partial correlations; therefore, when a zero pattern is present in 𝛀\boldsymbol{\mathrm{\Omega}}, it represents absent edges in the undirected graph of a Gaussian Markov network [4, 5]. Furthermore, letting 𝛀=𝐖𝐖t\boldsymbol{\mathrm{\Omega}}=\boldsymbol{\mathrm{W}}\boldsymbol{\mathrm{W}}^{t} be its Cholesky decomposition, a zero pattern in the lower triangular matrix 𝐖\boldsymbol{\mathrm{W}} yields the acyclic digraph associated with a Gaussian Bayesian network [6, 7] model, up to a permutation of the variables [8]. As a result, much of the academic focus has been on sparsity in either the inverse covariance matrix or its Cholesky decomposition [9, 10, 11, 12, 13].

Conversely, a zero pattern in the covariance matrix 𝚺\boldsymbol{\mathrm{\Sigma}} represents missing edges from the undirected graph of a covariance graph model [14, 15, 16, 17]. However, a structured zero pattern on the Cholesky decomposition 𝚺=𝐓𝐓t\boldsymbol{\mathrm{\Sigma}}=\boldsymbol{\mathrm{T}}\boldsymbol{\mathrm{T}}^{t} of the covariance matrix has only been addressed by few works: [18] and [19]. In [18] the authors briefly analyse zeros in 𝐓\boldsymbol{\mathrm{T}} mainly as a tool for better understanding of a higher-level graphical model called covariance chain, which is the main focus of their work. More directly related to our interests is the work of Rothman et al. [19], who directly explore a regression interpretation of the Cholesky factor 𝐓\boldsymbol{\mathrm{T}} over the error variables. However, the authors do not address a structured zero pattern directly; instead, they focus on a banding structure for 𝐓\boldsymbol{\mathrm{T}}, inspired by the popularity of this estimator for the covariance matrix. In fact, a significant amount of the paper is devoted to analysing the relationship between the covariance matrix, or its inverse, and banded Cholesky factorization.

To fill this gap, in this paper we focus on arbitrary zero patterns in the Cholesky factor 𝐓\boldsymbol{\mathrm{T}} of the covariance matrix 𝚺\boldsymbol{\mathrm{\Sigma}}. We argue how this naturally models scenarios with ordered variables representing noisy inputs from hidden signal sources, see for example [20] where latent brain regions are assumed to influence neural measurements. We discuss how this model could be extended to unordered variables by defining a new Gaussian graphical model over the Cholesky factorization of the covariance matrix. This new graphical model can be thought of as the analogue of a Gaussian Bayesian network, which is defined via sparse factorizations of the inverse covariance matrix. For ordered variables, we propose a novel learning method by directly penalising a matrix loss, in contrast with existing regression-based approaches in the literature [19]. This new estimator is feasible to compute thanks to a simplification of the loss gradient computation similar to the one proposed in [21]. Finally, we empirically assess the model performance in a broad range of experiments, exploring multiple simulation scenarios as well as real data where a natural spatial or temporal order arises between the variables.

The rest of the paper is organized as follows. In Section 2 we introduce the theoretical preliminaries necessary to follow the exposition. Then we detail the proposed sparse model for the Cholesky factor 𝐓\boldsymbol{\mathrm{T}} of the covariance matrix 𝚺\boldsymbol{\mathrm{\Sigma}}, as well as introduce its extension to a Gaussian graphical model, in Section 3. Afterwards, we discuss existing state-of-the-art estimation methods in Section 4, where we also detail our novel penalized matrix loss proposal. We assess the previously explained estimation methods in both simulated and real experiments, whose results are shown in Section 5. Finally, we close the paper in Section 6, discussing the conclusions that can be drawn from the presented work, and also outlining the planned lines of future research. Appendices A.1, A.2 and A.3 are referenced throughout the paper and contain additional material for the interested reader.

2 Theoretical preliminaries

We will denote as 𝒩(𝝁,𝚺)\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\mathrm{\Sigma}}) the multivariate Gaussian distribution with mean vector 𝝁\boldsymbol{\mu} and covariance matrix 𝚺\boldsymbol{\mathrm{\Sigma}}. Since we focus on the zero pattern in 𝚺\boldsymbol{\mathrm{\Sigma}}, in the following we will set 𝝁=𝟎\boldsymbol{\mu}=\boldsymbol{0} for notational simplicity, without loosing generality. As in the previous section, we will denote the inverse covariance matrix as 𝛀=𝚺1\boldsymbol{\mathrm{\Omega}}=\boldsymbol{\mathrm{\Sigma}}^{-1}.

2.1 The inverse covariance matrix and systems of recursive regressions

As we stated previously, the inverse covariance matrix and its decompositions have been thoroughly researched. If 𝑿\boldsymbol{X} is a pp-variate Gaussian random vector, that is, if 𝑿𝒩(𝟎,𝚺)\boldsymbol{X}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{\mathrm{\Sigma}}), then the upper Cholesky factorization of 𝛀\boldsymbol{\mathrm{\Omega}} can be used to model ordered sequences of regressions [7, 9], as follows.

For J{1,,p}J\subseteq\{1,\ldots,p\}, iJi\not\in J and jJj\in J, let 𝜷i|J=(𝚺iJ𝚺JJ1)t\boldsymbol{\beta}_{i|J}=(\boldsymbol{\mathrm{\Sigma}}_{iJ}\boldsymbol{\mathrm{\Sigma}}^{-1}_{JJ})^{t} denote the vector of regression coefficients of XiX_{i} on 𝑿J\boldsymbol{X}_{J}, with its jj-th entry being βij|J\beta_{ij|J}, the coefficient corresponding to variable XjX_{j}. We may equivalently write the statistical model for 𝑿\boldsymbol{X} as a system of recursive linear regression equations (also called a linear structural equation model [22]), where for each i{1,,p}i\in\{1,\ldots,p\},

Xi=j=1i1βij|1,,i1Xj+i,X_{i}=\sum_{j=1}^{i-1}\beta_{ij|1,\ldots,i-1}X_{j}+\mathcal{E}_{i}, (1)

with i\mathcal{E}_{i} independent Gaussian random variables of zero mean and var(i)=var(Xi|X1,,Xi1)\operatorname{var}(\mathcal{E}_{i})=\operatorname{var}(X_{i}|X_{1},\ldots,X_{i-1}). The above regression system can also be expressed as 𝑿=𝐁𝑿+𝓔\boldsymbol{X}=\boldsymbol{\mathrm{B}}\boldsymbol{X}+\boldsymbol{\mathcal{E}}, where 𝐁\boldsymbol{\mathrm{B}} is a strictly lower triangular matrix with entries bij=βij|1,,i1b_{ij}=\beta_{ij|1,\ldots,i-1}. Rearranging the previous equation, we obtain

𝑿=𝐋𝓔,\boldsymbol{X}=\boldsymbol{\mathrm{L}}\boldsymbol{\mathcal{E}}, (2)

where 𝐋=(𝐈p𝐁)1\boldsymbol{\mathrm{L}}=(\boldsymbol{\mathrm{I}}_{p}-\boldsymbol{\mathrm{B}})^{-1} with 𝐈p\boldsymbol{\mathrm{I}}_{p} the pp-dimensional identity matrix. Now if we take variances and inverses on Equation (2), we arrive at the upper Cholesky decomposition of the inverse covariance matrix,

𝛀=𝚺1=𝐋t𝐃1𝐋1=𝐔𝐃1𝐔t=𝐖𝐖t,\boldsymbol{\mathrm{\Omega}}=\boldsymbol{\mathrm{\Sigma}}^{-1}=\boldsymbol{\mathrm{L}}^{-t}\boldsymbol{\mathrm{D}}^{-1}\boldsymbol{\mathrm{L}}^{-1}=\boldsymbol{\mathrm{U}}\boldsymbol{\mathrm{D}}^{-1}\boldsymbol{\mathrm{U}}^{t}=\boldsymbol{\mathrm{W}}\boldsymbol{\mathrm{W}}^{t}, (3)

where 𝐔=𝐋t=(𝐈p𝐁)t\boldsymbol{\mathrm{U}}=\boldsymbol{\mathrm{L}}^{-t}=(\boldsymbol{\mathrm{I}}_{p}-\boldsymbol{\mathrm{B}})^{t} and 𝐖=𝐔𝐃1\boldsymbol{\mathrm{W}}=\boldsymbol{\mathrm{U}}\sqrt{\boldsymbol{\mathrm{D}}^{-1}} are upper triangular matrices, and 𝐃\boldsymbol{\mathrm{D}} is a diagonal matrix containing the variances of 𝓔\boldsymbol{\mathcal{E}}.

Equations (1) and (3) are intimately related: all parameters of the recursive regression system are encoded in the upper Cholesky factorization. Indeed, we have that 𝐃=var(𝓔)\boldsymbol{\mathrm{D}}=\operatorname{var}(\boldsymbol{\mathcal{E}}) and the (j,i)(j,i) entry of matrix 𝐔\boldsymbol{\mathrm{U}}, ujiu_{ji}, is equal to βij|1,,i1-\beta_{ij|1,\ldots,i-1}, therefore all the regression coefficients can also be recovered, and 𝐔\boldsymbol{\mathrm{U}} is explicitly written as

𝐔=(1β21|1βp1|1,,p101βpp1|1,,p1001).\boldsymbol{\mathrm{U}}=\begin{pmatrix}1&-\beta_{21|1}&\cdots&-\beta_{p1|1,\ldots,p-1}\\ 0&1&\ddots&\vdots\\ \vdots&\ddots&\ddots&-\beta_{pp-1|1,\ldots,p-1}\\ 0&\cdots&0&1\\ \end{pmatrix}. (4)

2.2 The Cholesky decomposition of a covariance matrix

From Equation (2) we could have proceeded differently. Instead of taking variances and then inverses, most suited for analysing ordered sequences of variable regressions and subsequent Gaussian Bayesian networks (GBNs), we could have just taken variances and would have arrived at the Cholesky decomposition of the covariance matrix [19]

𝚺=𝐋𝐃𝐋t=𝐓𝐓t,\boldsymbol{\mathrm{\Sigma}}=\boldsymbol{\mathrm{L}}\boldsymbol{\mathrm{D}}\boldsymbol{\mathrm{L}}^{t}=\boldsymbol{\mathrm{T}}\boldsymbol{\mathrm{T}}^{t}, (5)

where now 𝐋=(𝐈p𝐁)1\boldsymbol{\mathrm{L}}=(\boldsymbol{\mathrm{I}}_{p}-\boldsymbol{\mathrm{B}})^{-1} and 𝐓=𝐋𝐃\boldsymbol{\mathrm{T}}=\boldsymbol{\mathrm{L}}\sqrt{\boldsymbol{\mathrm{D}}} are lower triangular.

Observe that Equation (5) is a direct analogue of Equation (3); although now there is not a natural interpretation of 𝐋\boldsymbol{\mathrm{L}} entries in terms of regression coefficients. However, we will now show that the entries of 𝐋\boldsymbol{\mathrm{L}} are in fact regression coefficients, just over a different conditioning set. Alternative derivations of this result can be found in [23, p. 158] and [18, p. 846]. The one in [23] is computational, based on the sweep matrix operator [24], whereas [18] provides a sketch based on a recursive expression for regression coefficients. We use instead simple identities over partitioned matrices.

Proposition 1.

For i{1,,p}i\in\{1,\ldots,p\} and j<ij<i, the (i,j)(i,j) entry of matrix 𝐋\boldsymbol{\mathrm{L}}, denoted as lijl_{ij}, is equal to βij|1,,j{\beta}_{ij|1,\ldots,j}.

Proof.

For each i{1,,p}i\in\{1,\ldots,p\}, j<ij<i and J={1,,j}J=\{1,\ldots,j\}, the following partitioned identities hold [23, Equation (4.2.18)]

𝚺iJ=𝐋iJ𝐃JJ𝐋JJt,\displaystyle\boldsymbol{\mathrm{\Sigma}}_{iJ}=\boldsymbol{\mathrm{L}}_{iJ}\boldsymbol{\mathrm{D}}_{JJ}\boldsymbol{\mathrm{L}}_{JJ}^{t},
𝚺JJ=𝐋JJ𝐃JJ𝐋JJt.\displaystyle\boldsymbol{\mathrm{\Sigma}}_{JJ}=\boldsymbol{\mathrm{L}}_{JJ}\boldsymbol{\mathrm{D}}_{JJ}\boldsymbol{\mathrm{L}}_{JJ}^{t}.

Therefore,

𝜷i|Jt\displaystyle\boldsymbol{\beta}_{i|J}^{t} =𝚺iJ𝚺JJ1\displaystyle=\boldsymbol{\mathrm{\Sigma}}_{iJ}\boldsymbol{\mathrm{\Sigma}}_{JJ}^{-1}
=𝐋iJ𝐃JJ𝐋JJt(𝐋JJ𝐃JJ𝐋JJt)1\displaystyle=\boldsymbol{\mathrm{L}}_{iJ}\boldsymbol{\mathrm{D}}_{JJ}\boldsymbol{\mathrm{L}}_{JJ}^{t}(\boldsymbol{\mathrm{L}}_{JJ}\boldsymbol{\mathrm{D}}_{JJ}\boldsymbol{\mathrm{L}}_{JJ}^{t})^{-1}
=𝐋iJ𝐋JJ1.\displaystyle=\boldsymbol{\mathrm{L}}_{iJ}\boldsymbol{\mathrm{L}}_{JJ}^{-1}.

Furthermore, observe that, since 𝐋JJ\boldsymbol{\mathrm{L}}_{JJ} is lower triangular with ones along the diagonal, the last column of 𝐋JJ1\boldsymbol{\mathrm{L}}_{JJ}^{-1} is always a vector of zero entries except the last entry, which is 11. This means, in particular, that for each i{1,,p}i\in\{1,\ldots,p\}, j<ij<i and J={1,,j}J=\{1,\ldots,j\}, the jj-th element of row vector 𝐋iJ𝐋JJ1\boldsymbol{\mathrm{L}}_{iJ}\boldsymbol{\mathrm{L}}_{JJ}^{-1} is equal to lijl_{ij}, which in turn is equal to the jj-th entry of 𝜷i|J\boldsymbol{\beta}_{i|J}, βij|J\beta_{ij|J}. ∎

The explicit expression in terms of regression coefficients for 𝐋\boldsymbol{\mathrm{L}} is therefore,

𝐋=(100β21|110βp1|1βpp1|1,,p11).\boldsymbol{\mathrm{L}}=\begin{pmatrix}1&0&\cdots&0\\ {\beta}_{21|1}&1&\ddots&\vdots\\ \vdots&\ddots&\ddots&0\\ {\beta}_{p1|1}&\cdots&{\beta}_{pp-1|1,\ldots,p-1}&1\\ \end{pmatrix}. (6)

From Equations (4) and (6) useful relationships can be determined; see, for example, [18, p.847] and Appendix A.1.

2.3 The Gaussian Bayesian network model and sparse Cholesky factorizations of 𝛀\boldsymbol{\mathrm{\Omega}}

The upper Cholesky factorization in Equation (3) is typically used as a parametrization for GBNs, see, for example, [13, 8, 25]. In the following we will explain why, since this is closely related to the extension of our proposal to a graphical model.

Let 𝒢=(V,E)\mathcal{G}=(V,E) be an acyclic digraph, where V={1,,p}V=\{1,\ldots,p\} is the vertex set and EV×VE\subseteq V\times V is the edge set. Assume that 1p1\prec\cdots\prec p is a topological order of VV, which means that for all iVi\in V, denoting as pa(i)\operatorname{pa}(i) the parent set in 𝒢\mathcal{G} of node ii, it holds that pa(i){1,,i1}\operatorname{pa}(i)\subseteq\{1,\ldots,i-1\}. The ordered Markov property of Bayesian networks states that [7]

XiXj|𝑿pa(i) for all j<i and jpa(i),X_{i}\rotatebox[origin={c}]{90.0}{$\models$}X_{j}|\boldsymbol{X}_{\operatorname{pa}(i)}\text{ for all }j<i\text{ and }j\not\in\operatorname{pa}(i), (7)

where XiXj|𝑿pa(i)X_{i}\rotatebox[origin={c}]{90.0}{$\models$}X_{j}|\boldsymbol{X}_{\operatorname{pa}(i)} stands for conditional independence [26], that is, XiX_{i} and XjX_{j} are independent given 𝑿pa(i)=𝒙pa(i)\boldsymbol{X}_{\operatorname{pa}(i)}=\boldsymbol{x}_{\operatorname{pa}(i)}, for any value of 𝒙pa(i)\boldsymbol{x}_{\operatorname{pa}(i)}. In the multivariate Gaussian distribution, the above conditional independence is equivalent to the regression coefficient βij|1,,i1\beta_{ij|1,\ldots,i-1} being zero (see for example [27]). Since βij|1,,i1=0\beta_{ij|1,\ldots,i-1}=0 if and only if uji=0u_{ji}=0, then the zero pattern containing absent edges in 𝒢\mathcal{G} can be read off from 𝐔\boldsymbol{\mathrm{U}} in Equation (3).

Now allow the ancestral order to be arbitrary, and denote as 𝒰(𝒢)\mathcal{U}(\mathcal{G}) the set of matrices that have positive diagonal and zeros compatible with a given acyclic digraph 𝒢=(V,E)\mathcal{G}=(V,E); that is, such that if (j,i)E(j,i)\not\in E, jij\neq i, then mji=0m_{ji}=0 for all 𝐌𝒰(𝒢)\boldsymbol{\mathrm{M}}\in\mathcal{U}(\mathcal{G}). The GBN model can thus be expressed as

𝒩(𝟎,𝚺) s.t. 𝚺1=𝐖𝐖t with 𝐖𝒰(G).\mathcal{N}(\boldsymbol{0},\boldsymbol{\mathrm{\Sigma}})\text{ s.t. }{\boldsymbol{\mathrm{\Sigma}}}^{-1}=\boldsymbol{\mathrm{W}}\boldsymbol{\mathrm{W}}^{t}\text{ with }\boldsymbol{\mathrm{W}}\in\mathcal{U}(G). (8)
Remark 1.

Observe that if 𝐗=(X1,,Xp)t𝒩(𝟎,𝚺)\boldsymbol{X}=(X_{1},\ldots,X_{p})^{t}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{\mathrm{\Sigma}}) follows a GBN model with graph 𝒢\mathcal{G}, the parameter matrix 𝐖\boldsymbol{\mathrm{W}} of Equation (8) is not the Cholesky factor of 𝛀=𝚺1\boldsymbol{\mathrm{\Omega}}=\boldsymbol{\mathrm{\Sigma}}^{-1}. This only occurs when the ancestral order of the nodes in 𝒢\mathcal{G} is 1p1\prec\cdots\prec p; that is, when the variables follow a natural order (the direct analogue of our model).

However, if we denote as κ(𝐌)\kappa(\boldsymbol{\mathrm{M}}) the permutation of rows and columns in 𝐌\boldsymbol{\mathrm{M}} following the ancestral order of 𝒢\mathcal{G}, then κ(𝐖)\kappa(\boldsymbol{\mathrm{W}}) is the upper Cholesky factor of κ(𝛀)\kappa(\boldsymbol{\mathrm{\Omega}}).

3 Sparse Cholesky decomposition of the covariance matrix

We will hereby introduce the sparse Cholesky decomposition model 𝚺=𝐓𝐓t\boldsymbol{\mathrm{\Sigma}}=\boldsymbol{\mathrm{T}}\boldsymbol{\mathrm{T}}^{t} for the covariance matrix, which consists of allowing an arbitrary zero pattern in 𝐓\boldsymbol{\mathrm{T}}. The entries of 𝐓\boldsymbol{\mathrm{T}} are in correspondence with those in 𝐋\boldsymbol{\mathrm{L}} and 𝐃\boldsymbol{\mathrm{D}} (see Equation (5), tij=lijdiit_{ij}=l_{ij}\sqrt{d_{ii}}), and therefore sometimes we will indistinctly refer to each of these matrices .

3.1 The regression interpretations for the covariance Cholesky factor

In terms of model estimation, in 𝐔\boldsymbol{\mathrm{U}} (Equation (4)) each column corresponds to the parameters of a single recursive regression, whereas each entry of matrix 𝐋\boldsymbol{\mathrm{L}} (Equation (6)) corresponds to a different regression model.

Fortunately, [19] gave an alternative regression interpretation for matrix 𝐋\boldsymbol{\mathrm{L}}: recalling Equation (2), where the original variables 𝑿\boldsymbol{X} are written as a linear function of the error terms 𝓔\boldsymbol{\mathcal{E}}, and unfolding this matrix equation, we obtain

Xi=j=1i1lijj+i.X_{i}=\sum_{j=1}^{i-1}l_{ij}\mathcal{E}_{j}+\mathcal{E}_{i}. (9)

Thus obtaining an analogue of Equation (1), but now instead of regressing the original variables, the regression is performed over the error terms of the ordered recursive regressions in Equation (1).

Remark 2.

The (i,j)(i,j) entry of matrix 𝐋\boldsymbol{\mathrm{L}} for j<ij<i, lijl_{ij}, has therefore two interpretations as a regression coefficient:

  1. 1.

    It is the coefficient of the error j\mathcal{E}_{j} on the regression of XiX_{i} over 1,,i1\mathcal{E}_{1},\ldots,\mathcal{E}_{i-1}.

  2. 2.

    It is the coefficient of variable XjX_{j} in the regression of XiX_{i} over X1,,XjX_{1},\ldots,X_{j}.

Furthermore, from Equations (1) and (9) we have another dual interpretation for variable i\mathcal{E}_{i}: it is the error of the regression of XiX_{i} onto X1,,Xi1X_{1},\ldots,X_{i-1}, but also of XiX_{i} onto 1,,i1\mathcal{E}_{1},\ldots,\mathcal{E}_{i-1}.

3.2 A hidden variable model interpretation

The sparse Cholesky parametrization of the covariance matrix naturally models a hidden variable structure [28, 29, 30, 31] over ordered Gaussian observables (Equation (2)). Interpreting the error terms 𝓔\boldsymbol{\mathcal{E}} as latent signal sources, then the model is a sort of restricted GBN. This interpretation is naturally associated with the first regression coefficients outlined in Remark 2.

The constraints for this GBN that represents our model are:

  • All arcs are from hidden variables 𝓔\boldsymbol{\mathcal{E}} to the observed ones 𝑿\boldsymbol{X}.

  • There is always an arc from i\mathcal{E}_{i} to XiX_{i}, for all i{1,,p}i\in\{1,\ldots,p\}.

  • For each i{1,,p}i\in\{1,\ldots,p\}, only variables 1,,i1\mathcal{E}_{1},\ldots,\mathcal{E}_{i-1} can have arcs to XiX_{i}.

Figure 1 represents one such restricted GBN compatible with our model.

1\mathcal{E}_{1}2\mathcal{E}_{2}3\mathcal{E}_{3}X1X_{1}X2X_{2}X3X_{3}
Figure 1: Hidden variable model interpretation. We have omitted the index set notation for vertices and directly used variables for clarity.

The sparse Cholesky factor for Figure 1 would have the following lower triangular pattern

(0),\begin{pmatrix}&&\\ *&&\\ 0&*&\end{pmatrix}, (10)

where an asterisk means a non-zero entry. Observe that the zero value in entry (3,1)(3,1) corresponds to the missing edge from 1\mathcal{E}_{1} to X3X_{3} in Figure 1.

3.3 A graphical model extension for unordered variables

In direct analogy with GBNs and Equation (8), we can define a graphical model which is parametrized by a Cholesky factorization, up to a permutation. Let 𝒢=(V,E)\mathcal{G}=(V,E) be an arbitrary given acyclic digraph, and denote with \prec its ancestral order. Let pr(i)={jV:ji}\operatorname{pr}_{\prec}(i)=\{j\in V:j\prec i\} be the predecessor set of a node iVi\in V.

Analogous to 𝒰(𝒢)\mathcal{U}(\mathcal{G}) in GBNs, denote as (𝒢)\mathcal{L}(\mathcal{G}) the set of matrices which have a zero pattern compatible with 𝒢\mathcal{G}. By this we mean that if (j,i)E(j,i)\not\in E, jij\neq i, then tij=0t_{ij}=0. We may define the Gaussian graphical model (comparable to Equation (8))

𝒩(𝟎,𝚺) s.t. 𝚺=𝐓𝐓t with 𝐓(𝒢).\mathcal{N}(\boldsymbol{0},\boldsymbol{\mathrm{\Sigma}})\text{ s.t. }\boldsymbol{\mathrm{\Sigma}}=\boldsymbol{\mathrm{T}}\boldsymbol{\mathrm{T}}^{t}\text{ with }\boldsymbol{\mathrm{T}}\in\mathcal{L}(\mathcal{G}). (11)
Remark 3.

As in the case of GBNs (Remark 1), the parameter matrix 𝐓\boldsymbol{\mathrm{T}} in Equation (11) is lower triangular, and thus coincides with the Cholesky factor of 𝚺\boldsymbol{\mathrm{\Sigma}}, when the variables are already ancestrally ordered. Otherwise, the sparse Cholesky factorization model applies when the ancestral order \prec is known, that is, denoting as κ(𝐌)\kappa(\boldsymbol{\mathrm{M}}) the permutation of rows and columns of 𝐌\boldsymbol{\mathrm{M}} following \prec, then κ(𝐓)\kappa(\boldsymbol{\mathrm{T}}) is the Cholesky factor of κ(𝚺)\kappa(\boldsymbol{\mathrm{\Sigma}}).

This extension has a more natural correspondence with the second regression interpretation of Remark 2, which holds after reordering rows and columns of 𝐓\boldsymbol{\mathrm{T}} to comply with \prec, as follows. First note that Equation (6) holds for an unordered version of 𝐋\boldsymbol{\mathrm{L}}, and thus we have that tijβij|pr(j)t_{ij}\propto\beta_{ij|\operatorname{pr}_{\prec}(j)} for jij\prec i. Therefore, we retrieve a sort of ordered Markov property (comparable to Equation (7))

XiXj|𝑿pr(j) for all ji and jpa(i).X_{i}\rotatebox[origin={c}]{90.0}{$\models$}X_{j}|\boldsymbol{X}_{\operatorname{pr}_{\prec}(j)}\text{ for all }j\prec i\text{ and }j\not\in\operatorname{pa}(i). (12)

A simple example of an arbitrary graph would be that in Figure 2. In this model, factor 𝐓\boldsymbol{\mathrm{T}} would be lower triangular after reordering its rows and columns following the ancestral ordering of the graph, 2132\prec 1\prec 3,

𝐓=(0000),κ(𝐓)=(0000).\boldsymbol{\mathrm{T}}=\begin{pmatrix}*&*&0\\ 0&*&0\\ *&0&*\end{pmatrix},\,\kappa(\boldsymbol{\mathrm{T}})=\begin{pmatrix}*&0&0\\ *&*&0\\ 0&*&*\end{pmatrix}.
X2X_{2}X1X_{1}X3X_{3}
Figure 2: Graph with ancestral order 2132\prec 1\prec 3. We have omitted the index set notation for vertices and directly used variables for clarity.

In the example of Figure 1, where variables already exhibit a natural order, the graph that would represent such interactions would be that in Figure 3, whose parameter matrix 𝐓\boldsymbol{\mathrm{T}} is already lower triangular (see also Remark 3).

X1X_{1}X2X_{2}X3X_{3}
Figure 3: Graph corresponding to the model in Figure 1. We have omitted the index set notation for vertices and directly used variables for clarity.

A further study of this graphical model extension is out of the scope of this work, since we focus on naturally ordered variables from hidden signal sources, but we have discussed it here for completeness and interest for future work.

4 Model estimation

We will first review two regression-based existing estimators for this model that can be found in [19], and then will detail our proposed penalized matrix loss estimation method.

4.1 Existing work: Banding and lasso

Throughout this section denote as 𝒙i\boldsymbol{x}_{i} a sample of size NN corresponding to variable XiX_{i}, where 𝑿=(X1,,Xp)\boldsymbol{X}=(X_{1},\ldots,X_{p}) is assumed to follow the regression model of Equation (9).

The banding estimate for 𝐓\boldsymbol{\mathrm{T}} builds upon the respective for 𝐋\boldsymbol{\mathrm{L}}. The idea is to estimate by standard least squares only the first kk sub-diagonals of 𝐋\boldsymbol{\mathrm{L}} and set the rest to zero. Specifically, if b(k)=max(1,ik)b(k)=\max(1,i-k) denotes the starting index, with respect to the band parameter kk, of the ii-th row vector 𝒍i=(lib(k),,lii1)t\boldsymbol{l}_{i}=(l_{ib(k)},\ldots,l_{ii-1})^{t} in matrix 𝐋\boldsymbol{\mathrm{L}}, then, letting 𝜺^b(k)=𝒙b(k)\hat{\boldsymbol{\varepsilon}}_{b(k)}=\boldsymbol{x}_{b(k)},

𝒍^i=argmin𝒍i𝒙i(𝜺^b(k)𝜺^i1)𝒍i22,\displaystyle\hat{\boldsymbol{l}}_{i}=\arg\min_{\boldsymbol{l}_{i}}\,\lVert\boldsymbol{x}_{i}-(\hat{\boldsymbol{\varepsilon}}_{b(k)}\cdots\hat{\boldsymbol{\varepsilon}}_{i-1})\boldsymbol{l}_{i}\rVert_{2}^{2}, (13)
𝜺^i=𝒙i(𝜺^b(k)𝜺^i1)𝒍^i,\displaystyle\hat{\boldsymbol{\varepsilon}}_{i}=\boldsymbol{x}_{i}-(\hat{\boldsymbol{\varepsilon}}_{b(k)}\cdots\hat{\boldsymbol{\varepsilon}}_{i-1})\hat{\boldsymbol{l}}_{i},
d^ii=1N𝜺^i22,\displaystyle\hat{d}_{ii}=\frac{1}{N}\lVert\hat{\boldsymbol{\varepsilon}}_{i}\rVert_{2}^{2},

where 2\lVert\cdot\rVert_{2} is the Euclidean or l2l_{2}-norm. In order to ensure positive definiteness of all matrices involved in the computations, kk must be smaller than min(N1,p)\min(N-1,p) [19]. Matrix 𝐓^=𝐋^𝐃^\widehat{\boldsymbol{\mathrm{T}}}=\widehat{\boldsymbol{\mathrm{L}}}\sqrt{\widehat{\boldsymbol{\mathrm{D}}}} inherits the band structure from 𝐋^\widehat{\boldsymbol{\mathrm{L}}}. The main drawback of this banding estimator is the restrictive zero pattern that it imposes. Note also that this method requires previous selection of the parameter kk.

An alternative to banding which gives more flexibility over the zero pattern is to use lasso regression over Equation (9),

𝒍^i=argmin𝒍i𝒙i(𝜺^1𝜺^i1)𝒍i22+λ𝒍i1,\displaystyle\hat{\boldsymbol{l}}_{i}=\arg\min_{\boldsymbol{l}_{i}}\,\lVert\boldsymbol{x}_{i}-(\hat{\boldsymbol{\varepsilon}}_{1}\cdots\hat{\boldsymbol{\varepsilon}}_{i-1})\boldsymbol{l}_{i}\rVert_{2}^{2}+\lambda\lVert\boldsymbol{l}_{i}\rVert_{1}, (14)
𝜺^i=𝒙i(𝜺^1𝜺^i1)𝒍^i,\displaystyle\hat{\boldsymbol{\varepsilon}}_{i}=\boldsymbol{x}_{i}-(\hat{\boldsymbol{\varepsilon}}_{1}\cdots\hat{\boldsymbol{\varepsilon}}_{i-1})\hat{\boldsymbol{l}}_{i},
d^ii=1N𝜺^i22,\displaystyle\hat{d}_{ii}=\frac{1}{N}\lVert\hat{\boldsymbol{\varepsilon}}_{i}\rVert_{2}^{2},

where this time 𝜺^1=𝒙1\hat{\boldsymbol{\varepsilon}}_{1}=\boldsymbol{x}_{1}, 𝒍i=(li1,,lii1)t\boldsymbol{l}_{i}=(l_{i1},\ldots,l_{ii-1})^{t}, and 1\lVert\cdot\rVert_{1} is the l1l_{1}-norm and λ>0\lambda>0 is the penalisation parameter. Observe that such penalty could be replaced with any other sparsity inducing penalty over 𝒍i\boldsymbol{l}_{i}, for example, [19] discussed also the nested lasso [32] because the sparsity pattern is preserved in the covariance matrix. We have discarded such penalty because we are primarily interested in an arbitrary zero pattern in 𝐓^\widehat{\boldsymbol{\mathrm{T}}} and not worried about the induced one in 𝚺^\widehat{\boldsymbol{\mathrm{\Sigma}}}.

4.2 Penalized gradient-based learning of the covariance sparse Cholesky factor

The above approaches are based on the regression interpretation of the sparse Cholesky factor model, Equation (9). By contrast, we propose to directly estimate all of the parameters, that is, matrix 𝐓\boldsymbol{\mathrm{T}} entries, by solving one optimization problem. This allows for example to recover maximum likelihood estimates, as well as have the potential to be easily extended to the graphical model interpretation (Equation (11)) following an approach similar to [33].

Denote as 𝚺(𝐓)\boldsymbol{\mathrm{\Sigma}}(\boldsymbol{\mathrm{T}}) the parametrization of a covariance matrix 𝚺\boldsymbol{\mathrm{\Sigma}} with its Cholesky factor 𝐓\boldsymbol{\mathrm{T}} (Equation (5)). We propose to learn a sparse model for 𝐓\boldsymbol{\mathrm{T}} by solving the following optimization problem

argmin𝐓ϕ(𝚺(𝐓))+λ𝐓1,\arg\min_{\boldsymbol{\mathrm{T}}}\phi\left(\boldsymbol{\mathrm{\Sigma}}(\boldsymbol{\mathrm{T}})\right)+\lambda\lVert\boldsymbol{\mathrm{T}}\rVert_{1}, (15)

where ϕ()\phi(\cdot) is a differentiable loss function over covariance matrices, λ>0\lambda>0 is the penalisation parameter, and 1\lVert\cdot\rVert_{1} is the l1l_{1}-norm for matrices, which induces sparsity on 𝐓\boldsymbol{\mathrm{T}} [34]. Note that, as in the regression case, the l1l_{1} penalty could be replaced with any other sparsity inducing matrix norm. Solving Equation (15) can be done via proximal gradient algorithms, which have optimal convergence rates among first-order methods [34] and are tailored for a convex ϕ\phi but also competitive in the non-convex case [21]. In this work we have used two such smooth loss functions: the negative Gaussian log-likelihood and the triangular Frobenious norm.

The negative Gaussian log-likelihood for a sample 𝒙1,,𝒙N\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{N} when 𝝁\boldsymbol{\mu} is assumed to be zero is [27] proportional to

ϕNLL(𝚺)=lndet(𝚺)+tr(𝚺1𝚺^),\phi_{NLL}(\boldsymbol{\mathrm{\Sigma}})=\ln\det(\boldsymbol{\mathrm{\Sigma}})+\operatorname{tr}(\boldsymbol{\mathrm{\Sigma}}^{-1}\hat{\boldsymbol{\mathrm{\Sigma}}}), (16)

where Σ^=1/Nn=1N𝒙n𝒙nt\hat{\Sigma}=1/N\sum_{n=1}^{N}\boldsymbol{x}_{n}\boldsymbol{x}_{n}^{t} is the maximum likelihood estimator for Σ\Sigma. On the other hand, the Frobenious norm loss that we also consider is

ϕFR(𝚺)=𝚺𝚺^F2=i=1pj=1p(σijσ^ij)2.\phi_{FR}(\boldsymbol{\mathrm{\Sigma}})=\lVert\boldsymbol{\mathrm{\Sigma}}-\hat{\boldsymbol{\mathrm{\Sigma}}}\rVert_{F}^{2}=\sum_{i=1}^{p}\sum_{j=1}^{p}(\sigma_{ij}-\hat{\sigma}_{ij})^{2}. (17)

Both ϕNLL\phi_{NLL} and ϕFR\phi_{FR} are smooth, and in general ϕNLL\phi_{NLL} renders the optimization problem of Equation (15) non-convex [35], whereas ϕFR\phi_{FR} is a convex function. In Appendix A.2 we provide the details on gradient computations for ϕNLL\phi_{NLL} and ϕFR\phi_{FR}, as well as the proximal algorithm pseudocode.

5 Experiments

In all of the experiments we compare the four estimation methods outlined in the previous section: banding 𝐓\boldsymbol{\mathrm{T}} (Equation (13)), lasso regressions (Equation (14)), and our two proposed penalized losses ϕNLL\phi_{NLL} (Equation (16)) and ϕFR\phi_{FR} (Equation (17)). These four methods will be denoted in the remainder as band\operatorname{band}, lasso\operatorname{lasso}, grad_lik\operatorname{grad\_lik} and grad_frob\operatorname{grad\_frob}, respectively. All data was standardized, and therefore in our proposed losses the sample correlation matrix was used instead of 𝚺^\widehat{\boldsymbol{\mathrm{\Sigma}}}. The implementation of our loss optimization methods grad_frob\operatorname{grad\_frob} and grad_lik\operatorname{grad\_lik} can be found in the R [36] package covchol111Version under development: https://github.com/irenecrsn/covchol. The experiments described throughout this section can be reproduced following the instructions and using the code available at the repository https://github.com/irenecrsn/chol-inv..

5.1 Simulation

We have conducted simulation experiments in two different scenarios. First, because as the work of Rothman et al. [19] is the most directly related to our model, we have replicated their simulation setting for completeness. Therein they select three fixed covariance matrices with either a fixed known banded sparsity pattern or no zeros at all. By contrast, in the second experiment we explore how the methods perform when the sparsity pattern is arbitrary.

In both experiments we have measured two statistics in order to assess both model selection (how precise the zero pattern is recovered) and estimation (how numerically close is the retrieved matrix to the original one). These metrics are evaluated over 𝚺\boldsymbol{\mathrm{\Sigma}} and 𝚺^\widehat{\boldsymbol{\mathrm{\Sigma}}} in the second experiment instead of 𝐓\boldsymbol{\mathrm{T}} and 𝐓^\widehat{\boldsymbol{\mathrm{T}}}, for better comparability with [19]. Specifically, we use the F1\operatorname{F1} score for evaluating the zero pattern,

F1(𝐓,𝐓^)=2TPR(𝐓,𝐓^)TDR(𝐓,𝐓^)TPR(𝐓,𝐓^)+TDR(𝐓,𝐓^),\operatorname{F1}(\boldsymbol{\mathrm{T}},\widehat{\boldsymbol{\mathrm{T}}})=2\frac{\operatorname{TPR}(\boldsymbol{\mathrm{T}},\widehat{\boldsymbol{\mathrm{T}}})\operatorname{TDR}(\boldsymbol{\mathrm{T}},\widehat{\boldsymbol{\mathrm{T}}})}{\operatorname{TPR}(\boldsymbol{\mathrm{T}},\widehat{\boldsymbol{\mathrm{T}}})+\operatorname{TDR}(\boldsymbol{\mathrm{T}},\widehat{\boldsymbol{\mathrm{T}}})}, (18)

where TPR\operatorname{TPR} and TDR\operatorname{TDR} are the true positive and discovery rate, respectively,

TPR(𝐓,𝐓^)=|{tij0 and t^ij0}||{tij0}|,\displaystyle\operatorname{TPR}(\boldsymbol{\mathrm{T}},\widehat{\boldsymbol{\mathrm{T}}})=\frac{|\{t_{ij}\neq 0\text{ and }\hat{t}_{ij}\neq 0\}|}{|\{t_{ij}\neq 0\}|},
TDR(𝐓,𝐓^)=|{tij0 and t^ij0}||{t^ij0}|;\displaystyle\operatorname{TDR}(\boldsymbol{\mathrm{T}},\widehat{\boldsymbol{\mathrm{T}}})=\frac{|\{t_{ij}\neq 0\text{ and }\hat{t}_{ij}\neq 0\}|}{|\{\hat{t}_{ij}\neq 0\}|};

and the induced matrix 11-norm,

NORM(𝐓,𝐓^)=𝐓𝐓^1=max1jpi=1p|t^ijtij|.\operatorname{NORM}(\boldsymbol{\mathrm{T}},\widehat{\boldsymbol{\mathrm{T}}})=\lVert\boldsymbol{\mathrm{T}}-\hat{\boldsymbol{\mathrm{T}}}\rVert_{1}=\max_{1\leq j\leq p}\sum_{i=1}^{p}\lvert\hat{t}_{ij}-t_{ij}\rvert.

5.1.1 Fixed covariance matrices

The fixed covariance matrices used in the simulations by Rothman et al. [19] are:

  • The autoregressive model of order 11, where the true covariance matrix 𝚺1\boldsymbol{\mathrm{\Sigma}}_{1} has entries σij=ρ|ij|\sigma_{ij}=\rho^{|i-j|}, with ρ=0.7\rho=0.7.

  • The 44-banded correlation matrix 𝚺2\boldsymbol{\mathrm{\Sigma}}_{2} with entries σij=0.4𝕀(|ij|=1)+0.2𝕀(2|ij|3)+0.1𝕀(|ij|=4)\sigma_{ij}=0.4\mathbb{I}(|i-j|=1)+0.2\mathbb{I}(2\leq|i-j|\leq 3)+0.1\mathbb{I}(|i-j|=4) for iji\neq j, 𝕀\mathbb{I} being the set indicator function.

  • The dense correlation matrix 𝚺3\boldsymbol{\mathrm{\Sigma}}_{3} with 0.50.5 in all of its entries except for the diagonal.

Similarly to [19], we use for matrix dimension pp values ranging from 3030 to 500500, and draw from the respective distribution 𝒩(𝟎,𝚺i)\mathcal{N}(\boldsymbol{0},\boldsymbol{\mathrm{\Sigma}}_{i}), i{1,2,3}i\in\{1,2,3\}, a sample of size 200200 which allows to visualize both the p>Np>N and p<Np<N scenarios. With this experiment we measure how sparsity inducing methods for learning 𝐓\boldsymbol{\mathrm{T}} behave in scenarios which are not specially suited for them, except for band\operatorname{band} and 𝚺2\boldsymbol{\mathrm{\Sigma}}_{2}.

Figure 4 shows the results of the aforementioned simulation scenario. The norms are shown in logarithmic scale for a better comparison between the methods, since there were significant disparities.

Refer to caption
Figure 4: Results of the simulation experiment set out in [19]. Metric NORM\operatorname{NORM} is in logarithmic scale.

𝚺1\boldsymbol{\mathrm{\Sigma}}_{1} and 𝚺3\boldsymbol{\mathrm{\Sigma}}_{3} are both dense matrices, with 𝚺1\boldsymbol{\mathrm{\Sigma}}_{1} having entries that decay when moving away from the diagonal. We observe that the inexistent sparsity pattern is best approximated by grad_lik\operatorname{grad\_lik} and lasso\operatorname{lasso}, but interestingly grad_frob\operatorname{grad\_frob} and band\operatorname{band} achieve competitive norm results, sometimes even outperforming the rest. Matrix 𝚺2\boldsymbol{\mathrm{\Sigma}}_{2} is banded, therefore, as expected, band\operatorname{band} achieves both the highest F1\operatorname{F1} measure and lowest norm difference.

5.1.2 Arbitrary sparsity pattern in the Cholesky factor 𝐓\boldsymbol{\mathrm{T}}

In this experiment the sparse Cholesky factor 𝐓\boldsymbol{\mathrm{T}} is simulated using essentially the method of [37, Algorithm 3] with a random acyclic digraph to represent zero pattern, that is, the latent structure (see Figure 1). Observe that in general this does not yield a uniformly sampled Cholesky factor, but it is more flexible that the standard diagonal dominance procedure, see [37] for further discussion on this issue. We generate three Cholesky factors 𝐓i\boldsymbol{\mathrm{T}}_{i} with a proportion of i/pi/p non-zero entries (density/average edge number of the corresponding acyclic digraph), where i{1,2,3}i\in\{1,2,3\}. Then we draw a sample of size 200200 from 𝒩(𝟎,𝐓i𝐓it)\mathcal{N}(\boldsymbol{0},\boldsymbol{\mathrm{T}}_{i}\boldsymbol{\mathrm{T}}_{i}^{t}) for pp ranging between 3030 and 500500, as in the previous experiment.

Figure 5 depicts the results.

Refer to caption
Figure 5: Results of the simulation experiment for an arbitrary sparsity pattern in 𝐓\boldsymbol{\mathrm{T}}. Metric NORM\operatorname{NORM} is in logarithmic scale. Density indicates the average proportion of lower triangular non-zero entries in the simulated 𝐓\boldsymbol{\mathrm{T}} Cholesky factors.

Note that as the density decreases, the F1\operatorname{F1} score and matrix norm results slightly worsen, but in general the methods behaviour is maintained. The band\operatorname{band} estimator exhibits a performance similar to the previous experiment: although achieving a small F1\operatorname{F1} score, it has a relatively small matrix norm difference. This behaviour is shared in this case with grad_lik\operatorname{grad\_lik}, which has in general poor performance. However, the worst performing method is lasso\operatorname{lasso}, which neither is able to recover the sparsity pattern, nor gets numerically close to the original Cholesky factor (it has a significantly high value for the norm difference). Conversely, method grad_frob\operatorname{grad\_frob} has the best performance, with a significantly high F1\operatorname{F1} score when compared with the rest and competitive or best norm difference results.

5.2 Real data

In this section we have selected two data sets from the UCI machine learning repository [38] where a natural order arises among the variables, and which are therefore suitable for our sparse covariance Cholesky factorization model. Both of them are labelled with a class variable, therefore after estimating the respective Cholesky factors with each method, we assess classification performance.

For classifying a sample 𝒙\boldsymbol{x} we use quadratic discriminant analysis [19], where 𝒙\boldsymbol{x} is classified in the class value cc that maximizes

lnf^(𝒙,c)\displaystyle\ln\hat{f}(\boldsymbol{x},c) =lnf^(𝒙|c)+lnf^(c),\displaystyle=\ln\hat{f}(\boldsymbol{x}|c)+\ln\hat{f}(c),

where f^(c)\hat{f}(c) is the proportion of observations for class cc and we have expressed f^(𝒙|c)\hat{f}(\boldsymbol{x}|c) in terms of 𝐓\boldsymbol{\mathrm{T}} instead of 𝚺\boldsymbol{\mathrm{\Sigma}},

lnf^(𝒙|c)\displaystyle\ln\hat{f}(\boldsymbol{x}|c) 12lndet(𝚺^c)12(𝒙𝝁^c)t𝚺^c1(𝒙𝝁^c)\displaystyle\propto\frac{1}{2}\ln\det(\widehat{\boldsymbol{\mathrm{\Sigma}}}_{c})-\frac{1}{2}(\boldsymbol{x}-\hat{\boldsymbol{\mu}}_{c})^{t}\widehat{\boldsymbol{\mathrm{\Sigma}}}_{c}^{-1}(\boldsymbol{x}-\hat{\boldsymbol{\mu}}_{c}) (19)
=lndet(𝐓^c)12(𝒙𝝁^c)t𝐓^ct𝐓^c1(𝒙𝝁^c)\displaystyle=\ln\det(\widehat{\boldsymbol{\mathrm{T}}}_{c})-\frac{1}{2}(\boldsymbol{x}-\hat{\boldsymbol{\mu}}_{c})^{t}\widehat{\boldsymbol{\mathrm{T}}}_{c}^{-t}\widehat{\boldsymbol{\mathrm{T}}}_{c}^{-1}(\boldsymbol{x}-\hat{\boldsymbol{\mu}}_{c})
=i=1plntii12(𝐓^c1(𝒙𝝁^c))t𝐓^c1(𝒙𝝁^c),\displaystyle=\sum_{i=1}^{p}\ln t_{ii}-\frac{1}{2}(\widehat{\boldsymbol{\mathrm{T}}}_{c}^{-1}(\boldsymbol{x}-\hat{\boldsymbol{\mu}}_{c}))^{t}\widehat{\boldsymbol{\mathrm{T}}}_{c}^{-1}(\boldsymbol{x}-\hat{\boldsymbol{\mu}}_{c}),

with 𝝁^c\hat{\boldsymbol{\mu}}_{c}, 𝚺^c\widehat{\boldsymbol{\mathrm{\Sigma}}}_{c} and 𝐓^c\widehat{\boldsymbol{\mathrm{T}}}_{c} the respective estimates from training samples belonging to class cc.

Finally, for evaluating classification performance we have used:

  • The F1\operatorname{F1} score, already defined in Equation (18) in terms of TDR\operatorname{TDR} and TPR\operatorname{TPR}, but adapted to classification instead of matrix entries.

  • The true negative rate, TNR\operatorname{TNR}, since it is not contained in the F1\operatorname{F1} score, which is the proportion of observations that have been correctly not classified as class cc.

  • The accuracy, ACC\operatorname{ACC}, which measures the proportion of observations that have been correctly assigned a class. Observe that this last metric, unlike the other two, is not class-dependent, but instead global.

5.2.1 Sonar: Mine vs. Rocks

The first real data set we explore is the Connectionist Bench (Sonar, Mines vs. Rocks) data set, which contains numeric observations from a sonar signal bounced at both a metal cylinder (mine) and rocks. It contains 6060 variables and 208208 observations. Each variable corresponds to the energy within a certain frequency band, integrated over a period of time, in increasing order. Each observation represents a different beam angle for the same object under detection. Over this data set the objective is to classify a sample as rock or mine. This data set was also analysed in [19], but without the expression in terms of 𝐓\boldsymbol{\mathrm{T}} for Equation (19) and only using method band\operatorname{band} for 𝐓\boldsymbol{\mathrm{T}}.

As a first exploratory step, we have applied each of the methods for learning the Cholesky factor 𝐓\boldsymbol{\mathrm{T}} to all the instances labelled as MM (mines), and RR (rocks), which we show as a heatmap in Figure 6.

Refer to caption
Figure 6: Heatmaps of the Cholesky factors of rock and mine samples. M: Mines; R: Rocks.

The Cholesky factor for mines retrieved by grad_lik\operatorname{grad\_lik} and lasso\operatorname{lasso} look fairly similar, whereas the one for rocks that lasso\operatorname{lasso} estimates is nearly diagonal. Bands can be clearly observed from heatmaps by band\operatorname{band}, and all methods impose zero values for variables near to or higher than 5050, which could be motivated by the problem characteristics and hint at high sonar frequencies being nearly noiseless. This latter behaviour is inherited by the covariance matrices, whose heatmaps are shown in Appendix A.3, Figure 9. The entries in the Cholesky factor estimated by grad_frob\operatorname{grad\_frob} are the most extreme, since most of them are zero, and the ones which are not have the highest and lowest values among all the estimates recovered. Despite the outlined differences among the Cholesky factors, the induced covariance matrices exhibit rather similar heatmaps and eigenvalues (Figures 9 and 10 in Appendix A.3).

For the quadratic discriminant analysis we have used leave-one-out cross-validation, since the sample size was sufficiently small to allow it. Table 1 contains the results thus obtained. We see that lasso\operatorname{lasso} is the method that performs poorest overall. Conversely, band\operatorname{band} is arguably the best for this problem, except for the TNR\operatorname{TNR} of rock samples, which is highest for grad_frob\operatorname{grad\_frob}. However, observing the rest of statistics for grad_frob\operatorname{grad\_frob}, it can be deduced that this method over-classifies samples as mines: it has the lowest TNR\operatorname{TNR} for them. On the other hand, grad_lik\operatorname{grad\_lik} performs competitively for this problem, but is in general outperformed or matched by band\operatorname{band}. Since the sonar behaviour hints at a band structure for the covariance (frequency patterns being related to those close to them), and therefore for its Cholesky factor, the good performance of band\operatorname{band} could be expected.

band\operatorname{band} grad_frob\operatorname{grad\_frob} grad_lik\operatorname{grad\_lik} lasso\operatorname{lasso}
TNR\operatorname{TNR} (M) 0.78 0.08 0.78 0.62
F1\operatorname{F1} (M) 0.8 0.7 0.55 0.33
TNR\operatorname{TNR} (R) 0.79 0.97 0.45 0.26
F1\operatorname{F1} (R) 0.78 0.15 0.65 0.5
ACC\operatorname{ACC} 0.79 0.56 0.61 0.43
Table 1: Statistics for the sonar problem. M: mines; R: rocks.

5.2.2 Wall-Following Robot Navigation

The other real data set we use is the Wall-Following Robot Navigation one. Here a robot moves in a room following the wall clockwise. It contains 54565456 observations and 2424 variables. Each variable corresponds to the value of an ultrasound sensor, which are arranged circularly over the robot’s body. Here the increasing order reflects the reference angle where the sensor is located. Since the robot is moving clockwise, here the classification task is between four possible class values: Move-Forward, Sharp-Right-Turn, Slight-Left-Turn or Slight-Right-Turn.

As in the previous problem, we obtain the Cholesky factors for each of the movements, depicted in Figure 7. We notice that grad_frob\operatorname{grad\_frob} outputs a similar matrix (except for the Slight-Left-Turn movement) than the other three methods, which means that the extreme behaviour we observed in the sonar experiment was problem related. By contrast, the Cholesky factor for Slight-Left-Turn is nearly diagonal. The other matrices are rather similar among the methods, with band\operatorname{band} notably choosing in general a high banding parameter kk (few to no bands). Here we have a similar structure as in the sonar problem: we appreciate for all the movements except Slight-Left-Turn that most entries close to the diagonal are positive, whereas distant ones are frequently negative. Regarding Slight-Left-Turn, these matrices are the sparsest and have near zero values on the diagonal. Since the robot is moving clockwise, this movement is related to obstacles, therefore it could hint that sensor readings are correctly identifying them.

Refer to caption
Figure 7: Heatmaps of the Cholesky factor of the wall-following robot navigation samples.

In this problem we have a larger sample size, and therefore we split the data into train and test, with half of the samples on each set. The classification results are shown in Table 2. We observe that all of the methods perform arguably good, in fact they achieve nearly identical accuracy. It is noticeable how competitive are lasso\operatorname{lasso} and grad_lik\operatorname{grad\_lik}, which performed much worse in the sonar problem. We also notice that arguably the best results are obtained for the Slight-Left-Turn movement, which confirms our previous intuition over the heatmaps about sensors correctly identifying obstacles. The worst performance over all methods is for the Slight-Right-Turn movement, but is not noteworthy when compared with the rest (except for Slight-Left-Turn).

band\operatorname{band} grad_frob\operatorname{grad\_frob} grad_lik\operatorname{grad\_lik} lasso\operatorname{lasso}
TNR\operatorname{TNR} (MF) 0.87 0.86 0.85 0.84
F1\operatorname{F1} (MF) 0.64 0.64 0.61 0.62
TNR\operatorname{TNR} (SHR) 0.89 0.88 0.87 0.85
F1\operatorname{F1} (SHR) 0.72 0.72 0.73 0.73
TNR\operatorname{TNR} (SLL) 0.96 0.95 0.98 0.98
F1\operatorname{F1} (SLL) 0.67 0.65 0.72 0.7
TNR\operatorname{TNR} (SLR) 0.82 0.83 0.81 0.84
F1\operatorname{F1} (SLR) 0.56 0.57 0.55 0.57
ACC\operatorname{ACC} 0.66 0.66 0.65 0.66
Table 2: Statistics for the robot problem. MF: Move-Forward; SHR: Sharp-Right-Turn; SLL: Slight-Left-Turn; SLR: Slight-Right-Turn.

5.3 Discussion

We can draw several conclusions from both the simulated and real experiments. First, we report in Figure 8 the execution time for each method, where it can be seen that grad_lik\operatorname{grad\_lik} is the slowest one and band\operatorname{band} is the fastest.

Refer to caption
Figure 8: Logarithm of the execution time (in seconds) for each of the methods under evaluation. Density indicates the average proportion of lower triangular non-zero entries in the simulated 𝐓\boldsymbol{\mathrm{T}} Cholesky factors.

Whenever there is a clear dependence between variables that are close in the ordering, such as in the sonar example, the band\operatorname{band} method could be preferred, because it is the one that more naturally approximates the structure induced in the Cholesky factor (as happened in the sonar example).

Our new proposed method grad_frob\operatorname{grad\_frob} has shown to be competitive both in execution time as well as recovery results: when interested in model selection, that is, how accurately zeros in the Cholesky factor are estimated, it yields the best results. Conversely, grad_lik\operatorname{grad\_lik} has shown to be the most robust: in simulations it achieved reasonable performance even when the true covariance matrix was dense, and it also performed competitively in the sonar example, which was mostly suited for band\operatorname{band} as we discussed.

Finally, lasso\operatorname{lasso} has achieved overall poor results, except for the wall-following robot navigation data set. Specially, in simulations it failed to correctly recover the zero pattern in the Cholesky factor and was the numerically farthest away from the true matrix. Despite this, it is the second fastest of the four methods, so when model selection or robustness are not a concern it is a good alternative to band\operatorname{band}, since it provides more flexibility over the zero pattern in the Cholesky factor.

6 Conclusions and future work

In this paper we have proposed a sparse model for the Cholesky factor of the covariance matrix of a multivariate Gaussian distribution. Few other works in the literature have previously addressed this problem, and we expand them in many ways. We have formalised the extension of an arbitrary zero pattern in the Cholesky factorization to a Gaussian graphical model. We have proposed a novel estimation method that directly optimizes a penalized matrix loss, and we have compared it with the other already existing regression-based approaches in different simulation and real experiments. We have finally discussed which estimation method is preferable depending on the problem characteristics, and argue how our estimation proposal is preferable in many scenarios than those already existing.

As future research, the most direct and interesting derivative work would be to further analyse, both theoretically and empirically, the Gaussian graphical model extension to unordered variables of the sparse covariance Cholesky factorization model. Also in this direction, its relationship with the already established covariance graph [14, 15, 16, 17], an undirected graphical model over the covariance matrix, could yield interesting results. Regarding our novel estimation method, we could explore other alternatives to solve the optimization problems that arise from the two losses that we compute in Appendix A.2. For large data sets, the use of deep learning models to find the hidden Cholesky factor could be also explored.

Acknowledgements

This work has been partially supported by the Spanish Ministerio de Ciencia, Innovación y Universidades through the PID2019-109247GB-I00 project. Irene Córdoba has been supported by the predoctoral grant FPU15/03797 from the Spanish Ministerio de Ciencia, Innovación y Universidades. Gherardo Varando has been supported by a research grant (13358) from VILLUM FONDEN. We thank Adam P. Rothman for helpful discussion and code sharing about his work.

Appendix A Appendices

A.1 A transformation between the Cholesky factorizations of 𝚺\boldsymbol{\mathrm{\Sigma}} and 𝛀\boldsymbol{\mathrm{\Omega}}

The following result gives how to obtain the elements of 𝐋\boldsymbol{\mathrm{L}} from those in 𝐔\boldsymbol{\mathrm{U}}. Recall that, since 𝐃\boldsymbol{\mathrm{D}} is diagonal, the (i,i)(i,i) entry in 𝐃1\boldsymbol{\mathrm{D}}^{-1} is dii1d_{ii}^{-1}.

Proposition.

For each i{1,,p}i\in\{1,\ldots,p\} with j{1,,i1}j\in\{1,\ldots,i-1\} the following identity holds:

βij|1,,j=βij|1,,i1+k=j+1i1βik|1,,i1βkj|1,,j.\beta_{ij|1,\ldots,j}=\beta_{ij|1,\ldots,i-1}+\sum_{k=j+1}^{i-1}\beta_{ik|1,\ldots,i-1}\beta_{kj|1,\ldots,j}.
Proof.

First we recall that 𝐋=𝐔t\boldsymbol{\mathrm{L}}=\boldsymbol{\mathrm{U}}^{-t} and 𝐔𝐔1=Ip\boldsymbol{\mathrm{U}}\boldsymbol{\mathrm{U}}^{-1}={I}_{p}. Thus, for each i{2,,p}i\in\{2,\ldots,p\} and each j<ij<i, if we multiply the ii-th row in 𝐔\boldsymbol{\mathrm{U}} with the jj-th column of 𝐔1\boldsymbol{\mathrm{U}}^{-1}, this must be equal to 0. Specifically, we obtain the following equation,

0=\displaystyle 0= βij|1,,i1\displaystyle-\beta_{ij|1,\ldots,i-1}
βij+1|1,,i1βj+1j|1,,j\displaystyle-\beta_{ij+1|1,\ldots,i-1}\beta_{j+1j|1,\ldots,j}
\displaystyle\vdots
βii1|1,,i1βi1j|1,,j\displaystyle-\beta_{ii-1|1,\ldots,i-1}\beta_{i-1j|1,\ldots,j}
+βij|1,,j.\displaystyle+\beta_{ij|1,\ldots,j}.

Therefore,

0=βij|1,,jk=j+1i1βik|1,,i1βkj|1,,jβij|1,,i1,0=\beta_{ij|1,\ldots,j}-\sum_{k=j+1}^{i-1}\beta_{ik|1,\ldots,i-1}\beta_{kj|1,\ldots,j}-\beta_{ij|1,\ldots,i-1},

which yields the desired result. ∎

A.2 Proximal gradient algorithm and gradient computations for ϕNLL\phi_{NLL} and ϕFR\phi_{FR}

We will show how to obtain a simplified expression for the gradient of ϕ(𝚺(𝐓))\phi(\boldsymbol{\mathrm{\Sigma}}(\boldsymbol{\mathrm{T}})) with respect to 𝐓\boldsymbol{\mathrm{T}} as a function of the one with respect to 𝚺\boldsymbol{\mathrm{\Sigma}}. We will denote these gradients as 𝐓ϕ\nabla_{\boldsymbol{\mathrm{T}}}\phi and 𝚺ϕ\nabla_{\boldsymbol{\mathrm{\Sigma}}}\phi, respectively.

Proposition.

For any differentiable loss function ϕ(𝚺(𝐓))\phi(\boldsymbol{\mathrm{\Sigma}}(\boldsymbol{\mathrm{T}})),

𝐓ϕ=2𝚺ϕ𝐓.\nabla_{\boldsymbol{\mathrm{T}}}\phi=2\nabla_{\boldsymbol{\mathrm{\Sigma}}}\phi\boldsymbol{\mathrm{T}}. (20)
Proof.

This proof follows some ideas from [21, Proposition 2.1]. By matrix calculus [39], we have the following gradient expression for j<ij<i,

ϕ(𝚺(𝐓))tij=tr(𝚺ϕ𝚺(𝐓)tij),\frac{\partial\phi(\boldsymbol{\mathrm{\Sigma}}(\boldsymbol{\mathrm{T}}))}{\partial t_{ij}}=tr\left(\nabla_{\boldsymbol{\mathrm{\Sigma}}}\phi\frac{\partial\boldsymbol{\mathrm{\Sigma}}(\boldsymbol{\mathrm{T}})}{\partial t_{ij}}\right), (21)

where we have used that 𝚺ϕ\nabla_{\boldsymbol{\mathrm{\Sigma}}}\phi is symmetric. Furthermore, note that[39]

𝚺(𝐓)tij=𝐓𝐓ttij=𝐓𝐄ij+𝐄ji𝐓t,\frac{\partial\boldsymbol{\mathrm{\Sigma}}(\boldsymbol{\mathrm{T}})}{\partial t_{ij}}=\frac{\partial\boldsymbol{\mathrm{T}}\boldsymbol{\mathrm{T}}^{t}}{\partial t_{ij}}=\boldsymbol{\mathrm{T}}\boldsymbol{\mathrm{E}}^{ij}+\boldsymbol{\mathrm{E}}^{ji}\boldsymbol{\mathrm{T}}^{t},

where 𝐄ij\boldsymbol{\mathrm{E}}^{ij} (𝐄ji\boldsymbol{\mathrm{E}}^{ji}) has its (i,j)(i,j) ((j,i)(j,i)) entry equal to one and zero elsewhere. Then from Equation (21) we have

ϕ(𝚺(𝐓))tij=\displaystyle\frac{\partial\phi(\boldsymbol{\mathrm{\Sigma}}(\boldsymbol{\mathrm{T}}))}{\partial t_{ij}}= tr(𝚺ϕ(𝐓𝐄ij+𝐄ji𝐓t))\displaystyle\operatorname{tr}\left(\nabla_{\boldsymbol{\mathrm{\Sigma}}}\phi(\boldsymbol{\mathrm{T}}\boldsymbol{\mathrm{E}}^{ij}+\boldsymbol{\mathrm{E}}^{ji}\boldsymbol{\mathrm{T}}^{t})\right)
=\displaystyle= tr(𝚺ϕ𝐓𝐄ij)+tr(𝚺ϕ𝐄ji𝐓t)\displaystyle\operatorname{tr}\left(\nabla_{\boldsymbol{\mathrm{\Sigma}}}\phi\boldsymbol{\mathrm{T}}\boldsymbol{\mathrm{E}}^{ij}\right)+\operatorname{tr}\left(\nabla_{\boldsymbol{\mathrm{\Sigma}}}\phi\boldsymbol{\mathrm{E}}^{ji}\boldsymbol{\mathrm{T}}^{t}\right)
=\displaystyle= tr(𝐄ji𝐓t𝚺ϕ)+tr(𝐄ji𝐓t𝚺ϕ)\displaystyle\operatorname{tr}\left(\boldsymbol{\mathrm{E}}^{ji}\boldsymbol{\mathrm{T}}^{t}\nabla_{\boldsymbol{\mathrm{\Sigma}}}\phi\right)+\operatorname{tr}\left(\boldsymbol{\mathrm{E}}^{ji}\boldsymbol{\mathrm{T}}^{t}\nabla_{\boldsymbol{\mathrm{\Sigma}}}\phi\right)
=\displaystyle= 2tr(𝐄ji𝐓t𝚺ϕ).\displaystyle 2\operatorname{tr}\left(\boldsymbol{\mathrm{E}}^{ji}\boldsymbol{\mathrm{T}}^{t}\nabla_{\boldsymbol{\mathrm{\Sigma}}}\phi\right).

Since aji=tr(𝐄ji𝐀)a_{ji}=\operatorname{tr}(\boldsymbol{\mathrm{E}}^{ji}\boldsymbol{\mathrm{A}}) for any matrix 𝐀\boldsymbol{\mathrm{A}}, we have the desired result. ∎

The above proposition implies that once a loss function ϕ(𝚺(𝐓))\phi(\boldsymbol{\mathrm{\Sigma}}(\boldsymbol{\mathrm{T}})) is fixed, it is only necessary to compute 𝚺ϕ\nabla_{\boldsymbol{\mathrm{\Sigma}}}\phi in order to obtain 𝐓ϕ\nabla_{\boldsymbol{\mathrm{T}}}\phi. We would then use it for the proximal gradient method (Algorithm 1).

Algorithm 1 Proximal gradient algorithm for minimization of l1l_{1}-penalized loss
0:ϕ\phi differentiable function over positive definite matrices,    𝐓p×p\boldsymbol{\mathrm{T}}\in\mathbb{R}^{p\times p} lower triangular,    MM\in\mathbb{N}, ε,λ,α(0,1)\varepsilon,\lambda,\alpha\in(0,1)
1:f=ϕ(𝚺(𝐓))f=\phi(\boldsymbol{\mathrm{\Sigma}}(\boldsymbol{\mathrm{T}}))
2:g=λ𝐓1g=\lambda||\boldsymbol{\mathrm{T}}||_{1}
3:repeat
4:  𝐃=𝐓ϕ(𝚺(𝐓))\boldsymbol{\mathrm{D}}=\nabla_{\boldsymbol{\mathrm{T}}}\phi(\boldsymbol{\mathrm{\Sigma(\boldsymbol{\mathrm{T}}}}))
5:  s=1s=1
6:  loop
7:   𝐓=𝐓s𝐃\boldsymbol{\mathrm{T}}^{\prime}=\boldsymbol{\mathrm{T}}-s{\boldsymbol{\mathrm{D}}}
8:   soft thresholding 𝐓\boldsymbol{\mathrm{T}}^{\prime} at level sλs\lambda
9:   f=ϕ(𝚺(𝐓))f^{\prime}=\phi(\boldsymbol{\mathrm{\Sigma}}(\boldsymbol{\mathrm{T}})), g=λ𝐓1g^{\prime}=\lambda||\boldsymbol{\mathrm{T}}||_{1}
10:   ν=12s(𝐓𝐓F2)+tr((𝐓𝐓)𝐃)\nu=\frac{1}{2s}(||\boldsymbol{\mathrm{T}}-\boldsymbol{\mathrm{T}}^{\prime}||_{F}^{2})+\operatorname{tr}((\boldsymbol{\mathrm{T}}^{\prime}-\boldsymbol{\mathrm{T}}){\boldsymbol{\mathrm{D}}})
11:   if f+gf+gf^{\prime}+g^{\prime}\leq f+g and ff+νf^{\prime}\leq f+\nu  then
12:    break
13:   else
14:    s=αss=\alpha s
15:   end if
16:  end loop
17:  δ=(f+gfg)\delta=(f+g-f^{\prime}-g^{\prime})
18:  𝐓=𝐓,f=f,g=g\boldsymbol{\mathrm{T}}=\boldsymbol{\mathrm{T}}^{\prime},f=f^{\prime},g=g^{\prime}
19:until k>Mk>M or δ<ε\delta<\varepsilon
19:TT

We thus can easily obtain the gradient for ϕNLL\phi_{NLL} and ϕFR\phi_{FR}, which we consider in this work. Standard matrix calculus [39] gives 𝚺ϕNLL=𝚺1𝚺1𝚺^𝚺1\nabla_{\boldsymbol{\mathrm{\Sigma}}}\phi_{NLL}=\boldsymbol{\mathrm{\Sigma}}^{-1}-\boldsymbol{\mathrm{\Sigma}}^{-1}\hat{\boldsymbol{\mathrm{\Sigma}}}\boldsymbol{\mathrm{\Sigma}}^{-1}. Therefore we have

𝐓ϕNLL\displaystyle\nabla_{\boldsymbol{\mathrm{T}}}\phi_{NLL} =2𝚺ϕNLL𝐓\displaystyle=2\nabla_{\boldsymbol{\mathrm{\Sigma}}}\phi_{NLL}\boldsymbol{\mathrm{T}}
=2𝚺1(𝐓)(𝐈p𝚺^𝚺1(𝐓))𝐓\displaystyle=2\boldsymbol{\mathrm{\Sigma}}^{-1}(\boldsymbol{\mathrm{T}})(\boldsymbol{\mathrm{I}}_{p}-\hat{\boldsymbol{\mathrm{\Sigma}}}\boldsymbol{\mathrm{\Sigma}}^{-1}(\boldsymbol{\mathrm{T}}))\boldsymbol{\mathrm{T}}
=2𝐓t(𝐈p𝐓1𝚺^𝐓t).\displaystyle=2\boldsymbol{\mathrm{T}}^{-t}(\boldsymbol{\mathrm{I}}_{p}-\boldsymbol{\mathrm{T}}^{-1}\hat{\boldsymbol{\mathrm{\Sigma}}}\boldsymbol{\mathrm{T}}^{-t}).

Conversely, from [39] we have that 𝚺ϕFR(𝚺)=2(𝚺𝚺^)\nabla_{\boldsymbol{\mathrm{\Sigma}}}\phi_{FR}(\boldsymbol{\mathrm{\Sigma}})=2(\boldsymbol{\mathrm{\Sigma}}-\hat{\boldsymbol{\mathrm{\Sigma}}}). Thus 𝐓ϕFR=2(𝐓𝐓t𝚺^)𝐓\nabla_{\boldsymbol{\mathrm{T}}}\phi_{FR}=2(\boldsymbol{\mathrm{T}}\boldsymbol{\mathrm{T}}^{t}-\hat{\boldsymbol{\mathrm{\Sigma}}})\boldsymbol{\mathrm{T}}.

A.3 More figures for real data experiments

Refer to caption
Figure 9: Heatmaps of the covariance matrix of rock and mine samples. M: Mines; R: Rocks.
Refer to caption
Figure 10: Scree plot of the covariance matrices of rock and mine samples. M: Mines; R: Rocks.
Refer to caption
Figure 11: Heatmaps of the covariance matrix of the wall-following robot navigation samples.
Refer to caption
Figure 12: Scree plot of the covariance matrices of the wall-following robot navigation samples.

References

  • [1] P. Bühlmann and S. van de Geer, Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, 2011.
  • [2] M. Maathuis, M. Drton, S. Lauritzen, and M. Wainwright, Eds., Handbook of Graphical Models. CRC Press, 2018.
  • [3] T. Li, C. Qian, E. Levina, and J. Zhu, “High-dimensional Gaussian graphical models on network-linked data.” Journal of Machine Learning Research, vol. 21, no. 74, pp. 1–45, 2020.
  • [4] A. P. Dempster, “Covariance selection,” Biometrics, vol. 28, no. 1, pp. 157–175, 1972.
  • [5] M. Yuan and Y. Lin, “Model selection and estimation in the Gaussian graphical model,” Biometrika, vol. 94, no. 1, pp. 19–35, 2007.
  • [6] S. Wright, “The method of path coefficients,” Ann. Math. Stat., vol. 5, no. 3, pp. 161–215, 1934.
  • [7] N. Wermuth, “Linear recursive equations, covariance selection, and path analysis,” J. Am. Stat. Assoc., vol. 75, no. 372, pp. 963–972, 1980.
  • [8] S. van de Geer and P. Bühlmann, “0\ell_{0}-penalized maximum likelihood for sparse directed acyclic graphs,” Ann. Stat., vol. 41, no. 2, pp. 536–567, 2013.
  • [9] M. Pourahmadi, “Joint mean-covariance models with applications to longitudinal data: Unconstrained parameterisation,” Biometrika, vol. 86, no. 3, pp. 677–690, 1999.
  • [10] A. d’Aspremont, O. Banerjee, and L. El Ghaoui, “First-order methods for sparse covariance selection,” SIAM J. Matrix Anal. Appl., vol. 30, no. 1, pp. 56–66, 2008.
  • [11] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008.
  • [12] A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu, “Sparse permutation invariant covariance estimation,” Electron. J. Stat., vol. 2, pp. 494–515, 2008.
  • [13] I. Córdoba, C. Bielza, and P. Larrañaga, “A review of Gaussian Markov models for conditional independence,” J. Stat. Plan. Inference, vol. 206, pp. 127–144, 2020.
  • [14] G. Kauermann, “On a dualization of graphical Gaussian models,” Scand. J. Stat., vol. 23, no. 1, pp. 105–116, 1996.
  • [15] D. R. Cox and N. Wermuth, “Linear dependencies represented by chain graphs,” Stat. Sci., vol. 8, no. 3, pp. 204–218, 1993.
  • [16] S. Chaudhuri, M. Drton, and T. S. Richardson, “Estimation of a covariance matrix with zeros,” Biometrika, vol. 94, no. 1, pp. 199–216, 2007.
  • [17] J. Bien and R. J. Tibshirani, “Sparse estimation of a covariance matrix,” Biometrika, vol. 98, no. 4, pp. 807–820, 2011.
  • [18] N. Wermuth, D. Cox, and G. M. Marchetti, “Covariance chains,” Bernoulli, vol. 12, no. 5, pp. 841–862, 2006.
  • [19] A. J. Rothman, E. Levina, and J. Zhu, “A new approach to Cholesky-based covariance regularization in high dimensions,” Biometrika, vol. 97, no. 3, pp. 539–550, 2010.
  • [20] A. Das, D. Sexton, C. Lainscsek, S. S. Cash, and T. J. Sejnowski, “Characterizing brain connectivity from human electrocorticography recordings with unobserved inputs during epileptic seizures,” Neural Comput., vol. 31, no. 7, pp. 1271–1326, 2019.
  • [21] G. Varando and N. R. Hansen, “Graphical continuous Lyapunov models,” in Proceedings of the 36th conference on Uncertainty in Artificial Intelligence. Accepted, 2020, arXiv:2005.10483.
  • [22] M. Drton, “Algebraic problems in structural equation modeling,” in The 50th Anniversary of Gröbner Bases. Mathematical Society of Japan, 2018, pp. 35–86.
  • [23] A. P. Dempster, Elements of Continuous Multivariate Analysis. Adisson-Wesley, 1969.
  • [24] A. E. Beaton, “The use of special matrix operators in statistical calculus,” ETS Research Bulletin Series, vol. 1964, no. 2, pp. i–222, 1964.
  • [25] Q. Ye, A. Amini, and Q. Zhou, “Optimizing regularized cholesky score for order-based learning of bayesian networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. Available online, pp. 1–1, 2020.
  • [26] A. P. Dawid, “Conditional independence in statistical theory,” J. R. Stat. Soc. Ser. B Stat. Methodol., vol. 41, no. 1, pp. 1–31, 1979.
  • [27] T. W. Anderson, An Introduction to Multivariate Statistical Analysis, 3rd ed. John Wiley & Sons, 2003.
  • [28] V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky, “Latent variable graphical model selection via convex optimization,” Ann. Statist., vol. 40, no. 4, pp. 1935–1967, 08 2012.
  • [29] D. Yatsenko, K. Josić, A. S. Ecker, E. Froudarakis, R. J. Cotton, and A. S. Tolias, “Improved estimation and interpretation of correlations in neural circuits,” PLOS Computational Biology, vol. 11, no. 3, pp. 1–28, 2015.
  • [30] M. Zorzi and R. Sepulchre, “AR identification of latent-variable graphical models,” IEEE Transactions on Automatic Control, vol. 61, no. 9, pp. 2327–2340, 2016.
  • [31] S. Basu, X. Li, and G. Michailidis, “Low rank and structured modeling of high-dimensional vector autoregressions,” IEEE Transactions on Signal Processing, vol. 67, no. 5, pp. 1207–1222, 2019.
  • [32] E. Levina, A. Rothman, and J. Zhu, “Sparse estimation of large covariance matrices via a nested Lasso penalty,” Ann. Appl. Stat., vol. 2, no. 1, pp. 245–263, 2008.
  • [33] X. Zheng, B. Aragam, P. Ravikumar, and E. P. Xing, “DAGs with NO TEARS: Continuous optimization for structure learning,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems. Curran Associates Inc., 2018, p. 9492–9503.
  • [34] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, “Optimization with sparsity-inducing penalties,” Found. Trends Mach. Learn., vol. 4, no. 1, p. 1–106, 2012.
  • [35] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.
  • [36] R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, 2020. [Online]. Available: https://www.R-project.org/
  • [37] I. Córdoba, G. Varando, C. Bielza, and P. Larrañaga, “On generating random Gaussian graphical models,” International Journal of Approximate Reasoning, vol. 125, pp. 240–250, 2020.
  • [38] D. Dua and C. Graff, “UCI machine learning repository,” 2020. [Online]. Available: http://archive.ics.uci.edu/ml
  • [39] K. Petersen and M. Pedersen, The Matrix Cookbook. Technical University of Denmark, 2008.