This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Differential cumulants, hierarchical models and monomial ideals

Daniel Bruynooghe,
Department of Statistics, London School of Economics, London, UK
d.hawellek@lse.ac.uk
Henry P. Wynn
Department of Statistics, London School of Economics, London, UK
h.wynn@lse.ac.uk
Abstract

For a joint probability density function fX(x)f_{X}(x) of a random vector XX the mixed partial derivatives of logfX(x)\log f_{X}(x) can be interpreted as limiting cumulants in an infinitesimally small open neighborhood around xx. Moreover, setting them to zero everywhere gives independence and conditional independence conditions. The latter conditions can be mapped, using an algebraic differential duality, into monomial ideal conditions. This provides an isomorphism between hierarchical models and monomial ideals. It is thus shown that certain monomial ideals are associated with particular classes of hierarchical models.

Keywords: Differential cumulants, conditional independence, hierarchical models, monomial ideals.

1 Introduction

This paper draws together three areas: a new concept of differential cumulants, hierarchical models and the theory of monomial ideals in algebra. The central idea is that for a strictly positive density fX(x)f_{X}(x) of a pp-dimensional random vector XX, the mixed partial derivative of the log density gX(x)=logfX(x)g_{X}(x)=\log f_{X}(x) can be used to express independence and conditional independence statements. Thus, for random variables X1,X2,X3 in X_{1},X_{2},X_{3}\text{ in }\mathbb{R}, the condition

2x1x2gX1,X2,X3(x1,x2,x3)=0 for all (x1,x2,x3) in 3\frac{\partial^{2}}{\partial x_{1}\partial x_{2}}g_{X_{1},X_{2},X_{3}}(x_{1},x_{2},x_{3})=0\text{ for all }(x_{1},x_{2},x_{3})\text{ in }\mathbb{R}^{3} (1)

is equivalent to the conditional independence statement

X1X2|X3.X_{1}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{2}|X_{3}.

In the next section we show how such mixed partial derivatives can be interpreted as differential cumulants. Then, in section 3, we show how collections of differential equations like (1) can be used to express independence and conditional independence models. Section 4 shows that, more generally, these collections can be used to define hierarchical statistical models of exponential form.

Section 5 maps the hierarchical model conditions to monomial ideals, which are increasingly being used within algebraic statistics. This isomorphism maps, for example, the mixed partial derivative condition (1) to the monomial ideal <x1x2><x_{1}x_{2}> within the polynomial ring k[x1,x2,x3]k[x_{1},x_{2},x_{3}]. The equivalence allows ideal properties to be interpreted as hierarchical model properties, opening up an algebraic-statistical interface with some potential.

2 Local and differential cumulants

This section can be considered as a development from a body of work on local correlation. Good examples are the papers of Holland & Wang (1987), Jones (1996) and Bairamov et al. (2003). We draw particularly on Mueller & Yan (2001).

Let XpX\in\mathbb{R}^{p} be a random vector. We assume XX has a p+1p+1 times continuously differentiable density fXf_{X}. Once we introduce the concept of differential cumulants, we further require fXf_{X} be strictly positive.

For x,k in px,k\text{ in }\mathbb{R}^{p} we set xk:=i=1pxiki,x!:=i=1pxi!x^{k}\mathrel{\mathop{\ordinarycolon}}=\prod_{i=1}^{p}x_{i}^{k_{i}},\;x!\mathrel{\mathop{\ordinarycolon}}=\prod_{i=1}^{p}x_{i}! and mk=𝔼(Xk).m_{k}=\mathbb{E}(X^{k}). Let MX:pM_{X}\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{p}\longrightarrow\mathbb{R} and KX:pK_{X}\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{p}\longrightarrow\mathbb{R} denote the moment and cumulant generating functions of XX respectively. For a vector k in pk\text{ in }\mathbb{N}^{p} we set

Dkf(x):=k1i=1pxikif(x),D^{k}f(x)\mathrel{\mathop{\ordinarycolon}}=\frac{\partial^{\,\mathinner{\!\left\lVert k\right\rVert}_{1}}}{\prod_{i=1}^{p}\partial x_{i}^{k_{i}}}f(x),

where k1:=i=1p|ki|\mathinner{\!\left\lVert k\right\rVert}_{1}\mathrel{\mathop{\ordinarycolon}}=\sum_{i=1}^{p}\mathinner{\!\left\lvert k_{i}\right\rvert} is the Manhattan norm. By convention D0f(x):=f(x)D^{0}f(x)\mathrel{\mathop{\ordinarycolon}}=f(x).

The cumulant κk\kappa_{k} can be found by evaluating Dk(log(MX(t))D^{k}(\log(M_{X}(t)) at zero. We use the multivariate chain rule (given e.g. in Hardy, 2006) stated in Theorem 1. At the heart of the chain rule is an identification of differential operators with multisets:

Definition 1 (Multiset, multiplicity, size).

A multiset MM is a set which may hold multiple copies of its elements. The number of occurrences of an element is its multiplicity. The multiplicity of a multiset is the vector of multiplicities of its elements, denoted by νM\nu_{M}. The total number of elements |M|\mathinner{\!\left\lvert M\right\rvert} in MM is the size. A multiset which is a set is called degenerate.

Example 1 (Partial derivative and multiset).

The partial derivative 3x1x33f(x)\frac{\partial^{3}}{\partial x_{1}\partial x_{3}^{3}}f(x) has associated multiset {1,3,3,3}\{1,3,3,3\} with multiplicity (1,0,3)(1,0,3) and size four.

Definition 2 (Partition of a multiset).

Let II be some index set and (Mi)i in I(M_{i})_{i\text{ in }I} be a family of multisets with associated family of multiplicities (νMi)i in I(\nu_{M_{i}})_{i\text{ in }I}. A partition π\pi of a multiset MM is a multiset of multisets {(Mi)iI}\{(M_{i})_{i\in I}\} such that νM=iIνMi\nu_{M}=\sum_{i\in I}\nu_{M_{i}}. Being a multiset itself, a partition can hold multiple copies of one or more multisets.

Example 2 (Partition of a multiset).

The multiset {{x1,x3},{x1,x3},{x3}}\{\{x_{1},x_{3}\},\{x_{1},x_{3}\},\{x_{3}\}\} is a partition of {x1,x1,x3,x3,x3}\{x_{1},x_{1},x_{3},x_{3},x_{3}\}, since (1,0,1)+(1,0,1)+(0,0,1)=(2,0,3)(1,0,1)+(1,0,1)+(0,0,1)=(2,0,3). In the following, we will use the shorthand {x1x3|x1x3|x3}\{x_{1}x_{3}|x_{1}x_{3}|x_{3}\}.

Associated with a partition π\pi of a multiset MM is a combinatorial quantity to which we refer as the collapse number c(π)c(\pi). It is defined as

c(π):=νM!iIνMi!νπ!.c(\pi)\mathrel{\mathop{\ordinarycolon}}=\frac{\nu_{M}!}{\prod_{i\in I}\nu_{M_{i}}!\nu_{\pi}!}.

See Hardy (2006) for a combinatorial interpretation of c(π)c(\pi).

Theorem 1 (Higher order derivative of chain functions).
Dkg(h(x))=πΠ(k)c(π)D|π|g(h)j=1|π|DνMjh(x),D^{k}g(h(x))=\sum_{\pi\in\Pi(k)}c(\pi)D^{\mathinner{\!\left\lvert\pi\right\rvert}}g(h)\prod_{j=1}^{\mathinner{\!\left\lvert\pi\right\rvert}}D^{\nu_{M_{j}}}h(x),

where Π(k)\Pi(k) is the set of all partitions of a multiset with multiplicity kk and MjM_{j} is the j-th multiset in the partition π\pi.

Proof.

See Hardy (2006). ∎

Corollary 1 (Cumulants as functions of moments).

Let κk\kappa_{k} be the k-th cumulant. Then

κk=πΠ(k)c(π)(1)(|π|1)(|π|1)!j=1pmνMj.\kappa_{k}=\sum_{\pi\in\Pi(k)}c(\pi)(-1)^{(\mathinner{\!\left\lvert\pi\right\rvert}-1)}(\mathinner{\!\left\lvert\pi\right\rvert}-1)!\prod_{j=1}^{p}m_{\nu_{M_{j}}}. (2)
Proof.

Set g(h)=log(h)g(h)=\log(h),   h(t)=MX(t)h(t)=M_{X}(t) and evaluate at t=0t=0. ∎

Example 3 (Partial derivative).

Consider the partial derivative 3xz2g(h(x,y,z)\frac{\partial^{3}}{\partial x\partial z^{2}}g(h(x,y,z). The associated multiset is {1,3,3}\{1,3,3\} with partitions {133}\{133\}, {13|3}\{13|3\}, {1|33}\{1|33\}, {1|3|3}\{1|3|3\}. The multivariate chain rule tells us that

D102g(h(x,y,z))\displaystyle D^{102}g(h(x,y,z)) =DgD102h\displaystyle=DgD^{102}h
+2D2gD101hD001h\displaystyle\quad+2D^{2}gD^{101}hD^{001}h
+D2gD100hD002h\displaystyle\quad+D^{2}gD^{100}hD^{002}h
+D3gD100h(D001h)2,\displaystyle\quad+D^{3}gD^{100}h(D^{001}h)^{2},

where function arguments have been suppressed on the right hand side for better readability. In particular we may conclude that

κ102=m1022m101m001m100m002+m100m0012.\displaystyle\kappa_{102}=m_{102}-2m_{101}m_{001}-m_{100}m_{002}+m_{100}m^{2}_{001}.

The expression for cumulants in terms of moments is particularly simple in what we shall call the square-free case, that is for cumulants κk\kappa_{k}, whose index vector kk is binary. In that case, the multiset associated with kk is degenerate and c(π)=1c(\pi)=1. Equation (2) simplifies to

κk=πΠ(k)(1)(|π|1)(|π|1)!j=1pmνMj.\kappa_{k}=\sum_{\pi\in\Pi(k)}(-1)^{(\mathinner{\!\left\lvert\pi\right\rvert}-1)}(\mathinner{\!\left\lvert\pi\right\rvert}-1)!\prod_{j=1}^{p}m_{\nu_{M_{j}}}.

In this form it is often stated and derived via the classical Faa Di Bruno formula applied to an exponential function followed by a Moebius inversion (see e.g. Barndorff-Nielsen & Cox, 1989).

Local analogues to moments and cumulants can be derived as one considers their limiting counterparts in the neighborhood of a fixed point ξ in p\xi\text{ in }\mathbb{R}^{p}, an idea proposed by Mueller & Yan (2001). This section derives formulae for local moments and cumulants and local moment generating functions provided its global counterpart exists.

For a strictly positive edge length ϵ in +\epsilon\text{ in }\mathbb{R}_{+}, let A(a,ϵ):=[ξϵ2,ξ+ϵ2]pA(a,\epsilon)\mathrel{\mathop{\ordinarycolon}}=[\xi-\frac{\epsilon}{2},\xi+\frac{\epsilon}{2}]^{p} denote the hyper cube centralized at ξ\xi. Let |A|=ϵp\mathinner{\!\left\lvert A\right\rvert}=\epsilon^{p} denote its volume. The density of the random variable X in pX\text{ in }\mathbb{R}^{p} conditional on being in AA is given by

fXA(x)=fX(x)11A(x)pr(XA),f^{A}_{X}(x)=\frac{f_{X}(x)1\hskip-7.39772pt{1}_{A}(x)}{pr(X\in A)},

where 11A(x)1\hskip-7.39772pt{1}_{A}(x) is the indicator function which returns unity if xx is in AA and zero otherwise. The conditional moments about ξ\xi are denoted by

mkA=𝔼(i=1p(Xiξi)ki|XA).m^{A}_{k}=\mathbb{E}\bigg{(}\prod_{i=1}^{p}(X_{i}-\xi_{i})^{k_{i}}\mathinner{\Bigr{\rvert}}X\in A\bigg{)}.

Let 22\mathbb{N} and 2+12\mathbb{N}\hskip-3.0pt+\hskip-3.0pt1 denote the set of positive even and odd integers respectively. For symmetry reasons, even and odd orders of individual components have different effects on local moments, which motivates the following definition:

α1+:=α1+i=1p11(αi2+1).\mathinner{\!\left\lVert\alpha\right\rVert}^{+}_{1}\mathrel{\mathop{\ordinarycolon}}=\mathinner{\!\left\lVert\alpha\right\rVert}_{1}+\sum_{i=1}^{p}1\hskip-7.39772pt{1}\;(\alpha_{i}\in 2\mathbb{N}\hskip-1.0pt+\hskip-1.0pt1).

1+\mathinner{\!\left\lVert\cdot\right\rVert}^{+}_{1} increments the total sum of the components of a vector by one additional unit for each odd component (it is not to be interpreted as a norm).

Theorem 2 (Local moments).

Let X in pX\text{ in }\mathbb{R}^{p} be an absolutely continuous random vector with density fXf_{X} which is pp times differentiable in ξ in p\xi\text{ in }\mathbb{R}^{p}. Let k in pk\text{ in }\mathbb{N}^{p} determine the order of moment. Then, for |A|\mathinner{\!\left\lvert A\right\rvert} sufficiently small, XX has local moment

mkA\displaystyle m^{A}_{k} =r(ϵ,k)(DαfX(ξ)fX(ξ)+O(ϵ2)),\displaystyle=r(\epsilon,k)\bigg{(}\frac{D^{\alpha}f_{X}(\xi)}{f_{X}(\xi)}+O(\epsilon^{2})\bigg{)}, (3)

where r(ϵ,k):=ϵk1+i=1,ki2p1ki+1i=1,ki2+1p1ki+2r(\epsilon,k)\mathrel{\mathop{\ordinarycolon}}=\epsilon^{\mathinner{\!\left\lVert k\right\rVert}_{1}^{+}}{\displaystyle\prod_{\begin{subarray}{c}i=1,\\ k_{i}\in 2\mathbb{N}\end{subarray}}^{p}}\frac{1}{k_{i}+1}{\displaystyle\prod_{\begin{subarray}{c}i=1,\\ k_{i}\in 2\mathbb{N}\hskip-1.0pt+\hskip-1.0pt1\end{subarray}}^{p}}\frac{1}{k_{i}+2} and α:=i=1,ki2+1pei\alpha\mathrel{\mathop{\ordinarycolon}}={\displaystyle\sum_{\begin{subarray}{c}i=1,\\ k_{i}\in 2\mathbb{N}\hskip-1.0pt+\hskip-1.0pt1\end{subarray}}^{p}}e_{i} .

Proof.

Consider

mkA\displaystyle m^{A}_{k} =Ai=1p(xiξi)kifX(x)dxAfX(x)𝑑x\displaystyle=\frac{\int_{A}\prod_{i=1}^{p}(x_{i}-\xi_{i})^{k_{i}}f_{X}(x)dx}{\int_{A}f_{X}(x)dx} (4)

Approximate fXf_{X} through its pp-th order Taylor expansion, integrate (4) term by term and exploit the point symmetry of odd order terms about the origin. ∎

Example 4 (Local moment m120m_{120}).

Consider a tri-variate random variable XX with local moment m120A=E((X1ξ1)(X2ξ2)2|XA).m_{120}^{A}=E((X_{1}-\xi_{1})(X_{2}-\xi_{2})^{2}|X\in A). Then r(ϵ,k)=ϵ49,α:=(1,0,0)r(\epsilon,k)=\frac{\epsilon^{4}}{9},\alpha\mathrel{\mathop{\ordinarycolon}}=(1,0,0)^{\prime} and we obtain

m120A=ϵ49f(x1,x2,x3)x1+O(ϵ6).m_{120}^{A}=\frac{\epsilon^{4}}{9}\frac{\partial f(x_{1},x_{2},x_{3})}{\partial x_{1}}+O(\epsilon^{6}).

A natural way to extend the concept of a local moment is to consider the limiting case when ϵ0\epsilon\rightarrow 0. This leads to our definition of differential moments.

Definition 3 (Differential moment).

The differential moment of an absolutely continuous random vector X in pX\text{ in }\mathbb{R}^{p} in ξ in p\xi\text{ in }\mathbb{R}^{p} is defined as:

mkξ:=limϵ0mkAr(ϵ,k).\displaystyle m^{\xi}_{k}\mathrel{\mathop{\ordinarycolon}}=\lim_{\epsilon\rightarrow 0}\frac{m^{A}_{k}}{r(\epsilon,k)}.
Corollary 2 (Differential moment).

For a differential moment of order k in pk\text{ in }\mathbb{N}^{p} in ξ in p\xi\text{ in }\mathbb{R}^{p} it holds that

mkξ=DαfX(ξ)fX(ξ).\displaystyle m^{\xi}_{k}=\frac{D^{\alpha}f_{X}(\xi)}{f_{X}(\xi)}.
Proof.

This follows from Theorem 2 upon taking the limit as ϵ0\epsilon\rightarrow 0. ∎

From (3) it is clear that the choice of α\alpha in the derivative DαfXD^{\alpha}f_{X} depends only on the pattern of odd and even components of the moment. To be precise, α\alpha holds a unity corresponding to odd components and a zero corresponding to even component entries. Consequently, the differential moment mkξm^{\xi}_{k} depends on kk only via the pattern of odd and even values.

This suggests defining an equivalence relation on p×p\mathbb{N}^{p}\times\mathbb{N}^{p}: For u,kpu,k\in\mathbb{N}^{p} set umkmu1up=mk1kpu\sim_{m}k\iff m_{u_{1}\cdots u_{p}}=m_{k_{1}\cdots k_{p}}. The relation m\sim_{m} partitions the product space p×p\mathbb{N}^{p}\times\mathbb{N}^{p} into 2p2^{p} equivalence classes of same differential moments. The graph corresponding to m\sim_{m} is depicted in Figure 1 for the bivariate case. The axes give the order of the moment for the two components. Different symbols represent different equivalence classes. For instance, (2,2)m(4,2)(2,2)\sim_{m}(4,2), since m22ξ=m42ξm^{\xi}_{22}=m^{\xi}_{42}. Note that umkuk12u\sim_{m}k\iff\mathinner{\!\left\lVert u-k\right\rVert}_{1}\in 2\mathbb{N}.

Refer to caption
Figure 1: Graph of the equivalence classes induced by m\sim_{m} (bivariate case). Each equivalence class is depicted with a different symbol.

Similarly to local moments, for any measurable set AA we can define a local moment generating function:

MXA(t):=𝔼(etX|XA).M_{X}^{A}(t)\mathrel{\mathop{\ordinarycolon}}=\mathbb{E}(e^{t^{\prime}X}|X\in A).

Being a conditional expectation, it exists if MXM_{X} exists. We have the following expansion:

MXA(t)\displaystyle M^{A}_{X}(t) =1pr(XA)Aei=1ptixifX(x)𝑑x\displaystyle=\frac{1}{pr(X\in A)}\int_{A}e^{\sum_{i=1}^{p}t_{i}x_{i}}f_{X}(x)dx
=1+i=1ptifX(x)xi|x=ξ(ϵ23fX(ξ)+O(ϵ4))\displaystyle=1+\sum_{i=1}^{p}t_{i}\mathinner{\frac{\partial f_{X}(x)}{\partial x_{i}}\Biggr{\rvert}}_{x=\xi}\bigg{(}\frac{\epsilon^{2}}{3f_{X}(\xi)}+O(\epsilon^{4})\bigg{)}
+i=1pti22fX(x)xi2|x=ξ(ϵ26fX(ξ)+O(ϵ4))\displaystyle+\sum_{i=1}^{p}t_{i}^{2}\mathinner{\frac{\partial^{2}f_{X}(x)}{\partial x_{i}^{2}}\Biggr{\rvert}}_{x=\xi}\bigg{(}\frac{\epsilon^{2}}{6f_{X}(\xi)}+O(\epsilon^{4})\bigg{)}
+i=1pj>iptitj2fX(x)xixj|x=ξ(ϵ49fX(ξ)+O(ϵ6))+O(ϵ4t3).\displaystyle+\sum_{i=1}^{p}\sum_{j>i}^{p}t_{i}t_{j}\mathinner{\frac{\partial^{2}f_{X}(x)}{\partial x_{i}\partial x_{j}}\Biggr{\rvert}}_{x=\xi}\bigg{(}\frac{\epsilon^{4}}{9f_{X}(\xi)}+O(\epsilon^{6})\bigg{)}+O(\epsilon^{4}\mathinner{\!\left\lVert t^{3}\right\rVert}).

The local moments can be computed from the local moment generating function via differentiation to appropriate order and evaluation at t=0t=0. The natural logarithm of the local moment generating function defines the local cumulant generating function KXA(t):pK^{A}_{X}(t)\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{p}\longrightarrow\mathbb{R}:

KXA(t):=log(MXA(t)).K^{A}_{X}(t)\mathrel{\mathop{\ordinarycolon}}=\log(M^{A}_{X}(t)).
Corollary 3 (Local cumulants).

Under the conditions of Theorem 2 it holds for the local cumulants that

κkA=πΠ(k)c(π)(1)(|π|1)(|π|1)!j=1|π|r(ϵ,νMj)(DαjfX(ξ)fX(ξ)+O(ϵ2)),\displaystyle\kappa_{k}^{A}=\sum_{\pi\in\Pi(k)}c(\pi)(-1)^{(\mathinner{\!\left\lvert\pi\right\rvert}-1)}(\mathinner{\!\left\lvert\pi\right\rvert}-1)!\prod_{j=1}^{\mathinner{\!\left\lvert\pi\right\rvert}}r(\epsilon,\nu_{M_{j}})\bigg{(}\frac{D^{\alpha_{j}}f_{X}(\xi)}{f_{X}(\xi)}+O(\epsilon^{2})\bigg{)},

where αj\alpha_{j} is a function of the partition π\pi and defined as

αj:=i=1pei11(νMj(i)2+1),\displaystyle\alpha_{j}\mathrel{\mathop{\ordinarycolon}}={\displaystyle\sum_{i=1}^{p}e_{i}1\hskip-7.39772pt{1}{\bigg{(}\nu_{M_{j}}(i)\in 2\mathbb{N}\hskip-1.0pt+\hskip-1.0pt1\bigg{)}}},

that is, αj\alpha_{j} is binary and holds ones corresponding to odd elements of νMj\nu_{M_{j}}. Furthermore,

r(ϵ,νMj)\displaystyle r(\epsilon,\nu_{M_{j}}) :=ϵνMj1+i=1,νMj(i)2p1νMj(i)+1i=1,νMj(i)2+1p1νMj(i)+2.\displaystyle\mathrel{\mathop{\ordinarycolon}}=\epsilon^{\mathinner{\!\left\lVert\nu_{M_{j}}\right\rVert}_{1}^{+}}{\displaystyle\prod_{\begin{subarray}{c}i=1,\\ \nu_{M_{j}}(i)\in 2\mathbb{N}\end{subarray}}^{p}}\frac{1}{\nu_{M_{j}}(i)+1}{\displaystyle\prod_{\begin{subarray}{c}i=1,\\ \nu_{M_{j}}(i)\in 2\mathbb{N}\hskip-1.0pt+\hskip-1.0pt1\end{subarray}}^{p}}\frac{1}{\nu_{M_{j}}(i)+2}.
Proof.

Combine the chain rule and Theorem 2. ∎

Similarly to differential moments we can define differential cumulants at ξ\xi. Two different ways of doing so are natural. First, taking the limiting quantity of the local cumulants as ϵ0\epsilon\rightarrow 0 or, second, taking the series of differential moments and requiring that the mapping between moments and cumulants is preserved which is induced through the ex-log relation of the associated generating functions, see also the discussion in (McCullagh, 1987, page 62).

As demonstrated below, the two quantities just described differ in general and coincide only in the square-free case. In order to retain the intuitive and familiar relation between cumulants and moments, we define differential cumulants in terms of differential moments.

Definition 4 (Differential cumulant).

For an index vector k in pk\text{ in }\mathbb{N}^{p}, the differential cumulant in a in pa\text{ in }\mathbb{R}^{p} is defined as

κka:=πΠ(k)c(π)(1)(|π|1)(|π|1)!i=1|π|mνMia.\displaystyle\kappa_{k}^{a}\mathrel{\mathop{\ordinarycolon}}=\sum_{\pi\in\Pi(k)}c(\pi)(-1)^{(\mathinner{\!\left\lvert\pi\right\rvert}-1)}(\mathinner{\!\left\lvert\pi\right\rvert}-1)!\prod_{i=1}^{\mathinner{\!\left\lvert\pi\right\rvert}}m^{a}_{\nu_{M_{i}}}.

We are now in a position to state the main result of this section, namely that mixed partial derivatives of the log density can be interpreted as differential cumulants.

Lemma 1 (Differential cumulant).

For a differential cumulant in ξ in p\xi\text{ in }\mathbb{R}^{p} of order k in pk\text{ in }\mathbb{N}^{p} it holds that

κkξ=Dαlog(fX(ξ)),\displaystyle\kappa^{\xi}_{k}=D^{\alpha}\log(f_{X}(\xi)),

where α:=i=1,ki2+1pei\alpha\mathrel{\mathop{\ordinarycolon}}={\displaystyle\sum_{\begin{subarray}{c}i=1,\\ k_{i}\in 2\mathbb{N}\hskip-1.0pt+\hskip-1.0pt1\end{subarray}}^{p}}e_{i} projects odd elements of kk onto one and even elements of kk onto zero.

Proof.

Apply the chain rule to Dαlog(fX(ξ))D^{\alpha}\log(f_{X}(\xi)). ∎

This is a multivariate generalization of the local dependence function introduced by Holland & Wang (1987). The next theorem relates differential cumulants to the limit of local cumulants.

Theorem 3 (Differential and limiting local cumulant).

A differential cumulant κkξ\kappa^{\xi}_{k} equals the limit of the local cumulant limϵ01r(ϵ,k)κkAlim_{\epsilon\rightarrow 0}\frac{1}{r(\epsilon,k)}\kappa^{A}_{k} if and only if kk is binary, i.e. κk\kappa_{k} is a square-free cumulant.

Proof.

First, let k{0,1}pk\in\{0,1\}^{p} be binary and π={(Mj)1j|π|}\pi=\{(M_{j})_{1\leq j\leq\mathinner{\!\left\lvert\pi\right\rvert}}\} be a partition of the lattice corresponding to kk. One can show that r(ϵ,k)=j=1|π|r(ϵ,νMj)r(\epsilon,k)=\prod_{j=1}^{\mathinner{\!\left\lvert\pi\right\rvert}}r(\epsilon,\nu_{M_{j}}). With that

1r(ϵ,k)κkA\displaystyle\frac{1}{r(\epsilon,k)}\kappa^{A}_{k} =πΠ(k)(1)(|π|1)(|π|1)!j=1,Mjπ|π|DνMjfX(ξ)fX(ξ)+O(ϵ2).\displaystyle=\sum_{\pi\in\Pi(k)}(-1)^{(\mathinner{\!\left\lvert\pi\right\rvert}-1)}(\mathinner{\!\left\lvert\pi\right\rvert}-1)!\prod_{\begin{subarray}{c}j=1,\\ M_{j}\in\pi\end{subarray}}^{\mathinner{\!\left\lvert\pi\right\rvert}}\frac{D^{\nu_{M_{j}}}f_{X}(\xi)}{f_{X}(\xi)}+O(\epsilon^{2}). (5)

Now take limits as ϵ0\epsilon\rightarrow 0 to obtain limϵ01r(ϵ,k)κkA=κkξ.lim_{\epsilon\rightarrow 0}\frac{1}{r(\epsilon,k)}\kappa^{A}_{k}=\kappa_{k}^{\xi}.

Conversely, suppose kk is not binary. Express κkA\kappa^{A}_{k} as a linear combination of local moments. Consider the degenerate partition π\pi, which holds only one multiset MM with multiplicity νM=k\nu_{M}=k. The quantity associated with π\pi converges to cDkfX(ξ)fX(ξ)c\frac{D^{k}f_{X}(\xi)}{f_{X}(\xi)} for some constant c in c\text{ in }\mathbb{R}. kk not being binary, this cannot be a differential moment, which are proportional to DαfX(ξ)D^{\alpha}f_{X}(\xi) for some binary α\alpha. Differential cumulants are linear combinations of differential moment products only. Hence κkA\kappa^{A}_{k} does not converge to a differential cumulant. ∎

Of particular interest to us are differential cumulants which vanish everywhere. We refer to them as zero-cumulants. Writing g=logfXg=\log f_{X}, we shall usually write Dαg=0D^{\alpha}g=0 to denote the zero-cumulant associated with α\alpha in the understanding that this holds for all xx.

The next section shows that sets of zero cumulants are isomorphic to conditional independence statements. As a consequence of lemma 1 zero-cumulants are invariant under diagonal transformations of the random vector XX. In particular, they are not affected by the probability integral transformation and hence any result below holds also true for the copula density of XX.

3 Independence and conditional independence

From now on, we shall assume that fXf_{X} is strictly positive everywhere. Sets of zero-cumulants are equivalent to conditional and unconditional dependency structures.

Proposition 1 (Independence in the bivariate case).

Let X in 2X\text{ in }\mathbb{R}^{2}. Then X1X2κ11x=0 for all x in 2X_{1}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{2}\iff\kappa_{11}^{x}=0\quad\text{ for all }x\text{ in }\mathbb{R}^{2}.

Proof.
0=κ11x=2x1x2log(fX1,X2(x1,x2)fX1,X2(x1,x2)\displaystyle 0=\kappa_{11}^{x}=\frac{\partial^{2}}{\partial x_{1}\partial x_{2}}\log(f_{X_{1},X_{2}}(x_{1},x_{2})\iff f_{X_{1},X_{2}}(x_{1},x_{2}) =eh1(x1)+h2(x2)\displaystyle=e^{h_{1}(x_{1})+h_{2}(x_{2})}

for some functions h1,h2:h_{1},h_{2}\mathrel{\mathop{\ordinarycolon}}\mathbb{R}\rightarrow\mathbb{R}. ∎

In the multivariate case, we can express conditional independence of any pair given the remaining variables through square free differential cumulants.

Proposition 2 (Conditional independence of two random variables).

Let X in pX\text{ in }\mathbb{R}^{p}. Then

XiXj|Xijκkx=0 for all x in p,X_{i}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{j}|X_{-ij}\iff\kappa_{k}^{x}=0\quad\text{ for all }x\text{ in }\mathbb{R}^{p},

where

Xij:=(X1,,Xi1,Xi+1,,Xj1,Xj+1,,Xp)X_{-ij}\mathrel{\mathop{\ordinarycolon}}=(X_{1},...,X_{i-1},X_{i+1},...,X_{j-1},X_{j+1},...,X_{p})

and k=ei+ej,(i,j){1,,p}2,ijk=e_{i}+e_{j},\;\;(i,j)\in\{1,...,p\}^{2},\;\;i\neq j.

Proof.

By analogy with the bivariate case. ∎

Setting several square-free differential cumulants to zero simultaneously allows us to express conditional independence statements.

Proposition 3 (Multivariate conditional independence).

Given three index sets I,J,KI,J,K which partition {1,,p}\{1,...,p\}, let S={ei+ej,iI,jJ}S=\{e_{i}+e_{j},i\in I,j\in J\}. Then

XIXJ|XKκkx=0 for all kS and for all x in p.X_{I}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{J}|X_{K}\iff\kappa_{k}^{x}=0\text{ for all }k\in S\text{ and for all }x\text{ in }\mathbb{R}^{p}.
Proof.

From proposition 2 it is clear, that this is equivalent to the conditional independence statement

XIXJ|XKXiXj|Xij for all (i,j)I×J.X_{I}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{J}|X_{K}\iff X_{i}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{j}|X_{-ij}\quad\text{ for all }(i,j)\in I\times J.

Sufficiency (\Rightarrow) and necessity (\Leftarrow) are semi-graphoid and graphoid axioms referred to as decomposition and intersection respectively. Both hold true for strictly positive conditional densities (see for instance Cozman & Walley, 2005). ∎

Pairwise conditional independence of all pairs is equivalent to independence.

Theorem 4 (Pairwise conditional independence if and only if independence).

The random variables X1,,XnX_{1},...,X_{n} are independent if and only if κei+ej=0 for all (i,j){1,,n}2,ij\kappa_{e_{i}+e_{j}}=0\quad\text{ for all }(i,j)\in\{1,...,n\}^{2},\;\;i\neq j.

Proof.

Sufficiency follows from differentiation of the log-density. Necessity can be proved by induction on the number of variables nn. The statement is true for n=2n=2 by proposition 1. Let the statement be true for nn and let the (n+12)n+1\choose 2 differential cumulants κei+ej\kappa_{e_{i}+e_{j}} vanish, where eie_{i} and eje_{j} are unit vectors in n+1\mathbb{R}^{n+1}. Consider κe1+e2=0\kappa_{e_{1}+e_{2}}=0. Integration with respect to x1x_{1} and x2x_{2} yields

fX1,,Xn+1(x1,,xn+1)=eh1(x1)+h2(x2)f_{X_{1},...,X_{n+1}}(x_{1},...,x_{n+1})=e^{h_{1}(x_{-1})+h_{2}(x_{-2})} (6)

for some functions h1:nh_{1}\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{n}\longrightarrow\mathbb{R} and h2:nh_{2}\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{n}\longrightarrow\mathbb{R}. Now integrate again with respect to x1x_{1} to obtain

fX1(x1)=eh1(x1)eh2(x2)𝑑x1.f_{X_{-1}}(x_{-1})=e^{h_{1}(x_{-1})}\int_{\mathbb{R}}e^{h_{2}(x_{-2})}dx_{1}.

The left hand side is an n-dimensional marginal density which factorises into nn marginals by induction assumption: fX1(x1)=i=2n+1fXi(xi).f_{X_{-1}}(x_{-1})=\prod_{i=2}^{n+1}f_{X_{i}}(x_{i}). This allows us to conclude that h1(x1)h_{1}(x_{-1}) can be split into a sum of two functions, g1:n1g_{1}\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{n-1}\longrightarrow\mathbb{R} and g2:g_{2}\mathrel{\mathop{\ordinarycolon}}\mathbb{R}\longrightarrow\mathbb{R}, where the latter is a function of x2x_{2} only, i.e. h1(x1)=g1(x12)+g2(x2).h_{1}(x_{-1})=g_{1}(x_{-12})+g_{2}(x_{2}). Considering (6) again we see that the density fX1,,Xn+1f_{{X_{1}},...,{X_{n+1}}} factorises

fX1,,Xn+1(x1,,xn+1)=eg2(x2)+g1(x12)+h2(x2).f_{X_{1},...,X_{n+1}}(x_{1},...,x_{n+1})=e^{g_{2}(x_{2})+g_{1}(x_{-12})+h_{2}(x_{-2})}.

Hence X2X2X_{2}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{-2} and the density of X2X_{-2} factorises by induction assumption. ∎

4 Hierarchical models

The analysis of the last section makes clear that setting certain mixed two-way partial derivatives of g(x)=logfX(x)g(x)=\log f_{X}(x) equal to zero, is equivalent to independence or conditional independence statements. We can go further and define a generalized hierarchical model using the same process.

The basic structure of a hierarchical model can be define via a simplicial complex. Thus let 𝒩={1,,p}\mathcal{N}=\{1,\ldots,p\} be the vertex set representing the random variables X1,,XpX_{1},...,X_{p}. A collection 𝒮\mathcal{S} of index sets J𝒩J\subseteq\mathcal{N} is a simplicial complex if it is closed under taking subsets, i.e. if J in 𝒮J\text{ in }\mathcal{S} and KJK\subseteq J then K in 𝒮K\text{ in }\mathcal{S}.

Definition 5.

Given a simplicial complex 𝒮\mathcal{S} over an index set 𝒩={1,,p}\mathcal{N}=\{1,\ldots,p\} and an absolutely continuous random vector XX a hierarchical model for the joint distribution function fX(x)f_{X}(x) takes the form:

fX(x)=exp{J in 𝒮hJ(xJ)},f_{X}(x)=\exp\left\{\sum_{J\text{ in }\mathcal{S}}h_{J}(x_{J})\right\},

where hJ:Jh_{J}\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{J}\longrightarrow\mathbb{R} and xJ in Jx_{J}\text{ in }\mathbb{R}^{J} is the canonical projection of x in px\text{ in }\mathbb{R}^{p} onto the subspace associated with the index set JJ.

This is equivalent to a quasi-additive model for g(x)=J𝒮hJ(xJ),g(x)=\sum_{J\in\mathcal{S}}h_{J}(x_{J}), and we also refer to this model for g(x)g(x) as being hierarchical. It is clear that we may write the model over the maximal cliques only, namely simplexes which are not contained in a larger simplex. In the terminology of Lauritzen (1996) we require fXf_{X} be positive and factorise according to 𝒮\mathcal{S} for it to be a hierarchical model with respect to 𝒮\mathcal{S}.

Associated to an index set K𝒩K\subseteq\mathcal{N} is a differential operator DkD^{k}, where k=iKei{0,1}pk=\sum_{i\in K}e_{i}\in\{0,1\}^{p} holds ones for every member of KK and zeros otherwise. In the following, we overload the differential operator by allowing it to be superscripted by a set or by a vector. Thus, for an index set KK we set DK:=DkD^{K}\mathrel{\mathop{\ordinarycolon}}=D^{k} and similarly κKx:=κkx\kappa_{K}^{x}\mathrel{\mathop{\ordinarycolon}}=\kappa_{k}^{x}. DKD^{K} returns the differential cumulant κKx\kappa_{K}^{x}, when applied to g(x)g(x).

Example 5.

Let K={2,4,6}K=\{2,4,6\}. We obtain k=(0,1,0,1,0,1)k=(0,1,0,1,0,1) and

DKg(x)=κKx=κkx=3x2x4x6g(x).D^{K}g(x)=\kappa_{K}^{x}=\kappa_{k}^{x}=\frac{\partial^{3}}{\partial x_{2}\partial x_{4}\partial x_{6}}g(x).

We collect the results of the last section into a comprehensive statement. First, we define the complementary complex to a simplicial complex 𝒮\mathcal{S} on 𝒩\mathcal{N}.

Definition 6.

Given a simplicial complex 𝒮\mathcal{S} on an index set 𝒩\mathcal{N} we define the complementary complex as the collection 𝒮¯\bar{\mathcal{S}} of every index set KK which is not a member of 𝒮\mathcal{S}.

Note immediately that 𝒮¯\bar{\mathcal{S}} is closed under unions, i.e. K,K in 𝒮¯KK𝒮¯K,K^{\prime}\text{ in }\bar{\mathcal{S}}\Rightarrow K\cup K^{\prime}\in\bar{\mathcal{S}}. It is a main point of this paper that there is a duality between setting collections of mixed differential cumulants equal to zero and a general hierarchical model:

Theorem 5.

Given a simplicial complex 𝒮\mathcal{S} on an index set 𝒩\mathcal{N}, a model gg is hierarchical, based on 𝒮\mathcal{S} if and only if all differential cumulants on the complementary complex vanish everywhere, that is

κKx=0, for allx in pand for allK in 𝒮¯.\kappa_{K}^{x}=0,\;\mbox{ for all}\;x\text{ in }\mathbb{R}^{p}\;\mbox{and for all}\;K\text{ in }\bar{\mathcal{S}}.
Proof.

First, let gg be hierarchical with respect to 𝒮\mathcal{S}, that is gg is a log-density with representation g(x)=J in 𝒮hJ(xJ)g(x)=\sum_{J\text{ in }\mathcal{S}}h_{J}(x_{J}). Then, for K in 𝒮¯K\text{ in }\bar{\mathcal{S}}, the associated differential operator DKD^{K} annihilates any term hJh_{J} in gg, since KJ for any J in SK\not\subseteq J\text{ for any }J\text{ in }S.

Conversely, suppose κKx=0 for allx in pand for allK in 𝒮¯.\kappa_{K}^{x}=0\mbox{ for all}\;x\text{ in }\mathbb{R}^{p}\;\mbox{and for all}\;K\text{ in }\bar{\mathcal{S}}. Then, by proposition 2, fXf_{X} is pairwise Markov with respect to 𝒮\mathcal{S} and hence factorises over maximal cliques of 𝒮\mathcal{S} by the Hammersley-Clifford theorem. The reader is referred to Lauritzen (1996) for a detailed discussion of factorization and Markov properties. ∎

5 The duality with monomial ideals

The growing area of algebraic statistics makes use of computational commutative algebra particularly for discrete probability model, notably Poisson and multinomial log-linear models. Work connecting the algebraic methods to continuous probability models is sparser although considerable process has been made in the Gaussian case. For an overview see Drton et al. (2009). Our link to the algebra is via monomial ideals.

A monomial in x,,xpx,...,x_{p} is a product of the form xα=j=1pxjαjx^{\alpha}=\prod_{j=1}^{p}x_{j}^{\alpha_{j}}, where α in p\alpha\text{ in }\mathbb{N}^{p}. A monomial ideal II is a subset of a polynomial ring k[x1,,xp]k[x_{1},...,x_{p}] such that any mIm\in I can be written as a finite polynomial combination m=kKhkxαkm=\sum_{k\in K}h_{k}x^{\alpha_{k}}, where hkk[x1,,xp]h_{k}\in k[x_{1},...,x_{p}] and αkp\alpha_{k}\in\mathbb{N}^{p} for all kKk\in K. We write I=<xα1,,xαK>I=<x^{\alpha_{1}},...,x^{\alpha_{K}}> to express that II is generated by the family of monomials (xαk)kK(x^{\alpha_{k}})_{k\in K}.

The full set MM of monomials contained in monomial ideal II has the hierarchical structure:

xαMxα+γM,x^{\alpha}\in M\Rightarrow x^{\alpha+\gamma}\in M, (7)

for any index set γp\gamma\in\mathbb{N}^{p}. A monomial ideal is square-free if its generators (xαk)1kK(x^{\alpha_{k}})_{1\leq k\leq K} are square free, i.e. αk{0,1}p for all 1kK\alpha_{k}\in\{0,1\}^{p}\mbox{ for all }1\leq k\leq K.

The following discussion shows that there is complete duality between the structure of square-free monomial ideals and hierarchical models. Associated with a simplicial complex 𝒮\mathcal{S} is its Stanley-Reisner ideal I𝒮I_{\mathcal{S}}. This is the ideal generated by all square-free monomial in the complementary complex 𝒮¯\bar{\mathcal{S}}. For a face K𝒮¯K\in\bar{\mathcal{S}} let mK(x):=kKxkm_{K}(x)\mathrel{\mathop{\ordinarycolon}}=\prod_{k\in K}x_{k} denote the associated square-free monomial. Then

I𝒮=<(mK)K𝒮¯>.I_{\mathcal{S}}=<(m_{K})_{K\in\bar{\mathcal{S}}}>.

The second step, which is a main point of the paper, is to associate the differential operator DKD^{K} with the monomial mK(x)m_{K}(x). We need only confirm that the hierarchical structure implied by (7) is consistent with differential conditions of Theorem 5.

Without loss of generality include all differential operators which are obtained by continued differentiation. Then, (7) is mapped exactly to

Dαg(x)=0,for allxpDα+γg(x)=0,for allxpD^{\alpha}g(x)=0,\;\mbox{for all}\;x\in\mathbb{R}^{p}\Rightarrow D^{\alpha+\gamma}g(x)=0,\;\mbox{for all}\;x\in\mathbb{R}^{p}

simply by continued differentiation. This bijective mapping from monomial ideals into differential operators, is sometimes referred to as a “polarity” and within differential ideal theory has its origins in “Seidenberg’s differential nullstellensatz” (Seidenberg, 1956). It allows us to map properties of hierarchical models in statistics to monomial ideal properties and vice versa.

One of the main conditions discussed in the theory of hierarchical models in statistics is the decomposability of a joint density function into a product of certain marginal probabilities. Simple conditional probability is a canonical case. Thus with p=3p=3 the conditional independence X1X2|X3X_{1}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{2}|X_{3} is represented by the graph 1321-3-2. In this case the graph has the model simplicial complex: 𝒮={13,23}\mathcal{S}=\{13,23\}, where, again, we write 𝒮\mathcal{S} in terms of its maximal cliques. The Stanley-Reisner ideal is I𝒮=<x1x2>I_{\mathcal{S}}=<x_{1}x_{2}>.

There is a factorization:

fX1,X2,X3(x1,x2,x3)=fX1,X3(x1,x3)fX2,X3(x2,x3)fX3(x3).f_{X_{1},X_{2},X_{3}}(x_{1},x_{2},x_{3})=\frac{f_{X_{1},X_{3}}(x_{1},x_{3})f_{X_{2},X_{3}}(x_{2},x_{3})}{f_{X_{3}}(x_{3})}.

Decomposable graphical models, discussed below, are a generalization of this simple case. There are other cases, however, where one or more factorizations are associated with the same simplicial complex. An example is the 44-cycle: 𝒮={12,23,34,41}\mathcal{S}=\{12,23,34,41\} with The Stanley-Reisner ideal I𝒮=<x1x3,x2x4>I_{\mathcal{S}}=<x_{1}x_{3},x_{2}x_{4}>. Although this ideal is rather simple from an algebraic point of view the 4-cycle from a statistical point of view is rather complex. By considering special ideals we obtain general classes of models, in a subsection 5.2.

Another issue is that the structure of 𝒮\mathcal{S} may suggest factorizations even when they are problematical. Perhaps the first such case is the 3-cycle: 𝒮={12,13,23}\mathcal{S}=\{12,13,23\}. The Stanley-Reisner ideal is I𝒮=<x1x2x3>I_{\mathcal{S}}=<x_{1}x_{2}x_{3}>. The maximal clique log-density representation has no three-way interaction:

g(x1,x2,x3)=h12(x1,x2)+h13(x1,x3)+h14(x1,x4).g(x_{1},x_{2},x_{3})=h_{12}(x_{1},x_{2})+h_{13}(x_{1},x_{3})+h_{14}(x_{1},x_{4}).

This might suggest the factorization

fX1,X2,X3(x1,x2,x3)=fX1,X2(x1,x2)fX1,X3(x1,x3)fX2,X3(x2,x3)fX1(x1)fX2(x2)fX3(x3)f_{X_{1},X_{2},X_{3}}(x_{1},x_{2},x_{3})=\frac{f_{X_{1},X_{2}}(x_{1},x_{2})f_{X_{1},X_{3}}(x_{1},x_{3})f_{X_{2},X_{3}}(x_{2},x_{3})}{f_{X_{1}}(x_{1})f_{X_{2}}(x_{2})f_{X_{3}}(x_{3})} (8)

A factorization of this kind is the continuous analogue to a perfect three-dimensional table in the discrete case (Darroch, 1962). However, except when X1,X2,X3X_{1},X_{2},X_{3} are independent we have not been able to provide a standard density for which (8) holds.

5.1 Decomposability and marginality

Our use of the index set notation makes its straightforward to define decomposability.

Definition 7.

Let 𝒩={1,,p}\mathcal{N}=\{1,\ldots,p\} be the vertex set of a graph 𝒢\mathcal{G} and I,JI,J vertex sets such that IJ=𝒩I\cup J=\mathcal{N}. Then 𝒢\mathcal{G} is decomposable if and only if IJI\cap J is complete and II forms a maximal clique or the subgraph based on II is decomposable and similarly for JJ.

Under this condition the corresponding hierarchical model has a factorization

fV(xV)=JCfJ(xJ)KSfK(XK),f_{V}(x_{V})=\frac{\prod_{J\in C}f_{J}(x_{J})}{\prod_{K\in S}f_{K}(X_{K})},

where the numerator on the right hand side corresponds to cliques and the denominator to separators which arise in the continued factorization under the definition.

Refer to caption
Figure 2: Factorization and marginalisation of a hierarchical model.

It is important to realize that in order to proceed with the factorization at each stage a marginalisation step is required. Consider the simple case based on the simplicial complex 𝒮={123,234,345}\mathcal{S}=\{123,234,345\}. One choice of factorization at first stage is (with simplified notation):

f12345=f123f2345f23f_{12345}=\frac{f_{123}f_{2345}}{f_{23}}

and we continue the factorization to give

f12345=f123f234f345f23f34.f_{12345}=\frac{f_{123}f_{234}f_{345}}{f_{23}f_{34}}.

The process of marginalisation is shown in Figure 2. At any stage, we may choose to marginalise with respect to any variable that is member of just a single clique. In the first step these are X1X_{1} and X5X_{5} and suppose we chose to single out X1X_{1}. Once fXf_{X} has been integrated with respect to x1x_{1}, the marginal model for X2,,X5X_{2},...,X_{5} is obtained. The removal of a the clique 123123 leads to X2X_{2} being exposed and we may continue with X2X_{2} or X5X_{5} etc.

The Stanley-Reisner ideal I𝒮=<x1x4,x1x5,x2x5>I_{\mathcal{S}}=<x_{1}x_{4},x_{1}x_{5},x_{2}x_{5}> is an ideal in k[x1,x2,x3,x4,x5]k[x_{1},x_{2},x_{3},x_{4},x_{5}]. The factorization of f2345f_{2345} is, however, mapped into the monomial ideal <x2x5><x_{2}x_{5}> which is an ideal in k[x2,x3,x4,x5]k[x_{2},x_{3},x_{4},x_{5}]. A marginalisation has allowed us to drop from five dimensions to four. This is clear from the exponential expression of the model:

f12345=exp{h123(x1,x2,x3)+h234(x2,x3,x4)+h345(x3,x4,x5)}.f_{12345}=\exp\left\{h_{123}(x_{1},x_{2},x_{3})+h_{234}(x_{2},x_{3},x_{4})+h_{345}(x_{3},x_{4},x_{5})\right\}.

Integrating with respect to x1x_{1} we obtain a hierarchical model for the marginal joint distribution of (X2,X3,X4,X5)(X_{2},X_{3},X_{4},X_{5}). This marginalisation is possible because x1x_{1} appears only in the single clique {1,2,3}\{1,2,3\}.

We have exposed an interesting relationship between the statistical and algebra formulation: in order to reduce the dimensionality and obtain the Stanley-Reisner ideal for a reduced set of variables, we must first perform a marginalisation, which is a non-algebraic operation, at least, not in general a finite dimensional operation. We capture this in the following Lemma:

Lemma 2.

Whenever a simplicial complex of hierarchical model has a subset of vertexes which form a facet of a unique maximal clique (simplex) then the marginal model obtained by deleting this facet (and its connections) is valid. Moreover the monomial ideal representation is obtained by deleting any generators containing the corresponding variables and is in the ring without these variables.

Proof.

This follows the lines of the example. If JJ is the subset of vertexes and KK, with JKJ\subset K, is the unique maximal clique, then in the exponential expression for the density there will be a unique term exp(hK(xK))\exp(h_{K}(x_{K})) in which xJx_{J} appears. Integrating with respect to xJx_{J} to obtain the marginal distribution for XVJX_{V\setminus J} gives the reduced model. The monomial ideal representation follows accordingly. ∎

5.2 Artinian closure and polynomial exponential models

The terms hJ(xJ)h_{J}(x_{J}) which appear in the hierarchical models have not been given any special form. In fact it is a main point of this paper that this is not required to give the monomial ideal equivalence. We note, again, that we always use square-free monomial ideals.

Certain classes of hierarchical models can, however, be obtained by imposing further differential conditions. The following lemma shows that the log-density is polynomial if we impose univariate derivative restriction.

Lemma 3.

If in addition to the differential conditions in Theorem 5 we impose conditions of the form

nixinig(x)=0, for all 1ip and np\frac{\partial^{n_{i}}}{\partial x_{i}^{n_{i}}}g(x)=0,\;\text{ for all }1\leq i\leq p\text{ and }n\in\mathbb{N}^{p} (9)

the hh-functions in the corresponding hierarchical model are polynomials, in which the degree of xix_{i} does not exceed ni1, for all 1ipn_{i}-1,\;\text{ for all }1\leq i\leq p.

Proof.

Repeated integration with respect to xix_{i} shows that gg is indeed a polynomial in xix_{i} of degree less than nin_{i}, when the other variable are fixed. Since this holds for all 1ip1\leq i\leq p the result follows. ∎

The simultaneous inclusions of derivative operators with respect to one indeterminate in (9) constitutes an Artinian closure of the differential version of the Stanley-Reisner ideal I𝒮I_{\mathcal{S}}.

Example 6 (BEC density).

Suppose XX is bivariate and we impose the symmetric Artinian closure conditions

2xi2g(x1,x2)=0, for i=1,2.\frac{\partial^{2}}{\partial x_{i}^{2}}g(x_{1},x_{2})=0,\;\text{ for }i=1,2.

Then integration yields

g(x1,x2)=x1h1(x2)+h2(x2)\displaystyle g(x_{1},x_{2})=x_{1}h_{1}(x_{2})+h_{2}(x_{2})
and
g(x1,x2)=x2h3(x1)+h4(x1).\displaystyle g(x_{1},x_{2})=x_{2}h_{3}(x_{1})+h_{4}(x_{1}).

A comparison of these functionals identifies h1(x2)=a3x2+a1,h2(x2)=a0+a2x2,h3(x1)=a3x1+a2,h4(x1)=a1x+a0,h_{1}(x_{2})=a_{3}x_{2}+a_{1},\;h_{2}(x_{2})=a_{0}+a_{2}x_{2},\;h_{3}(x_{1})=a_{3}x_{1}+a_{2},\;h_{4}(x_{1})=a_{1}x+a_{0}, for some aia_{i}\in\mathbb{R} for all 1i41\leq i\leq 4, so that g(x1,x2)g(x_{1},x_{2}) can be written as

g(x1,x2)=a0+a1x1+a2x2+a3x1x2.g(x_{1},x_{2})=a_{0}+a_{1}x_{1}+a_{2}x_{2}+a_{3}x_{1}x_{2}. (10)

It can be shown that X1X_{1} is distributed exponentially conditional on X2=x2X_{2}=x_{2} for all x2>0x_{2}>0 and vice versa. A distributions with that property is called bivariate exponential conditionals (BEC) distribution. BEC distributions are completely described by gg in the sense that any BEC density is of the form (10) (Arnold & Strauss, 1988). In particular, the independence case is included, if we force a3=0a_{3}=0 by imposing the additional restriction

2x1x2g(x1,x2)=0.\frac{\partial^{2}}{\partial x_{1}x_{2}}g(x_{1},x_{2})=0.

This confirms Proposition 1 for this particular example.

The previous example extends readily into higher dimension. We call a distribution multivariate exponential conditionals (MEC) distribution, if XjX_{j} is distributed exponentially conditional on Xi=xiX_{i}=x_{i} for all 1i,jp,ij1\leq i,j\leq p,i\neq j. We capture the extension to the pp-dimensional case in the following lemma:

Lemma 4 (MEC distributions and Artinian closure).

The following statements are equivalent:

  1. 1.

    A distribution belongs to the class of MEC distributions

  2. 2.

    gg is multi-linear, i.e there exist 2p2^{p} indices asa_{s}\in\mathbb{R} such that g=sζasxsg=\sum_{s\in\zeta}a_{s}x^{s}, where ζ={0,1}p\zeta=\{0,1\}^{p} denotes the set of pp dimensional binary vectors

  3. 3.

    2xi2g(x)=0, for all 1ip.\frac{\partial^{2}}{\partial x_{i}^{2}}g(x)=0,\;\text{ for all }1\leq i\leq p.

Proof.

For a proof of (1)(2)(1)\iff(2) see Arnold & Strauss (1988). The proof of (2)(3)(2)\iff(3) follows the lines of the example. ∎

Another case of considerable importance is the Gaussian distribution. Here

g(x)=K𝒮hK(XK),g(x)=\sum_{K\in\mathcal{S}}h_{K}(X_{K}),

and the maximal cliques are of degree two. The latter condition is partly obtained with an Artinian closure with ni=3,i=1,,pn_{i}=3,\;i=1,\ldots,p. However, more is required. We can guess, from the fact that for a normal distribution all (ordinary) cumulants of degree three and above are zero, that if we impose all degree-three differential cumulant to be zero we obtain polynomial terms of maximum degree 2. This is, in fact the correct set of conditions to make the models terms of degree at most two. In the α\alpha-notation the conditions are

Dαg=0,for allαpwithα1=3,D^{\alpha}g=0,\;\mbox{for all}\;\alpha\in\mathbb{N}^{p}\;\mbox{with}\;\mathinner{\!\left\lVert\alpha\right\rVert}_{1}=3,

which includes the Artinian closure conditions. The corresponding ideal is generated by all polynomials of degree three. For a non-singular multivariate Gaussian, we, of course, require non-negative definiteness of the degree-two part of the model, considered as a quadratic form.

The hierarchical model is given by additional restrictions which are equivalent to removing certain terms of the form xixj,ijx_{i}x_{j},\;i\neq j. This is the same as setting the corresponding {ij}\{ij\}-th entry in the inverse covariance matrix (influence matrix) equal to zero. The removed xixjx_{i}x_{j} generate the Stanley-Reisner ideal so that the zero structure of the influence matrix completely determines the ideal.

5.3 Ideal-generated models

The duality between monomial ideals and hierarchical models encourages the investigation of the properties of hierarchical models for different types of ideals. There are some important properties and features of monomial ideals which may be linked to the corresponding hierarchical models and we mention just a few here in an attempt to introduce a larger research programme.

We begin with the sub-class of decomposable models. It is well know from the statistical literature (see Lauritzen, 1996) that the decomposability property of the model based on a simplicial complex 𝒮\mathcal{S} is equivalent to the chordal property: there is no chord-less 4-cycle. Remarkably, the latter is equivalent to a property of the Stanley-Reisner ideal I𝒮I_{\mathcal{S}}, namely: that the minimal free resolution of I𝒮I_{\mathcal{S}} be linear(see below for a brief explanation). This is a result of Fröberg (1988), see also Dochtermann & Engström (2009). Petrovic & Stokes (2010) adapt a result of Geiger et al. (2006) to show that I𝒮I_{\mathcal{S}}, in this case, is generated in degree 2, that is all its generators have degree 2.

Theorem 6.

A decomposable graphical model I𝒮I_{\mathcal{S}} has a “2-linear” resolution.

The term linear refers to the structure of the minimal free resolution of I𝒮I_{\mathcal{S}}. In this resolution there are monomial maps between the stages of the resolution sequence. Linear means that these maps are linear. As a simple example consider again the simplicial complex 𝒮={123,234,345}\mathcal{S}=\{123,234,345\} with Stanley-Reisner ideal I𝒮=<x1x4,x1x5,x2x5>I_{\mathcal{S}}=<x_{1}x_{4},x_{1}x_{5},x_{2}x_{5}> The minimal free resolution of I𝒮I_{\mathcal{S}} is given by:

[x1x4,x1x5,x2x5][x50x4x20x1]0,[x_{1}x_{4},x_{1}x_{5},x_{2}x_{5}]\begin{array}[]{c}\tiny{\left[\begin{array}[]{rr}-x_{5}&0\\ x_{4}&-x_{2}\\ 0&x_{1}\\ \end{array}\right]}\\ \;\;\;\rightarrow\\ \\ \end{array}0,

and one sees that the map is linear. By contrast, the 4-cycle is generated in degree 2, but is not linear:

[x1x3,x2x4][x2x4x1x3]0,[x_{1}x_{3},x_{2}x_{4}]\begin{array}[]{c}\tiny{\left[\begin{array}[]{r}x_{2}x_{4}\\ -x_{1}x_{3}\\ \end{array}\right]}\\ \rightarrow\\ \\ \end{array}0,

giving a non-linear map.

A special case of 2-linear resolutions are Ferrer ideals. A Ferrer ideal I𝒮I_{\mathcal{S}} is one in which the degree-two linear generators can be placed in a table with an inverse stair-case. Such staircases arose historically in the study of integer partitions. As an example take the Stanley-Reisner ideal

I𝒮=<x1x6,x1x7,x1x8,x2x6,x2x7,x3x6,x3x7,x4x6,x5x6>k[x1,,x9].I_{\mathcal{S}}=<x_{1}x_{6},x_{1}x_{7},x_{1}x_{8},x_{2}x_{6},x_{2}x_{7},x_{3}x_{6},x_{3}x_{7},x_{4}x_{6},x_{5}x_{6}>\subseteq k[x_{1},...,x_{9}].

The Ferrer table is:

67891x1x6x1x7x1x82x2x6x2x73x3x6x3x74x4x65x5x6\begin{array}[]{c|cccc}&6&7&8&9\\ \hline\cr 1&x_{1}x_{6}&x_{1}x_{7}&x_{1}x_{8}&\\ 2&x_{2}x_{6}&x_{2}x_{7}&&\\ 3&x_{3}x_{6}&x_{3}x_{7}&&\\ 4&x_{4}x_{6}&&&\\ 5&x_{5}x_{6}&&&\\ \end{array}

Considering the non-empty cells as given by edges this corresponds to a special type of bi-partite graph between nodes {1,2,3,4,5}\{1,2,3,4,5\} and {6,7,8,9}\{6,7,8,9\}. Corso & Nagel (2009) show that, among the class of bi-partite graphs, Ferrer ideals are indeed uniquely characterized as having a 2-linear minimal free resolution.

It is straightforward to show that the corresponding hierarchical model is decomposable by exhibiting the decomposition given by Lemma 2. First take two simplices based on the variables defining, respectively, the rows and columns. In the example these are J1=12345,J2=6789J_{1}=12345,J_{2}=6789. Then join all nodes corresponding to the complement of the Ferrer diagram to give:

67891x1x92x2x8x2x93x3x8x3x94x4x7x4x8x4x95x5x7x5x8x5x9\begin{array}[]{c|cccc}&6&7&8&9\\ \hline\cr 1&&&&x_{1}x_{9}\\ 2&&&x_{2}x_{8}&x_{2}x_{9}\\ 3&&&x_{3}x_{8}&x_{3}x_{9}\\ 4&&x_{4}x_{7}&x_{4}x_{8}&x_{4}x_{9}\\ 5&&x_{5}x_{7}&x_{5}x_{8}&x_{5}x_{9}\\ \end{array}

The maximal cliques are easily seen to be given by a simple rule on this complementary table. For each non-empty row take the variable which defines that row together with every other variables for nonempty columns in that row and all the variables for the rows below that row. In this example we find, working down the rows, that the maximal cliques are:

123459,234589,34589,45789,5789.123459,234589,34589,45789,5789.

Note how to this example we can apply Lemma 2, by successively stripping off variables in the order: x1,x2,x3,x4,x5x_{1},x_{2},x_{3},x_{4},x_{5}. The separators are 23459,34589,4589,5789,78923459,34589,4589,5789,789. The rule provides a proof of the following.

Theorem 7.

Hierarchical models generated by Ferrer ideals are decomposable.

As another illustration of the duality between monomial ideals and conditional independence structures, we next consider two terminal networks. In Sáenz de Cabezón & Wynn (2010) the authors apply the theory and construction of minimal free resolutions to the theory of reliability. One sub-class of these is to networks, in the classical sense of network reliability. Consider a connected graph G=(E,V)G=(E,V), with two identified nodes called input and output, respectively. A cut is a set of edges, which if removed from the graph disconnects input and output. A path is connected set of edges from input to output. A minimal cut is a cut for which no proper subset is a cut and minimal path is a path for which no proper subset is a path.

As a simple example consider the network depicted in Figure 3 with input =1=1 and output =4=4 and edges:

e1=12,e2=24,e3=23,e4=13,e5=34.e_{1}=1-2,\;e_{2}=2-4,\;e_{3}=2-3,\;e_{4}=1-3,\;e_{5}=3-4.

The minimal cuts are {e1,e4},{e2,e5},{e1,e3,e5},{e2,e3,e4}\{e_{1},e_{4}\},\;\{e_{2},e_{5}\},\;\{e_{1},e_{3},e_{5}\},\;\{e_{2},e_{3},e_{4}\}. If we associate a variable xix_{i} with each edge eie_{i} then the minimal cuts generate an ideal. In this example we write

I𝒮=<x1x4,x2x5,x1x3x5,x2x3x4>.I_{\mathcal{S}}=<x_{1}x_{4},x_{2}x_{5},x_{1}x_{3}x_{5},x_{2}x_{3}x_{4}>.

The maximal cliques of 𝒮\mathcal{S} for the corresponding model simplicial complex 𝒮\mathcal{S} are

{15,24,123,345}\{15,24,123,345\}
Refer to caption
Figure 3: A two-terminal network.

We could, on the other hand define I𝒮I_{\mathcal{S}^{*}} as being the collection of all paths on the network. In this case the I𝒮I_{\mathcal{S}^{*}} is generated by the minimal paths giving:

<x1x2,x4x5,x1x3x5,x2x3x4>,<x_{1}x_{2},x_{4}x_{5},x_{1}x_{3}x_{5},x_{2}x_{3}x_{4}>,

and SS^{*} consists of the complements of the cuts and has maximal cliques {15,24,134,235}\{15,24,134,235\}.

There is a duality between cuts and path models for two-terminal networks:

Lemma 5.

The model simplicial complex 𝒮\mathcal{S} based on the cut ideal I𝒮I_{\mathcal{S}} of a two terminal network is formed from the complement of all paths on the network. Conversely, the model simplicial complex 𝒮\mathcal{S}^{*} based on the path ideal I𝒮I_{\mathcal{S}^{*}}, is formed from the complement of all cuts. Moreover: (𝒮)=S(\mathcal{S}^{*})^{*}=S.

For example, the term 1515 of 𝒮\mathcal{S} is the complement of the (non-minimal) path 234234 and the term 1414 in 𝒮\mathcal{S}^{*} is the complement the (non-minimal) cut 235235.

This duality is a special example of Alexander duality and we omit the proof, see Miller & Sturmfels (2005), Proposition 1.37. The general result says that for a square-free 𝒮\mathcal{S}, if we define 𝒮\mathcal{S}^{*} as the complement of all non-faces of 𝒮\mathcal{S}, then (𝒮)=S(\mathcal{S}^{*})^{*}=S.

It will have been noticed that for this network SS and SS^{*} are self-dual in the sense that the two simplicial complexes have the same structure and only differ in the labelling of the vertexes. Both models have two separate conditional independence properties. Thus for 𝒮\mathcal{S} we have X1X4|(X2,X3,X5)X_{1}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{4}|(X_{2},X_{3},X_{5}) and X2X5|(X1,X3,X4)X_{2}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{5}|(X_{1},X_{3},X_{4}).

5.4 Geometric constructions

Simplicial complexes are at the heart of algebraic topology and it is natural to look in that field for classes of simplicial complexes whose abstract version may be used to support hierarchical models. We mention briefly one class here arising from the fast-growing area of persistent homology, see Edelsbrunner & Harer (2010). This class has already been used by Lunagómez et al. (2009) to construct graphical models using so-called Alpha complexes. We give the construction now. It is to based on the cover provided by a union of balls in RdR^{d}, a construction used by Edelsbrunner (1995) in the context of computational geometry and in Naiman & Wynn (1992) and Naiman & Wynn (1997) to study Bonferroni bounds in statistics.

Thus, let z1,,zpz_{1},\ldots,z_{p} be pp points in RdR^{d} and define the solid balls with radius rr centered at the points:

Bi(r)={z:xzir},i=1,,pB_{i}(r)=\{z\mathrel{\mathop{\ordinarycolon}}||x-z_{i}||\leq r\},\;i=1,\ldots,p

The nerve of the cover represented by the union of balls is the simplicial complex 𝒮\mathcal{S} derived form the intersections of the balls, and is called the Alpha complex. It consists of exactly all index sets JJ for which iJBi(r)\cap_{i\in J}B_{i}(r)\neq\emptyset.

When the radius, rr, is small 𝒮\mathcal{S} consists of unconnected vertexes and the hierarchical model gives independence of the X1,,XpX_{1},...,X_{p}. As rr\rightarrow\infty there is a value of rr at and beyond which i=1pBi(r)\cap_{i=1}^{p}B_{i}(r)\neq\emptyset and 𝒮\mathcal{S} consists of a single complete clique. In that case we have a full hierarchical model. Between these two cases, and depending on the position of the ziz_{i} and the value of rr we obtain a rich classes of simplicial complexes and hence hierarchical models. Some of these will be decomposable and we refer to the discussion in Lunagómez et al. (2009).

It is the study of the topology of the nerve as rr changes, and in particular the behavior of its Betti numbers, which drives the area of persistent homology. A important theoretical and computational result is that this topology is also that of the reduced simplicial complex based on the Delauney complex associated with their Voronoi diagram. That is to say, for fixed rr it is enough, from a topological (homotopy) viewpoint, to use the sub-complex of the Delauney complex 𝒮\mathcal{S}^{-} contained in 𝒮\mathcal{S}. The theory derives from classical results of Borsuk (1948) and Leray (1945). One beautiful fact is that the Delauney dual complex based on the furthest point Voronoi diagram (Okabe et al., 2000), is obtained by the Alexander duality mentioned in the last subsection.

In this paper we have concentrated on the correspondence between 𝒮\mathcal{S} and its Stanley-Reisner ideal I𝒮I_{\mathcal{S}}. The use of I𝒮I_{\mathcal{S}} is not always explicit in persistent homology but is implicit in the underlying homology theory: see Sáenz de Cabezón (2008) for a thorough investigation, including algorithms. Also, although the topology of 𝒮\mathcal{S} and its reduced Delaunay version 𝒮\mathcal{S}^{-} is the same, if their actual structure is different they lead to different hierarchical models. One can also use non-Euclidean metrics to define the cover and, indeed, work in different spaces and with other kinds of cover. Notwithstanding these many interesting technical issues the use of geometric constructions to define interesting classes of hierarchical model promises to be very fruitful.

6 Conclusion

There are many features and properties of monomial ideals which remain to be exploited in statistics via the isomorphism discussed in the last section. We should mention minimal free resolutions, the closely related Hilbert series, Betti numbers (including graded and multi-graded versions) and Alexander duality. It is pleasing that in the general case the development of the last section only requires consideration of square-free ideals, whose theory is a little easier than the full polynomial case. Fast algorithms are available for symbolic operations covering all these areas so that as further links are made they can be implemented. We have not covered statistical analysis in this paper. Further work is in progress to fit and test the zero-cumulant conditions Dαg=0D^{\alpha}g=0 using, for example, kernel methods.

References

  • Arnold & Strauss (1988) Arnold, B. & Strauss, D. (1988). Bivariate distributions with exponential conditionals. Journal of the American Statistical Association 83, 522–527.
  • Bairamov et al. (2003) Bairamov, I., Kotz, S. & Kozubowski, T. (2003). A new measure of linear local dependence. Statistics 37, 243–258.
  • Barndorff-Nielsen & Cox (1989) Barndorff-Nielsen, O. & Cox, D. (1989). Asymptotic techniques for use in statistics .
  • Borsuk (1948) Borsuk, K. (1948). On the imbedding of systems of compacta in simplicial complexes. Fund. Math 35, 217–234.
  • Corso & Nagel (2009) Corso, A. & Nagel, U. (2009). Monomial and toric ideals associated to Ferrers graphs. Transactions of The American Mathematical Society 361, 1371–1395.
  • Cozman & Walley (2005) Cozman, F. & Walley, P. (2005). Graphoid properties of epistemic irrelevance and independence. Annals of Mathematics and Artificial Intelligence 45, 173–195.
  • Darroch (1962) Darroch, J. (1962). Interactions in multi-factor contingency tables. Journal of the Royal Statistical Society. Series B (Methodological) 24, 251–263.
  • Dochtermann & Engström (2009) Dochtermann, A. & Engström, A. (2009). Algebraic properties of edge ideals via combinatorial topology. the electronic journal of combinatorics 16, R2.
  • Drton et al. (2009) Drton, M., Sturmfels, B. & Sullivant, S. (2009). Lectures on algebraic statistics. Birkhauser.
  • Edelsbrunner (1995) Edelsbrunner, H. (1995). The union of balls and its dual shape. Discrete and Computational Geometry 13, 415–440.
  • Edelsbrunner & Harer (2010) Edelsbrunner, H. & Harer, J. (2010). Computational topology: an introduction. American Mathematical Society.
  • Fröberg (1988) Fröberg, R. (1988). On Stanley-Reisner rings, in “Topics in Algebra”. Banach Center Public 26, 57–69.
  • Geiger et al. (2006) Geiger, D., Meek, C. & Sturmfels, B. (2006). On the toric algebra of graphical models. The Annals of Statistics 34, 1463–1492.
  • Hardy (2006) Hardy, M. (2006). Combinatorics of partial derivatives. The Electronic Journal of Combinatorics 13.
  • Holland & Wang (1987) Holland, P. & Wang, Y. (1987). Dependence function for continuous bivariate densities. Communications in Statistics-Theory and Methods 16, 863–876.
  • Jones (1996) Jones, M. (1996). The local dependence function. Biometrika 83, 899.
  • Lauritzen (1996) Lauritzen, S. (1996). Graphical models. Oxford University Press, USA.
  • Leray (1945) Leray, J. (1945). Sur la forme des espaces topologiques et sur les points fixes des représentations. J. Math. Pures Appl., IX. Sér. 24, 95–167.
  • Lunagómez et al. (2009) Lunagómez, S., Mukherjee, S. & Wolpert, R. (2009). Geometric representations of hypergraphs for prior specification and posterior sampling. Duke University Department of Statistical Science Discussion Paper .
  • McCullagh (1987) McCullagh, P. (1987). Tensor methods in statistics. Chapman and Hall London.
  • Miller & Sturmfels (2005) Miller, E. & Sturmfels, B. (2005). Combinatorial Commutative Algebra. Springer Verlag.
  • Mueller & Yan (2001) Mueller, H. & Yan, X. (2001). On local moments. Journal of Multivariate Analysis 76, 90–109.
  • Naiman & Wynn (1992) Naiman, D. & Wynn, H. (1992). Inclusion-exclusion-Bonferroni identities and inequalities for discrete tube-like problems via Euler characteristics. The Annals of Statistics 20, 43–76.
  • Naiman & Wynn (1997) Naiman, D. & Wynn, H. (1997). Abstract tubes, improved inclusion-exclusion identities and inequalities and importance sampling. The Annals of Statistics 25, 1954–1983.
  • Okabe et al. (2000) Okabe, A., Boots, B., Sugihara, K. & Chiu, S. (2000). Spatial tessellations: Concepts and applications of Voronoi diagrams (POD). New York: John Wiley & Sons.
  • Petrovic & Stokes (2010) Petrovic, S. & Stokes, E. (2010). Markov degrees of hierarchical models determined by betti numbers of stanley-reisner ideals ArXiv:0910.1610v2.
  • Sáenz de Cabezón (2008) Sáenz de Cabezón, E. (2008). Combinatorial Koszul Homology: Computations and Applications. Ph.D. thesis, Universidad de La Rioja.
  • Sáenz de Cabezón & Wynn (2010) Sáenz de Cabezón, E. & Wynn, H. (2010). Mincut ideals of two-terminal networks. Applicable Algebra in Engineering, Communication and Computing 21, 443–457.
  • Seidenberg (1956) Seidenberg, A. (1956). Some remarks on Hilbert’s Nullstellensatz. Archiv der Mathematik 7, 235–240.