This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Testing for Directed Information Graphs

Sina Molavipour, Germán Bassi, and Mikael Skoglund This work was supported in part by the Knut and Alice Wallenberg Foundation.The authors are with the ACCESS Linnaeus Centre, School of Electrical Engineering, KTH Royal Institute of Technology, Stockholm, Sweden (e-mails: sinmo@kth.se, germanb@kth.se, skoglund@kth.se).
Abstract

In this paper, we study a hypothesis test to determine the underlying directed graph structure of nodes in a network, where the nodes represent random processes and the direction of the links indicate a causal relationship between said processes. Specifically, a kk-th order Markov structure is considered for them, and the chosen metric to determine a connection between nodes is the directed information. The hypothesis test is based on the empirically calculated transition probabilities which are used to estimate the directed information. For a single edge, it is proven that the detection probability can be chosen arbitrarily close to one, while the false alarm probability remains negligible. When the test is performed on the whole graph, we derive bounds for the false alarm and detection probabilities, which show that the test is asymptotically optimal by properly setting the threshold test and using a large number of samples. Furthermore, we study how the convergence of the measures relies on the existence of links in the true graph.

I Introduction

Causality is a concept that expresses the joint behavior in time of a group of components in a system. In general, it denotes the effect of one component to itself and others in the system during a time period. Consider a network of nodes, each producing a signal in time. These processes can behave independently, or there might be an underlying connection, by nature, between them. Inferring this structure is of great interest in many applications. In [1], for instance, neurons are taken as components while the time series of produced spikes is used to derive the underlying structure. Dynamical models are also a well-known tool to understand functionals of expressed neurons [2]. Additionally, in social networks, there is an increasing interest to estimate influences among users [3], while further applications exist in biology [4], economics [5], and many other fields.

Granger [6] defined the notion of causality between two time series by using a linear autoregressive model and comparing the estimation errors for two scenarios: when history of the second node is accounted for and when it is not. With this definition, however, we can poorly estimate models which operate non-linearly. Directed information was first introduced to address the flow of information in a communication set-up, and suggested by Massey [7] as a measure of causality since it is not limited to linear models. There exist other methods which may qualify for different applications. Several of these definitions are compared in [1], where directed information is argued as a robust measure for causality. There are also symmetric measures like correlation or mutual information, but they can only represent a mutual relationship between nodes and not a directed one.

The underlying causal structure of a network of processes can be properly visualized by a directed graph. In particular, in a Directed Information Graph (DIG) –introduced simultaneously by Amblard and Michel [8] and Quinn et al. [1]– the existence of an edge is determined by the value of the directed information between two nodes considering the history of the rest of the network. There are different approaches to tackle the problem of detecting and estimating such type of graphs. Directed information can be estimated based on prior assumptions on the processes’ structure, such as Markov properties, and empirically calculating probabilities [1, 3]. On the other hand, Jiao et al. [5] propose a universal estimator of directed information which is not restricted to any Markov assumption. Nonetheless, in the core of their technique, they consider a context tree weighting algorithm with different depths, which intuitively resembles a learning algorithm for estimating the order of a Markov structure. Other assumptions used in the study of DIGs, that constrain the structure of the underlying graph, are tree structures [9] or a limit on the nodes’ degree [3].

The estimation performance on the detection of edges on a DIG is crucial since it allows to characterize, for instance, the optimum test for detection, or the minimum number of samples needed to reliably obtain the underlying model, i.e., the sample complexity. In [3], the authors derive a bound on the sample complexity using total variation when the directed information between two nodes is empirically estimated. Following that work, Kontoyiannis et al. [10] investigate the performance of a test for causality between two nodes, and they show that the convergence rate of the empirical directed information can be improved if it is calculated conditioned on the true relationship between the nodes. In other words, the underlying structure of the true model has an effect on the detection performance of the whole graph. Motivated by this result, in this paper, we study a hypothesis test over a complete graph (not just a link between two nodes) when the directed information is empirically estimated, and we provide interesting insights into the problem. Moreover, we show that for every existing edge in the true graph, the estimation converges with 𝒪(1/n)\mathcal{O}(1/\sqrt{n}), while if there is no edge in the true model, convergence is of the order of 𝒪(1/n)\mathcal{O}(1/n).

The rest of the paper is organized as follows. In Section II, notations and definitions are introduced. In particular, the directed information is reviewed and the definition of an edge in a DIG is presented. The main results of our work are then shown in Section III. First, the performance of a hypothesis test for a single edge is studied, where we analyze the asymptotic behavior of estimators based on the knowledge about the true edges. Then, we demonstrate how the detection of the whole graph relies on the test for each edge. Finally, in the last section, the paper is concluded.

II Preliminaries

Assume a network with mm nodes representing processes {X1,,Xm}\{\textbf{X}_{1},\dots,\textbf{X}_{m}\}. The observation of the ll-th process in the discrete time interval t1t_{1} to t2t_{2} is described by the random variable Xl,t1t2X^{t_{2}}_{l,t_{1}}, which at each time takes values on the discrete alphabet 𝒳\mathcal{X}. With a little abuse of notation YiY_{i} and Y1nY_{1}^{n} represent the observations of the process Y at instance ii and in the interval 11 to nn, respectively.

The metric used to describe the causality relationship of these processes is the directed information, as suggested previously, since it can describe more general structures without further assumptions (such as linearity). Directed information is mainly used in information theory to characterize channels with feedback and it is defined based on causally conditioned probabilities.

Definition 1.

The probability distribution of Y1nY_{1}^{n} causally conditioned on X1nX_{1}^{n} is defined as

PY1n||X1n=i=1nPYi|X1i,Y1i1.P_{Y_{1}^{n}\;\lvert\rvert\;X_{1}^{n}}=\prod_{i=1}^{n}P_{Y_{i}|X_{1}^{i},Y_{1}^{i-1}}.

The entropy rate of the process Y causally conditioned on X is then defined as:

H(Y||X)\displaystyle H(\textbf{Y}\;\lvert\rvert\;\textbf{X}) limn1nH(Y1n||X1n)\displaystyle\triangleq\lim\limits_{n\to\infty}\frac{1}{n}H(Y_{1}^{n}\;\lvert\rvert\;X_{1}^{n})
=limn1ni=1nH(Yi|Y1i1,X1i).\displaystyle=\lim\limits_{n\to\infty}\frac{1}{n}\sum_{i=1}^{n}H(Y_{i}|Y^{i-1}_{1},X^{i}_{1}).

Consequently, the directed information rate of X to Y causally conditioned on Z is expressed as below:

I(XY||Z)\displaystyle I(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z}) H(Y||Z)H(Y||X,Z)\displaystyle\triangleq H(\textbf{Y}\;\lvert\rvert\;\textbf{Z})-H(\textbf{Y}\;\lvert\rvert\;\textbf{X},\textbf{Z})
=limn1ni=1nI(Yi;X1i|Y1i1,Z1i).\displaystyle=\lim\limits_{n\to\infty}\frac{1}{n}\sum_{i=1}^{n}I(Y_{i}\,;\,X_{1}^{i}|Y_{1}^{i-1},Z_{1}^{i}). (1)

Pairwise directed information does not unequivocally determine the one-step causal influence among nodes in a network. Instead, the history of the other remaining nodes should also be considered. Similarly as introduced in [1, 8], an edge from Xi\textbf{X}_{i} to Xj\textbf{X}_{j} exists in a directed information graph iff

I(XiXj||X[m]{i,j})>0,\displaystyle I(\textbf{X}_{i}\to\textbf{X}_{j}\;\lvert\rvert\;\textbf{X}_{[m]\setminus\{i,j\}})>0, (2)

where [m]{1,2,,m}[m]\triangleq\{1,2,\ldots,m\}. Having observed the output of every process, the edges of the graph can be estimated which results in a weighted directed graph. However, when only the existence of a directed edge is investigated the performance of the detection can be improved. This is presented in Section III.

There exist several methods to estimate information theoretic values which most of them intrinsically deal with counting possible events to estimate distributions. One such method is the empirical distribution, which we define as follows.

Definition 2.

Let x[m]n=(x1,1n,x2,1n,,xm,1n)x^{n}_{[m]}=(x^{n}_{1,1},x^{n}_{2,1},\dots,x^{n}_{m,1}) be a realization of the random variables X[m]n=(X1,1n,X2,1n,,Xm,1n)X_{[m]}^{n}=(X_{1,1}^{n},X_{2,1}^{n},\dots,X_{m,1}^{n}). The joint empirical distribution of kk+1k^{\prime}\triangleq k+1 consecutive time instances of all nodes is then defined as:

P^X[m]k(n)(𝕒[m]k)=1nkt=1nki=1m𝟙[xi,tt+k=𝕒i,1k],𝕒i,1k𝒳k.\displaystyle\hat{P}_{X^{k^{\prime}}_{[m]}}^{(n)}\!(\mathbb{a}^{k^{\prime}}_{[m]})=\frac{1}{n-k}\sum\limits_{t=1}^{n-k}\prod_{i=1}^{m}\mathds{1}[x^{t+k}_{i,t}=\mathbb{a}^{k^{\prime}}_{i,1}],\quad\forall\mathbb{a}^{k^{\prime}}_{i,1}\in\mathcal{X}^{k^{\prime}}. (3)

The joint distribution of any subset of nodes is then a marginal distribution of (3).

By plugging in the empirical distribution we can derive estimators for information theoretic quantities such as the entropy HH, where we use the notation H^\hat{H} to distinguish the empirical estimator, i.e.,

H^(Xk)=𝕒k𝒳kP^Xk(n)(𝕒k)log(P^Xk(n)(𝕒k)).\displaystyle\hat{H}(X^{k^{\prime}})=-\sum_{\mathbb{a}^{k^{\prime}}\in\mathcal{X}^{k^{\prime}}}\hat{P}_{X^{k^{\prime}}}^{(n)}(\mathbb{a}^{k^{\prime}})\log\left(\hat{P}_{X^{k^{\prime}}}^{(n)}(\mathbb{a}^{k^{\prime}})\right). (4)

A causal influence in the network implies that the past of a group of nodes affects the future of some other group or themselves. This motivates us to focus on a network of joint Markov processes in this paper, since it characterizes a state dependent operation for nodes, although we may put further assumptions to make calculations more tractable. For simplicity, we assume a three-node network (i.e., m=3m=3) and the processes to be X,Y\textbf{X},\textbf{Y}, and Z in the rest of the work, since the extension of the results for m>3m>3 is straightforward.

Assumption 1.

{X,Y,Z}\{\textbf{X},\textbf{Y},\textbf{Z}\} is a jointly stationary Markov process of order kk.

Let us define the |𝒳|3k×|𝒳|3\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{3k}\times\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{3} transition probability matrix QQ with elements

Q(Xk+1,Yk+1,Zk+1|X1k,Y1k,Z1k).Q(X_{k+1},Y_{k+1},Z_{k+1}|X_{1}^{k},Y_{1}^{k},Z_{1}^{k}).

Then, the next assumption prevents further complexities in the steps of the proof of our main result.

Assumption 2.

All transition probabilities are positive, i.e., Q>𝟎Q>\bf{0}.

This condition provides ergodicity for the joint Markov process and results in the joint empirical distribution asymptotically converging to the stationary distribution111The stationary distribution is denoted either as PX¯1k+1,Y¯1k+1,Z¯1k+1P_{\bar{X}_{1}^{k+1},\bar{Y}_{1}^{k+1},\bar{Z}_{1}^{k+1}} or as P¯X1k+1,Y1k+1,Z1k+1\bar{P}_{X_{1}^{k+1},Y_{1}^{k+1},Z_{1}^{k+1}} in the sequel., i.e.,

P^X1k+1,Y1k+1,Z1k+1(n)PX¯1k+1,Y¯1k+1,Z¯1k+1\hat{P}_{X_{1}^{k+1},Y_{1}^{k+1},Z_{1}^{k+1}}^{(n)}\to P_{\bar{X}_{1}^{k+1},\bar{Y}_{1}^{k+1},\bar{Z}_{1}^{k+1}}

In general, the directed information rate I(XY||Z)I(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z}) cannot be expressed with the stationary random variables X¯1k+1,Y¯1k+1\bar{X}_{1}^{k+1},\bar{Y}_{1}^{k+1}, and Z¯1k+1\bar{Z}_{1}^{k+1}, since a good estimator requires unlimited samples for perfect estimation. To see this,

I(XY||Z)\displaystyle I(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z}) =I(Y¯k+1;X¯1k+1|Y¯1k,Z¯1k+1)\displaystyle=I(\bar{Y}_{k+1};\bar{X}_{1}^{k+1}|\bar{Y}_{1}^{k},\bar{Z}_{1}^{k+1})
I(Y¯k+1;Y¯0,Z¯0|Y¯1k,Z¯1k+1)\displaystyle\quad-I(\bar{Y}_{k+1};\bar{Y}_{-\infty}^{0},\bar{Z}_{-\infty}^{0}|\bar{Y}_{1}^{k},\bar{Z}_{1}^{k+1})
I(Y¯k+1;X¯1k+1|Y¯1k,Z¯1k+1),\displaystyle\leq I(\bar{Y}_{k+1};\bar{X}_{1}^{k+1}|\bar{Y}_{1}^{k},\bar{Z}_{1}^{k+1}), (5)

where we use the Markov property in the first equation, and the inequality holds since the mutual information is non-negative. Thus, with a limited sampling interval, an upper bound would be derived. The next assumption makes (5) hold with equality.

Assumption 3.

For processes Y and Z, the Markov chain

Y¯k+1 (Y¯1kZ¯1k+1) (Y¯0Z¯0)\displaystyle\bar{Y}_{k+1}\,\rule[2.15277pt]{10.00002pt}{0.55pt}\,(\bar{Y}_{1}^{k}\bar{Z}_{1}^{k+1})\,\rule[2.15277pt]{10.00002pt}{0.55pt}\,(\bar{Y}_{-\infty}^{0}\bar{Z}_{-\infty}^{0})

holds.

Note that the above assumption should hold for every other two pairs of processes if we are interested in studying the whole graph and not only a single edge.

III Hypothesis test for Directed Information Graphs

Consider a graph 𝒢\mathcal{G}, where the edge from node ii to node jj is denoted by vijv_{ij}; we say that vij=1v_{ij}=1 if the node ii causally influences the node jj, otherwise, vij=0v_{ij}=0. A hypothesis test to identify the graph is performed on the adjacency matrix VV, whose elements are the vijv_{ij}s, and the performance of the test is studied through its false alarm and detection probabilities

PF\displaystyle P_{F} =P(V^=V|VV),\displaystyle=P(\hat{V}=V^{*}|V\neq V^{*}), (6)
PD\displaystyle P_{D} =P(V^=V|V=V),\displaystyle=P(\hat{V}=V^{*}|V=V^{*}), (7)

where V^\hat{V} is the estimation of VV (properly defined later), and VV^{*} is the hypothesis model to test. In Theorem 1 below, both an upper bound on PFP_{F} and a lower bound on PDP_{D} are derived.

Theorem 1.

For a directed information graph with adjacency matrix VV of size m×m{m\times m}, if Assumptions 1133 hold, the performance of the test for the hypothesis VV^{*} is bounded as:

PF\displaystyle P_{F} 1PG(r2,Ith),\displaystyle\leq 1-P_{G}\left(\frac{r}{2},I_{th}\right), (8)
PD\displaystyle P_{D} max{1N0[1PG(r2,Ith)],0},\displaystyle\geq\max\!\left\{1-N_{0}\!\left[1-P_{G}\left(\frac{r}{2},I_{th}\right)\right],0\right\}, (9)

using the plug-in estimation of nn samples with nn\to\infty. The function PGP_{G} is the regularized gamma function, and N0=m(m1)N1N_{0}=m(m-1)-N_{1} with N1N_{1} denoting the number of directed edges in the hypothesis graph, and r=|𝒳|mk(|𝒳|m1)r=\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{mk}(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{m}-1). Finally, IthI_{th} is the threshold value used to decide the existence of an edge, and its order is 𝒪(1)\mathcal{O}(1).

The proof of Theorem 1 consists of two steps. First, the asymptotic behavior of the test for a single edge is derived in Section III-A. Afterwards, the hypothesis test over the whole graph is studied based on the tests for each single edge. It can be seen later on that, by Remark 1 and Corollary 1, the performance of testing the graph remains as good as a test for causality of a single edge.

Remark 1.

Note that by increasing IthI_{th} while remaining of order 𝒪(1)\mathcal{O}(1), PG(r2,Ith)P_{G}\left(\frac{r}{2},I_{th}\right) gets arbitrarily close to one, which results in the probability of detection converging to one as the probability of false alarm tends to zero.

III-A Asymptotic Behavior of a Single Edge Estimation

In general, every possible probability transition matrix QQ can be parametrized with θΘ\theta\in\Theta, where Θr\Theta\subset\mathbb{R}^{r} (see Table I). The vector θ\theta is formed by concatenating the elements of QQ in a row-wise manner excluding the last (linearly dependent) column. However, if the transition probability could be factorized due to a Markov property among its variables, the matrix might thus be addressed with a lower dimension parameter.

To see this, let us concentrate in our 3-node network {X,Y,Z}\{\textbf{X},\textbf{Y},\textbf{Z}\}. If vxy=0v_{xy}=0, or equivalently I(XY||Z)=0I(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})=0, then by Assumption 3, the transition probability can be factorized as follows,

Qϕxy(Xk+1,Yk+1,Zk+1|X1k,Y1k,Z1k)=\displaystyle Q_{\phi_{xy}}(X_{k+1},Y_{k+1},Z_{k+1}|X_{1}^{k},Y_{1}^{k},Z_{1}^{k})=
Qγxy(Xk+1,Zk+1|X1k,Y1k,Z1k)Qγxy(Yk+1|Y1k,Z1k+1).\displaystyle Q_{\gamma_{xy}}(X_{k+1},Z_{k+1}|X_{1}^{k},Y_{1}^{k},Z_{1}^{k})Q_{\gamma^{\prime}_{xy}}(Y_{k+1}|Y_{1}^{k},Z_{1}^{k+1}).\!\!\!\!\!\! (10)

Here the transition matrix is parametrized by ϕxyΦxy\phi_{xy}\in\Phi_{xy} where ϕxy\phi_{xy} has two components: γxyΓ\gamma_{xy}\in\Gamma and γxyΓ\gamma^{\prime}_{xy}\in\Gamma^{\prime}, and Φxy=Γ×Γ\Phi_{xy}=\Gamma\times\Gamma^{\prime}. The dimensions of the sets are shown in Table I; note that r>d+dr>d+d^{\prime}. The vectors γxy\gamma_{xy} and γxy\gamma^{\prime}_{xy} are also formed by concatenating the elements of their respective matrices as in the case of θ\theta. More details are found in the proof of Theorem 2 in Appendix A.

TABLE I: Dimensions of Index Sets for m=3m=3.
Set Dimension
Θ\Theta r=|𝒳|3k(|𝒳|31)r=\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{3k}(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{3}-1)
Γ\Gamma d=|𝒳|3k(|𝒳|21)d=\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{3k}(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{2}-1)
Γ\Gamma^{\prime} d=|𝒳|2k+1(|𝒳|1)d^{\prime}=\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{2k+1}(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1)

Now consider the Neyman-Pearson criteria to test the hypothesis Φxy\Phi_{xy} within Θ\Theta.

Definition 3.

The log-likelihood is defined as

Lnθ(X1n,Y1n,Z1n)\displaystyle L^{\theta}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})
=log(Qθ(Xk+1n,Yk+1n,Zk+1n|X1k,Y1k,Z1k))\displaystyle=\log\left(Q_{\theta}(X_{k+1}^{n},Y_{k+1}^{n},Z_{k+1}^{n}|X_{1}^{k},Y_{1}^{k},Z_{1}^{k})\right)
=log(i=k+1nQθ(Xi,Yi,Zi|Xiki1,Yiki1,Ziki1)).\displaystyle=\log\left(\prod_{i=k+1}^{n}Q_{\theta}(X_{i},Y_{i},Z_{i}|X_{i-k}^{i-1},Y_{i-k}^{i-1},Z_{i-k}^{i-1})\right).

Let θ\theta^{\star} and ϕxy\phi^{\star}_{xy} be the most likely choice of transition matrix with general and null hypothesis vxy=0v_{xy}=0, respectively, i.e.,

θ\displaystyle\theta^{\star} =argmaxΘLnθ(X1n,Y1n,Z1n),\displaystyle=\operatorname*{arg\,max}_{\Theta}L^{\theta}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n}),
ϕxy\displaystyle\phi^{\star}_{xy} =argmaxΦxyLnϕxy(X1n,Y1n,Z1n).\displaystyle=\operatorname*{arg\,max}_{\Phi_{xy}}L^{\phi_{xy}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n}). (11)

As a result, the test for causality boils down to check the difference between likelihoods, i.e., the log-likelihood ratio:

Λxy,n=Lnθ(X1n,Y1n,Z1n)Lnϕxy(X1n,Y1n,Z1n),\displaystyle\Lambda_{{xy},n}=L^{\theta^{\star}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})-L^{\phi^{\star}_{xy}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n}), (12)

which is the Neyman-Pearson criteria for testing Φxy\Phi_{xy} within Θ\Theta. Then, in the following theorem, Λxy,n\Lambda_{xy,n} is shown to converge to a χ2\chi^{2} distribution of finite degree. The proof follows from standard results in [11, Th. 6.1].

Theorem 2.

Consider a network with three nodes {X,Y,Z}\{\textbf{X},\textbf{Y},\textbf{Z}\} and arbitrarily choose two nodes X and Y. Suppose Assumptions 13 hold, then

2Λxy,nχrdd2,2\Lambda_{{xy},n}\stackrel{{\scriptstyle\mathcal{L}}}{{\to}}\chi^{2}_{r-d-d^{\prime}},

if vxy=0v_{xy}=0 as nn\to\infty.

Proof:

The conditions of the theorem imply that the true underlying structure for the transition matrix is from Φxy\Phi_{xy} which is required as in [11, Th. 6.1]. The rest of the proof follows similar steps as in [10]. See Appendix A for further details. ∎

Remark 2.

Note that the asymptotic result from Theorem 2 depends only on the dimensions of the sets and not in the particular pair of nodes involved. Furthermore, the result also holds for a network with more than three nodes by properly defining the dimensions of the sets.

Remark 3.

Knowledge about the absence of edges other than vxyv_{xy} in the network results in Λxy,n\Lambda_{{xy},n} converging to a χ2\chi^{2} distribution of higher degree since (III-A) could be further factorized. To see this, assume vxy=0v_{xy}=0 and consider that a knowledge SS about the links (for example, the whole adjacency matrix VV) was given. Then, let the transition probability be factorized as much as possible, so it can be parametrized by Φxy\Phi^{\prime}_{xy} which has lower or equal dimension than Φxy\Phi_{xy}. Take

Λxy,n=Lnθ(X1n,Y1n,Z1n)Lnϕxy(X1n,Y1n,Z1n),\Lambda^{\prime}_{{xy},n}=L^{\theta^{\star}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})-L^{\phi^{\prime\star}_{xy}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n}),

where

ϕxy=argmaxΦxyLnϕxy(X1n,Y1n,Z1n).\phi^{\prime\star}_{xy}=\operatorname*{arg\,max}_{\Phi^{\prime}_{xy}}L^{\phi^{\prime}_{xy}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n}).

Intuitively, by following similar steps as in the proof of Theorem 2, we obtain that Λxy,n\Lambda^{\prime}_{{xy},n} behaves as a χq2\chi_{q}^{2} random variable, where r>q>rddr>q>r-d-d^{\prime}. Since the cumulative distribution function of the χq2\chi_{q}^{2} is a decreasing function with respect to the degree qq then,

PG(r2,a)\displaystyle P_{G}\!\left(\frac{r}{2},a\right) PG(q2,a)\displaystyle\leq P_{G}\!\left(\frac{q}{2},a\right)
=P(Λxy,n<a|S,vxy=0)\displaystyle=P(\Lambda^{\prime}_{xy,n}<a|S,v_{xy}=0)
P(Λxy,n<a|vxy=0)\displaystyle\leq P(\Lambda_{xy,n}<a|v_{xy}=0)
=PG(rdd2,a),\displaystyle=P_{G}\!\left(\frac{r-d-d^{\prime}}{2},a\right), (13)

for sufficiently large nn and any a>0a>0. The lower bound in (13) allows us to ignore the knowledge about other nodes.

Consider now the estimation of the directed information defined as plugging in the empirical distribution (instead of the true distribution) into I(Yk+1;X1k+1|Y1k,Z1k+1)I(Y_{k+1};X_{1}^{k+1}|Y_{1}^{k},Z_{1}^{k+1}), i.e.,

I^n(k)(XY||Z)I(Y^k+1;X^1k+1|Y^1k,Z^1k+1).\hat{I}_{n}^{(k)}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})\triangleq I(\hat{Y}_{k+1};\hat{X}_{1}^{k+1}|\hat{Y}_{1}^{k},\hat{Z}_{1}^{k+1}).

Then, the following lemma states that I^n(k)(XY||Z)\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z}), is proportional to Λxy,n\Lambda_{xy,n} with an 𝒪(n)\mathcal{O}(n) factor.

Lemma 1.

Λxy,n=(nk)I^n(k)(XY||Z)\Lambda_{xy,n}=(n-k)\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z}), which is the plug-in estimator of the directed information.

Proof:

The proof follows from standard definitions and noting that the KL-divergence is positive and minimized by zero. See Appendix B for the complete proof. ∎

Now, let us define the decision rule for checking the existence of an edge in the graph as follows:

v^i,j{1if (nk)I^n(k)(XiXj||X[m]{i,j})Ith0o.w.,\hat{v}_{i,j}\triangleq\begin{cases}1&\textnormal{if }(n-k)\hat{I}^{(k)}_{n}(\textbf{X}_{i}\to\textbf{X}_{j}\;\lvert\rvert\;\textbf{X}_{[m]\setminus\{i,j\}})\geq I_{th}\\ 0&\textnormal{o.w.,}\end{cases}

where IthI_{th} is of order 𝒪(1)\mathcal{O}(1). Then for any knowledge SS about states of edges in the true graph, as long as vxy=0v_{xy}=0 we have:

P(v^xy=1|S,vxy=0)\displaystyle P(\hat{v}_{xy}=1|S,v_{xy}=0)
=P((nk)I^n(k)(XY||Z)>Ith|S,vxy=0)\displaystyle=P((n-k)\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})>I_{th}|S,v_{xy}=0)
1PG(r2,Ith),\displaystyle\leq 1-P_{G}\!\left(\frac{r}{2},I_{th}\right), (14)

where the inequality is due to Remark 3.

From Theorem 2 and Lemma 1, it is inferred that when in the true adjacency matrix vxy=0v_{xy}=0, then the empirical estimation of the directed information converges to zero with a χ2\chi^{2} distribution at a rate 𝒪(1/n)\mathcal{O}(1/n). The asymptotic behavior of I^n(k)(XY||Z)\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z}) is different if the edge is present, i.e., vxy=1v_{xy}=1, which is addressed in the following theorem.

Theorem 3.

Consider a network with three nodes {X,Y,Z}\{\textbf{X},\textbf{Y},\textbf{Z}\} and arbitrarily choose two nodes X and Y. Suppose Assumptions 13 hold and let I¯xyI(Y¯k+1;X¯1k+1|Y¯1k,Z¯1k+1)\bar{I}_{xy}\triangleq I(\bar{Y}_{k+1};\bar{X}_{1}^{k+1}|\bar{Y}_{1}^{k},\bar{Z}_{1}^{k+1}), then,

nk[I^n(k)(XY||Z)I¯xy]𝒩(0,σ2),\sqrt{n-k}\left[\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})-\bar{I}_{xy}\right]\to\mathcal{N}(0,\sigma^{2}), (15)

with a finite σ2\sigma^{2} as nn\to\infty, if vxy=1v_{xy}=1.

Proof:

The empirical distribution can be decomposed in two parts, where the first one vanishes at a rate faster than 𝒪(1/n)\mathcal{O}(1/\sqrt{n}) and the second part converges at a rate 𝒪(1/n)\mathcal{O}(1/\sqrt{n}). The condition vxy=1v_{xy}=1 keeps the second part positive so it determines the asymptotic convergence of I^n(k)(XY||Z)\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z}). Refer to Appendix C for further details. ∎

Remark 4.

Knowledge about the state of other edges in the true graph model does not affect the asymptotic behavior presented in Theorem 3, given that the condition vxy=1v_{xy}=1 makes the convergence of the estimator independent of all other nodes. This can be seen by following the steps of the proof, where we only use the fact that if the true edge exists then I¯xy>0\bar{I}_{xy}>0 and (III-A) does not hold.

We can use Remark 4 to conclude that:

P(v^xy=0|S,vxy=1)\displaystyle P(\hat{v}_{xy}=0|S,v_{xy}=1)
=P((nk)I^n(k)(XY||Z)<Ith|S,vxy=1)\displaystyle=P((n-k)\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})<I_{th}|S,v_{xy}=1)
=P((nk)I^n(k)(XY||Z)<Ith|vxy=1)\displaystyle=P((n-k)\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})<I_{th}|v_{xy}=1)
=1Q(Ith(nk)I¯xynkσ),\displaystyle=1-Q\!\left(\frac{I_{th}-(n-k)\bar{I}_{xy}}{\sqrt{n-k}\,\sigma}\right), (16)

for sufficiently large nn, where Q()Q(\cdot) is the Q-function, and where the last equality is due to Theorem 3. Note that if vxy=1v_{xy}=1 then I¯xy>0\bar{I}_{xy}>0.

III-B Hypothesis Test over an Entire Graph

The performance of testing a hypothesis VV^{*} for a graph is studied by means of the false alarm and detection probabilities defined in (6) and (7), respectively. The results may be considered as an extension of the hypothesis test over a single edge in the graph.

First, let the false alarm probability be upper-bounded as

PF=P(V^=V|VV)mini,jP(v^ij=vij|VV).\displaystyle P_{F}=P(\hat{V}=V^{*}|V\neq V^{*})\leq\min_{i,j}P(\hat{v}_{ij}=v^{*}_{ij}|V\neq V^{*}).

If VVV\neq V^{*}, there exist nodes τ\tau and ρ\rho such that vτρvτρv_{\tau\rho}\neq v^{*}_{\tau\rho}. Hence,

PF\displaystyle P_{F} mini,jP(v^ij=vij|VV)\displaystyle\leq\min_{i,j}P(\hat{v}_{ij}=v^{*}_{ij}|V\neq V^{*}) (17)
P(v^τρ=vτρ|VV)\displaystyle\leq P(\hat{v}_{\tau\rho}=v^{*}_{\tau\rho}|V\neq V^{*}) (18)
=P(v^τρ=vτρ|VV,vτρvτρ)\displaystyle=P(\hat{v}_{\tau\rho}=v^{*}_{\tau\rho}|V\neq V^{*},v_{\tau\rho}\neq v^{*}_{\tau\rho})
={P(v^τρ=0|VV,vτρ=1)P(v^τρ=1|VV,vτρ=0)\displaystyle=\begin{cases}P(\hat{v}_{\tau\rho}=0|V\neq V^{*},v_{\tau\rho}=1)\\ P(\hat{v}_{\tau\rho}=1|V\neq V^{*},v_{\tau\rho}=0)\end{cases}
{1Q(Ith(nk)I¯τρnkσ)if vτρ=11PG(r2,Ith)if vτρ=0\displaystyle\leq\begin{cases}1-Q\!\left(\frac{I_{th}-(n-k)\bar{I}_{\tau\rho}}{\sqrt{n-k}\,\sigma}\right)&\textnormal{if }v_{\tau\rho}=1\\ 1-P_{G}\!\left(\frac{r}{2},I_{th}\right)&\textnormal{if }v_{\tau\rho}=0\end{cases} (19)

where the last inequality is due to (14) and (16).

On the other hand, the complement of the detection probability may be upper-bounded using the union bound:

1PD=P(V^V|V=V)i,jP(v^ijvij|V=V)\displaystyle 1-P_{D}=P(\hat{V}\neq V^{*}|V=V^{*})\leq\sum_{i,j}P(\hat{v}_{ij}\neq v^{*}_{ij}|V=V^{*})
=i,jvij=1P(v^ijvij|V=V)+i,jvij=0P(v^ijvij|V=V)\displaystyle=\sum_{\begin{subarray}{c}i,j\\ v_{ij}=1\end{subarray}}P(\hat{v}_{ij}\neq v^{*}_{ij}|V=V^{*})+\sum_{\begin{subarray}{c}i,j\\ v_{ij}=0\end{subarray}}P(\hat{v}_{ij}\neq v^{*}_{ij}|V=V^{*})
=i,jvij=1P(v^ij=0|V=V,vij=1)\displaystyle=\sum_{\begin{subarray}{c}i,j\\ v_{ij}=1\end{subarray}}P(\hat{v}_{ij}=0|V=V^{*},v_{ij}=1)
+i,jvij=0P(v^ij=1|V=V,vij=0)\displaystyle\quad+\sum_{\begin{subarray}{c}i,j\\ v_{ij}=0\end{subarray}}P(\hat{v}_{ij}=1|V=V^{*},v_{ij}=0)
N1(1Q(Ith(nk)I¯nkσ))+N0(1PG(r2,Ith)),\displaystyle\leq N_{1}\bigg{(}1-Q\bigg{(}\frac{I_{th}-(n-k)\bar{I}}{\sqrt{n-k}\,\sigma}\bigg{)}\!\!\bigg{)}+N_{0}\bigg{(}1-P_{G}\bigg{(}\frac{r}{2},I_{th}\bigg{)}\!\!\bigg{)}, (20)

where N0N_{0} and N1N_{1} are the number of off-diagonal 0s and 11s in the true matrix VV, i.e., N0+N1=m(m1)N_{0}+N_{1}=m(m-1), and I¯mins.t.vij=1i,jI¯ij\bar{I}\triangleq\min\limits_{\stackrel{{\scriptstyle i,j}}{{\text{s.t.}\,v_{ij}=1}}}\bar{I}_{ij}. The last inequality holds due to (14) and (16).

Since I¯ij=I(X¯j,k+1;X¯i,1k+1|X¯j,1k,X¯[m]{i,j},1k+1)>0\bar{I}_{ij}=I(\bar{X}_{j,k+1};\bar{X}_{i,1}^{k+1}|\bar{X}_{j,1}^{k},\bar{X}_{[m]\setminus\{i,j\},1}^{k+1})>0 and it is of order 𝒪(1)\mathcal{O}(1), then as nn\to\infty and noting that

lima1Q(a)=0,\lim_{a\to\infty}1-Q(-a)=0,

we have that,

PF\displaystyle P_{F} 1PG(r2,Ith)\displaystyle\leq 1-P_{G}\!\left(\frac{r}{2},I_{th}\right) (21)
1PD\displaystyle 1-P_{D} N0[1PG(r2,Ith)].\displaystyle\leq N_{0}\left[1-P_{G}\!\left(\frac{r}{2},I_{th}\right)\right]. (22)

This concludes the proof of Theorem 1.

Corollary 1.

In the special case the hypothesis test is performed on a single edge, for the false alarm probability, (17) and (18) become equal and we have

PFP(v^xy=1|vxy=0)=1PG(rdd2,Ith),P^{\prime}_{F}\triangleq P(\hat{v}_{xy}=1|v_{xy}=0)=1-P_{G}\left(\frac{r-d-d^{\prime}}{2},I_{th}\right),

and for the detection probability,

PDP(v^xy=1|vxy=1)=1,P^{\prime}_{D}\triangleq P(\hat{v}_{xy}=1|v_{xy}=1)=1,

as nn\to\infty, which is in the same lines as the argument in [10, Sec. III-C] for m=2m=2.

Refer to caption
Figure 1: Lower bound for detection probability PDP_{D} and upper bound for PFP_{F}, derived by varying the threshold of the test IthI_{th}, and with k=2k=2, m=5m=5 and binary alphabet. Since PF0P_{F}\geq 0 and PD1P_{D}\leq 1 by increasing IthI_{th}, PF0P_{F}\to 0 and PD1P_{D}\to 1.

III-C Numerical Results

The bounds derived in Theorem 1 state that the detection probability can be desirably close to one while the false alarm probability can remain near zero with a proper threshold test. In Fig. 1, these bounds are depicted with respect to different values of IthI_{th} for a network with m=5m=5 nodes. The joint process is assumed to be a Markov process of order k=2k=2, and the random variables take values on a binary alphabet (|𝒳|=2|\mathcal{X}|=2).

It can be observed in the figure that, for fixed IthI_{th}, PDP_{D} improves as N0N_{0} decreases, i.e., when the graph becomes sparser. Furthermore, by a proper choice of IthI_{th}, we can reach to optimal performance of the hypothesis test, i.e., PD=1P_{D}=1 and PF=0P_{F}=0.

IV Summary and Remarks

In this paper, we investigated the performance of a hypothesis test for detecting the underlying directed graph of a network of stochastic processes, which represents the causal relationship among nodes, by empirically calculating the directed information as the measure. We showed that the convergence rate of the directed information estimator relies on the existence or not of the link in the real structure. We further showed that with a proper adjustment of the threshold test for single edges, the overall hypothesis test is asymptotically optimal.

This work may be expanded by considering a detailed analysis on the sample complexity of the hypothesis test. Moreover, we assumed in this work that the estimator has access to samples from the whole network while in practice this might not be the case (see e.g. [12]).

Appendix A Proof of Theorem 2

For any right stochastic matrix AA of dimensions na×man_{a}\times m_{a}, let the matrix A~\tilde{A} denote the first ma1m_{a}-1 linearly independent columns of AA, as depicted in Fig. 2.

A~\tilde{A}AA
Figure 2: The matrix A~\tilde{A} is formed by removing the last column of AA.

Without loss of generality, consider 𝒳\mathcal{X} to be the set of integers {1,2,,|𝒳|}\{1,2,\dots,\mathinner{\!\left\lvert\mathcal{X}\right\rvert}\} which simplifies the indexing of elements in the alphabet. Let ux,1ku_{x,1}^{k} denote (ux,1,ux,2,,ux,k)𝒳k(u_{x,1},u_{x,2},\dots,u_{x,k})\in\mathcal{X}^{k}, and ux,uy,uz𝒳u^{\prime}_{x},u^{\prime}_{y},u^{\prime}_{z}\in\mathcal{X} excluding (ux,uy,uz)=(|𝒳|,|𝒳|,|𝒳|)(u^{\prime}_{x},u^{\prime}_{y},u^{\prime}_{z})=(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert}). Next define the 3k+33k+3 vector

u(ux,1k,uy,1k,uz,1k,ux,uy,uz)\displaystyle\vec{u}\triangleq(u_{x,1}^{k},u_{y,1}^{k},u_{z,1}^{k},u^{\prime}_{x},u^{\prime}_{y},u^{\prime}_{z}) (23)

which is associated with an element of Q~\tilde{Q} (the sub-matrix of the transition probability matrix QQ).

Every u\vec{u} can be addressed via the pair (lu,gu)(l_{\vec{u}},g_{\vec{u}}) where lu[1:|𝒳|3k]l_{\vec{u}}\in[1\mathrel{\mathop{\ordinarycolon}}\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{3k}] and gu[1:|𝒳|31]g_{\vec{u}}\in[1\mathrel{\mathop{\ordinarycolon}}\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{3}-1] indicate the row and column of that element, respectively. Also, let fu(lu1)(|𝒳|31)+guf_{\vec{u}}\triangleq(l_{\vec{u}}-1)(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{3}-1)+g_{\vec{u}}, which denotes the index of that element when vectorizing Q~\tilde{Q}. Any possible transition matrix can then be indexed with a vector

θ=(θ1,θ2,,θr)=(θfu)Θ\theta=(\theta_{1},\theta_{2},\dots,\theta_{r})=(\theta_{f_{\vec{u}}})\in\Theta

as QθQ_{\theta}, where Θ\Theta has dimension rr (see Table I) and θ\theta is constructed by concatenation of rows in Q~θ\tilde{Q}_{\theta}.

Suppose now that vxy=0v_{xy}=0 or equivalently, by definition (2), I(XY||Z)=0I(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})=0. Then

Q(Xk+1,Yk+1,Zk+1|X1k,Y1k,Z1k)\displaystyle Q(X_{k+1},Y_{k+1},Z_{k+1}|X_{1}^{k},Y_{1}^{k},Z_{1}^{k})
=P(Xk+1,Zk+1|X1k,Y1k,Z1k)P(Yk+1|Y1k,Z1k+1).\displaystyle=P(X_{k+1},Z_{k+1}|X_{1}^{k},Y_{1}^{k},Z_{1}^{k})P(Y_{k+1}|Y_{1}^{k},Z_{1}^{k+1}). (24)

Thus, the transition matrix QQ is determined by the elements of two other matrices T1T_{1} and T2T_{2} given by (A). Define the vectors

w\displaystyle\vec{w}\, (ix,1k,iy,1k,iz,1k,ix,iz),\displaystyle\triangleq(i_{x,1}^{k},i_{y,1}^{k},i_{z,1}^{k},i^{\prime}_{x},i^{\prime}_{z}),
w\displaystyle\vec{w}^{\prime} (iy,1k,(iz,1k,iz),iy),\displaystyle\triangleq(i_{y,1}^{k},(i_{z,1}^{k},i^{\prime}_{z}),i^{\prime}_{y}),

which are associated with an element in T~1\tilde{T}_{1} and T~2\tilde{T}_{2}, such that (ix,iz)(|𝒳|,|𝒳|)(i^{\prime}_{x},i^{\prime}_{z})\neq(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert}) in w\vec{w} and iy|𝒳|i^{\prime}_{y}\neq\mathinner{\!\left\lvert\mathcal{X}\right\rvert} in w\vec{w}^{\prime}. Then

fw\displaystyle f_{\vec{w}}\, (lw1)(|𝒳|21)+gw,\displaystyle\triangleq(l_{\vec{w}}-1)(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{2}-1)+g_{\vec{w}},
fw\displaystyle f_{\vec{w}^{\prime}} (lw1)(|𝒳|1)+gw,\displaystyle\triangleq(l_{\vec{w}^{\prime}}-1)(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1)+g_{\vec{w}^{\prime}},

where the pairs of row and column indices for each element in T~1\tilde{T}_{1} and T~2\tilde{T}_{2} are then (lw,gw)(l_{\vec{w}},g_{\vec{w}}) and (lw,gw)(l_{\vec{w}^{\prime}},g_{\vec{w}^{\prime}}), respectively.

A matrix QQ such as the one in (A) can be parametrized by a vector ϕxyΦxy\phi_{xy}\in\Phi_{xy}, where Φxy=Γ×Γ\Phi_{xy}=\Gamma\times\Gamma^{\prime} has dimension ddd\cdot d^{\prime} (see Table I). Then

Qϕxy=\displaystyle Q_{\phi_{xy}}=
Qγxy(Xk+1,Zk+1|X1k,Y1k,Z1k)Qγxy(Yk+1|Y1k,Z1k+1),\displaystyle\quad Q_{\gamma_{xy}}(X_{k+1},Z_{k+1}|X_{1}^{k},Y_{1}^{k},Z_{1}^{k})Q_{\gamma^{\prime}_{xy}}(Y_{k+1}|Y_{1}^{k},Z_{1}^{k+1}),

where

γxy=(γfw)Γandγxy=(γfw)Γ\gamma_{xy}=(\gamma_{f_{\vec{w}}})\in\Gamma\quad\text{and}\quad\gamma^{\prime}_{xy}=(\gamma^{\prime}_{f_{\vec{w}^{\prime}}})\in\Gamma^{\prime}

determine ϕxy\phi_{xy}, are vectors of length dd and dd^{\prime}, and are constructed by concatenating the rows of Q~γxy\tilde{Q}_{\gamma_{xy}} and Q~γxy\tilde{Q}_{\gamma^{\prime}_{xy}}, respectively. There exists then the mapping h:ΦxyΘh\mathrel{\mathop{\ordinarycolon}}\Phi_{xy}\to\Theta such that component-wise:

(h(ϕxy))fu=γfwγfw.\displaystyle(h(\phi_{xy}))_{f_{\vec{u}}}=\gamma_{f_{\vec{w}}}\!\cdot\gamma^{\prime}_{f_{\vec{w}^{\prime}}}. (25)

Consider the matrix K(ϕxy)K(\phi_{xy}) of size (r+1)×(d+d)(r+1)\times(d+d^{\prime}) such that for every element:

(K(ϕxy))fu,f={Qh(ϕxy)(ux,uy,uz|ux,1k,uy,1k,uz,1k)γffdQh(ϕxy)(ux,uy,uz|ux,1k,uy,1k,uz,1k)γfdf>d.\displaystyle(K(\phi_{xy}))_{f_{\vec{u}},f}=\begin{cases}\frac{\partial Q_{h(\phi_{xy})}(u^{\prime}_{x},u^{\prime}_{y},u^{\prime}_{z}|u_{x,1}^{k},u_{y,1}^{k},u_{z,1}^{k})}{\partial\gamma_{f}}\vspace{0.3cm}&f\leq d\\ \frac{\partial Q_{h(\phi_{xy})}(u^{\prime}_{x},u^{\prime}_{y},u^{\prime}_{z}|u_{x,1}^{k},u_{y,1}^{k},u_{z,1}^{k})}{\partial\gamma^{\prime}_{f-d}}&f>d.\end{cases} (26)

In other words, every row of the matrix K(ϕxy)K(\phi_{xy}) is a derivative of an element of Qh(ϕxy)Q_{h(\phi_{xy})} with respect to all elements of γxy\gamma_{xy} followed by the derivatives with respect to γxy\gamma^{\prime}_{xy}.

According to [11, Th. 6.1], and by the definition of θ\theta^{\star} and ϕxy\phi^{\star}_{xy} in (11),

2{Lnθ(X1n,Y1n,Z1n)Lnϕxy(X1n,Y1n,Z1n)}χrdd2,2\left\{L^{\theta^{\star}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})-L^{\phi^{\star}_{xy}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})\right\}\stackrel{{\scriptstyle\mathcal{L}}}{{\to}}\chi^{2}_{r-d-d^{\prime}},

if Qh(ϕxy)Q_{h(\phi_{xy})} has continuous third order partial derivatives and K(ϕxy)K(\phi_{xy}) is of rank d+dd+d^{\prime}. The first condition holds according to the definition of hh in (25). To verify the second condition we can observe that there exist four types of rows in K(ϕxy)K(\phi_{xy}):

  • Type 1: Take the rows u1=(i1k,j1k,l1k,i,j,l)\vec{u}_{1}=(i_{1}^{k},j_{1}^{k},l_{1}^{k},i^{\prime},j^{\prime},l^{\prime}) in (26) such that (i,l)(|𝒳|,|𝒳|)(i^{\prime},l^{\prime})\neq(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert}) jointly and j|𝒳|j^{\prime}\neq\mathinner{\!\left\lvert\mathcal{X}\right\rvert}. This means that in the (fu1)(f_{\vec{u}_{1}})-th row of KK, the derivatives are taken from

    Qh(ϕxy)(i,j,l|i1k,j1k,l1k)=γfw1γfw1,Q_{h(\phi_{xy})}(i^{\prime},j^{\prime},l^{\prime}|i_{1}^{k},j_{1}^{k},l_{1}^{k})=\gamma_{f_{\vec{w}_{1}}}\!\cdot\gamma^{\prime}_{f_{\vec{w}^{\prime}_{1}}},

    where w1=(i1k,j1k,l1k,i,l)\vec{w}_{1}=(i_{1}^{k},j_{1}^{k},l_{1}^{k},i^{\prime},l^{\prime}) and w1=(j1k,(l1k,l),j)\vec{w}^{\prime}_{1}=(j_{1}^{k},(l_{1}^{k},l^{\prime}),j^{\prime}). So all elements in such rows are zero except in the columns fw1f_{\vec{w}_{1}} and (d+fw1)(d+f_{\vec{w}^{\prime}_{1}}), which take the values γfw1\gamma^{\prime}_{f_{\vec{w}^{\prime}_{1}}} and γfw1\gamma_{f_{\vec{w}_{1}}}, respectively.

  • Type 2: Now consider the rows u2=(i1k,j1k,l1k,i,j,l)\vec{u}_{2}=(i_{1}^{k},j_{1}^{k},l_{1}^{k},i^{\prime},j^{\prime},l^{\prime}) such that (i,l)=(|𝒳|,|𝒳|)(i^{\prime},l^{\prime})=(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert}) and j|𝒳|j^{\prime}\neq\mathinner{\!\left\lvert\mathcal{X}\right\rvert}. This means that in the (fu2)(f_{\vec{u}_{2}})-th row of KK, the derivatives are taken from

    Qh(ϕxy)(i,j,l|i1k,j1k,l1k)=(1(a,b)(|𝒳|,|𝒳|)γfw2(a,b))γfw2,Q_{h(\phi_{xy})}(i^{\prime},j^{\prime},l^{\prime}|i_{1}^{k},j_{1}^{k},l_{1}^{k})\\ =\bigg{(}1-\sum_{(a,b)\neq(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert})}\gamma_{f_{\vec{w}_{2}(a,b)}}\bigg{)}\gamma^{\prime}_{f_{\vec{w}^{\prime}_{2}}},

    where we define w2(a,b)=(i1k,j1k,l1k,a,b)\vec{w}_{2}(a,b)=(i_{1}^{k},j_{1}^{k},l_{1}^{k},a,b) and w2=(j1k,(l1k,l),j)\vec{w}^{\prime}_{2}=(j_{1}^{k},(l_{1}^{k},l^{\prime}),j^{\prime}). So all elements in such rows are zero except in the (|𝒳|21)(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{2}-1) columns (among the first dd columns) from fw2(1,1)f_{\vec{w}_{2}(1,1)} to fw2(|𝒳|,|𝒳|1)f_{\vec{w}_{2}(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1)} which are equal to γfw2-\gamma^{\prime}_{f_{\vec{w}^{\prime}_{2}}}, and the column (d+fw2)(d+f_{\vec{w}^{\prime}_{2}}) which is equal to

    1(a,b)(|𝒳|,|𝒳|)γfw2(a,b).1-\sum_{(a,b)\neq(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert})}\gamma_{f_{\vec{w}_{2}(a,b)}}.
  • Type 3: Consider the rows u3=(i1k,j1k,l1k,i,j,l)\vec{u}_{3}=(i_{1}^{k},j_{1}^{k},l_{1}^{k},i^{\prime},j^{\prime},l^{\prime}) such that (i,l)(|𝒳|,|𝒳|)(i^{\prime},l^{\prime})\neq(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert}) and j=|𝒳|j^{\prime}=\mathinner{\!\left\lvert\mathcal{X}\right\rvert}. Also let w3=(i1k,j1k,l1k,i,l)\vec{w}_{3}=(i_{1}^{k},j_{1}^{k},l_{1}^{k},i^{\prime},l^{\prime}) and w3(a)=(j1k,(l1k,l),a)\vec{w}^{\prime}_{3}(a)=(j_{1}^{k},(l_{1}^{k},l^{\prime}),a). Then, all elements of such rows are zero except in the column fw3f_{\vec{w}_{3}} which takes the value

    1a|𝒳|γfw3(a),1-\sum_{a\neq\mathinner{\!\left\lvert\mathcal{X}\right\rvert}}\gamma^{\prime}_{f_{\vec{w}^{\prime}_{3}(a)}},

    and the (|𝒳|1)(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1) columns (among the last dd^{\prime} columns) from d+fw3(1)d+f_{\vec{w}^{\prime}_{3}(1)} to d+fw3(|𝒳|1)d+f_{\vec{w}^{\prime}_{3}(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1)} that are equal γfw3-\gamma_{f_{\vec{w}_{3}}}.

  • Type 4: Lastly, consider rows u4=(i1k,j1k,l1k,i,j,l)\vec{u}_{4}=(i_{1}^{k},j_{1}^{k},l_{1}^{k},i^{\prime},j^{\prime},l^{\prime}) such that (i,l)=(|𝒳|,|𝒳|)(i^{\prime},l^{\prime})=(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert}) and j=|𝒳|j^{\prime}=\mathinner{\!\left\lvert\mathcal{X}\right\rvert}. Assume vectors w4(a,b)=(i1k,j1k,l1k,a,b)\vec{w}_{4}(a,b)=(i_{1}^{k},j_{1}^{k},l_{1}^{k},a,b) and w4(a)=(j1k,(l1k,l),a)\vec{w}^{\prime}_{4}(a)=(j_{1}^{k},(l_{1}^{k},l^{\prime}),a). Then , the only non-zero elements belong to the (|𝒳|21)(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{2}-1) columns from fw4(1,1)f_{\vec{w}_{4}(1,1)} to fw4(|𝒳|,|𝒳|)f_{\vec{w}_{4}(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert})} (among the first dd columns) which are equal to

    (1a|𝒳|γfw4(a)),-\bigg{(}1-\sum_{a\neq\mathinner{\!\left\lvert\mathcal{X}\right\rvert}}\gamma^{\prime}_{f_{\vec{w}^{\prime}_{4}(a)}}\bigg{)},

    and the (|𝒳|1)(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1) columns from d+fw4(1)d+f_{\vec{w}^{\prime}_{4}(1)} to d+fw4(|𝒳|1)d+f_{\vec{w}^{\prime}_{4}(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1)} (among the last dd^{\prime} columns) which are equal to

    (1(a,b)(|𝒳|,|𝒳|)γfw4(a,b)).-\bigg{(}1-\sum_{(a,b)\neq(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert})}\gamma_{f_{\vec{w}_{4}(a,b)}}\bigg{)}.

We show now that if a linear combination of all columns equals the vector zero, then all coefficients should be zero as well. Let cfc_{f} be the ff-th column of K(ϕxy)K(\phi_{xy}) then if

f=1d+dαfcf=0,\displaystyle\sum_{f=1}^{d+d^{\prime}}\alpha_{f}c_{f}=\vec{0}, (27)

then, αf=0,f\alpha_{f}=0,\forall f. To see this, consider the Type 1 row with i1k=j1k=l1k=1ki_{1}^{k}=j_{1}^{k}=l_{1}^{k}=1^{k} and i=l=1i^{\prime}=l^{\prime}=1. Since it only has two non-zero elements, we have that

j[1:|𝒳|1]:α1γj+αjγ1=0.\forall j^{\prime}\in[1\mathrel{\mathop{\ordinarycolon}}\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1]\mathrel{\mathop{\ordinarycolon}}\quad\alpha_{1}\gamma^{\prime}_{j^{\prime}}+\alpha_{j^{\prime}}\gamma_{1}=0. (28)

Then, take the Type 3 row with i1k=j1k=l1k=1ki_{1}^{k}=j_{1}^{k}=l_{1}^{k}=1^{k} and i=l=j=1i^{\prime}=l^{\prime}=j^{\prime}=1, where we have that

α1(a|𝒳|1γa)b|𝒳|αbγ1=0.\alpha_{1}\bigg{(}\sum_{a\neq\mathinner{\!\left\lvert\mathcal{X}\right\rvert}}1-\gamma^{\prime}_{a}\bigg{)}-\sum_{b\neq\mathinner{\!\left\lvert\mathcal{X}\right\rvert}}\alpha_{b}\gamma_{1}=0. (29)

From (28) and noting that we have assumed Q>𝟎Q>\bf{0}, if α1>0\alpha_{1}>0 then αj<0\alpha_{j^{\prime}}<0 for all j[1:|𝒳|1]j^{\prime}\in[1\mathrel{\mathop{\ordinarycolon}}\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1]. Hence, the left-hand side of (29) is strictly positive and not zero. An analogous result is found assuming α1<0\alpha_{1}<0. By contradiction, we conclude that α1=0\alpha_{1}=0, and from (28),

j[1:|𝒳|1]:αj=0.\forall j^{\prime}\in[1\mathrel{\mathop{\ordinarycolon}}\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1]\mathrel{\mathop{\ordinarycolon}}\quad\alpha_{j^{\prime}}=0.

By varying (i,l)(i^{\prime},l^{\prime}) and for all combinations of (i1k,j1k,l1k)(i_{1}^{k},j_{1}^{k},l_{1}^{k}) we derive that all αf\alpha_{f}s are zero, and as a result, K(ϕxy)K(\phi_{xy}) has d+dd+d^{\prime} linearly independent columns which meets the second condition. The proof of Theorem 2 is thus complete.

Appendix B Proof of Lemma 1

The proof follows similar steps as the one in [10, Prop. 9]. Using the definition of log-likelihood,

Lnθ(X1n,Y1n,Z1n)\displaystyle L^{\theta^{\star}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})
=maxθΘi=k+1nlog(Qθ(Xi,Yi,Zi|Xiki1,Yiki1,Ziki1))\displaystyle=\max\limits_{\theta\in\Theta}\sum_{i=k+1}^{n}\log(Q_{\theta}(X_{i},Y_{i},Z_{i}|X_{i-k}^{i-1},Y_{i-k}^{i-1},Z_{i-k}^{i-1}))
=maxθΘx1k+1y1k+1z1k+1(nk)P^X1k+1Y1k+1Z1k+1(x1k+1y1k+1z1k+1)\displaystyle=\max\limits_{\theta\in\Theta}\sum_{x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1}}(n-k)\hat{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})
×log(Qθ(xk+1yk+1zk+1|x1ky1kz1k))\displaystyle\qquad\times\log(Q_{\theta}(x_{k+1}y_{k+1}z_{k+1}|x_{1}^{k}y_{1}^{k}z_{1}^{k}))
=(nk)[minθΘ{D(P^X1k+1Y1k+1Z1k+1||QθP^X1kY1kZ1k)}\displaystyle=-(n-k)\biggl{[}\min\limits_{\theta\in\Theta}\biggl{\{}D\big{(}\hat{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}}\;\lvert\rvert\;Q_{\theta}\otimes\hat{P}_{X_{1}^{k}Y_{1}^{k}Z_{1}^{k}}\big{)}\biggr{\}}
+x1k+1y1k+1z1k+1P^(x1k+1y1k+1z1k+1)log(P^(x1ky1kz1k)P^(x1k+1y1k+1z1k+1))],\displaystyle\quad+\!\!\!\!\!\sum_{x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1}}\!\!\!\!\!\!\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})\log\bigg{(}\frac{\hat{P}(x_{1}^{k}y_{1}^{k}z_{1}^{k})}{\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})}\bigg{)}\biggl{]}, (30)

where

(QθP^X1kY1kZ1k)(x1k+1y1k+1z1k+1)P^X1kY1kZ1k(x1ky1kz1k)Qθ(xk+1yk+1zk+1|x1ky1kz1k).(Q_{\theta}\otimes\hat{P}_{X_{1}^{k}Y_{1}^{k}Z_{1}^{k}})(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})\triangleq\\ \qquad\hat{P}_{X_{1}^{k}Y_{1}^{k}Z_{1}^{k}}(x_{1}^{k}y_{1}^{k}z_{1}^{k})Q_{\theta}(x_{k+1}y_{k+1}z_{k+1}|x_{1}^{k}y_{1}^{k}z_{1}^{k}).

Since the KL-divergence is minimized by zero, then

Lnθ(X1n,Y1n,Z1n)=(nk)[H^(X1k,Y1k,Z1k)H^(X1k+1,Y1k+1,Z1k+1)].L^{\theta^{\star}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})=(n-k)\cdot\\ [\hat{H}(X_{1}^{k},Y_{1}^{k},Z_{1}^{k})-\hat{H}(X_{1}^{k+1},Y_{1}^{k+1},Z_{1}^{k+1})]. (31)

On the other hand, for the second log-likelihood, we have:

Lnϕ(X1n,Y1n,Z1n)\displaystyle L^{\phi^{\star}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})
=maxϕΦi=k+1nlog(Qϕ(Xi,Yi,Zi|Xiki1,Yiki1,Ziki1))\displaystyle=\max\limits_{\phi\in\Phi}\sum_{i=k+1}^{n}\log(Q_{\phi}(X_{i},Y_{i},Z_{i}|X_{i-k}^{i-1},Y_{i-k}^{i-1},Z_{i-k}^{i-1}))
=maxϕxzi=k+1nlog(Qϕxz(Xi,Zi|Xiki1,Yiki1,Ziki1))A1\displaystyle=\underbrace{\max\limits_{\phi^{xz}}\sum_{i=k+1}^{n}\log(Q_{\phi^{xz}}(X_{i},Z_{i}|X_{i-k}^{i-1},Y_{i-k}^{i-1},Z_{i-k}^{i-1}))}_{A_{1}}
+maxϕyi=k+1nlog(Qϕy(Yi|Yiki1,Ziki))A2.\displaystyle\quad+\underbrace{\max\limits_{\phi^{y}}\sum_{i=k+1}^{n}\log(Q_{\phi^{y}}(Y_{i}|Y_{i-k}^{i-1},Z_{i-k}^{i}))}_{A_{2}}.

With a similar approach as in (30), we can expand A1A_{1} and A2A_{2} as it is shown in (32) at the bottom of the page.

A1\displaystyle A_{1} =(nk)[minϕxz{D(P^X1k+1,Y1k,Z1k+1||QϕxzP^X1k,Y1k,Z1k)}+x1k+1,y1k,z1k+1P^(x1k+1,y1k,z1k+1)log(P^(x1k,y1k,z1k)P^(x1k+1,y1k,z1k+1))]\displaystyle=-(n-k)\biggl{[}\min\limits_{\phi^{xz}}\biggl{\{}D\big{(}\hat{P}_{X_{1}^{k+1},Y_{1}^{k},Z_{1}^{k+1}}\;\lvert\rvert\;Q_{\phi^{xz}}\otimes\hat{P}_{X_{1}^{k},Y_{1}^{k},Z_{1}^{k}}\big{)}\biggr{\}}+\!\!\!\sum_{x_{1}^{k+1},y_{1}^{k},z_{1}^{k+1}}\!\!\!\hat{P}(x_{1}^{k+1},y_{1}^{k},z_{1}^{k+1})\log\bigg{(}\frac{\hat{P}(x_{1}^{k},y_{1}^{k},z_{1}^{k})}{\hat{P}(x_{1}^{k+1},y_{1}^{k},z_{1}^{k+1})}\bigg{)}\biggr{]}
A2\displaystyle A_{2} =(nk)[minϕy{D(P^Y1k+1,Z1k+1||QϕyP^Y1k,Z1k+1)}+y1k+1,z1k+1P^(y1k+1,z1k+1)log(P^(y1k,z1k+1)P^(y1k+1,z1k+1))].\displaystyle=-(n-k)\biggl{[}\min\limits_{\phi^{y}}\biggl{\{}D\big{(}\hat{P}_{Y_{1}^{k+1},Z_{1}^{k+1}}\;\lvert\rvert\;Q_{\phi^{y}}\otimes\hat{P}_{Y_{1}^{k},Z_{1}^{k+1}}\big{)}\biggr{\}}+\sum_{y_{1}^{k+1},z_{1}^{k+1}}\hat{P}(y_{1}^{k+1},z_{1}^{k+1})\log\bigg{(}\frac{\hat{P}(y_{1}^{k},z_{1}^{k+1})}{\hat{P}(y_{1}^{k+1},z_{1}^{k+1})}\bigg{)}\biggr{]}. (32)

As a result,

Lnϕ(X1n,Y1n,Z1n)\displaystyle L^{\phi^{\star}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})
=(nk)[H^(X1k,Y1k,Z1k)H^(X1k+1,Y1k,Z1k+1)\displaystyle=(n-k)\Bigl{[}\hat{H}(X_{1}^{k},Y_{1}^{k},Z_{1}^{k})-\hat{H}(X_{1}^{k+1},Y_{1}^{k},Z_{1}^{k+1})
+H^(Y1k,Z1k+1)H^(Y1k+1,Z1k+1)].\displaystyle\quad+\hat{H}(Y_{1}^{k},Z_{1}^{k+1})-\hat{H}(Y_{1}^{k+1},Z_{1}^{k+1})\Bigr{]}. (33)

Finally, combining (31) and (33), we obtain

Λxy,n=Lnθ(X1n,Y1n,Z1n)Lnϕxy(X1n,Y1n,Z1n)\displaystyle\Lambda_{{xy},n}=L^{\theta^{\star}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})-L^{\phi^{\star}_{xy}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})
=(nk)[H^(Yk+1|Y1k,Z1k+1)H^(Yk+1|X1k+1,Y1k,Z1k+1)]\displaystyle=(n-k)[\hat{H}(Y_{k+1}|Y_{1}^{k},Z_{1}^{k+1})-\hat{H}(Y_{k+1}|X_{1}^{k+1},Y_{1}^{k},Z_{1}^{k+1})]
=(nk)I^n(k)(XY||Z),\displaystyle=(n-k)\hat{I}_{n}^{(k)}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z}), (34)

which concludes the proof of Lemma 1.

I^n(k)(XY||Z)=𝕩1k+1𝕪1k+1𝕫1k+1P^(𝕩1k+1𝕪1k+1𝕫1k+1)logP^(𝕪k+1𝕩1k+1|𝕪1k𝕫1k+1)P^(𝕪k+1|𝕪1k𝕫1k+1)P^(𝕩1k+1|𝕪1k𝕫1k+1)\displaystyle\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})=\sum\nolimits_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\log\frac{\hat{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\hat{P}(\mathbb{y}_{k+1}|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\hat{P}(\mathbb{x}_{1}^{k+1}|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}
=𝕩1k+1𝕪1k+1𝕫1k+1P^(𝕩1k+1𝕪1k+1𝕫1k+1)log[P^(𝕪k+1𝕩1k+1|𝕪1k𝕫1k+1)P¯(𝕪k+1|𝕪1k𝕫1k+1)P¯(𝕩1k+1|𝕪1k𝕫1k+1)P^(𝕪k+1|𝕪1k𝕫1k+1)P^(𝕩1k+1|𝕪1k𝕫1k+1)P¯(𝕪k+1𝕩1k+1|𝕪1k𝕫1k+1)]\displaystyle=\sum\nolimits_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\ \hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\log\biggl{[}\frac{\hat{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{y}_{k+1}|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{x}_{1}^{k+1}|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\hat{P}(\mathbb{y}_{k+1}|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\hat{P}(\mathbb{x}_{1}^{k+1}|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}\biggr{]}
+𝕩1k+1𝕪1k+1𝕫1k+1P^(𝕩1k+1𝕪1k+1𝕫1k+1)logP¯(𝕪k+1𝕩1k+1|𝕪1k𝕫1k+1)P¯(𝕪k+1|𝕪1k𝕫1k+1)P¯(𝕩1k+1|𝕪1k𝕫1k+1)\displaystyle\quad+\sum\nolimits_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\log\frac{\bar{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\bar{P}(\mathbb{y}_{k+1}|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{x}_{1}^{k+1}|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}
=𝕩1k+1𝕪1k+1𝕫1k+1P^(𝕩1k+1𝕪1k+1𝕫1k+1)log[P^(𝕩1k+1𝕪1k+1𝕫1k+1)P^(𝕪1k𝕫1k+1)P^(𝕪1k+1𝕫1k+1)P^(𝕩1k+1𝕪1k𝕫1k+1)P¯(𝕪1k+1𝕫1k+1)P¯(𝕩1k+1𝕪1k𝕫1k+1)P¯(𝕩1k+1𝕪1k+1𝕫1k+1)P¯(𝕪1k𝕫1k+1)]\displaystyle=\sum\nolimits_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\log\biggl{[}\frac{\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\hat{P}(\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\hat{P}(\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}\cdot\frac{\bar{P}(\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\bar{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}\biggl{]}
+𝕩1k+1𝕪1k+1𝕫1k+1P^(𝕩1k+1𝕪1k+1𝕫1k+1)logP¯(𝕪k+1𝕩1k+1|𝕪1k𝕫1k+1)P¯(𝕪k+1|𝕪1k𝕫1k+1)P¯(𝕩1k+1|𝕪1kz1k+1)\displaystyle\quad+\sum\nolimits_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\log\frac{\bar{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\bar{P}(\mathbb{y}_{k+1}|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{x}_{1}^{k+1}|\mathbb{y}_{1}^{k}z_{1}^{k+1})}
=D(P^X1k+1Y1k+1Z1k+1||P¯X1k+1Y1k+1Z1k+1)+D(P^Y1kZ1k+1||P¯Y1kZ1k+1)\displaystyle=D(\hat{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}}\;\lvert\rvert\;\bar{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}})+D(\hat{P}_{Y_{1}^{k}Z_{1}^{k+1}}\;\lvert\rvert\;\bar{P}_{Y_{1}^{k}Z_{1}^{k+1}})
D(P^Y1k+1Z1k+1||P¯Y1k+1Z1k+1)D(P^X1k+1Y1kZ1k+1||P¯X1k+1Y1kZ1k+1)\displaystyle\quad-D(\hat{P}_{Y_{1}^{k+1}Z_{1}^{k+1}}\;\lvert\rvert\;\bar{P}_{Y_{1}^{k+1}Z_{1}^{k+1}})-D(\hat{P}_{X_{1}^{k+1}Y_{1}^{k}Z_{1}^{k+1}}\;\lvert\rvert\;\bar{P}_{X_{1}^{k+1}Y_{1}^{k}Z_{1}^{k+1}})
+𝕩1k+1𝕪1k+1𝕫1k+1(1nki=k+1n𝟙[xikiyikiziki=𝕩1k+1𝕪1k+1𝕫1k+1])logP¯(𝕪k+1𝕩1k+1|𝕪1k𝕫1k+1)P¯(𝕪k+1|𝕪1k𝕫1k+1)P¯(𝕩1k+1|𝕪1k𝕫1k+1).\displaystyle\quad+\sum_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\left(\frac{1}{n-k}\sum\nolimits_{i=k+1}^{n}\mathds{1}[x_{i-k}^{i}y_{i-k}^{i}z_{i-k}^{i}=\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}]\right)\log\frac{\bar{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\bar{P}(\mathbb{y}_{k+1}|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{x}_{1}^{k+1}|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}. (35)

Appendix C Proof of Theorem 3

We begin by expanding the expression I^n(k)(XY||Z)\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z}) using the definition of the empirical distribution in (3) and we obtain (35), found at the bottom of the next page. We then proceed to analyze the asymptotic behavior of the estimator.

The first four terms in (35), i.e., the KL-divergence terms, decay faster than 𝒪(1/n)\mathcal{O}(1/\sqrt{n}). This is shown later in the proof. On the other hand, since vxy=1v_{xy}=1, I(Y¯k+1;X¯1k+1|Y¯1k,Z¯1k+1)>0I(\bar{Y}_{k+1};\bar{X}_{1}^{k+1}|\bar{Y}_{1}^{k},\bar{Z}_{1}^{k+1})>0 due to (5) and Assumption 3, and thus, the last term in (35) is non-zero and dominates the convergence of the estimator, as we see next. Here, one observes that conditioning on vxy=1v_{xy}=1 is sufficient to analyze the convergence of I^n(k)(XY||Z)\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z}) and further knowledge about other edges is irrelevant (see Remark 4). We then conclude that,

limnnkI^n(k)(XY||Z)=limn1nki=k+1nSi,\lim_{n\to\infty}\sqrt{n-k}\,\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})=\lim_{n\to\infty}\frac{1}{\sqrt{n-k}}\sum_{i=k+1}^{n}S_{i},

where

Si\displaystyle S_{i} logPY¯k+1X¯1k+1|Y¯1kZ¯1k+1(yixiki|yiki1ziki)\displaystyle\triangleq\log P_{\bar{Y}_{k+1}\bar{X}_{1}^{k+1}|\bar{Y}_{1}^{k}\bar{Z}_{1}^{k+1}}(y_{i}x_{i-k}^{i}|y^{i-1}_{i-k}z_{i-k}^{i})
logPY¯k+1|Y¯1kZ¯1k+1(yi|yiki1ziki)\displaystyle\quad-\log P_{\bar{Y}_{k+1}|\bar{Y}_{1}^{k}\bar{Z}_{1}^{k+1}}(y_{i}|y^{i-1}_{i-k}z_{i-k}^{i})
logPX¯1k+1|Y¯1kZ¯1k+1(xiki|yiki1ziki).\displaystyle\quad-\log P_{\bar{X}_{1}^{k+1}|\bar{Y}_{1}^{k}\bar{Z}_{1}^{k+1}}(x_{i-k}^{i}|y^{i-1}_{i-k}z_{i-k}^{i}).

We note that SiS_{i} is a functional of the chain {(Xiki,Yiki,Ziki)}\{(X_{i-k}^{i},Y_{i-k}^{i},\allowbreak Z_{i-k}^{i})\} and its mean is 𝔼[S]=I(Y¯k+1;X¯1k+1|Y¯1k,Z¯1k+1)\mathds{E}[S]=I(\bar{Y}_{k+1};\bar{X}_{1}^{k+1}|\bar{Y}_{1}^{k},\bar{Z}_{1}^{k+1}). The chain is ergodic and we can thus apply the central limit theorem [13, Sec. I.16] to the partial sums to obtain

nk(1nki=k+1n1Si𝔼[S])𝒩(0,σ2),\displaystyle\sqrt{n-k}\left(\frac{1}{n-k}\sum_{i=k+1}^{n-1}S_{i}-\mathds{E}[S]\right)\to\mathcal{N}(0,\sigma^{2}), (36)

where σ2\sigma^{2} is bounded.

Now, to complete the proof, it only remains to show that the KL-divergence terms in (35) multiplied by a nk\sqrt{n-k} factor converge to zero as nn\to\infty. We present the proof for one term and the others follow a similar approach. We first recall the Taylor expansion with Lagrange remainder form,

f(x)=f(a)+f(a)(xa)+f′′(x)(xa)22!,f(x)=f(a)+f^{\prime}(a)(x-a)+\frac{f^{\prime\prime}(x^{*})(x-a)^{2}}{2!},

for some x(a,x)x^{*}\in(a,x). Then, let us define ρP¯(x1k+1y1k+1z1k+1)P^(x1k+1y1k+1z1k+1),\rho\triangleq\frac{\bar{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})}{\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})}, so we can expand the first KL-divergence term as:

D(P^X1k+1Y1k+1Z1k+1||P¯X1k+1Y1k+1Z1k+1)\displaystyle D(\hat{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}}\;\lvert\rvert\;\bar{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}})
=x1k+1y1k+1z1k+1P^(x1k+1y1k+1z1k+1)logρ\displaystyle=-\sum\nolimits_{x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1}}\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})\log\rho
=P^(x1k+1y1k+1z1k+1)[(ρ1)(ρ1)22!τ2]\displaystyle=-\sum\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})\Big{[}(\rho-1)-\frac{(\rho-1)^{2}}{2!\tau^{2}}\Big{]}
=P^(x1k+1y1k+1z1k+1)(ρ1)212τ2\displaystyle=\sum\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})(\rho-1)^{2}\frac{1}{2\tau^{2}} (37)
=(P^(x1k+1y1k+1z1k+1)P¯(x1k+1y1k+1z1k+1))2C,\displaystyle=\sum\left(\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})-\bar{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})\right)^{2}C,\!\!\!\!\! (38)

for some τ(1,ρ)\tau\in(1,\rho), where

C12P^(x1k+1y1k+1z1k+1)τ2,C\triangleq\frac{1}{2\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})\tau^{2}},

and (37) follows due to

P^(x1k+1y1k+1z1k+1)(ρ1)\displaystyle\sum\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})(\rho-1)
=P¯(x1k+1y1k+1z1k+1)P^(x1k+1y1k+1z1k+1)=0.\displaystyle=\sum\bar{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})-\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})=0.

Since the Markov model is assumed to be ergodic (Assumption 2), P^(x1k+1y1k+1z1k+1)0\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})\nrightarrow 0, and therefore CC is bounded. Now i[1:nk]\forall i\in[1\mathrel{\mathop{\ordinarycolon}}n-k] consider the sequence

Ti(x1k+1y1k+1z1k+1)𝟙[Xik+iYik+iZik+i=x1k+1y1k+1z1k+1]T_{i}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})\triangleq\mathds{1}[X_{i}^{k+i}Y_{i}^{k+i}Z_{i}^{k+i}=x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1}]

with mean P¯(x1k+1y1k+1z1k+1)\bar{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1}). According to the law of iterated logarithms,

lim supni=1nk(TiP¯(x1k+1y1k+1z1k+1))(nk)loglog(nk)=2a.s.\limsup_{n\to\infty}\frac{\sum_{i=1}^{n-k}(T_{i}-\bar{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1}))}{\sqrt{(n-k)\log\log(n-k)}}=\sqrt{2}\quad\textnormal{a.s.}

Using the definition of the empirical distribution, this implies

lim supn(nk)(P^(x1k+1y1k+1z1k+1)P¯(x1k+1y1k+1z1k+1))(nk)loglog(nk)\displaystyle\limsup_{n\to\infty}\frac{(n-k)(\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})-\bar{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1}))}{\sqrt{(n-k)\log\log(n-k)}}
=lim supnP^(x1k+1y1k+1z1k+1)P¯(x1k+1y1k+1z1k+1)loglog(nk)/nk\displaystyle=\limsup_{n\to\infty}\frac{\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})-\bar{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})}{\sqrt{\log\log(n-k)}/\sqrt{n-k}}
=2a.s.\displaystyle=\sqrt{2}\quad\textnormal{a.s.} (39)

As a result we can rewrite (38) and conclude that

lim supnnkD(P^X1k+1Y1k+1Z1k+1||P¯X1k+1Y1k+1Z1k+1)\displaystyle\limsup_{n\to\infty}\sqrt{n-k}\,D(\hat{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}}\;\lvert\rvert\;\bar{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}})
=lim supnloglog(nk)nkx1k+1y1k+1z1k+12C=0,\displaystyle=\limsup_{n\to\infty}\frac{\log\log(n-k)}{\sqrt{n-k}}\sum\nolimits_{x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1}}2C=0,

given that each term in the finite sum is bounded. Therefore, as nn\to\infty, the four KL-divergence terms in (35) multiplied by a nk\sqrt{n-k} factor tend to zero and the proof of Theorem 3 is thus complete.

References

  • [1] C. J. Quinn, T. P. Coleman, N. Kiyavash, and N. G. Hatsopoulos, “Estimating the Directed Information to Infer Causal Relationships in Ensemble Neural Spike Train Recordings,” Journal of Computational Neuroscience, vol. 30, no. 1, pp. 17–44, Feb. 2011.
  • [2] K. Friston, L. Harrison, and W. Penny, “Dynamic Causal Modelling,” NeuroImage, vol. 19, no. 4, pp. 1273–1302, Aug. 2003.
  • [3] C. J. Quinn, N. Kiyavash, and T. P. Coleman, “Directed Information Graphs,” IEEE Trans. Inf. Theory, vol. 61, no. 12, pp. 6887–6909, Dec. 2015.
  • [4] W. M. Lord, J. Sun, N. T. Ouellette, and E. M. Bollt, “Inference of Causal Information Flow in Collective Animal Behavior,” IEEE Trans. Mol. Biol. Multi-Scale Commun., vol. 2, no. 1, pp. 107–116, Jun. 2016.
  • [5] J. Jiao, H. H. Permuter, L. Zhao, Y. H. Kim, and T. Weissman, “Universal Estimation of Directed Information,” IEEE Trans. Inf. Theory, vol. 59, no. 10, pp. 6220–6242, Oct. 2013.
  • [6] C. W. J. Granger, “Investigating Causal Relations by Econometric Models and Cross-spectral Methods,” Econometrica, vol. 37, no. 3, pp. 424–438, Aug. 1969.
  • [7] J. Massey, “Causality, Feedback and Directed Information,” in Proc. Int. Symp. Inf. Theory Applic. (ISITA), Honolulu, HI, USA, Nov. 1990, pp. 303–305.
  • [8] P.-O. Amblard and O. J. J. Michel, “On Directed Information Theory and Granger Causality Graphs,” Journal of Computational Neuroscience, vol. 30, no. 1, pp. 7–16, Feb. 2011.
  • [9] C. J. Quinn, N. Kiyavash, and T. P. Coleman, “Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs,” IEEE Trans. Signal Process., vol. 61, no. 12, pp. 3173–3182, Jun. 2013.
  • [10] I. Kontoyiannis and M. Skoularidou, “Estimating the Directed Information and Testing for Causality,” IEEE Trans. Inf. Theory, vol. 62, no. 11, pp. 6053–6067, Nov. 2016.
  • [11] P. Billingsley, Statistical Inference for Markov Processes.   Chicago, IL, USA: Univ. of Chicago Press, 1961.
  • [12] J. Scarlett and V. Cevher, “Lower Bounds on Active Learning for Graphical Model Selection,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 Apr. 2017, pp. 55–64.
  • [13] K. L. Chung, Markov Chains: With Stationary Transition Probabilities, 2nd ed., ser. A Series of Comprehensive Studies in Mathematics.   New York, NY, USA: Springer-Verlag, 1967, vol. 104.