Testing for Directed Information Graphs

Sina Molavipour, Germán Bassi, and Mikael Skoglund This work was supported in part by the Knut and Alice Wallenberg Foundation.The authors are with the ACCESS Linnaeus Centre, School of Electrical Engineering, KTH Royal Institute of Technology, Stockholm, Sweden (e-mails: sinmo@kth.se, germanb@kth.se, skoglund@kth.se).

Abstract

In this paper, we study a hypothesis test to determine the underlying directed graph structure of nodes in a network, where the nodes represent random processes and the direction of the links indicate a causal relationship between said processes. Specifically, a $k$ -th order Markov structure is considered for them, and the chosen metric to determine a connection between nodes is the directed information. The hypothesis test is based on the empirically calculated transition probabilities which are used to estimate the directed information. For a single edge, it is proven that the detection probability can be chosen arbitrarily close to one, while the false alarm probability remains negligible. When the test is performed on the whole graph, we derive bounds for the false alarm and detection probabilities, which show that the test is asymptotically optimal by properly setting the threshold test and using a large number of samples. Furthermore, we study how the convergence of the measures relies on the existence of links in the true graph.

I Introduction

Causality is a concept that expresses the joint behavior in time of a group of components in a system. In general, it denotes the effect of one component to itself and others in the system during a time period. Consider a network of nodes, each producing a signal in time. These processes can behave independently, or there might be an underlying connection, by nature, between them. Inferring this structure is of great interest in many applications. In [1], for instance, neurons are taken as components while the time series of produced spikes is used to derive the underlying structure. Dynamical models are also a well-known tool to understand functionals of expressed neurons [2]. Additionally, in social networks, there is an increasing interest to estimate influences among users [3], while further applications exist in biology [4], economics [5], and many other fields.

Granger [6] defined the notion of causality between two time series by using a linear autoregressive model and comparing the estimation errors for two scenarios: when history of the second node is accounted for and when it is not. With this definition, however, we can poorly estimate models which operate non-linearly. Directed information was first introduced to address the flow of information in a communication set-up, and suggested by Massey [7] as a measure of causality since it is not limited to linear models. There exist other methods which may qualify for different applications. Several of these definitions are compared in [1], where directed information is argued as a robust measure for causality. There are also symmetric measures like correlation or mutual information, but they can only represent a mutual relationship between nodes and not a directed one.

The underlying causal structure of a network of processes can be properly visualized by a directed graph. In particular, in a Directed Information Graph (DIG) –introduced simultaneously by Amblard and Michel [8] and Quinn et al. [1]– the existence of an edge is determined by the value of the directed information between two nodes considering the history of the rest of the network. There are different approaches to tackle the problem of detecting and estimating such type of graphs. Directed information can be estimated based on prior assumptions on the processes’ structure, such as Markov properties, and empirically calculating probabilities [1, 3]. On the other hand, Jiao et al. [5] propose a universal estimator of directed information which is not restricted to any Markov assumption. Nonetheless, in the core of their technique, they consider a context tree weighting algorithm with different depths, which intuitively resembles a learning algorithm for estimating the order of a Markov structure. Other assumptions used in the study of DIGs, that constrain the structure of the underlying graph, are tree structures [9] or a limit on the nodes’ degree [3].

The estimation performance on the detection of edges on a DIG is crucial since it allows to characterize, for instance, the optimum test for detection, or the minimum number of samples needed to reliably obtain the underlying model, i.e., the sample complexity. In [3], the authors derive a bound on the sample complexity using total variation when the directed information between two nodes is empirically estimated. Following that work, Kontoyiannis et al. [10] investigate the performance of a test for causality between two nodes, and they show that the convergence rate of the empirical directed information can be improved if it is calculated conditioned on the true relationship between the nodes. In other words, the underlying structure of the true model has an effect on the detection performance of the whole graph. Motivated by this result, in this paper, we study a hypothesis test over a complete graph (not just a link between two nodes) when the directed information is empirically estimated, and we provide interesting insights into the problem. Moreover, we show that for every existing edge in the true graph, the estimation converges with $\mathcal{O}(1/\sqrt{n})$ , while if there is no edge in the true model, convergence is of the order of $\mathcal{O}(1/n)$ .

The rest of the paper is organized as follows. In Section II, notations and definitions are introduced. In particular, the directed information is reviewed and the definition of an edge in a DIG is presented. The main results of our work are then shown in Section III. First, the performance of a hypothesis test for a single edge is studied, where we analyze the asymptotic behavior of estimators based on the knowledge about the true edges. Then, we demonstrate how the detection of the whole graph relies on the test for each edge. Finally, in the last section, the paper is concluded.

II Preliminaries

Assume a network with $m$ nodes representing processes $\{\textbf{X}_{1},\dots,\textbf{X}_{m}\}$ . The observation of the $l$ -th process in the discrete time interval $t_{1}$ to $t_{2}$ is described by the random variable $X^{t_{2}}_{l,t_{1}}$ , which at each time takes values on the discrete alphabet $\mathcal{X}$ . With a little abuse of notation $Y_{i}$ and $Y_{1}^{n}$ represent the observations of the process Y at instance $i$ and in the interval $1$ to $n$ , respectively.

The metric used to describe the causality relationship of these processes is the directed information, as suggested previously, since it can describe more general structures without further assumptions (such as linearity). Directed information is mainly used in information theory to characterize channels with feedback and it is defined based on causally conditioned probabilities.

Definition 1.

The probability distribution of $Y_{1}^{n}$ causally conditioned on $X_{1}^{n}$ is defined as

P_{Y_{1}^{n}\;\lvert\rvert\;X_{1}^{n}}=\prod_{i=1}^{n}P_{Y_{i}|X_{1}^{i},Y_{1}^{i-1}}.

The entropy rate of the process Y causally conditioned on X is then defined as:

	$\displaystyle H(\textbf{Y}\;\lvert\rvert\;\textbf{X})$	$\displaystyle\triangleq\lim\limits_{n\to\infty}\frac{1}{n}H(Y_{1}^{n}\;\lvert\rvert\;X_{1}^{n})$
		$\displaystyle=\lim\limits_{n\to\infty}\frac{1}{n}\sum_{i=1}^{n}H(Y_{i}\|Y^{i-1}_{1},X^{i}_{1}).$

Consequently, the directed information rate of X to Y causally conditioned on Z is expressed as below:

	$\displaystyle I(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})$	$\displaystyle\triangleq H(\textbf{Y}\;\lvert\rvert\;\textbf{Z})-H(\textbf{Y}\;\lvert\rvert\;\textbf{X},\textbf{Z})$
		$\displaystyle=\lim\limits_{n\to\infty}\frac{1}{n}\sum_{i=1}^{n}I(Y_{i}\,;\,X_{1}^{i}\|Y_{1}^{i-1},Z_{1}^{i}).$		(1)

Pairwise directed information does not unequivocally determine the one-step causal influence among nodes in a network. Instead, the history of the other remaining nodes should also be considered. Similarly as introduced in [1, 8], an edge from $\textbf{X}_{i}$ to $\textbf{X}_{j}$ exists in a directed information graph iff

\displaystyle I(\textbf{X}_{i}\to\textbf{X}_{j}\;\lvert\rvert\;\textbf{X}_{[m]\setminus\{i,j\}})>0,

(2)

where $[m]\triangleq\{1,2,\ldots,m\}$ . Having observed the output of every process, the edges of the graph can be estimated which results in a weighted directed graph. However, when only the existence of a directed edge is investigated the performance of the detection can be improved. This is presented in Section III.

There exist several methods to estimate information theoretic values which most of them intrinsically deal with counting possible events to estimate distributions. One such method is the empirical distribution, which we define as follows.

Definition 2.

Let $x^{n}_{[m]}=(x^{n}_{1,1},x^{n}_{2,1},\dots,x^{n}_{m,1})$ be a realization of the random variables $X_{[m]}^{n}=(X_{1,1}^{n},X_{2,1}^{n},\dots,X_{m,1}^{n})$ . The joint empirical distribution of $k^{\prime}\triangleq k+1$ consecutive time instances of all nodes is then defined as:

\displaystyle\hat{P}_{X^{k^{\prime}}_{[m]}}^{(n)}\!(\mathbb{a}^{k^{\prime}}_{[m]})=\frac{1}{n-k}\sum\limits_{t=1}^{n-k}\prod_{i=1}^{m}\mathds{1}[x^{t+k}_{i,t}=\mathbb{a}^{k^{\prime}}_{i,1}],\quad\forall\mathbb{a}^{k^{\prime}}_{i,1}\in\mathcal{X}^{k^{\prime}}.

(3)

The joint distribution of any subset of nodes is then a marginal distribution of (3).

By plugging in the empirical distribution we can derive estimators for information theoretic quantities such as the entropy $H$ , where we use the notation $\hat{H}$ to distinguish the empirical estimator, i.e.,

\displaystyle\hat{H}(X^{k^{\prime}})=-\sum_{\mathbb{a}^{k^{\prime}}\in\mathcal{X}^{k^{\prime}}}\hat{P}_{X^{k^{\prime}}}^{(n)}(\mathbb{a}^{k^{\prime}})\log\left(\hat{P}_{X^{k^{\prime}}}^{(n)}(\mathbb{a}^{k^{\prime}})\right).

(4)

A causal influence in the network implies that the past of a group of nodes affects the future of some other group or themselves. This motivates us to focus on a network of joint Markov processes in this paper, since it characterizes a state dependent operation for nodes, although we may put further assumptions to make calculations more tractable. For simplicity, we assume a three-node network (i.e., $m=3$ ) and the processes to be $\textbf{X},\textbf{Y}$ , and Z in the rest of the work, since the extension of the results for $m>3$ is straightforward.

Assumption 1.

$\{\textbf{X},\textbf{Y},\textbf{Z}\}$ is a jointly stationary Markov process of order $k$ .

Let us define the $\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{3k}\times\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{3}$ transition probability matrix $Q$ with elements

Q(X_{k+1},Y_{k+1},Z_{k+1}|X_{1}^{k},Y_{1}^{k},Z_{1}^{k}).

Then, the next assumption prevents further complexities in the steps of the proof of our main result.

Assumption 2.

All transition probabilities are positive, i.e., $Q>\bf{0}$ .

This condition provides ergodicity for the joint Markov process and results in the joint empirical distribution asymptotically converging to the stationary distribution¹¹1The stationary distribution is denoted either as $P_{\bar{X}_{1}^{k+1},\bar{Y}_{1}^{k+1},\bar{Z}_{1}^{k+1}}$ or as $\bar{P}_{X_{1}^{k+1},Y_{1}^{k+1},Z_{1}^{k+1}}$ in the sequel., i.e.,

\hat{P}_{X_{1}^{k+1},Y_{1}^{k+1},Z_{1}^{k+1}}^{(n)}\to P_{\bar{X}_{1}^{k+1},\bar{Y}_{1}^{k+1},\bar{Z}_{1}^{k+1}}

In general, the directed information rate $I(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})$ cannot be expressed with the stationary random variables $\bar{X}_{1}^{k+1},\bar{Y}_{1}^{k+1}$ , and $\bar{Z}_{1}^{k+1}$ , since a good estimator requires unlimited samples for perfect estimation. To see this,

$\displaystyle I(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})$	$\displaystyle=I(\bar{Y}_{k+1};\bar{X}_{1}^{k+1}\|\bar{Y}_{1}^{k},\bar{Z}_{1}^{k+1})$
	$\displaystyle\quad-I(\bar{Y}_{k+1};\bar{Y}_{-\infty}^{0},\bar{Z}_{-\infty}^{0}\|\bar{Y}_{1}^{k},\bar{Z}_{1}^{k+1})$
	$\displaystyle\leq I(\bar{Y}_{k+1};\bar{X}_{1}^{k+1}\|\bar{Y}_{1}^{k},\bar{Z}_{1}^{k+1}),$	(5)

where we use the Markov property in the first equation, and the inequality holds since the mutual information is non-negative. Thus, with a limited sampling interval, an upper bound would be derived. The next assumption makes (5) hold with equality.

Assumption 3.

For processes Y and Z, the Markov chain

\displaystyle\bar{Y}_{k+1}\,\rule[2.15277pt]{10.00002pt}{0.55pt}\,(\bar{Y}_{1}^{k}\bar{Z}_{1}^{k+1})\,\rule[2.15277pt]{10.00002pt}{0.55pt}\,(\bar{Y}_{-\infty}^{0}\bar{Z}_{-\infty}^{0})

holds.

Note that the above assumption should hold for every other two pairs of processes if we are interested in studying the whole graph and not only a single edge.

III Hypothesis test for Directed Information Graphs

Consider a graph $\mathcal{G}$ , where the edge from node $i$ to node $j$ is denoted by $v_{ij}$ ; we say that $v_{ij}=1$ if the node $i$ causally influences the node $j$ , otherwise, $v_{ij}=0$ . A hypothesis test to identify the graph is performed on the adjacency matrix $V$ , whose elements are the $v_{ij}$ s, and the performance of the test is studied through its false alarm and detection probabilities

	$\displaystyle P_{F}$	$\displaystyle=P(\hat{V}=V^{}\|V\neq V^{}),$		(6)
	$\displaystyle P_{D}$	$\displaystyle=P(\hat{V}=V^{}\|V=V^{}),$		(7)

where $\hat{V}$ is the estimation of $V$ (properly defined later), and $V^{*}$ is the hypothesis model to test. In Theorem 1 below, both an upper bound on $P_{F}$ and a lower bound on $P_{D}$ are derived.

Theorem 1.

For a directed information graph with adjacency matrix $V$ of size ${m\times m}$ , if Assumptions $1$ – $3$ hold, the performance of the test for the hypothesis $V^{*}$ is bounded as:

	$\displaystyle P_{F}$	$\displaystyle\leq 1-P_{G}\left(\frac{r}{2},I_{th}\right),$		(8)
	$\displaystyle P_{D}$	$\displaystyle\geq\max\!\left\{1-N_{0}\!\left[1-P_{G}\left(\frac{r}{2},I_{th}\right)\right],0\right\},$		(9)

using the plug-in estimation of $n$ samples with $n\to\infty$ . The function $P_{G}$ is the regularized gamma function, and $N_{0}=m(m-1)-N_{1}$ with $N_{1}$ denoting the number of directed edges in the hypothesis graph, and $r=\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{mk}(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{m}-1)$ . Finally, $I_{th}$ is the threshold value used to decide the existence of an edge, and its order is $\mathcal{O}(1)$ .

The proof of Theorem 1 consists of two steps. First, the asymptotic behavior of the test for a single edge is derived in Section III-A. Afterwards, the hypothesis test over the whole graph is studied based on the tests for each single edge. It can be seen later on that, by Remark 1 and Corollary 1, the performance of testing the graph remains as good as a test for causality of a single edge.

Remark 1.

Note that by increasing $I_{th}$ while remaining of order $\mathcal{O}(1)$ , $P_{G}\left(\frac{r}{2},I_{th}\right)$ gets arbitrarily close to one, which results in the probability of detection converging to one as the probability of false alarm tends to zero.

III-A Asymptotic Behavior of a Single Edge Estimation

In general, every possible probability transition matrix $Q$ can be parametrized with $\theta\in\Theta$ , where $\Theta\subset\mathbb{R}^{r}$ (see Table I). The vector $\theta$ is formed by concatenating the elements of $Q$ in a row-wise manner excluding the last (linearly dependent) column. However, if the transition probability could be factorized due to a Markov property among its variables, the matrix might thus be addressed with a lower dimension parameter.

To see this, let us concentrate in our 3-node network $\{\textbf{X},\textbf{Y},\textbf{Z}\}$ . If $v_{xy}=0$ , or equivalently $I(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})=0$ , then by Assumption 3, the transition probability can be factorized as follows,

		$\displaystyle Q_{\phi_{xy}}(X_{k+1},Y_{k+1},Z_{k+1}\|X_{1}^{k},Y_{1}^{k},Z_{1}^{k})=$
		$\displaystyle Q_{\gamma_{xy}}(X_{k+1},Z_{k+1}\|X_{1}^{k},Y_{1}^{k},Z_{1}^{k})Q_{\gamma^{\prime}_{xy}}(Y_{k+1}\|Y_{1}^{k},Z_{1}^{k+1}).\!\!\!\!\!\!$		(10)

Here the transition matrix is parametrized by $\phi_{xy}\in\Phi_{xy}$ where $\phi_{xy}$ has two components: $\gamma_{xy}\in\Gamma$ and $\gamma^{\prime}_{xy}\in\Gamma^{\prime}$ , and $\Phi_{xy}=\Gamma\times\Gamma^{\prime}$ . The dimensions of the sets are shown in Table I; note that $r>d+d^{\prime}$ . The vectors $\gamma_{xy}$ and $\gamma^{\prime}_{xy}$ are also formed by concatenating the elements of their respective matrices as in the case of $\theta$ . More details are found in the proof of Theorem 2 in Appendix A.

TABLE I: Dimensions of Index Sets for

m=3

Set	Dimension
$\Theta$	$r=\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{3k}(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{3}-1)$
$\Gamma$	$d=\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{3k}(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{2}-1)$
$\Gamma^{\prime}$	$d^{\prime}=\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{2k+1}(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1)$

Now consider the Neyman-Pearson criteria to test the hypothesis $\Phi_{xy}$ within $\Theta$ .

Definition 3.

The log-likelihood is defined as

	$\displaystyle L^{\theta}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})$
	$\displaystyle=\log\left(Q_{\theta}(X_{k+1}^{n},Y_{k+1}^{n},Z_{k+1}^{n}\|X_{1}^{k},Y_{1}^{k},Z_{1}^{k})\right)$
	$\displaystyle=\log\left(\prod_{i=k+1}^{n}Q_{\theta}(X_{i},Y_{i},Z_{i}\|X_{i-k}^{i-1},Y_{i-k}^{i-1},Z_{i-k}^{i-1})\right).$

Let $\theta^{\star}$ and $\phi^{\star}_{xy}$ be the most likely choice of transition matrix with general and null hypothesis $v_{xy}=0$ , respectively, i.e.,

	$\displaystyle\theta^{\star}$	$\displaystyle=\operatorname*{arg\,max}_{\Theta}L^{\theta}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n}),$
	$\displaystyle\phi^{\star}_{xy}$	$\displaystyle=\operatorname*{arg\,max}_{\Phi_{xy}}L^{\phi_{xy}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n}).$		(11)

As a result, the test for causality boils down to check the difference between likelihoods, i.e., the log-likelihood ratio:

\displaystyle\Lambda_{{xy},n}=L^{\theta^{\star}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})-L^{\phi^{\star}_{xy}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n}),

(12)

which is the Neyman-Pearson criteria for testing $\Phi_{xy}$ within $\Theta$ . Then, in the following theorem, $\Lambda_{xy,n}$ is shown to converge to a $\chi^{2}$ distribution of finite degree. The proof follows from standard results in [11, Th. 6.1].

Theorem 2.

Consider a network with three nodes $\{\textbf{X},\textbf{Y},\textbf{Z}\}$ and arbitrarily choose two nodes X and Y. Suppose Assumptions 1–3 hold, then

2\Lambda_{{xy},n}\stackrel{{\scriptstyle\mathcal{L}}}{{\to}}\chi^{2}_{r-d-d^{\prime}},

if $v_{xy}=0$ as $n\to\infty$ .

Proof:

The conditions of the theorem imply that the true underlying structure for the transition matrix is from $\Phi_{xy}$ which is required as in [11, Th. 6.1]. The rest of the proof follows similar steps as in [10]. See Appendix A for further details. ∎

Remark 2.

Note that the asymptotic result from Theorem 2 depends only on the dimensions of the sets and not in the particular pair of nodes involved. Furthermore, the result also holds for a network with more than three nodes by properly defining the dimensions of the sets.

Remark 3.

Knowledge about the absence of edges other than $v_{xy}$ in the network results in $\Lambda_{{xy},n}$ converging to a $\chi^{2}$ distribution of higher degree since (III-A) could be further factorized. To see this, assume $v_{xy}=0$ and consider that a knowledge $S$ about the links (for example, the whole adjacency matrix $V$ ) was given. Then, let the transition probability be factorized as much as possible, so it can be parametrized by $\Phi^{\prime}_{xy}$ which has lower or equal dimension than $\Phi_{xy}$ . Take

\Lambda^{\prime}_{{xy},n}=L^{\theta^{\star}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})-L^{\phi^{\prime\star}_{xy}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n}),

where

\phi^{\prime\star}_{xy}=\operatorname*{arg\,max}_{\Phi^{\prime}_{xy}}L^{\phi^{\prime}_{xy}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n}).

Intuitively, by following similar steps as in the proof of Theorem 2, we obtain that $\Lambda^{\prime}_{{xy},n}$ behaves as a $\chi_{q}^{2}$ random variable, where $r>q>r-d-d^{\prime}$ . Since the cumulative distribution function of the $\chi_{q}^{2}$ is a decreasing function with respect to the degree $q$ then,

$\displaystyle P_{G}\!\left(\frac{r}{2},a\right)$	$\displaystyle\leq P_{G}\!\left(\frac{q}{2},a\right)$
	$\displaystyle=P(\Lambda^{\prime}_{xy,n}<a\|S,v_{xy}=0)$
	$\displaystyle\leq P(\Lambda_{xy,n}<a\|v_{xy}=0)$
	$\displaystyle=P_{G}\!\left(\frac{r-d-d^{\prime}}{2},a\right),$	(13)

for sufficiently large $n$ and any $a>0$ . The lower bound in (13) allows us to ignore the knowledge about other nodes.

Consider now the estimation of the directed information defined as plugging in the empirical distribution (instead of the true distribution) into $I(Y_{k+1};X_{1}^{k+1}|Y_{1}^{k},Z_{1}^{k+1})$ , i.e.,

\hat{I}_{n}^{(k)}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})\triangleq I(\hat{Y}_{k+1};\hat{X}_{1}^{k+1}|\hat{Y}_{1}^{k},\hat{Z}_{1}^{k+1}).

Then, the following lemma states that $\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})$ , is proportional to $\Lambda_{xy,n}$ with an $\mathcal{O}(n)$ factor.

Lemma 1.

$\Lambda_{xy,n}=(n-k)\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})$ , which is the plug-in estimator of the directed information.

Proof:

The proof follows from standard definitions and noting that the KL-divergence is positive and minimized by zero. See Appendix B for the complete proof. ∎

Now, let us define the decision rule for checking the existence of an edge in the graph as follows:

\hat{v}_{i,j}\triangleq\begin{cases}1&\textnormal{if }(n-k)\hat{I}^{(k)}_{n}(\textbf{X}_{i}\to\textbf{X}_{j}\;\lvert\rvert\;\textbf{X}_{[m]\setminus\{i,j\}})\geq I_{th}\\ 0&\textnormal{o.w.,}\end{cases}

where $I_{th}$ is of order $\mathcal{O}(1)$ . Then for any knowledge $S$ about states of edges in the true graph, as long as $v_{xy}=0$ we have:

	$\displaystyle P(\hat{v}_{xy}=1\|S,v_{xy}=0)$
	$\displaystyle=P((n-k)\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})>I_{th}\|S,v_{xy}=0)$
	$\displaystyle\leq 1-P_{G}\!\left(\frac{r}{2},I_{th}\right),$		(14)

where the inequality is due to Remark 3.

From Theorem 2 and Lemma 1, it is inferred that when in the true adjacency matrix $v_{xy}=0$ , then the empirical estimation of the directed information converges to zero with a $\chi^{2}$ distribution at a rate $\mathcal{O}(1/n)$ . The asymptotic behavior of $\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})$ is different if the edge is present, i.e., $v_{xy}=1$ , which is addressed in the following theorem.

Theorem 3.

Consider a network with three nodes $\{\textbf{X},\textbf{Y},\textbf{Z}\}$ and arbitrarily choose two nodes X and Y. Suppose Assumptions 1–3 hold and let $\bar{I}_{xy}\triangleq I(\bar{Y}_{k+1};\bar{X}_{1}^{k+1}|\bar{Y}_{1}^{k},\bar{Z}_{1}^{k+1})$ , then,

\sqrt{n-k}\left[\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})-\bar{I}_{xy}\right]\to\mathcal{N}(0,\sigma^{2}),

(15)

with a finite $\sigma^{2}$ as $n\to\infty$ , if $v_{xy}=1$ .

Proof:

The empirical distribution can be decomposed in two parts, where the first one vanishes at a rate faster than $\mathcal{O}(1/\sqrt{n})$ and the second part converges at a rate $\mathcal{O}(1/\sqrt{n})$ . The condition $v_{xy}=1$ keeps the second part positive so it determines the asymptotic convergence of $\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})$ . Refer to Appendix C for further details. ∎

Remark 4.

Knowledge about the state of other edges in the true graph model does not affect the asymptotic behavior presented in Theorem 3, given that the condition $v_{xy}=1$ makes the convergence of the estimator independent of all other nodes. This can be seen by following the steps of the proof, where we only use the fact that if the true edge exists then $\bar{I}_{xy}>0$ and (III-A) does not hold.

We can use Remark 4 to conclude that:

	$\displaystyle P(\hat{v}_{xy}=0\|S,v_{xy}=1)$
	$\displaystyle=P((n-k)\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})<I_{th}\|S,v_{xy}=1)$
	$\displaystyle=P((n-k)\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})<I_{th}\|v_{xy}=1)$
	$\displaystyle=1-Q\!\left(\frac{I_{th}-(n-k)\bar{I}_{xy}}{\sqrt{n-k}\,\sigma}\right),$		(16)

for sufficiently large $n$ , where $Q(\cdot)$ is the Q-function, and where the last equality is due to Theorem 3. Note that if $v_{xy}=1$ then $\bar{I}_{xy}>0$ .

III-B Hypothesis Test over an Entire Graph

The performance of testing a hypothesis $V^{*}$ for a graph is studied by means of the false alarm and detection probabilities defined in (6) and (7), respectively. The results may be considered as an extension of the hypothesis test over a single edge in the graph.

First, let the false alarm probability be upper-bounded as

\displaystyle P_{F}=P(\hat{V}=V^{*}|V\neq V^{*})\leq\min_{i,j}P(\hat{v}_{ij}=v^{*}_{ij}|V\neq V^{*}).

If $V\neq V^{*}$ , there exist nodes $\tau$ and $\rho$ such that $v_{\tau\rho}\neq v^{*}_{\tau\rho}$ . Hence,

$\displaystyle P_{F}$	$\displaystyle\leq\min_{i,j}P(\hat{v}_{ij}=v^{}_{ij}\|V\neq V^{})$	(17)
	$\displaystyle\leq P(\hat{v}_{\tau\rho}=v^{}_{\tau\rho}\|V\neq V^{})$	(18)
	$\displaystyle=P(\hat{v}_{\tau\rho}=v^{}_{\tau\rho}\|V\neq V^{},v_{\tau\rho}\neq v^{*}_{\tau\rho})$
	$\displaystyle=\begin{cases}P(\hat{v}_{\tau\rho}=0\|V\neq V^{},v_{\tau\rho}=1)\\ P(\hat{v}_{\tau\rho}=1\|V\neq V^{},v_{\tau\rho}=0)\end{cases}$
	$\displaystyle\leq\begin{cases}1-Q\!\left(\frac{I_{th}-(n-k)\bar{I}_{\tau\rho}}{\sqrt{n-k}\,\sigma}\right)&\textnormal{if }v_{\tau\rho}=1\\ 1-P_{G}\!\left(\frac{r}{2},I_{th}\right)&\textnormal{if }v_{\tau\rho}=0\end{cases}$	(19)

where the last inequality is due to (14) and (16).

On the other hand, the complement of the detection probability may be upper-bounded using the union bound:

	$\displaystyle 1-P_{D}=P(\hat{V}\neq V^{}\|V=V^{})\leq\sum_{i,j}P(\hat{v}_{ij}\neq v^{}_{ij}\|V=V^{})$
	$\displaystyle=\sum_{\begin{subarray}{c}i,j\\ v_{ij}=1\end{subarray}}P(\hat{v}_{ij}\neq v^{}_{ij}\|V=V^{})+\sum_{\begin{subarray}{c}i,j\\ v_{ij}=0\end{subarray}}P(\hat{v}_{ij}\neq v^{}_{ij}\|V=V^{})$
	$\displaystyle=\sum_{\begin{subarray}{c}i,j\\ v_{ij}=1\end{subarray}}P(\hat{v}_{ij}=0\|V=V^{*},v_{ij}=1)$
	$\displaystyle\quad+\sum_{\begin{subarray}{c}i,j\\ v_{ij}=0\end{subarray}}P(\hat{v}_{ij}=1\|V=V^{*},v_{ij}=0)$
	$\displaystyle\leq N_{1}\bigg{(}1-Q\bigg{(}\frac{I_{th}-(n-k)\bar{I}}{\sqrt{n-k}\,\sigma}\bigg{)}\!\!\bigg{)}+N_{0}\bigg{(}1-P_{G}\bigg{(}\frac{r}{2},I_{th}\bigg{)}\!\!\bigg{)},$		(20)

where $N_{0}$ and $N_{1}$ are the number of off-diagonal $0$ s and $1$ s in the true matrix $V$ , i.e., $N_{0}+N_{1}=m(m-1)$ , and $\bar{I}\triangleq\min\limits_{\stackrel{{\scriptstyle i,j}}{{\text{s.t.}\,v_{ij}=1}}}\bar{I}_{ij}$ . The last inequality holds due to (14) and (16).

Since $\bar{I}_{ij}=I(\bar{X}_{j,k+1};\bar{X}_{i,1}^{k+1}|\bar{X}_{j,1}^{k},\bar{X}_{[m]\setminus\{i,j\},1}^{k+1})>0$ and it is of order $\mathcal{O}(1)$ , then as $n\to\infty$ and noting that

\lim_{a\to\infty}1-Q(-a)=0,

we have that,

	$\displaystyle P_{F}$	$\displaystyle\leq 1-P_{G}\!\left(\frac{r}{2},I_{th}\right)$		(21)
	$\displaystyle 1-P_{D}$	$\displaystyle\leq N_{0}\left[1-P_{G}\!\left(\frac{r}{2},I_{th}\right)\right].$		(22)

This concludes the proof of Theorem 1.

Corollary 1.

In the special case the hypothesis test is performed on a single edge, for the false alarm probability, (17) and (18) become equal and we have

P^{\prime}_{F}\triangleq P(\hat{v}_{xy}=1|v_{xy}=0)=1-P_{G}\left(\frac{r-d-d^{\prime}}{2},I_{th}\right),

and for the detection probability,

P^{\prime}_{D}\triangleq P(\hat{v}_{xy}=1|v_{xy}=1)=1,

as $n\to\infty$ , which is in the same lines as the argument in [10, Sec. III-C] for $m=2$ .

Refer to caption — Figure 1: Lower bound for detection probability $P_{D}$ and upper bound for $P_{F}$ , derived by varying the threshold of the test $I_{th}$ , and with $k=2$ , $m=5$ and binary alphabet. Since $P_{F}\geq 0$ and $P_{D}\leq 1$ by increasing $I_{th}$ , $P_{F}\to 0$ and $P_{D}\to 1$ .

III-C Numerical Results

The bounds derived in Theorem 1 state that the detection probability can be desirably close to one while the false alarm probability can remain near zero with a proper threshold test. In Fig. 1, these bounds are depicted with respect to different values of $I_{th}$ for a network with $m=5$ nodes. The joint process is assumed to be a Markov process of order $k=2$ , and the random variables take values on a binary alphabet ( $|\mathcal{X}|=2$ ).

It can be observed in the figure that, for fixed $I_{th}$ , $P_{D}$ improves as $N_{0}$ decreases, i.e., when the graph becomes sparser. Furthermore, by a proper choice of $I_{th}$ , we can reach to optimal performance of the hypothesis test, i.e., $P_{D}=1$ and $P_{F}=0$ .

IV Summary and Remarks

In this paper, we investigated the performance of a hypothesis test for detecting the underlying directed graph of a network of stochastic processes, which represents the causal relationship among nodes, by empirically calculating the directed information as the measure. We showed that the convergence rate of the directed information estimator relies on the existence or not of the link in the real structure. We further showed that with a proper adjustment of the threshold test for single edges, the overall hypothesis test is asymptotically optimal.

This work may be expanded by considering a detailed analysis on the sample complexity of the hypothesis test. Moreover, we assumed in this work that the estimator has access to samples from the whole network while in practice this might not be the case (see e.g. [12]).

Appendix A Proof of Theorem 2

For any right stochastic matrix $A$ of dimensions $n_{a}\times m_{a}$ , let the matrix $\tilde{A}$ denote the first $m_{a}-1$ linearly independent columns of $A$ , as depicted in Fig. 2.

Figure 2: The matrix

\tilde{A}

is formed by removing the last column of

A

Without loss of generality, consider $\mathcal{X}$ to be the set of integers $\{1,2,\dots,\mathinner{\!\left\lvert\mathcal{X}\right\rvert}\}$ which simplifies the indexing of elements in the alphabet. Let $u_{x,1}^{k}$ denote $(u_{x,1},u_{x,2},\dots,u_{x,k})\in\mathcal{X}^{k}$ , and $u^{\prime}_{x},u^{\prime}_{y},u^{\prime}_{z}\in\mathcal{X}$ excluding $(u^{\prime}_{x},u^{\prime}_{y},u^{\prime}_{z})=(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert})$ . Next define the $3k+3$ vector

\displaystyle\vec{u}\triangleq(u_{x,1}^{k},u_{y,1}^{k},u_{z,1}^{k},u^{\prime}_{x},u^{\prime}_{y},u^{\prime}_{z})

(23)

which is associated with an element of $\tilde{Q}$ (the sub-matrix of the transition probability matrix $Q$ ).

Every $\vec{u}$ can be addressed via the pair $(l_{\vec{u}},g_{\vec{u}})$ where $l_{\vec{u}}\in[1\mathrel{\mathop{\ordinarycolon}}\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{3k}]$ and $g_{\vec{u}}\in[1\mathrel{\mathop{\ordinarycolon}}\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{3}-1]$ indicate the row and column of that element, respectively. Also, let $f_{\vec{u}}\triangleq(l_{\vec{u}}-1)(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{3}-1)+g_{\vec{u}}$ , which denotes the index of that element when vectorizing $\tilde{Q}$ . Any possible transition matrix can then be indexed with a vector

\theta=(\theta_{1},\theta_{2},\dots,\theta_{r})=(\theta_{f_{\vec{u}}})\in\Theta

as $Q_{\theta}$ , where $\Theta$ has dimension $r$ (see Table I) and $\theta$ is constructed by concatenation of rows in $\tilde{Q}_{\theta}$ .

Suppose now that $v_{xy}=0$ or equivalently, by definition (2), $I(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})=0$ . Then

		$\displaystyle Q(X_{k+1},Y_{k+1},Z_{k+1}\|X_{1}^{k},Y_{1}^{k},Z_{1}^{k})$
		$\displaystyle=P(X_{k+1},Z_{k+1}\|X_{1}^{k},Y_{1}^{k},Z_{1}^{k})P(Y_{k+1}\|Y_{1}^{k},Z_{1}^{k+1}).$		(24)

Thus, the transition matrix $Q$ is determined by the elements of two other matrices $T_{1}$ and $T_{2}$ given by (A). Define the vectors

	$\displaystyle\vec{w}\,$	$\displaystyle\triangleq(i_{x,1}^{k},i_{y,1}^{k},i_{z,1}^{k},i^{\prime}_{x},i^{\prime}_{z}),$
	$\displaystyle\vec{w}^{\prime}$	$\displaystyle\triangleq(i_{y,1}^{k},(i_{z,1}^{k},i^{\prime}_{z}),i^{\prime}_{y}),$

which are associated with an element in $\tilde{T}_{1}$ and $\tilde{T}_{2}$ , such that $(i^{\prime}_{x},i^{\prime}_{z})\neq(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert})$ in $\vec{w}$ and $i^{\prime}_{y}\neq\mathinner{\!\left\lvert\mathcal{X}\right\rvert}$ in $\vec{w}^{\prime}$ . Then

	$\displaystyle f_{\vec{w}}\,$	$\displaystyle\triangleq(l_{\vec{w}}-1)(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{2}-1)+g_{\vec{w}},$
	$\displaystyle f_{\vec{w}^{\prime}}$	$\displaystyle\triangleq(l_{\vec{w}^{\prime}}-1)(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1)+g_{\vec{w}^{\prime}},$

where the pairs of row and column indices for each element in $\tilde{T}_{1}$ and $\tilde{T}_{2}$ are then $(l_{\vec{w}},g_{\vec{w}})$ and $(l_{\vec{w}^{\prime}},g_{\vec{w}^{\prime}})$ , respectively.

A matrix $Q$ such as the one in (A) can be parametrized by a vector $\phi_{xy}\in\Phi_{xy}$ , where $\Phi_{xy}=\Gamma\times\Gamma^{\prime}$ has dimension $d\cdot d^{\prime}$ (see Table I). Then

	$\displaystyle Q_{\phi_{xy}}=$
	$\displaystyle\quad Q_{\gamma_{xy}}(X_{k+1},Z_{k+1}\|X_{1}^{k},Y_{1}^{k},Z_{1}^{k})Q_{\gamma^{\prime}_{xy}}(Y_{k+1}\|Y_{1}^{k},Z_{1}^{k+1}),$

where

\gamma_{xy}=(\gamma_{f_{\vec{w}}})\in\Gamma\quad\text{and}\quad\gamma^{\prime}_{xy}=(\gamma^{\prime}_{f_{\vec{w}^{\prime}}})\in\Gamma^{\prime}

determine $\phi_{xy}$ , are vectors of length $d$ and $d^{\prime}$ , and are constructed by concatenating the rows of $\tilde{Q}_{\gamma_{xy}}$ and $\tilde{Q}_{\gamma^{\prime}_{xy}}$ , respectively. There exists then the mapping $h\mathrel{\mathop{\ordinarycolon}}\Phi_{xy}\to\Theta$ such that component-wise:

\displaystyle(h(\phi_{xy}))_{f_{\vec{u}}}=\gamma_{f_{\vec{w}}}\!\cdot\gamma^{\prime}_{f_{\vec{w}^{\prime}}}.

(25)

Consider the matrix $K(\phi_{xy})$ of size $(r+1)\times(d+d^{\prime})$ such that for every element:

\displaystyle(K(\phi_{xy}))_{f_{\vec{u}},f}=\begin{cases}\frac{\partial Q_{h(\phi_{xy})}(u^{\prime}_{x},u^{\prime}_{y},u^{\prime}_{z}|u_{x,1}^{k},u_{y,1}^{k},u_{z,1}^{k})}{\partial\gamma_{f}}\vspace{0.3cm}&f\leq d\\ \frac{\partial Q_{h(\phi_{xy})}(u^{\prime}_{x},u^{\prime}_{y},u^{\prime}_{z}|u_{x,1}^{k},u_{y,1}^{k},u_{z,1}^{k})}{\partial\gamma^{\prime}_{f-d}}&f>d.\end{cases}

(26)

In other words, every row of the matrix $K(\phi_{xy})$ is a derivative of an element of $Q_{h(\phi_{xy})}$ with respect to all elements of $\gamma_{xy}$ followed by the derivatives with respect to $\gamma^{\prime}_{xy}$ .

According to [11, Th. 6.1], and by the definition of $\theta^{\star}$ and $\phi^{\star}_{xy}$ in (11),

2\left\{L^{\theta^{\star}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})-L^{\phi^{\star}_{xy}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})\right\}\stackrel{{\scriptstyle\mathcal{L}}}{{\to}}\chi^{2}_{r-d-d^{\prime}},

if $Q_{h(\phi_{xy})}$ has continuous third order partial derivatives and $K(\phi_{xy})$ is of rank $d+d^{\prime}$ . The first condition holds according to the definition of $h$ in (25). To verify the second condition we can observe that there exist four types of rows in $K(\phi_{xy})$ :

•

Type 1: Take the rows $\vec{u}_{1}=(i_{1}^{k},j_{1}^{k},l_{1}^{k},i^{\prime},j^{\prime},l^{\prime})$ in (26) such that $(i^{\prime},l^{\prime})\neq(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert})$ jointly and $j^{\prime}\neq\mathinner{\!\left\lvert\mathcal{X}\right\rvert}$ . This means that in the $(f_{\vec{u}_{1}})$ -th row of $K$ , the derivatives are taken from

Q_{h(\phi_{xy})}(i^{\prime},j^{\prime},l^{\prime}|i_{1}^{k},j_{1}^{k},l_{1}^{k})=\gamma_{f_{\vec{w}_{1}}}\!\cdot\gamma^{\prime}_{f_{\vec{w}^{\prime}_{1}}},

where $\vec{w}_{1}=(i_{1}^{k},j_{1}^{k},l_{1}^{k},i^{\prime},l^{\prime})$ and $\vec{w}^{\prime}_{1}=(j_{1}^{k},(l_{1}^{k},l^{\prime}),j^{\prime})$ . So all elements in such rows are zero except in the columns $f_{\vec{w}_{1}}$ and $(d+f_{\vec{w}^{\prime}_{1}})$ , which take the values $\gamma^{\prime}_{f_{\vec{w}^{\prime}_{1}}}$ and $\gamma_{f_{\vec{w}_{1}}}$ , respectively.

•

Type 2: Now consider the rows $\vec{u}_{2}=(i_{1}^{k},j_{1}^{k},l_{1}^{k},i^{\prime},j^{\prime},l^{\prime})$ such that $(i^{\prime},l^{\prime})=(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert})$ and $j^{\prime}\neq\mathinner{\!\left\lvert\mathcal{X}\right\rvert}$ . This means that in the $(f_{\vec{u}_{2}})$ -th row of $K$ , the derivatives are taken from

Q_{h(\phi_{xy})}(i^{\prime},j^{\prime},l^{\prime}|i_{1}^{k},j_{1}^{k},l_{1}^{k})\\ =\bigg{(}1-\sum_{(a,b)\neq(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert})}\gamma_{f_{\vec{w}_{2}(a,b)}}\bigg{)}\gamma^{\prime}_{f_{\vec{w}^{\prime}_{2}}},

where we define $\vec{w}_{2}(a,b)=(i_{1}^{k},j_{1}^{k},l_{1}^{k},a,b)$ and $\vec{w}^{\prime}_{2}=(j_{1}^{k},(l_{1}^{k},l^{\prime}),j^{\prime})$ . So all elements in such rows are zero except in the $(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{2}-1)$ columns (among the first $d$ columns) from $f_{\vec{w}_{2}(1,1)}$ to $f_{\vec{w}_{2}(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1)}$ which are equal to $-\gamma^{\prime}_{f_{\vec{w}^{\prime}_{2}}}$ , and the column $(d+f_{\vec{w}^{\prime}_{2}})$ which is equal to

1-\sum_{(a,b)\neq(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert})}\gamma_{f_{\vec{w}_{2}(a,b)}}.

•

Type 3: Consider the rows $\vec{u}_{3}=(i_{1}^{k},j_{1}^{k},l_{1}^{k},i^{\prime},j^{\prime},l^{\prime})$ such that $(i^{\prime},l^{\prime})\neq(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert})$ and $j^{\prime}=\mathinner{\!\left\lvert\mathcal{X}\right\rvert}$ . Also let $\vec{w}_{3}=(i_{1}^{k},j_{1}^{k},l_{1}^{k},i^{\prime},l^{\prime})$ and $\vec{w}^{\prime}_{3}(a)=(j_{1}^{k},(l_{1}^{k},l^{\prime}),a)$ . Then, all elements of such rows are zero except in the column $f_{\vec{w}_{3}}$ which takes the value

$1-\sum_{a\neq\mathinner{\!\left\lvert\mathcal{X}\right\rvert}}\gamma^{\prime}_{f_{\vec{w}^{\prime}_{3}(a)}},$

and the $(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1)$ columns (among the last $d^{\prime}$ columns) from $d+f_{\vec{w}^{\prime}_{3}(1)}$ to $d+f_{\vec{w}^{\prime}_{3}(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1)}$ that are equal $-\gamma_{f_{\vec{w}_{3}}}$ .

•

Type 4: Lastly, consider rows $\vec{u}_{4}=(i_{1}^{k},j_{1}^{k},l_{1}^{k},i^{\prime},j^{\prime},l^{\prime})$ such that $(i^{\prime},l^{\prime})=(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert})$ and $j^{\prime}=\mathinner{\!\left\lvert\mathcal{X}\right\rvert}$ . Assume vectors $\vec{w}_{4}(a,b)=(i_{1}^{k},j_{1}^{k},l_{1}^{k},a,b)$ and $\vec{w}^{\prime}_{4}(a)=(j_{1}^{k},(l_{1}^{k},l^{\prime}),a)$ . Then , the only non-zero elements belong to the $(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}^{2}-1)$ columns from $f_{\vec{w}_{4}(1,1)}$ to $f_{\vec{w}_{4}(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert})}$ (among the first $d$ columns) which are equal to

-\bigg{(}1-\sum_{a\neq\mathinner{\!\left\lvert\mathcal{X}\right\rvert}}\gamma^{\prime}_{f_{\vec{w}^{\prime}_{4}(a)}}\bigg{)},

and the $(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1)$ columns from $d+f_{\vec{w}^{\prime}_{4}(1)}$ to $d+f_{\vec{w}^{\prime}_{4}(\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1)}$ (among the last $d^{\prime}$ columns) which are equal to

-\bigg{(}1-\sum_{(a,b)\neq(\mathinner{\!\left\lvert\mathcal{X}\right\rvert},\mathinner{\!\left\lvert\mathcal{X}\right\rvert})}\gamma_{f_{\vec{w}_{4}(a,b)}}\bigg{)}.

We show now that if a linear combination of all columns equals the vector zero, then all coefficients should be zero as well. Let $c_{f}$ be the $f$ -th column of $K(\phi_{xy})$ then if

\displaystyle\sum_{f=1}^{d+d^{\prime}}\alpha_{f}c_{f}=\vec{0},

(27)

then, $\alpha_{f}=0,\forall f$ . To see this, consider the Type 1 row with $i_{1}^{k}=j_{1}^{k}=l_{1}^{k}=1^{k}$ and $i^{\prime}=l^{\prime}=1$ . Since it only has two non-zero elements, we have that

\forall j^{\prime}\in[1\mathrel{\mathop{\ordinarycolon}}\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1]\mathrel{\mathop{\ordinarycolon}}\quad\alpha_{1}\gamma^{\prime}_{j^{\prime}}+\alpha_{j^{\prime}}\gamma_{1}=0.

(28)

Then, take the Type 3 row with $i_{1}^{k}=j_{1}^{k}=l_{1}^{k}=1^{k}$ and $i^{\prime}=l^{\prime}=j^{\prime}=1$ , where we have that

\alpha_{1}\bigg{(}\sum_{a\neq\mathinner{\!\left\lvert\mathcal{X}\right\rvert}}1-\gamma^{\prime}_{a}\bigg{)}-\sum_{b\neq\mathinner{\!\left\lvert\mathcal{X}\right\rvert}}\alpha_{b}\gamma_{1}=0.

(29)

From (28) and noting that we have assumed $Q>\bf{0}$ , if $\alpha_{1}>0$ then $\alpha_{j^{\prime}}<0$ for all $j^{\prime}\in[1\mathrel{\mathop{\ordinarycolon}}\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1]$ . Hence, the left-hand side of (29) is strictly positive and not zero. An analogous result is found assuming $\alpha_{1}<0$ . By contradiction, we conclude that $\alpha_{1}=0$ , and from (28),

\forall j^{\prime}\in[1\mathrel{\mathop{\ordinarycolon}}\mathinner{\!\left\lvert\mathcal{X}\right\rvert}-1]\mathrel{\mathop{\ordinarycolon}}\quad\alpha_{j^{\prime}}=0.

By varying $(i^{\prime},l^{\prime})$ and for all combinations of $(i_{1}^{k},j_{1}^{k},l_{1}^{k})$ we derive that all $\alpha_{f}$ s are zero, and as a result, $K(\phi_{xy})$ has $d+d^{\prime}$ linearly independent columns which meets the second condition. The proof of Theorem 2 is thus complete.

Appendix B Proof of Lemma 1

The proof follows similar steps as the one in [10, Prop. 9]. Using the definition of log-likelihood,

	$\displaystyle L^{\theta^{\star}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})$
	$\displaystyle=\max\limits_{\theta\in\Theta}\sum_{i=k+1}^{n}\log(Q_{\theta}(X_{i},Y_{i},Z_{i}\|X_{i-k}^{i-1},Y_{i-k}^{i-1},Z_{i-k}^{i-1}))$
	$\displaystyle=\max\limits_{\theta\in\Theta}\sum_{x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1}}(n-k)\hat{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})$
	$\displaystyle\qquad\times\log(Q_{\theta}(x_{k+1}y_{k+1}z_{k+1}\|x_{1}^{k}y_{1}^{k}z_{1}^{k}))$
	$\displaystyle=-(n-k)\biggl{[}\min\limits_{\theta\in\Theta}\biggl{\{}D\big{(}\hat{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}}\;\lvert\rvert\;Q_{\theta}\otimes\hat{P}_{X_{1}^{k}Y_{1}^{k}Z_{1}^{k}}\big{)}\biggr{\}}$
	$\displaystyle\quad+\!\!\!\!\!\sum_{x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1}}\!\!\!\!\!\!\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})\log\bigg{(}\frac{\hat{P}(x_{1}^{k}y_{1}^{k}z_{1}^{k})}{\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})}\bigg{)}\biggl{]},$		(30)

where

(Q_{\theta}\otimes\hat{P}_{X_{1}^{k}Y_{1}^{k}Z_{1}^{k}})(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})\triangleq\\ \qquad\hat{P}_{X_{1}^{k}Y_{1}^{k}Z_{1}^{k}}(x_{1}^{k}y_{1}^{k}z_{1}^{k})Q_{\theta}(x_{k+1}y_{k+1}z_{k+1}|x_{1}^{k}y_{1}^{k}z_{1}^{k}).

Since the KL-divergence is minimized by zero, then

L^{\theta^{\star}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})=(n-k)\cdot\\ [\hat{H}(X_{1}^{k},Y_{1}^{k},Z_{1}^{k})-\hat{H}(X_{1}^{k+1},Y_{1}^{k+1},Z_{1}^{k+1})].

(31)

On the other hand, for the second log-likelihood, we have:

	$\displaystyle L^{\phi^{\star}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})$
	$\displaystyle=\max\limits_{\phi\in\Phi}\sum_{i=k+1}^{n}\log(Q_{\phi}(X_{i},Y_{i},Z_{i}\|X_{i-k}^{i-1},Y_{i-k}^{i-1},Z_{i-k}^{i-1}))$
	$\displaystyle=\underbrace{\max\limits_{\phi^{xz}}\sum_{i=k+1}^{n}\log(Q_{\phi^{xz}}(X_{i},Z_{i}\|X_{i-k}^{i-1},Y_{i-k}^{i-1},Z_{i-k}^{i-1}))}_{A_{1}}$
	$\displaystyle\quad+\underbrace{\max\limits_{\phi^{y}}\sum_{i=k+1}^{n}\log(Q_{\phi^{y}}(Y_{i}\|Y_{i-k}^{i-1},Z_{i-k}^{i}))}_{A_{2}}.$

With a similar approach as in (30), we can expand $A_{1}$ and $A_{2}$ as it is shown in (32) at the bottom of the page.

	$\displaystyle A_{1}$	$\displaystyle=-(n-k)\biggl{[}\min\limits_{\phi^{xz}}\biggl{\{}D\big{(}\hat{P}_{X_{1}^{k+1},Y_{1}^{k},Z_{1}^{k+1}}\;\lvert\rvert\;Q_{\phi^{xz}}\otimes\hat{P}_{X_{1}^{k},Y_{1}^{k},Z_{1}^{k}}\big{)}\biggr{\}}+\!\!\!\sum_{x_{1}^{k+1},y_{1}^{k},z_{1}^{k+1}}\!\!\!\hat{P}(x_{1}^{k+1},y_{1}^{k},z_{1}^{k+1})\log\bigg{(}\frac{\hat{P}(x_{1}^{k},y_{1}^{k},z_{1}^{k})}{\hat{P}(x_{1}^{k+1},y_{1}^{k},z_{1}^{k+1})}\bigg{)}\biggr{]}$
	$\displaystyle A_{2}$	$\displaystyle=-(n-k)\biggl{[}\min\limits_{\phi^{y}}\biggl{\{}D\big{(}\hat{P}_{Y_{1}^{k+1},Z_{1}^{k+1}}\;\lvert\rvert\;Q_{\phi^{y}}\otimes\hat{P}_{Y_{1}^{k},Z_{1}^{k+1}}\big{)}\biggr{\}}+\sum_{y_{1}^{k+1},z_{1}^{k+1}}\hat{P}(y_{1}^{k+1},z_{1}^{k+1})\log\bigg{(}\frac{\hat{P}(y_{1}^{k},z_{1}^{k+1})}{\hat{P}(y_{1}^{k+1},z_{1}^{k+1})}\bigg{)}\biggr{]}.$		(32)

As a result,

	$\displaystyle L^{\phi^{\star}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})$
	$\displaystyle=(n-k)\Bigl{[}\hat{H}(X_{1}^{k},Y_{1}^{k},Z_{1}^{k})-\hat{H}(X_{1}^{k+1},Y_{1}^{k},Z_{1}^{k+1})$
	$\displaystyle\quad+\hat{H}(Y_{1}^{k},Z_{1}^{k+1})-\hat{H}(Y_{1}^{k+1},Z_{1}^{k+1})\Bigr{]}.$		(33)

Finally, combining (31) and (33), we obtain

	$\displaystyle\Lambda_{{xy},n}=L^{\theta^{\star}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})-L^{\phi^{\star}_{xy}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})$
	$\displaystyle=(n-k)[\hat{H}(Y_{k+1}\|Y_{1}^{k},Z_{1}^{k+1})-\hat{H}(Y_{k+1}\|X_{1}^{k+1},Y_{1}^{k},Z_{1}^{k+1})]$
	$\displaystyle=(n-k)\hat{I}_{n}^{(k)}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z}),$		(34)

which concludes the proof of Lemma 1.

	$\displaystyle\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})=\sum\nolimits_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\log\frac{\hat{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\hat{P}(\mathbb{y}_{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\hat{P}(\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}$
	$\displaystyle=\sum\nolimits_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\ \hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\log\biggl{[}\frac{\hat{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{y}_{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\hat{P}(\mathbb{y}_{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\hat{P}(\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}\biggr{]}$
	$\displaystyle\quad+\sum\nolimits_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\log\frac{\bar{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\bar{P}(\mathbb{y}_{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}$
	$\displaystyle=\sum\nolimits_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\log\biggl{[}\frac{\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\hat{P}(\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\hat{P}(\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}\cdot\frac{\bar{P}(\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\bar{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}\biggl{]}$
	$\displaystyle\quad+\sum\nolimits_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\log\frac{\bar{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\bar{P}(\mathbb{y}_{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}z_{1}^{k+1})}$
	$\displaystyle=D(\hat{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}}\;\lvert\rvert\;\bar{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}})+D(\hat{P}_{Y_{1}^{k}Z_{1}^{k+1}}\;\lvert\rvert\;\bar{P}_{Y_{1}^{k}Z_{1}^{k+1}})$
	$\displaystyle\quad-D(\hat{P}_{Y_{1}^{k+1}Z_{1}^{k+1}}\;\lvert\rvert\;\bar{P}_{Y_{1}^{k+1}Z_{1}^{k+1}})-D(\hat{P}_{X_{1}^{k+1}Y_{1}^{k}Z_{1}^{k+1}}\;\lvert\rvert\;\bar{P}_{X_{1}^{k+1}Y_{1}^{k}Z_{1}^{k+1}})$
	$\displaystyle\quad+\sum_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\left(\frac{1}{n-k}\sum\nolimits_{i=k+1}^{n}\mathds{1}[x_{i-k}^{i}y_{i-k}^{i}z_{i-k}^{i}=\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}]\right)\log\frac{\bar{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\bar{P}(\mathbb{y}_{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}.$		(35)

Appendix C Proof of Theorem 3

We begin by expanding the expression $\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})$ using the definition of the empirical distribution in (3) and we obtain (35), found at the bottom of the next page. We then proceed to analyze the asymptotic behavior of the estimator.

The first four terms in (35), i.e., the KL-divergence terms, decay faster than $\mathcal{O}(1/\sqrt{n})$ . This is shown later in the proof. On the other hand, since $v_{xy}=1$ , $I(\bar{Y}_{k+1};\bar{X}_{1}^{k+1}|\bar{Y}_{1}^{k},\bar{Z}_{1}^{k+1})>0$ due to (5) and Assumption 3, and thus, the last term in (35) is non-zero and dominates the convergence of the estimator, as we see next. Here, one observes that conditioning on $v_{xy}=1$ is sufficient to analyze the convergence of $\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})$ and further knowledge about other edges is irrelevant (see Remark 4). We then conclude that,

\lim_{n\to\infty}\sqrt{n-k}\,\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})=\lim_{n\to\infty}\frac{1}{\sqrt{n-k}}\sum_{i=k+1}^{n}S_{i},

where

	$\displaystyle S_{i}$	$\displaystyle\triangleq\log P_{\bar{Y}_{k+1}\bar{X}_{1}^{k+1}\|\bar{Y}_{1}^{k}\bar{Z}_{1}^{k+1}}(y_{i}x_{i-k}^{i}\|y^{i-1}_{i-k}z_{i-k}^{i})$
		$\displaystyle\quad-\log P_{\bar{Y}_{k+1}\|\bar{Y}_{1}^{k}\bar{Z}_{1}^{k+1}}(y_{i}\|y^{i-1}_{i-k}z_{i-k}^{i})$
		$\displaystyle\quad-\log P_{\bar{X}_{1}^{k+1}\|\bar{Y}_{1}^{k}\bar{Z}_{1}^{k+1}}(x_{i-k}^{i}\|y^{i-1}_{i-k}z_{i-k}^{i}).$

We note that $S_{i}$ is a functional of the chain $\{(X_{i-k}^{i},Y_{i-k}^{i},\allowbreak Z_{i-k}^{i})\}$ and its mean is $\mathds{E}[S]=I(\bar{Y}_{k+1};\bar{X}_{1}^{k+1}|\bar{Y}_{1}^{k},\bar{Z}_{1}^{k+1})$ . The chain is ergodic and we can thus apply the central limit theorem [13, Sec. I.16] to the partial sums to obtain

\displaystyle\sqrt{n-k}\left(\frac{1}{n-k}\sum_{i=k+1}^{n-1}S_{i}-\mathds{E}[S]\right)\to\mathcal{N}(0,\sigma^{2}),

(36)

where $\sigma^{2}$ is bounded.

Now, to complete the proof, it only remains to show that the KL-divergence terms in (35) multiplied by a $\sqrt{n-k}$ factor converge to zero as $n\to\infty$ . We present the proof for one term and the others follow a similar approach. We first recall the Taylor expansion with Lagrange remainder form,

f(x)=f(a)+f^{\prime}(a)(x-a)+\frac{f^{\prime\prime}(x^{*})(x-a)^{2}}{2!},

for some $x^{*}\in(a,x)$ . Then, let us define $\rho\triangleq\frac{\bar{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})}{\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})},$ so we can expand the first KL-divergence term as:

	$\displaystyle D(\hat{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}}\;\lvert\rvert\;\bar{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}})$
	$\displaystyle=-\sum\nolimits_{x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1}}\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})\log\rho$
	$\displaystyle=-\sum\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})\Big{[}(\rho-1)-\frac{(\rho-1)^{2}}{2!\tau^{2}}\Big{]}$
	$\displaystyle=\sum\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})(\rho-1)^{2}\frac{1}{2\tau^{2}}$			(37)
	$\displaystyle=\sum\left(\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})-\bar{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})\right)^{2}C,\!\!\!\!\!$			(38)

for some $\tau\in(1,\rho)$ , where

C\triangleq\frac{1}{2\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})\tau^{2}},

and (37) follows due to

	$\displaystyle\sum\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})(\rho-1)$
	$\displaystyle=\sum\bar{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})-\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})=0.$

Since the Markov model is assumed to be ergodic (Assumption 2), $\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})\nrightarrow 0$ , and therefore $C$ is bounded. Now $\forall i\in[1\mathrel{\mathop{\ordinarycolon}}n-k]$ consider the sequence

T_{i}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})\triangleq\mathds{1}[X_{i}^{k+i}Y_{i}^{k+i}Z_{i}^{k+i}=x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1}]

with mean $\bar{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})$ . According to the law of iterated logarithms,

\limsup_{n\to\infty}\frac{\sum_{i=1}^{n-k}(T_{i}-\bar{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1}))}{\sqrt{(n-k)\log\log(n-k)}}=\sqrt{2}\quad\textnormal{a.s.}

Using the definition of the empirical distribution, this implies

	$\displaystyle\limsup_{n\to\infty}\frac{(n-k)(\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})-\bar{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1}))}{\sqrt{(n-k)\log\log(n-k)}}$
	$\displaystyle=\limsup_{n\to\infty}\frac{\hat{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})-\bar{P}(x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1})}{\sqrt{\log\log(n-k)}/\sqrt{n-k}}$
	$\displaystyle=\sqrt{2}\quad\textnormal{a.s.}$		(39)

As a result we can rewrite (38) and conclude that

	$\displaystyle\limsup_{n\to\infty}\sqrt{n-k}\,D(\hat{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}}\;\lvert\rvert\;\bar{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}})$
	$\displaystyle=\limsup_{n\to\infty}\frac{\log\log(n-k)}{\sqrt{n-k}}\sum\nolimits_{x_{1}^{k+1}y_{1}^{k+1}z_{1}^{k+1}}2C=0,$

given that each term in the finite sum is bounded. Therefore, as $n\to\infty$ , the four KL-divergence terms in (35) multiplied by a $\sqrt{n-k}$ factor tend to zero and the proof of Theorem 3 is thus complete.

References

[1] C. J. Quinn, T. P. Coleman, N. Kiyavash, and N. G. Hatsopoulos, “Estimating the Directed Information to Infer Causal Relationships in Ensemble Neural Spike Train Recordings,” Journal of Computational Neuroscience, vol. 30, no. 1, pp. 17–44, Feb. 2011.
[2] K. Friston, L. Harrison, and W. Penny, “Dynamic Causal Modelling,” NeuroImage, vol. 19, no. 4, pp. 1273–1302, Aug. 2003.
[3] C. J. Quinn, N. Kiyavash, and T. P. Coleman, “Directed Information Graphs,” IEEE Trans. Inf. Theory, vol. 61, no. 12, pp. 6887–6909, Dec. 2015.
[4] W. M. Lord, J. Sun, N. T. Ouellette, and E. M. Bollt, “Inference of Causal Information Flow in Collective Animal Behavior,” IEEE Trans. Mol. Biol. Multi-Scale Commun., vol. 2, no. 1, pp. 107–116, Jun. 2016.
[5] J. Jiao, H. H. Permuter, L. Zhao, Y. H. Kim, and T. Weissman, “Universal Estimation of Directed Information,” IEEE Trans. Inf. Theory, vol. 59, no. 10, pp. 6220–6242, Oct. 2013.
[6] C. W. J. Granger, “Investigating Causal Relations by Econometric Models and Cross-spectral Methods,” Econometrica, vol. 37, no. 3, pp. 424–438, Aug. 1969.
[7] J. Massey, “Causality, Feedback and Directed Information,” in Proc. Int. Symp. Inf. Theory Applic. (ISITA), Honolulu, HI, USA, Nov. 1990, pp. 303–305.
[8] P.-O. Amblard and O. J. J. Michel, “On Directed Information Theory and Granger Causality Graphs,” Journal of Computational Neuroscience, vol. 30, no. 1, pp. 7–16, Feb. 2011.
[9] C. J. Quinn, N. Kiyavash, and T. P. Coleman, “Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs,” IEEE Trans. Signal Process., vol. 61, no. 12, pp. 3173–3182, Jun. 2013.
[10] I. Kontoyiannis and M. Skoularidou, “Estimating the Directed Information and Testing for Causality,” IEEE Trans. Inf. Theory, vol. 62, no. 11, pp. 6053–6067, Nov. 2016.
[11] P. Billingsley, Statistical Inference for Markov Processes. Chicago, IL, USA: Univ. of Chicago Press, 1961.
[12] J. Scarlett and V. Cevher, “Lower Bounds on Active Learning for Graphical Model Selection,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 Apr. 2017, pp. 55–64.
[13] K. L. Chung, Markov Chains: With Stationary Transition Probabilities, 2nd ed., ser. A Series of Comprehensive Studies in Mathematics. New York, NY, USA: Springer-Verlag, 1967, vol. 104.

$\displaystyle I(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})$	$\displaystyle=I(\bar{Y}_{k+1};\bar{X}_{1}^{k+1}\|\bar{Y}_{1}^{k},\bar{Z}_{1}^{k+1})$
	$\displaystyle\quad-I(\bar{Y}_{k+1};\bar{Y}_{-\infty}^{0},\bar{Z}_{-\infty}^{0}\|\bar{Y}_{1}^{k},\bar{Z}_{1}^{k+1})$
	$\displaystyle\leq I(\bar{Y}_{k+1};\bar{X}_{1}^{k+1}\|\bar{Y}_{1}^{k},\bar{Z}_{1}^{k+1}),$	(5)

	$\displaystyle P(\hat{v}_{xy}=0\|S,v_{xy}=1)$
	$\displaystyle=P((n-k)\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})<I_{th}\|S,v_{xy}=1)$
	$\displaystyle=P((n-k)\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})<I_{th}\|v_{xy}=1)$
	$\displaystyle=1-Q\!\left(\frac{I_{th}-(n-k)\bar{I}_{xy}}{\sqrt{n-k}\,\sigma}\right),$		(16)

	$\displaystyle 1-P_{D}=P(\hat{V}\neq V^{}\|V=V^{})\leq\sum_{i,j}P(\hat{v}_{ij}\neq v^{}_{ij}\|V=V^{})$
	$\displaystyle=\sum_{\begin{subarray}{c}i,j\\ v_{ij}=1\end{subarray}}P(\hat{v}_{ij}\neq v^{}_{ij}\|V=V^{})+\sum_{\begin{subarray}{c}i,j\\ v_{ij}=0\end{subarray}}P(\hat{v}_{ij}\neq v^{}_{ij}\|V=V^{})$
	$\displaystyle=\sum_{\begin{subarray}{c}i,j\\ v_{ij}=1\end{subarray}}P(\hat{v}_{ij}=0\|V=V^{*},v_{ij}=1)$
	$\displaystyle\quad+\sum_{\begin{subarray}{c}i,j\\ v_{ij}=0\end{subarray}}P(\hat{v}_{ij}=1\|V=V^{*},v_{ij}=0)$
	$\displaystyle\leq N_{1}\bigg{(}1-Q\bigg{(}\frac{I_{th}-(n-k)\bar{I}}{\sqrt{n-k}\,\sigma}\bigg{)}\!\!\bigg{)}+N_{0}\bigg{(}1-P_{G}\bigg{(}\frac{r}{2},I_{th}\bigg{)}\!\!\bigg{)},$		(20)

	$\displaystyle L^{\phi^{\star}}_{n}(X_{1}^{n},Y_{1}^{n},Z_{1}^{n})$
	$\displaystyle=\max\limits_{\phi\in\Phi}\sum_{i=k+1}^{n}\log(Q_{\phi}(X_{i},Y_{i},Z_{i}\|X_{i-k}^{i-1},Y_{i-k}^{i-1},Z_{i-k}^{i-1}))$
	$\displaystyle=\underbrace{\max\limits_{\phi^{xz}}\sum_{i=k+1}^{n}\log(Q_{\phi^{xz}}(X_{i},Z_{i}\|X_{i-k}^{i-1},Y_{i-k}^{i-1},Z_{i-k}^{i-1}))}_{A_{1}}$
	$\displaystyle\quad+\underbrace{\max\limits_{\phi^{y}}\sum_{i=k+1}^{n}\log(Q_{\phi^{y}}(Y_{i}\|Y_{i-k}^{i-1},Z_{i-k}^{i}))}_{A_{2}}.$

	$\displaystyle\hat{I}^{(k)}_{n}(\textbf{X}\to\textbf{Y}\;\lvert\rvert\;\textbf{Z})=\sum\nolimits_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\log\frac{\hat{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\hat{P}(\mathbb{y}_{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\hat{P}(\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}$
	$\displaystyle=\sum\nolimits_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\ \hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\log\biggl{[}\frac{\hat{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{y}_{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\hat{P}(\mathbb{y}_{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\hat{P}(\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}\biggr{]}$
	$\displaystyle\quad+\sum\nolimits_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\log\frac{\bar{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\bar{P}(\mathbb{y}_{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}$
	$\displaystyle=\sum\nolimits_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\log\biggl{[}\frac{\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\hat{P}(\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\hat{P}(\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}\cdot\frac{\bar{P}(\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\bar{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}\biggl{]}$
	$\displaystyle\quad+\sum\nolimits_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\hat{P}(\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1})\log\frac{\bar{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\bar{P}(\mathbb{y}_{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}z_{1}^{k+1})}$
	$\displaystyle=D(\hat{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}}\;\lvert\rvert\;\bar{P}_{X_{1}^{k+1}Y_{1}^{k+1}Z_{1}^{k+1}})+D(\hat{P}_{Y_{1}^{k}Z_{1}^{k+1}}\;\lvert\rvert\;\bar{P}_{Y_{1}^{k}Z_{1}^{k+1}})$
	$\displaystyle\quad-D(\hat{P}_{Y_{1}^{k+1}Z_{1}^{k+1}}\;\lvert\rvert\;\bar{P}_{Y_{1}^{k+1}Z_{1}^{k+1}})-D(\hat{P}_{X_{1}^{k+1}Y_{1}^{k}Z_{1}^{k+1}}\;\lvert\rvert\;\bar{P}_{X_{1}^{k+1}Y_{1}^{k}Z_{1}^{k+1}})$
	$\displaystyle\quad+\sum_{\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}}\left(\frac{1}{n-k}\sum\nolimits_{i=k+1}^{n}\mathds{1}[x_{i-k}^{i}y_{i-k}^{i}z_{i-k}^{i}=\mathbb{x}_{1}^{k+1}\mathbb{y}_{1}^{k+1}\mathbb{z}_{1}^{k+1}]\right)\log\frac{\bar{P}(\mathbb{y}_{k+1}\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}{\bar{P}(\mathbb{y}_{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})\bar{P}(\mathbb{x}_{1}^{k+1}\|\mathbb{y}_{1}^{k}\mathbb{z}_{1}^{k+1})}.$		(35)