Testing for Directed Information Graphs
Abstract
In this paper, we study a hypothesis test to determine the underlying directed graph structure of nodes in a network, where the nodes represent random processes and the direction of the links indicate a causal relationship between said processes. Specifically, a -th order Markov structure is considered for them, and the chosen metric to determine a connection between nodes is the directed information. The hypothesis test is based on the empirically calculated transition probabilities which are used to estimate the directed information. For a single edge, it is proven that the detection probability can be chosen arbitrarily close to one, while the false alarm probability remains negligible. When the test is performed on the whole graph, we derive bounds for the false alarm and detection probabilities, which show that the test is asymptotically optimal by properly setting the threshold test and using a large number of samples. Furthermore, we study how the convergence of the measures relies on the existence of links in the true graph.
I Introduction
Causality is a concept that expresses the joint behavior in time of a group of components in a system. In general, it denotes the effect of one component to itself and others in the system during a time period. Consider a network of nodes, each producing a signal in time. These processes can behave independently, or there might be an underlying connection, by nature, between them. Inferring this structure is of great interest in many applications. In [1], for instance, neurons are taken as components while the time series of produced spikes is used to derive the underlying structure. Dynamical models are also a well-known tool to understand functionals of expressed neurons [2]. Additionally, in social networks, there is an increasing interest to estimate influences among users [3], while further applications exist in biology [4], economics [5], and many other fields.
Granger [6] defined the notion of causality between two time series by using a linear autoregressive model and comparing the estimation errors for two scenarios: when history of the second node is accounted for and when it is not. With this definition, however, we can poorly estimate models which operate non-linearly. Directed information was first introduced to address the flow of information in a communication set-up, and suggested by Massey [7] as a measure of causality since it is not limited to linear models. There exist other methods which may qualify for different applications. Several of these definitions are compared in [1], where directed information is argued as a robust measure for causality. There are also symmetric measures like correlation or mutual information, but they can only represent a mutual relationship between nodes and not a directed one.
The underlying causal structure of a network of processes can be properly visualized by a directed graph. In particular, in a Directed Information Graph (DIG) –introduced simultaneously by Amblard and Michel [8] and Quinn et al. [1]– the existence of an edge is determined by the value of the directed information between two nodes considering the history of the rest of the network. There are different approaches to tackle the problem of detecting and estimating such type of graphs. Directed information can be estimated based on prior assumptions on the processes’ structure, such as Markov properties, and empirically calculating probabilities [1, 3]. On the other hand, Jiao et al. [5] propose a universal estimator of directed information which is not restricted to any Markov assumption. Nonetheless, in the core of their technique, they consider a context tree weighting algorithm with different depths, which intuitively resembles a learning algorithm for estimating the order of a Markov structure. Other assumptions used in the study of DIGs, that constrain the structure of the underlying graph, are tree structures [9] or a limit on the nodes’ degree [3].
The estimation performance on the detection of edges on a DIG is crucial since it allows to characterize, for instance, the optimum test for detection, or the minimum number of samples needed to reliably obtain the underlying model, i.e., the sample complexity. In [3], the authors derive a bound on the sample complexity using total variation when the directed information between two nodes is empirically estimated. Following that work, Kontoyiannis et al. [10] investigate the performance of a test for causality between two nodes, and they show that the convergence rate of the empirical directed information can be improved if it is calculated conditioned on the true relationship between the nodes. In other words, the underlying structure of the true model has an effect on the detection performance of the whole graph. Motivated by this result, in this paper, we study a hypothesis test over a complete graph (not just a link between two nodes) when the directed information is empirically estimated, and we provide interesting insights into the problem. Moreover, we show that for every existing edge in the true graph, the estimation converges with , while if there is no edge in the true model, convergence is of the order of .
The rest of the paper is organized as follows. In Section II, notations and definitions are introduced. In particular, the directed information is reviewed and the definition of an edge in a DIG is presented. The main results of our work are then shown in Section III. First, the performance of a hypothesis test for a single edge is studied, where we analyze the asymptotic behavior of estimators based on the knowledge about the true edges. Then, we demonstrate how the detection of the whole graph relies on the test for each edge. Finally, in the last section, the paper is concluded.
II Preliminaries
Assume a network with nodes representing processes . The observation of the -th process in the discrete time interval to is described by the random variable , which at each time takes values on the discrete alphabet . With a little abuse of notation and represent the observations of the process Y at instance and in the interval to , respectively.
The metric used to describe the causality relationship of these processes is the directed information, as suggested previously, since it can describe more general structures without further assumptions (such as linearity). Directed information is mainly used in information theory to characterize channels with feedback and it is defined based on causally conditioned probabilities.
Definition 1.
The probability distribution of causally conditioned on is defined as
The entropy rate of the process Y causally conditioned on X is then defined as:
Consequently, the directed information rate of X to Y causally conditioned on Z is expressed as below:
(1) |
Pairwise directed information does not unequivocally determine the one-step causal influence among nodes in a network. Instead, the history of the other remaining nodes should also be considered. Similarly as introduced in [1, 8], an edge from to exists in a directed information graph iff
(2) |
where . Having observed the output of every process, the edges of the graph can be estimated which results in a weighted directed graph. However, when only the existence of a directed edge is investigated the performance of the detection can be improved. This is presented in Section III.
There exist several methods to estimate information theoretic values which most of them intrinsically deal with counting possible events to estimate distributions. One such method is the empirical distribution, which we define as follows.
Definition 2.
Let be a realization of the random variables . The joint empirical distribution of consecutive time instances of all nodes is then defined as:
(3) |
The joint distribution of any subset of nodes is then a marginal distribution of (3).
By plugging in the empirical distribution we can derive estimators for information theoretic quantities such as the entropy , where we use the notation to distinguish the empirical estimator, i.e.,
(4) |
A causal influence in the network implies that the past of a group of nodes affects the future of some other group or themselves. This motivates us to focus on a network of joint Markov processes in this paper, since it characterizes a state dependent operation for nodes, although we may put further assumptions to make calculations more tractable. For simplicity, we assume a three-node network (i.e., ) and the processes to be , and Z in the rest of the work, since the extension of the results for is straightforward.
Assumption 1.
is a jointly stationary Markov process of order .
Let us define the transition probability matrix with elements
Then, the next assumption prevents further complexities in the steps of the proof of our main result.
Assumption 2.
All transition probabilities are positive, i.e., .
This condition provides ergodicity for the joint Markov process and results in the joint empirical distribution asymptotically converging to the stationary distribution111The stationary distribution is denoted either as or as in the sequel., i.e.,
In general, the directed information rate cannot be expressed with the stationary random variables , and , since a good estimator requires unlimited samples for perfect estimation. To see this,
(5) |
where we use the Markov property in the first equation, and the inequality holds since the mutual information is non-negative. Thus, with a limited sampling interval, an upper bound would be derived. The next assumption makes (5) hold with equality.
Assumption 3.
For processes Y and Z, the Markov chain
holds.
Note that the above assumption should hold for every other two pairs of processes if we are interested in studying the whole graph and not only a single edge.
III Hypothesis test for Directed Information Graphs
Consider a graph , where the edge from node to node is denoted by ; we say that if the node causally influences the node , otherwise, . A hypothesis test to identify the graph is performed on the adjacency matrix , whose elements are the s, and the performance of the test is studied through its false alarm and detection probabilities
(6) | ||||
(7) |
where is the estimation of (properly defined later), and is the hypothesis model to test. In Theorem 1 below, both an upper bound on and a lower bound on are derived.
Theorem 1.
For a directed information graph with adjacency matrix of size , if Assumptions – hold, the performance of the test for the hypothesis is bounded as:
(8) | ||||
(9) |
using the plug-in estimation of samples with . The function is the regularized gamma function, and with denoting the number of directed edges in the hypothesis graph, and . Finally, is the threshold value used to decide the existence of an edge, and its order is .
The proof of Theorem 1 consists of two steps. First, the asymptotic behavior of the test for a single edge is derived in Section III-A. Afterwards, the hypothesis test over the whole graph is studied based on the tests for each single edge. It can be seen later on that, by Remark 1 and Corollary 1, the performance of testing the graph remains as good as a test for causality of a single edge.
Remark 1.
Note that by increasing while remaining of order , gets arbitrarily close to one, which results in the probability of detection converging to one as the probability of false alarm tends to zero.
III-A Asymptotic Behavior of a Single Edge Estimation
In general, every possible probability transition matrix can be parametrized with , where (see Table I). The vector is formed by concatenating the elements of in a row-wise manner excluding the last (linearly dependent) column. However, if the transition probability could be factorized due to a Markov property among its variables, the matrix might thus be addressed with a lower dimension parameter.
To see this, let us concentrate in our 3-node network . If , or equivalently , then by Assumption 3, the transition probability can be factorized as follows,
(10) |
Here the transition matrix is parametrized by where has two components: and , and . The dimensions of the sets are shown in Table I; note that . The vectors and are also formed by concatenating the elements of their respective matrices as in the case of . More details are found in the proof of Theorem 2 in Appendix A.
Set | Dimension |
---|---|
Now consider the Neyman-Pearson criteria to test the hypothesis within .
Definition 3.
The log-likelihood is defined as
Let and be the most likely choice of transition matrix with general and null hypothesis , respectively, i.e.,
(11) |
As a result, the test for causality boils down to check the difference between likelihoods, i.e., the log-likelihood ratio:
(12) |
which is the Neyman-Pearson criteria for testing within . Then, in the following theorem, is shown to converge to a distribution of finite degree. The proof follows from standard results in [11, Th. 6.1].
Theorem 2.
Proof:
Remark 2.
Note that the asymptotic result from Theorem 2 depends only on the dimensions of the sets and not in the particular pair of nodes involved. Furthermore, the result also holds for a network with more than three nodes by properly defining the dimensions of the sets.
Remark 3.
Knowledge about the absence of edges other than in the network results in converging to a distribution of higher degree since (III-A) could be further factorized. To see this, assume and consider that a knowledge about the links (for example, the whole adjacency matrix ) was given. Then, let the transition probability be factorized as much as possible, so it can be parametrized by which has lower or equal dimension than . Take
where
Intuitively, by following similar steps as in the proof of Theorem 2, we obtain that behaves as a random variable, where . Since the cumulative distribution function of the is a decreasing function with respect to the degree then,
(13) |
for sufficiently large and any . The lower bound in (13) allows us to ignore the knowledge about other nodes.
Consider now the estimation of the directed information defined as plugging in the empirical distribution (instead of the true distribution) into , i.e.,
Then, the following lemma states that , is proportional to with an factor.
Lemma 1.
, which is the plug-in estimator of the directed information.
Proof:
The proof follows from standard definitions and noting that the KL-divergence is positive and minimized by zero. See Appendix B for the complete proof. ∎
Now, let us define the decision rule for checking the existence of an edge in the graph as follows:
where is of order . Then for any knowledge about states of edges in the true graph, as long as we have:
(14) |
where the inequality is due to Remark 3.
From Theorem 2 and Lemma 1, it is inferred that when in the true adjacency matrix , then the empirical estimation of the directed information converges to zero with a distribution at a rate . The asymptotic behavior of is different if the edge is present, i.e., , which is addressed in the following theorem.
Theorem 3.
Proof:
The empirical distribution can be decomposed in two parts, where the first one vanishes at a rate faster than and the second part converges at a rate . The condition keeps the second part positive so it determines the asymptotic convergence of . Refer to Appendix C for further details. ∎
Remark 4.
Knowledge about the state of other edges in the true graph model does not affect the asymptotic behavior presented in Theorem 3, given that the condition makes the convergence of the estimator independent of all other nodes. This can be seen by following the steps of the proof, where we only use the fact that if the true edge exists then and (III-A) does not hold.
III-B Hypothesis Test over an Entire Graph
The performance of testing a hypothesis for a graph is studied by means of the false alarm and detection probabilities defined in (6) and (7), respectively. The results may be considered as an extension of the hypothesis test over a single edge in the graph.
First, let the false alarm probability be upper-bounded as
If , there exist nodes and such that . Hence,
(17) | ||||
(18) | ||||
(19) |
On the other hand, the complement of the detection probability may be upper-bounded using the union bound:
(20) |
where and are the number of off-diagonal s and s in the true matrix , i.e., , and . The last inequality holds due to (14) and (16).
Since and it is of order , then as and noting that
we have that,
(21) | ||||
(22) |
This concludes the proof of Theorem 1.
Corollary 1.

III-C Numerical Results
The bounds derived in Theorem 1 state that the detection probability can be desirably close to one while the false alarm probability can remain near zero with a proper threshold test. In Fig. 1, these bounds are depicted with respect to different values of for a network with nodes. The joint process is assumed to be a Markov process of order , and the random variables take values on a binary alphabet ().
It can be observed in the figure that, for fixed , improves as decreases, i.e., when the graph becomes sparser. Furthermore, by a proper choice of , we can reach to optimal performance of the hypothesis test, i.e., and .
IV Summary and Remarks
In this paper, we investigated the performance of a hypothesis test for detecting the underlying directed graph of a network of stochastic processes, which represents the causal relationship among nodes, by empirically calculating the directed information as the measure. We showed that the convergence rate of the directed information estimator relies on the existence or not of the link in the real structure. We further showed that with a proper adjustment of the threshold test for single edges, the overall hypothesis test is asymptotically optimal.
This work may be expanded by considering a detailed analysis on the sample complexity of the hypothesis test. Moreover, we assumed in this work that the estimator has access to samples from the whole network while in practice this might not be the case (see e.g. [12]).
Appendix A Proof of Theorem 2
For any right stochastic matrix of dimensions , let the matrix denote the first linearly independent columns of , as depicted in Fig. 2.
Without loss of generality, consider to be the set of integers which simplifies the indexing of elements in the alphabet. Let denote , and excluding . Next define the vector
(23) |
which is associated with an element of (the sub-matrix of the transition probability matrix ).
Every can be addressed via the pair where and indicate the row and column of that element, respectively. Also, let , which denotes the index of that element when vectorizing . Any possible transition matrix can then be indexed with a vector
as , where has dimension (see Table I) and is constructed by concatenation of rows in .
Suppose now that or equivalently, by definition (2), . Then
(24) |
Thus, the transition matrix is determined by the elements of two other matrices and given by (A). Define the vectors
which are associated with an element in and , such that in and in . Then
where the pairs of row and column indices for each element in and are then and , respectively.
A matrix such as the one in (A) can be parametrized by a vector , where has dimension (see Table I). Then
where
determine , are vectors of length and , and are constructed by concatenating the rows of and , respectively. There exists then the mapping such that component-wise:
(25) |
Consider the matrix of size such that for every element:
(26) |
In other words, every row of the matrix is a derivative of an element of with respect to all elements of followed by the derivatives with respect to .
According to [11, Th. 6.1], and by the definition of and in (11),
if has continuous third order partial derivatives and is of rank . The first condition holds according to the definition of in (25). To verify the second condition we can observe that there exist four types of rows in :
-
•
Type 1: Take the rows in (26) such that jointly and . This means that in the -th row of , the derivatives are taken from
where and . So all elements in such rows are zero except in the columns and , which take the values and , respectively.
-
•
Type 2: Now consider the rows such that and . This means that in the -th row of , the derivatives are taken from
where we define and . So all elements in such rows are zero except in the columns (among the first columns) from to which are equal to , and the column which is equal to
-
•
Type 3: Consider the rows such that and . Also let and . Then, all elements of such rows are zero except in the column which takes the value
and the columns (among the last columns) from to that are equal .
-
•
Type 4: Lastly, consider rows such that and . Assume vectors and . Then , the only non-zero elements belong to the columns from to (among the first columns) which are equal to
and the columns from to (among the last columns) which are equal to
We show now that if a linear combination of all columns equals the vector zero, then all coefficients should be zero as well. Let be the -th column of then if
(27) |
then, . To see this, consider the Type 1 row with and . Since it only has two non-zero elements, we have that
(28) |
Then, take the Type 3 row with and , where we have that
(29) |
From (28) and noting that we have assumed , if then for all . Hence, the left-hand side of (29) is strictly positive and not zero. An analogous result is found assuming . By contradiction, we conclude that , and from (28),
By varying and for all combinations of we derive that all s are zero, and as a result, has linearly independent columns which meets the second condition. The proof of Theorem 2 is thus complete.
Appendix B Proof of Lemma 1
The proof follows similar steps as the one in [10, Prop. 9]. Using the definition of log-likelihood,
(30) |
where
Since the KL-divergence is minimized by zero, then
(31) |
On the other hand, for the second log-likelihood, we have:
With a similar approach as in (30), we can expand and as it is shown in (32) at the bottom of the page.
(32) |
As a result,
(33) |
(35) |
Appendix C Proof of Theorem 3
We begin by expanding the expression using the definition of the empirical distribution in (3) and we obtain (35), found at the bottom of the next page. We then proceed to analyze the asymptotic behavior of the estimator.
The first four terms in (35), i.e., the KL-divergence terms, decay faster than . This is shown later in the proof. On the other hand, since , due to (5) and Assumption 3, and thus, the last term in (35) is non-zero and dominates the convergence of the estimator, as we see next. Here, one observes that conditioning on is sufficient to analyze the convergence of and further knowledge about other edges is irrelevant (see Remark 4). We then conclude that,
where
We note that is a functional of the chain and its mean is . The chain is ergodic and we can thus apply the central limit theorem [13, Sec. I.16] to the partial sums to obtain
(36) |
where is bounded.
Now, to complete the proof, it only remains to show that the KL-divergence terms in (35) multiplied by a factor converge to zero as . We present the proof for one term and the others follow a similar approach. We first recall the Taylor expansion with Lagrange remainder form,
for some . Then, let us define so we can expand the first KL-divergence term as:
(37) | ||||
(38) |
for some , where
and (37) follows due to
Since the Markov model is assumed to be ergodic (Assumption 2), , and therefore is bounded. Now consider the sequence
with mean . According to the law of iterated logarithms,
Using the definition of the empirical distribution, this implies
(39) |
As a result we can rewrite (38) and conclude that
given that each term in the finite sum is bounded. Therefore, as , the four KL-divergence terms in (35) multiplied by a factor tend to zero and the proof of Theorem 3 is thus complete.
References
- [1] C. J. Quinn, T. P. Coleman, N. Kiyavash, and N. G. Hatsopoulos, “Estimating the Directed Information to Infer Causal Relationships in Ensemble Neural Spike Train Recordings,” Journal of Computational Neuroscience, vol. 30, no. 1, pp. 17–44, Feb. 2011.
- [2] K. Friston, L. Harrison, and W. Penny, “Dynamic Causal Modelling,” NeuroImage, vol. 19, no. 4, pp. 1273–1302, Aug. 2003.
- [3] C. J. Quinn, N. Kiyavash, and T. P. Coleman, “Directed Information Graphs,” IEEE Trans. Inf. Theory, vol. 61, no. 12, pp. 6887–6909, Dec. 2015.
- [4] W. M. Lord, J. Sun, N. T. Ouellette, and E. M. Bollt, “Inference of Causal Information Flow in Collective Animal Behavior,” IEEE Trans. Mol. Biol. Multi-Scale Commun., vol. 2, no. 1, pp. 107–116, Jun. 2016.
- [5] J. Jiao, H. H. Permuter, L. Zhao, Y. H. Kim, and T. Weissman, “Universal Estimation of Directed Information,” IEEE Trans. Inf. Theory, vol. 59, no. 10, pp. 6220–6242, Oct. 2013.
- [6] C. W. J. Granger, “Investigating Causal Relations by Econometric Models and Cross-spectral Methods,” Econometrica, vol. 37, no. 3, pp. 424–438, Aug. 1969.
- [7] J. Massey, “Causality, Feedback and Directed Information,” in Proc. Int. Symp. Inf. Theory Applic. (ISITA), Honolulu, HI, USA, Nov. 1990, pp. 303–305.
- [8] P.-O. Amblard and O. J. J. Michel, “On Directed Information Theory and Granger Causality Graphs,” Journal of Computational Neuroscience, vol. 30, no. 1, pp. 7–16, Feb. 2011.
- [9] C. J. Quinn, N. Kiyavash, and T. P. Coleman, “Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs,” IEEE Trans. Signal Process., vol. 61, no. 12, pp. 3173–3182, Jun. 2013.
- [10] I. Kontoyiannis and M. Skoularidou, “Estimating the Directed Information and Testing for Causality,” IEEE Trans. Inf. Theory, vol. 62, no. 11, pp. 6053–6067, Nov. 2016.
- [11] P. Billingsley, Statistical Inference for Markov Processes. Chicago, IL, USA: Univ. of Chicago Press, 1961.
- [12] J. Scarlett and V. Cevher, “Lower Bounds on Active Learning for Graphical Model Selection,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 Apr. 2017, pp. 55–64.
- [13] K. L. Chung, Markov Chains: With Stationary Transition Probabilities, 2nd ed., ser. A Series of Comprehensive Studies in Mathematics. New York, NY, USA: Springer-Verlag, 1967, vol. 104.