This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Towards Characterising Bayesian Network Models under Selection

Angelos P. Armen and Robin J. Evans
(Date: July 30, 2025)
Abstract.

Real-life statistical samples are often plagued by selection bias, which complicates drawing conclusions about the general population. When learning causal relationships between the variables is of interest, the sample may be assumed to be from a distribution in a causal Bayesian network (BN) model under selection. Understanding the constraints in the model under selection is the first step towards recovering causal structure in the original model. The conditional-independence (CI) constraints in a BN model under selection have been already characterised; there exist, however, additional, non-CI constraints in such models. In this work, some initial results are provided that simplify the characterisation problem. In addition, an algorithm is designed for identifying compelled ancestors (definite causes) from a completed partially directed acyclic graph (CPDAG). Finally, a non-CI, non-factorisation constraint in a BN model under selection is computed for the first time.

1. Introduction

Real-life statistical samples are often not from the population of interest but from a subpopulation with fixed values for a set of selection variables. The most prominent example is case–control studies. Suppose that SS is a boolean variable indicating whether an individual is included (“selected”) in the study. Then a case–control sample with variables 𝐕\mathbf{V} may be assumed to be a random sample from the conditional distribution over 𝐕\mathbf{V} given S=trueS=\text{true} (Spirtes et al., 2000). We refer to conditioning on fixed values of a set of variables as selection. Selection may create “spurious” dependencies, that is, dependencies that are not present in the general population, as the following example (Pearl, 2009) shows.

Example 1.

Suppose that admission to a certain college requires either high grades or musical talent. Even if grades and musical talent are uncorrelated in the general population, they will be negatively correlated in the college population: students with low grades will be most likely musically gifted, which would explain their admission to the college, and, accordingly, students with no musical talent will probably have high grades.

This phenomenon is known as selection bias, Berkson’s paradox (Berkson, 1946), and the explaining away effect (Kim and Pearl, 1983) (in the example above, one explanation for admission renders the other one less likely).

Applying algorithms that assume a random sample in their input to selection-biased data may lead to incorrect output. Thus, several approaches have been developed to deal with selection bias in various tasks. The motivation behind this work is to improve causal structure learning from selection-biased data.

Elucidating causal relationships is of utmost importance in science. A causal directed acyclic graph (DAG) represents the direct causal relationships between a set of variables (“direct” meaning not through other variables in the set). In the absence of hidden common causes, causal feedback loops, and selection bias, the causal DAG also represents the conditional independences (CIs) between the variables based on the so-called Markov condition (Neapolitan, 2004); the causal Bayesian network (BN) model is the set of distributions that satisfy the CI constraints encoded by the causal DAG. By performing hypothesis tests of CI on a sample from the probability distribution over the variables, constraint-based structure learning algorithms such as PC (Spirtes et al., 1999) can learn the causal BN model, which amounts to learning (features of) the causal DAG. In the presence of selection bias (and/or hidden common causes), however, the probability distribution over the variables may no longer be in the causal BN model and not all constraints on the distribution are CI constraints; in that case, the PC algorithm is not appropriate.

Some previous work focussed on learning causal BN models from selection-biased samples by correcting for the selection bias. Cooper (2000) devised a Bayesian method in which the biased sample is treated as a random sample with missing values for a known or unknown number of unsampled individuals, and the likelihood of the data is computed by summing over all possibilities for the missing values and the number of unsampled individuals, if unknown. This approach is computationally intractable in all but the smallest examples. In addition, it requires knowledge of the non-random-sampling process. Borboudakis and Tsamardinos (2015) devised a CI test for case–control samples with categorical variables, characterised potentially spurious links when learning the skeleton of the causal DAG using a test for random samples, and proposed the use of their specialised test on these links as a post-processing approach to removing spurious links. The drawbacks of this approach is that it is not applicable to general selection-biased samples, the joint distribution over the selection variables needs to be known, and the specialised tests are less powerful than the ones for random samples.

A more general approach to deal with the problem of selection bias is to characterise (supermodels of) BN models under selection and design algorithms for learning those models.

An ordinary Markov model is defined by the CI constraints in a BN model under marginalisation and/or selection (Shpitser et al., 2014). A maximal ancestral graph (MAG) (Richardson and Spirtes, 2002) is a graphical representation of an ordinary Markov model that can be learned using an algorithm such as FCI (Spirtes et al., 1999; Zhang, 2008). A MAG also represents ancestral relationships in the DAG of a BN model; in the case of a causal BN model, these are (indirect) causal relationships.

There are, however, additional, non-CI constraints in a BN model under marginalisation and/or selection. A nested Markov model (Richardson et al., 2017) is defined by the CI constraints and the so-called Verma constraints (Robins, 1986) in a BN model under marginalisation. These constraints may be represented by either a marginal directed acyclic graph (mDAG) (Evans, 2016) or an acyclic directed mixed graph (ADMG) (Richardson et al., 2017) and comprise all the equality constraints in the margin of a BN model over a set of categorical variables (Evans, 2018), although there there also exist inequality constraints such as Bell’s inequalities (see, for example, Wolfe et al., 2016, for details). In contrast to MAGs, mDAGs and ADMGs represent direct causal relationships, and are therefore more expressive. In the case of selection, there exist non-CI equality constraints as well. Lauritzen (1999) showed that there exist non-CI factorisation constraints. Evans and Didelez (2015) showed the existence of non-CI, non-factorisation constraints in a certain categorical BN model under conditioning; here we show that those constraints are also constraints in another BN model under selection and explicitly identify the sole constraint in the case of binary variables.

In this work, some initial results are provided that simplify the problem of characterising BN models under selection. For the sake of simplicity, the results are presented for categorical variables, but they can be easily generalised to general state spaces where a joint density with respect to a product measure exists. All proofs can be found in the appendix of this paper.

Having characterised equality constraints in BN models under selection, it may be possible to devise a graphical representation of them. Structure-learning algorithms for selection-biased data can then be developed that make use of the non-CI information in the data and potentially enable the learning of more causal relationships than is currently possible.

2. Background

In this paper, sets are in boldface (e.g., 𝐒\mathbf{S}), XX is used a shortcut for the singleton {X}\{X\} in places were a set is expected, and 𝐀˙𝐁\mathbf{A}\dot{\cup}\mathbf{B} denotes the union of disjoint sets 𝐀\mathbf{A} and 𝐁\mathbf{B}. Random variables are denoted by capital letters (e.g., XX) and their values by the respective lowercase letters (e.g., xx); fixed values are denoted by a hat (e.g., x^\hat{x}). If 𝐗\mathbf{X} and 𝐘\mathbf{Y} are random variables, 𝐱𝐲\mathbf{x}\cup\mathbf{y} denotes the values of 𝐗𝐘\mathbf{X}\cup\mathbf{Y}; the same is true for set operations other than the union. Probability distributions are denoted by capital letters (e.g., PP) and their respective probability density functions by the respective lowercase letters (e.g., pp); p(𝐱)p(\mathbf{x}) is used as a shortcut for p(𝐗=𝐱)p(\mathbf{X}=\mathbf{x}). If PP is a distribution over 𝐕\mathbf{V}, 𝐗\mathbf{X}, 𝐘\mathbf{Y}, and 𝐙\mathbf{Z} are distinct subsets of 𝐕\mathbf{V}, and 𝐗\mathbf{X} and 𝐘\mathbf{Y} are nonempty, the conditional independence of 𝐗\mathbf{X} and 𝐘\mathbf{Y} given 𝐙\mathbf{Z} in PP is denoted by 𝐗\PerpP𝐘𝐙\mathbf{X}\Perp_{P}\mathbf{Y}\mid\mathbf{Z}. If PP is a distribution over 𝐎˙𝐇˙𝐂˙𝐒\mathbf{O}\dot{\cup}\mathbf{H}\dot{\cup}\mathbf{C}\dot{\cup}\mathbf{S}, then the marginal/conditional distribution of PP over 𝐎\mathbf{O} given 𝐂\mathbf{C} and 𝐒=𝐬^\mathbf{S}=\hat{\mathbf{s}} is denoted by P[𝐇𝐂,𝐒=𝐬^P[_{\mathbf{H}}^{\mathbf{C},\mathbf{S}=\hat{\mathbf{s}}}. Finally, graphs are in calligraphic (e.g., 𝒢\mathcal{G}).

A graph 𝒢\mathcal{G} is an ordered pair (𝐕,𝐄)(\mathbf{V},\mathbf{E}) of a set of nodes 𝐕\mathbf{V} and a set of edges 𝐄\mathbf{E} that connect pairs of distinct nodes in 𝐕\mathbf{V}. If there is an edge between nodes XX and YY in 𝒢\mathcal{G}, then XX and YY are adjacent in 𝒢\mathcal{G}. The union of the sets of nodes adjacent to nodes 𝐗\mathbf{X} in 𝒢\mathcal{G} is denoted by 𝐀𝐃𝐉𝒢(𝐗)\mathbf{ADJ}_{\mathcal{G}}(\mathbf{X}). A sequence of n2n\geq 2 nodes (X1,,Xn)(X_{1},\ldots,X_{n}) such that, for 2in2\leq i\leq n, Xi1X_{i-1} and XiX_{i} are adjacent, is called a path from X1X_{1} to XnX_{n}. 𝒢\mathcal{G} is connected if there is a path from every node to every other node in the graph. Let p=(X1,,Xn)p=(X_{1},\ldots,X_{n}) be a path. The nodes X2,,Xn1X_{2},\ldots,X_{n-1} are called interior nodes on pp. Path (Xi,,Xj)(X_{i},\ldots,X_{j}), where 1i<jn1\leq i<j\leq n, is the subpath of pp from XiX_{i} to XjX_{j} and is denoted by p(Xi,Yj)p(X_{i},Y_{j}). If X1=XnX_{1}=X_{n}, pp is a cycle; if also X1,,Xn1X_{1},\ldots,X_{n-1} are distinct, pp is a simple cycle. A path is simple if no subpath is a cycle. A triple is a simple path with three nodes. A triple (X,Z,Y)(X,Z,Y) is shielded if XX and YY are adjacent. A simple path is shielded if there is a shielded triple on the path. Let 𝒢1=(𝐕1,𝐄1)\mathcal{G}_{1}=(\mathbf{V}_{1},\mathbf{E}_{1}) and 𝒢2=(𝐕2,𝐄2)\mathcal{G}_{2}=(\mathbf{V}_{2},\mathbf{E}_{2}) be graphs. 𝒢1\mathcal{G}_{1} is a subgraph of 𝒢2\mathcal{G}_{2} (denoted by 𝒢1𝒢2\mathcal{G}_{1}\subseteq\mathcal{G}_{2}) if 𝐕1𝐕2\mathbf{V}_{1}\subseteq\mathbf{V}_{2} and 𝐄1𝐄2\mathbf{E}_{1}\subseteq\mathbf{E}_{2}. The induced subgraph of 𝒢\mathcal{G} over 𝐀𝐕\mathbf{A}\subseteq\mathbf{V} (denoted by 𝒢𝐀\mathcal{G}_{\mathbf{A}}) is the graph with set of nodes 𝐀\mathbf{A} and the edges in 𝒢\mathcal{G} between nodes in 𝐀\mathbf{A}.

A graph is called directed (resp. undirected) when its edges are directed (resp. undirected). A tree is a connected undirected graph without (simple) cycles. In a tree, a leaf is a node which is adjacent to a single node. If there is an edge XYX\rightarrow Y in directed graph 𝒢\mathcal{G}, then XX is a parent of YY and YY a child of XX in 𝒢\mathcal{G}; the edge is said to be out of XX and into YY. The union of the sets of parents (resp. children) of nodes 𝐗\mathbf{X} in 𝒢\mathcal{G} is denoted by 𝐏𝐀𝒢(𝐗)\mathbf{PA}_{\mathcal{G}}(\mathbf{X}) (resp. 𝐂𝐇𝒢(𝐗)\mathbf{CH}_{\mathcal{G}}(\mathbf{X})). The set X𝐏𝐀𝒢(X)X\cup\mathbf{PA}_{\mathcal{G}}(X) is called the family of XX in 𝒢\mathcal{G}. A node without children is called a source (resp. sink). A link is an edge without regard of direction, and the skeleton of a directed graph is the undirected graph whose edges corresponds to links in the directed graph. A triple (X,Y,Z)(X,Y,Z) such that XZYX\rightarrow Z\leftarrow Y is a collider. A path from XX to YY is out of (resp. into) XX and out of (resp. into) YY if the first edge of the path is out of (resp. into) XX and the last edge is out of (resp. into) YY. A simple path from XX to YY where all edges are directed towards YY is called directed. If there is a directed path from XX to YY or X=YX=Y, then XX is an ancestor of YY and YY a descendant of XX. The union of the sets of ancestors (resp. descendants) of nodes 𝐗\mathbf{X} in 𝒢\mathcal{G} is denoted by 𝐀𝐍𝒢(𝐗)\mathbf{AN}_{\mathcal{G}}(\mathbf{X}) (resp. 𝐃𝐄𝒢(𝐗)\mathbf{DE}_{\mathcal{G}}(\mathbf{X})). A simple cycle (X1,,Xn)(X_{1},\ldots,X_{n}) is directed if for 2in2\leq i\leq n, the edge between Xi1X_{i-1} and XiX_{i} is directed towards XiX_{i}. A DAG is a directed graph without directed cycles. A conditional DAG is a DAG (𝐗˙𝐘\mathbf{X}\dot{\cup}\mathbf{Y}, 𝐄\mathbf{E}) such that the nodes in 𝐘\mathbf{Y} are sources (Evans, 2018). The nodes in 𝐗\mathbf{X} and 𝐘\mathbf{Y} are called random nodes and fixed nodes, respectively; fixed nodes are drawn in a rectangle. The result of fixing 𝐀𝐕\mathbf{A}\subseteq\mathbf{V} in DAG 𝒢=(𝐕,𝐄)\mathcal{G}=(\mathbf{V},\mathbf{E}) (denoted by ϕ𝐀(𝒢)\phi_{\mathbf{A}}(\mathcal{G})) is the conditional DAG with random nodes 𝐕𝐀\mathbf{V}\setminus\mathbf{A}, fixed nodes 𝐀\mathbf{A}, the edges in 𝒢\mathcal{G} between nodes in 𝐕𝐀\mathbf{V}\setminus\mathbf{A}, and the edges in 𝒢\mathcal{G} from nodes in 𝐀\mathbf{A} to nodes in 𝐕𝐀\mathbf{V}\setminus\mathbf{A}. A partially directed acyclic graph (PDAG) is a partially directed graph without directed cycles.

Let 𝐗\mathbf{X} and 𝐘\mathbf{Y} be distinct sets of variables. A (conditional) model over 𝐗\mathbf{X} given 𝐘\mathbf{Y} is a set of (conditional) probability distributions over 𝐗\mathbf{X} given 𝐘\mathbf{Y}.

Definition 1 (BN model).

Let 𝐗\mathbf{X} be a set of variables and 𝒢\mathcal{G} be a DAG over 𝐗\mathbf{X}. The BN model defined by 𝒢\mathcal{G}, denoted by 𝐁𝐍𝐌(𝒢)\operatorname{\mathbf{BNM}}(\mathcal{G}), is the set of distributions PP of 𝐗\mathbf{X} that satisfy the Markov condition with 𝒢\mathcal{G}, that is, every variable XX in 𝐗\mathbf{X} is independent of its non-descendants and non-parents in 𝒢\mathcal{G} given its parents in 𝒢\mathcal{G}:

X\PerpP𝐗(𝐃𝐄𝒢(X)𝐏𝐀𝒢(X))𝐏𝐀𝒢(X)X\Perp_{P}\mathbf{X}\setminus(\mathbf{DE}_{\mathcal{G}}(X)\cup\mathbf{PA}_{\mathcal{G}}(X))\mid\mathbf{PA}_{\mathcal{G}}(X)

𝒢\mathcal{G} is referred to as the structure of 𝐁𝐍𝐌(𝒢)\operatorname{\mathbf{BNM}}(\mathcal{G}).

Let 𝐗\mathbf{X} be a set of categorical variables, 𝒢\mathcal{G} be a DAG over 𝐗\mathbf{X}, and P𝐁𝐍𝐌(𝒢)P\in\operatorname{\mathbf{BNM}}(\mathcal{G}). Then PP equals the product of the conditional distributions of the nodes in 𝒢\mathcal{G} given their parents in 𝒢\mathcal{G}: (Neapolitan, 2004, Theorem 1.4)

p(𝐱)=X𝐗p(x𝐩𝐚𝒢(X))p(\mathbf{x})=\prod_{X\in\mathbf{X}}p(x\mid\mathbf{pa}_{\mathcal{G}}(X))

Suppose that conditional distributions of the nodes in 𝒢\mathcal{G} given their parents in 𝒢\mathcal{G} are specified. Then the product of the distributions is in 𝐁𝐍𝐌(𝒢)\operatorname{\mathbf{BNM}}(\mathcal{G}) (Neapolitan, 2004, Theorem 1.5).

BN models are widely used for causal inference. A causal BN model is defined by a causal DAG, whose edges denote direct causal relationships between the variables (See Neapolitan, 2004, for the exact definition of (direct) causation used). In the absence of hidden common causes, causal feedback loops, and selection bias, the distribution over the variables in a causal DAG 𝒢\mathcal{G} is in 𝐁𝐍𝐌(𝒢)\operatorname{\mathbf{BNM}}(\mathcal{G}) (Neapolitan, 2004). Therefore, structure learning from a sample amounts to learning direct causal relationships between the variables in the sample.

Different DAGs may impose the same factorisation on the distribution over the variables and, therefore, impose the same CI constraints and define the same BN model; these DAGs are called Markov equivalent and said to belong to the same Markov equivalence class. Two DAGs are in the same class if and only if they have the same skeleton and unshielded colliders. The skeleton and the unshielded colliders are the same within the class. A Markov equivalence class of DAGs can be represented by a completed partially directed acyclic graph (CPDAG), a PDAG with the same skeleton as the DAGs in the class and the directed edges that are present in every DAG in the class. Any DAG in the class can be obtained by orienting the undirected edges in the CPDAG, as long as no directed cycles or new unshielded colliders are created. Without further assumptions, structure learning algorithms cannot distinguish between Markov equivalent DAGs; therefore, DAGs can be learned up to Markov equivalence.

Definition 2 (Hierarchical model).

Let 𝐗\mathbf{X} be a set of variables and 𝐅\mathbf{F} be a set of inclusion-maximal subsets of 𝐗\mathbf{X} such that every variable in 𝐗\mathbf{X} is included in at least one set in 𝐅\mathbf{F}. The hierarchical model defined by 𝐅\mathbf{F} (denoted by 𝐇𝐌(𝐅)\mathbf{HM}(\mathbf{F})) is the set of distributions PP of 𝐗\mathbf{X} that are the normalised product of nonnegative functions of the values of the sets in 𝐅\mathbf{F}:

p(𝐱)=1c𝐀𝐅ϕ𝐀(𝐚)p(\mathbf{x})=\frac{1}{c}\prod_{\mathbf{A}\in\mathbf{F}}\phi_{\mathbf{A}}(\mathbf{a})

where ϕ𝐀\phi_{\mathbf{A}} is a nonnegative function of the values of 𝐀\mathbf{A} and

c=𝐱𝐀𝐅ϕ𝐀(𝐚^)>0.c=\sum_{\mathbf{x}}\prod_{\mathbf{A}\in\mathbf{F}}\phi_{\mathbf{A}}(\hat{\mathbf{a}})>0.

3. Results

We start by formally defining conditioning and selection of models.

Definition 3 (Model under conditioning).

Let 𝐌\mathbf{M} be a model over 𝐎˙𝐂˙𝐒\mathbf{O}\dot{\cup}\mathbf{C}\dot{\cup}\mathbf{S}. Then 𝐌\mathbf{M} after conditioning on 𝐂\mathbf{C} and 𝐒=𝐬^\mathbf{S}=\hat{\mathbf{s}}, denoted by 𝐌[𝐂,𝐒=𝐬^\mathbf{M}[^{\mathbf{C},\mathbf{S}=\hat{\mathbf{s}}}, is a conditional model over 𝐎\mathbf{O} given 𝐂\mathbf{C} defined as follows.

Q𝐌[𝐂,𝐒=𝐬^P𝐌 s.t. P[𝐂,𝐒=𝐬^=Q\displaystyle Q\in\mathbf{M}[^{\mathbf{C},\mathbf{S}=\hat{\mathbf{s}}}\iff\exists P\in\mathbf{M}\text{ s.t. }P[^{\mathbf{C},\mathbf{S}=\hat{\mathbf{s}}}=Q

When 𝐂=\mathbf{C}=\emptyset, 𝐌\mathbf{M} is said to be under selection. The variables in 𝐎\mathbf{O}, 𝐂\mathbf{C}, and 𝐒\mathbf{S} are referred to as observed variables, conditioning variables, and selection variables, respectively.

According the following lemma, model conditioning is commutative.

Lemma 1.

Let 𝐌\mathbf{M} be a model over 𝐎˙𝐂1˙𝐂2˙𝐒1˙𝐒2\mathbf{O}\dot{\cup}\mathbf{C}_{1}\dot{\cup}\mathbf{C}_{2}\dot{\cup}\mathbf{S}_{1}\dot{\cup}\mathbf{S}_{2}. Then

(𝐌𝐂1,𝐒1=𝐬^1)[𝐂2,𝐒2=𝐬^2=𝐌[𝐂1𝐂2,𝐒1𝐒2=𝐬^1𝐬^2.(\mathbf{M}^{\mathbf{C}_{1},\mathbf{S}_{1}=\hat{\mathbf{s}}_{1}})[^{\mathbf{C}_{2},\mathbf{S}_{2}=\hat{\mathbf{s}}_{2}}=\mathbf{M}[^{\mathbf{C}_{1}\cup\mathbf{C}_{2},\mathbf{S}_{1}\cup\mathbf{S}_{2}=\hat{\mathbf{s}}_{1}\cup\hat{\mathbf{s}}_{2}}.
Proof.

The proof follows from commutativity of conditioning and selection of distributions. ∎

Let 𝐎\mathbf{O} and 𝐒={S1,,Sn}\mathbf{S}=\{S_{1},\ldots,S_{n}\} be distinct sets of categorical variables, 𝐅={𝐅1,,𝐅n}\mathbf{F}=\{\mathbf{F}_{1},\ldots,\mathbf{F}_{n}\} be a set of inclusion-maximal subsets of 𝐎\mathbf{O} such that every variable in 𝐎\mathbf{O} is included in at least one set in 𝐅\mathbf{F}, 𝐄i\mathbf{E}_{i} (1in1\leq i\leq n) be the set of edges from each node in 𝐅i\mathbf{F}_{i} to SiS_{i}, and 𝒢=(𝐎𝐒,1in𝐄i)\mathcal{G}=(\mathbf{O}\cup\mathbf{S},\bigcup_{1\leq i\leq n}\mathbf{E}_{i}). Lauritzen (1999) proved that 𝐁𝐍𝐌(𝒢)[𝐒=𝐬^=𝐇𝐌(𝐅)\operatorname{\mathbf{BNM}}(\mathcal{G})[^{\mathbf{S}=\hat{\mathbf{s}}}=\mathbf{HM}(\mathbf{F}), which means that every categorical hierarchical model is a BN model under selection.

Example 2.

Let 𝒢\mathcal{G} be the DAG in Figure 1. Then

𝐁𝐍𝐌(𝒢)[{S1,S2,S3}={s1^,s2^,s3^}=𝐇𝐌({{O1,O2},{O2,O3},{O3,O1}}).\operatorname{\mathbf{BNM}}(\mathcal{G})[^{\{S_{1},S_{2},S_{3}\}=\{\hat{s_{1}},\hat{s_{2}},\hat{s_{3}}\}}=\mathbf{HM}(\{\{O_{1},O_{2}\},\{O_{2},O_{3}\},\{O_{3},O_{1}\}\}).

This hierarchical model is called the no three-way interaction model in the statistical literature (Jobson, 2012). Note that the corresponding MAG model is the saturated model; therefore, the constraints in the hierarchical model are non-CI constraints.

S1S_{1}O1O_{1}S3S_{3}O2O_{2}O3O_{3}S2S_{2}
Figure 1. The DAG of a BN model which, under conditioning on the values of certain variables (S1S_{1}, S2S_{2}, and S3S_{3}), equals a non-saturated hierarchical model (See Example 2).

In fact, every BN model under selection is a submodel of a (not necessarily saturated) hierarchical model; consider the following example.

Example 3.

Let 𝒢\mathcal{G} be the DAG in Figure 2. For P𝐁𝐍𝐌(𝒢)P\in\operatorname{\mathbf{BNM}}(\mathcal{G}) such that p(s^)>0p(\hat{s})>0 it holds that

p(o1,o2,o3s^)\displaystyle p(o_{1},o_{2},o_{3}\mid\hat{s}) =cp(o1,o2,o3,s^)=cp(o1)p(o2o1)p(o3o1)p(s^o2,o3)\displaystyle=c\cdot p(o_{1},o_{2},o_{3},\hat{s})=c\cdot p(o_{1})p(o_{2}\mid o_{1})p(o_{3}\mid o_{1})p(\hat{s}\mid o_{2},o_{3})
=cϕ{O1,O2}(o1,o2)ϕ{O1,O3}(o1,o3)ϕ{O2,O3}(o2,o3)\displaystyle=c\cdot\mathrm{\phi}_{\{O_{1},O_{2}\}}(o_{1},o_{2})\mathrm{\phi}_{\{O_{1},O_{3}\}}(o_{1},o_{3})\mathrm{\phi}_{\{O_{2},O_{3}\}}(o_{2},o_{3})

where c=1/p(s^)c=1/p(\hat{s}) and ϕ𝐀\mathrm{\phi}_{\mathbf{A}} is a nonnegative function of the values of 𝐀\mathbf{A}. That is, P[S=s^𝐇𝐌({{O1,O2},{O1,O3},{O2,O3}})P[^{S=\hat{s}}\in\mathbf{HM}(\{\{O_{1},O_{2}\},\{O_{1},O_{3}\},\{O_{2},O_{3}\}\}). Note that this is the same hierarchical model as in Example 1.

O1O_{1}O2O_{2}O3O_{3}SS
Figure 2. The DAG of a BN model which, under conditioning on the value a certain variable (SS), is a submodel of a non-saturated hierarchical model (See Example 3).
Definition 4 (Selection hierarchical model).

Let 𝐎˙𝐒\mathbf{O}\dot{\cup}\mathbf{S} be a set of variables, 𝒢=(𝐎𝐒,𝐄)\mathcal{G}=(\mathbf{O}\cup\mathbf{S},\mathbf{E}), 𝐈\mathbf{I} be the set of the intersections of the families in 𝒢\mathcal{G} with 𝐎\mathbf{O}:

𝐈={(X𝐏𝐀𝒢(X))𝐎:X𝐎𝐒}\mathbf{I}=\{(X\cup\mathbf{PA}_{\mathcal{G}}(X))\cap\mathbf{O}:X\in\mathbf{O}\cup\mathbf{S}\}

and 𝐅\mathbf{F} be the inclusion-maximal among the unique sets of 𝐈\mathbf{I}. 𝐇𝐌(𝐅)\mathbf{HM}(\mathbf{F}) is called the selection hierarchical model of 𝐁𝐍𝐌(𝒢)\operatorname{\mathbf{BNM}}(\mathcal{G}) with respect to 𝐒=𝐬^\mathbf{S}=\hat{\mathbf{s}} (denoted by 𝐒𝐇𝐌(𝒢,𝐒=𝐬^)\mathbf{SHM}(\mathcal{G},\mathbf{S}=\hat{\mathbf{s}})).

Lemma 2.

Let 𝒢=(𝐎𝐒,𝐄)\mathcal{G}=(\mathbf{O}\cup\mathbf{S},\mathbf{E}), where 𝐎˙𝐒\mathbf{O}\dot{\cup}\mathbf{S} is a set of categorical variables. Then

𝐁𝐍𝐌(𝒢)[𝐒=𝐬^𝐒𝐇𝐌(𝒢,𝐒=𝐬^).\operatorname{\mathbf{BNM}}(\mathcal{G})[^{\mathbf{S}=\hat{\mathbf{s}}}\subseteq\mathbf{SHM}(\mathcal{G},\mathbf{S}=\hat{\mathbf{s}}).

We refer to the constraints in the selection hierarchical model as factorisation constraints. In the end of this section, an example of a non-CI, non-factorisation constraint is given.

The results that follow simplify the characterisation problem.

Owing to the lemma below, it is sufficient to characterise BN models under selection when the selection nodes are sinks.

Lemma 3.

Let 𝐎˙𝐒\mathbf{O}\dot{\cup}\mathbf{S} be a set of categorical variables, 𝒢=(𝐎𝐒,𝐄)\mathcal{G}=(\mathbf{O}\cup\mathbf{S},\mathbf{E}), and 𝒢\mathcal{G}^{\prime} be the subgraph of 𝒢\mathcal{G} with all edges out of 𝐒\mathbf{S} removed. Then

𝐁𝐍𝐌(𝒢)[𝐒=𝐬^=𝐁𝐍𝐌(𝒢)[𝐒=𝐬^.\operatorname{\mathbf{BNM}}(\mathcal{G}^{\prime})[^{\mathbf{S}=\hat{\mathbf{s}}}=\operatorname{\mathbf{BNM}}(\mathcal{G})[^{\mathbf{S}=\hat{\mathbf{s}}}.

Based on the following lemma and due to commutativity of model conditioning (Lemma 1), only cases where the sets of parents of the selection nodes are non-nested need to be considered.

Lemma 4.

Let 𝐎˙𝐒\mathbf{O}\dot{\cup}\mathbf{S} be a set of categorical variables, {S1,S2}𝐒\{S_{1},S_{2}\}\subseteq\mathbf{S}, 𝒢=(𝐎𝐒,𝐄)\mathcal{G}=(\mathbf{O}\cup\mathbf{S},\mathbf{E}) such that 𝐂𝐇𝒢(𝐒)=\mathbf{CH}_{\mathcal{G}}(\mathbf{S})=\emptyset and 𝐏𝐀𝒢(S2)𝐏𝐀𝒢(S1)\mathbf{PA}_{\mathcal{G}}(S_{2})\subseteq\mathbf{PA}_{\mathcal{G}}(S_{1}), 𝐒=𝐒{S2}\mathbf{S}^{\prime}=\mathbf{S}\setminus\{S_{2}\}, and 𝒢=𝒢𝐎𝐒\mathcal{G}^{\prime}=\mathcal{G}_{\mathbf{O}\cup\mathbf{S}^{\prime}}. Then

𝐁𝐍𝐌(𝒢)[𝐒=𝐬^=𝐁𝐍𝐌(𝒢)[𝐒=𝐬^.\operatorname{\mathbf{BNM}}(\mathcal{G}^{\prime})[^{\mathbf{S}^{\prime}=\hat{\mathbf{s}}^{\prime}}=\operatorname{\mathbf{BNM}}(\mathcal{G})[^{\mathbf{S}=\hat{\mathbf{s}}}.

We note that Lemma 3 and 4 are dual to Lemma 1 and 2, respectively, about marginalisation, in Evans (2016).

The notion of conditional BN model is needed for the theorem that follows.

Definition 5 (Conditional BN model).

Let 𝐗\mathbf{X} and 𝐘\mathbf{Y} be distinct sets of variables and 𝒢\mathcal{G} be a conditional DAG with random nodes 𝐗\mathbf{X} and fixed nodes 𝐘\mathbf{Y}. The conditional BN model defined by 𝒢\mathcal{G}, denoted by 𝐁𝐍𝐌(𝒢)\operatorname{\mathbf{BNM}}(\mathcal{G}), is the set of conditional distributions PP of 𝐗\mathbf{X} given 𝐘\mathbf{Y} that satisfy the conditional Markov condition with 𝒢\mathcal{G}, that is, every variable XX in 𝐗\mathbf{X} is independent of its non-descendants and non-parents in 𝒢\mathcal{G} given its parents in 𝒢\mathcal{G}:

X\PerpP𝐗(𝐃𝐄𝒢(X)𝐏𝐀𝒢(X))𝐏𝐀𝒢(X)X\Perp_{P}\mathbf{X}\setminus(\mathbf{DE}_{\mathcal{G}}(X)\cup\mathbf{PA}_{\mathcal{G}}(X))\mid\mathbf{PA}_{\mathcal{G}}(X)

Clearly, a BN model is a conditional BN model.

Let 𝐗\mathbf{X} and 𝐘\mathbf{Y} be distinct sets of categorical variables, 𝒢\mathcal{G} be a conditional DAG with random nodes 𝐗\mathbf{X} and fixed nodes 𝐘\mathbf{Y}, and P𝐁𝐍𝐌(𝒢)P\in\operatorname{\mathbf{BNM}}(\mathcal{G}). The proof of Theorem 1.4 in Neapolitan (2004) can be easily adapted to show that PP equals the product of the conditional distributions of the random nodes in 𝒢\mathcal{G} given their parents in 𝒢\mathcal{G}:

p(𝐱𝐲)=X𝐗p(x𝐩𝐚𝒢(X))p(\mathbf{x}\mid\mathbf{y})=\prod_{X\in\mathbf{X}}p(x\mid\mathbf{pa}_{\mathcal{G}}(X))

Suppose that conditional distributions of the nodes in 𝒢\mathcal{G} given their parents in 𝒢\mathcal{G} are specified. It is straightforward to adapt the proof of Theorem 1.5 in Neapolitan (2004) to show that the product of the distributions is in 𝐁𝐍𝐌(𝒢)\operatorname{\mathbf{BNM}}(\mathcal{G}).

According to the theorem below, selection only affects the ancestors of the selection nodes; it is therefore sufficient to characterise the case where all nodes are ancestors of the selection nodes.

Theorem 1.

Let 𝐎˙𝐒\mathbf{O}\dot{\cup}\mathbf{S} be a set of categorical variables, 𝒢=(𝐎𝐒,𝐄)\mathcal{G}=(\mathbf{O}\cup\mathbf{S},\mathbf{E}) such that 𝐂𝐇𝒢(𝐒)=\mathbf{CH}_{\mathcal{G}}(\mathbf{S})=\emptyset, 𝐗=𝐀𝐍𝒢(𝐒)\mathbf{X}=\mathbf{AN}_{\mathcal{G}}(\mathbf{S}), 𝐘=𝐎𝐗\mathbf{Y}=\mathbf{O}\setminus\mathbf{X}, 𝒢1=𝒢𝐗\mathcal{G}_{1}=\mathcal{G}_{\mathbf{X}}, 𝒢2=ϕ𝐗𝐒(𝒢𝐎)\mathcal{G}_{2}=\phi_{\mathbf{X}\setminus\mathbf{S}}(\mathcal{G}_{\mathbf{O}}), PP be a distribution over 𝐎\mathbf{O} such that p(𝐱𝐬)>0p(\mathbf{x}\setminus\mathbf{s})>0, P1=P[𝐘P_{1}=P[_{\mathbf{Y}}, and P2=P[𝐗𝐒P_{2}=P[^{\mathbf{X}\setminus\mathbf{S}}. Then

P𝐁𝐍𝐌(𝒢)[𝐒=𝐬^P1𝐁𝐍𝐌(𝒢1)[𝐒=𝐬^P2𝐁𝐍𝐌(𝒢2)P\in\operatorname{\mathbf{BNM}}(\mathcal{G})[^{\mathbf{S}=\hat{\mathbf{s}}}\iff P_{1}\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{1})[^{\mathbf{S}=\hat{\mathbf{s}}}\land P_{2}\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{2})
Example 4.

Let 𝐎={O1,,O6}\mathbf{O}=\{O_{1},\ldots,O_{6}\} be a set of categorical variables, SS be a variable not in 𝐎\mathbf{O}, 𝒢\mathcal{G}, 𝒢1\mathcal{G}_{1}, and 𝒢2\mathcal{G}_{2} be the DAG in Figure 3(a), 3(b), and 3(c), respectively, PP be a distribution over 𝐎\mathbf{O} such that p(o1,o2,o3,o4)>0p(o_{1},o_{2},o_{3},o_{4})>0, P1=P[{O5,O6}P_{1}=P[_{\{O_{5},O_{6}\}}, and P2=P[{O1,O2,O3,O4}P_{2}=P[^{\{O_{1},O_{2},O_{3},O_{4}\}}. Then Theorem 1 says that

P𝐁𝐍𝐌(𝒢)[S=s^P1𝐁𝐍𝐌(𝒢1)[S=s^P2𝐁𝐍𝐌(𝒢2).P\in\operatorname{\mathbf{BNM}}(\mathcal{G})[^{S=\hat{s}}\iff P_{1}\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{1})[^{S=\hat{s}}\land P_{2}\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{2}).

Therefore, only 𝐁𝐍𝐌(𝒢1)[S=s^\operatorname{\mathbf{BNM}}(\mathcal{G}_{1})[^{S=\hat{s}} needs to be characterised.

O3O_{3}O4O_{4}O6O_{6}O1O_{1}O2O_{2}O5O_{5}SS
(a) A DAG.
O3O_{3}O4O_{4}O1O_{1}O2O_{2}SS
(b) The induced subgraph over {O1,,O4,S}\{O_{1},\ldots,O_{4},S\} of the DAG in Figure 3(a).
O3O_{3}O4O_{4}O6O_{6}O1O_{1}O2O_{2}O5O_{5}
(c) The result of fixing {O1,,O4}\{O_{1},\ldots,O_{4}\} in the induced subgraph over {O1,,O6}\{O_{1},\ldots,O_{6}\} of the DAG in Figure 3(a).
Figure 3. The graphs in Example 4.

In general, different Markov equivalent DAGs have different ancestors of the selection nodes. When computing constraints in a model under selection, Theorem 1 may be applied to a Markov equivalent DAG with the fewest ancestors of the selection nodes. As it turns out, such a DAG contains only the compelled ancestors of the selection nodes. Let 𝐆\mathbf{G} be a Markov equivalence class of DAGs. XX is a compelled ancestor of YY in 𝐆\mathbf{G} if X𝐀𝐍𝒢(Y)X\in\mathbf{AN}_{\mathcal{G}}(Y) for every 𝒢𝐆\mathcal{G}\in\mathbf{G}. Clearly, compelled ancestry is a transitive relation. The union of the sets of compelled ancestors of 𝐘\mathbf{Y} in 𝐆\mathbf{G} is denoted by 𝐜𝐨𝐦𝐩𝐀𝐍𝐆(𝐘)\mathbf{compAN}_{\mathbf{G}}(\mathbf{Y}). Note that, if 𝐆\mathbf{G} is the Markov equivalence class of an (unknown) causal DAG, then 𝐜𝐨𝐦𝐩𝐀𝐍𝐆(𝐘)\mathbf{compAN}_{\mathbf{G}}(\mathbf{Y}) is the set of definite causes of YY.

The theorem that follows states that, for a Markov equivalence class of DAGs, there exists a DAG in the class such that the ancestors of a set of variables in the DAG are the compelled ancestors of the variables in the class.

Theorem 2.

Let 𝐆\mathbf{G} be a Markov equivalence class of DAGs over 𝐕\mathbf{V} and 𝐘𝐕\mathbf{Y}\subseteq\mathbf{V}. There exists 𝒢𝐆\mathcal{G}\in\mathbf{G} such that 𝐀𝐍𝒢(𝐘)=𝐜𝐨𝐦𝐩𝐀𝐍𝐆(𝐘)\mathbf{AN}_{\mathcal{G}}(\mathbf{Y})=\mathbf{compAN}_{\mathbf{G}}(\mathbf{Y}).

The lemma below is useful to obtaining such a DAG.

Lemma 5.

Let 𝐆\mathbf{G} be a Markov equivalence class of DAGs over 𝐕\mathbf{V} and 𝒢𝐆\mathcal{G}\in\mathbf{G}. Let further 𝐘𝐕\mathbf{Y}\subseteq\mathbf{V} and 𝐙=𝐜𝐨𝐦𝐩𝐀𝐍𝐆(𝐘)\mathbf{Z}=\mathbf{compAN}_{\mathbf{G}}(\mathbf{Y}). 𝐀𝐍𝒢(𝐘)=𝐙\mathbf{AN}_{\mathcal{G}}(\mathbf{Y})=\mathbf{Z} if and only if every edge between Z𝐙Z\in\mathbf{Z} and X𝐕𝐙X\in\mathbf{V}\setminus\mathbf{Z} in 𝒢\mathcal{G} is out of ZZ.

The following theorem gives necessary and sufficient criteria for identifying compelled ancestors from a CPDAG, thereby eliminating the need to enumerate the DAGs in the Markov equivalence class and take the intersection of the ancestors in each DAG.

Theorem 3.

Let 𝐆\mathbf{G} be a Markov equivalence class of DAGs and 𝒫\mathcal{P} be its CPDAG. Then X𝐜𝐨𝐦𝐩𝐀𝐍𝐆(𝐘)X\in\mathbf{compAN}_{\mathbf{G}}(\mathbf{Y}) if and only if either

  1. (1)

    X𝐀𝐍𝒫(𝐘)X\in\mathbf{AN}_{\mathcal{P}}(\mathbf{Y}), or

  2. (2)

    XX is on an unshielded undirected path between two members of 𝐀𝐍𝒫(𝐘)\mathbf{AN}_{\mathcal{P}}(\mathbf{Y}) in 𝒫\mathcal{P}.

Algorithm 1 can be used for the identification of compelled ancestors based on Theorem 3. The algorithm first identifies 𝐀=𝐀𝐍𝒫(𝐘)\mathbf{A}=\mathbf{AN}_{\mathcal{P}}(\mathbf{Y}) and then recursively finds the interior nodes 𝐔\mathbf{U} on the unshielded undirected paths from each member XX of 𝐀\mathbf{A} to 𝐀{X}\mathbf{A}\setminus\{X\} with no interior node in 𝐀\mathbf{A} in 𝒫\mathcal{P}. The second part of the algorithm is based on the lemma below.

Lemma 6.

Let 𝒫\mathcal{P} be a CPDAG over 𝐕\mathbf{V} and X𝐕X\in\mathbf{V}. The subgraph of 𝒫\mathcal{P} containing only the nodes and the edges on unshielded undirected paths from XX to 𝐕{X}\mathbf{V}\setminus\{X\} in 𝒫\mathcal{P} is a tree.

For each member XX of 𝐀\mathbf{A}, function call find_uup_interior_nodes(𝒰,X,𝐀)\textsc{find\_uup\_interior\_nodes}(\mathcal{U},X,\mathbf{A}) at line 7 of Algorithm 1 implicitly performs a depth-first search (DFS) in the tree of unshielded undirected paths from XX to 𝐕{X}\mathbf{V}\setminus\{X\} in 𝒫\mathcal{P}, starting at XX. When either a member of 𝐀\mathbf{A} or a leaf in the tree is encountered, the algorithm backtracks. When at node YY during backtracking, if a member of 𝐀\mathbf{A} was encountered beyond YY (as indicated by the boolean flag foundfound), YY is added to 𝐔\mathbf{U} (unless Y𝐀Y\in\mathbf{A}). Since identifying 𝐀\mathbf{A} is O(|𝐕|)O(|\mathbf{V}|), each DFS is O(|𝐕|)O(|\mathbf{V}|), and a DFS is performed for each member of 𝐀\mathbf{A}, Algorithm 1 is O(|𝐕|2)O(|\mathbf{V}|^{2}).

Algorithm 1 Find compelled ancestors. 𝒫\mathcal{P} is a CPDAG over 𝐕\mathbf{V} and 𝐘𝐕\mathbf{Y}\subseteq\mathbf{V}. In the output, 𝐂𝐀=𝐜𝐨𝐦𝐩𝐀𝐍𝐆(𝐘)\mathbf{CA}=\mathbf{compAN}_{\mathbf{G}}(\mathbf{Y}), where 𝐆\mathbf{G} is the Markov equivalence class of DAGs represented by 𝒫\mathcal{P}. The uup in function names stands for unshielded undirected path.
1:𝒫,𝐘\mathcal{P},\mathbf{Y}
2:𝐂𝐀\mathbf{CA}
3:𝐀𝐀𝐍𝒫(𝐘)\mathbf{A}\leftarrow\mathbf{AN}_{\mathcal{P}}(\mathbf{Y})
4:Let 𝒰\mathcal{U} be the subgraph of 𝒫\mathcal{P} containing only the undirected edges of 𝒫\mathcal{P}
5:𝐔\mathbf{U}\leftarrow\emptyset
6:for all X𝐀X\in\mathbf{A} do
7:  𝐔Xfind_uup_interior_nodes(𝒰,X,𝐀)\mathbf{U}_{X}\leftarrow\textsc{find\_uup\_interior\_nodes}(\mathcal{U},X,\mathbf{A})
8:  𝐔𝐔𝐔X\mathbf{U}\leftarrow\mathbf{U}\cup\mathbf{U}_{X}
9:end for
10:𝐂𝐀𝐀𝐔\mathbf{CA}\leftarrow\mathbf{A}\cup\mathbf{U}
11:function find_uup_interior_nodes(𝒰,X,𝐀\mathcal{U},X,\mathbf{A})
12:  𝐔\mathbf{U}\leftarrow\emptyset
13:  for all Y𝐀𝐃𝐉𝒰(X)Y\in\mathbf{ADJ}_{\mathcal{U}}(X) do
14:   (foundY,𝐔Y)find_uup_interior_nodes_recursive(𝒰,X,Y,𝐀)(found_{Y},\mathbf{U}_{Y})\leftarrow\textsc{find\_uup\_interior\_nodes\_recursive}(\mathcal{U},X,Y,\mathbf{A})
15:   𝐔𝐔𝐔Y\mathbf{U}\leftarrow\mathbf{U}\cup\mathbf{U}_{Y}
16:  end for
17:  return 𝐔\mathbf{U}
18:end function
19:function find_uup_interior_nodes_recursive(𝒰,X,Y,𝐀\mathcal{U},X,Y,\mathbf{A})
20:  foundfalsefound\leftarrow false
21:  𝐔\mathbf{U}\leftarrow\emptyset
22:  for all Z𝐀𝐃𝐉𝒰(Y)(𝐀𝐃𝐉𝒰(X){X})Z\in\mathbf{ADJ}_{\mathcal{U}}(Y)\setminus(\mathbf{ADJ}_{\mathcal{U}}(X)\cup\{X\}) do
23:   if Z𝐀Z\in\mathbf{A} then
24:     foundZtruefound_{Z}\leftarrow true
25:     𝐔Z\mathbf{U}_{Z}\leftarrow\emptyset
26:   else
27:     (foundZ,𝐔Z)find_uup_interior_nodes_recursive(𝒰,Y,Z,𝐀)(found_{Z},\mathbf{U}_{Z})\leftarrow\textsc{find\_uup\_interior\_nodes\_recursive}(\mathcal{U},Y,Z,\mathbf{A})
28:   end if
29:   𝐔𝐔𝐔Z\mathbf{U}\leftarrow\mathbf{U}\cup\mathbf{U}_{Z}
30:   foundfoundfoundZfound\leftarrow found\lor found_{Z}
31:  end for
32:  if foundfound then
33:   𝐔𝐔{Y}\mathbf{U}\leftarrow\mathbf{U}\cup\{Y\}
34:  end if
35:  return (found,𝐔)found,\mathbf{U})
36:end function

Let 𝒢\mathcal{G} be a DAG over 𝐕\mathbf{V}, 𝐆\mathbf{G} be the Markov equivalence class of 𝒢\mathcal{G}, 𝐘𝐕\mathbf{Y}\subseteq\mathbf{V}, and 𝐙=𝐜𝐨𝐦𝐩𝐀𝐍𝐆(𝐘)\mathbf{Z}=\mathbf{compAN}_{\mathbf{G}}(\mathbf{Y}). A DAG 𝒢\mathcal{G}^{\prime} in 𝐆\mathbf{G} such that 𝐀𝐍𝒢(𝐘)=𝐙\mathbf{AN}_{\mathcal{G}^{\prime}}(\mathbf{Y})=\mathbf{Z} can be obtained as follows. First, obtain the CPDAG 𝒫\mathcal{P} of 𝐆\mathbf{G} (using, e.g., the algorithm of Chickering, 1995). Then identify 𝐙\mathbf{Z} from 𝒫\mathcal{P} using Algorithm 1. Finally, use Algorithm 10.3 in Neapolitan (2004), orienting edges between Z𝐙Z\in\mathbf{Z} and X𝐕𝐙X\in\mathbf{V}\setminus\mathbf{Z} out of ZZ in 𝒫\mathcal{P} in Step 1 of the algorithm, to orient edges in 𝒫\mathcal{P}.

Example 5.

Let GG be the DAG in Figure 3(a) and 𝐆\mathbf{G} be the Markov equivalence class where GG belongs. Figure 4(a) shows the CPDAG of GG. According to Theorem 3,

𝐜𝐨𝐦𝐩𝐀𝐍𝐆(S)={O1,O2,O3}.\mathbf{compAN}_{\mathbf{G}}(S)=\{O_{1},O_{2},O_{3}\}.

Owing to Theorem 2, there exists 𝒢𝐆\mathcal{G}^{\prime}\in\mathbf{G} such that 𝐀𝐍𝒢(S)={O1,O2,O3}\mathbf{AN}_{\mathcal{G}^{\prime}}(S)=\{O_{1},O_{2},O_{3}\}. Such GG^{\prime} is the DAG in Figure 4(b). Let 𝒢1\mathcal{G}^{\prime}_{1} and 𝒢2\mathcal{G}^{\prime}_{2} be the DAG in Figures 4(c) and 4(d), respectively, PP be a distribution over 𝐎\mathbf{O} such that p(o1,o2,o3)>0p(o_{1},o_{2},o_{3})>0, P1=P[{O4,O4,O5,O6}P^{\prime}_{1}=P[_{\{O_{4},O_{4},O_{5},O_{6}\}}, and P2=P[{O1,O2,O3}P^{\prime}_{2}=P[^{\{O_{1},O_{2},O_{3}\}}. Then Theorem 1 implies that

P𝐁𝐍𝐌(𝒢)[S=s^=𝐁𝐍𝐌(𝒢)[S=s^P1𝐁𝐍𝐌(𝒢1)[S=s^P2𝐁𝐍𝐌(𝒢2).P\in\operatorname{\mathbf{BNM}}(\mathcal{G})[^{S=\hat{s}}=\operatorname{\mathbf{BNM}}(\mathcal{G}^{\prime})[^{S=\hat{s}}\iff P_{1}\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{1}^{\prime})[^{S=\hat{s}}\land P_{2}\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{2}^{\prime}).

Therefore, only 𝐁𝐍𝐌(𝒢1)[S=s^\operatorname{\mathbf{BNM}}(\mathcal{G}_{1}^{\prime})[^{S=\hat{s}}, which has fewer variables than 𝐁𝐍𝐌(𝒢1)[S=s^\operatorname{\mathbf{BNM}}(\mathcal{G}_{1})[^{S=\hat{s}} in Example 4, needs to be characterised.

O3O_{3}O4O_{4}O6O_{6}O1O_{1}O2O_{2}O5O_{5}SS
(a) The CPDAG of the Markov equivalence class where the DAG in Figure 3(a) belongs. O1O_{1}, O2O_{2}, and O3O_{3} are the compelled ancestors of SS in the class.
O3O_{3}O4O_{4}O6O_{6}O1O_{1}O2O_{2}O5O_{5}SS
(b) A DAG in the Markov equivalence class represented by the CPDAG in Figure 4(a) such that the only ancestors of SS are the compelled ancestors of SS in the class.
O3O_{3}O1O_{1}O2O_{2}SS
(c) The induced subgraph over {O1,O2,O3,S}\{O_{1},O_{2},O_{3},S\} of the DAG in Figure 4(b).
O3O_{3}O4O_{4}O6O_{6}O1O_{1}O2O_{2}O5O_{5}
(d) The result of fixing {O1,O2,O3)}\{O_{1},O_{2},O_{3})\} in the induced subgraph over {O1,,O6}\{O_{1},\ldots,O_{6}\} of the DAG in Figure 4(b).
Figure 4. The graphs in Example 5.

Finally, the following theorem states that, in the case of a single selection node, selection reduces to conditioning on the parents of the node in the induced subgraph over the observed nodes.

Theorem 4.

Let 𝐎˙S\mathbf{O}\dot{\cup}S be a set of categorical variables, 𝒢=(𝐎S,𝐄)\mathcal{G}=(\mathbf{O}\cup S,\mathbf{E}) such that 𝐂𝐇𝒢(S)=\mathbf{CH}_{\mathcal{G}}(S)=\emptyset, and PP be a distribution over 𝐎\mathbf{O} such that p(𝐩𝐚𝒢(S))>0p(\mathbf{pa}_{\mathcal{G}}(S))>0. Then

P𝐁𝐍𝐌(𝒢)[S=s^P[𝐏𝐀𝒢(S)𝐁𝐍𝐌(𝒢𝐎)[𝐏𝐀𝒢(S)P\in\operatorname{\mathbf{BNM}}(\mathcal{G})[^{S=\hat{s}}\iff P[^{\mathbf{PA}_{\mathcal{G}}(S)}\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{\mathbf{O}})[^{\mathbf{PA}_{\mathcal{G}}(S)}

We conclude this section with an example of a non-CI, non-factorisation constraint in a categorical BN model under selection.

Example 6.

Let 𝒢\mathcal{G} be the DAG in Figure 5 and 𝐎={O1,O2,O3,O4}\mathbf{O}=\{O_{1},O_{2},O_{3},O_{4}\} be a set of variables with domain {1,2}\{1,2\}. Owing to Theorem 4, P𝐁𝐍𝐌(𝒢)[S=s^P[O4𝐁𝐍𝐌(𝒢𝐎)[O4P\in\operatorname{\mathbf{BNM}}(\mathcal{G})[^{S=\hat{s}}\iff P[^{O_{4}}\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{\mathbf{O}})[^{O_{4}}. Therefore, only 𝐁𝐍𝐌(𝒢𝐎)[O4\operatorname{\mathbf{BNM}}(\mathcal{G}_{\mathbf{O}})[^{O_{4}} needs to be characterised. 𝐁𝐍𝐌(𝒢𝐎)\operatorname{\mathbf{BNM}}(\mathcal{G}_{\mathbf{O}}) is the set of distributions of 𝐎\mathbf{O} for which O1\PerpO3O2O_{1}\Perp O_{3}\mid O_{2} holds. Thus, apart from being nonnegative and sum to one, the values of pp must satisfy the following system of polynomial equations:

(1) (p1,1,1,1+p1,1,1,2)(p2,1,2,1p2,1,2,2)(p1,1,2,1+p1,1,2,2)(p2,1,1,1+p2,1,1,2)=0\displaystyle(p_{1,1,1,1}+p_{1,1,1,2})(p_{2,1,2,1}-p_{2,1,2,2})-(p_{1,1,2,1}+p_{1,1,2,2})(p_{2,1,1,1}+p_{2,1,1,2})=0
(2) (p1,2,1,1+p1,2,1,2)(p2,2,2,1p2,2,2,2)(p1,2,2,1+p1,2,2,2)(p2,2,1,1+p2,2,1,2)=0\displaystyle(p_{1,2,1,1}+p_{1,2,1,2})(p_{2,2,2,1}-p_{2,2,2,2})-(p_{1,2,2,1}+p_{1,2,2,2})(p_{2,2,1,1}+p_{2,2,1,2})=0

where pi,j,k,lp_{i,j,k,l} stands for p(O1=i,O2=j,O3=k,O4=l)p(O_{1}=i,O_{2}=j,O_{3}=k,O_{4}=l). Suppose that QQ is a distribution over 𝐎\mathbf{O}. In order for QQ to be in 𝐁𝐍𝐌(𝒢𝐎)[O4\operatorname{\mathbf{BNM}}(\mathcal{G}_{\mathbf{O}})[^{O_{4}}, there must exist a distribution RR of O4O_{4} such that the QR𝐁𝐍𝐌(𝒢𝐎)Q\cdot R\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{\mathbf{O}}). Let qi,j,k,lq_{i,j,k,l} stand for q(O1=i,O2=j,O3=kO4=l)q(O_{1}=i,O_{2}=j,O_{3}=k\mid O_{4}=l) and rir_{i} stand for r(O4=i)r(O_{4}=i). Replacing pi,j,k,lp_{i,j,k,l} with qi,j,k,lrlq_{i,j,k,l}r_{l} in the system above and adding the equation r1+r2=1r_{1}+r_{2}=1 results in the following system:

(3) (q1,1,1,1r1+q1,1,1,2r2)(q2,1,2,1r1q2,1,2,2r2)(q1,1,2,1r1+q1,1,2,2r2)(q2,1,1,1r1+q2,1,1,2r2)=0\displaystyle(q_{1,1,1,1}r_{1}+q_{1,1,1,2}r_{2})(q_{2,1,2,1}r_{1}-q_{2,1,2,2}r_{2})-(q_{1,1,2,1}r_{1}+q_{1,1,2,2}r_{2})(q_{2,1,1,1}r_{1}+q_{2,1,1,2}r_{2})=0
(4) (q1,2,1,1r1+q1,2,1,2r2)(q2,2,2,1r1q2,2,2,2r2)(q1,2,2,1r1+q1,2,2,2r2)(q2,2,1,1r1+q2,2,1,2r2)=0\displaystyle(q_{1,2,1,1}r_{1}+q_{1,2,1,2}r_{2})(q_{2,2,2,1}r_{1}-q_{2,2,2,2}r_{2})-(q_{1,2,2,1}r_{1}+q_{1,2,2,2}r_{2})(q_{2,2,1,1}r_{1}+q_{2,2,1,2}r_{2})=0
(5) r1+r21=0\displaystyle r_{1}+r_{2}-1=0

The values of qq must be such that the system, considered as a system of r1r_{1} and r2r_{2}, has a solution. Replacing 1 with the dummy variable hh in Equation 5 results in a system of homogeneous polynomial equations in r1r_{1}, r2r_{2}, and hh, and the system must have a nontrivial solution (that is, a solution other than all variables being zero). For a system of nn homogeneous polynomials in nn variables, the resultant is the unique (up to a constant) polynomial in the coefficients of the polynomials (here, the values of qq) whose vanishing is equivalent to the system having trivial solutions (Cox et al., 2006), and is irreducible. We used Macaulay 2 to compute the resultant, which has degree 8 and 218 terms. As both the MAG model over 𝐎\mathbf{O} and 𝐒𝐇𝐌(\mathbf{SHM}(G,S=s^),S=\hat{s}) are saturated, the resultant is an example of a non-CI, non-factorisation constraint.

While our concern in this example is the existence of solutions to the system, Evans and Didelez (2015) derived a necessary and sufficient condition for the existence of finitely many solutions when Q𝐁𝐍𝐌(𝒢𝐎)[O4Q\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{\mathbf{O}})[^{O_{4}} (clearly, a solution exists in this case). Their goal was to recover a distribution P𝐁𝐍𝐌(𝒢)P\in\operatorname{\mathbf{BNM}}(\mathcal{G}) from its conditional QQ given SS. If the number of solutions to the system is finite, then the system can be solved (in principle) in order to recover PP, as r1=p1r_{1}=p_{1} and r2=p2r_{2}=p_{2}, where pi=p(O1=i)p_{i}=p(O_{1}=i), will be one of the solutions, and pi,j,k,l=qi,j,k,lplp_{i,j,k,l}=q_{i,j,k,l}p_{l}.

O1O_{1}O2O_{2}O3O_{3}O4O_{4}SS
Figure 5. The DAG of a BN model which, under conditioning on the value a certain variable (SS), includes a non-CI, non-factorisation constraint (See Example 6).

4. Conclusion and future work

In this work, preliminary results were provided towards characterising the constraints imposed on a distribution in a BN model under selection. Specifically, it was shown that the only cases that need to be characterised are the ones where the selection nodes have no children, the sets of parents of the selection nodes are non-nested, and all nodes are compelled ancestors of the selection nodes. In addition, in the case of a single selection node, selection reduces to conditioning on the parents of the selection node in the induced subgraph over the observed nodes. Furthermore, an algorithm was designed for identifying compelled ancestors from the CPDAG, thereby eliminating the need to enumerate the DAGs in the Markov equivalence class. This is a useful result on its own in causal structure learning, as compelled ancestors correspond to definite causes. Finally, a non-CI, non-factorisation constraint in a BN model under selection was computed for the first time.

Future work includes further reducing the characterisation problem, interpreting the constraints, devising a graph representation of them, unifying that representation with mDAGs, characterising the equivalence classes of the unified graphs, and, ultimately, devising structure-learning algorithms for the graphs.

Appendix – proofs

Proof of Lemma 2.

The proof follows from the factorisation imposed by 𝐁𝐍𝐌(𝒢)\operatorname{\mathbf{BNM}}(\mathcal{G}) on its distributions. ∎

Proof of Lemma 3.

Since 𝒢𝒢\mathcal{G}^{\prime}\subseteq\mathcal{G}, 𝐁𝐍𝐌(𝒢)𝐁𝐍𝐌(𝒢)\operatorname{\mathbf{BNM}}(\mathcal{G}^{\prime})\subseteq\operatorname{\mathbf{BNM}}(\mathcal{G}). Thus, 𝐁𝐍𝐌(𝒢)[𝐒=𝐬^𝐁𝐍𝐌(𝒢)[𝐒=𝐬^\operatorname{\mathbf{BNM}}(\mathcal{G}^{\prime})[^{\mathbf{S}=\hat{\mathbf{s}}}\subseteq\operatorname{\mathbf{BNM}}(\mathcal{G})[^{\mathbf{S}=\hat{\mathbf{s}}}. Now suppose P𝐁𝐍𝐌(𝒢)[𝐒=𝐬^P\in\operatorname{\mathbf{BNM}}(\mathcal{G})[^{\mathbf{S}=\hat{\mathbf{s}}}. Then there exists Q𝐁𝐍𝐌(𝒢)Q\in\operatorname{\mathbf{BNM}}(\mathcal{G}) such that Q[𝐒=𝐬^=PQ[^{\mathbf{S}=\hat{\mathbf{s}}}=P. Let R𝐁𝐍𝐌(𝒢)R\in\operatorname{\mathbf{BNM}}(\mathcal{G}^{\prime}) such that r(x𝐩𝐚𝒢(X))=q(x𝐩𝐚𝒢(X),𝐩𝐚^𝒢(X)𝐬^)r(x\mid\mathbf{pa}_{\mathcal{G}^{\prime}}(X))=q(x\mid\mathbf{pa}_{\mathcal{G}^{\prime}}(X),\widehat{\mathbf{pa}}_{\mathcal{G}}(X)\cap\hat{\mathbf{s}}) for each X𝐎𝐒X\in\mathbf{O}\cup\mathbf{S}. Then

r(𝐨,𝐬^)\displaystyle r(\mathbf{o},\hat{\mathbf{s}}) =O𝐎r(o𝐩𝐚𝒢(O))S𝐒r(s^𝐩𝐚𝒢(S))\displaystyle=\prod_{O\in\mathbf{O}}r(o\mid\mathbf{pa}_{\mathcal{G}^{\prime}}(O))\prod_{S\in\mathbf{S}}r(\hat{s}\mid\mathbf{pa}_{\mathcal{G}^{\prime}}(S))
=O𝐎q(o𝐩𝐚𝒢(O),𝐩𝐚^𝒢(O)𝐬^)S𝐒q(s^𝐩𝐚𝒢(S),𝐩𝐚^𝒢(S)𝐬^)=q(𝐨,𝐬^)\displaystyle=\prod_{O\in\mathbf{O}}q(o\mid\mathbf{pa}_{\mathcal{G}^{\prime}}(O),\widehat{\mathbf{pa}}_{\mathcal{G}}(O)\cap\hat{\mathbf{s}})\prod_{S\in\mathbf{S}}q(\hat{s}\mid\mathbf{pa}_{\mathcal{G}^{\prime}}(S),\widehat{\mathbf{pa}}_{\mathcal{G}}(S)\cap\hat{\mathbf{s}})=q(\mathbf{o},\hat{\mathbf{s}})

Therefore, r(𝐨𝐬^)=q(𝐨𝐬^)=p(𝐨)r(\mathbf{o}\mid\hat{\mathbf{s}})=q(\mathbf{o}\mid\hat{\mathbf{s}})=p(\mathbf{o}). Thus, P𝐁𝐍𝐌(𝒢)[𝐒=𝐬^P\in\operatorname{\mathbf{BNM}}(\mathcal{G}^{\prime})[^{\mathbf{S}=\hat{\mathbf{s}}}. ∎

Proof of Lemma 4.

Forward direction: Suppose P𝐁𝐍𝐌(𝒢)[𝐒=𝐬^P\in\operatorname{\mathbf{BNM}}(\mathcal{G}^{\prime})[^{\mathbf{S}^{\prime}=\hat{\mathbf{s}}^{\prime}}. Then there exists Q𝐁𝐍𝐌(𝒢)Q\in\operatorname{\mathbf{BNM}}(\mathcal{G}^{\prime}) such that Q[𝐒=𝐬^=PQ[^{\mathbf{S}^{\prime}=\hat{\mathbf{s}}^{\prime}}=P. Let R𝐁𝐍𝐌(𝒢)R\in\operatorname{\mathbf{BNM}}(\mathcal{G}) such that r(z𝐩𝐚𝒢(Z))=q(z𝐩𝐚𝒢(Z))r(z\mid\mathbf{pa}_{\mathcal{G}}(Z))=q(z\mid\mathbf{pa}_{\mathcal{G}^{\prime}}(Z)) for each Z𝐎𝐒Z\in\mathbf{O}\cup\mathbf{S}^{\prime} and r(s^2𝐩𝐚𝒢(S2))=1r(\hat{s}_{2}\mid\mathbf{pa}_{\mathcal{G}}(S_{2}))=1. Then

r(𝐨,𝐬^)\displaystyle r(\mathbf{o},\hat{\mathbf{s}}) =r(s^2𝐩𝐚𝒢(S2))O𝐎r(o𝐩𝐚𝒢(O))S𝐒r(s^𝐩𝐚𝒢(S))\displaystyle=r(\hat{s}_{2}\mid\mathbf{pa}_{\mathcal{G}}(S_{2}))\prod_{O\in\mathbf{O}}r(o\mid\mathbf{pa}_{\mathcal{G}}(O))\prod_{S^{\prime}\in\mathbf{S}^{\prime}}r(\hat{s}^{\prime}\mid\mathbf{pa}_{\mathcal{G}}(S^{\prime}))
=O𝐎q(o𝐩𝐚𝒢(O))S𝐒q(s^𝐩𝐚𝒢(S))=q(𝐨,𝐬^)\displaystyle=\prod_{O\in\mathbf{O}}q(o\mid\mathbf{pa}_{\mathcal{G}}(O))\prod_{S^{\prime}\in\mathbf{S}^{\prime}}q(\hat{s}^{\prime}\mid\mathbf{pa}_{\mathcal{G}}(S^{\prime}))=q(\mathbf{o},\hat{\mathbf{s}}^{\prime})

Therefore, r(𝐨𝐬^)=q(𝐨𝐬^)=p(𝐨)r(\mathbf{o}\mid\hat{\mathbf{s}})=q(\mathbf{o}\mid\hat{\mathbf{s}}^{\prime})=p(\mathbf{o}). Thus, P𝐁𝐍𝐌(𝒢)[𝐒=𝐬^P\in\operatorname{\mathbf{BNM}}(\mathcal{G})[^{\mathbf{S}=\hat{\mathbf{s}}}.

Reverse direction: Suppose P𝐁𝐍𝐌(𝒢)[𝐒=𝐬^P\in\operatorname{\mathbf{BNM}}(\mathcal{G})[^{\mathbf{S}=\hat{\mathbf{s}}}. Then there exists Q𝐁𝐍𝐌(𝒢)Q\in\operatorname{\mathbf{BNM}}(\mathcal{G}) such that Q[𝐒=𝐬^=PQ[^{\mathbf{S}=\hat{\mathbf{s}}}=P. Let 𝐙=𝐒{S1}\mathbf{Z}=\mathbf{S}^{\prime}\setminus\{S_{1}\} and R𝐁𝐍𝐌(𝒢)R\in\operatorname{\mathbf{BNM}}(\mathcal{G}^{\prime}) such that r(u𝐩𝐚𝒢(U))=q(u𝐩𝐚𝒢(U))r(u\mid\mathbf{pa}_{\mathcal{G}^{\prime}}(U))=q(u\mid\mathbf{pa}_{\mathcal{G}}(U)) for each U𝐎𝐙U\in\mathbf{O}\cup\mathbf{Z} and r(s^1𝐩𝐚𝒢(S1))=q(s^1𝐩𝐚𝒢(S1))q(s^2𝐩𝐚𝒢(S2))r(\hat{s}_{1}\mid\mathbf{pa}_{\mathcal{G}^{\prime}}(S_{1}))=q(\hat{s}_{1}\mid\mathbf{pa}_{\mathcal{G}}(S_{1}))\cdot q(\hat{s}_{2}\mid\mathbf{pa}_{\mathcal{G}}(S_{2})). Then

r(𝐨,𝐬^)\displaystyle r(\mathbf{o},\hat{\mathbf{s}}^{\prime}) =r(s^1𝐩𝐚𝒢(S1))O𝐎r(o𝐩𝐚𝒢(O))Z𝐙r(z^𝐩𝐚𝒢(Z))\displaystyle=r(\hat{s}_{1}\mid\mathbf{pa}_{\mathcal{G}^{\prime}}(S_{1}))\prod_{O\in\mathbf{O}}r(o\mid\mathbf{pa}_{\mathcal{G}^{\prime}}(O))\prod_{Z\in\mathbf{Z}}r(\hat{z}\mid\mathbf{pa}_{\mathcal{G}^{\prime}}(Z))
=q(s^1𝐩𝐚𝒢(S1))q(s2^𝐩𝐚𝒢(S2))O𝐎q(o𝐩𝐚𝒢(O))Z𝐙q(z^𝐩𝐚𝒢(Z))\displaystyle=q(\hat{s}_{1}\mid\mathbf{pa}_{\mathcal{G}}(S_{1}))\cdot q(\hat{s_{2}}\mid\mathbf{pa}_{\mathcal{G}}(S_{2}))\prod_{O\in\mathbf{O}}q(o\mid\mathbf{pa}_{\mathcal{G}}(O))\prod_{Z\in\mathbf{Z}}q(\hat{z}\mid\mathbf{pa}_{\mathcal{G}}(Z))
=O𝐎q(o𝐩𝐚𝒢(O))S𝐒q(s^𝐩𝐚𝒢(S))=q(𝐨,𝐬^)\displaystyle=\prod_{O\in\mathbf{O}}q(o\mid\mathbf{pa}_{\mathcal{G}}(O))\prod_{S\in\mathbf{S}}q(\hat{s}\mid\mathbf{pa}_{\mathcal{G}}(S))=q(\mathbf{o},\hat{\mathbf{s}})

Therefore, r(𝐨𝐬^)=q(𝐨𝐬^)=p(𝐨)r(\mathbf{o}\mid\hat{\mathbf{s}}^{\prime})=q(\mathbf{o}\mid\hat{\mathbf{s}})=p(\mathbf{o}). Thus, P𝐁𝐍𝐌(𝒢)[𝐒=𝐬^P\in\operatorname{\mathbf{BNM}}(\mathcal{G}^{\prime})[^{\mathbf{S}=\hat{\mathbf{s}}^{\prime}}. ∎

Proof of Theorem 1.

Forward direction: Suppose P𝐁𝐍𝐌(𝒢)[𝐒=𝐬^P\in\operatorname{\mathbf{BNM}}(\mathcal{G})[^{\mathbf{S}=\hat{\mathbf{s}}}. Then there exists Q𝐁𝐍𝐌(𝒢)Q\in\operatorname{\mathbf{BNM}}(\mathcal{G}) such that Q[𝐒=𝐬^=PQ[^{\mathbf{S}=\hat{\mathbf{s}}}=P. Since p(𝐱𝐬)>0p(\mathbf{x}\setminus\mathbf{s})>0, q(𝐱𝐬,𝐬^)>0q(\mathbf{x}\setminus\mathbf{s},\hat{\mathbf{s}})>0. Let R1𝐁𝐍𝐌(𝒢1)R_{1}\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{1}) such that r1(x𝐩𝐚𝒢1(X))=q(x𝐩𝐚𝒢(X))r_{1}(x\mid\mathbf{pa}_{\mathcal{G}_{1}}(X))=q(x\mid\mathbf{pa}_{\mathcal{G}}(X)) for each X𝐗X\in\mathbf{X}. Then

q(𝐱)\displaystyle q(\mathbf{x}) =𝐲q(𝐱,𝐲)=X𝐗q(x𝐩𝐚𝒢(X))𝐲Y𝐘q(y𝐩𝐚𝒢(Y)𝐱,𝐩𝐚𝒢(Y)𝐲)\displaystyle=\sum_{\mathbf{y}}q(\mathbf{x},\mathbf{y})=\prod\limits_{X\in\mathbf{X}}q(x\mid\mathbf{pa}_{\mathcal{G}}(X))\sum\limits_{\mathbf{y}}\prod\limits_{Y\in\mathbf{Y}}q(y\mid\mathbf{pa}_{\mathcal{G}}(Y)\cap\mathbf{x},\mathbf{pa}_{\mathcal{G}}(Y)\cap\mathbf{y})
=X𝐗q(x𝐩𝐚𝒢(X))=X𝐗r1(x𝐩𝐚𝒢(X))=r1(𝐱)\displaystyle=\prod\limits_{X\in\mathbf{X}}q(x\mid\mathbf{pa}_{\mathcal{G}}(X))=\prod\limits_{X\in\mathbf{X}}r_{1}(x\mid\mathbf{pa}_{\mathcal{G}}(X))=r_{1}(\mathbf{x})

Therefore, r1(𝐱𝐬𝐬^)=q(𝐱𝐬𝐬^)=p(𝐱𝐬)=p1(𝐱𝐬)r_{1}(\mathbf{x}\setminus\mathbf{s}\mid\hat{\mathbf{s}})=q(\mathbf{x}\setminus\mathbf{s}\mid\hat{\mathbf{s}})=p(\mathbf{x}\setminus\mathbf{s})=p_{1}(\mathbf{x}\setminus\mathbf{s}). Thus, P1𝐁𝐍𝐌(𝒢1)[𝐒=𝐬^P_{1}\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{1})[^{\mathbf{S}=\hat{\mathbf{s}}}.

Let R2𝐁𝐍𝐌(𝒢2)R_{2}\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{2}) such that r2(y𝐩𝐚𝒢2(Y))=q(y𝐩𝐚𝒢(Y))r_{2}(y\mid\mathbf{pa}_{\mathcal{G}_{2}}(Y))=q(y\mid\mathbf{pa}_{\mathcal{G}}(Y)) for each Y𝐘Y\in\mathbf{Y}. Then

p2(𝐲𝐱𝐬)\displaystyle p_{2}(\mathbf{y}\mid\mathbf{x}\setminus\mathbf{s}) =p(𝐲𝐱𝐬)=q(𝐲𝐱𝐬,𝐬^)=q(𝐨,𝐬^)q(𝐱𝐬,𝐬^)=Y𝐘q(y𝐩𝐚𝒢(Y))\displaystyle=p(\mathbf{y}\mid\mathbf{x}\setminus\mathbf{s})=q(\mathbf{y}\mid\mathbf{x}\setminus\mathbf{s},\hat{\mathbf{s}})=\frac{q(\mathbf{o},\hat{\mathbf{s}})}{q(\mathbf{x}\setminus\mathbf{s},\hat{\mathbf{s}})}=\prod_{Y\in\mathbf{Y}}q(y\mid\mathbf{pa}_{\mathcal{G}}(Y))
=Y𝐘r2(y𝐩𝐚𝒢2(Y))=r2(𝐲𝐱𝐬)\displaystyle=\prod_{Y\in\mathbf{Y}}r_{2}(y\mid\mathbf{pa}_{\mathcal{G}_{2}}(Y))=r_{2}(\mathbf{y}\mid\mathbf{x}\setminus\mathbf{s})

Thus, P2𝐁𝐍𝐌(𝒢2)P_{2}\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{2}).

Reverse direction: Suppose P1𝐁𝐍𝐌(𝒢1)[𝐒=𝐬^P_{1}\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{1})[^{\mathbf{S}=\hat{\mathbf{s}}} and P2𝐁𝐍𝐌(𝒢2)P_{2}\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{2}). Then there exists Q1𝐁𝐍𝐌(𝒢1)Q_{1}\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{1}) such that Q1[𝐒=𝐬^=P1Q_{1}[^{\mathbf{S}=\hat{\mathbf{s}}}=P_{1}. Let R𝐁𝐍𝐌(𝒢)R\in\operatorname{\mathbf{BNM}}(\mathcal{G}) such that r(x𝐩𝐚𝒢(X))=q1(x𝐩𝐚𝒢1(X))r(x\mid\mathbf{pa}_{\mathcal{G}}(X))=q_{1}(x\mid\mathbf{pa}_{\mathcal{G}_{1}}(X)) for each X𝐗X\in\mathbf{X} and r(y𝐩𝐚𝒢(Y))=p2(y𝐩𝐚𝒢2(Y))r(y\mid\mathbf{pa}_{\mathcal{G}}(Y))=p_{2}(y\mid\mathbf{pa}_{\mathcal{G}_{2}}(Y)) for each Y𝐘Y\in\mathbf{Y}. Then

r(𝐨,𝐬)\displaystyle r(\mathbf{o},\mathbf{s}) =X𝐗r(x𝐩𝐚𝒢(X))Y𝐘r(y𝐩𝐚𝒢(Y))\displaystyle=\prod\limits_{X\in\mathbf{X}}r(x\mid\mathbf{pa}_{\mathcal{G}}(X))\prod\limits_{Y\in\mathbf{Y}}r(y\mid\mathbf{pa}_{\mathcal{G}}(Y))
=X𝐗q1(x𝐩𝐚𝒢1(X))Y𝐘p2(y𝐩𝐚𝒢2(Y))=q1(𝐱)p2(𝐲𝐱𝐬)\displaystyle=\prod_{X\in\mathbf{X}}q_{1}(x\mid\mathbf{pa}_{\mathcal{G}_{1}}(X))\prod_{Y\in\mathbf{Y}}p_{2}(y\mid\mathbf{pa}_{\mathcal{G}_{2}}(Y))=q_{1}(\mathbf{x})p_{2}(\mathbf{y}\mid\mathbf{x}\setminus\mathbf{s})

and

r(𝐬^)\displaystyle r(\hat{\mathbf{s}}) =𝐨r(𝐨,𝐬^)=𝐱𝐬q1(𝐱𝐬,𝐬^)𝐲p2(𝐲𝐱𝐬)=q1(𝐬^)>0\displaystyle=\sum_{\mathbf{o}}r(\mathbf{o},\hat{\mathbf{s}})=\sum_{\mathbf{x}\setminus\mathbf{s}}q_{1}(\mathbf{x}\setminus\mathbf{s},\hat{\mathbf{s}})\sum_{\mathbf{y}}p_{2}(\mathbf{y}\mid\mathbf{x}\setminus\mathbf{s})=q_{1}(\hat{\mathbf{s}})>0

Therefore,

r(𝐨𝐬^)\displaystyle r(\mathbf{o}\mid\hat{\mathbf{s}}) =r(𝐨,𝐬^)r(𝐬^)=q1(𝐱𝐬,𝐬^)p2(𝐲𝐱𝐬)q1(𝐬^)=q1(𝐱𝐬𝐬^)p2(𝐲𝐱𝐬)\displaystyle=\frac{r(\mathbf{o},\hat{\mathbf{s}})}{r(\hat{\mathbf{s}})}=\frac{q_{1}(\mathbf{x}\setminus\mathbf{s},\hat{\mathbf{s}})p_{2}(\mathbf{y}\mid\mathbf{x}\setminus\mathbf{s})}{q_{1}(\hat{\mathbf{s}})}=q_{1}(\mathbf{x}\setminus\mathbf{s}\mid\hat{\mathbf{s}})p_{2}(\mathbf{y}\mid\mathbf{x}\setminus\mathbf{s})
=p1(𝐱𝐬)p2(𝐲𝐱𝐬)=p(𝐱𝐬)p(𝐲𝐱𝐬)=p(𝐨)\displaystyle=p_{1}(\mathbf{x}\setminus\mathbf{s})p_{2}(\mathbf{y}\mid\mathbf{x}\setminus\mathbf{s})=p(\mathbf{x}\setminus\mathbf{s})p(\mathbf{y}\mid\mathbf{x}\setminus\mathbf{s})=p(\mathbf{o})

Thus, P𝐁𝐍𝐌(𝒢)[𝐒=𝐬^P\in\operatorname{\mathbf{BNM}}(\mathcal{G})[^{\mathbf{S}=\hat{\mathbf{s}}}. ∎

Let 𝒢\mathcal{G} be a directed graph over 𝐕\mathbf{V} and 𝐗𝐕\mathbf{X}\subseteq\mathbf{V}. 𝐗\mathbf{X} is ancestral if 𝐀𝐍𝒢(𝐗)𝐗\mathbf{AN}_{\mathcal{G}}(\mathbf{X})\subseteq\mathbf{X}. Clearly, 𝐗\mathbf{X} is ancestral if and only if there are no edges from 𝐕𝐗\mathbf{V}\setminus\mathbf{X} to 𝐗\mathbf{X}.

Lemma 7.

Let 𝐆\mathbf{G} be a Markov equivalence class of DAGs over 𝐕\mathbf{V}, {𝒢1,𝒢2}𝐆\{\mathcal{G}_{1},\mathcal{G}_{2}\}\subseteq\mathbf{G}, and 𝐀1\mathbf{A}_{1} and 𝐀2\mathbf{A}_{2} be ancestral sets in 𝒢1\mathcal{G}_{1} and 𝒢2\mathcal{G}_{2}, respectively. There exists 𝒢𝐆\mathcal{G}\in\mathbf{G} such that 𝐀1𝐀2\mathbf{A}_{1}\cup\mathbf{A}_{2} is ancestral in 𝒢\mathcal{G}.

Proof.

Let 𝒢\mathcal{G} be the graph over 𝐕\mathbf{V} with edges between members of 𝐀2\mathbf{A}_{2} taken from 𝒢1\mathcal{G}_{1}, and edges between members of 𝐕𝐀2\mathbf{V}\setminus\mathbf{A}_{2} or between a member of 𝐀2\mathbf{A}_{2} and a member of 𝐕𝐀2\mathbf{V}\setminus\mathbf{A}_{2} taken from 𝒢2\mathcal{G}_{2}. Clearly, 𝒢\mathcal{G} has the same skeleton as 𝒢1\mathcal{G}_{1} and 𝒢2\mathcal{G}_{2}.

Suppose that cc is a directed cycle in 𝒢\mathcal{G}. If all nodes in cc were in 𝐀2\mathbf{A}_{2} (resp., 𝐕𝐀2\mathbf{V}\setminus\mathbf{A}_{2}), then cc would be a directed cycle in 𝒢1\mathcal{G}_{1} (resp., 𝒢2\mathcal{G}_{2}). Therefore, there exists a node in cc that is in 𝐀2\mathbf{A}_{2} and one that is in 𝐕𝐀2\mathbf{V}\setminus\mathbf{A}_{2}, which implies that there is an edge XYX\rightarrow Y in cc such that X𝐕𝐀2X\in\mathbf{V}\setminus\mathbf{A}_{2} and Y𝐀2Y\in\mathbf{A}_{2}. This is a contradiction, as XYX\rightarrow Y is taken from 𝒢2\mathcal{G}_{2} and 𝐀2\mathbf{A}_{2} is ancestral in 𝒢2\mathcal{G}_{2}. Therefore, 𝒢\mathcal{G} is acyclic.

Let (X,Y,Z)(X,Y,Z) be an unshielded triple in 𝒢\mathcal{G}. If both edges in the triple are taken from either 𝒢1\mathcal{G}_{1} or 𝒢2\mathcal{G}_{2}, then (X,Y,Z)(X,Y,Z) is a collider in 𝒢\mathcal{G} if and only if it is a collider in 𝒢1\mathcal{G}_{1} and 𝒢2\mathcal{G}_{2}. Suppose that the first edge is taken from 𝒢1\mathcal{G}_{1} and the second from 𝒢2\mathcal{G}_{2}, which implies that {X,Y}𝐀2\{X,Y\}\in\mathbf{A}_{2} and Z𝐕𝐀2Z\in\mathbf{V}\setminus\mathbf{A}_{2}. Since 𝐀2\mathbf{A}_{2} is ancestral in 𝒢2\mathcal{G}_{2}, the second edge is out of YY. Therefore, (X,Y,Z)(X,Y,Z) is a noncollider in 𝒢\mathcal{G}, 𝒢1\mathcal{G}_{1}, and 𝒢2\mathcal{G}_{2}. Thus, 𝒢𝐆\mathcal{G}\in\mathbf{G}.

Suppose that there is an edge between X𝐀1𝐀2X\in\mathbf{A}_{1}\cup\mathbf{A}_{2} and Y𝐕(𝐀1𝐀2)Y\in\mathbf{V}\setminus(\mathbf{A}_{1}\cup\mathbf{A}_{2}) in 𝒢\mathcal{G}. If Y𝐕𝐀2Y\in\mathbf{V}\setminus\mathbf{A}_{2}, then the edge is taken from 𝒢2\mathcal{G}_{2} and is out of XX because 𝐀2\mathbf{A}_{2} is ancestral in 𝒢2\mathcal{G}_{2}. Otherwise Y𝐀2𝐀1Y\in\mathbf{A}_{2}\setminus\mathbf{A}_{1} and the edge is taken from 𝒢1\mathcal{G}_{1} and is out of XX because 𝐀1\mathbf{A}_{1} is ancestral in 𝒢1\mathcal{G}_{1}. Therefore, 𝐀1𝐀2\mathbf{A}_{1}\cup\mathbf{A}_{2} is ancestral in 𝒢\mathcal{G}.

Proof of Theorem 2.

Suppose that 𝐆={𝒢1,,𝒢i,,𝒢n}\mathbf{G}=\{\mathcal{G}_{1},\ldots,\mathcal{G}_{i},\ldots,\mathcal{G}_{n}\} and 𝐀i=𝐀𝐍𝒢i(𝐘)\mathbf{A}_{i}=\mathbf{AN}_{\mathcal{G}_{i}}(\mathbf{Y}). It will be proved by induction on ii that, for each ii, there exists 𝒢i𝐆\mathcal{G}^{\prime}_{i}\in\mathbf{G} such that 𝐀1𝐀i\mathbf{A}_{1}\cap\ldots\cap\mathbf{A}_{i} is ancestral in 𝒢i\mathcal{G}^{\prime}_{i}.

Base case: Since 𝐀1\mathbf{A}_{1} is ancestral in 𝒢1\mathcal{G}_{1}, 𝒢1=𝒢1\mathcal{G}^{\prime}_{1}=\mathcal{G}_{1}.

Inductive step: Suppose that there exists 𝒢i𝐆\mathcal{G}^{\prime}_{i}\in\mathbf{G} such that 𝐀1𝐀i\mathbf{A}_{1}\cap\ldots\cap\mathbf{A}_{i} is ancestral in 𝒢i\mathcal{G}^{\prime}_{i}. Since 𝐀i+1\mathbf{A}_{i+1} is ancestral in 𝒢i+1\mathcal{G}_{i+1}, Lemma 7 implies that there exists 𝒢i+1𝐆\mathcal{G}^{\prime}_{i+1}\in\mathbf{G} such that 𝐀1𝐀i+1\mathbf{A}_{1}\cap\ldots\cap\mathbf{A}_{i+1} is ancestral in 𝒢i+1\mathcal{G}^{\prime}_{i+1}.

Therefore, there exists 𝒢n𝐆\mathcal{G}^{\prime}_{n}\in\mathbf{G} such that 𝐀1𝐀n=𝐜𝐨𝐦𝐩𝐀𝐍𝐆(𝐘)\mathbf{A}_{1}\cap\ldots\cap\mathbf{A}_{n}=\mathbf{compAN}_{\mathbf{G}}(\mathbf{Y}) is ancestral in 𝒢n\mathcal{G}^{\prime}_{n}, which means 𝐀n=𝐜𝐨𝐦𝐩𝐀𝐍𝐆(𝐘)\mathbf{A}_{n}=\mathbf{compAN}_{\mathbf{G}}(\mathbf{Y}).

Proof of Lemma 5.

Forward direction: Suppose that 𝐀𝐍𝒢(𝐘)=𝐙\mathbf{AN}_{\mathcal{G}}(\mathbf{Y})=\mathbf{Z}. If there is an edge XZX\rightarrow Z such that X𝐕𝐙X\in\mathbf{V}\setminus\mathbf{Z} and Z𝐙Z\in\mathbf{Z} in 𝒢\mathcal{G}, then X𝐀𝐍𝒢(𝐘)X\in\mathbf{AN}_{\mathcal{G}}(\mathbf{Y}). This is a contradiction. Therefore, every edge between X𝐕𝐙X\in\mathbf{V}\setminus\mathbf{Z} and Z𝐙Z\in\mathbf{Z} is out of ZZ.

Reverse direction: Suppose that every edge between X𝐕𝐙X\in\mathbf{V}\setminus\mathbf{Z} and Z𝐙Z\in\mathbf{Z} is out of ZZ. Since directed paths from 𝐕𝐙\mathbf{V}\setminus\mathbf{Z} to 𝐘\mathbf{Y} must go through 𝐙\mathbf{Z}, there are no directed paths from 𝐕𝐙\mathbf{V}\setminus\mathbf{Z} to 𝐘\mathbf{Y} in 𝐆\mathbf{G}. Thus, 𝐀𝐍𝒢(𝐘)=𝐙\mathbf{AN}_{\mathcal{G}}(\mathbf{Y})=\mathbf{Z}. ∎

Lemma 8 (Meek (1995), Lemma 1).

Let 𝒫\mathcal{P} be a CPDAG. If triple XY ZX\rightarrow Y\mathrel{\rule[2.15277pt]{6.99997pt}{0.5pt}}Z exists in 𝒫\mathcal{P}, then edge XZX\rightarrow Z exists in 𝒫\mathcal{P}.

A simple path from XX to YY where all directed edges are directed towards YY is called possibly directed.

Lemma 9.

Let pp be an unshielded possibly directed path from XX to YY in a CPDAG. If pp is out of XX, then pp is directed.

Proof.

It will be proved by induction on the number of nodes on a subpath of pp starting from XX that every such subpath is directed.

Base case: The result follows by the hypothesis.

Inductive step: Suppose that p(X,V)p(X,V) is a directed path. Let UU be the predecessor and WW be the successor of VV on pp. If the edge between VV and WW is undirected, then Lemma 8 says that edge UWU\rightarrow W exists. Since pp is unshielded, this is a contradiction. Therefore, the edge between VV and WW is directed towards WW.

Corollary 1.

In a CPDAG, an unshielded possibly directed path from XX to YY takes one of three forms:

  1. (1)

    XYX\rightarrow\cdots\rightarrow Y

  2. (2)

    X  YX\mathrel{\rule[2.15277pt]{6.99997pt}{0.5pt}}\cdots\mathrel{\rule[2.15277pt]{6.99997pt}{0.5pt}}Y

  3. (3)

    X  YX\mathrel{\rule[2.15277pt]{6.99997pt}{0.5pt}}\cdots\mathrel{\rule[2.15277pt]{6.99997pt}{0.5pt}}\rightarrow\cdots\rightarrow Y

Proof.

Let pp be an unshielded possibly directed path from XX to YY in a CPDAG. If pp is not undirected, let AA is the first node on pp such that p(A,Y)p(A,Y) is out of AA. Owing to Lemma 9, p(A,Y)p(A,Y) is directed. ∎

Let p1=(X1,,Xk)p_{1}=(X_{1},\ldots,X_{k}) and p2=(Xk,,Xk+n1)p_{2}=(X_{k},\ldots,X_{k+n-1}) be two paths in graph 𝒢\mathcal{G}. Path (X1,,Xk+n1)(X_{1},\ldots,X_{k+n-1}) in 𝒢\mathcal{G} is the concatenation of p1p_{1} and p2p_{2} and is denoted by p1p2p_{1}\oplus p_{2}.

Lemma 10.

Let 𝒢\mathcal{G} be a DAG and pp be a directed path from XX to YY in 𝒢\mathcal{G}. Then there exists an unshielded directed path from XX to YY in 𝒢\mathcal{G} that goes through a subset of the nodes on pp.

Proof.

Suppose that pp is shielded and let (U,V,W)(U,V,W) be the first shielded triple on pp. The edge between UU and WW in 𝒢\mathcal{G} is out of UU, because otherwise (U,V,W)(U,V,W) would be a directed cycle in 𝒢\mathcal{G}. Therefore, p=p(X,U)p(W,Y)p^{\prime}=p(X,U)\oplus p(W,Y) is a directed path from XX to YY in 𝒢\mathcal{G}. Repeating the same procedure with pp^{\prime} results in an unshielded directed path from XX to YY in 𝒢\mathcal{G} that goes through a subset of the nodes on pp. ∎

In a graph 𝒢\mathcal{G}, an edge between two nonconsecutive nodes on a simple cycle is called a chord. 𝒢\mathcal{G} is chordal if every simple cycle with four or more distinct nodes has a chord. Let 𝐆\mathbf{G} be a Markov equivalence class of DAGs, 𝒫\mathcal{P} be its CPDAG, and 𝒰\mathcal{U} be the subgraph of 𝒫\mathcal{P} containing only the undirected edges of 𝒫\mathcal{P}. 𝒰\mathcal{U} is chordal (see Meek, 1995, proof of Theorem 3).

A clique in 𝒢\mathcal{G} is a set of nodes that are all adjacent to each other; a maximal clique is a clique that is not contained in another. A join tree 𝒯\mathcal{T} for graph 𝒢\mathcal{G} is a tree over the maximal cliques of 𝒢\mathcal{G} such that, for every pair of maximal cliques {𝐌1,𝐌2}\{\mathbf{M}_{1},\mathbf{M}_{2}\}, X𝐌1𝐌2X\in\mathbf{M}_{1}\cap\mathbf{M}_{2}, and 𝐌\mathbf{M} on the simple path from 𝐌1\mathbf{M}_{1} to 𝐌2\mathbf{M}_{2} in 𝒯\mathcal{T}, X𝐌X\in\mathbf{M}. 𝒢\mathcal{G} has a join tree if and only if 𝒢\mathcal{G} is chordal (Beeri et al., 1983).

A total order << on the nodes of undirected graph 𝒰\mathcal{U} induces an orientation of 𝒰\mathcal{U} into a directed graph 𝒢\mathcal{G}: if edge X YX\mathrel{\rule[2.15277pt]{6.99997pt}{0.5pt}}Y exists in 𝒢\mathcal{G}, orient the edge as XYX\rightarrow Y if X<YX<Y (Meek, 1995). Clearly, 𝒢\mathcal{G} is acyclic. << is consistent with respect to 𝒰\mathcal{U} if 𝒢\mathcal{G} has no unshielded colliders.

A partial order π\pi is a tree order for tree 𝒯\mathcal{T} if adjacent nodes in TT are comparable under π\pi (Meek, 1995). Any tree order for 𝒯\mathcal{T} can be obtained by choosing a root for 𝒯\mathcal{T} and ordering the nodes based on their distance from the root. Tree order π\pi for join tree 𝒯\mathcal{T} for graph 𝒢\mathcal{G} induces a partial order π\prec_{\pi} on the nodes of 𝒢\mathcal{G} (Meek, 1995): if π(𝐌1,𝐌2)\pi(\mathbf{M}_{1},\mathbf{M}_{2}), then for all X𝐌1𝐌2X\in\mathbf{M}_{1}\cap\mathbf{M}_{2} and Y𝐌2Y\in\mathbf{M}_{2}, XπYX\prec_{\pi}Y.111Our definition of π\prec_{\pi} is different than the one of Meek (1995), as it can be shown that according to the latter, π\prec_{\pi} is not partial order. However, the results of Meek (1995) still hold under our definition of π\prec_{\pi}. If XX and YY are both in the minimal element of π\pi, then they are not comparable under π\prec_{\pi}.

Lemma 11 (Meek (1995), Lemma 4).

Let 𝒰\mathcal{U} be a chordal graph, 𝒯\mathcal{T} be a join tree for 𝒰\mathcal{U}, and π\pi be a tree order for 𝒯\mathcal{T}. Any extension of π\prec_{\pi} to a total order is a consistent order with respect to 𝒰\mathcal{U}.

Lemma 12.

Let 𝒰\mathcal{U} be a chordal graph and cc be a simple cycle in 𝒰\mathcal{U}. There exist two shielded triples on cc.

Proof.

Let c=(X1,,Xn,X1)c=(X_{1},\ldots,X_{n},X_{1}). The result will be proved by induction on the number of edges nn of cc.

Base case: If n=3n=3, then (X1,X2,X3)(X_{1},X_{2},X_{3}) and (X2,X3,X1)(X_{2},X_{3},X_{1}) are shielded triples on cc.

Inductive step: Suppose that n4n\geq 4 and the result holds for simple cycles with up to n1n-1 edges. Since 𝒰\mathcal{U} is chordal, cc has a chord Xi XjX_{i}\mathrel{\rule[2.15277pt]{6.99997pt}{0.5pt}}X_{j}. By the induction hypothesis, there exist two shielded triples t1t_{1} and t2t_{2} on (Xi,,Xj,Xi)(X_{i},\ldots,X_{j},X_{i}) and two shielded triples t3t_{3} and t4t_{4} on (Xi,Xj,,Xn,X1,,Xi)(X_{i},X_{j},\ldots,X_{n},X_{1},\ldots,X_{i}). Even if t2=(Xj1,Xj,Xi)t_{2}=(X_{j-1},X_{j},X_{i}) and t3=(Xi,Xj,Xj+1)t_{3}=(X_{i},X_{j},X_{j+1}), t1t_{1} and t4t_{4} are distinct shielded triples on cc.

Let p=(X1,,Xn)p=(X_{1},\ldots,X_{n}) be a path. Path (Xn,,X1)(X_{n},\ldots,X_{1}) is the reverse of pp and is denoted by p1p^{-1}. A simple cycle is shielded if there is a shielded triple on the cycle.

Proof of Theorem 3.

Forward direction: Suppose X𝐀𝐍𝒫(𝐘)X\in\mathbf{AN}_{\mathcal{P}}(\mathbf{Y}) and let 𝒢𝐆\mathcal{G}\in\mathbf{G}. There exists a directed path from XX to 𝐘\mathbf{Y} in 𝒢\mathcal{G}. Owing to Lemma 10, there exists an unshielded directed path from XX to 𝐘\mathbf{Y} in 𝒢\mathcal{G}. Therefore, there exists an unshielded possibly directed path pp from XX to 𝐘\mathbf{Y} in 𝒫\mathcal{P}. Suppose X𝐀𝐍𝒫(𝐘)X\notin\mathbf{AN}_{\mathcal{P}}(\mathbf{Y}). Then no unshielded possibly directed path from XX to 𝐘\mathbf{Y} in 𝒫\mathcal{P} is out of XX due to Lemma 9. Let ZZ be the successor XX on pp and 𝒢𝐆\mathcal{G}\in\mathbf{G} such that edge X ZX\mathrel{\rule[2.15277pt]{6.99997pt}{0.5pt}}Z is oriented as XZX\leftarrow Z in 𝒢\mathcal{G}. There exists a directed path from XX to 𝐘\mathbf{Y} that does not go through ZZ in 𝒢\mathcal{G}, because otherwise a directed cycle would occur in 𝒢\mathcal{G}. Lemma 10 then implies that there exists an unshielded directed path from XX to 𝐘\mathbf{Y} that does not go through ZZ in 𝒢\mathcal{G}. Therefore, there exists an unshielded possibly directed path from XX to 𝐘\mathbf{Y} that does not go through ZZ in 𝒫\mathcal{P}.

Suppose that for each pair of unshielded possibly directed paths (X,Z1,,Y1)(X,Z_{1},\ldots,Y_{1}) and (X,Z2,,Y2)(X,Z_{2},\ldots,Y_{2}) in 𝒫\mathcal{P} such that Z1Z2Z_{1}\neq Z_{2}, Y1𝐘Y_{1}\in\mathbf{Y}, and Y2𝐘Y_{2}\in\mathbf{Y}, Z1Z_{1} and Z2Z_{2} are adjacent in 𝒫\mathcal{P}. Let 𝒰\mathcal{U} be the subgraph of 𝒫\mathcal{P} containing only the undirected edges of 𝒫\mathcal{P} and 𝐙\mathbf{Z} be the set of nodes ZZ such that an unshielded possibly directed path (X,Z,,Y)(X,Z,\ldots,Y) (Y𝐘Y\in\mathbf{Y}) exists in 𝒫\mathcal{P}. {X}𝐙\{X\}\cup\mathbf{Z} is a clique in 𝒰\mathcal{U}; therefore, it is contained in some maximal clique 𝐌\mathbf{M}. Let 𝒯\mathcal{T} be a join tree for 𝒰\mathcal{U} with 𝐌\mathbf{M} as the root, π\pi be a tree order for 𝒯\mathcal{T}, π\prec^{\prime}_{\pi} be the extension of π\prec_{\pi} such that ZπXZ\prec^{\prime}_{\pi}X for each Z𝐙Z\in\mathbf{Z}, and α\alpha be a total order which extends π\prec^{\prime}_{\pi}. Owing to Lemma 11, α\alpha is a consistent order with respect to 𝒰\mathcal{U}. Let 𝒢𝐆\mathcal{G}\in\mathbf{G} be such that 𝒰\mathcal{U} is oriented according to α\alpha. There exists a directed path from XX to Y𝐘Y\in\mathbf{Y} in 𝒢\mathcal{G}. Owing to Lemma 10, there exists an unshielded directed path from XX to YY in 𝒢\mathcal{G}. Therefore, there exists an unshielded possibly directed path from XX to YY in 𝒫\mathcal{P} that is out of XX in 𝒢\mathcal{G}. This is a contradiction. Thus, there exists a pair of unshielded possibly directed paths p1=(X,Z1,,Y1)p_{1}=(X,Z_{1},\ldots,Y_{1}) and p2=(X,Z2,,Y2)p_{2}=(X,Z_{2},\ldots,Y_{2}) in 𝒫\mathcal{P} such that Z1Z2Z_{1}\neq Z_{2}, Y1𝐘Y_{1}\in\mathbf{Y}, Y2𝐘Y_{2}\in\mathbf{Y}, and ¬Adj𝒫(Z1,Z2)\neg\text{Adj}_{\mathcal{P}}(Z_{1},Z_{2}).

Owing to Corollary 1, p1p_{1} is of the form X Z1 Y1X\mathrel{\rule[2.15277pt]{6.99997pt}{0.5pt}}Z_{1}\cdots\mathrel{\rule[2.15277pt]{6.99997pt}{0.5pt}}\rightarrow\cdots\rightarrow Y_{1} and p2p_{2} is of the form X Z2 Y2X\mathrel{\rule[2.15277pt]{6.99997pt}{0.5pt}}Z_{2}\cdots\mathrel{\rule[2.15277pt]{6.99997pt}{0.5pt}}\rightarrow\cdots\rightarrow Y_{2}. Let U1U_{1} be the first ancestor of Y1Y_{1} on p1p_{1} (which may be Y1Y_{1} itself), U2U_{2} be the first ancestor of Y2Y_{2} on p2p_{2} (which may be Y2Y_{2} itself), and s=p1(X,U1)1p2(X,U2)s=p_{1}(X,U_{1})^{-1}\oplus p_{2}(X,U_{2}). Suppose that a subpath ss^{\prime} of ss is a simple cycle. Lemma 12 says that two shielded triples exist on ss^{\prime}, which is a contradiction. Therefore, ss is a simple path. Thus, XX is on an unshielded undirected path in 𝒫\mathcal{P} between two members of 𝐀𝐍𝒫(𝐘)\mathbf{AN}_{\mathcal{P}}(\mathbf{Y}) in 𝒫\mathcal{P}.

Reverse direction: Suppose X𝐀𝐍𝒫(𝐘)X\notin\mathbf{AN}_{\mathcal{P}}(\mathbf{Y}) and that XX is an interior node on an unshielded undirected path pp from Z1𝐀𝐍𝒫(𝐘)Z_{1}\in\mathbf{AN}_{\mathcal{P}}(\mathbf{Y}) to Z2𝐀𝐍𝒫(𝐘)Z_{2}\in\mathbf{AN}_{\mathcal{P}}(\mathbf{Y}) in 𝒫\mathcal{P}. Let U1U_{1} and U2U_{2} be the nodes on either side of XX on pp and 𝒢𝐆\mathcal{G}\in\mathbf{G}. Suppose that U1 XU_{1}\mathrel{\rule[2.15277pt]{6.99997pt}{0.5pt}}X in 𝒫\mathcal{P} is oriented as U1XU_{1}\leftarrow X in 𝒢\mathcal{G}. Then p(Z1,X)1p(Z_{1},X)^{-1} is a directed path in 𝒢\mathcal{G}, because otherwise there would be unshielded colliders in 𝒢\mathcal{G} that are not in 𝒫\mathcal{P}. Suppose that U1 XU_{1}\mathrel{\rule[2.15277pt]{6.99997pt}{0.5pt}}X in 𝒫\mathcal{P} is oriented as U1XU_{1}\rightarrow X in 𝒢\mathcal{G}. Then edge X U2X\mathrel{\rule[2.15277pt]{6.99997pt}{0.5pt}}U_{2} in 𝒫\mathcal{P} is oriented as XU2X\rightarrow U_{2} in 𝒢\mathcal{G} and p(X,Z2)p(X,Z_{2}) is a directed path in 𝒢\mathcal{G}, because otherwise there would be unshielded colliders in 𝒢\mathcal{G} that are not in 𝒫\mathcal{P}. Therefore, X𝐀𝐍𝒢(𝐘)X\in\mathbf{AN}_{\mathcal{G}}(\mathbf{Y}). ∎

Proof of Lemma 6.

Let 𝒯\mathcal{T} be the subgraph of 𝒫\mathcal{P} containing only the nodes and the edges on unshielded undirected paths from XX to 𝐕{X}\mathbf{V}\setminus\{X\} in 𝒫\mathcal{P}. Clearly, 𝒯\mathcal{T} is a connected undirected graph. Suppose that there is a simple cycle cc in 𝒯\mathcal{T}. Owing to Lemma 12, there exist two shielded triples on cc, which is a contradiction. Therefore, 𝒯\mathcal{T} is a tree. ∎

Proof of Theorem 4.

Forward direction: Suppose P𝐁𝐍𝐌(𝒢)[S=s^P\in\operatorname{\mathbf{BNM}}(\mathcal{G})[^{S=\hat{s}}. Then there exists Q𝐁𝐍𝐌(𝒢)Q\in\operatorname{\mathbf{BNM}}(\mathcal{G}) such that Q[S=s^=PQ[^{S=\hat{s}}=P. Since p(𝐩𝐚𝒢(S))>0p(\mathbf{pa}_{\mathcal{G}}(S))>0, q(𝐩𝐚𝒢(S),s^)>0q(\mathbf{pa}_{\mathcal{G}}(S),\hat{s})>0. Let R𝐁𝐍𝐌(𝒢𝐎)R\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{\mathbf{O}}) such that r(X𝐩𝐚𝒢𝐎(X))=q(X𝐩𝐚𝒢(X))r(X\mid\mathbf{pa}_{\mathcal{G}_{\mathbf{O}}}(X))=q(X\mid\mathbf{pa}_{\mathcal{G}}(X)) for each X𝐎X\in\mathbf{O}. Then

p(𝐨𝐩𝐚𝒢(S)𝐩𝐚𝒢(S))\displaystyle p(\mathbf{o}\setminus\mathbf{pa}_{\mathcal{G}}(S)\mid\mathbf{pa}_{\mathcal{G}}(S)) =q(𝐨𝐩𝐚𝒢(S)𝐩𝐚𝒢(S),s^)=q(𝐨,s^)q(𝐩𝐚𝒢(S),s^)\displaystyle=q(\mathbf{o}\setminus\mathbf{pa}_{\mathcal{G}}(S)\mid\mathbf{pa}_{\mathcal{G}}(S),\hat{s})=\frac{q(\mathbf{o},\hat{s})}{q(\mathbf{pa}_{\mathcal{G}}(S),\hat{s})}
=q(s^𝐩𝐚𝒢(S))O𝐎q(o𝐩𝐚𝒢(O))q(s^𝐩𝐚𝒢(S))𝐨𝐩𝐚𝒢(S)O𝐎q(o𝐩𝐚𝒢(O))\displaystyle=\frac{q(\hat{s}\mid\mathbf{pa}_{\mathcal{G}}(S))\prod_{O\in\mathbf{O}}q(o\mid\mathbf{pa}_{\mathcal{G}}(O))}{q(\hat{s}\mid\mathbf{pa}_{\mathcal{G}}(S))\sum_{\mathbf{o}\setminus\mathbf{pa}_{\mathcal{G}}(S)}\prod_{O\in\mathbf{O}}q(o\mid\mathbf{pa}_{\mathcal{G}}(O))}
=O𝐎q(o𝐩𝐚𝒢(O))𝐨𝐩𝐚𝒢(S)O𝐎q(o𝐩𝐚𝒢(O))\displaystyle=\frac{\prod_{O\in\mathbf{O}}q(o\mid\mathbf{pa}_{\mathcal{G}}(O))}{\sum_{\mathbf{o}\setminus\mathbf{pa}_{\mathcal{G}}(S)}\prod_{O\in\mathbf{O}}q(o\mid\mathbf{pa}_{\mathcal{G}}(O))}
=O𝐎r(o𝐩𝐚𝒢𝐎(O))𝐨𝐩𝐚𝒢(S)O𝐎r(o𝐩𝐚𝒢𝐎(O))\displaystyle=\frac{\prod_{O\in\mathbf{O}}r(o\mid\mathbf{pa}_{\mathcal{G}_{\mathbf{O}}}(O))}{\sum_{\mathbf{o}\setminus\mathbf{pa}_{\mathcal{G}}(S)}\prod_{O\in\mathbf{O}}r(o\mid\mathbf{pa}_{\mathcal{G}_{\mathbf{O}}}(O))}
=r(𝐨)r(𝐩𝐚𝒢(S))=r(𝐨𝐩𝐚𝒢(S)𝐩𝐚𝒢(S))\displaystyle=\frac{r(\mathbf{o})}{r(\mathbf{pa}_{\mathcal{G}}(S))}=r(\mathbf{o}\setminus\mathbf{pa}_{\mathcal{G}}(S)\mid\mathbf{pa}_{\mathcal{G}}(S))

Therefore, P[𝐏𝐀𝒢(S)𝐁𝐍𝐌(𝒢𝐎)[𝐏𝐀𝒢(S)P[^{\mathbf{PA}_{\mathcal{G}}(S)}\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{\mathbf{O}})[^{\mathbf{PA}_{\mathcal{G}}(S)}.

Reverse direction: Suppose P[𝐏𝐀𝒢(S)𝐁𝐍𝐌(𝒢𝐎)[𝐏𝐀𝒢(S)P[^{\mathbf{PA}_{\mathcal{G}}(S)}\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{\mathbf{O}})[^{\mathbf{PA}_{\mathcal{G}}(S)}. Therefore, there exists Q𝐁𝐍𝐌(𝒢𝐎)Q\in\operatorname{\mathbf{BNM}}(\mathcal{G}_{\mathbf{O}}) such that Q[𝐏𝐀𝒢(S)=P[𝐏𝐀𝒢(S)Q[^{\mathbf{PA}_{\mathcal{G}}(S)}=P[^{\mathbf{PA}_{\mathcal{G}}(S)}. Let R𝐁𝐍𝐌(𝒢)R\in\operatorname{\mathbf{BNM}}(\mathcal{G}) such that r(o𝐩𝐚𝒢(O))=q(o𝐩𝐚𝒢𝐎(O))r(o\mid\mathbf{pa}_{\mathcal{G}}(O))=q(o\mid\mathbf{pa}_{\mathcal{G}_{\mathbf{O}}}(O)) for each O𝐎O\in\mathbf{O} and r(s^𝐩𝐚𝒢(S))=cp(𝐩𝐚𝒢(S))/q(𝐩𝐚𝒢(S))r(\hat{s}\mid\mathbf{pa}_{\mathcal{G}}(S))=c\cdot p(\mathbf{pa}_{\mathcal{G}}(S))/q(\mathbf{pa}_{\mathcal{G}}(S)), where c=1/max𝐩𝐚𝒢(S)(p(𝐩𝐚𝒢(S))/q(𝐩𝐚𝒢(S)))c=1/\max_{\mathbf{pa}_{\mathcal{G}}(S)}(p(\mathbf{pa}_{\mathcal{G}}(S))/q(\mathbf{pa}_{\mathcal{G}}(S))) (ensuring that r(s^𝐩𝐚𝒢(S))1r(\hat{s}\mid\mathbf{pa}_{\mathcal{G}}(S))\leq 1). Then

r(𝐨,s^)\displaystyle r(\mathbf{o},\hat{s}) =r(s^𝐩𝐚𝒢(S))O𝐎r(o𝐩𝐚𝒢(O))=r(s^𝐩𝐚𝒢(S))O𝐎q(o𝐩𝐚𝒢𝐎(O))\displaystyle=r(\hat{s}\mid\mathbf{pa}_{\mathcal{G}}(S))\prod_{O\in\mathbf{O}}r(o\mid\mathbf{pa}_{\mathcal{G}}(O))=r(\hat{s}\mid\mathbf{pa}_{\mathcal{G}}(S))\prod_{O\in\mathbf{O}}q(o\mid\mathbf{pa}_{\mathcal{G}_{\mathbf{O}}}(O))
=r(s^𝐩𝐚𝒢(S))q(𝐨)=r(s^𝐩𝐚𝒢(S))q(𝐩𝐚𝒢(S))q(𝐨𝐩𝐚𝒢(S)𝐩𝐚𝒢(S))\displaystyle=r(\hat{s}\mid\mathbf{pa}_{\mathcal{G}}(S))q(\mathbf{o})=r(\hat{s}\mid\mathbf{pa}_{\mathcal{G}}(S))q(\mathbf{pa}_{\mathcal{G}}(S))q(\mathbf{o}\setminus\mathbf{pa}_{\mathcal{G}}(S)\mid\mathbf{pa}_{\mathcal{G}}(S))
=cp(𝐩𝐚𝒢(S))q(𝐩𝐚𝒢(S)q(𝐩𝐚𝒢(S))p(𝐨𝐩𝐚𝒢(S)𝐩𝐚𝒢(S))=cp(𝐨)\displaystyle=c\cdot\frac{p(\mathbf{pa}_{\mathcal{G}}(S))}{q(\mathbf{pa}_{\mathcal{G}}(S)}q(\mathbf{pa}_{\mathcal{G}}(S))p(\mathbf{o}\setminus\mathbf{pa}_{\mathcal{G}}(S)\mid\mathbf{pa}_{\mathcal{G}}(S))=c\cdot p(\mathbf{o})

and r(s^)=𝐨r(𝐨,s^)=c𝐨p(𝐨)=c>0r(\hat{s})=\sum_{\mathbf{o}}r(\mathbf{o},\hat{s})=c\sum_{\mathbf{o}}p(\mathbf{o})=c>0. Therefore,

r(𝐨s^)=r(𝐨,s^)r(s^)=cp(𝐨)c=p(𝐨)\displaystyle r(\mathbf{o}\mid\hat{s})=\frac{r(\mathbf{o},\hat{s})}{r(\hat{s})}=\frac{c\cdot p(\mathbf{o})}{c}=p(\mathbf{o})

Thus, P𝐁𝐍𝐌(𝒢)[S=s^P\in\operatorname{\mathbf{BNM}}(\mathcal{G})[^{S=\hat{s}}. ∎

References

  • Beeri et al. [1983] Catriel Beeri, Ronald Fagin, David Maier, and Mihalis Yannakakis. On the desirability of acyclic database schemes. Journal of the ACM (JACM), 30(3):479–513, 1983.
  • Berkson [1946] Joseph Berkson. Limitations of the application of fourfold table analysis to hospital data. Biometrics Bulletin, 2(3):47–53, 1946.
  • Borboudakis and Tsamardinos [2015] Giorgos Borboudakis and Ioannis Tsamardinos. Bayesian network learning with discrete case-control data. In UAI, pages 151–160, 2015.
  • Chickering [1995] David Maxwell Chickering. A transformational characterization of equivalent bayesian network structures. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pages 87–98. Morgan Kaufmann Publishers Inc., 1995.
  • Cooper [2000] Gregory F Cooper. A bayesian method for causal modeling and discovery under selection. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pages 98–106. Morgan Kaufmann Publishers Inc., 2000.
  • Cox et al. [2006] David A Cox, John Little, and Donal O’shea. Using algebraic geometry, volume 185. Springer Science & Business Media, 2006.
  • Evans [2016] Robin J Evans. Graphs for margins of bayesian networks. Scandinavian Journal of Statistics, 43(3):625–648, 2016.
  • Evans [2018] Robin J Evans. Margins of discrete bayesian networks. The Annals of Statistics, 46(6A):2623–2656, 2018.
  • Evans and Didelez [2015] Robin J Evans and Vanessa Didelez. Recovering from selection bias using marginal structure in discrete models. In Proceedings of the UAI 2015 Conference on Advances in Causal Inference-Volume 1504, pages 46–55. CEUR-WS. org, 2015.
  • Jobson [2012] J Dave Jobson. Applied multivariate data analysis: volume II: Categorical and Multivariate Methods. Springer Science & Business Media, 2012.
  • Kim and Pearl [1983] JinHyung Kim and Judea Pearl. A computational model for causal and diagnostic reasoning in inference systems. In International Joint Conference on Artificial Intelligence, pages 0–0, 1983.
  • Lauritzen [1999] Steffen L Lauritzen. Generating mixed hierarchical interaction models by selection. 1999.
  • Meek [1995] Christopher Meek. Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pages 403–410. Morgan Kaufmann Publishers Inc., 1995.
  • Neapolitan [2004] Richard E Neapolitan. Learning bayesian networks. Pearson Prentice Hall Upper Saddle River, NJ, 2004. ISBN 0130125342.
  • Pearl [2009] Judea Pearl. Causality. Cambridge university press, 2009.
  • Richardson and Spirtes [2002] Thomas Richardson and Peter Spirtes. Ancestral graph markov models. Annals of Statistics, pages 962–1030, 2002.
  • Richardson et al. [2017] Thomas S Richardson, Robin J Evans, James M Robins, and Ilya Shpitser. Nested markov properties for acyclic directed mixed graphs. arXiv preprint arXiv:1701.06686, 2017.
  • Robins [1986] James Robins. A new approach to causal inference in mortality studies with a sustained exposure period — application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9-12):1393–1512, 1986.
  • Shpitser et al. [2014] Ilya Shpitser, Robin J Evans, Thomas S Richardson, and James M Robins. Introduction to nested markov models. Behaviormetrika, 41(1):3–39, 2014.
  • Spirtes et al. [1999] Peter Spirtes, Christopher Meek, and Thomas Richardson. An algorithm for causal inference in the presence of latent variables and selection bias in computation, causation and discovery, 1999, 1999.
  • Spirtes et al. [2000] Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. 2000. ISBN 0262194406.
  • Wolfe et al. [2016] Elie Wolfe, Robert W Spekkens, and Tobias Fritz. The inflation technique for causal inference with latent variables. arXiv preprint arXiv:1609.00672, 2016.
  • Zhang [2008] Jiji Zhang. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artificial Intelligence, 172(16):1873–1896, 2008.