On the Sparse DAG Structure Learning
Based on Adaptive Lasso

Danru Xu, Erdun Gao, Wei Huang, Menghan Wang, Andy Song, Mingming Gong Danru Xu, Erdun Gao, Wei Huang, and Mingming Gong are with the School of Mathematics and Statistics, Faculty of Science, The University of Melbourne, Melbourne, Australia (e-mail: danru.xu@student.unimelb.edu.au; erdun.gao@student.unimelb.edu.au; wei.huang@unimelb.edu.au; mingming.gong@unimelb.edu.au). Menghan Wang is with Ebay inc. (email: wangmengh@zju.edu.cn). Andy Song is with the School of Computing Technologies, the Royal Melbourne Institute of Technology, Melbourne, Australia (e-mail: andy.song@rmit.edu.au).

Abstract

Learning the underlying Bayesian Networks (BNs), represented by directed acyclic graphs (DAGs), of the concerned events from purely-observational data is a crucial part of evidential reasoning. This task remains challenging due to the large and discrete search space. A recent flurry of developments followed NOTEARS [1] recast this combinatorial problem into a continuous optimization problem by leveraging an algebraic equality characterization of acyclicity. However, the continuous optimization methods suffer from obtaining non-spare graphs after the numerical optimization, which leads to the inflexibility to rule out the potentially cycle-inducing edges or false discovery edges with small values. To address this issue, in this paper, we develop a completely data-driven DAG structure learning method without a predefined value to post-threshold small values. We name our method NOTEARS with adaptive Lasso (NOTEARS-AL), which is achieved by applying the adaptive penalty method to ensure the sparsity of the estimated DAG. Moreover, we show that NOTEARS-AL also inherits the oracle properties under some specific conditions. Extensive experiments on both synthetic and a real-world dataset demonstrate that our method consistently outperforms NOTEARS.

Index Terms:

Bayesian Networks learning, Sparse DAG, Continuous Optimization, Adaptive Lasso

I Introduction

Bayesian networks (BNs) [2] is a kind of graphical model that bridges probability theory and graph theory to represent a joint distribution, which can also be leveraged to model the shape and level of the relationships between random variables by representing conditional dependence as edges and corresponding parameters in a directed acyclic graph (DAG). With success in encoding uncertain expert knowledge in expert systems [3], BN has been widely used in many applications such as causal inference[4], biology [5], and healthcare [6].

Over the last decades, lots of work has been done on discovering Bayesian networks from purely-observational data. In general, the problem is ill-posed since there may be multiple DAGs producing the same distribution. Furthermore, concerning the computation complexity [7], since the search space of DAGs is discrete, many standard numerical algorithms cannot be utilized, which leads to the DAG structure learning problem NP-hard [8, 9]. In other words, searching the optimal DAG in the DAGs space scales super-exponentially by the number of variables.

In general, there are mainly two classes of DAG structure learning methods. One is constraint-based algorithms. The other is score-based algorithms. Constraint-based methods, such as SGS [10], PC [11], and GS [12], read the conditional independence relationships encoded in the data and try to find a completed partially directed acyclic graph (CPDAG) that entails all (and only) the corresponding d-separations by Markovian and Faithfulness assumptions [13]. That is to say, the output of these algorithms is a set of DAGs (equivalence class) instead of the ground-truth DAG. The main idea of score-based methods is to establish a suitable score function as the objective function and then find the graph structure that maximizes the objective in the solution space. Common score functions include BGe score [14], BDe score [15], MDL [16] and BIC [14]. Based on the idea of order search and greedy search, some approaches that optimize the score functions over the DAGs space are proposed, such as order-based search (OBS) [17], the max-min hill climbing method [18], greedy equivalence search (GES) [19], fast GES [20], and A* [21].

To improve the searching efficiency, recently, Zheng et al. [1] introduced a new approach, named NOTEARS, which seeks to replace the traditional combinatorial constraint by a smooth characterization of acyclicity whose value is set to be zero as a DAG. In this way, the combinatorial optimization problem is converted into a continuous optimization such that a wide range of numerical methods can be applied. As NOTEARS only considers the linear model, some following works [22, 23, 24, 25, 26, 27] extend it to non-linear or non-parametric situations by leveraging neural networks or orthogonal series. Following these works, many works also extend this work to time-series data [28], unmeasured common confounders [29], interventional data [30], distributed learning [31, 32] and incomplete data [33, 34]. Further related works include GOLEM [35], a likelihood-based method that suggests using likelihood as the learning objective with soft constraints. NOFEARS [36] focuses on optimization by considering the Karush–Kuhn–Tucker (KTT) conditions. Truncated Matrix Power Iteration [37] finds that enlarging the coefficients for higher-order terms can better approximate the DAG constraints and leads to a better graph recovery performance. The convergences of the augmented Lagrangian method and the quadratic penalty method are studied in [38].

Refer to caption — Figure 1: Visualization of the ground truth (left-hand side of legend) and estimated weighted adjacency matrices (right-hand side of legend) of a $2$ -node DAG with $n=1000$ and penalty level $\lambda\in\{0,0.3,0.6,0.9,1.2,1.5,1.8,2.1,2.4\}$ . NOTEARS obtains all the solutions without thresholding.

It is worth noting that NOTEARS and most of its following works [27, 26, 25, 24, 23, 35] cannot directly obtain a sparse solution since the numerical optimization always introduces many small false positive estimates. From Figure 1, we can observe that even with a large sample size over a $2$ -node graph, the false positive edge (the right one) always exists and cannot be ruled out by regularization. Moreover, as the penalty level goes larger, the bias of the estimates becomes larger and larger until a pure zeros matrix. This implies that, in NOTEARS, only increasing the value of the penalty coefficient cannot guarantee a consistent estimate, and extra procedures are necessary for obtaining a sparse solution.

To deal with this issue, NOTEARS utilizes a hard threshold to post-process the initial estimators. The fixed thresholding strategy is a direct method to cut down the false positive estimates caused by numerical error. On the other hand, the hard-threshold strategy is not a flexible and systematic way to reduce the number of false discoveries. Firstly, choosing a fixed and sub-optimal value as a threshold cannot adapt to different types of graphs. Secondly, although the amount of false positive edges is diminished via adopting a hard-thresholding strategy, small coefficients in the true model may also be deleted indiscriminately (see Section IV-B).

For NOTEARS, on the one hand, it cannot obtain a sparse solution without thresholding. On the other hand, threshold strategy causes inflexibility, and this is not a systematic way. Motivated by this, we develop a new DAG structure learning method based on the adaptive lasso, which abandons the fixed-threshold strategy and directly leads to a sparse output. As we will show, the main contributions are as follows.

1.

We propose a purely data-driven method that does not need a predefined threshold and then illustrate the specific algorithm.
2.

We study the oracle properties that our method possesses, including sparsity and asymptotic normality, which means that our method has the ability to select the right subset model and converges to the true model with an optimal rate when the sample size increases.
3.

We consider using a more sufficient and suitable validation set in cross-validation for our goal to find the optimal subset model.
4.

We demonstrate the effectiveness of our method through simulation experiments with weights generated from a broader range, e.g., Gaussian distributed weights with mean zero.

The rest of this paper is organized as follows: in Section II, we review some basic concepts of the DAG model and the specific algorithm of NOTEARS, as well as the limitation of its fixed-threshold strategy. Then, based on the idea of adaptive Lasso [39], we introduce our method and study its asymptotic behavior when the sample size goes to infinity in Section III. In Section IV, to verify our theoretical study and the effectiveness of our method, we compare our method with NOTEARS and existing state-of-the-art by simulated experiments. Finally, we conclude our work in Section V.

II Related works

In this section, we review the linear DAG model and the NOTEARS method. In section II-A, we first briefly review some fundamental concepts of the DAG model with linear structure assumption and the corresponding score function used to maximize the likelihood of observations. Then, we detail the NOTEARS method and its optimization algorithm in Section II-B.

II-A DAG with Linear Structure

A DAG can be encoded as $G=(V,E)$ , where $V$ refers to the set of nodes and $E$ is the set of directed edges. The nodes in $G$ represent variables under consideration. And directed edges correspond to the direct conditional relationships between parental nodes and child nodes. For example, $V_{1}\rightarrow V_{2}$ , represents $V_{1}$ is the parent of $V_{2}$ . Figure 2 illustrates three possible three-node DAGs with two edges.

The task of DAG structure learning is to seek a DAG from the DAGs space of $d$ nodes (discrete) that best fits the observational data $\mathbb{X}$ . Let $\mathbb{X}\in\mathbb{R}^{n\times d}$ represent the observations consisting of $n$ $d$ -dimensional vectors. We assume that these vectors are realizations of $n$ identically independently distributed (iid) random vectors with joint distribution $P(X)$ , where $X=(X_{1},X_{2},...,X_{d})$ . Importantly, with no latent variable and Markovian assumptions [40], the joint distribution $P(X)$ can be decomposed as follows:

P(X)=\prod_{i=1}^{d}P(X_{i}|Pa(X_{i})),

(1)

where $Pa(X_{i})$ represents the parental set of $X_{i}$ . Then, we model the relationships of variables in $X$ by a structure equation model (SEM), denoted as $X_{i}=f_{i}(Pa(X_{i}),\varepsilon_{i}),i=1,...,d$ , where $\varepsilon_{i}$ refers to the uncovered variables. In this paper, we narrow our focus on the DAG with additive noise models (ANMs). By introducing the concept of weighted adjacency matrix $W\in\mathbb{R}^{d\times d}$ , we can establish the linear ANMs as $X=XW+N$ , where $N$ is a $d$ -dimensional random vector, denoted as $N=(N_{1},N_{2},...,N_{d})$ . Each $N_{i}$ represents the noise term added to $X_{i}$ . In this paper, we assume $N_{1},N_{2},...,N_{d}$ are independent and identically distributed. The information about the structure of a DAG is completely involved in $W$ , which is defined in the following way:

\{V_{i}\rightarrow V_{j}\}\subset E\quad\Longleftrightarrow\quad W_{ij}\neq 0.

(2)

We note $G_{W}$ as the graph induced by $W$ . To derive $W$ , we can maximize the likelihood of observations $\mathbb{X}$ . Benefiting from the well-known probability density transformation method [35], this target can be equally achieved by minimizing the least-square loss function $\frac{1}{2n}||X-XW||_{2}^{2}$ over the search space. A regularization term $\lambda||W||_{1}$ is added to deduce a sparse solution. Then, the overall score function is

\mathcal{S}(W)=\frac{1}{2n}||X-XW||_{2}^{2}+\lambda||W||_{1}.

(3)

II-B DAG with no NOTEARS

Previous score-based methods suffer from searching the discrete DAGs space and lead to low efficiency. To overcome this problem, Zheng et al.[1] proposed the NOTEARS method, which seeks to use a function $h$ satisfying these four desiderata: i) $h(W)=0$ is a necessary and sufficient condition for $G_{W}$ to be a DAG; ii) its value reflects the level of acyclicity of the graph; iii) smooth; iv) its first derivative exists and is easy to compute. Based on the method in [41], which starts to use the matrix algebra to investigate the number of circles in a graph, NOTEARS verifies that $h(W)=\text{Tr}(e^{W\circ W})-d$ is a proper function to characterize the acyclicity with the derivative $\nabla h(W)=(e^{W\circ W})^{T}\circ 2W$ . After determining the smooth equality constraint, the problem can be written as the following optimization program:

	$\displaystyle\min_{W\in\mathbb{R}^{d\times d}}$	$\displaystyle\ \frac{1}{2n}\|\|X-XW\|\|_{2}^{2}+\lambda\|\|W\|\|_{1},$		(4)
	subject to	$\displaystyle\ h(W)=0.$		(4)

In order to solve the program (4), this constrained optimization program is converted to a sequence of unconstrained problems by the augmented Lagrangian method [42]. The procedure is shown as follows.

$\displaystyle W_{k}=$	$\displaystyle\ \operatorname*{arg\,min}_{W\in\mathbb{R}^{d\times d}}\frac{1}{2n}\|\|X-XW\|\|_{2}^{2}+\lambda\|\|W\|\|_{1}$	(5)
	$\displaystyle\quad\quad\quad\quad+\frac{\rho_{k}}{2}\|h(W)\|^{2}+\alpha_{k}h(W),$
$\displaystyle\alpha_{k+1}=$	$\displaystyle\ \alpha_{k}+\rho_{k}h(W_{k}),$
$\displaystyle\rho_{k+1}=$	$\displaystyle\ \beta\rho_{k}.$

For each iteration, the first step can be solved by the L-BFGS proximal quasi-Newton method [43, 44] with the derivative as $-\frac{1}{n}X^{T}(X-XW)+\lambda\mathbbm{1}_{d\times d}+\left(\rho|h(W)|+\alpha_{k}\right)\nabla h(W).$ To overcome the problem caused by the non-smooth point zero in the penalty term, i.e., $W_{ij}=0,\quad i,j=1,..,d$ , the original weighted adjacency matrix is cast into the difference between two matrices $(W^{1},W^{2})$ with non-negative elements [45].

Remark After obtaining the initial estimate, NOTEARS zeros out small coefficients if their absolute values are less than the predefined threshold to reduce the number of false positive discovery edges. However, this strategy leads to more missing edges with small weights and is hard to adapt to different scenarios. For example, if the weights of a DAG are generated from a uniform distribution over $(0,2)$ , by setting $w_{t}=0.3$ , the estimates of small weights, which are located in the interval $(0,0.3)$ , have a high risk to be forced to $0$ in the post-processing step. Since the simulation experiments in previous work commonly set a gap to zero, for instance, in the simulation part of [1], the weights of edges commonly sampled uniformly from $[-2\alpha,-0.5\alpha]\cup[0.5\alpha,2\alpha]$ , where $\alpha$ is a weight scale, this side effect is not obvious. The specific algorithm of NOTEARS with a fixed predefined threshold is presented in Algorithm 1.

Algorithm 1 NOTEARS with a fixed threshold.

Step 0. Take

\rho_{0}>0

, initial value

W_{0}\in\mathbb{R}^{d\times d}

\alpha_{0}\in\mathbb{R}

, and some

\xi\in(0,1)

. Set threshold

w_{t}=0.3

\eta=10

Step k. (k

\geq

1) Find the smallest non-negative integer

j_{k}

such that with

\rho=\eta^{j_{k}}\rho_{0}

h(W_{k+1})<\xi h(W_{k}),

where

	$\displaystyle W_{k+1}=\operatorname*{arg\,min}_{W\in\mathbb{R}^{d\times d}}$	$\displaystyle\frac{1}{2n}\|\|X-XW\|\|_{2}^{2}+\lambda\|\|W\|\|_{1}$
		$\displaystyle+\frac{\rho}{2}\|h(W)\|^{2}+\alpha_{k}h(W).$

Compute

\alpha_{k+1}=\alpha_{k}+\rho h(W_{k+1})

\rho_{k+1}=\beta\rho_{k}

Set

\widehat{W}=W_{k+1}

if converge.

Final Post-processing step. Set

\widehat{W}\left[|\widehat{W}|<w_{t}\right]=0

, then return

\widehat{W}

III The proposed method

In this section, we extend the NOTEARS method with adaptive Lasso to achieve learning the sparse DAG structure from purely observational data and name our method as NOTEARS with adaptive Lasso (NOTEARS-AL). We first introduce the statistical tool we use in our method, named adaptive lasso, in Section III-A. After detailing our method in Section III-B, we investigate the asymptotic behavior of NOTEARS-AL theoretically in Section III-C. We then develop an algorithm to implement this method in Section III-D, and illustrate how we deal with our model’s hyper-parameters in Section III-E. Finally, We introduce how to extend NOTEARS to generalized linear models in Section III-F.

III-A Adaptive Lasso

In recent decades, variable selection has been studied extensively. In particular, Lasso [46] was introduced to improve the prediction accuracy and, simultaneously, reduce overfitting. In the linear model, the lasso estimates are obtained by optimizing the following program:

\min_{\beta\in\mathbb{R}^{d}}\ \frac{1}{2n}||X-X\beta||_{2}^{2}+\lambda_{n}||\beta||_{1}.

(6)

where $\lambda_{n}$ is related to the sample size $n$ . Although Lasso variable selection has been shown to be consistent under some conditions [47], Zou[48] illustrated certain scenarios where the lasso is an inconsistent procedure in terms of variable selection. Zou[48] argued that the setup in lasso is somewhat unfair since it treats all the coefficients with indistinctive penalty levels in the sparsity term. To fix this problem, Zou[48] proposed a new methodology called adaptive lasso. In adaptive lasso, the weights of penalty levels are totally data-driven and cleverly chosen. The adaptive lasso estimates are obtained by optimizing the following program:

\min_{\beta\in\mathbb{R}^{d}}\frac{1}{2n}||X-X\beta||_{2}^{2}+\lambda_{n}\sum_{i=1}^{d}c_{i}|\beta_{i}|,

(7)

where $c_{i}$ is a predefined penalty weight. For example, we can use the ordinary least square estimates $\widehat{\beta_{ols}}$ to define $c_{i}$ . Choose a $\gamma>0$ , and define the penalty weight $c_{i}=\frac{1}{|\widehat{\beta_{ols,i}}|^{\gamma}}$ , where $\gamma$ commonly be $0.5$ , $1$ , $2$ or determined by the cross-validation method.

III-B The Overall Learning Objective

From Figure 1, we can find that when the penalty level is relatively low, the learned false edges cannot be zeroed out. On the other hand, assigning a larger weight to the regularization term may lead to even worse results. This implies that Lasso put much more penalty on large coefficients rather than false positive ones, which is not consistent with our requirement. Motivated by this observation, to improve the criterion (4) on variable selection, one intuition is to apply the adaptive penalty levels to different coefficients, which has been developed in [39]. Following this motivation, we modify Eq. (4) to

	$\displaystyle\min_{W\in\mathbb{R}^{d\times d}}$	$\displaystyle\ \frac{1}{2n}\|\|X-XW\|\|_{2}^{2}+\lambda_{n}\sum_{i=1}^{d}\sum_{j=1}^{d}\|c_{ij}w_{ij}\|,$		(8)
	subject to	$\displaystyle\ h(W)=0,$		(8)

where $c_{ij}$ is the specified penalty for $W_{ij}$ . Expecting the minor false positive edges can be shrunk to zeros and reserve the true edges at the same time, we would like the corresponding $c_{ij}$ to be large for the former and small for the latter. Hence, the minor false edges are heavily penalized, and the true positive edges are lightly penalized. More details about how to choose $c_{ij}$ are given in Section III-C. With this modification of the score function, no extra predefined threshold is needed to post-processing the initial estimate anymore.

III-C Asymptotic Oracle Properties

In this section, we show that with a proper choice of $c_{ij}$ and under some mild conditions, our method possesses the oracle properties when the sample number $n$ goes to infinity. Similar to procedure (5), via the augmented Lagrangian method, a dual ascent process can solve the ECP (8). The first step of each iteration can be regarded as finding the local minimum of the Lagrangian with a fixed Lagrange multiplier $\alpha$ , that is,

	$\displaystyle L_{n}(W)=\min_{W\in\mathbb{R}^{d\times d}}$	$\displaystyle\frac{1}{2n}\|\|X-XW\|\|_{2}^{2}+\lambda_{n}\sum_{i,j}\|c_{ij}w_{ij}\|$		(9)
		$\displaystyle+\frac{\rho}{2}\|h(W)\|^{2}+\alpha h(W).$		(9)

Let $W^{*}$ denote the underlying true adjacency matrix. And then, we define

	$\displaystyle\mathcal{A}=\{(i,j):w^{*}_{ij}\neq 0\},$
	$\displaystyle\mathcal{A}_{c}=\{(i,j):w^{*}_{ij}=0\},$

which means that $\mathcal{A}$ collects the indices for terms whose true parameters are nonzero, and $\mathcal{A}_{c}$ contains the indices for terms that do not exist in the underlying true model. Here, we first make two conditions:

1.

$X_{i}=X_{i}W^{*}+N_{i},i=1,2,...,n$ , where $N_{1},N_{2},...,N_{n}$ are identically independent distributed with mean $\mathbb{0}$ and variance $\mathbb{\sigma}^{2}<\infty$ .
2.

$\frac{1}{n}X^{T}X$ is finite and converges to a positive definite matrix $D\in\mathbb{R}^{d\times d}$ .

Under these two conditions, the oracle properties of NOTEARS-AL are described in the following theorems. Rigorous proofs are given in Appendix VI.

Theorem 1 (Local Minimizer) Under above two conditions, let $a_{n}=\lambda_{n}c_{\mathcal{A}}$ , where $c_{\mathcal{A}}$ refer to the $c_{ij}$ ’s associated with nonzero coefficients in $W^{*}$ , if $a_{n}=o_{p}(n^{q})$ for some $q\leq-\frac{1}{2}$ , then there exists a local minimizer of $L_{n}(W)$ , which is $\sqrt{n}$ -consistent to $W^{*}$ . Let $\widehat{W}_{n}$ denotes the local minimizer in Theorem 1, which satisfies $||\widehat{W}_{n}-W^{*}||=O_{p}(n^{-\frac{1}{2}})$ . Consistency is a nice start, and this motivates us to study more on its asymptotic properties as sample size $n$ goes to infinity. Before introducing the oracle properties, we need to define more notations because now we want to analyze the estimators of nonzero parameters in the underlying true model. We define

	$\displaystyle\mathcal{A}_{i}$	$\displaystyle=\{(i,j):(i,j)\in\mathcal{A}\},$
	$\displaystyle d_{i}$	$\displaystyle=\|\mathcal{A}_{i}\|.$

In condition 1), we already assume $\frac{1}{n}X^{T}X\to D$ as $n\to\infty$ . Furthermore, without loss of generality, we assume $\mathcal{A}_{i}=\{(i,1),(i,2),...,(i,d_{i})\}$ . Let

D=\left(\begin{matrix}D_{i0}&D_{i1}\\ D_{i2}&D_{i3}\end{matrix}\right),

where $D_{i0}$ is a $d_{i}\times d_{i}$ matrix corresponding to the coefficients with index in $\mathcal{A}_{i}$ .

Theorem 2 (Oracle Properties) Under the two conditions, let $b_{n}=\lambda_{n}c_{\mathcal{A}_{c}}$ , where $c_{\mathcal{A}_{c}}$ refer to the $c_{ij}$ ’s associated with zero coefficients in $W^{*}$ , if $\sqrt{n}b_{n}\rightarrow\infty$ as $n\rightarrow\infty$ , then the local minimizer $\widehat{W}_{n}$ in Theorem 1 must satisfy the following:

•

(Sparsity) $\lim\limits_{n\to\infty}P(\widehat{W}_{\mathcal{A}_{c}}=0)=1$
•

(Asymptotic normality) For $\forall i\in{1,2,...,d}$ , $\sqrt{n}(\widehat{W}_{\mathcal{A}_{i}}-{W^{*}}_{\mathcal{A}_{i}})\xrightarrow{d}X\sim N(0,\sigma^{2}D_{i0}^{-1})$ as $n\rightarrow\infty$ .

Firstly, for the zero coefficients part, Theorem 2 shows that given $a_{n}=o_{p}(n^{q})$ for some $q\leq-\frac{1}{2}$ and $\sqrt{n}b_{n}\rightarrow\infty$ , NOTEARS-AL can consistently remove all irrelevant variables with probability tending to 1. Secondly, by magnifying the difference by $\sqrt{n}$ for the nonzero estimators, we can see that the pattern turns out to be a normal distribution. Then, there is only one question left. That is how to determine $c_{ij}$ ’s such that the nonzero related part $a_{n}=o_{p}(n^{q})$ for some $q\leq-\frac{1}{2}$ and zero related part $\sqrt{n}b_{n}\rightarrow\infty$ . Here, using the fact that ordinary least square (OLS) estimates $\widehat{W}_{ols}$ are $\sqrt{n}$ -consistent estimates of $W$ (the process of proof is consistent with Theorem1), suppose that $\lambda_{n}\sqrt{n}\rightarrow 0$ and $\lambda_{n}n^{\frac{\gamma+1}{2}}\rightarrow\infty$ , then, we can define $c_{ij}$ ’s by $\widehat{W}_{ols}$ with some specific power $\gamma$ so that the above two requirements are met. Besides, specific details to determine $\lambda_{n}$ are given in III-E.

	$\displaystyle\widehat{W}_{ols}=$	$\displaystyle\ \operatorname*{arg\,min}_{W\in\mathbb{R}^{d\times d}}\frac{1}{2n}\|\|X-XW\|\|_{2}^{2}+\lambda_{n}\sum_{i=1}^{d}\sum_{j=1}^{d}\|c_{ij}w_{ij}\|,$		(10)
	subject to	$\displaystyle\ h(W)=0,\ c_{ij}=\frac{1}{\|\widehat{W}_{ols_{ij}}\|^{\gamma}}.$		(10)

III-D NOTEARS with Adaptive Lasso

The algorithm can be divided into two parts. The first part is to find the solution $\widehat{W}_{ols}$ for Eq. (10), and the second part is plugging in the adaptive penalty parameters to Eq. (9) and solving it. These equality-constrained programs (ECPs) are then solved by the augmented Lagrangian method, i.e., using the Lagrangian method to handle the hard DAG constraint $h(W)=0$ . The first part follows the same procedure as the original NOTEARS without the thresholding step. And then, we obtain a matrix $C\in\mathbb{R}^{d\times d}$ with element $c_{ij}=\frac{1}{|\widehat{W}_{ols_{ij}}|}$ . Let $W_{C}\vcentcolon=C\circ W$ , where $\circ$ is the Hadamard product. Thus, we have $W_{ij}=W_{C,ij}/c_{ij}$ . In matrix notation, it is $W=W_{C}\varoslash C$ , where $\varoslash$ refers to Hadamard division. Then, ECP (4) can be transformed as

$\displaystyle\widehat{W_{C}}=$	$\displaystyle\operatorname*{arg\,min}_{W_{C}\in\mathbb{R}^{d\times d}}\frac{1}{2n}\|\|X-XW_{C}\varoslash C\|\|_{2}^{2}+\lambda_{n}\|\|W_{C}\|\|_{1}$	(11)
	$\displaystyle\text{subject to}\quad h(W_{C}\varoslash C)=0.$
$\displaystyle\widehat{W}_{als}=$	$\displaystyle\widehat{W_{C}}\varoslash C,$

where $\widehat{W}_{als}$ is the final estimate for $W$ . Then, similar to (5), via the augmented Lagrangian method, the above ECP (11) can be solved by a dual ascent process.

$\displaystyle W_{Ck}=$	$\displaystyle\operatorname*{arg\,min}_{W_{C}\in\mathbb{R}^{d\times d}}\frac{1}{2n}\|\|X-XW_{C}\varoslash C\|\|_{2}^{2}+\lambda_{n}\|\|W_{C}\|\|_{1}$	(12)
	$\displaystyle+\frac{\rho}{2}\|h(W_{C}\varoslash C)\|^{2}+\alpha_{k}h(W_{C}\varoslash C),$
$\displaystyle\alpha_{k+1}=$	$\displaystyle\alpha_{k}+\rho h(W_{C}\varoslash C).$

To complete the above optimization, we need to derive a new form of the gradient for the first step in Eq. (12). Here we also reformulate the original $W$ as the difference between two matrices with positive elements. For the Loss function:

\nabla\frac{1}{2n}||X-XW_{C}\varoslash C||_{2}^{2}=-2X^{T}(X-XW_{C}\varoslash C)\varoslash C.

For the acyclicity term $h$ :

\nabla h(W_{C}\varoslash C)=2\left(e^{(W_{C}\circ W_{C})\varoslash(C\circ C)}\right)^{T}W_{C}\varoslash(C\circ C).

To sum up, the algorithm proceeds as Algorithm 2.

Algorithm 2 NOTEARS with the adaptive lasso.

Step 0. Take

\rho_{0}>0

, initial value

W_{0}\in\mathbb{R}^{d\times d}

\alpha_{0}\in\mathbb{R}

, and some

\xi\in(0,1)

. Set threshold

w_{t}

\eta=10

OLS loop k. (k

\geq

1) Using the complete data set

X

, find the smallest non-negative integer

j_{k}

such that with

\rho=\eta^{j_{k}}\rho_{0}

h(W_{k+1})<\xi h(W_{k}),

where

W_{k+1}=\operatorname*{arg\,min}_{W\in\mathbb{R}^{d\times d}}\frac{1}{2n}||X-W^{T}X||_{2}^{2}+\frac{\rho}{2}|h(W)|^{2}+\alpha_{k}h(W).

Compute

\alpha_{k+1}=\alpha_{k}+\rho h(W_{k+1})

Set

\widehat{W}_{ols}=W_{k+1}

if converge. Then, define

C\vcentcolon=1\varoslash|\widehat{W}_{ols}|^{\gamma}

W_{C}\vcentcolon=C\circ W

Adaptive lasso loop t. (t

\geq

1) Using the complete data set

X

, for all

\lambda_{n}

, find the smallest non-negative integer

i_{t}

such that with

\rho=\eta^{i_{t}}\rho_{0}

h(W_{C,t+1}\varoslash C)<\xi h(W_{C,t}\varoslash C),

where

	$\displaystyle W_{C,t+1}=$	$\displaystyle\operatorname*{arg\,min}_{W_{C}\in\mathbb{R}^{d\times d}}\frac{1}{2n}\|\|X-XW_{C}\varoslash C\|\|_{2}^{2}+\lambda_{n}\|\|W_{C}\|\|_{1}$
		$\displaystyle+\frac{\rho}{2}\|h(W_{C}\varoslash C)\|^{2}+\alpha_{t}h(W_{C}\varoslash C).$

Compute

\alpha_{t+1}=\alpha_{t}+\rho h(W_{C,t+1}\varoslash C)

Set

\widehat{W_{C}}_{,n}=W_{C,t+1}

if converge.

Return the adaptive estimate

\widehat{W}_{n}\vcentcolon=\widehat{W_{C}}_{,n}\varoslash C

III-E Hyper-parameter

Choosing the tuning parameter $\lambda_{n}$ is an important issue. We found that the results of NOTEARS-based methods are sensitive to the value of hyper-parameter, thus it is essential to choose a good one. Our goal is to choose the optimal model $s_{n}$ which contains the true variables as many as possible. Here, following Zou [48], we utilize cross-validation to find an optimal $\lambda_{n}$ . From NOTEARS-AL, we can obtain a set of candidate models $S=\{s_{n},n=1,2,...,N\}$ , where $s_{n}=\{(i,j)\in(1:d)\times(1:d):\widehat{W}_{n,ij}\neq 0\}$ . The cross-validation procedure we use shows as Algorithm 3.

Algorithm 3 The Cross-validation method.

Step 0. Divide the data into validation set

X_{v}

and train set

X_{t}

, and

|X_{v}|=n_{v}

|X_{t}|=n_{t}

n_{v}+n_{t}=n

For model

S_{n}

(n=1,2,…,N). Using the validation set

X_{t}

, compute the solution

\widetilde{W}_{n}^{t}

, where

\widetilde{W}_{n}^{t}=\operatorname*{arg\,min}_{W\in\mathbb{R}^{d\times d},W_{(-s_{n})}=0}\frac{1}{2n_{t}}||X_{t}-W^{T}X_{t}||_{2}^{2},

Evaluate the prediction performance of

\widetilde{W}_{n}^{t}

on the validation set

X_{v}

by loss function

\frac{1}{2n_{v}}||X_{v}-(\widetilde{W}_{n}^{t})^{T}X_{v}||_{2}^{2}

Another key modification is that we need a more sufficient validation set to find the best model to achieve the restricted model-selection consistency, especially when the candidate model set is large[49]. Motivated by this, instead of using the traditional $K$ -fold cross-validation method, whose validation set is only $1/K$ of data, we suggest swapping the proportions of the validation set and training set.

III-F Extension to Generalized Linear Models

Zheng et al.[27] also extend the idea of NOTEARS to generalized linear models (GLMs). Traditionally, a GLM is formulated as follows:

\mathbb{E}[X_{i}|Pa(X_{i})]=g_{i}(d_{i}^{T}X),

(13)

where $g_{i}$ is a link function used in GLMs, and $d_{i}\in\mathbb{R}^{d}$ . To extend the DAG constraint beyond the linear setting, the key point is to find a sensible way to redefine the weighted adjacency matrix $W$ since there is no explicit form of $W$ in the GLMs setting. Based on the idea in [50], which encodes the importance of each variable considered via utilizing corresponding partial derivatives, it is then easy to show that the dependence among variables can be precisely measured by the $L^{2}$ -norm of corresponding partial derivatives [27]. This implies the weighted adjacency matrix $W$ in GLMs can be defined as follows:

W_{ki}=||\frac{\partial f_{i}}{\partial x_{k}}||_{L^{2}}=d_{ik}^{2}.

(14)

Thus, for linear mean functions, i.e. $f_{i}(X)=d_{i}^{T}X$ , $W_{ki}=0$ is equivalent to $d_{ik}=0$ . We can simplify the DAG constraint $h(W)=0$ by plugging in $W=({d_{1}}^{T},{d_{2}}^{T},...,{d_{d}}^{T})$ . To construct the score function for GLMs, we need a more suitable loss function here. In this paper, we focus on one specific model in GLMs, the logistic regression model, where $X_{i}\in\{0,1\},i=0,...,d$ . For logistic regression, the least-square loss is not appropriate anymore since it will result in a non-convex optimal function. Here we introduce the concept named Log Loss, which is commonly used as the loss function for logistic regression, defined as follows:

l=\frac{1}{n}\sum_{i=1}^{n}[-x_{i}\log\left(g(t_{i})-(1-x_{i})\log(1-g(t_{i})\right)],

(15)

where $t_{i}=x_{i}\times({d_{1}}^{T},{d_{2}}^{T},...,{d_{d}}^{T})$ .

IV Experiments

We conduct experiments with a large sample size to verify our asymptotic oracle properties (Section III-C). To validate the effectiveness of our proposed method, we compare it against NOTEARS with fixed threshold, greedy equivalent search (GES) [19, 20] and the PC algorithm [4]. For NOTEARS, the value of the threshold is $0.1$ , and set the penalty level $\lambda=0.1$ in our experiments. This setting can not only promise a DAG result but also remain a relatively good performance.

The basic setup of our experiments is as follows. For each experiment, the ground-truth DAG is generated from the random graph model: Erdös-Rényi (ER). We use ER $2$ to denote an ER graph with $s_{0}=2d$ edges and similar meanings for ER $1$ and ER $4$ . Based on the ground truth DAG, we generate the weights of existing edges from $1$ ) a normal distribution with mean $0$ and variance $\sigma^{2}$ ; 2) a uniform distribution over the interval $(-c,c)$ , where $c\in\mathbb{R}$ . And then, we simulate the sample $X$ via $X_{i}=f_{i}(X_{pa(i)})+z_{i},\quad i=1,...,d$ , and the noise $z_{i}\sim N(0,1)$ . Here, we mainly consider the linear case and extend to one type of GLMs named the logistic regression model. The metrics we use to evaluate the estimated DAGs include structural Hamming distance (SHD), true positive rate (TPR), and false discovery rate (FDR).

IV-A Numerical Results: Gaussian Weights

Compared to the original weight-generating interval of NOTEARS, which is Uniform $([-2.0,-0.5]\cup[0.5,2.0])$ , there are two points we consider to improve in our method. The first one is about the upper and lower bound, which are set to be $\pm 2\alpha$ . We expect our method can extend NOTEARS to a wider range with a better estimate. The second one is the coefficients near $0$ , which cannot be learned by the fixed-thresholding post-processing step. For the above two improvement points, we first perform the structure learning study in Gaussian-weight cases, where the weights of existing edges are sampled from a normal distribution with mean $0$ and variance $2^{2}$ . In this way, we can investigate the performance of NOTEARS-AL under a wider range of coefficients.

In our experiments, we generate {ER $1$ , ER $2$ , ER $4$ } graphs with different numbers of nodes $d=\{10,20,50\}$ , $15$ graphs for each type. And then, we simulate $n=1000$ samples with iid Gaussian noises. The results are shown in Figure 3. For better visualization, only the average values of metrics are reported. Consistent with previous work GES, it shows a significant advantage in the ER $1$ case, no matter which metric we evaluate. But the performances of GES deteriorate rapidly when the level of sparsity decreases. Also, as we can see from Figure 3, for ER $4$ with $50$ nodes, the algorithm of GES takes too long to obtain the results because of the large searching space and low searching efficiency. Not surprisingly, the performance of NOTEARS and NOTEARS-AL show a positive relationship. This is natural since NOTEARS-AL extracts the weights of penalty level from conventional NOTEARS, which means that the performance of NOTEARS-AL is also sensitive to the results of the conventional one. Nonetheless, our approach performs uniformly better than NOTEARS across different graphs in terms of all metrics.

Figure 4 and Figure 5 show structure recovery results for SEM with exponential noise and Gumbel noise, respectively, where the weights of existing edges are sampled from a normal distribution with mean $0$ and variance $2^{2}$ .

IV-B Coefficients in the Neighborhood of $0$

As we mentioned, one side effect of the fixed-threshold strategy is that small coefficients in the true model may be deleted indiscriminately. Since the simulation experiments in previous work commonly set a gap to zero, which is $\alpha[-2,-0.5]\cup\alpha[0.5,2]$ , this side effect is not obvious. Hence, we investigate the value of missing edges when weights are generated from a range that covers zero, e.g., Uniform $(-5,5)$ , Uniform $(-10,10)$ , Gaussian $(0,2^{2})$ and Gaussian $(0,3^{2})$ . For each kind of distribution, we generated $200$ DAG structures from the ER $1$ model with $10$ nodes and then simulated $1000$ samples. Based on the different methodologies of these two methods, one with a hard threshold of $0.1$ and the other with adaptive penalty levels, our method is more flexible and suitable to handle the coefficients around zero. Indeed, Figure 6 confirms this intuition. We plot the histogram of missing edges. In all three cases, our method (blue bar) loses fewer edges than NOTEARS with hard thresholding (yellow bar), especially the area around zero. Moreover, as the range becomes wider or the variance increases, the gap between these two methods becomes more and more significant.

IV-C High-dimension Cases

To investigate the performance of our method for estimating DAG from high-dimensional graphical models, we increase the number of nodes to $100$ and compare the results to those of fixed-thresholding NOTEARS. Here, to illustrate the validity of the evaluation criteria, we draw error bars with sample standard deviations. The results are shown in Figure 7. We can find that under nearly the same value of SHD, the true positive rate of NOTEARS-AL is significantly higher than that of NOTEARS with a fixed threshold. Besides, in terms of FDR (lower is better), NOTEARS-AL has lower values for all graphs. In general, NOTEARS-AL consistently outperforms the original fixed-threshold one, even in high-dimensional settings.

IV-D Generalized Linear Models

For GLMs, we concentrate on logistic regression for dichotomous data $X_{i}\in\{0,1\}$ , $i=1,...,d$ with link function $g(t)=e^{t}/(e^{t}+1)$ . As we introduced in Section III-F, for linear mean functions, the elements of $W$ in DAG constraint $h(W)=0$ can be substituted by $d_{ik}$ ’s, the coefficients in linear mean functions. We compare the performance of fixed-thresholding NOTEARS to NOTEARS-AL with edge weights following Gaussian $(0,2^{2})$ .

Figure 8 shows the structure recovery results for $n=1000$ . By comparison, We can find the TPR of NOTEARS-AL is significantly higher than that of NOTEARS with a fixed threshold. At the same time, in terms of the other two metrics, i.e., SHD and FDR, both methods keep at a similar level. In general, NOTEARS-AL consistently outperforms the original NOTEARS.

IV-E Real Data

Finally, consistent with Zheng et al. [1], we compare our method to NOTEARS on the same real dataset, which records the expression levels of $11$ phosphoproteins and phospholipids in human cells. This dataset is commonly used in the literature on Bayesian network structure inference as it provides a network that is widely accepted by the biological community. Based on $d=11$ cell types and $n=911$ observational samples, the ground truth structure given in [5] contains $17$ edges. In our experiments, NOTEARS with $0.1$ threshold estimates $22$ edges with an SHD of $20$ , compared to $20$ estimates for NOTEARS-AL with an SHD of $19$ .

V Conclusion

Based on the methodology of NOTEARS, we present a method for score-based learning of DAG models that do not need a hard thresholding post-processing step. The critical technical design is to use an adaptive penalty level for each edge rather than a common one. Notably, we prove that our method enjoys the oracle properties when the adaptive penalties satisfy some mild conditions. With a suitable choice of hyper-parameter in the second step, where we apply a modified $K$ -fold cross-validation to handle, our method can zero out edges automatically in each iteration. Our simulation experiments show the effectiveness of our method, and it is worth emphasizing that the adaptive strategy in our method is generally more flexible than the hard-threshold strategy since it is a purely data-driven procedure, which enables our method to own the ability to adapt to different types of graphs.

References

[1] X. Zheng, B. Aragam, P. K. Ravikumar, and E. P. Xing, “Dags with no tears: Continuous optimization for structure learning,” Advances in Neural Information Processing Systems, vol. 31, 2018.
[2] J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann, 1988.
[3] D. Heckerman, “A tutorial on learning with bayesian networks,” Innovations in Bayesian networks, pp. 33–82, 2008.
[4] P. Spirtes, C. N. Glymour, R. Scheines, and D. Heckerman, Causation, prediction, and search. MIT press, 2000.
[5] K. Sachs, O. Perez, D. Pe’er, D. A. Lauffenburger, and G. P. Nolan, “Causal protein-signaling networks derived from multiparameter single-cell data,” Science, vol. 308, no. 5721, pp. 523–529, 2005.
[6] P. J. Lucas, L. C. Van der Gaag, and A. Abu-Hanna, “Bayesian networks in biomedicine and health-care,” Artificial intelligence in medicine, vol. 30, no. 3, pp. 201–214, 2004.
[7] R. Ganian and V. Korchemna, “The complexity of bayesian network learning: Revisiting the superstructure,” Advances in Neural Information Processing Systems, vol. 34, pp. 430–442, 2021.
[8] M. Chickering, D. Heckerman, and C. Meek, “Large-sample learning of bayesian networks is np-hard,” Journal of Machine Learning Research, vol. 5, pp. 1287–1330, 2004.
[9] D. M. Chickering, “Learning bayesian networks is np-complete,” in Learning from data. Springer, 1996, pp. 121–130.
[10] P. Spirtes, C. Glymour, and R. Scheines, “Causality from probability,” 1989.
[11] P. Spirtes and C. Glymour, “An algorithm for fast recovery of sparse causal graphs,” Social science computer review, vol. 9, no. 1, pp. 62–72, 1991.
[12] D. Margaritis and S. Thrun, “Bayesian network induction via local neighborhoods,” Advances in neural information processing systems, vol. 12, 1999.
[13] J. Peters, D. Janzing, and B. Schölkopf, Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
[14] D. Geiger and D. Heckerman, “Learning gaussian networks,” in Uncertainty Proceedings 1994. Elsevier, 1994, pp. 235–243.
[15] D. Heckerman, D. Geiger, and D. M. Chickering, “Learning bayesian networks: The combination of knowledge and statistical data,” Machine learning, vol. 20, no. 3, pp. 197–243, 1995.
[16] R. R. Bouckaert, “Probabilistic network construction using the minimum description length principle,” in European conference on symbolic and quantitative approaches to reasoning and uncertainty. Springer, 1993, pp. 41–48.
[17] M. Teyssier and D. Koller, “Ordering-based search: A simple and effective algorithm for learning bayesian networks,” arXiv preprint arXiv:1207.1429, 2012.
[18] I. Tsamardinos, L. E. Brown, and C. F. Aliferis, “The max-min hill-climbing bayesian network structure learning algorithm,” Machine Learning, vol. 65, no. 1, pp. 31–78, 2006.
[19] D. M. Chickering, “Optimal structure identification with greedy search,” Journal of machine learning research, vol. 3, no. Nov, pp. 507–554, 2002.
[20] J. Ramsey, M. Glymour, R. Sanchez-Romero, and C. Glymour, “A million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images,” International journal of data science and analytics, vol. 3, no. 2, pp. 121–129, 2017.
[21] I. Ng, Y. Zheng, J. Zhang, and K. Zhang, “Reliable causal discovery with improved exact search and weaker assumptions,” in Advances in Neural Information Processing Systems, vol. 34, 2021.
[22] D. Kalainathan, O. Goudet, I. Guyon, D. Lopez-Paz, and M. Sebag, “Structural agnostic modeling: Adversarial learning of causal graphs,” arXiv preprint arXiv:1803.04929, 2018.
[23] I. Ng, S. Zhu, Z. Fang, H. Li, Z. Chen, and J. Wang, “Masked gradient-based causal structure learning,” in SIAM International Conference on Data Mining, 2022, pp. 424–432.
[24] Y. Yu, J. Chen, T. Gao, and M. Yu, “DAG-GNN: DAG structure learning with graph neural networks,” in International Conference on Machine Learning, 2019.
[25] S. Lachapelle, P. Brouillard, T. Deleu, and S. Lacoste-Julien, “Gradient-based neural dag learning,” in International Conference on Learning Representations, 2020.
[26] S. Zhu, I. Ng, and Z. Chen, “Causal discovery with reinforcement learning,” in International Conference on Learning Representations, 2020.
[27] X. Zheng, C. Dan, B. Aragam, P. Ravikumar, and E. Xing, “Learning sparse nonparametric dags,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 3414–3425.
[28] R. Pamfil, N. Sriwattanaworachai, S. Desai, P. Pilgerstorfer, K. Georgatzis, P. Beaumont, and B. Aragam, “Dynotears: Structure learning from time-series data,” in International Conference on Artificial Intelligence and Statistics, 2020.
[29] R. Bhattacharya, T. Nagarajan, D. Malinsky, and I. Shpitser, “Differentiable causal discovery under unmeasured confounding,” in International Conference on Artificial Intelligence and Statistics, 2021.
[30] P. Brouillard, S. Lachapelle, A. Lacoste, S. Lacoste-Julien, and A. Drouin, “Differentiable causal discovery from interventional data,” in Advances in Neural Information Processing Systems, 2020.
[31] I. Ng and K. Zhang, “Towards federated bayesian network structure learning with continuous optimization,” in International Conference on Artificial Intelligence and Statistics, 2022.
[32] E. Gao, J. Chen, L. Shen, T. Liu, M. Gong, and H. Bondell, “FedDAG: Federated DAG structure learning,” Transactions on Machine Learning Research, 2022.
[33] T. Geffner, J. Antoran, A. Foster, W. Gong, C. Ma, E. Kiciman, A. Sharma, A. Lamb, M. Kukla, N. Pawlowski et al., “Deep end-to-end causal inference,” arXiv preprint arXiv:2202.02195, 2022.
[34] E. Gao, I. Ng, M. Gong, L. Shen, W. Huang, T. Liu, K. Zhang, and H. Bondell, “MissDAG: Causal discovery in the presence of missing data with continuous additive noise models,” in Advances in Neural Information Processing Systems, 2022.
[35] I. Ng, A. Ghassami, and K. Zhang, “On the role of sparsity and dag constraints for learning linear dags,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 943–17 954, 2020.
[36] D. Wei, T. Gao, and Y. Yu, “Dags with no fears: A closer look at continuous optimization for learning bayesian networks,” in Advances in Neural Information Processing Systems, 2020.
[37] Z. Zhang, I. Ng, D. Gong, Y. Liu, E. M. Abbasnejad, M. Gong, K. Zhang, and J. Q. Shi, “Truncated matrix power iteration for differentiable DAG learning,” in Advances in Neural Information Processing Systems, 2022.
[38] I. Ng, S. Lachapelle, N. R. Ke, S. Lacoste-Julien, and K. Zhang, “On the convergence of continuous constrained optimization for structure learning,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2022, pp. 8176–8198.
[39] H. Zou, “The adaptive lasso and its oracle properties,” Journal of the American statistical association, vol. 101, no. 476, pp. 1418–1429, 2006.
[40] P. Spirtes, C. Glymour, R. Scheines et al., Causation, Prediction, and Search. The MIT Press, 2001, vol. 1.
[41] F. Harary and B. Manvel, “On the number of cycles in a graph,” Matematickỳ časopis, vol. 21, no. 1, pp. 55–63, 1971.
[42] D. P. Bertsekas, “Nonlinear programming,” Journal of the Operational Research Society, vol. 48, no. 3, pp. 334–334, 1997.
[43] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, “A limited memory algorithm for bound constrained optimization,” SIAM Journal on scientific computing, vol. 16, no. 5, pp. 1190–1208, 1995.
[44] K. Zhong, I. E.-H. Yen, I. S. Dhillon, and P. K. Ravikumar, “Proximal quasi-newton for computationally intensive l1-regularized m-estimators,” Advances in Neural Information Processing Systems, vol. 27, 2014.
[45] X. Zheng, C. Dan, B. Aragam, P. Ravikumar, and E. P. Xing, “Learning sparse nonparametric dags,” 2019. [Online]. Available: https://arxiv.org/abs/1909.13189
[46] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
[47] N. Meinshausen and P. Bühlmann, “Variable selection and high-dimensional graphs with the lasso,” Ann Stat, vol. 34, pp. 1436–1462, 2006.
[48] H. Zou, “The adaptive lasso and its oracle properties,” Journal of the American Statistical Association, vol. 101, no. 476, pp. 1418–1429, 2006. [Online]. Available: http://www.jstor.org/stable/27639762
[49] Y. Feng and Y. Yu, “The restricted consistency property of leave- $n_{v}$ -out cross-validation for high-dimensional variable selection,” 2013. [Online]. Available: https://arxiv.org/abs/1308.5390
[50] L. A. Rosasco, S. Villa, S. Mosci, M. Santoro, and A. Verri, “Nonparametric sparsity and regularization,” 2013.
[51] N. H. Choi, W. Li, and J. Zhu, “Variable selection with the strong heredity constraint and its oracle property,” Journal of the American Statistical Association, vol. 105, no. 489, pp. 354–364, 2010.
[52] A. W. Van der Vaart, Asymptotic statistics. Cambridge university press, 2000, vol. 3.

VI Proof of Theorems

VI-A Proof of Theorem 1

Let $\eta_{n}=n^{p}$ , $p<0$ , $\bm{\delta}=\begin{pmatrix}\delta_{1,1}&\delta_{1,2}&\cdots&\delta_{1,d}\\ \delta_{2,1}&\delta_{2,2}&\cdots&\delta_{2,d}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{d,1}&\delta_{d,2}&\cdots&\delta_{d,d}\end{pmatrix}$ , $||\bm{\delta}||\leq d$ , where $d$ is a constant.

	$\displaystyle D_{n}(\bm{\delta})=$	$\displaystyle L_{n}(W^{}+\eta_{n}\bm{\delta})-L_{n}(W^{})$
	$\displaystyle=$	$\displaystyle l_{n}(W^{}+\eta_{n}\bm{\delta})-l_{n}(W^{})+$
		$\displaystyle\lambda_{n}\sum_{i}\sum_{j}c_{ij}\left\{\|W^{}_{ij}+\eta_{n}\delta_{ij}\|-\|W^{}_{ij}\|\right\}+$
		$\displaystyle\frac{\rho}{2}\left\{\|h(W^{}+\eta_{n}\bm{\delta})\|^{2}-\|h(W^{})\|^{2}\right\}+$
		$\displaystyle\alpha\left\{h(W^{}+\eta_{n}\bm{\delta})-h(W^{})\right\}$

where $l_{n}(W)=\frac{1}{2n}||X-XW||_{2}^{2}$ .

Let $a_{n,ij}=\lambda_{n}c_{ij}$ , $(i,j)\in\mathcal{A}$ , we can derive

	$\displaystyle D_{n}(\bm{\delta})\geq$	$\displaystyle l_{n}(W^{}+\eta_{n}\bm{\delta})-l_{n}(W^{})+$
		$\displaystyle\sum_{(i,j)\in\mathcal{A}}a_{n,ij}\left\{\|W^{}_{ij}+\eta_{n}\delta_{ij}\|-\|W^{}_{ij}\|\right\}+$
		$\displaystyle\frac{\rho}{2}\left\{\|h(W^{}_{ij}+\eta_{n}\delta_{ij})\|^{2}-\|h(W^{}_{ij})\|^{2}\right\}+$
		$\displaystyle\alpha\left\{h(W^{}_{ij}+\eta_{n}\delta_{ij})-h(W^{}_{ij})\right\}$
	$\displaystyle\geq$	$\displaystyle l_{n}(W^{}+\eta_{n}\bm{\delta})-l_{n}(W^{})+$
		$\displaystyle\eta_{n}\sum_{(i,j)\in\mathcal{A}}a_{n,ij}\|\delta_{ij}\|+$
		$\displaystyle\frac{\rho}{2}\left\{\|h(W^{}+\eta_{n}\bm{\delta})\|^{2}-\|h(W^{})\|^{2}\right\}+$
		$\displaystyle\alpha\left\{h(W^{}+\eta_{n}\bm{\delta})-h(W^{})\right\}$
	$\displaystyle=$	$\displaystyle\nabla l_{n}(W^{*})^{T}(\eta_{n}\bm{\delta})+$
		$\displaystyle\frac{1}{2}(\eta_{n}\bm{\delta})^{T}[\nabla^{2}l_{n}(W^{*})](\eta_{n}\bm{\delta})(1+o_{p}(1))-$
		$\displaystyle\eta_{n}\sum_{(i,j)\in\mathcal{A}}a_{n,ij}\|\delta_{ij}\|+$
		$\displaystyle\frac{\rho}{2}\left\{\|h(W^{}+\eta_{n}\bm{\delta})\|^{2}-\|h(W^{})\|^{2}\right\}+$
		$\displaystyle\alpha\left\{h(W^{}+\eta_{n}\bm{\delta})-h(W^{})\right\}$
	$\displaystyle\triangleq$	$\displaystyle\@slowromancap i@+\@slowromancap ii@+\@slowromancap iii@+\@slowromancap iv@$

where we define

	$\displaystyle\@slowromancap i@=$	$\displaystyle\nabla l_{n}(W^{*})^{T}(\eta_{n}\bm{\delta})$
	$\displaystyle\@slowromancap ii@=$	$\displaystyle\frac{1}{2}(\eta_{n}\bm{\delta})^{T}[\nabla^{2}l_{n}(W^{*})](\eta_{n}\bm{\delta})(1+o_{p}(1))$
	$\displaystyle\@slowromancap iii@=$	$\displaystyle\eta_{n}\sum_{(i,j)\in\mathcal{A}}a_{n,ij}\|\delta_{ij}\|$
	$\displaystyle\@slowromancap iv@=$	$\displaystyle\frac{\rho}{2}\left\{\|h(W^{}+\eta_{n}\bm{\delta})\|^{2}-\|h(W^{})\|^{2}\right\}+$
		$\displaystyle\alpha\left\{h(W^{}+\eta_{n}\bm{\delta})-h(W^{})\right\}$

For $\@slowromancap i@$ ,

	$\displaystyle\@slowromancap i@=$	$\displaystyle\nabla l_{n}(W^{*})^{T}(\eta_{n}\bm{\delta})$
	$\displaystyle=$	$\displaystyle\eta_{n}\left(\nabla l_{n}(W^{*})^{T}\right)\bm{\delta}$
	$\displaystyle=$	$\displaystyle\eta_{n}\left(\frac{1}{n}(X-XW^{*})^{T}X^{T}\right)\bm{\delta}$
	$\displaystyle=$	$\displaystyle-\frac{1}{\sqrt{n}}\eta_{n}(\sqrt{n}\frac{1}{n}(X-XW^{*})^{T}X^{T})\bm{\delta}$
	$\displaystyle=$	$\displaystyle-O_{p}(\frac{1}{\sqrt{n}}\eta_{n})\bm{\delta}.$

For $\@slowromancap ii@$ ,

\@slowromancap ii@=\frac{1}{2}\eta_{n}^{2}\left\{\bm{\delta}^{T}[\frac{1}{n}X^{T}X]\bm{\delta}\right\}(1+o_{P}(1))>0.

For $\@slowromancap iii@$ ,

	$\displaystyle\@slowromancap iii@=$	$\displaystyle-\eta_{n}\sum_{(i,j)\in\mathcal{A}}a_{n,ij}\|\delta_{ij}\|$
	$\displaystyle\geq$	$\displaystyle-\eta_{n}An^{q}d,$

where $A=\max\{a_{n,ij}/n^{q}:(i,j)\in\mathcal{A}\}$ , and $A\in O(1)$ .

For $\@slowromancap iv@$ , since $h(W)\geq 0$ , $\forall W\in\mathbb{R}^{d\times d}$ , we can derive

\@slowromancap iv@=\left[h(W^{*}+\eta_{n}\bm{\delta})-h(W^{*})\right]\left\{\frac{\rho}{2}\left[h(W^{*}+\eta_{n}\bm{\delta})+h(W^{*})\right]\right\}

$\Rightarrow\qquad\@slowromancap iv@\geq 0$ .

Thus,

	$\displaystyle D_{n}(\bm{\delta})\geq$	$\displaystyle-O_{p}(\frac{1}{\sqrt{n}}\eta_{n})\bm{\delta}+\frac{1}{2}\eta_{n}^{2}\left\{\bm{\delta}^{T}[\frac{1}{n}X^{T}X]\bm{\delta}\right\}(1+o_{p}(1))$
		$\displaystyle-\eta_{n}A_{n}n^{q}d+\@slowromancap iv@$

We can find when $p>-\frac{1}{2}$ and $q\leq-\frac{1}{2}$ , \@slowromancapii@ dominates \@slowromancapi@ and \@slowromancapiii@, then, $D_{n}(\bm{\delta})>0$ .

Thus, we can conclude that there exists a local minimizer of $L_{n}(W)$ with optimal convergence rate $||\widehat{W_{n}}-W^{*}||=O_{p}(n^{-\frac{1}{2}})$ [51].

VI-B Proof of Theorem 2

Firstly, for sparsity property, it is sufficient to prove $\forall(i,j)\in\mathcal{A}_{c}$ ,

\begin{cases}\frac{\partial L_{n}(W)}{\partial W_{ij}}|_{W=\widehat{W}}<0\quad\text{ if }-\varepsilon_{n}<\widehat{W}_{ij}<0\\ \frac{\partial L_{n}({W})}{\partial W_{ij}}|_{W=\widehat{W}}>0\quad\text{ if }0<\widehat{W}_{ij}<\varepsilon_{n}\end{cases}

(16)

with probability tending to 1 where $\varepsilon_{n}=O(n^{-\frac{1}{2}})$ .

Then,

	$\displaystyle\frac{\partial L_{n}(\widehat{W})}{\partial W_{ij}}=$	$\displaystyle-\frac{1}{n}X_{.i}^{T}(X_{.j}-XW_{.j}^{})+\sum_{k=1}^{d}\frac{1}{n}X_{.i}^{T}X_{kj}(\widehat{W}_{ij}-W^{}_{kj})$
		$\displaystyle+\lambda_{n}c_{ij}sign\{\widehat{W}_{ij}\}+2\widehat{W}_{ij}(\rho h(\widehat{W})+\alpha)(e^{\widehat{W}\circ\widehat{W}})^{T}_{ij}$

where we can find

		$\displaystyle\frac{1}{n}X_{.i}^{T}(X_{.j}-XW_{.j}^{*})=O_{p}(n^{-\frac{1}{2}}),$
		$\displaystyle\sum_{k=1}^{d}\frac{1}{n}X_{.i}^{T}X_{kj}(\widehat{W}_{ij}-W^{*}_{kj})=O_{p}(n^{-\frac{1}{2}}),$
		$\displaystyle 2\widehat{W}_{ij}(\rho h(\widehat{W})+\alpha)(e^{\widehat{W}\circ\widehat{W}})^{T}_{ij}\begin{cases}<0\quad\text{ if }-\varepsilon_{n}<\widehat{W}_{ij}<0\\ >0\quad\text{ if }0<\widehat{W}_{ij}<\varepsilon_{n}\end{cases}$

Let $b_{n,ij}=\lambda_{n}c_{ij}$ . Therefore, for $\forall(i,j)\in\mathcal{A}_{c}$ , if $\sqrt{n}b_{n,ij}\rightarrow\infty$ , then, the sign of $\frac{\partial L_{n}(\widehat{W})}{\partial W_{ij}}$ is dominated by $sgn\{\widehat{W}_{ij}\}$ .

Secondly, for asymptotic normality property, we concentrate on set $\mathcal{A}_{i}$ , $i=1,2,...,d$ , and use $W_{\mathcal{A}_{i}}$ to denote the sub adjacency matrix of graphs, $X_{\mathcal{A}_{i}}$ to denote the sub observation data of corresponding variables, and $C_{\mathcal{A}_{i}}$ to denote the corresponding adaptive penalty weights.

Denote

		$\displaystyle\@slowromancap i@(W)=\frac{1}{2n}\|\|X-XW\|\|^{2}_{2}$
		$\displaystyle\@slowromancap ii@(W)=\lambda_{n}\sum_{i}\sum_{j}\|c_{ij}W_{ij}\|$
		$\displaystyle\@slowromancap iii@(W)=\frac{\rho}{2}\|h(W)\|^{2}+\alpha h(W).$

Then,

\nabla_{\mathcal{A}_{i}}L_{n}(\widehat{W}_{\mathcal{A}_{i}})=\nabla_{\mathcal{A}_{i}}\@slowromancap i@(\widehat{W}_{\mathcal{A}_{i}})+\nabla_{\mathcal{A}_{i}}\@slowromancap ii@(\widehat{W}_{\mathcal{A}_{i}})+\nabla_{\mathcal{A}_{i}}\@slowromancap iii@(\widehat{W}_{\mathcal{A}_{i}}).

By Taylor expansion at $\widehat{W}_{\mathcal{A}_{i}}=W^{*}_{\mathcal{A}_{i}}$ , we have

	$\displaystyle\nabla_{\mathcal{A}_{i}}\@slowromancap i@(\widehat{W}_{\mathcal{A}_{i}})=$	$\displaystyle\nabla_{\mathcal{A}_{i}}\@slowromancap i@(W^{}_{\mathcal{A}_{i}})+\left[\nabla^{2}_{\mathcal{A}_{i}}\@slowromancap i@(W^{}_{\mathcal{A}_{i}})\right](\widehat{W}_{\mathcal{A}_{i}}-W^{*}_{\mathcal{A}_{i}})$
	$\displaystyle=$	$\displaystyle\frac{1}{n}X^{T}_{\mathcal{A}_{i}}X_{\mathcal{A}_{i}}(\widehat{W}_{\mathcal{A}_{i}}-W^{}_{\mathcal{A}_{i}})-\frac{1}{n}X^{T}_{\mathcal{A}_{i}}(X_{\mathcal{A}_{i}}-X_{\mathcal{A}_{i}}W^{}_{\mathcal{A}_{i}}),$
	$\displaystyle\nabla_{\mathcal{A}_{i}}\@slowromancap ii@(\widehat{W}_{\mathcal{A}_{i}})=$	$\displaystyle\lambda_{n}C_{\mathcal{A}_{i}}sgn(\widehat{W}_{\mathcal{A}_{i}})$
	$\displaystyle=$	$\displaystyle\lambda_{n}C_{\mathcal{A}_{i}}sgn(W^{}_{\mathcal{A}})+(\widehat{W}_{\mathcal{A}_{i}}-W^{}_{\mathcal{A}_{i}})o_{p}(1),$
	$\displaystyle\nabla_{\mathcal{A}_{i}}\@slowromancap iii@(\widehat{W}_{\mathcal{A}_{i}})=$	$\displaystyle(\rho\|h(\widehat{W}_{\mathcal{A}_{i}})\|+\alpha)\nabla h(\widehat{W}_{\mathcal{A}_{i}})$
	$\displaystyle=$	$\displaystyle(\widehat{W}_{\mathcal{A}_{i}}-W^{*}_{\mathcal{A}_{i}})o_{p}(1).$

Therefore, now we have

	$\displaystyle\nabla_{\mathcal{A}_{i}}L_{n}(\widehat{W}_{\mathcal{A}_{i}})=$	$\displaystyle\frac{1}{n}X^{T}_{\mathcal{A}_{i}}X_{\mathcal{A}_{i}}(\widehat{W}_{\mathcal{A}_{i}}-W^{}_{\mathcal{A}_{i}})-\frac{1}{n}X^{T}_{\mathcal{A}_{i}}(X_{\mathcal{A}_{i}}-X_{\mathcal{A}_{i}}W^{}_{\mathcal{A}_{i}})$
		$\displaystyle+\lambda_{n}C_{\mathcal{A}_{i}}sgn(W^{}_{\mathcal{A}_{i}})+(\widehat{W}_{\mathcal{A}_{i}}-W^{}_{\mathcal{A}_{i}})o_{p}(1)$
		$\displaystyle+(\widehat{W}_{\mathcal{A}_{i}}-W^{*}_{\mathcal{A}_{i}})o_{p}(1)$
	$\displaystyle=$	$\displaystyle 0.$

From Theorem 1, we have $a_{n,ij}=o_{p}(n^{-\frac{1}{2}})$ , $\forall(i,j)\in\mathcal{A}$ , and $||\widehat{W}_{\mathcal{A}}-W^{*}_{\mathcal{A}}||=O_{p}(n^{-\frac{1}{2}})$ , then,

\frac{1}{n}X^{T}_{\mathcal{A}_{i}}X_{\mathcal{A}_{i}}(\widehat{W}_{\mathcal{A}_{i}}-W^{*}_{\mathcal{A}_{i}})-\frac{1}{n}X^{T}_{\mathcal{A}_{i}}(X_{\mathcal{A}_{i}}-X_{\mathcal{A}_{i}}W^{*}_{\mathcal{A}_{i}})+o_{p}(n^{-\frac{1}{2}})=0.

	$\displaystyle\Rightarrow\sqrt{n}(\widehat{W}_{\mathcal{A}_{i}}-W^{*}_{\mathcal{A}_{i}})=$	$\displaystyle\left(\frac{1}{n}X^{T}_{\mathcal{A}_{i}}X_{\mathcal{A}_{i}}\right)^{-1}\times$
		$\displaystyle\left[\frac{1}{\sqrt{n}}X^{T}_{\mathcal{A}_{i}}(X_{\mathcal{A}_{i}}-X_{\mathcal{A}_{i}}W^{*}_{\mathcal{A}_{i}})-o_{p}(1)\right]$

Therefore, under condition 2, by Lindeberg-Feller central limit theorem in linear regression case, as our design satisfies the following mild condition[52]

\max_{1\leq k\leq n}|(X_{\mathcal{A}_{i}})_{k}|=o(n^{1/2}),

\forall i\in{1,2,...,d},\sqrt{n}(\widehat{W}_{\mathcal{A}_{i}}-{W^{*}}_{\mathcal{A}_{i}})\xrightarrow{d}X\sim N(0,\sigma^{2}D_{i0}^{-1}).

	$\displaystyle D_{n}(\bm{\delta})=$	$\displaystyle L_{n}(W^{}+\eta_{n}\bm{\delta})-L_{n}(W^{})$
	$\displaystyle=$	$\displaystyle l_{n}(W^{}+\eta_{n}\bm{\delta})-l_{n}(W^{})+$
		$\displaystyle\lambda_{n}\sum_{i}\sum_{j}c_{ij}\left\{\|W^{}_{ij}+\eta_{n}\delta_{ij}\|-\|W^{}_{ij}\|\right\}+$
		$\displaystyle\frac{\rho}{2}\left\{\|h(W^{}+\eta_{n}\bm{\delta})\|^{2}-\|h(W^{})\|^{2}\right\}+$
		$\displaystyle\alpha\left\{h(W^{}+\eta_{n}\bm{\delta})-h(W^{})\right\}$

	$\displaystyle D_{n}(\bm{\delta})\geq$	$\displaystyle l_{n}(W^{}+\eta_{n}\bm{\delta})-l_{n}(W^{})+$
		$\displaystyle\sum_{(i,j)\in\mathcal{A}}a_{n,ij}\left\{\|W^{}_{ij}+\eta_{n}\delta_{ij}\|-\|W^{}_{ij}\|\right\}+$
		$\displaystyle\frac{\rho}{2}\left\{\|h(W^{}_{ij}+\eta_{n}\delta_{ij})\|^{2}-\|h(W^{}_{ij})\|^{2}\right\}+$
		$\displaystyle\alpha\left\{h(W^{}_{ij}+\eta_{n}\delta_{ij})-h(W^{}_{ij})\right\}$
	$\displaystyle\geq$	$\displaystyle l_{n}(W^{}+\eta_{n}\bm{\delta})-l_{n}(W^{})+$
		$\displaystyle\eta_{n}\sum_{(i,j)\in\mathcal{A}}a_{n,ij}\|\delta_{ij}\|+$
		$\displaystyle\frac{\rho}{2}\left\{\|h(W^{}+\eta_{n}\bm{\delta})\|^{2}-\|h(W^{})\|^{2}\right\}+$
		$\displaystyle\alpha\left\{h(W^{}+\eta_{n}\bm{\delta})-h(W^{})\right\}$
	$\displaystyle=$	$\displaystyle\nabla l_{n}(W^{*})^{T}(\eta_{n}\bm{\delta})+$
		$\displaystyle\frac{1}{2}(\eta_{n}\bm{\delta})^{T}[\nabla^{2}l_{n}(W^{*})](\eta_{n}\bm{\delta})(1+o_{p}(1))-$
		$\displaystyle\eta_{n}\sum_{(i,j)\in\mathcal{A}}a_{n,ij}\|\delta_{ij}\|+$
		$\displaystyle\frac{\rho}{2}\left\{\|h(W^{}+\eta_{n}\bm{\delta})\|^{2}-\|h(W^{})\|^{2}\right\}+$
		$\displaystyle\alpha\left\{h(W^{}+\eta_{n}\bm{\delta})-h(W^{})\right\}$
	$\displaystyle\triangleq$	$\displaystyle\@slowromancap i@+\@slowromancap ii@+\@slowromancap iii@+\@slowromancap iv@$

On the Sparse DAG Structure Learning Based on Adaptive Lasso

Abstract

Index Terms:

I Introduction

II Related works

II-A DAG with Linear Structure

II-B DAG with no NOTEARS

III The proposed method

III-A Adaptive Lasso

III-B The Overall Learning Objective

III-C Asymptotic Oracle Properties

III-D NOTEARS with Adaptive Lasso

III-E Hyper-parameter

III-F Extension to Generalized Linear Models

IV Experiments

IV-A Numerical Results: Gaussian Weights

IV-B Coefficients in the Neighborhood of 0

IV-C High-dimension Cases

IV-D Generalized Linear Models

IV-E Real Data

V Conclusion

References

VI Proof of Theorems

VI-A Proof of Theorem 1

VI-B Proof of Theorem 2

On the Sparse DAG Structure Learning
Based on Adaptive Lasso

IV-B Coefficients in the Neighborhood of $0$