This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Sparse DAG Structure Learning
Based on Adaptive Lasso

Danru Xu, Erdun Gao, Wei Huang, Menghan Wang, Andy Song, Mingming Gong Danru Xu, Erdun Gao, Wei Huang, and Mingming Gong are with the School of Mathematics and Statistics, Faculty of Science, The University of Melbourne, Melbourne, Australia (e-mail: danru.xu@student.unimelb.edu.au; erdun.gao@student.unimelb.edu.au; wei.huang@unimelb.edu.au; mingming.gong@unimelb.edu.au). Menghan Wang is with Ebay inc. (email: wangmengh@zju.edu.cn). Andy Song is with the School of Computing Technologies, the Royal Melbourne Institute of Technology, Melbourne, Australia (e-mail: andy.song@rmit.edu.au).
Abstract

Learning the underlying Bayesian Networks (BNs), represented by directed acyclic graphs (DAGs), of the concerned events from purely-observational data is a crucial part of evidential reasoning. This task remains challenging due to the large and discrete search space. A recent flurry of developments followed NOTEARS [1] recast this combinatorial problem into a continuous optimization problem by leveraging an algebraic equality characterization of acyclicity. However, the continuous optimization methods suffer from obtaining non-spare graphs after the numerical optimization, which leads to the inflexibility to rule out the potentially cycle-inducing edges or false discovery edges with small values. To address this issue, in this paper, we develop a completely data-driven DAG structure learning method without a predefined value to post-threshold small values. We name our method NOTEARS with adaptive Lasso (NOTEARS-AL), which is achieved by applying the adaptive penalty method to ensure the sparsity of the estimated DAG. Moreover, we show that NOTEARS-AL also inherits the oracle properties under some specific conditions. Extensive experiments on both synthetic and a real-world dataset demonstrate that our method consistently outperforms NOTEARS.

Index Terms:
Bayesian Networks learning, Sparse DAG, Continuous Optimization, Adaptive Lasso

I Introduction

Bayesian networks (BNs) [2] is a kind of graphical model that bridges probability theory and graph theory to represent a joint distribution, which can also be leveraged to model the shape and level of the relationships between random variables by representing conditional dependence as edges and corresponding parameters in a directed acyclic graph (DAG). With success in encoding uncertain expert knowledge in expert systems [3], BN has been widely used in many applications such as causal inference[4], biology [5], and healthcare [6].

Over the last decades, lots of work has been done on discovering Bayesian networks from purely-observational data. In general, the problem is ill-posed since there may be multiple DAGs producing the same distribution. Furthermore, concerning the computation complexity [7], since the search space of DAGs is discrete, many standard numerical algorithms cannot be utilized, which leads to the DAG structure learning problem NP-hard [8, 9]. In other words, searching the optimal DAG in the DAGs space scales super-exponentially by the number of variables.

In general, there are mainly two classes of DAG structure learning methods. One is constraint-based algorithms. The other is score-based algorithms. Constraint-based methods, such as SGS [10], PC [11], and GS [12], read the conditional independence relationships encoded in the data and try to find a completed partially directed acyclic graph (CPDAG) that entails all (and only) the corresponding d-separations by Markovian and Faithfulness assumptions [13]. That is to say, the output of these algorithms is a set of DAGs (equivalence class) instead of the ground-truth DAG. The main idea of score-based methods is to establish a suitable score function as the objective function and then find the graph structure that maximizes the objective in the solution space. Common score functions include BGe score [14], BDe score [15], MDL [16] and BIC [14]. Based on the idea of order search and greedy search, some approaches that optimize the score functions over the DAGs space are proposed, such as order-based search (OBS) [17], the max-min hill climbing method [18], greedy equivalence search (GES) [19], fast GES [20], and A* [21].

To improve the searching efficiency, recently, Zheng et al. [1] introduced a new approach, named NOTEARS, which seeks to replace the traditional combinatorial constraint by a smooth characterization of acyclicity whose value is set to be zero as a DAG. In this way, the combinatorial optimization problem is converted into a continuous optimization such that a wide range of numerical methods can be applied. As NOTEARS only considers the linear model, some following works [22, 23, 24, 25, 26, 27] extend it to non-linear or non-parametric situations by leveraging neural networks or orthogonal series. Following these works, many works also extend this work to time-series data [28], unmeasured common confounders  [29], interventional data [30], distributed learning [31, 32] and incomplete data [33, 34]. Further related works include GOLEM [35], a likelihood-based method that suggests using likelihood as the learning objective with soft constraints. NOFEARS [36] focuses on optimization by considering the Karush–Kuhn–Tucker (KTT) conditions. Truncated Matrix Power Iteration [37] finds that enlarging the coefficients for higher-order terms can better approximate the DAG constraints and leads to a better graph recovery performance. The convergences of the augmented Lagrangian method and the quadratic penalty method are studied in [38].

Refer to caption
Figure 1: Visualization of the ground truth (left-hand side of legend) and estimated weighted adjacency matrices (right-hand side of legend) of a 22-node DAG with n=1000n=1000 and penalty level λ{0,0.3,0.6,0.9,1.2,1.5,1.8,2.1,2.4}\lambda\in\{0,0.3,0.6,0.9,1.2,1.5,1.8,2.1,2.4\}. NOTEARS obtains all the solutions without thresholding.

It is worth noting that NOTEARS and most of its following works [27, 26, 25, 24, 23, 35] cannot directly obtain a sparse solution since the numerical optimization always introduces many small false positive estimates. From Figure 1, we can observe that even with a large sample size over a 22-node graph, the false positive edge (the right one) always exists and cannot be ruled out by regularization. Moreover, as the penalty level goes larger, the bias of the estimates becomes larger and larger until a pure zeros matrix. This implies that, in NOTEARS, only increasing the value of the penalty coefficient cannot guarantee a consistent estimate, and extra procedures are necessary for obtaining a sparse solution.

To deal with this issue, NOTEARS utilizes a hard threshold to post-process the initial estimators. The fixed thresholding strategy is a direct method to cut down the false positive estimates caused by numerical error. On the other hand, the hard-threshold strategy is not a flexible and systematic way to reduce the number of false discoveries. Firstly, choosing a fixed and sub-optimal value as a threshold cannot adapt to different types of graphs. Secondly, although the amount of false positive edges is diminished via adopting a hard-thresholding strategy, small coefficients in the true model may also be deleted indiscriminately (see Section IV-B).

For NOTEARS, on the one hand, it cannot obtain a sparse solution without thresholding. On the other hand, threshold strategy causes inflexibility, and this is not a systematic way. Motivated by this, we develop a new DAG structure learning method based on the adaptive lasso, which abandons the fixed-threshold strategy and directly leads to a sparse output. As we will show, the main contributions are as follows.

  1. 1.

    We propose a purely data-driven method that does not need a predefined threshold and then illustrate the specific algorithm.

  2. 2.

    We study the oracle properties that our method possesses, including sparsity and asymptotic normality, which means that our method has the ability to select the right subset model and converges to the true model with an optimal rate when the sample size increases.

  3. 3.

    We consider using a more sufficient and suitable validation set in cross-validation for our goal to find the optimal subset model.

  4. 4.

    We demonstrate the effectiveness of our method through simulation experiments with weights generated from a broader range, e.g., Gaussian distributed weights with mean zero.

The rest of this paper is organized as follows: in Section II, we review some basic concepts of the DAG model and the specific algorithm of NOTEARS, as well as the limitation of its fixed-threshold strategy. Then, based on the idea of adaptive Lasso [39], we introduce our method and study its asymptotic behavior when the sample size goes to infinity in Section III. In Section IV, to verify our theoretical study and the effectiveness of our method, we compare our method with NOTEARS and existing state-of-the-art by simulated experiments. Finally, we conclude our work in Section V.

II Related works

In this section, we review the linear DAG model and the NOTEARS method. In section II-A, we first briefly review some fundamental concepts of the DAG model with linear structure assumption and the corresponding score function used to maximize the likelihood of observations. Then, we detail the NOTEARS method and its optimization algorithm in Section II-B.

II-A DAG with Linear Structure

A DAG can be encoded as G=(V,E)G=(V,E), where VV refers to the set of nodes and EE is the set of directed edges. The nodes in GG represent variables under consideration. And directed edges correspond to the direct conditional relationships between parental nodes and child nodes. For example, V1V2V_{1}\rightarrow V_{2}, represents V1V_{1} is the parent of V2V_{2}. Figure 2 illustrates three possible three-node DAGs with two edges.

Refer to caption
Figure 2: No circle in these three graphs. Both X1X_{1} and X3X_{3} are parental nodes of X2X_{2} (left); X2X_{2} is parental node of X1X_{1} and X3X_{3} (middle); X2X_{2} is the parent of X3X_{3} and the child of X1X_{1}.

The task of DAG structure learning is to seek a DAG from the DAGs space of dd nodes (discrete) that best fits the observational data 𝕏\mathbb{X}. Let 𝕏n×d\mathbb{X}\in\mathbb{R}^{n\times d} represent the observations consisting of nn dd-dimensional vectors. We assume that these vectors are realizations of nn identically independently distributed (iid) random vectors with joint distribution P(X)P(X), where X=(X1,X2,,Xd)X=(X_{1},X_{2},...,X_{d}). Importantly, with no latent variable and Markovian assumptions [40], the joint distribution P(X)P(X) can be decomposed as follows:

P(X)=i=1dP(Xi|Pa(Xi)),P(X)=\prod_{i=1}^{d}P(X_{i}|Pa(X_{i})), (1)

where Pa(Xi)Pa(X_{i}) represents the parental set of XiX_{i}. Then, we model the relationships of variables in XX by a structure equation model (SEM), denoted as Xi=fi(Pa(Xi),εi),i=1,,dX_{i}=f_{i}(Pa(X_{i}),\varepsilon_{i}),i=1,...,d, where εi\varepsilon_{i} refers to the uncovered variables. In this paper, we narrow our focus on the DAG with additive noise models (ANMs). By introducing the concept of weighted adjacency matrix Wd×dW\in\mathbb{R}^{d\times d}, we can establish the linear ANMs as X=XW+NX=XW+N, where NN is a dd-dimensional random vector, denoted as N=(N1,N2,,Nd)N=(N_{1},N_{2},...,N_{d}). Each NiN_{i} represents the noise term added to XiX_{i}. In this paper, we assume N1,N2,,NdN_{1},N_{2},...,N_{d} are independent and identically distributed. The information about the structure of a DAG is completely involved in WW, which is defined in the following way:

{ViVj}EWij0.\{V_{i}\rightarrow V_{j}\}\subset E\quad\Longleftrightarrow\quad W_{ij}\neq 0. (2)

We note GWG_{W} as the graph induced by WW. To derive WW, we can maximize the likelihood of observations 𝕏\mathbb{X}. Benefiting from the well-known probability density transformation method [35], this target can be equally achieved by minimizing the least-square loss function 12nXXW22\frac{1}{2n}||X-XW||_{2}^{2} over the search space. A regularization term λW1\lambda||W||_{1} is added to deduce a sparse solution. Then, the overall score function is

𝒮(W)=12nXXW22+λW1.\mathcal{S}(W)=\frac{1}{2n}||X-XW||_{2}^{2}+\lambda||W||_{1}. (3)

II-B DAG with no NOTEARS

Previous score-based methods suffer from searching the discrete DAGs space and lead to low efficiency. To overcome this problem, Zheng et al.[1] proposed the NOTEARS method, which seeks to use a function hh satisfying these four desiderata: i) h(W)=0h(W)=0 is a necessary and sufficient condition for GWG_{W} to be a DAG; ii) its value reflects the level of acyclicity of the graph; iii) smooth; iv) its first derivative exists and is easy to compute. Based on the method in [41], which starts to use the matrix algebra to investigate the number of circles in a graph, NOTEARS verifies that h(W)=Tr(eWW)dh(W)=\text{Tr}(e^{W\circ W})-d is a proper function to characterize the acyclicity with the derivative h(W)=(eWW)T2W\nabla h(W)=(e^{W\circ W})^{T}\circ 2W. After determining the smooth equality constraint, the problem can be written as the following optimization program:

minWd×d\displaystyle\min_{W\in\mathbb{R}^{d\times d}} 12nXXW22+λW1,\displaystyle\ \frac{1}{2n}||X-XW||_{2}^{2}+\lambda||W||_{1}, (4)
subject to h(W)=0.\displaystyle\ h(W)=0.

In order to solve the program (4), this constrained optimization program is converted to a sequence of unconstrained problems by the augmented Lagrangian method [42]. The procedure is shown as follows.

Wk=\displaystyle W_{k}= argminWd×d12nXXW22+λW1\displaystyle\ \operatorname*{arg\,min}_{W\in\mathbb{R}^{d\times d}}\frac{1}{2n}||X-XW||_{2}^{2}+\lambda||W||_{1} (5)
+ρk2|h(W)|2+αkh(W),\displaystyle\quad\quad\quad\quad+\frac{\rho_{k}}{2}|h(W)|^{2}+\alpha_{k}h(W),
αk+1=\displaystyle\alpha_{k+1}= αk+ρkh(Wk),\displaystyle\ \alpha_{k}+\rho_{k}h(W_{k}),
ρk+1=\displaystyle\rho_{k+1}= βρk.\displaystyle\ \beta\rho_{k}.

For each iteration, the first step can be solved by the L-BFGS proximal quasi-Newton method [43, 44] with the derivative as 1nXT(XXW)+λ𝟙d×d+(ρ|h(W)|+αk)h(W).-\frac{1}{n}X^{T}(X-XW)+\lambda\mathbbm{1}_{d\times d}+\left(\rho|h(W)|+\alpha_{k}\right)\nabla h(W). To overcome the problem caused by the non-smooth point zero in the penalty term, i.e., Wij=0,i,j=1,..,dW_{ij}=0,\quad i,j=1,..,d, the original weighted adjacency matrix is cast into the difference between two matrices (W1,W2)(W^{1},W^{2}) with non-negative elements [45].

Remark After obtaining the initial estimate, NOTEARS zeros out small coefficients if their absolute values are less than the predefined threshold to reduce the number of false positive discovery edges. However, this strategy leads to more missing edges with small weights and is hard to adapt to different scenarios. For example, if the weights of a DAG are generated from a uniform distribution over (0,2)(0,2), by setting wt=0.3w_{t}=0.3, the estimates of small weights, which are located in the interval (0,0.3)(0,0.3), have a high risk to be forced to 0 in the post-processing step. Since the simulation experiments in previous work commonly set a gap to zero, for instance, in the simulation part of [1], the weights of edges commonly sampled uniformly from [2α,0.5α][0.5α,2α][-2\alpha,-0.5\alpha]\cup[0.5\alpha,2\alpha], where α\alpha is a weight scale, this side effect is not obvious. The specific algorithm of NOTEARS with a fixed predefined threshold is presented in Algorithm 1.

Algorithm 1 NOTEARS with a fixed threshold.
  
  Step 0. Take ρ0>0\rho_{0}>0, initial value W0d×dW_{0}\in\mathbb{R}^{d\times d}, α0\alpha_{0}\in\mathbb{R}, and some ξ(0,1)\xi\in(0,1). Set threshold wt=0.3w_{t}=0.3, η=10\eta=10.
  Step k. (k\geq1) Find the smallest non-negative integer jkj_{k} such that with ρ=ηjkρ0\rho=\eta^{j_{k}}\rho_{0}
h(Wk+1)<ξh(Wk),h(W_{k+1})<\xi h(W_{k}),
where
Wk+1=argminWd×d\displaystyle W_{k+1}=\operatorname*{arg\,min}_{W\in\mathbb{R}^{d\times d}} 12nXXW22+λW1\displaystyle\frac{1}{2n}||X-XW||_{2}^{2}+\lambda||W||_{1}
+ρ2|h(W)|2+αkh(W).\displaystyle+\frac{\rho}{2}|h(W)|^{2}+\alpha_{k}h(W).
Compute αk+1=αk+ρh(Wk+1)\alpha_{k+1}=\alpha_{k}+\rho h(W_{k+1}) ρk+1=βρk\rho_{k+1}=\beta\rho_{k}
  
  Set W^=Wk+1\widehat{W}=W_{k+1} if converge.
  Final Post-processing step. Set W^[|W^|<wt]=0\widehat{W}\left[|\widehat{W}|<w_{t}\right]=0, then return W^\widehat{W}.

III The proposed method

In this section, we extend the NOTEARS method with adaptive Lasso to achieve learning the sparse DAG structure from purely observational data and name our method as NOTEARS with adaptive Lasso (NOTEARS-AL). We first introduce the statistical tool we use in our method, named adaptive lasso, in Section III-A. After detailing our method in Section III-B, we investigate the asymptotic behavior of NOTEARS-AL theoretically in Section III-C. We then develop an algorithm to implement this method in Section III-D, and illustrate how we deal with our model’s hyper-parameters in Section III-E. Finally, We introduce how to extend NOTEARS to generalized linear models in Section III-F.

III-A Adaptive Lasso

In recent decades, variable selection has been studied extensively. In particular, Lasso [46] was introduced to improve the prediction accuracy and, simultaneously, reduce overfitting. In the linear model, the lasso estimates are obtained by optimizing the following program:

minβd12nXXβ22+λnβ1.\min_{\beta\in\mathbb{R}^{d}}\ \frac{1}{2n}||X-X\beta||_{2}^{2}+\lambda_{n}||\beta||_{1}. (6)

where λn\lambda_{n} is related to the sample size nn. Although Lasso variable selection has been shown to be consistent under some conditions [47], Zou[48] illustrated certain scenarios where the lasso is an inconsistent procedure in terms of variable selection. Zou[48] argued that the setup in lasso is somewhat unfair since it treats all the coefficients with indistinctive penalty levels in the sparsity term. To fix this problem, Zou[48] proposed a new methodology called adaptive lasso. In adaptive lasso, the weights of penalty levels are totally data-driven and cleverly chosen. The adaptive lasso estimates are obtained by optimizing the following program:

minβd12nXXβ22+λni=1dci|βi|,\min_{\beta\in\mathbb{R}^{d}}\frac{1}{2n}||X-X\beta||_{2}^{2}+\lambda_{n}\sum_{i=1}^{d}c_{i}|\beta_{i}|, (7)

where cic_{i} is a predefined penalty weight. For example, we can use the ordinary least square estimates βols^\widehat{\beta_{ols}} to define cic_{i}. Choose a γ>0\gamma>0, and define the penalty weight ci=1|βols,i^|γc_{i}=\frac{1}{|\widehat{\beta_{ols,i}}|^{\gamma}}, where γ\gamma commonly be 0.50.5, 11, 22 or determined by the cross-validation method.

III-B The Overall Learning Objective

From Figure 1, we can find that when the penalty level is relatively low, the learned false edges cannot be zeroed out. On the other hand, assigning a larger weight to the regularization term may lead to even worse results. This implies that Lasso put much more penalty on large coefficients rather than false positive ones, which is not consistent with our requirement. Motivated by this observation, to improve the criterion (4) on variable selection, one intuition is to apply the adaptive penalty levels to different coefficients, which has been developed in [39]. Following this motivation, we modify Eq. (4) to

minWd×d\displaystyle\min_{W\in\mathbb{R}^{d\times d}} 12nXXW22+λni=1dj=1d|cijwij|,\displaystyle\ \frac{1}{2n}||X-XW||_{2}^{2}+\lambda_{n}\sum_{i=1}^{d}\sum_{j=1}^{d}|c_{ij}w_{ij}|, (8)
subject to h(W)=0,\displaystyle\ h(W)=0,

where cijc_{ij} is the specified penalty for WijW_{ij}. Expecting the minor false positive edges can be shrunk to zeros and reserve the true edges at the same time, we would like the corresponding cijc_{ij} to be large for the former and small for the latter. Hence, the minor false edges are heavily penalized, and the true positive edges are lightly penalized. More details about how to choose cijc_{ij} are given in Section III-C. With this modification of the score function, no extra predefined threshold is needed to post-processing the initial estimate anymore.

III-C Asymptotic Oracle Properties

In this section, we show that with a proper choice of cijc_{ij} and under some mild conditions, our method possesses the oracle properties when the sample number nn goes to infinity. Similar to procedure (5), via the augmented Lagrangian method, a dual ascent process can solve the ECP (8). The first step of each iteration can be regarded as finding the local minimum of the Lagrangian with a fixed Lagrange multiplier α\alpha, that is,

Ln(W)=minWd×d\displaystyle L_{n}(W)=\min_{W\in\mathbb{R}^{d\times d}} 12nXXW22+λni,j|cijwij|\displaystyle\frac{1}{2n}||X-XW||_{2}^{2}+\lambda_{n}\sum_{i,j}|c_{ij}w_{ij}| (9)
+ρ2|h(W)|2+αh(W).\displaystyle+\frac{\rho}{2}|h(W)|^{2}+\alpha h(W).

Let WW^{*} denote the underlying true adjacency matrix. And then, we define

𝒜={(i,j):wij0},\displaystyle\mathcal{A}=\{(i,j):w^{*}_{ij}\neq 0\},
𝒜c={(i,j):wij=0},\displaystyle\mathcal{A}_{c}=\{(i,j):w^{*}_{ij}=0\},

which means that 𝒜\mathcal{A} collects the indices for terms whose true parameters are nonzero, and 𝒜c\mathcal{A}_{c} contains the indices for terms that do not exist in the underlying true model. Here, we first make two conditions:

  1. 1.

    Xi=XiW+Ni,i=1,2,,nX_{i}=X_{i}W^{*}+N_{i},i=1,2,...,n, where N1,N2,,NnN_{1},N_{2},...,N_{n} are identically independent distributed with mean 𝟘\mathbb{0} and variance σ2<\mathbb{\sigma}^{2}<\infty.

  2. 2.

    1nXTX\frac{1}{n}X^{T}X is finite and converges to a positive definite matrix Dd×dD\in\mathbb{R}^{d\times d}.

Under these two conditions, the oracle properties of NOTEARS-AL are described in the following theorems. Rigorous proofs are given in Appendix VI.

Theorem 1 (Local Minimizer) Under above two conditions, let an=λnc𝒜a_{n}=\lambda_{n}c_{\mathcal{A}}, where c𝒜c_{\mathcal{A}} refer to the cijc_{ij}’s associated with nonzero coefficients in WW^{*}, if an=op(nq)a_{n}=o_{p}(n^{q}) for some q12q\leq-\frac{1}{2}, then there exists a local minimizer of Ln(W)L_{n}(W), which is n\sqrt{n}-consistent to WW^{*}. Let W^n\widehat{W}_{n} denotes the local minimizer in Theorem 1, which satisfies W^nW=Op(n12)||\widehat{W}_{n}-W^{*}||=O_{p}(n^{-\frac{1}{2}}). Consistency is a nice start, and this motivates us to study more on its asymptotic properties as sample size nn goes to infinity. Before introducing the oracle properties, we need to define more notations because now we want to analyze the estimators of nonzero parameters in the underlying true model. We define

𝒜i\displaystyle\mathcal{A}_{i} ={(i,j):(i,j)𝒜},\displaystyle=\{(i,j):(i,j)\in\mathcal{A}\},
di\displaystyle d_{i} =|𝒜i|.\displaystyle=|\mathcal{A}_{i}|.

In condition 1), we already assume 1nXTXD\frac{1}{n}X^{T}X\to D as nn\to\infty. Furthermore, without loss of generality, we assume 𝒜i={(i,1),(i,2),,(i,di)}\mathcal{A}_{i}=\{(i,1),(i,2),...,(i,d_{i})\}. Let

D=(Di0Di1Di2Di3),D=\left(\begin{matrix}D_{i0}&D_{i1}\\ D_{i2}&D_{i3}\end{matrix}\right),

where Di0D_{i0} is a di×did_{i}\times d_{i} matrix corresponding to the coefficients with index in 𝒜i\mathcal{A}_{i}.

Theorem 2 (Oracle Properties) Under the two conditions, let bn=λnc𝒜cb_{n}=\lambda_{n}c_{\mathcal{A}_{c}}, where c𝒜cc_{\mathcal{A}_{c}} refer to the cijc_{ij}’s associated with zero coefficients in WW^{*}, if nbn\sqrt{n}b_{n}\rightarrow\infty as nn\rightarrow\infty, then the local minimizer W^n\widehat{W}_{n} in Theorem 1 must satisfy the following:

  • (Sparsity) limnP(W^𝒜c=0)=1\lim\limits_{n\to\infty}P(\widehat{W}_{\mathcal{A}_{c}}=0)=1

  • (Asymptotic normality) For i1,2,,d\forall i\in{1,2,...,d}, n(W^𝒜iW𝒜i)𝑑XN(0,σ2Di01)\sqrt{n}(\widehat{W}_{\mathcal{A}_{i}}-{W^{*}}_{\mathcal{A}_{i}})\xrightarrow{d}X\sim N(0,\sigma^{2}D_{i0}^{-1}) as nn\rightarrow\infty.

Firstly, for the zero coefficients part, Theorem 2 shows that given an=op(nq)a_{n}=o_{p}(n^{q}) for some q12q\leq-\frac{1}{2} and nbn\sqrt{n}b_{n}\rightarrow\infty, NOTEARS-AL can consistently remove all irrelevant variables with probability tending to 1. Secondly, by magnifying the difference by n\sqrt{n} for the nonzero estimators, we can see that the pattern turns out to be a normal distribution. Then, there is only one question left. That is how to determine cijc_{ij}’s such that the nonzero related part an=op(nq)a_{n}=o_{p}(n^{q}) for some q12q\leq-\frac{1}{2} and zero related part nbn\sqrt{n}b_{n}\rightarrow\infty. Here, using the fact that ordinary least square (OLS) estimates W^ols\widehat{W}_{ols} are n\sqrt{n}-consistent estimates of WW (the process of proof is consistent with Theorem1), suppose that λnn0\lambda_{n}\sqrt{n}\rightarrow 0 and λnnγ+12\lambda_{n}n^{\frac{\gamma+1}{2}}\rightarrow\infty, then, we can define cijc_{ij}’s by W^ols\widehat{W}_{ols} with some specific power γ\gamma so that the above two requirements are met. Besides, specific details to determine λn\lambda_{n} are given in III-E.

W^ols=\displaystyle\widehat{W}_{ols}= argminWd×d12nXXW22+λni=1dj=1d|cijwij|,\displaystyle\ \operatorname*{arg\,min}_{W\in\mathbb{R}^{d\times d}}\frac{1}{2n}||X-XW||_{2}^{2}+\lambda_{n}\sum_{i=1}^{d}\sum_{j=1}^{d}|c_{ij}w_{ij}|, (10)
subject to h(W)=0,cij=1|W^olsij|γ.\displaystyle\ h(W)=0,\ c_{ij}=\frac{1}{|\widehat{W}_{ols_{ij}}|^{\gamma}}.

III-D NOTEARS with Adaptive Lasso

The algorithm can be divided into two parts. The first part is to find the solution W^ols\widehat{W}_{ols} for Eq. (10), and the second part is plugging in the adaptive penalty parameters to Eq. (9) and solving it. These equality-constrained programs (ECPs) are then solved by the augmented Lagrangian method, i.e., using the Lagrangian method to handle the hard DAG constraint h(W)=0h(W)=0. The first part follows the same procedure as the original NOTEARS without the thresholding step. And then, we obtain a matrix Cd×dC\in\mathbb{R}^{d\times d} with element cij=1|W^olsij|c_{ij}=\frac{1}{|\widehat{W}_{ols_{ij}}|}. Let WC:=CWW_{C}\vcentcolon=C\circ W, where \circ is the Hadamard product. Thus, we have Wij=WC,ij/cijW_{ij}=W_{C,ij}/c_{ij}. In matrix notation, it is W=WCCW=W_{C}\varoslash C, where \varoslash refers to Hadamard division. Then, ECP (4) can be transformed as

WC^=\displaystyle\widehat{W_{C}}= argminWCd×d12n||XXWCC||22+λn||WC||1\displaystyle\operatorname*{arg\,min}_{W_{C}\in\mathbb{R}^{d\times d}}\frac{1}{2n}||X-XW_{C}\varoslash C||_{2}^{2}+\lambda_{n}||W_{C}||_{1} (11)
subject toh(WCC)=0.\displaystyle\text{subject to}\quad h(W_{C}\varoslash C)=0.
W^als=\displaystyle\widehat{W}_{als}= WC^C,\displaystyle\widehat{W_{C}}\varoslash C,

where W^als\widehat{W}_{als} is the final estimate for WW. Then, similar to (5), via the augmented Lagrangian method, the above ECP (11) can be solved by a dual ascent process.

WCk=\displaystyle W_{Ck}= argminWCd×d12n||XXWCC||22+λn||WC||1\displaystyle\operatorname*{arg\,min}_{W_{C}\in\mathbb{R}^{d\times d}}\frac{1}{2n}||X-XW_{C}\varoslash C||_{2}^{2}+\lambda_{n}||W_{C}||_{1} (12)
+ρ2|h(WCC)|2+αkh(WCC),\displaystyle+\frac{\rho}{2}|h(W_{C}\varoslash C)|^{2}+\alpha_{k}h(W_{C}\varoslash C),
αk+1=\displaystyle\alpha_{k+1}= αk+ρh(WCC).\displaystyle\alpha_{k}+\rho h(W_{C}\varoslash C).

To complete the above optimization, we need to derive a new form of the gradient for the first step in Eq. (12). Here we also reformulate the original WW as the difference between two matrices with positive elements. For the Loss function:

12n||XXWCC||22=2XT(XXWCC)C.\nabla\frac{1}{2n}||X-XW_{C}\varoslash C||_{2}^{2}=-2X^{T}(X-XW_{C}\varoslash C)\varoslash C.

For the acyclicity term hh:

h(WCC)=2(e(WCWC)(CC))TWC(CC).\nabla h(W_{C}\varoslash C)=2\left(e^{(W_{C}\circ W_{C})\varoslash(C\circ C)}\right)^{T}W_{C}\varoslash(C\circ C).

To sum up, the algorithm proceeds as Algorithm 2.

Algorithm 2 NOTEARS with the adaptive lasso.
  
  Step 0. Take ρ0>0\rho_{0}>0, initial value W0d×dW_{0}\in\mathbb{R}^{d\times d}, α0\alpha_{0}\in\mathbb{R}, and some ξ(0,1)\xi\in(0,1). Set threshold wtw_{t}, η=10\eta=10.
  OLS loop k. (k\geq1) Using the complete data set XX, find the smallest non-negative integer jkj_{k} such that with ρ=ηjkρ0\rho=\eta^{j_{k}}\rho_{0}
h(Wk+1)<ξh(Wk),h(W_{k+1})<\xi h(W_{k}),
where
Wk+1=argminWd×d12nXWTX22+ρ2|h(W)|2+αkh(W).W_{k+1}=\operatorname*{arg\,min}_{W\in\mathbb{R}^{d\times d}}\frac{1}{2n}||X-W^{T}X||_{2}^{2}+\frac{\rho}{2}|h(W)|^{2}+\alpha_{k}h(W).
Compute αk+1=αk+ρh(Wk+1)\alpha_{k+1}=\alpha_{k}+\rho h(W_{k+1}).
  
  Set W^ols=Wk+1\widehat{W}_{ols}=W_{k+1} if converge. Then, define C:=1|W^ols|γC\vcentcolon=1\varoslash|\widehat{W}_{ols}|^{\gamma}, WC:=CWW_{C}\vcentcolon=C\circ W.
  
  Adaptive lasso loop t. (t\geq1) Using the complete data set XX, for all λn\lambda_{n}, find the smallest non-negative integer iti_{t} such that with ρ=ηitρ0\rho=\eta^{i_{t}}\rho_{0}
h(WC,t+1C)<ξh(WC,tC),h(W_{C,t+1}\varoslash C)<\xi h(W_{C,t}\varoslash C),
where
WC,t+1=\displaystyle W_{C,t+1}= argminWCd×d12n||XXWCC||22+λn||WC||1\displaystyle\operatorname*{arg\,min}_{W_{C}\in\mathbb{R}^{d\times d}}\frac{1}{2n}||X-XW_{C}\varoslash C||_{2}^{2}+\lambda_{n}||W_{C}||_{1}
+ρ2|h(WCC)|2+αth(WCC).\displaystyle+\frac{\rho}{2}|h(W_{C}\varoslash C)|^{2}+\alpha_{t}h(W_{C}\varoslash C).
Compute αt+1=αt+ρh(WC,t+1C)\alpha_{t+1}=\alpha_{t}+\rho h(W_{C,t+1}\varoslash C).
  Set WC^,n=WC,t+1\widehat{W_{C}}_{,n}=W_{C,t+1} if converge.
  Return the adaptive estimate W^n:=WC^,nC\widehat{W}_{n}\vcentcolon=\widehat{W_{C}}_{,n}\varoslash C.

III-E Hyper-parameter

Choosing the tuning parameter λn\lambda_{n} is an important issue. We found that the results of NOTEARS-based methods are sensitive to the value of hyper-parameter, thus it is essential to choose a good one. Our goal is to choose the optimal model sns_{n} which contains the true variables as many as possible. Here, following Zou [48], we utilize cross-validation to find an optimal λn\lambda_{n}. From NOTEARS-AL, we can obtain a set of candidate models S={sn,n=1,2,,N}S=\{s_{n},n=1,2,...,N\}, where sn={(i,j)(1:d)×(1:d):W^n,ij0}s_{n}=\{(i,j)\in(1:d)\times(1:d):\widehat{W}_{n,ij}\neq 0\}. The cross-validation procedure we use shows as Algorithm 3.

Algorithm 3 The Cross-validation method.
  
  Step 0. Divide the data into validation set XvX_{v} and train set XtX_{t}, and |Xv|=nv|X_{v}|=n_{v}, |Xt|=nt|X_{t}|=n_{t}, nv+nt=nn_{v}+n_{t}=n.
  For model SnS_{n} (n=1,2,…,N). Using the validation set XtX_{t}, compute the solution W~nt\widetilde{W}_{n}^{t}, where
W~nt=argminWd×d,W(sn)=012ntXtWTXt22,\widetilde{W}_{n}^{t}=\operatorname*{arg\,min}_{W\in\mathbb{R}^{d\times d},W_{(-s_{n})}=0}\frac{1}{2n_{t}}||X_{t}-W^{T}X_{t}||_{2}^{2},
  
  Evaluate the prediction performance of W~nt\widetilde{W}_{n}^{t} on the validation set XvX_{v} by loss function 12nvXv(W~nt)TXv22\frac{1}{2n_{v}}||X_{v}-(\widetilde{W}_{n}^{t})^{T}X_{v}||_{2}^{2}.

Another key modification is that we need a more sufficient validation set to find the best model to achieve the restricted model-selection consistency, especially when the candidate model set is large[49]. Motivated by this, instead of using the traditional KK-fold cross-validation method, whose validation set is only 1/K1/K of data, we suggest swapping the proportions of the validation set and training set.

III-F Extension to Generalized Linear Models

Zheng et al.[27] also extend the idea of NOTEARS to generalized linear models (GLMs). Traditionally, a GLM is formulated as follows:

𝔼[Xi|Pa(Xi)]=gi(diTX),\mathbb{E}[X_{i}|Pa(X_{i})]=g_{i}(d_{i}^{T}X), (13)

where gig_{i} is a link function used in GLMs, and didd_{i}\in\mathbb{R}^{d}. To extend the DAG constraint beyond the linear setting, the key point is to find a sensible way to redefine the weighted adjacency matrix WW since there is no explicit form of WW in the GLMs setting. Based on the idea in [50], which encodes the importance of each variable considered via utilizing corresponding partial derivatives, it is then easy to show that the dependence among variables can be precisely measured by the L2L^{2}-norm of corresponding partial derivatives [27]. This implies the weighted adjacency matrix WW in GLMs can be defined as follows:

Wki=fixkL2=dik2.W_{ki}=||\frac{\partial f_{i}}{\partial x_{k}}||_{L^{2}}=d_{ik}^{2}. (14)

Thus, for linear mean functions, i.e. fi(X)=diTXf_{i}(X)=d_{i}^{T}X, Wki=0W_{ki}=0 is equivalent to dik=0d_{ik}=0. We can simplify the DAG constraint h(W)=0h(W)=0 by plugging in W=(d1T,d2T,,ddT)W=({d_{1}}^{T},{d_{2}}^{T},...,{d_{d}}^{T}). To construct the score function for GLMs, we need a more suitable loss function here. In this paper, we focus on one specific model in GLMs, the logistic regression model, where Xi{0,1},i=0,,dX_{i}\in\{0,1\},i=0,...,d. For logistic regression, the least-square loss is not appropriate anymore since it will result in a non-convex optimal function. Here we introduce the concept named Log Loss, which is commonly used as the loss function for logistic regression, defined as follows:

l=1ni=1n[xilog(g(ti)(1xi)log(1g(ti))],l=\frac{1}{n}\sum_{i=1}^{n}[-x_{i}\log\left(g(t_{i})-(1-x_{i})\log(1-g(t_{i})\right)], (15)

where ti=xi×(d1T,d2T,,ddT)t_{i}=x_{i}\times({d_{1}}^{T},{d_{2}}^{T},...,{d_{d}}^{T}).

IV Experiments

We conduct experiments with a large sample size to verify our asymptotic oracle properties (Section III-C). To validate the effectiveness of our proposed method, we compare it against NOTEARS with fixed threshold, greedy equivalent search (GES) [19, 20] and the PC algorithm [4]. For NOTEARS, the value of the threshold is 0.10.1, and set the penalty level λ=0.1\lambda=0.1 in our experiments. This setting can not only promise a DAG result but also remain a relatively good performance.

The basic setup of our experiments is as follows. For each experiment, the ground-truth DAG is generated from the random graph model: Erdös-Rényi (ER). We use ER22 to denote an ER graph with s0=2ds_{0}=2d edges and similar meanings for ER11 and ER44. Based on the ground truth DAG, we generate the weights of existing edges from 11) a normal distribution with mean 0 and variance σ2\sigma^{2}; 2) a uniform distribution over the interval (c,c)(-c,c), where cc\in\mathbb{R}. And then, we simulate the sample XX via Xi=fi(Xpa(i))+zi,i=1,,dX_{i}=f_{i}(X_{pa(i)})+z_{i},\quad i=1,...,d, and the noise ziN(0,1)z_{i}\sim N(0,1). Here, we mainly consider the linear case and extend to one type of GLMs named the logistic regression model. The metrics we use to evaluate the estimated DAGs include structural Hamming distance (SHD), true positive rate (TPR), and false discovery rate (FDR).

IV-A Numerical Results: Gaussian Weights

Compared to the original weight-generating interval of NOTEARS, which is Uniform ([2.0,0.5][0.5,2.0])([-2.0,-0.5]\cup[0.5,2.0]), there are two points we consider to improve in our method. The first one is about the upper and lower bound, which are set to be ±2α\pm 2\alpha. We expect our method can extend NOTEARS to a wider range with a better estimate. The second one is the coefficients near 0, which cannot be learned by the fixed-thresholding post-processing step. For the above two improvement points, we first perform the structure learning study in Gaussian-weight cases, where the weights of existing edges are sampled from a normal distribution with mean 0 and variance 222^{2}. In this way, we can investigate the performance of NOTEARS-AL under a wider range of coefficients.

In our experiments, we generate {ER11, ER22, ER44} graphs with different numbers of nodes d={10,20,50}d=\{10,20,50\}, 1515 graphs for each type. And then, we simulate n=1000n=1000 samples with iid Gaussian noises. The results are shown in Figure 3. For better visualization, only the average values of metrics are reported. Consistent with previous work GES, it shows a significant advantage in the ER11 case, no matter which metric we evaluate. But the performances of GES deteriorate rapidly when the level of sparsity decreases. Also, as we can see from Figure 3, for ER44 with 5050 nodes, the algorithm of GES takes too long to obtain the results because of the large searching space and low searching efficiency. Not surprisingly, the performance of NOTEARS and NOTEARS-AL show a positive relationship. This is natural since NOTEARS-AL extracts the weights of penalty level from conventional NOTEARS, which means that the performance of NOTEARS-AL is also sensitive to the results of the conventional one. Nonetheless, our approach performs uniformly better than NOTEARS across different graphs in terms of all metrics.

Refer to caption
Figure 3: Numerical results in terms of SHD(\downarrow), TPR(\uparrow), and FDR(\downarrow) on ER graphs, with sample size n=1000n=1000 and Gaussian noise. For TPR, higher is better. For the other criterion, lower is better. Columns: random graph types, ERkk means ER graphs with k×dk\times d expected edges.

Figure 4 and Figure 5 show structure recovery results for SEM with exponential noise and Gumbel noise, respectively, where the weights of existing edges are sampled from a normal distribution with mean 0 and variance 222^{2}.

Refer to caption
Figure 4: Numerical results in terms of SHD, TPR, and FDR on ER graphs, with sample size n=1000n=1000 and Exponential noise.
Refer to caption
Figure 5: Numerical results regarding SHD, TPR, and FDR on ER graphs, with sample size n=1000n=1000 and Gumbel noise.

IV-B Coefficients in the Neighborhood of 0

As we mentioned, one side effect of the fixed-threshold strategy is that small coefficients in the true model may be deleted indiscriminately. Since the simulation experiments in previous work commonly set a gap to zero, which is α[2,0.5]α[0.5,2]\alpha[-2,-0.5]\cup\alpha[0.5,2], this side effect is not obvious. Hence, we investigate the value of missing edges when weights are generated from a range that covers zero, e.g., Uniform (5,5)(-5,5), Uniform (10,10)(-10,10), Gaussian (0,22)(0,2^{2}) and Gaussian (0,32)(0,3^{2}). For each kind of distribution, we generated 200200 DAG structures from the ER11 model with 1010 nodes and then simulated 10001000 samples. Based on the different methodologies of these two methods, one with a hard threshold of 0.10.1 and the other with adaptive penalty levels, our method is more flexible and suitable to handle the coefficients around zero. Indeed, Figure 6 confirms this intuition. We plot the histogram of missing edges. In all three cases, our method (blue bar) loses fewer edges than NOTEARS with hard thresholding (yellow bar), especially the area around zero. Moreover, as the range becomes wider or the variance increases, the gap between these two methods becomes more and more significant.

Refer to caption
Figure 6: The histogram of missing edges. Blue bars refer to NOTEARS with a hard-threshold 0.10.1; yellow bars refer to NOTEARS-AL. In the first row, the coefficients of the ground truth graph are generated from the uniform distributions (5,5)(-5,5) and (10,10)(-10,10) from left to right. In the second row, the true coefficients are from normal distributions with variances 222^{2} and 323^{2}, respectively.

IV-C High-dimension Cases

To investigate the performance of our method for estimating DAG from high-dimensional graphical models, we increase the number of nodes to 100100 and compare the results to those of fixed-thresholding NOTEARS. Here, to illustrate the validity of the evaluation criteria, we draw error bars with sample standard deviations. The results are shown in Figure 7. We can find that under nearly the same value of SHD, the true positive rate of NOTEARS-AL is significantly higher than that of NOTEARS with a fixed threshold. Besides, in terms of FDR (lower is better), NOTEARS-AL has lower values for all graphs. In general, NOTEARS-AL consistently outperforms the original fixed-threshold one, even in high-dimensional settings.

Refer to caption
Figure 7: Numerical results in terms of SHD, TPR, and FDR on ER22 graphs with Gaussian (0,22)(0,2^{2}) weights, with sample size n=1000n=1000 and iid Gaussian noise.

IV-D Generalized Linear Models

For GLMs, we concentrate on logistic regression for dichotomous data Xi{0,1}X_{i}\in\{0,1\}, i=1,,di=1,...,d with link function g(t)=et/(et+1)g(t)=e^{t}/(e^{t}+1). As we introduced in Section III-F, for linear mean functions, the elements of WW in DAG constraint h(W)=0h(W)=0 can be substituted by dikd_{ik}’s, the coefficients in linear mean functions. We compare the performance of fixed-thresholding NOTEARS to NOTEARS-AL with edge weights following Gaussian (0,22)(0,2^{2}).

Refer to caption
Figure 8: Numerical results for logistic regression in terms of SHD, TPR, and FDR on ER graphs, with sample size n=1000n=1000. For TPR, higher is better. For the other two metrics, lower is better. Columns: random graph types, ERkk means graphs with k×dk\times d expected edges.

Figure 8 shows the structure recovery results for n=1000n=1000. By comparison, We can find the TPR of NOTEARS-AL is significantly higher than that of NOTEARS with a fixed threshold. At the same time, in terms of the other two metrics, i.e., SHD and FDR, both methods keep at a similar level. In general, NOTEARS-AL consistently outperforms the original NOTEARS.

IV-E Real Data

Finally, consistent with Zheng et al. [1], we compare our method to NOTEARS on the same real dataset, which records the expression levels of 1111 phosphoproteins and phospholipids in human cells. This dataset is commonly used in the literature on Bayesian network structure inference as it provides a network that is widely accepted by the biological community. Based on d=11d=11 cell types and n=911n=911 observational samples, the ground truth structure given in [5] contains 1717 edges. In our experiments, NOTEARS with 0.10.1 threshold estimates 2222 edges with an SHD of 2020, compared to 2020 estimates for NOTEARS-AL with an SHD of 1919.

V Conclusion

Based on the methodology of NOTEARS, we present a method for score-based learning of DAG models that do not need a hard thresholding post-processing step. The critical technical design is to use an adaptive penalty level for each edge rather than a common one. Notably, we prove that our method enjoys the oracle properties when the adaptive penalties satisfy some mild conditions. With a suitable choice of hyper-parameter in the second step, where we apply a modified KK-fold cross-validation to handle, our method can zero out edges automatically in each iteration. Our simulation experiments show the effectiveness of our method, and it is worth emphasizing that the adaptive strategy in our method is generally more flexible than the hard-threshold strategy since it is a purely data-driven procedure, which enables our method to own the ability to adapt to different types of graphs.

References

  • [1] X. Zheng, B. Aragam, P. K. Ravikumar, and E. P. Xing, “Dags with no tears: Continuous optimization for structure learning,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [2] J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference.   Morgan kaufmann, 1988.
  • [3] D. Heckerman, “A tutorial on learning with bayesian networks,” Innovations in Bayesian networks, pp. 33–82, 2008.
  • [4] P. Spirtes, C. N. Glymour, R. Scheines, and D. Heckerman, Causation, prediction, and search.   MIT press, 2000.
  • [5] K. Sachs, O. Perez, D. Pe’er, D. A. Lauffenburger, and G. P. Nolan, “Causal protein-signaling networks derived from multiparameter single-cell data,” Science, vol. 308, no. 5721, pp. 523–529, 2005.
  • [6] P. J. Lucas, L. C. Van der Gaag, and A. Abu-Hanna, “Bayesian networks in biomedicine and health-care,” Artificial intelligence in medicine, vol. 30, no. 3, pp. 201–214, 2004.
  • [7] R. Ganian and V. Korchemna, “The complexity of bayesian network learning: Revisiting the superstructure,” Advances in Neural Information Processing Systems, vol. 34, pp. 430–442, 2021.
  • [8] M. Chickering, D. Heckerman, and C. Meek, “Large-sample learning of bayesian networks is np-hard,” Journal of Machine Learning Research, vol. 5, pp. 1287–1330, 2004.
  • [9] D. M. Chickering, “Learning bayesian networks is np-complete,” in Learning from data.   Springer, 1996, pp. 121–130.
  • [10] P. Spirtes, C. Glymour, and R. Scheines, “Causality from probability,” 1989.
  • [11] P. Spirtes and C. Glymour, “An algorithm for fast recovery of sparse causal graphs,” Social science computer review, vol. 9, no. 1, pp. 62–72, 1991.
  • [12] D. Margaritis and S. Thrun, “Bayesian network induction via local neighborhoods,” Advances in neural information processing systems, vol. 12, 1999.
  • [13] J. Peters, D. Janzing, and B. Schölkopf, Elements of causal inference: foundations and learning algorithms.   The MIT Press, 2017.
  • [14] D. Geiger and D. Heckerman, “Learning gaussian networks,” in Uncertainty Proceedings 1994.   Elsevier, 1994, pp. 235–243.
  • [15] D. Heckerman, D. Geiger, and D. M. Chickering, “Learning bayesian networks: The combination of knowledge and statistical data,” Machine learning, vol. 20, no. 3, pp. 197–243, 1995.
  • [16] R. R. Bouckaert, “Probabilistic network construction using the minimum description length principle,” in European conference on symbolic and quantitative approaches to reasoning and uncertainty.   Springer, 1993, pp. 41–48.
  • [17] M. Teyssier and D. Koller, “Ordering-based search: A simple and effective algorithm for learning bayesian networks,” arXiv preprint arXiv:1207.1429, 2012.
  • [18] I. Tsamardinos, L. E. Brown, and C. F. Aliferis, “The max-min hill-climbing bayesian network structure learning algorithm,” Machine Learning, vol. 65, no. 1, pp. 31–78, 2006.
  • [19] D. M. Chickering, “Optimal structure identification with greedy search,” Journal of machine learning research, vol. 3, no. Nov, pp. 507–554, 2002.
  • [20] J. Ramsey, M. Glymour, R. Sanchez-Romero, and C. Glymour, “A million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images,” International journal of data science and analytics, vol. 3, no. 2, pp. 121–129, 2017.
  • [21] I. Ng, Y. Zheng, J. Zhang, and K. Zhang, “Reliable causal discovery with improved exact search and weaker assumptions,” in Advances in Neural Information Processing Systems, vol. 34, 2021.
  • [22] D. Kalainathan, O. Goudet, I. Guyon, D. Lopez-Paz, and M. Sebag, “Structural agnostic modeling: Adversarial learning of causal graphs,” arXiv preprint arXiv:1803.04929, 2018.
  • [23] I. Ng, S. Zhu, Z. Fang, H. Li, Z. Chen, and J. Wang, “Masked gradient-based causal structure learning,” in SIAM International Conference on Data Mining, 2022, pp. 424–432.
  • [24] Y. Yu, J. Chen, T. Gao, and M. Yu, “DAG-GNN: DAG structure learning with graph neural networks,” in International Conference on Machine Learning, 2019.
  • [25] S. Lachapelle, P. Brouillard, T. Deleu, and S. Lacoste-Julien, “Gradient-based neural dag learning,” in International Conference on Learning Representations, 2020.
  • [26] S. Zhu, I. Ng, and Z. Chen, “Causal discovery with reinforcement learning,” in International Conference on Learning Representations, 2020.
  • [27] X. Zheng, C. Dan, B. Aragam, P. Ravikumar, and E. Xing, “Learning sparse nonparametric dags,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2020, pp. 3414–3425.
  • [28] R. Pamfil, N. Sriwattanaworachai, S. Desai, P. Pilgerstorfer, K. Georgatzis, P. Beaumont, and B. Aragam, “Dynotears: Structure learning from time-series data,” in International Conference on Artificial Intelligence and Statistics, 2020.
  • [29] R. Bhattacharya, T. Nagarajan, D. Malinsky, and I. Shpitser, “Differentiable causal discovery under unmeasured confounding,” in International Conference on Artificial Intelligence and Statistics, 2021.
  • [30] P. Brouillard, S. Lachapelle, A. Lacoste, S. Lacoste-Julien, and A. Drouin, “Differentiable causal discovery from interventional data,” in Advances in Neural Information Processing Systems, 2020.
  • [31] I. Ng and K. Zhang, “Towards federated bayesian network structure learning with continuous optimization,” in International Conference on Artificial Intelligence and Statistics, 2022.
  • [32] E. Gao, J. Chen, L. Shen, T. Liu, M. Gong, and H. Bondell, “FedDAG: Federated DAG structure learning,” Transactions on Machine Learning Research, 2022.
  • [33] T. Geffner, J. Antoran, A. Foster, W. Gong, C. Ma, E. Kiciman, A. Sharma, A. Lamb, M. Kukla, N. Pawlowski et al., “Deep end-to-end causal inference,” arXiv preprint arXiv:2202.02195, 2022.
  • [34] E. Gao, I. Ng, M. Gong, L. Shen, W. Huang, T. Liu, K. Zhang, and H. Bondell, “MissDAG: Causal discovery in the presence of missing data with continuous additive noise models,” in Advances in Neural Information Processing Systems, 2022.
  • [35] I. Ng, A. Ghassami, and K. Zhang, “On the role of sparsity and dag constraints for learning linear dags,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 943–17 954, 2020.
  • [36] D. Wei, T. Gao, and Y. Yu, “Dags with no fears: A closer look at continuous optimization for learning bayesian networks,” in Advances in Neural Information Processing Systems, 2020.
  • [37] Z. Zhang, I. Ng, D. Gong, Y. Liu, E. M. Abbasnejad, M. Gong, K. Zhang, and J. Q. Shi, “Truncated matrix power iteration for differentiable DAG learning,” in Advances in Neural Information Processing Systems, 2022.
  • [38] I. Ng, S. Lachapelle, N. R. Ke, S. Lacoste-Julien, and K. Zhang, “On the convergence of continuous constrained optimization for structure learning,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2022, pp. 8176–8198.
  • [39] H. Zou, “The adaptive lasso and its oracle properties,” Journal of the American statistical association, vol. 101, no. 476, pp. 1418–1429, 2006.
  • [40] P. Spirtes, C. Glymour, R. Scheines et al., Causation, Prediction, and Search.   The MIT Press, 2001, vol. 1.
  • [41] F. Harary and B. Manvel, “On the number of cycles in a graph,” Matematickỳ časopis, vol. 21, no. 1, pp. 55–63, 1971.
  • [42] D. P. Bertsekas, “Nonlinear programming,” Journal of the Operational Research Society, vol. 48, no. 3, pp. 334–334, 1997.
  • [43] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, “A limited memory algorithm for bound constrained optimization,” SIAM Journal on scientific computing, vol. 16, no. 5, pp. 1190–1208, 1995.
  • [44] K. Zhong, I. E.-H. Yen, I. S. Dhillon, and P. K. Ravikumar, “Proximal quasi-newton for computationally intensive l1-regularized m-estimators,” Advances in Neural Information Processing Systems, vol. 27, 2014.
  • [45] X. Zheng, C. Dan, B. Aragam, P. Ravikumar, and E. P. Xing, “Learning sparse nonparametric dags,” 2019. [Online]. Available: https://arxiv.org/abs/1909.13189
  • [46] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
  • [47] N. Meinshausen and P. Bühlmann, “Variable selection and high-dimensional graphs with the lasso,” Ann Stat, vol. 34, pp. 1436–1462, 2006.
  • [48] H. Zou, “The adaptive lasso and its oracle properties,” Journal of the American Statistical Association, vol. 101, no. 476, pp. 1418–1429, 2006. [Online]. Available: http://www.jstor.org/stable/27639762
  • [49] Y. Feng and Y. Yu, “The restricted consistency property of leave-nvn_{v}-out cross-validation for high-dimensional variable selection,” 2013. [Online]. Available: https://arxiv.org/abs/1308.5390
  • [50] L. A. Rosasco, S. Villa, S. Mosci, M. Santoro, and A. Verri, “Nonparametric sparsity and regularization,” 2013.
  • [51] N. H. Choi, W. Li, and J. Zhu, “Variable selection with the strong heredity constraint and its oracle property,” Journal of the American Statistical Association, vol. 105, no. 489, pp. 354–364, 2010.
  • [52] A. W. Van der Vaart, Asymptotic statistics.   Cambridge university press, 2000, vol. 3.

VI Proof of Theorems

VI-A Proof of Theorem 1

Let ηn=np\eta_{n}=n^{p}, p<0p<0, 𝜹=(δ1,1δ1,2δ1,dδ2,1δ2,2δ2,dδd,1δd,2δd,d)\bm{\delta}=\begin{pmatrix}\delta_{1,1}&\delta_{1,2}&\cdots&\delta_{1,d}\\ \delta_{2,1}&\delta_{2,2}&\cdots&\delta_{2,d}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{d,1}&\delta_{d,2}&\cdots&\delta_{d,d}\end{pmatrix}, 𝜹d||\bm{\delta}||\leq d, where dd is a constant.

Dn(𝜹)=\displaystyle D_{n}(\bm{\delta})= Ln(W+ηn𝜹)Ln(W)\displaystyle L_{n}(W^{*}+\eta_{n}\bm{\delta})-L_{n}(W^{*})
=\displaystyle= ln(W+ηn𝜹)ln(W)+\displaystyle l_{n}(W^{*}+\eta_{n}\bm{\delta})-l_{n}(W^{*})+
λnijcij{|Wij+ηnδij||Wij|}+\displaystyle\lambda_{n}\sum_{i}\sum_{j}c_{ij}\left\{|W^{*}_{ij}+\eta_{n}\delta_{ij}|-|W^{*}_{ij}|\right\}+
ρ2{|h(W+ηn𝜹)|2|h(W)|2}+\displaystyle\frac{\rho}{2}\left\{|h(W^{*}+\eta_{n}\bm{\delta})|^{2}-|h(W^{*})|^{2}\right\}+
α{h(W+ηn𝜹)h(W)}\displaystyle\alpha\left\{h(W^{*}+\eta_{n}\bm{\delta})-h(W^{*})\right\}

where ln(W)=12nXXW22l_{n}(W)=\frac{1}{2n}||X-XW||_{2}^{2}.

Let an,ij=λncija_{n,ij}=\lambda_{n}c_{ij}, (i,j)𝒜(i,j)\in\mathcal{A}, we can derive

Dn(𝜹)\displaystyle D_{n}(\bm{\delta})\geq ln(W+ηn𝜹)ln(W)+\displaystyle l_{n}(W^{*}+\eta_{n}\bm{\delta})-l_{n}(W^{*})+
(i,j)𝒜an,ij{|Wij+ηnδij||Wij|}+\displaystyle\sum_{(i,j)\in\mathcal{A}}a_{n,ij}\left\{|W^{*}_{ij}+\eta_{n}\delta_{ij}|-|W^{*}_{ij}|\right\}+
ρ2{|h(Wij+ηnδij)|2|h(Wij)|2}+\displaystyle\frac{\rho}{2}\left\{|h(W^{*}_{ij}+\eta_{n}\delta_{ij})|^{2}-|h(W^{*}_{ij})|^{2}\right\}+
α{h(Wij+ηnδij)h(Wij)}\displaystyle\alpha\left\{h(W^{*}_{ij}+\eta_{n}\delta_{ij})-h(W^{*}_{ij})\right\}
\displaystyle\geq ln(W+ηn𝜹)ln(W)+\displaystyle l_{n}(W^{*}+\eta_{n}\bm{\delta})-l_{n}(W^{*})+
ηn(i,j)𝒜an,ij|δij|+\displaystyle\eta_{n}\sum_{(i,j)\in\mathcal{A}}a_{n,ij}|\delta_{ij}|+
ρ2{|h(W+ηn𝜹)|2|h(W)|2}+\displaystyle\frac{\rho}{2}\left\{|h(W^{*}+\eta_{n}\bm{\delta})|^{2}-|h(W^{*})|^{2}\right\}+
α{h(W+ηn𝜹)h(W)}\displaystyle\alpha\left\{h(W^{*}+\eta_{n}\bm{\delta})-h(W^{*})\right\}
=\displaystyle= ln(W)T(ηn𝜹)+\displaystyle\nabla l_{n}(W^{*})^{T}(\eta_{n}\bm{\delta})+
12(ηn𝜹)T[2ln(W)](ηn𝜹)(1+op(1))\displaystyle\frac{1}{2}(\eta_{n}\bm{\delta})^{T}[\nabla^{2}l_{n}(W^{*})](\eta_{n}\bm{\delta})(1+o_{p}(1))-
ηn(i,j)𝒜an,ij|δij|+\displaystyle\eta_{n}\sum_{(i,j)\in\mathcal{A}}a_{n,ij}|\delta_{ij}|+
ρ2{|h(W+ηn𝜹)|2|h(W)|2}+\displaystyle\frac{\rho}{2}\left\{|h(W^{*}+\eta_{n}\bm{\delta})|^{2}-|h(W^{*})|^{2}\right\}+
α{h(W+ηn𝜹)h(W)}\displaystyle\alpha\left\{h(W^{*}+\eta_{n}\bm{\delta})-h(W^{*})\right\}
\displaystyle\triangleq \@slowromancapi@+\@slowromancapii@+\@slowromancapiii@+\@slowromancapiv@\displaystyle\@slowromancap i@+\@slowromancap ii@+\@slowromancap iii@+\@slowromancap iv@

where we define

\@slowromancapi@=\displaystyle\@slowromancap i@= ln(W)T(ηn𝜹)\displaystyle\nabla l_{n}(W^{*})^{T}(\eta_{n}\bm{\delta})
\@slowromancapii@=\displaystyle\@slowromancap ii@= 12(ηn𝜹)T[2ln(W)](ηn𝜹)(1+op(1))\displaystyle\frac{1}{2}(\eta_{n}\bm{\delta})^{T}[\nabla^{2}l_{n}(W^{*})](\eta_{n}\bm{\delta})(1+o_{p}(1))
\@slowromancapiii@=\displaystyle\@slowromancap iii@= ηn(i,j)𝒜an,ij|δij|\displaystyle\eta_{n}\sum_{(i,j)\in\mathcal{A}}a_{n,ij}|\delta_{ij}|
\@slowromancapiv@=\displaystyle\@slowromancap iv@= ρ2{|h(W+ηn𝜹)|2|h(W)|2}+\displaystyle\frac{\rho}{2}\left\{|h(W^{*}+\eta_{n}\bm{\delta})|^{2}-|h(W^{*})|^{2}\right\}+
α{h(W+ηn𝜹)h(W)}\displaystyle\alpha\left\{h(W^{*}+\eta_{n}\bm{\delta})-h(W^{*})\right\}

For \@slowromancapi@\@slowromancap i@,

\@slowromancapi@=\displaystyle\@slowromancap i@= ln(W)T(ηn𝜹)\displaystyle\nabla l_{n}(W^{*})^{T}(\eta_{n}\bm{\delta})
=\displaystyle= ηn(ln(W)T)𝜹\displaystyle\eta_{n}\left(\nabla l_{n}(W^{*})^{T}\right)\bm{\delta}
=\displaystyle= ηn(1n(XXW)TXT)𝜹\displaystyle\eta_{n}\left(\frac{1}{n}(X-XW^{*})^{T}X^{T}\right)\bm{\delta}
=\displaystyle= 1nηn(n1n(XXW)TXT)𝜹\displaystyle-\frac{1}{\sqrt{n}}\eta_{n}(\sqrt{n}\frac{1}{n}(X-XW^{*})^{T}X^{T})\bm{\delta}
=\displaystyle= Op(1nηn)𝜹.\displaystyle-O_{p}(\frac{1}{\sqrt{n}}\eta_{n})\bm{\delta}.

For \@slowromancapii@\@slowromancap ii@,

\@slowromancapii@=12ηn2{𝜹T[1nXTX]𝜹}(1+oP(1))>0.\@slowromancap ii@=\frac{1}{2}\eta_{n}^{2}\left\{\bm{\delta}^{T}[\frac{1}{n}X^{T}X]\bm{\delta}\right\}(1+o_{P}(1))>0.

For \@slowromancapiii@\@slowromancap iii@,

\@slowromancapiii@=\displaystyle\@slowromancap iii@= ηn(i,j)𝒜an,ij|δij|\displaystyle-\eta_{n}\sum_{(i,j)\in\mathcal{A}}a_{n,ij}|\delta_{ij}|
\displaystyle\geq ηnAnqd,\displaystyle-\eta_{n}An^{q}d,

where A=max{an,ij/nq:(i,j)𝒜}A=\max\{a_{n,ij}/n^{q}:(i,j)\in\mathcal{A}\}, and AO(1)A\in O(1).

For \@slowromancapiv@\@slowromancap iv@, since h(W)0h(W)\geq 0, Wd×d\forall W\in\mathbb{R}^{d\times d}, we can derive

\@slowromancapiv@=[h(W+ηn𝜹)h(W)]{ρ2[h(W+ηn𝜹)+h(W)]}\@slowromancap iv@=\left[h(W^{*}+\eta_{n}\bm{\delta})-h(W^{*})\right]\left\{\frac{\rho}{2}\left[h(W^{*}+\eta_{n}\bm{\delta})+h(W^{*})\right]\right\}

\@slowromancapiv@0\Rightarrow\qquad\@slowromancap iv@\geq 0.

Thus,

Dn(𝜹)\displaystyle D_{n}(\bm{\delta})\geq Op(1nηn)𝜹+12ηn2{𝜹T[1nXTX]𝜹}(1+op(1))\displaystyle-O_{p}(\frac{1}{\sqrt{n}}\eta_{n})\bm{\delta}+\frac{1}{2}\eta_{n}^{2}\left\{\bm{\delta}^{T}[\frac{1}{n}X^{T}X]\bm{\delta}\right\}(1+o_{p}(1))
ηnAnnqd+\@slowromancapiv@\displaystyle-\eta_{n}A_{n}n^{q}d+\@slowromancap iv@

We can find when p>12p>-\frac{1}{2} and q12q\leq-\frac{1}{2}, \@slowromancapii@ dominates \@slowromancapi@ and \@slowromancapiii@, then, Dn(𝜹)>0D_{n}(\bm{\delta})>0.

Thus, we can conclude that there exists a local minimizer of Ln(W)L_{n}(W) with optimal convergence rate Wn^W=Op(n12)||\widehat{W_{n}}-W^{*}||=O_{p}(n^{-\frac{1}{2}})[51].

VI-B Proof of Theorem 2

Firstly, for sparsity property, it is sufficient to prove (i,j)𝒜c\forall(i,j)\in\mathcal{A}_{c},

{Ln(W)Wij|W=W^<0 if εn<W^ij<0Ln(W)Wij|W=W^>0 if 0<W^ij<εn\begin{cases}\frac{\partial L_{n}(W)}{\partial W_{ij}}|_{W=\widehat{W}}<0\quad\text{ if }-\varepsilon_{n}<\widehat{W}_{ij}<0\\ \frac{\partial L_{n}({W})}{\partial W_{ij}}|_{W=\widehat{W}}>0\quad\text{ if }0<\widehat{W}_{ij}<\varepsilon_{n}\end{cases} (16)

with probability tending to 1 where εn=O(n12)\varepsilon_{n}=O(n^{-\frac{1}{2}}).

Then,

Ln(W^)Wij=\displaystyle\frac{\partial L_{n}(\widehat{W})}{\partial W_{ij}}= 1nX.iT(X.jXW.j)+k=1d1nX.iTXkj(W^ijWkj)\displaystyle-\frac{1}{n}X_{.i}^{T}(X_{.j}-XW_{.j}^{*})+\sum_{k=1}^{d}\frac{1}{n}X_{.i}^{T}X_{kj}(\widehat{W}_{ij}-W^{*}_{kj})
+λncijsign{W^ij}+2W^ij(ρh(W^)+α)(eW^W^)ijT\displaystyle+\lambda_{n}c_{ij}sign\{\widehat{W}_{ij}\}+2\widehat{W}_{ij}(\rho h(\widehat{W})+\alpha)(e^{\widehat{W}\circ\widehat{W}})^{T}_{ij}

where we can find

1nX.iT(X.jXW.j)=Op(n12),\displaystyle\frac{1}{n}X_{.i}^{T}(X_{.j}-XW_{.j}^{*})=O_{p}(n^{-\frac{1}{2}}),
k=1d1nX.iTXkj(W^ijWkj)=Op(n12),\displaystyle\sum_{k=1}^{d}\frac{1}{n}X_{.i}^{T}X_{kj}(\widehat{W}_{ij}-W^{*}_{kj})=O_{p}(n^{-\frac{1}{2}}),
2W^ij(ρh(W^)+α)(eW^W^)ijT{<0 if εn<W^ij<0>0 if 0<W^ij<εn\displaystyle 2\widehat{W}_{ij}(\rho h(\widehat{W})+\alpha)(e^{\widehat{W}\circ\widehat{W}})^{T}_{ij}\begin{cases}<0\quad\text{ if }-\varepsilon_{n}<\widehat{W}_{ij}<0\\ >0\quad\text{ if }0<\widehat{W}_{ij}<\varepsilon_{n}\end{cases}

Let bn,ij=λncijb_{n,ij}=\lambda_{n}c_{ij}. Therefore, for (i,j)𝒜c\forall(i,j)\in\mathcal{A}_{c}, if nbn,ij\sqrt{n}b_{n,ij}\rightarrow\infty, then, the sign of Ln(W^)Wij\frac{\partial L_{n}(\widehat{W})}{\partial W_{ij}} is dominated by sgn{W^ij}sgn\{\widehat{W}_{ij}\}.

Secondly, for asymptotic normality property, we concentrate on set 𝒜i\mathcal{A}_{i}, i=1,2,,di=1,2,...,d, and use W𝒜iW_{\mathcal{A}_{i}} to denote the sub adjacency matrix of graphs, X𝒜iX_{\mathcal{A}_{i}} to denote the sub observation data of corresponding variables, and C𝒜iC_{\mathcal{A}_{i}} to denote the corresponding adaptive penalty weights.

Denote

\@slowromancapi@(W)=12nXXW22\displaystyle\@slowromancap i@(W)=\frac{1}{2n}||X-XW||^{2}_{2}
\@slowromancapii@(W)=λnij|cijWij|\displaystyle\@slowromancap ii@(W)=\lambda_{n}\sum_{i}\sum_{j}|c_{ij}W_{ij}|
\@slowromancapiii@(W)=ρ2|h(W)|2+αh(W).\displaystyle\@slowromancap iii@(W)=\frac{\rho}{2}|h(W)|^{2}+\alpha h(W).

Then,

𝒜iLn(W^𝒜i)=𝒜i\@slowromancapi@(W^𝒜i)+𝒜i\@slowromancapii@(W^𝒜i)+𝒜i\@slowromancapiii@(W^𝒜i).\nabla_{\mathcal{A}_{i}}L_{n}(\widehat{W}_{\mathcal{A}_{i}})=\nabla_{\mathcal{A}_{i}}\@slowromancap i@(\widehat{W}_{\mathcal{A}_{i}})+\nabla_{\mathcal{A}_{i}}\@slowromancap ii@(\widehat{W}_{\mathcal{A}_{i}})+\nabla_{\mathcal{A}_{i}}\@slowromancap iii@(\widehat{W}_{\mathcal{A}_{i}}).

By Taylor expansion at W^𝒜i=W𝒜i\widehat{W}_{\mathcal{A}_{i}}=W^{*}_{\mathcal{A}_{i}}, we have

𝒜i\@slowromancapi@(W^𝒜i)=\displaystyle\nabla_{\mathcal{A}_{i}}\@slowromancap i@(\widehat{W}_{\mathcal{A}_{i}})= 𝒜i\@slowromancapi@(W𝒜i)+[𝒜i2\@slowromancapi@(W𝒜i)](W^𝒜iW𝒜i)\displaystyle\nabla_{\mathcal{A}_{i}}\@slowromancap i@(W^{*}_{\mathcal{A}_{i}})+\left[\nabla^{2}_{\mathcal{A}_{i}}\@slowromancap i@(W^{*}_{\mathcal{A}_{i}})\right](\widehat{W}_{\mathcal{A}_{i}}-W^{*}_{\mathcal{A}_{i}})
=\displaystyle= 1nX𝒜iTX𝒜i(W^𝒜iW𝒜i)1nX𝒜iT(X𝒜iX𝒜iW𝒜i),\displaystyle\frac{1}{n}X^{T}_{\mathcal{A}_{i}}X_{\mathcal{A}_{i}}(\widehat{W}_{\mathcal{A}_{i}}-W^{*}_{\mathcal{A}_{i}})-\frac{1}{n}X^{T}_{\mathcal{A}_{i}}(X_{\mathcal{A}_{i}}-X_{\mathcal{A}_{i}}W^{*}_{\mathcal{A}_{i}}),
𝒜i\@slowromancapii@(W^𝒜i)=\displaystyle\nabla_{\mathcal{A}_{i}}\@slowromancap ii@(\widehat{W}_{\mathcal{A}_{i}})= λnC𝒜isgn(W^𝒜i)\displaystyle\lambda_{n}C_{\mathcal{A}_{i}}sgn(\widehat{W}_{\mathcal{A}_{i}})
=\displaystyle= λnC𝒜isgn(W𝒜)+(W^𝒜iW𝒜i)op(1),\displaystyle\lambda_{n}C_{\mathcal{A}_{i}}sgn(W^{*}_{\mathcal{A}})+(\widehat{W}_{\mathcal{A}_{i}}-W^{*}_{\mathcal{A}_{i}})o_{p}(1),
𝒜i\@slowromancapiii@(W^𝒜i)=\displaystyle\nabla_{\mathcal{A}_{i}}\@slowromancap iii@(\widehat{W}_{\mathcal{A}_{i}})= (ρ|h(W^𝒜i)|+α)h(W^𝒜i)\displaystyle(\rho|h(\widehat{W}_{\mathcal{A}_{i}})|+\alpha)\nabla h(\widehat{W}_{\mathcal{A}_{i}})
=\displaystyle= (W^𝒜iW𝒜i)op(1).\displaystyle(\widehat{W}_{\mathcal{A}_{i}}-W^{*}_{\mathcal{A}_{i}})o_{p}(1).

Therefore, now we have

𝒜iLn(W^𝒜i)=\displaystyle\nabla_{\mathcal{A}_{i}}L_{n}(\widehat{W}_{\mathcal{A}_{i}})= 1nX𝒜iTX𝒜i(W^𝒜iW𝒜i)1nX𝒜iT(X𝒜iX𝒜iW𝒜i)\displaystyle\frac{1}{n}X^{T}_{\mathcal{A}_{i}}X_{\mathcal{A}_{i}}(\widehat{W}_{\mathcal{A}_{i}}-W^{*}_{\mathcal{A}_{i}})-\frac{1}{n}X^{T}_{\mathcal{A}_{i}}(X_{\mathcal{A}_{i}}-X_{\mathcal{A}_{i}}W^{*}_{\mathcal{A}_{i}})
+λnC𝒜isgn(W𝒜i)+(W^𝒜iW𝒜i)op(1)\displaystyle+\lambda_{n}C_{\mathcal{A}_{i}}sgn(W^{*}_{\mathcal{A}_{i}})+(\widehat{W}_{\mathcal{A}_{i}}-W^{*}_{\mathcal{A}_{i}})o_{p}(1)
+(W^𝒜iW𝒜i)op(1)\displaystyle+(\widehat{W}_{\mathcal{A}_{i}}-W^{*}_{\mathcal{A}_{i}})o_{p}(1)
=\displaystyle= 0.\displaystyle 0.

From Theorem 1, we have an,ij=op(n12)a_{n,ij}=o_{p}(n^{-\frac{1}{2}}), (i,j)𝒜\forall(i,j)\in\mathcal{A}, and W^𝒜W𝒜=Op(n12)||\widehat{W}_{\mathcal{A}}-W^{*}_{\mathcal{A}}||=O_{p}(n^{-\frac{1}{2}}), then,

1nX𝒜iTX𝒜i(W^𝒜iW𝒜i)1nX𝒜iT(X𝒜iX𝒜iW𝒜i)+op(n12)=0.\frac{1}{n}X^{T}_{\mathcal{A}_{i}}X_{\mathcal{A}_{i}}(\widehat{W}_{\mathcal{A}_{i}}-W^{*}_{\mathcal{A}_{i}})-\frac{1}{n}X^{T}_{\mathcal{A}_{i}}(X_{\mathcal{A}_{i}}-X_{\mathcal{A}_{i}}W^{*}_{\mathcal{A}_{i}})+o_{p}(n^{-\frac{1}{2}})=0.
n(W^𝒜iW𝒜i)=\displaystyle\Rightarrow\sqrt{n}(\widehat{W}_{\mathcal{A}_{i}}-W^{*}_{\mathcal{A}_{i}})= (1nX𝒜iTX𝒜i)1×\displaystyle\left(\frac{1}{n}X^{T}_{\mathcal{A}_{i}}X_{\mathcal{A}_{i}}\right)^{-1}\times
[1nX𝒜iT(X𝒜iX𝒜iW𝒜i)op(1)]\displaystyle\left[\frac{1}{\sqrt{n}}X^{T}_{\mathcal{A}_{i}}(X_{\mathcal{A}_{i}}-X_{\mathcal{A}_{i}}W^{*}_{\mathcal{A}_{i}})-o_{p}(1)\right]

Therefore, under condition 2, by Lindeberg-Feller central limit theorem in linear regression case, as our design satisfies the following mild condition[52]

max1kn|(X𝒜i)k|=o(n1/2),\max_{1\leq k\leq n}|(X_{\mathcal{A}_{i}})_{k}|=o(n^{1/2}),
i1,2,,d,n(W^𝒜iW𝒜i)𝑑XN(0,σ2Di01).\forall i\in{1,2,...,d},\sqrt{n}(\widehat{W}_{\mathcal{A}_{i}}-{W^{*}}_{\mathcal{A}_{i}})\xrightarrow{d}X\sim N(0,\sigma^{2}D_{i0}^{-1}).