This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Extended Fiducial Inference for Individual Treatment Effects via Deep Neural Networks

Sehwan Kim and Faming Liang Correspondence author: Faming Liang, email: fmliang@purdue.edu. Department of Statistics, Ewha Womans University, Seoul 03760, Republic of Korea. Department of Statistics, Purdue University, West Lafayette, IN 47907, USA.
Abstract

Individual treatment effect estimation has gained significant attention in recent data science literature. This work introduces the Double Neural Network (Double-NN) method to address this problem within the framework of extended fiducial inference (EFI). In the proposed method, deep neural networks are used to model the treatment and control effect functions, while an additional neural network is employed to estimate their parameters. The universal approximation capability of deep neural networks ensures the broad applicability of this method. Numerical results highlight the superior performance of the proposed Double-NN method compared to the conformal quantile regression (CQR) method in individual treatment effect estimation. From the perspective of statistical inference, this work advances the theory and methodology for statistical inference of large models. Specifically, it is theoretically proven that the proposed method permits the model size to increase with the sample size nn at a rate of O(nζ)O(n^{\zeta}) for some 0ζ<10\leq\zeta<1, while still maintaining proper quantification of uncertainty in the model parameters. This result marks a significant improvement compared to the range 0ζ<120\leq\zeta<\frac{1}{2} required by the classical central limit theorem. Furthermore, this work provides a rigorous framework for quantifying the uncertainty of deep neural networks under the neural scaling law, representing a substantial contribution to the statistical understanding of large-scale neural network models.

Keywords: Causal Inference, Deep Learning, Fiducial Inference, Stochastic Gradient MCMC, Uncertainty Quantification

1 Introduction

Causal inference is a fundamental problem in many disciplines such as medicine, econometrics, and social science. Formally, let {(y1,𝒙1,t1),(y2,𝒙2,t2),,(yn,𝒙n,tn)}\{(y_{1},{\boldsymbol{x}}_{1},t_{1}),(y_{2},{\boldsymbol{x}}_{2},t_{2}),\ldots,(y_{n},{\boldsymbol{x}}_{n},t_{n})\} denote a set of observations drawn from the following data-generating equations:

yi=c(𝒙i)+τ(𝒙i)ti+σzi,i=1,2,,n,y_{i}=c({\boldsymbol{x}}_{i})+\tau({\boldsymbol{x}}_{i})t_{i}+\sigma z_{i},\quad i=1,2,\ldots,n, (1)

where 𝒙id{\boldsymbol{x}}_{i}\in\mathbb{R}^{d} represents a vector of covariates of subject ii, ti{0,1}t_{i}\in\{0,1\} represents the treatment assignment to subject ii; c()c(\cdot) represents the expected outcome of subject ii if assigned to the control group (with ti=0)t_{i}=0), and τ(𝒙i)\tau({\boldsymbol{x}}_{i}) is the expected treatment effect of subject ii if assigned to the treatment group (with ti=1t_{i}=1); σ>0\sigma>0 is the standard deviation, and ziz_{i} represent a standardized random error that is not necessarily Gaussian. Under the potential outcome framework (Rubin, 1974), each individual receives only one assignment of the treatment with ti=0t_{i}=0 or 1, but not both. The goal of causal inference is to make inference for the average treatment effect (ATE) or individual treatment effect (ITE).

The ATE is defined as

τ0=𝔼(τ(𝒙))=𝒳τ(𝒙)𝑑F(𝒙),\tau_{0}=\mathbb{E}(\tau({\boldsymbol{x}}))=\int_{\mathcal{X}}\tau({\boldsymbol{x}})dF({\boldsymbol{x}}), (2)

where 𝒳\mathcal{X} denotes the sample space of 𝒙{\boldsymbol{x}}, and F(𝒙)F({\boldsymbol{x}}) denotes the cumulative distribution function of 𝒙{\boldsymbol{x}}. To estimate ATE, a variety of methods, including outcome regression, augmented/inverse probability weighting (AIPW/IPW) and matching, have been developed. See Imbens (2004) and Rosenbaum (2002) for overviews.

The ITE is often defined as the conditional average treatment effect (CATE):

τ(𝒙)=𝔼(Y|T=1,𝒙)𝔼(Y|T=0,𝒙),\tau({\boldsymbol{x}})=\mathbb{E}(Y|T=1,{\boldsymbol{x}})-\mathbb{E}(Y|T=0,{\boldsymbol{x}}), (3)

see e.g., Shalit et al. (2017) and Lu et al. (2018). Recently, Lei and Candès (2021) proposed to make predictive inference of the ITE by quantifying the uncertainty of

τ~i:=Y(T=1,𝒙i)Y(T=0,𝒙i):=Yi(1)Yi(0),\tilde{\tau}_{i}:=Y(T=1,{\boldsymbol{x}}_{i})-Y(T=0,{\boldsymbol{x}}_{i}):=Y_{i}(1)-Y_{i}(0), (4)

where Yi(ti)Y_{i}(t_{i}) denotes the potential outcome of subject ii with treatment assignment ti{0,1}t_{i}\in\{0,1\}. Henceforth, we will call τ~i\tilde{\tau}_{i} the predictive ITE.

It is known that ATE and ITE are identifiable if the conditions ‘strong ignorability’ and ‘overlapping’ are satisfied. The former means that, after accounting for observed covariates, the treatment assignment is independent of potential outcomes; and the latter ensures that every subject in the study has a positive probability of receiving either assignment, allowing for meaningful comparisons between treatment and control groups. Mathematically, the two conditions can be expressed as:

(i)strong ignorability:{Y(1),Y(0)}T|𝒙;(ii)overlapping: 0<P(T=1|𝒙)<1,𝒙𝒳,(i)\ \mbox{\it strong ignorability:}\ \{Y(1),Y(0)\}\rotatebox[origin={c}]{90.0}{$\models$}T|{\boldsymbol{x}};\ \ (ii)\ \mbox{\it overlapping:}\ 0<P(T=1|{\boldsymbol{x}})<1,\ \ \forall{\boldsymbol{x}}\in\mathcal{X},

where T{0,1}T\in\{0,1\} represents the treatment assignment variable, and \models denotes conditional independence. Together, they ensure that the causal effect can be correctly estimated without bias. See e.g. Guan and Yang (2019) for more discussions on this issue.

However, even under these assumptions, accurate inference for ATE and ITE can still be challenging. Specifically, the inference task can be complicated by unknown nonlinear forms of c(𝒙)c({\boldsymbol{x}}) and τ(𝒙)\tau({\boldsymbol{x}}). To address these issues, some authors have proposed to approximate them using a machine learning model, such as random forest (RF) (Breiman, 2001), Bayesian additive regression trees (BART) (Chipman et al., 2010), and neural networks. Refer to e.g., Foster et al. (2011), Hill (2011), Shalit et al. (2017), Wager and Athey (2018), and Hahn et al. (2020) for the details. Unfortunately, these methods often yield point estimates for the ATE and ITE, while failing to correctly quantifying their uncertainty due to the complexity of the machine learning models. Quite recently, Lei and Candès (2021) proposed to quantify the uncertainty of the predictive ITE using the conformal inference method (Vovk et al., 2005; Shafer and Vovk, 2008). This method provides coverage-guaranteed confidence intervals for the predictive ITE, but the intervals may become overly wide when the machine learning model is not consistently estimated. In short, while machine learning models, particularly neural networks, can effectively model complex, nonlinear functions such as c()c(\cdot) and τ()\tau(\cdot) for causal inference, performing accurate uncertainty quantification with these models remains a significant challenge. This is because these models typically have a complex functional form and involve a large number of parameters.

In this paper, we propose to conduct causal inference using an extended fiducial inference (EFI) method (Liang et al., 2024), with the goal of addressing the uncertainty quantification issue associated with treatment effect estimation. EFI provides an innovative framework for inferring model uncertainty based solely on observed data, aligning with the goal of fiducial inference (Fisher, 1935; Hannig, 2009). Specifically, it aims to solve the data-generating equations by explicitly imputing the unobserved random errors and approximating the model parameters from the observations and imputed random errors using a neural network; it then infers the uncertainty of the model parameters based on the learned neural network function and the imputed random errors (see Section 2 for a brief review). To make the EFI method feasible for causal effect estimation with accurate uncertainty quantification, we extend the method in two key aspects:

  • (i)

    We approximate each of the unknown functions, c(𝒙)c({\boldsymbol{x}}) and τ(𝒙)\tau({\boldsymbol{x}}), by a deep neural network (DNN) model. The DNN possesses universal approximation capability (Hornik et al., 1989; Hornik, 1991; Kidger and Lyons, 2020), meaning it can approximate any continuous function to an arbitrary degree of accuracy, provided it is sufficiently wide and deep. This property makes the proposed method applicable to a wide range of data-generating processes.

  • (ii)

    We theoretically prove that the dimensions (i.e., the number of parameters) of the DNN models used to approximate c(𝒙)c({\boldsymbol{x}}) and τ(𝒙)\tau({\boldsymbol{x}}) are allowed to increase with the sample size nn at a rate of O(nζ)O(n^{\zeta}) for some 0<ζ<10<\zeta<1, while the uncertainty of the DNN models can still be correctly quantified. That is, we are able to correctly quantify the uncertainty of the causal effect although it has to be approximated using large models.

In this paper, we regard a model as ‘large’ if its dimension increases with nn at a rate of 1/2ζ<11/2\leq\zeta<1. We note that part (ii) represents a significant theoretical innovation in statistical inference for large models. In the literature on this area, most efforts have focused on linear models, featuring techniques such as desparsified Lasso (Javanmard and Montanari, 2014; van de Geer et al., 2014; Zhang and Zhang, 2014), post-selection inference (Lee et al., 2016), and Markov neighborhood regression (Liang et al., 2022a). For nonlinear models, the research landscape appears to be more scattered. Portnoy (1986, 1988) showed that for independently and identically distributed (i.i.d) random vectors with the dimension pp increasing with the sample size nn, the central limit theorem (CLT) holds if p=O(nζ)p=O(n^{\zeta}) for some 0ζ<1/20\leq\zeta<1/2. It is worth noting that Bayesian methods, despite being sampling-based, do not permit the dimension of the true model to increase with nn at a higher rate. For example, even in the case of generalized linear models, to ensure the posterior consistency, the dimension of the true model is only allowed to increase with nn at a rate 0ζ<1/40\leq\zeta<1/4 (see Theorem 2 and Remark 2 of Jiang (2007)). Under its current theoretical framework developed by Liang et al. (2024), EFI can only be applied to make inference for the models whose dimension is fixed or increases with nn at a very low rate. This paper extends the theoretical framework of EFI further, establishing its applicability for statistical inference of large models.

It is worth noting that a DNN model with size p=O(nζ)p=O(n^{\zeta}), where ζ\zeta is close to (but less than) 1, has been shown to be sufficiently large for approximating many data generation processes. This is supported by the theory established in Sun et al. (2022) and Farrell et al. (2021). In Sun et al. (2022), it is shown that, as nn\to\infty, a sparse DNN model of this size can provide accurate approximations for multiple classes of functions, such as bounded α\alpha-Hölder smooth functions (Schmidt-Hieber, 2020), piecewise smooth functions with fixed input dimensions (Petersen and Voigtlaender, 2018), and functions representable by an affine system (Bolcskei et al., 2019). Similar results have also been obtained in Farrell et al. (2021), where it is shown that a multi-layer perceptron (MLP) with this model size and the ReLU activation function can provide an accurate approximation to the functions that lie in a Sobolev ball with certain smoothness. The approximation capability of DNNs of this size has also been empirically validated by Hestness et al. (2017), where a neural scaling law of p=O(nζ)p=O(n^{\zeta}) with 0.5ζ<10.5\leq\zeta<1 was identified through extensive studies across various model architectures in machine translation, language modeling, image processing, and speech recognition.

To highlight the strength of EFI in uncertainty quantification and to facilitate comparison with the conformal inference method, this study focuses on inference for predictive ITEs, although the proposed method can also be extended to ATE and CATE. Our numerical results demonstrate the superiority of the proposed method over the conformal inference method.

The remaining part of this paper is organized as follows. Section 2 provides a brief review of the EFI method. Section 3 extends EFI to statistical inference for large statistical models. Section 4 provides an illustrative example for EFI. Section 5 applies the proposed method to statistical inference for predictive ITEs, with both simulated and real data examples. Section 6 concludes the paper with a brief discussion.

2 A Brief Review of the EFI Method

While fiducial inference was widely considered as a big blunder by R.A. Fisher, the goal he initially set —inferring the uncertainty of model parameters on the basis of observations — has been continually pursued by many statisticians, see e.g. Zabell (1992), Hannig (2009), Hannig et al. (2016), Murph et al. (2022), and Martin (2023). To this end, Liang et al. (2024) developed the EFI method based on the fundamental concept of structural inference (Fraser, 1966, 1968). Consider a regression model:

Y=f(𝑿,Z,𝜽),Y=f({\boldsymbol{X}},Z,{\boldsymbol{\theta}}), (5)

where YY\in\mathbb{R} and 𝑿d{\boldsymbol{X}}\in\mathbb{R}^{d} represent the response and explanatory variables, respectively; 𝜽p{\boldsymbol{\theta}}\in\mathbb{R}^{p} represents the vector of parameters; and ZZ\in\mathbb{R} represents a scaled random error following a known distribution π0()\pi_{0}(\cdot). For the model (1), the treatment assignment TT should be included as a part of 𝑿{\boldsymbol{X}}.

Suppose that a random sample of size nn has been collected from the model, denoted by {(y1,𝒙1),(y2,𝒙2),,(yn,𝒙n)}\{(y_{1},{\boldsymbol{x}}_{1}),(y_{2},{\boldsymbol{x}}_{2}),\ldots,(y_{n},{\boldsymbol{x}}_{n})\}. In the point of view of structural inference (Fraser, 1966, 1968), they can be expressed in the data generating equations as follow:

yi=f(𝒙i,zi,𝜽),i=1,2,,n.y_{i}=f({\boldsymbol{x}}_{i},z_{i},{\boldsymbol{\theta}}),\quad i=1,2,\ldots,n. (6)

This system of equations consists of n+pn+p unknowns, namely, {𝜽,z1,z2,,zn}\{{\boldsymbol{\theta}},z_{1},z_{2},\ldots,z_{n}\}, while there are only nn equations. Therefore, the values of 𝜽{\boldsymbol{\theta}} cannot be uniquely determined by the data-generating equations, and this lack of uniqueness of unknowns introduces uncertainty in 𝜽{\boldsymbol{\theta}}.

Let 𝒁n={z1,z2,,zn}{\boldsymbol{Z}}_{n}=\{z_{1},z_{2},\ldots,z_{n}\} denote the unobservable random errors, which are also called latent variables in EFI. Let G()G(\cdot) denote an inverse function/mapping for the parameter 𝜽{\boldsymbol{\theta}}, i.e.,

𝜽=G(𝒀n,𝑿n,𝒁n).{\boldsymbol{\theta}}=G({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n}). (7)

It is worth noting that the inverse function is generally non-unique. For example, it can be constructed by solving any pp equations in (6) for 𝜽{\boldsymbol{\theta}}. As noted by Liang et al. (2024), this non-uniqueness of inverse function mirrors the flexibility of frequentist methods, where different estimators of 𝜽{\boldsymbol{\theta}} can be designed for different purposes.

As a general method, Liang et al. (2024) proposed to approximate the inverse function G()G(\cdot) using a sparse DNN, see Figure 1 for illustration. They also introduced an adaptive stochastic gradient Langevin dynamics (SGLD) algorithm, which facilitates the simultaneous training of the sparse DNN and simulation of the latent variables 𝒛{\boldsymbol{z}}. This is briefly described as follows.

Refer to caption
Figure 1: Illustration of the EFI network (Liang et al., 2024), where the orange nodes and orange links form a DNN (parameterized by the weights 𝒘n{\boldsymbol{w}}_{n}, with the subscript nn indicating its dependence on the training sample size nn), the green node represents latent variable to impute, and the black lines represent deterministic functions.

Let 𝜽^i:=g^(yi,𝒙i,zi,𝒘n)\hat{{\boldsymbol{\theta}}}_{i}:=\hat{g}(y_{i},{\boldsymbol{x}}_{i},z_{i},{\boldsymbol{w}}_{n}) denote the DNN prediction function parameterized by the weights 𝒘n{\boldsymbol{w}}_{n} in the EFI network, and let

𝜽¯:=1ni=1n𝜽^i=1ni=1ng^(yi,𝒙i,zi,𝒘n),\bar{{\boldsymbol{\theta}}}:=\frac{1}{n}\sum_{i=1}^{n}\hat{{\boldsymbol{\theta}}}_{i}=\frac{1}{n}\sum_{i=1}^{n}\hat{g}(y_{i},{\boldsymbol{x}}_{i},z_{i},{\boldsymbol{w}}_{n}), (8)

which serves as an estimator of G()G(\cdot). The EFI network has two output nodes defined, respectively, by

ei1:=𝜽^i𝜽¯2,ei2:=d(yi,y~i):=d(yi,𝒙i,zi,𝜽¯),e_{i1}:=\|\hat{{\boldsymbol{\theta}}}_{i}-\bar{{\boldsymbol{\theta}}}\|^{2},\quad e_{i2}:=d(y_{i},\tilde{y}_{i}):=d(y_{i},{\boldsymbol{x}}_{i},z_{i},\bar{{\boldsymbol{\theta}}}), (9)

where y~i=f(𝒙i,zi,𝜽¯)\tilde{y}_{i}=f({\boldsymbol{x}}_{i},z_{i},\bar{{\boldsymbol{\theta}}}), f()f(\cdot) is as specified in (6), and d()d(\cdot) is a function that measures the difference between yiy_{i} and y~i\tilde{y}_{i}. For example, for a normal linear/nonlinear regression, it can be defined as

d(yi,𝒙i,zi,𝜽¯)=yif(𝒙i,zi,𝜽¯)2.d(y_{i},{\boldsymbol{x}}_{i},z_{i},\bar{{\boldsymbol{\theta}}})=\|y_{i}-f({\boldsymbol{x}}_{i},z_{i},\bar{{\boldsymbol{\theta}}})\|^{2}. (10)

For logistic regression, it is defined as a squared ReLU function, see Liang et al. (2024) for the details. Furthermore, EFI defines an energy function as follows:

Un(𝒀n,𝑿n,𝒁n,𝒘n)=i=1nd(yi,𝒙i,zi,𝜽¯)+ηi=1n𝜽^i𝜽¯2,U_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})=\sum_{i=1}^{n}d(y_{i},{\boldsymbol{x}}_{i},z_{i},\bar{{\boldsymbol{\theta}}})+\eta\sum_{i=1}^{n}\|\hat{{\boldsymbol{\theta}}}_{i}-\bar{{\boldsymbol{\theta}}}\|^{2}, (11)

for some regularization coefficient η>0\eta>0, where first term measures the fitting error of the model as implied by equation (10), and the second term regularizes the variation of 𝜽^i\hat{{\boldsymbol{\theta}}}_{i}, ensuring that the neural network forms a proper estimator of the inverse function. Given this energy function, we define the likelihood function as

πϵ(𝒀n|𝑿n,𝒁n,𝒘n)eUn(𝒀n,𝑿n,𝒁n,𝒘n)/ϵ,\pi_{\epsilon}({\boldsymbol{Y}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})\propto e^{-U_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})/\epsilon}, (12)

for some constant ϵ\epsilon close to 0. As discussed in Liang et al. (2024), the choice of η\eta does not have much affect on the performance of EFI as long as ϵ\epsilon is sufficiently small.

Subsequently, the posterior of 𝒘n{\boldsymbol{w}}_{n} is given by

πϵ(𝒘n|𝑿n,𝒀n,𝒁n)π(𝒘n)eUn(𝒀n,𝑿n,𝒁n,𝒘n)/ϵ,\begin{split}\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n})&\propto\pi({\boldsymbol{w}}_{n})e^{-U_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})/\epsilon},\end{split} (13)

where π(𝒘n)\pi({\boldsymbol{w}}_{n}) denotes the prior of 𝒘n{\boldsymbol{w}}_{n}; and the predictive distribution of 𝒁n{\boldsymbol{Z}}_{n} is given by

πϵ(𝒁n|𝑿n,𝒀n,𝒘n)π0n(𝒁n)eUn(𝒀n,𝑿n,𝒁n,𝒘n)/ϵ.\begin{split}\pi_{\epsilon}({\boldsymbol{Z}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n})&\propto\pi_{0}^{\otimes n}({\boldsymbol{Z}}_{n})e^{-U_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})/\epsilon}.\end{split} (14)

In EFI, 𝒘n{\boldsymbol{w}}_{n} is estimated through maximizing the posterior πϵ(𝒘n|𝑿n,𝒀n)\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n}) given the observations {𝑿n,𝒀n}\{{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n}\}. By the Bayesian version of Fisher’s identity (Song et al., 2020), the gradient equation 𝒘nlogπϵ(𝒘n|𝑿n,𝒀n)\nabla_{{\boldsymbol{w}}_{n}}\log\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n}) =0=0 can be re-expressed as

𝒘nlogπϵ(𝒘n|𝑿n,𝒀n)=𝒘nlogπϵ(𝒘n|𝑿n,𝒀n,𝒁n)πϵ(𝒁n|𝑿n,𝒀n,𝒘n)𝑑𝒘n=0,\nabla_{{\boldsymbol{w}}_{n}}\log\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n})=\int\nabla_{{\boldsymbol{w}}_{n}}\log\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n})\pi_{\epsilon}({\boldsymbol{Z}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n})d{\boldsymbol{w}}_{n}=0, (15)

which can be solved using an adaptive stochastic gradient MCMC algorithm (Liang et al., 2022b; Deng et al., 2019). The algorithm works by iterating between two steps:

  • (a)

    Latent variable sampling: draw 𝒁n(k+1){\boldsymbol{Z}}_{n}^{(k+1)} according to a Markov transition kernel that leaves πϵ(𝒛|𝑿n,𝒀n,𝒘n(k))\pi_{\epsilon}({\boldsymbol{z}}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{(k)}) to be invariant;

  • (b)

    Parameter updating: update 𝒘n(k){\boldsymbol{w}}_{n}^{(k)} toward the maximum of logπϵ(𝒘n|𝑿n,𝒀n,𝒁n)\log\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n}) using stochastic approximation (Robbins and Monro, 1951), based on the sample 𝒁n(k+1){\boldsymbol{Z}}_{n}^{(k+1)}.

See Algorithm 1 for the pseudo-code. This algorithm is termed “adaptive” because the transition kernel in the latent variable sampling step changes with the working parameter estimate of 𝒘n{\boldsymbol{w}}_{n}. The parameter updating step can be implemented using mini-batch SGD, and the latent variable sampling step can be executed in parallel for each observation (yi,𝒙i)(y_{i},{\boldsymbol{x}}_{i}). Hence, the algorithm is scalable with respect to large datasets.

(i) (Initialization) Initialize 𝒘n(0){\boldsymbol{w}}_{n}^{(0)}, 𝒁n(0){\boldsymbol{Z}}_{n}^{(0)}, MM (the number of fiducial samples to collect), and 𝒦\mathcal{K} (burn-in iterations).
for k=1,2,…,𝒦+M\mathcal{K}+M do
      
      (ii) (Latent variable sampling) Given 𝒘n(k){\boldsymbol{w}}_{n}^{(k)}, simulate 𝒁n(k+1){\boldsymbol{Z}}_{n}^{(k+1)} by the SGHMC algorithm (Chen et al., 2014):
𝑽n(k+1)=(1ϖ)𝑽n(k)+υk+1^𝒁nlogπϵ(𝒁n(k)|𝑿n,𝒀n,𝒘n(k))+2ϖτ~υk+1𝒆(k+1),𝒁n(k+1)=𝒁n(k)+𝑽n(k+1),\begin{split}{\boldsymbol{V}}_{n}^{(k+1)}&=(1-\varpi){\boldsymbol{V}}_{n}^{(k)}+\upsilon_{k+1}\widehat{\nabla}_{{\boldsymbol{Z}}_{n}}\log\pi_{\epsilon}({\boldsymbol{Z}}_{n}^{(k)}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{(k)})+\sqrt{2\varpi\tilde{\tau}\upsilon_{k+1}}{\boldsymbol{e}}^{(k+1)},\\ {\boldsymbol{Z}}_{n}^{(k+1)}&={\boldsymbol{Z}}_{n}^{(k)}+{\boldsymbol{V}}_{n}^{(k+1)},\end{split}
where ϖ\varpi is the moment parameter, υk+1\upsilon_{k+1} is the learning rate, τ~=1\tilde{\tau}=1 is the temperature, and 𝒆(k+1)N(0,Id𝒛){\boldsymbol{e}}^{(k+1)}\sim N(0,I_{d_{{\boldsymbol{z}}}}).
      (iii) (Parameter updating) Draw a minibatch {(y1,𝒙1,z1(k)),,(ym,𝒙m,zm(k))}\{(y_{1},{\boldsymbol{x}}_{1},z_{1}^{(k)}),\ldots,(y_{m},{\boldsymbol{x}}_{m},z_{m}^{(k)})\} and update the network weights by the SGD algorithm:
𝒘n(k+1)=𝒘n(k)+γk+1[nmi=1m𝒘nlogπϵ(yi|𝒙i,zi(k),𝒘n(k))+𝒘nlogπ(𝒘n(k))],{\boldsymbol{w}}_{n}^{(k+1)}={\boldsymbol{w}}_{n}^{(k)}+\gamma_{k+1}\left[\frac{n}{m}\sum_{i=1}^{m}\nabla_{{\boldsymbol{w}}_{n}}\log\pi_{\epsilon}(y_{i}|{\boldsymbol{x}}_{i},z_{i}^{(k)},{\boldsymbol{w}}_{n}^{(k)})+\nabla_{{\boldsymbol{w}}_{n}}\log\pi({\boldsymbol{w}}_{n}^{(k)})\right], (16)
where γk+1\gamma_{k+1} is the step size, and logπϵ(yi|𝒙i,zi(k),𝒘n(k))\log\pi_{\epsilon}(y_{i}|{\boldsymbol{x}}_{i},z_{i}^{(k)},{\boldsymbol{w}}_{n}^{(k)}) can be appropriately defined according to (12).
      (iv) (Fiducial sample collection) If k+1>𝒦k+1>\mathcal{K}, calculate 𝜽^i(k+1)=g^(yi,𝒙i,zi(k+1),𝒘n(k+1))\hat{{\boldsymbol{\theta}}}_{i}^{(k+1)}=\hat{g}(y_{i},{\boldsymbol{x}}_{i},z_{i}^{(k+1)},{\boldsymbol{w}}_{n}^{(k+1)}) for each i{1,2,,n}i\in\{1,2,\ldots,n\} and average them to get a fiducial 𝜽¯\bar{{\boldsymbol{\theta}}}-sample as calculated in (8).
end for
(v) (Statistical Inference) Conducting statistical inference for the model based on the collected fiducial samples.
Algorithm 1 Adaptive SGHMC for Extended Fiducial Inference

Under mild conditions for adaptive stochastic gradient MCMC algorithms (Deng et al., 2019; Liang et al., 2022b), it was shown in Liang et al. (2024) that

𝒘n(k)𝒘np0,as k,\|{\boldsymbol{w}}_{n}^{(k)}-{\boldsymbol{w}}_{n}^{*}\|\stackrel{{\scriptstyle p}}{{\to}}0,\quad\mbox{as $k\to\infty$}, (17)

where 𝒘n{\boldsymbol{w}}_{n}^{*} denotes a solution to equation (15) and p\stackrel{{\scriptstyle p}}{{\to}} denotes convergence in probability, and that

𝒁n(k)dπϵ(𝒁n|𝑿n,𝒀n,𝒘n),as k,{\boldsymbol{Z}}_{n}^{(k)}\stackrel{{\scriptstyle d}}{{\rightsquigarrow}}\pi_{\epsilon}({\boldsymbol{Z}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{*}),\quad\mbox{as $k\to\infty$}, (18)

in 2-Wasserstein distance, where d\stackrel{{\scriptstyle d}}{{\rightsquigarrow}} denotes weak convergence.

To study the limit of (18) as ϵ\epsilon decays to 0, i.e.,

pn(𝒛|𝒀n,𝑿n,𝒘n)=limϵ0πϵ(𝒁n|𝑿n,𝒀n,𝒘n),p_{n}^{*}({\boldsymbol{z}}|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{w}}_{n}^{*})=\lim_{\epsilon\downarrow 0}\pi_{\epsilon}({\boldsymbol{Z}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{*}),

where pn(𝒛|𝒀n,𝑿n,𝒘n)p_{n}^{*}({\boldsymbol{z}}|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{w}}_{n}^{*}) is referred to as the extended fiducial density (EFD) of 𝒁n{\boldsymbol{Z}}_{n} learned in EFI, it is necessary for 𝒘n{\boldsymbol{w}}_{n}^{*} to be a consistent estimator of 𝒘{\boldsymbol{w}}_{*}, the parameters of the underlying true EFI network. To ensure this consistency, Liang et al. (2024) impose some conditions on the structure of the DNN and the prior distribution π(𝒘n)\pi({\boldsymbol{w}}_{n}). Specifically, they assume that 𝒘n{\boldsymbol{w}}_{n} takes values in a compact space 𝒲\mathcal{W}; π(𝒘n)\pi({\boldsymbol{w}}_{n}) is a truncated mixture Gaussian distribution on 𝒲\mathcal{W}; and the DNN structure satisfies certain constraints given in Sun et al. (2022), e.g., the width of the output layer (i.e., the dimension of 𝜽{\boldsymbol{\theta}}) is fixed or grows very slowly with nn. They then justify the consistency of 𝒘n{\boldsymbol{w}}_{n}^{*} based on the sparse deep learning theory developed in Sun et al. (2022). The consistency of 𝒘n{\boldsymbol{w}}_{n}^{*} further implies that

G(𝒀n,𝑿n,𝒁n)=1ni=1ng^(yi,𝒙i,zi,𝒘n),G^{*}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})=\frac{1}{n}\sum_{i=1}^{n}\hat{g}(y_{i},{\boldsymbol{x}}_{i},z_{i},{\boldsymbol{w}}_{n}^{*}),

serves as a consistent estimator for the inverse function/mapping 𝜽=G(𝒀n,𝑿n,𝒁n){\boldsymbol{\theta}}=G({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n}).

By Theorem 3.2 in Liang et al. (2024), for the target model (1), which is a noise-additive model, the EFD of 𝒁n{\boldsymbol{Z}}_{n} is invariant to the choice of the inverse function, provided that d()d(\cdot) is specified as in (10) in defining the energy function. Further, by Lemma 4.2 in Liang et al. (2024), pn(𝒛|𝒀n,𝑿m,𝒘n)p_{n}^{*}({\boldsymbol{z}}|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{m},{\boldsymbol{w}}_{n}^{*}) is given by

dPn(𝒛|𝑿n,𝒀n,𝒘n)dν=π0n(𝒛)𝒵nπ0n(𝒛)𝑑ν,\frac{dP_{n}^{*}({\boldsymbol{z}}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{*})}{d\nu}=\frac{\pi_{0}^{\otimes n}({\boldsymbol{z}})}{\int_{\mathcal{Z}_{n}}\pi_{0}^{\otimes n}({\boldsymbol{z}})d\nu}, (19)

where Pn(𝒛|𝑿n,𝒀n,𝒘n)P_{n}^{*}({\boldsymbol{z}}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{*}) represents the cumulative distribution function (CDF) corresponding to pn(𝒛|𝑿n,𝒀n,𝒘n)p_{n}^{*}({\boldsymbol{z}}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{*}); 𝒵n={𝒛:Un(𝒀n,𝑿n,𝒁n,𝒘n)=0}\mathcal{Z}_{n}=\{{\boldsymbol{z}}:U_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n}^{*})=0\} represents the zero-energy set, which forms a manifold in the space n\mathbb{R}^{n}; and ν\nu is the sum of intrinsic measures on the pp-dimensional manifold in 𝒵n\mathcal{Z}_{n}. That is, under the consistency of 𝒘n{\boldsymbol{w}}_{n}^{*}, pn(𝒛|𝑿n,𝒀n,𝒘n)p_{n}^{*}({\boldsymbol{z}}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{*}) is reduced to a truncated density function of π0n(𝒛)\pi_{0}^{\otimes n}({\boldsymbol{z}}) on the manifold 𝒵n\mathcal{Z}_{n}, while 𝒵n\mathcal{Z}_{n} itself is also invariant to the choice of the inverse function as shown in Lemma 3.1 of Liang et al. (2024). In other words, for the model (1), the EFD of 𝒁n{\boldsymbol{Z}}_{n} is asymptotically invariant to the inverse function we learned given its consistency.

Let Θ:={𝜽p:𝜽=G(𝒀n,𝑿n,𝒛),𝒛𝒵n}\Theta:=\{{\boldsymbol{\theta}}\in\mathbb{R}^{p}:{\boldsymbol{\theta}}=G^{*}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{z}}),{\boldsymbol{z}}\in\mathcal{Z}_{n}\} denote the parameter space of the target model, which represents the set of all possible values of 𝜽{\boldsymbol{\theta}} that G()G^{*}(\cdot) takes when 𝒛{\boldsymbol{z}} runs over 𝒵n\mathcal{Z}_{n}. Then, for any function b(𝜽)b({\boldsymbol{\theta}}) of interest, its EFD μn(|𝒀n,𝑿n)\mu_{n}^{*}(\cdot|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n}) associated with G()G^{*}(\cdot) is given by

μn(B|𝒀n,𝑿n)=𝒵n(B)𝑑Pn(𝒛|𝒀n,𝑿n,𝒘n),for any measurable set BΘ,\begin{split}\mu_{n}^{*}(B|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n})&=\int_{\mathcal{Z}_{n}(B)}dP_{n}^{*}({\boldsymbol{z}}|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{w}}_{n}^{*}),\quad\mbox{for any measurable set $B\subset\Theta$},\end{split} (20)

where 𝒵n(B)={𝒛𝒵n:b(G(𝒀n,𝑿n,𝒛))B}\mathcal{Z}_{n}(B)=\{{\boldsymbol{z}}\in\mathcal{Z}_{n}:b(G^{*}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{z}}))\in B\}. The EFD provides an uncertainty measure for b(𝜽)b({\boldsymbol{\theta}}). Practically, the EFD of b(𝜽)b({\boldsymbol{\theta}}) can be constructed based on the samples {b(𝜽¯1),b(𝜽¯2),,b(𝜽¯M)}\{b(\bar{{\boldsymbol{\theta}}}_{1}),b(\bar{{\boldsymbol{\theta}}}_{2}),\ldots,b(\bar{{\boldsymbol{\theta}}}_{M})\}, where {𝜽¯1,𝜽¯2,,𝜽¯M}\{\bar{{\boldsymbol{\theta}}}_{1},\bar{{\boldsymbol{\theta}}}_{2},\ldots,\bar{{\boldsymbol{\theta}}}_{M}\} denotes the fiducial 𝜽¯\bar{{\boldsymbol{\theta}}}-samples collected at step (iv) of Algorithm 1.

Finally, we note that, as discussed in Liang et al. (2024), the invariance property of 𝒵n\mathcal{Z}_{n} is not crucial to the validity of EFI, although it does enhance the robustness of the inference. Additionally, for a neural network model, its parameters are only unique up to certain loss-invariant transformations, such as reordering hidden neurons within the same hidden layer or simultaneously altering the sign or scale of certain connection weights, see Sun et al. (2022) for discussions. Therefore, in EFI, the consistency of 𝒘n{\boldsymbol{w}}_{n}^{*} refers to its consistency with respect to one of the equivalent solutions to (15), while mathematically 𝒘n{\boldsymbol{w}}_{n}^{*} can still be treated as unique. Refer to Section §\S1.1 (of the supplement) for more discussions on this issue.

3 EFI for Large Models

In this section, we first establish the consistence of the inverse function/mapping learned in EFI for large models, and then discuss its application for uncertainty quantification of deep neural networks.

3.1 Consistency of Inverse Mapping Learned in EFI for Large Models

It is important to note that the sparse deep learning theory of Sun et al. (2022) is developed under the general constraint dim(𝒘n)=O(n1δ)dim({\boldsymbol{w}}_{n})=O(n^{1-\delta}) for some 0<δ<10<\delta<1, which restricts the dimension of the output layer of the DNN model to be fixed or grows very slowly with the sample size nn. Therefore, under its current theoretical framework, EFI can only be applied to the models for which the dimension is fixed or increases very slowly with nn.

To extend EFI to large models, where the dimension of 𝜽{\boldsymbol{\theta}} can grow with nn at a rate of O(nζ)O(n^{\zeta}), particularly for 1/2ζ<11/2\leq\zeta<1, we provide a new proof for the consistency of G(𝒀n,𝑿n,𝒁n)G^{*}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n}) based on the theory of stochastic deep learning (Liang et al., 2022b). Specifically, we establish the following theorem, where the output layer width of the DNN in the EFI network is set to match the dimension of 𝜽{\boldsymbol{\theta}}. The proof is lengthy and provided in the supplement.

Theorem 3.1

Suppose Assumptions 1-6 hold (see the supplement), ϵ\epsilon is sufficiently small, and

l=1Hdln,\sum_{l=1}^{H}d_{l}\prec n, (21)

where dld_{l} denotes the width of layer ll, dH=dim(𝛉)d_{H}=dim({\boldsymbol{\theta}}), and HH denotes the depth of the DNN in the EFI network. Then G(𝐘n,𝐗n,𝐙n)=1ni=1ng^(yi,𝐱i,zi,𝐰n)G^{*}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})=\frac{1}{n}\sum_{i=1}^{n}\hat{g}(y_{i},{\boldsymbol{x}}_{i},z_{i},{\boldsymbol{w}}_{n}^{*}) constitutes a consistent estimator of the inverse function.

As implied by (21), we have dlnd_{l}\prec n holds for each layer l=1,2,,Hl=1,2,\ldots,H. We call such a neural network a narrow DNN. For narrow DNNs, by the existing theory, see e.g., Kidger and Lyons (2020), Park et al. (2020), and Kim et al. (2023), the universal approximation can be achieved with a minimum hidden layer width of max{d0+1,dH}\max\{d_{0}+1,d_{H}\}, where d0d_{0} and dHd_{H} represent the widths of the input and output layers, respectively. Hence, (21) implies that EFI can be applied to statistical inference for a large model of dimension

dim(𝜽)=dH=O(nζ),0ζ<1,dim({\boldsymbol{\theta}})=d_{H}=O(n^{\zeta}),\quad 0\leq\zeta<1,

under the narrow DNN setting with the depth H=O(nβ)H=O(n^{\beta}) for some 0<β<1ζ0<\beta<1-\zeta. Here, Without loss of generality, we assume d0dHd_{0}\preceq d_{H}. For such a DNN, the total dimension of 𝒘n{\boldsymbol{w}}_{n}:

dim(𝒘n)=i=1Hdi(di1+1)=O(n2ζ+β),dim({\boldsymbol{w}}_{n})=\sum_{i=1}^{H}d_{i}(d_{i-1}+1)=O(n^{2\zeta+\beta}),

can be much greater than nn, where ‘1’ represents the bias parameter of each neuron at the hidden and output layers. Specifically, we can have dim(𝒘n)ndim({\boldsymbol{w}}_{n})\succ n with appropriate choices of ζ\zeta and β\beta. However, leveraging the asymptotic equivalence between the DNN and an auxiliary stochastic neural network (StoNet) (Liang et al., 2022b), we can still prove that the resulting estimator of 𝜽{\boldsymbol{\theta}} is consistent, see the supplement for the detail.

Regarding this extension of the EFI method for statistical inference of large models, we have an additional remark:

Remark 1

In this paper, we impose a mixture Gaussian prior on 𝐰n{\boldsymbol{w}}_{n} to ensure the consistency of 𝐰n{\boldsymbol{w}}_{n}^{*} and, consequently, the consistency of the inverse mapping G(𝐘n,𝐗n,𝐙n)G^{*}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n}). However, this Bayesian treatment of 𝐰n{\boldsymbol{w}}_{n} is not strictly necessary, although it introduces sparsity that improves the efficiency of EFI. For the narrow DNN, the consistency of the 𝐰n{\boldsymbol{w}}_{n} estimator can also be established under the frequentist framework by leveraging the asymptotic equivalence between the DNN and the auxiliary StoNet, using the same technique introduced in the supplement (see Section §\S1.2). In this narrow and deep setting, each of the regressions formed by the StoNet is low-dimensional (with dlnd_{l}\prec n), making the Bayesian treatment of 𝐰n{\boldsymbol{w}}_{n} unnecessary while still achieving a consistent estimator of 𝐰n{\boldsymbol{w}}_{n}.

3.2 Double-NN Method

Suppose a DNN is used for modeling the data, i.e., approximating the function f()f(\cdot) in (5). By Sun et al. (2022) and Farrell et al. (2021), a DNN of size O(nζ)O(n^{\zeta}) for some 0<ζ<10<\zeta<1 has been large enough for approximating many classes of functions. Therefore, EFI can be used for making inference for such a DNN model. In this case, EFI involves two neural networks, one is for modeling the data, which is called the ‘data modeling network’ and parameterized by 𝜽{\boldsymbol{\theta}}; and the other one is for approximating the inverse function, which is called the ‘inverse mapping network’ and parameterized by 𝒘n{\boldsymbol{w}}_{n}. Therefore, the proposed method is coined as ‘double-NN’. Note that during the EFI training process, only the parameters 𝒘n{\boldsymbol{w}}_{n} of the inverse mapping network are updated in equation (16) of Algorithm 1. The parameters of the data modeling network are subsequently updated in response to the adjustment of 𝒘n{\boldsymbol{w}}_{n}, based on the formula given in (8).

In our theoretical study for the double-NN method, we actually assume that the true data-generating model Y=f(𝑿,Z,𝜽)Y=f({\boldsymbol{X}},Z,{\boldsymbol{\theta}}) is a neural network, thereby omitting the approximation error of the data modeling network, based on its universal approximation capability. In practice, we have observed that the double-NN method is robust to this approximation error. Specifically, even when the true model is not a neural network, EFI can still recover the true random errors with high accuracy and achieve the zero-energy solution as nn\to\infty and ϵ0\epsilon\to 0. A further theoretical exploration of this phenomenon would be of interest.

As mentioned previously, for a neural network model, its parameters are only unique up to certain loss-invariant transformations. As the training sample size nn becomes large, we expect that the optimizers 𝜽^:=argmax𝜽πϵ(𝒁n|𝑿n,𝒀n,𝜽)\hat{{\boldsymbol{\theta}}}:=\arg\max_{{\boldsymbol{\theta}}}\pi_{\epsilon}({\boldsymbol{Z}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{\theta}}) are all equivalent. Thus, in this paper, the consistency of 𝜽^\hat{{\boldsymbol{\theta}}} refers to its consistency with respect to one of the equivalent global optimizers, while mathematically 𝜽^\hat{{\boldsymbol{\theta}}} can still be treated as unique. A similar issue occurs to the parameters of the inverse mapping network, as discussed in Section §\S1.1 of the supplement.

4 An Illustrative Example for EFI

To illustrate how EFI works for statistical inference problems, we consider a linear regression example:

yi=τTi+μ+𝒙i𝜷+σzi,i=1,2,,n,y_{i}=\tau T_{i}+\mu+{\boldsymbol{x}}_{i}^{\top}{\boldsymbol{\beta}}+\sigma z_{i},\quad i=1,2,\ldots,n, (22)

where Ti{0,1}T_{i}\in\{0,1\} is a binary variable indicating the treatment assignment, τ\tau is the treatment effect, 𝒙id{\boldsymbol{x}}_{i}\in\mathbb{R}^{d} are confounders/covariates, ziN(0,1)z_{i}\sim N(0,1) is the standardized random noise, and 𝜷d{\boldsymbol{\beta}}\in\mathbb{R}^{d} and σ+\sigma\in\mathbb{R}_{+} are unknown parameters. For this example, τ\tau represents the ATE as well as the CATE, due to its independence of the covariates 𝒙{\boldsymbol{x}}. In the simulation study, we set τ=1\tau=1, μ=1\mu=1, d=4d=4, and 𝜷=(1,1,1,1){\boldsymbol{\beta}}=(-1,1,-1,1)^{\top}; generate 𝒙iN(0,Id){\boldsymbol{x}}_{i}\sim N(0,I_{d}); and generate the treatment variable via a logistic regression:

P(Ti=1)=11+exp{ν𝝃𝒙i},P(T_{i}=1)=\frac{1}{1+\exp\{-\nu-{\boldsymbol{\xi}}^{\top}{\boldsymbol{x}}_{i}\}}, (23)

where ν=1\nu=1 and 𝝃=(1,1,1,1){\boldsymbol{\xi}}=(-1,1,-1,1)^{\top}. We consider three different cases with the sample size n=250n=250, 500 and 1000, respectively. For each case, we generate 100 datasets.

Statistical inference for the parameters in the model (22) can be made with EFI under its standard framework. Let 𝜽=(τ,μ,𝜷,logσ){\boldsymbol{\theta}}=(\tau,\mu,{\boldsymbol{\beta}}^{\top},\log\sigma)^{\top} be the parameter vector. EFI approximates the inverse function 𝜽=g(y,T,𝒙,z){\boldsymbol{\theta}}=g(y,T,{\boldsymbol{x}},z) by a DNN, for which (y,T,𝒙,z)(y,T,{\boldsymbol{x}},z) serves as input variables and 𝜽{\boldsymbol{\theta}} as output variables. The results are summarized in Table 1.

For comparison, a variety of methods, including Unadj (Imbens and Rubin, 2015), inverse probability weighting (IPW) (Rosenbaum, 1987), double-robust (DR) (Robins et al., 1994; Bang and Robins, 2005), and BART (Hill, 2011), have been applied to this example. These methods fall into distinct categories. The Unadj is straightforward, estimating the ATE by calculating the difference between the treatment and control groups, i.e., τ^=1nti=1ntYi(1)1nci=1ncYi(0)\hat{\tau}=\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}Y_{i}(1)-\frac{1}{n_{c}}\sum_{i=1}^{n_{c}}Y_{i}(0), where the effect of confounders is not adjusted. Both IPW and DR are widely used ATE estimation methods, which adjust the effect of confounders based on propensity scores. They both are implemented using the R package drgee (Zetterqvist and Sjölander, 2015). The BART employs Bayesian additive regression trees to learn the outcome function, which naturally accommodates heterogeneous treatment effects as well as nonlinearity of the outcome function. It is implemented using the R package bartcause (Dorie and Hill, 2020).

Table 1: Comparison of EFI with various ATE estimation methods, where “coverage” refers to the averaged coverage rate of τ\tau, “length” refers to the averaged width of confidence intervals, and the number in the parentheses refers to the standard deviation of the averaged width. The averages and standard deviations were calculated based on 100 datasets.
n=250n=250 n=500n=500 n=1000n=1000
Method coverage length coverage length coverage length
Unadj 0.95 1.161(0.066) 0.93 0.822(0.032) 0.97 0.424(0.017)
BART 0.99 0.857(0.070) 0.98 0.611(0.047) 0.96 0.428(0.024)
IPW 0.90 0.710(0.157) 0.92 0.560(0.141) 0.92 0.417(0.101)
DR 0.96 0.652(0.058) 0.93 0.465(0.033) 0.94 0.331(0.017)
EFI 0.95 0.647(0.033) 0.95 0.438(0.021) 0.95 0.338(0.012)

The comparison indicates that EFI performs very well for this standard ATE estimation problem. Specifically, EFI generates confidence intervals of nearly the same length as DR, but with more accurate coverage rates. This is remarkable, as DR has often been considered as the golden standard for ATE estimation and is consistent if either the outcome or propensity score models is correctly specified, and locally efficient if both are correctly specified. Furthermore, EFI produces much shorter confidence intervals compared to Unadj, IPW, and BART, while maintaining more accurate coverage rates.

We attribute the superior performance of EFI on this example to its fidelity in parameter estimation, an attractive property of EFI as discussed in Liang et al. (2024). As implied by (14), EFI essentially estimates 𝜽{\boldsymbol{\theta}} by maximizing the predictive likelihood function πϵ(𝒁n|𝑿n,𝒀n,𝜽)π0n(𝒁n)eUn(𝒀n,𝑿n,𝒁n,𝒘)/ϵ\pi_{\epsilon}({\boldsymbol{Z}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{\theta}})\propto\pi_{0}^{\otimes n}({\boldsymbol{Z}}_{n})e^{-U_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}})/\epsilon}, which balances the likelihood of 𝒁n{\boldsymbol{Z}}_{n} and the model fitting errors coded in Un()U_{n}(\cdot). In contrast, the maximum likelihood estimation (MLE) method sets 𝜽^MLE=argmax𝜽π0n(𝒁n)\hat{{\boldsymbol{\theta}}}_{MLE}=\arg\max_{{\boldsymbol{\theta}}}\pi_{0}^{\otimes n}({\boldsymbol{Z}}_{n}), where 𝒁n{\boldsymbol{Z}}_{n} is expressed as a function of (𝒀n,𝑿n,𝜽)({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{\theta}}). In general, MLE is inclined to be influenced by the outliers and deviations of covariates especially when the sample size is not sufficiently large. It is important to note that the MLE serves as the core for all the IPW, DR and BART methods in estimating the outcome and propensity score models. For this reason, various adjustments for confounding and heterogeneous treatment effects have been developed in the literature.

Compared to the existing causal inference methods, EFI works as a solver for the data-generating equation (as ϵ0\epsilon\downarrow 0), providing a coherent way to address the confounding and heterogeneous treatment effects and resulting in faithful estimates for the model parameters and their uncertainty as well. This example illustrates the performance of EFI in ATE estimation when confounders are present, while the examples in the next section showcase the performance of EFI in dealing with heterogeneous treatment effects via DNN modeling. Extensive comparisons with BART and other nonparametric modeling methods are also presented.

In this example, we omit the estimation of the propensity score model. As discussed in Section 6, the proposed method can be extended by including an additional DNN to approximate the propensity score, enabling the use of inverse probability weighting for ATE estimation. However, the ATE estimation is not the focus of this work.

5 Causal Inference for Individual Treatment Effects

This section demonstrates how EFI can be used to perform statistical inference of the predictive ITE for the data-generating model (1). Let 𝜽c{\boldsymbol{\theta}}_{c} denote the vector of parameters for modeling the function c(𝒙)c({\boldsymbol{x}}), let 𝜽τ{\boldsymbol{\theta}}_{\tau} denote the vector of parameters for modeling the function τ(𝒙)\tau({\boldsymbol{x}}), and let 𝜽={𝜽c,𝜽τ,log(σ)}{\boldsymbol{\theta}}=\{{\boldsymbol{\theta}}_{c},{\boldsymbol{\theta}}_{\tau},\log(\sigma)\} denote the whole set of parameters for the model (1). We model the inverse function 𝜽=g(y,T,𝒙,z){\boldsymbol{\theta}}=g(y,T,{\boldsymbol{x}},z) by a DNN. Also, we can model each of the functions c(𝒙)c({\boldsymbol{x}}) and τ(𝒙)\tau({\boldsymbol{x}}) by a DNN if their functional forms are unknown. For convenience, we refer to the DNN for modeling c(𝒙)c({\boldsymbol{x}}) as ‘cc-network’ and that for modeling τ(𝒙)\tau({\boldsymbol{x}}) as ‘τ\tau-network’, and 𝜽c{\boldsymbol{\theta}}_{c} and 𝜽τ{\boldsymbol{\theta}}_{\tau} represent their weights, respectively. As mentioned previously, we can restrict the sizes of the cc-network and τ\tau-network to the order of O(nζ~)O(n^{\tilde{\zeta}}) for some 0<ζ~<10<\tilde{\zeta}<1.

Note that in solving the data generating equations (1), the proposed method involves two types of neural networks: one for modeling causal effects and the other for approximating the inverse function 𝜽=g(y,T,𝒙,z){\boldsymbol{\theta}}=g(y,T,{\boldsymbol{x}},z). While we still refer to the proposed method as ‘Double-NN’, it actually involves three DNNs.

5.1 ITE prediction intervals

Assume the training set consists of ntrainn_{train} subjects, and the test set consists of ntestn_{test} subjects. The subjects in the test set can be grouped into three categories: (i) {(𝒙i,0,Yi(1),Yiobs(0))\{({\boldsymbol{x}}_{i},0,Y_{i}(1),Y_{i}^{obs}(0)), ic}i\in\mathcal{I}_{c}\}, where the responses under the control are observed; (ii) {(𝒙i,1,Yiobs(1),Yi(0)):it}\{({\boldsymbol{x}}_{i},1,Y_{i}^{obs}(1),Y_{i}(0)):i\in\mathcal{I}_{t}\}, where the responses under the treatment are observed; and (iii) {(𝒙i,Ti,Yi(1),Yi(0)):im}\{({\boldsymbol{x}}_{i},T_{i},Y_{i}(1),Y_{i}(0)):i\in\mathcal{I}_{m}\}, where only covariates are observed. Here, we use c\mathcal{I}_{c}, t\mathcal{I}_{t}, and m\mathcal{I}_{m} to denote the index sets of the subjects in the respective categories and, therefore, ctm={1,,ntest}\mathcal{I}_{c}\cup\mathcal{I}_{t}\cup\mathcal{I}_{m}=\{1,\dots,n_{test}\}. For the ITE of each subject in the test set, we can construct the prediction interval with a desired confidence level of 1α1-\alpha in the following procedure:

  • (i)

    For subject ici\in\mathcal{I}_{c}: At each iteration kk of Algorithm 1, calculate the prediction Y^i(k)(1)=c^(k)(𝒙i)+τ^(k)(𝒙i)+σ^(k)Znew(k,1)\hat{Y}_{i}^{(k)}(1)=\hat{c}^{(k)}({\boldsymbol{x}}_{i})+\hat{\tau}^{(k)}({\boldsymbol{x}}_{i})+\hat{\sigma}^{(k)}Z_{new}^{(k,1)}, where Znew(k,1)N(0,1)Z_{new}^{(k,1)}\sim N(0,1). Let cl(𝒙i,1)c_{l}({\boldsymbol{x}}_{i},1) and cu(𝒙i,1)c_{u}({\boldsymbol{x}}_{i},1) denote, respectively, the α2\frac{\alpha}{2}- and (1α2)(1-\frac{\alpha}{2})-quantiles of {Y^i(k)(1):k=𝒦+1,𝒦+2,,𝒦+M}\{\hat{Y}_{i}^{(k)}(1):k=\mathcal{K}+1,\mathcal{K}+2,\ldots,\mathcal{K}+M\} collected over iterations. Since Yiobs(0)Y_{i}^{obs}(0) is observed, (cl(𝒙i,1)Yiobs(0),cu(𝒙i,1)Yiobs(0))(c_{l}({\boldsymbol{x}}_{i},1)-Y_{i}^{obs}(0),c_{u}({\boldsymbol{x}}_{i},1)-Y_{i}^{obs}(0)) forms a (1α)(1-\alpha)-prediction interval for the ITE Yi(1)Yiobs(0)Y_{i}(1)-Y_{i}^{obs}(0).

  • (ii)

    For subject iti\in\mathcal{I}_{t}: At each iteration kk of Algorithm 1, calculate the prediction Y^i(k)(0)=c^(k)(𝒙i)+σ^(k)Znew(k,2)\hat{Y}_{i}^{(k)}(0)=\hat{c}^{(k)}({\boldsymbol{x}}_{i})+\hat{\sigma}^{(k)}Z_{new}^{(k,2)}, where Znew(k,2)N(0,1)Z_{new}^{(k,2)}\sim N(0,1). Let cl(𝒙i,0)c_{l}({\boldsymbol{x}}_{i},0) and cu(𝒙i,0)c_{u}({\boldsymbol{x}}_{i},0) denote, respectively, the α2\frac{\alpha}{2}- and (1α2)(1-\frac{\alpha}{2})-quantiles of {Y^i(k)(0):k=𝒦+1,𝒦+2,,𝒦+M}\{\hat{Y}_{i}^{(k)}(0):k=\mathcal{K}+1,\mathcal{K}+2,\ldots,\mathcal{K}+M\} collected over iterations. Since Yiobs(1)Y_{i}^{obs}(1) is observed, (Yiobs(1)cu(𝒙i,0),Yiobs(1)cl(𝒙i,1))(Y_{i}^{obs}(1)-c_{u}({\boldsymbol{x}}_{i},0),Y_{i}^{obs}(1)-c_{l}({\boldsymbol{x}}_{i},1)) forms a (1α)(1-\alpha)-prediction interval for the ITE Yiobs(1)Yi(0)Y_{i}^{obs}(1)-Y_{i}(0).

  • (iii)

    For subject imi\in\mathcal{I}_{m}: At each iteration kk of Algorithm 1, calculate the prediction Y^i(k)(1)Y^i(k)(0)=τ^(k)(𝒙i)+2σ^(k)Znew(k,3)\hat{Y}_{i}^{(k)}(1)-\hat{Y}_{i}^{(k)}(0)=\hat{\tau}^{(k)}({\boldsymbol{x}}_{i})+\sqrt{2}\hat{\sigma}^{(k)}Z_{new}^{(k,3)}, where Znew(k,3)N(0,1)Z_{new}^{(k,3)}\sim N(0,1). Let cl(𝒙i)c_{l}({\boldsymbol{x}}_{i}) and cu(𝒙i)c_{u}({\boldsymbol{x}}_{i}) denote, respectively, the α2\frac{\alpha}{2}- and (1α2)(1-\frac{\alpha}{2})-quantiles of {Y^i(k)(1)Y^i(k)(0):k=𝒦+1,𝒦+2,,𝒦+M}\{\hat{Y}_{i}^{(k)}(1)-\hat{Y}_{i}^{(k)}(0):k=\mathcal{K}+1,\mathcal{K}+2,\ldots,\mathcal{K}+M\} collected over iterations. Then (cl(𝒙i),cu(𝒙i))(c_{l}({\boldsymbol{x}}_{i}),c_{u}({\boldsymbol{x}}_{i})) forms a (1α)(1-\alpha)-prediction interval for the ITE Yi(1)Yi(0)Y_{i}(1)-Y_{i}(0).

5.2 Simulation Study

Example 1

Consider the data-generating equation

yi=μ+𝒙i𝜷+(η0+η(𝒙i))Ti+σzi,i=1,2,,n,y_{i}=\mu+{\boldsymbol{x}}_{i}^{\top}{\boldsymbol{\beta}}+(\eta_{0}+\eta({\boldsymbol{x}}_{i}))T_{i}+\sigma z_{i},\quad i=1,2,\ldots,n, (24)

where 𝒙i=(xi,1,xi,2){\boldsymbol{x}}_{i}=(x_{i,1},x_{i,2})^{\top} with each element drawn independently from Unif(0,1)Unif(0,1), μ=1\mu=1, 𝜷=(1,1){\boldsymbol{\beta}}=(1,1)^{\top}, η0=1\eta_{0}=1, σ=1\sigma=1, ziN(0,1)z_{i}\sim N(0,1), and η(𝒙i)=s(xi1)s(xi2)E(s(xi1)s(xi2))\eta({\boldsymbol{x}}_{i})=s(x_{i1})s(x_{i2})-E(s(x_{i1})s(x_{i2})). As in Lei and Candès (2021), we set s(a)=21+exp(12(a0.5))s(a)=\frac{2}{1+exp(-12(a-0.5))}, and generate the treatment variable TiT_{i} according to the propensity score model:

e(𝒙i)=14(1+β2,4(xi,1)),e({\boldsymbol{x}}_{i})=\frac{1}{4}(1+\beta_{2,4}(x_{i,1})), (25)

where β2,4\beta_{2,4} is the CDF of the beta distribution with parameters (2,4), ensuring e(𝒙i)[0.25,0.5]e({\boldsymbol{x}}_{i})\in[0.25,0.5] and thereby sufficient overlap between the treatment and control groups. In terms of equation (1), we have c(𝒙i)=μ+𝒙i𝜷c({\boldsymbol{x}}_{i})=\mu+{\boldsymbol{x}}_{i}^{\top}{\boldsymbol{\beta}} and τ(𝒙i)=η0+η(𝒙i)\tau({\boldsymbol{x}}_{i})=\eta_{0}+\eta({\boldsymbol{x}}_{i}). We generated 20 datasets from the model (24) independently, each consisting of ntrain=500n_{train}=500 training samples and ntest=1000n_{test}=1000 test samples.

Table 2: Comparison of Double-NN and CQR for inference of the predictive ITE for Example (24), where the coverage and length of the prediction intervals were calculated by averaging over 20 datasets with the standard deviation given in the parentheses.

Case c\mathcal{I}_{c} Case t\mathcal{I}_{t} Case m\mathcal{I}_{m} Method Coverage Length Coverage Length Method Coverage Length Double-NN 0.9549 4.2004 0.9581 4.1812 Double-NN 0.9583 5.6056 (0.0095) (0.1567) (0.0098) (0.1541) (0.0103) (0.2207) CQR-BART 0.9472 4.2702 0.9533 4.4024 CQR(inexact) 0.9530 6.3244 (0.0342) (0.5225) (0.0341) (0.8972) (0.0198) (0.5426) CQR-Boosting 0.9556 5.5199 0.9548 4.4493 CQR(exact) 1.0000 13.4005 (0.0294) (0.5866) (0.0259) (0.5097) (0.0002) (2.4936) CQR-RF 0.9529 5.4609 0.9652 4.6428 CQR(naive) 0.9998 12.8861 (0.0233) (0.5172) (0.0171) (0.5408) (0.0004) (1.5275) CQR-NN 0.9570 6.4072 0.9755 5.8125 (0.0195) (0.8087) (0.0199) (1.4332)

For this example, we assume the functional form of c(𝒙)c({\boldsymbol{x}}) is known and model τ(𝒙)\tau({\boldsymbol{x}}) by a DNN. The DNN has two hidden layers, each consisting of 10 hidden neurons. The number of parameters of the DNN is |𝜽τ|=151|{\boldsymbol{\theta}}_{\tau}|=151, and the total dimension of 𝜽=(μ,η0,𝜷,𝜽τ,log(σ)){\boldsymbol{\theta}}=(\mu,\eta_{0},{\boldsymbol{\beta}},{\boldsymbol{\theta}}_{\tau}^{\top},\log(\sigma))^{\top} is 156 (ntrain0.81)(\approx n_{train}^{0.81}), which falls into the class of large models.

Refer to Section §\S3 of the supplement for parameter settings for the Double-NN method. For comparison, the conformal quantile regression (CQR) method (Romano et al., 2019; Lei and Candès, 2021) was applied to this example, where the outcome function was approximated using different machine learning methods, including BART (Chipman et al., 2010), Boosting (Schapire, 1990; Breiman, 1998), and random forest (RF) (Breiman, 2001), and neural network (NN). Refer to Section §\S2 of the supplement for a brief description of the CQR method. For CQR-NN, we used a neural network of structure (p+1)(p+1)-10-10-2 to model the outcome quantiles, where the extra input variable is for treatment and the two output neurons are for (α/2,1α/2)(\alpha/2,1-\alpha/2)-quantiles of the outcome (Romano et al., 2019). Additionally, we used a neural network of structure pp-10-10-1 to model the propensity score in order to compute weighted CQR as in Lei and Candès (2021).

The other CQR methods were implemented using the R package cfcausal (Lei and Candès, 2021). For the case m\mathcal{I}_{m}, we considered CQR-BART only, given its relative superiority over other CQR methods in the cases c\mathcal{I}_{c} and t\mathcal{I}_{t}.

The results were summarized in Table 2. The comparison shows that the Double-NN method outperforms the CQR methods in both the coverage rate and length of the prediction intervals under all the three cases c\mathcal{I}_{c}, t\mathcal{I}_{t}, and m\mathcal{I}_{m}. Specifically, the prediction intervals resulting from the Double-NN method tend to be shorter, while their coverage rates tend to be closer to the nominal level.

Refer to caption
Figure 2: Demonstration of the Double-NN method for a dataset simulated from (24): (left) scatter plot of 𝒛^i\hat{{\boldsymbol{z}}}_{i} (yy-axis) versus 𝒛i{\boldsymbol{z}}_{i} (xx-axis); (middle) Q-Q plot of 𝒛^i\hat{{\boldsymbol{z}}}_{i} and 𝒛i{\boldsymbol{z}}_{i}; (right) scatter plot of τ(𝒙i)\tau({\boldsymbol{x}}_{i}) (yy-axis) versus τ^(𝒙i)\hat{\tau}({\boldsymbol{x}}_{i}) (xx-axis).

Figure 2 demonstrates the rationale underlying the Double-NN method. The left scatter plot compares the imputed and true values of the latent variables for a dataset simulated from (24), where the imputed values were collected at the last iteration of Algorithm 1. The comparison reveals a close match between the imputed and true latent variable values, with the variability of the imputed values representing the source of uncertainty in the data-generating system. This variability in the latent variables can be propagated to 𝜽{\boldsymbol{\theta}} through the estimated inverse function G()G(\cdot), leading to the uncertainty in parameters and, consequently, the uncertainty in predictions. The middle scatter plot shows that the imputed latent variable values follows the standard Gaussian distribution, as expected. The right scatter plot compares the estimated and true values of the function τ(𝒙i)\tau({\boldsymbol{x}}_{i}), with the variability of the estimator representing its uncertainty. This plot further implies that the Double-NN method not only works for performing inference for the predictive ITE but also works for performing inference for CATE.

Example 2

Consider the data-generating equation

yi=c(𝒙i)+τ(𝒙i)Ti+σzi,i=1,2,,n,y_{i}=c({\boldsymbol{x}}_{i})+\tau({\boldsymbol{x}}_{i})T_{i}+\sigma z_{i},\quad i=1,2,\ldots,n, (26)

where 𝒙i=(xi,1,xi,2,,xi,5){\boldsymbol{x}}_{i}=(x_{i,1},x_{i,2},\ldots,x_{i,5})^{\top} with each element drawn independently from Unif(0,1)Unif(0,1), τ(𝒙)\tau({\boldsymbol{x}}) and TiT_{i} are generated as in Example 1 except that 𝒙i{\boldsymbol{x}}_{i} contains three extra false covariates, c(𝒙i)=2xi,11+5xi,22c({\boldsymbol{x}}_{i})=\frac{2x_{i,1}}{1+5x_{i,2}^{2}}, σ=1\sigma=1, and ziN(0,1)z_{i}\sim N(0,1). We simulated 20 datasets from this equation, each consisting of ntrain=1000n_{train}=1000 training samples and ntest=1000n_{test}=1000 test samples.

For this example, we modeled both c(𝒙)c({\boldsymbol{x}}) and τ(𝒙)\tau({\boldsymbol{x}}) using DNNs. Each of the DNNs consists of two hidden layers, each layer consisting of 10 hidden neurons. In consequence, 𝜽=(𝜽c,𝜽τ,log(σ)){\boldsymbol{\theta}}=({\boldsymbol{\theta}}_{c}^{\top},{\boldsymbol{\theta}}_{\tau}^{\top},\log(\sigma))^{\top} has a total dimension of 363 (ntrain0.85)(\approx n_{train}^{0.85}).

Similar to Example 1, we also applied the CQR methods (Lei and Candès, 2021) to this example for comparison. The CQR methods were implemented as described in Example 1. The results were summarized in Table 3, which indicates again that the Double-NN method outperforms the CQR methods under all the three cases c\mathcal{I}_{c}, t\mathcal{I}_{t}, and m\mathcal{I}_{m}. The prediction intervals resulting from the Double-NN method tend to be shorter, while their coverage rates tend to be closer to the nominal level.

Similar to Figure 2, Figure 3 demonstrates the rationale underlying the Double-NN method, as well as its capability for CATE inference. The left plot demonstrates the variability embedded in the latent variables of the data-generating system. The middle-left plot shows that the imputed latent variables are distributed according to the standard Gaussian distribution, as expected. The right two plots display the estimates of c(𝒙i)c({\boldsymbol{x}}_{i}) and τ(𝒙i)\tau({\boldsymbol{x}}_{i}), respectively. Once again, we note that the variations of the estimates of c(𝒙i)c({\boldsymbol{x}}_{i}) and τ(𝒙i)\tau({\boldsymbol{x}}_{i}), as depicted in their respective scatter plots, reflect their uncertainty according to the theory of EFI.

Table 3: Comparison of Double-NN and CQR for inference of the predictive ITE for Example (26), where the coverage and length of the prediction intervals were calculated by averaging over 20 datasets with the standard deviation given in the parentheses.

Case c\mathcal{I}_{c} Case t\mathcal{I}_{t} Case m\mathcal{I}_{m} Method Coverage Length Coverage Length Method Coverage Length Double-NN 0.9519 4.2727 0.9645 4.246 Double-NN 0.9604 6.0079 (0.0111) (0.0101) (0.0069) (0.0967) (0.0946) (0.1363) CQR-BART 0.9584 4.3586 0.9545 4.2658 CQR(inexact) 0.9386 6.0492 (0.0220) (0.4392) (0.0230) (0.4586) (0.0270) (0.6062) CQR-Boosting 0.9536 4.9942 0.9572 4.4393 CQR(exact) 0.9996 12.1252 (0.0175) (0.4044) (0.0194) (0.4213) (0.0007) (1.1022) CQR-RF 0.9563 5.6658 0.9580 4.4399 CQR(naive) 0.9988 11.5566 (0.0198) (0.4777) (0.0232) (0.5044) (0.0014) (0.9309) CQR-NN 0.9595 4.6748 0.9452 3.9579 (0.0165) (0.6015) (0.0185) (0.4301)

Refer to caption
Figure 3: Demonstration of the Double-NN method for a dataset simulated from (26): (left) scatter plot of 𝒛^i\hat{{\boldsymbol{z}}}_{i} (yy-axis) versus 𝒛i{\boldsymbol{z}}_{i} (xx-axis); (middle-left) Q-Q plot of 𝒛^i\hat{{\boldsymbol{z}}}_{i} and 𝒛i{\boldsymbol{z}}_{i}; (middle-right) scatter plot of c(𝒙i)c({\boldsymbol{x}}_{i}) (yy-axis) versus c^(𝒙i)\hat{c}({\boldsymbol{x}}_{i}) (xx-axis); (right) scatter plot of τ(𝒙i)\tau({\boldsymbol{x}}_{i}) (yy-axis) versus τ^(𝒙i)\hat{\tau}({\boldsymbol{x}}_{i}) (xx-axis).
Precision in Estimation of Heterogeneous Effects

As demonstrated in Figure 2 and Figure 3, the Double-NN method can also be used for inference of CATE. The performance in CATE estimation is often measured using the expected Precision in Estimation of Heterogeneous Effects (PEHE), which is defined as:

ϵPEHE=𝒳(τ^(𝒙)τ(𝒙))2𝑑F(𝒙),\epsilon_{PEHE}=\int_{\mathcal{X}}(\hat{\tau}({\boldsymbol{x}})-\tau({\boldsymbol{x}}))^{2}dF({\boldsymbol{x}}),

where F(𝒙)F({\boldsymbol{x}}) denotes the distribution function of the covariates 𝑿{\boldsymbol{X}}. As we can see, ϵPEHE\epsilon_{PEHE} summarizes the precision of the CATE over the entire sample space 𝒳\mathcal{X} (Hill, 2011; Shalit et al., 2017; Caron et al., 2022). In practice, since we only observe the treatment effect on the treatment group, the target of interest is generally only for the treatment group, i.e ϵPEHE(T)=𝒳(τ^(𝒙)τ(𝒙))2𝑑FT(𝒙)\epsilon_{PEHE}^{(T)}=\int_{\mathcal{X}}(\hat{\tau}({\boldsymbol{x}})-\tau({\boldsymbol{x}}))^{2}dF_{T}({\boldsymbol{x}}), where FT(𝒙)F_{T}({\boldsymbol{x}}) denotes the distribution function of the covariates in the treatment group. We estimated ϵPEHE(T)\epsilon_{PEHE}^{(T)} by ϵ^PEHE(T)=1ntiIt(τ^(𝒙i)τ(𝒙i))2\hat{\epsilon}_{PEHE}^{(T)}=\frac{1}{n_{t}}\sum_{i\in I_{t}}(\hat{\tau}({\boldsymbol{x}}_{i})-\tau({\boldsymbol{x}}_{i}))^{2}. For the Double-NN method, we set τ^(𝒙i)=1Mk=𝒦+1𝒦+Mτ^(k)(𝒙i)\hat{\tau}({\boldsymbol{x}}_{i})=\frac{1}{M}\sum_{k=\mathcal{K}+1}^{\mathcal{K}+M}\hat{\tau}^{(k)}({\boldsymbol{x}}_{i}), where MM denotes the number of estimates of τ(𝒙i)\tau({\boldsymbol{x}}_{i}) collected in a run of Algorithm 1.

For comparison, the existing CATE estimation methods, including single-learner (S-learner), two-learner (T-learner), and X-learner (Künzel et al., 2019), have been applied to the datasets generated above, where the RF and BART are used as the base learners. In the S-learner, a single outcome function is estimated using a base learner with all available covariates, where the treatment indicator is treated as a covariate, and then estimate CATE by τ^S=μ^(𝒙,1)μ^(𝒙,0)\hat{\tau}_{S}=\hat{\mu}({\boldsymbol{x}},1)-\hat{\mu}({\boldsymbol{x}},0), where μ^(𝒙,t)\hat{\mu}({\boldsymbol{x}},t) denotes the outcome function estimator. The T-learner estimates the outcome functions using a base learner separately for the units under the control and those under the treatment, and then estimate CATE by τ^T(𝒙)=μ^1(𝒙)μ^0(𝒙)\hat{\tau}_{T}({\boldsymbol{x}})=\hat{\mu}_{1}({\boldsymbol{x}})-\hat{\mu}_{0}({\boldsymbol{x}}), where μ^t(𝒙)\hat{\mu}_{t}({\boldsymbol{x}}) denote the outcome function estimator for the assignment group t{0,1}t\in\{0,1\}. The X-learner builds on the T-learner; it uses the observed outcomes to estimate the unobserved ITEs, and then estimate the CATE in another step as if the ITEs were observed. Refer to Künzel et al. (2019) and Caron et al. (2022) for the detail. We implemented the S-learner, T-learner, and X-leaner using the package downloaded at https://github.com/albicaron/EstITE.

Table 4 compares the values of ϵ^PEHE(T)\hat{\epsilon}_{PEHE}^{(T)} resulting from the Double-NN, S-learners, T-learners, and X-learners. for the models (24) and (26). The comparison shows that the Double-NN method outperforms the existing ones in achieving consistent CATE estimates over different covariate values. This is remarkable! As explained in Section 4, we would attribute this performance of the Double-NN method to its fidelity in parameter estimation (Liang et al., 2024). Compared to the MLE method, which serves as the prototype for the base learners, the Double-NN method is forced to be more robust to covariates due to added penalty term Un(𝒀n,𝑿n,𝒁n,𝒘n)/ϵU_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})/\epsilon.

Table 4: Comparison of Double-NN and other methods in ϵPEHE(T)\epsilon_{PEHE}^{(T)}, where each of the mean and standard deviations was calculated based on 20 datasets generated from (24) or (26).
Model (24) Model (26)
Method Training Test Training Test
S-RF 0.3769 ±\pm 0.0170 0.3660 ±\pm 0.0188 0.3722 ±\pm 0.0074 0.3377 ±\pm 0.0100
S-BART 0.4233 ±\pm 0.0156 0.4344 ±\pm 0.0149 0.3371 ±\pm 0.0099 0.3418 ±\pm 0.0102
T-RF 0.4545 ±\pm 0.0114 0.4198 ±\pm 0.0118 0.4095 ±\pm 0.0064 0.3488 ±\pm 0.0084
T-BART 0.4190 ±\pm 0.0139 0.4236 ±\pm 0.0127 0.4308 ±\pm 0.0092 0.4298 ±\pm 0.0093
X-RF 0.3416 ±\pm 0.0153 0.3451 ±\pm 0.0162 0.2761 ±\pm 0.0106 0.2789 ±\pm 0.0106
X-BART 0.3863 ±\pm 0.0137 0.3972 ±\pm 0.0128 0.3853 ±\pm 0.0102 0.3862 ±\pm 0.0097
Double-NN 0.2962 ±\pm 0.0167 0.3139 ±\pm 0.0178 0.3788 ±\pm 0.0105 0.3899 ±\pm 0.0110

5.3 Real Data Analysis

5.3.1 Lalonde

The ‘LaLonde’ data is a well-known dataset used in causal inference to evaluate the effectiveness of a job training program in improving the employment prospects of participants. We used the dataset given in the package “twang” (Cefalu et al., 2021) among various versions. The dataset includes earning data in 1978 on 614 individuals, with 185 receiving job training and 429 in the control group. There are 8 covariates including various demographic, educational, and employment-related variables. While the LaLonde dataset has been widely used for ATE estimation, we use it to illustrate the Double-NN method for constructing ITE prediction intervals.

To evaluate the performance of different methods, we randomly split the LaLonde dataset into a training set and a test set. The training set, denoted by 𝒟train\mathcal{D}_{train}, consists of ntrain=600n_{train}=600 observations; while the test set, denoted by 𝒟test\mathcal{D}_{test}, consists of ntest=14n_{test}=14 observations. We trained the Double-NN on 𝒟train\mathcal{D}_{train} and constructed prediction intervals for each subject in 𝒟test\mathcal{D}_{test} with a confidence level of 1α=0.51-\alpha=0.5. For the Double-NN, we modeled both c(𝒙)c({\boldsymbol{x}}) and τ(𝒙)\tau({\boldsymbol{x}}) using DNNs. Each of the DNNs consists of two hidden layers, with each layer consisting of 10 hidden neurons. In consequence, 𝜽=(𝜽c,𝜽τ,log(σ)){\boldsymbol{\theta}}=({\boldsymbol{\theta}}_{c}^{\top},{\boldsymbol{\theta}}_{\tau}^{\top},\log(\sigma))^{\top} has a dimension of 423 (ntrain0.95)(\approx n_{train}^{0.95}), a challenging task for uncertainty quantification of the model.

Figure 4 displays the constructed ITE prediction intervals for the test data, comparing the proposed method to the CQR method (Lei and Candès, 2021). The comparison shows that the prediction intervals resulting from the proposed method are shorter than those from the CQR method, while the centers of those intervals are similar. This suggests that the proposed method is able to estimate the ITEs with a higher degree of precision.

Refer to caption
Figure 4: Comparison of prediction intervals resulting from Double-NN (labeled as EFI) and CQR (labeled as conformal) for the subjects in the test set of Lalonde.

5.3.2 NLSM

This subsection conducts an analysis on the ‘National Study of Learning Mindsets’ (NLSM) dataset used in the 2018 Atlantic Causal Inference Conference workshop (Yeager et al., 2019; Carvalho et al., 2019). NSLM records the results of a randomized evaluation for a “nudge-like” intervention designed to instill students with a growth mindset. The dataset is available at https://github.com/grf-labs/grf/tree/master/experiments/acic18, which includes 10,391 students from 76 schools, with four student-level covariates and six school-level students. After factoring the categorical variables, the dimension of covariates 𝒙{\boldsymbol{x}} increases to 29.

Due to unavailability of the true treatment effect values, we performed an exploratory analysis as in Lei and Candès (2021). In order to construct prediction intervals for the ITE, we split the dataset into two sets: 𝒟train\mathcal{D}_{train} and 𝒟test\mathcal{D}_{test}. The former has a sample size of ntrain=5200n_{train}=5200, and the latter has a sample size of ntest=5191n_{test}=5191. For the Double-DNN method, we used DNNs to model the functions τ(𝒙)\tau({\boldsymbol{x}}) and c(𝒙)c({\boldsymbol{x}}). Each DNN consists of two hidden layers, with each hidden layer consisting of 10 hidden neurons. Therefore, the dimension of 𝜽=(𝜽c,𝜽τ,log(σ)){\boldsymbol{\theta}}=({\boldsymbol{\theta}}_{c}^{\top},{\boldsymbol{\theta}}_{\tau}^{\top},\log(\sigma))^{\top} is 843 (ntrain0.79)(\approx n_{train}^{0.79}).

The Double-DNN was trained on 𝒟train\mathcal{D}_{train} and the prediction intervals were constructed on 𝒟test\mathcal{D}_{test}, which corresponds to case (iii) described in Section 5.1. This process was repeated 20 times. For comparison, the CQR method (Lei and Candès, 2021) was also applied to this example.

Refer to caption
Figure 5: Comparison of the average length of intervals obtained by the Double-NN (labeled as EFI) and CQR (labeled as conformal) for the NLSM data.

Figure 5 displays the average length of prediction intervals, obtained by Double-DNN and CQR, as a function of α\alpha, with the upper and lower envelops being respectively the 95%95\% and 5%5\% quantiles across 20 runs. For this example, we implemented CQR using the “inexact” method, and therefore, its interval lengths tend to be short with approximate validity. However, as shown in Figure 5, the prediction intervals resulting from the Double-NN method tend to be even shorter than those from CQR as α\alpha increases. Figure 6 (a) compares the fractions of the prediction intervals, obtained by Double-NN and CQR, that cover positive values only. While Figure 6 (b) compares the fractions of the prediction intervals that cover negative values only. In summary, the Double-NN can provide more accurate predictions for the ITE than CQR for this example. Specifically, the Double-NN identified fewer subjects with significant ITEs than the CQR, as implied by Figure 6 (a) and (b); while each has a narrow prediction interval, as implied by Figure 5.

(a) (b)
Refer to caption Refer to caption
Figure 6: Fractions of the intervals obtained by Double-NN (labeled as EFI) and CQR (labeled as conformal) with (a) positive lower bounds and (b) negative upper bounds, where the upper and lower envelops are respectively 95%95\% and 5%5\% quantiles across 20 runs.

6 Discussion

This paper extends EFI to statistical inference for large statistical models and applies the proposed Double-NN method to treatment effect estimation. The numerical results demonstrate that the Double-NN method significantly outperforms the existing CQR method in ITE prediction. As mentioned in the paper, we attribute the superior performance of the Double-NN method to its fidelity in parameter estimation. Due to the universal approximation ability of deep neural networks, the Double-NN method is generally applicable for causal effect estimation.

From the perspective of statistical inference, this paper advances the theory and methodology for making inference of large statistical models, allowing the model size to increase with the sample size nn at a rate of O(nζ)O(n^{\zeta}) for any exponent 0ζ<10\leq\zeta<1. In particular, the Double-NN method provides a rigorous approach for quantifying the uncertainty of deep neural networks. In this paper, we have tested the performance of the Double-NN method on numerical examples with the exponent ranging 0.79ζ0.950.79\leq\zeta\leq 0.95, which all falls into the class of large models.

The Double-NN method can be further extended toward a general nonparametric approach for causal inference. Specifically, we can include an additional neural network to approximate the propensity score, enabling the outcome and propensity score functions to be simultaneously estimated. This extension will enable the use of inverse probability weighting methods to further improve ATE estimation, especially in the scenario where the covariate distributions in the treatment and control groups are imbalanced (Shalit et al., 2017; Hahn et al., 2020). From the perspective of EFI, this just corresponds to making inference for a different b(𝜽)b({\boldsymbol{\theta}}) function. Similarly, for inference of ITE, a different b(𝜽)b({\boldsymbol{\theta}}) function, including those adjusted with propensity scores, can also be used. The key advantage of EFI is its ability to automatically quantify the uncertainty of these functions as prescribed in (20), even when the functions are highly complex.

Regarding the size of large models, our theory does not preclude applications to large-scale DNNs with millions or even billions of parameters, as supported by the neural scaling law. As mentioned previously, Hestness et al. (2017) investigated the relationship between the DNN model size and the dataset size: they discovered a sub-linear scaling law of dim(𝜽)ndim({\boldsymbol{\theta}})\prec n across various model architectures in machine learning applications, including machine translation, language modeling, image processing, and speech recognition. Their findings suggest that Theorem 3.1 remains valid for large-scale DNNs by choosing an appropriate growth rate for their depth.

In practice, we often encounter small-nn-large-pp problems. For such a problem, we need to deal with a model of dimension dim(𝜽)ndim({\boldsymbol{\theta}})\succeq n, which is often termed as an over-parameterized model. A further extension of EFI for over-parameterized models is possible by imposing an appropriate sparsity constraint on 𝜽{\boldsymbol{\theta}}. How to make post-selection inference with EFI for the over-parameterized models will be studied in future work.

Finally, we note that a recent work by Williams (2023) demonstrates how conformal prediction sets arise from a generalized fiducial distribution. Given the inherent connections between GFI and EFI, we believe that the results established in Williams (2023) should also apply to EFI. In particular, EFI follows the same switching principle as GFI (Hannig et al., 2016), which infers the uncertainty of the model parameters from the distribution of unobserved random errors. Further research on EFI from this perspective is of great interest, as it could potentially alleviate EFI’s reliance on assumptions about the underlying data distribution in prediction uncertainty quantification.

Availability

The code that implements the Double NN method can be found at https://github.com/sehwankimstat/DoubleNN.

Acknowledgments

Liang’s research is supported in part by the NSF grants DMS-2015498 and DMS-2210819, and the NIH grant R01-GM152717. The authors thank the editor, associate editor, and referee for their constructive comments, which have led to significant improvement of this paper.

Appendix: Supplement for “Extended Fiducial Inference for Individual Treatment Effects via Deep Neural Networks ”


This material is organized as follows. Section §\S1 provides a proof for Theorem 3.1. Section §\S2 provides a brief description for the CQR method. Section §\S3 provides parameter settings for the experiments reported in the main text and this supplement.

Appendix §\S1 Theoretical Proofs

To prove the validity of the proposed method, it is sufficient to prove that g^()\hat{g}(\cdot) constitutes a consistent estimator of 𝜽{\boldsymbol{\theta}}, building on the theory developed in Liang et al. (2024). To ensure the self-contained nature of this paper, we provide a concise overview of the theory presented in Liang et al. (2024) in Section §\S1.1 of this supplement. Subsequently, our study will center on establishing the consistency of g^()\hat{g}(\cdot).

§\S1.1 Outline of the Proof

First of all, we note that the theoretical study is conducted under the assumption that the EFI network has been correctly specified such that a sparse EFI network 𝒘~n\tilde{{\boldsymbol{w}}}_{n}^{*} exists, from which the complete data (𝑿n,𝒀n,𝒁n)({\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n}^{*}) can be generated, where 𝒁n{\boldsymbol{Z}}_{n}^{*} represents the values of the latent variables realized in the observed samples. Specifically, we assume 𝒁nπϵ(𝒁|𝑿n,𝒀n,𝒘~n){\boldsymbol{Z}}_{n}^{*}\sim\pi_{\epsilon}({\boldsymbol{Z}}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*}) holds as ϵ0\epsilon\downarrow 0.

For the EFI network, we define

𝒢^(𝒘n|𝒘~n):=1nlogπϵ(𝒀n,𝒁n|𝑿n,𝒘n)+1nlogπ(𝒘n).\widehat{\mathcal{G}}({\boldsymbol{w}}_{n}|\tilde{{\boldsymbol{w}}}_{n}^{*}):=\frac{1}{n}\log\pi_{\epsilon}({\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{w}}_{n})+\frac{1}{n}\log\pi({\boldsymbol{w}}_{n}). (S1)

Therefore,

𝒘^n:=argmax𝒘n𝒲n𝒢^(𝒘n|𝒘~n),\hat{{\boldsymbol{w}}}_{n}^{*}:=\arg\max_{{\boldsymbol{w}}_{n}\in\mathcal{W}_{n}}\widehat{\mathcal{G}}({\boldsymbol{w}}_{n}|\tilde{{\boldsymbol{w}}}_{n}^{*}),

is the global maximizer of the posterior πϵ(𝒘n|𝑿n,𝒀n,𝒁n)\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n}^{*}). Also, we define

𝒢~(𝒘n|𝒘~n):=1nlogπϵ(𝒀n,𝒁n|𝑿n,𝒘n)𝑑πϵ(𝒁n|𝑿n,𝒀n,𝒘~n)+1nlogπ(𝒘n).=1n{logπ(𝒘n|𝑿n,𝒀n)logπ(𝒁n|𝑿n,𝒀n,𝒘~n)π(𝒁n|𝑿n,𝒀n,𝒘n)dπ(𝒁n|𝑿n,𝒀n,𝒘~n)+logπ(𝒁n|𝑿n,𝒀n,𝒘~n)dπ(𝒁n|𝑿n,𝒀n,𝒘~n)+c},\begin{split}\widetilde{\mathcal{G}}({\boldsymbol{w}}_{n}|\tilde{{\boldsymbol{w}}}_{n}^{*})&:=\frac{1}{n}\int\log\pi_{\epsilon}({\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{w}}_{n})d\pi_{\epsilon}({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*})+\frac{1}{n}\log\pi({\boldsymbol{w}}_{n}).\\ &=\frac{1}{n}\Big{\{}\log\pi({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n})-\int\log\frac{\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*})}{\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n})}d\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*})\\ &+\int\log\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*})d\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*})+c\Big{\}},\\ \end{split} (S2)

where c=log𝒲nπ(𝒀n|𝑿n,𝒘n)π(𝒘n)𝑑𝒘nc=\log\int_{\mathcal{W}_{n}}\pi({\boldsymbol{Y}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{w}}_{n})\pi({\boldsymbol{w}}_{n})d{\boldsymbol{w}}_{n} is the log-normalizing constant of the posterior π(𝒘n|𝑿n,𝒀n)\pi({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n}). Note that in the above derivation, 𝑿n{\boldsymbol{X}}_{n} can be ignored for simplicity since it is constant. For simplicity of notation, we let

DKL(𝒘n)=logπ(𝒁n|𝑿n,𝒀n,𝒘~n)π(𝒁n|𝑿n,𝒀n,𝒘n)dπ(𝒁n|𝑿n,𝒀n,𝒘~n),D_{KL}({\boldsymbol{w}}_{n})=\int\log\frac{\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*})}{\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n})}d\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*}),

be the Kullback-Leibler divergence between π(𝒁n|𝑿n,𝒀n,𝒘~n)\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*}) and π(𝒁n|𝑿n,𝒀n,𝒘n)\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}). Regarding the EFI network, we make the following assumption:

Assumption 1

The EFI network satisfies the conditions:

  • (i)

    The parameter space 𝒲n\mathcal{W}_{n} (of 𝒘n{\boldsymbol{w}}_{n}) is convex and compact.

  • (ii)

    𝔼(logπϵ(Y,Z|X,𝒘n))2<\mathbb{E}(\log\pi_{\epsilon}(Y,Z|X,{\boldsymbol{w}}_{n}))^{2}<\infty for any 𝒘n𝒲n{\boldsymbol{w}}_{n}\in\mathcal{W}_{n}, where (X,Y,Z)(X,Y,Z) denotes a generic sample as those in (𝒀n,𝒁n,𝑿n)({\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{X}}_{n}).

Define Q(𝒘n)=𝔼(logπϵ(Y,Z|X,𝒘n))+1nlogπ(𝒘n)Q^{*}({\boldsymbol{w}}_{n})=\mathbb{E}(\log\pi_{\epsilon}(Y,Z|X,{\boldsymbol{w}}_{n}))+\frac{1}{n}\log\pi({\boldsymbol{w}}_{n}). By Assumption 1 and the weak law of large numbers,

1nlogπϵ(𝒘n|𝑿n,𝒀n,𝒁n)Q(𝒘n)𝑝0,\frac{1}{n}\log\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n})-Q^{*}({\boldsymbol{w}}_{n})\overset{p}{\rightarrow}0, (S3)

holds uniformly over the parameter space 𝒲n\mathcal{W}_{n}, where p\stackrel{{\scriptstyle p}}{{\to}} denotes convergence in probability. Further, we make the following assumption on Q(𝒘n)Q^{*}({\boldsymbol{w}}_{n}), which is essentially an identifiability condition of 𝒘^n\hat{{\boldsymbol{w}}}_{n}^{*}.

Assumption 2

(i) Q(𝐰n)Q^{*}({\boldsymbol{w}}_{n}) is continuous in 𝐰n{\boldsymbol{w}}_{n} and is uniquely maximized at some point 𝐰n{\boldsymbol{w}}_{n}^{\diamond}; (ii) for any ϵ>0\epsilon>0, the value sup𝐰n𝒲n\B(ϵ)Q(𝐰n)sup_{{\boldsymbol{w}}_{n}\in\mathcal{W}_{n}\backslash B(\epsilon)}Q^{*}({\boldsymbol{w}}_{n}) exists, where B(ϵ)={𝐰n:𝐰n𝐰n<ϵ}B(\epsilon)=\{{\boldsymbol{w}}_{n}:\|{\boldsymbol{w}}_{n}-{\boldsymbol{w}}_{n}^{\diamond}\|<\epsilon\}, and δ:=Q(𝐰n)sup𝐰n𝒲n\B(ϵ)Q(𝐰n)>0\delta:=Q^{*}({\boldsymbol{w}}_{n}^{\diamond})-\sup_{{\boldsymbol{w}}_{n}\in\mathcal{W}_{n}\backslash B(\epsilon)}Q^{*}({\boldsymbol{w}}_{n})>0.

Assumption 2 restricts the shape of Q(𝒘n)Q^{*}({\boldsymbol{w}}_{n}) around the global maximizer, which cannot be discontinuous or too flat. Given nonidentifiability of the neural network model, see e.g. Sun et al. (2022), we have implicitly assumed that each 𝒘n{\boldsymbol{w}}_{n} is unique up to loss-invariant transformations, e.g., reordering the hidden neurons within the same hidden layer or simultaneously altering the signs of certain weights and biases. The same assumption has often been used in theoretical studies of neural networks, see e.g. Liang et al. (2018b).

On the other hand, by Theorem 1 of Liang et al. (2018a), we have

sup𝒘n𝒲n|𝒢^(𝒘n|𝒘~n)𝒢~(𝒘n|𝒘~n)|p0,as n,\sup_{{\boldsymbol{w}}_{n}\in\mathcal{W}_{n}}\left|\widehat{\mathcal{G}}({\boldsymbol{w}}_{n}|\tilde{{\boldsymbol{w}}}_{n}^{*})-\widetilde{\mathcal{G}}({\boldsymbol{w}}_{n}|\tilde{{\boldsymbol{w}}}_{n}^{*})\right|\stackrel{{\scriptstyle p}}{{\to}}0,\quad\mbox{as $n\to\infty$}, (S4)

under some regularity conditions. Under Assumptions 1-2, Liang et al. (2024) proved the following lemma:

Lemma S1

(Lemma 4.1; Liang et al. (2024)) Suppose Assumptions 1-2 hold, and the joint likelihood function π(𝐘n,𝐙n|𝐗n,𝐰n)\pi({\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{w}}_{n}) is continuous in 𝐰n{\boldsymbol{w}}_{n}. If 𝐰^n\hat{{\boldsymbol{w}}}_{n}^{*} is unique, then 𝐰n{\boldsymbol{w}}_{n}^{*} that maximizes π(𝐰n|𝐗n,𝐘n)\pi({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n}) as well as minimizes DKL(𝐰n)D_{KL}({\boldsymbol{w}}_{n}) is unique and, subsequently, 𝐰^n𝐰np0\|\hat{{\boldsymbol{w}}}_{n}^{*}-{\boldsymbol{w}}_{n}^{*}\|\stackrel{{\scriptstyle p}}{{\to}}0 holds as nn\to\infty.

For Lemma S1, the uniqueness of 𝒘^n\hat{{\boldsymbol{w}}}_{n}^{*}, up to some loss-invariant transformations as discussed previously, can be ensured by its consistency as established in the followed sections of this supplement, see equation (S20). The condition minimizing DKL(𝒘n)D_{KL}({\boldsymbol{w}}_{n}) is generally implied by U~n(𝒀n,𝑿n,𝒁n,𝒘n)=0\tilde{U}_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})=0 provided the consistency of 𝜽¯n\bar{{\boldsymbol{\theta}}}_{n}^{*}, while the convergence of 𝒘n{\boldsymbol{w}}_{n}^{*} to a maximizer of π(𝒘n|𝑿n,𝒀n)\pi({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n}) is generally implied by the Monte Carlo nature of Algorithm 1. Therefore, by eq. (17) of the main text, if 𝒘n(k){\boldsymbol{w}}_{n}^{(k)} converges and U~n(𝒀n,𝑿n,𝒁n,𝒘n(k))\tilde{U}_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n}^{(k)}) converges to 0, we would have 𝒘^n𝒘np0\|\hat{{\boldsymbol{w}}}_{n}^{*}-{\boldsymbol{w}}_{n}^{*}\|\stackrel{{\scriptstyle p}}{{\to}}0 provided that the prior has been appropriately chosen such that 𝒘^n\hat{{\boldsymbol{w}}}_{n}^{*} is consistent for 𝒘~n\tilde{{\boldsymbol{w}}}_{n}^{*} and g^(yi,xi,zi,𝒘^n)\hat{g}(y_{i},x_{i},z_{i},\hat{{\boldsymbol{w}}}_{n}^{*}) is consistent for 𝜽{\boldsymbol{\theta}}^{*}.

In summary, by Lemma S1, if we can choose an appropriate prior π(𝒘)\pi({\boldsymbol{w}}) such that 𝒘^n\hat{{\boldsymbol{w}}}_{n}^{*} is consistent for 𝒘~n\tilde{{\boldsymbol{w}}}_{n}^{*}, then 𝒘n{\boldsymbol{w}}_{n}^{*} is also consistent for 𝒘~n\tilde{{\boldsymbol{w}}}_{n}^{*} and g^(y,𝒙,z,𝒘n)\hat{g}(y,{\boldsymbol{x}},z,{\boldsymbol{w}}_{n}^{*}) is consistent for 𝜽{\boldsymbol{\theta}}^{*}. Establishing the consistency of 𝒘^n\hat{{\boldsymbol{w}}}_{n}^{*} will be the focus of Section §\S1.2 of this supplement. Note that showing the consistency of 𝒘^n\hat{{\boldsymbol{w}}}_{n}^{*} is relatively simpler than directly working on 𝒘n{\boldsymbol{w}}_{n}^{*}, as 𝒘^n\hat{{\boldsymbol{w}}}_{n}^{*} is based on the complete data.

Then, following from the consistency of g^()\hat{g}(\cdot), we immediately have

1ni=1ng^(yi,xi,zi,𝒘n)𝜽p0,as n.\|\frac{1}{n}\sum_{i=1}^{n}\hat{g}(y_{i},x_{i},z_{i},{\boldsymbol{w}}_{n}^{*})-{\boldsymbol{\theta}}^{*}\|\stackrel{{\scriptstyle p}}{{\to}}0,\quad\mbox{as $n\to\infty$}. (S5)

As a slight relaxation for the definition of G()G(\cdot), we can write equation (7) of the main text as

𝜽=limnG(𝒀n,𝑿n,𝒁n),{\boldsymbol{\theta}}^{*}=\lim_{n\to\infty}G({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n}), (S6)

where 𝒁n{\boldsymbol{Z}}_{n} is assumed to be known. By combining (S5) and (S6), we have

1ni=1ng^(yi,xi,zi,𝒘n)G(𝒀n,𝑿n,𝒁n)p0,as n,\left\|\frac{1}{n}\sum_{i=1}^{n}\hat{g}(y_{i},x_{i},z_{i},{\boldsymbol{w}}_{n}^{*})-G({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})\right\|\stackrel{{\scriptstyle p}}{{\to}}0,\quad\mbox{as $n\to\infty$},

i.e., the EFI estimator 𝜽¯:=1ni=1ng^(yi,xi,zi,𝒘n)\bar{{\boldsymbol{\theta}}}^{*}:=\frac{1}{n}\sum_{i=1}^{n}\hat{g}(y_{i},x_{i},z_{i},{\boldsymbol{w}}_{n}^{*}) is consistent for the inverse mapping G(𝒀n,𝑿n,𝒁n)G({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n}). Further, by Slutsky’s theorem, the uncertainty of 𝒁n{\boldsymbol{Z}}_{n} can be propagated to 𝜽{\boldsymbol{\theta}} via the EFI estimator. Therefore, the extended fiducial density (EFD) function of 𝜽{\boldsymbol{\theta}} can be approximated by

μ~n(d𝜽)=1k=1δ𝜽¯,k(d𝜽),as ,\tilde{\mu}_{n}(d{\boldsymbol{\theta}})=\frac{1}{\mathcal{M}}\sum_{k=1}^{\mathcal{M}}\delta_{\bar{{\boldsymbol{\theta}}}^{*,k}}(d{\boldsymbol{\theta}}),\quad\mbox{as $\mathcal{M}\to\infty$}, (S7)

where δa\delta_{a} stands for the Dirac measure at a given point aa, 𝜽¯,k:=1ni=1ng^(xi,yi,zi,k,𝒘n)\bar{{\boldsymbol{\theta}}}^{*,k}:=\frac{1}{n}\sum_{i=1}^{n}\hat{g}(x_{i},y_{i},z_{i}^{*,k},{\boldsymbol{w}}_{n}^{*}), and 𝒁n,k:=(z1,k,z2,k,,zn,k){\boldsymbol{Z}}_{n}^{*,k}:=(z_{1}^{*,k},z_{2}^{*,k},\ldots,z_{n}^{*,k}) for k=1,2,,k=1,2,\ldots,\mathcal{M} denotes \mathcal{M} random draws from the distribution π(𝒁n|𝑿n,𝒀n,𝒘n)\pi({\boldsymbol{Z}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{*}) under the limit setting of ϵ\epsilon.

§\S1.2 On the Consistency of g^()\hat{g}(\cdot) under the Large Model Scenario

In this section, we first show that 𝒘^n\hat{{\boldsymbol{w}}}_{n}^{*} is consistent under the framework of imputation-regularized optimization (IRO) algorithm (Liang et al., 2018a) by assuming that the true values of 𝒁n{\boldsymbol{Z}}_{n} are known and 𝒘n{\boldsymbol{w}}_{n} is subject to a mixture Gaussian prior. Subsequently, we show that g^()\hat{g}(\cdot) is consistent.

§\S1.2.1 An Auxiliary Stochastic Neural Network Model

To show 𝒘^n\hat{{\boldsymbol{w}}}_{n}^{*} is consistent, we introduce an auxiliary stochastic neural network (StoNet) model. For each of the hidden and output neurons of the model, we introduce a random noise:

u~l,i=j=0dl1wn,l,i,jul1,j+el,i:=vl,i+el,i,ul,i=Ψl(u~l,i),\begin{split}\tilde{u}_{l,i}&=\sum_{j=0}^{d_{l-1}}w_{n,l,i,j}u_{l-1,j}+e_{l,i}:=v_{l,i}+e_{l,i},\\ u_{l,i}&=\Psi_{l}(\tilde{u}_{l,i}),\\ \end{split} (S8)

where l{1,2,,H}l\in\{1,2,\ldots,H\} indexes the layers of the DNN, and i{1,2,,dl}i\in\{1,2,\ldots,d_{l}\} indexes the neurons at layer ll of the DNN, the random noise el,iN(0,σl2)e_{l,i}\sim N(0,\sigma_{l}^{2}), wn,l.i,jw_{n,l.i,j} denotes the weight on the connection from neuron ii of layer ll to neuron jj of layer l1l-1, and Ψl()\Psi_{l}(\cdot) denotes the activation function used for layer ll. Note that σl2\sigma_{l}^{2}’s are all known and pre-specified by the user, and l=0l=0 represents the input layer. As a consequence of introducing the random noise, we can treat U~l\tilde{U}_{l}’s as latent variables, where U~l=(u~l,1,u~l,2,,u~l,dl)\tilde{U}_{l}=(\tilde{u}_{l,1},\tilde{u}_{l,2},\ldots,\tilde{u}_{l,d_{l}})^{\top}.

When considering a dataset of size nn, we let u~l,i,(k)\tilde{u}_{l,i,(k)} denote the latent variable imputed for neuron ii of layer ll for observation k{1,2,,n}k\in\{1,2,\ldots,n\}, let U~l,(k)=(u~l,1,(k),u~l,2,(k),,u~l,dl,(k))\tilde{U}_{l,(k)}=(\tilde{u}_{l,1,(k)},\tilde{u}_{l,2,(k)},\ldots,\tilde{u}_{l,d_{l},(k)})^{\top}, and let 𝑼~l={U~l,(1),U~l,(2),,U~l,(n)}\tilde{{\boldsymbol{U}}}_{l}=\{\tilde{U}_{l,(1)},\tilde{U}_{l,(2)},\ldots,\tilde{U}_{l,(n)}\} denote the latent variables at layer ll for all nn observations. Recall that we have defined 𝒀n={y1,y2,,yn}{\boldsymbol{Y}}_{n}=\{y_{1},y_{2},\ldots,y_{n}\}, 𝑿n={𝒙1,𝒙2,,𝒙n}{\boldsymbol{X}}_{n}=\{{\boldsymbol{x}}_{1},{\boldsymbol{x}}_{2},\ldots,{\boldsymbol{x}}_{n}\}, and 𝒁n={z1,z2,,zn}{\boldsymbol{Z}}_{n}=\{z_{1},z_{2},\ldots,z_{n}\}. Given a set of pseudo-complete data (𝒀n,𝑼~H,,𝑼~1,𝑿n,𝒁n)({\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H},\ldots,\tilde{{\boldsymbol{U}}}_{1},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n}), we define the posterior distribution of 𝒘n{\boldsymbol{w}}_{n} as:

πϵ(𝒘n|𝒀n,𝑼~H,,𝑼~1,𝑿n,𝒁n)π(𝒘n)k=1npϵ(yk|𝒙k,zk,𝜽¯)l=1Hk=1nπl(U~l,(k)|U~l1,(k),𝒘n,l),\begin{split}&\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H},\ldots,\tilde{{\boldsymbol{U}}}_{1},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})\propto\pi({\boldsymbol{w}}_{n})\prod_{k=1}^{n}p_{\epsilon}(y_{k}|{\boldsymbol{x}}_{k},z_{k},\bar{{\boldsymbol{\theta}}})\prod_{l=1}^{H}\prod_{k=1}^{n}\pi_{l}(\tilde{U}_{l,(k)}|\tilde{U}_{l-1,(k)},{\boldsymbol{w}}_{n,l}),\end{split} (S9)

where 𝒘n={𝒘n,1,𝒘n,2,,𝒘n,H}{\boldsymbol{w}}_{n}=\{{\boldsymbol{w}}_{n,1},{\boldsymbol{w}}_{n,2},\ldots,{\boldsymbol{w}}_{n,H}\} with 𝒘n,l{\boldsymbol{w}}_{n,l} being the connection weights for layer ll,

pϵ(yk|𝒙k,zk,𝜽¯)exp{(d(yk,xk,zk,𝜽¯)+η𝜽^k𝜽¯2)/ϵ},p_{\epsilon}(y_{k}|{\boldsymbol{x}}_{k},z_{k},\bar{{\boldsymbol{\theta}}})\propto\exp\{-(d(y_{k},x_{k},z_{k},\bar{{\boldsymbol{\theta}}})+\eta\|\hat{{\boldsymbol{\theta}}}_{k}-\bar{{\boldsymbol{\theta}}}\|^{2})/\epsilon\},

𝑼~0,(k)={yk,𝒙k,zk}\tilde{{\boldsymbol{U}}}_{0,(k)}=\{y_{k},{\boldsymbol{x}}_{k},z_{k}\}, 𝜽^k=U~H,(k)\hat{{\boldsymbol{\theta}}}_{k}=\tilde{U}_{H,(k)}, and πl()\pi_{l}(\cdot) represents a dld_{l}-dimensional multivariate Gaussian distribution.

Remark S1

It is interesting to point out that for the stochastic version of the EFI network, (𝐘n,𝐔~H,,𝐔~1)({\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H},\ldots,\tilde{{\boldsymbol{U}}}_{1}) forms a directed cyclic graph (DCG): 𝐘n𝐔~1𝐔~H𝐘n{\boldsymbol{Y}}_{n}\to\tilde{{\boldsymbol{U}}}_{1}\to\cdots\to\tilde{{\boldsymbol{U}}}_{H}\to{\boldsymbol{Y}}_{n}. See e.g. Sethuraman et al. (2023) for analysis of DCGs. For our case, 𝐘n{\boldsymbol{Y}}_{n} is known, which serves as intervention variables and greatly simplifies the problem. Specifically, as ϵ0\epsilon\downarrow 0, 𝛉^k\hat{{\boldsymbol{\theta}}}_{k}’s can be uniquely determined via pϵ()p_{\epsilon}(\cdot) for a given set of (𝐘n,𝐗n,𝐙n)({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n}) and, therefore, (S9) is reduced to the posterior distribution for a conventional StoNet with (yk,𝐱k,zk)(y_{k},{\boldsymbol{x}}_{k},z_{k}) serving as the input and 𝛉^k\hat{{\boldsymbol{\theta}}}_{k} serving as the target output.

Assumption 3

(i) The activation function Ψl()\Psi_{l}(\cdot) used for each hidden neuron is cc^{\prime}-Lipschitz continuous for some constant cc^{\prime}; (ii) dllog(dl)n/log(n)d_{l}\log(d_{l})\prec n/\log(n) for l=0,1,2,,Hl=0,1,2,\ldots,H, where d0d_{0} denotes the dimension of (Y,X,Z)(Y,X,Z); (iii) the network’s depth HH and widths dld_{l}’s, for l=0,1,2,,Hl=0,1,2,\ldots,H, are all allowed to increase with the sample size nn.

Assumption 3-(i) has covered many commonly used activation functions such as ReLU, tanh, and sigmoid. Assumption 3-(ii) restricts the width of the DNN. To model a dataset of dimension d0d_{0}, we generally need a model of dimension dim(𝜽)d0dim({\boldsymbol{\theta}})\succeq d_{0}. Since we do not aim to impose any sparsity constraints on 𝜽{\boldsymbol{\theta}} in our main theory, we assume d0d_{0} also satisfies the constraint d0log(d0)nd_{0}\log(d_{0})\prec n such that dim(𝜽)ndim({\boldsymbol{\theta}})\prec n is still possible to hold.

Suppose Assumption 1 and Assumption 3 hold. Following Liang et al. (2022b) and Sun and Liang (2022), we can establish the existence of a small value τ(d1,d2,,dH)\tau(d_{1},d_{2},\ldots,d_{H}), as a function of d1,d2,,dHd_{1},d_{2},\ldots,d_{H}, such that if max{σ1,σ2,,σH}τ(d1,d2,,dH)\max\{\sigma_{1},\sigma_{2},\ldots,\sigma_{H}\}\prec\tau(d_{1},d_{2},\ldots,d_{H}) and ϵ\epsilon\to\infty, then

sup𝒘n𝒲n1n|logπϵ(𝒘n|𝒀n,𝑼~H,,𝑼~1,𝑿n,𝒁n)logπϵ(𝒘n|𝒀n,𝑿n,𝒁n)|p0,as n.\begin{split}\sup_{{\boldsymbol{w}}_{n}\in\mathcal{W}_{n}}&\frac{1}{n}\Big{|}\log\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H},\ldots,\tilde{{\boldsymbol{U}}}_{1},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})-\log\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})\Big{|}\stackrel{{\scriptstyle p}}{{\to}}0,\quad\mbox{as $n\to\infty$}.\end{split} (S10)

In other words, the DNN model in the EFI network and stochastic DNN model introduced above have asymptotically the same loss function as long as σ1,,σH\sigma_{1},\ldots,\sigma_{H} are sufficiently small and ϵ\epsilon is sufficiently small.

Suppose that we want to estimate 𝒘n{\boldsymbol{w}}_{n} by maximizing the posterior distribution of the pseudo-complete data, i.e.,

𝒘^nu=argmax𝒘n{logπϵ(𝒘n|𝒀n,𝑼~H,,𝑼~1,𝑿n,𝒁n)}.\begin{split}\hat{{\boldsymbol{w}}}_{n}^{u}&=\arg\max_{{\boldsymbol{w}}_{n}}\Big{\{}\log\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H},\ldots,\tilde{{\boldsymbol{U}}}_{1},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})\Big{\}}.\end{split} (S11)

In the next subsection, we establish the consistency of 𝒘^nu\hat{{\boldsymbol{w}}}_{n}^{u}, as an estimator of 𝒘~n\tilde{{\boldsymbol{w}}}_{n}^{*}, under appropriate conditions. Note that by Assumptions 1-2 and (S10), we have

𝒘^nu𝒘^np0,as n.\|\hat{{\boldsymbol{w}}}_{n}^{u}-\hat{{\boldsymbol{w}}}_{n}^{*}\|\stackrel{{\scriptstyle p}}{{\to}}0,\quad\mbox{as $n\to\infty$}. (S12)

§\S1.2.2 Consistency of the Sparse DNN Model Estimation

This section gives a constructive proof for the consistency of 𝒘^nu\hat{{\boldsymbol{w}}}_{n}^{u} based on the IRO algorithm (Liang et al., 2018a). To solve the optimization problem in (S11), the IRO algorithm starts with an initial weight setting 𝒘^n(0)\hat{{\boldsymbol{w}}}_{n}^{(0)} and then iterates between the imputation and regularized-optimization steps:

  • Imputation: For each block, conditioned on the current parameter estimate 𝒘^n(t)\hat{{\boldsymbol{w}}}_{n}^{(t)}, simulate the latent variables (𝑼~H(t+1),,𝑼~1(t+1))(\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}) from the predictive distribution

    π(𝑼~H(t+1),,𝑼~1(t+1)|𝒀n,𝑿n,𝒁n,𝒘^n(t))k=1npϵ(yk|𝒙k,zk,𝜽¯(t+1))×l=1Hk=1nπl(U~l,(k)(t+1)|U~l1,(k)(t+1),𝒘^n,l(t)),\begin{split}\pi(\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},\hat{{\boldsymbol{w}}}_{n}^{(t)})&\propto\prod_{k=1}^{n}p_{\epsilon}(y_{k}|{\boldsymbol{x}}_{k},z_{k},\bar{{\boldsymbol{\theta}}}^{(t+1)})\\ &\times\prod_{l=1}^{H}\prod_{k=1}^{n}\pi_{l}(\tilde{U}_{l,(k)}^{(t+1)}|\tilde{U}_{l-1,(k)}^{(t+1)},\hat{{\boldsymbol{w}}}_{n,l}^{(t)}),\\ \end{split} (S13)

    where tt indexes iterations, 𝜽¯(t+1)=1nk=1nU~H,(k)(t+1)\bar{{\boldsymbol{\theta}}}^{(t+1)}=\frac{1}{n}\sum_{k=1}^{n}\tilde{U}_{H,(k)}^{(t+1)}, and 𝒘^n,l(t)\hat{{\boldsymbol{w}}}_{n,l}^{(t)} denotes the component of 𝒘^n(t)\hat{{\boldsymbol{w}}}_{n}^{(t)} corresponding to the weights at the ll-th layer.

  • Regularized-optimization: Given the pseudo-complete data {𝑼~H(t+1),,𝑼~1(t+1)\{\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}, 𝒀n{\boldsymbol{Y}}_{n}, 𝑿n{\boldsymbol{X}}_{n}, 𝒁n}{\boldsymbol{Z}}_{n}\}, update 𝒘^n(t)\hat{{\boldsymbol{w}}}_{n}^{(t)} by maximizing the penalized log-likelihood function:

    𝒘^n(t+1)=argmax𝒘n{logπ(𝒀n,𝑼~H(t+1),,𝑼~1(t+1)|𝑿n,𝒁n,𝒘n)+logπ(𝒘n)},\begin{split}\hat{{\boldsymbol{w}}}_{n}^{(t+1)}&=\arg\max_{{\boldsymbol{w}}_{n}}\Big{\{}\log\pi({\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}|{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})+\log\pi({\boldsymbol{w}}_{n})\Big{\}},\end{split} (S14)

    which, by the decomposition (S9), can be reduced to solving for 𝒘^n,1(t+1),,𝒘^n,H(t+1)\hat{{\boldsymbol{w}}}_{n,1}^{(t+1)},\ldots,\hat{{\boldsymbol{w}}}_{n,H}^{(t+1)}, separately. The penalty function logπ(𝒘n)\log\pi({\boldsymbol{w}}_{n}) should be chosen such that 𝒘^n(t+1)\hat{{\boldsymbol{w}}}_{n}^{(t+1)} forms a consistent estimator for the working true parameter

    𝒘n,(t+1)=argmax𝒘n𝔼𝒘^n(t)logπ(𝒀n,𝑼~H(t+1),,𝑼~1(t+1)|𝑿n,𝒁n,𝒘n)=argmax𝒘nlogπ(𝒀n,𝑼~H(t+1),,𝑼~1(t+1)|𝑿n,𝒁n,𝒘n)×π(𝑼~H(t+1),,𝑼~1(t+1)|𝒀n,𝑿n,𝒁n,𝒘^n(t))πϵ(𝒀n|𝑿n,𝒁n,𝒘~n)d𝑼~H(t+1)d𝑼~1(t+1)d𝒀n,\begin{split}&{\boldsymbol{w}}_{n,*}^{(t+1)}=\arg\max_{{\boldsymbol{w}}_{n}}\mathbb{E}_{\hat{{\boldsymbol{w}}}_{n}^{(t)}}\log\pi({\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}|{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})\\ &=\arg\max_{{\boldsymbol{w}}_{n}}\int\log\pi({\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}|{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})\\ &\quad\times\pi(\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},\hat{{\boldsymbol{w}}}_{n}^{(t)})\pi_{\epsilon}({\boldsymbol{Y}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*})d\tilde{{\boldsymbol{U}}}_{H}^{(t+1)}\cdots d\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}d{\boldsymbol{Y}}_{n},\end{split} (S15)

    where the likelihood function of the latent variables (𝑼~H(t+1),,𝑼~1(t+1))(\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}) is evaluated at 𝒘^n(t)\hat{{\boldsymbol{w}}}_{n}^{(t)}, and 𝒘~n\tilde{{\boldsymbol{w}}}_{n}^{*} corresponds to the true parameters of the underlying sparse DNN model.

To prove the consistency of 𝒘^n(t)\hat{{\boldsymbol{w}}}_{n}^{(t)} as nn and tt approach to infinity, we need Assumptions 4-6 as specified below. For a matrix 𝚺{\boldsymbol{\Sigma}}, we define the mm-sparse minimal eigenvalues as follows:

ϕmin(m|𝚺)=min𝜷:𝜷0m𝜷𝚺𝜷𝜷𝜷,\phi_{\rm\min}(m|{\boldsymbol{\Sigma}})=\min_{{\boldsymbol{\beta}}:\|{\boldsymbol{\beta}}\|_{0}\leq m}\frac{{\boldsymbol{\beta}}^{\top}{\boldsymbol{\Sigma}}{\boldsymbol{\beta}}}{{\boldsymbol{\beta}}^{\top}{\boldsymbol{\beta}}},

which represents the minimal eigenvalues of any m×mm\times m-dimensional principal submatrix, and 𝜷0\|{\boldsymbol{\beta}}\|_{0} denotes the number of nonzero elements in 𝜷{\boldsymbol{\beta}}. Let 𝚺l,i(t)dl1×dl1{\boldsymbol{\Sigma}}_{l,i}^{(t)}\in\mathbb{R}^{d_{l-1}\times d_{l-1}} denote the sample covariance matrix of the input variables for the regression formed for neuron ii of layer ll at iteration tt, and let sl,i(t)s_{l,i}^{(t)} denote the size of the true regression model as implied by the working true parameter 𝒘n,(t){\boldsymbol{w}}_{n,*}^{(t)}.

Assumption 4

(i) All the variables {𝐗n,𝐘n,𝐙n(t)}\{{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n}^{(t)}\} are uniformly bounded, where 𝐙n(t){\boldsymbol{Z}}_{n}^{(t)} denotes the imputed values of 𝐙n{\boldsymbol{Z}}_{n} at iteration tt; (ii) there exist some constants κ0,1\kappa_{0,1} and s1,i(t)s~1,i(t)min{d0,n}s_{1,i}^{(t)}\leq\tilde{s}_{1,i}^{(t)}\leq\min\{d_{0},n\} such that ϕmin(s~1,i(t)|𝚺1,i(t))κ0,1\phi_{\rm\min}(\tilde{s}_{1,i}^{(t)}|{\boldsymbol{\Sigma}}_{1,i}^{(t)})\geq\kappa_{0,1} holds uniformly for any neuron i{1,2,,d1}i\in\{1,2,\ldots,d_{1}\} and any iteration t{1,2,,T}t\in\{1,2,\ldots,T\}, where 𝚺1,i(t){\boldsymbol{\Sigma}}_{1,i}^{(t)} denotes the covariance matrix of {𝐗n,𝐘n,𝐙n(t)}\{{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n}^{(t)}\}.

Assumption 4-(i) restricts the pseudo-complete data (𝒙,𝒚,𝒛)({\boldsymbol{x}},{\boldsymbol{y}},{\boldsymbol{z}}) to be uniformly bounded. To satisfy this condition, we can add a data transformation/normalization layer to the DNN model, ensuring that the transformed input values fall within the bounded set. Specifically, the transformation/normalization layer can form a bijective mapping and contain no tuning parameters. For example, when dealing with standard Gaussian random variables, we can transform them to be uniform over (0,1) via the probability integral transformation Φ()\Phi(\cdot), the cumulative distribution function (CDF) of the standard Gaussian random variable.

Assumption 4-(ii) is natural for the problem. As implied by Assumption 3-(ii), the upper bound s1,i(t)s~1,i(t)min{d0,n}s_{1,i}^{(t)}\leq\tilde{s}_{1,i}^{(t)}\leq\min\{d_{0},n\} can always hold. For other layers, the upper bound is also true by Assumption 3-(ii), and the sparse eigenvalue property can be directly established, see Lemma S3 below.

Lemma S2

Consider a random matrix 𝐔n×d{\boldsymbol{U}}\in\mathbb{R}^{n\times d} with ndn\geq d. Suppose that the eigenvalues of 𝐔𝐔{\boldsymbol{U}}^{\top}{\boldsymbol{U}} are upper bounded, i.e., λmax(𝐔𝐔)nκmax\lambda_{\max}({\boldsymbol{U}}^{\top}{\boldsymbol{U}})\leq n\kappa_{\max} for some constant κmax>0\kappa_{\max}>0. Let Ψ(𝐔)\Psi({\boldsymbol{U}}) denote an elementwise transformation of 𝐔{\boldsymbol{U}}. Then

λmax((Ψ(𝑼))(Ψ(𝑼)))nκmax,\lambda_{\max}\left((\Psi({\boldsymbol{U}}))^{\top}(\Psi({\boldsymbol{U}}))\right)\leq n\kappa_{\max}, (S16)

for the tanh, sigmoid and ReLU transformations.

  • Proof:

    For ReLU, (S16) follows from Lemma 5 of Dittmer et al. (2018). For tanh and sigmoid, since they are Lipschitz continuous with a Lipschitz constant of 1, Lemma 5 of Dittmer et al. (2018) also applies.  \Box

Lemma S3

Consider an auxiliary stochastic neural network as defined in (S8) with an activation function tanh, sigmoid, or ReLU. Then for any any layer l{2,3,,H}l\in\{2,3,\ldots,H\}, neuron i{1,2,,dl}i\in\{1,2,\ldots,d_{l}\}, and iteration t{1,2,,T}t\in\{1,2,\ldots,T\}, there exists a number s~l,i(t)\tilde{s}_{l,i}^{(t)} such that sl,i(t)s~l,i(t)min{dl1,n}s_{l,i}^{(t)}\leq\tilde{s}_{l,i}^{(t)}\leq\min\{d_{l-1},n\} and ϕmin(s~l,i(t)|𝚺l,i(t))κ0,l\phi_{\rm\min}(\tilde{s}_{l,i}^{(t)}|{\boldsymbol{\Sigma}}_{l,i}^{(t)})\geq\kappa_{0,l} hold.

  • Proof:

    We use U~ldl\tilde{U}_{l}\in\mathbb{R}^{d_{l}} and VldlV_{l}\in\mathbb{R}^{d_{l}} to denote generic vectors corresponding to the ll-th layer, where U~k=Vl+𝒆l\tilde{U}_{k}=V_{l}+{\boldsymbol{e}}_{l}. Additionally, we use 𝑼~ln×dl\tilde{{\boldsymbol{U}}}_{l}\in\mathbb{R}^{n\times d_{l}} and 𝑽ln×dl{\boldsymbol{V}}_{l}\in\mathbb{R}^{n\times d_{l}} to denote the matrices (for all observations) corresponding to the ll-th layer. For the auxiliary stochastic neural network, since σl2\sigma_{l}^{2}’s have been set to very small values, it follows from (S8) that

    Ψ(U~l)Ψ(Vl)+VlΨ(Vl)𝒆l,l=1,2,,H1,\Psi(\tilde{U}_{l})\approx\Psi(V_{l})+\nabla_{V_{l}}\Psi(V_{l})\circ{\boldsymbol{e}}_{l},\quad l=1,2,\ldots,H-1,

    where \circ denotes elementwise product, Vl=(vl,1,vl,2,,vl,dl)V_{l}=(v_{l,1},v_{l,2},\ldots,v_{l,d_{l}})^{\top}, and 𝒆l=(el,1,el,2,,el,dl){\boldsymbol{e}}_{l}=(e_{l,1},e_{l,2},\ldots,e_{l,d_{l}})^{\top}. Then, for any i{1,2,,dl+1}i\in\{1,2,\ldots,d_{l+1}\},

    𝚺l+1,iVar(𝔼(Ψ(Vl)+VlΨ(Vl)𝒆l|Vl))+𝔼(Var(Ψ(Vl)+VlΨ(Vl)𝒆l|Vl))=Var(Ψ(Vl))+Diag{σl2𝔼[VlΨ(Vl)VlΨ(Vl)]},\begin{split}{\boldsymbol{\Sigma}}_{l+1,i}&\approx{\rm Var}(\mathbb{E}(\Psi(V_{l})+\nabla_{V_{l}}\Psi(V_{l})\circ{\boldsymbol{e}}_{l}|V_{l}))+\mathbb{E}({\rm Var}(\Psi(V_{l})+\nabla_{V_{l}}\Psi(V_{l})\circ{\boldsymbol{e}}_{l}|V_{l}))\\ &={\rm Var}(\Psi(V_{l}))+{\rm Diag}\left\{\sigma_{l}^{2}\mathbb{E}[\nabla_{V_{l}}\Psi(V_{l})\circ\nabla_{V_{l}}\Psi(V_{l})]\right\},\\ \end{split} (S17)

    where Diag{𝒂}{\rm Diag}\{{\boldsymbol{a}}\}, 𝒂dl{\boldsymbol{a}}\in\mathbb{R}^{d_{l}}, denotes a dl×dld_{l}\times d_{l} diagonal matrix with the diagonal elements given by the vector 𝒂{\boldsymbol{a}}.

    By induction, building on Assumption 4-(i) and Lemma S2, we can assume that the maximum eigenvalue of (Ψ(𝑼~l1))Ψ(𝑼~l1)(\Psi(\tilde{{\boldsymbol{U}}}_{l-1}))^{\top}\Psi(\tilde{{\boldsymbol{U}}}_{l-1}) is upper bounded by nκmaxn\kappa_{\max}. By an extension of Ostrowski’s theorem, see Theorem 3.2 of Higham and Cheng (1998), we have

    λmax(𝑽lT𝑽l)=max𝒖=1𝒖𝒘n,l(Ψ(𝑼~l1))Ψ(𝑼~l1)𝒘n,l𝒖λmax((Ψ(𝑼~l1))Ψ(𝑼~l1))max𝒖=1𝒖𝒘n,l𝒘n,l𝒖=nκmaxτmax,\begin{split}\lambda_{\max}({\boldsymbol{V}}_{l}^{T}{\boldsymbol{V}}_{l})&=\max_{\|{\boldsymbol{u}}\|=1}{\boldsymbol{u}}^{\top}{\boldsymbol{w}}_{n,l}(\Psi(\tilde{{\boldsymbol{U}}}_{l-1}))^{\top}\Psi(\tilde{{\boldsymbol{U}}}_{l-1}){\boldsymbol{w}}_{n,l}^{\top}{\boldsymbol{u}}\\ &\leq\lambda_{\max}((\Psi(\tilde{{\boldsymbol{U}}}_{l-1}))^{\top}\Psi(\tilde{{\boldsymbol{U}}}_{l-1}))\max_{\|{\boldsymbol{u}}\|=1}{\boldsymbol{u}}^{\top}{\boldsymbol{w}}_{n,l}{\boldsymbol{w}}_{n,l}^{\top}{\boldsymbol{u}}\\ &=n\kappa_{\max}\tau_{\max},\\ \end{split}

    where the existence of the upper bound λmax(𝒘n,l𝒘n,l)τmax\lambda_{\max}({\boldsymbol{w}}_{n,l}^{\top}{\boldsymbol{w}}_{n,l})\leq\tau_{\max} follows from the boundedness of 𝒘n{\boldsymbol{w}}_{n} as implied by Assumption 1-(i). By choosing 𝒖{\boldsymbol{u}} as a one-hot vector, it is easy to see that for any i{1,2,,dl}i\in\{1,2,\ldots,d_{l}\},

    j=1nVl,i,j2λmax(𝑽l𝑽l)nκmaxτmax,\sum_{j=1}^{n}V_{l,i,j}^{2}\leq\lambda_{\max}({\boldsymbol{V}}_{l}^{\top}{\boldsymbol{V}}_{l})\leq n\kappa_{\max}\tau_{\max}, (S18)

    where Vl,i,jV_{l,i,j} denotes the (j,i)(j,i)th element of 𝑽l{\boldsymbol{V}}_{l}. This further implies, as nn\to\infty,

    𝔼(Vl,i,j2)κmaxτmax.\mathbb{E}(V_{l,i,j}^{2})\leq\kappa_{\max}\tau_{\max}. (S19)

    In words, Vl,i,jV_{l,i,j} has bounded mean and variance.

    By Markov’s inequality, we can bound Vl,i,jV_{l,i,j} to a closed interval with a high probability, i.e., P(|Vl,i,j|C)1κmaxτmax/C2P(|V_{l,i,j}|\leq C)\geq 1-\kappa_{\max}\tau_{\max}/C^{2} for some large constant CC. Therefore, for any activation function which has nonzero gradients on any closed interval, e.g., tanh and sigmoid, there exists a constant c>0c>0 such that

    𝔼[Vl,i,jΨ(Vl,i,j)]2c.\mathbb{E}[\nabla_{V_{l,i,j}}\Psi(V_{l,i,j})]^{2}\geq c.

    which implies ϕmin(dl|𝚺l)cσl2\phi_{\min}(d_{l}|{\boldsymbol{\Sigma}}_{l})\geq c\sigma_{l}^{2}.

    For the ReLU activation function, if a hidden neuron belongs to the true neuron set at iteration tt (as determined by 𝒘n,(t){\boldsymbol{w}}_{n,*}^{(t)}), then Ψ(Vl,i,j(t))\Psi(V_{l,i,j}^{(t)}) cannot be constantly 0 over all nn samples. Therefore, it is reasonable to assume that there exists a threshold qmin(0,1)q_{\rm min}\in(0,1) such that 𝔼[Vl,i,jΨ(Vl,i,j(t))]2qmin\mathbb{E}[\nabla_{V_{l,i,j}}\Psi(V_{l,i,j}^{(t)})]^{2}\geq q_{\rm min} for any true neuron in all iterations. Under this assumption, we would at least have

    ϕmin(|𝒔l(t)||𝚺l)σl2qmin,l=1,2,,h;t=1,2,,T,\phi_{\rm min}(|{\boldsymbol{s}}_{l}^{(t)}||{\boldsymbol{\Sigma}}_{l})\geq\sigma_{l}^{2}q_{\rm min},\quad l=1,2,\ldots,h;\ \ t=1,2,\ldots,T,

    where 𝒔l(t)𝑺(t){\boldsymbol{s}}_{l}^{(t)}\subset{\boldsymbol{S}}^{(t)} denotes the set of true neurons at layer ll.  \Box

To study the property of the coefficient estimator for each regression formed in the auxiliary stochastic neural network, we introduce the following lemma, which is a restatement of Theorem 3.4 of Song and Liang (2023).

Lemma S4

(Theorem 3.4; Song and Liang (2023)) Consider a linear regression

𝒚=𝑿𝜷+σϵ,{\boldsymbol{y}}={\boldsymbol{X}}{\boldsymbol{\beta}}+\sigma{\boldsymbol{\epsilon}},

where 𝐲n{\boldsymbol{y}}\in\mathbb{R}^{n}, 𝐗n×pn{\boldsymbol{X}}\in\mathbb{R}^{n\times p_{n}}, 𝛃pn{\boldsymbol{\beta}}\in\mathbb{R}^{p_{n}}, σ>0\sigma>0, and ϵ𝒩(0,Ipn){\boldsymbol{\epsilon}}\sim\mathcal{N}(0,I_{p_{n}}) is Gaussian noise. Suppose that the model satisfies the following conditions:

  • (A1)

    (i) All the covariates are uniformly bounded; (ii) the dimensionality can be high with pnnp_{n}\geq n; and (iii) there exists some integer p¯\bar{p} (depending on nn and pnp_{n}) and a fixed constant κ0\kappa_{0} such that p¯sn\bar{p}\succ s_{n} and λmin(𝑿ξ𝑿ξ)nκ0\lambda_{\rm\min}({\boldsymbol{X}}_{\xi}^{\top}{\boldsymbol{X}}_{\xi})\geq n\kappa_{0} for any subset model |ξ|p¯|\xi|\leq\bar{p}, where sns_{n} denotes the size of the true model ξ\xi^{*}, and λmin()\lambda_{\rm\min}(\cdot) denotes the minimum eigenvalue of a square matrix. Let 𝜷^ξ\hat{{\boldsymbol{\beta}}}_{\xi} denote the MLE of the true model.

  • (A2)

    (i) snlogpnns_{n}\log p_{n}\prec n; (ii) max{|βj/σ|:j=1,2,,pn}γ3En\max\{|\beta_{j}^{*}/\sigma^{*}|:j=1,2,\ldots,p_{n}\}\leq\gamma_{3}E_{n}, where βj\beta_{j}^{*}’s and σ\sigma^{*} denote the true parameter values of the regression model, γ3(0,1)\gamma_{3}\in(0,1) is a fixed constant, and EnE_{n} is nondecreasing with respect to nn.

  • (A3)

    Let each component of 𝜷{\boldsymbol{\beta}} be subject to the following mixture Gaussian prior distribution

    βj/σ(1ρ)𝒩(0,σ~02)+ρ𝒩(0,σ~12),j=1,2,,pn,\beta_{j}/\sigma^{*}\sim(1-\rho)\mathcal{N}(0,\tilde{\sigma}_{0}^{2})+\rho\mathcal{N}(0,\tilde{\sigma}_{1}^{2}),\quad j=1,2,\ldots,p_{n},

    where En/σ~12+logσ~1logpnE_{n}/\tilde{\sigma}_{1}^{2}+\log\tilde{\sigma}_{1}\asymp\log p_{n}, ρ=1/pn1+u\rho=1/p_{n}^{1+u}, and σ~0an/2(1+u)logpn\tilde{\sigma}_{0}\leq a_{n}/\sqrt{2(1+u)\log p_{n}} for some constant u>1u>1 and sequence an1/(nsnlogpn)/pna_{n}\prec\sqrt{1/(ns_{n}\log p_{n})}/p_{n}. Additionally, snEnsnlogpn/ns_{n}E_{n}\sqrt{s_{n}\log p_{n}/n} σ~12\prec\tilde{\sigma}_{1}^{2} and minjξ|βj|M1logpn/n\min_{j\in\xi^{*}}|\beta_{j}^{*}|\geq M_{1}\sqrt{\log p_{n}/n} for some sufficiently large M1>0M_{1}>0.

Let 𝛃^n\hat{{\boldsymbol{\beta}}}_{n} denote the MAP estimator of 𝛃{\boldsymbol{\beta}}. Then there exists a constant cc such that

𝔼𝜷^n𝜷2=c(σ)2𝑿ξ𝑿ξ1c(σ)2nκ0,\mathbb{E}\|\hat{{\boldsymbol{\beta}}}_{n}-{\boldsymbol{\beta}}^{*}\|^{2}=c(\sigma^{*})^{2}\|{\boldsymbol{X}}_{\xi^{*}}^{\top}{\boldsymbol{X}}_{\xi^{*}}\|^{-1}\leq\frac{c(\sigma^{*})^{2}}{n\kappa_{0}},

Specifically, with dominating probability, we have 𝛃^n,j=𝛃^ξ,j\hat{{\boldsymbol{\beta}}}_{n,j}=\hat{{\boldsymbol{\beta}}}_{\xi^{*},j^{\prime}} if jξj\in\xi^{*} and 0 otherwise, where 𝛃^n,j\hat{{\boldsymbol{\beta}}}_{n,j} denotes the jjth element of 𝛃^n\hat{{\boldsymbol{\beta}}}_{n}, and 𝛃^ξ,j\hat{{\boldsymbol{\beta}}}_{\xi^{*},j^{\prime}} denotes the element of 𝛃^ξ\hat{{\boldsymbol{\beta}}}_{\xi^{*}} that corresponds to 𝛃n,j{\boldsymbol{\beta}}_{n,j}.

It is easy to see that under Assumption 1, Assumption 3, and Assumption 4, each linear regression formed in the stochastic neural network satisfies conditions (A1) and (A2) of Lemma S4. In particular, we can set EnE_{n} as the diameter of 𝒲n\mathcal{W}_{n}. To ensure the condition (A3) to be satisfied, we make the following assumption:

Assumption 5

Let each connection weight wn,l,i,jw_{n,l,i,j} be subject to the following mixture Gaussian priro distribution:

wn,l,i,j/σl(1ρ)𝒩(0,σ~0,l2)+ρ𝒩(0,σ~1,l2),l=1,2,,H,i=1,2,,dl,j=1,2,,dl1,w_{n,l,i,j}/\sigma_{l}\sim(1-\rho)\mathcal{N}(0,\tilde{\sigma}_{0,l}^{2})+\rho\mathcal{N}(0,\tilde{\sigma}_{1,l}^{2}),\ \ l=1,2,\ldots,H,\ \ i=1,2,\ldots,d_{l},\ \ j=1,2,\ldots,d_{l-1},

where we set ρ\rho, σ~0,l\tilde{\sigma}_{0,l} and σ~1,l\tilde{\sigma}_{1,l} such that ρ=1/dl11+u\rho=1/d_{l-1}^{1+u}, σ~0,l1/(dl13/2log(dl1)2(1+u)n)\tilde{\sigma}_{0,l}\prec 1/(d_{l-1}^{3/2}\log(d_{l-1})\sqrt{2(1+u)n}), and En/σ~1,l2+logσ~1,llogdl1E_{n}/\tilde{\sigma}_{1,l}^{2}+\log\tilde{\sigma}_{1,l}\asymp\log d_{l-1} for some constant u>1u>1 and any l=1,2,,Hl=1,2,\ldots,H. Additionally, there exists some constant M1>0M_{1}>0 such that min1idl,jξi|wn,l,i,j|M1logdl1/n\min_{1\leq i\leq d_{l},j\in\xi_{i}^{*}}|w_{n,l,i,j}^{*}|\geq M_{1}\sqrt{\log d_{l-1}/n}.

It is easy to verify that the condition (A3) holds under Assumption 3-(ii) and Assumption 5. In addition, the ww-min condition, i.e., min1idl,jξi|wn,l,i,j|M1logdl1/n\min_{1\leq i\leq d_{l},j\in\xi_{i}^{*}}|w_{n,l,i,j}^{*}|\geq M_{1}\sqrt{\log d_{l-1}/n}, is rather weak and can be generally satisfied as nn becomes large.

Theorem S1

Suppose that a mixture Gaussian penalty is imposed on the weights of the DNN model in the EFI network and Assumptions 1 and 3-5 hold. Furthermore, suppose l=1Hdlσl2/σl12n\sum_{l=1}^{H}d_{l}\sigma_{l}^{2}/\sigma_{l-1}^{2}\prec n holds, where σ0=O(1)\sigma_{0}=O(1) represents a constant. Then there exist a constant cc such that

E𝒘^n(t)𝒘n,(t)2cnl=1Hdlσl2σl12:=rno(1).E\|\hat{{\boldsymbol{w}}}_{n}^{(t)}-{\boldsymbol{w}}_{n,*}^{(t)}\|^{2}\leq\frac{c}{n}\sum_{l=1}^{H}d_{l}\frac{\sigma_{l}^{2}}{\sigma_{l-1}^{2}}:=r_{n}\prec o(1).
  • Proof:

    By the above analysis, each regression formed in the stochastic neural network satisfies the conditions of Lemma S4. Therefore, the sparse eigenvalue lower bounds established in Lemma S3 hold for the stochastic neural network. Further, by summarizing the l2l_{2}-errors of coefficient estimation for all l=1Hdl\sum_{l=1}^{H}d_{l} linear regressions, we can conclude the proof.  \Box

It is important to note that the condition l=1Hdlσl2/σl12n\sum_{l=1}^{H}d_{l}\sigma_{l}^{2}/\sigma_{l-1}^{2}\prec n allows the width of each layer of the neural network to increase with nn at a rate as high as O(n/log(n))O(n/\log(n)). This accommodates the scenarios where dim(𝜽)=O(nζ)dim({\boldsymbol{\theta}})=O(n^{\zeta}) for some 12ζ<1\frac{1}{2}\leq\zeta<1. In this case, we have dH=dim(𝜽)d_{H}=dim({\boldsymbol{\theta}}) for the DNN model in the EFI network, which enables the uncertainty of 𝜽{\boldsymbol{\theta}} to be properly quantified as implied by Theorem 3.1 proved below.

Further, let’s consider the mapping M(𝒘n)M({\boldsymbol{w}}_{n}) as defined in (S15), i.e.,

M(𝒘n)=argmax𝒘n𝔼𝒘nlogπ(𝒀n,𝑼~H,,𝑼~1|𝑿n,𝒁n,𝒘n).M({\boldsymbol{w}}_{n})=\arg\max_{{\boldsymbol{w}}_{n}^{\prime}}\mathbb{E}_{{\boldsymbol{w}}_{n}}\log\pi({\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H},\ldots,\tilde{{\boldsymbol{U}}}_{1}|{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n}^{\prime}).

As argued in Liang et al. (2018a) and Nielsen (2000), it is reasonable to assume that the mapping is contractive. A recursive application of the mapping, i.e., setting 𝒘^n(t+1)=𝒘n,(t+1)=M(𝒘^n(t))\hat{{\boldsymbol{w}}}_{n}^{(t+1)}={\boldsymbol{w}}_{n,*}^{(t+1)}=M(\hat{{\boldsymbol{w}}}_{n}^{(t)}), leads to a monotone increase of the target expectations

𝔼𝒘^n(t)logπ(𝒀n,𝑼~H(t+1),,𝑼~1(t+1)|𝑿n,𝒁n,𝒘^n(t+1))\mathbb{E}_{\hat{{\boldsymbol{w}}}_{n}^{(t)}}\log\pi({\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}|{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},\hat{{\boldsymbol{w}}}_{n}^{(t+1)})

for t=1,2,,Tt=1,2,\ldots,T.

Assumption 6

The mapping M(𝐰n)M({\boldsymbol{w}}_{n}) is differentiable. Let λmax(M𝐰n)\lambda_{\rm\max}(M_{{\boldsymbol{w}}_{n}}) be the largest singular value of M(𝐰n)/𝐰n\partial M({\boldsymbol{w}}_{n})/\partial{\boldsymbol{w}}_{n}. There exists a number λ<1\lambda^{*}<1 such that λmax(M𝐰n)λ\lambda_{\rm\max}(M_{{\boldsymbol{w}}_{n}})\leq\lambda^{*} for all 𝐰n𝒲n{\boldsymbol{w}}_{n}\in\mathcal{W}_{n} for sufficiently large nn and almost every training dataset DnD_{n}.

Theorem S2

Suppose that a mixture Gaussian penalty is imposed on the weights of the DNN model in the EFI network; Assumptions 1 and 3-6 hold; ϵ\epsilon is sufficiently small; and l=1Hdlσl2/σl12\sum_{l=1}^{H}d_{l}\sigma_{l}^{2}/\sigma_{l-1}^{2} n\prec n holds, where max{σ1,σ2,,σH}τ(d1,d2,,dH)\max\{\sigma_{1},\sigma_{2},\ldots,\sigma_{H}\}\prec\tau(d_{1},d_{2},\ldots,d_{H}) as defined in Section §\S1.2.1 and σ0=O(1)\sigma_{0}=O(1) represents a constant. Then 𝐰^n(t)𝐰~np0\|\hat{{\boldsymbol{w}}}_{n}^{(t)}-\tilde{{\boldsymbol{w}}}_{n}^{*}\|\stackrel{{\scriptstyle p}}{{\to}}0 for sufficiently large nn and sufficiently large tt and almost every training dataset DnD_{n}.

  • Proof:

    This lemma directly follows from Theorem 4 of Liang et al. (2018a) that the estimator 𝒘^n(t)\hat{{\boldsymbol{w}}}_{n}^{(t)} is consistent when both nn and tt are sufficiently large.  \Box

In summary, we have given a constructive proof for the consistency of sparse stochastic neural network based on the sparse learning theory developed for high-dimensional linear regression. Our proof implies that under the conditions of Theorem S2, such a consistent estimator can also be obtained by a direct calculation of 𝒘^nu\hat{{\boldsymbol{w}}}_{n}^{u} as defined in (S11). Furthermore, it follows from (S12) that

𝒘^n𝒘~np0,as n.\|\hat{{\boldsymbol{w}}}_{n}^{*}-\tilde{{\boldsymbol{w}}}_{n}^{*}\|\stackrel{{\scriptstyle p}}{{\to}}0,\quad\mbox{as $n\to\infty$.} (S20)
Proof of Theorem 3.1
  • Proof:

    As a summary of Lemma S1 and (S20), we have

    𝒘n𝒘~np0,as n,\|{\boldsymbol{w}}_{n}^{*}-\tilde{{\boldsymbol{w}}}_{n}^{*}\|\stackrel{{\scriptstyle p}}{{\to}}0,\quad\mbox{as $n\to\infty$},

    by setting σ1=σ2==σHτ(d1,d2,,dH)\sigma_{1}=\sigma_{2}=\cdots=\sigma_{H}\prec\tau(d_{1},d_{2},\ldots,d_{H}). Subsequently, g^(y,𝒙,z,𝒘n)\hat{g}(y,{\boldsymbol{x}},z,{\boldsymbol{w}}_{n}^{*}) constitutes a consistent estimator of 𝜽{\boldsymbol{\theta}}, and so does G(𝒀n,𝑿n,𝒁n)G^{*}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n}) following the arguments in Section §\S1.1.  \Box

In summary, through the introduction of an auxiliary stochastic neural network model and the utilization of the convergence theory of the IRO algorithm, we have justified the consistency of the sparse DNN model under mild conditions.

Remark S2

The result presented in Theorem 3.1 is interesting: we can set the width of each layer of the DNN model to be of order O(nα)O(n^{\alpha}) for some 1/2α<11/2\leq\alpha<1 and set its depth to be O(nα)O(n^{\alpha^{\prime}}) for some 0<α<1α0<\alpha^{\prime}<1-\alpha. Given the approximation ability of the sparse DNN model as studied in Sun et al. (2022), where it was proven that a neural network of size O(nα~)O(n^{\tilde{\alpha}}) for some 0<α~<10<\tilde{\alpha}<1 has been large enough for approximating many classes of functions, EFI can be used for uncertainty quantification for deep neural networks if their sizes are appropriately chosen. Specifically, for the Double-NN approach, we can set the size of the first neural network (for approximation of the inverse function 𝛉=G(𝐘n,𝐗n,𝐙n){\boldsymbol{\theta}}=G({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})) to be large under the constraint l=1Hdln\sum_{l=1}^{H}d_{l}\prec n, and set the second neural network (for approximation of the function 𝐘n=f(𝐗n,𝐙n;𝛉){\boldsymbol{Y}}_{n}=f({\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n};{\boldsymbol{\theta}})) to be relatively small with the size O(nα~)O(n^{\tilde{\alpha}}) for some 0<α~<10<\tilde{\alpha}<1. By the theory developed in this paper, the uncertainty of the second neural network can still be correctly quantified using EFI.

Appendix §\S2 CQR Method

For a test point 𝒙{\boldsymbol{x}} belonging to the class c\mathcal{I}_{c}, CQR aims to find the intervals C^1(𝒙)\hat{C}_{1}({\boldsymbol{x}}) and C^ITE(𝒙)\hat{C}_{ITE}({\boldsymbol{x}}) such that

P(Y(1;𝒙)C^1(𝒙))1α,P(Y(1;𝒙)Yobs(0;𝒙)C^ITE(𝒙))1α.P(Y(1;{\boldsymbol{x}})\in\hat{C}_{1}({\boldsymbol{x}}))\geq 1-\alpha,\quad P(Y(1;{\boldsymbol{x}})-Y^{obs}(0;{\boldsymbol{x}})\in\hat{C}_{ITE}({\boldsymbol{x}}))\geq 1-\alpha.

Similarly, for a test point 𝒙{\boldsymbol{x}} belonging to the class t\mathcal{I}_{t}, CQR aims to find the intervals C^0(𝒙)\hat{C}_{0}({\boldsymbol{x}}) and C^ITE(𝒙)\hat{C}_{ITE}({\boldsymbol{x}}) such that

P(Y(0;𝒙)C^0(𝒙))1α,P(Yobs(1;𝒙)Y(0;𝒙)C^ITE(𝒙))1α.\begin{split}P(Y(0;{\boldsymbol{x}})\in\hat{C}_{0}({\boldsymbol{x}}))&\geq 1-\alpha,\quad P(Y^{obs}(1;{\boldsymbol{x}})-Y(0;{\boldsymbol{x}})\in\hat{C}_{ITE}({\boldsymbol{x}}))\geq 1-\alpha.\end{split}

To achieve the above goals, CQR initially splits the training dataset 𝒳train\mathcal{X}_{train} to (𝒳train,𝒳valid)(\mathcal{X}_{train},\mathcal{X}_{valid}). A quantile regression model, such as BART or random forest, is trained on 𝒳train\mathcal{X}_{train}. Denote the output of the quantile regression by [q^α/2(t,𝒙),q^1α/2(t,𝒙)][\hat{q}_{\alpha/2}(t,{\boldsymbol{x}}),\hat{q}_{1-\alpha/2}(t,{\boldsymbol{x}})], and calculate the score si(t)=max(q^α/2(t,𝒙i)yi(t),yi(t)q^1α/2(t,𝒙i))s_{i}(t)=max(\hat{q}_{\alpha/2}(t,{\boldsymbol{x}}_{i})-y_{i}(t),y_{i}(t)-\hat{q}_{1-\alpha/2}(t,{\boldsymbol{x}}_{i})) for each 𝒙i𝒳valid{\boldsymbol{x}}_{i}\in\mathcal{X}_{valid}. Note that sis_{i} acts like a residual derived from the validation set. Let s^(t,α)=Quantile({si(t)}𝒙i𝒳valid;[(n+1)α]n])\hat{s}(t,\alpha)=Quantile(\{s_{i}(t)\}_{{\boldsymbol{x}}_{i}\in\mathcal{X}_{valid}};[\frac{(n+1)\alpha]}{n}]). Then C^t(𝒙j)\hat{C}_{t}({\boldsymbol{x}}_{j}) given below is the conformal prediction interval for any 𝒙j𝒳test{\boldsymbol{x}}_{j}\in\mathcal{X}_{test}:

C^t(𝒙j)=[q^α/2(t,𝒙j)s^(t,1α),q^1α/2(t,𝒙j)+s^(t,1α)]=[Y^L(t;𝒙j),Y^R(t;𝒙j)]\begin{split}\hat{C}_{t}({\boldsymbol{x}}_{j})&=[\hat{q}_{\alpha/2}(t,{\boldsymbol{x}}_{j})-\hat{s}(t,1-\alpha),\hat{q}_{1-\alpha/2}(t,{\boldsymbol{x}}_{j})+\hat{s}(t,1-\alpha)]\\ &=[\hat{Y}^{L}(t;{\boldsymbol{x}}_{j}),\hat{Y}^{R}(t;{\boldsymbol{x}}_{j})]\end{split}

For a more refined approach, one can consider incorporating propensity scores as weights as discussed in Tibshirani et al. (2019).

Case (i) and Case (ii) described in Section 5.1 can be addressed by the above approach. However, for Case (iii) there, the conformal prediction needs further steps. First, construct a pair of prediction intervals at level 1α/21-\alpha/2; namely, [Y^L(1;𝒙),Y^R(1;𝒙)][\hat{Y}^{L}(1;{\boldsymbol{x}}),\hat{Y}^{R}(1;{\boldsymbol{x}})] for Y(1)Y(1) and [Y^L(0;𝒙),Y^R(0;𝒙)][\hat{Y}^{L}(0;{\boldsymbol{x}}),\hat{Y}^{R}(0;{\boldsymbol{x}})] for Y(0)Y(0). Then, construct an interval for ITE as follows:

C^ITEnaive(𝒙)=[Y^L(1;𝒙)Y^R(0;𝒙),Y^R(1;𝒙)Y^L(0;𝒙)].\hat{C}^{naive}_{ITE}({\boldsymbol{x}})=[\hat{Y}^{L}(1;{\boldsymbol{x}})-\hat{Y}^{R}(0;{\boldsymbol{x}}),\hat{Y}^{R}(1;{\boldsymbol{x}})-\hat{Y}^{L}(0;{\boldsymbol{x}})].

We refer to this approach as the “naive” approach, which usually leads to very wide prediction intervals.

Another option for case (iii) is the so-called “nested” approach by splitting 𝒳train\mathcal{X}_{train} into two folds, denoted by (𝒳train,1,𝒳train,2)(\mathcal{X}_{train,1},\mathcal{X}_{train,2}). On the first fold, train C^(1,𝒙)\hat{C}(1,{\boldsymbol{x}}) and C^(0,𝒙)\hat{C}(0,{\boldsymbol{x}}) by applying conformal inference. On the second fold, for each 𝒙i𝒳train,2{\boldsymbol{x}}_{i}\in\mathcal{X}_{train,2}, compute

C^(𝒙i)={[Yobs(1;𝒙i)Y^R(0;𝒙i),Yobs(1,𝒙i)Y^L(0;𝒙i)],if T=1,[Y^L(1;𝒙i)Yobs(0;𝒙i),Y^R(1;𝒙i)Yobs(0;𝒙i)],if T=0,\hat{C}({\boldsymbol{x}}_{i})=\begin{cases}[Y^{obs}(1;{\boldsymbol{x}}_{i})-\hat{Y}^{R}(0;{\boldsymbol{x}}_{i}),Y^{obs}(1,{\boldsymbol{x}}_{i})-\hat{Y}^{L}(0;{\boldsymbol{x}}_{i})],&\textit{if \ }T=1,\\ [\hat{Y}^{L}(1;{\boldsymbol{x}}_{i})-Y^{obs}(0;{\boldsymbol{x}}_{i}),\hat{Y}^{R}(1;{\boldsymbol{x}}_{i})-Y^{obs}(0;{\boldsymbol{x}}_{i})],&\textit{if \ }T=0,\\ \end{cases}

where Y^L\hat{Y}^{L} and Y^R\hat{Y}^{R} are estimated based on 𝒳train,1\mathcal{X}_{train,1}. Note that the conformal inference is also applicable for the data with interval outcomes. Applying the conformal inference method with interval outcomes on (𝒙i,C^(𝒙i))({\boldsymbol{x}}_{i},\hat{C}({\boldsymbol{x}}_{i})) for 𝒙i𝒳train,2{\boldsymbol{x}}_{i}\in\mathcal{X}_{train,2}, yielding the interval C^ITEexact(𝒙)\hat{C}^{exact}_{ITE}({\boldsymbol{x}}). Applying the conformal inference twice results in a sparse utilization of data for training the regression model and, subsequently, wider prediction intervals. The inexact method involves fitting conditional quantiles of C^L\hat{C}^{L} and C^R\hat{C}^{R}, yielding an interval C~ITEinexact(𝒙)\tilde{C}_{ITE}^{inexact}({\boldsymbol{x}}). The inexact method does not guarantee the coverage rate. Refer to Lei and Candès (2021) for the detail.

Appendix §\S3 Experimental Settings

To enforce a sparse DNN to be learned for the inverse function g()g(\cdot), we impose the following mixture Gaussian prior on each element of 𝒘n{\boldsymbol{w}}_{n}:

π(w)ρN(0,σ12)+(1ρ)N(0,σ02),\pi(w)\sim\rho N(0,\sigma_{1}^{2})+(1-\rho)N(0,\sigma_{0}^{2}), (S21)

where ww represents a generic element of 𝒘n{\boldsymbol{w}}_{n} and, unless stated otherwise, we set ρ=1e2\rho=1e-2, σ0=1e2\sigma_{0}=1e-2 and σ1=1\sigma_{1}=1. The elements of 𝒘n{\boldsymbol{w}}_{n} are a priori independent.

For EFI, we employ SGHMC in latent variable sampling, i.e., we simulate 𝒁n(k+1){\boldsymbol{Z}}_{n}^{(k+1)} in the following formula:

𝑽n(k+1)=(1ϖ)𝑽n(k)+υk+1^𝒁nlogπϵ(𝒁n(k)|𝑿n,𝒀n,𝒘n(k))+2ϖτυk+1𝒆(k+1),𝒁n(k+1)=𝒁n(k)+𝑽n(k+1),\begin{split}{\boldsymbol{V}}_{n}^{(k+1)}&=(1-\varpi){\boldsymbol{V}}_{n}^{(k)}+\upsilon_{k+1}\widehat{\nabla}_{{\boldsymbol{Z}}_{n}}\log\pi_{\epsilon}({\boldsymbol{Z}}_{n}^{(k)}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{(k)})+\sqrt{2\varpi\tau\upsilon_{k+1}}{\boldsymbol{e}}^{(k+1)},\\ {\boldsymbol{Z}}_{n}^{(k+1)}&={\boldsymbol{Z}}_{n}^{(k)}+{\boldsymbol{V}}_{n}^{(k+1)},\end{split}

where τ=1\tau=1, 0<ϖ10<\varpi\leq 1 is the momentum parameter, 𝒆(k+1)N(0,Id𝒛){\boldsymbol{e}}^{(k+1)}\sim N(0,I_{d_{{\boldsymbol{z}}}}), and υk+1\upsilon_{k+1} is the learning rate. It is worth noting that the algorithm is reduced to SGLD if we set ϖ=1\varpi=1.

In the simulations, we set the learning rate sequence {υk:k=1,2,}\{\upsilon_{k}:k=1,2,\ldots\} and the step size sequence {γk:k=1,2,}\{\gamma_{k}:k=1,2,\ldots\} in the forms:

υk=Cυcυ+kα,γk=Cγcγ+kα,\upsilon_{k}=\frac{C_{\upsilon}}{c_{\upsilon}+k^{\alpha}},\quad\gamma_{k}=\frac{C_{\gamma}}{c_{\gamma}+k^{\alpha}},

for some constants Cυ>0C_{\upsilon}>0, cυ>0c_{\upsilon}>0, Cγ>0C_{\gamma}>0 and cγ>0c_{\gamma}>0, and α(0,1]\alpha\in(0,1]. The values of CυC_{\upsilon}, cυc_{\upsilon}, CγC_{\gamma},cγc_{\gamma} and α\alpha used in different experiments are given below.

§\S3.1 ATE

From the point of view of equation solving, the model (22) in the main text can also be written as

yi=τTi+μ+𝒙i𝜷+σzi,i=1,2,,n,y_{i}=\tau^{\prime}T_{i}^{\prime}+\mu^{\prime}+{\boldsymbol{x}}_{i}{\boldsymbol{\beta}}+\sigma z_{i},\quad i=1,2,\ldots,n, (S22)

where Ti{1,1}T_{i}^{\prime}\in\{-1,1\}, τ=τ/2\tau^{\prime}=\tau/2, and μ=μ+τ/2\mu^{\prime}=\mu+\tau/2. Let 𝜽=(τ,μ,𝜷,σ){\boldsymbol{\theta}}=(\tau^{\prime},\mu^{\prime},{\boldsymbol{\beta}}^{\top},\sigma)^{\top}. Equation (S22) can be solved under the standard framework of EFI. Unless otherwise noted, all results in this paper are based on solving this type of transformed equations.

We set α=1/7\alpha=1/7 and ϖ=0.1\varpi=0.1. For n=250n=250, we set (Cυ,cυ,Cγ,cγ)=((C_{\upsilon},c_{\upsilon},C_{\gamma},c_{\gamma})=(200000, 1000000, 54000, 1000000). For n=500n=500 and 1000, we set (Cυ,cυ,Cγ,cγ)=(500000,1000000(C_{\upsilon},c_{\upsilon},C_{\gamma},c_{\gamma})=(500000,1000000, 54000, 1000000).

We set 𝒦=5000\mathcal{K}=5000 as the number of burn-in iterations, and set M=50,000M=50,000 as the number of iterations used for fiducial sample collection. We thinned the Markov chain by a factor of B=5B=5; that is, we collected M/B=10,000M/B=10,000 samples in each run.

We set η=500\eta=500 and ϵ=1/10\epsilon=1/10 in construction of the energy function, and perform gradient clipping by norm with 5000 during the first 100 iterations (for finding a reasonably good initial point).

The DNN in the EFI network has two hidden layers, with the widths given by d1=90d_{1}=90 and d2=30d_{2}=30, respectively.

§\S3.2 Linear control response with Non-linear treatment response

Refer to caption
Figure S1: Description of the structure of the DNN model used in the Double-NN method: 𝜽{\boldsymbol{\theta}} (i.e., the output layer of the DNN model) can be partitioned into three parts, namely, 𝜽c{\boldsymbol{\theta}}_{c}, 𝜽τ{\boldsymbol{\theta}}_{\tau}, and σ\sigma, where ωcL\omega_{c}^{L}, ωτL\omega_{\tau}^{L} and ωσL\omega_{\sigma}^{L} correspond to the parameters for 𝜽c{\boldsymbol{\theta}}_{c}, 𝜽τ{\boldsymbol{\theta}}_{\tau}, and log(σ)\log(\sigma), respectively. In the case that 𝜽c{\boldsymbol{\theta}}_{c} or 𝜽τ{\boldsymbol{\theta}}_{\tau} represents a neural network, we re-scale them by dividing a factor of 25 such that the output values of each output neuron are close to each other, easing the training process.

We set α=1/7\alpha=1/7, ϖ=0.1\varpi=0.1, and (Cυ,cυ)=(200000,1000000)(C_{\upsilon},c_{\upsilon})=(200000,1000000) and (Cγ,cγ)=(20,20000)(C_{\gamma},c_{\gamma})=(20,20000) for ωτH\omega^{H}_{\tau} and (Cγ,cγ)=(20000,200000)(C_{\gamma},c_{\gamma})=(20000,200000) for all other parameters. Refer to Figure S1 for the definition of ωτH\omega_{\tau}^{H}.

For initialization, we update the 𝒘n{\boldsymbol{w}}_{n} with randomly sampled 𝒁n(t+1)π0n(𝒁n){\boldsymbol{Z}}_{n}^{(t+1)}\sim\pi_{0}^{\otimes n}({\boldsymbol{Z}}_{n}) for the first 5000 iterations. We set 𝒦=20,000\mathcal{K}=20,000 as the number of burn-in iterations, and set M=50,000M=50,000 as the number of iterations used for fiducial sample collection. We thinned the Markov chain by a factor of B=5B=5; that is, we collected M/B=10,000M/B=10,000 samples in each run. We set η=10\eta=10 and ϵ=1/10\epsilon=1/10 in construction of the energy function, and perform gradient clipping by norm with 5000 during the first 100 iterations.

The DNN in the EFI network has two hidden layers, with the widths given by d1=90d_{1}=90 and d2=30d_{2}=30, respectively. The DNN used for modeling the treatment effect has two hidden layers, each hidden layer consisting of 10 hidden neurons.

§\S3.3 Non-linear control response with Non-linear treatment response

We set α=1/7\alpha=1/7, ϖ=0.1\varpi=0.1, and (Cυ,cυ)=(500000,1000000)(C_{\upsilon},c_{\upsilon})=(500000,1000000) and (Cγ,cγ)=(2.5,1000000)(C_{\gamma},c_{\gamma})=(2.5,1000000) for ωτH\omega^{H}_{\tau} and ωcH\omega_{c}^{H}, and set (Cγ,cγ)=(20000,200000)(C_{\gamma},c_{\gamma})=(20000,200000) for all other parameters. Refer to Figure S1 for the definitions of ωτH\omega_{\tau}^{H} and ωcH\omega_{c}^{H}.

For initialization, we update the 𝒘n{\boldsymbol{w}}_{n} with randomly sampled 𝒁n(t+1)π0n(𝒛n){\boldsymbol{Z}}_{n}^{(t+1)}\sim\pi_{0}^{\otimes n}({\boldsymbol{z}}_{n}) for the first 5000 iterations. We set 𝒦=20,000\mathcal{K}=20,000 as the number of burn-in iterations, and set M=50,000M=50,000 as the number of iterations used for fiducial sample collection. We thinned the Markov chain by a factor of B=5B=5; that is, we collected M/B=10,000M/B=10,000 samples in each run. We set η=10\eta=10 and ϵ=1/10\epsilon=1/10 in construction of the energy function, and perform gradient clipping by norm with 5000 during the first 100 iterations.

The DNN in the EFI network has two hidden layers, with the widths given by d1=90d_{1}=90 and d2=30d_{2}=30, respectively. The DNN used for modeling the treatment effect has two hidden layers, each hidden layer consisting of 10 hidden neurons. The DNN used for modeling the response function under the control has two hidden layers, each hidden layer consisting of 10 hidden neurons.

§\S3.4 Lalonde

We set α=1/7\alpha=1/7, ϖ=0.1\varpi=0.1, and (Cυ,cυ)=(500000,1000000)(C_{\upsilon},c_{\upsilon})=(500000,1000000) and (Cγ,cγ)=(2.5,1000000)(C_{\gamma},c_{\gamma})=(2.5,1000000) for ωτH\omega^{H}_{\tau} and ωcH\omega^{H}_{c}, and (Cγ,cγ)=(1000,1000000)(C_{\gamma},c_{\gamma})=(1000,1000000) for all other parameters. Refer to Figure S1 for the definitions of ωτH\omega_{\tau}^{H} and ωcH\omega_{c}^{H}.

For initialization, we update the 𝒘n{\boldsymbol{w}}_{n} with randomly sampled 𝒁n(t+1)π0n(𝒛n){\boldsymbol{Z}}_{n}^{(t+1)}\sim\pi_{0}^{\otimes n}({\boldsymbol{z}}_{n}) for the first 10,000 iterations. We set 𝒦=20,000\mathcal{K}=20,000 as the number of burn-in iterations and set M=50,000M=50,000 as the number of iterations used for fiducial sample collection. We thinned the Markov chain by a factor of B=5B=5; that is, we collected M/B=10,000M/B=10,000 samples in each run. We set η=10\eta=10 and ϵ=1/10\epsilon=1/10 in construction of the energy function, and perform gradient clipping by norm with 5000 for first 100 iterations.

The DNN in the EFI network has two hidden layers, with the widths given by d1=90d_{1}=90 and d2=30d_{2}=30, respectively. The DNN used for modeling the treatment effect contains two hidden layers, each hidden layer containing 10 hidden neurons. The DNN used for modeling the response function under the control contains two hidden layers, each hidden layer containing 10 hidden neurons.

§\S3.5 NLSM

We set α=1/7\alpha=1/7, ϖ=0.1\varpi=0.1, and (Cυ,cυ)=(500000,1000000)(C_{\upsilon},c_{\upsilon})=(500000,1000000) and (Cγ,cγ)=(2.5,1000000)(C_{\gamma},c_{\gamma})=(2.5,1000000) for ωτH\omega^{H}_{\tau} and ωcH\omega^{H}_{c}, and (Cγ,cγ)=(5000,1000000)(C_{\gamma},c_{\gamma})=(5000,1000000) for all other parameters. Refer to Figure S1 for the definitions of ωτH\omega_{\tau}^{H} and ωcH\omega_{c}^{H}.

For initialization, we update the 𝒘n{\boldsymbol{w}}_{n} with randomly sampled Zn(t+1)π0n(𝒛n)Z_{n}^{(t+1)}\sim\pi_{0}^{\otimes n}({\boldsymbol{z}}_{n}) for the first 10,000 iterations. We set 𝒦=20,000\mathcal{K}=20,000 as the number of burn-in iterations and set M=50,000M=50,000 as the number of iterations used for fiducial sample collection. We thinned the Markov chain by a factor of B=5B=5; that is, we collected M/B=10,000M/B=10,000 samples in each run. We set η=10\eta=10 and ϵ=1/10\epsilon=1/10 in construction of the energy function, and perform gradient clipping by norm with 5000 for first 100 iterations.

The DNN in the EFI network contains two hidden layers, with the widths given by d1=90d_{1}=90 and d2=30d_{2}=30, respectively. The DNN used for modeling the treatment effect contains two hidden layers, each hidden layer containing 10 hidden neurons. The DNN used for modeling the response function under control contains two hidden layers, each hidden layer containing 10 hidden neurons.

References

  • Bang and Robins (2005) Bang, H. and Robins, J. M. (2005), “Doubly Robust Estimation in Missing Data and Causal Inference Models,” Biometrics, 61, 962–972.
  • Bolcskei et al. (2019) Bolcskei, H., Grohs, P., Kutyniok, G., and Petersen, P. (2019), “Optimal approximation with sparsely connected deep neural networks,” SIAM Journal on Mathematics of Data Science, 1, 8–45.
  • Breiman (1998) Breiman, L. (1998), “Arcing classifier (with discussion and a rejoinder by the author),” Annals of Statistics, 26, 801–849.
  • Breiman (2001) — (2001), “Random forests,” Machine learning, 45, 5–32.
  • Caron et al. (2022) Caron, A., Baio, G., and Manolopoulou, I. (2022), “Estimating Individual Treatment Effects using Non-Parametric Regression Models: a Review,” Journal of the Royal Statistical Society Series A: Statistics in Society, 185, 1115–1149.
  • Carvalho et al. (2019) Carvalho, C. M., Feller, A., Murray, J., Woody, S., and Yeager, D. S. (2019), “Assessing Treatment Effect Variation in Observational Studies: Results from a Data Challenge,” Observational Studies.
  • Cefalu et al. (2021) Cefalu, M., Ridgeway, G., McCaffrey, D., Morral, A., Griffin, B. A., and Burgette, L. (2021), “Package ‘twang’: Toolkit for Weighting and Analysis of Nonequivalent Groups,” R Package.
  • Chen et al. (2014) Chen, T., Fox, E., and Guestrin, C. (2014), “Stochastic gradient hamiltonian monte carlo,” in International conference on machine learning, pp. 1683–1691.
  • Chipman et al. (2010) Chipman, H. A., George, E. I., and McCulloch, R. E. (2010), “BART: Bayesian Additive Regression Trees,” The Annals of Applied Statistics, 4, 266–298.
  • Deng et al. (2019) Deng, W., Zhang, X., Liang, F., and Lin, G. (2019), “An adaptive empirical Bayesian method for sparse deep learning,” Advances in neural information processing systems, 32.
  • Dittmer et al. (2018) Dittmer, S., King, E. J., and Maass, P. (2018), “Singular Values for ReLU Layers,” IEEE Transactions on Neural Networks and Learning Systems, 31, 3594–3605.
  • Dorie and Hill (2020) Dorie, V. and Hill, J. L. (2020), “Bartcause: Causal Inference using Bayesian Additive Regression Trees [R package bartCause version 1.0-4],” R Package.
  • Farrell et al. (2021) Farrell, M., Liang, T., and Misra, S. (2021), “Deep Neural Networks for Estimation and Inference,” Econometrica, 89, 181–213.
  • Fisher (1935) Fisher, R. A. (1935), “The fiducial argument in statistical inference,” Annals of Eugenics, 6, 391–398.
  • Foster et al. (2011) Foster, J. C., Taylor, J. M., and Ruberg, S. J. (2011), “Subgroup identification from randomized clinical trial data,” Statistics in Medicine, 30.
  • Fraser (1966) Fraser, D. A. S. (1966), “Structural probability and a generalization,” Biometrika, 53, 1–9.
  • Fraser (1968) — (1968), The Structure of Inference, New York-London-Sydney: John Wiley & Sons.
  • Guan and Yang (2019) Guan, Q. and Yang, S. (2019), “A Unified Framework for Causal Inference with Multiple Imputation Using Martingale,” arXiv: Methodology.
  • Hahn et al. (2020) Hahn, P. R., Murray, J. S., and Carvalho, C. M. (2020), “Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects (with Discussion),” Bayesian Analysis, 15, 965 – 2020.
  • Hannig (2009) Hannig, J. (2009), “On generalized fiducial inference,” Statistica Sinica, 19, 491–544.
  • Hannig et al. (2016) Hannig, J., Iyer, H., Lai, R. C. S., and Lee, T. C. M. (2016), “Generalized Fiducial Inference: A Review and New Results,” Journal of the American Statistical Association, 111, 1346–1361.
  • Hestness et al. (2017) Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. (2017), “Deep Learning Scaling is Predictable, Empirically,” ArXiv, abs/1712.00409.
  • Higham and Cheng (1998) Higham, N. J. and Cheng, S. H. (1998), “Modifying the inertia of matrices arising in optimization,” Linear Algebra and its Applications, 261–279.
  • Hill (2011) Hill, J. L. (2011), “Bayesian Nonparametric Modeling for Causal Inference,” Journal of Computational and Graphical Statistics, 20, 217–240.
  • Hornik (1991) Hornik, K. (1991), “Approximation capabilities of multilayer feedforward networks,” Neural Networks, 4, 251–257.
  • Hornik et al. (1989) Hornik, K., Stinchcombe, M., and White, H. (1989), “Multilayer feedforward networks are universal approximators,” Neural Networks, 2, 359–366.
  • Imbens (2004) Imbens, G. (2004), “Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review,” The Review of Economics and Statistics, 86, 4–29.
  • Imbens and Rubin (2015) Imbens, G. W. and Rubin, D. B. (2015), Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction, USA: Cambridge University Press.
  • Javanmard and Montanari (2014) Javanmard, A. and Montanari, A. (2014), “Confidence iReferntervals and hypothesis testing for high-dimensional regression,” Journal of Machine Learning Research, 15, 2869–2909.
  • Jiang (2007) Jiang, W. (2007), “Bayesian variable selection for high dimensional generalized linear models: convergence rates of the fitted densities,” The Annals of Statistics, 35, 1487–1511.
  • Kidger and Lyons (2020) Kidger, P. and Lyons, T. (2020), “Universal Approximation with Deep Narrow Networks,” Proceedings of Machine Learning Research, 125, 1–22.
  • Kim et al. (2023) Kim, N., Min, C., and Park, S. (2023), “Minimum width for universal approximation using ReLU networks on compact domain,” ArXiv, abs/2309.10402.
  • Künzel et al. (2019) Künzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019), “Metalearners for estimating heterogeneous treatment effects using machine learning,” Proceedings of the National Academy of Sciences of the United States of America, 116, 4156 – 4165.
  • Lee et al. (2016) Lee, J., Sun, D., Sun, Y., and Taylor, J. (2016), “Exact post-selection inference, with application to the Lasso,” Annals of Statistics, 44, 907–927.
  • Lei and Candès (2021) Lei, L. and Candès, E. J. (2021), “Conformal Inference of Counterfactuals and Individual Treatment Effects,” Journal of the Royal Statistical Society Series B: Statistical Methodology, 83, 911–938.
  • Liang et al. (2018a) Liang, F., Jia, B., Xue, J., Li, Q., and Luo, Y. (2018a), “An imputation–regularized optimization algorithm for high dimensional missing data problems and beyond,” Journal of the Royal Statistical Society, Series B, 80, 899–926.
  • Liang et al. (2024) Liang, F., Kim, S., and Sun, Y. (2024), “Exended Fiducial Inference: Toward an Automated Process of Statistical Inference,” Journal of the Royal Statistical Society, Series B, in press.
  • Liang et al. (2018b) Liang, F., Li, Q., and Zhou, L. (2018b), “Bayesian Neural Networks for Selection of Drug Sensitive Genes,” Journal of the American Statistical Association, 113, 955–972.
  • Liang et al. (2022a) Liang, F., Xue, J., and Jia, B. (2022a), “Markov neighborhood regression for high-dimensional inference,” Journal of the American Statistical Association, 117, 1200–1214.
  • Liang et al. (2022b) Liang, S., Sun, Y., and Liang, F. (2022b), “Nonlinear Sufficient Dimension Reduction with a Stochastic Neural Network,” NeurIPS 2022.
  • Lu et al. (2018) Lu, M., Sadiq, S., Feaster, D. J., and Ishwaran, H. (2018), “Estimating Individual Treatment Effect in Observational Data Using Random Forest Methods,” Journal of Computational and Graphical Statistics, 27, 209 – 219.
  • Martin (2023) Martin, R. (2023), “Fiducial inference viewed through a possibility-theoretic inferential model lens,” Journal of Machine Learning Research, 215, 299–310.
  • Murph et al. (2022) Murph, A. C., Hannig, J., and Williams, J. P. (2022), “Generalized Fiducial Inference on Differentiable Manifolds,” arXiv:2209.15473.
  • Nielsen (2000) Nielsen, S. (2000), “The stochastic EM algorithm: Estimation and asymptotic results,” Bernoulli, 6, 457–489.
  • Park et al. (2020) Park, S., Yun, C., Lee, J., and Shin, J. (2020), “Minimum Width for Universal Approximation,” ArXiv, abs/2006.08859.
  • Petersen and Voigtlaender (2018) Petersen, P. and Voigtlaender, F. (2018), “Optimal approximation of piecewise smooth functions using deep ReLU neural networks,” Neural Networks, 108, 296–330.
  • Portnoy (1986) Portnoy, S. (1986), “On the central limit theorem in p\mathbb{R}^{p} when pp\to\infty,” Probability Theory and Related Fields, 73, 571–583.
  • Portnoy (1988) — (1988), “Asymptotic behavior of likelihood methods for exponential families when the number of parameters tend to infinity,” Annals of Statistics, 16, 356–366.
  • Robbins and Monro (1951) Robbins, H. and Monro, S. (1951), “A Stochastic Approximation Method,” Annals of Mathematical Statistics, 22, 400–407.
  • Robins et al. (1994) Robins, J. M., Rotnitsky, A., and Zhao, L. P. (1994), “Estimation of regression coefficients when some regressors are not always observed,” Journal of the American Statistical Association, 89, 846–866.
  • Romano et al. (2019) Romano, Y., Patterson, E., and Candès, E. J. (2019), “Conformalized Quantile Regression,” in Neural Information Processing Systems.
  • Rosenbaum (1987) Rosenbaum, P. R. (1987), “Model-Based Direct Adjustment,” Journal of the American Statistical Association, 82, 387–394.
  • Rosenbaum (2002) — (2002), Observational Studies (2nd edition), New York: Springer.
  • Rubin (1974) Rubin, D. B. (1974), “Estimating causal effects of treatments in randomized and nonrandomized studies.” Journal of Educational Psychology, 66, 688–701.
  • Schapire (1990) Schapire, R. E. (1990), “The Strength of Weak Learnability,” Machine Learning, 5, 197–227.
  • Schmidt-Hieber (2020) Schmidt-Hieber, J. (2020), “Nonparametric regression using deep neural networks with ReLU activation function,” The Annals of Statistics, 48, 1875–1897.
  • Sethuraman et al. (2023) Sethuraman, M. G., Lopez, R., Mohan, R. V., Fekri, F., Biancalani, T., and Hutter, J.-C. (2023), “NODAGS-Flow: Nonlinear Cyclic Causal Structure Learning,” in International Conference on Artificial Intelligence and Statistics.
  • Shafer and Vovk (2008) Shafer, G. and Vovk, V. (2008), “A Tutorial on Conformal Prediction,” J. Mach. Learn. Res., 9, 371–421.
  • Shalit et al. (2017) Shalit, U., Johansson, F. D., and Sontag, D. A. (2017), “Estimating individual treatment effect: generalization bounds and algorithms,” in Proceedings of the 34th International Conference on Machine Learning, PMLR, pp. 3076–3085.
  • Song and Liang (2023) Song, Q. and Liang, F. (2023), “Nearly optimal Bayesian Shrinkage for high dimensional regression,” China Science Mathematics, 66, 409–442.
  • Song et al. (2020) Song, Q., Sun, Y., Ye, M., and Liang, F. (2020), “Extended Stochastic Gradient MCMC for Large-Scale Bayesian Variable Selection,” Biometrika, 107, 997–1004.
  • Sun and Liang (2022) Sun, Y. and Liang, F. (2022), “A kernel-expanded stochastic neural network,” Journal of the Royal Statistical Society Series B, 84, 547–578.
  • Sun et al. (2022) Sun, Y., Song, Q., and Liang, F. (2022), “Consistent Sparse Deep Learning: Theory and Computation,” Journal of the American Statistical Association, 117, 1981–1995.
  • Tibshirani et al. (2019) Tibshirani, R. J., Foygel Barber, R., Candes, E., and Ramdas, A. (2019), “Conformal Prediction Under Covariate Shift,” in Advances in Neural Information Processing Systems, eds. Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., Curran Associates, Inc., vol. 32.
  • van de Geer et al. (2014) van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014), “On asymptotically optimal confidence regions and tests for high-dimensional models,” Ann. Statist., 42, 1166–1202.
  • Vovk et al. (2005) Vovk, V., Gammerman, A., and Shafer, G. (2005), Algorithmic Learning in a Random World, Springer.
  • Wager and Athey (2018) Wager, S. and Athey, S. (2018), “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests,” Journal of the American Statistical Association, 113, 1228–1242.
  • Williams (2023) Williams, J. P. (2023), “Model-free generalized fiducial inference,” .
  • Yeager et al. (2019) Yeager, D. S., Hanselman, P., Walton, G. M., Murray, J. S., Crosnoe, R., Muller, C., Tipton, E., Schneider, B., Hulleman, C. S., Hinojosa, C. P., Paunesku, D., Romero, C., Flint, K., Roberts, A., Trott, J., Iachan, R., Buontempo, J., Yang, S. M., Carvalho, C. M., Hahn, P. R., Gopalan, M., Mhatre, P., Ferguson, R., Duckworth, A. L., and Dweck, C. S. (2019), “A national experiment reveals where a growth mindset improves achievement,” Nature, 573.
  • Zabell (1992) Zabell, S. L. (1992), “R. A. Fisher and Fiducial Argument,” Statistical Science, 7, 369–387.
  • Zetterqvist and Sjölander (2015) Zetterqvist, J. and Sjölander, A. (2015), “Doubly Robust Estimation with the R Package drgee,” Epidemiologic Methods, 4, 69–86.
  • Zhang and Zhang (2014) Zhang, C.-H. and Zhang, S. S. (2014), “Confidence intervals for low dimensional parameters in high dimensional linear models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76, 217–242.