Extended Fiducial Inference for Individual Treatment Effects via Deep Neural Networks

Sehwan Kim^† and Faming Liang Correspondence author: Faming Liang, email: fmliang@purdue.edu. ^†Department of Statistics, Ewha Womans University, Seoul 03760, Republic of Korea.^∗ Department of Statistics, Purdue University, West Lafayette, IN 47907, USA.

Abstract

Individual treatment effect estimation has gained significant attention in recent data science literature. This work introduces the Double Neural Network (Double-NN) method to address this problem within the framework of extended fiducial inference (EFI). In the proposed method, deep neural networks are used to model the treatment and control effect functions, while an additional neural network is employed to estimate their parameters. The universal approximation capability of deep neural networks ensures the broad applicability of this method. Numerical results highlight the superior performance of the proposed Double-NN method compared to the conformal quantile regression (CQR) method in individual treatment effect estimation. From the perspective of statistical inference, this work advances the theory and methodology for statistical inference of large models. Specifically, it is theoretically proven that the proposed method permits the model size to increase with the sample size $n$ at a rate of $O(n^{\zeta})$ for some $0\leq\zeta<1$ , while still maintaining proper quantification of uncertainty in the model parameters. This result marks a significant improvement compared to the range $0\leq\zeta<\frac{1}{2}$ required by the classical central limit theorem. Furthermore, this work provides a rigorous framework for quantifying the uncertainty of deep neural networks under the neural scaling law, representing a substantial contribution to the statistical understanding of large-scale neural network models.

Keywords: Causal Inference, Deep Learning, Fiducial Inference, Stochastic Gradient MCMC, Uncertainty Quantification

1 Introduction

Causal inference is a fundamental problem in many disciplines such as medicine, econometrics, and social science. Formally, let $\{(y_{1},{\boldsymbol{x}}_{1},t_{1}),(y_{2},{\boldsymbol{x}}_{2},t_{2}),\ldots,(y_{n},{\boldsymbol{x}}_{n},t_{n})\}$ denote a set of observations drawn from the following data-generating equations:

y_{i}=c({\boldsymbol{x}}_{i})+\tau({\boldsymbol{x}}_{i})t_{i}+\sigma z_{i},\quad i=1,2,\ldots,n,

(1)

where ${\boldsymbol{x}}_{i}\in\mathbb{R}^{d}$ represents a vector of covariates of subject $i$ , $t_{i}\in\{0,1\}$ represents the treatment assignment to subject $i$ ; $c(\cdot)$ represents the expected outcome of subject $i$ if assigned to the control group (with $t_{i}=0)$ , and $\tau({\boldsymbol{x}}_{i})$ is the expected treatment effect of subject $i$ if assigned to the treatment group (with $t_{i}=1$ ); $\sigma>0$ is the standard deviation, and $z_{i}$ represent a standardized random error that is not necessarily Gaussian. Under the potential outcome framework (Rubin, 1974), each individual receives only one assignment of the treatment with $t_{i}=0$ or 1, but not both. The goal of causal inference is to make inference for the average treatment effect (ATE) or individual treatment effect (ITE).

The ATE is defined as

\tau_{0}=\mathbb{E}(\tau({\boldsymbol{x}}))=\int_{\mathcal{X}}\tau({\boldsymbol{x}})dF({\boldsymbol{x}}),

(2)

where $\mathcal{X}$ denotes the sample space of ${\boldsymbol{x}}$ , and $F({\boldsymbol{x}})$ denotes the cumulative distribution function of ${\boldsymbol{x}}$ . To estimate ATE, a variety of methods, including outcome regression, augmented/inverse probability weighting (AIPW/IPW) and matching, have been developed. See Imbens (2004) and Rosenbaum (2002) for overviews.

The ITE is often defined as the conditional average treatment effect (CATE):

\tau({\boldsymbol{x}})=\mathbb{E}(Y|T=1,{\boldsymbol{x}})-\mathbb{E}(Y|T=0,{\boldsymbol{x}}),

(3)

see e.g., Shalit et al. (2017) and Lu et al. (2018). Recently, Lei and Candès (2021) proposed to make predictive inference of the ITE by quantifying the uncertainty of

\tilde{\tau}_{i}:=Y(T=1,{\boldsymbol{x}}_{i})-Y(T=0,{\boldsymbol{x}}_{i}):=Y_{i}(1)-Y_{i}(0),

(4)

where $Y_{i}(t_{i})$ denotes the potential outcome of subject $i$ with treatment assignment $t_{i}\in\{0,1\}$ . Henceforth, we will call $\tilde{\tau}_{i}$ the predictive ITE.

It is known that ATE and ITE are identifiable if the conditions ‘strong ignorability’ and ‘overlapping’ are satisfied. The former means that, after accounting for observed covariates, the treatment assignment is independent of potential outcomes; and the latter ensures that every subject in the study has a positive probability of receiving either assignment, allowing for meaningful comparisons between treatment and control groups. Mathematically, the two conditions can be expressed as:

(i)\ \mbox{\it strong ignorability:}\ \{Y(1),Y(0)\}\rotatebox[origin={c}]{90.0}{$\models$}T|{\boldsymbol{x}};\ \ (ii)\ \mbox{\it overlapping:}\ 0<P(T=1|{\boldsymbol{x}})<1,\ \ \forall{\boldsymbol{x}}\in\mathcal{X},

where $T\in\{0,1\}$ represents the treatment assignment variable, and $\models$ denotes conditional independence. Together, they ensure that the causal effect can be correctly estimated without bias. See e.g. Guan and Yang (2019) for more discussions on this issue.

However, even under these assumptions, accurate inference for ATE and ITE can still be challenging. Specifically, the inference task can be complicated by unknown nonlinear forms of $c({\boldsymbol{x}})$ and $\tau({\boldsymbol{x}})$ . To address these issues, some authors have proposed to approximate them using a machine learning model, such as random forest (RF) (Breiman, 2001), Bayesian additive regression trees (BART) (Chipman et al., 2010), and neural networks. Refer to e.g., Foster et al. (2011), Hill (2011), Shalit et al. (2017), Wager and Athey (2018), and Hahn et al. (2020) for the details. Unfortunately, these methods often yield point estimates for the ATE and ITE, while failing to correctly quantifying their uncertainty due to the complexity of the machine learning models. Quite recently, Lei and Candès (2021) proposed to quantify the uncertainty of the predictive ITE using the conformal inference method (Vovk et al., 2005; Shafer and Vovk, 2008). This method provides coverage-guaranteed confidence intervals for the predictive ITE, but the intervals may become overly wide when the machine learning model is not consistently estimated. In short, while machine learning models, particularly neural networks, can effectively model complex, nonlinear functions such as $c(\cdot)$ and $\tau(\cdot)$ for causal inference, performing accurate uncertainty quantification with these models remains a significant challenge. This is because these models typically have a complex functional form and involve a large number of parameters.

In this paper, we propose to conduct causal inference using an extended fiducial inference (EFI) method (Liang et al., 2024), with the goal of addressing the uncertainty quantification issue associated with treatment effect estimation. EFI provides an innovative framework for inferring model uncertainty based solely on observed data, aligning with the goal of fiducial inference (Fisher, 1935; Hannig, 2009). Specifically, it aims to solve the data-generating equations by explicitly imputing the unobserved random errors and approximating the model parameters from the observations and imputed random errors using a neural network; it then infers the uncertainty of the model parameters based on the learned neural network function and the imputed random errors (see Section 2 for a brief review). To make the EFI method feasible for causal effect estimation with accurate uncertainty quantification, we extend the method in two key aspects:

(i)

We approximate each of the unknown functions, $c({\boldsymbol{x}})$ and $\tau({\boldsymbol{x}})$ , by a deep neural network (DNN) model. The DNN possesses universal approximation capability (Hornik et al., 1989; Hornik, 1991; Kidger and Lyons, 2020), meaning it can approximate any continuous function to an arbitrary degree of accuracy, provided it is sufficiently wide and deep. This property makes the proposed method applicable to a wide range of data-generating processes.
(ii)

We theoretically prove that the dimensions (i.e., the number of parameters) of the DNN models used to approximate $c({\boldsymbol{x}})$ and $\tau({\boldsymbol{x}})$ are allowed to increase with the sample size $n$ at a rate of $O(n^{\zeta})$ for some $0<\zeta<1$ , while the uncertainty of the DNN models can still be correctly quantified. That is, we are able to correctly quantify the uncertainty of the causal effect although it has to be approximated using large models.

In this paper, we regard a model as ‘large’ if its dimension increases with $n$ at a rate of $1/2\leq\zeta<1$ . We note that part (ii) represents a significant theoretical innovation in statistical inference for large models. In the literature on this area, most efforts have focused on linear models, featuring techniques such as desparsified Lasso (Javanmard and Montanari, 2014; van de Geer et al., 2014; Zhang and Zhang, 2014), post-selection inference (Lee et al., 2016), and Markov neighborhood regression (Liang et al., 2022a). For nonlinear models, the research landscape appears to be more scattered. Portnoy (1986, 1988) showed that for independently and identically distributed (i.i.d) random vectors with the dimension $p$ increasing with the sample size $n$ , the central limit theorem (CLT) holds if $p=O(n^{\zeta})$ for some $0\leq\zeta<1/2$ . It is worth noting that Bayesian methods, despite being sampling-based, do not permit the dimension of the true model to increase with $n$ at a higher rate. For example, even in the case of generalized linear models, to ensure the posterior consistency, the dimension of the true model is only allowed to increase with $n$ at a rate $0\leq\zeta<1/4$ (see Theorem 2 and Remark 2 of Jiang (2007)). Under its current theoretical framework developed by Liang et al. (2024), EFI can only be applied to make inference for the models whose dimension is fixed or increases with $n$ at a very low rate. This paper extends the theoretical framework of EFI further, establishing its applicability for statistical inference of large models.

It is worth noting that a DNN model with size $p=O(n^{\zeta})$ , where $\zeta$ is close to (but less than) 1, has been shown to be sufficiently large for approximating many data generation processes. This is supported by the theory established in Sun et al. (2022) and Farrell et al. (2021). In Sun et al. (2022), it is shown that, as $n\to\infty$ , a sparse DNN model of this size can provide accurate approximations for multiple classes of functions, such as bounded $\alpha$ -Hölder smooth functions (Schmidt-Hieber, 2020), piecewise smooth functions with fixed input dimensions (Petersen and Voigtlaender, 2018), and functions representable by an affine system (Bolcskei et al., 2019). Similar results have also been obtained in Farrell et al. (2021), where it is shown that a multi-layer perceptron (MLP) with this model size and the ReLU activation function can provide an accurate approximation to the functions that lie in a Sobolev ball with certain smoothness. The approximation capability of DNNs of this size has also been empirically validated by Hestness et al. (2017), where a neural scaling law of $p=O(n^{\zeta})$ with $0.5\leq\zeta<1$ was identified through extensive studies across various model architectures in machine translation, language modeling, image processing, and speech recognition.

To highlight the strength of EFI in uncertainty quantification and to facilitate comparison with the conformal inference method, this study focuses on inference for predictive ITEs, although the proposed method can also be extended to ATE and CATE. Our numerical results demonstrate the superiority of the proposed method over the conformal inference method.

The remaining part of this paper is organized as follows. Section 2 provides a brief review of the EFI method. Section 3 extends EFI to statistical inference for large statistical models. Section 4 provides an illustrative example for EFI. Section 5 applies the proposed method to statistical inference for predictive ITEs, with both simulated and real data examples. Section 6 concludes the paper with a brief discussion.

2 A Brief Review of the EFI Method

While fiducial inference was widely considered as a big blunder by R.A. Fisher, the goal he initially set —inferring the uncertainty of model parameters on the basis of observations — has been continually pursued by many statisticians, see e.g. Zabell (1992), Hannig (2009), Hannig et al. (2016), Murph et al. (2022), and Martin (2023). To this end, Liang et al. (2024) developed the EFI method based on the fundamental concept of structural inference (Fraser, 1966, 1968). Consider a regression model:

Y=f({\boldsymbol{X}},Z,{\boldsymbol{\theta}}),

(5)

where $Y\in\mathbb{R}$ and ${\boldsymbol{X}}\in\mathbb{R}^{d}$ represent the response and explanatory variables, respectively; ${\boldsymbol{\theta}}\in\mathbb{R}^{p}$ represents the vector of parameters; and $Z\in\mathbb{R}$ represents a scaled random error following a known distribution $\pi_{0}(\cdot)$ . For the model (1), the treatment assignment $T$ should be included as a part of ${\boldsymbol{X}}$ .

Suppose that a random sample of size $n$ has been collected from the model, denoted by $\{(y_{1},{\boldsymbol{x}}_{1}),(y_{2},{\boldsymbol{x}}_{2}),\ldots,(y_{n},{\boldsymbol{x}}_{n})\}$ . In the point of view of structural inference (Fraser, 1966, 1968), they can be expressed in the data generating equations as follow:

y_{i}=f({\boldsymbol{x}}_{i},z_{i},{\boldsymbol{\theta}}),\quad i=1,2,\ldots,n.

(6)

This system of equations consists of $n+p$ unknowns, namely, $\{{\boldsymbol{\theta}},z_{1},z_{2},\ldots,z_{n}\}$ , while there are only $n$ equations. Therefore, the values of ${\boldsymbol{\theta}}$ cannot be uniquely determined by the data-generating equations, and this lack of uniqueness of unknowns introduces uncertainty in ${\boldsymbol{\theta}}$ .

Let ${\boldsymbol{Z}}_{n}=\{z_{1},z_{2},\ldots,z_{n}\}$ denote the unobservable random errors, which are also called latent variables in EFI. Let $G(\cdot)$ denote an inverse function/mapping for the parameter ${\boldsymbol{\theta}}$ , i.e.,

{\boldsymbol{\theta}}=G({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n}).

(7)

It is worth noting that the inverse function is generally non-unique. For example, it can be constructed by solving any $p$ equations in (6) for ${\boldsymbol{\theta}}$ . As noted by Liang et al. (2024), this non-uniqueness of inverse function mirrors the flexibility of frequentist methods, where different estimators of ${\boldsymbol{\theta}}$ can be designed for different purposes.

As a general method, Liang et al. (2024) proposed to approximate the inverse function $G(\cdot)$ using a sparse DNN, see Figure 1 for illustration. They also introduced an adaptive stochastic gradient Langevin dynamics (SGLD) algorithm, which facilitates the simultaneous training of the sparse DNN and simulation of the latent variables ${\boldsymbol{z}}$ . This is briefly described as follows.

Refer to caption — Figure 1: Illustration of the EFI network (Liang et al., 2024), where the orange nodes and orange links form a DNN (parameterized by the weights ${\boldsymbol{w}}_{n}$ , with the subscript $n$ indicating its dependence on the training sample size $n$ ), the green node represents latent variable to impute, and the black lines represent deterministic functions.

Let $\hat{{\boldsymbol{\theta}}}_{i}:=\hat{g}(y_{i},{\boldsymbol{x}}_{i},z_{i},{\boldsymbol{w}}_{n})$ denote the DNN prediction function parameterized by the weights ${\boldsymbol{w}}_{n}$ in the EFI network, and let

\bar{{\boldsymbol{\theta}}}:=\frac{1}{n}\sum_{i=1}^{n}\hat{{\boldsymbol{\theta}}}_{i}=\frac{1}{n}\sum_{i=1}^{n}\hat{g}(y_{i},{\boldsymbol{x}}_{i},z_{i},{\boldsymbol{w}}_{n}),

(8)

which serves as an estimator of $G(\cdot)$ . The EFI network has two output nodes defined, respectively, by

e_{i1}:=\|\hat{{\boldsymbol{\theta}}}_{i}-\bar{{\boldsymbol{\theta}}}\|^{2},\quad e_{i2}:=d(y_{i},\tilde{y}_{i}):=d(y_{i},{\boldsymbol{x}}_{i},z_{i},\bar{{\boldsymbol{\theta}}}),

(9)

where $\tilde{y}_{i}=f({\boldsymbol{x}}_{i},z_{i},\bar{{\boldsymbol{\theta}}})$ , $f(\cdot)$ is as specified in (6), and $d(\cdot)$ is a function that measures the difference between $y_{i}$ and $\tilde{y}_{i}$ . For example, for a normal linear/nonlinear regression, it can be defined as

d(y_{i},{\boldsymbol{x}}_{i},z_{i},\bar{{\boldsymbol{\theta}}})=\|y_{i}-f({\boldsymbol{x}}_{i},z_{i},\bar{{\boldsymbol{\theta}}})\|^{2}.

(10)

For logistic regression, it is defined as a squared ReLU function, see Liang et al. (2024) for the details. Furthermore, EFI defines an energy function as follows:

U_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})=\sum_{i=1}^{n}d(y_{i},{\boldsymbol{x}}_{i},z_{i},\bar{{\boldsymbol{\theta}}})+\eta\sum_{i=1}^{n}\|\hat{{\boldsymbol{\theta}}}_{i}-\bar{{\boldsymbol{\theta}}}\|^{2},

(11)

for some regularization coefficient $\eta>0$ , where first term measures the fitting error of the model as implied by equation (10), and the second term regularizes the variation of $\hat{{\boldsymbol{\theta}}}_{i}$ , ensuring that the neural network forms a proper estimator of the inverse function. Given this energy function, we define the likelihood function as

\pi_{\epsilon}({\boldsymbol{Y}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})\propto e^{-U_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})/\epsilon},

(12)

for some constant $\epsilon$ close to 0. As discussed in Liang et al. (2024), the choice of $\eta$ does not have much affect on the performance of EFI as long as $\epsilon$ is sufficiently small.

Subsequently, the posterior of ${\boldsymbol{w}}_{n}$ is given by

\begin{split}\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n})&\propto\pi({\boldsymbol{w}}_{n})e^{-U_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})/\epsilon},\end{split}

(13)

where $\pi({\boldsymbol{w}}_{n})$ denotes the prior of ${\boldsymbol{w}}_{n}$ ; and the predictive distribution of ${\boldsymbol{Z}}_{n}$ is given by

\begin{split}\pi_{\epsilon}({\boldsymbol{Z}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n})&\propto\pi_{0}^{\otimes n}({\boldsymbol{Z}}_{n})e^{-U_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})/\epsilon}.\end{split}

(14)

In EFI, ${\boldsymbol{w}}_{n}$ is estimated through maximizing the posterior $\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n})$ given the observations $\{{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n}\}$ . By the Bayesian version of Fisher’s identity (Song et al., 2020), the gradient equation $\nabla_{{\boldsymbol{w}}_{n}}\log\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n})$ $=0$ can be re-expressed as

\nabla_{{\boldsymbol{w}}_{n}}\log\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n})=\int\nabla_{{\boldsymbol{w}}_{n}}\log\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n})\pi_{\epsilon}({\boldsymbol{Z}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n})d{\boldsymbol{w}}_{n}=0,

(15)

which can be solved using an adaptive stochastic gradient MCMC algorithm (Liang et al., 2022b; Deng et al., 2019). The algorithm works by iterating between two steps:

(a)

Latent variable sampling: draw ${\boldsymbol{Z}}_{n}^{(k+1)}$ according to a Markov transition kernel that leaves $\pi_{\epsilon}({\boldsymbol{z}}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{(k)})$ to be invariant;
(b)

Parameter updating: update ${\boldsymbol{w}}_{n}^{(k)}$ toward the maximum of $\log\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n})$ using stochastic approximation (Robbins and Monro, 1951), based on the sample ${\boldsymbol{Z}}_{n}^{(k+1)}$ .

See Algorithm 1 for the pseudo-code. This algorithm is termed “adaptive” because the transition kernel in the latent variable sampling step changes with the working parameter estimate of ${\boldsymbol{w}}_{n}$ . The parameter updating step can be implemented using mini-batch SGD, and the latent variable sampling step can be executed in parallel for each observation $(y_{i},{\boldsymbol{x}}_{i})$ . Hence, the algorithm is scalable with respect to large datasets.

(i) (Initialization) Initialize

{\boldsymbol{w}}_{n}^{(0)}

{\boldsymbol{Z}}_{n}^{(0)}

M

(the number of fiducial samples to collect), and

\mathcal{K}

(burn-in iterations).

for k=1,2,…, $\mathcal{K}+M$ do

(ii) (Latent variable sampling) Given

{\boldsymbol{w}}_{n}^{(k)}

, simulate

{\boldsymbol{Z}}_{n}^{(k+1)}

by the SGHMC algorithm (Chen et al., 2014):

\begin{split}{\boldsymbol{V}}_{n}^{(k+1)}&=(1-\varpi){\boldsymbol{V}}_{n}^{(k)}+\upsilon_{k+1}\widehat{\nabla}_{{\boldsymbol{Z}}_{n}}\log\pi_{\epsilon}({\boldsymbol{Z}}_{n}^{(k)}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{(k)})+\sqrt{2\varpi\tilde{\tau}\upsilon_{k+1}}{\boldsymbol{e}}^{(k+1)},\\ {\boldsymbol{Z}}_{n}^{(k+1)}&={\boldsymbol{Z}}_{n}^{(k)}+{\boldsymbol{V}}_{n}^{(k+1)},\end{split}

where

\varpi

is the moment parameter,

\upsilon_{k+1}

is the learning rate,

\tilde{\tau}=1

is the temperature, and

{\boldsymbol{e}}^{(k+1)}\sim N(0,I_{d_{{\boldsymbol{z}}}})

(iii) (Parameter updating) Draw a minibatch

\{(y_{1},{\boldsymbol{x}}_{1},z_{1}^{(k)}),\ldots,(y_{m},{\boldsymbol{x}}_{m},z_{m}^{(k)})\}

and update the network weights by the SGD algorithm:

{\boldsymbol{w}}_{n}^{(k+1)}={\boldsymbol{w}}_{n}^{(k)}+\gamma_{k+1}\left[\frac{n}{m}\sum_{i=1}^{m}\nabla_{{\boldsymbol{w}}_{n}}\log\pi_{\epsilon}(y_{i}|{\boldsymbol{x}}_{i},z_{i}^{(k)},{\boldsymbol{w}}_{n}^{(k)})+\nabla_{{\boldsymbol{w}}_{n}}\log\pi({\boldsymbol{w}}_{n}^{(k)})\right],

(16)

where

\gamma_{k+1}

is the step size, and

\log\pi_{\epsilon}(y_{i}|{\boldsymbol{x}}_{i},z_{i}^{(k)},{\boldsymbol{w}}_{n}^{(k)})

can be appropriately defined according to (12).

(iv) (Fiducial sample collection) If

k+1>\mathcal{K}

, calculate

\hat{{\boldsymbol{\theta}}}_{i}^{(k+1)}=\hat{g}(y_{i},{\boldsymbol{x}}_{i},z_{i}^{(k+1)},{\boldsymbol{w}}_{n}^{(k+1)})

for each

i\in\{1,2,\ldots,n\}

and average them to get a fiducial

\bar{{\boldsymbol{\theta}}}

-sample as calculated in (8).

end for

(v) (Statistical Inference) Conducting statistical inference for the model based on the collected fiducial samples.

Algorithm 1 Adaptive SGHMC for Extended Fiducial Inference

Under mild conditions for adaptive stochastic gradient MCMC algorithms (Deng et al., 2019; Liang et al., 2022b), it was shown in Liang et al. (2024) that

\|{\boldsymbol{w}}_{n}^{(k)}-{\boldsymbol{w}}_{n}^{*}\|\stackrel{{\scriptstyle p}}{{\to}}0,\quad\mbox{as $k\to\infty$},

(17)

where ${\boldsymbol{w}}_{n}^{*}$ denotes a solution to equation (15) and $\stackrel{{\scriptstyle p}}{{\to}}$ denotes convergence in probability, and that

{\boldsymbol{Z}}_{n}^{(k)}\stackrel{{\scriptstyle d}}{{\rightsquigarrow}}\pi_{\epsilon}({\boldsymbol{Z}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{*}),\quad\mbox{as $k\to\infty$},

(18)

in 2-Wasserstein distance, where $\stackrel{{\scriptstyle d}}{{\rightsquigarrow}}$ denotes weak convergence.

To study the limit of (18) as $\epsilon$ decays to 0, i.e.,

p_{n}^{*}({\boldsymbol{z}}|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{w}}_{n}^{*})=\lim_{\epsilon\downarrow 0}\pi_{\epsilon}({\boldsymbol{Z}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{*}),

where $p_{n}^{*}({\boldsymbol{z}}|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{w}}_{n}^{*})$ is referred to as the extended fiducial density (EFD) of ${\boldsymbol{Z}}_{n}$ learned in EFI, it is necessary for ${\boldsymbol{w}}_{n}^{*}$ to be a consistent estimator of ${\boldsymbol{w}}_{*}$ , the parameters of the underlying true EFI network. To ensure this consistency, Liang et al. (2024) impose some conditions on the structure of the DNN and the prior distribution $\pi({\boldsymbol{w}}_{n})$ . Specifically, they assume that ${\boldsymbol{w}}_{n}$ takes values in a compact space $\mathcal{W}$ ; $\pi({\boldsymbol{w}}_{n})$ is a truncated mixture Gaussian distribution on $\mathcal{W}$ ; and the DNN structure satisfies certain constraints given in Sun et al. (2022), e.g., the width of the output layer (i.e., the dimension of ${\boldsymbol{\theta}}$ ) is fixed or grows very slowly with $n$ . They then justify the consistency of ${\boldsymbol{w}}_{n}^{*}$ based on the sparse deep learning theory developed in Sun et al. (2022). The consistency of ${\boldsymbol{w}}_{n}^{*}$ further implies that

G^{*}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})=\frac{1}{n}\sum_{i=1}^{n}\hat{g}(y_{i},{\boldsymbol{x}}_{i},z_{i},{\boldsymbol{w}}_{n}^{*}),

serves as a consistent estimator for the inverse function/mapping ${\boldsymbol{\theta}}=G({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})$ .

By Theorem 3.2 in Liang et al. (2024), for the target model (1), which is a noise-additive model, the EFD of ${\boldsymbol{Z}}_{n}$ is invariant to the choice of the inverse function, provided that $d(\cdot)$ is specified as in (10) in defining the energy function. Further, by Lemma 4.2 in Liang et al. (2024), $p_{n}^{*}({\boldsymbol{z}}|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{m},{\boldsymbol{w}}_{n}^{*})$ is given by

\frac{dP_{n}^{*}({\boldsymbol{z}}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{*})}{d\nu}=\frac{\pi_{0}^{\otimes n}({\boldsymbol{z}})}{\int_{\mathcal{Z}_{n}}\pi_{0}^{\otimes n}({\boldsymbol{z}})d\nu},

(19)

where $P_{n}^{*}({\boldsymbol{z}}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{*})$ represents the cumulative distribution function (CDF) corresponding to $p_{n}^{*}({\boldsymbol{z}}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{*})$ ; $\mathcal{Z}_{n}=\{{\boldsymbol{z}}:U_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n}^{*})=0\}$ represents the zero-energy set, which forms a manifold in the space $\mathbb{R}^{n}$ ; and $\nu$ is the sum of intrinsic measures on the $p$ -dimensional manifold in $\mathcal{Z}_{n}$ . That is, under the consistency of ${\boldsymbol{w}}_{n}^{*}$ , $p_{n}^{*}({\boldsymbol{z}}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{*})$ is reduced to a truncated density function of $\pi_{0}^{\otimes n}({\boldsymbol{z}})$ on the manifold $\mathcal{Z}_{n}$ , while $\mathcal{Z}_{n}$ itself is also invariant to the choice of the inverse function as shown in Lemma 3.1 of Liang et al. (2024). In other words, for the model (1), the EFD of ${\boldsymbol{Z}}_{n}$ is asymptotically invariant to the inverse function we learned given its consistency.

Let $\Theta:=\{{\boldsymbol{\theta}}\in\mathbb{R}^{p}:{\boldsymbol{\theta}}=G^{*}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{z}}),{\boldsymbol{z}}\in\mathcal{Z}_{n}\}$ denote the parameter space of the target model, which represents the set of all possible values of ${\boldsymbol{\theta}}$ that $G^{*}(\cdot)$ takes when ${\boldsymbol{z}}$ runs over $\mathcal{Z}_{n}$ . Then, for any function $b({\boldsymbol{\theta}})$ of interest, its EFD $\mu_{n}^{*}(\cdot|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n})$ associated with $G^{*}(\cdot)$ is given by

\begin{split}\mu_{n}^{*}(B|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n})&=\int_{\mathcal{Z}_{n}(B)}dP_{n}^{*}({\boldsymbol{z}}|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{w}}_{n}^{*}),\quad\mbox{for any measurable set $B\subset\Theta$},\end{split}

(20)

where $\mathcal{Z}_{n}(B)=\{{\boldsymbol{z}}\in\mathcal{Z}_{n}:b(G^{*}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{z}}))\in B\}$ . The EFD provides an uncertainty measure for $b({\boldsymbol{\theta}})$ . Practically, the EFD of $b({\boldsymbol{\theta}})$ can be constructed based on the samples $\{b(\bar{{\boldsymbol{\theta}}}_{1}),b(\bar{{\boldsymbol{\theta}}}_{2}),\ldots,b(\bar{{\boldsymbol{\theta}}}_{M})\}$ , where $\{\bar{{\boldsymbol{\theta}}}_{1},\bar{{\boldsymbol{\theta}}}_{2},\ldots,\bar{{\boldsymbol{\theta}}}_{M}\}$ denotes the fiducial $\bar{{\boldsymbol{\theta}}}$ -samples collected at step (iv) of Algorithm 1.

Finally, we note that, as discussed in Liang et al. (2024), the invariance property of $\mathcal{Z}_{n}$ is not crucial to the validity of EFI, although it does enhance the robustness of the inference. Additionally, for a neural network model, its parameters are only unique up to certain loss-invariant transformations, such as reordering hidden neurons within the same hidden layer or simultaneously altering the sign or scale of certain connection weights, see Sun et al. (2022) for discussions. Therefore, in EFI, the consistency of ${\boldsymbol{w}}_{n}^{*}$ refers to its consistency with respect to one of the equivalent solutions to (15), while mathematically ${\boldsymbol{w}}_{n}^{*}$ can still be treated as unique. Refer to Section $\S$ 1.1 (of the supplement) for more discussions on this issue.

3 EFI for Large Models

In this section, we first establish the consistence of the inverse function/mapping learned in EFI for large models, and then discuss its application for uncertainty quantification of deep neural networks.

3.1 Consistency of Inverse Mapping Learned in EFI for Large Models

It is important to note that the sparse deep learning theory of Sun et al. (2022) is developed under the general constraint $dim({\boldsymbol{w}}_{n})=O(n^{1-\delta})$ for some $0<\delta<1$ , which restricts the dimension of the output layer of the DNN model to be fixed or grows very slowly with the sample size $n$ . Therefore, under its current theoretical framework, EFI can only be applied to the models for which the dimension is fixed or increases very slowly with $n$ .

To extend EFI to large models, where the dimension of ${\boldsymbol{\theta}}$ can grow with $n$ at a rate of $O(n^{\zeta})$ , particularly for $1/2\leq\zeta<1$ , we provide a new proof for the consistency of $G^{*}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})$ based on the theory of stochastic deep learning (Liang et al., 2022b). Specifically, we establish the following theorem, where the output layer width of the DNN in the EFI network is set to match the dimension of ${\boldsymbol{\theta}}$ . The proof is lengthy and provided in the supplement.

Theorem 3.1

Suppose Assumptions 1-6 hold (see the supplement), $\epsilon$ is sufficiently small, and

\sum_{l=1}^{H}d_{l}\prec n,

(21)

where $d_{l}$ denotes the width of layer $l$ , $d_{H}=dim({\boldsymbol{\theta}})$ , and $H$ denotes the depth of the DNN in the EFI network. Then $G^{*}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})=\frac{1}{n}\sum_{i=1}^{n}\hat{g}(y_{i},{\boldsymbol{x}}_{i},z_{i},{\boldsymbol{w}}_{n}^{*})$ constitutes a consistent estimator of the inverse function.

As implied by (21), we have $d_{l}\prec n$ holds for each layer $l=1,2,\ldots,H$ . We call such a neural network a narrow DNN. For narrow DNNs, by the existing theory, see e.g., Kidger and Lyons (2020), Park et al. (2020), and Kim et al. (2023), the universal approximation can be achieved with a minimum hidden layer width of $\max\{d_{0}+1,d_{H}\}$ , where $d_{0}$ and $d_{H}$ represent the widths of the input and output layers, respectively. Hence, (21) implies that EFI can be applied to statistical inference for a large model of dimension

dim({\boldsymbol{\theta}})=d_{H}=O(n^{\zeta}),\quad 0\leq\zeta<1,

under the narrow DNN setting with the depth $H=O(n^{\beta})$ for some $0<\beta<1-\zeta$ . Here, Without loss of generality, we assume $d_{0}\preceq d_{H}$ . For such a DNN, the total dimension of ${\boldsymbol{w}}_{n}$ :

dim({\boldsymbol{w}}_{n})=\sum_{i=1}^{H}d_{i}(d_{i-1}+1)=O(n^{2\zeta+\beta}),

can be much greater than $n$ , where ‘1’ represents the bias parameter of each neuron at the hidden and output layers. Specifically, we can have $dim({\boldsymbol{w}}_{n})\succ n$ with appropriate choices of $\zeta$ and $\beta$ . However, leveraging the asymptotic equivalence between the DNN and an auxiliary stochastic neural network (StoNet) (Liang et al., 2022b), we can still prove that the resulting estimator of ${\boldsymbol{\theta}}$ is consistent, see the supplement for the detail.

Regarding this extension of the EFI method for statistical inference of large models, we have an additional remark:

Remark 1

In this paper, we impose a mixture Gaussian prior on ${\boldsymbol{w}}_{n}$ to ensure the consistency of ${\boldsymbol{w}}_{n}^{*}$ and, consequently, the consistency of the inverse mapping $G^{*}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})$ . However, this Bayesian treatment of ${\boldsymbol{w}}_{n}$ is not strictly necessary, although it introduces sparsity that improves the efficiency of EFI. For the narrow DNN, the consistency of the ${\boldsymbol{w}}_{n}$ estimator can also be established under the frequentist framework by leveraging the asymptotic equivalence between the DNN and the auxiliary StoNet, using the same technique introduced in the supplement (see Section $\S$ 1.2). In this narrow and deep setting, each of the regressions formed by the StoNet is low-dimensional (with $d_{l}\prec n$ ), making the Bayesian treatment of ${\boldsymbol{w}}_{n}$ unnecessary while still achieving a consistent estimator of ${\boldsymbol{w}}_{n}$ .

3.2 Double-NN Method

Suppose a DNN is used for modeling the data, i.e., approximating the function $f(\cdot)$ in (5). By Sun et al. (2022) and Farrell et al. (2021), a DNN of size $O(n^{\zeta})$ for some $0<\zeta<1$ has been large enough for approximating many classes of functions. Therefore, EFI can be used for making inference for such a DNN model. In this case, EFI involves two neural networks, one is for modeling the data, which is called the ‘data modeling network’ and parameterized by ${\boldsymbol{\theta}}$ ; and the other one is for approximating the inverse function, which is called the ‘inverse mapping network’ and parameterized by ${\boldsymbol{w}}_{n}$ . Therefore, the proposed method is coined as ‘double-NN’. Note that during the EFI training process, only the parameters ${\boldsymbol{w}}_{n}$ of the inverse mapping network are updated in equation (16) of Algorithm 1. The parameters of the data modeling network are subsequently updated in response to the adjustment of ${\boldsymbol{w}}_{n}$ , based on the formula given in (8).

In our theoretical study for the double-NN method, we actually assume that the true data-generating model $Y=f({\boldsymbol{X}},Z,{\boldsymbol{\theta}})$ is a neural network, thereby omitting the approximation error of the data modeling network, based on its universal approximation capability. In practice, we have observed that the double-NN method is robust to this approximation error. Specifically, even when the true model is not a neural network, EFI can still recover the true random errors with high accuracy and achieve the zero-energy solution as $n\to\infty$ and $\epsilon\to 0$ . A further theoretical exploration of this phenomenon would be of interest.

As mentioned previously, for a neural network model, its parameters are only unique up to certain loss-invariant transformations. As the training sample size $n$ becomes large, we expect that the optimizers $\hat{{\boldsymbol{\theta}}}:=\arg\max_{{\boldsymbol{\theta}}}\pi_{\epsilon}({\boldsymbol{Z}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{\theta}})$ are all equivalent. Thus, in this paper, the consistency of $\hat{{\boldsymbol{\theta}}}$ refers to its consistency with respect to one of the equivalent global optimizers, while mathematically $\hat{{\boldsymbol{\theta}}}$ can still be treated as unique. A similar issue occurs to the parameters of the inverse mapping network, as discussed in Section $\S$ 1.1 of the supplement.

4 An Illustrative Example for EFI

To illustrate how EFI works for statistical inference problems, we consider a linear regression example:

y_{i}=\tau T_{i}+\mu+{\boldsymbol{x}}_{i}^{\top}{\boldsymbol{\beta}}+\sigma z_{i},\quad i=1,2,\ldots,n,

(22)

where $T_{i}\in\{0,1\}$ is a binary variable indicating the treatment assignment, $\tau$ is the treatment effect, ${\boldsymbol{x}}_{i}\in\mathbb{R}^{d}$ are confounders/covariates, $z_{i}\sim N(0,1)$ is the standardized random noise, and ${\boldsymbol{\beta}}\in\mathbb{R}^{d}$ and $\sigma\in\mathbb{R}_{+}$ are unknown parameters. For this example, $\tau$ represents the ATE as well as the CATE, due to its independence of the covariates ${\boldsymbol{x}}$ . In the simulation study, we set $\tau=1$ , $\mu=1$ , $d=4$ , and ${\boldsymbol{\beta}}=(-1,1,-1,1)^{\top}$ ; generate ${\boldsymbol{x}}_{i}\sim N(0,I_{d})$ ; and generate the treatment variable via a logistic regression:

P(T_{i}=1)=\frac{1}{1+\exp\{-\nu-{\boldsymbol{\xi}}^{\top}{\boldsymbol{x}}_{i}\}},

(23)

where $\nu=1$ and ${\boldsymbol{\xi}}=(-1,1,-1,1)^{\top}$ . We consider three different cases with the sample size $n=250$ , 500 and 1000, respectively. For each case, we generate 100 datasets.

Statistical inference for the parameters in the model (22) can be made with EFI under its standard framework. Let ${\boldsymbol{\theta}}=(\tau,\mu,{\boldsymbol{\beta}}^{\top},\log\sigma)^{\top}$ be the parameter vector. EFI approximates the inverse function ${\boldsymbol{\theta}}=g(y,T,{\boldsymbol{x}},z)$ by a DNN, for which $(y,T,{\boldsymbol{x}},z)$ serves as input variables and ${\boldsymbol{\theta}}$ as output variables. The results are summarized in Table 1.

For comparison, a variety of methods, including Unadj (Imbens and Rubin, 2015), inverse probability weighting (IPW) (Rosenbaum, 1987), double-robust (DR) (Robins et al., 1994; Bang and Robins, 2005), and BART (Hill, 2011), have been applied to this example. These methods fall into distinct categories. The Unadj is straightforward, estimating the ATE by calculating the difference between the treatment and control groups, i.e., $\hat{\tau}=\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}Y_{i}(1)-\frac{1}{n_{c}}\sum_{i=1}^{n_{c}}Y_{i}(0)$ , where the effect of confounders is not adjusted. Both IPW and DR are widely used ATE estimation methods, which adjust the effect of confounders based on propensity scores. They both are implemented using the R package drgee (Zetterqvist and Sjölander, 2015). The BART employs Bayesian additive regression trees to learn the outcome function, which naturally accommodates heterogeneous treatment effects as well as nonlinearity of the outcome function. It is implemented using the R package bartcause (Dorie and Hill, 2020).

Table 1: Comparison of EFI with various ATE estimation methods, where “coverage” refers to the averaged coverage rate of

\tau

, “length” refers to the averaged width of confidence intervals, and the number in the parentheses refers to the standard deviation of the averaged width. The averages and standard deviations were calculated based on 100 datasets.

Method	coverage	length	coverage	length	coverage	length
	$n=250$		$n=500$		$n=1000$
Unadj	0.95	1.161(0.066)	0.93	0.822(0.032)	0.97	0.424(0.017)
BART	0.99	0.857(0.070)	0.98	0.611(0.047)	0.96	0.428(0.024)
IPW	0.90	0.710(0.157)	0.92	0.560(0.141)	0.92	0.417(0.101)
DR	0.96	0.652(0.058)	0.93	0.465(0.033)	0.94	0.331(0.017)
EFI	0.95	0.647(0.033)	0.95	0.438(0.021)	0.95	0.338(0.012)

The comparison indicates that EFI performs very well for this standard ATE estimation problem. Specifically, EFI generates confidence intervals of nearly the same length as DR, but with more accurate coverage rates. This is remarkable, as DR has often been considered as the golden standard for ATE estimation and is consistent if either the outcome or propensity score models is correctly specified, and locally efficient if both are correctly specified. Furthermore, EFI produces much shorter confidence intervals compared to Unadj, IPW, and BART, while maintaining more accurate coverage rates.

We attribute the superior performance of EFI on this example to its fidelity in parameter estimation, an attractive property of EFI as discussed in Liang et al. (2024). As implied by (14), EFI essentially estimates ${\boldsymbol{\theta}}$ by maximizing the predictive likelihood function $\pi_{\epsilon}({\boldsymbol{Z}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{\theta}})\propto\pi_{0}^{\otimes n}({\boldsymbol{Z}}_{n})e^{-U_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}})/\epsilon}$ , which balances the likelihood of ${\boldsymbol{Z}}_{n}$ and the model fitting errors coded in $U_{n}(\cdot)$ . In contrast, the maximum likelihood estimation (MLE) method sets $\hat{{\boldsymbol{\theta}}}_{MLE}=\arg\max_{{\boldsymbol{\theta}}}\pi_{0}^{\otimes n}({\boldsymbol{Z}}_{n})$ , where ${\boldsymbol{Z}}_{n}$ is expressed as a function of $({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{\theta}})$ . In general, MLE is inclined to be influenced by the outliers and deviations of covariates especially when the sample size is not sufficiently large. It is important to note that the MLE serves as the core for all the IPW, DR and BART methods in estimating the outcome and propensity score models. For this reason, various adjustments for confounding and heterogeneous treatment effects have been developed in the literature.

Compared to the existing causal inference methods, EFI works as a solver for the data-generating equation (as $\epsilon\downarrow 0$ ), providing a coherent way to address the confounding and heterogeneous treatment effects and resulting in faithful estimates for the model parameters and their uncertainty as well. This example illustrates the performance of EFI in ATE estimation when confounders are present, while the examples in the next section showcase the performance of EFI in dealing with heterogeneous treatment effects via DNN modeling. Extensive comparisons with BART and other nonparametric modeling methods are also presented.

In this example, we omit the estimation of the propensity score model. As discussed in Section 6, the proposed method can be extended by including an additional DNN to approximate the propensity score, enabling the use of inverse probability weighting for ATE estimation. However, the ATE estimation is not the focus of this work.

5 Causal Inference for Individual Treatment Effects

This section demonstrates how EFI can be used to perform statistical inference of the predictive ITE for the data-generating model (1). Let ${\boldsymbol{\theta}}_{c}$ denote the vector of parameters for modeling the function $c({\boldsymbol{x}})$ , let ${\boldsymbol{\theta}}_{\tau}$ denote the vector of parameters for modeling the function $\tau({\boldsymbol{x}})$ , and let ${\boldsymbol{\theta}}=\{{\boldsymbol{\theta}}_{c},{\boldsymbol{\theta}}_{\tau},\log(\sigma)\}$ denote the whole set of parameters for the model (1). We model the inverse function ${\boldsymbol{\theta}}=g(y,T,{\boldsymbol{x}},z)$ by a DNN. Also, we can model each of the functions $c({\boldsymbol{x}})$ and $\tau({\boldsymbol{x}})$ by a DNN if their functional forms are unknown. For convenience, we refer to the DNN for modeling $c({\boldsymbol{x}})$ as ‘ $c$ -network’ and that for modeling $\tau({\boldsymbol{x}})$ as ‘ $\tau$ -network’, and ${\boldsymbol{\theta}}_{c}$ and ${\boldsymbol{\theta}}_{\tau}$ represent their weights, respectively. As mentioned previously, we can restrict the sizes of the $c$ -network and $\tau$ -network to the order of $O(n^{\tilde{\zeta}})$ for some $0<\tilde{\zeta}<1$ .

Note that in solving the data generating equations (1), the proposed method involves two types of neural networks: one for modeling causal effects and the other for approximating the inverse function ${\boldsymbol{\theta}}=g(y,T,{\boldsymbol{x}},z)$ . While we still refer to the proposed method as ‘Double-NN’, it actually involves three DNNs.

5.1 ITE prediction intervals

Assume the training set consists of $n_{train}$ subjects, and the test set consists of $n_{test}$ subjects. The subjects in the test set can be grouped into three categories: (i) $\{({\boldsymbol{x}}_{i},0,Y_{i}(1),Y_{i}^{obs}(0))$ , $i\in\mathcal{I}_{c}\}$ , where the responses under the control are observed; (ii) $\{({\boldsymbol{x}}_{i},1,Y_{i}^{obs}(1),Y_{i}(0)):i\in\mathcal{I}_{t}\}$ , where the responses under the treatment are observed; and (iii) $\{({\boldsymbol{x}}_{i},T_{i},Y_{i}(1),Y_{i}(0)):i\in\mathcal{I}_{m}\}$ , where only covariates are observed. Here, we use $\mathcal{I}_{c}$ , $\mathcal{I}_{t}$ , and $\mathcal{I}_{m}$ to denote the index sets of the subjects in the respective categories and, therefore, $\mathcal{I}_{c}\cup\mathcal{I}_{t}\cup\mathcal{I}_{m}=\{1,\dots,n_{test}\}$ . For the ITE of each subject in the test set, we can construct the prediction interval with a desired confidence level of $1-\alpha$ in the following procedure:

(i)

For subject $i\in\mathcal{I}_{c}$ : At each iteration $k$ of Algorithm 1, calculate the prediction $\hat{Y}_{i}^{(k)}(1)=\hat{c}^{(k)}({\boldsymbol{x}}_{i})+\hat{\tau}^{(k)}({\boldsymbol{x}}_{i})+\hat{\sigma}^{(k)}Z_{new}^{(k,1)}$ , where $Z_{new}^{(k,1)}\sim N(0,1)$ . Let $c_{l}({\boldsymbol{x}}_{i},1)$ and $c_{u}({\boldsymbol{x}}_{i},1)$ denote, respectively, the $\frac{\alpha}{2}$ - and $(1-\frac{\alpha}{2})$ -quantiles of $\{\hat{Y}_{i}^{(k)}(1):k=\mathcal{K}+1,\mathcal{K}+2,\ldots,\mathcal{K}+M\}$ collected over iterations. Since $Y_{i}^{obs}(0)$ is observed, $(c_{l}({\boldsymbol{x}}_{i},1)-Y_{i}^{obs}(0),c_{u}({\boldsymbol{x}}_{i},1)-Y_{i}^{obs}(0))$ forms a $(1-\alpha)$ -prediction interval for the ITE $Y_{i}(1)-Y_{i}^{obs}(0)$ .
(ii)

For subject $i\in\mathcal{I}_{t}$ : At each iteration $k$ of Algorithm 1, calculate the prediction $\hat{Y}_{i}^{(k)}(0)=\hat{c}^{(k)}({\boldsymbol{x}}_{i})+\hat{\sigma}^{(k)}Z_{new}^{(k,2)}$ , where $Z_{new}^{(k,2)}\sim N(0,1)$ . Let $c_{l}({\boldsymbol{x}}_{i},0)$ and $c_{u}({\boldsymbol{x}}_{i},0)$ denote, respectively, the $\frac{\alpha}{2}$ - and $(1-\frac{\alpha}{2})$ -quantiles of $\{\hat{Y}_{i}^{(k)}(0):k=\mathcal{K}+1,\mathcal{K}+2,\ldots,\mathcal{K}+M\}$ collected over iterations. Since $Y_{i}^{obs}(1)$ is observed, $(Y_{i}^{obs}(1)-c_{u}({\boldsymbol{x}}_{i},0),Y_{i}^{obs}(1)-c_{l}({\boldsymbol{x}}_{i},1))$ forms a $(1-\alpha)$ -prediction interval for the ITE $Y_{i}^{obs}(1)-Y_{i}(0)$ .
(iii)

For subject $i\in\mathcal{I}_{m}$ : At each iteration $k$ of Algorithm 1, calculate the prediction $\hat{Y}_{i}^{(k)}(1)-\hat{Y}_{i}^{(k)}(0)=\hat{\tau}^{(k)}({\boldsymbol{x}}_{i})+\sqrt{2}\hat{\sigma}^{(k)}Z_{new}^{(k,3)}$ , where $Z_{new}^{(k,3)}\sim N(0,1)$ . Let $c_{l}({\boldsymbol{x}}_{i})$ and $c_{u}({\boldsymbol{x}}_{i})$ denote, respectively, the $\frac{\alpha}{2}$ - and $(1-\frac{\alpha}{2})$ -quantiles of $\{\hat{Y}_{i}^{(k)}(1)-\hat{Y}_{i}^{(k)}(0):k=\mathcal{K}+1,\mathcal{K}+2,\ldots,\mathcal{K}+M\}$ collected over iterations. Then $(c_{l}({\boldsymbol{x}}_{i}),c_{u}({\boldsymbol{x}}_{i}))$ forms a $(1-\alpha)$ -prediction interval for the ITE $Y_{i}(1)-Y_{i}(0)$ .

5.2 Simulation Study

Example 1

Consider the data-generating equation

y_{i}=\mu+{\boldsymbol{x}}_{i}^{\top}{\boldsymbol{\beta}}+(\eta_{0}+\eta({\boldsymbol{x}}_{i}))T_{i}+\sigma z_{i},\quad i=1,2,\ldots,n,

(24)

where ${\boldsymbol{x}}_{i}=(x_{i,1},x_{i,2})^{\top}$ with each element drawn independently from $Unif(0,1)$ , $\mu=1$ , ${\boldsymbol{\beta}}=(1,1)^{\top}$ , $\eta_{0}=1$ , $\sigma=1$ , $z_{i}\sim N(0,1)$ , and $\eta({\boldsymbol{x}}_{i})=s(x_{i1})s(x_{i2})-E(s(x_{i1})s(x_{i2}))$ . As in Lei and Candès (2021), we set $s(a)=\frac{2}{1+exp(-12(a-0.5))}$ , and generate the treatment variable $T_{i}$ according to the propensity score model:

e({\boldsymbol{x}}_{i})=\frac{1}{4}(1+\beta_{2,4}(x_{i,1})),

(25)

where $\beta_{2,4}$ is the CDF of the beta distribution with parameters (2,4), ensuring $e({\boldsymbol{x}}_{i})\in[0.25,0.5]$ and thereby sufficient overlap between the treatment and control groups. In terms of equation (1), we have $c({\boldsymbol{x}}_{i})=\mu+{\boldsymbol{x}}_{i}^{\top}{\boldsymbol{\beta}}$ and $\tau({\boldsymbol{x}}_{i})=\eta_{0}+\eta({\boldsymbol{x}}_{i})$ . We generated 20 datasets from the model (24) independently, each consisting of $n_{train}=500$ training samples and $n_{test}=1000$ test samples.

Table 2: Comparison of Double-NN and CQR for inference of the predictive ITE for Example (24), where the coverage and length of the prediction intervals were calculated by averaging over 20 datasets with the standard deviation given in the parentheses.

Case $\mathcal{I}_{c}$ Case $\mathcal{I}_{t}$ Case $\mathcal{I}_{m}$ Method Coverage Length Coverage Length Method Coverage Length Double-NN 0.9549 4.2004 0.9581 4.1812 Double-NN 0.9583 5.6056 (0.0095) (0.1567) (0.0098) (0.1541) (0.0103) (0.2207) CQR-BART 0.9472 4.2702 0.9533 4.4024 CQR(inexact) 0.9530 6.3244 (0.0342) (0.5225) (0.0341) (0.8972) (0.0198) (0.5426) CQR-Boosting 0.9556 5.5199 0.9548 4.4493 CQR(exact) 1.0000 13.4005 (0.0294) (0.5866) (0.0259) (0.5097) (0.0002) (2.4936) CQR-RF 0.9529 5.4609 0.9652 4.6428 CQR(naive) 0.9998 12.8861 (0.0233) (0.5172) (0.0171) (0.5408) (0.0004) (1.5275) CQR-NN 0.9570 6.4072 0.9755 5.8125 (0.0195) (0.8087) (0.0199) (1.4332)

For this example, we assume the functional form of $c({\boldsymbol{x}})$ is known and model $\tau({\boldsymbol{x}})$ by a DNN. The DNN has two hidden layers, each consisting of 10 hidden neurons. The number of parameters of the DNN is $|{\boldsymbol{\theta}}_{\tau}|=151$ , and the total dimension of ${\boldsymbol{\theta}}=(\mu,\eta_{0},{\boldsymbol{\beta}},{\boldsymbol{\theta}}_{\tau}^{\top},\log(\sigma))^{\top}$ is 156 $(\approx n_{train}^{0.81})$ , which falls into the class of large models.

Refer to Section $\S$ 3 of the supplement for parameter settings for the Double-NN method. For comparison, the conformal quantile regression (CQR) method (Romano et al., 2019; Lei and Candès, 2021) was applied to this example, where the outcome function was approximated using different machine learning methods, including BART (Chipman et al., 2010), Boosting (Schapire, 1990; Breiman, 1998), and random forest (RF) (Breiman, 2001), and neural network (NN). Refer to Section $\S$ 2 of the supplement for a brief description of the CQR method. For CQR-NN, we used a neural network of structure $(p+1)$ -10-10-2 to model the outcome quantiles, where the extra input variable is for treatment and the two output neurons are for $(\alpha/2,1-\alpha/2)$ -quantiles of the outcome (Romano et al., 2019). Additionally, we used a neural network of structure $p$ -10-10-1 to model the propensity score in order to compute weighted CQR as in Lei and Candès (2021).

The other CQR methods were implemented using the R package cfcausal (Lei and Candès, 2021). For the case $\mathcal{I}_{m}$ , we considered CQR-BART only, given its relative superiority over other CQR methods in the cases $\mathcal{I}_{c}$ and $\mathcal{I}_{t}$ .

The results were summarized in Table 2. The comparison shows that the Double-NN method outperforms the CQR methods in both the coverage rate and length of the prediction intervals under all the three cases $\mathcal{I}_{c}$ , $\mathcal{I}_{t}$ , and $\mathcal{I}_{m}$ . Specifically, the prediction intervals resulting from the Double-NN method tend to be shorter, while their coverage rates tend to be closer to the nominal level.

Figure 2 demonstrates the rationale underlying the Double-NN method. The left scatter plot compares the imputed and true values of the latent variables for a dataset simulated from (24), where the imputed values were collected at the last iteration of Algorithm 1. The comparison reveals a close match between the imputed and true latent variable values, with the variability of the imputed values representing the source of uncertainty in the data-generating system. This variability in the latent variables can be propagated to ${\boldsymbol{\theta}}$ through the estimated inverse function $G(\cdot)$ , leading to the uncertainty in parameters and, consequently, the uncertainty in predictions. The middle scatter plot shows that the imputed latent variable values follows the standard Gaussian distribution, as expected. The right scatter plot compares the estimated and true values of the function $\tau({\boldsymbol{x}}_{i})$ , with the variability of the estimator representing its uncertainty. This plot further implies that the Double-NN method not only works for performing inference for the predictive ITE but also works for performing inference for CATE.

Example 2

Consider the data-generating equation

y_{i}=c({\boldsymbol{x}}_{i})+\tau({\boldsymbol{x}}_{i})T_{i}+\sigma z_{i},\quad i=1,2,\ldots,n,

(26)

where ${\boldsymbol{x}}_{i}=(x_{i,1},x_{i,2},\ldots,x_{i,5})^{\top}$ with each element drawn independently from $Unif(0,1)$ , $\tau({\boldsymbol{x}})$ and $T_{i}$ are generated as in Example 1 except that ${\boldsymbol{x}}_{i}$ contains three extra false covariates, $c({\boldsymbol{x}}_{i})=\frac{2x_{i,1}}{1+5x_{i,2}^{2}}$ , $\sigma=1$ , and $z_{i}\sim N(0,1)$ . We simulated 20 datasets from this equation, each consisting of $n_{train}=1000$ training samples and $n_{test}=1000$ test samples.

For this example, we modeled both $c({\boldsymbol{x}})$ and $\tau({\boldsymbol{x}})$ using DNNs. Each of the DNNs consists of two hidden layers, each layer consisting of 10 hidden neurons. In consequence, ${\boldsymbol{\theta}}=({\boldsymbol{\theta}}_{c}^{\top},{\boldsymbol{\theta}}_{\tau}^{\top},\log(\sigma))^{\top}$ has a total dimension of 363 $(\approx n_{train}^{0.85})$ .

Similar to Example 1, we also applied the CQR methods (Lei and Candès, 2021) to this example for comparison. The CQR methods were implemented as described in Example 1. The results were summarized in Table 3, which indicates again that the Double-NN method outperforms the CQR methods under all the three cases $\mathcal{I}_{c}$ , $\mathcal{I}_{t}$ , and $\mathcal{I}_{m}$ . The prediction intervals resulting from the Double-NN method tend to be shorter, while their coverage rates tend to be closer to the nominal level.

Similar to Figure 2, Figure 3 demonstrates the rationale underlying the Double-NN method, as well as its capability for CATE inference. The left plot demonstrates the variability embedded in the latent variables of the data-generating system. The middle-left plot shows that the imputed latent variables are distributed according to the standard Gaussian distribution, as expected. The right two plots display the estimates of $c({\boldsymbol{x}}_{i})$ and $\tau({\boldsymbol{x}}_{i})$ , respectively. Once again, we note that the variations of the estimates of $c({\boldsymbol{x}}_{i})$ and $\tau({\boldsymbol{x}}_{i})$ , as depicted in their respective scatter plots, reflect their uncertainty according to the theory of EFI.

Table 3: Comparison of Double-NN and CQR for inference of the predictive ITE for Example (26), where the coverage and length of the prediction intervals were calculated by averaging over 20 datasets with the standard deviation given in the parentheses.

Case $\mathcal{I}_{c}$ Case $\mathcal{I}_{t}$ Case $\mathcal{I}_{m}$ Method Coverage Length Coverage Length Method Coverage Length Double-NN 0.9519 4.2727 0.9645 4.246 Double-NN 0.9604 6.0079 (0.0111) (0.0101) (0.0069) (0.0967) (0.0946) (0.1363) CQR-BART 0.9584 4.3586 0.9545 4.2658 CQR(inexact) 0.9386 6.0492 (0.0220) (0.4392) (0.0230) (0.4586) (0.0270) (0.6062) CQR-Boosting 0.9536 4.9942 0.9572 4.4393 CQR(exact) 0.9996 12.1252 (0.0175) (0.4044) (0.0194) (0.4213) (0.0007) (1.1022) CQR-RF 0.9563 5.6658 0.9580 4.4399 CQR(naive) 0.9988 11.5566 (0.0198) (0.4777) (0.0232) (0.5044) (0.0014) (0.9309) CQR-NN 0.9595 4.6748 0.9452 3.9579 (0.0165) (0.6015) (0.0185) (0.4301)

Precision in Estimation of Heterogeneous Effects

As demonstrated in Figure 2 and Figure 3, the Double-NN method can also be used for inference of CATE. The performance in CATE estimation is often measured using the expected Precision in Estimation of Heterogeneous Effects (PEHE), which is defined as:

\epsilon_{PEHE}=\int_{\mathcal{X}}(\hat{\tau}({\boldsymbol{x}})-\tau({\boldsymbol{x}}))^{2}dF({\boldsymbol{x}}),

where $F({\boldsymbol{x}})$ denotes the distribution function of the covariates ${\boldsymbol{X}}$ . As we can see, $\epsilon_{PEHE}$ summarizes the precision of the CATE over the entire sample space $\mathcal{X}$ (Hill, 2011; Shalit et al., 2017; Caron et al., 2022). In practice, since we only observe the treatment effect on the treatment group, the target of interest is generally only for the treatment group, i.e $\epsilon_{PEHE}^{(T)}=\int_{\mathcal{X}}(\hat{\tau}({\boldsymbol{x}})-\tau({\boldsymbol{x}}))^{2}dF_{T}({\boldsymbol{x}})$ , where $F_{T}({\boldsymbol{x}})$ denotes the distribution function of the covariates in the treatment group. We estimated $\epsilon_{PEHE}^{(T)}$ by $\hat{\epsilon}_{PEHE}^{(T)}=\frac{1}{n_{t}}\sum_{i\in I_{t}}(\hat{\tau}({\boldsymbol{x}}_{i})-\tau({\boldsymbol{x}}_{i}))^{2}$ . For the Double-NN method, we set $\hat{\tau}({\boldsymbol{x}}_{i})=\frac{1}{M}\sum_{k=\mathcal{K}+1}^{\mathcal{K}+M}\hat{\tau}^{(k)}({\boldsymbol{x}}_{i})$ , where $M$ denotes the number of estimates of $\tau({\boldsymbol{x}}_{i})$ collected in a run of Algorithm 1.

For comparison, the existing CATE estimation methods, including single-learner (S-learner), two-learner (T-learner), and X-learner (Künzel et al., 2019), have been applied to the datasets generated above, where the RF and BART are used as the base learners. In the S-learner, a single outcome function is estimated using a base learner with all available covariates, where the treatment indicator is treated as a covariate, and then estimate CATE by $\hat{\tau}_{S}=\hat{\mu}({\boldsymbol{x}},1)-\hat{\mu}({\boldsymbol{x}},0)$ , where $\hat{\mu}({\boldsymbol{x}},t)$ denotes the outcome function estimator. The T-learner estimates the outcome functions using a base learner separately for the units under the control and those under the treatment, and then estimate CATE by $\hat{\tau}_{T}({\boldsymbol{x}})=\hat{\mu}_{1}({\boldsymbol{x}})-\hat{\mu}_{0}({\boldsymbol{x}})$ , where $\hat{\mu}_{t}({\boldsymbol{x}})$ denote the outcome function estimator for the assignment group $t\in\{0,1\}$ . The X-learner builds on the T-learner; it uses the observed outcomes to estimate the unobserved ITEs, and then estimate the CATE in another step as if the ITEs were observed. Refer to Künzel et al. (2019) and Caron et al. (2022) for the detail. We implemented the S-learner, T-learner, and X-leaner using the package downloaded at https://github.com/albicaron/EstITE.

Table 4 compares the values of $\hat{\epsilon}_{PEHE}^{(T)}$ resulting from the Double-NN, S-learners, T-learners, and X-learners. for the models (24) and (26). The comparison shows that the Double-NN method outperforms the existing ones in achieving consistent CATE estimates over different covariate values. This is remarkable! As explained in Section 4, we would attribute this performance of the Double-NN method to its fidelity in parameter estimation (Liang et al., 2024). Compared to the MLE method, which serves as the prototype for the base learners, the Double-NN method is forced to be more robust to covariates due to added penalty term $U_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})/\epsilon$ .

Table 4: Comparison of Double-NN and other methods in

\epsilon_{PEHE}^{(T)}

, where each of the mean and standard deviations was calculated based on 20 datasets generated from (24) or (26).

	Model (24)		Model (26)
Method	Training	Test	Training	Test
S-RF	0.3769 $\pm$ 0.0170	0.3660 $\pm$ 0.0188	0.3722 $\pm$ 0.0074	0.3377 $\pm$ 0.0100
S-BART	0.4233 $\pm$ 0.0156	0.4344 $\pm$ 0.0149	0.3371 $\pm$ 0.0099	0.3418 $\pm$ 0.0102
T-RF	0.4545 $\pm$ 0.0114	0.4198 $\pm$ 0.0118	0.4095 $\pm$ 0.0064	0.3488 $\pm$ 0.0084
T-BART	0.4190 $\pm$ 0.0139	0.4236 $\pm$ 0.0127	0.4308 $\pm$ 0.0092	0.4298 $\pm$ 0.0093
X-RF	0.3416 $\pm$ 0.0153	0.3451 $\pm$ 0.0162	0.2761 $\pm$ 0.0106	0.2789 $\pm$ 0.0106
X-BART	0.3863 $\pm$ 0.0137	0.3972 $\pm$ 0.0128	0.3853 $\pm$ 0.0102	0.3862 $\pm$ 0.0097
Double-NN	0.2962 $\pm$ 0.0167	0.3139 $\pm$ 0.0178	0.3788 $\pm$ 0.0105	0.3899 $\pm$ 0.0110

5.3 Real Data Analysis

5.3.1 Lalonde

The ‘LaLonde’ data is a well-known dataset used in causal inference to evaluate the effectiveness of a job training program in improving the employment prospects of participants. We used the dataset given in the package “twang” (Cefalu et al., 2021) among various versions. The dataset includes earning data in 1978 on 614 individuals, with 185 receiving job training and 429 in the control group. There are 8 covariates including various demographic, educational, and employment-related variables. While the LaLonde dataset has been widely used for ATE estimation, we use it to illustrate the Double-NN method for constructing ITE prediction intervals.

To evaluate the performance of different methods, we randomly split the LaLonde dataset into a training set and a test set. The training set, denoted by $\mathcal{D}_{train}$ , consists of $n_{train}=600$ observations; while the test set, denoted by $\mathcal{D}_{test}$ , consists of $n_{test}=14$ observations. We trained the Double-NN on $\mathcal{D}_{train}$ and constructed prediction intervals for each subject in $\mathcal{D}_{test}$ with a confidence level of $1-\alpha=0.5$ . For the Double-NN, we modeled both $c({\boldsymbol{x}})$ and $\tau({\boldsymbol{x}})$ using DNNs. Each of the DNNs consists of two hidden layers, with each layer consisting of 10 hidden neurons. In consequence, ${\boldsymbol{\theta}}=({\boldsymbol{\theta}}_{c}^{\top},{\boldsymbol{\theta}}_{\tau}^{\top},\log(\sigma))^{\top}$ has a dimension of 423 $(\approx n_{train}^{0.95})$ , a challenging task for uncertainty quantification of the model.

Figure 4 displays the constructed ITE prediction intervals for the test data, comparing the proposed method to the CQR method (Lei and Candès, 2021). The comparison shows that the prediction intervals resulting from the proposed method are shorter than those from the CQR method, while the centers of those intervals are similar. This suggests that the proposed method is able to estimate the ITEs with a higher degree of precision.

5.3.2 NLSM

This subsection conducts an analysis on the ‘National Study of Learning Mindsets’ (NLSM) dataset used in the 2018 Atlantic Causal Inference Conference workshop (Yeager et al., 2019; Carvalho et al., 2019). NSLM records the results of a randomized evaluation for a “nudge-like” intervention designed to instill students with a growth mindset. The dataset is available at https://github.com/grf-labs/grf/tree/master/experiments/acic18, which includes 10,391 students from 76 schools, with four student-level covariates and six school-level students. After factoring the categorical variables, the dimension of covariates ${\boldsymbol{x}}$ increases to 29.

Due to unavailability of the true treatment effect values, we performed an exploratory analysis as in Lei and Candès (2021). In order to construct prediction intervals for the ITE, we split the dataset into two sets: $\mathcal{D}_{train}$ and $\mathcal{D}_{test}$ . The former has a sample size of $n_{train}=5200$ , and the latter has a sample size of $n_{test}=5191$ . For the Double-DNN method, we used DNNs to model the functions $\tau({\boldsymbol{x}})$ and $c({\boldsymbol{x}})$ . Each DNN consists of two hidden layers, with each hidden layer consisting of 10 hidden neurons. Therefore, the dimension of ${\boldsymbol{\theta}}=({\boldsymbol{\theta}}_{c}^{\top},{\boldsymbol{\theta}}_{\tau}^{\top},\log(\sigma))^{\top}$ is 843 $(\approx n_{train}^{0.79})$ .

The Double-DNN was trained on $\mathcal{D}_{train}$ and the prediction intervals were constructed on $\mathcal{D}_{test}$ , which corresponds to case (iii) described in Section 5.1. This process was repeated 20 times. For comparison, the CQR method (Lei and Candès, 2021) was also applied to this example.

Figure 5 displays the average length of prediction intervals, obtained by Double-DNN and CQR, as a function of $\alpha$ , with the upper and lower envelops being respectively the $95\%$ and $5\%$ quantiles across 20 runs. For this example, we implemented CQR using the “inexact” method, and therefore, its interval lengths tend to be short with approximate validity. However, as shown in Figure 5, the prediction intervals resulting from the Double-NN method tend to be even shorter than those from CQR as $\alpha$ increases. Figure 6 (a) compares the fractions of the prediction intervals, obtained by Double-NN and CQR, that cover positive values only. While Figure 6 (b) compares the fractions of the prediction intervals that cover negative values only. In summary, the Double-NN can provide more accurate predictions for the ITE than CQR for this example. Specifically, the Double-NN identified fewer subjects with significant ITEs than the CQR, as implied by Figure 6 (a) and (b); while each has a narrow prediction interval, as implied by Figure 5.

6 Discussion

This paper extends EFI to statistical inference for large statistical models and applies the proposed Double-NN method to treatment effect estimation. The numerical results demonstrate that the Double-NN method significantly outperforms the existing CQR method in ITE prediction. As mentioned in the paper, we attribute the superior performance of the Double-NN method to its fidelity in parameter estimation. Due to the universal approximation ability of deep neural networks, the Double-NN method is generally applicable for causal effect estimation.

From the perspective of statistical inference, this paper advances the theory and methodology for making inference of large statistical models, allowing the model size to increase with the sample size $n$ at a rate of $O(n^{\zeta})$ for any exponent $0\leq\zeta<1$ . In particular, the Double-NN method provides a rigorous approach for quantifying the uncertainty of deep neural networks. In this paper, we have tested the performance of the Double-NN method on numerical examples with the exponent ranging $0.79\leq\zeta\leq 0.95$ , which all falls into the class of large models.

The Double-NN method can be further extended toward a general nonparametric approach for causal inference. Specifically, we can include an additional neural network to approximate the propensity score, enabling the outcome and propensity score functions to be simultaneously estimated. This extension will enable the use of inverse probability weighting methods to further improve ATE estimation, especially in the scenario where the covariate distributions in the treatment and control groups are imbalanced (Shalit et al., 2017; Hahn et al., 2020). From the perspective of EFI, this just corresponds to making inference for a different $b({\boldsymbol{\theta}})$ function. Similarly, for inference of ITE, a different $b({\boldsymbol{\theta}})$ function, including those adjusted with propensity scores, can also be used. The key advantage of EFI is its ability to automatically quantify the uncertainty of these functions as prescribed in (20), even when the functions are highly complex.

Regarding the size of large models, our theory does not preclude applications to large-scale DNNs with millions or even billions of parameters, as supported by the neural scaling law. As mentioned previously, Hestness et al. (2017) investigated the relationship between the DNN model size and the dataset size: they discovered a sub-linear scaling law of $dim({\boldsymbol{\theta}})\prec n$ across various model architectures in machine learning applications, including machine translation, language modeling, image processing, and speech recognition. Their findings suggest that Theorem 3.1 remains valid for large-scale DNNs by choosing an appropriate growth rate for their depth.

In practice, we often encounter small- $n$ -large- $p$ problems. For such a problem, we need to deal with a model of dimension $dim({\boldsymbol{\theta}})\succeq n$ , which is often termed as an over-parameterized model. A further extension of EFI for over-parameterized models is possible by imposing an appropriate sparsity constraint on ${\boldsymbol{\theta}}$ . How to make post-selection inference with EFI for the over-parameterized models will be studied in future work.

Finally, we note that a recent work by Williams (2023) demonstrates how conformal prediction sets arise from a generalized fiducial distribution. Given the inherent connections between GFI and EFI, we believe that the results established in Williams (2023) should also apply to EFI. In particular, EFI follows the same switching principle as GFI (Hannig et al., 2016), which infers the uncertainty of the model parameters from the distribution of unobserved random errors. Further research on EFI from this perspective is of great interest, as it could potentially alleviate EFI’s reliance on assumptions about the underlying data distribution in prediction uncertainty quantification.

Availability

The code that implements the Double NN method can be found at https://github.com/sehwankimstat/DoubleNN.

Acknowledgments

Liang’s research is supported in part by the NSF grants DMS-2015498 and DMS-2210819, and the NIH grant R01-GM152717. The authors thank the editor, associate editor, and referee for their constructive comments, which have led to significant improvement of this paper.

Appendix: Supplement for “Extended Fiducial Inference for Individual Treatment Effects via Deep Neural Networks ”

This material is organized as follows. Section $\S$ 1 provides a proof for Theorem 3.1. Section $\S$ 2 provides a brief description for the CQR method. Section $\S$ 3 provides parameter settings for the experiments reported in the main text and this supplement.

Appendix $\S$ 1 Theoretical Proofs

To prove the validity of the proposed method, it is sufficient to prove that $\hat{g}(\cdot)$ constitutes a consistent estimator of ${\boldsymbol{\theta}}$ , building on the theory developed in Liang et al. (2024). To ensure the self-contained nature of this paper, we provide a concise overview of the theory presented in Liang et al. (2024) in Section $\S$ 1.1 of this supplement. Subsequently, our study will center on establishing the consistency of $\hat{g}(\cdot)$ .

$\S$ 1.1 Outline of the Proof

First of all, we note that the theoretical study is conducted under the assumption that the EFI network has been correctly specified such that a sparse EFI network $\tilde{{\boldsymbol{w}}}_{n}^{*}$ exists, from which the complete data $({\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n}^{*})$ can be generated, where ${\boldsymbol{Z}}_{n}^{*}$ represents the values of the latent variables realized in the observed samples. Specifically, we assume ${\boldsymbol{Z}}_{n}^{*}\sim\pi_{\epsilon}({\boldsymbol{Z}}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*})$ holds as $\epsilon\downarrow 0$ .

For the EFI network, we define

\widehat{\mathcal{G}}({\boldsymbol{w}}_{n}|\tilde{{\boldsymbol{w}}}_{n}^{*}):=\frac{1}{n}\log\pi_{\epsilon}({\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{w}}_{n})+\frac{1}{n}\log\pi({\boldsymbol{w}}_{n}).

(S1)

Therefore,

\hat{{\boldsymbol{w}}}_{n}^{*}:=\arg\max_{{\boldsymbol{w}}_{n}\in\mathcal{W}_{n}}\widehat{\mathcal{G}}({\boldsymbol{w}}_{n}|\tilde{{\boldsymbol{w}}}_{n}^{*}),

is the global maximizer of the posterior $\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n}^{*})$ . Also, we define

\begin{split}\widetilde{\mathcal{G}}({\boldsymbol{w}}_{n}|\tilde{{\boldsymbol{w}}}_{n}^{*})&:=\frac{1}{n}\int\log\pi_{\epsilon}({\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{w}}_{n})d\pi_{\epsilon}({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*})+\frac{1}{n}\log\pi({\boldsymbol{w}}_{n}).\\ &=\frac{1}{n}\Big{\{}\log\pi({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n})-\int\log\frac{\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*})}{\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n})}d\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*})\\ &+\int\log\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*})d\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*})+c\Big{\}},\\ \end{split}

(S2)

where $c=\log\int_{\mathcal{W}_{n}}\pi({\boldsymbol{Y}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{w}}_{n})\pi({\boldsymbol{w}}_{n})d{\boldsymbol{w}}_{n}$ is the log-normalizing constant of the posterior $\pi({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n})$ . Note that in the above derivation, ${\boldsymbol{X}}_{n}$ can be ignored for simplicity since it is constant. For simplicity of notation, we let

D_{KL}({\boldsymbol{w}}_{n})=\int\log\frac{\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*})}{\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n})}d\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*}),

be the Kullback-Leibler divergence between $\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*})$ and $\pi({\boldsymbol{Z}}_{n}^{*}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n})$ . Regarding the EFI network, we make the following assumption:

Assumption 1

The EFI network satisfies the conditions:

(i)

The parameter space $\mathcal{W}_{n}$ (of ${\boldsymbol{w}}_{n}$ ) is convex and compact.
(ii)

$\mathbb{E}(\log\pi_{\epsilon}(Y,Z|X,{\boldsymbol{w}}_{n}))^{2}<\infty$ for any ${\boldsymbol{w}}_{n}\in\mathcal{W}_{n}$ , where $(X,Y,Z)$ denotes a generic sample as those in $({\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{X}}_{n})$ .

Define $Q^{*}({\boldsymbol{w}}_{n})=\mathbb{E}(\log\pi_{\epsilon}(Y,Z|X,{\boldsymbol{w}}_{n}))+\frac{1}{n}\log\pi({\boldsymbol{w}}_{n})$ . By Assumption 1 and the weak law of large numbers,

\frac{1}{n}\log\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n})-Q^{*}({\boldsymbol{w}}_{n})\overset{p}{\rightarrow}0,

(S3)

holds uniformly over the parameter space $\mathcal{W}_{n}$ , where $\stackrel{{\scriptstyle p}}{{\to}}$ denotes convergence in probability. Further, we make the following assumption on $Q^{*}({\boldsymbol{w}}_{n})$ , which is essentially an identifiability condition of $\hat{{\boldsymbol{w}}}_{n}^{*}$ .

Assumption 2

(i) $Q^{*}({\boldsymbol{w}}_{n})$ is continuous in ${\boldsymbol{w}}_{n}$ and is uniquely maximized at some point ${\boldsymbol{w}}_{n}^{\diamond}$ ; (ii) for any $\epsilon>0$ , the value $sup_{{\boldsymbol{w}}_{n}\in\mathcal{W}_{n}\backslash B(\epsilon)}Q^{*}({\boldsymbol{w}}_{n})$ exists, where $B(\epsilon)=\{{\boldsymbol{w}}_{n}:\|{\boldsymbol{w}}_{n}-{\boldsymbol{w}}_{n}^{\diamond}\|<\epsilon\}$ , and $\delta:=Q^{*}({\boldsymbol{w}}_{n}^{\diamond})-\sup_{{\boldsymbol{w}}_{n}\in\mathcal{W}_{n}\backslash B(\epsilon)}Q^{*}({\boldsymbol{w}}_{n})>0$ .

Assumption 2 restricts the shape of $Q^{*}({\boldsymbol{w}}_{n})$ around the global maximizer, which cannot be discontinuous or too flat. Given nonidentifiability of the neural network model, see e.g. Sun et al. (2022), we have implicitly assumed that each ${\boldsymbol{w}}_{n}$ is unique up to loss-invariant transformations, e.g., reordering the hidden neurons within the same hidden layer or simultaneously altering the signs of certain weights and biases. The same assumption has often been used in theoretical studies of neural networks, see e.g. Liang et al. (2018b).

On the other hand, by Theorem 1 of Liang et al. (2018a), we have

\sup_{{\boldsymbol{w}}_{n}\in\mathcal{W}_{n}}\left|\widehat{\mathcal{G}}({\boldsymbol{w}}_{n}|\tilde{{\boldsymbol{w}}}_{n}^{*})-\widetilde{\mathcal{G}}({\boldsymbol{w}}_{n}|\tilde{{\boldsymbol{w}}}_{n}^{*})\right|\stackrel{{\scriptstyle p}}{{\to}}0,\quad\mbox{as $n\to\infty$},

(S4)

under some regularity conditions. Under Assumptions 1-2, Liang et al. (2024) proved the following lemma:

Lemma S1

(Lemma 4.1; Liang et al. (2024)) Suppose Assumptions 1-2 hold, and the joint likelihood function $\pi({\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{w}}_{n})$ is continuous in ${\boldsymbol{w}}_{n}$ . If $\hat{{\boldsymbol{w}}}_{n}^{*}$ is unique, then ${\boldsymbol{w}}_{n}^{*}$ that maximizes $\pi({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n})$ as well as minimizes $D_{KL}({\boldsymbol{w}}_{n})$ is unique and, subsequently, $\|\hat{{\boldsymbol{w}}}_{n}^{*}-{\boldsymbol{w}}_{n}^{*}\|\stackrel{{\scriptstyle p}}{{\to}}0$ holds as $n\to\infty$ .

For Lemma S1, the uniqueness of $\hat{{\boldsymbol{w}}}_{n}^{*}$ , up to some loss-invariant transformations as discussed previously, can be ensured by its consistency as established in the followed sections of this supplement, see equation (S20). The condition minimizing $D_{KL}({\boldsymbol{w}}_{n})$ is generally implied by $\tilde{U}_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})=0$ provided the consistency of $\bar{{\boldsymbol{\theta}}}_{n}^{*}$ , while the convergence of ${\boldsymbol{w}}_{n}^{*}$ to a maximizer of $\pi({\boldsymbol{w}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n})$ is generally implied by the Monte Carlo nature of Algorithm 1. Therefore, by eq. (17) of the main text, if ${\boldsymbol{w}}_{n}^{(k)}$ converges and $\tilde{U}_{n}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n}^{(k)})$ converges to 0, we would have $\|\hat{{\boldsymbol{w}}}_{n}^{*}-{\boldsymbol{w}}_{n}^{*}\|\stackrel{{\scriptstyle p}}{{\to}}0$ provided that the prior has been appropriately chosen such that $\hat{{\boldsymbol{w}}}_{n}^{*}$ is consistent for $\tilde{{\boldsymbol{w}}}_{n}^{*}$ and $\hat{g}(y_{i},x_{i},z_{i},\hat{{\boldsymbol{w}}}_{n}^{*})$ is consistent for ${\boldsymbol{\theta}}^{*}$ .

In summary, by Lemma S1, if we can choose an appropriate prior $\pi({\boldsymbol{w}})$ such that $\hat{{\boldsymbol{w}}}_{n}^{*}$ is consistent for $\tilde{{\boldsymbol{w}}}_{n}^{*}$ , then ${\boldsymbol{w}}_{n}^{*}$ is also consistent for $\tilde{{\boldsymbol{w}}}_{n}^{*}$ and $\hat{g}(y,{\boldsymbol{x}},z,{\boldsymbol{w}}_{n}^{*})$ is consistent for ${\boldsymbol{\theta}}^{*}$ . Establishing the consistency of $\hat{{\boldsymbol{w}}}_{n}^{*}$ will be the focus of Section $\S$ 1.2 of this supplement. Note that showing the consistency of $\hat{{\boldsymbol{w}}}_{n}^{*}$ is relatively simpler than directly working on ${\boldsymbol{w}}_{n}^{*}$ , as $\hat{{\boldsymbol{w}}}_{n}^{*}$ is based on the complete data.

Then, following from the consistency of $\hat{g}(\cdot)$ , we immediately have

\|\frac{1}{n}\sum_{i=1}^{n}\hat{g}(y_{i},x_{i},z_{i},{\boldsymbol{w}}_{n}^{*})-{\boldsymbol{\theta}}^{*}\|\stackrel{{\scriptstyle p}}{{\to}}0,\quad\mbox{as $n\to\infty$}.

(S5)

As a slight relaxation for the definition of $G(\cdot)$ , we can write equation (7) of the main text as

{\boldsymbol{\theta}}^{*}=\lim_{n\to\infty}G({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n}),

(S6)

where ${\boldsymbol{Z}}_{n}$ is assumed to be known. By combining (S5) and (S6), we have

\left\|\frac{1}{n}\sum_{i=1}^{n}\hat{g}(y_{i},x_{i},z_{i},{\boldsymbol{w}}_{n}^{*})-G({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})\right\|\stackrel{{\scriptstyle p}}{{\to}}0,\quad\mbox{as $n\to\infty$},

i.e., the EFI estimator $\bar{{\boldsymbol{\theta}}}^{*}:=\frac{1}{n}\sum_{i=1}^{n}\hat{g}(y_{i},x_{i},z_{i},{\boldsymbol{w}}_{n}^{*})$ is consistent for the inverse mapping $G({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})$ . Further, by Slutsky’s theorem, the uncertainty of ${\boldsymbol{Z}}_{n}$ can be propagated to ${\boldsymbol{\theta}}$ via the EFI estimator. Therefore, the extended fiducial density (EFD) function of ${\boldsymbol{\theta}}$ can be approximated by

\tilde{\mu}_{n}(d{\boldsymbol{\theta}})=\frac{1}{\mathcal{M}}\sum_{k=1}^{\mathcal{M}}\delta_{\bar{{\boldsymbol{\theta}}}^{*,k}}(d{\boldsymbol{\theta}}),\quad\mbox{as $\mathcal{M}\to\infty$},

(S7)

where $\delta_{a}$ stands for the Dirac measure at a given point $a$ , $\bar{{\boldsymbol{\theta}}}^{*,k}:=\frac{1}{n}\sum_{i=1}^{n}\hat{g}(x_{i},y_{i},z_{i}^{*,k},{\boldsymbol{w}}_{n}^{*})$ , and ${\boldsymbol{Z}}_{n}^{*,k}:=(z_{1}^{*,k},z_{2}^{*,k},\ldots,z_{n}^{*,k})$ for $k=1,2,\ldots,\mathcal{M}$ denotes $\mathcal{M}$ random draws from the distribution $\pi({\boldsymbol{Z}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{*})$ under the limit setting of $\epsilon$ .

$\S$ 1.2 On the Consistency of $\hat{g}(\cdot)$ under the Large Model Scenario

In this section, we first show that $\hat{{\boldsymbol{w}}}_{n}^{*}$ is consistent under the framework of imputation-regularized optimization (IRO) algorithm (Liang et al., 2018a) by assuming that the true values of ${\boldsymbol{Z}}_{n}$ are known and ${\boldsymbol{w}}_{n}$ is subject to a mixture Gaussian prior. Subsequently, we show that $\hat{g}(\cdot)$ is consistent.

$\S$ 1.2.1 An Auxiliary Stochastic Neural Network Model

To show $\hat{{\boldsymbol{w}}}_{n}^{*}$ is consistent, we introduce an auxiliary stochastic neural network (StoNet) model. For each of the hidden and output neurons of the model, we introduce a random noise:

\begin{split}\tilde{u}_{l,i}&=\sum_{j=0}^{d_{l-1}}w_{n,l,i,j}u_{l-1,j}+e_{l,i}:=v_{l,i}+e_{l,i},\\ u_{l,i}&=\Psi_{l}(\tilde{u}_{l,i}),\\ \end{split}

(S8)

where $l\in\{1,2,\ldots,H\}$ indexes the layers of the DNN, and $i\in\{1,2,\ldots,d_{l}\}$ indexes the neurons at layer $l$ of the DNN, the random noise $e_{l,i}\sim N(0,\sigma_{l}^{2})$ , $w_{n,l.i,j}$ denotes the weight on the connection from neuron $i$ of layer $l$ to neuron $j$ of layer $l-1$ , and $\Psi_{l}(\cdot)$ denotes the activation function used for layer $l$ . Note that $\sigma_{l}^{2}$ ’s are all known and pre-specified by the user, and $l=0$ represents the input layer. As a consequence of introducing the random noise, we can treat $\tilde{U}_{l}$ ’s as latent variables, where $\tilde{U}_{l}=(\tilde{u}_{l,1},\tilde{u}_{l,2},\ldots,\tilde{u}_{l,d_{l}})^{\top}$ .

When considering a dataset of size $n$ , we let $\tilde{u}_{l,i,(k)}$ denote the latent variable imputed for neuron $i$ of layer $l$ for observation $k\in\{1,2,\ldots,n\}$ , let $\tilde{U}_{l,(k)}=(\tilde{u}_{l,1,(k)},\tilde{u}_{l,2,(k)},\ldots,\tilde{u}_{l,d_{l},(k)})^{\top}$ , and let $\tilde{{\boldsymbol{U}}}_{l}=\{\tilde{U}_{l,(1)},\tilde{U}_{l,(2)},\ldots,\tilde{U}_{l,(n)}\}$ denote the latent variables at layer $l$ for all $n$ observations. Recall that we have defined ${\boldsymbol{Y}}_{n}=\{y_{1},y_{2},\ldots,y_{n}\}$ , ${\boldsymbol{X}}_{n}=\{{\boldsymbol{x}}_{1},{\boldsymbol{x}}_{2},\ldots,{\boldsymbol{x}}_{n}\}$ , and ${\boldsymbol{Z}}_{n}=\{z_{1},z_{2},\ldots,z_{n}\}$ . Given a set of pseudo-complete data $({\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H},\ldots,\tilde{{\boldsymbol{U}}}_{1},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})$ , we define the posterior distribution of ${\boldsymbol{w}}_{n}$ as:

\begin{split}&\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H},\ldots,\tilde{{\boldsymbol{U}}}_{1},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})\propto\pi({\boldsymbol{w}}_{n})\prod_{k=1}^{n}p_{\epsilon}(y_{k}|{\boldsymbol{x}}_{k},z_{k},\bar{{\boldsymbol{\theta}}})\prod_{l=1}^{H}\prod_{k=1}^{n}\pi_{l}(\tilde{U}_{l,(k)}|\tilde{U}_{l-1,(k)},{\boldsymbol{w}}_{n,l}),\end{split}

(S9)

where ${\boldsymbol{w}}_{n}=\{{\boldsymbol{w}}_{n,1},{\boldsymbol{w}}_{n,2},\ldots,{\boldsymbol{w}}_{n,H}\}$ with ${\boldsymbol{w}}_{n,l}$ being the connection weights for layer $l$ ,

p_{\epsilon}(y_{k}|{\boldsymbol{x}}_{k},z_{k},\bar{{\boldsymbol{\theta}}})\propto\exp\{-(d(y_{k},x_{k},z_{k},\bar{{\boldsymbol{\theta}}})+\eta\|\hat{{\boldsymbol{\theta}}}_{k}-\bar{{\boldsymbol{\theta}}}\|^{2})/\epsilon\},

$\tilde{{\boldsymbol{U}}}_{0,(k)}=\{y_{k},{\boldsymbol{x}}_{k},z_{k}\}$ , $\hat{{\boldsymbol{\theta}}}_{k}=\tilde{U}_{H,(k)}$ , and $\pi_{l}(\cdot)$ represents a $d_{l}$ -dimensional multivariate Gaussian distribution.

Remark S1

It is interesting to point out that for the stochastic version of the EFI network, $({\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H},\ldots,\tilde{{\boldsymbol{U}}}_{1})$ forms a directed cyclic graph (DCG): ${\boldsymbol{Y}}_{n}\to\tilde{{\boldsymbol{U}}}_{1}\to\cdots\to\tilde{{\boldsymbol{U}}}_{H}\to{\boldsymbol{Y}}_{n}$ . See e.g. Sethuraman et al. (2023) for analysis of DCGs. For our case, ${\boldsymbol{Y}}_{n}$ is known, which serves as intervention variables and greatly simplifies the problem. Specifically, as $\epsilon\downarrow 0$ , $\hat{{\boldsymbol{\theta}}}_{k}$ ’s can be uniquely determined via $p_{\epsilon}(\cdot)$ for a given set of $({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})$ and, therefore, (S9) is reduced to the posterior distribution for a conventional StoNet with $(y_{k},{\boldsymbol{x}}_{k},z_{k})$ serving as the input and $\hat{{\boldsymbol{\theta}}}_{k}$ serving as the target output.

Assumption 3

(i) The activation function $\Psi_{l}(\cdot)$ used for each hidden neuron is $c^{\prime}$ -Lipschitz continuous for some constant $c^{\prime}$ ; (ii) $d_{l}\log(d_{l})\prec n/\log(n)$ for $l=0,1,2,\ldots,H$ , where $d_{0}$ denotes the dimension of $(Y,X,Z)$ ; (iii) the network’s depth $H$ and widths $d_{l}$ ’s, for $l=0,1,2,\ldots,H$ , are all allowed to increase with the sample size $n$ .

Assumption 3-(i) has covered many commonly used activation functions such as ReLU, tanh, and sigmoid. Assumption 3-(ii) restricts the width of the DNN. To model a dataset of dimension $d_{0}$ , we generally need a model of dimension $dim({\boldsymbol{\theta}})\succeq d_{0}$ . Since we do not aim to impose any sparsity constraints on ${\boldsymbol{\theta}}$ in our main theory, we assume $d_{0}$ also satisfies the constraint $d_{0}\log(d_{0})\prec n$ such that $dim({\boldsymbol{\theta}})\prec n$ is still possible to hold.

Suppose Assumption 1 and Assumption 3 hold. Following Liang et al. (2022b) and Sun and Liang (2022), we can establish the existence of a small value $\tau(d_{1},d_{2},\ldots,d_{H})$ , as a function of $d_{1},d_{2},\ldots,d_{H}$ , such that if $\max\{\sigma_{1},\sigma_{2},\ldots,\sigma_{H}\}\prec\tau(d_{1},d_{2},\ldots,d_{H})$ and $\epsilon\to\infty$ , then

\begin{split}\sup_{{\boldsymbol{w}}_{n}\in\mathcal{W}_{n}}&\frac{1}{n}\Big{|}\log\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H},\ldots,\tilde{{\boldsymbol{U}}}_{1},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})-\log\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})\Big{|}\stackrel{{\scriptstyle p}}{{\to}}0,\quad\mbox{as $n\to\infty$}.\end{split}

(S10)

In other words, the DNN model in the EFI network and stochastic DNN model introduced above have asymptotically the same loss function as long as $\sigma_{1},\ldots,\sigma_{H}$ are sufficiently small and $\epsilon$ is sufficiently small.

Suppose that we want to estimate ${\boldsymbol{w}}_{n}$ by maximizing the posterior distribution of the pseudo-complete data, i.e.,

\begin{split}\hat{{\boldsymbol{w}}}_{n}^{u}&=\arg\max_{{\boldsymbol{w}}_{n}}\Big{\{}\log\pi_{\epsilon}({\boldsymbol{w}}_{n}|{\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H},\ldots,\tilde{{\boldsymbol{U}}}_{1},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})\Big{\}}.\end{split}

(S11)

In the next subsection, we establish the consistency of $\hat{{\boldsymbol{w}}}_{n}^{u}$ , as an estimator of $\tilde{{\boldsymbol{w}}}_{n}^{*}$ , under appropriate conditions. Note that by Assumptions 1-2 and (S10), we have

\|\hat{{\boldsymbol{w}}}_{n}^{u}-\hat{{\boldsymbol{w}}}_{n}^{*}\|\stackrel{{\scriptstyle p}}{{\to}}0,\quad\mbox{as $n\to\infty$}.

(S12)

$\S$ 1.2.2 Consistency of the Sparse DNN Model Estimation

This section gives a constructive proof for the consistency of $\hat{{\boldsymbol{w}}}_{n}^{u}$ based on the IRO algorithm (Liang et al., 2018a). To solve the optimization problem in (S11), the IRO algorithm starts with an initial weight setting $\hat{{\boldsymbol{w}}}_{n}^{(0)}$ and then iterates between the imputation and regularized-optimization steps:

•

Imputation: For each block, conditioned on the current parameter estimate $\hat{{\boldsymbol{w}}}_{n}^{(t)}$ , simulate the latent variables $(\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)})$ from the predictive distribution

\begin{split}\pi(\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},\hat{{\boldsymbol{w}}}_{n}^{(t)})&\propto\prod_{k=1}^{n}p_{\epsilon}(y_{k}|{\boldsymbol{x}}_{k},z_{k},\bar{{\boldsymbol{\theta}}}^{(t+1)})\\ &\times\prod_{l=1}^{H}\prod_{k=1}^{n}\pi_{l}(\tilde{U}_{l,(k)}^{(t+1)}|\tilde{U}_{l-1,(k)}^{(t+1)},\hat{{\boldsymbol{w}}}_{n,l}^{(t)}),\\ \end{split}

(S13)

where $t$ indexes iterations, $\bar{{\boldsymbol{\theta}}}^{(t+1)}=\frac{1}{n}\sum_{k=1}^{n}\tilde{U}_{H,(k)}^{(t+1)}$ , and $\hat{{\boldsymbol{w}}}_{n,l}^{(t)}$ denotes the component of $\hat{{\boldsymbol{w}}}_{n}^{(t)}$ corresponding to the weights at the $l$ -th layer.

•

Regularized-optimization: Given the pseudo-complete data $\{\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}$ , ${\boldsymbol{Y}}_{n}$ , ${\boldsymbol{X}}_{n}$ , ${\boldsymbol{Z}}_{n}\}$ , update $\hat{{\boldsymbol{w}}}_{n}^{(t)}$ by maximizing the penalized log-likelihood function:

\begin{split}\hat{{\boldsymbol{w}}}_{n}^{(t+1)}&=\arg\max_{{\boldsymbol{w}}_{n}}\Big{\{}\log\pi({\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}|{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})+\log\pi({\boldsymbol{w}}_{n})\Big{\}},\end{split}

(S14)

which, by the decomposition (S9), can be reduced to solving for $\hat{{\boldsymbol{w}}}_{n,1}^{(t+1)},\ldots,\hat{{\boldsymbol{w}}}_{n,H}^{(t+1)}$ , separately. The penalty function $\log\pi({\boldsymbol{w}}_{n})$ should be chosen such that $\hat{{\boldsymbol{w}}}_{n}^{(t+1)}$ forms a consistent estimator for the working true parameter

\begin{split}&{\boldsymbol{w}}_{n,*}^{(t+1)}=\arg\max_{{\boldsymbol{w}}_{n}}\mathbb{E}_{\hat{{\boldsymbol{w}}}_{n}^{(t)}}\log\pi({\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}|{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})\\ &=\arg\max_{{\boldsymbol{w}}_{n}}\int\log\pi({\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}|{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n})\\ &\quad\times\pi(\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}|{\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},\hat{{\boldsymbol{w}}}_{n}^{(t)})\pi_{\epsilon}({\boldsymbol{Y}}_{n}|{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},\tilde{{\boldsymbol{w}}}_{n}^{*})d\tilde{{\boldsymbol{U}}}_{H}^{(t+1)}\cdots d\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}d{\boldsymbol{Y}}_{n},\end{split}

(S15)

where the likelihood function of the latent variables $(\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)})$ is evaluated at $\hat{{\boldsymbol{w}}}_{n}^{(t)}$ , and $\tilde{{\boldsymbol{w}}}_{n}^{*}$ corresponds to the true parameters of the underlying sparse DNN model.

To prove the consistency of $\hat{{\boldsymbol{w}}}_{n}^{(t)}$ as $n$ and $t$ approach to infinity, we need Assumptions 4-6 as specified below. For a matrix ${\boldsymbol{\Sigma}}$ , we define the $m$ -sparse minimal eigenvalues as follows:

\phi_{\rm\min}(m|{\boldsymbol{\Sigma}})=\min_{{\boldsymbol{\beta}}:\|{\boldsymbol{\beta}}\|_{0}\leq m}\frac{{\boldsymbol{\beta}}^{\top}{\boldsymbol{\Sigma}}{\boldsymbol{\beta}}}{{\boldsymbol{\beta}}^{\top}{\boldsymbol{\beta}}},

which represents the minimal eigenvalues of any $m\times m$ -dimensional principal submatrix, and $\|{\boldsymbol{\beta}}\|_{0}$ denotes the number of nonzero elements in ${\boldsymbol{\beta}}$ . Let ${\boldsymbol{\Sigma}}_{l,i}^{(t)}\in\mathbb{R}^{d_{l-1}\times d_{l-1}}$ denote the sample covariance matrix of the input variables for the regression formed for neuron $i$ of layer $l$ at iteration $t$ , and let $s_{l,i}^{(t)}$ denote the size of the true regression model as implied by the working true parameter ${\boldsymbol{w}}_{n,*}^{(t)}$ .

Assumption 4

(i) All the variables $\{{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n}^{(t)}\}$ are uniformly bounded, where ${\boldsymbol{Z}}_{n}^{(t)}$ denotes the imputed values of ${\boldsymbol{Z}}_{n}$ at iteration $t$ ; (ii) there exist some constants $\kappa_{0,1}$ and $s_{1,i}^{(t)}\leq\tilde{s}_{1,i}^{(t)}\leq\min\{d_{0},n\}$ such that $\phi_{\rm\min}(\tilde{s}_{1,i}^{(t)}|{\boldsymbol{\Sigma}}_{1,i}^{(t)})\geq\kappa_{0,1}$ holds uniformly for any neuron $i\in\{1,2,\ldots,d_{1}\}$ and any iteration $t\in\{1,2,\ldots,T\}$ , where ${\boldsymbol{\Sigma}}_{1,i}^{(t)}$ denotes the covariance matrix of $\{{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{Z}}_{n}^{(t)}\}$ .

Assumption 4-(i) restricts the pseudo-complete data $({\boldsymbol{x}},{\boldsymbol{y}},{\boldsymbol{z}})$ to be uniformly bounded. To satisfy this condition, we can add a data transformation/normalization layer to the DNN model, ensuring that the transformed input values fall within the bounded set. Specifically, the transformation/normalization layer can form a bijective mapping and contain no tuning parameters. For example, when dealing with standard Gaussian random variables, we can transform them to be uniform over (0,1) via the probability integral transformation $\Phi(\cdot)$ , the cumulative distribution function (CDF) of the standard Gaussian random variable.

Assumption 4-(ii) is natural for the problem. As implied by Assumption 3-(ii), the upper bound $s_{1,i}^{(t)}\leq\tilde{s}_{1,i}^{(t)}\leq\min\{d_{0},n\}$ can always hold. For other layers, the upper bound is also true by Assumption 3-(ii), and the sparse eigenvalue property can be directly established, see Lemma S3 below.

Lemma S2

Consider a random matrix ${\boldsymbol{U}}\in\mathbb{R}^{n\times d}$ with $n\geq d$ . Suppose that the eigenvalues of ${\boldsymbol{U}}^{\top}{\boldsymbol{U}}$ are upper bounded, i.e., $\lambda_{\max}({\boldsymbol{U}}^{\top}{\boldsymbol{U}})\leq n\kappa_{\max}$ for some constant $\kappa_{\max}>0$ . Let $\Psi({\boldsymbol{U}})$ denote an elementwise transformation of ${\boldsymbol{U}}$ . Then

\lambda_{\max}\left((\Psi({\boldsymbol{U}}))^{\top}(\Psi({\boldsymbol{U}}))\right)\leq n\kappa_{\max},

(S16)

for the tanh, sigmoid and ReLU transformations.

Proof:

For ReLU, (S16) follows from Lemma 5 of Dittmer et al. (2018). For tanh and sigmoid, since they are Lipschitz continuous with a Lipschitz constant of 1, Lemma 5 of Dittmer et al. (2018) also applies. $\Box$

Lemma S3

Consider an auxiliary stochastic neural network as defined in (S8) with an activation function tanh, sigmoid, or ReLU. Then for any any layer $l\in\{2,3,\ldots,H\}$ , neuron $i\in\{1,2,\ldots,d_{l}\}$ , and iteration $t\in\{1,2,\ldots,T\}$ , there exists a number $\tilde{s}_{l,i}^{(t)}$ such that $s_{l,i}^{(t)}\leq\tilde{s}_{l,i}^{(t)}\leq\min\{d_{l-1},n\}$ and $\phi_{\rm\min}(\tilde{s}_{l,i}^{(t)}|{\boldsymbol{\Sigma}}_{l,i}^{(t)})\geq\kappa_{0,l}$ hold.

Proof:

We use $\tilde{U}_{l}\in\mathbb{R}^{d_{l}}$ and $V_{l}\in\mathbb{R}^{d_{l}}$ to denote generic vectors corresponding to the $l$ -th layer, where $\tilde{U}_{k}=V_{l}+{\boldsymbol{e}}_{l}$ . Additionally, we use $\tilde{{\boldsymbol{U}}}_{l}\in\mathbb{R}^{n\times d_{l}}$ and ${\boldsymbol{V}}_{l}\in\mathbb{R}^{n\times d_{l}}$ to denote the matrices (for all observations) corresponding to the $l$ -th layer. For the auxiliary stochastic neural network, since $\sigma_{l}^{2}$ ’s have been set to very small values, it follows from (S8) that

\Psi(\tilde{U}_{l})\approx\Psi(V_{l})+\nabla_{V_{l}}\Psi(V_{l})\circ{\boldsymbol{e}}_{l},\quad l=1,2,\ldots,H-1,

where $\circ$ denotes elementwise product, $V_{l}=(v_{l,1},v_{l,2},\ldots,v_{l,d_{l}})^{\top}$ , and ${\boldsymbol{e}}_{l}=(e_{l,1},e_{l,2},\ldots,e_{l,d_{l}})^{\top}$ . Then, for any $i\in\{1,2,\ldots,d_{l+1}\}$ ,

\begin{split}{\boldsymbol{\Sigma}}_{l+1,i}&\approx{\rm Var}(\mathbb{E}(\Psi(V_{l})+\nabla_{V_{l}}\Psi(V_{l})\circ{\boldsymbol{e}}_{l}|V_{l}))+\mathbb{E}({\rm Var}(\Psi(V_{l})+\nabla_{V_{l}}\Psi(V_{l})\circ{\boldsymbol{e}}_{l}|V_{l}))\\ &={\rm Var}(\Psi(V_{l}))+{\rm Diag}\left\{\sigma_{l}^{2}\mathbb{E}[\nabla_{V_{l}}\Psi(V_{l})\circ\nabla_{V_{l}}\Psi(V_{l})]\right\},\\ \end{split}

(S17)

where ${\rm Diag}\{{\boldsymbol{a}}\}$ , ${\boldsymbol{a}}\in\mathbb{R}^{d_{l}}$ , denotes a $d_{l}\times d_{l}$ diagonal matrix with the diagonal elements given by the vector ${\boldsymbol{a}}$ .

By induction, building on Assumption 4-(i) and Lemma S2, we can assume that the maximum eigenvalue of $(\Psi(\tilde{{\boldsymbol{U}}}_{l-1}))^{\top}\Psi(\tilde{{\boldsymbol{U}}}_{l-1})$ is upper bounded by $n\kappa_{\max}$ . By an extension of Ostrowski’s theorem, see Theorem 3.2 of Higham and Cheng (1998), we have

\begin{split}\lambda_{\max}({\boldsymbol{V}}_{l}^{T}{\boldsymbol{V}}_{l})&=\max_{\|{\boldsymbol{u}}\|=1}{\boldsymbol{u}}^{\top}{\boldsymbol{w}}_{n,l}(\Psi(\tilde{{\boldsymbol{U}}}_{l-1}))^{\top}\Psi(\tilde{{\boldsymbol{U}}}_{l-1}){\boldsymbol{w}}_{n,l}^{\top}{\boldsymbol{u}}\\ &\leq\lambda_{\max}((\Psi(\tilde{{\boldsymbol{U}}}_{l-1}))^{\top}\Psi(\tilde{{\boldsymbol{U}}}_{l-1}))\max_{\|{\boldsymbol{u}}\|=1}{\boldsymbol{u}}^{\top}{\boldsymbol{w}}_{n,l}{\boldsymbol{w}}_{n,l}^{\top}{\boldsymbol{u}}\\ &=n\kappa_{\max}\tau_{\max},\\ \end{split}

where the existence of the upper bound $\lambda_{\max}({\boldsymbol{w}}_{n,l}^{\top}{\boldsymbol{w}}_{n,l})\leq\tau_{\max}$ follows from the boundedness of ${\boldsymbol{w}}_{n}$ as implied by Assumption 1-(i). By choosing ${\boldsymbol{u}}$ as a one-hot vector, it is easy to see that for any $i\in\{1,2,\ldots,d_{l}\}$ ,

\sum_{j=1}^{n}V_{l,i,j}^{2}\leq\lambda_{\max}({\boldsymbol{V}}_{l}^{\top}{\boldsymbol{V}}_{l})\leq n\kappa_{\max}\tau_{\max},

(S18)

where $V_{l,i,j}$ denotes the $(j,i)$ th element of ${\boldsymbol{V}}_{l}$ . This further implies, as $n\to\infty$ ,

\mathbb{E}(V_{l,i,j}^{2})\leq\kappa_{\max}\tau_{\max}.

(S19)

In words, $V_{l,i,j}$ has bounded mean and variance.

By Markov’s inequality, we can bound $V_{l,i,j}$ to a closed interval with a high probability, i.e., $P(|V_{l,i,j}|\leq C)\geq 1-\kappa_{\max}\tau_{\max}/C^{2}$ for some large constant $C$ . Therefore, for any activation function which has nonzero gradients on any closed interval, e.g., tanh and sigmoid, there exists a constant $c>0$ such that

\mathbb{E}[\nabla_{V_{l,i,j}}\Psi(V_{l,i,j})]^{2}\geq c.

which implies $\phi_{\min}(d_{l}|{\boldsymbol{\Sigma}}_{l})\geq c\sigma_{l}^{2}$ .

For the ReLU activation function, if a hidden neuron belongs to the true neuron set at iteration $t$ (as determined by ${\boldsymbol{w}}_{n,*}^{(t)}$ ), then $\Psi(V_{l,i,j}^{(t)})$ cannot be constantly 0 over all $n$ samples. Therefore, it is reasonable to assume that there exists a threshold $q_{\rm min}\in(0,1)$ such that $\mathbb{E}[\nabla_{V_{l,i,j}}\Psi(V_{l,i,j}^{(t)})]^{2}\geq q_{\rm min}$ for any true neuron in all iterations. Under this assumption, we would at least have

\phi_{\rm min}(|{\boldsymbol{s}}_{l}^{(t)}||{\boldsymbol{\Sigma}}_{l})\geq\sigma_{l}^{2}q_{\rm min},\quad l=1,2,\ldots,h;\ \ t=1,2,\ldots,T,

where ${\boldsymbol{s}}_{l}^{(t)}\subset{\boldsymbol{S}}^{(t)}$ denotes the set of true neurons at layer $l$ . $\Box$

To study the property of the coefficient estimator for each regression formed in the auxiliary stochastic neural network, we introduce the following lemma, which is a restatement of Theorem 3.4 of Song and Liang (2023).

Lemma S4

(Theorem 3.4; Song and Liang (2023)) Consider a linear regression

{\boldsymbol{y}}={\boldsymbol{X}}{\boldsymbol{\beta}}+\sigma{\boldsymbol{\epsilon}},

where ${\boldsymbol{y}}\in\mathbb{R}^{n}$ , ${\boldsymbol{X}}\in\mathbb{R}^{n\times p_{n}}$ , ${\boldsymbol{\beta}}\in\mathbb{R}^{p_{n}}$ , $\sigma>0$ , and ${\boldsymbol{\epsilon}}\sim\mathcal{N}(0,I_{p_{n}})$ is Gaussian noise. Suppose that the model satisfies the following conditions:

(A1)

(i) All the covariates are uniformly bounded; (ii) the dimensionality can be high with $p_{n}\geq n$ ; and (iii) there exists some integer $\bar{p}$ (depending on $n$ and $p_{n}$ ) and a fixed constant $\kappa_{0}$ such that $\bar{p}\succ s_{n}$ and $\lambda_{\rm\min}({\boldsymbol{X}}_{\xi}^{\top}{\boldsymbol{X}}_{\xi})\geq n\kappa_{0}$ for any subset model $|\xi|\leq\bar{p}$ , where $s_{n}$ denotes the size of the true model $\xi^{*}$ , and $\lambda_{\rm\min}(\cdot)$ denotes the minimum eigenvalue of a square matrix. Let $\hat{{\boldsymbol{\beta}}}_{\xi}$ denote the MLE of the true model.
(A2)

(i) $s_{n}\log p_{n}\prec n$ ; (ii) $\max\{|\beta_{j}^{*}/\sigma^{*}|:j=1,2,\ldots,p_{n}\}\leq\gamma_{3}E_{n}$ , where $\beta_{j}^{*}$ ’s and $\sigma^{*}$ denote the true parameter values of the regression model, $\gamma_{3}\in(0,1)$ is a fixed constant, and $E_{n}$ is nondecreasing with respect to $n$ .

(A3)

Let each component of ${\boldsymbol{\beta}}$ be subject to the following mixture Gaussian prior distribution

\beta_{j}/\sigma^{*}\sim(1-\rho)\mathcal{N}(0,\tilde{\sigma}_{0}^{2})+\rho\mathcal{N}(0,\tilde{\sigma}_{1}^{2}),\quad j=1,2,\ldots,p_{n},

where $E_{n}/\tilde{\sigma}_{1}^{2}+\log\tilde{\sigma}_{1}\asymp\log p_{n}$ , $\rho=1/p_{n}^{1+u}$ , and $\tilde{\sigma}_{0}\leq a_{n}/\sqrt{2(1+u)\log p_{n}}$ for some constant $u>1$ and sequence $a_{n}\prec\sqrt{1/(ns_{n}\log p_{n})}/p_{n}$ . Additionally, $s_{n}E_{n}\sqrt{s_{n}\log p_{n}/n}$ $\prec\tilde{\sigma}_{1}^{2}$ and $\min_{j\in\xi^{*}}|\beta_{j}^{*}|\geq M_{1}\sqrt{\log p_{n}/n}$ for some sufficiently large $M_{1}>0$ .

Let $\hat{{\boldsymbol{\beta}}}_{n}$ denote the MAP estimator of ${\boldsymbol{\beta}}$ . Then there exists a constant $c$ such that

\mathbb{E}\|\hat{{\boldsymbol{\beta}}}_{n}-{\boldsymbol{\beta}}^{*}\|^{2}=c(\sigma^{*})^{2}\|{\boldsymbol{X}}_{\xi^{*}}^{\top}{\boldsymbol{X}}_{\xi^{*}}\|^{-1}\leq\frac{c(\sigma^{*})^{2}}{n\kappa_{0}},

Specifically, with dominating probability, we have $\hat{{\boldsymbol{\beta}}}_{n,j}=\hat{{\boldsymbol{\beta}}}_{\xi^{*},j^{\prime}}$ if $j\in\xi^{*}$ and 0 otherwise, where $\hat{{\boldsymbol{\beta}}}_{n,j}$ denotes the $j$ th element of $\hat{{\boldsymbol{\beta}}}_{n}$ , and $\hat{{\boldsymbol{\beta}}}_{\xi^{*},j^{\prime}}$ denotes the element of $\hat{{\boldsymbol{\beta}}}_{\xi^{*}}$ that corresponds to ${\boldsymbol{\beta}}_{n,j}$ .

It is easy to see that under Assumption 1, Assumption 3, and Assumption 4, each linear regression formed in the stochastic neural network satisfies conditions (A1) and (A2) of Lemma S4. In particular, we can set $E_{n}$ as the diameter of $\mathcal{W}_{n}$ . To ensure the condition (A3) to be satisfied, we make the following assumption:

Assumption 5

Let each connection weight $w_{n,l,i,j}$ be subject to the following mixture Gaussian priro distribution:

w_{n,l,i,j}/\sigma_{l}\sim(1-\rho)\mathcal{N}(0,\tilde{\sigma}_{0,l}^{2})+\rho\mathcal{N}(0,\tilde{\sigma}_{1,l}^{2}),\ \ l=1,2,\ldots,H,\ \ i=1,2,\ldots,d_{l},\ \ j=1,2,\ldots,d_{l-1},

where we set $\rho$ , $\tilde{\sigma}_{0,l}$ and $\tilde{\sigma}_{1,l}$ such that $\rho=1/d_{l-1}^{1+u}$ , $\tilde{\sigma}_{0,l}\prec 1/(d_{l-1}^{3/2}\log(d_{l-1})\sqrt{2(1+u)n})$ , and $E_{n}/\tilde{\sigma}_{1,l}^{2}+\log\tilde{\sigma}_{1,l}\asymp\log d_{l-1}$ for some constant $u>1$ and any $l=1,2,\ldots,H$ . Additionally, there exists some constant $M_{1}>0$ such that $\min_{1\leq i\leq d_{l},j\in\xi_{i}^{*}}|w_{n,l,i,j}^{*}|\geq M_{1}\sqrt{\log d_{l-1}/n}$ .

It is easy to verify that the condition (A3) holds under Assumption 3-(ii) and Assumption 5. In addition, the $w$ -min condition, i.e., $\min_{1\leq i\leq d_{l},j\in\xi_{i}^{*}}|w_{n,l,i,j}^{*}|\geq M_{1}\sqrt{\log d_{l-1}/n}$ , is rather weak and can be generally satisfied as $n$ becomes large.

Theorem S1

Suppose that a mixture Gaussian penalty is imposed on the weights of the DNN model in the EFI network and Assumptions 1 and 3-5 hold. Furthermore, suppose $\sum_{l=1}^{H}d_{l}\sigma_{l}^{2}/\sigma_{l-1}^{2}\prec n$ holds, where $\sigma_{0}=O(1)$ represents a constant. Then there exist a constant $c$ such that

E\|\hat{{\boldsymbol{w}}}_{n}^{(t)}-{\boldsymbol{w}}_{n,*}^{(t)}\|^{2}\leq\frac{c}{n}\sum_{l=1}^{H}d_{l}\frac{\sigma_{l}^{2}}{\sigma_{l-1}^{2}}:=r_{n}\prec o(1).

Proof:

By the above analysis, each regression formed in the stochastic neural network satisfies the conditions of Lemma S4. Therefore, the sparse eigenvalue lower bounds established in Lemma S3 hold for the stochastic neural network. Further, by summarizing the $l_{2}$ -errors of coefficient estimation for all $\sum_{l=1}^{H}d_{l}$ linear regressions, we can conclude the proof. $\Box$

It is important to note that the condition $\sum_{l=1}^{H}d_{l}\sigma_{l}^{2}/\sigma_{l-1}^{2}\prec n$ allows the width of each layer of the neural network to increase with $n$ at a rate as high as $O(n/\log(n))$ . This accommodates the scenarios where $dim({\boldsymbol{\theta}})=O(n^{\zeta})$ for some $\frac{1}{2}\leq\zeta<1$ . In this case, we have $d_{H}=dim({\boldsymbol{\theta}})$ for the DNN model in the EFI network, which enables the uncertainty of ${\boldsymbol{\theta}}$ to be properly quantified as implied by Theorem 3.1 proved below.

Further, let’s consider the mapping $M({\boldsymbol{w}}_{n})$ as defined in (S15), i.e.,

M({\boldsymbol{w}}_{n})=\arg\max_{{\boldsymbol{w}}_{n}^{\prime}}\mathbb{E}_{{\boldsymbol{w}}_{n}}\log\pi({\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H},\ldots,\tilde{{\boldsymbol{U}}}_{1}|{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},{\boldsymbol{w}}_{n}^{\prime}).

As argued in Liang et al. (2018a) and Nielsen (2000), it is reasonable to assume that the mapping is contractive. A recursive application of the mapping, i.e., setting $\hat{{\boldsymbol{w}}}_{n}^{(t+1)}={\boldsymbol{w}}_{n,*}^{(t+1)}=M(\hat{{\boldsymbol{w}}}_{n}^{(t)})$ , leads to a monotone increase of the target expectations

\mathbb{E}_{\hat{{\boldsymbol{w}}}_{n}^{(t)}}\log\pi({\boldsymbol{Y}}_{n},\tilde{{\boldsymbol{U}}}_{H}^{(t+1)},\ldots,\tilde{{\boldsymbol{U}}}_{1}^{(t+1)}|{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n},\hat{{\boldsymbol{w}}}_{n}^{(t+1)})

for $t=1,2,\ldots,T$ .

Assumption 6

The mapping $M({\boldsymbol{w}}_{n})$ is differentiable. Let $\lambda_{\rm\max}(M_{{\boldsymbol{w}}_{n}})$ be the largest singular value of $\partial M({\boldsymbol{w}}_{n})/\partial{\boldsymbol{w}}_{n}$ . There exists a number $\lambda^{*}<1$ such that $\lambda_{\rm\max}(M_{{\boldsymbol{w}}_{n}})\leq\lambda^{*}$ for all ${\boldsymbol{w}}_{n}\in\mathcal{W}_{n}$ for sufficiently large $n$ and almost every training dataset $D_{n}$ .

Theorem S2

Suppose that a mixture Gaussian penalty is imposed on the weights of the DNN model in the EFI network; Assumptions 1 and 3-6 hold; $\epsilon$ is sufficiently small; and $\sum_{l=1}^{H}d_{l}\sigma_{l}^{2}/\sigma_{l-1}^{2}$ $\prec n$ holds, where $\max\{\sigma_{1},\sigma_{2},\ldots,\sigma_{H}\}\prec\tau(d_{1},d_{2},\ldots,d_{H})$ as defined in Section $\S$ 1.2.1 and $\sigma_{0}=O(1)$ represents a constant. Then $\|\hat{{\boldsymbol{w}}}_{n}^{(t)}-\tilde{{\boldsymbol{w}}}_{n}^{*}\|\stackrel{{\scriptstyle p}}{{\to}}0$ for sufficiently large $n$ and sufficiently large $t$ and almost every training dataset $D_{n}$ .

Proof:

This lemma directly follows from Theorem 4 of Liang et al. (2018a) that the estimator $\hat{{\boldsymbol{w}}}_{n}^{(t)}$ is consistent when both $n$ and $t$ are sufficiently large. $\Box$

In summary, we have given a constructive proof for the consistency of sparse stochastic neural network based on the sparse learning theory developed for high-dimensional linear regression. Our proof implies that under the conditions of Theorem S2, such a consistent estimator can also be obtained by a direct calculation of $\hat{{\boldsymbol{w}}}_{n}^{u}$ as defined in (S11). Furthermore, it follows from (S12) that

\|\hat{{\boldsymbol{w}}}_{n}^{*}-\tilde{{\boldsymbol{w}}}_{n}^{*}\|\stackrel{{\scriptstyle p}}{{\to}}0,\quad\mbox{as $n\to\infty$.}

(S20)

Proof of Theorem 3.1

Proof:

As a summary of Lemma S1 and (S20), we have

$\|{\boldsymbol{w}}_{n}^{*}-\tilde{{\boldsymbol{w}}}_{n}^{*}\|\stackrel{{\scriptstyle p}}{{\to}}0,\quad\mbox{as $n\to\infty$},$

by setting $\sigma_{1}=\sigma_{2}=\cdots=\sigma_{H}\prec\tau(d_{1},d_{2},\ldots,d_{H})$ . Subsequently, $\hat{g}(y,{\boldsymbol{x}},z,{\boldsymbol{w}}_{n}^{*})$ constitutes a consistent estimator of ${\boldsymbol{\theta}}$ , and so does $G^{*}({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})$ following the arguments in Section $\S$ 1.1. $\Box$

In summary, through the introduction of an auxiliary stochastic neural network model and the utilization of the convergence theory of the IRO algorithm, we have justified the consistency of the sparse DNN model under mild conditions.

Remark S2

The result presented in Theorem 3.1 is interesting: we can set the width of each layer of the DNN model to be of order $O(n^{\alpha})$ for some $1/2\leq\alpha<1$ and set its depth to be $O(n^{\alpha^{\prime}})$ for some $0<\alpha^{\prime}<1-\alpha$ . Given the approximation ability of the sparse DNN model as studied in Sun et al. (2022), where it was proven that a neural network of size $O(n^{\tilde{\alpha}})$ for some $0<\tilde{\alpha}<1$ has been large enough for approximating many classes of functions, EFI can be used for uncertainty quantification for deep neural networks if their sizes are appropriately chosen. Specifically, for the Double-NN approach, we can set the size of the first neural network (for approximation of the inverse function ${\boldsymbol{\theta}}=G({\boldsymbol{Y}}_{n},{\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n})$ ) to be large under the constraint $\sum_{l=1}^{H}d_{l}\prec n$ , and set the second neural network (for approximation of the function ${\boldsymbol{Y}}_{n}=f({\boldsymbol{X}}_{n},{\boldsymbol{Z}}_{n};{\boldsymbol{\theta}})$ ) to be relatively small with the size $O(n^{\tilde{\alpha}})$ for some $0<\tilde{\alpha}<1$ . By the theory developed in this paper, the uncertainty of the second neural network can still be correctly quantified using EFI.

Appendix $\S$ 2 CQR Method

For a test point ${\boldsymbol{x}}$ belonging to the class $\mathcal{I}_{c}$ , CQR aims to find the intervals $\hat{C}_{1}({\boldsymbol{x}})$ and $\hat{C}_{ITE}({\boldsymbol{x}})$ such that

P(Y(1;{\boldsymbol{x}})\in\hat{C}_{1}({\boldsymbol{x}}))\geq 1-\alpha,\quad P(Y(1;{\boldsymbol{x}})-Y^{obs}(0;{\boldsymbol{x}})\in\hat{C}_{ITE}({\boldsymbol{x}}))\geq 1-\alpha.

Similarly, for a test point ${\boldsymbol{x}}$ belonging to the class $\mathcal{I}_{t}$ , CQR aims to find the intervals $\hat{C}_{0}({\boldsymbol{x}})$ and $\hat{C}_{ITE}({\boldsymbol{x}})$ such that

\begin{split}P(Y(0;{\boldsymbol{x}})\in\hat{C}_{0}({\boldsymbol{x}}))&\geq 1-\alpha,\quad P(Y^{obs}(1;{\boldsymbol{x}})-Y(0;{\boldsymbol{x}})\in\hat{C}_{ITE}({\boldsymbol{x}}))\geq 1-\alpha.\end{split}

To achieve the above goals, CQR initially splits the training dataset $\mathcal{X}_{train}$ to $(\mathcal{X}_{train},\mathcal{X}_{valid})$ . A quantile regression model, such as BART or random forest, is trained on $\mathcal{X}_{train}$ . Denote the output of the quantile regression by $[\hat{q}_{\alpha/2}(t,{\boldsymbol{x}}),\hat{q}_{1-\alpha/2}(t,{\boldsymbol{x}})]$ , and calculate the score $s_{i}(t)=max(\hat{q}_{\alpha/2}(t,{\boldsymbol{x}}_{i})-y_{i}(t),y_{i}(t)-\hat{q}_{1-\alpha/2}(t,{\boldsymbol{x}}_{i}))$ for each ${\boldsymbol{x}}_{i}\in\mathcal{X}_{valid}$ . Note that $s_{i}$ acts like a residual derived from the validation set. Let $\hat{s}(t,\alpha)=Quantile(\{s_{i}(t)\}_{{\boldsymbol{x}}_{i}\in\mathcal{X}_{valid}};[\frac{(n+1)\alpha]}{n}])$ . Then $\hat{C}_{t}({\boldsymbol{x}}_{j})$ given below is the conformal prediction interval for any ${\boldsymbol{x}}_{j}\in\mathcal{X}_{test}$ :

\begin{split}\hat{C}_{t}({\boldsymbol{x}}_{j})&=[\hat{q}_{\alpha/2}(t,{\boldsymbol{x}}_{j})-\hat{s}(t,1-\alpha),\hat{q}_{1-\alpha/2}(t,{\boldsymbol{x}}_{j})+\hat{s}(t,1-\alpha)]\\ &=[\hat{Y}^{L}(t;{\boldsymbol{x}}_{j}),\hat{Y}^{R}(t;{\boldsymbol{x}}_{j})]\end{split}

For a more refined approach, one can consider incorporating propensity scores as weights as discussed in Tibshirani et al. (2019).

Case (i) and Case (ii) described in Section 5.1 can be addressed by the above approach. However, for Case (iii) there, the conformal prediction needs further steps. First, construct a pair of prediction intervals at level $1-\alpha/2$ ; namely, $[\hat{Y}^{L}(1;{\boldsymbol{x}}),\hat{Y}^{R}(1;{\boldsymbol{x}})]$ for $Y(1)$ and $[\hat{Y}^{L}(0;{\boldsymbol{x}}),\hat{Y}^{R}(0;{\boldsymbol{x}})]$ for $Y(0)$ . Then, construct an interval for ITE as follows:

\hat{C}^{naive}_{ITE}({\boldsymbol{x}})=[\hat{Y}^{L}(1;{\boldsymbol{x}})-\hat{Y}^{R}(0;{\boldsymbol{x}}),\hat{Y}^{R}(1;{\boldsymbol{x}})-\hat{Y}^{L}(0;{\boldsymbol{x}})].

We refer to this approach as the “naive” approach, which usually leads to very wide prediction intervals.

Another option for case (iii) is the so-called “nested” approach by splitting $\mathcal{X}_{train}$ into two folds, denoted by $(\mathcal{X}_{train,1},\mathcal{X}_{train,2})$ . On the first fold, train $\hat{C}(1,{\boldsymbol{x}})$ and $\hat{C}(0,{\boldsymbol{x}})$ by applying conformal inference. On the second fold, for each ${\boldsymbol{x}}_{i}\in\mathcal{X}_{train,2}$ , compute

\hat{C}({\boldsymbol{x}}_{i})=\begin{cases}[Y^{obs}(1;{\boldsymbol{x}}_{i})-\hat{Y}^{R}(0;{\boldsymbol{x}}_{i}),Y^{obs}(1,{\boldsymbol{x}}_{i})-\hat{Y}^{L}(0;{\boldsymbol{x}}_{i})],&\textit{if \ }T=1,\\ [\hat{Y}^{L}(1;{\boldsymbol{x}}_{i})-Y^{obs}(0;{\boldsymbol{x}}_{i}),\hat{Y}^{R}(1;{\boldsymbol{x}}_{i})-Y^{obs}(0;{\boldsymbol{x}}_{i})],&\textit{if \ }T=0,\\ \end{cases}

where $\hat{Y}^{L}$ and $\hat{Y}^{R}$ are estimated based on $\mathcal{X}_{train,1}$ . Note that the conformal inference is also applicable for the data with interval outcomes. Applying the conformal inference method with interval outcomes on $({\boldsymbol{x}}_{i},\hat{C}({\boldsymbol{x}}_{i}))$ for ${\boldsymbol{x}}_{i}\in\mathcal{X}_{train,2}$ , yielding the interval $\hat{C}^{exact}_{ITE}({\boldsymbol{x}})$ . Applying the conformal inference twice results in a sparse utilization of data for training the regression model and, subsequently, wider prediction intervals. The inexact method involves fitting conditional quantiles of $\hat{C}^{L}$ and $\hat{C}^{R}$ , yielding an interval $\tilde{C}_{ITE}^{inexact}({\boldsymbol{x}})$ . The inexact method does not guarantee the coverage rate. Refer to Lei and Candès (2021) for the detail.

Appendix $\S$ 3 Experimental Settings

To enforce a sparse DNN to be learned for the inverse function $g(\cdot)$ , we impose the following mixture Gaussian prior on each element of ${\boldsymbol{w}}_{n}$ :

\pi(w)\sim\rho N(0,\sigma_{1}^{2})+(1-\rho)N(0,\sigma_{0}^{2}),

(S21)

where $w$ represents a generic element of ${\boldsymbol{w}}_{n}$ and, unless stated otherwise, we set $\rho=1e-2$ , $\sigma_{0}=1e-2$ and $\sigma_{1}=1$ . The elements of ${\boldsymbol{w}}_{n}$ are a priori independent.

For EFI, we employ SGHMC in latent variable sampling, i.e., we simulate ${\boldsymbol{Z}}_{n}^{(k+1)}$ in the following formula:

\begin{split}{\boldsymbol{V}}_{n}^{(k+1)}&=(1-\varpi){\boldsymbol{V}}_{n}^{(k)}+\upsilon_{k+1}\widehat{\nabla}_{{\boldsymbol{Z}}_{n}}\log\pi_{\epsilon}({\boldsymbol{Z}}_{n}^{(k)}|{\boldsymbol{X}}_{n},{\boldsymbol{Y}}_{n},{\boldsymbol{w}}_{n}^{(k)})+\sqrt{2\varpi\tau\upsilon_{k+1}}{\boldsymbol{e}}^{(k+1)},\\ {\boldsymbol{Z}}_{n}^{(k+1)}&={\boldsymbol{Z}}_{n}^{(k)}+{\boldsymbol{V}}_{n}^{(k+1)},\end{split}

where $\tau=1$ , $0<\varpi\leq 1$ is the momentum parameter, ${\boldsymbol{e}}^{(k+1)}\sim N(0,I_{d_{{\boldsymbol{z}}}})$ , and $\upsilon_{k+1}$ is the learning rate. It is worth noting that the algorithm is reduced to SGLD if we set $\varpi=1$ .

In the simulations, we set the learning rate sequence $\{\upsilon_{k}:k=1,2,\ldots\}$ and the step size sequence $\{\gamma_{k}:k=1,2,\ldots\}$ in the forms:

\upsilon_{k}=\frac{C_{\upsilon}}{c_{\upsilon}+k^{\alpha}},\quad\gamma_{k}=\frac{C_{\gamma}}{c_{\gamma}+k^{\alpha}},

for some constants $C_{\upsilon}>0$ , $c_{\upsilon}>0$ , $C_{\gamma}>0$ and $c_{\gamma}>0$ , and $\alpha\in(0,1]$ . The values of $C_{\upsilon}$ , $c_{\upsilon}$ , $C_{\gamma}$ , $c_{\gamma}$ and $\alpha$ used in different experiments are given below.

$\S$ 3.1 ATE

From the point of view of equation solving, the model (22) in the main text can also be written as

y_{i}=\tau^{\prime}T_{i}^{\prime}+\mu^{\prime}+{\boldsymbol{x}}_{i}{\boldsymbol{\beta}}+\sigma z_{i},\quad i=1,2,\ldots,n,

(S22)

where $T_{i}^{\prime}\in\{-1,1\}$ , $\tau^{\prime}=\tau/2$ , and $\mu^{\prime}=\mu+\tau/2$ . Let ${\boldsymbol{\theta}}=(\tau^{\prime},\mu^{\prime},{\boldsymbol{\beta}}^{\top},\sigma)^{\top}$ . Equation (S22) can be solved under the standard framework of EFI. Unless otherwise noted, all results in this paper are based on solving this type of transformed equations.

We set $\alpha=1/7$ and $\varpi=0.1$ . For $n=250$ , we set $(C_{\upsilon},c_{\upsilon},C_{\gamma},c_{\gamma})=($ 200000, 1000000, 54000, 1000000). For $n=500$ and 1000, we set $(C_{\upsilon},c_{\upsilon},C_{\gamma},c_{\gamma})=(500000,1000000$ , 54000, 1000000).

We set $\mathcal{K}=5000$ as the number of burn-in iterations, and set $M=50,000$ as the number of iterations used for fiducial sample collection. We thinned the Markov chain by a factor of $B=5$ ; that is, we collected $M/B=10,000$ samples in each run.

We set $\eta=500$ and $\epsilon=1/10$ in construction of the energy function, and perform gradient clipping by norm with 5000 during the first 100 iterations (for finding a reasonably good initial point).

The DNN in the EFI network has two hidden layers, with the widths given by $d_{1}=90$ and $d_{2}=30$ , respectively.

$\S$ 3.2 Linear control response with Non-linear treatment response

We set $\alpha=1/7$ , $\varpi=0.1$ , and $(C_{\upsilon},c_{\upsilon})=(200000,1000000)$ and $(C_{\gamma},c_{\gamma})=(20,20000)$ for $\omega^{H}_{\tau}$ and $(C_{\gamma},c_{\gamma})=(20000,200000)$ for all other parameters. Refer to Figure S1 for the definition of $\omega_{\tau}^{H}$ .

For initialization, we update the ${\boldsymbol{w}}_{n}$ with randomly sampled ${\boldsymbol{Z}}_{n}^{(t+1)}\sim\pi_{0}^{\otimes n}({\boldsymbol{Z}}_{n})$ for the first 5000 iterations. We set $\mathcal{K}=20,000$ as the number of burn-in iterations, and set $M=50,000$ as the number of iterations used for fiducial sample collection. We thinned the Markov chain by a factor of $B=5$ ; that is, we collected $M/B=10,000$ samples in each run. We set $\eta=10$ and $\epsilon=1/10$ in construction of the energy function, and perform gradient clipping by norm with 5000 during the first 100 iterations.

$\S$ 3.3 Non-linear control response with Non-linear treatment response

We set $\alpha=1/7$ , $\varpi=0.1$ , and $(C_{\upsilon},c_{\upsilon})=(500000,1000000)$ and $(C_{\gamma},c_{\gamma})=(2.5,1000000)$ for $\omega^{H}_{\tau}$ and $\omega_{c}^{H}$ , and set $(C_{\gamma},c_{\gamma})=(20000,200000)$ for all other parameters. Refer to Figure S1 for the definitions of $\omega_{\tau}^{H}$ and $\omega_{c}^{H}$ .

For initialization, we update the ${\boldsymbol{w}}_{n}$ with randomly sampled ${\boldsymbol{Z}}_{n}^{(t+1)}\sim\pi_{0}^{\otimes n}({\boldsymbol{z}}_{n})$ for the first 5000 iterations. We set $\mathcal{K}=20,000$ as the number of burn-in iterations, and set $M=50,000$ as the number of iterations used for fiducial sample collection. We thinned the Markov chain by a factor of $B=5$ ; that is, we collected $M/B=10,000$ samples in each run. We set $\eta=10$ and $\epsilon=1/10$ in construction of the energy function, and perform gradient clipping by norm with 5000 during the first 100 iterations.

The DNN in the EFI network has two hidden layers, with the widths given by $d_{1}=90$ and $d_{2}=30$ , respectively. The DNN used for modeling the treatment effect has two hidden layers, each hidden layer consisting of 10 hidden neurons. The DNN used for modeling the response function under the control has two hidden layers, each hidden layer consisting of 10 hidden neurons.

$\S$ 3.4 Lalonde

We set $\alpha=1/7$ , $\varpi=0.1$ , and $(C_{\upsilon},c_{\upsilon})=(500000,1000000)$ and $(C_{\gamma},c_{\gamma})=(2.5,1000000)$ for $\omega^{H}_{\tau}$ and $\omega^{H}_{c}$ , and $(C_{\gamma},c_{\gamma})=(1000,1000000)$ for all other parameters. Refer to Figure S1 for the definitions of $\omega_{\tau}^{H}$ and $\omega_{c}^{H}$ .

For initialization, we update the ${\boldsymbol{w}}_{n}$ with randomly sampled ${\boldsymbol{Z}}_{n}^{(t+1)}\sim\pi_{0}^{\otimes n}({\boldsymbol{z}}_{n})$ for the first 10,000 iterations. We set $\mathcal{K}=20,000$ as the number of burn-in iterations and set $M=50,000$ as the number of iterations used for fiducial sample collection. We thinned the Markov chain by a factor of $B=5$ ; that is, we collected $M/B=10,000$ samples in each run. We set $\eta=10$ and $\epsilon=1/10$ in construction of the energy function, and perform gradient clipping by norm with 5000 for first 100 iterations.

The DNN in the EFI network has two hidden layers, with the widths given by $d_{1}=90$ and $d_{2}=30$ , respectively. The DNN used for modeling the treatment effect contains two hidden layers, each hidden layer containing 10 hidden neurons. The DNN used for modeling the response function under the control contains two hidden layers, each hidden layer containing 10 hidden neurons.

$\S$ 3.5 NLSM

We set $\alpha=1/7$ , $\varpi=0.1$ , and $(C_{\upsilon},c_{\upsilon})=(500000,1000000)$ and $(C_{\gamma},c_{\gamma})=(2.5,1000000)$ for $\omega^{H}_{\tau}$ and $\omega^{H}_{c}$ , and $(C_{\gamma},c_{\gamma})=(5000,1000000)$ for all other parameters. Refer to Figure S1 for the definitions of $\omega_{\tau}^{H}$ and $\omega_{c}^{H}$ .

For initialization, we update the ${\boldsymbol{w}}_{n}$ with randomly sampled $Z_{n}^{(t+1)}\sim\pi_{0}^{\otimes n}({\boldsymbol{z}}_{n})$ for the first 10,000 iterations. We set $\mathcal{K}=20,000$ as the number of burn-in iterations and set $M=50,000$ as the number of iterations used for fiducial sample collection. We thinned the Markov chain by a factor of $B=5$ ; that is, we collected $M/B=10,000$ samples in each run. We set $\eta=10$ and $\epsilon=1/10$ in construction of the energy function, and perform gradient clipping by norm with 5000 for first 100 iterations.

The DNN in the EFI network contains two hidden layers, with the widths given by $d_{1}=90$ and $d_{2}=30$ , respectively. The DNN used for modeling the treatment effect contains two hidden layers, each hidden layer containing 10 hidden neurons. The DNN used for modeling the response function under control contains two hidden layers, each hidden layer containing 10 hidden neurons.

References

Bang and Robins (2005) Bang, H. and Robins, J. M. (2005), “Doubly Robust Estimation in Missing Data and Causal Inference Models,” Biometrics, 61, 962–972.
Bolcskei et al. (2019) Bolcskei, H., Grohs, P., Kutyniok, G., and Petersen, P. (2019), “Optimal approximation with sparsely connected deep neural networks,” SIAM Journal on Mathematics of Data Science, 1, 8–45.
Breiman (1998) Breiman, L. (1998), “Arcing classifier (with discussion and a rejoinder by the author),” Annals of Statistics, 26, 801–849.
Breiman (2001) — (2001), “Random forests,” Machine learning, 45, 5–32.
Caron et al. (2022) Caron, A., Baio, G., and Manolopoulou, I. (2022), “Estimating Individual Treatment Effects using Non-Parametric Regression Models: a Review,” Journal of the Royal Statistical Society Series A: Statistics in Society, 185, 1115–1149.
Carvalho et al. (2019) Carvalho, C. M., Feller, A., Murray, J., Woody, S., and Yeager, D. S. (2019), “Assessing Treatment Effect Variation in Observational Studies: Results from a Data Challenge,” Observational Studies.
Cefalu et al. (2021) Cefalu, M., Ridgeway, G., McCaffrey, D., Morral, A., Griffin, B. A., and Burgette, L. (2021), “Package ‘twang’: Toolkit for Weighting and Analysis of Nonequivalent Groups,” R Package.
Chen et al. (2014) Chen, T., Fox, E., and Guestrin, C. (2014), “Stochastic gradient hamiltonian monte carlo,” in International conference on machine learning, pp. 1683–1691.
Chipman et al. (2010) Chipman, H. A., George, E. I., and McCulloch, R. E. (2010), “BART: Bayesian Additive Regression Trees,” The Annals of Applied Statistics, 4, 266–298.
Deng et al. (2019) Deng, W., Zhang, X., Liang, F., and Lin, G. (2019), “An adaptive empirical Bayesian method for sparse deep learning,” Advances in neural information processing systems, 32.
Dittmer et al. (2018) Dittmer, S., King, E. J., and Maass, P. (2018), “Singular Values for ReLU Layers,” IEEE Transactions on Neural Networks and Learning Systems, 31, 3594–3605.
Dorie and Hill (2020) Dorie, V. and Hill, J. L. (2020), “Bartcause: Causal Inference using Bayesian Additive Regression Trees [R package bartCause version 1.0-4],” R Package.
Farrell et al. (2021) Farrell, M., Liang, T., and Misra, S. (2021), “Deep Neural Networks for Estimation and Inference,” Econometrica, 89, 181–213.
Fisher (1935) Fisher, R. A. (1935), “The fiducial argument in statistical inference,” Annals of Eugenics, 6, 391–398.
Foster et al. (2011) Foster, J. C., Taylor, J. M., and Ruberg, S. J. (2011), “Subgroup identification from randomized clinical trial data,” Statistics in Medicine, 30.
Fraser (1966) Fraser, D. A. S. (1966), “Structural probability and a generalization,” Biometrika, 53, 1–9.
Fraser (1968) — (1968), The Structure of Inference, New York-London-Sydney: John Wiley & Sons.
Guan and Yang (2019) Guan, Q. and Yang, S. (2019), “A Unified Framework for Causal Inference with Multiple Imputation Using Martingale,” arXiv: Methodology.
Hahn et al. (2020) Hahn, P. R., Murray, J. S., and Carvalho, C. M. (2020), “Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects (with Discussion),” Bayesian Analysis, 15, 965 – 2020.
Hannig (2009) Hannig, J. (2009), “On generalized fiducial inference,” Statistica Sinica, 19, 491–544.
Hannig et al. (2016) Hannig, J., Iyer, H., Lai, R. C. S., and Lee, T. C. M. (2016), “Generalized Fiducial Inference: A Review and New Results,” Journal of the American Statistical Association, 111, 1346–1361.
Hestness et al. (2017) Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. (2017), “Deep Learning Scaling is Predictable, Empirically,” ArXiv, abs/1712.00409.
Higham and Cheng (1998) Higham, N. J. and Cheng, S. H. (1998), “Modifying the inertia of matrices arising in optimization,” Linear Algebra and its Applications, 261–279.
Hill (2011) Hill, J. L. (2011), “Bayesian Nonparametric Modeling for Causal Inference,” Journal of Computational and Graphical Statistics, 20, 217–240.
Hornik (1991) Hornik, K. (1991), “Approximation capabilities of multilayer feedforward networks,” Neural Networks, 4, 251–257.
Hornik et al. (1989) Hornik, K., Stinchcombe, M., and White, H. (1989), “Multilayer feedforward networks are universal approximators,” Neural Networks, 2, 359–366.
Imbens (2004) Imbens, G. (2004), “Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review,” The Review of Economics and Statistics, 86, 4–29.
Imbens and Rubin (2015) Imbens, G. W. and Rubin, D. B. (2015), Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction, USA: Cambridge University Press.
Javanmard and Montanari (2014) Javanmard, A. and Montanari, A. (2014), “Confidence iReferntervals and hypothesis testing for high-dimensional regression,” Journal of Machine Learning Research, 15, 2869–2909.
Jiang (2007) Jiang, W. (2007), “Bayesian variable selection for high dimensional generalized linear models: convergence rates of the fitted densities,” The Annals of Statistics, 35, 1487–1511.
Kidger and Lyons (2020) Kidger, P. and Lyons, T. (2020), “Universal Approximation with Deep Narrow Networks,” Proceedings of Machine Learning Research, 125, 1–22.
Kim et al. (2023) Kim, N., Min, C., and Park, S. (2023), “Minimum width for universal approximation using ReLU networks on compact domain,” ArXiv, abs/2309.10402.
Künzel et al. (2019) Künzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019), “Metalearners for estimating heterogeneous treatment effects using machine learning,” Proceedings of the National Academy of Sciences of the United States of America, 116, 4156 – 4165.
Lee et al. (2016) Lee, J., Sun, D., Sun, Y., and Taylor, J. (2016), “Exact post-selection inference, with application to the Lasso,” Annals of Statistics, 44, 907–927.
Lei and Candès (2021) Lei, L. and Candès, E. J. (2021), “Conformal Inference of Counterfactuals and Individual Treatment Effects,” Journal of the Royal Statistical Society Series B: Statistical Methodology, 83, 911–938.
Liang et al. (2018a) Liang, F., Jia, B., Xue, J., Li, Q., and Luo, Y. (2018a), “An imputation–regularized optimization algorithm for high dimensional missing data problems and beyond,” Journal of the Royal Statistical Society, Series B, 80, 899–926.
Liang et al. (2024) Liang, F., Kim, S., and Sun, Y. (2024), “Exended Fiducial Inference: Toward an Automated Process of Statistical Inference,” Journal of the Royal Statistical Society, Series B, in press.
Liang et al. (2018b) Liang, F., Li, Q., and Zhou, L. (2018b), “Bayesian Neural Networks for Selection of Drug Sensitive Genes,” Journal of the American Statistical Association, 113, 955–972.
Liang et al. (2022a) Liang, F., Xue, J., and Jia, B. (2022a), “Markov neighborhood regression for high-dimensional inference,” Journal of the American Statistical Association, 117, 1200–1214.
Liang et al. (2022b) Liang, S., Sun, Y., and Liang, F. (2022b), “Nonlinear Sufficient Dimension Reduction with a Stochastic Neural Network,” NeurIPS 2022.
Lu et al. (2018) Lu, M., Sadiq, S., Feaster, D. J., and Ishwaran, H. (2018), “Estimating Individual Treatment Effect in Observational Data Using Random Forest Methods,” Journal of Computational and Graphical Statistics, 27, 209 – 219.
Martin (2023) Martin, R. (2023), “Fiducial inference viewed through a possibility-theoretic inferential model lens,” Journal of Machine Learning Research, 215, 299–310.
Murph et al. (2022) Murph, A. C., Hannig, J., and Williams, J. P. (2022), “Generalized Fiducial Inference on Differentiable Manifolds,” arXiv:2209.15473.
Nielsen (2000) Nielsen, S. (2000), “The stochastic EM algorithm: Estimation and asymptotic results,” Bernoulli, 6, 457–489.
Park et al. (2020) Park, S., Yun, C., Lee, J., and Shin, J. (2020), “Minimum Width for Universal Approximation,” ArXiv, abs/2006.08859.
Petersen and Voigtlaender (2018) Petersen, P. and Voigtlaender, F. (2018), “Optimal approximation of piecewise smooth functions using deep ReLU neural networks,” Neural Networks, 108, 296–330.
Portnoy (1986) Portnoy, S. (1986), “On the central limit theorem in $\mathbb{R}^{p}$ when $p\to\infty$ ,” Probability Theory and Related Fields, 73, 571–583.
Portnoy (1988) — (1988), “Asymptotic behavior of likelihood methods for exponential families when the number of parameters tend to infinity,” Annals of Statistics, 16, 356–366.
Robbins and Monro (1951) Robbins, H. and Monro, S. (1951), “A Stochastic Approximation Method,” Annals of Mathematical Statistics, 22, 400–407.
Robins et al. (1994) Robins, J. M., Rotnitsky, A., and Zhao, L. P. (1994), “Estimation of regression coefficients when some regressors are not always observed,” Journal of the American Statistical Association, 89, 846–866.
Romano et al. (2019) Romano, Y., Patterson, E., and Candès, E. J. (2019), “Conformalized Quantile Regression,” in Neural Information Processing Systems.
Rosenbaum (1987) Rosenbaum, P. R. (1987), “Model-Based Direct Adjustment,” Journal of the American Statistical Association, 82, 387–394.
Rosenbaum (2002) — (2002), Observational Studies (2nd edition), New York: Springer.
Rubin (1974) Rubin, D. B. (1974), “Estimating causal effects of treatments in randomized and nonrandomized studies.” Journal of Educational Psychology, 66, 688–701.
Schapire (1990) Schapire, R. E. (1990), “The Strength of Weak Learnability,” Machine Learning, 5, 197–227.
Schmidt-Hieber (2020) Schmidt-Hieber, J. (2020), “Nonparametric regression using deep neural networks with ReLU activation function,” The Annals of Statistics, 48, 1875–1897.
Sethuraman et al. (2023) Sethuraman, M. G., Lopez, R., Mohan, R. V., Fekri, F., Biancalani, T., and Hutter, J.-C. (2023), “NODAGS-Flow: Nonlinear Cyclic Causal Structure Learning,” in International Conference on Artificial Intelligence and Statistics.
Shafer and Vovk (2008) Shafer, G. and Vovk, V. (2008), “A Tutorial on Conformal Prediction,” J. Mach. Learn. Res., 9, 371–421.
Shalit et al. (2017) Shalit, U., Johansson, F. D., and Sontag, D. A. (2017), “Estimating individual treatment effect: generalization bounds and algorithms,” in Proceedings of the 34th International Conference on Machine Learning, PMLR, pp. 3076–3085.
Song and Liang (2023) Song, Q. and Liang, F. (2023), “Nearly optimal Bayesian Shrinkage for high dimensional regression,” China Science Mathematics, 66, 409–442.
Song et al. (2020) Song, Q., Sun, Y., Ye, M., and Liang, F. (2020), “Extended Stochastic Gradient MCMC for Large-Scale Bayesian Variable Selection,” Biometrika, 107, 997–1004.
Sun and Liang (2022) Sun, Y. and Liang, F. (2022), “A kernel-expanded stochastic neural network,” Journal of the Royal Statistical Society Series B, 84, 547–578.
Sun et al. (2022) Sun, Y., Song, Q., and Liang, F. (2022), “Consistent Sparse Deep Learning: Theory and Computation,” Journal of the American Statistical Association, 117, 1981–1995.
Tibshirani et al. (2019) Tibshirani, R. J., Foygel Barber, R., Candes, E., and Ramdas, A. (2019), “Conformal Prediction Under Covariate Shift,” in Advances in Neural Information Processing Systems, eds. Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., Curran Associates, Inc., vol. 32.
van de Geer et al. (2014) van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014), “On asymptotically optimal confidence regions and tests for high-dimensional models,” Ann. Statist., 42, 1166–1202.
Vovk et al. (2005) Vovk, V., Gammerman, A., and Shafer, G. (2005), Algorithmic Learning in a Random World, Springer.
Wager and Athey (2018) Wager, S. and Athey, S. (2018), “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests,” Journal of the American Statistical Association, 113, 1228–1242.
Williams (2023) Williams, J. P. (2023), “Model-free generalized fiducial inference,” .
Yeager et al. (2019) Yeager, D. S., Hanselman, P., Walton, G. M., Murray, J. S., Crosnoe, R., Muller, C., Tipton, E., Schneider, B., Hulleman, C. S., Hinojosa, C. P., Paunesku, D., Romero, C., Flint, K., Roberts, A., Trott, J., Iachan, R., Buontempo, J., Yang, S. M., Carvalho, C. M., Hahn, P. R., Gopalan, M., Mhatre, P., Ferguson, R., Duckworth, A. L., and Dweck, C. S. (2019), “A national experiment reveals where a growth mindset improves achievement,” Nature, 573.
Zabell (1992) Zabell, S. L. (1992), “R. A. Fisher and Fiducial Argument,” Statistical Science, 7, 369–387.
Zetterqvist and Sjölander (2015) Zetterqvist, J. and Sjölander, A. (2015), “Doubly Robust Estimation with the R Package drgee,” Epidemiologic Methods, 4, 69–86.
Zhang and Zhang (2014) Zhang, C.-H. and Zhang, S. S. (2014), “Confidence intervals for low dimensional parameters in high dimensional linear models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76, 217–242.