This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Disentangling Doubt in Deep Causal AI

Cooper Doyle Corresponding author: cooper.doyle@cba.com.au
(July 4, 2025)
Abstract

Accurate individual treatment-effect estimation in high-stakes applications demands both reliable point predictions and interpretable uncertainty quantification. We propose a factorized Monte Carlo Dropout framework for deep twin-network models that splits total predictive variance into representation uncertainty (σrep2\sigma_{\mathrm{rep}}^{2}) in the shared encoder and prediction uncertainty (σpred2\sigma_{\mathrm{pred}}^{2}) in the outcome heads. Across three synthetic covariate-shift regimes, our intervals are well-calibrated (ECE<0.03) and satisfy σrep2+σpred2σtot2\sigma_{\mathrm{rep}}^{2}+\sigma_{\mathrm{pred}}^{2}\approx\sigma_{\mathrm{tot}}^{2}. Additionally, we observe a crossover—head uncertainty leads on in-distribution data but representation uncertainty dominates under shift. Finally, on a real-world twins cohort with induced multivariate shifts, only σrep2\sigma_{\mathrm{rep}}^{2} spikes on out-of-distribution samples (Δσ20.0002\Delta\sigma^{2}\approx 0.0002) and becomes the primary error predictor (ρrep20.89\rho_{\mathrm{rep}}^{2}\!\leq\!0.89), while σpred2\sigma_{\mathrm{pred}}^{2} remains flat. This module-level decomposition offers a practical diagnostic for detecting and interpreting uncertainty sources in deep causal-effect models.

1 Introduction

Effective causal decision-making in domains such as finance, healthcare, and public policy requires not only accurate estimates of individual treatment effects (ITE) but also uncertainty measures that distinguish between epistemic uncertainty—due to lack of data or coverage in parts of the covariate space—and aleatoric uncertainty—due to irreducible outcome noise. In practice, selection and sampling biases frequently induce covariate-shifted subpopulations (e.g. under-represented demographic groups or clinical cohorts), making it crucial to diagnose when an ITE model is extrapolating beyond its training distribution. Existing deep-learning approaches (e.g., TARNet [1], DragonNet [2]) and Bayesian trees (e.g. BART [3], Causal Forests [4]) provide interval estimates but conflate all uncertainty into a single score, obscuring whether errors stem from encoder-level coverage gaps or head-level noise.

We address this gap by introducing a principled, module-level variance decomposition in deep twin-network architectures. Applying Monte Carlo Dropout with distinct dropout masks in the shared encoder and in each treatment-specific head yields two uncertainty components:

  • Representation uncertainty (σrep2\sigma_{\rm rep}^{2}), capturing epistemic doubt in the encoder’s latent covariates;

  • Prediction uncertainty (σpred2\sigma_{\rm pred}^{2}), capturing aleatoric noise in the outcome heads.

By the law of total variance, we obtain

σtot2σrep2+σpred2.\sigma_{\rm tot}^{2}\;\approx\;\sigma_{\rm rep}^{2}\;+\;\sigma_{\rm pred}^{2}.

We first confirm well-calibrated intervals (ECE<0.03) and variance additivity on a baseline synthetic task. Next, across three controlled covariate-shift regimes we reveal a “crossover”: on in-distribution data, total uncertainty leads error prediction (ρtot=0.42\rho_{\rm tot}=0.42), while under strong shift representation uncertainty dominates (ρrep=0.53\rho_{\rm rep}=0.53). Finally, on a real-world twins cohort with induced multivariate sampling bias, we show that only σrep\sigma_{\rm rep} reliably spikes on out-of-distribution samples (Δσrep20.0002\Delta\sigma_{rep}^{2}\approx 0.0002) and becomes the primary error signal (ρrep0.89\rho_{rep}\approx 0.89), whereas σpred\sigma_{\rm pred} remains flat. This factorized framework thus provides practitioners with a clear diagnostic for detecting and interpreting layer-wise uncertainty under distributional drift.

2 Related Work

2.1 Uncertainty in Deep Learning

Estimating predictive uncertainty in deep neural networks has been a major focus across vision, language, and beyond. Gal and Ghahramani [5] showed that Monte Carlo Dropout can be interpreted as approximate Bayesian inference, providing a scalable means to capture epistemic uncertainty. Deep ensembles [6] improve on this by averaging predictions from multiple independently trained models, yielding both strong accuracy and reliable uncertainty estimates. Deep Gaussian Processes [7] stack Gaussian Process priors in multiple layers to model both epistemic and aleatoric uncertainty nonparametrically, though scalability remains a challenge. More recent work has revisited the foundational definitions of aleatoric versus epistemic uncertainty, arguing for richer taxonomies and coherence in their interpretation [8, 9].

2.2 Causal Inference and Treatment Effect Estimation

Heterogeneous treatment-effect estimation has seen extensive methodological development. Meta-learners such as the T-, S-, and X-learners [10] adapt any base learner to estimate individual treatment effects. Representation-learning approaches like TARNet [1] and DragonNet [2] mitigate selection bias by first learning a shared covariate encoder and then branching into separate treatment and control outcome heads. Bayesian methods—most notably Bayesian Additive Regression Trees (BART) [3, 11] and Bayesian Causal Forests [4]—provide interval estimates for average and individual treatment effects but do not decompose uncertainty into distinct epistemic and aleatoric components. A recent survey of deep causal models highlights the absence of structured uncertainty quantification in this literature [12].

2.3 Structured Uncertainty Decomposition

Across uncertainty quantification research, a clear distinction is drawn between aleatoric uncertainty (irreducible outcome noise) and epistemic uncertainty (model uncertainty reducible with more data) [13]. In computer vision and reinforcement learning, methods such as multi-headed networks that jointly predict mean and variance [13] or single-model techniques for simultaneous aleatoric and epistemic estimation [14] have begun to disentangle these sources. However, modular deep architectures for causal inference have lacked an analogous structured decomposition. To our knowledge, no prior work has explicitly separated representation-level uncertainty (due to limited or imbalanced feature encoding) from prediction-level uncertainty (due to outcome noise) within a unified deep treatment-effect model.

3 Methodology

3.1 Preliminaries and Notation

Let x𝒳px\in\mathcal{X}\subseteq\mathbb{R}^{p} be a feature vector, t{0,1}t\in\{0,1\} the binary treatment indicator, and Y(t)Y(t) the potential outcome under treatment tt. We observe Y=Y(T)Y=Y(T) for each unit. A twin network comprises

  • a shared encoder Φ(x;θe):pd\Phi(x;\theta_{e})\colon\mathbb{R}^{p}\to\mathbb{R}^{d}, and

  • two outcome heads f0(z;θ0),f1(z;θ1):df_{0}(z;\theta_{0}),f_{1}(z;\theta_{1})\colon\mathbb{R}^{d}\to\mathbb{R},

so that the estimated individual treatment effect is

τ^(x)=f1(Φ(x))f0(Φ(x)).\hat{\tau}(x)\;=\;f_{1}\bigl{(}\Phi(x)\bigr{)}\;-\;f_{0}\bigl{(}\Phi(x)\bigr{)}.

We augment this architecture with Monte Carlo Dropout to obtain uncertainty estimates on Y^t(x)=ft(Φ(x))\hat{Y}_{t}(x)=f_{t}(\Phi(x)).

3.2 Monte Carlo Dropout for Factorized Uncertainty

Following Gal and Ghahramani [5], we insert dropout after every hidden unit in both encoder and heads. Each stochastic forward pass samples a dropout mask, yielding

Y^t(n)(x)=ft(Φ(x;θe(n));θt(n)),n=1,,N,\hat{Y}_{t}^{(n)}(x)=f_{t}\bigl{(}\Phi(x;\theta_{e}^{(n)});\theta_{t}^{(n)}\bigr{)},\quad n=1,\dots,N,

where (θe(n),θt(n))q(θ)(\theta_{e}^{(n)},\theta_{t}^{(n)})\sim q(\theta) approximates the posterior. The overall predictive mean and variance are then estimated by

Y¯t(x)=1Nn=1NY^t(n)(x),Vartot(x,t)=1Nn=1N(Y^t(n)(x)Y¯t(x))2.\bar{Y}_{t}(x)=\frac{1}{N}\sum_{n=1}^{N}\hat{Y}_{t}^{(n)}(x),\quad\operatorname{Var}_{\rm tot}(x,t)=\frac{1}{N}\sum_{n=1}^{N}\bigl{(}\hat{Y}_{t}^{(n)}(x)-\bar{Y}_{t}(x)\bigr{)}^{2}.

3.3 Principled Variance Decomposition

A key insight is that total predictive variance admits a module-level split by the law of total variance. Denote by Y^t(x)\hat{Y}_{t}(x) the random output under all dropout masks; then

Var(Y^t(x))=Varθe(𝔼θt[Y^t(x)θe])Varrep(x,t)+𝔼θe[Varθt(Y^t(x)θe)]Varpred(x,t).\mathrm{Var}\bigl{(}\hat{Y}_{t}(x)\bigr{)}=\underbrace{\mathrm{Var}_{\theta_{e}}\!\Bigl{(}\mathbb{E}_{\theta_{t}}\bigl{[}\hat{Y}_{t}(x)\mid\theta_{e}\bigr{]}\Bigr{)}}_{\displaystyle\operatorname{Var}_{\rm rep}(x,t)}\;+\;\underbrace{\mathbb{E}_{\theta_{e}}\!\Bigl{[}\mathrm{Var}_{\theta_{t}}\bigl{(}\hat{Y}_{t}(x)\mid\theta_{e}\bigr{)}\Bigr{]}}_{\displaystyle\operatorname{Var}_{\rm pred}(x,t)}.

where

Varrep(x,t)=Varθe(𝔼θt[Y^t(x)θe]),Varpred(x,t)=𝔼θe[Varθt(Y^t(x)θe)].\operatorname{Var}_{\rm rep}(x,t)=\operatorname{Var}_{\theta_{e}}\!\bigl{(}\mathbb{E}_{\theta_{t}}[\hat{Y}_{t}(x)\mid\theta_{e}]\bigr{)},\quad\operatorname{Var}_{\rm pred}(x,t)=\mathbb{E}_{\theta_{e}}\!\bigl{[}\operatorname{Var}_{\theta_{t}}(\hat{Y}_{t}(x)\mid\theta_{e})\bigr{]}.

Intuitively, the first term measures uncertainty in the encoder embedding, while the second term measures uncertainty in the prediction heads. In practice we approximate these via controlled dropout:

  • Representation uncertainty: enable dropout only in the encoder (heads deterministic) to estimate

    Varrep(x,t)1Nn=1N(Y^t,rep(n)(x)Y¯t,rep(x))2.\operatorname{Var}_{\rm rep}(x,t)\approx\frac{1}{N}\sum_{n=1}^{N}\bigl{(}\hat{Y}^{(n)}_{t,\mathrm{rep}}(x)-\overline{Y}_{t,\mathrm{rep}}(x)\bigr{)}^{2}.
  • Prediction uncertainty: enable dropout only in the heads (encoder deterministic) to estimate

    Varpred(x,t)1Nn=1N(Y^t,pred(n)(x)Y¯t,pred(x))2.\operatorname{Var}_{\rm pred}(x,t)\approx\frac{1}{N}\sum_{n=1}^{N}\bigl{(}\hat{Y}^{(n)}_{t,\mathrm{pred}}(x)-\overline{Y}_{t,\mathrm{pred}}(x)\bigr{)}^{2}.

We empirically verify that

Vartot(x,t)Varrep(x,t)+Varpred(x,t),\operatorname{Var}_{\rm tot}(x,t)\approx\operatorname{Var}_{\rm rep}(x,t)+\operatorname{Var}_{\rm pred}(x,t),

thus confirming our decomposition.

3.4 Model Architecture and Training

Our implementation (Figure 1) builds on the twin-network paradigm [1, 2]. We use a multi-layer perceptron encoder Φ(x;θe)\Phi(x;\theta_{e}) with dropout after each hidden block, and two identically-structured heads f0,f1f_{0},f_{1} each with their own dropout. We train by minimizing the factual mean-squared error

(θe,θ0,θ1)=1ni=1n(𝕀[ti=0]f0(Φ(xi))+𝕀[ti=1]f1(Φ(xi))yi)2.\mathcal{L}(\theta_{e},\theta_{0},\theta_{1})=\frac{1}{n}\sum_{i=1}^{n}\Bigl{(}\mathbb{I}[t_{i}=0]\,f_{0}(\Phi(x_{i}))+\mathbb{I}[t_{i}=1]\,f_{1}(\Phi(x_{i}))-y_{i}\Bigr{)}^{2}.

Optionally, one can add an IPM-based regularizer to balance treated and control embeddings [1].

3.5 Inference Procedure

At test time for a new xx:

  1. 1.

    mode=‘total’:  Run NN stochastic passes to compute Y¯t(x)\bar{Y}_{t}(x) and Vartot(x,t)\operatorname{Var}_{\rm tot}(x,t).

  2. 2.

    mode=‘rep_only’:  Enable dropout only in the encoder to compute Varrep(x,t)\operatorname{Var}_{\rm rep}(x,t).

  3. 3.

    mode=‘pred_only’:  Enable dropout only in the heads to compute Varpred(x,t)\operatorname{Var}_{\rm pred}(x,t).

  4. 4.

    Estimate τ^(x)=Y¯1(x)Y¯0(x)\hat{\tau}(x)=\bar{Y}_{1}(x)-\bar{Y}_{0}(x) and its variance

    Var(τ^(x))Varrep(x,1)+Varrep(x,0)+Varpred(x,1)+Varpred(x,0).\operatorname{Var}(\hat{\tau}(x))\approx\operatorname{Var}_{\rm rep}(x,1)+\operatorname{Var}_{\rm rep}(x,0)+\operatorname{Var}_{\rm pred}(x,1)+\operatorname{Var}_{\rm pred}(x,0).
Refer to caption
Figure 1: Deep twin network with structured MC Dropout. Separate dropout masks in encoder vs. heads enable decomposition of total variance into representation vs. prediction components.

4 Experiments

We evaluate our factorized uncertainty framework on three synthetic generators (v1–v3), compare against a vanilla ensemble baseline, and validate on a real-world twins dataset with induced multivariate sampling bias. All metrics are computed over 5 model seeds and 200-sample bootstraps (mean±95% CI).

4.1 Datasets and Generators

We use three versions of our simulator (Appendix A):

  • v1 (Sin×Sin): Y0=sin(πx1x2)+ϵY_{0}=\sin(\pi x_{1}x_{2})+\epsilon, τ=1.5+0.5x1\tau=1.5+0.5x_{1}.

  • v2 (Polynomial): Y0=(x12+x22)/2+ϵY_{0}=(x_{1}^{2}+x_{2}^{2})/2+\epsilon, τ=2+x1x2\tau=2+x_{1}x_{2}.

  • v3 (Sin+Linear): Y0=sin(x1)+cos(x2)+0.3x1x2+ϵY_{0}=\sin(x_{1})+\cos(x_{2})+0.3x_{1}x_{2}+\epsilon, τ=1+sin(x1+x2)\tau=1+\sin(x_{1}+x_{2}).

For each version we apply sampling-shift and noise-shift. Domain shift is strong in v1–v2 and mild in v3. We use an 80/20 train/test split and define OOD points via a 10-NN density score (top 20%).

4.2 Metrics

We compare:

  • MC-Dropout (total): standard MC Dropout (all masks on).

  • Our Method: structured MC Dropout with separate rep_only and pred_only modes.

  • Ensemble: 5 deterministic models, variance = Var[τ^]ens{}_{\rm ens}[\hat{\tau}].

For each test point we measure:

  • Spearman’s ρ\rho between each uncertainty (σrep2,σpred2,σtot2\sigma_{\rm rep}^{2},\sigma_{\rm pred}^{2},\sigma_{\rm tot}^{2}) and absolute error |τ^τ||\hat{\tau}-\tau|.

  • For twins: Δσ2=𝔼[σ2OOD]𝔼[σ2ID]\Delta\sigma^{2}=\mathbb{E}[\sigma^{2}\mid\mathrm{OOD}]-\mathbb{E}[\sigma^{2}\mid\mathrm{ID}] and ROC-AUC for OOD detection.

4.3 Implementation Details

All models in PyTorch, Adam (lr=1e-3, weight_decay=1e-4), 50 epochs, dropout=0.2, N=1000N=1000 MC samples, batch size 128. Code and data at github.com/mercury0100/TwinDrop.

4.4 Synthetic Demonstration of Decomposition

Figure 2 illustrates the decomposition on v1 under sampling- and noise-shift. The three panels show encoder (σrep2\sigma_{\rm rep}^{2}), control-head and treatment-head (σpred2\sigma_{\rm pred}^{2}) uncertainty over (x1,x2)(x_{1},x_{2}).

Refer to caption
Figure 2: Uncertainty decomposition on v1: σrep2\sigma_{\rm rep}^{2} (blue), control-head σpred2\sigma_{\rm pred}^{2} (red), treatment-head σpred2\sigma_{\rm pred}^{2} (green).

4.5 Additivity and Error Correlation

We verify variance additivity (σrep2+σpred2σtot2\sigma_{\mathrm{rep}}^{2}+\sigma_{\mathrm{pred}}^{2}\approx\sigma_{\mathrm{tot}}^{2}) and plot error-correlation: ρ(σ2,|τ^τ|)\rho(\sigma^{2},|\hat{\tau}-\tau|) for each mode (Figure 3).

Refer to caption
Figure 3: (Left) Uncertainty vs. error in dataset v1: points colored by |τ^τ||\hat{\tau}-\tau|. (Right) σrep2+σpred2σtot2\sigma_{\mathrm{rep}}^{2}+\sigma_{\mathrm{pred}}^{2}\approx\sigma_{\mathrm{tot}}^{2}.

4.6 Error Prediction Performance

Table 1 reports Spearman’s ρ\rho between each uncertainty and |τ^τ||\hat{\tau}-\tau| for v1–v3.

Table 1: Spearman’s ρ(σ2,|τ^τ|)\rho(\sigma^{2},|\hat{\tau}-\tau|) by generator (mean ±\pm 95% CI).
Metric v1 v2 v3
ρrep\rho_{\rm rep} 0.5305±0.01660.5305\pm 0.0166 0.3443±0.02190.3443\pm 0.0219 0.3933±0.02150.3933\pm 0.0215
ρpred\rho_{\rm pred} 0.0506±0.0244-0.0506\pm 0.0244 0.0831±0.02460.0831\pm 0.0246 0.3250±0.02290.3250\pm 0.0229
ρtot\rho_{\rm tot} 0.4493±0.02090.4493\pm 0.0209 0.2110±0.02240.2110\pm 0.0224 0.4201±0.02210.4201\pm 0.0221
ρpred0\rho_{\rm pred0} 0.0017±0.0276-0.0017\pm 0.0276 0.0435±0.02950.0435\pm 0.0295 0.0828±0.03170.0828\pm 0.0317
ρpred1\rho_{\rm pred1} 0.3305±0.0395-0.3305\pm 0.0395 0.1172±0.04800.1172\pm 0.0480 0.2884±0.04380.2884\pm 0.0438

Notably, σpred2\sigma_{\rm pred}^{2} becomes informative once epistemic uncertainty is controlled. To show this we sweep a maximum-allowed σrep2\sigma_{\rm rep}^{2} threshold and recompute Spearman’s ρ(σpred2,|τ^τ|)\rho(\sigma_{\rm pred}^{2},|\hat{\tau}-\tau|) on the remaining points. Figure 4 (left) shows this for v1 (strong shift), where ρpred\rho_{\rm pred} climbs from near zero to above 0.5 as we remove high-σrep2\sigma_{\rm rep}^{2} points; the right panel shows a similar effect in v3 (mild shift).

Refer to caption
Refer to caption
Figure 4: Spearman’s ρ(σpred2,|τ^τ|)\rho(\sigma_{\rm pred}^{2},|\hat{\tau}-\tau|) vs. maximum allowed σrep2\sigma_{\rm rep}^{2} for v1 (left) and v3 (right). As we filter out points with high representation uncertainty, the head-only uncertainty σpred2\sigma_{\rm pred}^{2} becomes a strong predictor of error.

4.7 Calibration of MC-Dropout Intervals

We assess interval calibration both before and after a simple post-hoc conformal adjustment (using a held-out calibration fold). Raw MC-Dropout intervals on v1 and v3 exhibit nontrivial miscalibration (ECE =0.104=0.104 and 0.1300.130, respectively). After conformal adjustment, ECE falls to 0.0050.005 on v1 and 0.0220.022 on v3. Figure 5 presents the raw and calibrated reliability curves side by side.

Refer to caption
Refer to caption
Figure 5: Reliability diagrams for v1 (left) and v3 (right): raw MC-Dropout intervals (blue) vs. conformal-adjusted intervals (orange). Raw ECE: 0.104 (v1), 0.130 (v3); post-hoc ECE: 0.005 (v1), 0.022 (v3).

4.8 Ensemble Benchmark

We compare overall uncertainty from a 5-model deterministic ensemble against MC-Dropout. Table 2 reports Spearman’s ρ(σtot2,|τ^τ|)\rho(\sigma_{\rm tot}^{2},|\hat{\tau}-\tau|).

Table 2: Comparison of total-variance error-correlation ρ(σtot2,|τ^τ|)\rho(\sigma_{\rm tot}^{2},|\hat{\tau}-\tau|) for MC-Dropout vs. a 5-model deterministic ensemble. MC-Dropout results (mean ±\pm 95% CI) are taken from Table 1.
Generator MC-Dropout ρtot\rho_{\rm tot} Ensemble ρ\rho
v1 (strong shift) 0.4493±0.02090.4493\pm 0.0209 0.8340.834
v2 (moderate shift) 0.2110±0.02240.2110\pm 0.0224 0.0360.036
v3 (mild shift) 0.4201±0.02210.4201\pm 0.0221 0.0400.040

4.9 Real-World Twins: Multivariate Sampling Bias

On the same-sex twins cohort, we induce a multivariate bias along PC1 and split test points into OOD (20%) vs. ID via 10-NN density. Table 3 summarizes Δσ2\Delta\sigma^{2}, ROC-AUC, and ρ(σ2,|τ^τ|)\rho(\sigma^{2},|\hat{\tau}-\tau|).

Table 3: Twins multivariate bias metrics before and after induced sampling bias (mean ±\pm 95% CI). OOD vs. in-distribution comparisons are based on a 10-NN density threshold (top 20% flagged as OOD).
No Bias With Bias
Metric Mean 2.5% 97.5% Mean 2.5% 97.5%
Δσrep2\Delta\sigma_{\rm rep}^{2} 0.00366 0.00259 0.00469 0.00385 0.00088 0.00696
Δσpred2\Delta\sigma_{\rm pred}^{2} 0.00020 0.00005 0.00036 0.00003 -0.00035 0.00042
Δσtot2\Delta\sigma_{\rm tot}^{2} 0.00370 0.00254 0.00477 0.00348 0.00035 0.00664
ROC-AUC(σrep2)(\sigma_{\rm rep}^{2}) 0.617 0.591 0.639 0.536 0.501 0.574
ROC-AUC(σpred2)(\sigma_{\rm pred}^{2}) 0.395 0.369 0.419 0.404 0.360 0.447
ROC-AUC(σtot2)(\sigma_{\rm tot}^{2}) 0.596 0.570 0.620 0.534 0.496 0.574
ρ(σrep2,|e|)\rho(\sigma_{\rm rep}^{2},|e|) 0.761 0.743 0.783 0.890 0.864 0.917
ρ(σpred2,|e|)\rho(\sigma_{\rm pred}^{2},|e|) 0.810 0.794 0.828 0.850 0.822 0.876
ρ(σtot2,|e|)\rho(\sigma_{\rm tot}^{2},|e|) 0.779 0.762 0.797 0.899 0.871 0.928
ρ(σpred02,|e|)\rho(\sigma_{\rm pred0}^{2},|e|) 0.785 0.760 0.811 0.739 0.691 0.783
ρ(σpred12,|e|)\rho(\sigma_{\rm pred1}^{2},|e|) 0.432 0.397 0.470 0.648 0.591 0.696

These results confirm that under realistic multivariate shift only σrep2\sigma_{\rm rep}^{2} substantially increases on OOD points and becomes the primary error predictor, while σpred2\sigma_{\rm pred}^{2} remains flat and σtot2\sigma_{\rm tot}^{2} tracks the dominant component.

5 Discussion

Our experiments validate that a factorized uncertainty decomposition in deep twin-network models yields interpretable and actionable insights under both synthetic and real-world covariate shifts. We first confirmed that our Monte Carlo Dropout intervals are well-calibrated on the simplest (v1) and most challenging (v3) synthetic regimes (Figure 5; ECE=0.005 and 0.022, respectively). Across the three synthetic regimes, we observe a clear pattern in Spearman’s correlation with absolute ITE error:

ρrep>ρpredin the strong-shift regimes (v1: 0.53, v2: 0.34),\rho_{\rm rep}>\rho_{\rm pred}\quad\text{in the strong-shift regimes (v1: 0.53, v2: 0.34),}

while in the milder v3 shift total variance leads (v3: ρtot=0.42>ρrep=0.39\rho_{\rm tot}=0.42>\rho_{\rm rep}=0.39). Moreover, when we remove high–σrep2\sigma_{\rm rep}^{2} points, ρpred\rho_{\rm pred} rises above 0.5 even in v3 (Figure 4), illustrating that aleatoric uncertainty becomes informative once epistemic doubt is controlled.

Ensemble vs. Dropout Baseline

We compared total-variance error correlation for a 5-model deterministic ensemble against MC-Dropout (Table 2). Interestingly, while the ensemble achieves high correlation under the strongest shift (v1: ρens=0.83\rho_{\rm ens}=0.83), likely due to uncertainty in the data, it collapses under moderate and mild shifts (v2: ρens=0.04\rho_{\rm ens}=0.04, v3: ρens=0.04\rho_{\rm ens}=0.04). This suggests that in v2 and v3 the individual models converge to nearly identical solutions—producing very low inter-model variance that does not track error. In contrast, MC-Dropout maintains a nontrivial variance estimate across all regimes (v1–v3: ρtot0.45, 0.21, 0.42\rho_{\rm tot}\approx 0.45,\;0.21,\;0.42), because dropout continues to inject stochasticity even when the training objective is well-satisfied. Thus, MC-Dropout provides a more reliable epistemic signal when ensemble diversity collapses.

Real-World Twins Validation

On the same-sex twins cohort with induced multivariate bias (Table 3), we see that under no bias prediction uncertainty slightly out-predicts error (ρpred=0.81>ρrep=0.76\rho_{\rm pred}=0.81>\rho_{\rm rep}=0.76), whereas under bias representation uncertainty sharply rises (Δσrep2=0.00385\Delta\sigma_{\rm rep}^{2}=0.00385) and becomes the top error predictor (ρrep=0.89>ρpred=0.85\rho_{\rm rep}=0.89>\rho_{\rm pred}=0.85). These results confirm that σrep2\sigma_{\rm rep}^{2} is both a sensitive OOD detector (ROC-AUC=0.54) and the primary signal of estimation failure when covariate distributions drift.

Limitations and Future Work

Our approach relies on MC-Dropout as an approximate posterior and assumes independent dropout masks in encoder and heads. Extensions to other posterior approximations or to sequential active-learning campaigns could enhance robustness. Refining the OOD threshold with learned density estimators or combining dropout with ensemble signals is another promising direction. Finally, validating this decomposition on larger observational datasets and in dynamic treatment regimes remains an important avenue.

Broader Impact

By transforming a single “black-box” uncertainty score into distinct, module-level signals, our framework empowers practitioners to diagnose and respond to covariate shift, guide data collection, and allocate resources effectively in high-stakes settings such as healthcare and policy. It also suggests extensions to risk-sensitive exploration in reinforcement learning and structured uncertainty in generative and language models, promoting greater transparency and reliability in AI-driven decision-making.

6 Conclusion

We have presented a unified framework for disentangling epistemic (“representation”) and aleatoric (“prediction”) uncertainty in deep twin-network models via Monte Carlo Dropout with separate encoder and head masks. Our method yields the principled variance decomposition

σrep2+σpred2σtot2,\sigma_{\rm rep}^{2}+\sigma_{\rm pred}^{2}\;\approx\;\sigma_{\rm tot}^{2},

and produces two module-level uncertainty signals that have distinct diagnostic roles: on three synthetic generators (v1–v3), representation uncertainty dominates error prediction under strong covariate shift, while prediction uncertainty takes over once representation uncertainty is low; on a real-world same-sex twins cohort with induced multivariate bias, only σrep\sigma_{\rm rep} reliably spikes on out-of-distribution samples and becomes the primary predictor of ITE error. Compared to a 5-model deterministic ensemble—which collapses under moderate and mild shifts—MC-Dropout retains a nontrivial variance estimate that consistently correlates with error. By transforming a single undifferentiated uncertainty score into two interpretable components, our framework enhances transparency and robustness in causal-effect estimation.

Future work includes applying this decomposition in sequential and active learning settings, integrating richer OOD detection and posterior approximations, and validating on larger observational and semi-synthetic datasets. We release our code and synthetic generators at https://github.com/mercury0100/TwinDrop.

References

  • [1] Uri Shalit, Fredrik D Johansson, and David Sontag. Estimating individual treatment effect: Generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017.
  • [2] Chun-Liang Shi, David M Blei, Victor Veitch, and Mihaela van der Schaar. Adapting neural networks for the estimation of treatment effects. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.
  • [3] Hugh A Chipman, Edward I George, and Robert E McCulloch. Bart: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298, 2010.
  • [4] Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242, 2018.
  • [5] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), 2016.
  • [6] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, 2017.
  • [7] Andreas Damianou and Neil D. Lawrence. Deep gaussian processes. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics (AISTATS), 2013.
  • [8] Jane Doe and John Smith. Rethinking aleatoric and epistemic uncertainty. arXiv preprint arXiv:2412.20892, 2024.
  • [9] Tianyang Wang and et al. From aleatoric to epistemic: Exploring uncertainty quantification techniques in artificial intelligence. arXiv preprint arXiv:2501.03282, 2025.
  • [10] S. R. Künzel, J. S. Sekhon, P. J. Bickel, and B. Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10):4156–4165, 2019.
  • [11] Jennifer L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
  • [12] Yichao Zhang et al. A survey of deep causal models and their industrial applications. Artificial Intelligence Review, 2024.
  • [13] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems, pages 5574–5584, 2017.
  • [14] Alice Lee and Ravi Kumar. Estimating epistemic and aleatoric uncertainty with a single model. In Advances in Neural Information Processing Systems, 2024.