This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A hierarchical decomposition for explaining ML performance discrepancies

Jean Feng    Harvineet Singh    Fan Xia    Adarsh Subbaswamy    Alexej Gossmann
Abstract

Machine learning (ML) algorithms can often differ in performance across domains. Understanding why their performance differs is crucial for determining what types of interventions (e.g., algorithmic or operational) are most effective at closing the performance gaps. Existing methods focus on aggregate decompositions of the total performance gap into the impact of a shift in the distribution of features p(X)p(X) versus the impact of a shift in the conditional distribution of the outcome p(Y|X)p(Y|X); however, such coarse explanations offer only a few options for how one can close the performance gap. Detailed variable-level decompositions that quantify the importance of each variable to each term in the aggregate decomposition can provide a much deeper understanding and suggest much more targeted interventions. However, existing methods assume knowledge of the full causal graph or make strong parametric assumptions. We introduce a nonparametric hierarchical framework that provides both aggregate and detailed decompositions for explaining why the performance of an ML algorithm differs across domains, without requiring causal knowledge. We derive debiased, computationally-efficient estimators, and statistical inference procedures for asymptotically valid confidence intervals.

Distribution shift, Feature attribution, Explainability, Model generalizability, Nonparametric inference, Debiased Machine Learning

1 Introduction

The performance of an ML algorithm can differ across domains due to shifts in the data distribution. To understand what contributed to this performance gap, prior works have suggested decomposing the gap into that due to a shift in the marginal distribution of the input features p(X)p(X) versus that due to a shift in the conditional distribution of the outcome p(Y|X)p(Y|X) (Cai et al., 2023; Zhang et al., 2023; Liu et al., 2023; Qiu et al., 2023; Firpo et al., 2018). Although such aggregate decompositions can be helpful, more detailed explanations that quantify the importance of each variable to each term in this decomposition can provide deeper insight to ML teams trying to understand and close the performance gap. However, there is currently no principled framework that provides both aggregate and detailed decompositions for explaining performance gaps of ML algorithms.

As a motivating example, suppose an ML deployment team has an algorithm that predicts the risk of patients being readmitted to the hospital given data from the Electronic Health Records (EHR), e.g. demographic variables and diagnosis codes. The algorithm was trained for a general patient population. The team intends to deploy it for heart failure (HF) patients and observes a large performance drop (e.g. in accuracy) in this subgroup. An aggregate decomposition of the performance drop into the marginal versus conditional components (p(X)p(X) and p(Y|X)p(Y|X)) provides only a high-level understanding of the underlying cause and limited suggestions on how the model can be fixed. A small conditional term in the aggregate decomposition but a large marginal term is typically addressed by retraining the model on a reweighted version of the original data that matches the target population (Quionero-Candela et al., 2009); otherwise, the standard suggestion is to collect more data from the target population to recalibrate/fine-tune the algorithm (Steyerberg, 2009). A detailed decomposition would help the deployment team conduct a root cause analysis and assess the utility of targeted variable-specific interventions. For instance, if the performance gap is primarily driven by a marginal shift in a few diagnoses, the team can investigate why the rates of these diagnoses differ between general and HF patients. The team may find that certain diagnoses differ due to variations in patient case mix, which may be better addressed through model retraining, whereas others are due to variations in documentation practices, which may be better addressed through operational interventions.

Existing approaches for providing a detailed understanding of why the performance of an ML algorithm differs across domains fall into two general categories. One category, which is also the most common in the applied literature, is to quantify how much the data distribution shifted (Liu et al., 2023; Budhathoki et al., 2021; Kulinski et al., 2020; Rabanser et al., 2019), e.g. one can compare how the distribution of each input variable differs across domains (Cummings et al., 2023). However, given the complex interactions and non-linearities in (black-box) ML algorithms, it is difficult to quantify how exactly such shifts contribute to changes in performance. For instance, a large shift with respect to a given variable does not necessarily translate to a large shift in model performance, as that variable may have low importance in the algorithm or the performance metric may be insensitive to that shift. Thus the other category of approaches—which is also the focus of this manuscript—is to directly quantify how a shift with respect to a variable (subset) influences model performance. Existing proposals only provide detailed variable-level breakdowns for either the marginal or conditional terms in the aggregate decomposition, but no unified framework exists. Moreover, existing methods either require knowledge of the causal graph relating all variables (Zhang et al., 2023) or strong parametric assumptions (Wu et al., 2021; Dodd & Pepe, 2003).

Refer to caption
Figure 1: Hierarchical Decomposition of Performance Differences (HDPD): Aggregate decomposes a drop in model performance between two domains into that due to shifts in the covariate versus outcome distribution. Detailed quantifies the proportion of variation in accuracy changes explained by partial covariate/outcome shifts with respect to a variable or variable subset.

This work introduces a unified nonparametric framework for explaining the performance gap of an ML algorithm that first decomposes the gap into terms at the aggregate level and then provides detailed variable importance (VI) values for each aggregate term (Fig 1). Whereas prior works only provide point estimates for the decomposition, we derive debiased, asymptotically linear estimators for the terms in the decomposition which allow for the construction of confidence intervals (CIs) with asymptotically valid coverage rates. Uncertainty quantification is crucial in this setting, as one often has limited labeled data from the target domain. The estimation and statistical inference procedure is computationally efficient, despite the exponential number of terms in the Shapley value definition. We demonstrate the utility of our framework in real-world examples of prediction models for hospital readmission and insurance coverage.

2 Prior work

Describing distribution shifts. This line of work focuses on detecting and localizing which distributions shift between datasets (Kulinski et al., 2020; Rabanser et al., 2019). Budhathoki et al. (2021) identify the main variables contributing to a distribution shift via a Shapley framework, Kulinski & Inouye (2023) fits interpretable optimal transport maps, and Liu et al. (2023) finds the region with the largest shift in the conditional outcome distribution. However, these works do not quantify how these shifts contribute to changes in performance, the metric of practical importance.

Explaining loss differences across subpopulations. Understanding differences in model performance across subpopulations in a single dataset is similar to understanding differences in model performance across datasets, but the focus is typically to find subpopulations with poor performance rather than to explain how distribution shifts contributed to the performance change. Existing approaches include slice discovery methods (Plumb et al., 2023; Jain et al., 2023; d’Eon et al., 2022; Eyuboglu et al., 2022) and structured representations of the subpopulation using e.g. Euclidean balls (Ali et al., 2022).

Attributing performance changes. Prior works have described similar aggregate decompositions of the performance change into covariate and conditional outcome shift components (Cai et al., 2023; Qiu et al., 2023). To provide more granular explanations of performance shifts, existing works quantify the importance of shifts in each variable assuming the true causal graph is known (Zhang et al., 2023); covariate shifts restricted to variable subsets assuming partial shifts follow a particular structure (Wu et al., 2021); and conditional shifts in each variable assuming a parametric model (Dodd & Pepe, 2003). However, the strong assumptions made by these methods make them difficult to apply in practice, and model misspecification can lead to unintuitive interpretations. In addition, there is no unifying framework for decomposing both covariate and outcome shifts, and many methods do not output CIs, which is important when the amount of labeled data from a given domain is limited. A summary of how the proposed framework compares against prior works is shown in Table 1 of the Appendix.

Decomposition methods in econometrics. Explaining performance differences is similar to the long-studied problem of explaining income and health disparities between groups in econometrics. There, researchers regularly use frameworks such as the Oaxaca-Blinder decomposition (Oaxaca, 1973; Blinder, 1973; Fortin et al., 2011), which decomposes disparities into components quantifying the impact of covariate shifts and outcome shifts. These frameworks commonly decompose the components further to describe the importance of each variable (Oaxaca, 1973; Kirby et al., 2006). Existing methods typically rely on strong parametric assumptions (Fairlie, 2005; Yun, 2004; Firpo et al., 2018), which is inappropriate for the complex data settings in ML.

In summary, the distinguishing contribution of this work is that it unifies aggregate and detailed decompositions under a nonparametric framework with uncertainty quantification.

3 A hierarchical explanation framework

Here we introduce how the aggregate and detailed decompositions are defined for explaining the performance gap of a risk prediction algorithm f:𝒳m𝒴f:\mathcal{X}\subseteq\mathbb{R}^{m}\to\mathcal{Y} for binary outcomes across source and target domains, denoted by D=0D=0 and D=1D=1, respectively. Let the performance of ff be quantified in terms of a loss function :𝒳×𝒴\ell:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}, such as the 0-1 misclassification loss 𝟙{f(X)Y}\mathbbm{1}\{f(X)\neq Y\}. Suppose variables XX can be partitioned into Wm1W\in\mathbb{R}^{m_{1}} and Z=XWm2Z=X\setminus W\in\mathbb{R}^{m_{2}}, where m=m1+m2m=m_{1}+m_{2}. This partitioning allows for a factorization of the data distribution pDp_{D} in the source domain D=0D=0 and target domain D=1D=1 into

pD(W)pD(Z|W)pD(Y|W,Z)\displaystyle p_{D}(W)p_{D}(Z|W)p_{D}(Y|W,Z) (1)

(see Fig 2 top left). As such, we refer to WW as baseline variables and ZZ as conditional covariates. For the readmission example, WW may refer to demographics and ZZ to diagnosis codes. The total change in performance of ff can thus be written as Λ=𝔼111[(W,Z,Y)]𝔼000[(W,Z,Y)],\Lambda=\mathbb{E}_{111}\left[\ell(W,Z,Y)\right]-\mathbb{E}_{000}\left[\ell(W,Z,Y)\right], where 𝔼DWDZDY\mathbb{E}_{D_{\texttt{W}}D_{\texttt{Z}}D_{\texttt{Y}}} denotes the expectation over the distribution pDW(W)pDZ(Z|W)pDY(Y|W,Z)p_{D_{\texttt{W}}}(W)p_{D_{\texttt{Z}}}(Z|W)p_{D_{\texttt{Y}}}(Y|W,Z). A summary of notation used is provided in Table 2 of the Appendix.

Aggregate. At the aggregate level, the framework quantifies how replacing each factor in (1) from source to target contributes to the performance gap. We refer to such replacements as “aggregate shifts,” as the shift is not restricted to a particular subset of variables. This leads to the aggregate decomposition Λ=ΛW+ΛZ+ΛY\Lambda=\Lambda_{\texttt{W}}+\Lambda_{\texttt{Z}}+\Lambda_{\texttt{Y}}, where ΛW\Lambda_{\texttt{W}} quantifies the impact of a shift in the baseline distribution p(W)p(W) (also known as marginal or covariate shifts (Quionero-Candela et al., 2009)), ΛZ\Lambda_{\texttt{Z}} quantifies the impact of a shift in the conditional covariate distribution p(Z|W)p(Z|W), and ΛY\Lambda_{\texttt{Y}} quantifies the impact of a shift in the outcome distribution p(Y|W,Z)p(Y|W,Z) (also known as concept shift). More concretely,

ΛW\displaystyle\Lambda_{\texttt{W}} =𝔼100[]𝔼000[]\displaystyle=\mathbb{E}_{100}\left[\ell\right]-\mathbb{E}_{000}\left[\ell\right]
ΛZ\displaystyle\Lambda_{\texttt{Z}} =𝔼1[𝔼10[W]𝔼00[W]Δ10(W)]\displaystyle=\mathbb{E}_{1\cdot\cdot}\big{[}\underbrace{\mathbb{E}_{\cdot 10}\left[\ell\mid W\right]-\mathbb{E}_{\cdot 00}\left[\ell\mid W\right]}_{\Delta_{\cdot 10}(W)}\big{]}
ΛY\displaystyle\Lambda_{\texttt{Y}} =𝔼11[𝔼1[W,Z]𝔼0[W,Z]Δ0(W,Z)].\displaystyle=\mathbb{E}_{11\cdot}\big{[}\underbrace{\mathbb{E}_{\cdot\cdot 1}\left[\ell\mid W,Z\right]-\mathbb{E}_{\cdot\cdot 0}\left[\ell\mid W,Z\right]}_{\Delta_{\cdot\cdot 0}(W,Z)}\big{]}.

Prior works have proposed similar aggregate decompositions (Cai et al., 2023; Liu et al., 2023; Firpo et al., 2018).

Detailed. At the detailed level, the framework outputs Shapley-based variable attributions that describe how shifts with respect to each variable contribute to each term in the aggregate decomposition. Based on the VI values, an ML development team can better understand the underlying cause for a performance gap and brainstorm which mixture of operational interventions (e.g. changing data collection) and algorithmic interventions (e.g. retraining the model with respect to the variable(s)) would be most effective at closing the performance gap. For instance, a variable with high importance to the conditional covariate shift term ΛZ\Lambda_{\texttt{Z}} can be due to differences in missingness rates, prevalence, or selection bias; and a variable with high importance to the conditional outcome shift term ΛY\Lambda_{\texttt{Y}} may indicate inherent differences in the conditional distribution (also known as effect modification), differences in measurement error or outcome definitions, or omission of variables predictive of the outcome. Note that the framework does not output a detailed decomposition of the baseline shift term ΛW\Lambda_{\texttt{W}}, for reasons we discuss later.

We leverage the Shapley attribution framework for its axiomatic properties, which result in VI values with intuitive interpretations (Shapley, 1953; Charnes et al., 1988). Recall that for a real-valued value function vv defined for all subsets s{1,,m}s\subseteq\{1,\cdots,m\}, the attribution to element jj is defined as the average gain in value due to the inclusion of jj to every possible subset, i.e.

ϕj:=\displaystyle\phi_{j}:= 1ms{1,,m}j(m1|s|)1{v(sj)v(s)}.\displaystyle\ \frac{1}{m}\sum_{s\subseteq\{1,\cdots,m\}\setminus j}\binom{m-1}{\lvert s\rvert}^{-1}\{v(s\cup j)-v(s)\}.

So to define a detailed decomposition, the key question is how to define the value of a partial distribution shift only with respect to a variable subset ss; henceforth we refer to such shifts as ss-partial shifts. It turns out that the answer is far from straightforward.

Refer to caption
Figure 2: Decomposition framework for explaining the performance gap from source domain (D=0D=0) to target domain (D=1D=1), visualized through directed acyclic graphs. Aggregate decompositions describe the incremental impact of replacing each aggregate variable’s distribution at the source with that in the target, indicated by arrows from DD. Detailed decompositions quantify how well hypothesized partial distribution shifts with respect to variable subsets ZsZ_{s} explain performance gaps. This work considers partial outcome shifts that fine-tune the risk in the source domain with respect to ZsZ_{s} (as indicated by the additional node Q=p0(Y=1|W,Z)Q=p_{0}(Y=1|W,Z)) and partial conditional covariate shifts when ZsZsZ_{s}\rightarrow Z_{-s}.

3.1 Value of partial distribution shifts

If the true causal graph is known, it would be straightforward to determine how the data distribution shifted with respect to a given variable subset ss: we would assign the value of subset ss as the change in overall performance due to shifts only in ss with respect to this graph. However, in the absence of more detailed causal knowledge, one can only hypothesize the form of partial distribution shifts. Some proposals define importance as the change in average performance due to hypothesized partial shifts (see e.g. Wu et al. (2021)), but it can lead to unintuitive behavior where a hypothesized distribution shift inconsistent with the true causal graph is attributed high importance. We illustrate such an example in a simulation in Section 5.

Instead, our proposal defines the importance of variable subset ss as the proportion of variation in performance changes across strata explained by its corresponding hypothesized ss-partial shift, denoted by psp_{s}. The definition generalizes the traditional R2R^{2} measure in statistics, which quantifies how well do covariates explain variation in outcome rather than in performance changes. Prior works on variable importance have leveraged similar definitions (Williamson et al., 2021; Hines et al., 2023). For the conditional covariate decomposition, we define the value of ss as

vZ(s)\displaystyle v_{\texttt{Z}}(s) :=1𝔼1[(Δs0(W)Δ10(W))2]𝔼1[Δ102(W)].\displaystyle:=1-\frac{\mathbb{E}_{1\cdot\cdot}\left[\left(\Delta_{\cdot s0}(W)-\Delta_{\cdot 10}(W)\right)^{2}\right]}{\mathbb{E}_{1\cdot\cdot}\left[\Delta_{\cdot 10}^{2}(W)\right]}. (2)

where ΔDZ0(W)=𝔼DZ0[|W]𝔼00[|W]\tiny\Delta_{\cdot D_{\texttt{Z}}0}(W)=\mathbb{E}_{\cdot D_{\texttt{Z}}0}\left[\ell|W\right]-\mathbb{E}_{\cdot 00}\left[\ell|W\right] is the performance gap observed at in strata WW when p0(Z|W)p_{0}(Z|W) is replaced with pDZ(Z|W)p_{D_{\texttt{Z}}}(Z|W). (Note that we overload the notation where DZ=0D_{\texttt{Z}}=0 means source domain, DZ=1D_{\texttt{Z}}=1 means target domain, and DZ=sD_{\texttt{Z}}=s means an ss-partial shift.) Setting ϕZ,0=vZ()\phi_{\texttt{Z},0}=v_{\texttt{Z}}(\emptyset), we denote the corresponding Shapley values by {ϕZ,j:j=0,,m2}\{\phi_{\texttt{Z},j}:j=0,\cdots,m_{2}\}. Similarly, for the conditional outcome decomposition, we define the importance of ss as how well the hypothesized ss-partial outcome shift explains the variation in performance gaps across strata (W,Z)(W,Z), i.e.

vY(s)\displaystyle\footnotesize v_{\texttt{Y}}(s) :=1𝔼11[(Δs(W,Z)Δ1(W,Z))2]𝔼11[Δ12(W,Z)]\displaystyle:=1-\frac{\mathbb{E}_{11\cdot}\left[\left(\Delta_{\cdot\cdot s}(W,Z)-\Delta_{\cdot\cdot 1}(W,Z)\right)^{2}\right]}{\mathbb{E}_{11\cdot}\left[\Delta_{\cdot\cdot 1}^{2}(W,Z)\right]}

where Δ(W,Z)s=𝔼s[|W,Z]𝔼0[|W,Z].\Delta(W,Z)_{\cdot\cdot s}=\mathbb{E}_{\cdot\cdot s}\left[\ell|W,Z\right]-\mathbb{E}_{\cdot\cdot 0}\left[\ell|W,Z\right]. Setting ϕY,0=vY()\phi_{\texttt{Y},0}=v_{\texttt{Y}}(\emptyset), we denote the corresponding Shapley values by {ϕY,j:j=0,,m2}\{\phi_{\texttt{Y},j}:j=0,\cdots,m_{2}\}. Because this framework defines importance in terms of variance explained, it does not provide a detailed decomposition of the baseline shift.

3.2 Candidate partial shifts

The previous section defines a general scoring scheme for quantifying the importance of any hypothesized ss-partial shift. In this work, we consider partial shifts of the following forms. We leave other forms of partial shifts to future work.

ss-partial conditional covariate shift: We hypothesize that ZsZ_{-s} is downstream of ZsZ_{s} and define ps(z|w)p_{s}(z|w) as p1(zs|w)p0(zs|zs,w)p_{1}(z_{s}|w)p_{0}(z_{-s}|z_{s},w), as illustrated in Fig 2 right top. A similar proposal was considered in Wu et al. (2021). For s=s=\emptyset, note that vZ()=0v_{\texttt{Z}}(\emptyset)=0.

In the readmission example, such a shift would hypothesize that certain diagnosis codes in ZsZ_{s} are upstream of the others. For instance, one could reasonably hypothesize that diagnosis codes that describe family history of disease are upstream of the other diagnosis codes. Similarly, medical treatments/procedures are likely downstream of a patient’s baseline characteristics and comorbidities.

ss-partial conditional outcome shift: Shifting the conditional distribution of YY only with respect to a variable subset ZsZ_{s} but not ZsZ_{-s} requires care. We cannot simply split ZZ into ZsZ_{s} and ZsZ_{-s}, and define the distribution of YY solely as a function of ZsZ_{s} and WW, because defining ps(Y|W,Z):=p(Y|W,Zs)p_{s}(Y|W,Z):=p(Y|W,Z_{s}) for some pp generally implies that ps(Y|W,Z)p0(Y|W,Z)p_{s}(Y|W,Z)\not\equiv p_{0}(Y|W,Z) even when p1(Y|W,Z)p0(Y|W,Z)p_{1}(Y|W,Z)\equiv p_{0}(Y|W,Z).

Instead, we define an ss-partial outcome shift based on models commonly used in model recalibration/revision (Steyerberg, 2009; Platt, 1999), where the modified risk is a function of the risk in the source domain Q:=q(W,Z):=p0(Y=1|W,Z)Q:=q(W,Z):=p_{0}(Y=1|W,Z), WW, and ZsZ_{s}. In particular, we define the shift as

ps(y|z,r,w):=p1(y|zs,r,w)\displaystyle p_{s}(y|z,r,w):=p_{1}(y|z_{s},r,w) (3)
=p1(y|z~s,zs,w)p1(z~s|zs,w,q(w,zs,z~s)=r)𝑑z~s.\displaystyle=\int p_{1}(y|\tilde{z}_{-s},z_{s},w)p_{1}(\tilde{z}_{-s}|z_{s},w,q(w,z_{s},\tilde{z}_{-s})=r)d\tilde{z}_{-s}.

By defining the shifted outcome distribution solely as a function of Q,WQ,W, and ZsZ_{s}, it encompasses the special scenario where there is no conditional outcome shift and eliminates any direct effect from ZsZ_{-s} to YY. Note that vY()v_{\texttt{Y}}(\emptyset) may be non-zero when the risk in the target domain is a recalibration (i.e. temperature-scaling) of the risk in the source domain (Platt, 1999; Guo et al., 2017). For instance, there may be general environmental factors such that readmission risks in the target domain are uniformly lower.

In the readmission example, a partial outcome distribution shift may be due to, for instance, a hospital protocol where HF patients with given diagnoses zsz_{s} (e.g. kidney failure) are scheduled for more frequent follow-ups than just having those diagnoses alone. Thus readmission risks in the HF population can be viewed as a fine-tuning of readmission risks in the general patient population with respect to zsz_{s}.

4 Estimation and Inference

The key estimands in this hierarchical attribution framework are the aggregate terms ΛW,ΛZ\Lambda_{\texttt{W}},\Lambda_{\texttt{Z}}, and ΛY\Lambda_{\texttt{Y}} and the Shapley-based detailed terms ϕZ,j\phi_{\texttt{Z},j} and ϕY,j\phi_{\texttt{Y},j} for j=0,,m2j=0,\cdots,m_{2}. We now provide nonparametric estimators for these quantities as well as statistical inference procedures for constructing CIs. Nearly all prior works for decomposing performance gaps have relied on plug-in estimators, which substitute in estimates of the conditional means (also known as outcome models) or density ratios (Sugiyama et al., 2007), which we collectively refer to as nuisance parameters. For instance, given ML estimators μ^10\hat{\mu}_{\cdot 10} and μ^00\hat{\mu}_{\cdot 00} for the conditional means μ10(w)=𝔼10[|W]\mu_{\cdot 10}(w)=\mathbb{E}_{\cdot 10}[\ell|W] and μ00(w)=𝔼00[|W]\mu_{\cdot 00}(w)=\mathbb{E}_{\cdot 00}[\ell|W], respectively, the empirical mean of μ^10μ^00\hat{\mu}_{\cdot 10}-\hat{\mu}_{\cdot 00} with respect to the target domain is a plug-in estimator for ΛZ=𝔼1[μ10μ00]\Lambda_{\texttt{Z}}=\mathbb{E}_{1\cdot\cdot}[{\mu}_{\cdot 10}-{\mu}_{\cdot 00}]. However, because ML estimators typically converge to the true nuisance parameters at a rate slower than n1/2n^{-1/2}, plug-in estimators generally fail to be consistent at a rate of n1/2n^{-1/2} and cannot be used to construct CIs (Kennedy, 2022). Nevertheless, using tools from semiparametric inference theory (e.g. one-step correction and Neyman orthogonality), one can derive debiased ML estimators that facilitate statistical inference (Chernozhukov et al., 2018; Tsiatis, 2006). Here we derive debiased ML estimators for terms in the aggregate and detailed decompositions. The detailed conditional outcome decomposition is particularly interesting, as its unique structure is not amenable to standard techniques for debiasing ML estimators.

For ease of exposition, theoretical results are presented for split-sample estimators; nonetheless, the results readily extend to cross-fitted estimators under standard convergence criteria (Chernozhukov et al., 2018; Kennedy, 2022). Let {(Wi(d),Zi(d),Yi(d)):i=1,,nd}\{(W_{i}^{(d)},Z_{i}^{(d)},Y_{i}^{(d)}):i=1,\cdots,n_{d}\} denote ndn_{d} independent and identically distributed (IID) observations from the source and target domains d=0d=0 and 11, respectively. Let a fixed fraction of the data be partitioned towards “training” (Tr) and the remaining to “evaluation” (Ev); let nEvn_{\texttt{Ev}} be the number of observations in the evaluation partition. Let d\mathbb{P}_{d} denote the expectation with respect to domain dd and d,nEv\mathbb{P}^{\texttt{Ev}}_{d,n} denote the empirical average over observations in partition Ev from domain dd. All estimators are denoted using hat notation. Proofs, detailed theoretical results, and psuedocode are provided in the Appendix.

4.1 Aggregate decomposition

The aggregate decomposition terms can be formulated as an average treatment effect, a well-studied estimand in causal inference, where domain DD corresponds to treatment. As such, one can use augmented inverse probability weighting (AIPW) to define debiased ML estimators of the aggregate decomposition terms (e.g. (Kang & Schafer, 2007)). We review estimation and inference for these terms below.

Estimation. Using the training data, estimate outcome models μ00(W)=𝔼00[|W=w]\mu_{\cdot 00}(W)=\mathbb{E}_{\cdot 00}[\ell|W=w] and μ0(W,Z)=𝔼[|W,Z]\mu_{\cdot\cdot 0}(W,Z)=\mathbb{E}[\ell|W,Z] and density ratio models π100(W)=p1(W)/p0(W)\pi_{100}(W)=p_{1}(W)/p_{0}(W) and π110(W,Z)=p1(W,Z)/p0(W,Z)\pi_{110}(W,Z)=p_{1}(W,Z)/p_{0}(W,Z). The debiased ML estimators for ΛW,ΛZ,ΛY\Lambda_{\texttt{W}},\Lambda_{\texttt{Z}},\Lambda_{\texttt{Y}} are

Λ^W=\displaystyle\hat{\Lambda}_{\texttt{W}}= 0,nEv(μ^00(W))π^100(W)+1,nEvμ^00(W)0,nEv\displaystyle\mathbb{P}^{\texttt{Ev}}_{0,n}\left(\ell-\hat{\mu}_{\cdot 00}(W)\right)\hat{\pi}_{100}(W)+\mathbb{P}^{\texttt{Ev}}_{1,n}\hat{\mu}_{\cdot 00}(W)-\mathbb{P}^{\texttt{Ev}}_{0,n}\ell
Λ^Z=\displaystyle\hat{\Lambda}_{\texttt{Z}}= 0,nEv(μ^0(W,Z))π^110(W,Z)+1,nEvμ^0(W,Z)\displaystyle\mathbb{P}^{\texttt{Ev}}_{0,n}\left(\ell-\hat{\mu}_{\cdot\cdot 0}(W,Z)\right)\hat{\pi}_{110}(W,Z)+\mathbb{P}^{\texttt{Ev}}_{1,n}\hat{\mu}_{\cdot\cdot 0}(W,Z)
0,nEv(μ^00(W))π^100(W)1,nEvμ^00(W)\displaystyle-\mathbb{P}^{\texttt{Ev}}_{0,n}\left(\ell-\hat{\mu}_{\cdot 00}(W)\right)\hat{\pi}_{100}(W)-\mathbb{P}^{\texttt{Ev}}_{1,n}\hat{\mu}_{\cdot 00}(W)
Λ^Y=\displaystyle\hat{\Lambda}_{\texttt{Y}}= 1,nEv0,nEv(μ^0(W,Z))π^(W,Z)1,nEvμ^0(W,Z)\displaystyle\mathbb{P}^{\texttt{Ev}}_{1,n}\ell-\mathbb{P}^{\texttt{Ev}}_{0,n}\left(\ell-\hat{\mu}_{\cdot\cdot 0}(W,Z)\right)\hat{\pi}(W,Z)-\mathbb{P}^{\texttt{Ev}}_{1,n}\hat{\mu}_{0}(W,Z)

Inference. Assuming the estimators for the outcome and density ratio models converge at a fast enough rate, the AIPW estimators for the aggregate decomposition terms converge at the desired op(n1/2)o_{p}(n^{-1/2}) rate.

Theorem 4.1.

Suppose π100\pi_{100} and π110\pi_{110} are bounded; estimators μ^00\hat{{\mu}}_{\cdot 00}, π^0\hat{\pi}_{\cdot\cdot 0}, π^100\hat{\pi}_{100}, and π^110\hat{\pi}_{110} are consistent; and

0(μ^00μ00)(π^100π100)=op(n1/2)\displaystyle\mathbb{P}_{0}\left(\hat{{\mu}}_{\cdot 00}-{\mu}_{\cdot 00}\right)\left(\hat{\pi}_{100}-{\pi}_{100}\right)=o_{p}(n^{-1/2})
0(μ^0μ0)(π^110π110)=op(n1/2).\displaystyle\mathbb{P}_{0}\left(\hat{\mu}_{\cdot\cdot 0}-{\mu}_{\cdot\cdot 0}\right)\left(\hat{\pi}_{110}-{\pi}_{110}\right)=o_{p}(n^{-1/2}).

Then Λ^W,Λ^Z,\hat{\Lambda}_{\texttt{W}},\hat{\Lambda}_{\texttt{Z}}, and Λ^Y\hat{\Lambda}_{\texttt{Y}} are asymptotically linear estimators of their respective estimands.

4.2 Detailed decomposition

The Shapley-based detailed decomposition terms ϕZ,j\phi_{\texttt{Z},j} and ϕY,j\phi_{\texttt{Y},j} for j=0,,m2j=0,\cdots,m_{2} are an additive combination of the value functions that quantify the proportion of variability explained by ss-partial shifts. Below, we present novel estimators for values {vY(s):s{1,,m2}}\{v_{\texttt{Y}}(s):s\subseteq\{1,\cdots,m_{2}\}\} for ss-partial conditional outcome shifts, their theoretical properties, and computationally efficient estimation for their corresponding Shapley values. Due to space constraints, results pertaining to partial conditional covariate shifts are provided in Section B.1 of the Appendix.

4.2.1 Value of ss-partial outcome shifts

Standard recipes for deriving asymptotically linear, nonparametric efficient estimators rely on pathwise differentiability of the estimand and analyzing its efficient influence function (EIF) (Kennedy, 2022). However, vY(s)v_{\texttt{Y}}(s) is not pathwise differentiable because it is a function of (3), which conditions on the source risk q(w,z)q(w,z) equalling some value rr. Taking the pathwise derivative of vY(s)v_{\texttt{Y}}(s) requires taking a derivative of the indicator function 𝟙{q(w,z)=r}\mathbbm{1}\{q(w,z)=r\}, which generally does not exist. Given the difficulties in deriving an asymptotically normal estimator for vY(s)v_{\texttt{Y}}(s), we propose estimating a close alternative that is pathwise differentiable.

The idea is to replace qq in (3) with its binned variant qbin(w,z)=1Bq(w,z)B+12q_{\texttt{bin}}(w,z)=\frac{1}{B}\lfloor q(w,z)B+\frac{1}{2}\rfloor for some B+B\in\mathbb{Z}^{+}, which discretizes outputs from qq into BB disjoint bins. As long as BB is sufficiently high, the binned version of the estimand, denoted vY,bin(s)v_{\texttt{Y},\texttt{bin}}(s), is a close approximation to vY(s)v_{\texttt{Y}}(s). (We use B=20B=20 in the empirical analyses, which we believe to be sufficient in practice.) The benefit of this binned variant is that the derivative of the indicator function 𝟙{qbin(w,z)=r}\mathbbm{1}\{q_{\texttt{bin}}(w,z)=r\} is zero almost everywhere as long as observations with source risks exactly equal to a bin edge is measure zero. More formally, we require the following:

Condition 4.2.

Let Ξ\Xi be the set of (W,Z)(W,Z) such that q(W,Z)q(W,Z) falls precisely on some bin edge but q(W,Z)q(W,Z) is not equal to zero or one. The set Ξ\Xi is measure zero.

Under this condition, vY,bin(s)v_{\texttt{Y},\texttt{bin}}(s) is pathwise differentiable and, using one-step correction, we derive a debiased ML estimator that has the unique form of a V-statistic (this follows from the integration over “phantom” z~s\tilde{z}_{-s} in (3)). We represent V-statistics using the operator 1,nEv~1,nEv{\mathbb{P}}_{1,n}^{\texttt{Ev}}\tilde{\mathbb{P}}_{1,n}^{\texttt{Ev}}, which takes the average over all pairs of observations OiO_{i} from the evaluation partition with replacement, i.e. 1nEv2i=1nEvj=1nEvg(Oi,Oj)\frac{1}{n_{\texttt{Ev}}^{2}}\sum_{i=1}^{n_{\texttt{Ev}}}\sum_{j=1}^{n_{\texttt{Ev}}}g(O_{i},O_{j}) for some function gg. Calculation of this estimator and its theoretical properties are as follows.

Estimation. Using the training partition, estimate a density ratio model defined as π(W,Zs,Zs,Qbin)=p1(Zs|W,Zs,qbin(W,Z)=Qbin)/p1(Zs)\pi(W,Z_{s},Z_{-s},Q_{\texttt{bin}})=p_{1}(Z_{-s}|W,Z_{s},q_{\texttt{bin}}(W,Z)=Q_{\texttt{bin}})/p_{1}(Z_{-s}) and the ss-shifted outcome model μs(W,Z)=𝔼s[|W,Z]\mu_{\cdot\cdot s}(W,Z)=\mathbb{E}_{\cdot\cdot s}[\ell|W,Z]. The estimator for vY,bin(s)v_{\texttt{Y},\texttt{bin}}(s) is v^Y,bin(s)=v^Y,nnum(s)/v^Y,nden\hat{v}_{\texttt{Y},\texttt{bin}}(s)=\hat{v}_{\texttt{Y},n}^{\texttt{num}}(s)/\hat{v}_{\texttt{Y},n}^{\texttt{den}} where v^Y,nnum(s)\hat{v}_{\texttt{Y},n}^{\texttt{num}}(s) is

1,nEvξ^s(W,Z)2\displaystyle\mathbb{P}_{1,n}^{\texttt{Ev}}\hat{\xi}_{s}(W,Z)^{2} (4)
+21,nEvξ^s(W,Z)(μ^1(W,Z))\displaystyle+2\mathbb{P}_{1,n}^{\texttt{Ev}}\hat{\xi}_{s}(W,Z)(\ell-\hat{\mu}_{\cdot\cdot 1}(W,Z))
21,nEv~1,nEvξ^s(W,Zs,Z~s)(W,Zs,Z~s,Y)π^(W,Zs,Z~s,Qbin)\displaystyle-2\mathbb{P}_{1,n}^{\texttt{Ev}}\tilde{\mathbb{P}}_{1,n}^{\texttt{Ev}}\hat{\xi}_{s}(W,Z_{s},\tilde{Z}_{-s})\ell(W,Z_{s},\tilde{Z}_{-s},Y)\hat{\pi}(W,Z_{s},\tilde{Z}_{-s},Q_{\texttt{bin}})
+21,nEv~1,nEvξ^s(W,Zs,Z~s)μ^s(W,Zs,Z~s)π^(W,Zs,Z~s,Qbin)\displaystyle+2\mathbb{P}_{1,n}^{\texttt{Ev}}\tilde{\mathbb{P}}_{1,n}^{\texttt{Ev}}\hat{\xi}_{s}(W,Z_{s},\tilde{Z}_{-s})\hat{\mu}_{\cdot\cdot s}(W,Z_{s},\tilde{Z}_{-s})\hat{\pi}(W,Z_{s},\tilde{Z}_{-s},Q_{\texttt{bin}})

where ξ^s(W,Z)=μ^1(W,Z)μ^s(W,Z)\hat{\xi}_{s}(W,Z)=\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot s}(W,Z) and v^Y,nden\hat{v}_{\texttt{Y},n}^{\texttt{den}} is

1,nEv(μ^1(W,Z)μ^0(W,Z))2\displaystyle\mathbb{P}_{1,n}^{\texttt{Ev}}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z)\right)^{2} (5)
+21,nEv(μ^1(W,Z)μ^0(W,Z))(μ^1(W,Z))\displaystyle+2\mathbb{P}_{1,n}^{\texttt{Ev}}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z)\right)(\ell-\hat{\mu}_{\cdot\cdot 1}(W,Z))
20,nEv(μ^1(W,Z)μ^0(W,Z))(μ^0(W,Z))π^110(W,Z).\displaystyle-2\mathbb{P}_{0,n}^{\texttt{Ev}}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z)\right)(\ell-\hat{\mu}_{\cdot\cdot 0}(W,Z))\hat{\pi}_{110}(W,Z).

Inference. This estimator is asymptotically linear assuming the nuisance estimators converge at a fast enough rate.

Theorem 4.3.

Suppose Condition 4.2 holds. For variable subset ss, suppose π(W,Zs,Zs,Qbin),π110\pi(W,Z_{s},Z_{-s},Q_{\texttt{bin}}),\pi_{110} are bounded; vYden()>0v_{\texttt{Y}}^{\texttt{den}}(\emptyset)>0; estimator π^\hat{\pi} is consistent; estimators μ^0,μ^1\hat{\mu}_{\cdot\cdot 0},\hat{\mu}_{\cdot\cdot 1} and μ^s\hat{\mu}_{\cdot\cdot s} converge at an op(n1/4)o_{p}(n^{-1/4}) rate, and

1(q^binqbin)2\displaystyle\mathbb{P}_{1}(\hat{q}_{\texttt{bin}}-{q}_{\texttt{bin}})^{2} =op(n1)\displaystyle=o_{p}(n^{-1}) (6)
1(μsμ^s)(ππ^)\displaystyle\mathbb{P}_{1}(\mu_{\cdot\cdot s}-\hat{\mu}_{\cdot\cdot s})(\pi-\hat{\pi}) =op(n1/2)\displaystyle=o_{p}(n^{-1/2})
0(μ0μ^0)(π110π^110)\displaystyle\mathbb{P}_{0}(\mu_{\cdot\cdot 0}-\hat{\mu}_{\cdot\cdot 0})(\pi_{110}-\hat{\pi}_{110}) =op(n1/2)\displaystyle=o_{p}(n^{-1/2})

Then the estimator v^Y,bin(s)\hat{v}_{\texttt{Y},\texttt{bin}}(s) is asymptotically normal centered at the estimand vY,bin(s){v}_{\texttt{Y},\texttt{bin}}(s).

Convergence rate of op(n1/2)o_{p}(n^{-1/2}) can be achieved by ML estimators in a wide variety of conditions, and such assumptions are commonly used to construct debiased ML estimators. The additional requirement in (6) that q^bin\hat{q}_{\texttt{bin}} converges at a op(n1)o_{p}(n^{-1}) rate is new, but fast or even super-fast convergence rates of binned risks is achievable under suitable margin conditions (Audibert & Tsybakov, 2007).

4.2.2 Shapley values

Calculating the exact Shapley value is computationally intractable as it involves an exponential number of terms. Nevertheless, Williamson & Feng (2020) showed that calculating the exact Shapley value for variable importance measures is not necessary when the importance measures are estimated with uncertainty. One can instead estimate the Shapley values by sampling variable subsets and inflate the corresponding CIs to reflect the additional uncertainty due to subset sampling. As long as the number of subsets scales linearly or super-linearly in nn, one can show that this additional uncertainty will not be larger than that due to estimation of subset values themselves. Here we follow the same procedure to estimate and construct CIs for the detailed decomposition Shapley values (see Algorithm 4).

Refer to caption
Refer to caption
Refer to caption
Figure 3: Coverage rates of 90% CIs for aggregate decomposition terms (left) and value of ss-partial shifts for the conditional covariate (middle) and outcome shifts (right) across dataset sizes nn.

5 Simulation

We now present simulations that verify the theoretical results by showing that the proposed procedure achieves the desired coverage rates (Sec 5.1) and illustrate how the proposed method provides more intuitive explanations of performance gaps (Sec 5.2). In all empirical analyses, performance of the ML algorithm is quantified in terms of 0-1 accuracy. Below, we briefly describe the simulation settings; full details are provided in Sec E in the Appendix.

5.1 Verifying theoretical properties

We first verify that the inference procedures for the decomposition terms have CIs with coverage close to their nominal rate. We check the coverage of the aggregate decomposition as well as the value of ss-partial conditional covariate and partial conditional outcome shifts for s={Z1},{Z2},{Z3}s=\{Z_{1}\},\{Z_{2}\},\{Z_{3}\}. (W,Z1,Z2,Z3)(W,Z_{1},Z_{2},Z_{3}) are sampled from independent normal distributions with different means in source and target, while YY is simulated from logistic regression models with different coefficients. CIs for the debiased ML estimator converge to the nominal 90% coverage rate with increasing sample size, whereas those for the naïve plug-in estimator do not (Fig 3).

(a) Conditional covariate Refer to caption

(b) Conditional outcome

Refer to caption
Method Acc-1 Acc-2 Acc-3
ParametricChange 0.78 0.80 0.81
ParametricAcc 0.75 0.80 0.80
RandomForestAcc 0.78 0.80 0.80
OaxacaBlinder 0.78 0.78 0.80
Proposed 0.80 0.80 0.81
Figure 4: Comparison of variable importance reported by proposed method HDPD versus existing methods for conditional covariate (a) and conditional outcome (b) terms.

5.2 Comparing explanations

We now compare the proposed definitions for the detailed decomposition with existing methods. For the detailed conditional covariate decomposition, the comparators are:

  • MeanChange Tests for a difference in means for each variable. Defines importance as 11- p-value.

  • Oaxaca-Blinder: Fits a linear model of the logit-transformed expected loss with respect to ZZ in the source domain. Defines importance of ZiZ_{i} as its coefficient multiplied by the difference in the mean of ZiZ_{i} (Blinder, 1973).

  • WuShift (Wu et al., 2021): Defines importance of subset ss as change in overall performance due to ss-partial conditional covariate shifts. Applies Shapley framework to obtain VIs.

For the detailed decomposition of the performance gap due to shifts in the outcome, we compare against:

  • ParametricChange: Fits a logistic model for YY with interaction terms between domain and ZZ. Defines importance of ZiZ_{i} as the coefficient of its interaction term.

  • ParametricAcc: Same as ParametricChange but models the 0-1 loss rather than YY.

  • RandomForestAcc: Compares VI of RFs trained on data from both domains with input features DD, ZZ, and WW to predict the 0-1 loss.

  • Oaxaca-Blinder: Fits linear models for the logit-transformed expected loss in each domain. Defines importance of ZiZ_{i} as its mean in the target domain multiplied by the difference in its coefficients across domains.

Although the proposed method agrees on important features with these other methods in certain cases, there are important situations where the methods differ as highlighted below.

Conditional covariate. (Fig 4a) We simulate (W,Z1)(W,Z_{1}) from a standard normal distribution, Z2Z_{2} from a mixture of two Gaussians whose means depend on the value of Z1Z_{1} (i.e. Z1Z2Z_{1}\rightarrow Z_{2}), and YY from a logistic regression model depending on (W,Z1,Z2)(W,Z_{1},Z_{2}). We induce a shift from the source domain to the target domain by shifting only the distribution of Z1Z_{1}, so that p1(Z|W)=p0(Z2|Z1,W)p1(Z1|W)p_{1}(Z|W)=p_{0}(Z_{2}|Z_{1},W)p_{1}(Z_{1}|W). Only the proposed estimator correctly recovers that Z1Z_{1} is more important than Z2Z_{2}, because this shift explains all the variation in performance gaps across strata WW. The other methods incorrectly assign higher importance to Z2Z_{2} because they simply assess the performance change due to hypothesized ss-partial shifts but do not check if the hypothesized shifts are good explanations in the first place.

Conditional outcome. (Fig 4b) WW and Z5Z\in\mathbb{R}^{5} are simulated from the same distribution in both domains. YY is generated from a logistic regression model with coefficients for (W,Z1,,Z5)(W,Z_{1},\cdots,Z_{5}) as (0.2,0.4,2,0.25,0.1,0.1)(0.2,0.4,2,0.25,0.1,0.1) in the source and (0.2,0.4,0.8,0.1,0.1,0.1)(0.2,-0.4,0.8,0.1,0.1,0.1) in the target. Interestingly, none of the methods have the same ranking of the features. ParametricChange identifies Z2Z_{2} as having the largest shift on the logit scale, but this does not mean that it is the most important explanation for changes in the loss. According to our decomposition framework, Z1Z_{1} is actually the most important for explaining changes in model performance due to outcome shifts. Oaxaca-Blinder and ParametricAcc have odd behavior—Oaxaca-Blinder assigns Z1Z_{1} the lowest importance and ParametricAcc assigns Z3Z_{3} the highest importance), because the models are misspecified. The VI from RandomForestAcc is also difficult to interpret because it measures which variables are good predictors of performance, not performance shift.

(a) Readmission risk prediction

Refer to caption
Refer to caption

(b) Insurance coverage prediction

Refer to caption
Refer to caption
Figure 5: Aggregate and detailed decompositions for the performance gaps of (a) a model predicting readmission risk across two patient populations (General\rightarrowHeart Failure) and (b) a model predicting insurance coverage across US states (NE\rightarrowLA). A subset of VI estimates is shown; see full list in Sec F in the Appendix.

Perhaps a more objective evaluation is to compare the utility of the different explanations. To this end, we define a targeted algorithmic modification as one where the source risk is revised with respect to a subset of features by fitting an ML algorithm with input features as QQ, WW, and ZsZ_{s} on the target domain. Comparing the performance of the targeted algorithmic modifications that take in the top kk features from each explanation method, we find that model revisions based on the proposed method achieve the highest performance for k=1k=1 to 33.

6 Real-world data case studies

We now analyze two datasets with naturally-occurring shifts.

Hospital readmission. Using electronic health record data from a large safety-net hospital, we analyzed performance of a Gradient Boosted Tree (GBT) trained on the general patient population (source) to predict 30-day readmission risk but applied to patients diagnosed with HF (target). Features include 4 demographic variables (WW) and 16 diagnosis codes (ZZ). Each domain supplied n=3750n=3750 observations.

Model accuracy drops from 70% to 53%. From the aggregate decompositions (Fig 5a), we observe that the drop is mainly due to covariate shift. If one performed the standard check to see which variables significantly changed in their mean value (MeanChange), then one would find a significant shift in nearly every variable. Little support is offered to identify main drivers of the performance drop. In contrast, the detailed decomposition from the proposed framework estimates diagnoses “Drug-induced or toxic-related condition” and “mental and substance use disorder in remission” as having the highest estimated contributions to the conditional covariate shift, and most other variables having little to no contribution. Upon discussion with clinicians from this hospital, differences in the top two diagnoses may be explained by (i) substance use being a major cause of HF at this hospital, with over eighty percent of its HF patients reporting current or prior substance use, and (ii) substance use and mental health disorders often occurring simultaneously in this HF patient population. Based on these findings, closing the performance gap may require a mixture of both operational and algorithmic interventions. Finally, CIs from the debiased ML procedure provide valuable information on the uncertainty of the estimates and highlight, for instance, that more data is necessary to determine the true ordering between the top two features. In contrast, existing methods do not provide (asymptotically valid) CIs.

ACS Public Coverage. We analyze a neural network trained to predict whether a person has public health insurance using data from Nebraska in the American Community Survey (source, n=3000n=3000), applied to data from Louisiana (target, n=6000n=6000). Baseline variables include 3 demographics (sex, age, race), and covariates ZZ include 31 variables related to health conditions, employment, marital status, citizenship status, and education.

Model accuracy drops from 84% to 66% across the two states. The main driver is the shift in the outcome distribution per the aggregate decomposition (Fig 5b) and the most important contributor to the outcome shift is annual income, perhaps due to differences in cost of living across the two states. Income is significantly more important than all the other variables; the ranking between the remaining variables is unclear. In comparing the performance of targeted model revisions with respect to the top variables from each explanation method, we find that model revisions based on top variables identified by the proposed procedure lead to AUCs that are better or as good as those based on RandomForestAcc (Table 3 in the Appendix).

7 Discussion

ML algorithms regularly encounter distribution shifts in practice, leading to drops in performance. We present a novel framework that helps ML developers and deployment teams build a more nuanced understanding of the shifts. Compared to past work, the approach is nonparametric, does not require fine-grained knowledge of the causal relationship between variables, and quantifies the uncertainty of the estimates by constructing confidence intervals. We present case studies on real-world datasets to demonstrate the use of the framework for understanding and guiding interventions to reduce performance drops.

Extensions of this work include relaxing the assumption of overlapping support of covariates such as by restricting to the common support (Cai et al., 2023), allowing for decompositions of more complex measures of model performance such as AUC, and analyzing other factorizations of the data distribution (e.g. label or prior shifts (Kouw & Loog, 2019)). For unstructured data (e.g. image and text), the current framework can be applied to low-dimensional embeddings or by extracting interpretable concepts (Kim et al., 2018); more work is needed to extend this framework to directly analyze unstructured data. Finally, the focus of this work is to interpret performance gaps. Future work may extend ideas in this work to design optimal interventions for closing the performance gap.

Impact statement

This work presents a method for understanding failures of ML algorithms when they are deployed in settings or populations different from the ones in development datasets. Therefore, the work can be used to suggest ways of improving the algorithms or mitigating their harms. The method is generally applicable to tabular data settings for any classification algorithm, hence, it can potentially be applied across multiple domains where ML is used including medicine, finance, and online commerce.

Acknowledgments

We would like to thank Lucas Zier, Avni Kothari, Berkman Sahiner, Nicholas Petrick, Gene Pennello, Mi-Ok Kim, and Romain Pirracchio for their invaluable feedback on the work. This work was funded through a Patient-Centered Outcomes Research Institute® (PCORI®) Award (ME-2022C1-25619). The views presented in this work are solely the responsibility of the author(s) and do not necessarily represent the views of the PCORI®, its Board of Governors or Methodology Committee, and the Food and Drug Administration.

References

  • Ali et al. (2022) Ali, A., Cauchois, M., and Duchi, J. C. The lifecycle of a statistical model: Model failure detection, identification, and refitting, 2022. URL https://arxiv.org/abs/2202.04166.
  • Audibert & Tsybakov (2007) Audibert, J.-Y. and Tsybakov, A. B. Fast learning rates for plug-in classifiers. Ann. Stat., 35(2):608–633, April 2007.
  • Blinder (1973) Blinder, A. S. Wage discrimination: Reduced form and structural estimates. The Journal of Human Resources, 8(4):436–455, 1973. ISSN 0022166X. URL http://www.jstor.org/stable/144855.
  • Budhathoki et al. (2021) Budhathoki, K., Janzing, D., Bloebaum, P., and Ng, H. Why did the distribution change? In Banerjee, A. and Fukumizu, K. (eds.), Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pp. 1666–1674. PMLR, 13–15 Apr 2021. URL https://proceedings.mlr.press/v130/budhathoki21a.html.
  • Cai et al. (2023) Cai, T. T., Namkoong, H., and Yadlowsky, S. Diagnosing model performance under distribution shift. March 2023. URL http://arxiv.org/abs/2303.02011.
  • Charnes et al. (1988) Charnes, A., Golany, B., Keane, M., and Rousseau, J. Extremal principle solutions of games in characteristic function form: Core, chebychev and shapley value generalizations. In Sengupta, J. K. and Kadekodi, G. K. (eds.), Econometrics of Planning and Efficiency, pp.  123–133. Springer Netherlands, Dordrecht, 1988.
  • Chernozhukov et al. (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters. Econom. J., 21(1):C1–C68, February 2018.
  • Cummings et al. (2023) Cummings, B. C., Blackmer, J. M., Motyka, J. R., Farzaneh, N., Cao, L., Bisco, E. L., Glassbrook, J. D., Roebuck, M. D., Gillies, C. E., Admon, A. J., et al. External validation and comparison of a general ward deterioration index between diversely different health systems. Critical Care Medicine, 51(6):775, 2023.
  • d’Eon et al. (2022) d’Eon, G., d’Eon, J., Wright, J. R., and Leyton-Brown, K. The spotlight: A general method for discovering systematic errors in deep learning models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, pp.  1962–1981, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393522. doi: 10.1145/3531146.3533240. URL https://doi.org/10.1145/3531146.3533240.
  • Dodd & Pepe (2003) Dodd, L. E. and Pepe, M. S. Semiparametric regression for the area under the receiver operating characteristic curve. J. Am. Stat. Assoc., 98(462):409–417, 2003.
  • Eyuboglu et al. (2022) Eyuboglu, S., Varma, M., Saab, K. K., Delbrouck, J.-B., Lee-Messer, C., Dunnmon, J., Zou, J., and Re, C. Domino: Discovering systematic errors with cross-modal embeddings. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=FPCMqjI0jXN.
  • Fairlie (2005) Fairlie, R. W. An extension of the blinder-oaxaca decomposition technique to logit and probit models. Journal of economic and social measurement, 30(4):305–316, 2005.
  • Firpo et al. (2018) Firpo, S. P., Fortin, N. M., and Lemieux, T. Decomposing wage distributions using recentered influence function regressions. Econometrics, 6(2), 2018. ISSN 2225-1146. doi: 10.3390/econometrics6020028. URL https://www.mdpi.com/2225-1146/6/2/28.
  • Fortin et al. (2011) Fortin, N., Lemieux, T., and Firpo, S. Chapter 1 - decomposition methods in economics. volume 4 of Handbook of Labor Economics, pp.  1–102. Elsevier, 2011. doi: https://doi.org/10.1016/S0169-7218(11)00407-2. URL https://www.sciencedirect.com/science/article/pii/S0169721811004072.
  • Guo et al. (2017) Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. International Conference on Machine Learning, 70:1321–1330, 2017.
  • Hines et al. (2022) Hines, O., Diaz-Ordaz, K., and Vansteelandt, S. Variable importance measures for heterogeneous causal effects. April 2022.
  • Hines et al. (2023) Hines, O., Diaz-Ordaz, K., and Vansteelandt, S. Variable importance measures for heterogeneous causal effects, 2023.
  • Jain et al. (2023) Jain, S., Lawrence, H., Moitra, A., and Madry, A. Distilling model failures as directions in latent space. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=99RpBVpLiX.
  • Kang & Schafer (2007) Kang, J. D. Y. and Schafer, J. L. Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data. Statistical Science, 22(4):523 – 539, 2007. doi: 10.1214/07-STS227. URL https://doi.org/10.1214/07-STS227.
  • Kennedy (2022) Kennedy, E. H. Semiparametric doubly robust targeted double machine learning: a review. March 2022.
  • Kim et al. (2018) Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., and sayres, R. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  2668–2677. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/kim18d.html.
  • Kirby et al. (2006) Kirby, J. B., Taliaferro, G., and Zuvekas, S. H. Explaining racial and ethnic disparities in health care. Medical Care, 44(5):I64–I72, 2006. ISSN 00257079. URL http://www.jstor.org/stable/3768359.
  • Kouw & Loog (2019) Kouw, W. M. and Loog, M. An introduction to domain adaptation and transfer learning, 2019.
  • Kulinski & Inouye (2023) Kulinski, S. and Inouye, D. I. Towards explaining distribution shifts. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  17931–17952. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/kulinski23a.html.
  • Kulinski et al. (2020) Kulinski, S., Bagchi, S., and Inouye, D. I. Feature shift detection: Localizing which features have shifted via conditional distribution tests. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  19523–19533. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/e2d52448d36918c575fa79d88647ba66-Paper.pdf.
  • Liu et al. (2023) Liu, J., Wang, T., Cui, P., and Namkoong, H. On the need for a language describing distribution shifts: Illustrations on tabular datasets. July 2023.
  • Oaxaca (1973) Oaxaca, R. Male-female wage differentials in urban labor markets. International Economic Review, 14(3):693–709, 1973. ISSN 00206598, 14682354. URL http://www.jstor.org/stable/2525981.
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Platt (1999) Platt, J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.
  • Plumb et al. (2023) Plumb, G., Johnson, N., Cabrera, A., and Talwalkar, A. Towards a more rigorous science of blindspot discovery in image classification models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=MaDvbLaBiF. Expert Certification.
  • Qiu et al. (2023) Qiu, H., Tchetgen, E. T., and Dobriban, E. Efficient and multiply robust risk estimation under general forms of dataset shift. June 2023.
  • Quionero-Candela et al. (2009) Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. Dataset shift in machine learning. The MIT Press, 2009.
  • Rabanser et al. (2019) Rabanser, S., Günnemann, S., and Lipton, Z. Failing loudly: An empirical study of methods for detecting dataset shift. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/846c260d715e5b854ffad5f70a516c88-Paper.pdf.
  • Shapley (1953) Shapley, L. S. 17. a value for n-person games. In Kuhn, H. W. and Tucker, A. W. (eds.), Contributions to the Theory of Games (AM-28), Volume II, pp.  307–318. Princeton University Press, Princeton, December 1953.
  • Steyerberg (2009) Steyerberg, E. W. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer, New York, NY, 2009.
  • Sugiyama et al. (2007) Sugiyama, M., Nakajima, S., Kashima, H., Buenau, P., and Kawanabe, M. Direct importance estimation with model selection and its application to covariate shift adaptation. Adv. Neural Inf. Process. Syst., 2007.
  • Tsiatis (2006) Tsiatis, A. A. Semiparametric Theory and Missing Data. Springer New York, 2006.
  • van der Vaart (1998) van der Vaart, A. W. Asymptotic Statistics. Cambridge University Press, October 1998.
  • Williamson & Feng (2020) Williamson, B. D. and Feng, J. Efficient nonparametric statistical inference on population feature importance using shapley values. International Conference on Machine Learning, 2020. URL https://proceedings.icml.cc/static/paper_files/icml/2020/3042-Paper.pdf.
  • Williamson et al. (2021) Williamson, B. D., Gilbert, P. B., Carone, M., and Simon, N. Nonparametric variable importance assessment using machine learning techniques. Biometrics, 77(1):9–22, 2021. doi: https://doi.org/10.1111/biom.13392. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.13392.
  • Wu et al. (2021) Wu, E., Wu, K., and Zou, J. Explaining medical AI performance disparities across sites with confounder shapley value analysis. November 2021. URL http://arxiv.org/abs/2111.08168.
  • Yun (2004) Yun, M.-S. Decomposing differences in the first moment. Economics Letters, 82(2):275–280, 2004. ISSN 0165-1765. doi: https://doi.org/10.1016/j.econlet.2003.09.008. URL https://www.sciencedirect.com/science/article/pii/S0165176503002866.
  • Zhang et al. (2023) Zhang, H., Singh, H., Ghassemi, M., and Joshi, S. ”Why did the model fail?”: Attributing model performance changes to distribution shifts. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  41550–41578. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/zhang23ai.html.

Appendix A Contents of the Appendix

  • Table 1 summarizes the comparison with prior work.

  • Table 2 collects all the notation used for reference.

  • Algorithms 1 and 4 provides the steps required for computing the aggregate and detailed decomposition respectively. Detailed decompositions require computing the value of ss-partial conditional outcome and conditional covariate shifts which is described in Algorithms 2 and 3.

  • Section B describes the estimation and inference for detailed decomposition of conditional covariate shift.

  • Section C provides the derivations of the results.

  • Sections D and E describe the implementation and simulation details.

  • Section F provides additional details on the two real world datasets and results.

Papers Aggregate decomp. Detailed decomp. Does not require knowing causal DAG Confidence intervals Nonparametric
Cond. covariate Cond. outcome
Zhang et al. (2023) \checkmark \checkmark \checkmark
Cai et al. (2023) \checkmark \checkmark \checkmark \checkmark
Wu et al. (2021) \checkmark \checkmark \checkmark
Liu et al. (2023) \checkmark \checkmark \checkmark \checkmark
Dodd & Pepe (2003) \checkmark \checkmark \checkmark
Oaxaca (1973), Blinder (1973) \checkmark \checkmark \checkmark \checkmark \checkmark
HDPD (this paper) \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark
Table 1: Comparison to prior works that decompose performance change.
Symbol Meaning
W,Z,YW,Z,Y Variables: Baseline, Conditional covariates, Outcome
ff Prediction model being analyzed
(W,Z,Y)\ell(W,Z,Y) or \ell Loss function e.g. 0-1 loss
D=0D=0 and D=1D=1 Indicators for source and target domain
p0,p1p_{0},p_{1} Probability density (or mass) function for the two domains D=0,1D=0,1
𝔼DWDZDY\mathbb{E}_{D_{\texttt{W}}D_{\texttt{Z}}D_{\texttt{Y}}} Expectation over the distribution pDW(W)pDZ(Z|W)pDY(Y|W,Z)p_{D_{\texttt{W}}}(W)p_{D_{\texttt{Z}}}(Z|W)p_{D_{\texttt{Y}}}(Y|W,Z)
Q:=q(W,Z)Q:=q(W,Z) Source domain risk at W,ZW,Z, i.e. p0(Y=1|W,Z)p_{0}(Y=1|W,Z)
Tr and Ev Training dataset used to fit models and evaluation dataset used to compute decompositions
ϕZ,j\phi_{\texttt{Z},j} and ϕY,j\phi_{\texttt{Y},j} Shapley values for variable jj in the detailed decomposition of conditional covariate and outcome shifts
vZ(s)v_{\texttt{Z}}(s) and vY(s)v_{\texttt{Y}}(s) Value of a subset ss for ss-partial conditional covariate shift and ss-partial outcome shift
vnum(s)v_{\cdot}^{\texttt{num}}(s) and vden(s)v_{\cdot}^{\texttt{den}}(s) Numerator and denominator of the ratio defined in the value of a subset
Models μ\mu_{\cdot} Outcome models for the conditional expectation of the loss across different settings
Models π\pi_{\cdot} Density ratio models for feature densities across datasets
\mathbb{P} Notation for expectation
0,nEv\mathbb{P}_{0,n}^{\texttt{Ev}} and 1,nEv\mathbb{P}_{1,n}^{\texttt{Ev}} sample average over source and target data in the evaluation dataset
ψ(d,w,z,y)\psi(d,w,z,y) Influence function defined in the linear approximation of an estimand, see e.g. (11)
Table 2: Notation

Input: Source and target data {(Wi(d),Zi(d),Yi(d))}i=1nd\{(W_{i}^{(d)},Z_{i}^{(d)},Y_{i}^{(d)})\}_{i=1}^{n_{d}} for d{0,1}d\in\{0,1\}, loss function (W,Z,Y;f)\ell(W,Z,Y;f).

Output: Performance change due to baseline, conditional covariate, and conditional outcome shifts ΛW,ΛZ,ΛY\Lambda_{\texttt{W}},\Lambda_{\texttt{Z}},\Lambda_{\texttt{Y}}.

Split source and target data into training Tr and evaluation Ev partitions. Let nEvn^{\texttt{Ev}} be the total number of data points in the Ev partition.

Fit nuisance parameters ηW,ηZ,ηY\eta_{\texttt{W}},\eta_{\texttt{Z}},\eta_{\texttt{Y}}, defined in Section C.1, on the Tr partition as outlined in Section D.1.

Estimate ΛW,ΛZ,ΛY\Lambda_{\texttt{W}},\Lambda_{\texttt{Z}},\Lambda_{\texttt{Y}} using fitted nuisance parameters on the Ev partitions following the equations in Section 4.1.

Estimate variance of influence functions ψW(d,w,z,y;η^N),ψZ(d,w,z,y;η^N)\psi_{\texttt{W}}(d,w,z,y;\hat{\eta}_{\texttt{N}}),\psi_{\texttt{Z}}(d,w,z,y;\hat{\eta}_{\texttt{N}}), and ψY(d,w,z,y;η^N)\psi_{\texttt{Y}}(d,w,z,y;\hat{\eta}_{\texttt{N}}) as defined in (12), (13), and (14), respectively.

Compute α\alpha-level confidence intervals as Λ^N±z1α/2var^(ψN(d,w,z,y;η^N))/nEv\hat{\Lambda}_{\texttt{N}}\pm z_{1-\alpha/2}\sqrt{\widehat{var}(\psi_{\texttt{N}}(d,w,z,y;\hat{\eta}_{\texttt{N}}))/n^{\texttt{Ev}}} for N{W,Z,Y}\texttt{N}\in\{\texttt{W},\texttt{Z},\texttt{Y}\}, where zz is the inverse CDF of the standard normal distribution.

return Λ^W,Λ^Z,Λ^Y\hat{\Lambda}_{\texttt{W}},\hat{\Lambda}_{\texttt{Z}},\hat{\Lambda}_{\texttt{Y}} and confidence intervals

Algorithm 1 Aggregate decompositions into baseline, conditional covariate, and conditional outcome shifts

Input: Training Tr and evaluation Ev partitions of source and target data, subset of variables ss.

Output: Value for ss-partial conditional outcome shift for subset ss.

Fit nuisance parameters ηY,snum,ηYden\eta_{\texttt{Y},s}^{\texttt{num}},\eta_{\texttt{Y}}^{\texttt{den}}, defined in Sections C.3, on the Tr partitions as outlined in D.2.

Estimate vY(s)v_{\texttt{Y}}(s) by v^Ynum(s)/v^Yden\hat{v}_{\texttt{Y}}^{\texttt{num}}(s)/\hat{v}_{\texttt{Y}}^{\texttt{den}} where v^Ynum(s)\hat{v}_{\texttt{Y}}^{\texttt{num}}(s) is estimated using (4) and v^Yden\hat{v}_{\texttt{Y}}^{\texttt{den}} is estimated using (5) on the Ev partition.

Estimate variance of influence function ψY,bin,s(d,w,z,y;η^Y,snum,η^Yden)\psi_{\texttt{Y},\texttt{bin},s}(d,w,z,y;\hat{\eta}_{\texttt{Y},s}^{\texttt{num}},\hat{\eta}_{\texttt{Y}}^{\texttt{den}}) as defined in (63).

Compute α\alpha-level confidence interval as v^Y(s)±z1α/2var^(ψY,bin,s(d,w,z,y;η^Y,snum,η^Yden))/nEv\hat{v}_{\texttt{Y}}(s)\pm z_{1-\alpha/2}\sqrt{\widehat{var}(\psi_{\texttt{Y},\texttt{bin},s}(d,w,z,y;\hat{\eta}_{\texttt{Y},s}^{\texttt{num}},\hat{\eta}_{\texttt{Y}}^{\texttt{den}}))/n^{\texttt{Ev}}}.

return v^Y(s)\hat{v}_{\texttt{Y}}(s) and confidence interval

Algorithm 2 ValueConditionalOutcome(s): Value for ss-partial conditional outcome shift for a subset ss

Input: Training Tr and evaluation Ev partitions of source and target data, subset of variables ss.

Output: Value for ss-partial conditional covariate shift for subset ss.

Fit nuisance parameters ηZ,snum\eta_{\texttt{Z},s}^{\texttt{num}}, defined in Sections C.2, on the Tr partition, as outlined in D.3.

Estimate vZ(s)v_{\texttt{Z}}(s) by v^Znum(s)/v^Znum()\hat{v}_{\texttt{Z}}^{\texttt{num}}(s)/\hat{v}_{\texttt{Z}}^{\texttt{num}}(\emptyset) using (10) on the Ev partition.

Estimate variance of influence function ψZ,s(d,w,z,y;η^Z,snum)\psi_{\texttt{Z},s}(d,w,z,y;\hat{\eta}_{\texttt{Z},s}^{\texttt{num}}) as defined in (30).

Compute α\alpha-level confidence interval as v^Z(s)±z1α/2var^(ψZ,s(d,w,z,y;η^Z,snum))/nEv\hat{v}_{\texttt{Z}}(s)\pm z_{1-\alpha/2}\sqrt{\widehat{var}(\psi_{\texttt{Z},s}(d,w,z,y;\hat{\eta}_{\texttt{Z},s}^{\texttt{num}}))/n^{\texttt{Ev}}}.

return v^Z(s)\hat{v}_{\texttt{Z}}(s) and confidence interval

Algorithm 3 ValueConditionalCovariate(s): Value for ss-partial conditional covariate shift for a subset ss

Input: Source and target data {(Wi(d),Zi(d),Yi(d))}i=1nd\{(W_{i}^{(d)},Z_{i}^{(d)},Y_{i}^{(d)})\}_{i=1}^{n_{d}} for d{0,1}d\in\{0,1\}, loss function (W,Z,Y;f)\ell(W,Z,Y;f), γ+\gamma\in\mathbb{R}^{+}.

Output: Detailed decomposition for conditional outcome or covariate shift, {ϕY,j:j=0,,m2}\{\phi_{\texttt{Y},j}:j=0,\cdots,m_{2}\} or {ϕZ,j:j=1,,m2}\{\phi_{\texttt{Z},j}:j=1,\cdots,m_{2}\}.

Split source and target data into training Tr and evaluation Ev partitions. Let nEvn^{\texttt{Ev}} be the total number of data points in the Ev partition.

Subsample γnEv\lfloor\gamma n^{\texttt{Ev}}\rfloor subsets from 𝒵={1,,m2}\mathcal{Z}=\{1,\cdots,m_{2}\} with respect to Shapley weights, including \emptyset and 𝒵\mathcal{Z}, denoted s1,,sks_{1},\cdots,s_{k}.

Estimate vY(s)v_{\texttt{Y}}(s)\leftarrow ValueConditionalOutcome(s) and vZ(s)v_{\texttt{Z}}(s)\leftarrow ValueConditionalCovariate(s) for ss1,,sks\in s_{1},\cdots,s_{k}.

Get estimated Shapley values {ϕY,j}\{\phi_{\texttt{Y},j}\} and {ϕZ,j}\{\phi_{\texttt{Z},j}\} by solving constrained linear regression problems in (7) in Williamson & Feng (2020) with value functions vY(s)v_{\texttt{Y}}(s) and vZ(s)v_{\texttt{Z}}(s), respectively.

Compute confidence intervals based on the influence functions defined in Theorem 1 in Williamson & Feng (2020).

return Shapley values {ϕY,j:j=0,,m2}\{\phi_{\texttt{Y},j}:j=0,\dots,m_{2}\} and {ϕZ,j:j=1,,m2}\{\phi_{\texttt{Z},j}:j=1,\dots,m_{2}\} and confidence intervals

Algorithm 4 Detailed decomposition for conditional outcome and covariate shift

Appendix B Estimation and Inference

B.1 Value of ss-partial conditional covariate shifts

Estimation. Using the training partition, estimate the density ratio π1s0(zs,w)=p1(zs,w)/p0(zs,w)\pi_{1s0}(z_{s},w)=p_{1}(z_{s},w)/p_{0}(z_{s},w) and the outcome models

μ0s0(zs,w)\displaystyle\mu_{\cdot 0_{-s}0}(z_{s},w) =𝔼00[|zs,w]=p0(y|w,z)p0(zs|zs,w)𝑑y𝑑zs\displaystyle=\mathbb{E}_{\cdot 00}[\ell|z_{s},w]=\int\int\ell p_{0}(y|w,z)p_{0}(z_{-s}|z_{s},w)dydz_{-s} (7)
μ10(w)\displaystyle\mu_{\cdot 10}(w) =𝔼10[|w]\displaystyle=\mathbb{E}_{\cdot 10}[\ell|w] (8)
μs0(w)\displaystyle\mu_{\cdot s0}(w) =𝔼s0[|w]=p0(y|w,z)p0(zs|zs,w)p1(zs|w)𝑑y𝑑zs𝑑zs,\displaystyle=\mathbb{E}_{\cdot s0}[\ell|w]=\int\int\int\ell p_{0}(y|w,z)p_{0}(z_{-s}|z_{s},w)p_{1}(z_{s}|w)dydz_{-s}dz_{s}, (9)

in addition to the other nuisance models previously mentioned. We propose the estimator v^Z(s)=v^Znum(s)/v^Zden\hat{v}_{\texttt{Z}}(s)=\hat{v}_{\texttt{Z}}^{\texttt{num}}(s)/\hat{v}^{\texttt{den}}_{\texttt{Z}}, where

v^Znum(s):=1,nEv(μ^s0(W)μ^10(W))2+20,nEv(μ^s0(W)μ^10(W))(μ^0s0(W,Zs))π^1s0(W,Zs)20,nEv(μ^s0(W)μ^10(W))(μ^0(W,Z))π^110(W,Z)+21,nEv(μ^s0(W)μ^10(W))(μ^0s0(W,Zs)μ^s0(W))21,nEv(μ^s0(W)μ^10(W))(μ^0(W,Z)μ^10(W))\displaystyle\begin{split}\hat{v}_{\texttt{Z}}^{\texttt{num}}(s):=&\mathbb{P}_{1,n}^{\texttt{Ev}}(\hat{\mu}_{\cdot s0}(W)-\hat{\mu}_{\cdot 10}(W))^{2}\\ &+2\mathbb{P}_{0,n}^{\texttt{Ev}}(\hat{\mu}_{\cdot s0}(W)-\hat{\mu}_{\cdot 10}(W))(\ell-\hat{\mu}_{\cdot 0_{-s}0}(W,Z_{s}))\hat{\pi}_{1s0}(W,Z_{s})\\ &-2\mathbb{P}_{0,n}^{\texttt{Ev}}(\hat{\mu}_{\cdot s0}(W)-\hat{\mu}_{\cdot 10}(W))(\ell-\hat{\mu}_{\cdot\cdot 0}(W,Z))\hat{\pi}_{110}(W,Z)\\ &+2\mathbb{P}_{1,n}^{\texttt{Ev}}(\hat{\mu}_{\cdot s0}(W)-\hat{\mu}_{\cdot 10}(W))(\hat{\mu}_{\cdot 0_{-s}0}(W,Z_{s})-\hat{\mu}_{\cdot s0}(W))\\ &-2\mathbb{P}_{1,n}^{\texttt{Ev}}(\hat{\mu}_{\cdot s0}(W)-\hat{\mu}_{\cdot 10}(W))(\hat{\mu}_{\cdot\cdot 0}(W,Z)-\hat{\mu}_{\cdot 10}(W))\end{split} (10)

and v^Zden:=v^Znum()\hat{v}_{\texttt{Z}}^{\texttt{den}}:=\hat{v}_{\texttt{Z}}^{\texttt{num}}(\emptyset).

Inference. The estimator is asymptotically normal as long as the outcome and density ratio models are estimated at a fast enough rate defined formally as follows.

Condition B.1.

For variable subset ss, suppose the following holds

  • 0(μ001(W)μ01(W))2\mathbb{P}_{0}(\mu_{001}(W)-\mu_{01}(W))^{2} is bounded

  • 0(μ0s0(Zs,W)μ^0s0(Zs,W))(π^1s0(Zs,W)π1s0(Zs,W))=op(n1/2)\mathbb{P}_{0}(\mu_{\cdot 0_{-s}0}(Z_{s},W)-\hat{\mu}_{\cdot 0_{-s}0}(Z_{s},W))(\hat{\pi}_{1s0}(Z_{s},W)-\pi_{1s0}(Z_{s},W))=o_{p}(n^{-1/2})

  • 0(μ0(W,Z)μ^0(W,Z))(π^110(W,Z)π110(W,Z))=op(n1/2)\mathbb{P}_{0}(\mu_{\cdot\cdot 0}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z))(\hat{\pi}_{110}(W,Z)-\pi_{110}(W,Z))=o_{p}(n^{-1/2})

  • 1(μ^s0(W)μs0(W))2=op(n1/2)\mathbb{P}_{1}(\hat{\mu}_{\cdot s0}(W)-\mu_{\cdot s0}(W))^{2}=o_{p}(n^{-1/2})

  • 1(μ^10(W)μ10(W))2=op(n1/2)\mathbb{P}_{1}(\hat{\mu}_{\cdot 10}(W)-\mu_{\cdot 10}(W))^{2}=o_{p}(n^{-1/2})

  • (Positivity) p0(zs,w)>0p_{0}(z_{s},w)>0 and p0(w,z)>0p_{0}(w,z)>0 almost everywhere, such that the density ratios π1s0(w,zs)\pi_{1s0}(w,z_{s}) and π110(w,z)\pi_{110}(w,z) are well-defined and between (0,1)(0,1).

Theorem B.2.

For variable subset ss, suppose vZden(s)>0v_{Z}^{\texttt{den}}(s)>0 and Condition B.1 hold. Then the estimator v^Z(s)\hat{v}_{Z}(s) is asymptotically normal.

Appendix C Proofs

Notation. For all proofs, we will write \mathbb{P} to mean Ev\mathbb{P}^{\texttt{Ev}} (and likewise for the empirical version) for notational simplicity.

Overview of derivation strategy. We first present the general strategy for proving asymptotic normality of the estimators for the decompositions. Details on nonparametric debiased estimation can be found in texts such as Tsiatis (2006) and Kennedy (2022).

Let v()v(\mathbb{P}) be a pathwise differentiable quantity that is a function of the true regular (differentiable in quadratic mean) probability distribution \mathbb{P} over random variable OO. For instance, vv in the case of mean is defined as v():=𝔼o(O)[o]v(\mathbb{P}):=\mathbb{E}_{o\sim\mathbb{P}(O)}[o]. Let ^\hat{\mathbb{P}} denote an arbitrary regular estimator of \mathbb{P}, such as the maximum likelihood estimator. The plug-in estimator is then defined as v(^)v(\hat{\mathbb{P}}).

The von-Mises expansion of the functional vv (which linearizes vv in analogy to the first-order Taylor expansion), given it is pathwise differentiable, gives

v(^)v()=ψ(o;^)+R(^,).v(\hat{\mathbb{P}})-v(\mathbb{P})=-\mathbb{P}\ \psi(o;\hat{\mathbb{P}})+R(\hat{\mathbb{P}},\mathbb{P}). (11)

Here, the function ψ\psi is called an influence function (or a functional gradient of vv at ^\hat{\mathbb{P}}). R(^,)R(\hat{\mathbb{P}},\mathbb{P}) is a second-order remainder term. The one-step corrected estimators we consider have the form of v(^)+nψ(o;^)v(\hat{\mathbb{P}})+\mathbb{P}_{n}\psi(o;\hat{\mathbb{P}}) where n\mathbb{P}_{n} denotes a sample average. Following the expansion above, the one-step corrected estimator can be analyzed as follows,

(v(^)+nψ(o;^))v()\displaystyle\left(v(\hat{\mathbb{P}})+\mathbb{P}_{n}\psi(o;\hat{\mathbb{P}})\right)-v(\mathbb{P})
=(n)ψ(o;^)+R(^,)\displaystyle=(\mathbb{P}_{n}-\mathbb{P})\psi(o;\hat{\mathbb{P}})+R(\hat{\mathbb{P}},\mathbb{P})
=(n)ψ(o;)+(n)(ψ(o;^)ψ(o;))+R(^,)\displaystyle=(\mathbb{P}_{n}-\mathbb{P})\psi(o;\mathbb{P})+(\mathbb{P}_{n}-\mathbb{P})(\psi(o;\hat{\mathbb{P}})-\psi(o;\mathbb{P}))+R(\hat{\mathbb{P}},\mathbb{P})

Our goal will be to analyze each of the three terms and to show that they are asymptotically negligible at n\sqrt{n}-rate, such that the one-step corrected estimator satisfies

(v(^)+nψ(o;^))v()=nψ(o;)+op(n1/2),\left(v(\hat{\mathbb{P}})+\mathbb{P}_{n}\psi(o;\hat{\mathbb{P}})\right)-v(\mathbb{P})=\mathbb{P}_{n}\psi(o;\mathbb{P})+o_{p}(n^{-1/2}),

where we used the property of influence functions that they have zero mean. Thus the one-step corrected estimator is asymptotically normal with mean v()v(\mathbb{P}) and variance var(ψ(o;))/nvar(\psi(o;\mathbb{P}))/n, which allows for the construction of CIs. In the following proofs, we present the influence functions without derivations; see Kennedy (2022) and Hines et al. (2022) for strategies for deriving influence functions.

C.1 Aggregate decompositions

Let the nuisance parameters in the one-step estimators ΛW,ΛZ,ΛY\Lambda_{\texttt{W}},\Lambda_{\texttt{Z}},\Lambda_{\texttt{Y}} be denoted by ηW=(μ00,π100),ηZ=(μ0,μ00,π110),ηY=(μ0,π110)\eta_{\texttt{W}}=(\mu_{\cdot 00},\pi_{100}),\eta_{\texttt{Z}}=(\mu_{\cdot\cdot 0},\mu_{\cdot 00},\pi_{110}),\eta_{\texttt{Y}}=(\mu_{\cdot\cdot 0},\pi_{110}) respectively. Denote the estimated nuisances by η^W,η^Z,η^Y\hat{\eta}_{\texttt{W}},\hat{\eta}_{\texttt{Z}},\hat{\eta}_{\texttt{Y}}. The canonical gradients for the three estimands are

ψW(d,w,z,y;ηW)=\displaystyle\psi_{\texttt{W}}(d,w,z,y;\eta_{\texttt{W}})= [((w,z,y)μ00(w))π100(w)(w,z,y)]𝟙{d=0}p(d=0)+μ00(w)𝟙{d=1}p(d=1)ΛW\displaystyle\left[\left(\ell(w,z,y)-{\mu}_{\cdot 00}(w)\right)\pi_{100}(w)-\ell(w,z,y)\right]\frac{\mathbbm{1}\{d=0\}}{p(d=0)}+{\mu}_{\cdot 00}(w)\frac{\mathbbm{1}\{d=1\}}{p(d=1)}-\Lambda_{\texttt{W}} (12)
ψZ(d,w,z,y;ηZ)=\displaystyle\psi_{\texttt{Z}}(d,w,z,y;\eta_{\texttt{Z}})= [((w,z,y)μ0(w,z))π110(w,z)]𝟙{d=0}p(d=0)+μ0(w,z)𝟙{d=1}p(d=1)\displaystyle\left[\left(\ell(w,z,y)-\mu_{\cdot\cdot 0}(w,z)\right)\pi_{110}(w,z)\right]\frac{\mathbbm{1}\{d=0\}}{p(d=0)}+\mu_{\cdot\cdot 0}(w,z)\frac{\mathbbm{1}\{d=1\}}{p(d=1)}
[((w,z,y)μ00(w))π100(w)]𝟙{d=0}p(d=0)+μ00(w)𝟙{d=1}p(d=1)ΛZ\displaystyle-\left[\left(\ell(w,z,y)-\mu_{\cdot 00}(w)\right)\pi_{100}(w)\right]\frac{\mathbbm{1}\{d=0\}}{p(d=0)}+\mu_{\cdot 00}(w)\frac{\mathbbm{1}\{d=1\}}{p(d=1)}-\Lambda_{\texttt{Z}} (13)
ψY(d,w,z,y;ηY)=\displaystyle\psi_{\texttt{Y}}(d,w,z,y;\eta_{\texttt{Y}})= ((w,z,y)μ0(w,z))𝟙{d=1}p(d=1)[((w,z,y)μ0(w,z))π110(w,z)]𝟙{d=0}p(d=0)ΛY.\displaystyle\left(\ell(w,z,y)-\mu_{\cdot\cdot 0}(w,z)\right)\frac{\mathbbm{1}\{d=1\}}{p(d=1)}-\left[\left(\ell(w,z,y)-\mu_{\cdot\cdot 0}(w,z)\right)\pi_{110}(w,z)\right]\frac{\mathbbm{1}\{d=0\}}{p(d=0)}-\Lambda_{\texttt{Y}}. (14)
Theorem C.1 (Theorem 4.1).

Under conditions outlined in Theorem 4.1, the one-step corrected estimators for the aggregate decomposition terms, baseline, conditional covariate, and conditional outcome Λ^W\hat{\Lambda}_{\texttt{W}}, Λ^Z\hat{\Lambda}_{\texttt{Z}}, and Λ^Y\hat{\Lambda}_{\texttt{Y}}, are asymptotically linear, i.e.

Λ^NΛN=nψN+op(n1/2)N{W,Z,Y}.\displaystyle\hat{\Lambda}_{\texttt{N}}-\Lambda_{\texttt{N}}=\mathbb{P}_{n}\psi_{\texttt{N}}+o_{p}(n^{-1/2})\quad\forall\texttt{N}\in\{\texttt{W},\texttt{Z},\texttt{Y}\}. (15)
Proof.

The estimands ΛW,ΛZ,ΛY\Lambda_{\texttt{W}},\Lambda_{\texttt{Z}},\Lambda_{\texttt{Y}} have similarities to the standard average treatment effect (ATE) in the causal inference literature (see (Kennedy, 2022, Example 2). Hence, the estimators and their asymptotic properties directly follow. For treatment TT, outcome OO, and confounders CC, the mean outcome under T=1T=1 among the population with T=0T=0 is identified as

ϕ=op(o|c,t=1)p(c|t=0)𝑑o𝑑c\displaystyle\phi=\int op(o|c,t=1)p(c|t=0)dodc (16)

and its one-step corrected estimator can be derived from the canonical gradient of ϕ\phi, which takes the following form after plugging in the estimates of the nuisance models:

ϕ^n=n{𝟙{T=1}Pr(T=1)π^(c)(Oμ^1(c))+𝟙{T=0}Pr(T=0)μ^1(c)}\hat{\phi}_{n}=\mathbb{P}_{n}\left\{\frac{\mathbbm{1}\{T=1\}}{\Pr(T=1)}\hat{\pi}(c)\left(O-\hat{\mu}_{1}(c)\right)+\frac{\mathbbm{1}\{T=0\}}{\Pr(T=0)}\hat{\mu}_{1}(c)\right\}

satisfies

ϕ^nϕ=n{𝟙{T=1}Pr(T=1)π(c)(Oμ1(c))+𝟙{T=0}Pr(T=0)μ1(c)ϕ}+op(n1/2)\hat{\phi}_{n}-\phi=\mathbb{P}_{n}\left\{\frac{\mathbbm{1}\{T=1\}}{\Pr(T=1)}{\pi}(c)\left(O-{\mu}_{1}(c)\right)+\frac{\mathbbm{1}\{T=0\}}{\Pr(T=0)}{\mu}_{1}(c)-\phi\right\}+o_{p}(n^{-1/2})

where μ1(c)=𝔼[o|c,t=1]\mu_{1}(c)=\mathbb{E}[o|c,t=1] and π(c)=p(c|t=0)/p(c|t=1)\pi(c)=p(c|t=0)/p(c|t=1) as long as the following conditions hold:

  • p(c|t=1)>0p(c|t=1)>0 almost everywhere such that the density ratios π(c)\pi(c) are well-defined and bounded,

  • 1(μ^1μ1)(π^π)=op(n1/2)\mathbb{P}_{1}(\hat{\mu}_{1}-\mu_{1})(\hat{\pi}-\pi)=o_{p}(n^{-1/2}).

We establish the estimators and their influence functions by showing that they can all be viewed as mean outcomes of the form (16).

Baseline term ΛW\Lambda_{\texttt{W}}. The first term 𝔼100[(w,z,y)]\mathbb{E}_{100}\left[\ell(w,z,y)\right] is a mean outcome with respect to p((w,z,y)|w,d=0)p(w|d=1)p(\ell(w,z,y)|w,d=0)p(w|d=1), which is the same as that in (16) but with (w,z,y)\ell(w,z,y) as the outcome, ww as the confounder, and dd as the (flipped) treatment. The second term 𝔼000[(w,z,y)]\mathbb{E}_{000}\left[\ell(w,z,y)\right] is a simple average over D=0D=0 population whose influence function is the (w,z,y)\ell(w,z,y) itself.

Conditional covariate term ΛZ\Lambda_{\texttt{Z}}. First term 𝔼110[(w,z,y)]\mathbb{E}_{110}\left[\ell(w,z,y)\right] is the mean outcome with respect to p((w,z,y)|w,z,d=0)p(w,z|d=1)p(\ell(w,z,y)|w,z,d=0)p(w,z|d=1), where the chief difference is (w,z)(w,z) is the confounder. Second term 𝔼100[(w,z,y)]\mathbb{E}_{100}\left[\ell(w,z,y)\right] is also a mean outcome, as discussed above.

Conditional outcome term ΛY\Lambda_{\texttt{Y}}. First term 𝔼111[(w,z,y)]\mathbb{E}_{111}\left[\ell(w,z,y)\right] is a simple average over the D=1D=1 population. ∎

C.2 Value of ss-partial conditional covariate shifts

Let nuisance parameters in the one-step estimator vZ,snumv_{\texttt{Z},s}^{\texttt{num}} be denoted ηZ,snum=(μs0,μ10,μ0s0,μ001,μ0,π1s0,π110)\eta_{\texttt{Z},s}^{\texttt{num}}=(\mu_{\cdot s0},\mu_{\cdot 10},\mu_{\cdot 0_{-s}0},\mu_{001},\mu_{\cdot\cdot 0},\pi_{1s0},\pi_{110}) and the set of estimated nuisances by η^Z,snum\hat{\eta}_{\texttt{Z},s}^{\texttt{num}}. The canonical gradient of vZnum(s)v_{\texttt{Z}}^{\texttt{num}}(s) is

ψZ,snum(D,W,Z,Y;ηZ,snum)\displaystyle\psi_{\texttt{Z},s}^{\texttt{num}}(D,W,Z,Y;\eta_{\texttt{Z},s}^{\texttt{num}}) =(μs0(W)μ10(W))2𝟙{D=1}Pr(D=1)\displaystyle=(\mu_{\cdot s0}(W)-\mu_{\cdot 10}(W))^{2}\frac{\mathbbm{1}\{D=1\}}{\Pr(D=1)}
+2(μs0(W)μ10(W))(μ0s0(W,Zs))π1s0(W,Zs)𝟙{D=0}Pr(D=0)\displaystyle+2(\mu_{\cdot s0}(W)-\mu_{\cdot 10}(W))(\ell-{\mu}_{\cdot 0_{-s}0}(W,Z_{s})){\pi}_{1s0}(W,Z_{s})\frac{\mathbbm{1}\{D=0\}}{\Pr(D=0)}
2(μs0(W)μ10(W))(μ0(W,Z))π110(W,Z)𝟙{D=0}Pr(D=0)\displaystyle-2(\mu_{\cdot s0}(W)-\mu_{\cdot 10}(W))(\ell-{\mu}_{\cdot\cdot 0}(W,Z)){\pi}_{110}(W,Z)\frac{\mathbbm{1}\{D=0\}}{\Pr(D=0)}
+2(μs0(W)μ10(W))(μ0s0(W,Zs)μs0(W))𝟙{D=1}Pr(D=1)\displaystyle+2(\mu_{\cdot s0}(W)-\mu_{\cdot 10}(W))({\mu}_{\cdot 0_{-s}0}(W,Z_{s})-{\mu}_{\cdot s0}(W))\frac{\mathbbm{1}\{D=1\}}{\Pr(D=1)}
2(μs0(W)μ10(W))(μ0(W,Z)μ10(W))𝟙{D=1}Pr(D=1)\displaystyle-2(\mu_{\cdot s0}(W)-\mu_{\cdot 10}(W))({\mu}_{\cdot\cdot 0}(W,Z)-{\mu}_{\cdot 10}(W))\frac{\mathbbm{1}\{D=1\}}{\Pr(D=1)}
vZnum(s).\displaystyle-v_{\texttt{Z}}^{\texttt{num}}(s). (17)
Lemma C.2.

Under Condition B.1, v^Znum(s)\hat{v}_{\texttt{Z}}^{\texttt{num}}(s) satisfies

v^Znum(s)vZnum(s)\displaystyle\hat{v}_{\texttt{Z}}^{\texttt{num}}(s)-v_{\texttt{Z}}^{\texttt{num}}(s) =nψZ,snum(D,W,Z,Y;ηZ,snum)+op(n1/2)\displaystyle=\mathbb{P}_{n}\psi_{\texttt{Z},s}^{\texttt{num}}(D,W,Z,Y;\eta_{\texttt{Z},s}^{\texttt{num}})+o_{p}(n^{-1/2}) (18)
v^ZdenvZden\displaystyle\hat{v}^{\texttt{den}}_{\texttt{Z}}-v^{\texttt{den}}_{\texttt{Z}} =nψZ,num(D,W,Z,Y;ηZ,snum)+op(n1/2)\displaystyle=\mathbb{P}_{n}\psi_{\texttt{Z},\emptyset}^{\texttt{num}}(D,W,Z,Y;\eta_{\texttt{Z},s}^{\texttt{num}})+o_{p}(n^{-1/2}) (19)
Proof.

Consider the following decomposition

v^Znum(s)vZnum(s)\displaystyle\hat{v}_{\texttt{Z}}^{\texttt{num}}(s)-v_{\texttt{Z}}^{\texttt{num}}(s)
=\displaystyle= (n)ψZnum(D,W,Z,Y;ηZ,snum)\displaystyle(\mathbb{P}_{n}-\mathbb{P})\psi_{\texttt{Z}}^{\texttt{num}}(D,W,Z,Y;\eta_{\texttt{Z},s}^{\texttt{num}}) (20)
+(n)(ψZnum(D,W,Z,Y;η^Z,snum)ψZnum(W,Z,Y;ηZ,snum))\displaystyle+(\mathbb{P}_{n}-\mathbb{P})(\psi_{\texttt{Z}}^{\texttt{num}}(D,W,Z,Y;\hat{\eta}_{\texttt{Z},s}^{\texttt{num}})-\psi_{\texttt{Z}}^{\texttt{num}}(W,Z,Y;\eta_{\texttt{Z},s}^{\texttt{num}})) (21)
+(ψZnum(D,W,Z,Y;η^Z,snum)ψZnum(D,W,Z,Y;ηZ,snum))\displaystyle+\mathbb{P}(\psi_{\texttt{Z}}^{\texttt{num}}(D,W,Z,Y;\hat{\eta}_{\texttt{Z},s}^{\texttt{num}})-\psi_{\texttt{Z}}^{\texttt{num}}(D,W,Z,Y;\eta_{\texttt{Z},s}^{\texttt{num}})) (22)

We note that (21) converges to a normal distribution per CLT assuming the variance of ψZ,snum\psi_{\texttt{Z},s}^{\texttt{num}} is finite. The empirical process term (21) is asymptotically negligible, as the nuisance parameters ηZ,snum\eta_{\texttt{Z},s}^{\texttt{num}} are estimated using a separate training data split from the evaluation data and (Kennedy, 2022, Lemma 1) states that

(n)(ψZ,snum(D,W,Z,Y;η^Z,snum)ψZ,snum(D,W,Z,Y;ηZ,snum)=op(n1/2)(\mathbb{P}_{n}-\mathbb{P})(\psi_{\texttt{Z},s}^{\texttt{num}}(D,W,Z,Y;\hat{\eta}_{\texttt{Z},s}^{\texttt{num}})-\psi_{\texttt{Z},s}^{\texttt{num}}(D,W,Z,Y;\eta_{\texttt{Z},s}^{\texttt{num}})=o_{p}(n^{-1/2})

as long as estimators for all nuisance parameters are consistent. We now establish that the remainder term (22) is also asymptotically negligible. Integrating with respect to YY, we have that

(22)=\displaystyle\eqref{eq:b3_cond_cov_num}= 2(μ^s0(W)μ^10(W))×((μ0s0(Zs,W)μ^0s0(Zs,W))π^1s0(W,Zs)𝟙{D=0}p(D=0)+(μ^0s0(Zs,W)μ^s0(W))𝟙{D=1}p(D=1)(μ0(W,Z)μ^0(W,Z))π^110(W,Z)𝟙{D=0}p(D=0)(μ^0(W,Z)μ^10(W))𝟙{D=1}p(D=1))\displaystyle\mathbb{P}\ 2(\hat{\mu}_{\cdot s0}(W)-\hat{\mu}_{\cdot 10}(W))\begin{aligned} \times\Big{(}&(\mu_{\cdot 0_{-s}0}(Z_{s},W)-\hat{\mu}_{\cdot 0_{-s}0}(Z_{s},W))\hat{\pi}_{1s0}(W,Z_{s})\frac{\mathbbm{1}\{D=0\}}{p(D=0)}\\ &+(\hat{\mu}_{\cdot 0_{-s}0}(Z_{s},W)-\hat{\mu}_{\cdot s0}(W))\frac{\mathbbm{1}\{D=1\}}{p(D=1)}\\ &-(\mu_{\cdot\cdot 0}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z))\hat{\pi}_{110}(W,Z)\frac{\mathbbm{1}\{D=0\}}{p(D=0)}\\ &-(\hat{\mu}_{\cdot\cdot 0}(W,Z)-\hat{\mu}_{\cdot 10}(W))\frac{\mathbbm{1}\{D=1\}}{p(D=1)}\Big{)}\\ \end{aligned} (23)
+((μ^s0(W)μ^10(W))2(μs0(W)μ10(W))2)𝟙{D=1}p(D=1)\displaystyle+\mathbb{P}((\hat{\mu}_{\cdot s0}(W)-\hat{\mu}_{\cdot 10}(W))^{2}-(\mu_{\cdot s0}(W)-\mu_{\cdot 10}(W))^{2})\frac{\mathbbm{1}\{D=1\}}{p(D=1)} (24)
=\displaystyle= 2(μ^s0(W)μ^10(W))×((μ0s0(Zs,W)μ^0s0(Zs,W))(π^1s0(W,Zs)π1s0(W,Zs))𝟙{D=0}p(D=0)+(μ0s0(Zs,W)μ^0s0(Zs,W))π1s0(W,Zs)𝟙{D=0}p(D=0)+(μ^0s0(Zs,W)μ^s0(W))𝟙{D=1}p(D=1)(μ0(W,Z)μ^0(W,Z))(π^110(W,Z)π110(W,Z))𝟙{D=0}p(D=0)(μ0(W,Z)μ^0(W,Z))π110(W,Z)𝟙{D=0}p(D=0)(μ^0(W,Z)μ^10(W))𝟙{D=1}p(D=1))\displaystyle\mathbb{P}\ 2(\hat{\mu}_{\cdot s0}(W)-\hat{\mu}_{\cdot 10}(W))\begin{aligned} \times\Big{(}&(\mu_{\cdot 0_{-s}0}(Z_{s},W)-\hat{\mu}_{\cdot 0_{-s}0}(Z_{s},W))\left(\hat{\pi}_{1s0}(W,Z_{s})-{\pi}_{1s0}(W,Z_{s})\right)\frac{\mathbbm{1}\{D=0\}}{p(D=0)}\\ &+(\mu_{\cdot 0_{-s}0}(Z_{s},W)-\hat{\mu}_{\cdot 0_{-s}0}(Z_{s},W)){\pi}_{1s0}(W,Z_{s})\frac{\mathbbm{1}\{D=0\}}{p(D=0)}\\ &+(\hat{\mu}_{\cdot 0_{-s}0}(Z_{s},W)-\hat{\mu}_{\cdot s0}(W))\frac{\mathbbm{1}\{D=1\}}{p(D=1)}\\ &-(\mu_{\cdot\cdot 0}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z))\left(\hat{\pi}_{110}(W,Z)-{\pi}_{110}(W,Z)\right)\frac{\mathbbm{1}\{D=0\}}{p(D=0)}\\ &-(\mu_{\cdot\cdot 0}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z)){\pi}_{110}(W,Z)\frac{\mathbbm{1}\{D=0\}}{p(D=0)}\\ &-(\hat{\mu}_{\cdot\cdot 0}(W,Z)-\hat{\mu}_{\cdot 10}(W))\frac{\mathbbm{1}\{D=1\}}{p(D=1)}\Big{)}\\ \end{aligned} (25)
+((μ^s0(W)μ^10(W))2(μs0(W)μ10(W))2)𝟙{D=1}p(D=1)\displaystyle+\mathbb{P}((\hat{\mu}_{\cdot s0}(W)-\hat{\mu}_{\cdot 10}(W))^{2}-(\mu_{\cdot s0}(W)-\mu_{\cdot 10}(W))^{2})\frac{\mathbbm{1}\{D=1\}}{p(D=1)} (26)

From convergence conditions in Condition B.1, this simplifies to

(22)=\displaystyle\eqref{eq:b3_cond_cov_num}= 2(μ^s0(W)μ^10(W))×((μ0s0(Zs,W)μ^0s0(Zs,W))π1s0(W,Zs)𝟙{D=0}p(D=0)+(μ^0s0(Zs,W)μ^s0(W))𝟙{D=1}p(D=1)(μ0(W,Z)μ^0(W,Z))π110(W,Z)𝟙{D=0}p(D=0)(μ^0(W,Z)μ^10(W))𝟙{D=1}p(D=1))\displaystyle\mathbb{P}\ 2(\hat{\mu}_{\cdot s0}(W)-\hat{\mu}_{\cdot 10}(W))\begin{aligned} \times\Big{(}&(\mu_{\cdot 0_{-s}0}(Z_{s},W)-\hat{\mu}_{\cdot 0_{-s}0}(Z_{s},W)){\pi}_{1s0}(W,Z_{s})\frac{\mathbbm{1}\{D=0\}}{p(D=0)}\\ &+(\hat{\mu}_{\cdot 0_{-s}0}(Z_{s},W)-\hat{\mu}_{\cdot s0}(W))\frac{\mathbbm{1}\{D=1\}}{p(D=1)}\\ &-(\mu_{\cdot\cdot 0}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z)){\pi}_{110}(W,Z)\frac{\mathbbm{1}\{D=0\}}{p(D=0)}\\ &-(\hat{\mu}_{\cdot\cdot 0}(W,Z)-\hat{\mu}_{\cdot 10}(W))\frac{\mathbbm{1}\{D=1\}}{p(D=1)}\Big{)}\\ \end{aligned} (27)
+((μ^s0(W)μ^10(W))2(μs0(W)μ10(W))2)𝟙{D=1}p(D=1)\displaystyle+\mathbb{P}((\hat{\mu}_{\cdot s0}(W)-\hat{\mu}_{\cdot 10}(W))^{2}-(\mu_{\cdot s0}(W)-\mu_{\cdot 10}(W))^{2})\frac{\mathbbm{1}\{D=1\}}{p(D=1)} (28)
+op(n1/2),\displaystyle+o_{p}(n^{-1/2}), (29)

Given the true density ratios, we can further simplify the expectations over D=0D=0 weighted by the density ratios in the expression above to expectations over D=1D=1. By definition of μ0s0(Zs,W)\mu_{\cdot 0_{-s}0}(Z_{s},W) in (7) and μs0(W){\mu}_{\cdot s0}(W) in (9) and the definition of μ0(W,Z)\mu_{\cdot\cdot 0}(W,Z) and μ10(W)\mu_{\cdot 10}(W) in Section 4.1, (22) simplifies to

(22)=\displaystyle\eqref{eq:b3_cond_cov_num}= 1 2(μ^s0(W)μ^10(W))(μs0(W)μ^s0(W))\displaystyle\mathbb{P}_{1}\ 2(\hat{\mu}_{\cdot s0}(W)-\hat{\mu}_{\cdot 10}(W))(\mu_{\cdot s0}(W)-\hat{\mu}_{\cdot s0}(W))
1 2(μ^s0(W)μ^10(W))(μ10(W)μ^10(W))\displaystyle-\mathbb{P}_{1}\ 2(\hat{\mu}_{\cdot s0}(W)-\hat{\mu}_{\cdot 10}(W))(\mu_{\cdot 10}(W)-\hat{\mu}_{\cdot 10}(W))
+1(μ^s0(W)μ^10(W))2(μs0(W)μ10(W))2)\displaystyle+\mathbb{P}_{1}(\hat{\mu}_{\cdot s0}(W)-\hat{\mu}_{\cdot 10}(W))^{2}-({\mu}_{\cdot s0}(W)-{\mu}_{\cdot 10}(W))^{2})
+op(n1/2),\displaystyle+o_{p}(n^{-1/2}),

which is op(n1/2)o_{p}(n^{-1/2}) as long as the convergence conditions in Condition B.1 hold.

As the denominator vZdenv^{\texttt{den}}_{\texttt{Z}} is equal to the numerator vZnum()v^{\texttt{num}}_{\texttt{Z}}(\emptyset), it follows that the one-step estimator for the denominator v^Zden\hat{v}^{\texttt{den}}_{\texttt{Z}} is asymptotically linear with influence function ψZ,num\psi_{\texttt{Z},\emptyset}^{\texttt{num}}. ∎

Proof for Theorem B.2.

Combining Lemma C.2 and the Delta method (van der Vaart, 1998, Theorem 3.1), the estimator v^Z(s)=v^Znum(s)/v^Zden\hat{v}_{\texttt{Z}}(s)=\hat{v}_{\texttt{Z}}^{\texttt{num}}(s)/\hat{v}_{\texttt{Z}}^{\texttt{den}} is asymptotically linear

v^Znum(s)v^ZdenvZnum(s)vZden=nψZ,s(D,W,Z,Y;ηZ,snum,ηZden)+op(n1/2),\displaystyle\frac{\hat{v}_{\texttt{Z}}^{\texttt{num}}(s)}{\hat{v}_{\texttt{Z}}^{\texttt{den}}}-\frac{v_{\texttt{Z}}^{\texttt{num}}(s)}{v_{\texttt{Z}}^{\texttt{den}}}=\mathbbm{P}_{n}\psi_{\texttt{Z},s}(D,W,Z,Y;\eta_{\texttt{Z},s}^{\texttt{num}},\eta_{\texttt{Z}}^{\texttt{den}})+o_{p}(n^{-1/2}),

with influence function

ψZ,s(D,W,Z,Y;ηZ,snum,ηZ,sden)=1vZdenψZ,snum(D,W,Z,Y;ηZ,snum)vZnum(s)(vZden)2ψZden(D,W,Z,Y;ηZden),\displaystyle\psi_{\texttt{Z},s}(D,W,Z,Y;\eta_{\texttt{Z},s}^{\texttt{num}},\eta_{\texttt{Z},s}^{\texttt{den}})=\frac{1}{v_{\texttt{Z}}^{\texttt{den}}}\psi_{\texttt{Z},s}^{\texttt{num}}(D,W,Z,Y;\eta_{\texttt{Z},s}^{\texttt{num}})-\frac{v_{\texttt{Z}}^{\texttt{num}}(s)}{(v_{\texttt{Z}}^{\texttt{den}})^{2}}\psi_{\texttt{Z}}^{\texttt{den}}(D,W,Z,Y;\eta_{\texttt{Z}}^{\texttt{den}}), (30)

where ψZ,snum(D,W,Z,Y;ηZ,snum)\psi_{\texttt{Z},s}^{\texttt{num}}(D,W,Z,Y;\eta_{\texttt{Z},s}^{\texttt{num}}) is defined in (17) and ψZden(D,W,Z,Y;ηZden)=ψZ,num(D,W,Z,Y;ηZ,num)\psi_{\texttt{Z}}^{\texttt{den}}(D,W,Z,Y;\eta_{\texttt{Z}}^{\texttt{den}})=\psi_{\texttt{Z},\emptyset}^{\texttt{num}}(D,W,Z,Y;\eta_{\texttt{Z},\emptyset}^{\texttt{num}}).

Accordingly, the estimator asymptotically follows the normal distribution,

n(v^Z(s)vZ(s))dN(0,var(ψZ,s(D,W,Z,Y;ηZ,snum,ηZ,sden))\displaystyle\sqrt{n}\left(\hat{v}_{\texttt{Z}}(s)-v_{\texttt{Z}}(s)\right)\rightarrow_{d}N(0,\text{var}(\psi_{\texttt{Z},s}(D,W,Z,Y;\eta_{\texttt{Z},s}^{\texttt{num}},\eta_{\texttt{Z},s}^{\texttt{den}})) (31)

C.3 Value of ss-partial conditional outcome shifts

Let the nuisance parameters in vY,binnumv_{\texttt{Y},\texttt{bin}}^{\texttt{num}} be denoted ηY,snum=(Qbin,μ1,μs,π){\eta}_{\texttt{Y},s}^{\texttt{num}}=(Q_{\texttt{bin}},\mu_{\cdot\cdot 1},\mu_{\cdot\cdot s},\pi) and its estimate as η^Y,snum\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}.

We represent the one-step corrected estimator for vY,binnum(s)v_{\texttt{Y},\texttt{bin}}^{\texttt{num}}(s) as the V-statistic

v^Y,binnum(s)=\displaystyle\hat{v}_{\texttt{Y},\texttt{bin}}^{\texttt{num}}(s)= 1,n~1,n(μ^1(W,Z)μ^s(W,Z))2\displaystyle\mathbb{P}_{1,n}\tilde{\mathbb{P}}_{1,n}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot s}(W,Z)\right)^{2} (32)
+21,n~1,n(μ^1(W,Z)μ^s(W,Z))(μ1(W,Z))\displaystyle+2\mathbb{P}_{1,n}\tilde{\mathbb{P}}_{1,n}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot s}(W,Z)\right)(\ell-\mu_{\cdot\cdot 1}(W,Z)) (33)
21,n~1,n(μ^1(W,Zs,Z~s)μ^s(W,Zs,Z~s))((W,Zs,Z~s,Y)μs(W,Zs,Z~s))π(Z~s,Zs,W,qbin(W,Z))\displaystyle-2\mathbb{P}_{1,n}\tilde{\mathbb{P}}_{1,n}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z_{s},\tilde{Z}_{-s})-\hat{\mu}_{\cdot\cdot s}(W,Z_{s},\tilde{Z}_{-s})\right)(\ell(W,Z_{s},\tilde{Z}_{-s},Y)-\mu_{\cdot\cdot s}(W,Z_{s},\tilde{Z}_{-s}))\pi(\tilde{Z}_{-s},Z_{s},W,q_{\texttt{bin}}(W,Z)) (34)
=\displaystyle= 1,n~1,nh(W,Z,Y,W~,Z~,Y~;η^Y,snum).\displaystyle\mathbb{P}_{1,n}\tilde{\mathbb{P}}_{1,n}h\left(W,Z,Y,\tilde{W},\tilde{Z},\tilde{Y};\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right). (35)

In more detail, the conditions in Theorem 4.3 are as follows.

Condition C.3.

For variable subset ss, suppose the following hold

  • π(W,Zs,Zs,Qbin)\pi(W,Z_{s},Z_{-s},Q_{\texttt{bin}}) is bounded

  • π^\hat{\pi} is consistent

  • 1(μ^0μ0)2=op(n1/2)\mathbb{P}_{1}\left(\hat{\mu}_{\cdot\cdot 0}-\mu_{\cdot\cdot 0}\right)^{2}=o_{p}(n^{-1/2})

  • 1(μ^1μ1)2=op(n1/2)\mathbb{P}_{1}\left(\hat{\mu}_{\cdot\cdot 1}-\mu_{\cdot\cdot 1}\right)^{2}=o_{p}(n^{-1/2})

  • 1(μ^sμs)2=op(n1/2)\mathbb{P}_{1}\left(\hat{\mu}_{\cdot\cdot s}-\mu_{\cdot\cdot s}\right)^{2}=o_{p}(n^{-1/2})

  • 1(q^binqbin)2=op(n1)\mathbb{P}_{1}\left(\hat{q}_{\texttt{bin}}-q_{\texttt{bin}}\right)^{2}=o_{p}(n^{-1})

  • 1(μ^sμs)(π^π)=op(n1/2)\mathbb{P}_{1}\left(\hat{\mu}_{\cdot\cdot s}-\mu_{\cdot\cdot s}\right)\left(\hat{\pi}-\pi\right)=o_{p}(n^{-1/2}).

Lemma C.4.

Assuming Condition C.3 holds, v^Y,binnum\hat{v}_{\texttt{Y},\texttt{bin}}^{\texttt{num}} is an asymptotically linear estimator for vY,binnumv_{\texttt{Y},\texttt{bin}}^{\texttt{num}}, i.e.

v^Y,binnum(s)vY,binnum(s)=1,nψY,snum(D,W,Z,Y;ηY,snum)+op(n1/2),\displaystyle\hat{v}_{\texttt{Y},\texttt{bin}}^{\texttt{num}}(s)-v_{\texttt{Y},\texttt{bin}}^{\texttt{num}}(s)=\mathbb{P}_{1,n}\psi_{\texttt{Y},s}^{\texttt{num}}(D,W,Z,Y;\eta_{\texttt{Y},s}^{\texttt{num}})+o_{p}(n^{-1/2}), (36)

with influence function

ψY,snum(d,w,z,y;ηY,snum)=\displaystyle\psi_{\texttt{Y},s}^{\texttt{num}}\left(d,w,z,y;\eta_{\texttt{Y},s}^{\texttt{num}}\right)= (μ1(w,z)μs(w,z))2\displaystyle(\mu_{\cdot\cdot 1}(w,z)-\mu_{\cdot\cdot s}(w,z))^{2}
+2(μ1(w,z)μs(w,z))[(w,z,y)μ1(w,z)]\displaystyle+2(\mu_{\cdot\cdot 1}(w,z)-\mu_{\cdot\cdot s}(w,z))\left[\ell(w,z,y)-\mu_{\cdot\cdot 1}(w,z)\right]
21(μ1(w,zs,Zs)μs(w,zs,Zs))[(w,zs,Zs,y))μs(w,zs,Zs)]π(Zs,zs,w,qbin(w,z))\displaystyle-2\mathbb{P}_{1}\left(\mu_{\cdot\cdot 1}(w,z_{s},Z_{-s})-\mu_{\cdot\cdot s}(w,z_{s},Z_{-s})\right)\left[\ell\left(w,z_{s},Z_{-s},y)\right)-\mu_{\cdot\cdot s}(w,z_{s},Z_{-s})\right]\pi\left({Z}_{-s},z_{s},w,q_{\texttt{bin}}(w,z)\right)
vY,binnum(s).\displaystyle-v_{\texttt{Y},\texttt{bin}}^{\texttt{num}}(s). (37)
Proof.

Defining the symmetrized version of hh in (35) as hsym(W,Z,Y,W~,Z~,Y~)=h(W,Z,Y,W~,Z~,Y~)+h(W~,Z~,Y~,W,Z,Y)2h_{sym}(W,Z,Y,\tilde{W},\tilde{Z},\tilde{Y})=\frac{h(W,Z,Y,\tilde{W},\tilde{Z},\tilde{Y})+h(\tilde{W},\tilde{Z},\tilde{Y},W,Z,Y)}{2}, we rewrite the estimator as

v^Y,binnum(s)=1,n~1,nhsym(W,Z,Y,W~,Z~,Y~;η^Y,snum).\hat{v}_{\texttt{Y},\texttt{bin}}^{\texttt{num}}(s)=\mathbb{P}_{1,n}\tilde{\mathbb{P}}_{1,n}h_{sym}\left(W,Z,Y,\tilde{W},\tilde{Z},\tilde{Y};\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right).

Per Theorem 12.3 in (van der Vaart, 1998), the Hájek projection of v^Y,binnum(s)\hat{v}_{\texttt{Y},\texttt{bin}}^{\texttt{num}}(s) is

u^Y,binnum(s)\displaystyle\hat{u}_{\texttt{Y},\texttt{bin}}^{num}(s) =i=1n1[1,n~1,nhsym(W,Z,Y,W~,Z~,Y~;η^Y,snum)v^¯Ynum(s)Xi,Yi]\displaystyle=\sum_{i=1}^{n}\mathbb{P}_{1}\left[\mathbb{P}_{1,n}\tilde{\mathbb{P}}_{1,n}h_{sym}\left(W,Z,Y,\tilde{W},\tilde{Z},\tilde{Y};\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right)-\bar{\hat{v}}_{Y}^{num}(s)\mid X_{i},Y_{i}\right]
=i=1n1[hsym(Xi,Yi,X(2),Y(2);η^Y,snum)v^¯Ynum(s)Xi,Yi]\displaystyle=\sum_{i=1}^{n}\mathbb{P}_{1}\left[h_{sym}\left(X_{i},Y_{i},X^{(2)},Y^{(2)};\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right)-\bar{\hat{v}}_{Y}^{num}(s)\mid X_{i},Y_{i}\right]
=i=1nhsym,1(Xi,Yi;η^Y,snum)\displaystyle=\sum_{i=1}^{n}{h}_{sym,1}\left(X_{i},Y_{i};\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right)

where v^¯Ynum(s)=1~1hsym(W,Z,Y,W~,Z~,Y~;η^Y,snum)\bar{\hat{v}}_{Y}^{num}(s)=\mathbb{P}_{1}\tilde{\mathbb{P}}_{1}h_{sym}\left(W,Z,Y,\tilde{W},\tilde{Z},\tilde{Y};\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right).

Consider the decomposition

v^Y,binnum(s)vY,binnum(s)=\displaystyle\hat{v}_{\texttt{Y},\texttt{bin}}^{\texttt{num}}(s)-v_{\texttt{Y},\texttt{bin}}^{\texttt{num}}(s)= 1,n~1,nhsym(W,Z,Y,W~,Z~,Y~;η^Y,snum)1~1hsym(W,Z,Y,W~,Z~,Y~;ηY,snum)\displaystyle\mathbb{P}_{1,n}\tilde{\mathbb{P}}_{1,n}h_{sym}\left(W,Z,Y,\tilde{W},\tilde{Z},\tilde{Y};\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right)-\mathbb{P}_{1}\tilde{\mathbb{P}}_{1}h_{sym}\left(W,Z,Y,\tilde{W},\tilde{Z},\tilde{Y};\eta_{\texttt{Y},s}^{\texttt{num}}\right)
=\displaystyle= 1,n~1,nhsym(W,Z,Y,W~,Z~,Y~;η^Y,snum)1,n[hsym,1(X,Y;η^Y,snum)+v^¯Ynum(s)]\displaystyle\mathbb{P}_{1,n}\tilde{\mathbb{P}}_{1,n}{h}_{sym}\left(W,Z,Y,\tilde{W},\tilde{Z},\tilde{Y};\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right)-\mathbb{P}_{1,n}\left[{h}_{sym,1}\left(X,Y;\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right)+\bar{\hat{v}}_{\texttt{Y}}^{\texttt{num}}(s)\right] (38)
+(1,n1)(hsym,1(X,Y;η^Y,snum)+v^¯Ynum(s)hsym,1(X,Y)vYnum(s))\displaystyle+\left(\mathbb{P}_{1,n}-\mathbb{P}_{1}\right)\left({h}_{sym,1}\left(X,Y;\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right)+\bar{\hat{v}}_{\texttt{Y}}^{\texttt{num}}(s)-h_{sym,1}\left(X,Y\right)-v_{\texttt{Y}}^{\texttt{num}}(s)\right) (39)
+(1,n1)(hsym,1(X,Y)+vYnum(s))\displaystyle+\left(\mathbb{P}_{1,n}-\mathbb{P}_{1}\right)\left(h_{sym,1}\left(X,Y\right)+v_{\texttt{Y}}^{\texttt{num}}(s)\right) (40)
+1(hsym,1(X,Y;η^Y,snum)+v^¯Ynum(s)hsym,1(X,Y;ηY,snum)vYnum(s)).\displaystyle+\mathbb{P}_{1}\left({h}_{sym,1}\left(X,Y;\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right)+\bar{\hat{v}}_{\texttt{Y}}^{\texttt{num}}(s)-h_{sym,1}\left(X,Y;\eta_{\texttt{Y},s}^{\texttt{num}}\right)-v_{\texttt{Y}}^{\texttt{num}}(s)\right). (41)

We analyze each term in turn.

Term (38): Suppose 1hsym2(W,Z,Y,W~,Z~,Y~;η^Y,snum)<\mathbb{P}_{1}{h}_{sym}^{2}(W,Z,Y,\tilde{W},\tilde{Z},\tilde{Y};\hat{\eta}_{\texttt{Y},s}^{\texttt{num}})<\infty. Via a straightforward extension of the proof in Theorem 12.3 in van der Vaart (1998), one can show that

var(1,n~1,nhsym(W,Z,Y,W~,Z~,Y~;η^Y,snum))var(1,nhsym,1(W,Z,Y;η^Y,snum))p1.\frac{var\left(\mathbb{P}_{1,n}\tilde{\mathbb{P}}_{1,n}{h}_{sym}\left(W,Z,Y,\tilde{W},\tilde{Z},\tilde{Y};\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right)\right)}{var\left(\mathbb{P}_{1,n}{h}_{sym,1}\left(W,Z,Y;\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right)\right)}\rightarrow_{p}1.

Then by Theorem 11.2 in van der Vaart (1998) and Slutsky’s lemma, we have

1,n~1,nhsym(W,Z,Y,W~,Z~,Y~;η^Y,snum)1,n[hsym,1(W,Z,Y;η^Y,snum)+v^¯Ynum(s)]=op(n1/2).\mathbb{P}_{1,n}\tilde{\mathbb{P}}_{1,n}{h}_{sym}\left(W,Z,Y,\tilde{W},\tilde{Z},\tilde{Y};\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right)-\mathbb{P}_{1,n}\left[{h}_{sym,1}\left(W,Z,Y;\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right)+\bar{\hat{v}}_{\texttt{Y}}^{\texttt{num}}(s)\right]=o_{p}\left(n^{-1/2}\right).

Term (39): We perform sample splitting to estimate the nuisance parameters and calculate the estimator for v^Ynum(s)\hat{v}_{\texttt{Y}}^{\texttt{num}}(s). Then by Lemma 1 in Kennedy (2022), we have that

(1,n1)(hsym,1(W,Z,Y;η^Y,snum)+v^¯Ynum(s)hsym,1(W,Z,Y;ηY,snum)vYnum(s))=op(n1/2)\left(\mathbb{P}_{1,n}-\mathbb{P}_{1}\right)\left({h}_{sym,1}\left(W,Z,Y;\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right)+\bar{\hat{v}}_{\texttt{Y}}^{\texttt{num}}(s)-h_{sym,1}\left(W,Z,Y;\eta_{\texttt{Y},s}^{\texttt{num}}\right)-v_{\texttt{Y}}^{\texttt{num}}(s)\right)=o_{p}(n^{-1/2})

as long as the estimators for the nuisance parameters are consistent.

Term (40): This term (1,n1)(hsym,1(W,Z,Y;ηY,snum)+vYnum(s))=(1,n1)hsym,1(W,Z,Y;ηY,snum)\left(\mathbb{P}_{1,n}-\mathbb{P}_{1}\right)\left(h_{sym,1}\left(W,Z,Y;\eta_{\texttt{Y},s}^{\texttt{num}}\right)+v_{\texttt{Y}}^{\texttt{num}}(s)\right)=\left(\mathbb{P}_{1,n}-\mathbb{P}_{1}\right)h_{sym,1}\left(W,Z,Y;\eta_{\texttt{Y},s}^{\texttt{num}}\right) follows an asymptotic normal distribution per CLT.

Term (41): We will show that this bias term is asymptotically negligible. For notational simplicity, let ξ^(W,Zs,Zs)=μ^1(W,Z)μ^s(W,Z)\hat{\xi}(W,Z_{s},Z_{-s})=\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot s}(W,Z).

1~1(hsym(W,Z,Y,W~,Z~,Y~;η^Y,snum)+v^¯Ynum(s)hsym(W,Z,Y,W~,Z~,Y~;ηY,snum)vYnum(s))\displaystyle\mathbb{P}_{1}\tilde{\mathbb{P}}_{1}\left({h}_{sym}\left(W,Z,Y,\tilde{W},\tilde{Z},\tilde{Y};\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right)+\bar{\hat{v}}_{\texttt{Y}}^{\texttt{num}}(s)-h_{sym}\left(W,Z,Y,\tilde{W},\tilde{Z},\tilde{Y};\eta_{\texttt{Y},s}^{\texttt{num}}\right)-v_{\texttt{Y}}^{\texttt{num}}(s)\right)
=\displaystyle= 1~1(h(W,Z,Y,W~,Z~,Y~;η^Y,snum)h(W,Z,Y,W~,Z~,Y~;ηY,snum))\displaystyle\mathbb{P}_{1}\tilde{\mathbb{P}}_{1}\left({h}\left(W,Z,Y,\tilde{W},\tilde{Z},\tilde{Y};\hat{\eta}_{\texttt{Y},s}^{\texttt{num}}\right)-h\left(W,Z,Y,\tilde{W},\tilde{Z},\tilde{Y};\eta_{\texttt{Y},s}^{\texttt{num}}\right)\right)
=\displaystyle= 1(μ^1(W,Z)μ^s(W,Z))21(μ1(W,Z)μs(W,Z))2\displaystyle\mathbb{P}_{1}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot s}(W,Z)\right)^{2}-\mathbb{P}_{1}\left(\mu_{\cdot\cdot 1}(W,Z)-\mu_{\cdot\cdot s}(W,Z)\right)^{2}
+21(μ^1(W,Z)μ^s(W,Z))[μ1(W,Z)μ^1(W,Z)]\displaystyle+2\mathbb{P}_{1}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot s}(W,Z)\right)\left[\mu_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot 1}(W,Z)\right]
21~1(μ^1(W,Zs,Z~s)μ^s(W,Zs,Z~s))[(W,Zs,Z~s,Y)μ^s(W,Zs,Z~s)]π^(Z~s,Zs,W,q^bin(W,Z))\displaystyle-2\mathbb{P}_{1}\tilde{\mathbb{P}}_{1}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z_{s},\tilde{Z}_{-s})-\hat{\mu}_{\cdot\cdot s}(W,Z_{s},\tilde{Z}_{-s})\right)\left[\ell(W,Z_{s},\tilde{Z}_{-s},Y)-\hat{\mu}_{\cdot\cdot s}(W,Z_{s},\tilde{Z}_{-s})\right]\hat{\pi}\left(\tilde{Z}_{-s},Z_{s},W,\hat{q}_{\texttt{bin}}(W,Z)\right)
=\displaystyle= 1(μ^1(W,Z)μ^s(W,Z))21(μ1(W,Z)μs(W,Z))2\displaystyle\mathbb{P}_{1}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot s}(W,Z)\right)^{2}-\mathbb{P}_{1}\left(\mu_{\cdot\cdot 1}(W,Z)-\mu_{\cdot\cdot s}(W,Z)\right)^{2}
+21(μ^1(W,Z)μ^s(W,Z))[μ1(W,Z)μ^1(W,Z)]\displaystyle+2\mathbb{P}_{1}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot s}(W,Z)\right)\left[\mu_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot 1}(W,Z)\right]
21~1ξ^(W,Zs,Z~s)[μs(W,Zs,Z~s)μ^s(W,Zs,Z~s)]π^(Z~s,Zs,W,qbin(W,Z))\displaystyle-2\mathbb{P}_{1}\tilde{\mathbb{P}}_{1}\hat{\xi}(W,Z_{s},\tilde{Z}_{-s})\left[\mu_{\cdot\cdot s}(W,Z_{s},\tilde{Z}_{-s})-\hat{\mu}_{\cdot\cdot s}(W,Z_{s},\tilde{Z}_{-s})\right]\hat{\pi}\left(\tilde{Z}_{-s},Z_{s},W,{q}_{\texttt{bin}}(W,Z)\right)
21~1ξ^(W,Zs,Z~s)[(W,Zs,Z~s,Y)μ^s(W,Zs,Z~s)][π^(Z~s,Zs,W,q^bin(W,Z))π^(Z~s,Zs,W,qbin(W,Z))]\displaystyle-2\mathbb{P}_{1}\tilde{\mathbb{P}}_{1}\hat{\xi}(W,Z_{s},\tilde{Z}_{-s})\left[\ell(W,Z_{s},\tilde{Z}_{-s},Y)-\hat{\mu}_{\cdot\cdot s}(W,Z_{s},\tilde{Z}_{-s})\right]\left[\hat{\pi}\left(\tilde{Z}_{-s},Z_{s},W,\hat{q}_{\texttt{bin}}(W,Z)\right)-\hat{\pi}\left(\tilde{Z}_{-s},Z_{s},W,{q}_{\texttt{bin}}(W,Z)\right)\right]
=\displaystyle= 1(μs(W,Z)μ^s(W,Z))(μ^1(W,Z)μ^s(W,Z)+μ1(W,Z)μs(W,Z))\displaystyle\mathbb{P}_{1}\left(\mu_{\cdot\cdot s}(W,Z)-\hat{\mu}_{\cdot\cdot s}(W,Z)\right)\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot s}(W,Z)+\mu_{\cdot\cdot 1}(W,Z)-\mu_{\cdot\cdot s}(W,Z)\right) (42)
+1(μ^1(W,Z)μ1(W,Z))(μ1(W,Z)μs(W,Z)μ^1(W,Z)+μ^s(W,Z))\displaystyle+\mathbb{P}_{1}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\mu_{\cdot\cdot 1}(W,Z)\right)\left(\mu_{\cdot\cdot 1}(W,Z)-\mu_{\cdot\cdot s}(W,Z)-\hat{\mu}_{\cdot\cdot 1}(W,Z)+\hat{\mu}_{\cdot\cdot s}(W,Z)\right) (43)
21~1ξ^(W,Zs,Z~s)[μs(W,Zs,Z~s)μ^s(W,Zs,Z~s)]π^(Z~s,Zs,W,qbin(W,Z))\displaystyle-2\mathbb{P}_{1}\tilde{\mathbb{P}}_{1}\hat{\xi}(W,Z_{s},\tilde{Z}_{-s})\left[\mu_{\cdot\cdot s}(W,Z_{s},\tilde{Z}_{-s})-\hat{\mu}_{\cdot\cdot s}(W,Z_{s},\tilde{Z}_{-s})\right]\hat{\pi}\left(\tilde{Z}_{-s},Z_{s},W,{q}_{\texttt{bin}}(W,Z)\right) (44)
21~1ξ^(W,Zs,Z~s)[(W,Zs,Z~s,Y)μ^s(W,Zs,Z~s)][π^(Z~s,Zs,W,q^bin(W,Z))π^(Z~s,Zs,W,qbin(W,Z))].\displaystyle-2\mathbb{P}_{1}\tilde{\mathbb{P}}_{1}\hat{\xi}(W,Z_{s},\tilde{Z}_{-s})\left[\ell(W,Z_{s},\tilde{Z}_{-s},Y)-\hat{\mu}_{\cdot\cdot s}(W,Z_{s},\tilde{Z}_{-s})\right]\left[\hat{\pi}\left(\tilde{Z}_{-s},Z_{s},W,\hat{q}_{\texttt{bin}}(W,Z)\right)-\hat{\pi}\left(\tilde{Z}_{-s},Z_{s},W,{q}_{\texttt{bin}}(W,Z)\right)\right]. (45)

Note that (45) is op(n1/2)o_{p}(n^{-1/2}) under the assumed convergence rates for q^bin\hat{q}_{\texttt{bin}}. In addition, (43) is op(n1/2)o_{p}(n^{-1/2}), under the assumed convergence rates for μ^1\hat{\mu}_{\cdot\cdot 1} and μ^s\hat{\mu}_{\cdot\cdot s}.

Analyzing the remaining summands (42) + (44), we note that it simplifies as follows:

1(μs(X)μ^s(X))(μ^1(X)μ^s(X)+μ1(X)μs(X))\displaystyle\mathbb{P}_{1}\left(\mu_{\cdot\cdot s}(X)-\hat{\mu}_{\cdot\cdot s}(X)\right)\left(\hat{\mu}_{\cdot\cdot 1}(X)-\hat{\mu}_{\cdot\cdot s}(X)+\mu_{\cdot\cdot 1}(X)-\mu_{\cdot\cdot s}(X)\right)
21~1ξ^(W,Zs,Z~s)[μs(W,Zs,Z~s)μ^s(W,Zs,Z~s)](π^(Z~s,Zs,W,qbin(W,Z)))π(Z~s,Zs,W,qbin(W,Z))))\displaystyle-2\mathbb{P}_{1}\tilde{\mathbb{P}}_{1}\hat{\xi}(W,Z_{s},\tilde{Z}_{-s})\left[\mu_{\cdot\cdot s}(W,Z_{s},\tilde{Z}_{-s})-\hat{\mu}_{\cdot\cdot s}(W,Z_{s},\tilde{Z}_{-s})\right]\left(\hat{\pi}\left(\tilde{Z}_{-s},Z_{s},W,q_{\texttt{bin}}(W,Z))\right)-\pi\left(\tilde{Z}_{-s},Z_{s},W,q_{\texttt{bin}}(W,Z))\right)\right)
21~1ξ^(W,Zs,Z~s)[μs(W,Zs,Z~s)μ^s(W,Zs,Z~s)]π(Z~s,Zs,W,qbin(W,Z)))\displaystyle-2\mathbb{P}_{1}\tilde{\mathbb{P}}_{1}\hat{\xi}(W,Z_{s},\tilde{Z}_{-s})\left[\mu_{\cdot\cdot s}(W,Z_{s},\tilde{Z}_{-s})-\hat{\mu}_{\cdot\cdot s}(W,Z_{s},\tilde{Z}_{-s})\right]\pi\left(\tilde{Z}_{-s},Z_{s},W,q_{\texttt{bin}}(W,Z))\right)
=\displaystyle= 1(μs(X)μ^s(X))(μ1(X)μ^1(X)μs(X)+μ^s(X))\displaystyle\mathbb{P}_{1}\left(\mu_{\cdot\cdot s}(X)-\hat{\mu}_{\cdot\cdot s}(X)\right)\left(\mu_{\cdot\cdot 1}(X)-\hat{\mu}_{\cdot\cdot 1}(X)-\mu_{\cdot\cdot s}(X)+\hat{\mu}_{\cdot\cdot s}(X)\right)
21~1ξ^(W,Zs,Z~s)[μs(W,Zs,Z~s)μ^s(W,Zs,Z~s)](π^(Z~s,Zs,W,qbin(W,Z)))π(Z~s,Zs,W,qbin(W,Z)))),\displaystyle-2\mathbb{P}_{1}\tilde{\mathbb{P}}_{1}\hat{\xi}(W,Z_{s},\tilde{Z}_{-s})\left[\mu_{\cdot\cdot s}(W,Z_{s},\tilde{Z}_{-s})-\hat{\mu}_{\cdot\cdot s}(W,Z_{s},\tilde{Z}_{-s})\right]\left(\hat{\pi}\left(\tilde{Z}_{-s},Z_{s},W,q_{\texttt{bin}}(W,Z))\right)-\pi\left(\tilde{Z}_{-s},Z_{s},W,q_{\texttt{bin}}(W,Z))\right)\right),

which is op(n1/2)o_{p}(n^{-1/2}), under the assumed convergence rates for μ^s\hat{\mu}_{\cdot\cdot s}, μ^1\hat{\mu}_{\cdot\cdot 1}, and π^\hat{\pi}. ∎

Condition C.5 (Convergence conditions for v^Yden\hat{v}_{\texttt{Y}}^{\texttt{den}}).

Suppose the following holds

  • 1(μ1μ0(μ^1μ^0))2=op(n1/2)\mathbb{P}_{1}(\mu_{\cdot\cdot 1}-\mu_{\cdot\cdot 0}-(\hat{\mu}_{\cdot\cdot 1}-\hat{\mu}_{\cdot\cdot 0}))^{2}=o_{p}(n^{-1/2})

  • 0(μ0μ^0)(π110π^110)=op(n1/2)\mathbb{P}_{0}(\mu_{\cdot\cdot 0}-\hat{\mu}_{\cdot\cdot 0})(\pi_{110}-\hat{\pi}_{110})=o_{p}(n^{-1/2})

  • 0(μ1μ0)2\mathbb{P}_{0}(\mu_{\cdot\cdot 1}-\mu_{\cdot\cdot 0})^{2} is bounded

  • (Positivity) p(w,z|d=0)>0p(w,z|d=0)>0 almost everywhere, such that the density ratios π110(w,z)\pi_{110}(w,z) are well-defined and bounded.

Let the nuisance parameters in the one-step estimator vYdenv_{\texttt{Y}}^{\texttt{den}} be denoted by ηYden=(μ0,μ1,π110)\eta_{\texttt{Y}}^{\texttt{den}}=(\mu_{\cdot\cdot 0},\mu_{\cdot\cdot 1},\pi_{110}) and the set of estimated nuisances by η^Yden\hat{\eta}_{\texttt{Y}}^{\texttt{den}}.

Lemma C.6.

Assuming Condition C.5 holds, then v^Yden\hat{v}_{\texttt{Y}}^{\texttt{den}} is an asymptotically linear estimator for vYdenv_{\texttt{Y}}^{\texttt{den}}, i.e.

v^YdenvYden=nψYden(D,W,Z,Y;ηYden)+op(n1/2)\displaystyle\hat{v}_{\texttt{Y}}^{\texttt{den}}-v_{\texttt{Y}}^{\texttt{den}}=\mathbb{P}_{n}\psi_{\texttt{Y}}^{\texttt{den}}(D,W,Z,Y;\eta_{\texttt{Y}}^{\texttt{den}})+o_{p}(n^{-1/2})

with influence function

ψYden(D,W,Z,Y;ηYden)=\displaystyle\psi_{\texttt{Y}}^{\texttt{den}}(D,W,Z,Y;\eta_{\texttt{Y}}^{\texttt{den}})= (μ1(W,Z)μ0(W,Z))2𝟙{D=1}p(D=1)\displaystyle\left(\mu_{\cdot\cdot 1}(W,Z)-\mu_{\cdot\cdot 0}(W,Z)\right)^{2}\frac{\mathbbm{1}\{D=1\}}{p(D=1)} (46)
+2(μ1(W,Z)μ0(W,Z))(μ1(W,Z))𝟙{D=1}p(D=1)\displaystyle\ +2\left(\mu_{\cdot\cdot 1}(W,Z)-\mu_{\cdot\cdot 0}(W,Z)\right)(\ell-\mu_{\cdot\cdot 1}(W,Z))\frac{\mathbbm{1}\{D=1\}}{p(D=1)} (47)
2(μ1(W,Z)μ0(W,Z))(μ0(W,Z))π110(W,Z)𝟙{D=0}p(D=0)\displaystyle\ -2\left(\mu_{\cdot\cdot 1}(W,Z)-\mu_{\cdot\cdot 0}(W,Z)\right)(\ell-\mu_{\cdot\cdot 0}(W,Z))\pi_{110}(W,Z)\frac{\mathbbm{1}\{D=0\}}{p(D=0)} (48)
vYden.\displaystyle\ -v_{\texttt{Y}}^{\texttt{den}}. (49)
Proof.

Consider the following decomposition of bias in the one-step corrected estimate

v^YdenvYden\displaystyle\hat{v}_{\texttt{Y}}^{\texttt{den}}-v_{\texttt{Y}}^{\texttt{den}}
=\displaystyle= (n)ψYden(D,W,Z,Y;ηYden)\displaystyle(\mathbb{P}_{n}-\mathbb{P})\psi_{\texttt{Y}}^{\texttt{den}}(D,W,Z,Y;\eta_{\texttt{Y}}^{\texttt{den}}) (50)
+(n)(ψYden(D,W,Z,Y;η^Yden)ψYden(D,W,Z,Y;ηYden))\displaystyle+(\mathbb{P}_{n}-\mathbb{P})(\psi_{\texttt{Y}}^{\texttt{den}}(D,W,Z,Y;\hat{\eta}_{\texttt{Y}}^{\texttt{den}})-\psi_{\texttt{Y}}^{\texttt{den}}(D,W,Z,Y;\eta_{\texttt{Y}}^{\texttt{den}})) (51)
+(ψYden(D,W,Z,Y;η^Yden)ψYden(D,W,Z,Y;ηYden))\displaystyle+\mathbb{P}(\psi_{\texttt{Y}}^{\texttt{den}}(D,W,Z,Y;\hat{\eta}_{\texttt{Y}}^{\texttt{den}})-\psi_{\texttt{Y}}^{\texttt{den}}(D,W,Z,Y;\eta_{\texttt{Y}}^{\texttt{den}})) (52)

We observe that (50) converges to a normal distribution per CLT assuming that the variance of ψYden\psi_{\texttt{Y}}^{\texttt{den}} is finite. The empirical process term (51) is asymptotically negligible since the nuisance parameters ηYden\eta_{\texttt{Y}}^{\texttt{den}} are evaluated on an separate evaluation data split from the training data used for estimation. In addition assuming that the estimators for the nuisance parameters are consistent, Kennedy (2022, Lemma 1) states that

(n)(ψYden(D,W,Z,Y;η^Yden)ψYden(D,W,Z,Y;ηYden))=op(n1/2).(\mathbb{P}_{n}-\mathbb{P})(\psi_{\texttt{Y}}^{\texttt{den}}(D,W,Z,Y;\hat{\eta}_{\texttt{Y}}^{\texttt{den}})-\psi_{\texttt{Y}}^{\texttt{den}}(D,W,Z,Y;\eta_{\texttt{Y}}^{\texttt{den}}))=o_{p}(n^{-1/2}).

We now show that the remainder term (52) is also asymptotically negligible. Substituting the influence function and integrating with respect to YY, (52) becomes

(52)=\displaystyle\eqref{eq:b3_cond_outcome_denom}= (μ^1(W,Z)μ^0(W,Z)(μ1(W,Z)μ0(W,Z)))2𝟙(D=1)p(D=1)\displaystyle\mathbb{P}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z)-(\mu_{\cdot\cdot 1}(W,Z)-\mu_{\cdot\cdot 0}(W,Z))\right)^{2}\frac{\mathbbm{1}(D=1)}{p(D=1)} (53)
+2(μ^1(W,Z)μ^0(W,Z)(μ1(W,Z)μ0(W,Z)))(μ1(W,Z)μ0(W,Z))𝟙(D=1)p(D=1)\displaystyle+2\mathbb{P}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z)-(\mu_{\cdot\cdot 1}(W,Z)-\mu_{\cdot\cdot 0}(W,Z))\right)(\mu_{\cdot\cdot 1}(W,Z)-\mu_{\cdot\cdot 0}(W,Z))\frac{\mathbbm{1}(D=1)}{p(D=1)} (54)
+2(μ^1(W,Z)μ^0(W,Z))(μ1(W,Z)μ^1(W,Z))𝟙{D=1}p(D=1)\displaystyle\ +2\mathbb{P}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z)\right)(\mu_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot 1}(W,Z))\frac{\mathbbm{1}\{D=1\}}{p(D=1)} (55)
2(μ^1(W,Z)μ^0(W,Z))(μ0(W,Z)μ^0(W,Z))π^110(W,Z)𝟙{D=0}p(D=0)\displaystyle\ -2\mathbb{P}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z)\right)(\mu_{\cdot\cdot 0}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z))\hat{\pi}_{110}(W,Z)\frac{\mathbbm{1}\{D=0\}}{p(D=0)} (56)
=\displaystyle= (μ^1(W,Z)μ^0(W,Z)(μ1(W,Z)μ0(W,Z)))2𝟙(D=1)p(D=1)\displaystyle\mathbb{P}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z)-(\mu_{\cdot\cdot 1}(W,Z)-\mu_{\cdot\cdot 0}(W,Z))\right)^{2}\frac{\mathbbm{1}(D=1)}{p(D=1)} (57)
+2(μ0(W,Z)μ^0(W,Z))(μ1(W,Z)μ0(W,Z))𝟙(D=1)p(D=1)\displaystyle+2\mathbb{P}(\mu_{\cdot\cdot 0}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z))(\mu_{\cdot\cdot 1}(W,Z)-\mu_{\cdot\cdot 0}(W,Z))\frac{\mathbbm{1}(D=1)}{p(D=1)} (58)
2(μ^1(W,Z)μ^0(W,Z))(μ0(W,Z)μ^0(W,Z))(π^110(W,Z)π110(W,Z))𝟙{D=0}p(D=1)\displaystyle\ -2\mathbb{P}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z)\right)(\mu_{\cdot\cdot 0}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z))(\hat{\pi}_{110}(W,Z)-\pi_{110}(W,Z))\frac{\mathbbm{1}\{D=0\}}{p(D=1)} (59)
2(μ^1(W,Z)μ^0(W,Z))(μ0(W,Z)μ^0(W,Z))π110(W,Z)𝟙{D=0}p(D=0)\displaystyle\ -2\mathbb{P}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z)\right)(\mu_{\cdot\cdot 0}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z))\pi_{110}(W,Z)\frac{\mathbbm{1}\{D=0\}}{p(D=0)} (60)
=\displaystyle= (μ^1(W,Z)μ^0(W,Z)(μ1(W,Z)μ0(W,Z)))2𝟙(D=1)p(D=1)\displaystyle\mathbb{P}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z)-(\mu_{\cdot\cdot 1}(W,Z)-\mu_{\cdot\cdot 0}(W,Z))\right)^{2}\frac{\mathbbm{1}(D=1)}{p(D=1)} (61)
2(μ^1(W,Z)μ^0(W,Z))(μ0(W,Z)μ^0(W,Z))(π^110(W,Z)π110(W,Z))𝟙{D=0}p(D=0)\displaystyle\ -2\mathbb{P}\left(\hat{\mu}_{\cdot\cdot 1}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z)\right)(\mu_{\cdot\cdot 0}(W,Z)-\hat{\mu}_{\cdot\cdot 0}(W,Z))(\hat{\pi}_{110}(W,Z)-\pi_{110}(W,Z))\frac{\mathbbm{1}\{D=0\}}{p(D=0)} (62)

Thus the remainder term is op(n1/2)o_{p}(n^{-1/2}) if Condition C.5 holds. ∎

Proof for Theorem 4.3.

Combining Lemmas C.4, C.6, and the Delta method (van der Vaart, 1998, Theorem 3.1), the estimator v^Y,bin(s)=v^Y,binnum(s)/v^Yden\hat{v}_{\texttt{Y},\texttt{bin}}(s)=\hat{v}_{\texttt{Y},\texttt{bin}}^{\texttt{num}}(s)/\hat{v}_{\texttt{Y}}^{\texttt{den}} is asymptotically linear

v^Y,binnum(s)v^YdenvY,binnum(s)vYden=nψY,bin,s(D,W,Z,Y;ηY,snum,ηYden)+op(n1/2),\displaystyle\frac{\hat{v}_{\texttt{Y},\texttt{bin}}^{\texttt{num}}(s)}{\hat{v}_{\texttt{Y}}^{\texttt{den}}}-\frac{v_{\texttt{Y},\texttt{bin}}^{\texttt{num}}(s)}{v_{\texttt{Y}}^{\texttt{den}}}=\mathbbm{P}_{n}\psi_{\texttt{Y},\texttt{bin},s}(D,W,Z,Y;\eta_{\texttt{Y},s}^{\texttt{num}},\eta_{\texttt{Y}}^{\texttt{den}})+o_{p}(n^{-1/2}),

with influence function

ψY,bin,s(D,W,Z,Y;ηY,snum,ηYden)=1vYdenψY,snum(D,W,Z,Y;ηY,snum)vY,binnum(s)(vYden)2ψYden(D,W,Z,Y;ηYden),\displaystyle\psi_{\texttt{Y},\texttt{bin},s}(D,W,Z,Y;\eta_{\texttt{Y},s}^{\texttt{num}},\eta_{\texttt{Y}}^{\texttt{den}})=\frac{1}{v_{\texttt{Y}}^{\texttt{den}}}\psi_{\texttt{Y},s}^{\texttt{num}}(D,W,Z,Y;\eta_{\texttt{Y},s}^{\texttt{num}})-\frac{v_{\texttt{Y},\texttt{bin}}^{\texttt{num}}(s)}{(v_{\texttt{Y}}^{\texttt{den}})^{2}}\psi_{\texttt{Y}}^{\texttt{den}}(D,W,Z,Y;\eta_{\texttt{Y}}^{\texttt{den}}), (63)

where ψY,snum\psi_{\texttt{Y},s}^{\texttt{num}} and ψYden\psi_{\texttt{Y}}^{\texttt{den}} are defined in (37) and (49).

Accordingly, the estimator follows a normal distribution asymptotically,

n(v^Y(s)vY(s))dN(0,var(ψY,bin,s(D,W,Z,Y;ηYnum,ηYden))\displaystyle\sqrt{n}\left(\hat{v}_{\texttt{Y}}(s)-v_{\texttt{Y}}(s)\right)\rightarrow_{d}N(0,\text{var}(\psi_{\texttt{Y},\texttt{bin},s}(D,W,Z,Y;\eta_{\texttt{Y}}^{\texttt{num}},\eta_{\texttt{Y}}^{\texttt{den}})) (64)

Appendix D Implementation details

Here we describe how the nuisance parameters can be estimated in each of the decompositions. In general, density ratio models can be estimated via a standard reduction to a classification problem where a probabilistic classifier is trained to discriminate between source and target domains (Sugiyama et al., 2007).

Note on computation time. Shapley value computation can be parallelized over the subsets. For high-dimensional tabular data, grouping together variables can further reduce computation time (and increase interpretability).

D.1 Aggregate decompositions

Density ratio models. Using direct importance estimation (Sugiyama et al., 2007), density ratio models π100(W)\pi_{100}(W) and π110(W,Z)\pi_{110}(W,Z) can be estimated by fitting classifiers on the combined source and target data to predict D=0D=0 or 11 from features WW and (W,Z)(W,Z), respectively.

Outcome models. The outcome models μ00(W)\mu_{\cdot 00}(W) and μ0(W,Z)\mu_{\cdot\cdot 0}(W,Z) can be fit in a number of ways. One option is to estimate the conditional distribution of the outcome (i.e. p0(Y|W)p_{0}(Y|W) or p0(Y|W,Z)p_{0}(Y|W,Z)) using binary classifiers, from which one can obtain an estimate of the conditional expectation of the loss. Alternatively, one can estimate the conditional expectations of the loss directly by fitting regression models.

D.2 Detailed decomposition for ss-partial outcome shift

Density ratio models. The density ratio π(W,Zs,Zs,Q)=p1(Zs|W,Zs,q(W,Z)=Q)/p1(Zs)\pi(W,Z_{s},Z_{-s},Q)=p_{1}(Z_{-s}|W,Z_{s},q(W,Z)=Q)/p_{1}(Z_{-s}) in (4) can be estimated as follows. Create a second (“phantom”) dataset of the target domain in which ZsZ_{-s} is independent of ZsZ_{s} by permuting the original ZsZ_{-s} in the target domain. Compute qbinq_{\texttt{bin}} for all observations in the original dataset and the permuted dataset. Concatenate the original dataset from the target domain with the permuted dataset. Train a classifier to predict if an observation is from the original versus the permuted dataset.

Outcome models. The outcome models μ1\mu_{\cdot\cdot 1} and μs(W,Z)\mu_{\cdot\cdot s}(W,Z) can be similarly fit by estimating the conditional distribution p0(Y|W,Z)p_{0}(Y|W,Z) and ps(Y|W,Z,qbin(W,Z))p_{s}(Y|W,Z,q_{\texttt{bin}}(W,Z)) on the target domain, and then taking expectation of the loss.

Computing U-statistics. Calculating the double average 1,nEv~1,nEv\mathbb{P}_{1,n}^{\texttt{Ev}}\tilde{\mathbb{P}}_{1,n}^{\texttt{Ev}} in the estimator requires evaluating all n2n^{2} pairs of data points in target domain. This can be computationally expensive, so a good approximation is to subsample the inner average. We take 2000 subsamples. We did not see large changes in the bias of the estimates compared to calculating the exact U-statistics.

D.3 Detailed decomposition for ss-partial conditional outcome shift

Density ratio models. The ratio π1s0(W,Zs)=p1(W,Zs)/p0(W,Zs)\pi_{1s0}(W,Z_{s})=p_{1}(W,Z_{s})/p_{0}(W,Z_{s}) can be similarly fit using direct importance estimation.

Outcome models. We require the following models.

  • μ0s0(zs,w)=𝔼00[|zs,w]\mu_{\cdot 0_{-s}0}(z_{s},w)=\mathbb{E}_{\cdot 00}[\ell|z_{s},w], defined in (7), can be estimated by regressing loss against w,zsw,z_{s} on the source domain.

  • μ10(w)=𝔼10[|w]\mu_{\cdot 10}(w)=\mathbb{E}_{\cdot 10}[\ell|w], defined in (8), can be estimated by regressing μ0(w,z)\mu_{\cdot\cdot 0}(w,z) against ww in the target domain.

  • μs0(w)=𝔼s0[|zs,w]\mu_{\cdot s0}(w)=\mathbb{E}_{\cdot s0}[\ell|z_{s},w], defined in (9), can be estimated by regressing μ0s0(zs,w)\mu_{\cdot 0_{-s}0}(z_{s},w) against ww in the target domain.

For all models, we use cross-validation to select among model types and hyperparameters. Model selection is important so that the convergence rate conditions for the asymptotic normality results are met.

Appendix E Simulation details

Data generation: We generate synthetic data under two settings. For the coverage checks in Section 5.1, all features are sampled independently from a multivariate normal distribution. The mean of the (W,Z)(W,Z) in the source and target domains are (0,2,0.7,3)(0,2,0.7,3) and (0,0,0,0)(0,0,0,0), respectively. The outcome in the source and target domains are simulated from a logistic regression model with coefficients (0.3,1,0.5,1)(0.3,1,0.5,1) and (0.3,0.1,0.5,1.4)(0.3,0.1,0.5,1.4).

In the second setting for baseline comparisons in Figure 4b, each feature in WW\in\mathbb{R} and Z5Z\in\mathbb{R}^{5} is sampled independently from the rest from a uniform distribution over [1,1)[-1,1). The binary outcome YY is sampled from a logistic regression model with coefficients (0.2,0.4,2,0.25,0.1,0.1)(0.2,0.4,2,0.25,0.1,0.1) in source and (0.2,0.4,0.8,0.1,0.1,0.1)(0.2,-0.4,0.8,0.1,0.1,0.1) in target.

Sample-splitting: We fit all models on 80% of the data points from both source and target datasets which is the Tr partition, and keep the remaining 20% for computing the estimators which is the Ev partition.

Model types: We use scikit-learn implementations for all models (Pedregosa et al., 2011). We use 3-fold cross validation to select models. For density models, we fit random forest classifiers and logistic regression models with polynomial features of degree 3. Depending on whether the target outcome in outcome models is binary or real-valued, we fit random forest classifiers or regressors, and logistic regression or linear regression models with ridge penalty. Specific hyperparameter ranges for the grid search will be provided in the code, to be made publicly available.

Computing time and resources: Computation for the VI estimates can be quite fast, as Shapley value computation can be parallelized over the subsets and the number of unique variable subsets sampled in the Shapley value approximation is often quite small. For instance, for the ACS Public Coverage case study with 34 features, the unique subsets is 131131 even when the number of sampled subsets is 30003000, and it takes around 160 seconds to estimate the value of a single variable subset. All experiments are run on a 2.60 GHz processor with 8 CPU cores.

Refer to caption
Refer to caption
Refer to caption
Figure 6: Sample estimates and CIs for simulation from Section 5.1. Proposed is debiased ML estimator for HDPD.

Appendix F Data analysis details

Synthetic. We describe accuracy of the ML algorithm after it is retrained with the top kk features and predictions from the original model.

Hospital readmission. Using data from the electronic health records of a large safety-net hospital in the US, we analyzed the transferability of performance measures of a Gradient Boosted Tree (GBT) trained to predict 30-day readmission risk for the general patient population (source) but applied to patients diagnosed with heart failure (target). Each of the source and target datasets have 3750 observations for analyzing the performance gap. The GBT is trained on a held-out sample of 18,873 points from the general population. Features include 4 demographic variables (WW) and 16 diagnosis codes (ZZ). While training, we reweigh samples by class weights to address class imbalance.

ACS Public Coverage. We extract data from the American Community Survey (ACS) to predict whether a person has public health insurance. The data only contains persons of age less than 65 and having an income of less than $30,000. We analyze a neural network (MLP) trained on data from Nebraska (source) to data from Louisiana (target) given 3000 and 6000 observations from the source and target domains, respectively. Another 3300 from source for training the model. Figure 7 shows the detailed decomposition of conditional outcome shift for the dataset.

Refer to caption
Figure 7: Detailed decompositions for the performance gap of a model predicting insurance coverage prediction across two US states (NE \rightarrow LA). Plot shows values for the full set of 31 covariates.
k Diff AUC-k Lower CI Upper CI
1 0.000 0.000 0.000
2 0.006 0.001 0.010
3 -0.002 -0.007 0.002
4 0.004 -0.002 0.008
5 -0.001 -0.006 0.003
6 -0.002 -0.008 0.003
7 0.007 0.002 0.011
8 0.006 0.001 0.010
Table 3: Difference in AUCs between the revised insurance prediction model with respect to the top kk variables identified by the proposed versus RandomForestAcc procedures (Diff AUC-k = Proposed - RandomForestAcc). 95% CIs are shown.