This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Orthogonal Causal Calibration

Justin Whitehouse Stanford University Christopher Jung Meta Research Vasilis Syrgkanis Stanford University Bryan Wilder Carnegie Mellon University Zhiwei Steven Wu Carnegie Mellon University
Abstract

Estimates of heterogeneous treatment effects such as conditional average treatment effects (CATEs) and conditional quantile treatment effects (CQTEs) play an important role in real-world decision making. Given this importance, one should ensure these estimates are calibrated. While there is a rich literature on calibrating estimators of non-causal parameters, very few methods have been derived for calibrating estimators of causal parameters, or more generally estimators of quantities involving nuisance parameters.

In this work, we develop general algorithms for reducing the task of causal calibration to that of calibrating a standard (non-causal) predictive model. Throughout, we study a notion of calibration defined with respect to an arbitrary, nuisance-dependent loss \ell, under which we say an estimator θ\theta is calibrated if its predictions cannot be changed on any level set to decrease loss. For losses \ell satisfying a condition called universal orthogonality, we present a simple algorithm that transforms partially-observed data into generalized pseudo-outcomes and applies any off-the-shelf calibration procedure. For losses \ell satisfying a weaker assumption called conditional orthogonality, we provide a similar sample splitting algorithm the performs empirical risk minimization over an appropriately defined class of functions. Convergence of both algorithms follows from a generic, two term upper bound of the calibration error of any model: one term that measures the error in estimating unknown nuisance parameters and another that measures calibration error in a hypothetical world where the learned nuisances are true. We demonstrate the practical applicability of our results in experiments on both observational and synthetic data. Our results are exceedingly general, showing that essentially any existing calibration algorithm can be used in causal settings, with additional loss only arising from errors in nuisance estimation.

1 Introduction

Estimates of heterogeneous causal effects such as conditional average treatment effects (CATEs), conditional average causal derivatives (CACDs), and conditional quantile treatment effects (CQTEs) play a pervasive role in understanding various statistical, scientific, and economic problems. Due to the partially-observed nature of causal data, estimating causal quantities is naturally more difficult than estimating non-causal ones. Despite this, a vast literature has developed focused on estimating causal effects using both classical statistical methods [34, 36, 46, 1, 19] and modern machine learning (ML) approaches [18, 7, 6, 53, 14, 30]. In this work, we focus on an complementary and relatively understudied direction — developing algorithms for calibrating estimates of heterogeneous causal effects.

Calibration is a notion of model consistency that has been extensively studied both in theory and practice by the general ML community [23, 43, 39, 42, 56, 31, 50, 29, 13, 12, 17]. Classically, a model θ^(X)\widehat{\theta}(X) predicting YY is said to be calibrated if, when it makes a given prediction, the observed outcome on average is said prediction, i.e. if

𝔼[Yθ^(X)=v]=vfor any vrange(θ^).\mathbb{E}[Y\mid\widehat{\theta}(X)=v]=v\qquad\text{for any }v\in\mathrm{range}(\widehat{\theta}).

There are deep connections between calibrated predictions and optimal downstream decision-making [48, 47]. Namely, for many utility functions, when a decision maker is equipped with calibrated predictions, the optimal mapping from prediction to decision simply treats those predictions as if they were the ground truth. Given that models predicting heterogeneous causal effects or counterfactual quantities are leveraged in decision-making in domains such as medicine [15, 37], advertising and e-commerce [32, 20], and public policy [61], it is essential that these estimates are calibrated. In non-causal settings, there exists a plethora of simple algorithms for performing calibration such of algorithms such as histogram binning [25], isotonic calibration [2, 58] , linear calibration, and Platt scaling [51, 45]. These algorithms are all univariate regression procedures, regressing an observed outcome YY onto a model prediction θ^(X)\widehat{\theta}(X) over an appropriately-defined function class. However, given the partially-observed nature of causal problems, it is non-obvious how to extend these algorithms for calibrating estimates of general causal effects.

To illustrate the importance and difficulty of calibrating causal models, we look at the example of a doctor using model predictions to aid in prescribing medicine to prevent heart attacks. The doctor may have access to observations of the form Z=(X,A,Y)Z=(X,A,Y), where X𝒳X\in\mathcal{X} represent covariates, A{0,1}A\in\{0,1\} is a binary treatment, and Y=AY(1)+(1A)Y(0)Y=AY(1)+(1-A)Y(0)\in\mathbb{R} represents an observed outcome. For instance, a doctor may use a patient’s age, height, and blood pressure (covariates XX) to decide whether or not to give medicine (treatment AA) to a patient in order to prevent a heart attack (outcome YY). To aid in their decision making, the doctor may want to employ a model θ(X)\theta(X) predicting the conditional average treatment effect (CATE), which is defined as

θCATE(x):=𝔼[Y(1)Y(0)X=x]\theta_{\mathrm{CATE}}(x):=\mathbb{E}[Y(1)-Y(0)\mid X=x]

In this example, the CATE measures the difference in probability of a heart attack under treatment and control. Naturally, a doctor may decide to prescribe medication to a patient if θ(X)<0\theta(X)<0, i.e. if the model predicts the probability of a heart attack will go down under treatment. If the model is miscalibrated—for instance if we have

𝔼[Y(1)Y(0)θ(X)=0.15]>0\mathbb{E}[Y(1)-Y(0)\mid\theta(X)=-0.15]>0

the doctor may actually harm patients by prescribing medication! If θ(X)\theta(X) were calibrated, such a risk would not a occur. However, the doctor cannot run one of the aforementioned calibration algorithms because they only ever observe the outcome under treatment Y(1)Y(1) or control Y(0)Y(0), never the target individual treatment effect Y(1)Y(0)Y(1)-Y(0).

To work around this, one typically needs to estimate nuisance parameters — functions of the underlying data-generating distribution that are used to “de-bias” partial observations (either Y(1)Y(1) or Y(0)Y(0) in the previous example) into pseudo-outcomes: quantities that, on average, look like the target treatment effect (here Y(1)Y(0)Y(1)-Y(0)). For the CATE, these nuisance functions are the propensity score π0(x):=(A=1X=x)\pi_{0}(x):=\mathbb{P}(A=1\mid X=x) and the expected outcome mapping μ0(a,x):=𝔼[YX=x,A=a]\mu_{0}(a,x):=\mathbb{E}[Y\mid X=x,A=a], and pseudo-outcomes are of the form [35]

χCATE((μ,π);W):=μ(1,X)μ(0,X)+(Aπ(X)1A1π(X))(Yμ(A,X)).\chi_{\mathrm{CATE}}((\mu,\pi);W):=\mu(1,X)-\mu(0,X)+\left(\frac{A}{\pi(X)}-\frac{1-A}{1-\pi(X)}\right)(Y-\mu(A,X)).

Naturally, the nuisances and pseudo-outcomes will naturally be different for other heterogeneous causal effects and can typically be derived by the statistician.

What is known about causal calibration? In the setting of the above CATE example, if we assume conditional ignorability111This is the condition that the potential outcomes Y(0)Y(0) and Y(1)Y(1) are conditionally independent of the treatment AA given covariates XX, i.e. Y(0),Y(1)AXY(0),Y(1)\perp A\mid X, van der Laan et al. [58] show that by estimating the propensity score and expected outcome mappings, one can perform isotonic regression of the doubly-robust pseudo-outcomes χCATE\chi_{\mathrm{CATE}} onto model predictions θ(X)\theta(X) to ensure asymptotic calibration. However, this result is limited in application, and doesn’t extend to other heterogeneous effects of interest like conditional average causal derivatives, conditional local average treatment effects, or even CATEs in the presence of unobserved confounding (in which case one may need instrumental variables). Furthermore, the result of van der Laan et al. [58] don’t allow a learner to calibrate using other algorithms like Platt scaling, histogram binning, or even simple linear regression. While one could derive individual results for calibrating an estimate of “parameter A” under “algorithm B”, this would likely lead to a repeated reinventing of the wheel. The question considered in this paper is thus as follows: can one construct a framework that allows a statistician to calibrate estimates of general heterogeneous causal effects using an arbitrary, off-the-shelf calibration algorithms?

1.1 Our Contributions

In this paper, we reduce the problem of causal calibration, or calibrating estimates of heterogeneous causal effects, to the well-studied problem of calibrating non-causal predictive models. We assume the scientist is interested in calibrating an estimate θ(X)\theta(X) of some heterogeneous causal effect θ0(X)\theta_{0}(X) that can be specified as the conditional minimizer of a loss function, i.e. θ0(x)argminν𝔼[(ν,g0;W)X=x]\theta_{0}(x)\in\arg\min_{\nu}\mathbb{E}[\ell(\nu,g_{0};W)\mid X=x]. Here, (ν,g;w)\ell(\nu,g;w) involves a nuisance component, and we assume that there exists a true, unknown nuisance parameter g0g_{0}. We say a θ\theta is perfectly calibrated with respect to \ell if 𝔼[(θ(X),g0;Z)θ(X)]=0\mathbb{E}[\partial\ell(\theta(X),g_{0};Z)\mid\theta(X)]=0,222Here, (θ(x),g;z)\partial\ell(\theta(x),g;z) denotes the partial derivative of \ell with respect to its first argument θ(x)\theta(x), i.e. the quantity is defined by ν(ν,g;z)|ν=θ(x)\frac{\partial}{\partial\nu}\ell(\nu,g;z)|_{\nu=\theta(x)}. and that θ\theta is approximately calibrated if the L2L^{2} error

Cal(θ,g)=𝔼(𝔼[(θ(X),g;Z)θ(X)]2)1/2\mathrm{Cal}(\theta,g)=\mathbb{E}\left(\mathbb{E}[\partial\ell(\theta(X),g;Z)\mid\theta(X)]^{2}\right)^{1/2}

is small when gg is set as the true nuisance g0g_{0}. In words, θ\theta is calibrated if the predictions of θ\theta are “unimprovable” with respect to current level sets, a condition similar to the one outlined by Noarov and Roth [47] and the concept of swap regret [4, 17].

As a concrete example, θCATE(X)\theta_{\mathrm{CATE}}(X) is the conditional minimizer of the doubly-robust loss CATE(ν,(μ,π);z)\ell_{\mathrm{CATE}}(\nu,(\mu,\pi);z) given by CATE(ν,(μ,π);z):=12(νχCATE((μ,π);z))2\ell_{\mathrm{CATE}}(\nu,(\mu,\pi);z):=\frac{1}{2}\Big{(}\nu-\chi_{\mathrm{CATE}}((\mu,\pi);z)\Big{)}^{2}, where π0\pi_{0}, μ0\mu_{0}, and χCATE\chi_{\mathrm{CATE}} are as defined above. Perfect calibration for a CATE estimate θ\theta becomes 𝔼[Y(1)Y(0)θ(X)]=θ\mathbb{E}[Y(1)-Y(0)\mid\theta(X)]=\theta, and approximate L2L^{2} calibration becomes Cal(θ,(μ0,π0)):=𝔼((θ(X)𝔼[Y(1)Y(0)θ(X)])2)0\mathrm{Cal}(\theta,(\mu_{0},\pi_{0})):=\mathbb{E}\left(\left(\theta(X)-\mathbb{E}[Y(1)-Y(0)\mid\theta(X)]\right)^{2}\right)\approx 0.

Our reduction and algorithms are based around a robustness condition on the underlying loss \ell called Neyman Orthogonality. We say a loss \ell is Neyman orthogonal if

Dg𝔼[(θ0(X),g0;Z)X](gg0)=0,D_{g}\mathbb{E}[\partial\ell(\theta_{0}(X),g_{0};Z)\mid X](g-g_{0})=0, (1)

i.e. if \ell is insensitive to small estimation errors in the nuisance parameters g0g_{0} and loss minimizing parameter θ0\theta_{0}.333Here, DgD_{g} denotes a Gateaux derivative. In our work, we consider two mild variants of Neyman orthogonality. The first is universal orthogonality [18], a stronger notion of orthogonality in which θ0(x)\theta_{0}(x) in Equation (1) can be replaced by any estimate θ(x)\theta(x). Examples of causal effects that minimize such a loss are CATEs, conditional average causal derivatives (CACD), and conditional local average treatment effects (CLATEs). Another condition we consider is called conditional orthogonality, in which instead of conditioning of covariates XX in Equation (1), one conditions on a post-processing φ(X)\varphi(X) of covariates instead. Conditional orthogonality is a natural condition for calibration tasks, as we ultimately care about assessing the quality of an estimator θ(X)\theta(X) conditional on its own predictions. An example causal parameter that can be specified as the minimizer of a conditionally orthogonal loss is the conditional quantile under treatment (CQUT). Our specific contributions, broken down by assumption on the loss, are as follows:

  1. 1.

    In Section 3, we study calibration with respect to universally orthogonal loss functions. We present a sample splitting algorithm (Algorithm 1) that uses black-box ML algorithms to estimate unknown nuisance functions on the first half of the data, transforms the second half into de-biased “pseudo-outcomes”, and then applies any off-the-shelf calibration algorithm on the second half of the data treating pseudo-outcomes as true labels. We provide high-probability, finite-sample guarantees for our algorithm that only depend on convergence rates for the black-box nuisance estimation and calibration algorithms. We additionally provide a cross-calibration algorithm (Algorithm 2) that makes more efficient use of the data and show that our calibration algorithms do not significantly increase the risk of a model (Appendix C).

  2. 2.

    In Section 4, we study the calibration of models predicting effects that minimize a conditionally orthogonal loss function. In this more general setting, we describe a similar sample splitting algorithm that uses half of the data lo learn nuisances and then runs an off-the-shelf algorithm for “generalized’ calibration on the second half the data. We provide examples of such calibration algorithms in Appendix D, which are motivated by extending algorithms like isotonic calibration and linear calibration to general losses. We also prove finite-sample guarantees for this algorithm and construct a similar cross-calibration algorithm (Algorithm 4).

  3. 3.

    Lastly Section 5, we empirically evaluate the performance of the aforementioned algorithms on a mix of observational and semi-synthetic data. In one experiment, we use observational data alongside Algorithm 2 to show how one can calibrate models predicting the CATE of 401(k) eligibility and CLATE of 401(k) participation on an individual’s net total financial assets. Likewise, in another experiment on synthetically generated data, we show how Algorithm 4 can be used to both decrease L2L^{2} calibration error and average loss for models predicting CQUTs.

The design of our algorithms is inspired by a generic upper bound on the calibration error of any estimator θ(X)\theta(X). We show that the L2L^{2} calibration error of any estimator can be bounded above by two, decoupled terms: one involving nuisance estimation error and another representing calibration error under the orthogonalized loss evaluated at the learned nuisances.

Informal Theorem 1.

Suppose θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R} is some estimator, \ell is some base loss, and ~\widetilde{\ell} is the corresponding orthogonalized loss. Let g0g_{0} denote the true, unknown nuisance functions, and gg arbitrary nuisance estimates. We have

Cal(θ,g0)L2 calibration error under (θ(x),g0;w)err(g,g0)nuisance estimation error+Cal(θ,g)L2 calibration error under (θ(x),g;w)\underbrace{\mathrm{Cal}(\theta,g_{0})}_{\text{$L^{2}$ calibration error under $\ell(\theta(x),g_{0};w)$}}\lesssim\underbrace{\mathrm{err}(g,g_{0})}_{\text{nuisance estimation error}}\qquad\quad+\underbrace{\mathrm{Cal}(\theta,g)}_{\text{$L^{2}$ calibration error under $\ell(\theta(x),g;w)$}}

We view Informal Theorem 1 as a “change of measure” or “change of nuisance” result, allowing the learner to essentially pretend our learned nuisances represent reality while only paying a small error for misestimation. In many settings where there are two nuisances functions (e.g. g=(η,ζ)g=(\eta,\zeta) and g0=(η0,ζ0)g_{0}=(\eta_{0},\zeta_{0})), we will have err(g,g0)=O((ηη0)(ζζ0))\mathrm{err}(g,g_{0})=O(\|(\eta-\eta_{0})\cdot(\zeta-\zeta_{0})\|), and thus the error will be small if we estimate at least one nuisance function sufficiently well. This in particular is the case for the aforementioned CATE, where g0=(μ0,π0)g_{0}=(\mu_{0},\pi_{0}), as noted above. More broadly, we will have err(g,g0)=O(gg02)\mathrm{err}(g,g_{0})=O(\|g-g_{0}\|^{2}). While simple in appearance, the above bound depends on deep connections between Neyman orthogonality and calibration error. We prove a bound of the above form in each of the main sections of the paper. This decoupled bound naturally suggest using some fraction of the data to learn nuisance parameters, thus minimizing the first term, and then using fresh data with the learned nuisances to perform calibration, thus minimizing the second term.

Given that our framework and results are quite general, we go through concrete examples to help with building intuition. In fact, we show that the work of van der Laan et al. [58] can be seen as a special instantiation of our framework.

1.2 Related Work

Calibration:

Important to our work is the vast literature on calibration. Calibration was considered first in the context producing calibrated probabilities, both in the online [12, 17] and i.i.d. [51, 62] settings, but has since been considered in other contexts such as distribution calibration [54], threshold calibration [52, 59], and parity calibration [11]. Calibration is typically orthogonal to model training, and usually occurs as a simple post-processing routine. Some well-known algorithms for post-hoc calibration include Platt scaling [51, 26], histogram binning [62, 25], and isotonic regression [63, 2]. Many of these algorithms simultaneously offer strong theoretical guarantees (see Gupta [24] for an overview) and strong empirical performance when applied to practically-relevant ML models [23]. We view our work as complementary to existing, non-causal results on calibration. Our two-step algorithm allows a practitioner to directly apply any of the above listed algorithms, inheriting existing error guarantees so long as nuisance estimation is efficiently performed.

Double/debiased Machine Learning:

In our work, we also draw heavily from the literature on double/debiased machine learning [6, 9, 8]. Methods relating to double machine learning aim to eschew classical non-parametric assumptions (e.g. Donsker properties) on nuisance functions, often through simple sample splitting schemes [28, 34, 3]. In particular, if target population parameters are estimated using a Neyman orthogonal loss function [44, 18], then these works show that empirical estimates of the population parameters converge rapidly to either a population or conditional loss minimizer.

Of the various works related to double/debiased machine learning, we draw most heavily on ideas from the framework of orthogonal statistical learning [18]. In their work, Foster and Syrgkanis [18] develop a simple two-step framework for statistical learning in the presence of nuisance estimation. In particular, they show that when the underlying loss is Neyman orthogonal, then the excess risk can be bounded by two decoupled error terms: error from estimating nuisances and the error incurring from applying a learning algorithm with a fixed nuisance. Following its introduction, the orthogonal statistical learning framework has found applications in tasks such as the design of causal random forests [49] and causal model ensembling via Q-aggregation [40]. In this work, we show that central ideas from orthogonal statistical learning are naturally applicable to the problem of calibrating estimators of causal parameters.

Lastly, our work can be seen as a significant generalization of existing results on the calibration of causal parameters. Primarily, we compare our results to the work of van der Laan et al. [58]. In their work, the authors construct a sample-splitting scheme for calibrating estimates of conditional average treatment effects (CATEs). The specific algorithm leveraged by the authors uses one half of the data to estimate nuisance parameters, namely propensities and expected outcomes under control/treatment. After nuisances are learned, the algorithm transforms the second half of the data into pseudo-observations and runs isotonic regression as a calibration procedure. Our results are applicable to estimates of any causal parameter that can be specified as the population minimizer of a loss function, not just CATEs. Additionally, our generic procedure allows the scientist to plug in any black-box method for calibration, not just a specific algorithm such as isotonic regression. Likewise, our work is also significantly more general than the work of Leng and Dimmery [41], who provide a maximum-likelihood based approach for performing linear calibration, a weaker notion of calibration, of CATE estimates.

2 Calibration of Causal Effects

We are interested in calibrating some estimator/ML model θΘ{f:𝒳}\theta\in\Theta\subset\{f:\mathcal{X}\rightarrow\mathbb{R}\} that is predicting some heterogeneous causal effect θ0(x)\theta_{0}(x). We assume θ0\theta_{0} is specified as the conditional minimizer of some loss (θ(x),g;z)\ell(\theta(x),g;z), i.e.

θ0(x)argminν𝔼[(ν,g0;Z)X=x].\theta_{0}(x)\in\arg\min_{\nu\in\mathbb{R}}\mathbb{E}[\ell(\nu,g_{0};Z)\mid X=x].

We assume :×𝒢×𝒵\ell:\mathbb{R}\times\mathcal{G}\times\mathcal{Z}\rightarrow\mathbb{R} is some generic loss function involving nuisance. We assume 𝒵\mathcal{Z} is some space containing observations, and write ZZ as a prototypical random element from this space, and PZP_{Z} as the distribution on 𝒲\mathcal{W} from which WW is drawn. In the above, 𝒢\mathcal{G} is some space of nuisance functions, which are of the form g:𝒲dg:\mathcal{W}\rightarrow\mathbb{R}^{d}. We assume that there is some true nuisance parameter g0𝒢g_{0}\in\mathcal{G}, but that this parameter is unknown to the learner and must be estimated. We generally assume 𝒢\mathcal{G} is a subset of L2(PW):=L2(𝒲,PW)L^{2}(P_{W}):=L^{2}(\mathcal{W},P_{W}), and so as a norm we can consider the gL2(PW):=𝒲g(w)2PW(dw)\|g\|_{L^{2}(P_{W})}:=\int_{\mathcal{W}}\|g(w)\|^{2}P_{W}(dw), where \|\cdot\| denotes the standard Euclidean norm on d\mathbb{R}^{d}. Given a function T:𝒢T:\mathcal{G}\rightarrow\mathbb{R} and functions f,g𝒢f,g\in\mathcal{G}, we let DgT(f)(g)D_{g}T(f)(g) and Dg2T(f)(g,g)D_{g}^{2}T(f)(g,g) denote respectively the first and second Gateaux derivatives of TT at ff in the direction gg.

We typically have Z=(X,A,Y)Z=(X,A,Y), where X𝒳X\in\mathcal{X} represents covariates, A𝒜A\in\mathcal{A} represents treatment, and YY\in\mathbb{R} represents an outcome in an experiment. More generally, we assume the nested structure 𝒳𝒲𝒵\mathcal{X}\subset\mathcal{W}\subset\mathcal{Z}, where 𝒳\mathcal{X} intuitively represents the space of covariates or features, 𝒲\mathcal{W} represents an extended set of parameters on which the true nuisance parameter may also depend (e.g. instrumental variables, whether or not an individual actually accepted a treatment), and 𝒵\mathcal{Z} may contain additional observable information (e.g. outcome YY under the given treatment). We write the marginal distributions of XX and WW respectively as PXP_{X} and PWP_{W}. We typically write (θ,g;w)\ell(\theta,g;w) instead of (θ(x),g;w)\ell(\theta(x),g;w) for succinctness, and we let (θ,g;w):=θ(x)(θ(x),g;w)\partial\ell(\theta,g;w):=\frac{\partial}{\partial\theta(x)}\ell(\theta(x),g;w) be the partial derivative of \ell with respect to it’s first argument, θ(x)\theta(x).

In our work, we consider a general definition of calibration error that holds for any loss involving a nuisance component. This general notion of calibration, which is similar to notions that have been considered in Noarov and Roth [47], Gopalan et al. [22], Foster and Vohra [16], and Globus-Harris et al. [21], implies an estimator cannot be “improved” on any level set of its prediction.

Definition 2.1.

Let θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R} be an estimator, :×𝒢×𝒵\ell:\mathbb{R}\times\mathcal{G}\times\mathcal{Z}\rightarrow\mathbb{R} a nuisance-dependent loss function, and g𝒢g\in\mathcal{G} a fixed nuisance parameter. We define the L2L^{2} calibration error of θ\theta with respect to \ell and gg to be

Cal(θ,g)\displaystyle\mathrm{Cal}(\theta,g) :=𝔼(𝔼[(θ,g;Z)θ(X)]2)1/2\displaystyle:=\mathbb{E}\left(\mathbb{E}\left[\partial\ell(\theta,g;Z)\mid\theta(X)\right]^{2}\right)^{1/2}
=(𝒳𝔼[(θ,g;Z)θ(X)=θ(x)]2PX(dx))1/2.\displaystyle=\left(\int_{\mathcal{X}}\mathbb{E}[\partial\ell(\theta,g;Z)\mid\theta(X)=\theta(x)]^{2}P_{X}(dx)\right)^{1/2}.

We say θ\theta is perfectly calibrated if Cal(θ,g0)=0\mathrm{Cal}(\theta,g_{0})=0, where g0g_{0} is the true, unknown nuisance parameter.

In the special case where the loss (ν,y)\ell(\nu,y) is the squared loss and doesn’t involve a nuisance component, we recover a more classical notion of calibration error.

Definition 2.2 (Classical calibration error).

Let θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R} be a fixed estimator, and let (X,Y)P(X,Y)\sim P, where PP is some arbitrary distribution on 𝒳×\mathcal{X}\times\mathbb{R}. The classical L2L^{2} calibration error is defined by

Cal(θ;P):=𝒳(θ(x)𝔼PY[Yθ(X)=θ(x)])2PX(dx),\mathrm{Cal}(\theta;P):=\int_{\mathcal{X}}\left(\theta(x)-\mathbb{E}_{P_{Y}}[Y\mid\theta(X)=\theta(x)]\right)^{2}P_{X}(dx),

where we make dependence on the underlying distribution PP clear for convenience.

We will leverage the classical L2L^{2} calibration error in the sequel when reasoning about the convergence of our sample splitting algorithm. The key for now is that Definition 2.1 generalizes the above definition to arbitrary losses involving a nuisance component.

We are always interested in controlling Cal(θ,g0)\mathrm{Cal}(\theta,g_{0}). In words, θ\theta is calibrated if, on the level set {x𝒳:θ(x)=ν}\{x\in\mathcal{X}:\theta(x)=\nu\}, there is no constant value ω\omega\in\mathbb{R} we can switch the prediction θ(x)\theta(x) to to obtain lower loss. We can gleam further semantic meaning from looking at several additional examples below.

Example 2.3.

Below, we almost always assume observations are of the form Z=(X,A,Y)Z=(X,A,Y), where XX are covariates, A{0,1}A\in\{0,1\} or [0,1][0,1] indicates treatment, and Y=aY(a)𝟙A=aY=\sum_{a}Y(a)\mathbbm{1}_{A=a} indicates an outcome. We assume conditional ignorability of the treatment given covariates. The one exception is for conditional local average treatment effects, when assigned treatment may be ignored by the subject.

  1. 1.

    Conditional Average Treatment Effect: Recalling from earlier that the CATE θCATE(x):=𝔼[Y(1)Y(0)X=x]\theta_{\mathrm{CATE}}(x):=\mathbb{E}[Y(1)-Y(0)\mid X=x] can be specified as the conditional minimizer of the doubly-robust loss CATE\ell_{\mathrm{CATE}} given by CATE(ν,(μ,π);z):=12(νχCATE((μ,π);z))2\ell_{\mathrm{CATE}}(\nu,(\mu,\pi);z):=\frac{1}{2}\left(\nu-\chi_{\mathrm{CATE}}((\mu,\pi);z)\right)^{2} with χCATE\chi_{\mathrm{CATE}} given by

    χCATE((μ,π);z):=μ(1,x)μ(0,x)+(aπ(x)1a1π(x))(yμ(a,x)).\chi_{\mathrm{CATE}}((\mu,\pi);z):=\mu(1,x)-\mu(0,x)+\left(\frac{a}{\pi(x)}-\frac{1-a}{1-\pi(x)}\right)(y-\mu(a,x)). (2)

    Here, the true nuisances are π0(x):=(A=1X=x)\pi_{0}(x):=\mathbb{P}(A=1\mid X=x) and μ0(a,x):=𝔼[YX=x,A=a]\mu_{0}(a,x):=\mathbb{E}[Y\mid X=x,A=a]. It is clear perfect calibration becomes 𝔼[Y(1)Y(0)θ(X)]=θ(X)\mathbb{E}[Y(1)-Y(0)\mid\theta(X)]=\theta(X).

  2. 2.

    Conditional Average Causal Derivative: In a setting where treatments are not binary, but rather continuous (e.g. A[0,1]A\in[0,1]), it no longer makes sense to consider treatment effects as a difference. Instead, we can consider the conditional average causal derivative, which is defined by

    θACD(x):=𝔼[aY(A)X]\theta_{\mathrm{ACD}}(x):=\mathbb{E}[\partial_{a}Y(A)\mid X]

    θACD\theta_{\mathrm{ACD}} is in fact the conditional minimizer of the loss ACD\ell_{\mathrm{ACD}} given by

    ACD(θ,g;z)=12(θ(x)χACD(g;Z))2χACD(g;Z)=aμ(a,x)+aπ(ax)π(ax)(yμ(a,x))\ell_{\mathrm{ACD}}(\theta,g;z)=\frac{1}{2}(\theta(x)-\chi_{\mathrm{ACD}}(g;Z))^{2}\qquad\chi_{\mathrm{ACD}}(g;Z)=\partial_{a}\mu(a,x)+\frac{\partial_{a}\pi(a\mid x)}{\pi(a\mid x)}(y-\mu(a,x))

    where g0=(μ0,π0)g_{0}=(\mu_{0},\pi_{0}), μ0(a,x)=𝔼[YX=x,A=a]\mu_{0}(a,x)=\mathbb{E}[Y\mid X=x,A=a] is again the expected outcome mapping, π0(ax)\pi_{0}(a\mid x) is the density of the treatment given covariates (we could alternatively directly estimate the nuisance ζ0(a,x)=aπ0(ax)π0(ax)\zeta_{0}(a,x)=\frac{\partial_{a}\pi_{0}(a\mid x)}{\pi_{0}(a\mid x)} — more on this later). Naturally, θ\theta is perfectly calibrated with respect to ACD\ell_{\mathrm{ACD}} if

    θ(X)=𝔼(a𝔼[Y(A)X]θ(X)).\theta(X)=\mathbb{E}\left(\partial_{a}\mathbb{E}[Y(A)\mid X]\mid\theta(X)\right).
  3. 3.

    Conditional Local Average Treatment Effect: In settings with non-compliance, the prescribed treatment D{0,1}D\in\{0,1\} given to an individual may not be equivalent to the received treatment A{0,1}A\in\{0,1\}. Formally, we have Z=(X,D,A,Y)Z=(X,D,A,Y), where we assume A=A(1)D+A(0)(1D)A=A(1)D+A(0)(1-D), and Y=AY(1)+(1A)Y(0)Y=AY(1)+(1-A)Y(0), where A(d),Y(a)A(d),Y(a) represent potential outcomes for treatment assignment a,d{0,1}a,d\in\{0,1\}. We also assume monotonicty, i.e. that D(1)D(0)D(1)\geq D(0), and that the propensity for the recommended treatment π0(x):=(D=1X=x)\pi_{0}(x):=\mathbb{P}(D=1\mid X=x) is known. The parameter of interest here is

    θLATE(x):=𝔼[Y(1)Y(0)A(1)>A(0),X=x],\theta_{\mathrm{LATE}}(x):=\mathbb{E}[Y(1)-Y(0)\mid A(1)>A(0),X=x],

    which is identified (following standard computations, see Lan and Syrgkanis [40]) as

    θLATE(x)=𝔼[YA=1,X=x]𝔼[YA=0,X=x]𝔼[DA=1,X=x]𝔼[DA=0,X=x]=:p0(x)q0(x).\theta_{\mathrm{LATE}}(x)=\frac{\mathbb{E}[Y\mid A=1,X=x]-\mathbb{E}[Y\mid A=0,X=x]}{\mathbb{E}[D\mid A=1,X=x]-\mathbb{E}[D\mid A=0,X=x]}=:\frac{p_{0}(x)}{q_{0}(x)}. (3)

    It follows that θLATE\theta_{\mathrm{LATE}} conditionally minimizes the following somewhat complicated loss:

    LATE(θ,g;z)=12(θ(x)χLATE(g;z))2where\displaystyle\ell_{\mathrm{LATE}}(\theta,g;z)=\frac{1}{2}\left(\theta(x)-\chi_{\mathrm{LATE}}(g;z)\right)^{2}\qquad\text{where}
    χLATE(g;z):=p(x)q(x)+a(dπ0(x))q(x)(p(x)q(x)y(dπ0(x))a(dπ0(x))).\displaystyle\chi_{\mathrm{LATE}}(g;z):=\frac{p(x)}{q(x)}+\frac{a(d-\pi_{0}(x))}{q(x)}\left(\frac{p(x)}{q(x)}-\frac{y(d-\pi_{0}(x))}{a(d-\pi_{0}(x))}\right). (4)

    Calibration with respect to LATE\ell_{\mathrm{LATE}} becomes

    θ(X)=𝔼[Y(1)Y(0)D(1)>D(0),θ(X)].\theta(X)=\mathbb{E}\left[Y(1)-Y(0)\mid D(1)>D(0),\theta(X)\right].
  4. 4.

    Conditional Quantile Under Treatment: Lastly, we consider the conditional QQth quantile under treatment, which, assume Y(1)Y(1) admits a conditional Lebesgue density, is specified as θQUT(x)=FY(1)1(QX=x)\theta_{\mathrm{QUT}}(x)=F^{-1}_{Y(1)}(Q\mid X=x)444FY(1)(X=x)F_{Y(1)}(\cdot\mid X=x) here denotes the conditional CDF of Y(1)Y(1) given covariates XX. More generally, the QQth quantile under treatment is specified as

    θQUT(x)argminν𝔼[QUT(ν,p0;Z)X=x].\theta_{\mathrm{QUT}}(x)\in\arg\min_{\nu\in\mathbb{R}}\mathbb{E}\left[\ell_{\mathrm{QUT}}(\nu,p_{0};Z)\mid X=x\right].

    In the above, QUT:×𝒢×𝒵\ell_{\mathrm{QUT}}:\mathbb{R}\times\mathcal{G}\times\mathcal{Z}\rightarrow\mathbb{R} denotes the QQ-pinball loss, which is defined as

    QUT(θ,p;z):=ap(x)(yθ(x))(Q𝟙yθ(x)),\ell_{\mathrm{QUT}}(\theta,p;z):=ap(x)(y-\theta(x))\left(Q-\mathbbm{1}_{y\leq\theta(x)}\right),

    where the true, unknown nuisance is the inverse propensity score p0(x):=1π0(x)p_{0}(x):=\frac{1}{\pi_{0}(x)}, where π0(x):=(A=1X=x)\pi_{0}(x):=\mathbb{P}(A=1\mid X=x). A direct computation yields that calibration under QUT\ell_{\mathrm{QUT}} becomes

    (Y(1)θ(X)θ(X))=Q.\mathbb{P}\left(Y(1)\leq\theta(X)\mid\theta(X)\right)=Q.

The first three losses considered above are “easy” to calibrate with respect to, as CATE,ACD,\ell_{\mathrm{CATE}},\ell_{\mathrm{ACD}}, and LATE\ell_{\mathrm{LATE}} are universally orthogonal losses, a concept discussed in Section 3. The pinball loss, on the other hand, is the quintessential example of a “hard” loss to calibrate with respect to. Handling more complicated losses like this is the subject of Section 4.

3 Calibration for Universally Orthogonal Losses

In this section, we develop algorithms for calibrating estimates θ\theta of causal effects θ0\theta_{0} that are specified as conditional minimizers of universally orthogonal losses \ell. Universal orthogonality, first introduced in Foster and Syrgkanis [18], can viewed as a robustness property of losses that are “close” to squared losses. Heuristically, a loss is universally orthogonal if it is insensitive to small errors in estimating the nuisance functions regardless of the current estimate on the conditional loss minimizer, i.e. Equation (1) holds when θ0\theta_{0} is replaced by any function θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R}. We formalize this in the following definition.

Definition 3.1 (Universal Orthogonality).

Let :×𝒢×𝒵\ell:\mathbb{R}\times\mathcal{G}\times\mathcal{Z}\rightarrow\mathbb{R} be a loss involving nuisance, and let g0g_{0} denote the true nuisance parameter associated with \ell. We say \ell is universally orthogonal, if for any θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R} and g𝒢g\in\mathcal{G}, we have

Dg𝔼[(θ,g0;Z)X](gg0)=0.D_{g}\mathbb{E}\left[\partial\ell(\theta,g_{0};Z)\mid X\right](g-g_{0})=0.

By direct computation, one can verify that CATE,ACD,\ell_{\mathrm{CATE}},\ell_{\mathrm{ACD}}, and LATE\ell_{\mathrm{LATE}} all satisfy Definition 3.1, whereas QUT\ell_{\mathrm{QUT}} does not — this aligns with the idea of universally orthogonal losses behaving like squared loss. The following example gives a general class of losses that are universally orthogonal.

Example 3.2.

Assume that we have a vector of nuisances g0=(η0,ζ0)g_{0}=(\eta_{0},\zeta_{0}) for some η0,ζ0:𝒲k\eta_{0},\zeta_{0}:\mathcal{W}\rightarrow\mathbb{R}^{k}. Further, assume there is some function m(η;z)m(\eta;z) such that555Here, ζ0\zeta_{0} can be thought of a the (conditional) Riesz representer of the linear functional Dη𝔼[m(η;Z)X]D_{\eta}\mathbb{E}[m(\eta;Z)\mid X].

Dη𝔼[m(η;Z)X](ηη0)=𝔼[ζ0(W),(ηη0)(W)X]for any gg0𝒢.D_{\eta}\mathbb{E}[m(\eta;Z)\mid X](\eta-\eta_{0})=\mathbb{E}[\langle\zeta_{0}(W),(\eta-\eta_{0})(W)\rangle\mid X]\qquad\text{for any }g-g_{0}\in\mathcal{G}.

Then, any loss (θ,g;z)\ell(\theta,g;z) that obeys the score equation

(θ,g;z)=θ(x)m(η;z)+Corr(g;z)=:χ(g;z)\partial\ell(\theta,g;z)=\theta(x)-\underbrace{m(\eta;z)+\mathrm{Corr}(g;z)}_{=:\chi(g;z)} (5)

with 𝔼[Corr(g;Z)X]=𝔼[ζ(W),(η0η)(W)X]\mathbb{E}[\mathrm{Corr}(g;Z)\mid X]=\mathbb{E}[\langle\zeta(W),(\eta_{0}-\eta)(W)\rangle\mid X] will satisfy Definition 3.1. We think of χ(g;z):=m(η;z)Corr(g;z)\chi(g;z):=m(\eta;z)-\mathrm{Corr}(g;z) as representing a generalized pseudo-outcome that, on average, looks like the desired heterogeneous causal effect. In fact, any loss of the above form is equivalent to a loss that looks like (ν,g;z)=12(νχ(g;z))2\ell(\nu,g;z)=\frac{1}{2}\left(\nu-\chi(g;z)\right)^{2}.

All universally orthogonal losses we have seen up until this point satisfy can be written in terms of pseudo-outcomes in the canonical form above. We note the particular settings of η0,ζ0,m,\eta_{0},\zeta_{0},m, and Corr\mathrm{Corr} for each of these losses in Table 1. Typically, we assume the statistician will estimate η0\eta_{0} and ζ0\zeta_{0} by plugging in estimates for each unknown constituent function, e.g. by plugging in estimates for π0\pi_{0} and μ0\mu_{0} in the CATE example.

Loss η0\eta_{0} ζ0\zeta_{0} m(η;z)m(\eta;z) Corr(g;z)\mathrm{Corr}(g;z)
CATE\ell_{\mathrm{CATE}} μ0(a,x)\mu_{0}(a,x) (aπ0(x)1a1π0(x))\left(\frac{a}{\pi_{0}(x)}-\frac{1-a}{1-\pi_{0}(x)}\right) η(1,x)η(0,x)\eta(1,x)-\eta(0,x) ζ(a,x)(yη(a,x))\zeta(a,x)\cdot(y-\eta(a,x))
ACD\ell_{\mathrm{ACD}} μ0(a,x)\mu_{0}(a,x) aπ0(ax)π0(ax)\frac{\partial_{a}\pi_{0}(a\mid x)}{\pi_{0}(a\mid x)} aμ(a,x)\partial_{a}\mu(a,x) ζ(a,x)(yη(a,x))\zeta(a,x)\cdot(y-\eta(a,x))
LATE\ell_{\mathrm{LATE}} p0(x)q0(x)\frac{p_{0}(x)}{q_{0}(x)} a(dπ0(x))q0(x)\frac{a(d-\pi_{0}(x))}{q_{0}(x)} η(x)\eta(x) ζ(a,d,x)(y(dπ0(x))a(dπ0(x))η(x))\zeta(a,d,x)\cdot\left(\frac{y(d-\pi_{0}(x))}{a(d-\pi_{0}(x))}-\eta(x)\right)
Table 1: Canonical representations of universally orthogonal losses in terms of base nuisances η0\eta_{0}, Riesz representers ζ0\zeta_{0}, and pseudo-outcome components m(η;z)m(\eta;z) and Corr(g;z)\mathrm{Corr}(g;z).

Universal orthogonality allows to bound Cal(θ,g0)\mathrm{Cal}(\theta,g_{0}) above by two decoupled terms. The first, denoted err(g,g0;θ)\mathrm{err}(g,g_{0};\theta), intuitively measures the distance between some fixed nuisance estimate g𝒢g\in\mathcal{G} and the unknown, true nuisance parameters g0𝒢g_{0}\in\mathcal{G}. This error is second order in nature, generally depending on the squared norm of the nuisance estimation error. The second term is Cal(θ,g)\mathrm{Cal}(\theta,g), and represents the calibration error of θ\theta in a reality where the learned nuisances gg were actually the true nuisances. As mentioned in the introduction, we view our bound as a “change of nuisance” result (akin to change of measure), allowing the learner to pay an upfront price for nuisance misestimation and then subsequently reason about the calibration error under potentially incorrect, learned nuisances. We prove the following in Appendix A.

Theorem 3.3.

Let :×𝒢×𝒵\ell:\mathbb{R}\times\mathcal{G}\times\mathcal{Z}\rightarrow\mathbb{R} be universally orthogonal, per Definition 3.1. Let g0𝒢g_{0}\in\mathcal{G} denote the true nuisance parameters associated with \ell. Suppose Dg2𝔼[(θ,f;Z)X](gg0,gg0)D^{2}_{g}\mathbb{E}[\partial\ell(\theta,f;Z)\mid X](g-g_{0},g-g_{0}) exists for any g,f𝒢g,f\in\mathcal{G}. Then, for any g𝒢g\in\mathcal{G} and θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R}, we have

Cal(θ,g0)12err(g,g0;θ)+Cal(θ,g),\mathrm{Cal}(\theta,g_{0})\leq\frac{1}{2}\mathrm{err}(g,g_{0};\theta)+\mathrm{Cal}(\theta,g),

where err(g,h;θ):=supf[g,h]𝔼({Dg2𝔼[(θ,f;Z)X](hg,hg)}2)\mathrm{err}(g,h;\theta):=\sup_{f\in[g,h]}\sqrt{\mathbb{E}\left(\left\{D_{g}^{2}\mathbb{E}\left[\partial\ell(\theta,f;Z)\mid X\right](h-g,h-g)\right\}^{2}\right)}.666For f,h𝒢f,h\in\mathcal{G}, we let the interval [f,h]:={λ(w)f(w)+(1λ(w))h(w):λ(w)[0,1]}[f,h]:=\{\lambda(w)f(w)+(1-\lambda(w))h(w):\lambda(w)\in[0,1]\}.

While the expression defining err(g,g0;θ)\mathrm{err}(g,g_{0};\theta) looks unpalatable, when g0=(η0,ζ0)g_{0}=(\eta_{0},\zeta_{0}) (as in Example 3.2), this term can quite generally be bounded above by the cross-error in nuisance estimation, i.e. the quantity ηη0,ζζ0L2(PW)\|\langle\eta-\eta_{0},\zeta-\zeta_{0}\rangle\|_{L^{2}(P_{W})}. The formal statement of this fact is presented in Proposition 3.4 below. More broadly, if the conditional Hessian satisfies |Dg2𝔼[(θ,f;Z)X](Δ,Δ)|CΔL2(PW)|D_{g}^{2}\mathbb{E}[\partial\ell(\theta,f;Z)\mid X](\Delta,\Delta)|\leq C\|\Delta\|_{L^{2}(P_{W})} for any Δ𝒢𝒢\Delta\in\mathcal{G}-\mathcal{G} and f𝒢f\in\mathcal{G}, then we simply have err(g,g0;θ)Cgg0L2(PW)2\mathrm{err}(g,g_{0};\theta)\leq C\|g-g_{0}\|^{2}_{L^{2}(P_{W})}. These sorts of second order bound appears in other works on causal estimation [18, 58, 40, 6].

We view the above result below as a major generalization of the main result (Theorem 1) of van der Laan et al. [58], which shows a similar bound for measuring the calibration error of estimates of conditional average treatment effects when calibration is performed according to isotonic regression. Our result, which more explicitly leverages the concept of Neyman orthogonality, can be used to recover that of van der Laan et al. [58] as a special case, including the error rate in nuisance estimation.

Proposition 3.4.

Suppose the loss \ell satisfies the score condition outlined in Equation (5) and suppose m(η;z)m(\eta;z) is linear in η\eta. Then, we have

err(g,g0;θ)2ηη0,ζζ0L2(PW),\mathrm{err}(g,g_{0};\theta)\leq 2\|\langle\eta-\eta_{0},\zeta-\zeta_{0}\rangle\|_{L^{2}(P_{W})},

where g0=(η0,ζ0)g_{0}=(\eta_{0},\zeta_{0}) represent the true, unknown nuisance parameters, and g=(η,ζ)g=(\eta,\zeta) represent arbitrary, fixed nuisance estimates. If instead one has |Dg2𝔼[(θ,f;Z)(Δ,Δ)|CΔL2(PW)2|D_{g}^{2}\mathbb{E}[\partial\ell(\theta,f;Z)(\Delta,\Delta)|\leq C\|\Delta\|_{L^{2}(P_{W})}^{2} for any Δ\Delta and θ\theta, the one has

err(g,g0;θ)Cgg0L2(PW)2.\mathrm{err}(g,g_{0};\theta)\leq C\|g-g_{0}\|_{L^{2}(P_{W})}^{2}.

3.1 Sample Splitting Algorithm

Throughout the remainder of this section, we make the following assumption on the the loss \ell. Any loss of the form presented in Equation (5) naturally satisfies the following assumption, and thus our example is applicable to CATE,ACD,\ell_{\mathrm{CATE}},\ell_{\mathrm{ACD}}, and LATE\ell_{\mathrm{LATE}} described earlier.

Assumption 1 (Linear Score).

We assume there is some function χ:𝒢×𝒵\chi:\mathcal{G}\times\mathcal{Z}\rightarrow\mathbb{R} such that the loss :×𝒢×𝒵\ell:\mathbb{R}\times\mathcal{G}\times\mathcal{Z}\rightarrow\mathbb{R} satisfies

𝔼[(θ,g;Z)X]=θ(X)𝔼[χ(g;Z)X]\mathbb{E}[\partial\ell(\theta,g;Z)\mid X]=\theta(X)-\mathbb{E}[\chi(g;Z)\mid X]

for any g𝒢g\in\mathcal{G} and θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R}.

The largely theoretic bound presented in Theorem 3.3 suggests a natural algorithm for performing causal calibration. First, a learner should use some fraction (say half) of the data to produce nuisance estimate g^\widehat{g} using any black-box algorithm. Then, the learner should transform second half of the data into generalized pseudo-outcomes χ(g^;Z)\chi(\widehat{g};Z) using the learned nuisances. Finally, they should apply some off-the-shelf calibration algorithm to the transformed data points. We formalize this in Algorithm 1.

Algorithm 1 Sample-Splitting for Universally Orthogonal Losses
1:Samples Z1,,Z2nPZZ_{1},\dots,Z_{2n}\sim P_{Z}, nuisance alg. 𝒜1\mathcal{A}_{1}, calibration alg. 𝒜2\mathcal{A}_{2}, estimator θ\theta.
2:Estimate nuisances: g^𝒜1(Z1:n)\widehat{g}\leftarrow\mathcal{A}_{1}(Z_{1:n}).
3:Compute pseudo-outcomes: χn+1,,χ2n:=χ(g^;Zn+1),,χ(g^;Z2n)\chi_{n+1},\dots,\chi_{2n}:=\chi(\widehat{g};Z_{n+1}),\dots,\chi(\widehat{g};Z_{2n}).
4:Run calibration: θ^𝒜2(θ,(Xm,χm)m=n+12n)\widehat{\theta}\leftarrow\mathcal{A}_{2}(\theta,(X_{m},\chi_{m})_{m=n+1}^{2n})
5:Calibrated estimator θ^\widehat{\theta}.

Algorithm 1 generalizes the main algorithm in van der Laan et al. [58] (Algorithm 1) to allow for the calibration of an estimate of any heterogeneous causal effect using any off-the-shelf nuisance estimation and calibration algorithms. To make efficient use of the data, we recommend instead running the cross calibration procedure outlined in Algorithm 2. We do not provide a convergence analysis for this algorithm.

Algorithm 2 Cross Calibration for Universally Orthogonal Losses
1:Samples 𝒟=Z1,,ZnPZ\mathcal{D}=Z_{1},\dots,Z_{n}\sim P_{Z}, nuisance alg. 𝒜1\mathcal{A}_{1}, calibration alg. 𝒜2\mathcal{A}_{2}, estimator θ\theta.
2:Split :=[n]\mathcal{I}:=[n] into KK equally-sized folds 1,,K\mathcal{I}_{1},\dots,\mathcal{I}_{K}.
3:Define 𝒟k:={Zi:ik}\mathcal{D}_{k}:=\{Z_{i}:i\in\mathcal{I}_{k}\} for all k[K]k\in[K].
4:for k1,,Kk\in 1,\dots,K do
5:     Estimate nuisances: g^(k)𝒜1(𝒟𝒟k)\widehat{g}^{(-k)}\leftarrow\mathcal{A}_{1}(\mathcal{D}\setminus\mathcal{D}_{k}).
6:     Compute pseudo-outcomes χi:=χ(g^(k);Zi)\chi_{i}:=\chi(\widehat{g}^{(-k)};Z_{i}) for iki\in\mathcal{I}_{k}.
7:Run calibration: θ^𝒜2(θ,(Xm,χm)m=1n)\widehat{\theta}\leftarrow\mathcal{A}_{2}(\theta,(X_{m},\chi_{m})_{m=1}^{n})
8:Calibrated estimator θ^\widehat{\theta}.

We now prove convergence guarantees for Algorithm 1. We start by enumerating several assumptions needed to guarantee convergence.

Assumption 2.

Let 𝒜1:𝒵𝒢\mathcal{A}_{1}:\mathcal{Z}^{\ast}\rightarrow\mathcal{G} be a nuisance estimation algorithm taking in an arbitrary number of points, and let 𝒜2:Θ×(𝒳×)Θ\mathcal{A}_{2}:\Theta\times(\mathcal{X}\times\mathbb{R})^{\ast}\rightarrow\Theta be a calibration algorithm taking some initial estimator and an arbitrary number of covariate/label pairs. We assume

  1. 1.

    For any distribution PZP_{Z} on 𝒵\mathcal{Z}, Z1,,ZnPZ_{1},\dots,Z_{n}\sim P i.i.d., and failure probability δ1(0,1)\delta_{1}\in(0,1), we have

    err(g^,g0;θ)rate1(n,δ1;PZ),\mathrm{err}(\widehat{g},g_{0};\theta)\leq\mathrm{rate}_{1}(n,\delta_{1};P_{Z}),

    where g^𝒜1(Z1:n)\widehat{g}\leftarrow\mathcal{A}_{1}(Z_{1:n}) and rate1\mathrm{rate}_{1} is some rate function.

  2. 2.

    For any distribution QQ on 𝒳×\mathcal{X}\times\mathbb{R}, (X1,Y1),,(Xn,Yn)Q(X_{1},Y_{1}),\dots,(X_{n},Y_{n})\sim Q i.i.d., initial estimator θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R}, and failure probability δ2(0,1)\delta_{2}\in(0,1), we have

    Cal(θ^;Q)rate2(n,δ2;Q),\mathrm{Cal}(\widehat{\theta};Q)\leq\mathrm{rate}_{2}(n,\delta_{2};Q),

    where θ^𝒜2(θ,{(Xi,Yi)}i=1n)\widehat{\theta}\leftarrow\mathcal{A}_{2}(\theta,\{(X_{i},Y_{i})\}_{i=1}^{n}) and rate2\mathrm{rate}_{2} is some rate function.

We briefly parse the above assumptions. For the first assumption, whenever g0=(η0,ζ0)g_{0}=(\eta_{0},\zeta_{0}) and err(g^,g0;θ)(ηη0)(ζζ0)L2(PW)\mathrm{err}(\widehat{g},g_{0};\theta)\lesssim\|(\eta-\eta_{0})(\zeta-\zeta_{0})\|_{L^{2}(P_{W})} or err(g^,g0;θ)g^g0L2(PW)2\mathrm{err}(\widehat{g},g_{0};\theta)\lesssim\|\widehat{g}-g_{0}\|_{L^{2}(P_{W})}^{2}, one can directly apply ML, non-parametric, or semi-parametric methods to estimate the unknown nuisances. For instance, if the nuisances are assumed to assumed to satisfy Holder continuity assumptions or are assumed to belong to a ball in a reproducing kernel Hilbert space, one can apply classical kernel smoothing methods or kernel ridge regression respectively to estimate the unknown parameters [58, 57, 60] to obtain optimal rates. Further, many well-known calibration algorithms satisfy the second assumption, often in a manner that doesn’t depend on the underlying distribution QQ. For instance, results in Gupta and Ramdas [25] on LL^{\infty} calibration error bounds directly imply that if 𝒜2\mathcal{A}_{2} is taken to be uniform mass/histogram binning, then the rate function rate2\mathrm{rate}_{2} can be taken as rate2(n,δ;Q)=O(YBlog(B/δ)n)\mathrm{rate}_{2}(n,\delta;Q)=O\left(\|Y\|_{\infty}\sqrt{\frac{B\log(B/\delta)}{n}}\right), where BB denotes the number of bins/buckets, Y\|Y\|_{\infty} denotes the essential supremum of YY (which will be finite so long as nuisances and observations are bounded) and the unknown constant is independent of nn, QQ, and δ\delta. Likewise, the convergence in probability results for isotonic calibration proven by van der Laan et al. [58] can naturally be extended to high-probability guarantees using standard concentration of measure results.

Theorem 3.5.

Suppose :×𝒢×𝒵\ell:\mathbb{R}\times\mathcal{G}\times\mathcal{Z}\rightarrow\mathbb{R} is an arbitrary, universally orthogonal loss satisfying Assumption 1. Let 𝒜1,𝒜2,\mathcal{A}_{1},\mathcal{A}_{2}, and χ\chi satisfy Assumption 2. Then, with probability at least 1δ1δ21-\delta_{1}-\delta_{2}, the output θ^\widehat{\theta} of Algorithm 1 run on a calibration dataset of 2n2n i.i.d. data points Z1,,Z2nPZ_{1},\dots,Z_{2n}\sim P satisfies

Cal(θ^,g0)12rate1(n,δ1;PZ)+rate2(n,δ2;Pg^χ),\mathrm{Cal}(\widehat{\theta},g_{0})\leq\frac{1}{2}\mathrm{rate}_{1}(n,\delta_{1};P_{Z})+\mathrm{rate}_{2}(n,\delta_{2};P^{\chi}_{\widehat{g}}),

where Pg^χP^{\chi}_{\widehat{g}} denotes the (random) distribution of (X,χ(g^;Z))(X,\chi(\widehat{g};Z)) where ZPZZ\sim P_{Z} and again XZX\subset Z.

We prove the above theorem in Appendix B. The above result can be thought of as an analogue of Theorem 1 of Foster and Syrgkanis [18], which shows a similar bound on excess parameter risk, and also a generalization of Theorem 1 of van der Laan et al. [58], which shows a similar bound when isotonic regression is used to calibrate CATE estimates.

We note that while we state our bounds and assumptions in terms of high-probability guarantees, we could have equivalently assumed, err(g^,g0;θ)=O(rate1(n;PZ))\mathrm{err}(\widehat{g},g_{0};\theta)=O_{\mathbb{P}}(\mathrm{rate}_{1}(n;P_{Z})) and Cal(θ^;Q)=O(rate2(n))\mathrm{Cal}(\widehat{\theta};Q)=O_{\mathbb{P}}(\mathrm{rate}_{2}(n)) for appropriately chosen rate functions rate1\mathrm{rate}_{1} and rate2\mathrm{rate}_{2}777Here, rate2(n)\mathrm{rate}_{2}(n) would have to be a function satisfying independent of the distribution of pseudo-outcomes to obtain convergence in probability guarantees. Such functions exists for isotonic calibration and histogram binning. This, for instance, would be useful if one wanted to apply the results on the convergence of isotonic regression due to van der Laan et al. [58], who show Cal(θ;P)=O(n1/3)\mathrm{Cal}(\theta;P)=O_{\mathbb{P}}\left(n^{-1/3}\right).

Lastly, we note that it is desirable for calibration algorithms to possess a “do no harm” guarantee, which ensures that the risk of the calibrated parameter θ^\widehat{\theta} is not much larger than the risk of original parameter. We present such a guarantee in Theorem C.1 in Appendix C, which follows using standard risk bounding techniques due to Foster and Syrgkanis [18].

4 Calibration for Conditionally Orthogonal Losses

We now consider the more challenging problem where the causal effect θ0(X)\theta_{0}(X) is not the minimizer of a universally orthogonal loss. To aid in our exposition, we introduce calibration functions. In short, the calibration function gives a canonical choice of a post-processing θ^=τθ:𝒳\widehat{\theta}=\tau\circ\theta:\mathcal{X}\rightarrow\mathbb{R} such that Cal(θ^;g0)=0\mathrm{Cal}(\widehat{\theta};g_{0})=0. While computing the calibration function exactly requires knowledge of the data-generating distribution, it can be approximated in finitely-many samples.

Definition 4.1 (Calibration Function).

Given any θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R} and g𝒢g\in\mathcal{G}, we define the calibration function for θ\theta at gg as the mapping γθ(;g):𝒳\gamma_{\theta}(\cdot;g):\mathcal{X}\rightarrow\mathbb{R} given by

γθ(x;g):=argminν𝔼[(ν,g;Z)θ(X)=θ(x)].\gamma_{\theta}(x;g):=\arg\min_{\nu}\mathbb{E}[\ell(\nu,g;Z)\mid\theta(X)=\theta(x)].

In particular, when g=g0g=g_{0}, we call γθ:=γθ(;g0)\gamma_{\theta}^{\ast}:=\gamma_{\theta}(\cdot;g_{0}) the true calibration function.

As hinted, first-order optimality conditions alongside the tower rule for conditional expectations imply that 𝔼[(γθ(;g),g;Z)γθ(X;g)]=0\mathbb{E}[\partial\ell(\gamma_{\theta}(\cdot;g),g;Z)\mid\gamma_{\theta}(X;g)]=0 for any g𝒢g\in\mathcal{G}. This, in particular, implies that γθ\gamma_{\theta}^{\ast} is perfectly calibrated. As an example, when a loss satisfies Assumption 1, γθ(x,g)=𝔼[χ(g;Z)θ(X)=θ(x)]\gamma_{\theta}^{\ast}(x,g)=\mathbb{E}[\chi(g;Z)\mid\theta(X)=\theta(x)].

We now ask under what general assumptions on \ell can we achieve calibration. In analogy with Foster and Syrgkanis [18], who consider the general task of empirical risk minimization in the presence of nuisance, the hope would be to weaken the assumption of universal orthogonality to that of Neyman orthogonality. Two commonly-used definitions for Neyman orthogonality (a marginal version and a version conditional on covariates) are provided below.

Definition 4.2 (Neyman Orthogonality).

We say \ell is Neyman orthogonal conditional on covariates XX (or marginally) if, for g𝒢g\in\mathcal{G}, we have

Dg𝔼[(θ0,g0;Z)X](gg0)=0(respectively, Dg𝔼[(ω0,h0;Z)](gh0)=0),D_{g}\mathbb{E}[\partial\ell(\theta_{0},g_{0};Z)\mid X](g-g_{0})=0\qquad(\text{respectively, }D_{g}\mathbb{E}[\partial\ell(\omega_{0},h_{0};Z)](g-h_{0})=0),

where θ0(x):=argminν𝔼[(ν,g0;Z)X=x]\theta_{0}(x):=\arg\min_{\nu}\mathbb{E}[\ell(\nu,g_{0};Z)\mid X=x] and g0g_{0} denotes the true nuisances (respectively ω0:=argminν𝔼[(ν,h0;Z)]\omega_{0}:=\arg\min_{\nu}\mathbb{E}[\ell(\nu,h_{0};Z)] and h0h_{0} for the latter).

Neyman orthogonality is useful for a task such as risk minimization because it allows the statistician to relate the risk 𝔼[(θ^,g0;Z)]\mathbb{E}[\ell(\widehat{\theta},g_{0};Z)] under the true nuisances to the risk under the computed nuisances 𝔼[(θ^,g^;Z)]\mathbb{E}[\ell(\widehat{\theta},\widehat{g};Z)] up to second order errors. Why do we need two separate conditions on the loss \ell? In general, the conditions in Definition 4.2 are not equivalent. To illustrate this, we can look at the example of the conditional/marginal quantile under treatment (We will let θQUT(X)\theta_{\mathrm{QUT}}(X) denote the former and ωQUT\omega_{\mathrm{QUT}} the latter). Recalling that we defined the pinball loss ~QUT(ν,p;z):=ap(x)(yν)(Q𝟙yν)\widetilde{\ell}_{\mathrm{QUT}}(\nu,p;z):=ap(x)(y-\nu)(Q-\mathbbm{1}_{y\leq\nu}), it is not hard to see that we have

θQUT(x)=argminν𝔼[~QUT(ν,p0;Z)X=x]andωQUT=argminν𝔼[~QUT(ν,p0;Z)],\theta_{\mathrm{QUT}}(x)=\arg\min_{\nu}\mathbb{E}[\widetilde{\ell}_{\mathrm{QUT}}(\nu,p_{0};Z)\mid X=x]\quad\text{and}\quad\omega_{\mathrm{QUT}}=\arg\min_{\nu}\mathbb{E}[\widetilde{\ell}_{\mathrm{QUT}}(\nu,p_{0};Z)],

where p0(x):=(A=1X=x)1p_{0}(x):=\mathbb{P}(A=1\mid X=x)^{-1} denotes the inverse propensity. Straightforward computation yields that ~QUT\widetilde{\ell}_{\mathrm{QUT}} is Neyman orthogonal conditional on covariates, but not marginally orthogonal. However, as noted in Kallus et al. [33], one can define the more complicated loss QUT(ν,(p,f);z)\ell_{\mathrm{QUT}}(\nu,(p,f);z) by performing a first order correction:

QUT(ν,(p,f);z):=~QUT(ν,p;z)θ(x)(ap(x)(f(x)Q)f(x)+Q)=:Corr((p,f);z)\ell_{\mathrm{QUT}}(\nu,(p,f);z):=\widetilde{\ell}_{\mathrm{QUT}}(\nu,p;z)-\theta(x)\cdot\underbrace{(ap(x)(f(x)-Q)-f(x)+Q)}_{=:\mathrm{Corr}((p,f);z)} (6)

where f0(x):=(Y(1)ωQUTX=x)f_{0}(x):=\mathbb{P}(Y(1)\leq\omega_{\mathrm{QUT}}\mid X=x) is an additional nuisance that must be estimated from the data. One can check that QUT\ell_{\mathrm{QUT}} satisfies the second condition of Definition 4.2, i.e. marginal Neyman orthogonality.

In calibration, we care about the quality of a model conditional on its own predictions. More specifically, given any initial model θ(X)\theta(X), the goal of any calibration algorithm (e.g. histogram binning, isotonic regression) is to compute the calibration function γθ\gamma_{\theta}^{\ast} (Definition 4.1).888We remind the reader that γθ(x):=argminν𝔼[(ν,g0;w)θ(X)=θ(x)]\gamma_{\theta}^{\ast}(x):=\arg\min_{\nu}\mathbb{E}[\ell(\nu,g_{0};w)\mid\theta(X)=\theta(x)] If θ(X)\theta(X) were a constant function, then we would have γθ(x)ω0\gamma_{\theta}^{\ast}(x)\equiv\omega_{0}, and thus we would want to leverage a loss satisfying marginal orthogonality. Likewise, if θ(X)\theta(X) were roughly of the same complexity as θ0(X)\theta_{0}(X), we may want to leverage the a loss \ell satisfying the form of Neyman orthogonality conditional on covariates XX. In general, the complexity of the initial estimate θ(X)\theta(X) will interpolate between these two extremes.

For a variant of Neyman orthogonality to thus be useful, we would need to cross-derivative to vanish (a) when evaluated at the calibration function γθ\gamma_{\theta}^{\ast} instead of θ0(X)\theta_{0}(X) or ω0\omega_{0} and (b) when the expectation is taken conditionally on the prediction θ(X)\theta(X) instead of either conditionally on XX or marginally. The extra structure provided by universal orthogonality allowed us to side-step this issue as the cross-derivative of the loss vanished when evaluated at any estimate of θ0\theta_{0} so long as nuisances were estimated correctly. The following, quite technical condition will give us the structure we need to perform calibration of estimates of more general heterogeneous causal effects.

Definition 4.3 (Conditional Orthogonality).

Suppose ~(θ,η;z)\widetilde{\ell}(\theta,\eta;z) is some initial loss with true nuisance η0\eta_{0}. Define the “corrected” loss \ell by

(θ,(η,ζ);z):=~(θ,η;z)θ(x)Corr((η,ζ);z),\ell(\theta,(\eta,\zeta);z):=\widetilde{\ell}(\theta,\eta;z)-\theta(x)\cdot\mathrm{Corr}((\eta,\zeta);z),

where Corr((η,ζ);z)\mathrm{Corr}((\eta,\zeta);z) is any correction term satisfying 𝔼[Corr((η0,ζ);z)X]=0\mathbb{E}[\mathrm{Corr}((\eta_{0},\zeta);z)\mid X]=0. Then, we say \ell is conditionally orthogonal if, for any φ:𝒳\varphi:\mathcal{X}\rightarrow\mathbb{R}, there exists ζφL2(PX)\zeta_{\varphi}\in L^{2}(P_{X}) such that

Dg𝔼[(γφ,gφ;Z)φ(X)](gg0)=0,D_{g}\mathbb{E}[\partial\ell(\gamma_{\varphi}^{\ast},g_{\varphi};Z)\mid\varphi(X)](g-g_{0})=0,

for all g𝒢g\in\mathcal{G}, where gφ:=(η0,ζφ)g_{\varphi}:=(\eta_{0},\zeta_{\varphi}).

Definition 4.3 may be difficult to parse, but we can work through several examples to gain some intuition. Returning to the example of conditional quantile under treatment and the (corrected) pinball loss QUT(ν,(p,f);z)\ell_{\mathrm{QUT}}(\nu,(p,f);z) defined above, we simply took the correction term to be Corr((p,f);z):=ap(x)(f(x)Q)f(x)+Q\mathrm{Corr}((p,f);z):=ap(x)(f(x)-Q)-f(x)+Q. From a straightforward calculation, one can check QUT\ell_{\mathrm{QUT}} satisfies Definition 4.3 with additional nuisance fφf_{\varphi} given by fφ(x):=(Y(1)γφ(X)X=x)f_{\varphi}(x):=\mathbb{P}(Y(1)\leq\gamma_{\varphi}^{\ast}(X)\mid X=x).

More broadly, given some initial loss ~\widetilde{\ell}, one can use the Riesz representation theorem to obtain ζφ:𝒲\zeta_{\varphi}:\mathcal{W}\rightarrow\mathbb{R} satisfying

Dη𝔼[~(γφ,η0;Z)φ(X)](ηη0)=𝔼[ζφ(W),(ηη0)(W)φ(X)]D_{\eta}\mathbb{E}[\partial\widetilde{\ell}(\gamma_{\varphi}^{\ast},\eta_{0};Z)\mid\varphi(X)](\eta-\eta_{0})=\mathbb{E}[\langle\zeta_{\varphi}(W),(\eta-\eta_{0})(W)\rangle\mid\varphi(X)]

almost surely. Then, if we can find some variable UZU\subset Z such that 𝔼[UW]=η0(W)\mathbb{E}[U\mid W]=\eta_{0}(W), we can simply take Corr((η,ζ);z):=ζ(w),η(w)u\mathrm{Corr}((\eta,\zeta);z):=\langle\zeta(w),\eta(w)-u\rangle, which gives us a “corrected” loss

(θ,(η,ζ);z):=~(θ,η;z)θ(x)ζ(w),η(w)u.\ell(\theta,(\eta,\zeta);z):=\widetilde{\ell}(\theta,\eta;z)-\theta(x)\cdot\langle\zeta(w),\eta(w)-u\rangle.

We conclude by pointing out the following observations about calibration with respect to such losses \ell, which follows immediately from Definition 4.3.

Corollary 4.4.

Suppose a loss \ell satisfies Definition 4.3. Then, the following hold:

  1. 1.

    The calibration function

    γφ(x;(η,ζ)):=argminν𝔼[(ν,(η,ζ);Z)φ(X)=φ(x)]\gamma_{\varphi}(x;(\eta,\zeta)):=\arg\min_{\nu}\mathbb{E}[\ell(\nu,(\eta,\zeta);Z)\mid\varphi(X)=\varphi(x)]

    doesn’t depend on ζ\zeta when η=η0\eta=\eta_{0}. Thus, we can write γφ(x)=γφ(x;(η0,ζ))\gamma_{\varphi}^{\ast}(x)=\gamma_{\varphi}(x;(\eta_{0},\zeta)) for any ζ\zeta.

  2. 2.

    The calibration error of any estimate θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R}, given by

    Cal(θ,(η,ζ)):=(𝒳𝔼[(θ,(η,ζ);Z)θ(X)]2PX(dx))1/2\mathrm{Cal}(\theta,(\eta,\zeta)):=\left(\int_{\mathcal{X}}\mathbb{E}[\partial\ell(\theta,(\eta,\zeta);Z)\mid\theta(X)]^{2}P_{X}(dx)\right)^{1/2}

    also does not depend on ζ\zeta when η=η0\eta=\eta_{0}. Thus, we can write Cal(θ,η0)\mathrm{Cal}(\theta,\eta_{0}) in place of Cal(θ,(η0,ζ))\mathrm{Cal}(\theta,(\eta_{0},\zeta)) for any ζ\zeta.

  3. 3.

    Lastly, not only do \ell and ~\widetilde{\ell} posses the same conditional minimizer when evaluated at η0\eta_{0} (regardless of the choice of ζ\zeta), but we also have

    Cal(θ,(η0,ζ))=Cal~(θ,η0),\mathrm{Cal}(\theta,(\eta_{0},\zeta))=\widetilde{\mathrm{Cal}}(\theta,\eta_{0}),

    where θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R} is arbitrary and Cal~\widetilde{\mathrm{Cal}} denotes the calibration error under ~\widetilde{\ell}.

4.1 A General Bound on Calibration Error

We now prove a decoupled bound on the calibration error Cal(θ,g0)\mathrm{Cal}(\theta,g_{0}) under the assumption that \ell is conditionally orthogonal. This bound serves as a direct analogue of the one presented in Theorem 3.3, just in a more general setting. As we will see, bounding the calibration error of losses that are not universally orthogonalizable is a much more delicate task.

To prove our result, we will need place convexity assumptions on the underlying loss \ell. We note that these convexity results are akin to those made in existing works, namely in the work of Foster and Syrgkanis [18].

Assumption 3.

We assume that the loss function conditioned on covariates XX is α\alpha-strongly convex, i.e. for any vv\in\mathbb{R} and any g𝒢g\in\mathcal{G}, we have

𝔼[2(v,g;Z)X]α.\mathbb{E}\left[\partial^{2}\ell(v,g;Z)\mid X\right]\geq\alpha.
Assumption 4.

We assume that the loss function conditioned on covariates XX is β\beta-smooth, i.e. for any vv\in\mathbb{R} and any g𝒢g\in\mathcal{G}, we have

𝔼[2(v,g;Z)X]β.\mathbb{E}\left[\partial^{2}\ell(v,g;Z)\mid X\right]\leq\beta.

We now state the main theorem of this section. The bound below appears largely identical to the one presented in Theorem 3.3 modulo two minor differences. First, we pay a multiplicative factor of β/α\beta/\alpha in both of the decoupled terms, which ultimately is just the condition number of the loss \ell. Second, the error term err(g,gθ;γθ)\mathrm{err}(g,g_{\theta};\gamma_{\theta}^{\ast}) is evaluated at the calibration function γθ(x):=argminν𝔼[(ν,g0;Z)θ(X)=θ(x)]\gamma_{\theta}^{\ast}(x):=\arg\min_{\nu}\mathbb{E}[\ell(\nu,g_{0};Z)\mid\theta(X)=\theta(x)] instead of the parameter estimate θ\theta. This difference is due to the fact that, in the proof of Theorem 4.5, we must perform a functional Taylor expansion around γθ\gamma_{\theta}^{\ast} in order to invoke the orthogonality condition. This subtlety was absent in the case of universal orthogonality, as \ell was insensitive to nuisance misestimation for any parameter estimate θ\theta. We ultimately view this difference as minor, as for many examples (e.g. the pinball loss QUT\ell_{\mathrm{QUT}}) the dependence on γθ\gamma_{\theta}^{\ast} vanishes. We prove Theorem 4.5 in Appendix A.2.

Theorem 4.5.

Let \ell be a conditionally orthogonal loss (Definition 4.3) that is α\alpha-strong convex (Assumption 3) and β\beta-smooth (Assumption 4). Suppose Dg2𝔼[(γθ,f;Z)X](gg0,gg0)D_{g}^{2}\mathbb{E}[\partial\ell(\gamma_{\theta}^{\ast},f;Z)\mid X](g-g_{0},g-g_{0}) exists for f,g𝒢f,g\in\mathcal{G}. Then, for any estimate θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R} and nuisance parameter g=(η,ζ)g=(\eta,\zeta), we have

Cal(θ,η0)β2αerr(g,gθ;γθ)+βαCal(θ,g),\mathrm{Cal}(\theta,\eta_{0})\leq\frac{\beta}{2\alpha}\mathrm{err}(g,g_{\theta};\gamma_{\theta}^{\ast})+\frac{\beta}{\alpha}\mathrm{Cal}(\theta,g),

where gθ=(η0,ζθ)g_{\theta}=(\eta_{0},\zeta_{\theta}) are the true, unknown nuisance functions, γθ\gamma_{\theta}^{\ast} is the calibration function associated with θ\theta, and err(g,gθ;γθ)\mathrm{err}(g,g_{\theta};\gamma_{\theta}^{\ast}) is as defined in Theorem 3.3.

Although the bound in Theorem 4.5 looks similar in spirit to the one presented in Theorem 3.3, there still remain questions to answer. For instance, what does the condition number β/α\beta/\alpha look like for practically-relevant losses? Likewise, will the error term err(g,gθ;γθ)\mathrm{err}(g,g_{\theta};\gamma_{\theta}^{\ast}) simplify into a cross-error term as in the case of universally orthogonalizable losses? We interpret Theorem 4.5 by spending some time looking at the example of the pinball loss QUT\ell_{\mathrm{QUT}}.

Example 4.6.

First, for any fixed quantile QQ, we assess the strong convexity/smoothness properties of QUT\ell_{\mathrm{QUT}}. Let p:𝒳0p:\mathcal{X}\rightarrow\mathbb{R}_{\geq 0} represent any inverse-propensity estimate, and let π0\pi_{0} represent the true propensity score. Assume Y(1)Y(1) admits a conditional density fY(1)(yx)f_{Y(1)}(y\mid x) with respect to the Lebesgue measure on \mathbb{R}. Straightforward calculation yields

2𝔼[QUT(θ,p;Z)X]=fY(1)(θ(x)x).\partial^{2}\mathbb{E}[\ell_{\mathrm{QUT}}(\theta,p;Z)\mid X]=f_{Y(1)}(\theta(x)\mid x).

Thus, if lfY(1)(yx)ul\leq f_{Y(1)}(y\mid x)\leq u for all y,x𝒳y\in\mathbb{R},x\in\mathcal{X} and ϵ<p0(x)1,p(x)11ϵ\epsilon<p_{0}(x)^{-1},p(x)^{-1}\leq 1-\epsilon for all x𝒳x\in\mathcal{X} for some 0<ϵ<1/20<\epsilon<1/2, then we have:

ϵ2(1ϵ)2l2𝔼[QUT(θ,p;Z)X](1ϵ)2ϵ2u,\frac{\epsilon^{2}}{(1-\epsilon)^{2}}l\leq\partial^{2}\mathbb{E}[\ell_{\mathrm{QUT}}(\theta,p;Z)\mid X]\leq\frac{(1-\epsilon)^{2}}{\epsilon^{2}}u,

i.e. that QUT\ell_{\mathrm{QUT}} satisfies Assumption 4 with β=1ϵϵu\beta=\frac{1-\epsilon}{\epsilon}u and Assumption 3 with α=ϵ1ϵl\alpha=\frac{\epsilon}{1-\epsilon}l.

We can further interpret the error term err(g,gθ;γθ)\mathrm{err}(g,g_{\theta};\gamma_{\theta}^{\ast}) in the case of the pinball loss, where again g=(η,f)g=(\eta,f). In particular, straightforward calculation yields

𝔼[QUT(θ,g;Z)X]=p(X)p0(X)((Y(1)θ(X)X)Q)𝔼[A(f(X)Q)(p(W)p0(W))X].\mathbb{E}[\partial\ell_{\mathrm{QUT}}(\theta,g;Z)\mid X]=\frac{p(X)}{p_{0}(X)}(\mathbb{P}(Y(1)\leq\theta(X)\mid X)-Q)-\mathbb{E}\left[A(f(X)-Q)(p(W)-p_{0}(W))\mid X\right].

As the first term is linear in the nuisance estimate pp and doesn’t depend on ff, its second Gateaux derivative (with respect to gg) is identically zero. Thus, the error term does not depend on γθ\gamma_{\theta}^{\ast}, and thus we can write err(g,gθ)\mathrm{err}(g,g_{\theta}) instead. Further, using the same analysis as in Proposition 3.4 and writing ζ(a,x)=a(f(x)Q)\zeta(a,x)=a(f(x)-Q). we have that

12err(g,gθ)\displaystyle\frac{1}{2}\mathrm{err}(g,g_{\theta}) (pp0)(ζζθ)L2(PW)\displaystyle\leq\|(p-p_{0})(\zeta-\zeta_{\theta})\|_{L^{2}(P_{W})}
=A(f(X)fθ(X))(p(X)p0(X))L2(PW),\displaystyle=\left\|A(f(X)-f_{\theta}(X))(p(X)-p_{0}(X))\right\|_{L^{2}(P_{W})},

where we recall that fθ(x):=(Y(1)γθX=x)f_{\theta}(x):=\mathbb{P}(Y(1)\leq\gamma_{\theta}^{\ast}\mid X=x). Thus, even in the general case of conditional orthogonality, we can often obtain simple looking bounds on the error in nuisance estimation.

4.2 A Sample Splitting Algorithm

We conclude the section by presenting two algorithms for performing causal calibration with respect to conditionally orthogonal losses. As in Section 3, we first present a sample splitting algorithm (Algorithm 3) that enjoys finite sample convergence guarantees. We then present a corresponding cross-calibration algorithm that is likely more useful in practice.

Algorithm 3 Sample-Splitting for Conditionally Orthogonal Losses
1:Samples Z1,,Z2nPZZ_{1},\dots,Z_{2n}\sim P_{Z}, nuisance alg. 𝒜1\mathcal{A}_{1}, general loss calibration alg. 𝒜2\mathcal{A}_{2}, estimator θ\theta, loss involving nuisance \ell
2:Estimate nuisances: g^𝒜1(Z1:n)\widehat{g}\leftarrow\mathcal{A}_{1}(Z_{1:n}).
3:Compute loss partial-evaluations: n+1,,2n:=(,g^;Zn+1),,(,g^;Z2n)\ell_{n+1},\dots,\ell_{2n}:=\ell(\cdot,\widehat{g};Z_{n+1}),\dots,\ell(\cdot,\widehat{g};Z_{2n})
4:Run calibration: θ^𝒜2(θ,(Xm,m)m=n+12n)\widehat{\theta}\leftarrow\mathcal{A}_{2}(\theta,(X_{m},\ell_{m})_{m=n+1}^{2n})
5:Calibrated estimator θ^\widehat{\theta}.

Algorithm 3 is essentially a generalization of Algorithm 1 to general losses. The key difference is that we can no longer compute pseudo-outcomes for general losses. Instead, we assume that the calibration algorithm 𝒜2\mathcal{A}_{2} passed to Algorithm 1 can calibrate “with respect to general losses \ell”. What does this mean? Many calibration algorithms such as linear calibration, Platt scaling, and isotonic calibration, compute a mapping τ^\widehat{\tau} satisfying

τ^argminτm=1n((τθ)(Xi),Yi)\widehat{\tau}\in\arg\min_{\tau\in\mathcal{F}}\sum_{m=1}^{n}\ell\left((\tau\circ\theta)(X_{i}),Y_{i}\right) (7)

where (X1,Y1),,(Xn,Yn)(X_{1},Y_{1}),\dots,(X_{n},Y_{n}) denotes a calibration dataset, {f:}\mathcal{F}\subset\{f:\mathbb{R}\rightarrow\mathbb{R}\} is some appropriately-defined class of functions, and \ell is an appropriately chosen class of functions. Table 2 below outlines the choices of \mathcal{F} and \ell for common calibration algorithms.

Algorithm Loss \ell Function class \mathcal{F}
Isotonic Calibration (ν,y)=12(νy)2\ell(\nu,y)=\frac{1}{2}(\nu-y)^{2} ={τ(x) is non-decreasing}\mathcal{F}=\{\tau(x)\text{ is non-decreasing}\}
Linear Calibration (ν,y)=12(νy)2\ell(\nu,y)=\frac{1}{2}(\nu-y)^{2} ={τ(x)=ax+b,a,b}\mathcal{F}=\{\tau(x)=ax+b,\;a,b\in\mathbb{R}\}
Histogram binning (ν,y)=12(νy)2\ell(\nu,y)=\frac{1}{2}(\nu-y)^{2} ={τ(x)=cb𝟙θ(x)[θ(X)(b1)n/B),θ(X)(bn/B)}\mathcal{F}=\{\tau(x)=\sum c_{b}\mathbbm{1}_{\theta(x)\in[\theta(X)_{(b-1)n/B)},\theta(X)_{(bn/B)}}\}
Platt Scaling (ν,y)=ylog(ν)(1y)log(1ν)\ell(\nu,y)=-y\log(\nu)-(1-y)\log(1-\nu) ={τ(x)=11+exp(ax+b),a,b}\mathcal{F}=\{\tau(x)=\frac{1}{1+\exp(ax+b)},a,b\in\mathbb{R}\}
Table 2: Representations of classical calibration algorithms in the form outlined in Equation (7).

Thus, for general losses involving nuisance \ell, it makes sense that the calibration algorithm 𝒜2\mathcal{A}_{2} should compute the following minimizer. We outline a general template for 𝒜2\mathcal{A}_{2} in Algorithm 5 in Appendix D.

τ^argminτm=1n((τθ)(Xi),g^i;Zi)=:i(θ(Xi)),\widehat{\tau}\in\arg\min_{\tau\in\mathcal{F}}\sum_{m=1}^{n}\underbrace{\ell((\tau\circ\theta)(X_{i}),\widehat{g}_{i};Z_{i})}_{=:\ell_{i}(\theta(X_{i}))},

In the above, Z1,,ZnPZZ_{1},\dots,Z_{n}\sim P_{Z} is now the calibration sample and g^i\widehat{g}_{i} denotes a nuisance estimate which generically may depend on the current sample ii. Also in Appendix D, we also prove the convergence of a simple, three-way sample splitting algorithm based on uniform mass binning. and actually prove a finite sample convergence L2L^{2} calibration error convergence guarantee for uniform mass binning as well. We do not include this algorithm in the main paper as the focus of the work is on presenting a general framework for performing causal calibration.

We can similarly define a version of cross calibration for conditionally orthogonal losses, which we present in Algorithm 4.

Algorithm 4 Cross Calibration for Conditionally Orthogonal Losses
1:Samples 𝒟:=Z1,,ZnPZ\mathcal{D}:=Z_{1},\dots,Z_{n}\sim P_{Z}, nuisance alg. 𝒜1\mathcal{A}_{1}, general loss calibration alg. 𝒜2\mathcal{A}_{2}, estimator θ\theta, loss involving nuisance \ell.
2:Split :=[n]\mathcal{I}:=[n] into KK equally-sized folds 1,,K\mathcal{I}_{1},\dots,\mathcal{I}_{K}.
3:Define 𝒟k:={Zi:ik}\mathcal{D}_{k}:=\{Z_{i}:i\in\mathcal{I}_{k}\} for all k[K]k\in[K].
4:for k1,,Kk\in 1,\dots,K do
5:     Estimate nuisances: g^(k)𝒜1(θ,𝒟𝒟k)\widehat{g}^{(-k)}\leftarrow\mathcal{A}_{1}(\theta,\mathcal{D}\setminus\mathcal{D}_{k}).
6:     Compute loss partial evaluations: i:=(,g^(k);Zi)\ell_{i}:=\ell(\cdot,\widehat{g}^{(-k)};Z_{i}) for iki\in\mathcal{I}_{k}.
7:Run calibration: θ^𝒜2(θ,(Xm,m)m=1n)\widehat{\theta}\leftarrow\mathcal{A}_{2}(\theta,(X_{m},\ell_{m})_{m=1}^{n})
8:Calibrated estimator θ^\widehat{\theta}.

We now focus our efforts on proving a convergence guarantee for Algorithm 3. We present a set of assumptions we need to prove our convergence guarantee.

Assumption 5.

Let 𝒜1:𝒵×Θ𝒢\mathcal{A}_{1}:\mathcal{Z}^{\ast}\times\Theta\rightarrow\mathcal{G} be a nuisance estimation algorithm taking in an arbitrary number of points, and let 𝒜2:Θ×(𝒳)Θ\mathcal{A}_{2}:\Theta\times(\mathcal{X}\rightarrow\mathbb{R})^{\ast}\rightarrow\Theta be a general loss calibration algorithm taking some initial estimator and a sequence of partially-evaluated loss functions i:𝒳\ell_{i}:\mathcal{X}\rightarrow\mathbb{R}.

  1. 1.

    For any distribution PZP_{Z} on 𝒵\mathcal{Z}, Z1,,Zni.i.d.PZZ_{1},\dots,Z_{n}\sim_{i.i.d.}P_{Z}, θΘ\theta\in\Theta, and failure probability δ1(0,1)\delta_{1}\in(0,1), we have

    err(g^,gθ;γθ)rate1(n,δ1,θ;PZ),\mathrm{err}(\widehat{g},g_{\theta};\gamma_{\theta}^{\ast})\leq\mathrm{rate}_{1}(n,\delta_{1},\theta;P_{Z}),

    where g^=(η^,ζ^)𝒜1(Z1:n,θ)\widehat{g}=(\widehat{\eta},\widehat{\zeta})\leftarrow\mathcal{A}_{1}(Z_{1:n},\theta), gθ=(η0,ζθ)g_{\theta}=(\eta_{0},\zeta_{\theta}), and rate1\mathrm{rate}_{1} is some rate function.

  2. 2.

    For any distribution PZP_{Z} on 𝒵\mathcal{Z}, Z1,,Zni.i.d.PZZ_{1},\dots,Z_{n}\sim_{i.i.d.}P_{Z}, θΘ\theta\in\Theta, grange(𝒜1)𝒢g\in\mathrm{range}(\mathcal{A}_{1})\subset\mathcal{G} , and failure probability δ2(0,1)\delta_{2}\in(0,1), we have

    Cal(θ^;g)rate2(n,δ2,;PZ),\mathrm{Cal}(\widehat{\theta};g)\leq\mathrm{rate}_{2}(n,\delta_{2},\ell;P_{Z}),

    where θ^𝒜2(θ,{(Xm,m)}m=1n)\widehat{\theta}\leftarrow\mathcal{A}_{2}(\theta,\{(X_{m},\ell_{m})\}_{m=1}^{n}), m:=(,g^;Zm)\ell_{m}:=\ell(\cdot,\widehat{g};Z_{m}), and rate2\mathrm{rate}_{2} is some rate function.

  3. 3.

    With probability one, θ^=τθ\widehat{\theta}=\tau\circ\theta for some injective mapping τ:\tau:\mathbb{R}\rightarrow\mathbb{R}.

The first two assumptions are direct analogues of those made in Section 3, giving the learner control over nuisance estimation and calibration rates. In general, ζθζθ^\zeta_{\theta}\neq\zeta_{\widehat{\theta}} for arbitrary univariate mappings τ:\tau:\mathbb{R}\rightarrow\mathbb{R}. This is a problem, as the learner will estimate the additional nuisance associated with the initial parameter, ζθ\zeta_{\theta}, but Theorem 4.5 will be instantiated with respect to the nuisance ζθ^\zeta_{\widehat{\theta}}. The following lemma shows that injectivity of τ\tau ensures ζθ=ζθ^\zeta_{\theta}=\zeta_{\widehat{\theta}}.

Lemma 4.7.

Suppose \ell is conditionally orthogonal, and suppose φ1,φ2:𝒳\varphi_{1},\varphi_{2}:\mathcal{X}\rightarrow\mathbb{R} have the same level sets, i.e. they satisfy {φ11(c):crange(φ1)}={φ21(c):crange(φ2)}\{\varphi_{1}^{-1}(c):c\in\mathrm{range}(\varphi_{1})\}=\{\varphi_{2}^{-1}(c):c\in\mathrm{range}(\varphi_{2})\}. 999For a function f:𝒳𝒴f:\mathcal{X}\rightarrow\mathcal{Y} that is not necessarily injective, we let f1(c):={x𝒳:f(x)=c}f^{-1}(c):=\{x\in\mathcal{X}:f(x)=c\}. Then, the calibration functions satisfy γφ1=γφ2\gamma_{\varphi_{1}}^{\ast}=\gamma_{\varphi_{2}}^{\ast}. Additionally, without loss of generality, we can assume ζφ1ζφ2\zeta_{\varphi_{1}}\equiv\zeta_{\varphi_{2}}.

Calibration algorithms that learn an injective post-processing mapping, such as Platt scaling and linear calibration, will preserve level sets. For an algorithm like isotonic calibration, one can either (a) estimate ζθ\zeta_{\theta} and hope ζθζθ^\zeta_{\theta}\approx\zeta_{\widehat{\theta}} or (b) if τ^{f:f is monotonic}\widehat{\tau}\in\{f:\mathbb{R}\rightarrow\mathbb{R}\mid f\text{ is monotonic}\} learned via isotonic regression is not strictly increasing, one can release θ^=(τ^+ξ)θ\widehat{\theta}=(\widehat{\tau}+\xi)\circ\theta where ξ\xi is any strictly increasing map. For algorithms such as histogram binning and uniform mass binning, one can first learn then the level sets of θ^\widehat{\theta}, which are just based on the quantiles of the predictions of the initial estimator θ(X)\theta(X). Then, these level sets entirely specify the target nuisance ζθ^\zeta_{\widehat{\theta}} to estimate. We provide a version of Algorithm 3 in Appendix D that does this. We now provide a convergence guarantee for Algorithm 3.

Theorem 4.8.

Suppose :×𝒢×𝒵\ell:\mathbb{R}\times\mathcal{G}\times\mathcal{Z}\rightarrow\mathbb{R} is an arbitrary, conditionally orthogonal loss Let 𝒜1\mathcal{A}_{1} and 𝒜2\mathcal{A}_{2} satisfy Assumption 5, and suppose θ^=τ^θ\widehat{\theta}=\widehat{\tau}\circ\theta where τ^\widehat{\tau} is almost surely injective. Then, with probability at least 1δ1δ21-\delta_{1}-\delta_{2}, the output θ^\widehat{\theta} of Algorithm 1 run on a calibration dataset of 2n2n i.i.d. data points Z1,,Z2nPZ_{1},\dots,Z_{2n}\sim P satisfies

Cal(θ^,η0)12rate1(n,δ1,θ;PZ)+rate2(n,δ2,;PZ).\mathrm{Cal}(\widehat{\theta},\eta_{0})\leq\frac{1}{2}\mathrm{rate}_{1}(n,\delta_{1},\theta;P_{Z})+\mathrm{rate}_{2}(n,\delta_{2},\ell;P_{Z}).

5 Experiments

We now evaluate the performance of our cross-calibration algorithms, namely Algorithms 2 and 4. We consider two settings. First, we examine the viability of of our calibration algorithm for universally orthogonal losses using observational data. In particular, we show how Algorithm 2 can be used to calibrate estimates of the CATE of 401(k) eligibility and the conditional LATE of 401(k) participation on an individual’s net financial assets. Second, we measure the ability of Algorithm 4 to calibrate estimates of conditional quantiles under treatment on synthetic data. We examine the performance of our algorithm for several quantile both in terms of calibration error and average loss.

5.1 Effects of 401(k) Participation/Eligibility on Financial Assets

First, we consider the task of constructing and calibrating estimates of the heterogeneous effect of 401(k) eligibility/participation on an individual’s net financial assets. To do this, we use the 401(k) dataset leveraged in Chernozhukov and Hansen [5], Kallus et al. [33], and Chernozhukov et al. [10]. Since Chernozhukov and Hansen [5] argue that eligibility for 401(k) satisfies conditional ignorability given a sufficiently rich set of features101010The entire set of features are as follows: [‘age’, ‘inc’, ‘fsize’, ‘educ’, ‘db’, ‘marr’, ‘male’, ‘twoearn’, ‘pira’, ‘nohs’, ‘hs’, ‘smcol’, ‘col’, ‘hown’], we aim to measure the CATE of eligibility on net financial assets. This ignorability is not known to be satisfied for 401(k) participation, and thus we instead aim to measure to conditional LATE of 401(k) participation on net financial assets with eligibility serving as an instrument. For each parameter (either CATE or conditional LATE), we randomly split the dataset into three folds of uneven size: we use 60% of the data to construct the initial parameter estimate, 25% of the data to perform calibration, and reserve 15% of the data as a test set.

Model Training and Calibration:

To fit the initial model CATE/conditional LATE model, we split the training data randomly into K=5K=5 evenly-sized folds. We use cross-fitting to construct appropriate pseudo-outcomes, i.e. for each k[5]k\in[5] we use data in all but the kkth fold to estimate nuisances and then appropriately use these estimates to transform observations in the kkth fold. In the case of the CATE, we produce estimates π^(k)\widehat{\pi}^{(-k)}, μ^(k)\widehat{\mu}^{(-k)} of the propensity score π0(x):=(A=1X=x)\pi_{0}(x):=\mathbb{P}(A=1\mid X=x) and expected outcome mapping μ0(x):=𝔼[YX=x,A=a]\mu_{0}(x):=\mathbb{E}[Y\mid X=x,A=a] using gradient boosted decision and regression tress, respectively. We then produce pseudo-outcomes on the kkth fold, per Equation (2), and use gradient-boosted regression trees to regress these pseudo-outcomes onto covariates.

In the case of the conditional LATE, as the instrument policy ζ0(x):=(Z=1X=x)\zeta_{0}(x):=\mathbb{P}(Z=1\mid X=x) is not assumed to be known, instead of using the universally orthogonal loss discussed in Equation (4), we instead use the following loss detailed in Syrgkanis et al. [55] and Lan and Syrgkanis [40]:

(θ,(μ,π,ζ);W):=12(θ(x)τ(x)(y~τ(x)a~)Z~ζ(x)=:χLATE(g;W))2\ell(\theta,(\mu,\pi,\zeta);W):=\frac{1}{2}\Big{(}\theta(x)-\quad\underbrace{\tau(x)-\frac{(\widetilde{y}-\tau(x)\widetilde{a})\widetilde{Z}}{\zeta(x)}}_{=:\chi_{\mathrm{LATE}}^{\prime}(g;W)}\quad\Big{)}^{2}

where τ(x):=μ(1,x)μ(0,x)ζ(x)\tau(x):=\frac{\mu(1,x)-\mu(0,x)}{\zeta(x)}, y~:=yμ(a,x)\widetilde{y}:=y-\mu(a,x), Z~:=Zζ(x)\widetilde{Z}:=Z-\zeta(x), and a~:=aπ(x)\widetilde{a}:=a-\pi(x), where μ0\mu_{0} and π0\pi_{0} are as defined above. We estimate all nuisances and construct pseudo-outcomes χLATE\chi^{\prime}_{\mathrm{LATE}} as in the case of the CATE, once again using either gradient-boosted regression or decision trees based on appropriateness. Again, we regress pseudo-outcomes onto covariates using gradient-boosted regression trees to obtain initial parameter estimates.

After initial models are trained, we run Algorithm 2 (cross-calibration) on the 25% of the data reserved for calibration, once again using K=5K=5 folds. We estimate all nuisances again by either using gradient-boosted decision or regression trees. We perform calibration using three different algorithms: isotonic calibration, histogram binning with B=20B=20 buckets, and linear calibration, which performs simple linear regression with intercept of the constructed pseudo-outcomes onto the initial model predictions.

Comparing Calibration Error in Quartiles

We assess calibration of both the pre- and post-calibrated models by approximating the actual target treatment effect with the quartiles of the models’ predictions. We let θ^\widehat{\theta} denote either the pre or post calibrated model (either CATE or conditional LATE). Re-using the calibration dataset (25% of the data), we compute the order statistics of model predictions θ^(X)(1),,θ^(X)(N)\widehat{\theta}(X)_{(1)},\dots,\widehat{\theta}(X)_{(N)}. We then define four buckets based on the quartiles of the above order statistics, i.e.

Bi={x𝒳:θ^(x)(θ^(X)(iN/4),θ^(X)((i+1)N/4)]}for i=0,1,2,3B_{i}=\left\{x\in\mathcal{X}:\widehat{\theta}(x)\in\Big{(}\widehat{\theta}(X)_{(\lfloor iN/4\rfloor)},\widehat{\theta}(X)_{(\lfloor(i+1)N/4\rfloor)}\Big{]}\right\}\qquad\text{for }i=0,1,2,3

where we assume θ^(X)(0)=\widehat{\theta}(X)_{(0)}=-\infty and θ^(X)(N)=\widehat{\theta}(X)_{(N)}=\infty. Next, we use cross-fitting with K=5K=5 folds to transform the 15% of the data reserved for testing into pseudo-outcomes χm\chi_{m}.111111This is done in the same manner as discussed in the previous subsection We assign each transformed sample (Xm,χm)(X_{m},\chi_{m}) in the test set to an appropriate bin based on the predicted value of θ^(Xm)\widehat{\theta}(X_{m}), and average the pseudo outcomes falling into each bin. We then approximate the L2L^{2} calibration error, which is computed as: Cal^(θ^,g0)=14i=14(χ¯iθ¯i)2\widehat{\mathrm{Cal}}(\widehat{\theta},g_{0})=\frac{1}{4}\sum_{i=1}^{4}(\overline{\chi}_{i}-\overline{\theta}_{i})^{2}, where θ¯i\overline{\theta}_{i} denotes the average of the θ^(Xm)\widehat{\theta}(X_{m}) falling into bin ii and χ^i\widehat{\chi}_{i} denotes the average pseudo-outcome χm\chi_{m} in bin ii.

Refer to caption
(a) Calibration Error for CATE
Refer to caption
(b) Calibration Error for LATE
Figure 1:

We repeat the above experiment over M=100M=100 random splits of the initial 401(k) dataset. Figure 1 shows in a box and whisker plot the empirical distribution of the L2L^{2} calibration errors over these MM runs for the base (uncalibrated) model, and the model calibrated using each of thee aforementioned calibration algorithms. Each box is centered at the median calibration error mm, the bottom of the box is given by the first quartile of calibration error Q1Q1, and the top of the box is given by the third quartile Q3Q3 of calibration error. The top (bottom) of the whiskers are given by the maximum (resp. minimum) of the observations falling within m±1.5×(Q3Q1)m\pm 1.5\times(Q3-Q1).

From Figure 1, we see that all calibration algorithms result in noticeably smaller calibration error. In particular in the setting of the CATE of 401(k) eligibility on net financial assets, the third quartile for the calibration error using isotonic calibration and linear calibration falls below the first quartile of calibration error for the uncalibrated model. Likewise, in the LATE model, which aims to measure the effect of 401(k) participation on net financial assets, the third quartile of calibration error under histogram binning and linear calibration falls below the first quartile of calibration error for the uncalibrated model. This indicates that our calibration algorithms have a significant impact on the calibration error of the produced models.

5.2 Calibrating Estimates of Conditional Quantiles Under Treatment

We now show how Algorithm 4 (cross-calibration for conditionally orthogonal loss functions) can be used to calibrate estimates of the conditional QQth quantile under treatment. In particular, we study the the impact of sample size and chosen quantile have on both L2L^{2} calibration error and average loss.

For this experiment, we leverage synthetically generated data. At the start of the experiment, we generate two slope vector βY,βπ100\beta^{Y},\beta^{\pi}\in\mathbb{R}^{100} where β1:20Y,β1:20π,𝒩(0,I20)\beta^{Y}_{1:20},\beta^{\pi}_{1:20},\sim\mathcal{N}(0,I_{20}) are i.i.d. and βiY=βiπ=0\beta^{Y}_{i}=\beta^{\pi}_{i}=0 for all i>20i>20. Then, for N[500,1000,1500,2000,2500,3000]N\in[500,1000,1500,2000,2500,3000], we generate i.i.d. covariates X1,,XN𝒩(0,I100)X_{1},\dots,X_{N}\sim\mathcal{N}(0,I_{100}) and treatments A1,,ANA_{1},\dots,A_{N} where AiBern(pi)A_{i}\sim\mathrm{Bern}(p_{i}) and

pi:=max{C,min{11+exp(βπ,Xi),1C}}.p_{i}:=\max\left\{C,\min\left\{\frac{1}{1+\exp(\langle\beta^{\pi},X_{i}\rangle)},1-C\right\}\right\}.

In the above, C=0.05C=0.05 is a fixed clipping parameter. Finally, we generate potential outcomes under treatment and control as Yi(1)=Yi(0)=βY,Xi+ϵiY_{i}(1)=Y_{i}(0)=\langle\beta^{Y},X_{i}\rangle+\epsilon_{i}, where ϵ1,,ϵN\epsilon_{1},\dots,\epsilon_{N} are i.i.d. standard normal noise variables. We note that, because we are only interested in the conditional quantile under treatment, we make potential equivalent under both treatment and control for simplicity.

Given data as generated above, we then train an initial model θ^\widehat{\theta} using gradient-boosted regression trees. In more detail, in training these regression trees, we leverage the loss ~QUT(θ,p;z):=ap(x)(yθ(x))(Q𝟙yθ(x))\widetilde{\ell}_{\mathrm{QUT}}(\theta,p;z):=a\cdot p(x)(y-\theta(x))(Q-\mathbbm{1}_{y\leq\theta(x)}), where we determine the nuisances using K=5K=5 fold cross-fitting, i.e. where we estimate the propensity π0(x):=(A=1X=x)p0(x)1\pi_{0}(x):=\mathbb{P}(A=1\mid X=x)\equiv p_{0}(x)^{-1} using gradient-boosted decision trees. We note that this loss is Neyman orthogonal conditional on covariates XX, and thus using this loss allows for the efficient estimate of the quantile under treatment. However, this loss is not conditionally orthogonal, and thus is a poor fit for performing calibration.

We then use an additional NN samples generated in the same manner as above to perform calibration using Algorithm 4 and loss function QUT(θ,(p,f);z)\ell_{\mathrm{QUT}}(\theta,(p,f);z) (outlined in Equation (6)). We estimate the inverse propensity p0(x)p_{0}(x) by using gradient-boosted decision trees to estimate the propensity π0(x)=p0(x)1\pi_{0}(x)=p_{0}(x)^{-1}. We estimate the additional CDF-like nuisance fθ^(x):=(Y(1)γθ^(X)X=x)f_{\widehat{\theta}}(x):=\mathbb{P}(Y(1)\leq\gamma_{\widehat{\theta}}^{\ast}(X)\mid X=x) by instead estimating (Y(1)θ^(X)X=x)\mathbb{P}(Y(1)\leq\widehat{\theta}(X)\mid X=x), again using gradient-boosted decision trees. Heuristically, we are hoping that θ^\widehat{\theta} is a “reasonable rough estimate” for the calibration function γθ^\gamma_{\widehat{\theta}}^{\ast}. In both settings, we perform calibration using linear calibration. After we estimate nuisance, we run the cross-calibration algorithm by using linear calibration. We repeat the entire above process M=50M=50 times for three quantile values Q{0.5,0.75,0.9}Q\in\{0.5,0.75,0.9\}, and plot both the empirical L2L^{2} calibration errors and average losses below (measuring loss according to ~QUT(θ,p0;z)\widetilde{\ell}_{\mathrm{QUT}}(\theta,p_{0};z), i.e. the loss evaluated at the true nuisances).

Results:

Figure 2 displays the results of the experimental procedure outlined above. Displayed in the left-hand column are plots demonstrating the empirical L2L^{2} calibration at various sample sizes averaged over the M=50M=50 runs. Likewise, in the right-hand column, the sample loss is displayed averaged again over MM runs. We include point-wise valid 95% confidence intervals for all plots.

Regardless of the sample size NN and chosen quantile QQ, we see a significant decrease in the L2L^{2} calibration error. This shows that not only does Algorithm 4 exhibit favorable theoretical convergence guarantees, but that it also offer strong performance in practice. Typically, one hopes calibration algorithms enjoy a “do no harm” property, i.e. that calibrating a parameter estimate will not significantly increase loss. While we do not formally prove this, the right-hand column of Figure 2 demonstrates that calibrating typically decreases loss, as desired. Moreover, the loss obtained by using NN samples to construct an initial estimate and NN to perform calibration is comparable to the loss had 2N2N samples directly been used to estimate the unknown conditional parameter. This suggests that reserving some samples for calibration yields significantly lower calibration error without adversely affecting performance.

Refer to caption
(a) Calibration Error for Q=0.6Q=0.6
Refer to caption
(b) Average Loss for Q=0.6Q=0.6
Refer to caption
(c) Calibration Error for Q=0.75Q=0.75
Refer to caption
(d) Average Loss for Q=0.75Q=0.75
Refer to caption
(e) Calibration Error for Q=0.9Q=0.9
Refer to caption
(f) Average Loss for Q=0.9Q=0.9
Figure 2: We plot the performance of Algorithm 4 in calibrating estimates of conditional quantiles under treatment using linear calibration. We display both the sample L2L^{2} calibration error and the average loss for N{500,1000,1500,2000,2500,3000}N\in\{500,1000,1500,2000,2500,3000\} and Q{0.6,0.75,0.9}Q\in\{0.6,0.75,0.9\} (where an additional NN samples are used for calibration). We also display corresponding 95% pointwise-valid confidence intervals. Cross-calibration not only decreases calibration error (as expected), but also decreases loss.

6 Conclusion

In this work, we constructed a framework for calibrating general estimates of heterogeneous causal effects with respect to nuisance-dependent losses. By leveraging variants of Neyman orthogonality, we were able to bound the L2L^{2} calibration error of θ\theta by two decoupled terms. One term, roughly, represented the error in nuisance estimation, while the other term represented the L2L^{2} calibration error of θ\theta in a world where the learned nuisances were true. These bounds suggested natural sample-splitting and cross-calibration algorithms, for which we proved high-probability convergence guarantees. Our algorithms also admitted “do no harm” style guarantees. We empirically evaluated our algorithms in Section 5, in which we considered both observational and synthetic data.

While our provided contributions are quite general, there are still interesting directions for future work. First, in our work, we only measure the convergence of our algorithms via the L2L^{2}-calibration error. Depending on the situation, other notions of calibration error may be more appropriate. For instance, Gupta and Ramdas [25] analyze the convergence of histogram/uniform mass binning in terms of LL^{\infty} calibration error. Likewise, Globus-Harris et al. [21] study L1L^{1} muli-calibration error. We leave it as interesting future work to extend our results to the general setting of measuring LpL^{p} calibration error. Further, it would be interesting to study how the calibration of estimates of heterogeneous causal effects impacts utility in decision making tasks.

7 Acknowledgments

VS is supported by NSF Award IIS 2337916.

References

  • Abrevaya et al. [2015] Jason Abrevaya, Yu-Chin Hsu, and Robert P Lieli. Estimating conditional average treatment effects. Journal of Business & Economic Statistics, 33(4):485–505, 2015.
  • Barlow and Brunk [1972] Richard E Barlow and Hugh D Brunk. The isotonic regression problem and its dual. Journal of the American Statistical Association, 67(337):140–147, 1972.
  • Bickel et al. [1993] Peter J Bickel, Chris AJ Klaassen, Peter J Bickel, Ya’acov Ritov, J Klaassen, Jon A Wellner, and YA’Acov Ritov. Efficient and adaptive estimation for semiparametric models, volume 4. Springer, 1993.
  • Blum and Mansour [2007] Avrim Blum and Yishay Mansour. From external to internal regret. Journal of Machine Learning Research, 8(6), 2007.
  • Chernozhukov and Hansen [2004] Victor Chernozhukov and Christian Hansen. The effects of 401 (k) participation on the wealth distribution: an instrumental quantile regression analysis. Review of Economics and statistics, 86(3):735–751, 2004.
  • Chernozhukov et al. [2018] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters, 2018.
  • Chernozhukov et al. [2022a] Victor Chernozhukov, Whitney Newey, Vıctor M Quintas-Martınez, and Vasilis Syrgkanis. Riesznet and forestriesz: Automatic debiased machine learning with neural nets and random forests. In International Conference on Machine Learning, pages 3901–3914. PMLR, 2022a.
  • Chernozhukov et al. [2022b] Victor Chernozhukov, Whitney K Newey, and Rahul Singh. Automatic debiased machine learning of causal and structural effects. Econometrica, 90(3):967–1027, 2022b.
  • Chernozhukov et al. [2022c] Victor Chernozhukov, Whitney K Newey, and Rahul Singh. Debiased machine learning of global and local parameters using regularized riesz representers. The Econometrics Journal, 25(3):576–601, 2022c.
  • Chernozhukov et al. [2024] Victor Chernozhukov, Christian Hansen, Nathan Kallus, Martin Spindler, and Vasilis Syrgkanis. Applied causal inference powered by ml and ai. arXiv preprint arXiv:2403.02467, 2024.
  • Chung et al. [2023] Youngseog Chung, Aaron Rumack, and Chirag Gupta. Parity calibration. arXiv preprint arXiv:2305.18655, 2023.
  • Dawid [1982] A Philip Dawid. The well-calibrated bayesian. Journal of the American Statistical Association, 77(379):605–610, 1982.
  • Dawid [1985] A Philip Dawid. Calibration-based empirical probability. The Annals of Statistics, 13(4):1251–1274, 1985.
  • Fan et al. [2022] Qingliang Fan, Yu-Chin Hsu, Robert P Lieli, and Yichong Zhang. Estimation of conditional average treatment effects with high-dimensional data. Journal of Business & Economic Statistics, 40(1):313–327, 2022.
  • Feng et al. [2012] Ping Feng, Xiao-Hua Zhou, Qing-Ming Zou, Ming-Yu Fan, and Xiao-Song Li. Generalized propensity score for estimating the average treatment effect of multiple treatments. Statistics in medicine, 31(7):681–697, 2012.
  • Foster and Vohra [1999] Dean P Foster and Rakesh Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 29(1-2):7–35, 1999.
  • Foster and Vohra [1998] Dean P Foster and Rakesh V Vohra. Asymptotic calibration. Biometrika, 85(2):379–390, 1998.
  • Foster and Syrgkanis [2023] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. The Annals of Statistics, 51(3):879–908, 2023.
  • Fröolich and Melly [2010] Markus Fröolich and Blaise Melly. Estimation of quantile treatment effects with stata. The Stata Journal, 10(3):423–457, 2010.
  • Gao et al. [2024] Chen Gao, Yu Zheng, Wenjie Wang, Fuli Feng, Xiangnan He, and Yong Li. Causal inference in recommender systems: A survey and future directions. ACM Transactions on Information Systems, 42(4):1–32, 2024.
  • Globus-Harris et al. [2023] Ira Globus-Harris, Declan Harrison, Michael Kearns, Aaron Roth, and Jessica Sorrell. Multicalibration as boosting for regression. arXiv preprint arXiv:2301.13767, 2023.
  • Gopalan et al. [2024] Parikshit Gopalan, Michael Kim, and Omer Reingold. Swap agnostic learning, or characterizing omniprediction via multicalibration. Advances in Neural Information Processing Systems, 36, 2024.
  • Guo et al. [2017] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017.
  • Gupta [2022] Chirag Gupta. Post-hoc calibration without distributional assumptions. PhD thesis, PhD thesis, Carnegie Mellon University Pittsburgh, PA 15213, USA, 2022.
  • Gupta and Ramdas [2021] Chirag Gupta and Aaditya Ramdas. Distribution-free calibration guarantees for histogram binning without sample splitting. In International Conference on Machine Learning, pages 3942–3952. PMLR, 2021.
  • Gupta and Ramdas [2023] Chirag Gupta and Aaditya Ramdas. Online platt scaling with calibeating. arXiv preprint arXiv:2305.00070, 2023.
  • Gupta et al. [2020] Chirag Gupta, Aleksandr Podkopaev, and Aaditya Ramdas. Distribution-free binary classification: prediction sets, confidence intervals and calibration. Advances in Neural Information Processing Systems, 33:3711–3723, 2020.
  • Hasminskii and Ibragimov [1979] Rafail Z Hasminskii and Ildar A Ibragimov. On the nonparametric estimation of functionals. In Proceedings of the Second Prague Symposium on Asymptotic Statistics, volume 473, pages 474–482. North-Holland Amsterdam, 1979.
  • Hébert-Johnson et al. [2018] Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, pages 1939–1948. PMLR, 2018.
  • Heckman and Vytlacil [2005] James J Heckman and Edward Vytlacil. Structural equations, treatment effects, and econometric policy evaluation 1. Econometrica, 73(3):669–738, 2005.
  • Hendrycks et al. [2019] Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781, 2019.
  • Hitsch et al. [2024] Günter J Hitsch, Sanjog Misra, and Walter W Zhang. Heterogeneous treatment effects and optimal targeting policy evaluation. Quantitative Marketing and Economics, 22(2):115–168, 2024.
  • Kallus et al. [2019] Nathan Kallus, Xiaojie Mao, and Masatoshi Uehara. Localized debiased machine learning: Efficient inference on quantile treatment effects and beyond. arXiv preprint arXiv:1912.12945, 2019.
  • Kennedy [2022] Edward H Kennedy. Semiparametric doubly robust targeted double machine learning: a review. arXiv preprint arXiv:2203.06469, 2022.
  • Kennedy [2023] Edward H Kennedy. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics, 17(2):3008–3049, 2023.
  • Kennedy et al. [2023] Edward H Kennedy, Sivaraman Balakrishnan, and LA Wasserman. Semiparametric counterfactual density estimation. Biometrika, 110(4):875–896, 2023.
  • Kent et al. [2018] David M Kent, Ewout Steyerberg, and David Van Klaveren. Personalized evidence based medicine: predictive approaches to heterogeneous treatment effects. Bmj, 363, 2018.
  • Kumar et al. [2019] Ananya Kumar, Percy S Liang, and Tengyu Ma. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32, 2019.
  • Lakshminarayanan et al. [2017] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
  • Lan and Syrgkanis [2023] Hui Lan and Vasilis Syrgkanis. Causal q-aggregation for cate model selection. arXiv preprint arXiv:2310.16945, 2023.
  • Leng and Dimmery [2024] Yan Leng and Drew Dimmery. Calibration of heterogeneous treatment effects in randomized experiments. Information Systems Research, 2024.
  • Malinin and Gales [2018] Andrey Malinin and Mark Gales. Predictive uncertainty estimation via prior networks. Advances in neural information processing systems, 31, 2018.
  • Minderer et al. [2021] Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic. Revisiting the calibration of modern neural networks. Advances in Neural Information Processing Systems, 34:15682–15694, 2021.
  • Neyman [1979] Jerzy Neyman. C(α\alpha) tests and their use. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 41(1/2):1–21, 1979. ISSN 0581572X. URL http://www.jstor.org/stable/25050174.
  • Niculescu-Mizil and Caruana [2005] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pages 625–632, 2005.
  • Nie and Wager [2021] Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299–319, 2021.
  • Noarov and Roth [2023] Georgy Noarov and Aaron Roth. The scope of multicalibration: Characterizing multicalibration via property elicitation. arXiv preprint arXiv:2302.08507, 2023.
  • Noarov et al. [2023] Georgy Noarov, Ramya Ramalingam, Aaron Roth, and Stephan Xie. High-dimensional prediction for sequential decision making. arXiv preprint arXiv:2310.17651, 2023.
  • Oprescu et al. [2019] Miruna Oprescu, Vasilis Syrgkanis, and Zhiwei Steven Wu. Orthogonal random forest for causal inference. In International Conference on Machine Learning, pages 4932–4941. PMLR, 2019.
  • Ovadia et al. [2019] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32, 2019.
  • Platt et al. [1999] John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.
  • Sahoo et al. [2021] Roshni Sahoo, Shengjia Zhao, Alyssa Chen, and Stefano Ermon. Reliable decisions with threshold calibration. Advances in Neural Information Processing Systems, 34:1831–1844, 2021.
  • Semenova and Chernozhukov [2021] Vira Semenova and Victor Chernozhukov. Debiased machine learning of conditional average treatment effects and other causal functions. The Econometrics Journal, 24(2):264–289, 2021.
  • Song et al. [2019] Hao Song, Tom Diethe, Meelis Kull, and Peter Flach. Distribution calibration for regression. In International Conference on Machine Learning, pages 5897–5906. PMLR, 2019.
  • Syrgkanis et al. [2019] Vasilis Syrgkanis, Victor Lei, Miruna Oprescu, Maggie Hei, Keith Battocchi, and Greg Lewis. Machine learning estimation of heterogeneous treatment effects with instruments. Advances in Neural Information Processing Systems, 32, 2019.
  • Thulasidasan et al. [2019] Sunil Thulasidasan, Gopinath Chennupati, Jeff A Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. Advances in neural information processing systems, 32, 2019.
  • Tsybakov [2009] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer series in statistics. Springer, 2009. ISBN 978-0-387-79051-0. doi: 10.1007/B13794. URL https://doi.org/10.1007/b13794.
  • van der Laan et al. [2023] Lars van der Laan, Ernesto Ulloa-Pérez, Marco Carone, and Alex Luedtke. Causal isotonic calibration for heterogeneous treatment effects. arXiv preprint arXiv:2302.14011, 2023.
  • Vovk [2002] Vladimir Vovk. On-line confidence machines are well-calibrated. In The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings., pages 187–196. IEEE, 2002.
  • Wainwright [2019] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
  • Wilder and Welle [2024] Bryan Wilder and Pim Welle. Learning treatment effects while treating those in need. arXiv preprint arXiv:2407.07596, 2024.
  • Zadrozny and Elkan [2001] Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, volume 1, pages 609–616, 2001.
  • Zadrozny and Elkan [2002] Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 694–699, 2002.

Appendix A Calibration Error Decomposition Proofs

In this appendix, we prove the main error decompositions from Sections 3 an d4. These results provide two-term, decoupled bounds on the L2L^{2} calibration error of an arbitrary, fixed parameter θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R} in terms of L2L^{2} calibration error assuming the learned nuisances were correct, and a term measuring the distance between the learned nuisances and the true, unknown nuisance parameters.

A.1 Universally Orthogonality

We start by proving Theorem 3.3, which provides the claimed decoupled bound under the assumption that \ell is universally orthogonal (Definition 3.1). See 3.3

Proof of Theorem 3.3.

We start by adding a useful form of zero to the integrand, which yields

Cal(θ,g0)2\displaystyle\mathrm{Cal}(\theta,g_{0})^{2} =𝔼(𝔼[(θ,g0;Z)θ(X)]2)\displaystyle=\mathbb{E}\left(\mathbb{E}\left[\partial\ell(\theta,g_{0};Z)\mid\theta(X)\right]^{2}\right)
=𝔼(𝔼[(θ,g0;Z)θ(X)]{𝔼[(θ,g0;Z)θ(X)]𝔼[(θ,g;Z)θ(X)]})T1\displaystyle=\underbrace{\mathbb{E}\left(\mathbb{E}\left[\partial\ell(\theta,g_{0};Z)\mid\theta(X)\right]\cdot\left\{\mathbb{E}\left[\partial\ell(\theta,g_{0};Z)\mid\theta(X)\right]-\mathbb{E}\left[\partial\ell(\theta,g;Z)\mid\theta(X)\right]\right\}\right)}_{T_{1}}
+𝔼(𝔼[(θ,g0;Z)θ(X)]𝔼[(θ,g;Z)θ(X)])T2.\displaystyle\qquad+\underbrace{\mathbb{E}\left(\mathbb{E}\left[\partial\ell(\theta,g_{0};Z)\mid\theta(X)\right]\cdot\mathbb{E}\left[\partial\ell(\theta,g;Z)\mid\theta(X)\right]\right)}_{T_{2}}.

We bound T1T_{1} and T2T_{2} separately. As a first step in bounding T1T_{1}, note that by a second order Taylor expansion with respect to the nuisance estimate g𝒢g\in\mathcal{G} with Lagrange remainder, we have

𝔼[(θ,g0;Z)θ(X)]+𝔼[(θ,g;Z)θ(X)]\displaystyle-\mathbb{E}\left[\partial\ell(\theta,g_{0};Z)\mid\theta(X)\right]+\mathbb{E}\left[\partial\ell(\theta,g;Z)\mid\theta(X)\right]
=Dg𝔼[(θ,g0;Z)θ(X)](gg0)+12Dg2𝔼[(θ,g¯;Z)θ(X)](gg0,gg0)\displaystyle\qquad=D_{g}\mathbb{E}\left[\partial\ell(\theta,g_{0};Z)\mid\theta(X)\right](g-g_{0})+\frac{1}{2}D_{g}^{2}\mathbb{E}\left[\partial\ell(\theta,\overline{g};Z)\mid\theta(X)\right](g-g_{0},g-g_{0})
=12Dg2𝔼[(θ,g¯;Z)θ(X)](gg0,gg0).\displaystyle\qquad=\frac{1}{2}D_{g}^{2}\mathbb{E}\left[\partial\ell(\theta,\overline{g};Z)\mid\theta(X)\right](g-g_{0},g-g_{0}).

In the above, g¯[g0,g]\overline{g}\in[g_{0},g], and the first-order derivative (with respect to gg) vanishes due to the assumption of Definition 3.1. This is because we have Taylor expanded around the true, unknown nuisance g0=g0g_{0}=g_{0}.

With this, we can apply the Cauchy-Schwarz inequality, which furnishes

T1\displaystyle T_{1} 12𝔼(𝔼[(θ,g0;Z)θ(X)]2)𝔼({Dg2𝔼[(θ,g¯;Z)θ(X)](gg0,gg0)}2)\displaystyle\leq\frac{1}{2}\sqrt{\mathbb{E}\left(\mathbb{E}\left[\partial\ell(\theta,g_{0};Z)\mid\theta(X)\right]^{2}\right)}\sqrt{\mathbb{E}\left(\left\{D_{g}^{2}\mathbb{E}\left[\partial\ell(\theta,\overline{g};Z)\mid\theta(X)\right](g-g_{0},g-g_{0})\right\}^{2}\right)}
12𝔼(𝔼[(θ,g0;Z)θ(X)]2)𝔼({Dg2𝔼[(θ,g¯;Z)X](gg0,gg0)}2)\displaystyle\leq\frac{1}{2}\sqrt{\mathbb{E}\left(\mathbb{E}\left[\partial\ell(\theta,g_{0};Z)\mid\theta(X)\right]^{2}\right)}\sqrt{\mathbb{E}\left(\left\{D_{g}^{2}\mathbb{E}\left[\partial\ell(\theta,\overline{g};Z)\mid X\right](g-g_{0},g-g_{0})\right\}^{2}\right)}
12Cal(θ,g0)err(g,g0;θ)\displaystyle\leq\frac{1}{2}\mathrm{Cal}(\theta,g_{0})\cdot\mathrm{err}(g,g_{0};\theta)

In the second line, we apply Jensen’s inequality inside the conditional expectation.

Bounding T2T_{2} is more straightforward. Applying the Cauchy-Schwarz inequality, we have:

T2\displaystyle T_{2} 𝔼(𝔼[(θ,g0;Z)θ(X)]2)𝔼(𝔼[(θ,g;Z)θ(X)]2)\displaystyle\leq\sqrt{\mathbb{E}\left(\mathbb{E}\left[\partial\ell(\theta,g_{0};Z)\mid\theta(X)\right]^{2}\right)}\sqrt{\mathbb{E}\left(\mathbb{E}\left[\partial\ell(\theta,g;Z)\mid\theta(X)\right]^{2}\right)}
=Cal(θ,g0)Cal(θ,g)\displaystyle=\mathrm{Cal}(\theta,g_{0})\cdot\mathrm{Cal}(\theta,g)

This line of reasoning, in total, yields that

Cal(θ,g0)212Cal(θ,g0)err(g,g0;θ)+Cal(θ,g0)Cal(θ,g).\mathrm{Cal}(\theta,g_{0})^{2}\leq\frac{1}{2}\mathrm{Cal}(\theta,g_{0})\cdot\mathrm{err}(g,g_{0};\theta)+\mathrm{Cal}(\theta,g_{0})\cdot\mathrm{Cal}(\theta,g).

Dividing through by Cal(θ,g0)\mathrm{Cal}(\theta,g_{0}) yields the claimed bound. ∎

A.2 Conditional Orthogonality

We now turn to proving the second error bound, which holds in the case that \ell satisfies the weaker assumption of conditional orthogonality (Definition 4.3). We generically write g=(η,ζ)g=(\eta,\zeta) for a fixed pair of nuisance parameters. Further, for any fixed post-processing function φ:𝒳𝒳\varphi:\mathcal{X}\rightarrow\mathcal{X}^{\prime}, we write gφ=(η0,ζφ)g_{\varphi}=(\eta_{0},\zeta_{\varphi}) where ζφ\zeta_{\varphi} is a nuisance parameter ensuring the vanishing cross-derivative condition. To prove Theorem 4.5, we will need two technical lemmas. In what follows, we remind the reader of the definition of the calibration function γφ:𝒳×𝒢\gamma_{\varphi}:\mathcal{X}\times\mathcal{G}\rightarrow\mathbb{R} denote the calibration function under the orthogonalized loss \ell, i.e. γφ\gamma_{\varphi} is specified by

γθ(x;g):=argminν𝔼[(θ,g;Z)θ(X)=θ(x)],\gamma_{\theta}(x;g):=\arg\min_{\nu}\mathbb{E}[\ell(\theta,g;Z)\mid\theta(X)=\theta(x)],

We recall the identity γφγφ(;(η0,ζ))\gamma_{\varphi}^{\ast}\equiv\gamma_{\varphi}(\cdot;(\eta_{0},\zeta)) for any estimate ζ\zeta of the second nuisance parameter, which will be useful in the sequel.

The first lemma we prove measures the distance (in terms of the L2(PX)L^{2}(P_{X}) norm) between the true calibration γθ\gamma_{\theta}^{\ast} and the calibration function under any other nuisance pair g=(η,ζ)g=(\eta,\zeta), γθ(;g)\gamma_{\theta}(\cdot;g). Here, θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R} should be viewed as some arbitrary estimator. We can bound this distance in terms of the complicated looking error term, which was first introduced in Theorem 3.3. This term actually simplifies rather nicely, as was seen in the prequel when we computed the quantity for the task of calibrating estimates of conditional QQth quantiles under treatment.

Lemma A.1.

Let θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R} be an arbitrary function, assume \ell is conditionally orthogonal, and let gθ=(η0,ζθ)g_{\theta}=(\eta_{0},\zeta_{\theta}) be a pair of nuisance functions guaranteeing the vanishing cross-derivative condition in Definition 4.3. Let g=(η,ζ)g=(\eta,\zeta) be some fixed pair of nuisance functions. Then, assuming the base loss \ell satisfies α\alpha-strong convexity (Assumption 3), we have

γθγθ(;g)L2(PX)12αerr(g,gθ;γθ),\|\gamma_{\theta}^{\ast}-\gamma_{\theta}(\cdot;g)\|_{L^{2}(P_{X})}\leq\frac{1}{2\alpha}\mathrm{err}(g,g_{\theta};\gamma_{\theta}^{\ast}),

where we define err(g,h;φ):=supf[g,h]𝔼({Dg2𝔼[(φ,f;Z)X](gh,gh)}2)\mathrm{err}(g,h;\varphi):=\sup_{f\in[g,h]}\sqrt{\mathbb{E}\left(\left\{D_{g}^{2}\mathbb{E}\left[\partial\ell(\varphi,f;Z)\mid X\right](g-h,g-h)\right\}^{2}\right)}.

Proof.

First, strong convexity (Assumption 3) alongside equivalent conditions for strong convexity (namely that α(xy)2(f(x)f(y))(xy)\alpha(x-y)^{2}\leq(\partial f(x)-\partial f(y))(x-y)) yields:

α(γθ(X;gθ)γθ(X;g))2\displaystyle\alpha\left(\gamma_{\theta}(X;g_{\theta})-\gamma_{\theta}(X;g)\right)^{2}
(𝔼[(γθ(;gθ),g;Z)θ(X)]𝔼[(γθ(;g),g;Z)θ(X)])(γθ(X;gθ)γθ(X;g))\displaystyle\qquad\leq\left(\mathbb{E}\left[\partial\ell(\gamma_{\theta}(\cdot;g_{\theta}),g;Z)\mid\theta(X)\right]-\mathbb{E}\left[\partial\ell(\gamma_{\theta}(\cdot;g),g;Z)\mid\theta(X)\right]\right)\left(\gamma_{\theta}(X;g_{\theta})-\gamma_{\theta}(X;g)\right)
=𝔼[(γθ(;gθ),g;Z)θ(X)](γθ(X;gθ)γθ(X;g))\displaystyle\qquad=\mathbb{E}\left[\partial\ell(\gamma_{\theta}(\cdot;g_{\theta}),g;Z)\mid\theta(X)\right]\left(\gamma_{\theta}(X;g_{\theta})-\gamma_{\theta}(X;g)\right)

In the above, the equality on the third line follows from the definition of γθ(;g)\gamma_{\theta}(\cdot;g), as first order optimality conditions on γθ(x;g)=argminν𝔼[(ν,g;Z)θ(X)]\gamma_{\theta}(x;g)=\arg\min_{\nu\in\mathbb{R}}\mathbb{E}\left[\ell(\nu,g;Z)\mid\theta(X)\right] imply 𝔼[(γθ(;g),g;Z)θ(X)]=0\mathbb{E}\left[\partial\ell(\gamma_{\theta}(\cdot;g),g;Z)\mid\theta(X)\right]=0.

Rearranging the above inequality and taking absolute values yields

α|γθ(X;gθ)γθ(X;g)||𝔼[(γθ(;gθ),g;Z)θ(X)]|.\alpha\left|\gamma_{\theta}(X;g_{\theta})-\gamma_{\theta}(X;g)\right|\leq\left|\mathbb{E}\left[\partial\ell(\gamma_{\theta}(\cdot;g_{\theta}),g;Z)\mid\theta(X)\right]\right|.

Next, observe that from the condition 𝔼[(γθ(;gθ),gθ;Z)θ(X)]=0\mathbb{E}\left[\partial\ell(\gamma_{\theta}(\cdot;g_{\theta}),g_{\theta};Z)\mid\theta(X)\right]=0 alongside a second order Taylor expansion (with respect to nuisance pairs gg) with Lagrange form remainder plus conditional orthogonality (Definition 4.3), we have

𝔼[(γθ(;gθ),g;Z)θ(X)]\displaystyle\mathbb{E}\left[\partial\ell(\gamma_{\theta}(\cdot;g_{\theta}),g;Z)\mid\theta(X)\right]
=𝔼[(γθ(;gθ),g;Z)θ(X)]𝔼[(γθ(;gθ),gθ;Z)θ(X)]\displaystyle\qquad=\mathbb{E}\left[\partial\ell(\gamma_{\theta}(\cdot;g_{\theta}),g;Z)\mid\theta(X)\right]-\mathbb{E}\left[\partial\ell(\gamma_{\theta}(\cdot;g_{\theta}),g_{\theta};Z)\mid\theta(X)\right]
=Dg𝔼[(γθ(;gθ),gθ;Z)θ(X)](ggθ)+12Dg2𝔼[(γθ(;gθ),g¯;Z)θ(X)](ggθ,ggθ)\displaystyle\qquad=D_{g}\mathbb{E}\left[\partial\ell(\gamma_{\theta}(\cdot;g_{\theta}),g_{\theta};Z)\mid\theta(X)\right](g-g_{\theta})+\frac{1}{2}D_{g}^{2}\mathbb{E}\left[\partial\ell(\gamma_{\theta}(\cdot;g_{\theta}),\overline{g};Z)\mid\theta(X)\right](g-g_{\theta},g-g_{\theta})
=12Dg2𝔼[(γθ(;gθ),g¯;Z)θ(X)](ggθ,ggθ)\displaystyle\qquad=\frac{1}{2}D_{g}^{2}\mathbb{E}\left[\partial\ell(\gamma_{\theta}(\cdot;g_{\theta}),\overline{g};Z)\mid\theta(X)\right](g-g_{\theta},g-g_{\theta})
=12𝔼(Dg2𝔼[(γθ(;gθ),g¯;Z)X](ggθ,ggθ)θ(X)),\displaystyle\qquad=\frac{1}{2}\mathbb{E}\left(D_{g}^{2}\mathbb{E}\left[\partial\ell(\gamma_{\theta}(\cdot;g_{\theta}),\overline{g};Z)\mid X\right](g-g_{\theta},g-g_{\theta})\mid\theta(X)\right),

where g¯[g,gθ]\overline{g}\in[g,g_{\theta}] (here, g¯[g,gθ]\overline{g}\in[g,g_{\theta}] indicates g¯=λg+(1λ)gθ\overline{g}=\lambda g+(1-\lambda)g_{\theta} for some λ[0,1]\lambda\in[0,1]). Consequently, we have

γθ(X;gθ)γθ(X;g)L2(PX)\displaystyle\|\gamma_{\theta}(X;g_{\theta})-\gamma_{\theta}(X;g)\|_{L^{2}(P_{X})}
=(𝒳|γθ(x;gθ)γθ(x;g)|2PX(dx))1/2\displaystyle\qquad=\left(\int_{\mathcal{X}}\left|\gamma_{\theta}(x;g_{\theta})-\gamma_{\theta}(x;g)\right|^{2}P_{X}(dx)\right)^{1/2}
1α(𝒳𝔼[(γθ,g;Z)θ(X)=θ(x)]2PX(dx))1/2\displaystyle\qquad\leq\frac{1}{\alpha}\left(\int_{\mathcal{X}}\mathbb{E}\left[\partial\ell(\gamma_{\theta}^{\ast},g;Z)\mid\theta(X)=\theta(x)\right]^{2}P_{X}(dx)\right)^{1/2}
12α(𝒳𝔼(Dg2𝔼[(γθ(;gθ),g¯;Z)X](ggθ,ggθ)θ(X)=θ(x))2PX(dx))1/2\displaystyle\qquad\leq\frac{1}{2\alpha}\left(\int_{\mathcal{X}}\mathbb{E}\left(D_{g}^{2}\mathbb{E}\left[\partial\ell(\gamma_{\theta}(\cdot;g_{\theta}),\overline{g};Z)\mid X\right](g-g_{\theta},g-g_{\theta})\mid\theta(X)=\theta(x)\right)^{2}P_{X}(dx)\right)^{1/2}
12α(𝒳{Dg2𝔼[(γθ(;gθ),g¯;Z)X=x](ggθ,ggθ)}2PX(dx))1/2\displaystyle\qquad\leq\frac{1}{2\alpha}\left(\int_{\mathcal{X}}\left\{D_{g}^{2}\mathbb{E}\left[\partial\ell(\gamma_{\theta}(\cdot;g_{\theta}),\overline{g};Z)\mid X=x\right](g-g_{\theta},g-g_{\theta})\right\}^{2}P_{X}(dx)\right)^{1/2}
12αerr(g,gθ;γθ(;gθ)).\displaystyle\qquad\leq\frac{1}{2\alpha}\mathrm{err}(g,g_{\theta};\gamma_{\theta}(\cdot;g_{\theta})).

Noting the identity γθ(;gθ)γθ\gamma_{\theta}(\cdot;g_{\theta})\equiv\gamma_{\theta}^{\ast} proves the claimed result. ∎

The second lemma we prove bounds the L2(PX)L^{2}(P_{X}) distance between the parameter estimate θ\theta and the calibration function gθ(;g)g_{\theta}(\cdot;g) under a fixed pair of nuisances g=(η,ζ)g=(\eta,\zeta) in terms of the calibration error.

Lemma A.2.

Let θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R} be a fixed estimator, g=(η,ζ)𝒢g=(\eta,\zeta)\in\mathcal{G} an arbitrary, fixed nuisance pair, and γθ:𝒳×𝒢\gamma_{\theta}:\mathcal{X}\times\mathcal{G}\rightarrow\mathbb{R} the calibration function associated with θ\theta. Assume \ell is α\alpha-strongly convex (Assumption 3). We have

θγθ(;g)L2(PX)1αCal(θ,g).\left\|\theta-\gamma_{\theta}(\cdot;g)\right\|_{L^{2}(P_{X})}\leq\frac{1}{\alpha}\mathrm{Cal}(\theta,g).
Proof.

First, observe that from strong convexity (as in the proof of the above lemma), we have

α(θ(X)γθ(X;g))2\displaystyle\alpha\left(\theta(X)-\gamma_{\theta}(X;g)\right)^{2}
(𝔼[(θ,g;Z)θ(X)]𝔼[(γθ(,g),g;Z)θ(X)])(θ(X)γθ(X;g))\displaystyle\qquad\leq\left(\mathbb{E}\left[\partial\ell(\theta,g;Z)\mid\theta(X)\right]-\mathbb{E}\left[\partial\ell(\gamma_{\theta}(\cdot,g),g;Z)\mid\theta(X)\right]\right)\left(\theta(X)-\gamma_{\theta}(X;g)\right)
=𝔼[(θ,g;Z)θ(X)](θ(X)γθ(X;g)).\displaystyle\qquad=\mathbb{E}\left[\partial\ell(\theta,g;Z)\mid\theta(X)\right]\left(\theta(X)-\gamma_{\theta}(X;g)\right).

Thus, dividing through and taking the absolute value yields:

|θ(X)γθ(X;g)|1α|𝔼[(θ,g;Z)θ(X)]|.\left|\theta(X)-\gamma_{\theta}(X;g)\right|\leq\frac{1}{\alpha}\left|\mathbb{E}\left[\partial\ell(\theta,g;Z)\mid\theta(X)\right]\right|.

We now integrate to get the desired result. In particular, we have that

θγθ(;g)L2(PX)\displaystyle\|\theta-\gamma_{\theta}(\cdot;g)\|_{L^{2}(P_{X})} =(𝒳|θ(x)γθ(x;g)|2PX(dx))1/2\displaystyle=\left(\int_{\mathcal{X}}\left|\theta(x)-\gamma_{\theta}(x;g)\right|^{2}P_{X}(dx)\right)^{1/2}
1α(𝒳𝔼[(θ,g;Z)θ(X)=θ(x)]2PX(dx))1/2\displaystyle\leq\frac{1}{\alpha}\left(\int_{\mathcal{X}}\mathbb{E}\left[\partial\ell(\theta,g;Z)\mid\theta(X)=\theta(x)\right]^{2}P_{X}(dx)\right)^{1/2}
=1αCal(θ,g).\displaystyle=\frac{1}{\alpha}\mathrm{Cal}(\theta,g).

With the above two lemmas in hand, we can now prove Theorem 4.5, which we recall shows a decoupled bound on the calibration of a parameter θ\theta with respect to a conditionally orthogonal loss function \ell.

See 4.5

Proof of Theorem 4.5.

First, we note that by Corollary 4.4, we have that

Cal(θ,η0)Cal(θ,(η0,ζ))\mathrm{Cal}(\theta,\eta_{0})\equiv\mathrm{Cal}(\theta,(\eta_{0},\zeta))

for any second nuisance parameter ζ\zeta, so it suffices to bound Cal(θ,gθ)\mathrm{Cal}(\theta,g_{\theta}) where gθ=(η0,ζθ)g_{\theta}=(\eta_{0},\zeta_{\theta}). We have:

𝔼[(θ,gθ;Z)θ(X)]\displaystyle\mathbb{E}\left[\partial\ell(\theta,g_{\theta};Z)\mid\theta(X)\right] =𝔼[(θ,gθ;Z)θ(X)]𝔼[(γθ,gθ;Z)θ(X)]\displaystyle=\mathbb{E}\left[\partial\ell(\theta,g_{\theta};Z)\mid\theta(X)\right]-\mathbb{E}\left[\partial\ell(\gamma_{\theta}^{\ast},g_{\theta};Z)\mid\theta(X)\right]
=𝔼[2(θ¯,gθ;Z)θ(X)](θ(X)γθ(X))\displaystyle=\mathbb{E}\left[\partial^{2}\ell(\overline{\theta},g_{\theta};Z)\mid\theta(X)\right](\theta(X)-\gamma_{\theta}^{\ast}(X))
=𝔼[2(θ¯,gθ;Z)θ(X)](θ(X)γθ(X;g))\displaystyle=\mathbb{E}\left[\partial^{2}\ell(\overline{\theta},g_{\theta};Z)\mid\theta(X)\right](\theta(X)-\gamma_{\theta}(X;g))
+𝔼[2(θ¯,gθ;Z)θ(X)](γθ(X;g)γθ(X)),\displaystyle\quad+\mathbb{E}\left[\partial^{2}\ell(\overline{\theta},g_{\theta};Z)\mid\theta(X)\right](\gamma_{\theta}(X;g)-\gamma_{\theta}^{\ast}(X)),

where the first equality follows from the above calculation, the second from the fact γθ(x,(g0,b))=γθ(x)\gamma_{\theta}(x,(g_{0},b))=\gamma_{\theta}^{\ast}(x) regardless of choice of additional nuisance bb, and the third from a first order Taylor expansion with Lagrange form remainder on θ(X)\theta(X) (here θ¯[θ,γθ]\overline{\theta}\in[\theta,\gamma_{\theta}^{\ast}]). The final equality follows from adding a subtracting γθ(X;g)\gamma_{\theta}(X;g). Thus, we have

Cal(θ,gθ)2=𝔼(𝔼[(θ,gθ;Z)θ(X)]2)\displaystyle\mathrm{Cal}(\theta,g_{\theta})^{2}=\mathbb{E}\left(\mathbb{E}\left[\partial\ell(\theta,g_{\theta};Z)\mid\theta(X)\right]^{2}\right)
=𝔼(𝔼[(θ,gθ;Z)θ(X)]𝔼[2(θ¯,gθ;Z)θ(X)](θ(X)γθ(X;g)))\displaystyle\qquad=\mathbb{E}\left(\mathbb{E}\left[\partial\ell(\theta,g_{\theta};Z)\mid\theta(X)\right]\cdot\mathbb{E}\left[\partial^{2}\ell(\overline{\theta},g_{\theta};Z)\mid\theta(X)\right]\cdot(\theta(X)-\gamma_{\theta}(X;g))\right)
+𝔼(𝔼[(θ,gθ;Z)θ(X)]𝔼[2(θ¯,gθ;Z)θ(X)](γθ(X;g)γθ(X)))\displaystyle\qquad\quad+\mathbb{E}\left(\mathbb{E}\left[\partial\ell(\theta,g_{\theta};Z)\mid\theta(X)\right]\cdot\mathbb{E}\left[\partial^{2}\ell(\overline{\theta},g_{\theta};Z)\mid\theta(X)\right]\cdot(\gamma_{\theta}(X;g)-\gamma_{\theta}^{\ast}(X))\right)
βCal(θ,gθ)𝔼[(θ(X)γθ(X;g))2]+βCal(θ,gθ)𝔼[(γθ(X;g)γθ(X))2]\displaystyle\qquad\leq\beta\mathrm{Cal}(\theta,g_{\theta})\sqrt{\mathbb{E}\left[\left(\theta(X)-\gamma_{\theta}(X;g)\right)^{2}\right]}+\beta\mathrm{Cal}(\theta,g_{\theta})\sqrt{\mathbb{E}\left[\left(\gamma_{\theta}(X;g)-\gamma_{\theta}^{\ast}(X)\right)^{2}\right]}
=βCal(θ,gθ){θγθ(;g)L2(PX)+γθ(;g)γθL2(PX)}.\displaystyle\qquad=\beta\mathrm{Cal}(\theta,g_{\theta})\left\{\left\|\theta-\gamma_{\theta}(\cdot;g)\right\|_{L^{2}(P_{X})}+\left\|\gamma_{\theta}(\cdot;g)-\gamma_{\theta}^{\ast}\right\|_{L^{2}(P_{X})}\right\}.

Now, dividing through by Cal(θ,g0)\mathrm{Cal}(\theta,g_{0}) and plugging in the bounds provided by Lemma A.1 and Lemma A.2 and again leveraging the equivalence Cal(θ,η0)=Cal(θ,gθ)\mathrm{Cal}(\theta,\eta_{0})=\mathrm{Cal}(\theta,g_{\theta}), we have

Cal(θ,η0)βαCal(θ,g)+β2αerr(g,gθ;γθ),\mathrm{Cal}(\theta,\eta_{0})\leq\frac{\beta}{\alpha}\mathrm{Cal}(\theta,g)+\frac{\beta}{2\alpha}\mathrm{err}(g,g_{\theta};\gamma_{\theta}^{\ast}),

which is precisely the claimed result. ∎

Appendix B Algorithm Convergence Proofs

B.1 Universal Orthogonality

We now restate and prove the convergence guarantee of the sample splitting algorithm for calibration with respect to universally orthogonalizable loss functions. The result below is largely just an application of Theorem 3.3, with the only caveat being that some care must be taken to handle the fact that the output parameter θ^\widehat{\theta} and the nuisance estimate g~\widetilde{g} are now random variables, not fixed parameters. See 3.5

Proof of Theorem 3.5.

First, observe that, under Assumption 1, for any fixed θ\theta and gg, we have

Cal(θ,g)causal cal. error=Cal(θ,Pgχ)non-causal cal. error,\displaystyle\underbrace{\mathrm{Cal}(\theta,g)}_{\text{causal cal. error}}=\underbrace{\mathrm{Cal}(\theta,P^{\chi}_{g})}_{\text{non-causal cal. error}},

where PgχP^{\chi}_{g} denotes the joint distribution of (X,χ(g;Z))(X,\chi(g;Z)) over draws ZPZZ\sim P_{Z}. and the latter quantity is controlled with high-probability by Assumption 2. Since the above equality holds for any θ\theta and gg, it still holds when θ\theta is replaced by θ^\widehat{\theta}, the random output of Algorithm 1, and when gg is replaced by g^\widehat{g}, the corresponding nuisance estimate obtained from 𝒜1\mathcal{A}_{1}.

Under Assumption 1, it is also clear that err(g,g0;θ)\mathrm{err}(g,g_{0};\theta) does not depend on θ\theta, so it suffices to write err(g,g0)\mathrm{err}(g,g_{0}) going forward. Define the “bad” events as B1:={err(g^,g0)>rate1(n,δ1;PZ)}B_{1}:=\{\mathrm{err}(\widehat{g},g_{0})>\mathrm{rate}_{1}(n,\delta_{1};P_{Z})\} and B2:={Cal(θ^,P^g^χ)>rate2(n,δ2;Pg^χ)}B_{2}:=\{\mathrm{Cal}(\widehat{\theta},\widehat{P}^{\chi}_{\widehat{g}})>\mathrm{rate}_{2}(n,\delta_{2};P^{\chi}_{\widehat{g}})\}. Clearly, the first part of Assumption 2 yields that (B1)δ1\mathbb{P}(B_{1})\leq\delta_{1}. Likewise, the second part of Assumption 2 also yields that (B2Zn+1:2n)δ1,\mathbb{P}(B_{2}\mid Z_{n+1:2n})\leq\delta_{1}, as fixing Zn+1,,Z2nZ_{n+1},\dots,Z_{2n} fixes the learned nuisances (g^,b^)(\widehat{g},\widehat{b}), per Algorithm 1. Thus, applying the law of total probability, we have that the marginal probability of B2B_{2} (over both draws of Z1,,ZnZ_{1},\dots,Z_{n} and Zn+1,,Z2nZ_{n+1},\dots,Z_{2n}) is bounded by

(B2)=𝔼[(B2Z1:n)]δ2.\mathbb{P}(B_{2})=\mathbb{E}[\mathbb{P}(B_{2}\mid Z_{1:n})]\leq\delta_{2}.

This is because the ZiZ_{i} are independent and conditioning on the first nn observations fixes the nuisance estimate g^\widehat{g}. Thus, on the “good” event B1cB2cB_{1}^{c}\cap B_{2}^{c}, which occurs with probability at least 1δ1δ21-\delta_{1}-\delta_{2}, we have

Cal(θ^,g0)\displaystyle\mathrm{Cal}(\widehat{\theta},g_{0}) 12err(g^,g0)+Cal(θ^,g^)\displaystyle\leq\frac{1}{2}\mathrm{err}(\widehat{g},g_{0})+\mathrm{Cal}(\widehat{\theta},\widehat{g})
=12err(g^,g0)+Cal(θ^;Pg^χ)\displaystyle=\frac{1}{2}\mathrm{err}(\widehat{g},g_{0})+\mathrm{Cal}(\widehat{\theta};P^{\chi}_{\widehat{g}})
12rate1(n,δ1;PZ)+rate2(n,δ2;Pg^χ),\displaystyle\leq\frac{1}{2}\mathrm{rate}_{1}(n,\delta_{1};P_{Z})+\mathrm{rate}_{2}(n,\delta_{2};P^{\chi}_{\widehat{g}}),

where the first inequality follows from Theorem 3.3 and the second equality follows from the preamble at the beginning of this proof.

B.2 Conditionally Orthogonality

We now prove the convergence guarantees for Algorithm 3. The proof is largely the same as the convergence proof for Algorithm 1, but we nonetheless include it for completeness. The only key difference is that (a) we not longer have access to “pseudo-outcomes” that, in expectation, look like the target treatment effect and (b) we have to be careful, since the second nuisance parameter that is used in the definition of conditional orthogonality does not satisfy ζθ=ζθ^\zeta_{\theta}=\zeta_{\widehat{\theta}} in general. Once again, θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R} is the fixed, initial estimate treated as an input to Algorithm 3 and θ^\widehat{\theta} is the random, calibrated estimate that is the output of the algorithm. See 4.8

Proof of Theorem 4.8.

First, we observe that, by the assumption that θ^=τ^θ\widehat{\theta}=\widehat{\tau}\circ\theta where the random map τ^:\widehat{\tau}:\mathbb{R}\rightarrow\mathbb{R} is almost surely injective, we know that initial estimate θ\theta and the calibrated estimate θ^\widehat{\theta} have the same level sets. Consequently, by Lemma 4.7, we have ζθ=ζθ^\zeta_{\theta}=\zeta_{\widehat{\theta}} without loss of generality. Further, we have equivalence of the calibration functions corresponding to θ\theta and θ^\widehat{\theta}, i.e. we have γθγθ^\gamma_{\theta}^{\ast}\equiv\gamma_{\widehat{\theta}}^{\ast}. Thus, letting gφ=(η0,ζφ)g_{\varphi}=(\eta_{0},\zeta_{\varphi}) for any φ:𝒳\varphi:\mathcal{X}\rightarrow\mathbb{R}, we have

err(g^,gθ^;γθ^)=err(g^,gθ;γθ).\mathrm{err}(\widehat{g},g_{\widehat{\theta}};\gamma_{\widehat{\theta}}^{\ast})=\mathrm{err}(\widehat{g},g_{\theta};\gamma_{\theta}^{\ast}).

The rest of the proof is now more or less identical to that of Theorem 3.5, but we nonetheless include the proof for completeness. Define the “bad” events as B1:={err(g^,gθ;γθ)>rate1(n,δ1,θ;PZ)}B_{1}:=\{\mathrm{err}(\widehat{g},g_{\theta};\gamma_{\theta}^{\ast})>\mathrm{rate}_{1}(n,\delta_{1},\theta;P_{Z})\} and B2:={Cal(θ^;g^)>rate2(n,δ2,;PZ)}B_{2}:=\{\mathrm{Cal}(\widehat{\theta};\widehat{g})>\mathrm{rate}_{2}(n,\delta_{2},\ell;P_{Z})\}. Clearly, the first part of Assumption 5 yields that (B1)δ1\mathbb{P}(B_{1})\leq\delta_{1}. Likewise, the second part of Assumption 5 also yields that (B2Zn+1:2n)δ1,\mathbb{P}(B_{2}\mid Z_{n+1:2n})\leq\delta_{1}, as fixing Zn+1,,Z2nZ_{n+1},\dots,Z_{2n} fixes the learned nuisances (g^,b^)(\widehat{g},\widehat{b}), per Algorithm 1. Thus, applying the law of total probability, we have that the marginal probability of B2B_{2} (over both draws of Z1,,ZnZ_{1},\dots,Z_{n} and Zn+1,,Z2nZ_{n+1},\dots,Z_{2n}) is bounded by

(B2)=𝔼[(B2Z1:n)]δ2.\mathbb{P}(B_{2})=\mathbb{E}[\mathbb{P}(B_{2}\mid Z_{1:n})]\leq\delta_{2}.

This is because the ZiZ_{i} are independent and conditioning on the first nn observations fixes the nuisance estimate g^\widehat{g}. Thus, on the “good” event B1cB2cB_{1}^{c}\cap B_{2}^{c}, which occurs with probability at least 1δ1δ21-\delta_{1}-\delta_{2}, we have

Cal(θ^,η0)\displaystyle\mathrm{Cal}(\widehat{\theta},\eta_{0}) 12err(g^,gθ^;γθ^)+Cal(θ^,g^)\displaystyle\leq\frac{1}{2}\mathrm{err}(\widehat{g},g_{\widehat{\theta}};\gamma_{\widehat{\theta}}^{\ast})+\mathrm{Cal}(\widehat{\theta},\widehat{g})
=12err(g^,gθ;γθ)+Cal(θ^;g^)\displaystyle=\frac{1}{2}\mathrm{err}(\widehat{g},g_{\theta};\gamma_{\theta}^{\ast})+\mathrm{Cal}(\widehat{\theta};\widehat{g})
12rate1(n,δ1,θ;PZ)+rate2(n,δ2,;PZ),\displaystyle\leq\frac{1}{2}\mathrm{rate}_{1}(n,\delta_{1},\theta;P_{Z})+\mathrm{rate}_{2}(n,\delta_{2},\ell;P_{Z}),

where the first inequality follows from Theorem 4.5 and the second equality follows from the preamble at the beginning of this proof.

Appendix C A Do No Harm Property for Universally Orthogonal Losses

Throughout the main body of the paper, we focused on deriving bounds on excess calibration error for our causal calibration algorithm. One desideratum for any calibration algorithm is that the calibrated estimator does not have significantly higher loss than the original estimator. This property, called a “do no harm” property, is satisfied by many off-the-shelf calibration algorithms. We prove such a bound for universally orthogonal losses in the following section. In what follows, we denote the risk of a certain predictor θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R} with respect to a nuisance g:𝒲kg:\mathcal{W}\rightarrow\mathbb{R}^{k} as R(θ,g):=𝔼(θ,g;Z)R(\theta,g):=\mathbb{E}\ell(\theta,g;Z).

We start by proving a generic upper bound on the difference in risk between two arbitrary estimators. The argument presented below is similar in spirit to the proof of Theorem 2 in Foster and Syrgkanis [18].

Theorem C.1.

Let :×𝒢×𝒵\ell:\mathbb{R}\times\mathcal{G}\times\mathcal{Z}\rightarrow\mathbb{R} be a universally orthogonal loss function. Let θ,θ:𝒳\theta,\theta^{\prime}:\mathcal{X}\rightarrow\mathbb{R} be two estimators, and g𝒢g\in\mathcal{G} a nuisance function, and g0𝒢g_{0}\in\mathcal{G} the true nuisance function. Then, we have

R(θ,g0)R(θ,g0)R(θ,g)R(θ,g)+err(g,g0),R(\theta^{\prime},g_{0})-R(\theta,g_{0})\leq R(\theta^{\prime},g)-R(\theta,g)+\mathrm{err}^{\prime}(g,g_{0}),

where err(g,g0):=Cgg0L2(PW)2\mathrm{err}^{\prime}(g,g_{0}):=C\|g-g_{0}\|_{L^{2}(P_{W})}^{2} if we assume \ell is CC-smooth in gg, i.e. if we have

|Dg2R(φ,h)[gg0,gg0]|Cgg0L2(PW)2h[g,g0],φΘ.|D_{g}^{2}R(\varphi,h)[g-g_{0},g-g_{0}]|\leq C\|g-g_{0}\|_{L^{2}(P_{W})}^{2}\forall h\in[g,g_{0}],\varphi\in\Theta. (8)

If we instead assume g=(η,ζ)g=(\eta,\zeta) and (θ,g;z)=12(θ(x)χ(g;z))2\ell(\theta,g;z)=\frac{1}{2}(\theta(x)-\chi(g;z))^{2} where

χ(g;z):=m(η;z)+Corr(g;z),\chi(g;z):=m(\eta;z)+\mathrm{Corr}(g;z),

m(η;z)m(\eta;z) is linear in η\eta, and 𝔼[Corr(g0;Z)X]=𝔼[ζ(W),(η0η)(W)X]\mathbb{E}[\mathrm{Corr}(g_{0};Z)\mid X]=\mathbb{E}[\langle\zeta(W),(\eta_{0}-\eta)(W)\rangle\mid X], then we take

err(g,g0):=ηη0,ζζ0L2(PW)2.\mathrm{err}^{\prime}(g,g_{0}):=\|\langle\eta-\eta_{0},\zeta-\zeta_{0}\rangle\|_{L^{2}(P_{W})}^{2}.
Proof.

First, observe that we have the bound

R(θ,g0)R(θ,g0)=R(θ,g0)R(θ,g)T1+R(θ,g)R(θ,g)T2+R(θ,g)R(θ,g0)T3.R(\theta^{\prime},g_{0})-R(\theta,g_{0})=\underbrace{R(\theta^{\prime},g_{0})-R(\theta^{\prime},g)}_{T_{1}}+\underbrace{R(\theta^{\prime},g)-R(\theta,g)}_{T_{2}}+\underbrace{R(\theta,g)-R(\theta,g_{0})}_{T_{3}}.

Now, observe that by performing a second order Taylor expansion with respect to the nuisance component around g0g_{0}, we have;

T1\displaystyle T_{1} =DgR(θ,g0)(gg0)+12Dg2R(θ,g¯)(gg0,gg0)\displaystyle=-D_{g}R(\theta^{\prime},g_{0})(g-g_{0})+\frac{1}{2}D_{g}^{2}R(\theta^{\prime},\overline{g})(g-g_{0},g-g_{0})
T3\displaystyle T_{3} =DgR(θ,g0)(gg0)12Dg2R(θ,h¯)(gg0,gg0),\displaystyle=D_{g}R(\theta,g_{0})(g-g_{0})-\frac{1}{2}D_{g}^{2}R(\theta,\overline{h})(g-g_{0},g-g_{0}),

where g¯,h¯[g,g0]\overline{g},\overline{h}\in[g,g_{0}], understood in a point-wise sense. Thus, we can bound the sum:

T1+T3\displaystyle T_{1}+T_{3} =DgR(θ,g0)(gg0)DgR(θ,g0)(gg0)first order difference\displaystyle=\underbrace{D_{g}R(\theta,g_{0})(g-g_{0})-D_{g}R(\theta^{\prime},g_{0})(g-g_{0})}_{\text{first order difference}}
+12Dg2R(θ,g¯)(gg0,gg0)12Dg2R(θ,h¯)(gg0,gg0)second order difference\displaystyle\qquad+\underbrace{\frac{1}{2}D_{g}^{2}R(\theta^{\prime},\overline{g})(g-g_{0},g-g_{0})-\frac{1}{2}D_{g}^{2}R(\theta,\overline{h})(g-g_{0},g-g_{0})}_{\text{second order difference}}

We can use universal orthogonality to show the first order difference vanishes. In particular, we have

DgR(θ,g0)(gg0)DgR(θ,g0)(gg0)\displaystyle D_{g}R(\theta,g_{0})(g-g_{0})-D_{g}R(\theta^{\prime},g_{0})(g-g_{0}) =𝔼[Dg𝔼((θ,g0;Z)X)(gg0)Dg𝔼((θ,g0;Z)X)(gg0)]\displaystyle=\mathbb{E}\left[D_{g}\mathbb{E}(\ell(\theta,g_{0};Z)\mid X)(g-g_{0})-D_{g}\mathbb{E}(\ell(\theta^{\prime},g_{0};Z)\mid X)(g-g_{0})\right]
=𝔼[Dg𝔼((θ¯,g0;Z)X)(gg0)(θθ)(X)]\displaystyle=\mathbb{E}\left[D_{g}\mathbb{E}(\partial\ell(\overline{\theta},g_{0};Z)\mid X)(g-g_{0})(\theta-\theta^{\prime})(X)\right]
=0\displaystyle=0

where the second equality holds for some θ¯[θ,θ]\overline{\theta}\in[\theta,\theta^{\prime}]121212Understood in the sense that θ¯(x)[θ(x),θ(x)]\overline{\theta}(x)\in[\theta(x),\theta^{\prime}(x)] for each x𝒳x\in\mathcal{X} by performing a first-order Taylor expansion, and the final equality follows from universal orthogonality of \ell.

If we are in the first setting, and assume that the Hessian of the loss \ell has a maximum eigenvalue uniformly bounded (over θ\theta and gg) by β\beta, then we can bound the second order difference above as

12Dg2R(θ,h¯)(gg0,gg0)12Dg2R(θ,g¯)(gg0,gg0)β2gg0L2(PW)2.\frac{1}{2}D_{g}^{2}R(\theta,\overline{h})(g-g_{0},g-g_{0})-\frac{1}{2}D_{g}^{2}R(\theta^{\prime},\overline{g})(g-g_{0},g-g_{0})\leq\beta^{2}\|g-g_{0}\|_{L^{2}(P_{W})}^{2}.

Else, if we instead assume the linear score condition outlined in Assumption 1 and further assume m(η,z)m(\eta,z) is linear in η\eta, and thus have (θ,g;z)=12(θ(x)χ(g;z))2\ell(\theta,g;z)=\frac{1}{2}(\theta(x)-\chi(g;z))^{2} for some pseudo-outcome mapping χ:𝒢×𝒵\chi:\mathcal{G}\times\mathcal{Z}\rightarrow\mathbb{R}, then it is clear we have for any h𝒢h\in\mathcal{G}

Dη2𝔼[(θ,h;Z)X](gg0,gg0)\displaystyle D_{\eta}^{2}\mathbb{E}[\ell(\theta,h;Z)\mid X](g-g_{0},g-g_{0}) =0\displaystyle=0
Dζ2𝔼[(θ,h;Z)X](gg0,gg0)\displaystyle D_{\zeta}^{2}\mathbb{E}[\ell(\theta,h;Z)\mid X](g-g_{0},g-g_{0}) =0\displaystyle=0
Dζ,η𝔼[(θ,h;Z)X](gg0,gg0)\displaystyle D_{\zeta,\eta}\mathbb{E}[\ell(\theta,h;Z)\mid X](g-g_{0},g-g_{0}) =(𝔼[(ζζ0)(W),(ηη0)(W)X])2.\displaystyle=-\left(\mathbb{E}[\langle(\zeta-\zeta_{0})(W),(\eta-\eta_{0})(W)\rangle\mid X]\right)^{2}.

Thus, by leveraging Jensen’s inequality, we can bound the second order difference by

2𝔼[(ηη0)(W),(ζζ0)(W)2]=err(g,g0),\leq 2\mathbb{E}[\langle(\eta-\eta_{0})(W),(\zeta-\zeta_{0})(W)\rangle^{2}]=\mathrm{err}^{\prime}(g,g_{0}),

which completes the proof.

As before, with the above error decomposition bound, we can now prove a general convergence result given access to nuisance estimation and calibration algorithms that satisfy certain, high-probability convergence guarantees. Given the similarity of this result to Theorems 3.5 and 4.8 and the fact Theorem C.1 does the majority of the heavy lifting , we omit the proof.

Theorem C.2.

Assume the same setup as Theorem C.1. Assume 𝒜1;𝒵𝒢\mathcal{A}_{1};\mathcal{Z}^{\ast}\rightarrow\mathcal{G} is a nuisance estimation algorithm and 𝒜2:Θ×(𝒳)Θ\mathcal{A}_{2}:\Theta\times(\mathcal{X}\rightarrow\mathbb{R})^{\ast}\rightarrow\Theta is a general loss calibration algorithm. Assume

  1. 1.

    For any distribution PZP_{Z} on 𝒵\mathcal{Z}, Z1,,Zni.i.dPZZ_{1},\dots,Z_{n}\sim_{i.i.d}P_{Z}, θΘ\theta\in\Theta, and failure probability δ1(0,1)\delta_{1}\in(0,1), we have

    err(g^,g0)rate3(n,δ1,θ;PZ),\mathrm{err}(\widehat{g},g_{0})\leq\mathrm{rate}_{3}(n,\delta_{1},\theta;P_{Z}),

    where g^𝒜1(Z1:n)\widehat{g}\leftarrow\mathcal{A}_{1}(Z_{1:n}) and rate3\mathrm{rate}_{3} is some rate function.

  2. 2.

    For any distribution PZP_{Z} on 𝒵\mathcal{Z}, Z1,,Zni.i.d.PZZ_{1},\dots,Z_{n}\sim_{i.i.d.}P_{Z}, θΘ\theta\in\Theta, grange(𝒜1)g\in\mathrm{range}(\mathcal{A}_{1}), and failure probability δ2(0,1)\delta_{2}\in(0,1), we have

    R(θ^,g^)R(θ,g^)rate4(n,δ2,;Pz),R(\widehat{\theta},\widehat{g})-R(\theta,\widehat{g})\leq\mathrm{rate}_{4}(n,\delta_{2},\ell;P_{z}),

    where θ^𝒜2(θ,{(Xm,m)}m=1n)\widehat{\theta}\leftarrow\mathcal{A}_{2}(\theta,\{(X_{m},\ell_{m})\}_{m=1}^{n}), m:=(,g^;Zm)\ell_{m}:=\ell(\cdot,\widehat{g};Z_{m}), and rate4\mathrm{rate}_{4} is some rate function.

Then, with probability at least 1δ1δ21-\delta_{1}-\delta_{2}, we have

R(θ^,g0)R(θ,g0)rate3(n,δ1,θ;PZ)+rate4(n,δ2,;PZ).R(\widehat{\theta},g_{0})-R(\theta,g_{0})\leq\mathrm{rate}_{3}(n,\delta_{1},\theta;P_{Z})+\mathrm{rate}_{4}(n,\delta_{2},\ell;P_{Z}).

Appendix D Calibration Algorithms for General Losses

Here, we present analogues of classical calibration algorithms defined with respect to general losses involving nuisance components. In particular, we give generalizations of histogram binning/uniform mass binning, isotonic calibration, and linear calibration. We only prove convergence guarantees for one algorithm (uniform mass binning), but we experimentally showed in Section 5 that other algorithm (e.g. linear calibration) work well in practice.

We first present a general algorithm that computes a univariate post-processing over a suitable class of functions {f:}\mathcal{F}\subset\{f:\mathbb{R}\rightarrow\mathbb{R}\} and then returns a calibrated parameter estimate.

Algorithm 5 General ERM-based Calibration
1:Samples Z1,,ZnPZZ_{1},\dots,Z_{n}\sim P_{Z}, losses 1,,n:=(,g1;Z1),,(,gn;Zn)\ell_{1},\dots,\ell_{n}:=\ell(\cdot,g_{1};Z_{1}),\dots,\ell(\cdot,g_{n};Z_{n}), estimator θ\theta, function class \mathcal{F}.
2:Compute τ^argminτm=1nm(τθ(Xm))\widehat{\tau}\in\arg\min_{\tau\in\mathcal{F}}\sum_{m=1}^{n}\ell_{m}(\tau\circ\theta(X_{m})).
3:Define θ^:=τ^θ\widehat{\theta}:=\widehat{\tau}\circ\theta.
4:Calibrated estimator θ^\widehat{\theta}.

Depending on how \mathcal{F} is defined, when we obtain generalizations of many classical algorithms. We refer the reader to Table 2 in paper for appropriate choices of \mathcal{F}. As noted in Section 4, our convergence guarantee required that we learn an injective post-processing map from model predictions to calibrated predictions. We discussed in detail how many common algorithms either naturally fit this desideratum or can be easily modified to fit this desideratum. The one exception to this is uniform mass/histogram binning, which maps model predictions to a finite number of values. To address this, we prove the convergence of a separate, three-way sample splitting algorithm that performs uniform mass binning for general losses.

D.1 An analogue of Algorithm 3 for UMB

We now develop a three-way sample splitting algorithm for causal calibration based on uniform mass/histogram binning [27, 25, 38]. As in the case of Algorithm 3, our algorithm (Algorithm 6 below) is implicitly based on the calibration error decomposition presented in Theorem 4.5. To prove the convergence of Algorithm 3, we needed to assume θ^=τ^θ\widehat{\theta}=\widehat{\tau}\circ\theta for some injective, randomized post-processing map τ^:\widehat{\tau}:\mathbb{R}\rightarrow\mathbb{R}. While this desideratum is satisfied for some calibration algorithms (e.g. linear calibration and Platt scaling) and can be made to be satisfied for others (e.g. by slightly perturbing the post-processing map learned by isotonic calibration), it is clearly not a reasonable assumption for uniform mass binning/histogram binning. This is because, as the name suggest, binning algorithms “compress” the initial estimator into a pre-specified number of data-dependent bins.

Why did we assume θ^\widehat{\theta} was an injective post-processing of θ\theta? In any causal calibration algorithm, we need to estimate the unknown nuisance functions. For conditionally orthogonal losses, there are two parameters we need etimate. The first parameter, η0\eta_{0}, does not depend on either the initial model θ(X)\theta(X) or the calibrated model θ^(X)\widehat{\theta}(X), and thus can be readily estimated from data. However, the second parameter, ζθ^\zeta_{\widehat{\theta}}, depends on the calibrated model θ^\widehat{\theta}, which in general cannot be reliably estimated from data. Assuming the injectivity of the post-processing map τ^\widehat{\tau} allowed us to conclude (via Lemma 4.7) that ζθ=ζθ^\zeta_{\theta}=\zeta_{\widehat{\theta}}. Since we know θ\theta at the outset of the problem, it is reasonable to assume we can produce a convergent estimate ζ^\widehat{\zeta} of ζθ^\zeta_{\widehat{\theta}}.

The question is now what can we do for binning-based algorithms, which generally cannot be written as an injective post-processing of the initial model. The key idea is that, for histogram binning/uniform mass binning, the level sets of θ^(X)\widehat{\theta}(X) only depend on the order statistics of θ(X1),,θ(Xn)\theta(X_{1}),\dots,\theta(X_{n}), not on the nuisance parameters. Thus, we propose a natural three-way sample-splitting analogue of uniform mass binning for conditionally orthogonal losses \ell.

Our algorithm works as follows. First, we use the covariates X1,,XnX_{1},\dots,X_{n} associated with one third of the data to compute the buckets/bins of θ^\widehat{\theta} (these will be determined by evenly-spaced order statistics of θ(X1),,θ(Xn)\theta(X_{1}),\dots,\theta(X_{n})). Fixing the level sets up front like this in turn fixes the additional nuisance we need to estimate. This is because, under Lemma 4.7, any function φ:𝒳\varphi:\mathcal{X}\rightarrow\mathbb{R} with the same level sets as θ^\widehat{\theta} will have ζφ=ζθ^\zeta_{\varphi}=\zeta_{\widehat{\theta}}. Then, we use the second third Zn+1,,Z2nZ_{n+1},\dots,Z_{2n} to estimate the unknown nuisance functions η0\eta_{0} and ζθ^\zeta_{\widehat{\theta}}. Finally, we use the learned nuisance parameters and the final third Z2n+1,,Z3nZ_{2n+1},\dots,Z_{3n} to compute the empirical loss minimizer in each bucket. We formally state this heuristically-described procedure below in Algorithm 6.

Algorithm 6 Three-Way Uniform Mass Binning (UMB)
1:Sample Z1,,Z3ni.i.d.PZZ_{1},\dots,Z_{3n}\sim_{i.i.d.}P_{Z}, loss (ν,g;z)\ell(\nu,g;z), estimator θ\theta, nuisance est. alg. 𝒜1\mathcal{A}_{1}, calibration alg. 𝒜2\mathcal{A}_{2}, number of buckets BB.
2:Compute order statistics θ(X)(1),,θ(X)(n)\theta(X)_{(1)},\dots,\theta(X)_{(n)} of θ(X1),,θ(Xn)\theta(X_{1}),\dots,\theta(X_{n}).
3:Set θ(X)(0)=0,θ(X)(n)=\theta(X)_{(0)}=0,\theta(X)_{(n)}=\infty
4:for b[B]b\in[B] do
5:     Set Vb:=[θ(X)(b1)N/B,θ(X)bN/B)V_{b}:=[\theta(X)_{\lfloor(b-1)N/B\rfloor},\theta(X)_{\lfloor bN/B\rfloor})
6:Define φ(x):=b=1Bνb𝟙[θ(x)Vb]\varphi(x):=\sum_{b=1}^{B}\nu_{b}\mathbbm{1}[\theta(x)\in V_{b}] for any distinct ν1,,νB\nu_{1},\dots,\nu_{B}.
7:Compute nuisances g^=(η^,ζ^)𝒜1({Zm}m=n+12n,φ)\widehat{g}=(\widehat{\eta},\widehat{\zeta})\leftarrow\mathcal{A}_{1}(\{Z_{m}\}_{m=n+1}^{2n},\varphi)
8:for b[B]b\in[B] do
9:     ν^b:=argminνm=2n+13n(ν,g^;Zm)𝟙[θ(Xm)Vb]\widehat{\nu}_{b}:=\arg\min_{\nu}\sum_{m=2n+1}^{3n}\ell(\nu,\widehat{g};Z_{m})\mathbbm{1}[\theta(X_{m})\in V_{b}]
10:Set θ^(x):=b=1Bν^b𝟙[θ(x)Vb]\widehat{\theta}(x):=\sum_{b=1}^{B}\widehat{\nu}_{b}\mathbbm{1}[\theta(x)\in V_{b}]
11:Calibrated estimator θ^\widehat{\theta}.

We now present the set of assumptions that we will use to prove convergence of the above algorithm.

Assumption 6.

Let 𝒜1:Θ×𝒵𝒢\mathcal{A}_{1}:\Theta\times\mathcal{Z}^{\ast}\rightarrow\mathcal{G} be a nuisance estimation algorithm and let φ:𝒳\varphi:\mathcal{X}\rightarrow\mathbb{R} be any initial estimator. We make the following assumptions.

  1. 1.

    range(θ)[0,1]\mathrm{range}(\theta)\subset[0,1].

  2. 2.

    The conditionally orthogonal loss function :×𝒢×𝒵\ell:\mathbb{R}\times\mathcal{G}\times\mathcal{Z}\rightarrow\mathbb{R} satisfies

    1. (a)

      For any (η,ζ)𝒢,z𝒵(\eta,\zeta)\in\mathcal{G},z\in\mathcal{Z}, the minimizer of the loss ν=argminν(ν,g;z)\nu^{\ast}=\arg\min_{\nu}\ell(\nu,g;z) satisfies ν[0,1]\nu^{\ast}\in[0,1].

    2. (b)

      For any (η,ζ)𝒢,z𝒵(\eta,\zeta)\in\mathcal{G},z\in\mathcal{Z}, and ν[0,1]\nu\in[0,1], we have (ν,g;z)[C,C]\partial\ell(\nu,g;z)\in[-C,C].

  3. 3.

    For any distribution PZP_{Z} on 𝒵\mathcal{Z}, n1n\geq 1, and failure probability δ1(0,1)\delta_{1}\in(0,1), with probability at least 1δ11-\delta_{1} over the draws of Z1,,ZnPZZ_{1},\dots,Z_{n}\sim P_{Z} i.i.d., we have

    err(g^,gφ;γφ)rate1(n,δ1,φ;PZ),\mathrm{err}(\widehat{g},g_{\varphi};\gamma_{\varphi}^{\ast})\leq\mathrm{rate}_{1}(n,\delta_{1},\varphi;P_{Z}),

    where g^:=(η^,ζ^)𝒜1(φ,Z1:n)\widehat{g}:=(\widehat{\eta},\widehat{\zeta})\leftarrow\mathcal{A}_{1}(\varphi,Z_{1:n}) and rate1\mathrm{rate}_{1} is some rate function.

The above assumptions are mostly standard and essentially say the (a) the initial estimate θ(X)\theta(X) is bounded, (b) the partial derivative of the loss is bounded and the minimizer takes values in a bounded interval, and (c) we have access to some algorithm that can accurately estimate nuisance parameters.

For simplicity, we assume that the values ν^1,,ν^B\widehat{\nu}_{1},\dots,\widehat{\nu}_{B} Algorithm 6 assigns to each of the buckets V1,,VBV_{1},\dots,V_{B} are unique. This is to ensure two distinct buckets ViVjV_{i}\neq V_{j} do not merge, which would invalidate our application of Lemma 4.7. If, in practice, we have ν^i=ν^j\widehat{\nu}_{i}=\widehat{\nu}_{j} for iji\neq j, the learner can simply add 𝒰([ϵ,ϵ])\mathcal{U}([-\epsilon,\epsilon]) noise to ν^i\widehat{\nu}_{i} for ϵ>0\epsilon>0 arbitrarily small to guarantee uniqueness.

Assumption 7.

With probability 1 over the draws Z1,,Z3nPZZ_{1},\dots,Z_{3n}\sim P_{Z} i.i.d., we have that ν^1,,ν^B[0,1]\widehat{\nu}_{1},\dots,\widehat{\nu}_{B}\in[0,1] are distinct.

We now state the main result of this subsection, a technical convergence guarantee for Algorithm 6. We prove Theorem D.1 (along with requisite lemmas and propositions) the next subsection of this appendix.

Theorem D.1.

Fix any initial estimator θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R}, conditionally orthogonal loss function :[0,1]×𝒢×𝒵\ell:[0,1]\times\mathcal{G}\times\mathcal{Z}\rightarrow\mathbb{R}, and failure probabilities δ1,δ2(0,1)\delta_{1},\delta_{2}\in(0,1). Suppose Assumptions 3, 4, and 6 hold, and assume nBlog(B/min(δ1,δ2))n\gtrsim B\log(B/\min(\delta_{1},\delta_{2})). Then, with probability at least 1δ1δ21-\delta_{1}-\delta_{2} over the randomness of Z1,,Z3nPZZ_{1},\dots,Z_{3n}\sim P_{Z}, the output θ^\widehat{\theta} of Algorithm 6 satisfies

Cal(θ^,η0)\displaystyle\mathrm{Cal}(\widehat{\theta},\eta_{0}) β2αrate1(n,δ1,θ^;PZ)+2βα(βn+C2Blog(2nB/δ2)n.),\displaystyle\leq\frac{\beta}{2\alpha}\mathrm{rate}_{1}(n,\delta_{1},\widehat{\theta};P_{Z})+\frac{2\beta}{\alpha}\left(\frac{\beta}{n}+C\sqrt{\frac{2B\log(2nB/\delta_{2})}{n}}.\right),

where C>0C>0 is some constant that bounds the partial derivative as discussed in Assumption 6: i.e. |~(ν,g;z)|<C|\partial\widetilde{\ell}(\nu,g;z)|<C.

D.2 Proving Theorem D.1

We now pivot to the task of proving Theorem D.1. We first cite a result that guarantees, with high probability, that close to a 1/B1/B fraction of points will fall into each bucket.

Lemma D.2 ([38], Lemma 4.3; [27], Lemma 13).

For a universal constant c>0c>0, if ncBlog(B/δ)n\geq cB\log(B/\delta), the bucket {Vb:b[B]}\{V_{b}:b\in[B]\} produced in Algorithm 6 will satisfy

12BXPX(θ(X)Vb)2B,\frac{1}{2B}\leq\mathbb{P}_{X\sim P_{X}}(\theta(X)\in V_{b})\leq\frac{2}{B}, (9)

simultaneously for all b[B]b\in[B] with probability at least 1δ1-\delta over the randomness of X1,,XnX_{1},\dots,X_{n}

Next, we argue that, conditional on the random bins satisfying Lemma D.2, the population average derivative conditional on the observation falling into any given bucket will be close to the corresponding sample average.

Lemma D.3.

Fix any initial estimator θ:𝒳\theta:\mathcal{X}\rightarrow\mathbb{R}, conditionally orthogonal loss function :[0,1]×𝒢×𝒵\ell:[0,1]\times\mathcal{G}\times\mathcal{Z}\rightarrow\mathbb{R}, and a nuisance estimate g=(η,ζ)𝒢g=(\eta,\zeta)\in\mathcal{G}. Suppose the buckets {Vb}b[B]\{V_{b}\}_{b\in[B]} are such that

12B(θ(X)Vi)2B\frac{1}{2B}\leq\mathbb{P}(\theta(X)\in V_{i})\leq\frac{2}{B}

for every b[B]b\in[B], n8Blog(B/δ)n\geq 8B\log(B/\delta), and Assumption 6 holds. Then, with probability at least 1δ1-\delta over the randomness of Z2n+1,,Z3nPZnZ_{2n+1},\dots,Z_{3n}\sim P^{n}_{Z}, we have for all b[B]b\in[B] and all ν[0,1]\nu\in[0,1]

|𝔼[(ν,g;Z)θ(X)Vb]𝔼n[(ν,g;Z)θ(X)Vb]|2βn+2C2Blog(4nBδ)n\left|\mathbb{E}[\partial\ell(\nu,g;Z)\mid\theta(X)\in V_{b}]-\mathbb{E}_{n}[\partial\ell(\nu,g;Z)\mid\theta(X)\in V_{b}]\right|\leq\frac{2\beta}{n}+2C\sqrt{\frac{2B\log(\frac{4nB}{\delta})}{n}}

where

𝔼n[(ν,g;Z)θ(X)Vb]:=m=2n+13n(ν,g;Zm)𝟙[θ(Xm)Vb]m=2n+13n𝟙[θ(Xm)Vb]\mathbb{E}_{n}[\partial\ell(\nu,g;Z)\mid\theta(X)\in V_{b}]:=\frac{\sum_{m=2n+1}^{3n}\partial\ell(\nu,g;Z_{m})\cdot\mathbbm{1}[\theta(X_{m})\in V_{b}]}{\sum_{m=2n+1}^{3n}\mathbbm{1}[\theta(X_{m})\in V_{b}]}

denotes the empirical conditional mean over the calibration dataset Z2n+1,,Z3nZ_{2n+1},\dots,Z_{3n}.

Proof.

For convenience, let

Sb:={2n+1m3m:θ(Xm)Vb}S_{b}:=\{2n+1\leq m\leq 3m:\theta(X_{m})\in V_{b}\}

to denote set of indices that fall in VbV_{b}. Given that n8Blog(B/δ)n\geq 8B\log(B/\delta), the multiplicative Chernoff bound (Lemma D.4) tells us that with probability 1δ1-\delta,

|Sb|n4B.\displaystyle|S_{b}|\geq\frac{n}{4B}. (10)

for all i[B]i\in[B].

Note that 𝔼n[(ν,g;Z)θ(X)Vb]\mathbb{E}_{n}[\partial\ell(\nu,g;Z)\mid\theta(X)\in V_{b}] is the empirical mean over |Sb||S_{b}| many points. Therefore, with inequality (10) and Hoeffding’s inequality (Lemma D.5), we have for any fixed ν[0,1]\nu\in[0,1], with probability at least 1δ1-\delta, simultaneously for all b[B]b\in[B],

|𝔼[(ν,g;Z)θ(X)Vb]𝔼n[(ν,g;Z)θ(X)Vb]|2Clog(4B/δ)2|Sb|2C2Blog(4B/δ)n.\displaystyle\left|\mathbb{E}[\partial\ell(\nu,g;Z)\mid\theta(X)\in V_{b}]-\mathbb{E}_{n}[\partial\ell(\nu,g;Z)\mid\theta(X)\in V_{b}]\right|\leq 2C\sqrt{\frac{\log(4B/\delta)}{2|S_{b}|}}\leq 2C\sqrt{\frac{2B\log(4B/\delta)}{n}}.

Now, for some ϵ>0\epsilon>0 to be chosen later (which we will implicitly assume satisfies 1/ϵ1/\epsilon\in\mathbb{N}), we now take a union bound over an ϵ\epsilon-covering of [0,1][0,1]: with probability 1δ1-\delta, we have for all ν{0,ϵ,2ϵ,,1ϵ}\nu\in\{0,\epsilon,2\epsilon,\dots,1-\epsilon\} and b[B]b\in[B]:

|𝔼[(ν,g;Zm)θ(X)Vb]𝔼n[(ν,g;Zm)θ(X)Vb]|2C2Blog(4Bϵδ)n.\displaystyle\left|\mathbb{E}[\partial\ell(\nu,g;Z_{m})\cdot\mid\theta(X)\in V_{b}]-\mathbb{E}_{n}[\partial\ell(\nu,g;Z_{m})\mid\theta(X)\in V_{b}]\right|\leq 2C\sqrt{\frac{2B\log(\frac{4B}{\epsilon\delta})}{n}}.

For any ν{0,ϵ,2ϵ,,1ϵ}\nu\not\in\{0,\epsilon,2\epsilon,\dots,1-\epsilon\}, taking its closest point νϵ\nu_{\epsilon} in the ϵ\epsilon-grid yields

|𝔼[(ν,g;Z)θ(X)Vb]𝔼[(νϵ,g;Z)θ(X)Vb]|βϵ\displaystyle\left|\mathbb{E}[\partial\ell(\nu,g;Z)\cdot\mid\theta(X)\in V_{b}]-\mathbb{E}[\partial\ell(\nu_{\epsilon},g;Z)\mid\theta(X)\in V_{b}]\right|\leq\beta\epsilon
|𝔼n[(ν,g;Z)θ(X)Vb]𝔼n[(νϵ,g;Z)θ(X)Vb]|βϵ\displaystyle\left|\mathbb{E}_{n}[\partial\ell(\nu,g;Z)\mid\theta(X)\in V_{b}]-\mathbb{E}_{n}[\partial\ell(\nu_{\epsilon},g;Z)\mid\theta(X)\in V_{b}]\right|\leq\beta\epsilon

as \partial\ell is β\beta-Lipschitz by Assumption 4. Hence, with probability 1δ1-\delta, we have for any b[B]b\in[B] and ν[0,1]\nu\in[0,1],

|𝔼[(ν,g;Z)θ(X)Vb]𝔼n[(νb,g;Z)θ(X)Vb]|\displaystyle\left|\mathbb{E}[\partial\ell(\nu,g;Z)\mid\theta(X)\in V_{b}]-\mathbb{E}_{n}[\partial\ell(\nu_{b},g;Z)\mid\theta(X)\in V_{b}]\right|
|𝔼[(ν,g;Z)|θ(X)Vb]𝔼[(νϵ,g;Z)θ(X)Vb]|\displaystyle\leq\left|\mathbb{E}[\partial\ell(\nu,g;Z)|\theta(X)\in V_{b}]-\mathbb{E}[\partial\ell(\nu_{\epsilon},g;Z)\mid\theta(X)\in V_{b}]\right|
+|𝔼[(νϵ,g;Z)θ(X)Vb]𝔼n[(νϵ,g;Z)θ(X)Vb]|\displaystyle+\left|\mathbb{E}[\partial\ell(\nu_{\epsilon},g;Z)\mid\theta(X)\in V_{b}]-\mathbb{E}_{n}[\partial\ell(\nu_{\epsilon},g;Z)\mid\theta(X)\in V_{b}]\right|
+|𝔼n[(ν,g;Z)θ(X)Vb]𝔼n[(νϵ,g;Z)θ(X)Vb]|\displaystyle+\left|\mathbb{E}_{n}[\partial\ell(\nu,g;Z)\mid\theta(X)\in V_{b}]-\mathbb{E}_{n}[\partial\ell(\nu_{\epsilon},g;Z)\mid\theta(X)\in V_{b}]\right|
2βϵ+2Clog(2B/ϵδ)2n.\displaystyle\leq 2\beta\epsilon+2C\sqrt{\frac{\log(2B/\epsilon\delta)}{2n}}.

Therefore, we have with probability 1δ1-\delta,

|𝔼[(ν,g;Z)θ(X)Vb]𝔼n[(ν,g;Z)θ(X)Vb]|2βϵ+2C2Blog(4Bϵδ)n.\left|\mathbb{E}[\partial\ell(\nu,g;Z)\mid\theta(X)\in V_{b}]-\mathbb{E}_{n}[\partial\ell(\nu,g;Z)\mid\theta(X)\in V_{b}]\right|\leq 2\beta\epsilon+2C\sqrt{\frac{2B\log(\frac{4B}{\epsilon\delta})}{n}}.

for all b[B]b\in[B] and ν[0,1]\nu\in[0,1]. Setting ϵ=1n\epsilon=\frac{1}{n} yields the result ∎

Lemma D.4 (Multiplicative Chernoff Boud).

Let X1,,XnX_{1},\dots,X_{n} be independent random variables such that Xi{0,1}X_{i}\in\{0,1\} and 𝔼[Xi]=p\mathbb{E}[X_{i}]=p for all i[n]i\in[n]. For all t(0,1)t\in(0,1),

(i=1nXi(1t)np)exp(npt23).\mathbb{P}\left(\sum_{i=1}^{n}X_{i}\leq(1-t)np\right)\leq\exp\left(-\frac{npt^{2}}{3}\right).
Lemma D.5 (Hoeffding’s Inequality).

Let X1,,XnX_{1},\dots,X_{n} be independent random variables such that Xi[a,b]X_{i}\in[a,b]. Consider the sum of these random variables Sn=X1++XnS_{n}=X_{1}+\dots+X_{n}. For all t>0t>0,

(|Sn𝔼[Sn]|>t)2exp(2t2n(ba)2).\mathbb{P}\left(|S_{n}-\mathbb{E}[S_{n}]|>t\right)\leq 2\exp\left(\frac{2t^{2}}{n(b-a)^{2}}\right).

We use Lemma D.3 to now prove the following Proposition.

Proposition D.6.

Assume the same setup as Lemma D.3 and suppose Assumption 6 holds. Let Z1,,ZnPZZ_{1},\dots,Z_{n}\sim P_{Z} be i.i.d., let g=(η,ζ)𝒢g=(\eta,\zeta)\in\mathcal{G} be a fixed nuisance pair, and set θ^:=b=1Bν^b𝟙[θ(x)Vb]\widehat{\theta}:=\sum_{b=1}^{B}\widehat{\nu}_{b}\mathbbm{1}[\theta(x)\in V_{b}], where

ν^b=argminνm=1n(ν,g;Zm)𝟙[θ(Xm)Vb].\widehat{\nu}_{b}=\arg\min_{\nu}\sum_{m=1}^{n}\ell(\nu,g;Z_{m})\mathbbm{1}[\theta(X_{m})\in V_{b}].

Assume the ν^b\widehat{\nu}_{b} are distinct almost surely. Then, for any δ(0,1)\delta\in(0,1), we have with probability at least 1δ1-\delta,

Cal(θ^,g)2βn+2C2Blog(4nB/δ)n\mathrm{Cal}(\widehat{\theta},g)\leq\frac{2\beta}{n}+2C\sqrt{\frac{2B\log(4nB/\delta)}{n}}

where C>0C>0 is some constant that bounds the partial derivative as discussed in Assumption 6: i.e. |(ν,g;z)|<C|\partial\ell(\nu,g;z)|<C.

Proof.

As we have assumed that all ν^b\widehat{\nu}_{b}’s are all distinct, we have

Cal(θ^,g)\displaystyle\mathrm{Cal}(\widehat{\theta},g) =b[B](θ^(X)=ν^b)𝔼[(ν^b,g;Z)θ^(X)=νb]2\displaystyle=\sqrt{\sum_{b\in[B]}\mathbb{P}(\widehat{\theta}(X)=\widehat{\nu}_{b})\cdot\mathbb{E}\left[\partial\ell(\widehat{\nu}_{b},g;Z)\mid\widehat{\theta}(X)=\nu_{b}\right]^{2}}

Since we have assumed n8Blog(B/δ)n\geq 8B\log(B/\delta), we have from Lemma D.3 that, with probability at least 1δ1-\delta, simultaneously for each b[B]b\in[B]

𝔼[(ν^b,g;Z)θ^(X)=ν^b]=𝔼[(ν^b,g;Z)θ(X)Vb]2βn+2C2Blog(4nBδ)n,\displaystyle\mathbb{E}[\partial\ell(\widehat{\nu}_{b},g;Z)\mid\widehat{\theta}(X)=\widehat{\nu}_{b}]=\mathbb{E}[\partial\ell(\widehat{\nu}_{b},g;Z)\mid\theta(X)\in V_{b}]\leq\frac{2\beta}{n}+2C\sqrt{\frac{2B\log(\frac{4nB}{\delta})}{n}},

which follows since 𝔼n[(ν^b,g;Z)θ(X)Vb]=0\mathbb{E}_{n}[\partial\ell(\widehat{\nu}_{b},g;Z)\mid\theta(X)\in V_{b}]=0 by definition of ν^b\widehat{\nu}_{b}. Thus, with probability at least 1δ1-\delta, we get

Cal(θ^,g)\displaystyle\mathrm{Cal}(\widehat{\theta},g) =b[B](θ(X)Vb)𝔼[(ν^b,g;Z)θ^(X)=ν^b]2\displaystyle=\sqrt{\sum_{b\in[B]}\mathbb{P}(\theta(X)\in V_{b})\cdot\mathbb{E}\left[\partial\ell(\widehat{\nu}_{b},g;Z)\mid\widehat{\theta}(X)=\widehat{\nu}_{b}\right]^{2}}
b[B](θ(X)Vb)(2βn+2C2Blog(4nBδ)n)2\displaystyle\leq\sqrt{\sum_{b\in[B]}\mathbb{P}(\theta(X)\in V_{b})\cdot\left(\frac{2\beta}{n}+2C\sqrt{\frac{2B\log(\frac{4nB}{\delta})}{n}}\right)^{2}}
=2βn+2C2Blog(4nBδ)n.\displaystyle=\frac{2\beta}{n}+2C\sqrt{\frac{2B\log(\frac{4nB}{\delta})}{n}}.

We now have the requisite tools to prove Theorem D.1. Our argument proceeds in largely the same way that the proof of Theorems 3.5 and 4.8 — we start by defining appropriate “good” events, and then subsequently bound the overall probability of their failure.

Proof of Theorem D.1.

As before we start by defining some “bad” events. In particular, consider the events B1B_{1} and B2B_{2} defined respectively by

B1:={err(g^,gθ^;γθ^)>rate1(n,δ1,θ^;PZ)}andB2:={Cal(θ^,g^)>2βn+2C2Blog(8nB/δ2)n}B_{1}:=\left\{\mathrm{err}(\widehat{g},g_{\widehat{\theta}};\gamma_{\widehat{\theta}}^{\ast})>\mathrm{rate}_{1}(n,\delta_{1},\widehat{\theta};P_{Z})\right\}\quad\text{and}\quad B_{2}:=\left\{\mathrm{Cal}(\widehat{\theta},\widehat{g})>\frac{2\beta}{n}+2C\sqrt{\frac{2B\log(8nB/\delta_{2})}{n}}\right\}

where g^=(η^,ζ^)𝒜1(φ,Zn+1:2n)\widehat{g}=(\widehat{\eta},\widehat{\zeta})\leftarrow\mathcal{A}_{1}(\varphi,Z_{n+1:2n}), where φ\varphi is as defined in Algorithm 6. It is clear that (B1Z1:n)δ1\mathbb{P}(B_{1}\mid Z_{1:n})\leq\delta_{1} by Assumption 6, since fixing Z1:nZ_{1:n} fixes the estimate φ\varphi. Thus, the law of total expectation yields that (B1)δ1\mathbb{P}(B_{1})\leq\delta_{1}.

Bounding (B2)\mathbb{P}(B_{2}) takes mildly more care. Define the event

E:={b[B],12B(θ(X)Vb)2B},E:=\left\{\forall b\in[B],\frac{1}{2B}\leq\mathbb{P}(\theta(X)\in V_{b})\leq\frac{2}{B}\right\},

which by Lemma D.2 occurs with probability at least 1δ2/21-\delta_{2}/2 by the assumption that ncBlog(2B/δ2)n\geq cB\log(2B/\delta_{2}). We have the following bound:

(B2)\displaystyle\mathbb{P}(B_{2}) =(B2E)(E)+(B2Ec)(Ec)\displaystyle=\mathbb{P}(B_{2}\mid E)\mathbb{P}(E)+\mathbb{P}(B_{2}\mid E^{c})\mathbb{P}(E^{c})
(B2E)+(Ec)\displaystyle\leq\mathbb{P}(B_{2}\mid E)+\mathbb{P}(E^{c})
=𝔼[(B2E,Z1:2n)E]+(Ec)\displaystyle=\mathbb{E}\left[\mathbb{P}(B_{2}\mid E,Z_{1:2n})\mid E\right]+\mathbb{P}(E^{c})
δ2/2+δ2/2=δ2,\displaystyle\leq\delta_{2}/2+\delta_{2}/2=\delta_{2},

where the second to last inequality follows form the fact that (B2E,Z1:2n)δ2\mathbb{P}(B_{2}\mid E,Z_{1:2n})\leq\delta_{2} by Proposition D.6. We now apply Theorem 4.5 to see that, on the “good” event G=B1cB2cG=B_{1}^{c}\cap B_{2}^{c} (which occurs with probability at least 1δ1δ21-\delta_{1}-\delta_{2}) we have

Cal(θ^,η0)\displaystyle\mathrm{Cal}(\widehat{\theta},\eta_{0}) β2αerr(g^,gθ^;γθ^)+βαCal(θ^,g^)\displaystyle\leq\frac{\beta}{2\alpha}\mathrm{err}(\widehat{g},g_{\widehat{\theta}};\gamma_{\widehat{\theta}}^{\ast})+\frac{\beta}{\alpha}\mathrm{Cal}(\widehat{\theta},\widehat{g})
β2αrate1(n,δ1,θ^;PZ)+2βα(βn+C2Blog(8nB/δ2)n),\displaystyle\leq\frac{\beta}{2\alpha}\mathrm{rate}_{1}(n,\delta_{1},\widehat{\theta};P_{Z})+\frac{2\beta}{\alpha}\left(\frac{\beta}{n}+C\sqrt{\frac{2B\log(8nB/\delta_{2})}{n}}\right),

which proves the desired result. ∎

Appendix E Additional Proofs

In this appendix, we proofs of additional claims that do not constitute our primary results. We proceed by section of the paper. See 3.4

Proof of Proposition 3.4.

Let g¯=(η¯,ζ¯)[g,g0]\overline{g}=(\overline{\eta},\overline{\zeta})\in[g,g_{0}]. Observe that we have,

Dg2𝔼[(θ,g¯;Z)X](gg0,gg0)\displaystyle D_{g}^{2}\mathbb{E}[\partial\ell(\theta,\overline{g};Z)\mid X](g-g_{0},g-g_{0})
=Dg2𝔼[θ(X)m(η¯;Z)X](gg0,gg0)\displaystyle=D_{g}^{2}\mathbb{E}[\theta(X)-m(\overline{\eta};Z)\mid X](g-g_{0},g-g_{0})
Dg2𝔼[ζ¯(W),η¯(W)η0(W))X](gg0,gg0)\displaystyle\qquad-D_{g}^{2}\mathbb{E}[\langle\overline{\zeta}(W),\overline{\eta}(W)-\eta_{0}(W))\mid X](g-g_{0},g-g_{0}) (Assumption 1)
=Dg2𝔼[ζ¯(W),(η¯η0)(W)X](gg0,gg0)\displaystyle=D_{g}^{2}\mathbb{E}[\langle\overline{\zeta}(W),(\overline{\eta}-\eta_{0})(W)\rangle\mid X](g-g_{0},g-g_{0}) (Linearity of m(η;z)).\displaystyle\text{(Linearity of $m(\eta;z)$)}.

Now, we compute the Hessian that appears on the second line. In particular, further calculation yields that

Dg2𝔼[ζ¯(W),η¯(W)η0(W)X](gg0,gg0)\displaystyle D_{g}^{2}\mathbb{E}[\langle\overline{\zeta}(W),\overline{\eta}(W)-\eta_{0}(W)\rangle\mid X](g-g_{0},g-g_{0})
=𝔼[(η(W)η0(W),ζ(W)ζ0(W))(0110)(η(W)η0(W),ζ(W)ζ0(W))X]\displaystyle\qquad=\mathbb{E}\left[\left(\eta(W)-\eta_{0}(W),\zeta(W)-\zeta_{0}(W)\right)^{\top}\begin{pmatrix}0&-1\\ -1&0\end{pmatrix}\left(\eta(W)-\eta_{0}(W),\zeta(W)-\zeta_{0}(W)\right)\mid X\right]

Thus, we can write the error term down as

err(g,g0;θ)2:=supf[g,g0]𝔼({Dg2𝔼[(θ,f;Z)X](gg0,gg0)}2)\displaystyle\mathrm{err}(g,g_{0};\theta)^{2}:=\sup_{f\in[g,g_{0}]}\mathbb{E}\left(\left\{D_{g}^{2}\mathbb{E}\left[\partial\ell(\theta,f;Z)\mid X\right](g-g_{0},g-g_{0})\right\}^{2}\right)
=𝔼(𝔼[((ηη0)(W),(ζζ0)(W))(0110)((ηη0)(W),(ζζ0)(W))X]2)\displaystyle\qquad=\mathbb{E}\left(\mathbb{E}\left[((\eta-\eta_{0})(W),(\zeta-\zeta_{0})(W))^{\top}\begin{pmatrix}0&-1\\ -1&0\end{pmatrix}((\eta-\eta_{0})(W),(\zeta-\zeta_{0})(W))\mid X\right]^{2}\right)
=4𝔼(𝔼[ηη0)(W),ζζ0)(W)X]2)\displaystyle\qquad=4\mathbb{E}\left(\mathbb{E}\left[\langle\eta-\eta_{0})(W),\zeta-\zeta_{0})(W)\rangle\mid X\right]^{2}\right)
4(gg0)(bbθ)L2(PW)2(Jensen’s Inequality and Tower Rule).\displaystyle\qquad\leq 4\|(g-g_{0})(b-b_{\theta})\|_{L^{2}(P_{W})}^{2}\qquad\qquad\qquad\text{(Jensen's Inequality and Tower Rule)}.

Thus, taking square roots, we see that we have

err(g,g0)2ηη0,ζζ0L2(PW).\mathrm{err}(g,g_{0})\leq 2\|\langle\eta-\eta_{0},\zeta-\zeta_{0}\rangle\|_{L^{2}(P_{W})}.

Dividing both sides by two yields the claimed result. The second claim is trivial and follows from an application of Jensen’s inequality.

See 4.7

Proof of Lemma 4.7.

The first claim is straightforward, as we have

𝔼[(ν,g;Z)φ1(X)]=𝔼[(ν,g;Z)φ2(X)]\displaystyle\mathbb{E}[\ell(\nu,g;Z)\mid\varphi_{1}(X)]=\mathbb{E}[\ell(\nu,g;Z)\mid\varphi_{2}(X)]

for all ν\nu\in\mathbb{R}, g=(η,ζ)𝒢g=(\eta,\zeta)\in\mathcal{G}, so we just plug in g0=(η0,ζ)g_{0}=(\eta_{0},\zeta) for any ζ\zeta. The equivalence of calibration functions, in particular, implies we have

0=Dg𝔼[(γφc2,(η,ζφc1);Z)φc1(X)](gg0)=Dg𝔼[(γφc2,(η0,ζφc1);Z)φc2(X)](gg0)0=D_{g}\mathbb{E}[\partial\ell(\gamma_{\varphi_{c_{2}}}^{\ast},(\eta,\zeta_{\varphi_{c_{1}}});Z)\mid\varphi_{c_{1}}(X)](g-g_{0})=D_{g}\mathbb{E}[\partial\ell(\gamma_{\varphi_{c_{2}}}^{\ast},(\eta_{0},\zeta_{\varphi_{c_{1}}});Z)\mid\varphi_{c_{2}}(X)](g-g_{0})

for any direction g𝒢g\in\mathcal{G} and choice of c1,c2{1,2}c_{1},c_{2}\in\{1,2\}. That is, \ell satisfies the second point of Definition 4.3 when either ζφ1\zeta_{\varphi_{1}} or ζφ2\zeta_{\varphi_{2}} are plugged in. Thus, without loss of generality, we can assume they are the same. ∎