This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Geodesic Causal Inference

Daisuke Kurisu Yidong Zhou Taisuke Otsu  and  Hans-Georg Müller Center for Spatial Information Science, The University of Tokyo
5-1-5, Kashiwanoha, Kashiwa-shi, Chiba 277-8568, Japan.
daisukekurisu@csis.u-tokyo.ac.jp Department of Statistics, University of California, Davis, One Shields Avenue, Davis, CA, 95616, USA. ydzhou@ucdavis.edu Department of Economics, London School of Economics, Houghton Street, London, WC2A 2AE, UK. t.otsu@lse.ac.uk Department of Statistics, University of California, Davis, One Shields Avenue, Davis, CA, 95616, USA. hgmueller@ucdavis.edu
(Date: First version: January 6, 2024. This version: )
Abstract.

Adjusting for confounding and imbalance when establishing statistical relationships is an increasingly important task, and causal inference methods have emerged as the most popular tool to achieve this. Causal inference has been developed mainly for scalar outcomes and recently for distributional outcomes. We introduce here a general framework for causal inference when outcomes reside in general geodesic metric spaces, where we draw on a novel geodesic calculus that facilitates scalar multiplication for geodesics and the characterization of treatment effects through the concept of the geodesic average treatment effect. Using ideas from Fréchet regression, we develop estimation methods of the geodesic average treatment effect and derive consistency and rates of convergence for the proposed estimators. We also study uncertainty quantification and inference for the treatment effect. Our methodology is illustrated by a simulation study and real data examples for compositional outcomes of U.S. statewise energy source data to study the effect of coal mining, network data of New York taxi trips, where the effect of the COVID-19 pandemic is of interest, and brain functional connectivity network data to study the effect of Alzheimer’s disease.

Key words and phrases:
Doubly robust estimation, Fréchet regression, geodesic average treatment effect, metric statistics, networks, random objects
The first two authors contributed equally to the paper and are listed alphabetically. The work of D. K. was partially supported by JSPS KAKENHI Grant Number 23K12456 and the work of H.G.M. was partially supported by NSF grant DMS-2310450

1. Introduction

Causal inference aims to mitigate the effects of confounders in the assessment of treatment effects and has garnered increasing attention in modern data sciences (Imbens and Rubin,, 2015; Peters et al.,, 2017). While inadequate adjustment for confounders leads to biased causal effect estimates, Rosenbaum and Rubin, (1983) demonstrated that the propensity score, defined as the probability of treatment assignment conditional on confounders, can be used to correct this bias. Robins et al., (1994) proposed a doubly robust estimation method combining outcome regression and propensity score modeling; see also Scharfstein et al., (1999) and Bang and Robins, (2005). Hahn, (1998) established the semiparametric efficiency bound for estimation of the average treatment effect, and Hirano et al., (2003) developed a semiparametrically efficient estimator; see Athey and Imbens, (2017) for a survey.

Almost all existing works in causal inference have focused on scalar outcomes, or more generally, outcomes situated in an Euclidean space. The only exceptions are distributional data, for which causal inference was recently studied in Gunsilius, (2023) and Lin et al., (2023). These approaches rely on the fact that for the case of univariate distributions, one can make use of restricted linear operations in the space of quantile functions since these form a nonlinear subspace of the Hilbert space. Thus these approaches are specific for univariate distributions as outcomes and are not directly extendable to the case of general random objects, metric space-valued random variables. Here we aim at a major generalization to develop a novel methodology for causal inference on general random objects. Specifically, we propose a generalized notion of causal effects and its estimation methods when outcomes are located in a geodesic metric space, while treatments are discrete and potential confounders are Euclidean. Our framework can treat Euclidean and univariate distributional outcomes as special cases.

A basic issue one needs to address is then how to quantify treatment effects for outcomes corresponding to complex random objects. Conventional approaches based on algebraic averaging or differencing in Euclidean spaces do not work in geodesic metric spaces and therefore need to be generalized. For our methodology and theory, we consider uniquely extendable geodesic spaces, and then demonstrate how a geodesic calculus that allows for certain geodesic operations in such spaces can be harnessed to define the notions such as summation, subtraction, and scalar multiplication so that we can develop the concept of the geodesic average treatment effect (GATE).

We then proceed to construct estimation methods for the GATE in this general setting and furthermore derive consistency and rates of convergence for the estimators towards their population targets. To facilitate practical applications of the proposed methodology, we also provide uncertainty quantification and an inference method for the GATE. Lastly, we demonstrate the finite sample behavior of the proposed estimation methods by a simulation study and, importantly, illustrate the proposed methodology by relevant real world examples for networks and compositional outcomes.

A brief overview is as follows. Basic concepts and results for all of the following are presented in Section 2. Specifically, Section 2.1 provides a brief review of uniquely extendable geodesic metric spaces and the geodesics in such spaces, where we introduce a scalar multiplication for geodesics that serves as a key tool for subsequent developments. Following these essential preparations, Section 2.1 also presents specific examples of uniquely extendable geodesic metric spaces that are encountered in real world applications, motivating the proposed methodology. Then in Section 2.2 we introduce the notion of the GATE in geodesic metric spaces, which generalizes the conventional notion of differences between potential outcomes in Euclidean spaces.

In Section 3 we introduce doubly robust and cross-fitting estimators for the GATE, where basic notions such as propensity scores and implementations of conditional Fréchet means through Fréchet regression (Petersen and Müller,, 2019) as well as perturbations of random objects are introduced in Section 3.1. These notions and auxiliary results are then used in Section 3.2 to define our doubly robust estimator. As main results, we establish asymptotic properties of the doubly robust and cross-fitting estimators in Section 4. Specifically, we demonstrate the consistency and convergence rates of these estimators under regularity conditions (Section 4.1) and discuss methods for assessing uncertainty of the estimators (Section 4.2).

In Section 5 we present simulation results to evaluate the finite-sample performance of the proposed estimators for two metric spaces, the space of covariance matrices with the Frobenius metric and the space of compositional data with the geodesic metric on spheres. Section 6 is devoted to showcase the proposed methodology for three types of real world data, where we aim to infer the impact of coal mining in U.S. states on the composition of electricity generation sources, represented as spherical data; the impact of the COVID-19 pandemic on taxi usage quantified as networks and represented as graph Laplacians in the Manhattan area; and the effect of an Alzheimer’s disease diagnosis on brain functional connectivity networks. The paper is closed with a brief discussion in Section 7.

2. Treatment effects for random objects

2.1. Preliminaries on uniquely extendable geodesic metric spaces

Consider a uniquely geodesic metric space (Ω,d)(\Omega,d) that is endowed with a probability measure PP. For any α,βΩ\alpha,\beta\in\Omega, the unique geodesic connecting α\alpha and β\beta is a curve γα,β:[0,1]Ω\gamma_{\alpha,\beta}:[0,1]\mapsto\Omega such that d(γα,β(s),γα,β(t))=d(α,β)|ts|d(\gamma_{\alpha,\beta}(s),\gamma_{\alpha,\beta}(t))=d(\alpha,\beta)|t-s| for s,t[0,1]s,t\in[0,1]. For α,β,ζΩ\alpha,\beta,\zeta\in\Omega, define the following simple operations on geodesics,

γα,ζγζ,β\displaystyle\gamma_{\alpha,\zeta}\oplus\gamma_{\zeta,\beta} :=γα,β,γα,β:=γβ,α,idα:=γα,α.\displaystyle:=\gamma_{\alpha,\beta},\ \ominus\gamma_{\alpha,\beta}:=\gamma_{\beta,\alpha},\ \mathrm{id}_{\alpha}:=\gamma_{\alpha,\alpha}.

The space (Ω,d)(\Omega,d) is further assumed to be geodesically complete (Gallier and Quaintance,, 2020), meaning that every geodesic can be extended to a geodesic defined on \mathbb{R}. Specifically, for any α,βΩ\alpha,\beta\in\Omega, the geodesic γα,β:[0,1]Ω\gamma_{\alpha,\beta}:[0,1]\mapsto\Omega can be extended to γ¯α,β:Ω\bar{\gamma}_{\alpha,\beta}:\mathbb{R}\mapsto\Omega. The restriction of γ¯α,β\bar{\gamma}_{\alpha,\beta} to [0,1][0,1] coincides with γα,β\gamma_{\alpha,\beta}, i.e., γ¯α,β|[0,1]=γα,β\bar{\gamma}_{\alpha,\beta}|_{[0,1]}=\gamma_{\alpha,\beta}. We can then define a scalar multiplication for geodesics γα,β\gamma_{\alpha,\beta} with a factor ρ\rho\in\mathbb{R} by

ργα,β\displaystyle\rho\odot\gamma_{\alpha,\beta} :={{γ¯α,β(t):t[0,ρ]}if ρ>0{γ¯α,β(t):t[ρ,0]}if ρ<0idαif ρ=0,ρidα:=idα.\displaystyle:=\begin{cases}\{\bar{\gamma}_{\alpha,\beta}(t):t\in[0,\rho]\}&\text{if $\rho>0$}\\ \{\bar{\gamma}_{\alpha,\beta}(t):t\in[\rho,0]\}&\text{if $\rho<0$}\\ \mathrm{id}_{\alpha}&\text{if $\rho=0$}\end{cases},\quad\rho\odot\mathrm{id}_{\alpha}:=\mathrm{id}_{\alpha}.

In practical applications, the relevant metric space often constitutes a subset (,d)(\mathcal{M},d) of (Ω,d)(\Omega,d), which is typically closed and convex. Prominent instances of such spaces are given in Examples 2.12.4 below. We use these same example spaces to demonstrate the practical performance of the proposed geodesic treatment effect in simulations and real-world data analyses. In Appendix B we demonstrate that these examples satisfy the pertinent assumptions outlined in Sections 3 and 4; see Propositions B.1B.4. Geodesic extensions are confined to operate within the closed subset Ω\mathcal{M}\subset\Omega, in some cases necessitating suitable modifications to ensure that they do not cross the boundary (Zhu and Müller,, 2023).

Considering the forward extension of a geodesic, assume that the geodesic γα,β\gamma_{\alpha,\beta} extends to the boundary point ζ\zeta and denote the extended geodesic by γα,ζ:[0,1]\gamma_{\alpha,\zeta}:[0,1]\mapsto\mathcal{M}. For ρ>1\rho>1, we then define a scalar multiplication by

ργα,β={γα,ζ(t):t[0,h(ρ)]},\rho\odot\gamma_{\alpha,\beta}=\{\gamma_{\alpha,\zeta}(t):t\in[0,h(\rho)]\},

where

h(ρ)=(1d(α,β)d(α,ζ))ρ+1.h(\rho)=-\left(1-\frac{d(\alpha,\beta)}{d(\alpha,\zeta)}\right)^{\rho}+1.

It is straightforward to verify that the start point of ργα,β\rho\odot\gamma_{\alpha,\beta} is always α\alpha, while the end point is β\beta for ρ=1\rho=1 and ζ\zeta for ρ=\rho=\infty; see Figure 1 for an illustration. Similar modifications apply to the reverse extension where ρ<1\rho<-1. To simplify the notation, we write the end point of ργα,β\rho\odot\gamma_{\alpha,\beta} as γα,β(ρ)\gamma_{\alpha,\beta}(\rho).

Refer to caption
Figure 1. Illustration of geodesic extension. The circles symbolize objects in the geodesic metric space \mathcal{M}. The arrows emanating from these circles represent the geodesic paths connecting them.
Example 2.1 (Space of networks).

Consider simple, undirected, and weighted networks with a fixed number of nodes mm and bounded edge weights. There is a one-to-one correspondence between such a network and its graph Laplacian. The space of graph Laplacians equipped with the Frobenius (power) metric dFd_{F} can thus be used to characterize the space of networks (Zhou and Müller,, 2022), which is a bounded and convex subset of m2\mathbb{R}^{m^{2}}.

Example 2.2 (Space of covariance matrices).

Consider mm-dimensional covariance matrices with bounded variances. The space of such covariance matrices equipped with the Frobenius distance dFd_{F} is a bounded and convex subset of m2\mathbb{R}^{m^{2}}. This holds true for the space of mm-dimensional correlation matrices as well (Dryden et al.,, 2009).

Example 2.3 (Space of compositional data).

Compositional data takes values in the simplex

Δd1={𝒚d:yj0,j=1,,d, and j=1dyj=1},\Delta^{d-1}=\{\bm{y}\in\mathbb{R}^{d}:y_{j}\geq 0,j=1,\ldots,d,\text{ and }\sum_{j=1}^{d}y_{j}=1\},

reflecting that such data are non-negative proportions that sum to 1. Consider the component-wise square root 𝒚=(y1,,yd)\sqrt{\bm{y}}=(\sqrt{y_{1}},\ldots,\sqrt{y_{d}})^{\prime} of an element YΔd1Y\in\Delta^{d-1}, the simplex Δd1\Delta^{d-1} can be mapped to the first orthant of the unit sphere 𝒮+d1={𝒛𝒮d1:zj0,j=1,,d}\mathcal{S}^{d-1}_{+}=\{\bm{z}\in\mathcal{S}^{d-1}:z_{j}\geq 0,j=1,\ldots,d\} (Scealy and Welsh,, 2011, 2014). We then consider \mathcal{M} to be the space of compositional data 𝒮+d1\mathcal{S}^{d-1}_{+}, equipped with the geodesic (Riemannian) metric on the sphere,

dg(𝒚1,𝒚2)=arccos(𝒚1𝒚2),for 𝒚1,𝒚2Δd1.d_{g}(\bm{y}_{1},\bm{y}_{2})=\mathrm{arccos}(\bm{y}_{1}^{\prime}\bm{y}_{2}),\quad\text{for }\bm{y}_{1},\bm{y}_{2}\in\Delta^{d-1}.

This idea can be extended to distributions, where one considers a separable Hilbert space \mathcal{H} with inner product ,\langle\cdot,\cdot\rangle_{\mathcal{H}} and norm \|\cdot\|_{\mathcal{H}}. The Hilbert sphere 𝒮={g:g=1}\mathcal{S}=\{g\in\mathcal{H}:\|g\|_{\mathcal{H}}=1\} is an infinite-dimensional extension of finite-dimensional spheres. The space of (multivariate) absolutely continuous distributions can be equipped with the Fisher-Rao metric

dR(f1,f2)=arccos(f1,f2)d_{R}(f_{1},f_{2})=\mathrm{arccos}(\langle\sqrt{f_{1}},\sqrt{f_{2}}\rangle_{\mathcal{H}})

and then modeled as a subset of the Hilbert sphere 𝒮\mathcal{S}, where the Fisher-Rao metric pertains to the geodesic metric between square roots of densities f1f_{1} and f2f_{2}.

Example 2.4 (Wasserstein space).

Consider univariate probability measures on \mathbb{R} with finite second moments. The space of such probability measures, equipped with the Wasserstein metric d𝒲d_{\mathcal{W}}, is known as the Wasserstein space (𝒲,d𝒲)(\mathcal{W},d_{\mathcal{W}}), which is a complete and separable metric space. The Wasserstein metric between any two probability measures μ\mu and ν\nu is

d𝒲2(μ,ν)=01{Fμ1(p)Fν1(p)}2𝑑p,d_{\mathcal{W}}^{2}(\mu,\nu)=\int_{0}^{1}\{F_{\mu}^{-1}(p)-F_{\nu}^{-1}(p)\}^{2}dp,

where Fμ1()F_{\mu}^{-1}(\cdot), Fν1()F_{\nu}^{-1}(\cdot) denote the quantile functions of μ\mu and ν\nu, respectively.

In Examples 2.1 and 2.2, for any two matrices α,β\alpha,\beta in the space, the geodesic is the line segment connecting α\alpha and β\beta, i.e., γα,β(t)=α+(βα)t\gamma_{\alpha,\beta}(t)=\alpha+(\beta-\alpha)t. In Example 2.3, for any two points α,β𝒮+d1\alpha,\beta\in\mathcal{S}_{+}^{d-1}, the geodesic connecting α\alpha and β\beta is

γα,β(t)=cos(θt)α+sin(θt)β(αβ)αβ(αβ)α,\gamma_{\alpha,\beta}(t)=\cos(\theta t)\alpha+\sin(\theta t)\frac{\beta-(\alpha^{\prime}\beta)\alpha}{\|\beta-(\alpha^{\prime}\beta)\alpha\|},

where θ=arccos(αβ)\theta=\arccos(\alpha^{\prime}\beta) represents the angle between α\alpha and β\beta.

Write τ#μ\tau_{\#}\mu for the pushforward measure of μ\mu by the transport τ\tau. In Example 2.4, for any two distributions α,β𝒲\alpha,\beta\in\mathcal{W}, the geodesic between them is given by McCann’s interpolant (McCann,, 1997),

γα,β(t)=(id+t(Fβ1Fαid))#α,\gamma_{\alpha,\beta}(t)=(\mathrm{id}+t(F_{\beta}^{-1}\circ F_{\alpha}-\mathrm{id}))_{\#}\alpha,

where id\mathrm{id}, FαF_{\alpha}, Fβ1F_{\beta}^{-1} denote the identity map, cumulative distribution function of α\alpha, and quantile function of β\beta, respectively.

2.2. Geodesic average treatment effects

Let (,d)(\mathcal{M},d) be a uniquely extendable geodesic metric space. Units i=1,,ni=1,\dots,n are either given a treatment or not treated. For the ii-th unit, we observe a treatment indicator variable TiT_{i}, where Ti=1T_{i}=1 if the ii-th unit is treated and Ti=0T_{i}=0 otherwise, along with an outcome

Yi={Yi(0)if Ti=0Yi(1)if Ti=1,\displaystyle Y_{i}=\begin{cases}Y_{i}(0)&\text{if $T_{i}=0$}\\ Y_{i}(1)&\text{if $T_{i}=1$},\end{cases}

where Yi(0),Yi(1)Y_{i}(0),Y_{i}(1)\in\mathcal{M} are potential outcomes for Ti=0T_{i}=0 and 11, respectively. Additionally, we observe Euclidean covariates XipX_{i}\in\mathbb{R}^{p}. We assume that {Yi,Ti,Xi}i=1n\{Y_{i},T_{i},X_{i}\}_{i=1}^{n} is an i.i.d. sample from a joint distribution PP, where a generic random variable distributed according to PP is (Y,T,X)×{0,1}×𝒳(Y,T,X)\in\mathcal{M}\times\{0,1\}\times\mathcal{X} and we assume that all conditional distributions are well-defined; expectations and conditional expectations in the following are with respect to PP. We assume that 𝒳\mathcal{X} is a compact subset of p\mathbb{R}^{p} and write Y=Y(t)Y=Y(t) if T=t{0,1}T=t\in\{0,1\}.

A standard way to quantify causal effects for the usual situation where outcomes are located in Euclidean spaces is to target the difference of the potential outcomes. However, when outcomes take values in a general metric space this is not an option anymore, due to the absence of linear structure in general metric spaces (see, e.g., Example 2.3), which means that the notion of a difference does not exist. This requires to extend the conventional notion of causal effects for the case of random objects in \mathcal{M}, and to address this challenge we propose here to replace the conventional difference by the geodesic that connects the Fréchet means of the potential outcomes.

Define the Fréchet mean of a random element AA\in\mathcal{M} and the conditional Fréchet mean of AA\in\mathcal{M} given ZpZ\in\mathbb{R}^{p} as

E[A]\displaystyle\operatorname{E}_{\oplus}[A] :=argminνE[d2(ν,A)],E[A|Z]:=argminνE[d2(ν,A)|Z],\displaystyle:=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}\operatorname{E}[d^{2}(\nu,A)],\quad\operatorname{E}_{\oplus}[A|Z]:=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}\operatorname{E}[d^{2}(\nu,A)|Z], (2.1)

respectively. Assuming the minimizers are unique and well-defined, we introduce the following notions of causal effects in geodesic metric spaces.

Definition 2.1 (Geodesic individual/average treatment effect).

The geodesic individual treatment effect of TT on YY is defined as the geodesic connecting Yi(0)Y_{i}(0) and Yi(1)Y_{i}(1), i.e., γYi(0),Yi(1)\gamma_{Y_{i}(0),Y_{i}(1)}. The geodesic average treatment effect (GATE) of TT on YY is defined as the geodesic connecting E[Y(0)]\operatorname{E}_{\oplus}[Y(0)] and E[Y(1)]\operatorname{E}_{\oplus}[Y(1)], i.e., γE[Y(0)],E[Y(1)]\gamma_{\operatorname{E}_{\oplus}[Y(0)],\operatorname{E}_{\oplus}[Y(1)]}.

In Euclidean spaces, the geodesic individual treatment effect and geodesic average treatment effect (GATE) reduce to the difference of the potential outcomes for the ii-th unit, Yi(1)Yi(0)Y_{i}(1)-Y_{i}(0), and the difference of the expected potential outcomes, E[Y(1)]E[Y(0)]\operatorname{E}[Y(1)]-\operatorname{E}[Y(0)], respectively. These can be interpreted as the quantities that represent the shortest distance and directional information from Yi(0)Y_{i}(0) to Yi(1)Y_{i}(1) or from E[Y(0)]\operatorname{E}[Y(0)] to E[Y(1)]\operatorname{E}[Y(1)] with respect to the Euclidean metric. Since a geodesic connecting points α\alpha and β\beta in a geodesic metric space (,d)(\mathcal{M},d) embodies both the shortest distance and directional information from α\alpha to β\beta with respect to the metric dd, Definition 2.1 emerges as a natural extension of quantifying treatment effects in geodesic metric spaces.

3. Doubly robust estimation of geodesic average treatment effects

3.1. Preliminaries

To construct doubly robust estimators in metric spaces, one needs to extend the notion of doubly robust estimation of causal effects to the case of random objects in a geodesic metric space. In analogy to the Euclidean case, key ingredients are propensity scores p(x)=P(T=1|X=x)p(x)=\operatorname{P}(T=1|X=x) and conditional Fréchet means for the potential outcomes mt(x)=E[Y(t)|X=x]m_{t}(x)=\operatorname{E}_{\oplus}[Y(t)|X=x] for t{0,1}t\in\{0,1\}. We impose the following conditions.

Assumption 3.1.
  • (i)

    (,d)(\mathcal{M},d) is a uniquely extendable geodesic metric space that is complete, separable, and totally bounded.

  • (ii)

    There exists a constant η0(0,1/2)\eta_{0}\in(0,1/2) such that η0p(x)1η0\eta_{0}\leq p(x)\leq 1-\eta_{0} for each x𝒳x\in\mathcal{X}.

  • (iii)

    TT and {Y(0),Y(1)}\{Y(0),Y(1)\} are conditionally independent given XX.

Assumptions 3.1 (ii) and (iii) are standard overlap and unconfoundedness (or ignorability) conditions in the context of causal inference. Assumption 3.1 (ii) says that for any x𝒳x\in\mathcal{X}, there is a positive probability that there are units in both the treatment and control groups in any arbitrarily small neighborhood around xx. Assumptions 3.1 (iii) says that conditionally on XX, the treatment is randomly assigned from the potential outcomes. In the Euclidean case, this setup leads to various estimation methods for the average treatment effect, including inverse probability weighting, outcome regression, and doubly robust estimation methods.

Our goal here is to extend the doubly robust estimation to geodesic metric spaces, aiming at GATE (geodesic average treatment effect). To this end, we need to introduce the notion of a (Fréchet) regression model of Y(t)Y(t) on XX in geodesic metric spaces. More precisely, we impose the following conditions, where E\operatorname{E}_{\oplus} is as in (2.1).

Assumption 3.2.

There exist random perturbation maps 𝒫t:\mathcal{P}_{t}:\mathcal{M}\to\mathcal{M} and conditional Fréchet mean functions mt:𝒳m_{t}:\mathcal{X}\to\mathcal{M} for t0,1t\in{0,1} such that

  • (i)

    Y(t)=𝒫t(mt(X))Y(t)=\mathcal{P}_{t}(m_{t}(X)),

  • (ii)

    E[𝒫t(mt(X))|X]=mt(X)\operatorname{E}_{\oplus}[\mathcal{P}_{t}(m_{t}(X))|X]=m_{t}(X),

  • (iii)

    E[𝒫t(mt(X))]=E[mt(X)]\operatorname{E}_{\oplus}[\mathcal{P}_{t}(m_{t}(X))]=\operatorname{E}_{\oplus}[m_{t}(X)].

The random perturbation map 𝒫t\mathcal{P}_{t} is a generalization of additive noise ε\varepsilon in the Euclidean case that is needed here because the addition operation is not defined, and generalizes the error assumption E[ε|X]=0\operatorname{E}[\varepsilon|X]=0. Specifically, Assumption 3.2 (ii) is equivalent to E[Y(t)|X]=mt(X)\operatorname{E}_{\oplus}[Y(t)|X]=m_{t}(X) so that mtm_{t} becomes the Fréchet regression function of Y(t)Y(t) on XX (Petersen and Müller,, 2019). In the following, we refer to mtm_{t} as the outcome regression function. Note that our conditions on the perturbation maps are generalizations of those considered in Chen and Müller, (2022).

To obtain a doubly robust representation of the GATE, we additionally introduce an expectation operation for geodesics. For random objects A,BA,B\in\mathcal{M} and Euclidean random variables κ\kappa\in\mathbb{R} and XpX\in\mathbb{R}^{p}, define

EG[γA,B]\displaystyle\operatorname{E}_{G}[\gamma_{A,B}] :=γE[A],E[B],\displaystyle:=\gamma_{\operatorname{E}_{\oplus}[A],\operatorname{E}_{\oplus}[B]},
EG[κγA,B|X]\displaystyle\operatorname{E}_{G}[\kappa\odot\gamma_{A,B}|X] :=E[κ|X]γE[A|Z],E[B|Z],\displaystyle:=\operatorname{E}[\kappa|X]\odot\gamma_{\operatorname{E}_{\oplus}[A|Z],\operatorname{E}_{\oplus}[B|Z]},
EG,X[κγA,B]\displaystyle\operatorname{E}_{G,X}[\kappa\odot\gamma_{A,B}] :=EG[EG[κγA,B|X]].\displaystyle:=\operatorname{E}_{G}\left[\operatorname{E}_{G}[\kappa\odot\gamma_{A,B}|X]\right].

These definitions of geodesic expectations EG[]\operatorname{E}_{G}[\cdot], EG[|X]\operatorname{E}_{G}[\cdot|X], and EG,X[]\operatorname{E}_{G,X}[\cdot] are consistent with the corresponding (conditional) expectations in the Euclidean case if κ\kappa and {A,B}\{A,B\} are conditionally independent given XX. Indeed, if κ\kappa and {A,B}\{A,B\} are conditionally independent given XX and A,BA,B are Euclidean, it is easy to see that EG,X[κγA,B]=EG[κγA,B]\operatorname{E}_{G,X}[\kappa\odot\gamma_{A,B}]=\operatorname{E}_{G}[\kappa\odot\gamma_{A,B}] by the law of iterated expectations.

Based on the above assumptions and notions, a doubly robust representation of the GATE for geodesic metric spaces is as follows.

Proposition 3.1.

Let e(x)e(x) and μt(x)\mu_{t}(x), t{0,1}t\in\{0,1\} be models for the propensity score p(x)p(x) and the outcome regression functions mt(x)m_{t}(x) for t{0,1}t\in\{0,1\}. Suppose that Assumptions 3.1 and 3.2 hold. Then for any μ(,d)\mu\in(\mathcal{M},d), it holds that

γμ,E[Y(t)]\displaystyle\gamma_{\mu,\operatorname{E}_{\oplus}[Y(t)]} =EG,X[γμ,μt(X){(tTe(X)+(1t)(1T)1e(X))γμt(X),Y}],t{0,1}\displaystyle=\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{t}(X)}\oplus\left\{\left({\frac{tT}{e(X)}+\frac{(1-t)(1-T)}{1-e(X)}}\right)\odot\gamma_{\mu_{t}(X),Y}\right\}\right],\ t\in\{0,1\} (3.1)

provided that e(x)=p(x)e(x)=p(x) or μt(x)=mt(x)\mu_{t}(x)=m_{t}(x) for t{0,1}t\in\{0,1\} (one or both of these must hold) and hence

γE[Y(0)],E[Y(1)]\displaystyle\gamma_{\operatorname{E}_{\oplus}[Y(0)],\operatorname{E}_{\oplus}[Y(1)]} =EG,X[γμ,μ0(X){(1T1e(X))γμ0(X),Y}]\displaystyle=\ominus\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{0}(X)}\oplus\left\{\left(\frac{1-T}{1-e(X)}\right)\odot\gamma_{\mu_{0}(X),Y}\right\}\right]
EG,X[γμ,μ1(X){(Te(X))γμ1(X),Y}].\displaystyle\quad\quad\oplus\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\left\{\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right\}\right]. (3.2)

This proposition shows that the GATE is a natural generalization of the average treatment effect for Euclidean data to the case of geodesic metric spaces (see Remark 3.1 below). Furthermore, we note the double robustness property of the GATE estimator, i.e., either e(x)e(x) or μt(x)\mu_{t}(x) for t{0,1}t\in\{0,1\} could be misspecified (but not both) while still achieving consistent estimation of the GATE. Since the GATE is defined via the geodesic between the Fréchet means for the potential outcomes, we first establish the representation from an arbitrary starting point μ\mu in (3.1). Although the representation of the GATE in (3.2) also involves an arbitrary starting point μ\mu, the doubly robust, cross-fitting, and outcome regression GATE estimators presented below do not require to choose μ\mu; however, implementing the inverse probability weighting estimator requires choice of a starting point μ\mu that can be arbitrary (see Remark 3.3 below for further details). We illustrate in Figure 2 the doubly robust representation of the GATE when the outcome regression model or alternatively the propensity score model is correctly specified.

Refer to caption
(a)
Refer to caption
(b)
Figure 2. Doubly robust representation for GATE when either (A) the outcome regression model or (B) the propensity score model is correctly specified. In both panels, blue solid circles represent objects in the geodesic metric space \mathcal{M}. Arrows depict geodesics connecting them. Loop arrows represent geodesics that start and end at the same point. The red arrow in each panel represents the proposed GATE estimator, which reduces to the target γE[Y(0)],E[Y(1)]\gamma_{\operatorname{E}_{\oplus}[Y(0)],\operatorname{E}_{\oplus}[Y(1)]} in both cases, confirming its double robustness.
Remark 3.1 (Connection with the established doubly robust estimator for Euclidean data).

For the Euclidean case, (,d)=(,)(\mathcal{M},d)=(\mathbb{R},\|\cdot\|), where \|\cdot\| is the standard Euclidean metric. Then the average treatment effect of the treatment TT on the outcome YY is E[Y(1)]E[Y(0)]\operatorname{E}[Y(1)]-\operatorname{E}[Y(0)] and Assumption 3.1 guarantees

E[Y(t)]\displaystyle\operatorname{E}[Y(t)] =E[μt(X)+(tTe(X)+(1t)(1T)1e(X))(Yμt(X))]\displaystyle=\operatorname{E}\left[\mu_{t}(X)+\left({tT\over e(X)}+{(1-t)(1-T)\over 1-e(X)}\right)(Y-\mu_{t}(X))\right]
=E[γμt(X),Y(tTe(X)+(1t)(1T)1e(X))]\displaystyle=\operatorname{E}\left[\gamma_{\mu_{t}(X),Y}\left({tT\over e(X)}+{(1-t)(1-T)\over 1-e(X)}\right)\right]
=argminνE[νγμt(X),Y(tTe(X)+(1t)(1T)1e(X))2],\displaystyle=\mathop{\rm arg~{}min}\limits_{\nu\in\mathbb{R}}\operatorname{E}\left[\left\|\nu-\gamma_{\mu_{t}(X),Y}\left({tT\over e(X)}+{(1-t)(1-T)\over 1-e(X)}\right)\right\|^{2}\right], (3.3)

provided e(x)=p(x)e(x)=p(x) or μt(x)=E[Y(t)|X=x]\mu_{t}(x)=\operatorname{E}[Y(t)|X=x] for t{0,1}t\in\{0,1\} and this leads to the construction of a doubly robust estimator for the Euclidean case. Additionally, we can show that for any μ\mu\in\mathbb{R},

γE[Y(0)],E[Y(1)]\displaystyle\gamma_{\operatorname{E}[Y(0)],\operatorname{E}[Y(1)]} =EG[γμ,μ0(X){(1T1e(X))γμ0(X),Y}]\displaystyle=\ominus\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{0}(X)}\oplus\left\{\left(\frac{1-T}{1-e(X)}\right)\odot\gamma_{\mu_{0}(X),Y}\right\}\right]
EG[γμ,μ1(X){(Te(X))γμ1(X),Y}].\displaystyle\quad\quad\oplus\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\left\{\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right\}\right]. (3.4)

Adopting this perspective, Proposition 3.1 is seen to be indeed an extension of the doubly robust representation for the average treatment effect in the Euclidean setting to geodesic metric spaces, with the Euclidean scenario as a special case; see Appendix C for the proof of (3.1).

3.2. Doubly robust estimation

Building upon Proposition 3.1, we can construct a sample based doubly robust estimator for the GATE. Note that the right-hand side of (3.2) can be rewritten as EG,X[γA0,A1]\operatorname{E}_{G,X}[\gamma_{A_{0},A_{1}}] with

At=γμt(X),Y(tTe(X)+(1t)(1T)1e(X)),t{0,1},A_{t}=\gamma_{\mu_{t}(X),Y}\left({tT\over e(X)}+{(1-t)(1-T)\over 1-e(X)}\right),\ t\in\{0,1\},

which corresponds to the representation in (3.1). By moving to the sample counterpart, a doubly robust estimator for the GATE is obtained as γΘ^0(DR),Θ^1(DR)\gamma_{\widehat{\Theta}_{0}^{(\mathrm{DR})},\widehat{\Theta}_{1}^{(\mathrm{DR})}}, where

Θ^t(DR)\displaystyle\widehat{\Theta}_{t}^{(\mathrm{DR})} :=argminνQn,t(ν;μ^t,φ^),\displaystyle:=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}Q_{n,t}(\nu;\widehat{\mu}_{t},\widehat{\varphi}),
Qn,t(ν;μ,φ)\displaystyle Q_{n,t}(\nu;\mu,\varphi) =1ni=1nd2(ν,γμ(Xi),Yi(tTie(Xi;φ)+(1t)(1Ti)1e(Xi;φ))).\displaystyle={1\over n}\sum_{i=1}^{n}d^{2}\left(\nu,\gamma_{\mu(X_{i}),Y_{i}}\left({tT_{i}\over e(X_{i};\varphi)}+{(1-t)(1-T_{i})\over 1-e(X_{i};\varphi)}\right)\right). (3.5)

Here e(x;φ)e(x;\varphi) is a parametric model for the propensity score p(x)p(x), φ^\widehat{\varphi} is an estimator of φ\varphi, and μ^t(x)\widehat{\mu}_{t}(x) is an estimator of the outcome regression function mt(x)m_{t}(x). In our real data analysis, we estimate the propensity score by the logistic regression model and the outcome regression function by global Fréchet regression (Petersen and Müller,, 2019):

μ^t(x):=argminν1NtiIt{1+(XiX¯)Σ^1(xX¯)}d2(ν,Yi),t{0,1},\displaystyle\widehat{\mu}_{t}(x):=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}{1\over N_{t}}\sum_{i\in I_{t}}\{1+(X_{i}-\bar{X})^{\prime}\widehat{\Sigma}^{-1}(x-\bar{X})\}d^{2}(\nu,Y_{i}),\ t\in\{0,1\}, (3.6)

where It={1in:Ti=t}I_{t}=\{1\leq i\leq n:T_{i}=t\}, NtN_{t} is the sample size of ItI_{t}, X¯=n1i=1nXi\bar{X}=n^{-1}\sum_{i=1}^{n}X_{i}, and Σ^=n1i=1n(XiX¯)(XiX¯)\widehat{\Sigma}=n^{-1}\sum_{i=1}^{n}(X_{i}-\bar{X})(X_{i}-\bar{X})^{\prime}.

Remark 3.2 (Cross-fitting estimator).

In the Euclidean case, doubly robust estimation methods have been combined with the cross-fitting approach to mitigate over-fitting (Chernozhukov et al.,, 2018). The proposed doubly robust estimator γΘ^0(DR),Θ^1(DR)\gamma_{\widehat{\Theta}_{0}^{(\mathrm{DR})},\widehat{\Theta}_{1}^{(\mathrm{DR})}} can be similarly adapted to incorporate cross-fitting. Details are as follows.

Let {Sk}k=1K\{S_{k}\}_{k=1}^{K} be a (random) partition of {1,,n}\{1,\dots,n\} into KK subsets. For each kk, let φ^k\widehat{\varphi}_{k} and μ^t,k(x)\widehat{\mu}_{t,k}(x) be the estimators of φ\varphi and μt(x)\mu_{t}(x) computed from the data {Yi,Ti,Xi}iSk\{Y_{i},T_{i},X_{i}\}_{i\in S_{-k}} falling into the kk-th subset, where Sk=jkSjS_{-k}=\cup_{j\neq k}S_{j}. Setting

Θ^t,k(DR)\displaystyle\widehat{\Theta}_{t,k}^{(\mathrm{DR})} :=argminν1nkiSkd2(ν,γμ^t,k(Xi),Yi(tTie(Xi;φ^k)+(1t)(1Ti)1e(Xi;φ^k))),t{0,1},\displaystyle:=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}{1\over n_{k}}\sum_{i\in S_{k}}d^{2}\left(\nu,\gamma_{\widehat{\mu}_{t,k}(X_{i}),Y_{i}}\left({tT_{i}\over e(X_{i};\widehat{\varphi}_{k})}+{(1-t)(1-T_{i})\over 1-e(X_{i};\widehat{\varphi}_{k})}\right)\right),\ t\in\{0,1\},

where nkn_{k} is the sample size of SkS_{k}, the cross-fitting doubly robust estimator for the GATE is obtained as γΘ^0(CF),Θ^1(CF)\gamma_{\widehat{\Theta}_{0}^{(\mathrm{CF})},\widehat{\Theta}_{1}^{(\mathrm{CF})}}, with

Θ^t(CF)\displaystyle\widehat{\Theta}_{t}^{(\mathrm{CF})} :=argminνk=1Knknd2(ν,Θ^t,k(DR)),t{0,1}.\displaystyle:=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}\sum_{k=1}^{K}{n_{k}\over n}d^{2}(\nu,\widehat{\Theta}_{t,k}^{(\mathrm{DR})}),\ t\in\{0,1\}. (3.7)

In the next section, we provide the asymptotic properties of these doubly robust and cross-fitting sample based estimators.

Remark 3.3 (Outcome regression and inverse probability weighting estimators).

If the models for the outcome regression functions are correctly specified, i.e., μt=mt\mu_{t}=m_{t}, we obtain E[Y(t)]=E[μt(X)]\operatorname{E}_{\oplus}[Y(t)]=\operatorname{E}_{\oplus}[\mu_{t}(X)] and the GATE can be estimated by taking a sample counterpart of γE[μ0(X)],E[μ1(X)]\gamma_{\operatorname{E}_{\oplus}[\mu_{0}(X)],\operatorname{E}_{\oplus}[\mu_{1}(X)]}, which we refer to as the outcome regression estimator. Specifically, the outcome regression estimator for the GATE is given by γΘ^0(OR),Θ^1(OR)\gamma_{\widehat{\Theta}_{0}^{(\mathrm{OR})},\widehat{\Theta}_{1}^{(\mathrm{OR})}}, where

Θ^t(OR):=argminν1ni=1nd2(ν,μ^t(Xi)),t{0,1}.\displaystyle\widehat{\Theta}_{t}^{(\mathrm{OR})}:=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}{1\over n}\sum_{i=1}^{n}d^{2}(\nu,\widehat{\mu}_{t}(X_{i})),\ t\in\{0,1\}.

On the other hand, if the model for the propensity score is correctly specified, i.e., e(x)=p(x)e(x)=p(x), one finds

EG,X[(tTe(X)+(1t)(1T)1e(X))γμ,Y]\displaystyle\operatorname{E}_{G,X}\left[\left({tT\over e(X)}+{(1-t)(1-T)\over 1-e(X)}\right)\odot\gamma_{\mu,Y}\right] =γμ,E[Y(t)]\displaystyle=\gamma_{\mu,\operatorname{E}_{\oplus}[Y(t)]}

for any μ\mu\in\mathcal{M}. Thus the GATE can be obtained by using the inverse probability weighting method, which corresponds to the sample counterpart of

EG,X[(1T1e(X))γμ,Y]EG,X[(Te(X))γμ,Y].\displaystyle\ominus\operatorname{E}_{G,X}\left[\left({1-T\over 1-e(X)}\right)\odot\gamma_{\mu,Y}\right]\oplus\operatorname{E}_{G,X}\left[\left({T\over e(X)}\right)\odot\gamma_{\mu,Y}\right]. (3.8)

Specifically, since (3.8) can be rewritten as EG,X[γB0,B1]\operatorname{E}_{G,X}[\gamma_{B_{0},B_{1}}] with

Bt=γμ,Y(tTe(X)+(1t)(1T)1e(X)),t{0,1},B_{t}=\gamma_{\mu,Y}\left({tT\over e(X)}+{(1-t)(1-T)\over 1-e(X)}\right),\ t\in\{0,1\},

the inverse probability weighting estimator for the GATE is given by γΘ^0(IPW),Θ^1(IPW)\gamma_{\widehat{\Theta}_{0}^{(\mathrm{IPW})},\widehat{\Theta}_{1}^{(\mathrm{IPW})}}, where

Θ^t(IPW):=argminν1ni=1nd2(ν,γμ^,Yi(tTie^(Xi)+(1t)(1Ti)1e^(Xi))),t{0,1},\displaystyle\widehat{\Theta}_{t}^{(\mathrm{IPW})}:=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}{1\over n}\sum_{i=1}^{n}d^{2}\left(\nu,\gamma_{\widehat{\mu},Y_{i}}\left({tT_{i}\over\widehat{e}(X_{i})}+{(1-t)(1-T_{i})\over 1-\widehat{e}(X_{i})}\right)\right),\ t\in\{0,1\},

and μ^\widehat{\mu} is an estimator of μ\mu\in\mathcal{M}. While the inverse probability weighting estimator requires the choice of a starting point μ{\mu} of a geodesic γμ,Y\gamma_{\mu,Y} in (3.8), this starting point can be arbitrarily chosen; a simple choice that we adopt in our implementations is μ^\widehat{\mu}, the sample Fréchet mean of the {Yi}\{Y_{i}\}. The asymptotic properties of Θ^t(OR)\widehat{\Theta}_{t}^{(\mathrm{OR})} and Θ^t(IPW)\widehat{\Theta}_{t}^{(\mathrm{IPW})} are provided in Appendix A and the finite sample performance of the outcome regression and inverse probability weighting estimators is examined through simulations in Section 5.

4. Main results

4.1. Consistency and convergence rates

To examine the asymptotic properties of the proposed doubly robust and cross-fitting estimators, we impose the following assumptions on the models for the propensity score and outcome regression functions. Let e={e(x;φ):x𝒳,φΦ}\mathcal{M}_{e}=\{e(x;\varphi):x\in\mathcal{X},\varphi\in\Phi\} be a class of parametric models for the propensity score p(x)p(x), where Φp\Phi\subset\mathbb{R}^{p} is a compact set. For t{0,1}t\in\{0,1\}, let μ^t()\widehat{\mu}_{t}(\cdot) be an estimator for the outcome regression function mt()m_{t}(\cdot).

Assumption 4.1.
  • (i)

    For φ1,φ2Φ\varphi_{1},\varphi_{2}\in\Phi, assume that |e(x;φ1)e(x;φ2)|Ceφ1φ2|e(x;\varphi_{1})-e(x;\varphi_{2})|\leq C_{e}\|\varphi_{1}-\varphi_{2}\| for some positive constant CeC_{e}, and for all x𝒳x\in\mathcal{X} and φΦ\varphi\in\Phi, η0e(x;φ)1η0\eta_{0}\leq e(x;\varphi)\leq 1-\eta_{0} where η0\eta_{0} is the same constant as in Assumption 3.1(ii).

  • (ii)

    There exists φΦ\varphi_{*}\in\Phi and an estimate φ^\widehat{\varphi} such that φ^φ=Op(ϱn)\|\widehat{\varphi}-\varphi_{*}\|=O_{p}(\varrho_{n}) with ϱn0\varrho_{n}\to 0 as nn\to\infty.

  • (iii)

    There exist functions μt()\mu_{t}(\cdot), t{0,1}t\in\{0,1\} such that supx𝒳d(μ^t(x),μt(x))=Op(rn)\sup_{x\in\mathcal{X}}d(\widehat{\mu}_{t}(x),\mu_{t}(x))=O_{p}(r_{n}), t{0,1}t\in\{0,1\} with rn0r_{n}\to 0 as nn\to\infty.

Assumptions 4.1 (i) and (ii) imply the uniform convergence rate for the propensity score estimator, supx𝒳|e(x;φ^)e(x;φ)|=Op(ϱn)\sup_{x\in\mathcal{X}}|e(x;\widehat{\varphi})-e(x;\varphi_{*})|=O_{p}(\varrho_{n}), regardless of whether the propensity score model is misspecified or not. These assumptions are typically satisfied with ϱn=n1/2\varrho_{n}=n^{-1/2} under mild regularity conditions. Assumption 4.1 (iii) concerns the outcome regression. If one employs the global Fréchet regression model (3.6) for estimating mtm_{t}, then under some regularity conditions, this assumption is satisfied with rn=nαr_{n}=n^{-\alpha^{\prime}} for some α>1/2\alpha^{\prime}>1/2 (see Theorem 1 in Petersen and Müller, (2019)) and

μt(x)\displaystyle\mu_{t}(x) =argminνE[{1+(XE[X])Σ1(xE[X])}d2(ν,Y)|T=t],\displaystyle=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}\operatorname{E}[\{1+(X-\operatorname{E}[X])^{\prime}\Sigma^{-1}(x-\operatorname{E}[X])\}d^{2}(\nu,Y)|T=t],

where Σ=Var(X)\Sigma=\operatorname{Var}(X), again regardless of whether the model is misspecified or not for the conditional Fréchet mean, which is the true target. Alternatively, one can employ local Fréchet regression as outcome regression method, which is more flexible but comes with slower rates of convergence and is subject to the curse of dimensionality (see Theorem 1 in Chen and Müller, (2022)).

To establish the consistency of Θ^t(DR)\widehat{\Theta}_{t}^{(\mathrm{DR})} and Θ^t(CF)\widehat{\Theta}_{t}^{(\mathrm{CF})} in (3.2) and (3.7), we additionally require that the geodesics in \mathcal{M} satisfy the following assumption.

Assumption 4.2.

For all α1,α2(,d)\alpha_{1},\alpha_{2}\in(\mathcal{M},d),

supβ,κ[1/(1η0),1/η0]d(γα1,β(κ),γα2,β(κ))C0d(α1,α2),\displaystyle\sup_{\beta\in\mathcal{M},\kappa\in[1/(1-\eta_{0}),1/\eta_{0}]}d\left(\gamma_{\alpha_{1},\beta}(\kappa),\gamma_{\alpha_{2},\beta}(\kappa)\right)\leq C_{0}d(\alpha_{1},\alpha_{2}), (4.1)

for some positive constant C0C_{0} depending only on η0\eta_{0}.

This is a Lipschitz-type condition to control the criterion function (3.2) that involves the geodesics γμ(Xi),Yi\gamma_{\mu(X_{i}),Y_{i}}. When the space Ω\Omega is a CAT(0) space (i.e., a geodesic metric space that is globally non-positively curved in the sense of Alexandrov), this assumption is automatically satisfied. Many metric spaces of statistical interest are CAT(0) spaces, such as the Wasserstein space of univariate distributions, the space of networks expressed as graph Laplacians with the (power) Frobenius metric, covariance/correlation matrices with the affine-invariant metric (Thanwerdas and Pennec,, 2023), the Cholesky metric (Lin,, 2019) and various other metrics (Pigoli et al.,, 2014), as well as phylogenetic tree space with the BHV metric (Billera et al.,, 2001; Lin and Müller,, 2021). Assumption 4.2 also holds for some relevant positively curved spaces such as open hemispheres or orthants of spheres that are relevant for compositional data with the Riemannian metric; these spaces are not CAT(0), see Example 2.3.

For t{0,1}t\in\{0,1\}, denote by Θt(DR):=argminνQt(ν;μt,φ)\Theta_{t}^{(\mathrm{DR})}:=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}Q_{t}(\nu;\mu_{t},\varphi_{*}) the minimizer of the population criterion function, where

Qt(ν;μ,φ)\displaystyle Q_{t}(\nu;\mu,\varphi) =E[d2(ν,γμ(X),Y(tTe(X;φ)+(1t)(1T)1e(X;φ)))].\displaystyle=\operatorname{E}\left[d^{2}\left(\nu,\gamma_{\mu(X),Y}\left({tT\over e(X;\varphi)}+{(1-t)(1-T)\over 1-e(X;\varphi)}\right)\right)\right].

To guarantee the identification of the object of interest E[Y(t)]\operatorname{E}_{\oplus}[Y(t)] by the doubly robust estimation approach, we add the following conditions.

Assumption 4.3.

Assume that for t{0,1}t\in\{0,1\},

  • (i)

    the objects Θt(DR)\Theta_{t}^{(\mathrm{DR})} and Θ^t(DR)\widehat{\Theta}_{t}^{(\mathrm{DR})} exist and are unique, and for any ε>0\varepsilon>0,

    infd(ν,Θt(DR))>εQt(ν;μt,φ)>Qt(Θt(DR);μt,φ).\inf_{d(\nu,\Theta_{t}^{(\mathrm{DR})})>\varepsilon}Q_{t}(\nu;\mu_{t},\varphi_{*})>Q_{t}(\Theta_{t}^{(\mathrm{DR})};\mu_{t},\varphi_{*}).
  • (ii)

    Θt(DR)=E[Y(t)]\Theta_{t}^{(\mathrm{DR})}=\operatorname{E}_{\oplus}[Y(t)].

Assumption 4.3 (i) is a standard separation condition to achieve the consistency of the M-estimator (see, e.g., Chapter 3.2 in van der Vaart and Wellner, (2023)). Note that the existence of the minimizers follows immediately if \mathcal{M} is compact. Assumption 4.3 (ii) requires that at least one of the models for the propensity score and the outcome regression is correctly specified.

The following result provides the consistency of the doubly robust and cross-fitting estimators Θ^t(OR)\widehat{\Theta}_{t}^{(\mathrm{OR})} in (3.2) and Θ^t(CF)\widehat{\Theta}_{t}^{(\mathrm{CF})} in (3.7).

Theorem 4.1.

Suppose that Assumptions 3.1, 3.2, 4.2, and 4.3 hold.

  • (i)

    Under Assumption 4.1, as nn\to\infty,

    d(Θ^t(DR),E[Y(t)])=op(1),t{0,1}.d(\widehat{\Theta}_{t}^{(\mathrm{DR})},\operatorname{E}_{\oplus}[Y(t)])=o_{p}(1),\quad t\in\{0,1\}.
  • (ii)

    Suppose that Assumption 4.1 holds for {φ^k,μ^t,k}k=1K\{\widehat{\varphi}_{k},\widehat{\mu}_{t,k}\}_{k=1}^{K} and there exist constants c1c_{1} and c2c_{2} such that 0<c1nk/nc2<10<c_{1}\leq n_{k}/n\leq c_{2}<1 for all nn and k=1,,Kk=1,\dots,K. Then as nn\to\infty,

    d(Θ^t(CF),E[Y(t)])=op(1),t{0,1}.d(\widehat{\Theta}_{t}^{(\mathrm{CF})},\operatorname{E}_{\oplus}[Y(t)])=o_{p}(1),\quad t\in\{0,1\}.

To obtain rates of convergence for estimators Θ^t(DR)\widehat{\Theta}_{t}^{(\mathrm{DR})} and Θ^t(CF)\widehat{\Theta}_{t}^{(\mathrm{CF})} in the metric space (Ω,d)(\Omega,d) we need additional entropy conditions to quantify the complexity of the space. Let Bδ(ω)B_{\delta}(\omega) be the ball of radius δ\delta centered at ωΩ\omega\in\Omega and N(ε,Bδ(ω),d)N(\varepsilon,B_{\delta}(\omega),d) its covering number using balls of size ε\varepsilon.

Assumption 4.4.

For t{0,1}t\in\{0,1\},

  • (i)

    As δ0\delta\to 0,

    Jt(δ)\displaystyle J_{t}(\delta) :=011+logN(δε,Bδ(Θt(DR)),d)𝑑ε=O(1),\displaystyle:=\int_{0}^{1}\sqrt{1+\log N(\delta\varepsilon,B_{\delta}(\Theta_{t}^{(\mathrm{DR})}),d)}d\varepsilon=O(1), (4.2)
    Jμt(δ)\displaystyle J_{\mu_{t}}(\delta) :=011+logN(δε,Bδ1(μt),d)𝑑ε=O(δϖ),\displaystyle:=\int_{0}^{1}\sqrt{1+\log N(\delta\varepsilon,B_{\delta^{\prime}_{1}}(\mu_{t}),d_{\infty})}d\varepsilon=O(\delta^{-\varpi}), (4.3)

    for some δ1>0\delta^{\prime}_{1}>0 and ϖ(0,1)\varpi\in(0,1), where for ν,μ:𝒳\nu,\mu:\mathcal{X}\to\mathcal{M}, d(ν,μ):=supx𝒳d(ν(x),μ(x))d_{\infty}(\nu,\mu):=\sup_{x\in\mathcal{X}}d(\nu(x),\mu(x)).

  • (ii)

    There exist positive constants η,η1,C,C1\eta,\eta_{1},C,C_{1}, and β>1\beta>1 such that

    infd(μ,μt)η1φφη1infd(ν,Θt(DR))<η{Qt(ν;μ,φ)Qt(Θt(DR);μ,φ)Cd(ν,Θt(DR))β+C1η1β2(β1)d(ν,Θt(DR))β2}0.\displaystyle\inf_{\begin{subarray}{c}d_{\infty}(\mu,\mu_{t})\leq\eta_{1}\\ \|\varphi-\varphi_{*}\|\leq\eta_{1}\end{subarray}}\!\inf_{d(\nu,\Theta_{t}^{(\mathrm{DR})})<\eta}\!\!\left\{\!Q_{t}(\nu;\mu,\varphi)\!-\!Q_{t}(\Theta_{t}^{(\mathrm{DR})};\mu,\varphi)\!-\!Cd(\nu,\Theta_{t}^{(\mathrm{DR})})^{\beta}\!+\!C_{1}\eta_{1}^{{\beta\over 2(\beta-1)}}\!d(\nu,\Theta_{t}^{(\mathrm{DR})})^{{\beta\over 2}}\!\right\}\!\!\geq\!0. (4.4)

These assumptions are used to control the behavior of the (centered) criterion function Qn,tQtQ_{n,t}-Q_{t} around the minimum. Assumption 4.4 (i) postulates an entropy condition for Θt(DR)\Theta_{t}^{(\mathrm{DR})} and μt\mu_{t}. A sufficient condition for (4.3) is Jμt(δ):=01supx𝒳1+logN(c¯δε,Bδ1(μt(x)),d)dε=O(δϖ)J^{\prime}_{\mu_{t}}(\delta):=\int_{0}^{1}\sup_{x\in\mathcal{X}}\sqrt{1+\log N(\bar{c}\delta\varepsilon,B_{\delta^{\prime}_{1}}(\mu_{t}(x)),d)}d\varepsilon=O(\delta^{-\varpi^{\prime}}) for some positive constants c¯\bar{c} and ϖ(0,1)\varpi^{\prime}\in(0,1). One can verify (4.2) and

Jμt(δ)=O(logδ),as δ0\displaystyle J^{\prime}_{\mu_{t}}(\delta)=O(-\log\delta),\ \text{as $\delta\to 0$} (4.5)

for common metric spaces including Examples 2.12.4, for which therefore Assumption 4.4 (i) is satisfied. If (4.5) holds, we can take ϖ(0,1)\varpi\in(0,1) in Theorem 4.2 below arbitrarily small.

Assumption 4.4 (ii) constrains the shape of the population criterion function QtQ_{t}. This assumption is also satisfied for common metric spaces. Indeed Examples 2.12.4 satisfy this condition with β=2\beta=2 (see Propositions B.1B.4 below). This assumption is an adaptation of Assumption B3 in Delsol and Van Keilegom, (2020) who developed an asymptotic theory for M estimators with plugged-in estimated nuisance parameters. In the Euclidean case, this assumption is typically satisfied with β=2\beta=2, and the third and fourth terms in (4.4) will be the quadratic and linear terms to approximate the population criterion function QtQ_{t} for ν\nu. If QtQ_{t} does not involve nuisance parameters, the fourth term in (4.4) vanishes, in analogy to Assumption (U2) in Petersen and Müller, (2019).

Under these assumptions, the convergence rates of the doubly robust and cross-fitting estimators are obtained as follows.

Theorem 4.2.

Suppose that Assumptions 3.1, 3.2, 4.2, and 4.3 hold.

  • (i)

    If Assumptions 4.1 and 4.4 hold, then for each β(0,1)\beta^{\prime}\in(0,1), as nn\to\infty, with ρn\rho_{n} and rnr_{n} as in Assumption 4.1 and β,ϖ\beta,\varpi as in Assumption 4.4,

    d(Θ^t(DR),E[Y(t)])\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{DR})},\operatorname{E}_{\oplus}[Y(t)]) =Op(n12(β1+ϖ)+(ϱn+rn)ββ1),t{0,1}.\displaystyle=O_{p}\left(n^{-{1\over 2(\beta-1+\varpi)}}+(\varrho_{n}+r_{n})^{{\beta^{\prime}\over\beta-1}}\right),\quad t\in\{0,1\}. (4.6)
  • (ii)

    If Assumptions 4.1 and 4.4 hold for {φ^k,μ^t,k}k=1K\{\widehat{\varphi}_{k},\widehat{\mu}_{t,k}\}_{k=1}^{K} and there exist constants c1c_{1} and c2c_{2} such that 0<c1nk/nc2<10<c_{1}\leq n_{k}/n\leq c_{2}<1 for all nn and k=1,,Kk=1,\dots,K, then for each β(0,1)\beta^{\prime}\in(0,1), as nn\to\infty,

    d(Θ^t(CF),E[Y(t)])\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{CF})},\operatorname{E}_{\oplus}[Y(t)]) =Op(n12(β1+ϖ)+(ϱn+rn)ββ1),t{0,1}.\displaystyle=O_{p}\left(n^{-{1\over 2(\beta-1+\varpi)}}+(\varrho_{n}+r_{n})^{{\beta^{\prime}\over\beta-1}}\right),\quad t\in\{0,1\}. (4.7)

Note that both estimators achieve the same convergence rate. The first term corresponds to the rate for an infeasible estimator, where the nuisance parameters φ\varphi and μt\mu_{t} are known to be contained in neighbors of their (pseudo) true values. If the pseudo true values are completely known, then this term becomes the one with ϖ=0\varpi=0 for the conventional Fréchet mean. The second term in the convergence rate is due to the estimation of the nuisance parameters. Generally, neither term dominates. In a typical scenario, we have β=2\beta=2, ϱn=n1/2\varrho_{n}=n^{-1/2}, and rn=nα1r_{n}=n^{-\alpha_{1}} with any α1>1/2\alpha_{1}>1/2 so that the rate becomes Op(n12(1+ϖ)+nα1β)O_{p}(n^{-\frac{1}{2(1+\varpi)}}+n^{-\alpha_{1}\beta^{\prime}}) for any ϖ,β(0,1)\varpi,\beta^{\prime}\in(0,1), which can be arbitrarily close to the parametric rate.

Remark 4.1 (Convergence rates of outcome regression and inverse probability weighting estimators).

In Appendix A, we provide convergence rates of estimators Θ^t(OR)\widehat{\Theta}_{t}^{(\mathrm{OR})} and Θ^t(IPW)\widehat{\Theta}_{t}^{(\mathrm{IPW})} that rely on the correct specification of the outcome regression function and the propensity score function, respectively. Under some regularity conditions one can show that for each β(0,1)\beta^{\prime}\in(0,1), as nn\to\infty,

d(Θ^t(OR),E[Y(t)])\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{OR})},\operatorname{E}_{\oplus}[Y(t)]) =Op(n12(β1+ϖ1)+rn,1ββ1),\displaystyle=O_{p}\left(n^{-{1\over 2(\beta-1+\varpi_{1})}}+r_{n,1}^{{\beta^{\prime}\over\beta-1}}\right),
d(Θ^t(IPW),E[Y(t)])\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{IPW})},\operatorname{E}_{\oplus}[Y(t)]) =Op(n12(β1+ϖ2)+(vn+ϱn,1)ββ1),\displaystyle=O_{p}\left(n^{-{1\over 2(\beta-1+\varpi_{2})}}+(v_{n}+\varrho_{n,1})^{{\beta^{\prime}\over\beta-1}}\right),

where ϖ1,ϖ2(0,1)\varpi_{1},\varpi_{2}\in(0,1) are constants arising from entropy conditions for the models of the outcome regression function and the propensity score, corresponding to ϖ\varpi in (4.3). Here rn,1r_{n,1}, vnv_{n}, and ϱn,1\varrho_{n,1} are null sequences that correspond to the convergence rates of μ^t()\widehat{\mu}_{t}(\cdot), μ^\widehat{\mu}, and e^()\widehat{e}(\cdot), respectively, i.e.,

supx𝒳d(μ^t(x),mt(x))=Op(rn,1),d(μ^,μ)=Op(vn),supx𝒳|e^(x)p(x)|=Op(ϱn,1).\sup_{x\in\mathcal{X}}d(\widehat{\mu}_{t}(x),m_{t}(x))=O_{p}(r_{n,1}),\ d(\widehat{\mu},\mu)=O_{p}(v_{n}),\ \sup_{x\in\mathcal{X}}|\widehat{e}(x)-p(x)|=O_{p}(\varrho_{n,1}).

4.2. Uncertainty quantification

To obtain uncertainty quantification for the GATE estimators Θ^t(M)\widehat{\Theta}_{t}^{(\mathrm{M})} for M{DR,CF}\mathrm{M}\in\{\mathrm{DR},\mathrm{CF}\}, we adopt the (adaptive) HulC by Kuchibhotla et al., (2024) to construct confidence regions for the contrast d(Θ0(M),Θ1(M))d(\Theta_{0}^{(\mathrm{M})},\Theta_{1}^{(\mathrm{M})}), with implementation as follows.

Let {Sb}b=1B\{S_{b}\}_{b=1}^{B} be a (random) partition of {1,,n}\{1,\dots,n\} into BB subsets, and {d(Θb,0(M),Θb,1(M))}b=1B\{d(\Theta_{b,0}^{(\mathrm{M})},\Theta_{b,1}^{(\mathrm{M})})\}_{b=1}^{B} estimators for d(Θ0(M),Θ1(M))d(\Theta_{0}^{(\mathrm{M})},\Theta_{1}^{(\mathrm{M})}) computed for each of the subsamples {Yi,Ti,Xi:iSb}b=1B\{Y_{i},T_{i},X_{i}:i\in S_{b}\}_{b=1}^{B}. Define the maximum median bias of the estimators d(Θb,0(M),Θb,1(M))d(\Theta_{b,0}^{(\mathrm{M})},\Theta_{b,1}^{(\mathrm{M})}) for d(Θ0(M),Θ1(M))d(\Theta_{0}^{(\mathrm{M})},\Theta_{1}^{(\mathrm{M})}) as

Δ:=max1bB{0,12min{P(U~b0),P(U~b0)}},\Delta:=\max_{1\leq b\leq B}\left\{0,{1\over 2}-\min\left\{\operatorname{P}\left(\widetilde{U}_{b}\geq 0\right),\operatorname{P}\left(\widetilde{U}_{b}\leq 0\right)\right\}\right\},

where U~b=d(Θb,0(M),Θb,1(M))d(Θ0(M),Θ1(M))\widetilde{U}_{b}=d(\Theta_{b,0}^{(\mathrm{M})},\Theta_{b,1}^{(\mathrm{M})})-d(\Theta_{0}^{(\mathrm{M})},\Theta_{1}^{(\mathrm{M})}).

Adopting the algorithm of HulC, we can construct a confidence interval with coverage probability 1α1-\alpha for d(Θ0(M),Θ1(M))d(\Theta_{0}^{(\mathrm{M})},\Theta_{1}^{(\mathrm{M})}) as follows:

  • (Step 1)

    Find the smallest integer B=Bα,Δ1B=B_{\alpha,\Delta}\geq 1 such that

    P(B;Δ):=(12Δ)B+(12+Δ)Bα.\operatorname{P}(B;\Delta):=\left({1\over 2}-\Delta\right)^{B}+\left({1\over 2}+\Delta\right)^{B}\leq\alpha.
  • (Step 2)

    Generate a random variable UU from the uniform distribution on [0,1][0,1] and set BB^{*} as

    B\displaystyle B^{*} :={Bα,Δif Uτα,Δ,Bα,Δ1if U>τα,Δ,whereτα,Δ:=αP(Bα,Δ;Δ)P(Bα,Δ1;Δ)P(Bα,Δ;Δ).\displaystyle:=\begin{cases}B_{\alpha,\Delta}&\text{if $U\leq\tau_{\alpha,\Delta}$},\\ B_{\alpha,\Delta}-1&\text{if $U>\tau_{\alpha,\Delta}$},\\ \end{cases}\ \text{where}\ \tau_{\alpha,\Delta}:={\alpha-\operatorname{P}(B_{\alpha,\Delta};\Delta)\over\operatorname{P}(B_{\alpha,\Delta}-1;\Delta)-\operatorname{P}(B_{\alpha,\Delta};\Delta)}.
  • (Step 3)

    Randomly split {Yi,Ti,Xi}i=1n\{Y_{i},T_{i},X_{i}\}_{i=1}^{n} into BB^{*} disjoint sets {Yi,Ti,Xi:iSb}b=1B\{Y_{i},T_{i},X_{i}:i\in S_{b}\}_{b=1}^{B^{*}} and compute {d(Θb,0(M),Θb,1(M))}b=1B\{d(\Theta_{b,0}^{(\mathrm{M})},\Theta_{b,1}^{(\mathrm{M})})\}_{b=1}^{B^{*}}.

  • (Step 4)

    Compute the confidence interval

    C^α,Δ:=[min1bBd(Θb,0(M),Θb,1(M)),max1bBd(Θb,0(M),Θb,1(M))].\widehat{C}_{\alpha,\Delta}:=\left[\min_{1\leq b\leq B^{*}}d(\Theta_{b,0}^{(\mathrm{M})},\Theta_{b,1}^{(\mathrm{M})}),\max_{1\leq b\leq B^{*}}d(\Theta_{b,0}^{(\mathrm{M})},\Theta_{b,1}^{(\mathrm{M})})\right].

From Theorem 1 in Kuchibhotla et al., (2024), one obtains a finite sample guarantee for the coverage of this confidence interval, i.e.,

P(d(Θ0(M),Θ1(M))C^α,Δ)1α.\operatorname{P}(d(\Theta_{0}^{(\mathrm{M})},\Theta_{1}^{(\mathrm{M})})\in\widehat{C}_{\alpha,\Delta})\geq 1-\alpha.

For the implementation of C^α,Δ\widehat{C}_{\alpha,\Delta}, one can estimate Δ\Delta by using a subsampling method (see Section 4 of Kuchibhotla et al., (2024)).

5. Simulation studies

5.1. Implementation details and simulation scenarios

The algorithm for the proposed approach is outlined in Algorithm 1. The global Fréchet regression involved in the second step is implemented using the R package frechet (Chen et al.,, 2023). The minimization problem one needs to solve requires specific considerations for various geodesic spaces. For the spaces in Examples 2.1, 2.2, and 2.4, the minimization problem can be reduced to convex quadratic optimization (Stellato et al.,, 2020). For Example 2.3, the necessary optimization can be performed using the trust regions algorithm (Geyer,, 2020). The third step involves the calculation of Fréchet means, which reduces to the entry-wise mean of matrices for Examples 2.1, 2.2, and the mean of quantile functions for Example 2.4 due to the convexity of these spaces. The Fréchet mean for Example 2.3 can be implemented using the R package manifold (Dai,, 2022).

Input: data {(Yi,Ti,Xi)}i=1n\{(Y_{i},T_{i},X_{i})\}_{i=1}^{n}.
Output: doubly robust estimator of geodesic average treatment effect (GATE) γΘ^0(DR),Θ^1(DR)\gamma_{\widehat{\Theta}_{0}^{(\text{DR})},\widehat{\Theta}_{1}^{(\text{DR})}}.
1 e(x;φ^)e(x;\widehat{\varphi})\longleftarrow the estimated propensity score using logistic regression with data {Ti,Xi}i=1n\{T_{i},X_{i}\}_{i=1}^{n};
2 μ^t(x)\widehat{\mu}_{t}(x)\longleftarrow the estimated outcome regression function using global Fréchet regression:
μ^t(x)=argminν1NtiIt{1+(XiX¯)Σ^1(xX¯)}d2(ν,Yi),t{0,1},\widehat{\mu}_{t}(x)=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}{1\over N_{t}}\sum_{i\in I_{t}}\{1+(X_{i}-\bar{X})^{\prime}\widehat{\Sigma}^{-1}(x-\bar{X})\}d^{2}(\nu,Y_{i}),\ t\in\{0,1\},
where It={1in:Ti=t}I_{t}=\{1\leq i\leq n:T_{i}=t\}, NtN_{t} is the cardinality of ItI_{t}, X¯=n1i=1nXi\bar{X}=n^{-1}\sum_{i=1}^{n}X_{i}, and Σ^=n1i=1n(XiX¯)(XiX¯)\widehat{\Sigma}=n^{-1}\sum_{i=1}^{n}(X_{i}-\bar{X})(X_{i}-\bar{X})^{\prime};
3 Θ^t(DR)\widehat{\Theta}_{t}^{(\text{DR})}\longleftarrow the estimated outcome regression function:
Θ^t(DR)=argminν1ni=1nd2(ν,γμ^t(Xi),Yi(tTie(Xi;φ^)+(1t)(1Ti)1e(Xi;φ^)))\widehat{\Theta}_{t}^{\mathrm{(DR)}}=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}{1\over n}\sum_{i=1}^{n}d^{2}\left(\nu,\gamma_{\widehat{\mu}_{t}(X_{i}),Y_{i}}\left({tT_{i}\over e(X_{i};\widehat{\varphi})}+{(1-t)(1-T_{i})\over 1-e(X_{i};\widehat{\varphi})}\right)\right)
where e(x;φ^)e(x;\widehat{\varphi}) and μ^t(x)\widehat{\mu}_{t}(x) are the estimated propensity score and outcome regression function;
γΘ^0(DR),Θ^1(DR)\gamma_{\widehat{\Theta}_{0}^{(\text{DR})},\widehat{\Theta}_{1}^{(\text{DR})}}\longleftarrow the doubly robust estimator of GATE.
Algorithm 1 Geodesic Causal Inference

To assess the performance of the proposed doubly robust estimators, we report here the results of simulations for various settings, specifically for the space of covariance matrices equipped with the Frobenius metric and the space of three-dimensional compositional data 𝒮+2\mathcal{S}_{+}^{2} equipped with the geodesic metric on the sphere S2S^{2}; see Examples 2.2 and 2.3.

We consider sample sizes n=100,300,1000n=100,300,1000, with Q=500Q=500 Monte Carlo runs. For the qqth Monte Carlo run, with γΘ^0q,Θ^1q\gamma_{\widehat{\Theta}_{0}^{q},\widehat{\Theta}_{1}^{q}} denoting the GATE estimator, the average quality of the estimation over the Q=500Q=500 Monte Carlo runs is assessed by the average squared error (ASE)

ASE=1Qq=1Q{d2(Θ^0q,E[Y(0)])+d2(Θ^1q,E[Y(1)])},\mathrm{ASE}=\frac{1}{Q}\sum_{q=1}^{Q}\{d^{2}(\widehat{\Theta}_{0}^{q},\operatorname{E}_{\oplus}[Y(0)])+d^{2}(\widehat{\Theta}_{1}^{q},\operatorname{E}_{\oplus}[Y(1)])\},

where dd is the Frobenius metric dFd_{F} for covariance matrices and the geodesic metric dgd_{g} for compositional data.

In all simulations, the confounder XX follows a uniform distribution [1,1][-1,1]. The treatment TT has a conditional Bernoulli distribution depending on XX with P(T=1|X)=expit(0.75X)\operatorname{P}(T=1|X)=\mathrm{expit}(0.75X) where expit()=exp()/(1+exp())\mathrm{expit}(\cdot)=\mathrm{exp}(\cdot)/(1+\mathrm{exp}(\cdot)). We consider two specifications for the outcome regression: a global Fréchet regression model with predictor XX in mt(X)m_{t}(X) (correct specification) and one with the predictor X2X^{2} (incorrect specification), and two specifications for the propensity score model: a logistic regression model with predictor XX (correct specification) and a model with predictor X2X^{2} (incorrect specification). To demonstrate the double robustness of estimators Θ^t(DR)\widehat{\Theta}_{t}^{\text{(DR)}} and Θ^t(CF)\widehat{\Theta}_{t}^{\text{(CF)}}, we compare them with outcome regression (OR) and inverse probability weighting (IPW) estimators.

5.2. Covariance matrices

Random covariance matrices YY are generated as follows. First, the lower-triangular entries of YY are independently generated as Yjk=T+X+2+ϵY_{jk}=T+X+2+\epsilon, where 1j,km,j>k1\leq j,k\leq m,\,j>k and ϵ\epsilon is independently generated from a uniform distribution on [0.1,0.1][-0.1,0.1]. Due to symmetry, the upper-triangular entries of YY are then Ykj=YjkY_{kj}=Y_{jk} for 1j,km,j>k1\leq j,k\leq m,\ j>k. To ensure positive semi-definiteness, the diagonal entries YjjY_{jj} are set to equal the sum of all off-diagonal entries in the same row, kjYjk\sum_{k\neq j}Y_{jk}. This construction results in a diagonally dominated matrix, thus guaranteeing positive semi-definiteness.

We aim to estimate the GATE, whose true value is γE[Y(0)],E[Y(1)]\gamma_{\operatorname{E}_{\oplus}[Y(0)],\operatorname{E}_{\oplus}[Y(1)]} where (E[Y(0)])jk=2,(E[Y(0)])jj=18(\operatorname{E}_{\oplus}[Y(0)])_{jk}=2,(\operatorname{E}_{\oplus}[Y(0)])_{jj}=18, and (E[Y(1)])jk=3,(E[Y(1)])jj=27(\operatorname{E}_{\oplus}[Y(1)])_{jk}=3,(\operatorname{E}_{\oplus}[Y(1)])_{jj}=27 for 1jkm1\leq j\neq k\leq m. Table 1 presents the simulation results for 10×1010\times 10 matrices, i.e., for m=10m=10. The ASE of the doubly robust estimator decreases as the sample size increases when either the OR or IPW model is correctly specified, confirming the double robustness property. However, neither the OR nor IPW estimator demonstrates double robustness; their ASEs can be large even with a sample size of 1000 when the corresponding model is misspecified.

Table 1. Simulation results for covariance matrices. Average squared errors and standard deviations (in parentheses) for GATE using four different estimation procedures. The model specifications are in the first two columns. DR: doubly robust; CF: cross-fitting; OR: outcome regression; IPW: inverse probability weighting.
Model Sample size Estimator
OR IPW DR CF OR IPW
100 5.837 (8.694) 5.839 (8.696) 5.837 (8.690) 6.289 (9.105)
300 2.058 (2.860) 2.057 (2.859) 2.058 (2.860) 2.128 (2.946)
1000 0.610 (0.808) 0.611 (0.809) 0.610 (0.808) 0.639 (0.838)
100 5.837 (8.690) 5.835 (8.687) 5.837 (8.690) 31.91 (23.45)
300 2.058 (2.860) 2.057 (2.860) 2.058 (2.860) 26.26 (13.37)
1000 0.610 (0.808) 0.611 (0.808) 0.610 (0.808) 24.02 (7.127)
100 9.071 (12.317) 11.09 (16.19) 37.40 (26.18) 6.289 (9.105)
300 3.005 (3.848) 3.098 (4.061) 30.82 (14.19) 2.128 (2.946)
1000 0.986 (1.180) 0.993 (1.201) 28.23 (7.680) 0.639 (0.838)

5.3. Compositional data

Writing ϕ=π(X+2)/8[π/8,3π/8]\phi=\pi(X+2)/8\in[\pi/8,3\pi/8], we model the true regression functions m0()m_{0}(\cdot) and m1()m_{1}(\cdot) as

m0(X)=(cos(ϕ),12sin(ϕ),32sin(ϕ)),m1(X)=(cos(ϕ),32sin(ϕ),12sin(ϕ)),m_{0}(X)=(\cos(\phi),\frac{1}{2}\sin(\phi),\frac{\sqrt{3}}{2}\sin(\phi)),\quad m_{1}(X)=(\cos(\phi),\frac{\sqrt{3}}{2}\sin(\phi),\frac{1}{2}\sin(\phi)),

which are illustrated in Figure 3 as red and blues lines, respectively. The random response YY on 𝒮+2\mathcal{S}_{+}^{2} for T=0T=0 is then generated by adding a small perturbation to the true regression function. To this end, we first construct an orthonormal basis (e1,e2)(e_{1},e_{2}) for the tangent space on m0(X)m_{0}(X) where

e1=(sin(ϕ),12cos(ϕ),32cos(ϕ)),e2=(0,32,12).e_{1}=(\sin(\phi),-\frac{1}{2}\cos(\phi),-\frac{\sqrt{3}}{2}\cos(\phi)),\quad e_{2}=(0,\frac{\sqrt{3}}{2},-\frac{1}{2}).

Consider random tangent vectors U=Z1e1+Z2e2U=Z_{1}e_{1}+Z_{2}e_{2}, where Z1,Z2Z_{1},Z_{2} are two independent uniformly distributed random variables on [0.1,0.1][-0.1,0.1]. The random response YY is obtained as the exponential map at m0(X)m_{0}(X) applied to the tangent vector UU,

Y=Expm0(X)(U)=cos(U)m0(X)+sin(U)UU.Y=\mathrm{Exp}_{m_{0}(X)}(U)=\cos(\|U\|)m_{0}(X)+\sin(\|U\|)\frac{U}{\|U\|}.

A similar generation procedure is used for T=1T=1, where the orthonormal basis (e1,e2)(e_{1},e_{2}) for the tangent space on m1(X)m_{1}(X) is

e1=(sin(ϕ),32cos(ϕ),12cos(ϕ)),e2=(0,12,32).e_{1}=(\sin(\phi),-\frac{\sqrt{3}}{2}\cos(\phi),-\frac{1}{2}\cos(\phi)),\quad e_{2}=(0,\frac{1}{2},-\frac{\sqrt{3}}{2}).

Figure 3 illustrates randomly generated responses using the above generation procedure for a sample size n=200n=200, which are seen to be distributed around the true regression functions.

We are interesting in estimating the GATE, with true value γE[Y(0)],E[Y(1)]\gamma_{\operatorname{E}_{\oplus}[Y(0)],\operatorname{E}_{\oplus}[Y(1)]}, where

E[Y(0)]=(22,24,64),E[Y(1)]=(22,64,24).\operatorname{E}_{\oplus}[Y(0)]=(\frac{\sqrt{2}}{2},\frac{\sqrt{2}}{4},\frac{\sqrt{6}}{4}),\quad\operatorname{E}_{\oplus}[Y(1)]=(\frac{\sqrt{2}}{2},\frac{\sqrt{6}}{4},\frac{\sqrt{2}}{4}).

Table 2 summarizes the simulation results. The ASE of the doubly robust estimator decreases as the sample size increases when either the OR or IPW model is correct, thus again confirming the double robustness property. In comparison, neither the OR nor IPW estimator exhibits the double robustness property: When the corresponding model is misspecified, their ASEs do not converge as the sample size increases.

Refer to caption
Figure 3. Simulation example for compositional data represented on the sphere with the geodesic metric for a sample of size n=200n=200. Red: T=0T=0; blue: T=1T=1. The true regression functions m0()m_{0}(\cdot) and m1()m_{1}(\cdot) are shown as red and blue lines.
Table 2. Simulation results for compositional data. Average squared errors and standard deviations (in parentheses) for GATE using four different estimation procedures. The model specifications are as in the first two columns. DR: doubly robust; CF: cross-fitting; OR: outcome regression; IPW: inverse probability weighting.
Model Sample size Estimator
OR IPW DR CF OR IPW
100 0.044 (0.026) 0.044 (0.026) 0.044 (0.026) 0.067 (0.030)
300 0.026 (0.015) 0.026 (0.015) 0.026 (0.015) 0.047 (0.016)
1000 0.014 (0.008) 0.014 (0.008) 0.014 (0.008) 0.037 (0.009)
100 0.044 (0.026) 0.044 (0.026) 0.044 (0.026) 0.105 (0.039)
300 0.026 (0.015) 0.026 (0.015) 0.026 (0.015) 0.093 (0.024)
1000 0.014 (0.008) 0.014 (0.008) 0.014 (0.008) 0.094 (0.014)
100 0.061 (0.034) 0.067 (0.037) 0.105 (0.042) 0.067 (0.030)
300 0.039 (0.018) 0.040 (0.019) 0.096 (0.026) 0.047 (0.016)
1000 0.031 (0.011) 0.031 (0.011) 0.098 (0.015) 0.037 (0.009)

6. Real world applications

6.1. U.S. electricity generation data

Compositional data are ubiquitous and are not situated in a vector space as they correspond to vectors with non-negative elements that sum to 1; see Example 2.3. Various approaches have been developed to address the inherent nonlinearity of compositional data (Aitchison,, 1986; Scealy and Welsh,, 2014; Filzmoser et al.,, 2018) which are common in the analysis of geochemical and microbiome data.

Here we focus on U.S. electricity generation data, publicly available on the U.S. Energy Information Administration website (http://www.eia.gov/electricity). The data reflect net electricity generation from various sources for each state in the year 2020. In preprocessing, we excluded the “pumped storage” and “other” categories due to data errors and consolidated the remaining energy sources into three categories: Natural Gas (corresponding to the category of “natural gas”), Other Fossil (combining “coal,” “petroleum,” and “other gases”), and Renewables and Nuclear (combining “hydroelectric conventional,” “solar thermal and photovoltaic,” “geothermal,” “wind,” “wood and wood-derived fuels,” “other biomass,” and “nuclear”). This yielded a sample of n=50n=50 compositional observations taking values in the 2-simplex Δ2\Delta^{2}, as illustrated with a ternary plot in Figure 4. Following the approach described in Example 2.3, we applied the component-wise square root transformation, resulting in compositional outcomes as elements of the sphere 𝒮2\mathcal{S}^{2}, equipped with the geodesic metric.

In our analysis, the exposure (treatment) of interest is whether the state produced coal in 2020, where 29 states produced coal in 2020 while 21 states did not. The outcomes are the compositional data discussed in Example 2.3, where we consider two possible confounders: Gross domestic product (GDP) per capita (millions of chained 2012 dollars) and the proportion of electricity generated from coal and petroleum in 2010 for each state. We implemented the proposed approach to obtain the DR, CF, OR, and IPW estimators, with results demonstrated in Figure 4. The DR, CF, and OR estimators yield similar results, suggesting that coal production leads to a smaller proportion of renewables & nuclear. In contrast, the IPW estimator yields slightly different results, possibly due to the violation of the propensity score model, and therefore should not be used here.

To quantify the size and uncertainty of the treatment effect, we computed the geodesic distance between the two mean potential compositions using the DR estimator dg(Θ^0(DR),Θ^1(DR))d_{g}(\widehat{\Theta}_{0}^{\mathrm{(DR)}},\widehat{\Theta}_{1}^{\mathrm{(DR)}}) and obtained its finite-sample valid confidence region using the adaptive HulC algorithm. Specifically, dg(Θ^0(DR),Θ^1(DR))d_{g}(\widehat{\Theta}_{0}^{\mathrm{(DR)}},\widehat{\Theta}_{1}^{\mathrm{(DR)}}) is found to be 0.133, with the corresponding 95% finite-sample valid confidence region estimated as (0.112,0.269)(0.112,0.269). This suggests that the effect of coal production on the composition of electricity generation sources is significant at the 0.05 level.

Refer to caption
Figure 4. Ternary plot illustrating the composition of electricity generation in 2020 across 50 U.S. states and mean potential outcomes with and without production of coal in 2020 using four different methods. States are shape-coded based on whether they produced coal or not in 2020. Mean potential outcomes are color-coded based on which method is used. The geodesic distance between mean potential outcomes for the DR, CF, OR, and IPW estimators is 0.133, 0.108, 0.131, and 0.198, respectively, where the GATE obtained from IPW differs substantially from the GATE obtained with the other estimators.

6.2. New York Yellow taxi system after COVID-19 outbreak

Yellow taxi trip records in New York City (NYC), containing details such as pick-up and drop-off dates/times, locations, trip distances, payment methods, and driver-reported passenger counts, can be accessed at https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page. Additionally, NYC Coronavirus Disease 2019 (COVID-19) data are available at https://github.com/nychealth/coronavirus-data, providing citywide and borough-specific daily counts of probable and confirmed COVID-19 cases in NYC since February 29, 2020.

We focused on taxi trip records in Manhattan, which experiences the highest taxi traffic. Following preprocessing procedures outlined in Zhou and Müller, (2022), we grouped the 66 taxi zones (excluding islands) into 13 regions. We restricted our analysis to the period comprising 172 days from April 12, 2020 to September 30, 2020, during which taxi ridership per day in Manhattan steadily increased following a decline due to the COVID-19 outbreak. For each day, we constructed a daily undirected network with nodes corresponding to the 13 regions and edge weights representing the number of people traveling between connected regions. Self-loops in the networks were removed as the focus was on connections between different regions. Thus, we have observations consisting of a simple undirected weighted network for each of the 172 days, each associated with a 13×1313\times 13 graph Laplacian.

We aim to investigate the causal effect of COVID-19 new cases on daily taxi networks in Manhattan. COVID-19 new cases were dichotomized into 0 if less than 60 and 1 otherwise, resulting in 79 and 93 days classified into low (0) and high (1) COVID-19 new cases groups, respectively. The outcomes are graph Laplacians as discussed in Example 2.1. Confounders of interest include a weekend indicator and daily average temperature.

We obtained DR and CF estimators using the proposed approach, along with OR and IPW estimators for comparison. In Figure 5, we illustrate the entry-wise differences between the adjacency matrices corresponding to low and high COVID-19 new cases for different estimators. The DR, CF, and OR estimators exhibit similar performance, indicating that high COVID-19 led to less traffic. Regions with the largest differences include regions 105, 106, and 108, primarily residential areas including popular locations such as Penn Station, Grand Central Terminal, and the Metropolitan Museum of Art. The impact of COVID-19 new cases on traffic networks is seen to be negative, especially concerning traffic in residential areas. Conversely, the IPW estimator appears ineffective in capturing this causal effect.

Similarly to Section 6.1, the Frobenius distance between the two mean potential networks using the DR estimator dF(Θ^0(DR),Θ^1(DR))d_{F}(\widehat{\Theta}_{0}^{\mathrm{(DR)}},\widehat{\Theta}_{1}^{\mathrm{(DR)}}) is found to be 5216, with a corresponding 95%95\% finite-sample valid confidence region of (2362,10979)(2362,10979). This suggests that the effect of COVID-19 new cases on daily taxi networks in Manhattan is significant at the 0.05 level.

Refer to caption
(a) Cross-fitting
Refer to caption
(b) Doubly robust
Refer to caption
(c) Outcome regression
Refer to caption
(d) Inverse probability weighting
Figure 5. Average treatment effects (differences between adjacency matrices) represented as heatmaps using four different methods.

6.3. Brain functional connectivity networks and Alzheimer’s disease

Resting-state functional Magnetic Resonance Imaging (fMRI) methodology allows for the study of brain activation and the identification of brain regions exhibiting similar activity during the resting state (Ferreira and Busatto,, 2013; Sala-Llonch et al.,, 2015). In resting-state fMRI, a time series of Blood Oxygen Level Dependent (BOLD) signals are obtained for different regions of interest (ROI). The temporal coherence between pairwise ROIs is typically measured by temporal Pearson correlation coefficients (PCC) of the fMRI time series, forming an m×mm\times m correlation matrix when considering mm distinct ROIs. Alzheimer’s disease has been found to be associated with anomalies in the functional integration and segregation of ROIs (Zhang et al.,, 2010; Damoiseaux et al.,, 2012).

Our study utilized data obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The dataset in our analysis consists of 372 cognitively normal (CN), and 145 Alzheimer’s (AD) subjects. We used the automated anatomical labeling (AAL) atlas (Tzourio-Mazoyer et al.,, 2002) to parcellate the whole brain into 90 ROIs, with 45 ROIs in each hemisphere, so that m=90.m=90. The preprocessing of the BOLD signals was implemented by adopting standard procedures of slice-timing correction, head motion correction, and other standard steps. A PCC matrix was calculated for all time series pairs for each subject. These matrices were then converted into simple, undirected, weighted networks by setting diagonal entries to 0 and thresholding the absolute values of the remaining correlations. Density-based thresholding was employed, where the threshold varied from subject to subject to achieve a desired, fixed connection density. Specifically, we retained the 15% strongest connections.

In our analysis, Alzheimer’s disease, coded as a binary variable (0 for cognitively normal, 1 for diagnosed with Alzheimer’s disease), serves as exposure of interest. The outcomes are brain functional connectivity networks, where two confounders are the age and gender of each subject. To quantify the impact of Alzheimer’s disease on brain functional connectivity, we computed the entry-wise differences between the adjacency matrices of the two mean potential networks for the DR estimator, as depicted in Figure 6. Our observations reveal that Alzheimer’s disease leads to notable effects on various regions of the human brain. Specifically, the central region, orbital surface in the frontal lobe, temporal lobe, occipital lobe, and subcortical areas exhibit more pronounced influences, displaying clustered patterns consistent with findings in previous literature (Planche et al.,, 2022). Notably, damage to the frontal lobe is associated with issues in judgment, intelligence, and behavior, while damage to the temporal lobe can impact memory.

Refer to caption
Figure 6. Average treatment effect (difference between adjacency matrices) represented as a heatmap using the doubly robust procedure DR. The five regions circled by squares, from top left to bottom right, correspond to the central region, orbital surface in the frontal lobe, temporal lobe, occipital lobe, and subcortical areas, respectively.

To assess structural changes in brain functional connectivity networks following Alzheimer’s disease, Table 3 provides a summary of commonly adopted network measures that quantify functional integration and segregation, with definitions provided in Rubinov and Sporns, (2010). Our observations indicate a decrease in all functional integration and segregation measures of brain functional connectivity networks, with local efficiency experiencing the most substantial reduction. Functional segregation, denoting the brain’s ability for specialized processing within densely interconnected groups of regions, and functional integration, representing the capacity to rapidly combine specialized information from distributed brain regions, are both compromised. This finding implies that Alzheimer’s disease leads to lower global and local efficiency in human brains, disrupting crucial functions such as judgment and memory (Ahmadi et al.,, 2023).

Table 3. Network measures of mean potential networks for the doubly robust estimator
Cognitively normal Alzheimer’s disease
Functional integration Global efficiency 3.117 2.914
Characteristic path length 10.319 9.593
Functional segregation Local efficiency 3.599 3.035
Clustering coefficient 0.154 0.142

7. Discussion

We mention here a few additional issues. Although our assumptions are satisfied for most metric spaces of interest and guarantee relatively fast convergence rates of the proposed doubly robust estimators, one may consider alternative assumptions. For example, one can replace Assumption 4.4 (ii) with an analogous condition as Assumption (U2) in Petersen and Müller, (2019), that is

infd(ν,Θt(DR))<η(Qt(ν;μt,φ)Qt(Θt(DR);μt,φ)Cd(ν,Θt(DR))β)0.\inf_{d(\nu,\Theta_{t}^{(\mathrm{DR})})<\eta}\left(Q_{t}(\nu;\mu_{t},\varphi_{*})-Q_{t}(\Theta_{t}^{(\mathrm{DR})};\mu_{t},\varphi_{*})-Cd(\nu,\Theta_{t}^{(\mathrm{DR})})^{\beta}\right)\geq 0.

However, under this relaxed assumption, the convergence rates of the estimators Θ^t(DR)\widehat{\Theta}_{t}^{(\mathrm{DR})} and Θ^t(CF)\widehat{\Theta}_{t}^{(\mathrm{CF})} will be slower than those reported in Theorem 4.2. Specifically, the term (ϱn+rn)ββ1(\varrho_{n}+r_{n})^{{\beta^{\prime}\over\beta-1}} would need to be replaced with (ϱn+rn)1β(\varrho_{n}+r_{n})^{{1\over\beta}}.

Even when objects that represent outcomes of interest are given, for example covariance matrices, there is still the issue of choosing a metric and ensuing geodesics, where interpretability of the resulting GATE will be a consideration. For a brief discussion of this and related issues, see Dubey et al., (2024).

There are numerous scenarios where metric space-valued outcomes are of interest in causal inference, as we have demonstrated in our empirical examples. Our tools to construct estimators for treatment effects are inspired by their linear counterparts but are substantially different, and notions of Euclidean treatment effects need to be revisited to arrive at canonical generalizations for metric spaces. These investigations also shed new light on and lead to a reinterpretation of the conventional Euclidean case. Future exploration of technical aspects and real world applications for object data will enhance the applicability of causal inference methods across a range of complex data.

Appendix A Asymptotic properties of OR and IPW estimators

A.1. Outcome regression

In this section, we provide asymptotic properties of outcome regression estimators.

Assumption A.1.

Let μ^t()\widehat{\mu}_{t}(\cdot), t{0,1}t\in\{0,1\} be estimators for the outcome regression functions mt()m_{t}(\cdot), t{0,1}t\in\{0,1\}. Assume that supx𝒳d(μ^t(x),mt(x))=Op(rn,1)\sup_{x\in\mathcal{X}}d(\widehat{\mu}_{t}(x),m_{t}(x))=O_{p}(r_{n,1}), t{0,1}t\in\{0,1\} with rn,10r_{n,1}\to 0 as nn\to\infty.

Assumption A.2.

Define Θt(OR)=argminνQ(OR)(ν;mt)\Theta_{t}^{(\mathrm{OR})}=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}Q^{(\mathrm{OR})}(\nu;m_{t}) where Q(OR)(ν;μ)=E[d2(ν,μ(X))]Q^{(\mathrm{OR})}(\nu;\mu)=\operatorname{E}[d^{2}(\nu,\mu(X))]. The objects Θt(OR)\Theta_{t}^{(\mathrm{OR})} and Θ^t(OR)\widehat{\Theta}_{t}^{(\mathrm{OR})} exist and are unique, and for any ε>0\varepsilon>0, infd(ν,Θt(OR))>εQ(OR)(ν;mt)>Q(OR)(Θt(OR);mt)\inf_{d(\nu,\Theta_{t}^{(\mathrm{OR})})>\varepsilon}Q^{(\mathrm{OR})}(\nu;m_{t})>Q^{(\mathrm{OR})}(\Theta_{t}^{(\mathrm{OR})};m_{t}).

Assumption A.3.

For t{0,1}t\in\{0,1\},

  • (i)

    As δ0\delta\to 0,

    Jt(δ)\displaystyle J_{t}(\delta) :=011+logN(δε,Bδ(Θt(OR)),d)𝑑ε=O(1),\displaystyle:=\int_{0}^{1}\sqrt{1+\log N(\delta\varepsilon,B_{\delta}(\Theta_{t}^{(\mathrm{OR})}),d)}d\varepsilon=O(1), (A.1)
    Jmt(δ)\displaystyle J_{m_{t}}(\delta) :=011+logN(δε,Bδ1(mt),d)𝑑ε=O(δϖ1)\displaystyle:=\int_{0}^{1}\sqrt{1+\log N(\delta\varepsilon,B_{\delta^{\prime}_{1}}(m_{t}),d_{\infty})}d\varepsilon=O(\delta^{-\varpi_{1}}) (A.2)

    for some δ1>0\delta^{\prime}_{1}>0 and ϖ1(0,1)\varpi_{1}\in(0,1).

  • (ii)

    There exist positive constants η\eta, η1\eta_{1}, CC, C1C_{1}, and β>1\beta>1 such that

    infd(μ,μt)η1infd(ν,Θt(OR))η{Q(OR)(ν;μ)Q(OR)(Θt;μ)Cd(ν,Θt(OR))β+C1η1β2(β1)d(ν,Θt(OR))β2}0.\displaystyle\inf_{d_{\infty}(\mu,\mu_{t})\leq\eta_{1}}\inf_{d(\nu,\Theta_{t}^{(\mathrm{OR})})\leq\eta}\!\!\left\{\!Q^{(\mathrm{OR})}(\nu;\mu)\!-\!Q^{(\mathrm{OR})}(\Theta_{t};\mu)\!-\!Cd(\nu,\Theta_{t}^{(\mathrm{OR})})^{\beta}\!+\!C_{1}\eta_{1}^{{\beta\over 2(\beta-1)}}d(\nu,\Theta_{t}^{(\mathrm{OR})})^{{\beta\over 2}}\!\right\}\!\geq\!0.

One can see that Examples 2.12.4 satisfy Assumption A.3 (ii) with β=2\beta=2.

The following results provide the consistency and convergence rates of the outcome regression estimators. The proofs are similar to those of Theorems 4.1 and 4.2 and thus omitted.

Theorem A.1.

Suppose that Assumptions 3.1(i), (iii), 3.2, A.1, and A.2 hold. Then as nn\to\infty,

d(Θ^t(OR),E[Y(t)])\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{OR})},\operatorname{E}_{\oplus}[Y(t)]) =op(1),t{0,1}.\displaystyle=o_{p}(1),\ t\in\{0,1\}. (A.3)
Theorem A.2.

Suppose that Assumptions 3.1(i), (iii), 3.2, A.1, A.2, and A.3 hold. Then for each β(0,1)\beta^{\prime}\in(0,1), as nn\to\infty,

d(Θ^t(OR),E[Y(t)])\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{OR})},\operatorname{E}_{\oplus}[Y(t)]) =Op(n12(β1+ϖ1)+rn,1β(β1)),t{0,1}.\displaystyle=O_{p}\left(n^{-{1\over 2(\beta-1+\varpi_{1})}}+r_{n,1}^{{\beta^{\prime}\over(\beta-1)}}\right),\ t\in\{0,1\}. (A.4)

A.2. Inverse probability weighting

In this section, we provide asymptotic properties of inverse probability weighting estimators.

Assumption A.4.

Let μ^\widehat{\mu} and e^()\widehat{e}(\cdot) be estimators for μ\mu\in\mathcal{M} and the propensity score p()p(\cdot), respectively. Assume that d(μ^,μ)=Op(vn)d(\widehat{\mu},\mu)=O_{p}(v_{n}) and supx𝒳|e^(x)p(x)|=Op(ϱn,1)\sup_{x\in\mathcal{X}}|\widehat{e}(x)-p(x)|=O_{p}(\varrho_{n,1}) with vn,ϱn,10v_{n},\varrho_{n,1}\to 0 as nn\to\infty.

Assumption A.5.

Define Θt(IPW)=argminνQt(IPW)(ν;μ,p)\Theta_{t}^{(\mathrm{IPW})}=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}Q_{t}^{(\mathrm{IPW})}(\nu;\mu,p) where

Qt(IPW)(ν;α,e)=E[d2(ν,γα,Y(tTe(X)+(1t)(1T)1e(X)))],t{0,1}.Q_{t}^{(\mathrm{IPW})}(\nu;\alpha,e)=\operatorname{E}\left[d^{2}\left(\nu,\gamma_{\alpha,Y}\left({tT\over e(X)}+{(1-t)(1-T)\over 1-e(X)}\right)\right)\right],\ t\in\{0,1\}.

Assume that Θt(IPW)=E[Y(t)]\Theta_{t}^{(\mathrm{IPW})}=\operatorname{E}_{\oplus}[Y(t)] and the objects Θt(IPW)\Theta_{t}^{(\mathrm{IPW})} and Θ^t(IPW)\widehat{\Theta}_{t}^{(\mathrm{IPW})} exist and are unique, and for any ε>0\varepsilon>0, infd(ν,Θt(IPW))>εQt(IPW)(ν;μ,p)>Qt(IPW)(Θt(IPW);μ,p)\inf_{d(\nu,\Theta_{t}^{(\mathrm{IPW})})>\varepsilon}Q_{t}^{(\mathrm{IPW})}(\nu;\mu,p)>Q_{t}^{(\mathrm{IPW})}(\Theta_{t}^{(\mathrm{IPW})};\mu,p).

Assumption A.6.

For t{0,1}t\in\{0,1\},

  • (i)

    As δ0\delta\to 0,

    Jt(δ)\displaystyle J_{t}(\delta) :=011+logN(δε,Bδ(Θt(IPW)),d)𝑑ε=O(1),\displaystyle:=\int_{0}^{1}\sqrt{1+\log N(\delta\varepsilon,B_{\delta}(\Theta_{t}^{(\mathrm{IPW})}),d)}d\varepsilon=O(1), (A.5)
    Jp(δ)\displaystyle J_{p}(\delta) :=011+logN(δε,Bδ1(p),)𝑑ε=O(δϖ2)\displaystyle:=\int_{0}^{1}\sqrt{1+\log N(\delta\varepsilon,B_{\delta^{\prime}_{1}}(p),\|\cdot\|_{\infty})}d\varepsilon=O(\delta^{-\varpi_{2}}) (A.6)

    for some δ1>0\delta^{\prime}_{1}>0 and ϖ2(0,1)\varpi_{2}\in(0,1), where for p1,p2:𝒳p_{1},p_{2}:\mathcal{X}\to\mathbb{R}, p1p2:=supx𝒳|p1(x)p2(x)|\|p_{1}-p_{2}\|_{\infty}:=\sup_{x\in\mathcal{X}}|p_{1}(x)-p_{2}(x)|.

  • (ii)

    There exist positive constants η\eta, η1\eta_{1}, CC, C1C_{1}, and β>1\beta>1 such that

    infd(α,μ)η1epη1infd(ν,Θt(IPW))η{Qt(IPW)(ν;α,e)Qt(IPW)(Θt;α,e)\displaystyle\inf_{\begin{subarray}{c}d(\alpha,\mu)\leq\eta_{1}\\ \|e-p\|_{\infty}\leq\eta_{1}\end{subarray}}\!\!\inf_{d(\nu,\Theta_{t}^{(\mathrm{IPW})})\leq\eta}\!\!\left\{\!Q_{t}^{(\mathrm{IPW})}\!(\nu;\alpha,e)\!-\!Q_{t}^{(\mathrm{IPW})}\!(\Theta_{t};\alpha,e)\right.
    d(ν,Θt(IPW))β+C1η1β2(β1)d(ν,Θt(IPW))β2}0.\displaystyle\left.\qquad\qquad\qquad\qquad\qquad-d(\nu,\Theta_{t}^{(\mathrm{IPW})})^{\beta}\!+\!C_{1}\eta_{1}^{{\beta\over 2(\beta-1)}}\!d(\nu,\Theta_{t}^{(\mathrm{IPW})})^{{\beta\over 2}}\!\right\}\geq 0.

One can see that Examples 2.12.4 satisfy Assumption A.6 (ii) with β=2\beta=2.

The following results provide the consistency and convergence rates of the inverse probability weighting estimators. The proofs are again similar to those of Theorems 4.1 and 4.2.

Theorem A.3.

Suppose that Assumptions 3.1, 4.2, A.4, and A.5 hold. Then as nn\to\infty,

d(Θ^t(IPW),E[Y(t)])\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{IPW})},\operatorname{E}_{\oplus}[Y(t)]) =op(1),t{0,1}.\displaystyle=o_{p}(1),\ t\in\{0,1\}. (A.7)
Theorem A.4.

Suppose that Assumptions 3.1, 4.2, A.4, A.5, and A.6 hold. Then for each β(0,1)\beta^{\prime}\in(0,1), as nn\to\infty,

d(Θ^t(IPW),E[Y(t)])\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{IPW})},\operatorname{E}_{\oplus}[Y(t)]) =Op(n12(β1+ϖ2)+(vn+ϱn,1)β(β1)),t{0,1}.\displaystyle=O_{p}\left(n^{-{1\over 2(\beta-1+\varpi_{2})}}+(v_{n}+\varrho_{n,1})^{{\beta^{\prime}\over(\beta-1)}}\right),\ t\in\{0,1\}. (A.8)

Appendix B Verification of Model Assumptions

In this section, we verify our assumptions for Examples 2.1- 2.4.

Proposition B.1.

The space of graph Laplacians defined in Example 2.1 satisfies Assumptions 3.1(i), 3.2(ii), (iii), 4.2, 4.3(i), and 4.4.

Proof.

Note that the geodesic in the space of graph Laplacians (equipped with the Frobenius metric) is simply the line segment connecting the two endpoints. It is easy to verify that Assumption 4.2 holds for C0=1/η01C_{0}=1/\eta_{0}-1. Following the argument in Theorem 2 of Zhou and Müller, (2022), one can verify that Assumptions 4.3 (i) and 4.4 hold with β=2\beta=2 in Assumption 4.4 (ii). Assumption 3.1(i) holds as the space of graph Laplacians is a bounded and closed subset of m2\mathbb{R}^{m^{2}} (cf. Proposition 1 of Zhou and Müller, (2022)).

To prove that Assumptions 3.2 (ii) and (iii) hold, observe that we can define the perturbation of mt(X)m_{t}(X) by adding errors component-wise to mt(X)m_{t}(X), 𝒫t(mt(X))=mt(X)+ε\mathcal{P}_{t}(m_{t}(X))=m_{t}(X)+\varepsilon, where ε\varepsilon is an m×mm\times m random graph Laplacian with E[εjk|X]=0,E[εjk2|X]<\operatorname{E}[\varepsilon_{jk}|X]=0,\operatorname{E}[\varepsilon_{jk}^{2}|X]<\infty for 1j<km1\leq j<k\leq m. The definition of the perturbation map implies Assumption 3.2 (ii). Let tr(A)\mathrm{tr}(A) denote the trace of the m×mm\times m matrix AA. For any graph Laplacian ν\nu, we have E[dF2(ν,𝒫t(mt(X)))]=E[dF2(ν,mt(X))]+E[tr(εε)]\operatorname{E}[d_{F}^{2}(\nu,\mathcal{P}_{t}(m_{t}(X)))]=\operatorname{E}[d_{F}^{2}(\nu,m_{t}(X))]+\operatorname{E}\left[\mathrm{tr}\left(\varepsilon^{\prime}\varepsilon\right)\right], which implies Assumption 3.2 (iii). ∎

Proposition B.2.

The space of covariance or correlation matrices defined in Example 2.2 satisfies Assumptions 3.1(i), 3.2(ii), (iii), 4.2, 4.3(i), and 4.4.

The proof of Proposition B.2 is similar to that of Proposition B.1 and is omitted.

Proposition B.3.

Consider the space of compositional data (𝒮+d1,dg)(\mathcal{S}^{d-1}_{+},d_{g}) defined in Example 2.3. Let TμT_{\mu}\mathcal{M} be the tangent bundle at μ\mu\in\mathcal{M} and let Expμ\mathrm{Exp}_{\mu} and Logμ\mathrm{Log}_{\mu} denote exponential and logarithmic maps at μ\mu, respectively. Define ν~=LogΘt(DR)(ν)\widetilde{\nu}=\mathrm{Log}_{\Theta_{t}^{(\mathrm{DR})}}(\nu) and γ~=Logγt(γ¯t)\widetilde{\gamma}=\mathrm{Log}_{\gamma_{t}}\left(\bar{\gamma}_{t}\right), where

γt\displaystyle\gamma_{t} =γμt(X),Y(t)(tTe(x;φ)+(1t)(1T)1e(X;φ)),γ¯t=γμ(X),Y(t)(tTe(x;φ)+(1t)(1T)1e(X;φ)).\displaystyle=\gamma_{\mu_{t}(X),Y(t)}\left({tT\over e(x;\varphi_{*})}+{(1-t)(1-T)\over 1-e(X;\varphi_{*})}\right),\ \bar{\gamma}_{t}=\gamma_{\mu(X),Y(t)}\left({tT\over e(x;\varphi)}+{(1-t)(1-T)\over 1-e(X;\varphi)}\right).

Additionally, for vTΘt(DR)v\in T_{\Theta_{t}^{(\mathrm{DR})}}\mathcal{M} and wTγtw\in T_{\gamma_{t}}\mathcal{M}, define g(v,w)=dg2(ExpΘt(DR)(v),Expγt(w))g(v,w)=d_{g}^{2}(\mathrm{Exp}_{\Theta_{t}^{(\mathrm{DR})}}(v),\mathrm{Exp}_{\gamma_{t}}(w)). If there exist positive constants η¯1\bar{\eta}_{1} and η¯\bar{\eta} such that

infvη¯1,φφη¯supx𝒳dg(μ(x),μt(x))η¯λmin(2vvE[g(v,γ~)]|v=v)>0,\displaystyle\inf_{\begin{subarray}{c}\|v^{*}\|\leq\bar{\eta}_{1},\|\varphi-\varphi_{*}\|\leq\bar{\eta}\\ \sup_{x\in\mathcal{X}}d_{g}(\mu(x),\mu_{t}(x))\leq\bar{\eta}\end{subarray}}\lambda_{\mathrm{min}}\left(\left.{\partial^{2}\over\partial v\partial v^{\prime}}\operatorname{E}[g(v,\widetilde{\gamma})]\right|_{v=v^{*}}\right)>0, (B.1)

then the space (𝒮+d1,dg)(\mathcal{S}^{d-1}_{+},d_{g}) satisfies Assumptions 3.1(i), 3.2(ii), (iii), 4.2, 4.3(i), and 4.4, where λmin(A)\lambda_{\mathrm{min}}(A) is the smallest eigenvalue of a square matrix AA.

Proof.

Assumption 4.2 holds trivially for α1=α2\alpha_{1}=\alpha_{2}. So we verify Assumption 4.2 for α1α2\alpha_{1}\neq\alpha_{2}. Note that for any q1,q2𝒮+d1q_{1},q_{2}\in\mathcal{S}_{+}^{d-1}, we have q1q2dg(q1,q2)π22q1q2\|q_{1}-q_{2}\|\leq d_{g}(q_{1},q_{2})\leq{\pi\over 2\sqrt{2}}\|q_{1}-q_{2}\|, which implies the equivalence of the geodesic metric and the ambient Euclidean metric. One can find a positive constant CC^{\prime} depending only on η0\eta_{0} such that supβ𝒮+d1,κ[1/(1η0),1/η0]γα1,β(κ)γα2,β(κ)Cα1α2\sup_{\beta\in\mathcal{S}^{d-1}_{+},\kappa\in[1/(1-\eta_{0}),1/\eta_{0}]}\|\gamma_{\alpha_{1},\beta}(\kappa)-\gamma_{\alpha_{2},\beta}(\kappa)\|\leq C^{\prime}\|\alpha_{1}-\alpha_{2}\| and this yields (4.1) with C0=Cπ/(22)C_{0}=C^{\prime}\pi/(2\sqrt{2}).

Following the argument in Proposition 3 of Petersen and Müller, (2019), one can verify that Assumptions 4.3 (i) and 4.4 hold with β=2\beta=2 in Assumption 4.4 (ii) under (B.1). Assumption 3.1 (i) holds as 𝒮+d1\mathcal{S}^{d-1}_{+} is a bounded and closed subset of the unit sphere 𝒮d1\mathcal{S}^{d-1}.

For any ν𝒮+d1\nu\in\mathcal{S}^{d-1}_{+}, define the perturbation of mt(X)m_{t}(X) as a perturbation in the tangent space, 𝒫t(mt(X))=Expmt(X)(U)\mathcal{P}_{t}(m_{t}(X))=\mathrm{Exp}_{m_{t}(X)}(U), where U=j=1d1ZjejdU=\sum_{j=1}^{d-1}Z_{j}e_{j}\in\mathbb{R}^{d} is a random tangent vector where (e1,,ed1)(e_{1},\dots,e_{d-1}) denotes the orthonormal basis for the tangent space on mt(X)m_{t}(X) and Z1,,Zd1Z_{1},\dots,Z_{d-1} are real-valued independent random variables with mean zero. Since E[U]=0\operatorname{E}[U]=0, it follows that E[Expmt(X)(U)|X]=mt(X)\operatorname{E}_{\oplus}[\mathrm{Exp}_{m_{t}(X)}(U)|X]=m_{t}(X) and thus Assumption 3.2 (ii) holds. One can similarly show that Assumption 3.2 (iii) is satisfied.

Proposition B.4.

The Wasserstein space defined in Example 2.4 satisfies Assumptions 3.1(i), 3.2(ii), (iii), 4.2, 4.3(i), and 4.4.

Proof.

For any probability measure μ𝒲\mu\in\mathcal{W}, the map Q:μFμ1Q:\mu\mapsto F_{\mu}^{-1} is an isometry from 𝒲\mathcal{W} to the subset of the Hilbert space L2(0,1)L^{2}(0,1) formed by equivalence classes of left-continuous nondecreasing functions on (0,1)(0,1). The Wasserstein space can thus be viewed as a convex and closed subset of L2(0,1)L^{2}(0,1) (Bigot et al.,, 2017). The corresponding Wasserstein metric d𝒲d_{\mathcal{W}} aligns with the L2L^{2} metric dL2d_{L^{2}} between quantile functions. It is straightforward to verify that Assumption 4.2 is satisfied for the Hilbert space (L2(0,1),dL2)(L^{2}(0,1),d_{L^{2}}) with C0=1/η01C_{0}=1/\eta_{0}-1. Since the Wasserstein space is a closed subset of L2(0,1)L^{2}(0,1), the geodesic extension necessitates adjustments to prevent boundary crossings, as discussed in Section 2. Such adjustments can only reduce the left-hand side of (4.1), thereby ensuring that Assumption 4.2 holds for C0=1/η01C_{0}=1/\eta_{0}-1. Assumption 3.1(i) holds as the Wasserstein space is a bounded and closed subset of L2(0,1)L_{2}(0,1). Following the argument in Proposition 1 of Petersen and Müller, (2019), one can verify that Assumptions 4.3 (i) and 4.4 hold with β=2\beta=2 in Assumption 4.4 (ii).

For mt(X)𝒲m_{t}(X)\in\mathcal{W}, we can define the perturbation of mt(X)m_{t}(X) as a perturbation of the quantile function of mt(X)m_{t}(X): F𝒫t(mt(X))1(s)=Fmt(X)1(s)+εF_{\mathcal{P}_{t}(m_{t}(X))}^{-1}(s)=F_{m_{t}(X)}^{-1}(s)+\varepsilon, where ε\varepsilon is a real-valued random variable with E[ε|X]=0\operatorname{E}[\varepsilon|X]=0 and E[ε2|X]<\operatorname{E}[\varepsilon^{2}|X]<\infty. The definition of the perturbation map implies Assumption 3.2 (ii). For any ν𝒲\nu\in\mathcal{W}, we have E[d𝒲2(ν,𝒫t(mt(X)))]=E[d𝒲2(ν,mt(X))]+E[ε2]\operatorname{E}[d_{\mathcal{W}}^{2}(\nu,\mathcal{P}_{t}(m_{t}(X)))]=\operatorname{E}[d_{\mathcal{W}}^{2}(\nu,m_{t}(X))]+\operatorname{E}[\varepsilon^{2}], which implies Assumption 3.2 (iii). ∎

Appendix C Proofs for Section 3

C.1. Proof of Proposition 3.1

Step 1. It is sufficient to check (3.2) when

  • (i)

    the propensity score is correctly specified, that is, e(x)=p(x)e(x)=p(x) (Step 2) or

  • (ii)

    the outcome regression functions are correctly specified, that is, μt(x)=mt(x)\mu_{t}(x)=m_{t}(x), t{0,1}t\in\{0,1\} (Step 3).

Now we provide some auxiliary results. Observe that

(Te(X))γμ1(X),Y\displaystyle\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y} ={0γμ1(X),Y(0)=idμ1(X)=0γμ1(X),Y(1)if T=0,(1e(X))γμ1(X),Y(1)if T=1.\displaystyle=\begin{cases}0\odot\gamma_{\mu_{1}(X),Y(0)}=\mathrm{id}_{\mu_{1}(X)}=0\odot\gamma_{\mu_{1}(X),Y(1)}&\text{if $T=0$},\\ \left(\frac{1}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y(1)}&\text{if $T=1$}.\end{cases}

This yields

(Te(X))γμ1(X),Y\displaystyle\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y} =(Te(X))γμ1(X),Y(1).\displaystyle=\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y(1)}.

Then we have

EG,X[(Te(X))γμ1(X),Y]\displaystyle\operatorname{E}_{G,X}\left[\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right] =EG[(p(X)e(X))γμ1(X),m1(X)].\displaystyle=\operatorname{E}_{G}\left[\left(\frac{p(X)}{e(X)}\right)\odot\gamma_{\mu_{1}(X),m_{1}(X)}\right]. (C.1)

Likewise, we have

EG,X[(1T1e(X))γμ0(X),Y]\displaystyle\operatorname{E}_{G,X}\left[\left(\frac{1-T}{1-e(X)}\right)\odot\gamma_{\mu_{0}(X),Y}\right] =EG[(1p(X)1e(X))γμ0(X),m0(X)].\displaystyle=\operatorname{E}_{G}\left[\left(\frac{1-p(X)}{1-e(X)}\right)\odot\gamma_{\mu_{0}(X),m_{0}(X)}\right]. (C.2)

Further, for random objects A,B,R(,d)A,B,R\in(\mathcal{M},d), we have

EG,X[γA,B]\displaystyle\operatorname{E}_{G,X}[\gamma_{A,B}] =EG,X[γA,RγR,B]=EG,X[γA,R]EG,X[γR,B].\displaystyle=\operatorname{E}_{G,X}[\gamma_{A,R}\oplus\gamma_{R,B}]=\operatorname{E}_{G,X}[\gamma_{A,R}]\oplus\operatorname{E}_{G,X}[\gamma_{R,B}]. (C.3)

Step 2. Now we assume e(x)=p(x)e(x)=p(x). Note that (C.1) yields

EG,X[(Tp(X))γμ1(X),Y]\displaystyle\operatorname{E}_{G,X}\left[\left(\frac{T}{p(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right] =EG[γμ1(X),m1(X)].\displaystyle=\operatorname{E}_{G}[\gamma_{\mu_{1}(X),m_{1}(X)}].

Combining this and (C.3), we have

EG,X[γμ,μ1(X){(Tp(X))γμ1(X),Y}]\displaystyle\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\left\{\left(\frac{T}{p(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right\}\right] =EG,X[γμ,μ1(X)]EG,X[(Tp(X))γμ1(X),Y]\displaystyle=\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{1}(X)}\right]\oplus\operatorname{E}_{G,X}\left[\left(\frac{T}{p(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right]
=EG,X[γμ,μ1(X)γμ1(X),m1(X)]\displaystyle=\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\gamma_{\mu_{1}(X),m_{1}(X)}\right]
=EG,X[γμ,m1(X)]\displaystyle=\operatorname{E}_{G,X}\left[\gamma_{\mu,m_{1}(X)}\right]
=γμ,E[Y(1)].\displaystyle=\gamma_{\mu,\operatorname{E}_{\oplus}[Y(1)]}. (C.4)

Likewise, (C.2) and (C.3) yield

EG,X[γμ,μ0(X){(1T1p(X))γμ0(X),Y}]\displaystyle\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{0}(X)}\oplus\left\{\left(\frac{1-T}{1-p(X)}\right)\odot\gamma_{\mu_{0}(X),Y}\right\}\right] =γμ,E[Y(0)].\displaystyle=\gamma_{\mu,\operatorname{E}_{\oplus}[Y(0)]}. (C.5)

Together with (C.4) and (C.5), we have

EG,X[γμ,μ0(X){(1T1p(X))γμ0(X),Y}]EG,X[γμ,μ1(X){(Tp(X))γμ1(X),Y}]\displaystyle\ominus\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{0}(X)}\oplus\left\{\left(\frac{1-T}{1-p(X)}\right)\odot\gamma_{\mu_{0}(X),Y}\right\}\right]\oplus\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\left\{\left(\frac{T}{p(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right\}\right]
=γμ,E[Y(0)]γμ,E[Y(1)]\displaystyle=\ominus\gamma_{\mu,\operatorname{E}_{\oplus}[Y(0)]}\oplus\gamma_{\mu,\operatorname{E}_{\oplus}[Y(1)]}
=γE[Y(0)],μγμ,E[Y(1)]\displaystyle=\gamma_{\operatorname{E}_{\oplus}[Y(0)],\mu}\oplus\gamma_{\mu,\operatorname{E}_{\oplus}[Y(1)]}
=γE[Y(0)],E[Y(1)]\displaystyle=\gamma_{\operatorname{E}_{\oplus}[Y(0)],\operatorname{E}_{\oplus}[Y(1)]}

Step 3. Now we assume μt(x)=mt(x)\mu_{t}(x)=m_{t}(x), t{0,1}t\in\{0,1\}. In this case, (C.1) yields

EG,X[(Te(X))γμ1(X),Y]\displaystyle\operatorname{E}_{G,X}\left[\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right] =EG[(p(X)e(X))γm1(X),m1(X)]=EG[idm1(X)].\displaystyle=\operatorname{E}_{G}\left[\left(\frac{p(X)}{e(X)}\right)\odot\gamma_{m_{1}(X),m_{1}(X)}\right]=\operatorname{E}_{G}\left[\mathrm{id}_{m_{1}(X)}\right].

Then a similar argument to obtain (C.4) yields

EG,X[γμ,m1(X){(Te(X))γm1(X),Y}]\displaystyle\operatorname{E}_{G,X}\left[\gamma_{\mu,m_{1}(X)}\oplus\left\{\left(\frac{T}{e(X)}\right)\odot\gamma_{m_{1}(X),Y}\right\}\right] =EG,X[γμ,m1(X)]EG,X[(Te(X))γm1(X),Y]\displaystyle=\operatorname{E}_{G,X}\left[\gamma_{\mu,m_{1}(X)}\right]\oplus\operatorname{E}_{G,X}\left[\left(\frac{T}{e(X)}\right)\odot\gamma_{m_{1}(X),Y}\right]
=EG,X[γμ,m1(X)idm1(X)]\displaystyle=\operatorname{E}_{G,X}\left[\gamma_{\mu,m_{1}(X)}\oplus\mathrm{id}_{m_{1}(X)}\right]
=EG,X[γμ,m1(X)]\displaystyle=\operatorname{E}_{G,X}\left[\gamma_{\mu,m_{1}(X)}\right]
=γμ,E[Y(1)].\displaystyle=\gamma_{\mu,\operatorname{E}_{\oplus}[Y(1)]}. (C.6)

Likewise, (C.2) yields

EG,X[γμ,m0(X){(1T1e(X))γm0(X),Y}]\displaystyle\operatorname{E}_{G,X}\left[\gamma_{\mu,m_{0}(X)}\oplus\left\{\left(\frac{1-T}{1-e(X)}\right)\odot\gamma_{m_{0}(X),Y}\right\}\right] =γμ,E[Y(0)].\displaystyle=\gamma_{\mu,\operatorname{E}_{\oplus}[Y(0)]}. (C.7)

Together with (C.6) and (C.7), we have

EG,X[γμ,m0(X){(1T1e(X))γm0(X),Y}]EG,X[γμ,m1(X){(Te(X))γm1(X),Y}]\displaystyle\ominus\operatorname{E}_{G,X}\left[\gamma_{\mu,m_{0}(X)}\oplus\left\{\left(\frac{1-T}{1-e(X)}\right)\odot\gamma_{m_{0}(X),Y}\right\}\right]\oplus\operatorname{E}_{G,X}\left[\gamma_{\mu,m_{1}(X)}\oplus\left\{\left(\frac{T}{e(X)}\right)\odot\gamma_{m_{1}(X),Y}\right\}\right]
=γE[Y(0)],E[Y(1)].\displaystyle=\gamma_{\operatorname{E}_{\oplus}[Y(0)],\operatorname{E}_{\oplus}[Y(1)]}.

C.2. Proof of (3.1)

Assume that e(x)=p(x)e(x)=p(x) or μt(x)=mt(x)\mu_{t}(x)=m_{t}(x) for t{0,1}t\in\{0,1\}. Note that the geodesic in \mathbb{R} is the line segment connecting two endpoints and any geodesic γα,β={α+t(βα):t[0,1]}\gamma_{\alpha,\beta}=\{\alpha+t(\beta-\alpha):t\in[0,1]\} is extendable, that is, for any ρ\rho\in\mathbb{R}, ργα,β=γα,α+ρ(βα)\rho\odot\gamma_{\alpha,\beta}=\gamma_{\alpha,\alpha+\rho(\beta-\alpha)}.

Observe that

γμ,E[μ1(X)+Te(X)(Yμ1(X))]\displaystyle\gamma_{\mu,\operatorname{E}\left[\mu_{1}(X)+\frac{T}{e(X)}(Y-\mu_{1}(X))\right]} =EG[γμ,μ1(X)+Te(X)(Yμ1(X))]\displaystyle=\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{1}(X)+\frac{T}{e(X)}(Y-\mu_{1}(X))}\right]
=EG[γμ,μ1(X)γμ1(X),μ1(X)+Te(X)(Yμ1(X))]\displaystyle=\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\gamma_{\mu_{1}(X),\mu_{1}(X)+\frac{T}{e(X)}(Y-\mu_{1}(X))}\right]
=EG[γμ,μ1(X){(Te(X))γμ1(X),Y}].\displaystyle=\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\left\{\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right\}\right]. (C.8)

If e(x)=p(x)e(x)=p(x) or μ1(x)=m1(x)\mu_{1}(x)=m_{1}(x), we have

γμ,E[μ1(X)+Te(X)(Yμ1(X))]\displaystyle\gamma_{\mu,\operatorname{E}\left[\mu_{1}(X)+\frac{T}{e(X)}(Y-\mu_{1}(X))\right]} =γμ,E[μ1(X)+p(X)e(X)(m1(X)μ1(X))]\displaystyle=\gamma_{\mu,\operatorname{E}\left[\mu_{1}(X)+{p(X)\over e(X)}(m_{1}(X)-\mu_{1}(X))\right]}
=γμ,E[m1(X)]\displaystyle=\gamma_{\mu,\operatorname{E}[m_{1}(X)]}
=γμ,E[Y(1)].\displaystyle=\gamma_{\mu,\operatorname{E}[Y(1)]}. (C.9)

Combining (C.8) and (C.9),

γμ,E[Y(1)]=EG[γμ,μ1(X){(Te(X))γμ1(X),Y}].\gamma_{\mu,\operatorname{E}[Y(1)]}=\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\left\{\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right\}\right].

Likewise, if e(x)=p(x)e(x)=p(x) or μ1(x)=m0(x)\mu_{1}(x)=m_{0}(x), then

γμ,E[Y(0)]=EG[γμ,μ0(X){(1T1e(X))γμ0(X),Y}]\gamma_{\mu,\operatorname{E}[Y(0)]}=\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{0}(X)}\oplus\left\{\left(\frac{1-T}{1-e(X)}\right)\odot\gamma_{\mu_{0}(X),Y}\right\}\right]

and one obtains

γE[Y(0)],E[Y(1)]\displaystyle\gamma_{\operatorname{E}[Y(0)],\operatorname{E}[Y(1)]} =γμ,E[Y(0)]γμ,E[Y(1)]\displaystyle=\ominus\gamma_{\mu,\operatorname{E}[Y(0)]}\oplus\gamma_{\mu,\operatorname{E}[Y(1)]}
=EG[γμ,μ0(X){(1T1e(X))γμ0(X),Y}]\displaystyle=\ominus\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{0}(X)}\oplus\left\{\left(\frac{1-T}{1-e(X)}\right)\odot\gamma_{\mu_{0}(X),Y}\right\}\right]
EG[γμ,μ1(X){(Te(X))γμ1(X),Y}].\displaystyle\quad\quad\oplus\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\left\{\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right\}\right].

Appendix D Proofs for Section 4

For any positive sequences an,bna_{n},b_{n}, we write anbna_{n}\lesssim b_{n} if there is a constant C>0C>0 independent of nn such that anCbna_{n}\leq Cb_{n} for all nn.

D.1. Proof of Theorem 4.1 (i)

Define diam()=supμ1,μ2d(μ1,μ2)\mathrm{diam}(\mathcal{M})=\sup_{\mu_{1},\mu_{2}\in\mathcal{M}}d(\mu_{1},\mu_{2}). By Corollary 3.2.3 in van der Vaart and Wellner, (2023), it is sufficient to show

supν|Qn,t(ν;μt^,φ^)Qt(ν;μt,φ)|p0,t{0,1},\displaystyle\sup_{\nu\in\mathcal{M}}\left|Q_{n,t}(\nu;\widehat{\mu_{t}},\widehat{\varphi})-Q_{t}(\nu;\mu_{t},\varphi_{*})\right|\stackrel{{\scriptstyle p}}{{\to}}0,\ t\in\{0,1\},

where

Qn,t(ν;μ,φ)\displaystyle Q_{n,t}(\nu;\mu,\varphi) =1ni=1nd2(ν,γμ(Xi),Yi(tTie(Xi;φ)+(1t)(1Ti)1e(Xi;φ))),t{0,1}.\displaystyle={1\over n}\sum_{i=1}^{n}d^{2}\left(\nu,\gamma_{\mu(X_{i}),Y_{i}}\left({tT_{i}\over e(X_{i};\varphi)}+{(1-t)(1-T_{i})\over 1-e(X_{i};\varphi)}\right)\right),\ t\in\{0,1\}.

Now we show supν|Qn,1(ν;μ^1,φ^)Q1(ν;μ1,φ)|p0\sup_{\nu\in\mathcal{M}}\left|Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{1}(\nu;\mu_{1},\varphi_{*})\right|\stackrel{{\scriptstyle p}}{{\to}}0; the proof for t=0t=0 is similar. For this, we show that Qn,1(ν;μ^1,φ^)Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi}) converges weakly to Q1(ν;μ1,φ)Q_{1}(\nu;\mu_{1},\varphi_{*}) in ()\ell^{\infty}(\mathcal{M}) and then apply Theorem 1.3.6 in van der Vaart and Wellner, (2023). From Theorem 1.5.4 in van der Vaart and Wellner, (2023), this weak convergence follows by showing that

  • (i)

    Qn,1(ν;μ^1,φ^)pQ1(ν;μ1,φ)Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})\stackrel{{\scriptstyle p}}{{\to}}Q_{1}(\nu;\mu_{1},\varphi_{*}) for each ν\nu\in\mathcal{M} as nn\to\infty.

  • (ii)

    Qn,1(ν;μ^1,φ^)Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi}) is asymptotically equicontinuous in probability, i.e., for each ε,η>0\varepsilon,\eta>0, there exists δ>0\delta>0 such that

    lim supnP(supd(ν1,ν2)<δ|Qn,1(ν1;μ^1,φ^)Qn,1(ν2;μ^1,φ^)|>ε)<η\displaystyle\limsup_{n\to\infty}\operatorname{P}\left(\sup_{d(\nu_{1},\nu_{2})<\delta}|Q_{n,1}(\nu_{1};\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\nu_{2};\widehat{\mu}_{1},\widehat{\varphi})|>\varepsilon\right)<\eta

For an arbitrary ν\nu\in\mathcal{M}, for (i) we have

|Qn,1(ν;μ^1,φ^)Q1(ν;μ1,φ)|\displaystyle|Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{1}(\nu;\mu_{1},\varphi_{*})|
|Qn,1(ν;μ^1,φ^)Qn,1(ν;μ^1,φ)|+|Qn,1(ν;μ^1,φ)Qn,1(ν;μ1,φ)|\displaystyle\leq|Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\nu;\widehat{\mu}_{1},\varphi_{*})|+|Q_{n,1}(\nu;\widehat{\mu}_{1},\varphi_{*})-Q_{n,1}(\nu;\mu_{1},\varphi_{*})|
+|Qn,1(ν;μ1,φ)Q1(ν;μ1,φ)|\displaystyle\quad+|Q_{n,1}(\nu;\mu_{1},\varphi_{*})-Q_{1}(\nu;\mu_{1},\varphi_{*})|
=:Q1+Q2+Q3\displaystyle=:Q_{1}+Q_{2}+Q_{3}

and observe that

Q1\displaystyle Q_{1} 1ni=1n|d2(ν,γμ^1(Xi),Yi(Tie(Xi;φ^)))d2(ν,γμ^1(Xi),Yi(Tie(Xi;φ)))|\displaystyle\leq{1\over n}\sum_{i=1}^{n}\left|d^{2}\left(\nu,\gamma_{\widehat{\mu}_{1}(X_{i}),Y_{i}}\left({T_{i}\over e(X_{i};\widehat{\varphi})}\right)\right)-d^{2}\left(\nu,\gamma_{\widehat{\mu}_{1}(X_{i}),Y_{i}}\left({T_{i}\over e(X_{i};\varphi_{*})}\right)\right)\right|
2diam()ni=1nd(γμ^1(Xi),Yi(Tie(Xi;φ^)),γμ^1(Xi),Yi(Tie(Xi;φ)))\displaystyle\leq{2\mathrm{diam}(\mathcal{M})\over n}\sum_{i=1}^{n}d\left(\gamma_{\widehat{\mu}_{1}(X_{i}),Y_{i}}\left({T_{i}\over e(X_{i};\widehat{\varphi})}\right),\gamma_{\widehat{\mu}_{1}(X_{i}),Y_{i}}\left({T_{i}\over e(X_{i};\varphi_{*})}\right)\right)
2diam()ni=1n|Tie(Xi;φ^)Tie(Xi;φ)|d(μ^1(Xi),Yi)\displaystyle\lesssim{2\mathrm{diam}(\mathcal{M})\over n}\sum_{i=1}^{n}\left|{T_{i}\over e(X_{i};\widehat{\varphi})}-{T_{i}\over e(X_{i};\varphi_{*})}\right|d(\widehat{\mu}_{1}(X_{i}),Y_{i})
2diam()2η02supx𝒳|e(x;φ^)e(x;φ)|=Op(ϱn),\displaystyle\leq{2\mathrm{diam}(\mathcal{M})^{2}\over\eta_{0}^{2}}\sup_{x\in\mathcal{X}}|e(x;\widehat{\varphi})-e(x;\varphi_{*})|=O_{p}(\varrho_{n}), (D.1)
Q2\displaystyle Q_{2} 1ni=1n|d2(ν,γμ^1(Xi),Yi(Tie(Xi;φ)))d2(ν,γμ1(Xi),Yi(Tie(Xi;φ)))|\displaystyle\leq{1\over n}\sum_{i=1}^{n}\left|d^{2}\left(\nu,\gamma_{\widehat{\mu}_{1}(X_{i}),Y_{i}}\left({T_{i}\over e(X_{i};\varphi_{*})}\right)\right)-d^{2}\left(\nu,\gamma_{\mu_{1}(X_{i}),Y_{i}}\left({T_{i}\over e(X_{i};\varphi_{*})}\right)\right)\right|
2diam()ni=1nd(γμ^1(Xi),Yi(Tie(Xi;φ)),γμ1(Xi),Yi(Tie(Xi;φ)))\displaystyle\leq{2\mathrm{diam}(\mathcal{M})\over n}\sum_{i=1}^{n}d\left(\gamma_{\widehat{\mu}_{1}(X_{i}),Y_{i}}\left({T_{i}\over e(X_{i};\varphi_{*})}\right),\gamma_{\mu_{1}(X_{i}),Y_{i}}\left({T_{i}\over e(X_{i};\varphi_{*})}\right)\right)
2C0diam()supx𝒳d(μ^1(x),μ1(x))=Op(rn),\displaystyle\leq 2C_{0}\mathrm{diam}(\mathcal{M})\sup_{x\in\mathcal{X}}d(\widehat{\mu}_{1}(x),\mu_{1}(x))=O_{p}(r_{n}), (D.2)
E[Q32]\displaystyle\operatorname{E}[Q_{3}^{2}] 1nE[d4(ν,γμ1(X),Y(Te(X;φ)))]diam4()n=O(n1).\displaystyle\leq{1\over n}\operatorname{E}\left[d^{4}\left(\nu,\gamma_{\mu_{1}(X),Y}\left({T\over e(X;\varphi_{*})}\right)\right)\right]\leq{\mathrm{diam}^{4}(\mathcal{M})\over n}=O(n^{-1}). (D.3)

Combining (D.1), (D.1), and (D.3), we obtain (i).

Pick any ν1,ν2\nu_{1},\nu_{2}\in\mathcal{M}. For (ii), similarly to the argument leading to (D.1),

|Qn,1(ν1;μ^1,φ^)Qn,1(ν2;μ^1,φ^)|\displaystyle|Q_{n,1}(\nu_{1};\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\nu_{2};\widehat{\mu}_{1},\widehat{\varphi})| 2diam()d(ν1,ν2)\displaystyle\leq 2\mathrm{diam}(\mathcal{M})d(\nu_{1},\nu_{2})

and this implies supd(ν1,ν2)<δ|Qn,1(ν1;μ^1,φ^)Qn,1(ν2;μ^1,φ^)|=Op(δ)\sup_{d(\nu_{1},\nu_{2})<\delta}|Q_{n,1}(\nu_{1};\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\nu_{2};\widehat{\mu}_{1},\widehat{\varphi})|=O_{p}(\delta) so that we obtain (ii), which completes the proof.

D.2. Proof of Theorem 4.1 (ii)

Note that one can show max1kKd(Θ^t,k(DR),E[Y(t)])=op(1)\max_{1\leq k\leq K}d(\widehat{\Theta}_{t,k}^{(\mathrm{DR})},\operatorname{E}_{\oplus}[Y(t)])=o_{p}(1) by applying similar arguments as in the proof of Theorem 4.1 (i). From the definition of Θ^t(CF)\widehat{\Theta}_{t}^{(\mathrm{CF})},

k=1Knknd2(Θ^t(CF),Θ^t,k(DR))\displaystyle\sum_{k=1}^{K}{n_{k}\over n}d^{2}(\widehat{\Theta}_{t}^{(\mathrm{CF})},\widehat{\Theta}_{t,k}^{(\mathrm{DR})}) k=1Knknd2(E[Y(t)],Θ^t,k(DR))max1kKd2(E[Y(t)],Θ^t,k(DR)),\displaystyle\leq\sum_{k=1}^{K}{n_{k}\over n}d^{2}(\operatorname{E}_{\oplus}[Y(t)],\widehat{\Theta}_{t,k}^{(\mathrm{DR})})\leq\max_{1\leq k\leq K}d^{2}(\operatorname{E}_{\oplus}[Y(t)],\widehat{\Theta}_{t,k}^{(\mathrm{DR})}),

which implies that d(Θ^t(CF),Θ^t,k¯(DR))max1kKd(E[Y(t)],Θ^t,k(DR))d(\widehat{\Theta}_{t}^{(\mathrm{CF})},\widehat{\Theta}_{t,\bar{k}}^{(\mathrm{DR})})\leq\max_{1\leq k\leq K}d(\operatorname{E}_{\oplus}[Y(t)],\widehat{\Theta}_{t,k}^{(\mathrm{DR})}) for some k¯{1,,K}\bar{k}\in\{1,\dots,K\}. Then

d(Θ^t(CF),E[Y(t)])\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{CF})},\operatorname{E}_{\oplus}[Y(t)]) d(Θ^t(CF),Θ^t,k¯(DR))+d(Θ^t,k¯(DR),E[Y(t)])2max1kKd(Θ^t,k(DR),E[Y(t)])\displaystyle\leq d(\widehat{\Theta}_{t}^{(\mathrm{CF})},\widehat{\Theta}_{t,\bar{k}}^{(\mathrm{DR})})+d(\widehat{\Theta}_{t,\bar{k}}^{(\mathrm{DR})},\operatorname{E}_{\oplus}[Y(t)])\leq 2\max_{1\leq k\leq K}d(\widehat{\Theta}_{t,k}^{(\mathrm{DR})},\operatorname{E}_{\oplus}[Y(t)]) (D.4)
=op(1).\displaystyle=o_{p}(1).

D.3. Proof of Theorem 4.2(i)

Now we show (4.6) when t=1t=1. The case t=0t=0 is analogous. We adapt the proof of Theorem 3.2.5 in van der Vaart and Wellner, (2023). Define

hν,μ,φ(x,y,t)=d2(ν;γμ(x),y(te(x;φ)))d2(Θ1(DR);γμ1(x),y(te(x;φ))).h_{\nu,\mu,\varphi}(x,y,t)=d^{2}\left(\nu;\gamma_{\mu(x),y}\left({t\over e(x;\varphi)}\right)\right)-d^{2}\left(\Theta_{1}^{(\mathrm{DR})};\gamma_{\mu_{1}(x),y}\left({t\over e(x;\varphi_{*})}\right)\right).

For some δ0>0\delta_{0}>0 and δ1>0\delta^{\prime}_{1}>0, consider the class of functions

δ,δ1:={hν,μ,φ():d(ν,Θ1(DR))δ,d(μ,μ1)δ1,φφδ1}\mathcal{H}_{\delta_{,}\delta^{\prime}_{1}}:=\left\{h_{\nu,\mu,\varphi}(\cdot):d(\nu,\Theta_{1}^{(\mathrm{DR})})\leq\delta,d_{\infty}(\mu,\mu_{1})\leq\delta^{\prime}_{1},\|\varphi-\varphi_{*}\|\leq\delta^{\prime}_{1}\right\}

for any δδ0\delta\leq\delta_{0}. An envelope function for δ,δ1\mathcal{H}_{\delta,\delta^{\prime}_{1}} is 2δdiam()2\delta\mathrm{diam}(\mathcal{M}). Observe that

|hν~,μ~,φ~(x,y,t)hν,μ,φ(x,y,t)|\displaystyle|h_{\widetilde{\nu},\widetilde{\mu},\widetilde{\varphi}}(x,y,t)-h_{\nu,\mu,\varphi}(x,y,t)|
2diam()(1+C0+Cediam()η02){d(ν~,ν)+d(μ~,μ)+φ~φ}.\displaystyle\leq 2\mathrm{diam}(\mathcal{M})\left(1+C_{0}+{C_{e}\mathrm{diam}(\mathcal{M})\over\eta_{0}^{2}}\right)\left\{d(\widetilde{\nu},\nu)+d_{\infty}(\widetilde{\mu},\mu)+\|\widetilde{\varphi}-\varphi\|\right\}.

Then for ε>0\varepsilon>0, following Theorem 2.7.17 of van der Vaart and Wellner, (2023), the δε\delta\varepsilon bracketing number of δ,δ1\mathcal{H}_{\delta,\delta^{\prime}_{1}} is bounded by a multiple of N(c1δε,Bδ(Θ1(DR)),d)N(c2δε,Bδ1(μ1),d)(δε)c3N(c_{1}\delta\varepsilon,B_{\delta}(\Theta_{1}^{(\mathrm{DR})}),d)N(c_{2}\delta\varepsilon,B_{\delta^{\prime}_{1}}(\mu_{1}),d_{\infty})(\delta\varepsilon)^{-c_{3}}, where c1,c2c_{1},c_{2}, and c3c_{3} are constants depending only on pp, diam()\mathrm{diam}(\mathcal{M}), CeC_{e}, and C0C_{0}.

Assumption 4.4 (i) and Theorem 2.14.16 of van der Vaart and Wellner, (2023) imply that, for small enough δ>0\delta>0,

E[supd(ν,Θ1)δ,φφδ1d(μ,μ1)δ1|Qn,1(ν;μ,φ)Qn,1(Θ1(DR);μ,φ)Q1(ν;μ,φ)+Q1(Θ1(DR);μ,φ)|]\displaystyle\operatorname{E}\left[\sup_{\begin{subarray}{c}d(\nu,\Theta_{1})\leq\delta,\|\varphi-\varphi_{*}\|\leq\delta^{\prime}_{1}\\ d_{\infty}(\mu,\mu_{1})\leq\delta^{\prime}_{1}\end{subarray}}\!\left|Q_{n,1}(\nu;\mu,\varphi)-Q_{n,1}(\Theta_{1}^{(\mathrm{DR})};\mu,\varphi)-Q_{1}(\nu;\mu,\varphi)+Q_{1}(\Theta_{1}^{(\mathrm{DR})};\mu,\varphi)\right|\right]
2δdiam()n(J1(δ)+Jμ1(δ)+01log(δε)𝑑ε)δ(1+δϖ)nδ1ϖn.\displaystyle\lesssim{2\delta\mathrm{diam}(\mathcal{M})\over\sqrt{n}}\left(J_{1}(\delta)+J_{\mu_{1}}(\delta)+\int_{0}^{1}\sqrt{-\log(\delta\varepsilon)}d\varepsilon\right)\lesssim{\delta(1+\delta^{-\varpi})\over\sqrt{n}}\lesssim{\delta^{1-\varpi}\over\sqrt{n}}. (D.5)

For any β(0,1)\beta^{\prime}\in(0,1), set qn1=max{nβ4(β1+ϖ),(ϱn+rn)ββ2(β1)}q_{n}^{-1}=\max\{n^{-{\beta\over 4(\beta-1+\varpi)}},(\varrho_{n}+r_{n})^{{\beta\beta^{\prime}\over 2(\beta-1)}}\} and

Sj,n={ν:2j1<qnd(ν,Θ1(DR))β/22j}.S_{j,n}=\left\{\nu:2^{j-1}<q_{n}d(\nu,\Theta_{1}^{(\mathrm{DR})})^{\beta/2}\leq 2^{j}\right\}.

Choose η>0\eta>0 to satisfy Assumption 4.4 (ii) and also small enough so that Assumption 4.4 (i) holds for all δ<η\delta<\eta and set η~=ηβ/2\widetilde{\eta}=\eta^{\beta/2}. For any L,δ1,K>0L,\delta_{1},K>0,

P(qnd(Θ^1(DR),Θ1(DR))β/2>2L)\displaystyle\operatorname{P}\left(q_{n}d(\widehat{\Theta}_{1}^{(\mathrm{DR})},\Theta_{1}^{(\mathrm{DR})})^{\beta/2}>2^{L}\right)
P(Anc)+P(2d(Θ^1(DR),Θ1(DR))η)\displaystyle\leq\operatorname{P}(A_{n}^{c})+\operatorname{P}(2d(\widehat{\Theta}_{1}^{(\mathrm{DR})},\Theta_{1}^{(\mathrm{DR})})\geq\eta)
+jL2jqnη~P({supνSj,n{Qn,1(ν;μ^1,φ^)Qn,1(Θ1(DR);μ^1,φ^)}Kqn2}An),\displaystyle\quad+\sum_{\begin{subarray}{c}j\geq L\\ 2^{j}\leq q_{n}\widetilde{\eta}\end{subarray}}\!\!\!\operatorname{P}\!\left(\!\left\{\sup_{\nu\in S_{j,n}}\!\!\!\left\{Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\Theta_{1}^{(\mathrm{DR})};\widehat{\mu}_{1},\widehat{\varphi})\right\}\!\leq Kq_{n}^{-2}\right\}\!\cap A_{n}\!\right), (D.6)

where An={d(μ^1,μ1)δ1qn2(β1)β,φ^φδ1qn2(β1)β}A_{n}=\{d_{\infty}(\widehat{\mu}_{1},\mu_{1})\leq\delta_{1}q_{n}^{-{2(\beta-1)\over\beta}},\|\widehat{\varphi}-\varphi_{*}\|\leq\delta_{1}q_{n}^{-{2(\beta-1)\over\beta}}\}.

Note that for any ε>0\varepsilon>0, there exist an integer nεn_{\varepsilon} such that P(Anc)<ε/3\operatorname{P}(A_{n}^{c})<\varepsilon/3, P(2d(Θ^1(DR),Θ1(DR))η)<ε/3\operatorname{P}(2d(\widehat{\Theta}_{1}^{(\mathrm{DR})},\Theta_{1}^{(\mathrm{DR})})\geq\eta)<\varepsilon/3, and δ1qn2(β1)βδ1\delta_{1}q_{n}^{-{2(\beta-1)\over\beta}}\leq\delta^{\prime}_{1}.

Define An,εA_{n,\varepsilon} as AnA_{n} with nnεn\geq n_{\varepsilon}. Then, on An,εA_{n,\varepsilon}, for each fixed jj such that 2jqnη~2^{j}\leq q_{n}\widetilde{\eta}, one has for all νSj,n\nu\in S_{j,n},

Qn,1(ν;μ^1,φ^)Qn,1(Θ1(DR);μ^1,φ^)\displaystyle Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\Theta_{1}^{(\mathrm{DR})};\widehat{\mu}_{1},\widehat{\varphi})
Q1(ν;μ^1,φ^)Q1(Θ1(DR);μ^1,φ^)\displaystyle\geq Q_{1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{1}(\Theta_{1}^{(\mathrm{DR})};\widehat{\mu}_{1},\widehat{\varphi})
supd(ν,Θ1(DR))β/22jqn|Qn,1(ν;μ^1,φ^)Qn,1(Θ1(DR);μ^1,φ^)Q1(ν;μ^1,φ^)+Q1(Θ1(DR);μ^1,φ^)|\displaystyle\quad-\sup_{d(\nu,\Theta_{1}^{(\mathrm{DR})})^{\beta/2}\leq{{2^{j}\over q_{n}}}}\!\!\left|Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\Theta_{1}^{(\mathrm{DR})};\widehat{\mu}_{1},\widehat{\varphi})-Q_{1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})+Q_{1}(\Theta_{1}^{(\mathrm{DR})};\widehat{\mu}_{1},\widehat{\varphi})\right|
22j2qn2(C2C1δ1β2(β1)22j)\displaystyle\geq{2^{2j-2}\over q_{n}^{2}}\left({C\over 2}-C_{1}\delta_{1}^{{\beta\over 2(\beta-1)}}2^{2-j}\right)
supd(ν,Θ1(DR))β/22jqn|Qn,1(ν;μ^1,φ^)Qn,1(Θ1(DR);μ^1,φ^)Q1(ν;μ^1,φ^)+Q1(Θ1(DR);μ^1,φ^)|\displaystyle\quad-\sup_{d(\nu,\Theta_{1}^{(\mathrm{DR})})^{\beta/2}\leq{{2^{j}\over q_{n}}}}\!\!\left|Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\Theta_{1}^{(\mathrm{DR})};\widehat{\mu}_{1},\widehat{\varphi})-Q_{1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})+Q_{1}(\Theta_{1}^{(\mathrm{DR})};\widehat{\mu}_{1},\widehat{\varphi})\right|

Then for large jj such that C/2C1δ1β2(β1)22jK222jC/4C/2-C_{1}\delta_{1}^{{\beta\over 2(\beta-1)}}2^{2-j}-K2^{2-2j}\geq C/4, we have

P({supνSj,n{Qn,1(ν;μ^1,φ^)Qn,1(Θ1(DR);μ^1,φ^)}Kqn2}An,ε)\displaystyle\operatorname{P}\left(\left\{\sup_{\nu\in S_{j,n}}\left\{Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\Theta_{1}^{(\mathrm{DR})};\widehat{\mu}_{1},\widehat{\varphi})\right\}\leq Kq_{n}^{-2}\right\}\cap A_{n,\varepsilon}\right)
P(supd(ν,Θ1(DR))β/22jqn,d(μ,μ1)δ1qn2(β1)/β,φφδ1qn2(β1)/β|Qn,1(ν;μ,φ)Qn,1(Θ1(DR);μ,φ)Q1(ν;μ,φ)+Q1(Θ1(DR);μ,φ)|C22j24qn2).\displaystyle\leq\!\operatorname{P}\!\!\left(\sup_{\begin{subarray}{c}d(\nu,\Theta_{1}^{(\mathrm{DR})})^{\beta/2}\leq{2^{j}\over q_{n}},\\ d_{\infty}(\mu,\mu_{1})\leq\delta_{1}q_{n}^{-2(\beta-1)/\beta},\\ \|\varphi-\varphi_{*}\|\leq\delta_{1}q_{n}^{-2(\beta-1)/\beta}\end{subarray}}\!\!\!\left|Q_{n,1}(\nu;\mu,\varphi)\!-\!Q_{n,1}(\Theta_{1}^{(\mathrm{DR})};\mu,\varphi)\!-\!Q_{1}(\nu;\mu,\varphi)\!+\!Q_{1}(\Theta_{1}^{(\mathrm{DR})};\mu,\varphi)\right|\!\geq\!{C2^{2j-2}\over 4q_{n}^{2}}\!\right). (D.7)

For each jj, in the sum on the right-hand side of (D.3), we have d(ν,Θ1(DR))(2j/qn)2/βηd(\nu,\Theta_{1}^{(\mathrm{DR})})\leq(2^{j}/q_{n})^{2/\beta}\leq\eta. Using (D.3) and the Markov inequality,

P(qnd(Θ^1(DR),Θ1(DR))β/2>2L)\displaystyle\operatorname{P}\left(q_{n}d(\widehat{\Theta}_{1}^{(\mathrm{DR})},\Theta_{1}^{(\mathrm{DR})})^{\beta/2}>2^{L}\right)
4CjL2jqnη~22jqn2E[1An,εsupνSj,n|Qn,1(ν;μ,φ)Qn,1(Θ1(DR);μ,φ)Q1(ν;μ,φ)+Q1(Θ1(DR);μ,φ)|]+2ε3\displaystyle\leq\!{4\over C}\!\!\!\sum_{\begin{subarray}{c}j\geq L\\ 2^{j}\leq q_{n}\widetilde{\eta}\end{subarray}}\!\!\!\!{2^{-2j}\over q_{n}^{-2}}\!\operatorname{E}\!\!\left[\!1_{A_{n,\varepsilon}}\!\sup_{\nu\in S_{j,n}}\!\!\!\left|Q_{n,1}(\nu;\mu,\varphi)-Q_{n,1}(\Theta_{1}^{(\mathrm{DR})};\mu,\varphi)-Q_{1}(\nu;\mu,\varphi)+Q_{1}(\Theta_{1}^{(\mathrm{DR})};\mu,\varphi)\right|\right]\!+\!{2\varepsilon\over 3}
jL2jqnη~22j+2jβ(1ϖ)qn2+2β(1ϖ)n+2ε3jL(14β1+ϖβ)j+2ε3.\displaystyle\lesssim\sum_{\begin{subarray}{c}j\geq L\\ 2^{j}\leq q_{n}\widetilde{\eta}\end{subarray}}{2^{-2j+{2j\over\beta}(1-\varpi)}\over q_{n}^{-2+{2\over\beta}(1-\varpi)}\sqrt{n}}+{2\varepsilon\over 3}\leq\sum_{j\geq L}\left({1\over 4^{{\beta-1+\varpi\over\beta}}}\right)^{j}+{2\varepsilon\over 3}.

Since β>1\beta>1, the last series converges and hence the first term on the far right-hand side can be made smaller than ε>0\varepsilon>0 for large LL. Therefore, we obtain the desired result,

d(Θ^1(DR),E[Y(1)])=Op(qn2β)=Op(n12(β1+ϖ)+(ϱn+rn)β(β1)).d(\widehat{\Theta}_{1}^{(\mathrm{DR})},\operatorname{E}_{\oplus}[Y(1)])=O_{p}(q_{n}^{-{2\over\beta}})=O_{p}(n^{-{1\over 2(\beta-1+\varpi)}}+(\varrho_{n}+r_{n})^{{\beta^{\prime}\over(\beta-1)}}).

D.4. Proof of Theorem 4.2(ii)

Applying almost the same argument as in the proof of Theorem 4.2 (i), one can show

max1kKd(Θ^t,k(DR),E[Y(t)])=Op(n12(β1+ϖ)+(ϱn+rn)β(β1)).\max_{1\leq k\leq K}d(\widehat{\Theta}_{t,k}^{(\mathrm{DR})},\operatorname{E}_{\oplus}[Y(t)])=O_{p}\left(n^{-{1\over 2(\beta-1+\varpi)}}+(\varrho_{n}+r_{n})^{{\beta^{\prime}\over(\beta-1)}}\right).

Then (D.4) yields the result.

References

  • Ahmadi et al., (2023) Ahmadi, H., Fatemizadeh, E., and Motie-Nasrabadi, A. (2023). A comparative study of correlation methods in functional connectivity analysis using fMRI data of Alzheimer’s patients. Journal of Biomedical Physics & Engineering, 13(2):125.
  • Aitchison, (1986) Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman & Hall, Ltd.
  • Athey and Imbens, (2017) Athey, S. and Imbens, G. W. (2017). The state of applied econometrics: Causality and policy evaluation. Journal of Economic Perspectives, 31(2):3–32.
  • Bang and Robins, (2005) Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973.
  • Bigot et al., (2017) Bigot, J., Gouet, R., Klein, T., and López, A. (2017). Geodesic PCA in the Wasserstein space by convex PCA. Annales de l’Institut Henri Poincaré B: Probability and Statistics, 53:1–26.
  • Billera et al., (2001) Billera, L. J., Holmes, S. P., and Vogtmann, K. (2001). Geometry of the Space of Phylogenetic Trees. Advances in Applied Mathematics, 27:733–767.
  • Chen and Müller, (2022) Chen, Y. and Müller, H.-G. (2022). Uniform convergence of local Fréchet regression with applications to locating extrema and time warping for metric space valued trajectories. Annals of Statistics, 50(3):1573–1592.
  • Chen et al., (2023) Chen, Y., Zhou, Y., Chen, H., Gajardo, A., Fan, J., Zhong, Q., Dubey, P., Han, K., Bhattacharjee, S., Zhu, C., Iao, S. I., Kundu, P., Petersen, A., and Müller, H.-G. (2023). frechet: Statistical Analysis for Random Objects and Non-Euclidean Data. R package version 0.3.0.
  • Chernozhukov et al., (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68.
  • Dai, (2022) Dai, X. (2022). Statistical Inference on the Hilbert Sphere with Application to Random Densities. Electronic Journal of Statistics, 16(1):700–736.
  • Damoiseaux et al., (2012) Damoiseaux, J. S., Prater, K. E., Miller, B. L., and Greicius, M. D. (2012). Functional connectivity tracks clinical deterioration in Alzheimer’s disease. Neurobiology of Aging, 33(4):828–e19.
  • Delsol and Van Keilegom, (2020) Delsol, L. and Van Keilegom, I. (2020). Semiparametric m-estimation with non-smooth criterion functions. Annals of the Institute of Statistical Mathematics, 72(2):577–605.
  • Dryden et al., (2009) Dryden, I. L., Koloydenko, A., and Zhou, D. (2009). Non-Euclidean statistics for covariance matrices, with applications to diffusion tensor imaging. Annals of Applied Statistics, 3:1102–1123.
  • Dubey et al., (2024) Dubey, P., Chen, Y., and Müller, H.-G. (2024). Metric statistics: Exploration and inference for random objects with distance profiles. Annals of Statistics, 52:757–792.
  • Ferreira and Busatto, (2013) Ferreira, L. K. and Busatto, G. F. (2013). Resting-state functional connectivity in normal brain aging. Neuroscience & Biobehavioral Reviews, 37(3):384–400.
  • Filzmoser et al., (2018) Filzmoser, P., Hron, K., and Templ, M. (2018). Applied Compositional Data Analysis: With Worked Examples in R. Springer.
  • Gallier and Quaintance, (2020) Gallier, J. and Quaintance, J. (2020). Differential Geometry and Lie Groups: A Computational Perspective, volume 12. Springer.
  • Geyer, (2020) Geyer, C. J. (2020). trust: Trust Region Optimization. R package version 0.1.8.
  • Gunsilius, (2023) Gunsilius, F. F. (2023). Distributional synthetic controls. Econometrica, 91(3):1105–1117.
  • Hahn, (1998) Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, pages 315–331.
  • Hirano et al., (2003) Hirano, K., Imbens, G. W., and Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189.
  • Imbens and Rubin, (2015) Imbens, G. W. and Rubin, D. B. (2015). Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge university press.
  • Kuchibhotla et al., (2024) Kuchibhotla, A. K., Balakrishnan, S., and Wasserman, L. (2024). The HulC: Confidence regions from convex hulls. Journal of the Royal Statistical Society Series B Statistical Methodology, In press.
  • Lin, (2019) Lin, Z. (2019). Riemannian geometry of symmetric positive definite matrices via Cholesky decomposition. SIAM Journal on Matrix Analysis and Applications, 40(4):1353–1370.
  • Lin et al., (2023) Lin, Z., Kong, D., and Wang, L. (2023). Causal inference on distribution functions. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(2):378–398.
  • Lin and Müller, (2021) Lin, Z. and Müller, H.-G. (2021). Total variation regularized Fréchet regression for metric-space valued data. Annals of Statistics, 49:3510–3533.
  • McCann, (1997) McCann, R. J. (1997). A convexity principle for interacting gases. Advances in Mathematics, 128(1):153–179.
  • Peters et al., (2017) Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of Causal Inference: Foundations and Learning Algorithms. The MIT Press.
  • Petersen and Müller, (2019) Petersen, A. and Müller, H.-G. (2019). Fréchet regression for random objects with Euclidean predictors. Annals of Statistics, 47(2):691–719.
  • Pigoli et al., (2014) Pigoli, D., Aston, J. A., Dryden, I. L., and Secchi, P. (2014). Distances and inference for covariance operators. Biometrika, 101:409–422.
  • Planche et al., (2022) Planche, V., Manjon, J. V., Mansencal, B., Lanuza, E., Tourdias, T., Catheline, G., and Coupé, P. (2022). Structural progression of Alzheimer’s disease over decades: the MRI staging scheme. Brain Communications, 4(3):fcac109.
  • Robins et al., (1994) Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866.
  • Rosenbaum and Rubin, (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55.
  • Rubinov and Sporns, (2010) Rubinov, M. and Sporns, O. (2010). Complex network measures of brain connectivity: Uses and interpretations. NeuroImage, 52(3):1059–1069.
  • Sala-Llonch et al., (2015) Sala-Llonch, R., Bartrés-Faz, D., and Junqué, C. (2015). Reorganization of brain networks in aging: A review of functional connectivity studies. Frontiers in Psychology, 6.
  • Scealy and Welsh, (2011) Scealy, J. and Welsh, A. (2011). Regression for compositional data by using distributions defined on the hypersphere. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3):351–375.
  • Scealy and Welsh, (2014) Scealy, J. and Welsh, A. (2014). Colours and cocktails: Compositional data analysis. Australian & New Zealand Journal of Statistics, 56(2):145–169.
  • Scharfstein et al., (1999) Scharfstein, D. O., Rotnitzky, A., and Robins, J. M. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94(448):1096–1120.
  • Stellato et al., (2020) Stellato, B., Banjac, G., Goulart, P., Bemporad, A., and Boyd, S. (2020). OSQP: An operator splitting solver for quadratic programs. Mathematical Programming Computation, 12(4):637–672.
  • Thanwerdas and Pennec, (2023) Thanwerdas, Y. and Pennec, X. (2023). O(n)-invariant Riemannian metrics on SPD matrices. Linear Algebra and its Applications, 661:163–201.
  • Tzourio-Mazoyer et al., (2002) Tzourio-Mazoyer, N., Landeau, B., Papathanassiou, D., Crivello, F., Etard, O., Delcroix, N., Mazoyer, B., and Joliot, M. (2002). Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain. NeuroImage, 15(1):273–289.
  • van der Vaart and Wellner, (2023) van der Vaart, A. W. and Wellner, J. A. (2023). Weak Convergence and Empirical Processes with Applications to Statistics. Springer.
  • Zhang et al., (2010) Zhang, H.-Y., Wang, S.-J., Liu, B., Ma, Z.-L., Yang, M., Zhang, Z.-J., and Teng, G.-J. (2010). Resting brain connectivity: changes during the progress of Alzheimer disease. Radiology, 256(2):598–606.
  • Zhou and Müller, (2022) Zhou, Y. and Müller, H.-G. (2022). Network regression with graph Laplacians. Journal of Machine Learning Research, 23:1–41.
  • Zhu and Müller, (2023) Zhu, C. and Müller, H.-G. (2023). Geodesic optimal transport regression. arXiv preprint arXiv:2312.15376.