Geodesic Causal Inference

Daisuke Kurisu , Yidong Zhou , Taisuke Otsu and Hans-Georg Müller Center for Spatial Information Science, The University of Tokyo
5-1-5, Kashiwanoha, Kashiwa-shi, Chiba 277-8568, Japan. daisukekurisu@csis.u-tokyo.ac.jp Department of Statistics, University of California, Davis, One Shields Avenue, Davis, CA, 95616, USA. ydzhou@ucdavis.edu Department of Economics, London School of Economics, Houghton Street, London, WC2A 2AE, UK. t.otsu@lse.ac.uk Department of Statistics, University of California, Davis, One Shields Avenue, Davis, CA, 95616, USA. hgmueller@ucdavis.edu

(Date: First version: January 6, 2024. This version: )

Abstract.

Adjusting for confounding and imbalance when establishing statistical relationships is an increasingly important task, and causal inference methods have emerged as the most popular tool to achieve this. Causal inference has been developed mainly for scalar outcomes and recently for distributional outcomes. We introduce here a general framework for causal inference when outcomes reside in general geodesic metric spaces, where we draw on a novel geodesic calculus that facilitates scalar multiplication for geodesics and the characterization of treatment effects through the concept of the geodesic average treatment effect. Using ideas from Fréchet regression, we develop estimation methods of the geodesic average treatment effect and derive consistency and rates of convergence for the proposed estimators. We also study uncertainty quantification and inference for the treatment effect. Our methodology is illustrated by a simulation study and real data examples for compositional outcomes of U.S. statewise energy source data to study the effect of coal mining, network data of New York taxi trips, where the effect of the COVID-19 pandemic is of interest, and brain functional connectivity network data to study the effect of Alzheimer’s disease.

Key words and phrases:

Doubly robust estimation, Fréchet regression, geodesic average treatment effect, metric statistics, networks, random objects

The first two authors contributed equally to the paper and are listed alphabetically. The work of D. K. was partially supported by JSPS KAKENHI Grant Number 23K12456 and the work of H.G.M. was partially supported by NSF grant DMS-2310450

1. Introduction

Causal inference aims to mitigate the effects of confounders in the assessment of treatment effects and has garnered increasing attention in modern data sciences (Imbens and Rubin,, 2015; Peters et al.,, 2017). While inadequate adjustment for confounders leads to biased causal effect estimates, Rosenbaum and Rubin, (1983) demonstrated that the propensity score, defined as the probability of treatment assignment conditional on confounders, can be used to correct this bias. Robins et al., (1994) proposed a doubly robust estimation method combining outcome regression and propensity score modeling; see also Scharfstein et al., (1999) and Bang and Robins, (2005). Hahn, (1998) established the semiparametric efficiency bound for estimation of the average treatment effect, and Hirano et al., (2003) developed a semiparametrically efficient estimator; see Athey and Imbens, (2017) for a survey.

Almost all existing works in causal inference have focused on scalar outcomes, or more generally, outcomes situated in an Euclidean space. The only exceptions are distributional data, for which causal inference was recently studied in Gunsilius, (2023) and Lin et al., (2023). These approaches rely on the fact that for the case of univariate distributions, one can make use of restricted linear operations in the space of quantile functions since these form a nonlinear subspace of the Hilbert space. Thus these approaches are specific for univariate distributions as outcomes and are not directly extendable to the case of general random objects, metric space-valued random variables. Here we aim at a major generalization to develop a novel methodology for causal inference on general random objects. Specifically, we propose a generalized notion of causal effects and its estimation methods when outcomes are located in a geodesic metric space, while treatments are discrete and potential confounders are Euclidean. Our framework can treat Euclidean and univariate distributional outcomes as special cases.

A basic issue one needs to address is then how to quantify treatment effects for outcomes corresponding to complex random objects. Conventional approaches based on algebraic averaging or differencing in Euclidean spaces do not work in geodesic metric spaces and therefore need to be generalized. For our methodology and theory, we consider uniquely extendable geodesic spaces, and then demonstrate how a geodesic calculus that allows for certain geodesic operations in such spaces can be harnessed to define the notions such as summation, subtraction, and scalar multiplication so that we can develop the concept of the geodesic average treatment effect (GATE).

We then proceed to construct estimation methods for the GATE in this general setting and furthermore derive consistency and rates of convergence for the estimators towards their population targets. To facilitate practical applications of the proposed methodology, we also provide uncertainty quantification and an inference method for the GATE. Lastly, we demonstrate the finite sample behavior of the proposed estimation methods by a simulation study and, importantly, illustrate the proposed methodology by relevant real world examples for networks and compositional outcomes.

A brief overview is as follows. Basic concepts and results for all of the following are presented in Section 2. Specifically, Section 2.1 provides a brief review of uniquely extendable geodesic metric spaces and the geodesics in such spaces, where we introduce a scalar multiplication for geodesics that serves as a key tool for subsequent developments. Following these essential preparations, Section 2.1 also presents specific examples of uniquely extendable geodesic metric spaces that are encountered in real world applications, motivating the proposed methodology. Then in Section 2.2 we introduce the notion of the GATE in geodesic metric spaces, which generalizes the conventional notion of differences between potential outcomes in Euclidean spaces.

In Section 3 we introduce doubly robust and cross-fitting estimators for the GATE, where basic notions such as propensity scores and implementations of conditional Fréchet means through Fréchet regression (Petersen and Müller,, 2019) as well as perturbations of random objects are introduced in Section 3.1. These notions and auxiliary results are then used in Section 3.2 to define our doubly robust estimator. As main results, we establish asymptotic properties of the doubly robust and cross-fitting estimators in Section 4. Specifically, we demonstrate the consistency and convergence rates of these estimators under regularity conditions (Section 4.1) and discuss methods for assessing uncertainty of the estimators (Section 4.2).

In Section 5 we present simulation results to evaluate the finite-sample performance of the proposed estimators for two metric spaces, the space of covariance matrices with the Frobenius metric and the space of compositional data with the geodesic metric on spheres. Section 6 is devoted to showcase the proposed methodology for three types of real world data, where we aim to infer the impact of coal mining in U.S. states on the composition of electricity generation sources, represented as spherical data; the impact of the COVID-19 pandemic on taxi usage quantified as networks and represented as graph Laplacians in the Manhattan area; and the effect of an Alzheimer’s disease diagnosis on brain functional connectivity networks. The paper is closed with a brief discussion in Section 7.

2. Treatment effects for random objects

2.1. Preliminaries on uniquely extendable geodesic metric spaces

Consider a uniquely geodesic metric space $(\Omega,d)$ that is endowed with a probability measure $P$ . For any $\alpha,\beta\in\Omega$ , the unique geodesic connecting $\alpha$ and $\beta$ is a curve $\gamma_{\alpha,\beta}:[0,1]\mapsto\Omega$ such that $d(\gamma_{\alpha,\beta}(s),\gamma_{\alpha,\beta}(t))=d(\alpha,\beta)|t-s|$ for $s,t\in[0,1]$ . For $\alpha,\beta,\zeta\in\Omega$ , define the following simple operations on geodesics,

\displaystyle\gamma_{\alpha,\zeta}\oplus\gamma_{\zeta,\beta}

\displaystyle:=\gamma_{\alpha,\beta},\ \ominus\gamma_{\alpha,\beta}:=\gamma_{\beta,\alpha},\ \mathrm{id}_{\alpha}:=\gamma_{\alpha,\alpha}.

The space $(\Omega,d)$ is further assumed to be geodesically complete (Gallier and Quaintance,, 2020), meaning that every geodesic can be extended to a geodesic defined on $\mathbb{R}$ . Specifically, for any $\alpha,\beta\in\Omega$ , the geodesic $\gamma_{\alpha,\beta}:[0,1]\mapsto\Omega$ can be extended to $\bar{\gamma}_{\alpha,\beta}:\mathbb{R}\mapsto\Omega$ . The restriction of $\bar{\gamma}_{\alpha,\beta}$ to $[0,1]$ coincides with $\gamma_{\alpha,\beta}$ , i.e., $\bar{\gamma}_{\alpha,\beta}|_{[0,1]}=\gamma_{\alpha,\beta}$ . We can then define a scalar multiplication for geodesics $\gamma_{\alpha,\beta}$ with a factor $\rho\in\mathbb{R}$ by

\displaystyle\rho\odot\gamma_{\alpha,\beta}

\displaystyle:=\begin{cases}\{\bar{\gamma}_{\alpha,\beta}(t):t\in[0,\rho]\}&\text{if $\rho>0$}\\ \{\bar{\gamma}_{\alpha,\beta}(t):t\in[\rho,0]\}&\text{if $\rho<0$}\\ \mathrm{id}_{\alpha}&\text{if $\rho=0$}\end{cases},\quad\rho\odot\mathrm{id}_{\alpha}:=\mathrm{id}_{\alpha}.

In practical applications, the relevant metric space often constitutes a subset $(\mathcal{M},d)$ of $(\Omega,d)$ , which is typically closed and convex. Prominent instances of such spaces are given in Examples 2.1–2.4 below. We use these same example spaces to demonstrate the practical performance of the proposed geodesic treatment effect in simulations and real-world data analyses. In Appendix B we demonstrate that these examples satisfy the pertinent assumptions outlined in Sections 3 and 4; see Propositions B.1–B.4. Geodesic extensions are confined to operate within the closed subset $\mathcal{M}\subset\Omega$ , in some cases necessitating suitable modifications to ensure that they do not cross the boundary (Zhu and Müller,, 2023).

Considering the forward extension of a geodesic, assume that the geodesic $\gamma_{\alpha,\beta}$ extends to the boundary point $\zeta$ and denote the extended geodesic by $\gamma_{\alpha,\zeta}:[0,1]\mapsto\mathcal{M}$ . For $\rho>1$ , we then define a scalar multiplication by

\rho\odot\gamma_{\alpha,\beta}=\{\gamma_{\alpha,\zeta}(t):t\in[0,h(\rho)]\},

where

h(\rho)=-\left(1-\frac{d(\alpha,\beta)}{d(\alpha,\zeta)}\right)^{\rho}+1.

It is straightforward to verify that the start point of $\rho\odot\gamma_{\alpha,\beta}$ is always $\alpha$ , while the end point is $\beta$ for $\rho=1$ and $\zeta$ for $\rho=\infty$ ; see Figure 1 for an illustration. Similar modifications apply to the reverse extension where $\rho<-1$ . To simplify the notation, we write the end point of $\rho\odot\gamma_{\alpha,\beta}$ as $\gamma_{\alpha,\beta}(\rho)$ .

Refer to caption — Figure 1. Illustration of geodesic extension. The circles symbolize objects in the geodesic metric space $\mathcal{M}$ . The arrows emanating from these circles represent the geodesic paths connecting them.

Example 2.1 (Space of networks).

Consider simple, undirected, and weighted networks with a fixed number of nodes $m$ and bounded edge weights. There is a one-to-one correspondence between such a network and its graph Laplacian. The space of graph Laplacians equipped with the Frobenius (power) metric $d_{F}$ can thus be used to characterize the space of networks (Zhou and Müller,, 2022), which is a bounded and convex subset of $\mathbb{R}^{m^{2}}$ .

Example 2.2 (Space of covariance matrices).

Consider $m$ -dimensional covariance matrices with bounded variances. The space of such covariance matrices equipped with the Frobenius distance $d_{F}$ is a bounded and convex subset of $\mathbb{R}^{m^{2}}$ . This holds true for the space of $m$ -dimensional correlation matrices as well (Dryden et al.,, 2009).

Example 2.3 (Space of compositional data).

Compositional data takes values in the simplex

\Delta^{d-1}=\{\bm{y}\in\mathbb{R}^{d}:y_{j}\geq 0,j=1,\ldots,d,\text{ and }\sum_{j=1}^{d}y_{j}=1\},

reflecting that such data are non-negative proportions that sum to 1. Consider the component-wise square root $\sqrt{\bm{y}}=(\sqrt{y_{1}},\ldots,\sqrt{y_{d}})^{\prime}$ of an element $Y\in\Delta^{d-1}$ , the simplex $\Delta^{d-1}$ can be mapped to the first orthant of the unit sphere $\mathcal{S}^{d-1}_{+}=\{\bm{z}\in\mathcal{S}^{d-1}:z_{j}\geq 0,j=1,\ldots,d\}$ (Scealy and Welsh,, 2011, 2014). We then consider $\mathcal{M}$ to be the space of compositional data $\mathcal{S}^{d-1}_{+}$ , equipped with the geodesic (Riemannian) metric on the sphere,

d_{g}(\bm{y}_{1},\bm{y}_{2})=\mathrm{arccos}(\bm{y}_{1}^{\prime}\bm{y}_{2}),\quad\text{for }\bm{y}_{1},\bm{y}_{2}\in\Delta^{d-1}.

This idea can be extended to distributions, where one considers a separable Hilbert space $\mathcal{H}$ with inner product $\langle\cdot,\cdot\rangle_{\mathcal{H}}$ and norm $\|\cdot\|_{\mathcal{H}}$ . The Hilbert sphere $\mathcal{S}=\{g\in\mathcal{H}:\|g\|_{\mathcal{H}}=1\}$ is an infinite-dimensional extension of finite-dimensional spheres. The space of (multivariate) absolutely continuous distributions can be equipped with the Fisher-Rao metric

d_{R}(f_{1},f_{2})=\mathrm{arccos}(\langle\sqrt{f_{1}},\sqrt{f_{2}}\rangle_{\mathcal{H}})

and then modeled as a subset of the Hilbert sphere $\mathcal{S}$ , where the Fisher-Rao metric pertains to the geodesic metric between square roots of densities $f_{1}$ and $f_{2}$ .

Example 2.4 (Wasserstein space).

Consider univariate probability measures on $\mathbb{R}$ with finite second moments. The space of such probability measures, equipped with the Wasserstein metric $d_{\mathcal{W}}$ , is known as the Wasserstein space $(\mathcal{W},d_{\mathcal{W}})$ , which is a complete and separable metric space. The Wasserstein metric between any two probability measures $\mu$ and $\nu$ is

d_{\mathcal{W}}^{2}(\mu,\nu)=\int_{0}^{1}\{F_{\mu}^{-1}(p)-F_{\nu}^{-1}(p)\}^{2}dp,

where $F_{\mu}^{-1}(\cdot)$ , $F_{\nu}^{-1}(\cdot)$ denote the quantile functions of $\mu$ and $\nu$ , respectively.

In Examples 2.1 and 2.2, for any two matrices $\alpha,\beta$ in the space, the geodesic is the line segment connecting $\alpha$ and $\beta$ , i.e., $\gamma_{\alpha,\beta}(t)=\alpha+(\beta-\alpha)t$ . In Example 2.3, for any two points $\alpha,\beta\in\mathcal{S}_{+}^{d-1}$ , the geodesic connecting $\alpha$ and $\beta$ is

\gamma_{\alpha,\beta}(t)=\cos(\theta t)\alpha+\sin(\theta t)\frac{\beta-(\alpha^{\prime}\beta)\alpha}{\|\beta-(\alpha^{\prime}\beta)\alpha\|},

where $\theta=\arccos(\alpha^{\prime}\beta)$ represents the angle between $\alpha$ and $\beta$ .

Write $\tau_{\#}\mu$ for the pushforward measure of $\mu$ by the transport $\tau$ . In Example 2.4, for any two distributions $\alpha,\beta\in\mathcal{W}$ , the geodesic between them is given by McCann’s interpolant (McCann,, 1997),

\gamma_{\alpha,\beta}(t)=(\mathrm{id}+t(F_{\beta}^{-1}\circ F_{\alpha}-\mathrm{id}))_{\#}\alpha,

where $\mathrm{id}$ , $F_{\alpha}$ , $F_{\beta}^{-1}$ denote the identity map, cumulative distribution function of $\alpha$ , and quantile function of $\beta$ , respectively.

2.2. Geodesic average treatment effects

Let $(\mathcal{M},d)$ be a uniquely extendable geodesic metric space. Units $i=1,\dots,n$ are either given a treatment or not treated. For the $i$ -th unit, we observe a treatment indicator variable $T_{i}$ , where $T_{i}=1$ if the $i$ -th unit is treated and $T_{i}=0$ otherwise, along with an outcome

\displaystyle Y_{i}=\begin{cases}Y_{i}(0)&\text{if $T_{i}=0$}\\ Y_{i}(1)&\text{if $T_{i}=1$},\end{cases}

where $Y_{i}(0),Y_{i}(1)\in\mathcal{M}$ are potential outcomes for $T_{i}=0$ and $1$ , respectively. Additionally, we observe Euclidean covariates $X_{i}\in\mathbb{R}^{p}$ . We assume that $\{Y_{i},T_{i},X_{i}\}_{i=1}^{n}$ is an i.i.d. sample from a joint distribution $P$ , where a generic random variable distributed according to $P$ is $(Y,T,X)\in\mathcal{M}\times\{0,1\}\times\mathcal{X}$ and we assume that all conditional distributions are well-defined; expectations and conditional expectations in the following are with respect to $P$ . We assume that $\mathcal{X}$ is a compact subset of $\mathbb{R}^{p}$ and write $Y=Y(t)$ if $T=t\in\{0,1\}$ .

A standard way to quantify causal effects for the usual situation where outcomes are located in Euclidean spaces is to target the difference of the potential outcomes. However, when outcomes take values in a general metric space this is not an option anymore, due to the absence of linear structure in general metric spaces (see, e.g., Example 2.3), which means that the notion of a difference does not exist. This requires to extend the conventional notion of causal effects for the case of random objects in $\mathcal{M}$ , and to address this challenge we propose here to replace the conventional difference by the geodesic that connects the Fréchet means of the potential outcomes.

Define the Fréchet mean of a random element $A\in\mathcal{M}$ and the conditional Fréchet mean of $A\in\mathcal{M}$ given $Z\in\mathbb{R}^{p}$ as

\displaystyle\operatorname{E}_{\oplus}[A]

\displaystyle:=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}\operatorname{E}[d^{2}(\nu,A)],\quad\operatorname{E}_{\oplus}[A|Z]:=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}\operatorname{E}[d^{2}(\nu,A)|Z],

(2.1)

respectively. Assuming the minimizers are unique and well-defined, we introduce the following notions of causal effects in geodesic metric spaces.

Definition 2.1 (Geodesic individual/average treatment effect).

The geodesic individual treatment effect of $T$ on $Y$ is defined as the geodesic connecting $Y_{i}(0)$ and $Y_{i}(1)$ , i.e., $\gamma_{Y_{i}(0),Y_{i}(1)}$ . The geodesic average treatment effect (GATE) of $T$ on $Y$ is defined as the geodesic connecting $\operatorname{E}_{\oplus}[Y(0)]$ and $\operatorname{E}_{\oplus}[Y(1)]$ , i.e., $\gamma_{\operatorname{E}_{\oplus}[Y(0)],\operatorname{E}_{\oplus}[Y(1)]}$ .

In Euclidean spaces, the geodesic individual treatment effect and geodesic average treatment effect (GATE) reduce to the difference of the potential outcomes for the $i$ -th unit, $Y_{i}(1)-Y_{i}(0)$ , and the difference of the expected potential outcomes, $\operatorname{E}[Y(1)]-\operatorname{E}[Y(0)]$ , respectively. These can be interpreted as the quantities that represent the shortest distance and directional information from $Y_{i}(0)$ to $Y_{i}(1)$ or from $\operatorname{E}[Y(0)]$ to $\operatorname{E}[Y(1)]$ with respect to the Euclidean metric. Since a geodesic connecting points $\alpha$ and $\beta$ in a geodesic metric space $(\mathcal{M},d)$ embodies both the shortest distance and directional information from $\alpha$ to $\beta$ with respect to the metric $d$ , Definition 2.1 emerges as a natural extension of quantifying treatment effects in geodesic metric spaces.

3. Doubly robust estimation of geodesic average treatment effects

3.1. Preliminaries

To construct doubly robust estimators in metric spaces, one needs to extend the notion of doubly robust estimation of causal effects to the case of random objects in a geodesic metric space. In analogy to the Euclidean case, key ingredients are propensity scores $p(x)=\operatorname{P}(T=1|X=x)$ and conditional Fréchet means for the potential outcomes $m_{t}(x)=\operatorname{E}_{\oplus}[Y(t)|X=x]$ for $t\in\{0,1\}$ . We impose the following conditions.

Assumption 3.1.

(i)

$(\mathcal{M},d)$ is a uniquely extendable geodesic metric space that is complete, separable, and totally bounded.
(ii)

There exists a constant $\eta_{0}\in(0,1/2)$ such that $\eta_{0}\leq p(x)\leq 1-\eta_{0}$ for each $x\in\mathcal{X}$ .
(iii)

$T$ and $\{Y(0),Y(1)\}$ are conditionally independent given $X$ .

Assumptions 3.1 (ii) and (iii) are standard overlap and unconfoundedness (or ignorability) conditions in the context of causal inference. Assumption 3.1 (ii) says that for any $x\in\mathcal{X}$ , there is a positive probability that there are units in both the treatment and control groups in any arbitrarily small neighborhood around $x$ . Assumptions 3.1 (iii) says that conditionally on $X$ , the treatment is randomly assigned from the potential outcomes. In the Euclidean case, this setup leads to various estimation methods for the average treatment effect, including inverse probability weighting, outcome regression, and doubly robust estimation methods.

Our goal here is to extend the doubly robust estimation to geodesic metric spaces, aiming at GATE (geodesic average treatment effect). To this end, we need to introduce the notion of a (Fréchet) regression model of $Y(t)$ on $X$ in geodesic metric spaces. More precisely, we impose the following conditions, where $\operatorname{E}_{\oplus}$ is as in (2.1).

Assumption 3.2.

There exist random perturbation maps $\mathcal{P}_{t}:\mathcal{M}\to\mathcal{M}$ and conditional Fréchet mean functions $m_{t}:\mathcal{X}\to\mathcal{M}$ for $t\in{0,1}$ such that

(i)

$Y(t)=\mathcal{P}_{t}(m_{t}(X))$ ,
(ii)

$\operatorname{E}_{\oplus}[\mathcal{P}_{t}(m_{t}(X))|X]=m_{t}(X)$ ,
(iii)

$\operatorname{E}_{\oplus}[\mathcal{P}_{t}(m_{t}(X))]=\operatorname{E}_{\oplus}[m_{t}(X)]$ .

The random perturbation map $\mathcal{P}_{t}$ is a generalization of additive noise $\varepsilon$ in the Euclidean case that is needed here because the addition operation is not defined, and generalizes the error assumption $\operatorname{E}[\varepsilon|X]=0$ . Specifically, Assumption 3.2 (ii) is equivalent to $\operatorname{E}_{\oplus}[Y(t)|X]=m_{t}(X)$ so that $m_{t}$ becomes the Fréchet regression function of $Y(t)$ on $X$ (Petersen and Müller,, 2019). In the following, we refer to $m_{t}$ as the outcome regression function. Note that our conditions on the perturbation maps are generalizations of those considered in Chen and Müller, (2022).

To obtain a doubly robust representation of the GATE, we additionally introduce an expectation operation for geodesics. For random objects $A,B\in\mathcal{M}$ and Euclidean random variables $\kappa\in\mathbb{R}$ and $X\in\mathbb{R}^{p}$ , define

	$\displaystyle\operatorname{E}_{G}[\gamma_{A,B}]$	$\displaystyle:=\gamma_{\operatorname{E}_{\oplus}[A],\operatorname{E}_{\oplus}[B]},$
	$\displaystyle\operatorname{E}_{G}[\kappa\odot\gamma_{A,B}\|X]$	$\displaystyle:=\operatorname{E}[\kappa\|X]\odot\gamma_{\operatorname{E}_{\oplus}[A\|Z],\operatorname{E}_{\oplus}[B\|Z]},$
	$\displaystyle\operatorname{E}_{G,X}[\kappa\odot\gamma_{A,B}]$	$\displaystyle:=\operatorname{E}_{G}\left[\operatorname{E}_{G}[\kappa\odot\gamma_{A,B}\|X]\right].$

These definitions of geodesic expectations $\operatorname{E}_{G}[\cdot]$ , $\operatorname{E}_{G}[\cdot|X]$ , and $\operatorname{E}_{G,X}[\cdot]$ are consistent with the corresponding (conditional) expectations in the Euclidean case if $\kappa$ and $\{A,B\}$ are conditionally independent given $X$ . Indeed, if $\kappa$ and $\{A,B\}$ are conditionally independent given $X$ and $A,B$ are Euclidean, it is easy to see that $\operatorname{E}_{G,X}[\kappa\odot\gamma_{A,B}]=\operatorname{E}_{G}[\kappa\odot\gamma_{A,B}]$ by the law of iterated expectations.

Based on the above assumptions and notions, a doubly robust representation of the GATE for geodesic metric spaces is as follows.

Proposition 3.1.

Let $e(x)$ and $\mu_{t}(x)$ , $t\in\{0,1\}$ be models for the propensity score $p(x)$ and the outcome regression functions $m_{t}(x)$ for $t\in\{0,1\}$ . Suppose that Assumptions 3.1 and 3.2 hold. Then for any $\mu\in(\mathcal{M},d)$ , it holds that

\displaystyle\gamma_{\mu,\operatorname{E}_{\oplus}[Y(t)]}

\displaystyle=\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{t}(X)}\oplus\left\{\left({\frac{tT}{e(X)}+\frac{(1-t)(1-T)}{1-e(X)}}\right)\odot\gamma_{\mu_{t}(X),Y}\right\}\right],\ t\in\{0,1\}

(3.1)

provided that $e(x)=p(x)$ or $\mu_{t}(x)=m_{t}(x)$ for $t\in\{0,1\}$ (one or both of these must hold) and hence

	$\displaystyle\gamma_{\operatorname{E}_{\oplus}[Y(0)],\operatorname{E}_{\oplus}[Y(1)]}$	$\displaystyle=\ominus\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{0}(X)}\oplus\left\{\left(\frac{1-T}{1-e(X)}\right)\odot\gamma_{\mu_{0}(X),Y}\right\}\right]$
		$\displaystyle\quad\quad\oplus\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\left\{\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right\}\right].$		(3.2)

This proposition shows that the GATE is a natural generalization of the average treatment effect for Euclidean data to the case of geodesic metric spaces (see Remark 3.1 below). Furthermore, we note the double robustness property of the GATE estimator, i.e., either $e(x)$ or $\mu_{t}(x)$ for $t\in\{0,1\}$ could be misspecified (but not both) while still achieving consistent estimation of the GATE. Since the GATE is defined via the geodesic between the Fréchet means for the potential outcomes, we first establish the representation from an arbitrary starting point $\mu$ in (3.1). Although the representation of the GATE in (3.2) also involves an arbitrary starting point $\mu$ , the doubly robust, cross-fitting, and outcome regression GATE estimators presented below do not require to choose $\mu$ ; however, implementing the inverse probability weighting estimator requires choice of a starting point $\mu$ that can be arbitrary (see Remark 3.3 below for further details). We illustrate in Figure 2 the doubly robust representation of the GATE when the outcome regression model or alternatively the propensity score model is correctly specified.

Remark 3.1 (Connection with the established doubly robust estimator for Euclidean data).

For the Euclidean case, $(\mathcal{M},d)=(\mathbb{R},\|\cdot\|)$ , where $\|\cdot\|$ is the standard Euclidean metric. Then the average treatment effect of the treatment $T$ on the outcome $Y$ is $\operatorname{E}[Y(1)]-\operatorname{E}[Y(0)]$ and Assumption 3.1 guarantees

$\displaystyle\operatorname{E}[Y(t)]$	$\displaystyle=\operatorname{E}\left[\mu_{t}(X)+\left({tT\over e(X)}+{(1-t)(1-T)\over 1-e(X)}\right)(Y-\mu_{t}(X))\right]$
	$\displaystyle=\operatorname{E}\left[\gamma_{\mu_{t}(X),Y}\left({tT\over e(X)}+{(1-t)(1-T)\over 1-e(X)}\right)\right]$
	$\displaystyle=\mathop{\rm arg~{}min}\limits_{\nu\in\mathbb{R}}\operatorname{E}\left[\left\\|\nu-\gamma_{\mu_{t}(X),Y}\left({tT\over e(X)}+{(1-t)(1-T)\over 1-e(X)}\right)\right\\|^{2}\right],$	(3.3)

provided $e(x)=p(x)$ or $\mu_{t}(x)=\operatorname{E}[Y(t)|X=x]$ for $t\in\{0,1\}$ and this leads to the construction of a doubly robust estimator for the Euclidean case. Additionally, we can show that for any $\mu\in\mathbb{R}$ ,

	$\displaystyle\gamma_{\operatorname{E}[Y(0)],\operatorname{E}[Y(1)]}$	$\displaystyle=\ominus\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{0}(X)}\oplus\left\{\left(\frac{1-T}{1-e(X)}\right)\odot\gamma_{\mu_{0}(X),Y}\right\}\right]$
		$\displaystyle\quad\quad\oplus\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\left\{\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right\}\right].$		(3.4)

Adopting this perspective, Proposition 3.1 is seen to be indeed an extension of the doubly robust representation for the average treatment effect in the Euclidean setting to geodesic metric spaces, with the Euclidean scenario as a special case; see Appendix C for the proof of (3.1).

3.2. Doubly robust estimation

Building upon Proposition 3.1, we can construct a sample based doubly robust estimator for the GATE. Note that the right-hand side of (3.2) can be rewritten as $\operatorname{E}_{G,X}[\gamma_{A_{0},A_{1}}]$ with

A_{t}=\gamma_{\mu_{t}(X),Y}\left({tT\over e(X)}+{(1-t)(1-T)\over 1-e(X)}\right),\ t\in\{0,1\},

which corresponds to the representation in (3.1). By moving to the sample counterpart, a doubly robust estimator for the GATE is obtained as $\gamma_{\widehat{\Theta}_{0}^{(\mathrm{DR})},\widehat{\Theta}_{1}^{(\mathrm{DR})}}$ , where

	$\displaystyle\widehat{\Theta}_{t}^{(\mathrm{DR})}$	$\displaystyle:=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}Q_{n,t}(\nu;\widehat{\mu}_{t},\widehat{\varphi}),$
	$\displaystyle Q_{n,t}(\nu;\mu,\varphi)$	$\displaystyle={1\over n}\sum_{i=1}^{n}d^{2}\left(\nu,\gamma_{\mu(X_{i}),Y_{i}}\left({tT_{i}\over e(X_{i};\varphi)}+{(1-t)(1-T_{i})\over 1-e(X_{i};\varphi)}\right)\right).$		(3.5)

Here $e(x;\varphi)$ is a parametric model for the propensity score $p(x)$ , $\widehat{\varphi}$ is an estimator of $\varphi$ , and $\widehat{\mu}_{t}(x)$ is an estimator of the outcome regression function $m_{t}(x)$ . In our real data analysis, we estimate the propensity score by the logistic regression model and the outcome regression function by global Fréchet regression (Petersen and Müller,, 2019):

\displaystyle\widehat{\mu}_{t}(x):=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}{1\over N_{t}}\sum_{i\in I_{t}}\{1+(X_{i}-\bar{X})^{\prime}\widehat{\Sigma}^{-1}(x-\bar{X})\}d^{2}(\nu,Y_{i}),\ t\in\{0,1\},

(3.6)

where $I_{t}=\{1\leq i\leq n:T_{i}=t\}$ , $N_{t}$ is the sample size of $I_{t}$ , $\bar{X}=n^{-1}\sum_{i=1}^{n}X_{i}$ , and $\widehat{\Sigma}=n^{-1}\sum_{i=1}^{n}(X_{i}-\bar{X})(X_{i}-\bar{X})^{\prime}$ .

Remark 3.2 (Cross-fitting estimator).

In the Euclidean case, doubly robust estimation methods have been combined with the cross-fitting approach to mitigate over-fitting (Chernozhukov et al.,, 2018). The proposed doubly robust estimator $\gamma_{\widehat{\Theta}_{0}^{(\mathrm{DR})},\widehat{\Theta}_{1}^{(\mathrm{DR})}}$ can be similarly adapted to incorporate cross-fitting. Details are as follows.

Let $\{S_{k}\}_{k=1}^{K}$ be a (random) partition of $\{1,\dots,n\}$ into $K$ subsets. For each $k$ , let $\widehat{\varphi}_{k}$ and $\widehat{\mu}_{t,k}(x)$ be the estimators of $\varphi$ and $\mu_{t}(x)$ computed from the data $\{Y_{i},T_{i},X_{i}\}_{i\in S_{-k}}$ falling into the $k$ -th subset, where $S_{-k}=\cup_{j\neq k}S_{j}$ . Setting

\displaystyle\widehat{\Theta}_{t,k}^{(\mathrm{DR})}

\displaystyle:=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}{1\over n_{k}}\sum_{i\in S_{k}}d^{2}\left(\nu,\gamma_{\widehat{\mu}_{t,k}(X_{i}),Y_{i}}\left({tT_{i}\over e(X_{i};\widehat{\varphi}_{k})}+{(1-t)(1-T_{i})\over 1-e(X_{i};\widehat{\varphi}_{k})}\right)\right),\ t\in\{0,1\},

where $n_{k}$ is the sample size of $S_{k}$ , the cross-fitting doubly robust estimator for the GATE is obtained as $\gamma_{\widehat{\Theta}_{0}^{(\mathrm{CF})},\widehat{\Theta}_{1}^{(\mathrm{CF})}}$ , with

\displaystyle\widehat{\Theta}_{t}^{(\mathrm{CF})}

\displaystyle:=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}\sum_{k=1}^{K}{n_{k}\over n}d^{2}(\nu,\widehat{\Theta}_{t,k}^{(\mathrm{DR})}),\ t\in\{0,1\}.

(3.7)

In the next section, we provide the asymptotic properties of these doubly robust and cross-fitting sample based estimators.

Remark 3.3 (Outcome regression and inverse probability weighting estimators).

If the models for the outcome regression functions are correctly specified, i.e., $\mu_{t}=m_{t}$ , we obtain $\operatorname{E}_{\oplus}[Y(t)]=\operatorname{E}_{\oplus}[\mu_{t}(X)]$ and the GATE can be estimated by taking a sample counterpart of $\gamma_{\operatorname{E}_{\oplus}[\mu_{0}(X)],\operatorname{E}_{\oplus}[\mu_{1}(X)]}$ , which we refer to as the outcome regression estimator. Specifically, the outcome regression estimator for the GATE is given by $\gamma_{\widehat{\Theta}_{0}^{(\mathrm{OR})},\widehat{\Theta}_{1}^{(\mathrm{OR})}}$ , where

\displaystyle\widehat{\Theta}_{t}^{(\mathrm{OR})}:=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}{1\over n}\sum_{i=1}^{n}d^{2}(\nu,\widehat{\mu}_{t}(X_{i})),\ t\in\{0,1\}.

On the other hand, if the model for the propensity score is correctly specified, i.e., $e(x)=p(x)$ , one finds

\displaystyle\operatorname{E}_{G,X}\left[\left({tT\over e(X)}+{(1-t)(1-T)\over 1-e(X)}\right)\odot\gamma_{\mu,Y}\right]

\displaystyle=\gamma_{\mu,\operatorname{E}_{\oplus}[Y(t)]}

for any $\mu\in\mathcal{M}$ . Thus the GATE can be obtained by using the inverse probability weighting method, which corresponds to the sample counterpart of

\displaystyle\ominus\operatorname{E}_{G,X}\left[\left({1-T\over 1-e(X)}\right)\odot\gamma_{\mu,Y}\right]\oplus\operatorname{E}_{G,X}\left[\left({T\over e(X)}\right)\odot\gamma_{\mu,Y}\right].

(3.8)

Specifically, since (3.8) can be rewritten as $\operatorname{E}_{G,X}[\gamma_{B_{0},B_{1}}]$ with

B_{t}=\gamma_{\mu,Y}\left({tT\over e(X)}+{(1-t)(1-T)\over 1-e(X)}\right),\ t\in\{0,1\},

the inverse probability weighting estimator for the GATE is given by $\gamma_{\widehat{\Theta}_{0}^{(\mathrm{IPW})},\widehat{\Theta}_{1}^{(\mathrm{IPW})}}$ , where

\displaystyle\widehat{\Theta}_{t}^{(\mathrm{IPW})}:=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}{1\over n}\sum_{i=1}^{n}d^{2}\left(\nu,\gamma_{\widehat{\mu},Y_{i}}\left({tT_{i}\over\widehat{e}(X_{i})}+{(1-t)(1-T_{i})\over 1-\widehat{e}(X_{i})}\right)\right),\ t\in\{0,1\},

and $\widehat{\mu}$ is an estimator of $\mu\in\mathcal{M}$ . While the inverse probability weighting estimator requires the choice of a starting point ${\mu}$ of a geodesic $\gamma_{\mu,Y}$ in (3.8), this starting point can be arbitrarily chosen; a simple choice that we adopt in our implementations is $\widehat{\mu}$ , the sample Fréchet mean of the $\{Y_{i}\}$ . The asymptotic properties of $\widehat{\Theta}_{t}^{(\mathrm{OR})}$ and $\widehat{\Theta}_{t}^{(\mathrm{IPW})}$ are provided in Appendix A and the finite sample performance of the outcome regression and inverse probability weighting estimators is examined through simulations in Section 5.

4. Main results

4.1. Consistency and convergence rates

To examine the asymptotic properties of the proposed doubly robust and cross-fitting estimators, we impose the following assumptions on the models for the propensity score and outcome regression functions. Let $\mathcal{M}_{e}=\{e(x;\varphi):x\in\mathcal{X},\varphi\in\Phi\}$ be a class of parametric models for the propensity score $p(x)$ , where $\Phi\subset\mathbb{R}^{p}$ is a compact set. For $t\in\{0,1\}$ , let $\widehat{\mu}_{t}(\cdot)$ be an estimator for the outcome regression function $m_{t}(\cdot)$ .

Assumption 4.1.

(i)

For $\varphi_{1},\varphi_{2}\in\Phi$ , assume that $|e(x;\varphi_{1})-e(x;\varphi_{2})|\leq C_{e}\|\varphi_{1}-\varphi_{2}\|$ for some positive constant $C_{e}$ , and for all $x\in\mathcal{X}$ and $\varphi\in\Phi$ , $\eta_{0}\leq e(x;\varphi)\leq 1-\eta_{0}$ where $\eta_{0}$ is the same constant as in Assumption 3.1(ii).
(ii)

There exists $\varphi_{*}\in\Phi$ and an estimate $\widehat{\varphi}$ such that $\|\widehat{\varphi}-\varphi_{*}\|=O_{p}(\varrho_{n})$ with $\varrho_{n}\to 0$ as $n\to\infty$ .
(iii)

There exist functions $\mu_{t}(\cdot)$ , $t\in\{0,1\}$ such that $\sup_{x\in\mathcal{X}}d(\widehat{\mu}_{t}(x),\mu_{t}(x))=O_{p}(r_{n})$ , $t\in\{0,1\}$ with $r_{n}\to 0$ as $n\to\infty$ .

Assumptions 4.1 (i) and (ii) imply the uniform convergence rate for the propensity score estimator, $\sup_{x\in\mathcal{X}}|e(x;\widehat{\varphi})-e(x;\varphi_{*})|=O_{p}(\varrho_{n})$ , regardless of whether the propensity score model is misspecified or not. These assumptions are typically satisfied with $\varrho_{n}=n^{-1/2}$ under mild regularity conditions. Assumption 4.1 (iii) concerns the outcome regression. If one employs the global Fréchet regression model (3.6) for estimating $m_{t}$ , then under some regularity conditions, this assumption is satisfied with $r_{n}=n^{-\alpha^{\prime}}$ for some $\alpha^{\prime}>1/2$ (see Theorem 1 in Petersen and Müller, (2019)) and

\displaystyle\mu_{t}(x)

\displaystyle=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}\operatorname{E}[\{1+(X-\operatorname{E}[X])^{\prime}\Sigma^{-1}(x-\operatorname{E}[X])\}d^{2}(\nu,Y)|T=t],

where $\Sigma=\operatorname{Var}(X)$ , again regardless of whether the model is misspecified or not for the conditional Fréchet mean, which is the true target. Alternatively, one can employ local Fréchet regression as outcome regression method, which is more flexible but comes with slower rates of convergence and is subject to the curse of dimensionality (see Theorem 1 in Chen and Müller, (2022)).

To establish the consistency of $\widehat{\Theta}_{t}^{(\mathrm{DR})}$ and $\widehat{\Theta}_{t}^{(\mathrm{CF})}$ in (3.2) and (3.7), we additionally require that the geodesics in $\mathcal{M}$ satisfy the following assumption.

Assumption 4.2.

For all $\alpha_{1},\alpha_{2}\in(\mathcal{M},d)$ ,

\displaystyle\sup_{\beta\in\mathcal{M},\kappa\in[1/(1-\eta_{0}),1/\eta_{0}]}d\left(\gamma_{\alpha_{1},\beta}(\kappa),\gamma_{\alpha_{2},\beta}(\kappa)\right)\leq C_{0}d(\alpha_{1},\alpha_{2}),

(4.1)

for some positive constant $C_{0}$ depending only on $\eta_{0}$ .

This is a Lipschitz-type condition to control the criterion function (3.2) that involves the geodesics $\gamma_{\mu(X_{i}),Y_{i}}$ . When the space $\Omega$ is a CAT( $0$ ) space (i.e., a geodesic metric space that is globally non-positively curved in the sense of Alexandrov), this assumption is automatically satisfied. Many metric spaces of statistical interest are CAT( $0$ ) spaces, such as the Wasserstein space of univariate distributions, the space of networks expressed as graph Laplacians with the (power) Frobenius metric, covariance/correlation matrices with the affine-invariant metric (Thanwerdas and Pennec,, 2023), the Cholesky metric (Lin,, 2019) and various other metrics (Pigoli et al.,, 2014), as well as phylogenetic tree space with the BHV metric (Billera et al.,, 2001; Lin and Müller,, 2021). Assumption 4.2 also holds for some relevant positively curved spaces such as open hemispheres or orthants of spheres that are relevant for compositional data with the Riemannian metric; these spaces are not CAT(0), see Example 2.3.

For $t\in\{0,1\}$ , denote by $\Theta_{t}^{(\mathrm{DR})}:=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}Q_{t}(\nu;\mu_{t},\varphi_{*})$ the minimizer of the population criterion function, where

\displaystyle Q_{t}(\nu;\mu,\varphi)

\displaystyle=\operatorname{E}\left[d^{2}\left(\nu,\gamma_{\mu(X),Y}\left({tT\over e(X;\varphi)}+{(1-t)(1-T)\over 1-e(X;\varphi)}\right)\right)\right].

To guarantee the identification of the object of interest $\operatorname{E}_{\oplus}[Y(t)]$ by the doubly robust estimation approach, we add the following conditions.

Assumption 4.3.

Assume that for $t\in\{0,1\}$ ,

(i)

the objects $\Theta_{t}^{(\mathrm{DR})}$ and $\widehat{\Theta}_{t}^{(\mathrm{DR})}$ exist and are unique, and for any $\varepsilon>0$ ,

\inf_{d(\nu,\Theta_{t}^{(\mathrm{DR})})>\varepsilon}Q_{t}(\nu;\mu_{t},\varphi_{*})>Q_{t}(\Theta_{t}^{(\mathrm{DR})};\mu_{t},\varphi_{*}).

(ii)

$\Theta_{t}^{(\mathrm{DR})}=\operatorname{E}_{\oplus}[Y(t)]$ .

Assumption 4.3 (i) is a standard separation condition to achieve the consistency of the M-estimator (see, e.g., Chapter 3.2 in van der Vaart and Wellner, (2023)). Note that the existence of the minimizers follows immediately if $\mathcal{M}$ is compact. Assumption 4.3 (ii) requires that at least one of the models for the propensity score and the outcome regression is correctly specified.

The following result provides the consistency of the doubly robust and cross-fitting estimators $\widehat{\Theta}_{t}^{(\mathrm{OR})}$ in (3.2) and $\widehat{\Theta}_{t}^{(\mathrm{CF})}$ in (3.7).

Theorem 4.1.

Suppose that Assumptions 3.1, 3.2, 4.2, and 4.3 hold.

(i)

Under Assumption 4.1, as $n\to\infty$ ,

$d(\widehat{\Theta}_{t}^{(\mathrm{DR})},\operatorname{E}_{\oplus}[Y(t)])=o_{p}(1),\quad t\in\{0,1\}.$
(ii)

Suppose that Assumption 4.1 holds for $\{\widehat{\varphi}_{k},\widehat{\mu}_{t,k}\}_{k=1}^{K}$ and there exist constants $c_{1}$ and $c_{2}$ such that $0<c_{1}\leq n_{k}/n\leq c_{2}<1$ for all $n$ and $k=1,\dots,K$ . Then as $n\to\infty$ ,

$d(\widehat{\Theta}_{t}^{(\mathrm{CF})},\operatorname{E}_{\oplus}[Y(t)])=o_{p}(1),\quad t\in\{0,1\}.$

To obtain rates of convergence for estimators $\widehat{\Theta}_{t}^{(\mathrm{DR})}$ and $\widehat{\Theta}_{t}^{(\mathrm{CF})}$ in the metric space $(\Omega,d)$ we need additional entropy conditions to quantify the complexity of the space. Let $B_{\delta}(\omega)$ be the ball of radius $\delta$ centered at $\omega\in\Omega$ and $N(\varepsilon,B_{\delta}(\omega),d)$ its covering number using balls of size $\varepsilon$ .

Assumption 4.4.

For $t\in\{0,1\}$ ,

(i)

As $\delta\to 0$ ,

	$\displaystyle J_{t}(\delta)$	$\displaystyle:=\int_{0}^{1}\sqrt{1+\log N(\delta\varepsilon,B_{\delta}(\Theta_{t}^{(\mathrm{DR})}),d)}d\varepsilon=O(1),$		(4.2)
	$\displaystyle J_{\mu_{t}}(\delta)$	$\displaystyle:=\int_{0}^{1}\sqrt{1+\log N(\delta\varepsilon,B_{\delta^{\prime}_{1}}(\mu_{t}),d_{\infty})}d\varepsilon=O(\delta^{-\varpi}),$		(4.3)

for some $\delta^{\prime}_{1}>0$ and $\varpi\in(0,1)$ , where for $\nu,\mu:\mathcal{X}\to\mathcal{M}$ , $d_{\infty}(\nu,\mu):=\sup_{x\in\mathcal{X}}d(\nu(x),\mu(x))$ .

(ii)

There exist positive constants $\eta,\eta_{1},C,C_{1}$ , and $\beta>1$ such that

\displaystyle\inf_{\begin{subarray}{c}d_{\infty}(\mu,\mu_{t})\leq\eta_{1}\\ \|\varphi-\varphi_{*}\|\leq\eta_{1}\end{subarray}}\!\inf_{d(\nu,\Theta_{t}^{(\mathrm{DR})})<\eta}\!\!\left\{\!Q_{t}(\nu;\mu,\varphi)\!-\!Q_{t}(\Theta_{t}^{(\mathrm{DR})};\mu,\varphi)\!-\!Cd(\nu,\Theta_{t}^{(\mathrm{DR})})^{\beta}\!+\!C_{1}\eta_{1}^{{\beta\over 2(\beta-1)}}\!d(\nu,\Theta_{t}^{(\mathrm{DR})})^{{\beta\over 2}}\!\right\}\!\!\geq\!0.

(4.4)

These assumptions are used to control the behavior of the (centered) criterion function $Q_{n,t}-Q_{t}$ around the minimum. Assumption 4.4 (i) postulates an entropy condition for $\Theta_{t}^{(\mathrm{DR})}$ and $\mu_{t}$ . A sufficient condition for (4.3) is $J^{\prime}_{\mu_{t}}(\delta):=\int_{0}^{1}\sup_{x\in\mathcal{X}}\sqrt{1+\log N(\bar{c}\delta\varepsilon,B_{\delta^{\prime}_{1}}(\mu_{t}(x)),d)}d\varepsilon=O(\delta^{-\varpi^{\prime}})$ for some positive constants $\bar{c}$ and $\varpi^{\prime}\in(0,1)$ . One can verify (4.2) and

\displaystyle J^{\prime}_{\mu_{t}}(\delta)=O(-\log\delta),\ \text{as $\delta\to 0$}

(4.5)

for common metric spaces including Examples 2.1–2.4, for which therefore Assumption 4.4 (i) is satisfied. If (4.5) holds, we can take $\varpi\in(0,1)$ in Theorem 4.2 below arbitrarily small.

Assumption 4.4 (ii) constrains the shape of the population criterion function $Q_{t}$ . This assumption is also satisfied for common metric spaces. Indeed Examples 2.1–2.4 satisfy this condition with $\beta=2$ (see Propositions B.1–B.4 below). This assumption is an adaptation of Assumption B3 in Delsol and Van Keilegom, (2020) who developed an asymptotic theory for M estimators with plugged-in estimated nuisance parameters. In the Euclidean case, this assumption is typically satisfied with $\beta=2$ , and the third and fourth terms in (4.4) will be the quadratic and linear terms to approximate the population criterion function $Q_{t}$ for $\nu$ . If $Q_{t}$ does not involve nuisance parameters, the fourth term in (4.4) vanishes, in analogy to Assumption (U2) in Petersen and Müller, (2019).

Under these assumptions, the convergence rates of the doubly robust and cross-fitting estimators are obtained as follows.

Theorem 4.2.

Suppose that Assumptions 3.1, 3.2, 4.2, and 4.3 hold.

(i)

If Assumptions 4.1 and 4.4 hold, then for each $\beta^{\prime}\in(0,1)$ , as $n\to\infty$ , with $\rho_{n}$ and $r_{n}$ as in Assumption 4.1 and $\beta,\varpi$ as in Assumption 4.4,

\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{DR})},\operatorname{E}_{\oplus}[Y(t)])

\displaystyle=O_{p}\left(n^{-{1\over 2(\beta-1+\varpi)}}+(\varrho_{n}+r_{n})^{{\beta^{\prime}\over\beta-1}}\right),\quad t\in\{0,1\}.

(4.6)

(ii)

If Assumptions 4.1 and 4.4 hold for $\{\widehat{\varphi}_{k},\widehat{\mu}_{t,k}\}_{k=1}^{K}$ and there exist constants $c_{1}$ and $c_{2}$ such that $0<c_{1}\leq n_{k}/n\leq c_{2}<1$ for all $n$ and $k=1,\dots,K$ , then for each $\beta^{\prime}\in(0,1)$ , as $n\to\infty$ ,

\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{CF})},\operatorname{E}_{\oplus}[Y(t)])

\displaystyle=O_{p}\left(n^{-{1\over 2(\beta-1+\varpi)}}+(\varrho_{n}+r_{n})^{{\beta^{\prime}\over\beta-1}}\right),\quad t\in\{0,1\}.

(4.7)

Note that both estimators achieve the same convergence rate. The first term corresponds to the rate for an infeasible estimator, where the nuisance parameters $\varphi$ and $\mu_{t}$ are known to be contained in neighbors of their (pseudo) true values. If the pseudo true values are completely known, then this term becomes the one with $\varpi=0$ for the conventional Fréchet mean. The second term in the convergence rate is due to the estimation of the nuisance parameters. Generally, neither term dominates. In a typical scenario, we have $\beta=2$ , $\varrho_{n}=n^{-1/2}$ , and $r_{n}=n^{-\alpha_{1}}$ with any $\alpha_{1}>1/2$ so that the rate becomes $O_{p}(n^{-\frac{1}{2(1+\varpi)}}+n^{-\alpha_{1}\beta^{\prime}})$ for any $\varpi,\beta^{\prime}\in(0,1)$ , which can be arbitrarily close to the parametric rate.

Remark 4.1 (Convergence rates of outcome regression and inverse probability weighting estimators).

In Appendix A, we provide convergence rates of estimators $\widehat{\Theta}_{t}^{(\mathrm{OR})}$ and $\widehat{\Theta}_{t}^{(\mathrm{IPW})}$ that rely on the correct specification of the outcome regression function and the propensity score function, respectively. Under some regularity conditions one can show that for each $\beta^{\prime}\in(0,1)$ , as $n\to\infty$ ,

	$\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{OR})},\operatorname{E}_{\oplus}[Y(t)])$	$\displaystyle=O_{p}\left(n^{-{1\over 2(\beta-1+\varpi_{1})}}+r_{n,1}^{{\beta^{\prime}\over\beta-1}}\right),$
	$\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{IPW})},\operatorname{E}_{\oplus}[Y(t)])$	$\displaystyle=O_{p}\left(n^{-{1\over 2(\beta-1+\varpi_{2})}}+(v_{n}+\varrho_{n,1})^{{\beta^{\prime}\over\beta-1}}\right),$

where $\varpi_{1},\varpi_{2}\in(0,1)$ are constants arising from entropy conditions for the models of the outcome regression function and the propensity score, corresponding to $\varpi$ in (4.3). Here $r_{n,1}$ , $v_{n}$ , and $\varrho_{n,1}$ are null sequences that correspond to the convergence rates of $\widehat{\mu}_{t}(\cdot)$ , $\widehat{\mu}$ , and $\widehat{e}(\cdot)$ , respectively, i.e.,

\sup_{x\in\mathcal{X}}d(\widehat{\mu}_{t}(x),m_{t}(x))=O_{p}(r_{n,1}),\ d(\widehat{\mu},\mu)=O_{p}(v_{n}),\ \sup_{x\in\mathcal{X}}|\widehat{e}(x)-p(x)|=O_{p}(\varrho_{n,1}).

4.2. Uncertainty quantification

To obtain uncertainty quantification for the GATE estimators $\widehat{\Theta}_{t}^{(\mathrm{M})}$ for $\mathrm{M}\in\{\mathrm{DR},\mathrm{CF}\}$ , we adopt the (adaptive) HulC by Kuchibhotla et al., (2024) to construct confidence regions for the contrast $d(\Theta_{0}^{(\mathrm{M})},\Theta_{1}^{(\mathrm{M})})$ , with implementation as follows.

Let $\{S_{b}\}_{b=1}^{B}$ be a (random) partition of $\{1,\dots,n\}$ into $B$ subsets, and $\{d(\Theta_{b,0}^{(\mathrm{M})},\Theta_{b,1}^{(\mathrm{M})})\}_{b=1}^{B}$ estimators for $d(\Theta_{0}^{(\mathrm{M})},\Theta_{1}^{(\mathrm{M})})$ computed for each of the subsamples $\{Y_{i},T_{i},X_{i}:i\in S_{b}\}_{b=1}^{B}$ . Define the maximum median bias of the estimators $d(\Theta_{b,0}^{(\mathrm{M})},\Theta_{b,1}^{(\mathrm{M})})$ for $d(\Theta_{0}^{(\mathrm{M})},\Theta_{1}^{(\mathrm{M})})$ as

\Delta:=\max_{1\leq b\leq B}\left\{0,{1\over 2}-\min\left\{\operatorname{P}\left(\widetilde{U}_{b}\geq 0\right),\operatorname{P}\left(\widetilde{U}_{b}\leq 0\right)\right\}\right\},

where $\widetilde{U}_{b}=d(\Theta_{b,0}^{(\mathrm{M})},\Theta_{b,1}^{(\mathrm{M})})-d(\Theta_{0}^{(\mathrm{M})},\Theta_{1}^{(\mathrm{M})})$ .

Adopting the algorithm of HulC, we can construct a confidence interval with coverage probability $1-\alpha$ for $d(\Theta_{0}^{(\mathrm{M})},\Theta_{1}^{(\mathrm{M})})$ as follows:

(Step 1)

Find the smallest integer $B=B_{\alpha,\Delta}\geq 1$ such that

$\operatorname{P}(B;\Delta):=\left({1\over 2}-\Delta\right)^{B}+\left({1\over 2}+\Delta\right)^{B}\leq\alpha.$

(Step 2)

Generate a random variable $U$ from the uniform distribution on $[0,1]$ and set $B^{*}$ as

\displaystyle B^{*}

\displaystyle:=\begin{cases}B_{\alpha,\Delta}&\text{if $U\leq\tau_{\alpha,\Delta}$},\\ B_{\alpha,\Delta}-1&\text{if $U>\tau_{\alpha,\Delta}$},\\ \end{cases}\ \text{where}\ \tau_{\alpha,\Delta}:={\alpha-\operatorname{P}(B_{\alpha,\Delta};\Delta)\over\operatorname{P}(B_{\alpha,\Delta}-1;\Delta)-\operatorname{P}(B_{\alpha,\Delta};\Delta)}.

(Step 3)

Randomly split $\{Y_{i},T_{i},X_{i}\}_{i=1}^{n}$ into $B^{*}$ disjoint sets $\{Y_{i},T_{i},X_{i}:i\in S_{b}\}_{b=1}^{B^{*}}$ and compute $\{d(\Theta_{b,0}^{(\mathrm{M})},\Theta_{b,1}^{(\mathrm{M})})\}_{b=1}^{B^{*}}$ .

(Step 4)

Compute the confidence interval

\widehat{C}_{\alpha,\Delta}:=\left[\min_{1\leq b\leq B^{*}}d(\Theta_{b,0}^{(\mathrm{M})},\Theta_{b,1}^{(\mathrm{M})}),\max_{1\leq b\leq B^{*}}d(\Theta_{b,0}^{(\mathrm{M})},\Theta_{b,1}^{(\mathrm{M})})\right].

From Theorem 1 in Kuchibhotla et al., (2024), one obtains a finite sample guarantee for the coverage of this confidence interval, i.e.,

\operatorname{P}(d(\Theta_{0}^{(\mathrm{M})},\Theta_{1}^{(\mathrm{M})})\in\widehat{C}_{\alpha,\Delta})\geq 1-\alpha.

For the implementation of $\widehat{C}_{\alpha,\Delta}$ , one can estimate $\Delta$ by using a subsampling method (see Section 4 of Kuchibhotla et al., (2024)).

5. Simulation studies

5.1. Implementation details and simulation scenarios

The algorithm for the proposed approach is outlined in Algorithm 1. The global Fréchet regression involved in the second step is implemented using the R package frechet (Chen et al.,, 2023). The minimization problem one needs to solve requires specific considerations for various geodesic spaces. For the spaces in Examples 2.1, 2.2, and 2.4, the minimization problem can be reduced to convex quadratic optimization (Stellato et al.,, 2020). For Example 2.3, the necessary optimization can be performed using the trust regions algorithm (Geyer,, 2020). The third step involves the calculation of Fréchet means, which reduces to the entry-wise mean of matrices for Examples 2.1, 2.2, and the mean of quantile functions for Example 2.4 due to the convexity of these spaces. The Fréchet mean for Example 2.3 can be implemented using the R package manifold (Dai,, 2022).

Input: data

\{(Y_{i},T_{i},X_{i})\}_{i=1}^{n}

Output: doubly robust estimator of geodesic average treatment effect (GATE)

\gamma_{\widehat{\Theta}_{0}^{(\text{DR})},\widehat{\Theta}_{1}^{(\text{DR})}}

e(x;\widehat{\varphi})\longleftarrow

the estimated propensity score using logistic regression with data

\{T_{i},X_{i}\}_{i=1}^{n}

;

\widehat{\mu}_{t}(x)\longleftarrow

the estimated outcome regression function using global Fréchet regression:

\widehat{\mu}_{t}(x)=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}{1\over N_{t}}\sum_{i\in I_{t}}\{1+(X_{i}-\bar{X})^{\prime}\widehat{\Sigma}^{-1}(x-\bar{X})\}d^{2}(\nu,Y_{i}),\ t\in\{0,1\},

where

I_{t}=\{1\leq i\leq n:T_{i}=t\}

N_{t}

is the cardinality of

I_{t}

\bar{X}=n^{-1}\sum_{i=1}^{n}X_{i}

, and

\widehat{\Sigma}=n^{-1}\sum_{i=1}^{n}(X_{i}-\bar{X})(X_{i}-\bar{X})^{\prime}

;

\widehat{\Theta}_{t}^{(\text{DR})}\longleftarrow

the estimated outcome regression function:

\widehat{\Theta}_{t}^{\mathrm{(DR)}}=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}{1\over n}\sum_{i=1}^{n}d^{2}\left(\nu,\gamma_{\widehat{\mu}_{t}(X_{i}),Y_{i}}\left({tT_{i}\over e(X_{i};\widehat{\varphi})}+{(1-t)(1-T_{i})\over 1-e(X_{i};\widehat{\varphi})}\right)\right)

where

e(x;\widehat{\varphi})

and

\widehat{\mu}_{t}(x)

are the estimated propensity score and outcome regression function;

\gamma_{\widehat{\Theta}_{0}^{(\text{DR})},\widehat{\Theta}_{1}^{(\text{DR})}}\longleftarrow

the doubly robust estimator of GATE.

Algorithm 1 Geodesic Causal Inference

To assess the performance of the proposed doubly robust estimators, we report here the results of simulations for various settings, specifically for the space of covariance matrices equipped with the Frobenius metric and the space of three-dimensional compositional data $\mathcal{S}_{+}^{2}$ equipped with the geodesic metric on the sphere $S^{2}$ ; see Examples 2.2 and 2.3.

We consider sample sizes $n=100,300,1000$ , with $Q=500$ Monte Carlo runs. For the $q$ th Monte Carlo run, with $\gamma_{\widehat{\Theta}_{0}^{q},\widehat{\Theta}_{1}^{q}}$ denoting the GATE estimator, the average quality of the estimation over the $Q=500$ Monte Carlo runs is assessed by the average squared error (ASE)

\mathrm{ASE}=\frac{1}{Q}\sum_{q=1}^{Q}\{d^{2}(\widehat{\Theta}_{0}^{q},\operatorname{E}_{\oplus}[Y(0)])+d^{2}(\widehat{\Theta}_{1}^{q},\operatorname{E}_{\oplus}[Y(1)])\},

where $d$ is the Frobenius metric $d_{F}$ for covariance matrices and the geodesic metric $d_{g}$ for compositional data.

In all simulations, the confounder $X$ follows a uniform distribution $[-1,1]$ . The treatment $T$ has a conditional Bernoulli distribution depending on $X$ with $\operatorname{P}(T=1|X)=\mathrm{expit}(0.75X)$ where $\mathrm{expit}(\cdot)=\mathrm{exp}(\cdot)/(1+\mathrm{exp}(\cdot))$ . We consider two specifications for the outcome regression: a global Fréchet regression model with predictor $X$ in $m_{t}(X)$ (correct specification) and one with the predictor $X^{2}$ (incorrect specification), and two specifications for the propensity score model: a logistic regression model with predictor $X$ (correct specification) and a model with predictor $X^{2}$ (incorrect specification). To demonstrate the double robustness of estimators $\widehat{\Theta}_{t}^{\text{(DR)}}$ and $\widehat{\Theta}_{t}^{\text{(CF)}}$ , we compare them with outcome regression (OR) and inverse probability weighting (IPW) estimators.

5.2. Covariance matrices

Random covariance matrices $Y$ are generated as follows. First, the lower-triangular entries of $Y$ are independently generated as $Y_{jk}=T+X+2+\epsilon$ , where $1\leq j,k\leq m,\,j>k$ and $\epsilon$ is independently generated from a uniform distribution on $[-0.1,0.1]$ . Due to symmetry, the upper-triangular entries of $Y$ are then $Y_{kj}=Y_{jk}$ for $1\leq j,k\leq m,\ j>k$ . To ensure positive semi-definiteness, the diagonal entries $Y_{jj}$ are set to equal the sum of all off-diagonal entries in the same row, $\sum_{k\neq j}Y_{jk}$ . This construction results in a diagonally dominated matrix, thus guaranteeing positive semi-definiteness.

We aim to estimate the GATE, whose true value is $\gamma_{\operatorname{E}_{\oplus}[Y(0)],\operatorname{E}_{\oplus}[Y(1)]}$ where $(\operatorname{E}_{\oplus}[Y(0)])_{jk}=2,(\operatorname{E}_{\oplus}[Y(0)])_{jj}=18$ , and $(\operatorname{E}_{\oplus}[Y(1)])_{jk}=3,(\operatorname{E}_{\oplus}[Y(1)])_{jj}=27$ for $1\leq j\neq k\leq m$ . Table 1 presents the simulation results for $10\times 10$ matrices, i.e., for $m=10$ . The ASE of the doubly robust estimator decreases as the sample size increases when either the OR or IPW model is correctly specified, confirming the double robustness property. However, neither the OR nor IPW estimator demonstrates double robustness; their ASEs can be large even with a sample size of 1000 when the corresponding model is misspecified.

Table 1. Simulation results for covariance matrices. Average squared errors and standard deviations (in parentheses) for GATE using four different estimation procedures. The model specifications are in the first two columns. DR: doubly robust; CF: cross-fitting; OR: outcome regression; IPW: inverse probability weighting.

Model		Sample size	Estimator
OR	IPW		DR	CF	OR	IPW
✓	✓	100	5.837 (8.694)	5.839 (8.696)	5.837 (8.690)	6.289 (9.105)
		300	2.058 (2.860)	2.057 (2.859)	2.058 (2.860)	2.128 (2.946)
		1000	0.610 (0.808)	0.611 (0.809)	0.610 (0.808)	0.639 (0.838)
✓	✗	100	5.837 (8.690)	5.835 (8.687)	5.837 (8.690)	31.91 (23.45)
		300	2.058 (2.860)	2.057 (2.860)	2.058 (2.860)	26.26 (13.37)
		1000	0.610 (0.808)	0.611 (0.808)	0.610 (0.808)	24.02 (7.127)
✗	✓	100	9.071 (12.317)	11.09 (16.19)	37.40 (26.18)	6.289 (9.105)
		300	3.005 (3.848)	3.098 (4.061)	30.82 (14.19)	2.128 (2.946)
		1000	0.986 (1.180)	0.993 (1.201)	28.23 (7.680)	0.639 (0.838)

5.3. Compositional data

Writing $\phi=\pi(X+2)/8\in[\pi/8,3\pi/8]$ , we model the true regression functions $m_{0}(\cdot)$ and $m_{1}(\cdot)$ as

m_{0}(X)=(\cos(\phi),\frac{1}{2}\sin(\phi),\frac{\sqrt{3}}{2}\sin(\phi)),\quad m_{1}(X)=(\cos(\phi),\frac{\sqrt{3}}{2}\sin(\phi),\frac{1}{2}\sin(\phi)),

which are illustrated in Figure 3 as red and blues lines, respectively. The random response $Y$ on $\mathcal{S}_{+}^{2}$ for $T=0$ is then generated by adding a small perturbation to the true regression function. To this end, we first construct an orthonormal basis $(e_{1},e_{2})$ for the tangent space on $m_{0}(X)$ where

e_{1}=(\sin(\phi),-\frac{1}{2}\cos(\phi),-\frac{\sqrt{3}}{2}\cos(\phi)),\quad e_{2}=(0,\frac{\sqrt{3}}{2},-\frac{1}{2}).

Consider random tangent vectors $U=Z_{1}e_{1}+Z_{2}e_{2}$ , where $Z_{1},Z_{2}$ are two independent uniformly distributed random variables on $[-0.1,0.1]$ . The random response $Y$ is obtained as the exponential map at $m_{0}(X)$ applied to the tangent vector $U$ ,

Y=\mathrm{Exp}_{m_{0}(X)}(U)=\cos(\|U\|)m_{0}(X)+\sin(\|U\|)\frac{U}{\|U\|}.

A similar generation procedure is used for $T=1$ , where the orthonormal basis $(e_{1},e_{2})$ for the tangent space on $m_{1}(X)$ is

e_{1}=(\sin(\phi),-\frac{\sqrt{3}}{2}\cos(\phi),-\frac{1}{2}\cos(\phi)),\quad e_{2}=(0,\frac{1}{2},-\frac{\sqrt{3}}{2}).

Figure 3 illustrates randomly generated responses using the above generation procedure for a sample size $n=200$ , which are seen to be distributed around the true regression functions.

We are interesting in estimating the GATE, with true value $\gamma_{\operatorname{E}_{\oplus}[Y(0)],\operatorname{E}_{\oplus}[Y(1)]}$ , where

\operatorname{E}_{\oplus}[Y(0)]=(\frac{\sqrt{2}}{2},\frac{\sqrt{2}}{4},\frac{\sqrt{6}}{4}),\quad\operatorname{E}_{\oplus}[Y(1)]=(\frac{\sqrt{2}}{2},\frac{\sqrt{6}}{4},\frac{\sqrt{2}}{4}).

Table 2 summarizes the simulation results. The ASE of the doubly robust estimator decreases as the sample size increases when either the OR or IPW model is correct, thus again confirming the double robustness property. In comparison, neither the OR nor IPW estimator exhibits the double robustness property: When the corresponding model is misspecified, their ASEs do not converge as the sample size increases.

Table 2. Simulation results for compositional data. Average squared errors and standard deviations (in parentheses) for GATE using four different estimation procedures. The model specifications are as in the first two columns. DR: doubly robust; CF: cross-fitting; OR: outcome regression; IPW: inverse probability weighting.

Model		Sample size	Estimator
OR	IPW		DR	CF	OR	IPW
✓	✓	100	0.044 (0.026)	0.044 (0.026)	0.044 (0.026)	0.067 (0.030)
		300	0.026 (0.015)	0.026 (0.015)	0.026 (0.015)	0.047 (0.016)
		1000	0.014 (0.008)	0.014 (0.008)	0.014 (0.008)	0.037 (0.009)
✓	✗	100	0.044 (0.026)	0.044 (0.026)	0.044 (0.026)	0.105 (0.039)
		300	0.026 (0.015)	0.026 (0.015)	0.026 (0.015)	0.093 (0.024)
		1000	0.014 (0.008)	0.014 (0.008)	0.014 (0.008)	0.094 (0.014)
✗	✓	100	0.061 (0.034)	0.067 (0.037)	0.105 (0.042)	0.067 (0.030)
		300	0.039 (0.018)	0.040 (0.019)	0.096 (0.026)	0.047 (0.016)
		1000	0.031 (0.011)	0.031 (0.011)	0.098 (0.015)	0.037 (0.009)

6. Real world applications

6.1. U.S. electricity generation data

Compositional data are ubiquitous and are not situated in a vector space as they correspond to vectors with non-negative elements that sum to 1; see Example 2.3. Various approaches have been developed to address the inherent nonlinearity of compositional data (Aitchison,, 1986; Scealy and Welsh,, 2014; Filzmoser et al.,, 2018) which are common in the analysis of geochemical and microbiome data.

Here we focus on U.S. electricity generation data, publicly available on the U.S. Energy Information Administration website (http://www.eia.gov/electricity). The data reflect net electricity generation from various sources for each state in the year 2020. In preprocessing, we excluded the “pumped storage” and “other” categories due to data errors and consolidated the remaining energy sources into three categories: Natural Gas (corresponding to the category of “natural gas”), Other Fossil (combining “coal,” “petroleum,” and “other gases”), and Renewables and Nuclear (combining “hydroelectric conventional,” “solar thermal and photovoltaic,” “geothermal,” “wind,” “wood and wood-derived fuels,” “other biomass,” and “nuclear”). This yielded a sample of $n=50$ compositional observations taking values in the 2-simplex $\Delta^{2}$ , as illustrated with a ternary plot in Figure 4. Following the approach described in Example 2.3, we applied the component-wise square root transformation, resulting in compositional outcomes as elements of the sphere $\mathcal{S}^{2}$ , equipped with the geodesic metric.

In our analysis, the exposure (treatment) of interest is whether the state produced coal in 2020, where 29 states produced coal in 2020 while 21 states did not. The outcomes are the compositional data discussed in Example 2.3, where we consider two possible confounders: Gross domestic product (GDP) per capita (millions of chained 2012 dollars) and the proportion of electricity generated from coal and petroleum in 2010 for each state. We implemented the proposed approach to obtain the DR, CF, OR, and IPW estimators, with results demonstrated in Figure 4. The DR, CF, and OR estimators yield similar results, suggesting that coal production leads to a smaller proportion of renewables & nuclear. In contrast, the IPW estimator yields slightly different results, possibly due to the violation of the propensity score model, and therefore should not be used here.

To quantify the size and uncertainty of the treatment effect, we computed the geodesic distance between the two mean potential compositions using the DR estimator $d_{g}(\widehat{\Theta}_{0}^{\mathrm{(DR)}},\widehat{\Theta}_{1}^{\mathrm{(DR)}})$ and obtained its finite-sample valid confidence region using the adaptive HulC algorithm. Specifically, $d_{g}(\widehat{\Theta}_{0}^{\mathrm{(DR)}},\widehat{\Theta}_{1}^{\mathrm{(DR)}})$ is found to be 0.133, with the corresponding 95% finite-sample valid confidence region estimated as $(0.112,0.269)$ . This suggests that the effect of coal production on the composition of electricity generation sources is significant at the 0.05 level.

6.2. New York Yellow taxi system after COVID-19 outbreak

Yellow taxi trip records in New York City (NYC), containing details such as pick-up and drop-off dates/times, locations, trip distances, payment methods, and driver-reported passenger counts, can be accessed at https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page. Additionally, NYC Coronavirus Disease 2019 (COVID-19) data are available at https://github.com/nychealth/coronavirus-data, providing citywide and borough-specific daily counts of probable and confirmed COVID-19 cases in NYC since February 29, 2020.

We focused on taxi trip records in Manhattan, which experiences the highest taxi traffic. Following preprocessing procedures outlined in Zhou and Müller, (2022), we grouped the 66 taxi zones (excluding islands) into 13 regions. We restricted our analysis to the period comprising 172 days from April 12, 2020 to September 30, 2020, during which taxi ridership per day in Manhattan steadily increased following a decline due to the COVID-19 outbreak. For each day, we constructed a daily undirected network with nodes corresponding to the 13 regions and edge weights representing the number of people traveling between connected regions. Self-loops in the networks were removed as the focus was on connections between different regions. Thus, we have observations consisting of a simple undirected weighted network for each of the 172 days, each associated with a $13\times 13$ graph Laplacian.

We aim to investigate the causal effect of COVID-19 new cases on daily taxi networks in Manhattan. COVID-19 new cases were dichotomized into 0 if less than 60 and 1 otherwise, resulting in 79 and 93 days classified into low (0) and high (1) COVID-19 new cases groups, respectively. The outcomes are graph Laplacians as discussed in Example 2.1. Confounders of interest include a weekend indicator and daily average temperature.

We obtained DR and CF estimators using the proposed approach, along with OR and IPW estimators for comparison. In Figure 5, we illustrate the entry-wise differences between the adjacency matrices corresponding to low and high COVID-19 new cases for different estimators. The DR, CF, and OR estimators exhibit similar performance, indicating that high COVID-19 led to less traffic. Regions with the largest differences include regions 105, 106, and 108, primarily residential areas including popular locations such as Penn Station, Grand Central Terminal, and the Metropolitan Museum of Art. The impact of COVID-19 new cases on traffic networks is seen to be negative, especially concerning traffic in residential areas. Conversely, the IPW estimator appears ineffective in capturing this causal effect.

Similarly to Section 6.1, the Frobenius distance between the two mean potential networks using the DR estimator $d_{F}(\widehat{\Theta}_{0}^{\mathrm{(DR)}},\widehat{\Theta}_{1}^{\mathrm{(DR)}})$ is found to be 5216, with a corresponding $95\%$ finite-sample valid confidence region of $(2362,10979)$ . This suggests that the effect of COVID-19 new cases on daily taxi networks in Manhattan is significant at the 0.05 level.

6.3. Brain functional connectivity networks and Alzheimer’s disease

Resting-state functional Magnetic Resonance Imaging (fMRI) methodology allows for the study of brain activation and the identification of brain regions exhibiting similar activity during the resting state (Ferreira and Busatto,, 2013; Sala-Llonch et al.,, 2015). In resting-state fMRI, a time series of Blood Oxygen Level Dependent (BOLD) signals are obtained for different regions of interest (ROI). The temporal coherence between pairwise ROIs is typically measured by temporal Pearson correlation coefficients (PCC) of the fMRI time series, forming an $m\times m$ correlation matrix when considering $m$ distinct ROIs. Alzheimer’s disease has been found to be associated with anomalies in the functional integration and segregation of ROIs (Zhang et al.,, 2010; Damoiseaux et al.,, 2012).

Our study utilized data obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The dataset in our analysis consists of 372 cognitively normal (CN), and 145 Alzheimer’s (AD) subjects. We used the automated anatomical labeling (AAL) atlas (Tzourio-Mazoyer et al.,, 2002) to parcellate the whole brain into 90 ROIs, with 45 ROIs in each hemisphere, so that $m=90.$ The preprocessing of the BOLD signals was implemented by adopting standard procedures of slice-timing correction, head motion correction, and other standard steps. A PCC matrix was calculated for all time series pairs for each subject. These matrices were then converted into simple, undirected, weighted networks by setting diagonal entries to 0 and thresholding the absolute values of the remaining correlations. Density-based thresholding was employed, where the threshold varied from subject to subject to achieve a desired, fixed connection density. Specifically, we retained the 15% strongest connections.

In our analysis, Alzheimer’s disease, coded as a binary variable (0 for cognitively normal, 1 for diagnosed with Alzheimer’s disease), serves as exposure of interest. The outcomes are brain functional connectivity networks, where two confounders are the age and gender of each subject. To quantify the impact of Alzheimer’s disease on brain functional connectivity, we computed the entry-wise differences between the adjacency matrices of the two mean potential networks for the DR estimator, as depicted in Figure 6. Our observations reveal that Alzheimer’s disease leads to notable effects on various regions of the human brain. Specifically, the central region, orbital surface in the frontal lobe, temporal lobe, occipital lobe, and subcortical areas exhibit more pronounced influences, displaying clustered patterns consistent with findings in previous literature (Planche et al.,, 2022). Notably, damage to the frontal lobe is associated with issues in judgment, intelligence, and behavior, while damage to the temporal lobe can impact memory.

To assess structural changes in brain functional connectivity networks following Alzheimer’s disease, Table 3 provides a summary of commonly adopted network measures that quantify functional integration and segregation, with definitions provided in Rubinov and Sporns, (2010). Our observations indicate a decrease in all functional integration and segregation measures of brain functional connectivity networks, with local efficiency experiencing the most substantial reduction. Functional segregation, denoting the brain’s ability for specialized processing within densely interconnected groups of regions, and functional integration, representing the capacity to rapidly combine specialized information from distributed brain regions, are both compromised. This finding implies that Alzheimer’s disease leads to lower global and local efficiency in human brains, disrupting crucial functions such as judgment and memory (Ahmadi et al.,, 2023).

Table 3. Network measures of mean potential networks for the doubly robust estimator

		Cognitively normal	Alzheimer’s disease
Functional integration	Global efficiency	3.117	2.914
Functional integration	Characteristic path length	10.319	9.593
Functional segregation	Local efficiency	3.599	3.035
Functional segregation	Clustering coefficient	0.154	0.142

7. Discussion

We mention here a few additional issues. Although our assumptions are satisfied for most metric spaces of interest and guarantee relatively fast convergence rates of the proposed doubly robust estimators, one may consider alternative assumptions. For example, one can replace Assumption 4.4 (ii) with an analogous condition as Assumption (U2) in Petersen and Müller, (2019), that is

\inf_{d(\nu,\Theta_{t}^{(\mathrm{DR})})<\eta}\left(Q_{t}(\nu;\mu_{t},\varphi_{*})-Q_{t}(\Theta_{t}^{(\mathrm{DR})};\mu_{t},\varphi_{*})-Cd(\nu,\Theta_{t}^{(\mathrm{DR})})^{\beta}\right)\geq 0.

However, under this relaxed assumption, the convergence rates of the estimators $\widehat{\Theta}_{t}^{(\mathrm{DR})}$ and $\widehat{\Theta}_{t}^{(\mathrm{CF})}$ will be slower than those reported in Theorem 4.2. Specifically, the term $(\varrho_{n}+r_{n})^{{\beta^{\prime}\over\beta-1}}$ would need to be replaced with $(\varrho_{n}+r_{n})^{{1\over\beta}}$ .

Even when objects that represent outcomes of interest are given, for example covariance matrices, there is still the issue of choosing a metric and ensuing geodesics, where interpretability of the resulting GATE will be a consideration. For a brief discussion of this and related issues, see Dubey et al., (2024).

There are numerous scenarios where metric space-valued outcomes are of interest in causal inference, as we have demonstrated in our empirical examples. Our tools to construct estimators for treatment effects are inspired by their linear counterparts but are substantially different, and notions of Euclidean treatment effects need to be revisited to arrive at canonical generalizations for metric spaces. These investigations also shed new light on and lead to a reinterpretation of the conventional Euclidean case. Future exploration of technical aspects and real world applications for object data will enhance the applicability of causal inference methods across a range of complex data.

Appendix A Asymptotic properties of OR and IPW estimators

A.1. Outcome regression

In this section, we provide asymptotic properties of outcome regression estimators.

Assumption A.1.

Let $\widehat{\mu}_{t}(\cdot)$ , $t\in\{0,1\}$ be estimators for the outcome regression functions $m_{t}(\cdot)$ , $t\in\{0,1\}$ . Assume that $\sup_{x\in\mathcal{X}}d(\widehat{\mu}_{t}(x),m_{t}(x))=O_{p}(r_{n,1})$ , $t\in\{0,1\}$ with $r_{n,1}\to 0$ as $n\to\infty$ .

Assumption A.2.

Define $\Theta_{t}^{(\mathrm{OR})}=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}Q^{(\mathrm{OR})}(\nu;m_{t})$ where $Q^{(\mathrm{OR})}(\nu;\mu)=\operatorname{E}[d^{2}(\nu,\mu(X))]$ . The objects $\Theta_{t}^{(\mathrm{OR})}$ and $\widehat{\Theta}_{t}^{(\mathrm{OR})}$ exist and are unique, and for any $\varepsilon>0$ , $\inf_{d(\nu,\Theta_{t}^{(\mathrm{OR})})>\varepsilon}Q^{(\mathrm{OR})}(\nu;m_{t})>Q^{(\mathrm{OR})}(\Theta_{t}^{(\mathrm{OR})};m_{t})$ .

Assumption A.3.

For $t\in\{0,1\}$ ,

(i)

As $\delta\to 0$ ,

	$\displaystyle J_{t}(\delta)$	$\displaystyle:=\int_{0}^{1}\sqrt{1+\log N(\delta\varepsilon,B_{\delta}(\Theta_{t}^{(\mathrm{OR})}),d)}d\varepsilon=O(1),$		(A.1)
	$\displaystyle J_{m_{t}}(\delta)$	$\displaystyle:=\int_{0}^{1}\sqrt{1+\log N(\delta\varepsilon,B_{\delta^{\prime}_{1}}(m_{t}),d_{\infty})}d\varepsilon=O(\delta^{-\varpi_{1}})$		(A.2)

for some $\delta^{\prime}_{1}>0$ and $\varpi_{1}\in(0,1)$ .

(ii)

There exist positive constants $\eta$ , $\eta_{1}$ , $C$ , $C_{1}$ , and $\beta>1$ such that

\displaystyle\inf_{d_{\infty}(\mu,\mu_{t})\leq\eta_{1}}\inf_{d(\nu,\Theta_{t}^{(\mathrm{OR})})\leq\eta}\!\!\left\{\!Q^{(\mathrm{OR})}(\nu;\mu)\!-\!Q^{(\mathrm{OR})}(\Theta_{t};\mu)\!-\!Cd(\nu,\Theta_{t}^{(\mathrm{OR})})^{\beta}\!+\!C_{1}\eta_{1}^{{\beta\over 2(\beta-1)}}d(\nu,\Theta_{t}^{(\mathrm{OR})})^{{\beta\over 2}}\!\right\}\!\geq\!0.

One can see that Examples 2.1–2.4 satisfy Assumption A.3 (ii) with $\beta=2$ .

The following results provide the consistency and convergence rates of the outcome regression estimators. The proofs are similar to those of Theorems 4.1 and 4.2 and thus omitted.

Theorem A.1.

Suppose that Assumptions 3.1(i), (iii), 3.2, A.1, and A.2 hold. Then as $n\to\infty$ ,

\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{OR})},\operatorname{E}_{\oplus}[Y(t)])

\displaystyle=o_{p}(1),\ t\in\{0,1\}.

(A.3)

Theorem A.2.

Suppose that Assumptions 3.1(i), (iii), 3.2, A.1, A.2, and A.3 hold. Then for each $\beta^{\prime}\in(0,1)$ , as $n\to\infty$ ,

\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{OR})},\operatorname{E}_{\oplus}[Y(t)])

\displaystyle=O_{p}\left(n^{-{1\over 2(\beta-1+\varpi_{1})}}+r_{n,1}^{{\beta^{\prime}\over(\beta-1)}}\right),\ t\in\{0,1\}.

(A.4)

A.2. Inverse probability weighting

In this section, we provide asymptotic properties of inverse probability weighting estimators.

Assumption A.4.

Let $\widehat{\mu}$ and $\widehat{e}(\cdot)$ be estimators for $\mu\in\mathcal{M}$ and the propensity score $p(\cdot)$ , respectively. Assume that $d(\widehat{\mu},\mu)=O_{p}(v_{n})$ and $\sup_{x\in\mathcal{X}}|\widehat{e}(x)-p(x)|=O_{p}(\varrho_{n,1})$ with $v_{n},\varrho_{n,1}\to 0$ as $n\to\infty$ .

Assumption A.5.

Define $\Theta_{t}^{(\mathrm{IPW})}=\mathop{\rm arg~{}min}\limits_{\nu\in\mathcal{M}}Q_{t}^{(\mathrm{IPW})}(\nu;\mu,p)$ where

Q_{t}^{(\mathrm{IPW})}(\nu;\alpha,e)=\operatorname{E}\left[d^{2}\left(\nu,\gamma_{\alpha,Y}\left({tT\over e(X)}+{(1-t)(1-T)\over 1-e(X)}\right)\right)\right],\ t\in\{0,1\}.

Assume that $\Theta_{t}^{(\mathrm{IPW})}=\operatorname{E}_{\oplus}[Y(t)]$ and the objects $\Theta_{t}^{(\mathrm{IPW})}$ and $\widehat{\Theta}_{t}^{(\mathrm{IPW})}$ exist and are unique, and for any $\varepsilon>0$ , $\inf_{d(\nu,\Theta_{t}^{(\mathrm{IPW})})>\varepsilon}Q_{t}^{(\mathrm{IPW})}(\nu;\mu,p)>Q_{t}^{(\mathrm{IPW})}(\Theta_{t}^{(\mathrm{IPW})};\mu,p)$ .

Assumption A.6.

For $t\in\{0,1\}$ ,

(i)

As $\delta\to 0$ ,

	$\displaystyle J_{t}(\delta)$	$\displaystyle:=\int_{0}^{1}\sqrt{1+\log N(\delta\varepsilon,B_{\delta}(\Theta_{t}^{(\mathrm{IPW})}),d)}d\varepsilon=O(1),$		(A.5)
	$\displaystyle J_{p}(\delta)$	$\displaystyle:=\int_{0}^{1}\sqrt{1+\log N(\delta\varepsilon,B_{\delta^{\prime}_{1}}(p),\\|\cdot\\|_{\infty})}d\varepsilon=O(\delta^{-\varpi_{2}})$		(A.6)

for some $\delta^{\prime}_{1}>0$ and $\varpi_{2}\in(0,1)$ , where for $p_{1},p_{2}:\mathcal{X}\to\mathbb{R}$ , $\|p_{1}-p_{2}\|_{\infty}:=\sup_{x\in\mathcal{X}}|p_{1}(x)-p_{2}(x)|$ .

(ii)

There exist positive constants $\eta$ , $\eta_{1}$ , $C$ , $C_{1}$ , and $\beta>1$ such that

	$\displaystyle\inf_{\begin{subarray}{c}d(\alpha,\mu)\leq\eta_{1}\\ \\|e-p\\|_{\infty}\leq\eta_{1}\end{subarray}}\!\!\inf_{d(\nu,\Theta_{t}^{(\mathrm{IPW})})\leq\eta}\!\!\left\{\!Q_{t}^{(\mathrm{IPW})}\!(\nu;\alpha,e)\!-\!Q_{t}^{(\mathrm{IPW})}\!(\Theta_{t};\alpha,e)\right.$
	$\displaystyle\left.\qquad\qquad\qquad\qquad\qquad-d(\nu,\Theta_{t}^{(\mathrm{IPW})})^{\beta}\!+\!C_{1}\eta_{1}^{{\beta\over 2(\beta-1)}}\!d(\nu,\Theta_{t}^{(\mathrm{IPW})})^{{\beta\over 2}}\!\right\}\geq 0.$

One can see that Examples 2.1–2.4 satisfy Assumption A.6 (ii) with $\beta=2$ .

The following results provide the consistency and convergence rates of the inverse probability weighting estimators. The proofs are again similar to those of Theorems 4.1 and 4.2.

Theorem A.3.

Suppose that Assumptions 3.1, 4.2, A.4, and A.5 hold. Then as $n\to\infty$ ,

\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{IPW})},\operatorname{E}_{\oplus}[Y(t)])

\displaystyle=o_{p}(1),\ t\in\{0,1\}.

(A.7)

Theorem A.4.

Suppose that Assumptions 3.1, 4.2, A.4, A.5, and A.6 hold. Then for each $\beta^{\prime}\in(0,1)$ , as $n\to\infty$ ,

\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{IPW})},\operatorname{E}_{\oplus}[Y(t)])

\displaystyle=O_{p}\left(n^{-{1\over 2(\beta-1+\varpi_{2})}}+(v_{n}+\varrho_{n,1})^{{\beta^{\prime}\over(\beta-1)}}\right),\ t\in\{0,1\}.

(A.8)

Appendix B Verification of Model Assumptions

In this section, we verify our assumptions for Examples 2.1- 2.4.

Proposition B.1.

The space of graph Laplacians defined in Example 2.1 satisfies Assumptions 3.1(i), 3.2(ii), (iii), 4.2, 4.3(i), and 4.4.

Proof.

Note that the geodesic in the space of graph Laplacians (equipped with the Frobenius metric) is simply the line segment connecting the two endpoints. It is easy to verify that Assumption 4.2 holds for $C_{0}=1/\eta_{0}-1$ . Following the argument in Theorem 2 of Zhou and Müller, (2022), one can verify that Assumptions 4.3 (i) and 4.4 hold with $\beta=2$ in Assumption 4.4 (ii). Assumption 3.1(i) holds as the space of graph Laplacians is a bounded and closed subset of $\mathbb{R}^{m^{2}}$ (cf. Proposition 1 of Zhou and Müller, (2022)).

To prove that Assumptions 3.2 (ii) and (iii) hold, observe that we can define the perturbation of $m_{t}(X)$ by adding errors component-wise to $m_{t}(X)$ , $\mathcal{P}_{t}(m_{t}(X))=m_{t}(X)+\varepsilon$ , where $\varepsilon$ is an $m\times m$ random graph Laplacian with $\operatorname{E}[\varepsilon_{jk}|X]=0,\operatorname{E}[\varepsilon_{jk}^{2}|X]<\infty$ for $1\leq j<k\leq m$ . The definition of the perturbation map implies Assumption 3.2 (ii). Let $\mathrm{tr}(A)$ denote the trace of the $m\times m$ matrix $A$ . For any graph Laplacian $\nu$ , we have $\operatorname{E}[d_{F}^{2}(\nu,\mathcal{P}_{t}(m_{t}(X)))]=\operatorname{E}[d_{F}^{2}(\nu,m_{t}(X))]+\operatorname{E}\left[\mathrm{tr}\left(\varepsilon^{\prime}\varepsilon\right)\right]$ , which implies Assumption 3.2 (iii). ∎

Proposition B.2.

The space of covariance or correlation matrices defined in Example 2.2 satisfies Assumptions 3.1(i), 3.2(ii), (iii), 4.2, 4.3(i), and 4.4.

The proof of Proposition B.2 is similar to that of Proposition B.1 and is omitted.

Proposition B.3.

Consider the space of compositional data $(\mathcal{S}^{d-1}_{+},d_{g})$ defined in Example 2.3. Let $T_{\mu}\mathcal{M}$ be the tangent bundle at $\mu\in\mathcal{M}$ and let $\mathrm{Exp}_{\mu}$ and $\mathrm{Log}_{\mu}$ denote exponential and logarithmic maps at $\mu$ , respectively. Define $\widetilde{\nu}=\mathrm{Log}_{\Theta_{t}^{(\mathrm{DR})}}(\nu)$ and $\widetilde{\gamma}=\mathrm{Log}_{\gamma_{t}}\left(\bar{\gamma}_{t}\right)$ , where

\displaystyle\gamma_{t}

\displaystyle=\gamma_{\mu_{t}(X),Y(t)}\left({tT\over e(x;\varphi_{*})}+{(1-t)(1-T)\over 1-e(X;\varphi_{*})}\right),\ \bar{\gamma}_{t}=\gamma_{\mu(X),Y(t)}\left({tT\over e(x;\varphi)}+{(1-t)(1-T)\over 1-e(X;\varphi)}\right).

Additionally, for $v\in T_{\Theta_{t}^{(\mathrm{DR})}}\mathcal{M}$ and $w\in T_{\gamma_{t}}\mathcal{M}$ , define $g(v,w)=d_{g}^{2}(\mathrm{Exp}_{\Theta_{t}^{(\mathrm{DR})}}(v),\mathrm{Exp}_{\gamma_{t}}(w))$ . If there exist positive constants $\bar{\eta}_{1}$ and $\bar{\eta}$ such that

\displaystyle\inf_{\begin{subarray}{c}\|v^{*}\|\leq\bar{\eta}_{1},\|\varphi-\varphi_{*}\|\leq\bar{\eta}\\ \sup_{x\in\mathcal{X}}d_{g}(\mu(x),\mu_{t}(x))\leq\bar{\eta}\end{subarray}}\lambda_{\mathrm{min}}\left(\left.{\partial^{2}\over\partial v\partial v^{\prime}}\operatorname{E}[g(v,\widetilde{\gamma})]\right|_{v=v^{*}}\right)>0,

(B.1)

then the space $(\mathcal{S}^{d-1}_{+},d_{g})$ satisfies Assumptions 3.1(i), 3.2(ii), (iii), 4.2, 4.3(i), and 4.4, where $\lambda_{\mathrm{min}}(A)$ is the smallest eigenvalue of a square matrix $A$ .

Proof.

Assumption 4.2 holds trivially for $\alpha_{1}=\alpha_{2}$ . So we verify Assumption 4.2 for $\alpha_{1}\neq\alpha_{2}$ . Note that for any $q_{1},q_{2}\in\mathcal{S}_{+}^{d-1}$ , we have $\|q_{1}-q_{2}\|\leq d_{g}(q_{1},q_{2})\leq{\pi\over 2\sqrt{2}}\|q_{1}-q_{2}\|$ , which implies the equivalence of the geodesic metric and the ambient Euclidean metric. One can find a positive constant $C^{\prime}$ depending only on $\eta_{0}$ such that $\sup_{\beta\in\mathcal{S}^{d-1}_{+},\kappa\in[1/(1-\eta_{0}),1/\eta_{0}]}\|\gamma_{\alpha_{1},\beta}(\kappa)-\gamma_{\alpha_{2},\beta}(\kappa)\|\leq C^{\prime}\|\alpha_{1}-\alpha_{2}\|$ and this yields (4.1) with $C_{0}=C^{\prime}\pi/(2\sqrt{2})$ .

Following the argument in Proposition 3 of Petersen and Müller, (2019), one can verify that Assumptions 4.3 (i) and 4.4 hold with $\beta=2$ in Assumption 4.4 (ii) under (B.1). Assumption 3.1 (i) holds as $\mathcal{S}^{d-1}_{+}$ is a bounded and closed subset of the unit sphere $\mathcal{S}^{d-1}$ .

For any $\nu\in\mathcal{S}^{d-1}_{+}$ , define the perturbation of $m_{t}(X)$ as a perturbation in the tangent space, $\mathcal{P}_{t}(m_{t}(X))=\mathrm{Exp}_{m_{t}(X)}(U)$ , where $U=\sum_{j=1}^{d-1}Z_{j}e_{j}\in\mathbb{R}^{d}$ is a random tangent vector where $(e_{1},\dots,e_{d-1})$ denotes the orthonormal basis for the tangent space on $m_{t}(X)$ and $Z_{1},\dots,Z_{d-1}$ are real-valued independent random variables with mean zero. Since $\operatorname{E}[U]=0$ , it follows that $\operatorname{E}_{\oplus}[\mathrm{Exp}_{m_{t}(X)}(U)|X]=m_{t}(X)$ and thus Assumption 3.2 (ii) holds. One can similarly show that Assumption 3.2 (iii) is satisfied.

∎

Proposition B.4.

The Wasserstein space defined in Example 2.4 satisfies Assumptions 3.1(i), 3.2(ii), (iii), 4.2, 4.3(i), and 4.4.

Proof.

For any probability measure $\mu\in\mathcal{W}$ , the map $Q:\mu\mapsto F_{\mu}^{-1}$ is an isometry from $\mathcal{W}$ to the subset of the Hilbert space $L^{2}(0,1)$ formed by equivalence classes of left-continuous nondecreasing functions on $(0,1)$ . The Wasserstein space can thus be viewed as a convex and closed subset of $L^{2}(0,1)$ (Bigot et al.,, 2017). The corresponding Wasserstein metric $d_{\mathcal{W}}$ aligns with the $L^{2}$ metric $d_{L^{2}}$ between quantile functions. It is straightforward to verify that Assumption 4.2 is satisfied for the Hilbert space $(L^{2}(0,1),d_{L^{2}})$ with $C_{0}=1/\eta_{0}-1$ . Since the Wasserstein space is a closed subset of $L^{2}(0,1)$ , the geodesic extension necessitates adjustments to prevent boundary crossings, as discussed in Section 2. Such adjustments can only reduce the left-hand side of (4.1), thereby ensuring that Assumption 4.2 holds for $C_{0}=1/\eta_{0}-1$ . Assumption 3.1(i) holds as the Wasserstein space is a bounded and closed subset of $L_{2}(0,1)$ . Following the argument in Proposition 1 of Petersen and Müller, (2019), one can verify that Assumptions 4.3 (i) and 4.4 hold with $\beta=2$ in Assumption 4.4 (ii).

For $m_{t}(X)\in\mathcal{W}$ , we can define the perturbation of $m_{t}(X)$ as a perturbation of the quantile function of $m_{t}(X)$ : $F_{\mathcal{P}_{t}(m_{t}(X))}^{-1}(s)=F_{m_{t}(X)}^{-1}(s)+\varepsilon$ , where $\varepsilon$ is a real-valued random variable with $\operatorname{E}[\varepsilon|X]=0$ and $\operatorname{E}[\varepsilon^{2}|X]<\infty$ . The definition of the perturbation map implies Assumption 3.2 (ii). For any $\nu\in\mathcal{W}$ , we have $\operatorname{E}[d_{\mathcal{W}}^{2}(\nu,\mathcal{P}_{t}(m_{t}(X)))]=\operatorname{E}[d_{\mathcal{W}}^{2}(\nu,m_{t}(X))]+\operatorname{E}[\varepsilon^{2}]$ , which implies Assumption 3.2 (iii). ∎

Appendix C Proofs for Section 3

C.1. Proof of Proposition 3.1

Step 1. It is sufficient to check (3.2) when

(i)

the propensity score is correctly specified, that is, $e(x)=p(x)$ (Step 2) or
(ii)

the outcome regression functions are correctly specified, that is, $\mu_{t}(x)=m_{t}(x)$ , $t\in\{0,1\}$ (Step 3).

Now we provide some auxiliary results. Observe that

\displaystyle\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y}

\displaystyle=\begin{cases}0\odot\gamma_{\mu_{1}(X),Y(0)}=\mathrm{id}_{\mu_{1}(X)}=0\odot\gamma_{\mu_{1}(X),Y(1)}&\text{if $T=0$},\\ \left(\frac{1}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y(1)}&\text{if $T=1$}.\end{cases}

This yields

\displaystyle\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y}

\displaystyle=\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y(1)}.

Then we have

\displaystyle\operatorname{E}_{G,X}\left[\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right]

\displaystyle=\operatorname{E}_{G}\left[\left(\frac{p(X)}{e(X)}\right)\odot\gamma_{\mu_{1}(X),m_{1}(X)}\right].

(C.1)

Likewise, we have

\displaystyle\operatorname{E}_{G,X}\left[\left(\frac{1-T}{1-e(X)}\right)\odot\gamma_{\mu_{0}(X),Y}\right]

\displaystyle=\operatorname{E}_{G}\left[\left(\frac{1-p(X)}{1-e(X)}\right)\odot\gamma_{\mu_{0}(X),m_{0}(X)}\right].

(C.2)

Further, for random objects $A,B,R\in(\mathcal{M},d)$ , we have

\displaystyle\operatorname{E}_{G,X}[\gamma_{A,B}]

\displaystyle=\operatorname{E}_{G,X}[\gamma_{A,R}\oplus\gamma_{R,B}]=\operatorname{E}_{G,X}[\gamma_{A,R}]\oplus\operatorname{E}_{G,X}[\gamma_{R,B}].

(C.3)

Step 2. Now we assume $e(x)=p(x)$ . Note that (C.1) yields

\displaystyle\operatorname{E}_{G,X}\left[\left(\frac{T}{p(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right]

\displaystyle=\operatorname{E}_{G}[\gamma_{\mu_{1}(X),m_{1}(X)}].

Combining this and (C.3), we have

$\displaystyle\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\left\{\left(\frac{T}{p(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right\}\right]$	$\displaystyle=\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{1}(X)}\right]\oplus\operatorname{E}_{G,X}\left[\left(\frac{T}{p(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right]$
	$\displaystyle=\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\gamma_{\mu_{1}(X),m_{1}(X)}\right]$
	$\displaystyle=\operatorname{E}_{G,X}\left[\gamma_{\mu,m_{1}(X)}\right]$
	$\displaystyle=\gamma_{\mu,\operatorname{E}_{\oplus}[Y(1)]}.$	(C.4)

Likewise, (C.2) and (C.3) yield

\displaystyle\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{0}(X)}\oplus\left\{\left(\frac{1-T}{1-p(X)}\right)\odot\gamma_{\mu_{0}(X),Y}\right\}\right]

\displaystyle=\gamma_{\mu,\operatorname{E}_{\oplus}[Y(0)]}.

(C.5)

Together with (C.4) and (C.5), we have

	$\displaystyle\ominus\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{0}(X)}\oplus\left\{\left(\frac{1-T}{1-p(X)}\right)\odot\gamma_{\mu_{0}(X),Y}\right\}\right]\oplus\operatorname{E}_{G,X}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\left\{\left(\frac{T}{p(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right\}\right]$
	$\displaystyle=\ominus\gamma_{\mu,\operatorname{E}_{\oplus}[Y(0)]}\oplus\gamma_{\mu,\operatorname{E}_{\oplus}[Y(1)]}$
	$\displaystyle=\gamma_{\operatorname{E}_{\oplus}[Y(0)],\mu}\oplus\gamma_{\mu,\operatorname{E}_{\oplus}[Y(1)]}$
	$\displaystyle=\gamma_{\operatorname{E}_{\oplus}[Y(0)],\operatorname{E}_{\oplus}[Y(1)]}$

Step 3. Now we assume $\mu_{t}(x)=m_{t}(x)$ , $t\in\{0,1\}$ . In this case, (C.1) yields

\displaystyle\operatorname{E}_{G,X}\left[\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right]

\displaystyle=\operatorname{E}_{G}\left[\left(\frac{p(X)}{e(X)}\right)\odot\gamma_{m_{1}(X),m_{1}(X)}\right]=\operatorname{E}_{G}\left[\mathrm{id}_{m_{1}(X)}\right].

Then a similar argument to obtain (C.4) yields

$\displaystyle\operatorname{E}_{G,X}\left[\gamma_{\mu,m_{1}(X)}\oplus\left\{\left(\frac{T}{e(X)}\right)\odot\gamma_{m_{1}(X),Y}\right\}\right]$	$\displaystyle=\operatorname{E}_{G,X}\left[\gamma_{\mu,m_{1}(X)}\right]\oplus\operatorname{E}_{G,X}\left[\left(\frac{T}{e(X)}\right)\odot\gamma_{m_{1}(X),Y}\right]$
	$\displaystyle=\operatorname{E}_{G,X}\left[\gamma_{\mu,m_{1}(X)}\oplus\mathrm{id}_{m_{1}(X)}\right]$
	$\displaystyle=\operatorname{E}_{G,X}\left[\gamma_{\mu,m_{1}(X)}\right]$
	$\displaystyle=\gamma_{\mu,\operatorname{E}_{\oplus}[Y(1)]}.$	(C.6)

Likewise, (C.2) yields

\displaystyle\operatorname{E}_{G,X}\left[\gamma_{\mu,m_{0}(X)}\oplus\left\{\left(\frac{1-T}{1-e(X)}\right)\odot\gamma_{m_{0}(X),Y}\right\}\right]

\displaystyle=\gamma_{\mu,\operatorname{E}_{\oplus}[Y(0)]}.

(C.7)

Together with (C.6) and (C.7), we have

	$\displaystyle\ominus\operatorname{E}_{G,X}\left[\gamma_{\mu,m_{0}(X)}\oplus\left\{\left(\frac{1-T}{1-e(X)}\right)\odot\gamma_{m_{0}(X),Y}\right\}\right]\oplus\operatorname{E}_{G,X}\left[\gamma_{\mu,m_{1}(X)}\oplus\left\{\left(\frac{T}{e(X)}\right)\odot\gamma_{m_{1}(X),Y}\right\}\right]$
	$\displaystyle=\gamma_{\operatorname{E}_{\oplus}[Y(0)],\operatorname{E}_{\oplus}[Y(1)]}.$

C.2. Proof of (3.1)

Assume that $e(x)=p(x)$ or $\mu_{t}(x)=m_{t}(x)$ for $t\in\{0,1\}$ . Note that the geodesic in $\mathbb{R}$ is the line segment connecting two endpoints and any geodesic $\gamma_{\alpha,\beta}=\{\alpha+t(\beta-\alpha):t\in[0,1]\}$ is extendable, that is, for any $\rho\in\mathbb{R}$ , $\rho\odot\gamma_{\alpha,\beta}=\gamma_{\alpha,\alpha+\rho(\beta-\alpha)}$ .

Observe that

$\displaystyle\gamma_{\mu,\operatorname{E}\left[\mu_{1}(X)+\frac{T}{e(X)}(Y-\mu_{1}(X))\right]}$	$\displaystyle=\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{1}(X)+\frac{T}{e(X)}(Y-\mu_{1}(X))}\right]$
	$\displaystyle=\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\gamma_{\mu_{1}(X),\mu_{1}(X)+\frac{T}{e(X)}(Y-\mu_{1}(X))}\right]$
	$\displaystyle=\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\left\{\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right\}\right].$	(C.8)

If $e(x)=p(x)$ or $\mu_{1}(x)=m_{1}(x)$ , we have

$\displaystyle\gamma_{\mu,\operatorname{E}\left[\mu_{1}(X)+\frac{T}{e(X)}(Y-\mu_{1}(X))\right]}$	$\displaystyle=\gamma_{\mu,\operatorname{E}\left[\mu_{1}(X)+{p(X)\over e(X)}(m_{1}(X)-\mu_{1}(X))\right]}$
	$\displaystyle=\gamma_{\mu,\operatorname{E}[m_{1}(X)]}$
	$\displaystyle=\gamma_{\mu,\operatorname{E}[Y(1)]}.$	(C.9)

Combining (C.8) and (C.9),

\gamma_{\mu,\operatorname{E}[Y(1)]}=\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\left\{\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right\}\right].

Likewise, if $e(x)=p(x)$ or $\mu_{1}(x)=m_{0}(x)$ , then

\gamma_{\mu,\operatorname{E}[Y(0)]}=\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{0}(X)}\oplus\left\{\left(\frac{1-T}{1-e(X)}\right)\odot\gamma_{\mu_{0}(X),Y}\right\}\right]

and one obtains

	$\displaystyle\gamma_{\operatorname{E}[Y(0)],\operatorname{E}[Y(1)]}$	$\displaystyle=\ominus\gamma_{\mu,\operatorname{E}[Y(0)]}\oplus\gamma_{\mu,\operatorname{E}[Y(1)]}$
		$\displaystyle=\ominus\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{0}(X)}\oplus\left\{\left(\frac{1-T}{1-e(X)}\right)\odot\gamma_{\mu_{0}(X),Y}\right\}\right]$
		$\displaystyle\quad\quad\oplus\operatorname{E}_{G}\left[\gamma_{\mu,\mu_{1}(X)}\oplus\left\{\left(\frac{T}{e(X)}\right)\odot\gamma_{\mu_{1}(X),Y}\right\}\right].$

Appendix D Proofs for Section 4

For any positive sequences $a_{n},b_{n}$ , we write $a_{n}\lesssim b_{n}$ if there is a constant $C>0$ independent of $n$ such that $a_{n}\leq Cb_{n}$ for all $n$ .

D.1. Proof of Theorem 4.1 (i)

Define $\mathrm{diam}(\mathcal{M})=\sup_{\mu_{1},\mu_{2}\in\mathcal{M}}d(\mu_{1},\mu_{2})$ . By Corollary 3.2.3 in van der Vaart and Wellner, (2023), it is sufficient to show

\displaystyle\sup_{\nu\in\mathcal{M}}\left|Q_{n,t}(\nu;\widehat{\mu_{t}},\widehat{\varphi})-Q_{t}(\nu;\mu_{t},\varphi_{*})\right|\stackrel{{\scriptstyle p}}{{\to}}0,\ t\in\{0,1\},

where

\displaystyle Q_{n,t}(\nu;\mu,\varphi)

\displaystyle={1\over n}\sum_{i=1}^{n}d^{2}\left(\nu,\gamma_{\mu(X_{i}),Y_{i}}\left({tT_{i}\over e(X_{i};\varphi)}+{(1-t)(1-T_{i})\over 1-e(X_{i};\varphi)}\right)\right),\ t\in\{0,1\}.

Now we show $\sup_{\nu\in\mathcal{M}}\left|Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{1}(\nu;\mu_{1},\varphi_{*})\right|\stackrel{{\scriptstyle p}}{{\to}}0$ ; the proof for $t=0$ is similar. For this, we show that $Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})$ converges weakly to $Q_{1}(\nu;\mu_{1},\varphi_{*})$ in $\ell^{\infty}(\mathcal{M})$ and then apply Theorem 1.3.6 in van der Vaart and Wellner, (2023). From Theorem 1.5.4 in van der Vaart and Wellner, (2023), this weak convergence follows by showing that

(i)

$Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})\stackrel{{\scriptstyle p}}{{\to}}Q_{1}(\nu;\mu_{1},\varphi_{*})$ for each $\nu\in\mathcal{M}$ as $n\to\infty$ .

(ii)

$Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})$ is asymptotically equicontinuous in probability, i.e., for each $\varepsilon,\eta>0$ , there exists $\delta>0$ such that

\displaystyle\limsup_{n\to\infty}\operatorname{P}\left(\sup_{d(\nu_{1},\nu_{2})<\delta}|Q_{n,1}(\nu_{1};\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\nu_{2};\widehat{\mu}_{1},\widehat{\varphi})|>\varepsilon\right)<\eta

For an arbitrary $\nu\in\mathcal{M}$ , for (i) we have

	$\displaystyle\|Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{1}(\nu;\mu_{1},\varphi_{*})\|$
	$\displaystyle\leq\|Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\nu;\widehat{\mu}_{1},\varphi_{})\|+\|Q_{n,1}(\nu;\widehat{\mu}_{1},\varphi_{})-Q_{n,1}(\nu;\mu_{1},\varphi_{*})\|$
	$\displaystyle\quad+\|Q_{n,1}(\nu;\mu_{1},\varphi_{})-Q_{1}(\nu;\mu_{1},\varphi_{})\|$
	$\displaystyle=:Q_{1}+Q_{2}+Q_{3}$

and observe that

$\displaystyle Q_{1}$	$\displaystyle\leq{1\over n}\sum_{i=1}^{n}\left\|d^{2}\left(\nu,\gamma_{\widehat{\mu}_{1}(X_{i}),Y_{i}}\left({T_{i}\over e(X_{i};\widehat{\varphi})}\right)\right)-d^{2}\left(\nu,\gamma_{\widehat{\mu}_{1}(X_{i}),Y_{i}}\left({T_{i}\over e(X_{i};\varphi_{*})}\right)\right)\right\|$
	$\displaystyle\leq{2\mathrm{diam}(\mathcal{M})\over n}\sum_{i=1}^{n}d\left(\gamma_{\widehat{\mu}_{1}(X_{i}),Y_{i}}\left({T_{i}\over e(X_{i};\widehat{\varphi})}\right),\gamma_{\widehat{\mu}_{1}(X_{i}),Y_{i}}\left({T_{i}\over e(X_{i};\varphi_{*})}\right)\right)$
	$\displaystyle\lesssim{2\mathrm{diam}(\mathcal{M})\over n}\sum_{i=1}^{n}\left\|{T_{i}\over e(X_{i};\widehat{\varphi})}-{T_{i}\over e(X_{i};\varphi_{*})}\right\|d(\widehat{\mu}_{1}(X_{i}),Y_{i})$
	$\displaystyle\leq{2\mathrm{diam}(\mathcal{M})^{2}\over\eta_{0}^{2}}\sup_{x\in\mathcal{X}}\|e(x;\widehat{\varphi})-e(x;\varphi_{*})\|=O_{p}(\varrho_{n}),$	(D.1)

$\displaystyle Q_{2}$	$\displaystyle\leq{1\over n}\sum_{i=1}^{n}\left\|d^{2}\left(\nu,\gamma_{\widehat{\mu}_{1}(X_{i}),Y_{i}}\left({T_{i}\over e(X_{i};\varphi_{})}\right)\right)-d^{2}\left(\nu,\gamma_{\mu_{1}(X_{i}),Y_{i}}\left({T_{i}\over e(X_{i};\varphi_{})}\right)\right)\right\|$
	$\displaystyle\leq{2\mathrm{diam}(\mathcal{M})\over n}\sum_{i=1}^{n}d\left(\gamma_{\widehat{\mu}_{1}(X_{i}),Y_{i}}\left({T_{i}\over e(X_{i};\varphi_{})}\right),\gamma_{\mu_{1}(X_{i}),Y_{i}}\left({T_{i}\over e(X_{i};\varphi_{})}\right)\right)$
	$\displaystyle\leq 2C_{0}\mathrm{diam}(\mathcal{M})\sup_{x\in\mathcal{X}}d(\widehat{\mu}_{1}(x),\mu_{1}(x))=O_{p}(r_{n}),$	(D.2)

\displaystyle\operatorname{E}[Q_{3}^{2}]

\displaystyle\leq{1\over n}\operatorname{E}\left[d^{4}\left(\nu,\gamma_{\mu_{1}(X),Y}\left({T\over e(X;\varphi_{*})}\right)\right)\right]\leq{\mathrm{diam}^{4}(\mathcal{M})\over n}=O(n^{-1}).

(D.3)

Combining (D.1), (D.1), and (D.3), we obtain (i).

Pick any $\nu_{1},\nu_{2}\in\mathcal{M}$ . For (ii), similarly to the argument leading to (D.1),

\displaystyle|Q_{n,1}(\nu_{1};\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\nu_{2};\widehat{\mu}_{1},\widehat{\varphi})|

\displaystyle\leq 2\mathrm{diam}(\mathcal{M})d(\nu_{1},\nu_{2})

and this implies $\sup_{d(\nu_{1},\nu_{2})<\delta}|Q_{n,1}(\nu_{1};\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\nu_{2};\widehat{\mu}_{1},\widehat{\varphi})|=O_{p}(\delta)$ so that we obtain (ii), which completes the proof.

D.2. Proof of Theorem 4.1 (ii)

Note that one can show $\max_{1\leq k\leq K}d(\widehat{\Theta}_{t,k}^{(\mathrm{DR})},\operatorname{E}_{\oplus}[Y(t)])=o_{p}(1)$ by applying similar arguments as in the proof of Theorem 4.1 (i). From the definition of $\widehat{\Theta}_{t}^{(\mathrm{CF})}$ ,

\displaystyle\sum_{k=1}^{K}{n_{k}\over n}d^{2}(\widehat{\Theta}_{t}^{(\mathrm{CF})},\widehat{\Theta}_{t,k}^{(\mathrm{DR})})

\displaystyle\leq\sum_{k=1}^{K}{n_{k}\over n}d^{2}(\operatorname{E}_{\oplus}[Y(t)],\widehat{\Theta}_{t,k}^{(\mathrm{DR})})\leq\max_{1\leq k\leq K}d^{2}(\operatorname{E}_{\oplus}[Y(t)],\widehat{\Theta}_{t,k}^{(\mathrm{DR})}),

which implies that $d(\widehat{\Theta}_{t}^{(\mathrm{CF})},\widehat{\Theta}_{t,\bar{k}}^{(\mathrm{DR})})\leq\max_{1\leq k\leq K}d(\operatorname{E}_{\oplus}[Y(t)],\widehat{\Theta}_{t,k}^{(\mathrm{DR})})$ for some $\bar{k}\in\{1,\dots,K\}$ . Then

	$\displaystyle d(\widehat{\Theta}_{t}^{(\mathrm{CF})},\operatorname{E}_{\oplus}[Y(t)])$	$\displaystyle\leq d(\widehat{\Theta}_{t}^{(\mathrm{CF})},\widehat{\Theta}_{t,\bar{k}}^{(\mathrm{DR})})+d(\widehat{\Theta}_{t,\bar{k}}^{(\mathrm{DR})},\operatorname{E}_{\oplus}[Y(t)])\leq 2\max_{1\leq k\leq K}d(\widehat{\Theta}_{t,k}^{(\mathrm{DR})},\operatorname{E}_{\oplus}[Y(t)])$		(D.4)
		$\displaystyle=o_{p}(1).$

D.3. Proof of Theorem 4.2(i)

Now we show (4.6) when $t=1$ . The case $t=0$ is analogous. We adapt the proof of Theorem 3.2.5 in van der Vaart and Wellner, (2023). Define

h_{\nu,\mu,\varphi}(x,y,t)=d^{2}\left(\nu;\gamma_{\mu(x),y}\left({t\over e(x;\varphi)}\right)\right)-d^{2}\left(\Theta_{1}^{(\mathrm{DR})};\gamma_{\mu_{1}(x),y}\left({t\over e(x;\varphi_{*})}\right)\right).

For some $\delta_{0}>0$ and $\delta^{\prime}_{1}>0$ , consider the class of functions

\mathcal{H}_{\delta_{,}\delta^{\prime}_{1}}:=\left\{h_{\nu,\mu,\varphi}(\cdot):d(\nu,\Theta_{1}^{(\mathrm{DR})})\leq\delta,d_{\infty}(\mu,\mu_{1})\leq\delta^{\prime}_{1},\|\varphi-\varphi_{*}\|\leq\delta^{\prime}_{1}\right\}

for any $\delta\leq\delta_{0}$ . An envelope function for $\mathcal{H}_{\delta,\delta^{\prime}_{1}}$ is $2\delta\mathrm{diam}(\mathcal{M})$ . Observe that

	$\displaystyle\|h_{\widetilde{\nu},\widetilde{\mu},\widetilde{\varphi}}(x,y,t)-h_{\nu,\mu,\varphi}(x,y,t)\|$
	$\displaystyle\leq 2\mathrm{diam}(\mathcal{M})\left(1+C_{0}+{C_{e}\mathrm{diam}(\mathcal{M})\over\eta_{0}^{2}}\right)\left\{d(\widetilde{\nu},\nu)+d_{\infty}(\widetilde{\mu},\mu)+\\|\widetilde{\varphi}-\varphi\\|\right\}.$

Then for $\varepsilon>0$ , following Theorem 2.7.17 of van der Vaart and Wellner, (2023), the $\delta\varepsilon$ bracketing number of $\mathcal{H}_{\delta,\delta^{\prime}_{1}}$ is bounded by a multiple of $N(c_{1}\delta\varepsilon,B_{\delta}(\Theta_{1}^{(\mathrm{DR})}),d)N(c_{2}\delta\varepsilon,B_{\delta^{\prime}_{1}}(\mu_{1}),d_{\infty})(\delta\varepsilon)^{-c_{3}}$ , where $c_{1},c_{2}$ , and $c_{3}$ are constants depending only on $p$ , $\mathrm{diam}(\mathcal{M})$ , $C_{e}$ , and $C_{0}$ .

Assumption 4.4 (i) and Theorem 2.14.16 of van der Vaart and Wellner, (2023) imply that, for small enough $\delta>0$ ,

	$\displaystyle\operatorname{E}\left[\sup_{\begin{subarray}{c}d(\nu,\Theta_{1})\leq\delta,\\|\varphi-\varphi_{*}\\|\leq\delta^{\prime}_{1}\\ d_{\infty}(\mu,\mu_{1})\leq\delta^{\prime}_{1}\end{subarray}}\!\left\|Q_{n,1}(\nu;\mu,\varphi)-Q_{n,1}(\Theta_{1}^{(\mathrm{DR})};\mu,\varphi)-Q_{1}(\nu;\mu,\varphi)+Q_{1}(\Theta_{1}^{(\mathrm{DR})};\mu,\varphi)\right\|\right]$
	$\displaystyle\lesssim{2\delta\mathrm{diam}(\mathcal{M})\over\sqrt{n}}\left(J_{1}(\delta)+J_{\mu_{1}}(\delta)+\int_{0}^{1}\sqrt{-\log(\delta\varepsilon)}d\varepsilon\right)\lesssim{\delta(1+\delta^{-\varpi})\over\sqrt{n}}\lesssim{\delta^{1-\varpi}\over\sqrt{n}}.$		(D.5)

For any $\beta^{\prime}\in(0,1)$ , set $q_{n}^{-1}=\max\{n^{-{\beta\over 4(\beta-1+\varpi)}},(\varrho_{n}+r_{n})^{{\beta\beta^{\prime}\over 2(\beta-1)}}\}$ and

S_{j,n}=\left\{\nu:2^{j-1}<q_{n}d(\nu,\Theta_{1}^{(\mathrm{DR})})^{\beta/2}\leq 2^{j}\right\}.

Choose $\eta>0$ to satisfy Assumption 4.4 (ii) and also small enough so that Assumption 4.4 (i) holds for all $\delta<\eta$ and set $\widetilde{\eta}=\eta^{\beta/2}$ . For any $L,\delta_{1},K>0$ ,

		$\displaystyle\operatorname{P}\left(q_{n}d(\widehat{\Theta}_{1}^{(\mathrm{DR})},\Theta_{1}^{(\mathrm{DR})})^{\beta/2}>2^{L}\right)$
		$\displaystyle\leq\operatorname{P}(A_{n}^{c})+\operatorname{P}(2d(\widehat{\Theta}_{1}^{(\mathrm{DR})},\Theta_{1}^{(\mathrm{DR})})\geq\eta)$
		$\displaystyle\quad+\sum_{\begin{subarray}{c}j\geq L\\ 2^{j}\leq q_{n}\widetilde{\eta}\end{subarray}}\!\!\!\operatorname{P}\!\left(\!\left\{\sup_{\nu\in S_{j,n}}\!\!\!\left\{Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\Theta_{1}^{(\mathrm{DR})};\widehat{\mu}_{1},\widehat{\varphi})\right\}\!\leq Kq_{n}^{-2}\right\}\!\cap A_{n}\!\right),$		(D.6)

where $A_{n}=\{d_{\infty}(\widehat{\mu}_{1},\mu_{1})\leq\delta_{1}q_{n}^{-{2(\beta-1)\over\beta}},\|\widehat{\varphi}-\varphi_{*}\|\leq\delta_{1}q_{n}^{-{2(\beta-1)\over\beta}}\}$ .

Note that for any $\varepsilon>0$ , there exist an integer $n_{\varepsilon}$ such that $\operatorname{P}(A_{n}^{c})<\varepsilon/3$ , $\operatorname{P}(2d(\widehat{\Theta}_{1}^{(\mathrm{DR})},\Theta_{1}^{(\mathrm{DR})})\geq\eta)<\varepsilon/3$ , and $\delta_{1}q_{n}^{-{2(\beta-1)\over\beta}}\leq\delta^{\prime}_{1}$ .

Define $A_{n,\varepsilon}$ as $A_{n}$ with $n\geq n_{\varepsilon}$ . Then, on $A_{n,\varepsilon}$ , for each fixed $j$ such that $2^{j}\leq q_{n}\widetilde{\eta}$ , one has for all $\nu\in S_{j,n}$ ,

	$\displaystyle Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\Theta_{1}^{(\mathrm{DR})};\widehat{\mu}_{1},\widehat{\varphi})$
	$\displaystyle\geq Q_{1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{1}(\Theta_{1}^{(\mathrm{DR})};\widehat{\mu}_{1},\widehat{\varphi})$
	$\displaystyle\quad-\sup_{d(\nu,\Theta_{1}^{(\mathrm{DR})})^{\beta/2}\leq{{2^{j}\over q_{n}}}}\!\!\left\|Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\Theta_{1}^{(\mathrm{DR})};\widehat{\mu}_{1},\widehat{\varphi})-Q_{1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})+Q_{1}(\Theta_{1}^{(\mathrm{DR})};\widehat{\mu}_{1},\widehat{\varphi})\right\|$
	$\displaystyle\geq{2^{2j-2}\over q_{n}^{2}}\left({C\over 2}-C_{1}\delta_{1}^{{\beta\over 2(\beta-1)}}2^{2-j}\right)$
	$\displaystyle\quad-\sup_{d(\nu,\Theta_{1}^{(\mathrm{DR})})^{\beta/2}\leq{{2^{j}\over q_{n}}}}\!\!\left\|Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\Theta_{1}^{(\mathrm{DR})};\widehat{\mu}_{1},\widehat{\varphi})-Q_{1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})+Q_{1}(\Theta_{1}^{(\mathrm{DR})};\widehat{\mu}_{1},\widehat{\varphi})\right\|$

Then for large $j$ such that $C/2-C_{1}\delta_{1}^{{\beta\over 2(\beta-1)}}2^{2-j}-K2^{2-2j}\geq C/4$ , we have

		$\displaystyle\operatorname{P}\left(\left\{\sup_{\nu\in S_{j,n}}\left\{Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\Theta_{1}^{(\mathrm{DR})};\widehat{\mu}_{1},\widehat{\varphi})\right\}\leq Kq_{n}^{-2}\right\}\cap A_{n,\varepsilon}\right)$
		$\displaystyle\leq\!\operatorname{P}\!\!\left(\sup_{\begin{subarray}{c}d(\nu,\Theta_{1}^{(\mathrm{DR})})^{\beta/2}\leq{2^{j}\over q_{n}},\\ d_{\infty}(\mu,\mu_{1})\leq\delta_{1}q_{n}^{-2(\beta-1)/\beta},\\ \\|\varphi-\varphi_{*}\\|\leq\delta_{1}q_{n}^{-2(\beta-1)/\beta}\end{subarray}}\!\!\!\left\|Q_{n,1}(\nu;\mu,\varphi)\!-\!Q_{n,1}(\Theta_{1}^{(\mathrm{DR})};\mu,\varphi)\!-\!Q_{1}(\nu;\mu,\varphi)\!+\!Q_{1}(\Theta_{1}^{(\mathrm{DR})};\mu,\varphi)\right\|\!\geq\!{C2^{2j-2}\over 4q_{n}^{2}}\!\right).$		(D.7)

For each $j$ , in the sum on the right-hand side of (D.3), we have $d(\nu,\Theta_{1}^{(\mathrm{DR})})\leq(2^{j}/q_{n})^{2/\beta}\leq\eta$ . Using (D.3) and the Markov inequality,

	$\displaystyle\operatorname{P}\left(q_{n}d(\widehat{\Theta}_{1}^{(\mathrm{DR})},\Theta_{1}^{(\mathrm{DR})})^{\beta/2}>2^{L}\right)$
	$\displaystyle\leq\!{4\over C}\!\!\!\sum_{\begin{subarray}{c}j\geq L\\ 2^{j}\leq q_{n}\widetilde{\eta}\end{subarray}}\!\!\!\!{2^{-2j}\over q_{n}^{-2}}\!\operatorname{E}\!\!\left[\!1_{A_{n,\varepsilon}}\!\sup_{\nu\in S_{j,n}}\!\!\!\left\|Q_{n,1}(\nu;\mu,\varphi)-Q_{n,1}(\Theta_{1}^{(\mathrm{DR})};\mu,\varphi)-Q_{1}(\nu;\mu,\varphi)+Q_{1}(\Theta_{1}^{(\mathrm{DR})};\mu,\varphi)\right\|\right]\!+\!{2\varepsilon\over 3}$
	$\displaystyle\lesssim\sum_{\begin{subarray}{c}j\geq L\\ 2^{j}\leq q_{n}\widetilde{\eta}\end{subarray}}{2^{-2j+{2j\over\beta}(1-\varpi)}\over q_{n}^{-2+{2\over\beta}(1-\varpi)}\sqrt{n}}+{2\varepsilon\over 3}\leq\sum_{j\geq L}\left({1\over 4^{{\beta-1+\varpi\over\beta}}}\right)^{j}+{2\varepsilon\over 3}.$

Since $\beta>1$ , the last series converges and hence the first term on the far right-hand side can be made smaller than $\varepsilon>0$ for large $L$ . Therefore, we obtain the desired result,

d(\widehat{\Theta}_{1}^{(\mathrm{DR})},\operatorname{E}_{\oplus}[Y(1)])=O_{p}(q_{n}^{-{2\over\beta}})=O_{p}(n^{-{1\over 2(\beta-1+\varpi)}}+(\varrho_{n}+r_{n})^{{\beta^{\prime}\over(\beta-1)}}).

D.4. Proof of Theorem 4.2(ii)

Applying almost the same argument as in the proof of Theorem 4.2 (i), one can show

\max_{1\leq k\leq K}d(\widehat{\Theta}_{t,k}^{(\mathrm{DR})},\operatorname{E}_{\oplus}[Y(t)])=O_{p}\left(n^{-{1\over 2(\beta-1+\varpi)}}+(\varrho_{n}+r_{n})^{{\beta^{\prime}\over(\beta-1)}}\right).

Then (D.4) yields the result.

References

Ahmadi et al., (2023) Ahmadi, H., Fatemizadeh, E., and Motie-Nasrabadi, A. (2023). A comparative study of correlation methods in functional connectivity analysis using fMRI data of Alzheimer’s patients. Journal of Biomedical Physics & Engineering, 13(2):125.
Aitchison, (1986) Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman & Hall, Ltd.
Athey and Imbens, (2017) Athey, S. and Imbens, G. W. (2017). The state of applied econometrics: Causality and policy evaluation. Journal of Economic Perspectives, 31(2):3–32.
Bang and Robins, (2005) Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973.
Bigot et al., (2017) Bigot, J., Gouet, R., Klein, T., and López, A. (2017). Geodesic PCA in the Wasserstein space by convex PCA. Annales de l’Institut Henri Poincaré B: Probability and Statistics, 53:1–26.
Billera et al., (2001) Billera, L. J., Holmes, S. P., and Vogtmann, K. (2001). Geometry of the Space of Phylogenetic Trees. Advances in Applied Mathematics, 27:733–767.
Chen and Müller, (2022) Chen, Y. and Müller, H.-G. (2022). Uniform convergence of local Fréchet regression with applications to locating extrema and time warping for metric space valued trajectories. Annals of Statistics, 50(3):1573–1592.
Chen et al., (2023) Chen, Y., Zhou, Y., Chen, H., Gajardo, A., Fan, J., Zhong, Q., Dubey, P., Han, K., Bhattacharjee, S., Zhu, C., Iao, S. I., Kundu, P., Petersen, A., and Müller, H.-G. (2023). frechet: Statistical Analysis for Random Objects and Non-Euclidean Data. R package version 0.3.0.
Chernozhukov et al., (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68.
Dai, (2022) Dai, X. (2022). Statistical Inference on the Hilbert Sphere with Application to Random Densities. Electronic Journal of Statistics, 16(1):700–736.
Damoiseaux et al., (2012) Damoiseaux, J. S., Prater, K. E., Miller, B. L., and Greicius, M. D. (2012). Functional connectivity tracks clinical deterioration in Alzheimer’s disease. Neurobiology of Aging, 33(4):828–e19.
Delsol and Van Keilegom, (2020) Delsol, L. and Van Keilegom, I. (2020). Semiparametric m-estimation with non-smooth criterion functions. Annals of the Institute of Statistical Mathematics, 72(2):577–605.
Dryden et al., (2009) Dryden, I. L., Koloydenko, A., and Zhou, D. (2009). Non-Euclidean statistics for covariance matrices, with applications to diffusion tensor imaging. Annals of Applied Statistics, 3:1102–1123.
Dubey et al., (2024) Dubey, P., Chen, Y., and Müller, H.-G. (2024). Metric statistics: Exploration and inference for random objects with distance profiles. Annals of Statistics, 52:757–792.
Ferreira and Busatto, (2013) Ferreira, L. K. and Busatto, G. F. (2013). Resting-state functional connectivity in normal brain aging. Neuroscience & Biobehavioral Reviews, 37(3):384–400.
Filzmoser et al., (2018) Filzmoser, P., Hron, K., and Templ, M. (2018). Applied Compositional Data Analysis: With Worked Examples in R. Springer.
Gallier and Quaintance, (2020) Gallier, J. and Quaintance, J. (2020). Differential Geometry and Lie Groups: A Computational Perspective, volume 12. Springer.
Geyer, (2020) Geyer, C. J. (2020). trust: Trust Region Optimization. R package version 0.1.8.
Gunsilius, (2023) Gunsilius, F. F. (2023). Distributional synthetic controls. Econometrica, 91(3):1105–1117.
Hahn, (1998) Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, pages 315–331.
Hirano et al., (2003) Hirano, K., Imbens, G. W., and Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189.
Imbens and Rubin, (2015) Imbens, G. W. and Rubin, D. B. (2015). Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge university press.
Kuchibhotla et al., (2024) Kuchibhotla, A. K., Balakrishnan, S., and Wasserman, L. (2024). The HulC: Confidence regions from convex hulls. Journal of the Royal Statistical Society Series B Statistical Methodology, In press.
Lin, (2019) Lin, Z. (2019). Riemannian geometry of symmetric positive definite matrices via Cholesky decomposition. SIAM Journal on Matrix Analysis and Applications, 40(4):1353–1370.
Lin et al., (2023) Lin, Z., Kong, D., and Wang, L. (2023). Causal inference on distribution functions. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(2):378–398.
Lin and Müller, (2021) Lin, Z. and Müller, H.-G. (2021). Total variation regularized Fréchet regression for metric-space valued data. Annals of Statistics, 49:3510–3533.
McCann, (1997) McCann, R. J. (1997). A convexity principle for interacting gases. Advances in Mathematics, 128(1):153–179.
Peters et al., (2017) Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of Causal Inference: Foundations and Learning Algorithms. The MIT Press.
Petersen and Müller, (2019) Petersen, A. and Müller, H.-G. (2019). Fréchet regression for random objects with Euclidean predictors. Annals of Statistics, 47(2):691–719.
Pigoli et al., (2014) Pigoli, D., Aston, J. A., Dryden, I. L., and Secchi, P. (2014). Distances and inference for covariance operators. Biometrika, 101:409–422.
Planche et al., (2022) Planche, V., Manjon, J. V., Mansencal, B., Lanuza, E., Tourdias, T., Catheline, G., and Coupé, P. (2022). Structural progression of Alzheimer’s disease over decades: the MRI staging scheme. Brain Communications, 4(3):fcac109.
Robins et al., (1994) Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866.
Rosenbaum and Rubin, (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55.
Rubinov and Sporns, (2010) Rubinov, M. and Sporns, O. (2010). Complex network measures of brain connectivity: Uses and interpretations. NeuroImage, 52(3):1059–1069.
Sala-Llonch et al., (2015) Sala-Llonch, R., Bartrés-Faz, D., and Junqué, C. (2015). Reorganization of brain networks in aging: A review of functional connectivity studies. Frontiers in Psychology, 6.
Scealy and Welsh, (2011) Scealy, J. and Welsh, A. (2011). Regression for compositional data by using distributions defined on the hypersphere. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3):351–375.
Scealy and Welsh, (2014) Scealy, J. and Welsh, A. (2014). Colours and cocktails: Compositional data analysis. Australian & New Zealand Journal of Statistics, 56(2):145–169.
Scharfstein et al., (1999) Scharfstein, D. O., Rotnitzky, A., and Robins, J. M. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94(448):1096–1120.
Stellato et al., (2020) Stellato, B., Banjac, G., Goulart, P., Bemporad, A., and Boyd, S. (2020). OSQP: An operator splitting solver for quadratic programs. Mathematical Programming Computation, 12(4):637–672.
Thanwerdas and Pennec, (2023) Thanwerdas, Y. and Pennec, X. (2023). O(n)-invariant Riemannian metrics on SPD matrices. Linear Algebra and its Applications, 661:163–201.
Tzourio-Mazoyer et al., (2002) Tzourio-Mazoyer, N., Landeau, B., Papathanassiou, D., Crivello, F., Etard, O., Delcroix, N., Mazoyer, B., and Joliot, M. (2002). Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain. NeuroImage, 15(1):273–289.
van der Vaart and Wellner, (2023) van der Vaart, A. W. and Wellner, J. A. (2023). Weak Convergence and Empirical Processes with Applications to Statistics. Springer.
Zhang et al., (2010) Zhang, H.-Y., Wang, S.-J., Liu, B., Ma, Z.-L., Yang, M., Zhang, Z.-J., and Teng, G.-J. (2010). Resting brain connectivity: changes during the progress of Alzheimer disease. Radiology, 256(2):598–606.
Zhou and Müller, (2022) Zhou, Y. and Müller, H.-G. (2022). Network regression with graph Laplacians. Journal of Machine Learning Research, 23:1–41.
Zhu and Müller, (2023) Zhu, C. and Müller, H.-G. (2023). Geodesic optimal transport regression. arXiv preprint arXiv:2312.15376.

	$\displaystyle\|Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{1}(\nu;\mu_{1},\varphi_{*})\|$
	$\displaystyle\leq\|Q_{n,1}(\nu;\widehat{\mu}_{1},\widehat{\varphi})-Q_{n,1}(\nu;\widehat{\mu}_{1},\varphi_{})\|+\|Q_{n,1}(\nu;\widehat{\mu}_{1},\varphi_{})-Q_{n,1}(\nu;\mu_{1},\varphi_{*})\|$
	$\displaystyle\quad+\|Q_{n,1}(\nu;\mu_{1},\varphi_{})-Q_{1}(\nu;\mu_{1},\varphi_{})\|$
	$\displaystyle=:Q_{1}+Q_{2}+Q_{3}$