This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Causal Effect of Functional Treatment

Ruoxu Tan
Department of Statistics and Actuarial Science, University of Hong Kong
and
Wei Huang   
School of Mathematics and Statistics, University of Melbourne
and
Zheng Zhang
Institute of Statistics & Big Data, Renmin University of China
and
Guosheng Yin
Department of Statistics and Actuarial Science, University of Hong Kong
Corresponding author wei.huang@unimelb.edu.auCorresponding author zhengzhang@ruc.edu.cn
Abstract

Functional data often arise in the areas where the causal treatment effect is of interest. However, research concerning the effect of a functional variable on an outcome is typically restricted to exploring the association rather than the casual relationship. The generalized propensity score, often used to calibrate the selection bias, is not directly applicable to a functional treatment variable due to a lack of definition of probability density function for functional data. Based on the functional linear model for the average dose-response functional, we propose three estimators, namely, the functional stabilized weight estimator, the outcome regression estimator and the doubly robust estimator, each of which has its own merits. We study their theoretical properties, which are corroborated through extensive numerical experiments. A real data application on electroencephalography data and disease severity demonstrates the practical value of our methods.


Keywords: double robustness; functional data analysis; functional linear model; functional stabilized weight; treatment effect.

1 Introduction

In observational studies, a fundamental question is to investigate the causal effect of a treatment variable on an outcome. Traditionally, a large amount of works focus on a binary treatment variable (i.e., treatment versus control); see e.g., Rosenbaum and Rubin (1983); Hirano et al. (2003); Imai and Ratkovic (2014, 2015); Chan et al. (2016); Athey et al. (2018); Ding et al. (2019); Tan (2020); Guo et al. (2021). Recently, there is a growing interest in studying more complex treatment variables, e.g., categorical (Lopez and Gutman, 2017a, b; Li and Li, 2019) and continuous variables (Hirano and Imbens, 2004; Galvao and Wang, 2015; Kennedy et al., 2017; Fong et al., 2018; Kennedy, 2019; Ai et al., 2021; Bonvini and Kennedy, 2022).

Our interest is to estimate the average dose-response functional (ADRF) of a functional treatment variable. Functional data, usually collected repeatedly over a continuous domain, arise in many fields such as astronomy, economics and biomedical sciences. Such data are essentially of infinite dimension and thus fundamentally different from scalar variables; see Wang et al. (2016) for a review on functional data analysis. Our setting is substantially distinct from those of the observational longitudinal studies (Robins, 2000; Robins et al., 2000; Moodie and Stephens, 2012; Imai and Ratkovic, 2015; Zhao et al., 2015; Kennedy, 2019), where the treatment variable takes binary values at each time point and the outcome variable also depends on time. In contrast, the functional treatment variable considered here is real-valued over a continuous domain (not limited to time but can be space, frequency, etc.) and the outcome variable is a scalar that does not vary with respect to (w.r.t.) that continuous domain. Real data applications of our setting can be found in the literature on the scalar-on-function regression; see Morris (2015) for a review. For example, we may be interested in the causal relationship between daily temperature and crop harvest (Wong et al., 2019), spectrometric data and fat content of certain types of food (Ferraty and Vieu, 2006), daily air pollution and mortality rate (Kong et al., 2016), etc. Our work is motivated by an electroencephalography (EEG) dataset (Ciarleglio et al., 2022), which was collected to investigate possible causation between the human neuronal activities and severity of the major depressive disorder. A random subsample of 20 frontal asymmetric curves measuring intensity of the human neuronal activities is shown in Figure 1.

Refer to caption
Figure 1: A random subsample of 20 frontal asymmetric curves from the EEG dataset.

In a nonrandomized observational study, it is impossible to identify the causal effect without imposing any assumption. In addition to basic assumptions like the stable unit treatment value assumption, consistency and positivity, a routinely used condition for identification is the unconfoundedness assumption (Rosenbaum and Rubin, 1983) i.e., the potential outcome is independent of the treatment given all the confounding covariates. For a continuous scalar treatment, under the unconfoundedness assumption, the generalized propensity score or the stabilized weight (Hirano and Imbens, 2004; Imai and van Dyk, 2004; Galvao and Wang, 2015; Kennedy et al., 2017; Ai et al., 2021; Bonvini and Kennedy, 2022) has been widely used to calibrate the selection bias, which relies upon the (conditional) density of the treatment variable. However, the probability density function of functional data generally does not exist (Delaigle and Hall, 2010). Therefore, in our context, neither of the generalized propensity score or stabilized weight is well defined, and thus the traditional methods are not directly applicable. To circumvent the problem, Zhang et al. (2021) proposed to approximate the functional treatment variable based on functional principal component analysis (FPCA) and defined a stabilized weight function by the (conditional) probability density of the PC scores to adjust the selection bias. However, their estimator is not well defined when the number of PCs tends to infinity and thus asymptotic consistency is not guaranteed.

Under the unconfoundedness assumption in the context of a functional treatment variable, we propose three estimators of the ADRF, including a functional stabilized weight (FSW) method, a partially linear outcome regression (OR) method and a doubly robust (DR) approach by combining the two. Specifically, the first estimator assumes a functional linear model for the ADRF and incorporates a new FSW. In contrast to Zhang et al. (2021), our FSW is well defined and guarantees the identification of the ADRF. By constructing a novel consistent nonparametric estimator of the FSW, we can estimate the ADRF by functional linear regression with the weighted outcome. Our second estimator assumes a more restrictive linear model taking the confounding variables into account. The model quantifies the ADRF by the expectation of the linear model with respect only to the confounding variables, which is computed through a backfitting algorithm. Further, by combining the first two estimators, we propose a new DR estimator, which is consistent if either of the first two estimators is consistent. In particular, it enjoys a fast convergence rate as the second estimator if the partially linear model is correctly specified; while it imposes less restrictive modelling assumptions as the first estimator, which is still consistent although has a slower convergence rate. Such double robustness is different from those in the literature on a continuous scalar treatment variable (see e.g., Kennedy et al., 2017; Bonvini and Kennedy, 2022), where the DR estimator is consistent if one of two models for the conditional outcome and the generalized propensity score is correctly specified. In our case, the functional linear model for the ADRF is required for the consistency of all the three estimators; and the FSW, serving the role of the generalized propensity score, is estimated nonparametrically.

The rest of the paper is organized as follows. We introduce the basic setup for identification of ADRF in Section 2, followed by development of our three estimators in Section 3. We investigate the convergence rates of the estimators in Section 4, and provide detailed cross-validation procedures for selecting tuning parameters in Section 5. The simulation and the real data analysis are presented in Sections 6 and 7, respectively. Section 8 concludes with discussion, and the supplementary material contains all the technical details.

2 Identification of Average Dose-Response Functional

We consider a functional treatment variable such that the observed treatment variable, denoted by Z=Z():𝒯Z=Z(\cdot):\mathcal{T}\to\mathbb{R}, is a smooth real-valued random function defined on a compact interval 𝒯\mathcal{T}, with E{𝒯Z2(t)𝑑t}<E\{\int_{\mathcal{T}}Z^{2}(t)\,dt\}<\infty. Let Y(z)Y^{*}(z) be the potential outcome given the treatment Z=zZ=z for zL2(𝒯)z\in L^{2}(\mathcal{T}), the Hilbert space of squared integrable functions on 𝒯\mathcal{T}. For each individual, we only observe a random treatment ZZ and the corresponding outcome Y=Y(Z)Y=Y^{*}(Z)\in\mathbb{R}. We are interested in estimating the ADRF, E{Y(z)}E\{Y^{*}(z)\}, for any zL2(𝒯)z\in L^{2}(\mathcal{T}).

Unlike an observational longitudinal study (Robins, 2000; Robins et al., 2000; Moodie and Stephens, 2012; Imai and Ratkovic, 2015; Zhao et al., 2015; Kennedy, 2019), here we assume that the outcome variable YY does not depend on t𝒯t\in\mathcal{T}, where the interval 𝒯\mathcal{T} is not limited to representing a time interval, but can also be a space, a domain of frequencies, etc. Instead of considering the association between ZZ and YY as in the traditional regression setting, here we investigate their causal relationship. We assume that the functional variable ZZ is fully or densely observed. Essentially, our methodology can be applied when ZZ is only sparsely observed as long as the trajectory of ZZ can be well reconstructed, while development of asymptotic properties in such cases would be much more challenging (Zhang and Wang, 2016).

A fundamental problem of causal inference is that we never observe Y(z)Y^{*}(z) simultaneously for all zL2(𝒯)z\in L^{2}(\mathcal{T}). This is a more severe issue in our context, compared to the traditional scalar treatment case, because ZZ is a functional variable valued in an infinite dimensional space. It is then impossible to identify E{Y(z)}E\{Y^{*}(z)\} from the observed data alone without any assumption. Let 𝐗=(X1,,Xp)p\mathbf{X}=(X_{1},\ldots,X_{p})^{\top}\in\mathbb{R}^{p} be the pp-dimensional observable confounding covariates that are related to both Y(z)Y^{*}(z) and ZZ. We impose the following assumptions in order to identify E{Y(z)}E\{Y^{*}(z)\}.

Assumption 1

  • (i)

    Unconfoundedness: Given 𝐗\mathbf{X}, ZZ is independent of Y(z)Y^{*}(z) for all zL2(𝒯)z\in L^{2}(\mathcal{T}), i.e., {Y(z)}zL2(𝒯)Z𝐗\{Y^{*}(z)\}_{z\in L^{2}(\mathcal{T})}\perp Z\mid\mathbf{X}.

  • (ii)

    Stable unit treatment value assumption: There is no interference among the units, i.e., each individual’s outcome depends only on their own level of treatment intensity.

  • (iii)

    Consistency: Y=Y(z)Y=Y^{*}(z) a.s. if Z=zZ=z.

  • (iv)

    Positivity: The conditional density f𝐗|Zf_{\mathbf{X}|Z} satisfies f𝐗|Z(𝐗|Z)>0f_{\mathbf{X}|Z}(\mathbf{X}|Z)>0 a.s.

Assumption 1 is a natural extension of the classical identification assumption in the literature of scalar treatment effect (Rosenbaum and Rubin, 1983; Hirano and Imbens, 2004; Kennedy et al., 2017) to the context of a functional treatment variable. The positivity condition (iv) in Assumption 1 is different from that widely used for identifying the scalar treatment effect, where the generalized propensity score, defined by the probability density of the treatment variable conditional on the confounding covariates, is required to be positive. However, the probability and the conditional probability density functions of functional data, i.e., fZf_{Z} and fZ|𝐗f_{Z|\mathbf{X}}, generally do not exist (Delaigle and Hall, 2010). To circumvent this problem, we impose the positivity condition for the conditional density f𝐗|Zf_{\mathbf{X}|Z}, which is well defined.

Under Assumption 1, the ADRF can be identified from the observable variables (𝐗,Y,Z)(\mathbf{X},Y,Z) in the following three ways,

E{Y(z)}\displaystyle E\{Y^{*}(z)\} =E[E{Y(z)|𝐗}]=E[E{Y(z)|𝐗,Z=z}]=E{E(Y|𝐗,Z=z)};\displaystyle=E[E\{Y^{*}(z)|\mathbf{X}\}]=E[E\{Y^{*}(z)|\mathbf{X},Z=z\}]=E\{E(Y|\mathbf{X},Z=z)\}; (2.1)

Alternatively, we have

E{Y(z)}=E{E(Y|𝐗,Z=z)}\displaystyle E\{Y^{*}(z)\}=E\{E(Y|\mathbf{X},Z=z)\} =pE(Y|𝐗=𝐱,Z=z)f𝐗(𝐱)𝑑𝐱\displaystyle=\int_{\mathbb{R}^{p}}E(Y|\mathbf{X}=\mathbf{x},Z=z)f_{\mathbf{X}}(\mathbf{x})\,d\mathbf{x}
=pE(Y|𝐗=𝐱,Z=z)f𝐗(𝐱)f𝐗|Z(𝐱|z)f𝐗|Z(𝐱|z)𝑑𝐱\displaystyle=\int_{\mathbb{R}^{p}}E(Y|\mathbf{X}=\mathbf{x},Z=z)\dfrac{f_{\mathbf{X}}(\mathbf{x})}{f_{\mathbf{X}|Z}(\mathbf{x}|z)}f_{\mathbf{X}|Z}(\mathbf{x}|z)\,d\mathbf{x}
=E{π0(Z,𝐗)Y|Z=z},\displaystyle=E\{\pi_{0}(Z,\mathbf{X})Y|Z=z\}\,, (2.2)

where π0(Z,𝐗)=f𝐗(𝐗)/f𝐗|Z(𝐗|Z)\pi_{0}(Z,\mathbf{X})=f_{\mathbf{X}}(\mathbf{X})/f_{\mathbf{X}|Z}(\mathbf{X}|Z) is called the functional stabilized weight (FSW). Note that if ZZ were a scalar random variable, we have f𝐗(𝐗)/f𝐗|Z(𝐗|Z)=fZ(Z)/fZ|𝐗(Z|𝐗)f_{\mathbf{X}}(\mathbf{X})/f_{\mathbf{X}|Z}(\mathbf{X}|Z)=f_{Z}(Z)/f_{Z|\mathbf{X}}(Z|\mathbf{X}), where the right-hand side is the classical stabilized weight. Let E𝐗E_{\mathbf{X}} denote the expectation w.r.t. 𝐗\mathbf{X}. Noting that E[E𝐗{E(Y|𝐗,Z)}|Z=z]=E{E(Y|𝐗,Z=z)}E[E_{\mathbf{X}}\{E(Y|\mathbf{X},Z)\}|Z=z]=E\{E(Y|\mathbf{X},Z=z)\} and E{E(Y|𝐗,Z)π0(𝐗,Z)|Z=z}=E{Yπ0(𝐗,Z)|Z=z}E\{E(Y|\mathbf{X},Z)\pi_{0}(\mathbf{X},Z)|Z=z\}=E\{Y\pi_{0}(\mathbf{X},Z)|Z=z\}, together with (2.1) and (2.2), we can also identify E{Y(z)}E\{Y^{*}(z)\} as

E{Y(z)}=E[{YE(Y|𝐗,Z)}π0(𝐗,Z)+E𝐗{E(Y|𝐗,Z)}|Z=z].\displaystyle E\{Y^{*}(z)\}=E[\{Y-E(Y|\mathbf{X},Z)\}\pi_{0}(\mathbf{X},Z)+E_{\mathbf{X}}\{E(Y|\mathbf{X},Z)\}|Z=z]\,. (2.3)

We assume that our observed data (𝐗i,Yi,Zi)(\mathbf{X}_{i},Y_{i},Z_{i}), for i=1,,ni=1,\ldots,n, are i.i.d. copies of (𝐗,Y,Z)(\mathbf{X},Y,Z). Based on identification strategies (2.1), (2.2), and (2.3), we develop three estimators of E{Y(z)}E\{Y^{*}(z)\} under different model assumptions, each of which has its own merits.

3 Models and Estimation Methods

For all the three methods, we make the following functional linear model assumption on the ADRF:

E{Y(z)}=a+𝒯b(t)z(t)𝑑t,\displaystyle E\{Y^{*}(z)\}=a+\int_{\mathcal{T}}b(t)z(t)\,dt\,, (3.1)

where aa is the scalar intercept, and bL2(𝒯)b\in L^{2}(\mathcal{T}) is the slope function. Such model is widely used in the literature of scalar-on-function regression. Under such a model, the slope function b=b()b=b(\cdot) becomes the quantity of primary interest convenient for the causal inference, because it measures the intensity of the functional treatment effect. For example, to compute the average treatment effect (ATE), E{Y(z1)Y(z2)}E\{Y^{*}(z_{1})-Y^{*}(z_{2})\}, of two given z1,z2L2(𝒯)z_{1},z_{2}\in L^{2}(\mathcal{T}), it suffices to compute 𝒯b(t){z1(t)z2(t)}𝑑t\int_{\mathcal{T}}b(t)\{z_{1}(t)-z_{2}(t)\}\,dt.

3.1 Functional Stabilized Weight Estimator

Our functional stabilized weight (FSW) estimator does not rely on any model assumption other than (3.1). In particular, equation (2.2) suggests that we can estimate E{Y(z)}E\{Y^{*}(z)\} in (3.1) using the classical functional linear regression technique (e.g., Hall and Horowitz, 2007) with the response variable YY replaced by π^(Z,𝐗)Y\widehat{\pi}(Z,\mathbf{X})Y, provided an estimator π^\widehat{\pi} of π0\pi_{0}. To facilitate the development, suppose that an estimator π^\widehat{\pi} of π0\pi_{0} is available for now. Let G(t,s)=Cov{Z(t),Z(s)}G(t,s)=\operatorname{Cov}\{Z(t),Z(s)\} be the covariance function of ZZ. Its eigenvalues λ1λ20\lambda_{1}\geq\lambda_{2}\geq\cdots\geq 0 and the corresponding orthonormal eigenfunctions {ϕj}j=1\{\phi_{j}\}_{j=1}^{\infty} are defined as follows,

𝒯G(t,s)ϕj(s)𝑑s=λjϕj(t), for all t𝒯 ,\int_{\mathcal{T}}G(t,s)\phi_{j}(s)\,ds=\lambda_{j}\phi_{j}(t)\,,\textrm{ for all $t\in\mathcal{T}$\,,} (3.2)

where {ϕj}j=1\{\phi_{j}\}_{j=1}^{\infty}, also called the PC basis, is a complete basis of L2(𝒯)L^{2}(\mathcal{T}). According to the Karhunen–Loéve expansion, we have

Z(t)=μZ(t)+j=1ξjϕj(t),Z(t)=\mu_{Z}(t)+\sum_{j=1}^{\infty}\xi_{j}\phi_{j}(t)\,, (3.3)

where μZ(t)=E{Z(t)}\mu_{Z}(t)=E\{Z(t)\} and ξj=𝒯{Z(t)μZ(t)}ϕj(t)𝑑t\xi_{j}=\int_{\mathcal{T}}\{Z(t)-\mu_{Z}(t)\}\phi_{j}(t)\,dt.

With no ambiguity, let GG denote the linear operator such that (Gψ)(t)=𝒯G(t,s)ψ(s)𝑑s(G\circ\psi)(t)=\int_{\mathcal{T}}G(t,s)\psi(s)\,ds, for a function ψL2(𝒯)\psi\in L^{2}(\mathcal{T}). Combining (2.2), (3.1) and (3.3), we can write

E{π0(Z,𝐗)Y|Z=z}μY,π=𝒯b(t){z(t)μZ(t)}𝑑t,\displaystyle E\{\pi_{0}(Z,\mathbf{X})Y|Z=z\}-\mu_{Y,\pi}=\int_{\mathcal{T}}b(t)\{z(t)-\mu_{Z}(t)\}\,dt\,,

where μY,π=E{π0(Z,𝐗)Y}=a+𝒯b(t)μZ(t)𝑑t\mu_{Y,\pi}=E\{\pi_{0}(Z,\mathbf{X})Y\}=a+\int_{\mathcal{T}}b(t)\mu_{Z}(t)\,dt. It follows that (Gb)(t)=e(t)(G\circ b)(t)=e(t), where e(t)=E[{π0(Z,𝐗)YμY,π}{Z(t)μZ(t)}]e(t)=E[\{\pi_{0}(Z,\mathbf{X})Y-\mu_{Y,\pi}\}\{Z(t)-\mu_{Z}(t)\}]. Noting that {ϕj}j=1\{\phi_{j}\}_{j=1}^{\infty} is an orthonormal complete basis of L2(𝒯)L^{2}(\mathcal{T}), we can write b(t)=j=1bjϕj(t)b(t)=\sum_{j=1}^{\infty}b_{j}\phi_{j}(t) and e(t)=j=1ejϕj(t)e(t)=\sum_{j=1}^{\infty}e_{j}\phi_{j}(t), and combining with (3.2), we obtain for all jj that

bj=𝒯b(t)ϕj(t)𝑑t, and ej=𝒯e(t)ϕj(t)𝑑t=𝒯(Gb)(t)ϕj(t)𝑑t=bjλj.\displaystyle b_{j}=\int_{\mathcal{T}}b(t)\phi_{j}(t)\,dt\,,\textrm{ and }e_{j}=\int_{\mathcal{T}}e(t)\phi_{j}(t)\,dt=\int_{\mathcal{T}}(G\circ b)(t)\phi_{j}(t)\,dt=b_{j}\lambda_{j}\,.

The above argument suggests that we can estimate b(t)b(t) by b^FSW(t)=j=1qb^FSW,jϕ^j(t)\widehat{b}_{\mathrm{FSW}}(t)=\sum_{j=1}^{q}\widehat{b}_{\mathrm{FSW},j}\widehat{\phi}_{j}(t) with qq being a truncation parameter and b^FSW,j=λ^j1e^FSW,j\widehat{b}_{\mathrm{FSW},j}=\widehat{\lambda}_{j}^{-1}\widehat{e}_{\mathrm{FSW},j}, where the λ^j\widehat{\lambda}_{j}’s and ϕ^j\widehat{\phi}_{j}’s are the eigenvalues and the eigenfunctions of the empirical covariance function G^(s,t)=i=1n{Zi(s)μ^Z(s)}{Zi(t)μ^Z(t)}/n\widehat{G}(s,t)=\sum_{i=1}^{n}\{Z_{i}(s)-\widehat{\mu}_{Z}(s)\}\{Z_{i}(t)-\widehat{\mu}_{Z}(t)\}/n, respectively, and μ^Z(t)=i=1nZi(t)/n\widehat{\mu}_{Z}(t)=\sum_{i=1}^{n}Z_{i}(t)/n is the empirical estimator of μZ(t)\mu_{Z}(t). Also, the estimator of e(t)e(t) is given by

e^FSW(t)=i=1n{Yiπ^(Zi,𝐗i)μ^Y,π}{Zi(t)μ^Z(t)}n\widehat{e}_{\mathrm{FSW}}(t)=\sum_{i=1}^{n}\dfrac{\{Y_{i}\widehat{\pi}(Z_{i},\mathbf{X}_{i})-\widehat{\mu}_{Y,\pi}\}\{Z_{i}(t)-\widehat{\mu}_{Z}(t)\}}{n}

with μ^Y,π=i=1nYiπ^(Zi,𝐗i)/n\widehat{\mu}_{Y,\pi}=\sum_{i=1}^{n}Y_{i}\widehat{\pi}(Z_{i},\mathbf{X}_{i})/n. We then define e^FSW,j=𝒯e^FSW(t)ϕ^j(t)𝑑t\widehat{e}_{\mathrm{FSW},j}=\int_{\mathcal{T}}\widehat{e}_{\mathrm{FSW}}(t)\widehat{\phi}_{j}(t)\,dt. Finally, the estimator of aa is given by a^FSW=μ^Y,π𝒯b^FSW(t)μ^Z(t)𝑑t\widehat{a}_{\mathrm{FSW}}=\widehat{\mu}_{Y,\pi}-\int_{\mathcal{T}}\widehat{b}_{\mathrm{FSW}}(t)\widehat{\mu}_{Z}(t)\,dt.

It remains to estimate the FSW, π0(Z,𝐗)=f𝐗(𝐗)/f𝐗|Z(𝐗|Z)\pi_{0}(Z,\mathbf{X})=f_{\mathbf{X}}(\mathbf{X})/f_{\mathbf{X}|Z}(\mathbf{X}|Z). In particular, we need to estimate the weights at the sample points, π0(Zi,𝐗i)\pi_{0}(Z_{i},\mathbf{X}_{i}), for i=1,,ni=1,\ldots,n. A naive approach would first estimate the densities f𝐗f_{\mathbf{X}} and f𝐗|Zf_{\mathbf{X}|Z} and then take the ratio, which, however, may lead to unstable estimates because an estimator of the denominator f𝐗|Zf_{\mathbf{X}|Z} is difficult to derive and may be too close to zero. Also, approximating ZZ by its PC scores would not alleviate the challenge but pose additional difficulty in theoretical justification. To circumvent these problems, we treat π0\pi_{0} as a whole and develop a robust nonparametric estimator. The idea of estimating the density ratio directly has also been exploited in Ai et al. (2021) for a scalar treatment variable. However, their method is not applicable to the functional treatment.

Specifically, when the treatment ZZ was a scalar, Ai et al. (2021) found that the moment equation

E{π0(Z,𝐗)v(𝐗)u(Z)}=E{u(Z)}E{v(𝐗)},E\{\pi_{0}(Z,\mathbf{X})v(\mathbf{X})u(Z)\}=E\{u(Z)\}E\{v(\mathbf{X})\}\,, (3.4)

holds for any integrable functions v(𝐗)v(\mathbf{X}) and u(Z)u(Z) and it identifies π0(,)\pi_{0}(\cdot,\cdot). Denoting the support of 𝐗\mathbf{X} by 𝒳\mathcal{X}, they further estimated the function π0(,):×𝒳\pi_{0}(\cdot,\cdot):\mathbb{R}\times\mathcal{X}\to\mathbb{R} by maximizing a generalized empirical likelihood, subject to the restriction of the empirical counterpart of (3.4), where the spaces of integrable functions v(𝐗)v(\mathbf{X}) and u(Z)u(Z) are approximated by a growing number of basis functions; see Section 4 of Ai et al. (2021) for details. However, those restrictions are not applicable in our context where ZZ is a random function in L2(𝒯)L^{2}(\mathcal{T}). In particular, it is unclear whether (3.4) still identifies π0(,):L2(𝒯)×𝒳\pi_{0}(\cdot,\cdot):L^{2}(\mathcal{T})\times\mathcal{X}\to\mathbb{R}. Moreover, the space of functionals {u():L2(𝒯)}\{u(\cdot):L^{2}(\mathcal{T})\to\mathbb{R}\} is generally nonseparable, which impedes a consistent approximation of π0(,)\pi_{0}(\cdot,\cdot) using the basis expansion.

To overcome the challenge of estimating π0\pi_{0}, we derive a novel expanding set of equations that can identify π0\pi_{0} from functional data and avoid approximating a general functional u():L2(𝒯)u(\cdot):L^{2}(\mathcal{T})\to\mathbb{R}. Specifically, instead of estimating the function π0(,):L2(𝒯)×𝒳\pi_{0}(\cdot,\cdot):L^{2}(\mathcal{T})\times\mathcal{X}\to\mathbb{R}, we estimate its projection π0(z,):𝒳\pi_{0}(z,\cdot):\mathcal{X}\to\mathbb{R} for every fixed zL2(𝒯)z\in L^{2}(\mathcal{T}), and find that

E{π0(z,𝐗)v(𝐗)|Z=z}=pf𝐗(𝐱)f𝐗|Z(𝐱|z)v(𝐱)f𝐗|Z(𝐱|z)𝑑𝐱=E{v(𝐗)}\displaystyle E\{\pi_{0}(z,\mathbf{X})v(\mathbf{X})|Z=z\}=\int_{\mathbb{R}^{p}}\dfrac{f_{\mathbf{X}}(\mathbf{x})}{f_{\mathbf{X}|Z}(\mathbf{x}|z)}v(\mathbf{x})f_{\mathbf{X}|Z}(\mathbf{x}|z)d\mathbf{x}=E\{v(\mathbf{X})\} (3.5)

holds for any integrable function v(𝐗)v(\mathbf{X}). In particular, the following proposition shows that (3.5) indeed identifies the function π0(z,)\pi_{0}(z,\cdot) for every fixed zL2(𝒯)z\in L^{2}(\mathcal{T}).

Proposition 3.1.

For any fixed zL2(𝒯)z\in L^{2}(\mathcal{T}),

E{π(z,𝐗)v(𝐗)|Z=z}=E{v(𝐗)},\displaystyle E\{\pi(z,\mathbf{X})v(\mathbf{X})|Z=z\}=E\{v(\mathbf{X})\}\,,

holds for all integrable functions vv if and only if π(z,𝐗)=π0(z,𝐗)\pi(z,\mathbf{X})=\pi_{0}(z,\mathbf{X}) a.s.

The proof is given in Supplement A. This result indicates that π0\pi_{0} is fully characterized by (3.5), and thus we can use it to define our estimator. Since (3.5) is defined for a fixed zz and we need to estimate π0\pi_{0} at all sample points ZiZ_{i}, we consider the leave-one-out index set 𝒮i={j:ji,j=1,,n}\mathcal{S}_{-i}=\{j:j\neq i,j=1,\ldots,n\} to estimate π0\pi_{0} at each ZiZ_{i}. Specifically, for a given sample point z=Ziz=Z_{i}, the right-hand side of (3.5) can be estimated by its empirical version j𝒮jv(𝐗j)/(n1)\sum_{j\in\mathcal{S}_{-j}}v(\mathbf{X}_{j})/(n-1), and we propose to estimate the left-hand side of (3.5) by the Nadaraya–Watson estimator for a functional covariate (see e.g., Ferraty and Vieu, 2006),

j𝒮iπ0(Zi,𝐗j)v(𝐗j)K{d(Zj,Zi)/h}j𝒮iK{d(Zj,Zi)/h},\displaystyle\dfrac{\sum_{j\in\mathcal{S}_{-i}}\pi_{0}(Z_{i},\mathbf{X}_{j})v(\mathbf{X}_{j})K\{d(Z_{j},Z_{i})/h\}}{\sum_{j\in\mathcal{S}_{-i}}K\{d(Z_{j},Z_{i})/h\}}\,,

where d(,)d(\cdot,\cdot) denotes a semi-metric in L2(𝒯)L^{2}(\mathcal{T}), h>0h>0 is a bandwidth and KK is a kernel function quantifying the proximity of ZjZ_{j} and ZiZ_{i}. Commonly used choices for dd include the L2L^{2} norm and the projection metric d(z1,z2)=[𝒯{Π(z1)Π(z2)}2𝑑t]1/2d(z_{1},z_{2})=[\int_{\mathcal{T}}\{\Pi(z_{1})-\Pi(z_{2})\}^{2}\,dt]^{1/2}, where Π\Pi denotes a projection operation to a subspace of L2(𝒯)L^{2}(\mathcal{T}). The projection metric may yield better asymptotic properties; see Section 4 for more details, and see Section 5 on how to choose hh and KK.

As it is not possible to solve (3.5) for an infinite number of vv’s from a given finite sample, we use a growing number of basis functions to approximate any suitable function vv. Specifically, let νk(𝐗)=(v1(𝐗),,vk(𝐗))\nu_{k}(\mathbf{X})=\big{(}v_{1}(\mathbf{X}),\ldots,v_{k}(\mathbf{X})\big{)}^{\top} be a set of known basis functions, which may serve as a finite dimensional sieve approximation to the original function space of all integrable functions. We define our estimator of π0(Zi,𝐗j)\pi_{0}(Z_{i},\mathbf{X}_{j}), for j𝒮ij\in\mathcal{S}_{-i}, as the solution to the kk equations,

j𝒮iπ(Zi,𝐗j)νk(𝐗j)K{d(Zj,Zi)/h}j𝒮iK{d(Zj,Zi)/h}=1n1j𝒮iνk(𝐗j),\displaystyle\dfrac{\sum_{j\in\mathcal{S}_{-i}}\pi(Z_{i},\mathbf{X}_{j})\nu_{k}(\mathbf{X}_{j})K\{d(Z_{j},Z_{i})/h\}}{\sum_{j\in\mathcal{S}_{-i}}K\{d(Z_{j},Z_{i})/h\}}=\dfrac{1}{n-1}\sum_{j\in\mathcal{S}_{-i}}\nu_{k}(\mathbf{X}_{j})\,, (3.6)

which asymptotically identifies π0\pi_{0} as kk\rightarrow\infty and h0h\rightarrow 0. However in practice, a finite number kk of equations as in (3.6) are not able to identify a unique solution. Indeed, for any strictly increasing and concave function ρ\rho, replacing π(Zi,𝐗j)\pi(Z_{i},\mathbf{X}_{j}) by ρ{η^Ziνk(𝐗j)}\rho^{\prime}\{\widehat{\eta}_{Z_{i}}^{\top}\nu_{k}(\mathbf{X}_{j})\} would satisfy (3.6), where ρ()\rho^{\prime}(\cdot) is the first derivative of ρ()\rho(\cdot) and η^Zi\widehat{\eta}_{Z_{i}} is a kk-dimensional vector maximizing the strictly concave function Hhk,ZiH_{hk,Z_{i}} defined as

Hhk,Zi(η)=j𝒮iρ{ηνk(𝐗j)}K{d(Zj,Zi)/h}j𝒮iK{d(Zj,Zi)/h}η{1n1j𝒮iνk(𝐗j)}.\displaystyle H_{hk,Z_{i}}(\eta)=\dfrac{\sum_{j\in\mathcal{S}_{-i}}\rho\{\eta^{\top}\nu_{k}(\mathbf{X}_{j})\}K\{d(Z_{j},Z_{i})/h\}}{\sum_{j\in\mathcal{S}_{-i}}K\{d(Z_{j},Z_{i})/h\}}-\eta^{\top}\bigg{\{}\dfrac{1}{n-1}\sum_{j\in\mathcal{S}_{-i}}\nu_{k}(\mathbf{X}_{j})\bigg{\}}\,. (3.7)

This can be verified by noting that the gradient Hhk,Zi(η^Zi)=0\nabla H_{hk,Z_{i}}(\widehat{\eta}_{Z_{i}})=0 corresponds to (3.6) with π(Zi,𝐗j)\pi(Z_{i},\mathbf{X}_{j}) replaced by ρ{η^Ziνk(𝐗j)}\rho^{\prime}\{\widehat{\eta}_{Z_{i}}^{\top}\nu_{k}(\mathbf{X}_{j})\}. Therefore, for a given strictly increasing and concave function ρ\rho together with η^Zi\widehat{\eta}_{Z_{i}} defined above, we define our estimator of π0(Zi,𝐗j)\pi_{0}(Z_{i},\mathbf{X}_{j}) as π^hk(Zi,𝐗j)=ρ{η^Ziνk(𝐗j)}\widehat{\pi}_{hk}(Z_{i},\mathbf{X}_{j})=\rho^{\prime}\{\widehat{\eta}_{Z_{i}}^{\top}\nu_{k}(\mathbf{X}_{j})\}, which also induces the estimator π^hk(Zi,𝐗i)=ρ{η^Ziνk(𝐗i)}\widehat{\pi}_{hk}(Z_{i},\mathbf{X}_{i})=\rho^{\prime}\{\widehat{\eta}_{Z_{i}}^{\top}\nu_{k}(\mathbf{X}_{i})\} by replacing 𝐗j\mathbf{X}_{j} by 𝐗i\mathbf{X}_{i}.

To gain more insight on the function ρ\rho, we investigate the generalized empirical likelihood interpretation of our estimator. As shown in Supplement B, the estimator defined by (3.7) is the dual solution to the following local generalized empirical likelihood maximization problem. For each ZiZ_{i}, i=1,,ni=1,\ldots,n,

max{π(Zi,𝐗j)}j𝒮ij𝒮iD{π(Zi,𝐗j)}K{d(Zj,Zi)/h}j𝒮iK{d(Zj,Zi)/h},s.t.j𝒮iπ(Zi,𝐗j)νk(𝐗j)K{d(Zj,Zi)/h}j𝒮iK{d(Zj,Zi)/h}=1n1j𝒮iνk(𝐗j),\displaystyle\begin{aligned} &\max_{\{\pi(Z_{i},\mathbf{X}_{j})\}_{j\in\mathcal{S}_{-i}}}-\dfrac{\sum_{j\in\mathcal{S}_{-i}}D\{\pi(Z_{i},\mathbf{X}_{j})\}K\{d(Z_{j},Z_{i})/h\}}{\sum_{j\in\mathcal{S}_{-i}}K\{d(Z_{j},Z_{i})/h\}}\,,\\ &\mathrm{s.~{}t.}~{}\dfrac{\sum_{j\in\mathcal{S}_{-i}}\pi(Z_{i},\mathbf{X}_{j})\nu_{k}(\mathbf{X}_{j})K\{d(Z_{j},Z_{i})/h\}}{\sum_{j\in\mathcal{S}_{-i}}K\{d(Z_{j},Z_{i})/h\}}=\dfrac{1}{n-1}\sum_{j\in\mathcal{S}_{-i}}\nu_{k}(\mathbf{X}_{j})\,,\end{aligned} (3.8)

where D(w)D(w) is a distance measure between ww and 1 for ww\in\mathbb{R}. The function D(v)D(v) is continuously differentiable satisfying D(1)=0D(1)=0 and ρ(u)=D{D(1)(u)}uD(1)(u)\rho(-u)=D\{D^{{}^{\prime}(-1)}(u)\}-u\cdot D^{{}^{\prime}(-1)}(u), where D(1)D^{{}^{\prime}(-1)} is the inverse function of DD^{{}^{\prime}}, the first derivative of DD. The sample moment restriction stabilizes our weight estimator.

Our estimator π^hk\widehat{\pi}_{hk} is the desired weight closest to the baseline uniform weight 1 locally around a small neighbourhood of each ZiZ_{i}, subject to the finite sample version of the moment restriction in Proposition 3.1. The uniform weight can be considered as a baseline because when ZZ and Y(z)Y^{*}(z) are unconditionally independent for all zL2(𝒯)z\in L^{2}(\mathcal{T}) (i.e., a randomized trial without confoundedness), the functional stabilized weights should be uniformly equal to 1. Different choices of ρ\rho correspond to different distance measures. For example, if we take ρ(v)=exp(v1)\rho(v)=-\exp(-v-1), then D(v)=vlog(v)D(v)=-v\log(v) is the information entropy and the weights correspond to the exponential tilting (Imbens et al., 1998). Other choices include ρ(v)=log(v)+1\rho(v)=\log(v)+1 with D(v)=log(v)D(v)=-\log(v), the empirical likelihood (Owen, 1988), and ρ(v)=(1v)2/2\rho(v)=-(1-v)^{2}/2 with D(v)=(1v)2/2D(v)=(1-v)^{2}/2, yielding the implied weights of the continuous updating of the generalized method of moments (Hansen et al., 1996). We suggest to choose ρ(v)=exp(v1)\rho(v)=-\exp(-v-1) which guarantees a positive weight estimator π^hk(Zi,𝐗i)\widehat{\pi}_{hk}(Z_{i},\mathbf{X}_{i}).

3.2 Outcome Regression Estimator and Doubly Robust Estimator

To estimate the ADRF based on the outcome regression (OR) identification method in (2.1), an estimator of E(Y|𝐗,Z)E(Y|\mathbf{X},Z) is required. Thus, based on the functional linear model in (3.1), we further assume the following partially linear model,

Y=a+𝒯b(t)Z(t)𝑑t+θ𝐗+ϵ,\displaystyle Y=a+\int_{\mathcal{T}}b(t)Z(t)\,dt+\theta^{\top}\mathbf{X}+\epsilon\,, (3.9)

where aa and bb are the same as those in (3.1), θp\theta\in\mathbb{R}^{p} is the slope coefficient for 𝐗\mathbf{X} and ϵ\epsilon is the error variable with E(ϵ|Z,𝐗)=0E(\epsilon|Z,\mathbf{X})=0 and E(ϵ2|Z,𝐗)<E(\epsilon^{2}|Z,\mathbf{X})<\infty uniformly for all ZZ and 𝐗\mathbf{X}. Without loss of generality, we assume that E(θ𝐗)=0E(\theta^{\top}\mathbf{X})=0; otherwise the non-zero mean can be absorbed into aa.

Recalling E{Y(z)}=E{E(Y|𝐗,Z=z)}E\{Y^{*}(z)\}=E\{E(Y|\mathbf{X},Z=z)\} from (2.1), we can see that (3.9) implies (3.1). Indeed, model (3.9) imposes a stronger structure than model (3.1). For example, the linear part θ𝐗\theta^{\top}\mathbf{X} can be replaced by any nonparametric function of 𝐗\mathbf{X}, which still implies model (3.1). However, model (3.9) gives a simpler estimator of E{Y(z)}E\{Y^{*}(z)\} with a faster convergence rate and a better interpretability of the effects of the treatment and the confounding variables on the outcome.

It remains to estimate the unknown parameters aa and bb in (3.9), for which the backfitting algorithm (Buja et al., 1989) can be adapted to our context. In the first step, we set θ\theta to be zero and apply the same method as in Section 3.1 to regress YY on ZZ, which gives the initial estimators of aa and bb in (3.9). Specifically, we estimate b(t)b(t) initially by b^ini(t)=j=1qb^ini,jϕ^j(t)\widehat{b}_{\mathrm{ini}}(t)=\sum_{j=1}^{q}\widehat{b}_{\mathrm{ini},j}\widehat{\phi}_{j}(t) with b^ini,j=λ^j1e^ini,j\widehat{b}_{\mathrm{ini},j}=\widehat{\lambda}_{j}^{-1}\widehat{e}_{\mathrm{ini},j}, where the λ^j\widehat{\lambda}_{j}’s and ϕ^j\widehat{\phi}_{j}’s are the same as those in Section 3.1, while e^ini(t)=i=1n(Yiμ^Y){Zi(t)μ^Z(t)}/n\widehat{e}_{\mathrm{ini}}(t)=\sum_{i=1}^{n}(Y_{i}-\widehat{\mu}_{Y})\{Z_{i}(t)-\widehat{\mu}_{Z}(t)\}/n with μ^Y=i=1nYi/n\widehat{\mu}_{Y}=\sum_{i=1}^{n}Y_{i}/n. We then define e^ini,j=𝒯e^ini(t)ϕ^j(t)𝑑t\widehat{e}_{\mathrm{ini},j}=\int_{\mathcal{T}}\widehat{e}_{\mathrm{ini}}(t)\widehat{\phi}_{j}(t)\,dt. Finally, the initial estimator a^ini\widehat{a}_{\mathrm{ini}} of aa is defined as μ^Y𝒯b^ini(t)μ^Z(t)𝑑t\widehat{\mu}_{Y}-\int_{\mathcal{T}}\widehat{b}_{\mathrm{ini}}(t)\widehat{\mu}_{Z}(t)\,dt.

In the second step, we use the traditional least square method to regress the residual Yia^ini𝒯b^ini(t)Zi(t)𝑑tY_{i}-\widehat{a}_{\mathrm{ini}}-\int_{\mathcal{T}}\widehat{b}_{\mathrm{ini}}(t)Z_{i}(t)\,dt on 𝐗i\mathbf{X}_{i}, for i=1,,ni=1,\ldots,n. Specifically, we estimate θ\theta by θ^=(ΞΞ)1Ξ𝐘res\widehat{\theta}=(\Xi^{\top}\Xi)^{-1}\Xi^{\top}\mathbf{Y}_{\mathrm{res}}, where Ξ\Xi is an n×pn\times p matrix with the (i,j)(i,j)-th entry being XijX_{ij}, the jj-th element of 𝐗i\mathbf{X}_{i}, and

𝐘res=(Y1a^ini𝒯b^ini(t)Z1(t)𝑑t,,Yna^ini𝒯b^ini(t)Zn(t)𝑑t).\mathbf{Y}_{\mathrm{res}}=\Big{(}Y_{1}-\widehat{a}_{\mathrm{ini}}-\int_{\mathcal{T}}\widehat{b}_{\mathrm{ini}}(t)Z_{1}(t)\,dt,\ldots,Y_{n}-\widehat{a}_{\mathrm{ini}}-\int_{\mathcal{T}}\widehat{b}_{\mathrm{ini}}(t)Z_{n}(t)\,dt\Big{)}^{\top}\,.

Subsequently, we can repeat the first step with YiY_{i} replaced by Yiθ^𝐗iY_{i}-\widehat{\theta}^{\top}\mathbf{X}_{i} to update the estimators of a^ini\widehat{a}_{\mathrm{ini}} and b^ini\widehat{b}_{\mathrm{ini}}. This procedure is iterated until the outcome regression (OR) estimators of aa, bb and θ\theta, denoted by a^OR\widehat{a}_{\mathrm{OR}}, b^OR\widehat{b}_{\mathrm{OR}} and θ^OR\widehat{\theta}_{\mathrm{OR}}, are stabilized.

The OR estimator is easy to implement but requires strong parametric assumptions, while the FSW estimator requires less modelling assumptions but subject to a slow convergence rate. It is thus of interest to develop an estimator by combining these two that possesses more attractive properties.

Note that (2.3) satisfies the so-called doubly robust (DR) property: for two generic functions E~(Y|𝐗,Z)\widetilde{E}(Y|\mathbf{X},Z) and π~(𝐗,Z)\widetilde{\pi}(\mathbf{X},Z), it holds that

E{Y(z)}=E[{YE~(Y|𝐗,Z)}π~(𝐗,Z)+E𝐗{E~(Y|𝐗,Z)}|Z=z],\displaystyle E\{Y^{*}(z)\}=E[\{Y-\widetilde{E}(Y|\mathbf{X},Z)\}\widetilde{\pi}(\mathbf{X},Z)+E_{\mathbf{X}}\{\widetilde{E}(Y|\mathbf{X},Z)\}|Z=z]\,, (3.10)

if either E~(Y|𝐗,Z)=E(Y|𝐗,Z)\widetilde{E}(Y|\mathbf{X},Z)=E(Y|\mathbf{X},Z) or π~(𝐗,Z)=π0(𝐗,Z)\widetilde{\pi}(\mathbf{X},Z)=\pi_{0}(\mathbf{X},Z) but not necessarily both are satisfied; see Supplement C for the derivation.

Under model (3.1), (3.10) equals a+𝒯b(t)z(t)𝑑ta+\int_{\mathcal{T}}b(t)z(t)\,dt. To estimate aa and bb, it suffices to conduct the functional linear regression as in Section 3.1 to regress an estimator of {YE(Y|𝐗,Z)}π(𝐗,Z)+E𝐗{E(Y|𝐗,Z)}\{Y-E(Y|\mathbf{X},Z)\}\pi(\mathbf{X},Z)+E_{\mathbf{X}}\{E(Y|\mathbf{X},Z)\} on ZZ. Specifically, we estimate E(Y|𝐗,Z)E(Y|\mathbf{X},Z) by the OR estimator, E^(Y|𝐗,Z)=a^OR+𝒯b^OR(t)Z(t)𝑑t+θ^OR𝐗\widehat{E}(Y|\mathbf{X},Z)=\widehat{a}_{\mathrm{OR}}+\int_{\mathcal{T}}\widehat{b}_{\mathrm{OR}}(t)Z(t)\,dt+\widehat{\theta}_{\mathrm{OR}}^{\top}\mathbf{X}, and π0(𝐗,Z)\pi_{0}(\mathbf{X},Z) by π^hk(𝐗,Z)\widehat{\pi}_{hk}(\mathbf{X},Z). We define the DR estimator of bb as b^DR(t)=j=1qb^DR,jϕ^j(t)\widehat{b}_{\mathrm{DR}}(t)=\sum_{j=1}^{q}\widehat{b}_{\mathrm{DR},j}\widehat{\phi}_{j}(t), where b^DR,j=λ^j1e^DR,j\widehat{b}_{\mathrm{DR},j}=\widehat{\lambda}_{j}^{-1}\widehat{e}_{\mathrm{DR},j} with e^DR,j=𝒯e^DR(t)ϕ^j(t)𝑑t\widehat{e}_{\mathrm{DR},j}=\int_{\mathcal{T}}\widehat{e}_{\mathrm{DR}}(t)\widehat{\phi}_{j}(t)\,dt. Here, e^DR(t)\widehat{e}_{\mathrm{DR}}(t) is defined as

1ni=1n[{YiE^(Y|𝐗i,Zi)}π^hk(𝐗i,Zi)+1nj=1n{E^(Y|𝐗j,Zi)}μ^Y,DR]{Zi(t)μ^Z(t)},\displaystyle\dfrac{1}{n}\sum_{i=1}^{n}\bigg{[}\{Y_{i}-\widehat{E}(Y|\mathbf{X}_{i},Z_{i})\}\widehat{\pi}_{hk}(\mathbf{X}_{i},Z_{i})+\dfrac{1}{n}\sum_{j=1}^{n}\{\widehat{E}(Y|\mathbf{X}_{j},Z_{i})\}-\widehat{\mu}_{Y,\mathrm{DR}}\bigg{]}\{Z_{i}(t)-\widehat{\mu}_{Z}(t)\}\,,

where

μ^Y,DR=1ni=1n[{YiE^(Y|𝐗i,Zi)}π^hk(𝐗i,Zi)+1nj=1n{E^(Y|𝐗j,Zi)}].\displaystyle\widehat{\mu}_{Y,\mathrm{DR}}=\dfrac{1}{n}\sum_{i=1}^{n}\bigg{[}\{Y_{i}-\widehat{E}(Y|\mathbf{X}_{i},Z_{i})\}\widehat{\pi}_{hk}(\mathbf{X}_{i},Z_{i})+\dfrac{1}{n}\sum_{j=1}^{n}\{\widehat{E}(Y|\mathbf{X}_{j},Z_{i})\}\bigg{]}\,.

Also, the DR estimator of aa is defined as a^DR=μ^Y,DR𝒯b^DRμ^Z(t)𝑑t\widehat{a}_{\mathrm{DR}}=\widehat{\mu}_{Y,\mathrm{DR}}-\int_{\mathcal{T}}\widehat{b}_{\mathrm{DR}}\widehat{\mu}_{Z}(t)\,dt.

Provided that model (3.1) is correctly specified, according to the DR property (3.10), the DR estimator E^DR{Y(z)}=a^DR+𝒯b^DR(t)z(t)𝑑t\widehat{E}_{\mathrm{DR}}\{Y^{*}(z)\}=\widehat{a}_{\mathrm{DR}}+\int_{\mathcal{T}}\widehat{b}_{\mathrm{DR}}(t)z(t)\,dt is consistent if either E^(Y|𝐗,Z)\widehat{E}(Y|\mathbf{X},Z) or π^hk(𝐗,Z)\widehat{\pi}_{hk}(\mathbf{X},Z) is consistent. The consistency of E^(Y|𝐗,Z)\widehat{E}(Y|\mathbf{X},Z) mainly depends on correct specification of the partially linear model (3.9), while the consistency of π^hk(𝐗,Z)\widehat{\pi}_{hk}(\mathbf{X},Z), as a nonparametric estimator, relies on less restrictive (but technical) assumptions; see Section 4 for details.

4 Asymptotic Properties

We investigate asymptotic convergence rates of our proposed three estimators. Recall that our main model is E{Y(z)}=a+𝒯b(t)z(t)𝑑tE\{Y^{*}(z)\}=a+\int_{\mathcal{T}}b(t)z(t)\,dt as in (3.1). Although the estimators of aa are based on the estimators of bb, they are essentially empirical mean estimators, which can achieve the n1/2{n}^{-1/2} convergence rate; see Shin (2009). Thus, our primary focus is to derive the convergence rates of our estimators of bb, which, depending on the smoothness of ZZ and bb, are slower than those in the finite dimensional settings.

Recall that 𝒳p\mathcal{X}\subset\mathbb{R}^{p} denotes the support of 𝐗\mathbf{X}, and let 𝒵L2(𝒯)\mathcal{Z}\subset L^{2}(\mathcal{T}) be the support of ZZ. To derive the convergence rate of b^FSW\widehat{b}_{\mathrm{FSW}}, we first provide the uniform convergence rate of the estimator π^hk\widehat{\pi}_{hk} of the FSW π0\pi_{0} over 𝒳\mathcal{X} and 𝒵\mathcal{Z}, which requires the following conditions.

Condition A

  1. (A1)

    The set 𝒳\mathcal{X} is compact. For all z𝒵z\in\mathcal{Z} and 𝐱𝒳\mathbf{x}\in\mathcal{X}, π0(z,𝐱)\pi_{0}(z,\mathbf{x}) is strictly bounded away from zero and infinity.

  2. (A2)

    For all z𝒵z\in\mathcal{Z}, ηzk\exists~{}\eta_{z}\in\mathbb{R}^{k} and a constant α>0\alpha>0 such that sup(z,𝐱)𝒵×𝒳|ρ(1){π0(z,𝐱)}ηzνk(𝐱)|=O(kα)\sup_{(z,\mathbf{x})\in\mathcal{Z}\times\mathcal{X}}|\rho^{\prime(-1)}\{\pi_{0}(z,\mathbf{x})\}-\eta_{z}^{\top}\nu_{k}(\mathbf{x})|=O(k^{-\alpha}), where ρ(1)\rho^{\prime(-1)} is the inverse function of ρ\rho^{\prime}.

  3. (A3)

    The eigenvalues of E{νk(𝐗)νk(𝐗)|Z=z}E\{\nu_{k}(\mathbf{X})\nu_{k}(\mathbf{X})^{\top}|Z=z\} are bounded away from zero and infinity uniformly in kk and zz. There exists a sequence of constants ζ(k)\zeta(k) satisfying sup𝐱𝒳νk(𝐱)ζ(k)\sup_{\mathbf{x}\in\mathcal{X}}\|\nu_{k}(\mathbf{x})\|\leq\zeta(k).

  4. (A4)

    There exists a continuously differentiable function μ(h)>0\mu(h)>0, for h>0h>0, such that for all z𝒵z\in\mathcal{Z}, P{d(Z,z)h}/μ(h)P\{d(Z,z)\leq h\}/\mu(h) is strictly bounded away from zero and infinity. There exists h0>0h_{0}>0 such that suph(0,h0)μ(h)\sup_{h\in(0,h_{0})}\mu^{\prime}(h) is bounded.

  5. (A5)

    The kernel K:+K:\mathbb{R}\to\mathbb{R}^{+} is bounded, Lipschitz and supported on [0,1][0,1] with K(1)>0K(1)>0.

  6. (A6)

    For gk1,k2,z(𝐗)=ρ{ηzνk(𝐗)}vk1(𝐗)g_{k_{1},k_{2},z}(\mathbf{X})=\rho^{\prime}\{\eta_{z}^{\top}\nu_{k}(\mathbf{X})\}v_{k_{1}}(\mathbf{X}) and vk1(𝐗)vk2(𝐗)v_{k_{1}}(\mathbf{X})v_{k_{2}}(\mathbf{X}), there exists λ>0\lambda>0 such that for all k1,k2=1,,kk_{1},k_{2}=1,\ldots,k and all kk, all z1,z2𝒵z_{1},z_{2}\in\mathcal{Z} and all m2m\geq 2, |E[gk1,k2,z1(𝐗)|Z=z1]E[gk1,k2,z2(𝐗)|Z=z2]|/dλ(z1,z2)\big{|}E[g_{k_{1},k_{2},z_{1}}(\mathbf{X})|Z=z_{1}]-E[g_{k_{1},k_{2},z_{2}}(\mathbf{X})|Z=z_{2}]\big{|}/d^{\lambda}(z_{1},z_{2}) and E[|gk1,k2,z1(𝐗)|m|Z=z1]E[|g_{k_{1},k_{2},z_{1}}(\mathbf{X})|^{m}|Z=z_{1}] are bounded.

  7. (A7)

    Let ψ𝒵(ϵ)\psi_{\mathcal{Z}}(\epsilon) be the Kolmogorov ϵ\epsilon-entropy of 𝒵\mathcal{Z}, i.e., ψ𝒵(ϵ)=log{Nϵ(𝒵)}\psi_{\mathcal{Z}}(\epsilon)=\log\{N_{\epsilon}(\mathcal{Z})\}, where Nϵ(𝒵)N_{\epsilon}(\mathcal{Z}) is the minimal number of open balls in L2(𝒯)L^{2}(\mathcal{T}) of radius ϵ\epsilon (with the semi-metric dd) covering 𝒵\mathcal{Z}. It satisfies n=1exp[(δ1)ψ𝒵{(logn)/n}]<\sum_{n=1}^{\infty}\exp[-(\delta-1)\psi_{\mathcal{Z}}\{(\log n)/n\}]<\infty for some δ>1\delta>1, and (logn)2/{nμ(h)}<ψ𝒵{(logn)/n}<nμ(h)/logn(\log n)^{2}/\{n\mu(h)\}<\psi_{\mathcal{Z}}\{(\log n)/n\}<n\mu(h)/\log n for nn large enough.

Condition (A1) requires that 𝐗\mathbf{X} is compactly supported, which can be relaxed by restricting the tail behaviour of the distribution of 𝐗\mathbf{X} at the sacrifice of more tedious technical arguments. The boundedness of π0\pi_{0} in Condition (A1) (or some equivalent condition) is also commonly required in the literature (Kennedy et al., 2017; Ai et al., 2021; D’Amour et al., 2021). This condition can be relaxed if other smoothness assumptions are made on the potential outcome distribution (Ma and Wang, 2020). Condition (A2) essentially assumes that the convergence rate of the sieve approximation ηzνk(𝐱)\eta_{z}^{\top}\nu_{k}(\mathbf{x}) for ρ(1){π0(z,𝐱)}\rho^{\prime(-1)}\{\pi_{0}(z,\mathbf{x})\} is polynomial. This can be satisfied, for example, with α=+\alpha=+\infty if 𝐗\mathbf{X} is discrete, and with α=s/r\alpha=s/r if 𝐗\mathbf{X} has rr continuous components and νk\nu_{k} is a power series, where ss is the degree of smoothness of ρ(1){π0(z,)}\rho^{\prime(-1)}\{\pi_{0}(z,\cdot)\} w.r.t. the continuous components in 𝐗\mathbf{X} for any fixed zL2(𝒯)z\in L^{2}(\mathcal{T}). Further, Condition (A3) ensures that the sieve approximation conditional on a functional variable is not degenerate. Similar conditions are routinely assumed in the literature of sieve approximation (Newey, 1997). Conditions (A4) to (A6) are standard in functional nonparametric regression (Ferraty and Vieu, 2006; Ferraty et al., 2010). In particular, the function μ(h)\mu(h) in Condition (A4), also referred to as the small ball probability, has been studied extensively in the literature (Li and Shao, 2001; Ferraty and Vieu, 2006); Condition (A5) requires that KK is compactly supported, which is not satisfied for the Gaussian kernel but convenient for technical arguments; the variable gk1,k2,z(𝐗)g_{k_{1},k_{2},z}(\mathbf{X}) in Condition (A6) serves as the role of the response variable in the traditional regression setting. Condition (A7) is less standard: it regularizes the support 𝒵\mathcal{Z} by controlling its Kolmogorov entropy, which is used to establish the uniform convergence rate on 𝒵\mathcal{Z}. This condition is satisfied, for example, if 𝒵\mathcal{Z} is a compact set, d(,)d(\cdot,\cdot) is a projection metric and (logn)2=O{nμ(h)}(\log n)^{2}=O\{n\mu(h)\}; see Ferraty et al. (2010) for other examples.

Theorem 4.1.

Under Conditions (A1) to (A7), assuming that kk\to\infty and h0h\to 0 as nn\to\infty, we have

sup(z,𝐱)𝒵×𝒳|π^hk(z,𝐱)π0(z,𝐱)|=O[ζ(k){kα+hλk+ψ𝒵{(logn)/n}knμ(h)}]\displaystyle\sup_{(z,\mathbf{x})\in\mathcal{Z}\times\mathcal{X}}|\widehat{\pi}_{hk}(z,\mathbf{x})-\pi_{0}(z,\mathbf{x})|=O\bigg{[}\zeta(k)\bigg{\{}k^{-\alpha}+h^{\lambda}\sqrt{k}+\sqrt{\dfrac{\psi_{\mathcal{Z}}\{(\log n)/n\}k}{n\mu(h)}}\bigg{\}}\bigg{]}

holds almost surely.

Theorem 4.1 provides the sup-norm convergence rate of π^hk\widehat{\pi}_{hk} over the support of 𝐗\mathbf{X} and ZZ. The proof is given in Supplement D. The term ζ(k)(kα+hλk)\zeta(k)(k^{-\alpha}+h^{\lambda}\sqrt{k}) and the term ζ(k)ψ𝒵{(logn)/n}k/{nμ(h)}\zeta(k)\sqrt{\psi_{\mathcal{Z}}\{(\log n)/n\}k/\{n\mu(h)\}} can be viewed as the convergence rates of the bias and standard deviation, respectively.

To provide the convergence rate of b^FSW\widehat{b}_{\mathrm{FSW}}, we recall that E{π0(Z,𝐗)Y|Z=z}=a+𝒯b(t)z(t)𝑑tE\{\pi_{0}(Z,\mathbf{X})Y|Z=z\}=a+\int_{\mathcal{T}}b(t)z(t)\,dt under model (3.1). Let ϵπ=π0(Z,𝐗)Ya𝒯b(t)Z(t)𝑑t\epsilon_{\pi}=\pi_{0}(Z,\mathbf{X})Y-a-\int_{\mathcal{T}}b(t)Z(t)\,dt be the residual variable. Let γ>1\gamma>1 and β>γ/2+1\beta>\gamma/2+1 be two constants and the cjc_{j}’s be some generic positive constants.

Condition B

  1. (B1)

    The variable ϵπ\epsilon_{\pi} has a finite fourth moment, i.e., E(ϵπ4)<E(\epsilon_{\pi}^{4})<\infty.

  2. (B2)

    The functional variable ZZ has a finite fourth moment, i.e., 𝒯E{Z4(t)}𝑑t<\int_{\mathcal{T}}E\{Z^{4}(t)\}\,dt<\infty. The PC score ξj\xi_{j} defined in (3.3) satisfies E(ξj4)c1λj2E(\xi_{j}^{4})\leq c_{1}\lambda_{j}^{2}, and the eigenvalue λj\lambda_{j} defined in (3.2) satisfies λjλj+1c21jγ1\lambda_{j}-\lambda_{j+1}\geq c_{2}^{-1}j^{-\gamma-1} for all j1j\geq 1.

  3. (B3)

    The coefficient bjb_{j} satisfies |bj|c3jβ|b_{j}|\leq c_{3}j^{-\beta}, for all j1j\geq 1.

Condition (B1) makes a mild restriction on the moment of ϵπ\epsilon_{\pi}. Condition (B2) imposes regularity conditions on the random process ZZ, which, in particular, requires that the differences of the adjacent eigenvalues do not decay too fast. Condition (B3) assumes an upper bound for the decay rate of bjb_{j}, the coefficients of bb projected on {ϕj}j=1\{\phi_{j}\}_{j=1}^{\infty}. The latter two conditions are adopted from Hall and Horowitz (2007) for technical purposes.

Theorem 4.2.

Under Conditions (A1) to (A7), (B1) to (B3) and model (3.1), assuming that kk\to\infty and h0h\to 0 as nn\to\infty and choosing the truncation parameter qn1/(γ+2β)q\asymp n^{1/(\gamma+2\beta)}, we have

𝒯{b^FSW(t)b(t)}2𝑑t=Op[{k2α+h2λk+ψ𝒵{(logn)/n}knμ(h)}ζ2(k)n(γ+1)/(γ+2β)].\displaystyle\int_{\mathcal{T}}\{\widehat{b}_{\mathrm{FSW}}(t)-b(t)\}^{2}\,dt=O_{p}\bigg{[}\bigg{\{}k^{-2\alpha}+h^{2\lambda}k+\dfrac{\psi_{\mathcal{Z}}\{(\log n)/n\}k}{n\mu(h)}\bigg{\}}\zeta^{2}(k)n^{(\gamma+1)/(\gamma+2\beta)}\bigg{]}\,.

The proof of Theorem 4.2 is given in Supplement E. The convergence rate of 𝒯{b^FSW(t)b(t)}2𝑑t\int_{\mathcal{T}}\{\widehat{b}_{\mathrm{FSW}}(t)-b(t)\}^{2}\,dt in Theorem 4.2 can be quite slow mainly because of the small ball probability function μ(h)\mu(h). In the case of Gaussian process endowed with a metric, μ(h)\mu(h) has an exponential decay rate of hh (Li and Shao, 2001), which leads to a non-infinitesimal rate of 𝒯{b^FSW(t)b(t)}2𝑑t\int_{\mathcal{T}}\{\widehat{b}_{\mathrm{FSW}}(t)-b(t)\}^{2}\,dt. However, μ(h)\mu(h) can be much larger by choosing a proper semi-metric dd (Ferraty and Vieu, 2006; Ling and Vieu, 2018) so that the final convergence rate of 𝒯{b^FSW(t)b(t)}2𝑑t\int_{\mathcal{T}}\{\widehat{b}_{\mathrm{FSW}}(t)-b(t)\}^{2}\,dt reaches a polynomial decay of nn. Another way to improve the convergence rate is to impose additional structure assumptions. For example, if we assume that the slope function bb can be fully characterized by a finite number of PC basis functions, or the random process ZZ essentially lies on a finite dimensional space (not necessarily Euclidean; see Lin and Yao, 2021), then the truncation parameter qq does not need to tend to infinity and the convergence rate of 𝒯{b^FSW(t)b(t)}2𝑑t\int_{\mathcal{T}}\{\widehat{b}_{\mathrm{FSW}}(t)-b(t)\}^{2}\,dt can be much faster. Without such assumptions, estimating bb in an infinite dimensional space with a nonparametric estimator of π0\pi_{0} leads to a cumbersome convergence rate.

The OR estimator b^OR\widehat{b}_{\mathrm{OR}} relies on FPCA and the partially linear model (3.9). Additional conditions are needed to derive the convergence rate of b^OR\widehat{b}_{\mathrm{OR}}.

Condition C

  1. (C1)

    The covariate 𝐗\mathbf{X} has finite fourth moment, i.e., E(𝐗4)<E(\|\mathbf{X}\|^{4})<\infty. For m=1,,pm=1,\ldots,p and j1j\geq 1, |Cov{Xm,𝒯Z(t)ϕj(t)𝑑t}|<c4jγβ|\operatorname{Cov}\{X_{m},\int_{\mathcal{T}}Z(t)\phi_{j}(t)\,dt\}|<c_{4}j^{-\gamma-\beta}.

  2. (C2)

    Let gm(t)=j=1Cov{Xm,𝒯Z(t)ϕj(t)𝑑t}ϕj(t)/λjg_{m}(t)=\sum_{j=1}^{\infty}\operatorname{Cov}\{X_{m},\int_{\mathcal{T}}Z(t)\phi_{j}(t)\,dt\}\phi_{j}(t)/\lambda_{j} and eim=XimE(Xm)𝒯gm(t){Zi(t)μZ(t)}𝑑te_{im}=X_{im}-E(X_{m})-\int_{\mathcal{T}}g_{m}(t)\{Z_{i}(t)-\mu_{Z}(t)\}\,dt. For m=1,,pm=1,\ldots,p, E(e1m2|Z1)<E(e_{1m}^{2}|Z_{1})<\infty, and E(𝐞1𝐞1)E(\mathbf{e}_{1}^{\top}\mathbf{e}_{1}) is positive definite with 𝐞1=(e11,,e1p)\mathbf{e}_{1}=(e_{11},\ldots,e_{1p})^{\top}.

Conditions (C1) and (C2) adopted from Shin (2009) make mild assumptions on the covariate 𝐗\mathbf{X} to ensure n\sqrt{n}-consistency of the estimated coefficient θ^OR\widehat{\theta}_{\mathrm{OR}}.

Theorem 4.3.

Under Conditions (B2), (B3), (C1), (C2) and model (3.9), choosing the truncation parameter qn1/(γ+2β)q\asymp n^{1/(\gamma+2\beta)}, we have

𝒯{b^OR(t)b(t)}2𝑑t=Op(n(2β+1)/(γ+2β)).\displaystyle\int_{\mathcal{T}}\{\widehat{b}_{\mathrm{OR}}(t)-b(t)\}^{2}\,dt=O_{p}(n^{(-2\beta+1)/(\gamma+2\beta)})\,.

Because b^OR\widehat{b}_{\mathrm{OR}} is theoretically equivalent to the least square estimator proposed by Shin (2009) provided that each component estimator is unique (Buja et al., 1989), the theorem above is a direct result from Theorem 3.2 in Shin (2009) and thus its proof is omitted. The convergence rate of b^OR\widehat{b}_{\mathrm{OR}} is the same as that of the estimated slope function in the traditional functional linear regression (see Hall and Horowitz, 2007), i.e., such rate remains unchanged despite of the additional estimation of θ\theta. Compared with Theorem 4.2, the convergence rate here Op(n(2β+1)/(γ+2β))=Op(n1)Op(n(γ+1)/(γ+2β))O_{p}(n^{(-2\beta+1)/(\gamma+2\beta)})=O_{p}(n^{-1})\cdot O_{p}(n^{(\gamma+1)/(\gamma+2\beta)}) is much faster. Specifically, the part of the convergence rate Op(ζ2(k)[k2α+h2λk+ψ𝒵{(logn)/n}k/{nμ(h)}])O_{p}\big{(}\zeta^{2}(k)\big{[}k^{-2\alpha}+h^{2\lambda}k+\psi_{\mathcal{Z}}\{(\log n)/n\}k/\{n\mu(h)\}\big{]}\big{)}, which is the convergence rate of sup(z,𝐱)𝒵×𝒳|π^hk(z,𝐱)π0(z,𝐱)|2\sup_{(z,\mathbf{x})\in\mathcal{Z}\times\mathcal{X}}|\widehat{\pi}_{hk}(z,\mathbf{x})-\pi_{0}(z,\mathbf{x})|^{2}, is replaced by a faster one Op(n1)O_{p}(n^{-1}), because the nonparametric estimator π^hk\widehat{\pi}_{hk} is not used.

According to the DR property (3.10), the estimator b^DR\widehat{b}_{\mathrm{DR}} inherits advantages from b^OR\widehat{b}_{\mathrm{OR}} and b^FSW\widehat{b}_{\mathrm{FSW}}, depending on which conditions are satisfied. Specifically, if model (3.1) is correctly specified, b^DR\widehat{b}_{\mathrm{DR}} has the same convergence rate as b^FSW\widehat{b}_{\mathrm{FSW}} given that the conditions in Theorem 4.2 are satisfied. If the stronger model assumption (3.9) holds, then b^DR\widehat{b}_{\mathrm{DR}} enjoys the faster convergence rate as b^OR\widehat{b}_{\mathrm{OR}}, provided that Condition C is satisfied.

5 Selection of Tuning Parameters

For the FSW estimator, our methodology requires a kernel function KK with a metric dd, basis functions νk\nu_{k} and tuning parameters hh, kk and qq. Using a compactly supported kernel KK would make the denominator of the first term in (3.7) too small for some ZiZ_{i} and destabilize the maximization of Hhk,ZiH_{hk,Z_{i}} in (3.7). Therefore, we suggest using the Gaussian kernel, K(u)=exp(u2)K(u)=\exp(-u^{2}), with the semi-metric dd being the L2L^{2} norm. To develop an estimating strategy for π0\pi_{0} that is applicable for a generic dataset, we standardize the 𝐗i\mathbf{X}_{i}’s and use a unified set of choices for νk\nu_{k}. Letting 𝐗min=(mini{Xi1}i=1n,,mini{Xip}i=1n)\mathbf{X}_{\mathrm{min}}=(\min_{i}\{X_{i1}\}_{i=1}^{n},\ldots,\min_{i}\{X_{ip}\}_{i=1}^{n})^{\top}, 𝐗max=(maxi{Xi1}i=1n,,maxi{Xip}i=1n)\mathbf{X}_{\mathrm{max}}=(\max_{i}\{X_{i1}\}_{i=1}^{n},\ldots,\max_{i}\{X_{ip}\}_{i=1}^{n})^{\top}, we transform the 𝐗i\mathbf{X}_{i}’s,

𝐗i,st=2(𝐗i𝐗min)𝐗max𝐗min1,\displaystyle\mathbf{X}_{i,\mathrm{st}}=\dfrac{2(\mathbf{X}_{i}-\mathbf{X}_{\mathrm{min}})}{\mathbf{X}_{\mathrm{max}}-\mathbf{X}_{\mathrm{min}}}-1\,,

so that they have support on [1,1][-1,1]. Let Xij,stX_{ij,\mathrm{st}} be the jj-th component of 𝐗i,st\mathbf{X}_{i,\mathrm{st}} and 𝒫\mathcal{P}_{\ell} be the \ell-th standard Legendre polynomial on [1,1][-1,1]. If kk were given, we define v1(𝐗i)=1v_{1}(\mathbf{X}_{i})=1 and vp(1)+j+1(𝐗i)=𝒫(Xij,st)v_{p(\ell-1)+j+1}(\mathbf{X}_{i})=\mathcal{P}_{\ell}(X_{ij,\mathrm{st}}), for i=1,,ni=1,\ldots,n, j=1,,pj=1,\ldots,p and =1,,(k1)/p\ell=1,\ldots,(k-1)/p. We restrict the range of kk such that (k1)/p(k-1)/p is a positive integer, and orthonormalize the resulting matrix (νk(𝐗1),,νk(𝐗n))\big{(}\nu_{k}(\mathbf{X}_{1}),\ldots,\nu_{k}(\mathbf{X}_{n})\big{)}^{\top}.

We suggest to select h,kh,k and qq jointly by an LL-fold cross validation (CV). Specifically, we randomly split the dataset into LL parts, 𝒮1,,𝒮L\mathcal{S}_{1},\ldots,\mathcal{S}_{L}. Let 𝒮\mathcal{S}_{-\ell} denote the remaining sample with 𝒮\mathcal{S}_{\ell} excluded. We define the CV loss,

CVLFSW(h,k,q)\displaystyle\mathrm{CV}_{L}^{\mathrm{FSW}}(h,k,q)
==1Li𝒮[Yiπ^hk,(Zi,𝐗i)a^,FSWj=1qb^j,,FSW{𝒯ϕ^j(t)μ^Z(t)𝑑t+ξ^ij}]2,\displaystyle=\sum_{\ell=1}^{L}\sum_{i\in\mathcal{S}_{\ell}}\bigg{[}Y_{i}\widehat{\pi}_{hk,-\ell}(Z_{i},\mathbf{X}_{i})-\widehat{a}_{-\ell,\mathrm{FSW}}-\sum_{j=1}^{q}\widehat{b}_{j,-\ell,\mathrm{FSW}}\bigg{\{}\int_{\mathcal{T}}\widehat{\phi}_{j}(t)\widehat{\mu}_{Z}(t)\,dt+\widehat{\xi}_{ij}\bigg{\}}\bigg{]}^{2}\,,

where ξ^ij=𝒯{Zi(t)μ^Z(t)}ϕ^j(t)𝑑t\widehat{\xi}_{ij}=\int_{\mathcal{T}}\{Z_{i}(t)-\widehat{\mu}_{Z}(t)\}\widehat{\phi}_{j}(t)\,dt is the estimated PC score, a^,FSW\widehat{a}_{-\ell,\mathrm{FSW}}, b^j,,FSW\widehat{b}_{j,-\ell,\mathrm{FSW}} and π^hk,\widehat{\pi}_{hk,-\ell} are obtained by applying the method in Section 3.1 to 𝒮\mathcal{S}_{-\ell}. We choose the one from a given candidate set of {h,k,q}\{h,k,q\} such that CVL(h,k,q)\mathrm{CV}_{L}(h,k,q) is minimized.

For the OR estimator, we need to select the truncation parameter qq, which can be chosen by an LL-fold CV similar to the above derivation. Specifically, we consider

CVLOR(q)==1Li𝒮[Yia^,ORj=1qb^j,,OR{𝒯ϕ^j(t)μ^Z(t)𝑑t+ξ^ij}θ^,OR𝐗i]2,\displaystyle\mathrm{CV}_{L}^{\textrm{OR}}(q)=\sum_{\ell=1}^{L}\sum_{i\in\mathcal{S}_{\ell}}\bigg{[}Y_{i}-\widehat{a}_{-\ell,\textrm{OR}}-\sum_{j=1}^{q}\widehat{b}_{j,-\ell,\textrm{OR}}\bigg{\{}\int_{\mathcal{T}}\widehat{\phi}_{j}(t)\widehat{\mu}_{Z}(t)\,dt+\widehat{\xi}_{ij}\bigg{\}}-\widehat{\theta}^{\top}_{-\ell,\textrm{OR}}\mathbf{X}_{i}\bigg{]}^{2}\,,

where a^,OR\widehat{a}_{-\ell,\textrm{OR}}, b^j,,OR\widehat{b}_{j,-\ell,\textrm{OR}} and θ^,OR\widehat{\theta}_{-\ell,\textrm{OR}} are obtained by applying the method in Section 3.2 to 𝒮\mathcal{S}_{-\ell}. From a given candidate set of qq, we choose the one that minimizes CVLOR(q)\mathrm{CV}_{L}^{\textrm{OR}}(q).

Similarly, for the DR estimator, we still need to select the truncation parameter qq. This can be done by minimizing the CV loss,

CVLDR(q)\displaystyle\mathrm{CV}_{L}^{\mathrm{DR}}(q) ==1Li𝒮[{YiE^(Y|𝐗i,Zi)}π^hk,(𝐗i,Zi)+1|𝒮|j𝒮{E^(Y|𝐗j,Zi)}\displaystyle=\sum_{\ell=1}^{L}\sum_{i\in\mathcal{S}_{\ell}}\bigg{[}\{Y_{i}-\widehat{E}_{-\ell}(Y|\mathbf{X}_{i},Z_{i})\}\widehat{\pi}_{hk,-\ell}(\mathbf{X}_{i},Z_{i})+\dfrac{1}{|\mathcal{S}_{\ell}|}\sum_{j\in\mathcal{S}_{\ell}}\{\widehat{E}_{-\ell}(Y|\mathbf{X}_{j},Z_{i})\}
a^,DRj=1qb^j,,DR{𝒯ϕ^j(t)μ^Z(t)dt+ξ^ij}]2,\displaystyle\quad-\widehat{a}_{-\ell,\mathrm{DR}}-\sum_{j=1}^{q}\widehat{b}_{j,-\ell,\mathrm{DR}}\bigg{\{}\int_{\mathcal{T}}\widehat{\phi}_{j}(t)\widehat{\mu}_{Z}(t)\,dt+\widehat{\xi}_{ij}\bigg{\}}\bigg{]}^{2}\,,

where |{}||\{\cdot\}| denotes the number of elements in {}\{\cdot\}, a^,DR\widehat{a}_{-\ell,\mathrm{DR}}, E^(Y|𝐗i,Zi)\widehat{E}_{-\ell}(Y|\mathbf{X}_{i},Z_{i}), π^hk,\widehat{\pi}_{hk,-\ell} b^j,,DR\widehat{b}_{j,-\ell,\mathrm{DR}} are obtained by applying the methods in Section 3.2 to 𝒮\mathcal{S}_{-\ell}.

6 Simulation Study

As the ADRF is parametrized by a functional linear model, E{Y(z)}=a+𝒯b(t)z(t)𝑑tE\{Y^{*}(z)\}=a+\int_{\mathcal{T}}b(t)z(t)\,dt, we mainly focus on the performances of the estimators of bb, which measures the intensity of the functional treatment effect. In addition to the methods of functional stabilized weight (FSW), outcome regression (OR) and double robustness (DR) developed in Section 3, we also consider the method using the nonparametric principal component weight (PCW) proposed by Zhang et al. (2021), which assumes the same functional linear model for the ADRF, as well as the Naive method directly using the functional linear regression of YY on ZZ.

We choose the number of PCs used in estimating the PCW so as to explain 95% of the variance of ZZ following Zhang et al. (2021), and take (h^,k^,q^)=argmin(h,k,q)CVLFSW(\widehat{h},\widehat{k},\widehat{q})=\operatorname{argmin}_{(h,k,q)}\mathrm{CV}_{L}^{\mathrm{FSW}} for FSW with L=10L=10. For a fair comparison, we use such truncation parameter q^\widehat{q} in estimating bb for all other methods.

Table 1: Mean (standard deviation) of the integrated squared error (ISE ×102\times 10^{2}) of the slope function with the smallest values highlighted in boldface.
Model nn FSW OR DR PCW Naive
(i) 200200 20.44 (32.44) 19.39 (26.33) 20.34 (29.14) 37.24 (47.63) 47.32 (33.41)
500500 8.66 (11.91) 7.73 (14.05) 7.94 (14.08) 23.58 (34.93) 33.56 (16.27)
(ii) 200200 62.32 (68.12) 79.71 (124.73) 64.26 (100.92) 69.14 (169.68) 83.96 (131.40)
500500 37.71 (42.06) 39.23 (54.55) 34.15 (49.04) 35.62 (48.69) 43.02 (57.89)
(iii) 200200 39.97 (39.36) 43.99 (68.86) 46.68 (72.61) 96.56 (81.09) 148.51 (69.76)
500500 21.34 (13.11) 13.94 (14.82) 14.86 (17.36) 66.68 (68.84) 118.98 (27.22)
(iv) 200200 26.16 (27.40) 58.16 (45.99) 61.16 (54.35) 114.26 (55.43) 156.95 (51.66)
500500 18.27 (17.56) 48.42 (30.19) 49.88 (35.27) 91.86 (30.02) 139.97 (32.99)

We consider the sample sizes n=200n=200 and 500. For i=1,,ni=1,\ldots,n, we generate the functional treatment Zi(t)Z_{i}(t), for t[0,1]t\in[0,1], by Zi(t)=j=16Aijϕj(t)Z_{i}(t)=\sum_{j=1}^{6}A_{ij}\phi_{j}(t), where Ai1=4Ui1A_{i1}=4U_{i1}, Ai2=23Ui2A_{i2}=2\sqrt{3}U_{i2}, Ai3=22Ui3A_{i3}=2\sqrt{2}U_{i3}, Ai4=2Ui4A_{i4}=2U_{i4}, Ai5=Ui5A_{i5}=U_{i5}, Ai6=Ui6/2A_{i6}=U_{i6}/\sqrt{2} with the UijU_{ij}’s being the independent standard normal variables, ϕ2m1(t)=2sin(2mπt)\phi_{2m-1}(t)=\sqrt{2}\sin(2m\pi t) and ϕ2m(t)=2cos(2mπt)\phi_{2m}(t)=\sqrt{2}\cos(2m\pi t) for m=1,2,3m=1,2,3. For the slope function bb, we define b(t)=2ϕ1(t)+ϕ2(t)+ϕ3(t)/2+ϕ4(t)/2b(t)=2\phi_{1}(t)+\phi_{2}(t)+\phi_{3}(t)/2+\phi_{4}(t)/2. The ZiZ_{i}’s and bb are the same as those in Zhang et al. (2021). We consider four models for the covariate 𝐗i\mathbf{X}_{i} and the outcome variable YiY_{i}:

  1. (i)

    Xi=Ui1+ϵi1X_{i}=U_{i1}+\epsilon_{i1} and Yi=1+01b(t)Zi(t)𝑑t+2Xi+ϵi2Y_{i}=1+\int_{0}^{1}b(t)Z_{i}(t)\,dt+2X_{i}+\epsilon_{i2},

  2. (ii)

    Xi=Ui1+ϵi1X_{i}=U_{i1}+\epsilon_{i1} and Yi=1+01b(t)Zi(t)𝑑t+3Xi2+1.5sin(Xi)+ϵi2Y_{i}=1+\int_{0}^{1}b(t)Z_{i}(t)\,dt+3X_{i}^{2}+1.5\sin(X_{i})+\epsilon_{i2},

  3. (iii)

    𝐗i=(Xi1,Xi2)=((Ui1+1)2+ϵi1,Ui2)\mathbf{X}_{i}=(X_{i1},X_{i2})^{\top}=((U_{i1}+1)^{2}+\epsilon_{i1},U_{i2})^{\top} and Yi=1+01b(t)Zi(t)𝑑t+2Xi1+2Xi2+ϵi2Y_{i}=1+\int_{0}^{1}b(t)Z_{i}(t)\,dt+2X_{i1}+2X_{i2}+\epsilon_{i2},

  4. (iv)

    𝐗i=(Xi1,Xi2)=((Ui1+1)2+ϵi1,Ui2)\mathbf{X}_{i}=(X_{i1},X_{i2})^{\top}=((U_{i1}+1)^{2}+\epsilon_{i1},U_{i2})^{\top} and Yi=1+01b(t)Zi(t)𝑑t+2Xi1+2cos(Xi1)+5.5sin(Xi2)+ϵi2Y_{i}=1+\int_{0}^{1}b(t)Z_{i}(t)\,dt+2X_{i1}+2\cos(X_{i1})+5.5\sin(X_{i2})+\epsilon_{i2},

where ϵi1N(0,1)\epsilon_{i1}\sim N(0,1) and ϵi2N(0,25)\epsilon_{i2}\sim N(0,25) are generated independently. The PC scores AijA_{ij} of Zi(t)Z_{i}(t) affect covariate 𝐗i\mathbf{X}_{i} linearly in models (i) and (ii) and nonlinearly in models (iii) and (iv), and covariate 𝐗i\mathbf{X}_{i} affects the outcome variable YiY_{i} linearly in models (i) and (iii) and nonlinearly in models (ii) and (iv). The functional linear model (3.1) for the ADRF is correctly specified for all four models, while the partially linear model (3.9) is only correctly specified for models (i) and (iii). Due to the confounding effect of 𝐗\mathbf{X}, the Naive method ignoring 𝐗\mathbf{X} is expected to be biased. For each combination of model and sample size, we replicate 200 simulations and evaluate the results by the integrated squared error (ISE) of the slope function, [0,1]{b^(t)b(t)}2𝑑t\int_{[0,1]}\{\widehat{b}(t)-b(t)\}^{2}\,dt, where b^(t)\widehat{b}(t) denotes the estimator produced by the FSW, OR, DR, PCW and Naive methods, respectively.

Table 1 summarizes the mean and the standard deviation of the ISEs ×102\times 10^{2} for all configurations. Our proposed estimators FSW, OR and DR show consistency with sample size increases, and outperform PCW and Naive almost in all the cases. In particular, our three estimators are close to each other and are better than those of PCW and Naive under models (i) and (iii), where covariate 𝐗\mathbf{X} affects the outcome variable YY linearly. In the cases where the partially linear model (3.9) is misspecified, FSW and DR, followed closely by PCW, outperform OR under model (ii); FSW performs the best, followed by OR and DR, which still outperform PCW and Naive significantly under model (iv). Our results suggest that, in the case where the partially linear model is unlikely to hold, FSW or DR is preferred over OR. The performance of PCW is better than Naive but inferior to our methods under all models.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: True curve (—), first (– – –), second (– \cdot –) and third (\cdot \cdot \cdot) quartile estimated slope functions under model (i) with n=200n=200 and 500 for all methods.

To give visualization of representative estimated slope functions, we show the so-called quartile curves of the samples corresponding to the first, second and third quartile values of 200 ISEs in Figure 2 for model (i). It can be seen that the estimated curves using all the methods except Naive are close to the truth, and those produced by PCW and Naive tend to be more variable than others. Figure 3 corresponds to the results under model (iv), i.e., a more challenging model structure. The differences among the estimators are more significant: FSW performs the best, followed by OR and DR, while PCW and Naive are more biased and variable than the first three.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: True curve (—), first (– – –), second (– \cdot –) and third (\cdot \cdot \cdot) quartile estimated slope functions under model (iv) with n=200n=200 and 500 for all methods.

7 Real Data Analysis

We illustrate estimation of the ADRF using the five methods, FSW, OR, DR, PCW and Naive, on the electroencephalography (EEG) dataset from Ciarleglio et al. (2022). The EEG is a relatively low-cost tool to measure human neuronal activities. The measurements are taken from multiple electrodes placed on scalps of subjects, and they are then processed and transformed to produce current source density curves on the frequency domain that provide information of the intensity of neuronal activities. In particular, the frontal asymmetry curves, considered as potential biomarkers for major depressive disorder, are treated as the functional treatment variable Z(t)Z(t), for t[4,31.75]t\in[4,31.75] Hz. The outcome variable YY is defined as log(Y~+1)\log(\widetilde{Y}+1), where Y~\widetilde{Y} is the quick inventory of depressive symptomatology (QIDS) score measuring severity of depressive symptomatology. A larger value of YY indicates more severe depressive disorder. We are interested in investigating the causal effect of the intensity of neuronal activities represented by the frontal asymmetry curves on the severity of major depressive disorder. Potential confounding covariates 𝐗=(X1,X2,X3)\mathbf{X}=(X_{1},X_{2},X_{3})^{\top} include age X1X_{1}, sex X2X_{2} (11 for female and 0 for male) and Edinburgh Handedness Inventory score X3X_{3} (ranging from 100-100 to 100100 corresponding to completely left to right handedness). Individuals with missing ZZ are removed, which results in a sample of size 85 males and 151 females. The means (standard deviations) of YY, X1X_{1} and X2X_{2} are 2.73 (0.72), 35.97 (13.07) and 72.69 (48.76), respectively. For comparison of the estimated ADRF using different methods, the confounding variables 𝐗\mathbf{X} are centralized.

As in Section 6, we use the number of PCs explaining 95% of the variance of ZZ (equal to 1111 for this dataset) in estimating PCW. We minimize CVLFSW\mathrm{CV}_{L}^{\mathrm{FSW}} with L=20L=20 to choose the truncation parameter qq used in estimating the slope function bb for all methods as well as tuning parameters hh and kk used in FSW. As a result, q=2q=2 explaining about 80% of the variance of ZZ is chosen to estimate the slope function.

The left panel of Figure 4 shows the estimated slope functions. All the estimated slope functions have the same increasing trend over frequencies, while they have negative effect on the outcome variable for low frequencies. In other words, subjects with higher frontal asymmetry values on the low frequency domain tend to be healthier. This is consistent with the observations in Ciarleglio et al. (2022), where the authors found a negative association between the frontal asymmetry curves and the major depressive disorder status, using their functional regression model. More specifically, in the left panel of Figure 4, the estimated slope functions of FSW is the steepest and shows the largest effect in the low frequency domain. In contrast, the estimated slope function of Naive has overall smallest absolute values; the other three curves are quite close to each other.

To illustrate visually the above negative effect found from the estimated slope function, we compute the estimated ADRF E{Y(z)}E\{Y^{*}(z)\} of three curves, which correspond to the smallest, median and largest average frontal asymmetry values over low frequencies 4 to 15 Hz. The middle panel of Figure 4 exhibits the selected curves and the right panel shows the corresponding estimated E{Y(z)}E\{Y^{*}(z)\} using all methods, which are around 2.9, 2.8 and 2.3 for the three curves, respectively.

Refer to caption
Refer to caption
Refer to caption
Figure 4: Left: the estimated slope functions using FSW (—), OR (– –), DR (– \cdot –), PCW (\cdots) and Naive (\circ); median: the frontal asymmetry curves corresponding to the smallest (\circ), median (++) and largest (*) average values over 4 to 15 Hz; right: the estimated E{Y(Z)}E\{Y^{*}(Z)\} of the three selected curves with marks corresponding to the middle panel.

8 Discussion

In the case of a functional treatment variable, we propose three estimators of the ADRF, namely, the FSW estimator, the OR estimator and the DR estimator, based on the functional linear model for the ADRF. The consistency of the FSW estimator relies on the functional linear model of E{Y(z)}E\{Y^{*}(z)\} by developing a nonparametric estimator of the weight; while the OR estimator requires a more restrictive linear model for YY in terms of (Z,𝐗)(Z,\mathbf{X}). The DR estimator is consistent if either of the first two estimators is consistent. Asymptotic convergence rates are provided, and numerical results demonstrate great advantages of our methods.

It is of interest to construct the confidence band for the slope function to better quantify the ADRF or ATE, which, however, is difficult even in the simpler context of functional linear regression. As shown in Cardot et al. (2007), it is impossible for an estimator of the slope function to converge in distribution to a nondegenerate random element under the norm topology. For the OR estimator, it is possible to adapt the method proposed by Imaizumi and Kato (2019) to our context to construct an approximate confidence band. For the other two estimators, construction of the confidence band is challenging due to the less restrictive modelling assumption and the nonparametric estimator of the FSW, which warrants further investigation.

References

  • Ai et al. (2021) Ai, C., O. Linton, K. Motegi, and Z. Zhang (2021). A unified framework for efficient estimation of general treatment models. Quantitative Economics 12(3), 779–816.
  • Athey et al. (2018) Athey, S., G. W. Imbens, and S. Wager (2018). Approximate residual balancing: debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(4), 597–623.
  • Bonvini and Kennedy (2022) Bonvini, M. and E. H. Kennedy (2022). Fast convergence rates for dose-response estimation. arXiv preprint arXiv:2207.11825.
  • Buja et al. (1989) Buja, A., T. Hastie, and R. Tibshirani (1989). Linear smoothers and additive models. The Annals of Statistics, 453–510.
  • Cardot et al. (2007) Cardot, H., A. Mas, and P. Sarda (2007). Clt in functional linear regression models. Probability Theory and Related Fields 138(3), 325–361.
  • Chan et al. (2016) Chan, K. C. G., S. C. P. Yam, and Z. Zhang (2016). Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78(3), 673–700.
  • Ciarleglio et al. (2022) Ciarleglio, A., E. Petkova, and O. Harel (2022). Elucidating age and sex-dependent association between frontal eeg asymmetry and depression: An application of multiple imputation in functional regression. Journal of the American Statistical Association 117(537), 12–26.
  • Delaigle and Hall (2010) Delaigle, A. and P. Hall (2010). Defining probability density for a distribution of random functions. The Annals of Statistics 38(2), 1171–1193.
  • Ding et al. (2019) Ding, P., A. Feller, and L. Miratrix (2019). Decomposing treatment effect variation. Journal of the American Statistical Association 114(525), 304–317.
  • D’Amour et al. (2021) D’Amour, A., P. Ding, A. Feller, L. Lei, and J. Sekhon (2021). Overlap in observational studies with high-dimensional covariates. Journal of Econometrics 221(2), 644–654.
  • Ferraty et al. (2010) Ferraty, F., A. Laksaci, A. Tadj, and P. Vieu (2010). Rate of uniform consistency for nonparametric estimates with functional variables. Journal of Statistical planning and inference 140(2), 335–352.
  • Ferraty and Vieu (2006) Ferraty, F. and P. Vieu (2006). Nonparametric functional data analysis: theory and practice, Volume 76. Springer.
  • Fong et al. (2018) Fong, C., C. Hazlett, and K. Imai (2018). Covariate balancing propensity score for a continuous treatment: Application to the efficacy of political advertisements. The Annals of Applied Statistics 12, 156–177.
  • Galvao and Wang (2015) Galvao, A. F. and L. Wang (2015). Uniformly semiparametric efficient estimation of treatment effects with a continuous treatment. Journal of the American Statistical Association 110(512), 1528–1542.
  • Guo et al. (2021) Guo, W., X.-H. Zhou, and S. Ma (2021). Estimation of optimal individualized treatment rules using a covariate-specific treatment effect curve with high-dimensional covariates. Journal of the American Statistical Association 116(533), 309–321.
  • Hall and Horowitz (2007) Hall, P. and J. L. Horowitz (2007). Methodology and convergence rates for functional linear regression. The Annals of Statistics 35(1), 70–91.
  • Hansen et al. (1996) Hansen, L. P., J. Heaton, and A. Yaron (1996). Finite-sample properties of some alternative gmm estimators. Journal of Business & Economic Statistics 14(3), 262–280.
  • Hirano and Imbens (2004) Hirano, K. and G. W. Imbens (2004). The propensity score with continuous treatments. Applied Bayesian Modeling and Causal Inference from Incomplete-data Perspectives 226164, 73–84.
  • Hirano et al. (2003) Hirano, K., G. W. Imbens, and G. Ridder (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71(4), 1161–1189.
  • Imai and Ratkovic (2014) Imai, K. and M. Ratkovic (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(1), 243–263.
  • Imai and Ratkovic (2015) Imai, K. and M. Ratkovic (2015). Robust estimation of inverse probability weights for marginal structural models. Journal of the American Statistical Association 110(511), 1013–1023.
  • Imai and van Dyk (2004) Imai, K. and D. A. van Dyk (2004). Causal inference with general treatment regimes: Generalizing the propensity score. Journal of the American Statistical Association 99, 854–866.
  • Imaizumi and Kato (2019) Imaizumi, M. and K. Kato (2019). A simple method to construct confidence bands in functional linear regression. Statistica Sinica 29(4), 2055–2081.
  • Imbens et al. (1998) Imbens, G. W., R. H. Spady, and P. Johnson (1998). Information theoretic approaches to inference in moment condition models. Econometrica 66(2), 333.
  • Kennedy (2019) Kennedy, E. H. (2019). Nonparametric causal effects based on incremental propensity score interventions. Journal of the American Statistical Association 114(526), 645–656.
  • Kennedy et al. (2017) Kennedy, E. H., Z. Ma, M. D. McHugh, and D. S. Small (2017). Non-parametric methods for doubly robust estimation of continuous treatment effects. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79(4), 1229–1245.
  • Kong et al. (2016) Kong, D., K. Xue, F. Yao, and H. H. Zhang (2016). Partially functional linear regression in high dimensions. Biometrika 103(1), 147–159.
  • Li and Li (2019) Li, F. and F. Li (2019). Propensity score weighting for causal inference with multiple treatments. The Annals of Applied Statistics 13(4), 2389–2415.
  • Li and Shao (2001) Li, W. V. and Q.-M. Shao (2001). Gaussian processes: inequalities, small ball probabilities and applications. Handbook of Statistics 19, 533–597.
  • Lin and Yao (2021) Lin, Z. and F. Yao (2021). Functional regression on the manifold with contamination. Biometrika 108(1), 167–181.
  • Ling and Vieu (2018) Ling, N. and P. Vieu (2018). Nonparametric modelling for functional data: selected survey and tracks for future. Statistics 52(4), 934–949.
  • Lopez and Gutman (2017a) Lopez, M. J. and R. Gutman (2017a). Estimating the average treatment effects of nutritional label use using subclassification with regression adjustment. Statistical methods in medical research 26(2), 839–864.
  • Lopez and Gutman (2017b) Lopez, M. J. and R. Gutman (2017b). Estimation of causal effects with multiple treatments: a review and new ideas. Statistical Science, 432–454.
  • Ma and Wang (2020) Ma, X. and J. Wang (2020). Robust inference using inverse probability weighting. Journal of the American Statistical Association 115(532), 1851–1860.
  • Moodie and Stephens (2012) Moodie, E. E. and D. A. Stephens (2012). Estimation of dose–response functions for longitudinal data using the generalised propensity score. Statistical methods in medical research 21(2), 149–166.
  • Morris (2015) Morris, J. S. (2015). Functional regression. Annual Review of Statistics and Its Application 2, 321–359.
  • Newey (1997) Newey, W. K. (1997). Convergence rates and asymptotic normality for series estimators. Journal of Econometrics 79(1), 147–168.
  • Owen (1988) Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75(2), 237–249.
  • Robins (2000) Robins, J. M. (2000). Marginal structural models versus structural nested models as tools for causal inference. In Statistical models in epidemiology, the environment, and clinical trials, pp.  95–133. Springer.
  • Robins et al. (2000) Robins, J. M., M. A. Hernan, and B. Brumback (2000). Marginal structural models and causal inference in epidemiology. Epidemiology 11(5), 550–560.
  • Rosenbaum and Rubin (1983) Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70, 45–55.
  • Shin (2009) Shin, H. (2009). Partial functional linear regression. Journal of Statistical Planning and Inference 139(10), 3405–3418.
  • Tan (2020) Tan, Z. (2020). Model-assisted inference for treatment effects using regularized calibrated estimation with high-dimensional data. The Annals of Statistics 48(2), 811–837.
  • Wang et al. (2016) Wang, J.-L., J.-M. Chiou, and H.-G. Müller (2016). Functional data analysis. Annual Review of Statistics and its Application 3, 257–295.
  • Wong et al. (2019) Wong, R. K., Y. Li, and Z. Zhu (2019). Partially linear functional additive models for multivariate functional data. Journal of the American Statistical Association 114(525), 406–418.
  • Zhang and Wang (2016) Zhang, X. and J.-L. Wang (2016). From sparse to dense functional data and beyond. The Annals of Statistics 44(5), 2281–2321.
  • Zhang et al. (2021) Zhang, X., W. Xue, and Q. Wang (2021). Covariate balancing functional propensity score for functional treatments in cross-sectional observational studies. Computational Statistics & Data Analysis.
  • Zhao et al. (2015) Zhao, Y.-Q., D. Zeng, E. B. Laber, and M. R. Kosorok (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association 110(510), 583–598.