This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Estimation of conditional average treatment effects on distributed confidential data

Yuji Kawamata Center for Artificial Intelligence Research, University of Tsukuba, Tsukuba, Japan, yjkawamata@gmail.com    Ryoki Motai Graduate School of Science and Technology, University of Tsukuba, Tsukuba, Japan, motai.ryoki.tg@alumni.tsukuba.ac.jp    Yukihiko Okada Center for Artificial Intelligence Research, University of Tsukuba, Tsukuba, Japan, okayu@sk.tsukuba.ac.jp    Akira Imakura Center for Artificial Intelligence Research, University of Tsukuba, Tsukuba, Japan, imakura@cs.tsukuba.ac.jp    Tetsuya Sakurai Center for Artificial Intelligence Research, University of Tsukuba, Tsukuba, Japan, sakurai@cs.tsukuba.ac.jp
Abstract

Estimation of conditional average treatment effects (CATEs) is an important topic in sciences. CATEs can be estimated with high accuracy if distributed data across multiple parties can be centralized. However, it is difficult to aggregate such data owing to confidential or privacy concerns. To address this issue, we proposed data collaboration double machine learning, a method that can estimate CATE models from privacy-preserving fusion data constructed from distributed data, and evaluated our method through simulations. Our contributions are summarized in the following three points. First, our method enables estimation and testing of semi-parametric CATE models without iterative communication on distributed data. Our semi-parametric CATE method enable estimation and testing that is more robust to model mis-specification than parametric methods. Second, our method enables collaborative estimation between multiple time points and different parties through the accumulation of a knowledge base. Third, our method performed equally or better than other methods in simulations using synthetic, semi-synthetic and real-world datasets.

Keywords— Statistical causal inference; Conditional average treatment effect; Privacy-preserving; Distributed data; Double machine learning

1 Introduction

The Neyman–Rubin model (Imbens and Rubin, 2015) or potential outcomes framework is one of the main causal inference methods for estimating average treatment (intervention) effects and has been developed and applied in many studies since its establishment by Rubin. In recent years, there have been innovations that enable the estimation of individual or conditional average treatment effects (CATEs) by adapting not only statistical but also machine learning methods (e.g., Künzel et al. (2019); Athey et al. (2019)). Most methods for estimating CATEs assume that the data can be centralized in one place. However, if the distributed data contains confidential or private information, it is difficult to centralized them in one place.

Conversely, in the field of machine learning, a privacy-preserving analysis method called federated learning has been developed in recent years (Konečnỳ et al., 2016; McMahan et al., 2017; Rodríguez-Barroso et al., 2020; Criado et al., 2022; Rodríguez-Barroso et al., 2023; Rafi et al., 2024). Data collaboration (DC) analysis (Imakura and Sakurai, 2020; Bogdanova et al., 2020; Imakura et al., 2021b, 2023b, 2023a), one of federated learning systems, is a method that enables collaborative analysis from privacy-preserving fusion data. The privacy-preserving fusion data are constructed by sharing dimensionality-reduced intermediate representations instead of raw data. DC analysis was originally proposed to address regression and classification problems in distributed data. Recently, Kawamata et al. (Kawamata et al., 2024) proposed data collaboration quasi-experiment (DC-QE), that extends DC analysis to allow estimation of average treatment effect (ATE). It is important to note that although those studies often focus their discussions on privacy data (personal information), the technologies can also be used to preserve the management information held by companies. Therefore, in this paper, we collectively refer to such data that are difficult to disclose to the outside as confidential information. Below we discuss techniques for preserving confidential information.

In this paper, we propose data collaboration double machine learning (DC-DML) as a method to estimate CATEs while preserving the confidentiality of distributed data by incorporating double machine learning (DML) (Chernozhukov et al., 2018), a machine learning-based treatment effect estimation method, into DC-QE. Moreover, through numerical simulations, we evaluate the performance of our method compared to existing methods.

This paper relates partially linear models, one of the semi-parametric models. Since Engle et al. (Engle et al., 1986), partially linear models have been used in empirical studies where assuming a linear model between the covariates and the outcome is not appropriate, i.e., to avoid model mis-specification owing to linear assumptions. A significant advance about partially linear models in recent years is DML (Chernozhukov et al., 2018), which brought an least square estimator that can adapt common machine learning models. Various causal inference methods have been developed by developing DML (Fan et al., 2022; Bia et al., 2023). In this paper, we develop DML into a method that enables CATE estimation through distributed data with privacy-preserving.

The effects of policy interventions by governments or medical treatments by hospitals are likely to differ across people. It is possible to improve the overall performance of the treatment if one knows which subjects should be treated. For example, by concentrating scholarship support on those whose future income increases will be higher owing to the treatment, the limited budget can be used more efficiently. In addition, by concentrating medication on those who are more likely to improved or have fewer side effects from the treatment, medication can be more effective and safer to use. One approach to estimating CATEs for these situations with a high accuracy is to centralize data from many parties, but this is difficult owing to confidential or privacy concerns. Our method enables to estimate CATEs with high accuracy while maintaining the confidentiality of distributed data across many parties.

Our contributions are summarized in the following three points. First, our method enables estimation and testing of semi-parametric CATE models without iterative communication on distributed data. Semi-parametric (or non-parametric) CATE models enable estimation and testing that is more robust to model mis-specification than parametric models. However, to our knowledge, no communication-efficient method has been proposed for estimating and testing semi-parametric (or non-parametric) CATE models on distributed data. Second, our method enables collaborative estimation between different parties as well as multiple time points because the dimensionality-reduced intermediate representations can be accumulated as a knowledge base. Third, our method performed as well or better than other methods in evaluation simulations using synthetic, semi-synthetic and real-world datasets. Fig. 1 shows the summary of our study.

Refer to caption
Figure 1: The summary of our study.

The remainder of this paper is organized as follows. In Section 2, we present related works on causal inference and machine learning. We describe the basics of causal inference and distributed data in Section 3. We describe our method in Section 4. In Section 5, we report the numerical simulations. In Section 6, we discuss the results. Finally, we conclude in Section 7. Acronyms and their definitions used in this paper are summarized in A.

2 Related works

The field of treatment effect estimation has advanced significantly in recent years through the incorporation of machine learning methods as well as statistics. However, studies that take into account confidentiality preservation, which is our aim, are limited. Here, we briefly review existing methods.

DML is an innovative method proposed by Chernozhukov et al. Chernozhukov et al. (2018) that allows estimation of ATE using any machine learning methods. DML is a semi-parametric method that uses machine learning to estimate nuisance parameters for estimating target parameters. More specifically, DML first estimates nuisance parameters using machine learning, and then estimates the target parameters using the estimated nuisance parameters. DML resolves influences of regularization bias and over-fitting on estimates of target parameters through Neyman–orthogonal scores and cross-fitting, respectively. As shown in Section 3, DML can be extended easily to allow estimation of CATE. DML is a useful method for estimating treatment effects, but using centralized data. To achieve our goal, we need to develop DML to address distributed data. In Section 3, we describe DML in detail.

As representative methods that utilize machine learning for CATE estimation, generalized random forest (GRF) (Athey et al., 2019) and X-Learner (Künzel et al., 2019) have been proposed, but these methods cannot be applied to distributed data. 111 Many other methods for estimating CATEs have been proposed so far, such as non-parametric Bayesian (Hill, 2011; Taddy et al., 2016) and model averaging (Rolling et al., 2019; Shi et al., 2024). However, these methods also cannot be applied to distributed data. GRF is a non-parametric method for estimating treatment effects by solving local moment equations using random forests. GRF first constructs a random forest, then calculates weights for each sample from the random forest, and finally derives the treatment effect by solving a moment equations using the calculated weights. X-Learner is a non-parametric method that can estimate treatment effects from any machine learning model. X-Learner uses propensity scores (Rosenbaum and Rubin, 1983) to resolve performance degradation that occurs when controlled or treated group is much larger than the other group.

Conversely, Secure regression (SR) (Karr et al., 2005) and FedCI (Vo et al., 2022) are methods that can estimate CATEs from distributed data. SR is a parametric method for estimating linear regression models. In SR, statistics necessary to compute least squares estimates are calculated in each local party, and these statistics are shared. SR is communication-efficient because it enables estimation of a linear model by sharing statistics only once. However, in SR, bias owing to mis-specification can occur if the assumptions of the linear model are incorrect. FedCI is a non-parametric method that constructs a potential outcome model based on Gaussian processes through federated learning approach. In FedCI, the communication between the clients and the server is iterated: each client sends the model gradient calculated from its local data to the server, then the server averages the aggregated gradients, updates its own model parameters, and returns those parameters to the clients. In the case where local data is stored in a location isolated from the Internet or synchronization of communications is not possible, FedCI is difficult to execute.

While the purpose of our method is to estimate CATEs, ifedtree (Tan et al., 2022), Federated MLE, IPW-MLE and AIPW (Xiong et al., 2023), and DC-QE (Kawamata et al., 2024) are methods that have been proposed for different estimation targets. Those methods can address distributed data. ifedtree is a tree-based model averaging approach to improve the CATE estimation at a specific party by sharing models derived from other parties instead of raw data. Our method aims to estimate CATEs on aggregated whole data, rather than on specific party data. Federated MLE, IPW-MLE and AIPW are a parametric methods that aggregates individually estimated gradients for estimating propensity scores and treatment effects. Federated MLE, IPW-MLE and AIPW estimate the ATE not conditioned on covariates.

DC-QE is a semi-parametric method for estimating ATE based on a DC analysis framework. DC-QE first construct collaborative representations based on dimensionality-reduced intermediate representations collected from all parties instead of raw data, second estimates propensity scores, finally estimates ATE using the estimated propensity scores (see B for the pseudo-code of DC-QE). Our method extends DC-QE to address DML for CATE estimation.

Table 1 summarizes five of those described in this section that are particularly relevant to our method. In Section 4, we propose a method for estimating CATEs in a semi-parametric model while preserving confidentiality of distributed data. Although DML, GRF and X-Learner are methods for centralized data, our method address distributed data. Moreover, our method is more robust than SR for model misspecification and more communication-efficient than FedCI.

Methods
Can it deal with
distributed data?
Assumptions of
CATE model
Does it need
iterative communication?
DML (Chernozhukov et al., 2018) No Semi-parametric -
GRF (Athey et al., 2019) No Non-parametric -
X-Learner (Künzel et al., 2019) No Non-parametric -
SR (Karr et al., 2005) Yes Parametric No
FedCI (Vo et al., 2022) Yes Non-parametric Yes
DC-DML (our method) Yes Semi-parametric No
Table 1: Existing and our methods.

3 DML with the linear CATE model

In this section, we describe the DML method with the linear CATE model. Under the Neyman–Rubin causal model, let variables for subject ii as yi(0)y_{i}(0), yi(1)y_{i}(1), ziz_{i} and 𝒙i\boldsymbol{x}_{i}, where 𝒙i=(xi1,,xim)m\boldsymbol{x}_{i}=(x_{i}^{1},\cdots,x_{i}^{m})^{\top}\in\mathbb{R}^{m} is the mm-dimensional covariates vector, zi{0,1}z_{i}\in\{0,1\} is the treatment, yi(0)y_{i}(0) is the outcome that subject ii would have if they were treated (zi=1z_{i}=1), and yi(0)y_{i}(0) is that if not (controlled; zi=0z_{i}=0). Moreover, we denote yi=yi(zi)y_{i}=y_{i}(z_{i}). With this definition, the CATE is defined as

τ(𝒙i)=𝔼[yi(1)yi(0)|𝒙i].\tau(\boldsymbol{x}_{i})=\mathbb{E}[y_{i}(1)-y_{i}(0)|\boldsymbol{x}_{i}].

In DML, we make the following structural equation assumptions on the data generating process as

yi\displaystyle y_{i} =θ(𝒙i)zi+u(𝒙i)+εi,\displaystyle=\theta(\boldsymbol{x}_{i})\cdot z_{i}+u(\boldsymbol{x}_{i})+\varepsilon_{i}, (1)
zi\displaystyle z_{i} =h(𝒙i)+ηi.\displaystyle=h(\boldsymbol{x}_{i})+\eta_{i}. (2)

θ\theta is the function that represents the effect of the treatment conditional on the covariate 𝒙i\boldsymbol{x}_{i}. The covariates 𝒙i\boldsymbol{x}_{i} affect the treatment ziz_{i} via the function hh and the outcome via the function uu, where hh and uu are defined as functions belong to convex subsets \mathbb{H} and 𝕌\mathbb{U} with some normed vector space. εi\varepsilon_{i} and ηi\eta_{i} are disturbances such that 𝔼[ηi|𝒙i]=0\mathbb{E}[\eta_{i}|\boldsymbol{x}_{i}]=0, 𝔼[εi|𝒙i]=0\mathbb{E}[\varepsilon_{i}|\boldsymbol{x}_{i}]=0 and 𝔼[ηiεi|𝒙i]=0\mathbb{E}[\eta_{i}\varepsilon_{i}|\boldsymbol{x}_{i}]=0. In this data generating process, the function of our interest is θ\theta because obviously τ(𝒙i)=θ(𝒙i)\tau(\boldsymbol{x}_{i})=\theta(\boldsymbol{x}_{i}), while hh and uu are not functions of our interest. Note that, Chernozhukov et al. (Chernozhukov et al., 2018) assumed that θ\theta is constant, i.e., independent of 𝒙i\boldsymbol{x}_{i}. However, as we discuss below, we can estimate θ\theta similarly to Chernozhukov et al. (Chernozhukov et al., 2018) even when that θ\theta linearly depends on 𝒙i\boldsymbol{x}_{i}.

In this paper, we derive θ\theta by a partialling-out approach as follows: subtracting the conditional expectation of (1) from (1), we can re-write the structural equations as

yi𝔼[yi|𝒙i]=θ(𝒙i)(zi𝔼[zi|𝒙i])+εi.y_{i}-\mathbb{E}[y_{i}|\boldsymbol{x}_{i}]=\theta(\boldsymbol{x}_{i})\cdot(z_{i}-\mathbb{E}[z_{i}|\boldsymbol{x}_{i}])+\varepsilon_{i}.

Here, define q(𝒙i)=𝔼[yi|𝒙i]q(\boldsymbol{x}_{i})=\mathbb{E}[y_{i}|\boldsymbol{x}_{i}], where qq belongs a convex subset \mathbb{Q} with some normed vector space. Then, the residuals for qq and hh are

ζi\displaystyle\zeta_{i} =yiq(𝒙i),\displaystyle=y_{i}-q(\boldsymbol{x}_{i}),
ηi\displaystyle\eta_{i} =zih(𝒙i).\displaystyle=z_{i}-h(\boldsymbol{x}_{i}).

So we can obtain

ζi=θ(𝒙i)ηi+εi.\zeta_{i}=\theta(\boldsymbol{x}_{i})\cdot\eta_{i}+\varepsilon_{i}. (3)

Note that 𝔼[ζi|𝒙i]=0\mathbb{E}[\zeta_{i}|\boldsymbol{x}_{i}]=0 because 𝔼[ηi|𝒙i]=0\mathbb{E}[\eta_{i}|\boldsymbol{x}_{i}]=0 and 𝔼[εi|𝒙i]=0\mathbb{E}[\varepsilon_{i}|\boldsymbol{x}_{i}]=0.

We assume θ\theta as a linear function with constant term such that

θ(𝒙𝒊)=[1,𝒙i]𝜷=β0+xi1β1++ximβm,\theta(\boldsymbol{x_{i}})=[1,\boldsymbol{x}_{i}^{\top}]\boldsymbol{\beta}=\beta^{0}+x_{i}^{1}\beta^{1}+\cdots+x_{i}^{m}\beta^{m},

where 𝜷=[β0,,βm]m+1\boldsymbol{\beta}=[\beta^{0},\cdots,\beta^{m}]^{\top}\in\mathbb{R}^{m+1}. Re-write (3) as

ζi=ηi[1,𝒙i]𝜷+εi.\zeta_{i}=\eta_{i}[1,\boldsymbol{x}_{i}^{\top}]\boldsymbol{\beta}+\varepsilon_{i}.

To derive 𝜷\boldsymbol{\beta}, consider the following score vector:

ψ(𝒙i;𝜷^,q^,h^)=[1,𝒙i](zih^(𝒙i))(yiq^(𝒙i)[1,𝒙i]𝜷^(zih^(𝒙i))).\psi(\boldsymbol{x}_{i};\hat{\boldsymbol{\beta}},\hat{q},\hat{h})=[1,\boldsymbol{x}_{i}^{\top}]^{\top}(z_{i}-\hat{h}(\boldsymbol{x}_{i}))(y_{i}-\hat{q}(\boldsymbol{x}_{i})-[1,\boldsymbol{x}_{i}^{\top}]\hat{\boldsymbol{\beta}}(z_{i}-\hat{h}(\boldsymbol{x}_{i}))). (4)

Define the moment condition

𝔼[ψ(𝒙i;𝜷,q,h)]=𝟎,\mathbb{E}[\psi(\boldsymbol{x}_{i};\boldsymbol{\beta},q,h)]=\boldsymbol{0}, (5)

and Neyman–orthogonality (Chernozhukov et al., 2018)

limr0r𝔼[ψ(𝒙i;𝜷,q+r(qq),h+r(hh))]=𝟎,\lim_{r\to 0}\frac{\partial}{\partial r}\mathbb{E}[\psi(\boldsymbol{x}_{i};\boldsymbol{\beta},q+r(q^{\prime}-q),h+r(h^{\prime}-h))]=\boldsymbol{0}, (6)

where r[0,1)r\in[0,1), qq^{\prime}\in\mathbb{Q} and hh^{\prime}\in\mathbb{H}. The Neyman–orthogonality implies first-order insensitivity to perturbations of the nuisance functions qq and hh. Then, we have Theorem 1, whose proof is in C.

Theorem 1.

The score vector (4) satisfies both the moment condition (5) and Neyman–orthogonality (6).

Given q^\hat{q} and h^\hat{h}, β^\hat{\beta} can be obtained by solving (5) empirically from dataset as:

iψ(𝒙i;𝜷^,q^,h^)=i(ηi^[1,𝒙i](ζi^ηi^[1,𝒙i]𝜷^))=𝟎.\sum_{i}\psi(\boldsymbol{x}_{i};\hat{\boldsymbol{\beta}},\hat{q},\hat{h})=\sum_{i}\left(\hat{\eta_{i}}[1,\boldsymbol{x}_{i}^{\top}]^{\top}(\hat{\zeta_{i}}-\hat{\eta_{i}}[1,\boldsymbol{x}_{i}^{\top}]\hat{\boldsymbol{\beta}})\right)=\boldsymbol{0}. (7)

(7) indicates that 𝜷^\hat{\boldsymbol{\beta}} is the solution of the regression problem from ζi^\hat{\zeta_{i}} to ηi^[1,𝒙i]\hat{\eta_{i}}[1,\boldsymbol{x}_{i}^{\top}]. That is

𝜷^=argmin𝜷(i(ζi^ηi^[1,𝒙i]𝜷)2).\hat{\boldsymbol{\beta}}=\operatorname*{arg\,min}_{\boldsymbol{\beta}^{\prime}}\left(\sum_{i}\left(\hat{\zeta_{i}}-\hat{\eta_{i}}[1,\boldsymbol{x}_{i}^{\top}]\boldsymbol{\beta}^{\prime}\right)^{2}\right).

The DML uses machine learning models to estimate qq and hh, and cross-fitting to reduce the effect of over-fitting on the estimation of parameter θ\theta. In this paper, the data is split into two folds in cross-fitting. To derive the variance of 𝜷^\hat{\boldsymbol{\beta}}, re-write (4) as

ψ(𝒙i;𝜷^,q^,h^)=ψa(𝒙i;𝜷^,q^,h^)𝜷^+[1,𝒙i](zih^(𝒙i))(yiq^(𝒙i)),\psi(\boldsymbol{x}_{i};\hat{\boldsymbol{\beta}},\hat{q},\hat{h})=\psi^{a}(\boldsymbol{x}_{i};\hat{\boldsymbol{\beta}},\hat{q},\hat{h})\hat{\boldsymbol{\beta}}+[1,\boldsymbol{x}_{i}^{\top}]^{\top}(z_{i}-\hat{h}(\boldsymbol{x}_{i}))(y_{i}-\hat{q}(\boldsymbol{x}_{i})),

where

ψa(𝒙i;h^)=(zih^(𝒙i))2[1,𝒙i][1,𝒙i].\psi^{a}(\boldsymbol{x}_{i};\hat{h})=(z_{i}-\hat{h}(\boldsymbol{x}_{i}))^{2}[1,\boldsymbol{x}_{i}^{\top}]^{\top}[1,\boldsymbol{x}_{i}^{\top}].

From Theorem 3.2 in Chernozhukov et al. (Chernozhukov et al., 2018), the asymptotic variance of n(𝜷^𝜷)\sqrt{n}(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}) is

σ^2=J^11Kl=1L1|Nl|iNlψ(𝒙i;𝜷^,q^l,h^l)ψ(𝒙i;𝜷^,q^l,h^l)(J^1)\hat{\sigma}^{2}=\hat{J}^{-1}\frac{1}{K}\sum_{l=1}^{L}\frac{1}{|N_{l}|}\sum_{i\in N_{l}}\psi(\boldsymbol{x}_{i};\hat{\boldsymbol{\beta}},\hat{q}_{l},\hat{h}_{l})\psi(\boldsymbol{x}_{i};\hat{\boldsymbol{\beta}},\hat{q}_{l},\hat{h}_{l})^{\top}(\hat{J}^{-1})^{\top} (8)

Here, q^l\hat{q}_{l} and h^l\hat{h}_{l} are the functions, and NlN_{l} is the set of subjects for fold ll. Moreover,

J^=1Ll=1L1|Nl|iNlψa(𝒙i;h^l).\hat{J}=\frac{1}{L}\sum_{l=1}^{L}\frac{1}{|N_{l}|}\sum_{i\in N_{l}}\psi^{a}(\boldsymbol{x}_{i};\hat{h}_{l}). (9)

Then, the variance of 𝜷^\hat{\boldsymbol{\beta}} is

Var(𝜷^)=σ^2/n.\textup{Var}(\hat{\boldsymbol{\beta}})=\hat{\sigma}^{2}/n. (10)

In summary, the DML procedure can be expressed as the following two steps. First, qq and hh are estimated from the two folds using machine learning models to obtain the estimated residuals: ζ^i\hat{\zeta}_{i} and η^i\hat{\eta}_{i}. Second, 𝜷\boldsymbol{\beta} is estimated by regressing ζ^i\hat{\zeta}_{i} to η^i[1,𝒙i]\hat{\eta}_{i}[1,\boldsymbol{x}_{i}^{\top}].

4 Method

We describe our method in this section. Prior to that, we explain the data distribution that our method focuses on. Let nn be the number of subjects (sample size) in a dataset. Moreover, let X=[𝒙1,,𝒙n]n×mX=[\boldsymbol{x}_{1},\cdots,\boldsymbol{x}_{n}]^{\top}\in\mathbb{R}^{n\times m}, Z=[z1,,zn]{0,1}nZ=[z_{1},\cdots,z_{n}]^{\top}\in\{0,1\}^{n} and Y=[y1,,yn]nY=[y_{1},\cdots,y_{n}]^{\top}\in\mathbb{R}^{n} be the datasets of covariates, treatments, and outcomes, respectively. We consider that nn subjects are partitioned horizontally into cc parties, as follows:

X=[X1Xc],Z=[Z1Zc],Y=[Y1Yc].X=\left[\begin{array}[]{c}X_{1}\\ \vdots\\ X_{c}\end{array}\right],\quad Z=\left[\begin{array}[]{c}Z_{1}\\ \vdots\\ Z_{c}\end{array}\right],\quad Y=\left[\begin{array}[]{c}Y_{1}\\ \vdots\\ Y_{c}\end{array}\right].

The kkth party has a partial dataset and corresponding treatments and outcomes, which are Xknk×m,Zk{0,1}nk,YknkX_{k}\in\mathbb{R}^{n_{k}\times m},Z_{k}\in\{0,1\}^{n_{k}},Y_{k}\in\mathbb{R}^{n_{k}}, where nkn_{k} is the number of subjects for each party in the kkth row (n=k=1cnkn=\sum_{k=1}^{c}n_{k}).

This section proceeds as: we describe our method in Section 4.1, analyze the correctness of our method in Section 4.2, propose a dimension reduction method for DC-DML in Section 4.3, discuss the confidentiality preservation of our method in Section 4.4, and finally discuss the advantages and disadvantages in Section 4.5.

4.1 DC-DML

Our method DC-DML is based on DC-QE (Kawamata et al., 2024) and DML (Chernozhukov et al., 2018), and enables to estimate CATE with confidentiality preservation of distributed data. As in Kawamata et al. (Kawamata et al., 2024), DC-QE operates in two roles: user and analyst roles. Each user kk have confidential covariates XkX_{k}, treatments ZkZ_{k} and outcomes YkY_{k}. DC-QE is conducted in three stages: construction of collaborative representations, estimation by the analyst, and estimation by the users. In the first stage, each user individually constructs intermediate representations and shares them with the analyst instead of the confidential dataset, then the analyst constructs the collaborative representations. In the second stage, the analyst estimates its own coefficients and variance using DML, and shares them with all users. In the final stage, each user calculates its own coefficients and variance. Each stage is described in turn from Section 4.1.1 to 4.1.3. Fig. 2 shows an overall illustration of DC-DML. Algorithm 1 is the pseudo-code of DC-DML.

Refer to caption
Figure 2: Overall illustration of the proposed method.
Algorithm 1 Data collaboration double machine learning (DC-DML)
1:Input: covariates XX, treatments ZZ, outcomes YY.
2:Output: βk\boldsymbol{\beta}_{k}, Var(γk)\text{Var}(\boldsymbol{\gamma}_{k}) and Var(αk)\text{Var}(\alpha_{k}).
3:
4:user-kk-side
5:
6:Generate anchor dataset XkancX_{k}^{\text{anc}} and share it with all users.
7:Set XancX^{\text{anc}}.
8:Generate FkF_{k} and 𝝁k\boldsymbol{\mu}_{k}.
9:Compute X~k=(Xk𝟏𝝁k)Fk\widetilde{X}_{k}=(X_{k}-\boldsymbol{1}\boldsymbol{\mu}_{k}^{\top})F_{k}.
10:Compute X~kanc=(Xanc𝟏𝝁k)Fk\widetilde{X}_{k}^{\text{anc}}=(X^{\text{anc}}-\boldsymbol{1}\boldsymbol{\mu}_{k}^{\top})F_{k}.
11:Share [𝟏,X~k][\boldsymbol{1},\widetilde{X}_{k}], [𝟏,X~kanc][\boldsymbol{1},\widetilde{X}_{k}^{\text{anc}}], ZkZ_{k} and YkY_{k} to the analyst.
12:
13:analyst-side
14:
15:Get [𝟏,X~k][\boldsymbol{1},\widetilde{X}_{k}], [𝟏,X~kanc][\boldsymbol{1},\widetilde{X}_{k}^{\text{anc}}], ZkZ_{k} and YkY_{k} for all kk.
16:Set [𝟏,X~k][\boldsymbol{1},\widetilde{X}_{k}] and [𝟏,X~kanc][\boldsymbol{1},\widetilde{X}_{k}^{\text{anc}}].
17:Compute GkG_{k} from X~kanc\widetilde{X}_{k}^{\text{anc}} for all kk such that [𝟏,X~kanc]Gk[𝟏,X~kanc]Gk(kk)[\boldsymbol{1},\widetilde{X}_{k}^{\text{anc}}]G_{k}\approx[\boldsymbol{1},\widetilde{X}_{k^{\prime}}^{\text{anc}}]G_{k^{\prime}}~{}(k\neq k^{\prime}).
18:Compute Xˇk=[𝟏,X~k]Gk\check{X}_{k}=[\boldsymbol{1},\widetilde{X}_{k}]G_{k} for all kk.
19:Set Xˇ\check{X}, ZZ and YY.
20:Compute function qˇ\check{q} and hˇ\check{h} using Xˇ\check{X}, ZZ and YY.
21:Compute residuals 𝜼^\hat{\boldsymbol{\eta}} and 𝜻^\hat{\boldsymbol{\zeta}} using qˇ\check{q}, hˇ\check{h}, Xˇ\check{X}, ZZ and YY.
22:Obtain 𝜸ˇ\check{\boldsymbol{\gamma}} and Var(𝜸ˇ)\text{Var}(\check{\boldsymbol{\gamma}}) as the least square solution of (4.1.2) (in our paper).
23:Return RkPointR_{k}^{\text{Point}} and RkVarR_{k}^{\text{Var}} to all user kk.
24:
25:user-kk-side
26:
27:Set RkPointR_{k}^{\text{Point}} and RkVarR_{k}^{\text{Var}}.
28:Compute 𝜷k\boldsymbol{\beta}_{k}, Var(𝜸k)\text{Var}(\boldsymbol{\gamma}_{k}) and Var(αk)\text{Var}(\alpha_{k}).

4.1.1 Construction of collaborative representations

In the first stage, construction of collaborative representations, we use the aggregation method proposed in Imakura and Sakurai (Imakura and Sakurai, 2020). First, users generate and share an anchor dataset Xancr×mX^{\text{anc}}\in\mathbb{R}^{r\times m}, which is a shareable dataset consisting of public or randomly constructed dummy data, where rr is the number of subjects in the anchor data. The anchor dataset is used to compute the transformation matrices (described below as GkG_{k}) required to construct the collaborative representations. Second, each user kk constructs intermediate representations using a linear row-wise dimensionality reduction function fkf_{k} such as principal component analysis (PCA) (Pearson, 1901), locality preserving projection (LPP) (He and Niyogi, 2003) and factor analysis (FA) (Spearman, 1904). fkf_{k} is consisting of a transformation matrix Fkm×m~kF_{k}\in\mathbb{R}^{m\times\widetilde{m}_{k}} and a shift vector μkm\mu_{k}\in\mathbb{R}^{m}. m~k(<m)\widetilde{m}_{k}~{}(<m) is the reduced dimension of user kk. μk\mu_{k} is a mean vector in fkf_{k} with centering such as PCA and FA, or a zero vector in fkf_{k} without centering such as LPP. Moreover, is private Note that, fkf_{k} is a private function of user kk and should not be shared with other users. As will be described later in Section 4.4, the requirement that fkf_{k} be a dimensionality reduction function and not shared with other users is important to preserve the confidentiality of the covariates. The dimensionality reduction function is applied as

X~k\displaystyle\widetilde{X}_{k} =fk(Xk)=(Xk𝟏𝝁k)Fknk×m~k,\displaystyle=f_{k}(X_{k})=(X_{k}-\boldsymbol{1}\boldsymbol{\mu}_{k}^{\top})F_{k}\in\mathbb{R}^{n_{k}\times\widetilde{m}_{k}},
X~kanc\displaystyle\widetilde{X}_{k}^{\text{anc}} =fk(Xanc)=(Xanc𝟏𝝁k)Fkr×m~k,\displaystyle=f_{k}(X^{\text{anc}})=(X^{\text{anc}}-\boldsymbol{1}\boldsymbol{\mu}_{k}^{\top})F_{k}\in\mathbb{R}^{r\times\widetilde{m}_{k}},

where 𝟏=[1,,1]\boldsymbol{1}=[1,\cdots,1]^{\top}. Moreover, fkf_{k} is private and can differ from other users. To consider the constant term in the linear CATE model, we combine 𝟏\boldsymbol{1} to X~k\widetilde{X}_{k} and X~kanc\widetilde{X}_{k}^{\text{anc}}. Formally, we can represent those as

[𝟏,X~k]\displaystyle[\boldsymbol{1},\widetilde{X}_{k}] =[𝟏,Xk𝟏𝝁k]Fk¯,\displaystyle=[\boldsymbol{1},X_{k}-\boldsymbol{1}\boldsymbol{\mu}_{k}^{\top}]\bar{F_{k}},
[𝟏,X~kanc]\displaystyle[\boldsymbol{1},\widetilde{X}_{k}^{\text{anc}}] =[𝟏,Xanc𝟏𝝁k]Fk¯,\displaystyle=[\boldsymbol{1},X^{\text{anc}}-\boldsymbol{1}\boldsymbol{\mu}_{k}^{\top}]\bar{F_{k}},

where

Fk¯=[1Fk].\bar{F_{k}}=\left[\begin{array}[]{cc}1&\\ &F_{k}\end{array}\right].

Finally, each user kk shares the intermediate representations [𝟏,X~k][\boldsymbol{1},\widetilde{X}_{k}] and [𝟏,X~kanc][\boldsymbol{1},\widetilde{X}_{k}^{\text{anc}}], treatments ZkZ_{k} and outcomes YkY_{k} to the analyst.

Next, the analyst transforms the shared intermediate representations into an common form called collaborative representations. To do this, the analyst computes the transformation functions for all kk such that

[𝟏,X~kanc]Gk[𝟏,X~kanc]Gkr×mˇ(kk),[\boldsymbol{1},\widetilde{X}_{k}^{\text{anc}}]G_{k}\approx[\boldsymbol{1},\widetilde{X}_{k^{\prime}}^{\text{anc}}]G_{k^{\prime}}\in\mathbb{R}^{r\times\check{m}}~{}(k\neq k^{\prime}),

where Gk(m~k+1)×mˇG_{k}\in\mathbb{R}^{(\widetilde{m}_{k}+1)\times\check{m}} and mˇ\check{m} is the dimension of collaborative representations. Then, we have the collaborative representation for kk as

Xˇk=[𝟏,X~k]Gknk×mˇ.\check{X}_{k}=[\boldsymbol{1},\widetilde{X}_{k}]G_{k}\in\mathbb{R}^{n_{k}\times\check{m}}.

Let

[[𝟏,X~1anc],,[𝟏,X~canc]]=[U1,U2][Σ1Σ2][V1V2]U1Σ1V1\left[[\boldsymbol{1},\widetilde{X}_{1}^{\text{anc}}],\cdots,[\boldsymbol{1},\widetilde{X}_{c}^{\text{anc}}]\right]=[U_{1},U_{2}]\left[\begin{array}[]{cc}\Sigma_{1}&\\ &\Sigma_{2}\end{array}\right]\left[\begin{array}[]{c}V_{1}^{\top}\\ V_{2}^{\top}\end{array}\right]\approx U_{1}\Sigma_{1}V_{1}^{\top}

be a low-rank approximation based on singular value decomposition of the matrix combining [𝟏,X~kanc][\boldsymbol{1},\widetilde{X}_{k}^{\text{anc}}], where Σ1mˇ×mˇ\Sigma_{1}\in\mathbb{R}^{\check{m}\times\check{m}} is a diagonal matrix whose diagonal entries are the larger parts of the singular values and U1U_{1} and V1V_{1} are column orthogonal matrices whose columns are the corresponding left and right singular vectors, respectively. Matrix GkG_{k} is then computed as

Gk=([𝟏,X~kanc])U1,G_{k}=\left([\boldsymbol{1},\widetilde{X}_{k}^{\text{anc}}]\right)^{\dagger}U_{1},

where \dagger denotes the Moore–Penrose inverse. Then, the collaborative representations are given by

Xˇ=[Xˇ1Xˇc]n×mˇ.\check{X}=\left[\begin{array}[]{c}\check{X}_{1}\\ \vdots\\ \check{X}_{c}\end{array}\right]\in\mathbb{R}^{n\times\check{m}}.

4.1.2 Estimation by the analyst

In the second stage, estimation by the analyst, the analyst estimates their own CATE model based on the DML procedure that we described in Section 3. The analyst considers the data generation process as

yi\displaystyle y_{i} =θ(𝒙ˇi)zi+u(𝒙ˇi)+εi,\displaystyle=\theta(\check{\boldsymbol{x}}_{i})\cdot z_{i}+u(\check{\boldsymbol{x}}_{i})+\varepsilon_{i},
zi\displaystyle z_{i} =h(𝒙ˇi)+ηi,\displaystyle=h(\check{\boldsymbol{x}}_{i})+\eta_{i},

and the linear CATE model as

θ(𝒙ˇi)=𝒙ˇi𝜸=xˇi1γ1++xˇimˇγmˇ,\theta(\check{\boldsymbol{x}}_{i})=\check{\boldsymbol{x}}_{i}^{\top}\boldsymbol{\gamma}=\check{x}_{i}^{1}\gamma^{1}+\cdots+\check{x}_{i}^{\check{m}}\gamma^{\check{m}},

where 𝜸=[γ1,,γmˇ]mˇ\boldsymbol{\gamma}=[\gamma^{1},\cdots,\gamma^{\check{m}}]^{\top}\in\mathbb{R}^{\check{m}} and 𝒙ˇi\check{\boldsymbol{x}}_{i} is the collaborative representation of 𝒙i\boldsymbol{x}_{i}. The score vectors for the analyst are defined as

ψˇ(𝒙ˇi;𝜸ˇ,qˇ,hˇ)\displaystyle\check{\psi}(\check{\boldsymbol{x}}_{i};\check{\boldsymbol{\gamma}},\check{q},\check{h}) =𝒙ˇi(zihˇ(𝒙ˇi))(yiqˇ(𝒙ˇi)𝒙ˇi𝜸ˇ(zihˇ(𝒙ˇi))),\displaystyle=\check{\boldsymbol{x}}_{i}(z_{i}-\check{h}(\check{\boldsymbol{x}}_{i}))(y_{i}-\check{q}(\check{\boldsymbol{x}}_{i})-\check{\boldsymbol{x}}_{i}^{\top}\check{\boldsymbol{\gamma}}(z_{i}-\check{h}(\check{\boldsymbol{x}}_{i}))),
ψˇa(𝒙ˇi;hˇ)\displaystyle\check{\psi}^{a}(\check{\boldsymbol{x}}_{i};\check{h}) =(zihˇ(𝒙ˇi))2𝒙ˇi𝒙ˇi.\displaystyle=(z_{i}-\check{h}(\check{\boldsymbol{x}}_{i}))^{2}\check{\boldsymbol{x}}_{i}\check{\boldsymbol{x}}_{i}^{\top}.

where qˇ\check{q} and hˇ\check{h} are the analyst’s functions, and 𝜸ˇmˇ\check{\boldsymbol{\gamma}}\in\mathbb{R}^{\check{m}} is the analyst’s estimates.

The analyst’s purpose in the second stage is to obtain 𝜸ˇ\check{\boldsymbol{\gamma}}. To accomplish this, first, the analyst constructs qˇ\check{q} and hˇ\check{h} from Xˇ\check{X}, ZZ and YY using machine learning models to estimate residuals ζi\zeta_{i} and ηi\eta_{i}. The estimated residuals are expressed as

ζ^i\displaystyle\hat{\zeta}_{i} =yiqˇ(𝒙ˇi),\displaystyle=y_{i}-\check{q}(\check{\boldsymbol{x}}_{i}),
η^i\displaystyle\hat{\eta}_{i} =zihˇ(𝒙ˇi).\displaystyle=z_{i}-\check{h}(\check{\boldsymbol{x}}_{i}).

Second, the analyst calculates 𝜸ˇ\check{\boldsymbol{\gamma}} and Var(𝜸ˇ)mˇ×mˇ\text{Var}(\check{\boldsymbol{\gamma}})\in\mathbb{R}^{\check{m}\times\check{m}}. 𝜸ˇ\check{\boldsymbol{\gamma}} is the least squares solution of the following regression equation:

ζi^=η^i𝒙ˇi𝜸+εi.\hat{\zeta_{i}}=\hat{\eta}_{i}\check{\boldsymbol{x}}_{i}^{\top}\boldsymbol{\gamma}+\varepsilon_{i}.

That is

𝜸ˇ=argmin𝜸(i(ζ^iη^i𝒙ˇi𝜸)2).\check{\boldsymbol{\gamma}}=\operatorname*{arg\,min}_{\boldsymbol{\gamma}^{\prime}}\left(\sum_{i}\left(\hat{\zeta}_{i}-\hat{\eta}_{i}\check{\boldsymbol{x}}_{i}^{\top}\boldsymbol{\gamma}^{\prime}\right)^{2}\right).

The analyst can calculate the variance of 𝜸ˇ\check{\boldsymbol{\gamma}} in (8), (9) and (10), replacing ψ\psi, ψa\psi^{a} and 𝜷^\hat{\boldsymbol{\beta}} with ψˇ\check{\psi}, ψˇa\check{\psi}^{a} and 𝜸ˇ\check{\boldsymbol{\gamma}}, respectively.

4.1.3 Estimation by the users

In the final stage, estimation by the users, the users estimates their own CATE model based on parameters that the analyst returns. The analysts’s CATE model is written as

Xˇk𝜸ˇ\displaystyle\check{X}_{k}\check{\boldsymbol{\gamma}} =[𝟏,X~k]Gk𝜸ˇ\displaystyle=[\boldsymbol{1},\widetilde{X}_{k}]G_{k}\check{\boldsymbol{\gamma}}
=[𝟏,Xk𝟏𝝁k]Fk¯Gk𝜸ˇ.\displaystyle=[\boldsymbol{1},X_{k}-\boldsymbol{1}\boldsymbol{\mu}_{k}^{\top}]\bar{F_{k}}G_{k}\check{\boldsymbol{\gamma}}.

Then, for user kk, subject ii’s estimated CATE τ^i\hat{\tau}_{i} is written as

τ^i\displaystyle\hat{\tau}_{i} =γk0+(xi1μk1)γk1++(ximμkm)γkm\displaystyle=\gamma_{k}^{0}+(x_{i}^{1}-\mu_{k}^{1})\gamma_{k}^{1}+\cdots+(x_{i}^{m}-\mu_{k}^{m})\gamma_{k}^{m} (11)
=[1,μk1,,μkm]𝜸k+γk1xi1++γkmxim\displaystyle=[1,-\mu_{k}^{1},\cdots,-\mu_{k}^{m}]\boldsymbol{\gamma}_{k}+\gamma_{k}^{1}x_{i}^{1}+\cdots+\gamma_{k}^{m}x_{i}^{m}
=αk+γk1xi1++γkmxim\displaystyle=\alpha_{k}+\gamma_{k}^{1}x_{i}^{1}+\cdots+\gamma_{k}^{m}x_{i}^{m}

where

𝜸k=[γk0,γk1,,γkm]=Fk¯Gk𝜸ˇ,\boldsymbol{\gamma}_{k}=[\gamma_{k}^{0},\gamma_{k}^{1},\cdots,\gamma_{k}^{m}]^{\top}=\bar{F_{k}}G_{k}\check{\boldsymbol{\gamma}},

and

αk=[1,μk1,,μkm]𝜸k.\alpha_{k}=[1,-\mu_{k}^{1},\cdots,-\mu_{k}^{m}]\boldsymbol{\gamma}_{k}. (12)

For user kk, αk\alpha_{k} is a constant term and γk1,,γkm\gamma_{k}^{1},\cdots,\gamma_{k}^{m} are coefficients for xi1,,ximx_{i}^{1},\cdots,x_{i}^{m} in their CATE model. Thus, user kk’s coefficients are expressed as 𝜷k=[αk,γk1,,γkm]\boldsymbol{\beta}_{k}=[\alpha_{k},\gamma_{k}^{1},\cdots,\gamma_{k}^{m}]^{\top}.

The variance of 𝜸k\boldsymbol{\gamma}_{k} is

Var(𝜸k)=Fk¯GkVar(𝜸ˇ)GkFk¯.\text{Var}(\boldsymbol{\gamma}_{k})=\bar{F_{k}}G_{k}\text{Var}(\check{\boldsymbol{\gamma}})G_{k}^{\top}\bar{F_{k}}^{\top}.

Then, the variance of αk\alpha_{k} is

Var(αk)=[1,μk1,,μkm]Var(𝜸k)[1μk1μkm].\text{Var}(\alpha_{k})=[1,-\mu_{k}^{1},\cdots,-\mu_{k}^{m}]\text{Var}(\boldsymbol{\gamma}_{k})\left[\begin{array}[]{c}1\\ -\mu_{k}^{1}\\ \vdots\\ -\mu_{k}^{m}\end{array}\right]. (13)

User kk can statistically test their CATE model if they obtains 𝜷k\boldsymbol{\beta}_{k}, Var(𝜸k)\text{Var}(\boldsymbol{\gamma}_{k}) and Var(αk)\text{Var}(\alpha_{k}). To accomplish this, the following procedure is conducted in the final stage. First, the analyst returns RkPoint=Gk𝜸ˇR_{k}^{\text{Point}}=G_{k}\check{\boldsymbol{\gamma}} and RkVar=GkVar(𝜸ˇ)GkR_{k}^{\text{Var}}=G_{k}\text{Var}(\check{\boldsymbol{\gamma}})G_{k}^{\top} to user kk. Then, user kk calculates 𝜸k\boldsymbol{\gamma}_{k} and Var(𝜸k)\text{Var}(\boldsymbol{\gamma}_{k}) as

𝜸k\displaystyle\boldsymbol{\gamma}_{k} =Fk¯RkPoint,\displaystyle=\bar{F_{k}}R_{k}^{\text{Point}},
Var(𝜸k)\displaystyle\text{Var}(\boldsymbol{\gamma}_{k}) =Fk¯RkVarFk¯.\displaystyle=\bar{F_{k}}R_{k}^{\text{Var}}\bar{F_{k}}^{\top}.

Moreover, user kk calculates αk\alpha_{k} and Var(αk)\text{Var}(\alpha_{k}) using (12) and (13), respectively. Then, user kk obtained 𝜷k\boldsymbol{\beta}_{k}, Var(𝜸k)\text{Var}(\boldsymbol{\gamma}_{k}) and Var(αk)\text{Var}(\alpha_{k}), and can test statistically 𝜷k\boldsymbol{\beta}_{k}.

For user kk, subject ii’s estimated CATE τ^i\hat{\tau}_{i} are calculated from (11). To derive the variance of subject ii’s CATE, re-write (11) as

τ^i=[1,xi1μk1,,ximμkm]𝜸k\hat{\tau}_{i}=[1,x_{i}^{1}-\mu_{k}^{1},\cdots,x_{i}^{m}-\mu_{k}^{m}]\boldsymbol{\gamma}_{k}

Then,

Var(τ^i)=[1,xi1μk1,,ximμkm]Var(𝜸k)[1xi1μk1ximμkm].\text{Var}(\hat{\tau}_{i})=[1,x_{i}^{1}-\mu_{k}^{1},\cdots,x_{i}^{m}-\mu_{k}^{m}]\text{Var}(\boldsymbol{\gamma}_{k})\left[\begin{array}[]{c}1\\ x_{i}^{1}-\mu_{k}^{1}\\ \vdots\\ x_{i}^{m}-\mu_{k}^{m}\end{array}\right].

Thus, user kk can test τ^i\hat{\tau}_{i} statistically based on Var(τ^i)\text{Var}(\hat{\tau}_{i}).

4.2 Correctness of DC-DML

In this section, through a similar approach to Imakura et al. (Imakura et al., 2023b), we discuss the correctness of DC-DML. Now, we consider the following subspace depending on kk,

𝒮k=(Fk¯Gk)m+1,dim(𝒮k)=mˇ,\mathcal{S}_{k}=\mathcal{R}(\bar{F_{k}}G_{k})\subset\mathbb{R}^{m+1},~{}\text{dim}(\mathcal{S}_{k})=\check{m},

where \mathcal{R} denotes the range of a matrix. Define centralized analysis (CA) as collecting and analyzing the raw data, i.e., XX, ZZ and YY, in one place. This is an ideal analysis, but not possible if the raw data contains confidential information. Then, we have Theorem 2, whose proof is in D.

Theorem 2.

Let 𝛃CA\boldsymbol{\beta}_{\textup{CA}} be the coefficients, and qCAq_{\textup{CA}} and hCAh_{\textup{CA}} be functions computed in CA. Moreover, consider 𝛄CA=[γCA0,γCA1,,γCAm]\boldsymbol{\gamma}_{\textup{CA}}=[\gamma_{\textup{CA}}^{0},\gamma_{\textup{CA}}^{1},\cdots,\gamma_{\textup{CA}}^{m}]^{\top} such that

γCA0=[1,μ11,,μ1m]𝜷CA,γCAj=βCAj(j0).\gamma_{\textup{CA}}^{0}=[1,\mu_{1}^{1},\cdots,\mu_{1}^{m}]\boldsymbol{\beta}_{\textup{CA}},\gamma_{\textup{CA}}^{j}=\beta_{\textup{CA}}^{j}~{}(j\neq 0).

If 𝛄CA𝒮1==𝒮c\boldsymbol{\gamma}_{\textup{CA}}\in\mathcal{S}_{1}=\cdots=\mathcal{S}_{c}, 𝛍1==𝛍c\boldsymbol{\mu}_{1}=\cdots=\boldsymbol{\mu}_{c}, rank(XancF¯k)=m~\textup{rank}(X^{\textup{anc}}\bar{F}_{k})=\widetilde{m} for all k{1,,c}k\in\{1,\cdots,c\}, qCA(𝐱i)=qˇ(𝐱ˇi)q_{\textup{CA}}(\boldsymbol{x}_{i})=\check{q}(\check{\boldsymbol{x}}_{i}) and hCA(𝐱i)=hˇ(𝐱ˇi)h_{\textup{CA}}(\boldsymbol{x}_{i})=\check{h}(\check{\boldsymbol{x}}_{i}) for all i{1,,n}i\in\{1,\cdots,n\}, then 𝛃CA=𝛃k\boldsymbol{\beta}_{\textup{CA}}=\boldsymbol{\beta}_{k}.

Theorem 2 indicates that if the assumptions are satisfied, DC-DML can obtain the parameters computed in CA. Although it is unlikely that all of the assumptions are satisfied, Theorem 2 provides some hints for dimensionality reduction methods as we describe in Section 4.3.

4.3 A dimension reduction method for DC-DML

In this section, we propose a dimensionality reduction method for DC-DML with reference to Theorem 2. The dimensionality reduction method strongly affects the performance of DC-DML. In DC-DML, the coefficients 𝜸k\boldsymbol{\gamma}_{k} is obtained from the subspace 𝒮k\mathcal{S}_{k}. If we could obtain 𝜸k\boldsymbol{\gamma}_{k} close to 𝜸CA\boldsymbol{\gamma}_{\text{CA}}, we can achieve 𝜷k\boldsymbol{\beta}_{k} close to 𝜷CA\boldsymbol{\beta}_{\text{CA}}. To accomplish this, the subspace 𝒮k\mathcal{S}_{k} should contain 𝜸CA\boldsymbol{\gamma}_{\text{CA}}.

Using an approach similar to Imakura et al. (Imakura et al., 2023b), we consider a dimensionality reduction method for DC-DML. Algorithm 2 is the pseudo-code of that. The basic idea of the approach is that each user kk constructs FkF_{k} based on estimates of 𝜸CA\boldsymbol{\gamma}_{\text{CA}} obtained from their own local dataset. The estimate of 𝜸CA\boldsymbol{\gamma}_{\text{CA}} obtained by each user could be a good approximation of 𝜸CA\boldsymbol{\gamma}_{\text{CA}}. Thus, a bootstrap-based method can be a reasonable choice to achieve a good subspace. Let 0<p<10<p<1 be a parameter for sampling rate. In the bootstrap-based method, first, through the DML procedure by a random samplings of size pnkpn_{k}, user kk obtains estimates of 𝜸CA\boldsymbol{\gamma}_{\text{CA}} as 𝜸1,,𝜸m~k\boldsymbol{\gamma}_{1},\cdots,\boldsymbol{\gamma}_{\widetilde{m}_{k}}. Then, user kk constructs Fk=[𝜸¯1,,𝜸¯m~k]F_{k}=[\bar{\boldsymbol{\gamma}}_{1},\cdots,\bar{\boldsymbol{\gamma}}_{\widetilde{m}_{k}}], where 𝜸¯bm\bar{\boldsymbol{\gamma}}_{b}\in\mathbb{R}^{m} is a vector excluding the first element of 𝜸bm+1\boldsymbol{\gamma}_{b}\in\mathbb{R}^{m+1}. The reason of excluding the first element of 𝜸b\boldsymbol{\gamma}_{b} is that FkF_{k} should be a transformation matrix for mm-dimensional covariates. However, since the first column of F¯k\bar{F}_{k} is [1,0,,0][1,0,\cdots,0]^{\top} as in (4.1.1), the subspace contains the full range that the first element of 𝜸CA\boldsymbol{\gamma}_{\text{CA}} can take.

Algorithm 2 A bootstrap-based dimensionality reduction for DC-DML
1:Input: covariates XkX_{k}, treatments ZkZ_{k}, outcomes YkY_{k}.
2:Output: FkF_{k}.
3:
4:for b=1,,m~kb=1,\cdots,\widetilde{m}_{k} : do
5:  Set XkbX_{k}^{b}, ZkbZ_{k}^{b} and YkbY_{k}^{b} by a random sampling of size pnkpn_{k} from XkX_{k}, ZkZ_{k} and YkY_{k}.
6:  Compute 𝜸b\boldsymbol{\gamma}_{b} from XkbX_{k}^{b}, ZkbZ_{k}^{b} and YkbY_{k}^{b} by the DML procedure.
7:end for
8:Return Fk=[𝜸¯1,,𝜸¯m~k]F_{k}=[\bar{\boldsymbol{\gamma}}_{1},\cdots,\bar{\boldsymbol{\gamma}}_{\widetilde{m}_{k}}].

In addition, as with Imakura et al. (Imakura et al., 2023b), the subspace can be constructed by combining multiple dimensionality reduction methods. Let FBSm×m~BSF_{\text{BS}}\in\mathbb{R}^{m\times\widetilde{m}_{\text{BS}}} and FDRm×m~DRF_{\text{DR}}\in\mathbb{R}^{m\times\widetilde{m}_{\text{DR}}} be the transformation matrices of the bootstrap-based and other methods, respectively, where m~k=m~BS+m~DR\widetilde{m}_{k}=\widetilde{m}_{\text{BS}}+\widetilde{m}_{\text{DR}}. Then, FkF_{k} is

Fk=[FBS,FDR].F_{k}=[F_{\text{BS}},F_{\text{DR}}].

4.4 Discussion on confidentiality preservation

In this section, we discuss how DC-DML preserves the confidentiality of covariates. DC-DML prevents the original covariates from being recovered or inferred from the shared intermediate representations by the following two layers of confidentiality, inherited from DC technology (Imakura and Sakurai, 2020).

The first layer is to construct intermediate representations by dimensionality reduction of the covariates. Since the dimensionality reduction results in the loss of some of the original covariate information, it is impossible to recover the original covariates from the intermediate representations.

The second layer is that each user kk does not share the dimensionality reduction function fkf_{k} with the other users. If fkf_{k} is shared, the others can construct an approximate inverse function fk1f_{k}^{-1}, and then infer the original covariates using fk1f_{k}^{-1} for the shared intermediate representations of user kk. The construction of fk1f_{k}^{-1} is possible for those who know both XancX^{\text{anc}} and X~kanc\widetilde{X}_{k}^{\text{anc}}, but as far as the algorithm of DC technology is followed, there is no such person except user kk.

Various attacks are possible to recover or infer the original covariates from the shared intermediate representations. For example, it is a collusion by users and analysts. Depending on the way of collusion, the confidentiality of the covariates may be violated, as discussed by Imakura et al. (2021a), for more details, see. However, we believe that future research could create a robust DC technology against a variety of attacks.

4.5 Advantage and disadvantage

In this section, we describe the advantages and disadvantages of DC-DML. Those are attributed to DC-QE or DML with the linear CATE model described in Section 3.

There are four advantages. First, as mentiond in Section 4.4, dimensionality reduction preserves the confidentiality of covariates. Second, since DC-DML enables to collect a larger number of subjects from distributed data, and accumulate a knowledge base, it is possible to obtain better performance than in individual analysis. Third, DC-DML does not require repetitive communication (communication-efficient). Finally, DC-DML is less susceptible to model mis-specification because it assumes a semi-parametric model.

Conversely, there are three disadvantages. First, DC-DML cannot be used directly if confidential information is included in the treatments or outcomes. Solutions to this issue include, for example, transforming treatments or outcomes into a format that preserves confidentiality through encryption and so on. Second, the performance of the resulting model may deteriorate because the dimensionality reduction reduces the information in the raw data. However, obtaining more subjects from the distributed data can improve performance more than the deterioration caused by dimensionality reduction. Finally, if the linear CATE model is not valid, i.e. non-linear, DC-DML may obtain incorrect conclusions.

5 Simulations

In this section we describe three simulations we conducted to demonstrate the effectiveness of DC-DML. 222We performed the simulations using Python codes, which are available from the corresponding author by reasonable request. In Simulation I, that uses synthetic data, we show that DC-DML is useful in the setting of data distributions that would be incorrectly estimated by individual analysis (IA), the analysis performed by the user using only their own dataset. In Simulation II, that uses semi-synthetic data generated from the infant dataset, we compare the performances between DC-DML and other existing methods. In Simulation III, that uses real-world data from the financial assets and jobs datasets, we investigate the robustness of DC-DML performance in the presence of heterogeneous numbers of subjects or treatment group proportions per party.

In Simulation II, the methods to be compared to DC-DML are as follows. First are the DMLs in IA and CA, that we denote for convenience as IA-DML and CA-DML, respectively. As described above, IA-DML and CA-DML are, respectively, targets that DC-DML should perform better and closer to. Second, GRF and X-Learner in IA and CA, denoted for convenience with the prefixes IA- and CA-: IA-GRF, CA-GRF, IA-XL, and CA-XL. Finally, the distributed data analysis methods SR and FedCI.

In the simulations, we set up the machine learning models used in DML and X-learner as follows. First, both machine learning models used to estimate qq and hh in the DML methods (IA-DML, CA-DML and DC-DML) in the Simulation I are random forests. Second, the settings of the machine learning models used in Simulations II and III are as follows. We selected the machine learning models used to estimate qq in the DML methods by our preliminary simulations from among five regression methods: multiple regression, random forests, K-nearest neighbor, light gradient boosting machine (LGBM) and support vector machine (SVM) with standardization. In the preliminary simulations for qq, we repeated the regression of the outcomes on the covariates for the candidate methods in CA setting 50 times and set the criterion of selecting the one with the smallest root mean squared errors (RMSE) average. For hh, similarly, we selected by preliminary simulations among five classification methods: logistic regression, random forests, K-nearest neighbor, LGBM and SVR. In our preliminary simulations for hh, we followed the similar procedure as for qq, where we set the criterion of selecting the one with the smallest average Brier score. Here, Brier Score is a mean squared error measure of probability forecasting (Brier, 1950). The hyperparameters for the candidate methods were the default values shown in E. As a result of our preliminary simulations, we chose SVM as the machine learning models to be used in the estimation of qq and hh, both in Simulation II. For the financial data set in Simulation III, we set qq and hh to multiple regression and logistic regression, respectively. For the jobs dataset, we set qq and hh to multiple regression and random forest, respectively. We use the same models for the outcome and propensity score in X-Learner (IA-XL and CA-XL) as in DML in all simulations. The results of the preliminary simulations are in F.

The settings used in DC-DML are as follows. As the intermediate representation construction methods, we used six: PCA, LPP, FA, PCA+B, LPP+B, and FA+B. Here, +B represents the combination with the bootstrap-based dimensionality reduction method. The intermediate representation dimension and collaborative representation dimension were set as m~k=m1\widetilde{m}_{k}=m-1 for all kk and mˇ=m\check{m}=m, respectively. In combination with the bootstrap-based dimensionality reduction method, the BS dimension was set to three in Simulation I and 0.1m\lceil 0.1m\rceil in Simulation II and III. Here,\lceil\cdot\rceil is the ceiling function. Anchor data were generated from uniform random numbers of maximum and minimum values for each covariate and thier subject number rr was the same of the original dataset XX.

As described below, in Simulation II, we used the five methods described in Section 2 as comparison methods: DML, GRF, X-Learner, SR and FedCI. Also, in Simulation III, we used DML and SR as comparison methods. See G for the detailed settings of those methods.

In Simulations II and III, the evaluation measures used were how close the estimates were to the benchmarks. The benchmarks in Simulations II and III were the true values and the average values of 50 trials of CA-DML, respectively. In this study, we consider four evaluation measures: the RMSE of CATE, the consistency rate of significance of CATE, the RMSE of coefficients, and the consistency rate of significance of coefficients. Note that, for simplicity of explanation, results other than the RMSE of CATE in Simulation III were shown in H.

The RMSE of CATE

The RMSE of CATE is a measure of how much the predicted CATE matched the benchmark CATE as

1NΣi(τi,BMτ^i)2,\sqrt{\frac{1}{N}\Sigma_{i}\left(\tau_{i,\textup{BM}}-\hat{\tau}_{i}\right)^{2}},

where τi,BM\tau_{i,\textup{BM}} is the benchmark CATE and τ^i\hat{\tau}_{i} is the estimated CATE. The lower this measure, the better.

The consistency rate of significance of CATE

The consistency rate of significance of CATE is the rate where the estimated CATE obtained a statistical test result of “significantly positive,” “significantly negative,” or “not significant” in 5% level when the benchmark CATE is positive, negative, or zero, respectively. For Simulation III, we set the benchmark CATE is positive, negative, or zero when the estimate in CA-DML is significantly positive, significantly negative or not significant, respectively, in 5% level. The higher this measure, the better. This measure was not calculated in X-Learner and FedCI, because statistical testing methods for CATE are not provided in those methods.

The RMSE of coefficients

The RMSE of coefficients is a measure of how much the estimated coefficients matched that of the benchmark as

1m+1Σj(βBMjβkj)2,\sqrt{\frac{1}{m+1}\Sigma_{j}\left(\beta_{\textup{BM}}^{j}-\beta_{k}^{j}\right)^{2}},

where βBMj\beta_{BM}^{j} is the benchmark coefficient. The lower this measure, the better. Since GRF, X-Learner and FedCI are not models that estimate linear CATE, this measure was not calculated in those models.

The consistency rate of significance of coefficients

The consistency rate of significance of coefficients is the rate where the estimated coefficient obtained a statistical test result of “significantly positive,” “significantly negative,” or “not significant” in 5% level when the benchmark coefficient is positive, negative, or zero, respectively. For Simulation III, we set the benchmark coefficient is positive, negative, or zero when the estimate in CA-DML is significantly positive, significantly negative or not significant, respectively, in 5% level. The higher this measure, the better. As with the RMSE of coefficients, this measure was not calculated in GRF, X-Learner and FedCI.

5.1 Simulation I: Proof of concepts in synthetic data

In Simulation I, we considered the situation where two parties (c=2c=2) each had a dataset consisting of 300 subjects (nk=300n_{k}=300) and 10 covariates (m=10m=10). The covariates xi1x_{i}^{1} and xi2x_{i}^{2} were distributed as in Fig. 3, where blue and orange markers represent that of party 1 and 2, respectively. The other covariates were drawn from a normal distribution 𝒩(0,1)\mathcal{N}(0,1).

Refer to caption
Figure 3: The data distribution in Simulation I. The blue and orange markers represent the covariates of party 1 and 2, respectively.

The setting of the data generation process in Simulation I was

θ(𝒙i)\displaystyle\theta(\boldsymbol{x}_{i}) =1+xi1+xi2,\displaystyle=1+x_{i}^{1}+x_{i}^{2},
u(𝒙i)\displaystyle u(\boldsymbol{x}_{i}) =|xi1|+|xi2|,\displaystyle=|x_{i}^{1}|+|x_{i}^{2}|,
h(𝒙i)\displaystyle h(\boldsymbol{x}_{i}) =11exp(xi1xi2),\displaystyle=\frac{1}{1-\exp{(-x_{i}^{1}-x_{i}^{2})}},
εi\displaystyle\varepsilon_{i} 𝒩(0,0.1),\displaystyle\sim\mathcal{N}(0,0.1),
ηi\displaystyle\eta_{i} ={1h(𝒙i)w.p.h(𝒙i)h(𝒙i)w.p.1h(𝒙i)\displaystyle=\left\{\begin{array}[]{ll}1-h(\boldsymbol{x}_{i})&\textup{w.p.}~{}h(\boldsymbol{x}_{i})\\ -h(\boldsymbol{x}_{i})&\textup{w.p.}~{}1-h(\boldsymbol{x}_{i})\end{array}\right.

whre w.p. is “with probability”. In this setting, the constant term and the coefficients of xi1x_{i}^{1} and xi2x_{i}^{2} should be estimated to be significantly one.

Party 1 has the dataset with the broad distribution for covariate 1 and the narrow distribution for covariate 2. Thus, party 1 would be able to estimate the coefficient of covariate 1 relatively correctly in IA-DML, but would have difficulty estimating the coefficient of covariate 2. For party 2, the opposite of party 1 can be said.

The result of Simulation I is shown in Table 2. In CA-DML, the constant term and the coefficients of xi1x_{i}^{1} and xi2x_{i}^{2} are significant and close to their true values, and the coefficients of the other covariates are not significant. Thus, CA-DML obtained reasonable estimation results. Conversely, IA-DML obtained some wrong estimation results in both parties.

Covariates CA-DML IA-DML DC-DML(PCA+B)
Party 1 Party 2 Party 1 Party 2
const. 1.0283 ** 0.6300 ** 1.0068 ** 0.9563 ** 1.0207 **
x1x^{1} 1.0610 ** 0.9952 ** 0.6950 1.0578 ** 1.0119 **
x2x^{2} 1.1290 ** 2.0799 1.0359 ** 0.9042 ** 0.9761 **
x3x^{3} -0.1628 -0.1161 -0.3211 * -0.0961 -0.1269
x4x^{4} -0.1595 -0.0220 -0.0021 -0.1489 -0.1264
x5x^{5} 0.0566 -0.0924 0.1679 0.0147 0.0190
x6x^{6} 0.0905 0.3367 * -0.2482 0.1608 0.1354
x7x^{7} -0.0434 0.0771 -0.1849 -0.0317 -0.0960
x8x^{8} -0.0494 -0.0397 -0.1370 -0.0336 -0.0065
x9x^{9} 0.0607 -0.0955 0.1867 0.0418 0.0536
x10x^{10} -0.1773 -0.2022 -0.0543 -0.1270 -0.1391
** p<0.01p<0.01,   * p<0.05p<0.05
Table 2: The result of Simulation I.

DC-DML conducted by the PCA+B dimensionality reduction method obtained estimation results similar to CA-DML. This result suggests that even in situations where it is difficult to make valid estimates using IA-DML, DC-DML can produce more valid estimates than that.

5.2 Simulation II: Comparison with other methods in semi-synthetic data

In Simulation II, we evaluate the performance of DC-DML on semi-synthetic data generated from the infant health and development program (IHDP) dataset. The IHDP dataset is “a randomized simulation that began in 1985, targeted low-birth-weight, premature infants, and provided the treatment group with both intensive high-quality child care and home visits from a trained provider.” (Hill, 2011) 333The IHDP dataset is published as the supplementary file (https://www.tandfonline.com/doi/abs/10.1198/jcgs.2010.08162). The IHDP dataset is consisting of 25 covariates and a treatment variable and does not include an outcome variable. Thus, the outcomes needs to be synthetically set up for the simulation. Hill Hill (2011) removed data on all children with nonwhite mothers from the treatment group to imbalance the covariates in the treatment and control groups. We followed this as well, and as a result, the dataset in Simulation II consisted of 139 subjects in the treatment group and 608 subjects in the control group.

In Simulation II, we considered the situation where three parties (c=3c=3) each had a dataset consisting of 249 subjects (nk=249n_{k}=249) and 25 covariates (m=25m=25). The ratios of treatment groups were equal in all parties. The outcome variable in Simulation II we set as

θ(𝒙i)\displaystyle\theta(\boldsymbol{x}_{i}) =const.+𝒙i𝜷,\displaystyle=\textup{const.}+\boldsymbol{x}_{i}^{\top}\boldsymbol{\beta},
𝜷\displaystyle\boldsymbol{\beta} =[1σ1,0,1σ3,1σ4,0,1σ6,],\displaystyle=\left[\frac{1}{\sigma^{1}},0,-\frac{1}{\sigma^{3}},\frac{1}{\sigma^{4}},0,-\frac{1}{\sigma^{6}},\cdots\right]^{\top},
const. =𝒙i[1,0,1,1,0,1,]\displaystyle=-\boldsymbol{x}_{i}^{\top}[1,0,-1,1,0,-1,\cdots]^{\top}
u(𝒙i)\displaystyle u(\boldsymbol{x}_{i}) =|xi1x¯1σ1|++|ximx¯mσm|,\displaystyle=\left|\frac{x_{i}^{1}-\bar{x}^{1}}{\sigma^{1}}\right|+\cdots+\left|\frac{x_{i}^{m}-\bar{x}^{m}}{\sigma^{m}}\right|,
εi\displaystyle\varepsilon_{i} 𝒩(0,0.1),\displaystyle\sim\mathcal{N}(0,0.1),

where x¯j\bar{x}^{j} and σj\sigma^{j} are the average of and standard deviation of covariate jj, respectively, in the dataset. Note that, h(𝒙i)h(\boldsymbol{x}_{i}) and ηi\eta_{i} were not defined because the treatments were included in the dataset.

The results of the four evaluation measures in Simulation II is shown in Fig. 4-4. In this simulation, we conducted 50 simulations with different random seed values for data distributing and machine learning. The symbols ++ and - in parentheses in the figures represent significant positive and significant negative differences from the IA-DML results, respectively, in tt-test. There were no methods that were not significantly different from DC-DML.

Refer to caption
(a) The RMSE of CATE
Refer to caption
(b) The consistency rate of significance of CATE
Refer to caption
(c) The RMSE of coefficients
Refer to caption
(d) The consistency rate of significance of coefficients
Figure 4: Results of Simulation II. The dashed line is the average of CA-DML results. The triangle markers represent their average values. The vertical axes in Fig. 4 and 4 are the logarithmic scales.

As described below, Fig. 4-4 show that DC-DML performs well for each of the evaluation measures in Simulation 2. First, as shown in Fig. 4, the result for the RMSE of CATE shows that DC-DML significantly outperformed IA-DML. Moreover, DC-DML obtained better result than the existing methods. Second, as shown in Fig. 4, the result for the consistency rate of significance of CATE shows that DC-DML significantly outperformed IA-DML. Moreover, DC-DML obtained better result than the existing methods. Third, as shown in Fig. 4, the result for the RMSE of coefficients shows that DC-DML significantly outperformed IA-DML and are comparable to CA-DML and SR. Finally, as shown in Fig. 4, the result for the consistency rate of significance of coefficients shows that DC-DML significantly outperformed IA-DML. Moreover, DC-DML obtained better result than SR. These results in Simulation II suggests that DC-DML can outperform IA-DML and obtain equal or better results than the existing methods.

5.3 Simulation III: Robustness to distribution settings in real-world data

Based on real-world datasets of financial assets and jobs, in Simulation III, we investigate the robustness of DC-DML performance in the case where the number of subjects or rate of treatment group across parties is imbalanced. As the financial assets dataset, we focused on the survey of income and program participation (SIPP) dataset, which was used in Chernozhukov et al. (Chernozhukov et al., 2018) to estimate the effect of 401(k) eligibility on accumulated assets. 444This is avaiable from the repository of Chernozhukov et al. (2018) (https://github.com/VC2015/DMLonGitHub/). The SIPP dataset consists of 9915 observations and includes nine subject information (age, inc, educ, fsize, marr, twoearn, db, pira, hown) as covariates, 401(k) offerings as treatments, and net financial assets as outcomes. As the jobs dataset, we focused on Dehejia and Wahba’s dataset (Dehejia and Wahba, 1999), which was used to estimate the treatment effects of work experience and counseling on incomes. 555This is avaiable from the author’s website (https://users.nber.org/~rdehejia/nswdata.html). The jobs dataset consists of 2675 observations and includes eight subject information (age, black, hispanic, married, no degree, re74) as covariates, trainings as treatments, and real earnings in 1978 as outcomes.

In Simulation III, we considered three parties (c=3c=3) and three settings for data distribution where the rate of treatment group and the number of subjects are even or uneven, as shown in Table 3. In setting A, all parties have the same number of subjects and the same rate of treatment groups. In setting B, the number of subjects is even, but the rate of treatment groups differs across parties. In setting C, the number of subjects differs, but the rate of treatment groups is even across parties.

setting A setting B setting C
subjects even even uneven
treatment rate even uneven even
dataset party all treated controlled all treated controlled all treated controlled
financial assets 1 3304 1227 2077 3304 2549 755 6864 2549 4315
2 3304 1227 2077 3304 849 2455 2287 849 1438
3 3304 1227 2077 3304 283 3021 762 283 479
jobs 1 891 61 830 891 92 799 1337 92 1245
2 891 61 830 891 61 830 891 61 830
3 891 61 830 891 30 861 445 30 415
Table 3: The number of subjects in each setting for Simulation III.

In Simulation III, the result of CA-DML is used as the benchmark. Table 4 shows the average results of 50 trials of CA-DML. Note that, comparisons with GRF, X-Learner, and FedCI do not make sense because those assume the non-linear CATE model. Thus, only CA-DML, IA-DML and SR were considered as comparison with DC-DML.

financial assets jobs
covariate coefficient covariate coefficient
const. -9705.1794 const. -7533.7519
age 172.0307 age 226.4912 *
inc -0.1297 black 2493.7939
educ 642.9040 hispanic 1519.3141
fsize -1003.0686 married -2215.7011
marr 1102.9090 no_degree -858.5624
twoearn 5607.3989 re74 -0.3354
db 5657.7264 *
pira -1032.3599
hown 5324.2854 *
* p<0.05p<0.05
Table 4: The benchmark coefficients of Simulation III.

Fig. 5 represents the RMSE of CATE in Simulation III for the financial assets dataset. The symbols ++, - and 0 in parentheses in the figures represent significant positive, significant negative and not significant differences from the IA-DML results, respectively, in tt-test. The results for parties 1, 2 and 3 are shown on the left, middle and right sides, respectively, in the figures. In all cases shown in Fig. 5, DC-DML results are better than IA-DML and SR. In addition, the dimensionality reduction method combined with bootstrapping shows robust results. However, DC-DML using LPP are relatively worse than for DC-DML using the other dimensionality reduction methods. No SR results are better than IA-DML.

Refer to caption
Figure 5: The RMSE of CATE in Simulation III for the financial assets dataset. The dashed line is the average of CA-DML results. The triangle markers represent their average values. The vertical axis is the logarithmic scale.

Fig. 6 represents the RMSE of CATE in Simulation III for the jobs dataset. Similar to the simulation for the financial assets, DC-DML results are better than IA-DML and SR in all cases in Fig. 6. Although, as shown in H, DC-DML does not always exhibits better performance than IA-DML in the other measures, DC-DML results are often better than IA-DML. As with the financial assets dataset, the dimensionality reduction method combined with bootstrapping shows robust results, and SR results are often worse than IA-DML.

Refer to caption
Figure 6: The RMSE of CATE in Simulation III for the jobs dataset. The meaning of symbols, lines and axes are the same with Fig. 5.

The simulation results in both datasets suggest that the choice of dimensionality reduction method has a significant impact on the performance of DC-DML. What dimensionality reduction method provides good performance for DC-DML is likely to be data-dependent. How to select a dimensionality reduction method is a future issue.

On summary, in Simulation III with two datasets, DC-DML achieved better results than IA-DML, although not in all cases. Those results suggest that DC-DML can obtain comparable results to CA-DML in various data distribution settings than in IA-DML.

6 Discussion

Utilizing distributed data across multiple parties is a important topic in data science, but confidential or privacy concerns need to be addressed. Multi-source data fusion to construct models for the estimation of CATE can help to provide better policy or medical treatments for subjects. Our method provides statistical test results for treatment effects. This allows us to make decisions such as applying the treatment only to subjects with significantly better effects, or discontinuing the treatment for subjects with significantly worse effects. Giving procedures that take into account the subjects’s covariates is essential for safer and more efficient policy and health care implementation.

Our method enables the collaborative representation to be constructed again for each additional collaboration party. In addition, our method requires communication with the analyst only at the time of sharing intermediate representations or receiving estimation results, and does not require long time synchronization. This advantage enables intermediate representations to be accumulated over a longer time involving a larger number of parties, and then CATE can be estimated in refined models with large accumulated data, i.e. knowledge base.

Here, we briefly discuss the relationship between the readily identifiability, an important issue in privacy preservation literature, and our method. In the DC analysis literature, the readily identifiability indicates that the original data containing privacy information can be readily identified with the intermediate representation (Imakura et al., 2023a). Solving the readily identifiability issue could enhance the privacy preservation of our method. As shown in I, our method can be extended to use intermediate representations that are not readily identifiable with the original data. For more detailed discussion for the readily identifiability, see I.

Our study has some limitations. First, our method assumes the linear CATE model. As mentiond in Section 4.5, the linear CATE model may obtain incorrect conclusions in data where the assumption is not valid. Second, it is not clear how the dimensionality reduction method should be selected in DC-DML. Addressing these issues is a future study.

7 Conclusion

In this study, we proposed DC-DML, which estimates CATEs from privacy-preserving fusion data constructed from distributed data. Our method enables flexible estimation of CATE models from fusion data through a semi-parametric approach and does not require iterative communication. Moreover, our method enables collaborative estimation between multiple time points and different parties through the accumulation of a knowledge base. These are important characteristics with application to the practical use of our methods. Through the three simulations, we demonstrated the effectiveness of our method as follows. First, even in situations where it is difficult to make valid estimates using IA-DML, DC-DML can produce more valid estimates than that. Second, DC-DML can outperform IA-DML and obtain equal or better results than the existing methods. Third, DC-DML can obtain comparable results to CA-DML in various data distribution settings than in IA-DML.

In the future, we will develop our model to address nonlinear CATE models. Other directions include the development of our method to address vertically distributed data, and experiments in application to marketing datasets.

Acknowledgement

The authors gratefully acknowledge the New Energy and Industrial Technology Development Organization (NEDO) (No. JPNP18010), Japan Science and Technology Agency (JST) (No. JPMJPF2017), Japan Society for the Promotion of Science (JSPS), and Grants-in-Aid for Scientific Research (Nos. JP19KK0255, JP21H03451, JP23K22166, JP22K19767).

Appendix A Acronym table

A list of acronyms and definitions used in this paper is shown in Table 5.

Acronym Definition
AIPW Augmented inverse propensity weighted
ATE Average treatment effect
CA Centralized analysis
CATE Conditional average treatment effect
DC Data collaboration
DC-DML Data collaboration double machine learning
DC-QE Data collaboration quasi-experiment
DML Double machine learning
FA Factor analysis
FedCI Federated causal inference
GRF Generalized random forest
IA Individual analysis
IHDP Infant health and development program
IPW Inverse propensity-weighted
LGBM Light gradient boosting machine
LPP Locality preserving projection
MLE Maximum likelihood estimator
PCA Principal component analysis
RMSE Root mean squared error
SIPP Survey of income and program participation
SR Secure regression
SVM Support vector machine
Table 5: Acronym table.

Appendix B Algorithm of data collaboration quasi-experiment

Algorithm 3 is the pseudo-code of data collaboration quasi-experiment (DC-QE).

Algorithm 3 Data collaboration quasi-experiment (DC-QE)
1:Input: covariates XX, treatments ZZ, outcomes YY.
2:Output: estimated treatment effect τ^\hat{\tau}.
3:
4:user-kk-side
5:
6:Generate anchor dataset XkancX_{k}^{\text{anc}} and share it with all users.
7:Set XancX^{\text{anc}}.
8:Generate fkf_{k}.
9:Compute X~k=fk(Xk)\widetilde{X}_{k}=f_{k}(X_{k}).
10:Compute X~kanc=fk(Xanc)\widetilde{X}_{k}^{\text{anc}}=f_{k}(X^{\text{anc}}).
11:Share X~k\widetilde{X}_{k}, X~kanc\widetilde{X}_{k}^{\text{anc}}, ZkZ_{k} and YkY_{k} to the analyst.
12:
13:analyst-side
14:
15:Get X~k\widetilde{X}_{k}, X~kanc\widetilde{X}_{k}^{\text{anc}}, ZkZ_{k} and YkY_{k} for all kk.
16:Set X~k\widetilde{X}_{k} and X~kanc\widetilde{X}_{k}^{\text{anc}}.
17:Compute gkg_{k} from X~kanc\widetilde{X}_{k}^{\text{anc}} for all kk such that gk(X~kanc)gk(X~kanc)(kk)g_{k}(\widetilde{X}_{k}^{\text{anc}})\approx g_{k^{\prime}}(\widetilde{X}_{k^{\prime}}^{\text{anc}})~{}(k\neq k^{\prime}).
18:Compute Xˇk=gk(X~k)\check{X}_{k}=g_{k}(\widetilde{X}_{k}) for all kk.
19:Set Xˇ\check{X}, ZZ and YY.
20:Estimate propensity scores 𝜶^\hat{\boldsymbol{\alpha}} from Xˇ\check{X} and ZZ.
21:Estimate average treatment effect τ^\hat{\tau} from 𝜶^\hat{\boldsymbol{\alpha}} and YY using existing ways (matching, weighting and so on).

Appendix C The proof of Theorem 1

Proof.

Denote 𝒙¯i=[1,𝒙i]\bar{\boldsymbol{x}}_{i}=[1,\boldsymbol{x}_{i}^{\top}]^{\top}. First, we can prove the moment condition as

𝔼[ψ(𝒙i;𝜷,q,h)]\displaystyle\mathbb{E}[\psi(\boldsymbol{x}_{i};\boldsymbol{\beta},q,h)] =𝔼[𝒙¯i(zih(𝒙i))(yiq(𝒙i)𝒙¯i𝜷(zih(𝒙i)))]\displaystyle=\mathbb{E}\left[\bar{\boldsymbol{x}}_{i}(z_{i}-h(\boldsymbol{x}_{i}))(y_{i}-q(\boldsymbol{x}_{i})-\bar{\boldsymbol{x}}_{i}^{\top}\boldsymbol{\beta}(z_{i}-h(\boldsymbol{x}_{i})))\right]
=𝔼[𝒙¯iηiεi]\displaystyle=\mathbb{E}\left[\bar{\boldsymbol{x}}_{i}\eta_{i}\varepsilon_{i}\right]
=𝔼[𝒙¯i𝔼[ηiεi|𝒙i]]\displaystyle=\mathbb{E}\left[\bar{\boldsymbol{x}}_{i}\mathbb{E}\left[\eta_{i}\varepsilon_{i}|\boldsymbol{x}_{i}\right]\right]
=𝟎.\displaystyle=\boldsymbol{0}.

Second, for the Neyman–orthogonality, consider

ψ(𝒙i;𝜷,q+rδq,h+rδh)\displaystyle\psi(\boldsymbol{x}_{i};\boldsymbol{\beta},q+r\delta_{q},h+r\delta_{h})
=𝒙¯i(zih(𝒙i)rδh(𝒙i))(yiq(𝒙i)rδq(𝒙i)𝒙¯i𝜷(zih(𝒙i)rδh(𝒙i)))\displaystyle=\bar{\boldsymbol{x}}_{i}(z_{i}-h(\boldsymbol{x}_{i})-r\delta_{h}(\boldsymbol{x}_{i}))(y_{i}-q(\boldsymbol{x}_{i})-r\delta_{q}(\boldsymbol{x}_{i})-\bar{\boldsymbol{x}}_{i}^{\top}\boldsymbol{\beta}(z_{i}-h(\boldsymbol{x}_{i})-r\delta_{h}(\boldsymbol{x}_{i})))
=𝒙¯i(ηirδh(𝒙i))(ζirδq(𝒙i)𝒙¯i𝜷(ηirδh(𝒙i)))\displaystyle=\bar{\boldsymbol{x}}_{i}(\eta_{i}-r\delta_{h}(\boldsymbol{x}_{i}))(\zeta_{i}-r\delta_{q}(\boldsymbol{x}_{i})-\bar{\boldsymbol{x}}_{i}^{\top}\boldsymbol{\beta}(\eta_{i}-r\delta_{h}(\boldsymbol{x}_{i})))
=𝒙¯i{(ηirδh(𝒙i))(ηirδq(𝒙i))𝒙¯i𝜷(ηirδh(𝒙i))2},\displaystyle=\bar{\boldsymbol{x}}_{i}\left\{(\eta_{i}-r\delta_{h}(\boldsymbol{x}_{i}))(\eta_{i}-r\delta_{q}(\boldsymbol{x}_{i}))-\bar{\boldsymbol{x}}_{i}^{\top}\boldsymbol{\beta}(\eta_{i}-r\delta_{h}(\boldsymbol{x}_{i}))^{2}\right\},

where δh=hh\delta_{h}=h^{\prime}-h and δq=qq\delta_{q}=q^{\prime}-q. Then,

limr0r𝔼[ψ(𝒙i;𝜷,q+rδq,h+rδh)]\displaystyle\lim_{r\to 0}\frac{\partial}{\partial r}\mathbb{E}\left[\psi(\boldsymbol{x}_{i};\boldsymbol{\beta},q+r\delta_{q},h+r\delta_{h})\right]
=limr0r𝔼[𝒙¯i𝔼[{(ηirδh(𝒙i))(ζirδq(𝒙i))𝒙¯i𝜷(ηirδh(𝒙i))2}|𝒙i]]\displaystyle=\lim_{r\to 0}\frac{\partial}{\partial r}\mathbb{E}\left[\bar{\boldsymbol{x}}_{i}\mathbb{E}\left[\left\{(\eta_{i}-r\delta_{h}(\boldsymbol{x}_{i}))(\zeta_{i}-r\delta_{q}(\boldsymbol{x}_{i}))-\bar{\boldsymbol{x}}_{i}^{\top}\boldsymbol{\beta}(\eta_{i}-r\delta_{h}(\boldsymbol{x}_{i}))^{2}\right\}|\boldsymbol{x}_{i}\right]\right]
=𝔼[𝒙¯ilimr0r𝔼[{(ηirδh(𝒙i))(ζirδq(𝒙i))𝒙¯i𝜷(ηirδh(𝒙i))2}|𝒙i]].\displaystyle=\mathbb{E}\left[\bar{\boldsymbol{x}}_{i}\lim_{r\to 0}\frac{\partial}{\partial r}\mathbb{E}\left[\left\{(\eta_{i}-r\delta_{h}(\boldsymbol{x}_{i}))(\zeta_{i}-r\delta_{q}(\boldsymbol{x}_{i}))-\bar{\boldsymbol{x}}_{i}^{\top}\boldsymbol{\beta}(\eta_{i}-r\delta_{h}(\boldsymbol{x}_{i}))^{2}\right\}|\boldsymbol{x}_{i}\right]\right].

Here,

limr0r𝔼[{(ηirδh(𝒙i))(ζirδq(𝒙i))𝒙¯i𝜷(ηirδh(𝒙i))2}|𝒙i]\displaystyle\lim_{r\to 0}\frac{\partial}{\partial r}\mathbb{E}\left[\left\{(\eta_{i}-r\delta_{h}(\boldsymbol{x}_{i}))(\zeta_{i}-r\delta_{q}(\boldsymbol{x}_{i}))-\bar{\boldsymbol{x}}_{i}^{\top}\boldsymbol{\beta}(\eta_{i}-r\delta_{h}(\boldsymbol{x}_{i}))^{2}\right\}|\boldsymbol{x}_{i}\right]
=limr0r𝔼[(ηirδh(𝒙i))(ζirδq(𝒙i))|𝒙i]limr0r𝔼[𝒙¯i𝜷(ηirδh(𝒙i))2|𝒙i].\displaystyle=\lim_{r\to 0}\frac{\partial}{\partial r}\mathbb{E}\left[(\eta_{i}-r\delta_{h}(\boldsymbol{x}_{i}))(\zeta_{i}-r\delta_{q}(\boldsymbol{x}_{i}))|\boldsymbol{x}_{i}\right]-\lim_{r\to 0}\frac{\partial}{\partial r}\mathbb{E}\left[\bar{\boldsymbol{x}}_{i}^{\top}\boldsymbol{\beta}(\eta_{i}-r\delta_{h}(\boldsymbol{x}_{i}))^{2}|\boldsymbol{x}_{i}\right].

The first term is

limr0r𝔼[(ηirδh(𝒙i))(ζirδq(𝒙i))|𝒙i]\displaystyle\lim_{r\to 0}\frac{\partial}{\partial r}\mathbb{E}\left[(\eta_{i}-r\delta_{h}(\boldsymbol{x}_{i}))(\zeta_{i}-r\delta_{q}(\boldsymbol{x}_{i}))|\boldsymbol{x}_{i}\right]
=limr0𝔼[ηiδq(𝒙i)ζiδh(𝒙i)+2rδh(𝒙i)δq(𝒙i)|𝒙i]\displaystyle=\lim_{r\to 0}\mathbb{E}\left[-\eta_{i}\delta_{q}(\boldsymbol{x}_{i})-\zeta_{i}\delta_{h}(\boldsymbol{x}_{i})+2r\delta_{h}(\boldsymbol{x}_{i})\delta_{q}(\boldsymbol{x}_{i})|\boldsymbol{x}_{i}\right]
=𝔼[ηiδq(𝒙i)ζiδh(𝒙i)|𝒙i]\displaystyle=\mathbb{E}\left[-\eta_{i}\delta_{q}(\boldsymbol{x}_{i})-\zeta_{i}\delta_{h}(\boldsymbol{x}_{i})|\boldsymbol{x}_{i}\right]
=δq(𝒙i)𝔼[ηi|𝒙i]δh(𝒙i)𝔼[ζi|𝒙i]\displaystyle=-\delta_{q}(\boldsymbol{x}_{i})\mathbb{E}\left[\eta_{i}|\boldsymbol{x}_{i}\right]-\delta_{h}(\boldsymbol{x}_{i})\mathbb{E}\left[\zeta_{i}|\boldsymbol{x}_{i}\right]
=𝟎,\displaystyle=\boldsymbol{0},

and the second term is

limr0r𝔼[𝒙¯i𝜷(ηirδh(𝒙i))2|𝒙i]\displaystyle\lim_{r\to 0}\frac{\partial}{\partial r}\mathbb{E}\left[\bar{\boldsymbol{x}}_{i}^{\top}\boldsymbol{\beta}(\eta_{i}-r\delta_{h}(\boldsymbol{x}_{i}))^{2}|\boldsymbol{x}_{i}\right]
=𝒙¯i𝜷limr0r𝔼[(ηirδh(𝒙i))2|𝒙i]\displaystyle=\bar{\boldsymbol{x}}_{i}^{\top}\boldsymbol{\beta}\lim_{r\to 0}\frac{\partial}{\partial r}\mathbb{E}\left[(\eta_{i}-r\delta_{h}(\boldsymbol{x}_{i}))^{2}|\boldsymbol{x}_{i}\right]
=𝒙¯i𝜷limr0𝔼[2(ηirδh(𝒙i))δh(𝒙i)|𝒙i]\displaystyle=\bar{\boldsymbol{x}}_{i}^{\top}\boldsymbol{\beta}\lim_{r\to 0}\mathbb{E}\left[-2(\eta_{i}-r\delta_{h}(\boldsymbol{x}_{i}))\delta_{h}(\boldsymbol{x}_{i})|\boldsymbol{x}_{i}\right]
=2𝒙¯i𝜷δh(𝒙i)limr0𝔼[(ηirδh(𝒙i))|𝒙i]\displaystyle=-2\bar{\boldsymbol{x}}_{i}^{\top}\boldsymbol{\beta}\delta_{h}(\boldsymbol{x}_{i})\lim_{r\to 0}\mathbb{E}\left[(\eta_{i}-r\delta_{h}(\boldsymbol{x}_{i}))|\boldsymbol{x}_{i}\right]
=2𝒙¯i𝜷δh(𝒙i)𝔼[ηi|𝒙i]\displaystyle=-2\bar{\boldsymbol{x}}_{i}^{\top}\boldsymbol{\beta}\delta_{h}(\boldsymbol{x}_{i})\mathbb{E}\left[\eta_{i}|\boldsymbol{x}_{i}\right]
=𝟎.\displaystyle=\boldsymbol{0}.

Thus,

limr0r𝔼[ψ(𝒙i;𝜷,q+rδq,h+rδh)]=𝟎.\lim_{r\to 0}\frac{\partial}{\partial r}\mathbb{E}\left[\psi(\boldsymbol{x}_{i};\boldsymbol{\beta},q+r\delta_{q},h+r\delta_{h})\right]=\boldsymbol{0}.

Appendix D The proof of Theorem 2

Proof.

From Theorem 1 in Imakura et al. (2021a), if

𝒮1==𝒮cand𝝁1==𝝁candrank(XancF¯k)=m~,\mathcal{S}_{1}=\cdots=\mathcal{S}_{c}~{}\textup{and}~{}\boldsymbol{\mu}_{1}=\cdots=\boldsymbol{\mu}_{c}~{}\textup{and}~{}\textup{rank}(X^{\textup{anc}}\bar{F}_{k})=\widetilde{m},

then we have

Xˇ=WF1¯G1==WFc¯Gc,\check{X}=W\bar{F_{1}}G_{1}=\cdots=W\bar{F_{c}}G_{c},

where W=[𝟏,X𝟏𝝁1]W=[\boldsymbol{1},X-\boldsymbol{1}\boldsymbol{\mu}_{1}^{\top}]. Denote 𝒘i\boldsymbol{w}_{i}^{\top} as the iith row of WW. Here, consider

t(𝜷)=[[1,μ11,,μ1m]𝜷,β1,,βm]m+1.t(\boldsymbol{\beta}^{\prime})=\left[[1,\mu_{1}^{1},\cdots,\mu_{1}^{m}]\boldsymbol{\beta}^{\prime},\beta^{\prime 1},\cdots,\beta^{\prime m}\right]^{\top}\in\mathbb{R}^{m+1}.

then, we can express 𝜸CA=t(𝜷CA)\boldsymbol{\gamma}_{\textup{CA}}=t(\boldsymbol{\beta}_{\textup{CA}}). Since [1,𝒙i]𝜷=𝒘it(𝜷)[1,\boldsymbol{x}_{i}^{\top}]\boldsymbol{\beta}^{\prime}=\boldsymbol{w}_{i}^{\top}t(\boldsymbol{\beta}^{\prime}),

𝜷CA\displaystyle\boldsymbol{\beta}_{\textup{CA}} =argmin𝜷m+1i(ζ^iη^i[1,𝒙i]𝜷)2\displaystyle=\operatorname*{arg\,min}_{\boldsymbol{\beta}^{\prime}\in\mathbb{R}^{m+1}}\sum_{i}(\hat{\zeta}_{i}-\hat{\eta}_{i}[1,\boldsymbol{x}_{i}^{\top}]\boldsymbol{\beta}^{\prime})^{2}
=argmin𝜷m+1i(ζ^iη^i𝒘it(𝜷))2.\displaystyle=\operatorname*{arg\,min}_{\boldsymbol{\beta}^{\prime}\in\mathbb{R}^{m+1}}\sum_{i}(\hat{\zeta}_{i}-\hat{\eta}_{i}\boldsymbol{w}_{i}^{\top}t(\boldsymbol{\beta}^{\prime}))^{2}.

Thus, 𝜸CA\boldsymbol{\gamma}_{\textup{CA}} is equal to the solution of the following least squares problem:

𝜸CA=argmin𝜸m+1i(ζ^iη^i𝒘i𝜸)2.\boldsymbol{\gamma}_{\textup{CA}}=\operatorname*{arg\,min}_{\boldsymbol{\gamma}^{\prime}\in\mathbb{R}^{m+1}}\sum_{i}(\hat{\zeta}_{i}-\hat{\eta}_{i}\boldsymbol{w}_{i}^{\top}\boldsymbol{\gamma}^{\prime})^{2}.

From the assumption 𝜸CA𝒮k\boldsymbol{\gamma}_{\textup{CA}}\in\mathcal{S}_{k},

𝜸𝒌\displaystyle\boldsymbol{\gamma_{k}} =Fk¯Gk𝜸ˇ\displaystyle=\bar{F_{k}}G_{k}\check{\boldsymbol{\gamma}}
=argminFk¯Gk𝜸𝒮ki(ζ^iη^i𝒘iFk¯Gk𝜸)2\displaystyle=\operatorname*{arg\,min}_{\bar{F_{k}}G_{k}\boldsymbol{\gamma}^{\prime}\in\mathcal{S}_{k}}\sum_{i}(\hat{\zeta}_{i}-\hat{\eta}_{i}\boldsymbol{w}_{i}^{\top}\bar{F_{k}}G_{k}\boldsymbol{\gamma}^{\prime})^{2}
=argmin𝜸𝒮ki(ζ^iη^i𝒘i𝜸)2\displaystyle=\operatorname*{arg\,min}_{\boldsymbol{\gamma}^{\prime}\in\mathcal{S}_{k}}\sum_{i}(\hat{\zeta}_{i}-\hat{\eta}_{i}\boldsymbol{w}_{i}^{\top}\boldsymbol{\gamma}^{\prime})^{2}
=𝜸CA.\displaystyle=\boldsymbol{\gamma}_{\textup{CA}}.

We used the assumptions qCA(𝒙i)=qˇ(𝒙ˇi)q_{\text{CA}}(\boldsymbol{x}_{i})=\check{q}(\check{\boldsymbol{x}}_{i}) and hCA(𝒙i)=hˇ(𝒙ˇi)h_{\text{CA}}(\boldsymbol{x}_{i})=\check{h}(\check{\boldsymbol{x}}_{i}) to transform the equation from line 2 to 3. It is obvious that 𝜷CA\boldsymbol{\beta}_{\text{CA}} and 𝜷k\boldsymbol{\beta}_{k} are equal in all but the first element. However,

αk\displaystyle\alpha_{k} =[1,μk1,,μkm]𝜸k\displaystyle=[1,-\mu_{k}^{1},\cdots,-\mu_{k}^{m}]\boldsymbol{\gamma}_{k}
=[1,μ11,,μ1m]𝜸CA\displaystyle=[1,-\mu_{1}^{1},\cdots,-\mu_{1}^{m}]\boldsymbol{\gamma}_{\text{CA}}
=αCA\displaystyle=\alpha_{\text{CA}}

where αCA\alpha_{\text{CA}} is the first element of 𝜷CA\boldsymbol{\beta}_{\text{CA}}. Then, 𝜷CA=𝜷k\boldsymbol{\beta}_{\text{CA}}=\boldsymbol{\beta}_{k}. ∎

Appendix E Hyperparameters for candidate methods

Most hyperparameters of candidate methods for qq and hh are default parameters in scikit-learn (V1.2.2) and LightGBM (V3.3.5). The hyperparameters for qq estimation are as follows.

LinearRegression (scikit-learn)

fit_intercept: True, positive: False

RandomForestRegressor (scikit-learn)

bootstrap: True, ccp_alpha: 0.0, criterion: squared_error, max_depth: None, max_features: 1.0, max_leaf_nodes: None, max_samples: None, min_impurity_decrease: 0.0, min_samples_leaf: 1, min_samples_split: 2, min_weight_fraction_leaf: 0.0, n_estimators: 100, oob_score: False, warm_start: False

KNeighborsRegressor (scikit-learn)

algorithm: auto, leaf_size: 30, metric: minkowski, metric_params: None, n_neighbors: 5, p: 2, weights: uniform

LGBMRegressor (LightGBM)

boosting_type: gbdt, class_weight: None, colsample_bytree: 1.0, importance_type: split, learning_rate: 0.1, max_depth: -1, min_child_samples: 20, min_child_weight: 0.001, min_split_gain: 0.0, n_estimators: 100, num_leaves: 31, objective: None, reg_alpha: 0.0, reg_lambda: 0.0, subsample: 1.0, subsample_for_bin: 200000, subsample_freq: 0

SVR (scikit-learn)

C: 1.0, cache_size: 200, coef0: 0.0, degree: 3, epsilon: 0.1, gamma: scale, kernel: rbf, max_iter: -1, shrinking: True, tol: 0.001

The hyperparameters for hh estimation are as follows.

LogisticRegression (scikit-learn)

C: 1.0, class_weight: None, dual: False, fit_intercept: False, intercept_scaling: 1, l1_ratio: None, max_iter: 100, multi_class: auto, n_jobs: None, penalty: none, solver: lbfgs, tol: 0.0001, warm_start: False

RandomForestClassifier (scikit-learn)

bootstrap: True, ccp_alpha: 0.0, class_weight: None, criterion: gini, max_depth: None, max_features: sqrt, max_leaf_nodes: None, max_samples: None, min_impurity_decrease: 0.0, min_samples_leaf: 1, min_samples_split: 2, min_weight_fraction_leaf: 0.0, n_estimators: 100, oob_score: False, warm_start: False

KNeighborsClassifier (scikit-learn)

algorithm: auto, leaf_size: 30, metric: minkowski, metric_params: None, n_neighbors: 5, p: 2, weights: uniform

LGBMClassifier (LightGBM)

boosting_type: gbdt, class_weight: None, colsample_bytree: 1.0, importance_type: split, learning_rate: 0.1, max_depth: -1, min_child_samples: 20, min_child_weight: 0.001, min_split_gain: 0.0, n_estimators: 100, num_leaves: 31, objective: None, reg_alpha: 0.0, reg_lambda: 0.0, subsample: 1.0, subsample_for_bin: 200000, subsample_freq: 0

SVC (scikit-learn)

C: 1.0, break_ties: False, cache_size: 200, class_weight: None, coef0: 0.0, decision_function_shape: ovr, degree: 3, gamma: scale, kernel: rbf, max_iter: -1, probability: True, shrinking: True, tol: 0.001

Appendix F The results of the preliminary simulations

In the preliminary simulations, RMSE was calculated for the estimation of qq and Brier score for the estimation of hh using a two folds cross-validation with raw data. Table 6 shows the averages of 50 trials.

Simulation II
qq RMSE hh Brier score
Multiple regression 2.5613 Logistic regression 0.2579
Random forests 2.5308 Random forests 0.1532
K-nearest neighbor 2.8690 K-nearest neighbor 0.1665
LGBM 2.5734 LGBM 0.1733
SVM 2.2333 SVM 0.1488
Simulation III for the financial assets dataset
qq RMSE hh Brier score
Multiple regression 55827.4190 Logistic regression 0.2014
Random forests 58048.7840 Random forests 0.2151
K-nearest neighbor 57861.6099 K-nearest neighbor 0.2339
LGBM 56665.7031 LGBM 0.2047
SVM 65450.8516 SVM 0.2080
Simulation III for the jobs dataset
qq RMSE hh Brier score
Multiple regression 10963.5250 Logistic regression 0.1699
Random forests 12058.0681 Random forests 0.1628
K-nearest neighbor 11981.2699 K-nearest neighbor 0.1751
LGBM 11503.7386 LGBM 0.1719
SVM 15613.6765 SVM 0.1912
Table 6: The results of the preliminary simulations.

Appendix G Settings for the existing methods

To conduct the exsting method, we used EconML (V0.14.1) and FedCI (https://github.com/vothanhvinh/FedCI). All hyperparameters of GRF (CausalForest) and FedCI are default values as follows.

CausalForest (EconML)

criterion: mse, fit_intercept: True, honest: True, inference: True, max_depth: None, max_features: auto, max_samples: 0.45, min_balancedness_tol: 0.45, min_impurity_decrease: 0.0, min_samples_leaf: 5, min_samples_split: 10, min_var_fraction_leaf: None, min_var_leaf_on_val: False, min_weight_fraction_leaf: 0.0, n_estimators: 100, subforest_size: 4, warm_start: False

FedCI

n_iterations:2000, learning_rate:1e-3

For X-Learner, candidate methods for estimation of the outcome and propensity score have the hyperparameters described in E. For SR, we considered the regression model to consist of a constant term and a cross term between the covariates and the treatment as

yi=const.+zi(β0+xi1β1++ximβm)+εi.y_{i}=\textup{const.}+z_{i}(\beta^{0}+x_{i}^{1}\beta^{1}+\cdots+x_{i}^{m}\beta^{m})+\varepsilon_{i}.

Appendix H The all results of Simulation III

In this section, we include all results of Simulation III.

Fig. 5, 7, 8 and 9 represent the results of Simulation III for the financial assets dataset. In the RMSE of coefficients and the consistency rate of significance of coefficients shown in Fig. 8 and 9, DC-DML results for party1 using LPP are worse than for IA-DML in some cases. However, in most other cases shown in Fig. 5, 7, 8 and 9, DC-DML results are better than IA-DML. In addition, the dimensionality reduction method combined with bootstrapping shows robust results. No SR results are better than IA-DML.

Refer to caption
Figure 7: The consistency rate of significance of CATE in Simulation III for the financial assets dataset.
Refer to caption
Figure 8: The RMSE of coefficients in Simulation III for the financial assets dataset.
Refer to caption
Figure 9: The consistency rate of significance of coefficients in Simulation III for the financial assets dataset.

Fig. 6, 10, 11 and 12 represent the results of Simulation III for the jobs dataset. DC-DML results are worse than IA-DML when using LPP for the consistency rate of significance of CATE, and using LPP, LPP+B or FA+B for the consistency rate of significance of coefficients. However, DC-DML results are often better than IA-DML. DC-DML performs better than IA-DML particularly in the RMSEs of CATE and coefficients in most cases. Moreover, as with the financial assets dataset, the dimensionality reduction method combined with bootstrapping shows robust results. SR results are often worse than IA-DML.

Refer to caption
Figure 10: The consistency rate of significance of CATE in Simulation III for the jobs dataset.
Refer to caption
Figure 11: The RMSE of coefficients in Simulation III for the jobs dataset.
Refer to caption
Figure 12: The consistency rate of significance of coefficients in Simulation III for the jobs dataset.

On both datasets, there were cases where better performance was not achieved with DC-DML using LPP, especially. This suggests that the choice of dimensionality reduction method has a significant impact on the performance of DC-DML.

Appendix I Non-readily identifiable method

The definition of the readily identifiability (Imakura et al., 2023a) is the following.

Definition 1.

Let xipx_{i}^{\textup{p}} and xinpx_{i}^{\textup{np}} be a pair of the data including and not including personal information that can directly identify a specific subject for the iith person, respectively. We let 𝒳p={x1p,,xnp}\mathscr{X}^{\textup{p}}=\{x_{1}^{\textup{p}},\cdots,x_{n}^{\textup{p}}\} and 𝒳np={x1np,,xnnp}\mathscr{X}^{\textup{np}}=\{x_{1}^{\textup{np}},\cdots,x_{n}^{\textup{np}}\} be personal information and non-personal information datasets for the same nn persons, respectively.

For non-personal information xnp𝒳npx^{\textup{np}}\in\mathscr{X}^{\textup{np}}, if and only if someone else holds a key to correctly collate the corresponding personal information xp𝒳px^{\textup{p}}\in\mathscr{X}^{\textup{p}} or can generate the key by their own, then the non-personal dataset xnpx^{\textup{np}} is defined as “readily identifiable” to personal dataset 𝒳p\mathscr{X}^{\textup{p}}.

Imakura et al. (Imakura et al., 2023a) showed that the intermediate representation is readily identifiable to the original dataset in DC analysis. This is also true for DC-DML, i.e., X~k\widetilde{X}_{k} is readily identifiable to XkX_{k}.

To solve readily identifiable issue in DC analysis, Imakura et al. (Imakura et al., 2023a) proposed strategies in which the intermediate representations are non-readily identifiable to the original dataset. To develop DC-DML not to be readily identifiable, the following strategies are needed.

  • The sample indices in each user are permuted randomly.

  • Dimensionality reduction functions fkf_{k} are erasable and cannot be reconstructed.

According to the above strategies, we develop DC-DML as a non-readily identifiable DC-DML (NI-DC-DML) as follows. In the second stage, construction of collaborative representations, there are three different procedures between DC-DML and NI-DC-DML. First, in NI-DC-DML, user kk constructs Fk=FkEkF_{k}^{\prime}=F_{k}E_{k} instead of FkF_{k} as dimensionality reduction function, where Ekm~k×m~kE_{k}\in\mathbb{R}^{\widetilde{m}_{k}\times\widetilde{m}_{k}} is a random matrix. Second, user kk constructs a permutation matrix PkP_{k}. Thus, the intermediate representations are constructed as

X~k\displaystyle\widetilde{X}_{k} =Pk(Xk𝟏𝝁k)Fknk×m~k,\displaystyle=P_{k}(X_{k}-\boldsymbol{1}\boldsymbol{\mu}_{k}^{\top})F_{k}^{\prime}\in\mathbb{R}^{n_{k}\times\widetilde{m}_{k}},
X~kanc\displaystyle\widetilde{X}_{k}^{\text{anc}} =(Xanc𝟏𝝁k)Fkr×m~k.\displaystyle=(X^{\text{anc}}-\boldsymbol{1}\boldsymbol{\mu}_{k}^{\top})F_{k}^{\prime}\in\mathbb{R}^{r\times\widetilde{m}_{k}}.

Note that, PkP_{k} is not applied for construction of X~kanc\widetilde{X}_{k}^{\text{anc}}. Then, user kk erases PkP_{k} and FkF_{k}^{\prime}. PkP_{k} and FkF_{k}^{\prime} cannot be reconstructed if random number generator with different random seed for each user is used. Third, user kk shares permutated treatments Zk=PkZkZ_{k}^{\prime}=P_{k}Z_{k} and outcomes Yk=PkYkY_{k}^{\prime}=P_{k}Y_{k}, instead of ZkZ_{k} and YkY_{k}, respectively, to the analyst. Then, user kk erases X~k\widetilde{X}_{k}, X~kanc\widetilde{X}_{k}^{\text{anc}}, ZkZ_{k} and YkY_{k}.

In the final stage, estimation of the users, there are two different procedures between DC-DML and NI-DC-DML. First, the analyst returns RkPoint-ancR_{k}^{\text{Point-anc}} and RkVar-ancR_{k}^{\text{Var-anc}} instead of RkPointR_{k}^{\text{Point}} and RkVarR_{k}^{\text{Var}}, respectively, where

RkPoint-anc\displaystyle R_{k}^{\text{Point-anc}} =Xˇkanc𝜸ˇ,\displaystyle=\check{X}_{k}^{\text{anc}}\check{\boldsymbol{\gamma}},
RkVar-anc\displaystyle R_{k}^{\text{Var-anc}} =XˇkancVar(𝜸ˇ)(Xˇkanc),\displaystyle=\check{X}_{k}^{\text{anc}}\text{Var}(\check{\boldsymbol{\gamma}})(\check{X}_{k}^{\text{anc}})^{\top},
Xˇkanc\displaystyle\check{X}_{k}^{\text{anc}} =[𝟏,X~kanc]Gk.\displaystyle=[\boldsymbol{1},\widetilde{X}_{k}^{\text{anc}}]G_{k}.

Second, user kk calculates 𝜸k\boldsymbol{\gamma}_{k} and Var(𝜸k)\text{Var}(\boldsymbol{\gamma}_{k}) as

𝜸k\displaystyle\boldsymbol{\gamma}_{k} =((X¯anc)X¯anc)1(X¯anc)RkPoint-anc\displaystyle=((\bar{X}^{\text{anc}})^{\top}\bar{X}^{\text{anc}})^{-1}(\bar{X}^{\text{anc}})^{\top}R_{k}^{\text{Point-anc}}
(\displaystyle( =Fk¯Gk𝜸ˇ),\displaystyle=\bar{F_{k}}G_{k}\check{\boldsymbol{\gamma}}),
Var(𝜸k)\displaystyle\text{Var}(\boldsymbol{\gamma}_{k}) =((X¯anc)X¯anc)1(X¯anc)RkVar-ancX¯anc((X¯anc)X¯anc)1\displaystyle=((\bar{X}^{\text{anc}})^{\top}\bar{X}^{\text{anc}})^{-1}(\bar{X}^{\text{anc}})^{\top}R_{k}^{\text{Var-anc}}\bar{X}^{\text{anc}}((\bar{X}^{\text{anc}})^{\top}\bar{X}^{\text{anc}})^{-1}
(\displaystyle( =Fk¯GkVar(𝜸ˇ)GkFk¯),\displaystyle=\bar{F_{k}}G_{k}\text{Var}(\check{\boldsymbol{\gamma}})G_{k}^{\top}\bar{F_{k}}^{\top}),
X¯anc\displaystyle\bar{X}^{\text{anc}} =[𝟏,Xanc].\displaystyle=[\boldsymbol{1},X^{\text{anc}}].

The above calculation is correct because

((X¯anc)X¯anc)1(X¯anc)Xˇkanc\displaystyle((\bar{X}^{\text{anc}})^{\top}\bar{X}^{\text{anc}})^{-1}(\bar{X}^{\text{anc}})^{\top}\check{X}_{k}^{\text{anc}}
=((X¯anc)X¯anc)1(X¯anc)[𝟏,X~kanc]Gk\displaystyle=((\bar{X}^{\text{anc}})^{\top}\bar{X}^{\text{anc}})^{-1}(\bar{X}^{\text{anc}})^{\top}[\boldsymbol{1},\widetilde{X}_{k}^{\text{anc}}]G_{k}
=((X¯anc)X¯anc)1(X¯anc)[𝟏,Xanc]Fk¯Gk\displaystyle=((\bar{X}^{\text{anc}})^{\top}\bar{X}^{\text{anc}})^{-1}(\bar{X}^{\text{anc}})^{\top}[\boldsymbol{1},X^{\text{anc}}]\bar{F_{k}}G_{k}
=((X¯anc)X¯anc)1(X¯anc)X¯ancFk¯Gk\displaystyle=((\bar{X}^{\text{anc}})^{\top}\bar{X}^{\text{anc}})^{-1}(\bar{X}^{\text{anc}})^{\top}\bar{X}^{\text{anc}}\bar{F_{k}}G_{k}
=Fk¯Gk.\displaystyle=\bar{F_{k}}G_{k}.

The other procedures in NI-DC-DML are the same as in DC-DML. Algorithm 4 is the pseudo-code of NI-DC-DML.

Algorithm 4 Non-readily identifiable data collaboration double machine learning (NI-DC-DML)
1:Input: covariates XX, treatments ZZ, outcomes YY.
2:Output: βk\boldsymbol{\beta}_{k}, Var(γk)\text{Var}(\boldsymbol{\gamma}_{k}) and Var(αk)\text{Var}(\alpha_{k}).
3:
4:user-kk-side
5:
6:Generate anchor dataset XkancX_{k}^{\text{anc}} and share it with all users.
7:Set XancX^{\text{anc}}.
8:Generate FkF_{k}^{\prime}, 𝝁k\boldsymbol{\mu}_{k} and PkP_{k}.
9:Compute X~k=Pk(Xk𝟏𝝁k)Fk\widetilde{X}_{k}=P_{k}(X_{k}-\boldsymbol{1}\boldsymbol{\mu}_{k}^{\top})F_{k}^{\prime}, X~kanc=(Xanc𝟏𝝁k)Fk\widetilde{X}_{k}^{\text{anc}}=(X^{\text{anc}}-\boldsymbol{1}\boldsymbol{\mu}_{k}^{\top})F_{k}^{\prime}, Zk=PkZkZ_{k}^{\prime}=P_{k}Z_{k} and Yk=PkYkY_{k}^{\prime}=P_{k}Y_{k}.
10:Erase FkF_{k}^{\prime} and PkP_{k}.
11:Share [𝟏,X~k][\boldsymbol{1},\widetilde{X}_{k}], [𝟏,X~kanc][\boldsymbol{1},\widetilde{X}_{k}^{\text{anc}}], ZkZ_{k}^{\prime} and YkY_{k}^{\prime} to the analyst and erase them.
12:
13:analyst-side
14:
15:Get [𝟏,X~k][\boldsymbol{1},\widetilde{X}_{k}], [𝟏,X~kanc][\boldsymbol{1},\widetilde{X}_{k}^{\text{anc}}], ZkZ_{k}^{\prime} and YkY_{k}^{\prime} for all kk.
16:Set [𝟏,X~k][\boldsymbol{1},\widetilde{X}_{k}] and [𝟏,X~kanc][\boldsymbol{1},\widetilde{X}_{k}^{\text{anc}}].
17:Compute GkG_{k} from X~kanc\widetilde{X}_{k}^{\text{anc}} for all kk such that [𝟏,X~kanc]Gk[𝟏,X~kanc]Gk(kk)[\boldsymbol{1},\widetilde{X}_{k}^{\text{anc}}]G_{k}\approx[\boldsymbol{1},\widetilde{X}_{k^{\prime}}^{\text{anc}}]G_{k^{\prime}}~{}(k\neq k^{\prime}).
18:Compute Xˇk=[𝟏,X~k]Gk\check{X}_{k}=[\boldsymbol{1},\widetilde{X}_{k}]G_{k} for all kk.
19:Set Xˇ\check{X}, ZZ^{\prime} and YY^{\prime}.
20:Compute function qˇ\check{q} and hˇ\check{h} using Xˇ\check{X}, ZZ^{\prime} and YY^{\prime}.
21:Compute residuals 𝜼^\hat{\boldsymbol{\eta}} and 𝜻^\hat{\boldsymbol{\zeta}} using qˇ\check{q}, hˇ\check{h}, Xˇ\check{X}, ZZ^{\prime} and YY^{\prime}.
22:Obtain 𝜸ˇ\check{\boldsymbol{\gamma}} and Var(𝜸ˇ)\text{Var}(\check{\boldsymbol{\gamma}}) as the least square solution of (4.1.2) (in our paper).
23:Return RkPoint-ancR_{k}^{\text{Point-anc}} and RkVar-ancR_{k}^{\text{Var-anc}} to all user kk.
24:
25:user-kk-side
26:
27:Set RkPoint-ancR_{k}^{\text{Point-anc}} and RkVar-ancR_{k}^{\text{Var-anc}}.
28:Compute 𝜷k\boldsymbol{\beta}_{k}, Var(𝜸k)\text{Var}(\boldsymbol{\gamma}_{k}) and Var(αk)\text{Var}(\alpha_{k}).

The estimation result obtained in NI-DC-DML when EkE_{k} is the identity matrix is equal to that obtained in DC-DML. Conversely, if EkE_{k} is not the identity matrix, the NI-DC-DML and DC-DML estimation results differ. However, we think the difference is not large. As an example, we show in Table 7 the estimation results of NI-DC-DML in the situation of Simulation I. The results are correct except that the coefficient of x3x^{3} for party 2 is significant.

Strictly, if yiy_{i} or ziz_{i} are continuous values, NI-DC-DML cannot satisfy the non-readily identifiability. The solution to this issue could be the use of other privacy preservation technologies such as kk-anonymization, differential privacy or cryptography. That is one of the important future issues. Further investigation of the relationship between the readily identifiability and DC-DML deviates significantly from the purpose of this paper, thus we end our discussion here.

Covariates DC-DML(PCA+B)
Party 1 Party 2
const. 0.9133 ** 0.9772 **
x1x^{1} 1.0451 ** 0.9593 **
x2x^{2} 0.8847 ** 1.0420 **
x3x^{3} -0.1566 -0.2426 *
x4x^{4} -0.0559 0.0028
x5x^{5} 0.0189 0.0158
x6x^{6} 0.1160 0.0359
x7x^{7} 0.0598 -0.1224
x8x^{8} -0.0413 0.0371
x9x^{9} 0.0492 0.0802
x10x^{10} -0.1166 -0.1608
** p<0.01p<0.01,   * p<0.05p<0.05
Table 7: The estimation results of NI-DC-DML in the situation of Simulation I.

References

  • Athey et al. (2019) S. Athey, J. Tibshirani, and S. Wager. Generalized random forests. The Annals of Statistics, 47(2):1148, 2019.
  • Bia et al. (2023) M. Bia, M. Huber, and L. Lafférs. Double machine learning for sample selection models. Journal of Business & Economic Statistics, pages 1–12, 2023.
  • Bogdanova et al. (2020) A. Bogdanova, A. Nakai, Y. Okada, A. Imakura, and T. Sakurai. Federated learning system without model sharing through integration of dimensional reduced data representations. In Proceedings of IJCAI 2020 International Workshop on Federated Learning for User Privacy and Data Confidentiality, 2020.
  • Brier (1950) G. W. Brier. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950.
  • Chernozhukov et al. (2018) V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 01 2018. ISSN 1368-4221. doi: 10.1111/ectj.12097. URL https://doi.org/10.1111/ectj.12097.
  • Criado et al. (2022) M. F. Criado, F. E. Casado, R. Iglesias, C. V. Regueiro, and S. Barro. Non-iid data and continual learning processes in federated learning: A long road ahead. Information Fusion, 88:263–280, 2022.
  • Dehejia and Wahba (1999) R. H. Dehejia and S. Wahba. Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American statistical Association, 94(448):1053–1062, 1999.
  • Engle et al. (1986) R. F. Engle, C. W. J. Granger, J. Rice, and A. Weiss. Semiparametric estimates of the relation between weather and electricity sales. Journal of the American statistical Association, 81(394):310–320, 1986.
  • Fan et al. (2022) Q. Fan, Y. Hsu, R. P. Lieli, and Y. Zhang. Estimation of conditional average treatment effects with high-dimensional data. Journal of Business & Economic Statistics, 40(1):313–327, 2022.
  • He and Niyogi (2003) X. He and P. Niyogi. Locality preserving projections. Advances in neural information processing systems, 16, 2003.
  • Hill (2011) J. L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
  • Imakura and Sakurai (2020) A. Imakura and T. Sakurai. Data collaboration analysis framework using centralization of individual intermediate representations for distributed data sets. ASCE-ASME Journal of Risk and Uncertainty in Engineering Systems, Part A: Civil Engineering, 6(2):04020018, 2020.
  • Imakura et al. (2021a) A. Imakura, A. Bogdanova, T. Yamazoe, K. Omote, and T. Sakurai. Accuracy and privacy evaluations of collaborative data analysis. In The Second AAAI Workshop on Privacy-Preserving Artificial Intelligence (PPAI-21), 2021a.
  • Imakura et al. (2021b) A. Imakura, H. Inaba, Y. Okada, and T. Sakurai. Interpretable collaborative data analysis on distributed data. Expert Systems with Applications, 177:114891, 2021b.
  • Imakura et al. (2023a) A. Imakura, T. Sakurai, Y. Okada, T. Fujii, T. Sakamoto, and H. Abe. Non-readily identifiable data collaboration analysis for multiple datasets including personal information. Information Fusion, 98:101826, 2023a.
  • Imakura et al. (2023b) A. Imakura, R. Tsunoda, R. Kagawa, K. Yamagata, and T. Sakurai. DC-COX: Data collaboration cox proportional hazards model for privacy-preserving survival analysis on multiple parties. Journal of Biomedical Informatics, 137:104264, 2023b.
  • Imbens and Rubin (2015) G. W. Imbens and D. B. Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
  • Karr et al. (2005) A. F. Karr, X. Lin, A. P. Sanil, and J. P. Reiter. Secure regression on distributed databases. Journal of Computational and Graphical Statistics, 14(2):263–279, 2005.
  • Kawamata et al. (2024) Y. Kawamata, R. Motai, Y. Okada, A. Imakura, and T. Sakurai. Collaborative causal inference on distributed data. Expert Systems with Applications, 244:123024, 2024.
  • Konečnỳ et al. (2016) J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon. Federated learning: Strategies for improving communication efficiency. In NIPS Workshop on Private Multi-Party Machine Learning, 2016. URL https://arxiv.org/abs/1610.05492.
  • Künzel et al. (2019) S. R. Künzel, J. S. Sekhon, P. J. Bickel, and B. Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences, 116(10):4156–4165, 2019.
  • McMahan et al. (2017) B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  • Pearson (1901) K. Pearson. LIII. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science, 2(11):559–572, 1901.
  • Rafi et al. (2024) T. H. Rafi, F. A. Noor, T. Hussain, and D. Chae. Fairness and privacy preserving in federated learning: A survey. Information Fusion, 105:102198, 2024.
  • Rodríguez-Barroso et al. (2020) N. Rodríguez-Barroso, G. Stipcich, D. Jiménez-López, J. A. Ruiz-Millán, E. Martínez-Cámara, G. González-Seco, M. V. Luzón, M. A. Veganzones, and F. Herrera. Federated learning and differential privacy: Software tools analysis, the sherpa. ai fl framework and methodological guidelines for preserving data privacy. Information Fusion, 64:270–292, 2020.
  • Rodríguez-Barroso et al. (2023) N. Rodríguez-Barroso, D. Jiménez-López, M. V. Luzón, F. Herrera, and E. Martínez-Cámara. Survey on federated learning threats: Concepts, taxonomy on attacks and defences, experimental study and challenges. Information Fusion, 90:148–173, 2023.
  • Rolling et al. (2019) C. A. Rolling, Y. Yang, and D. Velez. Combining estimates of conditional treatment effects. Econometric Theory, 35(6):1089–1110, 2019.
  • Rosenbaum and Rubin (1983) P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
  • Shi et al. (2024) P. Shi, X. Zhang, and W. Zhong. Estimating conditional average treatment effects with heteroscedasticity by model averaging and matching. Economics Letters, 238:111679, 2024.
  • Spearman (1904) C. Spearman. “general intelligence,” objectively determined and measured. The American Journal of Psychology, 15(2):201–292, 1904.
  • Taddy et al. (2016) M. Taddy, M. Gardner, L. Chen, and D. Draper. A nonparametric bayesian analysis of heterogenous treatment effects in digital experimentation. Journal of Business & Economic Statistics, 34(4):661–672, 2016.
  • Tan et al. (2022) X. Tan, C. H. Chang, L. Zhou, and L. Tang. A tree-based model averaging approach for personalized treatment effect estimation from heterogeneous data sources. In International Conference on Machine Learning, pages 21013–21036. PMLR, 2022.
  • Vo et al. (2022) T. V. Vo, Y. Lee, T. N. Hoang, and T.-Y. Leong. Bayesian federated estimation of causal effects from observational data. In The 38th Conference on Uncertainty in Artificial Intelligence, 2022. URL https://openreview.net/forum?id=BEl3vP8sqlc.
  • Xiong et al. (2023) R. Xiong, A. Koenecke, M. Powell, Z. Shen, J. T. Vogelstein, and S. Athey. Federated causal inference in heterogeneous observational data. Statistics in Medicine, 42(24):4418–4439, 2023.