This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sequential Treatment Effect Estimation with Unmeasured Confounders

Yingrong Wang    Anpeng Wu    Baohong Li    Ziyang Xiao    Ruoxuan Xiong    Qing Han    Kun Kuang
Abstract

This paper studies the cumulative causal effects of sequential treatments in the presence of unmeasured confounders. It is a critical issue in sequential decision-making scenarios where treatment decisions and outcomes dynamically evolve over time. Advanced causal methods apply transformer as a backbone to model such time sequences, which shows superiority in capturing long time dependence and periodic patterns via attention mechanism. However, even they control the observed confounding, these estimators still suffer from unmeasured confounders, which influence both treatment assignments and outcomes. How to adjust the latent confounding bias in sequential treatment effect estimation remains an open challenge. Therefore, we propose a novel Decomposing Sequential Instrumental Variable framework for CounterFactual Regression (DSIV-CFR), relying on a common negative control assumption. Specifically, an instrumental variable (IV) is a special negative control exposure, while the previous outcome serves as a negative control outcome. This allows us to recover the IVs latent in observation variables and estimate sequential treatment effects via a generalized moment condition. We conducted experiments on 44 datasets and achieved significant performance in one- and multi-step prediction, supported by which we can identify optimal treatments for dynamic systems.

Machine Learning, ICML

1 Introduction

Sequential decision-making is fundamental to many real-world applications, including personalized medicine (Huang & Ning, 2012; Feuerriegel et al., 2024), financial investment (Ang et al., 2006; Gârleanu & Pedersen, 2013), and policy making (Jeunen et al., 2020; Hizli et al., 2023). These scenarios involve decisions that must dynamically adapt to evolving conditions, where the outcomes of previous decisions directly influence subsequent choices. For example, we consider a cancer patient undergoing treatment, as illustrated in Figure 1. The medical team must regularly adjust the treatment plan based on the patient’s condition (confounders) to control tumor volume (outcome) (Geng. et al., 2017). After surgery, if the tumor grows to 55cm355\ \text{cm}^{3}, the team may choose chemotherapy for the next stage after comprehensively considering the patient’s current health and expected treatment effects. Similar settings arise in various domains, such as financial investment, where decisions must adapt to fluctuating market conditions, and supply chain management, where strategies are adjusted dynamically based on demand and logistical feedback. Understanding the causal effects of sequential treatments is crucial for optimizing decision-making in such scenarios. However, accurately estimating these effects is complicated by the presence of unmeasured confounders that simultaneously influence treatment assignments and outcomes but remain unobserved. Unmeasured confounding introduces significant bias into causal estimates, leading to unreliable conclusions and potentially suboptimal or harmful decisions (Robins & Greenland, 1986; Kuroki & Pearl, 2014).

Refer to caption

Figure 1: A case of counterfactual prediction and decision making on the time series data in a medical setting.

Several studies have explored causal effect estimation from observational time series data. One significant challenge in these settings is the time dependency of variables, where all variables are influenced not only by causal relationships but also by their prior values over time. Early works leveraged the Markov property to simplify this issue, modeling the problem as a state transition chain where outcomes depend only on the immediate past (Battocchi et al., 2021; Tran et al., 2024). More broadly, autoregressive models like RNNs (Elman, 1990) and LSTMs (Hochreiter & Schmidhuber, 1997) incorporate all historical data to estimate future outcomes, with approaches such as ACTIN (Wang et al., 2024) adapting LSTMs for causal inference in time sequences. However, these methods often face computational challenges when handling high-dimensional, long time series data. Recently, transformers (Vaswani et al., 2017) have emerged as a more powerful tool for identifying long-time dependence and periodic patterns due to its attention mechanism, which dynamically captures the relationships between different positions in a sequence and flexibly allocates their weights. However, these approaches (Melnychuk et al., 2022; Shirakawa et al., 2024) rely on the unconfoundedness assumption, which presumes that all confounders are observed. In practice, this assumption is rarely met due to the difficulty of measuring critical latent factors, leading to biased causal effect estimates and unreliable findings.

Unmeasured confounding remains a critical challenge in causal inference, particularly in sequential treatment settings. For example, as illustrated in Figure 1, in cancer care, factors like socioeconomic status, mental health, and unmonitored lifestyle habits significantly influence treatment decisions and outcomes, yet these factors are often difficult or impossible to quantify or measure. Ignoring these variables would lead to biased estimates of causal effects, distorting the true relationships between treatments and outcomes. While existing approaches, such as Time Series Deconfounder (Bica et al., 2020), have made strides in addressing unmeasured confounding, they are often constrained by assumptions that may not hold in real-world scenarios, such as restrictive data requirements or limited flexibility in model design. Additionally, many of these methods face challenges with scalability and robustness, particularly when applied to high-dimensional, long time series data where the complexity of relationships between variables increases exponentially.

To address these challenges, we propose a novel framework: the Decomposing Sequential Instrumental Variable Framework for Counterfactual Regression (DSIV-CFR). This framework builds on the common negative control assumption, treating instrumental variables (IVs) as special negative control exposures and prior outcomes as negative control outcomes. These relationships allow the recovery of instrumental variables from observed covariates, enabling the framework to mitigate bias introduced by unmeasured confounders. DSIV-CFR effectively decomposes the problem into manageable tasks using generalized moment conditions, making it robust for handling high-dimensional, time-dependent data. Unlike existing methods, DSIV-CFR does not rely on the unconfoundedness assumption, ensuring reliable causal effect estimates even in the presence of latent confounders. Through extensive experiments on synthetic and real-world datasets, we demonstrate that DSIV-CFR significantly outperforms existing methods, providing a scalable, accurate, and practical solution for optimizing sequential decision-making in dynamic systems.

Refer to caption

Figure 2: Casual graphs for illustrating the relationships between different variables from time series data. On the left, we describe the causalities in details, where the subscript associated with tt in each variable reflects the time dependence. Unobserved variables are marked in shadow. In this paper, we focus on the cumulative effect of treatment (𝑨1,,𝑨t)(\boldsymbol{A}_{1},\dots,\boldsymbol{A}_{t}) on outcome 𝒀t\boldsymbol{Y}_{t}. For simplicity, we give its summary graph on the right, where the historical information of all treatments and confounders are denoted as 𝑨¯t1\bar{\boldsymbol{A}}_{t-1} and 𝑪¯t1\bar{\boldsymbol{C}}_{t-1}.

Contributions of this paper can be concluded as follows:

  • We systemically study the problem of sequential treatment effect estimation, particularly the complex causal relationships among variables in time series data under unmeasured confounders.

  • We proposed a novel method, DSIV-CFR, which effectively addresses the bias caused by unobserved confounders by leveraging instrumental variables and prior outcomes as negative controls, enabling accurate estimation of sequential treatment effects.

  • Extensive experiments demonstrate the superiority of DSIV-CFR in mitigating unmeasured confounding bias, improving causal estimation accuracy, and identifying optimal treatment strategies in dynamic systems.

2 Related Work

Treatment effect estimation for time series data. There is a comprehensive survey (Moraffah et al., 2021) that explores some basic problems about causal inference for time series analysis. Variations of treatments in these scenarios include time-invariant and time-varying ones. Time-invariant treatment refers to an intervention that occurs at a specific time and remains unchanged afterward. To estimate the effect of such treatment, the difference-in-difference method (Ashenfelter, 1978; Athey & Imbens, 2006) is one of the mostly used tools in economics, which depend on parallel trends assumption. Estimation of effects for time-varying treatment is most relevant to this paper. One solution to model the time series data is to utilize the Markov assumption, where the complex temporal issue can be simplified as a problem of state transitions between adjacent timestamps (Battocchi et al., 2021; Tran et al., 2024), i.e. the data at time tt is influenced only by the data at time t1t-1. The other way is to apply autoregressive models to connect causal models at a single timestamp, where all historical information serves as input to affect the current time point tt. For example, Time Series Deconfounder (Bica et al., 2020) uses RNN (Elman, 1990) can be regarded as an extension of Deconfounder (Wang & Blei, 2019) to address the temporal sequences. LSTM (Hochreiter & Schmidhuber, 1997) is also applied to extend the ideas of CFR (Shalit et al., 2017) to the estimation of causal effects in time series data (Wang et al., 2024). In recent works (Melnychuk et al., 2022; Shirakawa et al., 2024), transformer (Vaswani et al., 2017) serves as a more powerful tool to model complex time series when estimating the treatment effect.

Treatment effect estimation with unobserved confounders. Instrumental variable (IV) is a powerful tool to address latent confounding bias. It is independent of unmeasured confounders, serves as a cause of treatment, but has no influence on the outcome. However, standard IV methods require a predefined, strong, and valid IV, which is often difficult to find in real-world applications. Therefore, many researchers have studied the generation or synthesis of IV to address this problem. For example, ModelIV (Hartford et al., 2021) identifies valid IVs by selecting those in the tightest cluster of estimation points, assuming that these candidates yield approximately causal effects, thus relaxing the requirement for more than half of the candidates to be valid. AutoIV (Yuan et al., 2022) generates IV representations by learning a disentangled representation from the observational data on the basis of independence conditions. However, sometimes the representations learned by this methodology may be a reverse IV rather than a true IV. That is, there is a causal link from the treatment variable that leads to the so-called IV. Negative control is another tool that is widely used to address the issue of unmeasured confounding. Negative control refers to observational variables that could serve as proxies to indirectly indicate the effects of unmeasured confounders (Miao et al., 2018). A cross-moment approach (Kivva et al., 2023) is introduced, which also leverages the idea of negative controls to mitigate the impact of unobserved confounders. Moreover, negative controls can be naturally extended to time series data analysis (Miao et al., 2024). In this context, the treatment applied at the next moment can be regarded as a negative control exposure, while the outcome at the previous timestamp serves as a negative control outcome.

3 Problem Setup

In this paper, we aim to study the cumulative treatment effects of sequential treatments under unmeasured confounders. For observational data, as shown in Figure 2, we can observe 𝒟={𝑨t,𝑿t,𝒀t}t=1T\mathcal{D}=\{\boldsymbol{A}_{t},\boldsymbol{X}_{t},\boldsymbol{Y}_{t}\}_{t=1}^{T} collected over TT time steps. Among the observational variables, 𝑨t={𝑨ti}i=1n\boldsymbol{A}_{t}=\{\boldsymbol{A}_{t}^{i}\}_{i=1}^{n} represents the treatments applied at timestamp tt on nn units. Generally in this paper, the superscript ii denotes the index of each unit, while the subscript tt denotes the timestamp. 𝑿t\boldsymbol{X}_{t} and 𝒀t\boldsymbol{Y}_{t} are the observed covariates and factual outcomes. In the presence of unmeasured confounders, there exists a set of latent variables 𝑼t\boldsymbol{U}_{t}, which are missing and may simultaneously affect both the treatments 𝑨t\boldsymbol{A}_{t} and the outcomes 𝒀t1\boldsymbol{Y}_{t-1} and 𝒀t\boldsymbol{Y}_{t}. These unobserved confounders introduce dependencies that confuse causal inference. In this paper, we use 𝑨¯t1=(𝑨1,,𝑨t1)\bar{\boldsymbol{A}}_{t-1}=(\boldsymbol{A}_{1},\dots,\boldsymbol{A}_{t-1}) to denote the history of the treatment variable 𝑨\boldsymbol{A} up to time t1t-1. Similarly, we define 𝑿¯t1\bar{\boldsymbol{X}}_{t-1},𝒀¯t1\bar{\boldsymbol{Y}}_{t-1}, and 𝑼¯t1\bar{\boldsymbol{U}}_{t-1}. For simplicity, we use 𝑯¯t1={𝑿¯t1,𝑨¯t1,𝒀¯t1}\bar{\boldsymbol{H}}_{t-1}=\{\bar{\boldsymbol{X}}_{t-1},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{Y}}_{t-1}\} to denote observed history before treatment 𝑨t\boldsymbol{A}_{t} is applied. In the sequential treatment effect problem, the sequential value on {𝑨¯t1,𝑿¯t1,𝑼¯t1}\{\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{X}}_{t-1},\bar{\boldsymbol{U}}_{t-1}\} act as confounders that jointly influence both the current treatment 𝑨t\boldsymbol{A}_{t} and the outcome 𝒀t\boldsymbol{Y}_{t}.

One classical solution to reduce bias from 𝑼t1\boldsymbol{U}_{t-1} is to use a predefined IV, leveraging its exogeneity to infer causal effects. However, it is hard to find such IVs in reality. Therefore, we aim to recover IVs from observations 𝑿t1\boldsymbol{X}_{t-1} by decomposing them into the following two parts111In time series analysis, 𝒁\boldsymbol{Z} and 𝑪\boldsymbol{C} are considered two independent sources of 𝑿\boldsymbol{X}, with the mapping relationships assumed to be static, meaning they remain unchanged over time.: (1) confounders 𝑪t1\boldsymbol{C}_{t-1} that 𝑨t𝑪t1𝒀t\boldsymbol{A}_{t}\leftarrow\boldsymbol{C}_{t-1}\rightarrow\boldsymbol{Y}_{t}. (2) instruments 𝒁t1\boldsymbol{Z}_{t-1} that affect 𝑨t\boldsymbol{A}_{t} but are not influenced by 𝑼t1\boldsymbol{U}_{t-1}. Under the proposed time-series framework, we reconstruct them as a summary graph, which is shown in the right of Figure 2. The components of this summary graph include newly applied treatment 𝑨t\boldsymbol{A}_{t}, outcome of interest 𝒀t\boldsymbol{Y}_{t}, instruments 𝒁t1\boldsymbol{Z}_{t-1}, confounders {𝑨¯t1,𝑪¯t1,𝑼t1}\{\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1},\boldsymbol{U}_{t-1}\}, and auxiliary information 𝒀t1\boldsymbol{Y}_{t-1}. Our task is to predict the counterfactual outcome for the choice of optimal treatment given 𝑯¯t1\bar{\boldsymbol{H}}_{t-1}:

𝔼[𝒀t(𝒂t)|𝑯¯t1].\mathbb{E}[\boldsymbol{Y}_{t}(\boldsymbol{a}_{t})|\bar{\boldsymbol{H}}_{t-1}]. (1)

We make the following assumptions throughout this paper.

Assumption 3.1.

(Consistency). If a unit receives a treatment, i.e. 𝑨t=𝒂t\boldsymbol{A}_{t}=\boldsymbol{a}_{t}, then the observed outcomes 𝒀t\boldsymbol{Y}_{t} are identical to the potential outcomes 𝒀t(𝒂t)\boldsymbol{Y}_{t}(\boldsymbol{a}_{t}).

Assumption 3.2.

(Overlap). For any intervention, there is always a positive probability of receiving it given confounders, i.e. sup𝒂t𝑨t|(𝑨t=𝒂t|𝑪¯t1)(𝑨t=𝒂t|𝑪¯t1,𝑼t1)|<\sup_{\boldsymbol{a}_{t}\in\boldsymbol{A}_{t}}|\frac{\mathbb{P}(\boldsymbol{A}_{t}=\boldsymbol{a}_{t}|\bar{\boldsymbol{C}}_{t-1})}{\mathbb{P}(\boldsymbol{A}_{t}=\boldsymbol{a}_{t}|\bar{\boldsymbol{C}}_{t-1},\boldsymbol{U}_{t-1})}|<\infty.

Assumption 3.3.

(Sequential Latent Ignorability). By controlling all confounders, including observed ones 𝑪t1\boldsymbol{C}_{t-1} and unobserved ones 𝑼t1\boldsymbol{U}_{t-1}, the effect of treatment 𝑨t\boldsymbol{A}_{t} on the outcome 𝒀t\boldsymbol{Y}_{t} could be estimated without bias. Formally, 𝒀t(𝒂t)𝑨t|{𝑪¯t1,𝑼t1}\boldsymbol{Y}_{t}(\boldsymbol{a}_{t})\perp\!\!\!\perp\boldsymbol{A}_{t}\ |\ \{\bar{\boldsymbol{C}}_{t-1},\boldsymbol{U}_{t-1}\} for all 𝒂t\boldsymbol{a}_{t}.

Assumption 3.4.

(Additive Noise Model). Error terms are independent of each other, that is, the unobserved confounders from the previous time point do not affect the next ones. Formally, 𝒀t1↛𝒀t\boldsymbol{Y}_{t-1}\not\rightarrow\boldsymbol{Y}_{t} and 𝑼t1↛𝑼t\boldsymbol{U}_{t-1}\not\rightarrow\boldsymbol{U}_{t}. To estimate the potential outcome, we also assume 𝔼[𝑼|𝑿]=0\mathbb{E}[\boldsymbol{U}|\boldsymbol{X}]=0.

Assumption 3.5.

(Time-invariant Relationship). The function of the treatment effect, denoted as h:𝑿×𝑨𝒀h:\boldsymbol{X}\times\boldsymbol{A}\rightarrow\boldsymbol{Y}, does not change over time, although its value fluctuates due to the confounding heterogeneity of 𝑿\boldsymbol{X}.

Refer to caption

Figure 3: Overview of our model DSIV-CFR. Some explanations have been displayed in the right panel. Historical observations 𝑯¯\bar{\boldsymbol{H}} are input into the first transformer ψ()\psi(\cdot) to learn the representation of IVs ϕZ\phi_{Z} and confounders ϕC\phi_{C}, which is optimized by the mutual information (MI) loss. The second transformer h()h(\cdot) is trained as a backbone with the objective of accurately predicting future potential outcomes 𝒀\boldsymbol{Y}, measured by the MSE loss. In addition, an adversarial loss derived from IVs and confounders is considered to build up a GMM framework, requiring another bridge function f()f(\cdot) to learn a weight 𝑴\boldsymbol{M} so as to train against h()h(\cdot).
Proposition 3.6 (IV Decomposition).

Following the IV conditions, if all variables are observable, we can decompose the instrument 𝐙t1{\boldsymbol{Z}}_{t-1} from the observed covariates 𝐗¯t1\bar{\boldsymbol{X}}_{t-1} as follows: 𝐙t1𝐂¯t1{\boldsymbol{Z}}_{t-1}\perp\!\!\!\perp\bar{\boldsymbol{C}}_{t-1} and 𝐙t1𝐘t{𝐀t,𝐀¯t1,𝐂¯t1,𝐔t1}{\boldsymbol{Z}}_{t-1}\perp\!\!\!\perp\boldsymbol{Y}_{t}\mid\{\boldsymbol{A}_{t},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1},{\boldsymbol{U}}_{t-1}\}.

These conditions imply that the instrument variable 𝒁t1{\boldsymbol{Z}}_{t-1} is independent of the outcome 𝒀t\boldsymbol{Y}_{t} given the appropriate set of covariates, including the unmeasured confounders 𝑼t1\boldsymbol{U}_{t-1}. However, in practice, we cannot directly identify the instrument 𝒁t1\boldsymbol{Z}_{t-1} due to 𝑼t1\boldsymbol{U}_{t-1}. As a result, instrumental variable identification becomes challenging and would require additional assumptions. Motivated by the idea of negative control (Miao et al., 2018, 2024), we utilize a pair of proxy variables, including negative control exposure and negative control outcome, to help identify causal effects.

Assumption 3.7.

(Negative Control). Negative controls comprise negative control exposures (NCE) and negative control outcomes (NCO). NCE cannot directly affect the outcome 𝒀t\boldsymbol{Y}_{t}, and neither NCE nor treatment 𝑨t\boldsymbol{A}_{t} can affect NCO. The effect of unmeasured confounders 𝑼t1\boldsymbol{U}_{t-1} on NCE and 𝑨t\boldsymbol{A}_{t}, as well as on NCO and 𝒀t\boldsymbol{Y}_{t}, is proportional.

Generally, we can treat the instrument 𝒁t1\boldsymbol{Z}_{t-1} as a special case of NCE, while previous outcome 𝒀t1\boldsymbol{Y}_{t-1} can also be regarded as a special case of NCO.

Theorem 3.8 (IV Identification).

Based on the negative control assumptions, given 𝐂¯t1\bar{\boldsymbol{C}}_{t-1}, we can decompose the instrument 𝐙t1{\boldsymbol{Z}}_{t-1} from the observed covariates 𝐗¯t1\bar{\boldsymbol{X}}_{t-1} as follows: 𝐙t1𝐂¯t1{\boldsymbol{Z}}_{t-1}\perp\!\!\!\perp\bar{\boldsymbol{C}}_{t-1} and 𝐙t1𝐘t{𝐀t,𝐀¯t1,𝐂¯t1,𝐘t1}{\boldsymbol{Z}}_{t-1}\perp\!\!\!\perp\boldsymbol{Y}_{t}\mid\{\boldsymbol{A}_{t},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1},{\boldsymbol{Y}}_{t-1}\}.

Proof.

Under Assumptions 3.1-3.5 and 3.7, we assume that the function of 𝑼t1\boldsymbol{U}_{t-1} on 𝒀t1\boldsymbol{Y}_{t-1} and 𝒀t\boldsymbol{Y}_{t} is the same, and the outcome feedback function hh can be reformulated as:

𝒀t=h(𝑨t,𝑨¯t1,𝑪¯t1)+ϵ(𝑼t)+ϵ(𝑼t1)𝒀t1=h(𝑨t1,𝑨¯t2,𝑪¯t2)+ϵ(𝑼t2)+ϵ(𝑼t1),\begin{split}\boldsymbol{Y}_{t}&=h(\boldsymbol{A}_{t},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1})+\epsilon(\boldsymbol{U}_{t})+\epsilon(\boldsymbol{U}_{t-1})\\ \boldsymbol{Y}_{t-1}&=\scalebox{0.9}{$h(\boldsymbol{A}_{t-1},\bar{\boldsymbol{A}}_{t-2},\bar{\boldsymbol{C}}_{t-2})+\epsilon(\boldsymbol{U}_{t-2})+\epsilon(\boldsymbol{U}_{t-1})$},\end{split} (2)

where ϵ(𝑼t1)\epsilon(\boldsymbol{U}_{t-1}) denotes the unmeasured common causes of treatment 𝑨t\boldsymbol{A}_{t} and outcome 𝒀t\boldsymbol{Y}_{t}. Given 𝒁t1ϵ(𝑼t){𝑨t,𝑨¯t1,𝑪¯t1}\boldsymbol{Z}_{t-1}\perp\!\!\!\perp\epsilon(\boldsymbol{U}_{t})\mid\{\boldsymbol{A}_{t},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1}\} and 𝒁t1ϵ(𝑼t1){𝑨t,𝑨¯t1,𝑪¯t1}\boldsymbol{Z}_{t-1}\not\!\perp\!\!\!\perp\epsilon(\boldsymbol{U}_{t-1})\mid\{\boldsymbol{A}_{t},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1}\}, we can reformulate the decomposition condition in Proposition 3.6 as follows:

𝒁t1{h(𝑨t,𝑨¯t1,𝑪¯t1)+ϵ(𝑼t1)}{𝑨t,𝑨¯t1,𝑪¯t1,ϵ(𝑼t1)}.\begin{split}{\boldsymbol{Z}}_{t-1}\perp\!\!\!\perp&\{h(\boldsymbol{A}_{t},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1})+\epsilon(\boldsymbol{U}_{t-1})\}\\ \ \mid\ &\{\boldsymbol{A}_{t},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1},\epsilon(\boldsymbol{U}_{t-1})\}.\end{split} (3)

Therefore, we can identify IV 𝒁t1{\boldsymbol{Z}}_{t-1} from:

𝒁t1{h(𝑨t,𝑨¯t1,𝑪¯t1)+ϵ(𝑼t1)}{𝑨t,𝑨¯t1,𝑪¯t1,𝒀t1}.\begin{split}{\boldsymbol{Z}}_{t-1}\perp\!\!\!\perp&\{h(\boldsymbol{A}_{t},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1})+\epsilon(\boldsymbol{U}_{t-1})\}\\ \ \mid\ &\{\boldsymbol{A}_{t},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1},\boldsymbol{Y}_{t-1}\}.\end{split} (4)

Then, we can use the instrument 𝒁t1{\boldsymbol{Z}}_{t-1} to help us isolate the direct causal effect h(𝑨t,𝑨¯t1,𝑪¯t1)h(\boldsymbol{A}_{t},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1}) from unmeasured confounding bias. ∎

By taking the expectation of both sides of Equation (2) conditioned on {𝑪¯t1,𝒁t1}\{\bar{\boldsymbol{C}}_{t-1},\boldsymbol{Z}_{t-1}\}, given Assumption 3.4 𝔼[𝑼t|𝑿¯t1]=0\mathbb{E}[\boldsymbol{U}_{t}|\bar{\boldsymbol{X}}_{t-1}]=0, we have:

𝔼[𝒀t|𝑪¯t1,𝒁t1]=𝔼[h(𝑨t,𝑨¯t1,𝑪¯t1)|𝑪¯t1,𝒁t1]+𝔼[ϵ(𝑼)t1|𝑪¯t1]=h(𝑨t,𝑨¯t1,𝑪¯t1)𝑑F(𝑨t|𝑨¯t1,𝑪¯t1,𝒁t1),\begin{split}&\mathbb{E}[\boldsymbol{Y}_{t}|\bar{\boldsymbol{C}}_{t-1},\boldsymbol{Z}_{t-1}]\\ =&\scalebox{0.9}{$\mathbb{E}[h(\boldsymbol{A}_{t},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1})|\bar{\boldsymbol{C}}_{t-1},\boldsymbol{Z}_{t-1}]+\mathbb{E}[\epsilon(\boldsymbol{U})_{t-1}|\bar{\boldsymbol{C}}_{t-1}]$}\\ =&\scalebox{0.9}{$\int h(\boldsymbol{A}_{t},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1})\ dF(\boldsymbol{A}_{t}|\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1},\boldsymbol{Z}_{t-1})$},\end{split} (5)

where F(𝑨t|𝑨¯t1,𝑪¯t1,𝒁t1)F(\boldsymbol{A}_{t}|\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1},\boldsymbol{Z}_{t-1}) is the conditional treatment distribution. Equation (5) expresses an inverse problem for the function hh in terms of 𝔼[𝒀t|𝑪¯t1,𝒁t1]\mathbb{E}[\boldsymbol{Y}_{t}|\bar{\boldsymbol{C}}_{t-1},\boldsymbol{Z}_{t-1}] and F(𝑨t|𝑨¯t1,𝑪¯t1,𝒁t1)F(\boldsymbol{A}_{t}|\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1},\boldsymbol{Z}_{t-1}) (Newey & Powell, 2003). A standard approach is two-stage regression: using observed confounders and instrument to predict conditional treatment distribution F^(𝑨t|𝑨¯t1,𝑪¯t1,𝒁t1)\hat{F}(\boldsymbol{A}_{t}|\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1},\boldsymbol{Z}_{t-1}), and then regressing outcome with estimated treatment 𝑨^t\hat{\boldsymbol{A}}_{t} sampled from F^\hat{F} and observed confounders, i.e h^=(𝑨^t,𝑨¯t1,𝑪¯t1)\hat{h}=(\hat{\boldsymbol{A}}_{t},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1}). It can be regarded as a special form of the generalized method of moments (Hall, 2003), abbreviated as GMM. In the context of negative controls (Miao et al., 2024), we would like to extend this consistent estimation by applying GMM.

4 Methodology

Our proposed method, the Decomposing Sequential Instrumental Variable framework for Counterfactual Regression (DSIV-CFR), is illustrated in Figure 3. It consists of two key modules designed to address the challenges of sequential treatment effect estimation with unmeasured confounders. The first module focuses on leveraging transformers to model long sequential dependencies and learning the representations of instrumental variables (IVs) and confounders. The second module employs the generalized method of moments (GMM) to estimate counterfactual outcomes while effectively utilizing the learned representations.

4.1 Long Sequential Modeling for Learning IV and Confounder Representations

In our setting for estimating the effect of sequential treatment, the relationships between variables are influenced not only by immediate past values but also by long-term dependencies. To capture these complex dependencies, we employ transformer (Vaswani et al., 2017; Melnychuk et al., 2022) as our backbone, which has shown state-of-the-art performance in modeling long-range interactions in sequential data. Specifically, we build a 66-layer transformer with 88 heads. The input to this module consists of the observed covariates, treatment variables, and outcomes at each time step, i.e., 𝑯¯t1={𝑿¯t1,𝑨¯t1,𝒀¯t1}\bar{\boldsymbol{H}}_{t-1}=\{\bar{\boldsymbol{X}}_{t-1},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{Y}}_{t-1}\}. In each head jj, we calculate the keys K(j)K^{(j)}, queries Q(j)Q^{(j)}, and values V(j)V^{(j)} from a linear transformation of 𝑯¯t1\bar{\boldsymbol{H}}_{t-1} to obtain:

Atten(j)=softmax(Q(j)K(j)dqkv)V(j),\text{Atten}^{(j)}=\text{softmax}\left(\frac{Q^{(j)}K^{(j)\top}}{\sqrt{d_{qkv}}}\right)V^{(j)}, (6)

where dqkvd_{qkv} is the dimensionality. We use the concatenation as the output of this sub-module with JJ heads:

MHA=concat(Atten(1),,Atten(J)).\text{MHA}=\text{concat}\left(\text{Atten}^{(1)},\dots,\text{Atten}^{(J)}\right). (7)

The feed-forward layer with ReLU activation is then used for the non-linear transformations of MHA. Techniques such as layer normalization (Ba et al., 2016) and dropout (Srivastava et al., 2014) are also applied to enhance the stability and robustness of model training. The output of this module is a basic embedding ψ(𝑯¯t1)\psi(\bar{\boldsymbol{H}}_{t-1}), which is then followed by ϕZ()\phi_{Z}(\cdot) and ϕC()\phi_{C}(\cdot) to further disentangle IVs and confounders.

Learning IV Representation. Two conditions the IV should be satisfied: (1) Relevance. Instruments 𝒁t1\boldsymbol{Z}_{t-1} are required to be correlated with treatment 𝑨t\boldsymbol{A}_{t}. Inspired by AutoIV (Yuan et al., 2022), we encourage the information of 𝒁t1\boldsymbol{Z}_{t-1} related to 𝑨t\boldsymbol{A}_{t} to enter the IV representations ϕZ(ψ(𝑯¯t1))\phi_{Z}(\psi(\bar{\boldsymbol{H}}_{t-1})) by minimizing the additive inverse of contrastive log-ratio upper bound (Cheng et al., 2020):

ZA,t=1n2i=1nj=1n[logqθZA(𝑨ti|ϕZ(ψ(𝑯¯t1i)))logqθZA(𝑨tj|ϕZ(ψ(𝑯¯t1i)))].\begin{split}\mathcal{L}_{ZA,t}=&-\frac{1}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}[\ \log q_{\theta_{ZA}}(\boldsymbol{A}^{i}_{t}|\phi_{Z}(\psi(\bar{\boldsymbol{H}}^{i}_{t-1})))\\ &-\log q_{\theta_{ZA}}(\boldsymbol{A}^{j}_{t}|\phi_{Z}(\psi(\bar{\boldsymbol{H}}^{i}_{t-1})))\ ].\end{split} (8)

The variational distribution qθZA(𝑨t|ϕZ(ψ(𝑯¯t1)))q_{\theta_{ZA}}(\boldsymbol{A}_{t}|\phi_{Z}(\psi(\bar{\boldsymbol{H}}_{t-1}))) is determined by parameters θZA\theta_{ZA} to approximate the true conditional distribution (𝑨t|ϕZ(ψ(𝑯¯t1)))\mathbb{P}(\boldsymbol{A}_{t}|\phi_{Z}(\psi(\bar{\boldsymbol{H}}_{t-1}))). Log-likelihood loss function of variational approximation is defined as:

𝒟ZA,t=1ni=1nlogqθZA(𝑨ti|ϕZ(ψ(𝑯¯t1i))).\mathcal{LLD}_{ZA,t}=-\frac{1}{n}\sum_{i=1}^{n}\log q_{\theta_{ZA}}(\boldsymbol{A}^{i}_{t}|\phi_{Z}(\psi(\bar{\boldsymbol{H}}^{i}_{t-1}))). (9)

(2) Exclusion. As mentioned in Theorem 3.8, we require 𝒁t1𝒀t{𝑨t,𝑨¯t1,𝑪¯t1,𝒀t1}{\boldsymbol{Z}}_{t-1}\perp\!\!\!\perp\boldsymbol{Y}_{t}\mid\{\boldsymbol{A}_{t},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1},{\boldsymbol{Y}}_{t-1}\}. Similarly,

ZY,t=1n2i=1nj=1n{wt1ij[logqθZY(𝒀ti|ϕZ(ψ(𝑯¯t1i)))logqθZY(𝒀tj|ϕZ(ψ(𝑯¯t1i)))]},\begin{split}\mathcal{L}_{ZY,t}=\frac{1}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}\{&w^{ij}_{t-1}[\ \log q_{\theta_{ZY}}(\boldsymbol{Y}^{i}_{t}|\phi_{Z}(\psi(\bar{\boldsymbol{H}}^{i}_{t-1})))\\ &-\log q_{\theta_{ZY}}(\boldsymbol{Y}^{j}_{t}|\phi_{Z}(\psi(\bar{\boldsymbol{H}}^{i}_{t-1})))\ ]\ \},\end{split} (10)

where wt1ijw_{t-1}^{ij} is the weight of sample pair (i,j)(i,j) to achieve the conditional independence mentioned above. It is determined by the discrepancy between vti=[𝑨ti,𝑨¯t1i,𝑪¯t1i,𝒀t1i]v_{t}^{i}=[\boldsymbol{A}_{t}^{i},\bar{\boldsymbol{A}}_{t-1}^{i},\bar{\boldsymbol{C}}_{t-1}^{i},{\boldsymbol{Y}}_{t-1}^{i}] and vtj=[𝑨tj,𝑨¯t1j,𝑪¯t1j,𝒀t1j]v_{t}^{j}=[\boldsymbol{A}_{t}^{j},\bar{\boldsymbol{A}}_{t-1}^{j},\bar{\boldsymbol{C}}_{t-1}^{j},{\boldsymbol{Y}}_{t-1}^{j}] in the RBF kernel:

wt1ij=softmax(exp(vivj22σ2)).w_{t-1}^{ij}=\text{softmax}\left(\exp\left(-\frac{\lVert v_{i}-v_{j}\lVert^{2}}{2\sigma^{2}}\right)\right). (11)

The hyper-parameter σ\sigma is used to control the width of the Gaussian function, which we set to 11 in the experiments.

Learning Confounder Representation. There are also two restrictions that should be taken into account: (1) 𝑨t𝑪t1𝒀t\boldsymbol{A}_{t}\leftarrow\boldsymbol{C}_{t-1}\rightarrow\boldsymbol{Y}_{t}. Therefore, we minimize:

CA,t=1n2i=1nj=1n[logqθCA(𝑨ti|ϕC(ψ(𝑯¯t1i)))logqθCA(𝑨tj|ϕC(ψ(𝑯¯t1i)))],\begin{split}\mathcal{L}_{CA,t}=&-\frac{1}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}[\ \log q_{\theta_{CA}}(\boldsymbol{A}^{i}_{t}|\phi_{C}(\psi(\bar{\boldsymbol{H}}^{i}_{t-1})))\\ &-\log q_{\theta_{CA}}(\boldsymbol{A}^{j}_{t}|\phi_{C}(\psi(\bar{\boldsymbol{H}}^{i}_{t-1})))\ ],\end{split} (12)
CY,t=1n2i=1nj=1n[logqθCY(𝒀ti|ϕC(ψ(𝑯¯t1i)))logqθCY(𝒀tj|ϕC(ψ(𝑯¯t1i)))],\begin{split}\mathcal{L}_{CY,t}=&-\frac{1}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}[\ \log q_{\theta_{CY}}(\boldsymbol{Y}^{i}_{t}|\phi_{C}(\psi(\bar{\boldsymbol{H}}^{i}_{t-1})))\\ &-\log q_{\theta_{CY}}(\boldsymbol{Y}^{j}_{t}|\phi_{C}(\psi(\bar{\boldsymbol{H}}^{i}_{t-1})))\ ],\end{split} (13)

(2) We require 𝒁t1𝑪¯t1{\boldsymbol{Z}}_{t-1}\perp\!\!\!\perp\bar{\boldsymbol{C}}_{t-1} in Theorem 3.8.

ZC,t=1n2i=1nj=1n[logqθZC(𝒁t1i|𝑪¯t1i)logqθZC(𝒁t1j|𝑪¯t1i)],\begin{split}\mathcal{L}_{ZC,t}=\frac{1}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}[\ &\log q_{\theta_{ZC}}(\boldsymbol{Z}_{t-1}^{i}|\bar{\boldsymbol{C}}_{t-1}^{i})\\ -&\log q_{\theta_{ZC}}(\boldsymbol{Z}_{t-1}^{j}|\bar{\boldsymbol{C}}_{t-1}^{i})\ ],\end{split} (14)

where 𝒁t1\boldsymbol{Z}_{t-1} and 𝑪¯t1\bar{\boldsymbol{C}}_{t-1} are approximated by ϕZ(ψ(𝑯¯t1))\phi_{Z}(\psi(\bar{\boldsymbol{H}}_{t-1})) and {ϕC(ψ(𝑯¯1)),,ϕC(ψ(𝑯¯t1))}\{\phi_{C}(\psi(\bar{\boldsymbol{H}}_{1})),\dots,\phi_{C}(\psi(\bar{\boldsymbol{H}}_{t-1}))\}, respectively.

The overall loss function of these mutual information restrictions could be concluded as:

MI,t=ZA,t+ZY,t+CA,t+CY,t+ZC,t,\mathcal{L}_{MI,t}=\mathcal{L}_{ZA,t}+\mathcal{L}_{ZY,t}+\mathcal{L}_{CA,t}+\mathcal{L}_{CY,t}+\mathcal{L}_{ZC,t}, (15)

where MI=1Tt=1TMIt\mathcal{L}_{MI}=\frac{1}{T}\sum_{t=1}^{T}\mathcal{L}_{MI}^{t} and TT is denoted as the length of historical time series. To optimize the parameters of variational distributions, i.e. {θZA,θZY,θCA,θCY,θZC}\{\theta_{ZA},\theta_{ZY},\theta_{CA},\theta_{CY},\theta_{ZC}\}, we combine all the variational approximation loss as:

LLD=1Tt=1T(𝒟ZA,t+𝒟ZY,t+𝒟CA,t+𝒟CY,t+𝒟ZC,t),\begin{split}\mathcal{L}_{LLD}=&\frac{1}{T}\sum_{t=1}^{T}(\mathcal{LLD}_{ZA,t}+\mathcal{LLD}_{ZY,t}\\ +&\mathcal{LLD}_{CA,t}+\mathcal{LLD}_{CY,t}+\mathcal{LLD}_{ZC,t}),\end{split} (16)

where the definitions of 𝒟ZY,t\mathcal{LLD}_{ZY,t}, 𝒟CA,t\mathcal{LLD}_{CA,t}, 𝒟CY,t\mathcal{LLD}_{CY,t}, and 𝒟ZC,t\mathcal{LLD}_{ZC,t} are similar to that of 𝒟ZA,t\mathcal{LLD}_{ZA,t} in Equation (9), and we have omitted them due to space limitations. Using contrastive learning with upper bound for modeling representation, we automatically separate 𝒁\boldsymbol{Z} and 𝑪\boldsymbol{C} from 𝑿\boldsymbol{X} by Equation (14), while ensuring the validity of 𝒁\boldsymbol{Z} by Equation (8)-(10) and the confounding properties by Equation (12)-(13). Although we have stated the static relationships between 𝒁\boldsymbol{Z}, 𝑪\boldsymbol{C} and 𝑿\boldsymbol{X} in Section 3, our model is also applicable to the situation where the relationships are dynamic.

Table 1: Statistics of datasets.
Dataset Type of 𝑨\boldsymbol{A} dim𝑨\textbf{dim}_{\boldsymbol{A}} dim𝑿\textbf{dim}_{\boldsymbol{X}} dim𝑼\textbf{dim}_{\boldsymbol{U}} 𝑻\boldsymbol{T} # Train # Validation # Test
Simulation binary 11 1010 33 100100 10,00010,000 1,0001,000 1,0001,000
Tumor growth continuous 22 33 33 6060 10,00010,000 1,0001,000 21,74121,741
Cryptocurrency continuous 11 55 unknown 55 8686 2929 2929
MIMIC-III binary 22 6969 unknown 6060 4,2934,293 920920 920920
Table 2: Results of one-step-ahead outcome prediction on synthetic and real-world datasets.
Method Simulation Tumor Cryptocurrency MIMIC-III
Time Series Deconfounder 0.930±0.1010.930\pm 0.101 0.753±0.0420.753\pm 0.042 2.141±0.0512.141\pm 0.051 1.071±0.0391.071\pm 0.039
Causal Transformer 1.256±0.0811.256\pm 0.081 0.716±0.0030.716\pm 0.003 2.426±0.0142.426\pm 0.014 1.229±0.0261.229\pm 0.026
ACTIN 1.481±0.1291.481\pm 0.129 1.041±0.0021.041\pm 0.002 2.135±0.0052.135\pm 0.005 1.162±0.0431.162\pm 0.043
ORL 0.798±0.1150.798\pm 0.115 0.347±0.0100.347\pm 0.010 1.861±0.0121.861\pm 0.012 1.253±0.0241.253\pm 0.024
Deep LTMLE 0.531±0.0770.531\pm 0.077 0.202±0.0740.202\pm 0.074 1.553±0.0611.553\pm 0.061 1.260±0.1341.260\pm 0.134
DSIV-CFR 0.105±0.017\textbf{0.105}\boldsymbol{\pm}\textbf{0.017} 0.047±0.003\textbf{0.047}\boldsymbol{\pm}\textbf{0.003} 0.375±0.039\textbf{0.375}\boldsymbol{\pm}\textbf{0.039} 0.634±0.019\textbf{0.634}\boldsymbol{\pm}\textbf{0.019}
* Lower = better (best in bold)

4.2 GMM for Counterfactual Regression

As discussed in Section 3, we establish a GMM framework to accurately estimate the potential outcome in the presence of unmeasured confounders. We also apply a transformer h()h(\cdot) as the backbone of the estimator 𝒀^t=h(𝑨t,𝑨¯t1,𝑪¯t1)\hat{\boldsymbol{Y}}_{t}=h(\boldsymbol{A}_{t},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1}) mentioned in Equation (5). It is designed to predict the potential outcome 𝒀t\boldsymbol{Y}_{t} from treatment 𝑨t\boldsymbol{A}_{t} and a wealth of historical information {𝑨¯t1,𝑪¯t1}\{\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1}\}, which can be seen as the confounders illustrated on the right of Figure 2. The basic prediction error is taken into account:

minhMSE=1nTt=1Ti=1n(𝒀^ti𝒀ti)2,\min_{h}\quad\mathcal{L}_{MSE}=\frac{1}{n\cdot T}\sum_{t=1}^{T}\sum_{i=1}^{n}\left(\hat{\boldsymbol{Y}}_{t}^{i}-\boldsymbol{Y}_{t}^{i}\right)^{2}, (17)

which could be viewed as a second-order moment constraint. To further implement the process described by Equation (5), we build a bridge function f()f(\cdot) to obtain learnable weights 𝑴t1=f(𝑨¯t1,𝑪¯t1,𝒁t1)\boldsymbol{M}_{t-1}=f(\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1},\boldsymbol{Z}_{t-1}). Objective is defined as:

minhmaxfadv=1nTt=2T+1i=1n𝑴t1i(𝒀^ti𝒀ti),\min_{h}\max_{f}\quad\mathcal{L}_{adv}=\frac{1}{n\cdot T}\sum_{t=2}^{T+1}\sum_{i=1}^{n}\boldsymbol{M}_{t-1}^{i}\left(\hat{\boldsymbol{Y}}_{t}^{i}-\boldsymbol{Y}_{t}^{i}\right), (18)

where 𝑴t1i\boldsymbol{M}_{t-1}^{i} is the weight of the ii-th sample in 𝑴t1\boldsymbol{M}_{t-1}.

The overall objective of our DSIV-CFR can be concluded as minimizing the following loss function:

minhmaxf=MSE+αMI+βadv,\min_{h}\max_{f}\mathcal{L}=\mathcal{L}_{MSE}+\alpha\cdot\mathcal{L}_{MI}+\beta\cdot\mathcal{L}_{adv}, (19)

where {α,β}\{\alpha,\beta\} are two hyper-parameters to control the significance of different modules. We summarize the model optimization process as Algorithm 1 in Appendix A.

5 Experiments

5.1 Baselines

In Section 2, we discussed several works of causal inference on time series data. We applied these methods as baselines for the comparison of our DSIV-CFR, including Time Series Deconfounder (Bica et al., 2020), surrogate-based approach (Battocchi et al., 2021), Causal Transformer (Melnychuk et al., 2022), ACTIN (Wang et al., 2024), ORL (Tran et al., 2024), and Deep LTMLE (Shirakawa et al., 2024). More details are concluded in Appendix B.

5.2 Datasets

To comprehensively evaluate the performance of all models under various data conditions, we applied both synthetic data and real-world datasets. Statistics of these datasets are summarized in Table 1. The process of data generation for the fully-synthetic dataset, which we call Simulation dataset, is described in Appendix B.

Tumor growth (Geng. et al., 2017). Following previous works (Melnychuk et al., 2022; Wang et al., 2024), we also applied a tumor growth simulator for data generation. We select the patient type and two static features as covariates. Treatments include chemotherapy and radiation therapy. Outcome of interest is the tumor volume. Considering the range of data, we normalized 𝑿\boldsymbol{X} and 𝒀\boldsymbol{Y}.

Cryptocurrency222Cryptocurrency is available at https://github.com/binance/binance-public-data. We also collected real-life time series data of Bitcoin and Ethereum from July 11, 20232023 to October 2929, 20242024. Covariates include the opening price, lowest price, highest price, closing price, and trading volume for the day. Treatment refers to the positions held by individuals, while the outcome of interest is the return rate (%). We manually divided the complete sequence into time segments of length 55, setting the time gap to 55 as well to prevent data leakage.

MIMIC-III (Johnson et al., 2016). It is a comprehensive, publicly available database containing de-identified health data from patients admitted to critical care units at a large tertiary care hospital. Referring to data processing in the Causal Transformer (Melnychuk et al., 2022), we obtain 6969 covariates on the vital signs of patients. Vasopressor and mechanical ventilation are two treatments taken into account. We also choose blood pressure as the outcome.

5.3 Experimental Results

5.3.1 One-step-ahead prediction

Our targeted estimand is the counterfactual outcome in the future time step, denoted as 𝒀^t(𝒂t|𝑯¯t1)\hat{\boldsymbol{Y}}_{t}(\boldsymbol{a}_{t}|\bar{\boldsymbol{H}}_{t-1}). To evaluate the performance of these estimators, we use the MSE metric defined in Equation (17). To ensure the robustness and reliability of our experimental results, we independently repeated each experiment 55 times, each with a different random seed. Finally, we report the mean and standard deviation (std) of MSE in the format of mean ± std. The experimental results of the four datasets are summarized in Table 2, which comprehensively indicates the performance of the model in the diverse settings. It can be seen that our DSIV-CFR outperforms the state-of-the-art baselines on all datasets by a large margin.

The average running time of this experiment is detailed in Table 4 (Appendix D). Although the introduction of mutual information loss MI\mathcal{L}_{MI} and adversarial loss adv\mathcal{L}_{adv} has led to a time increase, it is justified by the significant improvement in model performance. Generally, the overall training time remains within an acceptable range.

Since many baselines rely on the unconfoundedness assumption, i.e. do not consider the unmeasured confounders, we conducted an additional trial for fairness. We set all the confounding variables to be observable and retested these methods. Even in this case, where the baselines have access to more information than our approach, DSIV-CFR still achieves better performance.

5.3.2 Hyper-parameter analysis

Refer to caption

Figure 4: Results of hyper-parameter analysis on the Cryptocurrency dataset. Setting the α\alpha to 0 is equivalent to removing MI\mathcal{L}_{MI}, and β=0\beta=0 means adv\mathcal{L}_{adv} is deleted. The heatmap on the right represents the changes in performance. If a cell’s color is closer to a warm color (orange), it means that the model trained with the corresponding {α,β}\{\alpha,\beta\} combination has better performance. Conversely, if the cell’s color is closer to a cool color (green), it indicates poorer performance.

We explore the combination of α\alpha and β\beta, which control the significance of mutual information constraints (MI\mathcal{L}_{MI}) for IV decomposition and adversarial function learning (adv\mathcal{L}_{adv}) in the GMM framework, respectively. When α\alpha or β\beta is set to 0, it indicates that the corresponding module is ablated. Such evaluation is conducted with the Cryptocurrency dataset. According to Figure 4, the effectiveness of each module could be validated. If the two losses are moderately incorporated, model performance will be better than focusing solely on MSE. The best performance is achieved by setting α=0.1\alpha=0.1 and β=0.1\beta=0.1. However, if their weights are blindly increased, it may create a conflict with the primary objective, which is to minimize the prediction error.

5.3.3 Sequential treatment decision making

CT (Melnychuk et al., 2022) and ACTIN (Wang et al., 2024) can be extended to predict the future outcomes τ\tau steps ahead, i.e. 𝒀t+τ(𝒂t:t+τ|𝑯¯t1)\boldsymbol{Y}_{t+\tau}(\boldsymbol{a}_{t:t+\tau}|\bar{\boldsymbol{H}}_{t-1}). The main idea is to additionally predict the next values of 𝑿\boldsymbol{X}. In this way, in subsequent steps, the estimated values 𝑿^t:t+τ1\hat{\boldsymbol{X}}_{t:t+\tau-1} and 𝒀^t:t+τ1\hat{\boldsymbol{Y}}_{t:t+\tau-1} are used in place of the required observations to continue the cycle of prediction. Although our method could also be easily extended in the aforementioned manner, i.e. iteratively repeating the estimation of single time step to obtain the result of the ultimate moment, we attempted to directly predict the future outcomes of sequential treatments. We naturally consider applying this model to a downstream task, i.e. decision-making for the optimal sequential treatments in the next time period. Implementations of such methodology extension is clarified in Appendix C. We conducted an experiment on multi-step-ahead decision-making and set τ\tau to 55. To obtain the oracle outcomes of all possible treatment sequences, we use the simulated dataset and selected the two baselines (CT and ACTIN) tailored for the similar task. 𝒀t+τ(𝒂t:t+τ)\boldsymbol{Y}_{t+\tau}(\boldsymbol{a}^{*}_{t:t+\tau}) of decision making 𝒂t:t+τ\boldsymbol{a}^{*}_{t:t+\tau} compared to the oracle are illustrated in Figure 5. Our method also performs well and achieves results that are close to the oracle.

Refer to caption

Figure 5: Results of decision making 55 steps ahead. Values on the vertical axis represent the differences between oracle and the corresponding results of predicted optimal treatments (lower=better).

6 Conclusion

The presence of unmeasured confounders is a critical challenge of estimating the cumulative effects of time-varying treatments with time series data. We propose DSIV-CFR, a novel method that leverages negative controls and mutual information constraints to decompose IVs and confounders from the observations. DSIV-CFR accurately estimates the future potential outcome with unmeasured confounders through a modular design that combines transformers for sequential dependency modeling and GMM for counterfactual estimation with the learned IVs. This model can also be naturally extended for sequential treatment decision making, which holds significant potential for real-world scenarios.

Impact Statement

This paper introduces a novel DSIV-CFR, which estimates the future potential outcome on time series data with unobserved confounders. This advancement in machine learning could improve decision making in various fields such as medicine, finance, and meteorology. For example, in personalized medicine, DSIV-CFR is able to predict the counterfactual outcome in the future of the treatment that may be taken for the next period, with the help of historical records. In this way, it allows doctors to provide better treatment strategies. The limitation of our method lies in the constraints mentioned in Assumption 3.4. If 𝔼[𝑼|𝑿]0\mathbb{E}[\boldsymbol{U}|\boldsymbol{X}]\neq 0, the estimate of potential outcome will be biased, but it still works to provide a consistent estimate of treatment effect.

References

  • Ang et al. (2006) Ang, A., Hodrick, R. J., Xing, Y., and Zhang, X. The cross-section of volatility and expected returns. The journal of finance, 61(1):259–299, 2006.
  • Ashenfelter (1978) Ashenfelter, O. Estimating the effect of training programs on earnings. The Review of Economics and Statistics, 60(1):47–57, 1978.
  • Athey & Imbens (2006) Athey, S. and Imbens, G. W. Identification and inference in nonlinear difference-in-differences models. Econometrica, 74(2):431–497, 2006.
  • Ba et al. (2016) Ba, L. J., Kiros, J. R., and Hinton, G. E. Layer normalization. CoRR, abs/1607.06450, 2016. URL http://arxiv.org/abs/1607.06450.
  • Battocchi et al. (2021) Battocchi, K., Dillon, E., Hei, M., Lewis, G., Oprescu, M., and Syrgkanis, V. Estimating the long-term effects of novel treatments. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  2925–2935, 2021.
  • Bica et al. (2020) Bica, I., Alaa, A. M., and van der Schaar, M. Time series deconfounder: Estimating treatment effects over time in the presence of hidden confounders. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp.  884–895. PMLR, 2020.
  • Cheng et al. (2020) Cheng, P., Hao, W., Dai, S., Liu, J., Gan, Z., and Carin, L. CLUB: A contrastive log-ratio upper bound of mutual information. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp.  1779–1788. PMLR, 2020.
  • Elman (1990) Elman, J. L. Finding structure in time. Cogn. Sci., 14(2):179–211, 1990.
  • Feuerriegel et al. (2024) Feuerriegel, S., Frauen, D., Melnychuk, V., Schweisthal, J., Hess, K., Curth, A., Bauer, S., Kilbertus, N., Kohane, I. S., and van der Schaar, M. Causal machine learning for predicting treatment outcomes. Nature Medicine, 30(4):958–968, 2024.
  • Gârleanu & Pedersen (2013) Gârleanu, N. and Pedersen, L. H. Dynamic trading with predictable returns and transaction costs. The Journal of Finance, 68(6):2309–2340, 2013.
  • Geng. et al. (2017) Geng., C., Paganetti, H., and Grassberger, C. Prediction of treatment response for combined chemo- and radiation therapy for non-small cell lung cancer patients using a bio-mathematical model. Scientific Reports, 7(1):13542, 2017.
  • Hall (2003) Hall, A. R. Generalized method of moments. A companion to theoretical econometrics, pp.  230–255, 2003.
  • Hartford et al. (2021) Hartford, J. S., Veitch, V., Sridhar, D., and Leyton-Brown, K. Valid causal inference with (some) invalid instruments. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  4096–4106. PMLR, 2021.
  • Hizli et al. (2023) Hizli, C., John, S. T., Juuti, A. T., Saarinen, T. T., Pietiläinen, K. H., and Marttinen, P. Causal modeling of policy interventions from treatment-outcome sequences. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  13050–13084. PMLR, 2023.
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Comput., 9(8):1735–1780, 1997.
  • Huang & Ning (2012) Huang, X. and Ning, J. Analysis of multi-stage treatments for recurrent diseases. Statistics in medicine, 31(24):2805–2821, 2012.
  • Jeunen et al. (2020) Jeunen, O., Rohde, D., Vasile, F., and Bompaire, M. Joint policy-value learning for recommendation. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pp.  1223–1233. ACM, 2020.
  • Johnson et al. (2016) Johnson, A. E., Pollard, T. J., Shen, L., wei H. Lehman, L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. Mimic-iii: a freely accessible critical care database. Scientific Data, 3(1):160035, 2016.
  • Kivva et al. (2023) Kivva, Y., Salehkaleybar, S., and Kiyavash, N. A cross-moment approach for causal effect estimation. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  • Kuroki & Pearl (2014) Kuroki, M. and Pearl, J. Measurement bias and effect restoration in causal inference. Biometrika, 101(2):423–437, 2014.
  • Melnychuk et al. (2022) Melnychuk, V., Frauen, D., and Feuerriegel, S. Causal transformer for estimating counterfactual outcomes. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  15293–15329. PMLR, 2022.
  • Miao et al. (2018) Miao, W., Geng, Z., and Tchetgen, E. J. T. Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika, 105(4):pp. 987–993, 2018.
  • Miao et al. (2024) Miao, W., Shi, X., Li, Y., and Tchetgen, E. T. A confounding bridge approach for double negative control inference on causal effects, 2024. URL https://arxiv.org/abs/1808.04945.
  • Moraffah et al. (2021) Moraffah, R., Sheth, P., Karami, M., Bhattacharya, A., Wang, Q., Tahir, A., Raglin, A., and Liu, H. Causal inference for time series analysis: problems, methods and evaluation. Knowl. Inf. Syst., 63(12):3041–3085, 2021.
  • Newey & Powell (2003) Newey, W. K. and Powell, J. L. Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565–1578, 2003.
  • Robins & Greenland (1986) Robins, J. M. and Greenland, S. The role of model selection in causal inference from nonexperimental data. American Journal of Epidemiology, 123(3):392–402, 1986.
  • Shalit et al. (2017) Shalit, U., Johansson, F. D., and Sontag, D. A. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp.  3076–3085. PMLR, 2017.
  • Shirakawa et al. (2024) Shirakawa, T., Li, Y., Wu, Y., Qiu, S., Li, Y., Zhao, M., Iso, H., and van der Laan, M. J. Longitudinal targeted minimum loss-based estimation with temporal-difference heterogeneous transformer. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024.
  • Srivastava et al. (2014) Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, 2014.
  • Tran et al. (2024) Tran, A., Bibaut, A., and Kallus, N. Inferring the long-term causal effects of long-term treatments from short-term experiments. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  5998–6008, 2017.
  • Wang et al. (2024) Wang, X., Lyu, S., Yang, L., Zhan, Y., and Chen, H. A dual-module framework for counterfactual estimation over time. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024.
  • Wang & Blei (2019) Wang, Y. and Blei, D. M. The blessings of multiple causes. Journal of the American Statistical Association, 114(528):1574–1596, 2019.
  • Yuan et al. (2022) Yuan, J., Wu, A., Kuang, K., Li, B., Wu, R., Wu, F., and Lin, L. Auto IV: counterfactual prediction via automatic instrumental variable decomposition. ACM Trans. Knowl. Discov. Data, 16(4):74:1–74:20, 2022.

Appendix A Pseudo-Code

As stated in Section 4, we propose a novel DSIV-CFR method to accurately estimate the effect of sequential treatment in the presence of unmeasured confounders. It comprises two key modules. The first module utilizes transformers to model long sequential dependencies and captures the representations of instrumental variables (IVs) and confounders. The second module then employs the Generalized Method of Moments (GMM) to estimate counterfactual outcomes by effectively leveraging the learned representations. The detailed pseudo-code of DSIV-CFR is provided in Algorithm 1.

Algorithm 1 Decomposing Sequential Instrumental Variable framework for CounterFactual Regression
  Input: dataset 𝒟={𝑯¯t1,𝑨t}t=1T+1\mathcal{D}=\{\bar{\boldsymbol{H}}_{t-1},\boldsymbol{A}_{t}\}_{t=1}^{T+1}, transformers ψ\psi and hh, representation heads ϕZ\phi_{Z} and ϕC\phi_{C}, bridge function ff, hyper-parameter α\alpha and β\beta, maximum iteration KK, alternating training rounds RR
  Output: future outcomes 𝒀^t=T+1\hat{\boldsymbol{Y}}_{t=T+1}
  Initialize parameters θ\theta of {ψ,ϕZ,ϕC}\{\psi,\phi_{Z},\phi_{C}\}, ξ\xi of hh, ζ\zeta of ff
  for i1i\leftarrow 1 to KK do
     for t1t\leftarrow 1 to TT do
        // Module 1: Long Sequential Modeling for Learning IV and Confounder Representations
        Obtain basic embedding ψ(𝑯¯t1)\psi(\bar{\boldsymbol{H}}_{t-1})
        Obtain instrumental variable representation 𝒁t1ϕZ(ψ(𝑯¯t1))\boldsymbol{Z}_{t-1}\leftarrow\phi_{Z}(\psi(\bar{\boldsymbol{H}}_{t-1}))
        Obtain confounder representation 𝑪t1ϕC(ψ(𝑯¯t1))\boldsymbol{C}_{t-1}\leftarrow\phi_{C}(\psi(\bar{\boldsymbol{H}}_{t-1}))
        Calculate mutual information loss MI\mathcal{L}_{MI} by Equation (15)
        // Module 2: GMM for Counterfactual Regression
        Predict potential outcome in the next time step 𝒀^th(𝑨t,𝑨¯t1,𝑪¯t1)\hat{\boldsymbol{Y}}_{t}\leftarrow h(\boldsymbol{A}_{t},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t-1})
        Calculate MSE loss MSE\mathcal{L}_{MSE} by Equation (17)
        Obtain learnable weights 𝑴t1f(𝑪¯t1,𝒁t1)\boldsymbol{M}_{t-1}\leftarrow f(\bar{\boldsymbol{C}}_{t-1},\boldsymbol{Z}_{t-1})
        Calculate adversarial loss adv\mathcal{L}_{adv} by Equation (18)
     end for
     Calculate overall loss MSE\mathcal{L}\leftarrow\mathcal{L}_{MSE} by Equation (19)
     Update ξξ\xi\leftarrow\xi^{\prime} after \mathcal{L}.backward( )
     // Alternating training for variational distribution approximation
     for j1j\leftarrow 1 to RR do
        Obtain 𝒁¯t1\bar{\boldsymbol{Z}}_{t-1} and 𝑪¯t1\bar{\boldsymbol{C}}_{t-1} by repeating Module 1
        Calculate log-likelihood loss LLD\mathcal{L}_{LLD} by Equation (16)
        Update θθ\theta\leftarrow\theta^{\prime} after LLD\mathcal{L}_{LLD}.backward( )
     end for
     // Alternating training for bridge function
     for j1j\leftarrow 1 to RR do
        Obtain [Y^t,][\hat{Y}_{t},\dots] and [𝑴t1,][\boldsymbol{M}_{t-1},\dots] by repeating Module 2
        Calculate log-likelihood loss adv\mathcal{L}_{adv} by Equation (18)
        Update ζζ\zeta\leftarrow\zeta^{\prime} after adv\mathcal{L}_{adv}.backward( )
     end for
     Decompose historical observations 𝑯¯T\bar{\boldsymbol{H}}_{T} into confounders 𝑪¯T\bar{\boldsymbol{C}}_{T} and IV 𝒁T\boldsymbol{Z}_{T} by networks {ψ,ϕZ,ϕC}\{\psi,\phi_{Z},\phi_{C}\}
     Predict future outcomes 𝒀^t=T+1h(𝑨T+1,𝑨¯T,𝑪¯T)\hat{\boldsymbol{Y}}_{t=T+1}\leftarrow h(\boldsymbol{A}_{T+1},\bar{\boldsymbol{A}}_{T},\bar{\boldsymbol{C}}_{T})
  end for
  return 𝒀^t=T+1\hat{\boldsymbol{Y}}_{t=T+1}

Appendix B Implementation details of experiments

Baselines. We have concluded some details of the baselines applied in this paper, which are shown in Table 3. Among them, Time Series Deconfounder (Bica et al., 2020), abbreviated as TSD, considers the scenario with unobserved confounders, while other methods all rely on the unconfoundedness assumption. We also clarify their backbones for modeling the time series data. Implementations of them are all available at the given links. Although there is also a surrogate-based method (Battocchi et al., 2021), we have not included it in the baselines for comparison temporarily, as we have not found its open-source code.

Table 3: Details of baselines.
Method Consider 𝑼\boldsymbol{U} Backbone Open-source
TSD \checkmark RNN https://github.com/vanderschaarlab/mlforhealthlabpub/tree/main/alg/time_series_deconfounder
CT ×\times Transformer https://github.com/Valentyn1997/CausalTransformer
ACTIN ×\times LSTM https://github.com/wangxin/ACTIN
ORL ×\times RL https://github.com/allentran/long-term-ate-orl
DLTMLE ×\times Transformer https://github.com/shirakawatoru/dltmle-icml-2024

Data generation for one-step ahead prediction. For each moment, we generate 33-dimensional IVs 𝒁t\boldsymbol{Z}_{t} and 77-dimensional confounders 𝑪t\boldsymbol{C}_{t}, each dimension of which follows a uniform distribution 𝒰(0,1)\mathcal{U}(0,1). There are also unobserved confounders 𝑼t\boldsymbol{U}_{t} of 33 dimensions, and each dimension is randomly sampled from 𝒩(0,1)\mathcal{N}(0,1). Specifically,

{𝒁t+10.4𝒁t+0.6𝒁t+1+0.3sint𝑪t+10.3𝑪t+0.7𝑪t+1+0.2sint𝑼t+1𝑼t+10.1cost\left\{\begin{aligned} &\ \boldsymbol{Z}_{t+1}\leftarrow 0.4\cdot\boldsymbol{Z}_{t}+0.6\cdot\boldsymbol{Z}_{t+1}+0.3\cdot\sin t\\ &\ \boldsymbol{C}_{t+1}\leftarrow 0.3\cdot\boldsymbol{C}_{t}+0.7\cdot\boldsymbol{C}_{t+1}+0.2\cdot\sin t\\ &\ \boldsymbol{U}_{t+1}\leftarrow\boldsymbol{U}_{t+1}-0.1\cdot\cos t\end{aligned}\right. (20)

As for the training and validation data, their treatments 𝑨t+1\boldsymbol{A}_{t+1} of each time step is simulated by the following process.

logit=i=1d(coefa,i𝑽t,icos𝑽t,i2)0.5𝑨t+0.2𝒀t0.1sint,𝑽t={𝒁t,𝑪t,𝑼t}(𝑨t+1)=11+exp(logit),𝑨t+1={ 0,if(𝑨t+1)<0.5 1,if(𝑨t+1)0.5\begin{split}&\text{logit}=\sum_{i=1}^{d}\left(coef_{a,i}\cdot\boldsymbol{V}_{t,i}-\cos{\boldsymbol{V}_{t,i}^{2}}\right)-0.5\cdot\boldsymbol{A}_{t}+0.2\cdot\boldsymbol{Y}_{t}-0.1\cdot\sin t,\quad\boldsymbol{V}_{t}=\{\boldsymbol{Z}_{t},\boldsymbol{C}_{t},\boldsymbol{U}_{t}\}\\ &\mathbb{P}(\boldsymbol{A}_{t+1})=\frac{1}{1+\exp(-\text{logit})},\quad\boldsymbol{A}_{t+1}=\left\{\begin{aligned} &\ 0,\ \text{if}\ \ \mathbb{P}(\boldsymbol{A}_{t+1})<0.5\\ &\ 1,\ \text{if}\ \ \mathbb{P}(\boldsymbol{A}_{t+1})\geq 0.5\end{aligned}\right.\end{split} (21)

The coefficients coefacoef_{a} are randomly generated. We initialized that 𝑨0=[0,,0]\boldsymbol{A}_{0}=[0,\dots,0] and 𝒀0=[0,,0]\boldsymbol{Y}_{0}=[0,\dots,0]. To evaluate the counterfactual prediction capabilities of each model, sequences 𝑨\boldsymbol{A} are randomly generated in the test set. In all sets, the generation of 𝒀t+1\boldsymbol{Y}_{t+1} can be described as :

𝒀t+1=i=1d(coefy,i𝑽t,i)0.2sin𝑨t+1+0.5sint5,𝑽t={𝑪¯t,𝑼t,𝑼t1}\boldsymbol{Y}_{t+1}=\sum_{i=1}^{d}\left(coef_{y,i}\cdot\boldsymbol{V}_{t,i}^{\prime}\right)-0.2\cdot\sin\boldsymbol{A}_{t+1}+0.5\cdot\sin\frac{t}{5},\quad\boldsymbol{V}_{t}^{\prime}=\{\bar{\boldsymbol{C}}_{t},\boldsymbol{U}_{t},\boldsymbol{U}_{t-1}\} (22)

Appendix C Sequential treatment decision making

Methodology extension. Our method can be naturally extended to make decisions about determining the optimal treatment plan τ\tau steps ahead. Specifically, the treatment 𝒂t:t+τ1\boldsymbol{a}_{t:t+\tau-1} taken into consideration is a sequence of length τ\tau, where each dimension can independently take one of dd possible treatments. For example, in our setting, we set the time window τ\tau to 55 and the treatment to be binary (d=2d=2). We first traverse all of the 25=322^{5}=32 possible alternatives, infer the corresponding outcomes, and select the optimal one. For each subsequent moment, we still use the transformer ψ()\psi(\cdot) in Section 4.1 to learn a basic representation of the observations {𝑿¯t+3,𝑨¯t+3,𝒀¯t1}\{\bar{\boldsymbol{X}}_{t+3},\bar{\boldsymbol{A}}_{t+3},\bar{\boldsymbol{Y}}_{t-1}\}, followed by ϕZ\phi_{Z} and ϕC\phi_{C} to obtain the IVs {𝒁t1,𝒁t,𝒁t+1,𝒁t+2,𝒁t+3}\{\boldsymbol{Z}_{t-1},\boldsymbol{Z}_{t},\boldsymbol{Z}_{t+1},\boldsymbol{Z}_{t+2},\boldsymbol{Z}_{t+3}\} and confounders {𝑪t1,𝑪t,𝑪t+1,𝑪t+2,𝑪t+3}\{\boldsymbol{C}_{t-1},\boldsymbol{C}_{t},\boldsymbol{C}_{t+1},\boldsymbol{C}_{t+2},\boldsymbol{C}_{t+3}\}. It is important to note that, since IVs are exogenous variables, the covariates for decomposition, i.e. {𝑿t1,𝑿t,𝑿t+1,𝑿t+2,𝑿t+3}\{\boldsymbol{X}_{t-1},\boldsymbol{X}_{t},\boldsymbol{X}_{t+1},\boldsymbol{X}_{t+2},\boldsymbol{X}_{t+3}\}, must be observable in our setting. If we were to predict 𝑿t1:t+3\boldsymbol{X}_{t-1:t+3} on our own, as in the case of CT (Melnychuk et al., 2022), it would not include the IVs. We then modify the counterfactual regression module mentioned in Section 4.2, where the second transformer h()h(\cdot) to directly predict the 55-step-ahead outcome by 𝒀t+4=h(𝑨t:t+4,𝑨¯t1,𝑪¯t+3)\boldsymbol{Y}_{t+4}=h(\boldsymbol{A}_{t:t+4},\bar{\boldsymbol{A}}_{t-1},\bar{\boldsymbol{C}}_{t+3}). As for the adversarial loss, we rebuild the bridge function f()f(\cdot) to learn the weights 𝑴t+3=f(𝑨¯t+3,𝑪¯t+3,𝒁t1:t+3)\boldsymbol{M}_{t+3}=f(\bar{\boldsymbol{A}}_{t+3},\bar{\boldsymbol{C}}_{t+3},\boldsymbol{Z}_{t-1:t+3}). The other parts of our DSIV-CFR remain unchanged, as described in the previous sections.

Data generation for five-step ahead decision making. The dimension of 𝒁\boldsymbol{Z}, 𝑪\boldsymbol{C}, and 𝑼\boldsymbol{U} are 33, 1212, and 55, respectively. Each dimension of 𝒁\boldsymbol{Z} and 𝑪\boldsymbol{C} follows a uniform distribution 𝒰(0,3)\mathcal{U}(0,3). As for the training and validation sets, we respectively generate 2,0002,000 and 200200 samples roughly following the data generation process described in Appendix B, except that 𝒀\boldsymbol{Y} will be affected by the interventions in the prior 55 time steps, which can be expressed by the following formulation:

𝒀t+1=0.2i=1d(coefy,i𝑽t,i)0.5j=04coefseq,j𝑨t+1j+sint,𝑽t={𝑪¯t,𝑼t,𝑼t1},\boldsymbol{Y}_{t+1}=0.2\sum_{i=1}^{d}\left(coef_{y,i}\cdot\boldsymbol{V}_{t,i}^{\prime}\right)-0.5\sum_{j=0}^{4}coef_{seq,j}\boldsymbol{A}_{t+1-j}+\sin t,\quad\boldsymbol{V}_{t}^{\prime}=\{\bar{\boldsymbol{C}}_{t},\boldsymbol{U}_{t},\boldsymbol{U}_{t-1}\}, (23)

where coefficients coefseqcoef_{seq} are randomly generated. These two sets include the information of {𝑯¯29,𝒀30}\{\bar{\boldsymbol{H}}_{29},\boldsymbol{Y}_{30}\}. To generate the test set for evaluation of decision making, we first generate the historical information of 2525 time steps following the aforementioned steps. Afterwards, for each treatment plan bb in the 3232 alternatives \mathcal{B}, we directly apply the assigned interventions {𝑨26b,,𝑨30b}\{\boldsymbol{A}_{26}^{b},\dots,\boldsymbol{A}_{30}^{b}\} instead of inferring them with Equation (21), to iteratively calculate {𝒀26b,,𝒀30b}\{\boldsymbol{Y}_{26}^{b},\dots,\boldsymbol{Y}_{30}^{b}\} by Equation (22). The oracle is recorded as maxb𝒀30b\max_{b\in\mathcal{B}}\boldsymbol{Y}_{30}^{b}, meaning the maximum outcome at timestamp T=30T=30 after the optimal treatment plan bb^{*} is applied. This set includes 100100 records covering information of {𝑿¯29,𝑨¯25,𝒀¯25,𝒀30}\{\bar{\boldsymbol{X}}_{29},\bar{\boldsymbol{A}}_{25},\bar{\boldsymbol{Y}}_{25},\boldsymbol{Y}^{*}_{30}\}. The evaluation results of model’s early decision making ability compared with the oracle are visualized in Figure 6.

Real-world applications. Our model is capable of forecasting future outcomes over multiple steps ahead and identifying the optimal treatments through exhaustive computation. This capability endows it with broad applicability and significant potential for generalization. For example, in the field of autonomous driving, our model can predict the future states of the road environment, such as the trajectories of vehicles and pedestrians, as well as changes in traffic signals. This allows it to plan the optimal driving route in advance. In the medical field, the model can predict the progression of a patient’s condition and calculate the best treatment plan ahead of time. In the economic and financial sectors, the model can be used to forecast market trends and changes in economic indicators, thereby assisting investors in devising optimal investment strategies.

Refer to caption
Statistics of policy making results compared to oracle
Min Max Avg Std
CT 0.0540.054 3.4913.491 1.7221.722 0.6570.657
ACTIN 0.0540.054 3.7193.719 1.7591.759 0.6920.692
DSIV-CFR 0.012\boldsymbol{0.012} 1.791\boldsymbol{1.791} 0.529\boldsymbol{0.529} 0.407\boldsymbol{0.407}
* Lower = better (best in bold)
Figure 6: Results of decision making 55 steps ahead. It is a detailed version of Figure 5. Visualization is given on the left, where values on the vertical axis represent the differences between oracle and the corresponding results of predicted optimal treatments (lower=better). Detailed statistics are reported on the right.

Appendix D Time complexity

When conducting the experiment of one-step-ahead outcome prediction, we also record the training time of each method included in evaluation. Results are shown in Table 4. The larger sample size, longer time series, and higher feature dimensionality could require more time for training. In addition, MI\mathcal{L}_{MI} and adv\mathcal{L}_{adv} consume much longer time than MSE\mathcal{L}_{MSE}, but they play an important role in improving the model performance.

Table 4: Running time (minutes) of one-step-ahead prediction.
Method Simulation Tumor Cryptocurrency MIMIC-III
Time Series Deconfounder 16.5516.55 3.2003.200 0.2000.200 2.4502.450
Causal Transformer 15.2715.27 2.1002.100 0.4170.417 1.9701.970
ACTIN 17.0517.05 1.6331.633 0.5500.550 0.2670.267
ORL 13.7813.78 3.1003.100 0.3330.333 2.8002.800
Deep LTMLE 9.2679.267 3.3503.350 0.1830.183 2.2172.217
DSIV-CFR 27.8727.87 25.7525.75 1.3171.317 10.3010.30

Appendix E Experimental results without unmeasured confounders

Many relevant methods do not take into account the impact of unobserved confounders. However, in the evaluation data used in Table 2, there exists the influence of 𝑼\boldsymbol{U} on 𝒀\boldsymbol{Y}. Therefore, we set UU as observable, satisfying the unconfoundedness assumption, and re-evaluated the performance of the baselines. Results are reported in Table 5. If UU contains a significant amount of important information, it can be beneficial for improving the performance of outcome prediction. However, if it contains more noise, it may instead lead to a decline in performance.

Table 5: Results of baselines for one-step-ahead prediction without unmeasured confounders.
Method Input Simulation Tumor
Time Series Deconfounder 𝑿\boldsymbol{X} 0.930±0.1010.930\pm 0.101 0.753±0.0420.753\pm 0.042
Causal Transformer 𝑿\boldsymbol{X} 1.256±0.0811.256\pm 0.081 0.716±0.0030.716\pm 0.003
{𝑿,𝑼}\{\boldsymbol{X},\boldsymbol{U}\} 0.804±0.0070.804\pm 0.007 0.703±0.0130.703\pm 0.013
ACTIN 𝑿\boldsymbol{X} 1.481±0.1291.481\pm 0.129 1.041±0.0021.041\pm 0.002
{𝑿,𝑼}\{\boldsymbol{X},\boldsymbol{U}\} 1.355±0.1521.355\pm 0.152 1.049±0.0031.049\pm 0.003
ORL 𝑿\boldsymbol{X} 0.798±0.1150.798\pm 0.115 0.347±0.0100.347\pm 0.010
{𝑿,𝑼}\{\boldsymbol{X},\boldsymbol{U}\} 0.788±0.1090.788\pm 0.109 0.351±0.0070.351\pm 0.007
Deep LTMLE 𝑿\boldsymbol{X} 0.531±0.0770.531\pm 0.077 0.202±0.0740.202\pm 0.074
{𝑿,𝑼}\{\boldsymbol{X},\boldsymbol{U}\} 0.467±0.0410.467\pm 0.041 0.179±0.0730.179\pm 0.073
DSIV-CFR 𝑿\boldsymbol{X} 0.105±0.017\textbf{0.105}\boldsymbol{\pm}\textbf{0.017} 0.047±0.003\textbf{0.047}\boldsymbol{\pm}\textbf{0.003}
* Lower = better (best in bold)