This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Imputation-based randomization tests for randomized experiments with interference

Tingxuan Han Department of Statistics and Data Science, Tsinghua University Ke Zhu111These authors have contributed equally to this work. Department of Statistics, North Carolina State University Department of Biostatistics and Bioinformatics, Duke University Hanzhong Liu Department of Statistics and Data Science, Tsinghua University Ke Deng222Co-corresponding authors: lhz2016@tsinghua.edu.cn; kdeng@tsinghua.edu.cn Department of Statistics and Data Science, Tsinghua University
Abstract

The presence of interference renders classic Fisher randomization tests infeasible due to nuisance unknowns. To address this issue, we propose imputing the nuisance unknowns and computing Fisher randomization p-values multiple times, then averaging them. We term this approach the imputation-based randomization test and provide theoretical results on its asymptotic validity. Our method leverages the merits of randomization and the flexibility of the Bayesian framework: for multiple imputations, we can either employ the empirical distribution of observed outcomes to achieve robustness against model mis-specification or utilize a parametric model to incorporate prior information. Simulation results demonstrate that our method effectively controls the type I error rate and significantly enhances the testing power compared to existing randomization tests for randomized experiments with interference. We apply our method to a two-round randomized experiment with multiple treatments and one-way interference, where existing randomization tests exhibit limited power.

1 Introduction

In causal inference, the “no interference” assumption posits that potential outcomes of each unit are unaffected by assignments of other units (Cox,, 1958; Rubin,, 1980). In practice, however, interference does exist in different scenarios. For instance, whether or not other people were vaccinated against COVID-19 can affect your probability of infection. Causal inference in the presence of interference has received much attention in recent years (Rosenbaum,, 2007; Hudgens and Halloran,, 2008; Tchetgen and VanderWeele,, 2012; Liu and Hudgens,, 2014; Loh et al.,, 2020; Imai et al.,, 2021; Yu et al.,, 2022; Leung, 2022a, ; Forastiere et al.,, 2022; Vazquez-Bare,, 2023; Shirani and Bayati,, 2024).

As the golden standard for drawing causal conclusions, randomized experiments provide a solid basis for causal inference and enable Fisher randomization tests (FRTs) exactly valid in finite samples under the assumption of no interference (Fisher,, 1935). In the presence of interference, however, classic FRTs are infeasible since the null hypothesis of interest can no longer impute all potential outcomes certainly. To address this challenge, conditional randomization tests (CRTs) have been proposed to conduct causal inference for a selected subset of units (referred to as the focal units) with respect to a carefully selected subset of treatment assignments (referred to as the focal assignments) instead, so that the null hypothesis is sharp for the conditional cause inference and thus the classic FRT is still applicable (Aronow,, 2012; Athey et al.,, 2018; Basse et al.,, 2019; Puelz et al.,, 2022; Yanagi and Sei,, 2022; Owusu,, 2023; Tiwari and Basu,, 2024; Basse et al.,, 2024; Zhang and Zhao,, 2024). Although CRTs can achieve valid causal inference, they often suffer from severe power loss due to aggressive selection of focal units and assignments in conditioning (Puelz et al.,, 2022). Zhong, (2024) proposed partial null randomization tests (PNRTs), which avoid conditioning on focal assignments by conducting pairwise comparisons of the test statistic between the observed assignment and all possible assignments. This method allows distinct focal unit sets for different assignments, enhancing statistical power. While PNRTs relax the restrictions of CRT, they remain underpowered in challenging cases where the focal unit proportion is low, as demonstrated in our simulations and real data application.

Alternatively, multiple imputation (MI) pioneered by Rubin, (1977, 1978, 1996); Rubin, 2004a ; Rubin, 2004b is another popular approach to handle nuisance unknowns in causal inference, following the Bayesian idea. Examples include imputing compliance status to enable more powerful FRTs with noncompliance (Rubin,, 1998; Lee et al.,, 2017; Forastiere et al.,, 2018), imputing potential outcomes in FRTs for facilitating partially post hoc subgroup analyses (Lee and Rubin,, 2015), handling unspecified factorial effects (Espinosa et al.,, 2016), addressing missing outcomes (Ivanova et al.,, 2022), and managing disruptions (e.g., COVID-19) in clinical trials (Uschner et al.,, 2024). Noticeably, FRT with MI often leads to a posterior predictive p-value (Rubin,, 1984), enabling us to address more complex scenarios in causal inference from the Bayesian perspective (Ding and Guo,, 2023; Leavitt,, 2023; Li et al.,, 2023; Ray and van der Vaart,, 2020; Breunig et al.,, 2022).

In this paper, we follow such an imputation-based idea for implementing FRTs to propose an imputation-based randomization test (IRT) for randomized experiments with interference. Imputing unobserved potential outcomes that cannot be fully specified based on the null hypothesis of interest with respect to an appropriate imputation distribution, we come up with a series of FRT p-values, each of which comes from a FRT corresponding to a specific imputation of the unobserved potential outcomes. A working p-value, referred to as the IRT p-value, can be obtained by averaging these FRT p-values to claim statistical significance of the causal effect of interest. Retaining all units and potential treatment assignments, the proposed IRT is more powerful than CRTs and PNRTs, and can be easily applied to various challenging scenarios with interference. Theoretical analyses show that the proposed IRT controls the type I error rate at most twice the significance level asymptotically when the imputation distribution is properly specified. Furthermore, numerical results show that IRT empirically maintains the type I error rate at the nominal significance level and substantially improves power over existing methods.

2 Preliminaries

2.1 The Potential Outcome Model in Presence of Interference

Consider a finite population 𝕌={1,,N}{\mathbb{U}}=\{1,\ldots,N\} with NN units, to each of which KK different treatments in the treatment set 𝒯={t1,,tK}{\mathcal{T}}=\{t_{1},\ldots,t_{K}\} can be potentially assigned. For simplicity, we mainly consider the simple case with K=2K=2 in this study, though our approach can be easily extended to more complicated scenarios where K>2K>2 (as demonstrated in the real data application in Section 5). Let 𝒯={0,1}{\mathcal{T}}=\{0,1\} be the binary treatment set with 0 as control treatment and 1 as active treatment, and Zi𝒯Z_{i}\in{\mathcal{T}} be the indicator of treatment assignment for the ii-th unit in 𝕌{\mathbb{U}}. The collective treatment assignment for all NN units is represented by the binary vector 𝒁=(Z1,,ZN)𝒵{0,1}N{\boldsymbol{Z}}=(Z_{1},\ldots,Z_{N})\in{\mathcal{Z}}\subseteq\{0,1\}^{N}. Let 𝒁obs=(Z1obs,,ZNobs)𝒵{\boldsymbol{Z}}^{{\rm{obs}}}=(Z_{1}^{\rm{obs}},\ldots,Z_{N}^{\rm{obs}})\in{\mathcal{Z}} be the treatment assignment actually implemented in the randomized experiment. Throughout this study, we always assume that 𝒁obs{\boldsymbol{Z}}^{\rm{obs}} is a random sample from a known randomized design 𝑭𝒵{\boldsymbol{F}}_{\mathcal{Z}} defined over 𝒵{\mathcal{Z}}.

In the presence of interference, each unit ii has a collection of potential outcomes, i.e., {Yi(𝒛)}𝒛𝒵\{Y_{i}({\boldsymbol{z}})\}_{{\boldsymbol{z}}\in{\mathcal{Z}}}, in general, because there could be a total of 2N2^{N} distinct exposures for each unit in the most general case where interference occurs between any two units. In practice, however, interference typically occurs only among certain units. In this case, the number of distinct exposures for each unit would be much smaller than 2N2^{N}. In practice, researchers often assume that only a small exposure set {\mathcal{E}} is available, with the actual exposure received by unit ii under treatment assignment vector 𝒁{\boldsymbol{Z}} determined by an exposure mapping mi:𝒵m_{i}:{\mathcal{Z}}\rightarrow{\mathcal{E}}. This leads to the assumption of correct exposure mapping as outlined below (Manski,, 2013; Aronow and Samii,, 2017; Hoshino and Yanagi,, 2023). Apparently, the potential outcomes of unit ii reduce to a much smaller set {Yi(a)}a\{Y_{i}(a)\}_{a\in{\mathcal{E}}} under such an assumption.

Assumption 1 (Correct exposure mapping).

For all i𝕌i\in{\mathbb{U}} and 𝒛,𝒛𝒵{\boldsymbol{z}},{\boldsymbol{z}}^{\prime}\in{\mathcal{Z}}, if mi(𝒛)=mi(𝒛)m_{i}({\boldsymbol{z}})=m_{i}({\boldsymbol{z}}^{\prime}), then we have Yi(𝒛)=Yi(𝒛)Y_{i}({\boldsymbol{z}})=Y_{i}({\boldsymbol{z}}^{\prime}).

Different exposure mappings define different interference mechanisms. A general framework for establishing an interference mechanism is to introduce an interference network 𝒩{\mathcal{N}} over units of interest where units ii and jj are connected (denoted as iji\sim j) if and only if interference may occur between them, and define

mi(𝒛)={2,if zi=1;1,if zi=0 and zj=1, for some unit ji;0,if zi=0 and zj=0, for any unit ji.m_{i}({\boldsymbol{z}})=\begin{cases}2,&\text{if }z_{i}=1;\\ 1,&\text{if }z_{i}=0\text{ and }z_{j}=1,\text{ for some unit }j\sim i;\\ 0,&\text{if }z_{i}=0\text{ and }z_{j}=0,\text{ for any unit }j\sim i.\end{cases} (1)

Under such an interference mechanism, we have three types of treatment exposures for each unit ii, i.e., ={0,1,2}{\mathcal{E}}=\{0,1,2\}, where exposure 22 implies that unit ii itself is assigned to treatment, exposure 11 implies that units ii is assigned to control while one of its direct neighbors is under treatment, and exposure 0 implies that unit ii and all its direct neighbors are all under control. Many popular interference mechanisms can be viewed as special cases of this framework. Below are three concrete examples.

Example 1 (Clustered Interference).

Assuming that the NN units in the finite population of interest are grouped into KK disjointed clusters (such as villages), and interference occurs solely within clusters, Basse et al., (2019) considered an interference mechanism known as clustered interference where iji\sim j if and only if units ii and jj belong to the same cluster.

Example 2 (Spatial Interference).

Assuming that the NN units of interest are located in a spatial space with d(i,j)d(i,j) being the spatial distance between units ii and jj, and interference occurs only between two units whose spatial distance is shorter than a threshold rr, Puelz et al., (2022) considered an interference mechanism known as spatial interference where iji\sim j if and only if d(i,j)rd(i,j)\leq r.

Example 3 (No Interference).

In case where there does not exist jij\not=i such that iji\sim j, mi(𝒛)m_{i}({\boldsymbol{z}}) in Eq. (1) degenerates to mi(𝒛)=2zim_{i}({\boldsymbol{z}})=2z_{i} for all i𝕌i\in{\mathbb{U}} with ={0,2}{\mathcal{E}}=\{0,2\}, which can be further simplified to mi(𝒛)=zim_{i}({\boldsymbol{z}})=z_{i} and ={0,1}{\mathcal{E}}=\{0,1\} after a rescaling of ziz_{i}’s. In this special case, the exposure unit ii receives is completely determined by ZiZ_{i}, implying no-interference between units. This setting is a critical component of the stable unit treatment value assumption (SUTVA) proposed by Rubin, (1980).

2.2 Fisher Randomization Test

In case of no interference, a classic problem in a randomized case-control study is to test the sharp null hypothesis of no treatment effect for all units. Following the notation in Example 3,

H0:Yi(1)=Yi(0), 1iN,H_{0}^{\mathcal{F}}:Y_{i}(1)=Y_{i}(0),\ 1\leq i\leq N, (2)

where Yi(1)Y_{i}(1) and Yi(0)Y_{i}(0) denote the potential outcomes of unit ii under treatment and control, respectively. Under H0H_{0}^{\mathcal{F}}, define Yi(1)=Yi(0)=θiY_{i}(1)=Y_{i}(0)=\theta_{i} for all i𝕌i\in{\mathbb{U}} and 𝜽=(θ1,,θN){\boldsymbol{\theta}}=(\theta_{1},\cdots,\theta_{N}). With the observed outcomes 𝒀obs={Yiobs}i𝕌{\boldsymbol{Y}}^{{\rm{obs}}}=\{Y_{i}^{{\rm{obs}}}\}_{i\in{\mathbb{U}}} where Yiobs=ZiobsYi(1)+(1Ziobs)Yi(0)=θiY_{i}^{{\rm{obs}}}=Z_{i}^{\rm{obs}}Y_{i}(1)+(1-Z_{i}^{\rm{obs}})Y_{i}(0)=\theta_{i}, 𝜽{\boldsymbol{\theta}} is fully observed. Consider a finite-population framework where potential outcomes are fixed quantities, and the only source of randomness is the treatment assignment. The classic difference-in-means test statistic under a treatment assignment 𝒛{\boldsymbol{z}} is

T(𝒛,𝜽)=|i=1Nziθii=1Nzii=1N(1zi)θii=1N(1zi)|.T({\boldsymbol{z}},{\boldsymbol{\theta}})=\left|\frac{\sum_{i=1}^{N}z_{i}\theta_{i}}{\sum_{i=1}^{N}z_{i}}-\frac{\sum_{i=1}^{N}(1-z_{i})\theta_{i}}{\sum_{i=1}^{N}(1-z_{i})}\right|. (3)

Because 𝜽{\boldsymbol{\theta}} is a fixed vector after the randomization experiment is conducted, the null distribution of T(𝒁,𝜽)T({\boldsymbol{Z}},{\boldsymbol{\theta}}) is determined by the randomization distribution 𝑭𝒵{\boldsymbol{F}}_{\mathcal{Z}}. Based on these facts, Fisher randomization test (FRT) can be conducted by calculating the p-value below:

p(𝒁obs;𝜽)=𝒁(T(𝒁,𝜽)T(𝒁obs,𝜽)),p({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}})={\mathbb{P}}_{\boldsymbol{Z}}\left(T({\boldsymbol{Z}},{\boldsymbol{\theta}})\geq T({\boldsymbol{Z}}^{{\rm{obs}}},{\boldsymbol{\theta}})\right), (4)

where 𝒁obs{\boldsymbol{Z}}^{{\rm{obs}}} stands for the realized assignment, and 𝒁{\mathbb{P}}_{\boldsymbol{Z}} is the probability defined by 𝒁𝑭𝒵{\boldsymbol{Z}}\sim{\boldsymbol{F}}_{\mathcal{Z}}. According to Fisher, (1935), FRT is finite-sample exact in the sense that under H0H_{0}^{\mathcal{F}},

(p(𝒁obs;𝜽)α)αfor all 0α1, when 𝒁obs𝑭𝒵.{\mathbb{P}}\left(p({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}})\leq\alpha\right)\leq\alpha\ \text{for all }0\leq\alpha\leq 1,\text{ when }{\boldsymbol{Z}}^{{\rm{obs}}}\sim{{\boldsymbol{F}}_{\mathcal{Z}}}.

Hypothesis H0H_{0}^{\mathcal{F}} is commonly referred to as the sharp null hypothesis, since the reference distribution of the test statistic T(𝒁,𝜽)T({\boldsymbol{Z}},{\boldsymbol{\theta}}) is fully determined under H0H_{0}^{\mathcal{F}}.

2.3 Conditional and Partial Null Randomization Tests

In presence of interference, however, the situation becomes more complicated. Given the treatment exposure mapping in Eq. (1), we aim to test the null hypothesis on the contrast between any two exposures aa and bb in {\mathcal{E}}, i.e.,

H0{a,b}:Yi(a)=Yi(b), 1iN.{H_{0}^{\{a,b\}}:Y_{i}(a)=Y_{i}(b)},\ 1\leq i\leq N. (5)

For example, we are often interested in testing the null hypothesis of no spillover effect, i.e., H0{0,1}H_{0}^{\{0,1\}}. Under H0{a,b}H_{0}^{\{a,b\}}, define Yi(a)=Yi(b)=θiY_{i}(a)=Y_{i}(b)=\theta_{i} for all i𝕌i\in{\mathbb{U}} and 𝜽=(θ1,,θN){\boldsymbol{\theta}}=(\theta_{1},\cdots,\theta_{N}), then the classic difference-in-means test statistic under treatment assignment 𝒛{\boldsymbol{z}} becomes

Ta,b(𝒛,𝜽)=|1Nb(𝒛)i𝕌b(𝒛)θi1Na(𝒛)i𝕌a(𝒛)θi|,T_{a,b}({\boldsymbol{z}},{\boldsymbol{\theta}})=\left|\frac{1}{N_{b}({\boldsymbol{z}})}\sum_{i\in{\mathbb{U}}_{b}({\boldsymbol{z}})}\theta_{i}-\frac{1}{N_{a}({\boldsymbol{z}})}\sum_{i\in{\mathbb{U}}_{a}({\boldsymbol{z}})}\theta_{i}\right|, (6)

where 𝕌b(𝒛)={i𝕌:mi(𝒛)=b}{\mathbb{U}}_{b}({\boldsymbol{z}})=\{i\in{\mathbb{U}}:m_{i}({\boldsymbol{z}})=b\}, Nb(𝒛)=|𝕌b(𝒛)|N_{b}({\boldsymbol{z}})=|{\mathbb{U}}_{b}({\boldsymbol{z}})|, 𝕌a(𝒛)={i𝕌:mi(𝒛)=a}{\mathbb{U}}_{a}({\boldsymbol{z}})=\{i\in{\mathbb{U}}:m_{i}({\boldsymbol{z}})=a\}, and Na(𝒛)=|𝕌a(𝒛)|N_{a}({\boldsymbol{z}})=|{\mathbb{U}}_{a}({\boldsymbol{z}})|. Ta,b(𝒛,𝜽)T_{a,b}({\boldsymbol{z}},{\boldsymbol{\theta}}) is well defined as long as Na(𝒛),Nb(𝒛)>0N_{a}({\boldsymbol{z}}),N_{b}({\boldsymbol{z}})>0 for all 𝒛𝒵{\boldsymbol{z}}\in{\mathcal{Z}}.

A critical challenge in testing H0{a,b}H_{0}^{\{a,b\}} in a randomized experiment with interference, however, lies in the fact that even under H0{a,b}H_{0}^{\{a,b\}}, not all elements of 𝜽{\boldsymbol{\theta}} are observable after the experiment is conducted with the realized assignment 𝒁obs{\boldsymbol{Z}}^{{\rm{obs}}}. To be concrete, define 𝕌a,bobs=𝕌a(𝒁obs)𝕌b(𝒁obs){\mathbb{U}}^{{\rm{obs}}}_{a,b}={\mathbb{U}}_{a}({\boldsymbol{Z}}^{{\rm{obs}}})\cup{\mathbb{U}}_{b}({\boldsymbol{Z}}^{{\rm{obs}}}), the partially observed version of 𝜽{\boldsymbol{\theta}} can be expressed as

𝜽obs=(θ1obs,,θNobs)withθiobs=θifori𝕌a,bobsandθiobs=?fori𝕌a,bobs,{\boldsymbol{\theta}}^{\rm{obs}}=(\theta^{\rm{obs}}_{1},\ldots,\theta_{N}^{\rm{obs}})\ \mbox{with}\ \theta^{\rm{obs}}_{i}=\theta_{i}\ \mbox{for}\ i\in{\mathbb{U}}_{a,b}^{\rm{obs}}\ \mbox{and}\ \theta^{\rm{obs}}_{i}=?\ \mbox{for}\ i\notin{\mathbb{U}}_{a,b}^{\rm{obs}}, (7)

where the question mark ‘??’ stands for the missing values. Define Na,bobs=|𝕌a,bobs|N^{{\rm{obs}}}_{a,b}=|{\mathbb{U}}^{{\rm{obs}}}_{a,b}|. Because only Na,bobsN^{{\rm{obs}}}_{a,b} elements of 𝜽{\boldsymbol{\theta}} are observable, the null hypothesis H0{a,b}H_{0}^{\{a,b\}} is no longer sharp, resulting in an unknown reference distribution of the test statistic Ta,b(𝒁,𝜽)T_{a,b}({\boldsymbol{Z}},{\boldsymbol{\theta}}) under H0{a,b}H_{0}^{\{a,b\}}.

To deal with this critical issue, Aronow, (2012) and Athey et al., (2018) proposed conditional randomization tests (CRTs) to substitute the classic FRT that is not applicable in this setting. The main idea is to avoid nuisance unknowns by focusing only on a subset of carefully selected units, referred to as “focal units”, and conducting randomized experiments on them with a randomization distribution restricted to a subset of treatment assignments, referred to as “focal assignments”. This approach ensures that the null hypothesis H0{a,b}H_{0}^{\{a,b\}} becomes sharp in the conditional inference. Because the reference distribution of the test statistic with focal units is precisely known conditional on the focal assignments and under the null hypothesis, an FRT-like testing procedure can be conducted. Basse et al., (2019) has shown that CRTs are exact both conditionally and unconditionally under mild conditions. Nonetheless, because CRTs employ only a restrictive subset of treatment assignments on a small subset of units of interest, they often suffer from severe power loss. To mitigate power loss, it is advisable to utilize as large a set of focal units associated with a set of focal assignments as possible, because a larger focal unit set often ensures a higher power, whereas a larger focal assignment set offers a higher resolution in obtaining the p-value. To achieve this, Puelz et al., (2022) proposed the concept of null exposure graph (NEG), which is a bipartite graph between units and assignments with an edge between unit i𝕌i\in{\mathbb{U}} and assignment 𝒛𝒵{\boldsymbol{z}}\in{\mathcal{Z}} if and only if unit ii is exposed to either aa or bb under 𝒛{\boldsymbol{z}}. Encoding the impact of treatment assignments on units in terms of the observability of θi\theta_{i}’s under a given treatment assignment, NEG offers a way to better select focal units and focal assignments. For example, every biclique in NEG defines a set of appropriate focal units and focal assignments. Such an insight leads to a series of CRTs based on biclique decomposition of NEG, whose efficiency is further improved by Yanagi and Sei, (2022) via power analysis. More recently, Zhong, (2024) expanded the core concept of CRTs into a more general framework named partial null randomization tests (PNRTs), which reports an ensemble p-value by averaging multiple p-values obtained from a series of CRTs. Each CRT utilizes a small focal assignment set comprising only two focal assignments, yet incorporates an enlarged set of focal units including all focal units compatible with the specific pair of focal assignments. In PNRTs, the larger focal unit set of each individual CRT enhances testing power, while the ensemble p-value, obtained through averaging, addresses the issue of extremely low resolution in the individual p-values.

Despite these efforts, randomization tests based on the idea of conditional inference still suffer from diminishing power. For instance, consider a scenario where a high proportion of treatment zi=1z_{i}=1 arises from ethical concerns (e.g., the treatment significantly benefits participants). This typically leads to a high proportion of exposure mi(𝒛)=2m_{i}({\boldsymbol{z}})=2. If the goal is to test for no spillover effect H00,1H_{0}^{{0,1}}, the proportion of units with observable potential outcomes under this null hypothesis, N0,1obs/NN_{0,1}^{\rm{obs}}/N, tends to be small for many assignments 𝒛𝒵{\boldsymbol{z}}\in\mathcal{Z}. This results in a sparse NEG, posing significant challenges in identifying sufficiently large bicliques within the NEG to conduct a randomization test with realistic power.

2.4 Imputation-based p-value for Testing a Complex Null Hypothesis

For a parametric statistical model f𝜽(x)f_{\boldsymbol{\theta}}(x) with 𝜽Θ{\boldsymbol{\theta}}\in\Theta as the vector of parameters, consider testing the following complex null hypothesis H0:𝜽Θ0,H_{0}:{\boldsymbol{\theta}}\in\Theta_{0}, where Θ0\Theta_{0} is a non-empty subset of the parameter space Θ\Theta. Given a collection of independent and identically distributed (i.i.d.) samples 𝑿obs=(X1,,Xn){\boldsymbol{X}}^{{\rm{obs}}}=(X_{1},\cdots,X_{n}) from model f𝜽(x)f_{\boldsymbol{\theta}}(x), researchers often test H0H_{0} based on a parameter-dependent test statistics D(𝑿obs,𝜽)D({\boldsymbol{X}}^{{\rm{obs}}},{\boldsymbol{\theta}}). In case that Θ0\Theta_{0} contains only one element and thus 𝜽{\boldsymbol{\theta}} is fully specified under H0H_{0}, the p-value for testing H0H_{0} can be naturally established as below:

p(𝑿obs;𝜽)=𝑿(D(𝑿,𝜽)D(𝑿obs,𝜽)),p({\boldsymbol{X}}^{{\rm{obs}}};{\boldsymbol{\theta}})={\mathbb{P}}_{\boldsymbol{X}}\left(D({\boldsymbol{X}},{\boldsymbol{\theta}})\geq D({\boldsymbol{X}}^{{\rm{obs}}},{\boldsymbol{\theta}})\right), (8)

where the probability 𝑿{\mathbb{P}}_{{\boldsymbol{X}}} is derived by random sample 𝑿f𝜽(x){\boldsymbol{X}}\sim f_{\boldsymbol{\theta}}(x). In case that Θ0\Theta_{0} contains more than one element and thus 𝜽{\boldsymbol{\theta}} is not fully specified under H0H_{0}, the most classic way to establish the p-value for H0H_{0} is to find a proper estimate of 𝜽{\boldsymbol{\theta}} (e.g., the MLE) first, and then replace the unspecified 𝜽{\boldsymbol{\theta}} in (8) by its estimate 𝜽^\hat{\boldsymbol{\theta}}, resulting in the classic plug-in p-value (PIP) p(𝑿obs;𝜽^)p({\boldsymbol{X}}^{{\rm{obs}}};\hat{\boldsymbol{\theta}}).

From the Bayesian perspective, p(𝑿obs;𝜽^)p({\boldsymbol{X}}^{{\rm{obs}}};\hat{\boldsymbol{\theta}}) imputes the unspecified parameter 𝜽{\boldsymbol{\theta}} with a point estimation based on the observed data, which is obviously suboptimal. A natural alternative is to impute 𝜽{\boldsymbol{\theta}} based on its posterior distribution under H0H_{0}. Based on this insight, Meng, (1994) and Gelman et al., (1996) proposed to test H0H_{0} by the posterior predictive p-value (PPP) defined as:

pB(𝑿obs;Θ0)=𝔼[p(𝑿obs;𝜽)𝑿obs;H0]=p(𝑿obs;𝜽)π(𝜽𝑿obs,Θ0)𝑑𝜽,{p}_{B}({\boldsymbol{X}}^{{\rm{obs}}};\Theta_{0})={\mathbb{E}}\left[p({\boldsymbol{X}}^{{\rm{obs}}};{\boldsymbol{\theta}})\mid{\boldsymbol{X}}^{{\rm{obs}}};H_{0}\right]=\int p({\boldsymbol{X}}^{{\rm{obs}}};{\boldsymbol{\theta}})\pi({\boldsymbol{\theta}}\mid{\boldsymbol{X}}^{{\rm{obs}}},\Theta_{0})d{\boldsymbol{\theta}}, (9)

where π(𝜽𝑿obs,Θ0)f(𝑿obs𝜽)π(𝜽Θ0)\pi({\boldsymbol{\theta}}\mid{\boldsymbol{X}}^{{\rm{obs}}},\Theta_{0})\propto f({\boldsymbol{X}}^{{\rm{obs}}}\mid{\boldsymbol{\theta}})\pi({\boldsymbol{\theta}}\mid\Theta_{0}) is the posterior distribution of 𝜽{\boldsymbol{\theta}} given 𝑿obs{\boldsymbol{X}}^{obs} and a prior distribution π(𝜽Θ0)\pi({\boldsymbol{\theta}}\mid\Theta_{0}) defined over Θ0\Theta_{0}. Notably, because typically π(𝜽𝑿obs,Θ0)\pi({\boldsymbol{\theta}}\mid{\boldsymbol{X}}^{{\rm{obs}}},\Theta_{0}) and 𝜽^\hat{{\boldsymbol{\theta}}} both converge to the true parameter 𝜽{\boldsymbol{\theta}} as the sample size approaches infinity, the PPP pB(𝑿obs;Θ0){p}_{B}({\boldsymbol{X}}^{{\rm{obs}}};\Theta_{0}) and the PIP p(𝑿obs;θ^){p}({\boldsymbol{X}}^{{\rm{obs}}};\hat{\theta}) often shares the same asymptotical behavior. A key advantage of PPP over PIP is its ability to address uncertainty in specifying 𝜽{\boldsymbol{\theta}} via the posterior distribution π(𝜽𝑿obs,Θ0)\pi({\boldsymbol{\theta}}\mid{\boldsymbol{X}}^{{\rm{obs}}},\Theta_{0}), which makes the inference procedure more robust.

Many causal inference problems with complicated structures conform to the framework of PPP. For example, Rubin, (1998) utilized PPP to improve the power of FRT in randomized experiments with one-sided noncompliance. Under the null hypothesis of no treatment effect for compliers, the unknown compliance types of units serve as the nuisance parameter 𝜽{\boldsymbol{\theta}}, whose value can be imputed based on its posterior predictive distribution under a parametric model with an appropriate prior distribution. Averaging the pristine p-values across the imputed datasets according to Eq. (9), PPP is obtained to summarize the observed evidence against the null hypothesis. Earlier efforts of utilizing multiple imputation based on posterior predictive distribution for Bayesian model checking were summarized in Guttman, (1967) and Rubin, (1981, 1984). In this paper, we aim to utilize the imputation idea to address the unobservable potential outcomes in randomization tests with interference.

3 Imputation-based Randomization Test

3.1 Methodology

Following the spirit of PPP, we propose the following imputation-based randomization test. First, we impute the missing elements of the partially observed 𝜽{\boldsymbol{\theta}} according to an imputation distribution π\pi to get a complete version of 𝜽{\boldsymbol{\theta}} referred to as 𝜽imp{\boldsymbol{\theta}}^{{\rm{imp}}} hereafter, and conduct the classic FRT based on the single imputation for the p-value below:

pSI(𝒁obs;𝜽imp)=𝒁(T(𝒁,𝜽imp)T(𝒁obs,𝜽imp)),p_{{\rm{SI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{imp}}})={\mathbb{P}}_{\boldsymbol{Z}}\left(T({\boldsymbol{Z}},{\boldsymbol{\theta}}^{{\rm{imp}}})\geq T({\boldsymbol{Z}}^{{\rm{obs}}},{\boldsymbol{\theta}}^{{\rm{imp}}})\right), (10)

with the partially observed 𝜽{\boldsymbol{\theta}} in Eq. (4) substituted by the fully imputed 𝜽imp{\boldsymbol{\theta}}^{{\rm{imp}}}. Next, we average pSIp_{{\rm{SI}}} according to the imputation distribution π(𝜽obs)\pi(\cdot\mid{\boldsymbol{\theta}}^{\rm{obs}}), which is dependent on the realized assignment and observed outcomes, to get the following ensemble p-value:

pMI(𝒁obs;𝜽obs)\displaystyle p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{\rm{obs}}) =𝔼(pSI(𝒁obs;𝜽imp))=pSI(𝒁obs;𝜽imp)π(𝜽imp𝜽obs)𝑑𝜽imp,\displaystyle={\mathbb{E}}\left(p_{{\rm{SI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{imp}}})\right)=\int p_{{\rm{SI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{imp}}})\pi({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{\rm{obs}})d{\boldsymbol{\theta}}^{{\rm{imp}}}, (11)

and reject H0{a,b}H_{0}^{\{a,b\}} if pMIp_{{\rm{MI}}} is smaller than a pre-specified significance level α\alpha. We term the above testing procedure as imputation-based randomization test (IRT). Considering that it’s difficult to calculate pMI(𝒁obs;𝜽obs)p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{\rm{obs}}) exactly because the involved FRT p-values typically have no analytical form, we can approximate pMI(𝒁obs;𝜽obs)p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{\rm{obs}}) via Monte-Carlo simulation, as suggested by Rubin, (1998), after conducting the randomized experiment according to the realized assignment 𝒁obs{\boldsymbol{Z}}^{{\rm{obs}}}. The algorithm below summarizes the procedure of IRT. Define 𝑰(){\boldsymbol{I}}(\cdot) as the indicator function.

Input: Observed assignment 𝒁obs{\boldsymbol{Z}}^{\rm{obs}}, observed elements of nuisance parameters 𝜽obs{\boldsymbol{\theta}}^{\rm{obs}}, imputation distribution π(𝜽obs)\pi(\cdot\mid{\boldsymbol{\theta}}^{\rm{obs}}), treatment assignment mechanism 𝑭𝒵{\boldsymbol{F}}_{\mathcal{Z}}, significance level α\alpha, number of simulations KK.
1 for k=1k=1 to KK do
2       (Imputation) Sample 𝜽(k)imp{\boldsymbol{\theta}}^{{\rm{imp}}}_{(k)} from the imputation distribution π(𝜽obs)\pi(\cdot\mid{\boldsymbol{\theta}}^{\rm{obs}}).
3       (Randomization) Randomly sample 𝒛(k){\boldsymbol{z}}_{(k)} from 𝑭𝒵{\boldsymbol{F}}_{\mathcal{Z}}.
4       Calculate Rk=𝑰(T(𝒛(k),𝜽(k)imp)T(𝒁obs,𝜽(k)imp))R_{k}={\boldsymbol{I}}\big{(}T({\boldsymbol{z}}_{(k)},{\boldsymbol{\theta}}^{{\rm{imp}}}_{(k)})\geq T({\boldsymbol{Z}}^{{\rm{obs}}},{\boldsymbol{\theta}}^{{\rm{imp}}}_{(k)})\big{)}.
5      
6 end for
7
8(Averaging) p^MI(𝒁obs;𝜽obs)=K1k=1KRk\hat{p}_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{\rm{obs}})=K^{-1}\sum_{k=1}^{K}R_{k}.
Output: p-value p^MI(𝒁obs;𝜽obs)\hat{p}_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{\rm{obs}}); Reject H0{a,b}H_{0}^{\{a,b\}} if p^MI(𝒁obs;𝜽obs)α\hat{p}_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{\rm{obs}})\leq\alpha.
Algorithm 1 Imputation-based Randomization Test (IRT)

Apparently, the imputation distribution π\pi plays an important role in the construction of IRT. Assuming that units in 𝕌{\mathbb{U}} are random samples from a super population 𝒫{\mathcal{P}} with g()g(\cdot) as the marginal density of θi\theta_{i} for i𝕌i\in{\mathbb{U}}, the structure of the hypothesis testing problem defined in Eq. (5) naturally suggests the following strategy to impute 𝜽=(θ1,,θN){\boldsymbol{\theta}}=(\theta_{1},\cdots,\theta_{N}), given its observed version 𝜽obs{\boldsymbol{\theta}}^{\rm{obs}}: we set θiimp=θiobs\theta_{i}^{{\rm{imp}}}=\theta_{i}^{{\rm{obs}}} for i𝕌a,bobsi\in{\mathbb{U}}^{{\rm{obs}}}_{a,b}, and we impute θiimp\theta_{i}^{{\rm{imp}}} with a random sample from g()g(\cdot) for i𝕌a,bobsi\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}, leading to the oracle imputation distribution

πo(𝜽imp𝜽obs)=i𝕌a,bobsg(θiimp)i𝕌a,bobsδ(θiimpθiobs),{\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{\rm{obs}})}=\prod_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}g(\theta_{i}^{{\rm{imp}}})\cdot{\prod_{i\in{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\delta(\theta_{i}^{{\rm{imp}}}-{\theta_{i}^{\rm{obs}}})}, (12)

where δ()\delta(\cdot) is the Dirac delta function. We refer to the IRT based on πo(𝜽imp𝜽obs)\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{\rm{obs}}) as IRTo.

In practice, however, it’s often the case that distribution g()g(\cdot) is unknown and needs to be estimated based on the observed subset in 𝜽obs{\boldsymbol{\theta}}^{{\rm{obs}}}, namely 𝜽~obs={θiobs}i𝕌a,bobs\tilde{{\boldsymbol{\theta}}}^{\rm{obs}}=\{\theta_{i}^{\rm{obs}}\}_{i\in{\mathbb{U}}^{\rm{obs}}_{a,b}} according to Eq. (7). Let g^(𝜽~obs)\hat{g}(\cdot\mid\tilde{\boldsymbol{\theta}}^{\rm{obs}}) be the estimated marginal distribution of θi\theta_{i}, we get the following imputation distribution in action:

π^(𝜽imp𝜽obs)=i𝕌a,bobsg^(θiimp𝜽~obs)i𝕌a,bobsδ(θiimpθiobs).\hat{\pi}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{\rm{obs}})=\prod_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\hat{g}(\theta_{i}^{{\rm{imp}}}\mid\tilde{\boldsymbol{\theta}}^{{\rm{obs}}})\cdot\prod_{i\in{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\delta(\theta^{{\rm{imp}}}_{i}-\theta_{i}^{\rm{obs}}). (13)

When distribution g()g(\cdot) has a parametric form with 𝜸{\boldsymbol{\gamma}} as the model parameters, we can learn g()g(\cdot) from 𝜽~obs\tilde{\boldsymbol{\theta}}^{{\rm{obs}}} by inferring its parameter 𝜸{\boldsymbol{\gamma}} in either frequentist or Bayesian manner. We term this version of IRT as IRTp.

If no appropriate parametric model is available, we can estimate g()g(\cdot) via non-parametric approaches instead. For example, we can approximate the unknown g()g(\cdot) by the empirical distribution of the observed potential outcomes 𝜽~obs\tilde{\boldsymbol{\theta}}^{{\rm{obs}}}, i.e.,

g^e(θ𝜽~obs)=1Na,bobsi𝕌a,bobsδ(θθiobs).\hat{g}_{e}(\theta\mid\tilde{\boldsymbol{\theta}}^{\rm{obs}})=\frac{1}{N^{{\rm{obs}}}_{a,b}}\sum_{i\in{\mathbb{U}}_{a,b}^{\rm{obs}}}\delta(\theta-\theta_{i}^{\rm{obs}}). (14)

Alternatively, when g()g(\cdot) is known to be a continuous distribution, we can rely on the smoothed version of the empirical distribution g^\hat{g} instead as suggested by van der Vaart, (1994), i.e.,

g^s(θ𝜽~obs)=1Na,bobsi𝕌a,bobsK(θθiobshN),\hat{g}_{s}(\theta\mid\tilde{\boldsymbol{\theta}}^{\rm{obs}})=\frac{1}{N_{a,b}^{\rm{obs}}}\sum_{i\in{\mathbb{U}}_{a,b}^{\rm{obs}}}K\left(\frac{\theta-\theta_{i}^{\rm{obs}}}{h_{N}}\right), (15)

where K()K(\cdot) is a symmetric kernel function with bandwidth hNh_{N} satisfying K(u)𝑑u=1\int K(u)du=1, u2K(u)𝑑u<\int u^{2}K(u)du<\infty, and K(u)=K(u)K(u)=K(-u). We refer to the IRTs based on g^e\hat{g}_{e} and g^s\hat{g}_{s} as IRTe and as IRTs, respectively.

3.2 Theoretical Results

In this section, we present the theoretical results for IRT under the null hypothesis in Eq. (5), when the oracle imputation distribution is precisely known or needs to be estimated based on the observables. Some of the results are closely related to the results for PPP in Meng, (1994). A critical difference between IRT and PPP, however, is that the imputation distribution in IRT would not converge to a point mass, as in the classic PPP, with the increase of sample size. Such a fact leads to some unique issues in the theoretical analysis of IRT, making it non-trivial to establish the validity of IRT. First, we start with the simple case where IRT is conducted with the oracle imputation distribution, i.e., IRTo.

Theorem 1.

Assume that (i) the null hypothesis in Eq. (5) holds with Yi(a)=Yi(b)=θiY_{i}(a)=Y_{i}(b)=\theta_{i}, (ii) θi\theta_{i} are i.i.d. samples from distribution with probability density g()g(\cdot), (iii) 𝐙obs𝐅𝒵{\boldsymbol{Z}}^{\rm{obs}}\sim{\boldsymbol{F}}_{\mathcal{Z}}, and (iv) the imputation of 𝛉imp{\boldsymbol{\theta}}^{{\rm{imp}}} is according to the oracle imputation distribution πo(𝛉imp𝛉obs)\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{\rm{obs}}) as defined in Eq. (12). Then pMI(𝐙obs;𝛉obs)p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}}) is stochastically less variable than a uniform distribution but with the same mean, which means that: (i) 𝔼(𝐙obs,𝛉)[pMI(𝐙obs;𝛉obs)]=𝔼U[U]=1/2{\mathbb{E}}_{({\boldsymbol{Z}}^{\rm{obs}},{\boldsymbol{\theta}})}[p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})]={\mathbb{E}}_{U}[U]=1/2, and (ii) 𝔼(𝐙obs,𝛉)[h(pMI(𝐙obs;𝛉obs))]𝔼U[h(U)]{\mathbb{E}}_{({\boldsymbol{Z}}^{\rm{obs}},{\boldsymbol{\theta}})}\left[h\left(p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})\right)\right]\leq{\mathbb{E}}_{U}[h(U)] for all convex functions hh on [0,1][0,1], where UU is a random sample from the uniform distribution on [0,1][0,1], the expectation 𝔼(𝐙obs,𝛉){\mathbb{E}}_{({\boldsymbol{Z}}^{\rm{obs}},{\boldsymbol{\theta}})} is taken with respect to the joint distribution of (𝐙obs,𝛉)({\boldsymbol{Z}}^{{\rm{obs}}},{\boldsymbol{\theta}}), and 𝔼U{\mathbb{E}}_{U} is taken with respect to UU.

Theorem 1 shows that IRTo is more centered around 1/21/2 than UUnif(0,1)U\sim{\rm{Unif}}(0,1) when the null hypothesis holds. The proof is detailed in the Supplementary Material. The theorem immediately leads to the following corollary based on Lemma 1 in Meng, (1994).

Corollary 1.

Under the same assumptions in Theorem 1, we have

(pMI(𝒁obs;𝜽obs)α)2α,α(0,1),{\mathbb{P}}\left(p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})\leq\alpha\right)\leq 2\alpha,\quad{\forall\ \alpha\in(0,1)},

where {\mathbb{P}} is the probability with respect to 𝐙obs𝐅𝒵{\boldsymbol{Z}}^{\rm{obs}}\sim{\boldsymbol{F}}_{\mathcal{Z}} and θ1,,θNi.i.d.g()\theta_{1},\ldots,\theta_{N}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}g(\cdot).

Corollary 1 ensures that IRTo can control the type I error rate at most twice the significance level, i.e., α\alpha, a result that is well-known for PPP. Although such a result cannot be further improved to more positive forms from the theoretical perspective, e.g., (pMI(𝒁obs;𝜽obs)α)α{\mathbb{P}}\left(p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})\leq\alpha\right)\leq\alpha for all α>0\alpha>0, according to the counterexample in Dahl, (2006), our empirical studies in Sections 4 and 5 show that the type I error rate of IRTo and its extensions are often well controlled at α\alpha. Interestingly, such a phenomenon also occurs in other model-free, finite-sample exact inference methods, such as cross-conformal prediction, jackknife+, and CV+ (Vovk et al.,, 2018; Vovk and Wang,, 2020; Barber et al.,, 2021; Guan,, 2024). For instance, Barber et al., (2021) recommended the jackknife+ prediction interval for its typical empirical coverage of 1α1-\alpha, although theoretically, it can only guarantee 12α1-2\alpha coverage due to certain pathological cases. These practices provide extra support for the application of IRT.

Next, we focus on the more challenging cases where IRT is conducted with the estimated imputation distribution as shown in Eq. (13). Define

Δ(𝒁obs;𝜽obs)=pSI(𝒁obs;𝜽imp)[π^(𝜽imp𝜽obs)πo(𝜽imp𝜽obs)]𝑑𝜽imp\begin{split}\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})=\int p_{{\rm{SI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{imp}}})\left[\hat{\pi}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})-\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})\right]d{\boldsymbol{\theta}}^{{\rm{imp}}}\end{split}

as the ensemble p-value difference between the two versions of IRT with the estimated and oracle imputation distributions. Intuitively, if the distribution of Δ(𝒁obs;𝜽obs)\Delta({\boldsymbol{Z}}^{\rm{obs}};{\boldsymbol{\theta}}^{\rm{obs}}), which is determined by the joint distribution of (𝒁obs,𝜽)({\boldsymbol{Z}}^{\rm{obs}},{\boldsymbol{\theta}}), concentrates at 0, the IRT under the estimated imputation distribution will closely resemble IRTo with similar properties. The following lemma provides a positive answer to the above intuition, ensuring that if Δ(𝒁obs;𝜽obs)\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}}) converges to 0 in probability, the asymptotic type I error rate of IRT with the corresponding estimated imputation distribution is also bounded by twice the significance level.

Lemma 1.

Assume that (i) the null hypothesis in Eq. (5) holds with Yi(a)=Yi(b)=θiY_{i}(a)=Y_{i}(b)=\theta_{i}, (ii) θi\theta_{i} are i.i.d. samples from distribution with probability density g()g(\cdot), and (iii) 𝐙obs𝐅𝒵{\boldsymbol{Z}}^{\rm{obs}}\sim{\boldsymbol{F}}_{\mathcal{Z}}. If Δ(𝐙obs;𝛉obs)=op(1)\Delta({\boldsymbol{Z}}^{\rm{obs}};{\boldsymbol{\theta}}^{\rm{obs}})=o_{p}(1), then for pMI(𝐙obs;𝛉obs)p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}}) with imputation Eq. (13), we have:

lim supN(pMI(𝒁obs;𝜽obs)α)2α,α(0,1).\limsup_{N\rightarrow\infty}{\mathbb{P}}\left(p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})\leq\alpha\right)\leq 2\alpha,\quad{\forall\ \alpha\in(0,1)}. (16)

Denote the sampling distribution of Na,bobsN_{a,b}^{\rm{obs}} derived by 𝒁obs𝑭𝒵{\boldsymbol{Z}}^{\rm{obs}}\sim{\boldsymbol{F}}_{\mathcal{Z}} as 𝑭N{\boldsymbol{F}}_{N}, and the corresponding probability mass function (p.m.f.) as

N(Na,bobs=n)=𝒛:|𝕌a(𝒛)𝕌b(𝒛)|=n𝒁(𝒁=𝒛),{\mathbb{P}}_{N}(N_{a,b}^{\rm{obs}}=n)=\sum_{{\boldsymbol{z}}:|{\mathbb{U}}_{a}({\boldsymbol{z}})\cup{\mathbb{U}}_{b}({\boldsymbol{z}})|=n}{\mathbb{P}}_{{\boldsymbol{Z}}}({\boldsymbol{Z}}={\boldsymbol{z}}), (17)

where 𝒁{\mathbb{P}}_{\boldsymbol{Z}} is defined over 𝒁𝑭𝒵{\boldsymbol{Z}}\sim{\boldsymbol{F}}_{\mathcal{Z}}. The following theorem provides sufficient conditions under which Δ(𝒁obs;𝜽obs)=op(1)\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})=o_{p}(1), and thus Eq. (16) holds.

Theorem 2.

Under the same assumptions in Lemma 1, if gg and g^\hat{g} correspond to continuous distributions, a sufficient condition for Δ(𝐙obs;𝛉obs)=op(1)\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})=o_{p}(1), and thus Eq. (16), is

𝔼|i=Na,bobs+1Ng^(ϑiϑ1:Na,bobs)g(ϑi)1|=o(1) as N,{\mathbb{E}}\left|\prod_{i=N_{a,b}^{\rm{obs}}+1}^{N}\frac{\hat{g}\left({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{a,b}^{\rm{obs}}}\right)}{g({\vartheta}_{i})}-1\right|=o(1)\text{ as }N\rightarrow\infty, (18)

where 𝔼{\mathbb{E}} is the expectation taken with respect to ϑ1,,ϑN{\vartheta}_{1},\ldots,{\vartheta}_{N} i.i.d. with probability density function g()g(\cdot), ϑ1:Na,bobs={ϑ1,,ϑNa,bobs}{\bm{{\vartheta}}}_{1:N_{a,b}^{\rm{obs}}}=\{{\vartheta}_{1},\ldots,{\vartheta}_{N_{a,b}^{\rm{obs}}}\} and Na,bobs𝐅NN_{a,b}^{\rm{obs}}\sim{\boldsymbol{F}}_{N}. If gg and g^\hat{g} correspond to discrete distributions with corresponding p.m.f. p and p^\hat{p}, the sufficient condition is

𝔼|i=Na,bobs+1Np^(ϑiϑ1:Na,bobs)p(ϑi)1|=o(1) as N,{\mathbb{E}}\left|\prod_{i=N_{a,b}^{\rm{obs}}+1}^{N}\frac{\hat{p}\left({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{a,b}^{\rm{obs}}}\right)}{p({\vartheta}_{i})}-1\right|=o(1)\text{ as }N\rightarrow\infty, (19)

where 𝔼{\mathbb{E}} is the expectation taken with respect to ϑ1,,ϑN{\vartheta}_{1},\ldots,{\vartheta}_{N} i.i.d. with probability density function g()g(\cdot), ϑ1:Na,bobs={ϑ1,,ϑNa,bobs}{\bm{{\vartheta}}}_{1:N_{a,b}^{\rm{obs}}}=\{{\vartheta}_{1},\ldots,{\vartheta}_{N_{a,b}^{\rm{obs}}}\} and Na,bobs𝐅NN_{a,b}^{\rm{obs}}\sim{\boldsymbol{F}}_{N}.

The above sufficient conditions imply that when the likelihood ratio between the estimated distribution g^\hat{g} and the true distribution gg on the unobservables converges to 1 in L1L^{1}, IRT with the estimated imputation distribution will exhibit similar performance as IRTo, whose behavior has been established previously. The corollary below shows that in a simple case where g()g(\cdot) represents a normal distribution with known variance, the sufficient conditions can be theoretically validated under specific conditions.

Corollary 2.

Under the same assumptions in Lemma 1, let θi\theta_{i} be i.i.d. samples from 𝒩(μ,σ2){\mathcal{N}}(\mu,\sigma^{2}) with known variance σ2\sigma^{2} and unknown mean μ\mu. Suppose the sampling distribution 𝐅N{\boldsymbol{F}}_{N} of Na,bobsN_{a,b}^{\rm{obs}} is a point mass distribution, so that the missing rate of nuisance parameters, defined as rNmis=1Na,bobs/Nr_{N}^{\rm{mis}}=1-N_{a,b}^{\rm{obs}}/N, is a constant depending on NN. If we estimate the unknown density g()g(\cdot) with its posterior predictive distribution g^(𝛉~obs)\hat{g}(\cdot\mid\tilde{\boldsymbol{\theta}}^{\rm{obs}}), given the observed outcomes 𝛉~obs\tilde{\boldsymbol{\theta}}^{\rm{obs}} and a normal prior for μ\mu, then (16) holds when rNmis=o(1)r_{N}^{\rm{mis}}=o(1).

For more complex cases, theoretical verification of the sufficient conditions in Theorem 2 becomes challenging, but can be approximated by numerical verification when rNmisr_{N}^{\rm{mis}} converges to 0 at a sufficiently fast rate, e.g., O(N1/2)O(N^{-1/2}). Further details are provided in the Supplementary Material.

4 Simulations

4.1 Clustered Interference

In this section, we examine the proposed method under various interference mechanisms via simulation. First, we consider the clustered interference as defined in Example 1. For a group of N=300N=300 units equally divided into K{30,75,150}K\in\{30,75,150\} clusters, we are interested in testing the null hypothesis of no spillover effect H0{0,1}H_{0}^{\{0,1\}}. Since the potential outcome Yi(2)Y_{i}(2) is irrelevant to our interest, we only generate potential outcomes Yi(0)Y_{i}(0) and Yi(1)Y_{i}(1) by

Yi(0)𝒩(0,1),Yi(1)=Yi(0)+τ,i=1,,N,Y_{i}(0)\sim\mathcal{N}(0,1),\ Y_{i}(1)=Y_{i}(0)+\tau,\ i=1,\ldots,N, (20)

where τ{0,0.1,,1}\tau\in\{0,0.1,\cdots,1\}. To identify the spillover effect under clustered interference, we consider the two-stage randomized experiment suggested by Hudgens and Halloran, (2008), where we randomly assign K/2\lfloor K/2\rfloor clusters to the treatment group and the remaining clusters to the control group in the first stage, and then randomly select one unit from each cluster in the treatment group to receive the treatment in the second stage with all the other units untreated. For each of the 3×113\times 11 configurations of (K,τ)(K,\tau), we generate 10 independent datasets according to the data simulation model in Eq. (20). For each simulated dataset with a specific set of potential outcomes, we conduct 1,000 parallel randomized experiments with the treatment assignment 𝒁{\boldsymbol{Z}} randomly specified based on the randomized design 𝑭𝒵{\boldsymbol{F}}_{\mathcal{Z}} according to the two-stage experiment.

Using the difference-in-means in Eq. (6) as the test statistic with the significance level to reject the null hypothesis set to α=0.05\alpha=0.05, we compare the proposed IRT methods, including IRTo, IRTp, IRTe and IRTs, to a group of CRT methods with focal units and assignments selected by various algorithms (including CRTo, CRTb, CRTdb and CRTpb) and two PNRT methods (namely PNRTp and PNRTm). Detailed configurations of the 10 involved competing methods can be found in Table 1. For each of these competing randomization tests, we measure its type I error rate via simulated datasets with τ=0\tau=0, and power via simulated datasets with τ0\tau\neq 0.

Methods Description Interference Type
Clustered Spatial
CRTo CRT with one focal unit per cluster (Basse et al.,, 2019)
CRTb CRT with general biclique decomposition (Puelz et al.,, 2022)
CRTdb CRT with design-assisted biclique decomposition (Puelz et al.,, 2022)
CRTpb CRT with power-analysis-assisted biclique decomposition (Yanagi and Sei,, 2022)
PNRTp The pairwise comparison-based PNRT (Zhong,, 2024)
PNRTm The minimization-based PNRT (Zhong,, 2024)
IRTo IRT imputed with the oracle distribution
IRTp IRT imputed with a parametric model
IRTe IRT imputed with the empirical distribution
IRTs IRT imputed with a smooth kernel density function
Table 1: Methods evaluated in the simulation study.

We present the type I error rate across 10 datasets and the average power of the 10 competing methods in Fig. 1. For cases where K=150K=150, however, the results of CRTb and CRTpb are absent because these two methods become infeasible in these cases. When K=150K=150, the average missing rate of θi\theta_{i} reaches up to 25%, leading to a null exposure graph with a density lower than 0.75. Consequently, it is impractical to find a biclique decomposition that meets the minimum size requirements established by Puelz et al., (2022) and Yanagi and Sei, (2022), which specify at least 20 focal assignments and 20 focal units, rendering CRTb and CRTpb infeasible. From the results, we make the following observations. First, while IRT and CRT methods successfully control the type I error rate around 0.05 across all 10 datasets, the two PNRT methods fail to do so and tend to yield type I error rates that are much smaller than 0.05 in all cases. Although our theory guarantees only a 2α2\alpha bound of the type I error rate for IRT methods asymptotically, these simulation results suggest that IRT can often successfully control the type I error rate below α\alpha in practice. Second, the IRT methods demonstrate significantly higher power than CRT and PNRT methods across all cases, especially for cases with a large number of clusters KK. Although CRTo and CRTdb also yield competitively high power in these cases, they are specifically designed for clustered interference only and can not be extended to other settings. Third, the IRT methods based on estimated distribution – namely IRTe, IRTp and IRTs – exhibit performance similar to IRTo, even though the missing rate of nuisance parameters is far from 0. These results confirm that the proposed IRTs are powerful tools for casual inference in presence of cluster interference.

Additional experiments where Yi(0)Y_{i}(0) values are generated from heavier tail distributions, such as χ2\chi^{2} or tt distributions, but still modeled as a normal distribution, are reported in the Supplementary Materials, showing that IRTp is robust to model mis-specification.

Refer to caption
Figure 1: Simulation results under clustered interference. The type I error rate across 10 datasets and the average power over these datasets are visualized for the 10 competing methods under different specifications of causal effect τ\tau and number of clusters KK. The horizontal dashed line indicates the significance level of α=0.05\alpha=0.05.

4.2 Spatial Interference

Next, we consider the spatial interference as defined in Example 2. For a group of N=1,000N=1,000 units with spatial interference, the potential outcomes are simulated according to Eq. (20). We adopt the similar setting as in Yanagi and Sei, (2022) to specify the interference mechanism: each unit ii is associated with a 2-dimensional coordinate 𝒙i2\boldsymbol{x}_{i}\in{\mathbb{R}}^{2} according to

𝒙i𝒩((0.5,0.5)T,0.12I2)I(1i500)+𝒩((0.25,0.75)T,0.0752I2)I(500<i800)+𝒩((0.3,0.3)T,0.0752I2)I(800<i1000),\begin{split}\boldsymbol{x}_{i}\sim&\mathcal{N}\big{(}(0.5,0.5)^{\mathrm{\scriptscriptstyle T}},0.1^{2}I_{2}\big{)}\cdot I(1\leq i\leq 500)+{\mathcal{N}}\big{(}(0.25,0.75)^{\mathrm{\scriptscriptstyle T}},0.075^{2}I_{2}\big{)}\cdot I(500<i\leq 800)\\ &+{\mathcal{N}}\big{(}(0.3,0.3)^{\mathrm{\scriptscriptstyle T}},0.075^{2}I_{2}\big{)}\cdot I(800<i\leq 1000),\end{split}

where I2I_{2} denotes the 2×22\times 2 identity matrix, and allow interference to occur between two unit ii and jj if and only if their Euclidean distance d(i,j)=𝒙i𝒙j2rd(i,j)=||\boldsymbol{x}_{i}-\boldsymbol{x}_{j}||_{2}\leq r, with the distance of interference r{0.005,0.01,0.02}r\in\{0.005,0.01,0.02\}. For each configuration of (r,p,τ)(r,p,\tau), we generate 10 independent datasets of sample size NN according to Eq. (20). For each simulated dataset, the treatment assignment 𝒁=(Z1,,Zn){\boldsymbol{Z}}=(Z_{1},\ldots,Z_{n}) is sampled randomly for 1,000 times according to an independent Bernoulli trial where ZiBernoulli(p)Z_{i}\sim\text{Bernoulli}(p) for i=1,,Ni=1,\ldots,N with p{0.2,0.5,0.8}p\in\{0.2,0.5,0.8\}.

Refer to caption
Figure 2: Simulation results under spatial interference: A, B and C show the type I error rate and average power of the 10 competing methods for p=0.2,0.5p=0.2,0.5, 0.80.8, under different specifications of casual effect τ\tau and distance of interference rr. The horizontal dashed line indicates the significance level of α=0.05\alpha=0.05.

Using the difference-in-means in Eq. (6) as the test statistic, we aim to test the null hypothesis of no spillover effect, i.e., H0{0,1}H_{0}^{\{0,1\}}, with a significance level of α=0.05\alpha=0.05. The type I error rate and average power across 10 datasets for the methods described in Section 4.1 are measured for comparison. The results are presented in Fig. 2, where CRTo and CRTdb are not included due to their inapplicability under spatial interference. In this figure, subfigures A, B, and C depict the type I error rate and power for p=0.2p=0.2, 0.50.5, and 0.80.8, respectively. For p=0.5p=0.5 and 0.80.8, the results of CRTb and CRTpb are absent due to the high average missing rate of nuisance parameters, which can reach up to 50% and 80%.

These results show that the proposed IRTs outperform existing methods significantly under spatial interference as well as under clustered interference. First, the proposed IRT methods effectively control the type I error rate and exhibit significantly higher power compared to CRTs and PNRTs, especially in the case of p=0.8p=0.8, where CRTs are not applicable and PNRTs have almost no power at all. Second, similar to the results under clustered interference, the other IRT methods exhibit similar performance to IRTo even with a large missing rate of nuisance parameters. Additional results for Yi(0)Y_{i}(0) generated from χ2(4)\chi^{2}(4) and t(4)t(4) are detailed in the Supplementary Material, which show that IRT methods remain robust across different distributions in the case of spatial interference as well.

Because all simulation results show that IRTs based on different estimated imputation distributions perform similarly to IRTo, we recommend using IRTe in practice for its ease of implementation and independence from additional assumptions for the marginal distribution of θi\theta_{i}.

5 Real Data Application

5.1 The Insurance Purchase Data

We re-analyze the insurance purchase data collected from a two-round randomized experiment with multiple treatments conducted by Cai et al., (2015) to study the influence of social network among farmers on their decisions to purchase a new weather insurance product. In this experiment, N=4,902N=4,902 households of rice farmers were randomly assigned to the first or the second rounds of insurance introduction sessions. The households assigned to both rounds were randomized to receive either a simple or an intensive introduction to the insurance product. The households assigned to the second round, however, were further divided into three random groups and informed additional information about the purchase decisions of the first-round in three different ways: no additional information at all (NoInfo), summary information in terms of the overall purchase rate (Overall), or detailed information on concrete purchase decisions of all households in the first round (Indiv). At the end, all involved households were randomized into 2+2×3=82+2\times 3=8 treatment groups, as summarized in Fig 3. A stratified randomized experiment was conducted to randomly assign the 8 treatments to households.

Refer to caption
Figure 3: Eight treatment groups in the two-round experiment conducted by Cai et al., (2015). Numbers in the brackets highlight how many households are assigned into these treatment groups in the experiment.

Households were asked to make a purchase decision immediately after the insurance introduction sessions, making sure that there was no interference between households in the same round. However, there was a 3-day gap between the two rounds to allow households from the first round to personally discuss the insurance with their friends in the second round, implying a one-way interference from the first round to the second round. The social network of the 4,902 households was known in the experiment. Let 𝑾=(Wij)N×N{\boldsymbol{W}}=(W_{ij})_{N\times N} be the adjacent matrix of the social network. We have Wij=1W_{ij}=1 if two households ii and jj are friends, and Wij=0W_{ij}=0 otherwise. The purchase decisions of all households, i.e., YiY_{i}’s, were recorded as the output of the experiment, where Yi=1Y_{i}=1 if household ii made purchase and Yi=0Y_{i}=0 otherwise.

5.2 Testing the Spillover Effect by IRTs

Containing rich ingredients, the insurance purchase data can be analyzed in different ways to answer different scientific questions. Here, we are interested in testing the spillover effect from the first-round households to the second-round households. To be concrete, let 𝕌={1,,N}{\mathbb{U}}=\{1,\dots,N\} be the N=4,902N=4,902 households of interest, 𝒯={1,,8}{\mathcal{T}}=\{1,\ldots,8\} be the treatment set showed in Fig. 3, Ziobs𝒯Z^{\rm{obs}}_{i}\in{\mathcal{T}} be the observed treatment assignment of household i𝕌i\in{\mathbb{U}} and 𝒁obs=(Z1obs,,ZNobs){\boldsymbol{Z}}^{\rm{obs}}=(Z^{\rm{obs}}_{1},\ldots,Z^{\rm{obs}}_{N}). Due to the possible interference between households through their social network, a household assigned to treatment t𝒯t\in{\mathcal{T}} may receive distinct exposures depending on how much prior information its friend households can bring to it before treatment: if some of its friend households get access to the intensive introduction of the insurance product a priori in the first stage, treatment tt would be conducted with strong prior information, leading to exposure t(s)t^{(s)}; if some of its friend households get access to the simple introduction a priori only, treatment tt would be conducted with weak prior information, leading to exposure t(w)t^{(w)}; if none of its friend households get access to the simple or intensive introduction a priori, treatment tt would be conducted with no prior information at all, leading to exposure t(n)t^{(n)}. Although we allow insurance content to spill over through social networks, for simplicity, we assume that households cannot obtain information about the first-round purchase decisions from social networks; they can only obtain this information from the experimenter as intended. Based on these insights, we work on the following exposure mapping from 𝒵{\mathcal{Z}} to exposure set =t𝒯{t(n),t(w),t(s)}{\mathcal{E}}=\cup_{t\in{\mathcal{T}}}\{t^{(n)},t^{(w)},t^{(s)}\}:

mi(𝒛)=zi(ai),i𝕌,𝒛𝒵,m_{i}({\boldsymbol{z}})=z_{i}^{(a_{i})},\ \forall\ i\in{\mathbb{U}},\ {\boldsymbol{z}}\in{\mathcal{Z}}, (21)

where indicator aia_{i} takes value in set {n,w,s}\{n,w,s\} highlighting the level of prior information household ii receives before treatment. Note that because aia_{i}’s are fully determined by the treatment assignment 𝒛{\boldsymbol{z}} and the social network of the households, mi(𝒛)m_{i}({\boldsymbol{z}}) in Eq. (21) is a well defined exposure mapping.

For treatment t𝒯t\in{\mathcal{T}}, we consider testing the following null hypothesis of no spillover effect, which is the contrast between t(w)t^{(w)} and t(s)t^{(s)} within treatment group tt:

H0t:Yi(t(w))=Yi(t(s)), 1iN.H_{0}^{t}:Y_{i}(t^{(w)})=Y_{i}(t^{(s)}),\ 1\leq i\leq N. (22)

For this purpose, we use the difference-in-means between the t(s)t^{(s)} exposure group and the t(w)t^{(w)} exposure group as the estimator of the spillover effect and the test statistic for testing H0tH_{0}^{t} for treatment tt, and establish two-sided p-value of the test via various randomization tests. For IRTs, we consider IRTe and IRTp with the following Beta-Binomial model for θi\theta_{i}’s: θiBinomial(1,q)\theta_{i}\sim\text{Binomial}(1,q), qBeta(1,1).q\sim\text{Beta}(1,1). For PNRTs, we try both PNRTp and PNRTm under the default setting. For CRTs, however, we find that all members in this method family are not applicable to this real data example: CRTo and CRTdb cannot be applied to two-round experiments with multiple treatments and one-way interference, and CRTb and CRTpb are infeasible because the low density (e.g., 0.107 when t=3t=3) of the null exposure graph makes the biclique decomposition infeasible. Therefore, we only compare the performance of IRTe and IRTp versus PNRTp and PNRTm here.

Treatment Estimated    p-values reported by different randomization tests
spillover effect CRT PNRTp PNRTm IRTe IRTp
1 -0.009 - 0.560 1.000 0.805 0.796
2 -0.019 - 0.470 1.000 0.600 0.593
\cdashline1-7 3 0.133 - 0.423 1.000 0.002 0.006
4 -0.022 - 0.590 1.000 0.739 0.749
5 0.113 - 0.503 1.000 0.085 0.080
6 0.011 - 0.564 1.000 0.795 0.829
7 0.138 - 0.504 1.000 0.038 0.035
8 0.128 - 0.510 1.000 0.048 0.059
Table 2: Testing results for the two-round experiment. The significance level is α=0.05\alpha=0.05, and the Bonferroni adjusted significance level is α/8=0.00625\alpha/8=0.00625.

Table 2 summarizes the spillover effect for t{1,,8}t\in\{1,\cdots,8\} and the corresponding p-values obtained by different methods. From the table, we can see the following facts. First, IRTs, as well as PNRTs, suggest that spillover effects are not significant for treatment 11 and 22, which is consistent with the domain knowledge of no interference within the same round. Second, both IRT methods reveal significant spillover effects for treatment 3 and treatment 7, although only the spillover effect for treatment 3 is significant after considering the Bonferroni adjustment for multiple testing. Since households in the treatment 3 group received only a simple introduction with no additional information on the purchase decision in the first round, it’s reasonable that leaked information from the social network would significantly influence their purchase decision. However, both PNRT methods miss these signals, with all p-values from PNRTm far from 0.05.

5.3 Simulation Based on Real Data

To further assess the performance of the proposed methods in practice, we train a predictive regression model for the potential outcomes based on the real data and generate simulated data from the fitted model for performance evaluation and comparison. To achieve this goal, we select 9 relevant covariates of the households according to Cai et al., (2015) as predictors to build a logistic regression model for the binary potential outcomes, resulting in a 4902×94902\times 9 covariate matrix 𝑿{\boldsymbol{X}}. Because 6 of the 9 columns in 𝑿{\boldsymbol{X}} have missing values, we would like to impute these missing values for convenient model training and data simulation. Here, we follow the strategy suggested by Zhao and Ding, (2024) to impute all missing values in 𝑿{\boldsymbol{X}} with 0, and include the binary missing data indicators (1 for missing elements and 0 otherwise) of the 6 incomplete covariates as additional covariates, resulting in an extra 4902×64902\times 6 covariate matrix 𝑴{\boldsymbol{M}}. Moreover, dummy variables recording actual exposures received by households in the experiment, i.e., 𝑬=(Ei,e)i𝕌,e{\boldsymbol{E}}=(E_{i,e})_{i\in{\mathbb{U}},e\in{\mathcal{E}}}, are also included in the model as covariates, where Ei,e=1E_{i,e}=1 if household ii received exposure ee in the experiment. With these covariates, we fit the logistic model for the observed outcomes 𝒀obs{\boldsymbol{Y}}^{{\rm{obs}}} as our base model for data generation: logistic(𝒀obs1+𝑬+𝑿+𝑴).\text{logistic}({\boldsymbol{Y}}^{{\rm{obs}}}\sim 1+{\boldsymbol{E}}+{\boldsymbol{X}}+{\boldsymbol{M}}). Simulation datasets can be generated based on the above base model to evaluate the performance of various methods for testing spillover effects. For illustrative purpose, we focus on testing H0tH_{0}^{t} with t=3t=3, which examines the spillover effect between exposures 3(s)3^{(s)} and 3(w)3^{(w)}. To be specific, let coef(s)(s) and coef(w)(w) be the coefficients of dummy variables E:,3(s)E_{:,3^{(s)}} and E:,3(w)E_{:,3^{(w)}} obtained in the fitted base model, and Δ(s,w)\Delta(s,w) be their difference. We modify the base model by resetting coef(s)=(s)= coef(w)+Δs,wr(w)+\Delta_{s,w}\cdot r for some r(0,1)r\in(0,1), and generate potential outcomes 𝒀(3(s)){\boldsymbol{Y}}(3^{(s)}) and 𝒀(3(w)){\boldsymbol{Y}}(3^{(w)}) of the same sample size as the original experiment according to the modified model. If r=0r=0, coef(s)=(s)= coef(w)(w) in the modified model, and H0H_{0} holds; if r>0r>0, H0H_{0} fails. Here, we adjust rr to align the differences in means of the generated 𝒀(3(s)){\boldsymbol{Y}}(3^{(s)}) and 𝒀(3(w)){\boldsymbol{Y}}(3^{(w)}) (denoted as τ\tau) to eight different values, ranging from 0 to 0.27, to simulate scenarios with varying signal strengths.

Refer to caption
Figure 4: Power of different randomization tests for simulated datasets generated from modified versions of a base model obtained by fitting the real data. The horizontal dashed line indicates the significance level of α=0.05\alpha=0.05.

We apply PNRTp, PNRTm, IRTe, and IRTp to this simulated dataset, and visualize their type I error rate (for cases with τ=0\tau=0) and power (for cases with τ>0\tau>0) in Fig. 4. The results indicate that IRTs can effectively control the type I error rate and has high power to detect spillover effects, while PNRTs have almost no power at all.

6 Discussion

In this paper, we proposed the IRT method, which imputes unknown nuisance parameters and computes the average of pristine p-values. Theoretically, this method can bound the type I error rate at twice the significance level asymptotically when the missing rate of nuisance parameters converges to 0 in probability. Simulation results show that even when the missing rate of nuisance parameters is relatively high, the IRT method demonstrates good empirical control of type I error rate. Compared to CRTs and PNRTs, IRT methods significantly improve power and maintain robust performance regardless of the true distribution, particularly when the proportion of imputable outcomes under the null hypothesis is small. Overall, we recommend IRTe, which imputes unknowns using the empirical distribution of the observed nuisance parameters, as it is easy to implement and free from the model specification.

Our IRT framework could be extended in the following directions. First, covariates could be incorporated into IRT to further improve power, including using covariate information and machine learning methods to impute unobserved potential outcomes, and employing covariate-adjusted test statistics (Zhao and Ding,, 2024). Second, studentized test statistics could be used to construct asymptotic valid tests for weak null hypotheses (Wu and Ding,, 2021; Cohen and Fogarty,, 2022). Third, we could construct randomization-based confidence intervals by extending the analytical inversion approach of FRT to IRT (Zhu and Liu,, 2023; Fiksel,, 2024). Finally, the IRT framework could be applied to other complex network experiments and complex interference structures (Basse and Airoldi,, 2018; Leung, 2022b, ; Airoldi and Christakis,, 2024; Liu et al.,, 2024).

7 Funding

This work was supported by the National Natural Science Foundation of China (12371269 & 12071242).

Supplementary Material

Supplementary Material available online includes proofs and additional simulation results. An R package implementing the proposed methods is available at https://github.com/htx113/imprt.

References

  • Airoldi and Christakis, (2024) Airoldi, E. M. and Christakis, N. A. (2024). Induction of social contagion for diverse outcomes in structured experiments in isolated villages. Science, 384(6695):eadi5147.
  • Aronow, (2012) Aronow, P. M. (2012). A general method for detecting interference between units in randomized experiments. Sociological Methods & Research, 41(1):3–16.
  • Aronow and Samii, (2017) Aronow, P. M. and Samii, C. (2017). Estimating average causal effects under general interference, with application to a social network experiment. The Annals of Applied Statistics, 11(4):1912 – 1947.
  • Athey et al., (2018) Athey, S., Eckles, D., and Imbens, G. W. (2018). Exact p-values for network interference. Journal of the American Statistical Association, 113(521):230–240.
  • Barber et al., (2021) Barber, R. F., Candes, E. J., Ramdas, A., and Tibshirani, R. J. (2021). Predictive inference with the jackknife+. The Annals of Statistics, 49(1):486–507.
  • Basse et al., (2024) Basse, G., Ding, P., Feller, A., and Toulis, P. (2024). Randomization tests for peer effects in group formation experiments. Econometrica, 92(2):567–590.
  • Basse and Airoldi, (2018) Basse, G. W. and Airoldi, E. M. (2018). Model-assisted design of experiments in the presence of network-correlated outcomes. Biometrika, 105(4):849–858.
  • Basse et al., (2019) Basse, G. W., Feller, A., and Toulis, P. (2019). Randomization tests of causal effects under interference. Biometrika, 106(2):487–494.
  • Breunig et al., (2022) Breunig, C., Liu, R., and Yu, Z. (2022). Double robust Bayesian inference on average treatment effects. arXiv preprint, arXiv:2211.16298.
  • Cai et al., (2015) Cai, J., Janvry, A. D., and Sadoulet, E. (2015). Social networks and the decision to insure. American Economic Journal: Applied Economics, 7(2):81–108.
  • Cohen and Fogarty, (2022) Cohen, P. L. and Fogarty, C. B. (2022). Gaussian prepivoting for finite population causal inference. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(2):295–320.
  • Cox, (1958) Cox, D. R. (1958). Planning of Experiments. New York: John Wiley & Sons.
  • Dahl, (2006) Dahl, F. A. (2006). On the conservativeness of posterior predictive p-values. Statistics & Probability Letters, 76(11):1170–1174.
  • Ding and Guo, (2023) Ding, P. and Guo, T. (2023). Posterior predictive propensity scores and p-values. Observational Studies, 9(1):3–18.
  • Espinosa et al., (2016) Espinosa, V., Dasgupta, T., and Rubin, D. B. (2016). A Bayesian perspective on the analysis of unreplicated factorial experiments using potential outcomes. Technometrics, 58(1):62–73.
  • Fiksel, (2024) Fiksel, J. (2024). On exact randomization-based covariate-adjusted confidence intervals. Biometrics, 80(2):ujae051.
  • Fisher, (1935) Fisher, R. A. (1935). The Design of Experiments. Oliver and Boyd, Edinburgh, London, 1st edition edition.
  • Forastiere et al., (2018) Forastiere, L., Mealli, F., and Miratrix, L. (2018). Posterior predictive p-values with Fisher randomization tests in noncompliance settings: Test statistics vs discrepancy measures. Bayesian Analysis, 13(3):681 – 701.
  • Forastiere et al., (2022) Forastiere, L., Mealli, F., Wu, A., and Airoldi, E. M. (2022). Estimating causal effects under network interference with Bayesian generalized propensity scores. Journal of Machine Learning Research, 23(289):1–61.
  • Gelman et al., (1996) Gelman, A., Meng, X.-L., and Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6:733–760.
  • Guan, (2024) Guan, L. (2024). A conformal test of linear models via permutation-augmented regressions. The Annals of Statistics, in press.
  • Guttman, (1967) Guttman, I. (1967). The use of the concept of a future observation in goodness-of-fit problems. Journal of the Royal Statistical Society: Series B (Methodological), 29(1):83–100.
  • Hoshino and Yanagi, (2023) Hoshino, T. and Yanagi, T. (2023). Randomization test for the specification of interference structure. arXiv preprint, arXiv:2301.05580.
  • Hudgens and Halloran, (2008) Hudgens, M. G. and Halloran, M. E. (2008). Toward causal inference with interference. Journal of the American Statistical Association, 103(482):832–842.
  • Imai et al., (2021) Imai, K., Jiang, Z., and Malani, A. (2021). Causal inference with interference and noncompliance in two-stage randomized experiments. Journal of the American Statistical Association, 116(534):632–644.
  • Ivanova et al., (2022) Ivanova, A., Lederman, S., Stark, P. B., Sullivan, G., and Vaughn, B. (2022). Randomization tests in clinical trials with multiple imputation for handling missing data. Journal of Biopharmaceutical Statistics, 32(3):441–449.
  • Leavitt, (2023) Leavitt, T. (2023). Randomization-based, Bayesian inference of causal effects. Journal of Causal Inference, 11(1):20220025.
  • Lee et al., (2017) Lee, J. J., Forastiere, L., Miratrix, L., and Pillai, N. S. (2017). More powerful multiple testing in randomized experiments with non-compliance. Statistica Sinica, 27:1319–1345.
  • Lee and Rubin, (2015) Lee, J. J. and Rubin, D. B. (2015). Valid randomization-based p-values for partially post hoc subgroup analyses. Statistics in Medicine, 34(24):3214–3222.
  • (30) Leung, M. P. (2022a). Causal inference under approximate neighborhood interference. Econometrica, 90(1):267–293.
  • (31) Leung, M. P. (2022b). Rate-optimal cluster-randomized designs for spatial interference. The Annals of Statistics, 50(5):3064–3087.
  • Li et al., (2023) Li, F., Ding, P., and Mealli, F. (2023). Bayesian causal inference: a critical review. Philosophical Transactions of the Royal Society A, 381(2247):20220153.
  • Liu and Hudgens, (2014) Liu, L. and Hudgens, M. G. (2014). Large sample randomization inference of causal effects in the presence of interference. Journal of the American Statistical Association, 109(505):288–301.
  • Liu et al., (2024) Liu, Y., Zhou, Y., Li, P., and Hu, F. (2024). Cluster-adaptive network a/b testing: From randomization to estimation. Journal of Machine Learning Research, 25(170):1–48.
  • Loh et al., (2020) Loh, W. W., Hudgens, M. G., Clemens, J. D., Ali, M., and Emch, M. E. (2020). Randomization inference with general interference and censoring. Biometrics, 76(1):235–245.
  • Manski, (2013) Manski, C. F. (2013). Identification of treatment response with social interactions. The Econometrics Journal, 16(1):S1–S23.
  • Meng, (1994) Meng, X.-L. (1994). Posterior predictive pp-values. The Annals of Statistics, 22(3):1142–1160.
  • Owusu, (2023) Owusu, J. (2023). Randomization inference of heterogeneous treatment effects under network interference. arXiv preprint, arXiv:2308.00202.
  • Puelz et al., (2022) Puelz, D., Basse, G., Feller, A., and Toulis, P. (2022). A graph-theoretic approach to randomization tests of causal effects under general interference. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):174–204.
  • Ray and van der Vaart, (2020) Ray, K. and van der Vaart, A. (2020). Semiparametric Bayesian causal inference. The Annals of Statistics, 48(5):2999–3020.
  • Rosenbaum, (2007) Rosenbaum, P. R. (2007). Interference between units in randomized experiments. Journal of the American Statistical Association, 102(477):191–200.
  • Rubin, (1977) Rubin, D. B. (1977). Formalizing subjective notions about the effect of nonrespondents in sample surveys. Journal of the American Statistical Association, 72(359):538–543.
  • Rubin, (1978) Rubin, D. B. (1978). Multiple imputations in sample surveys-a phenomenological Bayesian approach to nonresponse. In Proceedings of the Survey Research Methods Section of the American Statistical Association, volume 1, pages 20–34. American Statistical Association Alexandria, VA.
  • Rubin, (1980) Rubin, D. B. (1980). Randomization analysis of experimental data: The Fisher randomization test comment. Journal of the American Statistical Association, 75(371):591–593.
  • Rubin, (1981) Rubin, D. B. (1981). Estimation in parallel randomized experiments. Journal of Educational Statistics, 6(4):377–401.
  • Rubin, (1984) Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4):1151–1172.
  • Rubin, (1996) Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91(434):473–489.
  • Rubin, (1998) Rubin, D. B. (1998). More powerful randomization-based p-values in double-blind trials with non-compliance. Statistics in Medicine, 17(3):371–385.
  • (49) Rubin, D. B. (2004a). The design of a general and flexible system for handling nonresponse in sample surveys. The American Statistician, 58(4):298–302.
  • (50) Rubin, D. B. (2004b). Multiple Imputation for Nonresponse in Surveys, volume 81. John Wiley & Sons.
  • Shirani and Bayati, (2024) Shirani, S. and Bayati, M. (2024). Causal message-passing for experiments with unknown and general network interference. Proceedings of the National Academy of Sciences, 121(40):e2322232121.
  • Tchetgen and VanderWeele, (2012) Tchetgen, E. J. T. and VanderWeele, T. J. (2012). On causal inference in the presence of interference. Statistical Methods in Medical Research, 21(1):55–75.
  • Tiwari and Basu, (2024) Tiwari, S. and Basu, P. (2024). Quasi-randomization tests for network interference. arXiv preprint, arXiv:2403.16673.
  • Uschner et al., (2024) Uschner, D., Sverdlov, O., Carter, K., Chipman, J., Kuznetsova, O., Renteria, J., Lane, A., Barker, C., Geller, N., Proschan, M., et al. (2024). Using randomization tests to address disruptions in clinical trials: A report from the niss ingram olkin forum series on unplanned clinical trial disruptions. Statistics in Biopharmaceutical Research, 16(4):405–413.
  • van der Vaart, (1994) van der Vaart, A. (1994). Weak convergence of smoothed empirical processes. Scandinavian Journal of Statistics, 21(4):501–504.
  • Vazquez-Bare, (2023) Vazquez-Bare, G. (2023). Identification and estimation of spillover effects in randomized experiments. Journal of Econometrics, 237(1):105237.
  • Vovk et al., (2018) Vovk, V., Nouretdinov, I., Manokhin, V., and Gammerman, A. (2018). Cross-conformal predictive distributions. In Conformal and Probabilistic Prediction and Applications, pages 37–51. PMLR.
  • Vovk and Wang, (2020) Vovk, V. and Wang, R. (2020). Combining p-values via averaging. Biometrika, 107(4):791–808.
  • Wu and Ding, (2021) Wu, J. and Ding, P. (2021). Randomization tests for weak null hypotheses in randomized experiments. Journal of the American Statistical Association, 116(536):1898–1913.
  • Yanagi and Sei, (2022) Yanagi, M. and Sei, T. (2022). Improving randomization tests under interference based on power analysis. arXiv preprint, arXiv:2203.10469.
  • Yu et al., (2022) Yu, C. L., Airoldi, E. M., Borgs, C., and Chayes, J. T. (2022). Estimating the total treatment effect in randomized experiments with unknown network structure. Proceedings of the National Academy of Sciences, 119(44):e2208975119.
  • Zhang and Zhao, (2024) Zhang, Y. and Zhao, Q. (2024). Multiple conditional randomization tests for lagged and spillover treatment effects. Biometrika, in press.
  • Zhao and Ding, (2024) Zhao, A. and Ding, P. (2024). To adjust or not to adjust? Estimating the average treatment effect in randomized experiments with missing covariates. Journal of the American Statistical Association, 119(545):450–460.
  • Zhong, (2024) Zhong, L. (2024). Unconditional randomization tests for interference. arXiv preprint, arXiv:2409.09243.
  • Zhu and Liu, (2023) Zhu, K. and Liu, H. (2023). Pair-switching rerandomization. Biometrics, 79(3):2127–2142.

Supplementary Material

Section A provides proofs of theoretical results. Section B provides additional simulation results.

Appendix A Proofs

A.1 Proof of Theorem 1

By the law of iterated expectation, we have

𝔼(𝒁obs,𝜽)[h(pMI(𝒁obs;𝜽obs))]=𝔼𝒁obs[𝔼𝜽|𝒁obs(h(pMI(𝒁obs;𝜽obs))𝒁obs)].\begin{split}{\mathbb{E}}_{({\boldsymbol{Z}}^{\rm{obs}},{\boldsymbol{\theta}})}\left[h(p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}}))\right]&={\mathbb{E}}_{{\boldsymbol{Z}}^{\rm{obs}}}\left[{\mathbb{E}}_{{\boldsymbol{\theta}}|{\boldsymbol{Z}}^{\rm{obs}}}\left(h(p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}}))\mid{\boldsymbol{Z}}^{{\rm{obs}}}\right)\right].\end{split}

By definition, when oracle imputation distribution is known,

pMI(𝒁obs;𝜽obs)=pSI(𝒁obs;𝜽imp)πo(𝜽imp𝜽obs)𝑑𝜽imp.\begin{split}p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})&=\int p_{{\rm{SI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{imp}}})\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})d{\boldsymbol{\theta}}^{{\rm{imp}}}.\end{split}

For any convex function hh on [0,1][0,1],

h(pMI(𝒁obs;𝜽obs))h(pSI(𝒁obs;𝜽imp))πo(𝜽imp𝜽obs)𝑑𝜽imp.\begin{split}h(p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}}))&\leq\int h(p_{{\rm{SI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{imp}}}))\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})d{\boldsymbol{\theta}}^{{\rm{imp}}}.\end{split}

Since hh is bounded on [0,1][0,1], Fubini’s theorem implies that

𝔼𝜽|𝒁obs(h(pMI(𝒁obs;𝜽obs))𝒁obs)h(pSI(𝒁obs;𝜽imp))πo(𝜽imp𝜽obs)𝑑𝜽impg(𝜽)𝑑𝜽=h(pSI(𝒁obs;𝜽imp))πo(𝜽imp𝜽obs)g(𝜽)𝑑𝜽𝑑𝜽imp.\begin{split}{\mathbb{E}}_{{\boldsymbol{\theta}}|{\boldsymbol{Z}}^{\rm{obs}}}\left(h(p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}}))\mid{\boldsymbol{Z}}^{{\rm{obs}}}\right)&\leq\int\int h(p_{{\rm{SI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{imp}}}))\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})d{\boldsymbol{\theta}}^{{\rm{imp}}}g({\boldsymbol{\theta}})d{\boldsymbol{\theta}}\\ &=\int\int h(p_{{\rm{SI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{imp}}}))\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})g({\boldsymbol{\theta}})d{\boldsymbol{\theta}}d{\boldsymbol{\theta}}^{{\rm{imp}}}.\end{split}

By definition,

πo(𝜽imp𝜽obs)=i𝕌a,bobsg(θiimp)i𝕌a,bobsδ(θiimpθiobs)=i𝕌a,bobsg(θiimp)i𝕌a,bobsδ(θiimpθi).\begin{split}\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{\rm{obs}})&=\prod_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}g(\theta_{i}^{{\rm{imp}}})\cdot\prod_{i\in{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\delta(\theta^{{\rm{imp}}}_{i}-\theta_{i}^{\rm{obs}})\\ &=\prod_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}g(\theta_{i}^{{\rm{imp}}})\cdot\prod_{i\in{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\delta(\theta^{{\rm{imp}}}_{i}-\theta_{i}).\end{split}

Hence, conditional on 𝒁obs{\boldsymbol{Z}}^{\rm{obs}},

πo(𝜽imp𝜽obs)g(𝜽)𝑑𝜽=i𝕌a,bobsg(θiimp)i𝕌a,bobsδ(θiimpθi)g(θi)𝑑θi=i=1Ng(θiimp)=g(𝜽imp),\begin{split}\int\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{\rm{obs}})g({\boldsymbol{\theta}})d{\boldsymbol{\theta}}&=\prod_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}g(\theta_{i}^{{\rm{imp}}})\cdot\prod_{i\in{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\int\delta(\theta^{{\rm{imp}}}_{i}-\theta_{i})g(\theta_{i})d\theta_{i}\\ &=\prod_{i=1}^{N}g(\theta_{i}^{{\rm{imp}}})=g({\boldsymbol{\theta}}^{{\rm{imp}}}),\end{split}

and we have

𝔼𝜽|𝒁obs(h(pMI(𝒁obs;𝜽obs))𝒁obs)h(pSI(𝒁obs;𝜽imp))g(𝜽imp)𝑑𝜽imp.\begin{split}{\mathbb{E}}_{{\boldsymbol{\theta}}|{\boldsymbol{Z}}^{\rm{obs}}}\left(h(p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}}))\mid{\boldsymbol{Z}}^{{\rm{obs}}}\right)&\leq\int h(p_{{\rm{SI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{imp}}}))g({\boldsymbol{\theta}}^{{\rm{imp}}})d{\boldsymbol{\theta}}^{{\rm{imp}}}.\end{split}

Hence,

𝔼(𝒁obs,𝜽)[h(pMI(𝒁obs;𝜽obs))]=𝔼𝒁obs[𝔼𝜽|𝒁obs(h(pMI(𝒁obs;𝜽obs))𝒁obs)]h(pSI(𝒁obs;𝜽imp))g(𝜽imp)𝑑𝜽imp𝑑𝑭𝒵(𝒁obs)=h(pSI(𝒁obs;𝜽imp))𝑑𝑭𝒵(𝒁obs)g(𝜽imp)𝑑𝜽imp=𝔼U[h(U)]g(𝜽imp)𝑑𝜽imp=𝔼U[h(U)],\begin{split}{\mathbb{E}}_{({\boldsymbol{Z}}^{\rm{obs}},{\boldsymbol{\theta}})}\left[h(p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}}))\right]&={\mathbb{E}}_{{\boldsymbol{Z}}^{\rm{obs}}}\left[{\mathbb{E}}_{{\boldsymbol{\theta}}|{\boldsymbol{Z}}^{\rm{obs}}}\left(h(p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}}))\mid{\boldsymbol{Z}}^{{\rm{obs}}}\right)\right]\\ &\leq\int\int h(p_{{\rm{SI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{imp}}}))g({\boldsymbol{\theta}}^{{\rm{imp}}})d{\boldsymbol{\theta}}^{{\rm{imp}}}d{\boldsymbol{F}}_{\mathcal{Z}}({\boldsymbol{Z}}^{{\rm{obs}}})\\ &=\int\int h(p_{{\rm{SI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{imp}}}))d{\boldsymbol{F}}_{\mathcal{Z}}({\boldsymbol{Z}}^{{\rm{obs}}})g({\boldsymbol{\theta}}^{{\rm{imp}}})d{\boldsymbol{\theta}}^{{\rm{imp}}}\\ &=\int{\mathbb{E}}_{U}[h(U)]g({\boldsymbol{\theta}}^{{\rm{imp}}})d{\boldsymbol{\theta}}^{{\rm{imp}}}={\mathbb{E}}_{U}[h(U)],\end{split}

where UU[0,1]U\sim U[0,1]. Moreover, when h(U)Uh(U)\equiv U, we have

𝔼(𝒁obs,𝜽)[h(pMI(𝒁obs;𝜽obs))]=𝔼U[U]=12.\begin{split}{\mathbb{E}}_{({\boldsymbol{Z}}^{\rm{obs}},{\boldsymbol{\theta}})}\left[h(p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}}))\right]&={\mathbb{E}}_{U}[U]=\frac{1}{2}.\end{split}

We complete the proof. ∎

A.2 Proof of Corollary 1

We state Lemma 1 in Meng, (1994) in the following Lemma A1.

Lemma A1.

Let G(α)G(\alpha) be the c.d.f. of a random variable WW on [0,1][0,1]. If WW is stochastically less variable than U[0,1]U[0,1] but with the same mean, then α[0,1]\forall\alpha\in[0,1],

α[α220αG(t)𝑑t]1/2G(α)α+[α220αG(t)𝑑t]1/21.\alpha-\left[\alpha^{2}-2\int_{0}^{\alpha}G(t)dt\right]^{1/2}\leq G(\alpha)\leq\alpha+\left[\alpha^{2}-2\int_{0}^{\alpha}G(t)dt\right]^{1/2}\leq 1. (A1)

The first or second inequality becomes equality for all α\alpha if and only if G(α)αG(\alpha)\equiv\alpha.

By Theorem 1, the IRTo p-value pMI(𝒁obs;𝜽obs)p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}}) is stochastically less variable than U[0,1]U[0,1] but with the same mean when 𝒁obs𝑭𝒵{\boldsymbol{Z}}^{\rm{obs}}\sim{\boldsymbol{F}}_{\mathcal{Z}} and θ1,,θN\theta_{1},\ldots,\theta_{N} are i.i.d. with density g()g(\cdot) known. Denote the c.d.f. of pMI(𝒁obs;𝜽obs)p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}}) as G()G(\cdot). Then Lemma A1 gives the lower and upper bounds of G(α)G(\alpha), and we have

(pMI(𝒁obs;𝜽obs)α)=G(α)α+[α220αG(t)𝑑t]1/22α,\begin{split}{\mathbb{P}}(p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})\leq\alpha)=G(\alpha)\leq\alpha+\left[\alpha^{2}-2\int_{0}^{\alpha}G(t)dt\right]^{1/2}\leq 2\alpha,\end{split}

where {\mathbb{P}} is the probability taken with respect to (w.r.t.) 𝒁obs𝑭𝒵{\boldsymbol{Z}}^{\rm{obs}}\sim{\boldsymbol{F}}_{\mathcal{Z}} and θ1,,θN\theta_{1},\ldots,\theta_{N} i.i.d. with density g()g(\cdot). ∎

A.3 Proof of Lemma 1

Denote the p-value of IRTo as pMIo(𝒁obs;𝜽obs)p_{{\rm{MI}}}^{o}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}}), then for the p-value pMI(𝒁obs;𝜽obs)p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}}) of IRT with imputation function (13), we have

Δ(𝒁obs;𝜽obs)=pMI(𝒁obs;𝜽obs)pMIo(𝒁obs;𝜽obs),\begin{split}\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})=p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})-p_{{\rm{MI}}}^{o}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}}),\end{split}

and

(pMI(𝒁obs;𝜽obs)α)=(pMIo(𝒁obs;𝜽obs)+Δ(𝒁obs;𝜽obs)α)=(pMIo(𝒁obs;𝜽obs)+Δ(𝒁obs;𝜽obs)α,|Δ(𝒁obs;𝜽obs)|>δ)+(pMIo(𝒁obs;𝜽obs)+Δ(𝒁obs;𝜽obs)α,|Δ(𝒁obs;𝜽obs)|δ)(|Δ(𝒁obs;𝜽obs)|>δ)+(pMIo(𝒁obs;𝜽obs)α+δ)(|Δ(𝒁obs;𝜽obs)|>δ)+2α+2δ.\begin{split}{\mathbb{P}}\left(p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})\leq\alpha\right)&={\mathbb{P}}\left(p_{{\rm{MI}}}^{o}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})+\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})\leq\alpha\right)\\ &={\mathbb{P}}\left(p_{{\rm{MI}}}^{o}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})+\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})\leq\alpha,|\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})|>\delta\right)\\ &\indent+{\mathbb{P}}\left(p_{{\rm{MI}}}^{o}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})+\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})\leq\alpha,|\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})|\leq\delta\right)\\ &\leq{\mathbb{P}}\left(|\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})|>\delta\right)+{\mathbb{P}}\left(p_{{\rm{MI}}}^{o}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})\leq\alpha+\delta\right)\\ &\leq{\mathbb{P}}\left(|\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})|>\delta\right)+2\alpha+2\delta.\end{split}

Since |Δ(𝒁obs;𝜽obs)|=op(1)|\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})|=o_{p}(1), then for any δ>0\delta>0, we have

limN(|Δ(𝒁obs;𝜽obs)|>δ)=0.\begin{split}\lim_{N\rightarrow\infty}{\mathbb{P}}\left(|\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})|>\delta\right)=0.\end{split}

Therefore,

lim supN(pMI(𝒁obs;𝜽obs)α)2α+2δ,δ>0.\begin{split}\limsup_{N\rightarrow\infty}{\mathbb{P}}\left(p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})\leq\alpha\right)\leq 2\alpha+2\delta,\ \forall\delta>0.\end{split}

Letting δ0\delta\rightarrow 0, we obtain

lim supN(pMI(𝒁obs;𝜽obs)α)2α.\begin{split}\limsup_{N\rightarrow\infty}{\mathbb{P}}\left(p_{{\rm{MI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})\leq\alpha\right)\leq 2\alpha.\end{split}

We complete the proof. ∎

A.4 Proof of Theorem 2

By definition, 𝜽obs{\boldsymbol{\theta}}^{{\rm{obs}}} is determined by (𝜽,𝒁obs)({\boldsymbol{\theta}},{\boldsymbol{Z}}^{\rm{obs}}). In the following, we denote the expectation with respect to the joint distribution of (𝒁obs,𝜽,𝜽imp)({\boldsymbol{Z}}^{\rm{obs}},{\boldsymbol{\theta}},{\boldsymbol{\theta}}^{\rm{imp}}) for 𝒁obs𝑭𝒵{\boldsymbol{Z}}^{\rm{obs}}\sim{\boldsymbol{F}}_{\mathcal{Z}}, θig()\theta_{i}\sim g(\cdot) i.i.d., and (𝜽imp𝜽,𝒁obs)πo(𝜽imp𝜽obs)({\boldsymbol{\theta}}^{\rm{imp}}\mid{\boldsymbol{\theta}},{\boldsymbol{Z}}^{\rm{obs}})\sim\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}}) as 𝔼~\tilde{{\mathbb{E}}}.

(1) Continuous distribution: When g()g(\cdot) and g^(𝜽~obs)\hat{g}(\cdot\mid\tilde{\boldsymbol{\theta}}^{\rm{obs}}) correspond to continuous distributions, let

h(𝜽imp𝜽obs)=i𝕌a,bobsg^(θiimp𝜽~obs)g(θiimp)1.\begin{split}h({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})=\prod_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\frac{\hat{g}(\theta_{i}^{{\rm{imp}}}\mid\tilde{\boldsymbol{\theta}}^{{\rm{obs}}})}{g(\theta_{i}^{{\rm{imp}}})}-1.\end{split}

Then

Δ(𝒁obs;𝜽obs)=pSI(𝒁obs;𝜽imp)h(𝜽imp𝜽obs)πo(𝜽imp𝜽obs)𝑑𝜽imp.\begin{split}\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})&=\int p_{{\rm{SI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{imp}}})h({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})d{\boldsymbol{\theta}}^{{\rm{imp}}}.\end{split}

The distribution of Δ(𝒁obs;𝜽obs)\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}}) depends on (𝒁obs,𝜽)({\boldsymbol{Z}}^{{\rm{obs}}},{\boldsymbol{\theta}}), and it suffices to show that

𝔼(𝒁obs,𝜽)|Δ(𝒁obs;𝜽obs)|0,\begin{split}{\mathbb{E}}_{({\boldsymbol{Z}}^{\rm{obs}},{\boldsymbol{\theta}})}|\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})|\rightarrow 0,\end{split}

where the expectation is taken with respect to the joint distribution of (𝒁obs,𝜽)({\boldsymbol{Z}}^{{\rm{obs}}},{\boldsymbol{\theta}}), and

|Δ(𝒁obs;𝜽obs)||h(𝜽imp𝜽obs)|πo(𝜽imp𝜽obs)d𝜽imp.\begin{split}|\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})|\leq\int|h({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})d{\boldsymbol{\theta}}^{{\rm{imp}}}.\end{split}

Therefore,

𝔼(𝒁obs,𝜽)|Δ(𝒁obs;𝜽obs)||h(𝜽imp𝜽obs)|πo(𝜽imp𝜽obs)d𝜽impg(𝜽)d𝜽d𝑭𝒵(𝒁obs)=𝔼~|h(𝜽imp𝜽obs)|,\begin{split}{\mathbb{E}}_{({\boldsymbol{Z}}^{\rm{obs}},{\boldsymbol{\theta}})}|\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})|&\leq\int\int\int|h({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})d{\boldsymbol{\theta}}^{{\rm{imp}}}g({\boldsymbol{\theta}})d{\boldsymbol{\theta}}d{\boldsymbol{F}}_{\mathcal{Z}}({\boldsymbol{Z}}^{\rm{obs}})\\ &=\tilde{{\mathbb{E}}}|h({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|,\end{split}

where E~\tilde{E} denotes the expectation with respect to the joint distribution of (𝒁obs,𝜽,𝜽imp)({\boldsymbol{Z}}^{\rm{obs}},{\boldsymbol{\theta}},{\boldsymbol{\theta}}^{\rm{imp}}) for 𝒁obs𝑭𝒵{\boldsymbol{Z}}^{\rm{obs}}\sim{\boldsymbol{F}}_{\mathcal{Z}}, θig()\theta_{i}\sim g(\cdot) i.i.d., and (𝜽imp𝜽,𝒁obs)πo(𝜽imp𝜽obs)({\boldsymbol{\theta}}^{\rm{imp}}\mid{\boldsymbol{\theta}},{\boldsymbol{Z}}^{\rm{obs}})\sim\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}}). Therefore, a sufficient condition for Δ(𝒁obs;𝜽obs)=op(1)\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})=o_{p}(1) is

𝔼~|h(𝜽imp𝜽obs)|0.\tilde{{\mathbb{E}}}|h({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|\rightarrow 0. (A2)

By definition,

πo(𝜽imp𝜽obs)=i𝕌a,bobsg(θiimp)i𝕌a,bobsδ(θiimpθiobs)=i𝕌a,bobsg(θiimp)i𝕌a,bobsδ(θiimpθi),\begin{split}\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{\rm{obs}})&=\prod_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}g(\theta_{i}^{{\rm{imp}}})\cdot\prod_{i\in{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\delta(\theta^{{\rm{imp}}}_{i}-\theta_{i}^{\rm{obs}})\\ &=\prod_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}g(\theta_{i}^{{\rm{imp}}})\cdot\prod_{i\in{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\delta(\theta^{{\rm{imp}}}_{i}-\theta_{i}),\end{split}

and 𝜽~obs={θiobs:i𝕌a,bobs}{θi:i𝕌a,bobs}\tilde{{\boldsymbol{\theta}}}^{\rm{obs}}=\{\theta_{i}^{\rm{obs}}:i\in{\mathbb{U}}^{{\rm{obs}}}_{a,b}\}\equiv\{\theta_{i}:i\in{\mathbb{U}}^{{\rm{obs}}}_{a,b}\}. For a given 𝒁obs{\boldsymbol{Z}}^{\rm{obs}}, the set 𝕌a,bobs{\mathbb{U}}_{a,b}^{\rm{obs}} is determined. Denote 𝕌a,bobs={i1,,iNa,bobs}{\mathbb{U}}_{a,b}^{\rm{obs}}=\{i_{1},\ldots,i_{N_{a,b}^{\rm{obs}}}\} and (𝕌a,bobs)c={iNa,bobs+1,,iN}({\mathbb{U}}_{a,b}^{\rm{obs}})^{c}=\{i_{N_{a,b}^{\rm{obs}}+1},\ldots,i_{N}\}, and their union is 𝕌={1,,N}{\mathbb{U}}=\{1,\ldots,N\}. Then the conditional expectation of |h(𝜽imp𝜽obs)||h({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})| given 𝒁obs{\boldsymbol{Z}}^{\rm{obs}} is

𝔼(|h(𝜽imp𝜽obs)|𝒁obs)=|h(𝜽imp𝜽obs)|πo(𝜽imp𝜽obs)g(𝜽)d𝜽impd𝜽=|i𝕌a,bobsg^(θiimp𝜽~obs)g(θiimp)1|i𝕌a,bobsg(θiimp)i𝕌a,bobsδ(θiimpθi)i=1Ng(θi)d𝜽impd𝜽=|i𝕌a,bobsg^(θiimp𝜽~obs)g(θiimp)1|i𝕌a,bobsg(θiimp)i𝕌a,bobsg(θi)d𝜽i𝕌a,bobsimpd𝜽i𝕌a,bobs=|k=Na,bobs+1Ng^(θikimp𝜽i1:iNa,bobs)g(θikimp)1|k=Na,bobs+1Ng(θikimp)j=1Na,bobsg(θij)d𝜽iNa,bobs+1:iNimpd𝜽i1:iNa,bobs.\begin{split}&\indent{{\mathbb{E}}}\left(|h({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|\mid{\boldsymbol{Z}}^{\rm{obs}}\right)\\ &=\int\int|h({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})g({\boldsymbol{\theta}})d{\boldsymbol{\theta}}^{{\rm{imp}}}d{\boldsymbol{\theta}}\\ &=\int\int\left|\prod_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\frac{\hat{g}(\theta_{i}^{{\rm{imp}}}\mid\tilde{\boldsymbol{\theta}}^{{\rm{obs}}})}{g(\theta_{i}^{{\rm{imp}}})}-1\right|\prod_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}g(\theta_{i}^{{\rm{imp}}})\prod_{i\in{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\delta(\theta_{i}^{\rm{imp}}-\theta_{i})\cdot\prod_{i=1}^{N}g(\theta_{i})d{\boldsymbol{\theta}}^{\rm{imp}}d{\boldsymbol{\theta}}\\ &=\int\int\left|\prod_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\frac{\hat{g}(\theta_{i}^{{\rm{imp}}}\mid\tilde{\boldsymbol{\theta}}^{{\rm{obs}}})}{g(\theta_{i}^{{\rm{imp}}})}-1\right|\prod_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}g(\theta_{i}^{{\rm{imp}}})\prod_{i\in{\mathbb{U}}^{{\rm{obs}}}_{a,b}}g(\theta_{i})d{\boldsymbol{\theta}}^{\rm{imp}}_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}d{\boldsymbol{\theta}}_{i\in{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\\ &=\int\int\left|\prod_{k=N_{a,b}^{\rm{obs}}+1}^{N}\frac{\hat{g}\left(\theta_{i_{k}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}_{i_{1}:i_{N_{a,b}^{\rm{obs}}}}\right)}{g(\theta_{i_{k}}^{{\rm{imp}}})}-1\right|\prod_{k=N_{a,b}^{\rm{obs}}+1}^{N}g(\theta_{i_{k}}^{{\rm{imp}}})\prod_{j=1}^{N_{a,b}^{\rm{obs}}}g(\theta_{i_{j}})d{\boldsymbol{\theta}}^{\rm{imp}}_{i_{N_{a,b}^{\rm{obs}}+1}:i_{N}}d{\boldsymbol{\theta}}_{i_{1}:i_{N_{a,b}^{\rm{obs}}}}.\end{split}

We make the following variable substitution: ϑk=θik{\vartheta}_{k}=\theta_{i_{k}} for k=1,,Na,bobsk=1,\ldots,N_{a,b}^{\rm{obs}}, and ϑk=θikimp{\vartheta}_{k}=\theta_{i_{k}}^{\rm{imp}} for k=Na,bobs+1,,Nk=N_{a,b}^{\rm{obs}}+1,\ldots,N. Then we have

𝔼(|h(𝜽imp𝜽obs)|𝒁obs)=|k=Na,bobs+1Ng^(ϑkϑ1:Na,bobs)g(ϑk)1|k=1Ng(ϑk)dϑ1:N,\begin{split}{{\mathbb{E}}}\left(|h({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|\mid{\boldsymbol{Z}}^{\rm{obs}}\right)=\int\left|\prod_{k=N_{a,b}^{\rm{obs}}+1}^{N}\frac{\hat{g}\left({\vartheta}_{k}\mid{\bm{{\vartheta}}}_{1:N_{a,b}^{\rm{obs}}}\right)}{g({\vartheta}_{k})}-1\right|\prod_{k=1}^{N}g({\vartheta}_{k})d{\bm{{\vartheta}}}_{1:N},\end{split}

which is equal to the conditional expectation of

|k=Na,bobs+1Ng^(ϑkϑ1:Na,bobs)g(ϑk)1|\begin{split}\left|\prod_{k=N_{a,b}^{\rm{obs}}+1}^{N}\frac{\hat{g}\left({\vartheta}_{k}\mid{\bm{{\vartheta}}}_{1:N_{a,b}^{\rm{obs}}}\right)}{g({\vartheta}_{k})}-1\right|\end{split}

with respect to ϑ1,,ϑN{\vartheta}_{1},\ldots,{\vartheta}_{N} i.i.d. with probability density g()g(\cdot) given Na,bobsN_{a,b}^{\rm{obs}}. Hence,

𝔼~|h(𝜽imp𝜽obs)|=𝔼[𝔼(|h(𝜽imp𝜽obs)|𝒁obs)]=𝒛𝒵𝔼(|h(𝜽imp𝜽obs)|𝒁obs=𝒛)𝒁(𝒁=𝒛)=𝒛𝒵𝔼(|k=Na,bobs+1Ng^(ϑkϑ1:Na,bobs)g(ϑk)1|Na,bobs=|𝕌a(𝒛)𝕌b(𝒛)|)𝒁(𝒁=𝒛)=n𝔼(|k=Na,bobs+1Ng^(ϑkϑ1:Na,bobs)g(ϑk)1|Na,bobs=n)𝒛:|𝕌a(𝒛)𝕌b(𝒛)|=n𝒁(𝒁=𝒛)=n𝔼(|k=Na,bobs+1Ng^(ϑkϑ1:Na,bobs)g(ϑk)1|Na,bobs=n)N(Na,bobs=n)=𝔼|k=Na,bobs+1Ng^(ϑkϑ1:Na,bobs)g(ϑk)1|,\begin{split}&\indent\tilde{{\mathbb{E}}}|h({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|={\mathbb{E}}\left[{{\mathbb{E}}}\left(|h({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|\mid{\boldsymbol{Z}}^{\rm{obs}}\right)\right]\\ &=\sum_{{\boldsymbol{z}}\in{\mathcal{Z}}}{{\mathbb{E}}}\left(|h({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|\mid{\boldsymbol{Z}}^{\rm{obs}}={\boldsymbol{z}}\right){\mathbb{P}}_{\boldsymbol{Z}}({\boldsymbol{Z}}={\boldsymbol{z}})\\ &=\sum_{{\boldsymbol{z}}\in{\mathcal{Z}}}{{\mathbb{E}}}\left(\left|\prod_{k=N_{a,b}^{\rm{obs}}+1}^{N}\frac{\hat{g}\left({\vartheta}_{k}\mid{\bm{{\vartheta}}}_{1:N_{a,b}^{\rm{obs}}}\right)}{g({\vartheta}_{k})}-1\right|\mid N_{a,b}^{\rm{obs}}=|{\mathbb{U}}_{a}({\boldsymbol{z}})\cup{\mathbb{U}}_{b}({\boldsymbol{z}})|\right){\mathbb{P}}_{\boldsymbol{Z}}({\boldsymbol{Z}}={\boldsymbol{z}})\\ &=\sum_{n}{\mathbb{E}}\left(\left|\prod_{k=N_{a,b}^{\rm{obs}}+1}^{N}\frac{\hat{g}\left({\vartheta}_{k}\mid{\bm{{\vartheta}}}_{1:N_{a,b}^{\rm{obs}}}\right)}{g({\vartheta}_{k})}-1\right|\mid N_{a,b}^{\rm{obs}}=n\right)\sum_{{\boldsymbol{z}}:|{\mathbb{U}}_{a}({\boldsymbol{z}})\cup{\mathbb{U}}_{b}({\boldsymbol{z}})|=n}{\mathbb{P}}_{\boldsymbol{Z}}({\boldsymbol{Z}}={\boldsymbol{z}})\\ &=\sum_{n}{\mathbb{E}}\left(\left|\prod_{k=N_{a,b}^{\rm{obs}}+1}^{N}\frac{\hat{g}\left({\vartheta}_{k}\mid{\bm{{\vartheta}}}_{1:N_{a,b}^{\rm{obs}}}\right)}{g({\vartheta}_{k})}-1\right|\mid N_{a,b}^{\rm{obs}}=n\right){\mathbb{P}}_{N}(N_{a,b}^{\rm{obs}}=n)\\ &={\mathbb{E}}\left|\prod_{k=N_{a,b}^{\rm{obs}}+1}^{N}\frac{\hat{g}\left({\vartheta}_{k}\mid{\bm{{\vartheta}}}_{1:N_{a,b}^{\rm{obs}}}\right)}{g({\vartheta}_{k})}-1\right|,\end{split}

where N(Na,bobs=n)=𝒛:|𝕌a(𝒛)𝕌b(𝒛)|=n𝒁(𝒁=𝒛){\mathbb{P}}_{N}(N_{a,b}^{\rm{obs}}=n)=\sum_{{\boldsymbol{z}}:|{\mathbb{U}}_{a}({\boldsymbol{z}})\cup{\mathbb{U}}_{b}({\boldsymbol{z}})|=n}{\mathbb{P}}_{\boldsymbol{Z}}({\boldsymbol{Z}}={\boldsymbol{z}}) is the p.m.f. of 𝑭N{\boldsymbol{F}}_{N}, and the last expectation is with respect to ϑ1,,ϑN{\vartheta}_{1},\ldots,{\vartheta}_{N} i.i.d. with probability density g()g(\cdot), and Na,bobs𝑭NN_{a,b}^{\rm{obs}}\sim{\boldsymbol{F}}_{N}. Thus, sufficient condition (A2) is equivalent to:

𝔼|k=Na,bobs+1Ng^(ϑkϑ1:Na,bobs)g(ϑk)1|0,{\mathbb{E}}\left|\prod_{k=N_{a,b}^{\rm{obs}}+1}^{N}\frac{\hat{g}\left({\vartheta}_{k}\mid{\bm{{\vartheta}}}_{1:N_{a,b}^{\rm{obs}}}\right)}{g({\vartheta}_{k})}-1\right|\rightarrow 0, (A3)

where the expectation is taken with respect to ϑ1,,ϑN{\vartheta}_{1},\ldots,{\vartheta}_{N} i.i.d. with probability density g()g(\cdot) and Na,bobs𝑭NN_{a,b}^{\rm{obs}}\sim{\boldsymbol{F}}_{N}.

(2) Discrete distribution: When g()g(\cdot) and g^(𝜽~obs)\hat{g}(\cdot\mid\tilde{\boldsymbol{\theta}}^{\rm{obs}}) correspond to discrete distributions, we denote their corresponding probability mass function as p()p(\cdot) and p^(𝜽~obs)\hat{p}(\cdot\mid\tilde{\boldsymbol{\theta}}^{\rm{obs}}), i.e.,

g(θiimp)=ψΨp(ψ)δ(θiimpψ),g^(θiimp𝜽~obs)=ψΨp^(ψ𝜽~obs)δ(θiimpψ),\begin{split}g(\theta_{i}^{\rm{imp}})=\sum_{\psi\in\Psi}p(\psi)\delta(\theta_{i}^{\rm{imp}}-\psi),\quad\hat{g}(\theta_{i}^{\rm{imp}}\mid\tilde{\boldsymbol{\theta}}^{\rm{obs}})=\sum_{\psi\in\Psi}\hat{p}(\psi\mid\tilde{\boldsymbol{\theta}}^{\rm{obs}})\delta(\theta_{i}^{\rm{imp}}-\psi),\end{split}

where Ψ\Psi is the set of all the ψ\psi s.t. p(ψ)>0p(\psi)>0. Then for any function ν(𝜽imp𝜽obs)\nu({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{\rm{obs}}),

ν(𝜽imp𝜽obs)πo(𝜽imp𝜽obs)𝑑𝜽imp=ν(𝜽imp𝜽obs)i𝕌a,bobsψΨp(ψ)δ(θiimpψ)i𝕌a,bobsδ(θiimpθiobs)d𝜽imp=ν(𝜽imp𝜽obs)ψiΨ for i𝕌a,bobsi𝕌a,bobsp(ψi)δ(θiimpψi)i𝕌a,bobsδ(θiimpθiobs)d𝜽imp=ψiΨ for i𝕌a,bobsν(𝜽imp𝜽obs)i𝕌a,bobsp(ψi)i𝕌a,bobsδ(θiimpψi)i𝕌a,bobsδ(θiimpθiobs)d𝜽imp=𝜽impΘobsν(𝜽imp𝜽obs)i𝕌a,bobsp(θiimp),\begin{split}&\indent\int\nu({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{\rm{obs}})\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})d{\boldsymbol{\theta}}^{\rm{imp}}\\ &=\int\nu({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{\rm{obs}})\cdot\prod_{i\notin{\mathbb{U}}_{a,b}^{\rm{obs}}}\sum_{\psi\in\Psi}p(\psi)\delta(\theta_{i}^{\rm{imp}}-\psi)\cdot\prod_{i\in{\mathbb{U}}_{a,b}^{\rm{obs}}}\delta(\theta_{i}^{\rm{imp}}-\theta_{i}^{\rm{obs}})d{\boldsymbol{\theta}}^{\rm{imp}}\\ &=\int\nu({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{\rm{obs}})\cdot\sum_{\psi_{i}\in\Psi\text{ for }i\notin{\mathbb{U}}_{a,b}^{\rm{obs}}}\prod_{i\notin{\mathbb{U}}_{a,b}^{\rm{obs}}}p(\psi_{i})\delta(\theta_{i}^{\rm{imp}}-\psi_{i})\cdot\prod_{i\in{\mathbb{U}}_{a,b}^{\rm{obs}}}\delta(\theta_{i}^{\rm{imp}}-\theta_{i}^{\rm{obs}})d{\boldsymbol{\theta}}^{\rm{imp}}\\ &=\sum_{\psi_{i}\in\Psi\text{ for }i\notin{\mathbb{U}}_{a,b}^{\rm{obs}}}\int\nu({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{\rm{obs}})\prod_{i\notin{\mathbb{U}}_{a,b}^{\rm{obs}}}p(\psi_{i})\cdot\prod_{i\notin{\mathbb{U}}_{a,b}^{\rm{obs}}}\delta(\theta_{i}^{\rm{imp}}-\psi_{i})\cdot\prod_{i\in{\mathbb{U}}_{a,b}^{\rm{obs}}}\delta(\theta_{i}^{\rm{imp}}-\theta_{i}^{\rm{obs}})d{\boldsymbol{\theta}}^{\rm{imp}}\\ &=\sum_{{\boldsymbol{\theta}}^{\rm{imp}}\in\Theta^{\rm{obs}}}\nu({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{\rm{obs}})\prod_{i\notin{\mathbb{U}}_{a,b}^{\rm{obs}}}p(\theta_{i}^{\rm{imp}}),\end{split}

where Θobs={𝜽imp:θiimp=θiobs for i𝕌a,bobs, and θiimpΨ for i𝕌a,bobs}\Theta^{\rm{obs}}=\{{\boldsymbol{\theta}}^{\rm{imp}}:\theta_{i}^{\rm{imp}}=\theta_{i}^{\rm{obs}}\text{ for }i\in{\mathbb{U}}_{a,b}^{\rm{obs}},\text{ and }\theta_{i}^{\rm{imp}}\in\Psi\text{ for }i\notin{\mathbb{U}}_{a,b}^{\rm{obs}}\}. Similarly, we have

ν(𝜽imp𝜽obs)π^(𝜽imp𝜽obs)𝑑𝜽imp=𝜽impΘobsν(𝜽imp𝜽obs)i𝕌a,bobsp^(θiimp𝜽~obs).\begin{split}\int\nu({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{\rm{obs}})\hat{\pi}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})d{\boldsymbol{\theta}}^{\rm{imp}}=\sum_{{\boldsymbol{\theta}}^{\rm{imp}}\in\Theta^{\rm{obs}}}\nu({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{\rm{obs}})\prod_{i\notin{\mathbb{U}}_{a,b}^{\rm{obs}}}\hat{p}(\theta_{i}^{\rm{imp}}\mid\tilde{\boldsymbol{\theta}}^{\rm{obs}}).\end{split}

Hence,

Δ(𝒁obs;𝜽obs)=pSI(𝒁obs;𝜽imp)[π^(𝜽imp𝜽obs)πo(𝜽imp𝜽obs)]𝑑𝜽imp=𝜽impΘobspSI(𝒁obs;𝜽imp)i𝕌a,bobsp^(θiimp𝜽~obs)𝜽impΘobspSI(𝒁obs;𝜽imp)i𝕌a,bobsp(θiimp)=𝜽impΘobspSI(𝒁obs;𝜽imp)i𝕌a,bobsp(θiimp)(i𝕌a,bobsp^(θiimp𝜽~obs)p(θiimp)1),\begin{split}\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})&=\int p_{{\rm{SI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{imp}}})\left[\hat{\pi}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})-\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})\right]d{\boldsymbol{\theta}}^{{\rm{imp}}}\\ &=\sum_{{\boldsymbol{\theta}}^{\rm{imp}}\in\Theta^{\rm{obs}}}p_{{\rm{SI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{imp}}})\prod_{i\notin{\mathbb{U}}_{a,b}^{\rm{obs}}}\hat{p}(\theta_{i}^{\rm{imp}}\mid\tilde{\boldsymbol{\theta}}^{\rm{obs}})-\sum_{{\boldsymbol{\theta}}^{\rm{imp}}\in\Theta^{\rm{obs}}}p_{{\rm{SI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{imp}}})\prod_{i\notin{\mathbb{U}}_{a,b}^{\rm{obs}}}p(\theta_{i}^{\rm{imp}})\\ &=\sum_{{\boldsymbol{\theta}}^{\rm{imp}}\in\Theta^{\rm{obs}}}p_{{\rm{SI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{imp}}})\prod_{i\notin{\mathbb{U}}_{a,b}^{\rm{obs}}}p(\theta_{i}^{\rm{imp}})\left(\prod_{i\notin{\mathbb{U}}_{a,b}^{\rm{obs}}}\frac{\hat{p}(\theta_{i}^{\rm{imp}}\mid\tilde{\boldsymbol{\theta}}^{\rm{obs}})}{p(\theta_{i}^{\rm{imp}})}-1\right),\end{split}

where Θobs={𝜽imp:θiimp=θiobs for i𝕌a,bobs, and θiimpΨ for i𝕌a,bobs}\Theta^{\rm{obs}}=\{{\boldsymbol{\theta}}^{\rm{imp}}:\theta_{i}^{\rm{imp}}=\theta_{i}^{\rm{obs}}\text{ for }i\in{\mathbb{U}}_{a,b}^{\rm{obs}},\text{ and }\theta_{i}^{\rm{imp}}\in\Psi\text{ for }i\notin{\mathbb{U}}_{a,b}^{\rm{obs}}\}. Let

hd(𝜽imp𝜽obs)=i𝕌a,bobsp^(θiimp𝜽~obs)p(θiimp)1,\begin{split}h_{d}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})=\prod_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\frac{\hat{p}(\theta_{i}^{{\rm{imp}}}\mid\tilde{\boldsymbol{\theta}}^{{\rm{obs}}})}{p(\theta_{i}^{{\rm{imp}}})}-1,\end{split}

then

Δ(𝒁obs;𝜽obs)=pSI(𝒁obs;𝜽imp)hd(𝜽imp𝜽obs)πo(𝜽imp𝜽obs)𝑑𝜽imp.\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})=\int p_{{\rm{SI}}}({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{imp}}})h_{d}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})d{\boldsymbol{\theta}}^{{\rm{imp}}}. (A4)

Similar to the case of continuous distribution, the target is

𝔼(𝒁obs,𝜽)|Δ(𝒁obs;𝜽obs)|0,\begin{split}{\mathbb{E}}_{({\boldsymbol{Z}}^{\rm{obs}},{\boldsymbol{\theta}})}|\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})|\rightarrow 0,\end{split}

where the expectation is taken with respect to the joint distribution of (𝒁obs,𝜽)({\boldsymbol{Z}}^{{\rm{obs}}},{\boldsymbol{\theta}}), and

|Δ(𝒁obs;𝜽obs)||hd(𝜽imp𝜽obs)|πo(𝜽imp𝜽obs)d𝜽imp.\begin{split}|\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})|\leq\int|h_{d}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})d{\boldsymbol{\theta}}^{{\rm{imp}}}.\end{split}

Therefore,

𝔼(𝒁obs,𝜽)|Δ(𝒁obs;𝜽obs)||hd(𝜽imp𝜽obs)|πo(𝜽imp𝜽obs)d𝜽impg(𝜽)d𝜽d𝑭𝒵(𝒁obs)=𝔼~|hd(𝜽imp𝜽obs)|,\begin{split}{\mathbb{E}}_{({\boldsymbol{Z}}^{\rm{obs}},{\boldsymbol{\theta}})}|\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})|&\leq\int\int\int|h_{d}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})d{\boldsymbol{\theta}}^{{\rm{imp}}}g({\boldsymbol{\theta}})d{\boldsymbol{\theta}}d{\boldsymbol{F}}_{\mathcal{Z}}({\boldsymbol{Z}}^{\rm{obs}})\\ &=\tilde{{\mathbb{E}}}|h_{d}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|,\end{split}

and the following is a sufficient condition for Δ(𝒁obs;𝜽obs)=op(1)\Delta({\boldsymbol{Z}}^{{\rm{obs}}};{\boldsymbol{\theta}}^{{\rm{obs}}})=o_{p}(1):

𝔼~|hd(𝜽imp𝜽obs)|0.\tilde{{\mathbb{E}}}|h_{d}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|\rightarrow 0. (A5)

Similarly, the conditional expectation of |hd(𝜽imp𝜽obs)||h_{d}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})| given 𝒁obs{\boldsymbol{Z}}^{\rm{obs}} is

𝔼(|hd(𝜽imp𝜽obs)|𝒁obs)=|hd(𝜽imp𝜽obs)|πo(𝜽imp𝜽obs)g(𝜽)d𝜽impd𝜽=|i𝕌a,bobsp^(θiimp𝜽~obs)p(θiimp)1|i𝕌a,bobsg(θiimp)i𝕌a,bobsδ(θiimpθi)i=1Ng(θi)d𝜽impd𝜽=|i𝕌a,bobsp^(θiimp𝜽~obs)p(θiimp)1|i𝕌a,bobsg(θiimp)i𝕌a,bobsg(θi)d𝜽i𝕌a,bobsimpd𝜽i𝕌a,bobs=|k=Na,bobs+1Np^(ϑkϑ1:Na,bobs)p(ϑk)1|k=1Ng(ϑk)dϑ1:N,\begin{split}&\indent{{\mathbb{E}}}\left(|h_{d}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|\mid{\boldsymbol{Z}}^{\rm{obs}}\right)\\ &=\int\int|h_{d}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|\pi_{o}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})g({\boldsymbol{\theta}})d{\boldsymbol{\theta}}^{{\rm{imp}}}d{\boldsymbol{\theta}}\\ &=\int\int\left|\prod_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\frac{\hat{p}(\theta_{i}^{{\rm{imp}}}\mid\tilde{\boldsymbol{\theta}}^{{\rm{obs}}})}{p(\theta_{i}^{{\rm{imp}}})}-1\right|\prod_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}g(\theta_{i}^{{\rm{imp}}})\prod_{i\in{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\delta(\theta_{i}^{\rm{imp}}-\theta_{i})\cdot\prod_{i=1}^{N}g(\theta_{i})d{\boldsymbol{\theta}}^{\rm{imp}}d{\boldsymbol{\theta}}\\ &=\int\int\left|\prod_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\frac{\hat{p}(\theta_{i}^{{\rm{imp}}}\mid\tilde{\boldsymbol{\theta}}^{{\rm{obs}}})}{p(\theta_{i}^{{\rm{imp}}})}-1\right|\prod_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}g(\theta_{i}^{{\rm{imp}}})\prod_{i\in{\mathbb{U}}^{{\rm{obs}}}_{a,b}}g(\theta_{i})d{\boldsymbol{\theta}}^{\rm{imp}}_{i\notin{\mathbb{U}}^{{\rm{obs}}}_{a,b}}d{\boldsymbol{\theta}}_{i\in{\mathbb{U}}^{{\rm{obs}}}_{a,b}}\\ &=\int\left|\prod_{k=N_{a,b}^{\rm{obs}}+1}^{N}\frac{\hat{p}\left({\vartheta}_{k}\mid{\bm{{\vartheta}}}_{1:{N_{a,b}^{\rm{obs}}}}\right)}{p({\vartheta}_{k})}-1\right|\prod_{k=1}^{N}g({\vartheta}_{k})d{\bm{{\vartheta}}}_{1:N},\end{split}

which is equal to the conditional expectation of

|k=Na,bobs+1Np^(ϑkϑ1:Na,bobs)p(ϑk)1|\begin{split}\left|\prod_{k=N_{a,b}^{\rm{obs}}+1}^{N}\frac{\hat{p}\left({\vartheta}_{k}\mid{\bm{{\vartheta}}}_{1:{N_{a,b}^{\rm{obs}}}}\right)}{p({\vartheta}_{k})}-1\right|\end{split}

with respect to ϑ1,,ϑN{\vartheta}_{1},\ldots,{\vartheta}_{N} i.i.d. with probability density g()g(\cdot) given Na,bobsN_{a,b}^{\rm{obs}}. Hence, we can similarly prove that

𝔼~|hd(𝜽imp𝜽obs)|=𝔼[𝔼(|hd(𝜽imp𝜽obs)|𝒁obs)]=𝔼|k=Na,bobs+1Np^(ϑkϑ1:Na,bobs)p(ϑk)1|,\begin{split}\tilde{{\mathbb{E}}}|h_{d}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|&={\mathbb{E}}\left[{{\mathbb{E}}}\left(|h_{d}({\boldsymbol{\theta}}^{{\rm{imp}}}\mid{\boldsymbol{\theta}}^{{\rm{obs}}})|\mid{\boldsymbol{Z}}^{\rm{obs}}\right)\right]\\ &={\mathbb{E}}\left|\prod_{k=N_{a,b}^{\rm{obs}}+1}^{N}\frac{\hat{p}\left({\vartheta}_{k}\mid{\bm{{\vartheta}}}_{1:{N_{a,b}^{\rm{obs}}}}\right)}{p({\vartheta}_{k})}-1\right|,\end{split}

where the last expectation is with respect to ϑ1,,ϑN{\vartheta}_{1},\ldots,{\vartheta}_{N} i.i.d. with probability density g()g(\cdot), and Na,bobs𝑭NN_{a,b}^{\rm{obs}}\sim{\boldsymbol{F}}_{N}. Thus, sufficient condition (A5) is equivalent to:

𝔼|k=Na,bobs+1Np^(ϑkϑ1:Na,bobs)p(ϑk)1|0,{\mathbb{E}}\left|\prod_{k=N_{a,b}^{\rm{obs}}+1}^{N}\frac{\hat{p}\left({\vartheta}_{k}\mid{\bm{{\vartheta}}}_{1:{N_{a,b}^{\rm{obs}}}}\right)}{p({\vartheta}_{k})}-1\right|\rightarrow 0, (A6)

where the expectation is taken with respect to ϑ1,,ϑN{\vartheta}_{1},\ldots,{\vartheta}_{N} i.i.d. with probability density g()g(\cdot) and Na,bobs𝑭NN_{a,b}^{\rm{obs}}\sim{\boldsymbol{F}}_{N}. We complete the proof. ∎

A.5 Proof of Corollary 2

Denote N1=Na,bobs=N(1rNmis)N_{1}=N_{a,b}^{\rm{obs}}=N(1-r_{N}^{\rm{mis}}) and N0=NNa,bobs=NrNmisN_{0}=N-N_{a,b}^{\rm{obs}}=N\cdot r_{N}^{\rm{mis}}. Suppose ϑ1,,ϑN1𝒩(μ,σ2){\vartheta}_{1},\ldots,{\vartheta}_{N_{1}}\sim{\mathcal{N}}(\mu,\sigma^{2}) i.i.d. with σ2\sigma^{2} known, and conjugate prior μ𝒩(μ0,σ02)\mu\sim{\mathcal{N}}(\mu_{0},\sigma_{0}^{2}). Let gg be the p.d.f. of 𝒩(μ,σ2){\mathcal{N}}(\mu,\sigma^{2}). For ϑ1:N1=(ϑ1,,ϑN1){\bm{{\vartheta}}}_{1:N_{1}}=({\vartheta}_{1},\ldots,{\vartheta}_{N_{1}}), define the average as ϑ¯1:N1=N11i=1N1ϑi\bar{{\bm{{\vartheta}}}}_{1:N_{1}}=N_{1}^{-1}\sum_{i=1}^{N_{1}}{\vartheta}_{i}. Then the posterior distribution of μ\mu is μϑ1:N1𝒩(μN1,σN12)\mu\mid{\bm{{\vartheta}}}_{1:N_{1}}\sim{\mathcal{N}}\left(\mu_{N_{1}},\sigma_{N_{1}}^{2}\right), where

σN12=11σ02+N1σ2,μN1=μ0σ02+N1ϑ¯1:N1σ21σ02+N1σ2=ϑ¯1:N1ϑ¯1:N1μ0σ021σ02+N1σ2,\sigma_{N_{1}}^{2}=\frac{1}{\frac{1}{\sigma_{0}^{2}}+\frac{N_{1}}{\sigma^{2}}},\quad\mu_{N_{1}}=\frac{\frac{\mu_{0}}{\sigma_{0}^{2}}+\frac{N_{1}\bar{{\bm{{\vartheta}}}}_{1:N_{1}}}{\sigma^{2}}}{\frac{1}{\sigma_{0}^{2}}+\frac{N_{1}}{\sigma^{2}}}=\bar{{\bm{{\vartheta}}}}_{1:N_{1}}-\frac{\frac{\bar{{\bm{{\vartheta}}}}_{1:N_{1}}-\mu_{0}}{\sigma_{0}^{2}}}{\frac{1}{\sigma_{0}^{2}}+\frac{N_{1}}{\sigma^{2}}},

and the posterior predictive distribution of θi\theta_{i} is

ϑϑ1:N1𝒩(μN1,σN12+σ2).\begin{split}{\vartheta}\mid{\bm{{\vartheta}}}_{1:N_{1}}\sim{\mathcal{N}}\left(\mu_{N_{1}},\sigma_{N_{1}}^{2}+\sigma^{2}\right).\end{split}

Hence, for i=N1+1,,Ni=N_{1}+1,\ldots,N,

g^(ϑiϑ1:N1)=12π(σN12+σ2)exp{(ϑiμN1)22(σN12+σ2)},\begin{split}\hat{g}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})=\frac{1}{\sqrt{2\pi(\sigma_{N_{1}}^{2}+\sigma^{2})}}\exp\left\{-\frac{({\vartheta}_{i}-\mu_{N_{1}})^{2}}{2(\sigma_{N_{1}}^{2}+\sigma^{2})}\right\},\end{split}

and

g^(ϑiϑ1:N1)g(ϑi)=(σ2σN12+σ2)1/2exp{(ϑiμ)22σ2(ϑiμN1)22(σN12+σ2)}.\begin{split}\frac{\hat{g}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})}{g({\vartheta}_{i})}&=\left(\frac{\sigma^{2}}{\sigma_{N_{1}}^{2}+\sigma^{2}}\right)^{1/2}\exp\left\{\frac{({\vartheta}_{i}-\mu)^{2}}{2\sigma^{2}}-\frac{({\vartheta}_{i}-\mu_{N_{1}})^{2}}{2(\sigma_{N_{1}}^{2}+\sigma^{2})}\right\}.\end{split}

It suffices to show that

𝔼(i=1N0g^(ϑiϑ1:N1)g(ϑi))20 as N.\begin{split}{\mathbb{E}}\left(\prod_{i=1}^{N_{0}}\frac{\hat{g}\left({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}}\right)}{g({\vartheta}_{i})}\right)^{2}\rightarrow 0\text{ as }N\rightarrow\infty.\end{split}

Given ϑ¯1:N1\bar{\bm{{\vartheta}}}_{1:N_{1}}, g^(ϑiϑ1:N1)/g(ϑi)\hat{g}\left({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}}\right)/g({\vartheta}_{i}) are i.i.d. for i=N1+1,,Ni=N_{1}+1,\ldots,N. Since 𝑭N{\boldsymbol{F}}_{N} is a point mass, then N1N_{1} and N0N_{0} are deterministic for each NN, and

𝔼[(i=N1+1Ng^(ϑiϑ1:N1)g(ϑi))2ϑ¯1:N1]=(𝔼[(g^(ϑiϑ1:N1)g(ϑi))2ϑ¯1:N1])N0,\begin{split}{\mathbb{E}}\left[\left(\prod_{i=N_{1}+1}^{N}\frac{\hat{g}\left({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}}\right)}{g({\vartheta}_{i})}\right)^{2}\mid\bar{\bm{{\vartheta}}}_{1:N_{1}}\right]=\left({\mathbb{E}}\left[\left(\frac{\hat{g}\left({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}}\right)}{g({\vartheta}_{i})}\right)^{2}\mid\bar{\bm{{\vartheta}}}_{1:N_{1}}\right]\right)^{N_{0}},\end{split}

where

𝔼[(g^(ϑiϑ1:N1)g(ϑi))2ϑ¯1:N1]=g^2(ϑϑ1:N1)g(ϑ)𝑑ϑ=σ2π(σN12+σ2)exp{(ϑμ)22σ2(ϑμN1)2σN12+σ2}𝑑ϑ=(σ4σ4σN14)1/2exp{(μN1μ)2σ2σN12}.\begin{split}{\mathbb{E}}\left[\left(\frac{\hat{g}\left({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}}\right)}{g({\vartheta}_{i})}\right)^{2}\mid\bar{\bm{{\vartheta}}}_{1:N_{1}}\right]&=\int\frac{\hat{g}^{2}\left({\vartheta}\mid{\bm{{\vartheta}}}_{1:N_{1}}\right)}{g({\vartheta})}d{\vartheta}\\ &=\frac{\sigma}{\sqrt{2\pi}(\sigma_{N_{1}}^{2}+\sigma^{2})}\int\exp\left\{\frac{({\vartheta}-\mu)^{2}}{2\sigma^{2}}-\frac{({\vartheta}-\mu_{N_{1}})^{2}}{\sigma_{N_{1}}^{2}+\sigma^{2}}\right\}d{\vartheta}\\ &=\left(\frac{\sigma^{4}}{\sigma^{4}-\sigma_{N_{1}}^{4}}\right)^{1/2}\exp\left\{\frac{(\mu_{N_{1}}-\mu)^{2}}{\sigma^{2}-\sigma^{2}_{N_{1}}}\right\}.\end{split}

Hence,

𝔼(i=1N0g^(ϑiϑ1:N1)g(ϑi))2=(σ4σ4σN14)N0/2𝔼ϑ¯1:N1[exp{N0(μN1μ)2σ2σN12}]=(σ4σ4σN14)N0/2N12πσexp{N0(μN1μ)2σ2σN12N1(ϑ¯1:N1μ)22σ2}𝑑ϑ¯1:N1.\begin{split}{\mathbb{E}}\left(\prod_{i=1}^{N_{0}}\frac{\hat{g}\left({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}}\right)}{g({\vartheta}_{i})}\right)^{2}&=\left(\frac{\sigma^{4}}{\sigma^{4}-\sigma_{N_{1}}^{4}}\right)^{N_{0}/2}{\mathbb{E}}_{\bar{\bm{{\vartheta}}}_{1:N_{1}}}\left[\exp\left\{\frac{N_{0}(\mu_{N_{1}}-\mu)^{2}}{\sigma^{2}-\sigma^{2}_{N_{1}}}\right\}\right]\\ &=\left(\frac{\sigma^{4}}{\sigma^{4}-\sigma_{N_{1}}^{4}}\right)^{N_{0}/2}\frac{\sqrt{N_{1}}}{\sqrt{2\pi}\sigma}\int\exp\left\{\frac{N_{0}(\mu_{N_{1}}-\mu)^{2}}{\sigma^{2}-\sigma^{2}_{N_{1}}}-\frac{N_{1}(\bar{\bm{{\vartheta}}}_{1:N_{1}}-\mu)^{2}}{2\sigma^{2}}\right\}d\bar{\bm{{\vartheta}}}_{1:N_{1}}.\end{split}

As rNmis0r_{N}^{\rm{mis}}\rightarrow 0 implies that N0/N10N_{0}/N_{1}\rightarrow 0 and N1N_{1}\rightarrow\infty, then there exists N~>1\tilde{N}>1 s.t. for any N>N~N>\tilde{N}, we have N0/N11/8N_{0}/N_{1}\leq 1/8, σN12/σ21/2\sigma_{N_{1}}^{2}/\sigma^{2}\leq 1/2 and σN12/σ021/2\sigma_{N_{1}}^{2}/\sigma^{2}_{0}\leq 1/2. Let aN=σN12/σ02a_{N}=\sigma^{2}_{N_{1}}/\sigma^{2}_{0}, then

μN1μ=(ϑ¯1:N1μ)(1aN)+(μ0μ)aN,\begin{split}\mu_{N_{1}}-\mu=(\bar{{\bm{{\vartheta}}}}_{1:N_{1}}-\mu)(1-a_{N})+(\mu_{0}-\mu)a_{N},\end{split}

and

N0(μN1μ)2σ2σN12N1(ϑ¯1:N1μ)22σ2=N12σ2[12N0σ2(1aN)2N1(σ2σN12)](ϑ¯1:N1μ)2+2N0aN(1aN)(μ0μ)σ2σN12(ϑ¯1:N1μ)+N0(μ0μ)2aN2σ2σN12.\begin{split}&\indent\frac{N_{0}(\mu_{N_{1}}-\mu)^{2}}{\sigma^{2}-\sigma^{2}_{N_{1}}}-\frac{N_{1}(\bar{\bm{{\vartheta}}}_{1:N_{1}}-\mu)^{2}}{2\sigma^{2}}\\ &=-\frac{N_{1}}{2\sigma^{2}}\left[1-\frac{2N_{0}\sigma^{2}(1-a_{N})^{2}}{N_{1}(\sigma^{2}-\sigma^{2}_{N_{1}})}\right](\bar{{\bm{{\vartheta}}}}_{1:N_{1}}-\mu)^{2}+\frac{2N_{0}a_{N}(1-a_{N})(\mu_{0}-\mu)}{\sigma^{2}-\sigma^{2}_{N_{1}}}(\bar{{\bm{{\vartheta}}}}_{1:N_{1}}-\mu)+\frac{N_{0}(\mu_{0}-\mu)^{2}a_{N}^{2}}{\sigma^{2}-\sigma^{2}_{N_{1}}}.\end{split}

Let

bN=12N0σ2(1aN)2N1(σ2σN12),cN=2σ2N0aN(1aN)(μ0μ)N1(σ2σN12),dN=N0(μ0μ)2aN2σ2σN12,\begin{split}b_{N}&=1-\frac{2N_{0}\sigma^{2}(1-a_{N})^{2}}{N_{1}(\sigma^{2}-\sigma^{2}_{N_{1}})},\\ c_{N}&=\frac{2\sigma^{2}\cdot N_{0}a_{N}(1-a_{N})(\mu_{0}-\mu)}{N_{1}(\sigma^{2}-\sigma^{2}_{N_{1}})},\\ d_{N}&=\frac{N_{0}(\mu_{0}-\mu)^{2}a_{N}^{2}}{\sigma^{2}-\sigma^{2}_{N_{1}}},\end{split}

then for N>N~N>\tilde{N}, we have 0aN1/20\leq a_{N}\leq 1/2, 1/2bN11/2\leq b_{N}\leq 1, and

N0(μN1μ)2σ2σN12N1(ϑ¯1:N1μ)22σ2=N12σ2[bN(ϑ¯1:N1μ)22cN(ϑ¯1:N1μ)]+dN=N1bN2σ2[(ϑ¯1:N1μ)22cNbN(ϑ¯1:N1μ)]+dN=N1bN2σ2[(ϑ¯1:N1μ)cNbN]2+dN+N1cN22bNσ2.\begin{split}&\indent\frac{N_{0}(\mu_{N_{1}}-\mu)^{2}}{\sigma^{2}-\sigma^{2}_{N_{1}}}-\frac{N_{1}(\bar{\bm{{\vartheta}}}_{1:N_{1}}-\mu)^{2}}{2\sigma^{2}}\\ &=-\frac{N_{1}}{2\sigma^{2}}\left[b_{N}(\bar{{\bm{{\vartheta}}}}_{1:N_{1}}-\mu)^{2}-2c_{N}(\bar{{\bm{{\vartheta}}}}_{1:N_{1}}-\mu)\right]+d_{N}\\ &=-\frac{N_{1}b_{N}}{2\sigma^{2}}\left[(\bar{{\bm{{\vartheta}}}}_{1:N_{1}}-\mu)^{2}-\frac{2c_{N}}{b_{N}}(\bar{{\bm{{\vartheta}}}}_{1:N_{1}}-\mu)\right]+d_{N}\\ &=-\frac{N_{1}b_{N}}{2\sigma^{2}}\left[(\bar{{\bm{{\vartheta}}}}_{1:N_{1}}-\mu)-\frac{c_{N}}{b_{N}}\right]^{2}+d_{N}+\frac{N_{1}c_{N}^{2}}{2b_{N}\sigma^{2}}.\end{split}

Hence, for N>N~N>\tilde{N},

N12πσexp{N0(μN1μ)2σ2σN12N1(ϑ¯1:N1μ)22σ2}𝑑ϑ¯1:N1=1bNexp{dN+N1cN22bNσ2},\begin{split}\frac{\sqrt{N_{1}}}{\sqrt{2\pi}\sigma}\int\exp\left\{\frac{N_{0}(\mu_{N_{1}}-\mu)^{2}}{\sigma^{2}-\sigma^{2}_{N_{1}}}-\frac{N_{1}(\bar{\bm{{\vartheta}}}_{1:N_{1}}-\mu)^{2}}{2\sigma^{2}}\right\}d\bar{\bm{{\vartheta}}}_{1:N_{1}}=\frac{1}{\sqrt{b_{N}}}\exp\left\{d_{N}+\frac{N_{1}c_{N}^{2}}{2b_{N}\sigma^{2}}\right\},\end{split}

and

𝔼(i=N1+1Ng^(ϑiϑ1:N1)g(ϑi))2=(σ4σ4σN14)N0/21bNexp{dN+N1cN22bNσ2}.\begin{split}{\mathbb{E}}\left(\prod_{i=N_{1}+1}^{N}\frac{\hat{g}\left({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}}\right)}{g({\vartheta}_{i})}\right)^{2}=\left(\frac{\sigma^{4}}{\sigma^{4}-\sigma_{N_{1}}^{4}}\right)^{N_{0}/2}\frac{1}{\sqrt{b_{N}}}\exp\left\{d_{N}+\frac{N_{1}c_{N}^{2}}{2b_{N}\sigma^{2}}\right\}.\end{split}

As σN12/σ2=1/(N1+σ2/σ02)1/N1\sigma^{2}_{N_{1}}/\sigma^{2}=1/(N_{1}+\sigma^{2}/\sigma_{0}^{2})\leq 1/N_{1}, then

1(σ4σ4σN14)N0/2=(1+σN14σ4σN14)N0/2=(1+σN14/σ41σN14/σ4)N0/2(1+1/N1211/N12)N0/2(1+2N12)N0/2=[(1+2N12)N12/2]N0/N12eN0/N121,\begin{split}1\leq\left(\frac{\sigma^{4}}{\sigma^{4}-\sigma_{N_{1}}^{4}}\right)^{N_{0}/2}&=\left(1+\frac{\sigma^{4}_{N_{1}}}{\sigma^{4}-\sigma_{N_{1}}^{4}}\right)^{N_{0}/2}=\left(1+\frac{\sigma^{4}_{N_{1}}/\sigma^{4}}{1-\sigma_{N_{1}}^{4}/\sigma^{4}}\right)^{N_{0}/2}\\ &\leq\left(1+\frac{1/N_{1}^{2}}{1-1/N_{1}^{2}}\right)^{N_{0}/2}\leq\left(1+\frac{2}{N_{1}^{2}}\right)^{N_{0}/2}\\ &=\left[\left(1+\frac{2}{N_{1}^{2}}\right)^{N_{1}^{2}/2}\right]^{N_{0}/N_{1}^{2}}\leq e^{N_{0}/N_{1}^{2}}\rightarrow 1,\end{split}

i.e.,

limN(σ4σ4σN14)N0/2=1.\begin{split}\lim_{N\rightarrow\infty}\left(\frac{\sigma^{4}}{\sigma^{4}-\sigma_{N_{1}}^{4}}\right)^{N_{0}/2}=1.\end{split}

Since limNbN=1\lim_{N\rightarrow\infty}b_{N}=1,

limNN0aN2=limNN0σ4(N1σ02+σ2)2=0,limNN1cN2=limN4σ4(σ2σN12)2N0N1N0aN2(1aN)2=0,\begin{split}\lim_{N\rightarrow\infty}N_{0}a_{N}^{2}&=\lim_{N\rightarrow\infty}\frac{N_{0}\sigma^{4}}{(N_{1}\sigma_{0}^{2}+\sigma^{2})^{2}}=0,\\ \lim_{N\rightarrow\infty}N_{1}c_{N}^{2}&=\lim_{N\rightarrow\infty}\frac{4\sigma^{4}}{(\sigma^{2}-\sigma_{N_{1}}^{2})^{2}}\cdot\frac{N_{0}}{N_{1}}\cdot N_{0}a_{N}^{2}(1-a_{N})^{2}=0,\end{split}

we have limNdN=0\lim_{N\rightarrow\infty}d_{N}=0 and

limN𝔼(i=N1+1Ng^(ϑiϑ1:N1)g(ϑi))2=1.\begin{split}\lim_{N\rightarrow\infty}{\mathbb{E}}\left(\prod_{i=N_{1}+1}^{N}\frac{\hat{g}\left({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}}\right)}{g({\vartheta}_{i})}\right)^{2}=1.\end{split}

Moreover,

𝔼(i=N1+1Ng^(ϑiϑ1:N1)g(ϑi))=i=N1+1Ng^(ϑiϑ1:N1)g(ϑi)i=1Ng(ϑi)dϑ=i=N1+1Ng^(ϑiϑ1:N1)i=1N1g(ϑi)dϑ1.\begin{split}{\mathbb{E}}\left(\prod_{i=N_{1}+1}^{N}\frac{\hat{g}\left({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}}\right)}{g({\vartheta}_{i})}\right)&=\int\prod_{i=N_{1}+1}^{N}\frac{\hat{g}\left({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}}\right)}{g({\vartheta}_{i})}\cdot\prod_{i=1}^{N}g({\vartheta}_{i})d{\bm{{\vartheta}}}\\ &=\int\prod_{i=N_{1}+1}^{N}\hat{g}\left({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}}\right)\cdot\prod_{i=1}^{N_{1}}g({\vartheta}_{i})d{\bm{{\vartheta}}}\equiv 1.\end{split}

Therefore,

limN𝔼(i=N1+1Ng^(ϑiϑ1:N1)g(ϑi)1)2=limN𝔼(i=N1+1Ng^(ϑiϑ1:N1)g(ϑi))21=0,\begin{split}\lim_{N\rightarrow\infty}{\mathbb{E}}\left(\prod_{i=N_{1}+1}^{N}\frac{\hat{g}\left({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}}\right)}{g({\vartheta}_{i})}-1\right)^{2}&=\lim_{N\rightarrow\infty}{\mathbb{E}}\left(\prod_{i=N_{1}+1}^{N}\frac{\hat{g}\left({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}}\right)}{g({\vartheta}_{i})}\right)^{2}-1=0,\end{split}

and thus,

𝔼|i=N1+1Ng^(ϑiϑ1:N1)g(ϑi)1|=o(1).\begin{split}{\mathbb{E}}\left|\prod_{i=N_{1}+1}^{N}\frac{\hat{g}\left({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}}\right)}{g({\vartheta}_{i})}-1\right|=o(1).\end{split}

We complete the proof. ∎

A.6 Numerical verification of sufficient conditions in Theorem 2

For complex cases, theoretically verifying the sufficient conditions outlined in Theorem 2 is challenging. Therefore, we provide numerical verifications for two distributions used in the simulation and real data analyses. Hereafter, we denote N1=Na,bobs=N(1rNmis)N_{1}=N_{a,b}^{\rm{obs}}=N(1-r_{N}^{\rm{mis}}) and N0=NNa,bobs=NrNmisN_{0}=N-N_{a,b}^{\rm{obs}}=N\cdot r_{N}^{\rm{mis}}. We assume that 𝑭N{\boldsymbol{F}}_{N} is a point mass distribution, and thus, N1N_{1} and rNmisr_{N}^{\rm{mis}} are deterministic.

1. Normal distribution with both mean and variance unknown. Consider the case where ϑ1,,ϑN𝒩(μ,σ2){\vartheta}_{1},\ldots,{\vartheta}_{N}\sim{\mathcal{N}}(\mu,\sigma^{2}) i.i.d. with μ,σ2\mu,\ \sigma^{2} unknown. The probability denisty is gg.

1.1. IRTp. We use the conjugate prior: σ2Inverse-Gamma(α0,β0)\sigma^{2}\sim\text{Inverse-Gamma}(\alpha_{0},\beta_{0}), and μσ2𝒩(μ0,σ2/κ0)\mu\mid\sigma^{2}\sim{\mathcal{N}}(\mu_{0},\sigma^{2}/\kappa_{0}). Then the posterior distribution is

σ2ϑ1:N1Inverse-Gamma(αN1,βN1),μσ2,ϑ1:N1𝒩(μN1,σ2/κN1),\begin{split}\sigma^{2}\mid{\bm{{\vartheta}}}_{1:N_{1}}&\sim\text{Inverse-Gamma}\left(\alpha_{N_{1}},\beta_{N_{1}}\right),\\ \mu\mid\sigma^{2},{\bm{{\vartheta}}}_{1:N_{1}}&\sim{\mathcal{N}}\left(\mu_{N_{1}},\sigma^{2}/\kappa_{N_{1}}\right),\end{split}

where

αN1=α0+N12,βN1=β0+12[(N11)S2+N1κ0κN1(ϑ¯1:N1μ0)2],μN1=κ0μ0+N1ϑ¯1:N1κN1,\alpha_{N_{1}}=\alpha_{0}+\frac{N_{1}}{2},\ \beta_{N_{1}}=\beta_{0}+\frac{1}{2}\left[(N_{1}-1)S^{2}+\frac{N_{1}\kappa_{0}}{\kappa_{N_{1}}}(\bar{{\bm{{\vartheta}}}}_{1:N_{1}}-\mu_{0})^{2}\right],\ \mu_{N_{1}}=\frac{\kappa_{0}\mu_{0}+N_{1}\bar{{\bm{{\vartheta}}}}_{1:N_{1}}}{\kappa_{N_{1}}},

and S2=(N11)1i=1N1(ϑiϑ¯1:N1)2S^{2}=(N_{1}-1)^{-1}\sum_{i=1}^{N_{1}}({\vartheta}_{i}-\bar{{\bm{{\vartheta}}}}_{1:N_{1}})^{2}, κN1=κ0+N1\kappa_{N_{1}}=\kappa_{0}+N_{1}. Then the posterior predictive distribution is a noncentral t-distribution:

ϑϑ1:N1t2αN1(μN1,βN1αN1(1+1κN1)).\begin{split}{\vartheta}\mid{\bm{{\vartheta}}}_{1:N_{1}}\sim t_{2\alpha_{N_{1}}}\left(\mu_{N_{1}},\frac{\beta_{N_{1}}}{\alpha_{N_{1}}}\left(1+\frac{1}{\kappa_{N_{1}}}\right)\right).\end{split}

Let σN12=βN1αN1(1+1κN1)\sigma^{2}_{N_{1}}=\frac{\beta_{N_{1}}}{\alpha_{N_{1}}}\left(1+\frac{1}{\kappa_{N_{1}}}\right), then

g^(ϑϑ1:N1)=Γ(αN1+1/2)Γ(αN1)2αN1πσN12(1+(ϑμN1)22αN1σN12)2αN1+12.\begin{split}\hat{g}({\vartheta}\mid{\bm{{\vartheta}}}_{1:N_{1}})=\frac{\Gamma(\alpha_{N_{1}}+1/2)}{\Gamma(\alpha_{N_{1}})\sqrt{2\alpha_{N_{1}}\pi\sigma^{2}_{N_{1}}}}\left(1+\frac{({\vartheta}-\mu_{N_{1}})^{2}}{2\alpha_{N_{1}}\sigma^{2}_{N_{1}}}\right)^{-\frac{2\alpha_{N_{1}}+1}{2}}.\end{split}

In the following, we verify that

𝔼|i=N1+1Ng^(ϑiϑ1:N1)g(ϑi)1|=o(1)\begin{split}{\mathbb{E}}\left|\prod_{i=N_{1}+1}^{N}\frac{\hat{g}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})}{g({\vartheta}_{i})}-1\right|=o(1)\end{split}

holds for rNmis=O(N1/2)r_{N}^{\rm{mis}}=O(N^{-1/2}) through simulations. For NN changing from 100 to 100000, let rNmisNrater_{N}^{\rm{mis}}\propto N^{-\text{rate}} for rate{0.5,0.6,0.7}\text{rate}\in\{0.5,0.6,0.7\}. We generate ϑ1,,ϑN{\vartheta}_{1},\ldots,{\vartheta}_{N} i.i.d. from the true distribution 𝒩(0,1){\mathcal{N}}(0,1), set α0=β0=κ0=1\alpha_{0}=\beta_{0}=\kappa_{0}=1 and μ0=0\mu_{0}=0, and compute

|i=N1+1Ng^(ϑiϑ1:N1)g(ϑi)1|.\begin{split}\left|\prod_{i=N_{1}+1}^{N}\frac{\hat{g}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})}{g({\vartheta}_{i})}-1\right|.\end{split}

Repeat the above procedure 1000 times and the average of the obtained 1000 values is an approximation of

𝔼|i=N1+1Ng^(ϑiϑ1:N1)g(ϑi)1|.\begin{split}{\mathbb{E}}\left|\prod_{i=N_{1}+1}^{N}\frac{\hat{g}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})}{g({\vartheta}_{i})}-1\right|.\end{split}

Values of the approximation are shown in Fig. A1. The results imply that when rNmisr_{N}^{\rm{mis}} converges to 0 faster, 𝔼|i=N1+1Ng^(ϑiϑ1:N1)g(ϑi)1|{\mathbb{E}}\left|\prod_{i=N_{1}+1}^{N}\frac{\hat{g}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})}{g({\vartheta}_{i})}-1\right| will converge to 0 more quickly.

Refer to caption
Figure A1: Simulation results for IRTp in the case of the normal distribution with both mean and variance unknown.

1.2. IRTs. When g()g(\cdot) is known to be continuous, let

g^(ϑϑ1:N1)=1N1hNj=1N1K(ϑϑjhN),\hat{g}({\vartheta}\mid{\bm{{\vartheta}}}_{1:N_{1}})=\frac{1}{N_{1}h_{N}}\sum_{j=1}^{N_{1}}K\left(\frac{{\vartheta}-{\vartheta}_{j}}{h_{N}}\right), (A7)

where the kernel KK satisfies K(u)𝑑u=1\int K(u)du=1, u2K(u)𝑑u<\int u^{2}K(u)du<\infty, and K(u)=K(u)K(u)=K(-u). IRT with imputation function (A7) can be seen a smoothed version of IRTe and we denote this method as IRTs. Let KK be a Gaussian kernel and gg corresponds to normal distribution. We choose hNN11/5h_{N}\propto N_{1}^{-1/5} via a plug-in method, and verify that

𝔼|i=N1+1Ng^(ϑiϑ1:N1)g(ϑi)1|=o(1)\begin{split}{\mathbb{E}}\left|\prod_{i=N_{1}+1}^{N}\frac{\hat{g}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})}{g({\vartheta}_{i})}-1\right|=o(1)\end{split}

holds for rNmis=O(N1/2)r_{N}^{\rm{mis}}=O(N^{-1/2}) through simulations. For NN changing from 100 to 100000, let rNmisNrater_{N}^{\rm{mis}}\propto N^{-\text{rate}} for rate{0.5,0.6,0.7}\text{rate}\in\{0.5,0.6,0.7\}. We generate ϑ1,,ϑN{\vartheta}_{1},\ldots,{\vartheta}_{N} i.i.d. from the true distribution 𝒩(0,1){\mathcal{N}}(0,1), set α0=β0=κ0=1\alpha_{0}=\beta_{0}=\kappa_{0}=1 and μ0=0\mu_{0}=0, and compute

|i=N1+1Ng^(ϑiϑ1:N1)g(ϑi)1|.\begin{split}\left|\prod_{i=N_{1}+1}^{N}\frac{\hat{g}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})}{g({\vartheta}_{i})}-1\right|.\end{split}

Repeat the above procedure 1000 times and the average of the obtained 1000 values is an approximation of

𝔼|i=N1+1Ng^(ϑiϑ1:N1)g(ϑi)1|.\begin{split}{\mathbb{E}}\left|\prod_{i=N_{1}+1}^{N}\frac{\hat{g}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})}{g({\vartheta}_{i})}-1\right|.\end{split}

Values of the approximation are shown in Fig. A2. The results imply that when rNmisr_{N}^{\rm{mis}} converges to 0 faster, 𝔼|i=N1+1Ng^(ϑiϑ1:N1)g(ϑi)1|{\mathbb{E}}\left|\prod_{i=N_{1}+1}^{N}\frac{\hat{g}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})}{g({\vartheta}_{i})}-1\right| will converge to 0 more quickly.

Refer to caption
Figure A2: Simulation results for IRTs in the case of normal distribution with both mean and variance unknown.

2. Binomial distribution. Consider the case where ϑ1,,ϑNBinomial(m,q){\vartheta}_{1},\ldots,{\vartheta}_{N}\sim\text{Binomial}(m,q) with mm known and qq unknown. The corresponding probability mass is p()p(\cdot).

2.1. IRTp. We use prior qBeta(α,β)q\sim\text{Beta}(\alpha,\beta), where α,β>0\alpha,\beta>0. Then the posterior distribution is qϑ1:N1Beta(αN1,βN1)q\mid{\bm{{\vartheta}}}_{1:N_{1}}\sim\text{Beta}(\alpha_{N_{1}},\beta_{N_{1}}), where

αN1=α+N1ϑ¯1:N1,βN1=β+N1(mϑ¯1:N1).\begin{split}\alpha_{N_{1}}&=\alpha+N_{1}\cdot\bar{{\bm{{\vartheta}}}}_{1:N_{1}},\ \beta_{N_{1}}=\beta+N_{1}(m-\bar{{\bm{{\vartheta}}}}_{1:N_{1}}).\end{split}

Then the posterior predictive distribution is

ϑϑ1:N1Beta-Binomial(m,αN1,βN1).\begin{split}{\vartheta}\mid{\bm{{\vartheta}}}_{1:N_{1}}\sim\text{Beta-Binomial}(m,\alpha_{N_{1}},\beta_{N_{1}}).\end{split}

Denote the p.m.f. of this posterior predictive distribution as p^(ϑϑ1:N1)\hat{p}({\vartheta}\mid{\bm{{\vartheta}}}_{1:N_{1}}). In the following, we assume that the sequence of rNmisr_{N}^{\rm{mis}} is deterministic for N1𝑭NN_{1}\sim{\boldsymbol{F}}_{N} and verify that

𝔼|i=N1+1Np^(ϑiϑ1:N1)p(ϑi)1|=o(1)\begin{split}{\mathbb{E}}\left|\prod_{i=N_{1}+1}^{N}\frac{\hat{p}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})}{p({\vartheta}_{i})}-1\right|=o(1)\end{split}

holds for rNmis=O(N1/2)r_{N}^{\rm{mis}}=O(N^{-1/2}) through simulations. For NN changing from 100 to 100000, let rNmisNrater_{N}^{\rm{mis}}\propto N^{-\text{rate}} for rate{0.5,0.6,0.7}\text{rate}\in\{0.5,0.6,0.7\}. We generate ϑ1,,ϑN{\vartheta}_{1},\ldots,{\vartheta}_{N} i.i.d. from the true distribution Binomial(100,0.1)(100,0.1), set α=β=1\alpha=\beta=1, and compute

|i=N1+1Np^(ϑiϑ1:N1)p(ϑi)1|.\begin{split}\left|\prod_{i=N_{1}+1}^{N}\frac{\hat{p}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})}{p({\vartheta}_{i})}-1\right|.\end{split}

Repeat the above procedure 1000 times and the average of the obtained 1000 values is an approximation of

𝔼|i=N1+1Np^(ϑiϑ1:N1)p(ϑi)1|.\begin{split}{\mathbb{E}}\left|\prod_{i=N_{1}+1}^{N}\frac{\hat{p}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})}{p({\vartheta}_{i})}-1\right|.\end{split}

Values of the approximation are shown in Fig. A3. The results imply that when rNmisr_{N}^{\rm{mis}} converges to 0 faster, 𝔼|i=N1+1Ng^(ϑiϑ1:N1)g(ϑi)1|{\mathbb{E}}\left|\prod_{i=N_{1}+1}^{N}\frac{\hat{g}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})}{g({\vartheta}_{i})}-1\right| will converge to 0 more quickly.

Refer to caption
Figure A3: Simulation results for IRTp in the case of binomial distribution.

2.2. IRTe. When g()g(\cdot) corresponds to a discrete distribution, it takes the form of

g(ϑ)=ψΨp(ψ)δ(ϑψ),g({\vartheta})=\sum_{\psi\in\Psi}p(\psi)\delta({\vartheta}-\psi), (A8)

where Ψ\Psi is the set of all the ψ\psi s.t. g(ψ)>0g(\psi)>0. Suppose we estimate p()p(\cdot) with the empirical distribution

g^(ϑϑ1:N1)=1N1j=1N1δ(ϑϑj).\hat{g}({\vartheta}\mid{\bm{{\vartheta}}}_{1:N_{1}})=\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}\delta({\vartheta}-{\vartheta}_{j}). (A9)

Then the empirical distribution can be written as

g^(ϑϑ1:N1)=ψΨ(1N1j=1N11{ϑj=ψ})δ(ϑψ),\begin{split}\hat{g}({\vartheta}\mid{\bm{{\vartheta}}}_{1:N_{1}})=\sum_{\psi\in\Psi}\left(\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}1\{{\vartheta}_{j}=\psi\}\right)\delta({\vartheta}-\psi),\end{split}

i.e.,

p^(ϑϑ1:N1)=1N1j=1N1𝑰{ϑj=ϑ}.\begin{split}\hat{p}({\vartheta}\mid{\bm{{\vartheta}}}_{1:N_{1}})=\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}{\boldsymbol{I}}\{{\vartheta}_{j}={\vartheta}\}.\end{split}

In the following, we verify that

𝔼|i=N1+1Np^(ϑiϑ1:N1)p(ϑi)1|=o(1)\begin{split}{\mathbb{E}}\left|\prod_{i=N_{1}+1}^{N}\frac{\hat{p}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})}{p({\vartheta}_{i})}-1\right|=o(1)\end{split}

holds for rNmis=O(N1/2)r_{N}^{\rm{mis}}=O(N^{-1/2}) through simulations. For NN changing from 100 to 100000, let rNmisNrater_{N}^{\rm{mis}}\propto N^{-\text{rate}} for rate{0.5,0.6,0.7}\text{rate}\in\{0.5,0.6,0.7\}. We generate ϑ1,,ϑN{\vartheta}_{1},\ldots,{\vartheta}_{N} i.i.d. from the true distribution Binomial(100,0.1)(100,0.1), set α=β=1\alpha=\beta=1, and compute

|i=N1+1Np^(ϑiϑ1:N1)p(ϑi)1|.\begin{split}\left|\prod_{i=N_{1}+1}^{N}\frac{\hat{p}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})}{p({\vartheta}_{i})}-1\right|.\end{split}

Repeat the above procedure 1000 times and the average of the obtained 1000 values is an approximation of

𝔼|i=N1+1Np^(ϑiϑ1:N1)p(ϑi)1|.\begin{split}{\mathbb{E}}\left|\prod_{i=N_{1}+1}^{N}\frac{\hat{p}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})}{p({\vartheta}_{i})}-1\right|.\end{split}

Values of the approximation are shown in Fig. A4. The results imply that when rNmisr_{N}^{\rm{mis}} converges to 0 faster, 𝔼|i=N1+1Ng^(ϑiϑ1:N1)g(ϑi)1|{\mathbb{E}}\left|\prod_{i=N_{1}+1}^{N}\frac{\hat{g}({\vartheta}_{i}\mid{\bm{{\vartheta}}}_{1:N_{1}})}{g({\vartheta}_{i})}-1\right| will converge to 0 more quickly.

Refer to caption
Figure A4: Simulation results for IRTe in the case of binomial distribution.

Appendix B Additional simulation results

B.1 Clustered Interference

For simulations in Section 4.1, the type I error across 10 datasets and average power over these datasets with Yi(0)χ2(4)Y_{i}(0)\sim\chi^{2}(4) and t(4)t(4) are shown in Fig. B1 and Fig. B2. In these scenarios, the prior for IRTp is still set to be a normal distribution. Similar to the observations in Section 4.1, all methods control the type I error well and our IRT methods outperform CRTs and PNRTs with respect to the power when Yi(0)χ2(4)Y_{i}(0)\sim\chi^{2}(4) or t(4)t(4). These results show that although the models used in IRTp (i.e., normal) and IRTe are different from the true distribution, their performances are robust. Moreover, all IRTs under the estimated imputation distribution are close to IRTo. Additionally, when gg corresponds to a continuous distribution, IRTe and IRTs are particularly close to each other in each case, which means that IRTe can also work well when the real distribution is continuous.

Refer to caption
Figure B1: Simulation results under clustered interference for Yi(0)χ2(4)Y_{i}(0)\sim\chi^{2}(4). The type I error rate across 10 datasets and the average power over these datasets are visualized for 10 competing methods under different specifications of causal effect τ\tau and number of clusters KK. The horizontal dashed line indicates the significance level of α=0.05\alpha=0.05.
Refer to caption
Figure B2: Simulation results under clustered interference for Yi(0)t(4)Y_{i}(0)\sim t(4). The type I error rate across 10 datasets and the average power over these datasets are visualized for 10 competing methods under different specifications of causal effect τ\tau and number of clusters KK. The horizontal dashed line indicates the significance level of α=0.05\alpha=0.05.

B.2 Spatial Interference

For simulations in Section 4.2, the type I error across 10 datasets and the average power when Yi(0)χ2(4)Y_{i}(0)\sim\chi^{2}(4) and t(4)t(4) are present in Fig. B3 and Fig. B4. The observations are similar to those in Section 4.2, and this implies that our IRT methods outperform CRTs and PNRTs, especially in the case when p=0.8p=0.8, where the missing rate of imputable outcomes is large. Moreover, the results show that IRT methods are robust when model is mis-specified.

Refer to caption
Figure B3: Simulation results under spatial interference for Yi(0)χ2(4)Y_{i}(0)\sim\chi^{2}(4): A, B, and C show the type I error rate and average power of 10 competing methods for p=0.2,0.5p=0.2,0.5, and 0.80.8 respectively, under different specifications of casual effect τ\tau and distance of interference rr. The horizontal dashed line indicates the significance level of α=0.05\alpha=0.05.
Refer to caption
Figure B4: Simulation results under spatial interference for Yi(0)t(4)Y_{i}(0)\sim t(4): A, B, and C show the type I error rate and average power of 10 competing methods for p=0.2,0.5p=0.2,0.5, and 0.80.8 respectively, under different specifications of casual effect τ\tau and distance of interference rr. The horizontal dashed line indicates the significance level of α=0.05\alpha=0.05.