This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\externaldocument

semi-modular-supp-18-11-24

Posterior risk of modular and semi-modular Bayesian inference

David T. Frazier Corresponding author: david.frazier@monash.edu Department of Econometrics and Business Statistics, Monash University, Clayton VIC 3800, Australia David J. Nott Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546 Operations Research and Analytics Cluster, National University of Singapore, Singapore 119077
Abstract

Modular Bayesian methods perform inference in models that are specified through a collection of coupled sub-models, known as modules. These modules often arise from modeling different data sources or from combining domain knowledge from different disciplines. “Cutting feedback” is a Bayesian inference method that ensures misspecification of one module does not affect inferences for parameters in other modules, and produces what is known as the cut posterior. However, choosing between the cut posterior and the standard Bayesian posterior is challenging. When misspecification is not severe, cutting feedback can greatly increase posterior uncertainty without a large reduction of estimation bias, leading to a bias-variance trade-off. This trade-off motivates semi-modular posteriors, which interpolate between standard and cut posteriors based on a tuning parameter. In this work, we provide the first precise formulation of the bias-variance trade-off that is present in cutting feedback, and we propose a new semi-modular posterior that takes advantage of it. Under general regularity conditions, we prove that this semi-modular posterior is more accurate than the cut posterior according to a notion of posterior risk. An important implication of this result is that point inferences made under the cut posterior are inadmissable. The new method is demonstrated in a number of examples.

Keywords. Bayesian modular inference, Cutting feedback, Model misspecification, posterior shrinkage

1 Introduction

In many applications, statistical models arise which can be viewed as a combination of coupled submodels (referred to as modules in the literature). Such models are often complex, frequently containing both shared and module-specific parameters, and module-specific data sources. Examples include pharmacokinetic/phamacodynamic (PK/PD) models (Bennett and Wakefield, 2001; Lunn et al., 2009) which couple a PK module describing movement of a drug through the body with a PD module describing its biological effect, or models for health effects of air pollution (Blangiardo et al., 2011) with separate modules for predicting pollutant concentrations and predicting health outcomes based on exposure. See Liu et al. (2009) and Jacob et al. (2017) for further examples.

In principle, Bayesian inference is attractive in modular settings due to its ability to combine the different sources of information and update uncertainties about unknowns coherently conditional on all the available data. However, it is well-known that Bayesian inference can be unreliable when the model is misspecified (Kleijn and van der Vaart, 2012). For conventional Bayesian inference in multi-modular models, misspecification in one module can adversely impact inferences about parameters in correctly specified modules. “Cutting feedback” approaches modify Bayesian inference to address this issue. They consider a sequential or conditional decomposition of the posterior distribution following the modular structure, and then modify certain terms so that unreliable information is isolated and cannot influence inferences of interest which may be sensitive to the misspecification.

Cutting feedback is only one technique belonging to a wider class of modular Bayesian inference methods (Liu et al., 2009). Good introductions to the basic idea and applications of cutting feedback are given by Lunn et al. (2009), Plummer (2015) and Jacob et al. (2017). Computational aspects of the approach are discussed in Plummer (2015), Jacob et al. (2020), Liu and Goudie (2022b), Yu et al. (2023) and Carmona and Nicholls (2022). Most of the above references deal only with cutting feedback in a certain “two module” system considered in Plummer (2015), which although simple is general enough to encompass many practical applications of cutting feedback methods. We also consider this two module system throughout the rest of the paper. Some recent progress in defining modules and cut posteriors in greater generality is reported in Liu and Goudie (2022a).

A useful extension of cutting feedback is the semi-modular posterior (SMP) approach of Carmona and Nicholls (2020), which avoids the binary decision of using either the cut or full posterior distribution, and can be viewed as a continuous interpolation between two distributions. Further developments and applications are discussed in Liu and Goudie (2023), Carmona and Nicholls (2022), Nicholls et al. (2022) and Frazier and Nott (2024). The motivation for semi-modular inference is explained clearly in Carmona and Nicholls (2022): “In Cut-model inference, feedback from the suspect module is completely cut. However, […] if the parameters of a well-specified module are poorly informed by “local” information then limited information from misspecified modules may allow us to bring the uncertainty down without introducing significant bias”. The above quote nicely describes the intuition behind SMI. However, there is no formal treatment of the bias-variance trade-off that exists in SMI, nor is there any rigorous discussion as to how SMI could “leverage” such a trade-off in practice.

Herein, we make three fundamental contributions to the literature on cutting feedback and misspecified Bayesian models. First, we formally demonstrate that when model misspecification is not too severe, cut posteriors can deliver inferences with smaller bias but more variability than standard Bayesian posteriors, which provides formal evidence for the bias-variance trade-off that motivates SMI and SMPs. Second, we use this result to develop a novel SMP that leverages this trade-off. Lastly, using a notion that captures the risk of a posterior, we demonstrate that the proposed SMP is preferable to the cut, as well as the full posterior under certain conditions. More specifically, under this notion of risk we show that the cut posteriors produce point estimators that are inadmissible, while, under additional conditions, the standard posterior is also shown to deliver inadmissible point estimators.

The remainder of the paper is organized as follows. In Section 2 we give the general framework and make rigorous conditions that are necessary for the existence of a bias-variance trade-off in modular inference problems. In Section 3 we discuss semi-modular inference, and describe our semi-modular posterior approach. In this section, a simple example is presented in which the new method produces superior results to those based on the cut and full posterior. In Section 4 we prove, under ‘classical’ regularity conditions, that our semi-modular posterior outperforms the standard and cut posteriors according to a notion of asymptotic risk for a posterior. Section 5 gives two empirical examples, and Section 6 concludes. Proofs of all stated results are given in the supplementary material.

2 Setup and Discussion of Cut posterior

Our first contribution is to formalize the potential benefits of using semi-modular inference methods (Carmona and Nicholls, 2020) in misspecified models. The semi-modular inference approach was originally introduced by Carmona and Nicholls (2020) using a two-module system discussed by Plummer (2015). In our current work, we focus on the same two-module system. It is important to describe our motivation for this choice, in the context of previous work on Bayesian modular inference.

One method for defining cutting feedback methods in misspecified models with more than two modules uses an implicit approach by modifying sampling steps in Markov chain Monte Carlo (MCMC) algorithms. The cut function in the WinBUGS and OpenBUGS software packages implements this in a modified Gibbs sampling approach. See Lunn et al. (2009) for a detailed description. However, defining a cut posterior in terms of an algorithm, while quite general, does not allow us to easily understand the implications of cutting feedback, or the general structure of the posterior, due to the implicit nature of such a definition. To give a better understanding, Plummer (2015) considered a two-module system where the cut posterior can be defined explicitly. In many models where cut methods are used, there might be one model component of particular concern, and a definition of the modules in a two-module system can often be made based on this. Many applications of cutting feedback use such a two-module system. Two module systems also play an important role in the recent attempt by Liu and Goudie (2022a) to explicitly define multi-modular systems and cutting feedback generally, where existing modules can be split into two recursively based on partitioning of the data and using the graphical structure of the model. We now describe the two-module system precisely and describe cutting feedback methods for it, before giving a motivating example.

2.1 Setup

We observe a sequence of data z1:n=(z1,,zn)z_{1:n}=(z_{1},\ldots,z_{n}), zi𝒵z_{i}\in\mathcal{Z} for each ii. The observations z1:nz_{1:n} are considered an observation of a random vector ZZ, and wish to conduct Bayesian inference on the unknown parameters θ\theta in the assumed joint likelihood f(z1:nθ)f(z_{1:n}\mid\theta), where θ=(θ1,θ2)\theta=(\theta_{1}^{\top},\theta_{2}^{\top})^{\top}, θΘ1×Θ2d\theta\in\Theta_{1}\times\Theta_{2}\subseteq\mathbb{R}^{d}, and Θ1d1\Theta_{1}\subseteq\mathbb{R}^{d_{1}}, Θ2d2\Theta_{2}\subseteq\mathbb{R}^{d_{2}}. Our prior beliefs over Θ\Theta are expressed via a prior density π(θ)=π(θ1)π(θ2θ1)\pi(\theta)=\pi(\theta_{1})\pi(\theta_{2}\mid\theta_{1}).

In modular Bayesian inference, the joint likelihood f(z1:nθ)f(z_{1:n}\mid\theta) can often be expressed as a product, with terms for data sources from different modules. The simplest case is the two module system described in Plummer (2015), where the random variables are Z=(X,Y)Z=(X,Y). The first module consists of a likelihood term depending on XX and θ1\theta_{1}, given by f1(θ1)f_{1}(\cdot\mid\theta_{1}), and the prior π(θ1)\pi(\theta_{1}). The second module consists of a likelihood term depending on YY and θ1,θ2\theta_{1},\theta_{2}, given by f2(θ1,θ2)f_{2}(\cdot\mid\theta_{1},\theta_{2}), and the conditional prior π(θ2θ1)\pi(\theta_{2}\mid\theta_{1}). An example of such a model is given below. A model with this structure leads to the following full posterior

πfull(θ1,θ2z1:n)\displaystyle\pi_{\mathrm{full}}(\theta_{1},\theta_{2}\mid z_{1:n}) =π(θ1x1:n,y1:n)π(θ2y1:n,θ1),\displaystyle=\pi(\theta_{1}\mid x_{1:n},y_{1:n})\pi(\theta_{2}\mid y_{1:n},\theta_{1}), (1)

in which the conditional posterior for θ2\theta_{2} given θ1\theta_{1} does not depend on the observations for the random variable XX, denoted as x1:n=(x1,,xn)x_{1:n}=(x_{1},\dots,x_{n}). The parameter θ1\theta_{1} is shared between the two modules, while θ2\theta_{2} is specific to the second module.

While the full posterior in (1) delivers reliable inference when the model is correctly specified, in the presence of model misspecification Bayesian inference can sometimes be unreliable and not “fit for purpose”; see, e.g., Grünwald and Van Ommen (2017) for examples in the case of linear models, and Kleijn and van der Vaart (2012) for a general discussion. Following the literature on cutting feedback methods, we restrict our attention to settings where misspecification is confined to the second module, while specification of the first module is not impacted. Because the parameter θ1\theta_{1} is shared between modules, inference about this parameter can be corrupted by misspecification of the second module. This can also impact the interpretation of inference about θ2\theta_{2}, which can be of interest even if the second module is misspecified provided this is done conditionally on values of θ1\theta_{1} consistent with the interpretation of this parameter in the first module.

Rather than use the posterior (1), in possibly misspecified models it has been argued that we can cut the link between the modules to produce more reliable inferences for θ1\theta_{1}. This is the idea of cutting feedback; see Figure 1 for a graphical depiction of this “cutting” mechanism where, for simplicity, we assume π(θ2θ1)\pi(\theta_{2}\mid\theta_{1}) does not depend on θ1\theta_{1}. In the case of the two module system, the cut (indicated by the vertical dotted line in Figure 1) severs the feedback between the modules and allows us to carry out inferences for θ1\theta_{1} based on module one, using the likelihood f1(θ1)f_{1}(\cdot\mid\theta_{1}), and then inference for θ2\theta_{2} can be carried out conditional on θ1\theta_{1}. In the case of Bayesian inference, this philosophy has led researchers to conduct inference using the cut posterior distribution (see, Plummer, 2015, Jacob et al., 2017).

Refer to caption
Figure 1: Graphical structure of the two-module system. The dashed line indicates the cut. Shaded nodes are observed quantities.

As shown in Carmona and Nicholls (2020) and Nicholls et al. (2022), the cut posterior is a “generalized” posterior distribution (see, e.g., Bissiri et al., 2016) that restricts the information flow to guard against model misspecification (Frazier and Nott, 2024). In the canonical two module system, the cut posterior takes the form

πcut(θz1:n):=πcut(θ1x1:n)π(θ2y1:n,θ1),πcut(θ1x1:n)f1(x1:nθ1)π(θ1).\pi_{\mathrm{cut}}(\theta\mid z_{1:n}):=\pi_{\mathrm{cut}}(\theta_{1}\mid x_{1:n})\pi(\theta_{2}\mid y_{1:n},\theta_{1}),\quad\pi_{\mathrm{cut}}(\theta_{1}\mid x_{1:n})\propto f_{1}(x_{1:n}\mid\theta_{1})\pi(\theta_{1}).

The common argument given for the use of πcut(θz1:n)\pi_{\mathrm{cut}}(\theta\mid z_{1:n}) instead of π(θz1:n)\pi(\theta\mid z_{1:n}) is the assumption that misspecification adversely impacts inferences for θ1\theta_{1}; e.g., Liu et al. (2009), and Jacob et al. (2017). The cut posterior uses only information from the data x1:nx_{1:n} in making inference about θ1\theta_{1}, ensuring inference is insensitive to misspecification of the model for y1:ny_{1:n}. However, uncertainty about θ1\theta_{1} can still be propagated through for inference on θ2\theta_{2} via the conditional posterior π(θ2θ1,y1:n)\pi(\theta_{2}\mid\theta_{1},y_{1:n}).

Motivating example: HPV prevalence and cervical cancer incidence

We now discuss a simple example described in Plummer (2015) that illustrates some of the benefits of cut model inference. The example, which is discussed further in the supplementary material, is based on data from a real epidemiological study (Maucort-Boulch et al., 2008). Of interest is the international correlation between high-risk human papillomavirus (HPV) prevalence and cervical cancer incidence, for women in a certain age group. There are two data sources. The first is survey data on HPV prevalence for 13 countries. There are XiX_{i} women with high-risk HPV in a sample of size nin_{i} for country ii, i=1,,13i=1,\dots,13. There is also data on cervical cancer incidence, with YiY_{i} cases in country ii in TiT_{i} woman years of follow-up. The data are modelled as

XiBinomial(ni,θ1,i),i=1,,13,X_{i}\sim\text{Binomial}(n_{i},\theta_{1,i}),\;\;\;i=1,\dots,13,
YiPoisson(λi),logλi=logTi+θ2,1+θ2,2θ1,i.Y_{i}\sim\text{Poisson}(\lambda_{i}),\;\;\;\log\lambda_{i}=\log T_{i}+\theta_{2,1}+\theta_{2,2}\theta_{1,i}.

The prior for θ1=(θ1,1,,θ1,13)\theta_{1}=(\theta_{1,1},\dots,\theta_{1,13})^{\top} assumes independent components with uniform marginals on [0,1][0,1]. The prior for θ2=(θ2,1,θ2,2)\theta_{2}=(\theta_{2,1},\theta_{2,2})^{\top}, assumes independent normal components, N(0,1000)N(0,1000).

Module 1 consists of π(θ1)\pi(\theta_{1}) and f(Xθ1)f(X\mid\theta_{1}) (survey data module) and module 2 consists of π(θ2)\pi(\theta_{2}) and f(Yθ2,θ1)f(Y\mid\theta_{2},\theta_{1}) (cancer incidence module). The Poisson regression model in the second module is misspecified. Because of the coupling of the survey and cancer incidence modules, with the HPV prevalence values θ1,i\theta_{1,i} appearing as covariates in the Poisson regression for cancer incidence, the cancer incidence module contributes misleading information about the HPV prevalence parameters. The cut posterior estimates these parameters based on the survey data only, preventing contamination of the estimates by the misspecified module. This in turn results in more interpretable estimates of the parameter θ2\theta_{2} in the misspecified module, since θ2\theta_{2} summarizes the relationship between HPV prevalence and cancer incidence, but the summary produced can only be useful when the inputs to the regression (i.e. the HPV prevalence covariate values) are properly estimated.

Figure 2 shows the marginal posterior distribution for θ2\theta_{2} for both the full and cut posterior distribution, which are very different, illustrating how the misspecification of the cancer incidence module distorts inference about HPV prevalence in the full posterior, resulting in uninterpretable estimation for θ2\theta_{2}. Further discussion of this example in the supplement shows the benefits of a semi-modular inference approach in which the tuning parameter ω\omega interpolating between the cut and full posterior can be chosen based on a user-defined loss function reflecting the purpose of the analysis. A more complex real example is considered in Section 5.

Refer to caption

Figure 2: Marginal full and cut posterior distributions for θ2\theta_{2} in the HPV example.

2.2 The Impact of Misspecification in Modular Inference

In the remainder we consider a generalization of the canonical two-module system by assuming that the joint likelihood takes the form

f(z1:nθ)=f1(z1:nθ1)f2(z1:nθ1,θ2),f(z_{1:n}\mid\theta)=f_{1}(z_{1:n}\mid\theta_{1})f_{2}(z_{1:n}\mid\theta_{1},\theta_{2}),

where the individual models may or may not depend on the entire dataset z1:nz_{1:n}. In the expressions above, f1(z1:nθ1)f_{1}(z_{1:n}\mid\theta_{1}) and f2(z1:nθ1,θ2)f_{2}(z_{1:n}\mid\theta_{1},\theta_{2}) need not be densities as functions in z1:nz_{1:n} so long as f(z1:nθ)f(z_{1:n}\mid\theta) itself is a density; the terms f1f_{1} and f2f_{2} describe a decomposition of the likelihood having the dependence indicated by the arguments. The goal of modular Bayesian inference is to conduct inference on θ=(θ1,θ2)\theta=(\theta_{1}^{\top},\theta_{2}^{\top})^{\top} in the specific setting where inference for θ1\theta_{1} based only on f1(z1:nθ1)f_{1}(z_{1:n}\mid\theta_{1}) is reasonable, but where the module f2(z1:nθ1,θ2)f_{2}(z_{1:n}\mid\theta_{1},\theta_{2}) contains additional useful information on θ1\theta_{1}; see the regularity conditions in Appendix C.1 for specific details. Modular inference on θ\theta can be carried out using the cut posterior

πcut(θz1:n):=πcut(θ1z1:n)π(θ2z1:n,θ1),πcut(θ1z1:n)f1(z1:nθ1)π(θ1).\pi_{\mathrm{cut}}(\theta\mid z_{1:n}):=\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\pi(\theta_{2}\mid z_{1:n},\theta_{1}),\;\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\propto{f_{1}(z_{1:n}\mid\theta_{1})\pi(\theta_{1})}. (2)

The cut posterior is beneficial when misspecification of f2(θ1,θ2)f_{2}(\cdot\mid\theta_{1},\theta_{2}) makes full Bayesian inference on θ1\theta_{1} less accurate compared to Bayesian inference using only f1(θ1)f_{1}(\cdot\mid\theta_{1}). The benefits of cutting feedback can be formalized by assuming that misspecification is limited to f2(θ1,θ2)f_{2}(\cdot\mid\theta_{1},\theta_{2}) and that the true data generating process (DGP) has a density of the form111Strictly speaking, this analysis extends beyond density functions but we maintain the use of densities to simplify our discussion.

h(zθ1,0,δ0)=f1(zθ1,0)δ0(z),h(z\mid\theta_{1,0},\delta_{0})=f_{1}(z\mid\theta_{1,0})\delta_{0}(z),

for some unknown component δ0(z)\delta_{0}(z) that controls the amount of model misspecification. The form of h(zθ1,0,δ0)h(z\mid\theta_{1,0},\delta_{0}) allows us to capture both gross model misspecification, and situations where misspecification is ambiguous. We first restrict our attention to gross model misspecification by imposing the following assumption.

Assumption 1 (Gross Misspecification).

The observed data {zi:i1,n1}\{z_{i}:i\geq 1,n\geq 1\} is independent and identically distributed according to h(zθ1,0,δ0)=f1(zθ1,0)δ0(z)h(z\mid\theta_{1,0},\delta_{0})=f_{1}(z\mid\theta_{1,0})\delta_{0}(z). For some 𝒵~𝒵\widetilde{\mathcal{Z}}\subseteq\mathcal{Z}, with 𝒵~h(zθ1,0,δ0)dz>0\int_{\widetilde{\mathcal{Z}}}h(z\mid\theta_{1,0},\delta_{0})\mathrm{d}z>0, and all z𝒵z\in\mathcal{Z}^{\prime}, δ0(z)f2(zθ1,0,θ2)\delta_{0}(z)\neq f_{2}(z\mid\theta_{1,0},\theta_{2}) for any θ2Θ2\theta_{2}\in\Theta_{2}. There exist θInt(Θ)\theta_{\star}\in\mathrm{Int}(\Theta) that minimizes θKL{h(zθ1,0,δ0)f(zθ)}\theta\mapsto\mathrm{KL}\{h(z\mid\theta_{1,0},\delta_{0})\|f(z\mid\theta)\}, the Kullback–Liebler divergence of f(zθ)f(z\mid\theta) over Θ\Theta with respect to h(zθ1,0,δ0)h(z\mid\theta_{1,0},\delta_{0}).

Under Assumption 1 and regularity conditions, the cut posterior πcut(θ1z1:n)\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n}) can be shown to deliver accurate inferences for θ1,0\theta_{1,0}, while the full posterior πfull(θ1z1:n)\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n}) does not. To state this result, denote the cut posterior distribution for θ1\theta_{1} as Πcut(θ1z1:n)\Pi_{\mathrm{cut}}(\theta_{1}\in\cdot\mid z_{1:n}), and let Πfull(θ1z1:n)\Pi_{\mathrm{full}}(\theta_{1}\in\cdot\mid z_{1:n}) denote the full posterior distribution for θ1\theta_{1}. For any ε>0\varepsilon>0, define Θ1(ε):={θ1Θ1:θ1θ1,0ε}\Theta_{1}(\varepsilon):=\{\theta_{1}\in\Theta_{1}:\|\theta_{1}-\theta_{1,0}\|\leq\varepsilon\}. For P0(n)P_{0}^{(n)} the true data generating process, we say that Xn𝑃aX_{n}\xrightarrow{P}a (or Xn=a+op(1)X_{n}=a+o_{p}(1)) if for all ε>0\varepsilon>0, P0(n)(Xnaε)=o(1)P_{0}^{(n)}(\|X_{n}-a\|\geq\varepsilon)=o(1) as nn\rightarrow\infty.

Lemma 1.

If Assumption 1 and the regularity conditions in Appendix B are satisfied, then for any ε>0\varepsilon>0, Πcut{θ1Θ1(ε)z1:n}𝑃1\Pi_{\mathrm{cut}}\left\{\theta_{1}\in\Theta_{1}(\varepsilon)\mid z_{1:n}\right\}\xrightarrow{P}1. For ε>0\varepsilon>0 such that θ1,Θ1(ε)\theta_{1,\star}\notin\Theta_{1}(\varepsilon), Πfull{θ1Θ1(ε)z1:n}𝑃0\Pi_{\mathrm{full}}\left\{\theta_{1}\in\Theta_{1}(\varepsilon)\mid z_{1:n}\right\}\xrightarrow{P}0.

Assumption 1 embodies cases where a practitioner can confidently determine that the second module is misspecified. Lemma 1 then shows that, in these cases, the cut posterior concentrates on the true value θ1,0\theta_{1,0}, while the full posterior does not. By restricting the flow of information across modules, cutting feedback produces inferences for θ1,0\theta_{1,0} that are more accurate than full Bayesian inference. Although Lemma 1 implies that cut posterior inference is more accurate than full posterior inference when the second module is grossly misspecified, it is of limited use when πcut(θ1z1:n)\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n}) and πfull(θ1z1:n)\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n}) are similarly located, which can occur when misspecification of the second module is not severe. In such cases, the full posterior will have less uncertainty than the cut posterior, while also being similarly located, making it unclear whether to prefer the cut or full posterior.

To capture the empirically relevant setting where, for any fixed nn, it is unclear which posterior to use, we investigate the behavior of these posteriors under a certain class of locally misspecified DGPs. Following the literature of robust asymptotic statistical analysis (see, e.g., Ch 3. Rieder, 2012), we approximate the empirical situation where neither method is clearly preferable using a local perturbation about the assumed model:

h(zθ0,δn):=f1(zθ1,0){1+ψζ(z)/n}f2(zθ0),δn=ψζ/n,\displaystyle h(z\mid\theta_{0},\delta_{n}):=f_{1}(z\mid\theta_{1,0})\{1+\psi^{\top}\zeta(z)/\sqrt{n}\}f_{2}(z\mid\theta_{0}),\quad\delta_{n}=\psi^{\top}\zeta/\sqrt{n}, (3)

which depends on a (random) direction of misspecification ζd\zeta\in\mathbb{R}^{d}, and magnitude ψ\psi that takes values in Δd\Delta\subset\mathbb{R}^{d}. Under the local perturbation framework in (3) the cut and full posteriors have similar locations in Θ\Theta for small-to-moderate sample sizes, and conventional specification testing methods cannot consistently detect which method delivers more accurate inferences for θ0\theta_{0}. As such, this class of perturbations allows us to compare cut and full posterior inference when correct specification of the second module is ambiguous.

Assumption 2 (Local Misspecification).

The triangular array {zi,n:1in,n1}\{z_{i,n}:1\leq i\leq n,n\geq 1\} is independent and identically distributed according to h(zθ0,δn)h(z\mid\theta_{0},\delta_{n}) in (3) for fixed nn. For ψΔd\psi\in\Delta\subset\mathbb{R}^{d}, Δ\Delta compact, ζ(z)\zeta(z) satisfies: (i) 𝒵logf1(zθ1,0)θ1ζ(z)ψf1(zθ1,0)f2(zθ0)dμ(z)=0\int_{\mathcal{Z}}\frac{\partial\log f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}}\zeta(z)^{\top}\psi f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)=0; (ii) For η=(η1,η2)\eta=(\eta_{1}^{\top},\eta_{2}^{\top})^{\top} partitioned conformably to θ=(θ1,θ2)\theta=(\theta_{1}^{\top},\theta_{2}^{\top})^{\top}, η=𝒵logf(zθ0)θζ(z)ψf1(zθ1,0)f2(zθ0)dμ(z)\eta=\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}\zeta(z)^{\top}\psi f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z), and 0η<0\leq\|\eta\|<\infty.

Remark 1.

Assumption 2 is a device that will allow us to rigorously compare the cut and full posteriors when neither is clearly preferable. The local misspecification framework ensures that, for any nn, there remains some ambiguity about model misspecification—and thus which posterior to prefer. It provides an appropriate theoretical framework for analyzing cut and full posteriors when the choice between them is unclear. Assumption 2 further maintains that the direction of misspecification, ζ(z)\zeta(z), does not adversely affect the location of the cut posterior for θ1\theta_{1} but will impact the full posterior.

Remark 2.

Assumption 2 resembles, but is distinct from, the misspecification device employed in Hjort and Claeskens (2003) and Claeskens and Hjort (2003) to construct methods that combine and choose between different frequentist point estimators. The approach outlined in Section 8 of Claeskens and Hjort (2003) is not appropriate here since their framework can produce cut posterior inferences for θ1,0\theta_{1,0} that are less accurate than the full posterior, which contradicts the underling reasoning for using cut posteriors. The misspecification framework in Assumption 2 also differs from the designs in Pompe and Jacob (2021), and Frazier and Nott (2024), which are similar to Assumption 1, and ensure that - with probability converging to one - the researcher knows the model is misspecified.

Under Assumption 2, cut posterior inference for θ1\theta_{1} is not impacted by misspecification of the second module. To show this, we require further notation. First, note that

(θ)=logf(z1:nθ)=logf1(z1:nθ1)+logf2(z1:nθ)p(θ1)+c(θ1,θ2),\ell(\theta)=\log f(z_{1:n}\mid\theta)=\log f_{1}(z_{1:n}\mid\theta_{1})+\log f_{2}(z_{1:n}\mid\theta)\equiv\ell_{p}(\theta_{1})+\ell_{c}(\theta_{1},\theta_{2}),

where p(θ1)\ell_{p}(\theta_{1}) denotes the partial log-likelihood for θ1\theta_{1} used in the cut posterior, and c(θ)\ell_{c}(\theta) denotes the log-likelihood used in the conditional posterior. Denote the derivative of the full log-likelihood as ˙(θ):=(θ)/θ\dot{\ell}(\theta):=\partial\ell(\theta)/\partial\theta, and the second derivative by ¨(θ):=2(θ)/θθ\ddot{\ell}(\theta):=\partial^{2}\ell(\theta)/\partial\theta\partial\theta^{\top}. For j,k{1,2}j,k\in\{1,2\}, define the first and second partial derivatives as ˙(j)(θ):=(θ)/θj\dot{\ell}_{(j)}(\theta):=\partial\ell(\theta)/\partial\theta_{j}, and ¨(ij)(θ):=2(θ)/θjθi\ddot{\ell}_{(ij)}(\theta):=\partial^{2}\ell(\theta)/\partial\theta_{j}\partial\theta_{i}^{\top}. Similar notations will be use to denote derivatives of u(θ)\ell_{u}(\theta), for u{c,p}u\in\{c,p\}. Let 𝔼n[g(z)]\mathbb{E}_{n}[g(z)] denote the expectation of g:𝒵dg:\mathcal{Z}\rightarrow\mathbb{R}^{d}, under h(zθ0,δn)h(z\mid\theta_{0},\delta_{n}) in Assumption 2, and define the following information matrices: :=limnn1𝔼n[¨(θ0)]\mathcal{I}:=-\lim_{n}n^{-1}\mathbb{E}_{n}[\ddot{\ell}(\theta_{0})], jk:=limnn1𝔼n[¨(jk)(θ0)]\mathcal{I}_{jk}:=-\lim_{n}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{(jk)}(\theta_{0})], and p(11):=limnn1𝔼n[¨p(θ1,0)]\mathcal{I}_{p(11)}:=-\lim_{n}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{p}(\theta_{1,0})]. Let θ¯cut:=Θθπcut(θz1:n)dθ\overline{\theta}_{\mathrm{cut}}:=\int_{\Theta}\theta\pi_{\mathrm{cut}}(\theta\mid z_{1:n})\mathrm{d}\theta, θ¯full=Θθπfull(θz1:n)dθ\overline{\theta}_{\mathrm{full}}=\int_{\Theta}\theta\pi_{\mathrm{full}}(\theta\mid z_{1:n})\mathrm{d}\theta, and let \Rightarrow denote weak convergence under P0(n)P^{(n)}_{0}.

Lemma 2.

Under Assumption 2, and the regularity conditions in Appendix C.1:

(i) n(θ¯fullθ0)N(1η,1)\sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{0})\Rightarrow N(\mathcal{I}^{-1}\eta,\mathcal{I}^{-1}).

(ii) n(θ¯1,cutθ1,0)N(0,p(11)1)\sqrt{n}(\overline{\theta}_{1,\mathrm{cut}}-\theta_{1,0})\Rightarrow N(0,\mathcal{I}^{-1}_{p(11)}).

(iii) n(θ¯2,cutθ1,0)N(221η2,221+22121p(11)112221)\sqrt{n}(\overline{\theta}_{2,\mathrm{cut}}-\theta_{1,0})\Rightarrow N(\mathcal{I}_{22}^{-1}\eta_{2},\mathcal{I}^{-1}_{22}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}).

(iv) Only credible sets calculated under πcut(θ1z1:n)\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n}) have valid frequentist coverage.

(v) The squared bias for θ1,0\theta_{1,0} and θ2,0\theta_{2,0} under πcut(θz1:n)\pi_{\mathrm{cut}}(\theta\mid z_{1:n}) is smaller than πfull(θz1:n)\pi_{\mathrm{full}}(\theta\mid z_{1:n}).

Remark 3.

Lemma 2 shows that inferences for θ1\theta_{1} using the cut posterior have no asymptotic bias, whereas the full posterior for θ1\theta_{1} has asymptotic bias and both posteriors have different biases for θ2\theta_{2}. The presence of asymptotic bias implies that, if η0\eta\neq 0, only credible sets for πcut(θ1z1:n)\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n}) correctly quantify uncertainty, i.e., only πcut(θ1z1:n)\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n}) has calibrated credible sets. Since the bias due to misspecification, η\eta, is unknown, it does seem feasible to determine whether πcut(θz1:n)\pi_{\mathrm{cut}}(\theta\mid z_{1:n}) or πfull(θz1:n)\pi_{\mathrm{full}}(\theta\mid z_{1:n}) more reliably quantifies uncertainty in general, except when η=0\eta=0 where both methods accurately quantify uncertainty.

Lemma 2 implies that the user is faced with a trade-off between conducting inference using the cut or full posterior: inferences under πfull(θz1:n)\pi_{\mathrm{full}}(\theta\mid z_{1:n}) have the smallest variability possible (via Cramer-Rao) but exhibit a bias of unknown magnitude, whereas inferences under πcut(θz1:n)\pi_{\mathrm{cut}}(\theta\mid z_{1:n}) are guaranteed to have smaller bias than those under πfull(θz1:n)\pi_{\mathrm{full}}(\theta\mid z_{1:n}) but have (weakly) larger variability. Lemma 2 is the first result to formally show that a bias-variance trade-off exists between the cut and full posteriors. Since the bias due to misspecification (η\eta) is unobservable, Lemma 2 exemplifies the situation where it is ambiguous as to whether we should base inferences on the posterior that exhibits low bias - the cut posterior - or the posterior that exhibits larger bias but which has much smaller variability - the full posterior.

3 Semi-Modular Inference

Lemma 2 suggest that is we consider the accuracy of posteriors using a criteria that measures both the bias and variance of posterior inference, it should be possible to combine the cut and full posteriors to deliver inferences that are more accurate than using either by themselves. The goal of semi-modular inference (SMI) is to interpolate between the cut and full posteriors to reduce the uncertainty about θ\theta while maintaining a tolerable level of bias. However, there are many ways to interpolate between two probability distributions, and there is no a priori reason to suspect one method of interpolation will deliver better results than others; see Nicholls et al. (2022) for a discussion on some of the possibilities.

Following Chakraborty et al. (2022) and Yu et al. (2023), we focus on conducting SMI using linear opinion pools (Stone, 1961), which produces a semi-modular posterior (SMP) that is a convex combination between the cut and full posteriors: for ω[0,1]\omega\in[0,1],

πω(θz1:n)\displaystyle\pi_{\omega}(\theta\mid z_{1:n}) :=πω(θ1z1:n)π(θ2z1:n,θ1),\displaystyle:=\pi_{\omega}(\theta_{1}\mid z_{1:n})\pi(\theta_{2}\mid z_{1:n},\theta_{1}), (4)
πω(θ1z1:n)\displaystyle\pi_{\omega}(\theta_{1}\mid z_{1:n}) :=(1ω)πcut(θ1z1:n)+ωπ(θ1z1:n).\displaystyle:=(1-\omega)\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})+\omega\pi(\theta_{1}\mid z_{1:n}).

The pooling weight ω[0,1]\omega\in[0,1] in the SMP determines the level of interpolation between the posteriors, and Chakraborty et al. (2022) suggest choosing ω\omega using prior-predictive conflict checks, while Carmona and Nicholls (2022) propose to use out-of-sample predictive methods to select ω\omega. In contrast, we propose a novel choice of pooling weight that leverages the bias-variance-trade-off between cut and full posterior inferences.

3.1 Shrinkage-based semi-modular posteriors

To build intuition as to how πω(θz1:n)\pi_{\omega}(\theta\mid z_{1:n}) can utilize the bias-variance trade-off between πcut\pi_{\mathrm{cut}} and πfull\pi_{\mathrm{full}}, we first focus on the behavior of the SMP for θ1\theta_{1}, i.e., πω(θ1z1:n)\pi_{\omega}(\theta_{1}\mid z_{1:n}) in (4), and analyze the behavior of πω(θz1:n)\pi_{\omega}(\theta\mid z_{1:n}) in subsequent sections.222Differences between the cut, full, and SMP posteriors for θ2\theta_{2} are attributable to differences in the posterior for θ1z1:n\theta_{1}\mid z_{1:n} since each posterior shares the same conditional posterior for θ2\theta_{2} given θ1\theta_{1}. Focusing on πω(θ1z1:n)\pi_{\omega}(\theta_{1}\mid z_{1:n}) in (4), note that the SMP point estimator for θ1\theta_{1} is

θ¯1(ω)\displaystyle\overline{\theta}_{1}(\omega) :=(1ω)θ¯1,cut+ωθ¯1,fullθ¯1,cutω(θ¯1,cutθ¯1,full),\displaystyle:=(1-\omega)\overline{\theta}_{1,\mathrm{cut}}+\omega\overline{\theta}_{1,\mathrm{full}}\equiv\overline{\theta}_{1,\mathrm{cut}}-\omega(\overline{\theta}_{1,\mathrm{cut}}-\overline{\theta}_{1,\mathrm{full}}), (5)

where θ¯1,cut:=Θ1θ1πcut(θ1z1:n)dθ1\overline{\theta}_{1,\mathrm{cut}}:=\int_{\Theta_{1}}\theta_{1}\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\mathrm{d}\theta_{1}, and θ¯1,full:=Θθ1πfull(θ1,θ2z1:n)dθ\overline{\theta}_{1,\mathrm{full}}:=\int_{\Theta}\theta_{1}\pi_{\mathrm{full}}(\theta_{1},\theta_{2}\mid z_{1:n})\mathrm{d}\theta. From Lemma 2, the asymptotic mean of n(θ¯1,cutθ1,0)\sqrt{n}(\overline{\theta}_{1,\mathrm{cut}}-\theta_{1,0}) is zero, while that of n(θ¯1,fullθ1,0)\sqrt{n}(\overline{\theta}_{1,\mathrm{full}}-\theta_{1,0}) depends on the misspecification bias η\eta. Hence, the SMP posterior mean θ¯1(ω)\overline{\theta}_{1}(\omega) combines an asymptotically unbiased estimator with high variance, θ¯1,cut\overline{\theta}_{1,\mathrm{cut}}, and an asymptotically biased estimator with small variance, θ¯1,full\overline{\theta}_{1,\mathrm{full}}. Therefore, if we are willing to tolerate some bias it will be possible to choose ω\omega to deliver inferences on θ1,0\theta_{1,0} that are more accurate than either the cut or full posterior alone, at least so long as our measure of “accuracy” accounts for both bias and variance.

The idea of combining biased and unbiased estimators has a long history in statistics, with the most commonly encountered estimators of this kind being shrinkage and James-Stein estimators, see Chapter 5 of Lehmann and Casella (2006) for a textbook discussion. Indeed, the form of θ¯1(ω)\overline{\theta}_{1}(\omega) in (5) is reminiscent of those encountered in the shrinkage literature. For two normally distributed estimators, one biased and the other unbiased, Green and Strawderman (1991) demonstrated how to optimally combine such estimators to deliver a shrinkage estimator that is optimal in terms of expected squared error loss. The approach of Green and Strawderman (1991) was extended to more general settings and loss functions by Kim and White (2001), and Judge and Mittelhammer (2004).

While θ¯1,cut\overline{\theta}_{1,\mathrm{cut}} and θ¯1,full\overline{\theta}_{1,\mathrm{full}} are not normally distributed, they are asymptotically normal, and so one could choose ω\omega in the SMP using existing shrinkage estimation proposals. Following ideas similar to Green and Strawderman (1991), and Kim and White (2001), we could choose ω\omega in (4) as

ω~+=min{1,ω~},γn(θ¯1,cutθ¯1,full)Υ(θ¯1,cutθ¯1,full),\widetilde{\omega}_{+}=\min\{1,\widetilde{\omega}\},\quad\frac{\gamma}{n(\overline{\theta}_{1,\mathrm{cut}}-\overline{\theta}_{1,\mathrm{full}})^{\top}\Upsilon(\overline{\theta}_{1,\mathrm{cut}}-\overline{\theta}_{1,\mathrm{full}})}, (6)

for some γ>0\gamma>0, and Υ\Upsilon a positive definite (d1×d1)(d_{1}\times d_{1})-matrix. Since θ¯1,full\overline{\theta}_{1,\mathrm{full}} is asymptotically biased, while θ¯1,cut\overline{\theta}_{1,\mathrm{cut}} is not, using ω~+\widetilde{\omega}_{+} within the SMP would allow us to interpret the SMP as shrinking cut posterior inferences towards those of the biased full posterior, and so using ω~+\widetilde{\omega}_{+} would deliver a type of “shrinkage” SMP (S-SMP).

Semi-modular Bayesian inference is clearly much more general than the linear Gaussian models analyzed in Green and Strawderman (1991), Kim and White (2001), and Judge and Mittelhammer (2004). Nevertheless, applying shrinkage estimation ideas within semi-modular Bayesian inference should allow us to effectively combine the cut and full posteriors. In the following sections, we show that this is indeed the case: empirically and theoretically, SMPs based on weights similar to (6) deliver inferences that can be shown to be optimal according to a general notion of asymptotic risk. Before presenting any formal analysis, we first illustrate the behavior of the S-SMP based on ω~+\widetilde{\omega}_{+} empirically in a simple example from the cutting feedback literature.

3.2 Example: Biased Mean

To demonstrate the benefits of the SMP, we consider a minor modification of the biased mean example in Section 2.1 of Liu et al. (2009). We observe two datasets generated from independent random variables with the same unknown mean θ1\theta_{1}, but where the assumed model for the second dataset is incorrect resulting in biased estimation of the parameter of interest. The first dataset corresponds to n1n_{1} observations on a (d1×1)(d_{1}\times 1)-dimensional, d11d_{1}\geq 1, random vector that we assume is generated from the model:

y1,i=θ1+ϵ1,i,y_{1,i}=\theta_{1}+\epsilon_{1,i},

where θ1=(θ1,1,,θ1,d1)\theta_{1}=(\theta_{1,1},\dots,\theta_{1,d_{1}})^{\top} and ϵ1,i\epsilon_{1,i}, i=1,,n1i=1,\dots,n_{1}, are iid N(0,Id1)N(0,I_{d_{1}}). However, we also observe an additional dataset, comprised of n2>n1n_{2}>n_{1} observations which is assumed to be from the model:

y2,i=θ1+σϵ2,i,y_{2,i}=\theta_{1}+\sigma\epsilon_{2,i},

where ϵ2i\epsilon_{2i}, i=1,,n2i=1,\dots,n_{2}, are iid N(0,Id1)N(0,I_{d_{1}}) for unknown σ>0\sigma>0. The prior density for θ1\theta_{1} is N(0,Id1)N(0,I_{d_{1}}), and for θ2:=σ2\theta_{2}:=\sigma^{2} is π(σ2)1/σ2\pi(\sigma^{2})\propto 1/\sigma^{2}. The parameter of interest is θ1\theta_{1}, and we wish to determine how much the second dataset should influence inference about θ1\theta_{1}, when its assumed model is incorrect, leading to biased estimation of θ1\theta_{1}.333The original example of Liu et al. (2009) is such that for even moderate values of the bias we always prefer the cut posterior. Given this, we have slightly modified the original example to ensure a meaningful trade-off exists between the cut and full posteriors. Without this modification, the S-SMP simply returns the cut posterior in the vast majority of cases.

Suppose that in the actual data-generating process ϵ2,i\epsilon_{2,i} is not N(0,Id1)N(0,I_{d_{1}}), but

ϵ2,iiid{N(0.25ιd1,0.10Id1) with prob. δN(0,0.5Id1) with prob. 1δ\epsilon_{2,i}\stackrel{{\scriptstyle iid}}{{\sim}}\begin{cases}N(-0.25\cdot\iota_{d_{1}},0.10\cdot I_{d_{1}})&\text{ with prob. }\delta\\ N(0,0.5I_{d_{1}})&\text{ with prob. }1-\delta\end{cases}

for i=1,,n2i=1,\dots,n_{2}, where ιd1\iota_{d_{1}} denotes a d1×1d_{1}\times 1 dimensional vector of ones. When δ=0\delta=0, this reduces to the assumed model, but when δ>0\delta>0 we will obtain biased estimation of θ1\theta_{1} under the assumed model.

For our experiments, we assume σ2=1/2\sigma^{2}=1/2 and use an equally spaced grid of values for the contamination δ{0:0.05:0.9}\delta\in\{0:0.05:0.9\}. For each value in this grid, we generate 1000 replications from the above process in the case where n1=100n_{1}=100 and n2=1000n_{2}=1000, and consider two different values of d1{1,5}d_{1}\in\{1,5\}. For each dataset, the S-SMP is based on the following version of the weights in (6): for ^p(11)1=Covπcut(θ1z1:n)(θ1)\widehat{\mathcal{I}}_{p(11)}^{-1}=\text{Cov}_{\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})}(\theta_{1}) and ^11.21=Covπ(θ1z1:n)(θ1)\widehat{\mathcal{I}}_{11.2}^{-1}=\text{Cov}_{\pi(\theta_{1}\mid z_{1:n})}(\theta_{1})

ω~+=min{1,ω~}ω~=tr[^p(11)1^11.21]θ¯1,cutθ¯1,full2𝕀{tr[^p(11)1^11.21]>0}.\widetilde{\omega}_{+}=\min\left\{1,\widetilde{\omega}\right\}\,\;\widetilde{\omega}=\frac{\text{\rm tr}[\widehat{\mathcal{I}}_{p(11)}^{-1}-\widehat{\mathcal{I}}_{11.2}^{-1}]}{\|\overline{\theta}_{1,\mathrm{cut}}-\overline{\theta}_{1,\mathrm{full}}\|^{2}}\mathbb{I}\left\{\text{\rm tr}[\widehat{\mathcal{I}}_{p(11)}^{-1}-\widehat{\mathcal{I}}_{11.2}^{-1}]>0\right\}.

To compare the impact of misspecification on the different modular posteriors we plot a Monte Carlo estimate of the expected squared error for the point estimators, based on 1000 replicated samples, across the grid of values for δ\delta. The results are presented in Figure 3 and show that for relatively small levels of contamination, the full posterior has lower expected squared error than the cut posterior, due to having a much smaller variance, while at higher levels of contamination the cut posterior has lower expected squared error. However, the expected squared error for the S-SMP is always lower than the cut posterior, which demonstrates that the SMP is able to ‘trade-off’ between the two posteriors to minimize squared error risk across all levels of contamination. However, we note that when d1=1d_{1}=1 and δ=0.90\delta=0.90, the S-SMP and cut posterior give very similar results.444Appendix D in the supplementary material contains additional experiments for this example. These results show that the S-SMP delivers reasonable results for all choices of d1d_{1} and becomes more accurate as the dimension d1d_{1} increases.

Refer to caption


Figure 3: Monte Carlo estimate of expected risk under squared error loss across different levels of contamination (δ\delta). Full refers to the expected risk for θ1,0\theta_{1,0} associated with the full posterior based on both sets of data; Cut is the P-risk associated to the cut posterior; SMP is the P-risk for the proposed semi-modular posterior.

4 Measuring posterior accuracy through risk

The example in Section 3.2 suggests that the S-SMP can deliver more accurate inferences, in terms of expected squared error, than the cut or full posterior. This generally suggests that the S-SMP can deliver inferences that are accurate according to a criteria that measures both the bias and variance of posterior inferences. However, we remark that such a notion of accuracy is only one way to measure the accuracy of posterior inferences. In modular Bayesian inference, Jacob et al. (2017) and Pompe and Jacob (2021) have suggested choosing between full and cut posteriors using out-of-sample predictive accuracy, while Carmona and Nicholls (2020) have suggested a similar approach for calibrating an SMI tuning parameter. While such an approach is undoubtedly useful, the empirical analysis in Jacob et al. (2017), as well as the empirical and theoretical analysis in Pompe and Jacob (2021), suggest that the preferred method according to such criteria is example specific, with neither method likely to be generally preferable. Further, Lemma 2 demonstrates that cut and full posterior inferences for θ0\theta_{0} exhibit asymptotic bias, and it is not clear how notions of predictive accuracy account for the impact of this bias on the resulting inferences.

In contrast to predictive approaches, we propose to measure the accuracy of modular and semi-modular posteriors through an inferential criterion that can accounts for both the bias and variance in the resulting inferences. Given that the SMP includes the cut (ω=0\omega=0) and full posterior (ω=1\omega=1) as special cases, we evaluate the accuracy of different posteriors using the “posterior risk” associated to different ω\omega values. This notion of risk has previously been used by Castillo (2014) and Lee and Lee (2018) to choose between different prior classes in Bayesian inference, and is capable of capturing both the bias and variance associated with posterior inferences.

Given a user-chosen loss function q:Θ×Θ+q:\Theta\times\Theta\rightarrow\mathbb{R}_{+}, at the point θ0Θ\theta_{0}\in\Theta we can measure the loss associated with ω[0,1]\omega\in[0,1] via the posterior risk 𝔼πω{q(θ,θ0)z1:n}:=Θq(θ,θ0)πω(θz1:n)dθ\mathbb{E}_{\pi_{\omega}}\{q(\theta,\theta_{0})\mid z_{1:n}\}:=\int_{\Theta}q(\theta,\theta_{0})\pi_{\omega}(\theta\mid z_{1:n})\mathrm{d}\theta. For g:𝒵dg:\mathcal{Z}\rightarrow\mathbb{R}^{d} recall that 𝔼n[g(z)]=𝒵g(z)h(zθ0,δn)dμ(z)\mathbb{E}_{n}[g(z)]=\int_{\mathcal{Z}}g(z)h(z\mid\theta_{0},\delta_{n})\mathrm{d}\mu(z), with h(zθ0,δn)h(z\mid\theta_{0},\delta_{n}) as in Assumption 2. The trimmed Posterior risk of πω\pi_{\omega} at θ0\theta_{0} (hereafter, referred to simply as P-risk) is defined as:555Trimming is necessary to ensure that the expectation exists, and can be disregarded in practical terms.

Rq(πω,θ0):=limνlim infn𝔼nmin{𝔼πω{nq(θ,θ0)z1:n},ν}.\mathrm{R}_{q}(\pi_{\omega},\theta_{0}):=\lim_{\nu\rightarrow\infty}\liminf_{n\rightarrow\infty}\mathbb{E}_{n}\min\{\mathbb{E}_{\pi_{\omega}}\{n\cdot q(\theta,\theta_{0})\mid z_{1:n}\},\nu\}.

The loss function q:Θ×Θ+q:\Theta\times\Theta\rightarrow\mathbb{R}_{+} satisfies the following assumptions.

Assumption 3.

For all θ,δΘ\theta,\delta\in\Theta, the loss function q(δ,θ)q(\delta,\theta) satisfies: i) q(δ,θ)0q(\delta,\theta)\geq 0 and q(δ,θ)=0q(\delta,\theta)=0 if and only if θ=δ\theta=\delta; ii) The loss is absolutely continuous in δ\delta, and three times continuously differentiable in δ\delta; iii) For δ\delta in a neighbourhood of θ0\theta_{0}, the matrix Υ(δ)=2q(δ,θ0)/δδ\Upsilon(\delta)=\partial^{2}q(\delta,\theta_{0})/\partial\delta\partial\delta^{\top} is continuous, and positive semi-definite; iv) For each i=1,,di=1,\dots,d, and for all xdx\in\mathbb{R}^{d}, δΘ\delta\in\Theta, |x{Υ(δ)/δi}x|Mx2|x^{\top}\{\partial\Upsilon(\delta)/\partial\delta_{i}\}x|\leq M\|x\|^{2}, where δi\delta_{i} denotes the ii-th direction of δ\delta, and M>0M>0.

Remark 4.

Assumption 3 includes losses such as squared error loss q(δ,θ)=δθ2q(\delta,\theta)=\|\delta-\theta\|^{2}, but also permits intrinsic measures of accuracy for densities. The Kullback-Liebler divergence,

q(θ,θ0)=𝒵f(zθ0)logf(zθ0)f1(zθ1)f2(zθ2,θ1)dz,q(\theta,\theta_{0})=\int_{\mathcal{Z}}f(z\mid\theta_{0})\log\frac{f(z\mid\theta_{0})}{f_{1}(z\mid\theta_{1})f_{2}(z\mid\theta_{2},\theta_{1})}\mathrm{d}z,

and various scoring rules, such as kernel scores or mean-variance scores (see Gneiting and Raftery, 2007, Sections 4 and 5), will satisfy Assumption 3. Assumption 3 does exclude discontinuous functions, such as those needed to measure the accuracy of quantiles. Extending our results to the case of discontinuous losses is the focus of subsequent work by the authors.

P-risk is related to expected asymptotic risk, which is often used to gauge the accuracy of frequentist point estimators. Since Rq(πω,θ0)\mathrm{R}_{q}(\pi_{\omega},\theta_{0}) is calculated from a chosen posterior based on a decision made for ω\omega, we refer to this notion as P-risk to distinguish it from asymptotic risk. Risk has a long history in statistical analysis and we refer to Chapter 6 of Lehmann and Casella (2006), and Chapter 8 of Van der Vaart (2000) for textbook treatments. The key benefit of using P-risk to measure different choices for ω\omega is that, for the chosen loss q(,)q(\cdot,\cdot), P-risk can deliver a concrete ranking across inference procedures relative to this choice. Furthermore, our use of P-risk is not at odds with Bayesian inference, and has already been used by others, albeit in slightly different contexts. More generally, as argued by Lehmann and Casella (2006) (pg 310): “The Bayesian paradigm is well suited for the construction of possible estimators, but is less well suited for their evaluation.” Consequently, we follow this suggestion of Lehmann and Casella (2006) and carry out inference via Bayesian methods but evaluate the accuracy of these methods using our notion of asymptotic risk (i.e., P-risk).

4.1 The P-risk of SMPs

As we have already seen in Section 3.1, weights of the form presented in (6) deliver a SMP whose expected squared error was smaller than the cut posterior. Thus, in the remainder we focus our theoretical analysis on SMPs with pooling weights that are general versions of those in (6): for γn\gamma_{n} a user-chosen sequence such that γn𝑃γ\gamma_{n}\xrightarrow{P}\gamma, γ>0\gamma>0, define the pooling weight

ω^+:=min{1,ω^},ω^=γnn(θ¯cutθ¯full)Υ(θ¯cut)(θ¯cutθ¯full).\displaystyle\widehat{\omega}_{+}:=\min\{1,\widehat{\omega}\},\quad\widehat{\omega}=\frac{\gamma_{n}}{n(\overline{\theta}_{\mathrm{cut}}-\overline{\theta}_{\mathrm{full}})^{\top}\Upsilon(\overline{\theta}_{\mathrm{cut}})(\overline{\theta}_{\mathrm{cut}}-\overline{\theta}_{\mathrm{full}})}. (7)

In contrast to the weight ω~+\widetilde{\omega}_{+} suggested in (6), the weight ω^+\widehat{\omega}_{+} in (7) depends on the entire vector θ\theta, and the curvature of the loss function, as captured by the matrix Υ(θ)=2g(θ,δ)/δδ|δ=θ\Upsilon(\theta)=\partial^{2}g(\theta,\delta)/\partial\delta\partial\delta^{\top}|_{\delta=\theta}. Incorporating the curvature of the loss within the pooling weight is necessary for the SMPs to deliver inferences that are accurate according to the chosen loss; see Theorem C.3 of Appendix C.3 for further discussion.

The pooling weight in (7) yields the shrinkage SMP (S-SMP):

πω^+(θz1:n):={(1ω^+)πcut(θ1z1:n)+ω^+π(θ1z1:n)}π(θ2θ1,z1:n).\displaystyle\pi_{\widehat{\omega}_{+}}(\theta\mid z_{1:n}):=\{(1-\widehat{\omega}_{+})\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})+\widehat{\omega}_{+}\pi(\theta_{1}\mid z_{1:n})\}\pi(\theta_{2}\mid\theta_{1},z_{1:n}). (8)

Different choices of γn\gamma_{n} in (7) deliver different weights and ultimately different posteriors. However, the following result shows that across a range of choices for γn\gamma_{n}, the P-risk of πω^+\pi_{\widehat{\omega}_{+}} in (8) is dominated by that of the cut posterior, and, under certain conditions, that of the full posterior. To state this result simply, define the d×dd\times d-dimensional matrix Υ:=Υ(θ0)\Upsilon:=\Upsilon(\theta_{0}), and denote the (d2×d2)(d_{2}\times d_{2})-block of Υ\Upsilon by Υ22\Upsilon_{22} (see Assumption 3); let \mathcal{M} be a d×dd\times d matrix with (d1×d1)(d_{1}\times d_{1})-block W:=(p(11)111.21)W:=(\mathcal{I}_{p(11)}^{-1}-\mathcal{I}_{11.2}^{-1}), (d2×d2)(d_{2}\times d_{2})-block V:=22121W12I221V:=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}I_{22}^{-1}, and (d1×d2)(d_{1}\times d_{2})-block W12221-W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}, where 11.2:=111222121\mathcal{I}_{11.2}:=\mathcal{I}_{11}-\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}. We remind the reader that \mathcal{I}, jk\mathcal{I}_{jk} (j,k{1,2}j,k\in\{1,2\}) and p(11)\mathcal{I}_{p(11)} are defined in Section 2.2.

Theorem 1.

Assumptions 2-3, and the regularity conditions in Appendix C.1 are satisfied. If trΥ2Υ\text{\rm tr}\Upsilon\mathcal{M}\geq 2\|\Upsilon\mathcal{M}\|, and 0<γ2(trΥ2Υ)0<\gamma\leq 2(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|), then:

(i) supη:η<{Rq(πω^+,θ0)Rq(πcut,θ0)}0;\sup_{\eta:\|\eta\|<\infty}\{\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})-\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})\}\leq 0;

(ii) supη{Rq(πω^+,θ0)Rq(πfull,θ0)}0\sup_{\eta\in\mathcal{E}}\{\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})-\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0})\}\leq 0 for :={η:1ηΥ2221η2Υ222trΥ}\mathcal{E}:=\{\eta:\|\mathcal{I}^{-1}\eta\|_{\Upsilon}^{2}-\|\mathcal{I}_{22}^{-1}\eta_{2}\|^{2}_{\Upsilon_{22}}\geq\text{\rm tr}\Upsilon\mathcal{M}\}.

Remark 5.

Under the class of misspecified DGPs in Assumption 2 and in terms of P-risk, Theorem 1 demonstrates that the cut posterior delivers a point estimator that is inadmissable, with a similar results also being true for the full posterior under additional conditions. Hence, from the standpoint of P-risk, the S-SMP is preferable to the cut posterior and under certain conditions the full posterior as well. However, it is unclear if the S-SMP has the lowest possible P-risk, i.e., if it is minimax. Answering this question requires deriving a local minimax efficiency bound for the class of DGPs in Assumption 2 (see, e.g., Chapter 8 in Van der Vaart, 2000 for a discussion on asymptotic minimax estimators), which is outside the scope of the current paper, and is left for future research.

Remark 6.

As suggested by an anonymous referee, a Bayesian may also be interested in the uncertainty quantification of the S-SMP. As discussed in Remark 3 after Lemma 2, only credible sets based on πcut(θ1z1:n)\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n}) deliver calibrated uncertainty quantification. Therefore, if ω^+>0\widehat{\omega}_{+}>0 the credible sets of the S-SMP will not be calibrated. More generally, the credible sets calculated from the cut posterior, full posterior and S-SMP all depend, in different ways, on the magnitude of the bias induced via misspecification. Since η\eta is unknown in practice it is not obvious how one should theoretically compare the behavior of such credible sets. In light of this issue, we believe that measuring the accuracy of the posteriors using P-risk is the most direct approach.

Remark 7.

The condition trΥ2Υ\text{\rm tr}\Upsilon\mathcal{M}\geq 2\|\Upsilon\mathcal{M}\| in Theorem 1 is related to the inefficiency that results from using the cut posterior relative to the full posterior. This condition is more likely to be satisfied when the efficiency gap between the cut and full posterior is large, or when dd - the dimension of θ\theta - is large. That being said, the condition trΥ2Υ\text{\rm tr}\Upsilon\mathcal{M}\geq 2\|\Upsilon\mathcal{M}\| is a sufficient condition and it is possible that the S-SMP may deliver smaller P-risk even when this condition is not satisfied.

Remark 8.

The example in Section 3.2 demonstrated that a S-SMP focused on inference for θ1\theta_{1} delivered smaller expected P-risk for θ1,0\theta_{1,0} than the cut posterior. However, Theorem 1 makes clear that the S-SMP can also deliver smaller P-risk for θ0\theta_{0}. Returning to the example in Section 3.2, we now analyze the P-risk at θ0\theta_{0} for the S-SMP under the shrinkage weight

ω^+=min{1,ω^},ω^=tr[Covπcut(θ)Covπfull(θ)]θ¯cutθ¯full2𝕀{tr[Covπcut(θ)Covπfull(θ)]>0}.\widehat{\omega}_{+}=\min\{1,\widehat{\omega}\},\quad\widehat{\omega}=\frac{\text{\rm tr}[\text{Cov}_{\pi_{\mathrm{cut}}}(\theta)-\text{Cov}_{\pi_{\mathrm{full}}}(\theta)]}{\|\overline{\theta}_{\mathrm{cut}}-\overline{\theta}_{\mathrm{full}}\|^{2}}\mathbb{I}\left\{\text{\rm tr}[\text{Cov}_{\pi_{\mathrm{cut}}}(\theta)-\text{Cov}_{\pi_{\mathrm{full}}}(\theta)]>0\right\}.

To this end, we repeat the Monte Carlo experiment in Section 3.2 under two different dimensions for d1{1,5}d_{1}\in\{1,5\}, so that d{2,6}d\in\{2,6\}, and present the results in Figure 9. These results show that the P-risk of the S-SMP is dominated by that of the cut posterior, and in certain cases that of the full posterior; as with the example in Section 3.2, for δ=0.90\delta=0.90, and d=2d=2, the S-SMP and cut posterior give very similar results. Theorem 1 is asymptotic and given the sample sizes considered in this experiment it is not surprising that at large levels of contamination the S-SMP can perform slightly worse than the cut posterior (when d=2d=2), since non-negligible weight is placed on the full posterior. However, at δ=0.90\delta=0.90 the median pooling weight across the replications is about 0.300.30, so that most of the pooling weight corresponds to the cut posterior. As we shall see shortly, under higher levels of misspecification and as nn increases, the S-SMP resembles the cut posterior.

Refer to caption

Figure 4: Monte Carlo estimate of expected risk for θ0\theta_{0} under squared error loss across different levels of contamination (δ\delta). Please see Section 3.2 and Figure 3 for further details.

Part one of Theorem 1 implies that if the cut posterior is inefficient relative to the full posterior, as measured by trΥ2Υ\text{\rm tr}\Upsilon\mathcal{M}\geq 2\|\Upsilon\mathcal{M}\|, then the S-SMP will be at least as accurate (in P-risk) as the cut posterior, and potentially more accurate than the full posterior.666Theorem 1 applies even if η=0\eta=0: when there is no misspecification bias the cut and full posterior means will be similar and the weight ω^+\widehat{\omega}_{+} will be close to unity so that the S-SMP will resemble the full posterior. The second part of Theorem 1 gives a sufficient, but not necessary, condition which guarantees that the P-risk of the full posterior dominates that of the S-SMP. This condition is likely to be satisfied when the difference in posterior locations is larger than the difference in posterior variances.

When d>2d>2 it is also possible to obtain an analytic expression for the P-risk of the S-SMP if we set q(θ,θ0)=12(θθ0)1(θθ0)q(\theta,\theta_{0})=\frac{1}{2}(\theta-\theta_{0})^{\top}\mathcal{M}^{-1}(\theta-\theta_{0}), so that Υ=1\Upsilon=\mathcal{M}^{-1}. The requirement that d>2d>2 is commonly encountered in the risk analysis of James-Stein estimators and is a consequence of the fact that πω^+(θz1:n)\pi_{\widehat{\omega}_{+}}(\theta\mid z_{1:n}) can be viewed as performing a type of posterior shrinkage.

Theorem 2.

Assumptions 2-3, and the regularity conditions in Appendix C.1 are satisfied. If Υ=1\Upsilon=\mathcal{M}^{-1}, d>2d>2, and 0<γ2(d2)0<\gamma\leq 2(d-2), then for any finite η\eta,

Rq(πω^+,θ0)\displaystyle\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0}) =Rq(πcut,θ0)γ{2(d2)γ}(d3)!2(d)!F11(d1;d1;λ)Rq(πcut,θ0),\displaystyle=\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\frac{\gamma\{2(d-2)-\gamma\}(d-3)!}{2(d)!}{{}_{1}\mathrm{F}_{1}(d-1;d_{1};\lambda)}\leq\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0}),

where F11(k1;k;λ){{}_{1}\mathrm{F}_{1}(k-1;k;\lambda)} is the confluent hyper-geometric function.

Theorem 2 is useful as it gives an exact bound on the P-risk, and an easily interpretable set of conditions for the value of γ\gamma in the weight ω^+\widehat{\omega}_{+}. Under Υ=1\Upsilon=\mathcal{M}^{-1}, the condition d>2d>2 is necessary to guarantee that πω^+(θz1:n)\pi_{\widehat{\omega}_{+}}(\theta\mid z_{1:n}) has smaller P-risk than πcut(θz1:n)\pi_{\mathrm{cut}}(\theta\mid z_{1:n}). This condition is related to Stein’s phenomenon (see Ch. 6 of Lehmann and Casella, 2006 for a discussion) and implies that using the cut posterior by itself is sub-optimal (in terms of P-risk) when d>2d>2. We stress that this interpretation is only valid when Υ=1\Upsilon=\mathcal{M}^{-1} and that a similar phenomena does not necessarily extend to other loss functions.

Theorems 1-2 demonstrate that, under certain conditions and in terms of P-risk, the S-SMP is more accurate than the cut posterior and possibly the full posterior. However, Theorems 1-2 implicitly require that the difference between the cut and full posterior locations do not diverge, which is a consequence of the asymptotic regime in Assumption 2. This begs the question of what happens to the S-SMP when we move from the case of local model misspecification (Assumption 2) to gross model misspecification (Assumption 1). The following result demonstrates that under gross model misspecification the S-SMP converges to the cut posterior, and so is robust to either form of misspecification.

Corollary 1.

Assumption 1, Assumption 3, and the regularity conditions in Appendix B.1 are satisfied. If θ¯cutθ0=Op(n1/2)\|\overline{\theta}_{\mathrm{cut}}-\theta_{0}\|=O_{p}(n^{-1/2}), and θ¯fullθ=Op(n1/2)\|\overline{\theta}_{\mathrm{full}}-\theta_{\star}\|=O_{p}(n^{-1/2}), then Θ|πω^+(θz1:n)πcut(θz1:n)|dθ=op(1).\int_{\Theta}|\pi_{\widehat{\omega}_{+}}(\theta\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta\mid z_{1:n})|\mathrm{d}\theta=o_{p}(1).

5 Additional examples

5.1 Normal-normal random effects model

We first apply the S-SMP to the misspecified normal-normal random effects model presented in Liu et al. (2009). The observed data is zijz_{ij} comprising observations on i=1,,Ni=1,\dots,N groups, with j=1,,Jj=1,\dots,J observations in each group, which we assume are generated from the model zijβi,φi2iidN(βi,φi2)z_{ij}\mid\beta_{i},\varphi_{i}^{2}\stackrel{{\scriptstyle iid}}{{\sim}}N(\beta_{i},\varphi_{i}^{2}), with random effects βiνiidN(0,ν2)\beta_{i}\mid\nu\stackrel{{\scriptstyle iid}}{{\sim}}N(0,\nu^{2}). The goal of the analysis is to conduct inference on the standard deviation of the random effects, ν\nu, and the residual standard deviation parameters φ=(φ1,,φN)\varphi=(\varphi_{1},\dots,\varphi_{N})^{\top}. Below we write β=(β1,,βN)\beta=(\beta_{1},\dots,\beta_{N})^{\top}, and ζ=(ν,β)\zeta=(\nu,\beta^{\top})^{\top}.

For z¯i=J1j=1Jzij\bar{z}_{i}=J^{-1}\sum_{j=1}^{J}z_{ij} and si2=j=1J(zijz¯i)2s_{i}^{2}=\sum_{j=1}^{J}(z_{ij}-\bar{z}_{i})^{2}, i=1,,Ni=1,\dots,N, the likelihood for ζ,φ\zeta,\varphi can be written to depend only on the sufficient statistics z¯=(z¯1,,z¯N)\bar{z}=(\bar{z}_{1},\dots,\bar{z}_{N})^{\top} and s2=(s12,,sN2)s^{2}=(s_{1}^{2},\dots,s_{N}^{2})^{\top}, where independently for i=1,,Ni=1,\dots,N,

z¯iζ,φN(βi,φi2/J),si2φGamma(J12,121φi2).\displaystyle\bar{z}_{i}\mid\zeta,\varphi\sim N(\beta_{i},\varphi_{i}^{2}/J),\quad s^{2}_{i}\mid\varphi\sim\text{Gamma}\left(\frac{J-1}{2},\frac{1}{2}\frac{1}{\varphi_{i}^{2}}\right).

Letting θ1=φ\theta_{1}=\varphi and θ2=ζ\theta_{2}=\zeta, the random effects model can then be written as a two-module system of the form shown in Figure 1: module one depends on (s2,θ1)(s^{2},\theta_{1}), X=s2X=s^{2}, and module two depends on (z¯,θ2,θ1)(\bar{z},\theta_{2},\theta_{1}), Y=z¯Y=\bar{z}.

Let Gamma(x;A,B)\text{Gamma}(x;A,B) denote the value of the Gamma(A,B)\text{Gamma}(A,B) density evaluated at xx, and N(x;μ,σ2)N(x;\mu,\sigma^{2}) denote the value of the N(μ,σ2)N(\mu,\sigma^{2}) density evaluated at xx. The first module has likelihood f1(Xθ1)=i=1NGamma(si2;J12,121θ1,i2)f_{1}(X\mid\theta_{1})=\prod_{i=1}^{N}\text{Gamma}\left(s_{i}^{2};\frac{J-1}{2},\frac{1}{2}\frac{1}{\theta_{1,i}^{2}}\right), while the second module has likelihood f2(Yθ)=i=1NN(Z¯i;βi,θ1,i2/J)f_{2}(Y\mid\theta)=\prod_{i=1}^{N}N(\bar{Z}_{i};\beta_{i},\theta_{1,i}^{2}/J).

When the Gaussian prior for the random effects term βi\beta_{i} conflicts with the likelihood information, inferences for θ1,i=φi\theta_{1,i}=\varphi_{i} can be adversely impacted. Such an outcome will happen when, for instance, a value of βi\beta_{i} differs markedly from its assumed model. Liu et al. (2009) argue that the thin-tailed Gaussian assumption for the random effects can produce poor inferences for θ1,i\theta_{1,i} due to the feedback induced by the likelihood term N(Z¯i;βi,φi2/J)N(\bar{Z}_{i};\beta_{i},\varphi_{i}^{2}/J) in the second module. To guard against this Liu et al. (2009) propose cut posterior inference for θ1s2\theta_{1}\mid s^{2}, which can be accommodated by simply updating the posterior for θ1\theta_{1} using only the information in the corresponding summary statistics s2s^{2}: given π(θ1,i2)(θ1,i2)1\pi(\theta_{1,i}^{2})\propto(\theta_{1,i}^{2})^{-1}, and independent across i=1,,Ni=1,\dots,N, the cut posterior for θ12\theta_{1}^{2} (where this denotes the elementwise square of θ1\theta_{1}) is

πcut(θ12X)i=1N(θ1,i2)J+12exp{Jsi22θ1,i2}.\pi_{\mathrm{cut}}(\theta_{1}^{2}\mid X)\propto\prod_{i=1}^{N}(\theta_{1,i}^{2})^{-\frac{J+1}{2}}\exp\left\{-\frac{J\cdot s_{i}^{2}}{2\theta_{1,i}^{2}}\right\}.

Summaries of the cut posterior for θ1\theta_{1} can be obtained by sampling from the cut posterior for θ12\theta_{1}^{2} and transforming the samples. Joint inferences for (ζ,φ)(\zeta,\varphi) can be carried out using the cut posterior distribution

πcut(ζ,φX,Y)\displaystyle\pi_{\mathrm{cut}}(\zeta,\varphi\mid X,Y) =πcut(φX)π(ζY,φ)=πcut(φs2)π(ζZ¯,φ),φ=θ1,\displaystyle=\pi_{\mathrm{cut}}(\varphi\mid X)\pi(\zeta\mid Y,\varphi)=\pi_{\mathrm{cut}}(\varphi\mid s^{2})\pi(\zeta\mid\bar{Z},\varphi),\quad\varphi=\theta_{1},

where the conditional posterior π(ζZ¯,φ)\pi(\zeta\mid\bar{Z},\varphi) is obtained from the joint posterior for θ1,θ2\theta_{1},\theta_{2}.

We now demonstrate that the S-SMP delivers inferences for θ1,i\theta_{1,i} that are more accurate than the cut posterior for different levels of misspecification. We generate 500 repeated samples from the normal-normal random effects model with N=100N=100 groups each with random effect component βi\beta_{i}, i=1,,Ni=1,\dots,N. For each group, we set θ1,i:=φi=0.50\theta_{1,i}:=\varphi_{i}=0.50 and ν=1\nu=1. We induce model misspecification through the random effect term β1\beta_{1}. Following the design of Liu and Goudie (2022b) we induce misspecification by forcing β1\beta_{1} to be an outlier, however, unlike Liu and Goudie (2022b) we consider that the magnitude of the outlier decreases as the number of individual observations in each group, JJ, increases. We set the random effect for the first group as β1=50/J\beta_{1}=50/J, and consider J{5,10,20,50}J\in\{5,10,20,50\}. When JJ is small the cut posterior delivers more accurate inferences for θ1,1\theta_{1,1} than the full posterior as the feedback between this outlier and θ1,1\theta_{1,1} has been removed. As the magnitude of the outlier shrinks, the full posterior for θ1,1\theta_{1,1} become more accurate and a meaningful trade-off between the cut and full posteriors exists.

Our goal is to measure the impact of misspecification on the inferences for θ1,1\theta_{1,1}, and so we choose the weight in the S-SMP using squared error loss for this component only. This produces a pooling weight that is similar to that discussed in Section 3.2 but based on θ1,1\theta_{1,1} and not the entire vector of θ1\theta_{1}.777Since only the first random effect component, β1\beta_{1}, is misspecified, inferences for θ1,2,,θ1,N\theta_{1,2},\cdots,\theta_{1,N} are not impacted by misspecification. In this way, even if the pooling weight was estimated based on the entire vector for θ1\theta_{1} it would be the θ1,1\theta_{1,1} component that would drive the pooling weight since the posterior means for the cut and full posterior are very similar for the remaining components. We present Monte Carlo estimates (calculated over the 500 replicated datasets) of the corresponding P-risk under squared error loss for the posteriors in Table 1. Across all values of JJ, the S-SMP has smaller P-risk than the cut posterior, and in many cases the full posterior as well. Further, the cut posterior is more accurate than the full posterior when JJ is small but the full posterior becomes more accurate as JJ increases. In this way, the weight in the S-SMP is close to unity with small JJ, but increases as JJ increases. However, for JJ large the cut and full posteriors behave similarly, and the S-SMP maintains most of the weight on the cut posterior.

JJ J=5J=5 J=10J=10 J=20J=20 J=50J=50 J=100J=100
SSMP\mathrm{S-SMP} 10.5365 2.9705 1.1753 0.3723 0.1897
Full\mathrm{Full} 8678.9763 4.6894 1.3124 0.5491 0.2634
Cut\mathrm{Cut} 10.5365 3.5806 1.4721 0.5102 0.2581
ω^+\widehat{\omega}_{+} 0.00 0.18 0.23 0.15 0.15
Table 1: P-risk values under squared error loss, multiplied by 100, for readability. Bold values indicate the lowest value of risk across the methods. ω^+\widehat{\omega}_{+} is the average posterior weight across the replications. Misspecification decreases as the number of individual observations per-group (J) increases.

5.2 Archaeological example

Our final example, discussed in Styring et al. (2017), Carmona and Nicholls (2020) and Yu et al. (2023), involves data collected to evaluate an “extensification hypothesis” for early Mesopotamian agricultural practices. The hypothesis states that as cities grew, agriculture extended over larger areas with less intensive cultivation, rather than more intensively farming existing areas to meet food demands.

The analysis uses two data sources: an archaeological dataset and a modern experimental dataset. Figure 5, which is similar to Figure 6 from Carmona and Nicholls (2020), shows a graphical representation of the model which comprises two modules. The first, the “HM module”, is a Gaussian linear regression model incorporating random effects. The second, the “PO module”, is a proportional odds model used to impute a missing categorical covariate for the HM module.

Refer to caption
Figure 5: Graphical representation of model for the agricultural extensification example.

In the HM module’s regression, the response is nitrogen level of cereal grains, denoted ZZ. We follow the notation of Yu et al. (2023) and use subscripts of AA and MM to denote archaeological and modern values of any variable. So, for example, ZAZ_{A} and ZMZ_{M} are nitrogen levels of cereal grains for archaeological and modern data respectively. For covariates in the HM module we have crop category C{Wheat,Barley}C\in\{\text{Wheat},\text{Barley}\}, site location PP (a categorical variable), site size SS, rainfall RR and manure level M{mlow,mmedium,mhigh}M\in\{m_{\text{low}},m_{\text{medium}},m_{\text{high}}\}. Archaeological data for rainfall and manure level, RAR_{A} and MAM_{A}, are missing. The HM module is a linear regression model with fixed effects for rainfall and manure level, a random effect for site location, and error variance based on crop category.

The imputation model in the PO module imputes the missing manuring level covariates for the archaeological data with parameters θ2=(γ,α,ξ,σξ)\theta_{2}=(\gamma,\alpha,\xi,\sigma_{\xi})^{\top}. The prior on MAM_{A} is a proportional odds model with covariates site size and site location. The parameter γ\gamma is the site size coefficient; a negative γ\gamma supports the extensification hypothesis. The parameter ξ\xi is a vector of random effects for five archaeological site locations in the proportional odds model, σξ\sigma_{\xi} is the standard deviation of random effects, and α\alpha is a vector of two threshold parameters. Further details on the model and priors are available in Appendix B.1 of Yu et al. (2023). Bayesian modular inference is relevant in this example because the PO module may be poorly specified. Therefore, we can cut feedback so that MAM_{A} is imputed based solely on the hierarchical model for cereal grain nitrogen levels (HM module in Figure 5), ensuring that imputation of MAM_{A} and the interpretation of γ\gamma are unaffected by any misspecification in the PO module.

In Figure 5, the red line indicates a “cut” between the modules. Figure 6 provides a simplified model structure.

Refer to caption
Figure 6: Simplified graphical representation of the model for the agricultural extensification example.

Although it looks different from the two-module system in Figure 1, the cut posterior has the same form, allowing cut and semi-modular inference to proceed similarly. Here we consider how semi-modular inference changes according to the choice of the loss function. For a given scalar parameter τ\tau, we will consider a S-SMP posterior using mixing weight

ω~+=min{1,ω~(τ)},ω~(τ)=στ,cut2στ,full2(τ¯cutτ¯full)2𝕀(στ,cut2στ,full2>0),\widetilde{\omega}_{+}=\min\{1,\widetilde{\omega}(\tau)\},\;\;\;\widetilde{\omega}(\tau)=\frac{\sigma_{\tau,\text{cut}}^{2}-\sigma_{\tau,\mathrm{full}}^{2}}{(\overline{\tau}_{\text{cut}}-\overline{\tau}_{\mathrm{full}})^{2}}\mathbb{I}(\sigma_{\tau,\text{cut}}^{2}-\sigma_{\tau,\mathrm{full}}^{2}>0),

where στ,cut2\sigma_{\tau,\text{cut}}^{2} and στ,full2\sigma_{\tau,\mathrm{full}}^{2} are the cut and full marginal posterior variances for τ\tau, and τ¯cut\overline{\tau}_{\text{cut}} and τ¯full\overline{\tau}_{\mathrm{full}} are the cut and full marginal posterior means for τ\tau. This is similar to the S-SMP in Section 3.2, but based on the marginal full and cut posteriors for τ\tau.

Figure 7 shows the cut, full and S-SMP posteriors for γ\gamma (the parameter of main interest) and the proportional odds regression random effects ξ1,,ξ5\xi_{1},\dots,\xi_{5}. When basing the mixing weight on γ\gamma (top left), we do a full cut, while basing the mixing weights on ξ1,,ξ5\xi_{1},\dots,\xi_{5} results in mixing weights varying between 0.20.2 and 11. In this example, the shrinkage weight can vary a great deal depending on what scalar parameter is being targeted in the loss function, and so the use of an appropriately defined loss function for the application is crucial. The S-SMP for γ\gamma shows weaker evidence for the extensification hypothesis than the standard posterior, in the sense that the posterior probabililty of γ<0\gamma<0 is smaller. Details of the MCMC approach for generating samples from the cut posterior, as well as an SMC method to generate samples from the full posterior, are given in Yu et al. (2023).

Refer to caption
Figure 7: Marginal cut, conventional and S-SMP posteriors for different parameters in the proportional odds regression model for the archaeological example. The title for each graph shows the parameter used, and the S-SMP mixing weight used for that parameter.

6 Discussion

Choosing between the cut and full posteriors is difficult when model misspecification is not severe and in such cases semi-modular posteriors (SMPs) proposed in Carmona and Nicholls (2020) are an attractive alternative. While SMPs are motivated by the presence of a bias-variance trade-off between cut and full posterior inferences, this paper is the first to formalize the existence of such a trade-off. Using SMPs based on linear opinion pooling, we devise a novel pooling weight that allows the SMP to leverage this bias-variance trade-off. Our proposed shrinkage SMP is simple to implement and possess useful theoretical guarantees that other SMP approaches do not posses: the posterior risk of our shrinkage SMP is dominated by that of the cut posterior, and, under certain conditions, that of the full posterior. An interesting future direction would be to determine if our theoretical results can be extended to other types of SMPs, such as those of Carmona and Nicholls (2020) and Nicholls et al. (2022).

As suggested by a referee, the notion of asymptotic risk we consider is only one criterion with which to judge the accuracy of posterior inferences, with posterior predictive accuracy and the validity of posterior credible sets being alternative measures. While assessing the accuracy of different methods based on posterior predictive accuracy is empirically feasible, it is not obvious that it is feasible to deduce a ranking across different modular Bayesian methods under our maintained assumptions. Further, the random weighting of the SMP ensures that determining the asymptotic shape of this posterior, and thus the behavior of its credible sets, is not straightforward. We leave these interesting topics for future study.

Acknowledgments

David T. Frazier gratefully acknowledges support by the Australian Research Council through grant DE200101070. David Nott’s research was supported by the Ministry of Education, Singapore, under the Academic Research Fund Tier 2 (MOE-T2EP20123-0009). We thank seminary participants at the Weierstrass Institute for Applied Analysis and Stochastics, and the Computational methods for unifying multiple statistical analyses (Fusion) workshop for helpful comments. In addition, we thank Pierre Jacob for helpful comments on some of the stated results. The authors also thank the associate editor and referees for very helpful comments that significantly improved the paper.

References

  • Bennett and Wakefield (2001) Bennett, J. and J. Wakefield (2001). Errors-in-variables in joint population pharmacokinetic/pharmacodynamic modeling. Biometrics 57(3), 803–812.
  • Bissiri et al. (2016) Bissiri, P. G., C. C. Holmes, and S. G. Walker (2016). A general framework for updating belief distributions. Journal of the Royal Statistical Society. Series B, Statistical methodology 78(5), 1103.
  • Blangiardo et al. (2011) Blangiardo, M., A. Hansell, and S. Richardson (2011). A Bayesian model of time activity data to investigate health effect of air pollution in time series studies. Atmospheric Environment 45(2), 379–386.
  • Carmona and Nicholls (2020) Carmona, C. and G. Nicholls (2020). Semi-modular inference: enhanced learning in multi-modular models by tempering the influence of components. In International Conference on Artificial Intelligence and Statistics, pp.  4226–4235. PMLR.
  • Carmona and Nicholls (2022) Carmona, C. and G. Nicholls (2022). Scalable semi-modular inference with variational meta-posteriors. arXiv:2204.00296.
  • Carpenter et al. (2017) Carpenter, B., A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell (2017, January). Stan: A Probabilistic Programming Language. Journal of Statistical Software 76(1), 1–32. Number: 1.
  • Castillo (2014) Castillo, I. (2014). On Bayesian supremum norm contraction rates. Annals of Statistics.
  • Chakraborty et al. (2022) Chakraborty, A., D. J. Nott, C. Drovandi, D. T. Frazier, and S. A. Sisson (2022). Modularized Bayesian analyses and cutting feedback in likelihood-free inference. Statistics and Computing (To appear).
  • Claeskens and Hjort (2003) Claeskens, G. and N. L. Hjort (2003). The focused information criterion. Journal of the American Statistical Association 98(464), 900–916.
  • Davidson (1994) Davidson, J. (1994). Stochastic limit theory: An introduction for econometricians. OUP Oxford.
  • Frazier and Nott (2024) Frazier, D. T. and D. J. Nott (2024). Cutting feedback and modularized analyses in generalized Bayesian inference. Bayesian Analysis 1(1), 1–29.
  • Gneiting and Raftery (2007) Gneiting, T. and A. E. Raftery (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association 102(477), 359–378.
  • Green and Strawderman (1991) Green, E. J. and W. E. Strawderman (1991). A James-Stein type estimator for combining unbiased and possibly biased estimators. Journal of the American Statistical Association 86(416), 1001–1006.
  • Grünwald and Van Ommen (2017) Grünwald, P. and T. Van Ommen (2017). Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Analysis 12(4), 1069–1103.
  • Hansen (2016) Hansen, B. E. (2016). Efficient shrinkage in parametric models. Journal of Econometrics 190(1), 115–132.
  • Hjort and Claeskens (2003) Hjort, N. L. and G. Claeskens (2003). Frequentist model average estimators. Journal of the American Statistical Association 98(464), 879–899.
  • Jacob et al. (2017) Jacob, P. E., L. M. Murray, C. C. Holmes, and C. P. Robert (2017). Better together? Statistical learning in models made of modules. arXiv preprint arXiv:1708.08719.
  • Jacob et al. (2020) Jacob, P. E., J. O’Leary, and Y. F. Atchadé (2020). Unbiased Markov chain Monte Carlo methods with couplings. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82(3), 543–600.
  • Judge and Mittelhammer (2004) Judge, G. G. and R. C. Mittelhammer (2004). A semiparametric basis for combining estimation problems under quadratic loss. Journal of the American Statistical Association 99(466), 479–487.
  • Kim and White (2001) Kim, T.-H. and H. White (2001). James-Stein-type estimators in large samples with application to the least absolute deviations estimator. Journal of the American Statistical Association 96(454), 697–705.
  • Kleijn and van der Vaart (2012) Kleijn, B. and A. van der Vaart (2012). The Bernstein-von-Mises theorem under misspecification. Electron. J. Statist. 6, 354–381.
  • Lee and Lee (2018) Lee, K. and J. Lee (2018). Optimal Bayesian minimax rates for unconstrained large covariance matrices. Bayesian Analysis.
  • Lehmann and Casella (2006) Lehmann, E. L. and G. Casella (2006). Theory of point estimation. Springer Science & Business Media.
  • Liu et al. (2009) Liu, F., M. Bayarri, and J. Berger (2009). Modularization in Bayesian analysis, with emphasis on analysis of computer models. Bayesian Analysis 4(1), 119–150.
  • Liu and Goudie (2022a) Liu, Y. and R. J. B. Goudie (2022a). A general framework for cutting feedback within modularized Bayesian inference. arXiv:2211.03274.
  • Liu and Goudie (2022b) Liu, Y. and R. J. B. Goudie (2022b). Stochastic approximation cut algorithm for inference in modularized Bayesian models. Statistics and Computing 32(7).
  • Liu and Goudie (2023) Liu, Y. and R. J. B. Goudie (2023). Generalized geographically weighted regression model within a modularized Bayesian framework. Bayesian Analysis (To appear).
  • Lunn et al. (2009) Lunn, D., N. Best, D. Spiegelhalter, G. Graham, and B. Neuenschwander (2009). Combining MCMC with ‘sequential’ PKPD modelling. Journal of Pharmacokinetics and Pharmacodynamics 36, 19–38.
  • Maucort-Boulch et al. (2008) Maucort-Boulch, D., S. Franceschi, and M. Plummer (2008). International correlation between human papillomavirus prevalence and cervical cancer incidence. Cancer Epidemiology and Prevention Biomarkers 17(3), 717–720.
  • Miller (2021) Miller, J. W. (2021). Asymptotic normality, concentration, and coverage of generalized posteriors. Journal of Machine Learning Research 22(168), 1–53.
  • Newey (1985) Newey, W. K. (1985). Maximum likelihood specification testing and conditional moment tests. Econometrica: Journal of the Econometric Society, 1047–1070.
  • Nicholls et al. (2022) Nicholls, G. K., J. E. Lee, C.-H. Wu, and C. U. Carmona (2022). Valid belief updates for prequentially additive loss functions arising in semi-modular inference. arXiv preprint arXiv:2201.09706.
  • Plummer (2015) Plummer, M. (2015). Cuts in Bayesian graphical models. Statistics and Computing 25(1), 37–43.
  • Pompe and Jacob (2021) Pompe, E. and P. E. Jacob (2021). Asymptotics of cut distributions and robust modular inference using posterior bootstrap. arXiv preprint arXiv:2110.11149.
  • Rieder (2012) Rieder, H. (2012). Robust Asymptotic Statistics: Volume I. Springer Science & Business Media.
  • Rousseau (1997) Rousseau, J. (1997). Asymptotic bayes risks for a general class of losses. Statistics & probability letters 35(2), 115–121.
  • Shen and Wasserman (2001) Shen, X. and L. Wasserman (2001). Rates of convergence of posterior distributions. The Annals of Statistics 29(3), 687–714.
  • Stone (1961) Stone, M. (1961). The opinion pool. The Annals of Mathematical Statistics, 1339–1342.
  • Styring et al. (2017) Styring, A., M. Charles, F. Fantone, M. Hald, A. McMahon, R. Meadow, G. Nicholls, A. Patel, M. Pitre, A. Smith, A. Sołtysiak, G. Stein, J. Weber, H. Weiss, and A. Bogaard (2017). Isotope evidence for agricultural extensification reveals how the world’s first cities were fed. Nature Plants 3, 17076.
  • Van der Vaart (2000) Van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge University Press.
  • Yu et al. (2023) Yu, X., D. J. Nott, and M. S. Smith (2023). Variational inference for cutting feedback in misspecified models. Statistical Science 38(3), 490–509.

Appendix A Supplementary Material

This supplementary material contains the regularity conditions used to obtain the results in the main text, proofs of all stated results and several lemmas used to prove the main results. The regularity conditions and proofs are broken up into two sections that depend on whether the analysis is conducted under gross model misspecification (Assumption 1), or local model misspecification (Assumption 2). In addition, this material contains further details of the HPV and cervical cancer incidence example introduced in Section 2.1 of the main text, and additional experiments for the biased means example in Section 3.2.

Appendix B Gross Misspecification: Assumption 1

B.1 Regularity Conditions

The regularity conditions used to prove Lemma 1 are similar to those used to deduce posterior concentration rates in generalized Bayesian methods; see, e.g., Shen and Wasserman (2001), as well as Miller (2021). We state the assumptions separately for the cut posterior and the full posterior. Recall, p(θ1)=logf1(z1:nθ1)\ell_{p}(\theta_{1})=\log f_{1}(z_{1:n}\mid\theta_{1}), and rewrite the cut posterior for θ1\theta_{1} as

πcut(θ1z1:n)=exp{p(θ1)}π(θ1)Θ1exp{p(θ1)}π(θ1)dθ1=exp{p(θ1)p(θ1,0)}π(θ1)Θ1exp{p(θ1)p(θ1,0)}π(θ1)dθ1.\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})=\frac{\exp\{\ell_{p}(\theta_{1})\}\pi(\theta_{1})}{\int_{\Theta_{1}}\exp\{\ell_{p}(\theta_{1})\}\pi(\theta_{1})\mathrm{d}\theta_{1}}=\frac{\exp\{\ell_{p}(\theta_{1})-\ell_{p}(\theta_{1,0})\}\pi(\theta_{1})}{\int_{\Theta_{1}}\exp\{\ell_{p}(\theta_{1})-\ell_{p}(\theta_{1,0})\}\pi(\theta_{1})\mathrm{d}\theta_{1}}.

For AΘ1A\subseteq\Theta_{1}, write M1,n(A)=Aexp{p(θ1)p(θ1,0)}π(θ1)dθ1M_{1,n}(A)=\int_{A}\exp\{\ell_{p}(\theta_{1})-\ell_{p}(\theta_{1,0})\}\pi(\theta_{1})\mathrm{d}\theta_{1} so that Πcut(θ1Az1:n)=M1,n(A)M1,n(Θ)\Pi_{\mathrm{cut}}(\theta_{1}\in A\mid z_{1:n})=\frac{M_{1,n}(A)}{M_{1,n}(\Theta)}. For ε>0\varepsilon>0, define Θ1(nε2):={θ1Θ1:θ1θ1,0nε2}\Theta_{1}(\sqrt{n\varepsilon^{2}}):=\{\theta_{1}\in\Theta_{1}:\|\theta_{1}-\theta_{1,0}\|\leq\sqrt{n\varepsilon^{2}}\}.

Assumption B1.

The following are satisfied.

  1. (i)

    For any δ>0\delta>0 there exists ε>0\varepsilon>0 and a sufficiently large K>0K>0 such that

    P0(n)[supθ1θ1,0δ1n{p(θ1)p(θ1,0)}Kε]=o(1).P^{(n)}_{0}\left[\sup_{\|\theta_{1}-\theta_{1,0}\|\geq\delta}\frac{1}{n}\left\{\ell_{p}(\theta_{1})-\ell_{p}(\theta_{1,0})\right\}\geq-K\varepsilon\right]=o(1).
  2. (ii)

    For all n1n\geq 1, ε>0\varepsilon>0, and bn=(1/2)Π{Θ1(nε2)}e2nϵ2b_{n}=(1/2)\Pi\{\Theta_{1}(\sqrt{n\varepsilon^{2}})\}e^{-2n\epsilon^{2}},

    P0(n){M1,n(Θ1)bn}2/nϵ2.P^{(n)}_{0}\left\{M_{1,n}(\Theta_{1})\leq b_{n}\right\}\leq 2/n\epsilon^{2}.
  3. (iii)

    For ε>0\varepsilon>0, c>0c>0, Π{Θ1(nε2)}exp{cnε2}\Pi\{\Theta_{1}(\sqrt{n\varepsilon^{2}})\}\gtrsim\exp\{-cn\varepsilon^{2}\}.

  4. (iv)

    For any θ1,θ1Θ1\theta_{1},\theta_{1}^{\prime}\in\Theta_{1}, if θ1θ1\theta_{1}\neq\theta_{1}^{\prime}, then f1(zθ1)f1(zθ1)f_{1}(z\mid\theta_{1})\neq f_{1}(z\mid\theta_{1}^{\prime}) with positive probability.

Remark 9.

Assumptions B1 parts (i), (iii) are identical to those maintained in Shen and Wasserman (2001) but for the partial log-likelihood p(θ1)\ell_{p}(\theta_{1}), while the bound on the posterior denominator in Assumption B1(ii) is maintained to simply the proofs and can be removed at the cost of additional technicalities; for instance, using arguments similar to those of Lemma 1 in Shen and Wasserman (2001).

From the definition of KL{h(zθ1,0,δ0)f(zθ)}\mathrm{KL}\{h(z\mid\theta_{1,0},\delta_{0})\|f(z\mid\theta)\},

KL{h(zθ1,0,δ0)f(zθ)}=\displaystyle\mathrm{KL}\{h(z\mid\theta_{1,0},\delta_{0})\|f(z\mid\theta)\}= 𝒵logf1(zθ1,0)δ0(z)f1(zθ1)f2(zθ)f1(zθ1,0)δ0(z)dz\displaystyle\int_{\mathcal{Z}}\log\frac{f_{1}(z\mid\theta_{1,0})\delta_{0}(z)}{f_{1}(z\mid\theta_{1})f_{2}(z\mid\theta)}f_{1}(z\mid\theta_{1,0})\delta_{0}(z)\mathrm{d}z
=\displaystyle= 𝒵logf1(zθ1,0)f1(zθ1)f1(zθ1,0)δ0(z)dz\displaystyle\int_{\mathcal{Z}}\log\frac{f_{1}(z\mid\theta_{1,0})}{f_{1}(z\mid\theta_{1})}f_{1}(z\mid\theta_{1,0})\delta_{0}(z)\mathrm{d}z
+\displaystyle+ 𝒵logδ0(z)f2(zθ)f1(zθ1,0)δ0(z)dz.\displaystyle\int_{\mathcal{Z}}\log\frac{\delta_{0}(z)}{f_{2}(z\mid\theta)}f_{1}(z\mid\theta_{1,0})\delta_{0}(z)\mathrm{d}z.

Setting θ1=θ1,0\theta_{1}=\theta_{1,0} minimizes the first part of KL{h(zθ1,0,δ0)f(zθ)}\mathrm{KL}\{h(z\mid\theta_{1,0},\delta_{0})\|f(z\mid\theta)\}, but does not minimize both components. Hence, under Assumption 1, θ:=argminθΘKL{h(zθ1,0,δ0)f(zθ)}\theta_{\star}:=\operatornamewithlimits{argmin\,}_{\theta\in\Theta}\mathrm{KL}\{h(z\mid\theta_{1,0},\delta_{0})\|f(z\mid\theta)\} is the value we would expect the full posterior to concentrate onto as nn\rightarrow\infty. Thus, conducting joint Bayesian inference on θ\theta under Assumption 1 results in posteriors for which θ1\theta_{1} will not concentrate onto θ1,0\theta_{1,0}. To formally prove this result, recall the definition (θ)=logf(z1:nθ)\ell(\theta)=\log f(z_{1:n}\mid\theta), and consider the following regularity conditions, which are equivalent version of Assumption B1 but for (θ)\ell(\theta), and π(θ)\pi(\theta).

Assumption B2.

The following are satisfied.

  1. (i)

    For any δ>0\delta>0 there exists ε>0\varepsilon>0 and a sufficiently large K>0K>0 such that, for some θΘ\theta_{\star}\in\Theta,

    P0(n)[supθθδ1n{(θ)(θ)}Kε]=o(1).P^{(n)}_{0}\left[\sup_{\|\theta-\theta_{\star}\|\geq\delta}\frac{1}{n}\left\{\ell(\theta)-\ell(\theta_{\star})\right\}\geq-K\varepsilon\right]=o(1).
  2. (ii)

    For all n1n\geq 1, ε>0\varepsilon>0, Θ(nε2):={θΘ:θθnε2}\Theta({\sqrt{n\varepsilon^{2}}}):=\{\theta\in\Theta:\|\theta-\theta_{\star}\|\leq\sqrt{n\varepsilon^{2}}\}, and bn=(1/2)Π{Θ(nε2)}e2nε2b_{n}=(1/2)\Pi\{\Theta(\sqrt{n\varepsilon^{2}})\}e^{-2n\varepsilon^{2}},

    P0(n){Mn(Θ)bn}2/nε2.P^{(n)}_{0}\left\{M_{n}(\Theta)\leq b_{n}\right\}\leq 2/n\varepsilon^{2}.
  3. (iii)

    For ε>0\varepsilon>0, c>0c>0, and Π{Θ(nε2)}exp{2cnε2}\Pi\{\Theta({\sqrt{n\varepsilon^{2}}})\}\gtrsim\exp\{-2cn\varepsilon^{2}\}.

  4. (iv)

    For any θ,θΘ\theta,\theta^{\prime}\in\Theta, if θθ\theta\neq\theta^{\prime}, then f(zθ)f(zθ)f(z\mid\theta)\neq f(z\mid\theta^{\prime}) with positive probability.

B.2 Proofs of Main Results: Gross Misspecification

Proof of Lemma 1.

We prove the two cases separately, starting with the cut posterior.

Part 1: Cut posterior.

For ε>0\varepsilon>0, recall Θ1(ε):={θΘ:θ1θ1,0ε}\Theta_{1}(\varepsilon):=\{\theta\in\Theta:\|\theta_{1}-\theta_{1,0}\|\leq\varepsilon\}, and consider

Πcut{Θ1(ε)cz1:n}\displaystyle\Pi_{\mathrm{cut}}\{\Theta_{1}(\varepsilon)^{c}\mid z_{1:n}\} =Θ1(ε)cexp{p(θ1)p(θ1,0)}π(θ1)Θ1exp{p(θ1)p(θ1,0)}π(θ1)dθ1=M1,n{Θ1(ε)c}M1,n{Θ1}.\displaystyle=\int_{\Theta_{1}(\varepsilon)^{c}}\frac{\exp\{\ell_{p}(\theta_{1})-\ell_{p}(\theta_{1,0})\}\pi(\theta_{1})}{\int_{\Theta_{1}}\exp\{\ell_{p}(\theta_{1})-\ell_{p}(\theta_{1,0})\}\pi(\theta_{1})\mathrm{d}\theta_{1}}=\frac{M_{1,n}\{\Theta_{1}(\varepsilon)^{c}\}}{M_{1,n}\{\Theta_{1}\}}.

Apply Assumption B1(ii) to see that, with probability at least 2/(nε2)22/(\sqrt{n\varepsilon^{2}})^{2},

Πcut{Θ1(ε)cz1:n}\displaystyle\Pi_{\mathrm{cut}}\{\Theta_{1}(\varepsilon)^{c}\mid z_{1:n}\} M1,n{Θ1(ε)c}bn\displaystyle\leq\frac{M_{1,n}\{\Theta_{1}(\varepsilon)^{c}\}}{b_{n}}
2e2nϵ2Π{Θ1(nε2)}M1,n{Θ1(ε)c}\displaystyle\leq 2\frac{e^{2n\epsilon^{2}}}{\Pi\{\Theta_{1}(\sqrt{n\varepsilon^{2}})\}}M_{1,n}\{\Theta_{1}({\varepsilon})^{c}\}
e4nϵ2M1,n{Θ1(ε)c},\displaystyle\lesssim{e^{4n\epsilon^{2}}}M_{1,n}\{\Theta_{1}({\varepsilon})^{c}\}, (9)

where the last term follows by Assumption B1(iii).

Focus on the term in brackets in (9). Since Θ1(ε)\Theta_{1}(\varepsilon) is bounded for any finite nn, the log-likelihood ratio p(θ1)p(θ1,0)\ell_{p}(\theta_{1})-\ell_{p}(\theta_{1,0}) is also bounded for any nn, and so by Assumption B1(i),

M1,n{Θ1(ε)c}\displaystyle M_{1,n}\{\Theta_{1}(\varepsilon)^{c}\} =1{Θ1(ε)c}exp{p(θ1)p(θ1,0)}π(θ1)dθ1\displaystyle=\int\mathrm{1}\left\{\Theta_{1}(\varepsilon)^{c}\right\}\exp\{\ell_{p}(\theta_{1})-\ell_{p}(\theta_{1,0})\}\pi(\theta_{1})\mathrm{d}\theta_{1}
exp{nKε2}Π{Θ1(ε)c}\displaystyle\leq\exp\{-nK\varepsilon^{2}\}\Pi\{\Theta_{1}(\varepsilon)^{c}\}
exp{nKε2},\displaystyle\leq\exp\{-nK\varepsilon^{2}\},

with probability converging to one (since Π(A)1\Pi(A)\leq 1 for all AΘA\subseteq\Theta).

Placing this bound into equation (9), and taking K=8cK=8c, we obtain

Πcut{θΘ1(ε)cz1:n}\displaystyle\Pi_{\mathrm{cut}}\{\theta\in\Theta_{1}(\varepsilon)^{c}\mid z_{1:n}\} e4nϵ2[M1,n{Θ1(ε)c}]\displaystyle\lesssim{e^{4n\epsilon^{2}}}\left[M_{1,n}\{\Theta_{1}({\varepsilon})^{c}\}\right]
exp{c4nϵ2cKnε2}\displaystyle\lesssim\exp\{c4n\epsilon^{2}-cKn\varepsilon^{2}\}
=exp{4cnϵ2}.\displaystyle=\exp\{-4cn\epsilon^{2}\}.

For any εlog(n)/n\varepsilon\leq\log(n)/\sqrt{n}, the stated result follows.

Part 2: Exact posterior. Repeating similar arguments to those above, but for the set Θ(ε):={θΘ:θθε}\Theta(\varepsilon):=\{\theta\in\Theta:\|\theta-\theta_{\star}\|\leq\varepsilon\}, proves that, with probability converging to one

Πfull{θΘ(ε)z1:n}1exp{c4nε2}.\Pi_{\mathrm{full}}\{\theta\in\Theta_{\star}(\varepsilon)\mid z_{1:n}\}\gtrsim 1-\exp\{-c4n\varepsilon^{2}\}.

However, defining Θ1,(ε):={θ1Θ:θ1θ1,ε}\Theta_{1,\star}(\varepsilon):=\{\theta_{1}\in\Theta:\|\theta_{1}-\theta_{1,\star}\|\leq\varepsilon\}, since Θ(ε)Θ1,(ε)×Θ2\Theta(\varepsilon)\subset\Theta_{1,\star}(\varepsilon)\times\Theta_{2}, it follows that

1Cexp{c4nε2}Πfull{θΘ(ε)z1:n}\displaystyle 1-C\exp\{-c4n\varepsilon^{2}\}\leq\Pi_{\mathrm{full}}\{\theta\in\Theta(\varepsilon)\mid z_{1:n}\} =Θ(ε)πfull(θ1,θ2z1:n)dθ2dθ1\displaystyle=\int_{\Theta(\varepsilon)}\pi_{\mathrm{full}}(\theta_{1},\theta_{2}\mid z_{1:n})\mathrm{d}\theta_{2}\mathrm{d}\theta_{1}
Θ1,(ε)Θ2πfull(θ1,θ2z1:n)dθ2dθ1\displaystyle\leq\int_{\Theta_{1,\star}(\varepsilon)}\int_{\Theta_{2}}\pi_{\mathrm{full}}(\theta_{1},\theta_{2}\mid z_{1:n})\mathrm{d}\theta_{2}\mathrm{d}\theta_{1}
Θ1,επfull(θ1z1:n)dθ1.\displaystyle\leq\int_{\Theta_{1,\varepsilon}^{\star}}\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})\mathrm{d}\theta_{1}.

Hence, with probability converging to 1, Θ1,(ε)πfull(θ1z1:n)dθ11\int_{\Theta_{1,\star}(\varepsilon)}\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})\mathrm{d}\theta_{1}\rightarrow 1. Since θ1,0θ1,\theta_{1,0}\neq\theta_{1,\star} under Assumption 1, there exists some ε>0\varepsilon>0, such that θ1,0Θ1,(ε)\theta_{1,0}\not\in\Theta_{1,\star}(\varepsilon), so that for any ε~ε\widetilde{\varepsilon}\leq\varepsilon, Θ1(ε~)π(θ1z1:n)dθ10\int_{\Theta_{1}(\widetilde{\varepsilon})}\pi(\theta_{1}\mid z_{1:n})\mathrm{d}\theta_{1}\rightarrow 0 in probability. ∎

Proof of Corollary 1.

Write the SMP as

πω(θz1:n)\displaystyle\pi_{\omega}(\theta\mid z_{1:n}) =πcut(θz1:n)+ω{πfull(θz1:n)πcut(θz1:n)}\displaystyle=\pi_{\mathrm{cut}}(\theta\mid z_{1:n})+\omega\{\pi_{\mathrm{full}}(\theta\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta\mid z_{1:n})\}
=πcut(θz1:n)+ω{πfull(θ1z1:n)πcut(θ1z1:n)}π(θ2θ1,z1:n)\displaystyle=\pi_{\mathrm{cut}}(\theta\mid z_{1:n})+\omega\{\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\}\pi(\theta_{2}\mid\theta_{1},z_{1:n}) (10)

where πfull(θ1z1:n)=Θ2πfull(θz1:n)dθ2\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})=\int_{\Theta_{2}}\pi_{\mathrm{full}}(\theta\mid z_{1:n})\mathrm{d}\theta_{2}. Apply (10), and Fubini, to obtain

Θ|πω^+(θz1:n)πcut(θz1:n)|dθ\displaystyle\int_{\Theta}|\pi_{\widehat{\omega}_{+}}(\theta\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta\mid z_{1:n})|\mathrm{d}\theta =Θ|ω^+{πfull(θ1z1:n)πcut(θ1z1:n)}π(θ2θ1,z1:n)|dθ\displaystyle=\int_{\Theta}|\widehat{\omega}_{+}\{\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\}\pi(\theta_{2}\mid\theta_{1},z_{1:n})|\mathrm{d}\theta
=Θ1Θ2|ω^+{πfull(θ1z1:n)πcut(θ1z1:n)}|π(θ2θ1,z1:n)dθ\displaystyle=\int_{\Theta_{1}}\int_{\Theta_{2}}|\widehat{\omega}_{+}\{\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\}|\pi(\theta_{2}\mid\theta_{1},z_{1:n})\mathrm{d}\theta
=ω^+Θ1|πfull(θ1z1:n)πcut(θ1z1:n)|dθ1.\displaystyle=\widehat{\omega}_{+}\int_{\Theta_{1}}|\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})|\mathrm{d}\theta_{1}.

We can then write

Θ|πω^+(θz1:n)πcut(θz1:n)|dθω^+[Θ1πfull(θ1z1:n)dθ1+Θ1πcut(θ1z1:n)dθ1]=2ω^+.\displaystyle\int_{\Theta}|\pi_{\widehat{\omega}_{+}}(\theta\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta\mid z_{1:n})|\mathrm{d}\theta\leq\widehat{\omega}_{+}\left[\int_{\Theta_{1}}\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})\mathrm{d}\theta_{1}+\int_{\Theta_{1}}\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\mathrm{d}\theta_{1}\right]=2\widehat{\omega}_{+}.

The stated result now follows if ω^+=op(1)\widehat{\omega}_{+}=o_{p}(1) as n+n\rightarrow+\infty.

To show this, let Xn,cut:=n(θ¯cutθ0)X_{n,\mathrm{cut}}:=\sqrt{n}(\overline{\theta}_{\mathrm{cut}}-\theta_{0}), Xn,full:=n(θ¯fullθ)X_{n,\mathrm{full}}:=\sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{\star}), Xn:=Xfull,nXcut,nX_{n}:=X_{\mathrm{full},n}-X_{\mathrm{cut},n} and Yn=n(θ0θ)Y_{n}=\sqrt{n}(\theta_{0}-\theta_{\star}). Then, for Υn=Υ(θ¯cut){\Upsilon_{n}}=\Upsilon(\overline{\theta}_{\mathrm{cut}}),

ω^=γnnθ¯cutθ¯fullΥn2=γnXn.cutXn,full+YnΥn2=γnYnXnΥn2.\widehat{\omega}=\frac{\gamma_{n}}{n\|\overline{\theta}_{\mathrm{cut}}-\overline{\theta}_{\mathrm{full}}\|^{2}_{\Upsilon_{n}}}=\frac{\gamma_{n}}{\|X_{n.\mathrm{cut}}-X_{n,\mathrm{full}}+Y_{n}\|^{2}_{\Upsilon_{n}}}=\frac{\gamma_{n}}{\|Y_{n}-X_{n}\|^{2}_{\Upsilon_{n}}}.

By the reverse triangle inequality

ω^γn|XnΥn2YnΥn2|.\displaystyle\widehat{\omega}\leq\frac{\gamma_{n}}{|\|X_{n}\|_{\Upsilon_{n}}^{2}-\|Y_{n}\|_{\Upsilon_{n}}^{2}|}.

By the hypothesis of the result, Xn=Op(1)\|X_{n}\|=O_{p}(1), while under Assumption 1, θ1,0θ1,\theta_{1,0}\neq\theta_{1,\star}, so that Yn+\|Y_{n}\|\rightarrow+\infty as n+n\rightarrow+\infty, and the stated result follows. ∎

Appendix C Local Misspecification: Assumption 2

C.1 Regularity Conditions: Local Misspecification

Before stating the regularity conditions we maintain in this section, we recall several notations previously defined in the main text. Let (θ)=logf(z1:nθ)\ell(\theta)=\log f(z_{1:n}\mid\theta), and denote the joint log-likelihood for the ii-th observation as (ziθ)=logf(ziθ)\ell(z_{i}\mid\theta)=\log f(z_{i}\mid\theta). Denote the full derivative of the log-likelihood as ˙(θ):=(θ)/θ\dot{\ell}(\theta):=\partial\ell(\theta)/\partial\theta, and denote the second derivative as ¨(θ):=2(θ)/θθ\ddot{\ell}(\theta):=\partial^{2}\ell(\theta)/\partial\theta\partial\theta^{\top}. For j,k{1,2}j,k\in\{1,2\}, define the partial derivatives ˙(j)(θ)=(θ)/θj\dot{\ell}_{(j)}(\theta)=\partial\ell(\theta)/\partial\theta_{j}, and the second partial derivatives ¨(jk)(θ)=2(θ)/θjθk\ddot{\ell}_{(jk)}(\theta)=\partial^{2}\ell(\theta)/\partial\theta_{j}\partial\theta_{k}^{\top}. For a function g:𝒵dg:\mathcal{Z}\rightarrow\mathbb{R}^{d}, let 𝔼n[g(z)]\mathbb{E}_{n}[g(z)] denote the expectation of g(z)g(z) under h(zθ0,δn)h(z\mid\theta_{0},\delta_{n}) in Assumption 2; i.e., 𝔼n[g(z)]=𝒵g(z)h(zθ0,δn)dμ(z)\mathbb{E}_{n}[g(z)]=\int_{\mathcal{Z}}g(z)h(z\mid\theta_{0},\delta_{n})\mathrm{d}\mu(z). Define the matrices :=limnn1𝔼n[¨(θ0)]\mathcal{I}:=-\lim_{n}n^{-1}\mathbb{E}_{n}[\ddot{\ell}(\theta_{0})], and jk:=limnn1𝔼n[¨(jk)(θ0)]\mathcal{I}_{jk}:=-\lim_{n}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{(jk)}(\theta_{0})]. Recall that η=(η1,η2)\eta=(\eta_{1}^{\top},\eta_{2}^{\top})^{\top} in Assumption 2 is partitioned conformably with θ=(θ1,θ2)\theta=(\theta_{1}^{\top},\theta_{2}^{\top})^{\top}.

In addition, note that

(θ)=logf1(z1:nθ1)+logf2(z1:nθ)=p(θ1)+c(θ),\ell(\theta)=\log f_{1}(z_{1:n}\mid\theta_{1})+\log f_{2}(z_{1:n}\mid\theta)=\ell_{p}(\theta_{1})+\ell_{c}(\theta),

where p(θ1):=logf1(z1:nθ1)\ell_{p}(\theta_{1}):=\log f_{1}(z_{1:n}\mid\theta_{1}) signifies the ‘partial log-likelihood’ term, and c(θ):=logf2(z1:nθ1,θ2)\ell_{c}(\theta):=\log f_{2}(z_{1:n}\mid\theta_{1},\theta_{2}) signifies the log-likelihood term that is used in cut inference to construct the conditional posterior for θ2\theta_{2} given θ1\theta_{1}. Define the partial derivatives of p(θ)\ell_{p}(\theta) as ˙p(1)(θ1)=p(θ1)/θ1\dot{\ell}_{p(1)}(\theta_{1})=\partial\ell_{p}(\theta_{1})/\partial\theta_{1}, ¨p(11)(θ1)=2p(11)(θ1)/θjθk\ddot{\ell}_{p(11)}(\theta_{1})=\partial^{2}\ell_{p(11)}(\theta_{1})/\partial\theta_{j}\partial\theta_{k}^{\top}, and recall p(11):=limnn1𝔼n[¨p(11)(θ1,0)]\mathcal{I}_{p(11)}:=-\lim_{n}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{p(11)}(\theta_{1,0})]. For c(θ)\ell_{c}(\theta) and j,k{1,2}j,k\in\{1,2\}, define ˙c(jk)(θ):=c(j)(θ)/θj\dot{\ell}_{c(jk)}(\theta):=\partial\ell_{c(j)}(\theta)/\partial\theta_{j} and ¨c(jk)(θ):=2c(jk)(θ)/θjθk\ddot{\ell}_{c(jk)}(\theta):=\partial^{2}\ell_{c(jk)}(\theta)/\partial\theta_{j}\partial\theta_{k}^{\top}. From the structure of the log-likelihood (θ)\ell(\theta), note that

12\displaystyle\mathcal{I}_{12} :=limnn1𝔼n[¨(12)(θ0)]=limnn1𝔼n[¨c(12)(θ0)],\displaystyle:=-\lim_{n\rightarrow\infty}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{(12)}(\theta_{0})]=-\lim_{n\rightarrow\infty}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{c(12)}(\theta_{0})],
21\displaystyle\mathcal{I}_{21} :=limnn1𝔼n[¨(21)(θ0)]=limnn1𝔼n[¨c(21)(θ0)],\displaystyle:=-\lim_{n\rightarrow\infty}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{(21)}(\theta_{0})]=-\lim_{n\rightarrow\infty}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{c(21)}(\theta_{0})],
22\displaystyle\mathcal{I}_{22} :=limnn1𝔼n[¨(22)(θ0)]=limnn1𝔼n[¨c(22)(θ0)].\displaystyle:=-\lim_{n\rightarrow\infty}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{(22)}(\theta_{0})]=-\lim_{n\rightarrow\infty}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{c(22)}(\theta_{0})].

To formalize the impact of the misspecification in Assumption 2, we impose the following regularity conditions on the density h(zθ,δn)h(z\mid\theta,\delta_{n}).888We eschew measurability conditions and assume that all objects written are measurable.

Assumption C1.

For υ:=(θ,ψ)\upsilon:=(\theta^{\top},\psi^{\top})^{\top}, let υ0:=(θ0,ψ0)\upsilon_{0}:=(\theta_{0}^{\top},\psi_{0}^{\top})^{\top}, be elements of the interior of Θ×Δ\Theta\times\Delta, where Δd\Delta\subset\mathbb{R}^{d} and compact. The function hn(zυ)=f1(zθ)f2(zθ){1+ψζ(z)/n}h_{n}(z\mid\upsilon)=f_{1}(z\mid\theta)f_{2}(z\mid\theta)\{1+\psi^{\top}\zeta(z)/\sqrt{n}\} is twice continuously differentiable in vv for almost all z𝒵z\in\mathcal{Z}. There exist positive functions a(z)a(z), b(z)b(z) such that for all z𝒵z\in\mathcal{Z}, except on sets of measure zero, n(zυ)=loghn(zυ)\ell_{n}(z\mid\upsilon)=\log h_{n}(z\mid\upsilon), satisfies the following: expn(zυ)a(z)\exp\ell_{n}(z\mid\upsilon)\leq a(z), and for all υυ0ν0/n\|\upsilon-\upsilon_{0}\|\leq\nu_{0}/\sqrt{n}, and some ν0>0\nu_{0}>0, each of the following |n(zυ)||\ell_{n}(z\mid\upsilon)|, ˙n(zυ)2\|\dot{\ell}_{n}(z\mid\upsilon)\|^{2}, ¨n(zυ)2\|\ddot{\ell}_{n}(z\mid\upsilon)\|^{2} are less than b(z)b(z). Further, 𝔼n[a(z)],𝔼n[b(z)],𝔼n[a(z)b(z)]<+\mathbb{E}_{n}[a(z)],\mathbb{E}_{n}[b(z)],\mathbb{E}_{n}[a(z)b(z)]<+\infty, and the set {z𝒵:hn(zυ)>0}\{z\in\mathcal{Z}:h_{n}(z\mid\upsilon)>0\} does not depend on υ\upsilon.

In addition, we maintain the following assumptions about the assumed model f(zθ)f(z\mid\theta).

Assumption C2.

For any θ,θΘ\theta,\theta^{\prime}\in\Theta, if θθ\theta\neq\theta^{\prime}, then (θ)(θ)\ell(\theta)\neq\ell(\theta^{\prime}) with positive probability.

Remark 10.

Assumption C1 is similar to the regularity conditions employed by Claeskens and Hjort (2003) to deduce large sample theory for frequentist model averaging estimators. Assumption C2 is a classical identification condition.

C.2 Preliminary Results

The following intermediate results are used to state and prove our main results.

Lemma C.1.

If Assumptions 2, C1 and C2 are satisfied, then the following results are satisfied.

  1. 1.

    limnn1𝔼n[¨p(11)(θ1,0)]=limnn1𝔼n[˙p(1)(θ1,0)˙p(1)(θ1,0)]\lim_{n}n^{-1}\mathbb{E}_{n}[-\ddot{\ell}_{p(11)}(\theta_{1,0})]=\lim_{n}n^{-1}\mathbb{E}_{n}[\dot{\ell}_{p(1)}(\theta_{1,0})\dot{\ell}_{p(1)}(\theta_{1,0})^{\top}].

  2. 2.

    limnn1𝔼n[¨c(11)(θ0)]=limnn1𝔼n[˙c(1)(θ0)˙c(1)(θ0)]\lim_{n}n^{-1}\mathbb{E}_{n}[-\ddot{\ell}_{c(11)}(\theta_{0})]=\lim_{n}n^{-1}\mathbb{E}_{n}[\dot{\ell}_{c(1)}(\theta_{0})\dot{\ell}_{c(1)}(\theta_{0})^{\top}].

  3. 3.

    limnn1𝔼n[˙p(1)(θ1,0)˙c(1)(θ0)]=0\lim_{n}n^{-1}\mathbb{E}_{n}[\dot{\ell}_{p(1)}(\theta_{1,0})\dot{\ell}_{c(1)}(\theta_{0})^{\top}]=0.

  4. 4.

    limnn1𝔼n[˙p(1)(θ1,0)˙c(2)(θ0)]=0\lim_{n}n^{-1}\mathbb{E}_{n}[\dot{\ell}_{p(1)}(\theta_{1,0})\dot{\ell}_{c(2)}(\theta_{0})^{\top}]=0.

The following Lemma is used to prove the results under the drifting sequences of DGPs constructed in Assumption 2.

Lemma C.2.

Assumptions 2, C1, and C2 are satisfied. Then ϕ(θ,δ)=𝒵˙(zθ)h(zθ,δ)dμ(z)\phi(\theta,\delta)=\int_{\mathcal{Z}}\dot{\ell}(z\mid\theta)h(z\mid\theta,\delta)\mathrm{d}\mu(z) exists and is continuous on Θ×Δ\Theta\times\Delta, and for all ε>0\varepsilon>0, and any compact Θ~Θ\tilde{\Theta}\subset\Theta,

limn+Pr(supθΘ~|n1˙(θ)ϕ(θ,δn)|ε)=0.\quad\lim_{n\rightarrow+\infty}\operatorname{Pr}\left(\sup_{\theta\in\tilde{\Theta}}\left|n^{-1}\dot{\ell}(\theta)-\phi\left(\theta,\delta_{n}\right)\right|\geqslant\varepsilon\right)=0.

The following result is a consequence of Lemmas C.1 and C.2.

Lemma C.3.

If Assumptions 2, C1 and C2 are satisfied, then the following results are satisfied.

  1. 1.

    limn𝔼n[˙(θ0)/n]=η\lim_{n}\mathbb{E}_{n}\left[\dot{\ell}(\theta_{0})/\sqrt{n}\right]=\eta.

  2. 2.

    ˙(θ0)/nN(η,)\dot{\ell}(\theta_{0})/\sqrt{n}\Rightarrow N(\eta,\mathcal{I}).

  3. 3.

    ˙p(1)(θ0)/nN(0,p(11))\dot{\ell}_{p(1)}(\theta_{0})/\sqrt{n}\Rightarrow N(0,\mathcal{I}_{p(11)}).

  4. 4.

    ˙c(2)(θ0)/nN(η2,(22)).\dot{\ell}_{c(2)}(\theta_{0})/\sqrt{n}\Rightarrow N(\eta_{2},\mathcal{I}_{(22)}).

The following is a useful extension of Stein’s Lemma.

Lemma C.4 (Lemma 2 of Hansen, 2016).

If ξN(0,V)\xi\sim N(0,V) is an m×1m\times 1 vector, and Ψ\Psi is m×mm\times m matrix, for φ(X):mm\varphi(X):\mathbb{R}^{m}\rightarrow\mathbb{R}^{m} continuously differentiable, then for hmh\in\mathbb{R}^{m},

𝔼{φ(ξ+h)Ψξ}=𝔼tr{xφ(ξ+h)ΨV}.\mathbb{E}\left\{\varphi(\xi+{h})^{\top}\Psi\xi\right\}=\mathbb{E}\text{\rm tr}\left\{\frac{\partial}{\partial{x}}\varphi(\xi+{h})^{\top}\Psi V\right\}. (11)

To simplify proofs of our results, we use the following block-matrix notations: for j,k{1,2}j,k\in\{1,2\}, let 𝟎dj×dk\mathbf{0}_{d_{j}\times d_{k}} denote a dj×dkd_{j}\times d_{k} matrix of zeros, and define

Γ1,cut\displaystyle\Gamma_{1,\mathrm{cut}} :=(p(11)1:𝟎d1×d1:𝟎d1×d2),\displaystyle:=(\mathcal{I}_{p(11)}^{-1}:\mathbf{0}_{d_{1}\times d_{1}}:\mathbf{0}_{d_{1}\times d_{2}}),\quad Γ1,full:=(𝟎d1×d1:11.21:11.2112221),\displaystyle\Gamma_{1,\mathrm{full}}:=(\mathbf{0}_{d_{1}\times d_{1}}:\mathcal{I}_{11.2}^{-1}:-\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}),
Γ2,cut\displaystyle\Gamma_{2,\mathrm{cut}} :=(22121p(11)1:𝐎d2×d1:221),\displaystyle:=(-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}:\mathbf{O}_{d_{2}\times d_{1}}:\mathcal{I}_{22}^{-1}),\quad Γ2,full:=(𝐎d2×d1:2212111.21:221).\displaystyle\Gamma_{2,\mathrm{full}}:=(\mathbf{O}_{d_{2}\times d_{1}}:-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}:\mathcal{I}_{22}^{-1}).

In addition, define

Zn:=1n(˙p(1)(θ1,0)˙p(1)(θ1,0)+˙c(1)(θ0)˙c(2)(θ1,0)),Γcut:=(Γ1,cutΓ2,cut),Γfull:=(Γ1,fullΓ2,full),\displaystyle Z_{n}:=\frac{1}{\sqrt{n}}\begin{pmatrix}\dot{\ell}_{p(1)}(\theta_{1,0})\\ \dot{\ell}_{p(1)}(\theta_{1,0})+\dot{\ell}_{c(1)}(\theta_{0})\\ \dot{\ell}_{c(2)}(\theta_{1,0})\end{pmatrix},\quad\Gamma_{\mathrm{cut}}:=\begin{pmatrix}\Gamma_{1,\mathrm{cut}}\\ \Gamma_{2,\mathrm{cut}}\end{pmatrix},\quad\Gamma_{\mathrm{full}}:=\begin{pmatrix}\Gamma_{1,\mathrm{full}}\\ \Gamma_{2,\mathrm{full}}\end{pmatrix},

where we note that Γcut\Gamma_{\mathrm{cut}} and Γfull\Gamma_{\mathrm{full}} have dimension (d1+d2)×(2d1+d2)(d_{1}+d_{2})\times(2d_{1}+d_{2}). Recall

W=(p(11)111.21),=(WW1222122121W22121W12221).W=(\mathcal{I}_{p(11)}^{-1}-\mathcal{I}_{11.2}^{-1}),\quad\mathcal{M}=\begin{pmatrix}W&-W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W&\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\end{pmatrix}.

When no confusion is likely to result, when terms are evaluated at θ0\theta_{0}, we drop their dependence on this value; e.g., for j,k{1,2}j,k\in\{1,2\} we will often write (jk)=(jk)(θ0)\ell_{(jk)}=\ell_{(jk)}(\theta_{0}).

Lemma C.5.

Under Assumptions 2, C1 and C2, the following are satisfied.

  1. 1.

    Znξ+τZ_{n}\Rightarrow\xi+\tau, where τ=(𝟎d1×1,η)\tau=(\mathbf{0}_{d_{1}\times 1}^{\top},\eta^{\top})^{\top}, and where ξ\xi is a 2d1+d22d_{1}+d_{2} dimensional normal random variable with mean zero and variance

    Ω\displaystyle\Omega :=limn+n1𝔼n[ZnZn]\displaystyle:=\lim_{n\rightarrow+\infty}n^{-1}\mathbb{E}_{n}[Z_{n}Z_{n}^{\top}]
    =limn+n1(𝔼n[˙p(1)˙p(1)]𝔼n[˙p(1)˙p(1)]𝟎d1×d2𝔼n[˙p(1)˙p(1)]𝔼n[{˙p(1)+˙c(1)}{˙p(1)+˙c(1)}]𝔼n[˙c(1)˙c(2)]𝟎d2×d1𝔼n[˙c(2)˙c(1)]𝔼n[˙c(2)˙c(2)])\displaystyle=\lim_{n\rightarrow+\infty}n^{-1}\begin{pmatrix}\mathbb{E}_{n}[\dot{\ell}_{p(1)}\dot{\ell}_{p(1)}^{\top}]&\mathbb{E}_{n}[\dot{\ell}_{p(1)}\dot{\ell}_{p(1)}^{\top}]&\mathbf{0}_{d_{1}\times d_{2}}\\ \mathbb{E}_{n}[\dot{\ell}_{p(1)}\dot{\ell}_{p(1)}^{\top}]&\mathbb{E}_{n}[\{\dot{\ell}_{p(1)}+\dot{\ell}_{c(1)}\}\{\dot{\ell}_{p(1)}+\dot{\ell}_{c(1)}\}^{\top}]&\mathbb{E}_{n}[\dot{\ell}_{c(1)}\dot{\ell}_{c(2)}^{\top}]\\ \mathbf{0}_{d_{2}\times d_{1}}&\mathbb{E}_{n}[\dot{\ell}_{c(2)}\dot{\ell}_{c(1)}^{\top}]&\mathbb{E}_{n}[\dot{\ell}_{c(2)}\dot{\ell}_{c(2)}^{\top}]\end{pmatrix}
    =(p(11)p(11)𝟎d1×d2p(11)1112𝟎d2×d12122).\displaystyle=\begin{pmatrix}\mathcal{I}_{p(11)}&\mathcal{I}_{p(11)}&\mathbf{0}_{d_{1}\times d_{2}}\\ \mathcal{I}_{p(11)}&\mathcal{I}_{11}&\mathcal{I}_{12}\\ \mathbf{0}_{d_{2}\times d_{1}}&\mathcal{I}_{21}&\mathcal{I}_{22}\end{pmatrix}.
  2. 2.

    (ΓcutΓfull)Ω(ΓcutΓfull)=(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}=\mathcal{M}.

  3. 3.

    ΓcutΩ(ΓcutΓfull)=\Gamma_{\mathrm{cut}}\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}=\mathcal{M}.

To analyze the behavior of the S-SMP, we must first deduce the behavior of θ¯cut\overline{\theta}_{\mathrm{cut}} and θ¯full\overline{\theta}_{\mathrm{full}}. To this end, define the statistics

Tn,full=ΓfullZn/n=(T1,full,nT2,full,n),Tn,cut=ΓcutZn/n=(T1,cut,nT2,cut,n),T_{n,\mathrm{full}}=\Gamma_{\mathrm{full}}Z_{n}/\sqrt{n}=\begin{pmatrix}T_{1,\mathrm{full},n}\\ T_{2,\mathrm{full},n}\end{pmatrix},\quad T_{n,\mathrm{cut}}=\Gamma_{\mathrm{cut}}Z_{n}/\sqrt{n}=\begin{pmatrix}T_{1,\mathrm{cut},n}\\ T_{2,\mathrm{cut},n}\end{pmatrix},

which are partitioned conformably with θ=(θ1,θ2)\theta=(\theta_{1}^{\top},\theta_{2}^{\top})^{\top}. Define

t:=n(θθ0Tn,full)=(t1,t2),𝒯:={t=n(θθ0Tn,full):θΘ},\displaystyle t:=\sqrt{n}(\theta-\theta_{0}-T_{n,\mathrm{full}})=(t_{1}^{\top},t_{2}^{\top})^{\top},\;\;\mathcal{T}:=\{t=\sqrt{n}(\theta-\theta_{0}-T_{n,\mathrm{full}}):\theta\in\Theta\},
ϑ:=n(θθ0Tn,cut)=(ϑ1,ϑ2),𝒱:={ϑ=n(θθ0Tn,cut):θΘ}.\displaystyle\vartheta:=\sqrt{n}(\theta-\theta_{0}-T_{n,\mathrm{cut}})=(\vartheta_{1}^{\top},\vartheta_{2}^{\top})^{\top},\;\;\mathcal{V}:=\{\vartheta=\sqrt{n}(\theta-\theta_{0}-T_{n,\mathrm{cut}}):\theta\in\Theta\}.

The cut and full posteriors for ϑ\vartheta and tt are given by

πcut(ϑz1:n)=1ndθπcut(θ0+ϑn+Tn,cutz1:n),\displaystyle\pi_{\mathrm{cut}}(\vartheta\mid z_{1:n})=\frac{1}{\sqrt{n}^{d_{\theta}}}\pi_{\mathrm{cut}}\left(\theta_{0}+\frac{\vartheta}{\sqrt{n}}+{T_{n,\mathrm{cut}}}{}\mid z_{1:n}\right),
πfull(tz1:n)=1ndθπfull(θ0+tn+Tn,fullz1:n).\displaystyle\pi_{\mathrm{full}}(t\mid z_{1:n})=\frac{1}{\sqrt{n}^{d_{\theta}}}\pi_{\mathrm{full}}\left(\theta_{0}+\frac{t}{\sqrt{n}}+T_{n,\mathrm{full}}\mid z_{1:n}\right).
Theorem C.1.

If Assumption 2 in the main text, and Assumptions C1 and C2 are satisfied, then 𝒯|πfull(tz1:n)N(t;0,1)|dt=op(1)\int_{\mathcal{T}}|\pi_{\mathrm{full}}(t\mid z_{1:n})-N(t;0,\mathcal{I}^{-1})|\mathrm{d}t=o_{p}(1) and n(θ¯fullθ0)N(1η,1)\sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{0})\Rightarrow N(\mathcal{I}^{-1}\eta,\mathcal{I}^{-1}).

Theorem C.2.

If Assumption 2 in the main text, and Assumptions C1 and C2 are satisfied, then

𝒱1|πcut(ϑ1z1:n)N(ϑ1;0,p(11)1)|dϑ1=op(1),\displaystyle\int_{\mathcal{V}_{1}}|\pi_{\mathrm{cut}}(\vartheta_{1}\mid z_{1:n})-N(\vartheta_{1};0,\mathcal{I}_{p(11)}^{-1})|\mathrm{d}\vartheta_{1}=o_{p}(1),
𝒱2|πcut(ϑ2z1:n)N(ϑ2;0,221+22121p(11)112221)|dϑ2=op(1).\displaystyle\int_{\mathcal{V}_{2}}|\pi_{\mathrm{cut}}(\vartheta_{2}\mid z_{1:n})-N(\vartheta_{2};0,\mathcal{I}^{-1}_{22}+\mathcal{I}^{-1}_{22}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1})|\mathrm{d}\vartheta_{2}=o_{p}(1).

In addition,

n(θ¯1,cutθ1,0)N(0,p(11)1),\displaystyle\sqrt{n}(\overline{\theta}_{1,\mathrm{cut}}-\theta_{1,0})\Rightarrow N(0,\mathcal{I}_{p(11)}^{-1}),
n(θ¯2,cutθ2,0)N((22)1η2,221+22121p(11)112221).\displaystyle\sqrt{n}(\overline{\theta}_{2,\mathrm{cut}}-\theta_{2,0})\Rightarrow N(\mathcal{I}_{(22)}^{-1}\eta_{2},\mathcal{I}^{-1}_{22}+\mathcal{I}^{-1}_{22}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}).
Corollary C.1.

If Assumption 2 in the main text, and Assumptions C1 and C2 are satisfied, then, for j{1,2}j\in\{1,2\},

limn+([𝔼n{n(θ¯j,fullθj,0)}]2[𝔼n{n(θ¯j,cutθj,0)}]2)0.\lim_{n\rightarrow+\infty}\left([\mathbb{E}_{n}\left\{\sqrt{n}(\overline{\theta}_{j,\mathrm{full}}-\theta_{j,0})\right\}]^{2}-[\mathbb{E}_{n}\left\{\sqrt{n}(\overline{\theta}_{j,\mathrm{cut}}-\theta_{j,0})\right\}]^{2}\right)\geq 0.
Lemma C.6.

If Assumptions 2, and C1-C2 are satisfied, then

  1. 1.

    n(θ¯cutθ0θ¯fullθ0)=(ΓcutΓfull)Zn+op(1)(ΓcutΓfull)(ξ+τ)\sqrt{n}\begin{pmatrix}\overline{\theta}_{\mathrm{cut}}-\theta_{0}\\ \overline{\theta}_{\mathrm{full}}-\theta_{0}\end{pmatrix}=\begin{pmatrix}\Gamma_{\mathrm{cut}}\\ \Gamma_{\mathrm{full}}\end{pmatrix}Z_{n}+o_{p}(1)\Rightarrow\begin{pmatrix}\Gamma_{\mathrm{cut}}\\ \Gamma_{\mathrm{full}}\end{pmatrix}(\xi+\tau).

  2. 2.

    ω^+ω¯:=min{1,γ(ξ+τ)(ΓcutΓfull)Υ(ΓcutΓfull)(ξ+τ)}\widehat{\omega}_{+}\Rightarrow\overline{\omega}:=\min\left\{1,\frac{\gamma}{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\Upsilon(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)}\right\}, where Υ=2q(θ,θ0)/θθ|θ=θ0\Upsilon=\partial^{2}q(\theta,\theta_{0})/\partial\theta\partial\theta^{\top}|_{\theta=\theta_{0}}.

  3. 3.

    n{θ¯(ω^+)θ0}Γcut(ξ+τ)ω¯{(ΓcutΓfull)(ξ+τ)}\sqrt{n}\{\overline{\theta}(\widehat{\omega}_{+})-\theta_{0}\}\Rightarrow\Gamma_{\mathrm{cut}}(\xi+\tau)-\overline{\omega}\{(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)\}.

Remark 11.

We do not explicitly consider that the cut and full posteriors are computed from samples of different sizes; e.g., n2n_{2} for the exact and n1n_{1} for the cut. While useful, this difference in sample sizes will not have a significant impact on the resulting behavior of the SMP so long as n1/n2α(0,)n_{1}/n_{2}\rightarrow\alpha\in(0,\infty). If one wishes to impose such a condition, the only result will be a slight change in the definition of the matrix Ω\Omega in Lemma C.5 to account for the fact that limnn1/n21\lim_{n}n_{1}/n_{2}\neq 1. As such, all results presented herein can be extended to this case at the cost of minor additional technicalities. To see this, let n2n_{2} be the larger sample size associated with the full posterior, and n1n_{1} the smaller sample size associated with the cut posterior, which satisfies limnn1/n2=α\lim_{n}n_{1}/n_{2}=\alpha for some α<1\alpha<1. Then, our results go through with n=n2n=n_{2} since

n2(θ¯cutθ0)\displaystyle\sqrt{n_{2}}(\overline{\theta}_{\mathrm{cut}}-\theta_{0}) =α1/2n1(θ¯cutθ0)+op{(1n1/n21α)n1(θ¯cutθ0)}\displaystyle=\alpha^{-1/2}\sqrt{n_{1}}(\overline{\theta}_{\mathrm{cut}}-\theta_{0})+o_{p}\left\{\left(\frac{1}{n_{1}/n_{2}}-\frac{1}{\alpha}\right)\|\sqrt{n_{1}}(\overline{\theta}_{\mathrm{cut}}-\theta_{0})\|\right\}
=α1/2n1(θ¯cutθ0)+op(1),\displaystyle=\alpha^{-1/2}\sqrt{n_{1}}(\overline{\theta}_{\mathrm{cut}}-\theta_{0})+o_{p}(1),

where the second equality follows from Theorem C.2 (when πcut(θz1:n)\pi_{\mathrm{cut}}(\theta\mid z_{1:n}) is based on n1n_{1} observations).

C.3 Proofs of Main Results

Recall the definitions 11.2=111222121\mathcal{I}_{11.2}=\mathcal{I}_{11}-\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}, and

W=p(11)111.21,=(WW1222122121W=22121W12221).\displaystyle W=\mathcal{I}_{p(11)}^{-1}-\mathcal{I}_{11.2}^{-1},\quad\mathcal{M}=\begin{pmatrix}W&-W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W&=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\end{pmatrix}. (12)

To prove Theorem 1 in the main text, we first prove the following result.

Theorem C.3.

Consider that Assumptions 2-3, and the regularity conditions in Assumptions C1-C2 are satisfied. If trΥ2Υ\text{\rm tr}\Upsilon\mathcal{M}\geq 2\|\Upsilon\mathcal{M}\|, and 0γ2(trΥ2Υ)0\leq\gamma\leq 2(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|), then

Rq(πω^+,θ0)Rq(πcut,θ0)𝔼[γ{2(trΥ2Υ)γ}(ξ+τ)(ΓcutΓfull)Υ(ΓcutΓfull)(ξ+τ)].\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})\leq\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\mathbb{E}\left[\frac{\gamma\{2(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|)-\gamma\}}{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)}\right].
Proof of Theorem C.3..

For Υ=[2q(δ,θ0)/δδ]δ=θ0\Upsilon=[\partial^{2}q(\delta,\theta_{0})/\partial\delta\partial\delta^{\top}]_{\delta=\theta_{0}}, and XΥ2=XΥX\|X\|_{\Upsilon}^{2}=X^{\top}\Upsilon X, following arguments similar to those in Theorem 1 of Rousseau (1997), it can be shown that

Rq(πcut,θ0)\displaystyle\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0}) :=limν+lim infn+𝔼nmin{Θnq(θ,θ0)πcut(θz1:n)dθ,ν}\displaystyle:=\lim_{\nu\rightarrow+\infty}\liminf_{n\rightarrow+\infty}\mathbb{E}_{n}\min\left\{\int_{\Theta}nq(\theta,\theta_{0})\pi_{\mathrm{cut}}(\theta\mid z_{1:n})\mathrm{d}\theta,\nu\right\}
=limν+lim infn+𝔼nmin{nq(θ¯cut,θ0),ν}\displaystyle=\lim_{\nu\rightarrow+\infty}\liminf_{n\rightarrow+\infty}\mathbb{E}_{n}\min\left\{nq(\overline{\theta}_{\mathrm{cut}},\theta_{0}),\nu\right\}
=limν+lim infn+𝔼nmin{n(θ¯cutθ0)Υ2,ν},\displaystyle=\lim_{\nu\rightarrow+\infty}\liminf_{n\rightarrow+\infty}\mathbb{E}_{n}\min\left\{\|\sqrt{n}(\overline{\theta}_{\mathrm{cut}}-\theta_{0})\|_{\Upsilon}^{2},\nu\right\},

and similarly (under Assumption 2),

Rq(πfull,θ0)\displaystyle\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0}) :=limν+lim infn+𝔼nmin{Θnq(θ,θ0)πfull(θz1:n)dθ,ν}\displaystyle:=\lim_{\nu\rightarrow+\infty}\liminf_{n\rightarrow+\infty}\mathbb{E}_{n}\min\left\{\int_{\Theta}nq(\theta,\theta_{0})\pi_{\mathrm{full}}(\theta\mid z_{1:n})\mathrm{d}\theta,\nu\right\}
=limν+lim infn+𝔼nmin{nq(θ¯full,θ0),ν}\displaystyle=\lim_{\nu\rightarrow+\infty}\liminf_{n\rightarrow+\infty}\mathbb{E}_{n}\min\left\{nq(\overline{\theta}_{\mathrm{full}},\theta_{0}),\nu\right\}
=limν+lim infn+𝔼nmin{n(θ¯fullθ0)Υ2,ν},\displaystyle=\lim_{\nu\rightarrow+\infty}\liminf_{n\rightarrow+\infty}\mathbb{E}_{n}\min\left\{\|\sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{0})\|_{\Upsilon}^{2},\nu\right\},

which together imply that

Rq(πω,θ0)=limν+limn+𝔼nmin{n{θ¯(ω)θ0}Υ2,ν}.\mathrm{R}_{q}(\pi_{\omega},\theta_{0})=\lim_{\nu\rightarrow+\infty}\lim_{n\rightarrow+\infty}\mathbb{E}_{n}\min\{\|\sqrt{n}\{\bar{\theta}(\omega)-\theta_{0}\}\|^{2}_{\Upsilon},\nu\}.

Write

n{θ¯(ω^+)θ0}\displaystyle\sqrt{n}\{\bar{\theta}(\widehat{\omega}_{+})-\theta_{0}\} =n{(1ω^+)θ¯cut+ω^+θ¯fullθ0}\displaystyle=\sqrt{n}\{(1-\widehat{\omega}_{+})\overline{\theta}_{\mathrm{cut}}+\widehat{\omega}_{+}\overline{\theta}_{\mathrm{full}}-\theta_{0}\}
=n(θ¯cutθ0)ω^+{n(θ¯cutθ0)n(θ¯fullθ0)}\displaystyle=\sqrt{n}(\bar{\theta}_{\mathrm{cut}}-\theta_{0})-\widehat{\omega}_{+}\{\sqrt{n}(\bar{\theta}_{\mathrm{cut}}-\theta_{0})-\sqrt{n}(\bar{\theta}_{\mathrm{full}}-\theta_{0})\}

By Lemma C.6,

n{θ¯(ω^+)θ0}\displaystyle\sqrt{n}\{\bar{\theta}(\widehat{\omega}_{+})-\theta_{0}\} Ψ:=Γcut(ξ+τ)ω¯(ΓcutΓfull)(ξ+τ),\displaystyle\Rightarrow\Psi:=\Gamma_{\mathrm{cut}}(\xi+\tau)-\overline{\omega}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau),

where ξN(0,Ω)\xi\sim N(0,\Omega), and τ=(0,η)\tau=(0^{\top},\eta^{\top})^{\top} are defined in Lemma C.5, and where

ω¯=γ(ξ+τ)P(ξ+τ),P=(ΓcutΓfull)Υ(ΓcutΓfull).\overline{\omega}=\frac{\gamma}{{(\xi+\tau)^{\top}P(\xi+\tau)}},\quad P=(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}}).

For 𝒬n:=n{θ¯(ω^+)θ0}Υ2\mathcal{Q}_{n}:=\|\sqrt{n}\{\bar{\theta}(\widehat{\omega}_{+})-\theta_{0}\}\|^{2}_{\Upsilon}, and ν0\nu\geq 0, let

Ψn,ν=n{θ¯(ω^+)θ0}𝕀[𝒬nν]+ν𝕀[𝒬n>ν].\Psi_{n,\nu}=\sqrt{n}\{\bar{\theta}(\widehat{\omega}_{+})-\theta_{0}\}\cdot\mathbb{I}[\mathcal{Q}_{n}\leq\nu]+\nu\cdot\mathbb{I}[\mathcal{Q}_{n}>\nu].

By Theorem 1.8.8 of Lehmann and Casella (2006),

lim infn𝔼[Ψn,νΥ2]=𝔼[ΨΥ2𝕀(ΨΥ2ν)]+ν2Pr(ΥΥ2>ν).\liminf_{n\rightarrow\infty}\mathbb{E}\left[\|\Psi_{n,\nu}\|_{{\Upsilon}}^{2}\right]=\mathbb{E}\left[\|\Psi\|_{{\Upsilon}}^{2}\mathbb{I}(\|\Psi\|_{\Upsilon}^{2}\leq\nu)\right]+\nu^{2}\text{Pr}(\|{\Upsilon}\|^{2}_{\Upsilon}>\nu).

For ν\nu\rightarrow\infty, the RHS of the above converges to 𝔼[ΨΥΨ]\mathbb{E}\left[\Psi^{\top}{\Upsilon}\Psi\right], and we have that

Rq(πω^+,θ0)=𝔼[ΨΥΨ].\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})=\mathbb{E}\left[\Psi^{\top}{\Upsilon}\Psi\right].

Expand Rq(πω^+,θ0)\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0}):

Rq(πω^+,θ0)=\displaystyle\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})= 𝔼[ΨΥΨ]\displaystyle\mathbb{E}\left[\Psi^{\top}{\Upsilon}\Psi\right]
=\displaystyle= 𝔼[(ξ+τ)ΓcutΥΓcut(ξ+τ)]+γ2𝔼(ξ+τ)P(ξ+τ)[(ξ+τ)P(ξ+τ)]2\displaystyle\mathbb{E}\left[(\xi+\tau)^{\top}\Gamma_{\mathrm{cut}}^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)\right]+\gamma^{2}\mathbb{E}\frac{(\xi+\tau)^{\top}P(\xi+\tau)}{[(\xi+\tau)^{\top}P(\xi+\tau)]^{2}}
2γ𝔼(ξ+τ)(ΓcutΓfull)ΥΓcut(ξ+τ)(ξ+τ)P(ξ+τ)\displaystyle-2\gamma\mathbb{E}\frac{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)}{(\xi+\tau)^{\top}P(\xi+\tau)}
=\displaystyle= 𝔼[(ξ+τ)ΓcutΥΓcut(ξ+τ)]+γ2𝔼1[(ξ+τ)P(ξ+τ)]\displaystyle\mathbb{E}\left[(\xi+\tau)^{\top}\Gamma_{\mathrm{cut}}^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)\right]+\gamma^{2}\mathbb{E}\frac{1}{[(\xi+\tau)^{\top}P(\xi+\tau)]}
2γ𝔼(ξ+τ)(ΓcutΓfull)ΥΓcut(ξ+τ)(ξ+τ)P(ξ+τ).\displaystyle-2\gamma\mathbb{E}\frac{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)}{(\xi+\tau)^{\top}P(\xi+\tau)}.

From the definitions of Γcut,ξ,τ\Gamma_{\mathrm{cut}},\xi,\tau,

𝔼[(ξ+τ)ΓcutΥΓcut(ξ+τ)]=Rq(πcut,θ0).\mathbb{E}\left[(\xi+\tau)^{\top}\Gamma_{\mathrm{cut}}^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)\right]=\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0}).

Let us concentrate our attention on the last term in R0(θ0,πω^+)\mathrm{R}_{0}(\theta_{0},\pi_{\widehat{\omega}_{+}}). Define the mapping φ(x)=x/(xΥx)\varphi(x)=x^{\top}/(x^{\top}{\Upsilon}x) and note that

𝔼(ξ+τ)(ΓcutΓfull)ΥΓcut(ξ+τ)(ξ+τ)P(ξ+τ)\displaystyle\mathbb{E}\frac{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)}{(\xi+\tau)^{\top}P(\xi+\tau)} =𝔼{φ(ξ+τ)(ΓcutΓfull)ΥΓcut(ξ+τ)}.\displaystyle=\mathbb{E}\{\varphi(\xi+\tau)(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)\}.

Recall that 𝔼(ξ)=0\mathbb{E}(\xi)=0, and Ω=𝔼(ξξ)\Omega=\mathbb{E}(\xi\xi^{\top}). Using Lemma C.4, and for K:=(ΓcutΓfull)ΥΓcutΩK:=(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}\Omega, we have

𝔼(ξ+τ)(ΓcutΓfull)ΥΓcut(ξ+τ)(ξ+τ)P(ξ+τ)\displaystyle\mathbb{E}\frac{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)}{(\xi+\tau)^{\top}P(\xi+\tau)} =𝔼tr{φ(x)/x|x=X}K\displaystyle=\mathbb{E}\text{\rm tr}\{\partial\varphi(x)/\partial x|_{x=X}\}K
=𝔼trK(ξ+τ)P(ξ+τ)2𝔼trP(ξ+τ)(ξ+τ)K[(ξ+τ)P(ξ+τ)]2.\displaystyle=\mathbb{E}\frac{\text{\rm tr}K}{(\xi+\tau)^{\top}P(\xi+\tau)}-2\mathbb{E}\frac{\text{\rm tr}P(\xi+\tau)(\xi+\tau)^{\top}K}{[{(\xi+\tau)^{\top}P(\xi+\tau)}]^{2}}. (13)

From the properties of tr()\text{\rm tr}(\cdot), and Lemma C.5,

trK=trΥΓcutΩ(ΓcutΓfull)=trΥ.\displaystyle\text{\rm tr}K=\text{\rm tr}{\Upsilon}\Gamma_{\mathrm{cut}}\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}=\text{\rm tr}{\Upsilon}\mathcal{M}.

For the second term in (13), we have that

trP(ξ+τ)(ξ+τ)K\displaystyle\text{\rm tr}P(\xi+\tau)(\xi+\tau)^{\top}K =tr(ξ+τ)KP(ξ+τ)\displaystyle=\text{\rm tr}(\xi+\tau)^{\top}KP(\xi+\tau)
=(ξ+τ)(ΓcutΓfull)ΥΓcutΩ(ΓcutΓfull)Υ(ΓcutΓfull)(ξ+τ)\displaystyle=(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)
=(ξ+τ)(ΓcutΓfull)ΥΥ(ΓcutΓfull)(ξ+τ),\displaystyle=(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\mathcal{M}{\Upsilon}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau),

and where the last equality again follows from Lemma C.5. Consequently, for Υ{\Upsilon} positive semi-definite,

trP(ξ+τ)(ξ+τ)K\displaystyle\text{\rm tr}P(\xi+\tau)(\xi+\tau)^{\top}K =(ξ+τ)(ΓcutΓfull)ΥΥ(ΓcutΓfull)(ξ+τ)\displaystyle=(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\mathcal{M}{\Upsilon}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)
Υ1/2Υ1/2{(ξ+τ)P(ξ+τ)}\displaystyle\leq\|{\Upsilon}^{1/2}\mathcal{M}{\Upsilon}^{1/2}\|\left\{(\xi+\tau)^{\top}P(\xi+\tau)\right\}
=Υ{(ξ+τ)P(ξ+τ)},\displaystyle=\|\Upsilon\mathcal{M}\|\left\{(\xi+\tau)^{\top}P(\xi+\tau)\right\}, (14)

where the last inequality holds for any matrix norm \|\cdot\| such that if AA and BB are positive semi-definite, then AB=BA\|AB\|=\|BA\|; e.g., the Frobenius norm. Applying (14) into (13), we have

𝔼(ξ+τ)(ΓcutΓfull)ΥΓcut(ξ+τ)(ξ+τ)P(ξ+τ)\displaystyle-\mathbb{E}\frac{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)}{(\xi+\tau)^{\top}P(\xi+\tau)}
𝔼[trΥ(ξ+τ)P(ξ+τ)]+2Υ𝔼[{(ξ+τ)P(ξ+τ)}{(ξ+τ)P(ξ+τ)}2]\displaystyle\leq-\mathbb{E}[\frac{\text{\rm tr}\Upsilon\mathcal{M}}{(\xi+\tau)^{\top}P(\xi+\tau)}]+2\|\Upsilon\mathcal{M}\|\mathbb{E}\left[\frac{\left\{(\xi+\tau)^{\top}P(\xi+\tau)\right\}}{\left\{(\xi+\tau)^{\top}P(\xi+\tau)\right\}^{2}}\right]
=(trΥ2Υ)𝔼[1(ξ+τ)P(ξ+τ)].\displaystyle=-\left(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|\right)\mathbb{E}\left[\frac{1}{(\xi+\tau)^{\top}P(\xi+\tau)}\right]. (15)

Applying (15) into the last equation for R0(θ0,πω^+)\mathrm{R}_{0}(\theta_{0},\pi_{\widehat{\omega}_{+}}), and collecting terms yields

Rq(πω^+,θ0)\displaystyle\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0}) Rq(πcut,θ0)+γ2𝔼[1{(ξ+τ)P(ξ+τ)}]2γ𝔼[(trΥ2Υ)(ξ+τ)P(ξ+τ)]\displaystyle\leq\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})+\gamma^{2}\mathbb{E}\left[\frac{1}{\{(\xi+\tau)^{\top}P(\xi+\tau)\}}\right]-2\gamma\mathbb{E}\left[\frac{(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|)}{(\xi+\tau)^{\top}P(\xi+\tau)}\right]
=Rq(πcut,θ0)γ𝔼[{2(trΥ2Υ)γ}(ξ+τ)P(ξ+τ)].\displaystyle=\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\gamma\mathbb{E}\left[\frac{\{2(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|)-\gamma\}}{(\xi+\tau)^{\top}P(\xi+\tau)}\right]. (16)

Theorem C.3 yields the following corollary.

Corollary C.2.

Consider that Assumptions 2-3, and the regularity conditions in Assumptions C1-C2 are satisfied. If trΥ>2Υ\text{\rm tr}\Upsilon\mathcal{M}>2\|\Upsilon\mathcal{M}\|, and 0<γ<2(trΥ2Υ)0<\gamma<2(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|), then for any η\eta such that η<\|\eta\|<\infty,

Rq(πω^+,θ0)\displaystyle\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0}) Rq(πcut,θ0)γ{2(trΥ2Υ)γ}trΥ+τ(ΓcutΓfull)Υ(ΓcutΓfull)τ<Rq(πcut,θ0).\displaystyle\leq\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\frac{\gamma\{2(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|)-\gamma\}}{\text{\rm tr}{\Upsilon}\mathcal{M}+\tau^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})\tau}<\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0}). (17)

Furthermore, if τ=(0,η)\tau=(0^{\top},\eta^{\top})^{\top} is such that τ(ΓcutΓfull)Υ(ΓcutΓfull)τtrΥ\tau^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})\tau\geq\text{\rm tr}\Upsilon\mathcal{M}, then

Rq(πω^+,θ0)Rq(πfull,θ0)\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})\leq\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0})
Proof of Corollary C.2..

Since ξ\xi is Gaussian with mean zero and variance Ω\Omega,

𝔼[(ξ+τ)P(ξ+τ)]=τPτ+trPΩ.\mathbb{E}[{(\xi+\tau)^{\top}P(\xi+\tau)}]=\tau^{\top}P\tau+\text{\rm tr}P\Omega. (18)

From equation (16) in the proof of Theorem C.3,

Rq(πω^+,θ0)\displaystyle\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0}) Rq(πcut,θ0)γ𝔼[2(trΥ2Υ)γ][(ξ+τ)P(ξ+τ)]\displaystyle\leq\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\gamma\mathbb{E}\frac{[2(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|)-\gamma]}{[{(\xi+\tau)^{\top}P(\xi+\tau)}]}
Rq(πcut,θ0)γ[2(trΥ2Υ)γ]𝔼[(ξ+τ)P(ξ+τ)]\displaystyle\leq\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\gamma\frac{[2(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|)-\gamma]}{\mathbb{E}[{(\xi+\tau)^{\top}P(\xi+\tau)}]}
Rq(πcut,θ0)γ[2(trΥ2Υ)γ]τPτ+trPΩ\displaystyle\leq\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\gamma\frac{[2(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|)-\gamma]}{\tau^{\top}P\tau+\text{\rm tr}P\Omega}
<Rq(πcut,θ0),\displaystyle<\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0}),

where the second inequality follows by Jensen’s inequality, and the second to last by plugging in the moment of the quadratic form in equation (18), while the last (strict) inequality follows since 0<γ<2(trΥ2Υ)0<\gamma<2(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|).

Since under our hypotheses, Rq(πω^+,θ0)<Rq(πcut,θ0)\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})<\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0}), to prove the second part of the result we need only prove that

Rq(πfull,θ0)Rq(πcut,θ0)0.\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0})-\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})\geq 0.

Taking ω=0\omega=0 and ω=1\omega=1 in the proof of Theorem C.3, we see that

Rq(πcut,θ0)\displaystyle\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0}) =𝔼n{(ξ+τ)ΓcutΥΓcut(ξ+τ)}=τΓcutΥΓcutτ+trΥΓcutΩΓcut\displaystyle=\mathbb{E}_{n}\{(\xi+\tau)^{\top}\Gamma_{\mathrm{cut}}^{\top}\Upsilon\Gamma_{\mathrm{cut}}(\xi+\tau)\}=\tau^{\top}\Gamma_{\mathrm{cut}}^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}\tau+\text{\rm tr}\Upsilon\Gamma_{\mathrm{cut}}\Omega\Gamma_{\mathrm{cut}}^{\top}
=221η2Υ222+trΥΓcutΩΓcut;\displaystyle=\|\mathcal{I}_{22}^{-1}\eta_{2}\|^{2}_{\Upsilon_{22}}+\text{\rm tr}\Upsilon\Gamma_{\mathrm{cut}}\Omega\Gamma_{\mathrm{cut}}^{\top};
Rq(πfull,θ0)\displaystyle\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0}) =𝔼n{(ξ+τ)ΓfullΥΓfull(ξ+τ)}=τΓfullΥΓfullτ+trΥΓfullΩΓfull\displaystyle=\mathbb{E}_{n}\{(\xi+\tau)^{\top}\Gamma_{\mathrm{full}}^{\top}\Upsilon\Gamma_{\mathrm{full}}(\xi+\tau)\}=\tau^{\top}\Gamma_{\mathrm{full}}^{\top}{\Upsilon}\Gamma_{\mathrm{full}}\tau+\text{\rm tr}\Upsilon\Gamma_{\mathrm{full}}\Omega\Gamma_{\mathrm{full}}^{\top}
=1ηΥ222+trΥΓfullΩΓfull.\displaystyle=\|\mathcal{I}^{-1}\eta\|^{2}_{\Upsilon_{22}}+\text{\rm tr}\Upsilon\Gamma_{\mathrm{full}}\Omega\Gamma_{\mathrm{full}}^{\top}.

Consequently,

Rq(πcut,θ0)Rq(πfull,θ0)\displaystyle\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0}) =221Υ2221ηΥ2+trΥ{ΓcutΩΓcutΓfullΩΓfull}\displaystyle=\|\mathcal{I}_{22}^{-1}\|_{\Upsilon_{22}}^{2}-\|\mathcal{I}^{-1}\eta\|_{\Upsilon}^{2}+\text{\rm tr}\Upsilon\{\Gamma_{\mathrm{cut}}\Omega\Gamma_{\mathrm{cut}}^{\top}-\Gamma_{\mathrm{full}}\Omega\Gamma_{\mathrm{full}}^{\top}\}

From the definitions of Γcut,Γfull\Gamma_{\mathrm{cut}},\Gamma_{\mathrm{full}}, Ω\Omega, and \mathcal{M} in (12),

ΓcutΩΓcutΓfullΩΓfull\displaystyle\Gamma_{\mathrm{cut}}\Omega\Gamma_{\mathrm{cut}}^{\top}-\Gamma_{\mathrm{full}}\Omega\Gamma_{\mathrm{full}}^{\top}
=(p(11)1p(11)11222122121p(11)1221+22121p(11)112221)(11.2111.21122212212111.21221+2212111.2112221)\displaystyle=\begin{pmatrix}\mathcal{I}_{p(11)}^{-1}&-\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}&\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\end{pmatrix}-\begin{pmatrix}\mathcal{I}_{11.2}^{-1}&-\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}&\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\end{pmatrix}
=.\displaystyle=\mathcal{M}.

Hence, Rq(πcut,θ0)Rq(πfull,θ0)=221Υ2221ηΥ2+trΥ\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0})=\|\mathcal{I}_{22}^{-1}\|_{\Upsilon_{22}}^{2}-\|\mathcal{I}^{-1}\eta\|_{\Upsilon}^{2}+\text{\rm tr}\Upsilon\mathcal{M} and

Rq(πcut,θ0)Rq(πfull,θ0)0,\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0})\leq 0,

when 1ηΥ2trΥ+221Υ222\|\mathcal{I}^{-1}\eta\|_{\Upsilon}^{2}\geq\text{\rm tr}\Upsilon\mathcal{M}+\|\mathcal{I}_{22}^{-1}\|_{\Upsilon_{22}}^{2}, as maintained in the stated result. ∎

Proof of Theorem 1 (main text)..

Theorem 1 is a direct consequence of Corollary C.2. To see this, note that Corollary C.2 is true pointwise for any η\eta such that η<\|\eta\|<\infty. Note that Rq(πcut,θ0)\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0}) and Rq(πfull,θ0)\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0}) are convex for all η\eta, and that the RHS of the bound for Rq(πω^+,θ0)\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0}) in equation (17) is also convex as a function of η\eta. Hence, Corollary C.2 holds uniformly, which verifies Theorem 1. ∎

Proof of Theorem 2.

Recall the expanded expression for Rq(πω^+,θ0)\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0}) in the proof of Theorem C.3. Recall equation (13) derived in that proof of Theorem C.3:

𝔼(ξ+τ)(ΓcutΓfull)ΥΓcut(ξ+τ)(ξ+τ)P(ξ+τ)\displaystyle\mathbb{E}\frac{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\Upsilon\Gamma_{\mathrm{cut}}(\xi+\tau)}{(\xi+\tau)^{\top}P(\xi+\tau)} =𝔼trK(ξ+τ)P(ξ+τ)2𝔼trP(ξ+τ)(ξ+τ)K[(ξ+τ)P(ξ+τ)]2,\displaystyle=\mathbb{E}\frac{\text{\rm tr}K}{(\xi+\tau)^{\top}P(\xi+\tau)}-2\mathbb{E}\frac{\text{\rm tr}P(\xi+\tau)(\xi+\tau)^{\top}K}{[{(\xi+\tau)^{\top}P(\xi+\tau)}]^{2}},

where K=(ΓcutΓfull)ΥΓcutΩK=(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\Upsilon\Gamma_{\mathrm{cut}}\Omega. Under our choice of loss, we can show that

trK=trΥΓcutΩ(ΓcutΓfull)=trΥ=d.\displaystyle\text{\rm tr}K=\text{\rm tr}\Upsilon\Gamma_{\mathrm{cut}}\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}=\text{\rm tr}\Upsilon\mathcal{M}=d.

For the second term in (13),

trP(ξ+τ)(ξ+τ)K\displaystyle\text{\rm tr}P(\xi+\tau)(\xi+\tau)^{\top}K =(ξ+τ)KP(ξ+τ)\displaystyle=(\xi+\tau)^{\top}KP(\xi+\tau)
=(ξ+τ)(ΓcutΓfull)1ΓcutΩ(ΓcutΓfull)1(ΓcutΓfull)(ξ+τ)\displaystyle=(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\mathcal{M}^{-1}\Gamma_{\mathrm{cut}}\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\mathcal{M}^{-1}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)
=(ξ+τ)(ΓcutΓfull)1(ΓcutΓfull)(ξ+τ).\displaystyle=(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\mathcal{M}^{-1}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau).

Applying the above, the second term in (13) becomes

trP(ξ+τ)(ξ+τ)K[(ξ+τ)P(ξ+τ)]2\displaystyle\frac{\text{\rm tr}P(\xi+\tau)(\xi+\tau)^{\top}K}{[{(\xi+\tau)^{\top}P(\xi+\tau)}]^{2}} =(ξ+τ)(ΓcutΓfull)1(ΓcutΓfull)(ξ+τ){(ξ+τ)(ΓcutΓfull)1(ΓcutΓfull)(ξ+τ)}2\displaystyle=\frac{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\mathcal{M}^{-1}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)}{\{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\mathcal{M}^{-1}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)\}^{2}}
=1(ξ+τ)(ΓcutΓfull)1(ΓcutΓfull)(ξ+τ).\displaystyle=\frac{1}{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\mathcal{M}^{-1}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)}. (19)

From Lemma C.5, Cov[(ΓcutΓfull)(ξ+τ)]=.\text{Cov}[(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)]=\mathcal{M}. Hence, the inverse of the random variable in (19) is distributed as non-central chi-squared with κ=d\kappa=d degrees of freedom, and non-centrality parameter λ=τPτ\lambda=\tau^{\top}P\tau, which, for d>2d>2, has density function

fχ(z):=exp(λ)j=0λjj!z12(κ+2j)1exp(12z)212(κ+2j)Γ[(κ+2j)/2].f_{\chi}(z):=\exp(-\lambda)\sum_{j=0}^{\infty}\frac{\lambda^{j}}{j!}\frac{z^{\frac{1}{2}(\kappa+2j)-1}\exp({-\frac{1}{2}z})}{2^{\frac{1}{2}(\kappa+2j)}\Gamma[(\kappa+2j)/2]}.

To calculate 𝔼[Z1]\mathbb{E}[Z^{-1}], first we see that

𝔼[Z1]=z1fχ(z)dμ(z)=exp(λ)j=00λjj!z12(κ+2j)2exp(12z)212(κ+2j)Γ[(κ+2j)/2]dμ(z),\displaystyle\mathbb{E}[Z^{-1}]=\int z^{-1}f_{\chi}(z)\mathrm{d}\mu(z)=\exp(-\lambda)\sum_{j=0}^{\infty}\int_{0}^{\infty}\frac{\lambda^{j}}{j!}\frac{z^{\frac{1}{2}(\kappa+2j)-2}\exp({-\frac{1}{2}z})}{2^{\frac{1}{2}(\kappa+2j)}\Gamma[(\kappa+2j)/2]}\mathrm{d}\mu(z),

and note that, since κ/2>1\kappa/2>1, when d>2d>2, for all j0j\geq 0,

z12(κ+2j)r1exp(z/2)dμ(z)=2(κ/2+j)1Γ[(κ/2+j)1],\int z^{\frac{1}{2}(\kappa+2j)-r-1}\exp(-z/2)\mathrm{d}\mu(z)=2^{(\kappa/2+j)-1}\Gamma[(\kappa/2+j)-1],

so that we can rewrite the expectation as

𝔼[Z1]\displaystyle\mathbb{E}[Z^{-1}] =21Γ(κ/21)Γ(κ/2)exp(λ)j=0(κ/21)j(κ/2)j(λjj!)\displaystyle=2^{-1}\frac{\Gamma(\kappa/2-1)}{\Gamma(\kappa/2)}\exp(-\lambda)\sum_{j=0}^{\infty}\frac{(\kappa/2-1)_{j}}{(\kappa/2)_{j}}\left(\frac{\lambda^{j}}{j!}\right)
=21Γ(κ/21)Γ(κ/2)exp(λ)F11(κ/21;κ/2;λ).\displaystyle=2^{-1}\frac{\Gamma(\kappa/2-1)}{\Gamma(\kappa/2)}\exp(-\lambda){}_{1}F_{1}(\kappa/2-1;\kappa/2;\lambda).

Taking κ=d\kappa=d then yields the result. ∎

C.4 Proofs of Preliminary Results in Section C.2

When no confusion is likely to result, quantities that are evaluated at θ0\theta_{0} will have their dependence on θ0\theta_{0} suppressed; i.e., we write p(11)(θ0)=p(11)\mathcal{I}_{p(11)}(\theta_{0})=\mathcal{I}_{p(11)}, etc.

Proof of Lemma C.1..

Note that, under the regularity conditions in Assumption C1, we can exchange the order of integration and differentiation. Furthermore, note that for δn\delta_{n} as in Assumption 2 and z𝒵z\in\mathcal{Z}, we have that

limn+n1𝔼n[˙(θ0)˙(θ0)]\displaystyle\lim_{n\rightarrow+\infty}n^{-1}\mathbb{E}_{n}[\dot{\ell}(\theta_{0})\dot{\ell}(\theta_{0})^{\top}] =limn+𝒵logf(zθ0)θlogf(zθ0)θh(zθ0,δn)dμ(z)\displaystyle=\lim_{n\rightarrow+\infty}\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta^{\top}}h(z\mid\theta_{0},\delta_{n})\mathrm{d}\mu(z)
=𝒵logf(zθ0)θlogf(zθ0)θh(zθ0,0)dμ(z)\displaystyle=\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta^{\top}}h(z\mid\theta_{0},0)\mathrm{d}\mu(z)
=𝒵logf(zθ0)θlogf(zθ0)θf(zθ0)dμ(z)\displaystyle=\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta^{\top}}f(z\mid\theta_{0})\mathrm{d}\mu(z)
=𝔼[˙(ziθ0)˙(ziθ0)],\displaystyle=\mathbb{E}[\dot{\ell}(z_{i}\mid\theta_{0})\dot{\ell}(z_{i}\mid\theta_{0})^{\top}],

where recall that

˙(ziθ0)=logf(ziθ0)/θ, and ¨(ziθ0)=2logf(ziθ0)/θθ.\dot{\ell}(z_{i}\mid\theta_{0})=\partial\log f(z_{i}\mid\theta_{0})/\partial\theta,\text{ and }\ddot{\ell}(z_{i}\mid\theta_{0})=\partial^{2}\log f(z_{i}\mid\theta_{0})/\partial\theta\partial\theta^{\top}.

Similarly, from the structure of the score equations, and the information matrix equality, we have that

𝔼¨(ziθ0)\displaystyle-\mathbb{E}\ddot{\ell}(z_{i}\mid\theta_{0}) =𝔼[˙(ziθ0)˙(ziθ0)]\displaystyle=\mathbb{E}[\dot{\ell}(z_{i}\mid\theta_{0})\dot{\ell}(z_{i}\mid\theta_{0})^{\top}]
𝔼(¨p(11)+¨c(11)¨c(12)¨c(21)¨c(22))\displaystyle-\mathbb{E}\begin{pmatrix}\ddot{\ell}_{p(11)}+\ddot{\ell}_{c(11)}&\ddot{\ell}_{c(12)}\\ \ddot{\ell}_{c(21)}&\ddot{\ell}_{c(22)}\end{pmatrix} =𝔼({¨p(1)+¨c(1)}{¨p(1)+¨c(1)}{¨p(1)+¨c(1)}¨c(2)¨c(2){¨p(1)+¨c(1)}¨c(2)¨c(2))\displaystyle=\mathbb{E}\begin{pmatrix}\{\ddot{\ell}_{p(1)}+\ddot{\ell}_{c(1)}\}\{\ddot{\ell}_{p(1)}+\ddot{\ell}_{c(1)}\}^{\top}&\{\ddot{\ell}_{p(1)}+\ddot{\ell}_{c(1)}\}\ddot{\ell}_{c(2)}^{\top}\\ \ddot{\ell}_{c(2)}\{\ddot{\ell}_{p(1)}+\ddot{\ell}_{c(1)}\}^{\top}&\ddot{\ell}_{c(2)}\ddot{\ell}_{c(2)}^{\top}\end{pmatrix}
(p(11)+c(11)(12)(21)(22))\displaystyle\begin{pmatrix}\mathcal{I}_{p(11)}+\mathcal{I}_{c(11)}&\mathcal{I}_{(12)}\\ \mathcal{I}_{(21)}&\mathcal{I}_{(22)}\end{pmatrix} =𝔼(˙p(1)˙p(1)+˙c(1)˙c(1)+2˙p(1)˙c(1)¨p(1)¨c(2)+¨c(1)¨c(2)¨c(2)¨p(1)+¨c(2)¨c(1)¨c(2)¨c(2))\displaystyle=\mathbb{E}\begin{pmatrix}\dot{\ell}_{p(1)}\dot{\ell}_{p(1)}^{\top}+\dot{\ell}_{c(1)}\dot{\ell}_{c(1)}^{\top}+2\dot{\ell}_{p(1)}\dot{\ell}_{c(1)}&\ddot{\ell}_{p(1)}\ddot{\ell}_{c(2)}^{\top}+\ddot{\ell}_{c(1)}\ddot{\ell}_{c(2)}^{\top}\\ \ddot{\ell}_{c(2)}\ddot{\ell}_{p(1)}^{\top}+\ddot{\ell}_{c(2)}\ddot{\ell}_{c(1)}^{\top}&\ddot{\ell}_{c(2)}\ddot{\ell}_{c(2)}^{\top}\end{pmatrix}

Analysing the above we see that

p(11)=\displaystyle\mathcal{I}_{p(11)}= 𝒵2logf1(zθ1)θ1θ1f(zθ0)dμ(z)\displaystyle-\int_{\mathcal{Z}}\frac{\partial^{2}\log f_{1}(z\mid\theta_{1})}{\partial\theta_{1}\partial\theta_{1}^{\top}}f(z\mid\theta_{0})\mathrm{d}\mu(z)
=\displaystyle= 𝒵1f1(zθ1,0)2f1(zθ1,0)θ1f1(zθ1,0)θ1f(zθ0)dμ(z)\displaystyle\int_{\mathcal{Z}}\frac{1}{f_{1}(z\mid\theta_{1,0})^{2}}\frac{\partial f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}^{\top}}\frac{\partial f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}^{\top}}f(z\mid\theta_{0})\mathrm{d}\mu(z)
𝒵2f1(zθ1,0)θ1θ1f(zθ0)dμ(z)\displaystyle\hskip 14.22636pt-\int_{\mathcal{Z}}\frac{\partial^{2}f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}\partial\theta_{1}^{\top}}f(z\mid\theta_{0})\mathrm{d}\mu(z)
=\displaystyle= 𝔼˙p(ziθ1,0)˙p(ziθ1,0),\displaystyle\mathbb{E}\dot{\ell}_{p}(z_{i}\mid\theta_{1,0})\dot{\ell}_{p}(z_{i}\mid\theta_{1,0})^{\top}, (20)

where the last line follows from rewriting 1f1(zθ1,0)f1(zθ1,0)θ=˙p(1)(zθ1,0)\frac{1}{f_{1}(z\mid\theta_{1,0})}\frac{\partial f_{1}(z\mid\theta_{1,0})}{\partial\theta^{\top}}=\dot{\ell}_{p(1)}(z\mid\theta_{1,0}), and from exchanging integration and differentiation to note that the second term is zero. This proves part one of the result. Repeating the argument for p(11)\mathcal{I}_{p(11)} for c(11)\mathcal{I}_{c(11)} yields

c(11)=𝔼˙c(1)(ziθ0)˙c(1)(ziθ0),\displaystyle\mathcal{I}_{c(11)}=\mathbb{E}\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})^{\top},

which proves the second part of the result.

Now, let us investigate the equivalence of each term. For the first term we have that

p(11)+c(11)\displaystyle\mathcal{I}_{p(11)}+\mathcal{I}_{c(11)} =𝔼˙p(1)(ziθ1,0)˙p(1)(ziθ1,0)+𝔼˙c(1)(ziθ0)˙c(1)(ziθ0)\displaystyle=\mathbb{E}\dot{\ell}_{p(1)}(z_{i}\mid\theta_{1,0})\dot{\ell}_{p(1)}(z_{i}\mid\theta_{1,0})^{\top}+\mathbb{E}\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})^{\top}
+2𝔼˙p(1)(ziθ1,0)˙c(1)(ziθ0).\displaystyle+2\mathbb{E}\dot{\ell}_{p(1)}(z_{i}\mid\theta_{1,0})\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})^{\top}. (21)

Applying the above equations for j(11)\mathcal{I}_{j(11)}, j{p,c}j\in\{p,c\}, in equation (21) we have

p(11)+c(11)\displaystyle\mathcal{I}_{p(11)}+\mathcal{I}_{c(11)} =𝔼˙p(ziθ1,0)˙p(ziθ1,0)+𝔼˙c(1)(ziθ0)˙c(1)(ziθ0)\displaystyle=\mathbb{E}\dot{\ell}_{p}(z_{i}\mid\theta_{1,0})\dot{\ell}_{p}(z_{i}\mid\theta_{1,0})^{\top}+\mathbb{E}\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})^{\top}
=𝔼˙p(1)(ziθ1,0)˙p(1)(ziθ1,0)+𝔼˙c(1)(ziθ0)˙c(1)(ziθ0)\displaystyle=\mathbb{E}\dot{\ell}_{p(1)}(z_{i}\mid\theta_{1,0})\dot{\ell}_{p(1)}(z_{i}\mid\theta_{1,0})^{\top}+\mathbb{E}\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})^{\top}
+2𝔼˙p(1)(ziθ1,0)˙c(1)(ziθ0),\displaystyle+2\mathbb{E}\dot{\ell}_{p(1)}(z_{i}\mid\theta_{1,0})\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})^{\top},

which is satisfied if and only if 𝔼˙p(1)(ziθ0)˙c(1)(ziθ0)=0\mathbb{E}\dot{\ell}_{p(1)}(z_{i}\mid\theta_{0})\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})^{\top}=0. Hence, we have part three of the stated result.

Lastly, we investigate the term (21)\mathcal{I}_{(21)}. Again, using the definitions of this term we have

21=\displaystyle\mathcal{I}_{21}= n1𝒵2c(zθ0)θ2θ1f(zθ0)dμ(z)\displaystyle-n^{-1}\int_{\mathcal{Z}}\frac{\partial^{2}\ell_{c}(z\mid\theta_{0})}{\partial\theta_{2}\partial\theta_{1}^{\top}}f(z\mid\theta_{0})\mathrm{d}\mu(z)
=\displaystyle= 𝒵1f2(zθ0)2f2(zθ)θ2f2(zθ)θ1f(zθ0)dμ(z)𝒵2f2(zθ)θ2θ1f(zθ0)dμ(z)\displaystyle-\int_{\mathcal{Z}}\frac{1}{f_{2}(z\mid\theta_{0})^{2}}\frac{\partial f_{2}(z\mid\theta)}{\partial\theta_{2}}\frac{\partial f_{2}(z\mid\theta)}{\partial\theta_{1}^{\top}}f(z\mid\theta_{0})\mathrm{d}\mu(z)-\int_{\mathcal{Z}}\frac{\partial^{2}f_{2}(z\mid\theta)}{\partial\theta_{2}\partial\theta_{1}^{\top}}f(z\mid\theta_{0})\mathrm{d}\mu(z)
=\displaystyle= 𝔼˙c(2)(zθ0)˙c(1)(zθ0),\displaystyle\mathbb{E}\dot{\ell}_{c(2)}(z\mid\theta_{0})\dot{\ell}_{c(1)}(z\mid\theta_{0})^{\top}, (22)

where the last equality again follows from exchanging integration and differentiation of the second term, and rewriting the derivatives in the first term. Therefore, from equation (22), and the general matrix information equality, we have the equality

21=𝔼˙c(2)(zθ0)˙c(1)(zθ0)=𝔼˙c(2)(zθ0)˙c(1)(zθ0)+𝔼˙c(2)(zθ0)˙p(1)(zθ1,0),\displaystyle\mathcal{I}_{21}=\mathbb{E}\dot{\ell}_{c(2)}(z\mid\theta_{0})\dot{\ell}_{c(1)}(z\mid\theta_{0})^{\top}=\mathbb{E}\dot{\ell}_{c(2)}(z\mid\theta_{0})\dot{\ell}_{c(1)}(z\mid\theta_{0})^{\top}+\mathbb{E}\dot{\ell}_{c(2)}(z\mid\theta_{0})\dot{\ell}_{p(1)}(z\mid\theta_{1,0})^{\top},

which is satisfied if and only if 𝔼˙c(2)(zθ0)˙p(1)(zθ1,0)=0\mathbb{E}\dot{\ell}_{c(2)}(z\mid\theta_{0})\dot{\ell}_{p(1)}(z\mid\theta_{1,0})^{\top}=0, which proves the last result. ∎

Proof of Lemma C.2.

To simplify the proof, let δn(z)=ψξ(z)/n\delta_{n}(z)=\psi^{\top}\xi(z)/\sqrt{n}. By Assumption C1, under Assumption 2, we see that ˙(zθ)h(zθ,δn)b(z){1+δn(z)}f1(zθ1,0)f2(zθ0)+o(b(z)/n)\|\dot{\ell}(z\mid\theta)\|h(z\mid\theta,\delta_{n})\leq b(z)\{1+\delta_{n}(z)\}f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})+o(b(z)/\sqrt{n}). By Assumption 2, there exists some a1(z)0a_{1}(z)\geq 0 with supz𝒵a1(z)<\sup_{z\in\mathcal{Z}}a_{1}(z)<\infty such that δn(z)a1(z)\delta_{n}(z)\leq a_{1}(z). Hence, for each n1n\geq 1, h(zθ,δn){1+a1(z)/n}f1(zθ1,0)f2(zθ0)h(z\mid\theta,\delta_{n})\leqslant\{1+a_{1}(z)/\sqrt{n}\}f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0}) and |˙(zθ)h(zθ,δn)|b(z){1+a1(z)/n}f1(zθ1,0)f2(zθ0)|\dot{\ell}(z\mid\theta)h(z\mid\theta,\delta_{n})|\leqslant b(z)\{1+a_{1}(z)/\sqrt{n}\}f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0}). From Assumption C1, we have that 𝒵a1(z)b(z)f1(zθ1,0)f2(zθ0)dμ(z)<\int_{\mathcal{Z}}a_{1}(z)b(z)f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)<\infty, and 𝒵b(z)f1(zθ1,0)f2(zθ0)dμ(z)<\int_{\mathcal{Z}}b(z)f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)<\infty. By the dominated convergence theorem, we then have that ϕ(θ,δn)\phi(\theta,\delta_{n}) exists for each δnΔ\delta_{n}\in\Delta.

To prove the second part of the result, we note that, by Assumption 2,

limnsupθΘ~|ϕ(θ,δn)ϕ(θ,0)|=limn1nsupθΘ~|𝒵˙(zθ)δn(z)f1(zθ1,0)f2(zθ0)dμ(z)|\displaystyle\lim_{n\rightarrow\infty}\sup_{\theta\in\tilde{\Theta}}|\phi(\theta,\delta_{n})-\phi(\theta,0)|=\lim_{n\rightarrow\infty}\frac{1}{\sqrt{n}}\sup_{\theta\in\tilde{\Theta}}\left|\int_{\mathcal{Z}}\dot{\ell}(z\mid\theta)\delta_{n}(z)f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)\right|
limn1n|𝒵b(z)a1(z)f1(zθ1,0)f2(zθ0)dμ(z)+o{1n𝒵b(z)f1(zθ1,0)f2(zθ0)dμ(z)}|\displaystyle\leq\lim_{n\rightarrow\infty}\frac{1}{\sqrt{n}}\left|\int_{\mathcal{Z}}b(z)a_{1}(z)f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)+o\left\{\frac{1}{\sqrt{n}}\int_{\mathcal{Z}}b(z)f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)\right\}\right|
=o(1/n),\displaystyle=o(1/\sqrt{n}), (23)

where the second line follows since by Assumption C1 ˙(zθ)b(z)\|\dot{\ell}(z\mid\theta)\|\leq b(z).

Now, by the triangle inequality,

supθΘ~|n1˙(θ)ϕ(θ,0)|\displaystyle\sup_{\theta\in\tilde{\Theta}}\left|n^{-1}\dot{\ell}(\theta)-\phi\left(\theta,0\right)\right|\leq supθΘ~|n1˙(θ)ϕ(θ,δn)|+supθΘ~|ϕ(θ,δn)ϕ(θ,0)|.\displaystyle\sup_{\theta\in\tilde{\Theta}}\left|n^{-1}\dot{\ell}(\theta)-\phi\left(\theta,\delta_{n}\right)\right|+\sup_{\theta\in\tilde{\Theta}}\left|\phi\left(\theta,\delta_{n}\right)-\phi\left(\theta,0\right)\right|.

From equation (23), for any ε>0\varepsilon>0, the there exists an n1n_{1} large enough so that for all nn1n\geq n_{1}, supθΘ~|ϕ(θ,δn)ϕ(θ,0)|ε/2\sup_{\theta\in\widetilde{\Theta}}\left|\phi\left(\theta,\delta_{n}\right)-\phi\left(\theta,0\right)\right|\leq\varepsilon/2. To handle the first term, note that by Assumption C1, for any θ,θΘ\theta,\theta^{\prime}\in\Theta, and some θ¯Θ\overline{\theta}\in\Theta,

˙(zθ)˙(zθ)¨(zθ¯)θθb(z)θθ.\|\dot{\ell}(z\mid\theta)-\dot{\ell}(z\mid\theta^{\prime})\|\leq\|\ddot{\ell}(z\mid\overline{\theta})\|\|\theta-\theta^{\prime}\|\leq b(z)\|\theta-\theta^{\prime}\|.

Since 𝔼[b(z)]<\mathbb{E}[b(z)]<\infty, it follows directly from Theorem 21.10 of Davidson (1994), that

supθΘ~|n1˙(θ)ϕ(θ,δn)|ε/2\sup_{\theta\in\tilde{\Theta}}\left|n^{-1}\dot{\ell}(\theta)-\phi\left(\theta,\delta_{n}\right)\right|\leq\varepsilon/2

with probability converging to one. Hence, for any ε>0\varepsilon>0,

limn+Pr[supθΘ~|n1˙(θ)ϕ(θ,0)|ε]\displaystyle\lim_{n\rightarrow+\infty}\text{Pr}\left[\sup_{\theta\in\tilde{\Theta}}\left|n^{-1}\dot{\ell}(\theta)-\phi\left(\theta,0\right)\right|\geq\varepsilon\right]\leq limn+Pr[supθΘ~|n1˙(θ)ϕ(θ,δn)|ε/2]\displaystyle\lim_{n\rightarrow+\infty}\text{Pr}\left[\sup_{\theta\in\tilde{\Theta}}\left|n^{-1}\dot{\ell}(\theta)-\phi\left(\theta,\delta_{n}\right)\right|\geq\varepsilon/2\right]
+limn+Pr[supθΘ~|ϕ(θ,δn)ϕ(θ,0)|ε/2]\displaystyle+\lim_{n\rightarrow+\infty}\text{Pr}\left[\sup_{\theta\in\tilde{\Theta}}\left|\phi\left(\theta,\delta_{n}\right)-\phi\left(\theta,0\right)\right|\geq\varepsilon/2\right]
=o(1).\displaystyle=o(1).

Proof of Lemma C.3..

The result uses Lemma C.2 , the nature of the model misspecification in Assumption 2, and the separable structure of the likelihood.

Result 1. Recall the expectation

ϕ(θ,δn):=𝒵logf(zθ)θh(zθ,δn)dμ(z).\phi(\theta,\delta_{n}):=\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta)}{\partial\theta}h(z\mid\theta,\delta_{n})\mathrm{d}\mu(z).

From Lemma C.2, and the regularity conditions on h(zθ,δ)h(z\mid\theta,\delta) in Assumption C1, ϕ(θ,δn)\phi(\theta,\delta_{n}) exists and is continuous on Θ×Δ\Theta\times\Delta. The second portion of Lemma C.2 implies that, for any 0<ε<+0<\varepsilon<+\infty,

supθΘ(ε)|n1˙(θ)ϕ(θ,0)|=op(1).\sup_{\theta\in\Theta(\varepsilon)}|n^{-1}\dot{\ell}(\theta)-\phi(\theta,0)|=o_{p}(1).

Define ϕn:=ϕ(θ0,δn)\phi_{n}:=\phi(\theta_{0},\delta_{n}) and note that, since ϕ(,δ)\phi(\cdot,\delta) is continuous in δ\delta, as nn\rightarrow\infty,

ϕnϕ(θ0,δ0).\phi_{n}\rightarrow\phi(\theta_{0},\delta_{0}).

However, since we can exchange differentiation and integration (Assumption C1), and since h(θ,δ)h(\cdot\mid\theta,\delta) is continuous in δ\delta, for each θΘ\theta\in\Theta, we have

ϕ(θ0,δn)\displaystyle\phi(\theta_{0},\delta_{n}) =𝒵logf(zθ0)θh(zθ0,δn)dμ(z)\displaystyle=\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}h(z\mid\theta_{0},\delta_{n})\mathrm{d}\mu(z)
=𝒵logf(zθ0)θ{1+δn(z)/n}f(zθ0)dμ(z){1+o(1/n)}\displaystyle=\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}\{1+\delta_{n}(z)/\sqrt{n}\}f(z\mid\theta_{0})\mathrm{d}\mu(z)\{1+o(1/\sqrt{n})\}
=𝒵logf(zθ0)θf(zθ0)dμ(z)+1n𝒵logf(zθ0)θδn(z)dμ(z){1+o(1/n)}.\displaystyle=\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}f(z\mid\theta_{0})\mathrm{d}\mu(z)+\frac{1}{\sqrt{n}}\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}\delta_{n}(z)\mathrm{d}\mu(z)\{1+o(1/\sqrt{n})\}. (24)

The first term in the above is zero since

𝒵logf(zθ0)θf(zθ0)dμ(z)=𝒵θf(zθ0)dμ(z)=θ𝒵f(zθ0)dμ(z)=0.\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}f(z\mid\theta_{0})\mathrm{d}\mu(z)=\int_{\mathcal{Z}}\frac{\partial}{\partial\theta}f(z\mid\theta_{0})\mathrm{d}\mu(z)=\frac{\partial}{\partial\theta}\int_{\mathcal{Z}}f(z\mid\theta_{0})\mathrm{d}\mu(z)=0.

Considering the second term, using Assumption 2 in the main text and Assumption C1, we see that the integral portion of the second term satisfies

𝒵logf(zθ0)θδn(z)dμ(z)𝒵b(z)a(z)dμ(z)<.\left\|\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}\delta_{n}(z)\mathrm{d}\mu(z)\right\|\leq\int_{\mathcal{Z}}b(z)a(z)\mathrm{d}\mu(z)<\infty.

Hence, as nn\rightarrow\infty, limn+nϕ(θ0,δn)=η,\lim_{n\rightarrow+\infty}\sqrt{n}\phi(\theta_{0},\delta_{n})=\eta, which yields the first part of the stated result.


Result 2. To deduce the second result, we need only establish a CLT for {n1/2˙(θ0)nϕ(θ0,δn)}\{n^{-1/2}\dot{\ell}(\theta_{0})-\sqrt{n}\phi(\theta_{0},\delta_{n})\}. This can be established using arguments similar to those given in in Lemma 2.1 of Newey (1985). In particular, let λd\lambda\in\mathbb{R}^{d} be a non-zero vector, with λ<\|\lambda\|<\infty, and define Yn(z):=λ[˙(zθ0)ϕn]Y_{n}(z):=\lambda^{\top}\left[\dot{\ell}(z\mid\theta_{0})-\phi_{n}\right] and Yi,n=Yn(zi)(i=1,,n)Y_{i,n}=Y_{n}\left(z_{i}\right)(i=1,\ldots,n). For each n1n\geq 1, the random variable Yi,nY_{i,n} is mean-zero and and has a strictly positive variance λΩnλ\lambda^{\top}\Omega_{n}\lambda, for all nn large enough, where Ωn:=𝔼n[˙(zθ0)˙(zθ0)]\Omega_{n}:=\mathbb{E}_{n}[\dot{\ell}(z\mid\theta_{0})\dot{\ell}(z\mid\theta_{0})^{\top}]. For ε>0\varepsilon>0 let An(ε)={z:|˙(z,θ0)ϕn|>εnλΩnλ}A_{n}(\varepsilon)=\left\{z:\left|\dot{\ell}\left(z,\theta_{0}\right)-\phi_{n}\right|>\varepsilon\sqrt{n\lambda^{\top}\Omega_{n}\lambda}\right\}. For each n,Yi,n(i=1,,T)n,Y_{i,n}(i=1,\ldots,T) are identically distributed, so that by Assumption C1, for any ε>0\varepsilon>0,

[1/(nλΩnλ)]i=1n|Yi,n|εnλΩnλ|Yi,n|2h(ziθ0,δn)dμ(zi)\displaystyle{\left[1/\left(n\lambda^{\top}\Omega_{n}\lambda\right)\right]\cdot\sum_{i=1}^{n}\int_{\left|Y_{i,n}\right|\geqslant\varepsilon\sqrt{n\lambda^{\top}\Omega_{n}\lambda}}\left|Y_{i,n}\right|^{2}h\left(z_{i}\mid\theta_{0},\delta_{n}\right)\mathrm{d}\mu\left(z_{i}\right)}
=(1/λΩnλ)An(ε)|Yn(z)|2h(zθ0,δn)dμ(z)\displaystyle=\left(1/\lambda^{\top}\Omega_{n}\lambda\right)\cdot\int_{A_{n}(\varepsilon)}\left|Y_{n}(z)\right|^{2}h\left(z\mid\theta_{0},\delta_{n}\right)\mathrm{d}\mu(z)
(1/λΩnλ)λ2An(ε)˙(zθ0)ϕn2h(zθ0,δn)dμ(z)\displaystyle\leq\left(1/\lambda^{\top}\Omega_{n}\lambda\right)\|\lambda\|^{2}\int_{A_{n}(\varepsilon)}\|\dot{\ell}(z\mid\theta_{0})-\phi_{n}\|^{2}h\left(z\mid\theta_{0},\delta_{n}\right)\mathrm{d}\mu(z)
(1/λΩnλ)λ2{ϕn2+An(ε)˙(zθ0)2h(zθ0,δn)dμ(z)}\displaystyle\leq\left(1/\lambda^{\top}\Omega_{n}\lambda\right)\|\lambda\|^{2}\left\{\|\phi_{n}\|^{2}+\int_{A_{n}(\varepsilon)}\|\dot{\ell}(z\mid\theta_{0})\|^{2}h\left(z\mid\theta_{0},\delta_{n}\right)\mathrm{d}\mu(z)\right\}
(1/λΩnλ)λ[ϕn2+An(ε)b(z)a(z)dμ(z)]\displaystyle\leq\left(1/\lambda^{\top}\Omega_{n}\lambda\right)\|\lambda\|\left[\left\|\phi_{n}\right\|^{2}+\int_{A_{n}(\varepsilon)}b(z)a(z)\mathrm{d}\mu(z)\right]

Since h(zθ0,δn)h(zθ0,δ0)f(zθ0)h(z\mid\theta_{0},\delta_{n})\rightarrow h(z\mid\theta_{0},\delta_{0})\equiv f(z\mid\theta_{0}), by continuity Ωn𝔼[˙(zθ0)˙(zθ0)]\Omega_{n}\rightarrow\mathbb{E}[\dot{\ell}(z\mid\theta_{0})\dot{\ell}(z\mid\theta_{0})^{\top}], and limλΩnλ=λΩλ>0\lim\lambda^{\top}\Omega_{n}\lambda=\lambda^{\top}\Omega\lambda>0, so that by Lemma A.1 of Newey (1985), the set An(ε)A_{n}(\varepsilon)\rightarrow\emptyset, as n+n\rightarrow+\infty. Hence, by Assumption C1, limnAn(ε)b(z)a(z)dμ(z)=0\lim_{n}\int_{A_{n}(\varepsilon)}b(z)a(z)\mathrm{d}\mu(z)=0. Lastly, since ϕn0\phi_{n}\rightarrow 0, we have that

[1/(nλΩnλ)]i=1n|Yi,n|εnλΩnλ|Yi,n|2h(ziθ0,δn)dμ(zi)=o(1),\displaystyle{\left[1/\left(n\lambda^{\top}\Omega_{n}\lambda\right)\right]\cdot\sum_{i=1}^{n}\int_{\left|Y_{i,n}\right|\geqslant\varepsilon\sqrt{}n\lambda^{\top}\Omega_{n}\lambda}\left|Y_{i,n}\right|^{2}h\left(z_{i}\mid\theta_{0},\delta_{n}\right)\mathrm{d}\mu\left(z_{i}\right)}=o(1),

which implies that the Lindberg-Feller condition for the central limit theorem is satisfied, so that by the Cramer-Wold device:

{n1/2˙(θ0)nϕn}N(0,Ω)\{n^{-1/2}\dot{\ell}\left(\theta_{0}\right)-\sqrt{n}\phi_{n}\}{\Rightarrow}N(0,\Omega)

and the second part of the stated result is satisfied.


Result 3. First, define

ϕ1(θ,δn):=𝒵logf1(zθ1)θ1h(zθ,δn)dμ(z).\phi_{1}(\theta,\delta_{n}):=\int_{\mathcal{Z}}\frac{\partial\log f_{1}(z\mid\theta_{1})}{\partial\theta_{1}}h(z\mid\theta,\delta_{n})\mathrm{d}\mu(z).

Applying the definition of h(zθ,δn)h(z\mid\theta,\delta_{n}) in Assumption 2 yields, up to an o(1/n)o(1/\sqrt{n}) term,

ϕ1(θ,δn)\displaystyle\phi_{1}(\theta,\delta_{n}) =ϕ1(θ,δ0)+ϕ1(θ0,δ¯)/δ\displaystyle=\phi_{1}(\theta,\delta_{0})+\partial\phi_{1}(\theta_{0},\bar{\delta})/\partial\delta
=𝒵logf1(zθ1,0)θ1f(zθ0)dμ(z)+𝒵logf1(zθ1,0)θ1δn(z)f(zθ0)dμ(z)1n.\displaystyle=\int_{\mathcal{Z}}\frac{\partial\log f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}}f(z\mid\theta_{0})\mathrm{d}\mu(z)+\int_{\mathcal{Z}}\frac{\partial\log f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}}\delta_{n}(z)f(z\mid\theta_{0})\mathrm{d}\mu(z)\frac{1}{\sqrt{n}}. (25)

For the first term in (25), we have

𝒵logf1(zθ1,0)θ1f(zθ0)dμ(z)\displaystyle\int_{\mathcal{Z}}\frac{\partial\log f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}}f(z\mid\theta_{0})\mathrm{d}\mu(z) =𝒵f1(zθ1,0)θ1f2(zθ0)dμ(z)\displaystyle=\int_{\mathcal{Z}}\frac{\partial f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}}f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)
=θf1(zθ1,0)f2(zθ0)dμ(z)\displaystyle=\frac{\partial}{\partial\theta}\int f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)
=0.\displaystyle=0.

By Assumption 2(iii), the second term in (25) satisfies

1n𝒵logf1(zθ1,0)θ1δn(z)f1(zθ1,0)f2(zθ0)dμ(z)=0.\displaystyle\frac{1}{\sqrt{n}}\int_{\mathcal{Z}}\frac{\partial\log f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}}\delta_{n}(z)f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)=0.

Repeating the same arguments as in the proof of Result 2, but only for ˙p(zθ1)=logf1(zθ1)/θ1\dot{\ell}_{p}(z\mid\theta_{1})=\partial\log f_{1}(z\mid\theta_{1})/\partial\theta_{1}, we have that

n1/2˙p(θ1,0)N{0,𝔼(˙p(1)˙p(1))}N(0,p(11)).n^{-1/2}\dot{\ell}_{p}\left(\theta_{1,0}\right){\Rightarrow}N\{0,\mathbb{E}(\dot{\ell}_{p(1)}\dot{\ell}_{p(1)}^{\top})\}\equiv N(0,\mathcal{I}_{p(11)}).

Result 4. Define

ϕ2(θ,δn):=𝒵logf2(zθ)θ2h(zθ,δn)dμ(z)\phi_{2}(\theta,\delta_{n}):=\int_{\mathcal{Z}}\frac{\partial\log f_{2}(z\mid\theta)}{\partial\theta_{2}}h(z\mid\theta,\delta_{n})\mathrm{d}\mu(z)

and apply the definition of h(zθ,δn)h(z\mid\theta,\delta_{n}) in Assumption 2 to see that, up to an o(1/n)o(1/\sqrt{n}) term,

ϕ2(θ,δn)=\displaystyle\phi_{2}(\theta,\delta_{n})= 𝒵logf2(zθ0)θ2f(zθ0)dμ(z)+𝒵logf2(zθ0)θ2δn(z)f(zθ0)dμ(z)1n.\displaystyle\int_{\mathcal{Z}}\frac{\partial\log f_{2}(z\mid\theta_{0})}{\partial\theta_{2}}f(z\mid\theta_{0})\mathrm{d}\mu(z)+\int_{\mathcal{Z}}\frac{\partial\log f_{2}(z\mid\theta_{0})}{\partial\theta_{2}}\delta_{n}(z)f(z\mid\theta_{0})\mathrm{d}\mu(z)\frac{1}{\sqrt{n}}.

Similar arguments to those used to prove Result 3 yield

ϕ2(θ0,δn)\displaystyle\phi_{2}(\theta_{0},\delta_{n}) =0+η/n.\displaystyle=0+\eta/\sqrt{n}.

The distributional result then follows similarly to Result 2. ∎

Proof of Lemma C.5..

We prove each result in turn.


Result 1. Write τn=(𝟎d1×1,ϕn)\tau_{n}=(\mathbf{0}_{d_{1}\times 1}^{\top},\phi_{n}^{\top})^{\top}, where ϕn=ϕ(θ0,δn)\phi_{n}=\phi(\theta_{0}^{\top},\delta_{n}^{\top})^{\top} was defined in the proof of Lemma C.3 and recall that τ=(𝟎d1×1,η)\tau=(\mathbf{0}_{d_{1}\times 1}^{\top},\eta^{\top})^{\top}. The first and second results in Lemma C.3 together imply that

{Znnτn}N(0,Ω),\{Z_{n}-\sqrt{n}\tau_{n}\}\Rightarrow N(0,\Omega),

where

Ω\displaystyle\Omega :=limn+𝔼[ZnZn]\displaystyle:=\lim_{n\rightarrow+\infty}\mathbb{E}[Z_{n}Z_{n}^{\top}]
=(𝔼[˙p(1)˙p(1)]𝔼[˙p(1){˙p(1)+˙c(1)}]𝔼[˙p(1)˙c(2)]𝔼[{˙p(1)+˙c(1)}˙p(1)]𝔼[{˙p(1)+˙c(1)}{˙p(1)+˙c(1)}]𝔼[{˙p(1)+˙c(1)}˙c(2)]E[˙c(2)˙p(1)]𝔼[˙c(2){˙p(1)+˙c(1)}]𝔼[˙c(2)˙c(2)])\displaystyle=\begin{pmatrix}\mathbb{E}[\dot{\ell}_{p(1)}\dot{\ell}_{p(1)}^{\top}]&\mathbb{E}[\dot{\ell}_{p(1)}\{\dot{\ell}_{p(1)}+\dot{\ell}_{c(1)}\}^{\top}]&\mathbb{E}[\dot{\ell}_{p(1)}\dot{\ell}_{c(2)}^{\top}]\\ \mathbb{E}[\{\dot{\ell}_{p(1)}+\dot{\ell}_{c(1)}\}\dot{\ell}_{p(1)}^{\top}]&\mathbb{E}[\{\dot{\ell}_{p(1)}+\dot{\ell}_{c(1)}\}\{\dot{\ell}_{p(1)}+\dot{\ell}_{c(1)}\}^{\top}]&\mathbb{E}[\{\dot{\ell}_{p(1)}+\dot{\ell}_{c(1)}\}\dot{\ell}_{c(2)}^{\top}]\\ E[\dot{\ell}_{c(2)}\dot{\ell}_{p(1)}^{\top}]&\mathbb{E}[\dot{\ell}_{c(2)}\{\dot{\ell}_{p(1)}+\dot{\ell}_{c(1)}\}^{\top}]&\mathbb{E}[\dot{\ell}_{c(2)}\dot{\ell}_{c(2)}^{\top}]\end{pmatrix}

However, from Lemma C.1, we know that

limn+n1𝔼n[˙p(1)˙c(1)]=0, and limn+n1𝔼n[˙p(1)˙c(2)]=0.\lim_{n\rightarrow+\infty}n^{-1}\mathbb{E}_{n}\left[\dot{\ell}_{p(1)}\dot{\ell}_{c(1)}^{\top}\right]=0,\text{ and }\lim_{n\rightarrow+\infty}n^{-1}\mathbb{E}_{n}\left[\dot{\ell}_{p(1)}\dot{\ell}_{c(2)}^{\top}\right]=0.

Applying these relationships, we then have the general form of Ω\Omega as stated in the result. Since nτnτ\sqrt{n}\tau_{n}\rightarrow\tau in probability, the first stated result follows.

The second part of the result follows by applying the results of Lemma C.1 to see that we can rewrite Ω\Omega as

Ω=(p(11)p(11)0p(11)111202122),\displaystyle\Omega=\begin{pmatrix}\mathcal{I}_{p(11)}&\mathcal{I}_{p(11)}&0\\ \mathcal{I}_{p(11)}&\mathcal{I}_{11}&\mathcal{I}_{12}\\ 0&\mathcal{I}_{21}&\mathcal{I}_{22}\end{pmatrix},

where

11=p(11)+c(11),22=c(22),21=c(21).\mathcal{I}_{11}=\mathcal{I}_{p(11)}+\mathcal{I}_{c(11)},\quad\mathcal{I}_{22}=\mathcal{I}_{c(22)},\quad\mathcal{I}_{21}=\mathcal{I}_{c(21)}.

Result 2. Define

V22:=22122.11,22.11:=221+2212111.2112221,V_{22}:=\mathcal{I}_{22}^{-1}-\mathcal{I}_{22.1}^{-1},\quad\mathcal{I}_{22.1}^{-1}:=\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1},

and recall the definitions of Γcut\Gamma_{\mathrm{cut}} and Γfull\Gamma_{\mathrm{full}} to see that

ΓcutΓfull=(p(11)111.2111.211222122121p(11)12212111.21V22).\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}}=\begin{pmatrix}\mathcal{I}_{p(11)}^{-1}&-\mathcal{I}_{11.2}^{-1}&\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}&\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}&V_{22}\end{pmatrix}.

Firstly,

(ΓcutΓfull)Ω=(p(11)111.2111.211222122121p(11)12212111.2122122.11)(p(11)p(11)0p(11)111202122)\displaystyle(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})\Omega=\begin{pmatrix}\mathcal{I}_{p(11)}^{-1}&-\mathcal{I}_{11.2}^{-1}&\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}&\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}&\mathcal{I}_{22}^{-1}-\mathcal{I}_{22.1}^{-1}\end{pmatrix}\begin{pmatrix}\mathcal{I}_{p(11)}&\mathcal{I}_{p(11)}&0\\ \mathcal{I}_{p(11)}&\mathcal{I}_{11}&\mathcal{I}_{12}\\ 0&\mathcal{I}_{21}&\mathcal{I}_{22}\end{pmatrix}
=(Id111.21p(11)I11.2111+11.211222121𝐎d1×d222121+2212111.21p(11)22121+2212111.2111+V22212212111.2112+V2222).\displaystyle=\begin{pmatrix}I_{d_{1}}-\mathcal{I}_{11.2}^{-1}\mathcal{I}_{p(11)}&I-\mathcal{I}_{11.2}^{-1}\mathcal{I}_{11}+\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}&\mathbf{O}_{d_{1}\times d_{2}}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{p(11)}&-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{11}+V_{22}\mathcal{I}_{21}&\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}+{V_{22}}\mathcal{I}_{22}\end{pmatrix}.

Therefore,

(ΓcutΓfull)Ω(ΓcutΓfull)=\displaystyle(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}=
(Id111.21p(11)I11.2111+11.211222121𝐎d1×d222121+2212111.21p(11)22121+2212111.2111+V22212212111.2112+V2222)×\displaystyle\begin{pmatrix}I_{d_{1}}-\mathcal{I}_{11.2}^{-1}\mathcal{I}_{p(11)}&I-\mathcal{I}_{11.2}^{-1}\mathcal{I}_{11}+\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}&\mathbf{O}_{d_{1}\times d_{2}}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{p(11)}&-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{11}+V_{22}\mathcal{I}_{21}&\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}+{V_{22}}\mathcal{I}_{22}\end{pmatrix}\times
(p(11)1p(11)11222111.2111.21122212212111.21V22).\displaystyle\begin{pmatrix}\mathcal{I}_{p(11)}^{-1}&-\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{11.2}^{-1}&\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ \mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}&V_{22}\end{pmatrix}. (26)

To derive the stated result, we analyze each of the corresponding blocks in the d×dd\times d matrix.

(d1×d1)(d_{1}\times d_{1})-Block. Let 11\mathcal{M}_{11} denote the (d1×d1)(d_{1}\times d_{1})-block of \mathcal{M}. Multiply the first row and first column of (26) to obtain

11\displaystyle\mathcal{M}_{11} =p(11)111.2111.21+11.21111.2111.21122212111.21\displaystyle=\mathcal{I}_{p(11)}^{-1}-\mathcal{I}_{11.2}^{-1}-\mathcal{I}_{11.2}^{-1}+\mathcal{I}_{11.2}\mathcal{I}_{11}\mathcal{I}_{11.2}^{-1}-\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}
=W11.21+11.21(111222121)=11.211.21\displaystyle=W-\mathcal{I}_{11.2}^{-1}+\mathcal{I}_{11.2}^{-1}\underbrace{(\mathcal{I}_{11}-\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21})}_{=\mathcal{I}_{11.2}}\mathcal{I}_{11.2}^{-1}
=W11.21+11.21\displaystyle=W-\mathcal{I}_{11.2}^{-1}+\mathcal{I}_{11.2}^{-1}
=W,\displaystyle=W,

where we recall that W=p(11)111.21W=\mathcal{I}_{p(11)}^{-1}-\mathcal{I}_{11.2}^{-1}.

(d1×d2)(d_{1}\times d_{2})-Block. Let 12\mathcal{M}_{12} denote the (d1×d2)(d_{1}\times d_{2})-block of \mathcal{M}. Multiply the first row and second column of (26) to obtain

12\displaystyle\mathcal{M}_{12} =p(11)112221+11.2112221\displaystyle=-\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}+\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}
+11.211222111.211111.2112221+11.21122212111.2112221\displaystyle+\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}-\mathcal{I}_{11.2}^{-1}\mathcal{I}_{11}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}+\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}
=W12221\displaystyle=-W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}
+11.211222111.21(111222121)11.2112221\displaystyle+\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}-\mathcal{I}_{11.2}^{-1}(\mathcal{I}_{11}-\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21})\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}
=W12221.\displaystyle=-W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}.

(d2×d2)(d_{2}\times d_{2})-Block. Let 22\mathcal{M}_{22} denote the (d2×d2)(d_{2}\times d_{2})-block of \mathcal{M}. Multiply the second row and second column of (26), and use the fact that 22.11=221+2212111.2112221\mathcal{I}_{22.1}^{-1}=\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}, so that V22=22122.11=2212111.2112221V_{22}=\mathcal{I}_{22}^{-1}-\mathcal{I}_{22.1}^{-1}=-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}, to obtain

22\displaystyle\mathcal{M}_{22} =22121p(11)1122212212111.2112221\displaystyle=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}
2212111.2112221+2212111.211111.2112221\displaystyle-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{11}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}
+V222111.2112221+2212111.2112V22+V2222V22\displaystyle+V_{22}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}V_{22}+V_{22}\mathcal{I}_{22}V_{22}
=22121W122212212111.2112221+2212111.211111.2112221\displaystyle=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{11}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}
2212111.21122212111.2112221+2212111.2112V22+V2222V22\displaystyle-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}V_{22}+V_{22}\mathcal{I}_{22}V_{22}
=22121W122212212111.2112221+2212111.2112221\displaystyle=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}
+2212111.2112V22+V2222V22\displaystyle+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}V_{22}+V_{22}\mathcal{I}_{22}V_{22}
=22121W122212212111.21122212111.2112221+2212111.2112221222212111.2112221\displaystyle=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{22}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}
=22121W12221.\displaystyle=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}.

Applying each of 11,12,22\mathcal{M}_{11},\mathcal{M}_{12},\mathcal{M}_{22} into equation (26) yields

(ΓcutΓfull)Ω(ΓcutΓfull)=(WW1222122121W22121W12221)=.\displaystyle(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}=\begin{pmatrix}W&-W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W&\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\end{pmatrix}=\mathcal{M}.

Result 3. Similar to Result 2,

ΓcutΩ(ΓcutΓfull)=\displaystyle\Gamma_{\mathrm{cut}}\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}=
(p(11)1𝐎𝐎22121p(11)1𝐎221)(p(11)p(11)0p(11)111202122)×(ΓcutΓfull)\displaystyle\begin{pmatrix}\mathcal{I}_{p(11)}^{-1}&\mathbf{O}&\mathbf{O}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}&\mathbf{O}&\mathcal{I}_{22}^{-1}\end{pmatrix}\begin{pmatrix}\mathcal{I}_{p(11)}&\mathcal{I}_{p(11)}&0\\ \mathcal{I}_{p(11)}&\mathcal{I}_{11}&\mathcal{I}_{12}\\ 0&\mathcal{I}_{21}&\mathcal{I}_{22}\end{pmatrix}\times(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}
=(Id1Id1𝐎22121𝐎Id2)×(p(11)1p(11)11222111.2111.21122212212111.21V22)\displaystyle=\begin{pmatrix}I_{d_{1}}&I_{d_{1}}&\mathbf{O}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}&\mathbf{O}&I_{d_{2}}\end{pmatrix}\times\begin{pmatrix}\mathcal{I}_{p(11)}^{-1}&-\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{11.2}^{-1}&\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ \mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}&V_{22}\end{pmatrix}
=(WW1222122112W22121W12221)\displaystyle=\begin{pmatrix}W&-W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{12}W&\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\end{pmatrix}
=.\displaystyle=\mathcal{M}.

Proof of Theorem C.1..

Recall θ¯full:=Θθπfull(θz1:n)dθ\overline{\theta}_{\mathrm{full}}:=\int_{\Theta}\theta\pi_{\mathrm{full}}(\theta\mid z_{1:n})\mathrm{d}\theta, and decompose

n(θ¯fullθ0)\displaystyle\sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{0}) =n(θ¯fullθ0Tn,full)+nTn,full\displaystyle=\sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{0}-T_{n,\mathrm{full}})+\sqrt{n}T_{n,\mathrm{full}}
=Θn(θθ0Tn,full)π(θz1:n)dθ+nTn\displaystyle=\int_{\Theta}\sqrt{n}(\theta-\theta_{0}-T_{n,\mathrm{full}})\pi(\theta\mid z_{1:n})\mathrm{d}\theta+\sqrt{n}T_{n}
=𝒯tπfull(tz1:n)dθ+nTn,full\displaystyle=\int_{\mathcal{T}}t\pi_{\mathrm{full}}(t\mid z_{1:n})\mathrm{d}\theta+\sqrt{n}T_{n,\mathrm{full}}

where 𝒯:={t=n(θθ0Tn,full):θΘ}\mathcal{T}:=\{t=\sqrt{n}(\theta-\theta_{0}-T_{n,\mathrm{full}}):\theta\in\Theta\} and

Tn,full=ΓfullZn/n=1˙(θ0)/n.T_{n,\mathrm{full}}=\Gamma_{\mathrm{full}}Z_{n}/\sqrt{n}=\mathcal{I}^{-1}\dot{\ell}(\theta_{0})/n.

From Lemma C.3,

nTn,fullN(1η,1).\sqrt{n}T_{n,\mathrm{full}}\Rightarrow N(\mathcal{I}^{-1}\eta,\mathcal{I}^{-1}).

Hence, the stated result follows if 𝒯tπfull(tz1:n)=op(1)\int_{\mathcal{T}}t\pi_{\mathrm{full}}(t\mid z_{1:n})=o_{p}(1).

To show this, we first note that

𝒯tπfull(tz1:n)dt=𝒯t[πfull(tz1:n)N{t;0,1}]dt,\int_{\mathcal{T}}t\pi_{\mathrm{full}}(t\mid z_{1:n})\mathrm{d}t=\int_{\mathcal{T}}t[\pi_{\mathrm{full}}(t\mid z_{1:n})-N\{t;0,\mathcal{I}^{-1}\}]\mathrm{d}t,

since 0=𝒯tN{t;0,1}dt0=\int_{\mathcal{T}}tN\{t;0,\mathcal{I}^{-1}\}\mathrm{d}t. We then have

𝒯t[πfull(tz1:n)N{t;0,1}]dt𝒯t|πfull(tz1:n)N{t;0,1}|dt\left\|\int_{\mathcal{T}}t[\pi_{\mathrm{full}}(t\mid z_{1:n})-N\{t;0,\mathcal{I}^{-1}\}]\mathrm{d}t\right\|\leq\int_{\mathcal{T}}\|t\||\pi_{\mathrm{full}}(t\mid z_{1:n})-N\{t;0,\mathcal{I}^{-1}\}|\mathrm{d}t

The regularity conditions in Assumption C1-C2 are sufficient to apply Theorem 8.2 of Lehmann and Casella (2006), to deduce that, for 𝒯:={t=n(θθ0Tn,full):θΘ}\mathcal{T}:=\{t=\sqrt{n}(\theta-\theta_{0}-T_{n,\mathrm{full}}):\theta\in\Theta\},

𝒯t|πfull(tz1:n)N{t;0,1}|=op(1).\int_{\mathcal{T}}\|t\||\pi_{\mathrm{full}}(t\mid z_{1:n})-N\{t;0,\mathcal{I}^{-1}\}|=o_{p}(1).

Proof of Theorem C.2..

The proof follows similar arguments to the proof of Theorem C.1 if we set

Tn,cut=(T1,n,cutT2,n,cut)=(Γ1,cutZn/nΓ2,cutZn/n)(p(11)1˙p(1)(θ1,0)/n221{˙c(2)(θ0)/n21p(11)1˙p(1)(θ1,0)/n}.).T_{n,\mathrm{cut}}=\begin{pmatrix}T_{1,n,\mathrm{cut}}\\ T_{2,n,\mathrm{cut}}\end{pmatrix}=\begin{pmatrix}\Gamma_{1,\mathrm{cut}}Z_{n}/\sqrt{n}\\ \Gamma_{2,\mathrm{cut}}Z_{n}/\sqrt{n}\end{pmatrix}\equiv\begin{pmatrix}\mathcal{I}_{p(11)}^{-1}\dot{\ell}_{p(1)}(\theta_{1,0})/{n}\\ \mathcal{I}_{22}^{-1}\{\dot{\ell}_{c(2)}(\theta_{0})/{n}-\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\dot{\ell}_{p(1)}(\theta_{1,0})/{n}\}.\end{pmatrix}.

In particular, if we replace the full posterior in the proof of Theorem C.1 by the cut posterior πcut(θ1z1:n)\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n}), and the full posterior mean θ¯full\overline{\theta}_{\mathrm{full}} with the cut posterior mean θ¯1,cut=Θθπcut(θ1z1:n)dθ\overline{\theta}_{1,\mathrm{cut}}=\int_{\Theta}\theta\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\mathrm{d}\theta, we have that

𝒱1ϑ1|πcut(ϑ1z1:n)N{ϑ1;0,p(11)1}|dϑ1=op(1).\int_{\mathcal{V}_{1}}\|\vartheta_{1}\||\pi_{\mathrm{cut}}(\vartheta_{1}\mid z_{1:n})-N\{\vartheta_{1};0,\mathcal{I}_{p(11)}^{-1}\}|\mathrm{d}\vartheta_{1}=o_{p}(1).

Using the above results and the following decomposition

n(θ¯1,cutθ1,0)\displaystyle\sqrt{n}(\overline{\theta}_{1,\mathrm{cut}}-\theta_{1,0}) =n(θ¯fullθ1,0T1,n,cut)+nT1,n,cut\displaystyle=\sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{1,0}-T_{1,n,\mathrm{cut}})+\sqrt{n}T_{1,n,\mathrm{cut}}
=Θ1n(θ1θ1,0T1,n,cut)πcut(θ1z1:n)dθ1+nT1,n,cut\displaystyle=\int_{\Theta_{1}}\sqrt{n}(\theta_{1}-\theta_{1,0}-T_{1,n,\mathrm{cut}})\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\mathrm{d}\theta_{1}+\sqrt{n}T_{1,n,\mathrm{cut}}
=𝒱1ϑ1πcut(ϑ1z1:n)dϑ1+nT1,n,cut,\displaystyle=\int_{\mathcal{V}_{1}}\vartheta_{1}\pi_{\mathrm{cut}}(\vartheta_{1}\mid z_{1:n})\mathrm{d}\vartheta_{1}+\sqrt{n}T_{1,n,\mathrm{cut}},

we have that

n(θ¯1,cutθ1,0)\displaystyle\sqrt{n}(\overline{\theta}_{1,\mathrm{cut}}-\theta_{1,0}) =nT1,n,cut+op(1),\displaystyle=\sqrt{n}T_{1,n,\mathrm{cut}}+o_{p}(1),

Appealing to Lemma C.3 we have that

nT1,n,cutN(0,p(11)1),\sqrt{n}T_{1,n,\mathrm{cut}}\Rightarrow N(0,\mathcal{I}_{p(11)}^{-1}),

which yields the stated results for the θ1\theta_{1} dimension.

To prove the results for the θ2\theta_{2} dimension, first write

n(θ¯2,cutθ2,0)\displaystyle\sqrt{n}(\overline{\theta}_{2,\mathrm{cut}}-\theta_{2,0}) =n(θ¯2,cutθ2,0T2,n,cut)+nT2,n,cut\displaystyle=\sqrt{n}(\overline{\theta}_{2,\mathrm{cut}}-\theta_{2,0}-T_{2,n,\mathrm{cut}})+\sqrt{n}T_{2,n,\mathrm{cut}}
=Θ2n(θ2θ2,0T2,n,cut)πcut(θ2z1:n)dθ+nT2,n,cut\displaystyle=\int_{\Theta_{2}}\sqrt{n}(\theta_{2}-\theta_{2,0}-T_{2,n,\mathrm{cut}})\pi_{\mathrm{cut}}(\theta_{2}\mid z_{1:n})\mathrm{d}\theta+\sqrt{n}T_{2,n,\mathrm{cut}}
=𝒯2ϑ2πcut(ϑ2z1:n)dϑ2+nT2,n,cut,\displaystyle=\int_{\mathcal{T}_{2}}\vartheta_{2}\pi_{\mathrm{cut}}(\vartheta_{2}\mid z_{1:n})\mathrm{d}\vartheta_{2}+\sqrt{n}T_{2,n,\mathrm{cut}},

where 𝒱2:={ϑ2=n(θ2θ2,0T2,n,cut):θ2Θ2}\mathcal{V}_{2}:=\{\vartheta_{2}=\sqrt{n}(\theta_{2}-\theta_{2,0}-T_{2,n,\mathrm{cut}}):\theta_{2}\in\Theta_{2}\}. From Lemma C.3, the second term satisfies

nT2,n,cutN(221η2,221+22121p(11)112221)\sqrt{n}T_{2,n,\mathrm{cut}}\Rightarrow N(\mathcal{I}_{22}^{-1}\eta_{2},\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1})

and the result then follows if 𝒯2ϑ2πcut(ϑ2z1:n)dϑ2=op(1)\int_{\mathcal{T}_{2}}\vartheta_{2}\pi_{\mathrm{cut}}(\vartheta_{2}\mid z_{1:n})\mathrm{d}\vartheta_{2}=o_{p}(1). However, similar to the proof of Theorem C.1

𝒯2ϑ2πcut(ϑ2z1:n)dϑ2\displaystyle\int_{\mathcal{T}_{2}}\vartheta_{2}\pi_{\mathrm{cut}}(\vartheta_{2}\mid z_{1:n})\mathrm{d}\vartheta_{2}\leq 𝒯2ϑ2|πcut(ϑ2z1:n)dϑ2N(ϑ2;0,221+22121p(11)112221)|dϑ2\displaystyle\int_{\mathcal{T}_{2}}\|\vartheta_{2}\||\pi_{\mathrm{cut}}(\vartheta_{2}\mid z_{1:n})\mathrm{d}\vartheta_{2}-N(\vartheta_{2};0,\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1})|\mathrm{d}\vartheta_{2}
+𝒯2ϑ2N(ϑ2;0,221+22121p(11)112221)dϑ2.\displaystyle+\int_{\mathcal{T}_{2}}\vartheta_{2}N(\vartheta_{2};0,\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1})\mathrm{d}\vartheta_{2}.

The regularity conditions in Assumption C1-C2 are sufficient to apply Corollary 1 of Frazier and Nott (2024) to deduce that

𝒯2ϑ2|πcut(ϑ2z1:n)dϑ2N(ϑ2;0,221+22121p(11)112221)|dϑ2=op(1).\int_{\mathcal{T}_{2}}\|\vartheta_{2}\||\pi_{\mathrm{cut}}(\vartheta_{2}\mid z_{1:n})\mathrm{d}\vartheta_{2}-N(\vartheta_{2};0,\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1})|\mathrm{d}\vartheta_{2}=o_{p}(1).

Proof of Corollary C.1.

For j=1j=1, we must only note that the cut posterior mean exhibits no bias, so that the result is satisfied.

For j=2j=2, recall that by Theorem C.2, the bias of the cut posterior mean is given by 122η2\mathcal{I}^{-1}_{22}\eta_{2}. By Theorem C.1, the full posterior for θ2\theta_{2} exhibits the asymptotic bias

(221+2212111.2112221)η22212111.21η1.\left(\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\right)\eta_{2}-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\eta_{1}.

The squared difference between the bias for θ2,0\theta_{2,0} under the cut and full posteriors is then positive so long as

η22212111.21122212212111.2112221η2+η111.21122212212111.21η1\displaystyle\eta_{2}^{\top}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\eta_{2}+\eta_{1}^{\top}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\eta_{1}
>2η111.21122212212111.2112221η2\displaystyle>2\eta_{1}^{\top}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\eta_{2}

Writing

X=2212111.2112221η2,Y=2212111.21η1,X=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\eta_{2},\quad Y=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\eta_{1},

we can rewrite the above restriction as

XX+YY2XY=XY20.X^{\top}X+Y^{\top}Y-2X^{\top}Y=\|X-Y\|^{2}\geq 0.

Proof of Lemma C.6..

Firstly, note that, in the proof of Theorem C.1 we have shown that

n(θ¯fullθ0)ΓfullZn=op(1),\|\sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{0})-\Gamma_{\mathrm{full}}Z_{n}\|=o_{p}(1),

and that, by Theorem C.2,

n(θ¯cutθ0)ΓcutZn=op(1).\|\sqrt{n}(\overline{\theta}_{\mathrm{cut}}-\theta_{0})-\Gamma_{\mathrm{cut}}Z_{n}\|=o_{p}(1).

Using these results and Lemmass C.1-C.5, it follows that

(n(θ¯cutθ0)n(θ¯fullθ0))=(ΓcutΓfull)Zn+op(1)(ΓcutΓfull)(ξ+τ).\begin{pmatrix}\sqrt{n}(\overline{\theta}_{\mathrm{cut}}-\theta_{0})\\ \sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{0})\end{pmatrix}=\begin{pmatrix}\Gamma_{\mathrm{cut}}\\ \Gamma_{\mathrm{full}}\end{pmatrix}Z_{n}+o_{p}(1)\Rightarrow\begin{pmatrix}\Gamma_{\mathrm{cut}}\\ \Gamma_{\mathrm{full}}\end{pmatrix}(\xi+\tau).

The continuous mapping theorem then yields the stated results. ∎

Appendix D HPV prevalence and cancer incidence

We return to the example in Section 2.1 of the main text, concerning the relationship between HPV prevalence and cervical cancer incidence. We use it to demonstrate how the semi-modular posterior can vary for different loss functions. In particular, we consider an S-SMP for a loss function targeting the HPV prevalence in country jj, θ1,j\theta_{1,j}, and show how the S-SMP mixing weight and semi-modular inferences vary across jj. We use the mixing weight

ω^+=min{1,ω^},ω^=σcut(j)2σ(j)2(θ¯1,cut(j)θ¯1(j))2𝕀(σcut(j)2σ(j)2>0),\widehat{\omega}_{+}=\min\{1,\widehat{\omega}\},\;\;\;\widehat{\omega}=\frac{{\sigma_{\text{cut}}^{(j)}}^{2}-{\sigma^{(j)}}^{2}}{(\bar{\theta}_{1,\text{cut}}^{(j)}-\bar{\theta}_{1}^{(j)})^{2}}\mathbb{I}({\sigma_{\text{cut}}^{(j)}}^{2}-{\sigma^{(j)}}^{2}>0),

where σcut(j)2{\sigma_{\text{cut}}^{(j)}}^{2} and σ(j)2{\sigma^{(j)}}^{2} are the cut and full marginal posterior variances for θ1,j\theta_{1,j}, and θ¯1,cut(j)\bar{\theta}_{1,\text{cut}}^{(j)} and θ¯1(j)\bar{\theta}_{1}^{(j)} are the cut and full marginal posterior means. We denote the marginal SMP for θ1,j\theta_{1,j} by πcut(j)(θ1,j|X)\pi_{\text{cut}}^{(j)}(\theta_{1,j}|X), and Figure 8 shows kernel estimates of these densities together with the SMP mixing weights ω^+\widehat{\omega}_{+}, and the cut and full posterior θ1,j\theta_{1,j} marginals. The countries are ordered in the plot (left to right, top to bottom) according to the cut posterior mean values. Interestingly, the pooling weights vary widely according to which country is under analysis; when the difference in location of the cut and full posterior distributions is large compared to the posterior variability, the SMP is close to the cut posterior, whereas the SMP is closer to the full posterior otherwise.

The results obtained in Figure 8 give a novel insight into the behavior of model misspecification in this example. While it is known that the Poisson assumption is inadequate, there exists significant variability in the adequacy of this assumption for the purpose of inference in specific countries: for certain countries we obtain a pooling weight of unity, so that the SMP corresponds to the full posterior, while for other countries the pooling weight is close to zero, so that the SMP corresponds to the cut posterior. In view of Corollary 1, when the model is grossly misspecified the SMP produces a pooling weight that is asymptotically zero. Hence, the empirical results in Figure 8 demonstrate that there are certain countries for which the Poisson model is adequate and others for which it is not. Critically, however, the SMP we have proposed does not require us to choose which posterior to use for which country a priori, and instead the SMP lets the observed data guide our choice of which posterior to use according to which posterior is better supported under the observed data.999MCMC computations for the full posterior distribution were done using Stan (Carpenter et al., 2017), whereas cut posterior samples for θ1\theta_{1} given XX are generated directly using conjugacy, followed by sampling importance resampling based on 1,0001,000 draws from an asymptotic normal approximation to the conditional posterior of θ2\theta_{2} given θ1\theta_{1}, YY, to obtain each θ2\theta_{2} sample. Kernel density estimates in the plots are based on 1,0001,000 posterior samples.

Refer to caption

Figure 8: Marginal cut, full and semi-modular posterior (SMP) densities for HPV prevalence in different countries. The SMP obtained for country jj uses a loss function depending on jj as described in the text. ω^+\hat{\omega}_{+} is the estimated mixing weight in the SMP. Countries are ordered (left to right, top to bottom) according to the cut posterior mean values.

Appendix E Additional Results for Biased Means Example

In this section, we present the repeated sampling behavior of the cut, full and S-SMP in the simulation design presented in Section 3.2 of the main text and additional results on the behavior of the S-SMP for additional choices of dd. Please refer to the main text for full details of the simulation design.

Table 2 compares the behavior of the posterior mean for the three methods across different values of δ\delta, and in the case where d1=1d_{1}=1. Similar results for the case of d1=5d_{1}=5 are reported in Table 3. In each table we report the Bias of the posterior mean (Bias), the average posterior variance across the replications (Var) and the Monte Carlo coverage of a 95% posterior credible sets (Cov).

Regarding the results for d1=1d_{1}=1 and d1=5d_{1}=5, we see that in both cases the the S-SMP attempts to trade-off bias in the full posterior for a significant reduction in variability. For both cases, the average cut posterior variance is about 0.01 across all values of δ\delta, while the average S-SMP variance ranges from about 0.0030.003 to 0.0050.005, which is significantly smaller than that of the cut posterior. Further, the bias exhibited by the S-SMP is generally much smaller than that exhibited by the full posterior, but is, by construction, larger than that exhibited by the cut posterior.

Recall that, as discussed in the remarks raised after the statement of Lemma 2 in the main text, under model misspecification only the cut posterior for θ1\theta_{1} can deliver credible sets that are also valid frequentist confidence sets. Analyzing the coverage of the posteriors across the two designs, we see that this is indeed the case, with only the cut posterior delivering accurate coverage. In comparison, in both cases, the full posterior has coverage that is zero or nearly zero once δ>0.10\delta>0.10. The S-SMP fairs better than the full posterior, but can deliver unreliable coverage when d1=1d_{1}=1. In contrast, as the dimension of θ1\theta_{1} increases to d1=5d_{1}=5, the coverage of the S-SMP dramatically appreciates and in most cases is only slightly smaller than the nominal level for most values of δ\delta. While not reported for the sake of brevity, extending these experiments to larger values of d1d_{1} show that the coverage of the S-SMP improves as d1d_{1} increases, so that when d1=25d_{1}=25 the coverage for the S-SMP is close to the nominal level across all values of δ\delta used in the experiment.

Lastly, we demonstrate the impact on the accuracy of the S-SMP, as measured by expected squared error, as d1d_{1} increases. To this end, we present the expected squared error of the cut, full and S-SMP for d1=1,5,10,25d_{1}=1,5,10,25. We plot these results visually in Figure 9. Analyzing these results, we see that as d1d_{1} increases, and for δ\delta small, the S-SMP resembles the full posterior, but as δ\delta increases the S-SMP more closely resemble the cut posterior. This is a consequence of the behavior of the pooling weight used in the S-SMP, which has an easier time shifting between the cut and full posterior when d>2d>2; we refer to Section 4 in the main paper for a technical explanation of this phenomenon.

Cut Full S-SMP
δ\delta Bias Var Cov Bias Var Cov Bias Var Cov
0.1000 -0.0081 0.0099 0.9400 -0.0131 0.0001 0.7540 -0.0098 0.0028 0.8190
0.1500 -0.0080 0.0099 0.9410 -0.0194 0.0001 0.5430 -0.0121 0.0029 0.7040
0.2000 -0.0082 0.0099 0.9410 -0.0257 0.0001 0.3100 -0.0143 0.0029 0.5800
0.2500 -0.0080 0.0099 0.9430 -0.0318 0.0001 0.1410 -0.0161 0.0030 0.5050
0.3000 -0.0082 0.0099 0.9400 -0.0380 0.0001 0.0360 -0.0182 0.0030 0.4560
0.3500 -0.0080 0.0099 0.9430 -0.0441 0.0001 0.0060 -0.0198 0.0031 0.4520
0.4000 -0.0081 0.0099 0.9400 -0.0503 0.0001 0.0010 -0.0218 0.0032 0.4610
0.4500 -0.0081 0.0099 0.9420 -0.0566 0.0001 0.0000 -0.0234 0.0033 0.4720
0.5000 -0.0082 0.0099 0.9420 -0.0628 0.0001 0.0000 -0.0248 0.0034 0.4810
0.5500 -0.0081 0.0099 0.9430 -0.0690 0.0001 0.0000 -0.0261 0.0036 0.4950
0.6000 -0.0082 0.0099 0.9420 -0.0752 0.0001 0.0000 -0.0276 0.0037 0.5010
0.6500 -0.0081 0.0099 0.9400 -0.0814 0.0001 0.0000 -0.0289 0.0038 0.4950
0.7000 -0.0081 0.0099 0.9430 -0.0877 0.0001 0.0000 -0.0302 0.0040 0.5010
0.7500 -0.0081 0.0099 0.9410 -0.0939 0.0001 0.0000 -0.0314 0.0042 0.5170
0.8000 -0.0079 0.0099 0.9420 -0.1001 0.0001 0.0000 -0.0321 0.0043 0.5440
0.8500 -0.0081 0.0099 0.9410 -0.1065 0.0001 0.0000 -0.0332 0.0045 0.5580
0.9000 -0.0082 0.0099 0.9430 -0.1126 0.0001 0.0000 -0.0340 0.0047 0.5790
Table 2: Repeated sampling results for β1\beta_{1} in the biased means example presented in Section 3.2 of the main text when d1=1d_{1}=1. Bias is the average bias of the posterior mean across the replications. Var is the average of the posterior variance, and Cov is the actual coverage of the 95% posterior credible set.
Cut Full S-SMP
Delta Bias Var Cov Bias Var Cov Bias Var Cov
0.1000 -0.0074 0.0099 0.9550 -0.0128 0.0005 0.9960 -0.0107 0.0039 0.9800
0.1500 -0.0075 0.0099 0.9530 -0.0186 0.0005 0.9900 -0.0135 0.0039 0.9790
0.2000 -0.0075 0.0099 0.9540 -0.0247 0.0005 0.9640 -0.0162 0.0040 0.9770
0.2500 -0.0073 0.0099 0.9540 -0.0306 0.0005 0.8840 -0.0185 0.0041 0.9560
0.3000 -0.0075 0.0099 0.9530 -0.0366 0.0005 0.7440 -0.0213 0.0042 0.9220
0.3500 -0.0075 0.0099 0.9570 -0.0426 0.0005 0.5120 -0.0235 0.0043 0.8750
0.4000 -0.0074 0.0099 0.9540 -0.0486 0.0005 0.2570 -0.0256 0.0045 0.8360
0.4500 -0.0073 0.0099 0.9550 -0.0546 0.0004 0.1060 -0.0275 0.0047 0.8130
0.5000 -0.0074 0.0099 0.9550 -0.0605 0.0004 0.0270 -0.0293 0.0049 0.8150
0.5500 -0.0074 0.0099 0.9550 -0.0665 0.0004 0.0060 -0.0306 0.0051 0.8290
0.6000 -0.0074 0.0099 0.9550 -0.0726 0.0004 0.0000 -0.0318 0.0053 0.8400
0.6500 -0.0073 0.0099 0.9560 -0.0787 0.0004 0.0000 -0.0326 0.0055 0.8570
0.7000 -0.0074 0.0099 0.9560 -0.0848 0.0004 0.0000 -0.0335 0.0058 0.8640
0.7500 -0.0073 0.0099 0.9530 -0.0908 0.0004 0.0000 -0.0338 0.0060 0.8790
0.8000 -0.0075 0.0099 0.9550 -0.0970 0.0004 0.0000 -0.0342 0.0062 0.8830
0.8500 -0.0074 0.0099 0.9560 -0.1031 0.0004 0.0000 -0.0343 0.0065 0.8980
0.9000 -0.0074 0.0099 0.9550 -0.1093 0.0003 0.0000 -0.0344 0.0067 0.8990
Table 3: Repeated sampling results for β1\beta_{1} in the biased means example presented in Section 3.2 of the main text when d1=5d_{1}=5. Bias is the average bias of the posterior mean across the replications. Var is the average of the posterior variance, and Cov is the actual coverage of the 95% posterior credible set.

Refer to caption

Refer to caption

Figure 9: Monte Carlo estimate of expected risk under squared error loss across different levels of contamination (δ\delta) when d1{1,5,10,25}d_{1}\in\{1,5,10,25\}. Full refers to the expected risk for θ1,0\theta_{1,0} associated with the full posterior based on both sets of data; Cut is the P-risk associated to the cut posterior; SMP is the P-risk for the proposed semi-modular posterior.

References

  • Bennett and Wakefield [2001] Bennett, J. and J. Wakefield (2001). Errors-in-variables in joint population pharmacokinetic/pharmacodynamic modeling. Biometrics 57(3), 803–812.
  • Bissiri et al. [2016] Bissiri, P. G., C. C. Holmes, and S. G. Walker (2016). A general framework for updating belief distributions. Journal of the Royal Statistical Society. Series B, Statistical methodology 78(5), 1103.
  • Blangiardo et al. [2011] Blangiardo, M., A. Hansell, and S. Richardson (2011). A Bayesian model of time activity data to investigate health effect of air pollution in time series studies. Atmospheric Environment 45(2), 379–386.
  • Carmona and Nicholls [2020] Carmona, C. and G. Nicholls (2020). Semi-modular inference: enhanced learning in multi-modular models by tempering the influence of components. In International Conference on Artificial Intelligence and Statistics, pp.  4226–4235. PMLR.
  • Carmona and Nicholls [2022] Carmona, C. and G. Nicholls (2022). Scalable semi-modular inference with variational meta-posteriors. arXiv:2204.00296.
  • Carpenter et al. [2017] Carpenter, B., A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell (2017, January). Stan: A Probabilistic Programming Language. Journal of Statistical Software 76(1), 1–32. Number: 1.
  • Castillo [2014] Castillo, I. (2014). On Bayesian supremum norm contraction rates. Annals of Statistics.
  • Chakraborty et al. [2022] Chakraborty, A., D. J. Nott, C. Drovandi, D. T. Frazier, and S. A. Sisson (2022). Modularized Bayesian analyses and cutting feedback in likelihood-free inference. Statistics and Computing (To appear).
  • Claeskens and Hjort [2003] Claeskens, G. and N. L. Hjort (2003). The focused information criterion. Journal of the American Statistical Association 98(464), 900–916.
  • Davidson [1994] Davidson, J. (1994). Stochastic limit theory: An introduction for econometricians. OUP Oxford.
  • Frazier and Nott [2024] Frazier, D. T. and D. J. Nott (2024). Cutting feedback and modularized analyses in generalized Bayesian inference. Bayesian Analysis 1(1), 1–29.
  • Gneiting and Raftery [2007] Gneiting, T. and A. E. Raftery (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association 102(477), 359–378.
  • Green and Strawderman [1991] Green, E. J. and W. E. Strawderman (1991). A James-Stein type estimator for combining unbiased and possibly biased estimators. Journal of the American Statistical Association 86(416), 1001–1006.
  • Grünwald and Van Ommen [2017] Grünwald, P. and T. Van Ommen (2017). Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Analysis 12(4), 1069–1103.
  • Hansen [2016] Hansen, B. E. (2016). Efficient shrinkage in parametric models. Journal of Econometrics 190(1), 115–132.
  • Hjort and Claeskens [2003] Hjort, N. L. and G. Claeskens (2003). Frequentist model average estimators. Journal of the American Statistical Association 98(464), 879–899.
  • Jacob et al. [2017] Jacob, P. E., L. M. Murray, C. C. Holmes, and C. P. Robert (2017). Better together? Statistical learning in models made of modules. arXiv preprint arXiv:1708.08719.
  • Jacob et al. [2020] Jacob, P. E., J. O’Leary, and Y. F. Atchadé (2020). Unbiased Markov chain Monte Carlo methods with couplings. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82(3), 543–600.
  • Judge and Mittelhammer [2004] Judge, G. G. and R. C. Mittelhammer (2004). A semiparametric basis for combining estimation problems under quadratic loss. Journal of the American Statistical Association 99(466), 479–487.
  • Kim and White [2001] Kim, T.-H. and H. White (2001). James-Stein-type estimators in large samples with application to the least absolute deviations estimator. Journal of the American Statistical Association 96(454), 697–705.
  • Kleijn and van der Vaart [2012] Kleijn, B. and A. van der Vaart (2012). The Bernstein-von-Mises theorem under misspecification. Electron. J. Statist. 6, 354–381.
  • Lee and Lee [2018] Lee, K. and J. Lee (2018). Optimal Bayesian minimax rates for unconstrained large covariance matrices. Bayesian Analysis.
  • Lehmann and Casella [2006] Lehmann, E. L. and G. Casella (2006). Theory of point estimation. Springer Science & Business Media.
  • Liu et al. [2009] Liu, F., M. Bayarri, and J. Berger (2009). Modularization in Bayesian analysis, with emphasis on analysis of computer models. Bayesian Analysis 4(1), 119–150.
  • Liu and Goudie [2022a] Liu, Y. and R. J. B. Goudie (2022a). A general framework for cutting feedback within modularized Bayesian inference. arXiv:2211.03274.
  • Liu and Goudie [2022b] Liu, Y. and R. J. B. Goudie (2022b). Stochastic approximation cut algorithm for inference in modularized Bayesian models. Statistics and Computing 32(7).
  • Liu and Goudie [2023] Liu, Y. and R. J. B. Goudie (2023). Generalized geographically weighted regression model within a modularized Bayesian framework. Bayesian Analysis (To appear).
  • Lunn et al. [2009] Lunn, D., N. Best, D. Spiegelhalter, G. Graham, and B. Neuenschwander (2009). Combining MCMC with ‘sequential’ PKPD modelling. Journal of Pharmacokinetics and Pharmacodynamics 36, 19–38.
  • Maucort-Boulch et al. [2008] Maucort-Boulch, D., S. Franceschi, and M. Plummer (2008). International correlation between human papillomavirus prevalence and cervical cancer incidence. Cancer Epidemiology and Prevention Biomarkers 17(3), 717–720.
  • Miller [2021] Miller, J. W. (2021). Asymptotic normality, concentration, and coverage of generalized posteriors. Journal of Machine Learning Research 22(168), 1–53.
  • Newey [1985] Newey, W. K. (1985). Maximum likelihood specification testing and conditional moment tests. Econometrica: Journal of the Econometric Society, 1047–1070.
  • Nicholls et al. [2022] Nicholls, G. K., J. E. Lee, C.-H. Wu, and C. U. Carmona (2022). Valid belief updates for prequentially additive loss functions arising in semi-modular inference. arXiv preprint arXiv:2201.09706.
  • Plummer [2015] Plummer, M. (2015). Cuts in Bayesian graphical models. Statistics and Computing 25(1), 37–43.
  • Pompe and Jacob [2021] Pompe, E. and P. E. Jacob (2021). Asymptotics of cut distributions and robust modular inference using posterior bootstrap. arXiv preprint arXiv:2110.11149.
  • Rieder [2012] Rieder, H. (2012). Robust Asymptotic Statistics: Volume I. Springer Science & Business Media.
  • Rousseau [1997] Rousseau, J. (1997). Asymptotic bayes risks for a general class of losses. Statistics & probability letters 35(2), 115–121.
  • Shen and Wasserman [2001] Shen, X. and L. Wasserman (2001). Rates of convergence of posterior distributions. The Annals of Statistics 29(3), 687–714.
  • Stone [1961] Stone, M. (1961). The opinion pool. The Annals of Mathematical Statistics, 1339–1342.
  • Styring et al. [2017] Styring, A., M. Charles, F. Fantone, M. Hald, A. McMahon, R. Meadow, G. Nicholls, A. Patel, M. Pitre, A. Smith, A. Sołtysiak, G. Stein, J. Weber, H. Weiss, and A. Bogaard (2017). Isotope evidence for agricultural extensification reveals how the world’s first cities were fed. Nature Plants 3, 17076.
  • Van der Vaart [2000] Van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge University Press.
  • Yu et al. [2023] Yu, X., D. J. Nott, and M. S. Smith (2023). Variational inference for cutting feedback in misspecified models. Statistical Science 38(3), 490–509.