\externaldocument

semi-modular-supp-18-11-24

Posterior risk of modular and semi-modular Bayesian inference

David T. Frazier Corresponding author: david.frazier@monash.edu Department of Econometrics and Business Statistics, Monash University, Clayton VIC 3800, Australia David J. Nott Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546 Operations Research and Analytics Cluster, National University of Singapore, Singapore 119077

Abstract

Modular Bayesian methods perform inference in models that are specified through a collection of coupled sub-models, known as modules. These modules often arise from modeling different data sources or from combining domain knowledge from different disciplines. “Cutting feedback” is a Bayesian inference method that ensures misspecification of one module does not affect inferences for parameters in other modules, and produces what is known as the cut posterior. However, choosing between the cut posterior and the standard Bayesian posterior is challenging. When misspecification is not severe, cutting feedback can greatly increase posterior uncertainty without a large reduction of estimation bias, leading to a bias-variance trade-off. This trade-off motivates semi-modular posteriors, which interpolate between standard and cut posteriors based on a tuning parameter. In this work, we provide the first precise formulation of the bias-variance trade-off that is present in cutting feedback, and we propose a new semi-modular posterior that takes advantage of it. Under general regularity conditions, we prove that this semi-modular posterior is more accurate than the cut posterior according to a notion of posterior risk. An important implication of this result is that point inferences made under the cut posterior are inadmissable. The new method is demonstrated in a number of examples.

Keywords. Bayesian modular inference, Cutting feedback, Model misspecification, posterior shrinkage

1 Introduction

In many applications, statistical models arise which can be viewed as a combination of coupled submodels (referred to as modules in the literature). Such models are often complex, frequently containing both shared and module-specific parameters, and module-specific data sources. Examples include pharmacokinetic/phamacodynamic (PK/PD) models (Bennett and Wakefield, 2001; Lunn et al., 2009) which couple a PK module describing movement of a drug through the body with a PD module describing its biological effect, or models for health effects of air pollution (Blangiardo et al., 2011) with separate modules for predicting pollutant concentrations and predicting health outcomes based on exposure. See Liu et al. (2009) and Jacob et al. (2017) for further examples.

In principle, Bayesian inference is attractive in modular settings due to its ability to combine the different sources of information and update uncertainties about unknowns coherently conditional on all the available data. However, it is well-known that Bayesian inference can be unreliable when the model is misspecified (Kleijn and van der Vaart, 2012). For conventional Bayesian inference in multi-modular models, misspecification in one module can adversely impact inferences about parameters in correctly specified modules. “Cutting feedback” approaches modify Bayesian inference to address this issue. They consider a sequential or conditional decomposition of the posterior distribution following the modular structure, and then modify certain terms so that unreliable information is isolated and cannot influence inferences of interest which may be sensitive to the misspecification.

Cutting feedback is only one technique belonging to a wider class of modular Bayesian inference methods (Liu et al., 2009). Good introductions to the basic idea and applications of cutting feedback are given by Lunn et al. (2009), Plummer (2015) and Jacob et al. (2017). Computational aspects of the approach are discussed in Plummer (2015), Jacob et al. (2020), Liu and Goudie (2022b), Yu et al. (2023) and Carmona and Nicholls (2022). Most of the above references deal only with cutting feedback in a certain “two module” system considered in Plummer (2015), which although simple is general enough to encompass many practical applications of cutting feedback methods. We also consider this two module system throughout the rest of the paper. Some recent progress in defining modules and cut posteriors in greater generality is reported in Liu and Goudie (2022a).

A useful extension of cutting feedback is the semi-modular posterior (SMP) approach of Carmona and Nicholls (2020), which avoids the binary decision of using either the cut or full posterior distribution, and can be viewed as a continuous interpolation between two distributions. Further developments and applications are discussed in Liu and Goudie (2023), Carmona and Nicholls (2022), Nicholls et al. (2022) and Frazier and Nott (2024). The motivation for semi-modular inference is explained clearly in Carmona and Nicholls (2022): “In Cut-model inference, feedback from the suspect module is completely cut. However, […] if the parameters of a well-specified module are poorly informed by “local” information then limited information from misspecified modules may allow us to bring the uncertainty down without introducing significant bias”. The above quote nicely describes the intuition behind SMI. However, there is no formal treatment of the bias-variance trade-off that exists in SMI, nor is there any rigorous discussion as to how SMI could “leverage” such a trade-off in practice.

Herein, we make three fundamental contributions to the literature on cutting feedback and misspecified Bayesian models. First, we formally demonstrate that when model misspecification is not too severe, cut posteriors can deliver inferences with smaller bias but more variability than standard Bayesian posteriors, which provides formal evidence for the bias-variance trade-off that motivates SMI and SMPs. Second, we use this result to develop a novel SMP that leverages this trade-off. Lastly, using a notion that captures the risk of a posterior, we demonstrate that the proposed SMP is preferable to the cut, as well as the full posterior under certain conditions. More specifically, under this notion of risk we show that the cut posteriors produce point estimators that are inadmissible, while, under additional conditions, the standard posterior is also shown to deliver inadmissible point estimators.

The remainder of the paper is organized as follows. In Section 2 we give the general framework and make rigorous conditions that are necessary for the existence of a bias-variance trade-off in modular inference problems. In Section 3 we discuss semi-modular inference, and describe our semi-modular posterior approach. In this section, a simple example is presented in which the new method produces superior results to those based on the cut and full posterior. In Section 4 we prove, under ‘classical’ regularity conditions, that our semi-modular posterior outperforms the standard and cut posteriors according to a notion of asymptotic risk for a posterior. Section 5 gives two empirical examples, and Section 6 concludes. Proofs of all stated results are given in the supplementary material.

2 Setup and Discussion of Cut posterior

Our first contribution is to formalize the potential benefits of using semi-modular inference methods (Carmona and Nicholls, 2020) in misspecified models. The semi-modular inference approach was originally introduced by Carmona and Nicholls (2020) using a two-module system discussed by Plummer (2015). In our current work, we focus on the same two-module system. It is important to describe our motivation for this choice, in the context of previous work on Bayesian modular inference.

One method for defining cutting feedback methods in misspecified models with more than two modules uses an implicit approach by modifying sampling steps in Markov chain Monte Carlo (MCMC) algorithms. The cut function in the WinBUGS and OpenBUGS software packages implements this in a modified Gibbs sampling approach. See Lunn et al. (2009) for a detailed description. However, defining a cut posterior in terms of an algorithm, while quite general, does not allow us to easily understand the implications of cutting feedback, or the general structure of the posterior, due to the implicit nature of such a definition. To give a better understanding, Plummer (2015) considered a two-module system where the cut posterior can be defined explicitly. In many models where cut methods are used, there might be one model component of particular concern, and a definition of the modules in a two-module system can often be made based on this. Many applications of cutting feedback use such a two-module system. Two module systems also play an important role in the recent attempt by Liu and Goudie (2022a) to explicitly define multi-modular systems and cutting feedback generally, where existing modules can be split into two recursively based on partitioning of the data and using the graphical structure of the model. We now describe the two-module system precisely and describe cutting feedback methods for it, before giving a motivating example.

2.1 Setup

We observe a sequence of data $z_{1:n}=(z_{1},\ldots,z_{n})$ , $z_{i}\in\mathcal{Z}$ for each $i$ . The observations $z_{1:n}$ are considered an observation of a random vector $Z$ , and wish to conduct Bayesian inference on the unknown parameters $\theta$ in the assumed joint likelihood $f(z_{1:n}\mid\theta)$ , where $\theta=(\theta_{1}^{\top},\theta_{2}^{\top})^{\top}$ , $\theta\in\Theta_{1}\times\Theta_{2}\subseteq\mathbb{R}^{d}$ , and $\Theta_{1}\subseteq\mathbb{R}^{d_{1}}$ , $\Theta_{2}\subseteq\mathbb{R}^{d_{2}}$ . Our prior beliefs over $\Theta$ are expressed via a prior density $\pi(\theta)=\pi(\theta_{1})\pi(\theta_{2}\mid\theta_{1})$ .

In modular Bayesian inference, the joint likelihood $f(z_{1:n}\mid\theta)$ can often be expressed as a product, with terms for data sources from different modules. The simplest case is the two module system described in Plummer (2015), where the random variables are $Z=(X,Y)$ . The first module consists of a likelihood term depending on $X$ and $\theta_{1}$ , given by $f_{1}(\cdot\mid\theta_{1})$ , and the prior $\pi(\theta_{1})$ . The second module consists of a likelihood term depending on $Y$ and $\theta_{1},\theta_{2}$ , given by $f_{2}(\cdot\mid\theta_{1},\theta_{2})$ , and the conditional prior $\pi(\theta_{2}\mid\theta_{1})$ . An example of such a model is given below. A model with this structure leads to the following full posterior

\displaystyle\pi_{\mathrm{full}}(\theta_{1},\theta_{2}\mid z_{1:n})

\displaystyle=\pi(\theta_{1}\mid x_{1:n},y_{1:n})\pi(\theta_{2}\mid y_{1:n},\theta_{1}),

(1)

in which the conditional posterior for $\theta_{2}$ given $\theta_{1}$ does not depend on the observations for the random variable $X$ , denoted as $x_{1:n}=(x_{1},\dots,x_{n})$ . The parameter $\theta_{1}$ is shared between the two modules, while $\theta_{2}$ is specific to the second module.

While the full posterior in (1) delivers reliable inference when the model is correctly specified, in the presence of model misspecification Bayesian inference can sometimes be unreliable and not “fit for purpose”; see, e.g., Grünwald and Van Ommen (2017) for examples in the case of linear models, and Kleijn and van der Vaart (2012) for a general discussion. Following the literature on cutting feedback methods, we restrict our attention to settings where misspecification is confined to the second module, while specification of the first module is not impacted. Because the parameter $\theta_{1}$ is shared between modules, inference about this parameter can be corrupted by misspecification of the second module. This can also impact the interpretation of inference about $\theta_{2}$ , which can be of interest even if the second module is misspecified provided this is done conditionally on values of $\theta_{1}$ consistent with the interpretation of this parameter in the first module.

Rather than use the posterior (1), in possibly misspecified models it has been argued that we can cut the link between the modules to produce more reliable inferences for $\theta_{1}$ . This is the idea of cutting feedback; see Figure 1 for a graphical depiction of this “cutting” mechanism where, for simplicity, we assume $\pi(\theta_{2}\mid\theta_{1})$ does not depend on $\theta_{1}$ . In the case of the two module system, the cut (indicated by the vertical dotted line in Figure 1) severs the feedback between the modules and allows us to carry out inferences for $\theta_{1}$ based on module one, using the likelihood $f_{1}(\cdot\mid\theta_{1})$ , and then inference for $\theta_{2}$ can be carried out conditional on $\theta_{1}$ . In the case of Bayesian inference, this philosophy has led researchers to conduct inference using the cut posterior distribution (see, Plummer, 2015, Jacob et al., 2017).

Refer to caption — Figure 1: Graphical structure of the two-module system. The dashed line indicates the cut. Shaded nodes are observed quantities.

As shown in Carmona and Nicholls (2020) and Nicholls et al. (2022), the cut posterior is a “generalized” posterior distribution (see, e.g., Bissiri et al., 2016) that restricts the information flow to guard against model misspecification (Frazier and Nott, 2024). In the canonical two module system, the cut posterior takes the form

\pi_{\mathrm{cut}}(\theta\mid z_{1:n}):=\pi_{\mathrm{cut}}(\theta_{1}\mid x_{1:n})\pi(\theta_{2}\mid y_{1:n},\theta_{1}),\quad\pi_{\mathrm{cut}}(\theta_{1}\mid x_{1:n})\propto f_{1}(x_{1:n}\mid\theta_{1})\pi(\theta_{1}).

The common argument given for the use of $\pi_{\mathrm{cut}}(\theta\mid z_{1:n})$ instead of $\pi(\theta\mid z_{1:n})$ is the assumption that misspecification adversely impacts inferences for $\theta_{1}$ ; e.g., Liu et al. (2009), and Jacob et al. (2017). The cut posterior uses only information from the data $x_{1:n}$ in making inference about $\theta_{1}$ , ensuring inference is insensitive to misspecification of the model for $y_{1:n}$ . However, uncertainty about $\theta_{1}$ can still be propagated through for inference on $\theta_{2}$ via the conditional posterior $\pi(\theta_{2}\mid\theta_{1},y_{1:n})$ .

Motivating example: HPV prevalence and cervical cancer incidence

We now discuss a simple example described in Plummer (2015) that illustrates some of the benefits of cut model inference. The example, which is discussed further in the supplementary material, is based on data from a real epidemiological study (Maucort-Boulch et al., 2008). Of interest is the international correlation between high-risk human papillomavirus (HPV) prevalence and cervical cancer incidence, for women in a certain age group. There are two data sources. The first is survey data on HPV prevalence for 13 countries. There are $X_{i}$ women with high-risk HPV in a sample of size $n_{i}$ for country $i$ , $i=1,\dots,13$ . There is also data on cervical cancer incidence, with $Y_{i}$ cases in country $i$ in $T_{i}$ woman years of follow-up. The data are modelled as

X_{i}\sim\text{Binomial}(n_{i},\theta_{1,i}),\;\;\;i=1,\dots,13,

Y_{i}\sim\text{Poisson}(\lambda_{i}),\;\;\;\log\lambda_{i}=\log T_{i}+\theta_{2,1}+\theta_{2,2}\theta_{1,i}.

The prior for $\theta_{1}=(\theta_{1,1},\dots,\theta_{1,13})^{\top}$ assumes independent components with uniform marginals on $[0,1]$ . The prior for $\theta_{2}=(\theta_{2,1},\theta_{2,2})^{\top}$ , assumes independent normal components, $N(0,1000)$ .

Module 1 consists of $\pi(\theta_{1})$ and $f(X\mid\theta_{1})$ (survey data module) and module 2 consists of $\pi(\theta_{2})$ and $f(Y\mid\theta_{2},\theta_{1})$ (cancer incidence module). The Poisson regression model in the second module is misspecified. Because of the coupling of the survey and cancer incidence modules, with the HPV prevalence values $\theta_{1,i}$ appearing as covariates in the Poisson regression for cancer incidence, the cancer incidence module contributes misleading information about the HPV prevalence parameters. The cut posterior estimates these parameters based on the survey data only, preventing contamination of the estimates by the misspecified module. This in turn results in more interpretable estimates of the parameter $\theta_{2}$ in the misspecified module, since $\theta_{2}$ summarizes the relationship between HPV prevalence and cancer incidence, but the summary produced can only be useful when the inputs to the regression (i.e. the HPV prevalence covariate values) are properly estimated.

Figure 2 shows the marginal posterior distribution for $\theta_{2}$ for both the full and cut posterior distribution, which are very different, illustrating how the misspecification of the cancer incidence module distorts inference about HPV prevalence in the full posterior, resulting in uninterpretable estimation for $\theta_{2}$ . Further discussion of this example in the supplement shows the benefits of a semi-modular inference approach in which the tuning parameter $\omega$ interpolating between the cut and full posterior can be chosen based on a user-defined loss function reflecting the purpose of the analysis. A more complex real example is considered in Section 5.

2.2 The Impact of Misspecification in Modular Inference

In the remainder we consider a generalization of the canonical two-module system by assuming that the joint likelihood takes the form

f(z_{1:n}\mid\theta)=f_{1}(z_{1:n}\mid\theta_{1})f_{2}(z_{1:n}\mid\theta_{1},\theta_{2}),

where the individual models may or may not depend on the entire dataset $z_{1:n}$ . In the expressions above, $f_{1}(z_{1:n}\mid\theta_{1})$ and $f_{2}(z_{1:n}\mid\theta_{1},\theta_{2})$ need not be densities as functions in $z_{1:n}$ so long as $f(z_{1:n}\mid\theta)$ itself is a density; the terms $f_{1}$ and $f_{2}$ describe a decomposition of the likelihood having the dependence indicated by the arguments. The goal of modular Bayesian inference is to conduct inference on $\theta=(\theta_{1}^{\top},\theta_{2}^{\top})^{\top}$ in the specific setting where inference for $\theta_{1}$ based only on $f_{1}(z_{1:n}\mid\theta_{1})$ is reasonable, but where the module $f_{2}(z_{1:n}\mid\theta_{1},\theta_{2})$ contains additional useful information on $\theta_{1}$ ; see the regularity conditions in Appendix C.1 for specific details. Modular inference on $\theta$ can be carried out using the cut posterior

\pi_{\mathrm{cut}}(\theta\mid z_{1:n}):=\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\pi(\theta_{2}\mid z_{1:n},\theta_{1}),\;\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\propto{f_{1}(z_{1:n}\mid\theta_{1})\pi(\theta_{1})}.

(2)

The cut posterior is beneficial when misspecification of $f_{2}(\cdot\mid\theta_{1},\theta_{2})$ makes full Bayesian inference on $\theta_{1}$ less accurate compared to Bayesian inference using only $f_{1}(\cdot\mid\theta_{1})$ . The benefits of cutting feedback can be formalized by assuming that misspecification is limited to $f_{2}(\cdot\mid\theta_{1},\theta_{2})$ and that the true data generating process (DGP) has a density of the form¹¹1Strictly speaking, this analysis extends beyond density functions but we maintain the use of densities to simplify our discussion.

h(z\mid\theta_{1,0},\delta_{0})=f_{1}(z\mid\theta_{1,0})\delta_{0}(z),

for some unknown component $\delta_{0}(z)$ that controls the amount of model misspecification. The form of $h(z\mid\theta_{1,0},\delta_{0})$ allows us to capture both gross model misspecification, and situations where misspecification is ambiguous. We first restrict our attention to gross model misspecification by imposing the following assumption.

Assumption 1 (Gross Misspecification).

The observed data $\{z_{i}:i\geq 1,n\geq 1\}$ is independent and identically distributed according to $h(z\mid\theta_{1,0},\delta_{0})=f_{1}(z\mid\theta_{1,0})\delta_{0}(z)$ . For some $\widetilde{\mathcal{Z}}\subseteq\mathcal{Z}$ , with $\int_{\widetilde{\mathcal{Z}}}h(z\mid\theta_{1,0},\delta_{0})\mathrm{d}z>0$ , and all $z\in\mathcal{Z}^{\prime}$ , $\delta_{0}(z)\neq f_{2}(z\mid\theta_{1,0},\theta_{2})$ for any $\theta_{2}\in\Theta_{2}$ . There exist $\theta_{\star}\in\mathrm{Int}(\Theta)$ that minimizes $\theta\mapsto\mathrm{KL}\{h(z\mid\theta_{1,0},\delta_{0})\|f(z\mid\theta)\}$ , the Kullback–Liebler divergence of $f(z\mid\theta)$ over $\Theta$ with respect to $h(z\mid\theta_{1,0},\delta_{0})$ .

Under Assumption 1 and regularity conditions, the cut posterior $\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})$ can be shown to deliver accurate inferences for $\theta_{1,0}$ , while the full posterior $\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})$ does not. To state this result, denote the cut posterior distribution for $\theta_{1}$ as $\Pi_{\mathrm{cut}}(\theta_{1}\in\cdot\mid z_{1:n})$ , and let $\Pi_{\mathrm{full}}(\theta_{1}\in\cdot\mid z_{1:n})$ denote the full posterior distribution for $\theta_{1}$ . For any $\varepsilon>0$ , define $\Theta_{1}(\varepsilon):=\{\theta_{1}\in\Theta_{1}:\|\theta_{1}-\theta_{1,0}\|\leq\varepsilon\}$ . For $P_{0}^{(n)}$ the true data generating process, we say that $X_{n}\xrightarrow{P}a$ (or $X_{n}=a+o_{p}(1)$ ) if for all $\varepsilon>0$ , $P_{0}^{(n)}(\|X_{n}-a\|\geq\varepsilon)=o(1)$ as $n\rightarrow\infty$ .

Lemma 1.

If Assumption 1 and the regularity conditions in Appendix B are satisfied, then for any $\varepsilon>0$ , $\Pi_{\mathrm{cut}}\left\{\theta_{1}\in\Theta_{1}(\varepsilon)\mid z_{1:n}\right\}\xrightarrow{P}1$ . For $\varepsilon>0$ such that $\theta_{1,\star}\notin\Theta_{1}(\varepsilon)$ , $\Pi_{\mathrm{full}}\left\{\theta_{1}\in\Theta_{1}(\varepsilon)\mid z_{1:n}\right\}\xrightarrow{P}0$ .

Assumption 1 embodies cases where a practitioner can confidently determine that the second module is misspecified. Lemma 1 then shows that, in these cases, the cut posterior concentrates on the true value $\theta_{1,0}$ , while the full posterior does not. By restricting the flow of information across modules, cutting feedback produces inferences for $\theta_{1,0}$ that are more accurate than full Bayesian inference. Although Lemma 1 implies that cut posterior inference is more accurate than full posterior inference when the second module is grossly misspecified, it is of limited use when $\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})$ and $\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})$ are similarly located, which can occur when misspecification of the second module is not severe. In such cases, the full posterior will have less uncertainty than the cut posterior, while also being similarly located, making it unclear whether to prefer the cut or full posterior.

To capture the empirically relevant setting where, for any fixed $n$ , it is unclear which posterior to use, we investigate the behavior of these posteriors under a certain class of locally misspecified DGPs. Following the literature of robust asymptotic statistical analysis (see, e.g., Ch 3. Rieder, 2012), we approximate the empirical situation where neither method is clearly preferable using a local perturbation about the assumed model:

\displaystyle h(z\mid\theta_{0},\delta_{n}):=f_{1}(z\mid\theta_{1,0})\{1+\psi^{\top}\zeta(z)/\sqrt{n}\}f_{2}(z\mid\theta_{0}),\quad\delta_{n}=\psi^{\top}\zeta/\sqrt{n},

(3)

which depends on a (random) direction of misspecification $\zeta\in\mathbb{R}^{d}$ , and magnitude $\psi$ that takes values in $\Delta\subset\mathbb{R}^{d}$ . Under the local perturbation framework in (3) the cut and full posteriors have similar locations in $\Theta$ for small-to-moderate sample sizes, and conventional specification testing methods cannot consistently detect which method delivers more accurate inferences for $\theta_{0}$ . As such, this class of perturbations allows us to compare cut and full posterior inference when correct specification of the second module is ambiguous.

Assumption 2 (Local Misspecification).

The triangular array $\{z_{i,n}:1\leq i\leq n,n\geq 1\}$ is independent and identically distributed according to $h(z\mid\theta_{0},\delta_{n})$ in (3) for fixed $n$ . For $\psi\in\Delta\subset\mathbb{R}^{d}$ , $\Delta$ compact, $\zeta(z)$ satisfies: (i) $\int_{\mathcal{Z}}\frac{\partial\log f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}}\zeta(z)^{\top}\psi f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)=0$ ; (ii) For $\eta=(\eta_{1}^{\top},\eta_{2}^{\top})^{\top}$ partitioned conformably to $\theta=(\theta_{1}^{\top},\theta_{2}^{\top})^{\top}$ , $\eta=\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}\zeta(z)^{\top}\psi f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)$ , and $0\leq\|\eta\|<\infty$ .

Remark 1.

Assumption 2 is a device that will allow us to rigorously compare the cut and full posteriors when neither is clearly preferable. The local misspecification framework ensures that, for any $n$ , there remains some ambiguity about model misspecification—and thus which posterior to prefer. It provides an appropriate theoretical framework for analyzing cut and full posteriors when the choice between them is unclear. Assumption 2 further maintains that the direction of misspecification, $\zeta(z)$ , does not adversely affect the location of the cut posterior for $\theta_{1}$ but will impact the full posterior.

Remark 2.

Assumption 2 resembles, but is distinct from, the misspecification device employed in Hjort and Claeskens (2003) and Claeskens and Hjort (2003) to construct methods that combine and choose between different frequentist point estimators. The approach outlined in Section 8 of Claeskens and Hjort (2003) is not appropriate here since their framework can produce cut posterior inferences for $\theta_{1,0}$ that are less accurate than the full posterior, which contradicts the underling reasoning for using cut posteriors. The misspecification framework in Assumption 2 also differs from the designs in Pompe and Jacob (2021), and Frazier and Nott (2024), which are similar to Assumption 1, and ensure that - with probability converging to one - the researcher knows the model is misspecified.

Under Assumption 2, cut posterior inference for $\theta_{1}$ is not impacted by misspecification of the second module. To show this, we require further notation. First, note that

\ell(\theta)=\log f(z_{1:n}\mid\theta)=\log f_{1}(z_{1:n}\mid\theta_{1})+\log f_{2}(z_{1:n}\mid\theta)\equiv\ell_{p}(\theta_{1})+\ell_{c}(\theta_{1},\theta_{2}),

where $\ell_{p}(\theta_{1})$ denotes the partial log-likelihood for $\theta_{1}$ used in the cut posterior, and $\ell_{c}(\theta)$ denotes the log-likelihood used in the conditional posterior. Denote the derivative of the full log-likelihood as $\dot{\ell}(\theta):=\partial\ell(\theta)/\partial\theta$ , and the second derivative by $\ddot{\ell}(\theta):=\partial^{2}\ell(\theta)/\partial\theta\partial\theta^{\top}$ . For $j,k\in\{1,2\}$ , define the first and second partial derivatives as $\dot{\ell}_{(j)}(\theta):=\partial\ell(\theta)/\partial\theta_{j}$ , and $\ddot{\ell}_{(ij)}(\theta):=\partial^{2}\ell(\theta)/\partial\theta_{j}\partial\theta_{i}^{\top}$ . Similar notations will be use to denote derivatives of $\ell_{u}(\theta)$ , for $u\in\{c,p\}$ . Let $\mathbb{E}_{n}[g(z)]$ denote the expectation of $g:\mathcal{Z}\rightarrow\mathbb{R}^{d}$ , under $h(z\mid\theta_{0},\delta_{n})$ in Assumption 2, and define the following information matrices: $\mathcal{I}:=-\lim_{n}n^{-1}\mathbb{E}_{n}[\ddot{\ell}(\theta_{0})]$ , $\mathcal{I}_{jk}:=-\lim_{n}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{(jk)}(\theta_{0})]$ , and $\mathcal{I}_{p(11)}:=-\lim_{n}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{p}(\theta_{1,0})]$ . Let $\overline{\theta}_{\mathrm{cut}}:=\int_{\Theta}\theta\pi_{\mathrm{cut}}(\theta\mid z_{1:n})\mathrm{d}\theta$ , $\overline{\theta}_{\mathrm{full}}=\int_{\Theta}\theta\pi_{\mathrm{full}}(\theta\mid z_{1:n})\mathrm{d}\theta$ , and let $\Rightarrow$ denote weak convergence under $P^{(n)}_{0}$ .

Lemma 2.

Under Assumption 2, and the regularity conditions in Appendix C.1:

(i) $\sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{0})\Rightarrow N(\mathcal{I}^{-1}\eta,\mathcal{I}^{-1})$ .

(ii) $\sqrt{n}(\overline{\theta}_{1,\mathrm{cut}}-\theta_{1,0})\Rightarrow N(0,\mathcal{I}^{-1}_{p(11)})$ .

(iii) $\sqrt{n}(\overline{\theta}_{2,\mathrm{cut}}-\theta_{1,0})\Rightarrow N(\mathcal{I}_{22}^{-1}\eta_{2},\mathcal{I}^{-1}_{22}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1})$ .

(iv) Only credible sets calculated under $\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})$ have valid frequentist coverage.

(v) The squared bias for $\theta_{1,0}$ and $\theta_{2,0}$ under $\pi_{\mathrm{cut}}(\theta\mid z_{1:n})$ is smaller than $\pi_{\mathrm{full}}(\theta\mid z_{1:n})$ .

Remark 3.

Lemma 2 shows that inferences for $\theta_{1}$ using the cut posterior have no asymptotic bias, whereas the full posterior for $\theta_{1}$ has asymptotic bias and both posteriors have different biases for $\theta_{2}$ . The presence of asymptotic bias implies that, if $\eta\neq 0$ , only credible sets for $\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})$ correctly quantify uncertainty, i.e., only $\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})$ has calibrated credible sets. Since the bias due to misspecification, $\eta$ , is unknown, it does seem feasible to determine whether $\pi_{\mathrm{cut}}(\theta\mid z_{1:n})$ or $\pi_{\mathrm{full}}(\theta\mid z_{1:n})$ more reliably quantifies uncertainty in general, except when $\eta=0$ where both methods accurately quantify uncertainty.

Lemma 2 implies that the user is faced with a trade-off between conducting inference using the cut or full posterior: inferences under $\pi_{\mathrm{full}}(\theta\mid z_{1:n})$ have the smallest variability possible (via Cramer-Rao) but exhibit a bias of unknown magnitude, whereas inferences under $\pi_{\mathrm{cut}}(\theta\mid z_{1:n})$ are guaranteed to have smaller bias than those under $\pi_{\mathrm{full}}(\theta\mid z_{1:n})$ but have (weakly) larger variability. Lemma 2 is the first result to formally show that a bias-variance trade-off exists between the cut and full posteriors. Since the bias due to misspecification ( $\eta$ ) is unobservable, Lemma 2 exemplifies the situation where it is ambiguous as to whether we should base inferences on the posterior that exhibits low bias - the cut posterior - or the posterior that exhibits larger bias but which has much smaller variability - the full posterior.

3 Semi-Modular Inference

Lemma 2 suggest that is we consider the accuracy of posteriors using a criteria that measures both the bias and variance of posterior inference, it should be possible to combine the cut and full posteriors to deliver inferences that are more accurate than using either by themselves. The goal of semi-modular inference (SMI) is to interpolate between the cut and full posteriors to reduce the uncertainty about $\theta$ while maintaining a tolerable level of bias. However, there are many ways to interpolate between two probability distributions, and there is no a priori reason to suspect one method of interpolation will deliver better results than others; see Nicholls et al. (2022) for a discussion on some of the possibilities.

Following Chakraborty et al. (2022) and Yu et al. (2023), we focus on conducting SMI using linear opinion pools (Stone, 1961), which produces a semi-modular posterior (SMP) that is a convex combination between the cut and full posteriors: for $\omega\in[0,1]$ ,

	$\displaystyle\pi_{\omega}(\theta\mid z_{1:n})$	$\displaystyle:=\pi_{\omega}(\theta_{1}\mid z_{1:n})\pi(\theta_{2}\mid z_{1:n},\theta_{1}),$		(4)
	$\displaystyle\pi_{\omega}(\theta_{1}\mid z_{1:n})$	$\displaystyle:=(1-\omega)\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})+\omega\pi(\theta_{1}\mid z_{1:n}).$

The pooling weight $\omega\in[0,1]$ in the SMP determines the level of interpolation between the posteriors, and Chakraborty et al. (2022) suggest choosing $\omega$ using prior-predictive conflict checks, while Carmona and Nicholls (2022) propose to use out-of-sample predictive methods to select $\omega$ . In contrast, we propose a novel choice of pooling weight that leverages the bias-variance-trade-off between cut and full posterior inferences.

3.1 Shrinkage-based semi-modular posteriors

To build intuition as to how $\pi_{\omega}(\theta\mid z_{1:n})$ can utilize the bias-variance trade-off between $\pi_{\mathrm{cut}}$ and $\pi_{\mathrm{full}}$ , we first focus on the behavior of the SMP for $\theta_{1}$ , i.e., $\pi_{\omega}(\theta_{1}\mid z_{1:n})$ in (4), and analyze the behavior of $\pi_{\omega}(\theta\mid z_{1:n})$ in subsequent sections.²²2Differences between the cut, full, and SMP posteriors for $\theta_{2}$ are attributable to differences in the posterior for $\theta_{1}\mid z_{1:n}$ since each posterior shares the same conditional posterior for $\theta_{2}$ given $\theta_{1}$ . Focusing on $\pi_{\omega}(\theta_{1}\mid z_{1:n})$ in (4), note that the SMP point estimator for $\theta_{1}$ is

\displaystyle\overline{\theta}_{1}(\omega)

\displaystyle:=(1-\omega)\overline{\theta}_{1,\mathrm{cut}}+\omega\overline{\theta}_{1,\mathrm{full}}\equiv\overline{\theta}_{1,\mathrm{cut}}-\omega(\overline{\theta}_{1,\mathrm{cut}}-\overline{\theta}_{1,\mathrm{full}}),

(5)

where $\overline{\theta}_{1,\mathrm{cut}}:=\int_{\Theta_{1}}\theta_{1}\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\mathrm{d}\theta_{1}$ , and $\overline{\theta}_{1,\mathrm{full}}:=\int_{\Theta}\theta_{1}\pi_{\mathrm{full}}(\theta_{1},\theta_{2}\mid z_{1:n})\mathrm{d}\theta$ . From Lemma 2, the asymptotic mean of $\sqrt{n}(\overline{\theta}_{1,\mathrm{cut}}-\theta_{1,0})$ is zero, while that of $\sqrt{n}(\overline{\theta}_{1,\mathrm{full}}-\theta_{1,0})$ depends on the misspecification bias $\eta$ . Hence, the SMP posterior mean $\overline{\theta}_{1}(\omega)$ combines an asymptotically unbiased estimator with high variance, $\overline{\theta}_{1,\mathrm{cut}}$ , and an asymptotically biased estimator with small variance, $\overline{\theta}_{1,\mathrm{full}}$ . Therefore, if we are willing to tolerate some bias it will be possible to choose $\omega$ to deliver inferences on $\theta_{1,0}$ that are more accurate than either the cut or full posterior alone, at least so long as our measure of “accuracy” accounts for both bias and variance.

The idea of combining biased and unbiased estimators has a long history in statistics, with the most commonly encountered estimators of this kind being shrinkage and James-Stein estimators, see Chapter 5 of Lehmann and Casella (2006) for a textbook discussion. Indeed, the form of $\overline{\theta}_{1}(\omega)$ in (5) is reminiscent of those encountered in the shrinkage literature. For two normally distributed estimators, one biased and the other unbiased, Green and Strawderman (1991) demonstrated how to optimally combine such estimators to deliver a shrinkage estimator that is optimal in terms of expected squared error loss. The approach of Green and Strawderman (1991) was extended to more general settings and loss functions by Kim and White (2001), and Judge and Mittelhammer (2004).

While $\overline{\theta}_{1,\mathrm{cut}}$ and $\overline{\theta}_{1,\mathrm{full}}$ are not normally distributed, they are asymptotically normal, and so one could choose $\omega$ in the SMP using existing shrinkage estimation proposals. Following ideas similar to Green and Strawderman (1991), and Kim and White (2001), we could choose $\omega$ in (4) as

\widetilde{\omega}_{+}=\min\{1,\widetilde{\omega}\},\quad\frac{\gamma}{n(\overline{\theta}_{1,\mathrm{cut}}-\overline{\theta}_{1,\mathrm{full}})^{\top}\Upsilon(\overline{\theta}_{1,\mathrm{cut}}-\overline{\theta}_{1,\mathrm{full}})},

(6)

for some $\gamma>0$ , and $\Upsilon$ a positive definite $(d_{1}\times d_{1})$ -matrix. Since $\overline{\theta}_{1,\mathrm{full}}$ is asymptotically biased, while $\overline{\theta}_{1,\mathrm{cut}}$ is not, using $\widetilde{\omega}_{+}$ within the SMP would allow us to interpret the SMP as shrinking cut posterior inferences towards those of the biased full posterior, and so using $\widetilde{\omega}_{+}$ would deliver a type of “shrinkage” SMP (S-SMP).

Semi-modular Bayesian inference is clearly much more general than the linear Gaussian models analyzed in Green and Strawderman (1991), Kim and White (2001), and Judge and Mittelhammer (2004). Nevertheless, applying shrinkage estimation ideas within semi-modular Bayesian inference should allow us to effectively combine the cut and full posteriors. In the following sections, we show that this is indeed the case: empirically and theoretically, SMPs based on weights similar to (6) deliver inferences that can be shown to be optimal according to a general notion of asymptotic risk. Before presenting any formal analysis, we first illustrate the behavior of the S-SMP based on $\widetilde{\omega}_{+}$ empirically in a simple example from the cutting feedback literature.

3.2 Example: Biased Mean

To demonstrate the benefits of the SMP, we consider a minor modification of the biased mean example in Section 2.1 of Liu et al. (2009). We observe two datasets generated from independent random variables with the same unknown mean $\theta_{1}$ , but where the assumed model for the second dataset is incorrect resulting in biased estimation of the parameter of interest. The first dataset corresponds to $n_{1}$ observations on a $(d_{1}\times 1)$ -dimensional, $d_{1}\geq 1$ , random vector that we assume is generated from the model:

y_{1,i}=\theta_{1}+\epsilon_{1,i},

where $\theta_{1}=(\theta_{1,1},\dots,\theta_{1,d_{1}})^{\top}$ and $\epsilon_{1,i}$ , $i=1,\dots,n_{1}$ , are iid $N(0,I_{d_{1}})$ . However, we also observe an additional dataset, comprised of $n_{2}>n_{1}$ observations which is assumed to be from the model:

y_{2,i}=\theta_{1}+\sigma\epsilon_{2,i},

where $\epsilon_{2i}$ , $i=1,\dots,n_{2}$ , are iid $N(0,I_{d_{1}})$ for unknown $\sigma>0$ . The prior density for $\theta_{1}$ is $N(0,I_{d_{1}})$ , and for $\theta_{2}:=\sigma^{2}$ is $\pi(\sigma^{2})\propto 1/\sigma^{2}$ . The parameter of interest is $\theta_{1}$ , and we wish to determine how much the second dataset should influence inference about $\theta_{1}$ , when its assumed model is incorrect, leading to biased estimation of $\theta_{1}$ .³³3The original example of Liu et al. (2009) is such that for even moderate values of the bias we always prefer the cut posterior. Given this, we have slightly modified the original example to ensure a meaningful trade-off exists between the cut and full posteriors. Without this modification, the S-SMP simply returns the cut posterior in the vast majority of cases.

Suppose that in the actual data-generating process $\epsilon_{2,i}$ is not $N(0,I_{d_{1}})$ , but

\epsilon_{2,i}\stackrel{{\scriptstyle iid}}{{\sim}}\begin{cases}N(-0.25\cdot\iota_{d_{1}},0.10\cdot I_{d_{1}})&\text{ with prob. }\delta\\ N(0,0.5I_{d_{1}})&\text{ with prob. }1-\delta\end{cases}

for $i=1,\dots,n_{2}$ , where $\iota_{d_{1}}$ denotes a $d_{1}\times 1$ dimensional vector of ones. When $\delta=0$ , this reduces to the assumed model, but when $\delta>0$ we will obtain biased estimation of $\theta_{1}$ under the assumed model.

For our experiments, we assume $\sigma^{2}=1/2$ and use an equally spaced grid of values for the contamination $\delta\in\{0:0.05:0.9\}$ . For each value in this grid, we generate 1000 replications from the above process in the case where $n_{1}=100$ and $n_{2}=1000$ , and consider two different values of $d_{1}\in\{1,5\}$ . For each dataset, the S-SMP is based on the following version of the weights in (6): for $\widehat{\mathcal{I}}_{p(11)}^{-1}=\text{Cov}_{\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})}(\theta_{1})$ and $\widehat{\mathcal{I}}_{11.2}^{-1}=\text{Cov}_{\pi(\theta_{1}\mid z_{1:n})}(\theta_{1})$

\widetilde{\omega}_{+}=\min\left\{1,\widetilde{\omega}\right\}\,\;\widetilde{\omega}=\frac{\text{\rm tr}[\widehat{\mathcal{I}}_{p(11)}^{-1}-\widehat{\mathcal{I}}_{11.2}^{-1}]}{\|\overline{\theta}_{1,\mathrm{cut}}-\overline{\theta}_{1,\mathrm{full}}\|^{2}}\mathbb{I}\left\{\text{\rm tr}[\widehat{\mathcal{I}}_{p(11)}^{-1}-\widehat{\mathcal{I}}_{11.2}^{-1}]>0\right\}.

To compare the impact of misspecification on the different modular posteriors we plot a Monte Carlo estimate of the expected squared error for the point estimators, based on 1000 replicated samples, across the grid of values for $\delta$ . The results are presented in Figure 3 and show that for relatively small levels of contamination, the full posterior has lower expected squared error than the cut posterior, due to having a much smaller variance, while at higher levels of contamination the cut posterior has lower expected squared error. However, the expected squared error for the S-SMP is always lower than the cut posterior, which demonstrates that the SMP is able to ‘trade-off’ between the two posteriors to minimize squared error risk across all levels of contamination. However, we note that when $d_{1}=1$ and $\delta=0.90$ , the S-SMP and cut posterior give very similar results.⁴⁴4Appendix D in the supplementary material contains additional experiments for this example. These results show that the S-SMP delivers reasonable results for all choices of $d_{1}$ and becomes more accurate as the dimension $d_{1}$ increases.

4 Measuring posterior accuracy through risk

The example in Section 3.2 suggests that the S-SMP can deliver more accurate inferences, in terms of expected squared error, than the cut or full posterior. This generally suggests that the S-SMP can deliver inferences that are accurate according to a criteria that measures both the bias and variance of posterior inferences. However, we remark that such a notion of accuracy is only one way to measure the accuracy of posterior inferences. In modular Bayesian inference, Jacob et al. (2017) and Pompe and Jacob (2021) have suggested choosing between full and cut posteriors using out-of-sample predictive accuracy, while Carmona and Nicholls (2020) have suggested a similar approach for calibrating an SMI tuning parameter. While such an approach is undoubtedly useful, the empirical analysis in Jacob et al. (2017), as well as the empirical and theoretical analysis in Pompe and Jacob (2021), suggest that the preferred method according to such criteria is example specific, with neither method likely to be generally preferable. Further, Lemma 2 demonstrates that cut and full posterior inferences for $\theta_{0}$ exhibit asymptotic bias, and it is not clear how notions of predictive accuracy account for the impact of this bias on the resulting inferences.

In contrast to predictive approaches, we propose to measure the accuracy of modular and semi-modular posteriors through an inferential criterion that can accounts for both the bias and variance in the resulting inferences. Given that the SMP includes the cut ( $\omega=0$ ) and full posterior ( $\omega=1$ ) as special cases, we evaluate the accuracy of different posteriors using the “posterior risk” associated to different $\omega$ values. This notion of risk has previously been used by Castillo (2014) and Lee and Lee (2018) to choose between different prior classes in Bayesian inference, and is capable of capturing both the bias and variance associated with posterior inferences.

Given a user-chosen loss function $q:\Theta\times\Theta\rightarrow\mathbb{R}_{+}$ , at the point $\theta_{0}\in\Theta$ we can measure the loss associated with $\omega\in[0,1]$ via the posterior risk $\mathbb{E}_{\pi_{\omega}}\{q(\theta,\theta_{0})\mid z_{1:n}\}:=\int_{\Theta}q(\theta,\theta_{0})\pi_{\omega}(\theta\mid z_{1:n})\mathrm{d}\theta$ . For $g:\mathcal{Z}\rightarrow\mathbb{R}^{d}$ recall that $\mathbb{E}_{n}[g(z)]=\int_{\mathcal{Z}}g(z)h(z\mid\theta_{0},\delta_{n})\mathrm{d}\mu(z)$ , with $h(z\mid\theta_{0},\delta_{n})$ as in Assumption 2. The trimmed Posterior risk of $\pi_{\omega}$ at $\theta_{0}$ (hereafter, referred to simply as P-risk) is defined as:⁵⁵5Trimming is necessary to ensure that the expectation exists, and can be disregarded in practical terms.

\mathrm{R}_{q}(\pi_{\omega},\theta_{0}):=\lim_{\nu\rightarrow\infty}\liminf_{n\rightarrow\infty}\mathbb{E}_{n}\min\{\mathbb{E}_{\pi_{\omega}}\{n\cdot q(\theta,\theta_{0})\mid z_{1:n}\},\nu\}.

The loss function $q:\Theta\times\Theta\rightarrow\mathbb{R}_{+}$ satisfies the following assumptions.

Assumption 3.

For all $\theta,\delta\in\Theta$ , the loss function $q(\delta,\theta)$ satisfies: i) $q(\delta,\theta)\geq 0$ and $q(\delta,\theta)=0$ if and only if $\theta=\delta$ ; ii) The loss is absolutely continuous in $\delta$ , and three times continuously differentiable in $\delta$ ; iii) For $\delta$ in a neighbourhood of $\theta_{0}$ , the matrix $\Upsilon(\delta)=\partial^{2}q(\delta,\theta_{0})/\partial\delta\partial\delta^{\top}$ is continuous, and positive semi-definite; iv) For each $i=1,\dots,d$ , and for all $x\in\mathbb{R}^{d}$ , $\delta\in\Theta$ , $|x^{\top}\{\partial\Upsilon(\delta)/\partial\delta_{i}\}x|\leq M\|x\|^{2}$ , where $\delta_{i}$ denotes the $i$ -th direction of $\delta$ , and $M>0$ .

Remark 4.

Assumption 3 includes losses such as squared error loss $q(\delta,\theta)=\|\delta-\theta\|^{2}$ , but also permits intrinsic measures of accuracy for densities. The Kullback-Liebler divergence,

q(\theta,\theta_{0})=\int_{\mathcal{Z}}f(z\mid\theta_{0})\log\frac{f(z\mid\theta_{0})}{f_{1}(z\mid\theta_{1})f_{2}(z\mid\theta_{2},\theta_{1})}\mathrm{d}z,

and various scoring rules, such as kernel scores or mean-variance scores (see Gneiting and Raftery, 2007, Sections 4 and 5), will satisfy Assumption 3. Assumption 3 does exclude discontinuous functions, such as those needed to measure the accuracy of quantiles. Extending our results to the case of discontinuous losses is the focus of subsequent work by the authors.

P-risk is related to expected asymptotic risk, which is often used to gauge the accuracy of frequentist point estimators. Since $\mathrm{R}_{q}(\pi_{\omega},\theta_{0})$ is calculated from a chosen posterior based on a decision made for $\omega$ , we refer to this notion as P-risk to distinguish it from asymptotic risk. Risk has a long history in statistical analysis and we refer to Chapter 6 of Lehmann and Casella (2006), and Chapter 8 of Van der Vaart (2000) for textbook treatments. The key benefit of using P-risk to measure different choices for $\omega$ is that, for the chosen loss $q(\cdot,\cdot)$ , P-risk can deliver a concrete ranking across inference procedures relative to this choice. Furthermore, our use of P-risk is not at odds with Bayesian inference, and has already been used by others, albeit in slightly different contexts. More generally, as argued by Lehmann and Casella (2006) (pg 310): “The Bayesian paradigm is well suited for the construction of possible estimators, but is less well suited for their evaluation.” Consequently, we follow this suggestion of Lehmann and Casella (2006) and carry out inference via Bayesian methods but evaluate the accuracy of these methods using our notion of asymptotic risk (i.e., P-risk).

4.1 The P-risk of SMPs

As we have already seen in Section 3.1, weights of the form presented in (6) deliver a SMP whose expected squared error was smaller than the cut posterior. Thus, in the remainder we focus our theoretical analysis on SMPs with pooling weights that are general versions of those in (6): for $\gamma_{n}$ a user-chosen sequence such that $\gamma_{n}\xrightarrow{P}\gamma$ , $\gamma>0$ , define the pooling weight

\displaystyle\widehat{\omega}_{+}:=\min\{1,\widehat{\omega}\},\quad\widehat{\omega}=\frac{\gamma_{n}}{n(\overline{\theta}_{\mathrm{cut}}-\overline{\theta}_{\mathrm{full}})^{\top}\Upsilon(\overline{\theta}_{\mathrm{cut}})(\overline{\theta}_{\mathrm{cut}}-\overline{\theta}_{\mathrm{full}})}.

(7)

In contrast to the weight $\widetilde{\omega}_{+}$ suggested in (6), the weight $\widehat{\omega}_{+}$ in (7) depends on the entire vector $\theta$ , and the curvature of the loss function, as captured by the matrix $\Upsilon(\theta)=\partial^{2}g(\theta,\delta)/\partial\delta\partial\delta^{\top}|_{\delta=\theta}$ . Incorporating the curvature of the loss within the pooling weight is necessary for the SMPs to deliver inferences that are accurate according to the chosen loss; see Theorem C.3 of Appendix C.3 for further discussion.

The pooling weight in (7) yields the shrinkage SMP (S-SMP):

\displaystyle\pi_{\widehat{\omega}_{+}}(\theta\mid z_{1:n}):=\{(1-\widehat{\omega}_{+})\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})+\widehat{\omega}_{+}\pi(\theta_{1}\mid z_{1:n})\}\pi(\theta_{2}\mid\theta_{1},z_{1:n}).

(8)

Different choices of $\gamma_{n}$ in (7) deliver different weights and ultimately different posteriors. However, the following result shows that across a range of choices for $\gamma_{n}$ , the P-risk of $\pi_{\widehat{\omega}_{+}}$ in (8) is dominated by that of the cut posterior, and, under certain conditions, that of the full posterior. To state this result simply, define the $d\times d$ -dimensional matrix $\Upsilon:=\Upsilon(\theta_{0})$ , and denote the $(d_{2}\times d_{2})$ -block of $\Upsilon$ by $\Upsilon_{22}$ (see Assumption 3); let $\mathcal{M}$ be a $d\times d$ matrix with $(d_{1}\times d_{1})$ -block $W:=(\mathcal{I}_{p(11)}^{-1}-\mathcal{I}_{11.2}^{-1})$ , $(d_{2}\times d_{2})$ -block $V:=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}I_{22}^{-1}$ , and $(d_{1}\times d_{2})$ -block $-W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}$ , where $\mathcal{I}_{11.2}:=\mathcal{I}_{11}-\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}$ . We remind the reader that $\mathcal{I}$ , $\mathcal{I}_{jk}$ ( $j,k\in\{1,2\}$ ) and $\mathcal{I}_{p(11)}$ are defined in Section 2.2.

Theorem 1.

Assumptions 2-3, and the regularity conditions in Appendix C.1 are satisfied. If $\text{\rm tr}\Upsilon\mathcal{M}\geq 2\|\Upsilon\mathcal{M}\|$ , and $0<\gamma\leq 2(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|)$ , then:

(i) $\sup_{\eta:\|\eta\|<\infty}\{\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})-\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})\}\leq 0;$

(ii) $\sup_{\eta\in\mathcal{E}}\{\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})-\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0})\}\leq 0$ for $\mathcal{E}:=\{\eta:\|\mathcal{I}^{-1}\eta\|_{\Upsilon}^{2}-\|\mathcal{I}_{22}^{-1}\eta_{2}\|^{2}_{\Upsilon_{22}}\geq\text{\rm tr}\Upsilon\mathcal{M}\}$ .

Remark 5.

Under the class of misspecified DGPs in Assumption 2 and in terms of P-risk, Theorem 1 demonstrates that the cut posterior delivers a point estimator that is inadmissable, with a similar results also being true for the full posterior under additional conditions. Hence, from the standpoint of P-risk, the S-SMP is preferable to the cut posterior and under certain conditions the full posterior as well. However, it is unclear if the S-SMP has the lowest possible P-risk, i.e., if it is minimax. Answering this question requires deriving a local minimax efficiency bound for the class of DGPs in Assumption 2 (see, e.g., Chapter 8 in Van der Vaart, 2000 for a discussion on asymptotic minimax estimators), which is outside the scope of the current paper, and is left for future research.

Remark 6.

As suggested by an anonymous referee, a Bayesian may also be interested in the uncertainty quantification of the S-SMP. As discussed in Remark 3 after Lemma 2, only credible sets based on $\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})$ deliver calibrated uncertainty quantification. Therefore, if $\widehat{\omega}_{+}>0$ the credible sets of the S-SMP will not be calibrated. More generally, the credible sets calculated from the cut posterior, full posterior and S-SMP all depend, in different ways, on the magnitude of the bias induced via misspecification. Since $\eta$ is unknown in practice it is not obvious how one should theoretically compare the behavior of such credible sets. In light of this issue, we believe that measuring the accuracy of the posteriors using P-risk is the most direct approach.

Remark 7.

The condition $\text{\rm tr}\Upsilon\mathcal{M}\geq 2\|\Upsilon\mathcal{M}\|$ in Theorem 1 is related to the inefficiency that results from using the cut posterior relative to the full posterior. This condition is more likely to be satisfied when the efficiency gap between the cut and full posterior is large, or when $d$ - the dimension of $\theta$ - is large. That being said, the condition $\text{\rm tr}\Upsilon\mathcal{M}\geq 2\|\Upsilon\mathcal{M}\|$ is a sufficient condition and it is possible that the S-SMP may deliver smaller P-risk even when this condition is not satisfied.

Remark 8.

The example in Section 3.2 demonstrated that a S-SMP focused on inference for $\theta_{1}$ delivered smaller expected P-risk for $\theta_{1,0}$ than the cut posterior. However, Theorem 1 makes clear that the S-SMP can also deliver smaller P-risk for $\theta_{0}$ . Returning to the example in Section 3.2, we now analyze the P-risk at $\theta_{0}$ for the S-SMP under the shrinkage weight

\widehat{\omega}_{+}=\min\{1,\widehat{\omega}\},\quad\widehat{\omega}=\frac{\text{\rm tr}[\text{Cov}_{\pi_{\mathrm{cut}}}(\theta)-\text{Cov}_{\pi_{\mathrm{full}}}(\theta)]}{\|\overline{\theta}_{\mathrm{cut}}-\overline{\theta}_{\mathrm{full}}\|^{2}}\mathbb{I}\left\{\text{\rm tr}[\text{Cov}_{\pi_{\mathrm{cut}}}(\theta)-\text{Cov}_{\pi_{\mathrm{full}}}(\theta)]>0\right\}.

To this end, we repeat the Monte Carlo experiment in Section 3.2 under two different dimensions for $d_{1}\in\{1,5\}$ , so that $d\in\{2,6\}$ , and present the results in Figure 9. These results show that the P-risk of the S-SMP is dominated by that of the cut posterior, and in certain cases that of the full posterior; as with the example in Section 3.2, for $\delta=0.90$ , and $d=2$ , the S-SMP and cut posterior give very similar results. Theorem 1 is asymptotic and given the sample sizes considered in this experiment it is not surprising that at large levels of contamination the S-SMP can perform slightly worse than the cut posterior (when $d=2$ ), since non-negligible weight is placed on the full posterior. However, at $\delta=0.90$ the median pooling weight across the replications is about $0.30$ , so that most of the pooling weight corresponds to the cut posterior. As we shall see shortly, under higher levels of misspecification and as $n$ increases, the S-SMP resembles the cut posterior.

Part one of Theorem 1 implies that if the cut posterior is inefficient relative to the full posterior, as measured by $\text{\rm tr}\Upsilon\mathcal{M}\geq 2\|\Upsilon\mathcal{M}\|$ , then the S-SMP will be at least as accurate (in P-risk) as the cut posterior, and potentially more accurate than the full posterior.⁶⁶6Theorem 1 applies even if $\eta=0$ : when there is no misspecification bias the cut and full posterior means will be similar and the weight $\widehat{\omega}_{+}$ will be close to unity so that the S-SMP will resemble the full posterior. The second part of Theorem 1 gives a sufficient, but not necessary, condition which guarantees that the P-risk of the full posterior dominates that of the S-SMP. This condition is likely to be satisfied when the difference in posterior locations is larger than the difference in posterior variances.

When $d>2$ it is also possible to obtain an analytic expression for the P-risk of the S-SMP if we set $q(\theta,\theta_{0})=\frac{1}{2}(\theta-\theta_{0})^{\top}\mathcal{M}^{-1}(\theta-\theta_{0})$ , so that $\Upsilon=\mathcal{M}^{-1}$ . The requirement that $d>2$ is commonly encountered in the risk analysis of James-Stein estimators and is a consequence of the fact that $\pi_{\widehat{\omega}_{+}}(\theta\mid z_{1:n})$ can be viewed as performing a type of posterior shrinkage.

Theorem 2.

Assumptions 2-3, and the regularity conditions in Appendix C.1 are satisfied. If $\Upsilon=\mathcal{M}^{-1}$ , $d>2$ , and $0<\gamma\leq 2(d-2)$ , then for any finite $\eta$ ,

\displaystyle\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})

\displaystyle=\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\frac{\gamma\{2(d-2)-\gamma\}(d-3)!}{2(d)!}{{}_{1}\mathrm{F}_{1}(d-1;d_{1};\lambda)}\leq\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0}),

where ${{}_{1}\mathrm{F}_{1}(k-1;k;\lambda)}$ is the confluent hyper-geometric function.

Theorem 2 is useful as it gives an exact bound on the P-risk, and an easily interpretable set of conditions for the value of $\gamma$ in the weight $\widehat{\omega}_{+}$ . Under $\Upsilon=\mathcal{M}^{-1}$ , the condition $d>2$ is necessary to guarantee that $\pi_{\widehat{\omega}_{+}}(\theta\mid z_{1:n})$ has smaller P-risk than $\pi_{\mathrm{cut}}(\theta\mid z_{1:n})$ . This condition is related to Stein’s phenomenon (see Ch. 6 of Lehmann and Casella, 2006 for a discussion) and implies that using the cut posterior by itself is sub-optimal (in terms of P-risk) when $d>2$ . We stress that this interpretation is only valid when $\Upsilon=\mathcal{M}^{-1}$ and that a similar phenomena does not necessarily extend to other loss functions.

Theorems 1-2 demonstrate that, under certain conditions and in terms of P-risk, the S-SMP is more accurate than the cut posterior and possibly the full posterior. However, Theorems 1-2 implicitly require that the difference between the cut and full posterior locations do not diverge, which is a consequence of the asymptotic regime in Assumption 2. This begs the question of what happens to the S-SMP when we move from the case of local model misspecification (Assumption 2) to gross model misspecification (Assumption 1). The following result demonstrates that under gross model misspecification the S-SMP converges to the cut posterior, and so is robust to either form of misspecification.

Corollary 1.

Assumption 1, Assumption 3, and the regularity conditions in Appendix B.1 are satisfied. If $\|\overline{\theta}_{\mathrm{cut}}-\theta_{0}\|=O_{p}(n^{-1/2})$ , and $\|\overline{\theta}_{\mathrm{full}}-\theta_{\star}\|=O_{p}(n^{-1/2})$ , then $\int_{\Theta}|\pi_{\widehat{\omega}_{+}}(\theta\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta\mid z_{1:n})|\mathrm{d}\theta=o_{p}(1).$

5 Additional examples

5.1 Normal-normal random effects model

We first apply the S-SMP to the misspecified normal-normal random effects model presented in Liu et al. (2009). The observed data is $z_{ij}$ comprising observations on $i=1,\dots,N$ groups, with $j=1,\dots,J$ observations in each group, which we assume are generated from the model $z_{ij}\mid\beta_{i},\varphi_{i}^{2}\stackrel{{\scriptstyle iid}}{{\sim}}N(\beta_{i},\varphi_{i}^{2})$ , with random effects $\beta_{i}\mid\nu\stackrel{{\scriptstyle iid}}{{\sim}}N(0,\nu^{2})$ . The goal of the analysis is to conduct inference on the standard deviation of the random effects, $\nu$ , and the residual standard deviation parameters $\varphi=(\varphi_{1},\dots,\varphi_{N})^{\top}$ . Below we write $\beta=(\beta_{1},\dots,\beta_{N})^{\top}$ , and $\zeta=(\nu,\beta^{\top})^{\top}$ .

For $\bar{z}_{i}=J^{-1}\sum_{j=1}^{J}z_{ij}$ and $s_{i}^{2}=\sum_{j=1}^{J}(z_{ij}-\bar{z}_{i})^{2}$ , $i=1,\dots,N$ , the likelihood for $\zeta,\varphi$ can be written to depend only on the sufficient statistics $\bar{z}=(\bar{z}_{1},\dots,\bar{z}_{N})^{\top}$ and $s^{2}=(s_{1}^{2},\dots,s_{N}^{2})^{\top}$ , where independently for $i=1,\dots,N$ ,

\displaystyle\bar{z}_{i}\mid\zeta,\varphi\sim N(\beta_{i},\varphi_{i}^{2}/J),\quad s^{2}_{i}\mid\varphi\sim\text{Gamma}\left(\frac{J-1}{2},\frac{1}{2}\frac{1}{\varphi_{i}^{2}}\right).

Letting $\theta_{1}=\varphi$ and $\theta_{2}=\zeta$ , the random effects model can then be written as a two-module system of the form shown in Figure 1: module one depends on $(s^{2},\theta_{1})$ , $X=s^{2}$ , and module two depends on $(\bar{z},\theta_{2},\theta_{1})$ , $Y=\bar{z}$ .

Let $\text{Gamma}(x;A,B)$ denote the value of the $\text{Gamma}(A,B)$ density evaluated at $x$ , and $N(x;\mu,\sigma^{2})$ denote the value of the $N(\mu,\sigma^{2})$ density evaluated at $x$ . The first module has likelihood $f_{1}(X\mid\theta_{1})=\prod_{i=1}^{N}\text{Gamma}\left(s_{i}^{2};\frac{J-1}{2},\frac{1}{2}\frac{1}{\theta_{1,i}^{2}}\right)$ , while the second module has likelihood $f_{2}(Y\mid\theta)=\prod_{i=1}^{N}N(\bar{Z}_{i};\beta_{i},\theta_{1,i}^{2}/J)$ .

When the Gaussian prior for the random effects term $\beta_{i}$ conflicts with the likelihood information, inferences for $\theta_{1,i}=\varphi_{i}$ can be adversely impacted. Such an outcome will happen when, for instance, a value of $\beta_{i}$ differs markedly from its assumed model. Liu et al. (2009) argue that the thin-tailed Gaussian assumption for the random effects can produce poor inferences for $\theta_{1,i}$ due to the feedback induced by the likelihood term $N(\bar{Z}_{i};\beta_{i},\varphi_{i}^{2}/J)$ in the second module. To guard against this Liu et al. (2009) propose cut posterior inference for $\theta_{1}\mid s^{2}$ , which can be accommodated by simply updating the posterior for $\theta_{1}$ using only the information in the corresponding summary statistics $s^{2}$ : given $\pi(\theta_{1,i}^{2})\propto(\theta_{1,i}^{2})^{-1}$ , and independent across $i=1,\dots,N$ , the cut posterior for $\theta_{1}^{2}$ (where this denotes the elementwise square of $\theta_{1}$ ) is

\pi_{\mathrm{cut}}(\theta_{1}^{2}\mid X)\propto\prod_{i=1}^{N}(\theta_{1,i}^{2})^{-\frac{J+1}{2}}\exp\left\{-\frac{J\cdot s_{i}^{2}}{2\theta_{1,i}^{2}}\right\}.

Summaries of the cut posterior for $\theta_{1}$ can be obtained by sampling from the cut posterior for $\theta_{1}^{2}$ and transforming the samples. Joint inferences for $(\zeta,\varphi)$ can be carried out using the cut posterior distribution

\displaystyle\pi_{\mathrm{cut}}(\zeta,\varphi\mid X,Y)

\displaystyle=\pi_{\mathrm{cut}}(\varphi\mid X)\pi(\zeta\mid Y,\varphi)=\pi_{\mathrm{cut}}(\varphi\mid s^{2})\pi(\zeta\mid\bar{Z},\varphi),\quad\varphi=\theta_{1},

where the conditional posterior $\pi(\zeta\mid\bar{Z},\varphi)$ is obtained from the joint posterior for $\theta_{1},\theta_{2}$ .

We now demonstrate that the S-SMP delivers inferences for $\theta_{1,i}$ that are more accurate than the cut posterior for different levels of misspecification. We generate 500 repeated samples from the normal-normal random effects model with $N=100$ groups each with random effect component $\beta_{i}$ , $i=1,\dots,N$ . For each group, we set $\theta_{1,i}:=\varphi_{i}=0.50$ and $\nu=1$ . We induce model misspecification through the random effect term $\beta_{1}$ . Following the design of Liu and Goudie (2022b) we induce misspecification by forcing $\beta_{1}$ to be an outlier, however, unlike Liu and Goudie (2022b) we consider that the magnitude of the outlier decreases as the number of individual observations in each group, $J$ , increases. We set the random effect for the first group as $\beta_{1}=50/J$ , and consider $J\in\{5,10,20,50\}$ . When $J$ is small the cut posterior delivers more accurate inferences for $\theta_{1,1}$ than the full posterior as the feedback between this outlier and $\theta_{1,1}$ has been removed. As the magnitude of the outlier shrinks, the full posterior for $\theta_{1,1}$ become more accurate and a meaningful trade-off between the cut and full posteriors exists.

Our goal is to measure the impact of misspecification on the inferences for $\theta_{1,1}$ , and so we choose the weight in the S-SMP using squared error loss for this component only. This produces a pooling weight that is similar to that discussed in Section 3.2 but based on $\theta_{1,1}$ and not the entire vector of $\theta_{1}$ .⁷⁷7Since only the first random effect component, $\beta_{1}$ , is misspecified, inferences for $\theta_{1,2},\cdots,\theta_{1,N}$ are not impacted by misspecification. In this way, even if the pooling weight was estimated based on the entire vector for $\theta_{1}$ it would be the $\theta_{1,1}$ component that would drive the pooling weight since the posterior means for the cut and full posterior are very similar for the remaining components. We present Monte Carlo estimates (calculated over the 500 replicated datasets) of the corresponding P-risk under squared error loss for the posteriors in Table 1. Across all values of $J$ , the S-SMP has smaller P-risk than the cut posterior, and in many cases the full posterior as well. Further, the cut posterior is more accurate than the full posterior when $J$ is small but the full posterior becomes more accurate as $J$ increases. In this way, the weight in the S-SMP is close to unity with small $J$ , but increases as $J$ increases. However, for $J$ large the cut and full posteriors behave similarly, and the S-SMP maintains most of the weight on the cut posterior.

$J$	$J=5$	$J=10$	$J=20$	$J=50$	$J=100$
$\mathrm{S-SMP}$	10.5365	2.9705	1.1753	0.3723	0.1897
$\mathrm{Full}$	8678.9763	4.6894	1.3124	0.5491	0.2634
$\mathrm{Cut}$	10.5365	3.5806	1.4721	0.5102	0.2581
$\widehat{\omega}_{+}$	0.00	0.18	0.23	0.15	0.15

Table 1: P-risk values under squared error loss, multiplied by 100, for readability. Bold values indicate the lowest value of risk across the methods.

\widehat{\omega}_{+}

is the average posterior weight across the replications. Misspecification decreases as the number of individual observations per-group (J) increases.

5.2 Archaeological example

Our final example, discussed in Styring et al. (2017), Carmona and Nicholls (2020) and Yu et al. (2023), involves data collected to evaluate an “extensification hypothesis” for early Mesopotamian agricultural practices. The hypothesis states that as cities grew, agriculture extended over larger areas with less intensive cultivation, rather than more intensively farming existing areas to meet food demands.

The analysis uses two data sources: an archaeological dataset and a modern experimental dataset. Figure 5, which is similar to Figure 6 from Carmona and Nicholls (2020), shows a graphical representation of the model which comprises two modules. The first, the “HM module”, is a Gaussian linear regression model incorporating random effects. The second, the “PO module”, is a proportional odds model used to impute a missing categorical covariate for the HM module.

In the HM module’s regression, the response is nitrogen level of cereal grains, denoted $Z$ . We follow the notation of Yu et al. (2023) and use subscripts of $A$ and $M$ to denote archaeological and modern values of any variable. So, for example, $Z_{A}$ and $Z_{M}$ are nitrogen levels of cereal grains for archaeological and modern data respectively. For covariates in the HM module we have crop category $C\in\{\text{Wheat},\text{Barley}\}$ , site location $P$ (a categorical variable), site size $S$ , rainfall $R$ and manure level $M\in\{m_{\text{low}},m_{\text{medium}},m_{\text{high}}\}$ . Archaeological data for rainfall and manure level, $R_{A}$ and $M_{A}$ , are missing. The HM module is a linear regression model with fixed effects for rainfall and manure level, a random effect for site location, and error variance based on crop category.

The imputation model in the PO module imputes the missing manuring level covariates for the archaeological data with parameters $\theta_{2}=(\gamma,\alpha,\xi,\sigma_{\xi})^{\top}$ . The prior on $M_{A}$ is a proportional odds model with covariates site size and site location. The parameter $\gamma$ is the site size coefficient; a negative $\gamma$ supports the extensification hypothesis. The parameter $\xi$ is a vector of random effects for five archaeological site locations in the proportional odds model, $\sigma_{\xi}$ is the standard deviation of random effects, and $\alpha$ is a vector of two threshold parameters. Further details on the model and priors are available in Appendix B.1 of Yu et al. (2023). Bayesian modular inference is relevant in this example because the PO module may be poorly specified. Therefore, we can cut feedback so that $M_{A}$ is imputed based solely on the hierarchical model for cereal grain nitrogen levels (HM module in Figure 5), ensuring that imputation of $M_{A}$ and the interpretation of $\gamma$ are unaffected by any misspecification in the PO module.

In Figure 5, the red line indicates a “cut” between the modules. Figure 6 provides a simplified model structure.

Although it looks different from the two-module system in Figure 1, the cut posterior has the same form, allowing cut and semi-modular inference to proceed similarly. Here we consider how semi-modular inference changes according to the choice of the loss function. For a given scalar parameter $\tau$ , we will consider a S-SMP posterior using mixing weight

\widetilde{\omega}_{+}=\min\{1,\widetilde{\omega}(\tau)\},\;\;\;\widetilde{\omega}(\tau)=\frac{\sigma_{\tau,\text{cut}}^{2}-\sigma_{\tau,\mathrm{full}}^{2}}{(\overline{\tau}_{\text{cut}}-\overline{\tau}_{\mathrm{full}})^{2}}\mathbb{I}(\sigma_{\tau,\text{cut}}^{2}-\sigma_{\tau,\mathrm{full}}^{2}>0),

where $\sigma_{\tau,\text{cut}}^{2}$ and $\sigma_{\tau,\mathrm{full}}^{2}$ are the cut and full marginal posterior variances for $\tau$ , and $\overline{\tau}_{\text{cut}}$ and $\overline{\tau}_{\mathrm{full}}$ are the cut and full marginal posterior means for $\tau$ . This is similar to the S-SMP in Section 3.2, but based on the marginal full and cut posteriors for $\tau$ .

Figure 7 shows the cut, full and S-SMP posteriors for $\gamma$ (the parameter of main interest) and the proportional odds regression random effects $\xi_{1},\dots,\xi_{5}$ . When basing the mixing weight on $\gamma$ (top left), we do a full cut, while basing the mixing weights on $\xi_{1},\dots,\xi_{5}$ results in mixing weights varying between $0.2$ and $1$ . In this example, the shrinkage weight can vary a great deal depending on what scalar parameter is being targeted in the loss function, and so the use of an appropriately defined loss function for the application is crucial. The S-SMP for $\gamma$ shows weaker evidence for the extensification hypothesis than the standard posterior, in the sense that the posterior probabililty of $\gamma<0$ is smaller. Details of the MCMC approach for generating samples from the cut posterior, as well as an SMC method to generate samples from the full posterior, are given in Yu et al. (2023).

6 Discussion

Choosing between the cut and full posteriors is difficult when model misspecification is not severe and in such cases semi-modular posteriors (SMPs) proposed in Carmona and Nicholls (2020) are an attractive alternative. While SMPs are motivated by the presence of a bias-variance trade-off between cut and full posterior inferences, this paper is the first to formalize the existence of such a trade-off. Using SMPs based on linear opinion pooling, we devise a novel pooling weight that allows the SMP to leverage this bias-variance trade-off. Our proposed shrinkage SMP is simple to implement and possess useful theoretical guarantees that other SMP approaches do not posses: the posterior risk of our shrinkage SMP is dominated by that of the cut posterior, and, under certain conditions, that of the full posterior. An interesting future direction would be to determine if our theoretical results can be extended to other types of SMPs, such as those of Carmona and Nicholls (2020) and Nicholls et al. (2022).

As suggested by a referee, the notion of asymptotic risk we consider is only one criterion with which to judge the accuracy of posterior inferences, with posterior predictive accuracy and the validity of posterior credible sets being alternative measures. While assessing the accuracy of different methods based on posterior predictive accuracy is empirically feasible, it is not obvious that it is feasible to deduce a ranking across different modular Bayesian methods under our maintained assumptions. Further, the random weighting of the SMP ensures that determining the asymptotic shape of this posterior, and thus the behavior of its credible sets, is not straightforward. We leave these interesting topics for future study.

Acknowledgments

David T. Frazier gratefully acknowledges support by the Australian Research Council through grant DE200101070. David Nott’s research was supported by the Ministry of Education, Singapore, under the Academic Research Fund Tier 2 (MOE-T2EP20123-0009). We thank seminary participants at the Weierstrass Institute for Applied Analysis and Stochastics, and the Computational methods for unifying multiple statistical analyses (Fusion) workshop for helpful comments. In addition, we thank Pierre Jacob for helpful comments on some of the stated results. The authors also thank the associate editor and referees for very helpful comments that significantly improved the paper.

References

Bennett and Wakefield (2001) Bennett, J. and J. Wakefield (2001). Errors-in-variables in joint population pharmacokinetic/pharmacodynamic modeling. Biometrics 57(3), 803–812.
Bissiri et al. (2016) Bissiri, P. G., C. C. Holmes, and S. G. Walker (2016). A general framework for updating belief distributions. Journal of the Royal Statistical Society. Series B, Statistical methodology 78(5), 1103.
Blangiardo et al. (2011) Blangiardo, M., A. Hansell, and S. Richardson (2011). A Bayesian model of time activity data to investigate health effect of air pollution in time series studies. Atmospheric Environment 45(2), 379–386.
Carmona and Nicholls (2020) Carmona, C. and G. Nicholls (2020). Semi-modular inference: enhanced learning in multi-modular models by tempering the influence of components. In International Conference on Artificial Intelligence and Statistics, pp. 4226–4235. PMLR.
Carmona and Nicholls (2022) Carmona, C. and G. Nicholls (2022). Scalable semi-modular inference with variational meta-posteriors. arXiv:2204.00296.
Carpenter et al. (2017) Carpenter, B., A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell (2017, January). Stan: A Probabilistic Programming Language. Journal of Statistical Software 76(1), 1–32. Number: 1.
Castillo (2014) Castillo, I. (2014). On Bayesian supremum norm contraction rates. Annals of Statistics.
Chakraborty et al. (2022) Chakraborty, A., D. J. Nott, C. Drovandi, D. T. Frazier, and S. A. Sisson (2022). Modularized Bayesian analyses and cutting feedback in likelihood-free inference. Statistics and Computing (To appear).
Claeskens and Hjort (2003) Claeskens, G. and N. L. Hjort (2003). The focused information criterion. Journal of the American Statistical Association 98(464), 900–916.
Davidson (1994) Davidson, J. (1994). Stochastic limit theory: An introduction for econometricians. OUP Oxford.
Frazier and Nott (2024) Frazier, D. T. and D. J. Nott (2024). Cutting feedback and modularized analyses in generalized Bayesian inference. Bayesian Analysis 1(1), 1–29.
Gneiting and Raftery (2007) Gneiting, T. and A. E. Raftery (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association 102(477), 359–378.
Green and Strawderman (1991) Green, E. J. and W. E. Strawderman (1991). A James-Stein type estimator for combining unbiased and possibly biased estimators. Journal of the American Statistical Association 86(416), 1001–1006.
Grünwald and Van Ommen (2017) Grünwald, P. and T. Van Ommen (2017). Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Analysis 12(4), 1069–1103.
Hansen (2016) Hansen, B. E. (2016). Efficient shrinkage in parametric models. Journal of Econometrics 190(1), 115–132.
Hjort and Claeskens (2003) Hjort, N. L. and G. Claeskens (2003). Frequentist model average estimators. Journal of the American Statistical Association 98(464), 879–899.
Jacob et al. (2017) Jacob, P. E., L. M. Murray, C. C. Holmes, and C. P. Robert (2017). Better together? Statistical learning in models made of modules. arXiv preprint arXiv:1708.08719.
Jacob et al. (2020) Jacob, P. E., J. O’Leary, and Y. F. Atchadé (2020). Unbiased Markov chain Monte Carlo methods with couplings. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82(3), 543–600.
Judge and Mittelhammer (2004) Judge, G. G. and R. C. Mittelhammer (2004). A semiparametric basis for combining estimation problems under quadratic loss. Journal of the American Statistical Association 99(466), 479–487.
Kim and White (2001) Kim, T.-H. and H. White (2001). James-Stein-type estimators in large samples with application to the least absolute deviations estimator. Journal of the American Statistical Association 96(454), 697–705.
Kleijn and van der Vaart (2012) Kleijn, B. and A. van der Vaart (2012). The Bernstein-von-Mises theorem under misspecification. Electron. J. Statist. 6, 354–381.
Lee and Lee (2018) Lee, K. and J. Lee (2018). Optimal Bayesian minimax rates for unconstrained large covariance matrices. Bayesian Analysis.
Lehmann and Casella (2006) Lehmann, E. L. and G. Casella (2006). Theory of point estimation. Springer Science & Business Media.
Liu et al. (2009) Liu, F., M. Bayarri, and J. Berger (2009). Modularization in Bayesian analysis, with emphasis on analysis of computer models. Bayesian Analysis 4(1), 119–150.
Liu and Goudie (2022a) Liu, Y. and R. J. B. Goudie (2022a). A general framework for cutting feedback within modularized Bayesian inference. arXiv:2211.03274.
Liu and Goudie (2022b) Liu, Y. and R. J. B. Goudie (2022b). Stochastic approximation cut algorithm for inference in modularized Bayesian models. Statistics and Computing 32(7).
Liu and Goudie (2023) Liu, Y. and R. J. B. Goudie (2023). Generalized geographically weighted regression model within a modularized Bayesian framework. Bayesian Analysis (To appear).
Lunn et al. (2009) Lunn, D., N. Best, D. Spiegelhalter, G. Graham, and B. Neuenschwander (2009). Combining MCMC with ‘sequential’ PKPD modelling. Journal of Pharmacokinetics and Pharmacodynamics 36, 19–38.
Maucort-Boulch et al. (2008) Maucort-Boulch, D., S. Franceschi, and M. Plummer (2008). International correlation between human papillomavirus prevalence and cervical cancer incidence. Cancer Epidemiology and Prevention Biomarkers 17(3), 717–720.
Miller (2021) Miller, J. W. (2021). Asymptotic normality, concentration, and coverage of generalized posteriors. Journal of Machine Learning Research 22(168), 1–53.
Newey (1985) Newey, W. K. (1985). Maximum likelihood specification testing and conditional moment tests. Econometrica: Journal of the Econometric Society, 1047–1070.
Nicholls et al. (2022) Nicholls, G. K., J. E. Lee, C.-H. Wu, and C. U. Carmona (2022). Valid belief updates for prequentially additive loss functions arising in semi-modular inference. arXiv preprint arXiv:2201.09706.
Plummer (2015) Plummer, M. (2015). Cuts in Bayesian graphical models. Statistics and Computing 25(1), 37–43.
Pompe and Jacob (2021) Pompe, E. and P. E. Jacob (2021). Asymptotics of cut distributions and robust modular inference using posterior bootstrap. arXiv preprint arXiv:2110.11149.
Rieder (2012) Rieder, H. (2012). Robust Asymptotic Statistics: Volume I. Springer Science & Business Media.
Rousseau (1997) Rousseau, J. (1997). Asymptotic bayes risks for a general class of losses. Statistics & probability letters 35(2), 115–121.
Shen and Wasserman (2001) Shen, X. and L. Wasserman (2001). Rates of convergence of posterior distributions. The Annals of Statistics 29(3), 687–714.
Stone (1961) Stone, M. (1961). The opinion pool. The Annals of Mathematical Statistics, 1339–1342.
Styring et al. (2017) Styring, A., M. Charles, F. Fantone, M. Hald, A. McMahon, R. Meadow, G. Nicholls, A. Patel, M. Pitre, A. Smith, A. Sołtysiak, G. Stein, J. Weber, H. Weiss, and A. Bogaard (2017). Isotope evidence for agricultural extensification reveals how the world’s first cities were fed. Nature Plants 3, 17076.
Van der Vaart (2000) Van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge University Press.
Yu et al. (2023) Yu, X., D. J. Nott, and M. S. Smith (2023). Variational inference for cutting feedback in misspecified models. Statistical Science 38(3), 490–509.

Appendix A Supplementary Material

This supplementary material contains the regularity conditions used to obtain the results in the main text, proofs of all stated results and several lemmas used to prove the main results. The regularity conditions and proofs are broken up into two sections that depend on whether the analysis is conducted under gross model misspecification (Assumption 1), or local model misspecification (Assumption 2). In addition, this material contains further details of the HPV and cervical cancer incidence example introduced in Section 2.1 of the main text, and additional experiments for the biased means example in Section 3.2.

Appendix B Gross Misspecification: Assumption 1

B.1 Regularity Conditions

The regularity conditions used to prove Lemma 1 are similar to those used to deduce posterior concentration rates in generalized Bayesian methods; see, e.g., Shen and Wasserman (2001), as well as Miller (2021). We state the assumptions separately for the cut posterior and the full posterior. Recall, $\ell_{p}(\theta_{1})=\log f_{1}(z_{1:n}\mid\theta_{1})$ , and rewrite the cut posterior for $\theta_{1}$ as

\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})=\frac{\exp\{\ell_{p}(\theta_{1})\}\pi(\theta_{1})}{\int_{\Theta_{1}}\exp\{\ell_{p}(\theta_{1})\}\pi(\theta_{1})\mathrm{d}\theta_{1}}=\frac{\exp\{\ell_{p}(\theta_{1})-\ell_{p}(\theta_{1,0})\}\pi(\theta_{1})}{\int_{\Theta_{1}}\exp\{\ell_{p}(\theta_{1})-\ell_{p}(\theta_{1,0})\}\pi(\theta_{1})\mathrm{d}\theta_{1}}.

For $A\subseteq\Theta_{1}$ , write $M_{1,n}(A)=\int_{A}\exp\{\ell_{p}(\theta_{1})-\ell_{p}(\theta_{1,0})\}\pi(\theta_{1})\mathrm{d}\theta_{1}$ so that $\Pi_{\mathrm{cut}}(\theta_{1}\in A\mid z_{1:n})=\frac{M_{1,n}(A)}{M_{1,n}(\Theta)}$ . For $\varepsilon>0$ , define $\Theta_{1}(\sqrt{n\varepsilon^{2}}):=\{\theta_{1}\in\Theta_{1}:\|\theta_{1}-\theta_{1,0}\|\leq\sqrt{n\varepsilon^{2}}\}$ .

Assumption B1.

The following are satisfied.

(i)

For any $\delta>0$ there exists $\varepsilon>0$ and a sufficiently large $K>0$ such that

P^{(n)}_{0}\left[\sup_{\|\theta_{1}-\theta_{1,0}\|\geq\delta}\frac{1}{n}\left\{\ell_{p}(\theta_{1})-\ell_{p}(\theta_{1,0})\right\}\geq-K\varepsilon\right]=o(1).

(ii)

For all $n\geq 1$ , $\varepsilon>0$ , and $b_{n}=(1/2)\Pi\{\Theta_{1}(\sqrt{n\varepsilon^{2}})\}e^{-2n\epsilon^{2}}$ ,

$P^{(n)}_{0}\left\{M_{1,n}(\Theta_{1})\leq b_{n}\right\}\leq 2/n\epsilon^{2}.$
(iii)

For $\varepsilon>0$ , $c>0$ , $\Pi\{\Theta_{1}(\sqrt{n\varepsilon^{2}})\}\gtrsim\exp\{-cn\varepsilon^{2}\}$ .
(iv)

For any $\theta_{1},\theta_{1}^{\prime}\in\Theta_{1}$ , if $\theta_{1}\neq\theta_{1}^{\prime}$ , then $f_{1}(z\mid\theta_{1})\neq f_{1}(z\mid\theta_{1}^{\prime})$ with positive probability.

Remark 9.

Assumptions B1 parts (i), (iii) are identical to those maintained in Shen and Wasserman (2001) but for the partial log-likelihood $\ell_{p}(\theta_{1})$ , while the bound on the posterior denominator in Assumption B1(ii) is maintained to simply the proofs and can be removed at the cost of additional technicalities; for instance, using arguments similar to those of Lemma 1 in Shen and Wasserman (2001).

From the definition of $\mathrm{KL}\{h(z\mid\theta_{1,0},\delta_{0})\|f(z\mid\theta)\}$ ,

	$\displaystyle\mathrm{KL}\{h(z\mid\theta_{1,0},\delta_{0})\\|f(z\mid\theta)\}=$	$\displaystyle\int_{\mathcal{Z}}\log\frac{f_{1}(z\mid\theta_{1,0})\delta_{0}(z)}{f_{1}(z\mid\theta_{1})f_{2}(z\mid\theta)}f_{1}(z\mid\theta_{1,0})\delta_{0}(z)\mathrm{d}z$
	$\displaystyle=$	$\displaystyle\int_{\mathcal{Z}}\log\frac{f_{1}(z\mid\theta_{1,0})}{f_{1}(z\mid\theta_{1})}f_{1}(z\mid\theta_{1,0})\delta_{0}(z)\mathrm{d}z$
	$\displaystyle+$	$\displaystyle\int_{\mathcal{Z}}\log\frac{\delta_{0}(z)}{f_{2}(z\mid\theta)}f_{1}(z\mid\theta_{1,0})\delta_{0}(z)\mathrm{d}z.$

Setting $\theta_{1}=\theta_{1,0}$ minimizes the first part of $\mathrm{KL}\{h(z\mid\theta_{1,0},\delta_{0})\|f(z\mid\theta)\}$ , but does not minimize both components. Hence, under Assumption 1, $\theta_{\star}:=\operatornamewithlimits{argmin\,}_{\theta\in\Theta}\mathrm{KL}\{h(z\mid\theta_{1,0},\delta_{0})\|f(z\mid\theta)\}$ is the value we would expect the full posterior to concentrate onto as $n\rightarrow\infty$ . Thus, conducting joint Bayesian inference on $\theta$ under Assumption 1 results in posteriors for which $\theta_{1}$ will not concentrate onto $\theta_{1,0}$ . To formally prove this result, recall the definition $\ell(\theta)=\log f(z_{1:n}\mid\theta)$ , and consider the following regularity conditions, which are equivalent version of Assumption B1 but for $\ell(\theta)$ , and $\pi(\theta)$ .

Assumption B2.

The following are satisfied.

(i)

For any $\delta>0$ there exists $\varepsilon>0$ and a sufficiently large $K>0$ such that, for some $\theta_{\star}\in\Theta$ ,

P^{(n)}_{0}\left[\sup_{\|\theta-\theta_{\star}\|\geq\delta}\frac{1}{n}\left\{\ell(\theta)-\ell(\theta_{\star})\right\}\geq-K\varepsilon\right]=o(1).

(ii)

For all $n\geq 1$ , $\varepsilon>0$ , $\Theta({\sqrt{n\varepsilon^{2}}}):=\{\theta\in\Theta:\|\theta-\theta_{\star}\|\leq\sqrt{n\varepsilon^{2}}\}$ , and $b_{n}=(1/2)\Pi\{\Theta(\sqrt{n\varepsilon^{2}})\}e^{-2n\varepsilon^{2}}$ ,

$P^{(n)}_{0}\left\{M_{n}(\Theta)\leq b_{n}\right\}\leq 2/n\varepsilon^{2}.$
(iii)

For $\varepsilon>0$ , $c>0$ , and $\Pi\{\Theta({\sqrt{n\varepsilon^{2}}})\}\gtrsim\exp\{-2cn\varepsilon^{2}\}$ .
(iv)

For any $\theta,\theta^{\prime}\in\Theta$ , if $\theta\neq\theta^{\prime}$ , then $f(z\mid\theta)\neq f(z\mid\theta^{\prime})$ with positive probability.

B.2 Proofs of Main Results: Gross Misspecification

Proof of Lemma 1.

We prove the two cases separately, starting with the cut posterior.

Part 1: Cut posterior.

For $\varepsilon>0$ , recall $\Theta_{1}(\varepsilon):=\{\theta\in\Theta:\|\theta_{1}-\theta_{1,0}\|\leq\varepsilon\}$ , and consider

\displaystyle\Pi_{\mathrm{cut}}\{\Theta_{1}(\varepsilon)^{c}\mid z_{1:n}\}

\displaystyle=\int_{\Theta_{1}(\varepsilon)^{c}}\frac{\exp\{\ell_{p}(\theta_{1})-\ell_{p}(\theta_{1,0})\}\pi(\theta_{1})}{\int_{\Theta_{1}}\exp\{\ell_{p}(\theta_{1})-\ell_{p}(\theta_{1,0})\}\pi(\theta_{1})\mathrm{d}\theta_{1}}=\frac{M_{1,n}\{\Theta_{1}(\varepsilon)^{c}\}}{M_{1,n}\{\Theta_{1}\}}.

Apply Assumption B1(ii) to see that, with probability at least $2/(\sqrt{n\varepsilon^{2}})^{2}$ ,

$\displaystyle\Pi_{\mathrm{cut}}\{\Theta_{1}(\varepsilon)^{c}\mid z_{1:n}\}$	$\displaystyle\leq\frac{M_{1,n}\{\Theta_{1}(\varepsilon)^{c}\}}{b_{n}}$
	$\displaystyle\leq 2\frac{e^{2n\epsilon^{2}}}{\Pi\{\Theta_{1}(\sqrt{n\varepsilon^{2}})\}}M_{1,n}\{\Theta_{1}({\varepsilon})^{c}\}$
	$\displaystyle\lesssim{e^{4n\epsilon^{2}}}M_{1,n}\{\Theta_{1}({\varepsilon})^{c}\},$	(9)

where the last term follows by Assumption B1(iii).

Focus on the term in brackets in (9). Since $\Theta_{1}(\varepsilon)$ is bounded for any finite $n$ , the log-likelihood ratio $\ell_{p}(\theta_{1})-\ell_{p}(\theta_{1,0})$ is also bounded for any $n$ , and so by Assumption B1(i),

	$\displaystyle M_{1,n}\{\Theta_{1}(\varepsilon)^{c}\}$	$\displaystyle=\int\mathrm{1}\left\{\Theta_{1}(\varepsilon)^{c}\right\}\exp\{\ell_{p}(\theta_{1})-\ell_{p}(\theta_{1,0})\}\pi(\theta_{1})\mathrm{d}\theta_{1}$
		$\displaystyle\leq\exp\{-nK\varepsilon^{2}\}\Pi\{\Theta_{1}(\varepsilon)^{c}\}$
		$\displaystyle\leq\exp\{-nK\varepsilon^{2}\},$

with probability converging to one (since $\Pi(A)\leq 1$ for all $A\subseteq\Theta$ ).

Placing this bound into equation (9), and taking $K=8c$ , we obtain

	$\displaystyle\Pi_{\mathrm{cut}}\{\theta\in\Theta_{1}(\varepsilon)^{c}\mid z_{1:n}\}$	$\displaystyle\lesssim{e^{4n\epsilon^{2}}}\left[M_{1,n}\{\Theta_{1}({\varepsilon})^{c}\}\right]$
		$\displaystyle\lesssim\exp\{c4n\epsilon^{2}-cKn\varepsilon^{2}\}$
		$\displaystyle=\exp\{-4cn\epsilon^{2}\}.$

For any $\varepsilon\leq\log(n)/\sqrt{n}$ , the stated result follows.

Part 2: Exact posterior. Repeating similar arguments to those above, but for the set $\Theta(\varepsilon):=\{\theta\in\Theta:\|\theta-\theta_{\star}\|\leq\varepsilon\}$ , proves that, with probability converging to one

\Pi_{\mathrm{full}}\{\theta\in\Theta_{\star}(\varepsilon)\mid z_{1:n}\}\gtrsim 1-\exp\{-c4n\varepsilon^{2}\}.

However, defining $\Theta_{1,\star}(\varepsilon):=\{\theta_{1}\in\Theta:\|\theta_{1}-\theta_{1,\star}\|\leq\varepsilon\}$ , since $\Theta(\varepsilon)\subset\Theta_{1,\star}(\varepsilon)\times\Theta_{2}$ , it follows that

	$\displaystyle 1-C\exp\{-c4n\varepsilon^{2}\}\leq\Pi_{\mathrm{full}}\{\theta\in\Theta(\varepsilon)\mid z_{1:n}\}$	$\displaystyle=\int_{\Theta(\varepsilon)}\pi_{\mathrm{full}}(\theta_{1},\theta_{2}\mid z_{1:n})\mathrm{d}\theta_{2}\mathrm{d}\theta_{1}$
		$\displaystyle\leq\int_{\Theta_{1,\star}(\varepsilon)}\int_{\Theta_{2}}\pi_{\mathrm{full}}(\theta_{1},\theta_{2}\mid z_{1:n})\mathrm{d}\theta_{2}\mathrm{d}\theta_{1}$
		$\displaystyle\leq\int_{\Theta_{1,\varepsilon}^{\star}}\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})\mathrm{d}\theta_{1}.$

Hence, with probability converging to 1, $\int_{\Theta_{1,\star}(\varepsilon)}\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})\mathrm{d}\theta_{1}\rightarrow 1$ . Since $\theta_{1,0}\neq\theta_{1,\star}$ under Assumption 1, there exists some $\varepsilon>0$ , such that $\theta_{1,0}\not\in\Theta_{1,\star}(\varepsilon)$ , so that for any $\widetilde{\varepsilon}\leq\varepsilon$ , $\int_{\Theta_{1}(\widetilde{\varepsilon})}\pi(\theta_{1}\mid z_{1:n})\mathrm{d}\theta_{1}\rightarrow 0$ in probability. ∎

Proof of Corollary 1.

Write the SMP as

	$\displaystyle\pi_{\omega}(\theta\mid z_{1:n})$	$\displaystyle=\pi_{\mathrm{cut}}(\theta\mid z_{1:n})+\omega\{\pi_{\mathrm{full}}(\theta\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta\mid z_{1:n})\}$
		$\displaystyle=\pi_{\mathrm{cut}}(\theta\mid z_{1:n})+\omega\{\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\}\pi(\theta_{2}\mid\theta_{1},z_{1:n})$		(10)

where $\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})=\int_{\Theta_{2}}\pi_{\mathrm{full}}(\theta\mid z_{1:n})\mathrm{d}\theta_{2}$ . Apply (10), and Fubini, to obtain

	$\displaystyle\int_{\Theta}\|\pi_{\widehat{\omega}_{+}}(\theta\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta\mid z_{1:n})\|\mathrm{d}\theta$	$\displaystyle=\int_{\Theta}\|\widehat{\omega}_{+}\{\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\}\pi(\theta_{2}\mid\theta_{1},z_{1:n})\|\mathrm{d}\theta$
		$\displaystyle=\int_{\Theta_{1}}\int_{\Theta_{2}}\|\widehat{\omega}_{+}\{\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\}\|\pi(\theta_{2}\mid\theta_{1},z_{1:n})\mathrm{d}\theta$
		$\displaystyle=\widehat{\omega}_{+}\int_{\Theta_{1}}\|\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\|\mathrm{d}\theta_{1}.$

We can then write

\displaystyle\int_{\Theta}|\pi_{\widehat{\omega}_{+}}(\theta\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta\mid z_{1:n})|\mathrm{d}\theta\leq\widehat{\omega}_{+}\left[\int_{\Theta_{1}}\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})\mathrm{d}\theta_{1}+\int_{\Theta_{1}}\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\mathrm{d}\theta_{1}\right]=2\widehat{\omega}_{+}.

The stated result now follows if $\widehat{\omega}_{+}=o_{p}(1)$ as $n\rightarrow+\infty$ .

To show this, let $X_{n,\mathrm{cut}}:=\sqrt{n}(\overline{\theta}_{\mathrm{cut}}-\theta_{0})$ , $X_{n,\mathrm{full}}:=\sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{\star})$ , $X_{n}:=X_{\mathrm{full},n}-X_{\mathrm{cut},n}$ and $Y_{n}=\sqrt{n}(\theta_{0}-\theta_{\star})$ . Then, for ${\Upsilon_{n}}=\Upsilon(\overline{\theta}_{\mathrm{cut}})$ ,

\widehat{\omega}=\frac{\gamma_{n}}{n\|\overline{\theta}_{\mathrm{cut}}-\overline{\theta}_{\mathrm{full}}\|^{2}_{\Upsilon_{n}}}=\frac{\gamma_{n}}{\|X_{n.\mathrm{cut}}-X_{n,\mathrm{full}}+Y_{n}\|^{2}_{\Upsilon_{n}}}=\frac{\gamma_{n}}{\|Y_{n}-X_{n}\|^{2}_{\Upsilon_{n}}}.

By the reverse triangle inequality

\displaystyle\widehat{\omega}\leq\frac{\gamma_{n}}{|\|X_{n}\|_{\Upsilon_{n}}^{2}-\|Y_{n}\|_{\Upsilon_{n}}^{2}|}.

By the hypothesis of the result, $\|X_{n}\|=O_{p}(1)$ , while under Assumption 1, $\theta_{1,0}\neq\theta_{1,\star}$ , so that $\|Y_{n}\|\rightarrow+\infty$ as $n\rightarrow+\infty$ , and the stated result follows. ∎

Appendix C Local Misspecification: Assumption 2

C.1 Regularity Conditions: Local Misspecification

Before stating the regularity conditions we maintain in this section, we recall several notations previously defined in the main text. Let $\ell(\theta)=\log f(z_{1:n}\mid\theta)$ , and denote the joint log-likelihood for the $i$ -th observation as $\ell(z_{i}\mid\theta)=\log f(z_{i}\mid\theta)$ . Denote the full derivative of the log-likelihood as $\dot{\ell}(\theta):=\partial\ell(\theta)/\partial\theta$ , and denote the second derivative as $\ddot{\ell}(\theta):=\partial^{2}\ell(\theta)/\partial\theta\partial\theta^{\top}$ . For $j,k\in\{1,2\}$ , define the partial derivatives $\dot{\ell}_{(j)}(\theta)=\partial\ell(\theta)/\partial\theta_{j}$ , and the second partial derivatives $\ddot{\ell}_{(jk)}(\theta)=\partial^{2}\ell(\theta)/\partial\theta_{j}\partial\theta_{k}^{\top}$ . For a function $g:\mathcal{Z}\rightarrow\mathbb{R}^{d}$ , let $\mathbb{E}_{n}[g(z)]$ denote the expectation of $g(z)$ under $h(z\mid\theta_{0},\delta_{n})$ in Assumption 2; i.e., $\mathbb{E}_{n}[g(z)]=\int_{\mathcal{Z}}g(z)h(z\mid\theta_{0},\delta_{n})\mathrm{d}\mu(z)$ . Define the matrices $\mathcal{I}:=-\lim_{n}n^{-1}\mathbb{E}_{n}[\ddot{\ell}(\theta_{0})]$ , and $\mathcal{I}_{jk}:=-\lim_{n}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{(jk)}(\theta_{0})]$ . Recall that $\eta=(\eta_{1}^{\top},\eta_{2}^{\top})^{\top}$ in Assumption 2 is partitioned conformably with $\theta=(\theta_{1}^{\top},\theta_{2}^{\top})^{\top}$ .

In addition, note that

\ell(\theta)=\log f_{1}(z_{1:n}\mid\theta_{1})+\log f_{2}(z_{1:n}\mid\theta)=\ell_{p}(\theta_{1})+\ell_{c}(\theta),

where $\ell_{p}(\theta_{1}):=\log f_{1}(z_{1:n}\mid\theta_{1})$ signifies the ‘partial log-likelihood’ term, and $\ell_{c}(\theta):=\log f_{2}(z_{1:n}\mid\theta_{1},\theta_{2})$ signifies the log-likelihood term that is used in cut inference to construct the conditional posterior for $\theta_{2}$ given $\theta_{1}$ . Define the partial derivatives of $\ell_{p}(\theta)$ as $\dot{\ell}_{p(1)}(\theta_{1})=\partial\ell_{p}(\theta_{1})/\partial\theta_{1}$ , $\ddot{\ell}_{p(11)}(\theta_{1})=\partial^{2}\ell_{p(11)}(\theta_{1})/\partial\theta_{j}\partial\theta_{k}^{\top}$ , and recall $\mathcal{I}_{p(11)}:=-\lim_{n}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{p(11)}(\theta_{1,0})]$ . For $\ell_{c}(\theta)$ and $j,k\in\{1,2\}$ , define $\dot{\ell}_{c(jk)}(\theta):=\partial\ell_{c(j)}(\theta)/\partial\theta_{j}$ and $\ddot{\ell}_{c(jk)}(\theta):=\partial^{2}\ell_{c(jk)}(\theta)/\partial\theta_{j}\partial\theta_{k}^{\top}$ . From the structure of the log-likelihood $\ell(\theta)$ , note that

	$\displaystyle\mathcal{I}_{12}$	$\displaystyle:=-\lim_{n\rightarrow\infty}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{(12)}(\theta_{0})]=-\lim_{n\rightarrow\infty}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{c(12)}(\theta_{0})],$
	$\displaystyle\mathcal{I}_{21}$	$\displaystyle:=-\lim_{n\rightarrow\infty}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{(21)}(\theta_{0})]=-\lim_{n\rightarrow\infty}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{c(21)}(\theta_{0})],$
	$\displaystyle\mathcal{I}_{22}$	$\displaystyle:=-\lim_{n\rightarrow\infty}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{(22)}(\theta_{0})]=-\lim_{n\rightarrow\infty}n^{-1}\mathbb{E}_{n}[\ddot{\ell}_{c(22)}(\theta_{0})].$

To formalize the impact of the misspecification in Assumption 2, we impose the following regularity conditions on the density $h(z\mid\theta,\delta_{n})$ .⁸⁸8We eschew measurability conditions and assume that all objects written are measurable.

Assumption C1.

For $\upsilon:=(\theta^{\top},\psi^{\top})^{\top}$ , let $\upsilon_{0}:=(\theta_{0}^{\top},\psi_{0}^{\top})^{\top}$ , be elements of the interior of $\Theta\times\Delta$ , where $\Delta\subset\mathbb{R}^{d}$ and compact. The function $h_{n}(z\mid\upsilon)=f_{1}(z\mid\theta)f_{2}(z\mid\theta)\{1+\psi^{\top}\zeta(z)/\sqrt{n}\}$ is twice continuously differentiable in $v$ for almost all $z\in\mathcal{Z}$ . There exist positive functions $a(z)$ , $b(z)$ such that for all $z\in\mathcal{Z}$ , except on sets of measure zero, $\ell_{n}(z\mid\upsilon)=\log h_{n}(z\mid\upsilon)$ , satisfies the following: $\exp\ell_{n}(z\mid\upsilon)\leq a(z)$ , and for all $\|\upsilon-\upsilon_{0}\|\leq\nu_{0}/\sqrt{n}$ , and some $\nu_{0}>0$ , each of the following $|\ell_{n}(z\mid\upsilon)|$ , $\|\dot{\ell}_{n}(z\mid\upsilon)\|^{2}$ , $\|\ddot{\ell}_{n}(z\mid\upsilon)\|^{2}$ are less than $b(z)$ . Further, $\mathbb{E}_{n}[a(z)],\mathbb{E}_{n}[b(z)],\mathbb{E}_{n}[a(z)b(z)]<+\infty$ , and the set $\{z\in\mathcal{Z}:h_{n}(z\mid\upsilon)>0\}$ does not depend on $\upsilon$ .

In addition, we maintain the following assumptions about the assumed model $f(z\mid\theta)$ .

Assumption C2.

For any $\theta,\theta^{\prime}\in\Theta$ , if $\theta\neq\theta^{\prime}$ , then $\ell(\theta)\neq\ell(\theta^{\prime})$ with positive probability.

Remark 10.

Assumption C1 is similar to the regularity conditions employed by Claeskens and Hjort (2003) to deduce large sample theory for frequentist model averaging estimators. Assumption C2 is a classical identification condition.

C.2 Preliminary Results

The following intermediate results are used to state and prove our main results.

Lemma C.1.

If Assumptions 2, C1 and C2 are satisfied, then the following results are satisfied.

1.

$\lim_{n}n^{-1}\mathbb{E}_{n}[-\ddot{\ell}_{p(11)}(\theta_{1,0})]=\lim_{n}n^{-1}\mathbb{E}_{n}[\dot{\ell}_{p(1)}(\theta_{1,0})\dot{\ell}_{p(1)}(\theta_{1,0})^{\top}]$ .
2.

$\lim_{n}n^{-1}\mathbb{E}_{n}[-\ddot{\ell}_{c(11)}(\theta_{0})]=\lim_{n}n^{-1}\mathbb{E}_{n}[\dot{\ell}_{c(1)}(\theta_{0})\dot{\ell}_{c(1)}(\theta_{0})^{\top}]$ .
3.

$\lim_{n}n^{-1}\mathbb{E}_{n}[\dot{\ell}_{p(1)}(\theta_{1,0})\dot{\ell}_{c(1)}(\theta_{0})^{\top}]=0$ .
4.

$\lim_{n}n^{-1}\mathbb{E}_{n}[\dot{\ell}_{p(1)}(\theta_{1,0})\dot{\ell}_{c(2)}(\theta_{0})^{\top}]=0$ .

The following Lemma is used to prove the results under the drifting sequences of DGPs constructed in Assumption 2.

Lemma C.2.

Assumptions 2, C1, and C2 are satisfied. Then $\phi(\theta,\delta)=\int_{\mathcal{Z}}\dot{\ell}(z\mid\theta)h(z\mid\theta,\delta)\mathrm{d}\mu(z)$ exists and is continuous on $\Theta\times\Delta$ , and for all $\varepsilon>0$ , and any compact $\tilde{\Theta}\subset\Theta$ ,

\quad\lim_{n\rightarrow+\infty}\operatorname{Pr}\left(\sup_{\theta\in\tilde{\Theta}}\left|n^{-1}\dot{\ell}(\theta)-\phi\left(\theta,\delta_{n}\right)\right|\geqslant\varepsilon\right)=0.

The following result is a consequence of Lemmas C.1 and C.2.

Lemma C.3.

If Assumptions 2, C1 and C2 are satisfied, then the following results are satisfied.

1.

$\lim_{n}\mathbb{E}_{n}\left[\dot{\ell}(\theta_{0})/\sqrt{n}\right]=\eta$ .
2.

$\dot{\ell}(\theta_{0})/\sqrt{n}\Rightarrow N(\eta,\mathcal{I})$ .
3.

$\dot{\ell}_{p(1)}(\theta_{0})/\sqrt{n}\Rightarrow N(0,\mathcal{I}_{p(11)})$ .
4.

$\dot{\ell}_{c(2)}(\theta_{0})/\sqrt{n}\Rightarrow N(\eta_{2},\mathcal{I}_{(22)}).$

The following is a useful extension of Stein’s Lemma.

Lemma C.4 (Lemma 2 of Hansen, 2016).

If $\xi\sim N(0,V)$ is an $m\times 1$ vector, and $\Psi$ is $m\times m$ matrix, for $\varphi(X):\mathbb{R}^{m}\rightarrow\mathbb{R}^{m}$ continuously differentiable, then for $h\in\mathbb{R}^{m}$ ,

\mathbb{E}\left\{\varphi(\xi+{h})^{\top}\Psi\xi\right\}=\mathbb{E}\text{\rm tr}\left\{\frac{\partial}{\partial{x}}\varphi(\xi+{h})^{\top}\Psi V\right\}.

(11)

To simplify proofs of our results, we use the following block-matrix notations: for $j,k\in\{1,2\}$ , let $\mathbf{0}_{d_{j}\times d_{k}}$ denote a $d_{j}\times d_{k}$ matrix of zeros, and define

	$\displaystyle\Gamma_{1,\mathrm{cut}}$	$\displaystyle:=(\mathcal{I}_{p(11)}^{-1}:\mathbf{0}_{d_{1}\times d_{1}}:\mathbf{0}_{d_{1}\times d_{2}}),\quad$	$\displaystyle\Gamma_{1,\mathrm{full}}:=(\mathbf{0}_{d_{1}\times d_{1}}:\mathcal{I}_{11.2}^{-1}:-\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}),$
	$\displaystyle\Gamma_{2,\mathrm{cut}}$	$\displaystyle:=(-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}:\mathbf{O}_{d_{2}\times d_{1}}:\mathcal{I}_{22}^{-1}),\quad$	$\displaystyle\Gamma_{2,\mathrm{full}}:=(\mathbf{O}_{d_{2}\times d_{1}}:-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}:\mathcal{I}_{22}^{-1}).$

In addition, define

\displaystyle Z_{n}:=\frac{1}{\sqrt{n}}\begin{pmatrix}\dot{\ell}_{p(1)}(\theta_{1,0})\\ \dot{\ell}_{p(1)}(\theta_{1,0})+\dot{\ell}_{c(1)}(\theta_{0})\\ \dot{\ell}_{c(2)}(\theta_{1,0})\end{pmatrix},\quad\Gamma_{\mathrm{cut}}:=\begin{pmatrix}\Gamma_{1,\mathrm{cut}}\\ \Gamma_{2,\mathrm{cut}}\end{pmatrix},\quad\Gamma_{\mathrm{full}}:=\begin{pmatrix}\Gamma_{1,\mathrm{full}}\\ \Gamma_{2,\mathrm{full}}\end{pmatrix},

where we note that $\Gamma_{\mathrm{cut}}$ and $\Gamma_{\mathrm{full}}$ have dimension $(d_{1}+d_{2})\times(2d_{1}+d_{2})$ . Recall

W=(\mathcal{I}_{p(11)}^{-1}-\mathcal{I}_{11.2}^{-1}),\quad\mathcal{M}=\begin{pmatrix}W&-W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W&\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\end{pmatrix}.

When no confusion is likely to result, when terms are evaluated at $\theta_{0}$ , we drop their dependence on this value; e.g., for $j,k\in\{1,2\}$ we will often write $\ell_{(jk)}=\ell_{(jk)}(\theta_{0})$ .

Lemma C.5.

Under Assumptions 2, C1 and C2, the following are satisfied.

$Z_{n}\Rightarrow\xi+\tau$ , where $\tau=(\mathbf{0}_{d_{1}\times 1}^{\top},\eta^{\top})^{\top}$ , and where $\xi$ is a $2d_{1}+d_{2}$ dimensional normal random variable with mean zero and variance

	$\displaystyle\Omega$	$\displaystyle:=\lim_{n\rightarrow+\infty}n^{-1}\mathbb{E}_{n}[Z_{n}Z_{n}^{\top}]$
		$\displaystyle=\lim_{n\rightarrow+\infty}n^{-1}\begin{pmatrix}\mathbb{E}_{n}[\dot{\ell}_{p(1)}\dot{\ell}_{p(1)}^{\top}]&\mathbb{E}_{n}[\dot{\ell}_{p(1)}\dot{\ell}_{p(1)}^{\top}]&\mathbf{0}_{d_{1}\times d_{2}}\\ \mathbb{E}_{n}[\dot{\ell}_{p(1)}\dot{\ell}_{p(1)}^{\top}]&\mathbb{E}_{n}[\{\dot{\ell}_{p(1)}+\dot{\ell}_{c(1)}\}\{\dot{\ell}_{p(1)}+\dot{\ell}_{c(1)}\}^{\top}]&\mathbb{E}_{n}[\dot{\ell}_{c(1)}\dot{\ell}_{c(2)}^{\top}]\\ \mathbf{0}_{d_{2}\times d_{1}}&\mathbb{E}_{n}[\dot{\ell}_{c(2)}\dot{\ell}_{c(1)}^{\top}]&\mathbb{E}_{n}[\dot{\ell}_{c(2)}\dot{\ell}_{c(2)}^{\top}]\end{pmatrix}$
		$\displaystyle=\begin{pmatrix}\mathcal{I}_{p(11)}&\mathcal{I}_{p(11)}&\mathbf{0}_{d_{1}\times d_{2}}\\ \mathcal{I}_{p(11)}&\mathcal{I}_{11}&\mathcal{I}_{12}\\ \mathbf{0}_{d_{2}\times d_{1}}&\mathcal{I}_{21}&\mathcal{I}_{22}\end{pmatrix}.$

2.

$(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}=\mathcal{M}$ .
3.

$\Gamma_{\mathrm{cut}}\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}=\mathcal{M}$ .

To analyze the behavior of the S-SMP, we must first deduce the behavior of $\overline{\theta}_{\mathrm{cut}}$ and $\overline{\theta}_{\mathrm{full}}$ . To this end, define the statistics

T_{n,\mathrm{full}}=\Gamma_{\mathrm{full}}Z_{n}/\sqrt{n}=\begin{pmatrix}T_{1,\mathrm{full},n}\\ T_{2,\mathrm{full},n}\end{pmatrix},\quad T_{n,\mathrm{cut}}=\Gamma_{\mathrm{cut}}Z_{n}/\sqrt{n}=\begin{pmatrix}T_{1,\mathrm{cut},n}\\ T_{2,\mathrm{cut},n}\end{pmatrix},

which are partitioned conformably with $\theta=(\theta_{1}^{\top},\theta_{2}^{\top})^{\top}$ . Define

	$\displaystyle t:=\sqrt{n}(\theta-\theta_{0}-T_{n,\mathrm{full}})=(t_{1}^{\top},t_{2}^{\top})^{\top},\;\;\mathcal{T}:=\{t=\sqrt{n}(\theta-\theta_{0}-T_{n,\mathrm{full}}):\theta\in\Theta\},$
	$\displaystyle\vartheta:=\sqrt{n}(\theta-\theta_{0}-T_{n,\mathrm{cut}})=(\vartheta_{1}^{\top},\vartheta_{2}^{\top})^{\top},\;\;\mathcal{V}:=\{\vartheta=\sqrt{n}(\theta-\theta_{0}-T_{n,\mathrm{cut}}):\theta\in\Theta\}.$

The cut and full posteriors for $\vartheta$ and $t$ are given by

	$\displaystyle\pi_{\mathrm{cut}}(\vartheta\mid z_{1:n})=\frac{1}{\sqrt{n}^{d_{\theta}}}\pi_{\mathrm{cut}}\left(\theta_{0}+\frac{\vartheta}{\sqrt{n}}+{T_{n,\mathrm{cut}}}{}\mid z_{1:n}\right),$
	$\displaystyle\pi_{\mathrm{full}}(t\mid z_{1:n})=\frac{1}{\sqrt{n}^{d_{\theta}}}\pi_{\mathrm{full}}\left(\theta_{0}+\frac{t}{\sqrt{n}}+T_{n,\mathrm{full}}\mid z_{1:n}\right).$

Theorem C.1.

If Assumption 2 in the main text, and Assumptions C1 and C2 are satisfied, then $\int_{\mathcal{T}}|\pi_{\mathrm{full}}(t\mid z_{1:n})-N(t;0,\mathcal{I}^{-1})|\mathrm{d}t=o_{p}(1)$ and $\sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{0})\Rightarrow N(\mathcal{I}^{-1}\eta,\mathcal{I}^{-1})$ .

Theorem C.2.

If Assumption 2 in the main text, and Assumptions C1 and C2 are satisfied, then

	$\displaystyle\int_{\mathcal{V}_{1}}\|\pi_{\mathrm{cut}}(\vartheta_{1}\mid z_{1:n})-N(\vartheta_{1};0,\mathcal{I}_{p(11)}^{-1})\|\mathrm{d}\vartheta_{1}=o_{p}(1),$
	$\displaystyle\int_{\mathcal{V}_{2}}\|\pi_{\mathrm{cut}}(\vartheta_{2}\mid z_{1:n})-N(\vartheta_{2};0,\mathcal{I}^{-1}_{22}+\mathcal{I}^{-1}_{22}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1})\|\mathrm{d}\vartheta_{2}=o_{p}(1).$

In addition,

	$\displaystyle\sqrt{n}(\overline{\theta}_{1,\mathrm{cut}}-\theta_{1,0})\Rightarrow N(0,\mathcal{I}_{p(11)}^{-1}),$
	$\displaystyle\sqrt{n}(\overline{\theta}_{2,\mathrm{cut}}-\theta_{2,0})\Rightarrow N(\mathcal{I}_{(22)}^{-1}\eta_{2},\mathcal{I}^{-1}_{22}+\mathcal{I}^{-1}_{22}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}).$

Corollary C.1.

If Assumption 2 in the main text, and Assumptions C1 and C2 are satisfied, then, for $j\in\{1,2\}$ ,

\lim_{n\rightarrow+\infty}\left([\mathbb{E}_{n}\left\{\sqrt{n}(\overline{\theta}_{j,\mathrm{full}}-\theta_{j,0})\right\}]^{2}-[\mathbb{E}_{n}\left\{\sqrt{n}(\overline{\theta}_{j,\mathrm{cut}}-\theta_{j,0})\right\}]^{2}\right)\geq 0.

Lemma C.6.

If Assumptions 2, and C1-C2 are satisfied, then

1.

$\sqrt{n}\begin{pmatrix}\overline{\theta}_{\mathrm{cut}}-\theta_{0}\\ \overline{\theta}_{\mathrm{full}}-\theta_{0}\end{pmatrix}=\begin{pmatrix}\Gamma_{\mathrm{cut}}\\ \Gamma_{\mathrm{full}}\end{pmatrix}Z_{n}+o_{p}(1)\Rightarrow\begin{pmatrix}\Gamma_{\mathrm{cut}}\\ \Gamma_{\mathrm{full}}\end{pmatrix}(\xi+\tau)$ .
2.

$\widehat{\omega}_{+}\Rightarrow\overline{\omega}:=\min\left\{1,\frac{\gamma}{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\Upsilon(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)}\right\}$ , where $\Upsilon=\partial^{2}q(\theta,\theta_{0})/\partial\theta\partial\theta^{\top}|_{\theta=\theta_{0}}$ .
3.

$\sqrt{n}\{\overline{\theta}(\widehat{\omega}_{+})-\theta_{0}\}\Rightarrow\Gamma_{\mathrm{cut}}(\xi+\tau)-\overline{\omega}\{(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)\}$ .

Remark 11.

We do not explicitly consider that the cut and full posteriors are computed from samples of different sizes; e.g., $n_{2}$ for the exact and $n_{1}$ for the cut. While useful, this difference in sample sizes will not have a significant impact on the resulting behavior of the SMP so long as $n_{1}/n_{2}\rightarrow\alpha\in(0,\infty)$ . If one wishes to impose such a condition, the only result will be a slight change in the definition of the matrix $\Omega$ in Lemma C.5 to account for the fact that $\lim_{n}n_{1}/n_{2}\neq 1$ . As such, all results presented herein can be extended to this case at the cost of minor additional technicalities. To see this, let $n_{2}$ be the larger sample size associated with the full posterior, and $n_{1}$ the smaller sample size associated with the cut posterior, which satisfies $\lim_{n}n_{1}/n_{2}=\alpha$ for some $\alpha<1$ . Then, our results go through with $n=n_{2}$ since

	$\displaystyle\sqrt{n_{2}}(\overline{\theta}_{\mathrm{cut}}-\theta_{0})$	$\displaystyle=\alpha^{-1/2}\sqrt{n_{1}}(\overline{\theta}_{\mathrm{cut}}-\theta_{0})+o_{p}\left\{\left(\frac{1}{n_{1}/n_{2}}-\frac{1}{\alpha}\right)\\|\sqrt{n_{1}}(\overline{\theta}_{\mathrm{cut}}-\theta_{0})\\|\right\}$
		$\displaystyle=\alpha^{-1/2}\sqrt{n_{1}}(\overline{\theta}_{\mathrm{cut}}-\theta_{0})+o_{p}(1),$

where the second equality follows from Theorem C.2 (when $\pi_{\mathrm{cut}}(\theta\mid z_{1:n})$ is based on $n_{1}$ observations).

C.3 Proofs of Main Results

Recall the definitions $\mathcal{I}_{11.2}=\mathcal{I}_{11}-\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}$ , and

\displaystyle W=\mathcal{I}_{p(11)}^{-1}-\mathcal{I}_{11.2}^{-1},\quad\mathcal{M}=\begin{pmatrix}W&-W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W&=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\end{pmatrix}.

(12)

To prove Theorem 1 in the main text, we first prove the following result.

Theorem C.3.

Consider that Assumptions 2-3, and the regularity conditions in Assumptions C1-C2 are satisfied. If $\text{\rm tr}\Upsilon\mathcal{M}\geq 2\|\Upsilon\mathcal{M}\|$ , and $0\leq\gamma\leq 2(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|)$ , then

\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})\leq\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\mathbb{E}\left[\frac{\gamma\{2(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|)-\gamma\}}{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)}\right].

Proof of Theorem C.3..

For $\Upsilon=[\partial^{2}q(\delta,\theta_{0})/\partial\delta\partial\delta^{\top}]_{\delta=\theta_{0}}$ , and $\|X\|_{\Upsilon}^{2}=X^{\top}\Upsilon X$ , following arguments similar to those in Theorem 1 of Rousseau (1997), it can be shown that

	$\displaystyle\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})$	$\displaystyle:=\lim_{\nu\rightarrow+\infty}\liminf_{n\rightarrow+\infty}\mathbb{E}_{n}\min\left\{\int_{\Theta}nq(\theta,\theta_{0})\pi_{\mathrm{cut}}(\theta\mid z_{1:n})\mathrm{d}\theta,\nu\right\}$
		$\displaystyle=\lim_{\nu\rightarrow+\infty}\liminf_{n\rightarrow+\infty}\mathbb{E}_{n}\min\left\{nq(\overline{\theta}_{\mathrm{cut}},\theta_{0}),\nu\right\}$
		$\displaystyle=\lim_{\nu\rightarrow+\infty}\liminf_{n\rightarrow+\infty}\mathbb{E}_{n}\min\left\{\\|\sqrt{n}(\overline{\theta}_{\mathrm{cut}}-\theta_{0})\\|_{\Upsilon}^{2},\nu\right\},$

and similarly (under Assumption 2),

	$\displaystyle\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0})$	$\displaystyle:=\lim_{\nu\rightarrow+\infty}\liminf_{n\rightarrow+\infty}\mathbb{E}_{n}\min\left\{\int_{\Theta}nq(\theta,\theta_{0})\pi_{\mathrm{full}}(\theta\mid z_{1:n})\mathrm{d}\theta,\nu\right\}$
		$\displaystyle=\lim_{\nu\rightarrow+\infty}\liminf_{n\rightarrow+\infty}\mathbb{E}_{n}\min\left\{nq(\overline{\theta}_{\mathrm{full}},\theta_{0}),\nu\right\}$
		$\displaystyle=\lim_{\nu\rightarrow+\infty}\liminf_{n\rightarrow+\infty}\mathbb{E}_{n}\min\left\{\\|\sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{0})\\|_{\Upsilon}^{2},\nu\right\},$

which together imply that

\mathrm{R}_{q}(\pi_{\omega},\theta_{0})=\lim_{\nu\rightarrow+\infty}\lim_{n\rightarrow+\infty}\mathbb{E}_{n}\min\{\|\sqrt{n}\{\bar{\theta}(\omega)-\theta_{0}\}\|^{2}_{\Upsilon},\nu\}.

Write

	$\displaystyle\sqrt{n}\{\bar{\theta}(\widehat{\omega}_{+})-\theta_{0}\}$	$\displaystyle=\sqrt{n}\{(1-\widehat{\omega}_{+})\overline{\theta}_{\mathrm{cut}}+\widehat{\omega}_{+}\overline{\theta}_{\mathrm{full}}-\theta_{0}\}$
		$\displaystyle=\sqrt{n}(\bar{\theta}_{\mathrm{cut}}-\theta_{0})-\widehat{\omega}_{+}\{\sqrt{n}(\bar{\theta}_{\mathrm{cut}}-\theta_{0})-\sqrt{n}(\bar{\theta}_{\mathrm{full}}-\theta_{0})\}$

By Lemma C.6,

\displaystyle\sqrt{n}\{\bar{\theta}(\widehat{\omega}_{+})-\theta_{0}\}

\displaystyle\Rightarrow\Psi:=\Gamma_{\mathrm{cut}}(\xi+\tau)-\overline{\omega}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau),

where $\xi\sim N(0,\Omega)$ , and $\tau=(0^{\top},\eta^{\top})^{\top}$ are defined in Lemma C.5, and where

\overline{\omega}=\frac{\gamma}{{(\xi+\tau)^{\top}P(\xi+\tau)}},\quad P=(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}}).

For $\mathcal{Q}_{n}:=\|\sqrt{n}\{\bar{\theta}(\widehat{\omega}_{+})-\theta_{0}\}\|^{2}_{\Upsilon}$ , and $\nu\geq 0$ , let

\Psi_{n,\nu}=\sqrt{n}\{\bar{\theta}(\widehat{\omega}_{+})-\theta_{0}\}\cdot\mathbb{I}[\mathcal{Q}_{n}\leq\nu]+\nu\cdot\mathbb{I}[\mathcal{Q}_{n}>\nu].

By Theorem 1.8.8 of Lehmann and Casella (2006),

\liminf_{n\rightarrow\infty}\mathbb{E}\left[\|\Psi_{n,\nu}\|_{{\Upsilon}}^{2}\right]=\mathbb{E}\left[\|\Psi\|_{{\Upsilon}}^{2}\mathbb{I}(\|\Psi\|_{\Upsilon}^{2}\leq\nu)\right]+\nu^{2}\text{Pr}(\|{\Upsilon}\|^{2}_{\Upsilon}>\nu).

For $\nu\rightarrow\infty$ , the RHS of the above converges to $\mathbb{E}\left[\Psi^{\top}{\Upsilon}\Psi\right]$ , and we have that

\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})=\mathbb{E}\left[\Psi^{\top}{\Upsilon}\Psi\right].

Expand $\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})$ :

	$\displaystyle\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})=$	$\displaystyle\mathbb{E}\left[\Psi^{\top}{\Upsilon}\Psi\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[(\xi+\tau)^{\top}\Gamma_{\mathrm{cut}}^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)\right]+\gamma^{2}\mathbb{E}\frac{(\xi+\tau)^{\top}P(\xi+\tau)}{[(\xi+\tau)^{\top}P(\xi+\tau)]^{2}}$
		$\displaystyle-2\gamma\mathbb{E}\frac{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)}{(\xi+\tau)^{\top}P(\xi+\tau)}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[(\xi+\tau)^{\top}\Gamma_{\mathrm{cut}}^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)\right]+\gamma^{2}\mathbb{E}\frac{1}{[(\xi+\tau)^{\top}P(\xi+\tau)]}$
		$\displaystyle-2\gamma\mathbb{E}\frac{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)}{(\xi+\tau)^{\top}P(\xi+\tau)}.$

From the definitions of $\Gamma_{\mathrm{cut}},\xi,\tau$ ,

\mathbb{E}\left[(\xi+\tau)^{\top}\Gamma_{\mathrm{cut}}^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)\right]=\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0}).

Let us concentrate our attention on the last term in $\mathrm{R}_{0}(\theta_{0},\pi_{\widehat{\omega}_{+}})$ . Define the mapping $\varphi(x)=x^{\top}/(x^{\top}{\Upsilon}x)$ and note that

\displaystyle\mathbb{E}\frac{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)}{(\xi+\tau)^{\top}P(\xi+\tau)}

\displaystyle=\mathbb{E}\{\varphi(\xi+\tau)(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)\}.

Recall that $\mathbb{E}(\xi)=0$ , and $\Omega=\mathbb{E}(\xi\xi^{\top})$ . Using Lemma C.4, and for $K:=(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}\Omega$ , we have

	$\displaystyle\mathbb{E}\frac{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)}{(\xi+\tau)^{\top}P(\xi+\tau)}$	$\displaystyle=\mathbb{E}\text{\rm tr}\{\partial\varphi(x)/\partial x\|_{x=X}\}K$
		$\displaystyle=\mathbb{E}\frac{\text{\rm tr}K}{(\xi+\tau)^{\top}P(\xi+\tau)}-2\mathbb{E}\frac{\text{\rm tr}P(\xi+\tau)(\xi+\tau)^{\top}K}{[{(\xi+\tau)^{\top}P(\xi+\tau)}]^{2}}.$		(13)

From the properties of $\text{\rm tr}(\cdot)$ , and Lemma C.5,

\displaystyle\text{\rm tr}K=\text{\rm tr}{\Upsilon}\Gamma_{\mathrm{cut}}\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}=\text{\rm tr}{\Upsilon}\mathcal{M}.

For the second term in (13), we have that

	$\displaystyle\text{\rm tr}P(\xi+\tau)(\xi+\tau)^{\top}K$	$\displaystyle=\text{\rm tr}(\xi+\tau)^{\top}KP(\xi+\tau)$
		$\displaystyle=(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)$
		$\displaystyle=(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\mathcal{M}{\Upsilon}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau),$

and where the last equality again follows from Lemma C.5. Consequently, for ${\Upsilon}$ positive semi-definite,

$\displaystyle\text{\rm tr}P(\xi+\tau)(\xi+\tau)^{\top}K$	$\displaystyle=(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\mathcal{M}{\Upsilon}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)$
	$\displaystyle\leq\\|{\Upsilon}^{1/2}\mathcal{M}{\Upsilon}^{1/2}\\|\left\{(\xi+\tau)^{\top}P(\xi+\tau)\right\}$
	$\displaystyle=\\|\Upsilon\mathcal{M}\\|\left\{(\xi+\tau)^{\top}P(\xi+\tau)\right\},$	(14)

where the last inequality holds for any matrix norm $\|\cdot\|$ such that if $A$ and $B$ are positive semi-definite, then $\|AB\|=\|BA\|$ ; e.g., the Frobenius norm. Applying (14) into (13), we have

	$\displaystyle-\mathbb{E}\frac{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}(\xi+\tau)}{(\xi+\tau)^{\top}P(\xi+\tau)}$
	$\displaystyle\leq-\mathbb{E}[\frac{\text{\rm tr}\Upsilon\mathcal{M}}{(\xi+\tau)^{\top}P(\xi+\tau)}]+2\\|\Upsilon\mathcal{M}\\|\mathbb{E}\left[\frac{\left\{(\xi+\tau)^{\top}P(\xi+\tau)\right\}}{\left\{(\xi+\tau)^{\top}P(\xi+\tau)\right\}^{2}}\right]$
	$\displaystyle=-\left(\text{\rm tr}\Upsilon\mathcal{M}-2\\|\Upsilon\mathcal{M}\\|\right)\mathbb{E}\left[\frac{1}{(\xi+\tau)^{\top}P(\xi+\tau)}\right].$		(15)

Applying (15) into the last equation for $\mathrm{R}_{0}(\theta_{0},\pi_{\widehat{\omega}_{+}})$ , and collecting terms yields

	$\displaystyle\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})$	$\displaystyle\leq\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})+\gamma^{2}\mathbb{E}\left[\frac{1}{\{(\xi+\tau)^{\top}P(\xi+\tau)\}}\right]-2\gamma\mathbb{E}\left[\frac{(\text{\rm tr}\Upsilon\mathcal{M}-2\\|\Upsilon\mathcal{M}\\|)}{(\xi+\tau)^{\top}P(\xi+\tau)}\right]$
		$\displaystyle=\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\gamma\mathbb{E}\left[\frac{\{2(\text{\rm tr}\Upsilon\mathcal{M}-2\\|\Upsilon\mathcal{M}\\|)-\gamma\}}{(\xi+\tau)^{\top}P(\xi+\tau)}\right].$		(16)

∎

Theorem C.3 yields the following corollary.

Corollary C.2.

Consider that Assumptions 2-3, and the regularity conditions in Assumptions C1-C2 are satisfied. If $\text{\rm tr}\Upsilon\mathcal{M}>2\|\Upsilon\mathcal{M}\|$ , and $0<\gamma<2(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|)$ , then for any $\eta$ such that $\|\eta\|<\infty$ ,

\displaystyle\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})

\displaystyle\leq\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\frac{\gamma\{2(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|)-\gamma\}}{\text{\rm tr}{\Upsilon}\mathcal{M}+\tau^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})\tau}<\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0}).

(17)

Furthermore, if $\tau=(0^{\top},\eta^{\top})^{\top}$ is such that $\tau^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}{\Upsilon}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})\tau\geq\text{\rm tr}\Upsilon\mathcal{M}$ , then

\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})\leq\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0})

Proof of Corollary C.2..

Since $\xi$ is Gaussian with mean zero and variance $\Omega$ ,

\mathbb{E}[{(\xi+\tau)^{\top}P(\xi+\tau)}]=\tau^{\top}P\tau+\text{\rm tr}P\Omega.

(18)

From equation (16) in the proof of Theorem C.3,

	$\displaystyle\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})$	$\displaystyle\leq\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\gamma\mathbb{E}\frac{[2(\text{\rm tr}\Upsilon\mathcal{M}-2\\|\Upsilon\mathcal{M}\\|)-\gamma]}{[{(\xi+\tau)^{\top}P(\xi+\tau)}]}$
		$\displaystyle\leq\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\gamma\frac{[2(\text{\rm tr}\Upsilon\mathcal{M}-2\\|\Upsilon\mathcal{M}\\|)-\gamma]}{\mathbb{E}[{(\xi+\tau)^{\top}P(\xi+\tau)}]}$
		$\displaystyle\leq\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\gamma\frac{[2(\text{\rm tr}\Upsilon\mathcal{M}-2\\|\Upsilon\mathcal{M}\\|)-\gamma]}{\tau^{\top}P\tau+\text{\rm tr}P\Omega}$
		$\displaystyle<\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0}),$

where the second inequality follows by Jensen’s inequality, and the second to last by plugging in the moment of the quadratic form in equation (18), while the last (strict) inequality follows since $0<\gamma<2(\text{\rm tr}\Upsilon\mathcal{M}-2\|\Upsilon\mathcal{M}\|)$ .

Since under our hypotheses, $\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})<\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})$ , to prove the second part of the result we need only prove that

\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0})-\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})\geq 0.

Taking $\omega=0$ and $\omega=1$ in the proof of Theorem C.3, we see that

	$\displaystyle\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})$	$\displaystyle=\mathbb{E}_{n}\{(\xi+\tau)^{\top}\Gamma_{\mathrm{cut}}^{\top}\Upsilon\Gamma_{\mathrm{cut}}(\xi+\tau)\}=\tau^{\top}\Gamma_{\mathrm{cut}}^{\top}{\Upsilon}\Gamma_{\mathrm{cut}}\tau+\text{\rm tr}\Upsilon\Gamma_{\mathrm{cut}}\Omega\Gamma_{\mathrm{cut}}^{\top}$
		$\displaystyle=\\|\mathcal{I}_{22}^{-1}\eta_{2}\\|^{2}_{\Upsilon_{22}}+\text{\rm tr}\Upsilon\Gamma_{\mathrm{cut}}\Omega\Gamma_{\mathrm{cut}}^{\top};$
	$\displaystyle\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0})$	$\displaystyle=\mathbb{E}_{n}\{(\xi+\tau)^{\top}\Gamma_{\mathrm{full}}^{\top}\Upsilon\Gamma_{\mathrm{full}}(\xi+\tau)\}=\tau^{\top}\Gamma_{\mathrm{full}}^{\top}{\Upsilon}\Gamma_{\mathrm{full}}\tau+\text{\rm tr}\Upsilon\Gamma_{\mathrm{full}}\Omega\Gamma_{\mathrm{full}}^{\top}$
		$\displaystyle=\\|\mathcal{I}^{-1}\eta\\|^{2}_{\Upsilon_{22}}+\text{\rm tr}\Upsilon\Gamma_{\mathrm{full}}\Omega\Gamma_{\mathrm{full}}^{\top}.$

Consequently,

\displaystyle\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0})

\displaystyle=\|\mathcal{I}_{22}^{-1}\|_{\Upsilon_{22}}^{2}-\|\mathcal{I}^{-1}\eta\|_{\Upsilon}^{2}+\text{\rm tr}\Upsilon\{\Gamma_{\mathrm{cut}}\Omega\Gamma_{\mathrm{cut}}^{\top}-\Gamma_{\mathrm{full}}\Omega\Gamma_{\mathrm{full}}^{\top}\}

From the definitions of $\Gamma_{\mathrm{cut}},\Gamma_{\mathrm{full}}$ , $\Omega$ , and $\mathcal{M}$ in (12),

	$\displaystyle\Gamma_{\mathrm{cut}}\Omega\Gamma_{\mathrm{cut}}^{\top}-\Gamma_{\mathrm{full}}\Omega\Gamma_{\mathrm{full}}^{\top}$
	$\displaystyle=\begin{pmatrix}\mathcal{I}_{p(11)}^{-1}&-\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}&\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\end{pmatrix}-\begin{pmatrix}\mathcal{I}_{11.2}^{-1}&-\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}&\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\end{pmatrix}$
	$\displaystyle=\mathcal{M}.$

Hence, $\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0})=\|\mathcal{I}_{22}^{-1}\|_{\Upsilon_{22}}^{2}-\|\mathcal{I}^{-1}\eta\|_{\Upsilon}^{2}+\text{\rm tr}\Upsilon\mathcal{M}$ and

\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0})\leq 0,

when $\|\mathcal{I}^{-1}\eta\|_{\Upsilon}^{2}\geq\text{\rm tr}\Upsilon\mathcal{M}+\|\mathcal{I}_{22}^{-1}\|_{\Upsilon_{22}}^{2}$ , as maintained in the stated result. ∎

Proof of Theorem 1 (main text)..

Theorem 1 is a direct consequence of Corollary C.2. To see this, note that Corollary C.2 is true pointwise for any $\eta$ such that $\|\eta\|<\infty$ . Note that $\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})$ and $\mathrm{R}_{q}(\pi_{\mathrm{full}},\theta_{0})$ are convex for all $\eta$ , and that the RHS of the bound for $\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})$ in equation (17) is also convex as a function of $\eta$ . Hence, Corollary C.2 holds uniformly, which verifies Theorem 1. ∎

Proof of Theorem 2.

Recall the expanded expression for $\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})$ in the proof of Theorem C.3. Recall equation (13) derived in that proof of Theorem C.3:

\displaystyle\mathbb{E}\frac{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\Upsilon\Gamma_{\mathrm{cut}}(\xi+\tau)}{(\xi+\tau)^{\top}P(\xi+\tau)}

\displaystyle=\mathbb{E}\frac{\text{\rm tr}K}{(\xi+\tau)^{\top}P(\xi+\tau)}-2\mathbb{E}\frac{\text{\rm tr}P(\xi+\tau)(\xi+\tau)^{\top}K}{[{(\xi+\tau)^{\top}P(\xi+\tau)}]^{2}},

where $K=(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\Upsilon\Gamma_{\mathrm{cut}}\Omega$ . Under our choice of loss, we can show that

\displaystyle\text{\rm tr}K=\text{\rm tr}\Upsilon\Gamma_{\mathrm{cut}}\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}=\text{\rm tr}\Upsilon\mathcal{M}=d.

For the second term in (13),

	$\displaystyle\text{\rm tr}P(\xi+\tau)(\xi+\tau)^{\top}K$	$\displaystyle=(\xi+\tau)^{\top}KP(\xi+\tau)$
		$\displaystyle=(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\mathcal{M}^{-1}\Gamma_{\mathrm{cut}}\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\mathcal{M}^{-1}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)$
		$\displaystyle=(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\mathcal{M}^{-1}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau).$

Applying the above, the second term in (13) becomes

	$\displaystyle\frac{\text{\rm tr}P(\xi+\tau)(\xi+\tau)^{\top}K}{[{(\xi+\tau)^{\top}P(\xi+\tau)}]^{2}}$	$\displaystyle=\frac{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\mathcal{M}^{-1}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)}{\{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\mathcal{M}^{-1}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)\}^{2}}$
		$\displaystyle=\frac{1}{(\xi+\tau)^{\top}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}\mathcal{M}^{-1}(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)}.$		(19)

From Lemma C.5, $\text{Cov}[(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})(\xi+\tau)]=\mathcal{M}.$ Hence, the inverse of the random variable in (19) is distributed as non-central chi-squared with $\kappa=d$ degrees of freedom, and non-centrality parameter $\lambda=\tau^{\top}P\tau$ , which, for $d>2$ , has density function

f_{\chi}(z):=\exp(-\lambda)\sum_{j=0}^{\infty}\frac{\lambda^{j}}{j!}\frac{z^{\frac{1}{2}(\kappa+2j)-1}\exp({-\frac{1}{2}z})}{2^{\frac{1}{2}(\kappa+2j)}\Gamma[(\kappa+2j)/2]}.

To calculate $\mathbb{E}[Z^{-1}]$ , first we see that

\displaystyle\mathbb{E}[Z^{-1}]=\int z^{-1}f_{\chi}(z)\mathrm{d}\mu(z)=\exp(-\lambda)\sum_{j=0}^{\infty}\int_{0}^{\infty}\frac{\lambda^{j}}{j!}\frac{z^{\frac{1}{2}(\kappa+2j)-2}\exp({-\frac{1}{2}z})}{2^{\frac{1}{2}(\kappa+2j)}\Gamma[(\kappa+2j)/2]}\mathrm{d}\mu(z),

and note that, since $\kappa/2>1$ , when $d>2$ , for all $j\geq 0$ ,

\int z^{\frac{1}{2}(\kappa+2j)-r-1}\exp(-z/2)\mathrm{d}\mu(z)=2^{(\kappa/2+j)-1}\Gamma[(\kappa/2+j)-1],

so that we can rewrite the expectation as

	$\displaystyle\mathbb{E}[Z^{-1}]$	$\displaystyle=2^{-1}\frac{\Gamma(\kappa/2-1)}{\Gamma(\kappa/2)}\exp(-\lambda)\sum_{j=0}^{\infty}\frac{(\kappa/2-1)_{j}}{(\kappa/2)_{j}}\left(\frac{\lambda^{j}}{j!}\right)$
		$\displaystyle=2^{-1}\frac{\Gamma(\kappa/2-1)}{\Gamma(\kappa/2)}\exp(-\lambda){}_{1}F_{1}(\kappa/2-1;\kappa/2;\lambda).$

Taking $\kappa=d$ then yields the result. ∎

C.4 Proofs of Preliminary Results in Section C.2

When no confusion is likely to result, quantities that are evaluated at $\theta_{0}$ will have their dependence on $\theta_{0}$ suppressed; i.e., we write $\mathcal{I}_{p(11)}(\theta_{0})=\mathcal{I}_{p(11)}$ , etc.

Proof of Lemma C.1..

Note that, under the regularity conditions in Assumption C1, we can exchange the order of integration and differentiation. Furthermore, note that for $\delta_{n}$ as in Assumption 2 and $z\in\mathcal{Z}$ , we have that

	$\displaystyle\lim_{n\rightarrow+\infty}n^{-1}\mathbb{E}_{n}[\dot{\ell}(\theta_{0})\dot{\ell}(\theta_{0})^{\top}]$	$\displaystyle=\lim_{n\rightarrow+\infty}\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta^{\top}}h(z\mid\theta_{0},\delta_{n})\mathrm{d}\mu(z)$
		$\displaystyle=\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta^{\top}}h(z\mid\theta_{0},0)\mathrm{d}\mu(z)$
		$\displaystyle=\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta^{\top}}f(z\mid\theta_{0})\mathrm{d}\mu(z)$
		$\displaystyle=\mathbb{E}[\dot{\ell}(z_{i}\mid\theta_{0})\dot{\ell}(z_{i}\mid\theta_{0})^{\top}],$

where recall that

\dot{\ell}(z_{i}\mid\theta_{0})=\partial\log f(z_{i}\mid\theta_{0})/\partial\theta,\text{ and }\ddot{\ell}(z_{i}\mid\theta_{0})=\partial^{2}\log f(z_{i}\mid\theta_{0})/\partial\theta\partial\theta^{\top}.

Similarly, from the structure of the score equations, and the information matrix equality, we have that

	$\displaystyle-\mathbb{E}\ddot{\ell}(z_{i}\mid\theta_{0})$	$\displaystyle=\mathbb{E}[\dot{\ell}(z_{i}\mid\theta_{0})\dot{\ell}(z_{i}\mid\theta_{0})^{\top}]$
	$\displaystyle-\mathbb{E}\begin{pmatrix}\ddot{\ell}_{p(11)}+\ddot{\ell}_{c(11)}&\ddot{\ell}_{c(12)}\\ \ddot{\ell}_{c(21)}&\ddot{\ell}_{c(22)}\end{pmatrix}$	$\displaystyle=\mathbb{E}\begin{pmatrix}\{\ddot{\ell}_{p(1)}+\ddot{\ell}_{c(1)}\}\{\ddot{\ell}_{p(1)}+\ddot{\ell}_{c(1)}\}^{\top}&\{\ddot{\ell}_{p(1)}+\ddot{\ell}_{c(1)}\}\ddot{\ell}_{c(2)}^{\top}\\ \ddot{\ell}_{c(2)}\{\ddot{\ell}_{p(1)}+\ddot{\ell}_{c(1)}\}^{\top}&\ddot{\ell}_{c(2)}\ddot{\ell}_{c(2)}^{\top}\end{pmatrix}$
	$\displaystyle\begin{pmatrix}\mathcal{I}_{p(11)}+\mathcal{I}_{c(11)}&\mathcal{I}_{(12)}\\ \mathcal{I}_{(21)}&\mathcal{I}_{(22)}\end{pmatrix}$	$\displaystyle=\mathbb{E}\begin{pmatrix}\dot{\ell}_{p(1)}\dot{\ell}_{p(1)}^{\top}+\dot{\ell}_{c(1)}\dot{\ell}_{c(1)}^{\top}+2\dot{\ell}_{p(1)}\dot{\ell}_{c(1)}&\ddot{\ell}_{p(1)}\ddot{\ell}_{c(2)}^{\top}+\ddot{\ell}_{c(1)}\ddot{\ell}_{c(2)}^{\top}\\ \ddot{\ell}_{c(2)}\ddot{\ell}_{p(1)}^{\top}+\ddot{\ell}_{c(2)}\ddot{\ell}_{c(1)}^{\top}&\ddot{\ell}_{c(2)}\ddot{\ell}_{c(2)}^{\top}\end{pmatrix}$

Analysing the above we see that

$\displaystyle\mathcal{I}_{p(11)}=$	$\displaystyle-\int_{\mathcal{Z}}\frac{\partial^{2}\log f_{1}(z\mid\theta_{1})}{\partial\theta_{1}\partial\theta_{1}^{\top}}f(z\mid\theta_{0})\mathrm{d}\mu(z)$
$\displaystyle=$	$\displaystyle\int_{\mathcal{Z}}\frac{1}{f_{1}(z\mid\theta_{1,0})^{2}}\frac{\partial f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}^{\top}}\frac{\partial f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}^{\top}}f(z\mid\theta_{0})\mathrm{d}\mu(z)$
	$\displaystyle\hskip 14.22636pt-\int_{\mathcal{Z}}\frac{\partial^{2}f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}\partial\theta_{1}^{\top}}f(z\mid\theta_{0})\mathrm{d}\mu(z)$
$\displaystyle=$	$\displaystyle\mathbb{E}\dot{\ell}_{p}(z_{i}\mid\theta_{1,0})\dot{\ell}_{p}(z_{i}\mid\theta_{1,0})^{\top},$	(20)

where the last line follows from rewriting $\frac{1}{f_{1}(z\mid\theta_{1,0})}\frac{\partial f_{1}(z\mid\theta_{1,0})}{\partial\theta^{\top}}=\dot{\ell}_{p(1)}(z\mid\theta_{1,0})$ , and from exchanging integration and differentiation to note that the second term is zero. This proves part one of the result. Repeating the argument for $\mathcal{I}_{p(11)}$ for $\mathcal{I}_{c(11)}$ yields

\displaystyle\mathcal{I}_{c(11)}=\mathbb{E}\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})^{\top},

which proves the second part of the result.

Now, let us investigate the equivalence of each term. For the first term we have that

	$\displaystyle\mathcal{I}_{p(11)}+\mathcal{I}_{c(11)}$	$\displaystyle=\mathbb{E}\dot{\ell}_{p(1)}(z_{i}\mid\theta_{1,0})\dot{\ell}_{p(1)}(z_{i}\mid\theta_{1,0})^{\top}+\mathbb{E}\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})^{\top}$
		$\displaystyle+2\mathbb{E}\dot{\ell}_{p(1)}(z_{i}\mid\theta_{1,0})\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})^{\top}.$		(21)

Applying the above equations for $\mathcal{I}_{j(11)}$ , $j\in\{p,c\}$ , in equation (21) we have

	$\displaystyle\mathcal{I}_{p(11)}+\mathcal{I}_{c(11)}$	$\displaystyle=\mathbb{E}\dot{\ell}_{p}(z_{i}\mid\theta_{1,0})\dot{\ell}_{p}(z_{i}\mid\theta_{1,0})^{\top}+\mathbb{E}\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})^{\top}$
		$\displaystyle=\mathbb{E}\dot{\ell}_{p(1)}(z_{i}\mid\theta_{1,0})\dot{\ell}_{p(1)}(z_{i}\mid\theta_{1,0})^{\top}+\mathbb{E}\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})^{\top}$
		$\displaystyle+2\mathbb{E}\dot{\ell}_{p(1)}(z_{i}\mid\theta_{1,0})\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})^{\top},$

which is satisfied if and only if $\mathbb{E}\dot{\ell}_{p(1)}(z_{i}\mid\theta_{0})\dot{\ell}_{c(1)}(z_{i}\mid\theta_{0})^{\top}=0$ . Hence, we have part three of the stated result.

Lastly, we investigate the term $\mathcal{I}_{(21)}$ . Again, using the definitions of this term we have

$\displaystyle\mathcal{I}_{21}=$	$\displaystyle-n^{-1}\int_{\mathcal{Z}}\frac{\partial^{2}\ell_{c}(z\mid\theta_{0})}{\partial\theta_{2}\partial\theta_{1}^{\top}}f(z\mid\theta_{0})\mathrm{d}\mu(z)$
$\displaystyle=$	$\displaystyle-\int_{\mathcal{Z}}\frac{1}{f_{2}(z\mid\theta_{0})^{2}}\frac{\partial f_{2}(z\mid\theta)}{\partial\theta_{2}}\frac{\partial f_{2}(z\mid\theta)}{\partial\theta_{1}^{\top}}f(z\mid\theta_{0})\mathrm{d}\mu(z)-\int_{\mathcal{Z}}\frac{\partial^{2}f_{2}(z\mid\theta)}{\partial\theta_{2}\partial\theta_{1}^{\top}}f(z\mid\theta_{0})\mathrm{d}\mu(z)$
$\displaystyle=$	$\displaystyle\mathbb{E}\dot{\ell}_{c(2)}(z\mid\theta_{0})\dot{\ell}_{c(1)}(z\mid\theta_{0})^{\top},$	(22)

where the last equality again follows from exchanging integration and differentiation of the second term, and rewriting the derivatives in the first term. Therefore, from equation (22), and the general matrix information equality, we have the equality

\displaystyle\mathcal{I}_{21}=\mathbb{E}\dot{\ell}_{c(2)}(z\mid\theta_{0})\dot{\ell}_{c(1)}(z\mid\theta_{0})^{\top}=\mathbb{E}\dot{\ell}_{c(2)}(z\mid\theta_{0})\dot{\ell}_{c(1)}(z\mid\theta_{0})^{\top}+\mathbb{E}\dot{\ell}_{c(2)}(z\mid\theta_{0})\dot{\ell}_{p(1)}(z\mid\theta_{1,0})^{\top},

which is satisfied if and only if $\mathbb{E}\dot{\ell}_{c(2)}(z\mid\theta_{0})\dot{\ell}_{p(1)}(z\mid\theta_{1,0})^{\top}=0$ , which proves the last result. ∎

Proof of Lemma C.2.

To simplify the proof, let $\delta_{n}(z)=\psi^{\top}\xi(z)/\sqrt{n}$ . By Assumption C1, under Assumption 2, we see that $\|\dot{\ell}(z\mid\theta)\|h(z\mid\theta,\delta_{n})\leq b(z)\{1+\delta_{n}(z)\}f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})+o(b(z)/\sqrt{n})$ . By Assumption 2, there exists some $a_{1}(z)\geq 0$ with $\sup_{z\in\mathcal{Z}}a_{1}(z)<\infty$ such that $\delta_{n}(z)\leq a_{1}(z)$ . Hence, for each $n\geq 1$ , $h(z\mid\theta,\delta_{n})\leqslant\{1+a_{1}(z)/\sqrt{n}\}f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})$ and $|\dot{\ell}(z\mid\theta)h(z\mid\theta,\delta_{n})|\leqslant b(z)\{1+a_{1}(z)/\sqrt{n}\}f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})$ . From Assumption C1, we have that $\int_{\mathcal{Z}}a_{1}(z)b(z)f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)<\infty$ , and $\int_{\mathcal{Z}}b(z)f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)<\infty$ . By the dominated convergence theorem, we then have that $\phi(\theta,\delta_{n})$ exists for each $\delta_{n}\in\Delta$ .

To prove the second part of the result, we note that, by Assumption 2,

	$\displaystyle\lim_{n\rightarrow\infty}\sup_{\theta\in\tilde{\Theta}}\|\phi(\theta,\delta_{n})-\phi(\theta,0)\|=\lim_{n\rightarrow\infty}\frac{1}{\sqrt{n}}\sup_{\theta\in\tilde{\Theta}}\left\|\int_{\mathcal{Z}}\dot{\ell}(z\mid\theta)\delta_{n}(z)f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)\right\|$
	$\displaystyle\leq\lim_{n\rightarrow\infty}\frac{1}{\sqrt{n}}\left\|\int_{\mathcal{Z}}b(z)a_{1}(z)f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)+o\left\{\frac{1}{\sqrt{n}}\int_{\mathcal{Z}}b(z)f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)\right\}\right\|$
	$\displaystyle=o(1/\sqrt{n}),$		(23)

where the second line follows since by Assumption C1 $\|\dot{\ell}(z\mid\theta)\|\leq b(z)$ .

Now, by the triangle inequality,

\displaystyle\sup_{\theta\in\tilde{\Theta}}\left|n^{-1}\dot{\ell}(\theta)-\phi\left(\theta,0\right)\right|\leq

\displaystyle\sup_{\theta\in\tilde{\Theta}}\left|n^{-1}\dot{\ell}(\theta)-\phi\left(\theta,\delta_{n}\right)\right|+\sup_{\theta\in\tilde{\Theta}}\left|\phi\left(\theta,\delta_{n}\right)-\phi\left(\theta,0\right)\right|.

From equation (23), for any $\varepsilon>0$ , the there exists an $n_{1}$ large enough so that for all $n\geq n_{1}$ , $\sup_{\theta\in\widetilde{\Theta}}\left|\phi\left(\theta,\delta_{n}\right)-\phi\left(\theta,0\right)\right|\leq\varepsilon/2$ . To handle the first term, note that by Assumption C1, for any $\theta,\theta^{\prime}\in\Theta$ , and some $\overline{\theta}\in\Theta$ ,

\|\dot{\ell}(z\mid\theta)-\dot{\ell}(z\mid\theta^{\prime})\|\leq\|\ddot{\ell}(z\mid\overline{\theta})\|\|\theta-\theta^{\prime}\|\leq b(z)\|\theta-\theta^{\prime}\|.

Since $\mathbb{E}[b(z)]<\infty$ , it follows directly from Theorem 21.10 of Davidson (1994), that

\sup_{\theta\in\tilde{\Theta}}\left|n^{-1}\dot{\ell}(\theta)-\phi\left(\theta,\delta_{n}\right)\right|\leq\varepsilon/2

with probability converging to one. Hence, for any $\varepsilon>0$ ,

	$\displaystyle\lim_{n\rightarrow+\infty}\text{Pr}\left[\sup_{\theta\in\tilde{\Theta}}\left\|n^{-1}\dot{\ell}(\theta)-\phi\left(\theta,0\right)\right\|\geq\varepsilon\right]\leq$	$\displaystyle\lim_{n\rightarrow+\infty}\text{Pr}\left[\sup_{\theta\in\tilde{\Theta}}\left\|n^{-1}\dot{\ell}(\theta)-\phi\left(\theta,\delta_{n}\right)\right\|\geq\varepsilon/2\right]$
		$\displaystyle+\lim_{n\rightarrow+\infty}\text{Pr}\left[\sup_{\theta\in\tilde{\Theta}}\left\|\phi\left(\theta,\delta_{n}\right)-\phi\left(\theta,0\right)\right\|\geq\varepsilon/2\right]$
		$\displaystyle=o(1).$

∎

Proof of Lemma C.3..

The result uses Lemma C.2 , the nature of the model misspecification in Assumption 2, and the separable structure of the likelihood.

Result 1. Recall the expectation

\phi(\theta,\delta_{n}):=\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta)}{\partial\theta}h(z\mid\theta,\delta_{n})\mathrm{d}\mu(z).

From Lemma C.2, and the regularity conditions on $h(z\mid\theta,\delta)$ in Assumption C1, $\phi(\theta,\delta_{n})$ exists and is continuous on $\Theta\times\Delta$ . The second portion of Lemma C.2 implies that, for any $0<\varepsilon<+\infty$ ,

\sup_{\theta\in\Theta(\varepsilon)}|n^{-1}\dot{\ell}(\theta)-\phi(\theta,0)|=o_{p}(1).

Define $\phi_{n}:=\phi(\theta_{0},\delta_{n})$ and note that, since $\phi(\cdot,\delta)$ is continuous in $\delta$ , as $n\rightarrow\infty$ ,

\phi_{n}\rightarrow\phi(\theta_{0},\delta_{0}).

However, since we can exchange differentiation and integration (Assumption C1), and since $h(\cdot\mid\theta,\delta)$ is continuous in $\delta$ , for each $\theta\in\Theta$ , we have

$\displaystyle\phi(\theta_{0},\delta_{n})$	$\displaystyle=\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}h(z\mid\theta_{0},\delta_{n})\mathrm{d}\mu(z)$
	$\displaystyle=\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}\{1+\delta_{n}(z)/\sqrt{n}\}f(z\mid\theta_{0})\mathrm{d}\mu(z)\{1+o(1/\sqrt{n})\}$
	$\displaystyle=\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}f(z\mid\theta_{0})\mathrm{d}\mu(z)+\frac{1}{\sqrt{n}}\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}\delta_{n}(z)\mathrm{d}\mu(z)\{1+o(1/\sqrt{n})\}.$	(24)

The first term in the above is zero since

\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}f(z\mid\theta_{0})\mathrm{d}\mu(z)=\int_{\mathcal{Z}}\frac{\partial}{\partial\theta}f(z\mid\theta_{0})\mathrm{d}\mu(z)=\frac{\partial}{\partial\theta}\int_{\mathcal{Z}}f(z\mid\theta_{0})\mathrm{d}\mu(z)=0.

Considering the second term, using Assumption 2 in the main text and Assumption C1, we see that the integral portion of the second term satisfies

\left\|\int_{\mathcal{Z}}\frac{\partial\log f(z\mid\theta_{0})}{\partial\theta}\delta_{n}(z)\mathrm{d}\mu(z)\right\|\leq\int_{\mathcal{Z}}b(z)a(z)\mathrm{d}\mu(z)<\infty.

Hence, as $n\rightarrow\infty$ , $\lim_{n\rightarrow+\infty}\sqrt{n}\phi(\theta_{0},\delta_{n})=\eta,$ which yields the first part of the stated result.

Result 2. To deduce the second result, we need only establish a CLT for $\{n^{-1/2}\dot{\ell}(\theta_{0})-\sqrt{n}\phi(\theta_{0},\delta_{n})\}$ . This can be established using arguments similar to those given in in Lemma 2.1 of Newey (1985). In particular, let $\lambda\in\mathbb{R}^{d}$ be a non-zero vector, with $\|\lambda\|<\infty$ , and define $Y_{n}(z):=\lambda^{\top}\left[\dot{\ell}(z\mid\theta_{0})-\phi_{n}\right]$ and $Y_{i,n}=Y_{n}\left(z_{i}\right)(i=1,\ldots,n)$ . For each $n\geq 1$ , the random variable $Y_{i,n}$ is mean-zero and and has a strictly positive variance $\lambda^{\top}\Omega_{n}\lambda$ , for all $n$ large enough, where $\Omega_{n}:=\mathbb{E}_{n}[\dot{\ell}(z\mid\theta_{0})\dot{\ell}(z\mid\theta_{0})^{\top}]$ . For $\varepsilon>0$ let $A_{n}(\varepsilon)=\left\{z:\left|\dot{\ell}\left(z,\theta_{0}\right)-\phi_{n}\right|>\varepsilon\sqrt{n\lambda^{\top}\Omega_{n}\lambda}\right\}$ . For each $n,Y_{i,n}(i=1,\ldots,T)$ are identically distributed, so that by Assumption C1, for any $\varepsilon>0$ ,

	$\displaystyle{\left[1/\left(n\lambda^{\top}\Omega_{n}\lambda\right)\right]\cdot\sum_{i=1}^{n}\int_{\left\|Y_{i,n}\right\|\geqslant\varepsilon\sqrt{n\lambda^{\top}\Omega_{n}\lambda}}\left\|Y_{i,n}\right\|^{2}h\left(z_{i}\mid\theta_{0},\delta_{n}\right)\mathrm{d}\mu\left(z_{i}\right)}$
	$\displaystyle=\left(1/\lambda^{\top}\Omega_{n}\lambda\right)\cdot\int_{A_{n}(\varepsilon)}\left\|Y_{n}(z)\right\|^{2}h\left(z\mid\theta_{0},\delta_{n}\right)\mathrm{d}\mu(z)$
	$\displaystyle\leq\left(1/\lambda^{\top}\Omega_{n}\lambda\right)\\|\lambda\\|^{2}\int_{A_{n}(\varepsilon)}\\|\dot{\ell}(z\mid\theta_{0})-\phi_{n}\\|^{2}h\left(z\mid\theta_{0},\delta_{n}\right)\mathrm{d}\mu(z)$
	$\displaystyle\leq\left(1/\lambda^{\top}\Omega_{n}\lambda\right)\\|\lambda\\|^{2}\left\{\\|\phi_{n}\\|^{2}+\int_{A_{n}(\varepsilon)}\\|\dot{\ell}(z\mid\theta_{0})\\|^{2}h\left(z\mid\theta_{0},\delta_{n}\right)\mathrm{d}\mu(z)\right\}$
	$\displaystyle\leq\left(1/\lambda^{\top}\Omega_{n}\lambda\right)\\|\lambda\\|\left[\left\\|\phi_{n}\right\\|^{2}+\int_{A_{n}(\varepsilon)}b(z)a(z)\mathrm{d}\mu(z)\right]$

Since $h(z\mid\theta_{0},\delta_{n})\rightarrow h(z\mid\theta_{0},\delta_{0})\equiv f(z\mid\theta_{0})$ , by continuity $\Omega_{n}\rightarrow\mathbb{E}[\dot{\ell}(z\mid\theta_{0})\dot{\ell}(z\mid\theta_{0})^{\top}]$ , and $\lim\lambda^{\top}\Omega_{n}\lambda=\lambda^{\top}\Omega\lambda>0$ , so that by Lemma A.1 of Newey (1985), the set $A_{n}(\varepsilon)\rightarrow\emptyset$ , as $n\rightarrow+\infty$ . Hence, by Assumption C1, $\lim_{n}\int_{A_{n}(\varepsilon)}b(z)a(z)\mathrm{d}\mu(z)=0$ . Lastly, since $\phi_{n}\rightarrow 0$ , we have that

\displaystyle{\left[1/\left(n\lambda^{\top}\Omega_{n}\lambda\right)\right]\cdot\sum_{i=1}^{n}\int_{\left|Y_{i,n}\right|\geqslant\varepsilon\sqrt{}n\lambda^{\top}\Omega_{n}\lambda}\left|Y_{i,n}\right|^{2}h\left(z_{i}\mid\theta_{0},\delta_{n}\right)\mathrm{d}\mu\left(z_{i}\right)}=o(1),

which implies that the Lindberg-Feller condition for the central limit theorem is satisfied, so that by the Cramer-Wold device:

\{n^{-1/2}\dot{\ell}\left(\theta_{0}\right)-\sqrt{n}\phi_{n}\}{\Rightarrow}N(0,\Omega)

and the second part of the stated result is satisfied.

Result 3. First, define

\phi_{1}(\theta,\delta_{n}):=\int_{\mathcal{Z}}\frac{\partial\log f_{1}(z\mid\theta_{1})}{\partial\theta_{1}}h(z\mid\theta,\delta_{n})\mathrm{d}\mu(z).

Applying the definition of $h(z\mid\theta,\delta_{n})$ in Assumption 2 yields, up to an $o(1/\sqrt{n})$ term,

	$\displaystyle\phi_{1}(\theta,\delta_{n})$	$\displaystyle=\phi_{1}(\theta,\delta_{0})+\partial\phi_{1}(\theta_{0},\bar{\delta})/\partial\delta$
		$\displaystyle=\int_{\mathcal{Z}}\frac{\partial\log f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}}f(z\mid\theta_{0})\mathrm{d}\mu(z)+\int_{\mathcal{Z}}\frac{\partial\log f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}}\delta_{n}(z)f(z\mid\theta_{0})\mathrm{d}\mu(z)\frac{1}{\sqrt{n}}.$		(25)

For the first term in (25), we have

	$\displaystyle\int_{\mathcal{Z}}\frac{\partial\log f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}}f(z\mid\theta_{0})\mathrm{d}\mu(z)$	$\displaystyle=\int_{\mathcal{Z}}\frac{\partial f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}}f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)$
		$\displaystyle=\frac{\partial}{\partial\theta}\int f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)$
		$\displaystyle=0.$

By Assumption 2(iii), the second term in (25) satisfies

\displaystyle\frac{1}{\sqrt{n}}\int_{\mathcal{Z}}\frac{\partial\log f_{1}(z\mid\theta_{1,0})}{\partial\theta_{1}}\delta_{n}(z)f_{1}(z\mid\theta_{1,0})f_{2}(z\mid\theta_{0})\mathrm{d}\mu(z)=0.

Repeating the same arguments as in the proof of Result 2, but only for $\dot{\ell}_{p}(z\mid\theta_{1})=\partial\log f_{1}(z\mid\theta_{1})/\partial\theta_{1}$ , we have that

n^{-1/2}\dot{\ell}_{p}\left(\theta_{1,0}\right){\Rightarrow}N\{0,\mathbb{E}(\dot{\ell}_{p(1)}\dot{\ell}_{p(1)}^{\top})\}\equiv N(0,\mathcal{I}_{p(11)}).

Result 4. Define

\phi_{2}(\theta,\delta_{n}):=\int_{\mathcal{Z}}\frac{\partial\log f_{2}(z\mid\theta)}{\partial\theta_{2}}h(z\mid\theta,\delta_{n})\mathrm{d}\mu(z)

and apply the definition of $h(z\mid\theta,\delta_{n})$ in Assumption 2 to see that, up to an $o(1/\sqrt{n})$ term,

\displaystyle\phi_{2}(\theta,\delta_{n})=

\displaystyle\int_{\mathcal{Z}}\frac{\partial\log f_{2}(z\mid\theta_{0})}{\partial\theta_{2}}f(z\mid\theta_{0})\mathrm{d}\mu(z)+\int_{\mathcal{Z}}\frac{\partial\log f_{2}(z\mid\theta_{0})}{\partial\theta_{2}}\delta_{n}(z)f(z\mid\theta_{0})\mathrm{d}\mu(z)\frac{1}{\sqrt{n}}.

Similar arguments to those used to prove Result 3 yield

\displaystyle\phi_{2}(\theta_{0},\delta_{n})

\displaystyle=0+\eta/\sqrt{n}.

The distributional result then follows similarly to Result 2. ∎

Proof of Lemma C.5..

We prove each result in turn.

Result 1. Write $\tau_{n}=(\mathbf{0}_{d_{1}\times 1}^{\top},\phi_{n}^{\top})^{\top}$ , where $\phi_{n}=\phi(\theta_{0}^{\top},\delta_{n}^{\top})^{\top}$ was defined in the proof of Lemma C.3 and recall that $\tau=(\mathbf{0}_{d_{1}\times 1}^{\top},\eta^{\top})^{\top}$ . The first and second results in Lemma C.3 together imply that

\{Z_{n}-\sqrt{n}\tau_{n}\}\Rightarrow N(0,\Omega),

where

	$\displaystyle\Omega$	$\displaystyle:=\lim_{n\rightarrow+\infty}\mathbb{E}[Z_{n}Z_{n}^{\top}]$
		$\displaystyle=\begin{pmatrix}\mathbb{E}[\dot{\ell}_{p(1)}\dot{\ell}_{p(1)}^{\top}]&\mathbb{E}[\dot{\ell}_{p(1)}\{\dot{\ell}_{p(1)}+\dot{\ell}_{c(1)}\}^{\top}]&\mathbb{E}[\dot{\ell}_{p(1)}\dot{\ell}_{c(2)}^{\top}]\\ \mathbb{E}[\{\dot{\ell}_{p(1)}+\dot{\ell}_{c(1)}\}\dot{\ell}_{p(1)}^{\top}]&\mathbb{E}[\{\dot{\ell}_{p(1)}+\dot{\ell}_{c(1)}\}\{\dot{\ell}_{p(1)}+\dot{\ell}_{c(1)}\}^{\top}]&\mathbb{E}[\{\dot{\ell}_{p(1)}+\dot{\ell}_{c(1)}\}\dot{\ell}_{c(2)}^{\top}]\\ E[\dot{\ell}_{c(2)}\dot{\ell}_{p(1)}^{\top}]&\mathbb{E}[\dot{\ell}_{c(2)}\{\dot{\ell}_{p(1)}+\dot{\ell}_{c(1)}\}^{\top}]&\mathbb{E}[\dot{\ell}_{c(2)}\dot{\ell}_{c(2)}^{\top}]\end{pmatrix}$

However, from Lemma C.1, we know that

\lim_{n\rightarrow+\infty}n^{-1}\mathbb{E}_{n}\left[\dot{\ell}_{p(1)}\dot{\ell}_{c(1)}^{\top}\right]=0,\text{ and }\lim_{n\rightarrow+\infty}n^{-1}\mathbb{E}_{n}\left[\dot{\ell}_{p(1)}\dot{\ell}_{c(2)}^{\top}\right]=0.

Applying these relationships, we then have the general form of $\Omega$ as stated in the result. Since $\sqrt{n}\tau_{n}\rightarrow\tau$ in probability, the first stated result follows.

The second part of the result follows by applying the results of Lemma C.1 to see that we can rewrite $\Omega$ as

\displaystyle\Omega=\begin{pmatrix}\mathcal{I}_{p(11)}&\mathcal{I}_{p(11)}&0\\ \mathcal{I}_{p(11)}&\mathcal{I}_{11}&\mathcal{I}_{12}\\ 0&\mathcal{I}_{21}&\mathcal{I}_{22}\end{pmatrix},

where

\mathcal{I}_{11}=\mathcal{I}_{p(11)}+\mathcal{I}_{c(11)},\quad\mathcal{I}_{22}=\mathcal{I}_{c(22)},\quad\mathcal{I}_{21}=\mathcal{I}_{c(21)}.

Result 2. Define

V_{22}:=\mathcal{I}_{22}^{-1}-\mathcal{I}_{22.1}^{-1},\quad\mathcal{I}_{22.1}^{-1}:=\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1},

and recall the definitions of $\Gamma_{\mathrm{cut}}$ and $\Gamma_{\mathrm{full}}$ to see that

\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}}=\begin{pmatrix}\mathcal{I}_{p(11)}^{-1}&-\mathcal{I}_{11.2}^{-1}&\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}&\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}&V_{22}\end{pmatrix}.

Firstly,

	$\displaystyle(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})\Omega=\begin{pmatrix}\mathcal{I}_{p(11)}^{-1}&-\mathcal{I}_{11.2}^{-1}&\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}&\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}&\mathcal{I}_{22}^{-1}-\mathcal{I}_{22.1}^{-1}\end{pmatrix}\begin{pmatrix}\mathcal{I}_{p(11)}&\mathcal{I}_{p(11)}&0\\ \mathcal{I}_{p(11)}&\mathcal{I}_{11}&\mathcal{I}_{12}\\ 0&\mathcal{I}_{21}&\mathcal{I}_{22}\end{pmatrix}$
	$\displaystyle=\begin{pmatrix}I_{d_{1}}-\mathcal{I}_{11.2}^{-1}\mathcal{I}_{p(11)}&I-\mathcal{I}_{11.2}^{-1}\mathcal{I}_{11}+\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}&\mathbf{O}_{d_{1}\times d_{2}}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{p(11)}&-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{11}+V_{22}\mathcal{I}_{21}&\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}+{V_{22}}\mathcal{I}_{22}\end{pmatrix}.$

Therefore,

	$\displaystyle(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}=$
	$\displaystyle\begin{pmatrix}I_{d_{1}}-\mathcal{I}_{11.2}^{-1}\mathcal{I}_{p(11)}&I-\mathcal{I}_{11.2}^{-1}\mathcal{I}_{11}+\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}&\mathbf{O}_{d_{1}\times d_{2}}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{p(11)}&-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{11}+V_{22}\mathcal{I}_{21}&\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}+{V_{22}}\mathcal{I}_{22}\end{pmatrix}\times$
	$\displaystyle\begin{pmatrix}\mathcal{I}_{p(11)}^{-1}&-\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{11.2}^{-1}&\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ \mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}&V_{22}\end{pmatrix}.$		(26)

To derive the stated result, we analyze each of the corresponding blocks in the $d\times d$ matrix.

$(d_{1}\times d_{1})$ -Block. Let $\mathcal{M}_{11}$ denote the $(d_{1}\times d_{1})$ -block of $\mathcal{M}$ . Multiply the first row and first column of (26) to obtain

	$\displaystyle\mathcal{M}_{11}$	$\displaystyle=\mathcal{I}_{p(11)}^{-1}-\mathcal{I}_{11.2}^{-1}-\mathcal{I}_{11.2}^{-1}+\mathcal{I}_{11.2}\mathcal{I}_{11}\mathcal{I}_{11.2}^{-1}-\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}$
		$\displaystyle=W-\mathcal{I}_{11.2}^{-1}+\mathcal{I}_{11.2}^{-1}\underbrace{(\mathcal{I}_{11}-\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21})}_{=\mathcal{I}_{11.2}}\mathcal{I}_{11.2}^{-1}$
		$\displaystyle=W-\mathcal{I}_{11.2}^{-1}+\mathcal{I}_{11.2}^{-1}$
		$\displaystyle=W,$

where we recall that $W=\mathcal{I}_{p(11)}^{-1}-\mathcal{I}_{11.2}^{-1}$ .

$(d_{1}\times d_{2})$ -Block. Let $\mathcal{M}_{12}$ denote the $(d_{1}\times d_{2})$ -block of $\mathcal{M}$ . Multiply the first row and second column of (26) to obtain

	$\displaystyle\mathcal{M}_{12}$	$\displaystyle=-\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}+\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}$
		$\displaystyle+\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}-\mathcal{I}_{11.2}^{-1}\mathcal{I}_{11}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}+\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}$
		$\displaystyle=-W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}$
		$\displaystyle+\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}-\mathcal{I}_{11.2}^{-1}(\mathcal{I}_{11}-\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21})\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}$
		$\displaystyle=-W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}.$

$(d_{2}\times d_{2})$ -Block. Let $\mathcal{M}_{22}$ denote the $(d_{2}\times d_{2})$ -block of $\mathcal{M}$ . Multiply the second row and second column of (26), and use the fact that $\mathcal{I}_{22.1}^{-1}=\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}$ , so that $V_{22}=\mathcal{I}_{22}^{-1}-\mathcal{I}_{22.1}^{-1}=-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}$ , to obtain

	$\displaystyle\mathcal{M}_{22}$	$\displaystyle=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}$
		$\displaystyle-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{11}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}$
		$\displaystyle+V_{22}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}V_{22}+V_{22}\mathcal{I}_{22}V_{22}$
		$\displaystyle=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{11}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}$
		$\displaystyle-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}V_{22}+V_{22}\mathcal{I}_{22}V_{22}$
		$\displaystyle=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}$
		$\displaystyle+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}V_{22}+V_{22}\mathcal{I}_{22}V_{22}$
		$\displaystyle=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{22}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}$
		$\displaystyle=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}.$

Applying each of $\mathcal{M}_{11},\mathcal{M}_{12},\mathcal{M}_{22}$ into equation (26) yields

\displaystyle(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}=\begin{pmatrix}W&-W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W&\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\end{pmatrix}=\mathcal{M}.

Result 3. Similar to Result 2,

	$\displaystyle\Gamma_{\mathrm{cut}}\Omega(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}=$
	$\displaystyle\begin{pmatrix}\mathcal{I}_{p(11)}^{-1}&\mathbf{O}&\mathbf{O}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}&\mathbf{O}&\mathcal{I}_{22}^{-1}\end{pmatrix}\begin{pmatrix}\mathcal{I}_{p(11)}&\mathcal{I}_{p(11)}&0\\ \mathcal{I}_{p(11)}&\mathcal{I}_{11}&\mathcal{I}_{12}\\ 0&\mathcal{I}_{21}&\mathcal{I}_{22}\end{pmatrix}\times(\Gamma_{\mathrm{cut}}-\Gamma_{\mathrm{full}})^{\top}$
	$\displaystyle=\begin{pmatrix}I_{d_{1}}&I_{d_{1}}&\mathbf{O}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{21}&\mathbf{O}&I_{d_{2}}\end{pmatrix}\times\begin{pmatrix}\mathcal{I}_{p(11)}^{-1}&-\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{11.2}^{-1}&\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ \mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}&V_{22}\end{pmatrix}$
	$\displaystyle=\begin{pmatrix}W&-W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\\ -\mathcal{I}_{22}^{-1}\mathcal{I}_{12}W&\mathcal{I}_{22}^{-1}\mathcal{I}_{21}W\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\end{pmatrix}$
	$\displaystyle=\mathcal{M}.$

∎

Proof of Theorem C.1..

Recall $\overline{\theta}_{\mathrm{full}}:=\int_{\Theta}\theta\pi_{\mathrm{full}}(\theta\mid z_{1:n})\mathrm{d}\theta$ , and decompose

	$\displaystyle\sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{0})$	$\displaystyle=\sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{0}-T_{n,\mathrm{full}})+\sqrt{n}T_{n,\mathrm{full}}$
		$\displaystyle=\int_{\Theta}\sqrt{n}(\theta-\theta_{0}-T_{n,\mathrm{full}})\pi(\theta\mid z_{1:n})\mathrm{d}\theta+\sqrt{n}T_{n}$
		$\displaystyle=\int_{\mathcal{T}}t\pi_{\mathrm{full}}(t\mid z_{1:n})\mathrm{d}\theta+\sqrt{n}T_{n,\mathrm{full}}$

where $\mathcal{T}:=\{t=\sqrt{n}(\theta-\theta_{0}-T_{n,\mathrm{full}}):\theta\in\Theta\}$ and

T_{n,\mathrm{full}}=\Gamma_{\mathrm{full}}Z_{n}/\sqrt{n}=\mathcal{I}^{-1}\dot{\ell}(\theta_{0})/n.

From Lemma C.3,

\sqrt{n}T_{n,\mathrm{full}}\Rightarrow N(\mathcal{I}^{-1}\eta,\mathcal{I}^{-1}).

Hence, the stated result follows if $\int_{\mathcal{T}}t\pi_{\mathrm{full}}(t\mid z_{1:n})=o_{p}(1)$ .

To show this, we first note that

\int_{\mathcal{T}}t\pi_{\mathrm{full}}(t\mid z_{1:n})\mathrm{d}t=\int_{\mathcal{T}}t[\pi_{\mathrm{full}}(t\mid z_{1:n})-N\{t;0,\mathcal{I}^{-1}\}]\mathrm{d}t,

since $0=\int_{\mathcal{T}}tN\{t;0,\mathcal{I}^{-1}\}\mathrm{d}t$ . We then have

\left\|\int_{\mathcal{T}}t[\pi_{\mathrm{full}}(t\mid z_{1:n})-N\{t;0,\mathcal{I}^{-1}\}]\mathrm{d}t\right\|\leq\int_{\mathcal{T}}\|t\||\pi_{\mathrm{full}}(t\mid z_{1:n})-N\{t;0,\mathcal{I}^{-1}\}|\mathrm{d}t

The regularity conditions in Assumption C1-C2 are sufficient to apply Theorem 8.2 of Lehmann and Casella (2006), to deduce that, for $\mathcal{T}:=\{t=\sqrt{n}(\theta-\theta_{0}-T_{n,\mathrm{full}}):\theta\in\Theta\}$ ,

\int_{\mathcal{T}}\|t\||\pi_{\mathrm{full}}(t\mid z_{1:n})-N\{t;0,\mathcal{I}^{-1}\}|=o_{p}(1).

∎

Proof of Theorem C.2..

The proof follows similar arguments to the proof of Theorem C.1 if we set

T_{n,\mathrm{cut}}=\begin{pmatrix}T_{1,n,\mathrm{cut}}\\ T_{2,n,\mathrm{cut}}\end{pmatrix}=\begin{pmatrix}\Gamma_{1,\mathrm{cut}}Z_{n}/\sqrt{n}\\ \Gamma_{2,\mathrm{cut}}Z_{n}/\sqrt{n}\end{pmatrix}\equiv\begin{pmatrix}\mathcal{I}_{p(11)}^{-1}\dot{\ell}_{p(1)}(\theta_{1,0})/{n}\\ \mathcal{I}_{22}^{-1}\{\dot{\ell}_{c(2)}(\theta_{0})/{n}-\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\dot{\ell}_{p(1)}(\theta_{1,0})/{n}\}.\end{pmatrix}.

In particular, if we replace the full posterior in the proof of Theorem C.1 by the cut posterior $\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})$ , and the full posterior mean $\overline{\theta}_{\mathrm{full}}$ with the cut posterior mean $\overline{\theta}_{1,\mathrm{cut}}=\int_{\Theta}\theta\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\mathrm{d}\theta$ , we have that

\int_{\mathcal{V}_{1}}\|\vartheta_{1}\||\pi_{\mathrm{cut}}(\vartheta_{1}\mid z_{1:n})-N\{\vartheta_{1};0,\mathcal{I}_{p(11)}^{-1}\}|\mathrm{d}\vartheta_{1}=o_{p}(1).

Using the above results and the following decomposition

	$\displaystyle\sqrt{n}(\overline{\theta}_{1,\mathrm{cut}}-\theta_{1,0})$	$\displaystyle=\sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{1,0}-T_{1,n,\mathrm{cut}})+\sqrt{n}T_{1,n,\mathrm{cut}}$
		$\displaystyle=\int_{\Theta_{1}}\sqrt{n}(\theta_{1}-\theta_{1,0}-T_{1,n,\mathrm{cut}})\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\mathrm{d}\theta_{1}+\sqrt{n}T_{1,n,\mathrm{cut}}$
		$\displaystyle=\int_{\mathcal{V}_{1}}\vartheta_{1}\pi_{\mathrm{cut}}(\vartheta_{1}\mid z_{1:n})\mathrm{d}\vartheta_{1}+\sqrt{n}T_{1,n,\mathrm{cut}},$

we have that

\displaystyle\sqrt{n}(\overline{\theta}_{1,\mathrm{cut}}-\theta_{1,0})

\displaystyle=\sqrt{n}T_{1,n,\mathrm{cut}}+o_{p}(1),

Appealing to Lemma C.3 we have that

\sqrt{n}T_{1,n,\mathrm{cut}}\Rightarrow N(0,\mathcal{I}_{p(11)}^{-1}),

which yields the stated results for the $\theta_{1}$ dimension.

To prove the results for the $\theta_{2}$ dimension, first write

	$\displaystyle\sqrt{n}(\overline{\theta}_{2,\mathrm{cut}}-\theta_{2,0})$	$\displaystyle=\sqrt{n}(\overline{\theta}_{2,\mathrm{cut}}-\theta_{2,0}-T_{2,n,\mathrm{cut}})+\sqrt{n}T_{2,n,\mathrm{cut}}$
		$\displaystyle=\int_{\Theta_{2}}\sqrt{n}(\theta_{2}-\theta_{2,0}-T_{2,n,\mathrm{cut}})\pi_{\mathrm{cut}}(\theta_{2}\mid z_{1:n})\mathrm{d}\theta+\sqrt{n}T_{2,n,\mathrm{cut}}$
		$\displaystyle=\int_{\mathcal{T}_{2}}\vartheta_{2}\pi_{\mathrm{cut}}(\vartheta_{2}\mid z_{1:n})\mathrm{d}\vartheta_{2}+\sqrt{n}T_{2,n,\mathrm{cut}},$

where $\mathcal{V}_{2}:=\{\vartheta_{2}=\sqrt{n}(\theta_{2}-\theta_{2,0}-T_{2,n,\mathrm{cut}}):\theta_{2}\in\Theta_{2}\}$ . From Lemma C.3, the second term satisfies

\sqrt{n}T_{2,n,\mathrm{cut}}\Rightarrow N(\mathcal{I}_{22}^{-1}\eta_{2},\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1})

and the result then follows if $\int_{\mathcal{T}_{2}}\vartheta_{2}\pi_{\mathrm{cut}}(\vartheta_{2}\mid z_{1:n})\mathrm{d}\vartheta_{2}=o_{p}(1)$ . However, similar to the proof of Theorem C.1

	$\displaystyle\int_{\mathcal{T}_{2}}\vartheta_{2}\pi_{\mathrm{cut}}(\vartheta_{2}\mid z_{1:n})\mathrm{d}\vartheta_{2}\leq$	$\displaystyle\int_{\mathcal{T}_{2}}\\|\vartheta_{2}\\|\|\pi_{\mathrm{cut}}(\vartheta_{2}\mid z_{1:n})\mathrm{d}\vartheta_{2}-N(\vartheta_{2};0,\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1})\|\mathrm{d}\vartheta_{2}$
		$\displaystyle+\int_{\mathcal{T}_{2}}\vartheta_{2}N(\vartheta_{2};0,\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1})\mathrm{d}\vartheta_{2}.$

The regularity conditions in Assumption C1-C2 are sufficient to apply Corollary 1 of Frazier and Nott (2024) to deduce that

\int_{\mathcal{T}_{2}}\|\vartheta_{2}\||\pi_{\mathrm{cut}}(\vartheta_{2}\mid z_{1:n})\mathrm{d}\vartheta_{2}-N(\vartheta_{2};0,\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{p(11)}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1})|\mathrm{d}\vartheta_{2}=o_{p}(1).

∎

Proof of Corollary C.1.

For $j=1$ , we must only note that the cut posterior mean exhibits no bias, so that the result is satisfied.

For $j=2$ , recall that by Theorem C.2, the bias of the cut posterior mean is given by $\mathcal{I}^{-1}_{22}\eta_{2}$ . By Theorem C.1, the full posterior for $\theta_{2}$ exhibits the asymptotic bias

\left(\mathcal{I}_{22}^{-1}+\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\right)\eta_{2}-\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\eta_{1}.

The squared difference between the bias for $\theta_{2,0}$ under the cut and full posteriors is then positive so long as

	$\displaystyle\eta_{2}^{\top}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\eta_{2}+\eta_{1}^{\top}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\eta_{1}$
	$\displaystyle>2\eta_{1}^{\top}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\eta_{2}$

Writing

X=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\mathcal{I}_{12}\mathcal{I}_{22}^{-1}\eta_{2},\quad Y=\mathcal{I}_{22}^{-1}\mathcal{I}_{21}\mathcal{I}_{11.2}^{-1}\eta_{1},

we can rewrite the above restriction as

X^{\top}X+Y^{\top}Y-2X^{\top}Y=\|X-Y\|^{2}\geq 0.

∎

Proof of Lemma C.6..

Firstly, note that, in the proof of Theorem C.1 we have shown that

\|\sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{0})-\Gamma_{\mathrm{full}}Z_{n}\|=o_{p}(1),

and that, by Theorem C.2,

\|\sqrt{n}(\overline{\theta}_{\mathrm{cut}}-\theta_{0})-\Gamma_{\mathrm{cut}}Z_{n}\|=o_{p}(1).

Using these results and Lemmass C.1-C.5, it follows that

\begin{pmatrix}\sqrt{n}(\overline{\theta}_{\mathrm{cut}}-\theta_{0})\\ \sqrt{n}(\overline{\theta}_{\mathrm{full}}-\theta_{0})\end{pmatrix}=\begin{pmatrix}\Gamma_{\mathrm{cut}}\\ \Gamma_{\mathrm{full}}\end{pmatrix}Z_{n}+o_{p}(1)\Rightarrow\begin{pmatrix}\Gamma_{\mathrm{cut}}\\ \Gamma_{\mathrm{full}}\end{pmatrix}(\xi+\tau).

The continuous mapping theorem then yields the stated results. ∎

Appendix D HPV prevalence and cancer incidence

We return to the example in Section 2.1 of the main text, concerning the relationship between HPV prevalence and cervical cancer incidence. We use it to demonstrate how the semi-modular posterior can vary for different loss functions. In particular, we consider an S-SMP for a loss function targeting the HPV prevalence in country $j$ , $\theta_{1,j}$ , and show how the S-SMP mixing weight and semi-modular inferences vary across $j$ . We use the mixing weight

\widehat{\omega}_{+}=\min\{1,\widehat{\omega}\},\;\;\;\widehat{\omega}=\frac{{\sigma_{\text{cut}}^{(j)}}^{2}-{\sigma^{(j)}}^{2}}{(\bar{\theta}_{1,\text{cut}}^{(j)}-\bar{\theta}_{1}^{(j)})^{2}}\mathbb{I}({\sigma_{\text{cut}}^{(j)}}^{2}-{\sigma^{(j)}}^{2}>0),

where ${\sigma_{\text{cut}}^{(j)}}^{2}$ and ${\sigma^{(j)}}^{2}$ are the cut and full marginal posterior variances for $\theta_{1,j}$ , and $\bar{\theta}_{1,\text{cut}}^{(j)}$ and $\bar{\theta}_{1}^{(j)}$ are the cut and full marginal posterior means. We denote the marginal SMP for $\theta_{1,j}$ by $\pi_{\text{cut}}^{(j)}(\theta_{1,j}|X)$ , and Figure 8 shows kernel estimates of these densities together with the SMP mixing weights $\widehat{\omega}_{+}$ , and the cut and full posterior $\theta_{1,j}$ marginals. The countries are ordered in the plot (left to right, top to bottom) according to the cut posterior mean values. Interestingly, the pooling weights vary widely according to which country is under analysis; when the difference in location of the cut and full posterior distributions is large compared to the posterior variability, the SMP is close to the cut posterior, whereas the SMP is closer to the full posterior otherwise.

The results obtained in Figure 8 give a novel insight into the behavior of model misspecification in this example. While it is known that the Poisson assumption is inadequate, there exists significant variability in the adequacy of this assumption for the purpose of inference in specific countries: for certain countries we obtain a pooling weight of unity, so that the SMP corresponds to the full posterior, while for other countries the pooling weight is close to zero, so that the SMP corresponds to the cut posterior. In view of Corollary 1, when the model is grossly misspecified the SMP produces a pooling weight that is asymptotically zero. Hence, the empirical results in Figure 8 demonstrate that there are certain countries for which the Poisson model is adequate and others for which it is not. Critically, however, the SMP we have proposed does not require us to choose which posterior to use for which country a priori, and instead the SMP lets the observed data guide our choice of which posterior to use according to which posterior is better supported under the observed data.⁹⁹9MCMC computations for the full posterior distribution were done using Stan (Carpenter et al., 2017), whereas cut posterior samples for $\theta_{1}$ given $X$ are generated directly using conjugacy, followed by sampling importance resampling based on $1,000$ draws from an asymptotic normal approximation to the conditional posterior of $\theta_{2}$ given $\theta_{1}$ , $Y$ , to obtain each $\theta_{2}$ sample. Kernel density estimates in the plots are based on $1,000$ posterior samples.

Appendix E Additional Results for Biased Means Example

In this section, we present the repeated sampling behavior of the cut, full and S-SMP in the simulation design presented in Section 3.2 of the main text and additional results on the behavior of the S-SMP for additional choices of $d$ . Please refer to the main text for full details of the simulation design.

Table 2 compares the behavior of the posterior mean for the three methods across different values of $\delta$ , and in the case where $d_{1}=1$ . Similar results for the case of $d_{1}=5$ are reported in Table 3. In each table we report the Bias of the posterior mean (Bias), the average posterior variance across the replications (Var) and the Monte Carlo coverage of a 95% posterior credible sets (Cov).

Regarding the results for $d_{1}=1$ and $d_{1}=5$ , we see that in both cases the the S-SMP attempts to trade-off bias in the full posterior for a significant reduction in variability. For both cases, the average cut posterior variance is about 0.01 across all values of $\delta$ , while the average S-SMP variance ranges from about $0.003$ to $0.005$ , which is significantly smaller than that of the cut posterior. Further, the bias exhibited by the S-SMP is generally much smaller than that exhibited by the full posterior, but is, by construction, larger than that exhibited by the cut posterior.

Recall that, as discussed in the remarks raised after the statement of Lemma 2 in the main text, under model misspecification only the cut posterior for $\theta_{1}$ can deliver credible sets that are also valid frequentist confidence sets. Analyzing the coverage of the posteriors across the two designs, we see that this is indeed the case, with only the cut posterior delivering accurate coverage. In comparison, in both cases, the full posterior has coverage that is zero or nearly zero once $\delta>0.10$ . The S-SMP fairs better than the full posterior, but can deliver unreliable coverage when $d_{1}=1$ . In contrast, as the dimension of $\theta_{1}$ increases to $d_{1}=5$ , the coverage of the S-SMP dramatically appreciates and in most cases is only slightly smaller than the nominal level for most values of $\delta$ . While not reported for the sake of brevity, extending these experiments to larger values of $d_{1}$ show that the coverage of the S-SMP improves as $d_{1}$ increases, so that when $d_{1}=25$ the coverage for the S-SMP is close to the nominal level across all values of $\delta$ used in the experiment.

Lastly, we demonstrate the impact on the accuracy of the S-SMP, as measured by expected squared error, as $d_{1}$ increases. To this end, we present the expected squared error of the cut, full and S-SMP for $d_{1}=1,5,10,25$ . We plot these results visually in Figure 9. Analyzing these results, we see that as $d_{1}$ increases, and for $\delta$ small, the S-SMP resembles the full posterior, but as $\delta$ increases the S-SMP more closely resemble the cut posterior. This is a consequence of the behavior of the pooling weight used in the S-SMP, which has an easier time shifting between the cut and full posterior when $d>2$ ; we refer to Section 4 in the main paper for a technical explanation of this phenomenon.

$\delta$	Bias	Var	Cov	Bias	Var	Cov	Bias	Var	Cov
		Cut			Full			S-SMP
0.1000	-0.0081	0.0099	0.9400	-0.0131	0.0001	0.7540	-0.0098	0.0028	0.8190
0.1500	-0.0080	0.0099	0.9410	-0.0194	0.0001	0.5430	-0.0121	0.0029	0.7040
0.2000	-0.0082	0.0099	0.9410	-0.0257	0.0001	0.3100	-0.0143	0.0029	0.5800
0.2500	-0.0080	0.0099	0.9430	-0.0318	0.0001	0.1410	-0.0161	0.0030	0.5050
0.3000	-0.0082	0.0099	0.9400	-0.0380	0.0001	0.0360	-0.0182	0.0030	0.4560
0.3500	-0.0080	0.0099	0.9430	-0.0441	0.0001	0.0060	-0.0198	0.0031	0.4520
0.4000	-0.0081	0.0099	0.9400	-0.0503	0.0001	0.0010	-0.0218	0.0032	0.4610
0.4500	-0.0081	0.0099	0.9420	-0.0566	0.0001	0.0000	-0.0234	0.0033	0.4720
0.5000	-0.0082	0.0099	0.9420	-0.0628	0.0001	0.0000	-0.0248	0.0034	0.4810
0.5500	-0.0081	0.0099	0.9430	-0.0690	0.0001	0.0000	-0.0261	0.0036	0.4950
0.6000	-0.0082	0.0099	0.9420	-0.0752	0.0001	0.0000	-0.0276	0.0037	0.5010
0.6500	-0.0081	0.0099	0.9400	-0.0814	0.0001	0.0000	-0.0289	0.0038	0.4950
0.7000	-0.0081	0.0099	0.9430	-0.0877	0.0001	0.0000	-0.0302	0.0040	0.5010
0.7500	-0.0081	0.0099	0.9410	-0.0939	0.0001	0.0000	-0.0314	0.0042	0.5170
0.8000	-0.0079	0.0099	0.9420	-0.1001	0.0001	0.0000	-0.0321	0.0043	0.5440
0.8500	-0.0081	0.0099	0.9410	-0.1065	0.0001	0.0000	-0.0332	0.0045	0.5580
0.9000	-0.0082	0.0099	0.9430	-0.1126	0.0001	0.0000	-0.0340	0.0047	0.5790

Table 2: Repeated sampling results for

\beta_{1}

in the biased means example presented in Section 3.2 of the main text when

d_{1}=1

. Bias is the average bias of the posterior mean across the replications. Var is the average of the posterior variance, and Cov is the actual coverage of the 95% posterior credible set.

Delta	Bias	Var	Cov	Bias	Var	Cov	Bias	Var	Cov
		Cut			Full			S-SMP
0.1000	-0.0074	0.0099	0.9550	-0.0128	0.0005	0.9960	-0.0107	0.0039	0.9800
0.1500	-0.0075	0.0099	0.9530	-0.0186	0.0005	0.9900	-0.0135	0.0039	0.9790
0.2000	-0.0075	0.0099	0.9540	-0.0247	0.0005	0.9640	-0.0162	0.0040	0.9770
0.2500	-0.0073	0.0099	0.9540	-0.0306	0.0005	0.8840	-0.0185	0.0041	0.9560
0.3000	-0.0075	0.0099	0.9530	-0.0366	0.0005	0.7440	-0.0213	0.0042	0.9220
0.3500	-0.0075	0.0099	0.9570	-0.0426	0.0005	0.5120	-0.0235	0.0043	0.8750
0.4000	-0.0074	0.0099	0.9540	-0.0486	0.0005	0.2570	-0.0256	0.0045	0.8360
0.4500	-0.0073	0.0099	0.9550	-0.0546	0.0004	0.1060	-0.0275	0.0047	0.8130
0.5000	-0.0074	0.0099	0.9550	-0.0605	0.0004	0.0270	-0.0293	0.0049	0.8150
0.5500	-0.0074	0.0099	0.9550	-0.0665	0.0004	0.0060	-0.0306	0.0051	0.8290
0.6000	-0.0074	0.0099	0.9550	-0.0726	0.0004	0.0000	-0.0318	0.0053	0.8400
0.6500	-0.0073	0.0099	0.9560	-0.0787	0.0004	0.0000	-0.0326	0.0055	0.8570
0.7000	-0.0074	0.0099	0.9560	-0.0848	0.0004	0.0000	-0.0335	0.0058	0.8640
0.7500	-0.0073	0.0099	0.9530	-0.0908	0.0004	0.0000	-0.0338	0.0060	0.8790
0.8000	-0.0075	0.0099	0.9550	-0.0970	0.0004	0.0000	-0.0342	0.0062	0.8830
0.8500	-0.0074	0.0099	0.9560	-0.1031	0.0004	0.0000	-0.0343	0.0065	0.8980
0.9000	-0.0074	0.0099	0.9550	-0.1093	0.0003	0.0000	-0.0344	0.0067	0.8990

Table 3: Repeated sampling results for

\beta_{1}

in the biased means example presented in Section 3.2 of the main text when

d_{1}=5

. Bias is the average bias of the posterior mean across the replications. Var is the average of the posterior variance, and Cov is the actual coverage of the 95% posterior credible set.

References

Bennett and Wakefield [2001] Bennett, J. and J. Wakefield (2001). Errors-in-variables in joint population pharmacokinetic/pharmacodynamic modeling. Biometrics 57(3), 803–812.
Bissiri et al. [2016] Bissiri, P. G., C. C. Holmes, and S. G. Walker (2016). A general framework for updating belief distributions. Journal of the Royal Statistical Society. Series B, Statistical methodology 78(5), 1103.
Blangiardo et al. [2011] Blangiardo, M., A. Hansell, and S. Richardson (2011). A Bayesian model of time activity data to investigate health effect of air pollution in time series studies. Atmospheric Environment 45(2), 379–386.
Carmona and Nicholls [2020] Carmona, C. and G. Nicholls (2020). Semi-modular inference: enhanced learning in multi-modular models by tempering the influence of components. In International Conference on Artificial Intelligence and Statistics, pp. 4226–4235. PMLR.
Carmona and Nicholls [2022] Carmona, C. and G. Nicholls (2022). Scalable semi-modular inference with variational meta-posteriors. arXiv:2204.00296.
Carpenter et al. [2017] Carpenter, B., A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell (2017, January). Stan: A Probabilistic Programming Language. Journal of Statistical Software 76(1), 1–32. Number: 1.
Castillo [2014] Castillo, I. (2014). On Bayesian supremum norm contraction rates. Annals of Statistics.
Chakraborty et al. [2022] Chakraborty, A., D. J. Nott, C. Drovandi, D. T. Frazier, and S. A. Sisson (2022). Modularized Bayesian analyses and cutting feedback in likelihood-free inference. Statistics and Computing (To appear).
Claeskens and Hjort [2003] Claeskens, G. and N. L. Hjort (2003). The focused information criterion. Journal of the American Statistical Association 98(464), 900–916.
Davidson [1994] Davidson, J. (1994). Stochastic limit theory: An introduction for econometricians. OUP Oxford.
Frazier and Nott [2024] Frazier, D. T. and D. J. Nott (2024). Cutting feedback and modularized analyses in generalized Bayesian inference. Bayesian Analysis 1(1), 1–29.
Gneiting and Raftery [2007] Gneiting, T. and A. E. Raftery (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association 102(477), 359–378.
Green and Strawderman [1991] Green, E. J. and W. E. Strawderman (1991). A James-Stein type estimator for combining unbiased and possibly biased estimators. Journal of the American Statistical Association 86(416), 1001–1006.
Grünwald and Van Ommen [2017] Grünwald, P. and T. Van Ommen (2017). Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Analysis 12(4), 1069–1103.
Hansen [2016] Hansen, B. E. (2016). Efficient shrinkage in parametric models. Journal of Econometrics 190(1), 115–132.
Hjort and Claeskens [2003] Hjort, N. L. and G. Claeskens (2003). Frequentist model average estimators. Journal of the American Statistical Association 98(464), 879–899.
Jacob et al. [2017] Jacob, P. E., L. M. Murray, C. C. Holmes, and C. P. Robert (2017). Better together? Statistical learning in models made of modules. arXiv preprint arXiv:1708.08719.
Jacob et al. [2020] Jacob, P. E., J. O’Leary, and Y. F. Atchadé (2020). Unbiased Markov chain Monte Carlo methods with couplings. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82(3), 543–600.
Judge and Mittelhammer [2004] Judge, G. G. and R. C. Mittelhammer (2004). A semiparametric basis for combining estimation problems under quadratic loss. Journal of the American Statistical Association 99(466), 479–487.
Kim and White [2001] Kim, T.-H. and H. White (2001). James-Stein-type estimators in large samples with application to the least absolute deviations estimator. Journal of the American Statistical Association 96(454), 697–705.
Kleijn and van der Vaart [2012] Kleijn, B. and A. van der Vaart (2012). The Bernstein-von-Mises theorem under misspecification. Electron. J. Statist. 6, 354–381.
Lee and Lee [2018] Lee, K. and J. Lee (2018). Optimal Bayesian minimax rates for unconstrained large covariance matrices. Bayesian Analysis.
Lehmann and Casella [2006] Lehmann, E. L. and G. Casella (2006). Theory of point estimation. Springer Science & Business Media.
Liu et al. [2009] Liu, F., M. Bayarri, and J. Berger (2009). Modularization in Bayesian analysis, with emphasis on analysis of computer models. Bayesian Analysis 4(1), 119–150.
Liu and Goudie [2022a] Liu, Y. and R. J. B. Goudie (2022a). A general framework for cutting feedback within modularized Bayesian inference. arXiv:2211.03274.
Liu and Goudie [2022b] Liu, Y. and R. J. B. Goudie (2022b). Stochastic approximation cut algorithm for inference in modularized Bayesian models. Statistics and Computing 32(7).
Liu and Goudie [2023] Liu, Y. and R. J. B. Goudie (2023). Generalized geographically weighted regression model within a modularized Bayesian framework. Bayesian Analysis (To appear).
Lunn et al. [2009] Lunn, D., N. Best, D. Spiegelhalter, G. Graham, and B. Neuenschwander (2009). Combining MCMC with ‘sequential’ PKPD modelling. Journal of Pharmacokinetics and Pharmacodynamics 36, 19–38.
Maucort-Boulch et al. [2008] Maucort-Boulch, D., S. Franceschi, and M. Plummer (2008). International correlation between human papillomavirus prevalence and cervical cancer incidence. Cancer Epidemiology and Prevention Biomarkers 17(3), 717–720.
Miller [2021] Miller, J. W. (2021). Asymptotic normality, concentration, and coverage of generalized posteriors. Journal of Machine Learning Research 22(168), 1–53.
Newey [1985] Newey, W. K. (1985). Maximum likelihood specification testing and conditional moment tests. Econometrica: Journal of the Econometric Society, 1047–1070.
Nicholls et al. [2022] Nicholls, G. K., J. E. Lee, C.-H. Wu, and C. U. Carmona (2022). Valid belief updates for prequentially additive loss functions arising in semi-modular inference. arXiv preprint arXiv:2201.09706.
Plummer [2015] Plummer, M. (2015). Cuts in Bayesian graphical models. Statistics and Computing 25(1), 37–43.
Pompe and Jacob [2021] Pompe, E. and P. E. Jacob (2021). Asymptotics of cut distributions and robust modular inference using posterior bootstrap. arXiv preprint arXiv:2110.11149.
Rieder [2012] Rieder, H. (2012). Robust Asymptotic Statistics: Volume I. Springer Science & Business Media.
Rousseau [1997] Rousseau, J. (1997). Asymptotic bayes risks for a general class of losses. Statistics & probability letters 35(2), 115–121.
Shen and Wasserman [2001] Shen, X. and L. Wasserman (2001). Rates of convergence of posterior distributions. The Annals of Statistics 29(3), 687–714.
Stone [1961] Stone, M. (1961). The opinion pool. The Annals of Mathematical Statistics, 1339–1342.
Styring et al. [2017] Styring, A., M. Charles, F. Fantone, M. Hald, A. McMahon, R. Meadow, G. Nicholls, A. Patel, M. Pitre, A. Smith, A. Sołtysiak, G. Stein, J. Weber, H. Weiss, and A. Bogaard (2017). Isotope evidence for agricultural extensification reveals how the world’s first cities were fed. Nature Plants 3, 17076.
Van der Vaart [2000] Van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge University Press.
Yu et al. [2023] Yu, X., D. J. Nott, and M. S. Smith (2023). Variational inference for cutting feedback in misspecified models. Statistical Science 38(3), 490–509.

	$\displaystyle\int_{\Theta}\|\pi_{\widehat{\omega}_{+}}(\theta\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta\mid z_{1:n})\|\mathrm{d}\theta$	$\displaystyle=\int_{\Theta}\|\widehat{\omega}_{+}\{\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\}\pi(\theta_{2}\mid\theta_{1},z_{1:n})\|\mathrm{d}\theta$
		$\displaystyle=\int_{\Theta_{1}}\int_{\Theta_{2}}\|\widehat{\omega}_{+}\{\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\}\|\pi(\theta_{2}\mid\theta_{1},z_{1:n})\mathrm{d}\theta$
		$\displaystyle=\widehat{\omega}_{+}\int_{\Theta_{1}}\|\pi_{\mathrm{full}}(\theta_{1}\mid z_{1:n})-\pi_{\mathrm{cut}}(\theta_{1}\mid z_{1:n})\|\mathrm{d}\theta_{1}.$

	$\displaystyle\mathrm{R}_{q}(\pi_{\widehat{\omega}_{+}},\theta_{0})$	$\displaystyle\leq\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\gamma\mathbb{E}\frac{[2(\text{\rm tr}\Upsilon\mathcal{M}-2\\|\Upsilon\mathcal{M}\\|)-\gamma]}{[{(\xi+\tau)^{\top}P(\xi+\tau)}]}$
		$\displaystyle\leq\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\gamma\frac{[2(\text{\rm tr}\Upsilon\mathcal{M}-2\\|\Upsilon\mathcal{M}\\|)-\gamma]}{\mathbb{E}[{(\xi+\tau)^{\top}P(\xi+\tau)}]}$
		$\displaystyle\leq\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0})-\gamma\frac{[2(\text{\rm tr}\Upsilon\mathcal{M}-2\\|\Upsilon\mathcal{M}\\|)-\gamma]}{\tau^{\top}P\tau+\text{\rm tr}P\Omega}$
		$\displaystyle<\mathrm{R}_{q}(\pi_{\mathrm{cut}},\theta_{0}),$

	$\displaystyle\lim_{n\rightarrow+\infty}\text{Pr}\left[\sup_{\theta\in\tilde{\Theta}}\left\|n^{-1}\dot{\ell}(\theta)-\phi\left(\theta,0\right)\right\|\geq\varepsilon\right]\leq$	$\displaystyle\lim_{n\rightarrow+\infty}\text{Pr}\left[\sup_{\theta\in\tilde{\Theta}}\left\|n^{-1}\dot{\ell}(\theta)-\phi\left(\theta,\delta_{n}\right)\right\|\geq\varepsilon/2\right]$
		$\displaystyle+\lim_{n\rightarrow+\infty}\text{Pr}\left[\sup_{\theta\in\tilde{\Theta}}\left\|\phi\left(\theta,\delta_{n}\right)-\phi\left(\theta,0\right)\right\|\geq\varepsilon/2\right]$
		$\displaystyle=o(1).$

	$\displaystyle{\left[1/\left(n\lambda^{\top}\Omega_{n}\lambda\right)\right]\cdot\sum_{i=1}^{n}\int_{\left\|Y_{i,n}\right\|\geqslant\varepsilon\sqrt{n\lambda^{\top}\Omega_{n}\lambda}}\left\|Y_{i,n}\right\|^{2}h\left(z_{i}\mid\theta_{0},\delta_{n}\right)\mathrm{d}\mu\left(z_{i}\right)}$
	$\displaystyle=\left(1/\lambda^{\top}\Omega_{n}\lambda\right)\cdot\int_{A_{n}(\varepsilon)}\left\|Y_{n}(z)\right\|^{2}h\left(z\mid\theta_{0},\delta_{n}\right)\mathrm{d}\mu(z)$
	$\displaystyle\leq\left(1/\lambda^{\top}\Omega_{n}\lambda\right)\\|\lambda\\|^{2}\int_{A_{n}(\varepsilon)}\\|\dot{\ell}(z\mid\theta_{0})-\phi_{n}\\|^{2}h\left(z\mid\theta_{0},\delta_{n}\right)\mathrm{d}\mu(z)$
	$\displaystyle\leq\left(1/\lambda^{\top}\Omega_{n}\lambda\right)\\|\lambda\\|^{2}\left\{\\|\phi_{n}\\|^{2}+\int_{A_{n}(\varepsilon)}\\|\dot{\ell}(z\mid\theta_{0})\\|^{2}h\left(z\mid\theta_{0},\delta_{n}\right)\mathrm{d}\mu(z)\right\}$
	$\displaystyle\leq\left(1/\lambda^{\top}\Omega_{n}\lambda\right)\\|\lambda\\|\left[\left\\|\phi_{n}\right\\|^{2}+\int_{A_{n}(\varepsilon)}b(z)a(z)\mathrm{d}\mu(z)\right]$