This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Posterior covariance information criterion for general loss functions

Yukito Iba    Keisuke Yano The Institute of Statistical Mathematics,
10-3 Midori cho, Tachikawa City, Tokyo 190-8562, Japan.
(TBA)
Abstract

We propose a novel computationally low-cost method for estimating a general predictive measure of generalised Bayesian inference. The proposed method utilises posterior covariance and provides estimators of the Gibbs and the plugin generalisation errors. We present theoretical guarantees of the proposed method, clarifying the connection to the Bayesian sensitivity analysis and the infinitesimal jackknife approximation of Bayesian leave-one-out cross validation. We illustrate several applications of our methods, including applications to differential privacy-preserving learning, the Bayesian hierarchical modeling, the Bayesian regression in the presence of influential observations, and the bias reduction of the widely-applicable information criterion. The applicability in high dimensions is also discussed.

keywords:
Bayesian statistics; predictive risk; Markov chain Monte Carlo; infinitesimal jackknife approximation; sensitivity; widely-applicable information criterion
volume: TBAissue: TBA
\startlocaldefs\endlocaldefs

and

1 Introduction

Bayesian statistics has achieved great successes in many applied fields because it can represent complex shapes of distributions, naturally quantify uncertainties, and accommodate the prior information commonly accepted in applied fields. The development of the Markov chain Monte Carlo (MCMC) methods has improved the utilization of Bayesian statistics. Nowadays, advanced MCMC algorithms are available and utilised in applied fields. Several softwares implement MCMC and enhance the accessibility of Bayesian methods.

Once Bayesian models are fitted to the data, the goodness of fit is evaluated. Measuring predictive performance is a method for this evaluation (e.g., Vehtari and Ojanen, 2012). It is also known as generalisation ability, and is an important concept in the machine learning literature. Many studies have been conducted to estimate the predictive risks of Bayesian models using specific evaluation functions. To construct an asymptotically unbiased estimator of the posterior mean of an expected log-likelihood, Spiegelhalter et al. (2002) employed the difference between the posterior mean of the log-likelihood and the log-likelihood evaluated at the posterior mean, and proposed the deviance information criterion (DIC). Ando (2007) modified DIC to accommodate the model misspecification. As an alternative approach to constructing an asymptotically unbiased estimator, Watanabe (2010b) employed the sample mean of the posterior variances of sample-wise log-likelihoods, and proposed the widely applicable information criterion (WAIC). As its name suggests, WAIC has now been popular in Bayesian data analysis. Okuno and Yano (2023) showed that WAIC is applicable to overparametrized models. Iba and Yano (2023) extended WAIC for the use in the weighted inference. More recently, Silva and Zanella (2023) proposed yet another estimator based on the mixture representation of a leave-one-out predictive distribution. These criteria successfully evaluate the posterior mean of the log-likelihood using samples from a single run of posterior simulation with primary data; additional simulations with leave-one-out data are unnecessary.

However, such useful tools for Bayesian models cannot accommodate arbitrary evaluation function for the prediction or arbitrary score function for the training. Because the choice of a predictive evaluation function is application-specific and intended to gain benefits from predicting the model, handling more general evaluation functions is desirable. Further, generalised Bayesian approach that accommodates arbitrary score function other than the log-likelihood in the Bayesian framework has growing attentions (Chernozhukov and Hong, 2003; Jiang and Tanner, 2008; Ribatet et al., 2012; Bissiri et al., 2016; Miller and Dunson, 2019; Giummolè et al., 2019; Nakagawa and Hashimoto, 2020; Mastubara et al., 2022). This approach makes the Bayesian approach more robust against misspecification of adequate prior distribution and model assumption. Cross-validation (Stone, 1974; Geisser, 1975) is a ubiquitous tool for handling arbitrary evaluation function and arbitrary score function. Although the leave-one-out cross validation (LOOCV) is intuitive and accurate, the brute force implementation requires recomputing the posterior distribution repeatedly, and is almost prohibitive. The importance sampling cross validation (IS-CV; Gelfand et al., 1992; Vehtari and Lampinen, 2003) estimates LOOCV without using leave-one-out posterior distributions. The Pareto-smoothing importance sampling cross validation (PSIS-CV; Vehtari et al., 2024) extends the range where an importance sampling technique is safely used. Vehtari (2002) proposed a generalisation of DIC to an arbitrary evaluation function. Underhill and Smith (2016) proposed the Bayesian predictive score information criterion (BPSIC) using information matrices, and showed that BPSIC is asymptotically unbiased to generalisation error for a differentiable evaluation function.

In this study, we develop yet another novel generalisation error estimate that accommodates general evaluation and score functions in line with Bayesian computation. The proposed method employs the posterior covariance to provide bias correction for empirical errors and presents an asymptotically unbiased estimator for generalisation errors. The proposed method demonstrates significant computational and theoretical features. First, the proposed method can avoid the naive importance sampling technique that is sensitive to the presence of influential observations (Peruggia, 1997); therefore, the proposed method is expected to be numerically stable with respect to such observations. Second, the proposed method is theoretically supported by its asymptotic unbiasedness. Third, and most importantly, users can simply subtract the posterior covariance to estimate the arbitrary generalization error, which is rooted in an important theoretical foundation discussed in the preceding paragraph and can avoid the unstable computation of information matrices and their inverse matrices. These advantages are illustrated in applications to differential privacy-preserving learning (Dwork, 2006) (Subsection 3.1), the Bayesian hierarchical modeling (Subsection 3.2), the Bayesian regression in the presence of influential observations (Subsection 3.3), the bias reduction of WAIC (Subsection 3.4), and the high-dimesitional models (Subsection 4.1).

Why does the form of posterior covariance appear in the proposal? This question leads us to find an interesting connection to the Bayesian local sensitivity and the infinitesimal jackknife approximation of the LOOCV. The Bayesian sensitivity formula (e.g., Pérez et al., 2006; Millar and Stewart, 2007; Thomas et al., 2018; Giordano et al., 2018; Iba, 2023; Giordano and Broderick, 2023) implies that the posterior covariance appears by the perturbation of the posterior distribution. This, together with the infinitesimal jackknife approximation of LOOCV (e.g., Beirami et al., 2017; Giordano et al., 2019; Rad and Maleki, 2020; Giordano and Broderick, 2023), leads us to find that the proposed method corresponds to an infinitesimal jackknife approximation of the Bayesian LOOCV, which is a reason for the appearance of the posterior covariance form; see Subsection 2.2. This aspect is reminiscent of the classical result on the asymptotic equivalence between LOOCV and Akaike information criterion (AIC; Akaike, 1973) discovered by Stone (1974). Also, it reduces to the asymptotic equivalence between WAIC and Bayesian LOOCV, given in Section 8 of Watanabe (2018), when the evaluation function is the log-likelihood.

The organization of this paper is as follows. Section 2 introduce our methodologies and present the infinitesimal jackknife interpretation of our methods. This interpretation gives a theoretical support (Theorem 2.4) to our methods. Section 3 illustrate the applications of our methods using both synthetic and real datasets. Section 4 discuss the applicability to high-dimensional models, the application to defining a case-influence measure, and the possibility of different definitions of the generalisation errors. Appendices include the proofs of the theorems and additional figures.

2 Proposed method

Suppose that we have nn observations Xn=(X1,,Xn)X^{n}=(X_{1},\ldots,X_{n}) with each observation following an unknown sampling distribution on a sample space 𝒳\mathcal{X} in d\mathbb{R}^{d}, and our target is to predict Xn+1X_{n+1} that follows the same sampling distribution and is independent from current observations. Let Θ\Theta be the parameter space in p\mathbb{R}^{p}.

2.1 Predictive evaluation

This subsection presents our methodologies. We work with the generalised posterior distribution given by

π(θ;Xn)=exp{i=1ns(Xi,θ)}π(θ)exp{i=1ns(Xi,θ)}π(θ)𝑑θ,\displaystyle\pi(\theta\,;\,X^{n})=\frac{\exp\{\sum_{i=1}^{n}s(X_{i},\theta)\}\pi(\theta)}{\int\exp\{\sum_{i=1}^{n}s(X_{i},\theta^{\prime})\}\pi(\theta^{\prime})d\theta^{\prime}}, (1)

where π(θ)\pi(\theta) is a prior density on Θ\Theta and s(x,θ)s(x,\theta) is a score function that is the minus of a loss function. The score function may be different from the log-likelihood.

For the predictive evaluation of the generalised Bayesian approach, consider an evaluation function ν(x,θ)\nu(x,\theta) for a future observation x𝒳x\in\mathcal{X} and a parameter vector θΘ\theta\in\Theta. Examples include the mean squared error ν(x,θ)=x𝔼[Xθ]2\nu(x,\theta)=\|x-\mathbb{E}[X\mid\theta]\|^{2}, the 1\ell_{1} error ν(x,θ)=x𝔼[Xθ]\nu(x,\theta)=\|x-\mathbb{E}[X\mid\theta]\|, the log likelihood ν(x,θ)=logp(xθ)\nu(x,\theta)=\log p(x\mid\theta), the likelihood ν(x,θ)=p(xθ)\nu(x,\theta)=p(x\mid\theta), and the p-value ν(x,θ)=xp(xθ)𝑑x\nu(x,\theta)=\int_{x}^{\infty}p(x^{\prime}\mid\theta)dx^{\prime}, where {p(xθ):θΘ}\{p(x\mid\theta):\theta\in\Theta\} is a parametric model and 𝔼[θ]\mathbb{E}[\cdot\mid\theta] is the expectation with respect to p(xθ)p(x\mid\theta).

On the basis of ν(x,θ)\nu(x,\theta), we consider two types of predictive measures: the Gibbs generalisation error

𝒢G,n=𝒢G,n(Xn)=𝔼Xn+1[𝔼pos[ν(Xn+1,θ)Xn]]\mathcal{G}_{\mathrm{G},n}=\mathcal{G}_{\mathrm{G},n}(X^{n})=\mathbb{E}_{X_{n+1}}[\,\mathbb{E}_{\mathrm{pos}}[\nu(X_{n+1},\theta)\mid X^{n}]\,]

and the plugin generalisation error

𝒢P,n=𝒢P,n(Xn)=𝔼Xn+1[ν(Xn+1,𝔼pos[θXn])],\mathcal{G}_{\mathrm{P},n}=\mathcal{G}_{\mathrm{P},n}(X^{n})=\mathbb{E}_{X_{n+1}}[\,\nu(X_{n+1},\mathbb{E}_{\mathrm{pos}}[\theta\mid X^{n}])\,],

where 𝔼pos[Xn]\mathbb{E}_{\mathrm{pos}}[\,\,\cdot\,\,\mid\,X^{n}] is the generalised posterior expectation, and 𝔼Xn+1[]\mathbb{E}_{X_{n+1}}[\,\,\cdot\,\,] is the expectation with respect to Xn+1X_{n+1}.

To estimate the Gibbs and plugin generalisation errors on the basis of current observations, we propose the Gibbs and the plugin posterior covariance information criteria PCICG\mathrm{PCIC}_{\mathrm{G}} and PCICP\mathrm{PCIC}_{\mathrm{P}}:

PCICG\displaystyle\mathrm{PCIC}_{\mathrm{G}} =1ni=1n𝔼pos[ν(Xi,θ)Xn]1ni=1nCovpos[ν(Xi,θ),s(Xi,θ)Xn]and\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{\mathrm{pos}}[\nu(X_{i},\theta)\mid X^{n}]-\frac{1}{n}\sum_{i=1}^{n}\mathrm{Cov}_{\mathrm{pos}}[\nu(X_{i},\theta),s(X_{i},\theta)\mid X^{n}]\quad\text{and} (2)
PCICP\displaystyle\mathrm{PCIC}_{\mathrm{P}} =1ni=1nν(Xi,𝔼pos[θXn])1ni=1nCovpos[ν(Xi,θ),s(Xi,θ)Xn],\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\nu(X_{i},\mathbb{E}_{\mathrm{pos}}[\theta\mid X^{n}])-\frac{1}{n}\sum_{i=1}^{n}\mathrm{Cov}_{\mathrm{pos}}[\nu(X_{i},\theta),s(X_{i},\theta)\mid X^{n}], (3)

where Covpos[,Xn]\mathrm{Cov}_{\mathrm{pos}}[\,\,\cdot\,,\,\cdot\mid X^{n}] is the generalised posterior covariance. Table 1 summarises how we obtain estimates of the Gibbs and the plug-in generalisation errors. The first terms correspond to the empirical errors

G,n=1ni=1n𝔼pos[ν(Xi,θ)Xn]andP,n\displaystyle\mathcal{E}_{\mathrm{G},n}=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{\mathrm{pos}}[\nu(X_{i},\theta)\mid X^{n}]\quad\text{and}\quad\mathcal{E}_{\mathrm{P},n} =1ni=1nν(Xi,𝔼pos[θXn]);\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\nu(X_{i},\mathbb{E}_{\mathrm{pos}}[\theta\mid X^{n}]);

Our proposed methods employ the posterior covariance as bias correction of empirical errors.

Table 1: Algorithm for the generalisation error estimation.
0:  Observations X1,,XnX_{1},\ldots,X_{n} and posterior samples θ1,,θM\theta_{1},\ldots,\theta_{M} from π(θ;Xn)\pi(\theta;X^{n}) with size MM.
0:  An estimate of 𝒢G,n\mathcal{G}_{\mathrm{G},n} and that of 𝒢P,n\mathcal{G}_{\mathrm{P},n}.
  Calculate
TG=1ni=1n1Mk=1Mν(Xi,θk)andTP=1ni=1nν(Xi,1Mk=1Mθk).T_{G}=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{M}\sum_{k=1}^{M}\nu(X_{i},\theta_{k})\,\,\text{and}\,\,T_{P}=\frac{1}{n}\sum_{i=1}^{n}\nu\left(X_{i},\frac{1}{M}\sum_{k=1}^{M}\theta_{k}\right).
  Calculate also
V=1ni=1n1Mk=1M{ν(Xi,θk)s(Xi,θk)}1ni=1n1Mk=1Mν(Xi,θk)1Mk=1Ms(Xi,θk).V=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{M}\sum_{k=1}^{M}\{\nu(X_{i},\theta_{k})s(X_{i},\theta_{k})\}-\frac{1}{n}\sum_{i=1}^{n}\frac{1}{M}\sum_{k=1}^{M}\nu(X_{i},\theta_{k})\frac{1}{M}\sum_{k=1}^{M}s(X_{i},\theta_{k}).
  return  (TGV)(T_{G}-V) as an estimate of 𝒢G,n\mathcal{G}_{\mathrm{G},n} and (TPV)(T_{P}-V) as an estimate of 𝒢P,n\mathcal{G}_{\mathrm{P},n}.
Remark 2.1 (Computation).

An important point of the proposed criteria is that their computation works well with the Bayesian computation. Typical generalisation error estimates such as Takeuchi Information Criterion (TIC; Takeuchi, 1976), Regularization Information Criterion (RIC; Shibata, 1989), and Generalised Information Criterion (GIC; Konishi and Kitagawa, 1996) employ information matrices like J^s:=(1/n)i=1n{θθs(Xi,θs)}\widehat{J}_{s}:=(1/n)\sum_{i=1}^{n}\{-\nabla_{\theta}\nabla_{\theta}^{\top}s(X_{i},\theta_{s})\} and their inverses, where θ\nabla_{\theta} is the gradient with respect to θ\theta. In particular, the computation of J^s1\widehat{J}^{-1}_{s} is instable and demanding in the standard Bayesian computation. Instead of this, the proposed method utilises posterior covariance to avoid such a demanding computation.

Remark 2.2 (Connection to WAIC).

When working with a parametric model {p(xθ):θΘ}\{p(x\mid\theta):\theta\in\Theta\}, we set the minus log-likelihood as the loss function, and consider the posterior distribution with learning rate β>0\beta>0. We then have ν(X,θ)=logp(Xθ)\nu(X,\theta)=-\log p(X\mid\theta) and s(X,θ)=βlogp(Xθ)s(X,\theta)=\beta\log p(X\mid\theta). In this case, PCICG\mathrm{PCIC}_{\mathrm{G}} is reduced to WAIC2\mathrm{WAIC}_{2} given in Section 8.3 of Watanabe (2018):

WAIC2=1ni=1n𝔼pos[logp(Xiθ)Xn]+βni=1n𝕍pos[logp(Xiθ)Xn],\displaystyle\mathrm{WAIC}_{2}=-\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{\mathrm{pos}}[\log p(X_{i}\mid\theta)\mid X^{n}]+\frac{\beta}{n}\sum_{i=1}^{n}\mathbb{V}_{\mathrm{pos}}[\log p(X_{i}\mid\theta)\mid X^{n}],

where 𝕍pos[Xn]\mathbb{V}_{\mathrm{pos}}[\,\,\cdot\,\,\mid X^{n}] is the generalised posterior variance.

2.2 Infinitesimal jackknife interpretation

Before presenting theoretical results, we shall discuss the infinitesimal jackknife (IJ) interpretation of the proposed method. The IJ approach is a general methodology that approximates algorithms requiring the re-fitting of models, such as cross validation and the bootstrap methods. In the recent machine learning literature, this methodology has been rekindled as a linear approximation of LOOCV; see Beirami et al. (2017); Koh and Liang (2017); Giordano et al. (2019); Rad and Maleki (2020). We briefly explain about the IJ approximation of the leave-one-out ZZ-estimates given as

θ^(i)=θsuch thatjiθs(Xj,θ)=0,i=1,,n.\displaystyle\widehat{\theta}^{(-i)}=\theta\quad\text{such that}\quad\sum_{j\neq i}\nabla_{\theta}s(X_{j},\theta)=0\quad,i=1,\ldots,n.

To describe the IJ approximation, we introduce the weighted ZZ-estimate

θ^w:=θsuch thati=1nwiθs(Xi,θ)=0forw=(w1,,wn)n,\displaystyle\widehat{\theta}_{w}:=\theta\quad\text{such that}\quad\sum_{i=1}^{n}w_{i}\nabla_{\theta}s(X_{i},\theta)=0\qquad\text{for}\qquad w=(w_{1},\ldots,w_{n})^{\top}\in\mathbb{R}^{n},

where the leave-one-out ZZ-estimate is denoted by

θ^(i)=θ^𝟙iwith𝟙i=(1,,1,0the i-th component,1,,1).\widehat{\theta}^{(-i)}=\widehat{\theta}_{\mathbbm{1}_{-i}}\quad\text{with}\quad\mathbbm{1}_{-i}=(1,\ldots,1,\underbrace{0}_{\text{the $i$-th component}},1,\ldots,1)^{\top}.

Then, by using the introduced weight ww, the leave-one-out estimate θ^(i)\widehat{\theta}^{(-i)} is linearly approximated as

θ^IJ(i)\displaystyle\widehat{\theta}^{(-i)}_{\mathrm{IJ}} =θ^+j=1nθ^wwj|w=𝟙(𝟙i𝟙)jwith𝟙=(1,,1,,1)\displaystyle=\widehat{\theta}+\sum_{j=1}^{n}\frac{\partial\widehat{\theta}_{w}}{\partial w_{j}}\Big{|}_{w=\mathbbm{1}}(\mathbbm{1}_{-i}-\mathbbm{1})_{j}\quad\text{with}\quad\mathbbm{1}=(1,\ldots,1,\ldots,1)^{\top}
=θ^θ^wwi|w=𝟙,\displaystyle=\widehat{\theta}-\frac{\partial\widehat{\theta}_{w}}{\partial w_{i}}\Big{|}_{w=\mathbbm{1}}, (4)

which is known as the infinitesimal jackknife (IJ) approximation of leave-one-out estimates. By the implicit function theorem, the derivative respect to the weight is evaluated as

θ^wwj|w=𝟙=(k=1nθθs(Xk,θ)|θ=θ^)1(θs(Xj,θ)|θ=θ^)\displaystyle\frac{\partial\widehat{\theta}_{w}}{\partial w_{j}}\Bigg{|}_{w=\mathbbm{1}}=\left(-\sum_{k=1}^{n}\nabla_{\theta}\nabla_{\theta}^{\top}s(X_{k},\theta)\Big{|}_{\theta=\widehat{\theta}}\right)^{-1}\left(\nabla_{\theta}s(X_{j},\theta)\Big{|}_{\theta=\widehat{\theta}}\right)

and then the IJ approximation becomes

θ^IJ(i)\displaystyle\widehat{\theta}^{(-i)}_{\mathrm{IJ}} =θ^(k=1nθθs(Xk,θ)|θ=θ^)1(θs(Xj,θ)|θ=θ^).\displaystyle=\widehat{\theta}-\left(-\sum_{k=1}^{n}\nabla_{\theta}\nabla_{\theta}^{\top}s(X_{k},\theta)\Big{|}_{\theta=\widehat{\theta}}\right)^{-1}\left(\nabla_{\theta}s(X_{j},\theta)\Big{|}_{\theta=\widehat{\theta}}\right).

What is the Bayesian version of this formula? Here we consider the Gibbs LOOCV

CVG:=1ni=1n𝔼pos[ν(Xi,θ)Xi]\displaystyle\mathrm{CV}_{\mathrm{G}}:=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{\mathrm{pos}}[\nu(X_{i},\theta)\mid X^{-i}] (5)

and the plug-in LOOCV

CVP:=1ni=1nν(Xi,𝔼pos[θXi])\displaystyle\mathrm{CV}_{\mathrm{P}}:=\frac{1}{n}\sum_{i=1}^{n}\nu(X_{i},\mathbb{E}_{\mathrm{pos}}[\theta\mid X^{-i}]) (6)

where 𝔼pos[Xi]\mathbb{E}_{\mathrm{pos}}[\cdot\mid X^{-i}] is the expectation with respect to the leave-one-out generalised posterior distribution:

π(θ;Xi)=exp{jis(Xj,θ)}π(θ)exp{jis(Xj,θ)}π(θ)𝑑θ.\pi(\theta\,;\,X^{-i})=\frac{\exp\{\sum_{j\neq i}s(X_{j},\theta)\}\pi(\theta)}{\int\exp\{\sum_{j\neq i}s(X_{j},\theta^{\prime})\}\pi(\theta^{\prime})d\theta^{\prime}}.

To investigate the IJ approximation of these cross validations, we define the weighted generalised posterior distribution

πw(θ;Xn):=exp{i=1nwis(Xi,θ)}π(θ)exp{i=1nwis(Xi,θ)}π(θ)dθforw=(w1,,wn)n,\displaystyle\pi_{w}(\theta\,;\,X^{n}):=\frac{\exp\{\sum_{i=1}^{n}w_{i}s(X_{i},\theta)\}\pi(\theta)}{\int\exp\{\sum_{i=1}^{n}w_{i}s(X_{i},\theta^{\prime})\}\pi(\theta^{\prime})\mathrm{d}\theta^{\prime}}\quad\text{for}\quad w=(w_{1},\ldots,w_{n})^{\top}\in\mathbb{R}^{n}, (7)

and denote by 𝔼posw[]\mathbb{E}_{\mathrm{pos}}^{w}[\,\,\cdot\,\,] the expectation with respect to πw(θ;Xn)\pi_{w}(\theta\,;\,X^{n}). Observe that we have

𝔼pos[ν(Xi,θ)Xi]\displaystyle\mathbb{E}_{\mathrm{pos}}[\nu(X_{i},\theta)\mid X^{-i}] =𝔼pos𝟙i[ν(Xi,θ)]and\displaystyle=\mathbb{E}_{\mathrm{pos}}^{\mathbbm{1}_{-i}}[\nu(X_{i},\theta)]\quad\text{and}
ν(Xi,𝔼pos[θXi])\displaystyle\nu(X_{i},\mathbb{E}_{\mathrm{pos}}[\theta\mid X^{-i}]) =ν(Xi,𝔼pos𝟙i[θ]).\displaystyle=\nu(X_{i},\mathbb{E}_{\mathrm{pos}}^{\mathbbm{1}_{-i}}[\theta]).

This, together with employing a structure analogous to equation (4), leads us to the following IJ approximations:

(𝔼pos[ν(Xi,θ)Xi])IJ\displaystyle(\mathbb{E}_{\mathrm{pos}}[\nu(X_{i},\theta)\mid X^{-i}])_{\mathrm{IJ}} :=𝔼pos[ν(Xi,θ)Xn]wi𝔼posw[ν(Xi,θ)Xn]|w=𝟙,\displaystyle:=\mathbb{E}_{\mathrm{pos}}[\nu(X_{i},\theta)\mid X^{n}]-\frac{\partial}{\partial w_{i}}\mathbb{E}_{\mathrm{pos}}^{w}[\nu(X_{i},\theta)\mid X^{n}]\Big{|}_{w=\mathbbm{1}},
(ν(Xi,𝔼pos[θXi]))IJ\displaystyle(\nu(X_{i},\mathbb{E}_{\mathrm{pos}}[\theta\mid X^{-i}]))_{\mathrm{IJ}} :=ν(Xi,𝔼pos[θXn])wiν(Xi,𝔼posw[θXn])|w=𝟙.\displaystyle:=\nu(X_{i},\mathbb{E}_{\mathrm{pos}}[\theta\mid X^{n}])-\frac{\partial}{\partial w_{i}}\nu(X_{i},\mathbb{E}_{\mathrm{pos}}^{w}[\theta\mid X^{n}])\Big{|}_{w=\mathbbm{1}}.

To evaluate the second terms in the IJ approximations, we employ the following variants of local sensitivity formula (Millar and Stewart, 2007; Giordano et al., 2018). For more details on Bayesian local sensitivity, refer to Thomas et al. (2018).

Lemma 2.3.

Under Condition C3 in Appendix A, we have, for k=1,2k=1,2,

kwik𝔼posw[ν(Xi,θ)]\displaystyle\frac{\partial^{k}}{\partial w_{i}^{k}}\mathbb{E}_{\mathrm{pos}}^{w}[\nu(X_{i},\theta)] =𝔼posw[{ν(Xi,θ)𝔼posw[ν(Xi,θ)]}{s(Xi,θ)𝔼posw[s(Xi,θ)]}k]\displaystyle=\mathbb{E}_{\mathrm{pos}}^{w}\left[\{\nu(X_{i},\theta)-\mathbb{E}_{\mathrm{pos}}^{w}[\nu(X_{i},\theta)]\}\{s(X_{i},\theta)-\mathbb{E}_{\mathrm{pos}}^{w}[s(X_{i},\theta)]\}^{k}\right]
kwik𝔼posw[θ]\displaystyle\frac{\partial^{k}}{\partial w_{i}^{k}}\mathbb{E}_{\mathrm{pos}}^{w}[\theta] =𝔼posw[{θ𝔼posw[θ]}{s(Xi,θ)𝔼posw[s(Xi,θ)]}k]\displaystyle=\mathbb{E}_{\mathrm{pos}}^{w}\left[\{\theta-\mathbb{E}_{\mathrm{pos}}^{w}[\theta]\}\{s(X_{i},\theta)-\mathbb{E}_{\mathrm{pos}}^{w}[s(X_{i},\theta)]\}^{k}\right]

In particular, we have

wi𝔼posw[ν(Xi,θ)]\displaystyle\frac{\partial}{\partial w_{i}}\mathbb{E}_{\mathrm{pos}}^{w}[\nu(X_{i},\theta)] =Covposw[ν(Xi,θ),s(Xi,θ)],and\displaystyle=\mathrm{Cov}_{\mathrm{pos}}^{w}[\nu(X_{i},\theta),s(X_{i},\theta)],\quad\text{and}
wi𝔼posw[θ]\displaystyle\frac{\partial}{\partial w_{i}}\mathbb{E}_{\mathrm{pos}}^{w}[\theta] =Covposw[θ,s(Xi,θ)].\displaystyle=\mathrm{Cov}_{\mathrm{pos}}^{w}[\theta,s(X_{i},\theta)].

The proof is given in Appendix B. We should note that the final equation in the lemma is crucial for assessing the frequentist variance; see Giordano and Broderick (2023).

For the Gibbs LOOCV, Lemma 2.3 yields the form of the IJ approximation given by

(𝔼pos[ν(Xi,θ)Xi])IJ\displaystyle(\mathbb{E}_{\mathrm{pos}}[\nu(X_{i},\theta)\mid X^{-i}])_{\mathrm{IJ}} =𝔼pos[ν(Xi,θ)Xn]Covpos[ν(Xi,θ),s(Xi,θ)].\displaystyle=\mathbb{E}_{\mathrm{pos}}[\nu(X_{i},\theta)\mid X^{n}]-\mathrm{Cov}_{\mathrm{pos}}[\nu(X_{i},\theta),s(X_{i},\theta)]. (8)

For the plug-in LOOCV, further calculi are required. If ν(Xi,θ)\nu(X_{i},\theta) is 2-times continuous differentiable with respect to θ\theta and its Hessian with respect to θ\theta is bounded, then the Taylor expansion around θ^=𝔼pos[θXn]\widehat{\theta}=\mathbb{E}_{\mathrm{pos}}[\theta\mid X^{n}] yields

Covpos[ν(Xi,θ),s(Xi,θ)]\displaystyle\mathrm{Cov}_{\mathrm{pos}}[\nu(X_{i},\theta),s(X_{i},\theta)]
=Covpos[ν(Xi,θ)ν(Xi,θ^),s(Xi,θ)]\displaystyle=\mathrm{Cov}_{\mathrm{pos}}[\nu(X_{i},\theta)-\nu(X_{i},\widehat{\theta})\,,\,s(X_{i},\theta)]
=Covpos[a=1dν(Xi,θ)θa|θ=θ^{θaθ^a}+OP(θθ^2),s(Xi,θ)]\displaystyle=\mathrm{Cov}_{\mathrm{pos}}\left[\sum_{a=1}^{d}\frac{\nu(X_{i},\theta)}{\partial\theta^{a}}\Big{|}_{\theta=\widehat{\theta}}\{\theta^{a}-\widehat{\theta}^{a}\}+O_{P}(\|\theta-\widehat{\theta}\|^{2})\,,\,s(X_{i},\theta)\right]
=Covpos[a=1dν(Xi,θ)θa|θ=θ^{θaθ^a},s(Xi,θ)]+rem\displaystyle=\mathrm{Cov}_{\mathrm{pos}}\left[\sum_{a=1}^{d}\frac{\nu(X_{i},\theta)}{\partial\theta^{a}}\Big{|}_{\theta=\widehat{\theta}}\{\theta^{a}-\widehat{\theta}^{a}\}\,,\,s(X_{i},\theta)\right]+\mathrm{rem}
=a=1dν(Xi,θ)θa|θ=θ^Covpos[θaθ^a,s(Xi,θ)]+rem\displaystyle=\sum_{a=1}^{d}\frac{\partial\nu(X_{i},\theta)}{\partial\theta^{a}}\Big{|}_{\theta=\widehat{\theta}}\mathrm{Cov}_{\mathrm{pos}}[\theta^{a}-\widehat{\theta}^{a}\,,\,s(X_{i},\theta)]+\mathrm{rem}

where θ=(θ1,,θd)\theta=(\theta^{1},\ldots,\theta^{d}), and rem\mathrm{rem} is negligible under additional assumptions. Here Lemma 2.3 gives

a=1dν(Xi,θ)θa|θ=θ^Covpos[θaθ^a,s(Xi,θ)]=a=1dν(Xi,θ)θa|θ=θ^wi𝔼posw[θa]|w=𝟙.\sum_{a=1}^{d}\frac{\partial\nu(X_{i},\theta)}{\partial\theta^{a}}\Big{|}_{\theta=\widehat{\theta}}\mathrm{Cov}_{\mathrm{pos}}[\theta^{a}-\widehat{\theta}^{a}\,,\,s(X_{i},\theta)]=\sum_{a=1}^{d}\frac{\partial\nu(X_{i},\theta)}{\partial\theta^{a}}\Big{|}_{\theta=\widehat{\theta}}\frac{\partial}{\partial w_{i}}\mathbb{E}_{\mathrm{pos}}^{w}[\theta^{a}]\Big{|}_{w=\mathbbm{1}}.

Observe that the chain rule gives

a=1dν(Xi,θ)θa|θ=θ^wi𝔼posw[θa]|w=𝟙=wiν(Xi,𝔼posw[θ])|w=𝟙,\sum_{a=1}^{d}\frac{\partial\nu(X_{i},\theta)}{\partial\theta^{a}}\Big{|}_{\theta=\widehat{\theta}}\frac{\partial}{\partial w_{i}}\mathbb{E}_{\mathrm{pos}}^{w}[\theta^{a}]\Big{|}_{w=\mathbbm{1}}=\frac{\partial}{\partial w_{i}}\nu(X_{i},\mathbb{E}_{\mathrm{pos}}^{w}[\theta])\Big{|}_{w=\mathbbm{1}},

implying that

wiν(Xi,𝔼posw[θ])|w=𝟙=Covpos[ν(Xi,θ),s(Xi,θ)]+rem.\frac{\partial}{\partial w_{i}}\nu(X_{i},\mathbb{E}_{\mathrm{pos}}^{w}[\theta])\Big{|}_{w=\mathbbm{1}}=\mathrm{Cov}_{\mathrm{pos}}[\nu(X_{i},\theta),s(X_{i},\theta)]+\mathrm{rem}.

This concludes that the IJ approximation for the plug-in LOOCV is

(ν(Xi,𝔼pos[θXi]))IJ=ν(Xi,𝔼pos[θXn])Covpos[ν(Xi,θ),s(Xi,θ)]+rem,\displaystyle(\nu(X_{i},\mathbb{E}_{\mathrm{pos}}[\theta\mid X^{-i}]))_{\mathrm{IJ}}=\nu(X_{i},\mathbb{E}_{\mathrm{pos}}[\theta\mid X^{n}])-\mathrm{Cov}_{\mathrm{pos}}[\nu(X_{i},\theta),s(X_{i},\theta)]+\mathrm{rem}, (9)

where the remaining term rem\mathrm{rem} is negligible.

So, the IJ approximations (8) and (9) of LOOCV presents PCICG\mathrm{PCIC}_{\mathrm{G}} and PCICP\mathrm{PCIC}_{\mathrm{P}}. In the next subsection, we will show that the residual between the IJ appximations and the expected generalisation error is negligible.

In the literature on information criteria, the IJ methodology has been used to show the asymptotic equivalence between information criteria and LOOCV (Stone, 1974; Watanabe, 2010a); see also Konishi and Kitagawa (1996). The IJ approximation of the LOOCV ZZ-estimate requires the second order differentiation and its inverse calculation (c.f., Beirami et al., 2017; Koh and Liang, 2017). The discussion here emphasizes that in Bayesian framework, these calculi are avoidable and there are user-friendly surrogates, that is, the posterior covariances. Recently, Giordano and Broderick (2023) analyse the Bayesian infinitesimal jackknife approximation for estimating frequentist’s covariance in details, showing the consistency in finite dimensional models and giving discussions on the behaviours in high-dimensional models.

2.3 Theoretical results

On the basis of the IJ approximation discussed in the previous subsection, this section presents the theoretical support for the use of PCICG\mathrm{PCIC}_{\mathrm{G}}. The same result holds for PCICP\mathrm{PCIC}_{\mathrm{P}}, and it is omitted.

The following is the main theorem stating the asymptotic unbiasedness of the proposed criteria. The proof consists of two steps: (1) the analysis on the residual between the IJ appximation and the LOOCV; (2) the analysis on the expected values of the difference between the generalisation error and the LOOCV. Suppose that current nn observations Xn=(X1,,Xn)X^{n}=(X_{1},\ldots,X_{n}) are independent and identically distributed from an unknown sampling distribution on a sample space 𝒳\mathcal{X} in d\mathbb{R}^{d}, and our prediction target Xn+1X_{n+1} follows the same sampling distribution as that of each observation and is independent of XnX^{n}. The expectation 𝔼[]\mathbb{E}[\cdot] denotes the expectation with respect to (X1,,Xn+1)(X_{1,}\ldots,X_{n+1}).

Theorem 2.4.

Under Conditions C1-C3, the criterion PCICG\mathrm{PCIC}_{\mathrm{G}} is asymptotically unbiased to the Gibbs generalisation error:

𝔼[PCICG]=𝔼[𝒢G,n]+o(𝔼[𝒢G,n]minθΘ𝔼[ν(Xn+1,θ)]).\displaystyle\mathbb{E}[\mathrm{PCIC}_{\mathrm{G}}]=\mathbb{E}[\mathcal{G}_{\mathrm{G},n}]+o\left(\mathbb{E}[\mathcal{G}_{\mathrm{G},n}]-\min_{\theta^{\prime}\in\Theta}\mathbb{E}[\nu(X_{n+1},\theta^{\prime})]\right). (10)
Proof of Theorem 2.4.

First of all, we subtract minθΘ𝔼[ν(Xn+1,θ)]\min_{\theta^{\prime}\in\Theta}\mathbb{E}[\nu(X_{n+1},\theta^{\prime})] from ν(Xn+1,θ)\nu(X_{n+1},\theta) and denote the result by ν\nu. This does not affect the proof, as we are simply subtracting the same quantity from both sides of equation (10).

Next, we analyse the Gibbs LOOCV (5) by continuing the Taylor expansion from the first order (8) to the second order. Then we have

𝔼pos[ν(Xi,θ)Xi]=(𝔼pos[ν(Xi,θ)Xi])IJ+(1)222wi2|wi=wi𝔼posw[ν(Xi,θ)],\displaystyle\mathbb{E}_{\mathrm{pos}}[\nu(X_{i},\theta)\mid X^{-i}]=(\mathbb{E}_{\mathrm{pos}}[\nu(X_{i},\theta)\mid X^{-i}])_{\mathrm{IJ}}+\frac{(-1)^{2}}{2}\frac{\partial^{2}}{\partial w_{i}^{2}}\bigg{|}_{w_{i}=w_{i}^{*}}\mathbb{E}_{\mathrm{pos}}^{w}[\nu(X_{i},\theta)],

where 𝔼posw[]\mathbb{E}_{\mathrm{pos}}^{w}[\cdot] is defined in (7), (𝔼pos[ν(Xi,θ)Xi])IJ(\mathbb{E}_{\mathrm{pos}}[\nu(X_{i},\theta)\mid X^{-i}])_{\mathrm{IJ}} is given in (8), wiw_{i}^{*} is a point in [0,1][0,1]. Since Lemma 2.3 gives the form of the second term on the right hand side above, we get

𝔼pos[ν(Xi,θ)Xi]=(𝔼pos[ν(Xi,θ)Xi])IJ+12κ3w,i[i],\displaystyle\mathbb{E}_{\mathrm{pos}}[\nu(X_{i},\theta)\mid X^{-i}]=(\mathbb{E}_{\mathrm{pos}}[\nu(X_{i},\theta)\mid X^{-i}])_{\mathrm{IJ}}+\frac{1}{2}\kappa^{w^{*,i}}_{3}[i], (11)

where w,i=(1,,1,wi,1,,1)w^{*,i}=(1,\ldots,1,w^{*}_{i},1,\ldots,1) and

κ3w,i[i]=𝔼posw,i[{ν(Xi,θ)𝔼posw,i[ν(Xi,θ)]}{s(Xi,θ)𝔼posw,i[s(Xi,θ)]}2].\displaystyle\kappa^{w^{*,i}}_{3}[i]=\mathbb{E}_{\mathrm{pos}}^{w^{*,i}}\left[\left\{\nu(X_{i},\theta)-\mathbb{E}_{\mathrm{pos}}^{w^{*,i}}[\nu(X_{i},\theta)]\right\}\left\{s(X_{i},\theta)-\mathbb{E}_{\mathrm{pos}}^{w^{*,i}}[s(X_{i},\theta)]\right\}^{2}\right].

Summing up (11) with respect to ii together with the form (8) of the IJ approximation yields the form of the Gibbs LOOCV

CVG=PCICG+12ni=1nκ3w,i[i].\displaystyle\mathrm{CV}_{\mathrm{G}}=\mathrm{PCIC}_{\mathrm{G}}+\frac{1}{2n}\sum_{i=1}^{n}\kappa_{3}^{w^{*,i}}[i]. (12)

Observe that the expected Gibbs LOOCV is just the expected Gibbs generalisation error:

𝔼[𝒢G,n1]=𝔼[CVG].\mathbb{E}[\mathcal{G}_{\mathrm{G},n-1}]=\mathbb{E}[\mathrm{CV}_{\mathrm{G}}].

Thus we get

𝔼[𝒢G,n]\displaystyle\mathbb{E}[\mathcal{G}_{\mathrm{G},n}] =𝔼[PCICG]+{𝔼[𝒢G,n]𝔼[𝒢G,n1]}=:A+12ni=1n𝔼[κ3w,i[i]]=:B.\displaystyle=\mathbb{E}[\mathrm{PCIC}_{\mathrm{G}}]+\underbrace{\{\mathbb{E}[\mathcal{G}_{\mathrm{G},n}]-\mathbb{E}[\mathcal{G}_{\mathrm{G},n-1}]\}}_{=:A}+\underbrace{\frac{1}{2n}\sum_{i=1}^{n}\mathbb{E}[\kappa_{3}^{w^{*,i}}[i]]}_{=:B}.

Condition C1 makes AA o(𝔼[𝒢G,n])o(\mathbb{E}[\mathcal{G}_{\mathrm{G},n}]) and Condition C2 makes BB o(𝔼[𝒢G,n)]o(\mathbb{E}[\mathcal{G}_{\mathrm{G},n})], which completes the proof. ∎

We remark that the first part of the proof is essentially the same as the cumulant expansion used in the proof of asymptotic unbiasedness of WAIC (Watanabe, 2018).

3 Applications

This section presents applications of the proposed methods in the comparison with the following existing methods:

  • Exact leave-one-out cross validation (LOOCV);

  • A generalisation of DIC (GDIC; Vehtari, 2002);

  • Bayesian predictive score information criterion (BPSIC; Underhill and Smith, 2016);

  • Importance sampling cross validation (IS-CV; Gelfand et al., 1992);

  • Pareto smoothed importance sampling cross validation (PSIS-CV; Vehtari et al., 2024).

Our applications in this section are summarised as follows:

  • We first present an application to the differential private learning;

  • We second apply the methods to Bayesian hierarchical modeling;

  • We then apply them to Bayesian regression models in the presence of influential observations;

  • We lastly apply PCIC to eliminate the bias of WAIC due to strong priors.

Further, we remark that in Subsection 4.1, we apply PCIC to high-dimensional models.

3.1 Evaluation of differential private learning methods

Refer to caption
Figure 1: Means of generalisation error estimates relative to the generalisation errors for the banknote authentication dataset, with standard deviations represented as shaded areas. The vertical axes represent values relative to the generalisation errors, while the horizontal axes correspond to β\beta in the generalised posterior distribution. (a) Generalisation error estimates for the Brier loss. (b) Estimates for the misclassification loss. (c) Estimates for the spherical loss.
Refer to caption
Figure 2: Means of generalisation error estimates relative to the generalisation errors for the adult dataset, with standard deviations represented as shaded areas. The vertical axes represent values relative to the generalisation errors, while the horizontal axes correspond to β\beta in the generalised posterior distribution. (a) Generalisation error estimates for the Brier loss. (b) Estimates for the misclassification loss. (c) Estimates for the spherical loss.

Differential privacy is a cryptographic framework designed to ensure the privacy of individual users’ data while enabling meaningful statistical analyses with a desired level of efficiency (Dwork, 2006). It provides a formal guarantee that the inclusion or exclusion of a single individual’s data does not significantly affect the output of the analysis, thereby protecting sensitive information. Differential privacy has many real-world applications (for example, see Desfontaines and Pejó, 2020), and many strategies have been developed (Abadi et al., 2016; Wang et al., 2015a; Minami et al., 2016; Mironov, 2017; Jewson et al., 2023).

The foundational concept of ensuring differential privacy is the (ε,δ)(\varepsilon,\delta)-differential private learner. For ε,δ0\varepsilon,\delta\geq 0, an (ε,δ)(\varepsilon,\delta)-differentially private learner is defined as a randomized estimator θ^\widehat{\theta} that satisfies the following property: for any pair of adjacent datasets Xn,X~n𝒳nX^{n},\widetilde{X}^{n}\in\mathcal{X}^{n}, that is, datasets with

the number of {i:Xi and X~i are different}1,\text{the number of }\{i:\text{$X_{i}$ and $\widetilde{X}_{i}$ are different}\}\leq 1,

and for any measurable set AA, the inequality

(θ^(Xn)A)exp(ε)(θ^(X~n)A)+δ\mathbb{P}\left(\widehat{\theta}(X^{n})\in A\right)\leq\exp(\varepsilon)\mathbb{P}\left(\widehat{\theta}(\widetilde{X}^{n})\in A\right)+\delta

holds. Here, ε\varepsilon represents the privacy budget, quantifying the level of privacy , while δ\delta accounts for a small probability of privacy violation.

A widely employed strategy for achieving (ε,δ)(\varepsilon,\delta)-differential privacy is the one-posterior-sample (OPS) estimator (Wang et al., 2015b; Minami et al., 2016). This estimator is a single sample from a generalized posterior distribution given by

πβ(θ;Xn)exp[βi=1nL(Xi,θ)],\pi_{\beta}(\theta;X^{n})\propto\exp\left[-\beta\sum_{i=1}^{n}L(X_{i},\theta)\right],

where L(,)L(\cdot,\cdot) denotes a user-specified loss or score function. The hyperparameter β\beta is intricately controlled by the privacy parameters (ε,δ)(\varepsilon,\delta), the choice of the score function LL, and the dataset size nn. This approach leverages the flexibility of posterior sampling to balance the trade-off between privacy guarantees and statistical utility.

Here, we demonstrate the application of PCICG\mathrm{PCIC}_{\mathrm{G}} to understanding the predictive behaviour of OPS estimators with different values of β\beta. We use two sets of classification data from UCI Machine Learning Repository (Dua and Graff, 2017), namely, the banknote authentication data set and the adult data set. The banknote authentication data set classifies genuine and forged banknote-like specimens based on four image features. The adult data set predicts whether income exceeds 5050K/yr based on 14 features from census data. We work with the generalised posterior distribution based on the logistic regression

πβ(θ;Yn,Xn)exp[βi=1n{Yilogσ(Xiθ)+(1Yi)log(1σ(Xiθ))}]exp{θθ/2},\displaystyle\pi_{\beta}(\theta;Y^{n},X^{n})\propto\exp\left[\beta\sum_{i=1}^{n}\left\{Y_{i}\log\sigma(X_{i}\theta)+(1-Y_{i})\log(1-\sigma(X_{i}\theta))\right\}\right]\exp\{-\theta^{\top}\theta/2\},

where σ()\sigma(\cdot) is the sigmoid function: σ(x):=1/(1+exp(x))\sigma(x):=1/(1+\exp(-x)). We consider predictive evaluation of OPS estimators using three major classification losses, the Brier loss, the misclassification loss, and the spherical loss:

νBrier(x,p)\displaystyle\nu_{\mathrm{Brier}}(x,p) =(xp)2,\displaystyle=(x-p)^{2},
νmisclass(x,p)\displaystyle\nu_{\mathrm{misclass}}(x,p) ={1ifx=1andp>1/2orx=0andp<1/2,0if otherwise,\displaystyle=\left\{\begin{array}[]{ll}-1&\text{if}\,x=1\,\text{and}\,p>1/2\,\,\,\,\text{or}\,\,\,\,x=0\,\text{and}\,p<1/2,\\ 0&\text{if otherwise},\end{array}\right.
νspherical(x,p)\displaystyle\nu_{\mathrm{spherical}}(x,p) ={xp+(1x)(1p)}p2+(1p)2,\displaystyle=-\frac{\{xp+(1-x)(1-p)\}}{\sqrt{p^{2}+(1-p)^{2}}},

where for the ii observation, pp is given by σ(Xiθ)\sigma(X_{i}\theta) using θ\theta. The misclassification loss is non-differentiable with respect to θ\theta and our theoretical results may not imply accurate generalisation error estimation; so, we confirm the applicability by numerical experiments. First, we randomly split the whole data into a training dataset of a sample size 50, 20 times. The test dataset is set to be the remaining. We then calculate the empirical errors using the training dataset, and the average of generalisation errors using the test dataset. We used 3980 MCMC samples after thinning out by 5 and a burnin period of length 100. The MCMC algorithm here employs the Gibbs sampler based on Polya-Gamma augmentation (Polson et al., 2013).

Figures 1 and 2 display the average values of generalisation error estimates relative to the average of the generalisation errors. The exact LOOCV is very close to the average of the generalisation error irrelevant to the loss and the dataset, but its computational cost is almost prohibitive. GDIC is not close to the average of the generalisation error. BPSIC works only for β=1\beta=1 and a differentiable evaluation function. The proposed method PCICG\mathrm{PCIC}_{\mathrm{G}} successfully estimates the generalisation errors of OPS estimators for all three evaluation functions, including the non-differentiable evaluation function. IS-CV and PSIS-CV are comparable to the proposed method; therefore, we cannot reach a conclusion regarding the performance difference of this setting.

3.2 Application to Bayesian hierarchical models

Refer to caption
Figure 3: Generalisation error estimates relative to the generalisation errors with varying data scaling factors. The averages are computed over different chains, with colored shades representing ±σ\pm\sigma for these chains. (a) Plots for the 8-schools data; (b) Plots for the meta-analysis data.
Refer to caption
Figure 4: Generalisation error estimates relative to the generalisation errors with varying prior scaling factors. The averages are computed over different chains, with colored shades representing ±σ\pm\sigma for these chains. (a) Plots for the 8-schools data; (b) Plots for the meta-analysis data.

We then apply the proposed methods to predictive evaluation of Bayesian hierarchical modeling. Here we shall work with a simple two-level normal model:

Xij\displaystyle X_{ij} 𝒩(θj,vj2),i=1,,nj,j=1,,J\displaystyle\sim\mathcal{N}(\theta_{j},v_{j}^{2}),\quad i=1,\ldots,n_{j},\quad j=1,\ldots,J
θi\displaystyle\theta_{i} 𝒩(μ,τ2),j=1,,J,\displaystyle\sim\mathcal{N}(\mu,\tau^{2}),\quad j=1,\ldots,J,

where vj2v_{j}^{2}s are assumed to be given, njn_{j} is the sample size of the jj-th group (j=1,,Jj=1,\ldots,J), and JJ is the number of groups. Here we assume that njn_{j} is equal to 1 and denote XijX_{ij} by XjX_{j}, which is always possible by the sufficiency reduction. The prior distributions of μ\mu and τ\tau are specified by the scale mixture of the normal distribution with the half Cauchy distribution and by the half normal distribution, respectively:

μσμ2\displaystyle\mu\mid\sigma^{2}_{\mu} 𝒩(0,σμ2),σμ2Half𝒞(γ),and\displaystyle\sim\mathcal{N}(0,\sigma^{2}_{\mu}),\quad\sigma^{2}_{\mu}\sim\mathrm{Half}\mathcal{C}(\gamma),\quad\text{and}
τ\displaystyle\tau Half𝒩(στ),\displaystyle\sim\mathrm{Half}\mathcal{N}(\sigma_{\tau}),

where γ\gamma and στ\sigma_{\tau} are hyper parameters.

We measure the prediction accuracy of the information fusion in the Bayesian hierarchical structure across groups by (unscaled) 22\ell_{2}^{2} loss

ν(Xj,μ)=|Xjμ|2.\nu(X_{j},\mu)=|X_{j}-\mu|^{2}.

The working posterior in this application is rewritten as

π(μ,τ;XJ)exp(j=1Js(Xj;μ,τ))π(μ,τ)withsj(x;μ,τ)=logϕ(x;μ,τ2+vj2),\pi(\mu,\tau;X^{J})\propto\exp\Bigg{(}\sum_{j=1}^{J}s(X_{j};\mu,\tau)\Bigg{)}\pi(\mu,\tau)\,\,\text{with}\,\,s_{j}(x;\mu,\tau)=\log\phi(x\,;\,\mu,\tau^{2}+v_{j}^{2}),

where ϕ(x;a,b)\phi(x\,;\,a,b) is the normal density with mean aa and variance bb. So, PCICG\mathrm{PCIC}_{\mathrm{G}} becomes

PCICG=1Jj=1J𝔼pos[ν(Xj,μ)XJ]1Jj=1JCovpos[ν(Xj,μ),sj(Xj,μ,τ)XJ].\mathrm{PCIC}_{\mathrm{G}}=\frac{1}{J}\sum_{j=1}^{J}\mathbb{E}_{\mathrm{pos}}[\nu(X_{j},\mu)\mid X^{J}]-\frac{1}{J}\sum_{j=1}^{J}\mathrm{Cov}_{\mathrm{pos}}[\nu(X_{j},\mu),s_{j}(X_{j},\mu,\tau)\mid X^{J}].

To check the behaviour, we use two datasets commonly used in the demonstration of the Bayesian hierarchical modeling. One is the 8-schools data: the dataset consists of coaching effects with the standard deviations in J=8J=8 different high schools in New Jersey; see Table 5.2 of Gelman et al. (2013). The other is the meta-analysis data from Yusuf et al. (1985): the dataset consists of J=22J=22 clinical trials of β\beta-blockers for reducing mortality after myocardial infarction; see Table 5.4 of Gelman et al. (2013). We apply empirical log-odds transformation so as to employ the normal model described above as in Section 5.8 of Gelman et al. (2013).

We design two types of experiments. One is changing the data scale by multiplying the data scaling factor δX{1,2,,9,10}\delta_{X}\in\{1,2,\ldots,9,10\} to XjX_{j}s. The other is changing the prior scale by multiplying the prior scaling factor δγ{0.1i:i=1,,10}\delta_{\gamma}\in\{0.1i\,:\,i=1,\ldots,10\} to γ\gamma. We fix στ=10\sigma_{\tau}=10 and set γ=10\gamma=10 as initial values. In the experiments, we first randomly split the original datasets of JJ groups into the training data of the remaining J/2\lfloor J/2\rfloor groups and the test data of JJ/2J-\lfloor J/2\rfloor groups 20 times. We then obtain 5000 posterior samples using PyMC (Abril-Pla et al., 2023), calculate the empirical errors and the generalisation error estimates using the training data set, and calculate the average of generalisation errors using the test data set.

Figure 3 displays the generalisation error estimates relative to the average generalisation errors along with the change of the data scaling factor δX\delta_{X}. Figure 4 displays those along with the change of the prior scaling factor δγ\delta_{\gamma}. As the data scaling factor becomes largers, all esitmates deteriorate (becomes far from the generalisation error) but the deteriorating rates of IS-CV and PCIC are slower than that of PSIS-CV. Fro any prior scaling factor, IS-CV and PCIC are closer to the generalisation error than PSIS-CV. GDIC is a relatively good metric for the meta-analysis data but not for the 8-school data.

3.3 Application to Bayesian regression in the presence of influential observations

We next apply the proposed method to Bayesian regression the presence of influential observations. The presence of influential observations impacts the variability of the case-deletion importance sampling weights, as discussed in Peruggia (1997), which results in the instability of IS-CV.

Refer to caption
Figure 5: Performance of PCICG\mathrm{PCIC}_{\mathrm{G}}, IS-CV, and PSIS-CV for the synthetic Bayesian regression in the presence of influential observations. The vertical axes present values relative to the generalisation errors, while the horizontal axes correspond to the magnitude of the outlier. Averages are computed across different experiments, with colored shades representing ±σ\pm\sigma. (a) results for 22\ell_{2}^{2} loss; (b) those for scaled 1\ell_{1} loss.

Here, we focus on the performance comparison between our method and IS-CV in Bayesian regression with influential observations. We work with the following Bayesian regression model in Peruggia (1997): Let nn be the sample size and let XidX_{i}\in\mathbb{R}^{d} (i=1,,ni=1,\ldots,n) be the dd covariates. Then,

YiXi,β𝒩(Xiβ,σ),i=1,,n,βΣ𝒩(0,Σ),σ2IG(aσ,bσ),ΣIW(κId×d,κ),\begin{split}Y_{i}\mid X_{i},\beta&\sim\mathcal{N}(X_{i}^{\top}\beta,\sigma),\,i=1,\ldots,n,\\ \beta\mid\Sigma&\sim\mathcal{N}(0,\Sigma),\\ \sigma^{2}&\sim\mathrm{IG}(a_{\sigma},b_{\sigma}),\\ \Sigma&\sim\mathrm{IW}(\kappa\,I_{d\times d},\kappa),\end{split} (13)

where IG\mathrm{IG} denotes the inverse gamma distribution, IW\mathrm{IW} denotes the inverse Wishart distribution, and Id×dI_{d\times d} denotes the d×dd\times d identity matrix. Three quantities σa,σb,κ\sigma_{a},\,\sigma_{b},\,\kappa are hyperparameters, and we fix σa=σb=κ=1\sigma_{a}=\sigma_{b}=\kappa=1. The score function s(x,θ)s(x,\theta) in this subsection is simply the log-likelihood. For the loss functions, we use 2\ell_{2}, 1\ell_{1}, scaled 2\ell_{2} and scaled 1\ell_{1} losses:

ν2(Yi,Xi,β,σ)\displaystyle\nu_{\ell_{2}}(Y_{i},X_{i},\beta,\sigma) =|YiXiβ|2andν1(Yi,Xi,β,σ)=|YiXiβ|,\displaystyle=|Y_{i}-X_{i}^{\top}\beta|^{2}\quad\text{and}\quad\nu_{\ell_{1}}(Y_{i},X_{i},\beta,\sigma)=|Y_{i}-X_{i}^{\top}\beta|,
νscaled2(Yi,Xi,β,σ)\displaystyle\nu_{\text{scaled}\,\ell_{2}}(Y_{i},X_{i},\beta,\sigma) =|YiXiβ|2/σ2,andνscaled1(Yi,Xi,β,σ)=|YiXiβ|/σ.\displaystyle=|Y_{i}-X_{i}^{\top}\beta|^{2}/\sigma^{2},\quad\text{and}\quad\nu_{\text{scaled}\,\ell_{1}}(Y_{i},X_{i},\beta,\sigma)=|Y_{i}-X_{i}^{\top}\beta|/\sigma.

We begin with the following synthetic Bayesian regression model: for a given RR, X~1,,X~n\widetilde{X}_{1},\ldots,\widetilde{X}_{n} are fixed to

X~i\displaystyle\widetilde{X}_{i} ={0.01iif i<n,Rif i=n,\displaystyle=\begin{cases}0.01i&\text{if $i<n$},\\ R&\text{if $i=n$},\end{cases}

and then Y1,,YnY_{1},\ldots,Y_{n} are given as

Yi\displaystyle Y_{i} =β0+X~iβ1+σεi,\displaystyle=\beta_{0}+\widetilde{X}_{i}\beta_{1}+\sigma\varepsilon_{i},

where εi\varepsilon_{i}s are i.i.d. from the standard Gaussian distribution, and β0\beta_{0}, β1\beta_{1}, and σ\sigma are unknown parameters. We assign (0,1,1) to the true values of (β0,β1,σ)(\beta_{0},\beta_{1},\sigma). We set 50 to the sample size nn. For each RR, we simulate the training data set 50 times and obtain 50 values of PCICG\mathrm{PCIC}_{\mathrm{G}}, PSIS-CV, and IS-CV. We then calculate the average of the generalisation errors using a test data set with a sample size of 10. By using the same Gibbs sampler as in Peruggia (1997), we obtained 2000 MCMC samples after thinning out by 10 and a burnin period of length 10000.

Figure 5 displays the comparison between PCICG\mathrm{PCIC}_{\mathrm{G}} and IS-CV. Consider ν2\nu_{\ell_{2}}. For R2R\leq 2, the performance of the two methods does not differ. For R3R\geq 3, the bias of IS-CV relative to the average of the generalisation errors, is larger than that of PCICG\mathrm{PCIC}_{\mathrm{G}}. Consider νscaled1\nu_{\mathrm{scaled}\,\ell_{1}}. In this case, for all RR, the bias of IS-CV, relative to the average of the generalisation errors, is larger than those of PCICG\mathrm{PCIC}_{\mathrm{G}} and PSIS-CV. In particular, PSIS-CV performs the best for this loss.

We next employ two real datasets containing influential observations. One is the the stack loss data: The data consists of n=21n=21 days of measurements with three condition records xix_{i}. The other is the Gesell data: This consists of n=21n=21 children’s records of the age xix_{i} at which they first spoke and their Gesell Adaptive Score yiy_{i}. These datasets are known to contain influential observations; the indices for influential observations are i=21i=21 and i=18i=18, respectively. In the experiments, we first randomly split the original datasets into the training data of n/2\lfloor n/2\rfloor samples and the test data of nn/2n-\lfloor n/2\rfloor samples 20 times. We then obtain posterior samples with different sample sizes, calculate the empirical errors and the generalisation error estimates using the training data set, and calculate the average of generalisation errors using the test data set.

Figures D.1 and D.2 display the results. Overall, PCICG\mathrm{PCIC}_{\mathrm{G}} and IS-CV works better than PSIS-CV for the unscaled losses (1\ell_{1} and 2\ell_{2} losses), whereas PSIS-CV works the best for the scaled losses. One of the reasons is that the scaled losses works similar (in fact, the scaled 2\ell_{2} loss is the half of log likelihood), and the weight smoothing in PSIS-CV only depends only on log-likelihood.

3.4 Application to eliminating biases due to strong priors

When using strong priors, WAIC is shown to have a bias in the generalisation error estimation (e.g., Vehtari et al. (2017); Ninomiya (2021)). We here employ PCICG\mathrm{PCIC}_{\rm G} to eliminate this bias. We begin with focusing on a simple example that enables a full analytic calculus and then provide a general scheme applicable to general models.

Analytic calculus in a location-shift model: Consider a simple location-shift model, where the observations Xn=(X1,,Xn)X^{n}=(X_{1},\ldots,X_{n}) follow

Xi=θ+εi,i=1,,n.\displaystyle X_{i}=\theta^{*}+\varepsilon_{i},\,\,i=1,\ldots,n.

Here, θ\theta^{*} is a vector in d\mathbb{R}^{d}, and the error terms ε1,,εn\varepsilon_{1},\ldots,\varepsilon_{n} are i.i.d. from a possibly non-Gaussian distribution with mean zero and covariance matrix identity matrix. Consider the generalised posterior distribution given by

π(θ;Xn)exp{βi=1nXiθ22θ22τ}withβ>0andτ>0,\displaystyle\pi(\theta;X^{n})\propto\exp{\left\{-\beta\frac{\sum_{i=1}^{n}\|X_{i}-\theta\|^{2}}{2}-\frac{\|\theta\|^{2}}{2\tau}\right\}}\quad\text{with}\quad\beta>0\,\text{and}\,\tau>0,

where \|\cdot\| is the 2\ell_{2} norm in d\mathbb{R}^{d}; then, the generalised posterior distribution is 𝒩(θ^,S)\mathcal{N}(\widehat{\theta},S) with

θ^=nβτnβτ+1X¯andS=1nβ+1/τId\displaystyle\widehat{\theta}=\frac{n\beta\tau}{n\beta\tau+1}\overline{X}\quad\text{and}\quad S=\frac{1}{n\beta+1/\tau}I_{d}

and the score function in this case is s(x,θ)=(β/2)xθ2s(x,\theta)=-(\beta/2)\|x-\theta\|^{2}, where let X¯:=i=1nXi/n\overline{X}:=\sum_{i=1}^{n}X_{i}/n and let IdI_{d} be the d×dd\times d identity matrix.

Consider the evaluation function ν(x,θ)=(xθ)A(xθ)\nu(x,\theta)=(x-\theta)^{\top}A(x-\theta) with a symmetric positive definite matrix Ad×dA\in\mathbb{R}^{d\times d}. Thus, the Gibbs generalisation gap is given by

𝔼[𝒢G,n]𝔼[G,n]=2nnβτnβτ+1tr(A),\displaystyle\mathbb{E}[\mathcal{G}_{\mathrm{G},n}]-\mathbb{E}[\mathcal{E}_{\mathrm{G},n}]=\frac{2}{n}\frac{n\beta\tau}{n\beta\tau+1}\mathrm{tr}(A), (14)

while the posterior covariance is given by

1ni=1nCovpos[ν(Xi,θ),s(Xi,θ)]=2nnβτnβτ+1i=1nX~iAX~inβ(nβ+1/τ)2tr(A),\displaystyle\frac{1}{n}\sum_{i=1}^{n}\mathrm{Cov}_{\mathrm{pos}}[\nu(X_{i},\theta),s(X_{i},\theta)]=-\frac{2}{n}\frac{n\beta\tau}{n\beta\tau+1}\frac{\sum_{i=1}^{n}\widetilde{X}_{i}^{\top}A\widetilde{X}_{i}}{n}-\frac{\beta}{(n\beta+1/\tau)^{2}}\mathrm{tr}(A), (15)

where X~i:=Xiθ^\widetilde{X}_{i}:=X_{i}-\widehat{\theta} and the detailed calculi are given in Appendix C. Thus, we obtain

𝔼[𝒢G,n]𝔼[G,n]\displaystyle\mathbb{E}[\mathcal{G}_{\mathrm{G},n}]-\mathbb{E}[\mathcal{E}_{\mathrm{G},n}]
=𝔼[1ni=1nCovpos[ν(Xi,θ),s(Xi,θ)]]+2nnβτnβτ+1(θ)Aθ(nβτ+1)2+rem,\displaystyle=-\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}\mathrm{Cov}_{\mathrm{pos}}[\nu(X_{i},\theta),s(X_{i},\theta)]\right]+\frac{2}{n}\frac{n\beta\tau}{n\beta\tau+1}\frac{(\theta^{*})^{\top}A\theta^{*}}{(n\beta\tau+1)^{2}}+\mathrm{rem}, (16)

where we let

rem=βtr(A)(nβ+1/τ)2+{(nβτnβτ+1)32(nβτnβτ+1)2}2tr(A)n2.\displaystyle\mathrm{rem}=\frac{\beta\mathrm{tr}(A)}{(n\beta+1/\tau)^{2}}+\left\{\left(\frac{n\beta\tau}{n\beta\tau+1}\right)^{3}-2\left(\frac{n\beta\tau}{n\beta\tau+1}\right)^{2}\right\}\frac{2\mathrm{tr}(A)}{n^{2}}.
Refer to caption
Figure 6: Bias correction of WAIC for the Poisson model (a) and the multivariate Cauchy model (b). The vertical axes denote the values relative to the generalisation errors, while the holizontal axes denote the scaling factor to the true values.

From (16), we conclude that in simple location-shift models, under the assumption

(θ)Aθ/(nβτ+1)2=o(1),\displaystyle(\theta^{*})^{\top}A\theta^{*}/(n\beta\tau+1)^{2}=o(1), (17)

even the naive PCICG\mathrm{PCIC}_{\mathrm{G}} estimates well the Gibbs generalisation error regardless of the dimension dd and the distribution of the error term. For non-strong priors (τ=O(1)\tau=O(1)), we can make the assumption in (17). For strong priors (1/τn1/\tau\sim n), we cannot expect the validity of the assumption, and the bias term

2nnβτnβτ+1(θ)Aθ(nβτ+1)2\frac{2}{n}\frac{n\beta\tau}{n\beta\tau+1}\frac{(\theta^{*})^{\top}A\theta^{*}}{(n\beta\tau+1)^{2}}

remains. Our generalised Bayesian framework provides a simple modification to remove this bias. Add the log-prior density (up to constant) divided by nn to s(x,θ)s(x,\theta);

s(x,θ)\displaystyle s^{\prime}(x,\theta) =s(x,θ)+(1/n)logπ(θ)\displaystyle=s(x,\theta)+(1/n)\log\pi(\theta)
=s(x,θ)(1/n)θ2/(2τ)(d/2)log(2πτ).\displaystyle=s(x,\theta)-(1/n)\|\theta\|^{2}/(2\tau)-(d/2)\log(2\pi\tau).

Then, we have

𝔼[𝒢G,n]𝔼[G,n]=𝔼[1ni=1nCovpos[ν(Xi,θ),s(Xi,θ)]]+rem2\displaystyle\mathbb{E}[\mathcal{G}_{\mathrm{G},n}]-\mathbb{E}[\mathcal{E}_{\mathrm{G},n}]=-\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}\mathrm{Cov}_{\mathrm{pos}}[\nu(X_{i},\theta),s^{\prime}(X_{i},\theta)]\right]+\mathrm{rem}_{2} (18)

with rem2\mathrm{rem}_{2} an O(n2)O(n^{-2})-term that is independent of θ\theta^{*}, which implies that the bias term in the naive PCICG\mathrm{PCIC}_{\mathrm{G}} can be removed by changing the naive average posterior covariance to

1ni=1nCovpos[ν(Xi,θ),s(Xi,θ)+1nlogπ(θ)].\frac{1}{n}\sum_{i=1}^{n}\mathrm{Cov}_{\mathrm{pos}}\left[\nu(X_{i},\theta)\,,\,s(X_{i},\theta)+\frac{1}{n}\log\pi(\theta)\right].

A scheme for general models: The above scheme is applicable in eliminating the bias of WAIC2\mathrm{WAIC}_{2} in general models by changing the average posterior variance to

1ni=1nCovpos[logp(Xiθ),logp(Xiθ)+1nlogπ(θ)Xn].\displaystyle\frac{1}{n}\sum_{i=1}^{n}\mathrm{Cov}_{\mathrm{pos}}\left[\log p(X_{i}\mid\theta)\,,\,\log p(X_{i}\mid\theta)+\frac{1}{n}\log\pi(\theta)\mid X^{n}\right].

We check the effectiveness of this scheme by using the Poisson model and the Cauchy model. Figure 6 demonstrates the results for the Poisson model with the conjugate Gamma prior and the multivariate Cauchy model with the Cauchy prior; that is, the observation and the prior in the Poisson model are described as

XPo(λ0)andλGamma(1,1),X\sim\mathrm{Po}(\lambda_{0})\quad\text{and}\quad\lambda\sim\mathrm{Gamma}(1,1),

while those in the Cauchy model are described as

XMultivariate𝒞(μ0,I5×5)andμMultivariate𝒞(0,I5×5).X\sim\mathrm{Multivariate}\,\mathcal{C}(\mu_{0}\,,\,I_{5\times 5})\quad\text{and}\quad\mu\sim\mathrm{Multivariate}\,\mathcal{C}(0\,,\,I_{5\times 5}).

After setting λ0=μ\lambda_{0}=\mu and μ0=μ𝟙\mu_{0}=\mu\mathbbm{1}, we vary the scaling factor μ\mu. In the Poisson conjugate model, the bias correction successfully works as depicted in Figure 6 (a). In the Cauchy model, the bias correction does work but there appears an additional non-negligible bias. Note that such bias also appears with large data scaling in the application to Bayesian hierarchical models; see Figure 3.

4 Discussions

In this section, we discuss the following three topics:

  • the applicability in high dimensions;

  • the application to defining a case-influence measure; and

  • possibility of different definitions of generalisation errors.

4.1 Applicability in high dimensions

We first check the applicability of our methodologies as well as IS-CV and PSIS-CV in high dimensional models. If we view the location-shift model in Subsection 3.4 from a different perspective, the arguments in that subsection suggest that PCIC can be applied even in high-dimensional settings. This does not contradict the results of Okuno and Yano (2023), where WAIC is shown to work even in overparameterised linear regression models. However, since these fully rely on the model’s structures, further experiments are required.

Here, we employ the Poisson sequence model, a canonical model for count-data analysis. Incorporating the high-dimensional structure with Poisson sequence model becomes important in recent count-data analysis (Komaki, 2006; Datta and Dunson, 2016; Yano et al., 2021; Hamura et al., 2022; Paul and Chakrabarti, 2024). The working model here is

Y(i)=(Y1(i)Y2(i)Yd(i))λ=(λ1λ2λd)j=1dPoisson(λj)(i=1,,n),Y(i)=\begin{pmatrix}Y_{1}(i)\\ Y_{2}(i)\\ \vdots\\ Y_{d}(i)\end{pmatrix}\quad\mid\quad\lambda=\begin{pmatrix}\lambda_{1}\\ \lambda_{2}\\ \vdots\\ \lambda_{d}\end{pmatrix}\quad\sim\quad\otimes_{j=1}^{d}\mathrm{Poisson}(\lambda_{j})\quad(i=1,\ldots,n),

where ii is an index for the observation, jj is an index for the coordinate, and \otimes denotes the product of measures. In this model, each observation Y(i)Y(i) given λ\lambda follows from the dd-dimensional independent Poisson distribution. For example, in spatio-temporal count-data analysis, ii may be the index for the year and jj may be the index for the observation site such as a district; see Datta and Dunson (2016); Yano et al. (2021); Hamura et al. (2022) for the details.

Refer to caption
Figure 7: Application of PCICG\mathrm{PCIC}_{\mathrm{G}}, PSIS-CV, and IS-CV to high-dimensional Poisson models. Subfigure (a) displays the results for 2\ell_{2} loss; Subfigure (b) displays the results for 2\ell_{2} loss, where PCICG\mathrm{PCIC}_{\mathrm{G}} is equal to WAIC2\mathrm{WAIC}_{2} in this case.
Refer to caption
Figure 8: Application of PCICG\mathrm{PCIC}_{\mathrm{G}}, PSIS-CV, and IS-CV to the Japanese pickpocket data (d=978)(d=978). The vertical axes exhibit values of the generalisation error estimates, while the horizontal axes depict the sample size. Subfigure (a) displays the results for 22\ell_{2}^{2} loss; Subfigure (b) displays the results for 1\ell_{1} loss; Subfigure (c) displays the results for 1\ell_{1} loss of square roots (νsq\nu_{\mathrm{sq}}).

We start with checking the behaviours of generalisation error estimates along with the change of the number dd of dimensions. We fix the sample size nn to 10. In this numerical experiment, we work with the Stein-type shrinkage prior proposed by Komaki (2006):

π(λ)\displaystyle\pi(\lambda) =λ1β11λdβd1(λ1++λd)α,\displaystyle=\frac{\lambda_{1}^{\beta_{1}-1}\cdots\lambda_{d}^{\beta_{d}-1}}{(\lambda_{1}+\cdots+\lambda_{d})^{\alpha}},

where α>0\alpha>0 and β=(β1,,βd)\beta=(\beta_{1},\ldots,\beta_{d}). The reason of this prior choice is that it is non-informative, efficiently combines information across coordinates, and has an efficient exact sampling algorithm. We set β=(3,,3)\beta=(3,\ldots,3) and α=j=1dβj1\alpha=\sum_{j=1}^{d}\beta_{j}-1. For the true values of λj\lambda_{j}s, we set

λj\displaystyle\lambda_{j} ={0.001ifjis odd,2if otherwise(j=1,,d).\displaystyle=\begin{cases}0.001&\text{if}\,\,j\,\,\text{is odd},\\ 2&\text{if otherwise}\end{cases}\quad(j=1,\ldots,d).

Figure 7 depicts the success of PCICG\mathrm{PCIC}_{\mathrm{G}} even in high-dimensional models, though we have not ensured it theoretically. Okuno and Yano (2023) shows the success of WAIC2\mathrm{WAIC}_{2} even in overparameterized linear regression models; Giordano and Broderick (2023) discusses the Bayesian infinitesimal jackknife approximation of the variance in high dimensions. These together with our result might deliver a new research direction focusing on understanding the behaviour of the posterior covariance in high-dimensions. Understanding this might require further theoretical investigations on the approximation of cross validation in high dimensions (e.g., Rad and Maleki, 2020) and Bayesian central limit theorem in high-dimensions (e.g., Panov and Spokoiny, 2015; Yano and Kato, 2020; Kasprzaki et al., 2023).

We next check the behaviours in the application to real data. We employ Japanese pickpocket data from Tokyo Metropolitan Police Department (2025). This data reports the total numbers of pickpockets in each year in Tokyo Prefecture, and are classified by town and also by the type of crimes. We use pickpocket data from 2012 to 2018 at 978 towns in 8 wards; So, we work with the Poisson sequence model (d=978d=978, n=7n=7; nn is the number of years we use in the analysis). For N=2,3,4,5,6N=2,3,4,5,6, we consider all combinations of selecting NN out of n=7n=7 years, and use each combination as training data while treating the remaining items as test data. For each training and test pair, we calculate the generalisation error as well as its estimates (PCICG\mathrm{PCIC}_{\mathrm{G}}, PSIS-CV, IS-CV, and the empirical error), and take their averages.

Figure 8 displays the results for three loss functions: 22\ell_{2}^{2} loss, 1\ell_{1} loss, and

νsr(Y(i),λ)=j=1d|Yj(i)λj|.\nu_{\mathrm{sr}}(Y(i),\lambda)=\sum_{j=1}^{d}\left|\sqrt{Y_{j}(i)}-\sqrt{\lambda_{j}}\right|.

For all loss functions, PCICG\mathrm{PCIC}_{\mathrm{G}} estimates the generalisation error well.

4.2 Application to defining a case-influence measure

Refer to caption
Figure 9: Influential observations in the Gesell data via MνM_{\nu} for the generalisation. The values are divided by their maximum. Subfigures (a) and (b) exhibit results for the posterior with the hyperparameter κ\kappa in equation (13) set to be 11. Subfigures (c) and (d) exhibit those for κ\kappa in equation (13) set to 1010.

We next consider an application of the proposed method to defining yet another case-influence measure. From the perspective of estimating the generalisation error, the posterior covariance

Mν,i:=Covpos[ν(Xi,θ),s(Xi,θ)].\displaystyle M_{\nu,i}:=\mathrm{Cov}_{\mathrm{pos}}[\nu(X_{i},\theta),s(X_{i},\theta)].

expresses the contribution of each observation in filling the generalisation gap. So, we can utilise it to define yet another Bayesian case-influence measure in prediction. When s(,)s(\cdot,\cdot) and ν(,)\nu(\cdot,\cdot), this coincides with the Bayesian local case sensitivity defined in Millar and Stewart (2007); Thomas et al. (2018). Note that this is not always positive except for ν(,)=s(,)\nu(\cdot,\cdot)=s(\cdot,\cdot); so we standardize it and take the absolute value.

Figure 9 showcases how the measure MνM_{\nu} works in the Gesell data discussed in Subsection 3.3, where we use the unscaled 22\ell_{2}^{2} and the scaled 22\ell_{2}^{2} losses, and the score function s(x,θ)s(x,\theta) is log likelihood. For relatively light-tailed prior (κ=1\kappa=1), the top-1 influential observations are the same for both losses (Figure 9 a) . For relatively heavy-tailed prior (κ=10\kappa=10), the top-1 influential observations are different (Figure 9 c).

Here we compare our results with those in Millar and Stewart (2007), where two types of sensitivities (parameter-scale and observation-scale) are defined. First, our results with the unscaled loss are similar to the results with the observation-scale sensitivity in Millar and Stewart (2007), and the results with the scaled loss are similar to the results with the parameter-scale sensitivity in Millar and Stewart (2007); Second, our results with the relatively light-tailed prior are similar to Millar and Stewart (2007)’s results in the known-variance case, and our results with the relatively heavy-tailed prior are similar to Millar and Stewart (2007)’s results the unknown-variance case with Bernard’s reference prior. Our result as well as Millar and Stewart (2007) and Thomas et al. (2018) suggest that the case influences also vary depending on the choice of loss functions. Also, our result as well as Millar and Stewart (2007) suggest that the case influences vary depending on the prior specification.

4.3 Possibility of different definitions of generalisation errors

Although we only consider the Gibbs and plug-in generalization errors, following the literature (e.g., Ando, 2007; Underhill and Smith, 2016), one could alternatively define the generalization error as

𝒢S,n=𝔼Xn+1[S(Xn+1,𝔼pos[p(θ)])],\mathcal{G}_{\mathrm{S},n}=\mathbb{E}_{X_{n+1}}\Bigl{[}S\bigl{(}X_{n+1},\,\mathbb{E}_{\mathrm{pos}}[p(\cdot\mid\theta)]\bigr{)}\Bigr{]},

where S(,)S(\cdot,\cdot) is a scoring rule (Gneiting and Raftery, 2007; Dawid et al., 2016), and {p(xθ)}\{p(x\mid\theta)\} is a parametric model. This may be viewed as a natural extension of the generalization error used in WAIC (Watanabe, 2010b).

In the binary prediction setting discussed in Subsection 3.1, the plug-in generalization errors coincide with this definition. However, our current methodologies do not accommodate this scoring-rule-based generalisation error in more general prediction scenarios. In this paper, we have not pursued this direction because it remains unclear whether the use of the Bayesian predictive density 𝔼pos[p(θ)]\mathbb{E}_{\mathrm{pos}}[p(\cdot\mid\theta)] is theoretically justified. While for the log score, the Bayesian predictive density is the Bayes solution (Aitchison, 1975), it is not necessarily the Bayes solution for other loss functions (see, e.g., Corcuera and Giummolè, 1999).

Recently, a quasi-Bayesian framework associated with scoring rules has been developed (Giummolè et al., 2019; Matsubara et al., 2024). Thus, extending our understanding of scoring-rule-based generalization errors in conjunction with such quasi-Bayesian theory would be an important future research direction.

5 Conclusion

We have proposed a novel, computationally low-cost method of estimating the Gibbs generalisation errors and plugin generalisation errors for arbitrary loss functions. We have demonstrated the usefulness of the proposed method in privacy-preserving learning, Bayesian hierarchical modeling, Bayesian regression modeling in the presence of the influential observations, and discussed the applicability in high dimensional statistical models.

Our numerical experiments suggest that IS-CV, PSIS-CV, and PCIC are useful tools for assessing generalisation errors. However, when applied to high-dimensional models (Subsection 4.1) or when the magnitude of the data is large (Subsection 3.2), PCIC remains effective even under such settings. Detailed theoretical analyses remain an important direction for future research.

An important practical implication of this study is that the posterior covariance provides an easy-to-implement generalisation error estimate for arbitrary loss functions, and can avoid the cumbersome refitting in LOOCV as well as the importance sampling technique that is sensitive to the presence of influential observations. Also, theoretical connections between WAIC, the Bayesian sensitivity analysis, and the infinitesimal jackknife approximation of Bayesian LOOCV are clarified. by our proof for the asymptotic unbiasedness .

Acknowledgement

The authors would like to thank the handling editor, the associate editor, and two anonymous referees for their comments that have improved the quality of the paper. Also, the authors would like to thank Akifumi Okuno, Hironori Fujisawa, Yoshiyuki Ninomiya, and Yusaku Ohkubo for providing them with fruitful discussions.

Funding

This work was supported by Japan Society for the Promotion of Science (JSPS) [Grant Nos. 19K20222, 21H05205, 21K12067], the Japan Science and Technology Agency’s Core Research for Evolutional Science and Technology (JST CREST) [Grant No. JPMJCR1763], and the MEXT Project for Seismology toward Research Innovation with Data of Earthquake (STAR-E) [Grant No. JPJ010217].

Code availability statements

The python code is available at https://github.com/kyanostat/PCIC4GL.

References

  • Abadi et al. (2016) Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K. and Zhang, L. (2016) Deep learning with differential privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 308–318.
  • Abril-Pla et al. (2023) Abril-Pla, O., Andreani, V., Carroll, C., Dong, L., Fonnesbeck, C. J., Kochurov, M., Kumar, R., Lao, J., Luhmann, C. C., Martin, O. A. et al. (2023) Pymc: a modern, and comprehensive probabilistic programming framework in python. PeerJ Computer Science, 9, e1516.
  • Aitchison (1975) Aitchison, J. (1975) Goodness of prediction fit. Biometrika, 62, 547–554.
  • Akaike (1973) Akaike, H. (1973) Information theory and an extension of the maximum likelihood principle. In Proceedings of the 2nd Intertnational Symposium on Information Theory (eds. B. Petrov and F. Csáki), 267–281.
  • Ando (2007) Ando, T. (2007) Bayesian predictive information criterion for the evaluation of hierarchical Bayesian and empirical Bayes models. Biometrika, 94, 443–458.
  • Beirami et al. (2017) Beirami, A., Razaviyayn, M., Shahrampour, S. and Tartokh, V. (2017) On optimal generalizability in parametric learning. In Advances in Neural Information Processing Systems, 3458–3468.
  • Bissiri et al. (2016) Bissiri, P., Holmes, C. and Walker, S. (2016) A general framework for updating brief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78, 1103–1130.
  • Chernozhukov and Hong (2003) Chernozhukov, V. and Hong, H. (2003) An MCMC approach to classical estimation. Journal of Econometrics, 115, 293–346.
  • Corcuera and Giummolè (1999) Corcuera, J. and Giummolè, F. (1999) A generalized Bayes rule for prediction. Scandinavian Journal of Statistics, 26, 265–279.
  • Datta and Dunson (2016) Datta, J. and Dunson, D. (2016) Bayesian inference on quasi-sparse count data. Biometrika, 103, 971–983.
  • Dawid et al. (2016) Dawid, A., Musio, M. and Ventura, L. (2016) Minimum scoring rule inference. Scandinavian Journal of Statistics, 43, 123–138.
  • Desfontaines and Pejó (2020) Desfontaines, D. and Pejó, B. (2020) Sok: Differential privacies. Proceedings on Privacy Enhancing Technologies, 2, 288–313.
  • Dua and Graff (2017) Dua, D. and Graff, C. (2017) UCI machine learning repository. URL: http://archive.ics.uci.edu/ml.
  • Dwork (2006) Dwork, C. (2006) Differential privacy. In Proceedings of the 33rd International conference on Automata, Languages and Programming, Part II, 1–12.
  • Geisser (1975) Geisser, S. (1975) The predictive sample reuse method with applications. Journal of American Statistical Association, 70, 320–328.
  • Gelfand et al. (1992) Gelfand, A., Dey, D. and Chang, H. (1992) Model determination using predictive distributions with implementation via sampling-based methods. In Bayesian Statistics (eds. J. Bernardo, J. Berger, A. Dawid and A. Smith), 147–167. Oxford University Press.
  • Gelman et al. (2013) Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A. and Rubin, D. (2013) Bayesian Data Analysis, 3rd Edition. Chapman and Hall/CRC.
  • Giordano and Broderick (2023) Giordano, R. and Broderick, T. (2023) The Bayesian infinitesimal jackknife for variance. ArXiv:2305.06466.
  • Giordano et al. (2018) Giordano, R., Broderick, T. and Jordan, M. (2018) Covariances, robustness, and variational Bayes. Journal of Machine Learning Research, 19, 1–49. URL: http://jmlr.org/papers/v19/17-670.html.
  • Giordano et al. (2019) Giordano, R., Stephenson, W., Liu, R., Jornan, M. and Broderick, T. (2019) A Swiss army infinitesimal jackknife. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) 2019, 89.
  • Giummolè et al. (2019) Giummolè, F., Mameli, V., Ruli, E. and Ventura, L. (2019) Objective Bayesian inference with proper scoring rules. Test, 28, 728––755.
  • Gneiting and Raftery (2007) Gneiting, T. and Raftery, A. (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359–378.
  • Hamura et al. (2022) Hamura, H., Irie, K. and Sugasawa, S. (2022) On global-local shrinkage priors for count data. Bayesian Analysis, 17, 545–564.
  • Iba (2023) Iba, Y. (2023) W-kernel and essential subspace for frequencist’s evaluation of Bayesian estimators. ArXiv:2311.13017.
  • Iba and Yano (2023) Iba, Y. and Yano, K. (2023) Posterior covariance information criterion for weighted inference. Neural computation, 35, 1–22.
  • Jewson et al. (2023) Jewson, J., Ghalebikesabi, S. and Holmes, C. (2023) Differentially private statistical inference through β\beta-divergence one posterior sampling. Proceedings of 37th Conference on Neural Information Processing Systems (NeurIPS 2023).
  • Jiang and Tanner (2008) Jiang, W. and Tanner, M. (2008) Gibbs posterior for variable selection in high-dimensional classification and data mining. The Annals of Statistics, 36, 2207–2231.
  • Kasprzaki et al. (2023) Kasprzaki, M., Giordano, R. and Broderick, T. (2023) How good is your Laplace approximation of the Bayesian posterior? finite-sample computable error bounds for a variety of useful divergences. ArXiv:2209.14992.
  • Koh and Liang (2017) Koh, P. and Liang, P. (2017) Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning, 1885–1894.
  • Komaki (2006) Komaki, F. (2006) A class of proper priors for Bayesian simultaneous prediction of independent Poisson observables. Journal of Multivariate Analysis, 97, 1815–1828.
  • Konishi and Kitagawa (1996) Konishi, S. and Kitagawa, G. (1996) Generalised information criteria in model selection. Biometrika, 83, 875–890.
  • Mastubara et al. (2022) Mastubara, T., Knoblauch, J., Briol, F. and Oates, C. (2022) Robust generalised Bayesian inference for intractable likelihoods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84, 997–1022.
  • Matsubara et al. (2024) Matsubara, T., Knoblauch, J., Briol, F.-X. and Oates, C. (2024) Generalized bayesian inference for discrete intractable likelihood. Journal of the American Statistical Association, 119, 2345–2355.
  • Millar and Stewart (2007) Millar, R. and Stewart, W. (2007) Assessment of locally influential observations in Bayesian models. Bayesian Analysis, 2, 365–384.
  • Miller and Dunson (2019) Miller, J. and Dunson, D. (2019) Robust Bayesian inference via coarsening. Journal of the American Statistical Association, 114, 1113––1125.
  • Minami et al. (2016) Minami, K., Arai, H., Sato, I. and Nakagawa, H. (2016) Differential privacy without sensitivity. In Advances in Neural Information Processing Systems (eds. D. Lee, M. Sugiyama, U. Luxburg, I. Guyon and R. Garnett), vol. 29.
  • Mironov (2017) Mironov, I. (2017) Rènyi differential privacy. 2017 IEEE 30th Computer Security Foundations Symposium (CSF), 263–275.
  • Nakagawa and Hashimoto (2020) Nakagawa, T. and Hashimoto, S. (2020) Robust Bayesian inference via γ\gamma-divergence. Communications in Statistics - Theory and Methods, 49, 343–360.
  • Ninomiya (2021) Ninomiya, Y. (2021) Prior intensified information criterion. ArXiv: 2110.12145.
  • Okuno and Yano (2023) Okuno, A. and Yano, K. (2023) A generalization gap estimation for overparameterized models via the Langevin functional variance. Journal of Computational and Graphical Statistics, 32, 1287–1295.
  • Panov and Spokoiny (2015) Panov, M. and Spokoiny, V. (2015) Finite sample Bernstein–von Mises theorem for semiparametric problems. Bayesian Analysis, 10, 665–710.
  • Paul and Chakrabarti (2024) Paul, S. and Chakrabarti, A. (2024) Asymptotic bayes optiamlity for sparse count data. URL: https://arxiv.org/abs/2401.05693.
  • Pérez et al. (2006) Pérez, C., Martin, J. and Rufo, M. (2006) MCMC-based local parametric sensitivity estimations. Computational Statistics & Data Analysis, 51, 823–835.
  • Peruggia (1997) Peruggia, M. (1997) On the variability of case-deletion importance sampling weights in the Bayesian linear model. Journal of the Ametican Statistical Association, 92.
  • Polson et al. (2013) Polson, N., Scott, J. and Windle, J. (2013) Bayesian inference for logistic models using Pólya–Gamma latent variables. Journal of the American Statistical Association, 108, 1339–1349.
  • Rad and Maleki (2020) Rad, K. and Maleki, A. (2020) A scalable estimate of the extra-sample prediction error via approximate leave-one-out. Journal of Royal Statistical Society: Series B, 82, 965–996.
  • Ribatet et al. (2012) Ribatet, M., Cooley, D. and Davison, A. (2012) Bayesian inference from composite likelihoods, with an application to spatial extremes. Statistica Sinica, 22, 813–845.
  • Shibata (1989) Shibata, R. (1989) Statistical aspects of model selection, 215–240. Springer-Verlag.
  • Silva and Zanella (2023) Silva, L. and Zanella, G. (2023) Robust leave-one-out cross-validation for high-dimensional Bayesian models. Journal of the American Statistical Association. In press.
  • Spiegelhalter et al. (2002) Spiegelhalter, D., Best, N., Carlin, B. and van der Linde, A. (2002) Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B, 64, 583–639.
  • Stone (1974) Stone, M. (1974) Cross-validatory choice and assessment of statistical predictions (with discussion). Journal of the Royal Statistical Society: Series B, 36, 111–147.
  • Takeuchi (1976) Takeuchi, K. (1976) Distribution of informational statistics and a criterion of model fitting (in japanese). Suri-Kagaku (Mathematic Sciences), 153, 12–18.
  • Thomas et al. (2018) Thomas, Z., MacEachern, S. and Peruggia, M. (2018) Reconciling curvature and importance sampling based procedures for summarizing case influence in Bayesian models. Journal of the American Statistical Association, 113, 1669–1683.
  • Tokyo Metropolitan Police Department (2025) Tokyo Metropolitan Police Department (2025) The number of crimes in tokyo prefecture by town and type. Https://www.keishicho.metro.tokyo.lg.jp/about_mpd/jokyo_tokei/jokyo/.
  • Underhill and Smith (2016) Underhill, N. and Smith, J. (2016) Context-dependent score based Bayesian information criteria. Bayesian Analysis, 11, 1005–1033.
  • Vehtari (2002) Vehtari, A. (2002) Discussion of“Bayesian measures of model complexity and fit” by spiegelhalter et al. Journal of Royal Statistical Society B, 64, 620.
  • Vehtari et al. (2017) Vehtari, A., Gelman, A. and Gabry, J. (2017) Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing.
  • Vehtari and Lampinen (2003) Vehtari, A. and Lampinen, J. (2003) Expected utility estimation via cross-validation. In Bayesian statistics (eds. J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith and M. West). Oxford University Press.
  • Vehtari and Ojanen (2012) Vehtari, A. and Ojanen, J. (2012) A survey of Bayesian predictive methods for model assessment, selection and comparison. Statistics Surveys, 6, 142––228.
  • Vehtari et al. (2024) Vehtari, A., Simpson, D., Gelman, A., Yao, Y. and Gabry, J. (2024) Pareto smoothed importance sampling. Journal of Machine Learning Research, 25, 1–58.
  • Wang et al. (2015a) Wang, Y.-X., Fienberg, S. and Smola, A. (2015a) Privacy for free: Posterior sampling and stochastic gradient Monte Carlo. In Proceedings of the 32nd International Conference on Machine Learning, vol. 37, 2493–2502.
  • Wang et al. (2015b) — (2015b) Privacy for free: Posterior sampling and stochastic gradient Monte Carlo. Proceeding of the 32nd International Conference on Machine Learning (ICML’15), 2493–2502.
  • Watanabe (2010a) Watanabe, S. (2010a) Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11, 3571–3594.
  • Watanabe (2010b) — (2010b) Equations of states in singular statistical estimation. Neural Networks, 23, 20–34.
  • Watanabe (2018) — (2018) Mathematical Theory of Bayesian Statistics. Chapman & Hall/CRC.
  • Yano et al. (2021) Yano, K., Kaneko, R. and Komaki, F. (2021) Minimax predictive density for sparse count data. Bernoulli, 27, 1212–1238.
  • Yano and Kato (2020) Yano, K. and Kato, K. (2020) On frequentist coverage errors of Bayesian credible sets in moderately high dimensions. Bernoulli, 26, 616–641.
  • Yusuf et al. (1985) Yusuf, S., Pero, R., Lewis, J., Collins, R. and Sleight, P. (1985) Beta blockade during and after myocardial infarction: an overview of the randomized trials. Progress in Cardiovascular Diseases, 27, 335–371.

Appendix A Conditions for the theorem

The conditions in this paper are as follows: In the conditions, ν\nu is redefined by subtracting minθΘ𝔼[ν(Xn+1,θ)]\min_{\theta^{\prime}\in\Theta}\mathbb{E}[\nu(X_{n+1},\theta^{\prime})] from the original ν\nu.

  1. (C1)

    The difference {𝔼Xn1[𝒢G,n1]𝔼[𝒢G,n]}\{\mathbb{E}_{X^{n-1}}[\mathcal{G}_{\mathrm{G},n-1}]-\mathbb{E}[\mathcal{G}_{\mathrm{G},n}]\} is of the order o(𝔼[𝒢G,n])o(\mathbb{E}[\mathcal{G}_{\mathrm{G},n}]);

  2. (C2)

    The following relation holds:

    𝔼[supw𝒲(1)|𝔼posw[{ν(X1,θ)𝔼posw[ν(X1,θ)]}{s(X1,θ)𝔼posw[s(X1,θ)]}2]|]\displaystyle\mathbb{E}\left[\sup_{w\in\mathcal{W}^{(-1)}}\left|\mathbb{E}_{\mathrm{pos}}^{w}\left[\left\{\nu(X_{1},\theta)-\mathbb{E}_{\mathrm{pos}}^{w}[\nu(X_{1},\theta)]\right\}\left\{s(X_{1},\theta)-\mathbb{E}_{\mathrm{pos}}^{w}[s(X_{1},\theta)]\right\}^{2}\right]\right|\right]
    =o(𝔼[𝒢G,n]);\displaystyle=o(\mathbb{E}[\mathcal{G}_{\mathrm{G},n}]);
  3. (C3)

    There exists an integrable function M(θ)M(\theta) such that for all w𝒲(1)w\in\mathcal{W}^{(-1)},

    |π(θ)exp{w1s(X1,θ)+k1s(Xk,θ)}νl(X1,θ)sj(X1,θ)|\displaystyle\left|\pi(\theta)\exp\left\{w_{1}s(X_{1},\theta)+\sum_{k\neq 1}s(X_{k},\theta)\right\}\nu^{l}(X_{1},\theta)s^{j}(X_{1},\theta)\right|
    M(θ),l=0,1,j=0,1.\displaystyle\leq M(\theta),\quad l=0,1,\,j=0,1.
Remark A.1 (Discussion on the conditions).

First, consider Condition C1. In regular statistical models and smooth evaluation functions, the result of Underhill and Smith (2016) implies that the order of 𝔼[𝒢G,n]\mathbb{E}[\mathcal{G}_{G,n}] is 1/n1/n. that the expected generalisation error subtracted by its minimum with respect to the parameter

ν~(Xi,θ)=ν(Xi,θ)minθΘ𝔼[ν(Xn+1,θ)].\widetilde{\nu}(X_{i},\theta)=\nu(X_{i},\theta)-\min_{\theta^{\prime}\in\Theta}\mathbb{E}[\nu(X_{n+1},\theta^{\prime})].

is of order 1/n1/n. We briefly discuss on it. Let θ\theta^{*} be the minimizer of 𝔼[ν(Xn+1,θ)]\mathbb{E}[\nu(X_{n+1},\theta^{\prime})]. The Taylor expansion yields

𝔼pos[ν(Xn+1,θ)Xn]\displaystyle\mathbb{E}_{\mathrm{pos}}[\nu(X_{n+1},\theta)\mid X^{n}] =𝔼pos[ν(Xn+1,θ)Xn]+θν(Xn+1,θ)𝔼pos[θθXn]\displaystyle=\mathbb{E}_{\mathrm{pos}}[\nu(X_{n+1},\theta^{*})\mid X^{n}]+\nabla_{\theta^{*}}\nu(X_{n+1},\theta)\mathbb{E}_{\mathrm{pos}}[\theta-\theta^{*}\mid X^{n}]
+12𝔼pos[(θθ)θ2ν(Xn+1,θ)(θθ)Xn]+oP(θθ2).\displaystyle\,+\frac{1}{2}\mathbb{E}_{\mathrm{pos}}[(\theta-\theta^{*})^{\top}\nabla_{\theta^{*}}^{2}\nu(X_{n+1},\theta)(\theta-\theta^{*})\mid X^{n}]+o_{P}(\|\theta-\theta^{*}\|^{2}).

Thus, we get

𝔼pos[ν~(Xn+1,θ)Xn]\displaystyle\mathbb{E}_{\mathrm{pos}}[\widetilde{\nu}(X_{n+1},\theta)\mid X^{n}] =θν(Xn+1,θ)𝔼pos[θθXn]=:T1\displaystyle=\underbrace{\nabla_{\theta^{*}}\nu(X_{n+1},\theta)\mathbb{E}_{\mathrm{pos}}[\theta-\theta^{*}\mid X^{n}]}_{=:T_{1}}
+12𝔼pos[(θθ)θ2ν(Xn+1,θ)(θθ)Xn]=:T2+oP(θθ2).\displaystyle\,+\frac{1}{2}\underbrace{\mathbb{E}_{\mathrm{pos}}[(\theta-\theta^{*})^{\top}\nabla_{\theta^{*}}^{2}\nu(X_{n+1},\theta)(\theta-\theta^{*})\mid X^{n}]}_{=:T_{2}}+o_{P}(\|\theta-\theta^{*}\|^{2}).

The order of the term T1T_{1} is controlled by the difference between the posterior mean and θ\theta^{*}; so it is of order 1/n1/n. The order of the term T2T_{2} is determined by Lemma 3 of Underhill and Smith (2016); so it is of order 1/n1/n, which concludes the order of the expected (subtracted) generalisation error. For singular statistical models and log-likelihoods, the result of Watanabe (2010b) implies that the order of 𝔼[𝒢G,n]\mathbb{E}[\mathcal{G}_{G,n}] (subtracted by its minimum) is 1/n1/n. Thus, Condition C1 usually holds.

Condition C2 is a condition for the residual. For regular statistical models and smooth evaluation functions, the Taylor expansion implies that the order of the left hand side is n3/2n^{-3/2} and less than the order of 𝔼[𝒢G,n]\mathbb{E}[\mathcal{G}_{G,n}]. For singular statistical models and log-likelihoods, we refer to Watanabe (2018). Finally, Condition C3 is a mild condition for assuming the existence of expectation.

Appendix B Proof for Lemma 2.3

This appendix provides the proof of Lemma 2.3.

The Lebesgue dominated convergence theorem ensures the exchange of differentiation and integration in (k/wik)𝔼posw[ν(Xi,θ)](\partial^{k}/\partial w_{i}^{k})\mathbb{E}_{\mathrm{pos}}^{w}[\nu(X_{i},\theta)] under Condition C3. Then, considering

F(w):=exp{i=1nwis(Xi,θ)}π(θ)𝑑θF(w):=\int\exp\left\{\sum_{i=1}^{n}w_{i}s(X_{i},\theta)\right\}\pi(\theta)d\theta

gives

wi𝔼posw[ν(Xi,θ)]\displaystyle\frac{\partial}{\partial w_{i}}\mathbb{E}_{\mathrm{pos}}^{w}[\nu(X_{i},\theta)] =s(Xi,θ)ν(Xi,θ)ei=1nwis(Xi,θ)π(θ)𝑑θF(w)F2(w)\displaystyle=\frac{\int s(X_{i},\theta)\nu(X_{i},\theta)\mathrm{e}^{\sum_{i=1}^{n}w_{i}s(X_{i},\theta)}\pi(\theta)d\theta F(w)}{F^{2}(w)}
s(Xi,θ)ei=1nwis(Xi,θ)π(θ)𝑑θν(Xi,θ)ei=1nwis(Xi,θ)π(θ)𝑑θF2(w),\displaystyle\quad-\frac{\int s(X_{i},\theta)\mathrm{e}^{\sum_{i=1}^{n}w_{i}s(X_{i},\theta)}\pi(\theta)d\theta\int\nu(X_{i},\theta)\mathrm{e}^{\sum_{i=1}^{n}w_{i}s(X_{i},\theta)}\pi(\theta)d\theta}{F^{2}(w)},

which implies

wi𝔼posw[ν(Xi,θ)]=𝔼posw[{ν(Xi,θ)𝔼posw[ν(Xi,θ)]}{s(Xi,θ)𝔼posw[s(Xi,θ)]}].\displaystyle\frac{\partial}{\partial w_{i}}\mathbb{E}_{\mathrm{pos}}^{w}[\nu(X_{i},\theta)]=\mathbb{E}_{\mathrm{pos}}^{w}\left[\{\nu(X_{i},\theta)-\mathbb{E}_{\mathrm{pos}}^{w}[\nu(X_{i},\theta)]\}\{s(X_{i},\theta)-\mathbb{E}_{\mathrm{pos}}^{w}[s(X_{i},\theta)]\}\right].

Further, we have

wis(Xi,θ)ν(Xi,θ)ei=1nwis(Xi,θ)π(θ)𝑑θF(w)F2(w)\displaystyle\frac{\partial}{\partial w_{i}}\frac{\int s(X_{i},\theta)\nu(X_{i},\theta)\mathrm{e}^{\sum_{i=1}^{n}w_{i}s(X_{i},\theta)}\pi(\theta)d\theta F(w)}{F^{2}(w)}
=s2(Xi,θ)ν(Xi,θ)ei=1nwis(Xi,θ)π(θ)𝑑θF(w)\displaystyle=\frac{\int s^{2}(X_{i},\theta)\nu(X_{i},\theta)\mathrm{e}^{\sum_{i=1}^{n}w_{i}s(X_{i},\theta)}\pi(\theta)d\theta}{F(w)}
s(Xi,θ)ν(Xi,θ)ei=1nwis(Xi,θ)π(θ)𝑑θs(Xi,θ)ei=1nwis(Xi,θ)π(θ)𝑑θF2(w)\displaystyle\quad-\frac{\int s(X_{i},\theta)\nu(X_{i},\theta)\mathrm{e}^{\sum_{i=1}^{n}w_{i}s(X_{i},\theta)}\pi(\theta)d\theta\int s(X_{i},\theta)\mathrm{e}^{\sum_{i=1}^{n}w_{i}s(X_{i},\theta)}\pi(\theta)d\theta}{F^{2}(w)}
=𝔼posw[s2(Xi,θ)ν(Xi,θ)]𝔼posw[s(Xi,θ)ν(Xi,θ)]𝔼posw[s(Xi,θ)]\displaystyle=\mathbb{E}_{\mathrm{pos}}^{w}[s^{2}(X_{i},\theta)\nu(X_{i},\theta)]-\mathbb{E}_{\mathrm{pos}}^{w}[s(X_{i},\theta)\nu(X_{i},\theta)]\mathbb{E}_{\mathrm{pos}}^{w}[s(X_{i},\theta)]

and

wis(Xi,θ)ei=1nwis(Xi,θ)π(θ)𝑑θν(Xi,θ)ei=1nwis(Xi,θ)π(θ)𝑑θF2(w)\displaystyle\frac{\partial}{\partial w_{i}}\frac{\int s(X_{i},\theta)\mathrm{e}^{\sum_{i=1}^{n}w_{i}s(X_{i},\theta)}\pi(\theta)d\theta\int\nu(X_{i},\theta)\mathrm{e}^{\sum_{i=1}^{n}w_{i}s(X_{i},\theta)}\pi(\theta)d\theta}{F^{2}(w)}
=s2(Xi,θ)ei=1nwis(Xi,θ)π(θ)𝑑θν(Xi,θ)ei=1nwis(Xi,θ)π(θ)𝑑θF2(w)F4(w)\displaystyle=\frac{\int s^{2}(X_{i},\theta)\mathrm{e}^{\sum_{i=1}^{n}w_{i}s(X_{i},\theta)}\pi(\theta)d\theta\int\nu(X_{i},\theta)\mathrm{e}^{\sum_{i=1}^{n}w_{i}s(X_{i},\theta)}\pi(\theta)d\theta F^{2}(w)}{F^{4}(w)}
+s(Xi,θ)ei=1nwis(Xi,θ)π(θ)𝑑θν(Xi,θ)s(Xi,θ)ei=1nwis(Xi,θ)π(θ)𝑑θF2(w)F4(w)\displaystyle\quad+\frac{\int s(X_{i},\theta)\mathrm{e}^{\sum_{i=1}^{n}w_{i}s(X_{i},\theta)}\pi(\theta)d\theta\int\nu(X_{i},\theta)s(X_{i},\theta)\mathrm{e}^{\sum_{i=1}^{n}w_{i}s(X_{i},\theta)}\pi(\theta)d\theta F^{2}(w)}{F^{4}(w)}
2{s(Xi,θ)ei=1nwis(Xi,θ)π(θ)𝑑θ}2ν(Xi,θ)ei=1nwis(Xi,θ)π(θ)𝑑θF(w)F4(w)\displaystyle\quad-2\frac{\{\int s(X_{i},\theta)\mathrm{e}^{\sum_{i=1}^{n}w_{i}s(X_{i},\theta)}\pi(\theta)d\theta\}^{2}\int\nu(X_{i},\theta)\mathrm{e}^{\sum_{i=1}^{n}w_{i}s(X_{i},\theta)}\pi(\theta)d\theta F(w)}{F^{4}(w)}
=𝔼posw[s2(Xi,θ)]𝔼posw[ν(Xi,θ)]+𝔼posw[s(Xi,θ)ν(Xi,θ)]𝔼posw[s(Xi,θ)]\displaystyle=\mathbb{E}_{\mathrm{pos}}^{w}[s^{2}(X_{i},\theta)]\mathbb{E}_{\mathrm{pos}}^{w}[\nu(X_{i},\theta)]+\mathbb{E}_{\mathrm{pos}}^{w}[s(X_{i},\theta)\nu(X_{i},\theta)]\mathbb{E}_{\mathrm{pos}}^{w}[s(X_{i},\theta)]
2(𝔼posw[s(Xi,θ)])2𝔼posw[ν(Xi,θ)],\displaystyle\quad-2(\mathbb{E}_{\mathrm{pos}}^{w}[s(X_{i},\theta)])^{2}\mathbb{E}_{\mathrm{pos}}^{w}[\nu(X_{i},\theta)],

which implies

2wi2𝔼posw[ν(Xi,θ)]=𝔼posw[{ν(Xi,θ)𝔼posw[ν(Xi,θ)]}{s(Xi,θ)𝔼posw[s(Xi,θ)]}2]\displaystyle\frac{\partial^{2}}{\partial w_{i}^{2}}\mathbb{E}_{\mathrm{pos}}^{w}[\nu(X_{i},\theta)]=\mathbb{E}_{\mathrm{pos}}^{w}\left[\{\nu(X_{i},\theta)-\mathbb{E}_{\mathrm{pos}}^{w}[\nu(X_{i},\theta)]\}\{s(X_{i},\theta)-\mathbb{E}_{\mathrm{pos}}^{w}[s(X_{i},\theta)]\}^{2}\right]

and completes the proof. ∎

Appendix C Detailed calculations used in Section 3.4

This appendix provides the detailed calculation used in Section 3.4. The calculation employs the following lemma.

Lemma C.1 (Kumar, 1973).

Let ww be a random vector from N(0,Id)N(0,I_{d}). For d×dd\times d symmetric matrices BB and CC, we have

𝔼[(wBw)(wCw)]=2tr(BC)+(trB)(trC).\displaystyle\mathbb{E}[(w^{\top}Bw)(w^{\top}Cw)]=2\mathrm{tr}(BC)+(\mathrm{tr}B)\,(\mathrm{tr}C).

Step 1. Establishing (15): Let us begin by expanding Covpos[s(Xi,θ),ν(Xi,θ)]\mathrm{Cov}_{\mathrm{pos}}[s(X_{i},\theta),\nu(X_{i},\theta)]. For i=1,,ni=1,\ldots,n, we have

𝔼pos[(Xiθ)A(Xiθ)(Xiθ)(Xiθ)]\displaystyle\mathbb{E}_{\mathrm{pos}}[(X_{i}-\theta)^{\top}A(X_{i}-\theta)(X_{i}-\theta)^{\top}(X_{i}-\theta)]
=𝔼pos[{(Xiθ^)A(Xiθ^)+2(θ^θ)A(Xiθ^)+(θ^θ)A(θ^θ)}\displaystyle=\mathbb{E}_{\mathrm{pos}}\Big{[}\Big{\{}(X_{i}-\widehat{\theta})^{\top}A(X_{i}-\widehat{\theta})+2(\widehat{\theta}-\theta)^{\top}A(X_{i}-\widehat{\theta})+(\widehat{\theta}-\theta)^{\top}A(\widehat{\theta}-\theta)\Big{\}}
{(Xiθ^)(Xiθ^)+2(θ^θ)(Xiθ^)+(θ^θ)(θ^θ)}]\displaystyle\quad\quad\quad\quad\quad\quad\Big{\{}(X_{i}-\widehat{\theta})^{\top}(X_{i}-\widehat{\theta})+2(\widehat{\theta}-\theta)^{\top}(X_{i}-\widehat{\theta})+(\widehat{\theta}-\theta)^{\top}(\widehat{\theta}-\theta)\Big{\}}\Big{]}
=X~iAX~iX~iX~i+𝔼pos[2X~iAX~iθ~X~i]+𝔼pos[X~iAX~iθ~θ~]\displaystyle=\widetilde{X}_{i}^{\top}A\widetilde{X}_{i}\widetilde{X}_{i}^{\top}\widetilde{X}_{i}+\mathbb{E}_{\mathrm{pos}}\big{[}2\widetilde{X}_{i}^{\top}A\widetilde{X}_{i}\widetilde{\theta}^{\top}\widetilde{X}_{i}\big{]}+\mathbb{E}_{\mathrm{pos}}\big{[}\widetilde{X}_{i}^{\top}A\widetilde{X}_{i}\widetilde{\theta}^{\top}\widetilde{\theta}\big{]}
+𝔼pos[2θ~AX~iX~iX~i]+𝔼pos[4θ~AX~iθ~X~i]+𝔼pos[2θ~AX~iθ~θ~]\displaystyle\quad+\mathbb{E}_{\mathrm{pos}}\big{[}2\widetilde{\theta}^{\top}A\widetilde{X}_{i}\widetilde{X}_{i}^{\top}\widetilde{X}_{i}\big{]}+\mathbb{E}_{\mathrm{pos}}\big{[}4\widetilde{\theta}^{\top}A\widetilde{X}_{i}\widetilde{\theta}^{\top}\widetilde{X}_{i}\big{]}+\mathbb{E}_{\mathrm{pos}}\big{[}2\widetilde{\theta}^{\top}A\widetilde{X}_{i}\widetilde{\theta}^{\top}\widetilde{\theta}\big{]}
+𝔼pos[θ~Aθ~X~iX~i]+𝔼pos[2θ~Aθ~θ~X~i]+𝔼pos[θ~Aθ~θ~θ~],\displaystyle\quad+\mathbb{E}_{\mathrm{pos}}\big{[}\widetilde{\theta}^{\top}A\widetilde{\theta}\widetilde{X}_{i}^{\top}\widetilde{X}_{i}\big{]}+\mathbb{E}_{\mathrm{pos}}\big{[}2\widetilde{\theta}^{\top}A\widetilde{\theta}\widetilde{\theta}^{\top}\widetilde{X}_{i}\big{]}+\mathbb{E}_{\mathrm{pos}}\big{[}\widetilde{\theta}^{\top}A\widetilde{\theta}\widetilde{\theta}^{\top}\widetilde{\theta}\big{]},

where X~i:=Xiθ^\widetilde{X}_{i}:=X_{i}-\widehat{\theta} and θ~:=θ^θ\widetilde{\theta}:=\widehat{\theta}-\theta. Lemma C.1 gives

𝔼pos[(θ^θ)(θ^θ)(θ^θ)A(θ^θ)]=2+d(nβ+1/τ)2tr(A)\displaystyle\mathbb{E}_{\mathrm{pos}}[(\widehat{\theta}-\theta)^{\top}(\widehat{\theta}-\theta)(\widehat{\theta}-\theta)^{\top}A(\widehat{\theta}-\theta)]=\frac{2+d}{(n\beta+1/\tau)^{2}}\mathrm{tr}(A)

and thus we get

𝔼pos[(Xiθ)(Xiθ)(Xiθ)A(Xiθ)]\displaystyle\mathbb{E}_{\mathrm{pos}}[(X_{i}-\theta)^{\top}(X_{i}-\theta)(X_{i}-\theta)^{\top}A(X_{i}-\theta)]
=X~iAX~iX~iX~i+4+dnβ+1/τX~iAX~i+tr(A)nβ+1/τX~iX~i+2+d(nβ+1/τ)2tr(A)\displaystyle=\widetilde{X}_{i}^{\top}A\widetilde{X}_{i}\widetilde{X}_{i}^{\top}\widetilde{X}_{i}+\frac{4+d}{n\beta+1/\tau}\widetilde{X}_{i}^{\top}A\widetilde{X}_{i}+\frac{\mathrm{tr}(A)}{n\beta+1/\tau}\widetilde{X}_{i}^{\top}\widetilde{X}_{i}+\frac{2+d}{(n\beta+1/\tau)^{2}}\mathrm{tr}(A) (C.1)

Further, for i=1,,ni=1,\ldots,n, we have

𝔼pos[(Xiθ)A(Xiθ)]𝔼pos[(Xiθ)(Xiθ)]\displaystyle\mathbb{E}_{\mathrm{pos}}[(X_{i}-\theta)^{\top}A(X_{i}-\theta)]\mathbb{E}_{\mathrm{pos}}[(X_{i}-\theta)^{\top}(X_{i}-\theta)]
={X~iX~i+dnβ+1/τ}{X~iAX~i+tr(A)nβ+1/τ}\displaystyle=\left\{\widetilde{X}_{i}^{\top}\widetilde{X}_{i}+\frac{d}{n\beta+1/\tau}\right\}\left\{\widetilde{X}_{i}^{\top}A\widetilde{X}_{i}+\frac{\mathrm{tr}(A)}{n\beta+1/\tau}\right\}
=X~iX~iX~iAX~i+tr(A)nβ+1/τX~iX~i+dnβ+1/τX~iAX~i+dnβ+1/τtr(A)nβ+1/τ.\displaystyle=\widetilde{X}_{i}^{\top}\widetilde{X}_{i}\widetilde{X}_{i}^{\top}A\widetilde{X}_{i}+\frac{\mathrm{tr}(A)}{n\beta+1/\tau}\widetilde{X}_{i}^{\top}\widetilde{X}_{i}+\frac{d}{n\beta+1/\tau}\widetilde{X}_{i}^{\top}A\widetilde{X}_{i}+\frac{d}{n\beta+1/\tau}\frac{\mathrm{tr}(A)}{n\beta+1/\tau}. (C.2)

Combining (C.1) and (C.2) yields

Covpos[ν(Xi,θ),s(Xi,θ)]=β2{4nβ+1/τY~iAY~i+2tr(A)(nβ+1/τ)2},\displaystyle\mathrm{Cov}_{\mathrm{pos}}[\nu(X_{i},\theta),s(X_{i},\theta)]=-\frac{\beta}{2}\left\{\frac{4}{n\beta+1/\tau}\widetilde{Y}_{i}^{\top}A\widetilde{Y}_{i}+\frac{2\mathrm{tr}(A)}{(n\beta+1/\tau)^{2}}\right\},

which implies (15).

Step 2. Establishing (16): Next, we take the expectation of (1/n)i=1n(Xiθ^)A(Xiθ^)(1/n)\sum_{i=1}^{n}(X_{i}-\widehat{\theta})^{\top}A(X_{i}-\widehat{\theta}). Let a:=(nβτ)/(nβτ+1)a:=(n\beta\tau)/(n\beta\tau+1). Then, we have

𝔼[1ni=1n(Xiθ^)A(Xiθ^)]\displaystyle\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}(X_{i}-\widehat{\theta})^{\top}A(X_{i}-\widehat{\theta})\right]
=𝔼[1ni=1n(Xiθ)A(Xiθ)]+(1a)2(θ)Aθ\displaystyle=\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}(X_{i}-\theta^{*})^{\top}A(X_{i}-\theta^{*})\right]+(1-a)^{2}(\theta^{*})^{\top}A\theta^{*}
+(a22a)𝔼[(X¯θ)A(X¯θ)]\displaystyle\qquad\qquad+(a^{2}-2a)\mathbb{E}\left[(\bar{X}-\theta^{*})^{\top}A(\bar{X}-\theta^{*})\right]
=tr(A)+(1a)2(θ)Aθ+(a22a)tr(A)n,\displaystyle=\mathrm{tr}(A)+(1-a)^{2}(\theta^{*})^{\top}A\theta^{*}+(a^{2}-2a)\frac{\mathrm{tr}(A)}{n},

which implies (16).

Step 3. Establishing (18): Finally, we consider Covpos[(Xiθ)A(Xiθ),θθ]\mathrm{Cov}_{\mathrm{pos}}[(X_{i}-\theta)^{\top}A(X_{i}-\theta),\theta^{\top}\theta]. For i=1,,ni=1,\ldots,n, we have

𝔼pos[(Xiθ)A(Xiθ)θθ]\displaystyle\mathbb{E}_{\mathrm{pos}}[(X_{i}-\theta)^{\top}A(X_{i}-\theta)\theta^{\top}\theta]
=𝔼pos[{(Xiθ^)A(Xiθ^)+(θ^θ)A(θ^θ)+2(Xiθ^)A(θ^θ)}\displaystyle=\mathbb{E}_{\mathrm{pos}}\Big{[}\Big{\{}(X_{i}-\widehat{\theta})^{\top}A(X_{i}-\widehat{\theta})+(\widehat{\theta}-\theta)^{\top}A(\widehat{\theta}-\theta)+2(X_{i}-\widehat{\theta})^{\top}A(\widehat{\theta}-\theta)\Big{\}}
{(θ^θ)(θ^θ)2(θ^θ)θ^+θ^θ^}]\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\Big{\{}(\widehat{\theta}-\theta)^{\top}(\widehat{\theta}-\theta)-2(\widehat{\theta}-\theta)^{\top}\widehat{\theta}+\widehat{\theta}^{\top}\widehat{\theta}\Big{\}}\Big{]}
=1nβ+1/τ(Xiθ^)A(Xiθ^)+(Xiθ^)A(Xiθ^)θ^θ^+tr(A)nβ+1/τθ^θ^\displaystyle=\frac{1}{n\beta+1/\tau}(X_{i}-\widehat{\theta})^{\top}A(X_{i}-\widehat{\theta})+(X_{i}-\widehat{\theta})^{\top}A(X_{i}-\widehat{\theta})\widehat{\theta}^{\top}\widehat{\theta}+\frac{\mathrm{tr}(A)}{n\beta+1/\tau}\widehat{\theta}^{\top}\widehat{\theta}
+2+d(nβ+1/τ)2tr(A)+4nβ+1/τ(Xiθ^)Aθ^\displaystyle\quad+\frac{2+d}{(n\beta+1/\tau)^{2}}\mathrm{tr}(A)+\frac{4}{n\beta+1/\tau}(X_{i}-\widehat{\theta})^{\top}A\widehat{\theta}

and we have

𝔼pos[(Xiθ)A(Xiθ)]𝔼pos[θθ]\displaystyle\mathbb{E}_{\mathrm{pos}}[(X_{i}-\theta)^{\top}A(X_{i}-\theta)]\mathbb{E}_{\mathrm{pos}}[\theta^{\top}\theta]
={(Xiθ^)A(Xiθ^)+tr(A)nβ+1/τ}{θ^θ^+1nβ+1/τ}\displaystyle=\Big{\{}(X_{i}-\widehat{\theta})^{\top}A(X_{i}-\widehat{\theta})+\frac{\mathrm{tr}(A)}{n\beta+1/\tau}\Big{\}}\Big{\{}\widehat{\theta}^{\top}\widehat{\theta}+\frac{1}{n\beta+1/\tau}\Big{\}}
=1nβ+1/τ(Xiθ^)A(Xiθ^)+(Xiθ^)A(Xiθ^)θ^θ^+tr(A)nβ+1/τθ^θ^\displaystyle=\frac{1}{n\beta+1/\tau}(X_{i}-\widehat{\theta})^{\top}A(X_{i}-\widehat{\theta})+(X_{i}-\widehat{\theta})^{\top}A(X_{i}-\widehat{\theta})\widehat{\theta}^{\top}\widehat{\theta}+\frac{\mathrm{tr}(A)}{n\beta+1/\tau}\widehat{\theta}^{\top}\widehat{\theta}
+tr(A)(nβ+1/τ)2.\displaystyle\qquad\qquad+\frac{\mathrm{tr}(A)}{(n\beta+1/\tau)^{2}}.

Combining these yields

Covpos[(Xiθ)A(Xiθ),θθ]=4nβ+1/τ(Xiθ^)Aθ^+1+d(nβ+1/τ)2tr(A).\displaystyle\mathrm{Cov}_{\mathrm{pos}}[(X_{i}-\theta)^{\top}A(X_{i}-\theta),\theta^{\top}\theta]=-\frac{4}{n\beta+1/\tau}(X_{i}-\widehat{\theta})^{\top}A\widehat{\theta}+\frac{1+d}{(n\beta+1/\tau)^{2}}\mathrm{tr}(A).

This, together with the identity

𝔼[1ni=1n(Xiθ^)Aθ^]=a(1a)𝔼[X¯AX¯]=a(1a){(θ)Aθ+tr(A)/n},\displaystyle\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}(X_{i}-\widehat{\theta})^{\top}A\widehat{\theta}\right]=a(1-a)\mathbb{E}\left[\overline{X}^{\top}A\overline{X}\right]=a(1-a)\{(\theta^{*})^{\top}A\theta^{*}+\mathrm{tr}(A)/n\},

yields (18).

References for this appendix

  1. [1]

    Kumar, A. (1973) Expectation of products of quadratic forms. Sankha, 35, 359–362.

Appendix D Additional figures

This appendix provides additional figures. Figure D.1 depicts the result for the stack loss data; see Subsection 3.2 for details. Figure D.2 depicts the result for the Gesell data; see Subsection 3.2 for details.

Refer to caption
Figure D.1: Performance of PCICG\mathrm{PCIC}_{\mathrm{G}}, IS-CV, and PSIS-CV for the stack loss dataset. The vertical axes present values relative to the generalisation errors, while the horizontal axes correspond to the number of posterior samples. Averages are computed across different experiments, with colored shades representing ±σ\pm\sigma. (a) results for 22\ell_{2}^{2} loss; (b) those for 1\ell_{1} loss; (c) those for scaled 22\ell_{2}^{2} loss; (d) those for scaled 1\ell_{1} loss.
Refer to caption
Figure D.2: Performance of PCICG\mathrm{PCIC}_{\mathrm{G}}, IS-CV, and PSIS-CV for the Gesell dataset. The vertical axes present values relative to the generalisation errors, while the horizontal axes correspond to the number of posterior samples. Averages are computed across different experiments, with colored shades representing ±σ\pm\sigma. (a) results for 22\ell_{2}^{2} loss; (b) those for 1\ell_{1} loss; (c) those for scaled 22\ell_{2}^{2} loss; (d) those for scaled 1\ell_{1} loss.