Objective Priors: An Introduction for Frequentists

Malay Ghoshlabel=e1]ghoshm@stat.ufl.edu [ University of Florida Malay Ghosh is Distinguished Professor, University of Florida, 223 Griffin–Floyd Hall, Gainesville, Florida 32611-8545, USA (e-mail: ghoshm@stat.ufl.edu).

(2011)

Abstract

Bayesian methods are increasingly applied in these days in the theory and practice of statistics. Any Bayesian inference depends on a likelihood and a prior. Ideally one would like to elicit a prior from related sources of information or past data. However, in its absence, Bayesian methods need to rely on some “objective” or “default” priors, and the resulting posterior inference can still be quite valuable.

Not surprisingly, over the years, the catalog of objective priors also has become prohibitively large, and one has to set some specific criteria for the selection of such priors. Our aim is to review some of these criteria, compare their performance, and illustrate them with some simple examples. While for very large sample sizes, it does not possibly matter what objective prior one uses, the selection of such a prior does influence inference for small or moderate samples. For regular models where asymptotic normality holds, Jeffreys’ general rule prior, the positive square root of the determinant of the Fisher information matrix, enjoys many optimality properties in the absence of nuisance parameters. In the presence of nuisance parameters, however, there are many other priors which emerge as optimal depending on the criterion selected. One new feature in this article is that a prior different from Jeffreys’ is shown to be optimal under the chi-square divergence criterion even in the absence of nuisance parameters. The latter is also invariant under one-to-one reparameterization.

Asymptotic expansion,

divergence criterion,

first-order probability matching,

Jeffreys’ prior,

left Haar priors,

location family,

location–scale family,

multiparameter,

orthogonality,

reference priors,

right Haar priors,

scale family,

second-order probability matching,

shrinkage argument,

doi:

10.1214/10-STS338

keywords:

^†^†volume: 26^†^†issue: 2

\relateddoi

T1Discussed in 10.1214/11-STS338A and 10.1214/11-STS338B; rejoinder at 10.1214/11-STS338REJ.

1 Introduction

Bayesian methods are increasingly used in recent years in the theory and practice of statistics. Their implementation requires specification of both a likelihood and a prior. With enough historical data, it is possible to elicit a prior distribution fairly accurately. However, even in its absence, Bayesian methods, if judiciously used, can produce meaningful inferences based on the so-called “objective” or “default” priors.

The main focus of this article is to introduce certain objective priors which could be potentially useful even for frequentist inference. One such example where frequentists are yet to reach a consensus about an “optimal” approach is the construction of confidence intervals for the ratio of two normal means, the celebrated Fieller–Creasy problem. It is shown in Section 4 of this paper how an “objective” prior produces a credible interval in this case which meets the target coverage probability of a frequentist confidence interval even for small or moderate sample sizes. Another situation, which has often become a real challenge for frequentists, is to find a suitable method for elimination of nuisance parameters when the dimension of the parameter grows in direct proportion to the sample size. This is what is usually referred to as the Neyman–Scott phenomenon. We will illustrate in Section 3 with an example of how an objective prior can sometimes overcome this problem.

Before getting into the main theme of this paper, we recount briefly the early history of objective priors. One of the earliest uses is usually attributed to Bayes (1763) and Laplace (1812) who recommended using a uniform prior for the binomial proportion $p$ in the absence of any other information. While intuitively quite appealing, this prior has often been criticized due to its lack of invariance under one-to-one reparameterization. For example, a uniform prior for $p$ in the binomial case does not result in a uniform prior for $p^{2}$ . A more compelling example is that a uniform prior for $\sigma$ , the population standard deviation, does not result in a uniform prior for $\sigma^{2}$ , and the converse is also true. In a situation like this, it is not at all clear whether there can be any preference to assign a uniform prior to either $\sigma$ or $\sigma^{2}$ .

In contrast, Jeffreys’ (1961) general rule prior, namely, the positive square root of the determinant of the Fisher information matrix, is invariant under one-to-one reparameterization of parameters. Wewill motivate this prior from several asymptotic considerations. In particular, for regular models where asymptotic normality holds, Jeffreys’ prior enjoys many optimality properties in the absence of nuisance parameters. In the presence of nuisance parameters, this prior suffers from many problems—marginalization paradox, the Neyman–Scott problem, just to name a few. Indeed, for the location–scale models, Jeffreys himself recommended alternate priors.

There are several criteria for the construction of objective priors. The present article primarily reviews two of these criteria in some detail, namely, “divergence priors” and “probability matching priors,” and finds optimal priors under these criteria. The class of divergence priors includes “reference priors” introduced by Bernardo (1979). The “probablity matching priors” were introduced by Welch and Peers (1963). There are many generalizations of the same in the past two decades. The development of both these priors rely on asymptotic considerations. Somewhat more briefly, I have discussed also a few other priors including the “right” and “left” Haar priors.

The paper does not claim the extensive thorough and comprehensive review of Kass and Wasserman (1996), nor does it aspire to the somewhat narrowly focused, but a very comprehensive review of probability matching priors as given in Ghosh and Mukerjee (1998), Datta and Mukerjee (2004) and Datta and Sweeting (2005). A very comprehensive review of reference priors is now available in Bernardo (2005), and a unified approach is given in the recent article of Berger, Bernardo and Sun (2009).

While primarily a review, the present article has been able to unify as well as generalize some of the previously considered criteria, for example, viewing the reference priors as members of a bigger class of divergence priors. Interestingly, with some of these criteria as presented here, it is possible to construct some alternatives to Jeffreys’ prior even in the absence of nuisance parameters.

The outline of the remaining sections is as follows. In Section 2 we introduce two basic tools to be used repeatedly in the subsequent sections. One such tool involving asymptotic expansion of the posterior density is due to Johnson (1970), and Ghosh, Sinha and Joshi (1982), and is discussed quite extensively in Ghosh, Delampady and Samanta (2006) and Datta and Mukerjee (2004). The second tool involves a shrinkage argument suggested by Dawid and used extensively by J. K. Ghosh and his co-authors. It is shown in Section 3 that this shrinkage argument can also be used in deriving priors with the criterion of maximizing the distance between the prior and the posterior. The distance measure used includes, but is not limited to, the Kullback–Leibler (K–L) distance considered in Bernardo (1979) for constructing two-group “reference priors.” Also, in this section we have considered a new prior different from Jeffreys even in the one-parameter case which is also invariant under one-to-one reparameterization. Section 4 addresses construction of priors under probability matching criteria. Certain other priors are introduced in Section 5, and it is pointed out that some of these priors can often provide exact and not just asymptotic matching. Some final remarks are made in Section 6.

Throughout this paper the results are presented more or less in a heuristic fashion, that is, without paying much attention to the regularity conditions needed to justify these results. More emphasis is placed on the application of these results in the construction of objective priors.

2 Two Basic Tools

An asymptotic expansion of the posterior density began with Johnson (1970), followed up later by Ghosh, Sinha and Joshi (1982), and many others. The result goes beyond that of the theorem of Bernstein and Von Mises which provides asymptotic normality of the posterior density. Typically, such an expansion is centered around the MLE (and occasionally the posterior mode), and requires only derivatives of the log-likelihood with respect to the parameters, and evaluated at their MLE’s. These expansions are available even for heavy-tailed densities such as Cauchy because finiteness of moments of the distribution is not needed. The result goes a long way in finding asymptotic expansion for the posterior moments of parameters of interest as well as in finding asymptotic posterior predictive distributions.

The asymptotic expansion of the posterior resembles that of an Edgeworth expansion, but, unlike the latter, this approach does not need use of cumulants of the distribution. Finding cumulants, though conceptually easy, can become quite formidable, especially in the presence of multiple parameters, demanding evaluation of mixed cumulants.

We have used this expansion as a first step in the derivation of objective priors under different criteria. Together with the shrinkage argument as mentioned earlier in the Introduction, and to be discussed later in this section, one can easily unify and extend many of the known results on prior selection. In particular, we will see later in this section how some of the reference priors of Bernardo (1979) can be found via application of these two tools. The approach also leads to a somewhat surprising result involving asymptotic expansion of the distribution function of the MLE in a fairly general setup, and is not restricted to any particular family of distributions, for example, the exponential family, or the location–scale family. A detailed exposition is available in Datta and Mukerjee (2004, pages 5–8).

For simplicity of exposition, we consider primarily the one-parameter case. Results needed for the multiparameter case will occasionally be mentioned, and, in most cases, these are straightforward, albeit often cumbersome, extensions of one-parameter results. Moreover, as stated in the Introduction, the results will be given without full rigor, that is, without any specific mention of the needed regularity conditions.

We begin with $X_{1},\dots,X_{n}|\theta$ i.i.d. with common p.d.f. $f(X|\theta)$ . Let $\hat{\theta}_{n}$ denote the MLE of $\theta$ . The likelihood function is denoted by $L_{n}(\theta)=\prod^{n}_{1}f(X_{i}|\theta)$ and let $\ell_{n}(\theta)=\log L_{n}(\theta)$ . Let $a_{i}=n^{-1}[d^{i}\ell_{n}(\theta)/\break d\theta^{i}]_{\theta=\hat{\theta}_{n}}$ , $i=1,2,\dots,$ and let $\hat{I}_{n}=-a_{2}$ , the observed per unit Fisher information number. Consider a twice differentiable prior $\pi$ . Let $T_{n}=\sqrt{n}(\theta-\hat{\theta}_{n})\hat{I}_{n}^{1/2}$ , and let $\pi^{*}_{n}(t)$ denote the posterior p.d.f. of $T_{n}$ given $X_{1},\ldots,X_{n}$ . Then, under certain regularity conditions, we have the following result.

Theorem 1

$\pi^{*}_{n}(t)=\phi(t)[1+n^{-1/2}\gamma_{1}(t;X_{1},\ldots,\break X_{n})+n^{-1}\gamma_{2}(t;X_{1},\dots,X_{n})]+O_{p}(n^{-3/2})$ , where $\phi(t)$ is the standard normal p.d.f., $\gamma_{1}(t;X_{1},\dots,X_{n})=\break a_{3}t^{3}/(6\hat{I}_{n}^{3/2})+(t/\hat{I}_{n}^{1/2})\pi^{\prime}(\hat{\theta}_{n})/\pi(\hat{\theta}_{n})$ and

	$\displaystyle\gamma_{2}(t;X_{1},\dots,X_{n})$
	$\displaystyle\quad=\frac{1}{24\hat{I}_{n}^{2}}a_{4}t^{4}+\frac{1}{72\hat{I}_{n}^{3}}a^{2}_{3}t^{6}+\frac{1}{2\hat{I}_{n}}t^{2}\frac{\pi^{\prime\prime}(\hat{\theta}_{n})}{\pi(\hat{\theta}_{n})}$
	$\displaystyle\qquad{}+\frac{1}{6\hat{I}_{n}^{2}}a_{3}t^{4}\frac{\pi^{\prime}(\hat{\theta}_{n})}{\pi(\hat{\theta}_{n})}-\frac{a_{4}}{8\hat{I}_{n}^{2}}$
	$\displaystyle\qquad{}-\frac{15a^{2}_{3}}{72\hat{I}_{n}^{3}}-\frac{1}{2\hat{I}_{n}}\frac{\pi^{\prime\prime}(\hat{\theta}_{n})}{\pi(\hat{\theta}_{n})}-\frac{a_{3}}{2\hat{I}_{n}^{2}}\frac{\pi^{\prime}(\hat{\theta}_{n})}{\pi(\hat{\theta}_{n})}.$

The proof is given in Ghosh, Delampdy and Samanta (2006, pages 107–108). The statement involves a few minor typos which can be corrected easily. We outline here only a few key steps needed in the proof.

We begin with the posterior p.d.f.,

	$\displaystyle\qquad\pi(\theta\|X_{1},\dots,X_{n})$
			(1)
	$\displaystyle\qquad\quad=\exp[\ell_{n}(\theta)]\pi(\theta)\bigl{/}\int\exp[\ell_{n}(\theta)]\pi(\theta)\,d\theta.$

Substituting $t=\sqrt{n}(\theta-\hat{\theta}_{n})\hat{I}_{n}^{1/2}$ , the posterior p.d.f. of $T_{n}$ is given by

	$\displaystyle\pi^{*}_{n}(t)=C^{-1}_{n}\exp[\ell_{n}\{\hat{\theta}_{n}+t(n\hat{I}_{n})^{-1/2}\}-\ell_{n}(\hat{\theta}_{n})]$	(2)
	$\displaystyle\hphantom{\pi^{*}_{n}(t)=}{}\cdot\pi\{\hat{\theta}_{n}+t(n\hat{I}_{n})^{-1/2}\},$
	$\displaystyle\hskip 56.0pt\mbox{where }C_{n}=\int\exp[\ell_{n}\{\hat{\theta}_{n}+t(n\hat{I}_{n})^{-1/2}\}$
$\displaystyle\eqntext{-\ell_{n}(\hat{\theta}_{n})]}$		(3)
$\displaystyle\eqntext{\cdot\pi\{\hat{\theta}_{n}+t(n\hat{I}_{n})^{-1/2}\}\,dt.\hskip 5.0pt}$		(4)

The rest of the proof involves a Taylor expansion of $\exp[\ell_{n}\{\hat{\theta}_{n}+t(n\hat{I}_{n})^{-1/2}\}$ and $\pi\{\hat{\theta}_{n}+t(n\hat{I}_{n})^{-1/2}\}$ around $\hat{\theta}_{n}$ up to a desired order, and collecting the coefficients of $n^{-1/2}$ , $n^{-1}$ , etc. The other component is evaluation of $C_{n}$ via momets of the N(0, 1) distribution.

Remark 1.

The above result is useful in finding certain expansions for the posterior moments as well. In particular, noting $\theta=\hat{\theta}_{n}+(n\hat{I}_{n})^{-1/2}t_{n}$ , it follows that the asymptotic expansion of the posterior mean of $\theta$ is given by

	$\displaystyle\qquad E(\theta\|X_{1},\dots,X_{n})$
			(5)
	$\displaystyle\qquad\quad=\hat{\theta}_{n}+n^{-1}\biggl{\{}\frac{a_{3}}{2\hat{I}_{n}^{2}}+t\frac{\pi^{\prime}(\hat{\theta}_{n})}{\hat{I}_{n}\pi(\hat{\theta}_{n})}\biggr{\}}+O_{p}(n^{-3/2}).$

Also, $V(\theta|X_{1},\ldots,X_{n})=(n\hat{I}_{n})^{-1}+O_{p}(n^{-3/2})$ .

A multiparameter extension of Theorem 1 is as follows. Suppose that $\theta=(\theta_{1},\dots,\theta_{p})^{T}$ is the parameter vector and $\hat{\theta}_{n}$ is the MLE of $\theta$ . Let

	$\displaystyle a_{jr}$	$\displaystyle=$	$\displaystyle-\hat{I}_{njr}=n^{-1}\frac{\partial^{2}\ell_{n}(\theta)}{\partial\theta_{j}\,\partial\theta_{r}}\biggl{\|}_{\theta=\hat{\theta}_{n}},$
	$\displaystyle a_{jrs}$	$\displaystyle=$	$\displaystyle n^{-1}\frac{\partial^{3}\ell_{n}(\theta)}{\partial{\theta}_{j}\,\partial\theta_{r}\,\partial\theta_{s}}\biggl{\|}_{\theta=\hat{\theta}_{n}}$

and $\hat{I}_{n}=((\hat{I}_{njr}))$ . Then retaining only up to the $O(n^{-1/2})$ term, the posterior of $W_{n}=\sqrt{n}(\theta-\hat{\theta}_{n})$ is given by

	$\displaystyle\quad\pi^{*}_{n}(w)=(2\pi)^{-1/2}\exp[-(1/2)w^{T}\hat{I}_{n}w]$
	$\displaystyle\quad\hphantom{\pi^{*}_{n}(w)=}{}\cdot\Biggl{[}1+n^{-1/2}\Biggl{\{}\sum^{p}_{j=1}w_{j}\biggl{(}\frac{\partial\log\pi}{\partial\theta_{j}}\biggr{)}\biggl{\|}_{\theta=\hat{\theta}_{n}}$
			(6)
	$\displaystyle\quad\hphantom{\Biggl{[}1+n^{-1/2}\Biggl{\{}\biggl{(}\frac{\partial\log\pi}{\partial\theta_{j}}\biggr{)}\biggl{\|}}{}+\frac{1}{6}\sum_{j,r,s}w_{j}w_{r}w_{s}a_{jrs}\Biggr{\}}$
	$\displaystyle\hphantom{\quad\Biggl{[}1+n^{-1/2}\Biggl{\{}\sum^{p}_{j=1}w_{j}\biggl{(}\frac{\partial\log\pi}{\partial\theta_{j}}\biggr{)}\biggl{\|}_{\theta=\hat{\theta}_{n}}}{}+O_{p}(n^{-1})\Biggr{]}.$

Next we present the basic shrinkage argument of J. K. Ghosh discussed in detail in Datta and Mukherjee (2004). The prime objective here is evaluation of $E[q(X,\theta)|\theta]=\lambda(\theta)$ , say, where $X$ and $\theta$ can bereal- or vector-valued. The idea is to find first $\int\lambda(\theta)\bar{\pi}_{m}(\theta)\,d\theta$ through a sequence of priors $\{\bar{\pi}_{m}(\theta)\}$ defined on a compact set, and then shrinking the prior to degeneracy at some interior point, say, $\theta$ of the compact set. The interesting point is that one never needs explicit specification of $\bar{\pi}_{m}(\theta)$ in carrying out this evaluation. We will see several illustrations of this in this article.

First, we present the shrinkage argument in a nutshell. Consider a proper prior $\bar{\pi}(\cdot)$ with a compact rectangle as its support in the parameter space, and $\bar{\pi}(\cdot)$ vanishes on the boundary of support, while remaining positive in the interior. The support of $\bar{\pi}(\cdot)$ is the closure of the set. Consider the posterior of $\theta$ under $\bar{\pi}(\cdot)$ and, hence, obtain $E^{\bar{\pi}}[q(X,\theta)|X]$ . Then find $E[\{E^{\bar{\pi}}(q(X,\theta)|X)\}|\theta]=\lambda(\theta)$ for $\theta$ in the interior of the support of $\bar{\pi}(\cdot)$ . Finally, integrate $\lambda(\cdot)$ with respect to $\bar{\pi}(\cdot)$ , and then allow $\bar{\pi}(\cdot)$ to converge to the degenerate prior at the true value of $\theta$ at an interior point of the support of $\pi(\theta)$ . This yields $E[q(X,\theta)|\theta]$ . The calculation assumes integrability of $q(X,\theta)$ over the joint distribution of $X$ and $\theta$ . Such integrability allows change in the order of integration.

When executed up to the desired order of approximation, under suitable assumptions, these steps can lead to significant reduction in the algebra underlying higher order frequentist asymptotics. The simplification arises from two counts. First, although the Bayesian approach to frequentist asymptotics requires Edgeworth type assumptions, it avoids an explicit Edgeworth expansion involving calculation of approximate cumulants. Second, as we will see, it helps establish the results in an easily interpretable compact form. The following two sections will demonstrate multiple usage of these two basic tools.

3 Objective Priors Via Maximization of the Distance Between the Prior and the Posterior

3.1 Reference Priors

We begin with an alternate derivation of the reference prior of Bernardo. Following Lindley (1956), Bernardo (1979) suggested a Kullback–Leibler (K–L) divergence between the prior and the posterior, namely, $E[\log\frac{\pi(\theta|X)}{\pi(\theta)}]$ , where expectation is taken over the joint distribution of $X$ and $\theta$ . The target is to find a prior $\pi$ which maximizes the above distance. It is shown in Berger and Bernardo (1989) that if one does this maximization for a fixed $n$ , this may lead to a discrete prior with finitely many jumps, a far cry from a diffuse prior. Hence, one needs an asymptotic maximization.

First write $E[\log\frac{\pi(\theta|X)}{\pi(\theta)}]$ as

	$\displaystyle E\biggl{[}\log\frac{\pi(\theta\|X)}{\pi(\theta)}\biggr{]}$
	$\displaystyle\quad=\int\!\!\int\log\frac{\pi(\theta\|X)}{\pi(\theta)}\pi(\theta\|X)m^{\pi}(X)\,d\theta\,dX$
			(7)
	$\displaystyle\quad=\int\!\!\int\log\frac{\pi(\theta\|X)}{\pi(\theta)}L_{n}(\theta)\pi(\theta)\,dX\,d\theta$
	$\displaystyle\quad=\int\pi(\theta)E\biggl{[}\log\frac{\pi(\theta\|X)}{\pi(\theta)}\biggl{\|}\theta\biggr{]}\,d\theta,$

where $X=(X_{1},\dots,X_{n})$ , $L_{n}(\theta)=\prod_{1}^{n}f(X_{i}|\theta)$ , the likelihood function, and $m^{\pi}(X)$ denotes the marginal of $X$ after integrating out $\theta$ . The integrations are carried out with respect to a prior $\pi$ having a compact support, and subsequently passing on to the limit as and when necessary.

Without any nuisance parameters, Bernardo(1979) showed somewhat heuristically that Jeffreys’ prior achieves the necessary maximization. A more rigorous proof was supplied later by Clarke and Barron (1990, 1994). We demonstrate heuristically how the shrinkage argument can also lead to the reference priors derived in Bernardo (1979). To this end, we first consider the one-parameter case for a regular family of distributions. We rewrite

	$\displaystyle\quad\qquad E\biggl{[}\log\frac{\pi(\theta\|X)}{\pi(\theta)}\biggr{]}$	$\displaystyle=$	$\displaystyle\int\pi(\theta)E[\log\pi(\theta\|X)\|\theta]\,d\theta$
			$\displaystyle{}-\int\pi(\theta)\log\pi(\theta)\,d\theta.$

Next we write

E^{\bar{\pi}}[\log\pi(\theta|X)|X]=\int\log\pi(\theta|X)\bar{\pi}(\theta|X)\,d\theta.

From the asymptotic expansion of the posterior, one gets

$\displaystyle\log\pi(\theta\|X)$	$\displaystyle=$	$\displaystyle(1/2)\log(n)-(1/2)\log(2\pi)$
		$\displaystyle{}-n\frac{(\theta-\hat{\theta}_{n})^{2}}{2}\hat{I}_{n}+(1/2)\log(\hat{I}_{n})$
		$\displaystyle{}+O_{p}(n^{-1/2}).$

Since $\frac{n(\theta-\hat{\theta}_{n})^{2}}{2}\hat{I}_{n}$ converges a posteriori to a $\chi_{1}^{2}$ distribution as $n\rightarrow\infty$ , irrespective of a prior $\pi$ , by the Bernstein–Von Mises and Slutsky’s theorems, one gets

	$\displaystyle E^{\bar{\pi}}[\log\pi(\theta\|X)]$
	$\displaystyle\quad=(1/2)\log(n)-(1/2)\log(2\pi e)$		(9)
	$\displaystyle\qquad{}+(1/2)\log(\hat{I}_{n})+O_{p}(n^{-1/2}).$

Since the leading term in the right-hand side of (3.1) does not involve the prior $\bar{\pi}$ , and $\hat{I}_{n}$ converges almost surely ( $P_{\theta}$ ) to $I(\theta)$ , applying the shrinkage argument, one gets from (3.1)

			$\displaystyle E[\log\pi(\theta\|X)\|\theta]$
			$\displaystyle\quad=(1/2)\log(n)-(1/2)\log(2\pi e)$
			$\displaystyle\qquad{}+\log(I^{1/2}(\theta))+O(n^{-1/2}).$

In view of (3.1), considering only the leading terms in (3.1), one needs to find a prior $\pi$ which maximizes $\int\log\{\frac{I^{1/2}(\theta)}{\pi(\theta)}\}\pi(\theta)\,d\theta$ . The integral being nonpositive due to the property of the Kullback–Leibler information number, its maximum value is zero, which is attained for $\pi(\theta)=I^{1/2}(\theta)$ , leading once again to Jeffreys’ prior.

The multiparameter generalization of the above result without any nuisance parameters is based on the asymptotic expansion

	$\displaystyle E[\log\pi(\theta\|X)\|\theta]$
	$\displaystyle\quad=(p/2)\log(n)-(p/2)\log(2\pi e)$
	$\displaystyle\qquad{}+\int\log\{\|I(\theta)\|^{1/2}/{\pi(\theta)}\}\pi(\theta)\,d\theta$
	$\displaystyle\qquad{}+O(n^{-1/2}),$

and maximization of the leading term yields once again Jeffreys’ general rule prior $\pi(\theta)=|I(\theta)|^{1/2}$ .

In the presence of nuisance parameters, however, Jeffreys’ general rule prior is no longer the distance maximizer. We will demonstrate this in the case when the parameter vector is split into two groups, one group consisting of the parameters of interest, and the other involving the nuisance parameters. In particular, Bernardo’s (1979) two-group reference prior will be included as a special case.

To this end, suppose $\theta=(\theta_{1},\theta_{2})$ , where $\theta_{1}$ ( $p_{1}\times 1$ ) is the parameter of interest and $\theta_{2}$ ( $p_{2}\times 1$ ) is the nuisance parameter. We partition the Fisher information matrix $I(\theta)$ as

I(\theta)=\pmatrix{I_{11}(\theta)&I_{12}(\theta)\cr I_{21}(\theta)&I_{22}(\theta)}.

First begin with a general conditional prior $\pi(\theta_{2}|\theta_{1})=\phi(\theta)$ (say). Bernardo (1979) considered $\phi(\theta)=|I_{22}(\theta)|^{1/2}$ . The marginal prior $\pi(\theta_{1})$ for $\theta_{1}$ is then obtained by maximizing the distance $E[\log\frac{\pi(\theta_{1}|X)}{\pi(\theta_{1})}]$ . We begin by writing

\qquad\log\frac{\pi(\theta_{1}|X)}{\pi(\theta_{1})}=\log\frac{\pi(\theta|X)}{\pi(\theta)}-\log\frac{\pi(\theta_{2}|\theta_{1},X)}{\pi(\theta_{2}|\theta_{1})}.\hskip-10.0pt

(11)

Writing $\pi(\theta)=\pi(\theta_{1})\phi(\theta)$ and $|I(\theta)|=|I_{22}(\theta)|\cdot\break|I_{11.2}(\theta)|$ , where $I_{11.2}(\theta)\!=\!I_{11}(\theta)\!-\!I_{12}(\theta)I_{22}^{-1}(\theta)\!\cdot\!I_{21}(\theta)$ , the asymptotic expansion and the shrinkage argument together yield

	$\displaystyle\hskip-8.0ptE\biggl{[}\log\frac{\pi(\theta\|X)}{\pi(\theta)}\biggr{]}\hskip-15.0pt$
	$\displaystyle\hskip-8.0pt\quad=(p/2)\log(n)-(p/2)\log(2\pi e)\hskip-15.0pt$
	$\displaystyle\hskip-8.0pt\qquad{}+\int\pi(\theta_{1})\hskip-15.0pt$
			(12)
	$\displaystyle\hskip-8.0pt\hphantom{\qquad{}+\int}{}\cdot\biggl{\{}\int\phi(\theta)\hskip-15.0pt$
	$\displaystyle\hskip-8.0pt\hphantom{\hphantom{\qquad{}+}{}\cdot\biggl{\{}\int\int}{}\cdot\log\frac{\|I_{22}(\theta)\|^{1/2}\|I_{11.2}(\theta)\|^{1/2}}{\pi(\theta_{1})\phi(\theta)}\,d\theta_{2}\biggr{\}}\,d\theta_{1}\hskip-15.0pt$
	$\displaystyle\hskip-8.0pt\qquad{}+O(n^{-1/2})\hskip-15.0pt$

and

			$\displaystyle\quad E\biggl{[}\log\frac{\pi(\theta_{2}\|\theta_{1},X)}{\pi(\theta_{2}\|\theta_{1})}\biggr{]}\hskip-10.0pt$
			$\displaystyle\quad\quad=(p_{2}/2)\log(n)-(p_{2}/2)\log(2\pi e)\hskip-10.0pt$
			$\displaystyle\quad\qquad{}+\int\pi(\theta_{1})\biggl{\{}\int\phi(\theta)\log\frac{\|I_{22}(\theta)\|^{1/2}}{\phi(\theta)}\,d\theta_{2}\biggr{\}}\,d\theta_{1}\hskip-10.0pt$
			$\displaystyle\quad\qquad{}+O(n^{-1/2}).\hskip-10.0pt$

From (11)–(3.1), retaining only the leading term,

	$\displaystyle\quad\quad E\biggl{[}\log\frac{\pi(\theta_{1}\|X)}{\pi(\theta_{1})}\biggr{]}\hskip-5.0pt$
	$\displaystyle\quad\quad\quad\approx(p_{1}/2)\log(n)-(p_{1}/2)\log(2\pi e)\hskip-5.0pt$
			(14)
	$\displaystyle\quad\quad\qquad{}+\int\pi(\theta_{1})\hskip-5.0pt$
	$\displaystyle\quad\quad\hphantom{\qquad{}+\int}{}\cdot\biggl{\{}\int\phi(\theta)\log\frac{\|I_{11.2}(\theta)\|^{1/2}}{\pi(\theta_{1})}\,d\theta_{2}\biggr{\}}\,d\theta_{1}.\hskip-5.0pt$

Writing $\log\psi(\theta_{1})=\int\phi(\theta)\log|I_{11.2}(\theta)|^{1/2}\,d\theta_{2}$ , once again by property of the Kullback–Leibler information number, it follows that the maximizing prior $\pi(\theta_{1})=\psi(\theta_{1})$ .

We have purposely not set limits for these integrals. An important point to note [as pointed out in Berger and Bernardo (1989)] is that evaluation of all these integrals is carried out over an increasing sequence of compact sets $K_{i}$ whose union is the entire parameter space. This is because most often we are working with improper priors, and direct evaluation of these integrals over the entire parameter space will simply give $+\infty$ which does not help finding any prior. As an illustration, if the parameter space is $\mathcal{R}\times\mathcal{R}^{+}$ as is typically the case for location–scale family of distributions, then one can take the increasing sequence of compact sets as $[-i,i]\times[i^{-1},i]$ , $i\geq 2$ . All the proofs are usually carried out by taking a sequence of priors $\pi_{i}$ with compact support $K_{i}$ , and eventually making $i\rightarrow\infty$ . This important point should be borne in mind in the actual derivation of reference priors. We will now illustrate this for the location–scale family of distributions when one of the two parameters is the parameter of interest, while the other one is the nuisance parameter.

Example 1 ((Location–scale models)).

Suppose $X_{1},\ldots,X_{n}$ are i.i.d. with common p.d.f. $\sigma^{-1}f((x-\mu)/\sigma)$ , where $\mu\in(-\infty,\infty)$ and $\sigma\in(0,\infty)$ . Consider the sequence of priors $\pi_{i}$ with support $[-i,i]\times[i^{-1},i]$ , $i=2,3,\ldots.$ We may note that $I(\mu,\sigma)=\break\sigma^{-2}{{c_{1}\enskip c_{2}}\choose{c_{2}\enskip c_{3}}}$ , where the constants $c_{1}$ , $c_{2}$ and $c_{3}$ are functions of $f$ and do not involve either $\mu$ or $\sigma$ . So, if $\mu$ is the parameter of interest, and $\sigma$ is the nuisance parameter, following Bernardo’s (1979) prescription, one begins with the sequence of priors $\pi_{i2}(\sigma|\mu)=k_{i2}\sigma^{-1}$ where, solving $1=k_{i2}\int_{i^{-1}}^{i}\sigma^{-1}\,d\sigma$ , one gets $k_{i2}=(2\log i)^{-1}$ . Next one finds the prior $\pi_{i1}(\mu)\!=\!k_{i1}\exp[\int_{-i}^{i}k_{i2}\sigma^{-1}\!\log(\sigma^{-1}\!)\,d\sigma]$ which is a constant not depending on either $\mu$ or $\sigma$ . Hence, the resulting joint prior $\pi_{i}(\mu,\sigma)=\pi_{i1}(\mu)\pi_{i2}(\sigma|\mu)\propto\sigma^{-1}$ , which is the desired reference prior. Incidentally, this is Jeffreys’ independence prior rather than Jeffreys’ general rule prior, the latter being proportional to $\sigma^{-2}$ . Conversely, when $\sigma$ is the parameter of interest and $\mu$ is the nuisance parameter, one begins with $\pi_{i2}(\mu|\sigma)=(2i)^{-1}$ and then, following Bernardo (1979) again, one finds $\pi_{i1}(\sigma)=\break c_{i1}\exp[\int_{i^{-1}}^{i}(2i)^{-1}\log(1/\sigma)]\,d\mu]\propto\sigma^{-1}$ . Thus, onceagain one gets Jeffreys’ independence prior. We will see in Section 5 that Jeffreys’ independence prior is a right Haar prior, while Jeffreys’ general rule prior is a left Haar prior for the location–scale family of distributions.

Example 2 ((Noncentrality parameter)).

Let $X_{1},\break\ldots,X_{n}|\mu,\sigma$ be i.i.d. N( $\mu,\sigma^{2}$ ), where $\mu$ real and $\sigma(\!>\!0)$ are both unknown. Suppose the parameter of interest is $\theta=\mu/\sigma$ , the noncentrality parameter. With the reparameterization $(\theta,\sigma)$ from $(\mu,\sigma)$ , the likelihood is rewritten as $L(\theta,\sigma)\propto\sigma^{-n}\exp[-\frac{1}{2\sigma^{2}}\cdot\break\sum_{i=1}^{n}(X_{i}-\theta\sigma)^{2}]$ . Then the per observation Fisher information matrix is given by $I(\theta,\sigma)\!=\!{{1\phantom{00000}\enskip\theta/\sigma}\choose{\theta/\sigma\enskip(\theta^{2}+2)/\sigma^{2}}}$ . Consider once again the sequence of priors $\pi_{i}$ with support $[-i,i]\times[i^{-1},i]$ , $i=2,3,\ldots.$ Again, following Bernardo, $\pi_{i2}(\sigma|\theta)=k_{i2}\sigma^{-1}$ , where $k_{i2}=(2\log i)^{-1}$ . Noting that $I_{11.2}(\theta,\sigma)=1-\theta^{2}/(\theta^{2}+2)=2/(\theta^{2}+2)$ , one gets $\pi_{i1}(\theta)\!=\!k_{i1}\!\exp[\int_{-i}^{i}\log(\sqrt{2}/(\theta^{2}+2)^{1/2})\,d\sigma]\!\propto(\theta^{2}+2)^{-1/2}$ . Hence, the reference prior in this example is given by $\pi_{R}(\theta,\sigma)\propto(\theta^{2}+2)^{-1/2}\sigma^{-1}$ . Due to its invariance property (Datta and Ghosh, 1996), in the original $(\mu,\sigma$ ) parameterization, the two-group reference prior turns out to be $\pi_{R}(\mu,\sigma)\propto\sigma^{-1}(\mu^{2}+2\sigma^{2})^{-1/2}$ .

Things simplify considerably if $\theta_{1}$ and $\theta_{2}$ are orthgonal in the Fisherian sense, namely, $I_{12}(\theta)=0$ (Huzurbazaar, 1950; Cox and Reid, 1987). Then if $I_{11}(\theta)$ and $I_{22}(\theta)$ factor respectively as $h_{11}(\theta_{1})h_{12}(\theta_{2})$ and $h_{21}(\theta_{1})h_{22}(\theta_{2})$ , as a special case of a more general result of Datta and Ghosh (1995c), it follows that the two-group reference prior is given by $h_{11}^{1/2}(\theta_{1})h_{22}^{1/2}(\theta_{2})$ .

Example 3.

As an illustration of the above, consider the celebrated Neyman–Scott problem (Berger and Bernardo, 1992a, 1992b). Consider a fixed effects one-way balanced normal ANOVA model where the number of observations per cell is fixed, but the number of cells grows to infinity. In symbols, let $X_{i1},\ldots,X_{ik}|\theta_{i}$ be mutually independent N( $\theta_{i},\sigma^{2})$ , $k\geq 2$ , $i=1,\ldots,n$ , all parameters being assumed unknown. Let $S=\sum_{i=1}^{n}\sum_{j=1}^{k}(X_{ij}-\bar{X}_{i})^{2}/(n(k-1))$ . Then the MLE of $\sigma^{2}$ is given by $(k-1)S/k$ which converges in probability [as $n\rightarrow\infty$ to $(k-1)\sigma^{2}/k$ ], and hence is inconsistent. Interestingly, Jeffreys’prior in this case also produces an inconsistent estimator of $\sigma^{2}$ , but the Berger–Bernardo reference prior does not.

To see this, we begin with Fisher Information matrix $I(\!\theta_{1},\ldots,\theta_{n},\sigma\!)\!=\!k\!\operatorname{Diag}(\!\sigma^{-2},\ldots,\sigma^{-2},(1/2)n\sigma^{-4})$ . Hence, Jeffreys’ prior $\pi_{J}(\theta_{1},\ldots,\theta_{n},\sigma^{2})\propto(\sigma^{2})^{-n/2-1}$ which leads to the marginal posterior $\pi_{J}(\sigma^{2}|X)\propto(\sigma^{2})^{-nk/2-1}\exp[-n(k-1)S/(2\sigma^{2})]$ of $\sigma^{2}$ , $X$ denoting the entire data set. Then the posterior mean of $\sigma^{2}$ is given by $n(k-1)S/(nk-2)$ , while the posterior mode is given by $n(k-1)S/(nk+2)$ . Both are inconsistent estimators of $\sigma^{2}$ , as these converge in probability to $(k-1)\sigma^{2}/k$ as $n\rightarrow\infty$ .

In contrast, by the result of Datta and Ghosh (1995c), the two-group reference prior $\pi_{R}(\theta_{1},\ldots,\theta_{n},\break\sigma^{2})\propto(\sigma^{2})^{-1}$ . This leads to the marginal posterior $\pi_{R}(\sigma^{2}|X)\propto(\sigma^{2})^{-n(k-1)/2-1}\exp[-n(k-1)S/(2\sigma^{2})]$ of $\sigma^{2}$ . Now the posterior mean is given by $n(k-1)S/\break(n(k-1)-2)$ , while the posterior mode is given by $n(k-1)S/(n(k-1)+2)$ . Both are consistent estimators of $\sigma^{2}$ .

Example 4 ((Ratio of normal means)).

Let $X_{1}$ and $X_{2}$ be two independent N( $\theta\mu,\mu$ ) random variables, where the parameter of interest is $\theta$ . This is the celebrated Fieller–Creasy problem. The Fisher information matrix in this case is $I(\theta,\mu)=({{\mu^{2}\enskip\mu\theta}\atop{\mu\theta\enskip 1+\theta^{2}}})$ . With the transformation $\phi=\mu(1+\theta^{2})^{1/2}$ , one obtains $I(\theta,\phi)=\operatorname{Diag}(\phi^{2}(1+\theta^{2})^{-2},1)$ . Again, by Datta and Ghosh (1995c), the two-group reference prior $\pi_{R}(\theta,\phi)\propto(1+\theta^{2})^{-1}$ .

Example 5 ((Random effects model)).

This example has been visited and revisited on several occasions. Berger and Bernardo (1992b) first found reference priors for variance components in this problem when the number of observations per cell is the same. Later, Ye (1994) and Datta and Ghosh (1995c, 1995d) also found reference priors for this problem. The case involving unequal number of observations per cell was considered by Chaloner (1987) and Datta, Ghosh and Kim (2002).

For simplicity, we consider here only the case with equal number of observations per cell. Let $Y_{ij}=m+\alpha_{i}+e_{ij}$ , $j=1,\ldots,n,i=1,\ldots,k$ . Here $m$ is an unknown parameter, while $\alpha_{i}$ ’s and $e_{ij}$ are mutually independent with $\alpha_{i}$ ’s i.i.d. N( $0,\sigma_{\alpha}^{2}$ ) and $e_{ij}$ i.i.d. N( $0,\sigma^{2}$ ). The parameters $m$ , $\sigma_{\alpha}^{2}$ and $\sigma^{2}$ are all unknown. We write $\bar{Y}_{i}=\sum_{j=1}^{n}Y_{ij}/n$ , $i=1,\ldots,k$ , and $\bar{Y}=\sum_{i=1}^{k}\bar{Y}_{i}/k$ . The minimal sufficient statistic is ( $\bar{Y},T,S)$ , where $T=n\sum_{i=1}^{k}(\bar{Y}_{i}-\bar{Y})^{2}$ and $S=\sum_{i=1}^{k}\sum_{j=1}^{n}(Y_{ij}-\bar{Y}_{i})^{2}$ .

The different parameters of interest that we consider are $m$ , $\sigma_{\alpha}^{2}/\sigma^{2}$ and $\sigma^{2}$ . The common mean $m$ is of great relevance in meta analysis (cf. Morris and Normand, 1992). Ye (1994) pointed out that the variance ratio $\sigma_{\alpha}^{2}/\sigma^{2}$ is of considerable interest in genetic studies. The parameter is also of importance to animal breeders, psychologists and others. Datta and Ghosh (1995d) have discussed the importance of $\sigma^{2}$ , the error variance. In order to find reference priors for each one of these parameters, we first make the one-to-one transformation from $(m,\sigma_{\alpha}^{2},\sigma^{2})$ to $(m,r,u)$ , where $r=\sigma^{-2}$ and $u=\sigma^{2}/(n\sigma_{\alpha}^{2}+\sigma^{2})$ . Thus, $\sigma_{\alpha}^{2}/\sigma^{2}=(1-u)/(nu)$ , and the likelihood $L(m,r,u)$ can be expressed as

	$\displaystyle\hskip-2.0ptL(m,r,u)$
	$\displaystyle\hskip-2.0pt\quad=r^{nk/2}u^{k/2}\exp[-(r/2)\{nku(\bar{Y}-m)^{2}+uT+S\}].$

Then the Fisher information matrix simplifies to $I(m,r,u)=k\operatorname{Diag}(nru,n/(2r^{2}),1/(2u^{2})$ . From Theorem 1 of Datta and Ghosh (1995c), it follows now that when $m$ , $r$ and $u$ are the respective parameters of interest, while the other two are nuisance parameters, the reference priors are given respectively by $\pi_{1R}(m,r,u)=1$ , $\pi_{2R}(m,r,u)=r^{-1}$ and $\pi_{3R}(m,r,u)=u^{-1}$ .

3.2 General Divergence Priors

Next, back to the one-parameter case, we consider the more general distance (Amari, 1982; Cressie and Read, 1984)

	$\displaystyle\quad\quad\quad D^{\pi}=\biggl{[}1-E\biggl{\{}\frac{\pi(\theta\|X)}{\pi(\theta)}\biggr{\}}^{-\beta}\biggr{]}\bigl{/}\{\beta(1-\beta)\},$
			(15)
	$\displaystyle\hphantom{\hskip 3.0pt\quad\quad D^{\pi}=\biggl{[}1-E\biggl{\{}\frac{\pi(\theta\|X)}{\pi(\theta)}\biggr{\}}^{-\beta}\biggr{]}\bigl{/}\{\beta(1-\beta)\}}\beta<1,$

which is to be interpreted as its limit when $\beta\rightarrow 0$ . This limit is the K–L distance as considered in Bernardo (1979). Also, $\beta=1/2$ gives the Bhattacharyya–Hellinger (Bhattacharyya, 1943; Hellinger, 1909) distance, and $\beta=-1$ leads to the chi-square distance (Clarke and Sun, 1997, 1999). In order to maximize $D^{\pi}$ with respect to a prior $\pi$ , one re-expresses (3.2) as

$\displaystyle\hskip 3.0ptD^{\pi}$	$\displaystyle=$	$\displaystyle\biggl{[}1-\int\!\!\int\pi^{\beta+1}(\theta)\pi^{-\beta}(\theta\|X)L_{n}(\theta)\,dX\,d\theta\biggr{]}\hskip-25.0pt$
		$\displaystyle{}\bigl{/}\{\beta(1-\beta)\}$
	$\displaystyle=$	$\displaystyle\biggl{[}1-\int\pi^{\beta+1}(\theta)E[\{\pi^{-\beta}(\theta\|X)\}\|\theta]\,d\theta\biggr{]}$
		$\displaystyle{}\bigl{/}\{\beta(1-\beta)\}.$

Hence, from (3.2), maximization of $D^{\pi}$ amounts to minimization (maximization) of

\int\pi^{\beta+1}(\theta)E[\{\pi^{-\beta}(\theta|X)\}|\theta]\,d\theta

(17)

for $0<\beta<1$ ( $\beta<0$ ). First consider the case $0<|\beta|\break<1$ . From Theorem 1, the posterior of $\theta$ is

	$\displaystyle\qquad\pi(\theta\|X)$	$\displaystyle=$	$\displaystyle\frac{\sqrt{n}\hat{I}_{n}^{1/2}}{(2\pi)^{1/2}}\exp\biggl{[}-\frac{n}{2}(\theta-\hat{\theta}_{n})^{2}\hat{I}_{n}\biggr{]}$
			$\displaystyle{}\cdot[1+O_{p}(n^{-1/2})].$

Thus,

	$\displaystyle\hskip 10.0pt\pi^{-\beta}(\theta\|X)\hskip-16.0pt$
	$\displaystyle\hskip 10.0pt\quad=n^{-\beta/2}(2\pi)^{\beta/2}\hat{I}_{n}^{-\beta/2}\hskip-16.0pt$		(19)
	$\displaystyle\hskip 10.0pt\qquad{}\cdot\exp\biggl{[}\frac{n\beta}{2}(\theta-\hat{\theta}_{n})^{2}\hat{I}_{n}\biggr{]}[1+O_{p}(n^{-1/2})].\hskip-16.0pt$

Following the shrinkage argument, and noting that conditional on $\theta$ , $\hat{I}_{n}\!\stackrel{{\scriptstyle p}}{{\rightarrow}}\!I(\theta)$ , while $n(\theta\!-\!\hat{\theta}_{n})^{2}\hat{I}_{n}\!\stackrel{{\scriptstyle d}}{{\rightarrow}}\!\chi_{1}^{2}$ , it follows heuristically from (3.2)

	$\displaystyle\quad\quad E[\pi^{-\beta}(\theta\|X)]$
	$\displaystyle\quad\quad\quad=n^{-\beta/2}(2\pi)^{\beta/2}[I(\theta)]^{-\beta/2}(1-\beta)^{-1/2}$		(20)
	$\displaystyle\quad\quad\qquad{}\cdot[1+O_{p}(n^{-1/2})].$

Hence, from (3.2), considering only the leading term, for $0\!<\!\beta\!<\!1$ , minimization of (17) with respect to $\pi$ amounts to minimization of $\int[\pi(\theta)/I^{1/2}(\theta)]^{\beta}\pi(\theta)\,d\theta$ with respect to $\pi$ subject to $\int\pi(\theta)\,d\theta=1$ . A simple application of Holder’s inequality shows that this minimization takes place when $\pi(\theta)\propto I^{1/2}(\theta)$ . Similarly, for $-1<\beta<0$ , $\pi(\theta)\propto I^{1/2}(\theta)$ provides the desired maximization of the expected distance between the prior and the posterior. The K–L distance, that is, when $\beta\rightarrow 0,$ has already been considered earlier.

Remark 2.

Equation (3.2) also holds for $\beta<-1$ . However, in this case, it is shown in Ghosh, Mergel and Liu (2011) that the integral $\int\{\pi(\theta)/\break I^{1/2}(\theta)\}^{-\beta}\cdot\pi(\theta)\,d\theta$ is uniquely minimized with respect to $\pi(\theta)\propto I^{1/2}(\theta)$ , and there exists no maximizer of this integral when $\int\pi(\theta)\,d\theta=1$ . Thus, in this case, there does not exist any prior which maximizes the posterior-prior distance.

Remark 3.

Surprisingly, Jeffreys’ prior is not necessarily the solution when $\beta=-1$ (the chi-square divergence). In this case, the first-order asymptotics does not work since $\pi^{\beta+1}(\theta)=1$ for all $\theta$ . However, retaining also the $O_{p}(n^{-1})$ term as given in Theorem 1, Ghosh, Mergel and Liu (2011) have found in this case the solution $\pi(\theta)\propto\exp[\int^{\theta}\frac{2g_{3}(t)-I^{\prime}(t)}{4I(t)}\,dt]$ , where $g_{3}(t)=E[-\frac{d^{3}\log p(X_{1}|t)}{dt^{3}}|t]$ . We shall refer to this prior as $\pi_{\mathrm{GML}}(\theta)$ . We will show by examples that this prior may differ from Jeffreys’prior. But first we will establish a hitherto unknown invariance property of this prior under one-to-one reparameterization.

Theorem 2

Suppose that $\phi$ is a one-to-one twice differentiable function of $\theta$ . Then $\pi_{\mathrm{GML}}(\phi)=\break C\pi_{\mathrm{GML}}(\theta)|\frac{d\theta}{d\phi}|$ , where $C(>0)$ , the constant of proportionality, does not involve any parameters.

{pf}

Without loss of generality, assume that $\phi$ is a nondecreasing function of $\theta$ . By the identity

g_{3}(\phi)=I^{\prime}(\phi)+E\biggl{[}\biggl{(}\frac{d^{2}\log f}{d\phi^{2}}\biggr{)}\biggl{(}\frac{d\log f}{d\phi}\biggr{)}\biggr{]},

$\pi_{\mathrm{GML}}^{\prime}(\phi)/\pi_{\mathrm{GML}}(\phi)$ reduces to

	$\displaystyle\quad\quad\pi_{\mathrm{GML}}^{\prime}(\phi)/\pi_{\mathrm{GML}}(\phi)\hskip-5.0pt$
			(21)
	$\displaystyle\quad\quad\quad=\frac{I^{\prime}(\phi)+2E[(d^{2}\log f/d\phi^{2})(d\log f/d\phi)]}{4I(\phi)}.\hskip-5.0pt$

Next, from the relation $I(\phi)=I(\theta)(d\theta/d\phi)^{2}$ , one gets the identities

	$\displaystyle\qquad I^{\prime}(\phi)=I^{\prime}(\theta)\biggl{(}\frac{d\theta}{d\phi}\biggr{)}^{3}$
			(22)
	$\displaystyle\qquad\hphantom{I^{\prime}(\phi)=}{}+2I(\theta)(d\theta/d\phi)(d^{2}\theta/d\phi^{2});$
	$\displaystyle\qquad\biggl{(}\frac{d^{2}\log f}{d\phi^{2}}\biggr{)}\biggl{(}\frac{d\log f}{d\phi}\biggr{)}$
	$\displaystyle\qquad\quad=\biggl{\{}\frac{d^{2}\log f}{d\theta^{2}}\biggl{(}\frac{d\theta}{d\phi}\biggr{)}^{2}+\frac{d\log f}{d\theta}\cdot\frac{d^{2}\theta}{d\phi^{2}}\biggr{\}}$		(23)
	$\displaystyle\qquad\hphantom{\quad=}{}\cdot\biggl{(}\frac{d\log f}{d\theta}\cdot\frac{d\theta}{d\phi}\biggr{)}.$

From (3.2)–(3.2), one gets, after simplification,

	$\displaystyle\pi_{\mathrm{GML}}^{\prime}(\phi)/\pi_{\mathrm{GML}}(\phi)$
	$\displaystyle\quad=\frac{\pi_{\mathrm{GML}}^{\prime}(\theta)}{\pi_{\mathrm{GML}}(\theta)}\frac{d\theta}{d\phi}+\frac{d^{2}\theta/d\phi^{2}}{d\theta/d\phi}.$		(24)

Now, on integration, it follows from (3.2) $\pi_{\mathrm{GML}}(\phi)\!=\!C\pi_{\mathrm{GML}}(\phi)(d\theta/d\phi)$ , which proves the theorem.

Example 6.

Consider the one-parameter exponential family of distributions with $p(X|\theta)=\break\exp[\theta X-\psi(\theta)+h(X)]$ . Then $g_{3}(\theta)=I^{\prime}(\theta)$ so that $\pi(\theta)\propto\exp[\frac{1}{4}\int\frac{I^{\prime}(\theta)}{I(\theta)}\,d\theta]=I^{1/4}(\theta)\vskip 2.0pt$ , which is different from Jeffreys’ $I^{1/2}(\theta)$ prior. Because of the invariance result proved in Theorem 2, in particular, for the $\operatorname{Binomial}(n,p)$ problem, noting that $p=\exp(\theta)/\vskip 2.0pt\break[1+\exp(\theta)]$ , one gets $\pi_{\mathrm{GML}}(p)\propto p^{-3/4}(1-p)^{-3/4}$ , which is a $\operatorname{Beta}(\frac{1}{4},\frac{1}{4})$ prior, different from Jeffreys’ $\operatorname{Beta}(\frac{1}{2},\frac{1}{2})$ prior, Laplace’s $\operatorname{Beta}(1,1)$ prior or Haldane’s improper $\operatorname{Beta}(0,0)$ prior. Similarly, for the Poisson $(\lambda)$ case, one gets $\pi_{\mathrm{GML}}(\lambda)\propto\lambda^{-1/4}$ , again different from Jeffreys’ $\pi_{J}(\lambda)\propto\lambda^{-1/2}$ prior. However, for the $\mathrm{N}(\theta,1)$ distribution, since $I(\theta)=1$ and $g_{3}(\theta)\!=\!I^{\prime}(\theta)\!=\!0,\pi_{\mathrm{GML}}(\theta)\!=\!c(>0)$ , a constant, which is the same as Jeffreys’ prior. It may be pointed out also that for the one-parameter exponential family, for the chi-square divergence, $\pi_{\mathrm{GML}}$ differs from Hartigan’s (1998) maximum likelihood prior $\pi_{H}(\theta)=I(\theta)$ .

Example 7.

For the one-parameter location family of distributions with $p(X|\theta)\!=\!f(X-\theta)$ , where $f$ is a p.d.f., both $g_{3}(\theta)$ and $I(\theta)$ are constants implying $I^{\prime}(\theta)=0$ . Hence, $\pi_{\mathrm{GML}}(\theta)$ is of the form $\pi_{\mathrm{GML}}(\theta)=\exp(k\theta)$ for some constant $k$ . However, for the special case of a symmetric $f$ , that is, $f(X)=f(-X)$ for all $X$ , $g_{3}(\theta)=0$ , and then $\pi_{\mathrm{GML}}(\theta)$ reduces once again to $\pi(\theta)=c$ , which is the same as Jeffreys’ prior.

Example 8.

For the general scale family of distributions with $p(X|\theta)=\theta^{-1}f(\frac{X}{\theta}),\theta>0$ , where $f$ is a p.d.f., $I(\theta)=\frac{c_{1}}{\theta^{2}}$ for some constant $c_{1}(>0)$ , where $g_{3}(\theta)=\frac{c_{2}}{\theta^{3}}$ for some constant $c_{2}$ . Then $\pi_{\mathrm{GML}}(\theta)\propto\exp(c\log\theta)=\theta^{c}$ for some constant $c$ . In particular, when $p(X|\theta)=\theta^{-1}\exp(-\frac{X}{\theta}),\pi_{\mathrm{GML}}(\theta)\propto\theta^{-3/2}$ , different from Jeffreys’ $\pi_{J}(\theta)\propto\theta^{-1}$ for the general scale family of distributions.

The multiparameter extension of the general divergence prior has been explored in the Ph.D. dissertation of Liu (2009). Among other things, he has shown that in the absence of any nuisance parameters, for $|\beta|<1$ , the divergence prior is Jeffreys’ prior. However, on the boundary, namely, $\beta=-1$ , priors other than Jeffreys’ prior emerge.

4 Probability Matching Priors

4.1 Motivation and First-Order Matching

As mentioned in the Introduction, probability matching priors are intended to achieve Bayes-frequentist synthesis. Specifically, these priors are required to provide asymptotically the same coverage probability of the Bayesian credible intervals with the corresponding frequentist counterparts. Over the years, there have been several versions of such priors-quantile matching priors, matching priors for distribution functions, HPD matching priors and matching priors associated with likelihood ratio statistics. Datta and Mukerjee provided a detailed account of all these priors. In this article I will be concerned only with quantile matching priors.

A general definition of quantile matching priors is as follows: Suppose $X_{1},\dots,X_{n}|\theta$ i.i.d. with common p.d.f. $f(X|\theta)$ , where $\theta$ is a real-valued parameter. Assume all the needed regularity conditions for the asymptotic expansion of the posterior around $\hat{\theta}_{n}$ , the MLE of $\theta$ . We continue with the notation of the previous section. For $0<\alpha<1$ , let $\theta^{\pi}_{1-\alpha}(X_{1},\dots,\break X_{n})\equiv\theta^{\pi}_{1-\alpha}$ denote the $(1-\alpha)$ th asymptotic posterior quantile of $\theta$ based on the prior $\pi$ , that is,

	$\displaystyle P^{\pi}[\theta\leq\theta^{\pi}_{1-\alpha}\|X_{1},\dots,X_{n}]$
			(25)
	$\displaystyle\quad=1-\alpha+O_{p}(n^{-r})$

for some $r\!>\!0$ . If now $P[\theta\!\leq\!\theta^{\pi}_{1-\alpha}|\theta]\!=\!1\!-\!\alpha+O_{p}(n^{-r})$ , then some order of probability matching is achieved. If $r=1$ , we call $\pi$ a first-order probability matching prior. If $r=3/2$ , we call $\pi$ a second-order probability matching prior.

We first provide an intuitive argument for why Jeffreys’ prior is a first-order probability matching prior in the absence of nuisance parameters. If $X_{1},\dots,\break X_{n}|\theta$ i.i.d. $\mathrm{N}(\theta,1)$ and $\pi(\theta)=1$ , $-\infty<\theta<\infty$ , then the posterior $\pi(\theta|X_{1},\dots,X_{n})$ is $\mathrm{N}(\bar{X}_{n},n^{-1})$ . Now writing $z_{1-\alpha}$ as the $100(1-\alpha)\%$ quantile of the $\mathrm{N}(0,1)$ distribution, one gets

	$\displaystyle\qquad P\bigl{[}\sqrt{n}(\theta-\bar{X}_{n})\leq z_{1-\alpha}\|X_{1},\dots,X_{n}\bigr{)}$
			(26)
	$\displaystyle\qquad\quad=1-\alpha=P\bigl{[}\sqrt{n}(\bar{X}_{n}-\theta)\geq-z_{1-\alpha}\|\theta\bigr{]},$

so that the one-sided credible interval $\bar{X}_{n}+z_{1-\alpha}/\sqrt{n}$ for $\theta$ has exact frequentist coverage probability $1-\alpha$ .

The above exact matching does not always hold. However, if $X_{1},\dots,X_{n}|\theta$ are i.i.d., then $\hat{\theta}_{n}|\theta$ is asymptotically $\mathrm{N}(\theta,(nI(\theta))^{-1})$ . Then, by the delta method, $g(\hat{\theta}_{n})|\theta\!\sim\!\mathrm{N}[g(\theta),\!(g^{\prime}(\theta)\!)^{2}(nI(\theta)\!)^{-1}]$ . So if $g^{\prime}(\theta)\!=\!I^{1/2}(\theta)$ so that $g(\theta)\!=\!\int^{\theta}I^{1/2}(t)\,dt$ , $\sqrt{n}[g(\hat{\theta}_{n})\!-\!g(\theta)]|\theta$ is asymptotically $\mathrm{N}(0,1)$ . Hence, from (4.1), with the uniform prior $\pi(\phi)=1$ for $\phi=g(\theta)$ , coverage matching is asymptotically achieved for $\phi$ . This leads to the prior $\pi(\theta)=\frac{d\phi}{d\theta}=g^{\prime}(\theta)=I^{1/2}(\theta)$ for $\theta$ .

Datta and Mukerjee (2004, pages 14–21) proved the result in a formal manner. They used the two basic tools of Section 3. In the absence of nuisance parameters, they showed that a first-order matching prior for $\theta$ is a solution of the differential equation

\frac{d}{d\theta}(\pi(\theta)I^{-1/2}(\theta))=0,

(27)

so that Jeffreys’ prior is the unique first-order matching prior. However, it does not always satisfy the second-order matching property.

4.2 Second-Order Matching

In order that the matching is accomplished up to $O(n^{-3/2})$ (second-order matching), one needs an asymptotic expansion of the posterior distribution function up to the $O(n^{-1})$ term, and to set up a second differential equation in addition to (27). This equation is given by (cf. Mukerjee and Dey, 1993; Mukerjee and Ghosh, 1997)

	$\displaystyle\qquad\frac{1}{3}\frac{d}{d\theta}[\pi(\theta)I^{-2}(\theta)g_{3}(\theta)]+\frac{d^{2}}{d\theta^{2}}[\pi(\theta)I^{-1}(\theta)]\hskip-10.0pt$
			(28)
	$\displaystyle\qquad\quad=0,\hskip-10.0pt$

where, as before, $g_{3}(\theta)=-E[\frac{d^{3}\log f(X|\theta)}{d\theta^{3}}|\theta]$ . If Jeffrey’s prior satisfies (4.2), then it is the unique second-order matching prior. While for the location and scale family of distributions, this is indeed the case, this is not true in general. Of course, in such an instance, there does not exist any second-order matching prior.

To see this, for $\pi_{J}(\theta)=I^{1/2}(\theta)$ , (4.2) reduces to

\frac{1}{3}\frac{d}{d\theta}[I^{-3/2}(\theta)g_{3}(\theta)]+\frac{d^{2}}{d\theta^{2}}[I^{-1/2}(\theta)]=0,

which requires $\frac{1}{3}I^{-3/2}(\theta)g_{3}(\theta)+\frac{d}{d\theta}(I^{-1/2}(\theta))$ to be a constant free from $\theta$ . After some algebra, the above expression simplifies to $(1/6)E[(\frac{d\log f}{d\theta})^{3}|\theta]/I^{3/2}(\theta)$ . It is easy to check now that for the one-parameter location and scale family of distributions, the above expression does not depend on $\theta$ . However, for the one-parameter exponential family of distributions with canonical parameter $\theta$ , the same holds if and only if $I^{\prime}(\theta)/I^{3/2}(\theta)$ does not depend on $\theta$ , or, in other words, $I(\theta)=\exp(c\theta)$ for some constant $c$ . Another interesting example is given below.

Example 9.

$(X_{1},X_{2})^{T}\sim\mathrm{N}_{2}[{{0}\choose{0}},{{1\enskip\rho}\choose{\rho\enskip 1}}]$ . One can verify that $I(\rho)=(1+\rho^{2})/(1-\rho^{2})^{2}$ and $L_{1,1,1}=-\frac{2\rho(3+\rho^{2})}{(1-\rho^{2})^{3}}$ so that $L_{1,1,1}/I^{3/2}(\rho)$ is not a constant. Hence, $\pi_{J}$ is not a second-order matching prior, and there does not exist any second-order matching prior in this example.

4.3 First-Order Quantile Matching Priors in the Presence of Nuisance Parameters

The parameter of interest is still real-valued, but there may be one or more nuisance parameters. To fix ideas, suppose $\theta=(\theta_{1},\dots,\theta_{p})$ , where $\theta_{1}$ is the parameter of interest, while $\theta_{2},\dots,\theta_{p}$ are the nuisance parameters. As shown by Welch and Peers (1963) and later more rigorously by Datta and Ghosh(1995a) and Datta (1996), writing $I^{-1}=((I^{jk}))$ , the probability matching equation is given by

\displaystyle\sum^{p}_{j=1}\frac{\partial}{\partial\theta_{j}}\{\pi(\theta)I^{j1}(I^{11})^{-1/2}\}=0.

(29)

Example 1 ((Continued)).

First consider $\mu$ as the parameter of interest, and $\sigma$ the nuisance parameter. Since each element of the inverse of the Fisher information matrix is a constant multiple of $\sigma^{2}$ , any prior $\pi(\mu,\sigma)\propto g(\sigma)$ , $g$ arbitrary, satisfies (29). Conversely, when $\sigma$ is the parameter of interest, and $\mu$ is the nuisance parameter, any prior $\pi(\mu,\sigma)\propto\sigma^{-1}g(\mu)$ satisfies (29).

A special case considered in Tibshirani (1989) is of interest. Here $\theta_{1}$ is orthogonal to $(\theta_{2},\dots,\theta_{p})$ in the Fisherian sense, that is, $I^{j1}=0$ for $j=2,3,\dots,p$ .

With orthogonality, (29) simplifies to

\frac{\partial}{\partial\theta_{1}}\{\pi(\theta)I^{-1/2}_{11}\}=0

(since $I^{11}=I^{-1}_{11}$ ). This leads to $\pi(\theta)=I^{1/2}_{11}h(\theta_{2},\dots,\break\theta_{p})$ , where $h$ is arbitrary. Often a second-order matching prior removes the arbitrariness of $h$ . We will see an example later in this section. However, this need not always be the case, and, indeed, as seen earlier in the one parameter case, second-order matching priors may not always exist. We will address this issue later in this section.

A special choice is $h\!\equiv\!1$ . The resultant prior $\pi(\theta)\!=\!I_{11}^{1/2}$ bears some intuitive appeal. Since under orthogonality, $\sqrt{n}(\hat{\theta}_{1n}-\theta_{1})|\theta\sim\mathrm{N}(0,I^{-1}_{11}(\theta))$ , one may expect $I^{1/2}_{11}(\theta)$ to be a first-order probability matching prior. This prior is only a member within the class of priors $\pi(\theta)=I^{1/2}_{11}h(\theta_{2},\dots,\theta_{p})$ , as found by Tibshirani (1989), and admittedly need not be second-order matching even when the latter exists. A recent article by Staicu and Reid (2008) has proved some interesting properties of the prior $\pi(\theta)=I^{1/2}_{11}(\theta)$ . This prior is also considered in Ghosh and Mukerjee (1992).

For a symmetric location–scale family of distributions, that is, when $f(X)=f(-X)$ , $c_{2}=0$ , that is, $\mu$ and $\sigma$ are orthogonal. Now, when $\mu$ is the parameter of interest and $\sigma$ is the nuisance parameter, the class of first-order matching priors $\pi_{1}(\mu,\sigma)$ is characterized by $h_{1}(\sigma)$ , where $h_{1}$ is arbitrary. Similarly, when $\sigma$ is the parameter of interest and $\mu$ is the nuisance parameter, the class of first-order matching priors is characterized by $\pi_{2}(\mu,\sigma)=\sigma^{-1}h_{2}(\mu)$ , where $g_{2}$ is arbitrary. The intersection of the two classes leads again to the unique prior $\pi(\mu,\sigma)=\sigma^{-1}$ .

Example 2 ((Continued)).

Let $X_{1},\dots,X_{n}|\mu,\sigma$ be i.i.d. N( $\mu,\sigma^{2}$ ), and $\theta=\mu/\sigma$ is again the parameter of interest. In order to find a parameter $\phi$ which is orthogonal to $\theta$ , we rewrite the p.d.f. in the form

	$\displaystyle\qquad f(X\|\theta,\sigma)$
			(30)
	$\displaystyle\qquad\quad=(2\pi\sigma^{2})^{-1/2}\exp\biggl{[}-\frac{1}{2\sigma^{2}}(X-\theta\sigma)^{2}\biggr{]}.$

Then the Fisher information matrix

I(\theta,\sigma)=\left[\pmatrix{1&{\theta/\sigma}\cr\theta/\sigma&\sigma^{-2}(\theta^{2}+2)}\right].

It turns out now if we reparameterize from $(\theta,\sigma)$ to $(\theta,\phi)$ , where $\phi=\sigma(\theta^{2}+2)^{1/2}$ , then $\theta$ and $\phi$ are orthogonal with the corresponding Fisher information matrix given by $I(\theta,\phi)=\operatorname{Diag}[2(\theta^{2}+2)^{-1},\phi^{-2}(\theta^{2}+2)]$ . Hence, the class of first-order matching priors when $\theta$ is the parameter of interest is given by $\pi(\theta,\phi)=(\theta^{2}+2)^{-1/2}h(\phi)$ , where $h$ is arbitrary.

4.4 Second-Order Quantile Matching Priors in the Presence of Nuisance Parameters

When $\theta_{1}$ is the parameter of interest, and $(\theta_{2},\ldots,\break\theta_{p})$ is the vector of nuisance parameters, the general class of second-order quantile matching priors is characterized in (2.4.11) and (2.4.12) of Datta and Mukerjee (2004, page 12). For simplicity, we consider only the case when $\theta_{1}$ is orthogonal to $(\theta_{2},\ldots,\break\theta_{p})$ . In this case a first-order quantile matching prior $\pi(\theta_{1},\theta_{2},\ldots,\theta_{p})\propto I_{11}^{1/2}(\theta)h(\theta_{2},\ldots,\theta_{p})$ is also second-order matching if and only if $h$ satisfies (cf. Datta and Mukerjee, 2004, page 27) the differential equation

	$\displaystyle\sum_{s=2}^{p}\sum_{u=2}^{p}\frac{\partial}{\partial\theta_{u}}\biggl{\{}I_{11}^{-1/2}I^{su}E\biggl{(}\frac{\partial^{3}\log f}{\partial\theta_{1}^{2}\,\partial\theta_{s}}\biggr{)}h\Bigl{\|}\theta\biggr{\}}$
	$\displaystyle\quad{}+(h/6)\frac{\partial}{\partial\theta_{1}}\biggl{\{}I_{11}^{-3/2}E\biggl{(}\biggl{(}\frac{\partial\log f}{\partial\theta_{1}}\biggr{)}^{3}\Bigl{\|}\theta\biggr{)}\biggr{\}}$		(31)
	$\displaystyle\qquad=0.$

We revisit Examples 1–5 and provide complete, or at least partial, characterization of second-order quantile matching priors.

Example 1 ((Continued)).

Let $f$ be symmetric so that $\mu$ and $\sigma$ are orthogonal. First let $\mu$ be the parameter of interest and $\sigma$ the nuisance parameter. Then since both the terms in (4.4) are zeroes, every first-order quantile matching prior of the form $\sigma^{-1}h(\sigma)=q(\sigma)$ , say, is also second-order matching. This means that an arbitrary prior of the form $\pi(\mu,\sigma)$ is second-order matching as long as it is only a function of $\sigma$ . On the other hand, if $\sigma$ is the parameter of interest and $\mu$ is the nuisance parameter, since the second term in (4.4) is zero, a first-order quantile matching prior of the form $\sigma^{-1}h(\mu)$ is also second-order matching if and only if $h(\mu)$ is a constant. Thus, the unique second-order quantile matching prior in this case is proportional to $\sigma^{-1}$ , which is Jeffreys’ independence prior.

Example 2 ((Continued)).

Recall that in this case writing $\theta=\mu/\sigma$ , and $\phi=\sigma(\theta^{2}+2)^{1/2}$ , the Fisher information matrix $I(\theta,\phi)\!=\!\operatorname{Diag}[2(\theta^{2}+2)^{-1},\phi^{-2}(\theta^{2}+2)]$ . Also, $E[(\frac{\partial\log f}{\partial\theta})^{3}|\theta,\phi]\!=\!-\frac{\partial\theta(\theta^{2}+3)}{(\theta^{2}+2)^{3}}$ and $E(\frac{\partial^{3}\log f}{\partial\theta^{2}\partial\phi}|\theta,\break\phi)=(4/\phi)(\theta^{2}+2)^{-2}$ . Hence, (4.4) holds if and only if $h(\phi)=\phi^{-1}$ . This leads to the unique second-order quantile matching prior $\pi(\theta,\phi)\propto(\theta^{2}+2)^{-1/2}$ . Back to the original $(\mu,\sigma)$ parameterization, this leads to the prior $\pi(\mu,\sigma)\propto\sigma^{-1}$ , Jeffreys’ independence prior.

Example 3 ((Continued)).

Consider once again the Neyman–Scott example. Since the Fisher information matrix $I(\theta_{1},\ldots,\theta_{n},\sigma^{2})=k\operatorname{Diag}(\sigma^{-2},\ldots,\break\sigma^{-2},n\sigma^{-2})$ , $\sigma^{2}$ is orthogonal to $(\theta_{1},\ldots,\theta_{n})$ . Now, the class of second-order matching priors is given by $\sigma^{-2}h(\mu_{1},\ldots,\mu_{n})$ , where $h$ is arbitrary. Simple algebra shows that in this case both the first and second terms in (4.4) are zeroes so that every first-order quantile matching prior is also second-order matching.

Example 4 ((Continued)).

From Tibshirani (1989), it follows that the class of first-order quantile matching priors for $\theta$ is of the form $(1+\theta^{2})^{-1}h(\phi)$ , where $h$ is arbitrary. Once again, since both the first and second terms in (4.4) are zeroes, every first-order quantile matching prior is also second-order matching.

Example 5 ((Continued)).

Again from Tibshirani (1989), the class of second-order matching priors when $m$ , $r$ and $u$ are the parameters of interest are given respectively by $h_{1}(r,u)$ , $r^{-1}h_{2}(m,u)$ and $u^{-1}h_{3}(m,r)$ , where $h_{1}$ , $h_{2}$ and $h_{3}$ are arbitrary nonnegative functions. Also, the prior $\pi_{S}(r,u)\!\propto\!(ru)^{-3/2}$ is second-order matching when $m$ is the parameter of interest. On the other hand, any first-order matching prior is also second-order matching when either $r$ or $u$ is the parameter of interest.

It may be of interest to find an example where a reference prior is not a second-order matching prior. Consider the gamma p.d.f. $f(x|\mu,\lambda)=(\lambda^{\lambda}/\Gamma(\lambda))\cdot\exp[-\lambda y/\mu]y^{\lambda-1}\mu^{-\lambda}$ , where the mean $\mu$ is the parameter of interest. The Fisher information matrix is given by $\operatorname{Diag}(\lambda\mu^{-2},\frac{d^{2}\log\Gamma(\lambda)}{d\lambda^{2}}-1/\lambda)$ . Then the two-group reference prior of Bernardo (1979) is given by $\mu^{-1}[\frac{d^{2}\log\Gamma(\lambda)}{d\lambda^{2}}-(1/\lambda)]^{1/2}$ , while the unique second-order quantile matching prior is given by $\lambda\mu^{-1}\cdot[\frac{d^{2}\log\Gamma(\lambda)}{d\lambda^{2}}-(1/\lambda)]$ .

In some of these examples, especially for the location and location–scale families, one gets exact rather than asymptotic matching. This is especially so when the matching prior is a right-invariant Haar prior. We will see some examples in the next section.

5 Other Priors

5.1 Invariant Priors

Very often objective priors are derived via some invariance criterion. We illustrate with the location–scale family of distributions.

Let $X$ have p.d.f. $p(x|\mu,\sigma)=\sigma^{-1}f((x-\mu)/\sigma)$ , $-\infty<\mu<\infty$ , $0<\sigma<\infty$ , where $f$ is a p.d.f. Then, as found in Section 4, the Fisher information matrix $I(\mu,\sigma)$ is of the form $I(\mu,\sigma)=\sigma^{-2}{{c_{1}\enskip c_{2}}\choose{c_{2}\enskip c_{3}}}$ . Hence, Jeffreys’ general rule prior $\pi_{J}(\mu,\sigma)\propto\sigma^{-2}$ . This prior, as we will see in this section, corresponds to a left-invariant Haar prior. In contrast, Jeffreys’ independence prior $\pi_{I}(\mu,\sigma)\propto\sigma^{-1}$ corresponds to a right-invariant Haar prior.

In order to demonstrate this, consider a group of linear transformations $G=\{g_{a,b}-\infty<a<\infty,b>0\}$ , where $g_{a,b}(x)=a+bx$ . The induced group of transformations on the parameter space will be denoted by $\bar{G}$ , where $\bar{G}=\{\bar{g}_{a,b}\}$ , where $\bar{g}_{a,b}(\mu,\sigma)=(a+b\mu,b\sigma)$ . The general theory of locally compact groups states that there exist two measures $\eta_{1}$ and $\eta_{2}$ on $\bar{G}$ such that $\eta_{1}$ is left-invariant and $\eta_{2}$ is right-invariant. What this means is that for all $\bar{g}\in\bar{G}$ and $A$ a subset of $G$ , $\eta_{1}(\bar{g}A)=\eta_{1}(A)$ and $\eta_{2}(A\bar{g})=\eta_{2}(A)$ , where $\bar{g}A=\{\bar{g}\bar{g_{*}}\dvtx\bar{g_{*}}\in A\}$ and $A\bar{g}=\{\bar{g_{*}}\bar{g}\dvtx\bar{g_{*}}\in A\}$ . The measures $\eta_{1}$ and $\eta_{2}$ are referred to respectively as left- and right-invariant Haar measures. For the location–scale family of distributions, the left- and right-invariant Haar priors turn out to be $\pi_{L}(\mu,\sigma)\propto\sigma^{-2}$ and $\pi_{R}(\mu,\sigma)\propto\sigma^{-1}$ , respectively (cf. Berger, 1985, pages 406–407; Ghosh, Delampady and Samanta, 2006, pages 136–138).

The right-Haar prior usually enjoys more optimality properties than the left-Haar prior. Some optimality properties of left-Haar priors are given in Datta and Ghosh (1995b). In Example 1, for the location–scale family of distributions, the right-Haar prior is Bernardo’s reference prior when either $\mu$ or $\sigma$ is the parameter of interest, while the other parameter is the nuisance parameter. Also, it is shown in Datta, Ghosh and Mukerjee (2000) that for the location–scale family of distributions, the right-Haar prior yields exact matching of the coverage probabilities of Bayesian credible intervals and the corresponding frequentist confidence intervals when either $\mu$ or $\sigma$ is the parameter of interest, while the other parameter is the nuisance parameter.

For simplicity, we demonstrate this only for the normal example. Let $X_{1},\ldots,X_{n}|\mu,\sigma^{2}$ be i.i.d. N( $\mu,\break\sigma^{2})$ , where $n\geq 2$ . With the right-Haar prior $\pi_{2}(\mu,\break\sigma)\propto\sigma^{-1}$ , the marginal posterior distribution of $\mu$ is Student’s $t$ with location parameter $\bar{X}=\sum_{i=1}^{n}X_{i}/n$ , scale parameter $S/\sqrt{n}$ , where $(n-1)S^{2}=\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}$ , and degrees of freedom $n-1$ . Hence, if $\mu_{1-\alpha}$ denotes the $100(1-\alpha)$ th percentile of this marginal posterior, then

$\displaystyle 1-\alpha$	$\displaystyle=$	$\displaystyle P(\mu\leq\mu_{1-\alpha}\|X_{1},\ldots,X_{n})$
	$\displaystyle=$	$\displaystyle P\bigl{[}\sqrt{n}(\mu-\bar{X})/S$
		$\displaystyle\hphantom{P\bigl{[}}{}\leq\sqrt{n}(\mu_{1-\alpha}-\bar{X})/S\|X_{1},\ldots,X_{n}\bigr{]}$
	$\displaystyle=$	$\displaystyle P\bigl{[}t_{n-1}\leq\sqrt{n}(\mu_{1-\alpha}-\bar{X})/S\bigr{]},$

so that $\sqrt{n}(\mu_{1-\alpha}-\bar{X})/S=t_{n-1,1-\alpha}$ , the $100(1-\alpha)$ th percentile of $t_{n-1}$ . Now

	$\displaystyle P(\mu\leq\mu_{1-\alpha}\|\mu,\sigma)$
	$\displaystyle\quad=P\bigl{[}\sqrt{n}(\bar{X}-\mu)/S\geq-t_{n-1,1-\alpha}\|\mu,\sigma\bigr{]}=1-\alpha$
	$\displaystyle\quad=P(\mu\leq\mu_{1-\alpha}\|X_{1},\ldots,X_{n}).$

This provides the exact coverage matching probability for $\mu$ .

Next, with the same set up, when $\sigma^{2}$ is the parameter of interest, its marginal posterior is Inverse $\operatorname{Gamma}((n-1)/2,(n-1)S^{2}/2$ ). Now, if $\sigma_{1-\alpha}^{2}$ denotes the $100(1-\alpha)$ th percentile of this marginal posterior, then $\sigma_{1-\alpha}^{2}=(n-1)S^{2}/\chi_{n-1;1-\alpha}^{2}$ , where $\chi_{n-1;1-\alpha}^{2}$ is the $100(1-\alpha)$ th percentile of the $\chi_{n-1}^{2}$ distribution. Now

	$\displaystyle P(\sigma^{2}\leq\sigma_{1-\alpha}^{2}\|\mu,\sigma)$
	$\displaystyle\quad=P[(n-1)S^{2}/\sigma^{2}\leq\chi_{n-1;1-\alpha}^{2}\|\mu,\sigma]=1-\alpha,$

showing once again the exact coverage matching.

The general definition of a right-invariant Haar density on $\bar{\mathcal{G}}$ which we will denote by $h_{r}$ must satisfy $\int_{A\bar{g}_{0}}h_{r}(x)\,dx=\int_{A}h_{r}(x)\,dx$ , where $A\bar{g}=\{\bar{g}_{*}\bar{g}\dvtx\bar{g}_{*}\in A\}$ . Similarly, a left invariant Haar density on $\bar{\mathcal{G}}$ which we will denote by $h_{l}$ must satisfy $\int_{\bar{g}A}h_{l}(x)\,dx\!=\int_{A}h_{l}(x)\,dx$ , where $\bar{g}A=\{\bar{g}\bar{g}_{*}\dvtx\bar{g}_{*}\in A\}$ . An alternate representation of the right- and left-Haar densities are given by $P^{h_{r}}(A\bar{g})=P^{h_{r}}(A)$ and $P^{h_{l}}(\bar{g}A)=\break P^{h_{l}}(A)$ , respectively.

It is shown in Halmos (1950) and Nachbin (1965) that the right- and left-invariant Haar densities exist and are unique up to a multiplicative constant. Berger (1985) provides calculation of $h_{r}$ and $h_{l}$ in a very general framework. He points out that if $\bar{\mathcal{G}}$ is isomorphic to the parameter space $\Theta$ , then one can construct right- and left-invariant Haar priors on the parameter space $\Theta$ . A very substantial account of invariant Haar densities is available in Datta and Ghosh (1995b). Severini, Mukerjee and Ghosh (2002) have demonstrated the exact matching property of right invariant Haar densities in a prediction context under fairly general conditions.

5.2 Moment Matching Priors

Here we discuss a new matching criterion which we will refer to as the “moment matching criterion.” For a regular family of distributions, the classic article of Bernstein and Von Mises (see, e.g., Ferguson, 1996, page 141; Ghosh, Delampady and Samanta, 2006, page 104) proved the asymptotic normality of the posterior of a parameter vector centered around the maximum likelihood estimator or the posterior mode and variance equal to the inverse of the observed Fisher information matrix evaluated at the maximum likelihood estimator or the posterior mode. We utilize the same asymptotic expansion to find priors which can provide high order matching of the moments of the posterior mean and the maximum likelihood estimator. For simplicity of exposition, we shall primarily confine ourselves to priors which achieve the matching of the first moment, although it is easy to see how higher order moment matching is equally possible.

The motivation for moment matching priors stems from several considerations. First, these priors lead to posterior means which share the asymptotic optimality of the MLE’s up to a high order. In particular, if one is interested in asymptotic bias or MSE reduction of the MLE’s through some adjustment, the same adjustment applies directly to the posterior means. In this way, it is possible to achieve Bayes-frequentist synthesis of point estimates. The second important aspect of these priors is that they provide new viable alternatives to Jeffreys’ prior even for real-valued parameters in the absence of nuisance parameters motivated from the proposed criterion. A third motivation, which will be made clear later in this section, is that with moment matching priors, it is possible to construct credible regions for parameters of interest based only on the posterior mean and the posterior variance, which match the maximum likelihood based confidence intervals to a high order of approximation. We will confine ourselves primarily to regular families of distributions.

Let $X_{1},X_{2},\ldots,X_{n}|\theta$ be independent and identically distributed with common density function $f(x|\theta)$ , where $\theta\in\Theta$ , some interval in the real line. Consider a general class of priors $\pi(\theta),\theta\in\Theta$ for $\theta$ . Throughout, it is assumed that both $f$ and $\pi$ satisfy all the needed regularity conditions as given in Johnson (1970) and Bickel and Ghosh (1990).

Let $\hat{\theta}_{n}$ denote the maximum likelihood estimator of $\theta$ . Under the prior $\pi$ , we denote the posterior mean of $\theta$ by $\hat{\theta}^{B}_{n}$ . The formal asymptotic expansion given in Section 2 now leads to $\hat{\theta}^{B}_{n}=\hat{\theta}_{n}+n^{-1}(\frac{a_{3}}{2\hat{I}_{n}^{2}}+\frac{1}{\hat{I}_{n}}\frac{\pi^{\prime}(\hat{\theta}_{n})}{\pi(\hat{\theta}_{n})})+O_{p}(n^{-3/2})$ , where $a_{3}$ and $\hat{I}_{n}$ are defined in Theorem 1. The law of large numbers and consistency of the MLE now give $n(\hat{\theta}^{B}_{n}-\hat{\theta}_{n})\stackrel{{\scriptstyle P}}{{\rightarrow}}(\frac{-g_{3}(\theta)}{2I^{2}(\theta)}+\frac{1}{I(\theta)}\frac{\pi^{\prime}(\theta)}{\pi(\theta)})$ . With the choice $\pi(\theta)=\exp[-\frac{1}{2}\int^{\theta}\frac{g_{3}(t)}{I(t)}\,dt]$ , one gets $\hat{\theta}^{B}_{n}-\hat{\theta}_{n}=O_{p}(n^{-3/2})$ . We will denote this prior as $\pi_{M}(\theta)$ .

Ghosh and Liu (2011) have shown that if $\phi$ is a one-to-one function of $\theta$ , then the moment matching prior $\pi_{M}(\phi)$ for $\phi$ is given by $\pi_{M}(\phi)=\pi_{M}(\theta)|\frac{d\theta}{d\phi}|^{3/2}$ . We now see an application of this result.

Example 6 ((Continued)).

Consider the regular one-parameter exponential family of densities given by $f(x|\theta)=\exp[\theta x-\psi(\theta)+h(x)]$ . For the canonical parameter $\theta$ , noting that $I(\theta)=\psi^{\prime\prime}(\theta)$ and $g_{3}(\theta)=\psi^{\prime\prime\prime}(\theta)=I^{\prime}(\theta),$ $\pi_{M}(\theta)=\exp[\frac{1}{2}\int I^{\prime}(\theta)/I(\theta)\,d\theta]=\break I^{1/2}(\theta)$ , which is Jeffreys’ prior. On the other hand, for the population mean $\phi=\psi^{\prime}(\theta)$ which is a strictly increasing function of $\theta$ [since $\psi^{\prime\prime}(\theta)=V(X|\theta)>0]$ , the moment matching prior $\pi_{M}(\phi)=I(\phi)$ . In particular, for the binomial proportion $p$ , one gets the Haldane prior $\pi_{H}(p)\propto p^{-1}(1-p)^{-1}$ , which is the same as Hartigan’s (1964, 1998) maximum likelihood prior. However, for the canonical parameter $\theta=\operatorname{logit}(p)$ , whereas we get Jeffreys’ prior, Hartigan (1964, 1998) gets the Laplace $\operatorname{uniform}(0,1)$ prior.

Remark 4.

It is now clear that a fundamental difference between priors obtained by matching probabilities and those obtained by matching moments is the lack of invariance of the latter under one-to-one reparameterization. It may be interesting to find conditions under which a moment matching prior agrees with Jeffreys’ prior $I^{1/2}(\theta)$ or the uniform constant prior. The former holds if and only if $g_{3}(\theta)=I^{\prime}(\theta)$ , while the latter holds if and only if $g_{3}(\theta)=0$ .

The if part of the above results are immediate from the definition of $\pi_{M}(\theta)$ . To prove the only if parts, note that if $\pi_{M}(\theta)=I^{1/2}(\theta)$ , first taking logarithms, and then differentiating with respect to $\theta$ , one gets $\frac{I^{\prime}(\theta)}{2I(\theta)}=\frac{g_{3}(\theta)}{2I(\theta)}$ so that $g_{3}(\theta)=I^{\prime}(\theta)$ . On the other hand, if $\pi(\theta)=c$ , then taking logarithms, and then differentiating with respect to $\theta$ , one gets $g_{3}(\theta)=0$ .

The above approach can be extended to the matching of higher moments as well. Noting that $V_{\pi}(\theta|X_{1},\break\ldots,X_{n})=E_{\pi}[(\theta-\hat{\theta}_{n})^{2}|X_{1},\ldots,X_{n})]-(\hat{\theta}_{n}^{B}-\hat{\theta}_{n})^{2}$ , it follows immediately that under the moment matching prior $\pi_{M}$ , $V_{\pi}(\theta|X_{1},\ldots,X_{n})=(n\hat{I}_{n})^{-1}+O_{p}(n^{-2})$ . This fact helps construction of credible intervals for $\theta$ , the parameter of interest, centered at the posterior mean and scaled by the posterior standard deviation which enjoys the same asymptotic properties as the credible interval centered at the MLE and scaled by the square root of the reciprocal of the observed Fisher information number.

6 Summary and Conclusion

As mentioned in the Introduction, this article provides a selective review of objective priors reflecting my own interest and familiarity with the topics. I am well aware that many important contributions are left out. For instance, I have discussed only the two-group reference priors of Bernardo (1979). A more appealing later contribution by Berger and Bernardo (1992b) provided an algorithm for the construction of multi-group reference priors when these groups are arranged in accordance to their order of importance. In particular, the one-at-a-time reference priors, as advocated by these authors, has proved to be quite useful in practice. Ghosal (1997, 1999) provided the construction of reference priors in nonregular cases, while a formal definition of reference priors encompassing both regular and nonregular cases has recently been proposed by Berger, Bernardo and Sun (2009).

Regarding probability matching priors, we have discussed only the quantile matching criterion. There are several others, possibly equally important probability matching criteria. Notable among these are the highest posterior density matching criterion as well as matching via inversion of test statistics, such as the likelihood ratio test statistic, Rao statistic or the Wald statistic. Extensive discussion of such matching priors is given in Datta and Mukerjee (2004). Datta et al. (2000) constructed matching priors via the prediction criterion, and related exact results in this context are available in Fraser and Reid (2002). The issue of matching priors in the context of conditional inference has been discussed quite extensively in Reid (1996).

A different class of priors called “the maximum likelihood prior” was developed by Hartigan (1964, 1998). Roughly speaking, these priors are found by maximizing the expected distance between the prior and the posterior under a truncated Kullback–Leibler distance. Like the proposed moment matching priors, the maximum likelihood prior densities, when they exist, result in posterior means asymptotically negligible from the MLE’s. I have alluded to some of these priors as a comparison with other priors as given in this paper.

With the exception of the right- and left-invariant Haar priors, the derivation of the remaining priors are based essentially on the asymptotic expansion of the posterior density as well as the shrinkage argument of J. K. Ghosh. This approach provides a nice unified tool for the development of objective priors. I believe very strongly that many new priors will be found in the future by either a direct application or slight modification of these tools.

The results of this article show that Jeffreys’ prior is a clear winner in the absence of nuisance parameters for most situations. The only exception is the chi-square divergence where different priors may emerge. But that corresponds only to one special case, namely, the boundary of the class of divergence priors, while Jeffreys’ prior continues its optimality in the interior. In the presence of nuisance parameters, my own recommendation is to find two- or multi-group reference priors following the algorithm of Berger and Bernardo (1992a), and then narrow down this class of priors by finding their intersection with the class of probability matching priors. This approach can even lead to a unique objective prior in some situations. Some simple illustrations are given in this article. I also want to point out the versatility of reference priors. For example, for nonregular models, Jeffreys’ general rule prior does not work. But as shown in Ghosal (1997) and Berger, Bernardo and Sun (2009), one can extend the definition of reference priors to cover these situations as well.

The examples given in this paper are purposely quite simplistic to aid understanding mainly of readers not familiar at all with the topic. Quite rightfully, they can be criticized as somewhat stylized. Both reference and probability matching priors, however, have been developed for more complex problems of practical importance. Among others, I may refer to Berger and Yang (1994), Berger, De Oliveira and Sanso (2001), Ghosh and Heo (2003), Ghosh, Carlin and Srivastava (1994) and Ghosh, Yin and Kim (2003). The topics of these papers include time series models, spatial models and inverse problems, such as linear calibration and problems in bioassay, in particular, slope ratio and parallel line assays. One can easily extend this list. A very useful source for all these papers is Bernardo (2005).

Acknowledgments

This research was supported in part by NSF Grant Number SES-0631426 and NSA Grant NumberMSPF-076-097. The comments of the Guest Editor and a reviewer led to substantial improvement on the manuscript.

References

(1) Amari, S. (1982). Differential geometry of curved exponential families—Curvatures and information loss. Ann. Statist. 10 357–387. \MR0653513
(2) Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 53 370–418.
(3) Berger, J. O. (1985). Statistical Decision Theory and Related Topics, 2nd ed. Springer, New York. \MR0804611
(4) Berger, J. O. and Bernardo, J. M. (1989). Estimating a product of means. Bayesian analysis with reference priors. J. Amer. Statist. Assoc. 84 200–207. \MR0999679
(5) Berger, J. O. and Bernardo, J. M. (1992a). On the development of reference priors (with discussion). In Bayesian Statistics 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 35–60. Oxford Univ. Press, New York. \MR1380269
(6) Berger, J. O. and Bernardo, J. M. (1992b). Reference priors in a variance components problem. In Bayesian Analysis in Statistics and Econometrics (P. K. Goel and N. S. Iyengar, eds.) 177–194. Springer, New York. \MR1194392
(7) Berger, J. O., Bernardo, J. M. and Sun, D. (2009). The formal definition of reference priors. Ann. Statist. 37 905–938. \MR2502655
(8) Berger, J. O., de Oliveira, V. and Sanso, B. (2001). Objective Bayesian analysis of spatially correlated data. J. Amer. Statist. Assoc. 96 1361–1374. \MR1946582
(9) Berger, J. O. and Yang, R. (1994). Noninformative priors and Bayesian testing for the AR(1) model. Econometric Theory 10 461–482. \MR1309107
(10) Bernardo, J. M. (1979). Reference posterior distributions for Bayesian inference (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 41 113–147. \MR0547240
(11) Bernardo, J. M. (2005). Reference analysis. In Handbook of Bayesian Statisics 25 (D. K. Dey and C. R. Rao, eds.). North-Holland, Amsterdam. \MR2490522
(12) Bhattacharyya, A. K. (1943). On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35 99–109. \MR0010358
(13) Bickel, P. J. and Ghosh, J. K. (1990). A decomposition for the likelihood ratio statistic and the Bartlett correction—A Bayesian argument. Ann. Statist. 18 1070–1090. \MR1062699
(14) Clarke, B. and Barron, A. (1990). Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory 36 453–471. \MR1053841
(15) Clarke, B. and Barron, A. (1994). Jeffreys’ prior is asymptotically least favorable under entropy risk. J. Statist. Plann. Inference 41 37–60. \MR1292146
(16) Clarke, B. and Sun, D. (1997). Reference priors under the chi-square distance. Sankhyā A 59 215–231. \MR1665703
(17) Clarke, B. and Sun, D. (1999). Asymptotics of the expected posterior. Ann. Inst. Statist. Math. 51 163–185. \MR1704652
(18) Cressie, N. and Read, T. R. C. (1984). Multinomial goodness-of-fit tests. J. Roy. Statist. Soc. Ser. B 46 440–464. \MR0790631
(19) Cox, D. R. and Reid, N. (1987). Parameter orthogonality and approximate conditional inference (with discussion). J. Roy Statist. Soc. Ser. B 53 79–109. \MR0893334
(20) Datta, G. S. (1996). On priors providing frequentist validity of Bayesian inference for multiple parametric functions. Biometrika 83 287–298. \MR1439784
(21) Datta, G. S. and Ghosh, J. K. (1995a). On priors providing frequentist validity for Bayesian inference. Biometrika 82 37–45. \MR1332838
(22) Datta, G. S. and Ghosh, J. K. (1995b). Noninformative priors for maximal invariant in group models. Test 4 95–114. \MR1365042
(23) Datta, G. S. and Ghosh, M. (1995c). Some remarks on noninformative priors. J. Amer. Statist. Assoc. 90 1357–1363. \MR1379478
(24) Datta, G. S. and Ghosh, M. (1995d). Hierarchical Bayes estimators of the error variance in one-way ANOVA models. J. Statist. Plann. Inference 45 399–411. \MR1341333
(25) Datta, G. S. and Ghosh, M. (1996). On the invariance of noninformative priors. Ann. Statist. 24 141–159. \MR1389884
(26) Datta, G. S., Ghosh, M. and Mukerjee, R. (2000). Some new results on probability matching priors. Calcutta Statist. Assoc. Bull. 50 179–192. \MR1843620
(27) Datta, G. S., Ghosh, M. and Kim, Y. (2002). Probability matching priors for one-way unbalanced rendom effects models. Statist. Decisions 20 29–51. \MR1904422
(28) Datta, G. S. and Mukerjee, R. (2004). Probability Matching Priors: Higher Order Asymptotics. Springer, New York. \MR2053794
(29) Datta, G. S., Mukerjee, R., Ghosh, M. and Sweeting, T. J. (2000). Bayesian prediction with approximate frequentist validity. Ann. Statist. 28 1414–1426. \MR1805790
(30) Datta, G. S. and Sweeting, T. J. (2005). Probability matching priors. In Bayesian Thinking, Modeling and Computation. Handbook of Statistics 25 (D. K. Dey and C. R. Rao, eds.). North-Holland, Amsterdam. \MR2490523
(31) Ferguson, T. (1996). A Course in Large Sample Theory. Chapman & Hall/CRC Press, Boca Raton, FL. \MR1699953
(32) Fraser, D. A. S. and Reid, N. (2002). Strong matching of frequentist and Bayesian parametric inference. J. Statist. Plann. Inference 103 263–285. \MR1896996
(33) Ghosal, S. (1997). Reference priors in multiparameter nonregular cases. Test 6 159–186. \MR1466439
(34) Ghosal, S. (1999). Probability matching priors for nonregular cases. Biometrika 86 956–964. \MR1741992
(35) Ghosh, J. K., Delampady, M. and Samanta, T. (2006). An Introduction to Bayesian Analysis. Springer, New York. \MR2247439
(36) Ghosh, M. and Liu, R. (2011). Moment matching priors. Sankhyā A. To appear.
(37) Ghosh, J. K. and Mukerjee, R. (1992). Non-informative priors (with discussion). In Bayesian Statistics 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 195–210. Oxford Univ. Press, New York. \MR1380277
(38) Ghosh, J. K., Sinha, B. K. and Joshi, S. N. (1982). Expansion for posterior probability and integrated Bayes risk. In Statistical Decision Theory and Related Topics III 1 403–456. Academic Press, New York. \MR0705299
(39) Ghosh, M., Carlin, B. P. and Srivastava, M. S. (1994). Probability matching priors for linear calibration. Test 4 333–357. \MR1379796
(40) Ghosh, M. and Heo, J. (2003). Default Bayesian priors for regression models with second-order autogressive residuals. J. Time Ser. Anal. 24 269–282. \MR1984597
(41) Ghosh, M., Mergel, V. and Liu, R. (2011). A general divergence criterion for prior selection. Ann. Inst. Statist. Math. 63 43–58.
(42) Ghosh, M. and Mukerjee, R. (1998). Recent developments on probability matching priors. In Applied Statistical Science III (S. E. Ahmed, M. Ahsanullah and B. K. Sinha, eds.) 227–252. Nova Science Publishers, New York. \MR1673669
(43) Ghosh, M., Yin, M. and Kim, Y.-H. (2003). Objective Bayesian inference for ratios of regression coefficients in linear models. Statist. Sinica 13 409–422. \MR1977734
(44) Halmos, P. (1950). Measure Theory. Van Nostrand, New York. \MR0033869
(45) Hartigan, J. A. (1964). Invariant prior densities. Ann. Math. Statist. 35 836–845. \MR0161406
(46) Hartigan, J. A. (1998). The maximum likelihood prior. Ann. Statist. 26 2083–2103. \MR1700222
(47) Hellinger, E. (1909). Neue Begründung der Theorie quadratischen Formen von unendlichen vielen Veränderlichen. J. Reine Angew. Math. 136 210–271.
(48) Huzurbazar, V. S. (1950). Probability distributions and orthogonal parameters. Proc. Camb. Phil. Soc. 46 281–284. \MR0034567
Jeffreys (1961) Jeffreys, H. (1961). Theory of Probability and Inference, 3rd ed. Cambridge Univ. Press, London.
(50) Johnson, R. A. (1970). Asymptotic expansions associated with posterior distribution. Ann. Math. Statist. 41 851–864. \MR0263198
(51) Kass, R. E. and Wasserman, L. (1996). The selection of prior distributions by formal rules. J. Amer. Statist. Assoc. 91 1343–1370. \MR1478684
(52) Laplace, P. S. (1812). Theorie Analytique des Probabilities. Courcier, Paris.
(53) Lindley, D. V. (1956). On the measure of the information provided by an experiment. Ann. Math. Statist. 27 986–1005. \MR0083936
(54) Liu, R. (2009). On some new contributions towards objective priors. Unpublished Ph.D. dissertation. Dept. Statistics, Univ. Florida, Gainesville, FL. \MR2714091
(55) Morris, C. N. and Normand, S. L. (1992). Hierarchical models for combining information and for meta-analysis. In Bayesian Statistics 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 321–344. Oxford. Univ. Press, New York. \MR1380284
(56) Mukerjee, R. and Dey, D. K. (1993). Frequentist validity of posterior quantiles in the presence of a nuisance parameter: Higher-order asymptotics. Biometrika 80 499–505. \MR1248017
(57) Mukerjee, R. and Ghosh, M. (1997). Second-order probability matching priors. Biometrika 84 970–975. \MR1625016
(58) Nachbin, L. (1965). The Haar Integral. Van Nostrand, New York. \MR0175995
(59) Reid, N. (1996). Likelihood and Bayesian approximation methods (with discussion). In Bayesian Statistics 5 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 351–368. Oxford Univ. Press, New York. \MR1425414
(60) Severini, T. A., Mukerjee, R. and Ghosh, M. (2002). On an exact probability matching property of right-invariant priors. Biometrika 89 952–957. \MR1946524
(61) Staicu, A.-M. and Reid, N. (2008). On probability matching priors. Canad. J. Statist. 36 613–622. \MR2532255
(62) Tibshirani, R. J. (1989). Noninformative priors for one parameter of many. Biometrika 76 604–608. \MR1040654
(63) Welch, B. L. and Peers, H. W. (1963). On formulae for confidence points based on integrals of weighted likelihoods. J. Roy Statist. Soc. Ser. B 35 318–329. \MR0173309
(64) Ye, K. (1994). Bayesian reference prior analysis on the ratio of variances for the balanced one-way random effect model. J. Statist. Plann. Inference 41 267–280. \MR1309613

	$\displaystyle\quad\pi^{*}_{n}(w)=(2\pi)^{-1/2}\exp[-(1/2)w^{T}\hat{I}_{n}w]$
	$\displaystyle\quad\hphantom{\pi^{*}_{n}(w)=}{}\cdot\Biggl{[}1+n^{-1/2}\Biggl{\{}\sum^{p}_{j=1}w_{j}\biggl{(}\frac{\partial\log\pi}{\partial\theta_{j}}\biggr{)}\biggl{\|}_{\theta=\hat{\theta}_{n}}$
			(6)
	$\displaystyle\quad\hphantom{\Biggl{[}1+n^{-1/2}\Biggl{\{}\biggl{(}\frac{\partial\log\pi}{\partial\theta_{j}}\biggr{)}\biggl{\|}}{}+\frac{1}{6}\sum_{j,r,s}w_{j}w_{r}w_{s}a_{jrs}\Biggr{\}}$
	$\displaystyle\hphantom{\quad\Biggl{[}1+n^{-1/2}\Biggl{\{}\sum^{p}_{j=1}w_{j}\biggl{(}\frac{\partial\log\pi}{\partial\theta_{j}}\biggr{)}\biggl{\|}_{\theta=\hat{\theta}_{n}}}{}+O_{p}(n^{-1})\Biggr{]}.$

	$\displaystyle E\biggl{[}\log\frac{\pi(\theta\|X)}{\pi(\theta)}\biggr{]}$
	$\displaystyle\quad=\int\!\!\int\log\frac{\pi(\theta\|X)}{\pi(\theta)}\pi(\theta\|X)m^{\pi}(X)\,d\theta\,dX$
			(7)
	$\displaystyle\quad=\int\!\!\int\log\frac{\pi(\theta\|X)}{\pi(\theta)}L_{n}(\theta)\pi(\theta)\,dX\,d\theta$
	$\displaystyle\quad=\int\pi(\theta)E\biggl{[}\log\frac{\pi(\theta\|X)}{\pi(\theta)}\biggl{\|}\theta\biggr{]}\,d\theta,$