This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Objective Priors: An Introduction for Frequentists

Malay Ghoshlabel=e1]ghoshm@stat.ufl.edu [ University of Florida Malay Ghosh is Distinguished Professor, University of Florida, 223 Griffin–Floyd Hall, Gainesville, Florida 32611-8545, USA (e-mail: ghoshm@stat.ufl.edu).
(2011)
Abstract

Bayesian methods are increasingly applied in these days in the theory and practice of statistics. Any Bayesian inference depends on a likelihood and a prior. Ideally one would like to elicit a prior from related sources of information or past data. However, in its absence, Bayesian methods need to rely on some “objective” or “default” priors, and the resulting posterior inference can still be quite valuable.

Not surprisingly, over the years, the catalog of objective priors also has become prohibitively large, and one has to set some specific criteria for the selection of such priors. Our aim is to review some of these criteria, compare their performance, and illustrate them with some simple examples. While for very large sample sizes, it does not possibly matter what objective prior one uses, the selection of such a prior does influence inference for small or moderate samples. For regular models where asymptotic normality holds, Jeffreys’ general rule prior, the positive square root of the determinant of the Fisher information matrix, enjoys many optimality properties in the absence of nuisance parameters. In the presence of nuisance parameters, however, there are many other priors which emerge as optimal depending on the criterion selected. One new feature in this article is that a prior different from Jeffreys’ is shown to be optimal under the chi-square divergence criterion even in the absence of nuisance parameters. The latter is also invariant under one-to-one reparameterization.

Asymptotic expansion,
divergence criterion,
first-order probability matching,
Jeffreys’ prior,
left Haar priors,
location family,
location–scale family,
multiparameter,
orthogonality,
reference priors,
right Haar priors,
scale family,
second-order probability matching,
shrinkage argument,
doi:
10.1214/10-STS338
keywords:
.
volume: 26issue: 2
\relateddoi

T1Discussed in 10.1214/11-STS338A and 10.1214/11-STS338B; rejoinder at 10.1214/11-STS338REJ.

1 Introduction

Bayesian methods are increasingly used in recent years in the theory and practice of statistics. Their implementation requires specification of both a likelihood and a prior. With enough historical data, it is possible to elicit a prior distribution fairly accurately. However, even in its absence, Bayesian methods, if judiciously used, can produce meaningful inferences based on the so-called “objective” or “default” priors.

The main focus of this article is to introduce certain objective priors which could be potentially useful even for frequentist inference. One such example where frequentists are yet to reach a consensus about an “optimal” approach is the construction of confidence intervals for the ratio of two normal means, the celebrated Fieller–Creasy problem. It is shown in Section 4 of this paper how an “objective” prior produces a credible interval in this case which meets the target coverage probability of a frequentist confidence interval even for small or moderate sample sizes. Another situation, which has often become a real challenge for frequentists, is to find a suitable method for elimination of nuisance parameters when the dimension of the parameter grows in direct proportion to the sample size. This is what is usually referred to as the Neyman–Scott phenomenon. We will illustrate in Section 3 with an example of how an objective prior can sometimes overcome this problem.

Before getting into the main theme of this paper, we recount briefly the early history of objective priors. One of the earliest uses is usually attributed to Bayes (1763) and Laplace (1812) who recommended using a uniform prior for the binomial proportion pp in the absence of any other information. While intuitively quite appealing, this prior has often been criticized due to its lack of invariance under one-to-one reparameterization. For example, a uniform prior for pp in the binomial case does not result in a uniform prior for p2p^{2}. A more compelling example is that a uniform prior for σ\sigma, the population standard deviation, does not result in a uniform prior for σ2\sigma^{2}, and the converse is also true. In a situation like this, it is not at all clear whether there can be any preference to assign a uniform prior to either σ\sigma or σ2\sigma^{2}.

In contrast, Jeffreys’ (1961) general rule prior, namely, the positive square root of the determinant of the Fisher information matrix, is invariant under one-to-one reparameterization of parameters. Wewill motivate this prior from several asymptotic considerations. In particular, for regular models where asymptotic normality holds, Jeffreys’ prior enjoys many optimality properties in the absence of nuisance parameters. In the presence of nuisance parameters, this prior suffers from many problems—marginalization paradox, the Neyman–Scott problem, just to name a few. Indeed, for the location–scale models, Jeffreys himself recommended alternate priors.

There are several criteria for the construction of objective priors. The present article primarily reviews two of these criteria in some detail, namely, “divergence priors” and “probability matching priors,” and finds optimal priors under these criteria. The class of divergence priors includes “reference priors” introduced by Bernardo (1979). The “probablity matching priors” were introduced by Welch and Peers (1963). There are many generalizations of the same in the past two decades. The development of both these priors rely on asymptotic considerations. Somewhat more briefly, I have discussed also a few other priors including the “right” and “left” Haar priors.

The paper does not claim the extensive thorough and comprehensive review of Kass and Wasserman (1996), nor does it aspire to the somewhat narrowly focused, but a very comprehensive review of probability matching priors as given in Ghosh and Mukerjee (1998), Datta and Mukerjee (2004) and Datta and Sweeting (2005). A very comprehensive review of reference priors is now available in Bernardo (2005), and a unified approach is given in the recent article of Berger, Bernardo and Sun (2009).

While primarily a review, the present article has been able to unify as well as generalize some of the previously considered criteria, for example, viewing the reference priors as members of a bigger class of divergence priors. Interestingly, with some of these criteria as presented here, it is possible to construct some alternatives to Jeffreys’ prior even in the absence of nuisance parameters.

The outline of the remaining sections is as follows. In Section 2 we introduce two basic tools to be used repeatedly in the subsequent sections. One such tool involving asymptotic expansion of the posterior density is due to Johnson (1970), and Ghosh, Sinha and Joshi (1982), and is discussed quite extensively in Ghosh, Delampady and Samanta (2006) and Datta and Mukerjee (2004). The second tool involves a shrinkage argument suggested by Dawid and used extensively by J. K. Ghosh and his co-authors. It is shown in Section 3 that this shrinkage argument can also be used in deriving priors with the criterion of maximizing the distance between the prior and the posterior. The distance measure used includes, but is not limited to, the Kullback–Leibler (K–L) distance considered in Bernardo (1979) for constructing two-group “reference priors.” Also, in this section we have considered a new prior different from Jeffreys even in the one-parameter case which is also invariant under one-to-one reparameterization. Section 4 addresses construction of priors under probability matching criteria. Certain other priors are introduced in Section 5, and it is pointed out that some of these priors can often provide exact and not just asymptotic matching. Some final remarks are made in Section 6.

Throughout this paper the results are presented more or less in a heuristic fashion, that is, without paying much attention to the regularity conditions needed to justify these results. More emphasis is placed on the application of these results in the construction of objective priors.

2 Two Basic Tools

An asymptotic expansion of the posterior density began with Johnson (1970), followed up later by Ghosh, Sinha and Joshi (1982), and many others. The result goes beyond that of the theorem of Bernstein and Von Mises which provides asymptotic normality of the posterior density. Typically, such an expansion is centered around the MLE (and occasionally the posterior mode), and requires only derivatives of the log-likelihood with respect to the parameters, and evaluated at their MLE’s. These expansions are available even for heavy-tailed densities such as Cauchy because finiteness of moments of the distribution is not needed. The result goes a long way in finding asymptotic expansion for the posterior moments of parameters of interest as well as in finding asymptotic posterior predictive distributions.

The asymptotic expansion of the posterior resembles that of an Edgeworth expansion, but, unlike the latter, this approach does not need use of cumulants of the distribution. Finding cumulants, though conceptually easy, can become quite formidable, especially in the presence of multiple parameters, demanding evaluation of mixed cumulants.

We have used this expansion as a first step in the derivation of objective priors under different criteria. Together with the shrinkage argument as mentioned earlier in the Introduction, and to be discussed later in this section, one can easily unify and extend many of the known results on prior selection. In particular, we will see later in this section how some of the reference priors of Bernardo (1979) can be found via application of these two tools. The approach also leads to a somewhat surprising result involving asymptotic expansion of the distribution function of the MLE in a fairly general setup, and is not restricted to any particular family of distributions, for example, the exponential family, or the location–scale family. A detailed exposition is available in Datta and Mukerjee (2004, pages 5–8).

For simplicity of exposition, we consider primarily the one-parameter case. Results needed for the multiparameter case will occasionally be mentioned, and, in most cases, these are straightforward, albeit often cumbersome, extensions of one-parameter results. Moreover, as stated in the Introduction, the results will be given without full rigor, that is, without any specific mention of the needed regularity conditions.

We begin with X1,,Xn|θX_{1},\dots,X_{n}|\theta i.i.d. with common p.d.f. f(X|θ)f(X|\theta). Let θ^n\hat{\theta}_{n} denote the MLE of θ\theta. The likelihood function is denoted by Ln(θ)=1nf(Xi|θ)L_{n}(\theta)=\prod^{n}_{1}f(X_{i}|\theta) and let n(θ)=logLn(θ)\ell_{n}(\theta)=\log L_{n}(\theta). Let ai=n1[din(θ)/dθi]θ=θ^na_{i}=n^{-1}[d^{i}\ell_{n}(\theta)/\break d\theta^{i}]_{\theta=\hat{\theta}_{n}}, i=1,2,,i=1,2,\dots, and let I^n=a2\hat{I}_{n}=-a_{2}, the observed per unit Fisher information number. Consider a twice differentiable prior π\pi. Let Tn=n(θθ^n)I^n1/2T_{n}=\sqrt{n}(\theta-\hat{\theta}_{n})\hat{I}_{n}^{1/2}, and let πn(t)\pi^{*}_{n}(t) denote the posterior p.d.f. of TnT_{n} givenX1,,XnX_{1},\ldots,X_{n}. Then, under certain regularity conditions, we have the following result.

Theorem 1

πn(t)=ϕ(t)[1+n1/2γ1(t;X1,,Xn)+n1γ2(t;X1,,Xn)]+Op(n3/2)\pi^{*}_{n}(t)=\phi(t)[1+n^{-1/2}\gamma_{1}(t;X_{1},\ldots,\break X_{n})+n^{-1}\gamma_{2}(t;X_{1},\dots,X_{n})]+O_{p}(n^{-3/2}), where ϕ(t)\phi(t) is the standard normal p.d.f., γ1(t;X1,,Xn)=a3t3/(6I^n3/2)+(t/I^n1/2)π(θ^n)/π(θ^n)\gamma_{1}(t;X_{1},\dots,X_{n})=\break a_{3}t^{3}/(6\hat{I}_{n}^{3/2})+(t/\hat{I}_{n}^{1/2})\pi^{\prime}(\hat{\theta}_{n})/\pi(\hat{\theta}_{n}) and

γ2(t;X1,,Xn)\displaystyle\gamma_{2}(t;X_{1},\dots,X_{n})
=124I^n2a4t4+172I^n3a32t6+12I^nt2π′′(θ^n)π(θ^n)\displaystyle\quad=\frac{1}{24\hat{I}_{n}^{2}}a_{4}t^{4}+\frac{1}{72\hat{I}_{n}^{3}}a^{2}_{3}t^{6}+\frac{1}{2\hat{I}_{n}}t^{2}\frac{\pi^{\prime\prime}(\hat{\theta}_{n})}{\pi(\hat{\theta}_{n})}
+16I^n2a3t4π(θ^n)π(θ^n)a48I^n2\displaystyle\qquad{}+\frac{1}{6\hat{I}_{n}^{2}}a_{3}t^{4}\frac{\pi^{\prime}(\hat{\theta}_{n})}{\pi(\hat{\theta}_{n})}-\frac{a_{4}}{8\hat{I}_{n}^{2}}
15a3272I^n312I^nπ′′(θ^n)π(θ^n)a32I^n2π(θ^n)π(θ^n).\displaystyle\qquad{}-\frac{15a^{2}_{3}}{72\hat{I}_{n}^{3}}-\frac{1}{2\hat{I}_{n}}\frac{\pi^{\prime\prime}(\hat{\theta}_{n})}{\pi(\hat{\theta}_{n})}-\frac{a_{3}}{2\hat{I}_{n}^{2}}\frac{\pi^{\prime}(\hat{\theta}_{n})}{\pi(\hat{\theta}_{n})}.

The proof is given in Ghosh, Delampdy and Samanta (2006, pages 107–108). The statement involves a few minor typos which can be corrected easily. We outline here only a few key steps needed in the proof.

We begin with the posterior p.d.f.,

π(θ|X1,,Xn)\displaystyle\qquad\pi(\theta|X_{1},\dots,X_{n})
(1)
=exp[n(θ)]π(θ)/exp[n(θ)]π(θ)𝑑θ.\displaystyle\qquad\quad=\exp[\ell_{n}(\theta)]\pi(\theta)\bigl{/}\int\exp[\ell_{n}(\theta)]\pi(\theta)\,d\theta.

Substituting t=n(θθ^n)I^n1/2t=\sqrt{n}(\theta-\hat{\theta}_{n})\hat{I}_{n}^{1/2}, the posterior p.d.f. of TnT_{n} is given by

πn(t)=Cn1exp[n{θ^n+t(nI^n)1/2}n(θ^n)]\displaystyle\pi^{*}_{n}(t)=C^{-1}_{n}\exp[\ell_{n}\{\hat{\theta}_{n}+t(n\hat{I}_{n})^{-1/2}\}-\ell_{n}(\hat{\theta}_{n})] (2)
π{θ^n+t(nI^n)1/2},\displaystyle\hphantom{\pi^{*}_{n}(t)=}{}\cdot\pi\{\hat{\theta}_{n}+t(n\hat{I}_{n})^{-1/2}\},
where Cn=exp[n{θ^n+t(nI^n)1/2}\displaystyle\hskip 56.0pt\mbox{where }C_{n}=\int\exp[\ell_{n}\{\hat{\theta}_{n}+t(n\hat{I}_{n})^{-1/2}\}
\eqntextn(θ^n)]\displaystyle\eqntext{-\ell_{n}(\hat{\theta}_{n})]} (3)
\eqntextπ{θ^n+t(nI^n)1/2}dt.\displaystyle\eqntext{\cdot\pi\{\hat{\theta}_{n}+t(n\hat{I}_{n})^{-1/2}\}\,dt.\hskip 5.0pt} (4)

The rest of the proof involves a Taylor expansion of exp[n{θ^n+t(nI^n)1/2}\exp[\ell_{n}\{\hat{\theta}_{n}+t(n\hat{I}_{n})^{-1/2}\} and π{θ^n+t(nI^n)1/2}\pi\{\hat{\theta}_{n}+t(n\hat{I}_{n})^{-1/2}\}around θ^n\hat{\theta}_{n} up to a desired order, and collecting the coefficients of n1/2n^{-1/2}, n1n^{-1}, etc. The other component is evaluation of CnC_{n} via momets of the N(0, 1) distribution.

Remark 1.

The above result is useful in finding certain expansions for the posterior moments as well. In particular, noting θ=θ^n+(nI^n)1/2tn\theta=\hat{\theta}_{n}+(n\hat{I}_{n})^{-1/2}t_{n}, it follows that the asymptotic expansion of the posterior mean of θ\theta is given by

E(θ|X1,,Xn)\displaystyle\qquad E(\theta|X_{1},\dots,X_{n})
(5)
=θ^n+n1{a32I^n2+tπ(θ^n)I^nπ(θ^n)}+Op(n3/2).\displaystyle\qquad\quad=\hat{\theta}_{n}+n^{-1}\biggl{\{}\frac{a_{3}}{2\hat{I}_{n}^{2}}+t\frac{\pi^{\prime}(\hat{\theta}_{n})}{\hat{I}_{n}\pi(\hat{\theta}_{n})}\biggr{\}}+O_{p}(n^{-3/2}).

Also, V(θ|X1,,Xn)=(nI^n)1+Op(n3/2)V(\theta|X_{1},\ldots,X_{n})=(n\hat{I}_{n})^{-1}+O_{p}(n^{-3/2}).

A multiparameter extension of Theorem 1 is as follows. Suppose that θ=(θ1,,θp)T\theta=(\theta_{1},\dots,\theta_{p})^{T} is the parameter vector and θ^n\hat{\theta}_{n} is the MLE of θ\theta. Let

ajr\displaystyle a_{jr} =\displaystyle= I^njr=n12n(θ)θjθr|θ=θ^n,\displaystyle-\hat{I}_{njr}=n^{-1}\frac{\partial^{2}\ell_{n}(\theta)}{\partial\theta_{j}\,\partial\theta_{r}}\biggl{|}_{\theta=\hat{\theta}_{n}},
ajrs\displaystyle a_{jrs} =\displaystyle= n13n(θ)θjθrθs|θ=θ^n\displaystyle n^{-1}\frac{\partial^{3}\ell_{n}(\theta)}{\partial{\theta}_{j}\,\partial\theta_{r}\,\partial\theta_{s}}\biggl{|}_{\theta=\hat{\theta}_{n}}

and I^n=((I^njr))\hat{I}_{n}=((\hat{I}_{njr})). Then retaining only up to the O(n1/2)O(n^{-1/2}) term, the posterior of Wn=n(θθ^n)W_{n}=\sqrt{n}(\theta-\hat{\theta}_{n}) is given by

πn(w)=(2π)1/2exp[(1/2)wTI^nw]\displaystyle\quad\pi^{*}_{n}(w)=(2\pi)^{-1/2}\exp[-(1/2)w^{T}\hat{I}_{n}w]
[1+n1/2{j=1pwj(logπθj)|θ=θ^n\displaystyle\quad\hphantom{\pi^{*}_{n}(w)=}{}\cdot\Biggl{[}1+n^{-1/2}\Biggl{\{}\sum^{p}_{j=1}w_{j}\biggl{(}\frac{\partial\log\pi}{\partial\theta_{j}}\biggr{)}\biggl{|}_{\theta=\hat{\theta}_{n}}
(6)
+16j,r,swjwrwsajrs}\displaystyle\quad\hphantom{\Biggl{[}1+n^{-1/2}\Biggl{\{}\biggl{(}\frac{\partial\log\pi}{\partial\theta_{j}}\biggr{)}\biggl{|}}{}+\frac{1}{6}\sum_{j,r,s}w_{j}w_{r}w_{s}a_{jrs}\Biggr{\}}
+Op(n1)].\displaystyle\hphantom{\quad\Biggl{[}1+n^{-1/2}\Biggl{\{}\sum^{p}_{j=1}w_{j}\biggl{(}\frac{\partial\log\pi}{\partial\theta_{j}}\biggr{)}\biggl{|}_{\theta=\hat{\theta}_{n}}}{}+O_{p}(n^{-1})\Biggr{]}.

Next we present the basic shrinkage argument of J. K. Ghosh discussed in detail in Datta and Mukherjee (2004). The prime objective here is evaluation of E[q(X,θ)|θ]=λ(θ)E[q(X,\theta)|\theta]=\lambda(\theta), say, where XX and θ\theta can bereal- or vector-valued. The idea is to find firstλ(θ)π¯m(θ)𝑑θ\int\lambda(\theta)\bar{\pi}_{m}(\theta)\,d\theta through a sequence of priors {π¯m(θ)}\{\bar{\pi}_{m}(\theta)\} defined on a compact set, and then shrinking the prior to degeneracy at some interior point, say, θ\theta of the compact set. The interesting point is that one never needs explicit specification of π¯m(θ)\bar{\pi}_{m}(\theta) in carrying out this evaluation. We will see several illustrations of this in this article.

First, we present the shrinkage argument in a nutshell. Consider a proper prior π¯()\bar{\pi}(\cdot) with a compact rectangle as its support in the parameter space, and π¯()\bar{\pi}(\cdot) vanishes on the boundary of support, while remaining positive in the interior. The support of π¯()\bar{\pi}(\cdot) is the closure of the set. Consider the posterior of θ\theta under π¯()\bar{\pi}(\cdot) and, hence, obtain Eπ¯[q(X,θ)|X]E^{\bar{\pi}}[q(X,\theta)|X]. Then find E[{Eπ¯(q(X,θ)|X)}|θ]=λ(θ)E[\{E^{\bar{\pi}}(q(X,\theta)|X)\}|\theta]=\lambda(\theta) for θ\theta in the interior of the support of π¯()\bar{\pi}(\cdot). Finally, integrate λ()\lambda(\cdot) with respect to π¯()\bar{\pi}(\cdot), and then allow π¯()\bar{\pi}(\cdot) to converge to the degenerate prior at the true value of θ\theta at an interior point of the support of π(θ)\pi(\theta). This yields E[q(X,θ)|θ]E[q(X,\theta)|\theta]. The calculation assumes integrability of q(X,θ)q(X,\theta) over the joint distribution of XX and θ\theta. Such integrability allows change in the order of integration.

When executed up to the desired order of approximation, under suitable assumptions, these steps can lead to significant reduction in the algebra underlying higher order frequentist asymptotics. The simplification arises from two counts. First, although the Bayesian approach to frequentist asymptotics requires Edgeworth type assumptions, it avoids an explicit Edgeworth expansion involving calculation of approximate cumulants. Second, as we will see, it helps establish the results in an easily interpretable compact form. The following two sections will demonstrate multiple usage of these two basic tools.

3 Objective Priors Via Maximization of the Distance Between the Prior and the Posterior

3.1 Reference Priors

We begin with an alternate derivation of the reference prior of Bernardo. Following Lindley (1956), Bernardo (1979) suggested a Kullback–Leibler (K–L) divergence between the prior and the posterior, namely, E[logπ(θ|X)π(θ)]E[\log\frac{\pi(\theta|X)}{\pi(\theta)}], where expectation is taken over the joint distribution of XX and θ\theta. The target is to find a prior π\pi which maximizes the above distance. It is shown in Berger and Bernardo (1989) that if one does this maximization for a fixed nn, this may lead to a discrete prior with finitely many jumps, a far cry from a diffuse prior. Hence, one needs an asymptotic maximization.

First write E[logπ(θ|X)π(θ)]E[\log\frac{\pi(\theta|X)}{\pi(\theta)}] as

E[logπ(θ|X)π(θ)]\displaystyle E\biggl{[}\log\frac{\pi(\theta|X)}{\pi(\theta)}\biggr{]}
=logπ(θ|X)π(θ)π(θ|X)mπ(X)𝑑θ𝑑X\displaystyle\quad=\int\!\!\int\log\frac{\pi(\theta|X)}{\pi(\theta)}\pi(\theta|X)m^{\pi}(X)\,d\theta\,dX
(7)
=logπ(θ|X)π(θ)Ln(θ)π(θ)𝑑X𝑑θ\displaystyle\quad=\int\!\!\int\log\frac{\pi(\theta|X)}{\pi(\theta)}L_{n}(\theta)\pi(\theta)\,dX\,d\theta
=π(θ)E[logπ(θ|X)π(θ)|θ]dθ,\displaystyle\quad=\int\pi(\theta)E\biggl{[}\log\frac{\pi(\theta|X)}{\pi(\theta)}\biggl{|}\theta\biggr{]}\,d\theta,

where X=(X1,,Xn)X=(X_{1},\dots,X_{n}), Ln(θ)=1nf(Xi|θ)L_{n}(\theta)=\prod_{1}^{n}f(X_{i}|\theta), the likelihood function, and mπ(X)m^{\pi}(X) denotes the marginal of XX after integrating out θ\theta. The integrations are carried out with respect to a prior π\pi having a compact support, and subsequently passing on to the limit as and when necessary.

Without any nuisance parameters, Bernardo(1979) showed somewhat heuristically that Jeffreys’ prior achieves the necessary maximization. A more rigorous proof was supplied later by Clarke and Barron (1990, 1994). We demonstrate heuristically how the shrinkage argument can also lead to the reference priors derived in Bernardo (1979). To this end, we first consider the one-parameter case for a regular family of distributions. We rewrite

E[logπ(θ|X)π(θ)]\displaystyle\quad\qquad E\biggl{[}\log\frac{\pi(\theta|X)}{\pi(\theta)}\biggr{]} =\displaystyle= π(θ)E[logπ(θ|X)|θ]𝑑θ\displaystyle\int\pi(\theta)E[\log\pi(\theta|X)|\theta]\,d\theta
π(θ)logπ(θ)𝑑θ.\displaystyle{}-\int\pi(\theta)\log\pi(\theta)\,d\theta.

Next we write

Eπ¯[logπ(θ|X)|X]=logπ(θ|X)π¯(θ|X)𝑑θ.E^{\bar{\pi}}[\log\pi(\theta|X)|X]=\int\log\pi(\theta|X)\bar{\pi}(\theta|X)\,d\theta.

From the asymptotic expansion of the posterior, one gets

logπ(θ|X)\displaystyle\log\pi(\theta|X) =\displaystyle= (1/2)log(n)(1/2)log(2π)\displaystyle(1/2)\log(n)-(1/2)\log(2\pi)
n(θθ^n)22I^n+(1/2)log(I^n)\displaystyle{}-n\frac{(\theta-\hat{\theta}_{n})^{2}}{2}\hat{I}_{n}+(1/2)\log(\hat{I}_{n})
+Op(n1/2).\displaystyle{}+O_{p}(n^{-1/2}).

Since n(θθ^n)22I^n\frac{n(\theta-\hat{\theta}_{n})^{2}}{2}\hat{I}_{n} converges a posteriori to a χ12\chi_{1}^{2} distribution as nn\rightarrow\infty, irrespective of a prior π\pi, by the Bernstein–Von Mises and Slutsky’s theorems, one gets

Eπ¯[logπ(θ|X)]\displaystyle E^{\bar{\pi}}[\log\pi(\theta|X)]
=(1/2)log(n)(1/2)log(2πe)\displaystyle\quad=(1/2)\log(n)-(1/2)\log(2\pi e) (9)
+(1/2)log(I^n)+Op(n1/2).\displaystyle\qquad{}+(1/2)\log(\hat{I}_{n})+O_{p}(n^{-1/2}).

Since the leading term in the right-hand side of (3.1) does not involve the prior π¯\bar{\pi}, and I^n\hat{I}_{n} converges almost surely (PθP_{\theta}) to I(θ)I(\theta), applying the shrinkage argument, one gets from (3.1)

E[logπ(θ|X)|θ]\displaystyle E[\log\pi(\theta|X)|\theta]
=(1/2)log(n)(1/2)log(2πe)\displaystyle\quad=(1/2)\log(n)-(1/2)\log(2\pi e)
+log(I1/2(θ))+O(n1/2).\displaystyle\qquad{}+\log(I^{1/2}(\theta))+O(n^{-1/2}).

In view of (3.1), considering only the leading terms in (3.1), one needs to find a prior π\pi which maximizes log{I1/2(θ)π(θ)}π(θ)𝑑θ\int\log\{\frac{I^{1/2}(\theta)}{\pi(\theta)}\}\pi(\theta)\,d\theta. The integral being nonpositive due to the property of the Kullback–Leibler information number, its maximum value is zero, which is attained for π(θ)=I1/2(θ)\pi(\theta)=I^{1/2}(\theta), leading once again to Jeffreys’ prior.

The multiparameter generalization of the above result without any nuisance parameters is based on the asymptotic expansion

E[logπ(θ|X)|θ]\displaystyle E[\log\pi(\theta|X)|\theta]
=(p/2)log(n)(p/2)log(2πe)\displaystyle\quad=(p/2)\log(n)-(p/2)\log(2\pi e)
+log{|I(θ)|1/2/π(θ)}π(θ)𝑑θ\displaystyle\qquad{}+\int\log\{|I(\theta)|^{1/2}/{\pi(\theta)}\}\pi(\theta)\,d\theta
+O(n1/2),\displaystyle\qquad{}+O(n^{-1/2}),

and maximization of the leading term yields once again Jeffreys’ general rule prior π(θ)=|I(θ)|1/2\pi(\theta)=|I(\theta)|^{1/2}.

In the presence of nuisance parameters, however, Jeffreys’ general rule prior is no longer the distance maximizer. We will demonstrate this in the case when the parameter vector is split into two groups, one group consisting of the parameters of interest, and the other involving the nuisance parameters. In particular, Bernardo’s (1979) two-group reference prior will be included as a special case.

To this end, suppose θ=(θ1,θ2)\theta=(\theta_{1},\theta_{2}), where θ1\theta_{1} (p1×1p_{1}\times 1) is the parameter of interest and θ2\theta_{2} (p2×1p_{2}\times 1) is the nuisance parameter. We partition the Fisher information matrix I(θ)I(\theta) as

I(θ)=(I11(θ)I12(θ)I21(θ)I22(θ)).I(\theta)=\pmatrix{I_{11}(\theta)&I_{12}(\theta)\cr I_{21}(\theta)&I_{22}(\theta)}.

First begin with a general conditional priorπ(θ2|θ1)=ϕ(θ)\pi(\theta_{2}|\theta_{1})=\phi(\theta) (say). Bernardo (1979) considered ϕ(θ)=|I22(θ)|1/2\phi(\theta)=|I_{22}(\theta)|^{1/2}. The marginal prior π(θ1)\pi(\theta_{1}) for θ1\theta_{1} is then obtained by maximizing the distanceE[logπ(θ1|X)π(θ1)]E[\log\frac{\pi(\theta_{1}|X)}{\pi(\theta_{1})}]. We begin by writing

logπ(θ1|X)π(θ1)=logπ(θ|X)π(θ)logπ(θ2|θ1,X)π(θ2|θ1).\qquad\log\frac{\pi(\theta_{1}|X)}{\pi(\theta_{1})}=\log\frac{\pi(\theta|X)}{\pi(\theta)}-\log\frac{\pi(\theta_{2}|\theta_{1},X)}{\pi(\theta_{2}|\theta_{1})}.\hskip-10.0pt (11)

Writing π(θ)=π(θ1)ϕ(θ)\pi(\theta)=\pi(\theta_{1})\phi(\theta) and |I(θ)|=|I22(θ)||I11.2(θ)||I(\theta)|=|I_{22}(\theta)|\cdot\break|I_{11.2}(\theta)|, where I11.2(θ)=I11(θ)I12(θ)I221(θ)I21(θ)I_{11.2}(\theta)\!=\!I_{11}(\theta)\!-\!I_{12}(\theta)I_{22}^{-1}(\theta)\!\cdot\!I_{21}(\theta), the asymptotic expansion and the shrinkage argument together yield

E[logπ(θ|X)π(θ)]\displaystyle\hskip-8.0ptE\biggl{[}\log\frac{\pi(\theta|X)}{\pi(\theta)}\biggr{]}\hskip-15.0pt
=(p/2)log(n)(p/2)log(2πe)\displaystyle\hskip-8.0pt\quad=(p/2)\log(n)-(p/2)\log(2\pi e)\hskip-15.0pt
+π(θ1)\displaystyle\hskip-8.0pt\qquad{}+\int\pi(\theta_{1})\hskip-15.0pt
(12)
{ϕ(θ)\displaystyle\hskip-8.0pt\hphantom{\qquad{}+\int}{}\cdot\biggl{\{}\int\phi(\theta)\hskip-15.0pt
log|I22(θ)|1/2|I11.2(θ)|1/2π(θ1)ϕ(θ)dθ2}dθ1\displaystyle\hskip-8.0pt\hphantom{\hphantom{\qquad{}+}{}\cdot\biggl{\{}\int\int}{}\cdot\log\frac{|I_{22}(\theta)|^{1/2}|I_{11.2}(\theta)|^{1/2}}{\pi(\theta_{1})\phi(\theta)}\,d\theta_{2}\biggr{\}}\,d\theta_{1}\hskip-15.0pt
+O(n1/2)\displaystyle\hskip-8.0pt\qquad{}+O(n^{-1/2})\hskip-15.0pt

and

E[logπ(θ2|θ1,X)π(θ2|θ1)]\displaystyle\quad E\biggl{[}\log\frac{\pi(\theta_{2}|\theta_{1},X)}{\pi(\theta_{2}|\theta_{1})}\biggr{]}\hskip-10.0pt
=(p2/2)log(n)(p2/2)log(2πe)\displaystyle\quad\quad=(p_{2}/2)\log(n)-(p_{2}/2)\log(2\pi e)\hskip-10.0pt
+π(θ1){ϕ(θ)log|I22(θ)|1/2ϕ(θ)dθ2}𝑑θ1\displaystyle\quad\qquad{}+\int\pi(\theta_{1})\biggl{\{}\int\phi(\theta)\log\frac{|I_{22}(\theta)|^{1/2}}{\phi(\theta)}\,d\theta_{2}\biggr{\}}\,d\theta_{1}\hskip-10.0pt
+O(n1/2).\displaystyle\quad\qquad{}+O(n^{-1/2}).\hskip-10.0pt

From (11)–(3.1), retaining only the leading term,

E[logπ(θ1|X)π(θ1)]\displaystyle\quad\quad E\biggl{[}\log\frac{\pi(\theta_{1}|X)}{\pi(\theta_{1})}\biggr{]}\hskip-5.0pt
(p1/2)log(n)(p1/2)log(2πe)\displaystyle\quad\quad\quad\approx(p_{1}/2)\log(n)-(p_{1}/2)\log(2\pi e)\hskip-5.0pt
(14)
+π(θ1)\displaystyle\quad\quad\qquad{}+\int\pi(\theta_{1})\hskip-5.0pt
{ϕ(θ)log|I11.2(θ)|1/2π(θ1)dθ2}dθ1.\displaystyle\quad\quad\hphantom{\qquad{}+\int}{}\cdot\biggl{\{}\int\phi(\theta)\log\frac{|I_{11.2}(\theta)|^{1/2}}{\pi(\theta_{1})}\,d\theta_{2}\biggr{\}}\,d\theta_{1}.\hskip-5.0pt

Writing logψ(θ1)=ϕ(θ)log|I11.2(θ)|1/2dθ2\log\psi(\theta_{1})=\int\phi(\theta)\log|I_{11.2}(\theta)|^{1/2}\,d\theta_{2}, once again by property of the Kullback–Leibler information number, it follows that the maximizing prior π(θ1)=ψ(θ1)\pi(\theta_{1})=\psi(\theta_{1}).

We have purposely not set limits for these integrals. An important point to note [as pointed out in Berger and Bernardo (1989)] is that evaluation of all these integrals is carried out over an increasing sequence of compact sets KiK_{i} whose union is the entire parameter space. This is because most often we are working with improper priors, and direct evaluation of these integrals over the entire parameter space will simply give ++\infty which does not help finding any prior. As an illustration, if the parameter space is ×+\mathcal{R}\times\mathcal{R}^{+} as is typically the case for location–scale family of distributions, then one can take the increasing sequence of compact sets as [i,i]×[i1,i][-i,i]\times[i^{-1},i], i2i\geq 2. All the proofs are usually carried out by taking a sequence of priors πi\pi_{i} with compact support KiK_{i}, and eventually making ii\rightarrow\infty. This important point should be borne in mind in the actual derivation of reference priors. We will now illustrate this for the location–scale family of distributions when one of the two parameters is the parameter of interest, while the other one is the nuisance parameter.

Example 1 ((Location–scale models)).

Suppose X1,,XnX_{1},\ldots,X_{n} are i.i.d. with common p.d.f. σ1f((xμ)/σ)\sigma^{-1}f((x-\mu)/\sigma), where μ(,)\mu\in(-\infty,\infty) and σ(0,)\sigma\in(0,\infty). Consider the sequence of priors πi\pi_{i} with support [i,i]×[i1,i][-i,i]\times[i^{-1},i], i=2,3,.i=2,3,\ldots. We may note that I(μ,σ)=σ2(c1c2c2c3)I(\mu,\sigma)=\break\sigma^{-2}{{c_{1}\enskip c_{2}}\choose{c_{2}\enskip c_{3}}}, where the constants c1c_{1}, c2c_{2} and c3c_{3} are functions of ff and do not involve either μ\mu or σ\sigma. So, if μ\mu is the parameter of interest, and σ\sigma is the nuisance parameter, following Bernardo’s (1979) prescription, one begins with the sequence of priors πi2(σ|μ)=ki2σ1\pi_{i2}(\sigma|\mu)=k_{i2}\sigma^{-1} where, solving 1=ki2i1iσ1𝑑σ1=k_{i2}\int_{i^{-1}}^{i}\sigma^{-1}\,d\sigma, one gets ki2=(2logi)1k_{i2}=(2\log i)^{-1}. Next one finds the prior πi1(μ)=ki1exp[iiki2σ1log(σ1)𝑑σ]\pi_{i1}(\mu)\!=\!k_{i1}\exp[\int_{-i}^{i}k_{i2}\sigma^{-1}\!\log(\sigma^{-1}\!)\,d\sigma] which is a constant not depending on either μ\mu or σ\sigma. Hence, the resulting joint prior πi(μ,σ)=πi1(μ)πi2(σ|μ)σ1\pi_{i}(\mu,\sigma)=\pi_{i1}(\mu)\pi_{i2}(\sigma|\mu)\propto\sigma^{-1}, which is the desired reference prior. Incidentally, this is Jeffreys’ independence prior rather than Jeffreys’ general rule prior, the latter being proportional to σ2\sigma^{-2}. Conversely, when σ\sigma is the parameter of interest and μ\mu is the nuisance parameter, one begins with πi2(μ|σ)=(2i)1\pi_{i2}(\mu|\sigma)=(2i)^{-1} and then, following Bernardo (1979) again, one finds πi1(σ)=ci1exp[i1i(2i)1log(1/σ)]dμ]σ1\pi_{i1}(\sigma)=\break c_{i1}\exp[\int_{i^{-1}}^{i}(2i)^{-1}\log(1/\sigma)]\,d\mu]\propto\sigma^{-1}. Thus, onceagain one gets Jeffreys’ independence prior. We will see in Section 5 that Jeffreys’ independence prior is a right Haar prior, while Jeffreys’ general rule prior is a left Haar prior for the location–scale family of distributions.

Example 2 ((Noncentrality parameter)).

Let X1,,Xn|μ,σX_{1},\break\ldots,X_{n}|\mu,\sigma be i.i.d. N(μ,σ2\mu,\sigma^{2}), where μ\mu real and σ(>0)\sigma(\!>\!0) are both unknown. Suppose the parameter of interest is θ=μ/σ\theta=\mu/\sigma, the noncentrality parameter. With the reparameterization (θ,σ)(\theta,\sigma) from (μ,σ)(\mu,\sigma), the likelihood is rewritten as L(θ,σ)σnexp[12σ2i=1n(Xiθσ)2]L(\theta,\sigma)\propto\sigma^{-n}\exp[-\frac{1}{2\sigma^{2}}\cdot\break\sum_{i=1}^{n}(X_{i}-\theta\sigma)^{2}]. Then the per observation Fisher information matrix is given by I(θ,σ)=(1θ/σθ/σ(θ2+2)/σ2)I(\theta,\sigma)\!=\!{{1\phantom{00000}\enskip\theta/\sigma}\choose{\theta/\sigma\enskip(\theta^{2}+2)/\sigma^{2}}}. Consider once again the sequence of priors πi\pi_{i} with support [i,i]×[i1,i][-i,i]\times[i^{-1},i], i=2,3,.i=2,3,\ldots. Again, following Bernardo, πi2(σ|θ)=ki2σ1\pi_{i2}(\sigma|\theta)=k_{i2}\sigma^{-1}, where ki2=(2logi)1k_{i2}=(2\log i)^{-1}. Noting that I11.2(θ,σ)=1θ2/(θ2+2)=2/(θ2+2)I_{11.2}(\theta,\sigma)=1-\theta^{2}/(\theta^{2}+2)=2/(\theta^{2}+2), one gets πi1(θ)=ki1exp[iilog(2/(θ2+2)1/2)𝑑σ](θ2+2)1/2\pi_{i1}(\theta)\!=\!k_{i1}\!\exp[\int_{-i}^{i}\log(\sqrt{2}/(\theta^{2}+2)^{1/2})\,d\sigma]\!\propto(\theta^{2}+2)^{-1/2}. Hence, the reference prior in this example is given by πR(θ,σ)(θ2+2)1/2σ1\pi_{R}(\theta,\sigma)\propto(\theta^{2}+2)^{-1/2}\sigma^{-1}. Due to its invariance property (Datta and Ghosh, 1996), in the original (μ,σ(\mu,\sigma) parameterization, the two-group reference prior turns out to be πR(μ,σ)σ1(μ2+2σ2)1/2\pi_{R}(\mu,\sigma)\propto\sigma^{-1}(\mu^{2}+2\sigma^{2})^{-1/2}.

Things simplify considerably if θ1\theta_{1} and θ2\theta_{2} are orthgonal in the Fisherian sense, namely, I12(θ)=0I_{12}(\theta)=0 (Huzurbazaar, 1950; Cox and Reid, 1987). Then if I11(θ)I_{11}(\theta) and I22(θ)I_{22}(\theta) factor respectively as h11(θ1)h12(θ2)h_{11}(\theta_{1})h_{12}(\theta_{2}) and h21(θ1)h22(θ2)h_{21}(\theta_{1})h_{22}(\theta_{2}), as a special case of a more general result of Datta and Ghosh (1995c), it follows that the two-group reference prior is given byh111/2(θ1)h221/2(θ2)h_{11}^{1/2}(\theta_{1})h_{22}^{1/2}(\theta_{2}).

Example 3.

As an illustration of the above, consider the celebrated Neyman–Scott problem (Berger and Bernardo, 1992a, 1992b). Consider a fixed effects one-way balanced normal ANOVA model where the number of observations per cell is fixed, but the number of cells grows to infinity. In symbols, let Xi1,,Xik|θiX_{i1},\ldots,X_{ik}|\theta_{i} be mutually independent N(θi,σ2)\theta_{i},\sigma^{2}), k2k\geq 2, i=1,,ni=1,\ldots,n, all parameters being assumed unknown. Let S=i=1nj=1k(XijX¯i)2/(n(k1))S=\sum_{i=1}^{n}\sum_{j=1}^{k}(X_{ij}-\bar{X}_{i})^{2}/(n(k-1)). Then the MLE of σ2\sigma^{2} is given by (k1)S/k(k-1)S/k which converges in probability [as nn\rightarrow\infty to (k1)σ2/k(k-1)\sigma^{2}/k], and hence is inconsistent. Interestingly, Jeffreys’prior in this case also produces an inconsistent estimator of σ2\sigma^{2}, but the Berger–Bernardo reference prior does not.

To see this, we begin with Fisher Information matrix I(θ1,,θn,σ)=k\operatornameDiag(σ2,,σ2,(1/2)nσ4)I(\!\theta_{1},\ldots,\theta_{n},\sigma\!)\!=\!k\!\operatorname{Diag}(\!\sigma^{-2},\ldots,\sigma^{-2},(1/2)n\sigma^{-4}). Hence, Jeffreys’ prior πJ(θ1,,θn,σ2)(σ2)n/21\pi_{J}(\theta_{1},\ldots,\theta_{n},\sigma^{2})\propto(\sigma^{2})^{-n/2-1} which leads to the marginal posterior πJ(σ2|X)(σ2)nk/21exp[n(k1)S/(2σ2)]\pi_{J}(\sigma^{2}|X)\propto(\sigma^{2})^{-nk/2-1}\exp[-n(k-1)S/(2\sigma^{2})] of σ2\sigma^{2}, XX denoting the entire data set. Then the posterior mean of σ2\sigma^{2} is given by n(k1)S/(nk2)n(k-1)S/(nk-2), while the posterior mode is given by n(k1)S/(nk+2)n(k-1)S/(nk+2). Both are inconsistent estimators of σ2\sigma^{2}, as these converge in probability to (k1)σ2/k(k-1)\sigma^{2}/k as nn\rightarrow\infty.

In contrast, by the result of Datta and Ghosh (1995c), the two-group reference prior πR(θ1,,θn,σ2)(σ2)1\pi_{R}(\theta_{1},\ldots,\theta_{n},\break\sigma^{2})\propto(\sigma^{2})^{-1}. This leads to the marginal posterior πR(σ2|X)(σ2)n(k1)/21exp[n(k1)S/(2σ2)]\pi_{R}(\sigma^{2}|X)\propto(\sigma^{2})^{-n(k-1)/2-1}\exp[-n(k-1)S/(2\sigma^{2})] of σ2\sigma^{2}. Now the posterior mean is given by n(k1)S/(n(k1)2)n(k-1)S/\break(n(k-1)-2), while the posterior mode is given by n(k1)S/(n(k1)+2)n(k-1)S/(n(k-1)+2). Both are consistent estimators of σ2\sigma^{2}.

Example 4 ((Ratio of normal means)).

Let X1X_{1} and X2X_{2} be two independent N(θμ,μ\theta\mu,\mu) random variables, where the parameter of interest is θ\theta. This is the celebrated Fieller–Creasy problem. The Fisher information matrix in this case is I(θ,μ)=(μ2μθμθ1+θ2)I(\theta,\mu)=({{\mu^{2}\enskip\mu\theta}\atop{\mu\theta\enskip 1+\theta^{2}}}). With the transformation ϕ=μ(1+θ2)1/2\phi=\mu(1+\theta^{2})^{1/2}, one obtains I(θ,ϕ)=\operatornameDiag(ϕ2(1+θ2)2,1)I(\theta,\phi)=\operatorname{Diag}(\phi^{2}(1+\theta^{2})^{-2},1). Again, by Datta and Ghosh (1995c), the two-group reference prior πR(θ,ϕ)(1+θ2)1\pi_{R}(\theta,\phi)\propto(1+\theta^{2})^{-1}.

Example 5 ((Random effects model)).

This example has been visited and revisited on several occasions. Berger and Bernardo (1992b) first found reference priors for variance components in this problem when the number of observations per cell is the same. Later, Ye (1994) and Datta and Ghosh (1995c, 1995d) also found reference priors for this problem. The case involving unequal number of observations per cell was considered by Chaloner (1987) and Datta, Ghosh and Kim (2002).

For simplicity, we consider here only the case with equal number of observations per cell. Let Yij=m+αi+eijY_{ij}=m+\alpha_{i}+e_{ij}, j=1,,n,i=1,,kj=1,\ldots,n,i=1,\ldots,k. Here mm is an unknown parameter, while αi\alpha_{i}’s and eije_{ij} are mutually independent with αi\alpha_{i}’s i.i.d. N(0,σα20,\sigma_{\alpha}^{2}) and eije_{ij} i.i.d. N(0,σ20,\sigma^{2}). The parameters mm, σα2\sigma_{\alpha}^{2} and σ2\sigma^{2} are all unknown. We write Y¯i=j=1nYij/n\bar{Y}_{i}=\sum_{j=1}^{n}Y_{ij}/n, i=1,,ki=1,\ldots,k, and Y¯=i=1kY¯i/k\bar{Y}=\sum_{i=1}^{k}\bar{Y}_{i}/k. The minimal sufficient statistic is (Y¯,T,S)\bar{Y},T,S), where T=ni=1k(Y¯iY¯)2T=n\sum_{i=1}^{k}(\bar{Y}_{i}-\bar{Y})^{2} and S=i=1kj=1n(YijY¯i)2S=\sum_{i=1}^{k}\sum_{j=1}^{n}(Y_{ij}-\bar{Y}_{i})^{2}.

The different parameters of interest that we consider are mm, σα2/σ2\sigma_{\alpha}^{2}/\sigma^{2} and σ2\sigma^{2}. The common mean mm is of great relevance in meta analysis (cf. Morris and Normand, 1992). Ye (1994) pointed out that the variance ratio σα2/σ2\sigma_{\alpha}^{2}/\sigma^{2} is of considerable interest in genetic studies. The parameter is also of importance to animal breeders, psychologists and others. Datta and Ghosh (1995d) have discussed the importance of σ2\sigma^{2}, the error variance. In order to find reference priors for each one of these parameters, we first make the one-to-one transformation from (m,σα2,σ2)(m,\sigma_{\alpha}^{2},\sigma^{2}) to (m,r,u)(m,r,u), where r=σ2r=\sigma^{-2} and u=σ2/(nσα2+σ2)u=\sigma^{2}/(n\sigma_{\alpha}^{2}+\sigma^{2}). Thus, σα2/σ2=(1u)/(nu)\sigma_{\alpha}^{2}/\sigma^{2}=(1-u)/(nu), and the likelihood L(m,r,u)L(m,r,u) can be expressed as

L(m,r,u)\displaystyle\hskip-2.0ptL(m,r,u)
=rnk/2uk/2exp[(r/2){nku(Y¯m)2+uT+S}].\displaystyle\hskip-2.0pt\quad=r^{nk/2}u^{k/2}\exp[-(r/2)\{nku(\bar{Y}-m)^{2}+uT+S\}].

Then the Fisher information matrix simplifies to I(m,r,u)=k\operatornameDiag(nru,n/(2r2),1/(2u2)I(m,r,u)=k\operatorname{Diag}(nru,n/(2r^{2}),1/(2u^{2}). From Theorem 1 of Datta and Ghosh (1995c), it follows now that when mm, rr and uu are the respective parameters of interest, while the other two are nuisance parameters, the reference priors are given respectively by π1R(m,r,u)=1\pi_{1R}(m,r,u)=1, π2R(m,r,u)=r1\pi_{2R}(m,r,u)=r^{-1} and π3R(m,r,u)=u1\pi_{3R}(m,r,u)=u^{-1}.

3.2 General Divergence Priors

Next, back to the one-parameter case, we consider the more general distance (Amari, 1982; Cressie and Read, 1984)

Dπ=[1E{π(θ|X)π(θ)}β]/{β(1β)},\displaystyle\quad\quad\quad D^{\pi}=\biggl{[}1-E\biggl{\{}\frac{\pi(\theta|X)}{\pi(\theta)}\biggr{\}}^{-\beta}\biggr{]}\bigl{/}\{\beta(1-\beta)\},
(15)
β<1,\displaystyle\hphantom{\hskip 3.0pt\quad\quad D^{\pi}=\biggl{[}1-E\biggl{\{}\frac{\pi(\theta|X)}{\pi(\theta)}\biggr{\}}^{-\beta}\biggr{]}\bigl{/}\{\beta(1-\beta)\}}\beta<1,

which is to be interpreted as its limit when β0\beta\rightarrow 0. This limit is the K–L distance as considered in Bernardo (1979). Also, β=1/2\beta=1/2 gives the Bhattacharyya–Hellinger (Bhattacharyya, 1943; Hellinger, 1909) distance, and β=1\beta=-1 leads to the chi-square distance (Clarke and Sun, 1997, 1999). In order to maximize DπD^{\pi} with respect to a prior π\pi, one re-expresses (3.2) as

Dπ\displaystyle\hskip 3.0ptD^{\pi} =\displaystyle= [1πβ+1(θ)πβ(θ|X)Ln(θ)𝑑X𝑑θ]\displaystyle\biggl{[}1-\int\!\!\int\pi^{\beta+1}(\theta)\pi^{-\beta}(\theta|X)L_{n}(\theta)\,dX\,d\theta\biggr{]}\hskip-25.0pt
/{β(1β)}\displaystyle{}\bigl{/}\{\beta(1-\beta)\}
=\displaystyle= [1πβ+1(θ)E[{πβ(θ|X)}|θ]𝑑θ]\displaystyle\biggl{[}1-\int\pi^{\beta+1}(\theta)E[\{\pi^{-\beta}(\theta|X)\}|\theta]\,d\theta\biggr{]}
/{β(1β)}.\displaystyle{}\bigl{/}\{\beta(1-\beta)\}.

Hence, from (3.2), maximization of DπD^{\pi} amounts to minimization (maximization) of

πβ+1(θ)E[{πβ(θ|X)}|θ]𝑑θ\int\pi^{\beta+1}(\theta)E[\{\pi^{-\beta}(\theta|X)\}|\theta]\,d\theta (17)

for 0<β<10<\beta<1 (β<0\beta<0). First consider the case 0<|β|<10<|\beta|\break<1. From Theorem 1, the posterior of θ\theta is

π(θ|X)\displaystyle\qquad\pi(\theta|X) =\displaystyle= nI^n1/2(2π)1/2exp[n2(θθ^n)2I^n]\displaystyle\frac{\sqrt{n}\hat{I}_{n}^{1/2}}{(2\pi)^{1/2}}\exp\biggl{[}-\frac{n}{2}(\theta-\hat{\theta}_{n})^{2}\hat{I}_{n}\biggr{]}
[1+Op(n1/2)].\displaystyle{}\cdot[1+O_{p}(n^{-1/2})].

Thus,

πβ(θ|X)\displaystyle\hskip 10.0pt\pi^{-\beta}(\theta|X)\hskip-16.0pt
=nβ/2(2π)β/2I^nβ/2\displaystyle\hskip 10.0pt\quad=n^{-\beta/2}(2\pi)^{\beta/2}\hat{I}_{n}^{-\beta/2}\hskip-16.0pt (19)
exp[nβ2(θθ^n)2I^n][1+Op(n1/2)].\displaystyle\hskip 10.0pt\qquad{}\cdot\exp\biggl{[}\frac{n\beta}{2}(\theta-\hat{\theta}_{n})^{2}\hat{I}_{n}\biggr{]}[1+O_{p}(n^{-1/2})].\hskip-16.0pt

Following the shrinkage argument, and noting that conditional on θ\theta, I^npI(θ)\hat{I}_{n}\!\stackrel{{\scriptstyle p}}{{\rightarrow}}\!I(\theta), while n(θθ^n)2I^ndχ12n(\theta\!-\!\hat{\theta}_{n})^{2}\hat{I}_{n}\!\stackrel{{\scriptstyle d}}{{\rightarrow}}\!\chi_{1}^{2}, it follows heuristically from (3.2)

E[πβ(θ|X)]\displaystyle\quad\quad E[\pi^{-\beta}(\theta|X)]
=nβ/2(2π)β/2[I(θ)]β/2(1β)1/2\displaystyle\quad\quad\quad=n^{-\beta/2}(2\pi)^{\beta/2}[I(\theta)]^{-\beta/2}(1-\beta)^{-1/2} (20)
[1+Op(n1/2)].\displaystyle\quad\quad\qquad{}\cdot[1+O_{p}(n^{-1/2})].

Hence, from (3.2), considering only the leading term, for 0<β<10\!<\!\beta\!<\!1, minimization of (17) with respect to π\pi amounts to minimization of [π(θ)/I1/2(θ)]βπ(θ)𝑑θ\int[\pi(\theta)/I^{1/2}(\theta)]^{\beta}\pi(\theta)\,d\theta with respect to π\pi subject to π(θ)𝑑θ=1\int\pi(\theta)\,d\theta=1. A simple application of Holder’s inequality shows that this minimization takes place when π(θ)I1/2(θ)\pi(\theta)\propto I^{1/2}(\theta). Similarly, for 1<β<0-1<\beta<0, π(θ)I1/2(θ)\pi(\theta)\propto I^{1/2}(\theta) provides the desired maximization of the expected distance between the prior and the posterior. The K–L distance, that is, when β0,\beta\rightarrow 0, has already been considered earlier.

Remark 2.

Equation (3.2) also holds for β<1\beta<-1. However, in this case, it is shown in Ghosh, Mergel and Liu (2011) that the integral {π(θ)/I1/2(θ)}βπ(θ)𝑑θ\int\{\pi(\theta)/\break I^{1/2}(\theta)\}^{-\beta}\cdot\pi(\theta)\,d\theta is uniquely minimized with respect to π(θ)I1/2(θ)\pi(\theta)\propto I^{1/2}(\theta), and there exists no maximizer of this integral when π(θ)𝑑θ=1\int\pi(\theta)\,d\theta=1. Thus, in this case, there does not exist any prior which maximizes the posterior-prior distance.

Remark 3.

Surprisingly, Jeffreys’ prior is not necessarily the solution when β=1\beta=-1 (the chi-square divergence). In this case, the first-order asymptotics does not work since πβ+1(θ)=1\pi^{\beta+1}(\theta)=1 for all θ\theta. However, retaining also the Op(n1)O_{p}(n^{-1}) term as given in Theorem 1, Ghosh, Mergel and Liu (2011) have found in this case the solution π(θ)exp[θ2g3(t)I(t)4I(t)𝑑t]\pi(\theta)\propto\exp[\int^{\theta}\frac{2g_{3}(t)-I^{\prime}(t)}{4I(t)}\,dt], where g3(t)=E[d3logp(X1|t)dt3|t]g_{3}(t)=E[-\frac{d^{3}\log p(X_{1}|t)}{dt^{3}}|t]. We shall refer to this prior as πGML(θ)\pi_{\mathrm{GML}}(\theta). We will show by examples that this prior may differ from Jeffreys’prior. But first we will establish a hitherto unknown invariance property of this prior under one-to-one reparameterization.

Theorem 2

Suppose that ϕ\phi is a one-to-one twice differentiable function of θ\theta. Then πGML(ϕ)=CπGML(θ)|dθdϕ|\pi_{\mathrm{GML}}(\phi)=\break C\pi_{\mathrm{GML}}(\theta)|\frac{d\theta}{d\phi}|, where C(>0)C(>0), the constant of proportionality, does not involve any parameters.

{pf}

Without loss of generality, assume that ϕ\phi is a nondecreasing function of θ\theta. By the identity

g3(ϕ)=I(ϕ)+E[(d2logfdϕ2)(dlogfdϕ)],g_{3}(\phi)=I^{\prime}(\phi)+E\biggl{[}\biggl{(}\frac{d^{2}\log f}{d\phi^{2}}\biggr{)}\biggl{(}\frac{d\log f}{d\phi}\biggr{)}\biggr{]},

πGML(ϕ)/πGML(ϕ)\pi_{\mathrm{GML}}^{\prime}(\phi)/\pi_{\mathrm{GML}}(\phi) reduces to

πGML(ϕ)/πGML(ϕ)\displaystyle\quad\quad\pi_{\mathrm{GML}}^{\prime}(\phi)/\pi_{\mathrm{GML}}(\phi)\hskip-5.0pt
(21)
=I(ϕ)+2E[(d2logf/dϕ2)(dlogf/dϕ)]4I(ϕ).\displaystyle\quad\quad\quad=\frac{I^{\prime}(\phi)+2E[(d^{2}\log f/d\phi^{2})(d\log f/d\phi)]}{4I(\phi)}.\hskip-5.0pt

Next, from the relation I(ϕ)=I(θ)(dθ/dϕ)2I(\phi)=I(\theta)(d\theta/d\phi)^{2}, one gets the identities

I(ϕ)=I(θ)(dθdϕ)3\displaystyle\qquad I^{\prime}(\phi)=I^{\prime}(\theta)\biggl{(}\frac{d\theta}{d\phi}\biggr{)}^{3}
(22)
+2I(θ)(dθ/dϕ)(d2θ/dϕ2);\displaystyle\qquad\hphantom{I^{\prime}(\phi)=}{}+2I(\theta)(d\theta/d\phi)(d^{2}\theta/d\phi^{2});
(d2logfdϕ2)(dlogfdϕ)\displaystyle\qquad\biggl{(}\frac{d^{2}\log f}{d\phi^{2}}\biggr{)}\biggl{(}\frac{d\log f}{d\phi}\biggr{)}
={d2logfdθ2(dθdϕ)2+dlogfdθd2θdϕ2}\displaystyle\qquad\quad=\biggl{\{}\frac{d^{2}\log f}{d\theta^{2}}\biggl{(}\frac{d\theta}{d\phi}\biggr{)}^{2}+\frac{d\log f}{d\theta}\cdot\frac{d^{2}\theta}{d\phi^{2}}\biggr{\}} (23)
(dlogfdθdθdϕ).\displaystyle\qquad\hphantom{\quad=}{}\cdot\biggl{(}\frac{d\log f}{d\theta}\cdot\frac{d\theta}{d\phi}\biggr{)}.

From (3.2)–(3.2), one gets, after simplification,

πGML(ϕ)/πGML(ϕ)\displaystyle\pi_{\mathrm{GML}}^{\prime}(\phi)/\pi_{\mathrm{GML}}(\phi)
=πGML(θ)πGML(θ)dθdϕ+d2θ/dϕ2dθ/dϕ.\displaystyle\quad=\frac{\pi_{\mathrm{GML}}^{\prime}(\theta)}{\pi_{\mathrm{GML}}(\theta)}\frac{d\theta}{d\phi}+\frac{d^{2}\theta/d\phi^{2}}{d\theta/d\phi}. (24)

Now, on integration, it follows from (3.2) πGML(ϕ)=CπGML(ϕ)(dθ/dϕ)\pi_{\mathrm{GML}}(\phi)\!=\!C\pi_{\mathrm{GML}}(\phi)(d\theta/d\phi), which proves the theorem.

Example 6.

Consider the one-parameter exponential family of distributions with p(X|θ)=exp[θXψ(θ)+h(X)]p(X|\theta)=\break\exp[\theta X-\psi(\theta)+h(X)]. Then g3(θ)=I(θ)g_{3}(\theta)=I^{\prime}(\theta) so that π(θ)exp[14I(θ)I(θ)𝑑θ]=I1/4(θ)\pi(\theta)\propto\exp[\frac{1}{4}\int\frac{I^{\prime}(\theta)}{I(\theta)}\,d\theta]=I^{1/4}(\theta)\vskip 2.0pt, which is different from Jeffreys’ I1/2(θ)I^{1/2}(\theta) prior. Because of the invariance result proved in Theorem 2, in particular, for the \operatornameBinomial(n,p)\operatorname{Binomial}(n,p) problem, noting that p=exp(θ)/[1+exp(θ)]p=\exp(\theta)/\vskip 2.0pt\break[1+\exp(\theta)], one gets πGML(p)p3/4(1p)3/4\pi_{\mathrm{GML}}(p)\propto p^{-3/4}(1-p)^{-3/4}, which is a \operatornameBeta(14,14)\operatorname{Beta}(\frac{1}{4},\frac{1}{4}) prior, different from Jeffreys’ \operatornameBeta(12,12)\operatorname{Beta}(\frac{1}{2},\frac{1}{2}) prior, Laplace’s \operatornameBeta(1,1)\operatorname{Beta}(1,1) prior or Haldane’s improper \operatornameBeta(0,0)\operatorname{Beta}(0,0) prior. Similarly, for the Poisson (λ)(\lambda) case, one gets πGML(λ)λ1/4\pi_{\mathrm{GML}}(\lambda)\propto\lambda^{-1/4}, again different from Jeffreys’ πJ(λ)λ1/2\pi_{J}(\lambda)\propto\lambda^{-1/2} prior. However, for the N(θ,1)\mathrm{N}(\theta,1) distribution, since I(θ)=1I(\theta)=1 and g3(θ)=I(θ)=0,πGML(θ)=c(>0)g_{3}(\theta)\!=\!I^{\prime}(\theta)\!=\!0,\pi_{\mathrm{GML}}(\theta)\!=\!c(>0), a constant, which is the same as Jeffreys’ prior. It may be pointed out also that for the one-parameter exponential family, for the chi-square divergence, πGML\pi_{\mathrm{GML}} differs from Hartigan’s (1998) maximum likelihood prior πH(θ)=I(θ)\pi_{H}(\theta)=I(\theta).

Example 7.

For the one-parameter location family of distributions with p(X|θ)=f(Xθ)p(X|\theta)\!=\!f(X-\theta), where ff is a p.d.f., both g3(θ)g_{3}(\theta) and I(θ)I(\theta) are constants implying I(θ)=0I^{\prime}(\theta)=0. Hence, πGML(θ)\pi_{\mathrm{GML}}(\theta) is of the formπGML(θ)=exp(kθ)\pi_{\mathrm{GML}}(\theta)=\exp(k\theta) for some constant kk. However, for the special case of a symmetric ff, that is, f(X)=f(X)f(X)=f(-X) for all XX, g3(θ)=0g_{3}(\theta)=0, and then πGML(θ)\pi_{\mathrm{GML}}(\theta) reduces once again to π(θ)=c\pi(\theta)=c, which is the same as Jeffreys’ prior.

Example 8.

For the general scale family of distributions with p(X|θ)=θ1f(Xθ),θ>0p(X|\theta)=\theta^{-1}f(\frac{X}{\theta}),\theta>0, where ff is a p.d.f., I(θ)=c1θ2I(\theta)=\frac{c_{1}}{\theta^{2}} for some constant c1(>0)c_{1}(>0), where g3(θ)=c2θ3g_{3}(\theta)=\frac{c_{2}}{\theta^{3}} for some constant c2c_{2}. Then πGML(θ)exp(clogθ)=θc\pi_{\mathrm{GML}}(\theta)\propto\exp(c\log\theta)=\theta^{c} for some constant cc. In particular, when p(X|θ)=θ1exp(Xθ),πGML(θ)θ3/2p(X|\theta)=\theta^{-1}\exp(-\frac{X}{\theta}),\pi_{\mathrm{GML}}(\theta)\propto\theta^{-3/2}, different from Jeffreys’ πJ(θ)θ1\pi_{J}(\theta)\propto\theta^{-1} for the general scale family of distributions.

The multiparameter extension of the general divergence prior has been explored in the Ph.D. dissertation of Liu (2009). Among other things, he has shown that in the absence of any nuisance parameters, for |β|<1|\beta|<1, the divergence prior is Jeffreys’ prior. However, on the boundary, namely, β=1\beta=-1, priors other than Jeffreys’ prior emerge.

4 Probability Matching Priors

4.1 Motivation and First-Order Matching

As mentioned in the Introduction, probability matching priors are intended to achieve Bayes-frequentist synthesis. Specifically, these priors are required to provide asymptotically the same coverage probability of the Bayesian credible intervals with the corresponding frequentist counterparts. Over the years, there have been several versions of such priors-quantile matching priors, matching priors for distribution functions, HPD matching priors and matching priors associated with likelihood ratio statistics. Datta and Mukerjee provided a detailed account of all these priors. In this article I will be concerned only with quantile matching priors.

A general definition of quantile matching priors is as follows: Suppose X1,,Xn|θX_{1},\dots,X_{n}|\theta i.i.d. with common p.d.f. f(X|θ)f(X|\theta), where θ\theta is a real-valued parameter. Assume all the needed regularity conditions for the asymptotic expansion of the posterior around θ^n\hat{\theta}_{n}, the MLE of θ\theta. We continue with the notation of the previous section. For 0<α<10<\alpha<1, let θ1απ(X1,,Xn)θ1απ\theta^{\pi}_{1-\alpha}(X_{1},\dots,\break X_{n})\equiv\theta^{\pi}_{1-\alpha} denote the (1α)(1-\alpha)th asymptotic posterior quantile of θ\theta based on the prior π\pi, that is,

Pπ[θθ1απ|X1,,Xn]\displaystyle P^{\pi}[\theta\leq\theta^{\pi}_{1-\alpha}|X_{1},\dots,X_{n}]
(25)
=1α+Op(nr)\displaystyle\quad=1-\alpha+O_{p}(n^{-r})

for some r>0r\!>\!0. If now P[θθ1απ|θ]=1α+Op(nr)P[\theta\!\leq\!\theta^{\pi}_{1-\alpha}|\theta]\!=\!1\!-\!\alpha+O_{p}(n^{-r}), then some order of probability matching is achieved. If r=1r=1, we call π\pi a first-order probability matching prior. If r=3/2r=3/2, we call π\pi a second-order probability matching prior.

We first provide an intuitive argument for why Jeffreys’ prior is a first-order probability matching prior in the absence of nuisance parameters. If X1,,Xn|θX_{1},\dots,\break X_{n}|\theta i.i.d. N(θ,1)\mathrm{N}(\theta,1) and π(θ)=1\pi(\theta)=1, <θ<-\infty<\theta<\infty, then the posterior π(θ|X1,,Xn)\pi(\theta|X_{1},\dots,X_{n}) is N(X¯n,n1)\mathrm{N}(\bar{X}_{n},n^{-1}). Now writing z1αz_{1-\alpha} as the 100(1α)%100(1-\alpha)\% quantile of the N(0,1)\mathrm{N}(0,1) distribution, one gets

P[n(θX¯n)z1α|X1,,Xn)\displaystyle\qquad P\bigl{[}\sqrt{n}(\theta-\bar{X}_{n})\leq z_{1-\alpha}|X_{1},\dots,X_{n}\bigr{)}
(26)
=1α=P[n(X¯nθ)z1α|θ],\displaystyle\qquad\quad=1-\alpha=P\bigl{[}\sqrt{n}(\bar{X}_{n}-\theta)\geq-z_{1-\alpha}|\theta\bigr{]},

so that the one-sided credible interval X¯n+z1α/n\bar{X}_{n}+z_{1-\alpha}/\sqrt{n} for θ\theta has exact frequentist coverage probability 1α1-\alpha.

The above exact matching does not always hold. However, if X1,,Xn|θX_{1},\dots,X_{n}|\theta are i.i.d., then θ^n|θ\hat{\theta}_{n}|\theta is asymptotically N(θ,(nI(θ))1)\mathrm{N}(\theta,(nI(\theta))^{-1}). Then, by the delta method, g(θ^n)|θN[g(θ),(g(θ))2(nI(θ))1]g(\hat{\theta}_{n})|\theta\!\sim\!\mathrm{N}[g(\theta),\!(g^{\prime}(\theta)\!)^{2}(nI(\theta)\!)^{-1}]. So if g(θ)=I1/2(θ)g^{\prime}(\theta)\!=\!I^{1/2}(\theta) so that g(θ)=θI1/2(t)𝑑tg(\theta)\!=\!\int^{\theta}I^{1/2}(t)\,dt, n[g(θ^n)g(θ)]|θ\sqrt{n}[g(\hat{\theta}_{n})\!-\!g(\theta)]|\theta is asymptotically N(0,1)\mathrm{N}(0,1). Hence, from (4.1), with the uniform prior π(ϕ)=1\pi(\phi)=1 for ϕ=g(θ)\phi=g(\theta), coverage matching is asymptotically achieved for ϕ\phi. This leads to the prior π(θ)=dϕdθ=g(θ)=I1/2(θ)\pi(\theta)=\frac{d\phi}{d\theta}=g^{\prime}(\theta)=I^{1/2}(\theta) for θ\theta.

Datta and Mukerjee (2004, pages 14–21) proved the result in a formal manner. They used the two basic tools of Section 3. In the absence of nuisance parameters, they showed that a first-order matching prior for θ\theta is a solution of the differential equation

ddθ(π(θ)I1/2(θ))=0,\frac{d}{d\theta}(\pi(\theta)I^{-1/2}(\theta))=0, (27)

so that Jeffreys’ prior is the unique first-order matching prior. However, it does not always satisfy the second-order matching property.

4.2 Second-Order Matching

In order that the matching is accomplished up to O(n3/2)O(n^{-3/2}) (second-order matching), one needs an asymptotic expansion of the posterior distribution function up to the O(n1)O(n^{-1}) term, and to set up a second differential equation in addition to (27). This equation is given by (cf. Mukerjee and Dey, 1993; Mukerjee and Ghosh, 1997)

13ddθ[π(θ)I2(θ)g3(θ)]+d2dθ2[π(θ)I1(θ)]\displaystyle\qquad\frac{1}{3}\frac{d}{d\theta}[\pi(\theta)I^{-2}(\theta)g_{3}(\theta)]+\frac{d^{2}}{d\theta^{2}}[\pi(\theta)I^{-1}(\theta)]\hskip-10.0pt
(28)
=0,\displaystyle\qquad\quad=0,\hskip-10.0pt

where, as before, g3(θ)=E[d3logf(X|θ)dθ3|θ]g_{3}(\theta)=-E[\frac{d^{3}\log f(X|\theta)}{d\theta^{3}}|\theta]. If Jeffrey’s prior satisfies (4.2), then it is the unique second-order matching prior. While for the location and scale family of distributions, this is indeed the case, this is not true in general. Of course, in such an instance, there does not exist any second-order matching prior.

To see this, for πJ(θ)=I1/2(θ)\pi_{J}(\theta)=I^{1/2}(\theta), (4.2) reduces to

13ddθ[I3/2(θ)g3(θ)]+d2dθ2[I1/2(θ)]=0,\frac{1}{3}\frac{d}{d\theta}[I^{-3/2}(\theta)g_{3}(\theta)]+\frac{d^{2}}{d\theta^{2}}[I^{-1/2}(\theta)]=0,

which requires 13I3/2(θ)g3(θ)+ddθ(I1/2(θ))\frac{1}{3}I^{-3/2}(\theta)g_{3}(\theta)+\frac{d}{d\theta}(I^{-1/2}(\theta)) to be a constant free from θ\theta. After some algebra, the above expression simplifies to (1/6)E[(dlogfdθ)3|θ]/I3/2(θ)(1/6)E[(\frac{d\log f}{d\theta})^{3}|\theta]/I^{3/2}(\theta). It is easy to check now that for the one-parameter location and scale family of distributions, the above expression does not depend on θ\theta. However, for the one-parameter exponential family of distributions with canonical parameter θ\theta, the same holds if and only if I(θ)/I3/2(θ)I^{\prime}(\theta)/I^{3/2}(\theta) does not depend on θ\theta, or, in other words, I(θ)=exp(cθ)I(\theta)=\exp(c\theta) for some constant cc. Another interesting example is given below.

Example 9.

(X1,X2)TN2[(00),(1ρρ1)](X_{1},X_{2})^{T}\sim\mathrm{N}_{2}[{{0}\choose{0}},{{1\enskip\rho}\choose{\rho\enskip 1}}]. One can verify that I(ρ)=(1+ρ2)/(1ρ2)2I(\rho)=(1+\rho^{2})/(1-\rho^{2})^{2} and L1,1,1=2ρ(3+ρ2)(1ρ2)3L_{1,1,1}=-\frac{2\rho(3+\rho^{2})}{(1-\rho^{2})^{3}} so that L1,1,1/I3/2(ρ)L_{1,1,1}/I^{3/2}(\rho) is not a constant. Hence, πJ\pi_{J} is not a second-order matching prior, and there does not exist any second-order matching prior in this example.

4.3 First-Order Quantile Matching Priors in the Presence of Nuisance Parameters

The parameter of interest is still real-valued, but there may be one or more nuisance parameters. To fix ideas, suppose θ=(θ1,,θp)\theta=(\theta_{1},\dots,\theta_{p}), where θ1\theta_{1} is the parameter of interest, while θ2,,θp\theta_{2},\dots,\theta_{p} are the nuisance parameters. As shown by Welch and Peers (1963) and later more rigorously by Datta and Ghosh(1995a) and Datta (1996), writing I1=((Ijk))I^{-1}=((I^{jk})), the probability matching equation is given by

j=1pθj{π(θ)Ij1(I11)1/2}=0.\displaystyle\sum^{p}_{j=1}\frac{\partial}{\partial\theta_{j}}\{\pi(\theta)I^{j1}(I^{11})^{-1/2}\}=0. (29)
Example 1 ((Continued)).

First consider μ\mu as the parameter of interest, and σ\sigma the nuisance parameter. Since each element of the inverse of the Fisher information matrix is a constant multiple of σ2\sigma^{2}, any prior π(μ,σ)g(σ)\pi(\mu,\sigma)\propto g(\sigma), gg arbitrary, satisfies (29). Conversely, when σ\sigma is the parameter of interest, and μ\mu is the nuisance parameter, any prior π(μ,σ)σ1g(μ)\pi(\mu,\sigma)\propto\sigma^{-1}g(\mu) satisfies (29).

A special case considered in Tibshirani (1989) is of interest. Here θ1\theta_{1} is orthogonal to (θ2,,θp)(\theta_{2},\dots,\theta_{p}) in the Fisherian sense, that is, Ij1=0I^{j1}=0 for j=2,3,,pj=2,3,\dots,p.

With orthogonality, (29) simplifies to

θ1{π(θ)I111/2}=0\frac{\partial}{\partial\theta_{1}}\{\pi(\theta)I^{-1/2}_{11}\}=0

(since I11=I111I^{11}=I^{-1}_{11}). This leads to π(θ)=I111/2h(θ2,,θp)\pi(\theta)=I^{1/2}_{11}h(\theta_{2},\dots,\break\theta_{p}), where hh is arbitrary. Often a second-order matching prior removes the arbitrariness of hh. We will see an example later in this section. However, this need not always be the case, and, indeed, as seen earlier in the one parameter case, second-order matching priors may not always exist. We will address this issue later in this section.

A special choice is h1h\!\equiv\!1. The resultant prior π(θ)=I111/2\pi(\theta)\!=\!I_{11}^{1/2} bears some intuitive appeal. Since under orthogonality, n(θ^1nθ1)|θN(0,I111(θ))\sqrt{n}(\hat{\theta}_{1n}-\theta_{1})|\theta\sim\mathrm{N}(0,I^{-1}_{11}(\theta)), one may expect I111/2(θ)I^{1/2}_{11}(\theta) to be a first-order probability matching prior. This prior is only a member within the class of priors π(θ)=I111/2h(θ2,,θp)\pi(\theta)=I^{1/2}_{11}h(\theta_{2},\dots,\theta_{p}), as found by Tibshirani (1989), and admittedly need not be second-order matching even when the latter exists. A recent article by Staicu and Reid (2008) has proved some interesting properties of the prior π(θ)=I111/2(θ)\pi(\theta)=I^{1/2}_{11}(\theta). This prior is also considered in Ghosh and Mukerjee (1992).

For a symmetric location–scale family of distributions, that is, when f(X)=f(X)f(X)=f(-X), c2=0c_{2}=0, that is, μ\mu and σ\sigma are orthogonal. Now, when μ\mu is the parameter of interest and σ\sigma is the nuisance parameter, the class of first-order matching priors π1(μ,σ)\pi_{1}(\mu,\sigma) is characterized by h1(σ)h_{1}(\sigma), where h1h_{1} is arbitrary. Similarly, when σ\sigma is the parameter of interest and μ\mu is the nuisance parameter, the class of first-order matching priors is characterized by π2(μ,σ)=σ1h2(μ)\pi_{2}(\mu,\sigma)=\sigma^{-1}h_{2}(\mu), where g2g_{2} is arbitrary. The intersection of the two classes leads again to the unique prior π(μ,σ)=σ1\pi(\mu,\sigma)=\sigma^{-1}.

Example 2 ((Continued)).

Let X1,,Xn|μ,σX_{1},\dots,X_{n}|\mu,\sigma be i.i.d. N(μ,σ2\mu,\sigma^{2}), and θ=μ/σ\theta=\mu/\sigma is again the parameter of interest. In order to find a parameter ϕ\phi which is orthogonal to θ\theta, we rewrite the p.d.f. in the form

f(X|θ,σ)\displaystyle\qquad f(X|\theta,\sigma)
(30)
=(2πσ2)1/2exp[12σ2(Xθσ)2].\displaystyle\qquad\quad=(2\pi\sigma^{2})^{-1/2}\exp\biggl{[}-\frac{1}{2\sigma^{2}}(X-\theta\sigma)^{2}\biggr{]}.

Then the Fisher information matrix

I(θ,σ)=[(1θ/σθ/σσ2(θ2+2))].I(\theta,\sigma)=\left[\pmatrix{1&{\theta/\sigma}\cr\theta/\sigma&\sigma^{-2}(\theta^{2}+2)}\right].

It turns out now if we reparameterize from (θ,σ)(\theta,\sigma) to (θ,ϕ)(\theta,\phi), where ϕ=σ(θ2+2)1/2\phi=\sigma(\theta^{2}+2)^{1/2}, then θ\theta and ϕ\phi are orthogonal with the corresponding Fisher information matrix given by I(θ,ϕ)=\operatornameDiag[2(θ2+2)1,ϕ2(θ2+2)]I(\theta,\phi)=\operatorname{Diag}[2(\theta^{2}+2)^{-1},\phi^{-2}(\theta^{2}+2)]. Hence, the class of first-order matching priors when θ\theta is the parameter of interest is given byπ(θ,ϕ)=(θ2+2)1/2h(ϕ)\pi(\theta,\phi)=(\theta^{2}+2)^{-1/2}h(\phi), where hh is arbitrary.

4.4 Second-Order Quantile Matching Priors in the Presence of Nuisance Parameters

When θ1\theta_{1} is the parameter of interest, and (θ2,,θp)(\theta_{2},\ldots,\break\theta_{p}) is the vector of nuisance parameters, the general class of second-order quantile matching priors is characterized in (2.4.11) and (2.4.12) of Datta and Mukerjee (2004, page 12). For simplicity, we consider only the case when θ1\theta_{1} is orthogonal to (θ2,,θp)(\theta_{2},\ldots,\break\theta_{p}). In this case a first-order quantile matching prior π(θ1,θ2,,θp)I111/2(θ)h(θ2,,θp)\pi(\theta_{1},\theta_{2},\ldots,\theta_{p})\propto I_{11}^{1/2}(\theta)h(\theta_{2},\ldots,\theta_{p}) is also second-order matching if and only if hh satisfies (cf. Datta and Mukerjee, 2004, page 27) the differential equation

s=2pu=2pθu{I111/2IsuE(3logfθ12θs)h|θ}\displaystyle\sum_{s=2}^{p}\sum_{u=2}^{p}\frac{\partial}{\partial\theta_{u}}\biggl{\{}I_{11}^{-1/2}I^{su}E\biggl{(}\frac{\partial^{3}\log f}{\partial\theta_{1}^{2}\,\partial\theta_{s}}\biggr{)}h\Bigl{|}\theta\biggr{\}}
+(h/6)θ1{I113/2E((logfθ1)3|θ)}\displaystyle\quad{}+(h/6)\frac{\partial}{\partial\theta_{1}}\biggl{\{}I_{11}^{-3/2}E\biggl{(}\biggl{(}\frac{\partial\log f}{\partial\theta_{1}}\biggr{)}^{3}\Bigl{|}\theta\biggr{)}\biggr{\}} (31)
=0.\displaystyle\qquad=0.

We revisit Examples 1–5 and provide complete, or at least partial, characterization of second-order quantile matching priors.

Example 1 ((Continued)).

Let ff be symmetric so that μ\mu and σ\sigma are orthogonal. First let μ\mu be the parameter of interest and σ\sigma the nuisance parameter. Then since both the terms in (4.4) are zeroes, every first-order quantile matching prior of the form σ1h(σ)=q(σ)\sigma^{-1}h(\sigma)=q(\sigma), say, is also second-order matching. This means that an arbitrary prior of the form π(μ,σ)\pi(\mu,\sigma) is second-order matching as long as it is only a function of σ\sigma. On the other hand, if σ\sigma is the parameter of interest and μ\mu is the nuisance parameter, since the second term in (4.4) is zero, a first-order quantile matching prior of the form σ1h(μ)\sigma^{-1}h(\mu) is also second-order matching if and only if h(μ)h(\mu) is a constant. Thus, the unique second-order quantile matching prior in this case is proportional to σ1\sigma^{-1}, which is Jeffreys’ independence prior.

Example 2 ((Continued)).

Recall that in this case writing θ=μ/σ\theta=\mu/\sigma, and ϕ=σ(θ2+2)1/2\phi=\sigma(\theta^{2}+2)^{1/2}, the Fisher information matrix I(θ,ϕ)=\operatornameDiag[2(θ2+2)1,ϕ2(θ2+2)]I(\theta,\phi)\!=\!\operatorname{Diag}[2(\theta^{2}+2)^{-1},\phi^{-2}(\theta^{2}+2)]. Also, E[(logfθ)3|θ,ϕ]=θ(θ2+3)(θ2+2)3E[(\frac{\partial\log f}{\partial\theta})^{3}|\theta,\phi]\!=\!-\frac{\partial\theta(\theta^{2}+3)}{(\theta^{2}+2)^{3}} and E(3logfθ2ϕ|θ,ϕ)=(4/ϕ)(θ2+2)2E(\frac{\partial^{3}\log f}{\partial\theta^{2}\partial\phi}|\theta,\break\phi)=(4/\phi)(\theta^{2}+2)^{-2}. Hence, (4.4) holds if and only if h(ϕ)=ϕ1h(\phi)=\phi^{-1}. This leads to the unique second-order quantile matching prior π(θ,ϕ)(θ2+2)1/2\pi(\theta,\phi)\propto(\theta^{2}+2)^{-1/2}. Back to the original (μ,σ)(\mu,\sigma) parameterization, this leads to the prior π(μ,σ)σ1\pi(\mu,\sigma)\propto\sigma^{-1}, Jeffreys’ independence prior.

Example 3 ((Continued)).

Consider once again the Neyman–Scott example. Since the Fisher information matrix I(θ1,,θn,σ2)=k\operatornameDiag(σ2,,σ2,nσ2)I(\theta_{1},\ldots,\theta_{n},\sigma^{2})=k\operatorname{Diag}(\sigma^{-2},\ldots,\break\sigma^{-2},n\sigma^{-2}), σ2\sigma^{2} is orthogonal to (θ1,,θn)(\theta_{1},\ldots,\theta_{n}). Now, the class of second-order matching priors is given by σ2h(μ1,,μn)\sigma^{-2}h(\mu_{1},\ldots,\mu_{n}), where hh is arbitrary. Simple algebra shows that in this case both the first and second terms in (4.4) are zeroes so that every first-order quantile matching prior is also second-order matching.

Example 4 ((Continued)).

From Tibshirani (1989), it follows that the class of first-order quantile matching priors for θ\theta is of the form (1+θ2)1h(ϕ)(1+\theta^{2})^{-1}h(\phi), where hh is arbitrary. Once again, since both the first and second terms in (4.4) are zeroes, every first-order quantile matching prior is also second-order matching.

Example 5 ((Continued)).

Again from Tibshirani (1989), the class of second-order matching priors when mm, rr and uu are the parameters of interest are given respectively by h1(r,u)h_{1}(r,u), r1h2(m,u)r^{-1}h_{2}(m,u) and u1h3(m,r)u^{-1}h_{3}(m,r), where h1h_{1}, h2h_{2} and h3h_{3} are arbitrary nonnegative functions. Also, the prior πS(r,u)(ru)3/2\pi_{S}(r,u)\!\propto\!(ru)^{-3/2} is second-order matching when mm is the parameter of interest. On the other hand, any first-order matching prior is also second-order matching when either rr or uu is the parameter of interest.

It may be of interest to find an example where a reference prior is not a second-order matching prior. Consider the gamma p.d.f. f(x|μ,λ)=(λλ/Γ(λ))exp[λy/μ]yλ1μλf(x|\mu,\lambda)=(\lambda^{\lambda}/\Gamma(\lambda))\cdot\exp[-\lambda y/\mu]y^{\lambda-1}\mu^{-\lambda}, where the mean μ\mu is the parameter of interest. The Fisher information matrix is given by \operatornameDiag(λμ2,d2logΓ(λ)dλ21/λ)\operatorname{Diag}(\lambda\mu^{-2},\frac{d^{2}\log\Gamma(\lambda)}{d\lambda^{2}}-1/\lambda). Then the two-group reference prior of Bernardo (1979) is given by μ1[d2logΓ(λ)dλ2(1/λ)]1/2\mu^{-1}[\frac{d^{2}\log\Gamma(\lambda)}{d\lambda^{2}}-(1/\lambda)]^{1/2}, while the unique second-order quantile matching prior is given by λμ1[d2logΓ(λ)dλ2(1/λ)]\lambda\mu^{-1}\cdot[\frac{d^{2}\log\Gamma(\lambda)}{d\lambda^{2}}-(1/\lambda)].

In some of these examples, especially for the location and location–scale families, one gets exact rather than asymptotic matching. This is especially so when the matching prior is a right-invariant Haar prior. We will see some examples in the next section.

5 Other Priors

5.1 Invariant Priors

Very often objective priors are derived via some invariance criterion. We illustrate with the location–scale family of distributions.

Let XX have p.d.f. p(x|μ,σ)=σ1f((xμ)/σ)p(x|\mu,\sigma)=\sigma^{-1}f((x-\mu)/\sigma), <μ<-\infty<\mu<\infty, 0<σ<0<\sigma<\infty, where ff is a p.d.f. Then, as found in Section 4, the Fisher information matrix I(μ,σ)I(\mu,\sigma) is of the form I(μ,σ)=σ2(c1c2c2c3)I(\mu,\sigma)=\sigma^{-2}{{c_{1}\enskip c_{2}}\choose{c_{2}\enskip c_{3}}}. Hence, Jeffreys’ general rule prior πJ(μ,σ)σ2\pi_{J}(\mu,\sigma)\propto\sigma^{-2}. This prior, as we will see in this section, corresponds to a left-invariant Haar prior. In contrast, Jeffreys’ independence prior πI(μ,σ)σ1\pi_{I}(\mu,\sigma)\propto\sigma^{-1} corresponds to a right-invariant Haar prior.

In order to demonstrate this, consider a group of linear transformations G={ga,b<a<,b>0}G=\{g_{a,b}-\infty<a<\infty,b>0\}, where ga,b(x)=a+bxg_{a,b}(x)=a+bx. The induced group of transformations on the parameter space will be denoted by G¯\bar{G}, where G¯={g¯a,b}\bar{G}=\{\bar{g}_{a,b}\}, where g¯a,b(μ,σ)=(a+bμ,bσ)\bar{g}_{a,b}(\mu,\sigma)=(a+b\mu,b\sigma). The general theory of locally compact groups states that there exist two measures η1\eta_{1} and η2\eta_{2} on G¯\bar{G} such that η1\eta_{1} is left-invariant and η2\eta_{2} is right-invariant. What this means is that for all g¯G¯\bar{g}\in\bar{G} and AA a subset of GG, η1(g¯A)=η1(A)\eta_{1}(\bar{g}A)=\eta_{1}(A) and η2(Ag¯)=η2(A)\eta_{2}(A\bar{g})=\eta_{2}(A), where g¯A={g¯g¯\dvtxg¯A}\bar{g}A=\{\bar{g}\bar{g_{*}}\dvtx\bar{g_{*}}\in A\} and Ag¯={g¯g¯\dvtxg¯A}A\bar{g}=\{\bar{g_{*}}\bar{g}\dvtx\bar{g_{*}}\in A\}. The measures η1\eta_{1} and η2\eta_{2} are referred to respectively as left- and right-invariant Haar measures. For the location–scale family of distributions, the left- and right-invariant Haar priors turn out to be πL(μ,σ)σ2\pi_{L}(\mu,\sigma)\propto\sigma^{-2} and πR(μ,σ)σ1\pi_{R}(\mu,\sigma)\propto\sigma^{-1}, respectively (cf. Berger, 1985, pages 406–407; Ghosh, Delampady and Samanta, 2006, pages 136–138).

The right-Haar prior usually enjoys more optimality properties than the left-Haar prior. Some optimality properties of left-Haar priors are given in Datta and Ghosh (1995b). In Example 1, for the location–scale family of distributions, the right-Haar prior is Bernardo’s reference prior when either μ\mu or σ\sigma is the parameter of interest, while the other parameter is the nuisance parameter. Also, it is shown in Datta, Ghosh and Mukerjee (2000) that for the location–scale family of distributions, the right-Haar prior yields exact matching of the coverage probabilities of Bayesian credible intervals and the corresponding frequentist confidence intervals when either μ\mu or σ\sigma is the parameter of interest, while the other parameter is the nuisance parameter.

For simplicity, we demonstrate this only for the normal example. Let X1,,Xn|μ,σ2X_{1},\ldots,X_{n}|\mu,\sigma^{2} be i.i.d. N(μ,σ2)\mu,\break\sigma^{2}), where n2n\geq 2. With the right-Haar prior π2(μ,σ)σ1\pi_{2}(\mu,\break\sigma)\propto\sigma^{-1}, the marginal posterior distribution of μ\mu is Student’s tt with location parameter X¯=i=1nXi/n\bar{X}=\sum_{i=1}^{n}X_{i}/n, scale parameter S/nS/\sqrt{n}, where (n1)S2=i=1n(XiX¯)2(n-1)S^{2}=\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}, and degrees of freedom n1n-1. Hence, if μ1α\mu_{1-\alpha} denotes the 100(1α)100(1-\alpha)th percentile of this marginal posterior, then

1α\displaystyle 1-\alpha =\displaystyle= P(μμ1α|X1,,Xn)\displaystyle P(\mu\leq\mu_{1-\alpha}|X_{1},\ldots,X_{n})
=\displaystyle= P[n(μX¯)/S\displaystyle P\bigl{[}\sqrt{n}(\mu-\bar{X})/S
n(μ1αX¯)/S|X1,,Xn]\displaystyle\hphantom{P\bigl{[}}{}\leq\sqrt{n}(\mu_{1-\alpha}-\bar{X})/S|X_{1},\ldots,X_{n}\bigr{]}
=\displaystyle= P[tn1n(μ1αX¯)/S],\displaystyle P\bigl{[}t_{n-1}\leq\sqrt{n}(\mu_{1-\alpha}-\bar{X})/S\bigr{]},

so that n(μ1αX¯)/S=tn1,1α\sqrt{n}(\mu_{1-\alpha}-\bar{X})/S=t_{n-1,1-\alpha}, the 100(1α)100(1-\alpha)th percentile of tn1t_{n-1}. Now

P(μμ1α|μ,σ)\displaystyle P(\mu\leq\mu_{1-\alpha}|\mu,\sigma)
=P[n(X¯μ)/Stn1,1α|μ,σ]=1α\displaystyle\quad=P\bigl{[}\sqrt{n}(\bar{X}-\mu)/S\geq-t_{n-1,1-\alpha}|\mu,\sigma\bigr{]}=1-\alpha
=P(μμ1α|X1,,Xn).\displaystyle\quad=P(\mu\leq\mu_{1-\alpha}|X_{1},\ldots,X_{n}).

This provides the exact coverage matching probability for μ\mu.

Next, with the same set up, when σ2\sigma^{2} is the parameter of interest, its marginal posterior is Inverse \operatornameGamma((n1)/2,(n1)S2/2\operatorname{Gamma}((n-1)/2,(n-1)S^{2}/2). Now, if σ1α2\sigma_{1-\alpha}^{2} denotes the 100(1α)100(1-\alpha)th percentile of this marginal posterior, then σ1α2=(n1)S2/χn1;1α2\sigma_{1-\alpha}^{2}=(n-1)S^{2}/\chi_{n-1;1-\alpha}^{2}, where χn1;1α2\chi_{n-1;1-\alpha}^{2} is the 100(1α)100(1-\alpha)th percentile of the χn12\chi_{n-1}^{2} distribution. Now

P(σ2σ1α2|μ,σ)\displaystyle P(\sigma^{2}\leq\sigma_{1-\alpha}^{2}|\mu,\sigma)
=P[(n1)S2/σ2χn1;1α2|μ,σ]=1α,\displaystyle\quad=P[(n-1)S^{2}/\sigma^{2}\leq\chi_{n-1;1-\alpha}^{2}|\mu,\sigma]=1-\alpha,

showing once again the exact coverage matching.

The general definition of a right-invariant Haar density on 𝒢¯\bar{\mathcal{G}} which we will denote by hrh_{r} must satisfy Ag¯0hr(x)𝑑x=Ahr(x)𝑑x\int_{A\bar{g}_{0}}h_{r}(x)\,dx=\int_{A}h_{r}(x)\,dx, where Ag¯={g¯g¯\dvtxg¯A}A\bar{g}=\{\bar{g}_{*}\bar{g}\dvtx\bar{g}_{*}\in A\}. Similarly, a left invariant Haar density on 𝒢¯\bar{\mathcal{G}} which we will denote by hlh_{l} must satisfy g¯Ahl(x)𝑑x=Ahl(x)𝑑x\int_{\bar{g}A}h_{l}(x)\,dx\!=\int_{A}h_{l}(x)\,dx, where g¯A={g¯g¯\dvtxg¯A}\bar{g}A=\{\bar{g}\bar{g}_{*}\dvtx\bar{g}_{*}\in A\}. An alternate representation of the right- and left-Haar densities are given by Phr(Ag¯)=Phr(A)P^{h_{r}}(A\bar{g})=P^{h_{r}}(A) and Phl(g¯A)=Phl(A)P^{h_{l}}(\bar{g}A)=\break P^{h_{l}}(A), respectively.

It is shown in Halmos (1950) and Nachbin (1965) that the right- and left-invariant Haar densities exist and are unique up to a multiplicative constant. Berger (1985) provides calculation of hrh_{r} and hlh_{l} in a very general framework. He points out that if 𝒢¯\bar{\mathcal{G}} is isomorphic to the parameter space Θ\Theta, then one can construct right- and left-invariant Haar priors on the parameter space Θ\Theta. A very substantial account of invariant Haar densities is available in Datta and Ghosh (1995b). Severini, Mukerjee and Ghosh (2002) have demonstrated the exact matching property of right invariant Haar densities in a prediction context under fairly general conditions.

5.2 Moment Matching Priors

Here we discuss a new matching criterion which we will refer to as the “moment matching criterion.” For a regular family of distributions, the classic article of Bernstein and Von Mises (see, e.g., Ferguson, 1996, page 141; Ghosh, Delampady and Samanta, 2006, page 104) proved the asymptotic normality of the posterior of a parameter vector centered around the maximum likelihood estimator or the posterior mode and variance equal to the inverse of the observed Fisher information matrix evaluated at the maximum likelihood estimator or the posterior mode. We utilize the same asymptotic expansion to find priors which can provide high order matching of the moments of the posterior mean and the maximum likelihood estimator. For simplicity of exposition, we shall primarily confine ourselves to priors which achieve the matching of the first moment, although it is easy to see how higher order moment matching is equally possible.

The motivation for moment matching priors stems from several considerations. First, these priors lead to posterior means which share the asymptotic optimality of the MLE’s up to a high order. In particular, if one is interested in asymptotic bias or MSE reduction of the MLE’s through some adjustment, the same adjustment applies directly to the posterior means. In this way, it is possible to achieve Bayes-frequentist synthesis of point estimates. The second important aspect of these priors is that they provide new viable alternatives to Jeffreys’ prior even for real-valued parameters in the absence of nuisance parameters motivated from the proposed criterion. A third motivation, which will be made clear later in this section, is that with moment matching priors, it is possible to construct credible regions for parameters of interest based only on the posterior mean and the posterior variance, which match the maximum likelihood based confidence intervals to a high order of approximation. We will confine ourselves primarily to regular families of distributions.

Let X1,X2,,Xn|θX_{1},X_{2},\ldots,X_{n}|\theta be independent and identically distributed with common density functionf(x|θ)f(x|\theta), where θΘ\theta\in\Theta, some interval in the real line. Consider a general class of priors π(θ),θΘ\pi(\theta),\theta\in\Theta for θ\theta. Throughout, it is assumed that both ff and π\pi satisfy all the needed regularity conditions as given in Johnson (1970) and Bickel and Ghosh (1990).

Let θ^n\hat{\theta}_{n} denote the maximum likelihood estimator of θ\theta. Under the prior π\pi, we denote the posterior mean of θ\theta by θ^nB\hat{\theta}^{B}_{n}. The formal asymptotic expansion given in Section 2 now leads to θ^nB=θ^n+n1(a32I^n2+1I^nπ(θ^n)π(θ^n))+Op(n3/2)\hat{\theta}^{B}_{n}=\hat{\theta}_{n}+n^{-1}(\frac{a_{3}}{2\hat{I}_{n}^{2}}+\frac{1}{\hat{I}_{n}}\frac{\pi^{\prime}(\hat{\theta}_{n})}{\pi(\hat{\theta}_{n})})+O_{p}(n^{-3/2}), where a3a_{3} and I^n\hat{I}_{n} are defined in Theorem 1. The law of large numbers and consistency of the MLE now give n(θ^nBθ^n)P(g3(θ)2I2(θ)+1I(θ)π(θ)π(θ))n(\hat{\theta}^{B}_{n}-\hat{\theta}_{n})\stackrel{{\scriptstyle P}}{{\rightarrow}}(\frac{-g_{3}(\theta)}{2I^{2}(\theta)}+\frac{1}{I(\theta)}\frac{\pi^{\prime}(\theta)}{\pi(\theta)}). With the choice π(θ)=exp[12θg3(t)I(t)𝑑t]\pi(\theta)=\exp[-\frac{1}{2}\int^{\theta}\frac{g_{3}(t)}{I(t)}\,dt], one gets θ^nBθ^n=Op(n3/2)\hat{\theta}^{B}_{n}-\hat{\theta}_{n}=O_{p}(n^{-3/2}). We will denote this prior as πM(θ)\pi_{M}(\theta).

Ghosh and Liu (2011) have shown that if ϕ\phi is a one-to-one function of θ\theta, then the moment matching prior πM(ϕ)\pi_{M}(\phi) for ϕ\phi is given by πM(ϕ)=πM(θ)|dθdϕ|3/2\pi_{M}(\phi)=\pi_{M}(\theta)|\frac{d\theta}{d\phi}|^{3/2}. We now see an application of this result.

Example 6 ((Continued)).

Consider the regular one-parameter exponential family of densities given by f(x|θ)=exp[θxψ(θ)+h(x)]f(x|\theta)=\exp[\theta x-\psi(\theta)+h(x)]. For the canonical parameter θ\theta, noting that I(θ)=ψ′′(θ)I(\theta)=\psi^{\prime\prime}(\theta) and g3(θ)=ψ′′′(θ)=I(θ),g_{3}(\theta)=\psi^{\prime\prime\prime}(\theta)=I^{\prime}(\theta), πM(θ)=exp[12I(θ)/I(θ)𝑑θ]=I1/2(θ)\pi_{M}(\theta)=\exp[\frac{1}{2}\int I^{\prime}(\theta)/I(\theta)\,d\theta]=\break I^{1/2}(\theta), which is Jeffreys’ prior. On the other hand, for the population mean ϕ=ψ(θ)\phi=\psi^{\prime}(\theta) which is a strictly increasing function of θ\theta [since ψ′′(θ)=V(X|θ)>0]\psi^{\prime\prime}(\theta)=V(X|\theta)>0], the moment matching prior πM(ϕ)=I(ϕ)\pi_{M}(\phi)=I(\phi). In particular, for the binomial proportion pp, one gets the Haldane prior πH(p)p1(1p)1\pi_{H}(p)\propto p^{-1}(1-p)^{-1}, which is the same as Hartigan’s (1964, 1998) maximum likelihood prior. However, for the canonical parameter θ=\operatornamelogit(p)\theta=\operatorname{logit}(p), whereas we get Jeffreys’ prior, Hartigan (1964, 1998) gets the Laplace \operatornameuniform(0,1)\operatorname{uniform}(0,1) prior.

Remark 4.

It is now clear that a fundamental difference between priors obtained by matching probabilities and those obtained by matching moments is the lack of invariance of the latter under one-to-one reparameterization. It may be interesting to find conditions under which a moment matching prior agrees with Jeffreys’ prior I1/2(θ)I^{1/2}(\theta) or the uniform constant prior. The former holds if and only if g3(θ)=I(θ)g_{3}(\theta)=I^{\prime}(\theta), while the latter holds if and only if g3(θ)=0g_{3}(\theta)=0.

The if part of the above results are immediate from the definition of πM(θ)\pi_{M}(\theta). To prove the only if parts, note that if πM(θ)=I1/2(θ)\pi_{M}(\theta)=I^{1/2}(\theta), first taking logarithms, and then differentiating with respect to θ\theta, one gets I(θ)2I(θ)=g3(θ)2I(θ)\frac{I^{\prime}(\theta)}{2I(\theta)}=\frac{g_{3}(\theta)}{2I(\theta)} so that g3(θ)=I(θ)g_{3}(\theta)=I^{\prime}(\theta). On the other hand, if π(θ)=c\pi(\theta)=c, then taking logarithms, and then differentiating with respect to θ\theta, one getsg3(θ)=0g_{3}(\theta)=0.

The above approach can be extended to the matching of higher moments as well. Noting that Vπ(θ|X1,,Xn)=Eπ[(θθ^n)2|X1,,Xn)](θ^nBθ^n)2V_{\pi}(\theta|X_{1},\break\ldots,X_{n})=E_{\pi}[(\theta-\hat{\theta}_{n})^{2}|X_{1},\ldots,X_{n})]-(\hat{\theta}_{n}^{B}-\hat{\theta}_{n})^{2}, it follows immediately that under the moment matching prior πM\pi_{M}, Vπ(θ|X1,,Xn)=(nI^n)1+Op(n2)V_{\pi}(\theta|X_{1},\ldots,X_{n})=(n\hat{I}_{n})^{-1}+O_{p}(n^{-2}). This fact helps construction of credible intervals for θ\theta, the parameter of interest, centered at the posterior mean and scaled by the posterior standard deviation which enjoys the same asymptotic properties as the credible interval centered at the MLE and scaled by the square root of the reciprocal of the observed Fisher information number.

6 Summary and Conclusion

As mentioned in the Introduction, this article provides a selective review of objective priors reflecting my own interest and familiarity with the topics. I am well aware that many important contributions are left out. For instance, I have discussed only the two-group reference priors of Bernardo (1979). A more appealing later contribution by Berger and Bernardo (1992b) provided an algorithm for the construction of multi-group reference priors when these groups are arranged in accordance to their order of importance. In particular, the one-at-a-time reference priors, as advocated by these authors, has proved to be quite useful in practice. Ghosal (1997, 1999) provided the construction of reference priors in nonregular cases, while a formal definition of reference priors encompassing both regular and nonregular cases has recently been proposed by Berger, Bernardo and Sun (2009).

Regarding probability matching priors, we have discussed only the quantile matching criterion. There are several others, possibly equally important probability matching criteria. Notable among these are the highest posterior density matching criterion as well as matching via inversion of test statistics, such as the likelihood ratio test statistic, Rao statistic or the Wald statistic. Extensive discussion of such matching priors is given in Datta and Mukerjee (2004). Datta et al. (2000) constructed matching priors via the prediction criterion, and related exact results in this context are available in Fraser and Reid (2002). The issue of matching priors in the context of conditional inference has been discussed quite extensively in Reid (1996).

A different class of priors called “the maximum likelihood prior” was developed by Hartigan (1964, 1998). Roughly speaking, these priors are found by maximizing the expected distance between the prior and the posterior under a truncated Kullback–Leibler distance. Like the proposed moment matching priors, the maximum likelihood prior densities, when they exist, result in posterior means asymptotically negligible from the MLE’s. I have alluded to some of these priors as a comparison with other priors as given in this paper.

With the exception of the right- and left-invariant Haar priors, the derivation of the remaining priors are based essentially on the asymptotic expansion of the posterior density as well as the shrinkage argument of J. K. Ghosh. This approach provides a nice unified tool for the development of objective priors. I believe very strongly that many new priors will be found in the future by either a direct application or slight modification of these tools.

The results of this article show that Jeffreys’ prior is a clear winner in the absence of nuisance parameters for most situations. The only exception is the chi-square divergence where different priors may emerge. But that corresponds only to one special case, namely, the boundary of the class of divergence priors, while Jeffreys’ prior continues its optimality in the interior. In the presence of nuisance parameters, my own recommendation is to find two- or multi-group reference priors following the algorithm of Berger and Bernardo (1992a), and then narrow down this class of priors by finding their intersection with the class of probability matching priors. This approach can even lead to a unique objective prior in some situations. Some simple illustrations are given in this article. I also want to point out the versatility of reference priors. For example, for nonregular models, Jeffreys’ general rule prior does not work. But as shown in Ghosal (1997) and Berger, Bernardo and Sun (2009), one can extend the definition of reference priors to cover these situations as well.

The examples given in this paper are purposely quite simplistic to aid understanding mainly of readers not familiar at all with the topic. Quite rightfully, they can be criticized as somewhat stylized. Both reference and probability matching priors, however, have been developed for more complex problems of practical importance. Among others, I may refer to Berger and Yang (1994), Berger, De Oliveira and Sanso (2001), Ghosh and Heo (2003), Ghosh, Carlin and Srivastava (1994) and Ghosh, Yin and Kim (2003). The topics of these papers include time series models, spatial models and inverse problems, such as linear calibration and problems in bioassay, in particular, slope ratio and parallel line assays. One can easily extend this list. A very useful source for all these papers is Bernardo (2005).

Acknowledgments

This research was supported in part by NSF Grant Number SES-0631426 and NSA Grant NumberMSPF-076-097. The comments of the Guest Editor and a reviewer led to substantial improvement on the manuscript.

References

  • (1) Amari, S. (1982). Differential geometry of curved exponential families—Curvatures and information loss. Ann. Statist. 10 357–387. \MR0653513
  • (2) Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 53 370–418.
  • (3) Berger, J. O. (1985). Statistical Decision Theory and Related Topics, 2nd ed. Springer, New York. \MR0804611
  • (4) Berger, J. O. and Bernardo, J. M. (1989). Estimating a product of means. Bayesian analysis with reference priors. J. Amer. Statist. Assoc. 84 200–207. \MR0999679
  • (5) Berger, J. O. and Bernardo, J. M. (1992a). On the development of reference priors (with discussion). In Bayesian Statistics 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 35–60. Oxford Univ. Press, New York. \MR1380269
  • (6) Berger, J. O. and Bernardo, J. M. (1992b). Reference priors in a variance components problem. In Bayesian Analysis in Statistics and Econometrics (P. K. Goel and N. S. Iyengar, eds.) 177–194. Springer, New York. \MR1194392
  • (7) Berger, J. O., Bernardo, J. M. and Sun, D. (2009). The formal definition of reference priors. Ann. Statist. 37 905–938. \MR2502655
  • (8) Berger, J. O., de Oliveira, V. and Sanso, B. (2001). Objective Bayesian analysis of spatially correlated data. J. Amer. Statist. Assoc. 96 1361–1374. \MR1946582
  • (9) Berger, J. O. and Yang, R. (1994). Noninformative priors and Bayesian testing for the AR(1) model. Econometric Theory 10 461–482. \MR1309107
  • (10) Bernardo, J. M. (1979). Reference posterior distributions for Bayesian inference (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 41 113–147. \MR0547240
  • (11) Bernardo, J. M. (2005). Reference analysis. In Handbook of Bayesian Statisics 25 (D. K. Dey and C. R. Rao, eds.). North-Holland, Amsterdam. \MR2490522
  • (12) Bhattacharyya, A. K. (1943). On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35 99–109. \MR0010358
  • (13) Bickel, P. J. and Ghosh, J. K. (1990). A decomposition for the likelihood ratio statistic and the Bartlett correction—A Bayesian argument. Ann. Statist. 18 1070–1090. \MR1062699
  • (14) Clarke, B. and Barron, A. (1990). Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory 36 453–471. \MR1053841
  • (15) Clarke, B. and Barron, A. (1994). Jeffreys’ prior is asymptotically least favorable under entropy risk. J. Statist. Plann. Inference 41 37–60. \MR1292146
  • (16) Clarke, B. and Sun, D. (1997). Reference priors under the chi-square distance. Sankhyā A 59 215–231. \MR1665703
  • (17) Clarke, B. and Sun, D. (1999). Asymptotics of the expected posterior. Ann. Inst. Statist. Math. 51 163–185. \MR1704652
  • (18) Cressie, N. and Read, T. R. C. (1984). Multinomial goodness-of-fit tests. J. Roy. Statist. Soc. Ser. B 46 440–464. \MR0790631
  • (19) Cox, D. R. and Reid, N. (1987). Parameter orthogonality and approximate conditional inference (with discussion). J. Roy Statist. Soc. Ser. B 53 79–109. \MR0893334
  • (20) Datta, G. S. (1996). On priors providing frequentist validity of Bayesian inference for multiple parametric functions. Biometrika 83 287–298. \MR1439784
  • (21) Datta, G. S. and Ghosh, J. K. (1995a). On priors providing frequentist validity for Bayesian inference. Biometrika 82 37–45. \MR1332838
  • (22) Datta, G. S. and Ghosh, J. K. (1995b). Noninformative priors for maximal invariant in group models. Test 4 95–114. \MR1365042
  • (23) Datta, G. S. and Ghosh, M. (1995c). Some remarks on noninformative priors. J. Amer. Statist. Assoc. 90 1357–1363. \MR1379478
  • (24) Datta, G. S. and Ghosh, M. (1995d). Hierarchical Bayes estimators of the error variance in one-way ANOVA models. J. Statist. Plann. Inference 45 399–411. \MR1341333
  • (25) Datta, G. S. and Ghosh, M. (1996). On the invariance of noninformative priors. Ann. Statist. 24 141–159. \MR1389884
  • (26) Datta, G. S., Ghosh, M. and Mukerjee, R. (2000). Some new results on probability matching priors. Calcutta Statist. Assoc. Bull. 50 179–192. \MR1843620
  • (27) Datta, G. S., Ghosh, M. and Kim, Y. (2002). Probability matching priors for one-way unbalanced rendom effects models. Statist. Decisions 20 29–51. \MR1904422
  • (28) Datta, G. S. and Mukerjee, R. (2004). Probability Matching Priors: Higher Order Asymptotics. Springer, New York. \MR2053794
  • (29) Datta, G. S., Mukerjee, R., Ghosh, M. and Sweeting, T. J. (2000). Bayesian prediction with approximate frequentist validity. Ann. Statist. 28 1414–1426. \MR1805790
  • (30) Datta, G. S. and Sweeting, T. J. (2005). Probability matching priors. In Bayesian Thinking, Modeling and Computation. Handbook of Statistics 25 (D. K. Dey and C. R. Rao, eds.). North-Holland, Amsterdam. \MR2490523
  • (31) Ferguson, T. (1996). A Course in Large Sample Theory. Chapman & Hall/CRC Press, Boca Raton, FL. \MR1699953
  • (32) Fraser, D. A. S. and Reid, N. (2002). Strong matching of frequentist and Bayesian parametric inference. J. Statist. Plann. Inference 103 263–285. \MR1896996
  • (33) Ghosal, S. (1997). Reference priors in multiparameter nonregular cases. Test 6 159–186. \MR1466439
  • (34) Ghosal, S. (1999). Probability matching priors for nonregular cases. Biometrika 86 956–964. \MR1741992
  • (35) Ghosh, J. K., Delampady, M. and Samanta, T. (2006). An Introduction to Bayesian Analysis. Springer, New York. \MR2247439
  • (36) Ghosh, M. and Liu, R. (2011). Moment matching priors. Sankhyā A. To appear.
  • (37) Ghosh, J. K. and Mukerjee, R. (1992). Non-informative priors (with discussion). In Bayesian Statistics 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 195–210. Oxford Univ. Press, New York. \MR1380277
  • (38) Ghosh, J. K., Sinha, B. K. and Joshi, S. N. (1982). Expansion for posterior probability and integrated Bayes risk. In Statistical Decision Theory and Related Topics III 1 403–456. Academic Press, New York. \MR0705299
  • (39) Ghosh, M., Carlin, B. P. and Srivastava, M. S. (1994). Probability matching priors for linear calibration. Test 4 333–357. \MR1379796
  • (40) Ghosh, M. and Heo, J. (2003). Default Bayesian priors for regression models with second-order autogressive residuals. J. Time Ser. Anal. 24 269–282. \MR1984597
  • (41) Ghosh, M., Mergel, V. and Liu, R. (2011). A general divergence criterion for prior selection. Ann. Inst. Statist. Math. 63 43–58.
  • (42) Ghosh, M. and Mukerjee, R. (1998). Recent developments on probability matching priors. In Applied Statistical Science III (S. E. Ahmed, M. Ahsanullah and B. K. Sinha, eds.) 227–252. Nova Science Publishers, New York. \MR1673669
  • (43) Ghosh, M., Yin, M. and Kim, Y.-H. (2003). Objective Bayesian inference for ratios of regression coefficients in linear models. Statist. Sinica 13 409–422. \MR1977734
  • (44) Halmos, P. (1950). Measure Theory. Van Nostrand, New York. \MR0033869
  • (45) Hartigan, J. A. (1964). Invariant prior densities. Ann. Math. Statist. 35 836–845. \MR0161406
  • (46) Hartigan, J. A. (1998). The maximum likelihood prior. Ann. Statist. 26 2083–2103. \MR1700222
  • (47) Hellinger, E. (1909). Neue Begründung der Theorie quadratischen Formen von unendlichen vielen Veränderlichen. J. Reine Angew. Math. 136 210–271.
  • (48) Huzurbazar, V. S. (1950). Probability distributions and orthogonal parameters. Proc. Camb. Phil. Soc. 46 281–284. \MR0034567
  • Jeffreys (1961) Jeffreys, H. (1961). Theory of Probability and Inference, 3rd ed. Cambridge Univ. Press, London.
  • (50) Johnson, R. A. (1970). Asymptotic expansions associated with posterior distribution. Ann. Math. Statist. 41 851–864. \MR0263198
  • (51) Kass, R. E. and Wasserman, L. (1996). The selection of prior distributions by formal rules. J. Amer. Statist. Assoc. 91 1343–1370. \MR1478684
  • (52) Laplace, P. S. (1812). Theorie Analytique des Probabilities. Courcier, Paris.
  • (53) Lindley, D. V. (1956). On the measure of the information provided by an experiment. Ann. Math. Statist. 27 986–1005. \MR0083936
  • (54) Liu, R. (2009). On some new contributions towards objective priors. Unpublished Ph.D. dissertation. Dept. Statistics, Univ. Florida, Gainesville, FL. \MR2714091
  • (55) Morris, C. N. and Normand, S. L. (1992). Hierarchical models for combining information and for meta-analysis. In Bayesian Statistics 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 321–344. Oxford. Univ. Press, New York. \MR1380284
  • (56) Mukerjee, R. and Dey, D. K. (1993). Frequentist validity of posterior quantiles in the presence of a nuisance parameter: Higher-order asymptotics. Biometrika 80 499–505. \MR1248017
  • (57) Mukerjee, R. and Ghosh, M. (1997). Second-order probability matching priors. Biometrika 84 970–975. \MR1625016
  • (58) Nachbin, L. (1965). The Haar Integral. Van Nostrand, New York. \MR0175995
  • (59) Reid, N. (1996). Likelihood and Bayesian approximation methods (with discussion). In Bayesian Statistics 5 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 351–368. Oxford Univ. Press, New York. \MR1425414
  • (60) Severini, T. A., Mukerjee, R. and Ghosh, M. (2002). On an exact probability matching property of right-invariant priors. Biometrika 89 952–957. \MR1946524
  • (61) Staicu, A.-M. and Reid, N. (2008). On probability matching priors. Canad. J. Statist. 36 613–622. \MR2532255
  • (62) Tibshirani, R. J. (1989). Noninformative priors for one parameter of many. Biometrika 76 604–608. \MR1040654
  • (63) Welch, B. L. and Peers, H. W. (1963). On formulae for confidence points based on integrals of weighted likelihoods. J. Roy Statist. Soc. Ser. B 35 318–329. \MR0173309
  • (64) Ye, K. (1994). Bayesian reference prior analysis on the ratio of variances for the balanced one-way random effect model. J. Statist. Plann. Inference 41 267–280. \MR1309613