This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Functional Uniform Priors for Nonlinear Modelling

Björn Bornkamp111bjoern.bornkamp@novartis.com
Abstract

This paper considers the topic of finding prior distributions when a major component of the statistical model depends on a nonlinear function. Using results on how to construct uniform distributions in general metric spaces, we propose a prior distribution that is uniform in the space of functional shapes of the underlying nonlinear function and then back-transform to obtain a prior distribution for the original model parameters. The primary application considered in this article is nonlinear regression, but the idea might be of interest beyond this case. For nonlinear regression the so constructed priors have the advantage that they are parametrization invariant and do not violate the likelihood principle, as opposed to uniform distributions on the parameters or the Jeffrey’s prior, respectively. The utility of the proposed priors is demonstrated in the context of nonlinear regression modelling in clinical dose-finding trials, through a real data example and simulation. In addition the proposed priors are used for calculation of an optimal Bayesian design.

1 Introduction

Mathematical models of the real world are typically nonlinear, examples in medical or biological applications can be found for instance in Lindsey (2001) or Jones et al. (2010). Setting up prior distributions in a statistical analysis of nonlinear models, however often remains a challenge. If external, numerical or non-numerical information exists, one can try to quantify it into a probability distribution, see for example the works of O’Hagan et al. (2006), Bornkamp and Ickstadt (2009), and Neuenschwander et al. (2010). The classical approach in the absence of substantive information is Jeffreys prior distribution (or variants), given by p(𝜽)det(I(𝜽))p\left({\boldsymbol{\theta}}\right)\propto\sqrt{{\det(I({\boldsymbol{\theta}}))}}, where 𝜽𝚯p{\boldsymbol{\theta}}\in{\boldsymbol{\Theta}}\subset\mathbb{R}^{p} is the parameter, and I(𝜽)I\left({\boldsymbol{\theta}}\right) the Fisher information matrix of the underlying statistical model. See Kass and Wasserman (1996), Ghosh et al. (2006, ch. 5) or Berger et al. (2009) for this approach and generalizations. A serious drawback is the fact that this prior can depend on observed covariates. In the case of nonlinear regression analysis, the prior depends on the design points and relative allocations to these points and thus violates the likelihood principle. Apart from the foundational issues this raises (see, e.g., O’Hagan and Forster (2004, ch. 3)) it also has undesirable practical consequences. For Bayesian optimal design calculations in nonlinear regression models, for example, Jeffreys prior cannot be used, because it depends on the design points, which is what we want to calculate in the optimal design problem. In the context of adaptive dose-finding clinical trials, patients are allocated dynamically to the doses available (see the works of Müller et al. (2006) or Dragalin et al. (2010)) so that the sequential analysis of the data will differ from the analysis combining all data, when using Jeffreys rule. In summary the main issue with the Jeffreys prior distribution is that one cannot state it before data collection, which is crucial in some applications. Surprisingly few proposals have been made to overcome this situation. In current practice often uniform distributions for 𝜽{\boldsymbol{\theta}} on a reasonable compact subset of the parameter space are used. This approach is however extremely sensitive to the chosen parametrization (which might be more or less arbitrary) and can be much more informative than one would expect intuitively.

To illustrate the point, we will use a simple example. Suppose one would like to analyse data using the exponential model exp(θx)\exp(-\theta x), here with x[0,10]x\in[0,10], which could be the mean function in a regression analysis. Assume that no historical data or practical experiences related to the problem are available.

Refer to caption

Figure 1: (i) Display of the uniform distribution on θ\theta scale; (ii) Display of the regression function exp(θx)\exp(-\theta x) for θ=0\theta=0, θ=5\theta=5 and the θ\theta corresponding to the i/10i/10 quantile i=1,,9i=1,\ldots,9 of the uniform distribution.

A first pragmatic approach in this situation is to use a uniform distribution on θ\theta values leading to a reasonable shape coverage of the underlying regression function exp(θx)\exp(-\theta x), for example the interval θ[0,5]\theta\in[0,5] covers the underlying shapes almost entirely. The consequences of assuming a uniform prior on [0,5][0,5] can be observed in Figure 1 (ii). While the prior is uniform in θ\theta space, it places most of its prior probability mass on the functional shapes that decrease quickly towards zero, and we end up with a very informative prior distribution in the space of functional shapes. This is highly undesirable when limited prior information regarding the shape is available. In addition it depends crucially on the upper bound selected for θ\theta, and a uniform distribution in an alternative parameterization would lead to entirely different prior in the space of shapes. One way to overcome these problems is to use a distribution that is uniform in the space of functional shapes of the underlying nonlinear function. This will be uninformative from the functional viewpoint and will not depend on the selected parameterization.

In finite dimensional situations it is a standard approach to use distributions that are uniform in an interpretable parameter transformation, when it is difficult to use the classical default prior distributions. In the context of Dirichlet process mixture modelling, one can use a uniform distribution on the probability that two observations cluster into one group and then transfer this into a prior distribution for the precision parameter of the Dirichlet process. In the challenging problem of assigning a prior distribution for variance parameters in hierarchical models, Daniels (1999) assumes a uniform distribution on the shrinkage coefficient and then transfers this to a prior distribution for the variance parameter. In these cases the standard change of variables theorem can be used to derive the necessary uniform distributions. When we want to impose a uniform distribution in the space of functional shapes of an underlying regression function, however, it is not entirely obvious how to construct a uniform distribution. In the next section we will review a methodology that allows to construct uniform distributions on general metric spaces. In Section 2.2 we will adapt this to the nonlinear models that we consider in this article. Finally in Section 3 we test the priors for nonlinear regression on a data set from a dose-finding trial, a simulation study and an optimal design problem.

2 Methodology

2.1 General Approach

Suppose one would like to find a prior distribution for a parameter 𝜽{\boldsymbol{\theta}} in a compact subspace 𝚯p,p<{\boldsymbol{\Theta}}\subset\mathbb{R}^{p},\,p<\infty. The approach proposed in this paper is to map the parameter 𝜽{\boldsymbol{\theta}} from 𝚯{\boldsymbol{\Theta}} into another compact metric space (M,d)(M,d), with metric dd, using a differentiable bijective function φ:𝚯M\varphi:{\boldsymbol{\Theta}}\mapsto M, so that φ(𝜽)=ϕM\varphi({\boldsymbol{\theta}})={\boldsymbol{\phi}}\in M. The metric dd should ideally define a reasonable measure of closeness and distance between the parameters, and its choice will of course be model and application dependent. In the exponential regression example, for instance, it seems adequate to measure the distance between two parameter values θ\theta^{\prime} and θ′′\theta^{\prime\prime} by a distance between the resulting functions exp(xθ)\exp(-x\theta^{\prime}) and exp(xθ′′)\exp(-x\theta^{\prime\prime}), rather than the Euclidean distance between the plain parameter values. In this metric space (M,d)(M,d), one then imposes a uniform distribution, reflecting the appropriate notion of distance of the metric space (M,d)(M,d), and transforms this distribution back to the parameter scale.

The construction of a uniform distribution in general metric spaces has been described by Dembski (1990), using the notion of packing numbers. Ghosal et al. (1997) apply this result for two particular Bayesian applications (derivation of Jeffreys prior for parametric problems and nonparametric density estimation). In the following we review and adapt this theory to our situation. Some basic mathematical notions are needed to present the ideas: Define an ϵ\epsilon-net as a set SϵMS_{\epsilon}\subset M, so that for all ϕ,ϕ′′Sϵ{\boldsymbol{\phi}}^{\prime},{\boldsymbol{\phi}}^{\prime\prime}\in S_{\epsilon} holds d(ϕ,ϕ′′)ϵd({\boldsymbol{\phi}}^{\prime},{\boldsymbol{\phi}}^{\prime\prime})\geq\epsilon, and the addition of any point to SϵS_{\epsilon} destroys this property. An ϵ\epsilon-lattice SϵmS_{\epsilon}^{m} is the ϵ\epsilon-net with maximum possible cardinality. Dembski defines the uniform distribution on MM as the limit of a discrete uniform distribution on an ϵ\epsilon-lattice on MM, when ϵ0\epsilon\rightarrow 0.

Definition 1

The uniform distribution Π\Pi on MM defined as

Π(A)=limϵ0Πϵ(A),\Pi(A)=\underset{\epsilon\rightarrow 0}{\lim}\,\Pi_{\epsilon}(A),

for AMA\subset M and Πϵ(A)\Pi_{\epsilon}(A) is the discrete uniform distribution supported on the points in SϵmS_{\epsilon}^{m}, i.e. Πϵ(A)=1|Sϵm|ϕSϵmδϕ(A)\Pi_{\epsilon}(A)=\frac{1}{|S_{\epsilon}^{m}|}\underset{{\boldsymbol{\phi}}\in S_{\epsilon}^{m}}{\sum}\delta_{{\boldsymbol{\phi}}}(A), with |Sϵm||S_{\epsilon}^{m}| the cardinality of SϵmS_{\epsilon}^{m}.

Loosely speaking the uniform distribution is hence defined as the limit of a discrete uniform distribution on an equally spaced grid, where the notion of “equally spaced” is determined by the distance metric underlying (M,d)(M,d). Even though this definition is intuitive it is not constructive. Apart from special cases, generating an ϵ\epsilon-lattice is computationally difficult in a general metric space; calculating the limit of ϵ\epsilon-lattices even more so. In addition it is unclear, whether there is just one limit distribution all ϵ\epsilon-lattices would converge to. To overcome these problems Dembski (1990) uses the closely related notion of packing numbers. The packing number D(ϵ,A,d)D(\epsilon,A,d) of a subset AMA\subset M in the metric dd is defined as the cardinality of an ϵ\epsilon-lattice on AA, and packing numbers are known for a number of metric spaces. An ϵ\epsilon-pseudo-probability can then be defined as Pϵ(A)=D(ϵ,A,d)D(ϵ,M,d)P_{\epsilon}(A)=\frac{D(\epsilon,A,d)}{D(\epsilon,M,d)}. It is straightforward to see that 0Pϵ(A)10\leq P_{\epsilon}(A)\leq 1 and that Pϵ(M)=1P_{\epsilon}(M)=1, but packing numbers are sub-additive and hence PϵP_{\epsilon} is not a probability measure. However for disjoint sets AA^{\prime} and A′′A^{\prime\prime} with minimum distance >ϵ>\epsilon additivity holds, i.e. Pϵ(AA′′)=Pϵ(A)+Pϵ(A′′)P_{\epsilon}(A^{\prime}\cup A^{\prime\prime})=P_{\epsilon}(A^{\prime})+P_{\epsilon}(A^{\prime\prime}). Dembski (1990) then shows that whenever limϵ0Pϵ(A)\underset{\epsilon\rightarrow 0}{\lim}\,P_{\epsilon}(A) exists for any AA, then the limit distribution is the unique uniform distribution on (M,d)(M,d) (see Dembski (1990) or Ghosal et al. (1997) for details). As packing numbers are known for a number of metric spaces, this result provides a constructive way for building uniform distributions, without the need for explicitly constructing ϵ\epsilon-lattices.

Subsequently we consider the practically important case of a finite number of parameters pp and assume that the metric of (M,d)(M,d), d(ϕ,ϕ0)=d(φ(𝜽),φ(𝜽0))=d(𝜽,𝜽0)d({\boldsymbol{\phi}},{\boldsymbol{\phi}}_{0})=d(\varphi({\boldsymbol{\theta}}),\varphi({\boldsymbol{\theta}}_{0}))=d^{*}({\boldsymbol{\theta}},{\boldsymbol{\theta}}_{0}) in terms of 𝜽{\boldsymbol{\theta}} can be approximated by a local quadratic approximation of the form

d(𝜽,𝜽0)=c1c2(𝜽𝜽0)T𝑽(𝜽0)(𝜽𝜽0)+O(𝜽𝜽0k),d^{*}({\boldsymbol{\theta}},{\boldsymbol{\theta}}_{0})=c_{1}\sqrt{c_{2}({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0})^{T}{\boldsymbol{V}}({\boldsymbol{\theta}}_{0})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0})+O(||{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0}||^{k})}, (1)

where c1,c2>0c_{1},c_{2}>0 are constants and k3k\geq 3. Equation (1) implies that d(.,.)d(.,.) can locally be approximated by a Euclidean metric. This is not a very strong condition, for a sufficiently often differentiable metric d(.,𝜽0)d(.,{\boldsymbol{\theta}}_{0}) one can make use of a Taylor expansion of second order of d(.,𝜽0)2d(.,{\boldsymbol{\theta}}_{0})^{2} and apply the square root to obtain (1). The following theorem calculates the distribution induced on 𝜽{\boldsymbol{\theta}} by imposing a uniform distribution in (M,d)(M,d), when assumption (1) holds. The proof is only a slight adaption of earlier results by Ghosal et al. (1997), see Appendix A.

Theorem 1

For a metric space (M,d)(M,d) and a bijective function φ\varphi, fulfilling (1), where 𝐕(𝛉){\boldsymbol{V}}({\boldsymbol{\theta}}) is a symmetric matrix with finite strictly positive eigenvalues 𝛉𝚯\forall{\boldsymbol{\theta}}\in{\boldsymbol{\Theta}} and continuous as a function of 𝛉{\boldsymbol{\theta}}, Pϵ(A)=D(ϵ,A,d)D(ϵ,M,d)P_{\epsilon}(A)=\frac{D(\epsilon,A,d)}{D(\epsilon,M,d)} for A𝚯A\subset{\boldsymbol{\Theta}} converges to

Adet(𝑽(𝜽))𝑑𝜽𝚯det(𝑽(𝜽))𝑑𝜽,𝑎𝑠ϵ0.\frac{\int_{A}\sqrt{\det({\boldsymbol{V}}({\boldsymbol{\theta}}))}d{\boldsymbol{\theta}}}{\int_{{\boldsymbol{\Theta}}}\sqrt{\det({\boldsymbol{V}}({\boldsymbol{\theta}}))}d{\boldsymbol{\theta}}},\;\mathit{as}\;\epsilon\rightarrow 0.

The density of the uniform probability distribution is hence given by:

p(𝜽)=det(𝑽(𝜽))𝚯det(𝑽(𝜽))𝑑𝜽.p({\boldsymbol{\theta}})=\frac{\sqrt{\det({\boldsymbol{V}}({\boldsymbol{\theta}}))}}{\int_{{\boldsymbol{\Theta}}}\sqrt{\det({\boldsymbol{V}}({\boldsymbol{\theta}}))}d{\boldsymbol{\theta}}}.

We note that the last result can be obtained as well by using considerations based on Riemannian manifolds, in which case (1) would be the Riemannian metric: For example Pennec (2006) explicitly considers uniform distributions on Riemannian manifolds and obtains the same result. We concentrated on Dembski’s derivation as it seems both more general and intuitive.

It is important to note that the so defined distribution is independent of the parametrization. This is intuitively clear, as the space (M,d)(M,d), where the uniform distribution is imposed, is fixed, no matter, which parametrization is used. We illustrate this invariance property for the special case of a Taylor approximation in the Theorem below; for a proof see Appendix B.

Theorem 2

Assume (M,d)(M,d) with d(𝛉,𝛉0)2=12(𝛉𝛉0)𝐕(𝛉0)(𝛉𝛉0)+O(𝛉𝛉03)d({\boldsymbol{\theta}},{\boldsymbol{\theta}}_{0})^{2}=\frac{1}{2}({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0})^{\prime}{\boldsymbol{V}}({\boldsymbol{\theta}}_{0})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0})+O(||{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0}||^{3}), where 𝐕(𝛉)=(d2(𝛉,𝛉0)θiθj)i,j{\boldsymbol{V}}({\boldsymbol{\theta}})=\left(\frac{d^{2}({\boldsymbol{\theta}},{\boldsymbol{\theta}}_{0})}{\partial\theta_{i}\partial\theta_{j}}\right)_{i,j} evaluated at 𝛉{\boldsymbol{\theta}}, which leads to a prior p(𝛉)det(𝐕(𝛉))p({\boldsymbol{\theta}})\propto\sqrt{\det({\boldsymbol{V}}({\boldsymbol{\theta}}))}.
When calculating the uniform distribution associated to the transformed parameter g(𝛉)=𝛄g({\boldsymbol{\theta}})={\boldsymbol{\gamma}}, with g:ppg:\mathbb{R}^{p}\rightarrow\mathbb{R}^{p} a bijective twice differentiable transformation, one obtains p(𝛄)det(𝐇(𝛄))det(𝐕(h(𝛄)))p({\boldsymbol{\gamma}})\propto\det({\boldsymbol{H}}({\boldsymbol{\gamma}}))\sqrt{\det({\boldsymbol{V}}(h({\boldsymbol{\gamma}})))}, where hh is the inverse of gg and 𝐇(𝛄)=(θ1h(𝛉),,θph(𝛉)){\boldsymbol{H}}({\boldsymbol{\gamma}})=(\frac{\partial}{\partial\theta_{1}}h({\boldsymbol{\theta}}),\ldots,\frac{\partial}{\partial\theta_{p}}h({\boldsymbol{\theta}})) is the Jacobian matrix associated with hh, which is the same result as applying the change of variables theorem to p(𝛉)p({\boldsymbol{\theta}}).

A technical restriction of the theory described in this section is the concentration on compact metric spaces 𝚯{\boldsymbol{\Theta}}. However, it is possible to extend this based on taking limits of a sequence of growing compact spaces, see the works of Dembski (1990) and Ghosal et al. (1997) for details. Note that the resulting limiting density does not need to be integrable.

2.1.1 Examples

Non-functional uniform priors

While the approach outlined in Section 2.1 is developed for general metric spaces, it coincides with standard results about change of variables, when the metric space MM is a compact subset of p\mathbb{R}^{p} as well. Suppose one would like to use a uniform distribution for φ(𝜽)\varphi({\boldsymbol{\theta}}) with φ(𝜽):pp\varphi({\boldsymbol{\theta}}):\mathbb{R}^{p}\rightarrow\mathbb{R}^{p} a bijective, continuously differentiable function and then back-transform to 𝜽{\boldsymbol{\theta}} scale. Using the standard change of variables theorem one obtains: p(𝜽)|det(D(𝜽))|p({\boldsymbol{\theta}})\propto|\det(D({\boldsymbol{\theta}}))|, where D(𝜽)=(θ1φ(𝜽),,θpφ(𝜽))D({\boldsymbol{\theta}})=(\frac{\partial}{\partial\theta_{1}}\varphi({\boldsymbol{\theta}}),\ldots,\frac{\partial}{\partial\theta_{p}}\varphi({\boldsymbol{\theta}})) is the Jacobian matrix of the transformation φ(𝜽)\varphi({\boldsymbol{\theta}}). Framed in the approach of the last section, the metric space MM is a compact subset of p\mathbb{R}^{p} with the Euclidean metric in the transformed space d(φ(𝜽),φ(𝜽0))=(φ(𝜽)φ(𝜽0))T(φ(𝜽)φ(𝜽0))d(\varphi({\boldsymbol{\theta}}),\varphi({\boldsymbol{\theta}}_{0}))=\sqrt{(\varphi({\boldsymbol{\theta}})-\varphi({\boldsymbol{\theta}}_{0}))^{T}(\varphi({\boldsymbol{\theta}})-\varphi({\boldsymbol{\theta}}_{0}))}. A local linear approximation to φ(𝜽)φ(𝜽0)\varphi({\boldsymbol{\theta}})-\varphi({\boldsymbol{\theta}}_{0}) is D(𝜽0)(𝜽𝜽0)D({\boldsymbol{\theta}}_{0})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0}) with remainder O(𝜽𝜽02)O(||{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0}||^{2}). Hence, one obtains d(φ(𝜽),φ(𝜽0))=(𝜽𝜽0)TD(𝜽)TD(𝜽)(𝜽𝜽0)+O(𝜽𝜽03)d(\varphi({\boldsymbol{\theta}}),\varphi({\boldsymbol{\theta}}_{0}))=\sqrt{({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0})^{T}D({\boldsymbol{\theta}})^{T}D({\boldsymbol{\theta}})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0})+O(||{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0}||^{3})}. Applying Theorem 1 one ends up with the desired distribution p(𝜽)det(D(𝜽)TD(𝜽))=|det(D(𝜽))|p({\boldsymbol{\theta}})\propto\sqrt{\det(D({\boldsymbol{\theta}})^{T}D({\boldsymbol{\theta}}))}=|\det(D({\boldsymbol{\theta}}))|.

Jeffreys Prior

Another special case of this general approach is Jeffreys prior itself. Jeffreys (1961) described his rule by noting that (1) approximates the empirical Hellinger distance (as well as the empirical Kullback-Leibler divergence) between the residual distributions in a statistical model, when 𝑽(𝜽){\boldsymbol{V}}({\boldsymbol{\theta}}) is the Fisher information matrix. In this situation the parameters of a statistical model 𝜽{\boldsymbol{\theta}} are mapped into the space of residual densities and this space is used to define the notion of distance between the 𝜽{\boldsymbol{\theta}}’s. Applying the machinery from the last section then leads to a uniform distribution on the space of residual densities. This interpretation of Jeffreys rule is rare, but has been noted among others for example by Kass and Wasserman (1996, ch. 3.6). Ghosal et al. (1997) and Balasubramanian (1997) explicitly derive Jeffreys rule from these principles. From this viewpoint Jeffreys prior is hence useful as a universal “default” prior, because it gives equal weights to all possible residual densities underlying a statistical model. However, the used metric can depend, for example, on values of covariates, which is undesirable in the nonlinear regression application, as discussed in the introduction.

Triangular Distribution

In this example Definition 1 is directly used to numerically approximate a uniform distribution on a metric space. This is can be done in the case p=1p=1, where the construction of ϵ\epsilon-lattices is easily possible numerically.

The triangular distribution, with density

p(x|θ)={2x/θ,0<xθ2(1x)/(1θ),θx<1,p(x|\theta)=\begin{cases}2x/\theta,&0<x\leq\theta\\ 2(1-x)/(1-\theta),&\theta\leq x<1\end{cases}, (2)

for θ(0,1)\theta\in(0,1) is a simple, yet versatile distribution, for which the Jeffreys prior does not exist (Berger et al., 2009). One possible metric space where to impose the uniform distribution is the space of triangular densities or triangular distribution functions parametrized by θ\theta. Several metrics might be used, we will consider the Hellinger metric dH(θ1,θ2)={01(p(x|θ1)p(x|θ2))2𝑑x}12d_{H}(\theta_{1},\theta_{2})=\left\{\int_{0}^{1}\left(\sqrt{p(x|\theta_{1})}-\sqrt{p(x|\theta_{2})}\right)^{2}dx\right\}^{\frac{1}{2}} and the Kolmogorov metric dK(θ1,θ2)=supy[0,1]|0y(p(x|θ1)p(x|θ2))dx|d_{K}(\theta_{1},\theta_{2})=\sup_{y\in[0,1]}|\int_{0}^{y}(p(x|\theta_{1})-p(x|\theta_{2}))dx|. Numerically calculating the corresponding ϵ\epsilon-lattices one obtains the distributions displayed in Figure 2. Interestingly one can observe that the calculated uniform distribution in the Hellinger metric space is equal to a Beta(1/2,1/2)Beta(1/2,1/2) distribution (as the reference prior in Berger et al. (2009)), while the calculated functional uniform distribution in the Kolmogorov metric results in a uniform distribution on [0,1][0,1].

Refer to caption

Figure 2: Numerically calculated uniform distributions in the (i) Hellinger metric and (ii) the Kolmogorov metric. The solid curves are based on interpolation of the empirical distribution functions of 0.005-lattices followed by differentiation, the dots represent an 0.03-lattice.

2.2 Nonlinear Regression

The implicit assumption when employing a nonlinear regression function μ(x,𝜽)\mu(x,{\boldsymbol{\theta}}), is that for one 𝜽{\boldsymbol{\theta}} the shape of the function μ(x,𝜽)\mu(x,{\boldsymbol{\theta}}) will adequately describe reality. It is usually unclear, however, which of these shapes is the right one. A uniform distribution on the functional shapes hence seems to be a reasonable prior. A suited metric space is consequently the space of functions μ(.,𝜽)\mu(.,{\boldsymbol{\theta}}), with x𝒳x\in\mathcal{X}\subset\mathbb{R}, 𝜽Kp{\boldsymbol{\theta}}\in K\subset\mathbb{R}^{p} with compact KK and metric for example given by the L2L_{2} distance d(𝜽,𝜽0)=𝒳(μ(x,𝜽)μ(x,𝜽0))2𝑑xd({\boldsymbol{\theta}},{\boldsymbol{\theta}}_{0})=\sqrt{\int_{\mathcal{X}}(\mu(x,{\boldsymbol{\theta}})-\mu(x,{\boldsymbol{\theta}}_{0}))^{2}dx}. By a first order Taylor expansion one obtains μ(x,𝜽)μ(x,𝜽0)=Jx(𝜽0)(𝜽𝜽0)+O(𝜽𝜽02)\mu(x,{\boldsymbol{\theta}})-\mu(x,{\boldsymbol{\theta}}_{0})=J_{x}({\boldsymbol{\theta}}_{0})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0})+O(||{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0}||^{2}), where Jx(𝜽0)=𝜽μ(x,𝜽)J_{x}({\boldsymbol{\theta}}_{0})=\frac{\partial}{\partial{\boldsymbol{\theta}}}\mu(x,{\boldsymbol{\theta}}) is the row vector of first partial derivatives. This results in an approximation of form (μ(x,𝜽)μ(x,𝜽0))2=(𝜽𝜽0)TJx(𝜽0)TJx(𝜽0)(𝜽𝜽0)+O(𝜽𝜽03)(\mu(x,{\boldsymbol{\theta}})-\mu(x,{\boldsymbol{\theta}}_{0}))^{2}=({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0})^{T}J_{x}({\boldsymbol{\theta}}_{0})^{T}J_{x}({\boldsymbol{\theta}}_{0})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0})+O(||{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0}||^{3}). Integrating this with respect to xx and taking the square root, leads to an approximation of d(𝜽,𝜽0)d({\boldsymbol{\theta}},{\boldsymbol{\theta}}_{0}) of form (𝜽𝜽0)T𝒁(𝜽0)(𝜽𝜽0)+O(𝜽𝜽03),\sqrt{({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0})^{T}{\boldsymbol{Z}}^{*}({\boldsymbol{\theta}}_{0})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0})+O(||{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0}||^{3})}, where 𝒁(𝜽)=𝒳Jx(𝜽)TJx(𝜽)𝑑x{\boldsymbol{Z}}^{*}({\boldsymbol{\theta}})=\int_{\mathcal{X}}J_{x}({\boldsymbol{\theta}})^{T}J_{x}({\boldsymbol{\theta}})dx. Consequently, from Theorem 1 the functional uniform distribution for 𝜽{\boldsymbol{\theta}} equals p(𝜽)det(𝒁(𝜽)).p({\boldsymbol{\theta}})\propto\sqrt{\det({\boldsymbol{Z}}^{*}({\boldsymbol{\theta}}))}. In the special case of a linear model μ(x,𝜽)=f(x)T𝜽\mu(x,{\boldsymbol{\theta}})=f(x)^{T}{\boldsymbol{\theta}}, the functional uniform distribution collapses to a constant prior distribution, which is the uniform distribution on 𝚯{\boldsymbol{\Theta}} for compact 𝚯{\boldsymbol{\Theta}} and improper, when extending to non-compact 𝚯{\boldsymbol{\Theta}}.

We now revisit the exponential regression example from the introduction. In this case one obtains Jx(θ)=xexp(θx)J_{x}(\theta)=-x\exp(-\theta x), calculating 010Jx(θ)2𝑑x\int_{0}^{10}J_{x}(\theta)^{2}dx and applying the square root, one obtains p(θ)exp(10θ)exp(20θ)200θ220θ1θ3p(\theta)\propto\exp(-10\theta)\sqrt{\frac{\exp(20\theta)-200\theta^{2}-20\theta-1}{\theta^{3}}}, normalizing this leads to the prior displayed in Figure 3 (i). On the θ\theta scale the shape based functional uniform density hence leads to a rather non-uniform distribution. In Figure 3 (ii) one can observe that the probability mass is distributed uniformly over the different shapes, as desired.

Refer to caption

Figure 3: (i) Display of the functional uniform distribution on θ\theta scale; (ii) Display of the regression function exp(θx)\exp(-\theta x) for θ=0\theta=0, θ=5\theta=5 and the θ\theta corresponding to the i/10i/10 quantile i=1,,9i=1,\ldots,9 of the functional uniform distribution.

An advantage of the functional uniform prior over the uniform prior is that it is independent of the choice of parameterization and not particularly sensitive to the potential choice of the bounds, provided all major shapes of the underlying function are covered. In Figure 3 (i) one can see that the density is already rather small at θ=5\theta=5 as most of the underlying functional shapes are already covered. In fact in this example one can extend the functional uniform distribution from the compact interval [0,5][0,5] to a proper distribution on [0,)[0,\infty).

Although the choice of the L2L_{2} metric for dd seems reasonable in a variety of situations, other choices are possible. One could for example use a weighted version of the L2L_{2} distance, when interest is in particular regions of the design space 𝒳\mathcal{X}. In fact Jeffreys prior can be identified as a special case, when the assumed residual model is given by a homoscedastic normal distribution. In this situation the empirical measure on the design points is used as a weighting measure. The Jeffreys prior has also been mentioned by Bates and Watts (1988, p. 217), as a prior that is uniform on the response surfaces, but the possibility of an alternative weighting measures has not been considered.

One potential obstacle in the use of the proposed functional uniform prior is the fact that it can be computationally challenging to calculate. In some of the situations it might be possible to calculate 𝒁(𝜽){\boldsymbol{Z}}^{*}({\boldsymbol{\theta}}) analytically in others one might need to use numerical integration to approximate the underlying integrals. However, it needs to be noted that the prior only needs to be calculated once, as the prior is independent of the observed data (it only depends on the design region 𝒳\mathcal{X} and on potential parameter bounds), and can then be approximated for example in terms of more commonly used distributions. This approximation can then be reused in different modelling situations.

3 Applications

In this section, we will evaluate the proposed functional uniform priors for nonlinear regression. One application of nonlinear regression is in the context of pharmaceutical dose-finding trials. A challenge in these trials is that the variability in the response is usually large and the number of used doses fairly small, so that the underlying inference problem is challenging, despite an often seemingly large sample size. The priors will first be tested in a real example, then the frequentist operating characteristics of the proposed functional uniform priors are assessed more formally in a simulation study for a binary endpoint. In the last example we will use the functional uniform distribution for calculation of a Bayesian optimal design in the exponential regression example.

3.1 Irritable Bowel Syndrome Dose-Response Study

Here the IBScovars data set taken from the DoseFinding package will be used Bornkamp et al. (2010). The data were part of a dose ranging trial on a compound for the treatment of the irritable bowel syndrome with four active doses 1, 2, 3, 4 equally distributed in the dose range [0,4][0,4] and placebo. The primary endpoint was a baseline adjusted abdominal pain score with larger values corresponding to a better treatment effect. In total 369 patients completed the study, with nearly balanced allocation across the doses. Assume a normal distribution is used to model the residual error and that the hyperbolic Emax model μ(x,𝜽)=θ0+θ1x/(θ2+x)\mu(x,{\boldsymbol{\theta}})=\theta_{0}+\theta_{1}x/(\theta_{2}+x) was chosen to describe the dose-response relationship. The parameters θ0\theta_{0} and θ1\theta_{1} determine the placebo mean and the asymptotic maximum effect, while the parameter θ2\theta_{2} determines the dose that gives 50 percent of the asymptotic maximum effect, so that it determines the steepness of the curve. In clinical practice vague prior information typically exists for θ0\theta_{0} and θ1\theta_{1}, but for illustration here we use improper constant priors for these two parameters and a prior proportional to σ2\sigma^{-2} for σ\sigma. For the nonlinear parameter θ2\theta_{2} we will use a uniform prior and the functional uniform prior distribution. When using a uniform distribution for θ2\theta_{2} it is necessary to assume bounds, as otherwise an improper posterior distribution may arise. We will use the bounds [0.004,6][0.004,6] here, the selection of the boundaries is based on the fact that practically all of the shapes of the underlying model are covered taking into account that the dose range is [0,4][0,4]. For comparability for the functional uniform prior the same bounds were used, although one can extend it to an integrable density on [0,)[0,\infty). The functional uniform prior will be used based on the function space defined by x/(θ2+x)x/(\theta_{2}+x). Performing the calculations described in Section 2.2 one obtains Jx2=x2/(x+θ2)4J^{2}_{x}={{x^{2}}/{\left(x+\theta_{2}\right)^{4}}}, calculating the integral 04Jx2(θ2)𝑑x\int_{0}^{4}J^{2}_{x}(\theta_{2})\,dx and applying the square root leads to p(θ2)1/θ24+12θ23+48θ22+64θ21[0.004,6](θ2)p(\theta_{2})\propto{1}/{\sqrt{\theta_{2}^{4}+12\theta_{2}^{3}+48\theta_{2}^{2}+64\theta_{2}}}1_{[0.004,6]}(\theta_{2}). Similar to the exponential regression example in the introduction, a uniform distribution on θ2\theta_{2} space induces an informative distribution in the space of functional shapes. Shapes corresponding to larger values of θ2\theta_{2} (say >3>3) correspond to almost linear shapes, while only very small values of θ2\theta_{2} lead to more pronounced concave shapes. A uniform prior on θ2\theta_{2} hence induces a prior that favors linear shapes over steeply increasing model shapes.

Refer to caption

Figure 4: Posterior the dose-response curve under a uniform and a functional uniform prior for θ2\theta_{2}.

We used importance sampling resampling based on a proposal distribution generated by the iterated Laplace approximation to implement the model (see, Bornkamp (2011)). In Figure 4 one can observe the posterior uncertainty intervals under the two prior distributions. As is visible, the bias towards linear shapes, when using a uniform distribution for θ2\theta_{2} pertains in the posterior distribution. This happens despite the rather large sample size, and despite the fact that the response at the doses 0, 4 and particularly at dose 1 are not very well fitted by a linear shape. So the posterior seems to be rather sensitive to the prior uniform distribution. The posterior based on the shape based functional uniform prior, in contrast, fits the data better at all doses, and seems to provide a more realistic measure of uncertainty for the dose-response curve, particularly for x(0,1)x\in(0,1).

3.2 Simulations

One might expect that the functional uniform prior distribution works acceptable no matter which functional shape is the true one. To investigate this in more detail and to compare this prior to other prior distributions in terms of their frequentist performance, simulation studies have been conducted. Here we report results from simulations in the context of binary nonlinear regression.

For simulation the power model μ(x,θ)=θ0+θ1xθ2\mu(x,\theta)=\theta_{0}+\theta_{1}x^{\theta_{2}} will be used to model the response probability depending on xx. The parameters θ0\theta_{0} and θ1\theta_{1} are hence subject to θ0+θ11\theta_{0}+\theta_{1}\leq 1 and θ0,θ10\theta_{0},\theta_{1}\geq 0, as a probability is modelled. Note that only θ2\theta_{2} enters the model function non-linearly

The doses 0,0.05,0.2,0.6,10,0.05,0.2,0.6,1 are to be used with equal allocations of 20 patients per dose. We use four scenarios in this case: in the first three cases the power model is used with θ0=0.2\theta_{0}=0.2 and θ1=0.6\theta_{1}=0.6, while θ2\theta_{2} is equal to 0.40.4 (Power 1), 11 (Linear) and 44 (Power 2). In addition we provide one scenario, where an Emax model 0.2+0.6/(x+0.05)0.2+0.6/(x+0.05) is the truth. The Emax scenario is added to investigate the behaviour under misspecification of the model. Each simulation scenario will be repeated 1000 times.

We will compare the functional uniform prior distribution to the uniform distribution on the parameters and to the Jeffreys prior distribution. For the uniform prior distribution approach uniform prior distributions were assumed for all parameters, and the nonlinear parameter θ2\theta_{2} was assumed to be within [0.05,20][0.05,20] to ensure integrability. The same bounds are used for the two other approaches for comparability. For the functional uniform prior approach, uniform priors are used for θ0\theta_{0} and θ1\theta_{1}, while for θ2\theta_{2} the functional uniform prior will be used on the function space defined by xθ2x^{\theta_{2}}. The prior can be calculated to be p(θ2)1/2θ23/3+θ22+θ2/2+1p(\theta_{2})\propto 1/\sqrt{2\theta_{2}^{3}/3+\theta_{2}^{2}+\theta_{2}/2+1}. For the Jeffreys prior approach we used a prior proportional to det(I(𝜽))\sqrt{{\det(I({\boldsymbol{\theta}}))}}, within the imposed parameter bounds. For analysis we used MCMC based on the HITRO algorithm, which is an MCMC sampler that combines the hit and run algorithm with the ratio of uniforms transformation. It does not need tuning and is hence well suited for a simulation study. The sampler is implemented in the Runuran package (Leydold and Hörmann, 2010), computations were performed with R (R Development Core Team, 2011). 10000 MCMC samples are used from the corresponding posterior distributions in each case, using a burnin phase of 1000 a thinning of 2.

Prior Model MAE1 MAE2 CP ILE
Uniform Linear 0.082 0.062 0.819 0.259
Power 1 0.079 0.056 0.816 0.255
Power 2 0.066 0.067 0.881 0.220
Emax 0.073 0.056 0.780 0.226
Jeffreys Linear 0.058 0.065 0.900 0.233
Power 1 0.054 0.056 0.901 0.220
Power 2 0.056 0.073 0.895 0.227
Emax 0.056 0.055 0.845 0.196
Func. Unif. Linear 0.060 0.060 0.892 0.240
Power 1 0.057 0.053 0.893 0.226
Power 2 0.057 0.070 0.912 0.240
Emax 0.059 0.054 0.834 0.203
Table 1: Estimation of dose-response; MAE1 and MAE2 correspond to the dose-response estimation error (for the posterior median and mode), CP denotes the average coverage probability of pointwise 0.9 credibility intervals and ILE denotes the average credibility interval lengths.

In Table 1 one can observe the estimation results in terms of the mean absolute estimation error for the dose-response function, MAE=1/9i=08|μ(i/8)μ^(i/8)|\mathrm{MAE}=1/9\sum_{i=0}^{8}|\mu(i/8)-\hat{\mu}(i/8)|, where μ(.)\mu(.) is the underlying true function and μ^(.)\hat{\mu}(.) is either the point wise posterior median (corresponding to MAE1) or the prediction corresponding to the posterior mode for the parameters (MAE2), the posterior mode for the uniform prior is equal to the maximum likelihood estimate. The values displayed in Table 1 are the average MAE\mathrm{MAE} over 1000 repetitions. In addition for each simulation the 0.9 credibility intervals at the dose-levels 0,1/8,2/8,,10,1/8,2/8,...,1 have been calculated. The number given in the Table is CP=1/9i=08P^i/8CP=1/9\sum_{i=0}^{8}\hat{P}_{i/8}, where P^d\hat{P}_{d} is the average coverage probability of the 0.9 credibility interval at dose dd over 1000 simulation runs. In addition the average length of the credibility intervals has been calculated as ILE=1/9i=08L^i/8ILE=1/9\sum_{i=0}^{8}\hat{L}_{i/8}, where LdL_{d} is the average length of the 0.9 credibility interval at dose dd over 1000 simulation runs. For estimation of the dose-response Jeffreys prior and the functional uniform prior improve upon the uniform prior distribution, while the Jeffreys prior and the functional uniform prior are close, with slight advantages for the Jeffreys prior. In terms of the credibility intervals the functional uniform and Jeffreys prior roughly keep their nominal level for the linear, and the power model cases, while the uniform prior probability does not. None of the priors achieves the nominal level for the Emax model, which is probably due to the fact that the Emax model is too different from the power model. Interestingly the credibility intervals of the uniform prior are larger than those of the other two priors, but lead to a smaller coverage probability.

Table 2 provides the estimation results with respect to parameter estimation. The main message here is that all priors perform roughly equal for estimation of the linear parameters θ0\theta_{0} and θ1\theta_{1}. For the nonlinear parameters, Jeffreys prior distribution and the functional uniform prior perform better than the uniform disribution.

In summary the functional uniform prior hence performs roughly equally well as the Jeffreys prior in these simulations. However, the functional uniform prior has the pragmatic and conceptual advantages that it does not depend on the observed covariates, and can thus be used for example for calculation of a Bayesian optimal design, or in sequential situations.

Uniform Prior Functional Uniform Prior
Scenario NN MAE1 MAE2 CP ILE MAE1 MAE2 CP ILE
Sig. Emax 1 125 0.256 0.277 0.903 1.098 0.230 0.270 0.914 1.028
Sig. Emax 2 0.278 0.283 0.895 1.144 0.258 0.278 0.909 1.089
Sig. Emax 3 0.243 0.275 0.902 1.014 0.251 0.262 0.898 1.030
Linear 0.266 0.291 0.901 1.100 0.241 0.289 0.918 1.057
Quadratic 0.272 0.278 0.880 1.109 0.242 0.276 0.898 1.038
Sig. Emax 1 250 0.185 0.214 0.908 0.818 0.167 0.209 0.920 0.768
Sig. Emax 2 0.196 0.206 0.908 0.850 0.187 0.201 0.910 0.811
Sig. Emax 3 0.174 0.202 0.913 0.738 0.170 0.188 0.912 0.744
Linear 0.200 0.209 0.891 0.831 0.189 0.211 0.900 0.794
Quadratic 0.201 0.215 0.881 0.839 0.185 0.216 0.886 0.782
Table 2: Estimation of dose-response; MAE1 and MAE2 correspond to the estimation error at the doses 0,1,…,8 (for the posterior median and the posterior mode), CP denotes the average coverage probability of pointwise 0.9 credibility intervals and ILE denotes the average credibility interval lengths.

3.3 Bayesian optimal design for exponential regression

In this section we will use the prior distribution for the exponential regression model derived in Section 2.2 to calculate a Bayesian optimal design. When assuming a homoscedastic normal model, the Fisher information is I(d,θ)wixi2exp(2θxi)I(d,\theta)\propto\sum w_{i}x_{i}^{2}\exp(-2\theta x_{i}). Hence minimizing log(I(d,θ))-\log(I(d,\theta)) will lead to a design with most information. Unfortunately the expression depends on θ\theta, which is of course unknown before the experiment. One way of dealing with this uncertainty are Bayesian optimal designs, where one optimizes the design criterion averaged with respect to a prior distribution: log(I(d,θ))p(θ)𝑑θ-\int\log(I(d,\theta))p(\theta)d\theta. In this situation we will use the uniform and functional uniform prior distribution (see Figures 1 and 3) both on the interval [0,5][0,5] for calculation of the optimal design. Restricting the design space to x[0,10]x\in[0,10] and only performing the optimization up to 5 design points, one ends up with the weights 𝒘=(0.956,0.022,0.022){\boldsymbol{w}}=(0.956,0.022,0.022) on the design points x=(0.38,4.04,10)x=(0.38,4.04,10), for the uniform prior, while the functional uniform prior distribution leads to a design of the form 𝒘=(0.19,0.3,0.51){\boldsymbol{w}}=(0.19,0.3,0.51) and x=(0.54,2.35,10)x=(0.54,2.35,10). The design corresponding to the functional uniform prior hence spreads its allocation weights more uniformly on the design range, whereas the uniform prior results in essentially one major design point.

One way of comparing the two calculated designs is to look at the efficiency Eff(d,θ)=exp(log(I(d,θ))log(I(dopt(θ),θ)))\mathrm{Eff}(d,\theta)=\exp(\log(I(d,\theta))-\log(I(d_{opt}(\theta),\theta))), of the calculated designs, with respect to the design dopt(θ)d_{opt}(\theta) that is locally optimal for the parameter value θ\theta, for a range of different shapes. In Figure 5 we plot the efficiency for the different shapes on the functional shape scale.

Refer to caption

Figure 5: Efficiency of the two designs for different shapes.

One can observe that the uniform prior design is only efficient for the sharply decreasing shapes with θ>0.71\theta>0.71, but otherwise has very low efficiency. The functional uniform prior improves quite a bit over the uniform prior distribution for most of the functional shape space, and provides at least a reasonable efficiency for most shapes.

4 Conclusions

A main motivation for this work is the practical limitation of the classical Jeffreys prior that it cannot be used in nonlinear regression settings, where the prior needs to be specified before data collection, for example when one wants to calculate a Bayesian optimal design or in adaptive dose-finding trials. For this purpose the functional uniform distribution has been introduced, which imposes a distribution on the parameters, so that it is uniform in the functional shapes underlying the nonlinear regression function. This was achieved by using a general framework for constructing uniform distributions based on earlier work by Dembski (1990) and Ghosal et al. (1997). We investigated the functional uniform prior for nonlinear regression in a real example, a simulation study and an optimal design problem where it showed very satisfactory performance.

There is no reason to call the priors proposed in this article globally uninformative, because one needs to choose the space MM and in particular the metric dd, where to impose the uniform distribution. The priors derived from the theory in Section 2.1 might then be considered uninformative in the particular aspect that (M,d)(M,d) reflects. In the case of nonlinear regression we argue that the uniform distribution on the space of functional shapes is often, depending of course on the considered application, a reasonable assumption for nonlinear regression when particular prior information is lacking. However, this also does not apply generally: A situation, where the functional uniform prior might not be adequate, occurs, for example, when the considered nonlinear model is extremely flexible, containing virtually all continuous functions for example (as neural network models). In this case it is often more adequate to concentrate most prior probability on a reasonable subset of the function space (e.g., smooth functions), rather than building a uniform distribution on all potential shapes, including shapes that might be implausible a-priori.

The theory outlined in Section 2 might be of interest to formulate functional uniform priors also for other type of models with a nonlinear aspect. In quite a few modelling situations one might be able to find a space (M,d)(M,d), where imposing a uniform distribution is plausible and then back-transform this distribution to the parameter scale. Ghosal et al. (1997) employ this idea, when (M,d)(M,d) is a space of densities and define priors for nonparametric density estimation. Another application could be the estimation of covariance matrices: Dryden et al. (2009) discuss the use of more adequate non-Euclidean distance metrics for covariance matrices, which would in our framework define the metric space for imposing the uniform distribution. Paulo (2005) derives default priors for Gaussian process interpolation, which are rather time consuming to evaluate. In this situation might choose the space of the covariance functions as (M,d)(M,d).

Appendix A Proof of Theorem 1

Ghosal et al. (1997) prove a closely related result, when the underlying metric is the Hellinger distance and MM the space of residual densities. We review their proof and adapt to metrics of the form (1) and proceed in two parts. Part A summarizes the proof of Ghosal et al. (1997) for completeness and part B provides additional Lemmas needed in our situation.

Part A
The proof starts by covering 𝚯{\boldsymbol{\Theta}} with hypercubes and inner hypercubes placed inside these cubes. Let A1,,AJA_{1},\ldots,A_{J} be the intersections of AA with the hypercubes and A1,,AJA^{\prime}_{1},\ldots,A^{\prime}_{J} be the intersections of AA with the inner hypercubes. Now separate the hypercubes and inner hypercubes so that each inner hypercube is at least ϵ\epsilon apart from any other in the d(.,.)d(.,.) metric (this is possible, when the results proved in Lemma 1 hold). By the sub-additivity of packing numbers one then has jD(ϵ,Aj,d)D(ϵ,A,d)jD(ϵ,Aj,d)\sum_{j}D(\epsilon,A^{\prime}_{j},d)\leq D(\epsilon,A,d)\leq\sum_{j}D(\epsilon,A_{j},d) and jD(ϵ,𝚯j,d)D(ϵ,𝚯,d)jD(ϵ,𝚯j,d)\sum_{j}D(\epsilon,{\boldsymbol{\Theta}}^{\prime}_{j},d)\leq D(\epsilon,{\boldsymbol{\Theta}},d)\leq\sum_{j}D(\epsilon,{\boldsymbol{\Theta}}_{j},d), where 𝚯j{\boldsymbol{\Theta}}^{\prime}_{j} and 𝚯j{\boldsymbol{\Theta}}_{j} denote the intersection with the hypercubes and inner hypercubes.

Now an upper an lower bound for D(ϵ,Aj,d)D(\epsilon,A_{j},d) is derived based on the local Euclidean approximation (1) to the metric dd. For a Euclidean metric one can calculate the packing number explicitly, see Kolmogorov and Tihomirov (1961). Up to proportionality D(ϵ,A,||.||)D(\epsilon,A,||.||) is given by vol(A)ϵpvol(A)\epsilon^{-p}, consequently for a metric of form (𝜽𝜽)T𝑽(𝜽𝜽),\sqrt{({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})^{T}{\boldsymbol{V}}({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})}, with 𝑽{\boldsymbol{V}} a fixed positive definite matrix the packing number is up to proportionality det(𝑽)vol(A)ϵp\sqrt{\mathrm{det}({\boldsymbol{V}})}vol(A)\epsilon^{-p}. Using the local Euclidean approximation (1) and Lemma 2 one can derive lower and upper bounds for D(ϵ,Aj,d)D(\epsilon,A_{j},d) and D(ϵ,Aj,d)D(\epsilon,A^{\prime}_{j},d) in terms of det(𝑽(𝜽j))vol(Aj)ϵp\sqrt{\mathrm{det}({\boldsymbol{V}}({\boldsymbol{\theta}}_{j}))}vol(A_{j})\epsilon^{-p} and det(𝑽(𝜽j))vol(Aj)ϵp\sqrt{\mathrm{det}({\boldsymbol{V}}({\boldsymbol{\theta}}_{j}))}vol(A^{\prime}_{j})\epsilon^{-p}. and similarly for D(ϵ,𝚯,d)D(\epsilon,{\boldsymbol{\Theta}},d) and thus for Pϵ(A)=D(ϵ,A,d)/D(ϵ,𝚯,d)P_{\epsilon}(A)=D(\epsilon,A,d)/D(\epsilon,{\boldsymbol{\Theta}},d). As the size of the hypercubes goes to zero the bounds become sharper (see Lemma 2) and lower and upper bound Pϵ(A)P_{\epsilon}(A) converge to Adet(𝑽(𝜽))𝑑𝜽𝚯det(𝑽(𝜽))𝑑𝜽\frac{\int_{A}\sqrt{\det({\boldsymbol{V}}({\boldsymbol{\theta}}))}d{\boldsymbol{\theta}}}{\int_{{\boldsymbol{\Theta}}}\sqrt{\det({\boldsymbol{V}}({\boldsymbol{\theta}}))}d{\boldsymbol{\theta}}}, see Ghosal et al. (1997) for the details of this argument.

Part B
Without loss of generality we focus on setting c1=c2=1c_{1}=c_{2}=1 in (1) for what follows.

Lemma 1

For symmetric, positive definite 𝐕(𝛉){\boldsymbol{V}}({\boldsymbol{\theta}}) there exist l,u>0l^{*},u^{*}>0 for 𝛉,𝛉𝚯{\boldsymbol{\theta}},{\boldsymbol{\theta}}^{\prime}\in{\boldsymbol{\Theta}} so that

l𝜽𝜽d(𝜽,𝜽)u𝜽𝜽l^{*}||{\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime}||\leq d({\boldsymbol{\theta}},{\boldsymbol{\theta}}^{\prime})\leq u^{*}||{\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime}||

Proof:

d(𝜽,𝜽)2𝜽𝜽2=(𝜽𝜽)T𝑽(𝜽)(𝜽𝜽)+O(𝜽𝜽p)𝜽𝜽2.\frac{d({\boldsymbol{\theta}},{\boldsymbol{\theta}}^{\prime})^{2}}{||{\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime}||^{2}}=\frac{{({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})^{T}{\boldsymbol{V}}({\boldsymbol{\theta}}^{\prime})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})+O(||{\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime}||^{p})}}{||{\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime}||^{2}}.

Now by an eigendecomposition and the compactness of 𝚯{\boldsymbol{\Theta}} and continuity of 𝑽(𝜽){\boldsymbol{V}}({\boldsymbol{\theta}}) one knows that there exist l,u>0l,u>0 so that

l(𝜽𝜽)T(𝜽𝜽)(𝜽𝜽)T𝑽(𝜽)(𝜽𝜽)u(𝜽𝜽)T(𝜽𝜽).l({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})^{T}({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})\leq({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})^{T}{\boldsymbol{V}}({\boldsymbol{\theta}})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})\leq u({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})^{T}({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime}).

So that in total we get a lower and upper bound by u2=u+κu^{*2}=u+\kappa, where κ=max(max𝜽𝚯O(𝜽𝜽p2),0)\kappa=\max(\max_{{\boldsymbol{\theta}}\in{\boldsymbol{\Theta}}}O(||{\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime}||^{p-2}),0), and similarly lower bounded. \Box

Lemma 2

For 𝛉,𝛉,𝛉{\boldsymbol{\theta}},{\boldsymbol{\theta}}^{\prime},{\boldsymbol{\theta}}^{*} lying in a hypercube Q𝚯Q\subset{\boldsymbol{\Theta}} we obtain

k1(𝜽𝜽)T𝑽(𝜽)(𝜽𝜽)d2(𝜽,𝜽)k2(𝜽𝜽)T𝑽(𝜽)(𝜽𝜽),k_{1}({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})^{T}{\boldsymbol{V}}({\boldsymbol{\theta}}^{*})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})\leq d^{2}({\boldsymbol{\theta}},{\boldsymbol{\theta}}^{\prime})\leq k_{2}({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})^{T}{\boldsymbol{V}}({\boldsymbol{\theta}}^{*})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime}),

where k1c0k_{1}\rightarrow c_{0} and k2c0k_{2}\rightarrow c_{0} for c0>0c_{0}>0, when the side length of QQ converges to 0.

Proof:

d2(𝜽,𝜽)(𝜽𝜽)𝑽(𝜽)(𝜽𝜽)\displaystyle\frac{d^{2}({\boldsymbol{\theta}},{\boldsymbol{\theta}}^{\prime})}{({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})^{\prime}{\boldsymbol{V}}({\boldsymbol{\theta}}^{*})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})} =\displaystyle= (𝜽𝜽)T𝑽(𝜽)(𝜽𝜽)(𝜽𝜽)T𝑽(𝜽)(𝜽𝜽)+O(𝜽𝜽p)(𝜽𝜽)T𝑽(𝜽)(𝜽𝜽)\displaystyle\frac{({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})^{T}{\boldsymbol{V}}({\boldsymbol{\theta}}^{\prime})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})}{({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})^{T}{\boldsymbol{V}}({\boldsymbol{\theta}}^{*})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})}+\frac{O(||{\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime}||^{p})}{({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})^{T}{\boldsymbol{V}}({\boldsymbol{\theta}}^{*})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})}
=\displaystyle= (𝜽𝜽)T𝑽(𝜽)(𝜽𝜽)(𝜽𝜽)T𝑽(𝜽)(𝜽𝜽)+O(𝜽𝜽p2)\displaystyle\frac{({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})^{T}{\boldsymbol{V}}({\boldsymbol{\theta}}^{\prime})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})}{({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})^{T}{\boldsymbol{V}}({\boldsymbol{\theta}}^{*})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})}+O(||{\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime}||^{p-2})
=\displaystyle= 1+(𝜽𝜽)T(𝑽(𝜽)𝑽(𝜽))(𝜽𝜽)(𝜽𝜽)T𝑽(𝜽)(𝜽𝜽)+O(𝜽𝜽p2)\displaystyle 1+\frac{({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})^{T}({\boldsymbol{V}}({\boldsymbol{\theta}}^{\prime})-{\boldsymbol{V}}({\boldsymbol{\theta}}^{*}))({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})}{({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})^{T}{\boldsymbol{V}}({\boldsymbol{\theta}}^{*})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime})}+O(||{\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime}||^{p-2})

Now 𝑽(𝜽){\boldsymbol{V}}({\boldsymbol{\theta}}) is continuous, so one can lower and upper bound the second summand on QQ. It converges to zero and hence the bounds towards each other when the size of the hypercubes shrinks (and by this 𝜽𝜽{\boldsymbol{\theta}}^{*}\rightarrow{\boldsymbol{\theta}}^{\prime}). Upper and lower bounding O(||𝜽𝜽||p2O(||{\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime}||^{p-2} implies the desired result. \Box


Appendix B Proof of Theorem 2

Consider the distance metric d(𝜸,𝜸0)=d(h(𝜸),h(𝜸0))d^{*}({\boldsymbol{\gamma}},{\boldsymbol{\gamma}}_{0})=d(h({\boldsymbol{\gamma}}),h({\boldsymbol{\gamma}}_{0})). To show invariance of the proposed procedure, the uniform distribution derived from dd^{*} needs to be p(𝜸)det(𝑯(𝜸))det(𝑽(h(𝜸)))p({\boldsymbol{\gamma}})\propto\det({\boldsymbol{H}}({\boldsymbol{\gamma}}))\sqrt{\det({\boldsymbol{V}}(h({\boldsymbol{\gamma}})))}, which is the distribution derived from p(𝜽)det(𝑽(𝜽))p({\boldsymbol{\theta}})\propto\sqrt{\det({\boldsymbol{V}}({\boldsymbol{\theta}}))} using a change of variables.

A second order Taylor expansion of d2(h(𝜸),h(𝜸0))d^{2}(h({\boldsymbol{\gamma}}),h({\boldsymbol{\gamma}}_{0})) in 𝜸0{\boldsymbol{\gamma}}_{0} leads to an approximation of the form (𝜸𝜸0)𝑴(𝜸0)(𝜸𝜸0),({\boldsymbol{\gamma}}-{\boldsymbol{\gamma}}_{0})^{\prime}{\boldsymbol{M}}({\boldsymbol{\gamma}}_{0})({\boldsymbol{\gamma}}-{\boldsymbol{\gamma}}_{0}), where i,ji,j element of 𝑴(𝜸){\boldsymbol{M}}({\boldsymbol{\gamma}}) is given by 𝑴(𝜸)(i,j)=l=1pk=1pθkθlm(𝜽)γihk(𝜸)γjhl(𝜸)+k=1pθkm(𝜽)γiγjhk(𝜸){\boldsymbol{M}}({\boldsymbol{\gamma}})_{(i,j)}=\sum_{l=1}^{p}\sum_{k=1}^{p}\frac{\partial}{\partial\theta_{k}\partial\theta_{l}}m({\boldsymbol{\theta}})\frac{\partial}{\partial\gamma_{i}}h_{k}({\boldsymbol{\gamma}})\frac{\partial}{\partial\gamma_{j}}h_{l}({\boldsymbol{\gamma}})+\sum_{k=1}^{p}\frac{\partial}{\partial\theta_{k}}m({\boldsymbol{\theta}})\frac{\partial}{\partial\gamma_{i}\partial\gamma_{j}}h_{k}({\boldsymbol{\gamma}}), where m(𝜽)=d2(𝜽,𝜽0)m({\boldsymbol{\theta}})=d^{2}({\boldsymbol{\theta}},{\boldsymbol{\theta}}_{0}) and 𝜽=h(𝜸){\boldsymbol{\theta}}=h({\boldsymbol{\gamma}}). When evaluating this expression in the expansion point the second summand vanishes as the gradient is zero. Hence one obtains 𝑴(𝜸)=𝑯(𝜸)T𝑽(h(𝜸))𝑯(𝜸){\boldsymbol{M}}({\boldsymbol{\gamma}})={\boldsymbol{H}}({\boldsymbol{\gamma}})^{T}{\boldsymbol{V}}(h({\boldsymbol{\gamma}})){\boldsymbol{H}}({\boldsymbol{\gamma}}), which results in the density p(𝜸)det(𝑯(𝜸))T𝑽(h(𝜸))det(𝑯(𝜸))=det(𝑯(𝜸))det(𝑽(h(𝜸))).p({\boldsymbol{\gamma}})\propto\sqrt{\det({\boldsymbol{H}}({\boldsymbol{\gamma}}))^{T}{\boldsymbol{V}}(h({\boldsymbol{\gamma}}))\det({\boldsymbol{H}}({\boldsymbol{\gamma}}))}=\det({\boldsymbol{H}}({\boldsymbol{\gamma}}))\sqrt{\det({\boldsymbol{V}}(h({\boldsymbol{\gamma}})))}. \Box

References

  • Balasubramanian (1997) Balasubramanian, V. (1997). Statistical inference, Occam’s razor, and statistical mechanics on the space of probability distributions. Neural Computation 9, 349–369.
  • Bates and Watts (1988) Bates, D. M. and Watts, D. G. (1988). Nonlinear Regression Analysis and Applications. John Wiley and sons, New York.
  • Berger et al. (2009) Berger, J. O., Bernardo, J. M., and Sun, D. (2009). The formal definition of reference priors. Annals of Statistics 37, 905–938.
  • Bornkamp (2011) Bornkamp, B. (2011). Approximating probability densities by iterated Laplace approximations. Journal of Computational and Graphical Statistics 00, 00–00.
  • Bornkamp and Ickstadt (2009) Bornkamp, B. and Ickstadt, K. (2009). A note on B-splines for semiparametric elicitation. The American Statistician 63, 373–377.
  • Bornkamp et al. (2010) Bornkamp, B., Pinheiro, J., and Bretz, F. (2010). DoseFinding: Planning and Analyzing Dose Finding experiments. R package version 0.4-1.
  • Daniels (1999) Daniels, M. J. (1999). A prior for the variance in hierarchical models. Canadian Journal of Statistics 27, 567–578.
  • Dembski (1990) Dembski, W. A. (1990). Uniform probability. Journal of Theoretical Probability 3, 611–626.
  • Dragalin et al. (2010) Dragalin, V., Bornkamp, B., Bretz, F., Miller, F., Padmanabhan, S. K., Patel, N., Perevozskaya, I., Pinheiro, J., and Smith, J. R. (2010). A simulation study to compare new adaptive dose-ranging designs. Statistics in Biopharmaceutical Research 2, 487–512.
  • Dryden et al. (2009) Dryden, I. L., Koloydenko, A., and Zhou, D. (2009). Non-Euclidean statistics for covariance matrices with applications to diffusion tensor imaging. Annals of Applied Statistics 3, 1102–1123.
  • Ghosal et al. (1997) Ghosal, S., Ghosh, J. K., and Ramamoorthi, R. V. (1997). Non-informative priors via sieves and packing numbers. In Panchapakesan, S. and Balakrishnan, N., editors, Advances in Statistical Decision Theory and Applications, pages 119–132. Birkhäuser, Boston.
  • Ghosh et al. (2006) Ghosh, J. K., Delampady, M., and Samanta, T. (2006). An Introduction to Bayesian Analysis: Theory and Methods. Springer, New York.
  • Jeffreys (1961) Jeffreys, H. (1961). Theory of Probability. Oxford University Press.
  • Jones et al. (2010) Jones, D. S., Plank, M. J., and Sleeman, B. D. (2010). Differential Equations and Mathematical Biology. Chapman and Hall, Boca Raton.
  • Kass and Wasserman (1996) Kass, R. E. and Wasserman, L. (1996). The selection of prior distributions by formal rules. Journal of the American Statistical Association 91, 1343–1370.
  • Kolmogorov and Tihomirov (1961) Kolmogorov, A. N. and Tihomirov, V. M. (1961). ϵ\epsilon-entropy and ϵ\epsilon-capacity of sets in function spaces. American Mathematics Society Translations Ser. 2 17, 277–364.
  • Leydold and Hörmann (2010) Leydold, J. and Hörmann, W. (2010). Runuran: R interface to the UNU.RAN random variate generators. R package version 0.15.0.
  • Lindsey (2001) Lindsey, J. K. (2001). Nonlinear Models for Medical Statistics. Oxford University Press, Oxford.
  • Müller et al. (2006) Müller, P., Berry, D. A., Grieve, A. P., and Krams, M. (2006). A Bayesian decision-theoretic dose-finding trial. Decision Analysis 3, 197–207.
  • Neuenschwander et al. (2010) Neuenschwander, B., Capkun-Niggli, G., Branson, M., and Spiegelhalter, D. J. (2010). Summarizing historical information on controls in clinical trials. Clinical Trials 7, 5–18.
  • O’Hagan et al. (2006) O’Hagan, A., Buck, C., Daneshkhah, A., Eiser, R., Garthwaite, P., Jenkinson, D., Oakley, J., and Rakow, T. (2006). Uncertain Judgements: Eliciting Expert Probabilities. John Wiley and Sons Inc.
  • O’Hagan and Forster (2004) O’Hagan, A. and Forster, J. (2004). Kendall’s Advanced Theory of Statistics, Volume 2B: Bayesian Inference. Arnold, London, 2nd edition.
  • Paulo (2005) Paulo, R. (2005). Default priors for Gaussian processes. Annals of Statistics 33, 556–582.
  • Pennec (2006) Pennec, X. (2006). Intrinsic statistics on Riemannian manifolds: Basic tools for geometric measurements. Journal of Mathematical Imaging and Vision 25, 127–154.
  • R Development Core Team (2011) R Development Core Team (2011). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.