This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Epistemic confidence, the Dutch Book and relevant subsets

Yudi Pawitan, Hangbin Lee and Youngjo Lee
Department of Medical Epidemiology and Biostatistics
Karolinska Institutet, Sweden
and
Department of Statistics
Seoul National University, South Korea
Abstract

We use a logical device called the Dutch Book to establish epistemic confidence, defined as the sense of confidence in an observed confidence interval. This epistemic property is unavailable – or even denied – in orthodox frequentist inference. In financial markets, including the betting market, the Dutch Book is also known as arbitrage or risk-free profitable transaction. A numerical confidence is deemed epistemic if its use as a betting price is protected from the Dutch Book by an external agent. Theoretically, to construct the Dutch Book, the agent must exploit unused information available in any relevant subset. Pawitan and Lee (2021) showed that confidence is an extended likelihood, and the likelihood principle states that the likelihood contains all the information in the data, hence leaving no relevant subset. Intuitively, this implies that confidence associated with the full likelihood is protected from the Dutch Book, and hence is epistemic. Our aim is to provide the theoretical support for this intuitive notion.

1 Introduction

Given data Y=yY=y – of arbitrary size or complexity – generated from a model pθ(y)p_{\theta}(y) indexed with scalar parameter of interest θ\theta, a confidence interval CI(y)\mbox{CI}(y) is computed with coverage probability

Pθ(θCI(Y))=γ.P_{\theta}(\theta\in\mbox{CI(Y)})=\gamma.

We are interested in the epistemic confidence, defined as the sense of confidence in the observed CI(y). (For simplicity, we shall often drop the explicit dependence on yy from the CI.) Arguably, this is what we want from a CI, but the orthodox frequentist view is emphatic that the probability γ\gamma does not apply to the observed interval CI(y)\mbox{CI}(y) but to the procedure. There are well-known examples justifying this position; see Example 1 below. In the confidence interval theory, the coverage probability is called the confidence level. So, in the frequentist theory, ‘confidence’ has actually no separate meaning from probability; in particular, it has no epistemic property. Schweder and Hjort (2016) and Schweder (2018) have been strong proponents of interpreting confidence as ‘epistemic probability.’ However, their view is not commonly accepted. Traditionally, only the Bayesians have no problem in stating that their subjective probability is epistemic. How do they achieve that? Is there a way to make a non-Bayesian confidence epistemic?

Frequentists interpret probability as either a long-term frequency or a propensity of the generating mechanism, such as a coin toss or a confidence interval procedure. So, for them, unique events, such as the next toss or the true status of an observed CI, do not have a probability. On the other hand, Bayesians can attach their subjective probability to such unique events. This interpretation is made possible using a logical device called the Dutch Book. As classically proposed by Frank Ramsey (1926) and Bruno de Finetti (1931), one’s subjective probability of an event EE is defined as the personal betting price one puts on the event. Though subjective, the price is not arbitrary, but it follows a normative rational consideration; it is a price such that no external agent can construct a Dutch Book against them, i.e., make a risk-free profit. In other words, it is irrational to make a bet that is guaranteed to lose. The Dutch Book is also known as arbitrage or free lunch. In the classical Dutch Book argument, the bet is made between two individuals.

Likewise, here we define confidence to be epistemic if it is protected from the Dutch Book, but crucially we assume that there is a betting market of a crowd of independent and intelligent players. In this market, bets are like a commodity with supply and demand from among the players. Assuming a perfect market condition – for instance, full competition, perfect information and no transaction cost – in accordance with the Arrow-Debreu theorem (Arrow and Debreu, 1954), there is an equilibrium price at which there is balance between supply and demand. Intuitively, if you are a seller and set your price too low, many would want to buy from you, thereby creating demand and increasing the price. Whereas if you set your price too high, nobody would want to buy from you, thus reducing demand and pressuring the price down. For the betting market in particular, the fundamental theorem of asset pricing (Ross, 1976) states that, assuming an objective probability model, there is no arbitrage if the price is determined by the objective probability. ‘Perfect information’ here means all players have access to the generic data yy and the sampling model pθ(y)p_{\theta}(y). (If there is no objective probability model, the market, as evidenced by actual betting markets, can still have an agreed interpersonal price at any given time, though not a theoretically determined price.)

To illustrate the role of the betting market in the Dutch Book argument, suppose you and I are betting on the 2024 US presidential election. Suppose the betting market is currently giving the price of 0.25 for Donald Trump to win (this means you pay $0.25 to get $1 back if Trump wins, including your original $0.25). Suppose, for whatever reasons, you believe Trump will lose and hence set the probability of him winning at 0.1. Then I would construct a Dutch Book by ‘buying from you’ at $0.1 and immediately ‘selling it in the market’ at $0.25, thus making a risk-free profit. ‘Buying from you’ means treating you like a bookie: paying you $0.1 and getting $1 back if Trump wins. While ‘selling in the market’ for me means betting against the event, so I behave like a bookie: people pay me $0.25 so that they can get $1 back if Trump wins, but I keep the $0.25 if Trump loses. So, overall I would make $0.15 risk-free, i.e., regardless of whether Trump wins or loses. Note that this is not just a thought experiment – you can do all this buying and selling of bets in the online betting market.

It is worth emphasizing the difference between our setup and the classical Dutch Book argument used to establish the subjective Bayesian probability. In the latter, because it does not presume the betting market, bets are made only between you and me. To avoid the Dutch Book, you have to make your bets internally consistent by following (additive) probability laws. However, even if your bets are internally consistent (or coherent), if your prices do not match the market prices, I can make a risk-free profit by playing between you and the market; see Example 1. So, the presence of the market creates a stronger requirement for epistemic probability. We shall avoid the terms ’subjective’ and ’objective’; one might consider ‘epistemic’ to be subjective since it refers to a personal decision-making based on a unique event, but the market consideration makes it impersonal.

The present issue is when the confidence, as measured by the coverage probability, applies to the observed interval. One way to judge this is whether you are willing to bet on the true status of the CI using the confidence level as your personal price. Normatively, this should be the case if you know there is no better price. Intuitively, this is when you’re sure that you have used all the available information in the data, so nobody can exploit you, i.e., construct a Dutch Book against you. Theoretically, to construct the Dutch Book, an external agent must exploit unused information in the form of a relevant subset, conditional on whether he can get a different coverage probability.

Pawitan and Lee (2021) showed that the confidence is an extended likelihood (Lee et al., 2017). The classical likelihood principle (Birnbaum, 1962) and its extended version (Bjørnstad, 1996) state that the likelihood contains all the information in the data. Intuitively, this implies that the likelihood leaves no relevant subset, and is thus protected from the Dutch Book. In other words, we can attach the degree of confidence to the observed CI, i.e., confidence is epistemic, provided it is associated with the full likelihood. Our aim is to establish the theoretical justification for this intuitive notion.

To summarize briefly and highlight the plan of the paper, we describe three key concepts: relevant subset, confidence and ancillary statistic. We prove the main theorem that there are no relevant subsets if confidence is associated with the full likelihood. This condition is satisfied if the confidence is based on a sufficient statistic. When there is no sufficient statistic, but there is a maximal ancillary statistic, then this ancillary defines relevant subsets and there are no further relevant subsets.

2 Main theory

2.1 Relevant subsets

The idea of relevant subset appeared in Fisher’s writings on the nature of probability (Fisher, 1958). He considered probability meaningful provided there is no relevant subset. However, he treated the condition as an axiom – appealing to our intuition as to what ‘meaningful’ means – and did not establish what conditions are needed to guarantee no relevant subset. To avoid unnecessary philosophical discussions, we’re limiting the idea of relevant subset to the confidence interval procedures but not to the probability concept in general.

Intuitively, we could use the coverage probability γ\gamma as a betting price if there is no better price given the data at hand. So the question is, are there any features of the data that can be used to improve the price? Mathematically, these ‘features’ are some observed statistics that can be used to help predict the true coverage status. Given an arbitrary statistic S(y)S(y), the conditional coverage probability Pθ(θCI|S(y))P_{\theta}(\theta\in\mbox{CI}|S(y)) will in general be biased, i.e., different from the marginal coverage. However, the bias as a function of the unknown θ\theta is generally not going to be consistently in one direction. For example, trivially, if we use the full data S(y)=yS(y)=y itself, i.e., fixing the data as observed, then the conditional coverage is either one or zero depending on the true status of the CI, hence completely non-informative. In terms of betting, this means we cannot exploit an arbitrary feature of the data as a way to construct a Dutch Book against someone who sets the price at γ\gamma. The betting motivation also appeared in Buehler (1959) and Robinson (1979), though they only assumed two people betting each other repeatedly, but not the existence of the betting market. As we shall discuss in Example 1 and after Theorem 1, this has a significant impact on the interpretation of epistemic confidence.

A statistic R(y)R(y) is defined to be relevant (cf. Buehler, 1959) if the conditional coverage is non-trivially and consistently biased in one direction. That is, for a positive bias, there is ϵ>0\epsilon>0 free of θ\theta, such that

Pθ(θCI(Y)|R(y))γ+ϵfor all θ.\displaystyle P_{\theta}(\theta\in\mbox{CI}(Y)|R(y))\geq\gamma+\epsilon\quad\text{for all }\theta. (1)

Now, potentially the feature R(y)R(y) can be used to construct a Dutch Book: Suppose you and I are betting, and I notice that the event [R(y)][R(y)] occurs. If you set the price at γ\gamma, then I would buy the bet from you and then sell it in the betting market at γ+ϵ\gamma+\epsilon. So I make a risk-free profit of ϵ\epsilon. (We have assumed that the market contains intelligent players, so they would also have noticed the relevant statistic and set the price accordingly.) Similarly, for the negative bias, the relevant R(y)R(y) has the property

Pθ(θCI(Y)|R(y))γϵfor allθ.P_{\theta}(\theta\in\mbox{CI}(Y)|R(y))\leq\gamma-\epsilon\ \mbox{for all}\ \theta. (2)

Technically, R(y)R(y) induces subsets of the sample space, known as the ‘relevant subset’; for convenience, we use the terms ’relevant statistic’ and ’relevant subset’ interchangeably. So, if there is a relevant subset, the confidence level γ\gamma is not epistemic. Conversely, if there are no relevant subsets, the betting price determined by the confidence level is protected from the Dutch Book. So, mathematically, we establish epistemic confidence by showing that it corresponds to a coverage probability that has no relevant subsets.

Example 1. Let y(y1,y2)y\equiv(y_{1},y_{2}) be an iid sample from a uniform distribution on {θ1,θ,θ+1}\{\theta-1,\theta,\theta+1\}, where the parameter θ\theta is an integer. Let y(1)y_{(1)} and y(2)y_{(2)} be the minimum and maximum values of y1y_{1} and y2y_{2}. We can show that the confidence interval CI(y)[y(1),y(2)]\mbox{CI}(y)\equiv[y_{(1)},y_{(2)}] has a coverage probability

Pθ(θCI)=7/9=0.78.P_{\theta}(\theta\in\mbox{CI})=7/9=0.78.

For example, on observing y(1)=3y_{(1)}=3 and y(2)=5y_{(2)}=5, the interval [3,5][3,5] is formally a 78% CI for θ\theta. But, if we ponder a bit, in this case we can actually be sure that the true θ=4\theta=4. So, the probability of 7/9 is clearly a wrong price for this interval. This is a typical example justifying the frequentist objection to attaching the coverage probability as a sense of confidence in an observed CI.

Here the range RR(y)y(2)y(1)R\equiv R(y)\equiv y_{(2)}-y_{(1)} is relevant. If R=2R=2 we know for sure that θ\theta is equal to the midpoint of the interval, so the CI will always be correct. But if R=0R=0, the CI is equal to the point y1y_{1}, and it falls with equal probability at the integers {θ1,θ,θ+1}\{\theta-1,\theta,\theta+1\}. So, for all θ\theta, we have

Pθ(θCI|R=2)\displaystyle P_{\theta}(\theta\in\mbox{CI}|R=2) =\displaystyle= 1>7/9\displaystyle 1>7/9
Pθ(θCI|R=1)\displaystyle P_{\theta}(\theta\in\mbox{CI}|R=1) =\displaystyle= 1>7/9\displaystyle 1>7/9
Pθ(θCI|R=0)\displaystyle P_{\theta}(\theta\in\mbox{CI}|R=0) =\displaystyle= 1/3<7/9.\displaystyle 1/3<7/9.

In the betting market, the range information will be used by the intelligent players to settle prices at these conditional probabilities. We can be sure, for example, that if y1=3y_{1}=3 and y2=5y_{2}=5, the intelligent players will not use 7/9 as the price and will instead use 1.00. So, the information can be used to construct a Dutch Book against anyone who ignores RR and unwittingly uses the unconditional coverage. How do we know that there is a relevant subset in this case? Moreover, given RR, how do we know if there is no further relevant subset?

To contrast with the classical Ramsey-de Finetti Dutch Book argument, suppose y1=y2=3y_{1}=y_{2}=3. If for whatever subjective reasons, you set the price 7/9 for [θCI][\theta\in\mbox{CI}], you are being internally consistent as long as you set the price 2/9 for [θCI][\theta\not\in\mbox{CI}], since the two numbers constitute a valid probability measure. Internal consistency means that I cannot make a risk-free profit from you based on this single realization of yy. Even if I know based on the conditional coverage that 1/3 is a better price, I cannot take any advantage of you because there is no betting market. So 7/9 is a valid subjective probability.

Now consider Buehler-Robinson’s setup, again assuming no betting market and supposing y1=y2=3y_{1}=y_{2}=3. They would say the marginal price 7/9 is a bad idea, because there is a relevant subset giving a conditional probability 1/3. In a series of independent repeated bets, if you set the price 7/9 whenever y1=y2=3y_{1}=y_{2}=3, I will be happy to ‘sell’ you the bet and be guaranteed to win in long term. This is the usual frequentist interpretation; in any single bet I am not guaranteed a risk-free money. As previously described, the presence of the betting market allows me make free-money from a single realization of yy. So, the threat of the market-based Dutch Book is more potent. The exact technical difference between Buehler-Robinson’s and our setup will be discussed below after Theorem 1, where the former allows one to choose an arbitrary prior distribution. \Box

2.2 Confidence distribution

It turns out that establishing a no-relevant-subset condition relies on the concept of a confidence. Let tT(y)t\equiv T(y) be a statistic for θ\theta, and define the right-side P-value function

Cm(θ;t)Pθ(Tt).C_{m}(\theta;t)\equiv P_{\theta}(T\geq t). (3)

Assuming that, for each tt, it behaves formally like a proper cumulative distribution function, Cm(θ;t)C_{m}(\theta;t) is called the confidence distribution of θ\theta. The subscript mm is used to indicate that it is a ‘marginal’ confidence, as it depends on the marginal distribution of TT. For continuous TT, at the true parameter, the random variable Cm(θ;T)C_{m}(\theta;T) is standard uniform. For continuous θ\theta, the corresponding confidence density is

cm(θ)cm(θ;t)Cm(θ;t)/θ.c_{m}(\theta)\equiv c_{m}(\theta;t)\equiv\partial C_{m}(\theta;t)/\partial\theta. (4)

The functions Cm(θ;t)C_{m}(\theta;t) and cm(θ)c_{m}(\theta) across θ\theta are realized statistics, which depend on both the data and the model, but not on the true unknown parameter θ0\theta_{0}. We can view the confidence distribution simply as the collection of P-values or CIs. Suggestively, and with a slight abuse of notation, we define

Cm(θCI)CIcm(θ)𝑑θC_{m}(\theta\in\mbox{CI})\equiv\int_{\textrm{CI}}c_{m}(\theta)d\theta (5)

to convey the ‘confidence of θ\theta belonging in the CI’.

We assume a regularity condition, called R1 below, that for any α(0,1)\alpha\in(0,1), the quantile function qα(θ)q_{\alpha}(\theta) of TT is a strictly increasing function of θ\theta. Then the frequentist procedure based on TT gives a γ\gamma-level CI defined by

CIγ(T)=(qγ21(T),qγ11(T))\mbox{CI}_{\gamma}(T)=\left(q_{\gamma_{2}}^{-1}(T),q_{\gamma_{1}}^{-1}(T)\right) (6)

for some γ2>γ1>0\gamma_{2}>\gamma_{1}>0 with γ2γ1=γ\gamma_{2}-\gamma_{1}=\gamma, to have a coverage probability

Pθ(θCIγ(T))=Pθ[T(qγ1(θ),qγ2(θ))]=γ2γ1=γ.P_{\theta}(\theta\in\mbox{CI}_{\gamma}(T))=P_{\theta}\Big{[}T\in(q_{\gamma_{1}}(\theta),q_{\gamma_{2}}(\theta))\Big{]}=\gamma_{2}-\gamma_{1}=\gamma.

Here the coverage probability is a frequentist probability based on the distribution of TT, whereas the confidence is for the observed interval CI(t)\mbox{CI}(t) based on the confidence density of θ\theta. The confidence becomes

Cm(θCIγ(t);t)\displaystyle C_{m}(\theta\in\mbox{CI}_{\gamma}(t);t) =Cm(θ=qγ11(t);t)Cm(θ=qγ21(t);t)\displaystyle=C_{m}(\theta=q_{\gamma_{1}}^{-1}(t);t)-C_{m}(\theta=q_{\gamma_{2}}^{-1}(t);t)
=Pθ=qγ11(t)(Tt)Pθ=qγ21(t)(Tt)\displaystyle=P_{\theta=q_{\gamma_{1}}^{-1}(t)}(T\geq t)-P_{\theta=q_{\gamma_{2}}^{-1}(t)}(T\geq t)
=(1γ1)(1γ2)=γ\displaystyle=(1-\gamma_{1})-(1-\gamma_{2})=\gamma
=Pθ(θCIγ(T)).\displaystyle=P_{\theta}(\theta\in\mbox{CI}_{\gamma}(T)).

Thus, we have the following lemma.

Lemma 1

Under the regularity condition R1,

Pθ(θCI(T))=Cm(θCI(t);t).P_{\theta}(\theta\in\mbox{CI}(T))=C_{m}(\theta\in\mbox{CI}(t);t). (7)

where CI(t)\mbox{CI}(t) is the observed interval of confidence procedure CI(T)\mbox{CI}(T) defined in (6).

Fisher (1950) was against the idea of interpreting the level of significance as a long-term frequency in repeated samples from the same population. For i=1,,n,i=1,\cdots,n, suppose TiT_{i}’s are estimates for θi\theta_{i}’s from different populations. Let Xi=I(θiCI(Ti))X_{i}=I(\theta_{i}\in CI(T_{i})) such that γ=Cm(θiCI(ti))\gamma=C_{m}(\theta_{i}\in CI(t_{i})) and let X¯=Xi/n.\bar{X}=\sum X_{i}/n. Then, X¯=γ+Op(1/n),\bar{X}=\gamma+O_{p}(1/\sqrt{n}), so that γ\gamma can be a long-term frequency of true coverage from different populations or experiments.

Example 2. On observing tt from N(θ,1)N(\theta,1), we have the confidence distribution

Cm(θ;t)=Pθ(Tt)=1Φ(tθ),C_{m}(\theta;t)=P_{\theta}(T\geq t)=1-\Phi(t-\theta),

where Φ()\Phi(\cdot) is the standard normal distribution function, with corresponding confidence density cm(θ)=ϕ(tθ)c_{m}(\theta)=\phi(t-\theta), i.e., the normal density centered at θ=t\theta=t. In principle, we can derive any confidence interval or P-value from this confidence density. This example applies in most large sample situations where, under regular conditions, the normal model is correct asymptotically. Furthermore, it illustrates clearly the canonical relationship between confidence and coverage probability. For instance, for a 95% CI, we have

Cm(θCI(t))=95%,C_{m}(\theta\in\mbox{CI(t)})=95\%,

reflecting the 95% confidence that the observed CI covers the true parameter. This confidence is associated with an exact coverage probability

Pθ(θCI(T))=0.95,P_{\theta}(\theta\in\mbox{CI(T)})=0.95,

so the confidence matches the coverage probability.\Box

Fisher (1930, 1933) called Cm(θ;t)C_{m}(\theta;t) the fiducial distribution of θ\theta, but he required TT to be sufficient. However, the recent definition of the confidence distribution (e.g. Schweder and Hjort, 2016, p.58) requires only Cm(θ;T)C_{m}(\theta;T) to be uniform at the true parameter, thus guaranteeing a correct coverage probability. Lemma 1 states when Fisher’s fiducial probability Cm(θ;t)C_{m}(\theta;t) becomes a frequentist probability, which requires TT to be continuous. When TT is discrete, the equality is only achieved asymptotically; see Appendix A3 for an example.

However, as shown in Example 1, a correct coverage probability does not rule out relevant subsets. This means that the current definition of the confidence distribution does not guarantee epistemic confidence. The key step is to define a confidence distribution that uses the full information. Motivated by the Bayesian formulation and Efron (1993), first define the implied prior as

c0(θ)c0(θ;t)m(t)cm(θ;t)L(θ;t),c_{0}(\theta)\equiv c_{0}(\theta;t)\equiv m(t)\frac{c_{m}(\theta;t)}{L(\theta;t)}, (8)

where m(t)m(t) cancels out all the terms not involving θ\theta in cm(θ;t)/L(θ;t){c_{m}(\theta;t)}/{L(\theta;t)}. In this paper, the full confidence density is defined by

cf(θ)cf(θ;y)c0(θ)L(θ;y).c_{f}(\theta)\equiv c_{f}(\theta;y)\propto c_{0}(\theta)L(\theta;y). (9)

The subscript ff is now used to indicate that it is associated with the full likelihood based on the whole data. When necessary for clarity, the dependence of the confidence density and the likelihood on tt and on the whole data yy will be made explicit. cf(θ)c_{f}(\theta) is defined only up to a constant term to allow it to integrate to one. Obviously, if TT is sufficient, then cm(θ)=cf(θ)c_{m}(\theta)=c_{f}(\theta), but in general they are not equal. In Section 3, we show a more convenient way to construct cf(θ)c_{f}(\theta). The confidence parallel to (5) can be denoted by Cf()C_{f}(\cdot). Thus, the full confidence density looks like a Bayesian posterior. However, the implied prior is not subjectively selected, and can be improper and data-dependent.

The full confidence density cf(θ)c_{f}(\theta) is used in general to compute the degree of confidence γ\gamma to any observed CI(y)\mbox{CI}(y) as

γ=CI(y)cf(θ)𝑑θ.\gamma=\int_{CI(y)}c_{f}(\theta)d\theta.

The CI has a coverage probability, which may or may not be equal to γ\gamma. We say that cf(θ)c_{f}(\theta) has no relevant subsets, if there is no R(y)R(y) such that the conditional coverage probability is biased in one direction according to (1) or (2).

For our main theorem, we assume the following regularity conditions, the proof is given in Appendix A1. For completeness and easy access, R1 is restated in full here.

  • R1.

    T=T(Y)T=T(Y) is a continuous scalar statistic whose quantile function qα(θ)q_{\alpha}(\theta), defined by

    Pθ(Tqα(θ))=α,P_{\theta}(T\leq q_{\alpha}(\theta))=\alpha,

    is strictly increasing function of θ\theta for any α(0,1)\alpha\in(0,1).

  • R2.

    c0(θ)c_{0}(\theta) is positive and locally integrable on the parameter space Θ\Theta such that

    Jc0(θ)𝑑θ<,for any compact subsetsJΘ.\int_{J}c_{0}(\theta)d\theta<\infty,\ \ \text{for any compact subsets}\ \ J\subseteq\Theta.
  • R3.

    logc0(θ)\log c_{0}(\theta) is uniformly continuous in yy.

  • R4.

    The confidence interval CI(y)=(bL(y),bU(y))\mbox{CI}(y)=(b_{L}(y),b_{U}(y)) is locally bounded, i.e., for any compact set KK in the sample space of yy, there exist M1M_{1} and M2M_{2} such that

    |bL(y)|M1and|bU(y)|M2for any yK|b_{L}(y)|\leq M_{1}\ \text{and}\ |b_{U}(y)|\leq M_{2}\quad\text{for any }y\in K
Theorem 1

Consider the full confidence density cf(θ)c0(θ)L(θ;y)c_{f}(\theta)\propto c_{0}(\theta)L(\theta;y), with c0(θ)c_{0}(\theta) being the implied prior defined by (8) satisfying R2 and R3, based on T(Y)T(Y) that satisfies R1. Let γ\gamma be the degree of confidence for the observed confidence interval CI(y)\mbox{CI}(y) that satisfies R4, such that

γ=CI(y)cf(θ)𝑑θ,for all y,\gamma=\int_{\textrm{CI}(y)}c_{f}(\theta)d\theta,\quad\text{for all }y,

Then cf(θ)c_{f}(\theta) has no relevant subsets.

Note we have two ways of computing the price of an observed CI: using Cf(θCI)C_{f}(\theta\in\mbox{CI}) or using Pθ(θCI)P_{\theta}(\theta\in\mbox{CI}). The latter is not guaranteed to be free of relevant subsets, while the former is not guaranteed to match the coverage probability. If the two are equal, we have a confidence that corresponds to a coverage probability that has no relevant subsets, hence epistemic confidence. If TT is sufficient and satisfies R1, Lemma 1 implies that the frequentist CI satisfies

Pθ(θCI(Y))=Cm(θCI(y))=Cf(θCI(y))=γ,for all θ and yP_{\theta}(\theta\in\mbox{CI}(Y))=C_{m}(\theta\in\mbox{CI}(y))=C_{f}(\theta\in\mbox{CI}(y))=\gamma,\ \text{for all }\theta\text{ and }y

Thus, we summarize the first key result in the following corollary:

Corollary 1

Under the regularity conditions R1-R4, if TT is sufficient statistic, the confidence based on cm(θ;t)c_{m}(\theta;t) has a correct coverage probability and no relevant subset. Hence the confidence is epistemic.

In Example 2, on observing tt from N(θ,1)N(\theta,1), cm(θ)=cf(θ)=ϕ(tθ).c_{m}(\theta)=c_{f}(\theta)=\phi(t-\theta). Furthermore, we also have L(θ)=ϕ(tθ)L(\theta)=\phi(t-\theta), so the implied prior c0(θ)=1c_{0}(\theta)=1. The coverage probabilities match the full confidence, and by Corollary 1, the confidence is epistemic.

We note that Pθ(θCI(Y))=Cf(θCI(y))P_{\theta}(\theta\in\mbox{CI}(Y))=C_{f}(\theta\in\mbox{CI}(y)) holds asymptotically, regardless whether yy is continuous or discrete. Corollary 1 specifies the conditions where it is true in finite samples.

For more generality, it is actually more convenient to prove the theorem using c0(θ)c0(θ;y)c_{0}(\theta)\equiv c_{0}(\theta;y) an arbitrary function that satisfies R2 and R3, as long as it leads to a proper c(θ;y)c(\theta;y). In particular, it does not have to be an implied prior (8) that depends on the statistic TT. If c0(θ)c_{0}(\theta) is a proper probability density that does not depend on yy, then cf(θ)c_{f}(\theta) is a Bayesian posterior density, shown already by Robinson’s (1979) Proposition 7.4 not to have relevant subsets. For proper priors, logc0(θ)\log c_{0}(\theta) is trivially uniformly continuous in yy, so the theorem extends his result to improper and data-dependent priors.

However, there is a significant impact on the interpretation. If you use an arbitrary c0(θ;y)c_{0}(\theta;y) that is not the same as the implied prior, and there is a betting market, your price γ\gamma will differ from the market price. So, as illustrated in the Introduction and in Example 1, I can construct a Dutch Book against you. This means that, in this case, the theorem is meaningful only for two people betting repeatedly against each other, with gains or losses expressed in terms of expected value or long-term average. It is exactly the setting described by Buehler (1959) and Robinson (1979). Crucially, in such a setting, the presence of relevant subsets does not guarantee an external agent to make a risk-free profit from a single bet. In this sense, it does not satisfy our original definition of epistemic confidence.

Lindley (1958) showed that, assuming TT is sufficient, Fisher’s fiducial probability – hence the marginal confidence – is equal to the Bayesian posterior if and only if the family pθ(y)p_{\theta}(y) is transformable to a location family. However, his proof assumed c0(θ)c_{0}(\theta) to be free of yy. Condition R3 of the theorem allows c0(θ)c_{0}(\theta) to depend on the data, so our result is not limited to the location family.

2.3 Ancillary statistics

The current definition of confidence (e.g. Schweder and Hjort, 2016, p.58) only requires Cm(θ;T)C_{m}(\theta;T) to follow uniform distribution. However, if TT is not sufficient, the marginal confidence is not epistemic, because it does not use the full likelihood, so it is not guaranteed free of relevant subsets. Limiting ourselves to models with sufficient statistics to get epistemic confidence is overly restrictive, since sufficient statistics exist at arbitrary sample sizes in the full exponential family only (Pawitan, 2001, Section 4.9). Using non-sufficient statistics implies a potential loss of efficiency and epistemic property. Further progress depends on the ancillary statistic, a feature or a function of the data whose distribution is free of the unknown parameter. As reviewed by Ghosh et al. (2010), it is one of Fisher’s great insights from the 1920s. We first have a parallel development for the conditional confidence distribution given the ancillary A(y)=aA(y)=a:

Cc(θ;t|a)\displaystyle C_{c}(\theta;t|a) \displaystyle\equiv Pθ(Tt|a)\displaystyle P_{\theta}(T\geq t|a)
cc(θ;t|a)\displaystyle c_{c}(\theta;t|a) \displaystyle\equiv Cc(θ;t|a)/θ.\displaystyle\partial C_{c}(\theta;t|a)/\partial\theta.

As for the marginal case, we have the following corollary from Lemma 1. Condition R1 needs a little modification, where it refers to the conditional statistic T|aT|a for each aa.

Corollary 2

Under the regularity condition R1,

Pθ(θCI|a)=Cc(θCI;t|a).P_{\theta}(\theta\in\mbox{CI}|a)=C_{c}(\theta\in\mbox{CI};t|a). (10)

where CI is the confidence interval based on the conditional distribution of T|aT|a.

Furthermore, define the implied prior as

c0(θ)c0(θ;t|a)m(t,a)cc(θ;t|a)L(θ;t|a),c_{0}(\theta)\equiv c_{0}(\theta;t|a)\equiv m(t,a)\frac{c_{c}(\theta;t|a)}{L(\theta;t|a)}, (11)

where m(t,a)m(t,a) cancels out all the terms not involving θ\theta in cc(θ;t|a)/L(θ;t|a){c_{c}(\theta;t|a)}/{L(\theta;t|a)}. As before, the full confidence is cf(θ)c0(θ)L(θ;y).c_{f}(\theta)\propto c_{0}(\theta)L(\theta;y).

Suppose T(y)=tT(y)=t is not sufficient but (t,a)(t,a) is, where aa is an ancillary statistic. In this case, aa is called an ancillary complement, and in a qualitative sense it is a maximal ancillary, because

L(θ;y)\displaystyle L(\theta;y) =\displaystyle= L(θ;t,a)\displaystyle L(\theta;t,a) (12)
\displaystyle\propto pθ(t|a)p(a)\displaystyle p_{\theta}(t|a)p(a)
\displaystyle\propto pθ(t|a)=L(θ;t|a).\displaystyle p_{\theta}(t|a)=L(\theta;t|a).

Thus, conditioning a non-sufficient statistic by a maximal ancillary has recovered the lost information and restored the full-data likelihood. In particular, the conditional confidence becomes the full confidence: cc(θ;t|a)=cf(θ)c_{c}(\theta;t|a)=c_{f}(\theta). Note that (12) holds for any maximal ancillary, so if a maximal ancillary exists, then the full likelihood is automatically equal to the conditional likelihood given any maximal ancillary statistic. In its sampling theory form, when tt is the maximum likelihood estimator (MLE) θ^,\hat{\theta}, full information can be recovered from pθ(θ^|a),p_{\theta}(\hat{\theta}|a), whose approximation has been studied by Barndorff-Nielsen (1983).

In conditional inference (Reid 1995), it is commonly stated that we condition on the ancillary to make our inference more ‘relevant’ to the data at hand, in other words, more epistemic. But this is typically stated on an intuitive basis; the following corollary provides a mathematical justification. Since we already condition on A(y)A(y), a further relevant subset R(y)R(y) is such that the conditional probability Pθ(θCI|A(y),R(y))P_{\theta}(\theta\in\mbox{CI}|A(y),R(y)) is non-trivially and consistently biased in one direction from Pθ(θCI|A(y))P_{\theta}(\theta\in\mbox{CI}|A(y)) in the same manner as (1). As we describe following Theorem 1 above, the result holds for an arbitrary c0(θ)c_{0}(\theta) that satisfies R2-R3 and leads to a valid confidence density. So it applies to c0(θ)c_{0}(\theta) defined by (11). Following a similar reasoning as for the previous corollary, we can state our second key result:

Corollary 3

If A(y)=aA(y)=a is maximal ancillary for T(y)T(y) and CI is constructed from the conditional confidence density based on T|aT|a, then under R1-R4, the conditional confidence Cc(θCI;t|a)C_{c}(\theta\in\mbox{CI};t|a) has a correct coverage probability and no further relevant subsets. Hence the conditional confidence is epistemic.

Remark: In view of (12), the confidence is epistemic for any choice of the maximal ancillary. Basu (1959) showed under mild conditions that maximal ancillaries exist. However, they may not be unique; this is an issue traditionally considered most problematic in conditional inference. If the maximal ancillary is not unique, then the conditional coverage probability might depend upon the choice. However, this does not affect the absence of relevant subset guaranteed by the corollary. We discuss this further in Section 4 and illustrate with an example in Appendix A4.

3 Examples

Our overall theory suggests that, regardless of the existence of a sufficient statistic, we can get epistemic confidence by computing CIs based on the full confidence density cf(θ)c0(θ)L(θ;y)c_{f}(\theta)\propto c_{0}(\theta)L(\theta;y). The corresponding coverage probability is either a marginal probability or a conditional probability given a maximal ancillary, depending on whether there exists a scalar sufficient statistic. The full likelihood L(θ;y)L(\theta;y) is almost always easy to compute. However, in order to get a correct coverage, c0(θ)c_{0}(\theta) is defined by (8) or (11), which in practice can be difficult to evaluate. For example, if we use the MLE, in general it has no closed form formula, and computing the P-value, even from its approximate distribution based on Barndorff-Nielsen’s (1983) formula, can be challenging. We illustrate through a series of examples some suitable approximations of c0(θ)c_{0}(\theta) that are simpler to compute.

Suppose, for sample size n=1n=1, there is a statistic t1T(y1)t_{1}\equiv T(y_{1}) that satisfies R1, i.e. it allows us to construct a valid confidence density cm(θ,t1)c_{m}(\theta,t_{1}). Then we can compute c0(θ)c_{0}(\theta) based on cm(θ;t1)/L(θ;t1)c_{m}(\theta;t_{1})/L(\theta;t_{1}). First consider the case when c0(θ)c_{0}(\theta) is free of the data. From the updating formula in Pawitan and Lee (2021), the confidence density based on the whole data is

cf(θ;y)\displaystyle c_{f}(\theta;y) \displaystyle\propto cm(θ;t1)L(θ;y1|t1)L(θ;y2yn)\displaystyle c_{m}(\theta;t_{1})L(\theta;y_{1}|t_{1})L(\theta;y_{2}\cdots y_{n}) (13)
\displaystyle\propto c0(θ)L(θ;t1)L(θ;y1|t1)L(θ;y2yn)\displaystyle c_{0}(\theta)L(\theta;t_{1})L(\theta;y_{1}|t_{1})L(\theta;y_{2}\cdots y_{n})
=\displaystyle= c0(θ)L(θ;y1)L(θ;y2yn)\displaystyle c_{0}(\theta)L(\theta;y_{1})L(\theta;y_{2}\cdots y_{n})
\displaystyle\propto c0(θ)L(θ;y).\displaystyle c_{0}(\theta)L(\theta;y).

The statistic t1t_{1} trivially exists if y1y_{1} itself leads to a valid confidence density. Once c0(θ)c_{0}(\theta) is available, (13) is highly convenient, since it does not require any computation of a statistic such as the MLE, its distribution or the P-value based on the whole data. More importantly, as shown in some examples below, formula (13) works even when there is no sufficient statistic from the whole data for n>1n>1. This is illustrated by the general location-family model in Section 3.2.

When c0(θ)c_{0}(\theta) depends on the data, it matters which yiy_{i} is used to compute it. In this case the updating formula (13) is only an approximation. As long as the contribution of logc0(θ)\log c_{0}(\theta) to logcf(θ)\log c_{f}(\theta) is of order O(1/n)O(1/n), we expect a close approximation. This is illustrated in Example 6 below.

3.1 Simple models

Example 1 (continued). Based on y1y_{1}, the confidence density and the likelihood functions are proportional:

c(θ;y1)L(θ;y1)=1,for θ{y11,y1,y1+1},c(\theta;y_{1})\propto L(\theta;y_{1})=1,\ \mbox{for }\theta\in\{y_{1}-1,y_{1},y_{1}+1\},

so the implied prior c0(θ)=1c_{0}(\theta)=1 for all θ\theta. The full likelihood based on (y1,y2)(y_{1},y_{2}) is

L(θ)=1,for θ{y(2)1,y(1)+1},L(\theta)=1,\ \mbox{for }\theta\in\{y_{(2)}-1,y_{(1)}+1\},

so, the full confidence density is cf(θ)L(θ)c_{f}(\theta)\propto L(\theta). For example, if y1=3y_{1}=3 and y2=5y_{2}=5, we do have 100% confidence that θ=4\theta=4. And if y1=y2=3y_{1}=y_{2}=3, we only have 33.3% confidence for θ=4\theta=4, though we have 100% confidence for θ{2,3,4}\theta\in\{2,3,4\}.

The MLE of θ\theta is not unique, but we can choose θ^=y¯\widehat{\theta}=\bar{y} as the MLE. It is not sufficient, but (y¯,R)(\bar{y},R) is, so RR is a maximal ancillary. Indeed the full confidence values match the conditional probabilities given the range RR as previously given. Furthermore, according to Corollary 3, there is no further relevant subset, so the confidence is epistemic. \Box

Example 3. Let y=y1y=y_{1} be a single sample from the uniform distribution on [θ,θ+1][\theta,\theta+1], where θ\theta is a real number. As in the previous examples, the confidence density and the likelihood functions are

cf(θ)L(θ)=1,for θ[y1,y].c_{f}(\theta)\propto L(\theta)=1,\ \mbox{for }\theta\in[y-1,y].

For example, if y=1.9y=1.9, then we are 100% confident in 0.9<θ<1.90.9<\theta<1.9; and 90% confident in 1.0<θ<1.91.0<\theta<1.9. The coverage probability of CIs of the form [yγ,y][y-\gamma,y] is indeed

Pθ(θCI)=γ,P_{\theta}(\theta\in\mbox{CI})=\gamma,

so the confidence is epistemic.

Now let’s consider trying to bet the value θ\lfloor\theta\rfloor, the largest integer smaller than θ\theta, based on observing yy. What price would you give to the bet that θ=y=1\lfloor\theta\rfloor=\lfloor y\rfloor=1? According to the confidence distribution, it should be 0.9. But we can show that the random variable y\lfloor y\rfloor is Bernoulli shifted by θ\lfloor\theta\rfloor and with success probability θθθ\langle\theta\rangle\equiv\theta-\lfloor\theta\rfloor, the fractional part of θ\theta. For example, if θ=1.6\theta=1.6 then θ=1\lfloor\theta\rfloor=1 and 1.6=0.6\langle 1.6\rangle=0.6, so y\lfloor y\rfloor is equal to 1+0=11+0=1 or 1+1=21+1=2 with probabilities 0.4 and 0.6, respectively. This means that, in general, using y\lfloor y\rfloor as a guess, the ‘coverage’ probability of being correct is

Pθ{Y=θ}=1θ.P_{\theta}\{\lfloor Y\rfloor=\lfloor\theta\rfloor\}=1-\langle\theta\rangle.

This probability varies from 0 to 1 across θ\theta, not matching the specific confidence – such as 0.9 above – derived from the confidence density.

The problem is that y\lfloor y\rfloor is no longer a sufficient statistic, so its marginal distribution is not fully informative. Now, the fractional part y\langle y\rangle is uniform between 0 and 1 for any θ\theta, so it is an ancillary statistic. We can show that, conditional on y\langle y\rangle, the distribution of y\lfloor y\rfloor is degenerate: with probability 1, it is equal to θ+1\lfloor\theta\rfloor+1 if y<θ\langle y\rangle<\langle\theta\rangle; and equal to θ\lfloor\theta\rfloor if y>θ\langle y\rangle>\langle\theta\rangle. This conditional distribution is distinct from the unconditional version, so y\langle y\rangle is relevant. Basu (1964) and Ghosh et al. (2010) used this example as a counter-example, where conditioning by an ancillary leads to a puzzling degenerate distribution. But actually, it is not so puzzling: The conditional likelihood is the same as the full likelihood, so y\langle y\rangle is maximal ancillary. This is of course as we should expect, since y\langle y\rangle together with y\lfloor y\rfloor form the full data yy.

To illustrate with real numbers, for example, on observing y=1.9y=1.9 we have y=0.9\langle y\rangle=0.9, so θ=y=1\lfloor\theta\rfloor=\lfloor y\rfloor=1 if the unknown θ<0.9\langle\theta\rangle<0.9. Now, your betting situation is much clearer: you will bet that θ=1\lfloor\theta\rfloor=1 if you believe that θ<0.9\langle\theta\rangle<0.9, i.e. 1<θ<1.91<\theta<1.9. This is exactly the same logical situation you faced before with the full likelihood and confidence density. \Box

3.2 Location family

Suppose y1,,yny_{1},\ldots,y_{n} are an iid sample from the location family with density

pθ(yi)=f(yiθ),p_{\theta}(y_{i})=f(y_{i}-\theta),

where f()f(\cdot) is an arbitrary but known density function, for example the Cauchy or normal densities. Immediately, based on y1y_{1} alone, the confidence density is

c(θ;y1)=f(y1θ)=L(θ;y1),c(\theta;y_{1})=f(y_{1}-\theta)=L(\theta;y_{1}),

so the implied prior c0(θ)=1c_{0}(\theta)=1. So, again using formula (13), the full confidence density is

cf(θ)L(θ)=i=1nf(yiθ).c_{f}(\theta)\propto L(\theta)=\prod_{i=1}^{n}f(y_{i}-\theta). (14)

This is a remarkably simple way to arrive at the confidence density of θ\theta without having to find the MLE and its distribution.

Without further specifications, the MLE Tθ^T\equiv\widehat{\theta} is not sufficient, so the marginal P-value Pθ(T>t)P_{\theta}(T>t) will not yield the full confidence. The distribution of the residuals (yiθ)(y_{i}-\theta) are free of θ\theta, so the set of differences (yiyj)(y_{i}-y_{j})’s are ancillary. In his classic paper, Fisher (1934) showed that

pθ(θ^|a)=k(a)L(θ)L(θ^),p_{\theta}(\widehat{\theta}|a)=k(a)\frac{L(\theta)}{L(\widehat{\theta})},

where aa is the set of differences from the order statistics y(1),,y(n)y_{(1)},\ldots,y_{(n)}. This means that the conditional likelihood based on θ^|a\widehat{\theta}|a matches the full likelihood (14), and the confidence of CIs based on (14) will match the conditional coverage probability. Indeed, here (θ^,a)(\widehat{\theta},a) is sufficient and aa is maximal ancillary. Overall, the confidence of CIs based on (14) is epistemic.

Example 4. Suppose that y=(y1,,yn)y=(y_{1},\cdots,y_{n}) are i.i.d sample from the uniform distribution on [θ1,θ+1][\theta-1,\theta+1]. Let y(1)y_{(1)} and y(n)y_{(n)} be ordered statistics, then (y(1),y(n))(y_{(1)},y_{(n)}) is a sufficient statistic. The likelihood is given by

L(θ;y)I(y(n)1θy(1)+1)L(\theta;y)\propto I{(y_{(n)}-1\leq\theta\leq y_{(1)}+1)}

and θ^=y(1)+1\widehat{\theta}=y_{(1)}+1 is a MLE. Since the uniform distribution on [θ1,θ+1][\theta-1,\theta+1] is location family, we have c0(θ)=1c_{0}(\theta)=1 to lead the full confidence

cf(θ)=12aI(y(n)1θy(1)+1)=12aI(θ^(2a)θθ^),c_{f}(\theta)=\frac{1}{2-a}I{(y_{(n)}-1\leq\theta\leq y_{(1)}+1)}=\frac{1}{2-a}I{(\widehat{\theta}-(2-a)\leq\theta\leq\widehat{\theta})},

where the range ay(n)y(1)a\equiv y_{(n)}-y_{(1)} is a maximal ancillary. \Box

3.3 Exponential family model

Suppose the dataset yy is an iid sample from the exponential family model with log-density of the form

logpθ(yi)=j=1Jhj(θ)tj(yi)A(θ)+c(yi).\log p_{\theta}(y_{i})=\sum_{j=1}^{J}h_{j}(\theta)t_{j}(y_{i})-A(\theta)+c(y_{i}). (15)

The MLE is sufficient if J=1J=1, but not if J>1J>1. In the latter case, the family is called the curved exponential family. By Theorem 1, when J=1J=1 confidence statements based on the MLE will be epistemic. (Our theory covers the continuous case in order to get exact coverage probabilities. Many important members are discrete, which is more complicated because the definition of the P-value is not unique, and the coverage probability function is guaranteed not to match any chosen confidence level. We discuss an example in Appendix A3.)

The standard evaluation of the confidence requires the tail probability of the distribution of the MLE, which in general has no closed form formula. Barndorff-Nielsen’s (1983) approximate conditional density of the MLE θ^\widehat{\theta} is given by

pθ(θ^|a)=k|I(θ^)|1/2L(θ)L(θ^)+O(n1),p_{\theta}(\widehat{\theta}|a)=k|I(\widehat{\theta})|^{1/2}\frac{L(\theta)}{L(\widehat{\theta})}+O(n^{-1}), (16)

where the MLE is the solution of A(θ)=ijhj(θ)tj(yi)A^{\prime}(\theta)=\sum_{i}\sum_{j}h^{\prime}_{j}(\theta)t_{j}(y_{i}), aa is the maximal ancillary and kk is a normalizing constant that is free of θ\theta. For J=1J=1 and the canonical parameter h1(θ)=θh_{1}(\theta)=\theta, the ancillary is null, and the approximation leads to the right-side P-value

Pθ{Z>r(θ)},r(θ)r+1rlogzr,P_{\theta}\{Z>r^{\ast}(\theta)\},\ \ r^{\ast}(\theta)\equiv r+\frac{1}{r}\log\frac{z}{r}, (17)

where ZZ is the standard normal variate and

r=sign(θ^θ)w,z=|I(θ^)|1/2(θ^θ),r=\mbox{sign}(\widehat{\theta}-\theta)\sqrt{w},\ \ z=|I(\widehat{\theta})|^{1/2}(\widehat{\theta}-\theta),

with w=2log{L(θ^)/L(θ)}w=2\log\{L(\widehat{\theta})/L(\theta)\} and I(θ^)I(\widehat{\theta}) the observed Fisher information.

Example 5. Let y=(y1,,yn)y=(y_{1},\cdots,y_{n}) be an iid sample from the gamma distribution with mean one and shape parameter θ\theta. The density is given by

pθ(yi)=1Γ(θ)θθyiθ1eθyi,p_{\theta}(y_{i})=\frac{1}{\Gamma(\theta)}\theta^{\theta}y_{i}^{\theta-1}e^{-\theta y_{i}},

so we have an exponential family model with

t(yi)=yi+logyi,A(θ)=logΓ(θ)θlogθ.t(y_{i})=-y_{i}+\log y_{i},\ \ A(\theta)=\log\Gamma(\theta)-\theta\log\theta.

To use formula (13), we first find the implied prior density using t1t(y1)t_{1}\equiv t(y_{1}) alone:

c0(θ)c(θ;t1)L(θ;t1),c_{0}(\theta)\propto\frac{c(\theta;t_{1})}{L(\theta;t_{1})},

where c1(θ)={Pθ(T1t1)}/θc_{1}(\theta)=\partial\{P_{\theta}(T_{1}\geq t_{1})\}/\partial\theta and L(θ;t1)=pθ(y1)L(\theta;t_{1})=p_{\theta}(y_{1}). The probability Pθ(T1t1)P_{\theta}(T_{1}\geq t_{1}) is a gamma integral, which is computed numerically. The implied prior is shown in Figure 1(a). So from (13), we get the confidence density

cf(θ)c0(θ)L(θ)=c0(θ)i=1npθ(yi).c_{f}(\theta)\propto c_{0}(\theta)L(\theta)=c_{0}(\theta)\prod_{i=1}^{n}p_{\theta}(y_{i}).

For an example with n=5n=5 and it(yi)=5.8791\sum_{i}t(y_{i})=-5.8791, which corresponds to the MLE θ^=3\widehat{\theta}=3, the confidence density is given by the solid line in Figure 1(b). The normalized likelihood function is also shown by the dashed line, which is quite distinct from the confidence density.

Refer to caption


Figure 1: (a) Implied prior of the gamma shape parameter θ\theta computed using formula (8) (solid) and from the approximate P-value formula (17) (circles). Both are normalized such that they are equal to one at the MLE (black dot). (b) The confidence densities based on a sample with size n=5n=5 using formula (13) (solid) and using the approximate P-value formula (17) (circles). The normalized likelihood function (dashed) is also shown.

To get the marginal confidence density based on the P-value formula (17), we need

w=2nlogΓ(θ)+2nlogΓ(θ^)+2n(θlogθθ^logθ^)+2(θθ^)i(logyiyi)w=-2n\log\Gamma(\theta)+2n\log\Gamma(\widehat{\theta})+2n(\theta\log\theta-\widehat{\theta}\log\widehat{\theta})+2(\theta-\widehat{\theta})\sum_{i}(\log y_{i}-y_{i})

where θ^\widehat{\theta} is the solution of

nψ(θ)nlogθn=it(yi),n\psi(\theta)-n\log\theta-n=\sum_{i}t(y_{i}),

with ψ(θ)logΓ(θ)/θ\psi(\theta)\equiv\partial\log\Gamma(\theta)/\partial\theta, and the observed Fisher information is

I(θ^)=n{ψ(θ^)1/θ^}.I(\widehat{\theta})=n\{\psi^{\prime}(\widehat{\theta})-1/\widehat{\theta}\}.

The circle points in Figure 1(b) are the marginal confidence density based on the same sample above. As expected, this tracks almost exactly the one given by formula (13). The corresponding implied prior based on cm(θ)/L(θ)c_{m}(\theta)/L(\theta) is given in Figure 1(a), also closely matching the implied prior based on n=1n=1. \Box

Example 6. This is an example where c0(θ)c_{0}(\theta) is data dependent. Let y=(y1,,yn)y=(y_{1},\ldots,y_{n}) be iid sample from N(θ,θ)N(\theta,\theta) for θ>0\theta>0. The log-density is given by

logpθ(y)=n2log(2πθ)12(iyi2/θ2iyi+nθ),\log p_{\theta}(y)=-\frac{n}{2}\log(2\pi\theta)-\frac{1}{2}(\sum_{i}y_{i}^{2}/\theta-2\sum_{i}y_{i}+n\theta),

so this is a regular exponential family with sufficient statistic T(y)=iyi2T(y)=\sum_{i}y_{i}^{2}. The marginal confidence density cm(θ)c_{m}(\theta) can be computed based on the non-central χ2\chi^{2} distribution for T(y)T(y). For n=1n=1, T(y1)=y12T(y_{1})=y_{1}^{2} is sufficient, and

C(θ;y1)\displaystyle C(\theta;y_{1}) =\displaystyle= Pθ(Y12>y12)\displaystyle P_{\theta}(Y_{1}^{2}>y_{1}^{2})
=\displaystyle= 1Φ(|y1|θθ)+Φ(|y1|θθ)\displaystyle 1-\Phi\left(\frac{|y_{1}|-\theta}{\sqrt{\theta}}\right)+\Phi\left(\frac{-|y_{1}|-\theta}{\sqrt{\theta}}\right)
c(θ;y1)\displaystyle c(\theta;y_{1}) =\displaystyle= 12ϕ(|y1|θθ)(|y1|θθ+1θ)+12ϕ(|y1|θθ)(|y1|θθ1θ)\displaystyle\frac{1}{2}\phi\left(\frac{|y_{1}|-\theta}{\sqrt{\theta}}\right)\left(\frac{|y_{1}|}{\theta\sqrt{\theta}}+\frac{1}{\sqrt{\theta}}\right)+\frac{1}{2}\phi\left(\frac{-|y_{1}|-\theta}{\sqrt{\theta}}\right)\left(\frac{|y_{1}|}{\theta\sqrt{\theta}}-\frac{1}{\sqrt{\theta}}\right)
L(θ;y1)\displaystyle L(\theta;y_{1}) =\displaystyle= ϕ(y1θθ),\displaystyle\phi\left(\frac{y_{1}-\theta}{\sqrt{\theta}}\right),

so the implied prior is data-dependent. This means that the full confidence density depends on which yiy_{i} is used to compute the implied prior:

cfi(θ)=c0(θ;yi)L(θ;y)c_{fi}(\theta)=c_{0}(\theta;y_{i})L(\theta;y)

In Figure 2, for n=3n=3, we compare cfi(θ)c_{fi}(\theta) using three different versions of c0(θ)c_{0}(\theta) based yiy_{i} for i=1,2,3i=1,2,3. These are also compared with the marginal confidence cm(θ)c_{m}(\theta). As shown in the figure, even for such a small dataset, the effects of the data dependence in this case are negligible. \Box

Refer to caption


Figure 2: Confidence densities cfi(θ)c_{fi}(\theta) (solid) and cm(θ)c_{m}(\theta) (circles) for the N(θ,θ)N(\theta,\theta) model. The former is actually drawn three times, corresponding to three different versions of the implied prior based on yiy_{i} for i=1,2,3i=1,2,3. (a) Based on y=(0.9,1,1.5)y=(0.9,1,1.5) (b) Based on y=(0.1,1,5).y=(0.1,1,5).

Example 7. This example from a curved exponential family is used to illustrate complex models, where standard results sometimes fail. Let y1,,yny_{1},\ldots,y_{n} be iid sample from N(θ,θ2)N(\theta,\theta^{2}) for θ>0\theta>0. First consider the confidence distribution based on y1y_{1},

Cm(θ;y1)=Pθ(Y1y1)=1Φ(y1θθ).C_{m}(\theta;y_{1})=P_{\theta}(Y_{1}\geq y_{1})=1-\Phi\left(\frac{y_{1}-\theta}{\theta}\right).

We can see immediately that if we use y1y_{1} as the statistic, the term inside the bracket converges to 1-1 as θ\theta\rightarrow\infty, so the confidence distribution goes to 1Φ(1)=0.841-\Phi(-1)=0.84. Hence y1y_{1} does not satisfy the regularity condition R1.

Here (yi2,yi)(\sum y_{i}^{2},\sum y_{i}) is minimal sufficient, and the likelihood function is

L(θ;y)=(2πθ2)n/2exp(yi2/2θ2+yi/θn/2).L(\theta;y)={(2\pi\theta^{2})^{-n/2}}\exp\left(-\sum y_{i}^{2}/2\theta^{2}+\sum y_{i}/\theta-n/2\right).

The MLE is given by

θ^=θ^(y)=yi+(yi)2+4yi22n\widehat{\theta}=\widehat{\theta}(y)=\frac{-\sum y_{i}+\sqrt{(\sum y_{i})^{2}+4\sum y_{i}^{2}}}{2n}

with a maximal ancillary

A(y)=yiyi2.\displaystyle A(y)=\frac{\sum y_{i}}{\sqrt{\sum y_{i}^{2}}}.

In terms of (θ^,a)(θ^(y),A(y))(\widehat{\theta},a)\equiv(\widehat{\theta}(y),A(y)), the likelihood is

L(θ;y)=(2πθ2)n/2exp((b+2n)θ^24θ2+bθ^2θn2),\displaystyle L(\theta;y)={(2\pi\theta^{2})^{-n/2}}\exp\left(-\frac{(b+2n)\widehat{\theta}^{2}}{4\theta^{2}}+\frac{b\widehat{\theta}}{2\theta}-\frac{n}{2}\right),

where ba2+aa2+4nb\equiv a^{2}+a\sqrt{a^{2}+4n}.

Now, denote the MLE and the ancillary based only on y1y_{1} by θ^1=θ^(y1)\widehat{\theta}_{1}=\widehat{\theta}(y_{1}) and a1a_{1}. The conditional confidence distribution based on θ^1|a1\widehat{\theta}_{1}|a_{1} is

Cc(θ;θ^1|a1)=Pθ(θ^(Y1)θ^1|a1)=1Φ(a1)Φ(a152θ^1θ+a1),\displaystyle C_{c}(\theta;\widehat{\theta}_{1}|a_{1})=P_{\theta}(\widehat{\theta}(Y_{1})\geq\widehat{\theta}_{1}|a_{1})=\frac{1}{\Phi(a_{1})}\Phi\left(\frac{-a_{1}-\sqrt{5}}{2}\frac{\widehat{\theta}_{1}}{\theta}+a_{1}\right),

which is now a valid confidence distribution, with density

cc(θ;θ^1|a1)=θCc(θ;θ^1|a1)=1Φ(a1)a1+52θ^1θ2ϕ(a152θ^1θ+a1).\displaystyle c_{c}(\theta;\widehat{\theta}_{1}|a_{1})=\frac{\partial}{\partial\theta}C_{c}(\theta;\widehat{\theta}_{1}|a_{1})=\frac{1}{\Phi(a_{1})}\frac{a_{1}+\sqrt{5}}{2}\frac{\widehat{\theta}_{1}}{\theta^{2}}\phi\left(\frac{-a_{1}-\sqrt{5}}{2}\frac{\widehat{\theta}_{1}}{\theta}+a_{1}\right).

The implied prior is c0(θ;θ^1|a1)cc(θ;θ^1|a1)/L(θ;y1)θ1c_{0}(\theta;\widehat{\theta}_{1}|a_{1})\propto c_{c}(\theta;\widehat{\theta}_{1}|a_{1})/L(\theta;y_{1})\propto\theta^{-1}. The updating rule gives the full confidence density

cf(θ;y)c0(θ)L(θ;y)\displaystyle c_{f}(\theta;y)\propto c_{0}(\theta)L(\theta;y) 1θn+1exp(yi22θ2+yiθ)\displaystyle\propto\frac{1}{\theta^{n+1}}\exp\left(\sum\frac{y_{i}^{2}}{2\theta^{2}}+\sum\frac{y_{i}}{\theta}\right)
1θn+1exp((b+2n)θ^24θ2+bθ^2θ).\displaystyle\propto\frac{1}{\theta^{n+1}}\exp\left(-\frac{(b+2n)\widehat{\theta}^{2}}{4\theta^{2}}+\frac{b\widehat{\theta}}{2\theta}\right). (18)

In fact, in Appendix A2 we show that, even though it is not sufficient, θ^1\widehat{\theta}_{1} still leads to a valid confidence distribution, and the implied prior based on cm(θ;θ^1)/L(θ;θ^1)c_{m}(\theta;\widehat{\theta}_{1})/L(\theta;\widehat{\theta}_{1}) is the same c0(θ)=1/θc_{0}(\theta)=1/\theta.

For completeness, Appendix A2 also shows the conditional confidence density derived using Barndorff-Nielsen’s formula (16), showing that we end up with the same implied prior. Instead here we use the exact result from Hinkley (1977). He derived the exact conditional density of w=θ1yi2w=\theta^{-1}\sqrt{\sum y_{i}^{2}},

p(w|a)=wn1exp{(wa)2/2}/In1(a)p(w|a)=w^{n-1}\exp\{-(w-a)^{2}/2\}/I_{n-1}(a)

where In1(a)=0xn1exp{(xa)2/2}𝑑xI_{n-1}(a)=\int_{0}^{\infty}x^{n-1}\exp\{-(x-a)^{2}/2\}dx. Let T(y)=yi2T(y)=\sqrt{\sum y_{i}^{2}}, then we have

Cc(θ;t|a)=Pθ(Tt|a)=P(Ww|a)=1Fa(w)C_{c}(\theta;t|a)=P_{\theta}(T\geq t|a)=P(W\geq w|a)=1-F_{a}(w)

where Fa(w)=p(w|a)𝑑wF_{a}(w)=\int p(w|a)dw. Then the confidence density becomes

cc(θ;t|a)=Fa(w)θ=p(w=t/θ|a)t/θ2=pθ(t|a)t/θc_{c}(\theta;t|a)=-\frac{\partial F_{a}(w)}{\partial\theta}=p(w=t/\theta|a)\ t/\theta^{2}=p_{\theta}(t|a)\ t/\theta

so that the implied prior becomes c0(θ;t|a)cc(θ;t|a)/L(θ;t|a)1/θc_{0}(\theta;t|a)\propto c_{c}(\theta;t|a)/L(\theta;t|a)\propto 1/\theta, which is the same with the results from θ^|a\widehat{\theta}|a. Thus, we have

cc(θ;t|a)=cc(θ;θ^|a)=cf(θ;y).c_{c}(\theta;t|a)=c_{c}(\theta;\widehat{\theta}|a)=c_{f}(\theta;y).

In Appendix 2, it is also shown that cm(θ;θ^1,,θ^n)c_{m}(\theta;\widehat{\theta}_{1},\cdots,\widehat{\theta}_{n}) with θ^i=θ^(yi)\widehat{\theta}_{i}=\widehat{\theta}(y_{i}) is a valid confidence density because Cm(θCI(θ^1,,θ^n))=Pθ(θCI(θ^1,,θ^n))C_{m}(\theta\in\mbox{CI}(\widehat{\theta}_{1},\cdots,\widehat{\theta}_{n}))=P_{\theta}(\theta\in\mbox{CI}(\widehat{\theta}_{1},\cdots,\widehat{\theta}_{n})). However, it is not epistemic because it does not use the full likelihood, so there is a loss of information.

As numerical illustrations, we compare the exact conditional P-value Pθ(T>t|a)P_{\theta}(T>t|a) for testing H0H_{0}: θ=1\theta=1, the corresponding full confidence Cf(θ)C_{f}(\theta) at θ=1\theta=1 and the P-value based on the score test. The latter was computed using the observed Fisher information, suggested by Hinkley (1977) as having good conditional properties. In Figure 3(a), we generate 100 datasets with n=5n=5 from N(θ,θ2)N(\theta,\theta^{2}) at θ=1.2\theta=1.2. The full confidence Cf(θ<1)C_{f}(\theta<1) is computed using the implied prior c0(θ)1/θc_{0}(\theta)\propto 1/\theta, and a constant prior c0(θ)1c_{0}(\theta)\propto 1. Panel (b) shows the result for n=10n=10. The full confidence with the implied prior c0(θ)1/θc_{0}(\theta)\propto 1/\theta agrees with the exact conditional probability. The use of a non-implied prior c0(θ)1c_{0}(\theta)\propto 1 cannot match the exact conditional probability. While the score test maintains the average conditional probability, it has a poor conditional property in these small samples. \Box

Refer to caption


Figure 3: Example from the N(θ,θ2)N(\theta,\theta^{2}) model. In each panel, the x-axis is the exact conditional P-value Pθ(T>t|a)P_{\theta}(T>t|a) given the ancillary aa for testing H0H_{0}: θ=1\theta=1. The y-axis is the full confidence Cf(1)θ<1c0(θ)L(θ;y)/m(y)𝑑θC_{f}(1)\equiv\int_{\theta<1}c_{0}(\theta)L(\theta;y)/m(y)d\theta, using the constant prior c0(θ)=1c_{0}(\theta)=1 (’+’ symbols) and the implied prior c0(θ)=1/θc_{0}(\theta)=1/\theta (circles). Also shown is the corresponding P-value from the score test using Fisher’s observed information (triangles). (a) For n=5n=5 and (b) for n=10n=10. To show the quality of the approximation for small P-values, the y-axis is expressed as a ratio.

4 Discussion

We have described the Dutch Book argument to establish the epistemic confidence that is meaningful for an observed confidence interval. Fisher tried to achieve the same purpose with the fiducial probability, but the use of the word ’probability’ had generated much confusion and controversies, so the concept of fiducial probability has been practically abandoned. However, the confidence concept is mainstream, although it comes with a frequentist interpretation only, so it applies not to the observed interval but to the procedure. The confidence may not be a probability but an extended likelihood (Pawitan and Lee, 2021), whose ratio is meaningful in hypothesis testing and statistical inferences (Lee and Bjørnstad, 2013). The extended likelihood is logically distinct from the classical likelihood. It is well known that we cannot use the likelihood directly for inference, except in special circumstances such as the normal case. Our results show that we can turn a classical likelihood into a confidence density by multiplying it with an implied prior. Furthermore, we get epistemic confidence by establishing the absence of relevance subsets.

It might appear that in trying to get epistemic confidence using the Dutch Book argument, we are simply recreating the Bayesian framework. But this is not the case. Bayesian subjective probability is established by an internal consistency requirement of a coherent betting strategy. If you’re being inconsistent, then an external agent can construct a Dutch Book against you. To avoid the Dutch Book, you have to use a (additive) probability measure. Crucially, in this Bayesian version, there is no reference to the betting market, which according to the fundamental theorem of asset pricing (Ross, 1976) will settle prices based on objective probabilities. So, in our setting, it is still possible to construct a market-linked Dutch Book against an internally consistent subjective Bayesian who ignores the objective probability.

Extra principles in the subjective probability framework have been proposed to deal the mismatch between the subjective and objective probabilities. For example, in Lewis’s (1980) ‘Principal Principle’

Ps{A|Chance(A)=x}=x,P_{s}\{A|\mbox{Chance}(A)=x\}=x,

where PsP_{s} denotes the subjective probability and ‘Chance’ the objective probability. So, the Principle simply declares that the subjective probability must be set to be equal to the objective probability, if the latter exists. Our Dutch Book argument can be used to justify the Principle, so the principle does not have to come out of the blue with no logical motivation. However, it is worth noting that epistemic confidence is not simply set equal to probability. Instead, it is the consequence of a theorem that establishes no relevant subset in order to avoid the Dutch Book. In our setup, the frequentist probability applies to a market involving a large number of independent players. Moreover, the rational personal betting price is no longer ‘subjective’, for example, in the choice of the prior. Thus, the conceptual separation of the personal and the market prices allow both epistemic and frequentist meaning of confidence.

Our use of money and bets to define epistemic confidence has some echoes in Shafer and Vovk (2001)’s game-theoretic foundation of probability, an ambitious rebuilding of probability without measure theory. However, their key concept is a sequential game between two players. The word ‘sequential’ clearly implies that the game is not meant to involve a risk-free profit from a single transaction that we want in a Dutch Book. Our usage of probability is fully within the Kolmogorov axiomatic system, and we make a clear distinction between probability and confidence.

Conditional inference (Reid 1995) has traditionally been the main area of statistics that tries to address the epistemic content of confidence intervals. The theory on ancillary statistics and relevant subsets has grown as a result (Basu, 1955, 1954; Ghosh, 2010). Conditioning on ancillary statistics is meant to make the inference ‘closer’ to the data at hand, but the proponents of conditional inference only go half-way to the end goal of epistemic confidence that Fisher wanted. The general lack of unique maximal ancillary is a great stumbling block, where it is then possible to come up with distinct relevant subsets with distinct conditional coverage probabilities. This raises an unanswerable question: What is then the ‘proper’ confidence for the observed interval? Our logical tool of the betting market overcomes this problem – in this case, the market cannot settle in an unambiguous price. But Corollary 3 still hold in the sense that as an individual, you’re still protected from the Dutch Book. We discuss this further with an example in Appendix 4.

Schweder and Hjort (2016) and Schweder (2018) have been strong proponents of interpreting confidence as ‘epistemic probability.’ We are in general agreement with their sentiment, but it is unclear which version of probability this is. The only established and accepted epistemic probability is the Bayesian probability, but in their writing, the confidence concept is clearly non-Bayesian. Our use of the Dutch Book defines normatively the epistemic value of the confidence while staying within the non-Bayesian framework.

In logic and philosophy, the relevant subset problem is known as the ‘reference class problem’ (Reichenbach, 1949; Hajek, 2007). Venn (1876) in his classic book already recognized the problem: ‘It is obvious that every individual thing or event has an indefinite number of properties or attributes observable in it, and might therefore be considered as belonging to an indefinite number of different classes of things…’, and this affects the probability assignment. We solve this problem by setting a limit to the amount of information as given by the observed data yy generated by the probability model pθ(y)p_{\theta}(y). This approach is in line with the theory of market equilibrium, where it is assumed that information is limited and available to all players. It is not possible to have an equilibrium – hence accepted prices – if information is indefinite, or when the players know that they have access to different pieces of information.

We have limited our current paper to the one parameter case. Following the proof of Theorem 1, the same result actually holds in the multi-parameter case as long as we consider bounded confidence regions satisfying Condition R4. The problem arises for a marginal parameter of interest, which might implicitly assume unbounded regions for the nuisance parameters. For example, in the normal model, with both mean μ\mu and variance σ2\sigma^{2} unknown, the standard CI for the mean implicitly employs an unbounded interval for the variance:

μ(y¯tα/2sn,y¯+tα/2sn)andσ2(0,)\mu\in\left(\bar{y}-t_{\alpha/2}\frac{s}{\sqrt{n}},\bar{y}+t_{\alpha/2}\frac{s}{\sqrt{n}}\right)\ \text{and}\ \sigma^{2}\in(0,\infty)

where tα/2t_{\alpha/2} is the quantile of tt distribution with (n1)(n-1) degrees of freedom. Thus, the theorem cannot guarantee the epistemic property of the tt-interval. In fact, Buehler and Feddersen (1963) and Brown (1967) showed the existence of relevant subsets for the tt-interval. Extensions of our epistemic confidence results to this problem are of great interest.

References

Arrow, K. J. and Debreu, G. (1954). Existence of an equilibrium for a competitive economy. Econometrica 22, 265–-290.

Barndorff-Nielsen O. (1983). On a formula for the distribution of the maximum likelihood estimator. Biometrika, 70: 343-365

Berger J.O. and Wolpert R.L. (1988). The likelihood principle. 2nd Edition. Institute of Mathematical Statistics, Hayward, CA

Birnbaum A. (1962). On the foundation of statistical inference. Journal of the American Statistical Association, 57, 269–326.

Bjørnstad J.F. (1996). On the generalization of the likelihood function and likelihood principle. Journal of the American Statistical Association, 91, 791–806.

Brown, L. (1963). The conditional level of Student’s t test. The Annals of Mathematical Statistics, 38, 1068–1071.

Buehler R.J. (1959). Some Validity Criteria for Statistical Inferences. The Annals of Mathematical Statistics, 30, 845–863

Buehler R.J. and Feddersen A.P. (1963). Note on a Conditional Property of Student’s tt. The Annals of Mathematical Statistics, 34, 1098–1100.

de Finetti, B. (1931). On the subjective meaning of probability. English translation in de Finetti (1993), Induction and Probability, pp. 291–321. Clueb: Bologna.

Fisher R.A. (1930). Inverse probability. Proceedings of the Cambridge Philosophical Society, 26, 528–535.

Fisher R.A. (1933). The concepts of inverse probability and fiducial probability referring to unknown parameters. Proceedings of the Royal Society of London, 139, 343–348.

Fisher R.A. (1934). Two new properties of mathematical likelihood. Proceedings of the Royal Society of London, 144A, 285.

Fisher R.A. (1950). Contributions to mathematical statistics (pp. 35.173a). Wiley.

Fisher R.A. (1958). The nature of probability. Centennial Review, 2, 261–274.

Fisher R.A. (1973). Statistical methods and scientific inference, 3rd Edition. New York: Hafner.

Ghosh M., Reid N. and Fraser D.A.S. (2010). Ancillary statistics: a review. Statistica Sinica 20, 1309–1332.

Hajek A. (2007). The reference class problem is your problem too. Synthese, 156: 563–585.

Lancaster H.O. (1961). Significance tests in discrete distributions. Journal of the American Statistical Association, 56: 223–234.

Lee Y. and Bjørnstad J.F. (2013). Extended likelihood approach to large-scale multiple testing. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 75, 553-575.

Lee Y., Nelder J.A. and Pawitan Y. (2017). Generalized linear models with random effects: unified analysis via H-likelihood (2nd ed.). CRC Press.

Lewis, D. (1980). A subjectivist’s guide to objective chance. In Ifs (pp. 267-297). Springer, Dordrecht.

Lindley D.V. (1958). Fiducial distributions and Bayes’ theorem. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 20, 102–107.

Pawitan Y. (2001). In all likelihood: Statistical modelling and inference using likelihood. Oxford University Press, Oxford, UK.

Pawitan Y. and Lee Y. (2021). Confidence as likelihood. Statistical Science. To appear.

Ramsey, F. (1926). Truth and probability. In Foundations of Mathematics and other Logical Essays. London: K. Paul, Trench, Trubner and Co. Reprinted in H.E. Kyburg and H.E. Smokler (eds.) (1980), Studies in Subjective Probability, 2nd edn (pp. 25-–52) New York: Robert Krieger.

Reichenbach H. (1949). The Theory of Probability. University of California Press.

Reid N. (1995). The roles of conditioning in inference. Statistical Science, 10, 138–157.

Robinson G. K.. (1979). Conditional properties of statistical procedures. The Annals of Statistics, 7, 742–755.

Ross, S. (1976), Return, risk and arbitrage. In: I. Friend and J. Bicksler (eds.), Studies in Risk and Return. Cambridge, MA: Ballinger

Schweder T. and Hjort N.L. (2016). Confidence, Likelihood, Probability. Cambridge University Press, Cambridge, UK.

Schweder T. (2018). Confidence is epistemic probability for empirical science. Journal of Statistical Planning and Inference 195, 116–125.

Venn, J. (1876). The Logic of Chance (2nd ed.). Macmillan and Co.

Appendix

A1. Proof of Theorem 1

One main consequence of condition R1 is that we start with a proper confidence density. However, as stated after the statement of the theorem, instead of starting with TT, for more generality, our proof would allow an arbitrary function c0(θ,y)c_{0}(\theta,y) that satisfies R2 and R3, as long as the resulting full confidence c(θ;y)c0(θ,y)L(θ;y)c(\theta;y)\propto c_{0}(\theta,y)L(\theta;y) is proper.

Throughout this part, given a fixed γ(0,1)\gamma\in(0,1), let CIγ(y)\mbox{CI}_{\gamma}(y) denote the γ\gamma-level interval such that

C(θCIγ(y))=CIγ(y)c(θ;y)𝑑θ=γfor any y𝒴,C(\theta\in\mbox{CI}_{\gamma}(y))=\int_{CI_{\gamma}(y)}c(\theta;y)d\theta=\gamma\quad\text{for any }y\in\mathcal{Y},

where 𝒴\mathcal{Y} is the sample space of YY and let RR denote a relevant subset such that

Pθ(θCIγ(Y)|YR)γ+ϵfor any θΘP_{\theta}(\theta\in CI_{\gamma}(Y)|Y\in R)\geq\gamma+\epsilon\quad\text{for any }\theta\in\Theta (19)

or negatively,

Pθ(θCIγ(Y)|YR)γϵfor any θΘP_{\theta}(\theta\in CI_{\gamma}(Y)|Y\in R)\leq\gamma-\epsilon\quad\text{for any }\theta\in\Theta

for some ϵ>0\epsilon>0 where Θ\Theta is the parameter space of θ\theta.

Lemma. Suppose that RR is a relevant subset for an interval CIγ(Y)\mbox{CI}_{\gamma}(Y). Let {Bk}\{B_{k}\} be a countable partition of RR, so that R=k=1BkR=\bigsqcup_{k=1}^{\infty}B_{k}, then there exists a relevant subset Bi{Bk}B_{i}\in\{B_{k}\}.

Proof. First consider the positively biased case. If Pθ(θCIγ(Y)|YBk)<γ+ϵP_{\theta}(\theta\in CI_{\gamma}(Y)|Y\in B_{k})<\gamma+\epsilon for all kk, then we have

Pθ(θCIγ(Y)|YR)\displaystyle P_{\theta}(\theta\in CI_{\gamma}(Y)|Y\in R) =Pθ(θCIγ(Y)&YR)Pθ(YR)\displaystyle=\frac{P_{\theta}(\theta\in CI_{\gamma}(Y)\ \&\ Y\in R)}{P_{\theta}(Y\in R)}
=k=1Pθ(θCIγ(Y)&YBk)k=1Pθ(YBk)\displaystyle=\frac{\sum_{k=1}^{\infty}P_{\theta}(\theta\in CI_{\gamma}(Y)\ \&\ Y\in B_{k})}{\sum_{k=1}^{\infty}P_{\theta}(Y\in B_{k})}
<k=1(γ+ϵ)Pθ(YBk)k=1Pθ(YBk)=γ+ϵ,\displaystyle<\frac{\sum_{k=1}^{\infty}(\gamma+\epsilon)P_{\theta}(Y\in B_{k})}{\sum_{k=1}^{\infty}P_{\theta}(Y\in B_{k})}=\gamma+\epsilon,

which is a contradiction to (19). Therefore, there should be Bi{Bk}B_{i}\in\{B_{k}\} such that

Pθ(θCIγ(Y)|YBi)γ+ϵ\displaystyle P_{\theta}(\theta\in CI_{\gamma}(Y)|Y\in B_{i})\geq\gamma+\epsilon

is a relevant subset. Similarly, supposing Pθ(θCIγ(Y)|YBi)γϵP_{\theta}(\theta\in CI_{\gamma}(Y)|Y\in B_{i})\leq\gamma-\epsilon leads to the corresponding proof for a negatively biased relevant subset. \Box

Proof of Theorem 1. By definition of c(θ;y)c(\theta;y) above, there is a non-negative function m(y)m(y) such that c0(θ;y)=m(y)c(θ;y)/L(θ;y)c_{0}(\theta;y)=m(y)c(\theta;y)/L(\theta;y). First we show that the existence of positively biased relevant subset RR leads to a contradiction. For an arbitrary γ(0,1)\gamma\in(0,1), suppose that there exists ϵ>0\epsilon>0 such that

Pθ(θCIγ(Y)|YR)γ+ϵfor any θΘ.P_{\theta}(\theta\in\mbox{CI}_{\gamma}(Y)|Y\in R)\geq\gamma+\epsilon\quad\text{for any }\theta\in\Theta.

Let δ\delta and ξ\xi be arbitrary numbers satisfying γ(1γϵ)/(γ+ϵ)<δ<(1γ)\gamma(1-\gamma-\epsilon)/(\gamma+\epsilon)<\delta<(1-\gamma) and

0<ξ<12log((γ+ϵ)(γ+δ)γ).\displaystyle 0<\xi<\frac{1}{2}\log\left(\frac{(\gamma+\epsilon)(\gamma+\delta)}{\gamma}\right). (20)

By the uniform continuity of logc0(θ;y)\log c_{0}(\theta;y), there exists d>0d>0 such that

y1y2<d|logc0(θ;y1)logc0(θ;y2)|<ξ\displaystyle\|y_{1}-y_{2}\|<d\Rightarrow|\log c_{0}(\theta;y_{1})-\log c_{0}(\theta;y_{2})|<\xi

Let {Ak}\{A_{k}\} be a partition of n\mathbb{R}^{n} divided by nn-dimensional grid with spacing dn\sqrt[n]{d} where nn is the sample size. Define Bk=AkRB_{k}=A_{k}\cap R, then {Bk}\{B_{k}\} becomes a countable partition of relevant subset RR. By the Lemma, there exists a relevant subset Bi{Bk}B_{i}\in\{B_{k}\} such that

Pθ(θCIγ(Y)|YBi)γ+ϵfor any θΘ.P_{\theta}(\theta\in\mbox{CI}_{\gamma}(Y)|Y\in B_{i})\geq\gamma+\epsilon\quad\text{for any }\theta\in\Theta.

It can be written by the integration form,

BiI(θCIγ(y))fθ(y)𝑑yBi(γ+ϵ)fθ(y)𝑑yfor any θΘ.\int_{B_{i}}I_{(\theta\in CI_{\gamma}(y))}f_{\theta}(y)dy\geq\int_{B_{i}}(\gamma+\epsilon)f_{\theta}(y)dy\quad\text{for any }\theta\in\Theta. (21)

Since y1y2<d\|y_{1}-y_{2}\|<d for any y1,y2Biy_{1},y_{2}\in B_{i}, from the regularity condition R3 we have

eξ<c0(θ;y1)c0(θ;y2)<eξfor any y1,y2Bi.\displaystyle e^{-\xi}<\frac{c_{0}(\theta;y_{1})}{c_{0}(\theta;y_{2})}<e^{\xi}\quad\text{for any }y_{1},y_{2}\in B_{i}. (22)

Let CIγ+δ(Y)CIγ(Y)\mbox{CI}_{\gamma+\delta}(Y)\supset\mbox{CI}_{\gamma}(Y) be the (γ+δ)(\gamma+\delta)-level confidence interval. By the regularity condition R4, there exists a compact set ΘiΘ\Theta_{i}\subseteq\Theta such that CIγ+δ(y)Θi\mbox{CI}_{\gamma+\delta}(y)\subseteq\Theta_{i} for any yBiy\in B_{i}. Then,

Θic(θ;y)𝑑θCIγ+δ(y)c(θ;y)𝑑θ=γ+δfor any yBi.\displaystyle\int_{\Theta_{i}}c(\theta;y)d\theta\geq\int_{CI_{\gamma+\delta}(y)}c(\theta;y)d\theta=\gamma+\delta\quad\text{for any }y\in B_{i}. (23)

Now let yy^{*} be an arbitrary observation in BiB_{i} and integrate (21) over Θi\Theta_{i} with the c0(θ;y)c_{0}(\theta;y^{*}), then we have

ΘiBiI(θCIγ(y))fθ(y)𝑑yc0(θ;y)𝑑θΘiBi(γ+ϵ)fθ(y)𝑑yc0(θ;y)𝑑θ.\int_{\Theta_{i}}\int_{B_{i}}I_{(\theta\in CI_{\gamma}(y))}f_{\theta}(y)dyc_{0}(\theta;y^{*})d\theta\geq\int_{\Theta_{i}}\int_{B_{i}}(\gamma+\epsilon)f_{\theta}(y)dyc_{0}(\theta;y^{*})d\theta. (24)

By the regularity condition R2,

0\displaystyle 0 ΘiBiI(θCIγ(y))fθ(y)𝑑yc0(θ;y)𝑑θΘiBifθ(y)𝑑yc0(θ;y)𝑑θ\displaystyle\leq\int_{\Theta_{i}}\int_{B_{i}}I_{(\theta\in CI_{\gamma}(y))}f_{\theta}(y)dyc_{0}(\theta;y^{*})d\theta\leq\int_{\Theta_{i}}\int_{B_{i}}f_{\theta}(y)dyc_{0}(\theta;y^{*})d\theta
ΘiPθ(YBi)c0(θ;y)𝑑θΘic0(θ;y)𝑑θ<.\displaystyle\leq\int_{\Theta_{i}}P_{\theta}(Y\in B_{i})c_{0}(\theta;y^{*})d\theta\leq\int_{\Theta_{i}}c_{0}(\theta;y^{*})d\theta<\infty.

Thus, the order of integration in (24) can be exchanged by Fubini’s theorem to lead

BiΘiI(θCI(y))fθ(y)c0(θ;y)𝑑θ𝑑yBiΘi(γ+ϵ)fθ(y)c0(θ;y)𝑑θ𝑑y.\displaystyle\int_{B_{i}}\int_{\Theta_{i}}I_{(\theta\in CI(y))}f_{\theta}(y)c_{0}(\theta;y^{*})d\theta dy\geq\int_{B_{i}}\int_{\Theta_{i}}(\gamma+\epsilon)f_{\theta}(y)c_{0}(\theta;y^{*})d\theta dy. (25)

Note here that the inequality (22) implies that

eξc(θ;y)m(y)<fθ(y)c0(θ;y)=fθ(y)c0(θ;y)c0(θ;y)c0(θ;y)<eξc(θ;y)m(y).\displaystyle e^{-\xi}c(\theta;y)m(y)<f_{\theta}(y)c_{0}(\theta;y^{*})=f_{\theta}(y)c_{0}(\theta;y)\frac{c_{0}(\theta;y^{*})}{c_{0}(\theta;y)}<e^{\xi}c(\theta;y)m(y).

Then the left-hand-side of (25) becomes

BiΘiI(θCIγ(y))fθ(y)c0(θ;y)𝑑θ𝑑y\displaystyle\int_{B_{i}}\int_{\Theta_{i}}I_{(\theta\in CI_{\gamma}(y))}f_{\theta}(y)c_{0}(\theta;y^{*})d\theta dy =BiCIγ(y)fθ(y)c0(θ;y)𝑑θ𝑑y\displaystyle=\int_{B_{i}}\int_{CI_{\gamma}(y)}f_{\theta}(y)c_{0}(\theta;y^{*})d\theta dy
<BieξCIγ(y)c(θ;y)𝑑θm(y)𝑑y=eξγBim(y)𝑑y\displaystyle<\int_{B_{i}}e^{\xi}\int_{CI_{\gamma}(y)}c(\theta;y)d\theta m(y)dy=e^{\xi}\gamma\int_{B_{i}}m(y)dy

and the right-hand-side of (25) becomes

BiΘi(γ+ϵ)fθ(y)c0(θ;y)𝑑θ𝑑y\displaystyle\int_{B_{i}}\int_{\Theta_{i}}(\gamma+\epsilon)f_{\theta}(y)c_{0}(\theta;y^{*})d\theta dy >Bi(γ+ϵ)eξΘic(θ;y)𝑑θm(y)𝑑y\displaystyle>\int_{B_{i}}(\gamma+\epsilon)e^{-\xi}\int_{\Theta_{i}}c(\theta;y)d\theta m(y)dy
eξ(γ+ϵ)(γ+δ)Bim(y)𝑑y.\displaystyle\geq e^{-\xi}(\gamma+\epsilon)(\gamma+\delta)\int_{B_{i}}m(y)dy.

by the inequality (23). Thus we have

eξγ>eξ(γ+ϵ)(γ+δ),e^{\xi}\gamma>e^{-\xi}(\gamma+\epsilon)(\gamma+\delta),

which is a contradiction to (20).

For negatively biased cases, define ξ\xi by an arbitrary number

0<ξ<12log(γγϵ)\displaystyle 0<\xi<\frac{1}{2}\log\left(\frac{\gamma}{\gamma-\epsilon}\right) (26)

and replace the inequality (25) by

BiΘiI(θCI(y))fθ(y)c0(θ;y)𝑑θ𝑑yBiΘi(γϵ)fθ(y)c0(θ;y)𝑑θ𝑑y.\displaystyle\int_{B_{i}}\int_{\Theta_{i}}I_{(\theta\in CI(y))}f_{\theta}(y)c_{0}(\theta;y^{*})d\theta dy\leq\int_{B_{i}}\int_{\Theta_{i}}(\gamma-\epsilon)f_{\theta}(y)c_{0}(\theta;y^{*})d\theta dy.

Then the left-hand-side becomes

BiΘiI(θCI(y))fθ(y)c0(θ;y)𝑑θ𝑑y>eξγBim(y)𝑑y\displaystyle\int_{B_{i}}\int_{\Theta_{i}}I_{(\theta\in CI(y))}f_{\theta}(y)c_{0}(\theta;y^{*})d\theta dy>e^{-\xi}\gamma\int_{B_{i}}m(y)dy

and the right-hand-side becomes

BiΘi(γϵ)fθ(y)c0(θ;y)𝑑θ𝑑y<eξ(γϵ)Bim(y)𝑑y\displaystyle\int_{B_{i}}\int_{\Theta_{i}}(\gamma-\epsilon)f_{\theta}(y)c_{0}(\theta;y^{*})d\theta dy<e^{\xi}(\gamma-\epsilon)\int_{B_{i}}m(y)dy

which lead to a contradiction to (26). \Box

A2. Curved exponential family: N(θ,θ2)N(\theta,\theta^{2})

We give more details of the N(θ,θ2)N(\theta,\theta^{2}) model. Denote the MLE based only on y1y_{1} by θ^(y1)=θ^1\widehat{\theta}(y_{1})=\widehat{\theta}_{1}. The confidence distribution is

Cm(θ;θ^1)\displaystyle C_{m}(\theta;\widehat{\theta}_{1}) =Pθ(θ^(Y1)θ^1)=Pθ(Y1+5Y122θ^1)\displaystyle=P_{\theta}(\widehat{\theta}(Y_{1})\geq\widehat{\theta}_{1})=P_{\theta}\left(\frac{-Y_{1}+\sqrt{5Y_{1}^{2}}}{2}\geq\widehat{\theta}_{1}\right)
=Pθ(Y12θ^151)+Pθ(Y12θ^15+1)\displaystyle=P_{\theta}\left(Y_{1}\geq\frac{2\widehat{\theta}_{1}}{\sqrt{5}-1}\right)+P_{\theta}\left(Y_{1}\leq\frac{-2\widehat{\theta}_{1}}{\sqrt{5}+1}\right)
=1Φ(1+52θ^1θ1)+Φ(152θ^1θ1).\displaystyle=1-\Phi\left(\frac{1+\sqrt{5}}{2}\frac{\widehat{\theta}_{1}}{\theta}-1\right)+\Phi\left(\frac{1-\sqrt{5}}{2}\frac{\widehat{\theta}_{1}}{\theta}-1\right).

Then limθCm(θ;θ^1)=1\lim_{\theta\rightarrow\infty}C_{m}(\theta;\widehat{\theta}_{1})=1 and limθ0Cm(θ;θ^1)=0\lim_{\theta\rightarrow 0}C_{m}(\theta;\widehat{\theta}_{1})=0, so that the right-side p-value Pθ(θ^(Y1)θ^1)P_{\theta}(\widehat{\theta}(Y_{1})\geq\widehat{\theta}_{1}) leads to a proper distribution function. Corresponding confidence density is given by

cm1(θ;θ^1)=1+52θ^1θ2ϕ(1+52θ^1θ1)152θ^1θ2ϕ(152θ^1θ1).\displaystyle c_{m1}(\theta;\widehat{\theta}_{1})=\frac{1+\sqrt{5}}{2}\frac{\widehat{\theta}_{1}}{\theta^{2}}\phi\left(\frac{1+\sqrt{5}}{2}\frac{\widehat{\theta}_{1}}{\theta}-1\right)-\frac{1-\sqrt{5}}{2}\frac{\widehat{\theta}_{1}}{\theta^{2}}\phi\left(\frac{1-\sqrt{5}}{2}\frac{\widehat{\theta}_{1}}{\theta}-1\right).

The implied prior based on θ^1\widehat{\theta}_{1} is

c0mi(θ;θ^i)cmi(θ;θ^i)/L(θ;θ^i)θ1c_{0mi}(\theta;\widehat{\theta}_{i})\propto c_{mi}(\theta;\widehat{\theta}_{i})/L(\theta;\widehat{\theta}_{i})\propto\theta^{-1}

where L(θ;θ^i)L(\theta;\widehat{\theta}_{i}) is the likelihood function from pθ(θ^i)p_{\theta}(\widehat{\theta}_{i}).

On the other hand, if we construct the full confidence densities by

cfi(θ;θ^(yi),y(i))cmi(θ;θ^(yi))L(θ;y(i)),c_{fi}(\theta;\widehat{\theta}(y_{i}),y_{(-i)})\propto c_{mi}(\theta;\widehat{\theta}(y_{i}))L(\theta;y_{(-i)}),

then the resulting confidence density depends on the choice of yiy_{i}. In this case we should consider cfi(θ;θ^i,y(i))c_{fi}(\theta;\widehat{\theta}_{i},y_{(-i)}) as an approximation to cf(θ;y)c_{f}(\theta;y). Figure 4 plots nn confidence densities cfi(θ;θ^i,y(i))c_{fi}(\theta;\widehat{\theta}_{i},y_{(-i)}) (solid) and cf(θ;y)c_{f}(\theta;y) (circle) with yiy_{i} from N(1,1)N(1,1). As shown in (b), when nn becomes large, the difference becomes negligible and cfi(θ;θ^i,y(i))c_{fi}(\theta;\widehat{\theta}_{i},y_{(-i)}) gets closer to cf(θ;y)c_{f}(\theta;y) (circle).

Refer to caption
Figure 4: Confidence densities cfi(θ;θ^i)c_{fi}(\theta;\widehat{\theta}_{i}) for i=1,2,,ni=1,2,\cdots,n (solid), cf(θ;y)c_{f}(\theta;y) (circle), cm(θ;θ^1,,θ^n)c_{m}(\theta;\widehat{\theta}_{1},\cdots,\widehat{\theta}_{n}) (cross). (a) n=3n=3, (b) n=10n=10.

So

cfi(θ;θ^(yi),y(i))θ1L(θ;θ^i)L(θ;y(i))=L(θ;θ^i)L(θ;yi)c0(θ)L(θ;y).c_{fi}(\theta;\widehat{\theta}(y_{i}),y_{(-i)})\propto\theta^{-1}L(\theta;\widehat{\theta}_{i})L(\theta;y_{(-i)})=\frac{L(\theta;\widehat{\theta}_{i})}{L(\theta;y_{i})}c_{0}(\theta)L(\theta;y).

There is a loss of information caused by using cmi(θ)c_{mi}(\theta), due to the sign of yiy_{i} as captured by L(θ;ai)L(\theta;a_{i}). This is negligible even in small samples; see Figure 4. However, the marginal confidence

cm(θ;θ^1,,θ^n)θ1L(θ;θ^1,,θ^n)c_{m}(\theta;\widehat{\theta}_{1},\cdots,\widehat{\theta}_{n})\propto\theta^{-1}L(\theta;\widehat{\theta}_{1},\cdots,\widehat{\theta}_{n})

has a larger loss of information, as shown in both Figure 4(a) and (b).

Figure 5 plots the logarithms of implied prior c0m1(θ;θ^1)1/θc_{0m1}(\theta;\widehat{\theta}_{1})\propto 1/\theta (dotted) and q1(θ;y1)cm1(θ;θ^1)/L(θ;y1)q_{1}(\theta;y_{1})\propto c_{m1}(\theta;\widehat{\theta}_{1})/L(\theta;y_{1}) (solid), properly scaled. Note that q1q_{1} is not uniformly continuous on y1y_{1}, because the information in L(θ;y1)L(θ;θ^1|a1)L(\theta;y_{1})\propto L(\theta;\widehat{\theta}_{1}|a_{1}) and L(θ;θ^1)L(\theta;\widehat{\theta}_{1}) differ.

Refer to caption
Figure 5: log(c0m1(θ;θ^1))\log(c_{0m1}(\theta;\widehat{\theta}_{1})) (dotted) and log(q1(θ;y1))\log(q_{1}(\theta;y_{1})) (solid) against y1y_{1} varying.

It is also possible to compute the conditional confidence density by using Barndorff-Nielsen’s formula (16), and to show that we end up with the same implied prior c0(θ)=1/θc_{0}(\theta)=1/\theta. Firstly, the likelihood ratio is given by

L(θ)L(θ^)=θ^nθnexp[b+2n4(θ^2θ21)+b2(θ^θ1)],\displaystyle\frac{L(\theta)}{L(\widehat{\theta})}=\frac{\widehat{\theta}^{n}}{\theta^{n}}\exp\left[-\frac{b+2n}{4}\left(\frac{\widehat{\theta}^{2}}{\theta^{2}}-1\right)+\frac{b}{2}\left(\frac{\widehat{\theta}}{\theta}-1\right)\right],

where ba2+aa2+4nb\equiv a^{2}+a\sqrt{a^{2}+4n}, and the observed Fisher information

I(θ^)=2logL(θ)θ2|θ=θ^=b+4n2θ^2.\displaystyle I(\widehat{\theta})=-\frac{\partial^{2}\log L(\theta)}{\partial\theta^{2}}\Big{|}_{\theta=\widehat{\theta}}=\frac{b+4n}{2\widehat{\theta}^{2}}.

Then we have

pθ(θ^|a)\displaystyle p_{\theta}(\widehat{\theta}|a)\approx c2b+4nθ^θ^nθnexp[b+2n4(θ^2θ21)+b2(θ^θ1)].\displaystyle\frac{c}{\sqrt{2}}\frac{\sqrt{b+4n}}{\widehat{\theta}}\frac{\widehat{\theta}^{n}}{\theta^{n}}\exp\left[-\frac{b+2n}{4}\left(\frac{\widehat{\theta}^{2}}{\theta^{2}}-1\right)+\frac{b}{2}\left(\frac{\widehat{\theta}}{\theta}-1\right)\right].

Let Uθ^(Y)/θU\equiv\widehat{\theta}(Y)/\theta and let u=θ^(y)/θu=\widehat{\theta}(y)/\theta, then the conditional density of u|au|a becomes

pθ(u|a)\displaystyle p_{\theta}(u|a)\approx cb+4n2un1exp[b+2n4(u21)+b2(u1)],\displaystyle\frac{c\sqrt{b+4n}}{\sqrt{2}}u^{n-1}\exp\left[-\frac{b+2n}{4}\left(u^{2}-1\right)+\frac{b}{2}\left(u-1\right)\right],

which does not contain θ\theta. Let Fa(u)=p(u|a)𝑑uF_{a}(u)=\int p(u|a)du, then

Cc(θ;θ^|a)=Pθ(θ^(Y)θ^)=Pθ(Uθ^/θ)=1Fa(θ^/θ).\displaystyle C_{c}(\theta;\widehat{\theta}|a)=P_{\theta}(\widehat{\theta}(Y)\geq\widehat{\theta})=P_{\theta}(U\geq\widehat{\theta}/\theta)=1-F_{a}(\widehat{\theta}/\theta).

It gives the conditional confidence density

cc(θ;θ^|a)=Fa(θ^/θ)θcb+4n2θ^nθn+1exp[b+2n4(θ^2θ21)F+b2(θ^θ1)],\displaystyle c_{c}(\theta;\widehat{\theta}|a)=-\frac{\partial F_{a}(\widehat{\theta}/\theta)}{\partial\theta}\approx\frac{c\sqrt{b+4n}}{\sqrt{2}}\frac{\widehat{\theta}^{n}}{\theta^{n+1}}\exp\left[-\frac{b+2n}{4}\left(\frac{\widehat{\theta}^{2}}{\theta^{2}}-1\right)F+\frac{b}{2}\left(\frac{\widehat{\theta}}{\theta}-1\right)\right],

and implied prior c0(θ;θ^|a)cc(θ;θ^|a)/L(θ;y)1/θc_{0}(\theta;\widehat{\theta}|a)\propto c_{c}(\theta;\widehat{\theta}|a)/L(\theta;y)\propto 1/\theta. Thus, the conditional confidence density from Barndorff-Nielsen’s formula becomes

cc(θ;θ^|a)=cf(θ;y)θ1L(θ;y),c_{c}(\theta;\widehat{\theta}|a)=c_{f}(\theta;y)\propto\theta^{-1}L(\theta;y),

which is the same as the full confidence (3.3).

A3. Discrete case

A complication arises in the discrete case since the definition of the P-value is not unique, and the coverage probability function is guaranteed not to match any chosen confidence level. Given the observed statistic T=tT=t, among several candidates, the mid P-value

Pθ(T>t)+12Pθ(T=t)P_{\theta}(T>t)+\frac{1}{2}P_{\theta}(T=t)

is often considered the most appropriate (Lancaster, 1961).

We shall discuss the specific case of the binomial and negative binomial models: Y1Bin(n,θ)Y_{1}\sim Bin(n,\theta) and Y2NB(y,θ)Y_{2}\sim NB(y,\theta). The two models have an identical likelihood, proportional to L(θ)θy(1θ)ny,L(\theta)\propto\theta^{y}(1-\theta)^{n-y}, but have different probability mass functions, respectively

Pθ(Y1=y)=(ny)θy(1θ)ny and Pθ(Y2=n)=(n1y1)θy(1θ)nyP_{\theta}(Y_{1}=y)={\binom{n}{y}}\theta^{y}(1-\theta)^{n-y}\text{ and }P_{\theta}(Y_{2}=n)={\binom{n-1}{y-1}}\theta^{y}(1-\theta)^{n-y}

Thus, they have the common MLE θ^=θ^1=θ^2=y/n\widehat{\theta}=\widehat{\theta}_{1}=\widehat{\theta}_{2}=y/n. However, the two MLEs have different supports

θ^1{0,1n,,n1n,1} and θ^2{1,yy+1,yy+2},\widehat{\theta}_{1}\in\left\{0,\frac{1}{n},\cdots,\frac{n-1}{n},1\right\}\text{ and }\widehat{\theta}_{2}\in\left\{1,\frac{y}{y+1},\frac{y}{y+2}\cdots\right\},

and therefore θ^1\widehat{\theta}_{1} and θ^2\widehat{\theta}_{2} have different distribution, which lead to different P-values. Statistical models such as the binomial and negative binomial models describe how the unobserved future data will be generated. Thus, all the information about θ\theta in the data and in the statistical model is in the extended likelihood. The use of the mid P-values

C(θ;y1=y)\displaystyle C(\theta;y_{1}=y) =12Pθ(θ^1=yn)+Pθ(θ^1>yn)=12(Iθ(y,ny+1)+Iθ(y+1,ny))\displaystyle=\frac{1}{2}P_{\theta}\left(\widehat{\theta}_{1}=\frac{y}{n}\right)+P_{\theta}\left(\widehat{\theta}_{1}>\frac{y}{n}\right)=\frac{1}{2}\Big{(}I_{\theta}(y,n-y+1)+I_{\theta}(y+1,n-y)\Big{)}
C(θ;y2=n)\displaystyle C(\theta;y_{2}=n) =12Pθ(θ^2=yn)+Pθ(θ^2>yn)=12(Iθ(y,ny+1)+Iθ(y,ny))\displaystyle=\frac{1}{2}P_{\theta}\left(\widehat{\theta}_{2}=\frac{y}{n}\right)+P_{\theta}\left(\widehat{\theta}_{2}>\frac{y}{n}\right)=\frac{1}{2}\Big{(}I_{\theta}(y,n-y+1)+I_{\theta}(y,n-y)\Big{)}

lead to different confidence densities

c(θ;y1)=\displaystyle c(\theta;y_{1})= 12(θy1(1θ)nyB(y,ny+1)+θy(1θ)ny1B(y+1,ny))\displaystyle\frac{1}{2}\left(\frac{\theta^{y-1}(1-\theta)^{n-y}}{B(y,n-y+1)}+\frac{\theta^{y}(1-\theta)^{n-y-1}}{B(y+1,n-y)}\right)
c(θ;y2)=\displaystyle c(\theta;y_{2})= 12(θy1(1θ)nyB(y,ny+1)+θy1(1θ)ny1B(y,ny))\displaystyle\frac{1}{2}\left(\frac{\theta^{y-1}(1-\theta)^{n-y}}{B(y,n-y+1)}+\frac{\theta^{y-1}(1-\theta)^{n-y-1}}{B(y,n-y)}\right)

where Iθ(,)I_{\theta}(\cdot,\cdot) is the regularized incomplete beta function and B(,)B(\cdot,\cdot) is the beta function.

Refer to caption
Figure 6: Coverage probabilities of the intervals from the confidence distribution based on the mid p-value for binomial models at n=10,50,100n=10,50,100 (top) and negative binomial models at y=10,50,100y=10,50,100 (bottom).

Figure 6 shows the coverage probabilities of the 95% two-sided confidence procedure based on the mid p-value of θ^\widehat{\theta} for binomial models and negative binomial models. We can see that the coverage probabilities fluctuate around 0.95 but they are not consistently biased in one direction. Moreover, as nn or yy becomes larger, the difference between the coverage probability and the confidence becomes smaller. In discrete case, it is not possible to access the exact objective coverage probability of the CI procedure. Here the confidence is a consistent estimate of the objective coverage probability. In negative binomial models with θ=1\theta=1, y=ny=n with probability 1, so that it behaves like binomial confidence procedure for n=yn=y.

Besides information in the likelihood, the confidence uses information from the statistical model. Consider two different statistical models, M1: N=X+1N=X+1 where XX\sim Poisson(η1)\eta_{1}) and Y1|N=nBin(n,θ)Y_{1}|N=n\sim Bin(n,\theta) and M2: Y=X+1Y=X+1 where XX\sim Poisson(η2)\eta_{2}) and Y2|Y=yNB(y,θ).Y_{2}|Y=y\sim NB(y,\theta). In M1, Y1|N=nY_{1}|N=n and in M2, Y2|Y=yY_{2}|Y=y have common likelihood, but they are different models, so that they have no reason to have a common confidence.

A4. When maximal ancillaries are not unique

When maximal ancillary is not unique, the conditional coverage probability may depend on the choice of the ancillary. However, the lack of unique ancillary does not affect the validity of Corollary 3 on the absence of relevant subset. We illustrate here with an example from Evans (2013). The data y=(y1,y2)y=(y_{1},y_{2}) are sampled from a joint distribution with probabilities under θ\theta given in the following table:

(y1,y2)(y_{1},y_{2}) (1,1)(1,1) (1,2)(1,2) (2,1)(2,1) (2,2)(2,2)
θ=1\theta=1 1/6 1/6 2/6 2/6
θ=2\theta=2 1/12 3/12 5/12 3/12

Here both the data yy and parameter θ\theta are discrete. Strictly, our theory does not cover this case, but we shall use it because it can still illustrate clearly the issues with non-unique maximal ancillaries. The marginal probabilities are

Pθ(Y1=1)=1/3;Pθ(Y1=2)=2/3\displaystyle P_{\theta}(Y_{1}=1)=1/3;\ \ P_{\theta}(Y_{1}=2)=2/3
Pθ(Y2=1)=Pθ(Y2=2)=1/2,\displaystyle P_{\theta}(Y_{2}=1)=P_{\theta}(Y_{2}=2)=1/2,

for θ=1,2\theta=1,2. So both Y1Y_{1} and Y2Y_{2} are ancillaries, i.e., their probabilities do not depend on θ\theta. The conditional probabilities of (y1,y2)(y_{1},y_{2}) given Y1=1Y_{1}=1 are

(y1,y2)(y_{1},y_{2}) (1,1)(1,1) (1,2)(1,2)
θ=1\theta=1 1/2 1/2
θ=2\theta=2 1/4 3/4

and, given Y2=1Y_{2}=1 are

(y1,y2)(y_{1},y_{2}) (1,1)(1,1) (2,1)(2,1)
θ=1\theta=1 1/3 2/3
θ=2\theta=2 1/6 5/6.

Based on the unconditional model, on observing (y1,y2)=(1,1)(y_{1},y_{2})=(1,1), we have the likelihood function L(θ=1)=1L(\theta=1)=1 and L(θ=2)=1/2L(\theta=2)=1/2, so the MLE θ^=1\widehat{\theta}=1. For (y1,y2)=(2,2)(y_{1},y_{2})=(2,2) we have a different likelihood, but still θ^=1.\widehat{\theta}=1. This means we cannot reconstruct the likelihood based on the MLE alone, hence MLE is not sufficient. But we can see immediately that we get the same likelihood function under the conditional model given Y1=1Y_{1}=1 or given Y2=1Y_{2}=1, so conditioning on each ancillary recovers the full likelihood and each ancillary is maximal.

Now consider using the MLE itself as a ‘CI’. Conditional on the ancillaries, the probability that the MLE is correct is

P1(θ^=θ|Y1=1)=P1(Y2=1|Y1=1)=1/2\displaystyle P_{1}(\widehat{\theta}=\theta|Y_{1}=1)=P_{1}(Y_{2}=1|Y_{1}=1)=1/2
P2(θ^=θ|Y1=1)=P2(Y2=2|Y1=1)=3/4\displaystyle P_{2}(\widehat{\theta}=\theta|Y_{1}=1)=P_{2}(Y_{2}=2|Y_{1}=1)=3/4
P1(θ^=θ|Y2=1)=P1(Y2=2|Y2=1)=1/3\displaystyle P_{1}(\widehat{\theta}=\theta|Y_{2}=1)=P_{1}(Y_{2}=2|Y_{2}=1)=1/3
P2(θ^=θ|Y2=1)=P2(Y2=1|Y2=1)=5/6\displaystyle P_{2}(\widehat{\theta}=\theta|Y_{2}=1)=P_{2}(Y_{2}=1|Y_{2}=1)=5/6

These conditional ‘coverage probabilities’ are indeed distinct from each other. However, comparing the conditional coverage probabilities given Y1Y_{1} to that given Y2Y_{2}, there is no consistent non-trivial bias in one direction across θ\theta. So if you use Y1Y_{1} as the ancillary, you cannot construct further relevant subsets based on Y2Y_{2}. This is the essence of our remark after Corollary 3 that the lack of unique maximal ancillary does not affect the validity of the corollary.

Unfortunately, in this example, the P-value is not defined because the parameter θ\theta can be an unordered label. So it is not possible to compute any version of confidence function or any implied prior. In the continuous case, we define CI to satisfy Pθ(θCI)=γP_{\theta}(\theta\in\mbox{CI})=\gamma for all θ\theta. However, in discrete cases, it is often not possible for the coverage probabilities to be same for all θ\theta, which violates the condition of Theorem 1. Fisher (1973, Chapter III) suggested that for problems such as this, the structure is not sufficient to allow an unambiguous probability-based inference, so only the likelihood is available.