Epistemic confidence, the Dutch Book and relevant subsets
Abstract
We use a logical device called the Dutch Book to establish epistemic confidence, defined as the sense of confidence in an observed confidence interval. This epistemic property is unavailable – or even denied – in orthodox frequentist inference. In financial markets, including the betting market, the Dutch Book is also known as arbitrage or risk-free profitable transaction. A numerical confidence is deemed epistemic if its use as a betting price is protected from the Dutch Book by an external agent. Theoretically, to construct the Dutch Book, the agent must exploit unused information available in any relevant subset. Pawitan and Lee (2021) showed that confidence is an extended likelihood, and the likelihood principle states that the likelihood contains all the information in the data, hence leaving no relevant subset. Intuitively, this implies that confidence associated with the full likelihood is protected from the Dutch Book, and hence is epistemic. Our aim is to provide the theoretical support for this intuitive notion.
1 Introduction
Given data – of arbitrary size or complexity – generated from a model indexed with scalar parameter of interest , a confidence interval is computed with coverage probability
We are interested in the epistemic confidence, defined as the sense of confidence in the observed CI(y). (For simplicity, we shall often drop the explicit dependence on from the CI.) Arguably, this is what we want from a CI, but the orthodox frequentist view is emphatic that the probability does not apply to the observed interval but to the procedure. There are well-known examples justifying this position; see Example 1 below. In the confidence interval theory, the coverage probability is called the confidence level. So, in the frequentist theory, ‘confidence’ has actually no separate meaning from probability; in particular, it has no epistemic property. Schweder and Hjort (2016) and Schweder (2018) have been strong proponents of interpreting confidence as ‘epistemic probability.’ However, their view is not commonly accepted. Traditionally, only the Bayesians have no problem in stating that their subjective probability is epistemic. How do they achieve that? Is there a way to make a non-Bayesian confidence epistemic?
Frequentists interpret probability as either a long-term frequency or a propensity of the generating mechanism, such as a coin toss or a confidence interval procedure. So, for them, unique events, such as the next toss or the true status of an observed CI, do not have a probability. On the other hand, Bayesians can attach their subjective probability to such unique events. This interpretation is made possible using a logical device called the Dutch Book. As classically proposed by Frank Ramsey (1926) and Bruno de Finetti (1931), one’s subjective probability of an event is defined as the personal betting price one puts on the event. Though subjective, the price is not arbitrary, but it follows a normative rational consideration; it is a price such that no external agent can construct a Dutch Book against them, i.e., make a risk-free profit. In other words, it is irrational to make a bet that is guaranteed to lose. The Dutch Book is also known as arbitrage or free lunch. In the classical Dutch Book argument, the bet is made between two individuals.
Likewise, here we define confidence to be epistemic if it is protected from the Dutch Book, but crucially we assume that there is a betting market of a crowd of independent and intelligent players. In this market, bets are like a commodity with supply and demand from among the players. Assuming a perfect market condition – for instance, full competition, perfect information and no transaction cost – in accordance with the Arrow-Debreu theorem (Arrow and Debreu, 1954), there is an equilibrium price at which there is balance between supply and demand. Intuitively, if you are a seller and set your price too low, many would want to buy from you, thereby creating demand and increasing the price. Whereas if you set your price too high, nobody would want to buy from you, thus reducing demand and pressuring the price down. For the betting market in particular, the fundamental theorem of asset pricing (Ross, 1976) states that, assuming an objective probability model, there is no arbitrage if the price is determined by the objective probability. ‘Perfect information’ here means all players have access to the generic data and the sampling model . (If there is no objective probability model, the market, as evidenced by actual betting markets, can still have an agreed interpersonal price at any given time, though not a theoretically determined price.)
To illustrate the role of the betting market in the Dutch Book argument, suppose you and I are betting on the 2024 US presidential election. Suppose the betting market is currently giving the price of 0.25 for Donald Trump to win (this means you pay $0.25 to get $1 back if Trump wins, including your original $0.25). Suppose, for whatever reasons, you believe Trump will lose and hence set the probability of him winning at 0.1. Then I would construct a Dutch Book by ‘buying from you’ at $0.1 and immediately ‘selling it in the market’ at $0.25, thus making a risk-free profit. ‘Buying from you’ means treating you like a bookie: paying you $0.1 and getting $1 back if Trump wins. While ‘selling in the market’ for me means betting against the event, so I behave like a bookie: people pay me $0.25 so that they can get $1 back if Trump wins, but I keep the $0.25 if Trump loses. So, overall I would make $0.15 risk-free, i.e., regardless of whether Trump wins or loses. Note that this is not just a thought experiment – you can do all this buying and selling of bets in the online betting market.
It is worth emphasizing the difference between our setup and the classical Dutch Book argument used to establish the subjective Bayesian probability. In the latter, because it does not presume the betting market, bets are made only between you and me. To avoid the Dutch Book, you have to make your bets internally consistent by following (additive) probability laws. However, even if your bets are internally consistent (or coherent), if your prices do not match the market prices, I can make a risk-free profit by playing between you and the market; see Example 1. So, the presence of the market creates a stronger requirement for epistemic probability. We shall avoid the terms ’subjective’ and ’objective’; one might consider ‘epistemic’ to be subjective since it refers to a personal decision-making based on a unique event, but the market consideration makes it impersonal.
The present issue is when the confidence, as measured by the coverage probability, applies to the observed interval. One way to judge this is whether you are willing to bet on the true status of the CI using the confidence level as your personal price. Normatively, this should be the case if you know there is no better price. Intuitively, this is when you’re sure that you have used all the available information in the data, so nobody can exploit you, i.e., construct a Dutch Book against you. Theoretically, to construct the Dutch Book, an external agent must exploit unused information in the form of a relevant subset, conditional on whether he can get a different coverage probability.
Pawitan and Lee (2021) showed that the confidence is an extended likelihood (Lee et al., 2017). The classical likelihood principle (Birnbaum, 1962) and its extended version (Bjørnstad, 1996) state that the likelihood contains all the information in the data. Intuitively, this implies that the likelihood leaves no relevant subset, and is thus protected from the Dutch Book. In other words, we can attach the degree of confidence to the observed CI, i.e., confidence is epistemic, provided it is associated with the full likelihood. Our aim is to establish the theoretical justification for this intuitive notion.
To summarize briefly and highlight the plan of the paper, we describe three key concepts: relevant subset, confidence and ancillary statistic. We prove the main theorem that there are no relevant subsets if confidence is associated with the full likelihood. This condition is satisfied if the confidence is based on a sufficient statistic. When there is no sufficient statistic, but there is a maximal ancillary statistic, then this ancillary defines relevant subsets and there are no further relevant subsets.
2 Main theory
2.1 Relevant subsets
The idea of relevant subset appeared in Fisher’s writings on the nature of probability (Fisher, 1958). He considered probability meaningful provided there is no relevant subset. However, he treated the condition as an axiom – appealing to our intuition as to what ‘meaningful’ means – and did not establish what conditions are needed to guarantee no relevant subset. To avoid unnecessary philosophical discussions, we’re limiting the idea of relevant subset to the confidence interval procedures but not to the probability concept in general.
Intuitively, we could use the coverage probability as a betting price if there is no better price given the data at hand. So the question is, are there any features of the data that can be used to improve the price? Mathematically, these ‘features’ are some observed statistics that can be used to help predict the true coverage status. Given an arbitrary statistic , the conditional coverage probability will in general be biased, i.e., different from the marginal coverage. However, the bias as a function of the unknown is generally not going to be consistently in one direction. For example, trivially, if we use the full data itself, i.e., fixing the data as observed, then the conditional coverage is either one or zero depending on the true status of the CI, hence completely non-informative. In terms of betting, this means we cannot exploit an arbitrary feature of the data as a way to construct a Dutch Book against someone who sets the price at . The betting motivation also appeared in Buehler (1959) and Robinson (1979), though they only assumed two people betting each other repeatedly, but not the existence of the betting market. As we shall discuss in Example 1 and after Theorem 1, this has a significant impact on the interpretation of epistemic confidence.
A statistic is defined to be relevant (cf. Buehler, 1959) if the conditional coverage is non-trivially and consistently biased in one direction. That is, for a positive bias, there is free of , such that
(1) |
Now, potentially the feature can be used to construct a Dutch Book: Suppose you and I are betting, and I notice that the event occurs. If you set the price at , then I would buy the bet from you and then sell it in the betting market at . So I make a risk-free profit of . (We have assumed that the market contains intelligent players, so they would also have noticed the relevant statistic and set the price accordingly.) Similarly, for the negative bias, the relevant has the property
(2) |
Technically, induces subsets of the sample space, known as the ‘relevant subset’; for convenience, we use the terms ’relevant statistic’ and ’relevant subset’ interchangeably. So, if there is a relevant subset, the confidence level is not epistemic. Conversely, if there are no relevant subsets, the betting price determined by the confidence level is protected from the Dutch Book. So, mathematically, we establish epistemic confidence by showing that it corresponds to a coverage probability that has no relevant subsets.
Example 1. Let be an iid sample from a uniform distribution on , where the parameter is an integer. Let and be the minimum and maximum values of and . We can show that the confidence interval has a coverage probability
For example, on observing and , the interval is formally a 78% CI for . But, if we ponder a bit, in this case we can actually be sure that the true . So, the probability of 7/9 is clearly a wrong price for this interval. This is a typical example justifying the frequentist objection to attaching the coverage probability as a sense of confidence in an observed CI.
Here the range is relevant. If we know for sure that is equal to the midpoint of the interval, so the CI will always be correct. But if , the CI is equal to the point , and it falls with equal probability at the integers . So, for all , we have
In the betting market, the range information will be used by the intelligent players to settle prices at these conditional probabilities. We can be sure, for example, that if and , the intelligent players will not use 7/9 as the price and will instead use 1.00. So, the information can be used to construct a Dutch Book against anyone who ignores and unwittingly uses the unconditional coverage. How do we know that there is a relevant subset in this case? Moreover, given , how do we know if there is no further relevant subset?
To contrast with the classical Ramsey-de Finetti Dutch Book argument, suppose . If for whatever subjective reasons, you set the price 7/9 for , you are being internally consistent as long as you set the price 2/9 for , since the two numbers constitute a valid probability measure. Internal consistency means that I cannot make a risk-free profit from you based on this single realization of . Even if I know based on the conditional coverage that 1/3 is a better price, I cannot take any advantage of you because there is no betting market. So 7/9 is a valid subjective probability.
Now consider Buehler-Robinson’s setup, again assuming no betting market and supposing . They would say the marginal price 7/9 is a bad idea, because there is a relevant subset giving a conditional probability 1/3. In a series of independent repeated bets, if you set the price 7/9 whenever , I will be happy to ‘sell’ you the bet and be guaranteed to win in long term. This is the usual frequentist interpretation; in any single bet I am not guaranteed a risk-free money. As previously described, the presence of the betting market allows me make free-money from a single realization of . So, the threat of the market-based Dutch Book is more potent. The exact technical difference between Buehler-Robinson’s and our setup will be discussed below after Theorem 1, where the former allows one to choose an arbitrary prior distribution.
2.2 Confidence distribution
It turns out that establishing a no-relevant-subset condition relies on the concept of a confidence. Let be a statistic for , and define the right-side P-value function
(3) |
Assuming that, for each , it behaves formally like a proper cumulative distribution function, is called the confidence distribution of . The subscript is used to indicate that it is a ‘marginal’ confidence, as it depends on the marginal distribution of . For continuous , at the true parameter, the random variable is standard uniform. For continuous , the corresponding confidence density is
(4) |
The functions and across are realized statistics, which depend on both the data and the model, but not on the true unknown parameter . We can view the confidence distribution simply as the collection of P-values or CIs. Suggestively, and with a slight abuse of notation, we define
(5) |
to convey the ‘confidence of belonging in the CI’.
We assume a regularity condition, called R1 below, that for any , the quantile function of is a strictly increasing function of . Then the frequentist procedure based on gives a -level CI defined by
(6) |
for some with , to have a coverage probability
Here the coverage probability is a frequentist probability based on the distribution of , whereas the confidence is for the observed interval based on the confidence density of . The confidence becomes
Thus, we have the following lemma.
Lemma 1
Under the regularity condition R1,
(7) |
where is the observed interval of confidence procedure defined in (6).
Fisher (1950) was against the idea of interpreting the level of significance as a long-term frequency in repeated samples from the same population. For suppose ’s are estimates for ’s from different populations. Let such that and let Then, so that can be a long-term frequency of true coverage from different populations or experiments.
Example 2. On observing from , we have the confidence distribution
where is the standard normal distribution function, with corresponding confidence density , i.e., the normal density centered at . In principle, we can derive any confidence interval or P-value from this confidence density. This example applies in most large sample situations where, under regular conditions, the normal model is correct asymptotically. Furthermore, it illustrates clearly the canonical relationship between confidence and coverage probability. For instance, for a 95% CI, we have
reflecting the 95% confidence that the observed CI covers the true parameter. This confidence is associated with an exact coverage probability
so the confidence matches the coverage probability.
Fisher (1930, 1933) called the fiducial distribution of , but he required to be sufficient. However, the recent definition of the confidence distribution (e.g. Schweder and Hjort, 2016, p.58) requires only to be uniform at the true parameter, thus guaranteeing a correct coverage probability. Lemma 1 states when Fisher’s fiducial probability becomes a frequentist probability, which requires to be continuous. When is discrete, the equality is only achieved asymptotically; see Appendix A3 for an example.
However, as shown in Example 1, a correct coverage probability does not rule out relevant subsets. This means that the current definition of the confidence distribution does not guarantee epistemic confidence. The key step is to define a confidence distribution that uses the full information. Motivated by the Bayesian formulation and Efron (1993), first define the implied prior as
(8) |
where cancels out all the terms not involving in . In this paper, the full confidence density is defined by
(9) |
The subscript is now used to indicate that it is associated with the full likelihood based on the whole data. When necessary for clarity, the dependence of the confidence density and the likelihood on and on the whole data will be made explicit. is defined only up to a constant term to allow it to integrate to one. Obviously, if is sufficient, then , but in general they are not equal. In Section 3, we show a more convenient way to construct . The confidence parallel to (5) can be denoted by . Thus, the full confidence density looks like a Bayesian posterior. However, the implied prior is not subjectively selected, and can be improper and data-dependent.
The full confidence density is used in general to compute the degree of confidence to any observed as
The CI has a coverage probability, which may or may not be equal to . We say that has no relevant subsets, if there is no such that the conditional coverage probability is biased in one direction according to (1) or (2).
For our main theorem, we assume the following regularity conditions, the proof is given in Appendix A1. For completeness and easy access, R1 is restated in full here.
-
R1.
is a continuous scalar statistic whose quantile function , defined by
is strictly increasing function of for any .
-
R2.
is positive and locally integrable on the parameter space such that
-
R3.
is uniformly continuous in .
-
R4.
The confidence interval is locally bounded, i.e., for any compact set in the sample space of , there exist and such that
Theorem 1
Consider the full confidence density , with being the implied prior defined by (8) satisfying R2 and R3, based on that satisfies R1. Let be the degree of confidence for the observed confidence interval that satisfies R4, such that
Then has no relevant subsets.
Note we have two ways of computing the price of an observed CI: using or using . The latter is not guaranteed to be free of relevant subsets, while the former is not guaranteed to match the coverage probability. If the two are equal, we have a confidence that corresponds to a coverage probability that has no relevant subsets, hence epistemic confidence. If is sufficient and satisfies R1, Lemma 1 implies that the frequentist CI satisfies
Thus, we summarize the first key result in the following corollary:
Corollary 1
Under the regularity conditions R1-R4, if is sufficient statistic, the confidence based on has a correct coverage probability and no relevant subset. Hence the confidence is epistemic.
In Example 2, on observing from , Furthermore, we also have , so the implied prior . The coverage probabilities match the full confidence, and by Corollary 1, the confidence is epistemic.
We note that holds asymptotically, regardless whether is continuous or discrete. Corollary 1 specifies the conditions where it is true in finite samples.
For more generality, it is actually more convenient to prove the theorem using an arbitrary function that satisfies R2 and R3, as long as it leads to a proper . In particular, it does not have to be an implied prior (8) that depends on the statistic . If is a proper probability density that does not depend on , then is a Bayesian posterior density, shown already by Robinson’s (1979) Proposition 7.4 not to have relevant subsets. For proper priors, is trivially uniformly continuous in , so the theorem extends his result to improper and data-dependent priors.
However, there is a significant impact on the interpretation. If you use an arbitrary that is not the same as the implied prior, and there is a betting market, your price will differ from the market price. So, as illustrated in the Introduction and in Example 1, I can construct a Dutch Book against you. This means that, in this case, the theorem is meaningful only for two people betting repeatedly against each other, with gains or losses expressed in terms of expected value or long-term average. It is exactly the setting described by Buehler (1959) and Robinson (1979). Crucially, in such a setting, the presence of relevant subsets does not guarantee an external agent to make a risk-free profit from a single bet. In this sense, it does not satisfy our original definition of epistemic confidence.
Lindley (1958) showed that, assuming is sufficient, Fisher’s fiducial probability – hence the marginal confidence – is equal to the Bayesian posterior if and only if the family is transformable to a location family. However, his proof assumed to be free of . Condition R3 of the theorem allows to depend on the data, so our result is not limited to the location family.
2.3 Ancillary statistics
The current definition of confidence (e.g. Schweder and Hjort, 2016, p.58) only requires to follow uniform distribution. However, if is not sufficient, the marginal confidence is not epistemic, because it does not use the full likelihood, so it is not guaranteed free of relevant subsets. Limiting ourselves to models with sufficient statistics to get epistemic confidence is overly restrictive, since sufficient statistics exist at arbitrary sample sizes in the full exponential family only (Pawitan, 2001, Section 4.9). Using non-sufficient statistics implies a potential loss of efficiency and epistemic property. Further progress depends on the ancillary statistic, a feature or a function of the data whose distribution is free of the unknown parameter. As reviewed by Ghosh et al. (2010), it is one of Fisher’s great insights from the 1920s. We first have a parallel development for the conditional confidence distribution given the ancillary :
As for the marginal case, we have the following corollary from Lemma 1. Condition R1 needs a little modification, where it refers to the conditional statistic for each .
Corollary 2
Under the regularity condition R1,
(10) |
where CI is the confidence interval based on the conditional distribution of .
Furthermore, define the implied prior as
(11) |
where cancels out all the terms not involving in . As before, the full confidence is
Suppose is not sufficient but is, where is an ancillary statistic. In this case, is called an ancillary complement, and in a qualitative sense it is a maximal ancillary, because
(12) | |||||
Thus, conditioning a non-sufficient statistic by a maximal ancillary has recovered the lost information and restored the full-data likelihood. In particular, the conditional confidence becomes the full confidence: . Note that (12) holds for any maximal ancillary, so if a maximal ancillary exists, then the full likelihood is automatically equal to the conditional likelihood given any maximal ancillary statistic. In its sampling theory form, when is the maximum likelihood estimator (MLE) full information can be recovered from whose approximation has been studied by Barndorff-Nielsen (1983).
In conditional inference (Reid 1995), it is commonly stated that we condition on the ancillary to make our inference more ‘relevant’ to the data at hand, in other words, more epistemic. But this is typically stated on an intuitive basis; the following corollary provides a mathematical justification. Since we already condition on , a further relevant subset is such that the conditional probability is non-trivially and consistently biased in one direction from in the same manner as (1). As we describe following Theorem 1 above, the result holds for an arbitrary that satisfies R2-R3 and leads to a valid confidence density. So it applies to defined by (11). Following a similar reasoning as for the previous corollary, we can state our second key result:
Corollary 3
If is maximal ancillary for and CI is constructed from the conditional confidence density based on , then under R1-R4, the conditional confidence has a correct coverage probability and no further relevant subsets. Hence the conditional confidence is epistemic.
Remark: In view of (12), the confidence is epistemic for any choice of the maximal ancillary. Basu (1959) showed under mild conditions that maximal ancillaries exist. However, they may not be unique; this is an issue traditionally considered most problematic in conditional inference. If the maximal ancillary is not unique, then the conditional coverage probability might depend upon the choice. However, this does not affect the absence of relevant subset guaranteed by the corollary. We discuss this further in Section 4 and illustrate with an example in Appendix A4.
3 Examples
Our overall theory suggests that, regardless of the existence of a sufficient statistic, we can get epistemic confidence by computing CIs based on the full confidence density . The corresponding coverage probability is either a marginal probability or a conditional probability given a maximal ancillary, depending on whether there exists a scalar sufficient statistic. The full likelihood is almost always easy to compute. However, in order to get a correct coverage, is defined by (8) or (11), which in practice can be difficult to evaluate. For example, if we use the MLE, in general it has no closed form formula, and computing the P-value, even from its approximate distribution based on Barndorff-Nielsen’s (1983) formula, can be challenging. We illustrate through a series of examples some suitable approximations of that are simpler to compute.
Suppose, for sample size , there is a statistic that satisfies R1, i.e. it allows us to construct a valid confidence density . Then we can compute based on . First consider the case when is free of the data. From the updating formula in Pawitan and Lee (2021), the confidence density based on the whole data is
(13) | |||||
The statistic trivially exists if itself leads to a valid confidence density. Once is available, (13) is highly convenient, since it does not require any computation of a statistic such as the MLE, its distribution or the P-value based on the whole data. More importantly, as shown in some examples below, formula (13) works even when there is no sufficient statistic from the whole data for . This is illustrated by the general location-family model in Section 3.2.
When depends on the data, it matters which is used to compute it. In this case the updating formula (13) is only an approximation. As long as the contribution of to is of order , we expect a close approximation. This is illustrated in Example 6 below.
3.1 Simple models
Example 1 (continued). Based on , the confidence density and the likelihood functions are proportional:
so the implied prior for all . The full likelihood based on is
so, the full confidence density is . For example, if and , we do have 100% confidence that . And if , we only have 33.3% confidence for , though we have 100% confidence for .
The MLE of is not unique, but we can choose as the MLE. It is not sufficient, but is, so is a maximal ancillary. Indeed the full confidence values match the conditional probabilities given the range as previously given. Furthermore, according to Corollary 3, there is no further relevant subset, so the confidence is epistemic.
Example 3. Let be a single sample from the uniform distribution on , where is a real number. As in the previous examples, the confidence density and the likelihood functions are
For example, if , then we are 100% confident in ; and 90% confident in . The coverage probability of CIs of the form is indeed
so the confidence is epistemic.
Now let’s consider trying to bet the value , the largest integer smaller than , based on observing . What price would you give to the bet that ? According to the confidence distribution, it should be 0.9. But we can show that the random variable is Bernoulli shifted by and with success probability , the fractional part of . For example, if then and , so is equal to or with probabilities 0.4 and 0.6, respectively. This means that, in general, using as a guess, the ‘coverage’ probability of being correct is
This probability varies from 0 to 1 across , not matching the specific confidence – such as 0.9 above – derived from the confidence density.
The problem is that is no longer a sufficient statistic, so its marginal distribution is not fully informative. Now, the fractional part is uniform between 0 and 1 for any , so it is an ancillary statistic. We can show that, conditional on , the distribution of is degenerate: with probability 1, it is equal to if ; and equal to if . This conditional distribution is distinct from the unconditional version, so is relevant. Basu (1964) and Ghosh et al. (2010) used this example as a counter-example, where conditioning by an ancillary leads to a puzzling degenerate distribution. But actually, it is not so puzzling: The conditional likelihood is the same as the full likelihood, so is maximal ancillary. This is of course as we should expect, since together with form the full data .
To illustrate with real numbers, for example, on observing we have , so if the unknown . Now, your betting situation is much clearer: you will bet that if you believe that , i.e. . This is exactly the same logical situation you faced before with the full likelihood and confidence density.
3.2 Location family
Suppose are an iid sample from the location family with density
where is an arbitrary but known density function, for example the Cauchy or normal densities. Immediately, based on alone, the confidence density is
so the implied prior . So, again using formula (13), the full confidence density is
(14) |
This is a remarkably simple way to arrive at the confidence density of without having to find the MLE and its distribution.
Without further specifications, the MLE is not sufficient, so the marginal P-value will not yield the full confidence. The distribution of the residuals are free of , so the set of differences ’s are ancillary. In his classic paper, Fisher (1934) showed that
where is the set of differences from the order statistics . This means that the conditional likelihood based on matches the full likelihood (14), and the confidence of CIs based on (14) will match the conditional coverage probability. Indeed, here is sufficient and is maximal ancillary. Overall, the confidence of CIs based on (14) is epistemic.
Example 4. Suppose that are i.i.d sample from the uniform distribution on . Let and be ordered statistics, then is a sufficient statistic. The likelihood is given by
and is a MLE. Since the uniform distribution on is location family, we have to lead the full confidence
where the range is a maximal ancillary.
3.3 Exponential family model
Suppose the dataset is an iid sample from the exponential family model with log-density of the form
(15) |
The MLE is sufficient if , but not if . In the latter case, the family is called the curved exponential family. By Theorem 1, when confidence statements based on the MLE will be epistemic. (Our theory covers the continuous case in order to get exact coverage probabilities. Many important members are discrete, which is more complicated because the definition of the P-value is not unique, and the coverage probability function is guaranteed not to match any chosen confidence level. We discuss an example in Appendix A3.)
The standard evaluation of the confidence requires the tail probability of the distribution of the MLE, which in general has no closed form formula. Barndorff-Nielsen’s (1983) approximate conditional density of the MLE is given by
(16) |
where the MLE is the solution of , is the maximal ancillary and is a normalizing constant that is free of . For and the canonical parameter , the ancillary is null, and the approximation leads to the right-side P-value
(17) |
where is the standard normal variate and
with and the observed Fisher information.
Example 5. Let be an iid sample from the gamma distribution with mean one and shape parameter . The density is given by
so we have an exponential family model with
To use formula (13), we first find the implied prior density using alone:
where and . The probability is a gamma integral, which is computed numerically. The implied prior is shown in Figure 1(a). So from (13), we get the confidence density
For an example with and , which corresponds to the MLE , the confidence density is given by the solid line in Figure 1(b). The normalized likelihood function is also shown by the dashed line, which is quite distinct from the confidence density.
To get the marginal confidence density based on the P-value formula (17), we need
where is the solution of
with , and the observed Fisher information is
The circle points in Figure 1(b) are the marginal confidence density based on the same sample above. As expected, this tracks almost exactly the one given by formula (13). The corresponding implied prior based on is given in Figure 1(a), also closely matching the implied prior based on .
Example 6. This is an example where is data dependent. Let be iid sample from for . The log-density is given by
so this is a regular exponential family with sufficient statistic . The marginal confidence density can be computed based on the non-central distribution for . For , is sufficient, and
so the implied prior is data-dependent. This means that the full confidence density depends on which is used to compute the implied prior:
In Figure 2, for , we compare using three different versions of based for . These are also compared with the marginal confidence . As shown in the figure, even for such a small dataset, the effects of the data dependence in this case are negligible.
Example 7. This example from a curved exponential family is used to illustrate complex models, where standard results sometimes fail. Let be iid sample from for . First consider the confidence distribution based on ,
We can see immediately that if we use as the statistic, the term inside the bracket converges to as , so the confidence distribution goes to . Hence does not satisfy the regularity condition R1.
Here is minimal sufficient, and the likelihood function is
The MLE is given by
with a maximal ancillary
In terms of , the likelihood is
where .
Now, denote the MLE and the ancillary based only on by and . The conditional confidence distribution based on is
which is now a valid confidence distribution, with density
The implied prior is . The updating rule gives the full confidence density
(18) |
In fact, in Appendix A2 we show that, even though it is not sufficient, still leads to a valid confidence distribution, and the implied prior based on is the same .
For completeness, Appendix A2 also shows the conditional confidence density derived using Barndorff-Nielsen’s formula (16), showing that we end up with the same implied prior. Instead here we use the exact result from Hinkley (1977). He derived the exact conditional density of ,
where . Let , then we have
where . Then the confidence density becomes
so that the implied prior becomes , which is the same with the results from . Thus, we have
In Appendix 2, it is also shown that with is a valid confidence density because . However, it is not epistemic because it does not use the full likelihood, so there is a loss of information.
As numerical illustrations, we compare the exact conditional P-value for testing : , the corresponding full confidence at and the P-value based on the score test. The latter was computed using the observed Fisher information, suggested by Hinkley (1977) as having good conditional properties. In Figure 3(a), we generate 100 datasets with from at . The full confidence is computed using the implied prior , and a constant prior . Panel (b) shows the result for . The full confidence with the implied prior agrees with the exact conditional probability. The use of a non-implied prior cannot match the exact conditional probability. While the score test maintains the average conditional probability, it has a poor conditional property in these small samples.
4 Discussion
We have described the Dutch Book argument to establish the epistemic confidence that is meaningful for an observed confidence interval. Fisher tried to achieve the same purpose with the fiducial probability, but the use of the word ’probability’ had generated much confusion and controversies, so the concept of fiducial probability has been practically abandoned. However, the confidence concept is mainstream, although it comes with a frequentist interpretation only, so it applies not to the observed interval but to the procedure. The confidence may not be a probability but an extended likelihood (Pawitan and Lee, 2021), whose ratio is meaningful in hypothesis testing and statistical inferences (Lee and Bjørnstad, 2013). The extended likelihood is logically distinct from the classical likelihood. It is well known that we cannot use the likelihood directly for inference, except in special circumstances such as the normal case. Our results show that we can turn a classical likelihood into a confidence density by multiplying it with an implied prior. Furthermore, we get epistemic confidence by establishing the absence of relevance subsets.
It might appear that in trying to get epistemic confidence using the Dutch Book argument, we are simply recreating the Bayesian framework. But this is not the case. Bayesian subjective probability is established by an internal consistency requirement of a coherent betting strategy. If you’re being inconsistent, then an external agent can construct a Dutch Book against you. To avoid the Dutch Book, you have to use a (additive) probability measure. Crucially, in this Bayesian version, there is no reference to the betting market, which according to the fundamental theorem of asset pricing (Ross, 1976) will settle prices based on objective probabilities. So, in our setting, it is still possible to construct a market-linked Dutch Book against an internally consistent subjective Bayesian who ignores the objective probability.
Extra principles in the subjective probability framework have been proposed to deal the mismatch between the subjective and objective probabilities. For example, in Lewis’s (1980) ‘Principal Principle’
where denotes the subjective probability and ‘Chance’ the objective probability. So, the Principle simply declares that the subjective probability must be set to be equal to the objective probability, if the latter exists. Our Dutch Book argument can be used to justify the Principle, so the principle does not have to come out of the blue with no logical motivation. However, it is worth noting that epistemic confidence is not simply set equal to probability. Instead, it is the consequence of a theorem that establishes no relevant subset in order to avoid the Dutch Book. In our setup, the frequentist probability applies to a market involving a large number of independent players. Moreover, the rational personal betting price is no longer ‘subjective’, for example, in the choice of the prior. Thus, the conceptual separation of the personal and the market prices allow both epistemic and frequentist meaning of confidence.
Our use of money and bets to define epistemic confidence has some echoes in Shafer and Vovk (2001)’s game-theoretic foundation of probability, an ambitious rebuilding of probability without measure theory. However, their key concept is a sequential game between two players. The word ‘sequential’ clearly implies that the game is not meant to involve a risk-free profit from a single transaction that we want in a Dutch Book. Our usage of probability is fully within the Kolmogorov axiomatic system, and we make a clear distinction between probability and confidence.
Conditional inference (Reid 1995) has traditionally been the main area of statistics that tries to address the epistemic content of confidence intervals. The theory on ancillary statistics and relevant subsets has grown as a result (Basu, 1955, 1954; Ghosh, 2010). Conditioning on ancillary statistics is meant to make the inference ‘closer’ to the data at hand, but the proponents of conditional inference only go half-way to the end goal of epistemic confidence that Fisher wanted. The general lack of unique maximal ancillary is a great stumbling block, where it is then possible to come up with distinct relevant subsets with distinct conditional coverage probabilities. This raises an unanswerable question: What is then the ‘proper’ confidence for the observed interval? Our logical tool of the betting market overcomes this problem – in this case, the market cannot settle in an unambiguous price. But Corollary 3 still hold in the sense that as an individual, you’re still protected from the Dutch Book. We discuss this further with an example in Appendix 4.
Schweder and Hjort (2016) and Schweder (2018) have been strong proponents of interpreting confidence as ‘epistemic probability.’ We are in general agreement with their sentiment, but it is unclear which version of probability this is. The only established and accepted epistemic probability is the Bayesian probability, but in their writing, the confidence concept is clearly non-Bayesian. Our use of the Dutch Book defines normatively the epistemic value of the confidence while staying within the non-Bayesian framework.
In logic and philosophy, the relevant subset problem is known as the ‘reference class problem’ (Reichenbach, 1949; Hajek, 2007). Venn (1876) in his classic book already recognized the problem: ‘It is obvious that every individual thing or event has an indefinite number of properties or attributes observable in it, and might therefore be considered as belonging to an indefinite number of different classes of things…’, and this affects the probability assignment. We solve this problem by setting a limit to the amount of information as given by the observed data generated by the probability model . This approach is in line with the theory of market equilibrium, where it is assumed that information is limited and available to all players. It is not possible to have an equilibrium – hence accepted prices – if information is indefinite, or when the players know that they have access to different pieces of information.
We have limited our current paper to the one parameter case. Following the proof of Theorem 1, the same result actually holds in the multi-parameter case as long as we consider bounded confidence regions satisfying Condition R4. The problem arises for a marginal parameter of interest, which might implicitly assume unbounded regions for the nuisance parameters. For example, in the normal model, with both mean and variance unknown, the standard CI for the mean implicitly employs an unbounded interval for the variance:
where is the quantile of distribution with degrees of freedom. Thus, the theorem cannot guarantee the epistemic property of the -interval. In fact, Buehler and Feddersen (1963) and Brown (1967) showed the existence of relevant subsets for the -interval. Extensions of our epistemic confidence results to this problem are of great interest.
References
-
Arrow, K. J. and Debreu, G. (1954). Existence of an equilibrium for a competitive economy. Econometrica 22, 265–-290.
-
Barndorff-Nielsen O. (1983). On a formula for the distribution of the maximum likelihood estimator. Biometrika, 70: 343-365
-
Berger J.O. and Wolpert R.L. (1988). The likelihood principle. 2nd Edition. Institute of Mathematical Statistics, Hayward, CA
-
Birnbaum A. (1962). On the foundation of statistical inference. Journal of the American Statistical Association, 57, 269–326.
-
Bjørnstad J.F. (1996). On the generalization of the likelihood function and likelihood principle. Journal of the American Statistical Association, 91, 791–806.
-
Brown, L. (1963). The conditional level of Student’s t test. The Annals of Mathematical Statistics, 38, 1068–1071.
-
Buehler R.J. (1959). Some Validity Criteria for Statistical Inferences. The Annals of Mathematical Statistics, 30, 845–863
-
Buehler R.J. and Feddersen A.P. (1963). Note on a Conditional Property of Student’s . The Annals of Mathematical Statistics, 34, 1098–1100.
-
de Finetti, B. (1931). On the subjective meaning of probability. English translation in de Finetti (1993), Induction and Probability, pp. 291–321. Clueb: Bologna.
-
Fisher R.A. (1930). Inverse probability. Proceedings of the Cambridge Philosophical Society, 26, 528–535.
-
Fisher R.A. (1933). The concepts of inverse probability and fiducial probability referring to unknown parameters. Proceedings of the Royal Society of London, 139, 343–348.
-
Fisher R.A. (1934). Two new properties of mathematical likelihood. Proceedings of the Royal Society of London, 144A, 285.
-
Fisher R.A. (1950). Contributions to mathematical statistics (pp. 35.173a). Wiley.
-
Fisher R.A. (1958). The nature of probability. Centennial Review, 2, 261–274.
-
Fisher R.A. (1973). Statistical methods and scientific inference, 3rd Edition. New York: Hafner.
-
Ghosh M., Reid N. and Fraser D.A.S. (2010). Ancillary statistics: a review. Statistica Sinica 20, 1309–1332.
-
Hajek A. (2007). The reference class problem is your problem too. Synthese, 156: 563–585.
-
Lancaster H.O. (1961). Significance tests in discrete distributions. Journal of the American Statistical Association, 56: 223–234.
-
Lee Y. and Bjørnstad J.F. (2013). Extended likelihood approach to large-scale multiple testing. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 75, 553-575.
-
Lee Y., Nelder J.A. and Pawitan Y. (2017). Generalized linear models with random effects: unified analysis via H-likelihood (2nd ed.). CRC Press.
-
Lewis, D. (1980). A subjectivist’s guide to objective chance. In Ifs (pp. 267-297). Springer, Dordrecht.
-
Lindley D.V. (1958). Fiducial distributions and Bayes’ theorem. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 20, 102–107.
-
Pawitan Y. (2001). In all likelihood: Statistical modelling and inference using likelihood. Oxford University Press, Oxford, UK.
-
Pawitan Y. and Lee Y. (2021). Confidence as likelihood. Statistical Science. To appear.
-
Ramsey, F. (1926). Truth and probability. In Foundations of Mathematics and other Logical Essays. London: K. Paul, Trench, Trubner and Co. Reprinted in H.E. Kyburg and H.E. Smokler (eds.) (1980), Studies in Subjective Probability, 2nd edn (pp. 25-–52) New York: Robert Krieger.
-
Reichenbach H. (1949). The Theory of Probability. University of California Press.
-
Reid N. (1995). The roles of conditioning in inference. Statistical Science, 10, 138–157.
-
Robinson G. K.. (1979). Conditional properties of statistical procedures. The Annals of Statistics, 7, 742–755.
-
Ross, S. (1976), Return, risk and arbitrage. In: I. Friend and J. Bicksler (eds.), Studies in Risk and Return. Cambridge, MA: Ballinger
-
Schweder T. and Hjort N.L. (2016). Confidence, Likelihood, Probability. Cambridge University Press, Cambridge, UK.
-
Schweder T. (2018). Confidence is epistemic probability for empirical science. Journal of Statistical Planning and Inference 195, 116–125.
-
Venn, J. (1876). The Logic of Chance (2nd ed.). Macmillan and Co.
Appendix
A1. Proof of Theorem 1
One main consequence of condition R1 is that we start with a proper confidence density. However, as stated after the statement of the theorem, instead of starting with , for more generality, our proof would allow an arbitrary function that satisfies R2 and R3, as long as the resulting full confidence is proper.
Throughout this part, given a fixed , let denote the -level interval such that
where is the sample space of and let denote a relevant subset such that
(19) |
or negatively,
for some where is the parameter space of .
Lemma. Suppose that is a relevant subset for an interval . Let be a countable partition of , so that , then there exists a relevant subset .
Proof. First consider the positively biased case. If for all , then we have
which is a contradiction to (19). Therefore, there should be such that
is a relevant subset. Similarly, supposing leads to the corresponding proof for a negatively biased relevant subset.
Proof of Theorem 1. By definition of above, there is a non-negative function such that . First we show that the existence of positively biased relevant subset leads to a contradiction. For an arbitrary , suppose that there exists such that
Let and be arbitrary numbers satisfying and
(20) |
By the uniform continuity of , there exists such that
Let be a partition of divided by -dimensional grid with spacing where is the sample size. Define , then becomes a countable partition of relevant subset . By the Lemma, there exists a relevant subset such that
It can be written by the integration form,
(21) |
Since for any , from the regularity condition R3 we have
(22) |
Let be the -level confidence interval. By the regularity condition R4, there exists a compact set such that for any . Then,
(23) |
Now let be an arbitrary observation in and integrate (21) over with the , then we have
(24) |
By the regularity condition R2,
Thus, the order of integration in (24) can be exchanged by Fubini’s theorem to lead
(25) |
Note here that the inequality (22) implies that
Then the left-hand-side of (25) becomes
and the right-hand-side of (25) becomes
by the inequality (23). Thus we have
which is a contradiction to (20).
A2. Curved exponential family:
We give more details of the model. Denote the MLE based only on by . The confidence distribution is
Then and , so that the right-side p-value leads to a proper distribution function. Corresponding confidence density is given by
The implied prior based on is
where is the likelihood function from .
On the other hand, if we construct the full confidence densities by
then the resulting confidence density depends on the choice of . In this case we should consider as an approximation to . Figure 4 plots confidence densities (solid) and (circle) with from . As shown in (b), when becomes large, the difference becomes negligible and gets closer to (circle).

So
There is a loss of information caused by using , due to the sign of as captured by . This is negligible even in small samples; see Figure 4. However, the marginal confidence
has a larger loss of information, as shown in both Figure 4(a) and (b).
Figure 5 plots the logarithms of implied prior (dotted) and (solid), properly scaled. Note that is not uniformly continuous on , because the information in and differ.

It is also possible to compute the conditional confidence density by using Barndorff-Nielsen’s formula (16), and to show that we end up with the same implied prior . Firstly, the likelihood ratio is given by
where , and the observed Fisher information
Then we have
Let and let , then the conditional density of becomes
which does not contain . Let , then
It gives the conditional confidence density
and implied prior . Thus, the conditional confidence density from Barndorff-Nielsen’s formula becomes
which is the same as the full confidence (3.3).
A3. Discrete case
A complication arises in the discrete case since the definition of the P-value is not unique, and the coverage probability function is guaranteed not to match any chosen confidence level. Given the observed statistic , among several candidates, the mid P-value
is often considered the most appropriate (Lancaster, 1961).
We shall discuss the specific case of the binomial and negative binomial models: and . The two models have an identical likelihood, proportional to but have different probability mass functions, respectively
Thus, they have the common MLE . However, the two MLEs have different supports
and therefore and have different distribution, which lead to different P-values. Statistical models such as the binomial and negative binomial models describe how the unobserved future data will be generated. Thus, all the information about in the data and in the statistical model is in the extended likelihood. The use of the mid P-values
lead to different confidence densities
where is the regularized incomplete beta function and is the beta function.

Figure 6 shows the coverage probabilities of the 95% two-sided confidence procedure based on the mid p-value of for binomial models and negative binomial models. We can see that the coverage probabilities fluctuate around 0.95 but they are not consistently biased in one direction. Moreover, as or becomes larger, the difference between the coverage probability and the confidence becomes smaller. In discrete case, it is not possible to access the exact objective coverage probability of the CI procedure. Here the confidence is a consistent estimate of the objective coverage probability. In negative binomial models with , with probability 1, so that it behaves like binomial confidence procedure for .
Besides information in the likelihood, the confidence uses information from the statistical model. Consider two different statistical models, M1: where Poisson( and and M2: where Poisson( and In M1, and in M2, have common likelihood, but they are different models, so that they have no reason to have a common confidence.
A4. When maximal ancillaries are not unique
When maximal ancillary is not unique, the conditional coverage probability may depend on the choice of the ancillary. However, the lack of unique ancillary does not affect the validity of Corollary 3 on the absence of relevant subset. We illustrate here with an example from Evans (2013). The data are sampled from a joint distribution with probabilities under given in the following table:
1/6 | 1/6 | 2/6 | 2/6 | |
1/12 | 3/12 | 5/12 | 3/12 |
Here both the data and parameter are discrete. Strictly, our theory does not cover this case, but we shall use it because it can still illustrate clearly the issues with non-unique maximal ancillaries. The marginal probabilities are
for . So both and are ancillaries, i.e., their probabilities do not depend on . The conditional probabilities of given are
1/2 | 1/2 | |
1/4 | 3/4 |
and, given are
1/3 | 2/3 | |
1/6 | 5/6. |
Based on the unconditional model, on observing , we have the likelihood function and , so the MLE . For we have a different likelihood, but still This means we cannot reconstruct the likelihood based on the MLE alone, hence MLE is not sufficient. But we can see immediately that we get the same likelihood function under the conditional model given or given , so conditioning on each ancillary recovers the full likelihood and each ancillary is maximal.
Now consider using the MLE itself as a ‘CI’. Conditional on the ancillaries, the probability that the MLE is correct is
These conditional ‘coverage probabilities’ are indeed distinct from each other. However, comparing the conditional coverage probabilities given to that given , there is no consistent non-trivial bias in one direction across . So if you use as the ancillary, you cannot construct further relevant subsets based on . This is the essence of our remark after Corollary 3 that the lack of unique maximal ancillary does not affect the validity of the corollary.
Unfortunately, in this example, the P-value is not defined because the parameter can be an unordered label. So it is not possible to compute any version of confidence function or any implied prior. In the continuous case, we define CI to satisfy for all . However, in discrete cases, it is often not possible for the coverage probabilities to be same for all , which violates the condition of Theorem 1. Fisher (1973, Chapter III) suggested that for problems such as this, the structure is not sufficient to allow an unambiguous probability-based inference, so only the likelihood is available.