On the Confidence Intervals in Bioequivalence Studies
Abstract
A bioequivalence study is a type of clinical trial designed to compare the biological equivalence of two different formulations of a drug. Such studies are typically conducted in controlled clinical settings with human subjects, who are randomly assigned to receive two formulations. The two formulations are then compared with respect to their pharmacokinetic profiles, which encompass the absorption, distribution, metabolism, and elimination of the drug. Under the guidance from Food and Drug Administration (FDA), for a size- bioequivalence test, the standard approach is to construct a confidence interval and verify if the confidence interval falls with the critical region. In this work, we clarify that confidence interval approach for bioequivalence testing yields a size- test only when the two one-sided tests in TOST are “equal-tailed”. Furthermore, a confidence interval approach is also discussed in the bioequivalence study.
Keywords: bioequivalence study; two one-sided tests; confidence interval.
1 Motivation
Bioequivalence studies are a type of clinical trial designed to compare the biological equivalence of two different formulations of a drug. The objective of these studies is to establish that the generic drug demonstrates the same rate and extent of absorption as the reference drug, and that its therapeutic effects are comparable. Such studies are typically conducted in controlled clinical settings with human subjects, who are randomly assigned to receive either the generic drug or the reference drug. The two formulations are then compared with respect to their pharmacokinetic profiles, which encompass the absorption, distribution, metabolism, and elimination of the drug. Acceptance criteria for bioequivalence are generally based on statistical methods that assess the similarities between the pharmacokinetic profiles of the two drug products. These comparisons often involve evaluating the area under the concentration-time curve (AUC) and the maximum concentration (Cmax) of the drug in the bloodstream. By ensuring that both AUC or Cmax values fall within pre-defined equivalence ranges, researchers can determine if the two formulations are bioequivalent.
Suppose we want to access the bioequivalence of two drugs or formulations. Under the guidance of Food and Drug Administration (FDA), the bioequivalence is claimed if a two-sided confidence interval of the geometric mean ratio fall within 80-125. Under the context of statistical hypothesis testing, to demonstrate bioequivalence, the following significance level hypothesis testing problem is considered:
(1) |
where are the population geometric mean of the test product and reference product respectively, are the lower and upper error bounds determined by the regulatory. Under the theory of intersection-union test, Westlake, (1981); Schuirmann, (1981, 1987) proposed two one-sided tests (TOST), which now has been the standard approach to test (1). As its name implies, TOST decomposes the interval hypothesis into two size- one-side hypothesis and reject if and only if both two one-sided hypothese are rejected. In general, when multiple size- tests are combined together, the overall size of the combined test is no longer . Fortunately, due to the theory of interaction-union test, the size of TOST is still , which will be discussed in Section 2.1. It has also been showed that, for a size- TOST procedure, it is identical to construct a confidence interval for and reject if the confidence interval falls entirely between and . Even though the procedure seems simple, there are still some questions need to clarify, for example:
-
•
Why is the geometric mean preferred over the arithmetic mean?
-
•
Why are the BE limitss (0.8, 1.25) not symetric about 1?
-
•
Why is the product of BE limit equal to 1 (), i.e., symmetric about 1 on the ratio scale?
-
•
In general, the combination of two size- tests could not get a size- test, but why TOST works? Why is a size- TOST identical to a confidence interval, not a confidence interval? Is TOST the most powerful?
The purpose of this paper is answer these questions theoretically and provide a clear and in-depth understanding of the bioequivalence study. For example, the geometric mean is preferred in bioequivalence studies because pharmacokinetic parameters, such as AUC and Cmax, typically follow a log-normal distribution. Using the geometric mean ensures that the ratio of the means remains unaffected by extreme values or outliers, leading to a more accurate and robust comparison of the test and reference formulations. The bioequivalence limits (0.8, 1.25) are not symmetric around 1 because they account for potential differences in variability and the possibility of type I or type II errors. The limits are based on log-transformed data, and when back-transformed to the original scale, they yield asymmetric confidence intervals. The asymmetry helps to accommodate the potential differences in variability between the two drug formulations. The product of the BE limits is equal to 1 to ensure that the ratio of the test and reference products’ geometric means is unbiased on the ratio scale. When the two limits are multiplied, they effectively cancel out any deviation from the true ratio of 1, allowing for a more accurate comparison of the drug formulations. TOST works because of the intersection-union test theory. While it may not be the most powerful test in all situations, it maintains the overall size- test by decomposing the interval hypothesis into two size- one-sided hypotheses and rejecting the null hypothesis only if both one-sided hypotheses are rejected. The intersection-union test theory ensures that the overall type I error rate remains controlled at the desired level () when combining the two one-sided tests. In general, combining two size- tests does not guarantee a size- test because the overall type I error rate may be inflated due to multiple testing. However, the TOST procedure, based on the intersection-union test theory, accounts for this issue and successfully maintains the overall type I error rate at the desired level, making it a suitable method for bioequivalence testing.
2 Theoretical Background
In this section, we delve into the theoretical background of bioequivalence studies for a univariate pharmacokinetic (PK) parameter. In a typical pharmacokinetic bioequivalence study, the univariate response variables such as are often assumed to follow a normal distribution, or equivalently, follow a lognormal distribution (Shen et al., (2006), Shen et al., (2017)). Under FDA guidelines, bioequivalence is treated as a statistical hypothesis testing problem as defined in (1), which states that the population geometric mean ratio for the test and reference products should fall between 80-125. At first glance, it may seem odd that the geometric mean ratio is used instead of the arithmetic mean difference. Moreover, it might appear strange that 80 and 125 are not symmetric about 100, but , meaning that the lower and upper bounds are symmetric about 100 on a ratio scale. To explain the reason, we need to take the logarithm of (1) and obtain the following hypothesis testing for the difference:
(2) |
where . The following theorem explain some of the reasons why FDA choose geometric mean ratio as the test statistic and 80-125 as the BE limits.
Theorem 1.
Suppose are independent and identically distributed from normal distribution and , that is, are independent and identically distributed from lognormal distribution with corresponding mean , variance on the log transformed scale. We let be the geometric mean of . Then the following statements hold true
-
(1)
;
-
(2)
;
-
(3)
.
The proof is given in the supplementary material. According to Theorem 1, it is straightforward to obtain the following facts:
Remark 1.
-
(a)
Since the PK parameters are lognormal distributed, which is right-tailed, it is more natural to compare the median between treatment and reference products rather than the arithmetic mean;
-
(b)
According to (1) in Theorem 1, we know that the log-transformed geometric mean of PK data is equivalent to the arithmetic mean of the log-transformed PK data. In other words, comparing the geometric mean ratio between two products is equivalent to comparing the arithmetic mean difference between the log-transformed PK data;
-
(c)
According to (2) and (3) in Theorem 1, we can make a conclusion that the geometric mean of PK data is a nearly unbiased estimator of PK median with an error rate of . To be more specific, as approaches infinity, . This implies that, for large sample sizes, the geometric mean of PK data provides an accurate estimate of the PK median.
Combining everything above, we can identify the advantages of using the geometric mean ratio as the test statistic: (1) The comparison of the geometric mean ratio can be converted to the comparison of the arithmetic mean difference; (2) The geometric mean ratio is a “good” estimator of the median ratio, which can be naturally applied to the distribution of PK parameters. Follows from the above analysis, we can further clarify the procedure of bioequivalence testing for univariate PK parameter in practice:
-
•
Step 1: Log-transform the PK data (e.g., AUC and Cmax) to convert the lognormal distribution into a normal distribution.
-
•
Step 2: Test , where are the log-transformed PK data from Step 1 and
-
•
Find a confidence interval for and reject if and only if the confidence interval falls in and the significance level of the corresponding procedure is .
It is important to note that after taking the logarithm, the original BE limits (0.8, 1.25) become symmetric about zero, and this fact plays a crucial role in the confidence interval approach. Without the “equal-tail” property, the confidence interval approach is no longer valid. This is because the symmetric nature of the limits on the log scale simplifies the comparison of the test and reference products, allowing for a balanced evaluation of bioequivalence that accounts for both type I and type II errors. See Section 2.3.2 for more details.
2.1 Union-Interaction Tests and Interaction-Union Tests
In the section, we proceed to review the TOST, or two one-sided tests. The TOST involves performing two one-sided tests, one to test if the difference between the treatments or populations is greater than a predetermined non-inferiority margin and another to test if the difference is less than a predetermined equivalence margin. If the results of both tests are significant, it suggests that the treatments or populations are equivalent within the defined margins. Before proceeding, we first review the union-interaction test and intersection-union test, which provide the theoretical foundation for the TOST. Suppose the parameter space of a hypothesis testing is , where are the parameter space of null and alternative hypothesis, respectively. Furthermore, let be the rejection region of the test, then the power function of the test, as a function of , is defined as , where is the test statistics. Before moving on, we first clarify two definitions that are commonly confused with one another.: the size and the level of a test.
Definition 1.
For , a test with power function is a size- test if .
Definition 2.
For , a test with power function is a level- test if .
These definitions help distinguish between the size and the level of a test, which are related but distinct concepts. The size of a test is the maximum probability of making a Type I error (i.e., rejecting the null hypothesis when it is true) under the null hypothesis parameter space . In contrast, the level of a test refers to the upper bound on the probability of making a Type I error. When the size of a test equals , it is considered a size- test, whereas when the size of a test is less than or equal to , it is considered a level- test. For a level- test, the size of the test may be much less then , and in such case, if we still use as the significance level, then the test will be less powerful.
Obviously, the null hypothesis in (2) can be viewed as the union of two simple null hypothesis and . Not only in bioequivalence study, but also in some other applications, such complicated hypotheses can be developed from tests for simpler hypotheses. These hypothesis can be generalized from the so called intersection-union tests (IUT) and union-intersection test (UIT), which play an important role in bioequivalence testing and multiple comparison procedures.
Definition 3.
Assume a hypothesis with reject . A family of hypotheses versus for with corresponding rejection region is said to obey the intersection-union principle if
(3) |
and is said to obey the union-intersection principle if
(4) |
It is straightforward to show that the rejection region of IUT is . The logic behind this is simple. is false if and only all of the are false. So rejecting is equivalent as rejecting each individual . Similarly, in terms of UIT, using the same logic, if any one of the hypotheses is rejected, then must also be rejected, and only if each of is accepted will the intersection be accepted. Thus, the rejection region of UIT is . Clearly, the hypothesis testing in 2 belongs to IUT. Now, a natural question rises: given the significance of each individual , what is the significance level of IUT? The following theorem gives an upper bound for the size of the IUT.
Theorem 2.
Let be the size of the test of with rejection region . Then the IUT with rejection region is a level test.
It should be remarked that Theorem 2 only shows that the level of an IUT is , that is, an upper bound for the size of the IUT. In fact, the size of the IUT could be much less than . The following theorem provides the conditions under which the size of the IUT is exactly , ensuring that the IUT is not too conservative.
Theorem 3.
Consider testing and let be the rejection region such that the level of is . Suppose that for some , there exists a sequence of parameters, such that
-
(1)
,
-
(2)
for each .
Then, the IUT with rejection region is a size test.
2.2 Two One-sided Tests (TOST)
We are now in a position to formally introduce the method for bioequivalence testing in (2). Albeit good theoretical guarantees for IUT, Westlake, (1981); Schuirmann, (1981, 1987) proposed the following so-called two one-sided tests (TOST), which has now been one of the standard procedures of (2). As its name implies, TOST consists two one-sided hypotheses
(5) |
and
(6) |
The in (2) can be expressed as the union of and . This procedure establishes bioequivalence at significance level if both and are rejected at level-. The rationale underlying is simple. If one may conclude that and also , then it has in effect been concluded that . In practice, size- Student’s -test is used for (5) and (6) and is rejected if
(7) |
where are the average of log-transformed PK data for treatment/reference product, and is the standard error of and , with being the standard deviation of two groups, and is the critical value from distribution with degree of freedom . In addition, it should be mentioned that the size for each individual -test is , not . There is no need for multiplicity adjustment in testing each the two one-sided null hypotheses for a univariate PK parameter. Suppose the size for each -test is , since TOST belongs to IUT, from Theorem 2, we know the level of TOST is , that is, the size is at most . To prove the size of TOST is exactly , we need to check the conditions of Theorem 3. First, consider a parameter point , then we have
thus the first condition in Theorem 3 holds true. Furthermore,
therefore, the second condition in Theorem 3 also remains valid, from where, we summarize the above analysis to the following theorem
Theorem 4.
Suppose the size of each individual test in TOST is , then the size of TOST equals exactly.
To compute the power of TOST, Phillips, (1990) derived the explicit form of the power, as a function of given by
where
is referred as the Owen’s function and, are the cdf and pdf of standard normal distribution, respectively.
In the previous sections, the theoretical foundations of the TOST method for bioequivalence testing have been established. However, the relationship between the TOST and and confidence intervals is not yet clear. In the next section, we will explore the connections between the TOST and these two types of confidence intervals, shedding light on their similarities, differences, and implications for bioequivalence testing. Understanding the connections between the TOST and different confidence intervals will provide insights into how these statistical methods relate to each other, and how they can be effectively applied in bioequivalence testing for univariate pharmacokinetic parameters. This will enhance the interpretation and application of these methods in practice, contributing to more accurate and reliable conclusions about bioequivalence.
2.3 Confidence Sets
Intuitively, a hypothesis is associated with an equivalent confidence interval approach. Traditionally, bioequivalence is claimed if the two-sided confidence interval of the geometric mean ratio, or equivalently, the arithmetic mean difference of the log-transformed data, falls within the BE limits. The use of a confidence interval for bioequivalence testing, rather than a confidence interval, has been a subject of debate among statisticians. The relationship between the size- TOST procedure and the confidence interval approach has been questioned by some researchers, who argue that the similarity is based more on an algebraic coincidence than a true statistical equivalence. Brown et al., (1995) suggested that the association between the TOST procedure and the confidence interval approach may be somewhat of a fiction, while Berger and Hsu, (1996) pointed out that using a confidence interval for bioequivalence testing can be conservative and may only work in specific cases. Other discussions on the use of the confidence interval procedure in bioequivalence testing can be found in works by Westlake, (1976),Westlake, (1981), Rani et al., (2004), Choi et al., (2008). These discussions highlight the ongoing debate about the appropriateness of using a confidence interval for bioequivalence testing, and the importance of understanding the underlying statistical concepts and assumptions in order to make informed decisions about the most suitable methods for a given study.
2.3.1 Confidence Interval
As we mentioned before, there are many different formulations of the bioequivalence hypothesis that lead to alternate tests and confidence intervals. In this section, we will discuss a confidence interval approach that corresponds exactly a size- TOST. It is well-known that there is a closed relationship between level- hypothesis and confidence set, that is, rejecting if and only if the intersection of the confidence set and the null hypothesis is empty. We summarize the property into the following theorem
Theorem 5.
Let be the parameter space. For each , let be a test statistic for with significance level and acceptance region , then the set is a level- confidence set for .
In other words, suppose is a confidence interval of parameter with condidence coefficient equal to , that is, . Consider the hypothesis testing verse . Then from Theorem 5, we know that the test that rejects if and only if is a level- test. In this section, we will apply Theorem 5 to show that a size- TOST is associated with a confidence interval.
Theorem 6.
Consider the hypothesis testing in (2) and define the following confidence interval for
(8) |
where
(9) |
and are the sample standard derivation of the two groups. Then the following two statements hold true
-
(1)
If , then the coverage probability of is , otherwise, the coverage probability equals 1.
-
(2)
is associated with the size- TOST for (2).
Following Theorem 6, we could make following remarks.
Remark 2.
-
(a)
From (1) in Theorem 6, we know is a level- confidence interval.
-
(b)
If contains zero, then it is identical to the confidence interval which will be introduced in the next section.
-
(c)
From Bayesian point of view, if the prior distribution of is noninformative, that is, , then the posterior credible probability of is exactly for and converges to as .
2.3.2 Confidence Interval
Follow the guidance of FDA, in practice, bioequivalence is claimed if the two-sided confidence interval of the geometric mean ratio falls within the BE limits. However, unfortunately, from Theorem 5, we could only see the connection between TOST and confidence interval. The reason why confidence interval yields a size- test is still unclear. In this section, owing to the “equal-tail” property, we will see that is not only an “algebraic coincidence”, but also theoretically guaranteed. In general, the size of a test associated with a confidence interval is not . For example, consider the following confidence interval
(10) |
where . It is obvious that defined above is a confidence interval. As , the confidence interval reduces to , and the size associated with this confidence interval is , not . In fact, the lower confidence interval in (10) defines a size- test, and similarly, the upper confidence interval in (10) defines a size- test. From the IUT theory in Theorem 2 and Theorem 3, we know the size of the defined by in (10) is , which equals only if . We summarize the above analysis into the following theorem.
Theorem 7.
Let be a random sample from a probability distribution with statistical parameter . Suppose is a lower confidence interval for and is a upper confidence interval for and Then is a confidence interval for .
Remark 3.
The above theorem reveals the importance of “equal-tail”, that is , without this property, the TOST will not yields a size- test. In other words, the total uncertainty of should be spent half above and half below the observed mean. Moreover, this theorem also provides another reason why in (2) should be symmetric about zero. Otherwise, for example, if , it is reasonable to put more weights on than , in which case, the size of procedure is no longer .
2.4 More Powerful Tests
From the previous analysis, we know that there are various procedures for testing (2), and a natural question rises, which one is better? Unfortunately, even though FDA suggests using to test bioequivalence, the procedure is not the best. In fact, it should be remarked that TOST has been criticized by many authors. On the one hand, TOST yields a biased test, on the other hand, the power of TOST is quite low. In practice, we would like a hypothesis test to be more likely to reject if than if , that is, for every and . The following definition summarize the tests summarizing this property.
Definition 4.
Let be the significance level. A hypothesis test with power function is said to be unbiased if and only if
Unfortunately, even using a confidence interval approach, TOST is still a biased test. Furthermore, it has also been shown that TOST is conservative and inefficient under asymptotic setup. Much attention has been paid to equivalence testing problem in (2) for decades, to mention but a few, Anderson and Hauck, (1983), Brown et al., (1997), Mart´ın Andrés and Herranz Tejedor, (2004), Choi et al., (2008), Fogarty and Small, (2014), Pesarin et al., (2016). See Wellek, (2002), Meyners, (2012) for a comprehensive review. Even though many methodologies have been proposed to test bioequivalence, the theoretical results are limited. Except some special models, like normal distribution with known variance, no finite sample optimality theory is available for tests of equivalence. To the best of our knowledge, the work by Romano, (2005) is the first theoretical result where the asymptotic optimality theory is established. Since the optimality theory is beyond the scope of the paper, we only show two examples to give the author some intuitions about optimal testing of equivalence.
Example 1 (Normal Distribution with Known Variance).
Suppose are i.i.d. , where is known. Consider the hypothesis testing: . Then the uniformly most powerful level- test is rejecting is , where satisfies
and is the cdf of standard normal distribution.
However, the uniformly most powerful test does not exist is is unknown. The next example generalize the results to the exponential family.
Example 2 (One-parameter Exponential Family).
Suppose are i.i.d. generated from the following one-parameter exponential family
where is an increasing function of only, is a function of only, and has monotone likelihood ratio in . Then the uniformly most powerful level- test for the equivalence hypothesis testing versus is
where are determined by .
3 Conclusion
The paper discusses the significance of bioequivalence studies, which compare the biological equivalence of two different formulations of a drug. These studies are conducted in clinical settings with human subjects to compare the pharmacokinetic profiles of the two formulations. Acceptance criteria for bioequivalence are generally based on statistical methods that evaluate the similarities between the two formulations. The paper explains the theory of intersection-union tests and how they can be used to determine bioequivalence. The TOST (two one-sided tests) approach is the standard approach to bioequivalence testing and has been shown to maintain a size- test by rejecting the null hypothesis only if both one-sided hypotheses are rejected. The paper provides insights into the theory behind bioequivalence testing, including the use of geometric mean, the non-symmetry of bioequivalence limits around 1, and the product of bioequivalence limits equaling 1. The paper also discusses the connection between the confidence interval approach and confidence interval approach.
References
- Anderson and Hauck, (1983) Anderson, S. and Hauck, W. W. (1983). A new procedure for testing equivalence in comparative bioavailability and other clinical trials. Communications in Statistics-Theory and Methods, 12(23):2663–2692.
- Berger and Hsu, (1996) Berger, R. L. and Hsu, J. C. (1996). Bioequivalence trials, intersection-union tests and equivalence confidence sets. Statistical Science, 11(4):283–319.
- Brown et al., (1995) Brown, L. D., Casella, G., and Gene Hwang, J. (1995). Optimal confidence sets, bioequivalence, and the limacon of pascal. Journal of the American Statistical Association, 90(431):880–889.
- Brown et al., (1997) Brown, L. D., Hwang, J. G., and Munk, A. (1997). An unbiased test for the bioequivalence problem. The annals of Statistics, 25(6):2345–2367.
- Choi et al., (2008) Choi, L., Caffo, B., and Rohde, C. (2008). A survey of the likelihood approach to bioequivalence trials. Statistics in Medicine, 27(24):4874–4894.
- Fogarty and Small, (2014) Fogarty, C. B. and Small, D. S. (2014). Equivalence testing for functional data with an application to comparing pulmonary function devices. The Annals of Applied Statistics, pages 2002–2026.
- Mart´ın Andrés and Herranz Tejedor, (2004) Martín Andrés, A. and Herranz Tejedor, I. (2004). Asymptotical tests on the equivalence, substantial difference and non-inferiority problems with two proportions. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 46(3):305–319.
- Meyners, (2012) Meyners, M. (2012). Equivalence tests–a review. Food quality and preference, 26(2):231–245.
- Pesarin et al., (2016) Pesarin, F., Salmaso, L., Carrozzo, E., and Arboretti, R. (2016). Union–intersection permutation solution for two-sample equivalence testing. Statistics and Computing, 26(3):693–701.
- Phillips, (1990) Phillips, K. F. (1990). Power of the two one-sided tests procedure in bioequivalence. Journal of pharmacokinetics and biopharmaceutics, 18(2):137–144.
- Rani et al., (2004) Rani, S., Pargal, A., et al. (2004). Bioequivalence: An overview of statistical concepts. Indian Journal of Pharmacology, 36:209–216.
- Romano, (2005) Romano, J. P. (2005). Optimal testing of equivalence hypotheses. The Annals of Statistics, 33(3):1036–1047.
- Schuirmann, (1981) Schuirmann, D. (1981). On hypothesis-testing to determine if the mean of a normal-distribution is contained in a known interval. In Biometrics, volume 37, pages 617–617. INTERNATIONAL BIOMETRIC SOC 808 17TH ST NW SUITE 200, WASHINGTON, DC 20006-3910.
- Schuirmann, (1987) Schuirmann, D. J. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of pharmacokinetics and biopharmaceutics, 15(6):657–680.
- Shen et al., (2006) Shen, H., Brown, L. D., and Zhi, H. (2006). Efficient estimation of log-normal means with application to pharmacokinetic data. Statistics in medicine, 25(17):3023–3038.
- Shen et al., (2017) Shen, M., Russek-Cohen, E., and Slud, E. V. (2017). Checking distributional assumptions for pharmacokinetic summary statistics based on simulations with compartmental models. Journal of Biopharmaceutical Statistics, 27(5):756–772.
- Wellek, (2002) Wellek, S. (2002). Testing statistical hypotheses of equivalence. Chapman and Hall/CRC.
- Westlake, (1981) Westlake, W. (1981). Bioequivalence testing–a need to rethink. Biometrics, 37(3):589–594.
- Westlake, (1976) Westlake, W. J. (1976). Symmetrical confidence intervals for bioequivalence trials. Biometrics, pages 741–744.
Appendix A Proof of Theorem 1
Proof.
The proof of (1) can be accomplished by direct calculations. It can be shown that
To prove (2), we let be the cumulative distribution function of the lognormal distribution, where is the complementary error function. For simplicity, we let the median of equals , that is, . Then, from the definition of median, we have , thus . Therefore, it follows that
where is the inverse of the complementary error function satisfying , thus, the median of equals . In fact, it can be also seen that logarithm function is a monotonous increasing, therefore, the median of is the exponential of the median of . Since is normal distributed, thus, median of equals .
Let be the pdf of lognormal distribution. Consider the expected value of , . It follows that
(11) | ||||
(12) |
where (11) uses the change of variable technique, i.e., . Note that in (12)is the pdf of a normal distribution with mean variance , thus , and therefore, .
The result of (3) follows from
∎
Appendix B Proof of Theorem 2
Proof.
Let . Then , for some and
Since the above equation holds true for arbitrary , the defined IUT test is a level test. ∎
Appendix C Proof of Theorem 3
Proof.
By Theorem 2, we know rejection region yields a level- test, that is
(13) |
Next, we are going to show that . Because , therefore,
where the last inequality follows from Bonferroni’s inequality. By condition (1) and (2), we can get . Thus,
(14) |
Combine (13) and (14), we prove that the size of IUT with rejection region is exactly . ∎
Appendix D Proof of Theorem 4
Appendix E Proof of Theorem 5
Proof.
By the definition of acceptance region, we have
which is the same as
Since the above statements hold true for all , thus
which implies is a level- confidence set for . ∎
Appendix F Proof of Theorem 6
Proof.
If , then for all , thus the coverage probability of is 1 in this case. So it is only need to consider the case when .
When , the event is the whole sample space, thus
And if , the event is the same as event , so the coverage probability equals
Similarly, when , the event is the whole sample space, thus
If , the event is the same as event , so the coverage probability equals
Combine everything above, the proof is completed. ∎
Appendix G Proof of Theorem 7
Proof.
By definition, we have
Consider the events and defined as . Therefore, the interaction of and is . It follows that
Since , we have . The result follow from that
∎