This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Confidence Intervals in Bioequivalence Studies

Kexuan Li, Susie Sinks, Peng Sun 
Global Analytics and Data Sciences, Biogen, Cambridge, Massachusetts, US.
and
Lingli Yang
Department of Mathematical Sciences, Worcester Polytechnic Institute
Corresponding author. Email Address: kexuan.li.77@gmail.com
Abstract

A bioequivalence study is a type of clinical trial designed to compare the biological equivalence of two different formulations of a drug. Such studies are typically conducted in controlled clinical settings with human subjects, who are randomly assigned to receive two formulations. The two formulations are then compared with respect to their pharmacokinetic profiles, which encompass the absorption, distribution, metabolism, and elimination of the drug. Under the guidance from Food and Drug Administration (FDA), for a size-α\alpha bioequivalence test, the standard approach is to construct a 100(12α)%100(1-2\alpha)\% confidence interval and verify if the confidence interval falls with the critical region. In this work, we clarify that 100(12α)%100(1-2\alpha)\% confidence interval approach for bioequivalence testing yields a size-α\alpha test only when the two one-sided tests in TOST are “equal-tailed”. Furthermore, a 100(1α)%100(1-\alpha)\% confidence interval approach is also discussed in the bioequivalence study.


Keywords: bioequivalence study; two one-sided tests; confidence interval.

1 Motivation

Bioequivalence studies are a type of clinical trial designed to compare the biological equivalence of two different formulations of a drug. The objective of these studies is to establish that the generic drug demonstrates the same rate and extent of absorption as the reference drug, and that its therapeutic effects are comparable. Such studies are typically conducted in controlled clinical settings with human subjects, who are randomly assigned to receive either the generic drug or the reference drug. The two formulations are then compared with respect to their pharmacokinetic profiles, which encompass the absorption, distribution, metabolism, and elimination of the drug. Acceptance criteria for bioequivalence are generally based on statistical methods that assess the similarities between the pharmacokinetic profiles of the two drug products. These comparisons often involve evaluating the area under the concentration-time curve (AUC) and the maximum concentration (Cmax) of the drug in the bloodstream. By ensuring that both AUC or Cmax values fall within pre-defined equivalence ranges, researchers can determine if the two formulations are bioequivalent.

Suppose we want to access the bioequivalence of two drugs or formulations. Under the guidance of Food and Drug Administration (FDA), the bioequivalence is claimed if a 90%90\% two-sided confidence interval of the geometric mean ratio fall within 80-125%\%. Under the context of statistical hypothesis testing, to demonstrate bioequivalence, the following significance level α\alpha hypothesis testing problem is considered:

H0:ηTηRδL or ηTηRδU versus Ha:δL<ηTηR<δU,H_{0}:\frac{\eta_{T}}{\eta_{R}}\leq\delta_{L}\textrm{ or }\frac{\eta_{T}}{\eta_{R}}\geq\delta_{U}\textrm{ versus }H_{a}:\delta_{L}<\frac{\eta_{T}}{\eta_{R}}<\delta_{U}, (1)

where ηT,ηR\eta_{T},\eta_{R} are the population geometric mean of the test product and reference product respectively, δL,δU\delta_{L},\delta_{U} are the lower and upper error bounds determined by the regulatory. Under the theory of intersection-union test, Westlake, (1981); Schuirmann, (1981, 1987) proposed two one-sided tests (TOST), which now has been the standard approach to test (1). As its name implies, TOST decomposes the interval hypothesis into two size-α\alpha one-side hypothesis and reject H0H_{0} if and only if both two one-sided hypothese are rejected. In general, when multiple size-α\alpha tests are combined together, the overall size of the combined test is no longer α\alpha. Fortunately, due to the theory of interaction-union test, the size of TOST is still α\alpha, which will be discussed in Section 2.1. It has also been showed that, for a size-α\alpha TOST procedure, it is identical to construct a 100(12α)%100(1-2\alpha)\% confidence interval for ηTηR\frac{\eta_{T}}{\eta_{R}} and reject H0H_{0} if the confidence interval falls entirely between δL\delta_{L} and δU\delta_{U}. Even though the procedure seems simple, there are still some questions need to clarify, for example:

  • Why is the geometric mean preferred over the arithmetic mean?

  • Why are the BE limitss (0.8, 1.25) not symetric about 1?

  • Why is the product of BE limit equal to 1 (0.8×1.25=10.8\times 1.25=1), i.e., symmetric about 1 on the ratio scale?

  • In general, the combination of two size-α\alpha tests could not get a size-α\alpha test, but why TOST works? Why is a size-α\alpha TOST identical to a 100(12α)%100(1-2\alpha)\% confidence interval, not a 100(1α)%100(1-\alpha)\% confidence interval? Is TOST the most powerful?

The purpose of this paper is answer these questions theoretically and provide a clear and in-depth understanding of the bioequivalence study. For example, the geometric mean is preferred in bioequivalence studies because pharmacokinetic parameters, such as AUC and Cmax, typically follow a log-normal distribution. Using the geometric mean ensures that the ratio of the means remains unaffected by extreme values or outliers, leading to a more accurate and robust comparison of the test and reference formulations. The bioequivalence limits (0.8, 1.25) are not symmetric around 1 because they account for potential differences in variability and the possibility of type I or type II errors. The limits are based on log-transformed data, and when back-transformed to the original scale, they yield asymmetric confidence intervals. The asymmetry helps to accommodate the potential differences in variability between the two drug formulations. The product of the BE limits is equal to 1 to ensure that the ratio of the test and reference products’ geometric means is unbiased on the ratio scale. When the two limits are multiplied, they effectively cancel out any deviation from the true ratio of 1, allowing for a more accurate comparison of the drug formulations. TOST works because of the intersection-union test theory. While it may not be the most powerful test in all situations, it maintains the overall size-α\alpha test by decomposing the interval hypothesis into two size-α\alpha one-sided hypotheses and rejecting the null hypothesis only if both one-sided hypotheses are rejected. The intersection-union test theory ensures that the overall type I error rate remains controlled at the desired level (α\alpha) when combining the two one-sided tests. In general, combining two size-α\alpha tests does not guarantee a size-α\alpha test because the overall type I error rate may be inflated due to multiple testing. However, the TOST procedure, based on the intersection-union test theory, accounts for this issue and successfully maintains the overall type I error rate at the desired level, making it a suitable method for bioequivalence testing.

The rest of the paper is organized as follows. Section 2 provides the theoretical background of bioequivalence study for an univariate PK parameter. We summarize the paper in Section 3 and technical proofs are provided in Appendix.

2 Theoretical Background

In this section, we delve into the theoretical background of bioequivalence studies for a univariate pharmacokinetic (PK) parameter. In a typical pharmacokinetic bioequivalence study, the univariate response variables such as log(AUC),log(Cmax)\log(\textrm{AUC}),\log(\textrm{C}_{\textrm{max}}) are often assumed to follow a normal distribution, or equivalently, AUC,Cmax\textrm{AUC},\textrm{C}_{\textrm{max}} follow a lognormal distribution (Shen et al., (2006), Shen et al., (2017)). Under FDA guidelines, bioequivalence is treated as a statistical hypothesis testing problem as defined in (1), which states that the population geometric mean ratio for the test and reference products should fall between 80-125%\%. At first glance, it may seem odd that the geometric mean ratio is used instead of the arithmetic mean difference. Moreover, it might appear strange that 80%\% and 125%\% are not symmetric about 100%\%, but 80%×125%=100%80\%\times 125\%=100\%, meaning that the lower and upper bounds are symmetric about 100%\% on a ratio scale. To explain the reason, we need to take the logarithm of (1) and obtain the following hypothesis testing for the difference:

H0:μTμRθL or μTμRθU versus Ha:θL<μTμR<θU,H_{0}:\mu_{T}-\mu_{R}\leq\theta_{L}\textrm{ or }\mu_{T}-\mu_{R}\geq\theta_{U}\textrm{ versus }H_{a}:\theta_{L}<\mu_{T}-\mu_{R}<\theta_{U}, (2)

where μT=log(ηT),μR=log(ηR),θL=log(δL),θR=log(δR)\mu_{T}=\log(\eta_{T}),\mu_{R}=\log(\eta_{R}),\theta_{L}=\log(\delta_{L}),\theta_{R}=\log(\delta_{R}). The following theorem explain some of the reasons why FDA choose geometric mean ratio as the test statistic and 80-125%\% as the BE limits.

Theorem 1.

Suppose X1,,XnX_{1},\ldots,X_{n} are independent and identically distributed from normal distribution X𝒩(μ,σ2)X\sim\mathcal{N}(\mu,\sigma^{2}) and Xi=exp(μ+σXi)X^{*}_{i}=\exp(\mu+\sigma X_{i}), that is, X1,,XnX^{*}_{1},\ldots,X^{*}_{n} are independent and identically distributed from lognormal distribution XX^{*} with corresponding mean μ\mu, variance σ2\sigma^{2} on the log transformed scale. We let GM(X)=(i=1nXi)1n\textrm{GM}(X^{*})=(\prod_{i=1}^{n}X^{*}_{i})^{\frac{1}{n}} be the geometric mean of X1,,XnX^{*}_{1},\ldots,X^{*}_{n}. Then the following statements hold true

  1. (1)

    exp(X¯)=GM(X)\exp(\bar{X})=\textrm{GM}(X^{*});

  2. (2)

    Median(X)=exp(μ)\textrm{Median}(X^{*})=\exp(\mu);

  3. (3)

    𝔼[GM(X)]=exp(μ+σ22n)\mathbb{E}[\textrm{GM}(X^{*})]=\exp(\mu+\frac{\sigma^{2}}{2n}).

The proof is given in the supplementary material. According to Theorem 1, it is straightforward to obtain the following facts:

Remark 1.
  1. (a)

    Since the PK parameters are lognormal distributed, which is right-tailed, it is more natural to compare the median between treatment and reference products rather than the arithmetic mean;

  2. (b)

    According to (1) in Theorem 1, we know that the log-transformed geometric mean of PK data is equivalent to the arithmetic mean of the log-transformed PK data. In other words, comparing the geometric mean ratio between two products is equivalent to comparing the arithmetic mean difference between the log-transformed PK data;

  3. (c)

    According to (2) and (3) in Theorem 1, we can make a conclusion that the geometric mean of PK data is a nearly unbiased estimator of PK median with an error rate of O(1/n)O(1/n). To be more specific, as nn approaches infinity, 𝔼[GM(X)]=Median(X)\mathbb{E}[\textrm{GM}(X^{*})]=\textrm{Median}(X^{*}). This implies that, for large sample sizes, the geometric mean of PK data provides an accurate estimate of the PK median.

Combining everything above, we can identify the advantages of using the geometric mean ratio as the test statistic: (1) The comparison of the geometric mean ratio can be converted to the comparison of the arithmetic mean difference; (2) The geometric mean ratio is a “good” estimator of the median ratio, which can be naturally applied to the distribution of PK parameters. Follows from the above analysis, we can further clarify the procedure of bioequivalence testing for univariate PK parameter in practice:

  • Step 1: Log-transform the PK data (e.g., AUC and Cmax) to convert the lognormal distribution into a normal distribution.

  • Step 2: Test H0:μTμRθL or μTμRθU versus Ha:θL<μTμR<θUH_{0}:\mu_{T}-\mu_{R}\leq\theta_{L}\textrm{ or }\mu_{T}-\mu_{R}\geq\theta_{U}\textrm{ versus }H_{a}:\theta_{L}<\mu_{T}-\mu_{R}<\theta_{U}, where μT,μR\mu_{T},\mu_{R} are the log-transformed PK data from Step 1 and θL=log(0.8)=0.223,θU=log(1.25)=0.223.\theta_{L}=\log(0.8)=-0.223,\theta_{U}=\log(1.25)=0.223.

  • Find a 100(12α)%100(1-2\alpha)\% confidence interval for μTμR\mu_{T}-\mu_{R} and reject H0H_{0} if and only if the 100(12α)%100(1-2\alpha)\% confidence interval falls in [θL,θR][\theta_{L},\theta_{R}] and the significance level of the corresponding procedure is α\alpha.

It is important to note that after taking the logarithm, the original BE limits (0.8, 1.25) become symmetric about zero, and this fact plays a crucial role in the 100(12α)%100(1-2\alpha)\% confidence interval approach. Without the “equal-tail” property, the 100(12α)%100(1-2\alpha)\% confidence interval approach is no longer valid. This is because the symmetric nature of the limits on the log scale simplifies the comparison of the test and reference products, allowing for a balanced evaluation of bioequivalence that accounts for both type I and type II errors. See Section 2.3.2 for more details.

2.1 Union-Interaction Tests and Interaction-Union Tests

In the section, we proceed to review the TOST, or two one-sided tests. The TOST involves performing two one-sided tests, one to test if the difference between the treatments or populations is greater than a predetermined non-inferiority margin and another to test if the difference is less than a predetermined equivalence margin. If the results of both tests are significant, it suggests that the treatments or populations are equivalent within the defined margins. Before proceeding, we first review the union-interaction test and intersection-union test, which provide the theoretical foundation for the TOST. Suppose the parameter space of a hypothesis testing is Θ=Θ0Θ0c\Theta=\Theta_{0}\bigcup\Theta_{0}^{c}, where Θ0,Θ0c\Theta_{0},\Theta_{0}^{c} are the parameter space of null and alternative hypothesis, respectively. Furthermore, let RR be the rejection region of the test, then the power function of the test, as a function of θΘ\theta\in\Theta, is defined as β(θ)=θΘ(TR)\beta(\theta)=\mathbb{P}_{\theta\in\Theta}(T\in R), where TT is the test statistics. Before moving on, we first clarify two definitions that are commonly confused with one another.: the size and the level of a test.

Definition 1.

For 0α10\leq\alpha\leq 1, a test with power function β(θ)\beta(\theta) is a size-α\alpha test if supΘ0β(θ)=α\sup_{\Theta_{0}}\beta(\theta)=\alpha.

Definition 2.

For 0α10\leq\alpha\leq 1, a test with power function β(θ)\beta(\theta) is a level-α\alpha test if supΘ0β(θ)α\sup_{\Theta_{0}}\beta(\theta)\leq\alpha.

These definitions help distinguish between the size and the level of a test, which are related but distinct concepts. The size of a test is the maximum probability of making a Type I error (i.e., rejecting the null hypothesis when it is true) under the null hypothesis parameter space Θ0\Theta_{0}. In contrast, the level of a test refers to the upper bound on the probability of making a Type I error. When the size of a test equals α\alpha, it is considered a size-α\alpha test, whereas when the size of a test is less than or equal to α\alpha, it is considered a level-α\alpha test. For a level-α\alpha test, the size of the test may be much less then α\alpha, and in such case, if we still use α\alpha as the significance level, then the test will be less powerful.

Obviously, the null hypothesis in (2) can be viewed as the union of two simple null hypothesis H01:μTμRθLH_{01}:\mu_{T}-\mu_{R}\leq\theta_{L} and H02:μTμRθUH_{02}:\mu_{T}-\mu_{R}\geq\theta_{U}. Not only in bioequivalence study, but also in some other applications, such complicated hypotheses can be developed from tests for simpler hypotheses. These hypothesis can be generalized from the so called intersection-union tests (IUT) and union-intersection test (UIT), which play an important role in bioequivalence testing and multiple comparison procedures.

Definition 3.

Assume a hypothesis H0:θΘ0 versus Ha:θΘ0cH_{0}:\theta\in\Theta_{0}\textrm{ versus }H_{a}:\theta\in\Theta^{c}_{0} with reject RR. A family of hypotheses Hi0:θΘiH_{i0}:\theta\in\Theta_{i} versus Hia:θΘicH_{ia}:\theta\in\Theta^{c}_{i} for i=1,,ki=1,\ldots,k with corresponding rejection region RiR_{i} is said to obey the intersection-union principle if

H0:θi=1kΘi, and Ha:θi=1kΘic,H_{0}:\theta\in\bigcup_{i=1}^{k}\Theta_{i},\textrm{ and }H_{a}:\theta\in\bigcap_{i=1}^{k}\Theta^{c}_{i}, (3)

and is said to obey the union-intersection principle if

H0:θi=1kΘi, and Ha:θi=1kΘic.H_{0}:\theta\in\bigcap_{i=1}^{k}\Theta_{i},\textrm{ and }H_{a}:\theta\in\bigcup_{i=1}^{k}\Theta^{c}_{i}. (4)

It is straightforward to show that the rejection region of IUT is i=1kRi\bigcap_{i=1}^{k}R_{i}. The logic behind this is simple. H0H_{0} is false if and only all of the H0i,i=1,,kH_{0i},i=1,\ldots,k are false. So rejecting H0:θi=1kΘiH_{0}:\theta\in\bigcup_{i=1}^{k}\Theta_{i} is equivalent as rejecting each individual H0i:θΘiH_{0i}:\theta\in\Theta_{i}. Similarly, in terms of UIT, using the same logic, if any one of the hypotheses H0iH_{0i} is rejected, then H0H_{0} must also be rejected, and only if each of H0iH_{0i} is accepted will the intersection H0H_{0} be accepted. Thus, the rejection region of UIT is i=1kRi\bigcup_{i=1}^{k}R_{i}. Clearly, the hypothesis testing in 2 belongs to IUT. Now, a natural question rises: given the significance of each individual H0iH_{0i}, what is the significance level of IUT? The following theorem gives an upper bound for the size of the IUT.

Theorem 2.

Let αi\alpha_{i} be the size of the test of H0iH_{0i} with rejection region RiR_{i}. Then the IUT with rejection region R=i=1kRiR=\bigcap_{i=1}^{k}R_{i} is a level α=supi=1,,kαi\alpha=\sup_{i=1,\ldots,k}\alpha_{i} test.

It should be remarked that Theorem 2 only shows that the level of an IUT is α=supi=1,,kαi\alpha=\sup_{i=1,\ldots,k}\alpha_{i}, that is, an upper bound for the size of the IUT. In fact, the size of the IUT could be much less than α\alpha. The following theorem provides the conditions under which the size of the IUT is exactly α\alpha, ensuring that the IUT is not too conservative.

Theorem 3.

Consider testing H0:θj=1kΘjH_{0}:\theta\in\bigcup_{j=1}^{k}\Theta_{j} and let RjR_{j} be the rejection region such that the level of H0jH_{0j} is α\alpha. Suppose that for some i=1,,ki=1,\ldots,k, there exists a sequence of parameters, θlΘi,l=1,2,,\theta_{l}\in\Theta_{i},l=1,2,\ldots, such that

  1. (1)

    limlθl(𝑿Ri)=α\lim_{l\rightarrow\infty}\mathbb{P}_{\theta_{l}}(\bm{X}\in R_{i})=\alpha,

  2. (2)

    for each j=1,,k,ji,liml(𝑿Ri)=1j=1,\ldots,k,j\neq i,\lim_{l\rightarrow\infty}\mathbb{P}(\bm{X}\in R_{i})=1.

Then, the IUT with rejection region R=j=1kRjR=\bigcap_{j=1}^{k}R_{j} is a size α\alpha test.

2.2 Two One-sided Tests (TOST)

We are now in a position to formally introduce the method for bioequivalence testing in (2). Albeit good theoretical guarantees for IUT, Westlake, (1981); Schuirmann, (1981, 1987) proposed the following so-called two one-sided tests (TOST), which has now been one of the standard procedures of (2). As its name implies, TOST consists two one-sided hypotheses

H01:μTμRθL versus Ha1:μTμR>θL\displaystyle H_{01}:\mu_{T}-\mu_{R}\leq\theta_{L}\textrm{ versus }H_{a1}:\mu_{T}-\mu_{R}>\theta_{L} (5)

and

H02:μTμRθU versus Ha2:μTμR<θU.\displaystyle H_{02}:\mu_{T}-\mu_{R}\geq\theta_{U}\textrm{ versus }H_{a2}:\mu_{T}-\mu_{R}<\theta_{U}. (6)

The H0:μTμRθL or μTμRθUH_{0}:\mu_{T}-\mu_{R}\leq\theta_{L}\textrm{ or }\mu_{T}-\mu_{R}\geq\theta_{U} in (2) can be expressed as the union of H01H_{01} and H02H_{02}. This procedure establishes bioequivalence at significance level α\alpha if both H01H_{01} and H02H_{02} are rejected at level-α\alpha. The rationale underlying is simple. If one may conclude that μTμR>θL\mu_{T}-\mu_{R}>\theta_{L} and also μTμR<θU\mu_{T}-\mu_{R}<\theta_{U}, then it has in effect been concluded that θL<μTμR<θU\theta_{L}<\mu_{T}-\mu_{R}<\theta_{U}. In practice, size-α\alpha Student’s tt-test is used for (5) and (6) and H0H_{0} is rejected if

TL=(X¯TX¯R)θLσ^ΔX¯>t1α,r,TU=(X¯TX¯R)θUσ^ΔX¯<t1α,r,T_{L}=\frac{(\bar{X}_{T}-\bar{X}_{R})-\theta_{L}}{\hat{\sigma}_{\Delta\bar{X}}}>t_{1-\alpha,r},\quad T_{U}=\frac{(\bar{X}_{T}-\bar{X}_{R})-\theta_{U}}{\hat{\sigma}_{\Delta\bar{X}}}<-t_{1-\alpha,r}, (7)

where X¯T,X¯R\bar{X}_{T},\bar{X}_{R} are the average of log-transformed PK data for treatment/reference product, and σ^ΔX¯=Sp(1nT+1nR)\hat{\sigma}_{\Delta\bar{X}}=S_{p}\sqrt{(\frac{1}{n_{T}}+\frac{1}{n_{R}})} is the standard error of X¯TX¯R\bar{X}_{T}-\bar{X}_{R} and Sp2=1nT+nR2[(nT1)ST2+(nR1)SR2]S_{p}^{2}=\frac{1}{n_{T}+n_{R}-2}[(n_{T}-1)S_{T}^{2}+(n_{R}-1)S_{R}^{2}], with ST,SRS_{T},S_{R} being the standard deviation of two groups, and t1α,r=(Xt1α,r)=1αt_{1-\alpha,r}=\mathbb{P}(X\leq t_{1-\alpha,r})=1-\alpha is the critical value from tt distribution with degree of freedom r=nT+nR2r=n_{T}+n_{R}-2. In addition, it should be mentioned that the size for each individual tt-test H01,H02H_{01},H_{02} is α\alpha, not α/2\alpha/2. There is no need for multiplicity adjustment in testing each the two one-sided null hypotheses for a univariate PK parameter. Suppose the size for each tt-test is α\alpha, since TOST belongs to IUT, from Theorem 2, we know the level of TOST is α\alpha, that is, the size is at most α\alpha. To prove the size of TOST is exactly α\alpha, we need to check the conditions of Theorem 3. First, consider a parameter point θl=μTμR=θU{\theta_{l}}=\mu_{T}-\mu_{R}=\theta_{U}, then we have

limlθl(𝑿R2)=θU(𝑿R2)=θU(TU<t1α,r)=α,\lim_{l\rightarrow\infty}\mathbb{P}_{\theta_{l}}(\bm{X}\in R_{2})=\mathbb{P}_{\theta_{U}}(\bm{X}\in R_{2})=\mathbb{P}_{\theta_{U}}(T_{U}<-t_{1-\alpha,r})=\alpha,

thus the first condition in Theorem 3 holds true. Furthermore,

limlθl(𝑿R1)=θU(𝑿R1)=θU(TL>t1α,r)1, as σ20,\lim_{l\rightarrow\infty}\mathbb{P}_{\theta_{l}}(\bm{X}\in R_{1})=\mathbb{P}_{\theta_{U}}(\bm{X}\in R_{1})=\mathbb{P}_{\theta_{U}}(T_{L}>t_{1-\alpha,r})\rightarrow 1,\textrm{ as }\sigma^{2}\rightarrow 0,

therefore, the second condition in Theorem 3 also remains valid, from where, we summarize the above analysis to the following theorem

Theorem 4.

Suppose the size of each individual test in TOST is α\alpha, then the size of TOST equals α\alpha exactly.

To compute the power of TOST, Phillips, (1990) derived the explicit form of the power, as a function of (μT,μR,nT,nR,σ2,θL,θU)(\mu_{T},\mu_{R},n_{T},n_{R},\sigma^{2},\theta_{L},\theta_{U}) given by

Power(μT,μR,nT,nR,σ2,θL,θU)\displaystyle\textrm{Power}\left(\mu_{T},\mu_{R},n_{T},n_{R},\sigma^{2},\theta_{L},\theta_{U}\right) =Qr(t1α,r,μTμRθUσ1nT+1nT;0(θUθL)r2σ1nT+1nTt1α,r)\displaystyle=Q_{r}\left(-t_{1-\alpha,r},\frac{\mu_{T}-\mu_{R}-\theta_{U}}{\sigma\sqrt{\frac{1}{n_{T}}+\frac{1}{n_{T}}}};0\frac{\left(\theta_{U}-\theta_{L}\right)\sqrt{r}}{2\sigma\sqrt{\frac{1}{n_{T}}+\frac{1}{n_{T}}}t_{1-\alpha,r}}\right)
Qr(t1α,r,μTμRθLσ1nT+1nT;0(θUθL)r2σ1nT+1nTt1α,r)\displaystyle\quad-Q_{r}\left(t_{1-\alpha,r},\frac{\mu_{T}-\mu_{R}-\theta_{L}}{\sigma\sqrt{\frac{1}{n_{T}}+\frac{1}{n_{T}}}};0\frac{\left(\theta_{U}-\theta_{L}\right)\sqrt{r}}{2\sigma\sqrt{\frac{1}{n_{T}}+\frac{1}{n_{T}}}t_{1-\alpha,r}}\right)

where

Qv(t,δ;a,b)=2πΓ(v2)2v22abΦ(txvδ)xv1ϕ(x)𝑑x,r=n1+n22Q_{v}(t,\delta;a,b)=\frac{\sqrt{2\pi}}{\Gamma\left(\frac{v}{2}\right)\cdot 2^{\frac{v-2}{2}}}\int_{a}^{b}\Phi\left(\frac{tx}{\sqrt{v}}-\delta\right)x^{v-1}\phi(x)dx,r=n_{1}+n_{2}-2

is referred as the Owen’s QQ function and, Φ(),ϕ()\Phi(\cdot),\phi(\cdot) are the cdf and pdf of standard normal distribution, respectively.

In the previous sections, the theoretical foundations of the TOST method for bioequivalence testing have been established. However, the relationship between the TOST and 100(1α)%100(1-\alpha)\% and 100(12α)%100(1-2\alpha)\% confidence intervals is not yet clear. In the next section, we will explore the connections between the TOST and these two types of confidence intervals, shedding light on their similarities, differences, and implications for bioequivalence testing. Understanding the connections between the TOST and different confidence intervals will provide insights into how these statistical methods relate to each other, and how they can be effectively applied in bioequivalence testing for univariate pharmacokinetic parameters. This will enhance the interpretation and application of these methods in practice, contributing to more accurate and reliable conclusions about bioequivalence.

2.3 Confidence Sets

Intuitively, a hypothesis is associated with an equivalent confidence interval approach. Traditionally, bioequivalence is claimed if the 100(12α)%100(1-2\alpha)\% two-sided confidence interval of the geometric mean ratio, or equivalently, the arithmetic mean difference of the log-transformed data, falls within the BE limits. The use of a 100(12α)%100(1-2\alpha)\% confidence interval for bioequivalence testing, rather than a 100(1α)%100(1-\alpha)\% confidence interval, has been a subject of debate among statisticians. The relationship between the size-α\alpha TOST procedure and the 100(12α)%100(1-2\alpha)\% confidence interval approach has been questioned by some researchers, who argue that the similarity is based more on an algebraic coincidence than a true statistical equivalence. Brown et al., (1995) suggested that the association between the TOST procedure and the 100(12α)%100(1-2\alpha)\% confidence interval approach may be somewhat of a fiction, while Berger and Hsu, (1996) pointed out that using a 100(12α)%100(1-2\alpha)\% confidence interval for bioequivalence testing can be conservative and may only work in specific cases. Other discussions on the use of the 100(12α)%100(1-2\alpha)\% confidence interval procedure in bioequivalence testing can be found in works by Westlake, (1976),Westlake, (1981), Rani et al., (2004), Choi et al., (2008). These discussions highlight the ongoing debate about the appropriateness of using a 100(12α)%100(1-2\alpha)\% confidence interval for bioequivalence testing, and the importance of understanding the underlying statistical concepts and assumptions in order to make informed decisions about the most suitable methods for a given study.

2.3.1 𝟏𝟎𝟎(𝟏𝜶)%100(1-\alpha)\% Confidence Interval

As we mentioned before, there are many different formulations of the bioequivalence hypothesis that lead to alternate tests and confidence intervals. In this section, we will discuss a 100(1α)%100(1-\alpha)\% confidence interval approach that corresponds exactly a size-α\alpha TOST. It is well-known that there is a closed relationship between level-α\alpha hypothesis and 100(1α)%100(1-\alpha)\% confidence set, that is, rejecting H0H_{0} if and only if the intersection of the 100(1α)%100(1-\alpha)\% confidence set and the null hypothesis is empty. We summarize the property into the following theorem

Theorem 5.

Let Θ\Theta be the parameter space. For each θ0Θ\theta_{0}\in\Theta, let Tθ0T_{\theta_{0}} be a test statistic for H0:θ=θ0H_{0}:\theta=\theta_{0} with significance level α\alpha and acceptance region A(θ0)A(\theta_{0}), then the set C(𝐗)={θ:𝐗A(θ)}C(\bm{X})=\{\theta:\bm{X}\in A(\theta)\} is a level-α\alpha confidence set for θ\theta.

In other words, suppose [L(𝑿),R(𝑿)][L(\bm{X}),R(\bm{X})] is a confidence interval of parameter θΘ\theta\in\Theta with condidence coefficient equal to 1α1-\alpha, that is, infθθΘ(θ[L(𝑿),R(𝑿)])=1α\inf_{\theta}\mathbb{P}_{\theta\in\Theta}(\theta\in[L(\bm{X}),R(\bm{X})])=1-\alpha. Consider the hypothesis testing H0:θΘ0H_{0}:\theta\in\Theta_{0} verse Ha:θΘ0cH_{a}:\theta\in\Theta_{0}^{c}. Then from Theorem 5, we know that the test that rejects H0H_{0} if and only if [L(𝑿),R(𝑿)]Θ0c=Ø[L(\bm{X}),R(\bm{X})]\bigcap\Theta_{0}^{c}=\O is a level-α\alpha test. In this section, we will apply Theorem 5 to show that a size-α\alpha TOST is associated with a 100(1α)%100(1-\alpha)\% confidence interval.

Theorem 6.

Consider the hypothesis testing H0:μTμRθL or μTμRθU versus Ha:θL<μTμR<θUH_{0}:\mu_{T}-\mu_{R}\leq\theta_{L}\textrm{ or }\mu_{T}-\mu_{R}\geq\theta_{U}\textrm{ versus }H_{a}:\theta_{L}<\mu_{T}-\mu_{R}<\theta_{U} in (2) and define the following confidence interval for μTμR\mu_{T}-\mu_{R}

[L(𝑿),U(𝑿)]=[min{0,(X¯TX¯R)t1α,rσ^ΔX¯},max{0,(X¯TX¯R)+t1α,rσ^ΔX¯}],[L(\bm{X}),U(\bm{X})]=\left[\min\{0,(\bar{X}_{T}-\bar{X}_{R})-t_{1-\alpha,r}\hat{\sigma}_{\Delta\bar{X}}\},\max\{0,(\bar{X}_{T}-\bar{X}_{R})+t_{1-\alpha,r}\hat{\sigma}_{\Delta\bar{X}}\}\right], (8)

where

σ^ΔX¯=Sp1m+1n,Sp=(m1)SXT2+(n1)SXR2m+n2,r=m+n2,\hat{\sigma}_{\Delta\bar{X}}=S_{p}\sqrt{\frac{1}{m}+\frac{1}{n}},S_{p}=\sqrt{\frac{(m-1)S_{X_{T}}^{2}+(n-1)S_{X_{R}}^{2}}{m+n-2}},r=m+n-2, (9)

and SXT,SXRS_{X_{T}},S_{X_{R}} are the sample standard derivation of the two groups. Then the following two statements hold true

  1. (1)

    If μTμR0\mu_{T}-\mu_{R}\neq 0, then the coverage probability of [L(𝑿),U(𝑿)][L(\bm{X}),U(\bm{X})] is 100(1α)%100(1-\alpha)\%, otherwise, the coverage probability equals 1.

  2. (2)

    [L(𝑿),U(𝑿)][L(\bm{X}),U(\bm{X})] is associated with the size-α\alpha TOST for (2).

Following Theorem 6, we could make following remarks.

Remark 2.
  1. (a)

    From (1) in Theorem 6, we know [L(𝑿),U(𝑿)][L(\bm{X}),U(\bm{X})] is a level-α\alpha confidence interval.

  2. (b)

    If [L(𝑿),U(𝑿)][L(\bm{X}),U(\bm{X})] contains zero, then it is identical to the 100(12α)%100(1-2\alpha)\% confidence interval which will be introduced in the next section.

  3. (c)

    From Bayesian point of view, if the prior distribution of μTμR\mu_{T}-\mu_{R} is noninformative, that is, π(μTμR)1\pi(\mu_{T}-\mu_{R})\propto 1, then the posterior credible probability of [L(𝑿),U(𝑿)][L(\bm{X}),U(\bm{X})] is exactly (12α)(1-2\alpha) for t1α,rμTμRt1α,r-t_{1-\alpha,r}\leq\mu_{T}-\mu_{R}\leq t_{1-\alpha,r} and converges to (1α)(1-\alpha) as |μTμR||\mu_{T}-\mu_{R}|\rightarrow\infty.

2.3.2 𝟏𝟎𝟎(𝟏𝟐𝜶)%100(1-2\alpha)\% Confidence Interval

Follow the guidance of FDA, in practice, bioequivalence is claimed if the 100(12α)%100(1-2\alpha)\% two-sided confidence interval of the geometric mean ratio falls within the BE limits. However, unfortunately, from Theorem 5, we could only see the connection between TOST and 100(1α)%100(1-\alpha)\% confidence interval. The reason why 100(12α)%100(1-2\alpha)\% confidence interval yields a size-α\alpha test is still unclear. In this section, owing to the “equal-tail” property, we will see that is not only an “algebraic coincidence”, but also theoretically guaranteed. In general, the size of a test associated with a 100(12α)%100(1-2\alpha)\% confidence interval is not α\alpha. For example, consider the following 100(12α)%100(1-2\alpha)\% confidence interval

[L(𝑿),U(𝑿)]=[(X¯TX¯R)t1α1,rσ^ΔX¯,(X¯TX¯R)+t1α2,rσ^ΔX¯],[L(\bm{X}),U(\bm{X})]=\left[(\bar{X}_{T}-\bar{X}_{R})-t_{1-\alpha_{1},r}\hat{\sigma}_{\Delta\bar{X}},(\bar{X}_{T}-\bar{X}_{R})+t_{1-\alpha_{2},r}\hat{\sigma}_{\Delta\bar{X}}\right], (10)

where α1+α2=2α\alpha_{1}+\alpha_{2}=2\alpha. It is obvious that [L(𝑿),U(𝑿)][L(\bm{X}),U(\bm{X})] defined above is a 100(12α)%100(1-2\alpha)\% confidence interval. As α10\alpha_{1}\rightarrow 0, the confidence interval reduces to (,(X¯TX¯R)+t12α,rσ^ΔX¯]\left(-\infty,(\bar{X}_{T}-\bar{X}_{R})+t_{1-2\alpha,r}\hat{\sigma}_{\Delta\bar{X}}\right], and the size associated with this confidence interval is 2α2\alpha, not α\alpha. In fact, the lower confidence interval [L(𝒙),)=[(X¯TX¯R)t1α1,rσ^ΔX¯,)[L(\bm{x}),\infty)=[(\bar{X}_{T}-\bar{X}_{R})-t_{1-\alpha_{1},r}\hat{\sigma}_{\Delta\bar{X}},\infty) in (10) defines a size-α1\alpha_{1} test, and similarly, the upper confidence interval (,U(𝒙)]=(,(X¯TX¯R)+t1α2,rσ^ΔX¯](-\infty,U(\bm{x})]=(\infty,(\bar{X}_{T}-\bar{X}_{R})+t_{1-\alpha_{2},r}\hat{\sigma}_{\Delta\bar{X}}] in (10) defines a size-α2\alpha_{2} test. From the IUT theory in Theorem 2 and Theorem 3, we know the size of the defined by [L(𝑿),U(𝑿)][L(\bm{X}),U(\bm{X})] in (10) is max{α1,α2}\max\{\alpha_{1},\alpha_{2}\}, which equals α\alpha only if α1=α2\alpha_{1}=\alpha_{2}. We summarize the above analysis into the following theorem.

Theorem 7.

Let 𝐗\bm{X} be a random sample from a probability distribution with statistical parameter θ\theta. Suppose [L(𝐗),)[L(\bm{X}),\infty) is a 100(1α1)%100(1-\alpha_{1})\% lower confidence interval for θ\theta and (,U(𝐗)](-\infty,U(\bm{X})] is a 100(1α2)%100(1-\alpha_{2})\% upper confidence interval for θ\theta and Then [L(𝐱),U(𝐱)][L(\bm{x}),U(\bm{x})] is a 100(1α1α2)%100(1-\alpha_{1}-\alpha_{2})\% confidence interval for θ\theta.

Remark 3.

The above theorem reveals the importance of “equal-tail”, that is α1=α2\alpha_{1}=\alpha_{2}, without this property, the TOST will not yields a size-α\alpha test. In other words, the total uncertainty of 2α2\alpha should be spent half above and half below the observed mean. Moreover, this theorem also provides another reason why θL,θU\theta_{L},\theta_{U} in (2) should be symmetric about zero. Otherwise, for example, if θL+θU<0\theta_{L}+\theta_{U}<0, it is reasonable to put more weights on α1\alpha_{1} than α2\alpha_{2}, in which case, the size of 100(12α)%100(1-2\alpha)\% procedure is no longer α\alpha.

2.4 More Powerful Tests

From the previous analysis, we know that there are various procedures for testing (2), and a natural question rises, which one is better? Unfortunately, even though FDA suggests using 100(12α)%100(1-2\alpha)\% to test bioequivalence, the procedure is not the best. In fact, it should be remarked that TOST has been criticized by many authors. On the one hand, TOST yields a biased test, on the other hand, the power of TOST is quite low. In practice, we would like a hypothesis test to be more likely to reject H0H_{0} if θΘ0c\theta\in\Theta_{0}^{c} than if θΘ0\theta\in\Theta_{0}, that is, β(θ2)β(θ1)\beta(\theta_{2})\geq\beta(\theta_{1}) for every θ1Θ0\theta_{1}\in\Theta_{0} and θ2Θ0c\theta_{2}\in\Theta_{0}^{c}. The following definition summarize the tests summarizing this property.

Definition 4.

Let α\alpha be the significance level. A hypothesis test H0:θΘ0 versus Ha:θΘ0cH_{0}:\theta\in\Theta_{0}\textrm{ versus }H_{a}:\theta\in\Theta_{0}^{c} with power function β()\beta(\cdot) is said to be unbiased if and only if

β(θ)α,θΘ0 and β(θ)αΘ0c.\beta(\theta)\leq\alpha,\theta\in\Theta_{0}\textrm{ and }\beta(\theta)\geq\alpha\in\Theta_{0}^{c}.

Unfortunately, even using a 100(12α)%100(1-2\alpha)\% confidence interval approach, TOST is still a biased test. Furthermore, it has also been shown that TOST is conservative and inefficient under asymptotic setup. Much attention has been paid to equivalence testing problem in (2) for decades, to mention but a few, Anderson and Hauck, (1983), Brown et al., (1997), Mart´ın Andrés and Herranz Tejedor, (2004), Choi et al., (2008), Fogarty and Small, (2014), Pesarin et al., (2016). See Wellek, (2002), Meyners, (2012) for a comprehensive review. Even though many methodologies have been proposed to test bioequivalence, the theoretical results are limited. Except some special models, like normal distribution with known variance, no finite sample optimality theory is available for tests of equivalence. To the best of our knowledge, the work by Romano, (2005) is the first theoretical result where the asymptotic optimality theory is established. Since the optimality theory is beyond the scope of the paper, we only show two examples to give the author some intuitions about optimal testing of equivalence.

Example 1 (Normal Distribution with Known Variance).

Suppose X1,,XnX_{1},\ldots,X_{n} are i.i.d. 𝒩(μ,σ2)\mathcal{N}(\mu,\sigma^{2}), where σ\sigma is known. Consider the hypothesis testing: H0:|μ|θ versus Ha:|μ|<θH_{0}:|\mu|\geq\theta\textrm{ versus }H_{a}:|\mu|<\theta. Then the uniformly most powerful level-α\alpha test is rejecting H0H_{0} is n1/2|X¯n|ψ(α,n1/2θ,σ)n^{1/2}|\bar{X}_{n}|\leq\psi(\alpha,n^{1/2}\theta,\sigma), where ψ(α,θ,σ)\psi(\alpha,\theta,\sigma) satisfies

ϕ(ψθσ)ϕ(ψθσ)=α,\displaystyle\phi(\frac{\psi-\theta}{\sigma})-\phi(\frac{-\psi-\theta}{\sigma})=\alpha,

and ϕ()\phi(\cdot) is the cdf of standard normal distribution.

However, the uniformly most powerful test does not exist is σ\sigma is unknown. The next example generalize the results to the exponential family.

Example 2 (One-parameter Exponential Family).

Suppose X1,,XnX_{1},\ldots,X_{n} are i.i.d. generated from the following one-parameter exponential family

fθ(x)=exp{η(θ)Y(x)ξ(θ)}h(x),\displaystyle f_{\theta}(x)=\exp\{\eta(\theta)Y(x)-\xi(\theta)\}h(x),

where η(θ)\eta(\theta) is an increasing function of θ\theta only, h(x)h(x) is a function of xx only, and Y()Y(\cdot) has monotone likelihood ratio in (X1,,Xn)(X_{1},\ldots,X_{n}). Then the uniformly most powerful level-α\alpha test for the equivalence hypothesis testing H0:θθ1 or θθ2H_{0}:\theta\leq\theta_{1}\textrm{ or }\theta\geq\theta_{2} versus H1:θ1θθ2H_{1}:\theta_{1}\leq\theta\leq\theta_{2} is

T(X1,,Xn)={1if c1<Y(X1,,Xn)<c2γiif Y(X1,,Xn)=ci,i=1,20if Y(X1,,Xn)<c1 or Y(X1,,Xn)>c2,T(X_{1},\ldots,X_{n})=\begin{cases}1&\text{if }c_{1}<Y(X_{1},\ldots,X_{n})<c_{2}\\ \gamma_{i}&\text{if }Y(X_{1},\ldots,X_{n})=c_{i},i=1,2\\ 0&\text{if }Y(X_{1},\ldots,X_{n})<c_{1}\text{ or }Y(X_{1},\ldots,X_{n})>c_{2},\end{cases}

where γ1,γ2,c1,c2\gamma_{1},\gamma_{2},c_{1},c_{2} are determined by β(θ1)=β(θ2)=α\beta(\theta_{1})=\beta(\theta_{2})=\alpha.

3 Conclusion

The paper discusses the significance of bioequivalence studies, which compare the biological equivalence of two different formulations of a drug. These studies are conducted in clinical settings with human subjects to compare the pharmacokinetic profiles of the two formulations. Acceptance criteria for bioequivalence are generally based on statistical methods that evaluate the similarities between the two formulations. The paper explains the theory of intersection-union tests and how they can be used to determine bioequivalence. The TOST (two one-sided tests) approach is the standard approach to bioequivalence testing and has been shown to maintain a size-α\alpha test by rejecting the null hypothesis only if both one-sided hypotheses are rejected. The paper provides insights into the theory behind bioequivalence testing, including the use of geometric mean, the non-symmetry of bioequivalence limits around 1, and the product of bioequivalence limits equaling 1. The paper also discusses the connection between the 100(12α)%100(1-2\alpha)\% confidence interval approach and 100(1α)%100(1-\alpha)\% confidence interval approach.

References

  • Anderson and Hauck, (1983) Anderson, S. and Hauck, W. W. (1983). A new procedure for testing equivalence in comparative bioavailability and other clinical trials. Communications in Statistics-Theory and Methods, 12(23):2663–2692.
  • Berger and Hsu, (1996) Berger, R. L. and Hsu, J. C. (1996). Bioequivalence trials, intersection-union tests and equivalence confidence sets. Statistical Science, 11(4):283–319.
  • Brown et al., (1995) Brown, L. D., Casella, G., and Gene Hwang, J. (1995). Optimal confidence sets, bioequivalence, and the limacon of pascal. Journal of the American Statistical Association, 90(431):880–889.
  • Brown et al., (1997) Brown, L. D., Hwang, J. G., and Munk, A. (1997). An unbiased test for the bioequivalence problem. The annals of Statistics, 25(6):2345–2367.
  • Choi et al., (2008) Choi, L., Caffo, B., and Rohde, C. (2008). A survey of the likelihood approach to bioequivalence trials. Statistics in Medicine, 27(24):4874–4894.
  • Fogarty and Small, (2014) Fogarty, C. B. and Small, D. S. (2014). Equivalence testing for functional data with an application to comparing pulmonary function devices. The Annals of Applied Statistics, pages 2002–2026.
  • Mart´ın Andrés and Herranz Tejedor, (2004) Martín Andrés, A. and Herranz Tejedor, I. (2004). Asymptotical tests on the equivalence, substantial difference and non-inferiority problems with two proportions. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 46(3):305–319.
  • Meyners, (2012) Meyners, M. (2012). Equivalence tests–a review. Food quality and preference, 26(2):231–245.
  • Pesarin et al., (2016) Pesarin, F., Salmaso, L., Carrozzo, E., and Arboretti, R. (2016). Union–intersection permutation solution for two-sample equivalence testing. Statistics and Computing, 26(3):693–701.
  • Phillips, (1990) Phillips, K. F. (1990). Power of the two one-sided tests procedure in bioequivalence. Journal of pharmacokinetics and biopharmaceutics, 18(2):137–144.
  • Rani et al., (2004) Rani, S., Pargal, A., et al. (2004). Bioequivalence: An overview of statistical concepts. Indian Journal of Pharmacology, 36:209–216.
  • Romano, (2005) Romano, J. P. (2005). Optimal testing of equivalence hypotheses. The Annals of Statistics, 33(3):1036–1047.
  • Schuirmann, (1981) Schuirmann, D. (1981). On hypothesis-testing to determine if the mean of a normal-distribution is contained in a known interval. In Biometrics, volume 37, pages 617–617. INTERNATIONAL BIOMETRIC SOC 808 17TH ST NW SUITE 200, WASHINGTON, DC 20006-3910.
  • Schuirmann, (1987) Schuirmann, D. J. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of pharmacokinetics and biopharmaceutics, 15(6):657–680.
  • Shen et al., (2006) Shen, H., Brown, L. D., and Zhi, H. (2006). Efficient estimation of log-normal means with application to pharmacokinetic data. Statistics in medicine, 25(17):3023–3038.
  • Shen et al., (2017) Shen, M., Russek-Cohen, E., and Slud, E. V. (2017). Checking distributional assumptions for pharmacokinetic summary statistics based on simulations with compartmental models. Journal of Biopharmaceutical Statistics, 27(5):756–772.
  • Wellek, (2002) Wellek, S. (2002). Testing statistical hypotheses of equivalence. Chapman and Hall/CRC.
  • Westlake, (1981) Westlake, W. (1981). Bioequivalence testing–a need to rethink. Biometrics, 37(3):589–594.
  • Westlake, (1976) Westlake, W. J. (1976). Symmetrical confidence intervals for bioequivalence trials. Biometrics, pages 741–744.

Appendix A Proof of Theorem 1

Proof.

The proof of (1) can be accomplished by direct calculations. It can be shown that

exp(X¯)\displaystyle\exp(\bar{X}) =exp(n1i=1nXi)=exp{log[exp(i=1nXi/n)]}=exp{log[i=1nexp(Xi/n)]}\displaystyle=\exp(n^{-1}\sum_{i=1}^{n}X_{i})=\exp\{\log[\exp(\sum_{i=1}^{n}X_{i}/n)]\}=\exp\{\log[\prod_{i=1}^{n}\exp(X_{i}/n)]\}
=(i=1nexp(Xi))1n=(i=1nX)1n=GM(X).\displaystyle=(\prod_{i=1}^{n}\exp(X_{i}))^{\frac{1}{n}}=(\prod_{i=1}^{n}X^{*})^{\frac{1}{n}}=\textrm{GM}(X^{*}).

To prove (2), we let FX(x)=12[1+erf(log(x)μ2σ)]F_{X^{*}}(x^{*})=\frac{1}{2}\left[1+\textrm{erf}\left(\frac{\log(x^{*})-\mu}{\sqrt{2}\sigma}\right)\right] be the cumulative distribution function of the lognormal distribution, where erf(x)=2π0π2exp(x2sin2θ)𝑑θ\textrm{erf}(x^{*})=\frac{2}{\pi}\int_{0}^{\frac{\pi}{2}}\exp\left(-\frac{x^{*2}}{\sin^{2}\theta}\right)d\theta is the complementary error function. For simplicity, we let the median of XX^{*} equals ϱ\varrho, that is, Median(X)=ϱ\textit{Median}(X^{*})=\varrho. Then, from the definition of median, we have FX(ϱ)=0.5F_{X^{*}}(\varrho)=0.5, thus ϱ=FX1(0.5)\varrho=F_{X^{*}}^{-1}(0.5). Therefore, it follows that

ϱ=FX1(0.5)=exp[σ2erf1(2p1)+μ]|p=0.5,\varrho=F_{X^{*}}^{-1}(0.5)=\left.\exp\left[\sigma\sqrt{2}\cdot\operatorname{erf}^{-1}(2p-1)+\mu\right]\right|_{p=0.5},

where erf1()\operatorname{erf}^{-1}(\cdot) is the inverse of the complementary error function satisfying erf1(0)=0\operatorname{erf}^{-1}(0)=0, thus, the median of XX^{*} equals exp(μ)\exp(\mu). In fact, it can be also seen that logarithm function is a monotonous increasing, therefore, the median of XX^{*} is the exponential of the median of log(X)\log(X^{*}). Since log(X)\log(X^{*}) is normal distributed, thus, median of XX^{*} equals exp(μ)\exp(\mu).

Let f(x)=1xσ2πexp((ln(x)μ)22σ2)f(x^{*})=\frac{1}{x^{*}\sigma\sqrt{2\pi}}\exp\left(-\frac{(\ln(x^{*})-\mu)^{2}}{2\sigma^{2}}\right) be the pdf of lognormal distribution. Consider the expected value of Xi1nX_{i}^{*\frac{1}{n}}, 𝔼[Xi1n]\mathbb{E}[X_{i}^{*\frac{1}{n}}]. It follows that

𝔼[Xi1n]\displaystyle\mathbb{E}[X_{i}^{*\frac{1}{n}}] =01xσ2πx1nexp((ln(x)μ)22σ2)𝑑x\displaystyle=\int_{0}^{\infty}\frac{1}{x^{*}\sigma\sqrt{2\pi}}x^{*\frac{1}{n}}\exp\left(-\frac{(\ln(x^{*})-\mu)^{2}}{2\sigma^{2}}\right)dx
=1σ2π1exp(t)exp(t)1/nexp((tμ)22σ2)exp(t)dt\displaystyle=\int_{-\infty}^{\infty}\frac{1}{\sigma\sqrt{2\pi}}\frac{1}{\exp(t)}\exp(t)^{1/n}\exp\left(-\frac{(t-\mu)^{2}}{2\sigma^{2}}\right)\exp(t)dt (11)
=1σ2πexp({(tμ)2n2σ2t}2nσ2)𝑑t\displaystyle=\int_{-\infty}^{\infty}\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{\left\{(t-\mu)^{2}n-2\sigma^{2}t\right\}}{2n\sigma^{2}}\right)dt
=1σ2πexp({n(tμσ2n)2σ4n2σ2μ}2nσ2)𝑑t\displaystyle=\int_{-\infty}^{\infty}\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{\left\{n\left(t-\mu-\frac{\sigma^{2}}{n}\right)^{2}-\frac{\sigma^{4}}{n}-2\sigma^{2}\mu\right\}}{2n\sigma^{2}}\right)dt
=1σ2πexp(n(tμσ2n)22nσ2)exp(σ4n+2μσ22nσ2)𝑑t\displaystyle=\int_{-\infty}^{\infty}\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{n\left(t-\mu-\frac{\sigma^{2}}{n}\right)^{2}}{2n\sigma^{2}}\right)\exp\left(\frac{\frac{\sigma^{4}}{n}+2\mu\sigma^{2}}{2n\sigma^{2}}\right)dt
=exp(μn+σ22n2)1σ2πexp((tμσ2n)22σ2)𝑑t,\displaystyle=\exp\left(\frac{\mu}{n}+\frac{\sigma^{2}}{2n^{2}}\right)\int_{-\infty}^{\infty}\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{\left(t-\mu-\frac{\sigma^{2}}{n}\right)^{2}}{2\sigma^{2}}\right)dt, (12)

where (11) uses the change of variable technique, i.e., x=exp(t)x=\exp(t). Note that 1σ2πexp((tμσ2n)22σ2)\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{\left(t-\mu-\frac{\sigma^{2}}{n}\right)^{2}}{2\sigma^{2}}\right) in (12)is the pdf of a normal distribution with mean μ+σ2n\mu+\frac{\sigma^{2}}{n} variance σ2\sigma^{2}, thus 1σ2πexp((tμσ2n)22σ2)𝑑t=1\int_{-\infty}^{\infty}\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{\left(t-\mu-\frac{\sigma^{2}}{n}\right)^{2}}{2\sigma^{2}}\right)dt=1, and therefore, 𝔼[Xi1n]=exp(μn+σ22n2)\mathbb{E}[X_{i}^{*\frac{1}{n}}]=\exp\left(\frac{\mu}{n}+\frac{\sigma^{2}}{2n^{2}}\right).

The result of (3) follows from

𝔼[GM(X)]=𝔼[i=1nXi)1n]=i=1n𝔼[Xi)1n]=i=1nexp(μn+σ22n2)=exp(μ+σ22n).\displaystyle\mathbb{E}[\textrm{GM}(X^{*})]=\mathbb{E}[\prod_{i=1}^{n}X^{*}_{i})^{\frac{1}{n}}]=\prod_{i=1}^{n}\mathbb{E}[X^{*}_{i})^{\frac{1}{n}}]=\prod_{i=1}^{n}\exp\left(\frac{\mu}{n}+\frac{\sigma^{2}}{2n^{2}}\right)=\exp\left(\mu+\frac{\sigma^{2}}{2n}\right).

Appendix B Proof of Theorem 2

Proof.

Let θi=1kΘi\theta\in\bigcup_{i=1}^{k}\Theta_{i}. Then θΘi\theta\in\Theta_{i}, for some i=1,,ki=1,\ldots,k and

θ(𝑿R)=θ(𝑿i=1kRi)θ(𝑿Ri)αisupi=1,,kαi=α.\displaystyle\mathbb{P}_{\theta}(\bm{X}\in R)=\mathbb{P}_{\theta}(\bm{X}\in\bigcap_{i=1}^{k}R_{i})\leq\mathbb{P}_{\theta}(\bm{X}\in R_{i})\leq\alpha_{i}\leq\sup_{i=1,\ldots,k}\alpha_{i}=\alpha.

Since the above equation holds true for arbitrary θi=1kΘi\theta\in\bigcup_{i=1}^{k}\Theta_{i}, the defined IUT test is a level α\alpha test. ∎

Appendix C Proof of Theorem 3

Proof.

By Theorem 2, we know rejection region R=j=1kRjR=\bigcap_{j=1}^{k}R_{j} yields a level-α\alpha test, that is

supθj=1kΘjθ(𝑿R)α.\sup_{\theta\in\bigcup_{j=1}^{k}\Theta_{j}}\mathbb{P}_{\theta}(\bm{X}\in R)\leq\alpha. (13)

Next, we are going to show that supθj=1kΘjθ(𝑿R)α\sup_{\theta\in\bigcup_{j=1}^{k}\Theta_{j}}\mathbb{P}_{\theta}(\bm{X}\in R)\geq\alpha. Because θij=1kΘj\theta_{i}\in\bigcup_{j=1}^{k}\Theta_{j}, therefore,

supθj=1kΘjθ(𝑿R)\displaystyle\sup_{\theta\in\bigcup_{j=1}^{k}\Theta_{j}}\mathbb{P}_{\theta}(\bm{X}\in R) limlθl(𝑿R)\displaystyle\geq\lim_{l\rightarrow\infty}\mathbb{P}_{\theta_{l}}(\bm{X}\in R)
=limlθl(𝑿j=1kRj)\displaystyle=\lim_{l\rightarrow\infty}\mathbb{P}_{\theta_{l}}(\bm{X}\in\bigcap_{j=1}^{k}R_{j})
limlj=1kθl(𝑿Rj)k+1,\displaystyle\geq\lim_{l\rightarrow\infty}\sum_{j=1}^{k}\mathbb{P}_{\theta_{l}}(\bm{X}\in R_{j})-k+1,

where the last inequality follows from Bonferroni’s inequality. By condition (1) and (2), we can get j=1kθl(𝑿Rj)=k1+α\sum_{j=1}^{k}\mathbb{P}_{\theta_{l}}(\bm{X}\in R_{j})=k-1+\alpha. Thus,

supθj=1kΘjθ(𝑿R)k1+αk+1=α.\sup_{\theta\in\bigcup_{j=1}^{k}\Theta_{j}}\mathbb{P}_{\theta}(\bm{X}\in R)\geq k-1+\alpha-k+1=\alpha. (14)

Combine (13) and (14), we prove that the size of IUT with rejection region R=j=1kRjR=\bigcap_{j=1}^{k}R_{j} is exactly α\alpha. ∎

Appendix D Proof of Theorem 4

Proof.

Because TOST belongs to IUT, by Theorem 2, we know the size of TOST is at most α\alpha. Consider a parameter point θl=μTμR=θU{\theta_{l}}=\mu_{T}-\mu_{R}=\theta_{U}, then we have

limlθl(𝑿R2)=θU(𝑿R2)=θU(TU<t1α,r)=α,\lim_{l\rightarrow\infty}\mathbb{P}_{\theta_{l}}(\bm{X}\in R_{2})=\mathbb{P}_{\theta_{U}}(\bm{X}\in R_{2})=\mathbb{P}_{\theta_{U}}(T_{U}<-t_{1-\alpha,r})=\alpha,

Furthermore,

limlθl(𝑿R1)=θU(𝑿R1)=θU(TL>t1α,r)1, as σ20,\lim_{l\rightarrow\infty}\mathbb{P}_{\theta_{l}}(\bm{X}\in R_{1})=\mathbb{P}_{\theta_{U}}(\bm{X}\in R_{1})=\mathbb{P}_{\theta_{U}}(T_{L}>t_{1-\alpha,r})\rightarrow 1,\textrm{ as }\sigma^{2}\rightarrow 0,

therefore, the conditions in Theorem 3 hold true. Thus, the size of TOST is exactly α\alpha. ∎

Appendix E Proof of Theorem 5

Proof.

By the definition of acceptance region, we have

supθ=θ0(𝑿A(θ0))=supθ=θ0(Tθ0=1)α,\displaystyle\sup_{\theta=\theta_{0}}\mathbb{P}(\bm{X}\notin A(\theta_{0}))=\sup_{\theta=\theta_{0}}\mathbb{P}(T_{\theta_{0}}=1)\leq\alpha,

which is the same as

1αinfθ=θ0(𝑿A(θ0))=infθ=θ0(θ0C(𝑿)).\displaystyle 1-\alpha\leq\inf_{\theta=\theta_{0}}\mathbb{P}(\bm{X}\in A(\theta_{0}))=\inf_{\theta=\theta_{0}}\mathbb{P}(\theta_{0}\in C(\bm{X})).

Since the above statements hold true for all θΘ\theta\in\Theta, thus

infθ0Θinfθ=θ0(θ0C(𝑿))1α,\displaystyle\inf_{\theta_{0}\in\Theta}\inf_{\theta=\theta_{0}}\mathbb{P}(\theta_{0}\in C(\bm{X}))\geq 1-\alpha,

which implies C(𝑿)={θ:𝑿A(θ)}C(\bm{X})=\{\theta:\bm{X}\in A(\theta)\} is a level-α\alpha confidence set for θ\theta. ∎

Appendix F Proof of Theorem 6

Proof.

If μTμR=0\mu_{T}-\mu_{R}=0, then μTμR[L(𝑿),U(𝑿)]\mu_{T}-\mu_{R}\in[L(\bm{X}),U(\bm{X})] for all 𝑿\bm{X}, thus the coverage probability of [L(𝑿),U(𝑿)][L(\bm{X}),U(\bm{X})] is 1 in this case. So it is only need to consider the case when μTμR0\mu_{T}-\mu_{R}\neq 0.

When μTμR>0\mu_{T}-\mu_{R}>0, the event {μTμRmin{0,(X¯TX¯R)t1α,rσ^ΔX¯}}\{\mu_{T}-\mu_{R}\geq\min\{0,(\bar{X}_{T}-\bar{X}_{R})-t_{1-\alpha,r}\hat{\sigma}_{\Delta\bar{X}}\}\} is the whole sample space, thus

({μTμRmin{0,(X¯TX¯R)t1α,rσ^ΔX¯}})=1.\displaystyle\mathbb{P}\left(\{\mu_{T}-\mu_{R}\geq\min\{0,(\bar{X}_{T}-\bar{X}_{R})-t_{1-\alpha,r}\hat{\sigma}_{\Delta\bar{X}}\}\}\right)=1.

And if μTμR>0\mu_{T}-\mu_{R}>0, the event {μTμRmax{0,(X¯TX¯R)+t1α,rσ^ΔX¯}}\{\mu_{T}-\mu_{R}\leq\max\{0,(\bar{X}_{T}-\bar{X}_{R})+t_{1-\alpha,r}\hat{\sigma}_{\Delta\bar{X}}\}\} is the same as event {μTμR(X¯TX¯R)+t1α,rσ^ΔX¯}\left\{\mu_{T}-\mu_{R}\leq(\bar{X}_{T}-\bar{X}_{R})+t_{1-\alpha,r}\hat{\sigma}_{\Delta\bar{X}}\right\}, so the coverage probability equals

(μTμR[L(𝑿),U(𝑿)])\displaystyle\mathbb{P}\left(\mu_{T}-\mu_{R}\in[L(\bm{X}),U(\bm{X})]\right) =(μTμRmax{0,(X¯TX¯R)+t1α,rσ^ΔX¯})\displaystyle=\mathbb{P}\left(\mu_{T}-\mu_{R}\leq\max\{0,(\bar{X}_{T}-\bar{X}_{R})+t_{1-\alpha,r}\hat{\sigma}_{\Delta\bar{X}}\}\right)
=(μTμR(X¯TX¯R)+t1α,rσ^ΔX¯)\displaystyle=\mathbb{P}\left(\mu_{T}-\mu_{R}\leq(\bar{X}_{T}-\bar{X}_{R})+t_{1-\alpha,r}\hat{\sigma}_{\Delta\bar{X}}\right)
=(X¯TX¯R(μTμR)t1α,rσ^ΔX¯)\displaystyle=\mathbb{P}\left(\bar{X}_{T}-\bar{X}_{R}-(\mu_{T}-\mu_{R})\geq-t_{1-\alpha,r}\hat{\sigma}_{\Delta\bar{X}}\right)
=1α.\displaystyle=1-\alpha.

Similarly, when μTμR<0\mu_{T}-\mu_{R}<0, the event {μTμRmax{0,(X¯TX¯R)+t1α,rσ^ΔX¯}}\{\mu_{T}-\mu_{R}\leq\max\{0,(\bar{X}_{T}-\bar{X}_{R})+t_{1-\alpha,r}\hat{\sigma}_{\Delta\bar{X}}\}\} is the whole sample space, thus

({μTμRmax{0,(X¯TX¯R)+t1α,rσ^ΔX¯}})=1.\displaystyle\mathbb{P}\left(\{\mu_{T}-\mu_{R}\leq\max\{0,(\bar{X}_{T}-\bar{X}_{R})+t_{1-\alpha,r}\hat{\sigma}_{\Delta\bar{X}}\}\}\right)=1.

If μTμR<0\mu_{T}-\mu_{R}<0, the event {μTμRmin{0,(X¯TX¯R)t1α,rσ^ΔX¯}}\{\mu_{T}-\mu_{R}\geq\min\{0,(\bar{X}_{T}-\bar{X}_{R})-t_{1-\alpha,r}\hat{\sigma}_{\Delta\bar{X}}\}\} is the same as event {μTμR(X¯TX¯R)t1α,rσ^ΔX¯}\left\{\mu_{T}-\mu_{R}\geq(\bar{X}_{T}-\bar{X}_{R})-t_{1-\alpha,r}\hat{\sigma}_{\Delta\bar{X}}\right\}, so the coverage probability equals

(μTμR[L(𝑿),U(𝑿)])\displaystyle\mathbb{P}\left(\mu_{T}-\mu_{R}\in[L(\bm{X}),U(\bm{X})]\right) =(μTμRmin{0,(X¯TX¯R)t1α,rσ^ΔX¯})\displaystyle=\mathbb{P}\left(\mu_{T}-\mu_{R}\geq\min\{0,(\bar{X}_{T}-\bar{X}_{R})-t_{1-\alpha,r}\hat{\sigma}_{\Delta\bar{X}}\}\right)
=(μTμR(X¯TX¯R)t1α,rσ^ΔX¯)\displaystyle=\mathbb{P}\left(\mu_{T}-\mu_{R}\geq(\bar{X}_{T}-\bar{X}_{R})-t_{1-\alpha,r}\hat{\sigma}_{\Delta\bar{X}}\right)
=(X¯TX¯R(μTμR)t1α,rσ^ΔX¯)\displaystyle=\mathbb{P}\left(\bar{X}_{T}-\bar{X}_{R}-(\mu_{T}-\mu_{R})\leq t_{1-\alpha,r}\hat{\sigma}_{\Delta\bar{X}}\right)
=1α.\displaystyle=1-\alpha.

Combine everything above, the proof is completed. ∎

Appendix G Proof of Theorem 7

Proof.

By definition, we have

(L(𝑿)θ)=1α1,(U(𝑿)θ)=1α2.\mathbb{P}(L(\bm{X})\leq\theta)=1-\alpha_{1},\mathbb{P}(U(\bm{X})\geq\theta)=1-\alpha_{2}.

Consider the events 𝒜\mathcal{A} and \mathcal{B} defined as 𝒜={𝑿:L(𝑿)θ},={𝑿:U(𝑿)θ}\mathcal{A}=\{\bm{X}:L(\bm{X})\leq\theta\},\mathcal{B}=\{\bm{X}:U(\bm{X})\geq\theta\}. Therefore, the interaction of 𝒜\mathcal{A} and \mathcal{B} is 𝒜={𝑿:L(𝑿)θU(𝑿)}\mathcal{A}\bigcap\mathcal{B}=\{\bm{X}:L(\bm{X})\leq\theta\leq U(\bm{X})\}. It follows that

(𝒜)=(L(𝑿)θ or U(𝑿)θ)(L(𝑿)θ or L(𝑿)θ)=1.\mathbb{P}\left(\mathcal{A}\bigcup\mathcal{B}\right)=\mathbb{P}\left(L(\bm{X})\leq\theta\textrm{ or }U(\bm{X})\geq\theta\right)\geq\mathbb{P}\left(L(\bm{X})\leq\theta\textrm{ or }L(\bm{X})\geq\theta\right)=1.

Since (𝒜)1\mathbb{P}\left(\mathcal{A}\bigcup\mathcal{B}\right)\leq 1, we have (𝒜)=1\mathbb{P}\left(\mathcal{A}\bigcup\mathcal{B}\right)=1. The result follow from that

(𝒜)=(𝒜)+()(𝒜)=1α1α2.\mathbb{P}\left(\mathcal{A}\bigcap\mathcal{B}\right)=\mathbb{P}(\mathcal{A})+\mathbb{P}(\mathcal{B})-\mathbb{P}\left(\mathcal{A}\bigcup\mathcal{B}\right)=1-\alpha_{1}-\alpha_{2}.