This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Online Testing of Subgroup Treatment Effects Based on Value Difference

Miao Yu Department of Statistics
North Carolina State University
Raleigh, USA
myu12@ncsu.edu
   Wenbin Lu Department of Statistics
North Carolina State University
Raleigh, USA
wlu4@ncsu.edu
   Rui Song Department of Statistics
North Carolina State University
Raleigh, USA
rsong@ncsu.edu
Abstract

Online A/B testing plays a critical role in the high-tech industry to guide product development and accelerate innovation. It performs a null hypothesis statistical test to determine which variant is better. However, a typical A/B test presents two problems: (i) a fixed-horizon framework inflates the false-positive errors under continuous monitoring; (ii) the homogeneous effects assumption fails to identify a subgroup with a beneficial treatment effect. In this paper, we propose a sequential test for subgroup treatment effects based on value difference, named SUBTLE, to address these two problems simultaneously. The SUBTLE allows the experimenters to ”peek” at the results during the experiment without harming the statistical guarantees. It assumes heterogeneous treatment effects and aims to test if some subgroup of the population will benefit from the investigative treatment. If the testing result indicates the existence of such a subgroup, a subgroup will be identified using a readily available estimated optimal treatment rule. We examine the empirical performance of our proposed test on both simulations and a real dataset. The results show that the SUBTLE has high detection power with controlled type I error at any time, is more robust to noise covariates, and can achieve early stopping compared with the corresponding fixed-horizon test.

Index Terms:
sequential testing, heterogeneous treatment effects

I Introduction

Online A/B testing, as a kind of randomized control experiment, is widely used in the high-tech industry to assess the value of ideas in a scientific manner [1]. It randomly exposes users to one of the two variants: control (A), the currently-used version, or treatment (B), a new version being evaluated, and collects the metric of interest, such as conversion rate, revenue, etc. Then, a null hypothesis statistical test is performed to evaluate whether there is a statistically significant difference between the two variants on the metric of interest. This scientific design helps to control for the external variations and thus establish the causality between the variants and the outcome. However, the current A/B testing has its limitations in terms of framework and model assumptions.

First of all, most A/B tests employ a fixed-horizon framework, whose validity requires that the sample size should be fixed and determined before the experiment starts. However, experimenters, driven by a fast-paced product evolution in practice, often ”peek” at the experiment and hope to find the significance as quickly as possible to avoid large (i) time cost: an A/B test may take prohibitively long time to collect the predetermined size of samples; and (ii) opportunity cost: the users who have been assigned to a suboptimal variant will be stuck in a bad experience for a long time [2]. The behaviors of continuously monitoring and concluding the experiment prematurely will be favorably biased towards getting significant results and lead to very high false-positive probabilities, well in excess of the nominal significance level α\alpha [3, 4]. Another limitation of A/B tests is that they assume homogeneous treatment effects among the population and mainly focus on testing the average treatment effect. However, it is common that treatment effects vary across subpopulations. Testing the subgroup treatment effects will help decision-makers distinguish the subpopulation that may benefit from a particular treatment from those who may not, and thereby guide companies’ marketing strategies in promoting new products.

The first problem can be addressed by applying the sequential testing framework. Sequential testing, in contrast to the classic fixed-horizon test, is a statistical testing procedure that continuously checks for significance at every new sample and stops the test as soon as a significant result is detected while controlling the type I error at any time. It generally gives a significant decrease in the required sample size compared to the fixed-horizon test with the same type I error and type II error control [5], and thus is able to end an experiment much earlier. This field was first introduced by Wald [5], who proposed the sequential probability ratio test (SPRT) for simple hypotheses using likelihood ratio as the test statistics, and then was extended to composite hypotheses by much following literature [6, 7, 8, 9, 10]. A thorough review was given by Lai [11]. However, the advantage of sequential testing in online A/B testing has not been recognized until recently Johari et al. [12] brought mSPRT, a variant of SPRT to A/B tests.

The second problem shows a demand for a test on subgroup treatment effects. Although sequential testing is rapidly developing in online A/B tests, little work focuses on subgroup treatment effect testing. Yu et al. [13] proposed a sequential score test (SST) based on the score statistics under a generalized linear model, which aims to test if there is any difference between treatment and control groups among any subjects. However, this test is based on a restrictive parametric assumption on treatment-covariates interaction and cannot be used to test the subgroup treatment effects.

In this paper, we consider a flexible model and propose a sequential test for SUBgroup Treatment effects based on vaLuE difference (SUBTLE), which aims to test if some group of the population would benefit from the investigative treatment. Our method does not require to specify any parametric form of covariate-specific treatment effects. If the null hypothesis is rejected, a beneficial subgroup can be easily obtained based on the estimated optimal treatment rule.

The remainder of this paper is structured as follows. In Section II, we review the idea of the mSPRT and SST, and discuss how they are related to our test. Then in Section III, we introduce our proposed method SUBTLE and provide the theoretical guarantee for its validity. We conduct simulations in Section IV and real data experiments in Section V to demonstrate the validity, detection power, robustness, and efficiency of our proposed test. Finally, in Section VI, we conclude the paper and present future directions.

II Related Work

II-A Mixture Sequential Probability Ratio Test

The mixture sequential probability ratio test (mSPRT) [9] supposes that the independent and identically distributed (i.i.d.) random variables Y1,Y2,Y_{1},Y_{2},\cdots have a probability density function fθ(x)f_{\theta}(x) induced by parameter θ\theta, and aims to test

H0:θ=θ0vs.H1:θθ0.\mbox{H}_{0}:\theta=\theta_{0}\quad vs.\quad\mbox{H}_{1}:\theta\neq\theta_{0}.

Its test statistics Λnπ\Lambda^{\pi}_{n} at sample size nn is a mixture of likelihood ratios weighted by a mixture density π()\pi(\cdot) over the parameter space Θ\Theta:

Λnπ=Θ{i=1nfθ(Yi)fθ0(Yi)}π(θ)𝑑θ.\Lambda^{\pi}_{n}=\int_{\Theta}\left\{\prod_{i=1}^{n}\frac{f_{\theta}(Y_{i})}{f_{\theta_{0}}(Y_{i})}\right\}\pi(\theta)d\theta.

The mSPRT stops the sampling at stage

N=inf{n1:Λnπ1/α}N=\inf\{n\geq 1:\Lambda_{n}^{\pi}\geq 1/\alpha\} (1)

and rejects the null hypothesis H0\mbox{H}_{0} in favor of H1\mbox{H}_{1}. If no such time exists, it continues the sampling indefinitely and accepts the H0\mbox{H}_{0}. Since the likelihood ratio under H0\mbox{H}_{0} is a nonnegative martingale with the initial value equal to 1, and so is the mixture of such likelihood ratios Λnπ\Lambda_{n}^{\pi}, the type I error of mSPRT can be proved to be always controlled at α\alpha by an application of Markov’s inequality and optional stopping theorem: PH0(Λnπα1)𝔼H0(Λnπ)/α1=𝔼H0(Λ0π)/α1=αP_{H_{0}}(\Lambda_{n}^{\pi}\geq\alpha^{-1})\leq\mathbb{E}_{H_{0}}(\Lambda_{n}^{\pi})/\alpha^{-1}=\mathbb{E}_{H_{0}}(\Lambda_{0}^{\pi})/\alpha^{-1}=\alpha. Besides, mSPRT is a test of power one [14], which means that any small deviation from θ0\theta_{0} can be detected as long as waiting long enough. It is also shown that mSPRT is almost optimal for data from an exponential family of distributions, with respect to the expected stopping time [15].

The mSPRT was brought to A/B tests by Johari et al. [12, 16], who assume that the observations in control and treatment groups arrive in pairs (Yi(0),Yi(1))(Y_{i}^{(0)},Y_{i}^{(1)}), i=1,2,i=1,2,\cdots. They restricted their data model to the two most common cases in practice: normal distribution and Bernoulli distribution, with μA\mu_{\text{A}} and μB\mu_{\text{B}} denoting the mean for the control and treatment groups, respectively. They test the hypotheses as below:

H0:θ:=μBμA=0vs.H1:θ0,\mbox{H}_{0}:\theta\vcentcolon=\mu_{\text{B}}-\mu_{\text{A}}=0\quad vs.\quad\mbox{H}_{1}:\theta\neq 0,

by directly applying mSPRT to the distribution of the differences Zi=Yi(1)Yi(0)Z_{i}=Y_{i}^{(1)}-Y_{i}^{(0)} (normal), or the joint distribution of data pairs (Yi(0),Yi(1))(Y_{i}^{(0)},Y_{i}^{(1)}) (Bernoulli), i=1,2,i=1,2,\cdots. After making some approximations to the likelihood ratio and choosing a normal mixture density π(θ)N(0,τ2)\pi(\theta)\sim\mbox{N}(0,\tau^{2}), the test statistic Λnπ\Lambda^{\pi}_{n} is able to have a closed form for both normal and Bernoulli observations.

However, the mSPRT does not work well in testing heterogeneous treatment effects due to the complexity of the likelihood induced by individual covariates. Specifically, a conjugate prior π()\pi(\cdot) for the likelihood ratio may not exist anymore, making the computation for the test statistic challenging. The unknown baseline covariates effect also increases the difficulty in constructing and approximating the likelihood ratio [13].

II-B Sequential Score Test

The sequential score test (SST) [13] assumes a generalized linear model with a link function g()g(\cdot) for the outcome YY:

g(𝔼(Y|A,𝐗))=𝝁T𝐗+(𝜽T𝐗)A,g(\mathbb{E}(Y|A,{\bm{\mathbf{{X}}}}))={\bm{\mathbf{{\mu}}}}^{T}{\bm{\mathbf{{X}}}}+({\bm{\mathbf{{\theta}}}}^{T}{\bm{\mathbf{{X}}}})A, (2)

where AA and 𝐗{\bm{\mathbf{{X}}}} respectively denote the binary treatment indicator and user covariates vector. It tests the multi-dimensional treatment-covariates interaction effect:

H0:𝜽=𝟎vs.H1:𝜽𝟎,\mbox{H}_{0}:{\bm{\mathbf{{\theta}}}}=\bm{0}\quad vs.\quad\mbox{H}_{1}:{\bm{\mathbf{{\theta}}}}\neq\bm{0}, (3)

while accounting for the linear baseline covariates effect 𝝁T𝐗{\bm{\mathbf{{\mu}}}}^{T}{\bm{\mathbf{{X}}}}. For the test statistics Λnπ\Lambda_{n}^{\pi}, instead of using a mixture of likelihood ratios as mSPRT, SST employs a mixture of asymptotic probability ratios of the score statistics. Since the probability ratio has the same martingale structure as the likelihood ratio, the type I error can still be controlled with the same decision rule as mSPRT (1). The asymptotic normality of the score statistics also guarantees a closed form of Λnπ\Lambda_{n}^{\pi} with a multivariate normal mixture density π()\pi(\cdot). However, the considered parametric model (2) can only be used to test if there is a linear covariate-treatment interaction effect and may fail to detect the existence of a subgroup with enhanced treatment effects. In addition, the subgroup estimated based on the index 𝜽T𝐗{\bm{\mathbf{{\theta}}}}^{T}{\bm{\mathbf{{X}}}} may be biased if the assumed linear model (2) is misspecified. Therefore in this paper, we propose a subgroup treatment effect test, which is able to test the existence of a beneficial subgroup and does not require specifying the form of treatment effects.

III Subgroup Treatment Effects Test Based on Value Difference

III-A Problem Setup

Suppose we have i.i.d. data 𝐎i={Yi,Ai,𝐗i}{\bm{\mathbf{{O}}}}_{i}=\left\{Y_{i},A_{i},{\bm{\mathbf{{X}}}}_{i}\right\}, i=1,2,i=1,2,\cdots, where YiY_{i}, AiA_{i}, 𝐗i{\bm{\mathbf{{X}}}}_{i} respectively denote the observed outcome, binary treatment indicator, and user covariates vector. Here, we consider a flexible generalized linear model:

g(𝔼(Yi|Ai,𝐗i))=μ(𝐗i)+θ(𝐗i)Ai,g(\mathbb{E}(Y_{i}|A_{i},{\bm{\mathbf{{X}}}}_{i}))=\mu({\bm{\mathbf{{X}}}}_{i})+\theta({\bm{\mathbf{{X}}}}_{i})A_{i}, (4)

where baseline covariates effect μ()\mu(\cdot) and treatment-covariates interaction effect θ()\theta(\cdot) are completely unspecified functions, and g()g(\cdot) is a prespecified link function. For example, we use the identity link g(μ)=μg(\mu)=\mu for normal responses and the logit link g(μ)=log{μ/(1μ)}g(\mu)=\log\{\mu/(1-\mu)\} for binary outcomes.

Assuming YY is coded such that larger values indicate a better outcome, we consider the following test of subgroup treatment effects:

H0:𝐱𝒳,θ(𝐱)0vs.\displaystyle\mbox{H}_{0}:\forall{\bm{\mathbf{{x}}}}\in\mathcal{X},\;\theta({\bm{\mathbf{{x}}}})\leq 0\quad vs.
H1:𝒳0𝒳such thatθ(𝐱)>0for all𝐱𝒳0,\displaystyle\mbox{H}_{1}:\exists\mathcal{X}_{0}\subset\mathcal{X}\;\text{such that}\;\theta({\bm{\mathbf{{x}}}})>0\;\text{for all}\;{\bm{\mathbf{{x}}}}\in\mathcal{X}_{0}, (5)

where 𝒳0\mathcal{X}_{0} is the beneficial subgroup with P(𝐗𝒳0)>0P({\bm{\mathbf{{X}}}}\in\mathcal{X}_{0})>0. Note that the above subgroup test is very different from the covariate-treatment interaction test considered in (3) and is much more challenging due to several aspects. First, both μ()\mu(\cdot) and θ()\theta(\cdot) are nonparametric and need to be estimated. Second, the considered hypotheses (5) are moment inequalities that are nonstandard. Third, it allows a nonregular setting, i.e. P(θ(𝐗)=0)>0P(\theta({\bm{\mathbf{{X}}}})=0)>0, which makes associated inference difficult. Here, we propose a test based on the value difference between the optimal treatment rule and a fixed treatment rule.

Let Y(a)Y^{*}(a) denote the potential outcome had the subject received treatment aa (a=0,1a=0,1), and V(d)=𝔼{Y(a),𝐗}{Y(d(𝐗))}V(d)=\mathbb{E}_{\{Y^{*}(a),{\bm{\mathbf{{X}}}}\}}\left\{Y^{*}(d({\bm{\mathbf{{X}}}}))\right\} denote a value function for a fixed treatment rule d(𝐗)d({\bm{\mathbf{{X}}}}) which maps the information in 𝐗{\bm{\mathbf{{X}}}} to treatment {0,1}\{0,1\}, where the subscript represents the expectation is taken with respect to the joint distribution of {Y(a),𝐗}\{Y^{*}(a),{\bm{\mathbf{{X}}}}\}. Consider the value difference Δ=V(dopt)V(0)\Delta=V(d^{\text{opt}})-V(0) between the optimal treatment rule dopt=1{θ(𝐗)>0}d^{\text{opt}}=1\left\{\theta({\bm{\mathbf{{X}}}})>0\right\} and the treatment rule that assigns control to everyone d=0d=0, where 1{}1\{\cdot\} is an indicator function. If the null hypothesis is true, no one would benefit from the treatment and the optimal treatment rule assigns everyone to control, and therefore the value difference is zero. However, if the alternative hypothesis is true, some people would have higher outcomes being assigned to treatment and thus the value difference is positive. In this way, the testing hypotheses (5) can be equivalently transformed into the following pair:

H0:Δ=0vs.H1:Δ>0.\mbox{H}_{0}:\Delta=0\quad vs.\quad\mbox{H}_{1}:\Delta>0. (6)

We make the following standard causal inference assumptions: (i) consistency, which states that the observed outcome is equal to the potential outcome under the actual treatment received, i.e. Y=Y(1)1(A=1)+Y(0)1(A=0)Y=Y^{*}(1)1(A=1)+Y^{*}(0)1(A=0); (ii) no unmeasured confounders, i.e. Y(a)A|𝐗Y^{*}(a)\perp\!\!\!\perp A\,|\,{\bm{\mathbf{{X}}}} for a=0,1a=0,1, which means that the potential outcome is independent of treatment given covariates; (iii) positivity, i.e. P(A=a|𝐗=𝐱)>0P(A=a|{\bm{\mathbf{{X}}}}={\bm{\mathbf{{x}}}})>0 for a=0,1a=0,1 and all 𝐱𝒳{\bm{\mathbf{{x}}}}\in\mathcal{X} such that P(𝐗=𝐱)>0P({\bm{\mathbf{{X}}}}={\bm{\mathbf{{x}}}})>0. Under these assumptions, it can be shown that

V(d)\displaystyle V(d) =𝔼𝐗[𝔼{Y(d(𝐗))|𝐗}]\displaystyle=\mathbb{E}_{{\bm{\mathbf{{X}}}}}\left[\mathbb{E}\left\{Y^{*}(d({\bm{\mathbf{{X}}}}))|{\bm{\mathbf{{X}}}}\right\}\right]
=𝔼𝐗[𝔼{Y|A=d(𝐗),𝐗}].\displaystyle=\mathbb{E}_{{\bm{\mathbf{{X}}}}}\left[\mathbb{E}\left\{Y|A=d({\bm{\mathbf{{X}}}}),{\bm{\mathbf{{X}}}}\right\}\right].

III-B Algorithm and Implementation

We estimate the value function of a given treatment rule dd by the augmented inverse probability weighted (AIPW) estimator [17, 18]:

V^AIPW(d)=1ni=1n[Yi1(Ai=d)pAi(𝐗i){1(Ai=d)pAi(𝐗i)1}×𝔼(Yi|Ai=d,𝐗i)],\hat{V}_{\text{AIPW}}(d)=\frac{1}{n}\sum_{i=1}^{n}\Big{[}\frac{Y_{i}\cdot 1(A_{i}=d)}{p_{A_{i}}({\bm{\mathbf{{X}}}}_{i})}-\left\{\frac{1(A_{i}=d)}{p_{A_{i}}({\bm{\mathbf{{X}}}}_{i})}-1\right\}\\ \times\mathbb{E}(Y_{i}|A_{i}=d,{\bm{\mathbf{{X}}}}_{i})\Big{]},

where pA(𝐗)=Ap(𝐗)+(1A)(1p(𝐗))p_{A}({\bm{\mathbf{{X}}}})=A\cdot p({\bm{\mathbf{{X}}}})+(1-A)\cdot(1-p({\bm{\mathbf{{X}}}})) and p(𝐗)=P(A=1|𝐗)p({\bm{\mathbf{{X}}}})=P(A=1|{\bm{\mathbf{{X}}}}) is the propensity score. This estimator is unbiased, i.e. 𝔼(Y,A,𝐗){V^AIPW(d)}=V(d)\mathbb{E}_{(Y,A,{\bm{\mathbf{{X}}}})}\{\hat{V}_{\text{AIPW}}(d)\}=V(d). Moreover, the AIPW estimator has double robustness, that is, the estimator remains consistent if either the estimator of 𝔼(Y|A=d,𝐗)\mathbb{E}(Y|A=d,{\bm{\mathbf{{X}}}}) or the estimator of the propensity score p(𝐗)p({\bm{\mathbf{{X}}}}) is consistent, which gives much flexibility. Then the value difference Δ\Delta is unbiasedly estimated by

D(𝐎i;μ,θ,p)\displaystyle\quad D({\bm{\mathbf{{O}}}}_{i};\mu,\theta,p)
[Yi1(Ai=1{θ(𝐗i)>0})pAi(𝐗i){1(Ai=1{θ(𝐗i)>0})pAi(𝐗i)1}\displaystyle\coloneqq\bigg{[}\frac{Y_{i}\cdot 1\left(A_{i}=1\{\theta({\bm{\mathbf{{X}}}}_{i})>0\}\right)}{p_{A_{i}}({\bm{\mathbf{{X}}}}_{i})}-\left\{\frac{1\left(A_{i}=1\{\theta({\bm{\mathbf{{X}}}}_{i})>0\}\right)}{p_{A_{i}}({\bm{\mathbf{{X}}}}_{i})}-1\right\}
×g1(μ(𝐗i)+θ(𝐗i)1{θ(𝐗i)>0})]\displaystyle\qquad\times g^{-1}\left(\mu({\bm{\mathbf{{X}}}}_{i})+\theta({\bm{\mathbf{{X}}}}_{i})1\{\theta({\bm{\mathbf{{X}}}}_{i})>0\}\right)\bigg{]}
[Yi1(Ai=0)1p(𝐗i){1(Ai=0)1p(𝐗i)1}×g1(μ(𝐗i))]\displaystyle\qquad-\left[\frac{Y_{i}\cdot 1(A_{i}=0)}{1-p({\bm{\mathbf{{X}}}}_{i})}-\left\{\frac{1(A_{i}=0)}{1-p({\bm{\mathbf{{X}}}}_{i})}-1\right\}\times g^{-1}\left(\mu({\bm{\mathbf{{X}}}}_{i})\right)\right]

under our assumed model (4), where g1()g^{-1}(\cdot) is the inverse of the link function. That is,

𝔼(Y,A,𝐗){D(𝐎i;μ,θ,p)}=Δ.\mathbb{E}_{(Y,A,{\bm{\mathbf{{X}}}})}\left\{D({\bm{\mathbf{{O}}}}_{i};\mu,\theta,p)\right\}=\Delta. (7)

Since μ()\mu(\cdot), θ()\theta(\cdot), and p()p(\cdot) are usually unknown, we let data come in batches and estimate them based on previous batches of data.

1. Initialize batch index k=0k=0, test statistic Λkπ=0\Lambda_{k}^{\pi}=0. Choose a significance level 0<α<10<\alpha<1, a batch size mm, an initial batch size ll, and a failure time MM.
2. Sample ll observations to formulate initial batch 𝒞0\mathcal{C}_{0}.
while True do
       (i) k=k+1.
      (ii) Let 𝒞¯k1=j=0k1𝒞j\overline{\mathcal{C}}_{k-1}=\cup_{j=0}^{k-1}\mathcal{C}_{j}. Estimate μ()\mu(\cdot), θ()\theta(\cdot), and p()p(\cdot) based on data in 𝒞¯k1\overline{\mathcal{C}}_{k-1} to get μ^k1\hat{\mu}_{k-1}, θ^k1\hat{\theta}_{k-1}, and p^k1\hat{p}_{k-1}.
      (iii) Sample another mm observations to formulate batch 𝒞k\mathcal{C}_{k}. For each 𝐎i𝒞k{\bm{\mathbf{{O}}}}_{i}\in\mathcal{C}_{k}, calculate D(𝐎i;μ^k1,θ^k1,p^k1)D({\bm{\mathbf{{O}}}}_{i};\hat{\mu}_{k-1},\hat{\theta}_{k-1},\hat{p}_{k-1}). Let
D¯k(𝒞k;𝒞¯k1)=1m𝐎i𝒞kD(𝐎i;μ^k1,θ^k1,p^k1).\bar{D}_{k}(\mathcal{C}_{k};\overline{\mathcal{C}}_{k-1})=\frac{1}{m}\sum_{{\bm{\mathbf{{O}}}}_{i}\in\mathcal{C}_{k}}D({\bm{\mathbf{{O}}}}_{i};\hat{\mu}_{k-1},\hat{\theta}_{k-1},\hat{p}_{k-1}).
      (iv) Estimate the conditional standard deviation σk=sd(D¯k(𝒞k;𝒞¯k1)|𝒞¯k1)\sigma_{k}=sd\left(\bar{D}_{k}(\mathcal{C}_{k};\overline{\mathcal{C}}_{k-1})|\overline{\mathcal{C}}_{k-1}\right) based on data in 𝒞¯k1\overline{\mathcal{C}}_{k-1} and denote it as σ^k\hat{\sigma}_{k}.
      (v) Calculate
Rk=1kj=1kσ^j1D¯j(𝒞j;𝒞¯j1)R_{k}=\frac{1}{\sqrt{k}}\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\bar{D}_{j}(\mathcal{C}_{j};\overline{\mathcal{C}}_{j-1}) (8)
and
Λkπ=ψ(1k(j=1kσ^j1)Δ, 1)(Rk)ψ(0, 1)(Rk)π(Δ)𝑑Δ,\Lambda_{k}^{\pi}=\int\frac{\psi_{\left(\frac{1}{\sqrt{k}}\left(\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\right)\Delta,\,1\right)}(R_{k})}{\psi_{\left(0,\,1\right)}(R_{k})}\pi(\Delta)d\Delta, (9)
where ψ(μ,σ2)()\psi_{(\mu,\sigma^{2})}(\cdot) denotes the probability density function of a normal distribution with mean μ\mu and variance σ2\sigma^{2}.
      if Λkπ>1/α\Lambda_{k}^{\pi}>1/\alpha  or  k×m+l>Mk\times m+l>M then
            break
       end if
      
end while
if Λkπ>1/α\Lambda_{k}^{\pi}>1/\alpha then
       Reject H0\mbox{H}_{0}. Estimate θ()\theta(\cdot) using all the data up to now and identify a subgroup 1{θ^(𝐗)>0}1\{\hat{\theta}({\bm{\mathbf{{X}}}})>0\}.
else
       Accept H0\mbox{H}_{0}.
end if
Algorithm 1 SUBTLE

Like SST, the key idea behind SUBTLE is to construct a statistic that has an (asymptotic) normal distribution with different means under the null hypothesis and alternative hypothesis (6), and then build the test statistics as a mixture of (asymptotic) probability ratios of it. Define Δ^k\hat{\Delta}_{k} as below:

Δ^kj=1kσ^j1D¯j(𝒞j;𝒞¯j1)j=1kσ^j1.\hat{\Delta}_{k}\coloneqq\frac{\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\bar{D}_{j}(\mathcal{C}_{j};\overline{\mathcal{C}}_{j-1})}{\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}}. (10)

Note that Δ^k\hat{\Delta}_{k} is an asymptotic unbiased estimator for Δ\Delta by (7). RkR_{k} (8) is a multiplier of Δ^k\hat{\Delta}_{k} and thus has the mean related to the value difference Δ\Delta. In section III-C we will show that RkR_{k} has an asymptotic normal distribution with the same variance but different means under the null and local alternative hypotheses so that our test statistics Λkπ\Lambda_{k}^{\pi} (9) is a mixture of asymptotic probability ratios of RkR_{k}. As we discussed in Section II, the probability ratio has the same martingale structure as the likelihood ratio, so SUBTLE can be shown to control type I error with the decision rule (1). Algorithm 1 shows our complete testing procedures.

In step (ii) of Algorithm 1, we estimate μ()\mu(\cdot) and θ()\theta(\cdot) by respectively building a random forest on control observations and treatment observations in previous batches. The propensity score p()p(\cdot) is estimated by computing the proportion of treatment observations (A=1A=1) in previous batches. In step (iv) we estimate σk\sigma_{k} with σ^k=sk2/m\hat{\sigma}_{k}=\sqrt{s_{k}^{2}/m}, where sk2s_{k}^{2} is the sample variance of D(𝐎i;μ^k1,θ^k1,p^k1),𝐎i𝒞¯k1D({\bm{\mathbf{{O}}}}_{i};\hat{\mu}_{k-1},\hat{\theta}_{k-1},\hat{p}_{k-1}),\;\forall{\bm{\mathbf{{O}}}}_{i}\in\overline{\mathcal{C}}_{k-1}. Since the value difference is always nonnegative, in step (v) we choose a truncated normal π(Δ)=2/2πτ2exp{Δ2/2τ2}1{Δ>0}\pi(\Delta)=2/\sqrt{2\pi\tau^{2}}\cdot\exp{\left\{-\Delta^{2}/2\tau^{2}\right\}}\cdot 1\{\Delta>0\} as the mixture density. The normality of the mixture density guarantees a closed form for our test statistic:

Λkπ=2{kk+(τj=1kσ^j1)2}1/2×exp[(τj=1kσ^j1Rk)22{(τj=1kσ^j1)2+k}]×{1F(0)},\Lambda_{k}^{\pi}=2\left\{\frac{k}{k+(\tau\cdot\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1})^{2}}\right\}^{1/2}\\ \times\exp\left[\frac{\left(\tau\cdot\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\cdot R_{k}\right)^{2}}{2\left\{\left(\tau\cdot\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\right)^{2}+k\right\}}\right]\times\left\{1-F(0)\right\},

where F()F(\cdot) is the cumulative distribution function of a normal distribution with mean (kj=1kσ^j1Rk)/{(j=1kσ^j1)2+k}(\sqrt{k}\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}R_{k})/\{(\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1})^{2}+k\} and variance kτ2/{τ2(j=1kσ^j1)2+k}k\tau^{2}/\{\tau^{2}(\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1})^{2}+k\}. In theory, the choice of mixture density variance τ2\tau^{2} will not have any effect on the type I error control. Johari et al. [12] proved that an optimal τ2\tau^{2} in terms of stopping time is the prior variance times a correction for truncating, so we suggest estimating τ2\tau^{2} based on historical data. In the last step, we add a failure time MM to the decision rule to terminate the test externally and accept the null hypothesis if we ever reach it. If the null hypothesis is rejected, we can employ random forests to estimate θ()\theta(\cdot) based on all the data up to the time that the experiment ends. Then the estimated treatment effect θ^(𝐱)\hat{\theta}({\bm{\mathbf{{x}}}}) naturally gives the beneficial subgroup 𝒳0={𝐱:θ^(𝐱)>0}\mathcal{X}_{0}=\{{\bm{\mathbf{{x}}}}:\hat{\theta}({\bm{\mathbf{{x}}}})>0\}.

III-C Validity

In this section, we will show that our proposed test SUBTLE is able to control type I error at any time, that is, PH0(Λkπ>1/α)<αP_{H_{0}}(\Lambda_{k}^{\pi}>1/\alpha)<\alpha for any k+k\in\mathbb{N}^{+}. Theorem III.1 gives the respective asymptotic distributions of RkR_{k} under the null and local alternative hypotheses, which demonstrates that the test statistics Λkπ\Lambda_{k}^{\pi} is a mixture of asymptotic probability ratios weighted by π()\pi(\cdot). Proposition 1 shows that this asymptotic probability ratio has a martingale structure under H0\mbox{H}_{0} when the sample size is large enough. Combining these two results with the demonstration in Section II-A, we can conclude that the type I error of SUBTLE is always controlled at α\alpha.

The upcoming theorem relies on the following conditions:

(C1)

The number of batches kk diverges to infinity as sample size nn diverges to infinity.

(C2)

Lindeberg-like condition: for all ϵ>0\epsilon>0

1kj=1k𝔼\displaystyle\frac{1}{k}\sum_{j=1}^{k}\mathbb{E} [{D¯j(𝒞j;𝒞¯j1)σ^j}2\displaystyle\left[\left\{\frac{\bar{D}_{j}(\mathcal{C}_{j};\overline{\mathcal{C}}_{j-1})}{\hat{\sigma}_{j}}\right\}^{2}\right.
×1{|D¯j(𝒞j;𝒞¯j1)|σ^j>ϵk}|𝒞¯j1]=op(1)\displaystyle\times\left.1\left\{\frac{|\bar{D}_{j}(\mathcal{C}_{j};\overline{\mathcal{C}}_{j-1})|}{\hat{\sigma}_{j}}>\epsilon\sqrt{k}\right\}\bigg{|}\overline{\mathcal{C}}_{j-1}\right]=o_{p}(1)
(C3)

1kj=1kσj2σ^j2𝑃1\frac{1}{k}\sum_{j=1}^{k}\frac{\sigma_{j}^{2}}{\hat{\sigma}_{j}^{2}}\xrightarrow{P}1.

(C4)
1kj=1kσ^j1[𝔼{D¯j(𝒞j;𝒞¯j1)|𝒞¯j1}\displaystyle\frac{1}{k}\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\left[\mathbb{E}\left\{\bar{D}_{j}(\mathcal{C}_{j};\overline{\mathcal{C}}_{j-1})|\overline{\mathcal{C}}_{j-1}\right\}\right.
𝔼{D¯j(𝒞j;d^j1opt,μ,θ,p)|𝒞¯j1}]=oP(k1/2),\displaystyle\left.-\mathbb{E}\left\{\bar{D}_{j}(\mathcal{C}_{j};\hat{d}^{opt}_{j-1},\mu,\theta,p)|\overline{\mathcal{C}}_{j-1}\right\}\right]=o_{P}(k^{-1/2}),

where D¯j(𝒞j;d^j1opt,μ,θ,p)\bar{D}_{j}(\mathcal{C}_{j};\hat{d}^{opt}_{j-1},\mu,\theta,p) is D¯j(𝒞j;𝒞¯j1)\bar{D}_{j}(\mathcal{C}_{j};\overline{\mathcal{C}}_{j-1}) with μ^j1,θ^j1,p^j1\hat{\mu}_{j-1},\hat{\theta}_{j-1},\hat{p}_{j-1} replaced by the true value μ,θ,p\mu,\theta,p, but d^j1opt=1{θ^j1(𝐗)>0}\hat{d}^{opt}_{j-1}=1\{\hat{\theta}_{j-1}({\bm{\mathbf{{X}}}})>0\} unchanged.

(C5)

1kj=1kσ^j1[𝔼{D¯j(𝒞j;d^j1opt,μ,θ,p)|𝒞¯j1}Δ]=oP(k1/2)\frac{1}{k}\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\left[\mathbb{E}\left\{\bar{D}_{j}(\mathcal{C}_{j};\hat{d}^{opt}_{j-1},\mu,\theta,p)|\overline{\mathcal{C}}_{j-1}\right\}-\Delta\right]=o_{P}(k^{-1/2}).

Similar conditions are also used in the literature for studying the properties of doubly robust estimators. Their appropriateness was discussed in Section 7 of [19]. Both (C4) and (C5) rely on the convergence rate of the estimator of μ,θ,p\mu,\theta,p. Wager and Athey [20] showed that under certain constraints on the subsampling rate, random forest predictions converge at the rate ns1/2n^{s-1/2}, where ss is chosen to satisfy some conditions. We assume that under this rate, (C4) and (C5) hold.

Theorem III.1

For Δ^k\hat{\Delta}_{k} defined in (10), under conditions (C1)-(C5),

1k(j=1kσ^j1)(Δ^kΔ)𝑑N(0,1)ask,\frac{1}{\sqrt{k}}\left(\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\right)\left(\hat{\Delta}_{k}-\Delta\right)\overset{d}{\rightarrow}\mbox{N}(0,1)\quad\mbox{as}\;k\rightarrow\infty,

where 𝑑\overset{d}{\rightarrow} represents convergence in distribution. In particular, as kk\rightarrow\infty, RkH0𝑑N(0,1)R_{k}\xrightarrow[\mbox{H}_{0}]{d}\mbox{N}(0,1) under the null hypothesis Δ=0\Delta=0, while Rkk1/2(j=1kσ^j1)ΔH1𝑑N(0,1)R_{k}-k^{-1/2}\left(\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\right)\Delta\xrightarrow[\mbox{H}_{1}]{d}\mbox{N}(0,1) under the local alternative Δ=δ/k\Delta=\delta/\sqrt{k}, where δ>0\delta>0 is fixed.

Proposition 1

Let

λk=ψ(1k(j=1kσ^j1)Δ, 1)(Rk)ψ(0, 1)(Rk)\lambda_{k}=\frac{\psi_{\left(\frac{1}{\sqrt{k}}\left(\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\right)\Delta,\,1\right)}(R_{k})}{\psi_{\left(0,\,1\right)}(R_{k})}

and k\mathcal{F}_{k} denote a filtration that contains all the historical information in the first (k+1)(k+1) batches 𝒞¯k\overline{\mathcal{C}}_{k}. Then under the null hypothesis H0:Δ=0\mbox{H}_{0}:\Delta=0, 𝔼(λk+1|k)\mathbb{E}(\lambda_{k+1}|\mathcal{F}_{k}) is approximately equal to λkexp{oP(1)}\lambda_{k}\cdot\exp\{o_{P}(1)\}.

The proofs of the above results are given in the appendices.

IV Simulated Experiments

In this section, we evaluate the test SUBTLE on three metrics: type I error, power, and sample size. We first compare SUBTLE with SST in terms of type I error and power under five models in Section IV-A. Then in Section IV-B, we present the impact of noise covariates on their powers. Finally, in Section IV-C, we compare the stopping time of SUBTLE to the required sample size of a fixed-horizon value difference test. The significance level α=0.05\alpha=0.05, initial batch size l=300l=300, failure time M=2300M=2300, and variance of mixture distribution τ2=1\tau^{2}=1 are fixed for both SUBTLE and SST in all simulation settings unless otherwise specified.

IV-A Type I Error & Power

We consider five data generation models in the form of (4) with logistic link g()g(\cdot). Data are generated in batches with batch size m=20m=20 and are randomly assigned to two groups with a fixed propensity score p(𝐗)=0.5p({\bm{\mathbf{{X}}}})=0.5. Each experiment is repeated 1000 times to estimate the type I error and power. For the first four models, we consider

Five covariates

:

X1iidBer(0.5),\displaystyle X_{1}\overset{iid}{\sim}\mbox{Ber}(0.5),\; X2iidU[1,1],\displaystyle X_{2}\overset{iid}{\sim}\mbox{U}[-1,1],
X3,X4,X5\displaystyle X_{3},X_{4},X_{5} iidN(0,1).\displaystyle\overset{iid}{\sim}\mbox{N}(0,1).
Two baseline effects

:

μ1(𝐗)\displaystyle\mu_{1}({\bm{\mathbf{{X}}}}) =2X1+X32,\displaystyle=-2-X_{1}+X_{3}^{2},
μ2(𝐗)\displaystyle\mu_{2}({\bm{\mathbf{{X}}}}) =1.3+X1+0.5X2X32.\displaystyle=-1.3+X_{1}+0.5X_{2}-X_{3}^{2}.
Two treatment-covariates interaction effects

:

θ1(𝐗)\displaystyle\theta_{1}({\bm{\mathbf{{X}}}}) =c1{X1+2X3>0},\displaystyle=c\cdot 1\{X_{1}+2X_{3}>0\},
θ2(𝐗)\displaystyle\theta_{2}({\bm{\mathbf{{X}}}}) =c1{X2>0orX5<0.5}.\displaystyle=c\cdot 1\{X_{2}>0\;\text{or}\;X_{5}<-0.5\}.

Table I displays which covariates, μ(𝐗)\mu({\bm{\mathbf{{X}}}}), and θ(𝐗)\theta({\bm{\mathbf{{X}}}}) are employed in each model. For model V, we consider the following high-dimensional setting:

Xr\displaystyle X_{r} iidN(0.2r0.6,1),r=1,2,3,4,5\displaystyle\overset{iid}{\sim}\mbox{N}(0.2r-0.6,1),\;r=1,2,3,4,5
Xr\displaystyle X_{r} iidN(0.2r1.6,2),r=6,7,8,9,10\displaystyle\overset{iid}{\sim}\mbox{N}(0.2r-1.6,2),\;r=6,7,8,9,10
Xr\displaystyle X_{r} iidU[0.5r+5,0.5r5],r=11,12,13\displaystyle\overset{iid}{\sim}\mbox{U}[-0.5r+5,0.5r-5],\;r=11,12,13
X14\displaystyle X_{14} iidU[0.5,1.5],X15iidU[1.5,0.5]\displaystyle\overset{iid}{\sim}\mbox{U}[-0.5,1.5],\;X_{15}\overset{iid}{\sim}\mbox{U}[-1.5,0.5]
Xr\displaystyle X_{r} iidBer(0.2r3.1),r=16,17,18,19,20\displaystyle\overset{iid}{\sim}\mbox{Ber}(0.2r-3.1),\;r=16,17,18,19,20
μ(𝐗)\displaystyle\mu({\bm{\mathbf{{X}}}}) =0.8+X18+0.5X12X32\displaystyle=-0.8+X_{18}+0.5X_{12}-X_{3}^{2}
θ(𝐗)\displaystyle\theta({\bm{\mathbf{{X}}}}) =c1{(X14>0.1)&(X20=1)},\displaystyle=c\cdot 1\{(X_{14}>-0.1)\,\&\,(X_{20}=1)\},

where cc varies among {1,0,0.6,0.8,1}\left\{-1,0,0.6,0.8,1\right\} indicating the intensity of the value difference. When c=1c=-1 and 0, the null hypothesis is true and the type I error is estimated; while when c=0.6,0.8,1c=0.6,0.8,1, the alternative is true and the power is estimated.

TABLE I: The covariates, μ(𝐗)\mu({\bm{\mathbf{{X}}}}), and θ(𝐗)\theta({\bm{\mathbf{{X}}}}) in Models I-IV
Model Input covariates μ(𝐗)\mu({\bm{\mathbf{{X}}}}) θ(𝐗)\theta({\bm{\mathbf{{X}}}})
I X1X_{1}, X3X_{3} μ1(𝐗)\mu_{1}({\bm{\mathbf{{X}}}}) θ1(𝐗)\theta_{1}({\bm{\mathbf{{X}}}})
II X1X_{1}, X2X_{2}, X3X_{3}, X4X_{4}, X5X_{5} μ2(𝐗)\mu_{2}({\bm{\mathbf{{X}}}}) θ2(𝐗)\theta_{2}({\bm{\mathbf{{X}}}})
III X1X_{1}, X2X_{2}, X3X_{3}, X4X_{4}, X5X_{5} μ1(𝐗)\mu_{1}({\bm{\mathbf{{X}}}}) θ2(𝐗)\theta_{2}({\bm{\mathbf{{X}}}})
IV X1X_{1}, X2X_{2}, X3X_{3}, X4X_{4}, X5X_{5} μ2(𝐗)\mu_{2}({\bm{\mathbf{{X}}}}) θ1(𝐗)\theta_{1}({\bm{\mathbf{{X}}}})

Table II shows that the SUBTLE is able to control type I error and achieve competing detection power, especially under high-dimensional setting (Model V); however, SST couldn’t control type I error especially when c=1c=-1. This can be explained by two things: (i) the linearity of the model (2) is violated; (ii) SST is testing if there is a difference between treatment and control groups among any subjects, instead of the existence of a beneficial subgroup. Specifically, SST is testing if the least false parameter 𝜽{\bm{\mathbf{{\theta}}}}^{*}, to which the MLE of 𝜽{\bm{\mathbf{{\theta}}}} under model misspecification converges, is zero or not. We also perform experiments with batch size m=40m=40, and the results (Table III) do not have much difference.

TABLE II: Estimated type I error and power of SUBTLE and SST with batch size 20
Model I II III IV V
cc SUBTLE SST SUBTLE SST SUBTLE SST SUBTLE SST SUBTLE SST
-1 0.009 0.695 0.002 0.589 0.003 0.224 0.004 0.411 0.002 0.008
0 0.015 0.134 0.010 0.023 0.006 0.095 0.010 0.023 0.006 0.038
0.6 0.323 0.564 0.491 0.513 0.269 0.389 0.424 0.425 0.559 0.170
0.8 0.623 0.845 0.878 0.900 0.719 0.723 0.822 0.824 0.925 0.390
1 0.911 0.974 0.988 0.996 0.952 0.943 0.985 0.982 0.997 0.742
TABLE III: Estimated type I error and power of SUBTLE and SST with batch size 40
Model I II III IV V
cc SUBTLE SST SUBTLE SST SUBTLE SST SUBTLE SST SUBTLE SST
-1 0.003 0.662 0.000 0.588 0.000 0.219 0.000 0.368 0.002 0.006
0 0.012 0.126 0.002 0.023 0.003 0.077 0.002 0.023 0.006 0.034
0.6 0.297 0.552 0.465 0.549 0.326 0.397 0.414 0.451 0.585 0.216
0.8 0.633 0.837 0.868 0.896 0.680 0.703 0.826 0.816 0.931 0.373
1 0.901 0.969 0.993 0.995 0.947 0.933 0.985 0.978 0.999 0.715

To evaluate the robustness of SUBTLE to the mixture density variance τ2\tau^{2}, we conduct additional simulations with τ2{0.0001,0.001,0.01,0.1,1,10}\tau^{2}\in\{0.0001,0.001,0.01,0.1,1,10\}. The data are generated from Model I with c=0c=0 or c=1c=1, and batch size m=200m=200 for computation efficiency. When c=0c=0 we estimate the type I error, while when c=1c=1 we estimate the power. The results in Table IV show that the type I error is always controlled below significance level 0.05 and the power has considerable robustness to the choice of τ2\tau^{2}.

TABLE IV: Estimated type I error and power for SUBTLE with varying mixture density variance
τ2\tau^{2} 0.0001 0.001 0.01 0.1 1 10
Type I error 0.003 0.024 0.021 0.016 0.005 0.002
Power 0.887 0.976 0.956 0.932 0.892 0.825

IV-B Noise Covariates

It is common in practice that a large number of covariates are incorporated in the experiment whereas the actual outcome only depends on a few of them. Some covariates do not have any effect on the response, like X4X_{4} in Model II, III, IV, and we call them noise covariates. In the following simulation, we explore the impact of noise covariates on the detection power of SUBTLE and SST. We choose Model I with c=0.8c=0.8 as the base model, and at each time add three noise covariates which are respectively from normal N(0,1)(0,1), uniform U[1,1][-1,1], and Bernoulli Ber(0.5)(0.5) distributions. The batch size is set to m=40m=40 for computation efficiency. Fig. 1 shows that SST has continuously decreasing powers as the number of noise covariates increases, while the power of SUBTLE is more robust to the noise covariates.

Refer to caption

Figure 1: Estimated power of SUBTLE and SST with data generated from Model I as the number of noise covariates increases

IV-C Stopping Time

A key feature of sequential testing is that it has an expected smaller sample size than fixed-horizon testing. For comparison, we consider a fixed-horizon version of SUBTLE, which leverages Theorem III.1 and rejects the null hypothesis H0:Δ=0\mbox{H}_{0}:\Delta=0 when Rk>ZαR_{k}>Z_{\alpha} for some predetermined kk, where ZαZ_{\alpha} denotes the (1α)(1-\alpha) quantile of standard normal distribution. We assume σ1=limkk1j=1kσ^j1\sigma^{-1}=\lim_{k\rightarrow\infty}k^{-1}\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}, then the required number of batches kk can be calculated as k=σ2(Zα+Z1power)2/Δ2k=\sigma^{2}(Z_{\alpha}+Z_{1-\mbox{power}})^{2}/\Delta^{2}, and thus the required sample size is n=k×m+ln=k\times m+l. The true value difference Δ\Delta can be directly estimated from data generated under the true model with two treatment rules, while σ2\sigma^{2} is estimated by the sample variance of Δ^k\hat{\Delta}_{{k}^{\prime}} times k{k}^{\prime} for some fixed large k{k}^{\prime}. Here, we choose Model V with c=1c=1 and batch size m=20m=20. The stopping sample size of our sequential SUBTLE over 1000 replicates is shown in Fig. 2, and the dashed vertical line indicates the required sample size for the fixed-horizon SUBTLE with the same power of 0.997 (seen from Table II) under the same setting. We can find that most of the time our sequential SUBTLE arrives at the decision early than the fixed-horizon version, but occasionally it can take longer. The distribution of the stopping time for sequential SUBTLE is right-skewed, which is in line with the findings in [12] and  [2].

Refer to caption

Figure 2: Histogram of stopping time of SUBTLE and SST with data generated from Model V

V Real Data Experiments

We use the Yahoo dataset to examine the performance of our SUBTLE, which contains user click events on articles over 10 days. Each event has a timestamp, a unique article id (variant), a binary click indicator (response), and four independent user features (covariates). We choose two articles (id=109520 and 109510) with the highest click-through rates as control and treatment, respectively. We set the significance level α=0.05\alpha=0.05, initial batch size and batch size l=m=200l=m=200, and the failure time M=50000M=50000.

To demonstrate the false-positive control of our method, we conduct an A/A test and a permutation test using SUBTLE. For the A/A test, we only use data from article 109510 and randomly generate fake treatment indicators. Our method accepts the null hypothesis. For the permutation test, we use combined data from articles 109510 and 109520 and permute their response 1000 times while leaving the treatment indicator and covariates unchanged. The estimated false-positive rate is below the significance level.

Then we test if there is any subgroup of users who would have a higher click-through rate on article 109510. In this experiment, SUBTLE rejects the null hypothesis with a sample size of 12400. We identify the beneficial subgroup 1{θ(𝐗)>0}1\{\theta({\bm{\mathbf{{X}}}})>0\} by estimating θ^(𝐗)\hat{\theta}({\bm{\mathbf{{X}}}}) with random forest on the first 12400 observations. To get a structured optimal treatment rule, we then build a classification tree on the same 12400 samples with a random forest estimator 1{θ^(𝐗)>0}1\{\hat{\theta}({\bm{\mathbf{{X}}}})>0\} as true labels. The resulting decision tree (Fig. 3) suggests that the users in the subgroup defined by {X3<0.7094or(X30.7094,X10.0318andX4<0.0003)}\{X_{3}<0.7094\;\text{or}\;(X_{3}\geq 0.7094,X_{1}\geq 0.0318\;\text{and}\;X_{4}<0.0003)\} have a higher click-through rate on article 109510 compared with article 109520.

Refer to caption

Figure 3: An optimal treatment rule in the Yahoo dataset by a decision tree

We then use the 50000 samples after the first 12400 samples as test data and compute the difference of click-through rates between articles 109510 and 109520 on the test data (overall treatment effect), and the same difference in the subgroup of the test data (subgroup treatment effect). We found that the subgroup treatment effect of 0.009 is larger than the overall treatment effect of 0.006, which shows that the identified subgroup has enhanced treatment effects than the overall population.

We further compute the inverse probability weighted (IPW) estimator n1i=1n[Yi1{Ai=d(𝐗i)}/pAi(𝐗i)]n^{-1}\sum_{i=1}^{n}[Y_{i}\cdot 1\{A_{i}=d({\bm{\mathbf{{X}}}}_{i})\}/p_{A_{i}}({\bm{\mathbf{{X}}}}_{i})] of the values of two treatment rules using the test data: a treatment rule d1(𝐗)=0d_{1}({\bm{\mathbf{{X}}}})=0 that assigns everyone to article 109520 and the optimal treatment rule d2(𝐗)=1{θ^(𝐗)>0}d_{2}({\bm{\mathbf{{X}}}})=1\{\hat{\theta}({\bm{\mathbf{{X}}}})>0\} estimated by random forest. Their IPW estimates are respectively 0.043 and 0.049, which suggests that the estimated optimal treatment rule is better than the fixed rule that assigns all users to article 109520 in terms of click-through rate. It also implies that there exists a subgroup of the population that has a higher click-through rate on article 109510 compared with article 109520.

VI Conclusion

In this paper, we propose SUBTLE, which is able to sequentially test if some subgroup of the population will benefit from the investigative treatment. If the null hypothesis is rejected, a beneficial subgroup can be easily identified based on the estimated optimal treatment rule. The validity of the test has been proved by both theoretical and simulation results. The experiments also show that SUBTLE has high detection power especially under high-dimensional settings, is robust to noise covariates, and allows quick inference most of the time compared with fixed-horizon testing.

Same as mSPRT and SST, the rejection condition of SUBTLE may never be reached in some cases, especially when the true effect size is negligible. Thus, a failure time is needed to terminate the test externally and accept the null hypothesis if we ever reach it. How to choose a failure time to balance waiting time and power need to be studied in the future. Another future direction is the application of our test under adaptive allocation, where users will have higher probabilities of being assigned to a beneficial variant based on previous observations. However, the validity may not be guaranteed anymore under adaptive allocation and more theoretical investigations are needed.

References

  • [1] R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne, “Controlled experiments on the web: survey and practical guide,” Data Mining and Knowledge Discovery, vol. 18, no. 1, pp. 140–181, 2009.
  • [2] N. Ju, D. Hu, A. Henderson, and L. Hong, “A sequential test for selecting the better variant: Online a/b testing, adaptive allocation, and continuous monitoring,” in Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining.   ACM, 2019, pp. 492–500.
  • [3] J. P. Simmons, L. D. Nelson, and U. Simonsohn, “False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant,” Psychological Science, vol. 22, no. 11, pp. 1359–1366, 2011.
  • [4] M. Goodson. (2014) Most winning a/b test results are illusory. [Online]. Available: http://www.datascienceassn.org/sites/default/files/Most%20Winning%20A-B%20Test%20Results%20are%20Illusory.pdf
  • [5] A. Wald, “Sequential tests of statistical hypotheses,” The Annals of Mathematical Statistics, vol. 16, no. 2, pp. 117–186, 1945.
  • [6] G. Schwarz, “Asymptotic shapes of bayes sequential testing regions,” The Annals of Mathematical Statistics, vol. 33, no. 1, pp. 224–236, 1962.
  • [7] D. R. Cox, “Large sample sequential tests for composite hypotheses,” Sankhyā: The Indian Journal of Statistics, Series A, vol. 25, no. 1, pp. 5–12, 1963.
  • [8] P. Armitage, C. McPherson, and B. Rowe, “Repeated significance tests on accumulating data,” Journal of the Royal Statistical Society: Series A (General), vol. 132, no. 2, pp. 235–244, 1969.
  • [9] H. Robbins, “Statistical methods related to the law of the iterated logarithm,” The Annals of Mathematical Statistics, vol. 41, no. 5, pp. 1397–1409, 1970.
  • [10] T. L. Lai, “Nearly optimal sequential tests of composite hypotheses,” The Annals of Statistics, vol. 16, no. 2, pp. 856–886, 1988.
  • [11] ——, “Sequential analysis: some classical problems and new challenges,” Statistica Sinica, vol. 11, no. 2, pp. 303–351, 2001.
  • [12] R. Johari, L. Pekelis, and D. J. Walsh, “Always valid inference: Bringing sequential analysis to a/b testing,” arXiv preprint arXiv:1512.04922, 2015.
  • [13] M. Yu, W. Lu, and R. Song, “A new framework for online testing of heterogeneous treatment effect,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 06, 2020, pp. 10 310–10 317.
  • [14] H. Robbins and D. Siegmund, “The expected sample size of some tests of power one,” The Annals of Statistics, vol. 2, no. 3, pp. 415–436, 1974.
  • [15] M. Pollak, “Optimality and almost optimality of mixture stopping rules,” The Annals of Statistics, vol. 6, no. 4, pp. 910–916, 1978.
  • [16] R. Johari, P. Koomen, L. Pekelis, and D. Walsh, “Peeking at a/b tests: Why it matters, and what to do about it,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.   ACM, 2017, pp. 1517–1525.
  • [17] J. M. Robins, A. Rotnitzky, and L. P. Zhao, “Estimation of regression coefficients when some regressors are not always observed,” Journal of the American Statistical Association, vol. 89, no. 427, pp. 846–866, 1994.
  • [18] B. Zhang, A. A. Tsiatis, E. B. Laber, and M. Davidian, “A robust method for estimating optimal treatment regimes,” Biometrics, vol. 68, no. 4, pp. 1010–1018, 2012.
  • [19] A. R. Luedtke and M. J. Van Der Laan, “Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy,” The Annals of Statistics, vol. 44, no. 2, pp. 713–742, 2016.
  • [20] S. Wager and S. Athey, “Estimation and inference of heterogeneous treatment effects using random forests,” Journal of the American Statistical Association, vol. 113, no. 523, pp. 1228–1242, 2018.

Appendices

A Proof of Theorem III.1

Let j\mathcal{F}_{j}, 0jk0\leq j\leq k, denote a filtration generated by observations in first (j+1)(j+1) batches 𝒞¯j=r=0j𝒞r\overline{\mathcal{C}}_{j}=\cup_{r=0}^{j}\mathcal{C}_{r}, and D¯j(𝒞j;d^j1opt,μ,θ,p)\bar{D}_{j}(\mathcal{C}_{j};\hat{d}^{opt}_{j-1},\mu,\theta,p) denote an AIPW estimator for Δ\Delta with only optimal decision rule estimated by previous batches:

D¯j(𝒞j;d^j1opt,μ,θ,p)\displaystyle\quad\,\bar{D}_{j}(\mathcal{C}_{j};\hat{d}^{opt}_{j-1},\mu,\theta,p)
1m𝐎i𝒞j[Yi1(Ai=1{θ^j1(𝐗i)>0})pAi(𝐗i)\displaystyle\coloneqq\frac{1}{m}\sum_{{\bm{\mathbf{{O}}}}_{i}\in\mathcal{C}_{j}}\bigg{[}\frac{Y_{i}\cdot 1\left(A_{i}=1\{\hat{\theta}_{j-1}({\bm{\mathbf{{X}}}}_{i})>0\}\right)}{p_{A_{i}}({\bm{\mathbf{{X}}}}_{i})}
{1(Ai=1{θ^j1(𝐗i)>0})pAi(𝐗i)1}\displaystyle\qquad-\left\{\frac{1\left(A_{i}=1\{\hat{\theta}_{j-1}({\bm{\mathbf{{X}}}}_{i})>0\}\right)}{p_{A_{i}}({\bm{\mathbf{{X}}}}_{i})}-1\right\}
×g1(μ(𝐗i)+θ(𝐗i)1{θ^j1(𝐗i)>0})]\displaystyle\qquad\times g^{-1}\left(\mu({\bm{\mathbf{{X}}}}_{i})+\theta({\bm{\mathbf{{X}}}}_{i})1\{\hat{\theta}_{j-1}({\bm{\mathbf{{X}}}}_{i})>0\}\right)\bigg{]}
[Yi1(Ai=0)1p(𝐗i){1(Ai=0)1p(𝐗i)1}×g1(μ(𝐗i))].\displaystyle\qquad-\left[\frac{Y_{i}\cdot 1\left(A_{i}=0\right)}{1-p({\bm{\mathbf{{X}}}}_{i})}-\left\{\frac{1(A_{i}=0)}{1-p({\bm{\mathbf{{X}}}}_{i})}-1\right\}\times g^{-1}\left(\mu({\bm{\mathbf{{X}}}}_{i})\right)\right].

Then

1k(j=1kσ^j1)(Δ^kΔ)\displaystyle\quad\,\frac{1}{k}\left(\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\right)(\hat{\Delta}_{k}-\Delta)
=1kj=1kσ^j1{D¯j(𝒞j;𝒞¯j1)Δ}\displaystyle=\frac{1}{k}\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\left\{\bar{D}_{j}(\mathcal{C}_{j};\overline{\mathcal{C}}_{j-1})-\Delta\right\}
=1kj=1kσ^j1([D¯j(𝒞j;𝒞¯j1)𝔼{D¯j(𝒞j;d^j1opt,μ,θ,p)|𝒞¯j1}]\displaystyle=\frac{1}{k}\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\left(\left[\bar{D}_{j}(\mathcal{C}_{j};\overline{\mathcal{C}}_{j-1})-\mathbb{E}\left\{\bar{D}_{j}(\mathcal{C}_{j};\hat{d}^{opt}_{j-1},\mu,\theta,p)|\overline{\mathcal{C}}_{j-1}\right\}\right]\right.
+[𝔼{D¯j(𝒞j;d^j1opt,μ,θ,p)|𝒞¯j1}Δ])\displaystyle\quad\left.+\left[\mathbb{E}\left\{\bar{D}_{j}(\mathcal{C}_{j};\hat{d}^{opt}_{j-1},\mu,\theta,p)|\overline{\mathcal{C}}_{j-1}\right\}-\Delta\right]\right)
=1kj=1kσ^j1[D¯j(𝒞j;𝒞¯j1)𝔼{D¯j(𝒞j;d^j1opt,μ,θ,p)|𝒞¯j1}]\displaystyle=\frac{1}{k}\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\left[\bar{D}_{j}(\mathcal{C}_{j};\overline{\mathcal{C}}_{j-1})-\mathbb{E}\left\{\bar{D}_{j}(\mathcal{C}_{j};\hat{d}^{opt}_{j-1},\mu,\theta,p)|\overline{\mathcal{C}}_{j-1}\right\}\right]
+oP(k1/2)\displaystyle\quad+o_{P}(k^{-1/2}) (11)
=1kj=1kσ^j1[D¯j(𝒞j;𝒞¯j1)𝔼{D¯j(𝒞j;𝒞¯j1)|j1}]\displaystyle=\frac{1}{k}\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\left[\bar{D}_{j}(\mathcal{C}_{j};\overline{\mathcal{C}}_{j-1})-\mathbb{E}\left\{\bar{D}_{j}(\mathcal{C}_{j};\overline{\mathcal{C}}_{j-1})|\mathcal{F}_{j-1}\right\}\right]
+oP(k1/2).\displaystyle\quad+o_{P}(k^{-1/2}). (12)

Above (11) follows by condition (C5) and (12) follows by condition (C4). For j=1,2,,kj=1,2,\cdots,k, let

Mk,j=1kD¯j(𝒞j;𝒞¯j1)𝔼{D¯j(𝒞j;𝒞¯j1)|j1}σ^j.M_{k,j}=\frac{1}{\sqrt{k}}\cdot\frac{\bar{D}_{j}(\mathcal{C}_{j};\overline{\mathcal{C}}_{j-1})-\mathbb{E}\{\bar{D}_{j}(\mathcal{C}_{j};\overline{\mathcal{C}}_{j-1})|\mathcal{F}_{j-1}\}}{\hat{\sigma}_{j}}.

It is obvious that for each kk, Mk,jM_{k,j}, 1jk1\leq j\leq k, is a martingale with respect to the filtration j\mathcal{F}_{j}. In particular, for all j1j\geq 1, 𝔼(Mk,j|j1)=0\mathbb{E}(M_{k,j}|\mathcal{F}_{j-1})=0 and j=1k𝔼(Mk,j2|j1)=k1j=1kσj2/σ^j2𝑝1\sum_{j=1}^{k}\mathbb{E}(M_{k,j}^{2}|\mathcal{F}_{j-1})=k^{-1}\sum_{j=1}^{k}\sigma_{j}^{2}/\hat{\sigma}_{j}^{2}\xrightarrow{p}1 as kk\xrightarrow{}\infty by (C3). The conditional Lindeberg condition holds in (C2), so the martingale central limit theory for triangular arrays gives

j=1kMk,j𝑑N(0,1).\sum_{j=1}^{k}M_{k,j}\xrightarrow{d}N(0,1).

Plugging it back into (12), we can get

1k(j=1kσ^j1)(Δ^kΔ)𝑑N(0,1).\frac{1}{\sqrt{k}}\left(\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\right)\left(\hat{\Delta}_{k}-\Delta\right)\xrightarrow{d}N(0,1). (13)

B Proof of Proposition 1

We first simplify the formula of λk\lambda_{k} to:

λk\displaystyle\lambda_{k} =ψ(1k(j=1kσ^j1)Δ, 1)(Rk)ψ(0, 1)(Rk)\displaystyle=\frac{\psi_{\left(\frac{1}{\sqrt{k}}\left(\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\right)\Delta,\,1\right)}(R_{k})}{\psi_{\left(0,\,1\right)}(R_{k})}
=exp{1kj=1kσ^j1ΔRk12k(j=1kσ^j1)2Δ2}\displaystyle=\exp\left\{\frac{1}{\sqrt{k}}\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\Delta\cdot R_{k}-\frac{1}{2k}(\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1})^{2}\Delta^{2}\right\}
=exp{1kj=1kσ^j1Δj=1k(σ^j1D¯j)12k(j=1kσ^j1)2Δ2},\displaystyle=\exp\left\{\frac{1}{k}\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\Delta\sum_{j=1}^{k}(\hat{\sigma}^{-1}_{j}\bar{D}_{j})-\frac{1}{2k}(\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1})^{2}\Delta^{2}\right\},

where we denote D¯j(𝒞j;𝒞¯j1)\bar{D}_{j}(\mathcal{C}_{j};\overline{\mathcal{C}}_{j-1}) with D¯j\bar{D}_{j} for simplicity. By the Delta method, we have

𝔼H0(λk+1|k)\displaystyle\quad\,\mathbb{E}_{H_{0}}(\lambda_{k+1}|\mathcal{F}_{k})
=𝔼H0[exp{1k+1j=1k+1σ^j1Δj=1k+1(σ^j1D¯j)\displaystyle=\mathbb{E}_{H_{0}}\left[\exp\left\{\frac{1}{k+1}\sum_{j=1}^{k+1}\hat{\sigma}_{j}^{-1}\Delta\sum_{j=1}^{k+1}(\hat{\sigma}^{-1}_{j}\bar{D}_{j})\right.\right.
12(k+1)(j=1k+1σ^j1)2Δ2}|k]\displaystyle\qquad\qquad\qquad\qquad\qquad\left.\left.-\frac{1}{2(k+1)}(\sum_{j=1}^{k+1}\hat{\sigma}_{j}^{-1})^{2}\Delta^{2}\right\}\bigg{|}\mathcal{F}_{k}\right]
exp[𝔼H0{1k+1j=1k+1σ^j1Δj=1k+1(σ^j1D¯j)\displaystyle\approx\exp\left[\mathbb{E}_{H_{0}}\left\{\frac{1}{k+1}\sum_{j=1}^{k+1}\hat{\sigma}_{j}^{-1}\Delta\sum_{j=1}^{k+1}(\hat{\sigma}^{-1}_{j}\bar{D}_{j})\right.\right.
12(k+1)(j=1k+1σ^j1)2Δ2|k}].\displaystyle\qquad\qquad\qquad\qquad\qquad\left.\left.-\frac{1}{2(k+1)}(\sum_{j=1}^{k+1}\hat{\sigma}_{j}^{-1})^{2}\Delta^{2}\bigg{|}\mathcal{F}_{k}\right\}\right].

Then the expectation term can be calculated as

𝔼H0{1k+1j=1k+1σ^j1Δj=1k+1(σ^j1D¯j)12(k+1)(j=1k+1σ^j1)2Δ2|k}\displaystyle\,\mathbb{E}_{H_{0}}\left\{\frac{1}{k+1}\sum_{j=1}^{k+1}\hat{\sigma}_{j}^{-1}\Delta\sum_{j=1}^{k+1}(\hat{\sigma}^{-1}_{j}\bar{D}_{j})-\frac{1}{2(k+1)}(\sum_{j=1}^{k+1}\hat{\sigma}_{j}^{-1})^{2}\Delta^{2}\bigg{|}\mathcal{F}_{k}\right\}
=1k+1j=1k+1σ^j1Δ{j=1k(σ^j1D¯j)+σ^k+11𝔼H0(D¯k+1|k)0}\displaystyle=\frac{1}{k+1}\sum_{j=1}^{k+1}\hat{\sigma}_{j}^{-1}\Delta\left\{\sum_{j=1}^{k}(\hat{\sigma}^{-1}_{j}\bar{D}_{j})+\hat{\sigma}^{-1}_{k+1}\cdot\underbrace{\mathbb{E}_{H_{0}}(\bar{D}_{k+1}|\mathcal{F}_{k})}_{0}\right\}
12(k+1)(j=1k+1σ^j1)2Δ2\displaystyle\qquad-\frac{1}{2(k+1)}(\sum_{j=1}^{k+1}\hat{\sigma}_{j}^{-1})^{2}\Delta^{2}
=1kj=1kσ^j1Δj=1k(σ^j1D¯j)12k(j=1kσ^j1)2Δ2\displaystyle=\frac{1}{k}\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\Delta\sum_{j=1}^{k}(\hat{\sigma}^{-1}_{j}\bar{D}_{j})-\frac{1}{2k}(\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1})^{2}\Delta^{2}
+(1k+1j=1k+1σ^j11kj=1kσ^j1)Δj=1k(σ^j1D¯j)\displaystyle\qquad+\left(\frac{1}{k+1}\sum_{j=1}^{k+1}\hat{\sigma}_{j}^{-1}-\frac{1}{k}\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\right)\Delta\sum_{j=1}^{k}(\hat{\sigma}^{-1}_{j}\bar{D}_{j})
{12(k+1)(j=1k+1σ^j1)212k(j=1kσ^j1)2}Δ2.\displaystyle\qquad-\left\{\frac{1}{2(k+1)}(\sum_{j=1}^{k+1}\hat{\sigma}_{j}^{-1})^{2}-\frac{1}{2k}(\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1})^{2}\right\}\Delta^{2}.

Since the true value difference Δ\Delta is not very large in practice, we assume local alternative Δ=OP(k1/2)\Delta=O_{P}(k^{-1/2}) here as in Theorem III.1. Therefore,

𝔼H0(λk+1|k)\displaystyle\quad\,\mathbb{E}_{H_{0}}(\lambda_{k+1}|\mathcal{F}_{k})
λkexp[(1k+1j=1k+1σ^j11kj=1kσ^j1)Δj=1k(σ^j1D¯j)\displaystyle\approx\lambda_{k}\cdot\exp\left[\left(\frac{1}{k+1}\sum_{j=1}^{k+1}\hat{\sigma}_{j}^{-1}-\frac{1}{k}\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\right)\Delta\sum_{j=1}^{k}(\hat{\sigma}^{-1}_{j}\bar{D}_{j})\right.
{12(k+1)(j=1k+1σ^j1)212k(j=1kσ^j1)2}Δ2]\displaystyle\qquad\qquad\qquad\left.-\left\{\frac{1}{2(k+1)}(\sum_{j=1}^{k+1}\hat{\sigma}_{j}^{-1})^{2}-\frac{1}{2k}(\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1})^{2}\right\}\Delta^{2}\right]
=(10)λkexp{(1k+1j=1k+1σ^j11kj=1kσ^j1)Δj=1kσ^j1Δ^k\displaystyle\overset{(\ref{c3eq14})}{=}\lambda_{k}\cdot\exp\left\{\left(\frac{1}{k+1}\sum_{j=1}^{k+1}\hat{\sigma}_{j}^{-1}-\frac{1}{k}\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\right)\Delta\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\hat{\Delta}_{k}\right.
{12(k+1)(j=1k+1σ^j1)212k(j=1kσ^j1)2}Δ2}\displaystyle\qquad\qquad\qquad\left.-\left\{\frac{1}{2(k+1)}(\sum_{j=1}^{k+1}\hat{\sigma}_{j}^{-1})^{2}-\frac{1}{2k}(\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1})^{2}\right\}\Delta^{2}\right\}
=λkexp{(1k+1j=1k+1σ^j11kj=1kσ^j1)OP(k1)j=1kσ^j1Δ^kOP(k1/2)by (13)Δ\displaystyle=\lambda_{k}\cdot\exp\left\{\underbrace{\left(\frac{1}{k+1}\sum_{j=1}^{k+1}\hat{\sigma}_{j}^{-1}-\frac{1}{k}\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\right)}_{O_{P}(k^{-1})}\cdot\underbrace{\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1}\hat{\Delta}_{k}}_{O_{P}(k^{1/2})\,\text{by (\ref{beq65})}}\cdot\Delta\right.
{12(k+1)(j=1k+1σ^j1)212k(j=1kσ^j1)2}OP(1)Δ2}\displaystyle\qquad\qquad\quad\left.-\underbrace{\left\{\frac{1}{2(k+1)}(\sum_{j=1}^{k+1}\hat{\sigma}_{j}^{-1})^{2}-\frac{1}{2k}(\sum_{j=1}^{k}\hat{\sigma}_{j}^{-1})^{2}\right\}}_{O_{P}(1)}\cdot\Delta^{2}\right\}
=λkexp{oP(1)}\displaystyle=\lambda_{k}\cdot\exp\{o_{P}(1)\}