This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

How to sample and when to stop sampling: The generalized Wald problem and minimax policies

Karun Adusumilli
Abstract.

The aim of this paper is to develop techniques for incorporating the cost of information into experimental design. Specifically, we study sequential experiments where sampling is costly and a decision-maker aims to determine the best treatment for full scale implementation by (1) adaptively allocating units to two possible treatments, and (2) stopping the experiment when the expected welfare (inclusive of sampling costs) from implementing the chosen treatment is maximized. Working under the diffusion limit, we describe the optimal policies under the minimax regret criterion. Under small cost asymptotics, the same policies are also optimal under parametric and non-parametric distributions of outcomes. The minimax optimal sampling rule is just the Neyman allocation; it is independent of sampling costs and does not adapt to previous outcomes. The decision-maker stops sampling when the average difference between the treatment outcomes, multiplied by the number of observations collected until that point, exceeds a specific threshold. The results derived here also apply to best arm identification with two arms.

This version:
The paper subsumes an unpublished note previously circulated as “Neyman allocation is minimax optimal for best arm identification with two arms” on ArXiv at the following link: https://arxiv.org/abs/2204.05527.
I would like to thank Tim Armstrong, Federico Bugni, David Childers, Pepe Montiel-Olea, Chao Qin, Azeem Shaikh, Tim Vogelsang and seminar participants at various universities and conferences for helpful comments.
Department of Economics, University of Pennsylvania

1. Introduction

Acquiring information is expensive. Experimenters need to carefully choose how many units of each treatment to sample and when to stop sampling. This paper seeks to develop techniques for incorporating the cost of information into experimental design. Specifically, we focus our analysis of costly experimentation within the context of comparative trials where the aim is to determine the best of two treatments.

In the computer science literature, such experiments are referred to as A/B tests. Technology companies like Amazon, Google and Microsoft routinely run hundreds of A/B tests a week to evaluate product changes, such as a tweak to a website layout or an update to a search algorithm. However, experimentation is expensive, especially if the changes being tested are very small and require evaluation on large amounts of data; e.g., Deng et al. (2013) state that even hundreds millions of users were considered insufficient at Google to detect the treatment effects they were interested in. Clinical or randomized trials are another example of A/B tests. Even here, reducing experimentation costs is a key goal. For instance, this has been a major objective for the FDA since 2004 when it introduced the ‘Critical Path Initiative’ for streamlining drug development; this in turn led the FDA to promote sequential designs in clinical trials (see, e.g., US Food and Drug Admin., 2018, for the current guidance, which was influenced by the need to reduce experimentation costs). For this reason, many of the recent clinical trials, such as the ones used to test the effectiveness of Covid vaccines (e.g., Zaks, 2020), now use multi-stage designs where the experiment can be terminated early if a particularly positive or negative effect is seen in early stages.

In practice, the cost of experimentation directly or indirectly enters the researchers’ experimental design when they choose an implicit or explicit stopping time (note that we use stopping time interchangeably with the number of observations in the experiment). For instance, in testing the efficacy of vaccines, experimenters stop after a pre-determined number of infections. In other cases, a power analysis may be used to determine sample size before the start of the experiment. But if the aim is to maximize welfare (or profits), neither of these procedures is optimal.111See, e.g., Manski and Tetenov (2016) for a critique on the common use of power analysis for determining the sample size in randomized control trials.

In this paper, we develop optimal experimentation designs that maximize social welfare (or profits) while also taking into account the cost of information. In particular, we study optimal sampling and stopping rules in sequential experiments where sampling is costly and the decision maker (DM) aims to determine the best of two treatments by: (1) adaptively allocating units to one of these treatments, and (2) stopping the experiment when the expected welfare, inclusive of sampling costs, is maximized. We term this the generalized Wald problem, and use minimax regret (Manski, 2021), a natural choice criterion under ambiguity aversion, to determine the optimal decision rule.222We do not consider the minimax risk criterion as it leads to a trivial decision: the DM should never experiment and always apply the status quo treatment.

We first derive the optimal decision rule in continuous time, under the diffusion regime (Wager and Xu, 2021; Fan and Glynn, 2021). Then, we show that analogues of this decision rule are also asymptotically optimal under parametric and non-parametric distributions of outcomes. The asymptotics, which appear to be novel, involve taking the marginal cost of experimentation to 0 at a specific rate. Section 4 delves into the rationale behind these ‘small cost asymptotics’, and argues that they are practically quite relevant. It is important to clarify here that ‘small costs’ need not literally imply the monetary costs of experimentation are close to 0. Rather, it denotes that these costs are small compared to the benefit of choosing the best treatment for full-scale implementation.

The optimal decision rule has a number of interesting, and perhaps, surprising properties. First, the optimal sampling rule is history independent and also independent of sampling costs. In fact, it is just the Neyman allocation, which is well known in the RCT literature as the (fixed) sampling strategy that minimizes estimation variance; our results state that one cannot better this even when allowing for adaptive strategies. Second, it is optimal to stop when the difference in average outcomes between the treatments, multiplied by the number of observations collected up to that point, exceeds a specific threshold. The threshold depends on sampling costs and the standard deviation of the treatment outcomes. Finally, at the conclusion of the experiment, the DM chooses the treatment with the highest average outcomes. The decision rule therefore has a simple form that makes it attractive for applications.

Our results also apply to the best arm identification problem with two arms.333The results for best arm identification were previously circulated in an unpublished note by the author, accessible from ArXiV at https://arxiv.org/abs/2204.05527. The current paper subsumes these results. Best arm identification shares the same aim of determining the best treatment but the number of observations is now exogenously specified, even as the sampling strategy is allowed to be adaptive. Despite this difference, we find Neyman allocation to be the minimax-regret optimal sampling rule in this context as well. However, by not not stopping adaptively, we lose on experimentation costs. Compared to best arm identification, we show that the use of an optimal stopping time allows us to attain the same regret, exclusive of sampling costs, with 40%40\% fewer observations on average (under the least favorable prior); this is independent of model parameters such as sampling costs and outcome variances.

For the most part, this paper focuses on constant sampling costs (i.e., constant per observation). This has been a standard assumption since the classic work of Wald (1947), see also Arrow et al. (1949) and Fudenberg et al. (2018), among others. In fact, many online marketplaces for running experiments, e.g., Amazon Mechanical Turk, charge a fixed cost per query/observation. Note also that the costs may be indirect: for online platforms like Google or Microsoft that routinely run thousands of A/B tests, these could correspond to how much experimentation hurts user experience. Still, one may wonder whether and how our results change under other cost functions and modeling choices, e.g., when data is collected in batches, or, when we measure regret in terms of nonlinear or quantile welfare. We asses this in Section 6. Almost all our results still go through under these variations. We also identify a broader class of cost functions, nesting the constant case, in which the form of the optimal decision stays the same.

1.1. Related literature

The question of when to stop sampling has a rich history in economics and statistics. It was first studied by Wald (1947) and Arrow et al. (1949) with the goal being hypothesis testing, specifically, optimizing the trade-off between type I and type II errors, instead of welfare maximization. Still, one can place these results into the present framework by imagining that the distributions of outcomes under both treatments are known, but it is unknown which distribution corresponds to which treatment. This paper generalizes these results by allowing the distributions to be unknown. For this reason, we term the question studied here the generalized Wald problem.

Chernoff (1959) studied the sequential hypothesis testing problem under multiple hypotheses, using large deviation methods. The asymptotics there involve taking the sampling costs to 0, even as there is a fixed reward gap between the treatments. More recently, the stopping rules of Chernoff (1959) were incorporated into the δ\delta-PAC (Probably Approximately Correct) algorithms devised by Garivier and Kaufmann (2016) and Qin et al. (2017) for best arm identification with a fixed confidence. The aim in these studies is to minimize the amount of time needed to attain a pre-specified probability, 1δ1-\delta, of selecting the optimal arm. However, these algorithms do not directly minimize a welfare criterion, and the constraint of pre-specifying a δ\delta could be misplaced, if, e.g., there is very little difference between the first and second best treatments. In fact, under the least favorable prior, our minimax decision rule mis-identifies the best treatment about 23% of the time. Qin and Russo (2022) study the costly sampling problem under fixed reward gap asymptotics using large deviation methods. The present paper differs in using local asymptotics and in appealing to a minimax regret criterion. However, unlike the papers cited above, we only study binary treatments.

A number of papers (Colton, 1963; Lai et al., 1980; Chernoff and Petkau, 1981) have studied sequential trials in which there is a population of NN units, and at each period, the DM randomly selects two individuals from this population, and assigns them to the two treatments. The DM is allowed to stop experimenting at any point and apply a single treatment on the remainder of the population. The setup in these papers is intermediate between our own and two-armed bandits: while the aim, as in here, is to minimize regret, acquiring samples is not by itself expensive and the outcomes in the experimentation phase matter for welfare. This literature also does not consider optimal sampling rules.

The paper is also closely related to the growing literature on information acquisition and design, see, Hébert and Woodford (2017); Fudenberg et al. (2018); Morris and Strack (2019); Liang et al. (2022), among others. Fudenberg et al. (2018) study the question of optimal stopping when there are two treatments and the goal is to maximize Bayes welfare (which is equivalent to minimizing Bayes regret) under normal priors and costly sampling. While the sampling rule in Fudenberg et al. (2018) is exogenously specified, Liang et al. (2022) study a more general version of this problem that allows for selecting this. In fact, for constant sampling costs, the setup in Liang et al. (2022) is similar to ours but the welfare criterion is different. The authors study a Bayesian version of the problem with normal priors, with the resulting decision rules having very different qualitative and quantitative properties from ours; see Section 3.2 for a detailed comparison. These differences arise because the minimax regret criterion corresponds to a least favorable prior with a specific two-point support. Thus, our results highlight the important role played by the prior in determining even the qualitative properties of the optimal decisions. This motivates the need for robust decision rules, and the minimax regret criterion provides one way to obtain them.

Our results also speak to the literature on drift-diffusion models (DDMs), which are widely used in neuroscience and psychology to study choice processes (Luce et al., 1986; Ratcliff and McKoon, 2008; Fehr and Rangel, 2011). DDMs are based on the classic binary state hypothesis testing problem of Wald (1947). Fudenberg et al. (2018) extend this model to allow for continuous states, using Gaussian priors, and show that the resulting optimal decision rules are very different, even qualitatively, from the predictions of DDM. In this paper, we show that if the DM is ambiguity averse and uses the minimax regret criterion, then the predictions of the DDM model are recovered even under continuous states. In other words, decision making under ignorance brings us back to DDM.

Finally, the results in this paper are unique in regards to all the above strands of literature in showing that any discrete time parametric and non-parametric version of the problem can be reduced to the diffusion limit under small cost asymptotics. Diffusion asymptotics were introduced by Wager and Xu (2021) and Fan and Glynn (2021) to study the properties of Thompson sampling in bandit experiments. The techniques for showing asymptotic equivalence to the limit experiment build on, and extend, previous work on sequential experiments by Adusumilli (2021). Relative to that paper, the novelty here is two-fold: first, we derive a sharp characterization of the minimax optimal decision rule for the Wald problem. Second, we introduce ‘small cost asymptotics’ that may be of independent interest in other, related problems where there is a ‘local-to-zero’ cost of continuing an experiment.

2. Setup under incremental learning

Following Fudenberg et al. (2018) and Liang et al. (2022), we start by describing the problem under a stylized setting where time is continuous and information arrives gradually in the form of Gaussian increments. In statistics and econometrics, this framework is also known as diffusion asymptotics (Adusumilli, 2021; Wager and Xu, 2021; Fan and Glynn, 2021). The benefit of the continuous time analysis is that it enables us to provide a sharp characterization of the minimax optimal decision rule; this is otherwise obscured by the discrete nature of the observations in a standard analysis. Section 4 describes how these asymptotics naturally arise under a limit of experiments perspective when we employ n1/2n^{-1/2} scaling for the treatment effect.

The setup is as follows. There are two treatments 0,10,1 corresponding to unknown mean rewards 𝝁:=(μ1,μ0)\bm{\mu}:=(\mu_{1},\mu_{0}) and known variances σ12,σ02\sigma_{1}^{2},\sigma_{0}^{2}. The aim of the decision maker (DM) is to determine which treatment to implement on the population. To guide her choice, the DM is allowed to conduct a sequential experiment, while paying a flow cost cc as long as the experiment is in progress. At each moment in time, the DM chooses which treatment to sample according to the sampling rule πa(t)π(A=a|t),a{0,1}\pi_{a}(t)\equiv\pi(A=a|\mathcal{F}_{t}),a\in\{0,1\}, which specifies the probability of selecting treatment aa given some filtration t\mathcal{F}_{t}. The DM then observes signals, x1(t),x0(t)x_{1}(t),x_{0}(t) from each of the treatments, as well as the fraction of times, q1(t),q0(t)q_{1}(t),q_{0}(t) each treatment was sampled so far:

dxa(t)\displaystyle dx_{a}(t) =μaπa(t)dt+σaπa(t)dWa(t),\displaystyle=\mu_{a}\pi_{a}(t)dt+\sigma_{a}\sqrt{\pi_{a}(t)}dW_{a}(t), (2.1)
dqa(t)\displaystyle dq_{a}(t) =πa(t)dt.\displaystyle=\pi_{a}(t)dt. (2.2)

Here, W1(t),W0(t)W_{1}(t),W_{0}(t) are independent one-dimensional Weiner processes. The experiment ends in accordance with an t\mathcal{F}_{t}-adapted stopping time, τ\tau. At the conclusion of the experiment, the DM chooses an τ\mathcal{F}_{\tau} measurable implementation rule, δ{0,1}\delta\in\{0,1\}, specifying which treatment to implement on the population. The DM’s decision thus consists of the triple 𝒅:=(π,τ,δ)\bm{d}:=(\pi,\tau,\delta).

Denote s(t)=(x1(t),x0(t),q1(t),q0(t))s(t)=(x_{1}(t),x_{0}(t),q_{1}(t),q_{0}(t)) and take tσ{s(u);ut}\mathcal{F}_{t}\equiv\sigma\{s(u);u\leq t\} to be the filtration generated by the state variables s()s(\cdot) until time tt.444As in Liang et al. (2022), we restrict attention to sampling rules πa\pi_{a} for which a weak solution to the functional SDEs (2.1), (2.2) exists. This is true if either πa:{s(z)}zt[0,1]\pi_{a}:\left\{s(z)\right\}_{z\leq t}\to[0,1] is continuous, see Karatzas and Shreve (2012, Section 5.4), or, if it is any deterministic function of tt. Let 𝔼𝒅|𝝁[]\mathbb{E}_{\bm{d}|\bm{\mu}}[\cdot] denote the expectation under a decision rule 𝒅\bm{d}, given some value of 𝝁\bm{\mu}. We evaluate decision rules under the minimax regret criterion, where the maximum regret is defined as

Vmax(𝒅)\displaystyle V_{\max}(\bm{d}) =max𝝁V(𝒅,𝝁),with\displaystyle=\max_{\bm{\mu}}V\left(\bm{d},\bm{\mu}\right),\ \textrm{with}
V(𝒅,𝝁)\displaystyle V\left(\bm{d},\bm{\mu}\right) :=𝔼𝒅|𝝁[max{μ1μ0,0}(μ1μ0)δ+cτ].\displaystyle:=\mathbb{E}_{\bm{d}|\bm{\mu}}\left[\max\{\mu_{1}-\mu_{0},0\}-(\mu_{1}-\mu_{0})\delta+c\tau\right]. (2.3)

We refer to V(𝒅,𝝁)V(\bm{d},\bm{\mu}) as the frequentist regret, i.e., the expected regret of 𝒅\bm{d} given 𝝁\bm{\mu}. Recall that regret is the difference in utilities, μ0+(μ1μ0)δcτ\mu_{0}+(\mu_{1}-\mu_{0})\delta-c\tau, generated by the oracle decision rule {τ=0,δ=𝕀{μ1>μ0}}\{\tau=0,\delta=\mathbb{I}\{\mu_{1}>\mu_{0}\}\}, and a given decision rule 𝒅\bm{d}.

2.1. Best arm identification

The best arm identification problem is a special case of the generalized Wald problem where the stopping time is fixed beforehand and set to τ=1\tau=1 without loss of generality. This is equivalent to choosing the number of observations before the start of the experiment; in fact, we show in Section 4 that a unit time interval corresponds to nn observations (the precise definition of nn is also given there). Thus, decisions now consist only of 𝒅=(π,δ)\bm{d}=(\pi,\delta), but π\pi is still allowed to be adaptive. If we further restrict π\pi to be fixed (i.e., non-adaptive), we get back back to the typical setting of Randomized Control Trials (RCTs).

Despite these differences, we show in Section 3 that the minimax-regret optimal sampling and implementation rules are the same in all cases; the optimal sampling rule is the Neyman allocation πa=σa/(σ1+σ0)\pi_{a}^{*}=\sigma_{a}/(\sigma_{1}+\sigma_{0}), while the optimal implementation rule is to choose the treatment with the higher average outcomes. Somewhat surprisingly, then, there is no difference in the optimal strategy between best arm identification and standard RCTs (under minimax regret). The presence of τ\tau, however, makes the generalized Wald problem fundamentally different from the other two. We provide a relative comparison of the benefit of optimal stopping in Section 3.3.

2.2. Bayesian formulation

It is convenient to first describe minimal regret under a Bayesian approach. Suppose the DM places a prior p0p_{0} on 𝝁\bm{\mu}. Bayes regret,

V(𝒅,p0):=V(𝒅,𝝁)𝑑p0(𝝁),V(\bm{d},p_{0}):=\int V\left(\bm{d},\bm{\mu}\right)dp_{0}(\bm{\mu}),

provides one way to evaluate the decision rules 𝒅\bm{d}. In the next section, we characterize minimax regret as Bayes regret under a least-favorable prior.

Let p(𝝁|s)p(\bm{\mu}|s) denote the posterior density of 𝝁\bm{\mu} given state ss. By standard results in stochastic filtering, (here, and in what follows, \propto denotes equality up to a normalization constant)

p(𝝁|s)\displaystyle p(\bm{\mu}|s) p(s|𝝁)p0(𝝁)\displaystyle\propto p(s|\bm{\mu})\cdot p_{0}(\bm{\mu})
pq1(x1|μ1)pq0(x0|μ0)p0(𝝁);pqa(|μa)𝒩(|qaμa,qaσa2)\displaystyle\propto p_{q_{1}}(x_{1}|\mu_{1})\cdot p_{q_{0}}(x_{0}|\mu_{0})\cdot p_{0}(\bm{\mu});\quad p_{q_{a}}(\cdot|\mu_{a})\equiv\mathcal{N}(\cdot|q_{a}\mu_{a},q_{a}\sigma_{a}^{2})

where 𝒩(|μ,σ2)\mathcal{N}(\cdot|\mu,\sigma^{2}) is the normal density with mean μ\mu and variance σ2\sigma^{2}, and the second proportionality follows from the fact W1(),W0()W_{1}(\cdot),W_{0}(\cdot) are independent Weiner processes.

Define V(s;p0)V^{*}(s;p_{0}) as the minimal expected Bayes regret, given state ss, i.e.,

V(s;p0)=inf𝒅𝒟𝔼𝝁|s[V(𝒅,𝝁)],V^{*}(s;p_{0})=\inf_{\bm{d}\in\mathcal{D}}\mathbb{E}_{\bm{\mu}|s}\left[V\left(\bm{d},\bm{\mu}\right)\right],

where 𝒟\mathcal{D} is the set of all decision rules that satisfy the measurability conditions set out previously. In principle, one could characterize V(;p0)V^{*}(\cdot;p_{0}) as a HJB Variational Inequality (HJB-VI; Øksendal, 2003, Chapter 10), compute it numerically and characterize the optimal Bayes decision rules. However, this can be computationally expensive, and moreover, does not provide a closed form characterization of the optimal decisions. Analytical expressions can be obtained under two types of priors:

2.2.1. Gaussian priors

In this case, the posterior is also Gaussian and its mean and variance can be computed analytically. Liang et al. (2022) derive the optimal decision rule in this setting. See Section 3.2 for a comparison with our proposals.

2.2.2. Two-point priors

Two point priors are closely related to hypothesis testing and the sequential likelihood ratio procedures of Wald (1947) and Arrow et al. (1949). More importantly for us, the least favorable prior for minimax regret, described in the next section, has a two point support.

Suppose the prior is supported on the two points (a1,b1),(a0,b0)(a_{1},b_{1}),(a_{0},b_{0}). Let θ=1\theta=1 denote the state when nature chooses (a1,b1)(a_{1},b_{1}), and θ=0\theta=0 the state when nature chooses (a0,b0)(a_{0},b_{0}). Also let (Ω,π,t)(\Omega,\mathbb{P}_{\pi},\mathcal{F}_{t}) denote the relevant probability space given a (possibly) randomized policy π\pi, where t\mathcal{F}_{t} is the filtration defined previously. Set Pπ0,Pπ1P_{\pi}^{0},P_{\pi}^{1} to be the probability measures Pπ0(A):=π(A|θ=0)P_{\pi}^{0}(A):=\mathbb{P}_{\pi}(A|\theta=0) and Pπ1(A):=π(A|θ=1)P_{\pi}^{1}(A):=\mathbb{P}_{\pi}(A|\theta=1) for any AtA\in\mathcal{F}_{t}.

Clearly, the likelihood ratio process φπ(t):=dPπ1dPπ0(t)\varphi^{\pi}(t):=\frac{dP_{\pi}^{1}}{dP_{\pi}^{0}}(\mathcal{F}_{t}) is a sufficient statistic for the DM. An application of the Girsanov theorem, noting that W1(),W0()W_{1}(\cdot),W_{0}(\cdot) are independent of each other, gives (see also Shiryaev, 2007, Section 4.2.1)

lnφπ(t)\displaystyle\ln\varphi^{\pi}(t) =(a1a0)σ12x1(t)+(b1b0)σ02x0(t)(a12a02)2σ12q1(t)(b12b02)2σ02q0(t).\displaystyle=\frac{(a_{1}-a_{0})}{\sigma_{1}^{2}}x_{1}(t)+\frac{(b_{1}-b_{0})}{\sigma_{0}^{2}}x_{0}(t)-\frac{(a_{1}^{2}-a_{0}^{2})}{2\sigma_{1}^{2}}q_{1}(t)-\frac{(b_{1}^{2}-b_{0}^{2})}{2\sigma_{0}^{2}}q_{0}(t). (2.4)

Let m0m_{0} denote the prior probability that θ=1\theta=1. Additionally, given a sampling rule π\pi, let mπ(t)=(θ=1|t)m^{\pi}(t)=\mathbb{P}(\theta=1|\mathcal{F}_{t}) denote the belief process describing the posterior probability that θ=1\theta=1. Following Shiryaev (2007, Section 4.2.1), mπ(t)m^{\pi}(t) can be related to φπ(t)\varphi^{\pi}(t) as

mπ(t)=m0φπ(t)(1m0)+m0φπ(t).m^{\pi}(t)=\frac{m_{0}\varphi^{\pi}(t)}{(1-m_{0})+m_{0}\varphi^{\pi}(t)}.

The Bayes optimal implementation rule at the end of the experiment is

δπ,τ\displaystyle\delta^{\pi,\tau} =𝕀{a1mπ(τ)+a0(1mπ(τ))b1mπ(τ)+b0(1mπ(τ))}\displaystyle=\mathbb{I}\left\{a_{1}m^{\pi}(\tau)+a_{0}(1-m^{\pi}(\tau))\geq b_{1}m^{\pi}(\tau)+b_{0}(1-m^{\pi}(\tau))\right\}
=𝕀{lnφπ(τ)ln(b0a0)(1m0)(a1b1)m0}.\displaystyle=\mathbb{I}\left\{\ln\varphi^{\pi}(\tau)\geq\ln\frac{(b_{0}-a_{0})(1-m_{0})}{(a_{1}-b_{1})m_{0}}\right\}. (2.5)

The super-script on δ\delta highlights that the above implementation rule is conditional on a given choice of (π,τ)(\pi,\tau). Relatedly, the Bayes regret at the end of the experiment (from employing the optimal implementation rule) is

ϖπ(τ):=min{(a1b1)mπ(τ),(b0a0)(1mπ(τ))}.\varpi^{\pi}(\tau):=\min\left\{(a_{1}-b_{1})m^{\pi}(\tau),(b_{0}-a_{0})(1-m^{\pi}(\tau))\right\}. (2.6)

Hence, for a given sampling rule π\pi, the Bayes optimal stopping time τπ\tau^{\pi}, can be obtained as the solution to the optimal stopping problem

τπ=infτ𝒯𝔼π[ϖπ(τ)+cτ],\tau^{\pi}=\inf_{\tau\in\mathcal{T}}\mathbb{E}_{\pi}\left[\varpi^{\pi}(\tau)+c\tau\right], (2.7)

where 𝒯\mathcal{T} is the set of all t\mathcal{F}_{t} measurable stopping times, and 𝔼π[]\mathbb{E}_{\pi}[\cdot] denotes the expectation under the sampling rule π\pi.

3. Minimax regret and optimal decision rules

Following Wald (1945), we characterize minimax regret as the value of a zero-sum game played between nature and the DM. Nature’s action consists of choosing a prior, p0𝒫p_{0}\in\mathcal{P}, over 𝝁\bm{\mu}, while the DM chooses the decision rule 𝒅\bm{d}. The minimax regret can then be written as

inf𝒅𝒟Vmax(𝒅)=inf𝒅𝒟supp0𝒫V(𝒅,p0).\inf_{\bm{d}\in\mathcal{D}}V_{\max}(\bm{d})=\inf_{\bm{d}\in\mathcal{D}}\sup_{p_{0}\in\mathcal{P}}V(\bm{d},p_{0}). (3.1)

The equilibrium action of nature is termed the least-favorable prior, and that of the DM, the minimax decision rule.

The following is the main result of this section: Denote γ00.536357\gamma_{0}^{*}\approx 0.536357, Δ02.19613\Delta_{0}^{*}\approx 2.19613, η:=(2cσ1+σ0)1/3\eta:=\left(\frac{2c}{\sigma_{1}+\sigma_{0}}\right)^{1/3}, γ=γ0/η\gamma^{*}=\gamma_{0}^{*}/\eta and Δ=ηΔ0\Delta^{*}=\eta\Delta_{0}^{*}.

Theorem 1.

The zero-sum two player game (3.1) has a Nash equilibrium with a unique minimax-regret value. The minimax-regret optimal decision rule is 𝐝:=(π,τ,δ)\bm{d}^{*}:=(\pi^{*},\tau^{*},\delta^{*}), where πa=σa/(σ1+σ0)\pi_{a}^{*}=\sigma_{a}/(\sigma_{1}+\sigma_{0}) for a{0,1}a\in\{0,1\},

τ=inf{t:|x1(t)σ1x0(t)σ0|γ},\tau^{*}=\inf\left\{t:\left|\frac{x_{1}(t)}{\sigma_{1}}-\frac{x_{0}(t)}{\sigma_{0}}\right|\geq\gamma^{*}\right\},

and δ=𝕀{x1(τ)σ1x0(τ)σ00}\delta^{*}=\mathbb{I}\left\{\frac{x_{1}(\tau^{*})}{\sigma_{1}}-\frac{x_{0}(\tau^{*})}{\sigma_{0}}\geq 0\right\}. Furthermore, the least favorable prior is a symmetric two-point distribution supported on (σ1Δ/2,σ0Δ/2),(σ1Δ/2,σ0Δ/2)(\sigma_{1}\Delta^{*}/2,-\sigma_{0}\Delta^{*}/2),(-\sigma_{1}\Delta^{*}/2,\sigma_{0}\Delta^{*}/2).

Theorem 1 makes no claim as to the uniqueness of the Nash equilibrium.555In fact, this would depend on the topology defined over 𝒟\mathcal{D} and 𝒫\mathcal{P}. Even if multiple equilibria were to exist, however, the value of the game V=inf𝒅𝒟supp0𝒫V(𝒅,p0)V^{*}=\inf_{\bm{d}\in\mathcal{D}}\sup_{p_{0}\in\mathcal{P}}V(\bm{d},p_{0}) would be unique, and 𝒅\bm{d}^{*} would still be minimax-regret optimal.

The optimal strategies under best-arm identification can be derived in the same manner as Theorem 1, but the proof is simpler as it does not involve a stopping rule. Let Φ()\Phi(\cdot) denote the CDF of the standard normal distribution.

Corollary 1.

The minimax-regret optimal decision rule for best-arm identification is 𝐝BAI:=(π,δ)\bm{d}_{\textrm{BAI}}^{*}:=(\pi^{*},\delta^{*}), where π,δ\pi^{*},\delta^{*} are defined in Theorem 1. The corresponding least-favorable prior is a symmetric two-point distribution supported on (σ1Δ¯0/2,σ0Δ¯0/2),(σ1Δ¯0/2,σ0Δ¯0/2)(\sigma_{1}\bar{\Delta}_{0}^{*}/2,-\sigma_{0}\bar{\Delta}_{0}^{*}/2),(-\sigma_{1}\bar{\Delta}_{0}^{*}/2,\sigma_{0}\bar{\Delta}_{0}^{*}/2), where Δ¯0:=2argmaxδδΦ(δ)\bar{\Delta}_{0}^{*}:=2\operatorname*{arg\,max}_{\delta}\delta\Phi(-\delta).

3.1. Proof sketch of Theorem 1

We start by describing the best responses of the DM and nature to specific classes of actions on their opponents’ part. For the actions of nature, we consider the set of ‘indifference priors’ indexed by Δ\Delta\in\mathbb{R}. These are two-point priors, pΔ,p_{\Delta}, supported on (σ1Δ/2,σ0Δ/2),(σ1Δ/2,σ0Δ/2(\sigma_{1}\Delta/2,-\sigma_{0}\Delta/2),(-\sigma_{1}\Delta/2,\sigma_{0}\Delta/2) with a prior probability of 0.50.5 at each support point. For the DM, we consider decision rules of the form 𝒅~γ=(π,τγ,δ)\tilde{\bm{d}}_{\gamma}=(\pi^{*},\tau_{\gamma},\delta^{*}), where

τγ:=inf{t:|x1(t)σ1x0(t)σ0|γ};γ(0,).\tau_{\gamma}:=\inf\left\{t:\left|\frac{x_{1}(t)}{\sigma_{1}}-\frac{x_{0}(t)}{\sigma_{0}}\right|\geq\gamma\right\};\ \gamma\in(0,\infty).

The DM’s response to pΔp_{\Delta}.

The term ‘indifference priors’ indicates that these priors make the DM indifferent between any sampling rule π\pi. The argument is as follows: let θ=1\theta=1 denote the state when 𝝁=(σ1Δ/2,σ0Δ/2)\bm{\mu}=(\sigma_{1}\Delta/2,-\sigma_{0}\Delta/2) and θ=0\theta=0 the state when 𝝁=(σ1Δ/2,σ0Δ/2)\bm{\mu}=(-\sigma_{1}\Delta/2,\sigma_{0}\Delta/2). Then, (2.4) implies

lnφπ(t)=Δ{x1(t)σ1x0(t)σ0}.\ln\varphi^{\pi}(t)=\Delta\left\{\frac{x_{1}(t)}{\sigma_{1}}-\frac{x_{0}(t)}{\sigma_{0}}\right\}. (3.2)

Suppose θ=1\theta=1. By (2.1), (2.2)

dx1(t)σ1dx0(t)σ0\displaystyle\frac{dx_{1}(t)}{\sigma_{1}}-\frac{dx_{0}(t)}{\sigma_{0}} =Δ2dt+π1dW1(t)π0dW0(t)\displaystyle=\frac{\Delta}{2}dt+\sqrt{\pi_{1}}dW_{1}(t)-\sqrt{\pi_{0}}dW_{0}(t)
=Δ2dt+dW~(t),\displaystyle=\frac{\Delta}{2}dt+d\tilde{W}(t), (3.3)

where W~(t):=π1dW1(t)π0dW0(t)\tilde{W}(t):=\sqrt{\pi_{1}}dW_{1}(t)-\sqrt{\pi_{0}}dW_{0}(t) is a one dimensional Weiner process, being a linear combination of two independent Weiner processes with π1+π0=1\pi_{1}+\pi_{0}=1. Plugging the above into (3.2) gives

dlnφπ(t)=Δ22dt+ΔdW~(t).d\ln\varphi^{\pi}(t)=\frac{\Delta^{2}}{2}dt+\Delta d\tilde{W}(t).

In a similar manner, we can show under θ=0\theta=0 that dlnφπ(t)=Δ22dt+ΔdW~(t).d\ln\varphi^{\pi}(t)=-\frac{\Delta^{2}}{2}dt+\Delta d\tilde{W}(t). In either case, the choice of π\pi does not affect the evolution of the likelihood-ratio process φπ(t)\varphi^{\pi}(t), and consequently, has no bearing on the evolution of the beliefs mπ(t)m^{\pi}(t).

As the likelihood-ratio and belief processes, φπ(t),mπ(t)\varphi^{\pi}(t),m^{\pi}(t) are independent of π\pi, the Bayes optimal stopping time in (2.7) is also independent of π\pi for indifference priors (standard results in optimal stopping, see e.g., Øksendal, 2003, Chapter 10, imply that the optimal stopping time in (2.7) is a function only of mπ(t)m^{\pi}(t) which is now independent of π\pi). In fact, it has the same form as the optimal stopping time in the Bayesian hypothesis testing problem of Arrow et al. (1949), analyzed in continuous time by Shiryaev (2007, Section 4.2.1) and Morris and Strack (2019). An adaptation of their results (see, Lemma 1 in Appendix A) shows that the Bayes optimal stopping time corresponding to pΔp_{\Delta} is

τγ(Δ)=inf{t:|x1(t)σ1x0(t)σ0|γ(Δ)},\tau_{\gamma(\Delta)}=\inf\left\{t:\left|\frac{x_{1}(t)}{\sigma_{1}}-\frac{x_{0}(t)}{\sigma_{0}}\right|\geq\gamma(\Delta)\right\}, (3.4)

where γ(Δ)\gamma(\Delta) is defined in Lemma 1. By (2.5) and (3.2), the corresponding Bayes optimal implementation rule is

δ=𝕀{x1(t)σ1x0(t)σ00},\delta^{*}=\mathbb{I}\left\{\frac{x_{1}(t)}{\sigma_{1}}-\frac{x_{0}(t)}{\sigma_{0}}\geq 0\right\},

and is independent of Δ\Delta. Hence, the decision rule (π,τγ(Δ),δ)(\pi^{*},\tau_{\gamma(\Delta)},\delta^{*}) is a best response of the DM to nature’s choice of pΔp_{\Delta}.

Nature’s response to τγ\tau_{\gamma}.

Next, consider nature’s response to the DM choosing 𝒅~γ\tilde{\bm{d}}_{\gamma}. Lemma 2 in Appendix A shows that the frequentist regret V(𝒅~γ,𝝁)V\left(\tilde{\bm{d}}_{\gamma},\bm{\mu}\right), given some 𝝁=(μ1,μ2)\bm{\mu}=(\mu_{1},\mu_{2}), depends only on |μ1μ2||\mu_{1}-\mu_{2}|. So, V(𝒅~γ,𝝁)V\left(\tilde{\bm{d}}_{\gamma},\bm{\mu}\right) is maximized at |μ1μ2|=(σ1+σ0)Δ(γ)/2|\mu_{1}-\mu_{2}|=(\sigma_{1}+\sigma_{0})\Delta(\gamma)/2, where Δ(γ)\Delta(\gamma) is some function of γ\gamma. The best response of nature to 𝒅~γ\tilde{\bm{d}}_{\gamma} is then to pick any prior that is supported on {𝝁:|μ1μ0|=(σ1+σ0)Δ(γ)/2}\left\{\bm{\mu}:|\mu_{1}-\mu_{0}|=(\sigma_{1}+\sigma_{0})\Delta(\gamma)/2\right\}. Therefore, the two-point prior pΔ(γ)p_{\Delta(\gamma)} is a best response to 𝒅~γ\tilde{\bm{d}}_{\gamma}.

Nash equilibrium.

Based on the above observations, we can obtain the Nash equilibrium by solving for the equilibrium values of γ,Δ\gamma,\Delta. This is done in Lemma 3 in Appendix A.

3.2. Discussion

3.2.1. Sampling rule

Perhaps the most striking aspect of the sampling rule is that it is just the Neyman allocation. It is not adaptive, and is also independent of sampling costs. In fact, Corollary 1 shows that the sampling and implementation rules are exactly the same as in the best arm identification problem.

The Neyman allocation is also well known as the sampling rule that minimizes the variance for the estimation of treatment effects μ1μ0\mu_{1}-\mu_{0}. Armstrong (2022) shows that for optimal estimation of μ1μ0\mu_{1}-\mu_{0}, the Neyman allocation cannot be bettered even if we allow the sampling strategy to be adaptive. However, the result of Armstrong (2022) does not apply to best-arm identification. Here we show that Neyman allocation does retain its optimality even in this instance. As a practical matter then, practitioners should continue employing the same randomization designs as those employed for standard (i.e., non-sequential) experiments.

By way of comparison, the optimal assignment rule under normal priors is also non-stochastic, but varies deterministically with time (Liang et al., 2022).

3.2.2. Stopping time

The stopping time is adaptive, but it is stationary and has a simple form: Define

ρ(t):=x1(t)σ1x0(t)σ0=μ1μ0σ1+σ0t+W~(t),\rho(t):=\frac{x_{1}(t)}{\sigma_{1}}-\frac{x_{0}(t)}{\sigma_{0}}=\frac{\mu_{1}-\mu_{0}}{\sigma_{1}+\sigma_{0}}t+\tilde{W}(t), (3.5)

where W~()\tilde{W}(\cdot) is standard one-dimensional Brownian motion. Our decision rule states that the DM should end the experiment when ρ(t)\rho(t) exceeds (σ1+σ02c)1/3γ0(\frac{\sigma_{1}+\sigma_{0}}{2c})^{1/3}\gamma_{0}^{*}. The threshold is decreasing in cc and increasing in σ1+σ0\sigma_{1}+\sigma_{0}. Let x¯a(t):=xa(t)/qa(t)\bar{x}_{a}(t):=x_{a}(t)/q_{a}(t) denote the sample average of outcomes from treatment aa at time tt. Since qa(t)=σat/(σ1+σ0)q_{a}(t)=\sigma_{a}t/(\sigma_{1}+\sigma_{0}) under π\pi^{*}, we can rewrite the optimal stopping rule as τ=inf{t:t|x¯1(t)x¯0(t)|(σ1+σ0)γ}\tau^{*}=\inf\left\{t:t\left|\bar{x}_{1}(t)-\bar{x}_{0}(t)\right|\geq(\sigma_{1}+\sigma_{0})\gamma^{*}\right\}; note that time, tt, is a measure of the number of observations collected so far. From the form of ρ()\rho(\cdot) and τ\tau^{*}, we can also infer that earlier stopping is indicative of larger reward gaps μ1μ0\mu_{1}-\mu_{0}, with the average length of the experiment being longest when μ1μ0=0\mu_{1}-\mu_{0}=0.

The stationarity of τ\tau^{*} is in sharp contrast to the properties of the optimal stopping time under Bayes regret with normal priors. There, the optimal stopping time is time dependent (Fudenberg et al., 2018; Liang et al., 2022). The following intuition, adapted from Fudenberg et al. (2018), helps understand the difference: Suppose that ρ(t)0\rho(t)\approx 0 for some large tt. Under a normal prior, this is likely because μ1μ0\mu_{1}-\mu_{0} is close to 0, in which case there is no significant difference between the treatments and the DM should terminate the experiment straightaway. On the other hand, the least favorable prior under minimax regret has a two point support, and under this prior, ρ(t)0\rho(t)\approx 0 would be interpreted as noise, so the DM should proceed henceforth as if starting the experiment from scratch. Thus, the qualitative properties of the stopping time are very different depending on the prior. The above intuition also suggests that the relation between μ1μ0\mu_{1}-\mu_{0} and stopping times is more complicated under normal priors, and not monotone as is the case under minimax regret.

The stopping time, τ\tau^{*}, induces a specific probability of mis-identification of the optimal treatment under the least favorable prior. By Lemmas 2 and 3, this probability is

α=1eΔγeΔγeΔγ=1eΔ0γ0eΔ0γ0eΔ0γ00.235.\alpha^{*}=\frac{1-e^{-\Delta^{*}\gamma^{*}}}{e^{\Delta^{*}\gamma^{*}}-e^{-\Delta^{*}\gamma^{*}}}=\frac{1-e^{-\Delta_{0}^{*}\gamma_{0}^{*}}}{e^{\Delta_{0}^{*}\gamma_{0}^{*}}-e^{-\Delta_{0}^{*}\gamma_{0}^{*}}}\approx 0.235. (3.6)

Interestingly, α\alpha^{*} is independent of the model parameters c,σ1,σ0c,\sigma_{1},\sigma_{0}. This is because the least favorable prior adjusts the reward gap in response to these quantities.

Another remarkable property, following from Fudenberg et al. (2018, Theorem 1), is that the probability of mis-identification is independent of the stopping time for any given value of 𝝁\bm{\mu}, i.e., (δ=1|τ,μ=b)=(δ=1|μ=b)\mathbb{P}(\delta=1|\tau,\mu=b)=\mathbb{P}(\delta=1|\mu=b). This is again different from the setting with normal priors, where earlier stopping is indicative of higher probability of selecting the best treatment.

3.3. Benefit of adaptive experimentation

In best arm identification and standard RCTs, the number of units of experimentation is specified beforehand. As we have seen previously, the Neyman allocation is minimax optimal under both adaptive and non-adaptive experiments. The benefit of the decision rule, 𝒅\bm{d}^{*}, however, is that it enables one to stop the experiment early, thus saving on experimental costs. To quantify this benefit, fix some values of σ1,σ0,c\sigma_{1},\sigma_{0},c, and suppose that nature chooses the least favorable prior, pΔp_{\Delta^{*}}, for the generalized Wald problem. Note that pΔp_{\Delta^{*}} is in general different from the least favorable prior for the best arm identification problem. However, the two coincide if the parameter values are such that η:=(2cσ1+σ0)1/3=Δ¯0/Δ00.484\eta:=\left(\frac{2c}{\sigma_{1}+\sigma_{0}}\right)^{1/3}=\bar{\Delta}_{0}^{*}/\Delta_{0}^{*}\approx 0.484, where Δ0,Δ¯0\Delta_{0}^{*},\bar{\Delta}_{0}^{*} are universal constants defined in the contexts of Theorem 1 and Corollary 1.

Let

R:=𝔼𝒅|𝝁[max{μ1μ0,0}(μ1μ0)δ]𝑑pΔR^{*}:=\int\mathbb{E}_{\bm{d}^{*}|\bm{\mu}}\left[\max\{\mu_{1}-\mu_{0},0\}-(\mu_{1}-\mu_{0})\delta\right]dp_{\Delta^{*}}

denote the Bayes regret, under pΔp_{\Delta^{*}}, of the minimax decision rule 𝒅\bm{d}^{*} net of sampling costs. In fact, by symmetry, the above is also the frequentist regret of 𝒅\bm{d}^{*} under both the support points of pΔp_{\Delta^{*}}. Now, let TRT_{R^{*}} denote the duration of time required in a non-adaptive experiment to achieve the same Bayes regret RR^{*} (also under the least-favorable prior and net of sampling costs). Then, making use of some results from Shiryaev (2007, Section 4.2.5), we show in Appendix B.1 that

𝔼[τ]TR=12α2(Φ1(1α))2ln1αα0.6.\frac{\mathbb{E}[\tau^{*}]}{T_{R^{*}}}=\frac{1-2\alpha^{*}}{2\left(\Phi^{-1}(1-\alpha^{*})\right)^{2}}\ln\frac{1-\alpha^{*}}{\alpha^{*}}\approx 0.6. (3.7)

In other words, the use of an adaptive stopping time enables us to attain the same regret with 40%40\% fewer observations on average. Interestingly, the above result is independent of σ1,σ0,c\sigma_{1},\sigma_{0},c, though the values of 𝔼[τ]\mathbb{E}[\tau^{*}] and TRT_{R^{*}} do depend on these quantities (it is only the ratio that is constant). Admittedly, (3.7) does not quantify the welfare gain from using an adaptive experiment - this will depend on the sampling costs - but it is nevertheless useful as an informal measure of how much the amount of experimentation can be reduced.

4. Parametric regimes and small cost asymptotics

We now turn to the analysis of parametric models in discrete time. As before, the DM is tasked with selecting a treatment for implementation on the population. To this end, the DM experiments sequentially in periods j=1,2,j=1,2,\dots after paying an ‘effective sampling cost’ CC per period. Let 1/n1/n denote the time difference between successive time periods. To analyze asymptotic behavior in this setting, we introduce small cost asymptotics, wherein C=c/n3/2C=c/n^{3/2} for some c(0,)c\in(0,\infty), and nn\to\infty.666The rationale behind the n3/2n^{3/2} normalization is the same as that in time series models with linear drift terms. The author is grateful to Tim Vogelsang for pointing this out.

Are small cost asymptotics realistic? We contend they are, as CC is not the actual cost of experimentation, but rather characterizes the tradeoff between these costs and the benefit accruing from full-scale implementation following the experiment. Indeed, one way to motivate our asymptotic regime is to imagine that there are n3/2n^{3/2} population units in the implementation phase (so that the benefit of implementing treatment aa on the population is n3/2μan^{3/2}\mu_{a}), cc is the cost of sampling an additional unit of observation, and time, tt, is measured in units of nn. This formalizes the intuition that, in practice, the cost of sampling is relatively small compared to the population size; this is particularly true for online platforms (Deng et al., 2013) and clinical trials. The scaling also suggests that if the population size is n3/2n^{3/2}, we should aim to experiment on a sample size of the order nn to achieve optimal welfare.

In each period, the DM assigns a treatment to a single unit of observation according to some sampling rule πj()\pi_{j}(\cdot). The treatment assignment is a random draw AjBernoulli(πj)A_{j}\sim\textrm{Bernoulli}(\pi_{j}). This results in an outcome Y(a)Pθ(a)Y^{(a)}\sim P_{\theta}^{(a)}, with Pθ(a)P_{\theta}^{(a)} denoting the population distribution of outcomes under treatment aa. In this section, we assume that this distribution is known up to some unknown θ(a)d\theta^{(a)}\in\mathbb{R}^{d}. It is without loss of generality to assume Pθ(1)(1),Pθ(0)(0)P_{\theta^{(1)}}^{(1)},P_{\theta^{(0)}}^{(0)} are mutually independent (conditional on θ(1),θ(0)\theta^{(1)},\theta^{(0)}) as we only ever observe the outcomes from one treatment anyway. After observing the outcome, the DM can decide either to stop sampling, or call up the next unit. At the end of the experiment, the DM prescribes a treatment to apply on the population.

We use the ‘stack-of-rewards-representation’ for the outcomes from each arm (Lattimore and Szepesvári, 2020, Section 4.6). Specifically, Yi(a)Y_{i}^{(a)} denotes the outcome for ii-th data point corresponding to treatment aa. Also, 𝐲nq:={Yi(a)}i=1nq{\bf y}_{nq}:=\{Y_{i}^{(a)}\}_{i=1}^{\left\lfloor nq\right\rfloor} denotes the sequence of outcomes after nq\left\lfloor nq\right\rfloor observations from treatment aa. We can imagine that prior to the experiment, nature draws an infinite stack of outcomes, 𝐲(a):={Yi(a)}i=1{\bf y}^{(a)}:=\{Y_{i}^{(a)}\}_{i=1}^{\infty}, corresponding to each treatment aa, and at each period jj, if Aj=aA_{j}=a, the DM observes the outcome at the top of the stack (this outcome is then removed from the stack corresponding to that treatment).

Recall that tt is the number of periods elapsed divided by nn. Let qa(t):=n1j=1nt𝕀(Aj=a)q_{a}(t):=n^{-1}\sum_{j=1}^{\left\lfloor nt\right\rfloor}\mathbb{I}(A_{j}=a), and take t\mathcal{F}_{t} to be the σ\sigma-algebra generated by

ξt{{Aj}j=1nt,{Yi(1)}i=1nq1(t),{Yi(0)}i=1nq0(t)},\xi_{t}\equiv\left\{\{A_{j}\}_{j=1}^{\left\lfloor nt\right\rfloor},\{Y_{i}^{(1)}\}_{i=1}^{\left\lfloor nq_{1}(t)\right\rfloor},\{Y_{i}^{(0)}\}_{i=1}^{\left\lfloor nq_{0}(t)\right\rfloor}\right\},

the set of all actions and rewards until period ntnt. The sequence of σ\sigma-algebras, {t}t𝒯n\{\mathcal{F}_{t}\}_{t\in\mathcal{T}_{n}}, where 𝒯n:={1/n,2/n,}\mathcal{T}_{n}:=\{1/n,2/n,\dots\}, constitutes a filtration. We require πnt()\pi_{nt}(\cdot) to be t1/n\mathcal{F}_{t-1/n} measurable, the stopping time, τ\tau, to be t1/n\mathcal{F}_{t-1/n} measurable, and the implementation rule, δ\delta, to be τ\mathcal{F}_{\tau} measurable. The set of all decision rules 𝒅({πnt}t𝒯n,τ,δ)\bm{d}\equiv(\{\pi_{nt}\}_{t\in\mathcal{T}_{n}},\tau,\delta) satisfying these requirements is denoted by 𝒟n\mathcal{D}_{n}. As unbounded stopping times pose technical challenges, we generally work with 𝒟n,T{𝒅𝒟n:τTa.s}\mathcal{D}_{n,T}\equiv\left\{\bm{d}\in\mathcal{D}_{n}:\tau\leq T\ \textrm{a.s}\right\}, the set of all decision rules with stopping times bounded by some arbitrarily large, but finite, TT.

The mean outcomes under a parameter θ\theta are denoted by μa(θ):=𝔼Pθ(a)[Yi(a)]\mu_{a}(\theta):=\mathbb{E}_{P_{\theta}^{(a)}}[Y_{i}^{(a)}]. Following Hirano and Porter (2009), for each a{0,1}a\in\{0,1\}, we consider local perturbations of the form {θ0(a)+ha/n;had}\{\theta_{0}^{(a)}+h_{a}/\sqrt{n};h_{a}\in\mathbb{R}^{d}\}, with hah_{a} unknown, around a reference parameter θ0(a)\theta_{0}^{(a)}. As in that paper, θ0(a)\theta_{0}^{(a)} is chosen such that μ1(θ0(1))=μ0(θ0(0))=0\mu_{1}(\theta_{0}^{(1)})=\mu_{0}(\theta_{0}^{(0)})=0 for each a{0,1}a\in\{0,1\}; the last equality, which sets the quantities to 0, is not necessary and is simply a convenient re-centering. This choice of θ0(a)\theta_{0}^{(a)} defines the hardest instance of the generalized Wald problem, with

μn,a(h):=μa(θ0(a)+h/n)μ˙ah/n\mu_{n,a}(h):=\mu_{a}(\theta_{0}^{(a)}+h/\sqrt{n})\approx\dot{\mu}_{a}^{\intercal}h/\sqrt{n}

for each hdh\in\mathbb{R}^{d}, where μ˙a:=θμa(θ0(a))\dot{\mu}_{a}:=\nabla_{\theta}\mu_{a}(\theta_{0}^{(a)}). When μ1(θ0(1))μ0(θ0(0))\mu_{1}(\theta_{0}^{(1)})\neq\mu_{0}(\theta_{0}^{(0)}), determining the best treatment is trivial under large nn, and many decision rules, including the one we propose here (in Section 4.3), would achieve zero asymptotic regret.

Let Ph(a):=Pθ0(a)+h/n(a)P_{h}^{(a)}:=P_{\theta_{0}^{(a)}+h/\sqrt{n}}^{(a)} and take 𝔼h(a)[]\mathbb{E}_{h}^{(a)}[\cdot] to be its corresponding expectation. We assume Pθ(a)P_{\theta}^{(a)} is differentiable in quadratic mean around θ0(a)\theta_{0}^{(a)} with score functions ψa(Yi)\psi_{a}(Y_{i}) and information matrices Ia:=𝔼0(a)[ψaψa]I_{a}:=\mathbb{E}_{0}^{(a)}[\psi_{a}\psi_{a}^{\intercal}]. To reduce some notational overhead, we set θ0(1)=θ0(0)=θ0\theta_{0}^{(1)}=\theta_{0}^{(0)}=\theta_{0}, and also suppose that μn,a(h)=μn,a(h)\mu_{n,a}(h)=-\mu_{n,a}(-h) for all hh. In fact, the latter is always true asymptotically. Both simplifications can be easily dispensed with (at the expense of some additional notation). We emphasize that our results do not fundamentally require θ0(1),θ0(0)\theta_{0}^{(1)},\theta_{0}^{(0)} to be the same or even have the same dimension.

4.1. Bayes and minimax regret under fixed nn

Let Pn,h(a)P_{n,h}^{(a)} denote the joint probability over 𝐲nT(a):={Y1(a),,YnT(a)}{\bf y}_{nT}^{(a)}:=\left\{Y_{1}^{(a)},\dots,Y_{nT}^{(a)}\right\} - the largest possible (under τT\tau\leq T) iid sequence of outcomes that can be observed from treatment aa - when Y(a)Ph(a)Y^{(a)}\sim P_{h}^{(a)}. Define 𝒉:=(h1,h0)\bm{h}:=(h_{1},h_{0}), take Pn,𝒉P_{n,\bm{h}} to be the joint probability Pn,h1(1)×Pn,h0(0)P_{n,h_{1}}^{(1)}\times P_{n,h_{0}}^{(0)}, and 𝔼n,𝒉[]\mathbb{E}_{n,\bm{h}}[\cdot] its corresponding expectation. The frequentist regret of decision rule 𝒅\bm{d} is defined as

Vn(𝒅,𝒉)\displaystyle V_{n}(\bm{d},\bm{h}) Vn(𝒅,(μn,1(h1),μn,0(h0)))\displaystyle\equiv V_{n}\left(\bm{d},\left(\mu_{n,1}(h_{1}),\mu_{n,0}(h_{0})\right)\right)
:=n𝔼n,𝒉[max{μn,1(h1)μn,0(h0),0}(μn,1(h1)μn,0(h0))δ+cn3/2nτ]\displaystyle:=\sqrt{n}\mathbb{E}_{n,\bm{h}}\left[\max\left\{\mu_{n,1}(h_{1})-\mu_{n,0}(h_{0}),0\right\}-\left(\mu_{n,1}(h_{1})-\mu_{n,0}(h_{0})\right)\delta+\frac{c}{n^{3/2}}n\tau\right]
=n𝔼n,𝒉[max{μn,1(h1)μn,0(h0),0}(μn,1(h1)μn,0(h0))δ]+c𝔼n,𝒉[τ],\displaystyle=\sqrt{n}\mathbb{E}_{n,\bm{h}}\left[\max\left\{\mu_{n,1}(h_{1})-\mu_{n,0}(h_{0}),0\right\}-\left(\mu_{n,1}(h_{1})-\mu_{n,0}(h_{0})\right)\delta\right]+c\mathbb{E}_{n,\bm{h}}[\tau],

where the multiplication by n\sqrt{n} in the second line of the above equation is a normalization ensuring Vn(𝒅,𝒉)V_{n}(\bm{d},\bm{h}) converges to a non-trivial quantity.

Let ν\nu denote a dominating measure over {Pθ:θΘ}\{P_{\theta}:\theta\in\Theta\}, and define pθ:=dPθ/dνp_{\theta}:=dP_{\theta}/d\nu. Also, take M0M_{0} to be some prior over over 𝒉\bm{h}, and m0m_{0} its density with respect to some other dominating measure ν1\nu_{1}. By Adusumilli (2021), the posterior density (wrt ν1\nu_{1}), pn(|t)p_{n}(\cdot|\mathcal{F}_{t}), of 𝒉\bm{h} depends only on 𝐲nqa(t)(a)={Yi(a)}i=1nqa(t){\bf y}_{nq_{a}(t)}^{(a)}=\{Y_{i}^{(a)}\}_{i=1}^{\left\lfloor nq_{a}(t)\right\rfloor} for a{0,1}a\in\{0,1\}. Hence,

pn(𝒉|t)\displaystyle p_{n}(\bm{h}|\mathcal{\mathcal{F}}_{t}) =pn(𝒉|𝐲nq1(t)(1),𝐲nq0(t)(0))\displaystyle=p_{n}\left(\bm{h}|{\bf y}_{nq_{1}(t)}^{(1)},{\bf y}_{nq_{0}(t)}^{(0)}\right)
{i=1nq1(t)pθ0+h1/n(1)(Yi(1))}{i=1nq0(t)pθ0+h0/n(0)(Yi(0))}m0(𝒉).\displaystyle\propto\left\{\prod_{i=1}^{\left\lfloor nq_{1}(t)\right\rfloor}p_{\theta_{0}+h_{1}/\sqrt{n}}^{(1)}(Y_{i}^{(1)})\right\}\left\{\prod_{i=1}^{\left\lfloor nq_{0}(t)\right\rfloor}p_{\theta_{0}+h_{0}/\sqrt{n}}^{(0)}(Y_{i}^{(0)})\right\}m_{0}(\bm{h}). (4.1)

The fixed nn Bayes regret of a decision 𝒅\bm{d} is given by Vn(𝒅,m0):=Vn(𝒅,𝒉)𝑑m0(𝒉)V_{n}(\bm{d},m_{0}):=\int V_{n}(\bm{d},\bm{h})dm_{0}(\bm{h}).

Let ξτ\xi_{\tau} denote the terminal state. From the form of Vn(𝒅,𝒉)V_{n}(\bm{d},\bm{h}), it is clear that the Bayes optimal implementation rule is δ(ξτ)=𝕀{μn,1(ξτ)μn,0(ξτ)}\delta^{*}(\xi_{\tau})=\mathbb{I}\left\{\mu_{n,1}(\xi_{\tau})\geq\mu_{n,0}(\xi_{\tau})\right\}, and the resulting Bayes regret at the terminal state is

ϖn(ξτ):=μnmax(ξτ)max{μn,1(ξτ),μn,0(ξτ)},\varpi_{n}(\xi_{\tau}):=\mu_{n}^{\max}(\xi_{\tau})-\max\left\{\mu_{n,1}(\xi_{\tau}),\mu_{n,0}(\xi_{\tau})\right\}, (4.2)

where μn,a(ξτ):=𝔼𝒉|ξτ[μn,a(ha)]\mu_{n,a}(\xi_{\tau}):=\mathbb{E}_{\bm{h}|\xi_{\tau}}[\mu_{n,a}(h_{a})] and μnmax(ξτ):=𝔼𝒉|ξτ[max{μn,1(h1),μn,0(h0)}]\mu_{n}^{\max}(\xi_{\tau}):=\mathbb{E}_{\bm{h}|\xi_{\tau}}[\max\{\mu_{n,1}(h_{1}),\mu_{n,0}(h_{0})\}]. We can thus associate each combination, (π,τ)(\pi,\tau), of sampling rules and stopping times with the distribution π,τ\mathbb{P}_{\pi,\tau} that they induce over (ϖn(ξτ),τ)(\varpi_{n}(\xi_{\tau}),\tau). Thus,

Vn(𝒅,m0)=𝔼π,τ[nϖn(ξτ)+cτ].V_{n}\left(\bm{d},m_{0}\right)=\mathbb{E}_{\pi,\tau}\left[\sqrt{n}\varpi_{n}(\xi_{\tau})+c\tau\right].

For any given T<T<\infty, the minimal Bayes regret in the fixed nn setting is therefore

Vn,T(m0)=inf𝒅𝒟n,T𝔼π,τ[nϖn(ξτ)+cτ].V_{n,T}^{*}(m_{0})=\inf_{\bm{d}\in\mathcal{D}_{n,T}}\mathbb{E}_{\pi,\tau}\left[\sqrt{n}\varpi_{n}(\xi_{\tau})+c\tau\right].

While our interest is in minimax regret, Vn,T:=inf𝒅𝒟n,Tsup𝒉Vn(𝒅,𝒉)V_{n,T}^{*}:=\inf_{\bm{d}\in\mathcal{D}_{n,T}}\sup_{\bm{h}}V_{n}(\bm{d},\bm{h}), the minimal Bayes regret is a useful theoretical device as it provides a lower bound, Vn,TVn,T(m0)V_{n,T}^{*}\geq V_{n,T}^{*}(m_{0}) for any prior m0m_{0}.

4.2. Lower bound on minimax regret

We impose the following assumptions:

Assumption 1.

(i) The class {Pθ(a);θd}\{P_{\theta}^{(a)};\theta\in\mathbb{R}^{d}\} is differentiable in quadratic mean around θ0\theta_{0} for each a{0,1}a\in\{0,1\}.

(ii) 𝔼0(a)[exp|ψa(Yi(a))|]<\mathbb{E}_{0}^{(a)}[\exp|\psi_{a}(Y_{i}^{(a)})|]<\infty for a{0,1}a\in\{0,1\}.

(iii) There exist μ˙1,μ˙0\dot{\mu}_{1},\dot{\mu}_{0} and ϵn0\epsilon_{n}\to 0 s.t nμ(Ph(a))nμn,a(h)=μ˙ah+ϵn|h|2\sqrt{n}\mu\left(P_{h}^{(a)}\right)\equiv\sqrt{n}\mu_{n,a}(h)=\dot{\mu}_{a}^{\intercal}h+\epsilon_{n}|h|^{2} for each a{0,1}a\in\{0,1\} and hdh\in\mathbb{R}^{d}.

The assumptions are standard, with the only onerous requirement being Assumption 1(ii), which requires score function to have bounded exponential moments. This is needed due to the proof techniques, which are adapted from Adusumilli (2021).

Let VV^{*} denote the asymptotic minimax regret, defined as the value of the minimax problem in (3.1).

Theorem 2.

Suppose Assumptions 1(i)-(iii) hold. Then,

sup𝒥limTlim infninf𝒅𝒟n,Tsup𝒉𝒥Vn(𝒅,𝒉)V,\sup_{\mathcal{J}}\lim_{T\to\infty}\liminf_{n\to\infty}\inf_{\bm{d}\in\mathcal{D}_{n,T}}\sup_{\bm{h}\in\mathcal{J}}V_{n}(\bm{d},\bm{h})\geq V^{*},

where the outer supremum is taken over all finite subsets 𝒥\mathcal{J} of d×d\mathbb{R}^{d}\times\mathbb{R}^{d}.

The proof proceeds as follows: Let σa2:=μ˙aIa1μ˙a\sigma_{a}^{2}:=\dot{\mu}_{a}^{\intercal}I_{a}^{-1}\dot{\mu}_{a},

ha:=σaΔ2μ˙aIa1μ˙aIa1μ˙a,h_{a}^{*}:=\frac{\sigma_{a}\Delta^{*}}{2\dot{\mu}_{a}^{\intercal}I_{a}^{-1}\dot{\mu}_{a}}I_{a}^{-1}\dot{\mu}_{a},

and take m0m_{0}^{*} to be the symmetric two-prior supported on (h1,h0)(h_{1}^{*},-h_{0}^{*}) and (h1,h0(-h_{1}^{*},h_{0}^{*}). This is the parametric counterpart to the least favorable prior described in Theorem 1. Clearly, there exist subsets 𝒥\mathcal{J} such that

inf𝒅𝒟n,Tsup𝒉𝒥Vn(𝒅,𝒉)inf𝒅𝒟n,TVn(𝒅,m0).\inf_{\bm{d}\in\mathcal{D}_{n,T}}\sup_{\bm{h}\in\mathcal{J}}V_{n}(\bm{d},\bm{h})\geq\inf_{\bm{d}\in\mathcal{D}_{n,T}}V_{n}(\bm{d},m_{0}^{*}).

In Appendix A, we show

limTlimninf𝒅𝒟n,TVn(𝒅,m0)=V.\lim_{T\to\infty}\lim_{n\to\infty}\inf_{\bm{d}\in\mathcal{D}_{n,T}}V_{n}(\bm{d},m_{0}^{*})=V^{*}. (4.3)

To prove (4.3), we build on previous work in Adusumilli (2021). Standard techniques, such as asymptotic representation theorems (Van der Vaart, 2000), are not easily applicable here due to the continuous time nature of the problem. We instead employ a three step approach: First, we replace Pn,𝒉P_{n,\bm{h}} with a simpler family of measures whose likelihood ratios (under different values of 𝒉\bm{h}) are the same as those under Gaussian distributions. Then, for this family, we write down a HJB-Variational Inequality (HJB-VI) to characterize the optimal value function under fixed nn. PDE approximation arguments then let us approximate the fixed nn value function with that under continuous time. The latter is shown to be VV^{*}.

The definition of asymptotic minimax risk used in Theorem 1 is standard, see, e.g., Van der Vaart (2000, Theorem 8.11), apart from the limT\lim_{T\to\infty} operation. The theorem asserts that VV^{*} is a lower bound on minimax regret under any bounded stopping time. The bound TT can be arbitrarily large. Our proof techniques require bounded stopping times as our approximation results, e.g., the SLAN property (see, equation (5.2) in Appendix A), are only valid when the experiment is of bounded duration.777For any given 𝒉\bm{h}, the dominated convergence theorem implies limTinf𝒅𝒟n,TVn(𝒅,𝒉)=inf𝒅𝒟nVn(𝒅,𝒉)\lim_{T\to\infty}\inf_{\bm{d}\in\mathcal{D}_{n,T}}V_{n}(\bm{d},\bm{h})=\inf_{\bm{d}\in\mathcal{D}_{n}}V_{n}(\bm{d},\bm{h}). However, to allow T=T=\infty in Theorem 1, we need to show that this equality holds uniformly over nn. In specific instances, e.g., when the parametric family is Gaussian, this is indeed the case, but we are not aware of any general results in this direction. Nevertheless, we conjecture that in practice there is no loss in setting T=T=\infty.

It is straightforward to extend Theorem 1 to best arm identification. We omit the formal statement for brevity.

4.3. Attaining the bound

We now describe a decision rule 𝒅n=(πn,τn,δn)\bm{d}_{n}=(\pi_{n},\tau_{n},\delta_{n}) that is asymptotically minimax optimal. Let σa2=μ˙aIa1μ˙a\sigma_{a}^{2}=\dot{\mu}_{a}^{\intercal}I_{a}^{-1}\dot{\mu}_{a} for each aa and

ρn(t):=x1(t)σ1x0(t)σ0,wherexa(t):=μ˙aIa1ni=1nqa(t)ψa(Yi(a)).\rho_{n}(t):=\frac{x_{1}(t)}{\sigma_{1}}-\frac{x_{0}(t)}{\sigma_{0}},\ \textrm{where}\quad x_{a}(t):=\frac{\dot{\mu}_{a}^{\intercal}I_{a}^{-1}}{\sqrt{n}}\sum_{i=1}^{\left\lfloor nq_{a}(t)\right\rfloor}\psi_{a}(Y_{i}^{(a)}).

Note that xa(t)x_{a}(t) is the efficient influence function process for estimation of μa(θ)\mu_{a}(\theta). We assume μ˙a,Ia,σa\dot{\mu}_{a},I_{a},\sigma_{a} are known; but in practice, they should be replaced with consistent estimates (from a vanishingly small initial sample) so that they do not require knowledge of the reference parameter θ0\theta_{0}. This can be done without affecting the asymptotic results, see Section 6.3.

Take πn\pi_{n} to be any sampling rule such that

|qa(t)tσaσ1+σ0|Bntb0uniformly over bounded t,\left|\frac{q_{a}(t)}{t}-\frac{\sigma_{a}}{\sigma_{1}+\sigma_{0}}\right|\leq B\left\lfloor nt\right\rfloor^{-b_{0}}\ \textrm{uniformly over bounded }t, (4.4)

for some B<B<\infty and b0>1/2b_{0}>1/2. To simplify matters, we suppose that πn\pi_{n} is deterministic, e.g., πn,1=𝕀{q1(t)tσ1/(σ1+σ0)}\pi_{n,1}=\mathbb{I}\left\{q_{1}(t)\leq t\sigma_{1}/(\sigma_{1}+\sigma_{0})\right\}. Fully randomized rules, πn,1=σ1/(σ0+σ1)\pi_{n,1}=\sigma_{1}/(\sigma_{0}+\sigma_{1}), do not satisfy the ‘fine-balance’ condition (4.4) and we indeed found them to perform poorly in practice. We further employ

τn,T=inf{t:|ρn(t)|γ}T\tau_{n,T}=\inf\left\{t:\left|\rho_{n}(t)\right|\geq\gamma^{*}\right\}\wedge T

as the stopping time, and as the implementation rule, set δn,T=𝕀{ρn(τn,T)0}\delta_{n,T}=\mathbb{I}\left\{\rho_{n}(\tau_{n,T})\geq 0\right\}.

Intuitively, 𝒅n,T=(πn,τn,T,δn,T)\bm{d}_{n,T}=(\pi_{n},\tau_{n,T},\delta_{n,T}) is the finite sample counterpart of the minimax optimal decision rule 𝒅\bm{d}^{*} from Section 3. The following theorem shows that it is asymptotically minimax optimal in that it attains the lower bound of Theorem 2.

Theorem 3.

Suppose Assumptions 1(i)-(iii) hold. Then,

sup𝒥limTlim infnsup𝒉𝒥Vn(𝒅n,T,𝒉)=V,\sup_{\mathcal{J}}\lim_{T\to\infty}\liminf_{n\to\infty}\sup_{\bm{h}\in\mathcal{J}}V_{n}(\bm{d}_{n,T},\bm{h})=V^{*},

where the outer supremum is taken over all finite subsets 𝒥\mathcal{J} of d×d\mathbb{R}^{d}\times\mathbb{R}^{d}.

An important implication of Theorem 3 is that the minimax optimal decision rule only involves one state variable, ρn(t)\rho_{n}(t). This is even though the state space in principle includes all the past observations until period ii, for a total of at least 2i2i variables. The theorem thus provides a major reduction in dimension.

5. The non-parametric setting

We now turn to the non-parametric setting where there is no a-priori information about the distributions P(1),P(0)P^{(1)},P^{(0)} of Yi(1)Y_{i}^{(1)} and Yi(0).Y_{i}^{(0)}. Let 𝒫\mathcal{P} denote the class of probability measures with bounded variance, and dominated by some measure ν\nu. We fix some reference probability distribution P0(a)𝒫P_{0}^{(a)}\in\mathcal{P}, and then, following Van der Vaart (2000), surround it with smooth one-dimensional sub-models of the form {Ps,h(a):sη}\{P_{s,h}^{(a)}:s\leq\eta\} for some η>0\eta>0, where h()h(\cdot) is a measurable function satisfying

[dPs,h(a)dP0(a)s12hdP0(a)]2𝑑ν0ass0.\int\left[\frac{\sqrt{dP_{s,h}^{(a)}}-\sqrt{dP_{0}^{(a)}}}{s}-\frac{1}{2}h\sqrt{dP_{0}^{(a)}}\right]^{2}d\nu\to 0\ \textrm{as}\ s\to 0. (5.1)

By Van der Vaart (2000), (5.1) implies h𝑑P0(a)=0\int hdP_{0}^{(a)}=0 and h2𝑑P0(a)<\int h^{2}dP_{0}^{(a)}<\infty. The set of all such candidate hh is termed the tangent space T(P0(a))T(P_{0}^{(a)}). This is a subset of the Hilbert space L2(P0(a))L^{2}(P_{0}^{(a)}), endowed with the inner product f,ga=𝔼P0(a)[fg]\left\langle f,g\right\rangle_{a}=\mathbb{E}_{P_{0}^{(a)}}[fg] and norm fa=𝔼P0(a)[f2]1/2\left\|f\right\|_{a}=\mathbb{E}_{P_{0}^{(a)}}[f^{2}]^{1/2}.

For any haT(P0(a))h_{a}\in T(P_{0}^{(a)}), let Pn,ha(a)P_{n,h_{a}}^{(a)} denote the joint probability measure over Y1(a),,YnT(a)Y_{1}^{(a)},\dots,Y_{nT}^{(a)}, when each Yi(a)Y_{i}^{(a)} is an iid draw from P1/n,ha(a)P_{1/\sqrt{n},h_{a}}^{(a)}. Also, denote 𝒉=(h1,h0)\bm{h}=(h_{1},h_{0}), where each haT(P0(a))h_{a}\in T(P_{0}^{(a)}), and take Pn,𝒉P_{n,\bm{h}} to be the joint probability Pn,h1(1)×Pn,h0(0)P_{n,h_{1}}^{(1)}\times P_{n,h_{0}}^{(0)}, with 𝔼n,𝒉[]\mathbb{E}_{n,\bm{h}}[\cdot] being its corresponding expectation. An important implication of (5.1) is the SLAN property that for all hT(P0(a))h\in T(P_{0}^{(a)}),

i=1nqlndP1/n,h(a)dP0(a)(Yai)\displaystyle\sum_{i=1}^{\left\lfloor nq\right\rfloor}\ln\frac{dP_{1/\sqrt{n},h}^{(a)}}{dP_{0}^{(a)}}(Y_{ai}) =1ni=1nqh(Yi(a))q2ha2+oPn,0(a)(1), uniformly over bounded q.\displaystyle=\frac{1}{\sqrt{n}}\sum_{i=1}^{\left\lfloor nq\right\rfloor}h(Y_{i}^{(a)})-\frac{q}{2}\left\|h\right\|_{a}^{2}+o_{P_{n,0}^{(a)}}(1),\ \textrm{ uniformly over bounded }q. (5.2)

See Adusumilli (2021, Lemma 2) for the proof.

The mean rewards under P(a)P^{(a)} are given by μ(P(a))=x𝑑P(a)(x)\mu(P^{(a)})=\int xdP^{(a)}(x). To obtain non-trivial regret bounds, we focus on the case where μ(P0(a))=0\mu(P_{0}^{(a)})=0 for a{0,1}a\in\{0,1\}. Let ψ(x):=x\psi(x):=x and σa2:=x2𝑑P0(a)(x)\sigma_{a}^{2}:=\int x^{2}dP_{0}^{(a)}(x). Then, ψ()\psi(\cdot) is the efficient influence function corresponding to estimation of μ\mu, in the sense that under some mild assumptions on {Ps,h(a)}\{P_{s,h}^{(a)}\},

μ(Ps,h(a))μ(P0(a))sψ,ha=μ(Ps,h(a))sψ,ha=o(s).\frac{\mu(P_{s,h}^{(a)})-\mu(P_{0}^{(a)})}{s}-\left\langle\psi,h\right\rangle_{a}=\frac{\mu(P_{s,h}^{(a)})}{s}-\left\langle\psi,h\right\rangle_{a}=o(s). (5.3)

The above implies μ(P1/n,h(a))ψ,ha/n\mu(P_{1/\sqrt{n},h}^{(a)})\approx\left\langle\psi,h\right\rangle_{a}/\sqrt{n}. This is just the right scaling for diffusion asymptotics. In what follows, we shall set μn,a(h):=μ(P1/n,h(a))\mu_{n,a}(h):=\mu(P_{1/\sqrt{n},h}^{(a)}).

It is possible to select {ϕa,1,ϕa,2,}T(P0(a))\{\phi_{a,1},\phi_{a,2},\dots\}\in T(P_{0}^{(a)}) in such a manner that {ψ/σa,ϕa,1,ϕa,2,}\{\psi/\sigma_{a},\phi_{a,1},\phi_{a,2},\dots\} is a set of orthonormal basis functions for the closure of T(P0(a))T(P_{0}^{(a)}); the division by σa\sigma_{a} in the first component ensures ψ/σaa2=x2/σa2𝑑P0(a)(x)=1\left\|\psi/\sigma_{a}\right\|_{a}^{2}=\int x^{2}/\sigma_{a}^{2}dP_{0}^{(a)}(x)=1. We can also choose these bases so they lie in T(P0(a))T(P_{0}^{(a)}), i.e., 𝔼P0(a)[ϕa,j]=0\mathbb{E}_{P_{0}^{(a)}}[\phi_{a,j}]=0 for all jj. By the Hilbert space isometry, each haT(P0(a))h_{a}\in T(P_{0}^{(a)}) is then associated with an element from the l2l_{2} space of square integrable sequences, (ha,0/σa,ha,1,)(h_{a,0}/\sigma_{a},h_{a,1},\dots), where ha,0=ψ,haah_{a,0}=\left\langle\psi,h_{a}\right\rangle_{a} and ha,k=ϕa,k,haah_{a,k}=\left\langle\phi_{a,k},h_{a}\right\rangle_{a} for all k0k\neq 0.

As in the previous sections, to derive the properties of minimax regret, it is convenient to first define a notion of Bayes regret. To this end, we follow Adusumilli (2021) and define Bayes regret in terms of priors on the tangent space T(P0)T(P_{0}), or equivalently, in terms of priors on l2l_{2}. Let (ϱ(1),ϱ(2),)(\varrho(1),\varrho(2),\dots) denote some permutation of (1,2,)(1,2,\dots). For the purposes of deriving our theoretical results, we may restrict attention to priors, m0m_{0}, that are supported on a finite dimensional sub-space,

I{𝒉T(P0(1))×T(P0(0)):ha=1σaψ,haaψσa+k=1I1ϕa,ϱ(k),haaϕa,ϱ(k)}\mathcal{H}_{I}\equiv\left\{\bm{h}\in T(P_{0}^{(1)})\times T(P_{0}^{(0)}):h_{a}=\frac{1}{\sigma_{a}}\left\langle\psi,h_{a}\right\rangle_{a}\frac{\psi}{\sigma_{a}}+\sum_{k=1}^{I-1}\left\langle\phi_{a,\varrho(k)},h_{a}\right\rangle_{a}\phi_{a,\varrho(k)}\right\}

of T(P0(a))T(P_{0}^{(a)}), or isometrically, on a subset of l2×l2l_{2}\times l_{2} of finite dimension I×II\times I. Note that the first component of hal2h_{a}\in l_{2} is always included in the prior; this is proportional to ψ,haa\left\langle\psi,h_{a}\right\rangle_{a}, the inner product with the efficient influence function.

In analogy with Section 4, the frequentist expected regret of decision rule 𝒅\bm{d} is defined as

Vn(𝒅,𝒉)\displaystyle V_{n}(\bm{d},\bm{h}) n𝔼n,𝒉[max{μn(h1)μn(h0),0}(μn(h1)μn(h0))δ+cn3/2nτ]\displaystyle\equiv\sqrt{n}\mathbb{E}_{n,\bm{h}}\left[\max\left\{\mu_{n}(h_{1})-\mu_{n}(h_{0}),0\right\}-\left(\mu_{n}(h_{1})-\mu_{n}(h_{0})\right)\delta+\frac{c}{n^{3/2}}n\tau\right]
=n𝔼n,𝒉[max{μn(h1)μn(h0),0}(μn(h1)μn(h0))δ]+c𝔼n,𝒉[τ].\displaystyle=\sqrt{n}\mathbb{E}_{n,\bm{h}}\left[\max\left\{\mu_{n}(h_{1})-\mu_{n}(h_{0}),0\right\}-\left(\mu_{n}(h_{1})-\mu_{n}(h_{0})\right)\delta\right]+c\mathbb{E}_{n,\bm{h}}[\tau].

The corresponding Bayes regret is

Vn(𝒅,m0)=Vn(𝒅,𝒉)𝑑m0(𝒉).V_{n}(\bm{d},m_{0})=\int V_{n}(\bm{d},\bm{h})dm_{0}(\bm{h}).

5.1. Lower bounds

The following assumptions are similar to Assumption 1:

Assumption 2.

(i) The sub-models {Ps,h(a);hT(P0(a))}\{P_{s,h}^{(a)};h\in T(P_{0}^{(a)})\} satisfy (5.1) for each a{0,1}a\in\{0,1\}.

(ii) 𝔼P0(a)[exp|Yi(a)|]<\mathbb{E}_{P_{0}^{(a)}}[\exp|Y_{i}^{(a)}|]<\infty for a{0,1}a\in\{0,1\}.

(iii) There exists ϵn0\epsilon_{n}\to 0 s.t nμn,a(ha)=ha,0+ϵnhaa2\sqrt{n}\mu_{n,a}(h_{a})=h_{a,0}+\epsilon_{n}\left\|h_{a}\right\|_{a}^{2} for each a{0,1}a\in\{0,1\} and haT(P0(a))h_{a}\in T(P_{0}^{(a)}).

We then have the following lower bound:

Theorem 4.

Suppose Assumptions 2(i)-(iii) hold. Then,

supIlimTlim infninf𝒅𝒟n,Tsup𝒉IVn(𝒅,𝒉)V,\sup_{\mathcal{H}_{I}}\lim_{T\to\infty}\liminf_{n\to\infty}\inf_{\bm{d}\in\mathcal{D}_{n,T}}\sup_{\bm{h}\in\mathcal{H}_{I}}V_{n}(\bm{d},\bm{h})\geq V^{*},

where the outer supremum is taken over all possible finite dimensional subspaces, I\mathcal{H}_{I}, of T(P0(1))×T(P0(0))T(P_{0}^{(1)})\times T(P_{0}^{(0)}).

As with Theorem 2, the proof involves lower bounding minimax regret with Bayes regret under a suitable prior. Denote ha,0:=σaΔ/2h_{a,0}^{*}:=\sigma_{a}\Delta^{*}/2 and take m0m_{0}^{*} to be the symmetric two-prior supported on ((h1,0,0,0),(h0,0,0,0,))((h_{1,0}^{*},0,0\dots),(-h_{0,0}^{*},0,0,\dots)) and ((h1,0,0,0),(h0,0,0,0,))((-h_{1,0}^{*},0,0\dots),(h_{0,0}^{*},0,0,\dots)). Here, m0m_{0}^{*} is a probability distribution on the space l2×l2l_{2}\times l_{2}. Then, there exist sub-spaces I\mathcal{H}_{I} such that

inf𝒅𝒟n,Tsup𝒉IVn(𝒅,𝒉)inf𝒅𝒟n,TVn(𝒅,m0).\inf_{\bm{d}\in\mathcal{D}_{n,T}}\sup_{\bm{h}\in\mathcal{H}_{I}}V_{n}(\bm{d},\bm{h})\geq\inf_{\bm{d}\in\mathcal{D}_{n,T}}V_{n}(\bm{d},m_{0}^{*}).

We can then show

limTlimninf𝒅𝒟n,TVn(𝒅,m0)=V.\lim_{T\to\infty}\lim_{n\to\infty}\inf_{\bm{d}\in\mathcal{D}_{n,T}}V_{n}(\bm{d},m_{0}^{*})=V^{*}.

The proof of the above uses the same arguments as that of Theorem 2, and is therefore omitted.

5.2. Attaining the bound

As in Section 4.3, take πn\pi_{n} to be any deterministic sampling rule that satisfies (4.4). Let

ρn(t):=x1(t)σ1x0(t)σ0,wherexa(t):=1ni=1nqa(t)Yi(a).\rho_{n}(t):=\frac{x_{1}(t)}{\sigma_{1}}-\frac{x_{0}(t)}{\sigma_{0}},\ \textrm{where}\quad x_{a}(t):=\frac{1}{\sqrt{n}}\sum_{i=1}^{\left\lfloor nq_{a}(t)\right\rfloor}Y_{i}^{(a)}. (5.4)

Note that xa(t)x_{a}(t), which is the scaled sum of outcomes from each treatment, is again the efficient influence function process for estimation of μ(P(a))\mu(P^{(a)}) in the non-parametric setting. We choose as the stopping time,

τn,T=inf{t:|ρn(t)|γ}T,\tau_{n,T}=\inf\left\{t:\left|\rho_{n}(t)\right|\geq\gamma^{*}\right\}\wedge T,

and as the implementation rule, set δn,T=𝕀{|ρn(τn,T)|0}\delta_{n,T}=\mathbb{I}\left\{\left|\rho_{n}(\tau_{n,T})\right|\geq 0\right\}.

The following theorem shows that the triple 𝒅n,T=(πn,τn,T,δn,T)\bm{d}_{n,T}=(\pi_{n},\tau_{n,T},\delta_{n,T}) attains the minimax lower bound in the non-parametric regime.

Theorem 5.

Suppose Assumptions 2(i)-(iii) hold. Then,

supIlimTlim infnsup𝒉IVn(𝒅n,T,𝒉)=V,\sup_{\mathcal{H}_{I}}\lim_{T\to\infty}\liminf_{n\to\infty}\sup_{\bm{h}\in\mathcal{H}_{I}}V_{n}(\bm{d}_{n,T},\bm{h})=V^{*},

where the outer supremum is taken over all possible finite dimensional subspaces, I\mathcal{H}_{I}, of T(P0(1))×T(P0(0))T(P_{0}^{(1)})\times T(P_{0}^{(0)}).

The proof is similar to that of Theorem 3 and is sketched in Appendix B.2.

6. Variations and extensions

We now consider various modifications of the basic setup and analyze if, and how, the optimal decisions change.

6.1. Batching

In practice, it may be that data is collected in batches instead of one at a time, and the DM can only make decisions after processing each batch. Let BnB_{n} denote the number of observations considered in each batch. In the context of Section 4, this corresponds to a time duration of Bn/nB_{n}/n. An analysis of the proof of Theorem 2 shows that it continues to hold as long as Bn/n0B_{n}/n\to 0. Thus, 𝒅n,T\bm{d}_{n,T} remains asymptotically minimax optimal in this scenario.

Even for Bn/nm(0,1)B_{n}/n\to m\in(0,1), the optimal decision rules remain broadly unchanged. Asymptotically, we have equivalence to Gaussian experiments, so we can analyze batched experiments under the diffusion framework by imagining that the stopping time is only allowed to take on discrete values {0,1/m,2/m,}\{0,1/m,2/m,\dots\}. It is then clear from the discussion in Section 3.1 that the optimal sampling and implementation rules remain unchanged. The discrete nature of the setting makes determining the optimal stopping rule difficult, but it is easy to show that the decision rule (π,τm,δ)(\pi^{*},\tau_{m}^{*},\delta^{*}), where

τm:=inf{t{0,1/m,2/m,}:|x1(t)σ1x0(t)σ0|γ},\tau_{m}^{*}:=\inf\left\{t\in\{0,1/m,2/m,\dots\}:\left|\frac{x_{1}(t)}{\sigma_{1}}-\frac{x_{0}(t)}{\sigma_{0}}\right|\geq\gamma^{*}\right\},

while not being exactly optimal, has a minimax regret that is arbitrarily close to VV^{*} for large enough mm (note that no batched experiment can attain a minimax regret that is lower than VV^{*}).

6.2. Alternative cost functions

All our results so far were derived under constant sampling costs. The same techniques apply to other types of flow costs as long as these depend only on ρ(t):=σ11x1(t)σ01x0(t)\rho(t):=\sigma_{1}^{-1}x_{1}(t)-\sigma_{0}^{-1}x_{0}(t). In particular, suppose that the frequentist regret is given by

V(𝒅,𝝁)=𝔼𝒅|𝝁[max{μ1μ0,0}(μ1μ0)δ+0τc(ρ(t))𝑑t],V\left(\bm{d},\bm{\mu}\right)=\mathbb{E}_{\bm{d}|\bm{\mu}}\left[\max\{\mu_{1}-\mu_{0},0\}-(\mu_{1}-\mu_{0})\delta+\int_{0}^{\tau}c(\rho(t))dt\right],

where c(z)c(z) is the flow cost of experimentation when ρ(t)=z\rho(t)=z. We require c()c(\cdot) to be (i) positive, (ii) bounded away from 0, i.e., infzc(z)c¯>0\inf_{z}c(z)\geq\underline{c}>0, and (iii) symmetric, i.e., c(z)=c(z)c(z)=c(-z). By (3.5), (σ1+σ0)ρ(t)/t(\sigma_{1}+\sigma_{0})\rho(t)/t is an estimate of the treatment effect μ1μ0\mu_{1}-\mu_{0}, so the above allows for situations in which sampling costs depend on the magnitude of the estimated treatment effects. While we are not aware of any real world examples of such costs, they could arise if there is feedback between the observations and sampling costs, e.g., if it is harder to find subjects for experimentation when the treatment effect estimates are higher. When there are only two states, the ‘ex-ante’ entropy cost of Sims (2003) is also equivalent to a specific flow cost of the form c()c(\cdot) above, see Morris and Strack (2019).888However, we are not aware of any extension of this result to continuous states.

For the above class of cost functions, we show in Appendix B.3 that the minimax optimal decision rule, 𝒅\bm{d}^{*}, and the least-favorable prior, pΔp_{\Delta}^{*}, have the same form as in Theorem 1, but the values of γ,Δ\gamma^{*},\Delta^{*} are different and need to be calculated by solving the minimax problem

minγmaxΔ{(σ1+σ02)(1eΔγ)ΔeΔγeΔγ+(1eΔγ)ζΔ(γ)+(eΔγ1)ζΔ(γ)eΔγeΔγ},\min_{\gamma}\max_{\Delta}\left\{\left(\frac{\sigma_{1}+\sigma_{0}}{2}\right)\frac{\left(1-e^{-\Delta\gamma}\right)\Delta}{e^{\Delta\gamma}-e^{-\Delta\gamma}}+\frac{\left(1-e^{-\Delta\gamma}\right)\zeta_{\Delta}(\gamma)+\left(e^{\Delta\gamma}-1\right)\zeta_{\Delta}(-\gamma)}{e^{\Delta\gamma}-e^{-\Delta\gamma}}\right\},

where

ζΔ(x):=20x0yeΔ(zy)c(z)𝑑z𝑑y.\zeta_{\Delta}(x):=2\int_{0}^{x}\int_{0}^{y}e^{\Delta(z-y)}c(z)dzdy.

Beyond this class of sampling costs, however, it is easy to conceive of scenarios in which the optimal decision rule differs markedly from the one we obtain here. For instance, Neyman allocation would no longer be the optimal sampling rule if the costs for sampling each treatment were different. Alternatively, if c()c(\cdot) were to depend on tt, the optimal stopping time could be non-stationary. The analysis of these cost functions is not covered by the present techniques.

6.3. Unknown variances

Replacing unknown variances (and other population quantities) with consistent estimates has no effect on asymptotic regret. We suggest two approaches to attain the minimax lower bounds when the variances are unknown.

The first approach uses ‘forced exploration’ (see, e.g., Lattimore and Szepesvári, 2020, Chapter 33, Note 7): we set πn=1/2\pi_{n}^{*}=1/2, for the first n¯=na\bar{n}=n^{a} observations, where a(0,1)a\in(0,1). This corresponds to a time duration of t¯=na1\bar{t}=n^{a-1}. We use the data from these periods to obtain consistent estimates, σ^12,σ^02\hat{\sigma}_{1}^{2},\hat{\sigma}_{0}^{2} of σ12,σ02\sigma_{1}^{2},\sigma_{0}^{2}. From t¯\bar{t} onwards, we apply the minimax optimal decision 𝒅n,T\bm{d}_{n,T} after plugging-in σ^1,σ^0\hat{\sigma}_{1},\hat{\sigma}_{0} in place of σ1,σ0\sigma_{1},\sigma_{0}. This strategy is asymptotically minimax optimal for any a(0,1)a\in(0,1). Determining the optimal aa in finite samples requires going beyond an asymptotic analysis, and is outside the scope of this paper (in fact, this is also an open question in the computer science literature).

Our second suggestion is to place a prior on σ1,σ0\sigma_{1},\sigma_{0}, and continuously update their values using posterior means. As a default, we suggest employing an inverse-gamma prior and computing the posterior by treating the outcomes as Gaussian (this is of course justified in the limit). This approach has the advantage of not requiring any tuning parameters.

6.4. Other regret measures

Instead of defining regret, max{μ(P(1))μ(P(0)),0}(μ(P(1))μ(P(0)))δ+cτ\max\{\mu(P^{(1)})-\mu(P^{(0)}),0\}-(\mu(P^{(1)})-\mu(P^{(0)}))\delta+c\tau, using the mean values of P(0),P(1)P^{(0)},P^{(1)}, we can use other functionals of the outcome/welfare distribution in the implementation phase, e.g., μ()\mu(\cdot) could be a quantile function. Note, however, that we still require costs to be linear and additively separable. Let ψa()\psi_{a}(\cdot) denote the efficient influence function corresponding to estimation of μ(P(a))\mu(P^{(a)}). Then, a straightforward extension of the results in Section 5 shows that Theorems 4 and 5 continue to hold, with xa(t)x_{a}(t) in (5.4) replaced with the efficient influence function process n1/2i=1nqa(t)ψa(Yi(a))n^{-1/2}\sum_{i=1}^{\left\lfloor nq_{a}(t)\right\rfloor}\psi_{a}(Y_{i}^{(a)}), and σa2\sigma_{a}^{2} with 𝔼P0(a)[ψ(Yi(a))2]\mathbb{E}_{P_{0}^{(a)}}[\psi(Y_{i}^{(a)})^{2}]. See Appendix B.4 for more details.

7. Numerical illustration

A/B testing is commonly used in online platforms for optimizing websites. Consequently, to assess the finite sample performance of our proposed policies, we run a Monte-Carlo simulation calibrated to a realistic example of such an A/B test. Suppose there are two candidate website layouts, with exit rates p0,p1p_{0},p_{1}, and we want to run an A/B test to determine the one with the lowest exit rate.999The exit rate is defined as the fraction of viewers of a webpage who exit from the website it is part of (i.e., without viewing other pages in that website). The outcomes are binary, Y(a)Bernoulli(pa)Y^{(a)}\sim\textrm{Bernoulli}(p_{a}). This is a parametric setting with score functions ψa(Yi(a))=Yi(a)\psi_{a}(Y_{i}^{(a)})=Y_{i}^{(a)}. We calibrate p0=0.4p_{0}=0.4, which is a typical value for an exit rate. The cost of experimentation is normalized to c=1c=1 and we consider various values of nn, corresponding to different ‘population sizes’ (recall that the benefit during implementation is scaled as n3/2pan^{3/2}p_{a}). We then set p1=p0+Δ/np_{1}=p_{0}+\Delta/\sqrt{n}, and describe the results under varying Δ\Delta. We believe local asymptotics provide a good approximation in practice, as the raw performance gains are known to be generally small - typically, |p1p0||p_{1}-p_{0}| is of the order 0.05 or less (see, e.g., Deng et al., 2013) - but they can translate to large profits when applied at scale, i.e., when nn is large.

Since σa=pa(1pa)\sigma_{a}=\sqrt{p_{a}(1-p_{a})} is unknown, we employ ‘forced sampling’ with n¯=max(50,0.05n)\bar{n}=\max(50,0.05n), i.e., using about 5% of the sample, to estimate σ1,σ0\sigma_{1},\sigma_{0}. Note that the asymptotically optimal sampling rule is always 1/21/2 in the Bernoulli setting, so forced sampling is in fact asymptotically costless. We also experimented with a beta prior to continuously update σa\sigma_{a}, but found the results to be somewhat inferior (see Appendix B.5 for details). Figure 7.1, Panel A plots the finite sample frequentist regret profile of our policy rules, 𝒅n𝒅n,\bm{d}_{n}\equiv\bm{d}_{n,\infty} (with T=T=\infty), for various values of nn, along with that of the minimax optimal policy, 𝒅\bm{d}^{*}, under the diffusion regime; the regret profile of the latter is derived analytically in Lemma 3. It is seen that diffusion asymptotics provide a very good approximation to the finite sample properties of 𝒅n\bm{d}_{n}, even for such relatively small values of nn as n=1000n=1000. In practice, A/B tests are run with tens, even hundreds, of thousands of observations. We also see that the max-regret of 𝒅n\bm{d}_{n} is very close to the asymptotic lower bound VV^{*} (the max-regret of 𝒅\bm{d}^{*}).

Figure 7.1, Panel B displays some summary statistics for the Bayes regret of 𝒅n\bm{d}_{n} under the least favorable prior, pΔp_{\Delta^{*}}. The regret distribution is positively skewed and heavy tailed. The finite sample Bayes regret is again very close to VV^{*}.

Appendix B.5 reports additional simulation results using Gaussian outcomes.

Refer to caption
Refer to caption
A: Frequentist regret profiles B: Performance under least-favorable prior

Note: The solid curve in Panel A is the regret profile of 𝒅\bm{d}^{*}; the vertical red line denotes Δ\Delta^{*}. We only plot the results for Δ>0\Delta>0 as the values are close to symmetric. The dashed red line in Panel B is VV^{*}, the asymptotic minimax regret. Black lines within the bars denote the Bayes regret in finite samples, under the least favorable prior. The bars describe the interquartile range of regret.

Figure 7.1. Finite sample performance of 𝒅n\bm{d}_{n}

8. Conclusion

This paper proposes a minimax optimal procedure for determining the best treatment when sampling is costly. The optimal sampling rule is just the Neyman allocation, while the optimal stopping rule is time-stationary and advises that the experiment be terminated when the average difference in outcomes multiplies by the number of observations exceeds a specific threshold. While these rules were derived under diffusion asymptotics, it is shown that finite sample counterparts of these rules remain optimal under both parametric and non-parametric regimes. The form of these rules is robust to a number of different variations of the original problem, e.g., under batching, different cost functions etc. Given the simple nature of these rules, and the potential for large sample efficiency gains (requiring, on average, 40% fewer observations than standard approaches), we believe they hold a lot of promise for practical use.

The paper also raises a number of avenues for future research. While our results were derived for binary treatments, multiple treatments are common in practice, and it would be useful to derive the optimal decision rules in this setting. We do expect, however, that in this case the optimal sampling rule would no longer be fixed, but history dependent. As noted previously, our setting also does not cover discounting and asymmetric cost functions. It is hoped that the techniques developed in this paper could help answer some of these outstanding questions.

References

  • Adusumilli (2021) K. Adusumilli, “Risk and optimal policies in bandit experiments,” arXiv preprint arXiv:2112.06363, 2021.
  • Armstrong (2022) T. B. Armstrong, “Asymptotic efficiency bounds for a class of experimental designs,” arXiv preprint arXiv:2205.02726, 2022.
  • Arrow et al. (1949) K. J. Arrow, D. Blackwell, and M. A. Girshick, “Bayes and minimax solutions of sequential decision problems,” Econometrica, pp. 213–244, 1949.
  • Barles and Souganidis (1991) G. Barles and P. E. Souganidis, “Convergence of approximation schemes for fully nonlinear second order equations,” Asymptotic analysis, vol. 4, no. 3, pp. 271–283, 1991.
  • Berger (2013) J. O. Berger, Statistical decision theory and Bayesian analysis.   Springer Science & Business Media, 2013.
  • Bertsekas (2012) D. Bertsekas, Dynamic programming and optimal control: Volume II.   Athena scientific, 2012, vol. 1.
  • Chernoff (1959) H. Chernoff, “Sequential design of experiments,” The Annals of Mathematical Statistics, vol. 30, no. 3, pp. 755–770, 1959.
  • Chernoff and Petkau (1981) H. Chernoff and A. J. Petkau, “Sequential medical trials involving paired data,” Biometrika, vol. 68, no. 1, pp. 119–132, 1981.
  • Colton (1963) T. Colton, “A model for selecting one of two medical treatments,” Journal of the American Statistical Association, vol. 58, no. 302, pp. 388–400, 1963.
  • Deng et al. (2013) A. Deng, Y. Xu, R. Kohavi, and T. Walker, “Improving the sensitivity of online controlled experiments by utilizing pre-experiment data,” in Proceedings of the sixth ACM international conference on Web search and data mining, 2013, pp. 123–132.
  • Fan and Glynn (2021) L. Fan and P. W. Glynn, “Diffusion approximations for thompson sampling,” arXiv preprint arXiv:2105.09232, 2021.
  • Fehr and Rangel (2011) E. Fehr and A. Rangel, “Neuroeconomic foundations of economic choice–recent advances,” Journal of Economic Perspectives, vol. 25, no. 4, pp. 3–30, 2011.
  • Fudenberg et al. (2018) D. Fudenberg, P. Strack, and T. Strzalecki, “Speed, accuracy, and the optimal timing of choices,” American Economic Review, vol. 108, no. 12, pp. 3651–84, 2018.
  • Garivier and Kaufmann (2016) A. Garivier and E. Kaufmann, “Optimal best arm identification with fixed confidence,” in Conference on Learning Theory.   PMLR, 2016, pp. 998–1027.
  • Hébert and Woodford (2017) B. Hébert and M. Woodford, “Rational inattention and sequential information sampling,” National Bureau of Economic Research, Tech. Rep., 2017.
  • Hirano and Porter (2009) K. Hirano and J. R. Porter, “Asymptotics for statistical treatment rules,” Econometrica, vol. 77, no. 5, pp. 1683–1701, 2009.
  • Karatzas and Shreve (2012) I. Karatzas and S. Shreve, Brownian motion and stochastic calculus.   Springer Science & Business Media, 2012, vol. 113.
  • Lai et al. (1980) T. Lai, B. Levin, H. Robbins, and D. Siegmund, “Sequential medical trials,” Proc. Natl. Acad. Sci. U.S.A., vol. 77, no. 6, pp. 3135–3138, 1980.
  • Lattimore and Szepesvári (2020) T. Lattimore and C. Szepesvári, Bandit algorithms.   Cambridge University Press, 2020.
  • Le Cam and Yang (2000) L. Le Cam and G. L. Yang, Asymptotics in statistics: some basic concepts.   Springer Science & Business Media, 2000.
  • Liang et al. (2022) A. Liang, X. Mu, and V. Syrgkanis, “Dynamically aggregating diverse information,” Econometrica, vol. 90, no. 1, pp. 47–80, 2022.
  • Luce et al. (1986) R. D. Luce et al., Response times: Their role in inferring elementary mental organization.   Oxford University Press on Demand, 1986, no. 8.
  • Manski (2021) C. F. Manski, “Econometrics for decision making: Building foundations sketched by haavelmo and wald,” Econometrica, vol. 89, no. 6, pp. 2827–2853, 2021.
  • Manski and Tetenov (2016) C. F. Manski and A. Tetenov, “Sufficient trial size to inform clinical practice,” Proc. Natl. Acad. Sci. U.S.A., vol. 113, no. 38, pp. 10 518–10 523, 2016.
  • Morris and Strack (2019) S. Morris and P. Strack, “The wald problem and the relation of sequential sampling and ex-ante information costs,” Available at SSRN 2991567, 2019.
  • Øksendal (2003) B. Øksendal, “Stochastic differential equations,” in Stochastic differential equations.   Springer, 2003, pp. 65–84.
  • Qin and Russo (2022) C. Qin and D. Russo, “Adaptivity and confounding in multi-armed bandit experiments,” arXiv preprint arXiv:2202.09036, 2022.
  • Qin et al. (2017) C. Qin, D. Klabjan, and D. Russo, “Improving the expected improvement algorithm,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • Ratcliff and McKoon (2008) R. Ratcliff and G. McKoon, “The diffusion decision model: theory and data for two-choice decision tasks,” Neural computation, vol. 20, no. 4, pp. 873–922, 2008.
  • Reikvam (1998) K. Reikvam, “Viscosity solutions of optimal stopping problems,” Stochastics and Stochastic Reports, vol. 62, no. 3-4, pp. 285–301, 1998.
  • Shiryaev (2007) A. N. Shiryaev, Optimal stopping rules.   Springer Science & Business Media, 2007.
  • Sims (2003) C. A. Sims, “Implications of rational inattention,” Journal of monetary Economics, vol. 50, no. 3, pp. 665–690, 2003.
  • Sion (1958) M. Sion, “On general minimax theorems.” Pacific Journal of mathematics, vol. 8, no. 1, pp. 171–176, 1958.
  • US Food and Drug Admin. (2018) US Food and Drug Admin., “FDA In Brief: FDA launches new pilot to advance innovative clinical trial designs as part agency’s broader program to modernize drug development and promote innovation in drugs targeted to unmet needs,” 2018. [Online]. Available: "https://www.fda.gov/news-events/fda-brief/fda-brief-fda-modernizes-clinical-trial-designs-and-approaches-drug-development-proposing-new"
  • Van der Vaart (2000) A. W. Van der Vaart, Asymptotic statistics.   Cambridge university press, 2000.
  • Van Der Vaart and Wellner (1996) A. W. Van Der Vaart and J. Wellner, Weak convergence and empirical processes: with applications to statistics.   Springer Science & Business Media, 1996.
  • Wager and Xu (2021) S. Wager and K. Xu, “Diffusion asymptotics for sequential experiments,” arXiv preprint arXiv:2101.09855, 2021.
  • Wald (1945) A. Wald, “Statistical decision functions which minimize the maximum risk,” Annals of Mathematics, pp. 265–280, 1945.
  • Wald (1947) ——, “Sequential analysis,” Tech. Rep., 1947.
  • Zaks (2020) T. Zaks, “A phase 3, randomized, stratified, observer-blind, placebo-controlled study to evaluate the efficacy, safety, and immunogenicity of mrna-1273 sars-cov-2 vaccine in adults aged 18 years and older,” Protocol Number mRNA-1273-P301. ModernaTX (20 August 2020) https://www. modernatx. com/sites/default/files/mRNA-1273-P301-Protocol. pdf, 2020.

Appendix A Proofs

A.1. Proof of Theorem 1

The proof makes use of the following lemmas:

Lemma 1.

Suppose nature sets p0p_{0} to be a symmetric two-point prior supported on (σ1Δ/2,σ0Δ/2),(σ1Δ/2,σ0Δ/2(\sigma_{1}\Delta/2,-\sigma_{0}\Delta/2),(-\sigma_{1}\Delta/2,\sigma_{0}\Delta/2). Then the decision d(Δ)=(π,τγ(Δ),δ)d(\Delta)=(\pi^{*},\tau_{\gamma(\Delta)},\delta^{*}), where γ(Δ)\gamma(\Delta) is defined in (A.3), is a best response by the DM.

Proof.

The prior is an indifference-inducing one, so by the argument given in Section 3.1, the DM is indifferent between any sampling rule π\pi. Thus, πa=σa/(σ1+σ0)\pi_{a}^{*}=\sigma_{a}/(\sigma_{1}+\sigma_{0}) is a best-response to this prior. Also, the prior is symmetric with m0=1/2m_{0}=1/2, so by (2.5), the Bayes optimal implementation rule is

δ\displaystyle\delta^{*} =𝕀{lnφπ(τ)0}=𝕀{x1(t)σ1x0(t)σ00}.\displaystyle=\mathbb{I}\left\{\ln\varphi^{\pi}(\tau)\geq 0\right\}=\mathbb{I}\left\{\frac{x_{1}(t)}{\sigma_{1}}-\frac{x_{0}(t)}{\sigma_{0}}\geq 0\right\}.

It remains to compute the Bayes optimal stopping time. Let θ=1\theta=1 denote the state when the prior is (σ1Δ/2,σ0Δ/2)(\sigma_{1}\Delta/2,-\sigma_{0}\Delta/2), with θ=0\theta=0 otherwise. The discussion in Section 3.1 implies that, conditional on θ\theta, the likelihood ratio process φπ(t)\varphi^{\pi}(t) does not depend on π\pi and evolves as

dlnφ(t)=(2θ1)Δ22dt+ΔdW~(t),d\ln\varphi(t)=(2\theta-1)\frac{\Delta^{2}}{2}dt+\Delta d\tilde{W}(t),

where W~()\tilde{W}(\cdot) is one-dimensional Brownian motion. By a similar argument as in Shiryaev (2007, Section 4.2.1), this in turn implies that the posterior probability mπ(t):=π(θ=1|t)m^{\pi}(t):=\mathbb{P}^{\pi}(\theta=1|\mathcal{F}_{t}) is also independent of π\pi and evolves as

dm(t)=Δm(t)(1m(t))dW~(t).dm(t)=\Delta m(t)(1-m(t))d\tilde{W}(t).

Therefore, by (2.7) the optimal stopping time also does not depend on π\pi and is given by

τ(Δ)\displaystyle\tau(\Delta) =infτ𝒯𝔼[ϖ(m(τ))+cτ],where\displaystyle=\inf_{\tau\in\mathcal{T}}\mathbb{E}\left[\varpi(m(\tau))+c\tau\right],\ \textrm{where} (A.1)
ϖ(m)\displaystyle\varpi(m) :=(σ1+σ0)2Δmin{m,1m}.\displaystyle:=\frac{(\sigma_{1}+\sigma_{0})}{2}\Delta\min\left\{m,1-m\right\}. (A.2)

Inspection of the objective function in (A.1) shows that this is exactly the same objective as in the Bayesian hypothesis testing problem, analyzed previously by Arrow et al. (1949) and Morris and Strack (2019). We follow the analysis of the latter paper. Morris and Strack (2019) show that instead of choosing the stopping time τ\tau, it is equivalent to imagine that the DM chooses a probability distribution GG over the posterior beliefs m(τ)m(\tau) at an ‘ex-ante’ cost

c(G)\displaystyle c(G) =2cΔ2(12m)ln1mmdG(m),\displaystyle=\frac{2c}{\Delta^{2}}\int(1-2m)\ln\frac{1-m}{m}dG(m),

subject to the constraint m𝑑G(m)=m0=1/2\int mdG(m)=m_{0}=1/2. Under the distribution GG, the expected regret, exclusive of sampling costs, for the DM is

ϖ(m)𝑑G(m)=(σ1+σ0)2Δmin{m,1m}𝑑G(m).\int\varpi(m)dG(m)=\frac{(\sigma_{1}+\sigma_{0})}{2}\Delta\int\min\{m,1-m\}dG(m).

Hence, the stopping time, τ\tau, that solves (A.1) is the one that induces the distribution GG^{*}, defined as

G\displaystyle G^{*} =argminG:m𝑑G(m)=12{c(G)+ϖ(m)𝑑G(m)}\displaystyle=\operatorname*{arg\,min}_{G:\int mdG(m)=\frac{1}{2}}\left\{c(G)+\int\varpi(m)dG(m)\right\}
=argminG:m𝑑G(m)=12f(m)𝑑G(m),\displaystyle=\operatorname*{arg\,min}_{G:\int mdG(m)=\frac{1}{2}}\int f(m)dG(m),

where

f(m):=2cΔ2(12m)ln1mm+(σ1+σ0)2Δmin{m,1m}.f(m):=\frac{2c}{\Delta^{2}}(1-2m)\ln\frac{1-m}{m}+\frac{(\sigma_{1}+\sigma_{0})}{2}\Delta\min\{m,1-m\}.

Clearly, f(m)=f(1m)f(m)=f(1-m). Hence, setting

α(Δ):=argminα[0,12]{(σ1+σ0)2Δα+2cΔ2(12α)ln1αα},\alpha(\Delta):=\operatorname*{arg\,min}_{\alpha\in\left[0,\frac{1}{2}\right]}\left\{\frac{(\sigma_{1}+\sigma_{0})}{2}\Delta\alpha+\frac{2c}{\Delta^{2}}(1-2\alpha)\ln\frac{1-\alpha}{\alpha}\right\},

it is easy to see that GG^{*} is a two-point distribution, supported on α(Δ),1α(Δ)\alpha(\Delta),1-\alpha(\Delta) with equal probability 1/21/2. By Shiryaev (2007, Section 4.2.1), this distribution is induced by the stopping time τγ(Δ)\tau_{\gamma(\Delta)}, where

γ(Δ):=1Δln1α(Δ)α(Δ).\gamma(\Delta):=\frac{1}{\Delta}\ln\frac{1-\alpha(\Delta)}{\alpha(\Delta)}. (A.3)

Hence, this stopping time is the best response to nature’s prior. ∎

Lemma 2.

Suppose 𝛍\bm{\mu} is such that |μ1μ0|=σ1+σ02Δ|\mu_{1}-\mu_{0}|=\frac{\sigma_{1}+\sigma_{0}}{2}\Delta. Then, for any γ,Δ>0\gamma,\Delta>0,

V(𝒅~γ,𝝁)=(σ1+σ0)2Δ1eΔγeΔγeΔγ+2cγΔeΔγ+eΔγ2eΔγeΔγ.V\left(\tilde{\bm{d}}_{\gamma},\bm{\mu}\right)=\frac{(\sigma_{1}+\sigma_{0})}{2}\Delta\frac{1-e^{-\Delta\gamma}}{e^{\Delta\gamma}-e^{-\Delta\gamma}}+\frac{2c\gamma}{\Delta}\frac{e^{\Delta\gamma}+e^{-\Delta\gamma}-2}{e^{\Delta\gamma}-e^{-\Delta\gamma}}.

Thus, the frequentist regret of 𝐝~γ\tilde{\bm{d}}_{\gamma} depends on 𝛍\bm{\mu} only through |μ1μ0||\mu_{1}-\mu_{0}|.

Proof.

Suppose that μ1>μ0\mu_{1}>\mu_{0}. Define

λ(t):=Δ{x1(t)σ1x0(t)σ0}.\lambda(t):=\Delta\left\{\frac{x_{1}(t)}{\sigma_{1}}-\frac{x_{0}(t)}{\sigma_{0}}\right\}.

Note that under 𝒅~γ\tilde{\bm{d}}_{\gamma} and 𝝁\bm{\mu},

x1(t)σ1x0(t)σ0=Δ2t+W~(t),\frac{x_{1}(t)}{\sigma_{1}}-\frac{x_{0}(t)}{\sigma_{0}}=\frac{\Delta}{2}t+\tilde{W}(t),

where W~()\tilde{W}(\cdot) is one-dimensional Brownian motion. Hence λ(t)=Δ22t+ΔW~(t).\lambda(t)=\frac{\Delta^{2}}{2}t+\Delta\tilde{W}(t). We can write the stopping time τγ\tau_{\gamma} in terms of λ(t)\lambda(t) as

τγ\displaystyle\tau_{\gamma} =inf{t:|x1(t)σ1x0(t)σ0|γ}=inf{t:|λ(t)|Δγ},\displaystyle=\inf\left\{t:\left|\frac{x_{1}(t)}{\sigma_{1}}-\frac{x_{0}(t)}{\sigma_{0}}\right|\geq\gamma\right\}=\inf\left\{t:|\lambda(t)|\geq\Delta\gamma\right\},

and the implementation rule as δ=𝕀{λ(τ)0}=𝕀{λ(τ)=Δγ}.\delta^{*}=\mathbb{I}\left\{\lambda(\tau)\geq 0\right\}=\mathbb{I}\left\{\lambda(\tau)=\Delta\gamma\right\}.

Now, noting the form of λ(t)\lambda(t), we can apply similar arguments as in Shiryaev (2007, Section 4.2, Lemma 5), to show that

𝔼[τγ|𝝁]=2Δ2Δγ(eΔγ+eΔγ2)eΔγeΔγ.\mathbb{E}\left[\tau_{\gamma}|\bm{\mu}\right]=\frac{2}{\Delta^{2}}\frac{\Delta\gamma\left(e^{\Delta\gamma}+e^{-\Delta\gamma}-2\right)}{e^{\Delta\gamma}-e^{-\Delta\gamma}}.

Furthermore, following Shiryaev (2007, Section 4.2, Lemma 4), we also have

(δ=1|𝝁)=(λ(τ)=Δγ|𝝁)=1eΔγeΔγeΔγ.\mathbb{P}(\delta^{*}=1|\bm{\mu})=\mathbb{P}(\lambda(\tau)=\Delta\gamma|\bm{\mu})=\frac{1-e^{-\Delta\gamma}}{e^{\Delta\gamma}-e^{-\Delta\gamma}}.

Hence, the frequentist regret is given by

V(𝒅~γ,𝝁)\displaystyle V\left(\tilde{\bm{d}}_{\gamma},\bm{\mu}\right) =σ1+σ02Δ(δ=1|𝝁)+c𝔼[τγ|𝝁]\displaystyle=\frac{\sigma_{1}+\sigma_{0}}{2}\Delta\mathbb{P}(\delta^{*}=1|\bm{\mu})+c\mathbb{E}\left[\tau_{\gamma}|\bm{\mu}\right]
=(σ1+σ0)2Δ1eΔγeΔγeΔγ+2cγΔeΔγ+eΔγ2eΔγeΔγ.\displaystyle=\frac{(\sigma_{1}+\sigma_{0})}{2}\Delta\frac{1-e^{-\Delta\gamma}}{e^{\Delta\gamma}-e^{-\Delta\gamma}}+\frac{2c\gamma}{\Delta}\frac{e^{\Delta\gamma}+e^{-\Delta\gamma}-2}{e^{\Delta\gamma}-e^{-\Delta\gamma}}.

While the above was shown under μ1>μ0\mu_{1}>\mu_{0}, an analogous argument under μ1<μ0\mu_{1}<\mu_{0} gives the same expression for V(𝒅~γ,𝝁)V\left(\tilde{\bm{d}}_{\gamma},\bm{\mu}\right). ∎

Lemma 3.

Consider a two-player zero-sum game in which nature chooses a symmetric two-point prior supported on (σ1Δ/2,σ0Δ/2)(\sigma_{1}\Delta/2,-\sigma_{0}\Delta/2) and (σ1Δ/2,σ0Δ/2)(-\sigma_{1}\Delta/2,\sigma_{0}\Delta/2) for some Δ>0\Delta>0 and the DM chooses 𝐝γ=(π,τγ,δ)\bm{d}_{\gamma}=(\pi^{*},\tau_{\gamma},\delta^{*}) for some γ>0\gamma>0. There exists a unique Nash equilibrium to this game at Δ=ηΔ0\Delta^{*}=\eta\Delta_{0}^{*} and γ=η1γ0\gamma^{*}=\eta^{-1}\gamma_{0}^{*}, where η,Δ0,γ0\eta,\Delta_{0}^{*},\gamma_{0}^{*} are defined in Section 3.

Proof.

Let pΔp_{\Delta} be the symmetric two-point prior supported on (σ1Δ/2,σ0Δ/2)(\sigma_{1}\Delta/2,-\sigma_{0}\Delta/2) and (σ1Δ/2,σ0Δ/2)(-\sigma_{1}\Delta/2,\sigma_{0}\Delta/2). By Lemma 2, the frequentist regret under a given choice of Δ:=2|μ1μ0|/(σ1+σ0)\Delta:=2|\mu_{1}-\mu_{0}|/(\sigma_{1}+\sigma_{0}) and γ\gamma is given by (σ1+σ0)2R(γ,Δ)\frac{(\sigma_{1}+\sigma_{0})}{2}R(\gamma,\Delta), where

R(γ,Δ):=Δ1eΔγeΔγeΔγ+2η3γΔeΔγ+eΔγ2eΔγeΔγ.R(\gamma,\Delta):=\Delta\frac{1-e^{-\Delta\gamma}}{e^{\Delta\gamma}-e^{-\Delta\gamma}}+\frac{2\eta^{3}\gamma}{\Delta}\frac{e^{\Delta\gamma}+e^{-\Delta\gamma}-2}{e^{\Delta\gamma}-e^{-\Delta\gamma}}.

Lemma 2 further implies that the frequentist regret V(𝒅,𝝁)V(\bm{d}^{*},\bm{\mu}) depends on 𝝁\bm{\mu} only through Δ\Delta. Therefore, the frequentist regret under both support points of pΔp_{\Delta} must be the same. Hence, the Bayes regret, V(𝒅γ,pΔ)V(\bm{d}_{\gamma},p_{\Delta}), is the same as the frequentist regret at each support point, i.e.,

V(𝒅γ,pΔ)=(σ1+σ0)2R(γ,Δ).V(\bm{d}_{\gamma},p_{\Delta})=\frac{(\sigma_{1}+\sigma_{0})}{2}R(\gamma,\Delta). (A.4)

We aim to find a Nash equilibrium in a two-player game in which natures chooses pΔp_{\Delta}, equivalently Δ\Delta, to maximize R(γ,Δ)R(\gamma,\Delta), while the DM chooses 𝒅γ\bm{d}_{\gamma}, equivalently γ\gamma, to minimize R(γ,Δ)R(\gamma,\Delta).

For η=1\eta=1, the unique Nash equilibrium to this game is given by Δ=Δ0\Delta=\Delta_{0}^{*} and γ=γ0\gamma=\gamma_{0}^{*}. We start by first demonstrating the existence of a unique Nash equilibrium. This is guaranteed by Sion’s minimax theorem (Sion, 1958) as long as R(γ,Δ)R(\gamma,\Delta) is continuous in both arguments (which is easily verified), and ‘convex quasi-concave’ on +×+\{0}\mathbb{R}^{+}\times\mathbb{R}^{+}\backslash\{0\}.101010In fact, convexity can be replaced with quasi-convexity for the theorem. To show convexity in the first argument, write R(,Δ)=R1(α(,Δ),Δ)R(\cdot,\Delta)=R_{1}(\alpha(\cdot,\Delta),\Delta) where

R1(α,Δ)\displaystyle R_{1}(\alpha,\Delta) :=Δα+2Δ2(12α)ln1αα;and\displaystyle:=\Delta\alpha+\frac{2}{\Delta^{2}}(1-2\alpha)\ln\frac{1-\alpha}{\alpha};\ \textrm{and}
α(γ,Δ)\displaystyle\alpha(\gamma,\Delta) :=1eΔγeΔγeΔγ.\displaystyle:=\frac{1-e^{-\Delta\gamma}}{e^{\Delta\gamma}-e^{-\Delta\gamma}}.

Now, for any fixed Δ>0\Delta>0, it is easy to verify that R1(,Δ)R_{1}(\cdot,\Delta) and α(,Δ)\alpha(\cdot,\Delta) are convex over the domain +\mathbb{R}^{+}. Since the composition of convex functions is also convex, this proves convexity of R(,Δ)R(\cdot,\Delta). To prove R(γ,)R(\gamma,\cdot) is quasi-concave, write R(γ,)=R2(γ,α(γ,))R(\gamma,\cdot)=R_{2}(\gamma,\alpha(\gamma,\cdot)), where

R2(γ,α):=1γαln1αα+2γ2(12α)ln1αα.R_{2}(\gamma,\alpha):=\frac{1}{\gamma}\alpha\ln\frac{1-\alpha}{\alpha}+2\gamma^{2}\frac{(1-2\alpha)}{\ln\frac{1-\alpha}{\alpha}}.

Now, αln1αα\alpha\ln\frac{1-\alpha}{\alpha} and (12α)/ln1αα(1-2\alpha)/\ln\frac{1-\alpha}{\alpha} are concave over +\{0}\mathbb{R}^{+}\backslash\{0\}, so R2(γ,)R_{2}(\gamma,\cdot) is also concave over +\{0}\mathbb{R}^{+}\backslash\{0\} for any fixed γ>0\gamma>0. Concavity implies the level set {α:R2(γ,α)ν}\left\{\alpha:R_{2}(\gamma,\alpha)\geq\nu\right\} is a closed interval in +\{0}\mathbb{R}^{+}\backslash\{0\} for any ν\nu\in\mathbb{R}. But α(γ,)\alpha(\gamma,\cdot) is positive and strictly decreasing, so for a fixed γ>0\gamma>0,

{Δ:R(γ,Δ)ν}{Δ:R2(γ,α(γ,Δ))ν}\left\{\Delta:R(\gamma,\Delta)\geq\nu\right\}\equiv\left\{\Delta:R_{2}(\gamma,\alpha(\gamma,\Delta))\geq\nu\right\}

is also a closed interval in +\{0}\mathbb{R}^{+}\backslash\{0\}, and therefore, convex, for any ν\nu\in\mathbb{R}. This proves quasi-concavity of R(γ,)R(\gamma,\cdot) whenever γ>0\gamma>0. At the same time, R(γ,Δ)Δ/2R(\gamma,\Delta)\to\Delta/2 when γ0\gamma\to 0; hence, R(γ,)R(\gamma,\cdot) is in fact quasi-concave for any γ0\gamma\geq 0. We thus conclude by Sion’s theorem that the Nash equilibrium exists and is unique. It is then routine to numerically compute Δ0,γ0\Delta_{0}^{*},\gamma_{0}^{*} though first-order conditions; we skip these calculations, which are straightforward. Figure A.1 provides a graphical illustration of the Nash equilibrium.

It remains to determine the Nash equilibrium under general η\eta. By the form of R(γ,Δ)R(\gamma,\Delta), if γ0\gamma_{0}^{*} is a best response to Δ0\Delta_{0}^{*} for η=1\eta=1, then η1γ0\eta^{-1}\gamma_{0}^{*} is a best response to ηΔ0\eta\Delta_{0}^{*} for general η\eta. Similarly, if Δ0\Delta_{0}^{*} is a best response to γ0\gamma_{0}^{*} for η=1\eta=1, then ηΔ0\eta\Delta_{0}^{*} is a best response to η1γ0\eta^{-1}\gamma_{0}^{*} for general η\eta. This proves Δ:=ηΔ0\Delta^{*}:=\eta\Delta_{0}^{*} and γ:=η1γ0\gamma^{*}:=\eta^{-1}\gamma_{0}^{*} is a Nash equilibrium in the general case. ∎

Refer to caption

Note: The red curve describes the best response of Δ\Delta to a given γ\gamma, while the blue curve describes the best response of γ\gamma to a given Δ\Delta. The point of intersection is the Nash equilibrium. This is for η=1\eta=1.

Figure A.1. Best responses and Nash equilibrium

We now complete the proof of Theorem 1: By Lemma 1, 𝒅\bm{d}^{*} is the optimal Bayes decision corresponding to p0p_{0}^{*}. We now show

sup𝝁V(𝒅,𝝁)=V(𝒅,p0),\sup_{\bm{\mu}}V(\bm{d}^{*},\bm{\mu})=V(\bm{d}^{*},p_{0}^{*}), (A.5)

which implies 𝒅\bm{d}^{*} is minimax optimal according to the verification theorem in Berger (2013, Theorem 17). To this end, recall from Lemma 2 that the frequentist regret V(𝒅,𝝁)V(\bm{d}^{*},\bm{\mu}) depends on 𝝁\bm{\mu} only through Δ:=2|μ1μ0|/(σ1+σ0)\Delta:=2|\mu_{1}-\mu_{0}|/(\sigma_{1}+\sigma_{0}). Furthermore, by Lemma 3, Δ\Delta^{*} is the best response of nature to 𝒅\bm{d}^{*}. These results imply

sup𝝁V(𝒅,𝝁)=(σ1+σ0)2supΔR(γ,Δ)=(σ1+σ0)2R(γ,Δ).\sup_{\bm{\mu}}V(\bm{d}^{*},\bm{\mu})=\frac{(\sigma_{1}+\sigma_{0})}{2}\sup_{\Delta}R(\gamma^{*},\Delta)=\frac{(\sigma_{1}+\sigma_{0})}{2}R(\gamma^{*},\Delta^{*}).

But by (A.4), we also have V(𝒅,p0)=(σ1+σ0)2R(γ,Δ)V(\bm{d}^{*},p_{0}^{*})=\frac{(\sigma_{1}+\sigma_{0})}{2}R(\gamma^{*},\Delta^{*}). This proves (A.5).

A.2. Proof of Corollary 1

We employ the same strategy as in the proof of Theorem 1. Suppose nature employs the indifference prior pΔp_{\Delta}, for any Δ>0\Delta>0. Then by similar arguments as earlier, the DM is indifferent between any sampling rule π\pi, and the optimal implementation rule is δ=𝕀{x1(t)σ1x0(t)σ00}.\delta^{*}=\mathbb{I}\left\{\frac{x_{1}(t)}{\sigma_{1}}-\frac{x_{0}(t)}{\sigma_{0}}\geq 0\right\}.

We now determine Nature’s best response to the DM choosing 𝒅=(π,δ)\bm{d}^{*}=(\pi^{*},\delta^{*}), where π\pi^{*} is the Neyman allocation. Consider an arbitrary 𝝁=(μ1,μ0)\bm{\mu}=(\mu_{1},\mu_{0}) such that |μ1μ0|=σ1+σ02Δ|\mu_{1}-\mu_{0}|=\frac{\sigma_{1}+\sigma_{0}}{2}\Delta. Suppose μ1>μ0\mu_{1}>\mu_{0}. Under π\pi^{*},

dx1(t)σ1dx0(t)σ0=Δ2dt+dW~(t),\frac{dx_{1}(t)}{\sigma_{1}}-\frac{dx_{0}(t)}{\sigma_{0}}=\frac{\Delta}{2}dt+d\tilde{W}(t),

where W~()\tilde{W}(\cdot) is the standard Weiner process, so the expected regret under 𝒅,𝝁\bm{d}^{*},\bm{\mu} is

V(𝒅,𝝁)\displaystyle V\left(\bm{d}^{*},\bm{\mu}\right) =(μ1μ0)(x1(1)σ1x0(1)σ00)=σ1+σ02ΔΦ(Δ2).\displaystyle=(\mu_{1}-\mu_{0})\mathbb{P}\left(\frac{x_{1}(1)}{\sigma_{1}}-\frac{x_{0}(1)}{\sigma_{0}}\leq 0\right)=\frac{\sigma_{1}+\sigma_{0}}{2}\Delta\Phi\left(-\frac{\Delta}{2}\right). (A.6)

An analogous argument shows that the same expression holds when μ0>μ1\mu_{0}>\mu_{1} as well. Consequently, nature’s optimal choice of 𝝁\bm{\mu} is to set Δ\Delta to Δ¯=2argmaxδδΦ(δ)\bar{\Delta}^{*}=2\arg\max_{\delta}\delta\Phi\left(-\delta\right), but is otherwise indifferent between any 𝝁\bm{\mu} such that |μ1μ0|=σ1+σ02Δ¯|\mu_{1}-\mu_{0}|=\frac{\sigma_{1}+\sigma_{0}}{2}\bar{\Delta}^{*}. Thus, pΔ¯p_{\bar{\Delta}^{*}} is a best response by nature to the DM’s choice of 𝒅=(π,δ)\bm{d}^{*}=(\pi^{*},\delta^{*}).

We have thereby shown pΔ¯,𝒅p_{\bar{\Delta}^{*}},\bm{d}^{*} form a Nash equilibrium. That 𝒅\bm{d}^{*} is minimax optimal then follows by similar arguments as in the proof of Theorem 1.

A.3. Proof of Theorem 2

Our aim is to show (4.3). The outline of the proof is as follows: First, as in Adusumilli (2021), we replace the true marginal and posterior distributions with suitable approximations. Next, we apply dynamic programming arguments and viscosity solution techniques to obtain a HJB-variational inequality (HJB-VI) for the value function in the experiment. Finally, the HJB-VI is connected to the problem of determining the optimal stopping time under diffusion asymptotics.

Step 0 (Definitions and preliminary observations)

Under m0m_{0}^{*}, let γ=1\gamma=1 denote the state (h1,h0)(h_{1}^{*},-h_{0}^{*}) and γ=0\gamma=0 the state (h1,h0)(-h_{1}^{*},h_{0}^{*}). Also, let 𝐲nq(a):={Yi(a)}i=1nq{\bf y}_{nq}^{(a)}:=\{Y_{i}^{(a)}\}_{i=1}^{\left\lfloor nq\right\rfloor} denote the stacked representation of outcomes Yi(a)Y_{i}^{(a)} from the first nqnq observations corresponding to treatment aa, and for any 𝒉:=(h1,h0)\bm{h}:=(h_{1},h_{0}), take Pnq1,nq0,𝒉P_{nq_{1},nq_{0},\bm{h}} to be the distribution corresponding to the joint density pnq1,h1(𝐲nq1(1))pnq0,h1(𝐲nq0(0))p_{nq_{1},h_{1}}({\bf y}_{nq_{1}}^{(1)})\cdot p_{nq_{0},h_{1}}({\bf y}_{nq_{0}}^{(0)}), where

pnq,ha(𝐲nq1(a)):=i=1nqpn,ha(Yi(a)).p_{nq,h_{a}}({\bf y}_{nq_{1}}^{(a)}):=\prod_{i=1}^{nq}p_{n,h_{a}}(Y_{i}^{(a)}).

Also, define P¯n\bar{P}_{n} as the marginal distribution of (𝐲nT(1),𝐲nT(0))\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}\right), i.e., it is the probability measure whose density, with respect to the dominating measure ν(𝐲nT(1),𝐲nT(0)):=a{0,1}ν(Y1(a))××ν(YnT(a))\nu({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}):=\prod_{a\in\{0,1\}}\nu(Y_{1}^{(a)})\times\dots\times\nu(Y_{nT}^{(a)}), is

p¯n(𝐲nT(1),𝐲nT(0))=pnT,h1(𝐲nT(1))pnT,h1(𝐲nT(0))𝑑m0(𝒉).\bar{p}_{n}\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}\right)=\int p_{nT,h_{1}}({\bf y}_{nT}^{(1)})\cdot p_{nT,h_{1}}({\bf y}_{nT}^{(0)})dm_{0}^{*}(\bm{h}).

Due to the two-point support of m0m_{0}^{*}, the posterior density pn(|ξt)p_{n}(\cdot|\xi_{t}) can be associated with a scalar,

mn(ξt)mn(𝐲nq1(t)(1),𝐲nq0(t)(0)):=Pn(γ=1|𝐲nq1(t)(1),𝐲nq0(t)(0)).m_{n}(\xi_{t})\equiv m_{n}\left({\bf y}_{nq_{1}(t)}^{(1)},{\bf y}_{nq_{0}(t)}^{(0)}\right):=P_{n}\left(\gamma=1|{\bf y}_{nq_{1}(t)}^{(1)},{\bf y}_{nq_{0}(t)}^{(0)}\right).

That the posterior depends on ξt\xi_{t} only via 𝐲nq1(t)(1),𝐲nq0(t)(0){\bf y}_{nq_{1}(t)}^{(1)},{\bf y}_{nq_{0}(t)}^{(0)} is an immediate consequence of Adusumilli (2021, Lemma 1). Recalling the definition of ϖn()\varpi_{n}(\cdot) in (4.2), we have ϖn(ξt)=ϖn(mn(ξt))\varpi_{n}(\xi_{t})=\varpi_{n}(m_{n}(\xi_{t})), where, for any m[0,1]m\in[0,1],

ϖn(m)\displaystyle\varpi_{n}(m) :=min{{μn,0(h0)μn,1(h1)}(1m),{μn,1(h1)μn,0(h0)}m}\displaystyle:=\min\left\{\left\{\mu_{n,0}(h_{0}^{*})-\mu_{n,1}(-h_{1}^{*})\right\}(1-m),\left\{\mu_{n,1}(h_{1}^{*})-\mu_{n,0}(-h_{0}^{*})\right\}m\right\}
=(μn,1(h1)μn,0(h0))min{m,1m}.\displaystyle=\left(\mu_{n,1}(h_{1}^{*})-\mu_{n,0}(-h_{0}^{*})\right)\min\{m,1-m\}.

The first equation above always holds, while the second holds under the simplification μn,a(h)=μn,a(h)\mu_{n,a}(h)=-\mu_{n,a}(-h) described in Section 4.

Let

za,nqa:=Ia1/2ni=1nqaψa(Yi(a)),z_{a,nq_{a}}:=\frac{I_{a}^{-1/2}}{\sqrt{n}}\sum_{i=1}^{\left\lfloor nq_{a}\right\rfloor}\psi_{a}(Y_{i}^{(a)}), (A.7)

denote the (standardized) score process. Under quadratic mean differentiability - Assumption 1(i) - the following SLAN property holds for both treatments:

i=1nqalndpθ0+h/n(a)dpθ0(a)=hIa1/2za,nqaqa2hIah+oPnT,θ0(a)(1),uniformly over bounded qa.\sum_{i=1}^{\left\lfloor nq_{a}\right\rfloor}\ln\frac{dp_{\theta_{0}+h/\sqrt{n}}^{(a)}}{dp_{\theta_{0}}^{(a)}}=h^{\intercal}I_{a}^{1/2}z_{a,nq_{a}}-\frac{q_{a}}{2}h^{\intercal}I_{a}h+o_{P_{nT,\theta_{0}}^{(a)}}(1),\ \textrm{uniformly over bounded }q_{a}. (A.8)

See Adusumilli (2021, Lemma 2) for the proof.111111It should be noted that the score process in that paper is defined slightly differently, as Ia1/2za,nqaI_{a}^{-1/2}z_{a,nq_{a}} under the present notation.

As in Adusumilli (2021), we now define approximate versions of the true marginal and posterior by replacing the actual likelihood apnqa,ha(a)(𝐲nT(a))\prod_{a}p_{nq_{a},h_{a}}^{(a)}({\bf y}_{nT}^{(a)}) with

aλnq,ha(a)(𝐲nq(a))\displaystyle\prod_{a}\lambda_{nq,h_{a}}^{(a)}({\bf y}_{nq}^{(a)}) adΛnq,ha(a)(𝐲nq(a))dν,where\displaystyle\equiv\prod_{a}\frac{d\Lambda_{nq,h_{a}}^{(a)}({\bf y}_{nq}^{(a)})}{d\nu},\ \textrm{where }
λnq,h(a)(𝐲nq(a))\displaystyle\lambda_{nq,h}^{(a)}({\bf y}_{nq}^{(a)}) :=exp{hIa1/2za,nqq2hIah}pnq,θ0(a)(𝐲nq(a))q[0,T].\displaystyle:=\exp\left\{h^{\intercal}I_{a}^{1/2}z_{a,nq}-\frac{q}{2}h^{\intercal}I_{a}h\right\}p_{nq,\theta_{0}}^{(a)}({\bf y}_{nq}^{(a)})\ \forall\ q\in[0,T]. (A.9)

In other words, we approximate the true likelihood with the first two terms in the SLAN expansion (A.8).

(Approximate marginal:) Denote by P~nq1,nq0,𝒉\tilde{P}_{nq_{1},nq_{0},\bm{h}} the measure whose density is λnq1,h1(1)(𝐲nq1(1))λnq0,h0(0)(𝐲nq0(0))\lambda_{nq_{1},h_{1}}^{(1)}({\bf y}_{nq_{1}}^{(1)})\cdot\lambda_{nq_{0},h_{0}}^{(0)}({\bf y}_{nq_{0}}^{(0)}), and take P¯~nq1,nq0\tilde{\bar{P}}_{nq_{1},nq_{0}} to be its marginal over 𝐲nq1(1),𝐲nq0(0){\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)} given the prior m0(𝒉)m_{0}^{*}(\bm{h}). Note that the density (wrt ν\nu) of P¯~nq1,nq0\tilde{\bar{P}}_{nq_{1},nq_{0}} is

p¯~nq1,nq0(𝐲nq1(1),𝐲nq0(0))=λnq1,h(1)(1)(𝐲nq1(1))λnq0,h(0)(0)(𝐲nq0(0))𝑑m0(𝒉).\tilde{\bar{p}}_{nq_{1},nq_{0}}\left({\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)}\right)=\int\lambda_{nq_{1},h^{(1)}}^{(1)}\left({\bf y}_{nq_{1}}^{(1)}\right)\cdot\lambda_{nq_{0},h^{(0)}}^{(0)}\left({\bf y}_{nq_{0}}^{(0)}\right)dm_{0}^{*}(\bm{h}). (A.10)

Also, define p¯~n(𝐲nT(1),𝐲nT(0)):=p¯~nT,nT(𝐲nT(1),𝐲nT(0))\tilde{\bar{p}}_{n}\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}\right):=\tilde{\bar{p}}_{nT,nT}\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}\right). Then, p¯~n(𝐲nT(1),𝐲nT(0))\tilde{\bar{p}}_{n}\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}\right) approximates the true marginal p¯n(𝐲nT(1),𝐲nT(0))\bar{p}_{n}\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}\right).

(Approximate posterior:) Next, let φ~(t)\tilde{\varphi}(t) be the approximate likelihood ratio

φ~(t)=λnq1,h1(1)(𝐲nq1(t)(1))λnq0,h0(0)(𝐲nq0(t)(0))λnq1,h1(1)(𝐲nq1(t)(1))λnq0,h0(0)(𝐲nq0(t)(0))=exp{Δρ(t)},\tilde{\varphi}(t)=\frac{\lambda_{nq_{1},h_{1}^{*}}^{(1)}\left({\bf y}_{nq_{1}(t)}^{(1)}\right)\cdot\lambda_{nq_{0},-h_{0}^{*}}^{(0)}\left({\bf y}_{nq_{0}(t)}^{(0)}\right)}{\lambda_{nq_{1},-h_{1}^{*}}^{(1)}\left({\bf y}_{nq_{1}(t)}^{(1)}\right)\cdot\lambda_{nq_{0},h_{0}^{*}}^{(0)}\left({\bf y}_{nq_{0}(t)}^{(0)}\right)}=\exp\left\{\Delta^{*}\rho(t)\right\},

where

ρn(t):=μ˙1z1,nq1(t)σ1μ˙0z0,nq0(t)σ0.\rho_{n}(t):=\frac{\dot{\mu}_{1}^{\intercal}z_{1,nq_{1}(t)}}{\sigma_{1}}-\frac{\dot{\mu}_{0}^{\intercal}z_{0,nq_{0}(t)}}{\sigma_{0}}. (A.11)

Based on the above, an approximation to the true posterior is given by121212Formally, this follows by the disintegration of measure, see, e.g., Adusumilli (2021, p.17).

φ~(t)1+φ~(t)=exp{Δρ(t)}1+exp{Δρ(t)}:=m~(ρn(t)),\displaystyle\frac{\tilde{\varphi}(t)}{1+\tilde{\varphi}(t)}=\frac{\exp\left\{\Delta^{*}\rho(t)\right\}}{1+\exp\left\{\Delta^{*}\rho(t)\right\}}:=\tilde{m}(\rho_{n}(t)), (A.12)

where m~(ρ):=exp(Δρ)/(1+exp(Δρ))\tilde{m}(\rho):=\exp(\Delta^{*}\rho)/(1+\exp(\Delta^{*}\rho)) for ρ\rho\in\mathbb{R}. When ρn(t)=ρ\rho_{n}(t)=\rho, the approximate posterior m~(ρ)\tilde{m}(\rho) in turn implies an approximate posterior, p~n(𝒉|ρ)\tilde{p}_{n}(\bm{h}|\rho), over 𝒉\bm{h} that takes the value (h1,h0)(h_{1}^{*},-h_{0}^{*}) with probability m~(ρ)\tilde{m}(\rho) and (h1,h0)(-h_{1}^{*},h_{0}^{*}) with probability 1m~(ρ)1-\tilde{m}(\rho).

Step 1 (Posterior and probability approximations)

Set Vn,T=inf𝒅𝒟n,TVn(𝒅,m0)V_{n,T}^{*}=\inf_{\bm{d}\in\mathcal{D}_{n,T}}V_{n}^{*}(\bm{d},m_{0}^{*}). Using dynamic programming arguments, it is straightforward to show that there exists a non-randomized sampling rule and stopping time that minimizes Vn(𝒅,m0)V_{n}^{*}(\bm{d},m_{0}) for any prior m0m_{0}. We therefore restrict 𝒟n,T\mathcal{D}_{n,T} to the set of all deterministic rules, 𝒟¯n,T\mathcal{\bar{D}}_{n,T}. Under deterministic policies, the sampling rules πnt\pi_{nt}, states ξt\xi_{t} and stopping times τ\tau are all deterministic functions of 𝐲nT(1),𝐲nT(0){\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}. Recall that 𝐲nT(1),𝐲nT(0){\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)} are the stacked vector of outcomes under nTnT observations of each treatment. It is useful to think of {πnt}t=1/nT,τ\{\pi_{nt}\}_{t=1/n}^{T},\tau as quantities mapping (𝐲nT(1),𝐲nT(0))({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}) to realizations of regret.131313Note that π,τ\pi,\tau still need to satisfy the measurability restrictions, and some components of 𝐲nT(a){\bf y}_{nT}^{(a)} may not be observed as both treatments cannot be sampled nTnT times. Taking 𝔼¯n[]\bar{\mathbb{E}}_{n}[\cdot] to be the expectation under P¯n\bar{P}_{n}, we then have

Vn(𝒅,m0)=𝔼¯n[nϖn(mn(ξτ))+cτ],V_{n}^{*}(\bm{d},m_{0}^{*})=\mathbb{\bar{E}}_{n}\left[\sqrt{n}\varpi_{n}\left(m_{n}\left(\xi_{\tau}\right)\right)+c\tau\right],

for any deterministic 𝒅𝒟¯n,T\bm{d}\in\mathcal{\bar{D}}_{n,T}.

Now, take 𝔼¯~n[]\tilde{\bar{\mathbb{E}}}_{n}[\cdot] to be the expectation under P¯~n\tilde{\bar{P}}_{n}, and define

V~n(𝒅,m0)=𝔼¯~n[nϖn(m~(ρn(τ)))+cτ].\tilde{V}_{n}(\bm{d},m_{0}^{*})=\mathbb{\tilde{\bar{E}}}_{n}\left[\sqrt{n}\varpi_{n}\left(\tilde{m}\left(\rho_{n}(\tau)\right)\right)+c\tau\right]. (A.13)

By Lemma 7 in Appendix B.6,

limnsup𝒅𝒟¯n,T|Vn(𝒅,m0)V~n(𝒅,m0)|=0.\lim_{n\to\infty}\sup_{\bm{d}\in\bar{\mathcal{D}}_{n,T}}\left|V_{n}^{*}(\bm{d},m_{0}^{*})-\tilde{V}_{n}(\bm{d},m_{0}^{*})\right|=0.

This in turn implies limn|Vn,TV~n,T|=0\lim_{n\to\infty}\left|V_{n,T}^{*}-\tilde{V}_{n,T}^{*}\right|=0, where V~n,T:=inf𝒅𝒟¯n,TV~n(𝒅,m0)\tilde{V}_{n,T}^{*}:=\inf_{\bm{d}\in\mathcal{\bar{D}}_{n,T}}\tilde{V}_{n}^{*}(\bm{d},m_{0}^{*}).

Step 2 (Recursive formula for V~n,T\tilde{V}_{n,T}^{*})

We now employ dynamic programming arguments to obtain a recursion for V~n,T\tilde{V}_{n,T}^{*}. This requires a bit of care since P¯~n\tilde{\bar{P}}_{n} is not a probability, even though it does integrate to 1 asymptotically.

Recall that p~n(𝒉|ρ)\tilde{p}_{n}(\bm{h}|\rho) is the probability measure on 𝒉\bm{h} that assigns probability m~(ρ)\tilde{m}(\rho) to (h1,h0)(h_{1}^{*},-h_{0}^{*}) and probability 1m~(ρ)1-\tilde{m}(\rho) to (h1,h0)(-h_{1}^{*},h_{0}^{*}). Define

p~n(Y(a)|ρ)\displaystyle\tilde{p}_{n}(Y^{(a)}|\rho) =pθ0(a)(Y(a))exp{1nhaψa(Y(a))12nhaIaha}𝑑p~n(𝒉|ρ),\displaystyle=p_{\theta_{0}}^{(a)}(Y^{(a)})\cdot\int\exp\left\{\frac{1}{\sqrt{n}}h_{a}^{\intercal}\psi_{a}(Y^{(a)})-\frac{1}{2n}h_{a}^{\intercal}I_{a}h_{a}\right\}d\tilde{p}_{n}(\bm{h}|\rho),
p¯~n(𝐲nq1(1),𝐲nq0(0)|ρ,q1,q0)\displaystyle\tilde{\bar{p}}_{n}({\bf y}_{-nq_{1}}^{(1)},{\bf y}_{-nq_{0}}^{(0)}|\rho,q_{1},q_{0}) =λnT,h1(1)(𝐲nT(1))λnT,h0(0)(𝐲nT(0))λnq1,h1(1)(𝐲nq1(1))λnq0,h0(0)(𝐲nq0(0))𝑑p~n(𝒉|ρ),and\displaystyle=\int\frac{\lambda_{nT,h_{1}}^{(1)}\left({\bf y}_{nT}^{(1)}\right)\cdot\lambda_{nT,h_{0}}^{(0)}\left({\bf y}_{nT}^{(0)}\right)}{\lambda_{nq_{1},h_{1}}^{(1)}\left({\bf y}_{nq_{1}}^{(1)}\right)\cdot\lambda_{nq_{0},h_{0}}^{(0)}\left({\bf y}_{nq_{0}}^{(0)}\right)}d\tilde{p}_{n}(\bm{h}|\rho),\quad\textrm{and}
η(ρ,q1,q0)\displaystyle\eta(\rho,q_{1},q_{0}) =𝑑p¯~n(𝐲nq1(1),𝐲nq0(0)|ρ,q1,q0),\displaystyle=\int d\tilde{\bar{p}}_{n}\left({\bf y}_{-nq_{1}}^{(1)},{\bf y}_{-nq_{0}}^{(0)}|\rho,q_{1},q_{0}\right), (A.14)

where 𝐲nq(a):={Ynq+1(a),,YnT(a)}{\bf y}_{-nq}^{(a)}:=\{Y_{nq+1}^{(a)},\dots,Y_{nT}^{(a)}\}. In words, p¯~n(𝐲nq1(1),𝐲nq0(0)|ρ,q1,q0)\tilde{\bar{p}}_{n}({\bf y}_{-nq_{1}}^{(1)},{\bf y}_{-nq_{0}}^{(0)}|\rho,q_{1},q_{0}) is the approximate probability density over the future values of the stacked rewards {Yi(a)}i=nqa+1nT\{Y_{i}^{(a)}\}_{i=nq_{a}+1}^{nT} given the current state ρ,q1,q0\rho,q_{1},q_{0}. Note that, η(ρ,q1,q0)\eta(\rho,q_{1},q_{0}) is the normalization constant of p¯~n(𝐲nq1(1),𝐲nq0(0)|ρ,q1,q0)\tilde{\bar{p}}_{n}({\bf y}_{-nq_{1}}^{(1)},{\bf y}_{-nq_{0}}^{(0)}|\rho,q_{1},q_{0}).

By Lemma 8 in Appendix B.6, V~n,T=V~n,T(0,0,0,0)\tilde{V}_{n,T}^{*}=\tilde{V}_{n,T}^{*}(0,0,0,0), where V~n,T()\tilde{V}_{n,T}^{*}(\cdot) solves the recursion

V~n,T(ρ,q1,q0,t)=min{nη(ρ,q1,q0)ϖn(m~(ρ)),\displaystyle\tilde{V}_{n,T}^{*}\left(\rho,q_{1},q_{0},t\right)=\min\bigg{\{}\sqrt{n}\eta(\rho,q_{1},q_{0})\varpi_{n}(\tilde{m}(\rho)),
η(ρ,q1,q0)cn+mina{0,1}V~n,T(ρ+(2a1)μ˙aIa1ψa(Y(a))nσa,q1+an,q0+1an,t+1n)dp~n(Y(a)|ρ)},\displaystyle\left.\frac{\eta(\rho,q_{1},q_{0})c}{n}+\min_{a\in\{0,1\}}\int\tilde{V}_{n,T}^{*}\left(\rho+\frac{(2a-1)\dot{\mu}_{a}^{\intercal}I_{a}^{-1}\psi_{a}(Y^{(a)})}{\sqrt{n}\sigma_{a}},q_{1}+\frac{a}{n},q_{0}+\frac{1-a}{n},t+\frac{1}{n}\right)d\tilde{p}_{n}(Y^{(a)}|\rho)\right\}, (A.15)

for tTt\leq T, and

V~n,T(ρ,q1,q0,T)=nη(ρ,q1,q0)ϖn(m~(ρ)).\tilde{V}_{n,T}^{*}\left(\rho,q_{1},q_{0},T\right)=\sqrt{n}\eta(\rho,q_{1},q_{0})\varpi_{n}(\tilde{m}(\rho)).

The function η()\eta(\cdot) accounts for the fact P¯~n\tilde{\bar{P}}_{n} is not a probability.

Now, Lemma 9 in Appendix B.6 shows that

supρ,q1,q0|η(ρ,q1,q0)1|Mnϑ\sup_{\rho,q_{1},q_{0}}\left|\eta(\rho,q_{1},q_{0})-1\right|\leq Mn^{-\vartheta} (A.16)

for some M<M<\infty and any ϑ(0,1/2)\vartheta\in(0,1/2). Furthermore, by Assumption 1(iii),

limnsupm[0,1]|nϖn(m)ϖ(m)|=0,\lim_{n\to\infty}\sup_{m\in[0,1]}\left|\sqrt{n}\varpi_{n}(m)-\varpi(m)\right|=0, (A.17)

where ϖ(m):=σ1+σ02Δmin{m,1m}\varpi(m):=\frac{\sigma_{1}+\sigma_{0}}{2}\Delta^{*}\min\{m,1-m\}. Since ϖ()\varpi(\cdot) is uniformly bounded, it follows from (A.17) that nϖn()\sqrt{n}\varpi_{n}(\cdot) is also uniformly bounded. Then, (A.16) and (A.17) imply

limn|V~n,T(0)V˘n,T(0)|=0,\lim_{n\to\infty}\left|\tilde{V}_{n,T}^{*}(0)-\breve{V}_{n,T}^{*}(0)\right|=0,

where V˘n,T(ρ,t)\breve{V}_{n,T}(\rho,t) is defined as the solution to the recursion

V˘n,T(ρ,t)\displaystyle\breve{V}_{n,T}^{*}\left(\rho,t\right) =min{ϖ(m~(ρ)),cn+mina{0,1}V˘n,T(ρ+(2a1)μ˙aIa1ψa(Y(a))nσa,t+1n)𝑑p~n(Y(a)|ρ)}\displaystyle=\min\left\{\varpi(\tilde{m}(\rho)),\frac{c}{n}+\min_{a\in\{0,1\}}\int\breve{V}_{n,T}^{*}\left(\rho+\frac{(2a-1)\dot{\mu}_{a}^{\intercal}I_{a}^{-1}\psi_{a}(Y^{(a)})}{\sqrt{n}\sigma_{a}},t+\frac{1}{n}\right)d\tilde{p}_{n}(Y^{(a)}|\rho)\right\}
for tT,\displaystyle\quad\textrm{for }t\leq T, (A.18)
V˘n,T(ρ,T)\displaystyle\breve{V}_{n,T}^{*}\left(\rho,T\right) =ϖ(m~(ρ)).\displaystyle=\varpi(\tilde{m}(\rho)).

We can drop the state variables q1,q0q_{1},q_{0} in V˘n,T()\breve{V}_{n,T}^{*}\left(\cdot\right) as they enter the definition of V~n,T(ρ,q1,q0,t)\tilde{V}_{n,T}^{*}\left(\rho,q_{1},q_{0},t\right) only via η(ρ,q1,q0)\eta(\rho,q_{1},q_{0}), which was shown in (A.16) to be uniformly close to 1.

Step 3 (PDE approximation and relationship to optimal stopping)

For any ρ\rho\in\mathbb{R}, let

ϖ(ρ):=ϖ(m~(ρ))=(σ1+σ0)Δ2min{exp(Δρ)1+exp(Δρ),11+exp(Δρ)}.\varpi(\rho):=\varpi(\tilde{m}(\rho))=\frac{(\sigma_{1}+\sigma_{0})\Delta^{*}}{2}\min\left\{\frac{\exp(\Delta^{*}\rho)}{1+\exp(\Delta^{*}\rho)},\frac{1}{1+\exp(\Delta^{*}\rho)}\right\}.

Lemma 10 in Appendix B.6 shows that V˘n,T()\breve{V}_{n,T}^{*}(\cdot) converges locally uniformly to VT()V_{T}^{*}(\cdot), the unique viscosity solution of the HJB-VI

min{ϖ(ρ)VT(ρ,t),c+tVT+Δ2(2m~(ρ)1)ρVT+12ρ2VT}\displaystyle\min\left\{\varpi(\rho)-V_{T}^{*}(\rho,t),c+\partial_{t}V_{T}^{*}+\frac{\Delta^{*}}{2}(2\tilde{m}(\rho)-1)\partial_{\rho}V_{T}^{*}+\frac{1}{2}\partial_{\rho}^{2}V_{T}^{*}\right\} =0for tT,\displaystyle=0\ \textrm{for }t\leq T,
VT(ρ,T)\displaystyle V_{T}^{*}(\rho,T) =ϖ(ρ).\displaystyle=\varpi(\rho). (A.19)

Note that the sampling rule does not enter the HJB-VI. This is a consequence of the choice of the prior, m0m_{0}^{*}.

There is a well known connection between HJB-VIs and the problem of optimal stopping that goes by the name of smooth-pasting or the high contact principle, see Øksendal (2003, Chapter 10) for an overview. In the present context, letting W(t)W(t) denote one-dimensional Brownian motion, it follows by Reikvam (1998) that

VT(0,0)\displaystyle V_{T}^{*}(0,0) =infτT𝔼[ϖ(ρτ)+cτ],where\displaystyle=\inf_{\tau\leq T}\mathbb{E}\left[\varpi(\rho_{\tau})+c\tau\right],\ \textrm{where}
dρt\displaystyle d\rho_{t} =Δ2(2m~(ρt)1)dt+dW(t);ρ0=0,\displaystyle=\frac{\Delta^{*}}{2}(2\tilde{m}(\rho_{t})-1)dt+dW(t);\ \rho_{0}=0,

and τ\tau is the set of all stopping times adapted to the filtration t\mathcal{F}_{t} generated by ρt\rho_{t}.

Step 4 (Taking TT\to\infty)

Through steps 1-3, we have shown

limninf𝒅𝒟n,Tsup𝒉Vn(𝒅,𝒉)limninf𝒅𝒟n,TVn(𝒅,m0)=VT(0,0).\lim_{n\to\infty}\inf_{\bm{d}\in\mathcal{D}_{n,T}}\sup_{\bm{h}}V_{n}(\bm{d},\bm{h})\geq\lim_{n\to\infty}\inf_{\bm{d}\in\mathcal{D}_{n,T}}V_{n}(\bm{d},m_{0}^{*})=V_{T}^{*}(0,0).

We now argue that

limTVT(0,0)=V:=infτ𝔼[ϖ(ρτ)+cτ].\lim_{T\to\infty}V_{T}^{*}(0,0)=V_{\infty}^{*}:=\inf_{\tau}\mathbb{E}\left[\varpi(\rho_{\tau})+c\tau\right].

Suppose not: Then, there exists ϵ>0\epsilon>0, and some stopping time τ¯\bar{\tau} such that V(τ¯):=𝔼[ϖ(ρτ¯)+cτ¯]<VT(0,0)ϵV(\bar{\tau}):=\mathbb{E}\left[\varpi(\rho_{\bar{\tau}})+c\bar{\tau}\right]<V_{T}^{*}(0,0)-\epsilon for all TT (note that we always have VT(0,0)VV_{T}^{*}(0,0)\geq V_{\infty}^{*} by definition). Now, ϖ()\varpi(\cdot) is uniformly bounded, so by the dominated convergence theorem, limT𝔼[ϖ(ρτ¯T)]=𝔼[ϖ(ρτ¯)]\lim_{T\to\infty}\mathbb{E}\left[\varpi(\rho_{\bar{\tau}\wedge T})\right]=\mathbb{E}\left[\varpi(\rho_{\bar{\tau}})\right]. Hence,

limTVT(0,0)\displaystyle\lim_{T\to\infty}V_{T}^{*}(0,0) limT𝔼[ϖ(ρτ¯T)+c(τ¯T)]\displaystyle\leq\lim_{T\to\infty}\mathbb{E}\left[\varpi(\rho_{\bar{\tau}\wedge T})+c\left(\bar{\tau}\wedge T\right)\right]
=𝔼[ϖ(ρτ¯)]+limTc𝔼[(τ¯T)]V(τ¯).\displaystyle=\mathbb{E}\left[\varpi(\rho_{\bar{\tau}})\right]+\lim_{T\to\infty}c\mathbb{E}\left[\left(\bar{\tau}\wedge T\right)\right]\leq V(\bar{\tau}).

This is a contradiction.

It remains to show VV_{\infty}^{*} is the same as VV^{*}, the value of the two-player game in Theorem 1. Define

mt=exp(Δρt)1+exp(Δρt).m_{t}=\frac{\exp(\Delta^{*}\rho_{t})}{1+\exp(\Delta^{*}\rho_{t})}.

By a change of variables from ρt\rho_{t} to mtm_{t}, we can write V:=infτ𝔼[ϖ(mt)+cτ]V_{\infty}^{*}:=\inf_{\tau}\mathbb{E}\left[\varpi(m_{t})+c\tau\right], where dmt=Δmt(1mt)dWtdm_{t}=\Delta^{*}m_{t}(1-m_{t})dW_{t} by Ito’s lemma. But by way of the proof of Lemma 1, see (A.1), this is just VV^{*}. The theorem can therefore be considered proved.

A.4. Proof of Theorem 3

For any 𝒉=(h1,h0)\bm{h}=(h_{1},h_{0}), let Pn,𝒉P_{n,\bm{h}} denote the joint distribution with density pnT,θ0+h1/n(1)(𝐲nT(1))pnT,θ0+h0/n(0)(𝐲nT(0))p_{nT,\theta_{0}+h_{1}/\sqrt{n}}^{(1)}({\bf y}_{nT}^{(1)})\cdot p_{nT,\theta_{0}+h_{0}/\sqrt{n}}^{(0)}({\bf y}_{nT}^{(0)}). Take 𝔼n,𝒉[]\mathbb{E}_{n,\bm{h}}[\cdot] to be the corresponding expectation. We can write Vn(𝒅n,T,𝒉)V_{n}(\bm{d}_{n,T},\bm{h}) as

Vn(𝒅n,T,𝒉)=𝔼n,𝒉[n(μn,1(h1)μn,0(h0))𝕀{δn,T0}+cτn,T].V_{n}(\bm{d}_{n,T},\bm{h})=\mathbb{E}_{n,\bm{h}}\left[\sqrt{n}\left(\mu_{n,1}(h_{1})-\mu_{n,0}(h_{0})\right)\mathbb{I}\{\delta_{n,T}\geq 0\}+c\tau_{n,T}\right].

Define μ(𝒉)=(μ˙1h1,μ˙0h0)\mu(\bm{h})=\left(\dot{\mu}_{1}^{\intercal}h_{1},\dot{\mu}_{0}^{\intercal}h_{0}\right), Δμ(𝒉)=μ˙1h1μ˙0h0\Delta\mu(\bm{h})=\dot{\mu}_{1}^{\intercal}h_{1}-\dot{\mu}_{0}^{\intercal}h_{0} and Δnμ(𝒉)=μn,1(h1)μn,0(h0)\Delta_{n}\mu(\bm{h})=\mu_{n,1}(h_{1})-\mu_{n,0}(h_{0}). In addition, we also define q~a(t):=σat/(σ1+σ0)\tilde{q}_{a}(t):=\sigma_{a}t/(\sigma_{1}+\sigma_{0}).

Step 1 (Weak convergence of ρn(t)\rho_{n}(t))

Denote Pn,0=Pn,(0,0).P_{n,0}=P_{n,(0,0)}. By the SLAN property (A.8), independence of 𝐲nT(1),𝐲n,T(0){\bf y}_{nT}^{(1)},{\bf y}_{n,T}^{(0)} given 𝒉\bm{h}, and the central limit theorem,

lndPn,𝒉dPn,0(𝐲nT(1),𝐲n,T(0))\displaystyle\ln\frac{dP_{n,\bm{h}}}{dP_{n,0}}\left({\bf y}_{nT}^{(1)},{\bf y}_{n,T}^{(0)}\right) =a{0,1}{haIa1/2za,nTT2haIaha}+oPn,0(1)\displaystyle=\sum_{a\in\{0,1\}}\left\{h_{a}^{\intercal}I_{a}^{1/2}z_{a,nT}-\frac{T}{2}h_{a}^{\intercal}I_{a}h_{a}\right\}+o_{Pn,0}(1) (A.20)
Pn,0𝑑𝒩(T2a{0,1}haIaha,Ta{0,1}haIaha).\displaystyle\xrightarrow[P_{n,0}]{d}\mathcal{N}\left(\frac{-T}{2}\sum_{a\in\{0,1\}}h_{a}^{\intercal}I_{a}h_{a},T\sum_{a\in\{0,1\}}h_{a}^{\intercal}I_{a}h_{a}\right). (A.21)

Therefore, by Le Cam’s first lemma, Pn,𝒉P_{n,\bm{h}} and Pn,0P_{n,0} are mutually contiguous.

We now determine the distribution of ρn(t)\rho_{n}(t). We start by showing

|μ˙aIa1σani=1nqa(t)ψa(Yi(a))μ˙aIa1σani=1nq~a(t)ψa(Yi(a))|=oPn,0(1),\left|\frac{\dot{\mu}_{a}^{\intercal}I_{a}^{-1}}{\sigma_{a}\sqrt{n}}\sum_{i=1}^{\left\lfloor nq_{a}(t)\right\rfloor}\psi_{a}(Y_{i}^{(a)})-\frac{\dot{\mu}_{a}^{\intercal}I_{a}^{-1}}{\sigma_{a}\sqrt{n}}\sum_{i=1}^{\left\lfloor n\tilde{q}_{a}(t)\right\rfloor}\psi_{a}(Y_{i}^{(a)})\right|=o_{P_{n,0}}(1), (A.22)

uniformly over tTt\leq T. Choose any b(1/2,1)b\in(1/2,1). For tnbt\leq n^{-b}, we must have qa(t),q~a(t)nbq_{a}(t),\tilde{q}_{a}(t)\leq n^{-b}, so (A.22) follows from Assumption 1(ii), which implies

sup1inT|ψa(Yi(a))|=OPn,0(n1/r),for any r>0.\sup_{1\leq i\leq nT}|\psi_{a}(Y_{i}^{(a)})|=O_{P_{n,0}}(n^{1/r}),\ \textrm{for any }r>0. (A.23)

As for the other values of tt, by (4.4) and (A.23),

μ˙aIa1σan{i=1nqa(t)ψa(Yi(a))i=1nq~a(t)ψa(Yi(a))}\displaystyle\frac{\dot{\mu}_{a}^{\intercal}I_{a}^{-1}}{\sigma_{a}\sqrt{n}}\left\{\sum_{i=1}^{\left\lfloor nq_{a}(t)\right\rfloor}\psi_{a}(Y_{i}^{(a)})-\sum_{i=1}^{\left\lfloor n\tilde{q}_{a}(t)\right\rfloor}\psi_{a}(Y_{i}^{(a)})\right\} n|qa(t)q~a(t)|sup1inT|ψa(Yi(a))|=oPn,0(1),\displaystyle\lesssim\sqrt{n}\left|q_{a}(t)-\tilde{q}_{a}(t)\right|\sup_{1\leq i\leq nT}|\psi_{a}(Y_{i}^{(a)})|=o_{P_{n,0}}(1),

uniformly over t(nb,T]t\in(n^{-b},T].

Now, (A.22) implies

ρn(t)=μ˙1I11σ1ni=1nq~1(t)ψ1(Yi(1))μ˙0I01σ0ni=1nq~0(t)ψ0(Yi(0))+oPn,0(1)uniformly over tT.\rho_{n}(t)=\frac{\dot{\mu}_{1}^{\intercal}I_{1}^{-1}}{\sigma_{1}\sqrt{n}}\sum_{i=1}^{\left\lfloor n\tilde{q}_{1}(t)\right\rfloor}\psi_{1}(Y_{i}^{(1)})-\frac{\dot{\mu}_{0}^{\intercal}I_{0}^{-1}}{\sigma_{0}\sqrt{n}}\sum_{i=1}^{\left\lfloor n\tilde{q}_{0}(t)\right\rfloor}\psi_{0}(Y_{i}^{(0)})+o_{P_{n,0}}(1)\ \textrm{uniformly over }t\leq T. (A.24)

By Donsker’s theorem, and recalling that q~a(t)=σat/(σ1+σ0)\tilde{q}_{a}(t)=\sigma_{a}t/(\sigma_{1}+\sigma_{0}),

μ˙aIa1σani=1nq~a()ψa(Yi(a))Pn,0𝑑σaσ1+σ0Wa(),\frac{\dot{\mu}_{a}^{\intercal}I_{a}^{-1}}{\sigma_{a}\sqrt{n}}\sum_{i=1}^{\left\lfloor n\tilde{q}_{a}(\cdot)\right\rfloor}\psi_{a}(Y_{i}^{(a)})\xrightarrow[P_{n,0}]{d}\sqrt{\frac{\sigma_{a}}{\sigma_{1}+\sigma_{0}}}W_{a}(\cdot),

where W1(),W0()W_{1}(\cdot),W_{0}(\cdot) can be taken to be independent Weiner processes due to the independence of 𝐲nT(1),𝐲n,T(0){\bf y}_{nT}^{(1)},{\bf y}_{n,T}^{(0)} under Pn,0P_{n,0}. Combined with (A.24), we conclude

ρn()Pn,0𝑑W~(),\rho_{n}(\cdot)\xrightarrow[P_{n,0}]{d}\tilde{W}(\cdot), (A.25)

where W~()=σ1σ1+σ0W1()σ0σ1+σ0W0()\tilde{W}(\cdot)=\sqrt{\frac{\sigma_{1}}{\sigma_{1}+\sigma_{0}}}W_{1}(\cdot)-\sqrt{\frac{\sigma_{0}}{\sigma_{1}+\sigma_{0}}}W_{0}(\cdot) is another Weiner process.

Let ZZ denote the normal random variable in (A.21). Equations (A.21) and (A.25) imply that ρn(),ln(dPn,𝒉/dPn,0)\rho_{n}(\cdot),\ln\left(dP_{n,\bm{h}}/dP_{n,0}\right) are asymptotically tight, and therefore, the joint (ρn(),ln(dPn,𝒉/dPn,0))\left(\rho_{n}(\cdot),\ln\left(dP_{n,\bm{h}}/dP_{n,0}\right)\right) is also asymptotically tight under Pn,0.P_{n,0}. Furthermore, for any t[0,T]t\in[0,T], it can be shown using (A.24) and (A.20) that

(ρn(t)lndPn,𝒉dPn,0)Pn,0𝑑(W~(t)Z)𝒩((0T2ahaIaha),[tΔμ(𝒉)σ1+σ0tΔμ(𝒉)σ1+σ0tTahaIaha]).\left(\begin{array}[]{c}\rho_{n}(t)\\ \ln\frac{dP_{n,\bm{h}}}{dP_{n,0}}\end{array}\right)\xrightarrow[P_{n,0}]{d}\left(\begin{array}[]{c}\tilde{W}(t)\\ Z\end{array}\right)\sim\mathcal{N}\left(\left(\begin{array}[]{c}0\\ \frac{-T}{2}\sum_{a}h_{a}^{\intercal}I_{a}h_{a}\end{array}\right),\left[\begin{array}[]{cc}t&\frac{\Delta\mu(\bm{h})}{\sigma_{1}+\sigma_{0}}t\\ \frac{\Delta\mu(\bm{h})}{\sigma_{1}+\sigma_{0}}t&T\sum_{a}h_{a}^{\intercal}I_{a}h_{a}\end{array}\right]\right).

Based on the above, an application of Le Cam’s third lemma as in Van Der Vaart and Wellner (1996, Theorem 3.10.12) then gives

ρn()Pn,𝒉𝑑ρ()where ρ(t):=Δμ(𝒉)σ1+σ0t+W~(t).\rho_{n}(\cdot)\xrightarrow[P_{n,\bm{h}}]{d}\rho(\cdot)\quad\textrm{where }\ \rho(t):=\frac{\Delta\mu(\bm{h})}{\sigma_{1}+\sigma_{0}}t+\tilde{W}(t). (A.26)

Step 2 (Weak convergence of δn,T,τn,T\delta_{n,T},\tau_{n,T})

Let 𝔻[0,T]\mathbb{D}[0,T] denote the metric space of all functions from [0,T][0,T] to \mathbb{R} equipped with the sup norm. For any element z()𝔻[0,T]z(\cdot)\in\mathbb{D}[0,T], define τT(z)=Tinf{t:|z(t)|γ}\tau_{T}(z)=T\wedge\inf\{t:|z(t)|\geq\gamma\} and δT(z)=𝕀{z(τT(z))>0}\delta_{T}(z)=\mathbb{I}\{z(\tau_{T}(z))>0\}.

Now, under 𝒉=(0,0)\bm{h}=(0,0), ρ()\rho(\cdot) is the Weiner process, whose sample paths take values (with probability 1) in ¯[0,T]\bar{\mathbb{C}}[0,T], the set of all continuous functions such that γ,γ\gamma,-\gamma are regular points (i.e., if z(t)=γz(t)=\gamma, z()γz(\cdot)-\gamma changes sign infinitely often in any time interval [t,t+ϵ][t,t+\epsilon], ϵ>0\epsilon>0; a similar property holds under z(t)=γz(t)=-\gamma). The latter is a well known property of Brownian motion, see Karatzas and Shreve (2012, Problem 2.7.18), and it implies z()¯[0,T]z(\cdot)\in\mathbb{\bar{C}}[0,T] must ‘cross’ the boundary within an arbitrarily small time interval after hitting γ\gamma or γ-\gamma. It is then easy to verify that if znzz_{n}\to z with zn𝔻[0,T]z_{n}\in\mathbb{D}[0,T] for all nn and z¯[0,T]z\in\mathbb{\bar{C}}[0,T], then τT(zn)τT(z)\tau_{T}(z_{n})\to\tau_{T}(z) and δT(zn)δT(z)\delta_{T}(z_{n})\to\delta_{T}(z). By construction, τn,T=τT(ρn)\tau_{n,T}=\tau_{T}(\rho_{n}) and δn,T=δT(ρn)\delta_{n,T}=\delta_{T}(\rho_{n}), so by (A.25) and the extended continuous mapping theorem (Van Der Vaart and Wellner, 1996, Theorem 1.11.1)

(τn,T,δn,T)Pn,0𝑑(τT,δT),(\tau_{n,T},\delta_{n,T})\xrightarrow[P_{n,0}]{d}(\tau_{T}^{*},\delta_{T}^{*}),

where τT:=τT(ρ)\tau_{T}^{*}:=\tau_{T}(\rho) and δT:=δT(ρ)\delta_{T}^{*}:=\delta_{T}(\rho).

For general 𝒉\bm{h}, ρ()\rho(\cdot) is distributed as in (A.26). By the Girsanov theorem, the probability law induced on 𝔻[0,T]\mathbb{D}[0,T] by the process Δμ(𝒉)σ1+σ0t+W~(t)\frac{\Delta\mu(\bm{h})}{\sigma_{1}+\sigma_{0}}t+\tilde{W}(t) is absolutely continuous with respect to the probability law induced by W~(t)\tilde{W}(t). Hence, with probability 1, the sample paths of ρ()\rho(\cdot) again lie in ¯[0,T]\bar{\mathbb{C}}[0,T]. Then, by similar arguments as in the case with 𝒉=(0,0)\bm{h}=(0,0), but now using (A.26), we conclude

(τn,T,δn,T)Pn,𝒉𝑑(τT,δT).(\tau_{n,T},\delta_{n,T})\xrightarrow[P_{n,\bm{h}}]{d}(\tau_{T}^{*},\delta_{T}^{*}). (A.27)

Step 3 (Convergence of Vn(𝒅n,T,𝒉)V_{n}(\bm{d}_{n,T},\bm{h}))

From (3.5) and the discussion in Section 3.1, it is clear that the distribution of ρ(t)\rho(t) is the same as that of σ11x1(t)σ01x0(t)\sigma_{1}^{-1}x_{1}(t)-\sigma_{0}^{-1}x_{0}(t) in the diffusion regime. Thus, the joint distribution, \mathbb{P}, of (τT,δT)(\tau_{T}^{*},\delta_{T}^{*}), defined in Step 2, is the same as the joint distribution of

(τTτT,δT𝕀{x1(τT)σ1x0(τT)σ00})\left(\tau_{T}^{*}\equiv\tau^{*}\wedge T,\delta_{T}^{*}\equiv\mathbb{I}\left\{\frac{x_{1}(\tau^{*}\wedge T)}{\sigma_{1}}-\frac{x_{0}(\tau^{*}\wedge T)}{\sigma_{0}}\geq 0\right\}\right)

in the diffusion regime, when the optimal sampling rule π\pi^{*} is used. Therefore, defining 𝒅T(π,τT,δT)\bm{d}_{T}^{*}\equiv(\pi^{*},\tau_{T}^{*},\delta_{T}^{*}) and 𝔼[]\mathbb{E}[\cdot] to be the expectation under \mathbb{P}, we obtain

V(𝒅T,μ(𝒉))=𝔼[Δμ(𝒉)δT+cτT],V(\bm{d}_{T}^{*},\mu(\bm{h}))=\mathbb{E}\left[\Delta\mu(\bm{h})\delta_{T}^{*}+c\tau_{T}^{*}\right],

where V(𝒅,𝝁)V(\bm{d},\bm{\mu}) denotes the frequentist regret of 𝒅\bm{d} in the diffusion regime. Now, recall that by the definitions stated early on in this proof,

Vn(𝒅n,T,𝒉)=𝔼n,𝒉[nΔnμ(𝒉)δn,T+cτn,T].V_{n}(\bm{d}_{n,T},\bm{h})=\mathbb{E}_{n,\bm{h}}\left[\sqrt{n}\Delta_{n}\mu(\bm{h})\delta_{n,T}+c\tau_{n,T}\right].

Since δn,τn\delta_{n},\tau_{n} are bounded and nΔnμ(𝒉)Δμ(𝒉)\sqrt{n}\Delta_{n}\mu(\bm{h})\to\Delta\mu(\bm{h}) by Assumption 1(iii), it follows from (A.27) that for each 𝒉\bm{h},

limnVn(𝒅n,T,𝒉)=V(𝒅T,μ(𝒉)).\lim_{n\to\infty}V_{n}(\bm{d}_{n,T},\bm{h})=V(\bm{d}_{T}^{*},\mu(\bm{h})). (A.28)

For any given 𝒉\bm{h} and ϵ>0\epsilon>0, a dominated convergence argument as in Step 4 of the proof of Theorem 2 shows that there exists T¯𝒉\bar{T}_{\bm{h}} large enough such that

V(𝒅T,μ(𝒉))V(𝒅,μ(𝒉))+ϵV(\bm{d}_{T}^{*},\mu(\bm{h}))\leq V(\bm{d}^{*},\mu(\bm{h}))+\epsilon (A.29)

for all TT¯hT\geq\bar{T}_{h}. Fix a finite subset 𝒥\mathcal{J} of \mathbb{R} and define T¯𝒥=sup𝒉𝒥T𝒉.\bar{T}_{\mathcal{J}}=\sup_{\bm{h}\in\mathcal{J}}T_{\bm{h}}. Then, (A.28) and (A.29) imply

lim infnsup𝒉𝒥Vn(𝒅n,T,𝒉)sup𝒉𝒥V(𝒅T,μ(𝒉))sup𝒉𝒥V(𝒅,μ(𝒉))+ϵ,\liminf_{n\to\infty}\sup_{\bm{h}\in\mathcal{J}}V_{n}(\bm{d}_{n,T},\bm{h})\leq\sup_{\bm{h}\in\mathcal{J}}V(\bm{d}_{T}^{*},\mu(\bm{h}))\leq\sup_{\bm{h}\in\mathcal{J}}V(\bm{d}^{*},\mu(\bm{h}))+\epsilon,

for all TT¯𝒥T\geq\bar{T}_{\mathcal{J}}. Since the above is true for any 𝒥\mathcal{J} and ϵ>0\epsilon>0,

sup𝒥limTlim infnsup𝒉𝒥Vn(𝒅n,T,𝒉)\displaystyle\sup_{\mathcal{J}}\lim_{T\to\infty}\liminf_{n\to\infty}\sup_{\bm{h}\in\mathcal{J}}V_{n}(\bm{d}_{n,T},\bm{h}) sup𝒥sup𝒉𝒥V(𝒅,μ(𝒉))\displaystyle\leq\sup_{\mathcal{J}}\sup_{\bm{h}\in\mathcal{J}}V(\bm{d}^{*},\mu(\bm{h}))
sup𝝁V(𝒅,𝝁)=V.\displaystyle\leq\sup_{\bm{\mu}}V(\bm{d}^{*},\bm{\mu})=V^{*}.

The inequality can be made an equality due to Theorem 2. We have thereby proved Theorem 3.

ONLINE APPENDIX

Appendix B Supplementary results

B.1. Proof of equation (3.7)

We exploit the fact that the least favorable prior has a two point support, and that the reward gap is the same under both support points. Fix some values of c,σ1,σ0c,\sigma_{1},\sigma_{0}. Recall the definition of α\alpha^{*} as the probability of mis-identification error from (3.6), and observe that R=(σ1+σ0)Δα/2R^{*}=(\sigma_{1}+\sigma_{0})\Delta^{*}\alpha^{*}/2. Furthermore, by Lemma 2,

𝔼[τ]\displaystyle\mathbb{E}[\tau^{*}] =2Δ2Δγ(eΔγ+eΔγ2)eΔγeΔγ=2Δ2(12α)ln1αα,\displaystyle=\frac{2}{\Delta^{*2}}\frac{\Delta^{*}\gamma^{*}\left(e^{\Delta^{*}\gamma^{*}}+e^{-\Delta^{*}\gamma^{*}}-2\right)}{e^{\Delta^{*}\gamma^{*}}-e^{-\Delta^{*}\gamma^{*}}}=\frac{2}{\Delta^{*2}}(1-2\alpha^{*})\ln\frac{1-\alpha^{*}}{\alpha^{*}},

where the second equality follows from the expression for α\alpha^{*} in (3.6).

Let θ=1\theta=1 denote the state when 𝝁=(σ1Δ/2,σ0Δ/2)\bm{\mu}=(\sigma_{1}\Delta^{*}/2,-\sigma_{0}\Delta^{*}/2) and θ=0\theta=0 the state when 𝝁=(σ1Δ/2,σ0Δ/2)\bm{\mu}=(-\sigma_{1}\Delta^{*}/2,\sigma_{0}\Delta^{*}/2). Because of the nature of the prior, we can think of a non-sequential experiment as choosing a set of mis-identification probabilities αs,βs\alpha_{s},\beta_{s} under the two states (e.g., αs\alpha_{s} is the probability of choosing treatment 0 under θ=1\theta=1), along with a duration (i.e., a sample size), TRT_{R^{*}}. To achieve a Bayes regret of RR^{*}, we would need αs+βs=2α\alpha_{s}+\beta_{s}=2\alpha^{*}. For any αs,βs\alpha_{s},\beta_{s}, let T(αs,βs)T(\alpha_{s},\beta_{s}) denote the minimum duration of time needed to achieve these mis-identification probabilities. Following Shiryaev (2007, Section 4.2.5), we have

T(αs,βs)\displaystyle T(\alpha_{s},\beta_{s}) =(Φ1(1αs)+Φ1(1βs))2Δ2.\displaystyle=\frac{\left(\Phi^{-1}(1-\alpha_{s})+\Phi^{-1}(1-\beta_{s})\right)^{2}}{\Delta^{*2}}.

Hence,

TR=minαs+βs=2α(Φ1(1αs)+Φ1(1βs))2Δ2.T_{R^{*}}=\min_{\alpha_{s}+\beta_{s}=2\alpha^{*}}\frac{\left(\Phi^{-1}(1-\alpha_{s})+\Phi^{-1}(1-\beta_{s})\right)^{2}}{\Delta^{*2}}.

It can be seen that the minimum is reached when αs=βs=α\alpha_{s}=\beta_{s}=\alpha^{*}, and we thus obtain

TR=4(Φ1(1α))2Δ2.T_{R^{*}}=\frac{4\left(\Phi^{-1}(1-\alpha^{*})\right)^{2}}{\Delta^{*2}}.

Therefore,

𝔼[τ]TR=(12α)ln1αα2(Φ1(1α))20.6.\frac{\mathbb{E}[\tau^{*}]}{T_{R^{*}}}=\frac{(1-2\alpha^{*})\ln\frac{1-\alpha^{*}}{\alpha^{*}}}{2\left(\Phi^{-1}(1-\alpha^{*})\right)^{2}}\approx 0.6.

B.2. Proof sketch of Theorem 5

For any 𝒉=(h1,h0)\bm{h}=(h_{1},h_{0}), haT(P0(a))h_{a}\in T(P_{0}^{(a)}), let Pn,𝒉P_{n,\bm{h}} denote the joint distribution P1/n,h1(1)(𝐲nT(1))P1/n,h0(0)(𝐲nT(0))P_{1/\sqrt{n},h_{1}}^{(1)}({\bf y}_{nT}^{(1)})\cdot P_{1/\sqrt{n},h_{0}}^{(0)}({\bf y}_{nT}^{(0)}). Take 𝔼n,𝒉[]\mathbb{E}_{n,\bm{h}}[\cdot] to be the corresponding expectation. As in Section 5, we can associate each haT(P0(a))h_{a}\in T(P_{0}^{(a)}) with an element from the l2l_{2} space of square integrable sequences {ha,0/σa,ha,1,}\{h_{a,0}/\sigma_{a},h_{a,1},\dots\}. In what follows, we write μa:=ha,0\mu_{a}:=h_{a,0}, and define 𝝁=(μ1,μ0)\bm{\mu}=\left(\mu_{1},\mu_{0}\right) and Δμ=μ1μ0\Delta\mu=\mu_{1}-\mu_{0}.

We only rework the first step of the proof of Theorem 3 as the remaining steps can be applied with minor changes.

Denote Pn,0=P0(1)(𝐲nT(1))P0(0)(𝐲nT(0))P_{n,0}=P_{0}^{(1)}({\bf y}_{nT}^{(1)})\cdot P_{0}^{(0)}({\bf y}_{nT}^{(0)}). By the SLAN property (5.2), independence of 𝐲nT(1),𝐲n,T(0){\bf y}_{nT}^{(1)},{\bf y}_{n,T}^{(0)} given 𝒉\bm{h}, and the central limit theorem,

lndPn,𝒉dPn,0(𝐲nT(1),𝐲n,T(0))\displaystyle\ln\frac{dP_{n,\bm{h}}}{dP_{n,0}}\left({\bf y}_{nT}^{(1)},{\bf y}_{n,T}^{(0)}\right) =a{0,1}{1ni=1nTha(Yi(a))T2haa2}+oPn,0(1)\displaystyle=\sum_{a\in\{0,1\}}\left\{\frac{1}{\sqrt{n}}\sum_{i=1}^{nT}h_{a}(Y_{i}^{(a)})-\frac{T}{2}\left\|h_{a}\right\|_{a}^{2}\right\}+o_{Pn,0}(1)
Pn,0𝑑𝒩(T2a{0,1}haa2,Ta{0,1}haa2).\displaystyle\xrightarrow[P_{n,0}]{d}\mathcal{N}\left(\frac{-T}{2}\sum_{a\in\{0,1\}}\left\|h_{a}\right\|_{a}^{2},T\sum_{a\in\{0,1\}}\left\|h_{a}\right\|_{a}^{2}\right). (B.1)

Therefore, by Le Cam’s first lemma, Pn,𝒉P_{n,\bm{h}} and Pn,0P_{n,0} are mutually contiguous. Next, define

ρn(t)=x1(t)σ1x0(t)σ0.\rho_{n}(t)=\frac{x_{1}(t)}{\sigma_{1}}-\frac{x_{0}(t)}{\sigma_{0}}.

By similar arguments as in the proof of Theorem 3,

ρn(t)=1σ1ni=1nq~1(t)Yi(1)1σ0ni=1nq~0(t)Yi(0)+oPn,0(1)uniformly over tT.\rho_{n}(t)=\frac{1}{\sigma_{1}\sqrt{n}}\sum_{i=1}^{\left\lfloor n\tilde{q}_{1}(t)\right\rfloor}Y_{i}^{(1)}-\frac{1}{\sigma_{0}\sqrt{n}}\sum_{i=1}^{\left\lfloor n\tilde{q}_{0}(t)\right\rfloor}Y_{i}^{(0)}+o_{P_{n,0}}(1)\ \textrm{uniformly over }t\leq T. (B.2)

Then, by Donsker’s theorem, and recalling that q~a(t)=σat/(σ1+σ0)\tilde{q}_{a}(t)=\sigma_{a}t/(\sigma_{1}+\sigma_{0}), we obtain

1σani=1nq~a()Yi(a)Pn,0𝑑σaσ1+σ0Wa(),\frac{1}{\sigma_{a}\sqrt{n}}\sum_{i=1}^{\left\lfloor n\tilde{q}_{a}(\cdot)\right\rfloor}Y_{i}^{(a)}\xrightarrow[P_{n,0}]{d}\sqrt{\frac{\sigma_{a}}{\sigma_{1}+\sigma_{0}}}W_{a}(\cdot),

where W1(),W0()W_{1}(\cdot),W_{0}(\cdot) can be taken to be independent Weiner processes due to the independence of 𝐲nT(1),𝐲n,T(0){\bf y}_{nT}^{(1)},{\bf y}_{n,T}^{(0)} under Pn,0P_{n,0}. Combined with (B.2), we conclude

ρn()Pn,0𝑑W~(),\rho_{n}(\cdot)\xrightarrow[P_{n,0}]{d}\tilde{W}(\cdot), (B.3)

where W~()=σ1σ1+σ0W1()σ0σ1+σ0W0()\tilde{W}(\cdot)=\sqrt{\frac{\sigma_{1}}{\sigma_{1}+\sigma_{0}}}W_{1}(\cdot)-\sqrt{\frac{\sigma_{0}}{\sigma_{1}+\sigma_{0}}}W_{0}(\cdot) is another Weiner process.

Equations (B.1) and (B.3) imply that ρn(),ln(dPn,𝒉/dPn,0)\rho_{n}(\cdot),\ln\left(dP_{n,\bm{h}}/dP_{n,0}\right) are asymptotically tight, and therefore, the joint (ρn(),ln(dPn,𝒉/dPn,0))\left(\rho_{n}(\cdot),\ln\left(dP_{n,\bm{h}}/dP_{n,0}\right)\right) is also asymptotically tight under Pn,0.P_{n,0}. It remains to determine the point-wise distributional limit of (ρn(),ln(dPn,𝒉/dPn,0))\left(\rho_{n}(\cdot),\ln\left(dP_{n,\bm{h}}/dP_{n,0}\right)\right) for each tt. By our l2l_{2} representation of hah_{a}, we have ha=(μa/σa)(ψ/σa)+ha,1h_{a}=(\mu_{a}/\sigma_{a})(\psi/\sigma_{a})+h_{a,-1}, where ha,1h_{a,-1} is orthogonal to the influence function ψ(Yi(a)):=Yi(a)\psi(Y_{i}^{(a)}):=Y_{i}^{(a)}. This implies 𝔼n,0[ha(Yi(a))Yi(a)]=μa\mathbb{E}_{n,0}[h_{a}(Y_{i}^{(a)})Y_{i}^{(a)}]=\mu_{a}, and therefore, after some straightforward algebra exploiting the fact that 𝐲nT(1),𝐲n,T(0){\bf y}_{nT}^{(1)},{\bf y}_{n,T}^{(0)} are independent iid sequences, we obtain

𝔼n,0[{a(2a1)σani=1nq~a(t)Yi(a)}{a1ni=1nq~a(t)ha(Yi(a))}]=Δμσ1+σ0t.\mathbb{E}_{n,0}\left[\left\{\sum_{a}\frac{(2a-1)}{\sigma_{a}\sqrt{n}}\sum_{i=1}^{\left\lfloor n\tilde{q}_{a}(t)\right\rfloor}Y_{i}^{(a)}\right\}\cdot\left\{\sum_{a}\frac{1}{\sqrt{n}}\sum_{i=1}^{\left\lfloor n\tilde{q}_{a}(t)\right\rfloor}h_{a}(Y_{i}^{(a)})\right\}\right]=\frac{\Delta\mu}{\sigma_{1}+\sigma_{0}}t.

Combining the above with (B.2) and the first line of (B.1), we find

(ρn(t)lndPn,𝒉dPn,0)\displaystyle\left(\begin{array}[]{c}\rho_{n}(t)\\ \ln\frac{dP_{n,\bm{h}}}{dP_{n,0}}\end{array}\right) =(0T2ahaa2)+(a(2a1)σani=1nq~a(t)Yi(a)a1ni=1nq~a(t)ha(Yi(a)))+\displaystyle=\left(\begin{array}[]{c}0\\ -\frac{T}{2}\sum_{a}\left\|h_{a}\right\|_{a}^{2}\end{array}\right)+\left(\begin{array}[]{c}\sum_{a}\frac{(2a-1)}{\sigma_{a}\sqrt{n}}\sum_{i=1}^{\left\lfloor n\tilde{q}_{a}(t)\right\rfloor}Y_{i}^{(a)}\\ \sum_{a}\frac{1}{\sqrt{n}}\sum_{i=1}^{\left\lfloor n\tilde{q}_{a}(t)\right\rfloor}h_{a}(Y_{i}^{(a)})\end{array}\right)+\dots
+(0a1ni=nq~a(t)+1nTha(Yi(a)))+oPn,0(1)\displaystyle\qquad\dots+\left(\begin{array}[]{c}0\\ \sum_{a}\frac{1}{\sqrt{n}}\sum_{i=\left\lfloor n\tilde{q}_{a}(t)+1\right\rfloor}^{nT}h_{a}(Y_{i}^{(a)})\end{array}\right)+o_{P_{n,0}}(1)
Pn,0𝑑(W~(t)Z)𝒩((0T2ahaa2),[tΔμσ1+σ0tΔμσ1+σ0tTahaa2])\displaystyle\xrightarrow[P_{n,0}]{d}\left(\begin{array}[]{c}\tilde{W}(t)\\ Z\end{array}\right)\sim\mathcal{N}\left(\left(\begin{array}[]{c}0\\ \frac{-T}{2}\sum_{a}\left\|h_{a}\right\|_{a}^{2}\end{array}\right),\left[\begin{array}[]{cc}t&\frac{\Delta\mu}{\sigma_{1}+\sigma_{0}}t\\ \frac{\Delta\mu}{\sigma_{1}+\sigma_{0}}t&T\sum_{a}\left\|h_{a}\right\|_{a}^{2}\end{array}\right]\right)

for each tt, where the last step makes use of the independence of (𝐲nq~1(t)(1),𝐲nq~0(t)(0))\left({\bf y}_{n\tilde{q}_{1}(t)}^{(1)},{\bf y}_{n\tilde{q}_{0}(t)}^{(0)}\right) and (𝐲nq~1(t)(1),𝐲nq~0(t)(0))\left({\bf y}_{-n\tilde{q}_{1}(t)}^{(1)},{\bf y}_{-n\tilde{q}_{0}(t)}^{(0)}\right). Based on the above, an application of Le Cam’s third lemma as in Van Der Vaart and Wellner (1996, Theorem 3.10.12) then gives

ρn()Pn,𝒉𝑑ρ()where ρ(t):=Δμσ1+σ0t+W~(t).\rho_{n}(\cdot)\xrightarrow[P_{n,\bm{h}}]{d}\rho(\cdot)\quad\textrm{where }\ \rho(t):=\frac{\Delta\mu}{\sigma_{1}+\sigma_{0}}t+\tilde{W}(t). (B.4)

B.3. Alternative cost functions

We follow the basic outline of Section 3.1 and Lemmas 1-3. Our ansatz is that the least favorable prior should be within the class of indifference priors, pΔp_{\Delta}, and the minimax decision rule should lie within the class 𝒅~γ=(π,τγ,δ)\tilde{\bm{d}}_{\gamma}=(\pi^{*},\tau_{\gamma},\delta^{*}).

The DM’s response to pΔp_{\Delta}.

Suppose nature employs the indifference prior pΔp_{\Delta}. Then it is clear from the discussion in Section 3.1, and the symmetry of the sampling costs c()c(\cdot) that the DM is indifferent between any sampling rule, and the Bayes optimal implementation rule is δ=𝕀{ρ(t)0}\delta^{*}=\mathbb{I}\{\rho(t)\geq 0\}. To determine the Bayes optimal stopping rule, we employ a similar analysis as in Lemma 1. Define

c~(m)\displaystyle\tilde{c}(m) :=c(1Δlnm1m),\displaystyle:=c\left(\frac{1}{\Delta}\ln\frac{m}{1-m}\right),
ϕc(m)\displaystyle\phi_{c}(m) :=1/2m1/2xc~(z)2(z(1z))2𝑑z𝑑x.\displaystyle:=\int_{1/2}^{m}\int_{1/2}^{x}\frac{\tilde{c}\left(z\right)}{2(z(1-z))^{2}}dzdx.

Note that c~()\tilde{c}(\cdot) is the sampling cost in terms of the posterior probability m(t)m(t), as ρ(t)=Δ1ln(m(t)1m(t))\rho(t)=\Delta^{-1}\ln\left(\frac{m(t)}{1-m(t)}\right). Let 𝔼[]\mathbb{E}[\cdot] denote the expectation over τ\tau given the prior pΔp_{\Delta} and sampling rule π\pi. By Morris and Strack (2019, Proposition 2),

𝔼[0τc(ρ(t))𝑑t]𝔼[0τc~(m(t))𝑑t]=01ϕc(m)𝑑Gτ(m),\mathbb{E}\left[\int_{0}^{\tau}c(\rho(t))dt\right]\equiv\mathbb{E}\left[\int_{0}^{\tau}\tilde{c}(m(t))dt\right]=\int_{0}^{1}\phi_{c}(m)dG_{\tau}(m),

where Gτ()G_{\tau}(\cdot) is the distribution induced over m(τ)m(\tau) by the stopping time τ\tau. Hence, as in Lemma 1, we can suppose that instead of choosing τ\tau, the DM chooses a probability distribution GG over the posterior beliefs m(τ)m(\tau) at an ‘ex-ante’ cost

c(G)\displaystyle c(G) =01ϕc(m)𝑑G(m),\displaystyle=\int_{0}^{1}\phi_{c}(m)dG(m),

subject to the constraint m𝑑G(m)=m0=1/2\int mdG(m)=m_{0}=1/2. Hence, the Bayes optimal stopping time is the one that induces the distribution GG^{*}, defined as

G\displaystyle G^{*} =argminG:m𝑑G(m)=12f(m)𝑑G(m),where\displaystyle=\operatorname*{arg\,min}_{G:\int mdG(m)=\frac{1}{2}}\int f(m)dG(m),\quad\textrm{where}
f(m)\displaystyle f(m) :=ϕc(m)+(σ1+σ0)Δ2min{m,1m}.\displaystyle:=\phi_{c}(m)+\frac{(\sigma_{1}+\sigma_{0})\Delta}{2}\min\{m,1-m\}.

As ϕc(1/2)=0\phi_{c}^{\prime}(1/2)=0, f(m)f(m) cannot be minimized at 1/21/2. Consider, then, f(m)f(m) for m[0,1/2)m\in[0,1/2). In this region, f(m)=ϕc(m)+(σ1+σ0)Δ2mf(m)=\phi_{c}(m)+\frac{(\sigma_{1}+\sigma_{0})\Delta}{2}m, where ϕ′′(m)>0\phi^{\prime\prime}(m)>0 by the assumption c~(m)>0\tilde{c}(m)>0. This proves f(m)f(m) is convex in [0,1/2)[0,1/2). Also, ϕc(1/2)=0\phi_{c}(1/2)=0, and under the assumption c()c¯c(\cdot)\geq\underline{c}, it is easy to see that ϕc(m)\phi_{c}(m)\to\infty as m0m\to 0, with ϕc(m)\phi_{c}(m) monotonically decreasing on (0,1/2](0,1/2]. Taken together, these results imply f(m)f(m) has a unique minimum in (0,1/2)(0,1/2). Denote α(Δ):=argminm(0,1/2)f(m).\alpha(\Delta):=\operatorname*{arg\,min}_{m\in(0,1/2)}f(m). By the symmetry of sampling costs, f(m)=f(1m)f(m)=f(1-m), and so the global minima of f()f(\cdot) are α(Δ),1α(Δ)\alpha(\Delta),1-\alpha(\Delta). Given the constraint m𝑑G(m)=1/2\int mdG^{*}(m)=1/2, we conclude that GG^{*} is a two-point distribution, supported on α(Δ),1α(Δ)\alpha(\Delta),1-\alpha(\Delta) with equal probability 1/21/2. By Shiryaev (2007, Section 4.2.1), this distribution is induced by the stopping time τγ(Δ)\tau_{\gamma(\Delta)}, where

γ(Δ):=1Δln1α(Δ)α(Δ).\gamma(\Delta):=\frac{1}{\Delta}\ln\frac{1-\alpha(\Delta)}{\alpha(\Delta)}.

This stopping time is the best response to nature’s prior pΔp_{\Delta}.

Nature’s response to τγ\tau_{\gamma}.

We will determine nature’s best response to the DM choosing 𝒅~γ\tilde{\bm{d}}_{\gamma} by obtaining a formula for the frequentist regret V(𝒅~γ,𝝁)V\left(\tilde{\bm{d}}_{\gamma},\bm{\mu}\right). Denote Δ=2(μ1μ0)/(σ1+σ0)\Delta=2(\mu_{1}-\mu_{0})/(\sigma_{1}+\sigma_{0}), and take ζΔ(x)\zeta_{\Delta}(x) to be the solution of the ODE

12ζΔ′′(x)+Δ2ζΔ(x)=c(x);ζΔ(0)=ζΔ(0)=0.\frac{1}{2}\zeta_{\Delta}^{\prime\prime}(x)+\frac{\Delta}{2}\zeta_{\Delta}^{\prime}(x)=c(x);\quad\zeta_{\Delta}(0)=\zeta_{\Delta}^{\prime}(0)=0.

It is easy to show that the solution is

ζΔ(x)=20xeΔy0yeΔzc(z)𝑑z𝑑y.\zeta_{\Delta}(x)=2\int_{0}^{x}e^{-\Delta y}\int_{0}^{y}e^{\Delta z}c(z)dzdy.

In what follows we write ρt=ρ(t)\rho_{t}=\rho(t).

We now claim that for any stopping time, τ\tau,

𝔼𝒅|𝝁[0τc(ρt)𝑑t]=𝔼𝒅|𝝁[ζΔ(ρτ)].\mathbb{E}_{\bm{d}|\bm{\mu}}\left[\int_{0}^{\tau}c(\rho_{t})dt\right]=\mathbb{E}_{\bm{d}|\bm{\mu}}\left[\zeta_{\Delta}(\rho_{\tau})\right]. (B.5)

To prove the above, we start by recalling from (3.5) that

ρt=Δ2t+W~(t),\rho_{t}=\frac{\Delta}{2}t+\tilde{W}(t),

where W~()\tilde{W}(\cdot) is a one-dimensional Weiner process. Then, for any bounded stopping time τ\tau, Ito’s lemma implies

ζΔ(ρτ)\displaystyle\zeta_{\Delta}(\rho_{\tau}) =ζΔ(ρ0)+Δ20τζΔ(ρt)𝑑t+120τζΔ′′(ρt)𝑑t+0τζΔ(ρt)𝑑W~(t)\displaystyle=\zeta_{\Delta}(\rho_{0})+\frac{\Delta}{2}\int_{0}^{\tau}\zeta_{\Delta}^{\prime}(\rho_{t})dt+\frac{1}{2}\int_{0}^{\tau}\zeta_{\Delta}^{\prime\prime}(\rho_{t})dt+\int_{0}^{\tau}\zeta_{\Delta}^{\prime}(\rho_{t})d\tilde{W}(t)
=0τc(ρt)𝑑t+0τζΔ(ρt)𝑑W~(t),\displaystyle=\int_{0}^{\tau}c(\rho_{t})dt+\int_{0}^{\tau}\zeta_{\Delta}^{\prime}(\rho_{t})d\tilde{W}(t),

where the last step follows from the definition of ζΔ()\zeta_{\Delta}(\cdot). This proves (B.5) for bounded stopping times. The extension to unbounded stopping times follows by a similar argument as in the proof of Proposition 2 in Morris and Strack (2019).

Recall that τγ:=inf{t:|ρt|γ}\tau_{\gamma}:=\inf\{t:|\rho_{t}|\geq\gamma\}. By Lemma 2,

(ρτγ=γ|𝝁)=1eΔγeΔγeΔγ.\mathbb{P}(\rho_{\tau_{\gamma}}=\gamma|\bm{\mu})=\frac{1-e^{-\Delta\gamma}}{e^{\Delta\gamma}-e^{-\Delta\gamma}}. (B.6)

This implies

𝔼𝒅|𝝁[ζΔ(ρτγ)]=1eΔγeΔγeΔγζΔ(γ)+eΔγ1eΔγeΔγζΔ(γ).\mathbb{E}_{\bm{d}|\bm{\mu}}\left[\zeta_{\Delta}(\rho_{\tau_{\gamma}})\right]=\frac{1-e^{-\Delta\gamma}}{e^{\Delta\gamma}-e^{-\Delta\gamma}}\zeta_{\Delta}(\gamma)+\frac{e^{\Delta\gamma}-1}{e^{\Delta\gamma}-e^{-\Delta\gamma}}\zeta_{\Delta}(-\gamma). (B.7)

Combining (B.5)-(B.7), we obtain

V(𝒅~γ,𝝁)\displaystyle V\left(\tilde{\bm{d}}_{\gamma},\bm{\mu}\right) =σ1+σ02Δ(δ=1|𝝁)+𝔼𝒅|𝝁[0τγc(ρt)𝑑t]\displaystyle=\frac{\sigma_{1}+\sigma_{0}}{2}\Delta\mathbb{P}(\delta^{*}=1|\bm{\mu})+\mathbb{E}_{\bm{d}|\bm{\mu}}\left[\int_{0}^{\tau_{\gamma}}c(\rho_{t})dt\right]
=(σ1+σ0)Δ21eΔγeΔγeΔγ+(1eΔγ)ζΔ(γ)+(eΔγ1)ζΔ(γ)eΔγeΔγ.\displaystyle=\frac{(\sigma_{1}+\sigma_{0})\Delta}{2}\frac{1-e^{-\Delta\gamma}}{e^{\Delta\gamma}-e^{-\Delta\gamma}}+\frac{\left(1-e^{-\Delta\gamma}\right)\zeta_{\Delta}(\gamma)+\left(e^{\Delta\gamma}-1\right)\zeta_{\Delta}(-\gamma)}{e^{\Delta\gamma}-e^{-\Delta\gamma}}.

Thus, the best response of nature to 𝒅~γ\tilde{\bm{d}}_{\gamma} is to pick any prior supported on

{𝝁:|μ1μ0|=σ1+σ02Δ(γ)},\left\{\bm{\mu}:|\mu_{1}-\mu_{0}|=\frac{\sigma_{1}+\sigma_{0}}{2}\Delta(\gamma)\right\},

where

Δ(γ):=argmaxΔ{(σ1+σ02)(1eΔγ)ΔeΔγeΔγ+(1eΔγ)ζΔ(γ)+(eΔγ1)ζΔ(γ)eΔγeΔγ}.\Delta(\gamma):=\operatorname*{arg\,max}_{\Delta}\left\{\left(\frac{\sigma_{1}+\sigma_{0}}{2}\right)\frac{\left(1-e^{-\Delta\gamma}\right)\Delta}{e^{\Delta\gamma}-e^{-\Delta\gamma}}+\frac{\left(1-e^{-\Delta\gamma}\right)\zeta_{\Delta}(\gamma)+\left(e^{\Delta\gamma}-1\right)\zeta_{\Delta}(-\gamma)}{e^{\Delta\gamma}-e^{-\Delta\gamma}}\right\}.

Therefore, the two-point prior pΔ(γ)p_{\Delta(\gamma)} is a best response to 𝒅~γ\tilde{\bm{d}}_{\gamma}.

Nash equilibrium.

By similar arguments as in the proof of Theorem 1, the Nash equilibrium is given by (pΔ,𝒅~γ)(p_{\Delta^{*}},\tilde{\bm{d}}_{\gamma^{*}}) where (Δ,γ)(\Delta^{*},\gamma^{*}) is the solution to the minimax problem

minγmaxΔ{(σ1+σ02)(1eΔγ)ΔeΔγeΔγ+(1eΔγ)ζΔ(γ)+(eΔγ1)ζΔ(γ)eΔγeΔγ}.\min_{\gamma}\max_{\Delta}\left\{\left(\frac{\sigma_{1}+\sigma_{0}}{2}\right)\frac{\left(1-e^{-\Delta\gamma}\right)\Delta}{e^{\Delta\gamma}-e^{-\Delta\gamma}}+\frac{\left(1-e^{-\Delta\gamma}\right)\zeta_{\Delta}(\gamma)+\left(e^{\Delta\gamma}-1\right)\zeta_{\Delta}(-\gamma)}{e^{\Delta\gamma}-e^{-\Delta\gamma}}\right\}.

B.4. Analysis of other regret measures

Following the discussion in Section 6.4, suppose that we measure regret in the implementation phase using some nonlinear functional μ()\mu(\cdot) of the outcome distributions P(0),P(1)P^{(0)},P^{(1)}. We assume that μ()\mu(\cdot) is a regular functional of the data, i.e., for each a{0,1}a\in\{0,1\}, there is a ψaL2(P0(a))\psi_{a}\in L^{2}(P_{0}^{(a)}) such that

μ(Pt,h(a))μ(P0(a))tψa,ha=o(t),\frac{\mu(P_{t,h}^{(a)})-\mu(P_{0}^{(a)})}{t}-\left\langle\psi_{a},h\right\rangle_{a}=o(t), (B.8)

for each of the sub-models {Pt,h(a):tη}\{P_{t,h}^{(a)}:t\leq\eta\} introduced in Section 5.141414Following Van der Vaart (2000, Section 25.3), we may restrict attention to those hT(P0(a))h\in T(P_{0}^{(a)}) for which the Hadamard derivative of μ(Pt,h(a))\mu(P_{t,h}^{(a)}), as given by (B.8), exists. The function ψa()\psi_{a}(\cdot) is termed the efficient influence function.

Define σa2:=𝔼P0(a)[ψa(Yi(a))2]\sigma_{a}^{2}:=\mathbb{E}_{P_{0}^{(a)}}[\psi_{a}(Y_{i}^{(a)})^{2}]. It is possible to select {ϕa,1,ϕa,2,}T(P0(a))\{\phi_{a,1},\phi_{a,2},\dots\}\in T(P_{0}^{(a)}) in such a manner that {ψa/σa,ϕa,1,ϕa,2,}\{\psi_{a}/\sigma_{a},\phi_{a,1},\phi_{a,2},\dots\} is a set of orthonormal basis functions for the closure of T(P0(a))T(P_{0}^{(a)}). We can also choose these bases so they lie in T(P0(a))T(P_{0}^{(a)}), i.e., 𝔼P0(a)[ϕa,j]=0\mathbb{E}_{P_{0}^{(a)}}[\phi_{a,j}]=0 for all jj. By the Hilbert space isometry, each haT(P0(a))h_{a}\in T(P_{0}^{(a)}) is then associated with an element from the l2l_{2} space of square integrable sequences, (ha,0/σa,ha,1,)(h_{a,0}/\sigma_{a},h_{a,1},\dots), where ha,0=ψa,haah_{a,0}=\left\langle\psi_{a},h_{a}\right\rangle_{a} and ha,k=ϕa,k,haah_{a,k}=\left\langle\phi_{a,k},h_{a}\right\rangle_{a} for all k0k\neq 0.

Note that the above setup closely mirrors the discussion in Section 5. Indeed, when μ()\mu(\cdot) is the mean functional, the efficient influence function is just ψ(Y):=Y\psi(Y):=Y, as defined in that section. It is then easy to verify that the derivation of the minimax lower bound in Theorem 4, and the discussion preceding it, goes through unchanged even for general functionals.

For decision rules that attain the lower bound, consider 𝒅n,T=(πn,τn,T,δn,T)\bm{d}_{n,T}=(\pi_{n},\tau_{n,T},\delta_{n,T}), as defined in Section 4.3, but with xa(t)x_{a}(t) in (5.4) now representing the efficient influence function process for treatment aa, i.e.,

xa(t):=1ni=1nqa(t)ψa(Yi(a)),x_{a}(t):=\frac{1}{\sqrt{n}}\sum_{i=1}^{\left\lfloor nq_{a}(t)\right\rfloor}\psi_{a}(Y_{i}^{(a)}),

and σa2:=𝔼P0(a)[ψ(Yi(a))2]\sigma_{a}^{2}:=\mathbb{E}_{P_{0}^{(a)}}[\psi(Y_{i}^{(a)})^{2}]. By the same method of proof as in Section B.2, it is easy to see that 𝒅n,T\bm{d}_{n,T} attains the lower bound VV^{*} and Theorem 5 thereby applies to general functionals as well.

B.5. Additional simulations

B.5.1. Updating σ1,σ0\sigma_{1},\sigma_{0} using a prior

As noted in Section 6.3, instead of using forced sampling to estimate the values of σ1,σ0\sigma_{1},\sigma_{0}, we could instead employ a prior and continuously update these parameter values. Here, we report simulation results from using this approach in the context of the numerical illustration from Section 7.

Since the outcome model is Bernoulli, we employ a beta prior over the unknown quantities p0,p1p_{0},p_{1}. The prior parameters are taken to be α0=2,β0=3\alpha_{0}=2,\beta_{0}=3; these imply that the prior is centered around p0(=0.4)p_{0}(=0.4). We then apply our proposed policies while continuously updating the values of σ1,σ0\sigma_{1},\sigma_{0} using the posterior means of p1,p0p_{1},p_{0}. We experimented with alternative prior parameters, but they did not change the results substantively.

Figure B.2, Panel A plots the finite sample frequentist regret profile of 𝒅n:=𝒅n,\bm{d}_{n}:=\bm{d}_{n,\infty} (i.e., 𝒅n,T\bm{d}_{n,T} with T=T=\infty) for various values of nn, along with that of 𝒅\bm{d}^{*} under diffusion asymptotics. The approach of using the prior performs substantially worse than forced sampling (as can be seen by comparing the figure to Figure 7.1), with the performance being worse for higher values of Δ\Delta. It appears that continuously updating the prior results in more variability, leading to higher expected regret than that under the minimax policy (for large Δ\Delta). However, the minimax regret itself is actually close to the asymptotic value, as can be seen by comparing the maximum values of the regret profiles; in particular, the difference is less than 3%. Nevertheless, we recommend employing forced sampling in practice, at-least for Bernoulli outcomes, as it has a strong theoretical justification.

Refer to caption

Note: The solid curve is the regret profile of 𝒅\bm{d}^{*}; the vertical red line denotes Δ\Delta^{*}. We only plot the results for Δ>0\Delta>0 as the values are close to symmetric.

Figure B.1. Frequentist regret profiles under prior updating

B.5.2. Simulations using Gaussian outcomes

To assess the finite sample performance of the proposed policies under continuous outcomes, we ran additional Monte-Carlo simulations assuming Gaussian outcomes Yi(a)𝒩(μa/n,σa2)Y_{i}^{(a)}\sim\mathcal{N}(\mu_{a}/\sqrt{n},\sigma_{a}^{2}) for each treatment. This is a parametric setting in which ρn(t)\rho_{n}(t) has the form

ρn(t)=1nσ1i=1nq1(t)Yi(1)1nσ0i=1nq0(t)Yi(0).\rho_{n}(t)=\frac{1}{\sqrt{n}\sigma_{1}}\sum_{i=1}^{\left\lfloor nq_{1}(t)\right\rfloor}Y_{i}^{(1)}-\frac{1}{\sqrt{n}\sigma_{0}}\sum_{i=1}^{\left\lfloor nq_{0}(t)\right\rfloor}Y_{i}^{(0)}.

Figure B.2, Panel A plots the finite sample frequentist regret profile of 𝒅n:=𝒅n,\bm{d}_{n}:=\bm{d}_{n,\infty} (i.e., 𝒅n,T\bm{d}_{n,T} with T=T=\infty) for various values of nn, along with that of 𝒅\bm{d}^{*} under diffusion asymptotics. The parameter values are c=1c=1 and σ02=σ12=1\sigma_{0}^{2}=\sigma_{1}^{2}=1. Given these parameter values, each nn corresponds to a sampling cost of C=n3/2C=n^{-3/2}. It is seen that diffusion asymptotics provide a very good approximation to the finite sample properties of 𝒅n\bm{d}_{n}, even for such relatively small values of nn as n=200n=200. Furthermore, 𝒅n\bm{d}_{n} can be seen to attain the lower bound for minimax regret. Panel B of the same figure displays some summary statistics for Bayes regret under 𝒅n\bm{d}_{n} when nature chooses the least favorable prior, pΔp_{\Delta^{*}}. The finite sample expected regret is very close to VV^{*}, the value of minimax regret under diffusion asymptotics. We can also infer that the distribution of regret under pΔp_{\Delta^{*}} is positively skewed and heavy tailed.

Refer to caption
Refer to caption
A: Frequentist regret profiles B: Performance under least-favorable prior

Note: The solid curve in Panel A is the regret profile of 𝒅\bm{d}^{*}; the vertical red line denotes Δ\Delta^{*}. We only plot the results for Δ>0\Delta>0 as the values are close to symmetric. The dashed red line in Panel B is VV^{*}, the asymptotic minimax regret. Black lines within the bars denote the Bayes regret in finite samples, under the least favorable prior. The bars describe the interquartile range of regret. Parameter values are c=1c=1, σ0=σ1=1\sigma_{0}=\sigma_{1}=1.

Figure B.2. Finite sample performance of 𝒅n\bm{d}_{n}

B.6. Supporting lemmas for the proof of Theorem 2

We suppose that Assumption 1 holds for all the results in this section. We also make heavy use of the notation defined in Appendix A.3. Additionally, for some M<M<\infty, define

An:={(𝐲nT(1),𝐲nT(0)):supqTznq(a)<Ma{0,1}}.A_{n}:=\left\{\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}\right):\sup_{q\leq T}\left\|z_{nq}^{(a)}\right\|<M\ \forall\ a\in\{0,1\}\right\}.
Lemma 4.

For any ϵ>0\epsilon>0, there exist M(ϵ),N(ϵ)<M(\epsilon),N(\epsilon)<\infty such that MM(ϵ)M\geq M(\epsilon) and nN(ϵ)n\geq N(\epsilon) implies P¯n(Anc)<ϵ\bar{P}_{n}(A_{n}^{c})<\epsilon. Furthermore, letting 𝔼n,0[]\mathbb{E}_{n,0}[\cdot] denote the expectation under Pn,0PnT,nT,𝟎P_{n,0}\equiv P_{nT,nT,\bm{0}},

supq1,q0T𝔼n,0[𝕀AnadPnqa,ha(a)dPnqa,0(a)(𝐲nqa(a))adΛnqa,ha(a)dPnqa,0(a)(𝐲nqa(a))]=o(1)(h1,h0).\sup_{q_{1},q_{0}\leq T}\mathbb{E}_{n,0}\left[\mathbb{I}_{A_{n}}\left\|\prod_{a}\frac{dP_{nq_{a},h_{a}}^{(a)}}{dP_{nq_{a},0}^{(a)}}\left({\bf y}_{nq_{a}}^{(a)}\right)-\prod_{a}\frac{d\Lambda_{nq_{a},h_{a}}^{(a)}}{dP_{nq_{a},0}^{(a)}}\left({\bf y}_{nq_{a}}^{(a)}\right)\right\|\right]=o(1)\ \forall\ (h_{1},h_{0}).
Proof.

The proof follows the same outline as in Adusumilli (2021, Lemma 3). Set

An,M={(𝐲nT(1),𝐲nT(0)):supqTznq(a)<Ma{0,1}}.A_{n,M}=\left\{\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}\right):\sup_{q\leq T}\left\|z_{nq}^{(a)}\right\|<M\ \forall\ a\in\{0,1\}\right\}.

Note that znq(a)z_{nq}^{(a)} is a partial sum process with mean 0 under n,0\mathbb{P}_{n,0}. By Kolmogorov’s maximal inequality and Assumption 1(ii), for each a{0,1}a\in\{0,1\},

PnT,0(a)(supqznq(a)M)1MVar[znq(a)]=O(1M).P_{nT,0}^{(a)}\left(\sup_{q}\left\|z_{nq}^{(a)}\right\|\geq M\right)\leq\frac{1}{M}\textrm{Var}\left[\left\|z_{nq}^{(a)}\right\|\right]=O\left(\frac{1}{M}\right).

Since PnT,0(1),PnT,0(0)P_{nT,0}^{(1)},P_{nT,0}^{(0)} are independent, it follows that Pn.0(An,Mnc)0P_{n.0}(A_{n,M_{n}}^{c})\to 0 for any MnM_{n}\to\infty. But by (A.8) and standard arguments involving Le Cam’s first lemma, PnT,h1(1)×PnT,h0(0)P_{nT,h_{1}}^{(1)}\times P_{nT,h_{0}}^{(0)} is contiguous to Pn,0P_{n,0} for all hh. This implies P¯naPnT,ha(a)dm0(𝒉)\bar{P}_{n}\equiv\int\prod_{a}P_{nT,h_{a}}^{(a)}dm_{0}(\bm{h}) is also contiguous to Pn,0P_{n,0} (this can be shown using the dominated convergence theorem; see also, Le Cam and Yang, p.138). Consequently, P¯n(An,Mnc)0\bar{P}_{n}(A_{n,M_{n}}^{c})\to 0 for any MnM_{n}\to\infty. The first claim is a straightforward consequence of this.

We now prove the second claim: For any 𝒉=(h1,h0)\bm{h}=(h_{1},h_{0}), recall that Pnq1,nq0,𝒉P_{nq_{1},nq_{0},\bm{h}} denotes the probability measure over (𝐲nq1(1),𝐲nq0(0))({\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)}) defined as

dPnq1,nq0,𝒉(𝐲nq1(1),𝐲nq0(0))=dPnq1,h1(𝐲nq1(1))dPnq0,h0(𝐲nq0(0)).dP_{nq_{1},nq_{0},\bm{h}}({\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)})=dP_{nq_{1},h_{1}}({\bf y}_{nq_{1}}^{(1)})\cdot dP_{nq_{0},h_{0}}({\bf y}_{nq_{0}}^{(0)}).

Now, by similar arguments as in Adusumilli (2021, Lemma 3), Pnqa(n),ha(a)P_{nq_{a}^{(n)},h_{a}}^{(a)} is contiguous to Pnqa(n),0(a)P_{nq_{a}^{(n)},0}^{(a)} for any hadh_{a}\in\mathbb{R}^{d} and deterministic sequence {qa(n)}n\{q_{a}^{(n)}\}_{n} converging to some q¯a[0,T]\bar{q}_{a}\in[0,T]. Due to the independence of 𝐲nq1(1),𝐲nq0(0){\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)} for any (q1,q0)(q_{1},q_{0}), it then follows that Pnq1(n),nq0(n),𝒉P_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{h}} is contiguous to Pnq1(n),nq0(n),𝟎P_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{0}}.

Now, let (q1(n),q0(n))[0,T]×[0,T](q_{1}^{(n)},q_{0}^{(n)})\in[0,T]\times[0,T] denote quantities such that

supq1,q0T𝔼n,0[𝕀AnadPnqa,ha(a)dPnqa,0(a)(𝐲nqa(a))adΛnqa,ha(a)dPnqa,0(a)(𝐲nqa(a))]\displaystyle\sup_{q_{1},q_{0}\leq T}\mathbb{E}_{n,0}\left[\mathbb{I}_{A_{n}}\left\|\prod_{a}\frac{dP_{nq_{a},h_{a}}^{(a)}}{dP_{nq_{a},0}^{(a)}}\left({\bf y}_{nq_{a}}^{(a)}\right)-\prod_{a}\frac{d\Lambda_{nq_{a},h_{a}}^{(a)}}{dP_{nq_{a},0}^{(a)}}\left({\bf y}_{nq_{a}}^{(a)}\right)\right\|\right]
𝔼n,0[𝕀AnadPnqa(n),ha(a)dPnqa(n),0(a)(𝐲nqa(n)(a))adΛnqa(n),ha(a)dPnqa(n),0(a)(𝐲nqa(n)(a))]+ϵ,\displaystyle\leq\mathbb{E}_{n,0}\left[\mathbb{I}_{A_{n}}\left\|\prod_{a}\frac{dP_{nq_{a}^{(n)},h_{a}}^{(a)}}{dP_{nq_{a}^{(n)},0}^{(a)}}\left({\bf y}_{nq_{a}^{(n)}}^{(a)}\right)-\prod_{a}\frac{d\Lambda_{nq_{a}^{(n)},h_{a}}^{(a)}}{dP_{nq_{a}^{(n)},0}^{(a)}}\left({\bf y}_{nq_{a}^{(n)}}^{(a)}\right)\right\|\right]+\epsilon,

for some arbitrarily small ϵ0\epsilon\geq 0 (this is always possible by the definition of the supremum). Without loss of generality, we may assume (q1(n),q0(n))(q_{1}^{(n)},q_{0}^{(n)}) converges to some (q¯1,q¯0)[0,T]×[0,T](\bar{q}_{1},\bar{q}_{0})\in[0,T]\times[0,T]; otherwise we can employ a subsequence argument since (q1(n),q0(n))(q_{1}^{(n)},q_{0}^{(n)}) lie in a bounded set. Define

Gn(q1,q0):=𝕀AnadΛnqa,ha(a)dPnqa,0(a)(𝐲nqa(a))adPnqa,ha(a)dPnqa,0(a)(𝐲nqa(a)).G_{n}(q_{1},q_{0}):=\mathbb{I}_{A_{n}}\cdot\left\|\prod_{a}\frac{d\Lambda_{nq_{a},h_{a}}^{(a)}}{dP_{nq_{a},0}^{(a)}}\left({\bf y}_{nq_{a}}^{(a)}\right)-\prod_{a}\frac{dP_{nq_{a},h_{a}}^{(a)}}{dP_{nq_{a},0}^{(a)}}\left({\bf y}_{nq_{a}}^{(a)}\right)\right\|.

The claim follows if we show 𝔼n,0[Gn(q1(n),q0(n))]0\mathbb{E}_{n,0}\left[G_{n}(q_{1}^{(n)},q_{0}^{(n)})\right]\to 0. By Lemma A.8 and the definition of Λnq,h(a)()\Lambda_{nq,h}^{(a)}(\cdot),

Gn(q1,q0)=𝕀Anaexp{haIa1/2za,nqaqa2haIaha}|expδn,qa(a)1|,G_{n}(q_{1},q_{0})=\mathbb{I}_{A_{n}}\cdot\prod_{a}\exp\left\{h_{a}^{\intercal}I_{a}^{1/2}z_{a,nq_{a}}-\frac{q_{a}}{2}h_{a}^{\intercal}I_{a}h_{a}\right\}\left|\exp\delta_{n,q_{a}}^{(a)}-1\right|,

where supqT|δn,q(a)|=o(1)\sup_{q\leq T}|\delta_{n,q}^{(a)}|=o(1) under Pn,0P_{n,0} for each a{0,1}a\in\{0,1\}. Since

𝕀Anaexp{haIa1/2za,nqaqa2haIaha}\mathbb{I}_{A_{n}}\cdot\prod_{a}\exp\left\{h_{a}^{\intercal}I_{a}^{1/2}z_{a,nq_{a}}-\frac{q_{a}}{2}h_{a}^{\intercal}I_{a}h_{a}\right\}

is bounded for any fixed (h1,h0)(h_{1},h_{0}) by the definition of 𝕀An\mathbb{I}_{A_{n}}, this implies Gn(q1(n),q0(n))=o(1)G_{n}(q_{1}^{(n)},q_{0}^{(n)})=o(1) under Pn,0P_{n,0}. Next, we argue Gn(q1(n),q0(n))G_{n}(q_{1}^{(n)},q_{0}^{(n)}) is uniformly integrable. The first term

𝕀AnadΛnqa(n),ha(a)dPnqa(n),0(a)(𝐲nqa(n)(a))\mathbb{I}_{A_{n}}\cdot\prod_{a}\frac{d\Lambda_{nq_{a}^{(n)},h_{a}}^{(a)}}{dP_{nq_{a}^{(n)},0}^{(a)}}\left({\bf y}_{nq_{a}^{(n)}}^{(a)}\right)

in the definition of Gn(q1(n),q0(n))G_{n}(q_{1}^{(n)},q_{0}^{(n)}) is bounded, and therefore uniformly integrable, for any 𝒉\bm{h}. It remains to prove uniform integrability of

adPnqa(n),ha(a)dPnqa(n),0(a)(𝐲nqa(n)(a))dPnq1(n),nq0(n),𝒉dPnq1(n),nq0(n),𝟎(𝐲nq1(n)(1),𝐲nq0(n)(0)).\prod_{a}\frac{dP_{nq_{a}^{(n)},h_{a}}^{(a)}}{dP_{nq_{a}^{(n)},0}^{(a)}}\left({\bf y}_{nq_{a}^{(n)}}^{(a)}\right)\equiv\frac{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{h}}}{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{0}}}\left({\bf y}_{nq_{1}^{(n)}}^{(1)},{\bf y}_{nq_{0}^{(n)}}^{(0)}\right).

For any b<b<\infty,

𝔼n,0[dPnq1(n),nq0(n),𝒉dPnq1(n),nq0(n),𝟎(𝐲nq1(n)(1),𝐲nq0(n)(0))𝕀{dPnq1(n),nq0(n),𝒉dPnq1(n),nq0(n),𝟎(𝐲nq1(n)(1),𝐲nq0(n)(0))>b}]\displaystyle\mathbb{E}_{n,0}\left[\frac{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{h}}}{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{0}}}\left({\bf y}_{nq_{1}^{(n)}}^{(1)},{\bf y}_{nq_{0}^{(n)}}^{(0)}\right)\cdot\mathbb{I}\left\{\frac{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{h}}}{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{0}}}\left({\bf y}_{nq_{1}^{(n)}}^{(1)},{\bf y}_{nq_{0}^{(n)}}^{(0)}\right)>b\right\}\right]
=dPnq1(n),nq0(n),𝒉dPnq1(n),nq0(n),𝟎(𝐲nq1(n)(1),𝐲nq0(n)(0))𝕀{adPnq1(n),nq0(n),𝒉dPnq1(n),nq0(n),𝟎(𝐲nq1(n)(1),𝐲nq0(n)(0))>b}𝑑Pnq1(n),nq0(n),𝟎\displaystyle=\int\frac{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{h}}}{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{0}}}\left({\bf y}_{nq_{1}^{(n)}}^{(1)},{\bf y}_{nq_{0}^{(n)}}^{(0)}\right)\mathbb{I}\left\{\prod_{a}\frac{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{h}}}{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{0}}}\left({\bf y}_{nq_{1}^{(n)}}^{(1)},{\bf y}_{nq_{0}^{(n)}}^{(0)}\right)>b\right\}dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{0}}
Pnq1(n),nq0(n),𝒉(adPnq1(n),nq0(n),𝒉dPnq1(n),nq0(n),𝟎(𝐲nq1(n)(1),𝐲nq0(n)(0))>b).\displaystyle\leq P_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{h}}\left(\prod_{a}\frac{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{h}}}{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{0}}}\left({\bf y}_{nq_{1}^{(n)}}^{(1)},{\bf y}_{nq_{0}^{(n)}}^{(0)}\right)>b\right).

But,

Pnq1(n),nq0(n),𝟎(adPnq1(n),nq0(n),𝒉dPnq1(n),nq0(n),𝟎(𝐲nq1(n)(1),𝐲nq0(n)(0))>b)\displaystyle P_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{0}}\left(\prod_{a}\frac{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{h}}}{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{0}}}\left({\bf y}_{nq_{1}^{(n)}}^{(1)},{\bf y}_{nq_{0}^{(n)}}^{(0)}\right)>b\right)
b1adPnq1(n),nq0(n),𝒉dPnq1(n),nq0(n),𝟎(𝐲nq1(n)(1),𝐲nq0(n)(0))dPnq1(n),nq0(n),𝟎b1,\displaystyle\leq b^{-1}\int\prod_{a}\frac{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{h}}}{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{0}}}\left({\bf y}_{nq_{1}^{(n)}}^{(1)},{\bf y}_{nq_{0}^{(n)}}^{(0)}\right)dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{0}}\leq b^{-1},

so the contiguity of Pnq1(n),nq0(n),𝒉P_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{h}} with respect to Pnq1(n),nq0(n),𝟎P_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{0}} implies we can choose bb and n¯\bar{n} large enough such that

lim supnn¯Pnq1(n),nq0(n),𝟎(adPnq1(n),nq0(n),𝒉dPnq1(n),nq0(n),𝟎(𝐲nq1(n)(1),𝐲nq0(n)(0))>b)<ϵ\limsup_{n\geq\bar{n}}P_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{0}}\left(\prod_{a}\frac{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{h}}}{dP_{nq_{1}^{(n)},nq_{0}^{(n)},\bm{0}}}\left({\bf y}_{nq_{1}^{(n)}}^{(1)},{\bf y}_{nq_{0}^{(n)}}^{(0)}\right)>b\right)<\epsilon

for any arbitrarily small ϵ\epsilon. These results demonstrate uniform integrability of Gn(q1(n),q0(n))G_{n}(q_{1}^{(n)},q_{0}^{(n)}) under Pn,0P_{n,0}. Since convergence in probability implies convergence in expectation for uniformly integrable random variables, we have thus shown 𝔼n,0[Gn(q1(n),q0(n))]0\mathbb{E}_{n,0}\left[G_{n}(q_{1}^{(n)},q_{0}^{(n)})\right]\to 0, which concludes the proof. ∎

Lemma 5.

For any measure PP, define PAnP\cap A_{n} as the restriction of PP to the set AnA_{n}. Then, limnP¯nAnP¯~nAnTV=0\lim_{n\to\infty}\left\|\bar{P}_{n}\cap A_{n}-\tilde{\bar{P}}_{n}\cap A_{n}\right\|_{\textrm{TV}}=0, where TV\left\|\cdot\right\|_{\textrm{TV}} denotes the total-variation metric.

Proof.

Denote Pn,0:=PnT,nT,𝟎P_{n,0}:=P_{nT,nT,\bm{0}}. By the properties of the total variation metric, contiguity of P¯n\bar{P}_{n} with respect to Pn,0P_{n,0} (shown in the proof of Lemma 4) and the absolute continuity of aΛn,ha(a)\prod_{a}\Lambda_{n,h_{a}}^{(a)} with respect to Pn,0P_{n,0} (by construction),

limnP¯nAnP¯~nAnTV\displaystyle\lim_{n\to\infty}\left\|\bar{P}_{n}\cap A_{n}-\tilde{\bar{P}}_{n}\cap A_{n}\right\|_{\textrm{TV}}
=12limn{𝕀An|adΛnT,ha(a)dPnT,0(a)(𝐲nT(a))adPnT,ha(a)dPnT,0(a)(𝐲nT(a))|dPn,0(𝐲nT(1),𝐲nT(0))}𝑑m0(𝒉).\displaystyle=\frac{1}{2}\lim_{n\to\infty}\int\left\{\int\mathbb{I}_{A_{n}}\left|\prod_{a}\frac{d\Lambda_{nT,h_{a}}^{(a)}}{dP_{nT,0}^{(a)}}\left({\bf y}_{nT}^{(a)}\right)-\prod_{a}\frac{dP_{nT,h_{a}}^{(a)}}{dP_{nT,0}^{(a)}}\left({\bf y}_{nT}^{(a)}\right)\right|dP_{n,0}({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)})\right\}dm_{0}^{*}(\bm{h}).

In the last expression, denote the term within the {}\{\} brackets by fn(𝒉)f_{n}(\bm{h}). By Lemma 4, fn(𝒉)0f_{n}(\bm{h})\to 0 for each 𝒉\bm{h}. Since m0()m_{0}^{*}(\cdot) is a two-point prior, this proves the desired claim. ∎

Lemma 6.

𝔼¯n[supt[0,T]𝕀An|mn(ξt)m(ρn(t))|]=o(1)\bar{\mathbb{E}}_{n}\left[\sup_{t\in[0,T]}\mathbb{I}_{A_{n}}\cdot\left|m_{n}(\xi_{t})-m(\rho_{n}(t))\right|\right]=o(1).

Proof.

Define

φ^(t)\displaystyle\hat{\varphi}(t) =(dPnq1(t),h1(1)dPnq1(t),0(1)(𝐲nq1(t)(1))dPnq0(t),h0(1)dPnq0(t),0(1)(𝐲nq0(t)(0)))1×\displaystyle=\left(\frac{dP_{nq_{1}(t),-h_{1}^{*}}^{(1)}}{dP_{nq_{1}(t),0}^{(1)}}\left({\bf y}_{nq_{1}(t)}^{(1)}\right)\cdot\frac{dP_{nq_{0}(t),h_{0}^{*}}^{(1)}}{dP_{nq_{0}(t),0}^{(1)}}\left({\bf y}_{nq_{0}(t)}^{(0)}\right)\right)^{-1}\times\cdots
×(dPnq1(t),h1(1)dPnq1(t),0(1)(𝐲nq1(t)(1))dPnq0(t),h0(1)dPnq0(t),0(1)(𝐲nq0(t)(0))).\displaystyle\cdots\times\left(\frac{dP_{nq_{1}(t),h_{1}^{*}}^{(1)}}{dP_{nq_{1}(t),0}^{(1)}}\left({\bf y}_{nq_{1}(t)}^{(1)}\right)\cdot\frac{dP_{nq_{0}(t),-h_{0}^{*}}^{(1)}}{dP_{nq_{0}(t),0}^{(1)}}\left({\bf y}_{nq_{0}(t)}^{(0)}\right)\right).

We can then write

mn(ξt)=φ^(t)1+φ^(t).m_{n}(\xi_{t})=\frac{\hat{\varphi}(t)}{1+\hat{\varphi}(t)}.

Recall the definition of φ~()\tilde{\varphi}(\cdot) in (A.12) and note that φ~()\tilde{\varphi}(\cdot) is bounded under the set AnA_{n}. Then, by the SLAN property (A.8) and some straightforward algebra,

supt[0,T]𝕀An|φ^(t)φ~(t)|=oPn,0(1).\sup_{t\in[0,T]}\mathbb{I}_{A_{n}}\left|\hat{\varphi}(t)-\tilde{\varphi}(t)\right|=o_{P_{n,0}}(1).

Due to the continuous mapping theorem, the above in turn implies

supt[0,T]𝕀An|mn(ξt)m(ρn(t))|=oPn,0(1).\sup_{t\in[0,T]}\mathbb{I}_{A_{n}}\left|m_{n}(\xi_{t})-m(\rho_{n}(t))\right|=o_{P_{n,0}}(1).

But it was shown in the proof of Lemma 4 that P¯n\bar{P}_{n} is contiguous with respect to Pn,0P_{n,0}; hence,

supt[0,T]𝕀An|mn(ξt)m(ρn(t))|=oP¯n(1).\sup_{t\in[0,T]}\mathbb{I}_{A_{n}}\left|m_{n}(\xi_{t})-m(\rho_{n}(t))\right|=o_{\bar{P}_{n}}(1).

Since both mn(),m()[0,1]m_{n}(\cdot),m(\cdot)\in[0,1], the claim follows from the dominated convergence theorem. ∎

Lemma 7.

limnsup𝒅𝒟¯n,T|Vn(𝒅,m0)V~n(𝒅,m0)|=0.\lim_{n\to\infty}\sup_{\bm{d}\in\bar{\mathcal{D}}_{n,T}}\left|V_{n}^{*}(\bm{d},m_{0}^{*})-\tilde{V}_{n}(\bm{d},m_{0}^{*})\right|=0.

Proof.

We prove the claim by bounding each term in the following expansion:

Vn(𝒅,m0)V~n(𝒅,m0)\displaystyle V_{n}^{*}(\bm{d},m_{0}^{*})-\tilde{V}_{n}(\bm{d},m_{0}^{*})
=𝔼¯n[𝕀Anc{nϖn(m(ξτ))+cτ}]+𝔼~n[𝕀Anc{nϖn(m~(ρn(τ)))+cτ}]\displaystyle=\bar{\mathbb{E}}_{n}\left[\mathbb{I}_{A_{n}^{c}}\left\{\sqrt{n}\varpi_{n}\left(m\left(\xi_{\tau}\right)\right)+c\tau\right\}\right]+\tilde{\mathbb{E}}_{n}\left[\mathbb{I}_{A_{n}^{c}}\left\{\sqrt{n}\varpi_{n}\left(\tilde{m}\left(\rho_{n}(\tau)\right)\right)+c\tau\right\}\right]
+(𝔼¯n𝔼~n)[𝕀An{nϖn(m~(ρn(τ)))+cτ}]\displaystyle\quad+\left(\bar{\mathbb{E}}_{n}-\tilde{\mathbb{E}}_{n}\right)\left[\mathbb{I}_{A_{n}}\cdot\left\{\sqrt{n}\varpi_{n}\left(\tilde{m}\left(\rho_{n}(\tau)\right)\right)+c\tau\right\}\right]
+𝔼¯n[𝕀Ann|ϖn(m(ξτ))ϖn(m~(ρn(τ)))|].\displaystyle\quad+\bar{\mathbb{E}}_{n}\left[\mathbb{I}_{A_{n}}\cdot\sqrt{n}\left|\varpi_{n}\left(m\left(\xi_{\tau}\right)\right)-\varpi_{n}\left(\tilde{m}\left(\rho_{n}(\tau)\right)\right)\right|\right]. (B.9)

First, observe that by Assumption 1(iii),

limnsupm[0,1]|nϖn(m)ϖ(m)|=0,\lim_{n\to\infty}\sup_{m\in[0,1]}\left|\sqrt{n}\varpi_{n}(m)-\varpi(m)\right|=0, (B.10)

where ϖ(m):=σ1+σ02Δmin{m,1m}\varpi(m):=\frac{\sigma_{1}+\sigma_{0}}{2}\Delta^{*}\min\{m,1-m\}. Since ϖ()\varpi(\cdot) is uniformly bounded, it follows from (B.10) that nϖn()\sqrt{n}\varpi_{n}(\cdot) is also uniformly bounded. Furthermore, τT\tau\leq T is also bounded. We thus conclude that nϖn(m(ξτ))+cτ\sqrt{n}\varpi_{n}\left(m\left(\xi_{\tau}\right)\right)+c\tau and nϖn(m~(ρn(τ)))+cτ\sqrt{n}\varpi_{n}\left(\tilde{m}\left(\rho_{n}(\tau)\right)\right)+c\tau are uniformly bounded by some (suitably large) constant L<L<\infty. The first two quantities in (B.9) are therefore bounded by LP¯n(Anc)L\cdot\bar{P}_{n}(A_{n}^{c}) and LP¯~n(Anc)L\cdot\tilde{\bar{P}}_{n}(A_{n}^{c}).

By Lemma 4, P¯n(Anc)\bar{P}_{n}(A_{n}^{c}) can be made arbitrarily small by choosing a suitably large MM in the definition of AnA_{n}. Hence the first term in (B.9) converges to 0 as nn\to\infty.

By Lemma 9, 𝑑P¯~nη(0,0,0)=1+Mnϑ\int d\tilde{\bar{P}}_{n}\equiv\eta(0,0,0)=1+Mn^{-\vartheta} for some M<M<\infty and ϑ>0\vartheta>0. But Lemma 5 implies P¯~n(An)=P¯n(An)+o(1)=1+o(1)\tilde{\bar{P}}_{n}(A_{n})=\bar{P}_{n}(A_{n})+o(1)=1+o(1). Hence P¯~n(Anc)=𝑑P¯~nP¯~n(An)0\tilde{\bar{P}}_{n}(A_{n}^{c})=\int d\tilde{\bar{P}}_{n}-\tilde{\bar{P}}_{n}(A_{n})\to 0 as nn\to\infty. Hence, the second term in (B.9) converges to 0 as well.

The third term in (B.9) is bounded by LP¯nAnP¯~nAnTVL\cdot\left\|\bar{P}_{n}\cap A_{n}-\tilde{\bar{P}}_{n}\cap A_{n}\right\|_{\textrm{TV}}. By Lemma 5, it converges to 0 as nn\to\infty.

Finally, for the fourth term of (B.9), observe that by (B.10),

n|ϖn(m(ξτ))ϖn(m~(ρn(τ)))|\displaystyle\sqrt{n}\left|\varpi_{n}\left(m\left(\xi_{\tau}\right)\right)-\varpi_{n}\left(\tilde{m}\left(\rho_{n}(\tau)\right)\right)\right| =|ϖ(m(ξτ))ϖ(m~(ρn(τ)))|+o(1)\displaystyle=\left|\varpi\left(m\left(\xi_{\tau}\right)\right)-\varpi\left(\tilde{m}\left(\rho_{n}(\tau)\right)\right)\right|+o(1)
σ1+σ02Δ|m(ξτ)m~(ρn(τ))|+o(1),\displaystyle\leq\frac{\sigma_{1}+\sigma_{0}}{2}\Delta^{*}\left|m\left(\xi_{\tau}\right)-\tilde{m}\left(\rho_{n}(\tau)\right)\right|+o(1),

where the last step follows from the definition of ϖ()\varpi(\cdot). Hence, the fourth term is bounded (uniformly over 𝒅𝒟¯n,T\bm{d}\in\bar{\mathcal{D}}_{n,T}) by

σ1+σ02Δ𝔼¯n[𝕀An|mn(ξt)m(ρn(t))|]+o(1),\frac{\sigma_{1}+\sigma_{0}}{2}\Delta^{*}\bar{\mathbb{E}}_{n}\left[\mathbb{I}_{A_{n}}\cdot\left|m_{n}(\xi_{t})-m(\rho_{n}(t))\right|\right]+o(1),

which is o(1)o(1) because of Lemma 6. This concludes the proof of the lemma. ∎

Lemma 8.

The function V~n,T:=inf𝐝D¯n,TV~n(𝐝,m0)\tilde{V}_{n,T}^{*}:=\inf_{\bm{d}\in\bar{D}_{n,T}}\tilde{V}_{n}(\bm{d},m_{0}), where V~n(𝐝,m0)\tilde{V}_{n}(\bm{d},m_{0}) is defined in (A.13), is the solution at (0,0,0,0)(0,0,0,0) of the recursive equation (A.15).

Proof.

In what follows, we define ϖn():=ϖn(m~n())\varpi_{n}(\cdot):=\varpi_{n}\left(\tilde{m}_{n}(\cdot)\right).

Step 1 (Disintegration of P¯~n\tilde{\bar{P}}_{n}). We start by presenting a disintegration result for P¯~n\tilde{\bar{P}}_{n}; this will turn out to be convenient when applying a dynamic programming argument on V~n,T\tilde{V}_{n,T}^{*}. Recall the definitions of p~nq1,nq0,𝒉(𝐲nq1(1),𝐲nq0(0))\tilde{p}_{nq_{1},nq_{0},\bm{h}}\left({\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)}\right) and p¯~nq1,nq0(𝐲nq1(1),𝐲nq0(0))\tilde{\bar{p}}_{nq_{1},nq_{0}}\left({\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)}\right) from Step 0 of the proof of Theorem 2. Also, let

p~nq1,nq0(𝐲nq1(1),𝐲nq0(0),𝒉):=p~nq1,nq0,𝒉(𝐲nq1(1),𝐲nq0(0))m0(𝒉)\tilde{p}_{nq_{1},nq_{0}}\left({\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)},\bm{h}\right):=\tilde{p}_{nq_{1},nq_{0},\bm{h}}\left({\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)}\right)\cdot m_{0}^{*}(\bm{h})

denote the joint probability density (wrt ν×ν1\nu\times\nu_{1}) over 𝐲nq1(1),𝐲nq0(0),𝒉{\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)},\bm{h} and take P~nq1,nq0\tilde{P}_{nq_{1},nq_{0}} to be the corresponding probability measure. By the disintegration theorem and the definition of p~n(𝒉|ρ)\tilde{p}_{n}(\bm{h}|\rho), we have

p~nq1,nq0(𝐲nq1(1),𝐲nq0(0),𝒉)=p¯~nq1,nq0(𝐲nq1(1),𝐲nq0(0))p~n(𝒉|ρ).\tilde{p}_{nq_{1},nq_{0}}\left({\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)},\bm{h}\right)=\tilde{\bar{p}}_{nq_{1},nq_{0}}\left({\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)}\right)\cdot\tilde{p}_{n}(\bm{h}|\rho). (B.11)

Note that in the above ρ\rho is a function of 𝐲nq1(1),𝐲nq0(0){\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)} - it is defined in (A.11) - but we have elected to suppress this dependence.

Now, λnT,h(a)(𝐲nT(a))\lambda_{nT,h}^{(a)}({\bf y}_{nT}^{(a)}) can be written as

λnT,h(a)(𝐲nT(a))=i=1nTexp{hnψ(Yi(a))12nhIah}pθ0(a)(Yi(a)).\lambda_{nT,h}^{(a)}({\bf y}_{nT}^{(a)})=\prod_{i=1}^{nT}\exp\left\{\frac{h^{\intercal}}{\sqrt{n}}\psi(Y_{i}^{(a)})-\frac{1}{2n}h^{\intercal}I_{a}h\right\}p_{\theta_{0}}^{(a)}(Y_{i}^{(a)}). (B.12)

Hence, for any q1,q0q_{1},q_{0},

p~nT,nT(𝐲nT(1),𝐲nT(0),𝒉)\displaystyle\tilde{p}_{nT,nT}\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)},\bm{h}\right) =p~nq1,nq0(𝐲nq1(1),𝐲nq0(0),𝒉)λnT,h1(1)(𝐲nT(1))λnT,h0(0)(𝐲nT(0))λnq1,h1(1)(𝐲nq1(1))λnq1,h0(0)(𝐲nq0(0))\displaystyle=\tilde{p}_{nq_{1},nq_{0}}\left({\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)},\bm{h}\right)\cdot\frac{\lambda_{nT,h_{1}}^{(1)}\left({\bf y}_{nT}^{(1)}\right)\cdot\lambda_{nT,h_{0}}^{(0)}\left({\bf y}_{nT}^{(0)}\right)}{\lambda_{nq_{1},h_{1}}^{(1)}\left({\bf y}_{nq_{1}}^{(1)}\right)\cdot\lambda_{nq_{1},h_{0}}^{(0)}\left({\bf y}_{nq_{0}}^{(0)}\right)}
=p¯~nq1,nq0(𝐲nq1(1),𝐲nq0(0))p~n(𝒉|ρ)λnT,h1(1)(𝐲nT(1))λnT,h0(0)(𝐲nT(0))λnq1,h1(1)(𝐲nq1(1))λnq1,h0(0)(𝐲nq0(0)),\displaystyle=\tilde{\bar{p}}_{nq_{1},nq_{0}}\left({\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)}\right)\cdot\tilde{p}_{n}(\bm{h}|\rho)\cdot\frac{\lambda_{nT,h_{1}}^{(1)}\left({\bf y}_{nT}^{(1)}\right)\cdot\lambda_{nT,h_{0}}^{(0)}\left({\bf y}_{nT}^{(0)}\right)}{\lambda_{nq_{1},h_{1}}^{(1)}\left({\bf y}_{nq_{1}}^{(1)}\right)\cdot\lambda_{nq_{1},h_{0}}^{(0)}\left({\bf y}_{nq_{0}}^{(0)}\right)},

where the first equality is a consequence of (B.12) and the definition of p~nT,nT(𝐲nT(1),𝐲nT(0),𝒉)\tilde{p}_{nT,nT}\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)},\bm{h}\right), and the second equality follows from (B.11). Integrating with respect to the dominating measure, ν1(𝒉)\nu_{1}(\bm{h}), on both sides of the expression then gives (the quantity p¯~n(𝐲nq1(1),𝐲nq0(0)|ρ,q1,q0)\tilde{\bar{p}}_{n}({\bf y}_{-nq_{1}}^{(1)},{\bf y}_{-nq_{0}}^{(0)}|\rho,q_{1},q_{0}) is defined in A.14)151515Recall that ν1(h)\nu_{1}(h) is some dominating measure for the prior m0m_{0}. Here, it can be taken to be the counting measure on (h1,h0)(-h_{1}^{*},h_{0}^{*}) and (h1,h0)(h_{1}^{*},h_{0}^{*}).

p¯~nT,nT(𝐲nT(1),𝐲nT(0))=p¯~nq1,nq0(𝐲nq1(1),𝐲nq0(0))p¯~n(𝐲nq1(1),𝐲nq0(0)|ρ,q1,q0).\tilde{\bar{p}}_{nT,nT}\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}\right)=\tilde{\bar{p}}_{nq_{1},nq_{0}}\left({\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)}\right)\cdot\tilde{\bar{p}}_{n}({\bf y}_{-nq_{1}}^{(1)},{\bf y}_{-nq_{0}}^{(0)}|\rho,q_{1},q_{0}). (B.13)

Step 2 (Relating successive values of p~n(,|ρ,q1,q0)\tilde{p}_{n}(\cdot,\cdot|\rho,q_{1},q_{0})). The quantity p~n(𝐲nq1(1),𝐲nq0(0)|ρ,q1,q0)\tilde{p}_{n}({\bf y}_{-nq_{1}}^{(1)},{\bf y}_{-nq_{0}}^{(0)}|\rho,q_{1},q_{0}) specifies the density of the unobserved elements, 𝐲nq1(1),𝐲nq0(0){\bf y}_{-nq_{1}}^{(1)},{\bf y}_{-nq_{0}}^{(0)}, of 𝐲nT(1),𝐲n,T(0){\bf y}_{nT}^{(1)},{\bf y}_{n,T}^{(0)} when the current state is ρ,q1,q0\rho,q_{1},q_{0}. In this step, we aim to characterize the density of the remaining elements of 𝐲nqa(a){\bf y}_{-nq_{a}}^{(a)}, if starting from the state ρ,q1,q0\rho,q_{1},q_{0}, we assign treatment aa and observe the first element, Ynqa+1(a)Y_{nq_{a}+1}^{(a)}, of 𝐲nqa(a){\bf y}_{-nq_{a}}^{(a)}.

We start by noting that (B.11), (B.13) are valid for any ρ,q1,q0\rho,q_{1},q_{0}, as long as q1,q0<Tq_{1},q_{0}<T. Suppose treatment 11 is employed when the current state 𝐲nq1(1),𝐲nq0(0){\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)}. Then, it is easily verified that

p~nq1+1,nq0(𝐲nq1+1(1),𝐲nq0(0),𝒉)\displaystyle\tilde{p}_{nq_{1}+1,nq_{0}}\left({\bf y}_{nq_{1}+1}^{(1)},{\bf y}_{nq_{0}}^{(0)},\bm{h}\right)
=p~nq1,nq0(𝐲nq1(1),𝐲nq0(0),𝒉)exp{1nh1ψ(Ynq1+1(1))12nh1I1h1}pθ0(1)(Ynq1+1(1))\displaystyle=\tilde{p}_{nq_{1},nq_{0}}\left({\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)},\bm{h}\right)\exp\left\{\frac{1}{\sqrt{n}}h_{1}^{\intercal}\psi\left(Y_{nq_{1}+1}^{(1)}\right)-\frac{1}{2n}h_{1}^{\intercal}I_{1}h_{1}\right\}p_{\theta_{0}}^{(1)}\left(Y_{nq_{1}+1}^{(1)}\right)
=p¯~nq1,nq0(𝐲nq1(1),𝐲nq0(0))p~n(𝒉|ρ)exp{1nh1ψ(Ynq1+1(1))12nh1I1h1}pθ0(1)(Ynq1+1(1)),\displaystyle=\tilde{\bar{p}}_{nq_{1},nq_{0}}\left({\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)}\right)\tilde{p}_{n}(\bm{h}|\rho)\exp\left\{\frac{1}{\sqrt{n}}h_{1}^{\intercal}\psi\left(Y_{nq_{1}+1}^{(1)}\right)-\frac{1}{2n}h_{1}^{\intercal}I_{1}h_{1}\right\}p_{\theta_{0}}^{(1)}\left(Y_{nq_{1}+1}^{(1)}\right),

where the last equality follows from (B.11). Integrating with respect to ν1(𝒉)\nu_{1}(\bm{h}) on both sides then gives161616The quantity p~n(Y(a)|ρ)\tilde{p}_{n}(Y^{(a)}|\rho) is defined in Step 2 of the proof of Theorem 2.

p¯~nq1+1,nq0(𝐲nq1+1(1),𝐲nq0(0))\displaystyle\tilde{\bar{p}}_{nq_{1}+1,nq_{0}}\left({\bf y}_{nq_{1}+1}^{(1)},{\bf y}_{nq_{0}}^{(0)}\right) =p¯~nq1,nq0(𝐲nq1(1),𝐲nq0(0))p~n(Ynq1+1(1)|ρ).\displaystyle=\tilde{\bar{p}}_{nq_{1},nq_{0}}\left({\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)}\right)\tilde{p}_{n}\left(Y_{nq_{1}+1}^{(1)}|\rho\right). (B.14)

Applying (B.13) twice, with the values (ρ,q1,q0)(\rho,q_{1},q_{0}) and (ρ,q1+1n,q0)(\rho^{\prime},q_{1}+\frac{1}{n},q_{0}), and making use of (B.14) together with the definition of ρ\rho from (A.11), we conclude

p¯~n(𝐲nq1(1),𝐲nq0(0)|ρ,q1,q0)\displaystyle\tilde{\bar{p}}_{n}\left({\bf y}_{-nq_{1}}^{(1)},{\bf y}_{-nq_{0}}^{(0)}|\rho,q_{1},q_{0}\right) =p¯~n(𝐲nq11(1),𝐲nq0(0)|ρ,q1+1n,q0)p~n(Ynq1+1(1)|ρ),\displaystyle=\tilde{\bar{p}}_{n}\left({\bf y}_{-nq_{1}-1}^{(1)},{\bf y}_{-nq_{0}}^{(0)}|\rho^{\prime},q_{1}+\frac{1}{n},q_{0}\right)\cdot\tilde{p}_{n}\left(Y_{nq_{1}+1}^{(1)}|\rho\right), (B.15)

where ρ:=ρ+n1/2I11ψ1(Ynq1+1(1))\rho^{\prime}:=\rho+n^{-1/2}I_{1}^{-1}\psi_{1}\left(Y_{nq_{1}+1}^{(1)}\right). Analogously,

p¯~n(𝐲nq1(1),𝐲nq0(0)|ρ,q1,q0)=p¯~n(𝐲nq1(1),𝐲nq01(0)|ρ,q1,q0+1n)p~n(Ynq1+1(0)|ρ),\tilde{\bar{p}}_{n}\left({\bf y}_{-nq_{1}}^{(1)},{\bf y}_{-nq_{0}}^{(0)}|\rho,q_{1},q_{0}\right)=\tilde{\bar{p}}_{n}\left({\bf y}_{-nq_{1}}^{(1)},{\bf y}_{-nq_{0}-1}^{(0)}|\rho^{\prime},q_{1},q_{0}+\frac{1}{n}\right)\cdot\tilde{p}_{n}\left(Y_{nq_{1}+1}^{(0)}|\rho\right), (B.16)

with ρ\rho^{\prime} now being ρn1/2I01ψ0(Ynq1+1(0))\rho-n^{-1/2}I_{0}^{-1}\psi_{0}\left(Y_{nq_{1}+1}^{(0)}\right).

Step 3 (Recursive expression for V~n,T\tilde{V}_{n,T}^{*}). Let ρj\rho_{j} denote the value of ρn()\rho_{n}(\cdot) at period jj. Suppose that at period jj of the experiment, the state is ξj=(𝐲nq1(1),𝐲nq0(0))\xi_{j}=({\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)}). The posterior, p¯~n(,|ρj,q1,q0)\tilde{\bar{p}}_{n}\left(\cdot,\cdot|\rho_{j},q_{1},q_{0}\right) provides the density of the remaining elements 𝐲nq1(1),𝐲nq0(0){\bf y}_{-nq_{1}}^{(1)},{\bf y}_{-nq_{0}}^{(0)} of the vector 𝐲nT(1),𝐲nT(0){\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}. By extension, we may define p¯~n,j(𝐲nT(1),𝐲nT(0)|ρj,q1,q0)\tilde{\bar{p}}_{n,j}({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}|\rho_{j},q_{1},q_{0}) as the density induced over paths 𝐲nT(1),𝐲nT(0){\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)} given the knowledge of ξj\xi_{j}. This density consists of a point mass for 𝐲nq1(1),𝐲nq0(0){\bf y}_{nq_{1}}^{(1)},{\bf y}_{nq_{0}}^{(0)}, with the rest of the components 𝐲nq1(1),𝐲nq0(0){\bf y}_{-nq_{1}}^{(1)},{\bf y}_{-nq_{0}}^{(0)} of the vector 𝐲nT(1),𝐲nT(0){\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)} distributed as p¯~n(𝐲nq1(1),𝐲nq0(0)|ρj,q1,q0)\tilde{\bar{p}}_{n}\left({\bf y}_{-nq_{1}}^{(1)},{\bf y}_{-nq_{0}}^{(0)}|\rho_{j},q_{1},q_{0}\right).

Let 𝒯j={j/n,(j+1)/n,,1}\mathcal{T}_{j}=\{j/n,(j+1)/n,\dots,1\} and take 𝒟n,j:T\mathcal{D}_{n,j:T} to be the set of all possible decision rules ({πnt}n𝒯j,τ,δ)(\{\pi_{nt}\}_{n\in\mathcal{T}_{j}},\tau,\delta) starting from period jj with the usual measurability restrictions, i.e., πnt()\pi_{nt}(\cdot) is t1/n\mathcal{F}_{t-1/n} measurable, the stopping time τ\tau is sequentially t1/n\mathcal{F}_{t-1/n} measurable, and the implementation rule δ\delta is τ\mathcal{F}_{\tau} measurable. Recalling that ϖn(ρ):=ϖn(m~n(ρ))\varpi_{n}(\rho):=\varpi_{n}\left(\tilde{m}_{n}(\rho)\right), define

V~n,T(ξj)=inf𝒅𝒟n,j:T{nϖn(ρn(τ))+c(τjn)}𝑑p¯~n,j(𝐲nT(1),𝐲nT(0)|ρj,q1,q0),\tilde{V}_{n,T}^{*}(\xi_{j})=\inf_{\bm{d}\in\mathcal{D}_{n,j:T}}\int\left\{\sqrt{n}\varpi_{n}(\rho_{n}(\tau))+c\left(\tau-\frac{j}{n}\right)\right\}d\tilde{\bar{p}}_{n,j}\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}|\rho_{j},q_{1},q_{0}\right), (B.17)

with the convention that at j=0j=0,

V~n,T(ξ0)=inf𝒅𝒟n,T{nϖn(ρn(τ))+cτ}𝑑p¯~n(𝐲nT(1),𝐲nT(0)).\tilde{V}_{n,T}^{*}(\xi_{0})=\inf_{\bm{d}\in\mathcal{D}_{n,T}}\int\left\{\sqrt{n}\varpi_{n}(\rho_{n}(\tau))+c\tau\right\}d\tilde{\bar{p}}_{n}\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}\right).

The quantity V~n,T(ξj)\tilde{V}_{n,T}^{*}(\xi_{j}) is akin to the value function at period jj. Note also that the quantities ρn(τ),τ\rho_{n}(\tau),\tau in (B.17) are functions of 𝐲nT(1),𝐲nT(0){\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}.

Clearly, V~n,T=V~n,T(ξ0)\tilde{V}_{n,T}^{*}=\tilde{V}_{n,T}^{*}(\xi_{0}) by definition, so the claim follows if we show: (i) V~n,T(ξj)=V~n,T(ρj,q1,q0,j/n)\tilde{V}_{n,T}^{*}(\xi_{j})=\tilde{V}_{n,T}^{*}(\rho_{j},q_{1},q_{0},j/n), i.e., it is function only of (ρj,q1,q0,t=j/n)(\rho_{j},q_{1},q_{0},t=j/n); and (ii) it satisfies the recursion (A.15). To show this, we adopt the usual approach in dynamic programming of using backward induction.

First, we argue that the induction hypothesis holds at j=nTj=nT (corresponding to t=Tt=T). Indeed,

V~n,T(ξnT)\displaystyle\tilde{V}_{n,T}^{*}(\xi_{nT}) :=nϖn(ρnT)𝑑p¯~n,nT(𝐲nT(1),𝐲nT(0)|ρnT,q1,q0)\displaystyle:=\int\sqrt{n}\varpi_{n}(\rho_{nT})d\tilde{\bar{p}}_{n,nT}\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}|\rho_{nT},q_{1},q_{0}\right)
=nϖn(ρnT)𝑑p¯~n(𝐲nq1(1),𝐲nq0(0)|ρnT,q1,q0)=nη(ρnT,q1,q0)ϖn(ρnT)\displaystyle=\int\sqrt{n}\varpi_{n}(\rho_{nT})d\tilde{\bar{p}}_{n}\left({\bf y}_{-nq_{1}}^{(1)},{\bf y}_{-nq_{0}}^{(0)}|\rho_{nT},q_{1},q_{0}\right)=\sqrt{n}\eta(\rho_{nT},q_{1},q_{0})\varpi_{n}(\rho_{nT})

and we can therefore write V~n,T(ξn)=V~n,T(ρnT,q1,q0,T)\tilde{V}_{n,T}^{*}(\xi_{n})=\tilde{V}_{n,T}^{*}\left(\rho_{nT},q_{1},q_{0},T\right) as a function only of ρnT,q1,q0,T\rho_{nT},q_{1},q_{0},T.

Now suppose that the induction hypothesis holds for the periods j+1,,nTj+1,\dots,nT. Consider the various possibilities at period jj. If the experiment is stopped right away, the continuation value of this choice is

V~n,T(ξj|τ=j)\displaystyle\tilde{V}_{n,T}^{*}(\xi_{j}|\tau=j) :=nϖn(ρj)𝑑p¯~n,j(𝐲nT(1),𝐲nT(0)|ρj,q1,q0)\displaystyle:=\int\sqrt{n}\varpi_{n}(\rho_{j})d\tilde{\bar{p}}_{n,j}\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}|\rho_{j},q_{1},q_{0}\right)
=nϖn(ρj)𝑑p¯~n(𝐲nq1(1),𝐲nq0(0)|ρj,q1,q0)=nη(ρj,q1,q0)ϖn(ρj).\displaystyle=\int\sqrt{n}\varpi_{n}(\rho_{j})d\tilde{\bar{p}}_{n}\left({\bf y}_{-nq_{1}}^{(1)},{\bf y}_{-nq_{0}}^{(0)}|\rho_{j},q_{1},q_{0}\right)=\sqrt{n}\eta(\rho_{j},q_{1},q_{0})\varpi_{n}(\rho_{j}).

On the other hand, if the experiment is continued and treatment 1 is sampled, the resulting continuation value is

V~n,T(ξj|πj=1)\displaystyle\tilde{V}_{n,T}^{*}(\xi_{j}|\pi_{j}=1)
:=inf{πj=1}𝒅𝒟n,j+1:T{nϖn(ρnτ)+c(τjn)}𝑑p¯~n,j(𝐲nT(1),𝐲nT(0)|ρj,q1,q0)\displaystyle:=\inf_{\{\pi_{j}=1\}\cap\bm{d}\in\mathcal{D}_{n,j+1:T}}\int\left\{\sqrt{n}\varpi_{n}\left(\rho_{n\tau}\right)+c\left(\tau-\frac{j}{n}\right)\right\}d\tilde{\bar{p}}_{n,j}\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}|\rho_{j},q_{1},q_{0}\right)
=cn𝑑p¯~n,j(𝐲nT(1),𝐲nT(0)|ρj,q1,q0)+\displaystyle=\frac{c}{n}\int d\tilde{\bar{p}}_{n,j}\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}|\rho_{j},q_{1},q_{0}\right)+\dots
+inf𝒅𝒟n,j+1:T{nϖn(ρnτ)+c(τj+1n)}𝑑p¯~n,j+1(𝐲nT(1),𝐲nT(0)|ρj+1,q1+1n,q0)𝑑p~n(Y(1)|ρj)\displaystyle+\inf_{\bm{d}\in\mathcal{D}_{n,j+1:T}}\int\int\left\{\sqrt{n}\varpi_{n}\left(\rho_{n\tau}\right)+c\left(\tau-\frac{j+1}{n}\right)\right\}d\tilde{\bar{p}}_{n,j+1}\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}|\rho_{j+1},q_{1}+\frac{1}{n},q_{0}\right)d\tilde{p}_{n}\left(Y^{(1)}|\rho_{j}\right)
=η(ρj,q1,q0)cn+\displaystyle=\eta(\rho_{j},q_{1},q_{0})\frac{c}{n}+\cdots
+[inf𝒅𝒟n,j+1:T{nϖn(ρnτ)+c(τj+1n)}𝑑p¯~n,j+1(𝐲nT(1),𝐲nT(0)|ρj+1,q1+1n,q0)]𝑑p~n(Y(1)|ρj)\displaystyle+\int\left[\inf_{\bm{d}\in\mathcal{D}_{n,j+1:T}}\int\left\{\sqrt{n}\varpi_{n}\left(\rho_{n\tau}\right)+c\left(\tau-\frac{j+1}{n}\right)\right\}d\tilde{\bar{p}}_{n,j+1}\left({\bf y}_{nT}^{(1)},{\bf y}_{nT}^{(0)}|\rho_{j+1},q_{1}+\frac{1}{n},q_{0}\right)\right]d\tilde{p}_{n}\left(Y^{(1)}|\rho_{j}\right)
=η(ρj,q1,q0)cn+V~n,T(ξj+1)𝑑p~n(Y(1)|ρj)\displaystyle=\frac{\eta(\rho_{j},q_{1},q_{0})c}{n}+\int\tilde{V}_{n,T}^{*}\left(\xi_{j+1}\right)d\tilde{p}_{n}\left(Y^{(1)}|\rho_{j}\right)
=η(ρj,q1,q0)cn+V~n,T(ρj+1,q1+1,q0,j+1n)𝑑p~n(Y(1)|ρj),\displaystyle=\frac{\eta(\rho_{j},q_{1},q_{0})c}{n}+\int\tilde{V}_{n,T}^{*}\left(\rho_{j+1},q_{1}+1,q_{0},\frac{j+1}{n}\right)d\tilde{p}_{n}\left(Y^{(1)}|\rho_{j}\right),

where ρj+1:=ρj+n1/2I11ψ1(Y(1))\rho_{j+1}:=\rho_{j}+n^{-1/2}I_{1}^{-1}\psi_{1}(Y^{(1)}) and ξj+1=ξj{Y(1)}\xi_{j+1}=\xi_{j}\cup\{Y^{(1)}\}. The first equality follows from (B.15), the second follows from a suitable measurable selection theorem (see, e.g., Bertsekas, 2012, Proposition A.5), the third from the definition of V~n,T(ξj+1)\tilde{V}_{n,T}^{*}(\xi_{j+1}), and the last equality from the induction hypothesis. In a similar vein, if treatment 0 were sampled, we would have

V~n,T(ξj|πj=0)=η(ρj,q1,q0)cn+V~n,T(ρj+1,q1,q0+1,j+1n)𝑑p~n(Y(0)|ρj).\tilde{V}_{n,T}^{*}(\xi_{j}|\pi_{j}=0)=\frac{\eta(\rho_{j},q_{1},q_{0})c}{n}+\int\tilde{V}_{n,T}^{*}\left(\rho_{j+1},q_{1},q_{0}+1,\frac{j+1}{n}\right)d\tilde{p}_{n}\left(Y^{(0)}|\rho_{j}\right).

Now,

V~n,T(ξj)=min{V~n,T(ξj|τ=j),V~n,T(ξj|πj=1),V~n,T(ξj|πj=0)}.\tilde{V}_{n,T}^{*}(\xi_{j})=\min\left\{\tilde{V}_{n,T}^{*}(\xi_{j}|\tau=j),\tilde{V}_{n,T}^{*}(\xi_{j}|\pi_{j}=1),\tilde{V}_{n,T}^{*}(\xi_{j}|\pi_{j}=0)\right\}. (B.18)

We have shown above that each of the three terms within the minimum in (B.18) are functions only of ρ,q1,q0,j/n\rho,q_{1},q_{0},j/n. Hence, V~n,T(ξj)=V~n,T(ρj,q1,q0,j/n)\tilde{V}_{n,T}^{*}(\xi_{j})=\tilde{V}_{n,T}^{*}(\rho_{j},q_{1},q_{0},j/n). Furthermore, by the expressions for these quantities, it is clear that (B.18) is none other than (A.15). This proves the induction hypothesis for period jj. The claim follows. ∎

Lemma 9.

There exist non-random constants, M<M<\infty and ϑ(0,1/2)\vartheta\in(0,1/2) such that supρ,q1,q0|η(ρ,q1,q0)1|Mnϑ\sup_{\rho,q_{1},q_{0}}\left|\eta(\rho,q_{1},q_{0})-1\right|\leq Mn^{-\vartheta}.

Proof.

By (B.12),

λnT,h1(1)(𝐲nT(1))λnT,h0(0)(𝐲nT(0))λnq1,h1(1)(𝐲nq1(1))λnq0,h0(0)(𝐲nq0(0))\displaystyle\frac{\lambda_{nT,h_{1}}^{(1)}\left({\bf y}_{nT}^{(1)}\right)\cdot\lambda_{nT,h_{0}}^{(0)}\left({\bf y}_{nT}^{(0)}\right)}{\lambda_{nq_{1},h_{1}}^{(1)}\left({\bf y}_{nq_{1}}^{(1)}\right)\cdot\lambda_{nq_{0},h_{0}}^{(0)}\left({\bf y}_{nq_{0}}^{(0)}\right)}
={i=nq1+1nTexp{h1ψ1(Yi(1))12h1I1h1}pθ0(1)(Yi(1))}{i=nq0+1nTexp{h0ψ0(Yi(0))12h0I0h0}pθ0(0)(Yi(0))}.\displaystyle=\left\{\prod_{i=nq_{1}+1}^{nT}\exp\left\{h_{1}^{\intercal}\psi_{1}(Y_{i}^{(1)})-\frac{1}{2}h_{1}^{\intercal}I_{1}h_{1}\right\}p_{\theta_{0}}^{(1)}(Y_{i}^{(1)})\right\}\cdot\left\{\prod_{i=nq_{0}+1}^{nT}\exp\left\{h_{0}^{\intercal}\psi_{0}(Y_{i}^{(0)})-\frac{1}{2}h_{0}^{\intercal}I_{0}h_{0}\right\}p_{\theta_{0}}^{(0)}(Y_{i}^{(0)})\right\}.

Making use of the above in the definition of p¯~n(,|ρ,q1,q0)\tilde{\bar{p}}_{n}(\cdot,\cdot|\rho,q_{1},q_{0}) and applying Fubini’s theorem gives

η(ρ,q1,q0)=a{0,1}i=nqa+1nT{exp(haψa(Yi(a))12haIaha)pθ0(a)(Yi(a))𝑑Yi(a)}dp~n(𝒉|ρ).\eta(\rho,q_{1},q_{0})=\int\prod_{a\in\{0,1\}}\prod_{i=nq_{a}+1}^{nT}\left\{\int\exp\left(h_{a}^{\intercal}\psi_{a}(Y_{i}^{(a)})-\frac{1}{2}h_{a}^{\intercal}I_{a}h_{a}\right)p_{\theta_{0}}^{(a)}(Y_{i}^{(a)})dY_{i}^{(a)}\right\}d\tilde{p}_{n}(\bm{h}|\rho). (B.19)

Denote

gan(h,Y)\displaystyle g_{an}(h,Y) =1nhψa(Y)12nhIah,\displaystyle=\frac{1}{\sqrt{n}}h^{\intercal}\psi_{a}(Y)-\frac{1}{2n}h^{\intercal}I_{a}h,
δan(h,Y)\displaystyle\delta_{an}(h,Y) =exp{gan(h,Y)}{1+gan(h,Y)+gan(h,Y)2/2},\displaystyle=\exp\{g_{an}(h,Y)\}-\{1+g_{an}(h,Y)+g_{an}(h,Y)^{2}/2\},

and taken 𝔼p0(a)[]\mathbb{E}_{p_{0}^{(a)}}[\cdot] to be the expectation corresponding to pθ0(a)(Yi(a))p_{\theta_{0}}^{(a)}(Y_{i}^{(a)}). Then, writing the inner integral (within the {}\{\} brackets) in (B.19) as ba(ha)b_{a}(h_{a}), we find

ba(ha)\displaystyle b_{a}(h_{a}) =𝔼p0(a)[exp{1nhaψa(Y(a))12nhaIaha}]\displaystyle=\mathbb{E}_{p_{0}^{(a)}}\left[\exp\left\{\frac{1}{\sqrt{n}}h_{a}^{\intercal}\psi_{a}(Y^{(a)})-\frac{1}{2n}h_{a}^{\intercal}I_{a}h_{a}\right\}\right]
=𝔼p0(a)[1+gan(ha,Y(a))+12gan2(ha,Y(a))]+𝔼p0(a)[δan(ha,Y(a))]\displaystyle=\mathbb{E}_{p_{0}^{(a)}}\left[1+g_{an}(h_{a},Y^{(a)})+\frac{1}{2}g_{an}^{2}(h_{a},Y^{(a)})\right]+\mathbb{E}_{p_{0}^{(a)}}\left[\delta_{an}(h_{a},Y^{(a)})\right]
:=Qn1(ha)+Qn2(ha).\displaystyle:=Q_{n1}(h_{a})+Q_{n2}(h_{a}). (B.20)

Since ψ()\psi(\cdot) is the score function at θ0\theta_{0}, 𝔼p0(a)[ψa(Y(a))]=0\mathbb{E}_{p_{0}^{(a)}}[\psi_{a}(Y^{(a)})]=0 and 𝔼pθ0(a)[ψa(Y(a))ψa(Y(a))]=Ia\mathbb{E}_{p_{\theta_{0}}^{(a)}}[\psi_{a}(Y^{(a)})\psi_{a}(Y^{(a)})^{\intercal}]=I_{a}. Using these results, and noting that the support of hah_{a} is only {ha,ha}\{h_{a}^{*},-h_{a}^{*}\} with ha:=Γ<\left\|h_{a}^{*}\right\|:=\Gamma<\infty due to the form of the prior, some straightforward algebra implies

Qn1(ha)=1+bn,where bnΓ4/(8n2eig¯(Ia)).Q_{n1}(h_{a})=1+b_{n},\ \textrm{where }b_{n}\leq\Gamma^{4}/(8n^{2}\textrm{$\underline{eig}$}(I_{a})).

Here, eig¯(Ia)\textrm{$\underline{eig}$}(I_{a}) denotes the minimum eigenvalue of IaI_{a}. Next, we can expand Qn2Q_{n2} as:

Qn2(ha)=𝔼p0(a)[𝕀ψa(Y(a))Kδn(ha,Y(a))]+𝔼p0(a)[𝕀ψa(Y(a))>Kδn(ha,Y(a))].Q_{n2}(h_{a})=\mathbb{E}_{p_{0}^{(a)}}\left[\mathbb{I}_{\left\|\psi_{a}(Y^{(a)})\right\|\leq K}\delta_{n}(h_{a},Y^{(a)})\right]+\mathbb{E}_{p_{0}^{(a)}}\left[\mathbb{I}_{\left\|\psi_{a}(Y^{(a)})\right\|>K}\delta_{n}(h_{a},Y^{(a)})\right]. (B.21)

Since ha=Γ\left\|h_{a}^{*}\right\|=\Gamma and ex(1+x+x2/2)=O(|x|3)e^{x}-(1+x+x^{2}/2)=O(|x|^{3}), the first term in (B.21) is bounded by K3Γ2n3/2K^{3}\Gamma^{2}n^{-3/2} over ha{haha}h_{a}\in\{h_{a}^{*}-h_{a}^{*}\}. Furthermore, for large enough nn, the second term in (B.21) is bounded by 𝔼p0(a)[expψa(Y(a))]/exp(bK)\mathbb{E}_{p_{0}^{(a)}}\left[\exp\left\|\psi_{a}(Y^{(a)})\right\|\right]/\exp(bK) for any b<1b<1. Hence, setting K=(3/2b)lnnK=(3/2b)\ln n gives supha{ha,ha}Qn2(ha)=O(ln3n/n3/2)\sup_{h_{a}\in\{h_{a}^{*},-h_{a}^{*}\}}Q_{n2}(h_{a})=O\left(\ln^{3}n/n^{3/2}\right). Combining the above, we conclude there exists some non-random L<L<\infty such that

supha{haha}|ba(ha)1|Lncfor any c<3/2.\sup_{h_{a}\in\{h_{a}^{*}-h_{a}^{*}\}}|b_{a}(h_{a})-1|\leq Ln^{-c}\ \textrm{for any }c<3/2.

Substituting the above bound on ba(ha)b_{a}(h_{a}) into (B.19) gives

η(ρ,q1,q0)\displaystyle\eta(\rho,q_{1},q_{0}) a{0,1}i=nqa+1nT(1+Lnc)(1+Lnc)2nT1+Mn(c1),\displaystyle\leq\prod_{a\in\{0,1\}}\prod_{i=nq_{a}+1}^{nT}(1+Ln^{-c})\leq(1+Ln^{-c})^{2nT}\leq 1+Mn^{-(c-1)},

for some M<M<\infty. Since we can choose any c(0,3/2)c\in(0,3/2), it follows ϑ:=c1(0,1/2)\vartheta:=c-1\in(0,1/2) and the claim follows. ∎

Lemma 10.

The solution V˘n,T(ρ,t)\breve{V}_{n,T}(\rho,t) of (A.18) converges locally uniformly to the unique viscosity solution of the HJB-VI (A.19).

Proof.

The proof consists of two steps. In the first step, we derive some preliminary results for expectations under the posterior p~n(Y(a)|ρ)\tilde{p}_{n}(Y^{(a)}|\rho). Then, we use the abstract convergence result of Barles and Souganidis (1991) to show that V˘n,T(ρ,t)\breve{V}_{n,T}(\rho,t) converges locally uniformly to the viscosity solution of (A.19).

Step 1 (Some results on moments of p~n(|ρ)\tilde{p}_{n}(\cdot|\rho)). Let 𝔼~ρ[]\tilde{\mathbb{E}}^{\rho}[\cdot] denote the expectation under p~n(|ρ)\tilde{p}_{n}(\cdot|\rho). In this step, we show that there exists ξn0\xi_{n}\to 0 independent of ρ\rho and a{0,1}a\in\{0,1\} such that

n𝔼~ρ[(2a1)μ˙aIa1ψa(Y(a))nσa]\displaystyle n\tilde{\mathbb{E}}^{\rho}\left[\frac{(2a-1)\dot{\mu}_{a}^{\intercal}I_{a}^{-1}\psi_{a}(Y^{(a)})}{\sqrt{n}\sigma_{a}}\right] =Δ2(2m~(ρ)1)+ξn,and\displaystyle=\frac{\Delta^{*}}{2}(2\tilde{m}(\rho)-1)+\xi_{n},\ \textrm{and} (B.22)
𝔼~ρ[(μ˙aIa1ψa(Y(a))σa)2]\displaystyle\tilde{\mathbb{E}}^{\rho}\left[\left(\frac{\dot{\mu}_{a}^{\intercal}I_{a}^{-1}\psi_{a}(Y^{(a)})}{\sigma_{a}}\right)^{2}\right] =12+ξn.\displaystyle=\frac{1}{2}+\xi_{n}. (B.23)

Furthermore,

𝔼~ρ[|μ˙aIa1ψa(Y(a))nσa|3]<.\tilde{\mathbb{E}}^{\rho}\left[\left|\frac{\dot{\mu}_{a}^{\intercal}I_{a}^{-1}\psi_{a}(Y^{(a)})}{\sqrt{n}\sigma_{a}}\right|^{3}\right]<\infty. (B.24)

Start with (B.22). Suppose a=1a=1. By the definition of p~n(|ρ)\tilde{p}_{n}(\cdot|\rho),

p~n(Y(1)|ρ)\displaystyle\tilde{p}_{n}(Y^{(1)}|\rho) =pθ0(1)(Y(1))[m~(ρ)exp{1nh1ψ1(Y(1))12nh1I1h1}\displaystyle=p_{\theta_{0}}^{(1)}(Y^{(1)})\left[\tilde{m}(\rho)\exp\left\{\frac{1}{\sqrt{n}}h_{1}^{*\intercal}\psi_{1}(Y^{(1)})-\frac{1}{2n}h_{1}^{*\intercal}I_{1}h_{1}^{*}\right\}\right.
+(1m~(ρ))exp{1nh1ψ1(Y(1))12nh1I1h1}].\displaystyle\qquad+\left.(1-\tilde{m}(\rho))\exp\left\{\frac{-1}{\sqrt{n}}h_{1}^{*\intercal}\psi_{1}(Y^{(1)})-\frac{1}{2n}h_{1}^{*\intercal}I_{1}h_{1}^{*}\right\}\right].

Hence,

𝔼~ρ[μ˙1I11ψ1(Y(1))σ1]\displaystyle\tilde{\mathbb{E}}^{\rho}\left[\frac{\dot{\mu}_{1}^{\intercal}I_{1}^{-1}\psi_{1}(Y^{(1)})}{\sigma_{1}}\right] =m~(ρ)μ˙1I11σ1ψ1(Y(1))exp{1nh1ψ1(Y(1))12nh1I1h1}𝑑pθ0(1)(Y(1))+\displaystyle=\tilde{m}(\rho)\frac{\dot{\mu}_{1}^{\intercal}I_{1}^{-1}}{\sigma_{1}}\int\psi_{1}(Y^{(1)})\exp\left\{\frac{1}{\sqrt{n}}h_{1}^{*\intercal}\psi_{1}(Y^{(1)})-\frac{1}{2n}h_{1}^{*\intercal}I_{1}h_{1}^{*}\right\}dp_{\theta_{0}}^{(1)}(Y^{(1)})+
(1m~(ρ))μ˙1I11σ1ψ1(Y(1))exp{1nh1ψ1(Y(1))12nh1I1h1}𝑑pθ0(1)(Y(1)).\displaystyle\qquad(1-\tilde{m}(\rho))\frac{\dot{\mu}_{1}^{\intercal}I_{1}^{-1}}{\sigma_{1}}\int\psi_{1}(Y^{(1)})\exp\left\{\frac{-1}{\sqrt{n}}h_{1}^{*\intercal}\psi_{1}(Y^{(1)})-\frac{1}{2n}h_{1}^{*\intercal}I_{1}h_{1}^{*}\right\}dp_{\theta_{0}}^{(1)}(Y^{(1)}).

Now, for each h1{h1,h1}h_{1}\in\{h_{1}^{*},-h_{1}^{*}\}, define

g1n(h1,Y)\displaystyle g_{1n}(h_{1},Y) =1nh1ψ1(Y)12nh1I1h1,and\displaystyle=\frac{1}{\sqrt{n}}h_{1}^{\intercal}\psi_{1}(Y)-\frac{1}{2n}h_{1}^{\intercal}I_{1}h_{1},\qquad\textrm{and}
δ1n(h1,Y)\displaystyle\delta_{1n}(h_{1},Y) =exp{g1n(h1,Y)}{1+g1n(h1,Y)}.\displaystyle=\exp\{g_{1n}(h_{1},Y)\}-\{1+g_{1n}(h_{1},Y)\}.

Then,

ψ(Y(1))exp{1nh1ψ1(Y(1))12nh1I1h1}𝑑pθ0(1)(Y(1))\displaystyle\int\psi(Y^{(1)})\exp\left\{\frac{1}{\sqrt{n}}h_{1}^{\intercal}\psi_{1}(Y^{(1)})-\frac{1}{2n}h_{1}^{\intercal}I_{1}h_{1}\right\}dp_{\theta_{0}}^{(1)}(Y^{(1)})
=𝔼pθ0(1)[ψ1(Y(1))exp{1nh1ψ1(Y(1))12nh1I1h1}]\displaystyle=\mathbb{E}_{p_{\theta_{0}}^{(1)}}\left[\psi_{1}(Y^{(1)})\exp\left\{\frac{1}{\sqrt{n}}h_{1}^{\intercal}\psi_{1}(Y^{(1)})-\frac{1}{2n}h_{1}^{\intercal}I_{1}h_{1}\right\}\right]
=𝔼pθ0(1)[ψ1(Y(1)){1+1nh1ψ1(Y(1))12nh1I1h1}]+𝔼pθ0(1)[ψ1(Y(1))δ1n(h1,Y(1))].\displaystyle=\mathbb{E}_{p_{\theta_{0}}^{(1)}}\left[\psi_{1}(Y^{(1)})\left\{1+\frac{1}{\sqrt{n}}h_{1}^{\intercal}\psi_{1}(Y^{(1)})-\frac{1}{2n}h_{1}^{\intercal}I_{1}h_{1}\right\}\right]+\mathbb{E}_{p_{\theta_{0}}^{(1)}}\left[\psi_{1}(Y^{(1)})\delta_{1n}(h_{1},Y^{(1)})\right].

Now, 𝔼pθ0(1)[ψ1(Y(1))]=0\mathbb{E}_{p_{\theta_{0}}^{(1)}}[\psi_{1}(Y^{(1)})]=0 and 𝔼pθ0(1)[ψ1(Y(1))ψ1(Y(1))]=I1\mathbb{E}_{p_{\theta_{0}}^{(1)}}[\psi_{1}(Y^{(1)})\psi_{1}(Y^{(1)})^{\intercal}]=I_{1}. Hence, the first term in the above expression equals I1hI_{1}h. As for the second term,

𝔼pθ0(1)[ψ1(Y(1))δ1n(h1,Y(1))]\displaystyle\mathbb{E}_{p_{\theta_{0}}^{(1)}}\left[\psi_{1}(Y^{(1)})\delta_{1n}(h_{1},Y^{(1)})\right] =𝔼pθ0(1)[𝕀ψ1(Y(1))Kψ1(Y(1))δ1n(h1,Y(1))]\displaystyle=\mathbb{E}_{p_{\theta_{0}}^{(1)}}\left[\mathbb{I}_{\left\|\psi_{1}(Y^{(1)})\right\|\leq K}\psi_{1}(Y^{(1)})\delta_{1n}(h_{1},Y^{(1)})\right]
+𝔼pθ0(1)[𝕀ψ1(Y(1))>Kψ1(Y(1))δ1n(h1,Y(1))].\displaystyle\quad+\mathbb{E}_{p_{\theta_{0}}^{(1)}}\left[\mathbb{I}_{\left\|\psi_{1}(Y^{(1)})\right\|>K}\psi_{1}(Y^{(1)})\delta_{1n}(h_{1},Y^{(1)})\right]. (B.25)

Since h1{h1,h1}h_{1}\in\{h_{1}^{*},-h_{1}^{*}\} with h1:=Γ\left\|h_{1}^{*}\right\|:=\Gamma, and ex(1+x)=o(x2)e^{x}-(1+x)=o(x^{2}), the first term in in (B.25) is bounded by K3Γ2n1K^{3}\Gamma^{2}n^{-1}. The second term in (B.25) is bounded by 𝔼pθ0(1)[expψ1(Y(1))]/exp(aK)\mathbb{E}_{p_{\theta_{0}}^{(1)}}\left[\exp\left\|\psi_{1}(Y^{(1)})\right\|\right]/\exp(aK) for any a<1a<1. Hence, setting K=(1/a)lnnK=(1/a)\ln n gives

maxh1{h1,h1}𝔼pθ0(1)[ψ1(Y(1))δ1n(h1,Y(1))]=O(ln3n/n).\max_{h_{1}\in\{h_{1}^{*},-h_{1}^{*}\}}\left\|\mathbb{E}_{p_{\theta_{0}}^{(1)}}\left[\psi_{1}(Y^{(1)})\delta_{1n}(h_{1},Y^{(1)})\right]\right\|=O(\ln^{3}n/n).

Combining the above results, we obtain

n𝔼~ρ[μ˙1I11ψ1(Y(1))σ1]\displaystyle\sqrt{n}\tilde{\mathbb{E}}^{\rho}\left[\frac{\dot{\mu}_{1}^{\intercal}I_{1}^{-1}\psi_{1}(Y^{(1)})}{\sigma_{1}}\right] =m~(ρ)μ˙1h1σ1(1m~(ρ))μ˙1h1σ1+ξn\displaystyle=\tilde{m}(\rho)\frac{\dot{\mu}_{1}^{\intercal}h_{1}^{*}}{\sigma_{1}}-(1-\tilde{m}(\rho))\frac{\dot{\mu}_{1}^{\intercal}h_{1}^{*}}{\sigma_{1}}+\xi_{n}
=(2m~(ρ)1)μ˙1h1σ1+ξn=(2m~(ρ)1)Δ2+ξn,\displaystyle=(2\tilde{m}(\rho)-1)\frac{\dot{\mu}_{1}^{\intercal}h_{1}^{*}}{\sigma_{1}}+\xi_{n}=(2\tilde{m}(\rho)-1)\frac{\Delta^{*}}{2}+\xi_{n},

where ξnln3n/n\xi_{n}\asymp\ln^{3}n/\sqrt{n}, and the last equality follows from the definition of h1h_{1}^{*}. In a similar manner, we can show for a=0a=0 that

n𝔼~ρ[μ˙0I01ψ0(Y(0))σ0]=(2m~(ρ)1)Δ2+ξn.\sqrt{n}\tilde{\mathbb{E}}^{\rho}\left[\frac{\dot{\mu}_{0}^{\intercal}I_{0}^{-1}\psi_{0}(Y^{(0)})}{\sigma_{0}}\right]=-(2\tilde{m}(\rho)-1)\frac{\Delta^{*}}{2}+\xi_{n}.

This proves (B.22).

The proofs of (B.23) and (B.24) are analogous.

Step 2 (Convergence to the HJB-VI). We now make the time change τ:=Tt\tau:=T-t. Let 𝕀n=𝕀{τ<1/n}\mathbb{I}_{n}=\mathbb{I}\{\tau<1/n\} and 𝕀nc=𝕀{τ1/n}\mathbb{I}_{n}^{c}=\mathbb{I}\{\tau\geq 1/n\}. Also, denote the state variables by s:=(ρ,τ)s:=(\rho,\tau) and take 𝒮\mathcal{S} to the domain of ss. Finally, let C(𝒮)C^{\infty}(\mathcal{S}) denote the set of all infinitely differentiable functions ϕ:𝒮\phi:\mathcal{S}\to\mathbb{R} such that supq0|Dqϕ|M\sup_{q\geq 0}|D^{q}\phi|\leq M for some M<M<\infty (such functions ϕ\phi are also known as test functions).

Following the time change, we can alternatively represent the solution, V˘n,T()\breve{V}_{n,T}^{*}(\cdot), to (A.18) as solving the approximation scheme171717This alternative representation does not follow from an algebraic manipulation, but can be verified by checking that the relevant inequalities are preserved, e.g., ϖ(m~(ρ))V˘n,T(ρ,τ)>0\varpi(\tilde{m}(\rho))-\breve{V}_{n,T}^{*}(\rho,\tau)>0 implies V˘n,T(ρ,τ)=cn+mina{0,1}𝔼~ρ[ϕ(ρ+(2a1)μ˙aIa1ψa(Y(a))nσa,τ1n)]\breve{V}_{n,T}^{*}(\rho,\tau)=\frac{c}{n}+\min_{a\in\{0,1\}}\tilde{\mathbb{E}}^{\rho}\left[\phi\left(\rho+\frac{(2a-1)\dot{\mu}_{a}^{\intercal}I_{a}^{-1}\psi_{a}(Y^{(a)})}{\sqrt{n}\sigma_{a}},\tau-\frac{1}{n}\right)\right], etc.

Sn(s,ϕ(s),[ϕ])\displaystyle S_{n}(s,\phi(s),[\phi]) =0for τ>0;ϕ(ρ,0)=0,\displaystyle=0\ \textrm{for $\tau>0$};\quad\phi(\rho,0)=0, (B.26)

where for any uu\in\mathbb{R} and ϕ:𝒮\phi:\mathcal{S}\to\mathbb{R},

Sn(s,u,[ϕ])\displaystyle S_{n}(s,u,[\phi])
:=𝕀ncmin{ϖ(m~(ρ))un,cn+mina{0,1}𝔼~ρ[ϕ(ρ+(2a1)μ˙aIa1ψa(Y(a))nσa,τ1n)u]}+\displaystyle:=-\mathbb{I}_{n}^{c}\min\left\{\frac{\varpi(\tilde{m}(\rho))-u}{n},\frac{c}{n}+\min_{a\in\{0,1\}}\tilde{\mathbb{E}}^{\rho}\left[\phi\left(\rho+\frac{(2a-1)\dot{\mu}_{a}^{\intercal}I_{a}^{-1}\psi_{a}(Y^{(a)})}{\sqrt{n}\sigma_{a}},\tau-\frac{1}{n}\right)-u\right]\right\}+
𝕀nϖ(m~(ρ))un.\displaystyle\qquad-\mathbb{I}_{n}\frac{\varpi(\tilde{m}(\rho))-u}{n}.

Here, [ϕ][\phi] refers to the fact that it is a functional argument. Define

F(D2ϕ,Dϕ,ϕ,s)=min{ϖ(m~(ρ))ϕ,τϕ+c+Δ2(2m~(ρ)1)ρϕ+12ρ2ϕ},F(D^{2}\phi,D\phi,\phi,s)=-\min\left\{\varpi(\tilde{m}(\rho))-\phi,-\partial_{\tau}\phi+c+\frac{\Delta^{*}}{2}(2\tilde{m}(\rho)-1)\partial_{\rho}\phi+\frac{1}{2}\partial_{\rho}^{2}\phi\right\},

as the left-hand side of HJB-VI (A.19) after the time change. By Barles and Souganidis (1991), the solution, V˘n,T()\breve{V}_{n,T}^{*}(\cdot), of (B.26) converges to the solution, VT()V_{T}^{*}(\cdot), of F(D2ϕ,Dϕ,ϕ,s)=0F(D^{2}\phi,D\phi,\phi,s)=0 with the boundary condition ϕ(ρ,0)=0\phi(\rho,0)=0 if the scheme Sn()S_{n}(\cdot) satisfies the properties of monotonicity, stability and consistency.

Monotonicity requires Sn(s,u,[ϕ1])Sn(s,u,[ϕ2])S_{n}(s,u,[\phi_{1}])\leq S_{n}(s,u,[\phi_{2}]) for all s𝒮s\in\mathcal{S}, uu\in\mathbb{R} and ϕ1ϕ2\phi_{1}\geq\phi_{2}. This is clearly satisfied.

Stability requires (B.26) to have a unique solution, V˘n,T()\breve{V}_{n,T}^{*}(\cdot), that is uniformly bounded. That a unique solution exists follows from backward induction. An upper bound on V˘n,T()\breve{V}_{n,T}^{*}(\cdot) is given by cT+supρϖ(m~(ρ))=cT+(σ1+σ0)Δ/2cT+\sup_{\rho}\varpi(\tilde{m}(\rho))=cT+(\sigma_{1}+\sigma_{0})\Delta^{*}/2.

Finally, the consistency requirement has two parts: for all ϕC(𝒮)\phi\in C^{\infty}(\mathcal{S}), and s(ρ,τ)𝒮s\equiv(\rho,\tau)\in\mathcal{S} such that τ>0\tau>0, we require

lim supnγ0zsnSn(z,ϕ(z)+γ,[ϕ+γ])\displaystyle\limsup_{\begin{subarray}{c}n\to\infty\\ \gamma\to 0\\ z\to s\end{subarray}}nS_{n}(z,\phi(z)+\gamma,[\phi+\gamma]) F(D2ϕ(s),Dϕ(s),ϕ(s),s), and\displaystyle\leq F(D^{2}\phi(s),D\phi(s),\phi(s),s),\textrm{ and } (B.27)
lim infnγ0zsnSn(z,ϕ(z)+γ,[ϕ+γ])\displaystyle\liminf_{\begin{subarray}{c}n\to\infty\\ \gamma\to 0\\ z\to s\end{subarray}}nS_{n}(z,\phi(z)+\gamma,[\phi+\gamma]) F(D2ϕ(s),Dϕ(s),ϕ(s),s).\displaystyle\geq F(D^{2}\phi(s),D\phi(s),\phi(s),s). (B.28)

For boundary values, s𝒮{(ρ,0):ρ}s\in\partial\mathcal{S}\equiv\{(\rho,0):\rho\in\mathbb{R}\}, the consistency requirements are (see, Barles and Souganidis, 1991)

lim supnγ0zs𝒮nSn(z,ϕ(z)+γ,[ϕ+γ])\displaystyle\limsup_{\begin{subarray}{c}n\to\infty\\ \gamma\to 0\\ z\to s\in\partial\mathcal{S}\end{subarray}}nS_{n}(z,\phi(z)+\gamma,[\phi+\gamma]) max{F(D2ϕ(s),Dϕ(s),ϕ(s),s),ϕ(s)ϖ(m~(ρ))},\displaystyle\leq\max\left\{F(D^{2}\phi(s),D\phi(s),\phi(s),s),\phi(s)-\varpi(\tilde{m}(\rho))\right\}, (B.29)
lim infnγ0zs𝒮nSn(z,ϕ(z)+γ,[ϕ+γ])\displaystyle\liminf_{\begin{subarray}{c}n\to\infty\\ \gamma\to 0\\ z\to s\in\partial\mathcal{S}\end{subarray}}nS_{n}(z,\phi(z)+\gamma,[\phi+\gamma]) min{F(D2ϕ(s),Dϕ(s),ϕ(s),s),ϕ(s)ϖ(m~(ρ))}.\displaystyle\geq\min\left\{F(D^{2}\phi(s),D\phi(s),\phi(s),s),\phi(s)-\varpi(\tilde{m}(\rho))\right\}. (B.30)

Using (B.22)-(B.24), it is straightforward to show (B.27) and (B.28) by a third order Taylor expansion, see Adusumilli (2021) for an illustration. For the boundary values, we can show (B.29) as follows (the proof of (B.30) is similar): Let z(ρz,τ)z\equiv(\rho_{z},\tau) denote some sequence converging to s(ρ,0)𝒮s\equiv(\rho,0)\in\partial\mathcal{S}. By the definition of Sn()S_{n}(\cdot), for every sequence (n,γ0,zs)(n\to\infty,\gamma\to 0,z\to s), there exists a sub-sequence such that either nSn(z,ϕ(z)+γ,[ϕ+γ])=(ϖ(m~(ρz))ϕ(z))nS_{n}(z,\phi(z)+\gamma,[\phi+\gamma])=-(\varpi(\tilde{m}(\rho_{z}))-\phi(z)) or

nSn(z,ϕ(z)+γ,[ϕ+γ])\displaystyle nS_{n}(z,\phi(z)+\gamma,[\phi+\gamma])
=min{ϖ(m~(ρ))un,cn+mina{0,1}𝔼~ρ[ϕ(ρ+(2a1)μ˙aIa1ψa(Y(a))nσa,τ1n)u]}.\displaystyle=-\min\left\{\frac{\varpi(\tilde{m}(\rho))-u}{n},\frac{c}{n}+\min_{a\in\{0,1\}}\tilde{\mathbb{E}}^{\rho}\left[\phi\left(\rho+\frac{(2a-1)\dot{\mu}_{a}^{\intercal}I_{a}^{-1}\psi_{a}(Y^{(a)})}{\sqrt{n}\sigma_{a}},\tau-\frac{1}{n}\right)-u\right]\right\}.

In the first instance, nSn(z,ϕ(z)+γ,[ϕ+γ])(ϖ(m~(ρ))ϕ(s))nS_{n}(z,\phi(z)+\gamma,[\phi+\gamma])\to-(\varpi(\tilde{m}(\rho))-\phi(s)) by the continuity of ϖ(m~())\varpi(\tilde{m}(\cdot)), while the second instance gives rise to the same expression for Sn()S_{n}(\cdot) as being in the interior, so that nSn(z,ϕ(z)+γ,[ϕ+γ])F(D2ϕ(s),Dϕ(s),ϕ(s),s)nS_{n}(z,\phi(z)+\gamma,[\phi+\gamma])\to F(D^{2}\phi(s),D\phi(s),\phi(s),s) by similar arguments as that used to show (B.27). Thus, in all cases, the limit along subsequences is smaller than the right hand side of (B.29). ∎