This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Is Offline Decision Making Possible with Only Few Samples?
Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement

Ruiqi Zhang    Yuexiang Zhai    Andrea Zanette
Abstract

What can an agent learn in a stochastic Multi-Armed Bandit (MAB) problem from a dataset that contains just a single sample for each arm? Surprisingly, in this work, we demonstrate that even in such a data-starved setting it may still be possible to find a policy competitive with the optimal one. This paves the way to reliable decision-making in settings where critical decisions must be made by relying only on a handful of samples.

Our analysis reveals that stochastic policies can be substantially better than deterministic ones for offline decision-making. Focusing on offline multi-armed bandits, we design an algorithm called Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST) which is quite different from the predominant value-based lower confidence bound approach. Its design is enabled by localization laws, critical radii, and relative pessimism. We prove that its sample complexity is comparable to that of LCB on minimax problems while being substantially lower on problems with very few samples.

Finally, we consider an application to offline reinforcement learning in the special case where the logging policies are known.

Multi-armed bandit, Offline reinforcement learning, Trust region.

1 Introduction

In several important problems, critical decisions must be made with just very few samples of pre-collected experience. For example, collecting samples in robotic manipulation may be slow and costly, and the ability to learn from very few interactions is highly desirable (Hester & Stone, 2013; Liu et al., 2021). Likewise, in clinical trials and in personalized medical decisions, reliable decisions must be made by relying on very small datasets (Liu et al., 2017). Sample efficiency is also key in personalized education (Bassen et al., 2020; Ruan et al., 2023).

However, to achieve good performance, the state-of-the-art algorithms may require millions of samples (Fu et al., 2020). These empirical findings seem to be supported by the existing theories: the sample complexity bounds, even minimax optimal ones, can be large in practice due to the large constants and the warmup factors (Ménard et al., 2021; Li et al., 2022; Azar et al., 2017; Zanette et al., 2019).

In this work, we study whether it is possible to make reliable decisions with only a few samples. We focus on an offline Multi-Armed Bandit (MAB) problem, which is a foundation model for decision-making (Lattimore & Szepesvári, 2020). In online MAB, an agent repeatedly chooses an arm from a set of arms, each providing a stochastic reward. Offline MAB is a variant where the agent cannot interact with the environment to gather new information and instead, it must make decisions based on a pre-collected dataset without playing additional exploratory actions, aiming at identifying the arm with the highest expected reward (Audibert et al., 2010; Garivier & Kaufmann, 2016; Russo, 2016; Ameko et al., 2020).

The standard approach to the problem is the Lower Confidence Bound (LCB) algorithm (Rashidinejad et al., 2021), a pessimistic variant of UCB (Auer et al., 2002) that involves selecting the arm with the highest lower bound on its performance. LCB encodes a principle called pessimism under uncertainty, which is the foundation principle for most algorithms for offline bandits and reinforcement learning (RL) (Jin et al., 2020; Zanette et al., 2020; Xie et al., 2021; Yin & Wang, 2021; Kumar et al., 2020; Kostrikov et al., 2021).

Unfortunately, the available methods that implement the principle of pessimism under uncertainty can fail in a data-starved regime because they rely on confidence intervals that are too loose when just a few samples are available. For example, even on a simple MAB instance with ten thousand arms, the best-known (Rashidinejad et al., 2021) performance bound for the LCB algorithm requires 24 samples per arm in order to provide meaningful guarantees, see Section 3.3. In more complex situations, such as in the sequential setting with function approximation, such a problem can become more severe due to the higher metric entropy of the function approximation class and the compounding of errors through time steps.

These considerations suggest that there is a “barrier of entry” to decision-making, both theoretically and practically: one needs to have a substantial number of samples in order to make reliable decisions even for settings as simple as offline MAB where the guarantees are tighter. Given the above technical reasons, and the lack of good algorithms and guarantees for data-starved decision problems, it is unclear whether it is even possible to find good decision rules with just a handful of samples.

In this paper, we make a substantial contribution towards lowering such barriers of entry. We discover that a carefully-designed algorithm tied to an advanced statistical analysis can substantially improve the sample complexity, both theoretically and practically, and enable reliable decision-making with just a handful of samples. More precisely, we focus on the offline MAB setting where we show that even if the dataset contains just a single sample in every arm, it may still be possible to compete with the optimal policy. This is remarkable, because with just one sample per arm—for example from a Bernoulli distribution—it is impossible to estimate the expected payoff of any of the arms! Our discovery is enabled by several key insights:

  • We search over stochastic policies, which we show can yield better performance for offline-decision making;

  • We use a localized notion of metric entropy to carefully control the size of the stochastic policy class that we search over;

  • We implement a concept called relative pessimism to obtain sharper guarantees.

These considerations lead us to design a trust region policy optimization algorithm called Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST), one that offers superior theoretical as well as empirical performance compared to LCB in a data-scarce situation.

Moreover, we apply the algorithm to selected reinforcement learning problems from (Fu et al., 2020) in the special case where information about the logging policies is available. We do so by a simple reduction from reinforcement learning to bandits, by mapping policies and returns in the former to actions and rewards in the latter, thereby disregarding the sequential aspect of the problem. Although we rely on the information of the logging policies being available, the empirical study shows that our algorithm compares well with a strong deep reinforcement learning baseline (i.e, CQL from Kumar et al. (2020)), without being sensitive to partial observability, sparse rewards, and hyper-parameters.

2 Additional related work

Multi-armed bandit (MAB) is a classical decision-making framework (Lattimore & Szepesvári, 2020; Lai & Robbins, 1985; Lai, 1987; Langford & Zhang, 2007; Auer, 2002; Bubeck et al., 2012; Audibert et al., 2009; Degenne & Perchet, 2016). The natural approach in offline MABs is the LCB algorithm (Ameko et al., 2020; Si et al., 2020), an offline variant of the classical UCB method (Auer et al., 2002) which is minimax optimal (Rashidinejad et al., 2021).

The optimization over stochastic policies is also considered in combinatorial multi-armed bandits (CMAB) (Combes et al., 2015). Most works on CMAB focus on variants of the UCB algorithm (Kveton et al., 2015; Combes et al., 2015; Chen et al., 2016) or of Thompson sampling  (Wang & Chen, 2018; Liu & Ročková, 2023), and they are generally online.

Our framework can also be applied to offline reinforcement learning (RL) (Sutton & Barto, 2018) whenever the logging policies are accessible. There exist a lot of practical algorithms for offline RL (Fujimoto et al., 2019; Peng et al., 2019; Wu et al., 2019; Kumar et al., 2020; Kostrikov et al., 2021). Theory has also been investigated extensively in tabular domain and function approximation setting (Nachum et al., 2019; Xie & Jiang, 2020; Zanette et al., 2021; Xie et al., 2021; Yin et al., 2022; Xiong et al., 2022). Some works also tried to establish general guarantees for deep RL algorithms via sophisticated statistical tools, such as bootstrapping (Thomas et al., 2015; Nakkiran et al., 2020; Hao et al., 2021; Wang et al., 2022; Zhang et al., 2022).

We rely on the notion of pessimism, which is a key concept in offline bandits and RL. While most prior works focused on the so-called absolute pessimism (Jin et al., 2020; Xie et al., 2021; Yin et al., 2022; Rashidinejad et al., 2021; Li et al., 2023), the work of Cheng et al. (2022) applied pessimism not on the policy value but on the difference (or improvement) between policies. However, their framework is very different from ours.

We make extensive use of two key concepts, namely localization laws and critical radii (Wainwright, 2019), which control the relative scale of the signal and uncertainty. The idea of localization plays a critical role in the theory of empirical process (Geer, 2000) and statistical learning theory (Koltchinskii, 2001, 2006; Bartlett & Mendelson, 2002; Bartlett et al., 2005). The concept of critical radius or critical inequality is used in non-parametric regression (Wainwright, 2019) and in off-policy evaluation (Duan et al., 2021; Duan & Wainwright, 2022, 2023; Mou et al., 2022).

3 Data-Starved Multi-Armed Bandits

In this section, we describe the MAB setting and give an example of a “data-starved” MAB instance where prior methods (such as LCB) can fail. We informally say that an offline MAB is “data-starved” if its dataset contains only very few samples in each arm.

Notation We let [n]={1,2,,n}[n]=\{1,2,...,n\} for a positive integer n.n. We let 2\left\|\cdot\right\|_{2} denote the Euclidean norm for vectors and the operator norm for matrices. We hide constants and logarithmic factors in the O~()\widetilde{O}(\cdot) notation. We let 𝔹pd(s)={xd:xps}\mathbb{B}_{p}^{d}(s)=\{x\in\mathbb{R}^{d}:\left\|x\right\|_{p}\leq s\} for any s0s\geq 0 and p1.p\geq 1. aba\lesssim b (aba\gtrsim b) means aCba\leq Cb (aCba\geq Cb) for some numerical constant C.C. aba\simeq b means that both aba\lesssim b and bab\lesssim a hold.

3.1 Multi-armed bandits

We consider the case where there are dd arms in a set 𝒜={a1,,ad}\mathcal{A}=\{a_{1},...,a_{d}\} with expected reward r(ai),i[d].r(a_{i}),i\in[d]. We assume access to an offline dataset 𝒟={(xi,ri)}i[N]\mathcal{D}=\left\{(x_{i},r_{i})\right\}_{i\in[N]} of action-reward tuples, where the experienced actions {xi}i[N]\left\{x_{i}\right\}_{i\in[N]} are i.i.d. from a distribution μ\mu. Each experienced reward is a random variable with expectation 𝔼[ri]=r(xi)\mathbb{E}[r_{i}]=r(x_{i}) and independent Gaussian noises ζi:=r(xi)𝔼[ri].\zeta_{i}:=r(x_{i})-\mathbb{E}[r_{i}]. For i[d],i\in[d], we denote the number of pulls to arm aia_{i} in 𝒟\mathcal{D} by N(ai)N(a_{i}) or Ni,N_{i}, while the variance of the noise for arm aia_{i} is denoted by σi2.\sigma_{i}^{2}. We denote the optimal arm as aargmaxa𝒜[r(a)]a^{*}\in\mathop{\arg\max}_{a\in\mathcal{A}}[r(a)] and the single policy concentrability as C=1/μ(a)C^{*}=1/\mu(a^{*}) where μ\mu is the distribution that generated the dataset. Without loss of generality, we assume the optimal arm is unique. We also write r=(r1,r2,,rd).r=(r_{1},r_{2},...,r_{d})^{\top}. Without loss of generality, we assume there is at least one sample for each arm (such arm can otherwise be removed).

3.2 Lower confidence bound algorithm

One simple but effective method for the offline MAB problem is the Lower Confidence Bound (LCB) algorithm, which is inspired by its online counterpart (UCB) (Auer et al., 2002). Like UCB, LCB computes the empirical mean r^i\widehat{r}_{i} associated to the reward of each arm ii along with its half confidence width bib_{i}. They are defined as

r^i:=1N(ai)k:xk=aixk,bi:=2σi2N(ai)log(2dδ).\widehat{r}_{i}:=\frac{1}{N(a_{i})}\sum_{k:x_{k}=a_{i}}x_{k},\ b_{i}:=\sqrt{\frac{2\sigma_{i}^{2}}{N(a_{i})}\log\left(\frac{2d}{\delta}\right)}. (1)

This definition ensures that each confidence interval brackets the corresponding expected reward with probability 1δ1-\delta:

r^ibir(ai)r^i+bii[d].\widehat{r}_{i}-b_{i}\leq r\left(a_{i}\right)\leq\widehat{r}_{i}+b_{i}\quad\forall i\in[d]. (2)

The width of the confidence level depends on the noise level σi\sigma_{i}, which can be exploited by variance-aware methods (Zhang et al., 2021; Min et al., 2021; Yin et al., 2022; Dai et al., 2022). When the true noise level is not accessible, we can replace it with the empirical standard deviation or with a high-probability upper bound. For example, when the reward for each arm is restricted to be within [0,1],[0,1], a simpler upper bound is σi21/4.\sigma_{i}^{2}\leq 1/4.

Unlike UCB, the half-width of the confidence intervals for LCB is not added, but subtracted, from the empirical mean, resulting in the lower bound li=r^ibil_{i}=\widehat{r}_{i}-b_{i}. The action identified by LCB is then the one that maximizes the resulting lower bound, thereby incorporating the principle of pessimism under uncertainty (Jin et al., 2020; Kumar et al., 2020). Specifically, given the dataset 𝒟,\mathcal{D}, LCB selects the arm using the following rule:

a^𝖫𝖢𝖡:=argmaxai𝒜li,\widehat{a}_{\mathsf{LCB}}:=\mathop{\mathrm{argmax}}_{a_{i}\in\mathcal{A}}~{}l_{i}, (3)

Rashidinejad et al. (2021) analyzed the LCB strategy. Below we provide a modified version of their theorem.

Theorem 3.1 (LCB Performance).

Suppose the noise of arm aia_{i} is sub-Gaussian with proxy variance σi2.\sigma_{i}^{2}. Let δ(0,1/2).\delta\in(0,1/2). Then, we have

  1. 1.

    (Comparison with any arm) With probability at least 1δ,1-\delta, for any comparator policy ai𝒜a_{i}\in\mathcal{A}, it holds that

    r(ai)r(a^𝖫𝖢𝖡)8σi2N(ai)log(2dδ).r\left(a_{i}\right)-r\left(\widehat{a}_{\mathsf{LCB}}\right)\leq\sqrt{\frac{8\sigma_{i}^{2}}{N(a_{i})}\log\left(\frac{2d}{\delta}\right)}. (4)
  2. 2.

    (Comparison with the optimal arm) Assume σi=1\sigma_{i}=1 for any i[d]i\in[d] and N8Clog(1/δ).N\geq 8C^{*}\log\left(1/\delta\right). Then, with probability at least 12δ,1-2\delta, one has

    r(a)r(a^𝖫𝖢𝖡)4CNlog(2dδ).r\left(a^{*}\right)-r\left(\widehat{a}_{\mathsf{LCB}}\right)\leq\sqrt{\frac{4C^{*}}{N}\log\left(\frac{2d}{\delta}\right)}. (5)

The statement of this theorem is slightly different from that in Rashidinejad et al. (2021), in the sense that their suboptimality is over 𝔼𝒟[r(a)r(a^𝖫𝖢𝖡)]\mathbb{E}_{\mathcal{D}}[r\left(a^{*}\right)-r\left(\widehat{a}_{\mathsf{LCB}}\right)] instead of a high-probability one. Rashidinejad et al. (2021) proved the minimax optimality of the algorithm when the single policy concentrability C2C^{*}\geq 2 and the sample size NO~(C).N\geq\widetilde{O}(C^{*}).

3.3 A data-starved MAB problem and failure of LCB

In order to highlight the limitation of a strategy such as LCB, let us describe a specific data-starved MAB instance, specifically one with d=10000d=10000 arms, equally partitioned into a set of good arms (i.e., 𝒜g\mathcal{A}_{g}) and a set of bad arms (i.e., 𝒜b\mathcal{A}_{b}). Each good arm returns a reward following the uniform distribution over [0.5,1.5],[0.5,1.5], while each bad arm returns a reward which follows 𝒩(0,1/4)\mathcal{N}(0,1/4).

Assume that we are given a dataset that contains only one sample per each arm. Instantiating the LCB confidence interval in (2) with σi1/2\sigma_{i}\leq 1/2 and δ=0.1,\delta=0.1, one obtains

r^i2.5r(ai)r^i+2.5.\widehat{r}_{i}-2.5\leq r(a_{i})\leq\widehat{r}_{i}+2.5.

Such bound is uninformative, because the lower bound for the true reward mean is less than the reward value of the worst arm. The performance bound for LCB confirms this intuition, because Theorem 3.1 requires at least N(ai)8log(1/0.05)=24N(a_{i})\geq\lceil 8*\log(1/0.05)\rceil=24 samples in each arm to provide any guarantee with probability at least 0.90.9 (here C=dC^{*}=d).

3.4 Can stochastic policies help?

At a first glance, extracting a good decision-making strategy for the problem discussed in Section 3.3 seems like a hopeless endeavor, because it is information-theoretically impossible to reliably estimate the expected payoff of any of the arms with just a single sample on each.

In order to proceed, the key idea is to enlarge the search space to contain stochastic policies.

Definition 3.2 (Stochastic Policies).

A stochastic policy over a MAB is a probability distribution wd,wi0,i=1dwi=1.w\in\mathbb{R}^{d},w_{i}\geq 0,\sum_{i=1}^{d}w_{i}=1.

To exemplify how stochastic policies can help, consider the behavioral cloning policy, which mimics the policy that generated the dataset for the offline MAB in Section 3.3. Such policy is stochastic, and it plays all arms uniformly at random, thereby achieving a score around 0.50.5 with high probability. The value of the behavioral cloning policy can be readily estimated using the Hoeffding bound (e.g., Proposition 2.5 in Wainwright (2019)): with probability at least 1δ=0.9,1-\delta=0.9, (here d=10000d=10000 is the number of arms and σ=1/2\sigma=1/2 is the true standard deviation), the value of behavioral cloning policy is greater or equal than

122σ2log(2/δ)d0.488.\displaystyle\frac{1}{2}-\sqrt{\frac{2\sigma^{2}\log\left(2/\delta\right)}{d}}\approx 0.488.

Such value is higher than the one guaranteed for LCB by Theorem 3.1. Intuitively, a stochastic policy that selects multiple arms can be evaluated more accurately because it averages the rewards experienced over different arms. This consideration suggests optimizing over stochastic policies.

By optimizing a lower bound on the performance of the stochastic policies, it should be possible to find one with a provably high return. Such an idea leads to solving an offline linear bandit problem, as follows

maxwd,wi0,i=1dwi=1\displaystyle\max_{w\in\mathbb{R}^{d},w_{i}\geq 0,\sum_{i=1}^{d}w_{i}=1}\ i=1dwir^ic(w)\displaystyle\sum_{i=1}^{d}w_{i}\widehat{r}_{i}-c(w) (6)

where c(w)c(w) is a suitable confidence interval for the policy ww and r^i\widehat{r}_{i} is the empirical reward for the ii-th arm defined in (1). While this approach is appealing, enlarging the search space to include all stochastic policies brings an increase in the metric entropy of the function class, and concretely, a d\sqrt{d} factor  (Abbasi-Yadkori et al., 2011; Rusmevichientong & Tsitsiklis, 2010; Hazan & Karnin, 2016; Jun et al., 2017; Kim et al., 2022) in the confidence intervals c(w)c(w) (in (6)), which negates all gains that arise from considering stochastic policies. In the next section, we propose an algorithm that bypasses the need for such d\sqrt{d} factor by relying on a more careful analysis and optimization procedure.

4 Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST)

In this section, we introduce our algorithm, called Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST). At a high level, the algorithm is a policy optimization algorithm based on a trust region centered around a reference policy. The size of the trust region determines the degree of pessimism, and its optimal problem-dependent size can be determined by analyzing the supremum of a problem-dependent empirical process. In the sequel, we describe 1) the decision variables, 2) the trust region optimization program, and 3) some techniques for its practical implementation.

4.1 Decision variables

The algorithm searches over the class of stochastic policies given by the weight vector w=(w1,w2,,wd)w=(w_{1},w_{2},...,w_{d})^{\top} of Definition 3.2. Instead of directly optimizing over the weights of the stochastic policy, it is convenient to center ww around a reference stochastic policy μ^\widehat{\mu} which is either known to perform well or is easy to estimate. In our theory and experiments, we consider a simple setup and use the behavioral cloning policy weighted by the noise levels {σi}\{\sigma_{i}\} if they are known. Namely, we consider

μ^i=Ni/σi2j=1dNj/σj2i[d].\widehat{\mu}_{i}=\frac{N_{i}/\sigma_{i}^{2}}{\sum_{j=1}^{d}N_{j}/\sigma_{j}^{2}}\quad\forall i\in[d]. (7)

When the size of the noise σi\sigma_{i} is constant across all arms, the policy μ^\widehat{\mu} is the behavioral cloning policy; when σi\sigma_{i} differs across arms, μ^\widehat{\mu} minimizes the variance of the empirical reward

μ^=argminwd,wi0,iwi=1Var(wr^),\widehat{\mu}=\mathop{\mathrm{argmin}}_{w\in\mathbb{R}^{d},w_{i}\geq 0,\sum_{i}w_{i}=1}\operatorname{Var}\left(w^{\top}\cdot\widehat{r}\right),

where r^=(r^1,,r^d)\widehat{r}=(\widehat{r}_{1},...,\widehat{r}_{d})^{\top} is defined in (1). Using such definition, we define as decision variable the policy improvement vector

Δ:=wμ^.\Delta:=w-\widehat{\mu}. (8)

This preparatory step is key: it allows us to implement relative pessimism, namely pessimism on the improvement—represented by Δ\Delta—rather than on the absolute value of the policy ww. Moreover, by restricting the search space to a ball around μ^\widehat{\mu}, one can efficiently reduce the metric entropy of the policy class and obtain tighter confidence intervals.

Refer to caption
Figure 1: A simple diagram for the trust regions on a 33-dim simplex. The central point is the reference (stochastic) policy, while red ellipses are trust regions around this reference policy.

4.2 Trust region optimization

Trust region.

TRUST (Algorithm 1) returns the stochastic policy πTRUST=Δ^+μ^d,\pi_{TRUST}=\widehat{\Delta}+\widehat{\mu}\in\mathbb{R}^{d}, where μ^\widehat{\mu} is the reference policy defined in (7) and Δ^\widehat{\Delta} is the policy improvement vector. In order to accurately quantify the effect of the improvement vector Δ,\Delta, we constrain it to a trust region 𝖢(ε)\mathsf{C}\left(\varepsilon\right) centered around μ^\widehat{\mu} where ε>0\varepsilon>0 is the radius of the trust region. More concretely, for a given radius ε>0,\varepsilon>0, the trust region is defined as

𝖢(ε):={Δ:\displaystyle\mathsf{C}\left(\varepsilon\right):=\Bigg{\{}\Delta: Δi+μ^i0,Δ+μ^1=1,\displaystyle\Delta_{i}+\widehat{\mu}_{i}\geq 0,\left\|\Delta+\widehat{\mu}\right\|_{1}=1,
i=1dΔi2σi2Niε2}.\displaystyle\sum_{i=1}^{d}\frac{\Delta_{i}^{2}\sigma_{i}^{2}}{N_{i}}\leq\varepsilon^{2}\Bigg{\}}. (9)

The trust region above serves two purposes: it ensures that the policy Δ^+μ^\widehat{\Delta}+\widehat{\mu} still represents a valid stochastic policy, and it regularizes the policy around the reference policy μ^\widehat{\mu}. We then search for the best policy within 𝖢(ε)\mathsf{C}\left(\varepsilon\right) by solving the optimization program

Δ^ε:=argmaxΔ𝖢(ε)Δr^.\widehat{\Delta}_{\varepsilon}:=\mathop{\mathrm{argmax}}_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\widehat{r}. (10)

Computationally, the program (10) is a second-order cone program (Alizadeh & Goldfarb, 2003; Boyd & Vandenberghe, 2004), which can be solved efficiently with standard off-the shelf libraries (Diamond & Boyd, 2016).

When ε=0\varepsilon=0, the trust region only includes the vector Δ=0\Delta=0, and the reference policy μ^\widehat{\mu} is the only feasible solution. When ε\varepsilon\rightarrow\infty, the search space includes all stochastic policies. In this latter case, the solution identified by the algorithm coincides with the greedy algorithm which chooses the arm with the highest empirical return. Rather than leaving ε\varepsilon as a hyper-parameter, in the next section we highlight a selection strategy for ε\varepsilon based on localized Gaussian complexities.

Critical radius.

The choice of ε\varepsilon is crucial to the performance of our algorithm because it balances optimization with regularization. Such consideration suggests that there is an optimal choice for the radius ε\varepsilon which balances searching over a larger space with keeping the metric entropy of such space under control. The optimal problem-dependent choice ε^\widehat{\varepsilon}_{*} can be found as a solution of a certain equation involving a problem-dependent supremum of an empirical process. More concretely, let EE be the feasible set of ε\varepsilon (e.g., E=+E=\mathbb{R}^{+}). We define the critical radius as

Definition 4.1 (Critical Radius).

The critical radius ε^\widehat{\varepsilon}_{*} of the trust region is the solution to the program

ε^=argmaxεE[Δ^εr^𝒢(ε)].\widehat{\varepsilon}_{*}=\mathop{\mathrm{argmax}}_{\varepsilon\in E}\left[\widehat{\Delta}_{\varepsilon}^{\top}\cdot\widehat{r}-\mathcal{G}\left(\varepsilon\right)\right]. (11)

Such equation involves a quantile of the localized gaussian complexity 𝒢(ε)\mathcal{G}\left(\varepsilon\right) of the stochastic policies identified by the trust region. Mathematically, this is defined as

Definition 4.2 (Quantile of the supremum of Gaussian process).

We denote the noise vector as η=r^r,\eta=\widehat{r}-r, which by our assumption is coordinate-wise independent and satisfies ηi𝒩(0,σi2/N(ai)).\eta_{i}\sim\mathcal{N}\left(0,\sigma_{i}^{2}/N(a_{i})\right). We define 𝒢(ε)\mathcal{G}\left(\varepsilon\right) as the smallest quantity such that with probability at least 1δ1-\delta, for any εE,\varepsilon\in E, it holds that

supΔ𝖢(ε)Δη𝒢(ε).\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta\leq\mathcal{G}\left(\varepsilon\right). (12)

In essence, 𝒢(ε)\mathcal{G}\left(\varepsilon\right) is an upper quantile of the supremum of the Gaussian process supΔ𝖢(ε)Δη\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta which holds uniformly for every εE.\varepsilon\in E. We also remark that this quantity depends on the feasible set EE and the trust region 𝖢(ε)\mathsf{C}\left(\varepsilon\right), and hence, is highly problem-dependent.

The critical radius plays a crucial role: it is the radius of the trust region that optimally balances optimization with uncertainty. Enlarging ε\varepsilon enlarges the search space for Δ\Delta, enabling the discovery of policies with potentially higher return. However, this also brings an increase in the metric entropy of the policy class encoded by 𝒢(ε)\mathcal{G}\left(\varepsilon\right), which means that each policy can be estimated less accurately. The critical radius represents the optimal tradeoff between these two forces. The final improvement vector that TRUST returns, which we denote as Δ^\widehat{\Delta}_{*}, is determined by solving (10) with the critical radius ε^\widehat{\varepsilon}_{*}. In mathematical terms, we express this as

Δ^:=argmaxΔ𝖢(ε^)Δr^,\widehat{\Delta}_{*}:=\mathop{\mathrm{argmax}}_{\Delta\in\mathsf{C}\left(\widehat{\varepsilon}_{*}\right)}\Delta^{\top}\widehat{r}, (13)

where ε^\widehat{\varepsilon}_{*} is defined in (11).

Algorithm 1 Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST)
  Input: Offline dataset 𝒟,\mathcal{D}, failure probability δ,\delta, the candidate set for the trust region widths EE (in practice, this is chosen as (14)).
  1. For εE,\varepsilon\in E, compute Δ^ε\widehat{\Delta}_{\varepsilon} from (10).
  2. For εE,\varepsilon\in E, estimate 𝒢(ε)\mathcal{G}\left(\varepsilon\right) via Monte-Carlo method (see Algorithm 2 in Appendix B).
  3. Solve (11) to obtain the critical radius ε^.\widehat{\varepsilon}_{*}.
  4. Compute the optimal improvement vector in 𝖢(ε^)\mathsf{C}\left(\widehat{\varepsilon}_{*}\right) via (13), denoted as Δ^.\widehat{\Delta}_{*}.
  5. Return the stochastic policy πTRUST=μ^+Δ^.\pi_{TRUST}=\widehat{\mu}+\widehat{\Delta}_{*}.

Implementation details

Since it can be difficult to solve (11) for a continuous value of εE=+\varepsilon\in E=\mathbb{R}^{+}, we use a discretization argument by considering the following candidate subset:

E={ε0,ε0α,,ε0α|E|1},E=\left\{\varepsilon_{0},\frac{\varepsilon_{0}}{\alpha},...,\frac{\varepsilon_{0}}{\alpha^{|E|-1}}\right\}, (14)

where α>1\alpha>1 is the decaying rate and ε0\varepsilon_{0} is the largest possible radius, which is the maximal weighted distance from the reference policy to any vertex. Mathematically, this is defined as

ε0=maxi[d]jiμ^j2σj2Nj+(1μ^i)2σi2Ni.\varepsilon_{0}=\max_{i\in[d]}\sqrt{\sum_{j\neq i}\frac{\widehat{\mu}_{j}^{2}\sigma_{j}^{2}}{N_{j}}+\frac{\left(1-\widehat{\mu}_{i}\right)^{2}\sigma_{i}^{2}}{N_{i}}}.

Our analysis that leads to Theorem 5.1 takes into account such discretization argument.

In line 2 of Algorithm 1, the algorithm works by estimating the quantile of the supremum of the localized Gaussian complexity 𝒢(ε)\mathcal{G}\left(\varepsilon\right) that appears in Definition 4.2, and then choose the ε\varepsilon that maximizes the objective function in (11). Although 𝒢(ε)\mathcal{G}\left(\varepsilon\right) can be upper bounded analytically, in our experiments we aim to obtain tighter guarantees and so we estimate it via Monte-Carlo. This can be achieved by 1) sampling independent noise vectors η\eta, 2) solving supΔ𝖢(ε)Δη\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta and 3) estimating the quantile via order statistics. More details can be found in Appendix B.

In summary, our practical algorithm can be seen as solving the optimization problem

(ε^,Δ^)=argmaxεE,Δ𝖢(ε){Δr^𝒢^(ε)}(\widehat{\varepsilon}_{*},\widehat{\Delta}_{*})=\mathop{\arg\max}_{\varepsilon\in E,\Delta\in\mathsf{C}\left(\varepsilon\right)}\bigg{\{}\Delta^{\top}\widehat{r}-\widehat{\mathcal{G}}(\varepsilon)\bigg{\}}

where r^d\widehat{r}\in\mathbb{R}^{d} is the empirical reward vector with r^i\widehat{r}_{i} defined in (1). Here, 𝒢^(ε)\widehat{\mathcal{G}}(\varepsilon) is computed according to the Monte-Carlo method defined in Algorithm 2 in Appendix B and EE is the candidate subset for radius defined in (14). This indicates a balance between the empirical reward of a stochastic policy and the local entropy metric it induces, representing

5 Theoretical guarantees

In this section, we provide some theoretical guarantees for the policy πTRUST\pi_{TRUST} returned by TRUST.

5.1 Problem-dependent analysis

We present 1) an improvement over the reference policy μ^\widehat{\mu}, 2) a sub-optimality gap with respect to any comparator policy π\pi and 3) an actionable lower bound on the performance of the output policy.

Given a stochastic policy π\pi, we let Vπ=𝔼aπ[r(a)]V^{\pi}=\mathbb{E}_{a\sim\pi}[r(a)] denote its value function. Furthermore, we denote a comparator policy π\pi by a triple (ε,Δ,π)(\varepsilon,\Delta,\pi) such that ε>0,Δ𝖢(ε),π=μ^+Δ.\varepsilon>0,\Delta\in\mathsf{C}\left(\varepsilon\right),\pi=\widehat{\mu}+\Delta.

Theorem 5.1 (Main theorem).

TRUST has the following properties.

  1. 1.

    With probability at least 1δ,1-\delta, the improvement over the behavioral policy is at least

    VπTRUSTVμ^supεε0,Δ𝖢(ε)[Δr2𝒢(ε)],V^{\pi_{TRUST}}-V^{\widehat{\mu}}\geq\sup_{\varepsilon\leq\varepsilon_{0},\Delta\in\mathsf{C}\left(\varepsilon\right)}\left[\Delta^{\top}r-2\mathcal{G}\left(\left\lceil\varepsilon\right\rceil\right)\right], (15)

    where ε=inf{εE,εε}.\lceil\varepsilon\rceil=\inf\{\varepsilon^{\prime}\in E,\varepsilon^{\prime}\geq\varepsilon\}.

  2. 2.

    With probability at least 1δ,1-\delta, for any stochastic comparator policy (ε,Δ,π)(\varepsilon,\Delta,\pi), the sub-optimality of the output policy can be upper bounded as

    VπVπTRUST2𝒢(ε).V^{\pi}-V^{\pi_{TRUST}}\leq 2\mathcal{G}\left(\left\lceil\varepsilon\right\rceil\right). (16)
  3. 3.

    With probability at least 12δ,1-2\delta, the data-dependent lower bound on VπTRUSTV^{\pi_{TRUST}} satisfies

    VπTRUSTπTRUSTr^𝒢(ε^)2log(1/δ)j=1dNj/σj2,V^{\pi_{TRUST}}\geq\pi_{TRUST}^{\top}\widehat{r}-\mathcal{G}\left(\lceil\widehat{\varepsilon}_{*}\rceil\right)-\sqrt{\frac{2\log(1/\delta)}{\sum_{j=1}^{d}N_{j}/\sigma_{j}^{2}}}, (17)

    where πTRUST=μ^+Δ^\pi_{TRUST}=\widehat{\mu}+\widehat{\Delta}_{*} is the policy output by Algorithm 1.

Our guarantees are problem-dependent as a function of the Gaussian process 𝒢()\mathcal{G}\left(\cdot\right); in Section 6 we show how these can be instantiated on an actual problem, highlighting the tightness of the analysis.

Equation 15 highlights the improvement with respect to the behavioral policy. It is expressed as a trade-off between maximizing the improvement Δr\Delta^{\top}r and minimizing its uncertainty 𝒢(ε)\mathcal{G}\left(\left\lceil\varepsilon\right\rceil\right). The presence of the supε\sup_{\varepsilon} indicates that TRUST achieves an optimal balance between these two factors. The state of the art guarantees that we are aware of highlight a trade-off between value and variance (Jin et al., 2021; Min et al., 2021). The novelty of our result lies in the fact that TRUST optimally balances the uncertainty implicitly as a function of the ‘coverage’ as well as the metric entropy of the search space. That is, TRUST selects the most appropriate search space by trading off its metric entropy with the quality of the policies that it contains.

The right-hand side in Equation 17 gives actionable statistical guarantees on the quality of the final policy and it can be fully computed from the available dataset; we give an example of the tightness of the analysis in Section 6.

Localized Gaussian complexity 𝒢(ε)\mathcal{G}\left(\varepsilon\right).

In Theorem 5.1, we upper bound the suboptimality VπVπTRUSTV^{\pi}-V^{\pi_{TRUST}} via a notion of localized metric entropy 𝒢().\mathcal{G}\left(\cdot\right). It is the quantile of the supremum of a Gaussian process, which can be efficiently estimated via Monte Carlo method (e.g., see Appendix B) or concentrated around its expectation. The expected value of 𝒢(ε)\mathcal{G}\left(\varepsilon\right) is also localized Gaussian width, a concept well-established in statistical learning theory (Bellec, 2019; Wei et al., 2020; Wainwright, 2019). More concretely, this is the localized Gaussian width for an affine simplex:

𝔼[supΔ𝖢(ε)Δη]=𝔼[sup𝕊d1{Δ:ΔΣε}Δη],\mathbb{E}\left[\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta\right]=\mathbb{E}\left[\sup_{\mathbb{S}^{d-1}\cap\left\{\Delta:\left\|\Delta\right\|_{\Sigma}\leq\varepsilon\right\}}\Delta^{\top}\eta\right],

where 𝕊d1\mathbb{S}^{d-1} denotes the simplex in d\mathbb{R}^{d} and Σ:=diag(σ12N1,σ22N2,,σd2Nd)\Sigma:=\operatorname{diag}\left(\frac{\sigma_{1}^{2}}{N_{1}},\frac{\sigma_{2}^{2}}{N_{2}},...,\frac{\sigma_{d}^{2}}{N_{d}}\right) is the weighted matrix. Moreover, this localized Gaussian width can be upper bound via

𝔼[supΔ𝖢(ε)Δη]min{log(dε2),εd}.\mathbb{E}\left[\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta\right]\lesssim\min\left\{\sqrt{\log\left(d\varepsilon^{2}\right)},\varepsilon\sqrt{d}\right\}. (18)
Refer to caption
Figure 2: The upper bound for the localized Gaussian width over a shifted simplex on d=10000d=10000 dimension. The shifted simplex is {Δd:i=1dΔi=0}.\left\{\Delta\in\mathbb{R}^{d}:\sum_{i=1}^{d}\Delta_{i}=0\right\}. The two-staged upper bound we plot is based on Theorem 1 in (Bellec, 2019)

To make it clearer, we plot this upper bound for localized Gaussian width in Figure 2. In (18), the rate matches the minimax lower bound up to universal constant (Gordon et al., 2007; Lecué & Mendelson, 2013; Bellec, 2019). To see the implication of the upper bound (18), let’s consider a simple example where the logging policy is uniform over all arms. We denote the optimal arm as aa^{*} and define

C:=1μ(a)C^{*}:=\frac{1}{\mu(a^{*})}

as the concentrability coefficient. By applying (18) and some concentration techniques (see Wainwright, 2019), we can perform a fine-grained analysis for the suboptimality induced by πTRUST.\pi_{TRUST}. Specifically, with probability at least 1δ,1-\delta, one has

VπVπTRUSTCNlog(2d|E|δ).V^{\pi_{*}}-V^{\pi_{TRUST}}\lesssim\sqrt{\frac{C^{*}}{N}\log\left(\frac{2d|E|}{\delta}\right)}. (19)

Note that, the high-probability upper bound here is minimax optimal up to constant and logarithmic factor (Rashidinejad et al., 2021) when C2.C^{*}\geq 2. Moreover, this example of uniform logging policy is an instance where LCB achieves minimax sub-optimality (up to constant and log factors) (see the proof of Theorem 2 in Rashidinejad et al., 2021). In this case, TRUST will achieve the same level of guarantees for the suboptimality of the output policy. We also empirically show the effectiveness of TRUST in Section 6. The full theorem for a fine-grained analysis for the suboptimality and its proof are deferred to Appendix C.

5.2 Proof of Theorem 5.1

To prove Lemma C.3, we first define the following event

:={supΔ𝖢(ε)Δη𝒢(ε)εE}.\mathcal{E}:=\left\{\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta\leq\mathcal{G}\left(\varepsilon\right)\quad\forall\varepsilon\in E\right\}. (20)

When \mathcal{E} happens, the quantity 𝒢(ε)\mathcal{G}\left(\varepsilon\right) can upper bound the supremum of the Gaussian process we care about, and hence, we can effectively upper bound the uncertainty for any stochastic policy using 𝒢().\mathcal{G}\left(\cdot\right). It follows from the Definition 4.2 that the event \mathcal{E} happens with probability a leas 1δ.1-\delta.

We can now prove all the claims in the theorem, starting from the first and the second. A comparator policy π\pi identifies a weight vector ww, an improvement vector Δ\Delta and a radius ε\varepsilon such that w=μ^+Δw=\widehat{\mu}+\Delta and Δ𝖢(ε).\Delta\in\mathsf{C}\left(\varepsilon\right). In fact, we can always take ε\varepsilon to be the minimal value such that Δ𝖢(ε).\Delta\in\mathsf{C}\left(\varepsilon\right). The first claim in Equation 15 can be proved by establishing that with probability at least 1δ1-\delta

wrπTRUSTr=ΔrΔ^r2𝒢(ε),w^{\top}r-\pi^{\top}_{TRUST}r=\Delta^{\top}r-\widehat{\Delta}_{*}^{\top}r\leq 2\mathcal{G}\left(\lceil\varepsilon\rceil\right), (21)

where πTRUST\pi_{TRUST} is the policy weight returned by Algorithm 1. In order to show Equation 21, we can decompose Δ^r\widehat{\Delta}_{*}^{\top}r using the fact that ε^E\widehat{\varepsilon}_{*}\in E and Δ^𝖢(ε^)\widehat{\Delta}_{*}\in\mathsf{C}\left(\widehat{\varepsilon}_{*}\right) to obtain

Δ^r\displaystyle\widehat{\Delta}_{*}^{\top}r =Δ^r^Δ^ηΔ^r^𝒢(ε^)=Δ^r^𝒢(ε^).\displaystyle=\widehat{\Delta}_{*}^{\top}\widehat{r}-\widehat{\Delta}_{*}^{\top}\eta\geq\widehat{\Delta}_{*}^{\top}\widehat{r}-\mathcal{G}\left(\widehat{\varepsilon}_{*}\right)=\widehat{\Delta}_{*}^{\top}\widehat{r}-\mathcal{G}\left(\left\lceil\widehat{\varepsilon}_{*}\right\rceil\right). (22)

To further lower bound the RHS above, we have the following lemma, which shows that Algorithm 1 can be written in an equivalent way.

Lemma 5.2.

The output of Algorithm 1 satisfies

(ε^,Δ^)=argmaxεε0,Δ𝖢(ε)[Δr^𝒢(ε)].\left(\widehat{\varepsilon}_{*},\widehat{\Delta}_{*}\right)=\mathop{\mathrm{argmax}}_{\varepsilon\leq\varepsilon_{0},\Delta\in\mathsf{C}\left(\varepsilon\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon\rceil\right)\big{]}. (23)

This shows that Algorithm 1 optimizes over an objective function which consists of a signal term (i.e., Δr^\Delta^{\top}\widehat{r}) minus a noise term (i.e., 𝒢(ε)\mathcal{G}\left(\lceil\varepsilon\rceil\right)). Applying this lemma to (22), we know

Δ^r\displaystyle\widehat{\Delta}_{*}^{\top}r Δr^𝒢(ε)=Δr+Δη𝒢(ε).\displaystyle\geq\Delta^{\top}\widehat{r}-\mathcal{G}\left(\left\lceil\varepsilon\right\rceil\right)=\Delta^{\top}r+\Delta^{\top}\eta-\mathcal{G}\left(\left\lceil\varepsilon\right\rceil\right). (24)

After recalling that under \mathcal{E}

ΔηsupΔ𝖢(ε)ΔηsupΔ𝖢(ε)Δη𝒢(ε),\displaystyle\Delta^{\top}\eta\leq\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta\leq\sup_{\Delta\in\mathsf{C}\left(\lceil\varepsilon\rceil\right)}\Delta^{\top}\eta\leq\mathcal{G}\left(\lceil\varepsilon\rceil\right), (25)

plugging the (25) back into (24) concludes the bound in Equation 21, which also proves our first claim. Rearranging the terms in Equation 21 and taking supremum over all comparator policies, we obtain

Δ^rsupεε0,Δ𝖢(ε)[Δr2𝒢(ε)],\widehat{\Delta}_{*}^{\top}r\geq\sup_{\varepsilon\leq\varepsilon_{0},\Delta\in\mathsf{C}\left(\varepsilon\right)}\left[\Delta^{\top}r-2\mathcal{G}\left(\left\lceil\varepsilon\right\rceil\right)\right], (26)

which proves the first claim since VπTRUSTVμ^=Δ^r.V^{\pi_{TRUST}}-V^{\widehat{\mu}}=\widehat{\Delta}_{*}^{\top}r.

In order to prove the last claim, it suffices to lower bound the policy value of the reference policy μ^.\widehat{\mu}. From (7), we have μ^(r^r)𝒩(0,1/[i=1dNi/σi2]),\widehat{\mu}\left(\widehat{r}-r\right)\sim\mathcal{N}(0,1/[\sum_{i=1}^{d}N_{i}/\sigma_{i}^{2}]), which implies with probability at least 1δ,1-\delta,

μ^(r^r)2log(1/δ)i=1dNi/σi2\widehat{\mu}\left(\widehat{r}-r\right)\leq\sqrt{\frac{2\log(1/\delta)}{\sum_{i=1}^{d}N_{i}/\sigma_{i}^{2}}} (27)

from the standard Hoeffding inequality (e.g., Prop 2.5 in Wainwright (2019)). Combining (22) and (27), we obtain

πTRUSTr\displaystyle\pi_{TRUST}^{\top}r =μ^r+Δ^r\displaystyle=\widehat{\mu}^{\top}r+\widehat{\Delta}_{*}^{\top}r
μ^r^+μ^(rr^)+Δ^r^𝒢(ε^)\displaystyle\geq\widehat{\mu}^{\top}\widehat{r}+\widehat{\mu}^{\top}(r-\widehat{r})+\widehat{\Delta}_{*}^{\top}\widehat{r}-\mathcal{G}\left(\widehat{\varepsilon}_{*}\right) (From (22))
πTRUSTr^𝒢(ε^)2log(1/δ)i=1dNi/σi2\displaystyle\geq\pi_{TRUST}^{\top}\widehat{r}-\mathcal{G}\left(\widehat{\varepsilon}_{*}\right)-\sqrt{\frac{2\log(1/\delta)}{\sum_{i=1}^{d}N_{i}/\sigma_{i}^{2}}} (From (27))

with probability at least 12δ.1-2\delta. Therefore, we conclude.

Augmentation with LCB

Compared to classical LCB, Algorithm 1 considers a much larger searching space, which encompasses not only the vertices of the simplex but the inner points as well. This enlargement of searching space shows great advantage, but this also comes with the price of larger uncertainty, especially when the width ε\varepsilon is large. In LCB, one considers the uncertainty by upper bound the noise at each vertex uniformly, while in our case, the uniform upper bound for a sub-region of the shifted simplex must be considered. When ε\varepsilon is large, the trust region method will induce larger uncertainty and tend to select a more stochastic policy than LCB and hence, can achieve worse performance. To determine the most effective final policy, one can always combine TRUST (Algorithm 1) with LCB and select the better one between them based on the lower bound induced by two algorithms. By comparing the lower bounds of LCB and TRUST, the value of the finally output policy is guaranteed to outperform the lower bound for either LCB or TRUST with high probability. We defer the detailed algorithm and its theoretical guarantees to Appendix E.

6 Experiments

We present simulated experiments where we show the failure of LCB and the strong performance of TRUST. Moreover, we also present an application of TRUST to offline reinforcement learning.

6.1 Simulated experiments

A data-starved MAB

We consider a data-starved MAB problem with d=10000d=10000 arms denoted by ai,i[d]a_{i},i\in[d]. The reward distributions are

r(ai){𝖴𝗇𝗂𝖿𝗈𝗋𝗆(0.5,1.5)i5000,𝒩(0,1/4)i>5000.r(a_{i})\sim\left\{\begin{aligned} &\mathsf{Uniform}(0.5,1.5)&i\leq 5000,\\ &\mathcal{N}\left(0,1/4\right)&i>5000.\end{aligned}\right. (28)

Namely, the set of good arms have reward random variables from a uniform distribution over [0.5,1.5][0.5,1.5] with unit mean, while the bad arms return a Gaussian reward with zero mean. We consider a dataset that contains a single sample for each of these arms.

We test Algorithm 1 on this MAB instance with fixed variance level σi=1/2\sigma_{i}=1/2. We set the reference policy μ^\widehat{\mu} to be the behavioral cloning policy, which coincides with the uniform policy. We also test LCB and the greedy method which simply chooses the policy with the highest empirical reward.

In this example, the greedy algorithm fails because it erroneously selects an arm with a reward >1.5>1.5, but such reward can only originate from a normal distribution with mean zero. Despite LCB incorporates the principle of pessimism under uncertainty, it selects an arm with average return equal to zero; its performance lower bound given by the confidence intervals is 1.5,-1.5, which is almost vacuous and very uninformative. The behavioral cloning policy performs better, because it selects an arm uniformly at random, achieving the score 0.50.5.

Behavior Policy Greedy Method LCB LCB Lower Bound
0.5 0 0 -1.5
max reward Policy Improvement by TRUST TRUST TRUST Lower Bound
1.0 0.42 0.92 0.6
Table 1: Results of simulated experiments in a 10000-arm bandit. The reward distribution is described in (28). The offline dataset includes one sample for each arm. The greedy method chooses the arm with the highest empirical reward. LCB selects an arm based on (3). The lower bound for LCB and TRUST follow (2) and (17), respectively.

Algorithm 1 achieves the best performance: the value of the policy that it identifies is 0.92,0.92, which almost matches the optimal policy. The lower bound on its performance computed by instantiating the RHS in (17) is around 0.60.6, a guarantee much tighter than that for LCB.

In order to gain intuition on the learning mechanics of TRUST, in Figure 3 we progressively enlarge the radius of the trust region from zero to the largest possible radius (on the xx axis) and plot the value of the policy that maximizes the linear objective Δr^,Δ𝖢(ε)\Delta^{\top}\widehat{r},\;\Delta\in\mathsf{C}\left(\varepsilon\right) for each value of the radius ε\varepsilon. Note that we rescale the range of ε\varepsilon to make the largest possible ε\varepsilon be one. In the same figure we also plot the lower bound computed with the help of equation (17).

Refer to caption
Figure 3: Policy values and their lower bounds for a data-starved MAB instance with 10000 arms whose reward distribution is described in (28).

Initially, the value of the policy increases because the optimization in (10) is performed over a larger set of stochastic policies. However, when ε\varepsilon approaches one, all stochastic policies are included in the optimization program. In this case, TRUST greedily selects the arm with the highest empirical reward which is from a normal distribution with mean zero. The optimal balance between the size of the policy search space and its metric entropy is given by the critical radius ε=0.0116ε0\varepsilon=0.0116\varepsilon_{0}, which is the point where the lower bound is the highest.

A more general data-starved MAB

Besides the data-starved MAB we constructed, we also show that in general MABs, the performance of TRUST is on par with LCB, but TRUST will have a much tighter statistical guarantee, i.e., a larger lower bound for the value of the returned policy. We did experiments on a d=1000d=1000-arm MAB where the reward distribution is

r(ai)𝒩(i1000,14),i[d].r(a_{i})\sim\mathcal{N}\left(\frac{i}{1000},\frac{1}{4}\right),\quad\forall i\in[d]. (29)

We ran TRUST Algorithm 1 and LCB over 8 different random seeds. When we have a single sample for each arm, TRUST will get a similar score as LCB. However, TRUST give a much tighter statistical guarantee than LCB, in the sense that the lower bound output by TRUST is much higher than that output by LCB so that TRUST can output a policy that is guaranteed to achieved a higher value. Moreover, we found the policies output from TRUST are much more stable than those from LCB. In all runs, while the lowest value of the arm chosen by LCB is around 0.24, all policies returned by TRUST have values above 0.65 with a much smaller variance, as shown in Table 2.

LCB TRUST
mean reward 0.718 0.725
mean lower bound 0.156 0.544
variance 0.265 0.038
minimal reward 0.239 0.658
Table 2: Comparison between LCB and TRUST (Algorithm 1) on a data-starved MAB with 1000 arms whose reward distribution follows (29). Both methods are repeated on 8 random seeds.

6.2 Offline reinforcement learning

In this section, we apply Algorithm 1 to the offline reinforcement learning (RL) setting under the assumption that the logging policies which generated the dataset are accessible. To be clear, our goal is not to exceed the performance of the state of the art deep RL algorithms—our algorithm is designed for bandit problems—but rather to illustrate the usefulness of our algorithm and theory.

Since our algorithm is designed for bandit problems, in order to apply it to the sequential setting, we map MDPs to MABs. Each policy in the MDP maps to an action in the MAB, and each trajectory return in the MDP maps to an experienced return in the MAB setting. Notice that this reduction disregards the sequential aspect of the problem and thus our algorithm cannot perform ‘trajectory stitching’ (Levine et al., 2020; Kumar et al., 2020; Kostrikov et al., 2021). Furthermore, it can only be applied under the assumption that the logging policies are known.

Specifically we consider a setting where there are multiple known logging policies, each generating few trajectories. We test Algorithm 1 on some selected environments from the D4RL dataset (Fu et al., 2020) and compare its performance to the (CQL) algorithm (Kumar et al., 2020), a popular and strong baseline for offline RL algorithms.

Since the D4RL dataset does not directly include the logging policies, we generate new datasets by running Soft Actor Critic (SAC) (Haarnoja et al., 2018) for 1000 episodes. We store 100 intermediate policies generated by SAC, and roll out 1 trajectory from each policy.

We use some default hyper-parameters for CQL.111We use the codebase and default hyper-parameters in https://github.com/young-geng/CQL. We report the unnormalized scores in Table 3, each averaged over 4 random seeds. Algorithm 1 achieves a score on par with or higher than that of CQL, especially when the offline dataset is of poor quality and when there are very few—or just one—trajectory generated from each logging policy. Notice that while CQL is not guaranteed to outperform the behavioral policy, TRUST is backed by Theorem 5.1. Moreover, while the performance of CQL is highly reliant on the choice of hyper-parameters, TRUST is essentially hyper-parameters free.

CQL TRUST
Hopper 1-traj-low 499 999
1-traj-high 2606 3437
Ant 1-traj-low 748 763
1-traj-high 4115 4488
Walker2d 1-traj-low 311 346
1-traj-high 4093 4097
HalfCheetah 1-traj-low 5775 5473
1-traj-high 9067 10380
Table 3: Unnormalized score of CQL and TRUST in 4 environments from D4RL. In 1-traj-low case, we take the first 100 policies in the running of SAC. In 1-traj-high case, we take the (10x+1)(10x+1)-th policy for x[100]x\in[100]. We sample one trajectory from each policy we take in all experiments.

Additionally, while CQL took around 16-24 hours on one NVIDIA GeForce RTX 2080 Ti, TRUST only took 0.5-1 hours on 10 CPUs. The experimental details are included in Appendix F.

7 Conclusion

In this paper we make a substantial contribution towards sample efficient decision making, by designing a data-efficient policy optimization algorithm that leverages offline data for the MAB setting. The key intuition of this work is to search over stochastic policies, which can be estimated more easily than deterministic ones.

The design of our algorithm is enabled by a number of key insights, such as the use of the localized gaussian complexity which leads to the definition of the critical radius for the trust region.

We believe that these concepts can be used more broadly to help design truly sample efficient algorithms, which can in turn enable the application of decision making to new settings where a high sample efficiency is critical.

8 Impact Statement

This paper presents a work whose goal is to advance the field of decision making under uncertainty. Since our work is primarily theoretical, we do not anticipate negative societal consequences.

References

  • Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (NIPS), 2011.
  • Alizadeh & Goldfarb (2003) Alizadeh, F. and Goldfarb, D. Second-order cone programming. Mathematical programming, 95(1):3–51, 2003.
  • Ameko et al. (2020) Ameko, M. K., Beltzer, M. L., Cai, L., Boukhechba, M., Teachman, B. A., and Barnes, L. E. Offline contextual multi-armed bandits for mobile health interventions: A case study on emotion regulation. In Proceedings of the 14th ACM Conference on Recommender Systems, pp.  249–258, 2020.
  • Audibert et al. (2009) Audibert, J.-Y., Bubeck, S., et al. Minimax policies for adversarial and stochastic bandits. In COLT, volume 7, pp.  1–122, 2009.
  • Audibert et al. (2010) Audibert, J.-Y., Bubeck, S., and Munos, R. Best arm identification in multi-armed bandits. In COLT, pp.  41–53, 2010.
  • Auer (2002) Auer, P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
  • Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002.
  • Azar et al. (2017) Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263–272. PMLR, 2017.
  • Bartlett & Mendelson (2002) Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
  • Bartlett et al. (2005) Bartlett, P. L., Bousquet, O., and Mendelson, S. Local rademacher complexities. The Annals of Statistics, 33(4):1497–1537, 2005.
  • Bassen et al. (2020) Bassen, J., Balaji, B., Schaarschmidt, M., Thille, C., Painter, J., Zimmaro, D., Games, A., Fast, E., and Mitchell, J. C. Reinforcement learning for the adaptive scheduling of educational activities. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp.  1–12, 2020.
  • Bellec (2019) Bellec, P. C. Localized gaussian width of m-convex hulls with applications to lasso and convex aggregation. 2019.
  • Boyd & Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004.
  • Bubeck et al. (2012) Bubeck, S., Cesa-Bianchi, N., et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • Chen et al. (2016) Chen, W., Wang, Y., Yuan, Y., and Wang, Q. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research, 17(1):1746–1778, 2016.
  • Cheng et al. (2022) Cheng, C.-A., Xie, T., Jiang, N., and Agarwal, A. Adversarially trained actor critic for offline reinforcement learning. arXiv preprint arXiv:2202.02446, 2022.
  • Combes et al. (2015) Combes, R., Talebi Mazraeh Shahi, M. S., Proutiere, A., et al. Combinatorial bandits revisited. Advances in neural information processing systems, 28, 2015.
  • Dai et al. (2022) Dai, Y., Wang, R., and Du, S. S. Variance-aware sparse linear bandits. arXiv preprint arXiv:2205.13450, 2022.
  • Degenne & Perchet (2016) Degenne, R. and Perchet, V. Anytime optimal algorithms in stochastic multi-armed bandits. In International Conference on Machine Learning, pp. 1587–1595. PMLR, 2016.
  • Diamond & Boyd (2016) Diamond, S. and Boyd, S. Cvxpy: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1):2909–2913, 2016.
  • Duan & Wainwright (2022) Duan, Y. and Wainwright, M. J. Policy evaluation from a single path: Multi-step methods, mixing and mis-specification. arXiv preprint arXiv:2211.03899, 2022.
  • Duan & Wainwright (2023) Duan, Y. and Wainwright, M. J. A finite-sample analysis of multi-step temporal difference estimates. In Learning for Dynamics and Control Conference, pp. 612–624. PMLR, 2023.
  • Duan et al. (2021) Duan, Y., Wang, M., and Wainwright, M. J. Optimal policy evaluation using kernel-based temporal difference methods. arXiv preprint arXiv:2109.12002, 2021.
  • Fu et al. (2020) Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  • Fujimoto et al. (2019) Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp. 2052–2062. PMLR, 2019.
  • Garivier & Kaufmann (2016) Garivier, A. and Kaufmann, E. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pp.  998–1027. PMLR, 2016.
  • Geer (2000) Geer, S. A. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000.
  • Gordon et al. (2007) Gordon, Y., Litvak, A. E., Mendelson, S., and Pajor, A. Gaussian averages of interpolated bodies and applications to approximate reconstruction. Journal of Approximation Theory, 149(1):59–73, 2007.
  • Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
  • Hao et al. (2021) Hao, B., Ji, X., Duan, Y., Lu, H., Szepesvári, C., and Wang, M. Bootstrapping statistical inference for off-policy evaluation. arXiv preprint arXiv:2102.03607, 2021.
  • Hazan & Karnin (2016) Hazan, E. and Karnin, Z. Volumetric spanners: an efficient exploration basis for learning. Journal of Machine Learning Research, 2016.
  • Hester & Stone (2013) Hester, T. and Stone, P. Texplore: real-time sample-efficient reinforcement learning for robots. Machine learning, 90:385–429, 2013.
  • Jin et al. (2020) Jin, Y., Yang, Z., and Wang, Z. Is pessimism provably efficient for offline rl? arXiv preprint arXiv:2012.15085, 2020.
  • Jin et al. (2021) Jin, Y., Yang, Z., and Wang, Z. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp. 5084–5096. PMLR, 2021.
  • Jun et al. (2017) Jun, K.-S., Bhargava, A., Nowak, R., and Willett, R. Scalable generalized linear bandits: Online computation and hashing. Advances in Neural Information Processing Systems, 30, 2017.
  • Kim et al. (2022) Kim, Y., Yang, I., and Jun, K.-S. Improved regret analysis for variance-adaptive linear bandits and horizon-free linear mixture mdps. Advances in Neural Information Processing Systems, 35:1060–1072, 2022.
  • Koltchinskii (2001) Koltchinskii, V. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902–1914, 2001.
  • Koltchinskii (2006) Koltchinskii, V. Local rademacher complexities and oracle inequalities in risk minimization. 2006.
  • Kostrikov et al. (2021) Kostrikov, I., Nair, A., and Levine, S. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
  • Kumar et al. (2020) Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779, 2020.
  • Kveton et al. (2015) Kveton, B., Wen, Z., Ashkan, A., and Szepesvari, C. Tight regret bounds for stochastic combinatorial semi-bandits. In Artificial Intelligence and Statistics, pp.  535–543. PMLR, 2015.
  • Lai (1987) Lai, T. L. Adaptive treatment allocation and the multi-armed bandit problem. The annals of statistics, pp.  1091–1114, 1987.
  • Lai & Robbins (1985) Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  • Langford & Zhang (2007) Langford, J. and Zhang, T. The epoch-greedy algorithm for multi-armed bandits with side information. Advances in neural information processing systems, 20, 2007.
  • Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit Algorithms. Cambridge University Press, 2020.
  • Lecué & Mendelson (2013) Lecué, G. and Mendelson, S. Learning subgaussian classes: Upper and minimax bounds. arXiv preprint arXiv:1305.4825, 2013.
  • Levine et al. (2020) Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  • Li et al. (2022) Li, G., Shi, L., Chen, Y., Chi, Y., and Wei, Y. Settling the sample complexity of model-based offline reinforcement learning. arXiv preprint arXiv:2204.05275, 2022.
  • Li et al. (2023) Li, Z., Yang, Z., and Wang, M. Reinforcement learning with human feedback: Learning dynamic choices via pessimism. arXiv preprint arXiv:2305.18438, 2023.
  • Liu et al. (2021) Liu, R., Nageotte, F., Zanne, P., de Mathelin, M., and Dresp-Langley, B. Deep reinforcement learning for the control of robotic manipulation: a focussed mini-review. Robotics, 10(1):22, 2021.
  • Liu & Ročková (2023) Liu, Y. and Ročková, V. Variable selection via thompson sampling. Journal of the American Statistical Association, 118(541):287–304, 2023.
  • Liu et al. (2017) Liu, Y., Logan, B., Liu, N., Xu, Z., Tang, J., and Wang, Y. Deep reinforcement learning for dynamic treatment regimes on medical registry data. In 2017 IEEE international conference on healthcare informatics (ICHI), pp.  380–385. IEEE, 2017.
  • Ménard et al. (2021) Ménard, P., Domingues, O. D., Jonsson, A., Kaufmann, E., Leurent, E., and Valko, M. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pp. 7599–7608. PMLR, 2021.
  • Min et al. (2021) Min, Y., Wang, T., Zhou, D., and Gu, Q. Variance-aware off-policy evaluation with linear function approximation. Advances in neural information processing systems, 34:7598–7610, 2021.
  • Mou et al. (2022) Mou, W., Wainwright, M. J., and Bartlett, P. L. Off-policy estimation of linear functionals: Non-asymptotic theory for semi-parametric efficiency. arXiv preprint arXiv:2209.13075, 2022.
  • Nachum et al. (2019) Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., and Schuurmans, D. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019.
  • Nakkiran et al. (2020) Nakkiran, P., Neyshabur, B., and Sedghi, H. The deep bootstrap framework: Good online learners are good offline generalizers. arXiv preprint arXiv:2010.08127, 2020.
  • Peng et al. (2019) Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  • Rashidinejad et al. (2021) Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. arXiv preprint arXiv:2103.12021, 2021.
  • Ruan et al. (2023) Ruan, S., Nie, A., Steenbergen, W., He, J., Zhang, J., Guo, M., Liu, Y., Nguyen, K. D., Wang, C. Y., Ying, R., et al. Reinforcement learning tutor better supported lower performers in a math task. arXiv preprint arXiv:2304.04933, 2023.
  • Rusmevichientong & Tsitsiklis (2010) Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
  • Russo (2016) Russo, D. Simple bayesian algorithms for best arm identification. In Conference on Learning Theory, pp.  1417–1418. PMLR, 2016.
  • Si et al. (2020) Si, N., Zhang, F., Zhou, Z., and Blanchet, J. Distributionally robust policy evaluation and learning in offline contextual bandits. In International Conference on Machine Learning, pp. 8884–8894. PMLR, 2020.
  • Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT Press, 2018.
  • Thomas et al. (2015) Thomas, P., Theocharous, G., and Ghavamzadeh, M. High confidence policy improvement. In International Conference on Machine Learning, pp. 2380–2388. PMLR, 2015.
  • Vershynin (2020) Vershynin, R. High-dimensional probability. University of California, Irvine, 2020.
  • Wainwright (2019) Wainwright, M. J. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
  • Wang et al. (2022) Wang, K., Zhao, H., Luo, X., Ren, K., Zhang, W., and Li, D. Bootstrapped transformer for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:34748–34761, 2022.
  • Wang & Chen (2018) Wang, S. and Chen, W. Thompson sampling for combinatorial semi-bandits. In International Conference on Machine Learning, pp. 5114–5122. PMLR, 2018.
  • Wei et al. (2020) Wei, Y., Fang, B., and J. Wainwright, M. From gauss to kolmogorov: Localized measures of complexity for ellipses. 2020.
  • Wu et al. (2019) Wu, Y., Tucker, G., and Nachum, O. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
  • Xie & Jiang (2020) Xie, T. and Jiang, N. Batch value-function approximation with only realizability. arXiv preprint arXiv:2008.04990, 2020.
  • Xie et al. (2021) Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. Bellman-consistent pessimism for offline reinforcement learning. arXiv preprint arXiv:2106.06926, 2021.
  • Xiong et al. (2022) Xiong, W., Zhong, H., Shi, C., Shen, C., Wang, L., and Zhang, T. Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game. arXiv preprint arXiv:2205.15512, 2022.
  • Yin & Wang (2021) Yin, M. and Wang, Y.-X. Towards instance-optimal offline reinforcement learning with pessimism. arXiv preprint arXiv:2110.08695, 2021.
  • Yin et al. (2022) Yin, M., Duan, Y., Wang, M., and Wang, Y.-X. Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. arXiv preprint arXiv:2203.05804, 2022.
  • Zanette et al. (2019) Zanette, A., Brunskill, E., and J. Kochenderfer, M. Almost horizon-free structure-aware best policy identification with a generative model. In Advances in Neural Information Processing Systems, 2019.
  • Zanette et al. (2020) Zanette, A., Lazaric, A., Kochenderfer, M. J., and Brunskill, E. Provably efficient reward-agnostic navigation with linear value iteration. In Advances in Neural Information Processing Systems, 2020.
  • Zanette et al. (2021) Zanette, A., Wainwright, M. J., and Brunskill, E. Provable benefits of actor-critic methods for offline reinforcement learning. arXiv preprint arXiv:2108.08812, 2021.
  • Zhang et al. (2022) Zhang, R., Zhang, X., Ni, C., and Wang, M. Off-policy fitted q-evaluation with differentiable function approximators: Z-estimation and inference theory. In International Conference on Machine Learning, pp. 26713–26749. PMLR, 2022.
  • Zhang et al. (2021) Zhang, Z., Yang, J., Ji, X., and Du, S. S. Improved variance-aware confidence sets for linear bandits and linear mixture mdp. Advances in Neural Information Processing Systems, 34:4342–4355, 2021.

Appendix A One-sample case with strong signals

In this section, we give a simple example of one-sample-per-arm case. This can be view as a special case of data-starved MAB and Theorem 5.1 can be applied to get a non-trivial guarantees. Specifically, consider an MAB with 2d2d arms. Assume the true mean reward vector is r=(1,1,,1,0,0,,0)r=(1,1,...,1,0,0,...,0)^{\top} and the noise vector is η𝒩(0,σ2I2d)\eta\sim\mathcal{N}(0,\sigma^{2}I_{2d}) That is, the first dd arms have rewards independently sampled from 𝒩(1,1)\mathcal{N}(1,1) and the rewards for other dd arms are independently sampled from 𝒩(0,0).\mathcal{N}(0,0). The stochastic reference policy is set to the uniform one, i.e., μ^=(1d,1d,1d).\widehat{\mu}=(\frac{1}{d},\frac{1}{d},...\frac{1}{d})^{\top}.

We apply Algorithm 1 to this MAB instance. In the next theorem, we will show that for a specific ε,\varepsilon, the optimal improvement in 𝖢(ε)\mathsf{C}\left(\varepsilon\right) (denoted as Δ^ε\widehat{\Delta}_{\varepsilon} in (10)) can achieve an improved reward value of constant level.

Proposition A.1.

Assume r=(1,1,,1,0,0,,0)r=(1,1,...,1,0,0,...,0)^{\top} and noise η𝒩(0,I2d).\eta\sim\mathcal{N}(0,I_{2d}). For any 0ε1d,0\leq\varepsilon\leq\frac{1}{\sqrt{d}}, with probability at least 1δ,1-\delta, the improvement of policy value can be lower bounded by

Δ^εrεd[12σ(1+8log(2/δ)d)],\widehat{\Delta}_{\varepsilon}^{\top}r\geq\varepsilon\sqrt{d}\left[\frac{1}{2}-\sigma\left(1+\sqrt{\frac{8\log\left(2/\delta\right)}{d}}\right)\right],

where the improvement vector in 𝖢(ε)\mathsf{C}\left(\varepsilon\right) is defined in (10). Therefore, for ε=1d\varepsilon=\frac{1}{\sqrt{d}} and d8log(2/δ),d\geq 8\log\left(2/\delta\right), with probability at least 1δ,1-\delta, we can get a constant policy improvement

Δ^εr122σ.\widehat{\Delta}_{\varepsilon}^{\top}r\geq\frac{1}{2}-2\sigma.
Proof.

We define the optimal improvement vector as

Δε:=argmaxΔ𝖢(ε)Δr.\Delta^{*}_{\varepsilon}:=\mathop{\arg\max}_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}r.

Then, from the definition of Δ^ε\widehat{\Delta}_{\varepsilon}, we have

Δ^εr=Δ^εr^Δ^εη\displaystyle\widehat{\Delta}_{\varepsilon}^{\top}r=\widehat{\Delta}_{\varepsilon}^{\top}\widehat{r}-\widehat{\Delta}_{\varepsilon}^{\top}\eta (Δε)r^Δ^εη=(Δε)r+(Δε)ηΔ^εη(Δε)rsignal[supΔ𝖢(ε)Δη(Δε)η]noise.\displaystyle\geq\left(\Delta^{*}_{\varepsilon}\right)^{\top}\widehat{r}-\widehat{\Delta}_{\varepsilon}^{\top}\eta=\left(\Delta^{*}_{\varepsilon}\right)^{\top}r+\left(\Delta^{*}_{\varepsilon}\right)^{\top}\eta-\widehat{\Delta}_{\varepsilon}^{\top}\eta\geq\underbrace{\left(\Delta^{*}_{\varepsilon}\right)^{\top}r}_{\text{signal}}-\underbrace{\left[\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta-\left(\Delta^{*}_{\varepsilon}\right)^{\top}\eta\right]}_{\text{noise}}. (30)

In order to lower bound the policy value improvement, it suffices to lower bound the signal part and upper bound the noise. We denote ={xd:i=1dxi=0}\mathcal{H}=\{x\in\mathbb{R}^{d}:\sum_{i=1}^{d}x_{i}=0\} as a hyperplane in d.\mathbb{R}^{d}. To deal with the signal part, it suffices to notice that

𝖢(ε)𝔹2d(ε).\mathsf{C}\left(\varepsilon\right)\subset\mathcal{H}\cap\mathbb{B}_{2}^{d}(\varepsilon).

We denote rr_{\parallel} as the orthogonal projection of rr on the \mathcal{H} and r=rr.r_{\perp}=r-r_{\parallel}. In the strong signal case, we have

r=(12,12,,12,12,12,,12),r=(12,12,,12,12,12,,12).r_{\parallel}=\left(\frac{1}{2},\frac{1}{2},...,\frac{1}{2},-\frac{1}{2},-\frac{1}{2},...,-\frac{1}{2}\right)^{\top},\quad r_{\perp}=\left(\frac{1}{2},\frac{1}{2},...,\frac{1}{2},\frac{1}{2},\frac{1}{2},...,\frac{1}{2}\right)^{\top}.

Then, the signal part satisfies

supΔ𝖢(ε)Δr=supΔ𝖢(ε)ΔrsupΔ𝔹2d(ε)Δr=(εrr2)r=εr2=εd2.\displaystyle\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}r=\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}r_{\parallel}\leq\sup_{\Delta\in\mathcal{H}\cap\mathbb{B}_{2}^{d}(\varepsilon)}\Delta^{\top}r_{\parallel}=\left(\varepsilon\cdot\frac{r_{\parallel}}{\left\|r_{\parallel}\right\|_{2}}\right)^{\top}r_{\parallel}=\varepsilon\left\|r_{\parallel}\right\|_{2}=\frac{\varepsilon\sqrt{d}}{2}. (31)

On the other hand, we notice that when ε1d,\varepsilon\leq\frac{1}{\sqrt{d}},

εrr2=(εd,εd,,εd,εd,εd,,εd)𝖢(ε).\varepsilon\cdot\frac{r_{\parallel}}{\left\|r_{\parallel}\right\|_{2}}=\left(\frac{\varepsilon}{\sqrt{d}},\frac{\varepsilon}{\sqrt{d}},...,\frac{\varepsilon}{\sqrt{d}},-\frac{\varepsilon}{\sqrt{d}},-\frac{\varepsilon}{\sqrt{d}},...,-\frac{\varepsilon}{\sqrt{d}}\right)^{\top}\in\mathsf{C}\left(\varepsilon\right).

So actually the inequality in the (31) should be an equation, which implies

supΔ𝖢(ε)Δ=εd2.\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}=\frac{\varepsilon\sqrt{d}}{2}. (32)

For the noise part, we decompose the noise as η=η+η,\eta=\eta_{\perp}+\eta_{\parallel}, where η\eta_{\parallel} is the orthogonal projection of η\eta on .\mathcal{H}. Then, from 𝖢(ε)𝔹2d(ε),\mathsf{C}\left(\varepsilon\right)\subset\mathcal{H}\cap\mathbb{B}_{2}^{d}(\varepsilon), one has

supΔ𝖢(ε)Δη\displaystyle\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta =supΔ𝖢(ε)Δ(η+η)=supΔ𝖢(ε)ΔηsupΔ𝔹2d(ε)Δη\displaystyle=\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\left(\eta_{\parallel}+\eta_{\perp}\right)=\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta_{\parallel}\leq\sup_{\Delta\in\mathcal{H}\cap\mathbb{B}_{2}^{d}(\varepsilon)}\Delta^{\top}\eta_{\parallel}
=(εηη2)η=εη2εη2.\displaystyle=\left(\varepsilon\cdot\frac{\eta_{\parallel}}{\left\|\eta_{\parallel}\right\|_{2}}\right)^{\top}\eta_{\parallel}=\varepsilon\left\|\eta_{\parallel}\right\|_{2}\leq\varepsilon\left\|\eta\right\|_{2}.

This implies Δ^εr(Δε)r[εη2(Δε)η].\widehat{\Delta}_{\varepsilon}^{\top}r\geq\left(\Delta^{*}_{\varepsilon}\right)^{\top}r-[\varepsilon\left\|\eta\right\|_{2}-\left(\Delta^{*}_{\varepsilon}\right)^{\top}\eta]. From our assumption, 1σ2η22\frac{1}{\sigma^{2}}\left\|\eta\right\|_{2}^{2} is a chi-square random variable with degree d,d, so from the Example 2.11 in Wainwright (2019), we know with probability at least 1δ/2,1-\delta/2, one has

η22dσ21+8log(2/δ)d.\frac{\left\|\eta\right\|_{2}^{2}}{d\sigma^{2}}\leq 1+\sqrt{\frac{8\log\left(2/\delta\right)}{d}}.

This implies

η2dσ2(1+8log(2/δ)d)dσ(1+2log(2/δ)d).\left\|\eta\right\|_{2}\leq\sqrt{d\sigma^{2}\left(1+\sqrt{\frac{8\log\left(2/\delta\right)}{d}}\right)}\leq\sqrt{d}\sigma\left(1+\sqrt{\frac{2\log\left(2/\delta\right)}{d}}\right).

The last inequality comes from 1+u1+u2\sqrt{1+u}\leq 1+\frac{u}{2} for positive uu. Moreover, since Δε\Delta^{*}_{\varepsilon} is a fixed vector, we know (Δε)η𝒩(0,σ2Δε22).\left(\Delta^{*}_{\varepsilon}\right)^{\top}\eta\sim\mathcal{N}\left(0,\sigma^{2}\left\|\Delta^{*}_{\varepsilon}\right\|_{2}^{2}\right). So with probability at least 1δ/2,1-\delta/2, one has

(Δε)ησΔε22log(2δ)σε2log(2δ)\left(\Delta^{*}_{\varepsilon}\right)^{\top}\eta\geq-\sigma\left\|\Delta^{*}_{\varepsilon}\right\|_{2}\sqrt{2\log\left(\frac{2}{\delta}\right)}\geq-\sigma\varepsilon\sqrt{2\log\left(\frac{2}{\delta}\right)}

Combining the two terms above, one has with probability at least 1δ,1-\delta, it holds

εη2(Δε)ηεdσ(1+2log(2/δ)d)+σε2log(2δ)=εdσ(1+8log(2/δ)d).\varepsilon\left\|\eta\right\|_{2}-\left(\Delta^{*}_{\varepsilon}\right)^{\top}\eta\leq\varepsilon\sqrt{d}\sigma\left(1+\sqrt{\frac{2\log\left(2/\delta\right)}{d}}\right)+\sigma\varepsilon\sqrt{2\log\left(\frac{2}{\delta}\right)}=\varepsilon\sqrt{d}\sigma\left(1+\sqrt{\frac{8\log\left(2/\delta\right)}{d}}\right). (33)

Combining (30), (32) and (33), we finish the proof. ∎

Appendix B Monte Carlo computation

Algorithm 2 Monte-Carlo method for computing 𝒢(ε)\mathcal{G}\left(\varepsilon\right)
  Input: Offline dataset 𝒟,\mathcal{D}, the radius value εE,\varepsilon\in E, the total sample size MM and threshold M0.M_{0}.
  1. Independently sample MM noise vectors, denoted as ηi\eta_{i} for i[M],i\in[M], where ηi𝒩0,σi2/N(ai),σi2\eta_{i}\sim\mathcal{N}0,\sigma_{i}^{2}/N(a_{i}),\sigma_{i}^{2} is the noise variance for the ii-th arm and N(ai)N(a_{i}) is the sample size for aia_{i} in 𝒟.\mathcal{D}.
  2. Solve Xi:=supΔ𝖢(ε)ΔηiX_{i}:=\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta_{i} for i[M]i\in[M] and order them as X(1)X(2)X(M).X_{(1)}\leq X_{(2)}\leq...\leq X_{(M)}.
  3. Return X(MM0+1)X_{(M-M_{0}+1)} as an estimate of 𝒢(ε)\mathcal{G}\left(\varepsilon\right) defined in Definition 4.2.

As discussed in Section 4, we can estimate 𝒢(ε)\mathcal{G}\left(\varepsilon\right) using classical Monte Carlo method. In this section, we illustrate the detailed implementation. We first sample MM i.i.d. noise and then solve supΔ𝖢(ε)Δη\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta for each to get MM suprema. We eventually select the M0M_{0}-th largest values of all suprema as our estimate for the bonus function, where M0M_{0} is a pre-computed integer dependent on MM and the pre-determined failure probability δ>0.\delta>0. Here, the program supΔ𝖢(ε)Δη\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta is a second-order cone program and can be efficiently solved via standard off-the shelf libraries (Alizadeh & Goldfarb, 2003; Boyd & Vandenberghe, 2004; Diamond & Boyd, 2016). The pseudocode for the Monte-Carlo sampling is in Algorithm 2.

To determine M0,M_{0}, we denote ηi\eta_{i} as the i.i.d. noise vector for i[M]i\in[M] and Xi=supΔ𝖢(ε)Δη.X_{i}=\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta. We denote the order statistics of XiX_{i}-s as X(1)X(2)X(M).X_{(1)}\leq X_{(2)}\leq...\leq X_{(M)}. Suppose the cumulative distribution function of XiX_{i} is F(x),F(x), then from the property of the order statistics, we know the cumulative distribution function of X(MM0+1)X_{(M-M_{0}+1)} is

FX(MM0+1)(x)=j=MM0+1MCMj(F(x))j(1F(x))Mj.F_{X_{\left(M-M_{0}+1\right)}}(x)=\sum_{j=M-M_{0}+1}^{M}C_{M}^{j}\left(F(x)\right)^{j}\left(1-F(x)\right)^{M-j}.

We denote q1δq_{1-\delta} as the (1δ)(1-\delta)-lower quantile of the random variable XX, then we have FX(MM0+1)(q1δ)=j=MM0+1MCMj(1δ)j(δ)Mj.F_{X_{\left(M-M_{0}+1\right)}}\left(q_{1-\delta}\right)=\sum_{j=M-M_{0}+1}^{M}C_{M}^{j}(1-\delta)^{j}(\delta)^{M-j}. For integer MM and δ>0\delta>0, we define Q(M,δ)Q(M,\delta) as the maximal integer M0M_{0} such that j=MM0+1MCMj(1δ)j(δ)Mjδ.\sum_{j=M-M_{0}+1}^{M}C_{M}^{j}(1-\delta)^{j}(\delta)^{M-j}\leq\delta. With this definition, we take a fixed MM and a total failure tolerance δ\delta for all εE\varepsilon\in E, then we take

M0=Q(M,δ2|E|)M_{0}=Q\left(M,\frac{\delta}{2|E|}\right)

as the threshold number. Under this choice, for any εE\varepsilon\in E, with probability at least 1δ/2|E|1-\delta/2|E|, it holds X(MM0+1)>q1δ/2|E|.X_{\left(M-M_{0}+1\right)}>q_{1-\delta/2|E|}. On the other hand, with probability 1δ/2|E|1-\delta/2|E|, it holds that supΔC(ε)Δηq1δ/2|E|\sup_{\Delta\in\mathrm{C}(\varepsilon)}\Delta^{\top}\eta\leq q_{1-\delta/2|E|} This implies

supΔC(ε)Δηq1δ/2|E|<X(MM0+1)\sup_{\Delta\in\mathrm{C}(\varepsilon)}\Delta^{\top}\eta\leq q_{1-\delta/2|E|}<X_{\left(M-M_{0}+1\right)}

with probability at least 1δ/|E|1-\delta/|E|. From a union bound, we know with probability at least 1δ1-\delta, the bound above holds for any εE.\varepsilon\in E.

Appendix C A fine-grained analysis to the suboptimality

We have shown a problem-dependent upper bound for the suboptimality in (16). In this section, we will give a further upper bound for 𝒢(ε)\mathcal{G}\left(\varepsilon\right) and hence, for the suboptimality. We have the following theorem. The proof is deferred to Section C.1.

Theorem C.1.

For a policy π\pi (deterministic or stochastic), we denote its reward value as VπV^{\pi}. TRUST has the following properties.

  1. 1.

    We denote a comparator policy as a triple (ε,Δ,π)(\varepsilon,\Delta,\pi) such that ε=i=1dσi2Δi2Ni,π=μ^+Δ.\varepsilon=\sum_{i=1}^{d}\frac{\sigma_{i}^{2}\Delta_{i}^{2}}{N_{i}},\pi=\widehat{\mu}+\Delta. We take the discrete candidate set EE defined in (14). With probability at least 1δ,1-\delta, for any stochastic comparator policy (ε,Δ,π),(\varepsilon,\Delta,\pi), the sub-optimality of the output policy of Algorithm 1 can be upper bounded as

    VπVπTRUST22i=1dαΔi2σi2Nilog(2|E|δ)+2min{i=1dαΔi2σi2Ni,4Dlog+(4edi=1dαΔi2σi2NiD2)}\displaystyle V^{\pi}-V^{\pi_{TRUST}}\leq 2\sqrt{2\sum_{i=1}^{d}\frac{\alpha\Delta_{i}^{2}\sigma_{i}^{2}}{N_{i}}\log\left(\frac{2|E|}{\delta}\right)}+2\min\left\{\sqrt{\sum_{i=1}^{d}\frac{\alpha\Delta_{i}^{2}\sigma_{i}^{2}}{N_{i}}},4D\sqrt{\log_{+}\left(\frac{4ed\sum_{i=1}^{d}\frac{\alpha\Delta_{i}^{2}\sigma_{i}^{2}}{N_{i}}}{D^{2}}\right)}\right\} (34)

    where DD is defined as any quantity satisfying

    Dmaxi[d][σi2Ni2σi2N]+j=1dNjσj2N2.D\geq\sqrt{\max_{i\in[d]}\left[\frac{\sigma_{i}^{2}}{N_{i}}-\frac{2\sigma_{i}^{2}}{N}\right]+\frac{\sum_{j=1}^{d}N_{j}\sigma_{j}^{2}}{N^{2}}}. (35)

    α\alpha is the decaying rate defined in (14), log+(a)=max(1,log(a)).\log_{+}(a)=\max(1,\log(a)).

  2. 2.

    (Comparison with the optimal policy) We further assume σi=1\sigma_{i}=1 for i[d]i\in[d] and assume the offline dataset is generated from the policy μ()\mu(\cdot) with mini[d]μ(ai)>0.\min_{i\in[d]}\mu(a_{i})>0. Without loss of generality we assume a1a_{1} is the optimal arm and denote the optimal policy as π\pi_{*}. We write

    C:=1μ(a1),Cmin:=1mini[d]μ(ai).C^{*}:=\frac{1}{\mu(a_{1})},\quad C_{\min}:=\frac{1}{\min_{i\in[d]}\mu(a_{i})}. (36)

    When N8Cminlog(d/δ),N\geq 8C_{\min}\log(d/\delta), with probability at least 12δ1-2\delta, one has

    VπVπTRUST\displaystyle V^{\pi_{*}}-V^{\pi_{TRUST}} CminNlog+(dCCmin)+CNlog(2|E|δ).\displaystyle\lesssim\sqrt{\frac{C_{\min}}{N}\log_{+}\left(\frac{dC^{*}}{C_{\min}}\right)}+\sqrt{\frac{C^{*}}{N}\log\left(\frac{2|E|}{\delta}\right)}. (37)

    Specially, when CminC,C_{\min}\simeq C^{*}, we have with probability at least 12δ,1-2\delta,

    VπVπTRUSTCNlog(2d|E|δ).V^{\pi_{*}}-V^{\pi_{TRUST}}\lesssim\sqrt{\frac{C^{*}}{N}\log\left(\frac{2d|E|}{\delta}\right)}. (38)

We remark that (34) is problem-dependent, and it gives an explicit upper bound for 𝒢(ε)\mathcal{G}\left(\lceil\varepsilon\rceil\right) in (16). This is derived by first concentrating 𝒢(ε)\mathcal{G}\left(\varepsilon\right) around 𝔼supΔ𝖢(ε)Δη\mathbb{E}\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta, which is well-defined as localized Gaussian width or local Gaussian complexity (Koltchinskii, 2006), and then upper bounding the localized Gaussian width of a convex hull via tools in convex analysis (Bellec, 2019). Different from (4), when π=ai\pi=a_{i} represents a single arm, (34) relies not only on σi2/Ni\sigma_{i}^{2}/N_{i}, but on σj2/Nj\sigma_{j}^{2}/N_{j} for jij\neq i as well, since the size of trust regions depend on σi2/Ni\sigma_{i}^{2}/N_{i} for all i[d].i\in[d].

Notably, (38) gives an analogous upper bound depending on μ()\mu(\cdot) and NN, which is comparable to the bound for LCB in (5) up to constant and logarithmic factors. This indicates that, when behavioral cloning policy is not too imbalanced, TRUST is guaranteed to achieve the same level of performance as LCB. In fact, this improvement is remarkable since TRUST is exploring a much larger policy searching space than LCB, which encompasses all stochastic policies (the whole simplex) rather than the set of all single arms only. We also remark that both (5) and (51) are worst-case upper bound, and in practice, we will show in Section 6 that in some settings, TRUST can achieve good performance while LCB fails completely.

Is TRUST minimax-optimal?

We consider the hard cases in MAB (Rashidinejad et al., 2021) where LCB achieves the minimax-optimal upper bound and we show for these hard cases, TRUST will achieve the same sample complexity as LCB up to log and constant factors. More specifically, we consider a two-arm MAB 𝒜={1,2}\mathcal{A}=\left\{1,2\right\} and the uniform behavioral cloning policy μ(1)=μ(2)=1/2.\mu(1)=\mu(2)=1/2. For δin[0,1/4],\delta\ in[0,1/4], we define 1\mathcal{M}_{1} and 2\mathcal{M}_{2} are two MDPs whose reward distributions are as follows.

1:r(1)𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(12),r(2)𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(12+δ)\displaystyle\mathcal{M}_{1}:r(1)\sim\mathsf{Bernoulli}\left(\frac{1}{2}\right),\ r(2)\sim\mathsf{Bernoulli}\left(\frac{1}{2}+\delta\right)
2:r(1)𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(12),r(2)𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(12δ),\displaystyle\mathcal{M}_{2}:r(1)\sim\mathsf{Bernoulli}\left(\frac{1}{2}\right),\ r(2)\sim\mathsf{Bernoulli}\left(\frac{1}{2}-\delta\right),

where 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(p)\mathsf{Bernoulli}\left(p\right) is the Bernoulli distribution with probability p.p. The next result is a corollary from Theorem C.1.

Corollary C.2.

We define 1,mdp2\mathcal{M}_{1},mdp_{2} as above for δ[0,1/4].\delta\in[0,1/4]. Assume NO~(1).N\geq\widetilde{O}(1). Then, we have

  1. 1.

    The minimax optimal lower bound for the suboptimality of LCB is

    infa^𝖫𝖢𝖡𝒜sup{1,2}𝔼𝒟[r(a)r(a^𝖫𝖢𝖡)]CN,\inf_{\widehat{a}_{\mathsf{LCB}}\in\mathcal{A}}\sup_{\mathcal{M}\in\left\{\mathcal{M}_{1},\mathcal{M}_{2}\right\}}\mathbb{E}_{\mathcal{D}}\left[r(a^{*})-r(\widehat{a}_{\mathsf{LCB}})\right]\gtrsim\sqrt{\frac{C^{*}}{N}}, (39)

    where 𝔼𝒟[]\mathbb{E}_{\mathcal{D}}\left[\cdot\right] is the expectation over the offline dataset 𝒟.\mathcal{D}.

  2. 2.

    The upper bound for suboptimality of TRUST mathces the lower bound above up to log factor. Namely, for any {1,2},\mathcal{M}\in\left\{\mathcal{M}_{1},\mathcal{M}_{2}\right\}, one has

    𝔼𝒟[r(a)VπTRUST]Clog(dN)N.\mathbb{E}_{\mathcal{D}}\left[r(a^{*})-V^{\pi_{TRUST}}\right]\lesssim\sqrt{\frac{C^{*}\log(dN)}{N}}. (40)

The first claim comes from Theorem 2 of (Rashidinejad et al., 2021), while the second claim is a direct corollary to Theorem C.1.

C.1 Proof of Theorem C.1

Proof.

Recall from Theorem 5.1 that for any comparator policy (ε,Δ,π)(\varepsilon,\Delta,\pi) defined above, one has

VπVπTRUST2𝒢(ε),V^{\pi}-V^{\pi_{TRUST}}\leq 2\mathcal{G}\left(\lceil\varepsilon\rceil\right),

where ε:=inf{εE:εε}.\lceil\varepsilon\rceil:=\inf\left\{\varepsilon^{\prime}\in E:\varepsilon\leq\varepsilon^{\prime}\right\}. The following lemma upper bounds the quantile of Gaussian suprema 𝒢(ε)\mathcal{G}\left(\varepsilon\right) for each εE.\varepsilon\in E. The proof is deferred to Section C.2.

Lemma C.3.

For εE,\varepsilon\in E, one can upper bound 𝒢(ε)\mathcal{G}\left(\varepsilon\right) as follows.

𝒢(ε)min{εd, 4Dlog+(4edε2D2)}+2ε2log(2|E|δ)\mathcal{G}\left(\varepsilon\right)\leq\min\left\{\varepsilon\cdot\sqrt{d}\ ,\ 4D\sqrt{\log_{+}\left(\frac{4ed\varepsilon^{2}}{D^{2}}\right)}\right\}+\sqrt{2\varepsilon^{2}\log\left(\frac{2|E|}{\delta}\right)} (41)

where log+(a)=max(1,log(a))\log_{+}(a)=\max(1,\log(a)) and DD is a quantity satisfying

Dmaxi[d][σi2Ni2σi2N]+j=1dNjσj2N2.D\geq\sqrt{\max_{i\in[d]}\left[\frac{\sigma_{i}^{2}}{N_{i}}-\frac{2\sigma_{i}^{2}}{N}\right]+\frac{\sum_{j=1}^{d}N_{j}\sigma_{j}^{2}}{N^{2}}}. (42)

Applying Lemma C.3 to εE,\lceil\varepsilon\rceil\in E, we obtain

VπVπTRUST2min{εd, 4Dlog+(4edε2D2)}+22ε2log(2|E|δ).\displaystyle V^{\pi}-V^{\pi_{TRUST}}\leq 2\min\left\{\lceil\varepsilon\rceil\cdot\sqrt{d}\ ,\ 4D\sqrt{\log_{+}\left(\frac{4ed\lceil\varepsilon\rceil^{2}}{D^{2}}\right)}\right\}+2\sqrt{2\lceil\varepsilon\rceil^{2}\log\left(\frac{2|E|}{\delta}\right)}. (43)

Since ε=i=1dσi2Δi2Ni\varepsilon=\sum_{i=1}^{d}\frac{\sigma_{i}^{2}\Delta_{i}^{2}}{N_{i}}, we know from our discretization scheme in (14)

εαi=1dσi2Δi2Ni.\lceil\varepsilon\rceil\leq\alpha\cdot\sum_{i=1}^{d}\frac{\sigma_{i}^{2}\Delta_{i}^{2}}{N_{i}}. (44)

Bridging (44) into (43), we obtain our first claim. In order to get the second claim, we take σi=1\sigma_{i}=1 for i[d]i\in[d] and Δ=πμ^,\Delta=\pi_{*}-\widehat{\mu}, which is the vector pointing at the vertex corresponding to the optimal arm from the uniform reference policy μ^\widehat{\mu} defined in (7). Then, we have

i=1dΔi2σi2Ni=1N12N+1N=1N11N1N1,\sum_{i=1}^{d}\frac{\Delta_{i}^{2}\sigma_{i}^{2}}{N_{i}}=\frac{1}{N_{1}}-\frac{2}{N}+\frac{1}{N}=\frac{1}{N_{1}}-\frac{1}{N}\leq\frac{1}{N_{1}},

where N1N_{1} is the sample size for the optimal arm a1.a_{1}. Therefore, we can further bound (43) as

VπVπTRUST\displaystyle V^{\pi_{*}}-V^{\pi_{TRUST}} 4Dlog+(4αedN1D2)+22αN1log(2|E|δ).\displaystyle\leq 4D\sqrt{\log_{+}\left(\frac{4\alpha ed}{N_{1}D^{2}}\right)}+2\sqrt{\frac{2\alpha}{N_{1}}\log\left(\frac{2|E|}{\delta}\right)}. (45)

Finally, we take a specific value of DD and lower bound N1N_{1} via Chernoff bound in Lemma C.7. From Lemma C.7, we know that when N8Cminlog(d/δ),N\geq 8C_{\min}\log(d/\delta), with probability at least 1δ,1-\delta, we have

Ni12Nμ(ai)N_{i}\geq\frac{1}{2}N\mu(a_{i}) (46)

for any i[d].i\in[d]. Recall the definition of DD in (42), we know that DD can be arbitrary value greater than maxi[d][σi2Ni2σi2N]+j=1dNjσj2N.\sqrt{\max_{i\in[d]}\left[\frac{\sigma_{i}^{2}}{N_{i}}-\frac{2\sigma_{i}^{2}}{N}\right]+\frac{\sum_{j=1}^{d}N_{j}\sigma_{j}^{2}}{N}}. Then, when σi=1\sigma_{i}=1, one has

maxi[d][σi2Ni2σi2N]+j=1dNjσj2N21mini[d]Ni.\sqrt{\max_{i\in[d]}\left[\frac{\sigma_{i}^{2}}{N_{i}}-\frac{2\sigma_{i}^{2}}{N}\right]+\frac{\sum_{j=1}^{d}N_{j}\sigma_{j}^{2}}{N^{2}}}\leq\sqrt{\frac{1}{\min_{i\in[d]}N_{i}}}.

We denote Nj=mini[d]NiN_{j}=\min_{i\in[d]}N_{i} (when there are multiple minimizers, we arbitrarily pick one). Then, we have

maxi[d][σi2Ni2σi2N]+j=1dNjσj2N21Nj2Nμ(aj)2Nmini[d]μ(ai)=2CminN.\sqrt{\max_{i\in[d]}\left[\frac{\sigma_{i}^{2}}{N_{i}}-\frac{2\sigma_{i}^{2}}{N}\right]+\frac{\sum_{j=1}^{d}N_{j}\sigma_{j}^{2}}{N^{2}}}\leq\sqrt{\frac{1}{N_{j}}}\leq\sqrt{\frac{2}{N\mu(a_{j})}}\leq\sqrt{\frac{2}{N\cdot\min_{i\in[d]}\mu(a_{i})}}=\sqrt{\frac{2C_{\min}}{N}}.

Therefore, we take D=2CminND=\sqrt{\frac{2C_{\min}}{N}} in (45) and apply N112Nμ(ai)N_{1}\geq\frac{1}{2}N\mu(a_{i}) to obtain

VπVπTRUST42CminNlog+(4αedCCmin)+4αCNlog(2|E|δ),V^{\pi_{*}}-V^{\pi_{TRUST}}\leq 4\sqrt{\frac{2C_{\min}}{N}\log_{+}\left(\frac{4\alpha edC^{*}}{C_{\min}}\right)}+4\sqrt{\frac{\alpha C^{*}}{N}\log\left(\frac{2|E|}{\delta}\right)}, (47)

which proves (37). Finally, when CCmin,C^{*}\simeq C_{\min}, one has

VπVπTRUSTCNlog(2d|E|δ).V^{\pi_{*}}-V^{\pi_{TRUST}}\lesssim\sqrt{\frac{C^{*}}{N}\log\left(\frac{2d|E|}{\delta}\right)}.

Therefore, we conclude. ∎

C.2 Proof of Lemma C.3

Proof.

Recall that Δ=(Δ1,Δ2,,Δd)\Delta=(\Delta_{1},\Delta_{2},...,\Delta_{d})^{\top} is the improvement vector, η=(η1,η2,,ηd)\eta=(\eta_{1},\eta_{2},...,\eta_{d})^{\top} is the noise vector, where entries are independent and ηi𝒩(0,σi2/Ni)\eta_{i}\sim\mathcal{N}(0,\sigma_{i}^{2}/N_{i}) and NiN_{i} is the sample size of arm aia_{i} in the offline dataset. To proceed with the proof, let’s further define

η~=(η~1,η~2,,η~d),Δ~=(Δ~1,Δ~2,,Δ~d),whereη~i=ηiNiσi,Δ~i=ΔiσiNi.\widetilde{\eta}=\left(\widetilde{\eta}_{1},\widetilde{\eta}_{2},...,\widetilde{\eta}_{d}\right)^{\top},\quad\widetilde{\Delta}=\left(\widetilde{\Delta}_{1},\widetilde{\Delta}_{2},...,\widetilde{\Delta}_{d}\right)^{\top},\quad\text{where}\quad\widetilde{\eta}_{i}=\eta_{i}\frac{\sqrt{N_{i}}}{\sigma_{i}},\ \widetilde{\Delta}_{i}=\frac{\Delta_{i}\sigma_{i}}{\sqrt{N_{i}}}. (48)

With this notation, one has

η~𝒩(0,Id),ηΔ=η~Δ~.\widetilde{\eta}\sim\mathcal{N}\left(0,I_{d}\right),\quad\eta^{\top}\Delta=\widetilde{\eta}^{\top}\widetilde{\Delta}.

We also write the equivalent trust region (for Δ~\widetilde{\Delta}) as

𝖢~(ε)={Δ~d:NiσiΔ~i+μ^i0,i=1d[NiσiΔ~i+μ^i]=1,Δ~2ε},\widetilde{\mathsf{C}}\left(\varepsilon\right)=\left\{\widetilde{\Delta}\in\mathbb{R}^{d}:\frac{\sqrt{N_{i}}}{\sigma_{i}}\widetilde{\Delta}_{i}+\widehat{\mu}_{i}\geq 0,\quad\sum_{i=1}^{d}\left[\frac{\sqrt{N_{i}}}{\sigma_{i}}\widetilde{\Delta}_{i}+\widehat{\mu}_{i}\right]=1,\quad\left\|\widetilde{\Delta}\right\|_{2}\leq\varepsilon\right\}, (49)

where μ^=(μ^1,μ^2,,μ^d)\widehat{\mu}=(\widehat{\mu}_{1},\widehat{\mu}_{2},...,\widehat{\mu}_{d})^{\top} is the policy weight for the reference policy. From the definition above, one has for any ε>0,\varepsilon>0,

Δ𝖢(ε)Δ~𝖢~(ε).\Delta\in\mathsf{C}\left(\varepsilon\right)\ \Leftrightarrow\ \widetilde{\Delta}\in\widetilde{\mathsf{C}}\left(\varepsilon\right).

Then, we apply Lemma C.4 to supΔ𝖢(ε)Δη\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta for a εE.\varepsilon\in E. One has with probability at least 1δ|E|,1-\frac{\delta}{|E|},

|supΔ𝖢(ε)Δη𝔼supΔ𝖢(ε)Δη|2ε2log(2|E|δ).\displaystyle\left|\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta-\mathbb{E}\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta\right|\leq\sqrt{2\varepsilon^{2}\log\left(\frac{2|E|}{\delta}\right)}.

From a union bound, one immediately has with probability at least 1δ,1-\delta, for any εE,\varepsilon\in E, it holds that

supΔ𝖢(ε)Δη𝔼supΔ𝖢(ε)Δη+2ε2log(2|E|δ).\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta\leq\mathbb{E}\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta+\sqrt{2\varepsilon^{2}\log\left(\frac{2|E|}{\delta}\right)}. (50)

From the definition of 𝒢(ε)\mathcal{G}\left(\varepsilon\right) in (4.2), we know that 𝒢(ε)\mathcal{G}\left(\varepsilon\right) is the minimal quantity that satisfy (50) with probability at least 1δ.1-\delta. Therefore, one has

𝒢(ε)𝔼supΔ𝖢(ε)Δη+2ε2log(2|E|δ)=𝔼η~𝒩(0,Id)[supΔ~𝖢~(ε)Δ~η~]+2ε2log(2|E|δ)εE.\mathcal{G}\left(\varepsilon\right)\leq\mathbb{E}\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta+\sqrt{2\varepsilon^{2}\log\left(\frac{2|E|}{\delta}\right)}=\mathbb{E}_{\widetilde{\eta}\sim\mathcal{N}(0,I_{d})}\left[\sup_{\widetilde{\Delta}\in\widetilde{\mathsf{C}}\left(\varepsilon\right)}\widetilde{\Delta}^{\top}\widetilde{\eta}\right]+\sqrt{2\varepsilon^{2}\log\left(\frac{2|E|}{\delta}\right)}\quad\forall\varepsilon\in E. (51)

Note that, the first term in the RHS of (51) is well-defined as localized Gaussian width over the convex hull defined by the trust region 𝖢(ε)\mathsf{C}\left(\varepsilon\right) (or equivalently, 𝖢~(ε)\widetilde{\mathsf{C}}\left(\varepsilon\right)). We denote

T:={Δ~d:NiσiΔ~i+μ^i0,i=1d[NiσiΔ~i+μ^i]=1}.T:=\left\{\widetilde{\Delta}\in\mathbb{R}^{d}:\frac{\sqrt{N_{i}}}{\sigma_{i}}\widetilde{\Delta}_{i}+\widehat{\mu}_{i}\geq 0,\quad\sum_{i=1}^{d}\left[\frac{\sqrt{N_{i}}}{\sigma_{i}}\widetilde{\Delta}_{i}+\widehat{\mu}_{i}\right]=1\right\}. (52)

We immediately have that TT is a convex hull of dd points in d\mathbb{R}^{d} and the vertices of this convex hull are the vertices of the simplex in d\mathbb{R}^{d} shifted by the reference policy μ^.\widehat{\mu}. In what follows, we plan to apply Lemma C.5 to the localized Gaussian width of Tε𝔹2.T\cap\varepsilon\mathbb{B}_{2}. However, TT is not subsumed by the unit ball in d,\mathbb{R}^{d}, so we need to do some additional scaling. Note that, the zero vector is included in T.T. Let’s compute the farthest distance for the vertices of T.T. We denote the ii-th vertex of TT as

Δ~=(σ1N1μ^1,,σi1Ni1μ^i1,σiNi(1μ^i),σi+1Ni+1μ^i+1,,σdNdμ^d).\widetilde{\Delta}=\left(-\frac{\sigma_{1}}{\sqrt{N_{1}}}\widehat{\mu}_{1},...,-\frac{\sigma_{i-1}}{\sqrt{N_{i-1}}}\widehat{\mu}_{i-1},\frac{\sigma_{i}}{\sqrt{N_{i}}}\left(1-\widehat{\mu}_{i}\right),-\frac{\sigma_{i+1}}{\sqrt{N_{i+1}}}\widehat{\mu}_{i+1},...,-\frac{\sigma_{d}}{\sqrt{N_{d}}}\widehat{\mu}_{d}\right). (53)

The 2\ell_{2}-norm of this improvement vector is

Δ~2=σi2Ni2σi2N+i=1dNiσi2N2,\left\|\widetilde{\Delta}\right\|_{2}=\sqrt{\frac{\sigma_{i}^{2}}{N_{i}}-\frac{2\sigma_{i}^{2}}{N}+\frac{\sum_{i=1}^{d}N_{i}\sigma_{i}^{2}}{N^{2}}},

where NN is the total sample size of the offline dataset. Therefore, the maximal radius of TT can be upper bounded by DD, where DD is any quantity that satisfies

Dmaxi[d][σi2Ni2σi2N]+j=1dNjσj2N2.D\geq\sqrt{\max_{i\in[d]}\left[\frac{\sigma_{i}^{2}}{N_{i}}-\frac{2\sigma_{i}^{2}}{N}\right]+\frac{\sum_{j=1}^{d}N_{j}\sigma_{j}^{2}}{N^{2}}}. (54)

We denote S=1DT:={1Dx:xT}.S=\frac{1}{D}\cdot T:=\left\{\frac{1}{D}\cdot x:x\in T\right\}. Then, from Lemma C.5, one has

𝔼η~𝒩(0,Id)[supΔ~𝖢~(ε)Δ~η~]\displaystyle\mathbb{E}_{\widetilde{\eta}\sim\mathcal{N}(0,I_{d})}\left[\sup_{\widetilde{\Delta}\in\widetilde{\mathsf{C}}\left(\varepsilon\right)}\widetilde{\Delta}^{\top}\widetilde{\eta}\right] =𝔼η~𝒩(0,Id)[supΔ~Tε𝔹2Δ~η~]\displaystyle=\mathbb{E}_{\widetilde{\eta}\sim\mathcal{N}(0,I_{d})}\left[\sup_{\widetilde{\Delta}\in T\cap\varepsilon\mathbb{B}_{2}}\widetilde{\Delta}^{\top}\widetilde{\eta}\right]
=D𝔼η~𝒩(0,Id)[supΔ~SεD𝔹2Δ~η~]\displaystyle=D\cdot\mathbb{E}_{\widetilde{\eta}\sim\mathcal{N}(0,I_{d})}\left[\sup_{\widetilde{\Delta}\in S\cap\frac{\varepsilon}{D}\cdot\mathbb{B}_{2}}\widetilde{\Delta}^{\top}\widetilde{\eta}\right] (SεD𝔹2S\cap\frac{\varepsilon}{D}\cdot\mathbb{B}_{2} can be got by scaling Tε𝔹2T\cap\varepsilon\mathbb{B}_{2} bt 1D\frac{1}{D})
D[(4log+(4ed(ε2D21)))(εDd)]\displaystyle\leq D\cdot\left[\left(4\sqrt{\log_{+}\left(4ed\left(\frac{\varepsilon^{2}}{D^{2}}\wedge 1\right)\right)}\right)\wedge\left(\frac{\varepsilon}{D}\sqrt{d}\right)\right] (Take s=εDs=\frac{\varepsilon}{D} and M=dM=d in Lemma C.5)
=D[(4log+(4ed(ε2D2)))(εDd)].\displaystyle=D\cdot\left[\left(4\sqrt{\log_{+}\left(4ed\left(\frac{\varepsilon^{2}}{D^{2}}\right)\right)}\right)\wedge\left(\frac{\varepsilon}{D}\sqrt{d}\right)\right]. (εD\varepsilon\leq D for any εE\varepsilon\in E)

This finishes the proof. ∎

C.3 Auxiliary lemmas

Lemma C.4 (Concentration of Gaussian suprema, Exercise 5.10 in Wainwright (2019)).

Let {Xθ,θ𝕋}\left\{X_{\theta},\theta\in\mathbb{T}\right\} be a zero-mean Gaussian process, and define Z=supθ𝕋XθZ=\sup_{\theta\in\mathbb{T}}X_{\theta}. Then, we have

[|Z𝔼[Z]|δ]2exp(δ22σ2),\mathbb{P}[|Z-\mathbb{E}[Z]|\geq\delta]\leq 2\exp\left(-\frac{\delta^{2}}{2\sigma^{2}}\right),

where σ2:=supθ𝕋var(Xθ)\sigma^{2}:=\sup_{\theta\in\mathbb{T}}\operatorname{var}\left(X_{\theta}\right) is the maximal variance of the process.

Lemma C.5 (Localized Gaussian Width of a Convex Hull, Proposition 1 in Bellec (2019)).

Let d1,M2d\geq 1,M\geq 2 and TT be the convex hull of MM points in d.\mathbb{R}^{d}. We write 𝔹2={xd:x21}\mathbb{B}_{2}=\left\{x\in\mathbb{R}^{d}:\left\|x\right\|_{2}\leq 1\right\} and s𝔹2={sx:xd,x21}.s\mathbb{B}_{2}=\left\{s\cdot x:x\in\mathbb{R}^{d},\left\|x\right\|_{2}\leq 1\right\}. Assume T𝔹2d(1).T\subset\mathbb{B}_{2}^{d}(1). Let gdg\in\mathbb{R}^{d} be a standard Gaussian vector. Then, for all s>0,s>0, one has

𝔼[supxTs𝔹2xg](4log+(4eM(s21)))(sdM),\mathbb{E}\left[\sup_{x\in T\cap s\mathbb{B}_{2}}\ x^{\top}g\right]\leq\left(4\sqrt{\log_{+}\left(4eM\left(s^{2}\wedge 1\right)\right)}\right)\wedge\left(s\sqrt{d\wedge M}\right), (55)

where log+(a)=max(1,log(a)),ab=min{a,b}.\log_{+}(a)=\max(1,\log(a)),a\wedge b=\min\left\{a,b\right\}.

Lemma C.6 (Chernoff bound for binomial random variables, Theorem 2.3.1 in Vershynin (2020)).

Let XiX_{i} be independent Bernoulli random variables with parameters pip_{i}. Consider their sum SN=i=1NXiS_{N}=\sum_{i=1}^{N}X_{i} and denote its mean by μ=𝔼SN\mu=\mathbb{E}S_{N}. Then, for any t>μt>\mu, we have

{SNt}eμ(eμt)t.\mathbb{P}\left\{S_{N}\geq t\right\}\leq e^{-\mu}\left(\frac{e\mu}{t}\right)^{t}.
Lemma C.7 (Chernoff bound for offline MAB).

Under the setting in Theorem C.1, we have

(Ni12Nμ(ai)i[d])1dexp(Nminj[d]μ(aj)8),\mathbb{P}\left(N_{i}\geq\frac{1}{2}N\mu(a_{i})\quad\forall i\in[d]\right)\leq 1-d\exp\left(-\frac{N\cdot\min_{j\in[d]}\mu(a_{j})}{8}\right),
Proof.

For arm i[d],i\in[d], we take μ=Nμ(ai)\mu=N\mu(a_{i}) and t=12Mμ(ai)t=\frac{1}{2}M\mu(a_{i}) in Lemma C.6 and obtain

(Ni12Nμ(ai))exp(Nμ(ai))(eNμ(ai)12Nμ(ai))12Nμ(ai)=exp(Nμ(ai)[1+12log(2e)])exp(Nμ(ai)8).\mathbb{P}\left(N_{i}\geq\frac{1}{2}N\mu(a_{i})\right)\leq\exp\left(-N\mu(a_{i})\right)\cdot\left(\frac{eN\mu(a_{i})}{\frac{1}{2}N\mu(a_{i})}\right)^{\frac{1}{2}N\mu(a_{i})}=\exp\left(N\mu(a_{i})\left[-1+\frac{1}{2}\log(2e)\right]\right)\leq\exp\left(-\frac{N\mu(a_{i})}{8}\right).

We finish the proof by a union bound for all arms. ∎

Appendix D Proof of Lemma 5.2

Proof.

Recall the definition of ε:\lceil\varepsilon\rceil:

ε:=inf{εE:εε}.\lceil\varepsilon\rceil:=\inf\left\{\varepsilon^{\prime}\in E:\varepsilon^{\prime}\geq\varepsilon\right\}. (56)

We additionally define

ε:=sup{εE:ε<ε}.\lfloor\varepsilon\rfloor:=\sup\left\{\varepsilon^{\prime}\in E:\varepsilon^{\prime}<\varepsilon\right\}. (57)

Specially, if there is no εE\varepsilon^{\prime}\in E such that ε<ε,\varepsilon^{\prime}<\varepsilon, then we define ε=0\lfloor\varepsilon\rfloor=0. Then we know for any εε0E\varepsilon\leq\varepsilon_{0}\in E (ε0\varepsilon_{0} is the largest possible radius) and a finite set E,E, it holds that

ε<εε, and ε=ε if and only if εE.\lfloor\varepsilon\rfloor<\varepsilon\leq\lceil\varepsilon\rceil,\quad\text{ and }\quad\varepsilon=\lceil\varepsilon\rceil\text{ if and only if }\varepsilon\in E. (58)

For any εE\varepsilon\in E, recall Δ^ε\widehat{\Delta}_{\varepsilon} is the optimal improvement vector within 𝖢(ε)\mathsf{C}\left(\varepsilon\right) defined in (10). It holds that

Δ^ε:\displaystyle\widehat{\Delta}_{\varepsilon}: =argmaxΔ𝖢(ε)Δr^=argmaxΔ𝖢(ε)[Δr^𝒢(ε)]\displaystyle=\mathop{\arg\max}_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\widehat{r}=\mathop{\arg\max}_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\varepsilon\right)\big{]} (since 𝒢(ε)\mathcal{G}\left(\varepsilon\right) does not depend on Δ\Delta)
=argmaxΔ𝖢(ε)[Δr^𝒢(ε)]\displaystyle=\mathop{\arg\max}_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon\rceil\right)\big{]} (εE,\varepsilon\in E, so ε=ε\lceil\varepsilon\rceil=\varepsilon)
argmaxε(ε,ε],Δ𝖢(ε)[Δr^𝒢(ε)].\displaystyle\leq\mathop{\arg\max}_{\varepsilon^{\prime}\in\left(\lfloor\varepsilon\rfloor,\lceil\varepsilon\rceil\right],\Delta\in\mathsf{C}\left(\varepsilon^{\prime}\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon^{\prime}\rceil\right)\big{]}.

On the other hand, when εE\varepsilon\in E and ε(ε,ε],\varepsilon^{\prime}\in(\left\lfloor\varepsilon\right\rfloor,\left\lceil\varepsilon\right\rceil], one has ε=ε=ε,\left\lceil\varepsilon^{\prime}\right\rceil=\left\lceil\varepsilon\right\rceil=\varepsilon, so

argmaxε(ε,ε],Δ𝖢(ε)[Δr^𝒢(ε)]\displaystyle\mathop{\arg\max}_{\varepsilon^{\prime}\in\left(\lfloor\varepsilon\rfloor,\lceil\varepsilon\rceil\right],\Delta\in\mathsf{C}\left(\varepsilon^{\prime}\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon^{\prime}\rceil\right)\big{]} =argmaxε(ε,ε],Δ𝖢(ε)[Δr^𝒢(ε)]argmaxΔ𝖢(ε)[Δr^𝒢(ε)],\displaystyle=\mathop{\arg\max}_{\varepsilon^{\prime}\in\left(\lfloor\varepsilon\rfloor,\lceil\varepsilon\rceil\right],\Delta\in\mathsf{C}\left(\varepsilon^{\prime}\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon\rceil\right)\big{]}\leq\mathop{\arg\max}_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon\rceil\right)\big{]},

where the last inequality comes from the fact that 𝖢(ε)𝖢(ε)\mathsf{C}\left(\varepsilon^{\prime}\right)\subset\mathsf{C}\left(\varepsilon\right) when εε=ε\varepsilon^{\prime}\leq\lceil\varepsilon\rceil=\varepsilon by definition of the trust region in (4.2). Combining two inequalities above, we have for any εE,\varepsilon\in E,

(ε,Δ^ε)=argmaxε(ε,ε],Δ𝖢(ε)[Δr^𝒢(ε)],\left(\varepsilon,\widehat{\Delta}_{\varepsilon}\right)=\mathop{\arg\max}_{\varepsilon^{\prime}\in\left(\lfloor\varepsilon\rfloor,\lceil\varepsilon\rceil\right],\Delta\in\mathsf{C}\left(\varepsilon^{\prime}\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon^{\prime}\rceil\right)\big{]}, (59)

where the variables in RHS above are ε\varepsilon^{\prime} and Δ\Delta, and Therefore, from the definition of we have

(ε^,Δ^)=argmaxεEargmaxε(ε,ε],Δ𝖢(ε)[Δr^𝒢(ε)]=argmaxεε0,Δ𝖢(ε)[Δr^𝒢(ε)].\displaystyle\left(\widehat{\varepsilon}_{*},\widehat{\Delta}_{*}\right)=\mathop{\arg\max}_{\varepsilon\in E}\mathop{\arg\max}_{\varepsilon^{\prime}\in\left(\lfloor\varepsilon\rfloor,\lceil\varepsilon\rceil\right],\Delta\in\mathsf{C}\left(\varepsilon^{\prime}\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon^{\prime}\rceil\right)\big{]}=\mathop{\arg\max}_{\varepsilon\leq\varepsilon_{0},\Delta\in\mathsf{C}\left(\varepsilon\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon\rceil\right)\big{]}.

This finishes the proof. ∎

Appendix E Augmentation with LCB

To determine the most effective final policy, we can compare the outputs of the LCB and Algorithm 1 and combine both policies, based on the relative magnitude of their corresponding lower bounds. Specifically, the combined policy is

π𝖼𝗈𝗆𝖻𝗂𝗇𝖾𝖽=\displaystyle\pi_{\mathsf{combined}}=
{a^𝖫𝖢𝖡 If maxai𝒜liw𝖳𝖱r^𝒢(ε^)2log(1/δ)j=1dNj/σj2,w𝖳𝖱 If maxai𝒜li<w𝖳𝖱r^𝒢(ε^)2log(1/δ)j=1dNj/σj2,\displaystyle\left\{\begin{aligned} \widehat{a}_{\mathsf{LCB}}&\ \text{ If }\max_{a_{i}\in\mathcal{A}}l_{i}\geq w_{\mathsf{TR}}^{\top}\widehat{r}-\mathcal{G}\left(\lceil\widehat{\varepsilon}_{*}\rceil\right)-\sqrt{\frac{2\log(1/\delta)}{\sum_{j=1}^{d}N_{j}/\sigma_{j}^{2}}},\\ w_{\mathsf{TR}}&\ \text{ If }\max_{a_{i}\in\mathcal{A}}l_{i}<w_{\mathsf{TR}}^{\top}\widehat{r}-\mathcal{G}\left(\lceil\widehat{\varepsilon}_{*}\rceil\right)-\sqrt{\frac{2\log(1/\delta)}{\sum_{j=1}^{d}N_{j}/\sigma_{j}^{2}}},\end{aligned}\right. (60)

where li=r^ibil_{i}=\widehat{r}_{i}-b_{i} is defined in (1) and 𝒢(ε)\mathcal{G}\left(\varepsilon\right) is defined in Definition 4.2. This combined policy will perform at least as well as LCB with high probability. More specifically, we have

Corollary E.1.

We denote the arm chosen by LCB as a^𝖫𝖢𝖡\widehat{a}_{\mathsf{LCB}}. We also denote r()r(\cdot) as the true reward of a policy (deterministic or stochastic). With probability at least 13δ,1-3\delta, one has

Vπ𝖼𝗈𝗆𝖻𝗂𝗇𝖾𝖽maxai𝒜li.V^{\pi_{\mathsf{combined}}}\geq\max_{a_{i}\in\mathcal{A}}l_{i}. (61)
Proof.

We denote r^(a^𝖫𝖢𝖡)=ra^𝖫𝖢𝖡\widehat{r}\left(\widehat{a}_{\mathsf{LCB}}\right)=r_{\widehat{a}_{\mathsf{LCB}}} and r^(w𝖳𝖱)\widehat{r}(w_{\mathsf{TR}}) as the empirical reward of the policy returned by LCB and Algorithm 1, respectively. Recall the uncertainty term of LCB in (1) and of Algorithm 1 in (E), we write b(a^𝖫𝖢𝖡)=ba^𝖫𝖢𝖡b(\widehat{a}_{\mathsf{LCB}})=b_{\widehat{a}_{\mathsf{LCB}}} and b(w𝖳𝖱)=𝒢(ε^)+2log(1/δ)/[j=1dNj/σj2]b(w_{\mathsf{TR}})=\mathcal{G}\left(\lceil\widehat{\varepsilon}_{*}\rceil\right)+\sqrt{2\log(1/\delta)/[\sum_{j=1}^{d}N_{j}/\sigma_{j}^{2}]}. Then, from Theorem 5.1, (2) and a union bound, we know with probability at least 13δ,1-3\delta, it holds that

r(a^𝖫𝖢𝖡)r^(a^𝖫𝖢𝖡)b(a^𝖫𝖢𝖡),r(w𝖳𝖱)r^(w𝖳𝖱)b(w𝖳𝖱),r(\widehat{a}_{\mathsf{LCB}})\geq\widehat{r}(\widehat{a}_{\mathsf{LCB}})-b(\widehat{a}_{\mathsf{LCB}}),\ r(w_{\mathsf{TR}})\geq\widehat{r}(w_{\mathsf{TR}})-b(w_{\mathsf{TR}}),

which implies

Vπ𝖼𝗈𝗆𝖻𝗂𝗇𝖾𝖽\displaystyle V^{\pi_{\mathsf{combined}}} r^(π𝖼𝗈𝗆𝖻𝗂𝗇𝖾𝖽)b(π𝖼𝗈𝗆𝖻𝗂𝗇𝖾𝖽)\displaystyle\geq\widehat{r}(\pi_{\mathsf{combined}})-b(\pi_{\mathsf{combined}})
r^(a^𝖫𝖢𝖡)b(a^𝖫𝖢𝖡)\displaystyle\geq\widehat{r}(\widehat{a}_{\mathsf{LCB}})-b(\widehat{a}_{\mathsf{LCB}}) (By (E))
=maxai𝒜li.\displaystyle=\max_{a_{i}\in\mathcal{A}}l_{i}. (By the definition of a^𝖫𝖢𝖡\widehat{a}_{\mathsf{LCB}} in (3))

Therefore, we conclude. ∎

Appendix F Experiment details

We did experiments on Mujoco environment in the D4RL dataset (Fu et al., 2020). All environments we test on are v3. Since the original D4RL dataset does not include the exact form of logging policies, we retrain SAC (Haarnoja et al., 2018) on these environment for 1000 episodes and keep record of the policy in each episode. We test 4 environments in two settings, denoted as ’1-traj-low’ and ’1-traj-high’. In either setting, the offline dataset is generated from 100 policies with one trajectory from each. In the ’1-traj-low’ setting, the data is generated from the first 100 policies in the training process of SAC, while in the ’1-traj-high’ setting, it is generated from the policy in (10x+1)(10x+1)-th episodes in the training process.

For all experiments on Mujoco, we average the results over 4 random seeds (from 2023 to 2026), and to run CQL, we use default hyper-parameters in https://github.com/young-geng/CQL to run 2000 episodes. For TRUST, we run it using a fixed standard deviation level σi=150\sigma_{i}=150 for all experiments.