Is Offline Decision Making Possible with Only Few Samples?
Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement

Ruiqi Zhang Yuexiang Zhai Andrea Zanette

Abstract

What can an agent learn in a stochastic Multi-Armed Bandit (MAB) problem from a dataset that contains just a single sample for each arm? Surprisingly, in this work, we demonstrate that even in such a data-starved setting it may still be possible to find a policy competitive with the optimal one. This paves the way to reliable decision-making in settings where critical decisions must be made by relying only on a handful of samples.

Our analysis reveals that stochastic policies can be substantially better than deterministic ones for offline decision-making. Focusing on offline multi-armed bandits, we design an algorithm called Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST) which is quite different from the predominant value-based lower confidence bound approach. Its design is enabled by localization laws, critical radii, and relative pessimism. We prove that its sample complexity is comparable to that of LCB on minimax problems while being substantially lower on problems with very few samples.

Finally, we consider an application to offline reinforcement learning in the special case where the logging policies are known.

Multi-armed bandit, Offline reinforcement learning, Trust region.

1 Introduction

In several important problems, critical decisions must be made with just very few samples of pre-collected experience. For example, collecting samples in robotic manipulation may be slow and costly, and the ability to learn from very few interactions is highly desirable (Hester & Stone, 2013; Liu et al., 2021). Likewise, in clinical trials and in personalized medical decisions, reliable decisions must be made by relying on very small datasets (Liu et al., 2017). Sample efficiency is also key in personalized education (Bassen et al., 2020; Ruan et al., 2023).

However, to achieve good performance, the state-of-the-art algorithms may require millions of samples (Fu et al., 2020). These empirical findings seem to be supported by the existing theories: the sample complexity bounds, even minimax optimal ones, can be large in practice due to the large constants and the warmup factors (Ménard et al., 2021; Li et al., 2022; Azar et al., 2017; Zanette et al., 2019).

In this work, we study whether it is possible to make reliable decisions with only a few samples. We focus on an offline Multi-Armed Bandit (MAB) problem, which is a foundation model for decision-making (Lattimore & Szepesvári, 2020). In online MAB, an agent repeatedly chooses an arm from a set of arms, each providing a stochastic reward. Offline MAB is a variant where the agent cannot interact with the environment to gather new information and instead, it must make decisions based on a pre-collected dataset without playing additional exploratory actions, aiming at identifying the arm with the highest expected reward (Audibert et al., 2010; Garivier & Kaufmann, 2016; Russo, 2016; Ameko et al., 2020).

The standard approach to the problem is the Lower Confidence Bound (LCB) algorithm (Rashidinejad et al., 2021), a pessimistic variant of UCB (Auer et al., 2002) that involves selecting the arm with the highest lower bound on its performance. LCB encodes a principle called pessimism under uncertainty, which is the foundation principle for most algorithms for offline bandits and reinforcement learning (RL) (Jin et al., 2020; Zanette et al., 2020; Xie et al., 2021; Yin & Wang, 2021; Kumar et al., 2020; Kostrikov et al., 2021).

Unfortunately, the available methods that implement the principle of pessimism under uncertainty can fail in a data-starved regime because they rely on confidence intervals that are too loose when just a few samples are available. For example, even on a simple MAB instance with ten thousand arms, the best-known (Rashidinejad et al., 2021) performance bound for the LCB algorithm requires 24 samples per arm in order to provide meaningful guarantees, see Section 3.3. In more complex situations, such as in the sequential setting with function approximation, such a problem can become more severe due to the higher metric entropy of the function approximation class and the compounding of errors through time steps.

These considerations suggest that there is a “barrier of entry” to decision-making, both theoretically and practically: one needs to have a substantial number of samples in order to make reliable decisions even for settings as simple as offline MAB where the guarantees are tighter. Given the above technical reasons, and the lack of good algorithms and guarantees for data-starved decision problems, it is unclear whether it is even possible to find good decision rules with just a handful of samples.

In this paper, we make a substantial contribution towards lowering such barriers of entry. We discover that a carefully-designed algorithm tied to an advanced statistical analysis can substantially improve the sample complexity, both theoretically and practically, and enable reliable decision-making with just a handful of samples. More precisely, we focus on the offline MAB setting where we show that even if the dataset contains just a single sample in every arm, it may still be possible to compete with the optimal policy. This is remarkable, because with just one sample per arm—for example from a Bernoulli distribution—it is impossible to estimate the expected payoff of any of the arms! Our discovery is enabled by several key insights:

•

We search over stochastic policies, which we show can yield better performance for offline-decision making;
•

We use a localized notion of metric entropy to carefully control the size of the stochastic policy class that we search over;
•

We implement a concept called relative pessimism to obtain sharper guarantees.

These considerations lead us to design a trust region policy optimization algorithm called Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST), one that offers superior theoretical as well as empirical performance compared to LCB in a data-scarce situation.

Moreover, we apply the algorithm to selected reinforcement learning problems from (Fu et al., 2020) in the special case where information about the logging policies is available. We do so by a simple reduction from reinforcement learning to bandits, by mapping policies and returns in the former to actions and rewards in the latter, thereby disregarding the sequential aspect of the problem. Although we rely on the information of the logging policies being available, the empirical study shows that our algorithm compares well with a strong deep reinforcement learning baseline (i.e, CQL from Kumar et al. (2020)), without being sensitive to partial observability, sparse rewards, and hyper-parameters.

2 Additional related work

Multi-armed bandit (MAB) is a classical decision-making framework (Lattimore & Szepesvári, 2020; Lai & Robbins, 1985; Lai, 1987; Langford & Zhang, 2007; Auer, 2002; Bubeck et al., 2012; Audibert et al., 2009; Degenne & Perchet, 2016). The natural approach in offline MABs is the LCB algorithm (Ameko et al., 2020; Si et al., 2020), an offline variant of the classical UCB method (Auer et al., 2002) which is minimax optimal (Rashidinejad et al., 2021).

The optimization over stochastic policies is also considered in combinatorial multi-armed bandits (CMAB) (Combes et al., 2015). Most works on CMAB focus on variants of the UCB algorithm (Kveton et al., 2015; Combes et al., 2015; Chen et al., 2016) or of Thompson sampling (Wang & Chen, 2018; Liu & Ročková, 2023), and they are generally online.

Our framework can also be applied to offline reinforcement learning (RL) (Sutton & Barto, 2018) whenever the logging policies are accessible. There exist a lot of practical algorithms for offline RL (Fujimoto et al., 2019; Peng et al., 2019; Wu et al., 2019; Kumar et al., 2020; Kostrikov et al., 2021). Theory has also been investigated extensively in tabular domain and function approximation setting (Nachum et al., 2019; Xie & Jiang, 2020; Zanette et al., 2021; Xie et al., 2021; Yin et al., 2022; Xiong et al., 2022). Some works also tried to establish general guarantees for deep RL algorithms via sophisticated statistical tools, such as bootstrapping (Thomas et al., 2015; Nakkiran et al., 2020; Hao et al., 2021; Wang et al., 2022; Zhang et al., 2022).

We rely on the notion of pessimism, which is a key concept in offline bandits and RL. While most prior works focused on the so-called absolute pessimism (Jin et al., 2020; Xie et al., 2021; Yin et al., 2022; Rashidinejad et al., 2021; Li et al., 2023), the work of Cheng et al. (2022) applied pessimism not on the policy value but on the difference (or improvement) between policies. However, their framework is very different from ours.

We make extensive use of two key concepts, namely localization laws and critical radii (Wainwright, 2019), which control the relative scale of the signal and uncertainty. The idea of localization plays a critical role in the theory of empirical process (Geer, 2000) and statistical learning theory (Koltchinskii, 2001, 2006; Bartlett & Mendelson, 2002; Bartlett et al., 2005). The concept of critical radius or critical inequality is used in non-parametric regression (Wainwright, 2019) and in off-policy evaluation (Duan et al., 2021; Duan & Wainwright, 2022, 2023; Mou et al., 2022).

3 Data-Starved Multi-Armed Bandits

In this section, we describe the MAB setting and give an example of a “data-starved” MAB instance where prior methods (such as LCB) can fail. We informally say that an offline MAB is “data-starved” if its dataset contains only very few samples in each arm.

Notation We let $[n]=\{1,2,...,n\}$ for a positive integer $n.$ We let $\left\|\cdot\right\|_{2}$ denote the Euclidean norm for vectors and the operator norm for matrices. We hide constants and logarithmic factors in the $\widetilde{O}(\cdot)$ notation. We let $\mathbb{B}_{p}^{d}(s)=\{x\in\mathbb{R}^{d}:\left\|x\right\|_{p}\leq s\}$ for any $s\geq 0$ and $p\geq 1.$ $a\lesssim b$ ( $a\gtrsim b$ ) means $a\leq Cb$ ( $a\geq Cb$ ) for some numerical constant $C.$ $a\simeq b$ means that both $a\lesssim b$ and $b\lesssim a$ hold.

3.1 Multi-armed bandits

We consider the case where there are $d$ arms in a set $\mathcal{A}=\{a_{1},...,a_{d}\}$ with expected reward $r(a_{i}),i\in[d].$ We assume access to an offline dataset $\mathcal{D}=\left\{(x_{i},r_{i})\right\}_{i\in[N]}$ of action-reward tuples, where the experienced actions $\left\{x_{i}\right\}_{i\in[N]}$ are i.i.d. from a distribution $\mu$ . Each experienced reward is a random variable with expectation $\mathbb{E}[r_{i}]=r(x_{i})$ and independent Gaussian noises $\zeta_{i}:=r(x_{i})-\mathbb{E}[r_{i}].$ For $i\in[d],$ we denote the number of pulls to arm $a_{i}$ in $\mathcal{D}$ by $N(a_{i})$ or $N_{i},$ while the variance of the noise for arm $a_{i}$ is denoted by $\sigma_{i}^{2}.$ We denote the optimal arm as $a^{*}\in\mathop{\arg\max}_{a\in\mathcal{A}}[r(a)]$ and the single policy concentrability as $C^{*}=1/\mu(a^{*})$ where $\mu$ is the distribution that generated the dataset. Without loss of generality, we assume the optimal arm is unique. We also write $r=(r_{1},r_{2},...,r_{d})^{\top}.$ Without loss of generality, we assume there is at least one sample for each arm (such arm can otherwise be removed).

3.2 Lower confidence bound algorithm

One simple but effective method for the offline MAB problem is the Lower Confidence Bound (LCB) algorithm, which is inspired by its online counterpart (UCB) (Auer et al., 2002). Like UCB, LCB computes the empirical mean $\widehat{r}_{i}$ associated to the reward of each arm $i$ along with its half confidence width $b_{i}$ . They are defined as

\widehat{r}_{i}:=\frac{1}{N(a_{i})}\sum_{k:x_{k}=a_{i}}x_{k},\ b_{i}:=\sqrt{\frac{2\sigma_{i}^{2}}{N(a_{i})}\log\left(\frac{2d}{\delta}\right)}.

(1)

This definition ensures that each confidence interval brackets the corresponding expected reward with probability $1-\delta$ :

\widehat{r}_{i}-b_{i}\leq r\left(a_{i}\right)\leq\widehat{r}_{i}+b_{i}\quad\forall i\in[d].

(2)

The width of the confidence level depends on the noise level $\sigma_{i}$ , which can be exploited by variance-aware methods (Zhang et al., 2021; Min et al., 2021; Yin et al., 2022; Dai et al., 2022). When the true noise level is not accessible, we can replace it with the empirical standard deviation or with a high-probability upper bound. For example, when the reward for each arm is restricted to be within $[0,1],$ a simpler upper bound is $\sigma_{i}^{2}\leq 1/4.$

Unlike UCB, the half-width of the confidence intervals for LCB is not added, but subtracted, from the empirical mean, resulting in the lower bound $l_{i}=\widehat{r}_{i}-b_{i}$ . The action identified by LCB is then the one that maximizes the resulting lower bound, thereby incorporating the principle of pessimism under uncertainty (Jin et al., 2020; Kumar et al., 2020). Specifically, given the dataset $\mathcal{D},$ LCB selects the arm using the following rule:

\widehat{a}_{\mathsf{LCB}}:=\mathop{\mathrm{argmax}}_{a_{i}\in\mathcal{A}}~{}l_{i},

(3)

Rashidinejad et al. (2021) analyzed the LCB strategy. Below we provide a modified version of their theorem.

Theorem 3.1 (LCB Performance).

Suppose the noise of arm $a_{i}$ is sub-Gaussian with proxy variance $\sigma_{i}^{2}.$ Let $\delta\in(0,1/2).$ Then, we have

(Comparison with any arm) With probability at least $1-\delta,$ for any comparator policy $a_{i}\in\mathcal{A}$ , it holds that

r\left(a_{i}\right)-r\left(\widehat{a}_{\mathsf{LCB}}\right)\leq\sqrt{\frac{8\sigma_{i}^{2}}{N(a_{i})}\log\left(\frac{2d}{\delta}\right)}.

(4)

2.

(Comparison with the optimal arm) Assume $\sigma_{i}=1$ for any $i\in[d]$ and $N\geq 8C^{*}\log\left(1/\delta\right).$ Then, with probability at least $1-2\delta,$ one has

$r\left(a^{*}\right)-r\left(\widehat{a}_{\mathsf{LCB}}\right)\leq\sqrt{\frac{4C^{*}}{N}\log\left(\frac{2d}{\delta}\right)}.$ (5)

The statement of this theorem is slightly different from that in Rashidinejad et al. (2021), in the sense that their suboptimality is over $\mathbb{E}_{\mathcal{D}}[r\left(a^{*}\right)-r\left(\widehat{a}_{\mathsf{LCB}}\right)]$ instead of a high-probability one. Rashidinejad et al. (2021) proved the minimax optimality of the algorithm when the single policy concentrability $C^{*}\geq 2$ and the sample size $N\geq\widetilde{O}(C^{*}).$

3.3 A data-starved MAB problem and failure of LCB

In order to highlight the limitation of a strategy such as LCB, let us describe a specific data-starved MAB instance, specifically one with $d=10000$ arms, equally partitioned into a set of good arms (i.e., $\mathcal{A}_{g}$ ) and a set of bad arms (i.e., $\mathcal{A}_{b}$ ). Each good arm returns a reward following the uniform distribution over $[0.5,1.5],$ while each bad arm returns a reward which follows $\mathcal{N}(0,1/4)$ .

Assume that we are given a dataset that contains only one sample per each arm. Instantiating the LCB confidence interval in (2) with $\sigma_{i}\leq 1/2$ and $\delta=0.1,$ one obtains

\widehat{r}_{i}-2.5\leq r(a_{i})\leq\widehat{r}_{i}+2.5.

Such bound is uninformative, because the lower bound for the true reward mean is less than the reward value of the worst arm. The performance bound for LCB confirms this intuition, because Theorem 3.1 requires at least $N(a_{i})\geq\lceil 8*\log(1/0.05)\rceil=24$ samples in each arm to provide any guarantee with probability at least $0.9$ (here $C^{*}=d$ ).

3.4 Can stochastic policies help?

At a first glance, extracting a good decision-making strategy for the problem discussed in Section 3.3 seems like a hopeless endeavor, because it is information-theoretically impossible to reliably estimate the expected payoff of any of the arms with just a single sample on each.

In order to proceed, the key idea is to enlarge the search space to contain stochastic policies.

Definition 3.2 (Stochastic Policies).

A stochastic policy over a MAB is a probability distribution $w\in\mathbb{R}^{d},w_{i}\geq 0,\sum_{i=1}^{d}w_{i}=1.$

To exemplify how stochastic policies can help, consider the behavioral cloning policy, which mimics the policy that generated the dataset for the offline MAB in Section 3.3. Such policy is stochastic, and it plays all arms uniformly at random, thereby achieving a score around $0.5$ with high probability. The value of the behavioral cloning policy can be readily estimated using the Hoeffding bound (e.g., Proposition 2.5 in Wainwright (2019)): with probability at least $1-\delta=0.9,$ (here $d=10000$ is the number of arms and $\sigma=1/2$ is the true standard deviation), the value of behavioral cloning policy is greater or equal than

\displaystyle\frac{1}{2}-\sqrt{\frac{2\sigma^{2}\log\left(2/\delta\right)}{d}}\approx 0.488.

Such value is higher than the one guaranteed for LCB by Theorem 3.1. Intuitively, a stochastic policy that selects multiple arms can be evaluated more accurately because it averages the rewards experienced over different arms. This consideration suggests optimizing over stochastic policies.

By optimizing a lower bound on the performance of the stochastic policies, it should be possible to find one with a provably high return. Such an idea leads to solving an offline linear bandit problem, as follows

\displaystyle\max_{w\in\mathbb{R}^{d},w_{i}\geq 0,\sum_{i=1}^{d}w_{i}=1}\

\displaystyle\sum_{i=1}^{d}w_{i}\widehat{r}_{i}-c(w)

(6)

where $c(w)$ is a suitable confidence interval for the policy $w$ and $\widehat{r}_{i}$ is the empirical reward for the $i$ -th arm defined in (1). While this approach is appealing, enlarging the search space to include all stochastic policies brings an increase in the metric entropy of the function class, and concretely, a $\sqrt{d}$ factor (Abbasi-Yadkori et al., 2011; Rusmevichientong & Tsitsiklis, 2010; Hazan & Karnin, 2016; Jun et al., 2017; Kim et al., 2022) in the confidence intervals $c(w)$ (in (6)), which negates all gains that arise from considering stochastic policies. In the next section, we propose an algorithm that bypasses the need for such $\sqrt{d}$ factor by relying on a more careful analysis and optimization procedure.

4 Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST)

In this section, we introduce our algorithm, called Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST). At a high level, the algorithm is a policy optimization algorithm based on a trust region centered around a reference policy. The size of the trust region determines the degree of pessimism, and its optimal problem-dependent size can be determined by analyzing the supremum of a problem-dependent empirical process. In the sequel, we describe 1) the decision variables, 2) the trust region optimization program, and 3) some techniques for its practical implementation.

4.1 Decision variables

The algorithm searches over the class of stochastic policies given by the weight vector $w=(w_{1},w_{2},...,w_{d})^{\top}$ of Definition 3.2. Instead of directly optimizing over the weights of the stochastic policy, it is convenient to center $w$ around a reference stochastic policy $\widehat{\mu}$ which is either known to perform well or is easy to estimate. In our theory and experiments, we consider a simple setup and use the behavioral cloning policy weighted by the noise levels $\{\sigma_{i}\}$ if they are known. Namely, we consider

\widehat{\mu}_{i}=\frac{N_{i}/\sigma_{i}^{2}}{\sum_{j=1}^{d}N_{j}/\sigma_{j}^{2}}\quad\forall i\in[d].

(7)

When the size of the noise $\sigma_{i}$ is constant across all arms, the policy $\widehat{\mu}$ is the behavioral cloning policy; when $\sigma_{i}$ differs across arms, $\widehat{\mu}$ minimizes the variance of the empirical reward

\widehat{\mu}=\mathop{\mathrm{argmin}}_{w\in\mathbb{R}^{d},w_{i}\geq 0,\sum_{i}w_{i}=1}\operatorname{Var}\left(w^{\top}\cdot\widehat{r}\right),

where $\widehat{r}=(\widehat{r}_{1},...,\widehat{r}_{d})^{\top}$ is defined in (1). Using such definition, we define as decision variable the policy improvement vector

\Delta:=w-\widehat{\mu}.

(8)

This preparatory step is key: it allows us to implement relative pessimism, namely pessimism on the improvement—represented by $\Delta$ —rather than on the absolute value of the policy $w$ . Moreover, by restricting the search space to a ball around $\widehat{\mu}$ , one can efficiently reduce the metric entropy of the policy class and obtain tighter confidence intervals.

Refer to caption — Figure 1: A simple diagram for the trust regions on a $3$ -dim simplex. The central point is the reference (stochastic) policy, while red ellipses are trust regions around this reference policy.

4.2 Trust region optimization

Trust region.

TRUST (Algorithm 1) returns the stochastic policy $\pi_{TRUST}=\widehat{\Delta}+\widehat{\mu}\in\mathbb{R}^{d},$ where $\widehat{\mu}$ is the reference policy defined in (7) and $\widehat{\Delta}$ is the policy improvement vector. In order to accurately quantify the effect of the improvement vector $\Delta,$ we constrain it to a trust region $\mathsf{C}\left(\varepsilon\right)$ centered around $\widehat{\mu}$ where $\varepsilon>0$ is the radius of the trust region. More concretely, for a given radius $\varepsilon>0,$ the trust region is defined as

	$\displaystyle\mathsf{C}\left(\varepsilon\right):=\Bigg{\{}\Delta:$	$\displaystyle\Delta_{i}+\widehat{\mu}_{i}\geq 0,\left\\|\Delta+\widehat{\mu}\right\\|_{1}=1,$
		$\displaystyle\sum_{i=1}^{d}\frac{\Delta_{i}^{2}\sigma_{i}^{2}}{N_{i}}\leq\varepsilon^{2}\Bigg{\}}.$		(9)

The trust region above serves two purposes: it ensures that the policy $\widehat{\Delta}+\widehat{\mu}$ still represents a valid stochastic policy, and it regularizes the policy around the reference policy $\widehat{\mu}$ . We then search for the best policy within $\mathsf{C}\left(\varepsilon\right)$ by solving the optimization program

\widehat{\Delta}_{\varepsilon}:=\mathop{\mathrm{argmax}}_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\widehat{r}.

(10)

Computationally, the program (10) is a second-order cone program (Alizadeh & Goldfarb, 2003; Boyd & Vandenberghe, 2004), which can be solved efficiently with standard off-the shelf libraries (Diamond & Boyd, 2016).

When $\varepsilon=0$ , the trust region only includes the vector $\Delta=0$ , and the reference policy $\widehat{\mu}$ is the only feasible solution. When $\varepsilon\rightarrow\infty$ , the search space includes all stochastic policies. In this latter case, the solution identified by the algorithm coincides with the greedy algorithm which chooses the arm with the highest empirical return. Rather than leaving $\varepsilon$ as a hyper-parameter, in the next section we highlight a selection strategy for $\varepsilon$ based on localized Gaussian complexities.

Critical radius.

The choice of $\varepsilon$ is crucial to the performance of our algorithm because it balances optimization with regularization. Such consideration suggests that there is an optimal choice for the radius $\varepsilon$ which balances searching over a larger space with keeping the metric entropy of such space under control. The optimal problem-dependent choice $\widehat{\varepsilon}_{*}$ can be found as a solution of a certain equation involving a problem-dependent supremum of an empirical process. More concretely, let $E$ be the feasible set of $\varepsilon$ (e.g., $E=\mathbb{R}^{+}$ ). We define the critical radius as

Definition 4.1 (Critical Radius).

The critical radius $\widehat{\varepsilon}_{*}$ of the trust region is the solution to the program

\widehat{\varepsilon}_{*}=\mathop{\mathrm{argmax}}_{\varepsilon\in E}\left[\widehat{\Delta}_{\varepsilon}^{\top}\cdot\widehat{r}-\mathcal{G}\left(\varepsilon\right)\right].

(11)

Such equation involves a quantile of the localized gaussian complexity $\mathcal{G}\left(\varepsilon\right)$ of the stochastic policies identified by the trust region. Mathematically, this is defined as

Definition 4.2 (Quantile of the supremum of Gaussian process).

We denote the noise vector as $\eta=\widehat{r}-r,$ which by our assumption is coordinate-wise independent and satisfies $\eta_{i}\sim\mathcal{N}\left(0,\sigma_{i}^{2}/N(a_{i})\right).$ We define $\mathcal{G}\left(\varepsilon\right)$ as the smallest quantity such that with probability at least $1-\delta$ , for any $\varepsilon\in E,$ it holds that

\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta\leq\mathcal{G}\left(\varepsilon\right).

(12)

In essence, $\mathcal{G}\left(\varepsilon\right)$ is an upper quantile of the supremum of the Gaussian process $\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta$ which holds uniformly for every $\varepsilon\in E.$ We also remark that this quantity depends on the feasible set $E$ and the trust region $\mathsf{C}\left(\varepsilon\right)$ , and hence, is highly problem-dependent.

The critical radius plays a crucial role: it is the radius of the trust region that optimally balances optimization with uncertainty. Enlarging $\varepsilon$ enlarges the search space for $\Delta$ , enabling the discovery of policies with potentially higher return. However, this also brings an increase in the metric entropy of the policy class encoded by $\mathcal{G}\left(\varepsilon\right)$ , which means that each policy can be estimated less accurately. The critical radius represents the optimal tradeoff between these two forces. The final improvement vector that TRUST returns, which we denote as $\widehat{\Delta}_{*}$ , is determined by solving (10) with the critical radius $\widehat{\varepsilon}_{*}$ . In mathematical terms, we express this as

\widehat{\Delta}_{*}:=\mathop{\mathrm{argmax}}_{\Delta\in\mathsf{C}\left(\widehat{\varepsilon}_{*}\right)}\Delta^{\top}\widehat{r},

(13)

where $\widehat{\varepsilon}_{*}$ is defined in (11).

Algorithm 1 Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST)

Input: Offline dataset

\mathcal{D},

failure probability

\delta,

the candidate set for the trust region widths

E

(in practice, this is chosen as (14)).

1. For

\varepsilon\in E,

compute

\widehat{\Delta}_{\varepsilon}

from (10).

2. For

\varepsilon\in E,

estimate

\mathcal{G}\left(\varepsilon\right)

via Monte-Carlo method (see Algorithm 2 in Appendix B).

3. Solve (11) to obtain the critical radius

\widehat{\varepsilon}_{*}.

4. Compute the optimal improvement vector in

\mathsf{C}\left(\widehat{\varepsilon}_{*}\right)

via (13), denoted as

\widehat{\Delta}_{*}.

5. Return the stochastic policy

\pi_{TRUST}=\widehat{\mu}+\widehat{\Delta}_{*}.

Implementation details

Since it can be difficult to solve (11) for a continuous value of $\varepsilon\in E=\mathbb{R}^{+}$ , we use a discretization argument by considering the following candidate subset:

E=\left\{\varepsilon_{0},\frac{\varepsilon_{0}}{\alpha},...,\frac{\varepsilon_{0}}{\alpha^{|E|-1}}\right\},

(14)

where $\alpha>1$ is the decaying rate and $\varepsilon_{0}$ is the largest possible radius, which is the maximal weighted distance from the reference policy to any vertex. Mathematically, this is defined as

\varepsilon_{0}=\max_{i\in[d]}\sqrt{\sum_{j\neq i}\frac{\widehat{\mu}_{j}^{2}\sigma_{j}^{2}}{N_{j}}+\frac{\left(1-\widehat{\mu}_{i}\right)^{2}\sigma_{i}^{2}}{N_{i}}}.

Our analysis that leads to Theorem 5.1 takes into account such discretization argument.

In line 2 of Algorithm 1, the algorithm works by estimating the quantile of the supremum of the localized Gaussian complexity $\mathcal{G}\left(\varepsilon\right)$ that appears in Definition 4.2, and then choose the $\varepsilon$ that maximizes the objective function in (11). Although $\mathcal{G}\left(\varepsilon\right)$ can be upper bounded analytically, in our experiments we aim to obtain tighter guarantees and so we estimate it via Monte-Carlo. This can be achieved by 1) sampling independent noise vectors $\eta$ , 2) solving $\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta$ and 3) estimating the quantile via order statistics. More details can be found in Appendix B.

In summary, our practical algorithm can be seen as solving the optimization problem

(\widehat{\varepsilon}_{*},\widehat{\Delta}_{*})=\mathop{\arg\max}_{\varepsilon\in E,\Delta\in\mathsf{C}\left(\varepsilon\right)}\bigg{\{}\Delta^{\top}\widehat{r}-\widehat{\mathcal{G}}(\varepsilon)\bigg{\}}

where $\widehat{r}\in\mathbb{R}^{d}$ is the empirical reward vector with $\widehat{r}_{i}$ defined in (1). Here, $\widehat{\mathcal{G}}(\varepsilon)$ is computed according to the Monte-Carlo method defined in Algorithm 2 in Appendix B and $E$ is the candidate subset for radius defined in (14). This indicates a balance between the empirical reward of a stochastic policy and the local entropy metric it induces, representing

5 Theoretical guarantees

In this section, we provide some theoretical guarantees for the policy $\pi_{TRUST}$ returned by TRUST.

5.1 Problem-dependent analysis

We present 1) an improvement over the reference policy $\widehat{\mu}$ , 2) a sub-optimality gap with respect to any comparator policy $\pi$ and 3) an actionable lower bound on the performance of the output policy.

Given a stochastic policy $\pi$ , we let $V^{\pi}=\mathbb{E}_{a\sim\pi}[r(a)]$ denote its value function. Furthermore, we denote a comparator policy $\pi$ by a triple $(\varepsilon,\Delta,\pi)$ such that $\varepsilon>0,\Delta\in\mathsf{C}\left(\varepsilon\right),\pi=\widehat{\mu}+\Delta.$

Theorem 5.1 (Main theorem).

TRUST has the following properties.

With probability at least $1-\delta,$ the improvement over the behavioral policy is at least

V^{\pi_{TRUST}}-V^{\widehat{\mu}}\geq\sup_{\varepsilon\leq\varepsilon_{0},\Delta\in\mathsf{C}\left(\varepsilon\right)}\left[\Delta^{\top}r-2\mathcal{G}\left(\left\lceil\varepsilon\right\rceil\right)\right],

(15)

where $\lceil\varepsilon\rceil=\inf\{\varepsilon^{\prime}\in E,\varepsilon^{\prime}\geq\varepsilon\}.$

2.

With probability at least $1-\delta,$ for any stochastic comparator policy $(\varepsilon,\Delta,\pi)$ , the sub-optimality of the output policy can be upper bounded as

$V^{\pi}-V^{\pi_{TRUST}}\leq 2\mathcal{G}\left(\left\lceil\varepsilon\right\rceil\right).$ (16)

With probability at least $1-2\delta,$ the data-dependent lower bound on $V^{\pi_{TRUST}}$ satisfies

V^{\pi_{TRUST}}\geq\pi_{TRUST}^{\top}\widehat{r}-\mathcal{G}\left(\lceil\widehat{\varepsilon}_{*}\rceil\right)-\sqrt{\frac{2\log(1/\delta)}{\sum_{j=1}^{d}N_{j}/\sigma_{j}^{2}}},

(17)

where $\pi_{TRUST}=\widehat{\mu}+\widehat{\Delta}_{*}$ is the policy output by Algorithm 1.

Our guarantees are problem-dependent as a function of the Gaussian process $\mathcal{G}\left(\cdot\right)$ ; in Section 6 we show how these can be instantiated on an actual problem, highlighting the tightness of the analysis.

Equation 15 highlights the improvement with respect to the behavioral policy. It is expressed as a trade-off between maximizing the improvement $\Delta^{\top}r$ and minimizing its uncertainty $\mathcal{G}\left(\left\lceil\varepsilon\right\rceil\right)$ . The presence of the $\sup_{\varepsilon}$ indicates that TRUST achieves an optimal balance between these two factors. The state of the art guarantees that we are aware of highlight a trade-off between value and variance (Jin et al., 2021; Min et al., 2021). The novelty of our result lies in the fact that TRUST optimally balances the uncertainty implicitly as a function of the ‘coverage’ as well as the metric entropy of the search space. That is, TRUST selects the most appropriate search space by trading off its metric entropy with the quality of the policies that it contains.

The right-hand side in Equation 17 gives actionable statistical guarantees on the quality of the final policy and it can be fully computed from the available dataset; we give an example of the tightness of the analysis in Section 6.

Localized Gaussian complexity $\mathcal{G}\left(\varepsilon\right)$ .

In Theorem 5.1, we upper bound the suboptimality $V^{\pi}-V^{\pi_{TRUST}}$ via a notion of localized metric entropy $\mathcal{G}\left(\cdot\right).$ It is the quantile of the supremum of a Gaussian process, which can be efficiently estimated via Monte Carlo method (e.g., see Appendix B) or concentrated around its expectation. The expected value of $\mathcal{G}\left(\varepsilon\right)$ is also localized Gaussian width, a concept well-established in statistical learning theory (Bellec, 2019; Wei et al., 2020; Wainwright, 2019). More concretely, this is the localized Gaussian width for an affine simplex:

\mathbb{E}\left[\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta\right]=\mathbb{E}\left[\sup_{\mathbb{S}^{d-1}\cap\left\{\Delta:\left\|\Delta\right\|_{\Sigma}\leq\varepsilon\right\}}\Delta^{\top}\eta\right],

where $\mathbb{S}^{d-1}$ denotes the simplex in $\mathbb{R}^{d}$ and $\Sigma:=\operatorname{diag}\left(\frac{\sigma_{1}^{2}}{N_{1}},\frac{\sigma_{2}^{2}}{N_{2}},...,\frac{\sigma_{d}^{2}}{N_{d}}\right)$ is the weighted matrix. Moreover, this localized Gaussian width can be upper bound via

\mathbb{E}\left[\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta\right]\lesssim\min\left\{\sqrt{\log\left(d\varepsilon^{2}\right)},\varepsilon\sqrt{d}\right\}.

(18)

To make it clearer, we plot this upper bound for localized Gaussian width in Figure 2. In (18), the rate matches the minimax lower bound up to universal constant (Gordon et al., 2007; Lecué & Mendelson, 2013; Bellec, 2019). To see the implication of the upper bound (18), let’s consider a simple example where the logging policy is uniform over all arms. We denote the optimal arm as $a^{*}$ and define

C^{*}:=\frac{1}{\mu(a^{*})}

as the concentrability coefficient. By applying (18) and some concentration techniques (see Wainwright, 2019), we can perform a fine-grained analysis for the suboptimality induced by $\pi_{TRUST}.$ Specifically, with probability at least $1-\delta,$ one has

V^{\pi_{*}}-V^{\pi_{TRUST}}\lesssim\sqrt{\frac{C^{*}}{N}\log\left(\frac{2d|E|}{\delta}\right)}.

(19)

Note that, the high-probability upper bound here is minimax optimal up to constant and logarithmic factor (Rashidinejad et al., 2021) when $C^{*}\geq 2.$ Moreover, this example of uniform logging policy is an instance where LCB achieves minimax sub-optimality (up to constant and log factors) (see the proof of Theorem 2 in Rashidinejad et al., 2021). In this case, TRUST will achieve the same level of guarantees for the suboptimality of the output policy. We also empirically show the effectiveness of TRUST in Section 6. The full theorem for a fine-grained analysis for the suboptimality and its proof are deferred to Appendix C.

5.2 Proof of Theorem 5.1

To prove Lemma C.3, we first define the following event

\mathcal{E}:=\left\{\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta\leq\mathcal{G}\left(\varepsilon\right)\quad\forall\varepsilon\in E\right\}.

(20)

When $\mathcal{E}$ happens, the quantity $\mathcal{G}\left(\varepsilon\right)$ can upper bound the supremum of the Gaussian process we care about, and hence, we can effectively upper bound the uncertainty for any stochastic policy using $\mathcal{G}\left(\cdot\right).$ It follows from the Definition 4.2 that the event $\mathcal{E}$ happens with probability a leas $1-\delta.$

We can now prove all the claims in the theorem, starting from the first and the second. A comparator policy $\pi$ identifies a weight vector $w$ , an improvement vector $\Delta$ and a radius $\varepsilon$ such that $w=\widehat{\mu}+\Delta$ and $\Delta\in\mathsf{C}\left(\varepsilon\right).$ In fact, we can always take $\varepsilon$ to be the minimal value such that $\Delta\in\mathsf{C}\left(\varepsilon\right).$ The first claim in Equation 15 can be proved by establishing that with probability at least $1-\delta$

w^{\top}r-\pi^{\top}_{TRUST}r=\Delta^{\top}r-\widehat{\Delta}_{*}^{\top}r\leq 2\mathcal{G}\left(\lceil\varepsilon\rceil\right),

(21)

where $\pi_{TRUST}$ is the policy weight returned by Algorithm 1. In order to show Equation 21, we can decompose $\widehat{\Delta}_{*}^{\top}r$ using the fact that $\widehat{\varepsilon}_{*}\in E$ and $\widehat{\Delta}_{*}\in\mathsf{C}\left(\widehat{\varepsilon}_{*}\right)$ to obtain

\displaystyle\widehat{\Delta}_{*}^{\top}r

\displaystyle=\widehat{\Delta}_{*}^{\top}\widehat{r}-\widehat{\Delta}_{*}^{\top}\eta\geq\widehat{\Delta}_{*}^{\top}\widehat{r}-\mathcal{G}\left(\widehat{\varepsilon}_{*}\right)=\widehat{\Delta}_{*}^{\top}\widehat{r}-\mathcal{G}\left(\left\lceil\widehat{\varepsilon}_{*}\right\rceil\right).

(22)

To further lower bound the RHS above, we have the following lemma, which shows that Algorithm 1 can be written in an equivalent way.

Lemma 5.2.

The output of Algorithm 1 satisfies

\left(\widehat{\varepsilon}_{*},\widehat{\Delta}_{*}\right)=\mathop{\mathrm{argmax}}_{\varepsilon\leq\varepsilon_{0},\Delta\in\mathsf{C}\left(\varepsilon\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon\rceil\right)\big{]}.

(23)

This shows that Algorithm 1 optimizes over an objective function which consists of a signal term (i.e., $\Delta^{\top}\widehat{r}$ ) minus a noise term (i.e., $\mathcal{G}\left(\lceil\varepsilon\rceil\right)$ ). Applying this lemma to (22), we know

\displaystyle\widehat{\Delta}_{*}^{\top}r

\displaystyle\geq\Delta^{\top}\widehat{r}-\mathcal{G}\left(\left\lceil\varepsilon\right\rceil\right)=\Delta^{\top}r+\Delta^{\top}\eta-\mathcal{G}\left(\left\lceil\varepsilon\right\rceil\right).

(24)

After recalling that under $\mathcal{E}$

\displaystyle\Delta^{\top}\eta\leq\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta\leq\sup_{\Delta\in\mathsf{C}\left(\lceil\varepsilon\rceil\right)}\Delta^{\top}\eta\leq\mathcal{G}\left(\lceil\varepsilon\rceil\right),

(25)

plugging the (25) back into (24) concludes the bound in Equation 21, which also proves our first claim. Rearranging the terms in Equation 21 and taking supremum over all comparator policies, we obtain

\widehat{\Delta}_{*}^{\top}r\geq\sup_{\varepsilon\leq\varepsilon_{0},\Delta\in\mathsf{C}\left(\varepsilon\right)}\left[\Delta^{\top}r-2\mathcal{G}\left(\left\lceil\varepsilon\right\rceil\right)\right],

(26)

which proves the first claim since $V^{\pi_{TRUST}}-V^{\widehat{\mu}}=\widehat{\Delta}_{*}^{\top}r.$

In order to prove the last claim, it suffices to lower bound the policy value of the reference policy $\widehat{\mu}.$ From (7), we have $\widehat{\mu}\left(\widehat{r}-r\right)\sim\mathcal{N}(0,1/[\sum_{i=1}^{d}N_{i}/\sigma_{i}^{2}]),$ which implies with probability at least $1-\delta,$

\widehat{\mu}\left(\widehat{r}-r\right)\leq\sqrt{\frac{2\log(1/\delta)}{\sum_{i=1}^{d}N_{i}/\sigma_{i}^{2}}}

(27)

from the standard Hoeffding inequality (e.g., Prop 2.5 in Wainwright (2019)). Combining (22) and (27), we obtain

$\displaystyle\pi_{TRUST}^{\top}r$	$\displaystyle=\widehat{\mu}^{\top}r+\widehat{\Delta}_{*}^{\top}r$
	$\displaystyle\geq\widehat{\mu}^{\top}\widehat{r}+\widehat{\mu}^{\top}(r-\widehat{r})+\widehat{\Delta}_{}^{\top}\widehat{r}-\mathcal{G}\left(\widehat{\varepsilon}_{}\right)$	(From (22))
	$\displaystyle\geq\pi_{TRUST}^{\top}\widehat{r}-\mathcal{G}\left(\widehat{\varepsilon}_{*}\right)-\sqrt{\frac{2\log(1/\delta)}{\sum_{i=1}^{d}N_{i}/\sigma_{i}^{2}}}$	(From (27))

with probability at least $1-2\delta.$ Therefore, we conclude.

Augmentation with LCB

Compared to classical LCB, Algorithm 1 considers a much larger searching space, which encompasses not only the vertices of the simplex but the inner points as well. This enlargement of searching space shows great advantage, but this also comes with the price of larger uncertainty, especially when the width $\varepsilon$ is large. In LCB, one considers the uncertainty by upper bound the noise at each vertex uniformly, while in our case, the uniform upper bound for a sub-region of the shifted simplex must be considered. When $\varepsilon$ is large, the trust region method will induce larger uncertainty and tend to select a more stochastic policy than LCB and hence, can achieve worse performance. To determine the most effective final policy, one can always combine TRUST (Algorithm 1) with LCB and select the better one between them based on the lower bound induced by two algorithms. By comparing the lower bounds of LCB and TRUST, the value of the finally output policy is guaranteed to outperform the lower bound for either LCB or TRUST with high probability. We defer the detailed algorithm and its theoretical guarantees to Appendix E.

6 Experiments

We present simulated experiments where we show the failure of LCB and the strong performance of TRUST. Moreover, we also present an application of TRUST to offline reinforcement learning.

6.1 Simulated experiments

A data-starved MAB

We consider a data-starved MAB problem with $d=10000$ arms denoted by $a_{i},i\in[d]$ . The reward distributions are

r(a_{i})\sim\left\{\begin{aligned} &\mathsf{Uniform}(0.5,1.5)&i\leq 5000,\\ &\mathcal{N}\left(0,1/4\right)&i>5000.\end{aligned}\right.

(28)

Namely, the set of good arms have reward random variables from a uniform distribution over $[0.5,1.5]$ with unit mean, while the bad arms return a Gaussian reward with zero mean. We consider a dataset that contains a single sample for each of these arms.

We test Algorithm 1 on this MAB instance with fixed variance level $\sigma_{i}=1/2$ . We set the reference policy $\widehat{\mu}$ to be the behavioral cloning policy, which coincides with the uniform policy. We also test LCB and the greedy method which simply chooses the policy with the highest empirical reward.

In this example, the greedy algorithm fails because it erroneously selects an arm with a reward $>1.5$ , but such reward can only originate from a normal distribution with mean zero. Despite LCB incorporates the principle of pessimism under uncertainty, it selects an arm with average return equal to zero; its performance lower bound given by the confidence intervals is $-1.5,$ which is almost vacuous and very uninformative. The behavioral cloning policy performs better, because it selects an arm uniformly at random, achieving the score $0.5$ .

Behavior Policy	Greedy Method	LCB	LCB Lower Bound
0.5	0	0	-1.5

max reward	Policy Improvement by TRUST	TRUST	TRUST Lower Bound
1.0	0.42	0.92	0.6

Table 1: Results of simulated experiments in a 10000-arm bandit. The reward distribution is described in (28). The offline dataset includes one sample for each arm. The greedy method chooses the arm with the highest empirical reward. LCB selects an arm based on (3). The lower bound for LCB and TRUST follow (2) and (17), respectively.

Algorithm 1 achieves the best performance: the value of the policy that it identifies is $0.92,$ which almost matches the optimal policy. The lower bound on its performance computed by instantiating the RHS in (17) is around $0.6$ , a guarantee much tighter than that for LCB.

In order to gain intuition on the learning mechanics of TRUST, in Figure 3 we progressively enlarge the radius of the trust region from zero to the largest possible radius (on the $x$ axis) and plot the value of the policy that maximizes the linear objective $\Delta^{\top}\widehat{r},\;\Delta\in\mathsf{C}\left(\varepsilon\right)$ for each value of the radius $\varepsilon$ . Note that we rescale the range of $\varepsilon$ to make the largest possible $\varepsilon$ be one. In the same figure we also plot the lower bound computed with the help of equation (17).

Initially, the value of the policy increases because the optimization in (10) is performed over a larger set of stochastic policies. However, when $\varepsilon$ approaches one, all stochastic policies are included in the optimization program. In this case, TRUST greedily selects the arm with the highest empirical reward which is from a normal distribution with mean zero. The optimal balance between the size of the policy search space and its metric entropy is given by the critical radius $\varepsilon=0.0116\varepsilon_{0}$ , which is the point where the lower bound is the highest.

A more general data-starved MAB

Besides the data-starved MAB we constructed, we also show that in general MABs, the performance of TRUST is on par with LCB, but TRUST will have a much tighter statistical guarantee, i.e., a larger lower bound for the value of the returned policy. We did experiments on a $d=1000$ -arm MAB where the reward distribution is

r(a_{i})\sim\mathcal{N}\left(\frac{i}{1000},\frac{1}{4}\right),\quad\forall i\in[d].

(29)

We ran TRUST Algorithm 1 and LCB over 8 different random seeds. When we have a single sample for each arm, TRUST will get a similar score as LCB. However, TRUST give a much tighter statistical guarantee than LCB, in the sense that the lower bound output by TRUST is much higher than that output by LCB so that TRUST can output a policy that is guaranteed to achieved a higher value. Moreover, we found the policies output from TRUST are much more stable than those from LCB. In all runs, while the lowest value of the arm chosen by LCB is around 0.24, all policies returned by TRUST have values above 0.65 with a much smaller variance, as shown in Table 2.

	LCB	TRUST
mean reward	0.718	0.725
mean lower bound	0.156	0.544
variance	0.265	0.038
minimal reward	0.239	0.658

Table 2: Comparison between LCB and TRUST (Algorithm 1) on a data-starved MAB with 1000 arms whose reward distribution follows (29). Both methods are repeated on 8 random seeds.

6.2 Offline reinforcement learning

In this section, we apply Algorithm 1 to the offline reinforcement learning (RL) setting under the assumption that the logging policies which generated the dataset are accessible. To be clear, our goal is not to exceed the performance of the state of the art deep RL algorithms—our algorithm is designed for bandit problems—but rather to illustrate the usefulness of our algorithm and theory.

Since our algorithm is designed for bandit problems, in order to apply it to the sequential setting, we map MDPs to MABs. Each policy in the MDP maps to an action in the MAB, and each trajectory return in the MDP maps to an experienced return in the MAB setting. Notice that this reduction disregards the sequential aspect of the problem and thus our algorithm cannot perform ‘trajectory stitching’ (Levine et al., 2020; Kumar et al., 2020; Kostrikov et al., 2021). Furthermore, it can only be applied under the assumption that the logging policies are known.

Specifically we consider a setting where there are multiple known logging policies, each generating few trajectories. We test Algorithm 1 on some selected environments from the D4RL dataset (Fu et al., 2020) and compare its performance to the (CQL) algorithm (Kumar et al., 2020), a popular and strong baseline for offline RL algorithms.

Since the D4RL dataset does not directly include the logging policies, we generate new datasets by running Soft Actor Critic (SAC) (Haarnoja et al., 2018) for 1000 episodes. We store 100 intermediate policies generated by SAC, and roll out 1 trajectory from each policy.

We use some default hyper-parameters for CQL.¹¹1We use the codebase and default hyper-parameters in https://github.com/young-geng/CQL. We report the unnormalized scores in Table 3, each averaged over 4 random seeds. Algorithm 1 achieves a score on par with or higher than that of CQL, especially when the offline dataset is of poor quality and when there are very few—or just one—trajectory generated from each logging policy. Notice that while CQL is not guaranteed to outperform the behavioral policy, TRUST is backed by Theorem 5.1. Moreover, while the performance of CQL is highly reliant on the choice of hyper-parameters, TRUST is essentially hyper-parameters free.

		CQL	TRUST
Hopper	1-traj-low	499	999
Hopper	1-traj-high	2606	3437
Ant	1-traj-low	748	763
Ant	1-traj-high	4115	4488
Walker2d	1-traj-low	311	346
Walker2d	1-traj-high	4093	4097
HalfCheetah	1-traj-low	5775	5473
HalfCheetah	1-traj-high	9067	10380

Table 3: Unnormalized score of CQL and TRUST in 4 environments from D4RL. In 1-traj-low case, we take the first 100 policies in the running of SAC. In 1-traj-high case, we take the

(10x+1)

-th policy for

x\in[100]

. We sample one trajectory from each policy we take in all experiments.

Additionally, while CQL took around 16-24 hours on one NVIDIA GeForce RTX 2080 Ti, TRUST only took 0.5-1 hours on 10 CPUs. The experimental details are included in Appendix F.

7 Conclusion

In this paper we make a substantial contribution towards sample efficient decision making, by designing a data-efficient policy optimization algorithm that leverages offline data for the MAB setting. The key intuition of this work is to search over stochastic policies, which can be estimated more easily than deterministic ones.

The design of our algorithm is enabled by a number of key insights, such as the use of the localized gaussian complexity which leads to the definition of the critical radius for the trust region.

We believe that these concepts can be used more broadly to help design truly sample efficient algorithms, which can in turn enable the application of decision making to new settings where a high sample efficiency is critical.

8 Impact Statement

This paper presents a work whose goal is to advance the field of decision making under uncertainty. Since our work is primarily theoretical, we do not anticipate negative societal consequences.

References

Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (NIPS), 2011.
Alizadeh & Goldfarb (2003) Alizadeh, F. and Goldfarb, D. Second-order cone programming. Mathematical programming, 95(1):3–51, 2003.
Ameko et al. (2020) Ameko, M. K., Beltzer, M. L., Cai, L., Boukhechba, M., Teachman, B. A., and Barnes, L. E. Offline contextual multi-armed bandits for mobile health interventions: A case study on emotion regulation. In Proceedings of the 14th ACM Conference on Recommender Systems, pp. 249–258, 2020.
Audibert et al. (2009) Audibert, J.-Y., Bubeck, S., et al. Minimax policies for adversarial and stochastic bandits. In COLT, volume 7, pp. 1–122, 2009.
Audibert et al. (2010) Audibert, J.-Y., Bubeck, S., and Munos, R. Best arm identification in multi-armed bandits. In COLT, pp. 41–53, 2010.
Auer (2002) Auer, P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002.
Azar et al. (2017) Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263–272. PMLR, 2017.
Bartlett & Mendelson (2002) Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
Bartlett et al. (2005) Bartlett, P. L., Bousquet, O., and Mendelson, S. Local rademacher complexities. The Annals of Statistics, 33(4):1497–1537, 2005.
Bassen et al. (2020) Bassen, J., Balaji, B., Schaarschmidt, M., Thille, C., Painter, J., Zimmaro, D., Games, A., Fast, E., and Mitchell, J. C. Reinforcement learning for the adaptive scheduling of educational activities. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–12, 2020.
Bellec (2019) Bellec, P. C. Localized gaussian width of m-convex hulls with applications to lasso and convex aggregation. 2019.
Boyd & Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004.
Bubeck et al. (2012) Bubeck, S., Cesa-Bianchi, N., et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
Chen et al. (2016) Chen, W., Wang, Y., Yuan, Y., and Wang, Q. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research, 17(1):1746–1778, 2016.
Cheng et al. (2022) Cheng, C.-A., Xie, T., Jiang, N., and Agarwal, A. Adversarially trained actor critic for offline reinforcement learning. arXiv preprint arXiv:2202.02446, 2022.
Combes et al. (2015) Combes, R., Talebi Mazraeh Shahi, M. S., Proutiere, A., et al. Combinatorial bandits revisited. Advances in neural information processing systems, 28, 2015.
Dai et al. (2022) Dai, Y., Wang, R., and Du, S. S. Variance-aware sparse linear bandits. arXiv preprint arXiv:2205.13450, 2022.
Degenne & Perchet (2016) Degenne, R. and Perchet, V. Anytime optimal algorithms in stochastic multi-armed bandits. In International Conference on Machine Learning, pp. 1587–1595. PMLR, 2016.
Diamond & Boyd (2016) Diamond, S. and Boyd, S. Cvxpy: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1):2909–2913, 2016.
Duan & Wainwright (2022) Duan, Y. and Wainwright, M. J. Policy evaluation from a single path: Multi-step methods, mixing and mis-specification. arXiv preprint arXiv:2211.03899, 2022.
Duan & Wainwright (2023) Duan, Y. and Wainwright, M. J. A finite-sample analysis of multi-step temporal difference estimates. In Learning for Dynamics and Control Conference, pp. 612–624. PMLR, 2023.
Duan et al. (2021) Duan, Y., Wang, M., and Wainwright, M. J. Optimal policy evaluation using kernel-based temporal difference methods. arXiv preprint arXiv:2109.12002, 2021.
Fu et al. (2020) Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
Fujimoto et al. (2019) Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp. 2052–2062. PMLR, 2019.
Garivier & Kaufmann (2016) Garivier, A. and Kaufmann, E. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pp. 998–1027. PMLR, 2016.
Geer (2000) Geer, S. A. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000.
Gordon et al. (2007) Gordon, Y., Litvak, A. E., Mendelson, S., and Pajor, A. Gaussian averages of interpolated bodies and applications to approximate reconstruction. Journal of Approximation Theory, 149(1):59–73, 2007.
Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
Hao et al. (2021) Hao, B., Ji, X., Duan, Y., Lu, H., Szepesvári, C., and Wang, M. Bootstrapping statistical inference for off-policy evaluation. arXiv preprint arXiv:2102.03607, 2021.
Hazan & Karnin (2016) Hazan, E. and Karnin, Z. Volumetric spanners: an efficient exploration basis for learning. Journal of Machine Learning Research, 2016.
Hester & Stone (2013) Hester, T. and Stone, P. Texplore: real-time sample-efficient reinforcement learning for robots. Machine learning, 90:385–429, 2013.
Jin et al. (2020) Jin, Y., Yang, Z., and Wang, Z. Is pessimism provably efficient for offline rl? arXiv preprint arXiv:2012.15085, 2020.
Jin et al. (2021) Jin, Y., Yang, Z., and Wang, Z. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp. 5084–5096. PMLR, 2021.
Jun et al. (2017) Jun, K.-S., Bhargava, A., Nowak, R., and Willett, R. Scalable generalized linear bandits: Online computation and hashing. Advances in Neural Information Processing Systems, 30, 2017.
Kim et al. (2022) Kim, Y., Yang, I., and Jun, K.-S. Improved regret analysis for variance-adaptive linear bandits and horizon-free linear mixture mdps. Advances in Neural Information Processing Systems, 35:1060–1072, 2022.
Koltchinskii (2001) Koltchinskii, V. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902–1914, 2001.
Koltchinskii (2006) Koltchinskii, V. Local rademacher complexities and oracle inequalities in risk minimization. 2006.
Kostrikov et al. (2021) Kostrikov, I., Nair, A., and Levine, S. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
Kumar et al. (2020) Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779, 2020.
Kveton et al. (2015) Kveton, B., Wen, Z., Ashkan, A., and Szepesvari, C. Tight regret bounds for stochastic combinatorial semi-bandits. In Artificial Intelligence and Statistics, pp. 535–543. PMLR, 2015.
Lai (1987) Lai, T. L. Adaptive treatment allocation and the multi-armed bandit problem. The annals of statistics, pp. 1091–1114, 1987.
Lai & Robbins (1985) Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
Langford & Zhang (2007) Langford, J. and Zhang, T. The epoch-greedy algorithm for multi-armed bandits with side information. Advances in neural information processing systems, 20, 2007.
Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit Algorithms. Cambridge University Press, 2020.
Lecué & Mendelson (2013) Lecué, G. and Mendelson, S. Learning subgaussian classes: Upper and minimax bounds. arXiv preprint arXiv:1305.4825, 2013.
Levine et al. (2020) Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
Li et al. (2022) Li, G., Shi, L., Chen, Y., Chi, Y., and Wei, Y. Settling the sample complexity of model-based offline reinforcement learning. arXiv preprint arXiv:2204.05275, 2022.
Li et al. (2023) Li, Z., Yang, Z., and Wang, M. Reinforcement learning with human feedback: Learning dynamic choices via pessimism. arXiv preprint arXiv:2305.18438, 2023.
Liu et al. (2021) Liu, R., Nageotte, F., Zanne, P., de Mathelin, M., and Dresp-Langley, B. Deep reinforcement learning for the control of robotic manipulation: a focussed mini-review. Robotics, 10(1):22, 2021.
Liu & Ročková (2023) Liu, Y. and Ročková, V. Variable selection via thompson sampling. Journal of the American Statistical Association, 118(541):287–304, 2023.
Liu et al. (2017) Liu, Y., Logan, B., Liu, N., Xu, Z., Tang, J., and Wang, Y. Deep reinforcement learning for dynamic treatment regimes on medical registry data. In 2017 IEEE international conference on healthcare informatics (ICHI), pp. 380–385. IEEE, 2017.
Ménard et al. (2021) Ménard, P., Domingues, O. D., Jonsson, A., Kaufmann, E., Leurent, E., and Valko, M. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pp. 7599–7608. PMLR, 2021.
Min et al. (2021) Min, Y., Wang, T., Zhou, D., and Gu, Q. Variance-aware off-policy evaluation with linear function approximation. Advances in neural information processing systems, 34:7598–7610, 2021.
Mou et al. (2022) Mou, W., Wainwright, M. J., and Bartlett, P. L. Off-policy estimation of linear functionals: Non-asymptotic theory for semi-parametric efficiency. arXiv preprint arXiv:2209.13075, 2022.
Nachum et al. (2019) Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., and Schuurmans, D. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019.
Nakkiran et al. (2020) Nakkiran, P., Neyshabur, B., and Sedghi, H. The deep bootstrap framework: Good online learners are good offline generalizers. arXiv preprint arXiv:2010.08127, 2020.
Peng et al. (2019) Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
Rashidinejad et al. (2021) Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. arXiv preprint arXiv:2103.12021, 2021.
Ruan et al. (2023) Ruan, S., Nie, A., Steenbergen, W., He, J., Zhang, J., Guo, M., Liu, Y., Nguyen, K. D., Wang, C. Y., Ying, R., et al. Reinforcement learning tutor better supported lower performers in a math task. arXiv preprint arXiv:2304.04933, 2023.
Rusmevichientong & Tsitsiklis (2010) Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
Russo (2016) Russo, D. Simple bayesian algorithms for best arm identification. In Conference on Learning Theory, pp. 1417–1418. PMLR, 2016.
Si et al. (2020) Si, N., Zhang, F., Zhou, Z., and Blanchet, J. Distributionally robust policy evaluation and learning in offline contextual bandits. In International Conference on Machine Learning, pp. 8884–8894. PMLR, 2020.
Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT Press, 2018.
Thomas et al. (2015) Thomas, P., Theocharous, G., and Ghavamzadeh, M. High confidence policy improvement. In International Conference on Machine Learning, pp. 2380–2388. PMLR, 2015.
Vershynin (2020) Vershynin, R. High-dimensional probability. University of California, Irvine, 2020.
Wainwright (2019) Wainwright, M. J. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
Wang et al. (2022) Wang, K., Zhao, H., Luo, X., Ren, K., Zhang, W., and Li, D. Bootstrapped transformer for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:34748–34761, 2022.
Wang & Chen (2018) Wang, S. and Chen, W. Thompson sampling for combinatorial semi-bandits. In International Conference on Machine Learning, pp. 5114–5122. PMLR, 2018.
Wei et al. (2020) Wei, Y., Fang, B., and J. Wainwright, M. From gauss to kolmogorov: Localized measures of complexity for ellipses. 2020.
Wu et al. (2019) Wu, Y., Tucker, G., and Nachum, O. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
Xie & Jiang (2020) Xie, T. and Jiang, N. Batch value-function approximation with only realizability. arXiv preprint arXiv:2008.04990, 2020.
Xie et al. (2021) Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. Bellman-consistent pessimism for offline reinforcement learning. arXiv preprint arXiv:2106.06926, 2021.
Xiong et al. (2022) Xiong, W., Zhong, H., Shi, C., Shen, C., Wang, L., and Zhang, T. Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game. arXiv preprint arXiv:2205.15512, 2022.
Yin & Wang (2021) Yin, M. and Wang, Y.-X. Towards instance-optimal offline reinforcement learning with pessimism. arXiv preprint arXiv:2110.08695, 2021.
Yin et al. (2022) Yin, M., Duan, Y., Wang, M., and Wang, Y.-X. Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. arXiv preprint arXiv:2203.05804, 2022.
Zanette et al. (2019) Zanette, A., Brunskill, E., and J. Kochenderfer, M. Almost horizon-free structure-aware best policy identification with a generative model. In Advances in Neural Information Processing Systems, 2019.
Zanette et al. (2020) Zanette, A., Lazaric, A., Kochenderfer, M. J., and Brunskill, E. Provably efficient reward-agnostic navigation with linear value iteration. In Advances in Neural Information Processing Systems, 2020.
Zanette et al. (2021) Zanette, A., Wainwright, M. J., and Brunskill, E. Provable benefits of actor-critic methods for offline reinforcement learning. arXiv preprint arXiv:2108.08812, 2021.
Zhang et al. (2022) Zhang, R., Zhang, X., Ni, C., and Wang, M. Off-policy fitted q-evaluation with differentiable function approximators: Z-estimation and inference theory. In International Conference on Machine Learning, pp. 26713–26749. PMLR, 2022.
Zhang et al. (2021) Zhang, Z., Yang, J., Ji, X., and Du, S. S. Improved variance-aware confidence sets for linear bandits and linear mixture mdp. Advances in Neural Information Processing Systems, 34:4342–4355, 2021.

Appendix A One-sample case with strong signals

In this section, we give a simple example of one-sample-per-arm case. This can be view as a special case of data-starved MAB and Theorem 5.1 can be applied to get a non-trivial guarantees. Specifically, consider an MAB with $2d$ arms. Assume the true mean reward vector is $r=(1,1,...,1,0,0,...,0)^{\top}$ and the noise vector is $\eta\sim\mathcal{N}(0,\sigma^{2}I_{2d})$ That is, the first $d$ arms have rewards independently sampled from $\mathcal{N}(1,1)$ and the rewards for other $d$ arms are independently sampled from $\mathcal{N}(0,0).$ The stochastic reference policy is set to the uniform one, i.e., $\widehat{\mu}=(\frac{1}{d},\frac{1}{d},...\frac{1}{d})^{\top}.$

We apply Algorithm 1 to this MAB instance. In the next theorem, we will show that for a specific $\varepsilon,$ the optimal improvement in $\mathsf{C}\left(\varepsilon\right)$ (denoted as $\widehat{\Delta}_{\varepsilon}$ in (10)) can achieve an improved reward value of constant level.

Proposition A.1.

Assume $r=(1,1,...,1,0,0,...,0)^{\top}$ and noise $\eta\sim\mathcal{N}(0,I_{2d}).$ For any $0\leq\varepsilon\leq\frac{1}{\sqrt{d}},$ with probability at least $1-\delta,$ the improvement of policy value can be lower bounded by

\widehat{\Delta}_{\varepsilon}^{\top}r\geq\varepsilon\sqrt{d}\left[\frac{1}{2}-\sigma\left(1+\sqrt{\frac{8\log\left(2/\delta\right)}{d}}\right)\right],

where the improvement vector in $\mathsf{C}\left(\varepsilon\right)$ is defined in (10). Therefore, for $\varepsilon=\frac{1}{\sqrt{d}}$ and $d\geq 8\log\left(2/\delta\right),$ with probability at least $1-\delta,$ we can get a constant policy improvement

\widehat{\Delta}_{\varepsilon}^{\top}r\geq\frac{1}{2}-2\sigma.

Proof.

We define the optimal improvement vector as

\Delta^{*}_{\varepsilon}:=\mathop{\arg\max}_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}r.

Then, from the definition of $\widehat{\Delta}_{\varepsilon}$ , we have

\displaystyle\widehat{\Delta}_{\varepsilon}^{\top}r=\widehat{\Delta}_{\varepsilon}^{\top}\widehat{r}-\widehat{\Delta}_{\varepsilon}^{\top}\eta

\displaystyle\geq\left(\Delta^{*}_{\varepsilon}\right)^{\top}\widehat{r}-\widehat{\Delta}_{\varepsilon}^{\top}\eta=\left(\Delta^{*}_{\varepsilon}\right)^{\top}r+\left(\Delta^{*}_{\varepsilon}\right)^{\top}\eta-\widehat{\Delta}_{\varepsilon}^{\top}\eta\geq\underbrace{\left(\Delta^{*}_{\varepsilon}\right)^{\top}r}_{\text{signal}}-\underbrace{\left[\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta-\left(\Delta^{*}_{\varepsilon}\right)^{\top}\eta\right]}_{\text{noise}}.

(30)

In order to lower bound the policy value improvement, it suffices to lower bound the signal part and upper bound the noise. We denote $\mathcal{H}=\{x\in\mathbb{R}^{d}:\sum_{i=1}^{d}x_{i}=0\}$ as a hyperplane in $\mathbb{R}^{d}.$ To deal with the signal part, it suffices to notice that

\mathsf{C}\left(\varepsilon\right)\subset\mathcal{H}\cap\mathbb{B}_{2}^{d}(\varepsilon).

We denote $r_{\parallel}$ as the orthogonal projection of $r$ on the $\mathcal{H}$ and $r_{\perp}=r-r_{\parallel}.$ In the strong signal case, we have

r_{\parallel}=\left(\frac{1}{2},\frac{1}{2},...,\frac{1}{2},-\frac{1}{2},-\frac{1}{2},...,-\frac{1}{2}\right)^{\top},\quad r_{\perp}=\left(\frac{1}{2},\frac{1}{2},...,\frac{1}{2},\frac{1}{2},\frac{1}{2},...,\frac{1}{2}\right)^{\top}.

Then, the signal part satisfies

\displaystyle\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}r=\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}r_{\parallel}\leq\sup_{\Delta\in\mathcal{H}\cap\mathbb{B}_{2}^{d}(\varepsilon)}\Delta^{\top}r_{\parallel}=\left(\varepsilon\cdot\frac{r_{\parallel}}{\left\|r_{\parallel}\right\|_{2}}\right)^{\top}r_{\parallel}=\varepsilon\left\|r_{\parallel}\right\|_{2}=\frac{\varepsilon\sqrt{d}}{2}.

(31)

On the other hand, we notice that when $\varepsilon\leq\frac{1}{\sqrt{d}},$

\varepsilon\cdot\frac{r_{\parallel}}{\left\|r_{\parallel}\right\|_{2}}=\left(\frac{\varepsilon}{\sqrt{d}},\frac{\varepsilon}{\sqrt{d}},...,\frac{\varepsilon}{\sqrt{d}},-\frac{\varepsilon}{\sqrt{d}},-\frac{\varepsilon}{\sqrt{d}},...,-\frac{\varepsilon}{\sqrt{d}}\right)^{\top}\in\mathsf{C}\left(\varepsilon\right).

So actually the inequality in the (31) should be an equation, which implies

\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}=\frac{\varepsilon\sqrt{d}}{2}.

(32)

For the noise part, we decompose the noise as $\eta=\eta_{\perp}+\eta_{\parallel},$ where $\eta_{\parallel}$ is the orthogonal projection of $\eta$ on $\mathcal{H}.$ Then, from $\mathsf{C}\left(\varepsilon\right)\subset\mathcal{H}\cap\mathbb{B}_{2}^{d}(\varepsilon),$ one has

	$\displaystyle\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta$	$\displaystyle=\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\left(\eta_{\parallel}+\eta_{\perp}\right)=\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta_{\parallel}\leq\sup_{\Delta\in\mathcal{H}\cap\mathbb{B}_{2}^{d}(\varepsilon)}\Delta^{\top}\eta_{\parallel}$
		$\displaystyle=\left(\varepsilon\cdot\frac{\eta_{\parallel}}{\left\\|\eta_{\parallel}\right\\|_{2}}\right)^{\top}\eta_{\parallel}=\varepsilon\left\\|\eta_{\parallel}\right\\|_{2}\leq\varepsilon\left\\|\eta\right\\|_{2}.$

This implies $\widehat{\Delta}_{\varepsilon}^{\top}r\geq\left(\Delta^{*}_{\varepsilon}\right)^{\top}r-[\varepsilon\left\|\eta\right\|_{2}-\left(\Delta^{*}_{\varepsilon}\right)^{\top}\eta].$ From our assumption, $\frac{1}{\sigma^{2}}\left\|\eta\right\|_{2}^{2}$ is a chi-square random variable with degree $d,$ so from the Example 2.11 in Wainwright (2019), we know with probability at least $1-\delta/2,$ one has

\frac{\left\|\eta\right\|_{2}^{2}}{d\sigma^{2}}\leq 1+\sqrt{\frac{8\log\left(2/\delta\right)}{d}}.

This implies

\left\|\eta\right\|_{2}\leq\sqrt{d\sigma^{2}\left(1+\sqrt{\frac{8\log\left(2/\delta\right)}{d}}\right)}\leq\sqrt{d}\sigma\left(1+\sqrt{\frac{2\log\left(2/\delta\right)}{d}}\right).

The last inequality comes from $\sqrt{1+u}\leq 1+\frac{u}{2}$ for positive $u$ . Moreover, since $\Delta^{*}_{\varepsilon}$ is a fixed vector, we know $\left(\Delta^{*}_{\varepsilon}\right)^{\top}\eta\sim\mathcal{N}\left(0,\sigma^{2}\left\|\Delta^{*}_{\varepsilon}\right\|_{2}^{2}\right).$ So with probability at least $1-\delta/2,$ one has

\left(\Delta^{*}_{\varepsilon}\right)^{\top}\eta\geq-\sigma\left\|\Delta^{*}_{\varepsilon}\right\|_{2}\sqrt{2\log\left(\frac{2}{\delta}\right)}\geq-\sigma\varepsilon\sqrt{2\log\left(\frac{2}{\delta}\right)}

Combining the two terms above, one has with probability at least $1-\delta,$ it holds

\varepsilon\left\|\eta\right\|_{2}-\left(\Delta^{*}_{\varepsilon}\right)^{\top}\eta\leq\varepsilon\sqrt{d}\sigma\left(1+\sqrt{\frac{2\log\left(2/\delta\right)}{d}}\right)+\sigma\varepsilon\sqrt{2\log\left(\frac{2}{\delta}\right)}=\varepsilon\sqrt{d}\sigma\left(1+\sqrt{\frac{8\log\left(2/\delta\right)}{d}}\right).

(33)

Combining (30), (32) and (33), we finish the proof. ∎

Appendix B Monte Carlo computation

Algorithm 2 Monte-Carlo method for computing

\mathcal{G}\left(\varepsilon\right)

Input: Offline dataset

\mathcal{D},

the radius value

\varepsilon\in E,

the total sample size

M

and threshold

M_{0}.

1. Independently sample

M

noise vectors, denoted as

\eta_{i}

for

i\in[M],

where

\eta_{i}\sim\mathcal{N}0,\sigma_{i}^{2}/N(a_{i}),\sigma_{i}^{2}

is the noise variance for the

i

-th arm and

N(a_{i})

is the sample size for

a_{i}

\mathcal{D}.

2. Solve

X_{i}:=\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta_{i}

for

i\in[M]

and order them as

X_{(1)}\leq X_{(2)}\leq...\leq X_{(M)}.

3. Return

X_{(M-M_{0}+1)}

as an estimate of

\mathcal{G}\left(\varepsilon\right)

defined in Definition 4.2.

As discussed in Section 4, we can estimate $\mathcal{G}\left(\varepsilon\right)$ using classical Monte Carlo method. In this section, we illustrate the detailed implementation. We first sample $M$ i.i.d. noise and then solve $\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta$ for each to get $M$ suprema. We eventually select the $M_{0}$ -th largest values of all suprema as our estimate for the bonus function, where $M_{0}$ is a pre-computed integer dependent on $M$ and the pre-determined failure probability $\delta>0.$ Here, the program $\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta$ is a second-order cone program and can be efficiently solved via standard off-the shelf libraries (Alizadeh & Goldfarb, 2003; Boyd & Vandenberghe, 2004; Diamond & Boyd, 2016). The pseudocode for the Monte-Carlo sampling is in Algorithm 2.

To determine $M_{0},$ we denote $\eta_{i}$ as the i.i.d. noise vector for $i\in[M]$ and $X_{i}=\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta.$ We denote the order statistics of $X_{i}$ -s as $X_{(1)}\leq X_{(2)}\leq...\leq X_{(M)}.$ Suppose the cumulative distribution function of $X_{i}$ is $F(x),$ then from the property of the order statistics, we know the cumulative distribution function of $X_{(M-M_{0}+1)}$ is

F_{X_{\left(M-M_{0}+1\right)}}(x)=\sum_{j=M-M_{0}+1}^{M}C_{M}^{j}\left(F(x)\right)^{j}\left(1-F(x)\right)^{M-j}.

We denote $q_{1-\delta}$ as the $(1-\delta)$ -lower quantile of the random variable $X$ , then we have $F_{X_{\left(M-M_{0}+1\right)}}\left(q_{1-\delta}\right)=\sum_{j=M-M_{0}+1}^{M}C_{M}^{j}(1-\delta)^{j}(\delta)^{M-j}.$ For integer $M$ and $\delta>0$ , we define $Q(M,\delta)$ as the maximal integer $M_{0}$ such that $\sum_{j=M-M_{0}+1}^{M}C_{M}^{j}(1-\delta)^{j}(\delta)^{M-j}\leq\delta.$ With this definition, we take a fixed $M$ and a total failure tolerance $\delta$ for all $\varepsilon\in E$ , then we take

M_{0}=Q\left(M,\frac{\delta}{2|E|}\right)

as the threshold number. Under this choice, for any $\varepsilon\in E$ , with probability at least $1-\delta/2|E|$ , it holds $X_{\left(M-M_{0}+1\right)}>q_{1-\delta/2|E|}.$ On the other hand, with probability $1-\delta/2|E|$ , it holds that $\sup_{\Delta\in\mathrm{C}(\varepsilon)}\Delta^{\top}\eta\leq q_{1-\delta/2|E|}$ This implies

\sup_{\Delta\in\mathrm{C}(\varepsilon)}\Delta^{\top}\eta\leq q_{1-\delta/2|E|}<X_{\left(M-M_{0}+1\right)}

with probability at least $1-\delta/|E|$ . From a union bound, we know with probability at least $1-\delta$ , the bound above holds for any $\varepsilon\in E.$

Appendix C A fine-grained analysis to the suboptimality

We have shown a problem-dependent upper bound for the suboptimality in (16). In this section, we will give a further upper bound for $\mathcal{G}\left(\varepsilon\right)$ and hence, for the suboptimality. We have the following theorem. The proof is deferred to Section C.1.

Theorem C.1.

For a policy $\pi$ (deterministic or stochastic), we denote its reward value as $V^{\pi}$ . TRUST has the following properties.

We denote a comparator policy as a triple $(\varepsilon,\Delta,\pi)$ such that $\varepsilon=\sum_{i=1}^{d}\frac{\sigma_{i}^{2}\Delta_{i}^{2}}{N_{i}},\pi=\widehat{\mu}+\Delta.$ We take the discrete candidate set $E$ defined in (14). With probability at least $1-\delta,$ for any stochastic comparator policy $(\varepsilon,\Delta,\pi),$ the sub-optimality of the output policy of Algorithm 1 can be upper bounded as

\displaystyle V^{\pi}-V^{\pi_{TRUST}}\leq 2\sqrt{2\sum_{i=1}^{d}\frac{\alpha\Delta_{i}^{2}\sigma_{i}^{2}}{N_{i}}\log\left(\frac{2|E|}{\delta}\right)}+2\min\left\{\sqrt{\sum_{i=1}^{d}\frac{\alpha\Delta_{i}^{2}\sigma_{i}^{2}}{N_{i}}},4D\sqrt{\log_{+}\left(\frac{4ed\sum_{i=1}^{d}\frac{\alpha\Delta_{i}^{2}\sigma_{i}^{2}}{N_{i}}}{D^{2}}\right)}\right\}

(34)

where $D$ is defined as any quantity satisfying

D\geq\sqrt{\max_{i\in[d]}\left[\frac{\sigma_{i}^{2}}{N_{i}}-\frac{2\sigma_{i}^{2}}{N}\right]+\frac{\sum_{j=1}^{d}N_{j}\sigma_{j}^{2}}{N^{2}}}.

(35)

$\alpha$ is the decaying rate defined in (14), $\log_{+}(a)=\max(1,\log(a)).$

(Comparison with the optimal policy) We further assume $\sigma_{i}=1$ for $i\in[d]$ and assume the offline dataset is generated from the policy $\mu(\cdot)$ with $\min_{i\in[d]}\mu(a_{i})>0.$ Without loss of generality we assume $a_{1}$ is the optimal arm and denote the optimal policy as $\pi_{*}$ . We write

C^{*}:=\frac{1}{\mu(a_{1})},\quad C_{\min}:=\frac{1}{\min_{i\in[d]}\mu(a_{i})}.

(36)

When $N\geq 8C_{\min}\log(d/\delta),$ with probability at least $1-2\delta$ , one has

\displaystyle V^{\pi_{*}}-V^{\pi_{TRUST}}

\displaystyle\lesssim\sqrt{\frac{C_{\min}}{N}\log_{+}\left(\frac{dC^{*}}{C_{\min}}\right)}+\sqrt{\frac{C^{*}}{N}\log\left(\frac{2|E|}{\delta}\right)}.

(37)

Specially, when $C_{\min}\simeq C^{*},$ we have with probability at least $1-2\delta,$

V^{\pi_{*}}-V^{\pi_{TRUST}}\lesssim\sqrt{\frac{C^{*}}{N}\log\left(\frac{2d|E|}{\delta}\right)}.

(38)

We remark that (34) is problem-dependent, and it gives an explicit upper bound for $\mathcal{G}\left(\lceil\varepsilon\rceil\right)$ in (16). This is derived by first concentrating $\mathcal{G}\left(\varepsilon\right)$ around $\mathbb{E}\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta$ , which is well-defined as localized Gaussian width or local Gaussian complexity (Koltchinskii, 2006), and then upper bounding the localized Gaussian width of a convex hull via tools in convex analysis (Bellec, 2019). Different from (4), when $\pi=a_{i}$ represents a single arm, (34) relies not only on $\sigma_{i}^{2}/N_{i}$ , but on $\sigma_{j}^{2}/N_{j}$ for $j\neq i$ as well, since the size of trust regions depend on $\sigma_{i}^{2}/N_{i}$ for all $i\in[d].$

Notably, (38) gives an analogous upper bound depending on $\mu(\cdot)$ and $N$ , which is comparable to the bound for LCB in (5) up to constant and logarithmic factors. This indicates that, when behavioral cloning policy is not too imbalanced, TRUST is guaranteed to achieve the same level of performance as LCB. In fact, this improvement is remarkable since TRUST is exploring a much larger policy searching space than LCB, which encompasses all stochastic policies (the whole simplex) rather than the set of all single arms only. We also remark that both (5) and (51) are worst-case upper bound, and in practice, we will show in Section 6 that in some settings, TRUST can achieve good performance while LCB fails completely.

Is TRUST minimax-optimal?

We consider the hard cases in MAB (Rashidinejad et al., 2021) where LCB achieves the minimax-optimal upper bound and we show for these hard cases, TRUST will achieve the same sample complexity as LCB up to log and constant factors. More specifically, we consider a two-arm MAB $\mathcal{A}=\left\{1,2\right\}$ and the uniform behavioral cloning policy $\mu(1)=\mu(2)=1/2.$ For $\delta\ in[0,1/4],$ we define $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ are two MDPs whose reward distributions are as follows.

	$\displaystyle\mathcal{M}_{1}:r(1)\sim\mathsf{Bernoulli}\left(\frac{1}{2}\right),\ r(2)\sim\mathsf{Bernoulli}\left(\frac{1}{2}+\delta\right)$
	$\displaystyle\mathcal{M}_{2}:r(1)\sim\mathsf{Bernoulli}\left(\frac{1}{2}\right),\ r(2)\sim\mathsf{Bernoulli}\left(\frac{1}{2}-\delta\right),$

where $\mathsf{Bernoulli}\left(p\right)$ is the Bernoulli distribution with probability $p.$ The next result is a corollary from Theorem C.1.

Corollary C.2.

We define $\mathcal{M}_{1},mdp_{2}$ as above for $\delta\in[0,1/4].$ Assume $N\geq\widetilde{O}(1).$ Then, we have

The minimax optimal lower bound for the suboptimality of LCB is

\inf_{\widehat{a}_{\mathsf{LCB}}\in\mathcal{A}}\sup_{\mathcal{M}\in\left\{\mathcal{M}_{1},\mathcal{M}_{2}\right\}}\mathbb{E}_{\mathcal{D}}\left[r(a^{*})-r(\widehat{a}_{\mathsf{LCB}})\right]\gtrsim\sqrt{\frac{C^{*}}{N}},

(39)

where $\mathbb{E}_{\mathcal{D}}\left[\cdot\right]$ is the expectation over the offline dataset $\mathcal{D}.$

2.

The upper bound for suboptimality of TRUST mathces the lower bound above up to log factor. Namely, for any $\mathcal{M}\in\left\{\mathcal{M}_{1},\mathcal{M}_{2}\right\},$ one has

$\mathbb{E}_{\mathcal{D}}\left[r(a^{*})-V^{\pi_{TRUST}}\right]\lesssim\sqrt{\frac{C^{*}\log(dN)}{N}}.$ (40)

The first claim comes from Theorem 2 of (Rashidinejad et al., 2021), while the second claim is a direct corollary to Theorem C.1.

C.1 Proof of Theorem C.1

Proof.

Recall from Theorem 5.1 that for any comparator policy $(\varepsilon,\Delta,\pi)$ defined above, one has

V^{\pi}-V^{\pi_{TRUST}}\leq 2\mathcal{G}\left(\lceil\varepsilon\rceil\right),

where $\lceil\varepsilon\rceil:=\inf\left\{\varepsilon^{\prime}\in E:\varepsilon\leq\varepsilon^{\prime}\right\}.$ The following lemma upper bounds the quantile of Gaussian suprema $\mathcal{G}\left(\varepsilon\right)$ for each $\varepsilon\in E.$ The proof is deferred to Section C.2.

Lemma C.3.

For $\varepsilon\in E,$ one can upper bound $\mathcal{G}\left(\varepsilon\right)$ as follows.

\mathcal{G}\left(\varepsilon\right)\leq\min\left\{\varepsilon\cdot\sqrt{d}\ ,\ 4D\sqrt{\log_{+}\left(\frac{4ed\varepsilon^{2}}{D^{2}}\right)}\right\}+\sqrt{2\varepsilon^{2}\log\left(\frac{2|E|}{\delta}\right)}

(41)

where $\log_{+}(a)=\max(1,\log(a))$ and $D$ is a quantity satisfying

D\geq\sqrt{\max_{i\in[d]}\left[\frac{\sigma_{i}^{2}}{N_{i}}-\frac{2\sigma_{i}^{2}}{N}\right]+\frac{\sum_{j=1}^{d}N_{j}\sigma_{j}^{2}}{N^{2}}}.

(42)

Applying Lemma C.3 to $\lceil\varepsilon\rceil\in E,$ we obtain

\displaystyle V^{\pi}-V^{\pi_{TRUST}}\leq 2\min\left\{\lceil\varepsilon\rceil\cdot\sqrt{d}\ ,\ 4D\sqrt{\log_{+}\left(\frac{4ed\lceil\varepsilon\rceil^{2}}{D^{2}}\right)}\right\}+2\sqrt{2\lceil\varepsilon\rceil^{2}\log\left(\frac{2|E|}{\delta}\right)}.

(43)

Since $\varepsilon=\sum_{i=1}^{d}\frac{\sigma_{i}^{2}\Delta_{i}^{2}}{N_{i}}$ , we know from our discretization scheme in (14)

\lceil\varepsilon\rceil\leq\alpha\cdot\sum_{i=1}^{d}\frac{\sigma_{i}^{2}\Delta_{i}^{2}}{N_{i}}.

(44)

Bridging (44) into (43), we obtain our first claim. In order to get the second claim, we take $\sigma_{i}=1$ for $i\in[d]$ and $\Delta=\pi_{*}-\widehat{\mu},$ which is the vector pointing at the vertex corresponding to the optimal arm from the uniform reference policy $\widehat{\mu}$ defined in (7). Then, we have

\sum_{i=1}^{d}\frac{\Delta_{i}^{2}\sigma_{i}^{2}}{N_{i}}=\frac{1}{N_{1}}-\frac{2}{N}+\frac{1}{N}=\frac{1}{N_{1}}-\frac{1}{N}\leq\frac{1}{N_{1}},

where $N_{1}$ is the sample size for the optimal arm $a_{1}.$ Therefore, we can further bound (43) as

\displaystyle V^{\pi_{*}}-V^{\pi_{TRUST}}

\displaystyle\leq 4D\sqrt{\log_{+}\left(\frac{4\alpha ed}{N_{1}D^{2}}\right)}+2\sqrt{\frac{2\alpha}{N_{1}}\log\left(\frac{2|E|}{\delta}\right)}.

(45)

Finally, we take a specific value of $D$ and lower bound $N_{1}$ via Chernoff bound in Lemma C.7. From Lemma C.7, we know that when $N\geq 8C_{\min}\log(d/\delta),$ with probability at least $1-\delta,$ we have

N_{i}\geq\frac{1}{2}N\mu(a_{i})

(46)

for any $i\in[d].$ Recall the definition of $D$ in (42), we know that $D$ can be arbitrary value greater than $\sqrt{\max_{i\in[d]}\left[\frac{\sigma_{i}^{2}}{N_{i}}-\frac{2\sigma_{i}^{2}}{N}\right]+\frac{\sum_{j=1}^{d}N_{j}\sigma_{j}^{2}}{N}}.$ Then, when $\sigma_{i}=1$ , one has

\sqrt{\max_{i\in[d]}\left[\frac{\sigma_{i}^{2}}{N_{i}}-\frac{2\sigma_{i}^{2}}{N}\right]+\frac{\sum_{j=1}^{d}N_{j}\sigma_{j}^{2}}{N^{2}}}\leq\sqrt{\frac{1}{\min_{i\in[d]}N_{i}}}.

We denote $N_{j}=\min_{i\in[d]}N_{i}$ (when there are multiple minimizers, we arbitrarily pick one). Then, we have

\sqrt{\max_{i\in[d]}\left[\frac{\sigma_{i}^{2}}{N_{i}}-\frac{2\sigma_{i}^{2}}{N}\right]+\frac{\sum_{j=1}^{d}N_{j}\sigma_{j}^{2}}{N^{2}}}\leq\sqrt{\frac{1}{N_{j}}}\leq\sqrt{\frac{2}{N\mu(a_{j})}}\leq\sqrt{\frac{2}{N\cdot\min_{i\in[d]}\mu(a_{i})}}=\sqrt{\frac{2C_{\min}}{N}}.

Therefore, we take $D=\sqrt{\frac{2C_{\min}}{N}}$ in (45) and apply $N_{1}\geq\frac{1}{2}N\mu(a_{i})$ to obtain

V^{\pi_{*}}-V^{\pi_{TRUST}}\leq 4\sqrt{\frac{2C_{\min}}{N}\log_{+}\left(\frac{4\alpha edC^{*}}{C_{\min}}\right)}+4\sqrt{\frac{\alpha C^{*}}{N}\log\left(\frac{2|E|}{\delta}\right)},

(47)

which proves (37). Finally, when $C^{*}\simeq C_{\min},$ one has

V^{\pi_{*}}-V^{\pi_{TRUST}}\lesssim\sqrt{\frac{C^{*}}{N}\log\left(\frac{2d|E|}{\delta}\right)}.

Therefore, we conclude. ∎

C.2 Proof of Lemma C.3

Proof.

Recall that $\Delta=(\Delta_{1},\Delta_{2},...,\Delta_{d})^{\top}$ is the improvement vector, $\eta=(\eta_{1},\eta_{2},...,\eta_{d})^{\top}$ is the noise vector, where entries are independent and $\eta_{i}\sim\mathcal{N}(0,\sigma_{i}^{2}/N_{i})$ and $N_{i}$ is the sample size of arm $a_{i}$ in the offline dataset. To proceed with the proof, let’s further define

\widetilde{\eta}=\left(\widetilde{\eta}_{1},\widetilde{\eta}_{2},...,\widetilde{\eta}_{d}\right)^{\top},\quad\widetilde{\Delta}=\left(\widetilde{\Delta}_{1},\widetilde{\Delta}_{2},...,\widetilde{\Delta}_{d}\right)^{\top},\quad\text{where}\quad\widetilde{\eta}_{i}=\eta_{i}\frac{\sqrt{N_{i}}}{\sigma_{i}},\ \widetilde{\Delta}_{i}=\frac{\Delta_{i}\sigma_{i}}{\sqrt{N_{i}}}.

(48)

With this notation, one has

\widetilde{\eta}\sim\mathcal{N}\left(0,I_{d}\right),\quad\eta^{\top}\Delta=\widetilde{\eta}^{\top}\widetilde{\Delta}.

We also write the equivalent trust region (for $\widetilde{\Delta}$ ) as

\widetilde{\mathsf{C}}\left(\varepsilon\right)=\left\{\widetilde{\Delta}\in\mathbb{R}^{d}:\frac{\sqrt{N_{i}}}{\sigma_{i}}\widetilde{\Delta}_{i}+\widehat{\mu}_{i}\geq 0,\quad\sum_{i=1}^{d}\left[\frac{\sqrt{N_{i}}}{\sigma_{i}}\widetilde{\Delta}_{i}+\widehat{\mu}_{i}\right]=1,\quad\left\|\widetilde{\Delta}\right\|_{2}\leq\varepsilon\right\},

(49)

where $\widehat{\mu}=(\widehat{\mu}_{1},\widehat{\mu}_{2},...,\widehat{\mu}_{d})^{\top}$ is the policy weight for the reference policy. From the definition above, one has for any $\varepsilon>0,$

\Delta\in\mathsf{C}\left(\varepsilon\right)\ \Leftrightarrow\ \widetilde{\Delta}\in\widetilde{\mathsf{C}}\left(\varepsilon\right).

Then, we apply Lemma C.4 to $\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta$ for a $\varepsilon\in E.$ One has with probability at least $1-\frac{\delta}{|E|},$

\displaystyle\left|\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta-\mathbb{E}\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta\right|\leq\sqrt{2\varepsilon^{2}\log\left(\frac{2|E|}{\delta}\right)}.

From a union bound, one immediately has with probability at least $1-\delta,$ for any $\varepsilon\in E,$ it holds that

\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta\leq\mathbb{E}\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta+\sqrt{2\varepsilon^{2}\log\left(\frac{2|E|}{\delta}\right)}.

(50)

From the definition of $\mathcal{G}\left(\varepsilon\right)$ in (4.2), we know that $\mathcal{G}\left(\varepsilon\right)$ is the minimal quantity that satisfy (50) with probability at least $1-\delta.$ Therefore, one has

\mathcal{G}\left(\varepsilon\right)\leq\mathbb{E}\sup_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\eta+\sqrt{2\varepsilon^{2}\log\left(\frac{2|E|}{\delta}\right)}=\mathbb{E}_{\widetilde{\eta}\sim\mathcal{N}(0,I_{d})}\left[\sup_{\widetilde{\Delta}\in\widetilde{\mathsf{C}}\left(\varepsilon\right)}\widetilde{\Delta}^{\top}\widetilde{\eta}\right]+\sqrt{2\varepsilon^{2}\log\left(\frac{2|E|}{\delta}\right)}\quad\forall\varepsilon\in E.

(51)

Note that, the first term in the RHS of (51) is well-defined as localized Gaussian width over the convex hull defined by the trust region $\mathsf{C}\left(\varepsilon\right)$ (or equivalently, $\widetilde{\mathsf{C}}\left(\varepsilon\right)$ ). We denote

T:=\left\{\widetilde{\Delta}\in\mathbb{R}^{d}:\frac{\sqrt{N_{i}}}{\sigma_{i}}\widetilde{\Delta}_{i}+\widehat{\mu}_{i}\geq 0,\quad\sum_{i=1}^{d}\left[\frac{\sqrt{N_{i}}}{\sigma_{i}}\widetilde{\Delta}_{i}+\widehat{\mu}_{i}\right]=1\right\}.

(52)

We immediately have that $T$ is a convex hull of $d$ points in $\mathbb{R}^{d}$ and the vertices of this convex hull are the vertices of the simplex in $\mathbb{R}^{d}$ shifted by the reference policy $\widehat{\mu}.$ In what follows, we plan to apply Lemma C.5 to the localized Gaussian width of $T\cap\varepsilon\mathbb{B}_{2}.$ However, $T$ is not subsumed by the unit ball in $\mathbb{R}^{d},$ so we need to do some additional scaling. Note that, the zero vector is included in $T.$ Let’s compute the farthest distance for the vertices of $T.$ We denote the $i$ -th vertex of $T$ as

\widetilde{\Delta}=\left(-\frac{\sigma_{1}}{\sqrt{N_{1}}}\widehat{\mu}_{1},...,-\frac{\sigma_{i-1}}{\sqrt{N_{i-1}}}\widehat{\mu}_{i-1},\frac{\sigma_{i}}{\sqrt{N_{i}}}\left(1-\widehat{\mu}_{i}\right),-\frac{\sigma_{i+1}}{\sqrt{N_{i+1}}}\widehat{\mu}_{i+1},...,-\frac{\sigma_{d}}{\sqrt{N_{d}}}\widehat{\mu}_{d}\right).

(53)

The $\ell_{2}$ -norm of this improvement vector is

\left\|\widetilde{\Delta}\right\|_{2}=\sqrt{\frac{\sigma_{i}^{2}}{N_{i}}-\frac{2\sigma_{i}^{2}}{N}+\frac{\sum_{i=1}^{d}N_{i}\sigma_{i}^{2}}{N^{2}}},

where $N$ is the total sample size of the offline dataset. Therefore, the maximal radius of $T$ can be upper bounded by $D$ , where $D$ is any quantity that satisfies

D\geq\sqrt{\max_{i\in[d]}\left[\frac{\sigma_{i}^{2}}{N_{i}}-\frac{2\sigma_{i}^{2}}{N}\right]+\frac{\sum_{j=1}^{d}N_{j}\sigma_{j}^{2}}{N^{2}}}.

(54)

We denote $S=\frac{1}{D}\cdot T:=\left\{\frac{1}{D}\cdot x:x\in T\right\}.$ Then, from Lemma C.5, one has

$\displaystyle\mathbb{E}_{\widetilde{\eta}\sim\mathcal{N}(0,I_{d})}\left[\sup_{\widetilde{\Delta}\in\widetilde{\mathsf{C}}\left(\varepsilon\right)}\widetilde{\Delta}^{\top}\widetilde{\eta}\right]$	$\displaystyle=\mathbb{E}_{\widetilde{\eta}\sim\mathcal{N}(0,I_{d})}\left[\sup_{\widetilde{\Delta}\in T\cap\varepsilon\mathbb{B}_{2}}\widetilde{\Delta}^{\top}\widetilde{\eta}\right]$
	$\displaystyle=D\cdot\mathbb{E}_{\widetilde{\eta}\sim\mathcal{N}(0,I_{d})}\left[\sup_{\widetilde{\Delta}\in S\cap\frac{\varepsilon}{D}\cdot\mathbb{B}_{2}}\widetilde{\Delta}^{\top}\widetilde{\eta}\right]$	( $S\cap\frac{\varepsilon}{D}\cdot\mathbb{B}_{2}$ can be got by scaling $T\cap\varepsilon\mathbb{B}_{2}$ bt $\frac{1}{D}$ )
	$\displaystyle\leq D\cdot\left[\left(4\sqrt{\log_{+}\left(4ed\left(\frac{\varepsilon^{2}}{D^{2}}\wedge 1\right)\right)}\right)\wedge\left(\frac{\varepsilon}{D}\sqrt{d}\right)\right]$	(Take $s=\frac{\varepsilon}{D}$ and $M=d$ in Lemma C.5)
	$\displaystyle=D\cdot\left[\left(4\sqrt{\log_{+}\left(4ed\left(\frac{\varepsilon^{2}}{D^{2}}\right)\right)}\right)\wedge\left(\frac{\varepsilon}{D}\sqrt{d}\right)\right].$	( $\varepsilon\leq D$ for any $\varepsilon\in E$ )

This finishes the proof. ∎

C.3 Auxiliary lemmas

Lemma C.4 (Concentration of Gaussian suprema, Exercise 5.10 in Wainwright (2019)).

Let $\left\{X_{\theta},\theta\in\mathbb{T}\right\}$ be a zero-mean Gaussian process, and define $Z=\sup_{\theta\in\mathbb{T}}X_{\theta}$ . Then, we have

\mathbb{P}[|Z-\mathbb{E}[Z]|\geq\delta]\leq 2\exp\left(-\frac{\delta^{2}}{2\sigma^{2}}\right),

where $\sigma^{2}:=\sup_{\theta\in\mathbb{T}}\operatorname{var}\left(X_{\theta}\right)$ is the maximal variance of the process.

Lemma C.5 (Localized Gaussian Width of a Convex Hull, Proposition 1 in Bellec (2019)).

Let $d\geq 1,M\geq 2$ and $T$ be the convex hull of $M$ points in $\mathbb{R}^{d}.$ We write $\mathbb{B}_{2}=\left\{x\in\mathbb{R}^{d}:\left\|x\right\|_{2}\leq 1\right\}$ and $s\mathbb{B}_{2}=\left\{s\cdot x:x\in\mathbb{R}^{d},\left\|x\right\|_{2}\leq 1\right\}.$ Assume $T\subset\mathbb{B}_{2}^{d}(1).$ Let $g\in\mathbb{R}^{d}$ be a standard Gaussian vector. Then, for all $s>0,$ one has

\mathbb{E}\left[\sup_{x\in T\cap s\mathbb{B}_{2}}\ x^{\top}g\right]\leq\left(4\sqrt{\log_{+}\left(4eM\left(s^{2}\wedge 1\right)\right)}\right)\wedge\left(s\sqrt{d\wedge M}\right),

(55)

where $\log_{+}(a)=\max(1,\log(a)),a\wedge b=\min\left\{a,b\right\}.$

Lemma C.6 (Chernoff bound for binomial random variables, Theorem 2.3.1 in Vershynin (2020)).

Let $X_{i}$ be independent Bernoulli random variables with parameters $p_{i}$ . Consider their sum $S_{N}=\sum_{i=1}^{N}X_{i}$ and denote its mean by $\mu=\mathbb{E}S_{N}$ . Then, for any $t>\mu$ , we have

\mathbb{P}\left\{S_{N}\geq t\right\}\leq e^{-\mu}\left(\frac{e\mu}{t}\right)^{t}.

Lemma C.7 (Chernoff bound for offline MAB).

Under the setting in Theorem C.1, we have

\mathbb{P}\left(N_{i}\geq\frac{1}{2}N\mu(a_{i})\quad\forall i\in[d]\right)\leq 1-d\exp\left(-\frac{N\cdot\min_{j\in[d]}\mu(a_{j})}{8}\right),

Proof.

For arm $i\in[d],$ we take $\mu=N\mu(a_{i})$ and $t=\frac{1}{2}M\mu(a_{i})$ in Lemma C.6 and obtain

\mathbb{P}\left(N_{i}\geq\frac{1}{2}N\mu(a_{i})\right)\leq\exp\left(-N\mu(a_{i})\right)\cdot\left(\frac{eN\mu(a_{i})}{\frac{1}{2}N\mu(a_{i})}\right)^{\frac{1}{2}N\mu(a_{i})}=\exp\left(N\mu(a_{i})\left[-1+\frac{1}{2}\log(2e)\right]\right)\leq\exp\left(-\frac{N\mu(a_{i})}{8}\right).

We finish the proof by a union bound for all arms. ∎

Appendix D Proof of Lemma 5.2

Proof.

Recall the definition of $\lceil\varepsilon\rceil:$

\lceil\varepsilon\rceil:=\inf\left\{\varepsilon^{\prime}\in E:\varepsilon^{\prime}\geq\varepsilon\right\}.

(56)

We additionally define

\lfloor\varepsilon\rfloor:=\sup\left\{\varepsilon^{\prime}\in E:\varepsilon^{\prime}<\varepsilon\right\}.

(57)

Specially, if there is no $\varepsilon^{\prime}\in E$ such that $\varepsilon^{\prime}<\varepsilon,$ then we define $\lfloor\varepsilon\rfloor=0$ . Then we know for any $\varepsilon\leq\varepsilon_{0}\in E$ ( $\varepsilon_{0}$ is the largest possible radius) and a finite set $E,$ it holds that

\lfloor\varepsilon\rfloor<\varepsilon\leq\lceil\varepsilon\rceil,\quad\text{ and }\quad\varepsilon=\lceil\varepsilon\rceil\text{ if and only if }\varepsilon\in E.

(58)

For any $\varepsilon\in E$ , recall $\widehat{\Delta}_{\varepsilon}$ is the optimal improvement vector within $\mathsf{C}\left(\varepsilon\right)$ defined in (10). It holds that

$\displaystyle\widehat{\Delta}_{\varepsilon}:$	$\displaystyle=\mathop{\arg\max}_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\Delta^{\top}\widehat{r}=\mathop{\arg\max}_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\varepsilon\right)\big{]}$	(since $\mathcal{G}\left(\varepsilon\right)$ does not depend on $\Delta$ )
	$\displaystyle=\mathop{\arg\max}_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon\rceil\right)\big{]}$	( $\varepsilon\in E,$ so $\lceil\varepsilon\rceil=\varepsilon$ )
	$\displaystyle\leq\mathop{\arg\max}_{\varepsilon^{\prime}\in\left(\lfloor\varepsilon\rfloor,\lceil\varepsilon\rceil\right],\Delta\in\mathsf{C}\left(\varepsilon^{\prime}\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon^{\prime}\rceil\right)\big{]}.$

On the other hand, when $\varepsilon\in E$ and $\varepsilon^{\prime}\in(\left\lfloor\varepsilon\right\rfloor,\left\lceil\varepsilon\right\rceil],$ one has $\left\lceil\varepsilon^{\prime}\right\rceil=\left\lceil\varepsilon\right\rceil=\varepsilon,$ so

\displaystyle\mathop{\arg\max}_{\varepsilon^{\prime}\in\left(\lfloor\varepsilon\rfloor,\lceil\varepsilon\rceil\right],\Delta\in\mathsf{C}\left(\varepsilon^{\prime}\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon^{\prime}\rceil\right)\big{]}

\displaystyle=\mathop{\arg\max}_{\varepsilon^{\prime}\in\left(\lfloor\varepsilon\rfloor,\lceil\varepsilon\rceil\right],\Delta\in\mathsf{C}\left(\varepsilon^{\prime}\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon\rceil\right)\big{]}\leq\mathop{\arg\max}_{\Delta\in\mathsf{C}\left(\varepsilon\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon\rceil\right)\big{]},

where the last inequality comes from the fact that $\mathsf{C}\left(\varepsilon^{\prime}\right)\subset\mathsf{C}\left(\varepsilon\right)$ when $\varepsilon^{\prime}\leq\lceil\varepsilon\rceil=\varepsilon$ by definition of the trust region in (4.2). Combining two inequalities above, we have for any $\varepsilon\in E,$

\left(\varepsilon,\widehat{\Delta}_{\varepsilon}\right)=\mathop{\arg\max}_{\varepsilon^{\prime}\in\left(\lfloor\varepsilon\rfloor,\lceil\varepsilon\rceil\right],\Delta\in\mathsf{C}\left(\varepsilon^{\prime}\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon^{\prime}\rceil\right)\big{]},

(59)

where the variables in RHS above are $\varepsilon^{\prime}$ and $\Delta$ , and Therefore, from the definition of we have

\displaystyle\left(\widehat{\varepsilon}_{*},\widehat{\Delta}_{*}\right)=\mathop{\arg\max}_{\varepsilon\in E}\mathop{\arg\max}_{\varepsilon^{\prime}\in\left(\lfloor\varepsilon\rfloor,\lceil\varepsilon\rceil\right],\Delta\in\mathsf{C}\left(\varepsilon^{\prime}\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon^{\prime}\rceil\right)\big{]}=\mathop{\arg\max}_{\varepsilon\leq\varepsilon_{0},\Delta\in\mathsf{C}\left(\varepsilon\right)}\big{[}\Delta^{\top}\widehat{r}-\mathcal{G}\left(\lceil\varepsilon\rceil\right)\big{]}.

This finishes the proof. ∎

Appendix E Augmentation with LCB

To determine the most effective final policy, we can compare the outputs of the LCB and Algorithm 1 and combine both policies, based on the relative magnitude of their corresponding lower bounds. Specifically, the combined policy is

		$\displaystyle\pi_{\mathsf{combined}}=$
		$\displaystyle\left\{\begin{aligned} \widehat{a}_{\mathsf{LCB}}&\ \text{ If }\max_{a_{i}\in\mathcal{A}}l_{i}\geq w_{\mathsf{TR}}^{\top}\widehat{r}-\mathcal{G}\left(\lceil\widehat{\varepsilon}_{}\rceil\right)-\sqrt{\frac{2\log(1/\delta)}{\sum_{j=1}^{d}N_{j}/\sigma_{j}^{2}}},\\ w_{\mathsf{TR}}&\ \text{ If }\max_{a_{i}\in\mathcal{A}}l_{i}<w_{\mathsf{TR}}^{\top}\widehat{r}-\mathcal{G}\left(\lceil\widehat{\varepsilon}_{}\rceil\right)-\sqrt{\frac{2\log(1/\delta)}{\sum_{j=1}^{d}N_{j}/\sigma_{j}^{2}}},\end{aligned}\right.$		(60)

where $l_{i}=\widehat{r}_{i}-b_{i}$ is defined in (1) and $\mathcal{G}\left(\varepsilon\right)$ is defined in Definition 4.2. This combined policy will perform at least as well as LCB with high probability. More specifically, we have

Corollary E.1.

We denote the arm chosen by LCB as $\widehat{a}_{\mathsf{LCB}}$ . We also denote $r(\cdot)$ as the true reward of a policy (deterministic or stochastic). With probability at least $1-3\delta,$ one has

V^{\pi_{\mathsf{combined}}}\geq\max_{a_{i}\in\mathcal{A}}l_{i}.

(61)

Proof.

We denote $\widehat{r}\left(\widehat{a}_{\mathsf{LCB}}\right)=r_{\widehat{a}_{\mathsf{LCB}}}$ and $\widehat{r}(w_{\mathsf{TR}})$ as the empirical reward of the policy returned by LCB and Algorithm 1, respectively. Recall the uncertainty term of LCB in (1) and of Algorithm 1 in (E), we write $b(\widehat{a}_{\mathsf{LCB}})=b_{\widehat{a}_{\mathsf{LCB}}}$ and $b(w_{\mathsf{TR}})=\mathcal{G}\left(\lceil\widehat{\varepsilon}_{*}\rceil\right)+\sqrt{2\log(1/\delta)/[\sum_{j=1}^{d}N_{j}/\sigma_{j}^{2}]}$ . Then, from Theorem 5.1, (2) and a union bound, we know with probability at least $1-3\delta,$ it holds that

r(\widehat{a}_{\mathsf{LCB}})\geq\widehat{r}(\widehat{a}_{\mathsf{LCB}})-b(\widehat{a}_{\mathsf{LCB}}),\ r(w_{\mathsf{TR}})\geq\widehat{r}(w_{\mathsf{TR}})-b(w_{\mathsf{TR}}),

which implies

$\displaystyle V^{\pi_{\mathsf{combined}}}$	$\displaystyle\geq\widehat{r}(\pi_{\mathsf{combined}})-b(\pi_{\mathsf{combined}})$
	$\displaystyle\geq\widehat{r}(\widehat{a}_{\mathsf{LCB}})-b(\widehat{a}_{\mathsf{LCB}})$	(By (E))
	$\displaystyle=\max_{a_{i}\in\mathcal{A}}l_{i}.$	(By the definition of $\widehat{a}_{\mathsf{LCB}}$ in (3))

Therefore, we conclude. ∎

Appendix F Experiment details

We did experiments on Mujoco environment in the D4RL dataset (Fu et al., 2020). All environments we test on are v3. Since the original D4RL dataset does not include the exact form of logging policies, we retrain SAC (Haarnoja et al., 2018) on these environment for 1000 episodes and keep record of the policy in each episode. We test 4 environments in two settings, denoted as ’1-traj-low’ and ’1-traj-high’. In either setting, the offline dataset is generated from 100 policies with one trajectory from each. In the ’1-traj-low’ setting, the data is generated from the first 100 policies in the training process of SAC, while in the ’1-traj-high’ setting, it is generated from the policy in $(10x+1)$ -th episodes in the training process.

For all experiments on Mujoco, we average the results over 4 random seeds (from 2023 to 2026), and to run CQL, we use default hyper-parameters in https://github.com/young-geng/CQL to run 2000 episodes. For TRUST, we run it using a fixed standard deviation level $\sigma_{i}=150$ for all experiments.

Is Offline Decision Making Possible with Only Few Samples? Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement

Abstract

1 Introduction

2 Additional related work

3 Data-Starved Multi-Armed Bandits

3.1 Multi-armed bandits

3.2 Lower confidence bound algorithm

Theorem 3.1 (LCB Performance).

3.3 A data-starved MAB problem and failure of LCB

3.4 Can stochastic policies help?

Definition 3.2 (Stochastic Policies).

4 Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST)

4.1 Decision variables

4.2 Trust region optimization

Trust region.

Critical radius.

Definition 4.1 (Critical Radius).

Definition 4.2 (Quantile of the supremum of Gaussian process).

Implementation details

5 Theoretical guarantees

5.1 Problem-dependent analysis

Theorem 5.1 (Main theorem).

Localized Gaussian complexity 𝒢​(ε)\mathcal{G}\left(\varepsilon\right).

5.2 Proof of Theorem 5.1

Lemma 5.2.

Augmentation with LCB

6 Experiments

6.1 Simulated experiments

A data-starved MAB

A more general data-starved MAB

6.2 Offline reinforcement learning

7 Conclusion

8 Impact Statement

References

Appendix A One-sample case with strong signals

Proposition A.1.

Proof.

Appendix B Monte Carlo computation

Appendix C A fine-grained analysis to the suboptimality

Theorem C.1.

Is TRUST minimax-optimal?

Corollary C.2.

C.1 Proof of Theorem C.1

Proof.

Lemma C.3.

C.2 Proof of Lemma C.3

Proof.

C.3 Auxiliary lemmas

Lemma C.4 (Concentration of Gaussian suprema, Exercise 5.10 in Wainwright (2019)).

Lemma C.5 (Localized Gaussian Width of a Convex Hull, Proposition 1 in Bellec (2019)).

Lemma C.6 (Chernoff bound for binomial random variables, Theorem 2.3.1 in Vershynin (2020)).

Lemma C.7 (Chernoff bound for offline MAB).

Proof.

Appendix D Proof of Lemma 5.2

Proof.

Appendix E Augmentation with LCB

Corollary E.1.

Proof.

Appendix F Experiment details

Is Offline Decision Making Possible with Only Few Samples?
Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement

Localized Gaussian complexity $\mathcal{G}\left(\varepsilon\right)$ .