This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\AtAppendix

Is RLHF More Difficult than Standard RL?
A Theoretical Perspective

Yuanhao Wang22footnotemark: 2    Qinghua Liu22footnotemark: 2    Chi Jin22footnotemark: 2
Abstract

Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward signals. Preferences arguably contain less information than rewards, which makes preference-based RL seemingly more difficult. This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs. Specifically, (1) for preferences that are drawn from reward-based probabilistic models, we reduce the problem to robust reward-based RL that can tolerate small errors in rewards; (2) for general arbitrary preferences where the objective is to find the von Neumann winner, we reduce the problem to multiagent reward-based RL which finds Nash equilibria for factored Markov games with a restricted set of policies. The latter case can be further reduced to adversarial MDP when preferences only depend on the final state. We instantiate all reward-based RL subroutines by concrete provable algorithms, and apply our theory to a large class of models including tabular MDPs and MDPs with generic function approximation. We further provide guarantees when K-wise comparisons are available.

00footnotetext: Princeton University. Email: {yuanhao,qinghual,chij}@princeton.edu

1 Introduction

Reinforcement learning (RL) is a control-theoretic problem in which agents take a sequence of actions, receive reward feedback from the environment, and aim to find good policies that maximize the cumulative rewards. The reward labels can be objective measures of success (winning in a game of Go (Silver et al., 2017)) or more often hand-designed measures of progress (gaining gold in DOTA2 (Berner et al., 2019)). The empirical success of RL in various domains  (Mnih et al., 2013; Vinyals et al., 2019; Todorov et al., 2012) crucially relies on the availability and quality of reward signals. However, this also presents a limitation for applying standard reinforcement learning when designing a good reward function is difficult.

An important approach that addresses this challenge is reinforcement learning from Human feedback (RLHF), where RL agents learn from preference feedback provided by humans. Preference feedback is arguably more intuitive to human users, more aligned with human values and easier to solicit in applications such as recommendation systems and image generation (Pereira et al., 2019; Lee et al., 2023). Empirically, RLHF is a key ingredient empowering the successes in tasks ranging from robotics (Jain et al., 2013) to large language models (Ouyang et al., 2022).

A simple observation about preference feedback is that preferences can always be reconstructed from reward signals. In other words, preference feedback contains arguably less information than scalar rewards, which may render the RLHF problem more challenging. A natural question to ask is:

Is preference-based RL more difficult than reward-based RL?

Existing research on preference-based RL (Chen et al., 2022; Pacchiano et al., 2021; Novoseller et al., 2020; Xu et al., 2020; Zhu et al., 2023) has established efficient guarantees for learning a near-optimal policy from preference feedback. These works typically develop specialized algorithms and analysis in a white-box fashion, instead of building on existing techniques in standard RL. This leaves it open whether it is necessary to develop new theoretical foundation for preference-based RL in parallel to standard reward-based RL.

This work presents a comprehensive set of results on provably efficient RLHF under a wide range of preference models. We develop new simple reduction-based approaches that solve preference-based RL by reducing it to existing frameworks in reward-based RL, with little to no additional cost:

  • For utility-based preferences—those drawn from a reward-based probabilistic model (see Section 2.1), we prove a reduction from preference-based RL to reward-based RL with robustness guarantees via a new Preference-to-Reward Interface (P2R, Algorithm 1). Our approach incurs no sample complexity overhead and the human query complexity 111 Throughout this paper, by sample complexity we mean the number of interactions with the MDP, while the query complexity refers the total number of calls of human evaluators. does not scale with the sample complexity of the RL algorithm. We instantiate our framework for comparisons based on (1) immediate reward of the current state-action pair or (2) cumulative reward of the trajectory, and apply existing reward-based RL algorithms to directly find near-optimal policies for RLHF in a large class of models including tabular MDPs, linear MDPs, and MDPs with low Bellman-Eluder dimension, etc. We further provide complexity guarantees when K-wise comparisons are available.

  • For general (arbitrary) preferences, we consider the objective of the von Neumann winner (see, e.g., Dudík et al., 2015; Kreweras, 1965), a solution concept that always exists and extends the Condorcet winner. We reduce this problem to multiagent reward-based RL which finds Nash equilibria for a special class of factored two-player Markov games under a restricted set of policies. When preferences only depend on the final state, we prove that such factored Markov games can be solved by both players running Adversarial Markov Decision Processes (AMDP) algorithms independently. For preferences that depend on entire trajectory, we develop an adapted version of optimistic Maximum Likelihood Estimation (OMLE) algorithm (Liu et al., 2022a), which handles this factored Markov games under general function approximation.

Notably, our algorithmic solutions are either reductions to standard reward-based RL problems or adaptations of existing algorithms (OMLE). This suggests that technically, preference feedback is not difficult to address given the existing knowledge of RL with reward feedback. Nevertheless, our impossibility results for utility-based preference (Lemma 3.2, 3.3) and reduction for general preferences also highlight several important conceptual differences between RLHF and standard RL.

1.1 Related work

Dueling bandits. Dueling bandits (Yue et al., 2012; Bengs et al., 2021) can be seen as a special case of preference based RL with H=1H=1. Many assumptions later applied to preference based RL, such as an underlying utility model with a link function (Yue and Joachims, 2009), the Plackett-Luce model (Saha and Gopalan, 2019), and the Condorcet winner (Zoghi et al., 2014) can be traced back to literature on dueling bandits. A reduction from preference feedback to reward feedback for bandits is proposed by Ailon et al. (2014). The concept of the von Neumann winner, which we employ for general preferences, has been considered in Dudík et al. (2015) for contextual dueling bandits.

RL from human feedback. Using human preferences in RL has been studied for at least a decade (Jain et al., 2013; Busa-Fekete et al., 2014) and is later incorporated with Deep RL (Christiano et al., 2017). It has found empirical success in robotics (Jain et al., 2013; Abramson et al., 2022; Ding et al., 2023), game playing (Ibarz et al., 2018) and fine-tuning large language models (Ziegler et al., 2019; Ouyang et al., 2022; Bai et al., 2022). RL with other forms of human feedback, such as demonstrations and scalar ratings, has also been considered in previous research (Finn et al., 2016; Warnell et al., 2018; Arakawa et al., 2018) but falls beyond the scope of this paper.

Theory of Preference-based RL. For utility-based preferences, Novoseller et al. (2020), Pacchiano et al. (2021) and Zhan et al. (2023b) consider tabular or linear MDPs, and assume that the per-trajectory reward is linear in trajectory features.  Xu et al. (2020) also considers the tabular setting. However, instead of assuming an explicit link function, several structural properties of the preference is assumed which guarantee a Condorcet winner and ensure small regret of a black-box dueling bandit algorithm. Finally, recent works of  Zhu et al. (2023); Zhan et al. (2023a) consider utility-based preferences in the offline setting, assuming the algorithm is provided with a pre-collected human preference (and transition) dataset with good coverage. Compared to the above works, this paper considers the online setting and derives results for utility-based preferences with general function approximation, which are significantly more general than those for tabular or linear MDPs.

For general preferences, Chen et al. (2022) also develops sample-efficient algorithms for finding the von Neumann winner 222 While their paper makes Assumption 3.1 in Chen et al. (2022) that seemingly assumes a Condorcet winner, this assumption actually always holds as π\pi^{\star} can be a general randomized policy, so their solution concept coincides with the von Neumann winner.. Their algorithm is computationally inefficient in general even when restricted to the tabular setting. Compared to this result, our AMDP-based reduction algorithm (Section 4.2) is computationally efficient in the tabular or linear setting when the comparison only depends on the final states. For general trajectory-based comparison, our results apply to a richer class of RL problems including POMDPs.

Finally, we remark that all prior results develop specialized algorithms and analysis for preference-based RL in a white-box fashion. In contrast, we develop reduction-based algorithms which can directly utilize state-of-the-art results in reward-based RL for preference-based RL. This reduction approach enables the significant generality of the results in this paper compared to prior works.

2 Preliminaries

We consider reinforcement learning in episodic MDPs, specified by a tuple (H,𝒮,𝒜,)(H,\mathcal{S},\mathcal{A},\mathbb{P}). Here 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the action space, and HH is the length of each episode. \mathbb{P} is the transition probability function; for each h[H]h\in[H] and s,a𝒮×𝒜s,a\in\mathcal{S}\times\mathcal{A}, h(|s,a)\mathbb{P}_{h}(\cdot|s,a) specifies the distribution of the next state. A trajectory τ(𝒮×𝒜)H\tau\in(\mathcal{S}\times\mathcal{A})^{H} is a sequence of interactions (s1,a1,,sH,aH)(s_{1},a_{1},\cdots,s_{H},a_{H}) with the MDP. A Markov policy π={πh:𝒮Δ𝒜}h[H]\pi=\{\pi_{h}:~{}\mathcal{S}\to\Delta_{\mathcal{A}}\}_{h\in[H]} specifies an action distribution based on the current state, while a general policy π={πh:(𝒮×𝒜)h1×𝒮Δ𝒜}h[H]\pi=\{\pi_{h}:(\mathcal{S}\times\mathcal{A})^{h-1}\times\mathcal{S}\to\Delta_{\mathcal{A}}\}_{h\in[H]} can choose a random action based on the whole history up to timestep hh.

In RLHF, an algorithm interacts with a reward-less MDP environment and may query comparison oracle (human evaluators) for preference information. We consider two types of preferences: utility-based ones and general ones.

2.1 Utility-based preferences

For utility-based comparison, we assume there exists an underlying reward function r:(𝒮×𝒜)H[0,1]r^{\star}:(\mathcal{S}\times\mathcal{A})^{H}\to[0,1]. Given a reward function, the value of a policy can be defined as 𝔼π[h=1Hr(sh,ah)]\mathbb{E}_{\pi}[\sum_{h=1}^{H}r^{\star}(s_{h},a_{h})], i.e. the expected cumulative reward obtained by executing the policy. An optimal policy is one that maximizes 𝔼π[h=1Hr(sh,ah)]\mathbb{E}_{\pi}[\sum_{h=1}^{H}r^{\star}(s_{h},a_{h})]. We say that π\pi is ϵ\epsilon-optimal if 𝔼π[h=1Hr(sh,ah)]maxπ𝔼π[h=1Hr(sh,ah)]ϵ\mathbb{E}_{\pi}[\sum_{h=1}^{H}r^{\star}(s_{h},a_{h})]\geq\max_{\pi^{\prime}}\mathbb{E}_{\pi^{\prime}}[\sum_{h=1}^{H}r^{\star}(s_{h},a_{h})]-\epsilon. We also consider a setting where the utility is only available for a whole trajectory. In this case, we assume that there is an underlying trajectory reward function r:(𝒮×𝒜)H[0,H]r^{\star}:(\mathcal{S}\times\mathcal{A})^{H}\to[0,H], which assigns a scalar utility to each trajectory. In this case the value of a policy can be defined similarly as 𝔼τπ[r(τ)]\mathbb{E}_{\tau\sim\pi}[r(\tau)].

In preference based RL, the reward is assumed to be unobservable, but is respected by the comparison oracle which models human evaluators.

Definition 1 (Comparison oracle).

A comparison oracle takes in two trajectories τ1\tau_{1}, τ2\tau_{2} and returns

oBer(σ(r(τ1)r(τ2))),\textstyle o\sim{\rm Ber}(\sigma(r^{\star}(\tau_{1})-r^{\star}(\tau_{2}))),

where σ()\sigma(\cdot) is a link function, and r()r^{\star}(\cdot) is the underlying reward function.

Here Ber(p)(p) denotes a Bernoulli distribution with mean pp. The comparison outcome o=1o=1 indicates τ1τ2\tau_{1}\succ\tau_{2}, and vice versa. We additionally require that the inputs τ1\tau_{1}, τ2\tau_{2} to the comparison oracle are feasible in the sense that they should be actual trajectories generated by the algorithm and cannot be synthesized artificially. This is motivated by the potential difficulty of asking human evaluators to compare out-of-distribution samples (e.g. random pixels).

In this work, we also consider (𝒔,𝒂)\boldsymbol{(s,a)} preferences, where Pr[(s1,a1)(s2,a2)]=σ(r(s1,a1)r(s2,a2))\Pr[(s_{1},a_{1})\succ(s_{2},a_{2})]=\sigma(r(s_{1},a_{1})-r(s_{2},a_{2})). For notational simplicity, we will use τ\tau to denote a state-action pair (s,a)(s,a) (which can be thought as an incomplete trajectory) so that (s,a)(s,a) preferences can be seen as a special case of the comparison oracle (Definition 1). See more details in Remark 3.1.

2.2 General preferences

For general preferences, we assume that for every trajectory pair τ,τ\tau,\tau^{\prime}, the probability that a human evaluator prefers ss over ss^{\prime} is

M[τ,τ]=Pr[ττ].M[\tau,\tau^{\prime}]=\Pr[\tau\succ\tau^{\prime}]. (1)

A general preference may not be realizable by a utility model, so we cannot define the optimal policy in the usual sense. Instead, we follow Dudík et al. (2015) and consider an alternative solution concept, the von Neumann winner (see Definition 5).

2.3 Function approximation

We first introduce the concept of eluder dimension, which has been widely used in RL to measure the difficulty of function approximation.

Definition 2 (eluder dimension).

For any function class (𝒳)\mathcal{F}\subseteq(\mathcal{X}\to\mathbb{R}), its Eluder dimension dimE(,ϵ){\rm dim}_{\rm E}(\mathcal{F},\epsilon) is defined as the length of the longest sequence {x1,x2,,xn}𝒳\{x_{1},x_{2},\dots,x_{n}\}\subseteq\mathcal{X} such that there exists ϵϵ\epsilon^{\prime}\geq\epsilon so that for all i[n]i\in[n], xix_{i} is ϵ\epsilon^{\prime}-independent of its prefix sequence {x1,,xi1}\{x_{1},\dots,x_{i-1}\}, in the sense that there exists some fi,gif_{i},g_{i}\in\mathcal{F} such that

j=1i1((figi)(xj))2ϵand|(figi)(xi)|ϵ.\textstyle\sqrt{\sum_{j=1}^{i-1}\left((f_{i}-g_{i})(x_{j})\right)^{2}}\leq\epsilon^{\prime}~{}~{}~{}{\rm and}~{}~{}~{}\left|(f_{i}-g_{i})(x_{i})\right|\geq\epsilon^{\prime}.

Intuitively, eluder dimension measures the number of worst-case mistakes one has to make in order to identify an unknown function from the class \mathcal{F}. It is often used as a sufficient condition to prove sample efficiency guarantees for optimism-based algorithms.

As in many works on function approximation, we assume knowledge a class of reward functions \mathcal{R} and realizability.

Assumption 1 (Realizability).

rr^{\star}\in\mathcal{R}.

We further use ¯:={r+c|c[H,0],r}\overline{\mathcal{R}}:=\{r+c|c\in[-H,0],r\in\mathcal{R}\} to denote the reward function class augmented with a bias term. The inclusion of an additional bias is due to the assumption that preference feedback is based on reward differences, so we could only learn rr^{\star} up to a constant. We note that for the dd-dimension linear reward class linear\mathcal{R}_{\rm linear}, dimE(¯linear)O~(d){\rm dim}_{\rm E}(\overline{\mathcal{R}}_{\rm linear})\leq\tilde{O}(d), and that for a general reward class \mathcal{R}, dimE(¯,ϵ)𝒪(HdimE(,ϵ/2)1.5/ϵ){\rm dim_{E}}(\overline{\mathcal{R}},\epsilon)\leq\mathcal{O}(H{\rm dim_{E}}(\mathcal{R},\epsilon/2)^{1.5}/\epsilon). The details can be found in Appendix B.

3 Utility-based RLHF

Given access to comparison oracles instead of scalar rewards, a natural idea is to convert preferences back to scalar reward signals, so that standard RL algorithms can be trained on top of them. In this section, we introduce an efficient reduction from RLHF to standard RL through a preference-to-reward interface. On a high level, the interface provides approximate reward labels for standard RL training, and only queries the comparison oracle when uncertainty is large. The reduction incurs small sample complexity overhead, and the number of queries to human evaluators does not scale with the sample complexity of the RL algorithm using it. Moreover, it solicits feedback from the comparison oracle in a limited number of batches, simplifying the training schedule.

3.1 A preference-to-reward interface

Algorithm 1 Preference-to-Reward (P2R) Interface
1:  r\mathcal{B}_{r}\leftarrow\mathcal{R}, 𝒟{}\mathcal{D}\leftarrow\{\}, 𝒟hist{}\mathcal{D}_{\rm hist}\leftarrow\{\}
2:  Execute the random policy to collect τ0\tau_{0}
3:  Upon query of trajectory τ\tau:
4:  if (r^,τ)𝒟hist(\hat{r},\tau)\in\mathcal{D}_{\rm hist} then
5:     Return r^\hat{r}
6:  if maxr,rr(r(τ)r(τ0))(r(τ)r(τ0))<2ϵ0\max_{r,r^{\prime}\in\mathcal{B}_{r}}(r(\tau)-r(\tau_{0}))-(r^{\prime}(\tau)-r^{\prime}(\tau_{0}))<2\epsilon_{0} then
7:     r^r(τ)r(τ0)\hat{r}\leftarrow r(\tau)-r(\tau_{0}) for an arbitrary rrr\in\mathcal{B}_{r}
8:     𝒟hist𝒟hist(r^,τ)\mathcal{D}_{\rm hist}\leftarrow\mathcal{D}_{\rm hist}\cup(\hat{r},\tau)
9:  else
10:     Query comparison oracle mm times on τ\tau and τ0\tau_{0}; compute average comparison result o¯\bar{o}
11:     r^argminx[H,H]|σ(x)o¯|\hat{r}\leftarrow\operatorname{argmin}_{x\in[-H,H]}|\sigma(x)-\bar{o}|,  𝒟𝒟(r^,τ)\mathcal{D}\leftarrow\mathcal{D}\cup(\hat{r},\tau) ,  𝒟hist𝒟hist(r^,τ)\mathcal{D}_{\rm hist}\leftarrow\mathcal{D}_{\rm hist}\cup(\hat{r},\tau)
12:     Update r\mathcal{B}_{r}
r{rr:(r^,τ)𝒟(r(τ)r(τ0)r^)2β}\textstyle\mathcal{B}_{r}\leftarrow\left\{r\in\mathcal{B}_{r}:\sum_{(\hat{r},\tau)\in\mathcal{D}}\left(r(\tau)-r(\tau_{0})-\hat{r}\right)^{2}\leq\beta\right\}
13:  Return r^\hat{r}

The interaction protocol of an RL algorithm with the Preference-to-Reward (P2R) Interface is shown in Fig. 1.

Refer to caption
Figure 1: Interaction protocol with the reward learning interface.

P2R maintains a confidence set of rewards BrB_{r}. When the RL algorithm wishes to learn the reward label of a trajectory τ\tau, P2R checks whether BrB_{r} approximately agrees on the reward of τ\tau. If so, it can return a reward label with no queries to the comparison oracle; if not, it will query the comparison oracle on τ\tau and a fixed trajectory τ0\tau_{0}, and update the confidence set of reward functions. The details of P2R are presented in Algorithm 1.

The performance of P2R depends on the RL algorithm running on top of it. In particular, we define sample complexity of a standard reward-based RL algorithm 𝒜\mathscr{A} as follows.

Definition 3.

An RL algorithm 𝒜\mathscr{A} is g(ϵ)g(\epsilon)-robust and has sample complexity 𝒞(ϵ,δ)\mathcal{C}(\epsilon,\delta) if it can output an ϵ\epsilon-optimal policy using 𝒞(ϵ,δ)\mathcal{C}(\epsilon,\delta) samples with probability at least 1δ1-\delta, even if the reward of each trajectory τ\tau is perturbed by ε(τ)\varepsilon(\tau) with ε(τ)g(ϵ)\|\varepsilon(\tau)\|_{\infty}\leq g(\epsilon).

We would like to note that the requirement of being g(ϵ)g(\epsilon)-robust is typically not restrictive. In fact, any tabular RL algorithm with sample complexity would be O(ϵ/H)O(\epsilon/H)-robust with the same sample complexity, while many algorithms with linear function approximation are O(ϵ/poly(d,H))O(\epsilon/{\rm poly}(d,H))-robust (Jin et al., 2020b; Zanette et al., 2020) (again with the same sample complexity). We can further show that one can effortlessly convert any standard RL algorithms with sample complexity 𝒞\mathcal{C} into an O(1/𝒞)O\left(1/\mathcal{C}\right)-robust one by using the procedure described in Lemma B.3.

Remark 3.1 (Trajectory vs. (s,a)(s,a) preferences).

So far, we presented the comparison oracle (Definition 1) and the P2R (Algorithm 1) using trajectory rewards. This would require the RL algorithm using P2R to be one that learns from once-per-trajectory scalar reward. To be compatible with standard RL algorithms, we can formally set τ\tau as an “incomplete trajectory” (s,a)(s,a) in both Definition 1 and Algorithm 1. This would not change the theoretical results regarding the P2R reduction.

Theoretical analysis of P2R

We assume that σ()\sigma(\cdot) is known and satisfies the following regularity assumption.

Assumption 2.

σ(0)=12\sigma(0)=\frac{1}{2}; for x[H,H]x\in[-H,H], σ(x)α>0\sigma^{\prime}(x)\geq\alpha>0.

Assumption 2 is common in bandits literature (Li et al., 2017) and is satisfied by the popular Bradley-Terry model, where σ\sigma is the logistic function. It is further motivated by Lemma 3.2 and Lemma 3.3: when σ\sigma is unknown or has no gradient lower bounds, the optimal policy can be impossible to identify.

Lemma 3.2.

If σ\sigma is unknown, even if there are only two possible candidates, the optimal policy is indeterminate.

Lemma 3.3.

If σ()\sigma^{\prime}(\cdot) is not lower bounded, for instance when σ(x)=12(1+sign(x))\sigma(x)=\frac{1}{2}(1+\texttt{\rm sign}(x)), the optimal policy is indeterminate.

P2R enjoys the following theoretical guarantee when we choose ϵ0=g(ϵ)/2\epsilon_{0}=g(\epsilon)/2, β=ϵ024\beta=\frac{\epsilon_{0}^{2}}{4}, d¯=dimE(¯,ϵ0)d_{\overline{\mathcal{R}}}={\rm dim}_{\rm E}(\overline{\mathcal{R}},\epsilon_{0}) and m=Θ(d¯ln(d¯/δ)ϵ02α2)m=\Theta\left(\frac{d_{\overline{\mathcal{R}}}\ln(d_{\overline{\mathcal{R}}}/\delta)}{\epsilon_{0}^{2}\alpha^{2}}\right).

Theorem 3.4.

Suppose Assumption 1 and 2 hold. Let ϵ0=g(ϵ)/2\epsilon_{0}=g(\epsilon)/2, d¯=dimE(¯,ϵ0)d_{\overline{\mathcal{R}}}={\rm dim}_{\rm E}(\overline{\mathcal{R}},\epsilon_{0}) and m=Θ(d¯ln(d¯/δ)ϵ02α2)m=\Theta\left(\frac{d_{\overline{\mathcal{R}}}\ln(d_{\overline{\mathcal{R}}}/\delta)}{\epsilon_{0}^{2}\alpha^{2}}\right). Suppose that 𝒜\mathscr{A} is an g(ϵ)g(\epsilon)-robust RL algorithm with sample complexity 𝒞(ϵ,δ)\mathcal{C}(\epsilon,\delta). By running 𝒜\mathscr{A} with the interface in Algorithm 1, we can learn an ϵ\epsilon-optimal policy using 𝒞(ϵ,δ)\mathcal{C}(\epsilon,\delta) samples and 𝒪~(d¯2α2g(ϵ)2)\widetilde{\mathcal{O}}\left(\frac{d_{\overline{\mathcal{R}}}^{2}}{\alpha^{2}g(\epsilon)^{2}}\right) queries to the comparison oracle with probability 12δ1-2\delta.

The full proof of Theorem 3.4 is deferred to Appendix B.3. The key idea of the analysis is to use the fact that rrr^{\star}\in\mathcal{B}_{r} and the condition in Line 6 to show that the returned reward labels r^\hat{r} are approximate accurate, and to use properties of eluder dimension to bound the number of samples for which the comparison oracle is called.

Theorem 3.4 shows that P2R is a provably efficient reduction: any standard RL algorithm 𝒜\mathcal{A} combined with P2R induces a provably efficient algorithm that learns an approximately optimal policy from preference feedback. The number of required interactions with the MDP environment is identical to that of the standard RL algorithm, and the query complexity only scales with the complexity of learning the reward function.

Comparison with other approaches

A more straightforward reduction would be extensively querying the comparison oracle for every sample generated by the RL algorithm. While this direct reduction would not suffer from increased sample complexity, it encounters two other drawbacks: (1) the oracle complexity, or the total workload for human evaluators, increases proportionally with the sample complexity of the RL algorithm, which can be prohibitive; (2) the RL training would need to pause and wait for human feedback at every iteration, creating substantial scheduling overhead.

Another method to construct reward feedback is to learn a full reward function directly before running the RL algorithm, as in Ouyang et al. (2022). However, without pre-existing high-quality offline datasets, collecting the samples for reward learning would require solving an exploration problem at least as hard as RL itself (Jin et al., 2020a), resulting in significant sample complexity overhead. In P2R, the exploration problem is solved by the RL algorithm using it.

Compared to the two alternative approaches, our reduction achieves the best of both worlds by avoiding sample complexity overhead with a query complexity that does not scale with the sample complexity.

3.2 Instantiations of P2R

When combined with existing sample complexity results, Theorem 3.4 directly implies concrete sample and query complexity bounds for preference-based RL in many settings, with no statistical and small computational overhead.

(s,a)(s,a) preferences

We first consider the comparison that is based on the immediate reward of the current state-action pair. Here we give tabular MDPs and MDPs with low Bellman-Eluder dimension (Jin et al., 2021) as two examples.

Example 1 (Tabular MDPs).

Our first example is tabular MDPs whose state and action spaces are finite and small. In this case d¯=𝒪~(|𝒮||𝒜|)d_{\overline{\mathcal{R}}}=\widetilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}|). The UCBVI-BF algorithm, proposed by Azar et al. (2017), is a model-based tabular RL algorithm which uses upper-confidence bound value iteration with Bernstein-Freedman bonuses. UCBVI-BF has sample complexity 𝒞(ϵ,δ)=𝒪(H3|𝒮||𝒜|/ϵ2)\mathcal{C}\left(\epsilon,\delta\right)=\mathcal{O}\left(H^{3}|\mathcal{S}||\mathcal{A}|/\epsilon^{2}\right) and is 𝒪(ϵ/H)\mathcal{O}(\epsilon/H) robust due to Lemma B.4.

Proposition 3.5.

Algorithm 1 with UCBVI-BF learns an ϵ\epsilon-optimal policy of a tabular MDP from preference feedback using 𝒪~(H3|𝒮||𝒜|ϵ2)\widetilde{\mathcal{O}}\big{(}\frac{H^{3}|\mathcal{S}||\mathcal{A}|}{\epsilon^{2}}\big{)} episodes of interaction with the environment and 𝒪~(H2|𝒮|2|𝒜|2α2ϵ2)\widetilde{\mathcal{O}}\big{(}\frac{H^{2}|\mathcal{S}|^{2}|\mathcal{A}|^{2}}{\alpha^{2}\epsilon^{2}}\big{)} queries to the comparison oracle. The algorithm is computationally efficient.

Example 2 (Low Bellman-eluder dimension).

Bellman-eluder dimension (Jin et al., 2021) is a complexity measure for function approximation RL with a Q-value function class \mathcal{F}. It can be shown that a large class of RL problems, including tabular MDPs, linear MDPs, reactive POMDPs and low Bellman rank problems, all have small Bellman-eluder dimension. Furthermore, Jin et al. (2021) designed an algorithm, named GOLF, which (i) first constructs a confidence set for the optimal Q-functions by including all the candidates with small temporal difference loss, (ii) then optimistically picks Q-estimate from the confidence set and executes its greedy policy, and (iii) repeats (i) and (ii) using the newly collected data. Assuming that \mathcal{F} satisfies realizability and completeness property, GOLF is g(ϵ)=Θ(ϵdBEH2)g(\epsilon)=\Theta\left(\frac{\epsilon}{\sqrt{d_{\rm BE}}H^{2}}\right)-robust with sample complexity 𝒞(ϵ,δ)=𝒪~(dBEH4ln||ϵ2)\mathcal{C}(\epsilon,\delta)=\widetilde{\mathcal{O}}\left(\frac{d_{\rm BE}H^{4}\ln|\mathcal{F}|}{\epsilon^{2}}\right) where dBEd_{\rm BE} is the Bellman-eluder dimension of the problem. By applying Theorem 3.4, we immediately have the following result.

Proposition 3.6.

Algorithm 1 with GOLF (Jin et al., 2021) learns an ϵ\epsilon-optimal policy of RL problems with Bellman-Eluder dimension dBEd_{\rm BE} in 𝒪~(dBEH4ln||ϵ2)\widetilde{\mathcal{O}}\big{(}\frac{d_{\rm BE}H^{4}\ln|\mathcal{F}|}{\epsilon^{2}}\big{)} episodes of interaction with the environment and 𝒪~(dBEd¯2H2α2ϵ2)\widetilde{\mathcal{O}}\big{(}\frac{d_{\rm BE}d_{\overline{\mathcal{R}}}^{2}H^{2}}{\alpha^{2}\epsilon^{2}}\big{)} queries to the comparison oracle.

Trajectory-based preferences

When the reward function is trajectory-based, we can instantiate P2R with the OMLE algorithm (Liu et al., 2022a) to solve any model-based RLHF of low generalized eluder dimension. In brief, OMLE is an optimism-based algorithm that maintains a model confidence set. This set comprises model candidates that demonstrate high log-likelihood on previously collected data. And in each iteration, the algorithm chooses a model estimate optimistically and executes its greedy policy to collect new data.

Example 3 (Low generalized eluder dimension).

Generalized eluder dimension (Liu et al., 2022a) is a complexity measure for function approximation RL with a transition function class 𝒫\mathcal{P}. In Appendix C.1, we show that a simple adaptation of OMLE is g(ϵ)=Θ(ϵd)g(\epsilon)=\Theta(\frac{\epsilon}{\sqrt{d_{\mathcal{R}}}})-robust with sample complexity 𝒞(ϵ,δ)=𝒪~(H2d𝒫|Πexp|2ln|𝒫|ϵ2+Hd|Πexp|ϵ),\mathcal{C}(\epsilon,\delta)=\widetilde{\mathcal{O}}\left(\frac{H^{2}d_{\mathcal{P}}|\Pi_{\rm exp}|^{2}\ln|\mathcal{P}|}{\epsilon^{2}}+\frac{Hd_{\mathcal{R}}|\Pi_{\rm exp}|}{\epsilon}\right), where d𝒫d_{\mathcal{P}} denotes the generalized eluder dimension of 𝒫\mathcal{P}, |Πexp||\Pi_{\rm exp}| is a parameter in the OMLE algorithm, and d=dimE(,ϵ)d_{\mathcal{R}}=\dim_{\rm E}(\mathcal{R},\epsilon). Plugging it back into Theorem 3.4, we obtain the following result.

Proposition 3.7.

Algorithm 1 with OMLE learns an ϵ\epsilon-optimal policy of RL problems with low generalized eluder dimension in 𝒪~(H2d𝒫|Πexp|2ln|𝒫|ϵ2+Hd|Πexp|ϵ)\widetilde{\mathcal{O}}\left(\frac{H^{2}d_{\mathcal{P}}|\Pi_{\rm exp}|^{2}\ln|\mathcal{P}|}{\epsilon^{2}}+\frac{Hd_{\mathcal{R}}|\Pi_{\rm exp}|}{\epsilon}\right) episodes of interaction with the environment and 𝒪~(dd¯2α2ϵ2)\widetilde{\mathcal{O}}\big{(}\frac{d_{\mathcal{R}}d_{\overline{\mathcal{R}}}^{2}}{\alpha^{2}\epsilon^{2}}\big{)} queries to the comparison oracle.

Liu et al. (2022a) prove that a wide range of model-based reinforcement learning problems have a low generalized eluder dimension d𝒫d_{\mathcal{P}} and only require a mild |Πexp||\Pi_{\rm exp}| to run the OMLE algorithm. Examples of such problems include tabular MDPs, factored MDPs, observable POMDPs, and decodable POMDPs. For a formal definition of generalized eluder dimension and more details on the aforementioned bound and examples, we refer interested readers to Appendix C.1 or Liu et al. (2022a). Finally, we remark that it is possible to apply the P2R framework in other settings with different complexity measures, such as DEC (Foster et al., 2021) or GEC (Zhong et al., 2022), by making some minor modifications to the corresponding algorithms to ensure robustness.

3.3 P-OMLE: Improved query complexity via white-box modification

While the P2R interface provides a clean and effortless recipe for modifying standard RL algorithms to work with preference feedback, it incurs a large query cost to the comparison oracle, e.g., cubic dependence on d¯d_{\overline{\mathcal{R}}} in Example 3. We believe that this disadvantage is caused by the black-box nature of P2R, and that better query complexities can be achieved by modifying standard RL algorithms in a white-box manner and specialized analysis. In this section, we take OMLE (Liu et al., 2022a) as an example and introduce a standalone adaptation to trajectory preference feedback with improved query complexity. We expect other optimism-based algorithms (e.g., UCBVI-BF and GOLF) can be modified in a similar white-box manner to achieve better query complexity.

The details of the Preference-based OMLE algorithm (P-OMLE) is provided in Algorithm 2. Compared to OMLE with reward feedback (Algorithm 3), the only difference is that now the reward confidence set is computed using preference feedback directly using log-likelihood. Denote by Vr,pπV^{\pi}_{r,p} the expected cumulative reward the leaner will receive if she follows policy π\pi in model (p,r)(p,r). In the tt-th iteration, the algorithm follows the following steps:

  • Optimistic planning: Find the policy-model pair (π,r,p)(\pi,r,p) that maximizes the value function Vr,pπV^{\pi}_{r,p};

  • Data collection: Construct an exploration policy set Πexp(πt)\Pi_{\rm exp}(\pi^{t}) 333 The exploration policy set is problem-dependent and can be simply {πt}\{\pi^{t}\} for many settings. and collect trajectories by running all policies in Πexp(πt)\Pi_{\rm exp}(\pi^{t});

  • Confidence set update: Update the confidence set using the updated log-likelihood:

(r,𝒟𝚛𝚠𝚍):=(τ,τ,y)𝒟𝚛𝚠𝚍lnσ(r(τ)r(τ),y),(p,𝒟𝚝𝚛𝚊𝚗𝚜):=(π,τ)𝒟𝚝𝚛𝚊𝚗𝚜lnpπ(τ).\mathcal{L}(r,\mathcal{D}_{\tt rwd}):=\sum_{(\tau,\tau^{\prime},y)\in\mathcal{D}_{\tt rwd}}\ln\sigma(r(\tau)-r(\tau^{\prime}),y),\quad\mathcal{L}(p,\mathcal{D}_{\tt trans}):=\sum_{(\pi,\tau)\in\mathcal{D}_{\tt trans}}\ln\mathbb{P}^{\pi}_{p}(\tau).
Algorithm 2 Preference-based OMLE (P-OMLE)
1:  1×𝒫\mathcal{B}^{1}\leftarrow\mathcal{R}\times\mathcal{P}
2:  execute an arbitrary policy to collect trajectory τ0\tau^{0}
3:  for t=1,,Tt=1,\ldots,T do
4:     compute (πt,rt,pt)=argmaxπ,(r,p)t[Vr,pπr(τ0)](\pi^{t},r^{t},p^{t})=\arg\max_{\pi,~{}(r,p)\in\mathcal{B}^{t}}[V^{\pi}_{r,p}-r(\tau^{0})]
5:     execute πt\pi^{t} to collect a trajectory τ\tau
6:     invoke comparison oracle on τ\tau and τ0\tau^{0} to get yy, add (τ,τ0,y)(\tau,\tau^{0},y) into 𝒟𝚛𝚠𝚍\mathcal{D}_{\tt rwd}
7:     for each πΠexp(πt)\pi\in\Pi_{\rm exp}(\pi^{t}) do
8:        execute π\pi to collect a trajectory τ\tau, add (π,τ)(\pi,\tau) into 𝒟𝚝𝚛𝚊𝚗𝚜\mathcal{D}_{\tt trans}
9:     update
t+1{(r,p)\displaystyle\mathcal{B}^{t+1}\leftarrow\big{\{}(r,p) ×𝒫:(r,𝒟𝚛𝚠𝚍)>maxr(r,𝒟𝚛𝚠𝚍)β\displaystyle\in\mathcal{R}\times\mathcal{P}:~{}\mathcal{L}(r,\mathcal{D}_{\tt rwd})>\max_{r^{\prime}\in\mathcal{R}}\mathcal{L}(r^{\prime},\mathcal{D}_{\tt rwd})-\beta_{\mathcal{R}}
and (p,𝒟𝚝𝚛𝚊𝚗𝚜)>maxp𝒫(p,𝒟𝚝𝚛𝚊𝚗𝚜)β𝒫}\displaystyle\text{ and }\mathcal{L}(p,\mathcal{D}_{\tt trans})>\max_{p^{\prime}\in\mathcal{P}}\mathcal{L}(p^{\prime},\mathcal{D}_{\tt trans})-\beta_{\mathcal{P}}\big{\}}

We have the follwoing guarantee for Algorithm 2.

Theorem 3.8.

Suppose Assumption, 1, 2 and 3 hold. There exists an absolute constant c1,c2>0c_{1},c_{2}>0 such that for any (T,δ)×(0,1](T,\delta)\in\mathbb{N}\times(0,1] if we choose β=c1ln(||T/δ)\beta_{\mathcal{R}}=c_{1}\ln(|\mathcal{R}|T/\delta) and β𝒫=c1ln(||T/δ)\beta_{\mathcal{P}}=c_{1}\ln(|\mathcal{M}|T/\delta) in Algorithm 2, then with probability at least 1δ1-\delta, we have that for all t[T]t\in[T],

t=1T[VVπt]2Hξ(d𝒫,T,c2β𝒫,|Πexp|)+𝒪(Hα1d¯Tβ)\sum_{t=1}^{T}[V^{\star}-V^{\pi^{t}}]\leq 2H\xi(d_{\mathcal{P}},T,c_{2}\beta_{\mathcal{P}},|\Pi_{\rm exp}|)+\mathcal{O}(H\alpha^{-1}\sqrt{d_{\overline{\mathcal{R}}}T\beta_{\mathcal{R}}})

where d¯=dimE(¯,1/T)d_{\overline{\mathcal{R}}}={\rm dim_{E}}(\overline{\mathcal{R}},1/\sqrt{T}).

For problems that satisfy ξ=𝒪~(d𝒫β𝒫|Πexp|T)\xi=\widetilde{\mathcal{O}}(\sqrt{d_{\mathcal{P}}\beta_{\mathcal{P}}|\Pi_{\rm exp}|T}), Theorem 3.8 implies both the sample complexity and the query complexity for learning an ϵ\epsilon-optimal policy is upper bounded by:

𝒪~(H2d𝒫|Πexp|2ln|𝒫|ϵ2+H2d¯ln||α2ϵ2).\widetilde{\mathcal{O}}\left(\frac{H^{2}d_{\mathcal{P}}|\Pi_{\rm exp}|^{2}\ln|\mathcal{P}|}{\epsilon^{2}}+\frac{H^{2}d_{\overline{\mathcal{R}}}\ln|\mathcal{R}|}{\alpha^{2}\epsilon^{2}}\right).

Compared to the result derived through the P2R interface (Example 3), the sample complexity here is basically the same while the query complexity has improved dependence on d¯d_{\overline{\mathcal{R}}} (from cubic to linear). Nonetheless, the query complexity here additionally depends on the complexity of learning the transition model, which is not the case in Example 3.

3.4 Extension to KK-wise comparison

In this subsection, we briefly discuss how our results can be extended to KK-wise comparison assuming the following Plackett-Luce (PL) model (Luce, 1959; Plackett, 1975) of KK item preferences.

Definition 4 (Plackett-Luce model).

The oracle takes in KK trajectories τ1,,τK\tau_{1},\ldots,\tau_{K} and outputs a permutation ϕ:[K][K]\phi:[K]\rightarrow[K] with probability (ϕ)=k=1Kexp(ηr(τϕ1(k)))t=kKexp(ηr(τϕ1(t))).\mathbb{P}(\phi)=\prod_{k=1}^{K}\frac{\exp(\eta\cdot r(\tau_{\phi^{-1}(k)}))}{\sum_{t=k}^{K}\exp(\eta\cdot r(\tau_{\phi^{-1}(t)}))}.

Note that when K=2K=2, the above PL model reduces to a pair-wise trajectory-type comparison oracle (Definition 1) with σ(x)=exp(ηx)/(exp(ηx)+1)\sigma(x)=\exp(\eta x)/(\exp(\eta x)+1) which satisfies Assumption 2 with α=Θ(ηexp(ηH))\alpha=\Theta(\eta\exp(-\eta H)). The PL model satisfies the following useful property, which is a corollary of its internal consistency (Hunter, 2004).

Property 3.9 (Hunter (2004, p396)).

For any disjoint sets {i1,j1},,{ik,jk}[K]\{i_{1},j_{1}\},\ldots,\{i_{k},j_{k}\}\subseteq[K], the following pair-wise comparisons are mutually independent: 𝟏(ϕ(i1)>ϕ(j1)),,𝟏(ϕ(ik)>ϕ(jk))\mathbf{1}(\phi(i_{1})>\phi(j_{1})),\ldots,\mathbf{1}(\phi(i_{k})>\phi(j_{k})) where ϕ\phi is a permutation sampled from PL(τ1,,τK)\text{PL}(\tau_{1},\ldots,\tau_{K}). Moreover, 𝟏(ϕ(im)<ϕ(jm))Ber(σ(r(τim)r(τjm))),\mathbf{1}(\phi(i_{m})<\phi(j_{m}))\sim\text{\rm Ber}(\sigma(r(\tau_{i_{m}})-r(\tau_{j_{m}}))), where σ(x)=exp(ηx)/(exp(ηx)+1)\sigma(x)=\exp(\eta x)/(\exp(\eta x)+1).

This property enables “batch querying” the preferences on K/2\lfloor K/2\rfloor pairs of (τ1,τ2)(\tau_{1},\tau_{2}) in parallel, which returns K/2\lfloor K/2\rfloor independent pairwise comparisons outcomes. This would enable us to reduce the number of queries by a factor of Ω(K)\Omega(K) for small KK in both Algorithm 1 and 3.

Theorem 3.10 (P2R with KK-wise comparison).

Suppose that 𝒜\mathscr{A} is an g(ϵ)g(\epsilon)-robust RL algorithm with sample complexity 𝒞(ϵ,δ)\mathcal{C}(\epsilon,\delta), and assume the same conditions and the same choice of parameters as in Theorem 3.4. By running 𝒜\mathscr{A} with the interface in Algorithm 1, we can learn an ϵ\epsilon-optimal policy using 𝒞(ϵ,δ)\mathcal{C}(\epsilon,\delta) samples and 𝒪~(d¯2α2g(ϵ)2min{K,m})\widetilde{\mathcal{O}}\left(\frac{d_{\overline{\mathcal{R}}}^{2}}{\alpha^{2}g(\epsilon)^{2}\min\{K,m\}}\right) queries to the KK-wise comparison oracle with probability 12δ1-2\delta.

Theorem 3.10 is a direct consequence of Theorem 3.4: If K2mK\geq 2m, we can obtain mm independent comparisons between two trajectories by a single query to the KK-wise comparison oracle and therefore reduce the overall query complexity in Theorem 3.4 by a factor of mm; otherwise, we can get mm independent comparisons by making 𝒪(m/K)\mathcal{O}(m/K) queries to the KK-wise comparison oracle, which reduces the overall query complexity by a factor of KK.

Similarly, we can combine the idea of batch querying with P-OMLE by adding K/2K/2 independent comparisons between τ\tau and τ0\tau^{0} into 𝒟𝚛𝚠𝚍\mathcal{D}_{\tt rwd} each time. And by slightly modifying the proof of Theorem 3.8, we obtain the following improved sample complexity and query complexity for learning an ϵ\epsilon-optimal policy with KK-wise comparison oracle:

𝒪~(H2d𝒫|Πexp|2ln|𝒫|ϵ2+d¯×max{H2ln||Kα2ϵ2,1}).\widetilde{\mathcal{O}}\left(\frac{H^{2}d_{\mathcal{P}}|\Pi_{\rm exp}|^{2}\ln|\mathcal{P}|}{\epsilon^{2}}+d_{\overline{\mathcal{R}}}\times\max\left\{\frac{H^{2}\ln|\mathcal{R}|}{K\alpha^{2}\epsilon^{2}},1\right\}\right).

However, the above parallelization benefits of using KK-wise comparison might be an artifact of the PL model: it seems improbable that the same human evaluator would independently rank K/2\lfloor K/2\rfloor copies of item AA and item BB. It remains an interesting problem to develop KK-wise comparison models more suitable to RLHF.

4 RLHF From General Preferences

The utility-based approach imposes strong assumptions on human preferences. Not only is the matrix M[τ,τ]M[\tau,\tau^{\prime}] in (1) assumed to be exactly realizable by σ(r(τ)r(τ))\sigma(r(\tau)-r(\tau^{\prime})), but also σ\sigma is assumed to be known and have a gradient lower bound. Moreover, the utility-based approach assumes that transitivity: if Pr[τ1τ2]0.5\Pr[\tau_{1}\succ\tau_{2}]\geq 0.5, Pr[τ2τ3]0.5\Pr[\tau_{2}\succ\tau_{3}]\geq 0.5, then Pr[τ1τ3]0.5\Pr[\tau_{1}\succ\tau_{3}]\geq 0.5. However, experiments have shown that human preferences can be intransitive (Tversky, 1969). These limitations of the utility-based approach motivates us to consider general preferences.

A general preference may not be realizable by a utility model, so we cannot define the optimal policy in the usual sense. Instead, we follow Dudík et al. (2015) and consider an alternative solution concept, the von Neumann winner.

Definition 5.

π\pi^{\star} is the von Neumann winner policy if (π,π\pi^{\star},\pi^{\star}) is a symmetric Nash equilibrium of the constant-sum game: maxπminπ𝔼τπ,τπM[τ,τ].\max_{\pi}\min_{\pi^{\prime}}\mathbb{E}_{\tau\sim\pi,\tau^{\prime}\sim\pi^{\prime}}M[\tau,\tau^{\prime}].

The duality gap of the game is defined as

DGap(π1,π2):=maxπ𝔼τπ,τπ2M[τ,τ]minπ𝔼τπ1,τπM[τ,τ].\textstyle{\rm DGap}(\pi_{1},\pi_{2}):=\max_{\pi}\mathbb{E}_{\tau\sim\pi,\tau^{\prime}\sim\pi_{2}}M[\tau,\tau^{\prime}]-\min_{\pi}\mathbb{E}_{\tau\sim\pi_{1},\tau^{\prime}\sim\pi}M[\tau,\tau^{\prime}].

We say that π\pi is an ϵ\epsilon-approximate von Neumann winner if the duality gap of (π,π)(\pi,\pi) is at most ϵ\epsilon. The von Neumann winner has been studied under the name of maximal lotteries in the context of social choice theory (Kreweras, 1965; Fishburn, 1984). The von Neumann winner is a natural generalization of the optimal policy concept for non-utility based preferences. It is known that

  • Intuitively, the von Neumann winner π\pi^{\star} is a randomized policy that “beats” any other policy π\pi^{\prime} in the sense that 𝔼τπ,τπM[τ,τ]1/2\mathbb{E}_{\tau\sim\pi^{\star},\tau^{\prime}\sim\pi^{\prime}}M[\tau,\tau^{\prime}]\geq 1/2;

  • If the utility-based preference model holds and the transitions are deterministic, the von Neumann winner is the optimal policy;

  • The von Neumann winner is the only solution concept that satisfies population-consistency and composition-consistency in social choice theory (Brandl et al., 2016).

Finding the von Neumann winner seems prima facie quite different from standard RL tasks. However, in this section we will show how finding the von Neumann winner can be reduced to finding the restricted Nash equilibrium in a type of Markov games. For preferences based on the final state, we can further reduce the problem to RL in adversarial MDP.

4.1 A reduction to Markov games

Factorized and independent Markov games (FI-MG).

Consider a two-player zero-sum Markov games with state space 𝒮=𝒮(1)×𝒮(2)\mathcal{S}=\mathcal{S}^{(1)}\times\mathcal{S}^{(2)}, action space 𝒜(1)\mathcal{A}^{(1)} and 𝒜(2)\mathcal{A}^{(2)} for each player respectively, transition kernel {h}h[H]\{\mathbb{P}_{h}\}_{h\in[H]} and reward function rr. We say the Markov game is factorized and independent if the transition kernel is factorized:

h(sh+1sh,ah)=h(sh+1(1)sh(1),ah(1))×h(sh+1(2)sh(2),ah(2)),\mathbb{P}_{h}(s_{h+1}\mid s_{h},a_{h})=\mathbb{P}_{h}(s^{(1)}_{h+1}\mid s^{(1)}_{h},a_{h}^{(1)})\times\mathbb{P}_{h}(s^{(2)}_{h+1}\mid s^{(2)}_{h},a_{h}^{(2)}),

where sh=(sh(1),sh(2))s_{h}=(s^{(1)}_{h},s^{(2)}_{h}), sh+1=(sh+1(1),sh+1(2))s_{h+1}=(s^{(1)}_{h+1},s^{(2)}_{h+1}), ah=(ah(1),ah(2))𝒜(1)×𝒜(2)a_{h}=(a_{h}^{(1)},a_{h}^{(2)})\in\mathcal{A}^{(1)}\times\mathcal{A}^{(2)}.

The above definition implies that the Markov game can be partitioned into two MDPs, where the transition dynamics are controlled separately by each player, and are completely independent of each other. The only source of correlation between the two MDPs arises from the reward function, which is permitted to depend on the joint trajectory from both MDPs. Building on the above factorization structure, we define partial trajectory τi,h:=(s1(i),a1(i),,sh(i))\tau_{i,h}:=(s^{(i)}_{1},a_{1}^{(i)},\ldots,s^{(i)}_{h}) that consists of states of the ii-th MDP factor and actions of the ii-th player. Furthermore, we define a restricted policy class Πi\Pi_{i} that contains all policies that map a partial trajectory to a distribution in Δ𝒜i\Delta_{\mathcal{A}_{i}}, i.e.,

Πi:={{πh}h[H]:πh((𝒮(i)×𝒜i)h1×𝒮(i)Δ𝒜i)} for i[2].\textstyle\Pi_{i}:=\left\{\{\pi_{h}\}_{h\in[H]}:~{}\pi_{h}\in\left((\mathcal{S}^{(i)}\times\mathcal{A}_{i})^{h-1}\times\mathcal{S}^{(i)}\rightarrow\Delta_{\mathcal{A}_{i}}\right)\right\}\text{ for }i\in[2].

And the goal is to learn a restricted Nash equilibrium (μ,ν)Π1×Π2(\mu^{\star},\nu^{\star})\in\Pi_{1}\times\Pi_{2} such that

μargmaxμΠ1𝔼τμ,τν[r(τ,τ)] and νargminνΠ2𝔼τμ,τν[r(τ,τ)].\displaystyle\mu^{\star}\in\arg\max_{\mu\in\Pi_{1}}\mathbb{E}_{\tau\sim\mu,\tau^{\prime}\sim\nu^{\star}}[r(\tau,\tau^{\prime})]\quad\text{ and }\quad\nu^{\star}\in\arg\min_{\nu\in\Pi_{2}}\mathbb{E}_{\tau\sim\mu^{\star},\tau^{\prime}\sim\nu}[r(\tau,\tau^{\prime})]. (2)

Finding von Neumann winner via learning restricted Nash.

We claim that finding an approximate von Neumann winner can be reduced to learning an approximate restricted Nash equilibrium in a FI-MG. The reduction is straightforward: we simply create a Markov game that consists of two independent copies of the original MDP and control the dynamics in the ii-th copy by the ii-th player’s actions. Such construction is clearly factorized and independent. Moreover, the restricted policy class Πi\Pi_{i} is equivalent to the universal policy class in the original MDP. We further define the reward function as r(τ,τ)=M[τ,τ]r(\tau,\tau^{\prime})=M[\tau,\tau^{\prime}] where MM is the general preference function. By definition 5, we immediately obtain the following equivalence relation.

Proposition 4.1.

If (μ,ν)(\mu^{\star},\nu^{\star}) is a restricted Nash equilibrium of the above FI-MG, then both μ\mu^{\star} and ν\nu^{\star} are von Neumann winner in the original problem.

The problem we are faced with now is how to learn restricted Nash equilibria in FI-MG. In the following sections, we present two approaches that leverage existing RL algorithms to solve this problem: (i) when the preference function depends solely on the final states of the two input trajectories, each player can independently execute an adversarial MDP algorithm; (ii) for general preference functions, a straightforward adaptation of the OMLE algorithm is sufficient under certain eluder-type conditions.

4.2 Learning from final-state-based preferences via adversarial MDPs

In this section, we consider a special case where the preference depends solely on the final states of the two input trajectories, i.e., M(τ,τ)=M(sH,sH)M(\tau,\tau^{\prime})=M(s_{H},s_{H}^{\prime})444 This could be true, for instance, if the agent’s task is to clean a kitchen, and is evaluated based on the kitchen’s configuration in the end. Given the previous equivalence relation between von Neumann winner and restricted Nash in FI-MG, one natural idea is to apply no-regret learning algorithms, as it is well-known that running two copies of no-regret online learning algorithms against each other can be used to compute Nash equilibria in zero-sum normal-form games. Since this paper focuses on sequential decision making, we need no-regret learning algorithms for adversarial MDPs, which we define below.

Adversarial MDPs.

In the adversarial MDP problem, the algorithm interacts with a series of MDPs with the same unknown transition but adversarially chosen rewards for each episode. Formally, there exists an unknown groundtruth transition function ={h}h=1H\mathbb{P}=\{\mathbb{P}_{h}\}_{h=1}^{H}. At the beginning of the kk-th episode, the algorithm chooses a policy πk\pi^{k} and then the adversary picks a reward function rk={rhk}h=1Hr^{k}=\{r_{h}^{k}\}_{h=1}^{H}. After that, the algorithm observes a trajectory τk=(s1k,a1k,y1k,,sHk,aHk,yHk)\tau^{k}=(s_{1}^{k},a_{1}^{k},y_{1}^{k},\ldots,s_{H}^{k},a_{H}^{k},y_{H}^{k}) sampled from executing policy πk\pi^{k} in the MDP parameterized by \mathbb{P} and rkr^{k}, where 𝔼[yhkshk,ahk]=rhk(shk,ahk)\mathbb{E}[y_{h}^{k}\mid s_{h}^{k},a_{h}^{k}]=r^{k}_{h}(s_{h}^{k},a_{h}^{k}). We define the regret of an adversarial MDP algorithm 𝒜\mathscr{A} to be the gap between the algorithm’s expected payoff and the best payoff achievable by the best fixed Markov policy:

RegretK(𝒜):=maxπΠMarkovk=1K𝔼π[h=1Hrhk(sh,ah)]k=1K𝔼πk[h=1Hrhk(sh,ah)].{\rm Regret}_{K}(\mathscr{A}):=\max_{\pi\in\Pi_{\rm Markov}}\sum_{k=1}^{K}\mathbb{E}_{\pi}\left[\sum_{h=1}^{H}r_{h}^{k}(s_{h},a_{h})\right]-\sum_{k=1}^{K}\mathbb{E}_{\pi_{k}}\left[\sum_{h=1}^{H}r_{h}^{k}(s_{h},a_{h})\right].

Now we explain how to learn a von Neumann winner via running adversarial MDP algorithms. We simply create two copies of the original MDP and instantiate two adversarial MDP algorithms 𝒜1\mathscr{A}_{1} and 𝒜2\mathscr{A}_{2} to control each of them separately. To execute 𝒜1\mathscr{A}_{1} and 𝒜2\mathscr{A}_{2}, we need to provide reward feedback to them in each kk-th episode. Denote by sHk,(1)s_{H}^{k,(1)} and sHk,(2)s_{H}^{k,(2)} the final states 𝒜1\mathscr{A}_{1} and 𝒜2\mathscr{A}_{2} observe in the kk-th episode. We will feed ykBer(M(sHk,(1),sHk,(2)))y^{k}\sim\text{Ber}(M(s_{H}^{k,(1)},s_{H}^{k,(2)})) into 𝒜1\mathscr{A}_{1} and 1yk1-y^{k} into 𝒜2\mathscr{A}_{2} as their reward at step H1H-1, respectively. And all other steps have zero reward feedback.The formal pseudocode is provided in Algorithm 4 (Appendix E). The following theorem states that as long as the invoked adversarial MDP algorithm has sublinear regret, the above scheme learns an approximate von Neumann winner in a sample-efficient manner.

Theorem 4.2.

Suppose RegretK(𝒜)βK1c{\rm Regret}_{K}(\mathscr{A})\leq\beta K^{1-c} with probability at least 1δ1-\delta for some c(0,1)c\in(0,1). Then Algorithm 4 with K=(4β/ϵ)1/cK=(4\beta/\epsilon)^{1/c} outputs an ϵ\epsilon-approximate von Neumann winner with probability at least 12δ1-2\delta.

In order to demonstrate the applicability of Theorem 4.2, we offer two examples where sublinear regret can be achieved in adversarial MDPs via computationally efficient algorithms. The first one is adversarial tabular MDPs where the number of states and actions are finite, i.e., |𝒮|,|𝒜|+|\mathcal{S}|,|\mathcal{A}|\leq+\infty.

Example 4 (adversarial tabular MDPs).

Jin et al. (2019) proposed an algorithm 𝒜\mathscr{A} with RegretK(𝒜)𝒪~(|𝒮|2|𝒜|H3K){\rm Regret}_{K}(\mathscr{A})\leq\widetilde{\mathcal{O}}(\sqrt{|\mathcal{S}|^{2}|\mathcal{A}|H^{3}K}). Plugging it back into Theorem 4.2 leads to K=𝒪~(|𝒮|2|𝒜|H3/ϵ2)K=\widetilde{\mathcal{O}}(|\mathcal{S}|^{2}|\mathcal{A}|H^{3}/\epsilon^{2}) sample complexity and query complexity for learning ϵ\epsilon-approximate von Neumann winner.

The second example is adversarial linear MDPs where the number of states and actions can be infinitely large while the transition and reward functions admit special linear structure. Formally, for each h[H]h\in[H], there exists known mapping ϕh:𝒮×𝒜d\phi_{h}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{d}, unknown mapping ψh:𝒮d\psi_{h}:\mathcal{S}\rightarrow\mathbb{R}^{d} and unknown vectors {θhk}k[K]d\{\theta^{k}_{h}\}_{k\in[K]}\subseteq\mathbb{R}^{d} such that (i) h(ss,a)=ϕh(s,a),ψh(s)\mathbb{P}_{h}(s^{\prime}\mid s,a)=\langle\phi_{h}(s,a),\psi_{h}(s^{\prime})\rangle, (ii) rhk(s,a)=ϕh(s,a),θhkr_{h}^{k}(s,a)=\langle\phi_{h}(s,a),\theta_{h}^{k}\rangle, and (iii) ϕ(s,a)21,θhk2d,sψ(s)f(s)𝑑s2d\|\phi(s,a)\|_{2}\leq 1,\|\theta_{h}^{k}\|_{2}\leq\sqrt{d},\|\int_{s^{\prime}}\psi(s^{\prime})f(s^{\prime})ds^{\prime}\|_{2}\leq\sqrt{d} for any (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} and f:𝒮[0,1]f:\mathcal{S}\rightarrow[0,1].

Example 5 (adversarial linear MDPs).

Sherman et al. (2023) proposed an algorithm 𝒜\mathscr{A} with RegretK(𝒜)𝒪~(dH2K6/7){\rm Regret}_{K}(\mathscr{A})\leq\widetilde{\mathcal{O}}(dH^{2}K^{6/7}) for online learning in adversarial linear MDPs.555 Sherman et al. (2023) requires the adversarial reward function to be linear in the feature mapping of linear MDPs. In Appendix E, we show the reward signal we constructed in Algorithm 4 satisfies such requirement. Combining it with Theorem 4.2 leads to K=𝒪~(d7H14/ϵ7)K=\widetilde{\mathcal{O}}(d^{7}H^{14}/\epsilon^{7}) sample complexity and query complexity for learning ϵ\epsilon-approximate restricted Nash equilibria.

4.3 Learning from trajectory-based preferences via OMLE_equilibrium

In this section, we consider the more general case where the preference M[τ,τ]M[\tau,\tau^{\prime}] is allowed to depend arbitrarily on the two input trajectories. Similar to the utility-based setting, we assume that the learner is provided with a preference class ((𝒮×𝒜)H×(𝒮×𝒜)H[0,1])\mathcal{M}\subseteq((\mathcal{S}\times\mathcal{A})^{H}\times(\mathcal{S}\times\mathcal{A})^{H}\rightarrow[0,1]) and transition function class 𝒫\mathcal{P} a priori, which contains the groundtruth preference and transition we are interacting with. Previously, we have established the reduction from learning the von Neumann winner to learning restricted Nash in FI-MG. In addition, learning restricted Nash in FI-MG is in fact a special case of learning Nash equilibrium in partially observable Markov games (POMGs). As a result, we can directly adapt the existing OMLE algorithm for learning Nash in POMGs (Liu et al., 2022b) to our setting, with only minor modifications required to learn the von Neumann winner. We defer the algorithmic details for this approach (Algorithm 5) to Appendix F, and present only the theoretical guarantee here.

Theorem 4.3.

Suppose Assumption 3 holds. There exist absolute constant c1c_{1} and c2c_{2} such that for any (T,δ)×(0,1](T,\delta)\in\mathbb{N}\times(0,1] if we choose β=c1ln(||T/δ)\beta_{\mathcal{M}}=c_{1}\ln(|\mathcal{M}|T/\delta) and β𝒫=c1ln(||T/δ)\beta_{\mathcal{P}}=c_{1}\ln(|\mathcal{M}|T/\delta) in Algorithm 5, then with probability at least 1δ1-\delta, the duality gap of the output policy of Algorithm 5 is at most

4ξ(d𝒫,T,c2β𝒫,|Πexp|)T+c2dβT,\frac{4\xi(d_{\mathcal{P}},T,c_{2}\beta_{\mathcal{P}},|\Pi_{\rm exp}|)}{T}+c_{2}\sqrt{\frac{d_{\mathcal{M}}\beta_{\mathcal{M}}}{T}},

where d=dimE(,1/T)d_{\mathcal{M}}={\rm dim_{E}}(\mathcal{M},1/T).

It has been proven that a wide range of RL problems admit a regret formulation of ξ=𝒪~(d𝒫β|Πexp|T)\xi=\widetilde{\mathcal{O}}(\sqrt{d_{\mathcal{P}}\beta|\Pi_{\rm exp}|T}) with mild d𝒫d_{\mathcal{P}} and |Πexp||\Pi_{\rm exp}| (Liu et al., 2022a). These problems include, but are not limited to, tabular MDPs, factored MDPs, linear kernel MDPs, observable POMDPs, and decodable POMDPs. For more details, please refer to Appendix C.1 or Liu et al. (2022a). For problems that satisfy ξ=𝒪~(d𝒫β𝒫|Πexp|T)\xi=\widetilde{\mathcal{O}}(\sqrt{d_{\mathcal{P}}\beta_{\mathcal{P}}|\Pi_{\rm exp}|T}), Theorem C.1 implies a sample complexity of

𝒪~(d𝒫|Πexp|ln|𝒫|ϵ2+dln||ϵ2).\widetilde{\mathcal{O}}\left(\frac{d_{\mathcal{P}}|\Pi_{\rm exp}|\ln|\mathcal{P}|}{\epsilon^{2}}+\frac{d_{\mathcal{M}}\cdot\ln|\mathcal{M}|}{\epsilon^{2}}\right).

The sample complexity for specific tractable problems can be derived by plugging their precise formulation of ξ\xi (provided in Appendix C.1) into the above bound.

5 Conclusion

This paper studies RLHF via efficient reductions. For utility based preferences, we introduce a Preference-to-Reward Interface which reduces preference based RL to standard reward-based RL. Our results are amenable to function approximation and incur no additional sample complexity. For general preferences without underlying rewards, we reduce finding the von Neumann winner to finding restricted Nash equilibrium in a class of Markov games. This can be more concretely solved by adversarial MDP algorithms if the preference depends solely on the final state, and by optimistic MLE for preferences that depend on the whole trajectory.

Our results demonstrate that RLHF, from both utility-based and general preferences, can be readily solved under standard assumptions and by existing algorithmic techniques in RL theory literature. This suggests that RLHF is not much harder than standard RL in the complexity sense, and needs not to be more complicated in the algorithmic sense. Consequently, our findings partially answer our main query: RLHF may not be more difficult than standard RL.

Our paper still leaves open a number of interesting problems for future research.

  1. 1.

    For utility based settings, currently a global lower gradient bound α\alpha of the link function is assumed (Assumption 2). For a logistic link function, α\alpha can decrease exponentially with respect to HH and lead to large query complexity bounds. We conjecture that α\alpha can be relaxed to local measures of the gradient lower bound which could alleviate the exponential dependence on HH.

  2. 2.

    For non-utility based preferences, our current approach reduces to adversarial MDP or Markov games, which limits its applicability (especially under function approximation). It remains unclear whether it is possible to reduce finding the von Neumann winner to a standard single-player RL problem.

Acknowledgement

This work was partial supported by National Science Foundation Grant NSF-IIS-2107304 and Office of Naval Research Grant N00014-22-1-2253.

References

  • Abramson et al. (2022) Josh Abramson, Arun Ahuja, Federico Carnevale, Petko Georgiev, Alex Goldin, Alden Hung, Jessica Landon, Jirka Lhotka, Timothy Lillicrap, Alistair Muldal, et al. Improving multimodal interactive agents with reinforcement learning from human feedback. arXiv preprint arXiv:2211.11602, 2022.
  • Ailon et al. (2014) Nir Ailon, Zohar Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning, pages 856–864. PMLR, 2014.
  • Arakawa et al. (2018) Riku Arakawa, Sosuke Kobayashi, Yuya Unno, Yuta Tsuboi, and Shin-ichi Maeda. Dqn-tamer: Human-in-the-loop reinforcement learning with intractable feedback. arXiv preprint arXiv:1810.11748, 2018.
  • Azar et al. (2017) Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
  • Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  • Bengs et al. (2021) Viktor Bengs, Róbert Busa-Fekete, Adil El Mesaoudi-Paul, and Eyke Hüllermeier. Preference-based online learning with dueling bandits: A survey. The Journal of Machine Learning Research, 22(1):278–385, 2021.
  • Berner et al. (2019) Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
  • Brandl et al. (2016) Florian Brandl, Felix Brandt, and Hans Georg Seedig. Consistent probabilistic social choice. Econometrica, 84(5):1839–1880, 2016.
  • Busa-Fekete et al. (2014) Róbert Busa-Fekete, Balázs Szörényi, Paul Weng, Weiwei Cheng, and Eyke Hüllermeier. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Machine learning, 97:327–351, 2014.
  • Chen et al. (2022) Xiaoyu Chen, Han Zhong, Zhuoran Yang, Zhaoran Wang, and Liwei Wang. Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. In International Conference on Machine Learning, pages 3773–3793. PMLR, 2022.
  • Christiano et al. (2017) Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  • Ding et al. (2023) Zihan Ding, Yuanpei Chen, Allen Z Ren, Shixiang Shane Gu, Hao Dong, and Chi Jin. Learning a universal human prior for dexterous manipulation from human preference. arXiv preprint arXiv:2304.04602, 2023.
  • Dudík et al. (2015) Miroslav Dudík, Katja Hofmann, Robert E Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual dueling bandits. In Conference on Learning Theory, pages 563–587. PMLR, 2015.
  • Finn et al. (2016) Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International conference on machine learning, pages 49–58. PMLR, 2016.
  • Fishburn (1984) Peter C Fishburn. Probabilistic social choice based on simple voting comparisons. The Review of Economic Studies, 51(4):683–692, 1984.
  • Foster et al. (2021) Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
  • Hunter (2004) David R Hunter. Mm algorithms for generalized bradley-terry models. The annals of statistics, 32(1):384–406, 2004.
  • Ibarz et al. (2018) Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems, 31, 2018.
  • Jain et al. (2013) Ashesh Jain, Brian Wojcik, Thorsten Joachims, and Ashutosh Saxena. Learning trajectory preferences for manipulators via iterative improvement. Advances in neural information processing systems, 26, 2013.
  • Jin et al. (2019) Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. Learning adversarial mdps with bandit feedback and unknown transition. arXiv preprint arXiv:1912.01192, 2019.
  • Jin et al. (2020a) Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR, 2020a.
  • Jin et al. (2020b) Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020b.
  • Jin et al. (2021) Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 34:13406–13418, 2021.
  • Kreweras (1965) Germain Kreweras. Aggregation of preference orderings. In Mathematics and Social Sciences I: Proceedings of the seminars of Menthon-Saint-Bernard, France (1–27 July 1960) and of Gösing, Austria (3–27 July 1962), pages 73–79, 1965.
  • Lee et al. (2023) Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
  • Li et al. (2017) Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pages 2071–2080. PMLR, 2017.
  • Liu et al. (2022a) Qinghua Liu, Praneeth Netrapalli, Csaba Szepesvari, and Chi Jin. Optimistic mle–a generic model-based algorithm for partially observable sequential decision making. arXiv preprint arXiv:2209.14997, 2022a.
  • Liu et al. (2022b) Qinghua Liu, Csaba Szepesvári, and Chi Jin. Sample-efficient reinforcement learning of partially observable markov games. arXiv preprint arXiv:2206.01315, 2022b.
  • Luce (1959) Robert D Luce. Individual choice behavior: A theoretical analysis. John Willey and sons, 1959.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Novoseller et al. (2020) Ellen Novoseller, Yibing Wei, Yanan Sui, Yisong Yue, and Joel Burdick. Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence, pages 1029–1038. PMLR, 2020.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  • Pacchiano et al. (2021) Aldo Pacchiano, Aadirupa Saha, and Jonathan Lee. Dueling rl: reinforcement learning with trajectory preferences. arXiv preprint arXiv:2111.04850, 2021.
  • Pereira et al. (2019) Bruno L Pereira, Alberto Ueda, Gustavo Penha, Rodrygo LT Santos, and Nivio Ziviani. Online learning to rank for sequential music recommendation. In Proceedings of the 13th ACM Conference on Recommender Systems, pages 237–245, 2019.
  • Plackett (1975) Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975.
  • Russo and Van Roy (2013) Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26, 2013.
  • Saha and Gopalan (2019) Aadirupa Saha and Aditya Gopalan. Pac battling bandits in the plackett-luce model. In Algorithmic Learning Theory, pages 700–737. PMLR, 2019.
  • Sherman et al. (2023) Uri Sherman, Tomer Koren, and Yishay Mansour. Improved regret for efficient online reinforcement learning with linear function approximation. arXiv preprint arXiv:2301.13087, 2023.
  • Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  • Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012. doi: 10.1109/IROS.2012.6386109.
  • Tversky (1969) Amos Tversky. Intransitivity of preferences. Psychological review, 76(1):31, 1969.
  • Vinyals et al. (2019) Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  • Warnell et al. (2018) Garrett Warnell, Nicholas Waytowich, Vernon Lawhern, and Peter Stone. Deep tamer: Interactive agent shaping in high-dimensional state spaces. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  • Xu et al. (2020) Yichong Xu, Ruosong Wang, Lin Yang, Aarti Singh, and Artur Dubrawski. Preference-based reinforcement learning with finite-time guarantees. Advances in Neural Information Processing Systems, 33:18784–18794, 2020.
  • Yue and Joachims (2009) Yisong Yue and Thorsten Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1201–1208, 2009.
  • Yue et al. (2012) Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012.
  • Zanette et al. (2020) Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pages 10978–10989. PMLR, 2020.
  • Zhan et al. (2023a) Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D Lee, and Wen Sun. Provable offline reinforcement learning with human feedback. arXiv preprint arXiv:2305.14816, 2023a.
  • Zhan et al. (2023b) Wenhao Zhan, Masatoshi Uehara, Wen Sun, and Jason D Lee. How to query human feedback efficiently in rl? arXiv preprint arXiv:2305.18505, 2023b.
  • Zhong et al. (2022) Han Zhong, Wei Xiong, Sirui Zheng, Liwei Wang, Zhaoran Wang, Zhuoran Yang, and Tong Zhang. A posterior sampling framework for interactive decision making. arXiv preprint arXiv:2211.01962, 2022.
  • Zhu et al. (2023) Banghua Zhu, Jiantao Jiao, and Michael I Jordan. Principled reinforcement learning with human feedback from pairwise or kk-wise comparisons. arXiv preprint arXiv:2301.11270, 2023.
  • Ziegler et al. (2019) Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  • Zoghi et al. (2014) Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten Rijke. Relative upper confidence bound for the k-armed dueling bandit problem. In International conference on machine learning, pages 10–18. PMLR, 2014.

Appendix A Proofs of Impossibility Results

Proof of Lemma 3.2.

Consider two link functions σ1(x):=11+exp(x)\sigma_{1}(x):=\frac{1}{1+\exp(-x)} and

σ2(x):=12+α1x+α2(xα3)I[|x|>α3],\sigma_{2}(x):=\frac{1}{2}+\alpha_{1}x+\alpha_{2}(x-\alpha_{3})\cdot I[|x|>\alpha_{3}],

where α1=1\alpha_{1}=1, α2=0.484\alpha_{2}=-0.484, α3=0.3\alpha_{3}=0.3 Consider an MDP with H=2H=2 and initial state s0s_{0}. Suppose that there are three terminal states s1,s2,s3s_{1},s_{2},s_{3}, where we observe that the trajectory preferences only depend on the terminal state in the following way:

Pr[s1s2]=0.7,Pr[s2s3]=0.8,Pr[s1s3]=0.903.\Pr[s_{1}\succ s_{2}]=0.7,\Pr[s_{2}\succ s_{3}]=0.8,\Pr[s_{1}\succ s_{3}]=0.903.

This can be explained by both σ1\sigma_{1} with r(1)={s0:0,s1:0.847,s2:0,s3:1.386}r^{(1)}=\{s_{0}:0,s_{1}:0.847,s_{2}:0,s_{3}:-1.386\}, and σ2\sigma_{2} with r(2)={s0:0,s1:0.2,s2:0,s3:0.3}r^{(2)}=\{s_{0}:0,s_{1}:0.2,s_{2}:0,s_{3}:-0.3\}.

Suppose that state s0s_{0} has two actions aa and bb, which leads to distributions {0.39:s1,0.61:s3}\{0.39:s_{1},0.61:s_{3}\} and {1:s2}\{1:s_{2}\} respectively. Under r(1)r^{(1)} and r(2)r^{(2)}, the optimal action would be aa and bb respectively. Therefore without knowledge of the link function, it is impossible to identify the optimal policy even with perfect knowledge of the transition and comparison probabilities. ∎

Proof of Lemma 3.3.

Similarly, consider an MDP with H=2H=2 and initial state s0s_{0}. Suppose that there are three terminal states s1,s2,s3s_{1},s_{2},s_{3}, where we observe that

Pr[s1s2]=1,Pr[s2s3]=1,Pr[s1s3]=1.\Pr[s_{1}\succ s_{2}]=1,\Pr[s_{2}\succ s_{3}]=1,\Pr[s_{1}\succ s_{3}]=1.

Meanwhile the state s0s_{0} has two actions aa and bb, leading to distributions {0.5:s1,0.5:s3}\{0.5:s_{1},0.5:s_{3}\} and {1:s2}\{1:s_{2}\} respectively. In this case, both r(1)={s0:0,s1:0.5,s2:0,s3:1}r^{(1)}=\{s_{0}:0,s_{1}:0.5,s_{2}:0,s_{3}:-1\} and r(2)={s0:0,s1:1,s2:0,s3:0.5}r^{(2)}=\{s_{0}:0,s_{1}:1,s_{2}:0,s_{3}:-0.5\} fits the comparison results perfectly. However under r(1)r^{(1)}, the optimal action is bb while under r(2)r^{(2)}, the optimal action is aa. ∎

Appendix B Proofs for P2R

B.1 Properties of eluder dimension

We begin by proving two properties of the eluder dimension of ¯\overline{\mathcal{R}}.

Lemma B.1.

Consider linear:={θx}\mathcal{R}_{\rm linear}:=\{\theta^{\top}x\}, where x𝒳dx\in\mathcal{X}\subseteq\mathbb{R}^{d}, θ2γ\|\theta\|_{2}\leq\gamma. Then there exists absolute constant CC such that

dimE(¯linear,ϵ)C(d+1)ln(1+γ+Hϵ)+1.{\rm dim}_{\rm E}(\overline{\mathcal{R}}_{\rm linear},\epsilon)\leq C(d+1)\ln\left(1+\frac{\gamma+H}{\epsilon}\right)+1.
Proof.

Note that

θx+c=[θ,c][x,1].\theta^{\top}x+c=[\theta,c]^{\top}[x,1].

Therefore ¯linear\overline{\mathcal{R}}_{\rm linear} can be seen as a (d+1)(d+1)-dimensional linear function class with parameter norm [θ,c]2γ+H\|[\theta,c]\|_{2}\leq\gamma+H. The statement is then a direct corollary of Russo and Van Roy (2013, Proposition 6). ∎

Lemma B.2.

For a general function class \mathcal{R} with domain 𝒳\mathcal{X} and range [0,H][0,H], for ϵ<1\epsilon<1,

dimE(¯,ϵ)𝒪(HdimE(,ϵ/2)1.5/ϵ).{\rm dim_{E}}(\overline{\mathcal{R}},\epsilon)\leq\mathcal{O}(H{\rm dim}_{\rm E}(\mathcal{R},\epsilon/2)^{1.5}/\epsilon).
Proof.

Suppose that there exists a sequence

{ri,ci,xi,yi}i[m],\{r_{i},c_{i},x_{i},y_{i}\}_{i\in[m]},

where rir_{i}\in\mathcal{R}, ci[H,0]c_{i}\in[-H,0], xi𝒳x_{i}\in\mathcal{X} and yiy_{i}\in\mathbb{R}, such that for all i[m]i\in[m]

|ri(xi)+ciyi|ϵ,j<i|ri(xj)+ciyj|2ϵ2.\left|r_{i}(x_{i})+c_{i}-y_{i}\right|\geq\epsilon,\quad\sum_{j<i}\left|r_{i}(x_{j})+c_{i}-y_{j}\right|^{2}\leq\epsilon^{2}.

By definition, dimE(¯,ϵ){\rm dim_{E}}(\overline{\mathcal{R}},\epsilon) is the largest mm so that such a sequence exists. By the pigeon-hole principle, there exists a subset I[m]I\subseteq[m] of size at least k=mH/ϵ0k=\frac{m}{\lceil H/\epsilon_{0}\rceil} and c¯[0,Hϵ0]\bar{c}\in[0,H-\epsilon_{0}] such that iI\forall i\in I, ci[c¯,c¯+ϵ0]c_{i}\in[\bar{c},\bar{c}+\epsilon_{0}]. Denote the subsequence indexed by II as {ri,ci,xi,yi}i[k]\{r^{\prime}_{i},c^{\prime}_{i},x^{\prime}_{i},y^{\prime}_{i}\}_{i\in[k]}. Define y~i:=yic¯\widetilde{y}_{i}:=y_{i}^{\prime}-\bar{c}. Now consider the sequence {ri,xi,y~i}i[k]\{r^{\prime}_{i},x^{\prime}_{i},\widetilde{y}_{i}\}_{i\in[k]}. By definition i[k]\forall i\in[k]

|ri(xi)y~i|ϵϵ0.\left|r_{i}^{\prime}(x_{i}^{\prime})-\widetilde{y}_{i}\right|\geq\epsilon-\epsilon_{0}.

It follows that

j<i|ri(xj)y~j|2\displaystyle\sum_{j<i}\left|r_{i}^{\prime}(x_{j})-\widetilde{y}_{j}\right|^{2} =j<i|ri(xj)(yjcj)|2+2j<i(c¯cj)(ri(xj)(yjcj))+j<i(c¯cj)2\displaystyle=\sum_{j<i}\left|r_{i}^{\prime}(x_{j})-(y_{j}^{\prime}-c_{j}^{\prime})\right|^{2}+2\sum_{j<i}(\bar{c}-c_{j}^{\prime})\left(r_{i}^{\prime}(x_{j})-(y_{j}^{\prime}-c_{j}^{\prime})\right)+\sum_{j<i}(\bar{c}-c_{j}^{\prime})^{2}
ϵ2+2kϵ0ϵ+kϵ02.\displaystyle\leq\epsilon^{2}+2\sqrt{k}\epsilon_{0}\epsilon+k\epsilon_{0}^{2}.

We can choose ϵ0:=(Hϵ216m)1/3\epsilon_{0}:=\left(\frac{H\epsilon^{2}}{16m}\right)^{1/3} so that ϵ0ϵ/(4k)\epsilon_{0}\leq\epsilon/(4\sqrt{k}). Then we can guarantee

|ri(xi)y~i|0.5ϵ,j<i|ri(xj)y~j|22ϵ2.\left|r_{i}^{\prime}(x_{i})-\widetilde{y}_{i}\right|\geq 0.5\epsilon,\;\sum_{j<i}\left|r_{i}^{\prime}(x_{j})-\widetilde{y}_{j}\right|^{2}\leq 2\epsilon^{2}.

By Jin et al. (2021, Proposition 43),

k(1+2ϵ2(0.5ϵ)2)dimE(,0.5ϵ).k\leq\left(1+\frac{2\epsilon^{2}}{(0.5\epsilon)^{2}}\right){\rm dim_{E}}(\mathcal{F},0.5\epsilon).

In other words,

mH/ϵ09dimE(,0.5ϵ),\frac{m}{\lceil H/\epsilon_{0}\rceil}\leq 9{\rm dim_{E}}(\mathcal{R},0.5\epsilon),

which gives

dimE(¯,ϵ)216HϵdimE(,0.5ϵ)1.5.{\rm dim_{E}}(\overline{\mathcal{R}},\epsilon)\leq\frac{216H}{\epsilon}\cdot{\rm dim_{E}}(\mathcal{R},0.5\epsilon)^{1.5}.

B.2 Properties related to robustness

Lemma B.3 (Robustness to perturbation).

Any RL algorithm 𝒜\mathscr{A} with sample complexity 𝒞(ϵ,δ)\mathcal{C}(\epsilon,\delta) can be converted to an algorithm 𝒜\mathscr{A}^{\prime} that is 1𝒞(ϵ,δ)\frac{1}{\mathcal{C}(\epsilon,\delta)}-robust with sample complexity 𝒞(ϵ,δ/3)\mathcal{C}(\epsilon,\delta/3).

Proof.

Consider the following modification of 𝒜\mathscr{A}: instead of using reward rr directly, we project rr to +2H+2H and 2H-2H unbiasedly; that is, the algorithm receives the binarized rewards

b(r):={2H:12+r4H,2H:12r4H}.b(r):=\left\{2H:\frac{1}{2}+\frac{r}{4H},-2H:\frac{1}{2}-\frac{r}{4H}\right\}.

By the definition of sample complexity, when using samples of b(r)b(r^{\star}), 𝒜\mathscr{A} outputs a policy π0\pi_{0} for rr^{\star} that is ϵ\epsilon-optimal with probability 1δ1-\delta in K:=𝒞(ϵ,δ)K:=\mathcal{C}(\epsilon,\delta) episodes. Denote the trajectories generated by running 𝒜\mathscr{A} on b(r)b(r^{\star}) by τ1,,τK\tau_{1},\cdots,\tau_{K}. Now suppose that for τk\tau_{k}, the reward label is perturbed from b(r(τk))b(r^{\star}(\tau_{k})) to b(rk)b(r^{\prime}_{k}) with |rkr(τk)|ϵ:=(𝒞(ϵ,δ))1|r^{\prime}_{k}-r^{\star}(\tau_{k})|\leq\epsilon^{\prime}:=(\mathcal{C}(\epsilon,\delta))^{-1}; denote the output policy of 𝒜\mathscr{A} by π\pi^{\prime}. It can be shown that

|ln(b(r(τk))=2Hb(rk)=2H)|ln(1+ϵH).\left|\ln\left(\frac{b(r^{\star}(\tau_{k}))=2H}{b(r^{\prime}_{k})=2H}\right)\right|\leq\ln\left(1+\frac{\epsilon^{\prime}}{H}\right).

Therefore the total density ratio

supr{2H,2H}K|k=1Kln(b(r(τ1)),,b(r(τK))=rb(r1),,b(rK)=r)|Kln(1+ϵH)KϵH1.\displaystyle\sup_{\vec{r}\in\{2H,-2H\}^{K}}\left|\sum_{k=1}^{K}\ln\left(\frac{b(r^{\star}(\tau_{1})),\cdots,b(r^{\star}(\tau_{K}))=\vec{r}}{b(r^{\prime}_{1}),\cdots,b(r^{\prime}_{K})=\vec{r}}\right)\right|\leq K\ln\left(1+\frac{\epsilon^{\prime}}{H}\right)\leq\frac{K\epsilon^{\prime}}{H}\leq 1.

It follows that the density ratio between π\pi and π^\hat{\pi} is also bounded by ee. Therefore the probability that π^\hat{\pi} is not ϵ\epsilon-optimal is at most 3δ3\delta. Rescaling δ\delta proves the lemma. ∎

Lemma B.4.

Any tabular RL algorithm 𝒜\mathscr{A} with sample complexity 𝒞(ϵ,δ)\mathcal{C}(\epsilon,\delta) is ϵ/(4H)\epsilon/(4H) robust with sample complexity 𝒞(ϵ/2,δ)\mathcal{C}(\epsilon/2,\delta).

Proof.

Suppose that 𝒜\mathscr{A} is run on perturbed rewards, where rewards for trajectory τ\tau is changed by ε(τ)\varepsilon(\tau). By definition, using 𝒞(ϵ/2,δ)\mathcal{C}(\epsilon/2,\delta) samples and with probability 1δ1-\delta, it outputs an ϵ/2\epsilon/2-optimal policy π^\hat{\pi} with respect to the perturbed reward function r+εr+\varepsilon, where εϵ/4H\|\varepsilon\|_{\infty}\leq\epsilon/4H. Denote the value function of policy π\pi with respect to reward rr by Vπ,rV^{\pi,r}. Further denote the optimal value function with respect to rr by π\pi^{\star}. It holds that for any policy π\pi,

|Vπ,rVπ+ε,r|ϵ/4.\left|V^{\pi,r}-V^{\pi+\varepsilon,r}\right|\leq\epsilon/4.

Therefore

Vπ,rVπ^,r\displaystyle V^{\pi^{\star},r}-V^{\hat{\pi},r} ϵ/2+Vπ,r+εVπ^,r+ε\displaystyle\leq\epsilon/2+V^{\pi^{\star},r+\varepsilon}-V^{\hat{\pi},r+\varepsilon}
ϵ/2+ϵ/2=ϵ.\displaystyle\leq\epsilon/2+\epsilon/2=\epsilon.

In other words π^\hat{\pi} is indeed an ϵ\epsilon-optimal policy with respect to the unperturbed rewards rr. ∎

B.3 Proof of Theorem 3.4

Lemma B.5.

With m=Θ(ln(1/δ)α2ϵ2)m=\Theta\left(\frac{\ln(1/\delta^{\prime})}{\alpha^{2}\epsilon^{\prime 2}}\right), for each τ\tau such that the comparison oracle is queried, with probability 1δ1-\delta^{\prime},

|r^(τ)(r(τ)r(τ0))|ϵ.\left|\hat{r}(\tau)-(r^{\star}(\tau)-r^{\star}(\tau_{0}))\right|\leq\epsilon^{\prime}.
Proof.

Suppose that the comparison oracle is queried for τ\tau and the average outcome is o¯\bar{o}. By Hoeffding bound, with probability 1δ1-\delta^{\prime},

|o¯σ(r(τ)r(τ0))|ln(2/δ)m.\left|\bar{o}-\sigma(r^{\star}(\tau)-r^{\star}(\tau_{0}))\right|\leq\sqrt{\frac{\ln(2/\delta^{\prime})}{m}}.

Since r^(τ)=argminx[H,H]|σ(x)o¯|\hat{r}(\tau)=\operatorname{argmin}_{x\in[-H,H]}|\sigma(x)-\bar{o}|,

|σ(r^(τ))o¯||σ(r(τ)r(τ0))o¯|ln(2/δ)m.\left|\sigma(\hat{r}(\tau))-\bar{o}\right|\leq\left|\sigma(r^{\star}(\tau)-r^{\star}(\tau_{0}))-\bar{o}\right|\leq\sqrt{\frac{\ln(2/\delta^{\prime})}{m}}.

It follows that

|σ(r^(τ))σ(r(τ)r(τ0))|2ln(2/δ)m.\left|\sigma(\hat{r}(\tau))-\sigma(r^{\star}(\tau)-r^{\star}(\tau_{0}))\right|\leq 2\sqrt{\frac{\ln(2/\delta^{\prime})}{m}}.

By Assumption 2,

|r^(τ)(r(τ)r(τ0))|1α|σ(r^(τ))σ(r(τ)r(τ0))|2αln(2/δ)mϵ.\left|\hat{r}(\tau)-(r^{\star}(\tau)-r^{\star}(\tau_{0}))\right|\leq\frac{1}{\alpha}\cdot\left|\sigma(\hat{r}(\tau))-\sigma(r^{\star}(\tau)-r^{\star}(\tau_{0}))\right|\leq\frac{2}{\alpha}\sqrt{\frac{\ln(2/\delta^{\prime})}{m}}\leq\epsilon^{\prime}.

Lemma B.6.

Set m=Θ(dln(d/δ)ϵ02α2)m=\Theta(\frac{d\ln(d/\delta)}{\epsilon_{0}^{2}\alpha^{2}}) and β=ϵ024\beta=\frac{\epsilon_{0}^{2}}{4}. With probability 1δ1-\delta, the number of samples on which the comparison oracle is queried is at most dimE(¯,ϵ0){\rm dim_{E}}(\overline{\mathcal{R}},\epsilon_{0}).

Proof.

Define r~(τ):=r(τ)r(τ0)\widetilde{r}^{\star}(\tau):=r^{\star}(\tau)-r^{\star}(\tau_{0}), r~(τ):=r(τ)r(τ0)\widetilde{r}(\tau):=r(\tau)-r(\tau_{0}).

When the comparison oracle is queried, maxr,rr([r(τ)r(τ0)][r(τ)r(τ0)])>2ϵ0\max_{r,r^{\prime}\in\mathcal{B}_{r}}\left([r(\tau)-r(\tau_{0})]-[r^{\prime}(\tau)-r^{\prime}(\tau_{0})]\right)>2\epsilon_{0} which means that either |r~(τ)r~(τ)|>ϵ0|\widetilde{r}(\tau)-\widetilde{r}^{\star}(\tau)|>\epsilon_{0} or |r~(τ)r~(τ)|>ϵ0|\widetilde{r}^{\prime}(\tau)-\widetilde{r}^{\star}(\tau)|>\epsilon_{0}. Suppose that there are KK trajectories which require querying comparison oracle. Suppose that the dataset is composed of

𝒟={(r^1,τ1),,(r^K,τK)},\mathcal{D}=\{(\hat{r}_{1},\tau_{1}),\cdots,(\hat{r}_{K},\tau_{K})\},

and r~1,,r~K¯\widetilde{r}_{1},\ldots,\widetilde{r}_{K}\in\overline{\mathcal{R}} are the functions that satisfy |r~k(τk)r~(τk)|>ϵ0|\widetilde{r}_{k}(\tau_{k})-\widetilde{r}^{\star}(\tau_{k})|>\epsilon_{0}. We now verify that (r~1,τ1,r~2,τ2,,r~K,τK)(\widetilde{r}_{1},\tau_{1},\widetilde{r}_{2},\tau_{2},\cdots,\widetilde{r}_{K},\tau_{K}) is an eluder sequence (with respect to function class ¯\overline{\mathcal{R}}).

The confidence set condition implies

k<i(r~i(τk)r^k)2β.\sum_{k<i}\left(\widetilde{r}_{i}(\tau_{k})-\hat{r}_{k}\right)^{2}\leq\beta.

With probability 1δ1-\delta, kK2d\forall k\leq K\land 2d, |r^kr~(τk)|ϵ04d\left|\hat{r}_{k}-\widetilde{r}^{\star}(\tau_{k})\right|\leq\frac{\epsilon_{0}}{4\sqrt{d}} (by Lemma B.5). Then for any iki\leq k

ki(r~i(τk)r~(τk))2\displaystyle\sum_{k\leq i}\left(\widetilde{r}_{i}(\tau_{k})-\widetilde{r}^{\star}(\tau_{k})\right)^{2} ki(r~i(τk)r^k)2+ki(r~i(τk)r^k+r~i(τk)r~(τk))(r^kr~(τk))\displaystyle\leq\sum_{k\leq i}\left(\widetilde{r}_{i}(\tau_{k})-\hat{r}_{k}\right)^{2}+\sum_{k\leq i}\left(\widetilde{r}_{i}(\tau_{k})-\hat{r}_{k}+\widetilde{r}_{i}(\tau_{k})-\widetilde{r}^{\star}(\tau_{k})\right)\cdot\left(\hat{r}_{k}-\widetilde{r}^{\star}(\tau_{k})\right)
β+2Kβϵ04d+K(ϵ04d)2ϵ02,\displaystyle\leq\beta+2\sqrt{K\beta}\cdot\frac{\epsilon_{0}}{4\sqrt{d}}+K\left(\frac{\epsilon_{0}}{4\sqrt{d}}\right)^{2}\leq\epsilon_{0}^{2},

as long as K2dK\leq 2d. In other words, with probability 1δ1-\delta, (r~1,τ1,r~2,τ2,,r~K2d,τK2d)(\widetilde{r}_{1},\tau_{1},\widetilde{r}_{2},\tau_{2},\cdots,\widetilde{r}_{K\land 2d},\tau_{K\land 2d}) is an eluder sequence, which by Definition 2 cannot have length more than d:=dimE(¯,ϵ0)d:={\rm dim_{E}}(\overline{\mathcal{R}},\epsilon_{0}). It follows that KdimE(¯,ϵ0).K\leq{\rm dim_{E}}(\overline{\mathcal{R}},\epsilon_{0}).

Lemma B.7.

With probability 1δ1-\delta, rrr^{\star}\in\mathcal{B}_{r} throughout the execution of Algorithm 1.

Proof.

By Lemma B.5 and Lemma B.6, with probability 1δ1-\delta, at every step

(r^,τ)𝒟(r(τ)r(τ0)r^)2d(ϵ04d)2ϵ024=β.\sum_{(\hat{r},\tau)\in\mathcal{D}}\left(r^{\star}(\tau)-r^{\star}(\tau_{0})-\hat{r}\right)^{2}\leq d\cdot\left(\frac{\epsilon_{0}}{4\sqrt{d}}\right)^{2}\leq\frac{\epsilon_{0}^{2}}{4}=\beta.

Lemma B.8.

With probability 1δ1-\delta, for each τ\tau in Line 3 of Algorithm 1, the returned reward r^\hat{r} satisfies

|r^(r(τ)r(τ0))|2ϵ0.\left|\hat{r}-(r^{\star}(\tau)-r^{\star}(\tau_{0}))\right|\leq 2\epsilon_{0}.
Proof.

We already know that this is true for τ\tau such that the comparison oracle is queried. However, if it is not queried, then

maxr,rr(r(τ)r(τ0)(r(τ)r(τ0))<2ϵ0.\max_{r,r^{\prime}\in\mathcal{B}_{r}}(r(\tau)-r(\tau_{0})-(r^{\prime}(\tau)-r^{\prime}(\tau_{0}))<2\epsilon_{0}.

Since rrr^{\star}\in\mathcal{B}_{r} (by Lemma B.7), this immediately implies |r^(r(τ)r(τ0))|2ϵ0.\left|\hat{r}-(r^{\star}(\tau)-r^{\star}(\tau_{0}))\right|\leq 2\epsilon_{0}.

Proof of Theorem 3.4.

Choose ϵ0:=g(ϵ)/2\epsilon_{0}:=g(\epsilon)/2, β=ϵ024\beta=\frac{\epsilon_{0}^{2}}{4} and m=Θ(d¯ln(d¯/δ)ϵ02α2)m=\Theta(\frac{d_{\overline{\mathcal{R}}}\ln(d_{\overline{\mathcal{R}}}/\delta)}{\epsilon_{0}^{2}\alpha^{2}}).

By Lemma B.8 (rescaling δ\delta), with probability 1δ1-\delta, the reward returned by the reward interface is g(ϵ)g(\epsilon)-close to r~:=rr(τ0)\widetilde{r}^{\star}:=r^{\star}-r^{\star}(\tau_{0}) throughout the execution of the algorithm. By the definition of sample complexity, with probability 1δ1-\delta, the policy returned by 𝒜\mathscr{A} is ϵ\epsilon-optimal for r~\widetilde{r}^{\star}, which implies that it is also ϵ\epsilon-optimal for rr^{\star}. The number of samples (episodes) is bounded by 𝒞(ϵ,δ)\mathcal{C}(\epsilon,\delta). Finally by Lemma B.6, the number of queries to the comparison oracle is at most

dimE(¯,ϵ0)m𝒪~(d¯2g2(ϵ)α2).{\rm dim}_{\rm E}(\overline{\mathcal{R}},\epsilon_{0})\cdot m\leq\widetilde{\mathcal{O}}\left(\frac{d_{\overline{\mathcal{R}}}^{2}}{g^{2}(\epsilon)\alpha^{2}}\right).

Appendix C OMLE with Perturbed Reward

C.1 Algorithm details and theoretical guarantees

In this section, we modify the optimistic MLE (OMLE) algorithm (Liu et al., 2022a) to deal with unknown reward functions. The adapted algorithm can then be used with Algorithm 1 as described in Example 3. OMLE is a model-based algorithm that requires a model class 𝒫\mathcal{P} in addition to the reward class \mathcal{R}. On a high level, OMLE maintains a joint confidence set \mathcal{B} in ×𝒫\mathcal{R}\times\mathcal{P}. In the tt-th iteration, the algorithm follows the following steps:

  • Optimistic planning: Find the policy-model pair (π,r,p)(\pi,r,p) that maximizes the value function Vr,pπV^{\pi}_{r,p} which is the expected cumulative reward the leaner will receive if she follows policy π\pi in a model with transition pp and reward rr;

  • Data collection: Construct an exploration policy set Πexp(πt)\Pi_{\rm exp}(\pi^{t}) 666The exploration policy set is problem-dependent and can be simply {πt}\{\pi^{t}\} for many settings. and collect trajectories by running all policies in Πexp(πt)\Pi_{\rm exp}(\pi^{t});

  • Confidence set update: Update the confidence set using the updated log-likelihood.

The main modification we make is the confidence set of rr, since the original OMLE algorithm assumes that rr is known. The pseudocode of our adapted OMLE algorithm is provided in Algorithm 3.

Algorithm 3 Optimistic MLE with ϵ\epsilon^{\prime}-Perturbed Reward Feedback
1:  1×𝒫\mathcal{B}^{1}\leftarrow\mathcal{R}\times\mathcal{P}
2:  Execute an arbitrary policy to collect trajectory τ0\tau^{0}
3:  for t=1,,Tt=1,\ldots,T do
4:     Compute (πt,rt,pt)=argmaxπ,(r,p)tVr,pπ(\pi^{t},r^{t},p^{t})=\arg\max_{\pi,~{}(r,p)\in\mathcal{B}^{t}}V^{\pi}_{r,p}
5:     Execute πt\pi^{t} to collect a trajectory τ\tau, receive reward r^\hat{r}, add(τ,r^)(\tau,\hat{r}) into 𝒟𝚛𝚠𝚍\mathcal{D}_{\tt rwd}
6:     for each πΠexp(πt)\pi\in\Pi_{\rm exp}(\pi^{t}) do
7:        Execute π\pi to collect a trajectory τ\tau, add (π,τ)(\pi,\tau) into 𝒟𝚝𝚛𝚊𝚗𝚜\mathcal{D}_{\tt trans}
8:     Update
t+1{(r,p)\displaystyle\mathcal{B}^{t+1}\leftarrow\big{\{}(r,p) ×𝒫:max(τ,r^)𝒟𝚛𝚠𝚍|r(τ)r^|ϵ\displaystyle\in\mathcal{R}\times\mathcal{P}:~{}\max_{(\tau,\hat{r})\in\mathcal{D}_{\tt rwd}}|r(\tau)-\hat{r}|\leq\epsilon^{\prime}
and (p,𝒟𝚝𝚛𝚊𝚗𝚜)>maxp𝒫(p,𝒟𝚝𝚛𝚊𝚗𝚜)β𝒫}\displaystyle\text{ and }\mathcal{L}(p,\mathcal{D}_{\tt trans})>\max_{p^{\prime}\in\mathcal{P}}\mathcal{L}(p^{\prime},\mathcal{D}_{\tt trans})-\beta_{\mathcal{P}}\big{\}}

In Line 8, the log-likelihood function is defined as

(p,𝒟𝚝𝚛𝚊𝚗𝚜):=(π,τ)𝒟𝚝𝚛𝚊𝚗𝚜lnpπ(τ),\mathcal{L}(p,\mathcal{D}_{\tt trans}):=\sum_{(\pi,\tau)\in\mathcal{D}_{\tt trans}}\ln\mathbb{P}^{\pi}_{p}(\tau), (3)

where pπ(τ)\mathbb{P}^{\pi}_{p}(\tau) denotes the probability of observing trajectory τ\tau in a model with transition function pp under policy π\pi. Other algorithmic parameters TT and β𝒫\beta_{\mathcal{P}} are chosen in the statement of Theorem C.1.

To obtain sample efficiency guarantees for OMLE, the following assumption is needed.

Assumption 3 (Generalized eluder-type condition).

There exists a real number d𝒫+d_{\mathcal{P}}\in\mathbb{R}^{+} and a function ζ\zeta such that: for any (T,Δ)×+(T,\Delta)\in\mathbb{N}\times\mathbb{R}^{+}, transitions {pt}t[T]\{p^{t}\}_{t\in[T]} and policies {πt}t[T]\{\pi^{t}\}_{t\in[T]}, we have

t[T],τ=1t1πΠexp(πτ)dTV2(ptπ,pπ)Δt=1TdTV(ptπt,pπt)ξ(d𝒫,T,Δ,|Πexp|).\forall t\in[T],\sum_{\tau=1}^{t-1}\sum_{\pi\in\Pi_{\rm exp}(\pi^{\tau})}d_{\rm TV}^{2}(\mathbb{P}_{p^{t}}^{\pi},\mathbb{P}_{p^{\star}}^{\pi})\leq\Delta~{}\Longrightarrow~{}\sum_{t=1}^{T}d_{\rm TV}(\mathbb{P}^{\pi^{t}}_{p^{t}},\mathbb{P}^{\pi^{t}}_{p^{\star}})\leq\xi(d_{\mathcal{P}},T,\Delta,|\Pi_{\rm exp}|).

Here pπ\mathbb{P}^{\pi}_{p} is the distribution of trajectories under model pp and policy π\pi, while |Πexp|:=maxπ|Πexp(π)||\Pi_{\rm exp}|:=\max_{\pi}|\Pi_{\rm exp}(\pi)| is the largest possible number of exploration policies.

Assumption 3 shares a similar intuition with the pigeon-hole principle and the elliptical potential lemma, which play important roles in the analysis of tabular MDP and linear bandits respectively. In particular, the ξ\xi function measures the worst-case growth rate of the cumulative error and is the central quantity characterizing the hardness of the problem. Liu et al. (2022a) prove that a wide range of RL problems satisfy Assumption 3 with moderate d𝒫d_{\mathcal{P}}, |Πexp||\Pi_{\rm exp}| and ξ=𝒪~(d𝒫Δ|Πexp|K)\xi=\widetilde{\mathcal{O}}(\sqrt{d_{\mathcal{P}}\Delta|\Pi_{\rm exp}|K}), including tabular MDPs, factored MDPs, observable POMDPs and decodable POMDPs (see Liu et al. (2022a) for more details).

Theorem C.1.

Suppose Assumption 1 and 3 hold. There exists absolute constant c1,c2>0c_{1},c_{2}>0 such that for any (T,δ)×(0,1](T,\delta)\in\mathbb{N}\times(0,1], if we choose β𝒫=c1ln(|𝒫||Πexp|T/δ)\beta_{\mathcal{P}}=c_{1}\ln(|\mathcal{P}||\Pi_{\rm exp}|T/\delta) in Algorithm 3, then with probability at least 1δ1-\delta, we have that for all t[T]t\in[T],

t=1T[VVπt]\displaystyle\sum_{t=1}^{T}[V^{\star}-V^{\pi^{t}}]\leq 2Hξ(d𝒫,T,c2β𝒫,|Πexp|)\displaystyle 2H\xi(d_{\mathcal{P}},T,c_{2}\beta_{\mathcal{P}},|\Pi_{\rm exp}|)
+𝒪(minω>0{Td(ω)ϵ+min{d(ω),T}H+Tω})+𝒪(HTlnδ1)\displaystyle+\mathcal{O}\left(\min_{\omega>0}\left\{T\sqrt{d_{\mathcal{R}}(\omega)}\epsilon^{\prime}+\min\{d_{\mathcal{R}}(\omega),T\}H+T\omega\right\}\right)+\mathcal{O}(H\sqrt{T\ln\delta^{-1}})

where d=dimE(,ω)d_{\mathcal{R}}={\rm dim_{E}}(\mathcal{R},\omega).

For problems that satisfy ξ=𝒪~(d𝒫β𝒫|Πexp|T)\xi=\widetilde{\mathcal{O}}(\sqrt{d_{\mathcal{P}}\beta_{\mathcal{P}}|\Pi_{\rm exp}|T}), Theorem C.1 implies a sample complexity of

𝒪~(H2d𝒫|Πexp|2ln|𝒫|ϵ2+Hd|Πexp|ϵ).\widetilde{\mathcal{O}}\left(\frac{H^{2}d_{\mathcal{P}}|\Pi_{\rm exp}|^{2}\ln|\mathcal{P}|}{\epsilon^{2}}+\frac{Hd_{\mathcal{R}}|\Pi_{\rm exp}|}{\epsilon}\right).

for learning an 𝒪(ϵ)\mathcal{O}(\epsilon)-optimal policy, if we have ω=ϵ\omega=\epsilon and ϵ=ϵ/d\epsilon^{\prime}=\epsilon/\sqrt{d_{\mathcal{R}}}. The sample complexity for specific tractable problems can be found in Appendix C.2.

C.2 Examples satisfying generalized eluder-type condition

In this section, we present several canonical examples that satisfy the generalized eluder-type condition with ξ=𝒪~(d𝒫β𝒫|Πexp|T)\xi=\widetilde{\mathcal{O}}(\sqrt{d_{\mathcal{P}}\beta_{\mathcal{P}}|\Pi_{\rm exp}|T}). More examples can be found in Liu et al. (2022a).

Example 6 (Finite-precision Factored MDPs).

In factored MDPs, each state ss consists of mm factors denoted by (s[1],,s[m])𝒳m(s[1],\cdots,s[m])\in\mathcal{X}^{m}. The transition structure is also factored as

h(sh+1|sh,ah)=i=1mi(sh+1[i]|sh[pai],ah),\mathbb{P}_{h}(s_{h+1}|s_{h},a_{h})=\prod_{i=1}^{m}\mathbb{P}^{i}(s_{h+1}[i]|s_{h}[{\rm pa}_{i}],a_{h}),

where pai[m]{\rm pa}_{i}\subseteq[m] is the parent set of ii. The reward function is similarly factored:

rh(s):=i=1mrhi(s[i]).r_{h}(s):=\sum_{i=1}^{m}r^{i}_{h}(s[i]).

Define B:=i=1m|𝒳|paiB:=\sum_{i=1}^{m}|\mathcal{X}|^{{\rm pa}_{i}}. Factored MDPs satisfy (with |Πexp|=1|\Pi_{\rm exp}|=1 and d𝒫=m2|𝒜|2B2poly(H)d_{\mathcal{P}}=m^{2}|\mathcal{A}|^{2}B^{2}{\rm poly}(H))

ξ(d𝒫,T,Δ,|Πexp|)d𝒫ΔT+AB2poly(H).\xi(d_{\mathcal{P}},T,\Delta,|\Pi_{\rm exp}|)\leq\sqrt{d_{\mathcal{P}}\Delta T}+AB^{2}{\rm poly}(H).

Moreover ln|𝒫|mbAB\ln|\mathcal{P}|\leq mbAB, ln||mb|𝒳|\ln|\mathcal{R}|\leq mb|\mathcal{X}|, where bb is the number of bits needed to specify each entry of (sh+1||sh,ah)\mathbb{P}(s_{h+1}||s_{h},a_{h}) or rh(s)r_{h}(s) 777We can deal with continuous model classes if we use the bracketing number instead of cardinatliy in Theorem C.1. Therefore Theorem C.1 implies a sample complexity of

poly(H)𝒪~(bm3|𝒜|3B3ϵ2+bm2|𝒳|α2ϵ2).{\rm poly}(H)\cdot\widetilde{\mathcal{O}}\left(\frac{bm^{3}|\mathcal{A}|^{3}B^{3}}{\epsilon^{2}}+\frac{bm^{2}|\mathcal{X}|}{\alpha^{2}\epsilon^{2}}\right).

To proceed, we define partially observable Markov decision process (POMDPs).

Definition 6.

In a POMDP, states are hidden from the learner and only observations emitted by states can be observed. Formally, at each step h[H]h\in[H], the learner observes oh𝕆h(sh)o_{h}\sim\mathbb{O}_{h}(\cdot\mid s_{h}) where shs_{h} is the current state and 𝕆h(sh)Δ𝒜\mathbb{O}_{h}(\cdot\mid s_{h})\in\Delta_{\mathcal{A}} is the observation distribution conditioning on the current state being shs_{h}. Then the learner takes action aha_{h} and the environment transitions to sh+1h(sh,ah)s_{h+1}\sim\mathbb{P}_{h}(\cdot\mid s_{h},a_{h}).

Liu et al. (2022a) prove that the following subclasses of POMDPs satisfy Assumption 3 with moderate d𝒫d_{\mathcal{P}} and |Πexp||\Pi_{\rm exp}|.

Example 7 (α\alpha-observable POMDPs).

We say a POMDP is α\alpha-observable if for every μ,μΔ𝒮\mu,\mu^{\prime}\in\Delta_{\mathcal{S}},

minh𝔼sμ[𝕆h(s)]𝔼sμ[𝕆h(s)]1αμμ1.\min_{h}\|\mathbb{E}_{s\sim\mu}[\mathbb{O}_{h}(\cdot\mid s)]-\mathbb{E}_{s\sim\mu^{\prime}}[\mathbb{O}_{h}(\cdot\mid s)]\|_{1}\geq\alpha\|\mu-\mu^{\prime}\|_{1}.

Intuitively, the above relation implies that different state distributions will induce different observation distributions, and the parameter α\alpha measures the amount of information preserved after mapping states to observations. It is proved that α\alpha-observable POMDPs satisfy Condition 3 with Πexp(π)={π}\Pi_{\rm exp}(\pi)=\{\pi\} and d𝒫=poly(S,A,α1,H)d_{\mathcal{P}}=\mathrm{poly}(S,A,\alpha^{-1},H) (Liu et al., 2022a).

For simplicity of notations, let u(h)=max{1,hm+1}u(h)=\max\{1,h-m+1\}.

Example 8 (mm-step decodable POMDPs).

We say a POMDP is mm-step decodable if there exists a set of decoder functions {ϕh}h[H]\{\phi_{h}\}_{h\in[H]} such that for every h[H]h\in[H], sh=ϕh((o,a)u(h):h1,oh)s_{h}=\phi_{h}((o,a)_{u(h):h-1},o_{h}). In other words, the current state can be uniquely identified from the most recent mm-step observations and actions. It is proved that α\alpha-observable POMDPs satisfy Condition 3 with |Πexp|=Am|\Pi_{\rm exp}|=A^{m} and d𝒫=poly(L,Am,H)d_{\mathcal{P}}=\mathrm{poly}(L,A^{m},H) where L=maxhrank(h)L=\max_{h}\text{rank}(\mathbb{P}_{h}) denotes the rank of the transition matrices {h}h[H]SA×S\{\mathbb{P}_{h}\}_{h\in[H]}\subseteq\mathbb{R}^{SA\times S} (Liu et al., 2022a).

C.3 Proof of Theorem C.1

We first define some useful notations. Denote by 𝒟𝚛𝚠𝚍t\mathcal{D}_{\tt rwd}^{t}, 𝒟𝚝𝚛𝚊𝚗𝚜t\mathcal{D}_{\tt trans}^{t} the reward, transition dataset at the end of the tt-th iteration. We further denote

t{r:max(τ,r^)𝒟𝚛𝚠𝚍t1|r(τ)r^|ϵ},\displaystyle\mathcal{B}^{t}_{\mathcal{R}}\leftarrow\big{\{}r\in\mathcal{R}:~{}\max_{(\tau,\hat{r})\in\mathcal{D}_{\tt rwd}^{t-1}}|r(\tau)-\hat{r}|\leq\epsilon^{\prime}\big{\}},
𝒫t{p𝒫:(p,𝒟𝚝𝚛𝚊𝚗𝚜t1)>maxp𝒫(p,𝒟𝚝𝚛𝚊𝚗𝚜t1)β𝒫}.\displaystyle\mathcal{B}^{t}_{\mathcal{P}}\leftarrow\big{\{}p\in\mathcal{P}:~{}\mathcal{L}(p,\mathcal{D}_{\tt trans}^{t-1})>\max_{p^{\prime}\in\mathcal{P}}\mathcal{L}(p^{\prime},\mathcal{D}_{\tt trans}^{t-1})-\beta_{\mathcal{P}}\big{\}}.

By the definition of t\mathcal{B}^{t}, we have that t=t×𝒫t\mathcal{B}^{t}=\mathcal{B}_{\mathcal{R}}^{t}\times\mathcal{B}_{\mathcal{P}}^{t}. Denote by (τt,r^t)(\tau^{t},\hat{r}^{t}) the trajectory-reward pair added into 𝒟𝚛𝚠𝚍\mathcal{D}_{\tt rwd} in the tt-th iteration of Algorithm 3.

By the definition of t\mathcal{B}_{\mathcal{R}}^{t} and the fact that all reward feedback is at most ϵ\epsilon^{\prime}-corrupted, we directly have that the confidence set t\mathcal{B}_{\mathcal{R}}^{t} always contain the groundtruth reward function.

Lemma C.2.

For all t[T]t\in[T], rtr^{\star}\in\mathcal{B}_{\mathcal{R}}^{t}.

Moreover, according to Liu et al. (2022a), the transition confidence set 𝒫t\mathcal{B}_{\mathcal{P}}^{t} always satisfies the following properties.

Lemma C.3 (Liu et al. (2022a)).

There exists absolute constant c2c_{2} such that under Assumption 3 and the same choice of β𝒫\beta_{\mathcal{P}} as in Theorem C.1, we have that with probability at least 1δ1-\delta:

  • p𝒫tp^{\star}\in\mathcal{B}_{\mathcal{P}}^{t}, for all t[T]t\in[T],

  • t=1Tmaxp𝒫tdTV(pπt,pπt)ξ(d𝒫,T,c2β𝒫,|Πexp|)\sum_{t=1}^{T}\max_{p\in\mathcal{B}_{\mathcal{P}}^{t}}d_{\rm TV}(\mathbb{P}^{\pi^{t}}_{p},\mathbb{P}^{\pi^{t}}_{p^{\star}})\leq\xi(d_{\mathcal{P}},T,c_{2}\beta_{\mathcal{P}},|\Pi_{\rm exp}|).

The first relation states that the transition confidence set contains the groundtruth transition model with high probability. And the second one states that if we use an arbitrary model p𝒫tp\in\mathcal{B}_{\mathcal{P}}^{t} to predict the transition dynamics under policy πt\pi^{t}, then the cumulative prediction error over TT iterations is upper bounded by function ξ\xi.

Proof of Theorem C.1.

In the following proof, we will assume the two relations in Lemma C.3 hold.

We have that

t(Vr,pVr,pπt)\displaystyle\sum_{t}\left(V_{r^{\star},p^{\star}}^{\star}-V_{r^{\star},p^{\star}}^{\pi^{t}}\right)
\displaystyle\leq t(Vrt,ptπtVr,pπt)\displaystyle\sum_{t}\left(V_{r^{t},p^{t}}^{\pi^{t}}-V_{r^{\star},p^{\star}}^{\pi^{t}}\right)
\displaystyle\leq 2HtdTV(ptπt,pπt)+t(Vrt,pπtVr,pπt)\displaystyle 2H\sum_{t}d_{\rm TV}(\mathbb{P}^{\pi^{t}}_{p^{t}},\mathbb{P}^{\pi^{t}}_{p^{\star}})+\sum_{t}\left(V_{r^{t},p^{\star}}^{\pi^{t}}-V_{r^{\star},p^{\star}}^{\pi^{t}}\right)
\displaystyle\leq 2HtdTV(ptπt,pπt)+t|rt(τt)r(τt)|+𝒪(HTln(1/δ))\displaystyle 2H\sum_{t}d_{\rm TV}(\mathbb{P}^{\pi^{t}}_{p^{t}},\mathbb{P}^{\pi^{t}}_{p^{\star}})+\sum_{t}\left|r^{t}(\tau^{t})-r^{\star}(\tau^{t})\right|+\mathcal{O}(H\sqrt{T\ln(1/\delta)})
\displaystyle\leq 2Hξ(d𝒫,T,c2β𝒫,|Πexp|)+t|rt(τt)r(τt)|+𝒪(HTln(1/δ))\displaystyle 2H\xi(d_{\mathcal{P}},T,c_{2}\beta_{\mathcal{P}},|\Pi_{\rm exp}|)+\sum_{t}\left|r^{t}(\tau^{t})-r^{\star}(\tau^{t})\right|+\mathcal{O}(H\sqrt{T\ln(1/\delta)})
\displaystyle\leq 2Hξ(d𝒫,T,c2β𝒫,|Πexp|)+𝒪(Tdϵ+d)+𝒪(HTln(1/δ)),\displaystyle 2H\xi(d_{\mathcal{P}},T,c_{2}\beta_{\mathcal{P}},|\Pi_{\rm exp}|)+\mathcal{O}(T\sqrt{d_{\mathcal{R}}}\epsilon^{\prime}+d_{\mathcal{R}})+\mathcal{O}(H\sqrt{T\ln(1/\delta)}),

where the first inequality uses the definition of (πt,rt,pt)(\pi^{t},r^{t},p^{t}) and the relation (rt,pt)t(r^{t},p^{t})\in\mathcal{B}^{t}, the third one holds with probability at least 1δ1-\delta by Azuma-Hoeffding inequality, the fourth one uses the second relation in Lemma C.3, and the last one invokes the standard regret guarantee for eluder dimension (e.g., Russo and Van Roy (2013)) where d=dimE(,ϵ/2)d_{\mathcal{R}}=\dim_{\rm E}(\mathcal{R},\epsilon^{\prime}/2). ∎

Appendix D Proofs for P-OMLE

The proof of Theorem 3.8 largely follows from that of Theorem C.1. We first introduce several useful notations. Denote by t\mathcal{B}_{\mathcal{M}}^{t}, 𝒫t\mathcal{B}_{\mathcal{P}}^{t} the preference, transition confidence set in the tt-th iteration, which satisfy t=𝒫t×t\mathcal{B}^{t}=\mathcal{B}_{\mathcal{P}}^{t}\times\mathcal{B}_{\mathcal{M}}^{t}. Denote the groundtruth transition and preference by pp^{\star} and MM^{\star}. Denote the trajectories generated when running Algorithm 2 by τ1,,τT\tau^{1},\cdots,\tau^{T}.

Similar to Lemma C.3, we have that the confidence set satisfies the following properties.

Lemma D.1.

There exists absolute constant c2c_{2} such that under the same condition as Theorem 3.8, we have that with probability at least 1δ1-\delta: for all t[T]t\in[T]

  • (r,p)t(r^{\star},p^{\star})\in\mathcal{B}^{t},

  • t=1Tmaxp𝒫tdTV(pπt,pπt)ξ(d𝒫,T,c2β𝒫,|Πexp|)\sum_{t=1}^{T}\max_{p\in\mathcal{B}_{\mathcal{P}}^{t}}d_{\rm TV}(\mathbb{P}^{\pi^{t}}_{p},\mathbb{P}^{\pi^{t}}_{p^{\star}})\leq\xi(d_{\mathcal{P}},T,c_{2}\beta_{\mathcal{P}},|\Pi_{\rm exp}|),

  • i<t|σ(rt(τi)rt(τ0))σ(r(τi)r(τ0))|2𝒪(β)\sum_{i<t}|\sigma(r^{t}(\tau^{i})-r^{t}(\tau^{0}))-\sigma(r^{\star}(\tau^{i})-r^{\star}(\tau^{0}))|^{2}\leq\mathcal{O}(\beta_{\mathcal{R}}).

Proof.

For the first two statements, see the proof of Theorem 3.2 in Liu et al. (2022a). For the third bullet point, see Liu et al. (2022a, Proposition B.2), which implies that

LHS𝒪(β+ln(||T/δ))=𝒪(β).\displaystyle{\rm LHS}\leq\mathcal{O}(\beta_{\mathcal{R}}+\ln(|\mathcal{R}|T/\delta))=\mathcal{O}(\beta).

Proof of Theorem 3.8.

By using the first relation in Lemma D.1 and the definition of (πt,rt,pt)(\pi^{t},r^{t},p^{t}),

t=1T[Vr,pVr,pπt]\displaystyle\sum_{t=1}^{T}[V_{r^{\star},p^{\star}}^{\star}-V_{r^{\star},p^{\star}}^{\pi^{t}}]
=\displaystyle= t=1T[Vr,pr(τ0)+r(τ0)Vr,pπt]\displaystyle\sum_{t=1}^{T}[V_{r^{\star},p^{\star}}^{\star}-r^{\star}(\tau^{0})+r^{\star}(\tau^{0})-V_{r^{\star},p^{\star}}^{\pi^{t}}]
\displaystyle\leq t=1T[Vrt,ptπtrt(τ0)+r(τ0)Vr,pπt]\displaystyle\sum_{t=1}^{T}[V_{r^{t},p^{t}}^{\pi^{t}}-r^{t}(\tau^{0})+r^{\star}(\tau^{0})-V_{r^{\star},p^{\star}}^{\pi^{t}}]
=\displaystyle= t=1T[Vrt,ptπtVrt,pπt]+t=1T[Vrt,pπtrt(τ0)+r(τ0)Vr,pπt].\displaystyle\sum_{t=1}^{T}[V_{r^{t},p^{t}}^{\pi^{t}}-V_{r^{t},p^{\star}}^{\pi^{t}}]+\sum_{t=1}^{T}[V_{r^{t},p^{\star}}^{\pi^{t}}-r^{t}(\tau^{0})+r^{\star}(\tau^{0})-V_{r^{\star},p^{\star}}^{\pi^{t}}].

We can control the first term by the second inequality in Lemma lem:omle-pref

t=1T[Vrt,ptπtVrt,pπt]2t=1TdTV(ptπt,pπt)2Hξ(d𝒫,T,c2β𝒫,|Πexp|).\sum_{t=1}^{T}[V_{r^{t},p^{t}}^{\pi^{t}}-V_{r^{t},p^{\star}}^{\pi^{t}}]\leq 2\sum_{t=1}^{T}d_{\rm TV}(\mathbb{P}^{\pi^{t}}_{p^{t}},\mathbb{P}^{\pi^{t}}_{p^{\star}})\leq 2H\xi(d_{\mathcal{P}},T,c_{2}\beta_{\mathcal{P}},|\Pi_{\rm exp}|).

For the second term, by Azuma-Hoeffding inequality and by combining the second inequality in Lemma D.1 with the regret guarantee for eluder dimension,

t=1T[Vrt,pπtrt(τ0)+r(τ0)Vr,pπt]\displaystyle\sum_{t=1}^{T}[V_{r^{t},p^{\star}}^{\pi^{t}}-r^{t}(\tau^{0})+r^{\star}(\tau^{0})-V_{r^{\star},p^{\star}}^{\pi^{t}}]
\displaystyle\leq t=1T|[(rt(τt)rt(τ0)][r(τt)r(τ0)]|+𝒪(HTln(1/δ))\displaystyle\sum_{t=1}^{T}|[(r^{t}(\tau^{t})-r^{t}(\tau^{0})]-[r^{\star}(\tau^{t})-r^{\star}(\tau^{0})]|+\mathcal{O}(H\sqrt{T\ln(1/\delta)})
\displaystyle\leq 𝒪(Hα1dTβ).\displaystyle\mathcal{O}(H\alpha^{-1}\sqrt{d_{\mathcal{R}}T\beta_{\mathcal{R}}}).

Appendix E Additional details and proofs for Section 4.2

E.1 Algorithm details

In Algorithm 4, we describe how to learn an approximate von Neumann winner via running two copies of an arbitrary adversarial MDP algorithm. Specifically, we maintain two algorithm instances 𝒜(1)\mathscr{A}^{(1)} and 𝒜(2)\mathscr{A}^{(2)}. In the kk-th iteration, we first sample two trajectories (s1:H(1),a1:H(1))(s_{1:H}^{(1)},a_{1:H}^{(1)}) and (s1:H(2),a1:H(2))(s_{1:H}^{(2)},a_{1:H}^{(2)}) without reward by executing πk(1)\pi^{(1)}_{k} and πk(2)\pi^{(2)}_{k}, the two output policies of 𝒜(1)\mathscr{A}^{(1)} and 𝒜(2)\mathscr{A}^{(2)}, respectively. Then we input (sH(1),sH(2))(s_{H}^{(1)},s_{H}^{(2)}) into the comparison oracle and get a binary feedback yy. After that, we augment (s1:H1(1),a1:H1(1))(s_{1:H-1}^{(1)},a_{1:H-1}^{(1)}) with zero reward at the first H2H-2 steps and reward yy at step H1H-1, which we feed into 𝒜(1)\mathscr{A}^{(1)}. Similarly we create feedback for 𝒜(2)\mathscr{A}^{(2)} by using (s1:H1(2),a1:H1(2))(s_{1:H-1}^{(2)},a_{1:H-1}^{(2)}) and reward 1y1-y. The final output policy π¯(1)\bar{\pi}^{(1)} is a uniform mixture of all the policies 𝒜(1)\mathscr{A}^{(1)} has produced during KK iterations.

Algorithm 4 Learning von Neumann Winner via Adversarial MDP Algorithms
  Initialize two algorithm instances 𝒜(1)\mathscr{A}^{(1)}, 𝒜(2)\mathscr{A}^{(2)} for adversarial MDPs with horizon length H1H-1
  for k=1,,Kk=1,\cdots,K do
     Receive πk(1)\pi^{(1)}_{k} from 𝒜(1)\mathscr{A}^{(1)} and πk(2)\pi^{(2)}_{k} from 𝒜(2)\mathscr{A}^{(2)}
     Sample (s1:H(1),a1:H(1))πk(1)(s_{1:H}^{(1)},a_{1:H}^{(1)})\sim\pi^{(1)}_{k} and (s1:H(2),a1:H(2))πk(2)(s_{1:H}^{(2)},a_{1:H}^{(2)})\sim\pi^{(2)}_{k}
     Query comparison oracle yBer(M(sH(1),sH(2)))y\sim{\rm Ber}(M(s^{(1)}_{H},s^{(2)}_{H}))
     Return feedback (s1(1),a1(1),0,,sH2(1),aH2(1),0,sH1(1),aH1(1),y)(s^{(1)}_{1},a_{1}^{(1)},0,\ldots,s^{(1)}_{H-2},a_{H-2}^{(1)},0,s^{(1)}_{H-1},a_{H-1}^{(1)},y) to 𝒜1\mathscr{A}_{1}       and (s1(2),a1(2),0,,sH2(2),aH2(2),0,sH1(2),aH1(2),1y)(s^{(2)}_{1},a_{1}^{(2)},0,\ldots,s^{(2)}_{H-2},a_{H-2}^{(2)},0,s^{(2)}_{H-1},a_{H-1}^{(2)},1-y) to 𝒜2\mathscr{A}_{2}, respectively
  Output average policy mixture π¯(1)\bar{\pi}^{(1)} where π¯(1):=Unif({πk(1)}k[K])\bar{\pi}^{(1)}:=\text{Unif}(\{\pi_{k}^{(1)}\}_{k\in[K]})

Converting the output policy to a Markov policy.

Note that the output policy π¯\bar{\pi} is a general non-Markov policy. However, we can convert it to a Markov policy in a sample-efficient manner for tabular MDPs through a simple procedure: execute π¯\bar{\pi} for NN episodes, then compute the empirical policy 888Set π^h(|s)=Unif(𝒜)\hat{\pi}_{h}(\cdot|s)=\text{Unif}(\mathcal{A}) if Jh(s)=0J_{h}(s)=0, i.e. if state ss is never visited at step hh.

π^h(a|s):=Jh(s,a)Jh(s),\hat{\pi}_{h}(a|s):=\frac{J_{h}(s,a)}{J_{h}(s)},

where Jh(s,a)J_{h}(s,a) and Jh(s)J_{h}(s) denote the visitation counters at state-action pair (s,a)(s,a) and state ss at step hh, respectively. The following lemma claims that the resulted Markov policy π^\hat{\pi} is also an approximate restricted Nash equilibrium.

Lemma E.1.

If π¯\bar{\pi} is an ϵ\epsilon-approximate von Neumann winner, then π^\hat{\pi} is a 2ϵ2\epsilon-approximate von Neumann winner with probability at least 1δ1-\delta, provided that N=Ω~(SAϵ2)N=\widetilde{\Omega}\left(\frac{SA}{\epsilon^{2}}\right).

E.2 Proof of Theorem 4.2

Proof.

Define U(π,π):=𝔼sHπ,sHπM[sH,sH]U(\pi,\pi^{\prime}):=\mathbb{E}_{s_{H}\sim\pi,s_{H}^{\prime}\sim\pi^{\prime}}M[s_{H},s_{H}^{\prime}]. The AMDP regret of 𝒜(1)\mathscr{A}^{(1)} gives

kU(πk(1),πk(2))\displaystyle\sum_{k}U\left(\pi^{(1)}_{k},\pi^{(2)}_{k}\right)\geq maxπ:MarkovkU(π,πk(2))βK1c\displaystyle\max_{\pi:~{}\text{Markov}}\sum_{k}U\left(\pi,\pi_{k}^{(2)}\right)-\beta K^{1-c}
=\displaystyle= maxπ:generalkU(π,πk(2))βK1c\displaystyle\max_{\pi:~{}\text{general}}\sum_{k}U\left(\pi,\pi_{k}^{(2)}\right)-\beta K^{1-c}

where the second inequality uses the fact that there always exists a Markov best response because only the state distribution at the final step matters in the definition of UU.

Similarly the regret of 𝒜(2)\mathscr{A}^{(2)} gives

k(1U(πk(1),πk(2)))maxπk[1U(πk(1),π)]βK1c.\displaystyle\sum_{k}\left(1-U\left(\pi^{(1)}_{k},\pi^{(2)}_{k}\right)\right)\geq\max_{\pi}\sum_{k}\left[1-U\left(\pi_{k}^{(1)},\pi\right)\right]-\beta K^{1-c}.

Summing the two inequalities gives

minπU(π¯(1),π)maxπU(π,π¯(2))2βK1c.\displaystyle\min_{\pi^{\prime}}U\left(\bar{\pi}^{(1)},\pi^{\prime}\right)\geq\max_{\pi}U(\pi,\bar{\pi}^{(2)})-2\beta K^{1-c}.

In other words

DGap(π¯(1),π¯(2))2β/Kc.{\rm DGap}(\bar{\pi}^{(1)},\bar{\pi}^{(2)})\leq 2\beta/K^{c}.

By the symmetry of preference function, we further have

DGap(π¯(1),π¯(1))\displaystyle{\rm DGap}(\bar{\pi}^{(1)},\bar{\pi}^{(1)}) =maxπU(π,π¯1)minπU(π¯1,π)=2maxπU(π,π¯1)1\displaystyle=\max_{\pi^{\prime}}U(\pi^{\prime},\bar{\pi}_{1})-\min_{\pi^{\prime}}U(\bar{\pi}_{1},\pi^{\prime})=2\max_{\pi^{\prime}}U(\pi^{\prime},\bar{\pi}_{1})-1
=2(maxπU(π,π¯1)U(π¯1,π¯1))\displaystyle=2\left(\max_{\pi^{\prime}}U(\pi^{\prime},\bar{\pi}_{1})-U(\bar{\pi}_{1},\bar{\pi}_{1})\right)
2DGap(π¯(1),π¯(2))4β/Kc.\displaystyle\leq 2{\rm DGap}(\bar{\pi}^{(1)},\bar{\pi}^{(2)})\leq 4\beta/K^{c}.

E.3 Details for adversarial linear MDPs.

To apply the regret guarantees from existing works on adversarial linear MDP (e.g., Sherman et al., 2023), we need to show the constructed reward signal in Algorithm 4 is linearly realizable. Since the reward signal is zero for the first H2H-2 steps, we only need to consider step H1H-1. Recall the definition of linear MDPs requires that there exist feature mappings ϕ\phi and ψ\psi such that h(ss,a)=ϕh(s,a),ψ(s)\mathbb{P}_{h}(s^{\prime}\mid s,a)=\langle\phi_{h}(s,a),\psi(s^{\prime})\rangle. By the bilinear structure of transition, the conditional expectation of reward at step H1H-1 in the kk-th iteration can be written as

𝔼[ysH1(1),aH1(1)]\displaystyle\mathbb{E}[y\mid s_{H-1}^{(1)},a_{H-1}^{(1)}] =𝔼[M(sH(1),sH(2))sH(1)h(sH1(1),aH1(1)),sH(2)πk(2)]\displaystyle=\mathbb{E}[M(s_{H}^{(1)},s_{H}^{(2)})\mid s_{H}^{(1)}\sim\mathbb{P}_{h}(\cdot\mid s_{H-1}^{(1)},a_{H-1}^{(1)}),~{}s_{H}^{(2)}\sim\pi_{k}^{(2)}]
=ϕH1(sH1(1),aH1(1)),s𝒮ψH1(s)𝔼[M(s,sH(2))sH(2)πk(2)].\displaystyle=\left\langle\phi_{H-1}(s_{H-1}^{(1)},a_{H-1}^{(1)}),\sum_{s\in\mathcal{S}}\psi_{H-1}(s)\mathbb{E}[M(s,s_{H}^{(2)})\mid s_{H}^{(2)}\sim\pi_{k}^{(2)}]\right\rangle.

Therefore, the reward function constructed for 𝒜(1)\mathscr{A}^{(1)} is a linear in the feature mapping ϕ\phi. Similarly, we can show the reward function constructed for 𝒜(2)\mathscr{A}^{(2)} is also linear.

E.4 Proof of Lemma E.1

Proof.

Let ι=cln(SAHN/δ)\iota=c\ln(SAHN/\delta) where cc is a large absolute constant. By standard concentration, with probability at least 1δ1-\delta, we have that for all ss, if π¯(sh=s)ι/N\mathbb{P}^{\bar{\pi}}(s_{h}=s)\geq{\iota}/{N}, then Jh(s)Nπ¯(sh=s)/2J_{h}(s)\geq N\mathbb{P}^{\bar{\pi}}(s_{h}=s)/2. With slight abuse of notation, we define π¯h(sh)=a1:h1,s1:h1π¯h(a1:h1,s1:h)\bar{\pi}_{h}(\cdot\mid s_{h})=\sum_{a_{1:h-1},s_{1:h-1}}\bar{\pi}_{h}(\cdot\mid a_{1:h-1},s_{1:h}). Therefore, we have

sπ¯(sh=s)π^h(s)π¯h(s)1\displaystyle\sum_{s}\mathbb{P}^{\bar{\pi}}(s_{h}=s)\cdot\|{\hat{\pi}}_{h}(\cdot\mid s)-{\bar{\pi}}_{h}(\cdot\mid s)\|_{1}\leq SιN+sπ¯(sh=s)ιANπ¯(sh=s)\displaystyle\frac{S\iota}{N}+\sum_{s}\mathbb{P}^{\bar{\pi}}(s_{h}=s)\cdot\sqrt{\frac{\iota A}{N\mathbb{P}^{\bar{\pi}}(s_{h}=s)}}
\displaystyle\leq SιN+ιSANϵ2H.\displaystyle\frac{S\iota}{N}+\sqrt{\frac{\iota SA}{N}}\leq\frac{\epsilon}{2H}.

Given a Markov policy π1\pi_{1} and general policy π2\pi_{2}, we define

qhπ1,π2(sh,ah):=𝔼sHπ1|sh,ah,sHπ2[M(sH,sH)].q_{h}^{\pi_{1},\pi_{2}}(s_{h},a_{h}):=\mathbb{E}_{s_{H}\sim\pi_{1}|s_{h},a_{h},~{}s_{H}^{\prime}\sim\pi_{2}}\left[M(s_{H},s_{H}^{\prime})\right].

It follows that for any policy π\pi,

|U(π^,π)U(π¯,π)|\displaystyle\left|U({\hat{\pi}},\pi)-U({\bar{\pi}},\pi)\right| =|h=1H𝔼π¯[π^h(sh)π¯h(sh),qhπ^,π(sh,)]|\displaystyle=\left|\sum_{h=1}^{H}\mathbb{E}_{\bar{\pi}}\left[\langle{\hat{\pi}}_{h}(\cdot\mid s_{h})-{\bar{\pi}}_{h}(\cdot\mid s_{h}),q^{{\hat{\pi}},\pi}_{h}(s_{h},\cdot)\rangle\right]\right|
h=1Hsπ¯(sh=s)π^h(s)π¯h(s)1ϵ/2.\displaystyle\leq\sum_{h=1}^{H}\sum_{s}\mathbb{P}^{\bar{\pi}}(s_{h}=s)\cdot\|{\hat{\pi}}_{h}(\cdot\mid s)-{\bar{\pi}}_{h}(\cdot\mid s)\|_{1}\leq\epsilon/2.

Therefore,

DGap(π^,π^)=2maxπU(π,π^)12maxπU(π,π¯)1+ϵ2ϵ.\displaystyle{\rm DGap}({\hat{\pi}},{\hat{\pi}})=2\max_{\pi^{\prime}}U(\pi^{\prime},{\hat{\pi}})-1\leq 2\max_{\pi^{\prime}}U(\pi^{\prime},{\bar{\pi}})-1+\epsilon\leq 2\epsilon.

Appendix F Additional details and proofs for Section 4.3

F.1 Algorithm details

To state the algorithm in a more compact way, we first introduce several notations. We denote the expected winning times of policy π\pi against policy π\pi^{\prime} in a RLHF instance with transition pp and preference MM by

Vp,Mπ,π:=𝔼[M(τ,τ)τpπ,τpπ].V_{p,M}^{\pi,\pi^{\prime}}:=\mathbb{E}\left[M(\tau,\tau^{\prime})\mid\tau\sim\mathbb{P}^{\pi}_{p},~{}\tau^{\prime}\sim\mathbb{P}^{\pi^{\prime}}_{p}\right].

Furthermore, we denote the best-response value against policy π\pi as

Vp,Mπ,=minπVp,Mπ,π,V_{p,M}^{\pi,\dagger}=\min_{\pi^{\prime}}V_{p,M}^{\pi,\pi^{\prime}},

and the minimax value as

Vp,M=maxπminπVp,Mπ,π.V_{p,M}^{\star}=\max_{\pi}\min_{\pi^{\prime}}V_{p,M}^{\pi,\pi^{\prime}}.
Algorithm 5 Learning von Neumann winner via Optimistic MLE
1:  1𝒫×\mathcal{B}^{1}\leftarrow\mathcal{P}\times\mathcal{M}
2:  execute an arbitrary policy to collect trajectory τ0\tau^{0}
3:  for t=1,,Tt=1,\ldots,T do
4:     compute optimistic von Neumann winner (π¯t,M¯t,p¯t)=argmaxπ,(p,M)tVp,Mπ,(\overline{\pi}^{t},\overline{M}^{t},\overline{p}^{t})={\arg\max}_{\pi,~{}(p,M)\in\mathcal{B}^{t}}V_{p,M}^{\pi,\dagger}
5:     compute optimistic best-response (π¯t,M¯t,p¯t)=argminπ,(p,M)tVp,Mπ¯t,π(\underline{\pi}^{t},\underline{M}^{t},\underline{p}^{t})={\arg\min}_{\pi^{\prime},~{}(p,M)\in\mathcal{B}^{t}}V_{p,M}^{\overline{\pi}^{t},\pi^{\prime}}
6:     sample τ¯tπ¯t\overline{\tau}^{t}\sim\overline{\pi}^{t} and τ¯tπ¯t\underline{\tau}^{t}\sim\underline{\pi}^{t}
7:     invoke comparison oracle on (τ¯t,τ¯t)(\overline{\tau}^{t},\underline{\tau}^{t}) to get yty^{t}, add (τ¯t,τ¯t,yt)(\overline{\tau}^{t},\underline{\tau}^{t},y^{t}) into 𝒟𝚙𝚛𝚎𝚏\mathcal{D}_{\tt pref}
8:     for each π(Πexp(π¯t)Πexp(π¯t))\pi\in(\Pi_{\rm exp}(\overline{\pi}^{t})\bigcup\Pi_{\rm exp}(\underline{\pi}^{t})) do
9:        execute π\pi to collect a trajectory τ\tau, add (π,τ)(\pi,\tau) into 𝒟𝚝𝚛𝚊𝚗𝚜\mathcal{D}_{\tt trans}
10:     update
t+1{(p,M)\displaystyle\mathcal{B}^{t+1}\leftarrow\big{\{}(p,M) 𝒫×:(M,𝒟𝚙𝚛𝚎𝚏)>maxM(M,𝒟𝚙𝚛𝚎𝚏)β\displaystyle\in\mathcal{P}\times\mathcal{M}:~{}\mathcal{L}(M,\mathcal{D}_{\tt pref})>\max_{M^{\prime}\in\mathcal{M}}\mathcal{L}(M^{\prime},\mathcal{D}_{\tt pref})-\beta_{\mathcal{M}}
and (p,𝒟𝚝𝚛𝚊𝚗𝚜)>maxp𝒫(p,𝒟𝚝𝚛𝚊𝚗𝚜)β𝒫}\displaystyle\text{ and }\mathcal{L}(p,\mathcal{D}_{\tt trans})>\max_{p^{\prime}\in\mathcal{P}}\mathcal{L}(p^{\prime},\mathcal{D}_{\tt trans})-\beta_{\mathcal{P}}\big{\}}
11:  output πout=Unif({π¯t}t[T])\pi^{\rm out}=\text{Unif}(\{\overline{\pi}^{t}\}_{t\in[T]})

We provide the pseudocode of learning von Neumann winner via optimistic MLE in Algorithm 5. In each iteration t[T]t\in[T], the algorithm performs the following three key steps:

  • Optimistic planning: Compute the most optimistic von Neumann winner π¯t\overline{\pi}^{t} by picking the most optimistic model-preference candidate (M,p)(M,p) in the current confidence set t\mathcal{B}^{t}. Then compute the most optimistic best-response to π¯t\overline{\pi}^{t}, denoted as π¯t\underline{\pi}^{t}.

  • Data collection: Sample two trajectories τ¯t\overline{\tau}^{t} and τ¯t\underline{\tau}^{t} from π¯t\overline{\pi}^{t} and π¯t\underline{\pi}^{t}. Then input them into the comparison oracle to get feedback yty^{t}, which is added into the preference dataset 𝒟𝚙𝚛𝚎𝚏\mathcal{D}_{\tt pref}. And similar to the standard OMLE, we also execute policies from the exploration policy set that is constructed by using π¯t\overline{\pi}^{t} and π¯t\underline{\pi}^{t}, and add the collected data into the transition dataset 𝒟𝚝𝚛𝚊𝚗𝚜\mathcal{D}_{\tt trans}.

  • Confidence set update: Update the confidence set using the updated log-likelihood, which is the same as Algorithm 3 except that we replace the utility-basede preference therein by general preference

    (M,𝒟𝚙𝚛𝚎𝚏):=(τ,τ,y)𝒟𝚙𝚛𝚎𝚏ln(yM(τ,τ)+(1y)(1M(τ,τ))).\mathcal{L}(M,\mathcal{D}_{\tt pref}):=\sum_{(\tau,\tau^{\prime},y)\in\mathcal{D}_{\tt pref}}\ln\left(yM(\tau,\tau^{\prime})+(1-y)(1-M(\tau,\tau^{\prime}))\right).

F.2 Proof of Theorem 4.3

We first introduce several useful notations. Denote by t\mathcal{B}_{\mathcal{M}}^{t}, 𝒫t\mathcal{B}_{\mathcal{P}}^{t} the preference, transition confidence set in the tt-th iteration, which satisfy t=𝒫t×t\mathcal{B}^{t}=\mathcal{B}_{\mathcal{P}}^{t}\times\mathcal{B}_{\mathcal{M}}^{t}. Denote the groundtruth transition and preference by pp^{\star} and MM^{\star}. To prove Theorem 4.3, it suffices to bound

t(Vp,MVp,Mπ¯t,).\sum_{t}\left(V_{p^{\star},M^{\star}}^{\star}-V_{p^{\star},M^{\star}}^{\overline{\pi}^{t},\dagger}\right).

Similar to the proof of Theorem C.1, we first state several key properties of the MLE confidence set t\mathcal{B}^{t}, which are trivial extensions of the confidence set properties in Liu et al. (2022a).

Lemma F.1 (Liu et al. (2022a)).

Under the same condition as Theorem 4.3, we have that with probability at least 1δ1-\delta: for all t[T]t\in[T]

  • (p,M)t(p^{\star},M^{\star})\in\mathcal{B}^{t},

  • t=1Tmaxp𝒫t(dTV(pπ¯t,pπ¯t)+dTV(pπ¯t,pπ¯t))ξ(d𝒫,T,c2β𝒫,|Πexp|)\sum_{t=1}^{T}\max_{p\in\mathcal{B}_{\mathcal{P}}^{t}}\left(d_{\rm TV}(\mathbb{P}^{\overline{\pi}^{t}}_{p},\mathbb{P}^{\overline{\pi}^{t}}_{p^{\star}})+d_{\rm TV}(\mathbb{P}^{\underline{\pi}^{t}}_{p},\mathbb{P}^{\underline{\pi}^{t}}_{p^{\star}})\right)\leq\xi(d_{\mathcal{P}},T,c_{2}\beta_{\mathcal{P}},|\Pi_{\rm exp}|),

  • maxMti<t|M(τ¯i,τ¯i)M(τ¯i,τ¯i)|2𝒪(β)\max_{M\in\mathcal{B}_{\mathcal{M}}^{t}}\sum_{i<t}|M(\overline{\tau}^{i},\underline{\tau}^{i})-M^{\star}(\overline{\tau}^{i},\underline{\tau}^{i})|^{2}\leq\mathcal{O}(\beta_{\mathcal{M}}).

The first relation states that confidence set t\mathcal{B}^{t} contains the groundtruth transition-preference model with high probability. The second one resembles the second relation in Lemma C.3. The third one states that any preference model MM in confidence set t\mathcal{B}_{\mathcal{M}}^{t} can well predict the preference over the previously collected trajectory pairs.

By using the first relation in Lemma F.1 and the definition of π¯t\overline{\pi}^{t} and π¯t\underline{\pi}^{t}, we have

t(Vp,MVp,Mπ¯t,)\displaystyle\sum_{t}\left(V_{p^{\star},M^{\star}}^{\star}-V_{p^{\star},M^{\star}}^{\overline{\pi}^{t},\dagger}\right)
\displaystyle\leq t(Vp¯t,M¯tπ¯t,Vp¯t,M¯tπ¯t,π¯t)\displaystyle\sum_{t}\left(V_{\overline{p}^{t},\overline{M}^{t}}^{\overline{\pi}^{t},\dagger}-V_{\underline{p}^{t},\underline{M}^{t}}^{\overline{\pi}^{t},\underline{\pi}^{t}}\right)
\displaystyle\leq t(Vp¯t,M¯tπ¯t,π¯tVp¯t,M¯tπ¯t,π¯t)\displaystyle\sum_{t}\left(V_{\overline{p}^{t},\overline{M}^{t}}^{\overline{\pi}^{t},\underline{\pi}^{t}}-V_{\underline{p}^{t},\underline{M}^{t}}^{\overline{\pi}^{t},\underline{\pi}^{t}}\right)
\displaystyle\leq 2t(dTV(p¯tπ¯t,pπ¯t)+dTV(p¯tπ¯t,pπ¯t)+dTV(p¯tπ¯t,pπ¯t)+dTV(p¯tπ¯t,pπ¯t))\displaystyle 2\sum_{t}\left(d_{\rm TV}(\mathbb{P}^{\overline{\pi}^{t}}_{\overline{p}^{t}},\mathbb{P}^{\overline{\pi}^{t}}_{p^{\star}})+d_{\rm TV}(\mathbb{P}^{\underline{\pi}^{t}}_{\overline{p}^{t}},\mathbb{P}^{\underline{\pi}^{t}}_{p^{\star}})+d_{\rm TV}(\mathbb{P}^{\overline{\pi}^{t}}_{\underline{p}^{t}},\mathbb{P}^{\overline{\pi}^{t}}_{p^{\star}})+d_{\rm TV}(\mathbb{P}^{\underline{\pi}^{t}}_{\underline{p}^{t}},\mathbb{P}^{\underline{\pi}^{t}}_{p^{\star}})\right)
+t(Vp,M¯tπ¯t,π¯tVp,M¯tπ¯t,π¯t)\displaystyle+\sum_{t}\left(V_{p^{\star},\overline{M}^{t}}^{\overline{\pi}^{t},\underline{\pi}^{t}}-V_{p^{\star},\underline{M}^{t}}^{\overline{\pi}^{t},\underline{\pi}^{t}}\right)
\displaystyle\leq 2t(dTV(p¯tπ¯t,pπ¯t)+dTV(p¯tπ¯t,pπ¯t)+dTV(p¯tπ¯t,pπ¯t)+dTV(p¯tπ¯t,pπ¯t))\displaystyle 2\sum_{t}\left(d_{\rm TV}(\mathbb{P}^{\overline{\pi}^{t}}_{\overline{p}^{t}},\mathbb{P}^{\overline{\pi}^{t}}_{p^{\star}})+d_{\rm TV}(\mathbb{P}^{\underline{\pi}^{t}}_{\overline{p}^{t}},\mathbb{P}^{\underline{\pi}^{t}}_{p^{\star}})+d_{\rm TV}(\mathbb{P}^{\overline{\pi}^{t}}_{\underline{p}^{t}},\mathbb{P}^{\overline{\pi}^{t}}_{p^{\star}})+d_{\rm TV}(\mathbb{P}^{\underline{\pi}^{t}}_{\underline{p}^{t}},\mathbb{P}^{\underline{\pi}^{t}}_{p^{\star}})\right)
+t(M¯t(τ¯t,τ¯t)M¯t(τ¯t,τ¯t))+𝒪(Tln(1/δ)),\displaystyle+\sum_{t}\left(\overline{M}^{t}(\overline{\tau}^{t},\underline{\tau}^{t})-\underline{M}^{t}(\overline{\tau}^{t},\underline{\tau}^{t})\right)+\mathcal{O}\left(\sqrt{T\ln(1/\delta)}\right),

By the second relation in Lemma F.1 and Definition 3, we have

t(dTV(p¯tπ¯t,pπ¯t)+dTV(p¯tπ¯t,pπ¯t)+dTV(p¯tπ¯t,pπ¯t)+dTV(p¯tπ¯t,pπ¯t))4ξ(d𝒫,T,cβ𝒫,|Πexp|),\displaystyle\sum_{t}\left(d_{\rm TV}(\mathbb{P}^{\overline{\pi}^{t}}_{\overline{p}^{t}},\mathbb{P}^{\overline{\pi}^{t}}_{p^{\star}})+d_{\rm TV}(\mathbb{P}^{\underline{\pi}^{t}}_{\overline{p}^{t}},\mathbb{P}^{\underline{\pi}^{t}}_{p^{\star}})+d_{\rm TV}(\mathbb{P}^{\overline{\pi}^{t}}_{\underline{p}^{t}},\mathbb{P}^{\overline{\pi}^{t}}_{p^{\star}})+d_{\rm TV}(\mathbb{P}^{\underline{\pi}^{t}}_{\underline{p}^{t}},\mathbb{P}^{\underline{\pi}^{t}}_{p^{\star}})\right)\leq 4\xi(d_{\mathcal{P}},T,c\beta_{\mathcal{P}},|\Pi_{\rm exp}|),

where cc is an absolute constant.

Combining the third relation in Lemma F.1 with the regret bound of eluder dimension (e.g., Lemma 2 in Russo and Van Roy (2013)), we have

t(M¯t(τ¯t,τ¯t)M¯t(τ¯t,τ¯t))𝒪(dβT).\displaystyle\sum_{t}\left(\overline{M}^{t}(\overline{\tau}^{t},\underline{\tau}^{t})-\underline{M}^{t}(\overline{\tau}^{t},\underline{\tau}^{t})\right)\leq\mathcal{O}(\sqrt{d_{\mathcal{M}}\beta_{\mathcal{M}}T}).

Putting all pieces together, we have

t(Vp,MVp,Mπ¯t,)4ξ(d𝒫,T,cβ𝒫,|Πexp|)+𝒪(dβT).\sum_{t}\left(V_{p^{\star},M^{\star}}^{\star}-V_{p^{\star},M^{\star}}^{\overline{\pi}^{t},\dagger}\right)\leq 4\xi(d_{\mathcal{P}},T,c\beta_{\mathcal{P}},|\Pi_{\rm exp}|)+\mathcal{O}(\sqrt{d_{\mathcal{M}}\beta_{\mathcal{M}}T}).