This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Tsinghua University, Beijing, 100084, China
11email: lfq20@mails.tsinghua.edu.cn
4Paradigm Inc., Beijing, 100084, China
11email: {huangshiyu,tuweiwei}@4paradigm.com

Diverse Policies Converge in Reward-free Markov Decision Processes

Fanqi Lin 11    Shiyu Huang 22 0000-0003-0500-0141    Wei-Wei Tu 22
Abstract

Reinforcement learning has achieved great success in many decision-making tasks, and traditional reinforcement learning algorithms are mainly designed for obtaining a single optimal solution. However, recent works show the importance of developing diverse policies, which makes it an emerging research topic. Despite the variety of diversity reinforcement learning algorithms that have emerged, none of them theoretically answer the question of how the algorithm converges and how efficient the algorithm is. In this paper, we provide a unified diversity reinforcement learning framework and investigate the convergence of training diverse policies. Under such a framework, we also propose a provably efficient diversity reinforcement learning algorithm. Finally, we verify the effectiveness of our method through numerical experiments111Access the code on GitHub: https://github.com/OpenRL-Lab/DiversePolicies.

Keywords:
Reinforcement learning Diversity Reinforcement Learning Bandit.

1 Introduction

Reinforcement learning (RL) shows huge advantages in various decision-making tasks, such as recommendation systems [20, 23], game AIs [3, 10] and robotic controls [24, 17]. While traditional RL algorithms can achieve superhuman performances on many public benchmarks, the obtained policy often falls into a fixed pattern. For example, previously trained agents may just overfit to a determined environment and could be vulnerable to environmental changes [6]. Finding diverse policies may increase the robustness of the agent [16, 12]. Moreover, a fixed-pattern agent will easily be attacked [21], because the opponent can find its weakness with a series of attempts. If the agent could play the game with different strategies each round, it will be hard for the opponent to identify the upcoming strategy and it will be unable to apply corresponding attacking tactics [13]. Recently, developing RL algorithms for diverse policies has attracted the attention of the RL community for the promising value of its application and also for the challenge of solving a more complex RL problem [7, 11, 4].

Current diversity RL algorithms vary widely due to factors like policy diversity measurement, optimization techniques, training strategies, and application scenarios. This variation makes comparison challenging. While these algorithms often incorporate deep neural networks and empirical tests for comparison, they typically lack in-depth theoretical analysis on training convergence and algorithm complexity, hindering the development of more efficient algorithms.

To address the aforementioned issues, we abstract various diversity RL algorithms, break down the training process, and introduce a unified framework. We offer a convergence analysis for policy population and utilize the contextual bandit formulation to design a more efficient diversity RL algorithm, analyzing its complexity. We conclude with visualizations, experimental evaluations, and an ablation study comparing training efficiencies of different methods. We summarise our contributions as follows: (1) We investigate recent diversity reinforcement learning algorithms and propose a unified framework. (2) We give out the theoretical analysis of the convergence of the proposed framework. (3) We propose a provably efficient diversity reinforcement learning algorithm. (4) We conduct numerical experiments to verify the effectiveness of our method.

2 Related Work

Diversity Reinforcement Learning

Recently, many researchers are committed to the design of diversity reinforcement learning algorithms [7, 19, 11, 4]. DIYAN [7] is a classical diversity RL algorithm, which learns maximum entropy policies via maximizing the mutual information between states and skills. Besides, [19] trains agents with latent conditioned policies which make use of continuous low-dimensional latent variables, thus it can obtain infinite qualified solutions. More recently, RSPO [26] obtains diverse behaviors via iteratively optimizing each policy. DGPO [4] then proposes a more efficient diversity RL algorithm with a novel diversity reward via sharing parameters between policies.

Bandit Algorithms

The challenge in multi-armed bandit algorithm design is balancing exploration and exploitation. Building on ϵ\epsilon-greedy[22], UCB algorithms[1] introduce guided exploration. Contextual bandit algorithms, like [18, 14], improve modeling for recommendation and reinforcement learning. They demonstrate better convergence properties with contextual information[5, 14]. Extensive research[2] provides regret bounds for these algorithms.

3 Preliminaries

Markov Decision Process

We consider environments that can be represented as a Markov decision process (MDP). An MDP can be represented as a tuple (𝒮,𝒜,PT,r,γ)(\mathcal{S},\mathcal{A},P_{T},r,\gamma), where 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the action space and γ[0,1)\gamma\in[0,1) is the reward discount factor. The state-transition function PT(s,a,s):𝒮×𝒜×S[0,1]P_{T}(s,a,s^{\prime}):\mathcal{S}\times\mathcal{A}\times{S}\mapsto[0,1] defines the transition probability over the next state ss^{\prime} after taking action aa at state ss. r(s,a):𝒮×𝒜r(s,a):\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R} is the reward function denoting the immediate reward received by the agent when taking action aa in state ss. The discounted state occupancy measure of policy π\pi is denoted as ρπ(s)=(1γ)t=0γtPtπ(s)\rho^{\pi}(s)=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}P_{t}^{\pi}(s), where Ptπ(s)P_{t}^{\pi}(s) is the probability that policy π\pi visit state ss at time tt. The agent’ objective is to learn a policy π\pi to maximize the expected accumulated reward J(θ)=𝔼zp(z),sρπ(s),aπ(|s,z)[tγtr(st,at)]J(\theta)=\mathbb{E}_{z\sim p(z),s\sim\rho^{\pi}(s),a\sim\pi(\cdot|s,z)}[\sum_{t}\gamma^{t}r(s_{t},a_{t})]. In diversity reinforcement learning, the latent conditioned policy is widely used. The latent conditioned policy is denoted as π(a|s,z)\pi(a|s,z), and the latent conditioned critic network is denoted as Vπ(s,z)V^{\pi}(s,z). During execution, the latent variable zp(z)z\sim p(z) is randomly sampled at the beginning of each episode and keeps fixed for the entire episode. When the latent variable zz is discrete, it can be sampled from a categorical distribution with NzN_{z} categories. When the latent variable zz is continuous, it can be sampled from a Gaussian distribution.

Table 1: Comparison of different diversity algorithms.
Method Citation Policy Selection Reward Calculation
RSPO [26] Iteration Fashion Behavior-driven / Reward-driven exploration
SIPO [9] Iteration Fashion Behavior-driven exploration
DIAYN [7] Uniform Sample I(s;z)I(s;z)
DSP [25] Uniform Sample I(s,a;z)I(s,a;z)
DGPO [4] Uniform Sample minzzDKL(ρπθ(s|z)||ρπθ(s|z))\min_{z^{\prime}\neq z}D_{KL}(\rho^{\pi_{\theta}}(s|z)||\rho^{\pi_{\theta}}(s|z^{\prime}))
Our work Bandit Selection Any form mentioned above

4 Methodology

In this section, we will provide a theoretical analysis of diversity algorithms in detail. Firstly, in section 4.1, we propose a unified framework for diversity algorithms, and point out major differences between diversity algorithms in this unified framework. Then we prove the convergence of diversity algorithms in section 4.2. We further formulate the diversity optimization problem as a contextual bandit problem, and propose bandit selection in section 4.3. Finally, we provide rigorous proof for regretregret bound of bandit selection in section 4.4.

4.1 A Unified Framework for Diversity Algorithms

Although there has been a lot of work on exploring diversity, we find that these algorithms lack a unified framework. So we propose a unified framework for diversity algorithms in Algorithm 1 to pave the way for further research.

We use DivDiv to measure the diversity distance between two policies and we abbreviate policy πθ(|s,zi)\pi_{\theta}(\cdot|s,z_{i}) as πi\pi^{i}. Vector ziz_{i} can be thought of as a skill unique to each policy πi\pi^{i}. Moreover, we define UN×NU\in\mathbb{R}^{N\times N} as diversity matrix where Uij=Div(πi,πj)U_{ij}=Div(\pi^{i},\pi^{j}) and NN denotes the number of policies.
For each episode, we first sample ziz_{i} to decide which policy to update. Then we interact the chosen policy with the environment to get trajectory τ\tau, which is used to calculate intrinsic reward rinr^{in} and update diversity matrix UU. We then store tuple (s,a,s,rin,zis,a,s^{\prime},r^{in},z_{i}) in replay buffer 𝒟\mathcal{D} and update πi\pi^{i} through any reinforcement learning algorithm.
Here we abstract the procedure of selecting ziz_{i} and calculating rinr^{in} as SelectZSelectZ and CalRCalR functions respectively, which are usually the most essential differences between diversity algorithms. We summarize the comparison of some diversity algorithms in Table 1. Now we describe these two functions in more detail.
Policy Selection. Note that we denote by p(z)p(z) the distribution of zz. We can divide means to select ziz_{i} into three categories in general, namely iteration fashion, uniform sample and bandit selection:

(1) Iteration fashion. Diversity algorithms such as RSPO [26] and SIPO [9] obtain diverse policies in an iterative manner. In the kk-th iteration, policy πk\pi^{k} will be chosen to update, and the target of optimization is to make πk\pi^{k} sufficiently different from previously discovered policies π1,,πk1\pi^{1},...,\pi^{k-1}. This method doesn’t ensure optimal performance and is greatly affected by policy initialization.

(2) Uniform sample. Another kind of popular diversity algorithm such as DIAYN [7] and DGPO [4], samples ziz_{i} uniformly to maximize the entropy of p(z)p(z). Due to the method’s disregard for the differences between policies, it often leads to slower convergence.

(3) Bandit selection. We frame obtaining diverse policies as a contextual bandit problem. Sampling ziz_{i} corresponds to minimizing regret in this context. This approach guarantees strong performance and rapid convergence.
Reward Calculation. Diversity algorithms differ in intrinsic reward calculation. Some, like [4, 7, 19], use mutual information theory and a discriminator ϕ\phi to distinguish policies. DIAYN[7] emphasizes deriving skill zz from the state ss, while [19] suggests using state-action pairs. On the other hand, algorithms like [15, 26] aim to make policies’ action or reward distributions distinguishable, known as behavior-driven and reward-driven exploration. DGPO[4] maximizes the minimal diversity distance between policies.

Algorithm 1 A Unified Framework for Diversity Algorithms
Initialize: πθ(|s,z);UN×N(Uij=Div(πi,πj)\pi_{\theta}(\cdot|s,z);U\in\mathbb{R}^{N\times N}(U_{ij}=Div(\pi^{i},\pi^{j}))
for each episode do
     Sample ziSelectZ(U)z_{i}\sim SelectZ(U);
     Get trajectory τ\tau from πi\pi^{i};
     Get rin=CalR(τ)r^{in}=CalR(\tau) and update UU;
     Store tuple (s,a,s,rin,zis,a,s^{\prime},r^{in},z_{i}) in replay buffer 𝒟\mathcal{D};
     Update πi\pi^{i} with 𝒟\mathcal{D};
end for

4.2 Convergence Analysis

In this section, we will show the convergence of diversity algorithms under a reasonable diversity target. We define 𝒫={π1,π2,,πN}\mathcal{P}=\{\pi^{1},\pi^{2},...,\pi^{N}\} as the set of independent policies, or policy population.

Definition 1. g:{π1,π2,,πN}N×Ng:\{\pi^{1},\pi^{2},...,\pi^{N}\}\to\mathbb{R}^{N\times N} is a function that maps population 𝒫\mathcal{P} to diversity matrix UU which is defined in section 4.1. Given a population 𝒫\mathcal{P}, we can calculate pairwise diversity distance under a certain diversity metric, which indicates that gg is an injective function.

Definition 2. Note that in the iterative process of the diversity algorithm, we update 𝒫\mathcal{P} directly instead of UU. So if we find a valid UU that satisfies the diversity target, then the corresponding population 𝒫\mathcal{P} is exactly our target diverse population. We refer to this process of finding 𝒫\mathcal{P} backward as g1g^{-1}.

Definition 3. f:N×NRf:\mathbb{R}^{N\times N}\to R is a function that maps UU to a real number. While UU measures the pairwise diversity distance between policies, ff measures the diversity of the entire population 𝒫\mathcal{P}. As the diversity of the population increases, the diversity metric calculated by ff will increase as well.

Definition 4. We further define δ\delta-target population set 𝒯δ={g1(U)|f(U)>δ,UN×N}\mathcal{T}_{\delta}=\{g^{-1}(U)|f(U)>\delta,U\in\mathbb{R}^{N\times N}\}. δ\delta is a threshold used to separate target and non-target regions. The meaning of this definition is that, during the training iteration process, when the diversity metric closely related to UU exceeds a certain threshold, or we say f(U)>δf(U)>\delta, the corresponding population 𝒫\mathcal{P} is our target population.
Note two important points: (1) The population meeting the diversity requirement should be a set, not a fixed point. (2) Choose a reasonable threshold δ\delta that ensures both sufficient diversity and ease of obtaining the population.

Theorem 4.1

(fU)ij=fUij=fDiv(πi,πj)>0(\frac{\partial f}{\partial U})_{ij}=\frac{\partial f}{\partial U_{ij}}=\frac{\partial f}{\partial Div(\pi^{i},\pi^{j})}>0, where i, j {1,2,3,,N}.\in\{1,2,3,...,N\}.
Proof. ff measures the diversity of the entire population 𝒫\mathcal{P}. When the diversity distance between two policies in a population πi\pi^{i} and πj\pi^{j} increases, the overall diversity metric f(U)f(U) will obviously increase.

Theorem 4.2

We can find some special continuous differentiable ff that, ε>0\exists\varepsilon>0, s.t. (fU)ij>ε(\frac{\partial f}{\partial U})_{ij}>\varepsilon, where i, j {1,2,3,,N}.\in\{1,2,3,...,N\}.
Proof. For example, we can simply define f(U)=ijUijf(U)=\sum_{i\neq j}U_{ij}, where (fU)ij=1(\frac{\partial f}{\partial U})_{ij}=1. So we can choose threshold 0<ε<10<\varepsilon<1, then we can find (fU)ij>ε(\frac{\partial f}{\partial U})_{ij}>\varepsilon obviously. Of course, we can also choose other relatively complex ff as the diversity metric.

Theorem 4.3

There’s a diversity algorithm and a threshold ν>0\nu>0. Each time the population 𝒫\mathcal{P} is updated, several elements in U will increase by at least ν\nu in terms of mathematical expectation.
Proof. In fact, many existing diversity algorithms already have this property. Suppose we currently choose πi\pi^{i} to update. For DIAYN [7], Div(πi,πj)Div(\pi^{i},\pi^{j}) and Div(πj,πi)(ji)Div(\pi^{j},\pi^{i})(\forall j\neq i) are increased in the optimization process. And for DGPO [4], suppose policy πj\pi^{j} is the closest to policy πi\pi^{i} in the policy space, then Div(πi,πj)Div(\pi^{i},\pi^{j}) and Div(πj,πi)Div(\pi^{j},\pi^{i}) are increased as well in the optimization process. Apart from these two, there are many other existing diversity algorithms such as  [19, 26, 15] that share the same property. Note that we propose Theorem 4.3 from the perspective of mathematical expectation, so we can infer that, ν>0,ji\exists\nu>0,j\neq i, s.t. Div(πi,πj)Div(πi,πj)>νDiv(\pi^{\prime i},\pi^{j})-Div(\pi^{i},\pi^{j})>\nu, where policy πi\pi^{\prime i} denotes the updated policy πi\pi^{i}. And for k{i,j}k\notin\{i,j\}, we can assume UikU_{ik} and UkiU_{ki} are unchanged for simplicity.

Theorem 4.4

With an effective diversity algorithm and a reasonable diversity δ\delta-target, we can obtain a diverse population 𝒫Tδ\mathcal{P}\in T_{\delta}.
Proof. We denote by 𝒫0\mathcal{P}_{0} the initialized policy population, and we define f0=f(g(𝒫0))f_{0}=f(g(\mathcal{P}_{0})). Then M𝒩\exists M\in\mathcal{N}, s.t. f0+Mνε>δf_{0}+M\cdot\nu\varepsilon>\delta. Given Theorem 4.2 and Theorem 4.3, we define 𝒫M\mathcal{P}_{M} as the policy population after M iterations, then we have f(g(𝒫M))>f0+Mνεf(g(\mathcal{P}_{M}))>f_{0}+M\cdot\nu\varepsilon, which means we can obtain the δ\delta-target policy population in up to MM iterations. Or we can say that the diversity algorithm will converge after at most MM iterations.

Remark. Careful selection of threshold δ\delta is crucial for diversity algorithms. Reasonable diversity goals should be set to avoid difficulty or getting stuck in the training process. This hyperparameter can be obtained through empirical experiments or methods like hyperparameter search. In certain diversity algorithms, both δ\delta and 𝒫\mathcal{P} may change during training. For instance, in iteration fashion algorithms (Section 4.1), during the kk-th iteration, 𝒫={π1,π2,,πk}\mathcal{P}=\{\pi^{1},\pi^{2},...,\pi^{k}\} with a target threshold of δk\delta_{k}. If policy πk\pi^{k} becomes distinct from π1,,πk1\pi^{1},...,\pi^{k-1}, meeting the diversity target, policy πk+1\pi_{k+1} is added to 𝒫\mathcal{P} and the threshold changes to δk+1\delta_{k+1}.

4.3 A Contextual Bandit Formulation

As mentioned in Section 4.1, we can sample ziz_{i} via bandit selection. In this section, we formally define KK-armed contextual bandit problem [14], and show how it models diversity optimization procedure.

Algorithm 2 A Contextual Bandit Formulation
Initialize: Arm Set 𝒜\mathcal{A}; Contextual Bandit Algorithm AlgoAlgo
for t=1,2,3,t=1,2,3,... do
     Observe feature vectors xt,ax_{t,a} for each a𝒜a\in\mathcal{A};
     Based on {xt,a}a𝒜\left\{x_{t,a}\right\}_{a\in\mathcal{A}} and reward in previous iterations, AlgoAlgo chooses an arm at𝒜a_{t}\in\mathcal{A} and receives reward rt,atr_{t,a_{t}};
     Update AlgoAlgo with (xt,at,at,rt,atx_{t,a_{t}},a_{t},r_{t,a_{t}});
end for

We show the procedure of the contextual bandit problem in Algorithm 2. In each iteration, we can observe feature vectors xt,ax_{t,a} for each a𝒜a\in\mathcal{A}, which are also denoted as context. Note that context may change during training. Then, AlgoAlgo will choose an arm at𝒜a_{t}\in\mathcal{A} based on contextual information and will receive reward rt,atr_{t,a_{t}}. Finally, tuple (xt,at,at,rt,atx_{t,a_{t}},a_{t},r_{t,a_{t}}) will be used to update AlgoAlgo.
We further define T-Reward [14] of AlgoAlgo as t=1Trt\sum_{t=1}^{T}r_{t}. Similarly, we define the optimal expected T-Reward as 𝐄[t=1Trt,at]{\bf E}[\sum_{t=1}^{T}r_{t,a_{t}^{*}}], where ata_{t}^{*} denotes the arm with maximum expected reward in iteration tt. To measure AlgoAlgo’s performance, we define T-regret RTR_{T} of AlgoAlgo by

RT=𝐄[t=1Trt,at]𝐄[t=1Trt,at].R_{T}={\bf E}[\sum_{t=1}^{T}r_{t,a_{t}^{*}}]-{\bf E}[\sum_{t=1}^{T}r_{t,a_{t}}]. (1)

Our goal is to minimize RTR_{T}.
In the diversity optimization problem, policies are akin to arms, and context is represented by visited states or ρπ(s)\rho^{\pi}(s). Note that context may change as policies evolve. When updating a policy, the reward is the difference in diversity metric before and after the update, linked to the diversity matrix UU (Section 4.1). Our objective is to maximize policy diversity, equivalent to maximizing expected reward or minimizing RTR_{T} in contextual bandit formulation.
Here’s an example to demonstrate the effectiveness of bandit selection. In some cases, a policy πi\pi^{i} may already be distinct enough from others, meaning that selecting πi\pi^{i} for an update wouldn’t significantly affect policy diversity. To address this, we should decrease the probability of sampling πi\pi^{i}. Fixed uniform sampling fails to address this issue, but bandit algorithms like UCB[2] or LinUCB[14] consider both historical rewards and the number of times policies have been chosen. This caters to our needs in such cases.

4.4 Regret Bound

In this section, we provide the regret bound for bandit selection in the diversity algorithms.

Problem Setting. We define TT as the number of iterations. In each iteration tt, we can observe NN feature vectors xt,adx_{t,a}\in\mathbb{R}^{d} and receive reward rt,atr_{t,a_{t}} with xt,a1\|x_{t,a}\|\leq 1 for a𝒜a\in\mathcal{A} and rt,at[0,1]r_{t,a_{t}}\in[0,1], where \|\cdot\| means l2l_{2}-norm, dd denotes the dimension of feature vector and ata_{t} is the chosen action in iteration tt.

Linear Realizability Assumption. Similar to lots of theoretical analyses of contextual bandit problems [1, 5], we propose linear realizability assumption to simplify the problem. We assume that there exists an unknown weight vector θd\theta^{*}\in\mathbb{R}^{d} with θ1\|\theta^{*}\|\leq 1 s.t.

𝐄[rt,a|xt,a]=xt,aTθ.{\bf E}[r_{t,a}|x_{t,a}]=x_{t,a}^{T}\theta^{*}. (2)

for all t and a.
We now analyze the rationality of this assumption in practical diversity algorithms. Reward rt,ar_{t,a} measures the changed value of overall diversity metric f(U)\bigtriangleup f(U) of policy population 𝒫\mathcal{P} after an update. Suppose πti\pi^{i}_{t} is the policy corresponding to the feature vector xt,ax_{t,a} in the iteration tt. While xt,ax_{t,a} encodes state features of πti\pi^{i}_{t}, it can encode the diversity information of πti\pi^{i}_{t} as well. Therefore, we can conclude that rt,ar_{t,a} is closely related to xt,ax_{t,a}. So given that xt,ax_{t,a} contains enough diversity information, we can assume that the hypothesis holds.

Theorem 4.5

(Diversity Reinforcement Learning Oracle 𝒟𝒪\mathcal{DRLO}). Given a reasonable δ\delta-target and an effective diversity algorithm, let the probability that the policy population 𝒫\mathcal{P} reaches δ\delta-target in TT iterations be 1϶δ,T1-\backepsilon_{\delta,T}. Then we have limT϶δ,T=0\lim_{T\to\infty}\backepsilon_{\delta,T}=0.
Proof. This is actually another formal description of the convergence of diversity algorithms which has been proved in Section 4.2. Experimental results [19, 4] have shown that ϶δ,T\backepsilon_{\delta,T} will decrease significantly when TT reaches a certain value.

Theorem 4.6

(Contextual Bandit Algorithm Oracle 𝒞𝒜𝒪\mathcal{CBAO}). There exists a contextual bandit algorithm that makes regret bounded by O(Tdln3(NTln(T)/η))O\left(\sqrt{Td{\rm ln}^{3}(NT{\rm ln}(T)/\eta)}\right) for TT iterations with probability 1η1-\eta.
Proof. Different contextual bandit algorithm corresponds to different regret bound. In fact, we can use the regret bound of any contextual bandit algorithm here. The regret bound mentioned here is the regret bound of SupLinUCB algorithm [5]. For concrete proof of this regret bound, we refer the reader to  [5].

Theorem 4.7

For T iterations, the regret for bandit selection in diversity algorithms is bounded by O(Tdln3(NTln(T)(1϶δ,T)η϶δ,T))O\left(\sqrt{Td{\rm ln}^{3}(\frac{NT{\rm ln}(T)(1-\backepsilon_{\delta,T})}{\eta-\backepsilon_{\delta,T}})}\right) with probability 1η1-\eta. Note that limT(η϶δ,T)=η>0\lim_{T\to\infty}(\eta-\backepsilon_{\delta,T})=\eta>0.
Proof. In diversity algorithms, the calculation of the regret bound is based on the premise that a certain δ\delta-target has been achieved. Note that 𝒟𝒪\mathcal{DRLO} and 𝒞𝒜𝒪\mathcal{CBAO} are independent variables in this problem setting. Given 0<η<10<\eta<1, we define

η1=η϶δ,T1϶δ,T.\eta_{1}=\frac{\eta-\backepsilon_{\delta,T}}{1-\backepsilon_{\delta,T}}. (3)

Then we have

1η=(1϶δ,T)(1η1).1-\eta=(1-\backepsilon_{\delta,T})(1-\eta_{1}). (4)

The implication of Equation 4 is that, for TT iterations, with probability 1η1-\eta, the regret for bandit selection in diversity algorithms is bounded by

O(Tdln3(NTln(T)/η1))=O(Tdln3(NTln(T)(1϶δ,T)η϶δ,T)).\begin{split}O\left(\sqrt{Td{\rm ln}^{3}(NT{\rm ln}(T)/\eta_{1})}\right)=O\left(\sqrt{Td{\rm ln}^{3}(\frac{NT{\rm ln}(T)(1-\backepsilon_{\delta,T})}{\eta-\backepsilon_{\delta,T}})}\right).\end{split} (5)

The right-hand side of Equation 5 is exactly the regret bound we propose in Theorem 4.7.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: (a) Policy evolution trajectory. We initialize three policies here, denoted by red, yellow, and green circles on the simplex. The darker the color of the policy, the more iterations it has gone through, and the greater the diversity distance between this policy and other policies is. Moreover, the blue circles on the simplex denote the average state marginal distribution of policies ρ(s)\rho(s). (b) Policy evolution process. We initialize three policies here as well, denoted by red, green, and blue dots on the simplex. The black dot denotes the average state marginal distribution of policies ρ(s)\rho(s). Moreover, the contour lines in the figure correspond to the diversity metric I(s;z)I(s;z).

5 Experiments

This section presents some experimental results about diversity algorithms. Firstly, from an intuitive geometric perspective, we demonstrate the process of policy evolution in the diversity algorithm. Then we compare the three policy selection methods mentioned in Section 4.1 by experiments, which illustrates the high efficiency of bandit selection.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: Comparison of different policy selection methods. (a) Training curves for different numbers of policies with a fixed δ\delta-target where δ=0.8\delta=0.8. (b) Training curves for different δ\delta-target with a fixed number of policies where N=8N=8.

5.1 A Geometric Perspective on Policy Evolution

To visualize the policy evolution process, we use DIAYN [7] as our diversity algorithm and construct a simple 3-state MDP [8] to conduct the experiment. The set of feasible state marginal distributions is described by a triangle [(1,0,0),(0,1,0),(0,0,1)][(1,0,0),(0,1,0),(0,0,1)] in 3\mathbb{R}^{3}. And we use state occupancy measure ρπi(s)\rho^{\pi_{i}}(s) to represent policy πi\pi^{i}. Moreover, we project the state occupancy measure onto a two-dimensional simplex for visualization.

Let ρ(s)\rho(s) be the average state marginal distribution of all policies. Figure 1(a) shows policy evolution during training. Initially, the state occupancy measures of different policies are similar. However, as training progresses, the policies spread out, indicating increased diversity. Figure 1(a) highlights that diversity [8] ensures distinct state occupancy measures among policies.

We use I(;)I(\cdot;\cdot) to denote mutual information. The diversity metric in unsupervised skill discovery algorithms is based on the mutual information of states and latent variable zz. Furthermore, the mutual information can be viewed as the average divergence between each policy’s state distribution ρ(s|z)\rho(s|z) and the average state distribution ρ(s)\rho(s) [8]:

I(s;z)=𝐄p(z)[DKL(ρ(s|z)ρ(s))].I(s;z)={\bf E}_{p(z)}[D_{KL}(\rho(s|z)\parallel\rho(s))]. (6)

Figure 1(b) shows the policy evolution process and the diversity metric I(s;z)I(s;z). We find that the diversity metric increased gradually during the training process, which is in line with our expectation.

5.2 Policy Selection Ablation

We continue to use 3-state MDP [8] as the experimental environment. Whereas, in order to get closer to the complicated practical environment, we set specific δ\delta-target and increased the number of policies. Moreover, when a policy that hasn’t met the diversity requirement is chosen to update, we will receive a reward r=1r=1, otherwise, we will receive a reward r=0r=0. We use I(s;z)I(s;z) as the diversity metric and use LinUCB[14] as our contextual bandit algorithm.

Figure 2 shows the training curves under different numbers of policies and different δ\delta-target over six random seeds. The results show that bandit selection not only always reaches the convergence fastest, but also achieves the highest overall diversity metric of the population when it converges. We now empirically analyze the reasons for this result:
Drawbacks of uniform sample. In many experiments, we observe that uniform sample has similar final performance to bandit selection, but significantly slower convergence. This is because after several iterations, some policies become distinct enough to prioritize updating other policies. However, uniform sample treats all policies equally, resulting in slow convergence.
Drawbacks of iteration fashion. In experiments, the iteration fashion converges quickly but has lower final performance than the other two methods. It’s greatly affected by initialization. Each policy update depends on the previous one, so poor initialization can severely impact subsequent updates, damaging the overall training process.
Advantages of bandit selection. Considering historical rewards and balancing exploitation and exploration, bandit selection quickly determines if a policy is different enough to adjust the sample’s probability distribution. Unlike iteration fashion, all policies can be selected for an update in a single iteration, making bandit selection not limited by policy initialization.

6 Conclusion

In this paper, we compare existing diversity algorithms, provide a unified diversity reinforcement learning framework, and investigate the convergence of training diverse policies. Moreover, we propose bandit selection under our proposed framework, and present the regret bound for it. Empirical results indicate that bandit selection achieves the highest diversity score with the fastest convergence speed compared to baseline methods. We also provide a geometric perspective on policy evolution through experiments. In the future, we will focus on the comparison and theoretical analysis of different reward calculation methods. And we will continually explore the application of diversity RL algorithms in more real-world decision-making tasks.

References

  • [1] Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3(Nov), 397–422 (2002)
  • [2] Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine learning 47(2), 235–256 (2002)
  • [3] Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., et al.: Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680 (2019)
  • [4] Chen, W., Huang, S., Chiang, Y., Chen, T., Zhu, J.: Dgpo: Discovering multiple strategies with diversity-guided policy optimization. In: Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems. pp. 2634–2636 (2023)
  • [5] Chu, W., Li, L., Reyzin, L., Schapire, R.: Contextual bandits with linear payoff functions. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. pp. 208–214. JMLR Workshop and Conference Proceedings (2011)
  • [6] Ellis, B., Moalla, S., Samvelyan, M., Sun, M., Mahajan, A., Foerster, J.N., Whiteson, S.: Smacv2: An improved benchmark for cooperative multi-agent reinforcement learning. arXiv preprint arXiv:2212.07489 (2022)
  • [7] Eysenbach, B., Gupta, A., Ibarz, J., Levine, S.: Diversity is all you need: Learning skills without a reward function. In: International Conference on Learning Representations (2018)
  • [8] Eysenbach, B., Salakhutdinov, R., Levine, S.: The information geometry of unsupervised reinforcement learning. In: International Conference on Learning Representations (2021)
  • [9] Fu, W., Du, W., Li, J., Chen, S., Zhang, J., Wu, Y.: Iteratively learning novel strategies with diversity measured in state distances. Submitted to ICLR 2023 (2022)
  • [10] Huang, S., Chen, W., Zhang, L., Li, Z., Zhu, F., Ye, D., Chen, T., Zhu, J.: Tikick: Towards playing multi-agent football full games from single-agent demonstrations. arXiv preprint arXiv:2110.04507 (2021)
  • [11] Huang, S., Yu, C., Wang, B., Li, D., Wang, Y., Chen, T., Zhu, J.: Vmapd: Generate diverse solutions for multi-agent games with recurrent trajectory discriminators. In: 2022 IEEE Conference on Games (CoG). pp. 9–16. IEEE (2022)
  • [12] Kumar, S., Kumar, A., Levine, S., Finn, C.: One solution is not all you need: Few-shot extrapolation via structured maxent rl. Advances in Neural Information Processing Systems 33, 8198–8210 (2020)
  • [13] Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., Pérolat, J., Silver, D., Graepel, T.: A unified game-theoretic approach to multiagent reinforcement learning. Advances in neural information processing systems 30 (2017)
  • [14] Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th international conference on World wide web. pp. 661–670 (2010)
  • [15] Liu, X., Jia, H., Wen, Y., Yang, Y., Hu, Y., Chen, Y., Fan, C., Hu, Z.: Unifying behavioral and response diversity for open-ended learning in zero-sum games. arXiv preprint arXiv:2106.04958 (2021)
  • [16] Mahajan, A., Rashid, T., Samvelyan, M., Whiteson, S.: Maven: Multi-agent variational exploration. arXiv preprint arXiv:1910.07483 (2019)
  • [17] Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., Storey, K., Macklin, M., Hoeller, D., Rudin, N., Allshire, A., Handa, A., et al.: Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470 (2021)
  • [18] May, B.C., Korda, N., Lee, A., Leslie, D.S.: Optimistic bayesian sampling in contextual-bandit problems. Journal of Machine Learning Research 13, 2069–2106 (2012)
  • [19] Osa, T., Tangkaratt, V., Sugiyama, M.: Discovering diverse solutions in deep reinforcement learning by maximizing state–action-based mutual information. Neural Networks 152, 90–104 (2022)
  • [20] Shi, J.C., Yu, Y., Da, Q., Chen, S.Y., Zeng, A.X.: Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 4902–4909 (2019)
  • [21] Wang, T.T., Gleave, A., Belrose, N., Tseng, T., Miller, J., Dennis, M.D., Duan, Y., Pogrebniak, V., Levine, S., Russell, S.: Adversarial policies beat professional-level go ais. arXiv preprint arXiv:2211.00241 (2022)
  • [22] Watkins, C.J.C.H.: Learning from delayed rewards. Robotics & Autonomous Systems (1989)
  • [23] Xue, W., Cai, Q., Zhan, R., Zheng, D., Jiang, P., An, B.: Resact: Reinforcing long-term engagement in sequential recommendation with residual actor. arXiv preprint arXiv:2206.02620 (2022)
  • [24] Yu, C., Yang, X., Gao, J., Yang, H., Wang, Y., Wu, Y.: Learning efficient multi-agent cooperative visual exploration. arXiv preprint arXiv:2110.05734 (2021)
  • [25] Zahavy, T., O’Donoghue, B., Barreto, A., Flennerhag, S., Mnih, V., Singh, S.: Discovering diverse nearly optimal policies with successor features. In: ICML 2021 Workshop on Unsupervised Reinforcement Learning (2021)
  • [26] Zhou, Z., Fu, W., Zhang, B., Wu, Y.: Continuously discovering novel strategies via reward-switching policy optimization. In: International Conference on Learning Representations (2021)