This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\DeclareMathOperator

*\minimiseminimise \DeclareMathOperator*\argmaxargmax \DeclareMathOperator*\argminargmin

Bounded Robustness in Reinforcement Learning
via Lexicographic Objectives

\NameDaniel Jarne Ornia \Emaild.jarneornia@tudelft.nl
\addrDelft University of Technology
   \NameLicio Romao \Emaillicio.romao@cs.ox.ac.uk
\addrUniversity of Oxford
   \NameLewis Hammond \Emaillewis.hammond@cs.ox.ac.uk
\addrUniversity of Oxford
   \NameManuel Mazo Jr. \Emailm.mazo@tudelft.nl
\addrDelft University of Technology
   \NameAlessandro Abate \Emailalessandro.abate@cs.ox.ac.uk
\addrUniversity of Oxford
Abstract

Policy robustness in Reinforcement Learning may not be desirable at any cost: the alterations caused by robustness requirements from otherwise optimal policies should be explainable, quantifiable and formally verifiable. In this work we study how policies can be maximally robust to arbitrary observational noise by analysing how they are altered by this noise through a stochastic linear operator interpretation of the disturbances, and establish connections between robustness and properties of the noise kernel and of the underlying MDPs. Then, we construct sufficient conditions for policy robustness, and propose a robustness-inducing scheme, applicable to any policy gradient algorithm, that formally trades off expected policy utility for robustness through lexicographic optimisation, while preserving convergence and sub-optimality in the policy synthesis.

1 Introduction

Consider a dynamical system where we need to synthesise a controller (policy) through a model-free Reinfrocement Learning (Sutton and Barto, 2018) approach. When using a simulator for training we expect the deployment of the controller in the real system to be affected by different sources of noise, possibly not predictable or modelled (e.g. for networked components we may have sensor faults, communication delays, etc). In safety-critical systems, robustness (in terms of successfully controlling the system under disturbances) should preserve formal guarantees, and plenty of effort has been put on developing formal convergence guarantees on policy gradient algorithms (Agarwal et al., 2021; Bhandari and Russo, 2019). All these guarantees vanish under regularization and adversarial approaches, which are aimed to produce more robust policies. Therefore, for such applications one needs a scheme to regulate the robustness-utility trade-off in RL policies, that on the one hand preserves the formal guarantees of the original algorithms, and on the other attains sub-optimality conditions from the original problem. Additionally, if we do not know the structure of the disturbance (which holds in most applications), learning directly a policy for an arbitrarily disturbed environment will yield unexpected behaviours when deployed in the true system.

Lexicographic Reinforcement Learning (LRL)

Recently, lexicographic optimisation (Isermann, 1982; Rentmeesters et al., 1996) has been applied to the multi-objective RL setting (Skalse et al., 2022b). In an LRL setting some objectives may be more important than others, and so we may want to obtain policies that solve the multi-objective problem in a lexicographically prioritised way, i.e., “find the policies that optimise objective ii (reasonably well), and from those the ones that optimise objective i+1i+1 (reasonably well), and so on”.

Previous Work

In robustness against model uncertainty, the MDP may have noisy or uncertain reward signals or transition probabilities, as well as possible resulting distributional shifts in the training data (Heger, 1994; Xu and Mannor, 2006; Fu et al., 2018; Pattanaik et al., 2018; Pirotta et al., 2013; Abdullah et al., 2019), connecting to ideas on distributionally robust optimisation (Wiesemann et al., 2014; Van Parys et al., 2015). For adversarial attacks or disturbances on policies or action selection in RL agents (Gleave et al., 2020; Lin et al., 2017; Tessler et al., 2019; Pan et al., 2019; Tan et al., 2020; Klima et al., 2019; Liang et al., 2022), recently Gleave et al. (2020) propose to attack RL agents by swapping the policy for an adversarial one at given times. For a detailed review on Robust RL see Moos et al. (2022). Our work focuses in robustness versus observational disturbances, where agents observe a disturbed state measurement and use it as input for the policy (Kos and Song, 2017; Huang et al., 2017; Behzadan and Munir, 2017; Mandlekar et al., 2017; Zhang et al., 2020, 2021). Zhang et al. (2020) propose a state-adversarial MDP framework, and utilise adversarial regularising terms that can be added to different deep RL algorithms to make the resulting policies more robust to observational disturbances, and Zhang et al. (2021) study how LSTM increases robustness with optimal state-perturbing adversaries.

Contributions

Most existing work on RL with observational disturbances proposes modifying RL algorithms at the cost of explainability (in terms of sub-optimality bounds) and verifiability, since the induced changes in the new policies result in a loss of convergence guarantees. Our main contributions are summarised in the following points.

  • We consider general unknown stochastic disturbances and formulate a quantitative definition of observational robustness that allows us to characterise the sets of robust policies for any MDP in the form of operator-invariant sets. We analyse how the structure of these sets depends on the MDP and noise kernel, and obtain an inclusion relation providing intuition into how we can search for robust policies more effectively.111There are strong connections between Sections 2-3 in this paper and the literature on planning for POMDPs (Spaan and Vlassis, 2004; Spaan, 2012) and MDP invariances (Ng et al., 1999; van der Pol et al., 2020; Skalse et al., 2022a), as well as recent work concerning robustness misspecification (Korkmaz, 2023).

  • We propose a meta-algorithm that can be applied to any existing policy gradient algorithm, Lexicographically Robust Policy Gradient (LRPG) that (1) Retains policy sub-optimality up to a specified tolerance while maximising robustness, (2) Formally controls the utility-robustness trade-off through this design tolerance, (3) Preserves formal guarantees.

Figure 1 represents a qualitative interpretation of the results in this work.

Refer to caption
Refer to caption
Figure 1: Qualitative representation LRPG (right), compared to usual robustness-inducing algorithms. The sets in blue are the robust policies to be defined in the coming sections. LRPG induces robustness while guaranteeing that the policies will deviate a bounded distance from the optimal.

1.1 Preliminaries

Notation

We use calligraphic letters 𝒜\mathcal{A} for collections of sets and Δ(𝒜)\Delta(\mathcal{A}) as the space of probability measures over 𝒜\mathcal{A}. For two probability distributions P,PP,P^{\prime} defined on the same σ\sigma-algebra \mathcal{F}, DTV(PP)=\operatornamesupA|P(A)P(A)|D_{TV}(P\|P^{\prime})=\operatorname{sup}_{A\in\mathcal{F}}|P(A)-P^{\prime}(A)| is the total variation distance. For two elements of a vector space we use ,\langle\cdot,\cdot\rangle as the inner product. We use 𝟏n\mathbf{1}_{n} as a column-vector of size nn that has all entries equal to 1. We say that an MDP is ergodic if for any policy the resulting Markov Chain (MC) is ergodic. We say that SS is a n×nn\times n row-stochastic matrix if Sij0S_{ij}\geq 0 and each row of SS sums 1. We assume all learning rates in this work αt(x,u)[0,1]\alpha_{t}(x,u)\in[0,1] (βt,ηt\beta_{t},\eta_{t}…) satisfy the conditions t=1αt(x,u)=\sum_{t=1}^{\infty}\alpha_{t}(x,u)=\infty and t=1αt(x,u)2<\sum_{t=1}^{\infty}\alpha_{t}(x,u)^{2}<\infty.

Lexicographic Reinforcement Learning

Consider a parameterised policy πθ\pi_{\theta} with θΘ\theta\in\Theta, and two objective functions K1K_{1} and K2K_{2}. PB-LRL uses a multi-timescale optimisation scheme to optimise θ\theta faster for higher-priority objectives, iteratively updating the constraints induced by these priorities and encoding them via Lagrangian relaxation techniques (Bertsekas, 1997). Let θ\argmaxθK1(θ)\theta^{\prime}\in\argmax_{\theta}K_{1}(\theta). Then, PB-LRL can be used to find parameters θ′′{\argmaxθK2(θ),\texts.t.K1(θ)K1(θ)ϵ}.\theta^{\prime\prime}\in\{\argmax_{\theta}K_{2}(\theta),\,\text{s.t.}\,\,K_{1}(\theta)\geq K_{1}(\theta^{\prime})-\epsilon\}. This is done through the update:

{aligned}θ\operatornameprojΘ[θ+θK^(θ)],λ\operatornameproj0[λ+ηt(k^1ϵtK1(θ))],\aligned\theta\leftarrow&\operatorname{proj}_{\Theta}\big{[}\theta+\nabla_{\theta}\hat{K}(\theta)\big{]},\quad\lambda\leftarrow&\operatorname{proj}_{\mathbb{R}_{\geq 0}}\big{[}\lambda+\eta_{t}(\hat{k}_{1}-\epsilon_{t}-K_{1}(\theta))\big{]}, (1)

where K^(θ):=(βt1+λβt2)K1(θ)+βt2K2(θ)\hat{K}(\theta):=(\beta_{t}^{1}+\lambda\beta_{t}^{2})\cdot K_{1}(\theta)+\beta_{t}^{2}\cdot K_{2}(\theta), λ\lambda is a Langrange multiplier, βt1,βt2,ηt\beta_{t}^{1},\beta_{t}^{2},\eta_{t} are learning rates, and k^1\hat{k}_{1} is an estimate of K1(θ)K_{1}(\theta^{\prime}). Typically, we set ϵt0\epsilon_{t}\to 0, though we can use other tolerances too, e.g., ϵt=0.9k^1\epsilon_{t}=0.9\cdot\hat{k}_{1}. For more details see Skalse et al. (2022b).

2 Observationally Robust Reinforcement Learning

Robustness-inducing methods in model-free RL must address the following dilemma: How do we deal with uncertainty without an explicit mechanism to estimate such uncertainty during policy execution? Consider an example of an MDP where, at policy roll-out phase, there is a non-zero probability of measuring a “wrong” state. In such a scenario, measuring the wrong state can lead to executing unboundedly bad actions. This problem is represented by the following version of a noise-induced partially observable Markov Decision Process (Spaan, 2012).

Definition 2.1.

An observationally-disturbed MDP (DOMDP) is (a POMDP) defined by the tuple (X,U,P,R,T,γ)(X,U,P,R,T,\gamma) where XX is a finite set of states, UU is a set of actions, P:U×XΔ(X)P:U\times X\mapsto\Delta(X) is a probability measure of the transitions between states and R:X×U×XR:X\times U\times X\mapsto\mathbb{R} is a reward function. The map T:XΔ(X)T:X\mapsto\Delta(X) is a stochastic kernel induced by some unknown noise signal, such that T(yx)T(y\mid x) is the probability of measuring yy while the true state is xx, and acts only on the state observations. At last γ[0,1]\gamma\in[0,1] is a reward discount.

A (memoryless) policy for the agent is a stochastic kernel π:XΔ(U)\pi:X\mapsto\Delta(U). For simplicity, we overload notation on π\pi, denoting by π(x,u)\pi(x,u) as the probability of taking action uu at state xx. In a DOMDP222Definition 2.1 is a generalised form of the State-Adversarial MDP used by Zhang et al. (2020): the adversarial case is a particular form of DOMDP where TT assigns probability 1 to one adversarial state. agents can measure the full state, but the measurement will be disturbed by some unknown random signal in the policy deployment. The difficulty of acting in such DOMDP is that agents will have to act based on disturbed states x~T(x)\tilde{x}\sim T(\cdot\mid x). We then need to construct policies that will be as robust as possible against such noise without the existance of a model to estimate, filter or reject disturbances. The value function of a policy π\pi (critic), Vπ:XV^{\pi}:X\mapsto\mathbb{R}, is given by Vπ(x0)=𝔼[t=0γtR(xt,π(xt),xt+1)]V^{\pi}(x_{0})=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}R(x_{t},\pi(x_{t}),x_{t+1})]. The action-value function of π\pi (QQ-function) is given by Qπ(x,u)=yXP(x,u,y)(R(x,u,y)+γVπ(y)).Q^{\pi}(x,u)=\sum_{y\in X}P(x,u,y)(R(x,u,y)+\gamma V^{\pi}(y)). We then define the objective function as J(π):=𝔼x0μ0[Vπ(x0)]J(\pi):=\mathbb{E}_{x_{0}\sim\mu_{0}}[V^{\pi}(x_{0})] with μ0\mu_{0} being a distribution of initial states, and we use J:=maxπJ(π)J^{*}:=\max_{\pi}J(\pi) and π\pi^{*} as the optimal policy, and Πϵ:={πΠ:JJ(π)ϵ}\Pi^{*}_{\epsilon}:=\{\pi\in\Pi:J^{*}-J(\pi)\leq\epsilon\} is the set of ϵ\epsilon-optimal policies. If a policy is parameterised by θΘ\theta\in\Theta we write πθ\pi_{\theta} and J(θ)J(\theta).

Assumption 1

For any DOMDP and policy π\pi, the resulting MC is irreducible and aperiodic.

We now formalise a notion of observational robustness. Firstly, due to the presence of the stochastic kernel TT, the policy we are applying is altered as we are applying a collection of actions in a possibly wrong state. Then, π,T(x,u):=yXT(yx)π(y,u),\langle\pi,T\rangle(x,u):=\sum_{y\in X}T(y\mid x)\pi(y,u), where π,T:XΔ(U)\langle\pi,T\rangle:X\mapsto\Delta(U) is the disturbed policy, which averages the current policy given the error induced by the presence of the stochastic kernel. Notice that ,T(x):ΠΔ(U)\langle\cdot,T\rangle(x):\Pi\mapsto\Delta(U) is an averaging operator yielding the alteration of the policy due to noise. We define the robustness regret333The robustness regret satisfies ρ(π,T)0\rho(\pi^{*},T)\geq 0 for all kernels TT, and it allows us to directly compare the robustness regret with the utility regret of the policy.: ρ(π,T):=J(π)J(π,T).\rho(\pi,T):=J(\pi)-J(\langle\pi,T\rangle).

Definition 2.2 (Policy Robustness).

A policy π\pi is κ\kappa-robust against a stochastic kernel TT if ρ(π,T)κ\rho(\pi,T)\leq\kappa. If π\pi is 0-robust it is maximally robust. The sets of κ\kappa-robust policies are Πκ:={πΠ:ρ(π,T)κ}\Pi_{\kappa}:=\{\pi\in\Pi:\rho(\pi,T)\leq\kappa\}, with Π0\Pi_{0} being the maximally robust policies.

One can motivate the characterisation and models above from a control perspective, where policies use as input discretised state measurements with possible sensor measurement errors. Formally ensuring robustness properties when learning RL policies will, in general, force the resulting policies to deviate from optimality in the undisturbed MDP. We propose then the following problem.

Problem 2.3.

Consider a DOMDP model as per Definition 2.1 and let ϵ\epsilon be a non-negative tolerance level. Our goal is to find amongst all ϵ\epsilon-optimal policies those that minimize the robustness level κ\kappa:

\operatornameminimizeκs.t.πΠϵΠκ.\operatorname{minimize}\,\kappa\,\,\,s.t.\,\pi\in\Pi^{\star}_{\epsilon}\cap\Pi_{\kappa}.

Note that this is formulated as general as possible with respect to the robustness of the policies: We would like to find a policy that, trading off ϵ\epsilon in terms of cumulative rewards, observes the same discounted rewards when disturbed by TT.

3 Characterisation of Robust Policies

An important question to be addressed before trying to synthesise robust policies is what these robust policies look like, and how they are related to DOMDP properties. A policy π\pi is said to be constant if π(x)=π(y)\pi(x)=\pi(y) for all x,yXx,y\in X, and the collection of all constant policies is denoted by Π¯\bar{\Pi}. A policy is called a fixed point of ,T\langle\cdot,T\rangle if π(x)=π,T(x)\pi(x)=\langle\pi,T\rangle(x) for all xXx\in X. The collection of all fixed points is ΠT\Pi_{T}. Observe furthermore that ΠT\Pi_{T} only depends on the kernel T{T} and the set444There is a (natural) bijection between the set of constant policies and the space Δ(U)\Delta(U). The set of fixed points of the operator ,T\langle\cdot,T\rangle also has an algebraic characterisation in terms of the null space of the operator Id(),T\mathrm{Id}(\cdot)-\langle\cdot,T\rangle. We are not exploiting the later characterisation in this paper. XX. Let us assume we have a policy iteration algorithm that employs an action-value function QπQ^{\pi} and policy π\pi. The advantage function for π\pi is defined as Aπ(x,u):=Qπ(x,u)Vπ(x)A^{\pi}(x,u):=Q^{\pi}(x,u)-V^{\pi}(x). We can similarly define the noise disadvantage of policy π\pi as:

{aligned}Dπ(x,T):=Vπ(x)𝔼\substackuπ,T(x)[Qπ(x,u)],\aligned D^{\pi}(x,T):=V^{\pi}(x)-\mathbb{E}_{\substack{u\sim\langle\pi,T\rangle(x)}}[Q^{\pi}(x,u)], (2)

which measures the difference of applying at state xx an action according to the policy π\pi with that of playing an action according to π,T\langle\pi,T\rangle and then continuing playing an action according to π\pi. Our intuition says that if it happens to be the case that Dπ(x,T)=0D^{\pi}(x,T)=0 for all states in the DOMDP, then such a policy is maximally robust. And this is indeed the case, as shown in the next proposition.

Proposition 3.1.

Consider a DOMDP as in Definition 2.1 and the robustness notion as in Definition 2.2. If a policy π\pi is such that Dπ(x,T)=0D^{\pi}(x,T)=0\,\, for all xXx\in X, then π\pi is maximally robust, i.e., let ΠD:={πΠ:μπ(x)Dπ(x,T)=0xX},\Pi_{D}:=\{\pi\in\Pi:\mu_{\pi}(x)D^{\pi}(x,T)=0\,\forall\,x\in X\}, then we have that ΠDΠ0\Pi_{D}\subseteq\Pi_{0}.

Proof 3.2.

We want to show that Dπ(x,T)=0\impliesρ(π,T)=0D^{\pi}(x,T)=0\implies\rho(\pi,T)=0. Taking Dπ(x,T)=0D^{\pi}(x,T)=0 one has a policy that produces an disadvantage of zero when noise kernel TT is applied. Then, xX,\forall\,x\in X,

Dπ(x,T)=0\implies𝔼\substackuπ,T(x)[Qπ(x,u)]=Vπ(x).D^{\pi}(x,T)=0\implies\mathbb{E}_{\substack{u\sim\langle\pi,T\rangle(x)}}[Q^{\pi}(x,u)]=V^{\pi}(x). (3)

Now define the value of the disturbed policy as Vπ,T(x)=𝔼\substackuπ,T(x),yP(x,u)[r(x,u,y)+γVπ,T(y)].V^{\langle\pi,T\rangle}(x)=\mathbb{E}_{\substack{u\sim\langle\pi,T\rangle(x),\\ y\sim P(\cdot\mid x,u)}}\left[r(x,u,y)+\gamma V^{\langle\pi,T\rangle}(y)\right]. We will now show that Vπ(x)=Vπ,T(x),V^{\pi}(x)=V^{\langle\pi,T\rangle}(x), for all xXx\in X. Observe, from \eqrefeq:qtilde using Vπ(x)=𝔼\substackuπ,T(x)[Qπ(x,u)]V^{\pi}(x)=\mathbb{E}_{\substack{u\sim\langle\pi,T\rangle(x)}}[Q^{\pi}(x,u)], we have xX\forall x\in X:

{aligned}Vπ(x)Vπ,T(x)=𝔼\substackuπ,T(x)[Qπ(x,u)]𝔼\substackuπ,T(x)yP(x,u)[r(x,u,y)+γVπ,T(y)]==𝔼\substackuπ,T(x)yP(x,u)[γVπ(y)γVπ,T(y)]=γ𝔼yP(x,u)[Vπ(y)Vπ,T(y)].\aligned V^{\pi}(x)-V^{\langle\pi,T\rangle}(x)=\mathbb{E}_{\substack{u\sim\langle\pi,T\rangle(x)}}[Q^{\pi}(x,u)]-\mathbb{E}_{\substack{u\sim\langle\pi,T\rangle(x)\\ y\sim P(\cdot\mid x,u)}}\left[r(x,u,y)+\gamma V^{\langle\pi,T\rangle}(y)\right]=\\ =\mathbb{E}_{\substack{u\sim\langle\pi,T\rangle(x)\\ y\sim P(\cdot\mid x,u)}}\left[\gamma V^{\pi}(y)-\gamma V^{\langle\pi,T\rangle}(y)\right]=\gamma\mathbb{E}_{y\sim P(\cdot\mid x,u)}\left[V^{\pi}(y)-V^{\langle\pi,T\rangle}(y)\right]. (4)

Now, taking the sup norm at both sides of \eqrefeq:recursive we get

{aligned}Vπ(x)Vπ,T(x)=γ\lVert𝔼yP(x,u)[Vπ(y)Vπ,T(y)]\rVert.\aligned\|V^{\pi}(x)-V^{\langle\pi,T\rangle}(x)\|_{\infty}=\gamma\left\lVert\mathbb{E}_{y\sim P(\cdot\mid x,u)}\left[V^{\pi}(y)-V^{\langle\pi,T\rangle}(y)\right]\right\rVert_{\infty}. (5)

Since the norm on the right hand side of \eqrefeq:recursive2 is over yXy\in X and γ<1\gamma<1, it follows that Vπ(x)Vπ,T(x)=0\|V^{\pi}(x)-V^{\langle\pi,T\rangle}(x)\|_{\infty}=0. Finally, Vπ(x)Vπ,T(x)=0\impliesVπ(x)Vπ,T(x)=0xX\|V^{\pi}(x)-V^{\langle\pi,T\rangle}(x)\|_{\infty}=0\implies V^{\pi}(x)-V^{\langle\pi,T\rangle}(x)=0\,\,\forall x\in X, and Vπ(x)Vπ,T(x)=0xX\impliesJ(π)=J(π,T)\impliesρ(π,T)=0V^{\pi}(x)-V^{\langle\pi,T\rangle}(x)=0\,\forall\,x\in X\implies J(\pi)=J(\langle\pi,T\rangle)\implies\rho(\pi,T)=0.

So far we have shown that both the set of fixed points Π¯\overline{\Pi} and the set of policies for which the disadvantage function is equal to zero ΠD\Pi_{D} are contained in the set of maximally robust policies. We now show how the defined robust policy sets can be linked in a single result through the following policy inclusions.

Theorem 3.3 (Policy Inclusions).

For a DOMDP with noise kernel TT, consider the sets Π¯,ΠT,ΠD\overline{\Pi},\Pi_{T},\Pi_{D} and Π0\Pi_{0}. Then, the following inclusion relation holds: Π¯ΠTΠDΠ0.\overline{\Pi}\subseteq\Pi_{T}\subseteq\Pi_{D}\subseteq\Pi_{0}. Additionally, the sets Π¯,ΠT\overline{\Pi},\Pi_{T} are convex for all MDPs and kernels TT, but ΠD,Π0\Pi_{D},\Pi_{0} may not be.

Proof 3.4.

If a policy πΠ\pi\in\Pi is a fixed point of the operator ,T\langle\cdot,T\rangle, then ρ(π,T)=J(π)J(π,T)=J(π)J(π)=0\impliesπΠ0\rho(\pi,T)=J(\pi)-J(\langle\pi,T\rangle)=J(\pi)-J(\pi)=0\implies\pi\in\Pi_{0}. Therefore, ΠTΠ0\Pi_{T}\subseteq\Pi_{0}. Now, the space of stochastic kernels 𝒦:XΔ(X)\mathcal{K}:X\mapsto\Delta(X) is equivalent to the space of row-stochastic |X|×|X||X|\times|X| matrices, therefore one can write T(yx)TxyT(y\mid x)\equiv T_{xy} as the xyxy-th entry of the matrix TT. Then, the representation of a constant policy as an X×UX\times U matrix can be written as π¯=𝟏|X|v\overline{\pi}=\mathbf{1}_{|X|}v^{\top}, where 𝟏|X|\mathbf{1}_{|X|} where vΔ(U)v\in\Delta(U) is any probability distribution over the action space. Observe that, applying the operator π,T\langle\pi,T\rangle to a constant policy yields π¯,T=T𝟏|X|v\langle\overline{\pi},T\rangle=T\mathbf{1}_{|X|}v^{\top}. By the Perron-Frobenius Theorem (Horn and Johnson, 2012), since TT is row-stochastic it has at least one eigenvalue \operatornameeig(T)=1\operatorname{eig}(T)=1, and this admits a (strictly positive) eigenvector T𝟏|X|=𝟏|X|T\mathbf{1}_{|X|}=\mathbf{1}_{|X|}. Therefore, π¯,T=T𝟏|X|v=𝟏|X|v=π¯\impliesΠ¯ΠT.\langle\overline{\pi},T\rangle=T\mathbf{1}_{|X|}v^{\top}=\mathbf{1}_{|X|}v^{\top}=\overline{\pi}\implies\overline{\Pi}\subseteq\Pi_{T}. Combining this result with Proposition 3.1, we simply need to show that ΠTΠD\Pi_{T}\subset\Pi_{D}. Take π\pi to be a fixed point of π,T\langle\pi,T\rangle. Then π,T=π\langle\pi,T\rangle=\pi, and from the definition in \eqrefeq:valdis:

{aligned}Dπ(x,T)=Vπ(x)𝔼\substackuπ,T(x,)[Qπ(x,u)]=Vπ(x)𝔼\substackuπ(x,)[Qπ(x,u)]=0.\aligned D^{\pi}(x,T)=V^{\pi}(x)-\mathbb{E}_{\substack{u\sim\langle\pi,T\rangle(x,\cdot)}}[Q^{\pi}(x,u)]=V^{\pi}(x)-\mathbb{E}_{\substack{u\sim\pi(x,\cdot)}}[Q^{\pi}(x,u)]=0.

Therefore, πΠD\pi\in\Pi_{D}, which completes the sequence of inclusions. Convexity of Π¯,ΠT\overline{\Pi},\Pi_{T} follows from considering the convex hulls of two constant or fixed point policies.

Let us reflect on the inclusion relations of Theorem 3.3. The inclusions are in general not strict, and in fact the geometry of the sets (as well as whether some of the relations are in fact equalities) is highly dependent on the reward function, and in particular on the complexity (from an information-theoretic perspective) of the reward function. As an intuition, less complex reward functions (more uniform) will make the inclusions above expand to the entire policy set, and more complex reward functions will make the relations collapse to equalities.

Corollary 3.5.

For any ergodic DOMDP there exist reward functions R¯\overline{R} and R¯\underline{R} such that the resulting DOMDP satisfies A) ΠD=Π0=Π\Pi_{D}=\Pi_{0}=\Pi (any policy is max. robust) if R=R¯R=\overline{R}, and B) ΠT=ΠD=Π0\Pi_{T}=\Pi_{D}=\Pi_{0} (only fixed point policies are maximally robust) if R=R¯R=\underline{R}.

Proof 3.6 (Corollary 3.5).

For statement A) let R¯(,,)=c\overline{R}(\cdot,\cdot,\cdot)=c for some constant cc\in\mathbb{R}. Then, J(π)=𝔼x0μ0[tγtr¯tπ]=cγ1γJ(\pi)=\mathbb{E}_{x_{0}\sim\mu_{0}}[\sum_{t}\gamma^{t}\overline{r}_{t}\mid\pi]=\frac{c\gamma}{1-\gamma}, which does not depend on the policy π\pi. For any noise kernel TT and policy π\pi, J(π)Jπ,T=0\impliesπΠ0J(\pi)-J\langle\pi,T\rangle=0\implies\pi\in\Pi_{0}. For statement B assume πΠ0:πΠT\exists\pi\in\Pi_{0}:\pi\notin\Pi_{T}. Then, xX\exists x^{*}\in X and uUu^{*}\in U such that π(x,u)π,T(x,u)\pi(x^{*},u^{*})\neq\langle\pi,T\rangle(x^{*},u^{*}). Let R¯(x,u,x):=c\underline{R}(x,u,x^{\prime}):=c if x=xx=x^{*} and u=uu=u^{*}, 0 otherwise. Then, 𝔼[R(x,π(x),x]<𝔼[R(x,π,T(x),x]\mathbb{E}[R(x,\pi(x),x^{\prime}]<\mathbb{E}[R(x,\langle\pi,T\rangle(x),x^{\prime}] and since the MDP is ergodic xx is visited infinitely often and J(π)J(π,T)>0\impliesπΠ0J(\pi)-J(\langle\pi,T\rangle)>0\implies\pi\notin\Pi_{0}, which contradicts the assumption. Therefore, Π0ΠT=\impliesΠ0=ΠT\Pi_{0}\setminus\Pi_{T}=\emptyset\implies\Pi_{0}=\Pi_{T}.

We can now summarise the insights from Theorem 3.3 and Corollary 3.5 in the following conclusions: (1) The set Π¯\overline{\Pi} is maximally robust, convex and independent of the DOMDP, (2) The set ΠT\Pi_{T} is maximally robust, convex, includes Π¯\overline{\Pi}, and its properties only depend on TT, (3) The set ΠD\Pi_{D} includes ΠT\Pi_{T} and is maximally robust, but its properties depend on the DOMDP.

4 Robustness through Lexicographic Objectives

To be able to apply LRL results to our robustness problem we need to first cast robustness as a valid objective to be maximised, and then show that a stochastic gradient descent approach would indeed find a global maximum of the objective, therefore yielding a maximally robust policy. 555The advantage of using LRL is that we can formally bound the trade-off between robustness and optimality through ϵ\epsilon, determinining how far we allow our resulting policy to be from an optimal policy in favour of it being more robust.

Algorithm 1 LRPG
input Simulator, T~\tilde{T}, ϵ\epsilon
 initialise θ\theta, critic (if using), λ\lambda, {βt1,βt2,η}\{\beta_{t}^{1},\beta_{t}^{2},\eta\}
 set t=0t=0, xtμ0x_{t}\sim\mu_{0}
while t<\operatornamemax_iterationst<\operatorname{max\_iterations} do
  perform utπθ(xt)u_{t}\sim\pi_{\theta}(x_{t})
  observe rtr_{t}, xt+1x_{t+1}, sample yT~(x)y\sim\tilde{T}(\cdot\mid x)
  if K^1(θ)\hat{K}_{1}(\theta) not converged then
   k^1K^1(θ)\hat{k}_{1}\leftarrow\hat{K}_{1}(\theta)
  end if
  update critic (if using)
  update θ\theta using \eqrefeq:lrpgiter and λ\lambda using \eqrefeq:LRLupdates
end while
output θ\theta
Proposed approach

Following the framework presented in previous sections, we propose the following approach to obtain lexicographic robustness. In the introduction, we emphasised that the motivation for this work comes partially from the fact that we may not know TT in reality, or have a way to estimate it. However, the theoretical results until now depend on TT. Our proposed solution to this lies in the results of Theorem 3.3. We can use a design generator T~\tilde{T} to perturb the policy during training such that T~\tilde{T} has the smallest possible fixed point set (i.e. the constant policy set, T~\tilde{T} satisfies ΠT~=Π¯\Pi_{\tilde{T}}=\overline{\Pi}), and any algorithm that drives the policy towards the set of fixed points of T~\tilde{T} will also drive the policy towards fixed points of TT: from Theorem 3.3, ΠT~ΠT\Pi_{\tilde{T}}\subseteq\Pi_{T}.

4.1 Lexicographically Robust Policy Gradient

Consider then the objective to be minimised:

KT~(θ)=12xXμπθ(x)uU(πθ(x,u)πθ,T~(x,u))2,K_{\tilde{T}}(\theta)=\frac{1}{2}\sum_{x\in X}\mu_{\pi_{\theta}}(x)\sum_{u\in U}\big{(}\pi_{\theta}(x,u)-\langle\pi_{\theta},\tilde{T}\rangle(x,u)\big{)}^{2}, (6)

Notice that optimising \eqrefeq:ProxyOptimisationPolicies projects the current policy onto the set of fixed points of the operator ,T~\langle\cdot,\tilde{T}\rangle, and due to Assumption 1, which requires μπθ(x)>0\mu_{\pi_{\theta}}(x)>0 for all xXx\in X, the optimal solution is equal to zero if and only if there exists a value of the parameter θ\theta for which the corresponding πθ\pi_{\theta} is a fixed point of ,T~\langle\cdot,\tilde{T}\rangle. We present now the proposed LRPG meta-algorithm to achieve lexicographic robustness for any policy gradient algorithm at choice. From Skalse et al. (2022b), the convergence of PB-LRL algorithms is guaranteed as long as the original policy gradient algorithm for each single objective converges.

Assumption 2

The policy is updated through an algorithm (e.g. A2C, PPO…) such that θt+1\operatornameprojΘ[θt+αtθtK^1]\theta_{t+1}\leftarrow\operatorname{proj}_{\Theta}\big{[}\theta_{t}+\alpha_{t}\nabla_{\theta_{t}}\hat{K}_{1}\big{]} converges a.s. to a (local or global) optimum θ\theta^{*}.

Theorem 4.1.

Consider a DOMDP as in Definition 2.1 and let πθ\pi_{\theta} be a parameterised policy. Take a design kernel T~{T:ΠT=Π¯}\tilde{T}\in\{T:\Pi_{T}=\overline{\Pi}\}. Consider the following modified gradient for objective KT~(θ)(x)K_{\tilde{T}}(\theta)(x) and sampled point yT~(x)y\sim\tilde{T}(\cdot\mid x):

{aligned}θK^T~(θ)=𝔼xμπθ[uU(πθ(x,u)πθ(y,u))θπθ(x,u)].\aligned\nabla_{\theta}\hat{K}_{\tilde{T}}^{\prime}(\theta)=\mathbb{E}_{x\sim\mu_{\pi_{\theta}}}\big{[}\sum_{u\in U}(\pi_{\theta}(x,u)-\pi_{\theta}(y,u))\nabla_{\theta}\pi_{\theta}(x,u)\big{]}. (7)

Given an ϵ>0\epsilon>0, if Assumptions 1 and 2 hold, then the following iteration (LRPG):

θ\operatornameprojΘ[θ+(βt1+λβt2)θK^1(θ)+βt2θK^T~(θ)]\theta\leftarrow\operatorname{proj}_{\Theta}\big{[}\theta+(\beta_{t}^{1}+\lambda\beta_{t}^{2})\cdot\nabla_{\theta}\hat{K}_{1}(\theta)+\beta_{t}^{2}\nabla_{\theta}\hat{K}_{\tilde{T}}^{\prime}(\theta)\big{]} (8)

converges a.s. to parameters θϵ\theta^{\epsilon} that satisfy θϵ\argminθΘKT~(θ)\theta^{\epsilon}\in\argmin_{\theta\in\Theta^{\prime}}K_{\tilde{T}}(\theta) such that K1K1(θϵ)ϵ,K^{*}_{1}\geq K_{1}(\theta^{\epsilon})-\epsilon, where Θ=Θ\Theta^{\prime}=\Theta if θ\theta^{*} is globally optimal and a compact local neighbourhood of θ\theta^{*} otherwise.

Proof 4.2.

To apply LRL results, we need to show that both gradient descent schemes converge (separately) to local or global maxima. Let us first show that θt+1=\operatornameprojΘ[θtαtθK^T~(θt)]\theta_{t+1}=\operatorname{proj}_{\Theta}\big{[}\theta_{t}-\alpha_{t}\nabla_{\theta}\hat{K}_{\tilde{T}}^{\prime}(\theta_{t})\big{]} converges a.s. to parameters θ~\tilde{\theta} satisfying KT~=0K_{\tilde{T}}=0. We prove this making use of fixed point iterations with non-expansive operators (specifically, Theorem 4, section 10.3 in Borkar (2008)). First, observe that for a tabular representation, πθ(x,u)=θxu\pi_{\theta}(x,u)=\theta_{xu}, and θπθ(x,u)\nabla_{\theta}\pi_{\theta}(x,u) is a vector of zeros, with value 11 for the position θxu\theta_{xu}. We can then write the SGD in terms of the policy for each state xx, considering π(x)(θxu1,θxu2,,θxuk)T\pi(x)\equiv(\theta_{xu_{1}},\theta_{xu_{2}},...,\theta_{xu_{k}})^{T}. Let yT~(x)y\sim\tilde{T}(\cdot\mid x). Then:

{aligned}πt+1(x)=πt(x)αt(πt(x)πt(y))=πt(x)αt(πt(x)πt,T~(x)(πt(y)πt,T~(x))).\aligned\pi_{t+1}(x)=\pi_{t}(x)-\alpha_{t}\big{(}\pi_{t}(x)-\pi_{t}(y)\big{)}=\pi_{t}(x)-\alpha_{t}\Big{(}\pi_{t}(x)-\langle\pi_{t},\tilde{T}\rangle(x)-\big{(}\pi_{t}(y)-\langle\pi_{t},\tilde{T}\rangle(x)\big{)}\Big{)}.

We now need to verify that the necessary conditions for applying Theorem 4, section 10.3 in Borkar (2008) hold. First, making use of the property T~=1\|\tilde{T}\|_{\infty}=1 for any row-stochastic matrix T~\tilde{T}, for any two policies π1,π2Π\pi_{1},\pi_{2}\in\Pi:

{aligned}π1,T~π2,T~=T~π1T~π2=T~(π1π2)T~π1π2=π1π2.\aligned\|\langle\pi_{1},\tilde{T}\rangle-\langle\pi_{2},\tilde{T}\rangle\|_{\infty}=\|\tilde{T}\pi_{1}-\tilde{T}\pi_{2}\|_{\infty}=\|\tilde{T}(\pi_{1}-\pi_{2})\|_{\infty}\leq\|\tilde{T}\|_{\infty}\|\pi_{1}-\pi_{2}\|_{\infty}=\|\pi_{1}-\pi_{2}\|_{\infty}.

Therefore, the operator ,T~\langle\cdot,\tilde{T}\rangle is non-expansive with respect to the sup-norm. For the final condition:

{aligned}𝔼yT~(x)[πt(y)πt,T~(x)πt,T~]=yXT~(yx)πt(y)πt,T~(x)=0.\aligned\mathbb{E}_{y\sim\tilde{T}(\cdot\mid x)}\left[\pi_{t}(y)-\langle\pi_{t},\tilde{T}\rangle(x)\mid\pi_{t},\tilde{T}\right]=\sum_{y\in X}\tilde{T}(y\mid x)\pi_{t}(y)-\langle\pi_{t},\tilde{T}\rangle(x)=0.

Therefore, the difference πt(y)πt,T~(x)\pi_{t}(y)-\langle\pi_{t},\tilde{T}\rangle(x) is a martingale difference for all xx. One can then apply Theorem 4, sec. 10.3 (Borkar, 2008) to conclude that πt(x)π~(x)\pi_{t}(x)\to\tilde{\pi}(x) almost surely. Finally from Assumption 1, for any policy all states xXx\in X are visited infinitely often, therefore πt(x)π~(x)xX\impliesπtπ~\pi_{t}(x)\to\tilde{\pi}(x)\forall x\in X\implies\pi_{t}\to\tilde{\pi} and π~\tilde{\pi} satisfies π~,T~=π~\langle\tilde{\pi},\tilde{T}\rangle=\tilde{\pi}, and KT~(π~)=0K_{\tilde{T}}(\tilde{\pi})=0.

Now, from Assumption 2, the iteration θ\operatornameprojΘ[θ+αtθK^1]\theta\leftarrow\operatorname{proj}_{\Theta}\big{[}\theta+\alpha_{t}\nabla_{\theta}\hat{K}_{1}\big{]} converges a.s. to a (local or global) optimum θ\theta^{*}. Then, both objectives are invex Ben-Israel and Mond (1986b) (either locally or globally), and any linear combination of them will also be invex (again, locally or globally). Finally, we can directly apply the results from Skalse et al. (2022b), and

θ\operatornameprojΘ[θ+(βt1+λβt2)θK^1(θ)+βt2θK^T~(θ)]\theta\leftarrow\operatorname{proj}_{\Theta}\big{[}\theta+(\beta_{t}^{1}+\lambda\beta_{t}^{2})\cdot\nabla_{\theta}\hat{K}_{1}(\theta)+\beta_{t}^{2}\nabla_{\theta}\hat{K}_{\tilde{T}}^{\prime}(\theta)\big{]}

converges a.s. to parameters θϵ\theta^{\epsilon} that satisfy θϵ\argminθΘKT~(θ)\theta^{\epsilon}\in\argmin_{\theta\in\Theta^{\prime}}K_{\tilde{T}}(\theta) such that K1K1(θϵ)ϵK^{*}_{1}\geq K_{1}(\theta^{\epsilon})-\epsilon, where Θ=Θ\Theta^{\prime}=\Theta if θ\theta^{*} is globally optimal and a compact local neighbourhood of θ\theta^{*} otherwise.

Remark 4.3.

Observe that \eqrefeq:approxgrad is not the true gradient of \eqrefeq:ProxyOptimisationPolicies, and θϵ\argminθΘKT~(θ)\theta^{\epsilon}\in\argmin_{\theta\in\Theta^{\prime}}K_{\tilde{T}}(\theta) if there exists a (local) minimum of KT~K_{\tilde{T}} in Θϵ:={θ:K1K1(θ)ϵ}\Theta^{\epsilon}:=\{\theta:K_{1}^{*}\geq K_{1}(\theta)-\epsilon\}. However, from Theorem 4.1 we know that the (pseudo) gradient descent scheme converges to a global minimum in the tabular case, therefore θK^T~(θ),θK^T~(θ)<0\langle\nabla_{\theta}\hat{K}_{\tilde{T}}^{\prime}(\theta),\nabla_{\theta}\hat{K}_{\tilde{T}}(\theta)\rangle<0 (Borkar, 2008), and gradient-like descent schemes will converge to (local or) global minimisers, which motivates the choice of this gradient approximation.

We reflect again on Figure 1. The main idea behind LRPG is that by formally expanding the set of acceptable policies with respect to K1K_{1}, we may find robust policies more effectively while guaranteeing a minimum performance in terms of expected rewards. This addresses directly the premise behind Problem 2.3. In LRPG the first objective is still to minimise the distance JJ(π)J^{*}-J(\pi) up to some tolerance. Then, from the policies that satisfy this constraint, we want to steer the learning algorithm towards a maximally robust policy, and we can do so without knowing TT.

5 Considerations on Noise Generators

A natural question emerging is how to choose T~\tilde{T}, and how the choice influences the resulting policy robustness towards any other true TT. In general, for any arbitrary policy utility landscape in a given MDP, there is no way of bounding the distance of the resulting policies for two different noise kernels T1,T2T_{1},T_{2}. However, the optimality of the policy remains bounded: Through LRPG guarantees we know that, for both cases, the utility of the resulting policy will be at most ϵ\epsilon far from the optimal.

Corollary 5.1.

Take TT to be any arbitrary noise kernel, and T~\tilde{T} to satisfy T~{T:ΠT=Π¯}\tilde{T}\in\{T:\Pi_{T}=\overline{\Pi}\}. Let π\pi be a policy resulting from a LRPG algorithm. Assume that minπΠT~DTV(ππ)=a\min_{\pi^{\prime}\in\Pi_{\tilde{T}}}D_{TV}(\pi\|\pi^{\prime})=a for some a<1a<1. Then, it holds for any TT that minπΠTDTV(ππ)a\min_{\pi^{\prime}\in\Pi_{{T}}}D_{TV}(\pi\|\pi^{\prime})\leq a.

Proof 5.2.

The proof follows by the inclusion results in Theorem 3.3. If ΠT~=Π¯\Pi_{\tilde{T}}=\overline{\Pi}, then ΠT~ΠT\Pi_{\tilde{T}}\subseteq\Pi_{T} for any other TT. Then, the distance from π\pi to the set ΠT\Pi_{T} is at most the distance to ΠT~\Pi_{\tilde{T}}.

That is, when using LRPG to obtain a robust policy π\pi, the resulting policy is at most aa far from the set of fixed points (and therefore a maximally robust policy) with respect to the true TT. This is the key argument behind our choices for T~\tilde{T}: A priori, the most sensible choice is a kernel that has no other fixed point than the set of constant policies. This fixed point condition is satisfied in the discrete state case for any T~\tilde{T} that induces an irreducible Markov Chain, and in continuous state for any T~\tilde{T} that satisfies a reachability condition (i.e. for any x0Xx_{0}\in X, there exists a finite time for which the probability of reaching any ball BXB\subset X of radius r>0r>0 through a sequence xt+1=T(xt)x_{t+1}=T(x_{t}) is measurable). This holds for (additive) uniform or Gaussian disturbances.

6 Experiments

We verify the theoretical results of LRPG in a series of experiments on discrete state/action safety-related environments (Chevalier-Boisvert et al., 2018) (for extended experiments in continuous control tasks, hyperparameters etc. see \hrefhttps://arxiv.org/abs/2209.15320extended version). We use A2C (Sutton and Barto, 2018) (LR-A2C) and PPO (Schulman et al., 2017) (LR-PPO) for our implementations of LRPG. In all cases, the lexicographic tolerance was set to ϵ=0.99k^1\epsilon=0.99\hat{k}_{1} to deviate as little as possible from the primary objective. We compare against the baseline algorithms and against SA-PPO (Zhang et al., 2020) which is among the most effective (adversarial) robust RL approaches in literature. We trained 10 independent agents for each algorithm, and reported scores of the median agent (as in Zhang et al. (2020)) for 50 roll-outs. To simulate T~\tilde{T} we disturb xx as x~=x+ξ\tilde{x}=x+\xi for (1) a uniform bounded noise signal ξ𝒰[b,b]\xi\sim\mathcal{U}_{[-b,b]} (T~u\tilde{T}^{u}) and (2) and a Gaussian noise (T~g\tilde{T}^{g}) such that ξ𝒩(0,0.5)\xi\sim\mathcal{N}(0,0.5). We test the resulting policies against a noiseless environment (\emptyset), a kernel T1=T~uT_{1}=\tilde{T}^{u}, a kernel T2=T~gT_{2}=\tilde{T}^{g} and against two different state-adversarial noise configurations (Tadv2T^{2}_{adv}) as proposed by Zhang et al. (2021) to evaluate how effective LRPG is at rejecting adversarial disturbances.

Robustness Results

We use objectives as defined in \eqrefeq:ProxyOptimisationPolicies. Additionally, we aim to test the hypothesis: If we have an estimator for the critic QπQ^{\pi} we can obtain robustness without inducing regularity in the policy using DπD^{\pi}, yielding a larger policy subspace to steer towards, and hopefully achieving policies closer to optimal. For this, we consider the objective KD(θ)(x):=12Dπθ(x,T)22K_{D}(\theta)(x):=\frac{1}{2}\|D^{\pi_{\theta}}(x,T)\|_{2}^{2} by modifying A2C to retain a Q critic. We investigate the impact of LRPG PPO and A2C for discrete action-space problems on Gymnasium (Brockman et al., 2016). Minigrid-LavaGap (fully observable), Minigrid-LavaCrossing (partially observable) are safe exploration tasks where the agent needs to navigate an environment with cliff-like regions. Minigrid-DynamicObstacles (stochastic, partially observable) is a dynamic obstacle-avoidance environment. See Table 1.

PPO on MiniGrid Environments A2C on MiniGrid Environments
Noise PPO LR\textPPO{}_{\text{PPO}}(KTu)(K^{u}_{T}) LR\textPPO{}_{\text{PPO}}(KTg)(K^{g}_{T}) SA-PPO A2C LR\textA2C{}_{\text{A2C}}(KTu)(K^{u}_{T}) LR\textA2C{}_{\text{A2C}}(KTg)(K^{g}_{T}) LR\textA2C{}_{\text{A2C}}(KD)(K_{D})
LavaGap
\emptyset 0.95±\pm0.003 0.95±\pm0.075 0.95±\pm0.101 0.94±\pm0.068 0.94±\pm0.004 0.94±\pm0.005 0.94±\pm0.003 0.94±\pm0.006
T1T_{1} 0.80±\pm0.041 0.95±\pm0.078 0.93±\pm0.124 0.88±\pm0.064 0.83±\pm0.061 0.93±\pm0.019 0.89±\pm0.032 0.91±\pm0.088
T2T_{2} 0.92±\pm0.015 0.95±\pm0.052 0.95±\pm0.094 0.93±\pm0.050 0.89±\pm0.029 0.94±\pm0.008 0.93±\pm0.011 0.93±\pm0.021
Tadv2T_{adv}^{2} 0.01±\pm0.051 0.71±\pm0.251 0.21±\pm0.357 0.87±\pm0.116 0.27±\pm0.119 0.79±\pm0.069 0.68±\pm0.127 0.56±\pm0.249
LavaCrossing
\emptyset 0.95±\pm0.023 0.93±\pm0.050 0.93±\pm0.018 0.88±\pm0.091 0.91±\pm0.024 0.91±\pm0.063 0.90±\pm0.017 0.92±\pm0.034
T1T_{1} 0.50±\pm0.110 0.92±\pm0.053 0.89±\pm0.029 0.64±\pm0.109 0.66±\pm0.071 0.78±\pm0.111 0.72±\pm0.073 0.76±\pm0.098
T2T_{2} 0.84±\pm0.061 0.92±\pm0.050 0.92±\pm0.021 0.85±\pm0.094 0.78±\pm0.054 0.83±\pm0.105 0.86±\pm0.029 0.87±\pm0.063
Tadv2T_{adv}^{2} 0.0±\pm0.004 0.50±\pm0.171 0.38±\pm0.020 0.82±\pm0.072 0.06±\pm0.056 0.04±\pm0.030 0.01±\pm0.008 0.09±\pm0.060
DynamicObstacles
\emptyset 0.91±\pm0.002 0.91±\pm0.008 0.91±\pm0.007 0.91±\pm0.131 0.91±\pm0.011 0.88±\pm0.020 0.89±\pm0.009 0.91±\pm0.013
T1T_{1} 0.23±\pm0.201 0.77±\pm0.102 0.61±\pm0.119 0.45±\pm0.188 0.27±\pm0.104 0.43±\pm0.108 0.45±\pm0.162 0.56±\pm0.270
T2T_{2} 0.50±\pm0.117 0.75±\pm0.075 0.70±\pm0.072 0.68±\pm0.490 0.45±\pm0.086 0.53±\pm0.109 0.52±\pm0.161 0.67±\pm0.203
Tadv2T_{adv}^{2} -0.49±\pm0.312 0.51±\pm0.234 0.33±\pm0.202 0.55±\pm0.170 -0.54±\pm0.209 -0.21±\pm0.192 -0.53±\pm0.261 -0.51±\pm0.260
Table 1: Reward values gained by LRPG and baselines on discrete control tasks.

7 Discussion

Experiments

We applied LRPG on PPO and A2C (and SAC algorithms), for a set of discrete and continuous control environments. These environments are particularly sensitive to robustness problems; the rewards are sparse, and applying a sub-optimal action at any step of the trajectory often leads to terminal states with zero (or negative) reward. LRPG successfully induces lower robustness regrets in the tested scenarios, and the use of KDK_{D} as an objective (even though we did not prove the convergence of a gradient based method with such objective) yields a better compromise between robustness and rewards. When compared to recent observational robustness methods, LRPG obtains similar robustness results while preserving the original guarantees of the chosen algorithm.

Shortcomings and Contributions

The motivation for LRPG comes from situations where, when deploying a model-free controller in a dynamical system, we do not have a way of estimating the noise generation and we are required to retain convergence guarantees of the algorithms used. Although LRPG is a useful approach for learning policies in control problems where the noise sources are unknown, questions emerge on whether there are more effective methods of incorporating robustness into RL policies when guarantees are not needed. Specifically, since a completely model-free approach does not allow for simple alternative solutions such as filtering or disturbance rejection, there are reasons to believe it could be outperformed by model-based (or model learning) approaches. However, we argue that in completely model-free settings, LRPG provides a rational strategy to robustify RL agents.

References

  • Abdullah et al. (2019) Mohammed Amin Abdullah, Hang Ren, Haitham Bou Ammar, Vladimir Milenkovic, Rui Luo, Mingtian Zhang, and Jun Wang. Wasserstein robust reinforcement learning. arXiv preprint arXiv:1907.13196, 2019.
  • Agarwal et al. (2021) Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. J. Mach. Learn. Res., 22(98):1–76, 2021.
  • Behzadan and Munir (2017) Vahid Behzadan and Arslan Munir. Vulnerability of deep reinforcement learning to policy induction attacks. In International Conference on Machine Learning and Data Mining in Pattern Recognition, pages 262–275. Springer, 2017.
  • Ben-Israel and Mond (1986a) A. Ben-Israel and B. Mond. What is invexity? The Journal of the Australian Mathematical Society. Series B. Applied Mathematics, 28(1):1–9, 1986a.
  • Ben-Israel and Mond (1986b) Adi Ben-Israel and Bertram Mond. What is invexity? The ANZIAM Journal, 28(1):1–9, 1986b.
  • Bertsekas (1999) Dimitri Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
  • Bertsekas (1997) Dimitri P Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3):334–334, 1997.
  • Bhandari and Russo (2019) Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786, 2019.
  • Borkar (2008) Vivek S. Borkar. Stochastic Approximation. Hindustan Book Agency, 2008.
  • Borkar and Soumyanatha (1997) V.S. Borkar and K. Soumyanatha. An analog scheme for fixed point computation. i. theory. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 44(4):351–355, 1997. 10.1109/81.563625.
  • Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • Chevalier-Boisvert et al. (2018) Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid, 2018.
  • Fu et al. (2018) Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adverserial inverse reinforcement learning. In International Conference on Learning Representations, 2018.
  • Gleave et al. (2020) Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, and Stuart Russell. Adversarial policies: Attacking deep reinforcement learning. In International Conference on Learning Representations, 2020.
  • Hanson (1981) Morgan A Hanson. On sufficiency of the kuhn-tucker conditions. Journal of Mathematical Analysis and Applications, 80(2):545–550, 1981.
  • Heger (1994) Matthias Heger. Consideration of risk in reinforcement learning. In Machine Learning Proceedings 1994, pages 105–111. Elsevier, 1994.
  • Horn and Johnson (2012) Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012.
  • Huang et al. (2017) Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
  • Isermann (1982) H Isermann. Linear lexicographic optimization. Operations-Research-Spektrum, 4(4):223–228, 1982.
  • Klima et al. (2019) Richard Klima, Daan Bloembergen, Michael Kaisers, and Karl Tuyls. Robust temporal difference learning for critical domains. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, page 350–358, Richland, SC, 2019. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450363099.
  • Korkmaz (2023) Ezgi Korkmaz. Adversarial robust deep reinforcement learning requires redefining robustness. arXiv preprint arXiv:2301.07487, 2023.
  • Kos and Song (2017) Jernej Kos and Dawn Song. Delving into adversarial attacks on deep policies. arXiv preprint arXiv:1705.06452, 2017.
  • Liang et al. (2022) Yongyuan Liang, Yanchao Sun, Ruijie Zheng, and Furong Huang. Efficient adversarial training without attacking: Worst-case-aware robust reinforcement learning. Advances in Neural Information Processing Systems, 35:22547–22561, 2022.
  • Lin et al. (2017) Yen-Chen Lin, Zhang-Wei Hong, Yuan-Hong Liao, Meng-Li Shih, Ming-Yu Liu, and Min Sun. Tactics of adversarial attack on deep reinforcement learning agents. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 3756–3762, 2017.
  • Mandlekar et al. (2017) Ajay Mandlekar, Yuke Zhu, Animesh Garg, Li Fei-Fei, and Silvio Savarese. Adversarially robust policy learning: Active construction of physically-plausible perturbations. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3932–3939. IEEE, 2017.
  • Moos et al. (2022) Janosch Moos, Kay Hansel, Hany Abdulsamad, Svenja Stark, Debora Clever, and Jan Peters. Robust reinforcement learning: A review of foundations and recent advances. Machine Learning and Knowledge Extraction, 4(1):276–315, 2022.
  • Ng et al. (1999) Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proc. of the Sixteenth International Conference on Machine Learning, 1999, 1999.
  • Pan et al. (2019) Xinlei Pan, Daniel Seita, Yang Gao, and John Canny. Risk averse robust adversarial reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pages 8522–8528. IEEE, 2019.
  • Paternain et al. (2019) Santiago Paternain, Luiz F. O. Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained reinforcement learning has zero duality gap. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 7553–7563, 2019.
  • Pattanaik et al. (2018) Anay Pattanaik, Zhenyi Tang, Shuijing Liu, Gautham Bommannan, and Girish Chowdhary. Robust deep reinforcement learning with adversarial attacks. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’18, page 2040–2042, Richland, SC, 2018. International Foundation for Autonomous Agents and Multiagent Systems.
  • Pirotta et al. (2013) Matteo Pirotta, Marcello Restelli, Alessio Pecorino, and Daniele Calandriello. Safe policy iteration. In International Conference on Machine Learning, pages 307–315. PMLR, 2013.
  • Raffin et al. (2021) Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021. URL http://jmlr.org/papers/v22/20-1364.html.
  • Rentmeesters et al. (1996) Mark J Rentmeesters, Wei K Tsai, and Kwei-Jay Lin. A theory of lexicographic multi-criteria optimization. In Proceedings of ICECCS’96: 2nd IEEE International Conference on Engineering of Complex Computer Systems (held jointly with 6th CSESAW and 4th IEEE RTAW), pages 76–79. IEEE, 1996.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Skalse et al. (2022a) Joar Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, and Adam Gleave. Invariance in policy optimisation and partial identifiability in reward learning. arXiv preprint arXiv:2203.07475, 2022a.
  • Skalse et al. (2022b) Joar Skalse, Lewis Hammond, Charlie Griffin, and Alessandro Abate. Lexicographic multi-objective reinforcement learning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, pages 3430–3436, jul 2022b. 10.24963/ijcai.2022/476.
  • Slater (1950) Morton Slater. Lagrange multipliers revisited. Cowles Commission Discussion Paper No. 403, 1950.
  • Spaan (2012) Matthijs TJ Spaan. Partially observable markov decision processes. In Reinforcement Learning, pages 387–414. Springer, 2012.
  • Spaan and Vlassis (2004) Matthijs TJ Spaan and N Vlassis. A point-based pomdp algorithm for robot planning. In IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, volume 3, pages 2399–2404. IEEE, 2004.
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Tan et al. (2020) Kai Liang Tan, Yasaman Esfandiari, Xian Yeow Lee, Soumik Sarkar, et al. Robustifying reinforcement learning agents via action space adversarial training. In 2020 American control conference (ACC), pages 3959–3964. IEEE, 2020.
  • Tessler et al. (2019) Chen Tessler, Yonathan Efroni, and Shie Mannor. Action robust reinforcement learning and applications in continuous control. In International Conference on Machine Learning, pages 6215–6224. PMLR, 2019.
  • van der Pol et al. (2020) Elise van der Pol, Thomas Kipf, Frans A. Oliehoek, and Max Welling. Plannable approximations to mdp homomorphisms: Equivariance under actions. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’20, page 1431–1439, Richland, SC, 2020. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450375184.
  • Van Parys et al. (2015) Bart PG Van Parys, Daniel Kuhn, Paul J Goulart, and Manfred Morari. Distributionally robust control of constrained stochastic systems. IEEE Transactions on Automatic Control, 61(2):430–442, 2015.
  • Wiesemann et al. (2014) Wolfram Wiesemann, Daniel Kuhn, and Melvyn Sim. Distributionally robust convex optimization. Operations Research, 62(6):1358–1376, 2014.
  • Xu and Mannor (2006) Huan Xu and Shie Mannor. The robustness-performance tradeoff in markov decision processes. Advances in Neural Information Processing Systems, 19, 2006.
  • Zhang et al. (2020) Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho-Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. Advances in Neural Information Processing Systems, 33:21024–21037, 2020.
  • Zhang et al. (2021) Huan Zhang, Hongge Chen, Duane Boning, and Cho-Jui Hsieh. Robust reinforcement learning on state observations with learned optimal adversary. In International Conference on Learning Representation (ICLR), 2021.
  • Zhang (2018) Shangtong Zhang. Modularized implementation of deep rl algorithms in pytorch. https://github.com/ShangtongZhang/DeepRL, 2018.

Appendix A Examples and Further Considerations

We provide here two examples to show how we can obtain limit scenarios Π0=Π\Pi_{0}=\Pi (any policy is maximally robust) or Π0=ΠT\Pi_{0}=\Pi_{T} (Example 1), and how for some MDPs the third inclusion in Theorem 3.3 is strict (Example 2).

Example 1

Consider the simple MDP in Figure 2. First, consider the reward function R1(x1,,)=10,R1(x2,,)=0R_{1}(x_{1},\cdot,\cdot)=10,\,R_{1}(x_{2},\cdot,\cdot)=0. This produces a “dummy” MDP where all policies have the same reward sum. Then, T,π\forall T,\pi, Vπ,T=VπV^{\langle\pi,T\rangle}=V^{\pi}, and therefore we have ΠD=Π0=Π\Pi_{D}=\Pi_{0}=\Pi.

Refer to caption
Figure 2: Example MDP. Values in brackets represent {P(,u1,),P(,u2,)}\{P(\cdot,u_{1},\cdot),P(\cdot,u_{2},\cdot)\}.

Now, consider the reward function R2(x1,u1,)=10,R2(,,)=0R_{2}(x_{1},u_{1},\cdot)=10,\,R_{2}(\cdot,\cdot,\cdot)=0 elsewhere. Take a non-constant policy π\pi, i.e., π(x1)π(x2)\pi(x_{1})\neq\pi(x_{2}). In the example DOMDP (assuming the initial state is drawn uniformly from X0={x1,x2}X_{0}=\{x_{1},x_{2}\}) one can show that at any time in the trajectory, there is a stationary probability Pr{xt=x1}=12\Pr\{x_{t}=x_{1}\}=\frac{1}{2}. Let us abuse notation and write π(xi)=(π(xi,u1)π(xi,u2))\pi(x_{i})=(\begin{array}[]{cc}\pi(x_{i},u_{1})&\pi(x_{i},u_{2})\end{array})^{\top} and R(xi)=(R(xi,u1,)R(xi,u2,))R(x_{i})=(\begin{array}[]{cc}R(x_{i},u_{1},\cdot)&R(x_{i},u_{2},\cdot)\end{array})^{\top}. For the given reward structure we have R(x2)=(00)R(x_{2})=(\begin{array}[]{cc}0&0\end{array})^{\top}, and therefore:

J(π)=Ex0μ0[t=0γtRt]=12R(x1),π(x1)γ1γ.J(\pi)=E_{x_{0}\sim\mu_{0}}\left[\sum_{t=0}^{\infty}\gamma^{t}R_{t}\right]=\frac{1}{2}\langle R(x_{1}),\pi(x_{1})\rangle\frac{\gamma}{1-\gamma}. (9)

Since the transitions of the MDP are independent of the actions, following the same principle as in \eqrefeq:costex: Jπ,T=12R(x1),,T(π)(x1)γ1γ.J\langle\pi,T\rangle=\frac{1}{2}\langle R(x_{1}),\langle\cdot,T\rangle(\pi)(x_{1})\rangle\frac{\gamma}{1-\gamma}. For any noise map ,T\operatornameId\langle\cdot,T\rangle\neq\operatorname{Id}, for the two-state policy it holds that πΠT\impliesπ,Tπ\pi\notin\Pi_{T}\implies\langle\pi,T\rangle\neq\pi. Therefore π,T(x1)π(x1)\langle\pi,T\rangle(x_{1})\neq\pi(x_{1}) and:

J(π)J(π,T)=5γ1γ(π(x1,1)π,T(x1,1))0,J(\pi)-J(\langle\pi,T\rangle)=\frac{5\gamma}{1-\gamma}\cdot\big{(}\pi(x_{1},1)-\langle\pi,T\rangle(x_{1},1)\big{)}\neq 0,

which implies that πΠ0\pi\notin\Pi_{0}.

Example 2

Consider the same MDP in Figure 2 with reward function R(x1,u1,)=R(x2,u1,)=10,R(x_{1},u_{1},\cdot)=R(x_{2},u_{1},\cdot)=10, and a reward of zero for all other transitions. Take a policy π(x1)=(1  0)\pi(x_{1})=(1\,\,0), π(x2)=(0  1)\pi(x_{2})=(0\,\,1). The policy yields a reward of 1010 in state x1x_{1} and a reward of 0 in state x2x_{2}. Again we assume the initial state is drawn uniformly from X0={x1,x2}X_{0}=\{x_{1},x_{2}\}. Then, observe:

{aligned}J(π)=Ex0μ0[t=0γtRt]=12R(x1),π(x1)γ1γ==1210γ1γ=5γ1γ.\aligned J(\pi)=&E_{x_{0}\sim\mu_{0}}\left[\sum_{t=0}^{\infty}\gamma^{t}R_{t}\right]=\frac{1}{2}\langle R(x_{1}),\pi(x_{1})\rangle\frac{\gamma}{1-\gamma}=\\ =&\frac{1}{2}\frac{10\gamma}{1-\gamma}=\frac{5\gamma}{1-\gamma}.

Define now noise map T(x1)=(1212)T(\cdot\mid x_{1})=(\frac{1}{2}\,\,\frac{1}{2}) and T(x2)=(1212)T(\cdot\mid x_{2})=(\frac{1}{2}\,\,\frac{1}{2}). Observe this noise map yields a policy with non-zero disadvantage, Dπ(x1,T)=5γ1γ(5γ1γ2.5)=2.5D^{\pi}(x_{1},T)=\frac{5\gamma}{1-\gamma}-\big{(}\frac{5\gamma}{1-\gamma}-2.5\big{)}=2.5 and similarly Dπ(x2,T)=2.5D^{\pi}(x_{2},T)=-2.5, therefore πΠD\pi\notin\Pi_{D}. However, the policy is maximally robust:

{aligned}J(π,T)=12R(x1),π,T(x1)γ1γ++12R(x2),π,T(x2)γ1γ=12γ1γ(5+5)=5γ1γ.\aligned J(\langle\pi,T\rangle)=\frac{1}{2}\langle R(x_{1}),\langle\pi,T\rangle(x_{1})\rangle\frac{\gamma}{1-\gamma}+\\ +\frac{1}{2}\langle R(x_{2}),\langle\pi,T\rangle(x_{2})\rangle\frac{\gamma}{1-\gamma}=\frac{1}{2}\frac{\gamma}{1-\gamma}\big{(}5+5\big{)}=\frac{5\gamma}{1-\gamma}. (10)

Therefore, πΠ0\pi\in\Pi_{0}.

Appendix B Theoretical Results

B.1 Auxiliary Results

Theorem B.1 (Stochastic Approximation with Non-Expansive Operator).

Let {ξt}\{\xi_{t}\} be a random sequence with ξtn\xi_{t}\in\mathbb{R}^{n} defined by the iteration:

ξt+1=ξt+αt(F(ξt)ξt+Mt+1),\xi_{t+1}=\xi_{t}+\alpha_{t}(F(\xi_{t})-\xi_{t}+M_{t+1}),

where:

  1. 1.

    The step sizes αt\alpha_{t} satisfy standard learning rate assumptions.

  2. 2.

    F:nnF:\mathbb{R}^{n}\mapsto\mathbb{R}^{n} is a \|\cdot\|_{\infty} non-expansive map. That is, for any ξ1,ξ2n\xi_{1},\xi_{2}\in\mathbb{R}^{n}, F(ξ1)F(ξ2)ξ1ξ2\|F(\xi_{1})-F(\xi_{2})\|_{\infty}\leq\|\xi_{1}-\xi_{2}\|_{\infty}.

  3. 3.

    {Mt}\{M_{t}\} is a martingale difference sequence with respect to the increasing family of σ\sigma-fields t:=σ(ξ0,M0,ξ1,M1,,ξt,Mt)\mathcal{F}_{t}:=\sigma(\xi_{0},M_{0},\xi_{1},M_{1},...,\xi_{t},M_{t}).

Then, the sequence ξtξ\xi_{t}\to\xi^{*} almost surely where ξ\xi^{*} is a fixed point such that F(ξ)=ξF(\xi^{*})=\xi^{*}.

Proof B.2.

See Borkar and Soumyanatha (1997).

Theorem B.3 (PB-LRL Convergence with 2 objectives.(Skalse et al., 2022b)).

Let \mathcal{M} be a multi-objective MDP with objectives KiK_{i}, i{1,2}i\in\{1,2\}. Assume a policy π\pi is twice differentiable in parameters θ\theta, and if using a critic ViV_{i} assume it is continuously differentiable on parameters wiw_{i}. Choose a tolerance ϵ\epsilon, and suppose that if PB-LRL is run for TT steps, there exists some limit point wiwi(θ)w_{i}\to w_{i}^{*}(\theta) when θ\theta is held fixed. If for both objectives there exists a gradient descent scheme such that limT𝔼t[θt]Θiϵ\lim_{T\to\infty}\mathbb{E}_{t}[\theta_{t}]\in\Theta_{i}^{\epsilon} then combining the objectives as in \eqrefeq:LRLupdates yields limT𝔼t[θt]{\argmaxθK2(θ),\texts.t.K1(θ)K1(θ)ϵ}\lim_{T\to\infty}\mathbb{E}_{t}[\theta_{t}]\in\{\argmax_{\theta}K_{2}(\theta),\,\text{s.t.}\,\,K_{1}(\theta)\geq K_{1}(\theta^{\prime})-\epsilon\}.

Proof B.4 (Proof Sketch).

We refer the interested reader to Skalse et al. (2022b) for a full proof, and here attempt to provide the intuition behind the result in the form of a proof sketch.

Let us begin by briefly recalling the general problem statement: we wish to take a multi-objective MDP \mathcal{M} with mm objectives, and obtain a lexicographically optimal policy (one that optimises the first objective, and then subject to this optimises the second objective, and so on). More precisely, for a policy π\pi parameterised by θ\theta, we say that π\pi is (globally) lexicographically ϵ\epsilon-optimal if θΘmϵ\theta\in\Theta^{\epsilon}_{m}, where Θ0ϵ=Θ\Theta^{\epsilon}_{0}=\Theta is the set of all policies in \mathcal{M}, Θi+1ϵ:={θΘiϵmaxθΘiϵKi(θ)Ki(θ)ϵi}\Theta^{\epsilon}_{i+1}:=\{\theta\in\Theta^{\epsilon}_{i}\mid\max_{\theta^{\prime}\in\Theta^{\epsilon}_{i}}K_{i}(\theta^{\prime})-K_{i}(\theta)\leq\epsilon_{i}\}, and m1ϵ\succcurlyeq0\mathbb{R}^{m-1}\ni\epsilon\succcurlyeq 0.666The proof in Skalse et al. (2022b) also considers local lexicographic optima, though for the sake of simplicity, we do not do so here.

The basic idea behind policy-based lexicographic reinforcement learning (PB-LRL) is to use a multi-timescale approach to first optimise θ\theta using K1K_{1}, then at a slower timescale optimise θ\theta using K2K_{2} while adding the condition that the loss with respect to K1K_{1} remains bounded by its current value, and so on. This sequence of constrained optimisations problems can be solved using a Lagrangian relaxation (Bertsekas, 1999), either in series or – via a judicious choice of learning rates – simultaneously, by exploiting a separation in timescales (Borkar, 2008). In the simultaneous case, the parameters of the critic wiw_{i} (if using an actor-critic algorithm, if not this part of the argument may be safely ignored) for each objective are updated on the fastest timescale, then the parameters θ\theta, and finally (i.e., most slowly) the Lagrange multipliers for each of the remaining constraints.

The proof proceeds via induction on the number of objectives, using a standard stochastic approximation argument (Borkar, 2008). In particular, due to the learning rates chosen, we may consider those more slowly updated parameters fixed for the purposes of analysing the convergence of the more quickly updated parameters. In the base case where m=1m=1, we have (by assumption) that limT𝔼t[θ]Θ1ϵ\lim_{T\to\infty}\mathbb{E}_{t}[\theta]\in\Theta_{1}^{\epsilon}. This is simply the standard (non-lexicographic) RL setting. Before continuing to the inductive step, Skalse et al. (2022b) observe that because gradient descent on K1K_{1} converges to globally optimal stationary point when m=1m=1 then K1K_{1} must be globally invex (where the opposite implication is also true) (Ben-Israel and Mond, 1986a).777A differentiable function f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R} is (globally) invex if and only if there exists a function g:n×nng:\mathbb{R}^{n}\times\mathbb{R}^{n}\rightarrow\mathbb{R}^{n} such that f(x1)f(x2)g(x1,x2)f(x2)f(x_{1})-f(x_{2})\geq g(x_{1},x_{2})^{\top}\nabla f(x_{2}) for all x1,x2nx_{1},x_{2}\in\mathbb{R}^{n} (Hanson, 1981).

The reason this observation is useful is that because each of the objectives KiK_{i} shares the same functional form, they are all invex, and furthermore, invexity is conserved under linear combinations and the addition of scalars, meaning that the Lagrangian formed in the relaxation of each constrained optimisation problem is also invex. As a result, if we assume that limT𝔼t[θ]Θiϵ\lim_{T\to\infty}\mathbb{E}_{t}[\theta]\in\Theta_{i}^{\epsilon} as our inductive hypothesis, then the stationary point of the Lagrangian for optimising objective Ki+1K_{i+1} is a global optimum, given the constraints that it does not worsen performance on K1,,KiK_{1},\ldots,K_{i}. Via Slater’s condition (Slater, 1950) and standard saddle-point arguments (Bertsekas, 1999; Paternain et al., 2019), we therefore have that limT𝔼t[θ]Θi+1ϵ\lim_{T\to\infty}\mathbb{E}_{t}[\theta]\in\Theta_{i+1}^{\epsilon}, completing the inductive step, and thus the overall inductive argument.

This concludes the proof that limT𝔼t[θ]Θmϵ\lim_{T\to\infty}\mathbb{E}_{t}[\theta]\in\Theta_{m}^{\epsilon}. We refer the reader to Skalse et al. (2022b) for a discussion of the error ϵ\epsilon, but intuitively it corresponds to a combination of the representational power of θ\theta, the critic parameters wiw_{i} (if used), and the duality gap due to the Lagrangian relaxation (Paternain et al., 2019). In cases where the representational power of the various parameters is sufficiently high, then it can be shown that ϵ=0\epsilon=0.

B.2 On Adversarial Disturbances and other Noise Kernels

A problem that remains open after this work is what constitutes an appropriate choice of T~\tilde{T}, and what can we expect by restricting a particular class of T~\tilde{T}. We first discuss adversarial examples, and then general considerations on T~\tilde{T} versus TT.

Adversarial Noise

As mentioned in the introduction, much of the previous work focuses on adversarial disturbances. We did not directly address this in the results of this work since our motivation lies in the scenarios where the disturbance is not adversarial and is unknown. However, following the results of Section 3, we are able to reason about adversarial disturbances. Consider an adversarial map TadvT_{adv} to be

π,Tadv(x)=π(y),y\operatornameargmaxyXad(x)d(π(x),π(y)),\langle\pi,T_{adv}\rangle(x)=\pi(y),\quad y\in\operatorname{argmax}_{y\in X_{ad}(x)}d\big{(}\pi(x),\pi(y)\big{)},

with Xad(x)XX_{ad}(x)\subseteq X being a set of admissible disturbance states for xx, and d(,)d(\cdot,\cdot) is a distance measure between distributions (e.g. 2-norm).

Proposition B.5.

Constant policies are a fixed point of TadvT_{adv}, and are the only fixed points if for all pairs x0,xkx_{0},x_{k} there exists a sequence {x0,,xk}X\{x_{0},...,x_{k}\}\subseteq X such that xiXad(xi)x_{i}\in X_{ad}(x_{i}).

Proof B.6.

First, it is straight-forward that if π¯Π¯\impliesπ¯,Tadv(x)=π¯(x)\overline{\pi}\in\overline{\Pi}\implies\langle\overline{\pi},T_{adv}\rangle(x)=\overline{\pi}(x). To show they are the only fixed points, assume that there is a non-constant policy π\pi^{\prime} that is a fixed point of TadT_{ad}. Then, there exists x,zx,z such that π(x)π(z)\pi^{\prime}(x)\neq\pi^{\prime}(z). However, by assumption, we can construct a sequence {x,,z}X\{x,...,z\}\subseteq X that connects xx and zz and every state in the sequence is in the admissible set of the previous one. Assume without loss of generality that this sequence is {x,y,z}\{x,y,z\}. Then, if π\pi^{\prime} is a fixed point, π,Tadv(x)=π(x)\langle{\pi^{\prime}},T_{adv}\rangle(x)=\pi^{\prime}(x), π,Tadv(y)=π(y)\langle{\pi^{\prime}},T_{adv}\rangle(y)=\pi^{\prime}(y) and π,Tadv(z)=π(z)\langle{\pi^{\prime}},T_{adv}\rangle(z)=\pi^{\prime}(z). However, π(x)π(z)\pi^{\prime}(x)\neq\pi^{\prime}(z), so either π(x)π(y)\impliesd(π(x),π(y))0\pi^{\prime}(x)\neq\pi^{\prime}(y)\implies d(\pi^{\prime}(x),\pi^{\prime}(y))\neq 0 or π(y)π(z)\impliesd(π(y),π(z))0\pi^{\prime}(y)\neq\pi^{\prime}(z)\implies d(\pi^{\prime}(y),\pi^{\prime}(z))\neq 0, therefore π\pi^{\prime} cannot be a fixed point of TadvT_{adv}.

The main difference between an adversarial operator and the random noise considered throughout this work is that TadvT_{adv} is not a linear operator, and additionally, it is time varying (since the policy is being modified at every time step of the PG algorithm). Therefore, including it as a LRPG objective would invalidate the assumptions required for LRPG to retain formal guarantees of the original PG algorithm used, and it is not guaranteed that the resulting policy gradient algorithm would converge.

Appendix C Experiment Methodology

We use in the experiments well-tested implementations of A2C, PPO and SAC from Stable Baselines 3 (Raffin et al., 2021) to include the computation of the lexicographic parameters in \eqrefeq:LRLupdates. All experiments were run on an Ubuntu 18.04 system, with a 16 core CPU and a graphic card Nvidia GeForce 3060.

LRPG Parameters.

The LRL parameters are initialised in all cases as β01=2,β02=1,λ=0\beta_{0}^{1}=2,\,\beta_{0}^{2}=1,\,\lambda=0 and η=0.001\eta=0.001. The LRL tolerance is set to ϵt=0.99k^1\epsilon_{t}=0.99\hat{k}_{1} to ensure we never deviate too much from the original objective, since the environments have very sparse rewards. We use a first order approximation to compute the LRL weights from the original LMORL implementation.

C.1 Discrete Control

The discrete control environments used can be seen in Figure 3.

Refer to caption
Refer to caption
Refer to caption
Figure 3: Screenshots of the environments used, from left: LavaGap, LavaCrossing and DynamicObstables.

Since all the environments use a pixel representation of the observation, we use a shared representation for the value function and policy, where the first component is a convolutional network, implemented as in Zhang (2018). The hyper-parameters of the neural representations are presented in Table 2.

Layer Output Func.
Conv1 16 ReLu
Conv2 32 ReLu
Conv3 64 ReLu
Table 2: Shared Observation Layers

The actor and critic layers, for both algorithms, are a fully connected layer with 6464 features as input and the corresponding output. We used in all cases an Adam optimiser. We optimised the parameters for each (vanilla) algorithm through a quick parameter search, and apply the same parameters for the Lexicographically Robust versions.

LavaGap LavaCrossing DynObs
Parallel Envs 16 16 16
Steps 21062\cdot 10^{6} 21062\cdot 10^{6} 8×1068\times 10^{6}
γ\gamma 0.99 0.99 0.98
α\alpha 0.00176 0.00176 0.00181
ϵ\epsilon(Adam) 10810^{-8} 10810^{-8} 10810^{-8}
Grad. Clip 0.9 0.9 0.5
Gae 0.95 0.95 0.95
Rollout 64 64 64
E. Coeff 0.01 0.014 0.011
V. Coeff 0.05 0.05 0.88
Table 3: A2C Parameters
LavaGap LavaCrossing DynObs
Parallel Envs 8 8 8
Steps 61066\cdot 10^{6} 21062\cdot 10^{6} 8×1058\times 10^{5}
γ\gamma 0.95 0.99 0.97
α\alpha 0.001 0.001 0.001
ϵ\epsilon(Adam) 10810^{-8} 10810^{-8} 10810^{-8}
Grad. Clip 1 1 0.1
Ratio Clip 0.2 0.2 0.2
Gae 0.95 0.95 0.95
Rollout 256 512 256
Epochs 10 10 10
E. Coeff 0 0.1 0.01
Table 4: PPO Parameters

For the implementation of the LRPG versions of the algorithms, in all cases we allow the algorithm to iterate for 1/31/3 of the total steps before starting to compute the robustness objectives. In other words, we use K^(θ)=K1(θ)\hat{K}(\theta)=K_{1}(\theta) until t=13\operatornamemax_stepst=\frac{1}{3}\operatorname{max\_steps}, and from this point we resume the lexicographic robustness computation as described in Algorithm 1. This is due to the structure of the environments simulated. The rewards (and in particular the positive rewards) are very sparse in the environments considered. Therefore, when computing the policy gradient steps, the loss for the primary objective is practically zero until the environment is successfully solved at least once. If we implement the combined lexicographic loss from the first time step, many times the algorithm would converge to a (constant) policy without exploring for enough steps, leading to convergence towards a maximally robust policy that does not solve the environment.

Noise Kernels. We consider two types of noise; a normal distributed noise T~g\tilde{T}^{g} and a uniform distributed noise T~u\tilde{T}^{u}. For the environments LavaGap and DynamicObstacles, the kernel T~u\tilde{T}^{u} produces a disturbed state x~=x+ξ\tilde{x}=x+\xi where ξ2\|\xi\|_{\infty}\leq 2, and for LavaCrossing ξ1.5\|\xi\|_{\infty}\leq 1.5. The normal distributed noise is in all cases 𝒩(0,0.5)\mathcal{N}(0,0.5). The maximum norm of the noise is quite large, but this is due to the structure of the observations in these environments. The pixel values are encoded as integers 090-9, where each integer represents a different feature in the environment (empty space, doors, lava, obstacle, goal…). Therefore, any noise ξ0.5\|\xi\|_{\infty}\leq 0.5 would most likely not be enough to confuse the agent. On the other hand, too large noise signals are unrealistic and produce pathological environments. All the policies are then tested against two “true” noise kernels, T1=T~uT_{1}=\tilde{T}^{u} and T2=T~gT_{2}=\tilde{T}^{g}. The main reason for this is to test both the scenarios where we assume a wrong noise kernel, and the case where we are training the agents with the correct kernel.

Comparison with SA-PPO. One of the baselines included is the State-Adversarial PPO algorithm proposed in Zhang et al. (2020). The implementation includes an extra parameter that multiplies the regularisation objective, kppok_{ppo}. Since we were not able to find indications on the best parameter for discrete action environments, we implemented kppo{0.1,1,2}k_{ppo}\in\{0.1,1,2\} and picked the best result for each entry in Table 1. Larger values seemed to de-stabilise the learning in some cases. The rest of the parameters are kept as in the vanilla PPO implementation.

C.1.1 Extended Results: Adversarial Disturbances

Even though we do not use an adversarial attacker or disturbance in our reasoning through this work, we implemented a policy-based state-adversarial noise disturbance to test the benchmark algorithms against, and evaluate how well each of the methods reacts to such adversarial disturbances.

Adversarial Disturbance

We implement a bounded policy-based adversarial attack, where at each state xx we maximise for the KL divergence between the disturbed and undisturbed state, such that the adversarial operator is:

{aligned}Tadvε(yx)=1\impliesy\argmaxx~DKL(π(x),π(x~))s.t.xx~2ε.\aligned T_{adv}^{\varepsilon}(y\mid x)=1\implies y\in\argmax_{\tilde{x}}&\,\,D_{KL}(\pi(x),\pi(\tilde{x}))\\ s.t.&\,\,\|x-\tilde{x}\|_{2}\leq\varepsilon.

The optimisation problem is solved at every point by using a Stochastic Gradient Langevin Dynamics (SGLD) optimiser. The results are presented in Table 5.

PPO on MiniGrid Environments A2C on MiniGrid Environments
Noise Vanilla LR\textPPO{}_{\text{PPO}}(KTu)(K^{u}_{T}) LR\textPPO{}_{\text{PPO}}(KTg)(K^{g}_{T}) SA-PPO Vanilla LR\textA2C{}_{\text{A2C}}(KTu)(K^{u}_{T}) LR\textA2C{}_{\text{A2C}}(KTg)(K^{g}_{T}) LR\textA2C{}_{\text{A2C}}(KD)(K_{D})
LavaGap
\emptyset 0.95±\pm0.003 0.95±\pm0.075 0.95±\pm0.101 0.94±\pm0.068 0.94±\pm0.004 0.94±\pm0.005 0.94±\pm0.003 0.94±\pm0.006
T1T_{1} 0.80±\pm0.041 0.95±\pm0.078 0.93±\pm0.124 0.88±\pm0.064 0.83±\pm0.061 0.93±\pm0.019 0.89±\pm0.032 0.91±\pm0.088
T2T_{2} 0.92±\pm0.015 0.95±\pm0.052 0.95±\pm0.094 0.93±\pm0.050 0.89±\pm0.029 0.94±\pm0.008 0.93±\pm0.011 0.93±\pm0.021
Tadv0.5T_{adv}^{0.5} 0.56±\pm0.194 0.93±\pm0.101 0.91±\pm0.076 0.90±\pm0.123 0.92±\pm0.034 0.94±\pm0.003 0.94±\pm0.007 0.93±\pm0.015
Tadv1T_{adv}^{1} 0.20±\pm0.243 0.90±\pm0.124 0.68±\pm0.190 0.90±\pm0.135 0.75±\pm0.123 0.94±\pm0.006 0.92±\pm0.038 0.88±\pm0.084
Tadv2T_{adv}^{2} 0.01±\pm0.051 0.71±\pm0.251 0.21±\pm0.357 0.87±\pm0.116 0.27±\pm0.119 0.79±\pm0.069 0.68±\pm0.127 0.56±\pm0.249
LavaCrossing
\emptyset 0.95±\pm0.023 0.93±\pm0.050 0.93±\pm0.018 0.88±\pm0.091 0.91±\pm0.024 0.91±\pm0.063 0.90±\pm0.017 0.92±\pm0.034
T1T_{1} 0.50±\pm0.110 0.92±\pm0.053 0.89±\pm0.029 0.64±\pm0.109 0.66±\pm0.071 0.78±\pm0.111 0.72±\pm0.073 0.76±\pm0.098
T2T_{2} 0.84±\pm0.061 0.92±\pm0.050 0.92±\pm0.021 0.85±\pm0.094 0.78±\pm0.054 0.83±\pm0.105 0.86±\pm0.029 0.87±\pm0.063
Tadv0.5T_{adv}^{0.5} 0.29±\pm0.098 0.91±\pm0.081 0.91±\pm0.054 0.87±\pm0.045 0.56±\pm0.039 0.51±\pm0.089 0.43±\pm0.041 0.68±\pm0.126
Tadv1T_{adv}^{1} 0.03±\pm0.022 0.83±\pm0.122 0.86±\pm0.132 0.87±\pm0.059 0.27±\pm0.158 0.25±\pm0.118 0.17±\pm0.067 0.43±\pm0.060
Tadv2T_{adv}^{2} 0.0±\pm0.004 0.50±\pm0.171 0.38±\pm0.020 0.82±\pm0.072 0.06±\pm0.056 0.04±\pm0.030 0.01±\pm0.008 0.09±\pm0.060
DynamicObstacles
\emptyset 0.91±\pm0.002 0.91±\pm0.008 0.91±\pm0.007 0.91±\pm0.131 0.91±\pm0.011 0.88±\pm0.020 0.89±\pm0.009 0.91±\pm0.013
T1T_{1} 0.23±\pm0.201 0.77±\pm0.102 0.61±\pm0.119 0.45±\pm0.188 0.27±\pm0.104 0.43±\pm0.108 0.45±\pm0.162 0.56±\pm0.270
T2T_{2} 0.50±\pm0.117 0.75±\pm0.075 0.70±\pm0.072 0.68±\pm0.490 0.45±\pm0.086 0.53±\pm0.109 0.52±\pm0.161 0.67±\pm0.203
Tadv0.5T_{adv}^{0.5} 0.74±\pm0.230 0.89±\pm0.118 0.85±\pm0.061 0.90±\pm0.142 0.46±\pm0.214 0.55±\pm0.197 0.51±\pm0.371 0.62±\pm0.249
Tadv1T_{adv}^{1} 0.26±\pm0.269 0.79±\pm0.157 0.68±\pm0.144 0.84±\pm0.150 0.19±\pm0.284 0.35±\pm0.197 0.23±\pm0.370 0.10±\pm0.379
Tadv2T_{adv}^{2} -0.49±\pm0.312 0.51±\pm0.234 0.33±\pm0.202 0.55±\pm0.170 -0.54±\pm0.209 -0.21±\pm0.192 -0.53±\pm0.261 -0.51±\pm0.260
Table 5: Extended Reward Results.

This type of adversarial attack with SGLD optimiser was proposed in Zhang et al. (2020). As one can see, the adversarial disturbance is quite successful at severely lowering the obtained rewards in all scenarios. Additionally, as expected SA-PPO was the most effective at minimizing the disturbance effect (as it is trained with adversarial disturbances), although LRPG produces reasonably robust policies against this type of disturbances as well. At last, A2C appears to be much more sensitive to adversarial disturbances than PPO, indicating that the policies produced by PPO are by default more robust than A2C.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Learning Plots for Discrete Control Environments.

C.2 Continuous Control

The continuous control environments simulated are MountainCar, LunarLander and BipedalWalker. The policies used are in all cases MLP policies with ReLU gates and a (64,64)(64,64) feature extractor plus a fully connected layer to output the values and actions unless stated otherwise. The hyperparameters can be found in tables 7 and 8. The implementation is based on Stable Baselines 3 (Raffin et al., 2021) tuned algorithms.

PPO on Continuous Environments SAC on Continuous Environments
Noise Vanilla LR\textPPO{}_{\text{PPO}} (KTu)(K^{u}_{T}) LR\textPPO{}_{\text{PPO}} (KTg)(K^{g}_{T}) SA-PPO Vanilla LR\textSAC{}_{\text{SAC}} (KTu)(K^{u}_{T}) LR\textSAC{}_{\text{SAC}} (KTg)(K^{g}_{T})
MountainCar
\emptyset 94.77±\pm0.26 93.17±\pm0.89 94.66±\pm1.61 88.69±\pm3.93 93.52±\pm0.05 94.43±\pm0.19 93.84±\pm0.05
T1T_{1} 88.67±\pm1.41 91.46±\pm1.22 94.91±\pm1.35 88.41±\pm3.99 1.89±\pm65.31 71.81±\pm13.04 76.90±\pm7.11
T2T_{2} 92.22±\pm1.11 92.40±\pm1.28 94.76±\pm1.42 89.32±\pm3.79 -27.82±\pm73.10 72.93±\pm8.57 69.41±\pm13.03
LunarLander
\emptyset 267.99±\pm38.04 269.76±\pm22.93 243.08±\pm37.03 220.18±\pm98.78 268.96±\pm51.52 275.17±\pm14.04 282.24±\pm15.95
T1T_{1} 156.09±\pm22.87 280.91±\pm20.34 182.80±\pm49.26 164.53 ±\pm45.48 128.18±\pm17.73 187.64±\pm76.30 153.81±\pm33.16
T2T_{2} 158.02±\pm46.57 276.76±\pm16.20 212.62±\pm37.56 221.84±\pm73.61 140.92±\pm20.61 187.82±\pm25.27 158.18±\pm28.60
BipedalWalker
\emptyset 265.39±\pm82.36 261.39±\pm83.19 276.66±\pm44.85 251.60±\pm103.08 236.39 ±\pm157.03 302.56±\pm70.79 313.56±\pm52.17
T1T_{1} 174.15±\pm170.30 253.56±\pm72.66 220.28±\pm118.61 264.69±\pm61.63 203.93 ±\pm167.83 241.45±\pm124.54 241.60±\pm139.93
T2T_{2} 135.16±\pm182.30 243.27±\pm89.86 265.37±\pm80.60 255.21±\pm90.61 84.10 ±\pm198.12 198.20±\pm151.64 229.75±\pm166.87
Table 6: Reward values gained by LRPG and baselines on continuous control tasks.

Noise Kernels. We consider again two types of noise; a normal distributed noise T~g\tilde{T}^{g} and a uniform distributed noise T~u\tilde{T}^{u}. In all cases, algorithms are implemented with a state observation normalizer. That is, assimptotically all states will be observed to be in the set (1,1)(-1,1). For this reason, the uniform noise is bounded at lower values than for the discrete control environments. For BipedalWalker ξ0.05\|\xi\|_{\infty}\leq 0.05 and for Lunarlander and MountainCar ξ0.1\|\xi\|_{\infty}\leq 0.1. Larger values were shown to destabilize learning.

MountainCarContinuous LunarLanderContinuous BipedalWalker-v3
Parallel Envs 1 16 32
Steps 2×1042\times 10^{4} 1×1061\times 10^{6} 5×1065\times 10^{6}
γ\gamma 0.9999 0.999 0.999
α\alpha 3×1043\times 10^{-4} 3×1043\times 10^{-4} 3×1043\times 10^{-4}
Grad. Clip 5 0.5 0.5
Ratio Clip 0.2 0.2 0.18
Gae 0.9 0.98 0.95
Epochs 10 4 10
E. Coeff 0.00429 0.01 0
Table 7: PPO Parameters for Continuous Control
MountainCarContinuous LunarLanderContinuous BipedalWalker-v3
Steps 5×1045\times 10^{4} 5×1055\times 10^{5} 5×1055\times 10^{5}
γ\gamma 0.9999 0.99 0.98
α\alpha 3×1043\times 10^{-4} 7.3×1047.3\times 10^{-4} 7.3×1047.3\times 10^{-4}
τ\tau 0.01 0.01 0.01
Train Freq. 32 1 64
Grad. Steps 32 1 64
MLP Arch (64,64) (400,300) (400,300)
Table 8: SAC Parameters for Continuous Control
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Learning Plots for Continuous Control Environments.
Learning processes

In general, learning was not severlely affected by the LRPG scheme. However, it was shown to induce a larger variance in the trajectories observed, as seen in LunarLander with LR-SAC and BipedalWalker with LR-SAC.