This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Reinforcement Learning on State Observations with Learned Optimal Adversary

Huan Zhang*,1  Hongge Chen*,2  Duane Boning2  Cho-Jui Hsieh1
1Department of Computer Science, UCLA  2Department of EECS, MIT
huan@huan-zhang.com, chenhg@mit.edu, boning@mtl.mit.edu
chohsieh@cs.ucla.edu
*
Huan Zhang and Hongge Chen contributed equally
Abstract

We study the robustness of reinforcement learning (RL) with adversarially perturbed state observations, which aligns with the setting of many adversarial attacks to deep reinforcement learning (DRL) and is also important for rolling out real-world RL agent under unpredictable sensing noise. With a fixed agent policy, we demonstrate that an optimal adversary to perturb state observations can be found, which is guaranteed to obtain the worst case agent reward. For DRL settings, this leads to a novel empirical adversarial attack to RL agents via a learned adversary that is much stronger than previous ones. To enhance the robustness of an agent, we propose a framework of alternating training with learned adversaries (ATLA), which trains an adversary online together with the agent using policy gradient following the optimal adversarial attack framework. Additionally, inspired by the analysis of state-adversarial Markov decision process (SA-MDP), we show that past states and actions (history) can be useful for learning a robust agent, and we empirically find a LSTM based policy can be more robust under adversaries. Empirical evaluations on a few continuous control environments show that ATLA achieves state-of-the-art performance under strong adversaries. Our code is available at https://github.com/huanzhang12/ATLA_robust_RL.

1 Introduction

Modern deep reinforcement learning agents (Mnih et al., 2015; Levine et al., 2015; Lillicrap et al., 2015; Silver et al., 2016; Fujimoto et al., 2018) typically use neuron networks as function approximators. Since the discovery of adversarial examples in image classification tasks (Szegedy et al., 2013), the vulnerabilities in DRL agents were first demonstrated in (Huang et al., 2017; Lin et al., 2017; Kos & Song, 2017) and further developed under more environments and different attack scenarios (Behzadan & Munir, 2017a; Pattanaik et al., 2018; Xiao et al., 2019). These attacks commonly add imperceptible noises into the observations of states, e.g., the observed environment slightly differs from true environment. This raises concerns for using RL in safety-crucial applications such as autonomous driving (Sallab et al., 2017; Voyage, 2019); additionally, the discrepancy between ground-truth states and agent observations also contributes to the “reality gap” - an agent working well in simulated environments may fail in real environments due to noises in observations (Jakobi et al., 1995; Muratore et al., 2019), as real-world sensing contains unavoidable noise (Brooks, 1992).

We classify the weakness of a DRL agent on the perturbations of state observations into two classes: the vulnerability in function approximators, which typically originates from the highly non-linear and blackbox nature of neural networks; and intrinsic weakness of policy: even perfect features for states are extracted, an agent can still make mistakes due to an intrinsic weakness in its policy.

For example, in the deep Q networks (DQNs) for Atari games, a large convolutional neural network (CNN) is used for extracting features from input frames. To act correctly, the network must extract crucial features: e.g., for the game of Pong, the position and velocity of the ball, which can observed by visualizing convolutional layers (Hausknecht & Stone, 2015; Guo et al., 2014). Many attacks to the DQN setting add imperceptible noises (Huang et al., 2017; Lin et al., 2017; Kos & Song, 2017; Behzadan & Munir, 2017a) that exploit the vulnerability of deep neural networks so that they extract wrong features, as we have seen in adversarial examples of image classification tasks. On the other hand, the fragile function approximation is not the only source of the weakness of a RL agent - in a finite-state Markov decision process (MDP), we can use tabular policy and value functions so there is no function approximation error. The agent can still be vulnerable to small perturbations on observations, e.g., perturbing the observation of a state to one of its four neighbors in a gridworld-like environment can prevent an agent from reaching its goal (Figure 1). To improve the robustness of RL, we need to take measures from both aspects — a more robust function approximator, and a policy aware of perturbations in observations.

Techniques developed in enhancing the robustness of neural network (NN) classifiers can be applied to address the vulnerability in function approximators. Especially, for environments like Atari games with images as input and discrete actions as outputs, the policy network πθ\pi_{\theta} behaves similarly to a classifier in test time. Thus, Fischer et al. (2019); Mirman et al. (2018a) utilized existing certified adversarial defense (Mirman et al., 2018b; Wong & Kolter, 2018; Gowal et al., 2018; Zhang et al., 2020a) approaches in supervised learning to enhance the robustness of DQN agents. Another successful approach (Zhang et al., 2020b) for both Atari and high-dimensional continuous control environment regularizes the smoothness of the learned policy such that maxs^(s)D(πθ(s),πθ(s^))\max_{\hat{s}\in\mathcal{B}(s)}D(\pi_{\theta}(s),\pi_{\theta}(\hat{s})) is small for some divergence DD and (s)\mathcal{B}(s) is a neighborhood around ss. This maximization can be solved using a gradient based method or convex relaxations of NNs (Salman et al., 2019; Zhang et al., 2018; Xu et al., 2020), and then minimized by optimizing θ\theta. Such an adversarial minimax regularization is in the same spirit as the ones used in some adversarial training approaches for (semi-)supervised learning, e.g., TRADES (Zhang et al., 2019) and VAT (Miyato et al., 2015). However, regularizing the function approximators does not explicitly improve the intrinsic policy robustness.

Refer to caption
(a) Path in unperturbed environment (found by policy iteration). Agent’s reward=+1\text{Agent's reward}=+1. Black arrows and numbers show actions and value function of the agent.
Refer to caption
(b) Path under the optimal adversary. Agent’s reward=\text{Agent's reward}=-\infty. Red arrows and numbers show actions and value function of the optimal adversary (Section 3.1).
Refer to caption
(c) A robust POMDP policy solved by SARSOP (Kurniawati et al., 2008) under the same adversary. This policy is history dependent (Section 3.2).
Figure 1: We show an agent in gridworld environment trained with no function approximators, and its optimal policy is intrinsically not robust to perturbations of state observations. The red square and blue circle are the starting point and target (reward +1) of the agent, respectively. The green triangles are traps, with reward -1 once encountered. The adversary is allowed to perturb the observation to adjacent states along four directions: up, down, left, and right. Adversary earns +1 at traps and -1 at the target. We set γ=0.9\gamma=0.9 for both agent and adversary. This example shows that the vulnerability of a RL agent does not only come from the errors in function approximators such as DNNs.

In this paper, we propose an orthogonal approach, alternating training with learned adversaries (ATLA), to enhance the robustness of DRL agents. We focus on dealing with the intrinsic weakness of the policy by learning an adversary online with the agent during training time, rather than directly regularizing function approximators. Our main contributions can be summarized as:

  • We follow the framework of state-adversarial Markov decision process (SA-MDP) and show how to learn an optimal adversary for perturbing observations. We demonstrate practical attacks under this formulation and obtain learned adversaries that are significantly stronger than previous ones.

  • We propose the alternating training with learned adversaries (ATLA) framework to improve the robustness of DRL agents. The difference between our approach and previous adversarial training approaches is that we use a stronger adversary, which is learned online together with the agent.

  • Our analysis on SA-MDP also shows that history can be important for learning a robust agent. We thus propose to use a LSTM based policy in the ATLA framework and find that it is more robust than policies parameterized as regular feedforward NNs.

  • We evaluate our approach empirically on four continuous control environments. We outperform explicit regularization based methods in a few environments, and our approach can also be directly combined with explicit regularizations on function approximators to achieve state-of-the-art results.

2 Related Work

State-adversarial Markov decision process (SA-MDP) (Zhang et al., 2020b) characterizes the decision making problem under adversarial attacks on state observations. Most importantly, the true state in the environment is not perturbed by the adversary under this setting; for example, perturbing pixels in an Atari environment (Huang et al., 2017; Kos & Song, 2017; Lin et al., 2017; Behzadan & Munir, 2017a; Inkawhich et al., 2019) does not change the true location of an object in the game simulator. SA-MDP can characterize agent performance under natural or adversarial noise from sensor measurements. For example, GPS sensor readings on a car are naturally noisy, but the ground truth location of the car is not affected by the noise. Importantly, this setting is different from robust Markov decision process (RMDP) (Nilim & El Ghaoui, 2004; Iyengar, 2005), where the worst case transition probabilities of the environment are considered. “Robust reinforcement learning” in some works (Mankowitz et al., 2018; 2019) refer to this different definition of robustness in RMDP, and should not be confused with our setting of robustness against perturbations on state observations.

Several works proposed methods to learn an adversary online together with an agent. RARL (Pinto et al., 2017) proposed to train an agent and an adversary under the two-player Markov game (Littman, 1994) setting. The adversary can change the environment states through actions directly applied to environment. The goal of RARL is to improve the robustness against environment parameter changes, such as mass, length or friction. Gleave et al. (2019) discussed the learning of an adversary using reinforcement learning to attack a victim agent, by taking adversarial actions that changes the environment and consequentially change the observation of the victim agent. Both Pinto et al. (2017); Gleave et al. (2019) conduct their attack under on the two-player Markov game framework, rather than considering perturbations on state observations. Besides, Li et al. (2019) consider a similar Markov game setting in multi-agent RL environments. The difference between these works and ours can be clearly seen in the setting where the adversary is fixed - under the framework of (Pinto et al., 2017; Gleave et al., 2019), the learning of agent is still a MDP, but in our setting, it becomes a harder POMDP problem (Section 3.2).

Training DRL agents with perturbed state observations from adversaries have been investigated in a few works, sometimes referred to as adversarial training. Kos & Song (2017); Behzadan & Munir (2017b) used gradient based adversarial attacks to DQN agents and put adversarial frames into replay buffer. This approach is not very successful because for Atari environments the main source of weakness is likely to come from the function approximator, so an adversarial regularization framework such as (Zhang et al., 2020b; Qu et al., 2020) which directly controls the smoothness of the QQ function is more effective. For lower dimensional continuous control tasks such as the MuJoCo environments, Mandlekar et al. (2017); Pattanaik et al. (2018) conducted FGSM and multi-step gradient based attacks during training time; however, their main focus was on the robustness against environment parameter changes and only limited evaluation on the adversarial attack setting was conducted with relatively weak adversaries. Zhang et al. (2020b) systematically tested this approach under newly proposed strong attacks, and found that it cannot reliably improve robustness. These early adversarial training approaches typically use gradients from a critic function. They are usually relatively weak, and not sufficient to lead to a robust policy under stronger attacks.

The robustness of RL has also been investigated from other perspectives. For example, Tessler et al. (2019) study MDPs under action perturbations; Tan et al. (2020) use adversarial training on action space to enhance agent robustness under action perturbations. Besides, policy teaching (Zhang & Parkes, 2008; Zhang et al., 2009; Ma et al., 2019) and policy poisoning (Rakhsha et al., 2020; Huang & Zhu, 2019) manipulate the reward or cost signal during agent training time to induce a desired agent policy. Essentially, policy teaching is a training time “attack” with perturbed rewards from the environments (which can be analogous to data poisoning attacks in supervised learning settings), while our goal is to obtain a robust agent against test time adversarial attacks. All these settings differ from the setting of perturbing state observations discussed in our paper.

3 Methodology

In this section, we first discuss the case where the agent policy is fixed, and then the case where the adversary is fixed in SA-MDPs. This allows us to propose an alternating training framework to improve robustness of RL agents under perturbations on state observations.

Notations and Background

We use 𝒮\mathcal{S} and 𝒜\mathcal{A} to represent the state space and the action space, respectively; 𝒫(𝒮)\mathcal{P}(\mathcal{S}) defines the set of all possible probability measures on 𝒮\mathcal{S}. We define a Markov decision process (MDP) as (𝒮,𝒜,R,p,γ)(\mathcal{S},\mathcal{A},R,p,\gamma), where R:𝒮×𝒜×𝒮R:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow{\mathbb{R}} and p:𝒮×𝒜𝒫(𝒮)p:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{P}(\mathcal{S}) are two mappings represent the reward and transition probability. The transition probability at time step tt can be written as p(s|s,a)=Pr(st+1=s|st=s,at=a)p(s^{\prime}|s,a)=\mathrm{Pr}(s_{t+1}=s^{\prime}|s_{t}=s,a_{t}=a). Reward function is defined as the expected reward R(s,a,s):=𝔼[rt|st=s,at=a,st+1=s]R(s,a,s^{\prime}):=\mathbb{E}[r_{t}|s_{t}=s,a_{t}=a,s_{t+1}=s^{\prime}]. γ[0,1]\gamma\in[0,1] is the discounting factor. We denote a stationary policy as π:𝒮𝒫(𝒜)\pi:\mathcal{S}\rightarrow\mathcal{P}(\mathcal{A}) which is independent of history. We denote history hth_{t} at time tt as {s0,a0,,st1,at1,st}\{s_{0},a_{0},\cdots,s_{t-1},a_{t-1},s_{t}\} and \mathcal{H} as the set of all histories. A history-dependent policy is defined as π:𝒫(𝒜)\pi:\mathcal{H}\rightarrow\mathcal{P}(\mathcal{A}). A partially observable Markov decision process (Astrom, 1965) (POMDP) can be defined as a 7-tuple (𝒮,𝒜,𝒪,Ω,R,p,γ)(\mathcal{S},\mathcal{A},\mathcal{O},\Omega,R,p,\gamma) where 𝒪\mathcal{O} is a set of observations and Ω\Omega is a set of conditional observation probabilities p(o|s)p(o|s). Unlike MDPs, POMDPs typically require history-dependent optimal policies.

Refer to caption
Figure 2: SA-MDP introduces an adversary on state observations in a MDP.

To study the decision problem under adversaries on state observations, we use state-adversarial Markov decision process (SA-MDP) framework (Zhang et al., 2020b). In SA-MDP, an adversary ν:𝒮𝒫(𝒮)\nu:\mathcal{S}\rightarrow\mathcal{P}(\mathcal{S}) is introduced to perturb the input state of an agent; however, the true environment state ss is unchanged (Figure 2). Formally, an SA-MDP is a 6-tuple (𝒮,𝒜,,R,p,γ)(\mathcal{S},\mathcal{A},\mathcal{B},R,p,\gamma) where \mathcal{B} is a mapping from a state s𝒮s\in\mathcal{S} to a set of states (s)𝒮\mathcal{B}(s)\in\mathcal{S}. The agent sees the perturbed state s^ν(|s)\hat{s}\sim\nu(\cdot|s) and takes the action π(a|s^)\pi(a|\hat{s}) accordingly. \mathcal{B} limits the power of adversary: supp(ν(|s))(s)\text{supp}\left(\nu(\cdot|s)\right)\in\mathcal{B}(s). The goal of SA-MDP is to solve an optimal policy π\pi^{*} under its optimal adversary ν(π)\nu^{*}(\pi^{*}); an optimal adversary is defined as ν(π)\nu^{*}(\pi) such that π\pi achieves the lowest possible expected discounted return (or value) on all states. Zhang et al. (2020b) did not give an explicit algorithm to solve SA-MDP and found that a stationary optimal policy need not exist.

3.1 Finding the optimal adversary under a fixed policy

In this section, we discuss how to find an optimal adversary ν\nu for a given policy π\pi. An optimal adversary leads to the worst case performance under bounded perturbation set \mathcal{B}, and is an absolute lower bound of the expected cumulative reward an agent can receive. It is similar to the concept of “minimal adversarial example” in supervised learning tasks. We first show how to solve the optimal adversary in MDP setting and then apply it to the DRL settings.

A technical lemma (Lemma 1) from Zhang et al. (2020b) shows that, from the adversary’s point of view, a fixed and stationary agent policy π\pi and the environment dynamics can be essentially merged into an MDP with redefined dynamics and reward functions:

Lemma 1 (Zhang et al. (2020b))

Given an SA-MDP M=(𝒮,𝒜,R,,p,γ)M=(\mathcal{S},\mathcal{A},R,\mathcal{B},p,\gamma) and a fixed and stationary policy π(|)\pi(\cdot|\cdot), there exists an MDP M^=(𝒮,𝒜^,R^,p^,γ)\hat{M}=(\mathcal{S},\hat{\mathcal{A}},\hat{R},\hat{p},\gamma) such that the optimal policy of M^\hat{M} is the optimal adversary ν\nu for SA-MDP given the fixed π\pi, where 𝒜^=𝒮\hat{\mathcal{A}}=\mathcal{S}, and

R^(s,a^,s):=𝔼[r^|s,a^,s]={a𝒜π(a|a^)p(s|s,a)R(s,a,s)a𝒜π(a|a^)p(s|s,a)for s,s𝒮 and a^(s)𝒜^,Cfor s,s𝒮 and a^(s).\hat{R}(s,\hat{a},s^{\prime}):=\mathbb{E}[\hat{r}|s,\hat{a},s^{\prime}]=\begin{cases}-\frac{\sum_{a\in\mathcal{A}}\pi(a|\hat{a})p(s^{\prime}|s,a)R(s,a,s^{\prime})}{\sum_{a\in\mathcal{A}}\pi(a|\hat{a})p(s^{\prime}|s,a)}&\text{for }s,s^{\prime}\in\mathcal{S}\text{ and }\hat{a}\in\mathcal{B}(s)\subset\hat{\mathcal{A}},\\ C&\text{for }s,s^{\prime}\in\mathcal{S}\text{ and }\hat{a}\notin\mathcal{B}(s).\end{cases}

where CC is a large negative constant, and

p^(s|s,a^)=a𝒜π(a|a^)p(s|s,a)for s,s𝒮 and a^𝒜^.\hat{p}(s^{\prime}|s,\hat{a})=\sum_{a\in\mathcal{A}}\pi(a|\hat{a})p(s^{\prime}|s,a)\quad\text{for }s,s^{\prime}\in\mathcal{S}\text{ and }\hat{a}\in\hat{\mathcal{A}}.

The intuition behind Lemma 1 is that the adversary’s goal is to reduce the reward earned by the agent. Thus, when a reward rtr_{t} is received by the agent at time step tt, the adversary receives a negative reward of r^t=rt\hat{r}_{t}=-r_{t}. To prevent the agent from taking actions outside of set (s)\mathcal{B}(s), a large negative reward CC is assigned to these actions such that the optimal adversary cannot take them. For actions within (s)\mathcal{B}(s), we calculate R^(s,a^,s)\hat{R}(s,\hat{a},s^{\prime}) by its definition, R^(s,a^,s):=𝔼[r^|s,a^,s]\hat{R}(s,\hat{a},s^{\prime}):=\mathbb{E}[\hat{r}|s,\hat{a},s^{\prime}] which yields the term in Lemma 1. The proof can be found in Appendix B of Zhang et al. (2020b).

After constructing the MDP M^\hat{M}, it is possible to solve an optimal agent ν\nu of M^\hat{M}, which will be the optimal adversary on SA-MDP MM given policy π\pi. For MDPs, under mild regularity assumptions an optimal policy always exists (Puterman, 2014). In our case, the optimal policy on M^\hat{M} corresponds to an optimal adversary in SA-MDP, which is the worst case perturbation for policy π\pi. As an illustration, in Figure 1, we show a GridWorld environment. The red square is the starting point. The blue circle and green triangles are the target and traps, respectively. When the agent hits the target, it earns reward +1 and the game stops and it earns reward -1 when it encounters a trap. We set γ=0.9\gamma=0.9 for both agent and adversary. The adversary is allowed to perturb the observation to adjacent cells along four directions: up, down, left, and right. When there is no adversary, after running policy iteration, the agent can easily reach target and earn a reward of +1, as in Figure 1(a). However, if we train the adversary based on Lemma 1 and apply it to the agent, we are able to make the agent repeatedly encounter a trap. This leads to -\infty reward for the agent and ++\infty reward for the adversary as shown in Figure 1(b).

Refer to caption
(a) No attack
(reward 𝟓𝟖𝟓𝟏\mathbf{5851})
Refer to caption
(b) RS Attack
(reward 𝟐𝟖𝟒\mathbf{284})
Refer to caption
(c) Our Attack
(reward 𝟏𝟏𝟒𝟎\mathbf{-1140})
Refer to caption
(d) No attack
(reward 𝟕𝟎𝟗𝟒\mathbf{7094})
Refer to caption
(e) RS Attack
(reward 𝟖𝟓\mathbf{85})
Refer to caption
(f) Our Attack
(reward 𝟕𝟒𝟑\mathbf{-743})
Figure 3: Our “Optimal” Attack and Robust Sarsa attack (a previous strong attack proposed in Zhang et al. (2020b)) on Ant and HalfCheetah environments. Previous strong attacks make the agent fail and receive a small positive reward (less than 1/10 of the reward without attack). Our attack is strong enough to trick the agent into moving to the opposite direction, receiving a large negative reward.

We now extend this Lemma 1 to the DRL setting. Since the learning of adversary is equivalent to solving an MDP, we parameterize the adversary as a neural network function and use any popular DRL algorithm to learn an “optimal” adversary. Here we quote the word “optimal” as we use function approximator to learn the agent so it’s no longer optimal, but we emphasize that it follows the SA-MDP framework of solving an optimal adversary. No existing adversarial attacks follow such a theoretically guided framework. We show our algorithm in Algorithm 1. Instead of learning to produce s^(s)\hat{s}\in\mathcal{B}(s) directly, since (s)\mathcal{B}(s) is usually a small set nearby ss (e.g., ={s|sspϵ}\mathcal{B}=\{s^{\prime}|\|s-s^{\prime}\|_{p}\leq\epsilon\}, our adversary learns a perturbation vector Δ\Delta, and we project s+Δs+\Delta to (s)\mathcal{B}(s).

The first advantage of attacking a policy in this way is that it is strong - as we allow to optimize the adversary in an online loop of interactions with the agent policy and environment, and keep improving the adversary with a goal of receiving as less reward as possible. It is strong because it follows the theoretical framework of finding an optimal adversary, rather than using any heuristic to generate a perturbation. Empirically, in the cases demonstrated in Figure 3, previous strong attacks (e.g., Robust Sarsa attack) can successfully fail an agent and make it stop moving and receive a small positive reward; our learned attack can trick the agent into moving toward the opposite direction of the goal and receive a large negative reward. We also find that this attack can further reduce the reward of robustly trained agents, like SA-PPO (Zhang et al., 2020b).

The second advantage of this attack is that it requires no gradient access to the policy itself; in fact, it treats the agent as part of the environment and only needs to run it in a blackbox. Previous attacks (e.g., Lin et al. (2017); Pattanaik et al. (2018); Xiao et al. (2019)) are mostly gradient based approach and need to access the values or gradients to a policy or value function. Even without access to gradients, the overall learning process is still just a MDP and we can apply any popular modern DRL methods to learn the adversary.

Algorithm 1 Learning an “optimal” adversary for perturbations on state observations
0:  Policy π(|s)\pi(\cdot|s) under attack, number of iterations NiterN_{iter}, batch size BB, perturbation set (s)\mathcal{B}(s)
1:  initialize adversary νϕ(|s)\nu_{\phi}(\cdot|s) parameterized by a neural network with parameters ϕ\phi,
2:  for i=1i=1 to NiterN_{iter} do
3:     𝒟Adv_Traj(νϕ,π,B)\mathcal{D}\leftarrow\text{{Adv\_Traj}}(\nu_{\phi},\pi,B) # collection of samples (for simplicity we ignore episodes here)
4:     ϕPolicyOptimizer(𝒟,ϕ)\phi\leftarrow\text{PolicyOptimizer}(\mathcal{D},\phi)
5:  end for Function Adv_Traj(νϕ,π,B):\text{{Adv\_Traj}}(\nu_{\phi},\pi,B):
6:  ss0s\leftarrow s_{0} # Initial state
7:  𝒟\mathcal{D}\leftarrow\emptyset
8:  for b=1b=1 to BB do
9:     Δsample from νϕ(|s)\Delta\leftarrow\text{sample from }\nu_{\phi}(\cdot|s)
10:     s^Proj(s)(s+Δ)\hat{s}\leftarrow\text{Proj}_{\mathcal{B}}(s)(s+\Delta) # projection will be a clipping for \ell_{\infty} norm set (s)\mathcal{B}(s)
11:     asample from π(|s^)a\leftarrow\text{sample from }\pi(\cdot|\hat{s})
12:     obtain current step reward rtr_{t}, next state ss^{\prime} from environment given action aa
13:     𝒟𝒟(s,Δ,rt,s)\mathcal{D}\leftarrow\mathcal{D}\cup(s,\Delta,-r_{t},s^{\prime}) # state, action, reward and next state for the adversary
14:     sss\leftarrow s^{\prime}
15:  end for
16:  return 𝒟\mathcal{D}

3.2 Finding the optimal policy under a fixed adversary

We now investigate SA-MDP when we fix the adversary ν\nu and find an optimal policy. In Lemma 2, we show that this case SA-MDP becomes a POMDP:

Lemma 2 (Optimal policy under fixed adversary)

Given an SA-MDP M=(𝒮,𝒜,,R,p,γ)M=(\mathcal{S},\mathcal{A},\mathcal{B},R,p,\gamma) and a fixed and stationary adversary ν(s^|s)\nu(\hat{s}|s), there exists a POMDP M¯=(𝒮,𝒜,Ω,O,R,p,γ)\bar{M}=(\mathcal{S},\mathcal{A},\Omega,O,R,p,\gamma) such that the optimal policy of M¯\bar{M} is the optimal policy π\pi for SA-MDP given the fixed ν\nu, where

Ω=s𝒮{s|ssupp(ν(|s))},O(o|s)=ν(s^|s)\Omega=\bigcup_{s\in\mathcal{S}}\{s^{\prime}|s^{\prime}\in\text{supp}(\nu(\cdot|s))\},\qquad O(o|s)=\nu(\hat{s}|s) (1)

where Ω\Omega is the set of observations, and OO defines the conditional observational probabilities (in our case it is conditioned only on ss and does not depend on actions). To prove Lemma 2, we construct the a POMDP with the observations defined on the support of all ν(|s),s𝒮\nu(\cdot|s),s\in\mathcal{S} and the observation process is exactly the process of generating an adversarially perturbed state s^\hat{s}. This POMDP is functionally identical to the original SA-MDP when ν\nu is fixed. This lemma unveils the connection between POMDP and SA-MDP: SA-MDP can be seen as a version of “robust” POMDP where the policy needs to be robust under a set of observational processes (adversaries). SA-MDP is different from robust POMDP (RPOMDP) (Osogami, 2015; Rasouli & Saghafian, 2018), which optimizes for the worst case environment transitions.

As a proof of concept, we use a modern POMDP solver, SARSOP (Kurniawati et al., 2008) to solve the GridWorld environment in Figure 1 to find a policy that can defeat the adversary. The POMDP solver produces a finite state controller (FSC) with 8 states (FSC is an efficient representation of history dependent policies). This FSC policy can almost eliminate the impact of the adversary and receive close to perfect reward, as shown in Figure 1(c).

Unfortunately, unlike MDPs, it is challenging to solve an optimal policy for POMDPs; state-of-the-art solvers (Bai et al., 2014; Sunberg & Kochenderfer, 2017) can only work on relatively simple environments which are much smaller than those used in modern DRL. Thus, we do not aim to solve the optimal policy. We follow (Wierstra et al., 2007) to use recurrent policy gradient theorem on POMDPs and use LSTM as function approximators for the value and policy networks. We denote ht={s^0,a0,s^1,a1,s^t}{h_{t}}=\{\hat{s}_{0},a_{0},\hat{s}_{1},a_{1}\cdots,\hat{s}_{t}\} containing all history of states (perturbed states s^\hat{s} in our setting) and actions. The policy π\pi parameterized by θ\theta takes an action ata_{t} given all observed history hth_{t}, and hth_{t} is typically encoded by a recurrent neural network (e.g., LSTM). The recurrent policy gradient theorem (Wierstra et al., 2007) shows that

θJ1Nn=1Nt=0Tθlogπθ(atn|htn)rtn\nabla_{\theta}J\approx\frac{1}{N}\sum_{n=1}^{N}\sum_{t=0}^{T}\nabla_{\theta}\log\pi_{\theta}(a_{t}^{n}|h_{t}^{n})r^{n}_{t} (2)

where NN is the number of sampled episodes, TT is episode length (for notation similarity, we assume each episode has the same length), and htnh_{t}^{n} is the history of states for episode nn up to time tt, and rtnr_{t}^{n} is the reward received for episode nn at time tt. We can then extend Eq. 2 to modern DRL algorithms such as proximal policy optimization (PPO), similarly as done in (Azizzadenesheli et al., 2018), by using the following loss function:

J(θ)1Nn=1Nt=0T[min(πθ(atn|htn)πθold(atn|htn)Ahtn,clip(πθ(atn|htn)πθold(atn|htn),1ϵ,1+ϵ)Ahtn)]J(\theta)\approx\frac{1}{N}\sum_{n=1}^{N}\sum_{t=0}^{T}\left[\min\left(\frac{\pi_{\theta}(a_{t}^{n}|h_{t}^{n})}{\pi_{\theta_{\text{old}}}(a_{t}^{n}|h_{t}^{n})}A_{h_{t}^{n}},\text{clip}(\frac{\pi_{\theta}(a_{t}^{n}|h_{t}^{n})}{\pi_{\theta_{\text{old}}}(a_{t}^{n}|h_{t}^{n})},1-\epsilon,1+\epsilon)A_{h_{t}^{n}}\right)\right] (3)

where AhtnA_{h_{t}^{n}} is a baseline advantage function for episode nn time step tt, which is based on a LSTM value function. ϵ\epsilon is the clipping threshold in PPO. The loss can be optimized via a gradient based optimizer and θold{\theta_{\text{old}}} is the old policy parameter before optimization iterations start. Although a LSTM or recurrent policy network has been used in the DRL setting in a few other works (Hausknecht & Stone, 2015; Azizzadenesheli et al., 2018), our focus is to improve agent robustness rather than learning a policy purely for POMDPs. In our empirical evaluation, we will compare feedforward and LSTM policies under our ATLA framework.

3.3 Alternating Training with Learned Adversaries (ATLA)

As we have discussed in Section 3.1, we can solve an optimal adversary given any fixed policy. In our ATLA framework, we train such an adversary online with the agent: we first keep the agent and optimize the adversary; the adversary is also parameterized as a neural network. Then we keep the adversary and optimize the agent. Both adversary and agent can be updated using a policy gradient algorithm such as PPO. We show our full algorithm in Algorithm 2.

Algorithm 2 Alternating Training with Learned Adversaries (ATLA)
0:  Environment \mathcal{E}, number of iterations NiterN_{iter}, and batch size BB.
1:  Initialize the agent’s actor network π(a|s^)\pi(a|\hat{s}) with parameters θ\theta.
2:  Initialize the adversary’s actor network ν(s^|s)\nu(\hat{s}|s) with parameters ϕ\phi.
3:  for i=1i=1 to NiterN_{iter} do
4:     for j=1j=1 to NπN_{\pi} do
5:        Run πθ\pi_{\theta} with fixed νϕ\nu_{\phi} to collect a set of trajectories 𝒟π:={(s^tk,j,atk,j,rtk,j,s^t+1k,j)}|k=1B\mathcal{D}_{\pi}:=\{(\hat{s}_{t}^{k,j},a_{t}^{k,j},r_{t}^{k,j},\hat{s}_{t+1}^{k,j})\}\big{\rvert}_{k=1}^{B}.
6:        θPolicyOptimizer(𝒟π,θ)\theta\leftarrow\text{PolicyOptimizer}(\mathcal{D}_{\pi},\theta)
7:     end for
8:     for j=1j=1 to NνN_{\nu} do
9:        𝒟νAdv_Traj(νϕ,πθ,B)\mathcal{D}_{\nu}\leftarrow\text{{Adv\_Traj}}(\nu_{\phi},\pi_{\theta},B) # Adv_Traj defined in Algorithm 1
10:        ϕPolicyOptimizer(𝒟ν,ϕ)\phi\leftarrow\text{PolicyOptimizer}(\mathcal{D}_{\nu},\phi)
11:     end for
12:  end for

Our algorithm is designed to use a strong and learned adversary that tries to find intrinsic weakness of the policy, and to obtain a good reward the policy must learn to defeat such an adversary. In other words, it attempts to solve the SA-MDP problem directly rather than relying on explicit regularization on the function approximator like the approach in (Zhang et al., 2020b). In our empirical evaluation, we show that such regularization can be unhelpful in some environments and harmful for performance when evaluating the agent without attacks.

The difference between our approach and previous adversarial training approaches such as (Pattanaik et al., 2018) is that we use a stronger adversary, learned online with the agent. Our empirical evaluation finds that using such a learned “optimal” adversary in training time allows the agent to learn a robust policy generalized to different types of strong adversarial attacks during test time. Additionally, it is important to distinguish between the original state ss and the perturbed state s^\hat{s}. We find that using ss instead of s^\hat{s} to train the advantage function and policy of the agent leads to worse performance, as it does not follow the theoretical framework of SA-MDP.

4 Experiments

“Optimal” attack on DRL agents111Code for the optimal attack and ATLA available at https://github.com/huanzhang12/ATLA_robust_RL

In section 3.1 we show that it is possible to cast the optimal adversary finding problem as an MDP problem. In practice, the environment dynamics are unknown but model-free RL methods can be used to approximately find this optimal adversary. In this section, we use PPO to train an adversary on four OpenAI Gym MuJoCo continuous control environments. Table 1 presents results on attacking vanilla PPO and robustly trained SA-PPO (Zhang et al., 2020b) agents. As a comparison, we also report the attack reward of five other baseline attacks: critic attack is based on (Pattanaik et al., 2018); random attack adds uniform random noise to state observations; MAD (maximal action difference) attack (Zhang et al., 2020b) maximizes the differences in action under perturbed states; RS (robust sarsa) attack is based on training robust action-value functions and is the strongest attack proposed in (Zhang et al., 2020b). Additionally, we include the black-box Snooping attack (Inkawhich et al., 2019). For all attacks we consider (s)\mathcal{B}(s) as a \ell_{\infty} norm ball around ss with radius ϵ\epsilon, set similarly as in (Zhang et al., 2020b). During testing, we run the agents without attacks as well as under attacks for 50 episodes and report the mean and standard deviation of episode rewards. In Table 1 our “optimal” attack achieves noticeably lower rewards than all the other five attacks. We illustrate a few examples of attacks in Figure 3. For RS and “optimal” attacks, we report the best (lowest) attack reward obtained from different hyper-parameters.

Table 1: Average episode rewards ±\pm standard deviation over 50 episodes on PPO and SA-PPO agents. We report natural rewards (no attacks) and rewards under six adversarial attacks, including a simple random noise attack, the critic based attack in Pattanaik et al. (2018),MAD and RS attacks in Zhang et al. (2020b), Snooping attack proposed in Inkawhich et al. (2019), and the optimal attack proposed in this paper. In each row we bold the best (lowest) attack reward over all five attacks. “Optimal” attack is better than other attacks in all environments, sometimes by a large margin.
Env. \ell_{\infty} norm perturb- ation budget ϵ\epsilon Method Natural Reward Attack Reward
Critic Random MAD Snooping RS “Optimal”
PPO 3167±\pm521 1464 ±\pm523 2101±\pm793 1410±\pm 655 2234±\pm1103 794±\pm238 636±\pm 9
Hopper 0.075 SA-PPO 3705±\pm 2 3789±\pm 15 2710±\pm 801 2652±\pm 835 2509±\pm838 1130 ±\pm42 1076±\pm 791
PPO 4472 ±\pm 635 3424 ±\pm 1295 3007 ±\pm 1200 2869 ±\pm 1271 2786±\pm962 1336 ±\pm 654 1086±\pm516
Walker2d 0.05 SA-PPO 4487±\pm 61 4875±\pm 30 4867±\pm 39 3668±\pm 1789 3928±\pm1661 3808±\pm 138 2908±\pm 1136
PPO 5687 ±\pm 758 4934±\pm 1022 5261±\pm 1005 1759±\pm 828 3668±\pm547 268 ±\pm227 -872 ±\pm 436
Ant 0.15 SA-PPO 4292±\pm 384 4805 ±\pm 128 4986 ±\pm452 4662 ±\pm522 4079±\pm768 3412 ±\pm1755 2511 ±\pm 1117
PPO 7117±\pm 98 5761±\pm119 5486 ±\pm 1378 1836±\pm 866 1637±\pm843 489±\pm 758 -660±\pm 219
HalfCheetah 0.15 SA-PPO 3632±\pm 20 3589±\pm 21 3619±\pm 18 3624±\pm 23 3616±\pm21 3283±\pm 20 3028 ±\pm23

Evaluation of ATLA

In this experiment, we study the effectiveness of our proposed ATLA method. Specifically, we use PPO as our policy optimizer. For policy networks, we have two different structures: the original fully connected (MLP) structure, and an LSTM structure which takes historical observations. The LSTMs are trained using backpropagation through time for up to 100 steps. In Table 2 we include the following methods for comparisons:

  • PPO (vanilla) and PPO (LSTM): PPO with a feedforward NN or LSTM as the policy network.

  • SA-PPO (Zhang et al., 2020b): the state-of-the-art approach for improving the robustness of DRL in continuous control environments, using a smooth policy regularization on feedforward NNs solved by convex relaxations.

  • Adversarial training using critic attack (Pattanaik et al., 2018): a previous work using critic based attack to generate adversarial observations in training time, and train a feedforward NN based agent with this relatively weak adversary.

  • ATLA-PPO (MLP) and ATLA-PPO (LSTM): Our proposed method trained with a feedforward NN (MLP) or LSTM as the policy network. The agent and adversary are trained using PPO with independent value and policy networks. For simplicity, we set Nπ=Nν=1N_{\pi}=N_{\nu}=1 in all settings.

  • ATLA-PPO (LSTM) +SA reg: Based on ATLA-PPO (LSTM), but with an extra adversarial smoothness constraint similar to those in SA-PPO. We use a 2-step stochastic gradient Langevin dynamics (SGLD) to solve the minimax loss, as convex relaxations of LSTMs are expensive.

For each agent, we report its “natural reward” (episode reward without attacks) and best attack reward in Table 2. To comprehensively evaluate the robustness of agents, the best attack reward is the lowest episode reward achieved by all six types attacks in Table 1, including our new “optimal” attack (these attacks include hundreds of independent adversaries for attacking a single agent, see Appendix A.1 for more details). For reproducibility, for each setup we train 21 agents, attack all of them and report the one with median robustness. We include detailed hyperparameters in A.5.

Table 2: Average episode rewards ±\pm standard deviation over 50 episodes on ATLA agents and baselines. We report natural rewards (no attacks) and the best (lowest) attack rewards among six types of adversarial attacks, including a simple random noise attack, the critic based attack in (Pattanaik et al., 2018), MAD and RS attacks in Zhang et al. (2020b), Snooping attack proposed in Inkawhich et al. (2019), and the optimal attack proposed in this paper. For each environment, we bold the most robust agent. Since both RS attack and our “optimal” attack are parameterized attacks, the “best attack” column represents the worst case agent performance under hundreds of adversaries. See Appendix A.1 for more details.
Env. State Dimension \ell_{\infty} norm perturb- ation budget ϵ\epsilon Method Natural Reward Best Attack
PPO (vanilla) 3167±\pm542 636±\pm 9
SA-PPO (Zhang et al., 2020b) 3705±\pm 2 1076±\pm 791
Pattanaik et al. (2018) 2755±\pm582 291±\pm 7
ATLA-PPO (MLP) 2559 ±\pm 958 976±\pm 40
PPO (LSTM) 3060±\pm 639.3 784±\pm 48
ATLA-PPO (LSTM) 3487±\pm 452 1224±\pm 191
Hopper 11 0.075 ATLA-PPO (LSTM) +SA Reg 3291±\pm 600 1772±\pm 802
PPO (vanilla) 4472 ±\pm 635 1086±\pm516
SA-PPO (Zhang et al., 2020b) 4487±\pm 61 2908±\pm 1136
Pattanaik et al. (2018) 4058±\pm 1410 733±\pm 1012
ATLA-PPO (MLP) 3138 ±\pm 1061 2213±\pm 915
PPO (LSTM) 2785±\pm 1121 1259±\pm 937
ATLA-PPO (LSTM) 3920±\pm 129 3219 ±\pm 1132
Walker2d 17 0.05 ATLA-PPO (LSTM) +SA Reg 3842±\pm 475 3239±\pm 894
PPO (vanilla) 5687 ±\pm 758 -872 ±\pm 436
SA-PPO (Zhang et al., 2020b) 4292±\pm 384 2511 ±\pm 1117
Pattanaik et al. (2018) 3469±\pm 1139 -672±\pm 100
ATLA-PPO (MLP) 4894±\pm 123 33±\pm327
PPO (LSTM) 5696 ±\pm 165 -513 ±\pm 104
ATLA-PPO (LSTM) 5612±\pm 130 716±\pm 256
Ant 111 0.15 ATLA-PPO (LSTM) +SA Reg 5359±\pm153 3765±\pm 101
PPO (vanilla) 7117±\pm 98 -660±\pm 218
SA-PPO (Zhang et al., 2020b) 3632±\pm 20 3028 ±\pm23
Pattanaik et al. (2018) 5241±\pm 1162 447±\pm 192
ATLA-PPO (MLP) 5417±\pm 49 2170±\pm 2097
PPO (LSTM) 5609±\pm 98 -886±\pm 30
ATLA-PPO (LSTM) 5766 ±\pm 109 2485±\pm 1488
HalfCheetah 17 0.15 ATLA-PPO (LSTM) +SA Reg 6157±\pm 852 4806±\pm 603

In Table 2 we can see that vanilla PPO with MLP or LSTM are not robust. For feedforward (MLP) agent policies, critic based adversarial training (Pattanaik et al., 2018) is not very effective under our suite of strong adversaries and is sometimes only slightly better than vanilla PPO. ATLA-PPO (MLP) outperforms SA-PPO on Hopper and Walker2d and is also competitive on HalfCheetah; for high dimensional environments like Ant, the robust function approximator regularization in SA-PPO is more effective. For LSTM agent policies, compared to vanilla PPO (LSTM) agents,

Refer to caption
Figure 4: The performance under the strongest attack for SA-PPO Hopper with different regularization κ\kappa. Even we increase regularization, it cannot outperform our ATLA agents.

ATLA-PPO (LSTM) can significantly improve agent robustness; a LSTM agent trained without a robust training procedure like ATLA cannot improve robustness. We find that LSTM agents tend to be more robust than their MLP counterparts, validating our findings in Section 3.2. ATLA-PPO (LSTM) is better than SA-PPO on Hopper and Walker2d. In all settings, especially for high dimensional environments like Ant, our ATLA approach that also includes State-Adversarial regularization (ATLA-PPO +SA Reg) outperforms all other baselines, as this combination improves both the intrinsic robustness of policy and the robustness of function approximator.

A robust function approximator can be insufficient

For some environments, SA-PPO method has its limitations - even using an increasingly larger regularization parameter κ\kappa (which controls how robust the function approximator needs to be), we still cannot reach the same performance as our ATLA agent (Figure 4). Additionally, when a large regularization is used, agent performance becomes much worse. In Figure 4, under the largest κ=1.0\kappa=1.0, the natural reward (1436±961436\pm 96) is much lower than other agents reported in Table 2.

5 Conclusion

In this paper, we first propose the optimal adversarial attack on state observations of RL agents, which is significantly stronger than many existing adversarial attacks. We then show the alternating training with learned adversaries (ATLA) framework to train an agent together with a learned optimal adversary to effectively improve agent robustness under attacks. We also show that a history dependent policy parameterized by a LSTM can be helpful for robustness. Our approach is orthogonal to existing regularization based techniques, and can be combined with state-adversarial regularization to achieve state-of-the-art robustness under strong adversarial attacks.

References

  • Astrom (1965) Karl J Astrom. Optimal control of markov processes with incomplete state information. Journal of mathematical analysis and applications, 10(1):174–205, 1965.
  • Azizzadenesheli et al. (2018) Kamyar Azizzadenesheli, Manish Kumar Bera, and Animashree Anandkumar. Trust region policy optimization for pomdps. arXiv preprint arXiv:1810.07900, 2018.
  • Bai et al. (2014) Haoyu Bai, David Hsu, and Wee Sun Lee. Integrated perception and planning in the continuous space: A pomdp approach. The International Journal of Robotics Research, 33(9):1288–1302, 2014.
  • Behzadan & Munir (2017a) Vahid Behzadan and Arslan Munir. Vulnerability of deep reinforcement learning to policy induction attacks. In International Conference on Machine Learning and Data Mining in Pattern Recognition, pp.  262–275. Springer, 2017a.
  • Behzadan & Munir (2017b) Vahid Behzadan and Arslan Munir. Whatever does not kill deep reinforcement learning, makes it stronger. arXiv preprint arXiv:1712.09344, 2017b.
  • Brooks (1992) Rodney A Brooks. Artificial life and real robots. In Proceedings of the First European Conference on artificial life, pp.  3–10, 1992.
  • Engstrom et al. (2020) Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep policy gradients: A case study on PPO and TRPO. arXiv preprint arXiv:2005.12729, 2020.
  • Fischer et al. (2019) Marc Fischer, Matthew Mirman, and Martin Vechev. Online robustness training for deep reinforcement learning. arXiv preprint arXiv:1911.00887, 2019.
  • Fujimoto et al. (2018) Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018.
  • Gleave et al. (2019) Adam Gleave, Michael Dennis, Neel Kant, Cody Wild, Sergey Levine, and Stuart Russell. Adversarial policies: Attacking deep reinforcement learning. arXiv preprint arXiv:1905.10615, 2019.
  • Gowal et al. (2018) Sven Gowal, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin, Jonathan Uesato, Timothy Mann, and Pushmeet Kohli. On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715, 2018.
  • Guo et al. (2014) Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L Lewis, and Xiaoshi Wang. Deep learning for real-time atari game play using offline monte-carlo tree search planning. In Advances in neural information processing systems, pp. 3338–3346, 2014.
  • Hausknecht & Stone (2015) Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. arXiv preprint arXiv:1507.06527, 2015.
  • Huang et al. (2017) Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
  • Huang & Zhu (2019) Yunhan Huang and Quanyan Zhu. Deceptive reinforcement learning under adversarial manipulations on cost signals. In International Conference on Decision and Game Theory for Security, pp.  217–237. Springer, 2019.
  • Inkawhich et al. (2019) Matthew Inkawhich, Yiran Chen, and Hai Li. Snooping attacks on deep reinforcement learning. arXiv preprint arXiv:1905.11832, 2019.
  • Iyengar (2005) Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
  • Jakobi et al. (1995) Nick Jakobi, Phil Husbands, and Inman Harvey. Noise and the reality gap: The use of simulation in evolutionary robotics. In European Conference on Artificial Life, pp.  704–720. Springer, 1995.
  • Kos & Song (2017) Jernej Kos and Dawn Song. Delving into adversarial attacks on deep policies. arXiv preprint arXiv:1705.06452, 2017.
  • Kurniawati et al. (2008) Hanna Kurniawati, David Hsu, and Wee Sun Lee. Sarsop: Efficient point-based pomdp planning by approximating optimally reachable belief spaces. In Robotics: Science and systems, volume 2008. Zurich, Switzerland., 2008.
  • Levine et al. (2015) Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.
  • Li et al. (2019) Shihui Li, Yi Wu, Xinyue Cui, Honghua Dong, Fei Fang, and Stuart Russell. Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  4213–4220, 2019.
  • Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Lin et al. (2017) Yen-Chen Lin, Zhang-Wei Hong, Yuan-Hong Liao, Meng-Li Shih, Ming-Yu Liu, and Min Sun. Tactics of adversarial attack on deep reinforcement learning agents. arXiv preprint arXiv:1703.06748, 2017.
  • Littman (1994) Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994, pp.  157–163. Elsevier, 1994.
  • Ma et al. (2019) Yuzhe Ma, Xuezhou Zhang, Wen Sun, and Jerry Zhu. Policy poisoning in batch reinforcement learning and control. In Advances in Neural Information Processing Systems, pp. 14570–14580, 2019.
  • Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ICLR, 2018.
  • Mandlekar et al. (2017) Ajay Mandlekar, Yuke Zhu, Animesh Garg, Li Fei-Fei, and Silvio Savarese. Adversarially robust policy learning: Active construction of physically-plausible perturbations. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  3932–3939. IEEE, 2017.
  • Mankowitz et al. (2018) Daniel J Mankowitz, Timothy A Mann, Pierre-Luc Bacon, Doina Precup, and Shie Mannor. Learning robust options. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Mankowitz et al. (2019) Daniel J Mankowitz, Nir Levine, Rae Jeong, Abbas Abdolmaleki, Jost Tobias Springenberg, Timothy Mann, Todd Hester, and Martin Riedmiller. Robust reinforcement learning for continuous control with model misspecification. arXiv preprint arXiv:1906.07516, 2019.
  • Mirman et al. (2018a) Matthew Mirman, Marc Fischer, and Martin Vechev. Distilled agent DQN for provable adversarial robustness, 2018a. URL https://openreview.net/forum?id=ryeAy3AqYm.
  • Mirman et al. (2018b) Matthew Mirman, Timon Gehr, and Martin Vechev. Differentiable abstract interpretation for provably robust neural networks. In International Conference on Machine Learning, pp. 3575–3583, 2018b.
  • Miyato et al. (2015) Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677, 2015.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Muratore et al. (2019) Fabio Muratore, Michael Gienger, and Jan Peters. Assessing transferability from simulation to reality for reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • Nilim & El Ghaoui (2004) Arnab Nilim and Laurent El Ghaoui. Robustness in Markov decision problems with uncertain transition matrices. In Advances in Neural Information Processing Systems, pp. 839–846, 2004.
  • Osogami (2015) Takayuki Osogami. Robust partially observable Markov decision process. In International Conference on Machine Learning, pp. 106–115, 2015.
  • Pattanaik et al. (2018) Anay Pattanaik, Zhenyi Tang, Shuijing Liu, Gautham Bommannan, and Girish Chowdhary. Robust deep reinforcement learning with adversarial attacks. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp.  2040–2042. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
  • Pinto et al. (2017) Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp.  2817–2826. JMLR. org, 2017.
  • Puterman (2014) Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  • Qu et al. (2020) Xinghua Qu, Yew-Soon Ong, Abhishek Gupta, and Zhu Sun. Defending adversarial attacks without adversarial attacks in deep reinforcement learning. arXiv preprint arXiv:2008.06199, 2020.
  • Rakhsha et al. (2020) Amin Rakhsha, Goran Radanovic, Rati Devidze, Xiaojin Zhu, and Adish Singla. Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. arXiv preprint arXiv:2003.12909, 2020.
  • Rasouli & Saghafian (2018) Mohammad Rasouli and Soroush Saghafian. Robust partially observable markov decision processes. No. RWP18-027, 2018.
  • Sallab et al. (2017) Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70–76, 2017.
  • Salman et al. (2019) Hadi Salman, Greg Yang, Huan Zhang, Cho-Jui Hsieh, and Pengchuan Zhang. A convex relaxation barrier to tight robustness verification of neural networks. In Advances in Neural Information Processing Systems 32, pp. 9832–9842. Curran Associates, Inc., 2019.
  • Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
  • Sunberg & Kochenderfer (2017) Zachary Sunberg and Mykel Kochenderfer. Online algorithms for pomdps with continuous state, action, and observation spaces. arXiv preprint arXiv:1709.06196, 2017.
  • Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2013.
  • Tan et al. (2020) Kai Liang Tan, Yasaman Esfandiari, Xian Yeow Lee, Soumik Sarkar, et al. Robustifying reinforcement learning agents via action space adversarial training. In 2020 American Control Conference (ACC), pp.  3959–3964. IEEE, 2020.
  • Tessler et al. (2019) Chen Tessler, Yonathan Efroni, and Shie Mannor. Action robust reinforcement learning and applications in continuous control. arXiv preprint arXiv:1901.09184, 2019.
  • Voyage (2019) Voyage. Introducing voyage deepdrive unlocking the potential of deep reinforcement learning. https://news.voyage.auto/introducing-voyage-deepdrive-69b3cf0f0be6, 2019.
  • Wierstra et al. (2007) Daan Wierstra, Alexander Foerster, Jan Peters, and Juergen Schmidhuber. Solving deep memory pomdps with recurrent policy gradients. In International Conference on Artificial Neural Networks, pp.  697–706. Springer, 2007.
  • Wong & Kolter (2018) Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pp. 5283–5292, 2018.
  • Xiao et al. (2019) Chaowei Xiao, Xinlei Pan, Warren He, Jian Peng, Mingjie Sun, Jinfeng Yi, Bo Li, and Dawn Song. Characterizing attacks on deep reinforcement learning. arXiv preprint arXiv:1907.09470, 2019.
  • Xu et al. (2020) Kaidi Xu, Zhouxing Shi, Huan Zhang, Minlie Huang, Kai-Wei Chang, Bhavya Kailkhura, Xue Lin, and Cho-Jui Hsieh. Automatic perturbation analysis on general computational graphs. arXiv preprint arXiv:2002.12920, 2020.
  • Zhang & Parkes (2008) Haoqi Zhang and David C Parkes. Value-based policy teaching with active indirect elicitation. In AAAI, volume 8, pp.  208–214, 2008.
  • Zhang et al. (2009) Haoqi Zhang, David C Parkes, and Yiling Chen. Policy teaching through reward function learning. In Proceedings of the 10th ACM conference on Electronic commerce, pp.  295–304, 2009.
  • Zhang et al. (2019) Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I Jordan. Theoretically principled trade-off between robustness and accuracy. arXiv preprint arXiv:1901.08573, 2019.
  • Zhang et al. (2018) Huan Zhang, Tsui-Wei Weng, Pin-Yu Chen, Cho-Jui Hsieh, and Luca Daniel. Efficient neural network robustness certification with general activation functions. In NIPS, 2018.
  • Zhang et al. (2020a) Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Duane Boning, and Cho-Jui Hsieh. Towards stable and efficient training of verifiably robust neural networks. ICLR, 2020a.
  • Zhang et al. (2020b) Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Duane Boning, and Cho-Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on observations. In Advances in Neural Information Processing Systems, 2020b.

Appendix A Appendix

A.1 Full results of all environments under different types of attacks

In Table 2, we only include the best attack rewards (lowest rewards over all attacks). In Table 3 we list the rewards under each specific attack. Note that, Robust Sarsa (RS) attack and our “optimal” policy attack both have hyperparameters. For RS attack we use the same set of 30 different settings of hyperparameters as in (Zhang et al., 2020b) to train a robust value function to attack the network. The reported RS attack result for each agent is the strongest one over the 30 trained value functions. For Snooping based attack, we use the “imitator” attack proxy as it was the strongest reported in (Inkawhich et al., 2019), and we attack every step of the agent. The imitator is a MLP or LSTM network according to agent policy network. We use the same loss KL divergence function as in the MAD attacks for this Snooping attack. We first collect state-action pairs for 100 episodes to train the “imitators”, whose network structures are the same as the corresponding agents. In the test time, we first run MAD attack on it the “imitator” and then input the generated perturbed observation to the agent in a transfer attack fashion. For our “optimal” policy attack, the hyperparameters are PPO training parameters for the adversary (including the learning rate of the adversary policy network, learning rate of the adversary value network, the entropy regularization parameter and the ratio clip ϵ\epsilon for PPO). We use a grid search of these hyperparameters to train an adversary that is as strong as possible, resulting in 100 to 200 adversaries produced for each agent. The reported optimal attack rewards is the lowest reward among all trained adversaries. Under this comprehensive adversarial evaluation, each agent is tested using hundreds of adversaries and the strongest adversary determines the true robustness of an agent.

Table 3: Average episode rewards ±\pm standard deviation over 50 episodes on five baselines and SA-PPO (Zhang et al., 2020b). We report natural episode rewards (no attacks) and episode rewards under six adversarial attacks, including a simple random noise attack, the critic based attack in (Pattanaik et al., 2018), MAD and RS attacks in Zhang et al. (2020b), Snooping attack proposed in Inkawhich et al. (2019), and the optimal attack proposed in this paper. In each row we bold the best (lowest) attack reward over all five attacks. The row for the most robust method is highlighted.
Env. \ell_{\infty} norm perturb- ation budget ϵ\epsilon Method Natural Reward Attack Reward Best Attack
Critic Random MAD Snooping RS “Optimal”
PPO (vanilla) 3167±\pm542 1464 ±\pm523 2101±\pm793 1410±\pm 655 2234±\pm1103 794±\pm238 636±\pm 9 636
SA-PPO (Zhang et al., 2020b) 3705±\pm 2 3789±\pm 15 2710±\pm 801 2652±\pm 835 2509±\pm838 1130 ±\pm42 1076±\pm 791 1076
Pattanaik et al. (2018) 2755±\pm582 2681±\pm 555 2265±\pm 502 1395±\pm337 1349±\pm436 1219±\pm 174 291±\pm 7 291
ATLA-PPO (MLP) 2559 ±\pm 958 3497±\pm 556 2153±\pm 882 1679±\pm676 1769±\pm562 2329±\pm 870 976±\pm 40 976
PPO (LSTM) 3060±\pm 639.3 2705±\pm 986 2410±\pm 786 2397±\pm 905 2234±\pm1103 811±\pm 74 784±\pm 48 784
ATLA-PPO (LSTM) 3487±\pm 452 3524±\pm 550 3474±\pm 401 3081±\pm 754 3130±\pm692 1567±\pm 347 1224±\pm 191 1224
Hopper 0.075 ATLA-PPO (LSTM)+ SA Reg 3291±\pm 600 2073±\pm 824 3165 ±\pm 576 2814±\pm 725 2857±\pm724 2244±\pm 618 1772±\pm 802 1772
PPO (vanilla) 4472 ±\pm 635 3424 ±\pm 1295 3007 ±\pm 1200 2869 ±\pm 1271 2786±\pm962 1336 ±\pm 654 1086±\pm516 1086
SA-PPO (Zhang et al., 2020b) 4487±\pm 61 4875±\pm 30 4867±\pm 39 3668±\pm 1789 3928±\pm1661 3808±\pm 138 2908±\pm 1136 2908
Pattanaik et al. (2018) 4058±\pm 1410 4058±\pm 1410 2840±\pm 2018 2927±\pm 1954 2568±\pm2044 1713 ±\pm1807 733±\pm 1012 733
ATLA-PPO (MLP) 3138 ±\pm 1061 3243±\pm 1004 3384 ±\pm 1056 2596±\pm 1005 2571±\pm1084 3367±\pm 1020 2213±\pm 915 2213
PPO (LSTM) 2785±\pm 1121 2730±\pm 1082 2578 ±\pm 1007 2471±\pm 1109 2286±\pm1156 1259±\pm 937 1523±\pm 869 1259
ATLA-PPO (LSTM) 3920±\pm 129 3915±\pm 274 3779 ±\pm 541 3963 ±\pm 36 3716±\pm666 3219 ±\pm 1132 3463±\pm 1016 3219
Walker2d 0.05 ATLA-PPO (LSTM) +SA Reg 3842±\pm 475 3884±\pm 132 3927±\pm 368 3836±\pm 492 3742±\pm629 3239±\pm 894 3663±\pm 707 3239
PPO (vanilla) 5687 ±\pm 758 4934±\pm 1022 5261±\pm 1005 1759±\pm 828 3668±\pm547 268 ±\pm227 -872 ±\pm 436 -872
SA-PPO (Zhang et al., 2020b) 4292±\pm 384 4805 ±\pm 128 4986 ±\pm452 4662 ±\pm522 4079±\pm768 3412 ±\pm1755 2511 ±\pm 1117 2511
Pattanaik et al. (2018) 3469±\pm 1139 3469±\pm 1139 2346±\pm 459 1427±\pm 625 1336±\pm644 1289±\pm 777 -672±\pm 100 -672
ATLA-PPO (MLP) 4894±\pm 123 4427±\pm 104 4541 ±\pm 691 1891±\pm 885 2862±\pm1137 842±\pm 143 33±\pm327 33
PPO (LSTM) 5696 ±\pm 165 5519 ±\pm 114 5475 ±\pm 691 3800±\pm 363 3723±\pm1168 1069 ±\pm 382 -513 ±\pm 104 -513
ATLA-PPO (LSTM) 5612±\pm 130 5196 ±\pm 134 5390 ±\pm 704 3903 ±\pm 217 4455±\pm677 1096 ±\pm 329 716±\pm 256 716
Ant 0.15 ATLA-PPO (LSTM) +SA Reg 5359 ±\pm153 5295±\pm 165 5366±\pm 104 5240±\pm 170 5135±\pm413 4136±\pm 149 3765±\pm 101 3765
PPO (vanilla) 7117±\pm 98 5761±\pm119 5486 ±\pm 1378 1836±\pm 866 1637±\pm843 489±\pm 758 -660±\pm 218 -660
SA-PPO (Zhang et al., 2020b) 3632±\pm 20 3589±\pm 21 3619±\pm 18 3624±\pm 23 3616±\pm21 3283±\pm 20 3028 ±\pm23 3028
Pattanaik et al. (2018) 5241±\pm 1162 5440±\pm 676 2910±\pm 1694 1773±\pm 1248 1465±\pm726 1602±\pm 1157 447±\pm 192 447
ATLA-PPO (MLP) 5417±\pm 49 5134±\pm 38 5388 ±\pm 34 4623 ±\pm1146 4167±\pm1507 2170±\pm 2097 2709±\pm 80 2170
PPO (LSTM) 5609±\pm 98 4294±\pm 112 5395±\pm 158 4768±\pm 106 4088±\pm748 2899 ±\pm 2006 -886±\pm 30 -886
ATLA-PPO (LSTM) 5766 ±\pm 109 4008±\pm 1031 5685 ±\pm 107 4807±\pm 154 4906±\pm182 3458±\pm 1338 2485±\pm 1488 2485
HalfCheetah 0.15 ATLA-PPO (LSTM) +SA Reg 6157±\pm 852 5991±\pm 209 6164±\pm603 5790±\pm 174 5785±\pm671 4806±\pm 603 5058±\pm 718 4806

A.2 Agent Performance during Training

In Table 3 we only report the agent performance at the end of training. In this subsection, we evaluate our agent performance during 20%, 40%, 60% and 80% of total training epochs using Robust Sarsa (RS) attacks. The results are presented in Table 4. The overall trend is that agents are getting stronger over time (“RS attack reward” is increasing), achieving better robustness in later checkpoints.

Table 4: Natural and RS attack rewards of ATLA-PPO (LSTM)+ SA Reg checkpoints during training. We report Average rewards±\pm standard deviation over 50 episodes.
Environment Reward 20% 40% 60% 80% 100%
Hopper Natural Reward 3440±\pm11 1161 ±\pm485 3013±\pm584 3569±\pm161 3291±\pm600
RS Attack Reward 716±\pm82 631±\pm51 1089±\pm501 3181±\pm634 2244±\pm618
Walker2d Natural Reward 989±\pm254 3506±\pm174 2203±\pm988 3803±\pm726 3842±\pm475
RS Attack Reward 882±\pm269 1744±\pm347 739±\pm531 2550±\pm1020 3239±\pm894
Ant Natural Reward 2634±\pm1222 4532±\pm106 5007±\pm143 5127±\pm542 5393±\pm139
RS Attack Reward 216±\pm171 1903±\pm93 3040±\pm241 3040±\pm241 4136±\pm149
HalfCheetah Natural Reward 4525±\pm140 5567±\pm138 5955±\pm177 5956±\pm181 6300±\pm261
RS Attack Reward 3986±\pm564 3986±\pm564 4911±\pm923 4571±\pm1314 4806±\pm603

A.3 Network structure

For fully connected networks, we use the same network as in (Zhang et al., 2020b), which contains 2 hidden layers with 64 hidden neurons each layer, for both policy and value networks, for both the agent and the adversary. For LSTM agents, we use a single layer LSTM with 64 hidden neurons, along with an input embedding layer projecting state dimension to 64 and an output layer projecting 64 to output dimension. For LSTM agents, when conducting the “optimal” attack, we also use a LSTM network for the adversary to ensure the adversary is powerful enough.

A.4 Hyperparameter for the learning-based “optimal” attack

Our “optimal” attacks uses policy gradient methods to learn the optimal adversary during agent testing, and each learning process involves the selection of hyperparameters. Specifically, the hyperparameters include the learning rates of the adversary’s policy and value networks, the entropy coefficient, and the annealing of the learning rate. To reduce search space, for ATLA agents, the learning rates of the testing phase adversary’s policy and value networks are chosen ranging from 0.3X to 3X of the learning rates of adversary’s policy and value networks used in training. For other agents trained without an adversary, the learning rates of the testing phase adversary’s policy and value networks are chosen ranging from 0.3X to 3X of the learning rates of the agent’s policy and value networks. We tested both linearly annealed learning rate and non-annealing learning rate. The adversary’s entropy coefficient is chosen form 0 and 0.003. The final results reported in all tables are the best (lowest) reward achieved by the “optimal” attacks among all hyperparameter configurations. Typically this includes around 100 to 200 different adversaries trained with different hyperparameters. This guarantees the strength of this attack and allows a comprehensive evaluation of the robustness of all agents.

A.5 Hyperparameters for ATLA performance evaluation

Hyperparameters for PPO (vanilla)

For the Walker2d and Hopper environment, we use the same set of hyperparameters as in (Zhang et al., 2020b); the hyperparameters were originally from (Engstrom et al., 2020) and found using a grid search experiment. We found that this set of hyperparameters work well. For HalfCheetah and Ant environment, we use a grid search of hyperparameters, including the learning rate of the policy network, learning rate of the value network and the entropy bonus coefficient. For Hopper, Walker2d and HalfCheetah environments, we train for 2 million steps (2 million environment interactions). For Ant, we train for 10 million steps. Training for longer may slightly improve agent performance under no attacks, but has no impact for performance under strong adversarial attacks.

Hyperparameters for PPO (LSTM)

For PPO (LSTM), we conduct a smaller scale hyperparameter search. We search hyperparameter values that are close to the optimal ones found for the PPO vanilla agent. We train these LSTM agents for the same steps as those in vanilla PPO.

Hyperparameters for SA-PPO

We use the same value for all hyperparameters as in vanilla PPO except SA-PPO’s extra κ\kappa for the strength of SA-PPO regularization. For κ\kappa, we choose from 1×1061\times 10^{-6} to 11. We train agents with each κ\kappa 21 times and choose the κ\kappa value whose median agent has the highest worst-case reward under all attacks.

Hyperparameters for ATLA-PPO

For ATLA-PPO, we have hyperparameters for both agent and adversary. We keep all agent hyperparameters the same as those in vanilla MLP/LSTM agents, except for the entropy bonus coefficient. We find that sometimes we need a larger entropy bonus coefficient in ATLA to allow sufficient exploration of the agent, as learning with an adversary is harder than learning in attack-free environments. For the adversary, we run a small-scale hyperparameter search on the learning rate of adversary policy and value networks, and the entropy bonus coefficient for the adversary. To reduce the number of hyperparameters for searching, we use values close to those of the agent. We set Nν=Nπ=1N_{\nu}=N_{\pi}=1 in all experiments and did not tune this hyperparameter. For ATLA, we train 5 million steps for Hopper, Walker and HalfCheetah and 10 million steps for Ant. We find that similar to the observations in (Madry et al., 2018), training with an adversary typically requires more steps to converge, however in all our environments the training does reliably converge.

Agent selection

For each setup, we repeat the experiments using the same set of hyperparameters for 21 times due to the high performance variance in RL. We then attack all the agents using random, critic, MAD and RS attacks. We use the lowest reward among all attacks as a metric to rank those agents. Then, we select the agent with median robustness as our final agent. This final agemt is then attacked using the “optimal” attack to further reduce its reward. The numbers we report in Table 2 are not from the best runs, but the runs with median robustness. This is done to improve reproducibility as RL training process can have high variance.

Table 5: Hyperparameters for all environments and settings. For vanilla environments, we use the hyperparameters from Zhang et al. (2020b) and Engstrom et al. (2020) if they are available for that environment (Hopper and Walker2d). Other environments’ hyperparameter for the vanilla PPO model is found by a grid search. For SA-PPO and ATLA-PPO (MLP), the same set of hyperparameters as in the vanilla models are used, except that for SA-PPO we tune the parameter κ\kappa and for ATLA-PPO (MLP) we tune the entropy bonus coefficients as well as learning rates for the adversary. For LSTM models, we first tune the vanilla LSTM PPO models and find the best learning rates, keep using them in all LSTM based models.
Env. model policy lr val lr entropy coeff. κ\kappa adv. policy lr adv. val lr adv. entropy coeff.
PPO(vanilla) 3e-4 2.5e-4 0
SA-PPO 3e-4 2.5e-4 0 0.03
ATLA-PPO (MLP) 3e-4 2.5e-4 0.01 0.001 0.0001 0.001
PPO (LSTM) 1e-3 3e-4 0.0
ATLA-PPO (LSTM) 1e-3 3e-4 0.01 0.01 0.01 0.001
Hopper ATLA-PPO (LSTM)+ SA Reg 1e-3 3e-4 0.01 0.3 0.003 0.01 0.003
PPO(vanilla) 4e-4 3e-4 0
SA-PPO 4e-4 3e-4 0
ATLA-PPO (MLP) 4e-4 3e-4 0.0003 0.0001 0.0001 0.002
PPO (LSTM) 1e-3 3e-2 0
ATLA-PPO (LSTM) 1e-3 3e-2 0.001 0.0003 0.03 0
Walker2d ATLA-PPO (LSTM)+ SA Reg 1e-3 3e-2 0.001 0.3 0.003 0.03 0.001
PPO(vanilla) 5e-5 1e-5 0
SA-PPO 5e-5 1e-5 0 3e-3
ATLA-PPO (MLP) 5e-5 1e-5 3e-4 1e-05 3e-06 0
PPO (LSTM) 3e-4 3e-4 0
ATLA-PPO (LSTM) 3e-4 3e-4 0.0003 0.0003 0.0001 0.0003
Ant ATLA-PPO (LSTM)+ SA Reg 3e-4 3e-4 0.003 0.1 0.0003 3e-05 3e-05
PPO(vanilla) 3e-4 1e-4 0
SA-PPO 3e-4 1e-4 0.1
ATLA-PPO (MLP) 3e-4 1e-4 0.0003 0.001 0.0003 0.003
PPO (LSTM) 1e-3 3e-4 0
ATLA-PPO (LSTM) 1e-3 3e-4 0.0003 0.003 0.001 0
HalfCheetah ATLA-PPO (LSTM)+ SA Reg 1e-3 3e-4 0 0.03 0.003 0.003 0.0003