This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Reward-Poisoning Attacks on Offline Multi-Agent Reinforcement Learning

Young Wu, Jeremy McMahan, Xiaojin Zhu, and Qiaomin Xie
Abstract

In offline multi-agent reinforcement learning (MARL), agents estimate policies from a given dataset. We study reward-poisoning attacks in this setting where an exogenous attacker modifies the rewards in the dataset before the agents see the dataset. The attacker wants to guide each agent into a nefarious target policy while minimizing the LpL^{p} norm of the reward modification. Unlike attacks on single-agent RL, we show that the attacker can install the target policy as a Markov Perfect Dominant Strategy Equilibrium (MPDSE), which rational agents are guaranteed to follow. This attack can be significantly cheaper than separate single-agent attacks. We show that the attack works on various MARL agents including uncertainty-aware learners, and we exhibit linear programs to efficiently solve the attack problem. We also study the relationship between the structure of the datasets and the minimal attack cost. Our work paves the way for studying defense in offline MARL.

1 Introduction

Multi-agent reinforcement learning (MARL) has achieved tremendous empirical success across a variety of tasks such as the autonomous driving, cooperative robotics, economic policy-making, and video games. In MARL, several agents interact with each other and the underlying environment, and each of them aims to optimize their individual long-term reward (Zhang, Yang, and Başar 2021). Such problems are often formulated under the framework of Markov Games (Shapley 1953), which generalizes the Markov Decision Process model from single-agent RL. In offline MARL, the agents aim to learn a good policy by exploiting a pre-collected dataset without further interactions with the environment or other agents (Pan et al. 2022; Jiang and Lu 2021; Cui and Du 2022; Zhong et al. 2022). The optimal solution in MARL typically involves equilibria concepts.

While the above empirical success is encouraging, MARL algorithms are susceptible to data poisoning attacks: the agents can reach the wrong equilibria if an exogenous attacker manipulates the feedback to agents. For example, a third party attacker may want to interfere with traffic to cause autonomous vehicles to behave abnormally; teach robots an incorrect procedure so that they fail at certain tasks; misinform economic agents about the state of the economy and guide them to make irrational investment or saving decisions; or cause the non-player characters in a video game to behave improperly to benefit certain human players. In this paper, we study the security threat posed by reward-poisoning attacks on offline MARL. Here, the attacker wants the agents to learn a target policy π\pi^{\dagger} of the attacker’s choosing (π\pi^{\dagger} does not need to be an equilibrium in the original Markov Game). Meanwhile, the attacker wants to minimize the amount of dataset manipulation to avoid detection and accruing high cost. This paper studies optimal offline MARL reward-poisoning attacks. Our work serves as a first step toward eventual defense against reward-poisoning attacks.

Our Contributions

We introduce reward-poisoning attacks in offline MARL. We show that any attack that reduces to attacking single-agent RL separately must be suboptimal. Consequently, new innovations are necessary to attack effectively. We present a reward-poisoning framework that guarantees the target policy π\pi^{\dagger} becomes a Markov Perfect Dominant Strategy Equilibrium (MPDSE) for the underlying Markov Game. Since any rational agent will follow an MPDSE if it exists, this ensures the agents adopt the target policy π\pi^{\dagger}. We also show the attack can be efficiently constructed using a linear program.

The attack framework has several important features. First, it is effective against a large class of offline MARL learners rather than a specific learning algorithm. Second, the framework allows partially decentralized agents who can only access their own individual rewards rather than the joint reward vectors of all agents. Lastly, the framework only makes the minimal assumption on the rationality of the learners that they will not take dominated actions.

We also give interpretable bounds on the minimal cost to poison an arbitrary dataset. These bounds relate the minimal attack cost to the structure of the underlying Markov Game. Using these bounds, we derive classes of extremal games that are especially cheap or expensive for the attacker to poison. These results show which games may be more susceptible to an attacker, while also giving insight to the structure of multi-agent attacks.

In the right hands, our framework could be used by a benevolent entity to coordinate agents in a way that improves social welfare. However, a malicious attacker could exploit the framework to harm learners and only benefit themselves. Consequently, our work paves the way for future study of MARL defense algorithms.

Related Work

Online Reward-Poisoning:

Reward poisoning problem has been studied in various settings, including online single-agent reinforcement learners (Banihashem et al. 2022; Huang and Zhu 2019; Liu and Lai 2021; Rakhsha et al. 2021a, b, 2020; Sun, Huo, and Huang 2020; Zhang et al. 2020), as well as online bandits  (Bogunovic et al. 2021; Garcelon et al. 2020; Guan et al. 2020; Jun et al. 2018; Liu and Shroff 2019; Lu, Wang, and Zhang 2021; Ma et al. 2018; Yang et al. 2021; Zuo 2020). Online reward poisoning for multiple learners is recently studied as a game redesign problem in  (Ma, Wu, and Zhu 2021).

Offline Reward Poisoning:

Ma et al. (2019); Rakhsha et al. (2020, 2021a); Rangi et al. (2022b); Zhang and Parkes (2008); Zhang, Parkes, and Chen (2009) focus on adversarial attack on offline single-agent reinforcement learners. Gleave et al. (2019); Guo et al. (2021) study the poisoning attack on multi-agent reinforcement learners, assuming that the attacker controls one of the learners. Our model instead assumes that the attacker is not one of the learners, and the attacker wants to and is able to poison the rewards of all learners at the same time. Our model pertains to many applications such as autonomous driving, robotics, traffic control and economic analysis, in which there is a central controller whose interests are not aligned with any of the agents and can modify the rewards and therefore manipulate all agents at the same time.

Constrained Mechanism Design:

Our paper is also related to the mechanism design literature, in particular, the K-implementation problem in Monderer and Tennenholtz (2004); Anderson, Shoham, and Altman (2010). Our model differs mainly in that the attacker, unlike a mechanism designer, does not alter the game/environment directly, but instead modifies the training data, from which the learners infer the underlying game and compute their policy accordingly. In practical applications, rewards are often stochastic due to imprecise measurement and state observation, hence the mechanism design approach is not directly applicable to MARL reward poisoning. Conversely, constrained mechanism design can be viewed a special case when the rewards are deterministic and the training data has uniform coverage of all period-state-action tuples.

Defense against Attacks on Reinforcement Learning:

There is also recent work on defending against reward poisoning or adversarial attacks on reinforcement learning; examples include Banihashem, Singla, and Radanovic (2021); Lykouris et al. (2021); Rangi et al. (2022a); Wei, Dann, and Zimmert (2022); Wu et al. (2022); Zhang et al. (2021a, b). These work focus on the single-agent setting where attackers have limited ability to modify the training data. We are not aware of defenses against reward poisoning in our offline multi-agent setting. Given the numerous real-world applications of offline MARL, we believe it is important to study the multi-agent version of the problem.

2 Preliminaries

Markov Games.

A finite-horizon general-sum nn-player Markov Game is given by a tuple G=(𝒮,𝒜,P,R,H,μ)G=(\mathcal{S},\mathcal{A},P,R,H,\mu)  (Littman 1994). Here 𝒮\mathcal{S} is the finite state space, and 𝒜=𝒜1××𝒜n\mathcal{A}=\mathcal{A}_{1}\times\cdots\times\mathcal{A}_{n} is the finite joint action space. We use 𝒂=(a1,,an)𝒜{\boldsymbol{a}}=(a_{1},\ldots,a_{n})\in\mathcal{A} to represent a joint action of the nn learners; we sometimes write 𝒂=(ai,ai){\boldsymbol{a}}=(a_{i},a_{-i}) to emphasize that learner ii takes action aia_{i} and the other n1n-1 learners take joint action aia_{-i}. For each period h[H]h\in[H], Ph:𝒮×𝒜Δ(𝒮)P_{h}:\mathcal{S}\times\mathcal{A}\to\Delta(\mathcal{S}) is the transition function, where Δ(𝒮)\Delta(\mathcal{S}) denotes the probability simplex on 𝒮\mathcal{S}, and Ph(s|s,𝒂)P_{h}(s^{\prime}|s,{\boldsymbol{a}}) is the probability that the state is ss^{\prime} in period h+1h+1 given the state is ss and the joint action is 𝒂{\boldsymbol{a}} in period hh. 𝑹h:𝒮×𝒜n{\boldsymbol{R}}_{h}:\mathcal{S}\times\mathcal{A}\to\mathbb{R}^{n} is the mean reward function for the nn players, where Ri,h(s,𝒂)R_{i,h}(s,{\boldsymbol{a}}) denotes the scalar mean reward for player ii in state ss and period hh when the joint action 𝒂{\boldsymbol{a}} is taken. The initial state distribution is μ\mu.

Policies and value functions.

We use π\pi to denote a deterministic Markovian policy for the nn players, where 𝝅h:𝒮𝒜{\boldsymbol{\pi}}_{h}:\mathcal{S}\to\mathcal{A} is the policy in period hh and 𝝅h(s){\boldsymbol{\pi}}_{h}(s) specifies the joint action in state ss and period hh. We write 𝝅h=(πi,h,πi,h){\boldsymbol{\pi}}_{h}=(\pi_{i,h},\pi_{-i,h}), where πi,h(s)\pi_{i,h}(s) is the action taken by learner ii and πi,h(s)\pi_{-i,h}(s) is the joint action taken by learners other than ii in state ss period hh. The value of a policy π\pi represents the expected cumulative rewards of the game assuming learners take actions according to π\pi. Formally, the QQ value of learner ii in state ss in period hh under a joint action 𝒂{\boldsymbol{a}} is given recursively by

Qi,Hπ(s,𝒂)\displaystyle Q_{i,H}^{\pi}\left(s,{\boldsymbol{a}}\right) =Ri,H(s,𝒂),\displaystyle=R_{i,H}\left(s,{\boldsymbol{a}}\right),
Qi,hπ(s,𝒂)\displaystyle Q_{i,h}^{\pi}\left(s,{\boldsymbol{a}}\right) =Ri,h(s,𝒂)+s𝒮Ph(s|s,𝒂)Vi,h+1π(s).\displaystyle=R_{i,h}\left(s,{\boldsymbol{a}}\right)+\displaystyle\sum_{s^{\prime}\in\mathcal{S}}P_{h}\left(s^{\prime}|s,{\boldsymbol{a}}\right)V_{i,h+1}^{\pi}\left(s^{\prime}\right).

The value of learner ii in state ss in period hh under policy π\pi is given by Vi,hπ(s)=Qi,hπ(s,𝝅h(s))V_{i,h}^{\pi}\left(s\right)=Q_{i,h}^{\pi}\left(s,{\boldsymbol{\pi}}_{h}\left(s\right)\right), and we use 𝑽hπ(s)n{\boldsymbol{V}}_{h}^{\pi}(s)\in\mathbb{R}^{n} to denote the vector of values for all learners in state ss in period hh under policy π\pi.

Offline MARL.

In offline MARL, the learners are given a fixed batch dataset 𝒟\mathcal{D} that records historical plays of nn agents under some behavior policies, and no further sampling is allowed. We assume that 𝒟={(sh(k),𝒂h(k),𝒓h0,(k))h=1H}k=1K\mathcal{D}=\big{\{}\big{(}s^{(k)}_{h},{\boldsymbol{a}}^{(k)}_{h},{\boldsymbol{r}}_{h}^{0,(k)}\big{)}_{h=1}^{H}\big{\}}_{k=1}^{K} contains KK episodes of length HH. The data tuple in period hh of episode kk consists of the state sh(k)𝒮s^{(k)}_{h}\in\mathcal{S}, the joint action profile 𝒂h(k)𝒜{\boldsymbol{a}}^{(k)}_{h}\in\mathcal{A}, and reward vector 𝒓h0,(k)n{\boldsymbol{r}}_{h}^{0,(k)}\in\mathbb{R}^{n}, where the superscript 0 denotes the original rewards before any attack. The next state sh+1(k)s^{(k)}_{h+1} can be found in the next tuple.Given the shared data 𝒟\mathcal{D}, each learner independently constructs a policy πi\pi_{i} to maximize their own cumulative reward. They then behave according to the resulting joint policy π=(π1,,πn){\pi}=(\pi_{1},\ldots,\pi_{n}) in future deployment. Note that in a multi-agent setting, the learners’ optimal solution concept is typically an approximate Nash equilibrium or Dominant Strategy Equilibrium (Cui and Du 2022; Zhong et al. 2022).

An agent’s access to 𝒟\mathcal{D} may be limited, for example due to privacy reasons. There are multiple levels of accessibility. In the first level, the agents can only access data that directly involves itself: instead of the tuple (sh,𝒂h,𝒓h)(s_{h},{\boldsymbol{a}}_{h},{\boldsymbol{r}}_{h}), agent ii would only be able to see (sh,ai,h,ri,h)(s_{h},a_{i,h},r_{i,h}). In the second level, agent ii can see the joint action but only its own reward: (sh,𝒂h,ri,h)(s_{h},{\boldsymbol{a}}_{h},r_{i,h}). In the third level, agent ii can see the whole (sh,𝒂h,𝒓h)(s_{h},{\boldsymbol{a}}_{h},{\boldsymbol{r}}_{h}). We focus on the second level in this paper.

Let Nh(s,𝒂)=k=1K𝟏{sh(k)=s,𝒂h(k)=𝒂}N_{h}\left(s,{\boldsymbol{a}}\right)=\sum_{k=1}^{K}\mathbf{1}_{\{s^{(k)}_{h}=s,{\boldsymbol{a}}^{(k)}_{h}={\boldsymbol{a}}\}} be the total number of episodes containing (s,𝒂,)\left(s,{\boldsymbol{a}},\cdot\right) in period hh. We consider a dataset 𝒟\mathcal{D} that satisfies the following coverage assumption.

Assumption 1.

(Full Coverage) For each (s,𝒂)(s,{\boldsymbol{a}}) and hh, Nh(s,𝒂)>0N_{h}\left(s,{\boldsymbol{a}}\right)>0.

While this assumption might appear strong, we later show that it is necessary to effectively poison the dataset.

Attack Model

We assume that the attacker has access to the original dataset 𝒟\mathcal{D}. The attacker has a pre-specified target policy 𝝅{\boldsymbol{\pi}}^{\dagger} and attempts to poison the rewards in 𝒟\mathcal{D} with the goal of forcing the learners to learn 𝝅{\boldsymbol{\pi}}^{\dagger} from the poisoned dataset. The attacker also desires that the attack has minimal cost. We let C(r0,r)C(r^{0},r^{\dagger}) denote the cost of a specific poisoning, where r0={(𝒓h0,(k))h=1H}k=1Kr^{0}=\big{\{}\big{(}{\boldsymbol{r}}_{h}^{0,(k)}\big{)}_{h=1}^{H}\big{\}}_{k=1}^{K} are the original rewards and r={(𝒓h,(k))h=1H}k=1Kr^{\dagger}=\big{\{}\big{(}{\boldsymbol{r}}_{h}^{\dagger,(k)}\big{)}_{h=1}^{H}\big{\}}_{k=1}^{K} are the poisoned rewards. We focus on the L1L^{1}-norm cost C(r0,r)=r0r1C(r^{0},r^{\dagger})=\|r^{0}-r^{\dagger}\|_{1}.

Rationality.

For generality, the attacker makes minimal assumptions on the learners’ rationality. Namely, the attacker only assumes that the learners never take dominated actions (Monderer and Tennenholtz 2004). For technical reasons, we strengthen this assumption slightly by introducing an arbitrarily small margin ι>0\iota>0 (e.g. representing the learners’ numerical resolution).

Definition 1.

A ι\iota-strict Markov perfect dominant strategy equilibrium (ι\iota-MPDSE) of a Markov Game GG is a policy π\pi satisfying that for all learners i[n]i\in\left[n\right], periods h[H]h\in\left[H\right], and states s𝒮s\in\mathcal{S},

ai𝒜i,aiπi,h(s),ai\displaystyle\forall\,a_{i}\in\mathcal{A}_{i},a_{i}\neq\pi_{i,h}(s),a_{-i} 𝒜i:\displaystyle\in\mathcal{A}_{-i}:
Qi,hπ(s,(πi,h(s),ai))\displaystyle Q_{i,h}^{\pi}\left(s,\left(\pi_{i,h}(s),a_{-i}\right)\right) Qi,hπ(s,(ai,ai))+ι.\displaystyle\geq Q_{i,h}^{\pi}\left(s,\left(a_{i},a_{-i}\right)\right)+\iota.

Note that a strict MPDSE, if exists, must be unique.

Assumption 2.

(Rationality) The learners will play an ι\iota-MPDSE should one exist.

Uncertainty-aware attack.

State-of-the-art MARL algorithms are typically uncertainty-aware (Cui and Du 2022; Zhong et al. 2022), meaning that learners are cognizant of the model uncertainty due to finite, random data and will calibrate their learning procedure accordingly. The attacker accounts for such uncertainty-aware learners, but does not know the learners’ specific algorithm or internal parameters. It only assumes that the policies computed by the learners are solutions to some game that is plausible given the dataset. Accordingly, the attacker aims to poison the dataset in such a way that the target policy is an ι\iota-MPDSE for every game that is plausible for the poisoned dataset.

To formally define the set of plausible Markov Games for a given dataset 𝒟\mathcal{D}, we first need a few definitions.

Definition 2.

(Confidence Game Set) The confidence set on the transition function Ph(s,𝒂)P_{h}\left(s,{\boldsymbol{a}}\right) has the form:

CIhP(s,𝒂):=\displaystyle\textup{CI}_{h}^{P}\left(s,{\boldsymbol{a}}\right):= {Ph(s,𝒂)Δ(𝒜):\displaystyle\big{\{}P_{h}\left(s,{\boldsymbol{a}}\right)\in\Delta\left(\mathcal{A}\right):
Ph(s,𝒂)P^h(s,𝒂)1ρhP(s,𝒂)}\displaystyle\|P_{h}\left(s,{\boldsymbol{a}}\right)-\hat{P}_{h}\left(s,{\boldsymbol{a}}\right)\|_{1}\leq\rho^{P}_{h}\left(s,{\boldsymbol{a}}\right)\big{\}}

where

P^h(s|s,𝒂):=1Nh(s,𝒂)k=1K𝟏{sh+1(k)=s,sh(k)=s,𝒂h(k)=𝒂}\hat{P}_{h}(s^{\prime}|s,{\boldsymbol{a}}):=\dfrac{1}{N_{h}(s,{\boldsymbol{a}})}\sum_{k=1}^{K}\mathbf{1}_{\{s^{(k)}_{h+1}=s^{\prime},s^{(k)}_{h}=s,{\boldsymbol{a}}^{(k)}_{h}={\boldsymbol{a}}\}} is the maximum likelihood estimate (MLE) of the true transition probability. Similarly, the confidence set on the reward function Ri,h(s,𝒂)R_{i,h}\left(s,{\boldsymbol{a}}\right) has the form:

CIi,hR(s,𝒂):=\displaystyle\textup{CI}_{i,h}^{R}\left(s,{\boldsymbol{a}}\right):= {Ri,h(s,𝒂)[b,b]:\displaystyle\big{\{}R_{i,h}\left(s,{\boldsymbol{a}}\right)\in\left[-b,b\right]:
|Ri,h(s,𝒂)R^i,h(s,𝒂)|ρhR(s,𝒂)},\displaystyle|R_{i,h}\left(s,{\boldsymbol{a}}\right)-\hat{R}_{i,h}\left(s,{\boldsymbol{a}}\right)|\leq\rho^{R}_{h}\left(s,{\boldsymbol{a}}\right)\big{\}},

where R^i,h(s,𝒂):=1Nh(s,𝒂)k=1Kri,h0,(k)𝟏{sh(k)=s,𝒂h(k)=𝒂}\hat{R}_{i,h}(s,{\boldsymbol{a}}):=\dfrac{1}{N_{h}(s,{\boldsymbol{a}})}\sum_{k=1}^{K}r_{i,h}^{0,(k)}\mathbf{1}_{\{s^{(k)}_{h}=s,{\boldsymbol{a}}^{(k)}_{h}={\boldsymbol{a}}\}} is the MLE of the reward. Then, the set of all plausible Markov Games consistent with 𝒟\mathcal{D}, denoted by CIG\textup{CI}^{G}, is defined to be:

CIG:={G=(𝒮,𝒜,P,R,H,μ):Ph(s,𝒂)CIhP(s,𝒂),\displaystyle\textup{CI}^{G}:=\big{\{}G=\left(\mathcal{S},\mathcal{A},P,R,H,\mu\right):P_{h}\left(s,{\boldsymbol{a}}\right)\in\textup{CI}_{h}^{P}\left(s,{\boldsymbol{a}}\right),
Ri,h(s,𝒂)CIi,hR(s,𝒂),i,h,s,𝒂}.\displaystyle R_{i,h}\left(s,{\boldsymbol{a}}\right)\in\textup{CI}_{i,h}^{R}\left(s,{\boldsymbol{a}}\right),\forall\,i,h,s,{\boldsymbol{a}}\big{\}}.

Note that both the attacker and the learners know that all of the rewards are bounded within [b,b]\left[-b,b\right] (we allow b=b=\infty). The values of ρhP(s,𝒂)\rho^{P}_{h}\left(s,{\boldsymbol{a}}\right) and ρhR(s,𝒂)\rho^{R}_{h}\left(s,{\boldsymbol{a}}\right) are typically given by concentration inequalities. One standard choice takes the Hoeffding-type form ρhP(s,𝒂)1/max{Nh(s,a),1},\rho^{P}_{h}\left(s,{\boldsymbol{a}}\right)\propto 1/{\sqrt{\max\{N_{h}(s,a),1\}}}, and ρhR(s,𝒂)1/max{Nh(s,a),1},\rho^{R}_{h}\left(s,{\boldsymbol{a}}\right)\propto 1/{\sqrt{\max\{N_{h}(s,a),1\}}}, where we recall that Nh(s,a)N_{h}(s,a) is the visitation count of the state-action pair (s,a)(s,a) (Xie et al. 2020; Cui and Du 2022; Zhong et al. 2022). We remark that with proper choice of ρhP\rho^{P}_{h} and ρhR\rho^{R}_{h}, CIG\textup{CI}^{G} contains the game constructed by optimistic MARL algorithms with upper confidence bounds (Xie et al. 2020), as well as that by pessimistic algorithms with lower confidence bounds (Cui and Du 2022; Zhong et al. 2022). See the appendix for details.

With the above definition, we consider an attacker that attempts to modify the original dataset 𝒟\mathcal{D} into 𝒟\mathcal{D}^{\dagger} so that 𝝅{\boldsymbol{\pi}}^{\dagger} is an ι\iota-MPDSE for every plausible game in CIG\textup{CI}^{G} induced by the poisoned 𝒟\mathcal{D}^{\dagger}. This would guarantee the learners adopt 𝝅{\boldsymbol{\pi}}^{\dagger}.

The full coverage Assumption 1 is necessary for the above attack goal, as shown in the following proposition. We defer the proof to appendix.

Proposition 1.

If Nh(s,𝐚)=0N_{h}\left(s,{\boldsymbol{a}}\right)=0 for some (h,s,𝐚)\left(h,s,{\boldsymbol{a}}\right), then there exist MARL learners for which the attacker’s problem is infeasible.

3 Poisoning Framework

In this section, we first argue that naively applying single-agent poisoning attack separately to each agent results in suboptimal attack cost. We then present a new optimal poisoning framework that accounts for multiple agents and thereby allows for efficiently solving the attack problem.

Suboptimality of single-agent attack reduction.

As a first attempt, the attacker could try to use existing single-agent RL reward poisoning methods. However, this approach is doomed to be suboptimal. Consider the following game with n=2n=2 learners, one period and one state:

𝒜1𝒜2\mathcal{A}_{1}\setminus\mathcal{A}_{2} 11 22
11 (3,3)\left(3,3\right) (1,2)\left(1,2\right)
22 (2,1)\left(2,1\right) (0,0)\left(0,0\right)

Suppose that the original dataset 𝒟\mathcal{D} has full coverage. For simplicity, we assume that each (s,𝒂)(s,{\boldsymbol{a}}) pair appears sufficiently many times so that ρR\rho^{R} is small. In this case, the target policy 𝝅=(1,1){\boldsymbol{\pi}}^{\dagger}=\left(1,1\right) is already a MPDSE, so no reward modification is needed. However, if we use a single-agent approach, each learner ii will observe the following dataset:

𝒜i\mathcal{A}_{i} rr
11 {3,1}\left\{3,1\right\}
22 {2,0}\left\{2,0\right\}

In this case, to learner ii it is not immediately clear which of the two actions is strictly better, for example, when 1,2{1,2} appears relatively more often then 3,0{3,0}. To ensure that both players take action 1, the attacker needs to modify at least one of the rewards for each player, thus incurring a nonzero (and thus suboptimal) attack cost.

The example above shows that a new approach is needed to construct an optimal poisoning framework tailored to the multi-agent setting. Below we develop such a framework, first for the simple Bandit Game setting, which is then generalized to Markov Games.

Bandit Game Setting

As a stepping stone, we start with a subclass of Markov Games with |𝒮|=1|\mathcal{S}|=1 and H=1H=1, which are sometimes called bandit games. A bandit game consists of a single stage normal-form game. For now, we also pretend that the learners simply use the data to compute an MLE point estimate G^\hat{G} of the game and then solve the estimated game G^\hat{G}. This is unrealistic, but it highlights the attacker’s strategy to enforce that 𝝅{\boldsymbol{\pi}}^{\dagger} is an ι\iota-strict DSE in G^\hat{G}.

Suppose the original dataset is 𝒟={(𝒂(k),𝒓0,(k))}k=1K\mathcal{D}=\big{\{}({\boldsymbol{a}}^{(k)},{\boldsymbol{r}}^{0,(k)})\big{\}}_{k=1}^{K} (recall we no longer have state or period). Also, let N(𝒂):=k=1K𝟏{𝒂(k)=𝒂}N({\boldsymbol{a}}):=\sum_{k=1}^{K}\mathbf{1}_{\{{\boldsymbol{a}}^{(k)}={\boldsymbol{a}}\}} be the action counts. The attacker’s problem can be formulated as a convex optimization problem given in (1).

minr\displaystyle\min_{r^{\dagger}} C(r0,r)\displaystyle C\big{(}r^{0},r^{\dagger}\big{)} (1)
 s.t. R(𝒂):=1N(𝒂)k=1K𝒓,(k)𝟏{𝒂(k)=𝒂},𝒂;\displaystyle R^{\dagger}({\boldsymbol{a}})\!:=\!\dfrac{1}{N({\boldsymbol{a}})}\sum_{k=1}^{K}{\boldsymbol{r}}^{\dagger,(k)}\mathbf{1}_{\left\{{\boldsymbol{a}}^{(k)}={\boldsymbol{a}}\right\}},\forall\,{\boldsymbol{a}};
Ri(πi,ai)Ri(ai,ai)+ι,i,ai,aiπi;\displaystyle R_{i}^{\dagger}\big{(}\pi_{i}^{\dagger},a_{-i}\big{)}\geq R_{i}^{\dagger}\left(a_{i},a_{-i}\right)+\iota,\forall\;i,a_{-i},a_{i}\neq\pi_{i}^{\dagger};
𝒓,(k)[b,b]n,k.\displaystyle{\boldsymbol{r}}^{\dagger,(k)}\in\left[-b,b\right]^{n},\forall\;k.

The first constraint in (1) models the learners’ MLE G^\hat{G} after poisoning. The second constraint enforces that 𝝅{\boldsymbol{\pi}}^{\dagger} is an ι\iota-strict DSE of G^\hat{G} by definition. We observe that:

  1. 1.

    The problem is feasible if ι2b\iota\leq 2b, since the attacker can always set, for each agent, the reward to be bb for the target action and b-b for all other actions;

  2. 2.

    If the cost function C(,)C(\cdot,\cdot) is the L1L^{1}-norm, the problem is a linear program (LP) with nKnK variables and (A1)An1+2nK(A-1)A^{n-1}+2nK inequality constraints (assuming each learner has |𝒜i|=A\left|\mathcal{A}_{i}\right|=A actions);

  3. 3.

    After the attack, learner ii only needs to see its own rewards to be convinced that πi\pi_{i}^{\dagger} is a dominant strategy; learner ii does not need to observe other learners’ rewards.

This simple formulation serves as an asymptotic approximation to the attack problem for confidence bound based learners. In particular, when N(𝒂)N({\boldsymbol{a}}) is large for all 𝒂{\boldsymbol{a}}, the confidence intervals on PP and RR are usually small.

With the above idea in place, we can consider more realistic learners that are uncertainty-aware. For these learners, the attacker attempts to enforce an ι\iota separation between the lower bound of the target action’s reward and the upper bounds of all other actions’ rewards (similar to arm elimination in bandits). With such separation, all plausible games in CIG\textup{CI}^{G} would have the target action profile as the dominant strategy equilibrium. This approach can be formulated as a slightly more complex optimization problem (2), where the second and third constraints enforce the desired ι\iota separation. The formulation (2) can be solved using standard optimization solvers, hence the optimal attack can be computed efficiently.

minr\displaystyle\min_{r^{\dagger}} C(r0,r)\displaystyle C(r^{0},r^{\dagger}) (2)
s.t. R(𝒂):=1N(a)k=1K𝒓,(k)𝟏{𝒂(k)=a},𝒂;\displaystyle R^{\dagger}({\boldsymbol{a}}):=\dfrac{1}{N(a)}\sum_{k=1}^{K}{\boldsymbol{r}}^{\dagger,(k)}\mathbf{1}_{\left\{{\boldsymbol{a}}^{(k)}=a\right\}},\forall\;{\boldsymbol{a}};
CIiR(𝒂):={Ri(𝒂)[b,b]:|Ri(𝒂)Ri(𝒂)|\displaystyle\textup{CI}_{i}^{R^{\dagger}}({\boldsymbol{a}}):=\big{\{}R_{i}({\boldsymbol{a}})\in[-b,b]:\big{|}R_{i}({\boldsymbol{a}})-R_{i}^{\dagger}({\boldsymbol{a}})\big{|}
ρR(𝒂)},i,𝒂;\displaystyle\hskip 56.9055pt\leq\rho^{R}({\boldsymbol{a}})\big{\}},\quad\;\forall\;i,{\boldsymbol{a}};
minRiCIiR(πi,ai)RimaxRiCIiR(ai,ai)Ri+ι,\displaystyle\min_{R_{i}\in\textup{CI}_{i}^{R^{\dagger}}\!(\pi_{i}^{\dagger},a_{-i})}\!R_{i}\geq\!\max_{R_{i}\in\textup{CI}_{i}^{R^{\dagger}}\!(a_{i},a_{-i})}R_{i}+\iota,
i,ai,aiπi;\displaystyle\hskip 113.81102pt\forall\;i,a_{-i},a_{i}\neq\pi_{i}^{\dagger};
𝒓,(k)[b,b]n,k.\displaystyle{\boldsymbol{r}}^{\dagger,(k)}\in\left[-b,b\right]^{n},\forall\;k.

We next consider whether this formulation has a feasible solution. Below we characterize the feasibility of the attack in terms of the margin parameter ι\iota and the confidence bounds.

Proposition 2.

The attacker’s problem (2) is feasible if ι2b2ρR(𝐚),𝐚𝒜.\iota\leq 2b-2\rho^{R}\left({\boldsymbol{a}}\right),\forall\;{\boldsymbol{a}}\in\mathcal{A}.

Proposition 2 is a special case of the general Theorem 5 with H=|𝒮|=1H=\left|\mathcal{S}\right|=1. We note that the condition in Proposition 2 has an equivalent form that relates to the structure of the dataset. We later present this form for more general case.

When an L1L^{1}-norm cost function is used, we show in the appendix that the formulation (2) can also be efficiently solved.

Proposition 3.

With L1L^{1}-norm cost function C(,)C\left(\cdot,\cdot\right), the problem (2) can be formulated as a linear program.

Markov Game Setting

We now generalize the ideas from the bandit setting to derive a poisoning framework for arbitrary Markov Games. With multiple states and periods, there are two main complications:

  1. 1.

    In each period hh, the learners’ decision depends on QhQ_{h}, which involves both the immediate reward RhR_{h} and the future return Qh+1Q_{h+1};

  2. 2.

    The uncertainty in QhQ_{h} amplifies as it propagates backward in hh.

Accordingly, the attacker needs to design the poisoning attack recursively.

Our main technical innovation is an attack formulation based on QQ confidence-bound backward induction. The attacker maintains confidence upper and lower bounds on the learners’ QQ function, Q¯\overline{Q} and Q¯\underline{Q}, with backward induction. To ensure 𝝅{\boldsymbol{\pi}}^{\dagger} becomes an ι\iota-MPDSE, the attacker again attempts to ι\iota-separate the lower bound of the target action and the upper bound of all other actions, at all states and periods.

Recall Definition 2: given the training dataset 𝒟\mathcal{D}, one can compute the MLEs 𝑹h{\boldsymbol{R}}_{h} and corresponding confidence sets CIi,hR\textup{CI}_{i,h}^{R} for the reward. The attacker aims to poison 𝒟\mathcal{D} into 𝒟\mathcal{D}^{\dagger} so that the MLEs and confidence sets become 𝑹h{\boldsymbol{R}}_{h}^{\dagger} and CIi,hR\textup{CI}_{i,h}^{R^{\dagger}}, under which 𝝅{\boldsymbol{\pi}}^{\dagger} is the unique ι\iota-MPDSE for all plausible games in the corresponding confidence game set. The attacker finds the minimum cost way of doing so by solving a QQ confidence-bound backward induction optimization problem, given in (3)–(7).

minr\displaystyle\displaystyle\min_{r^{\dagger}}\; C(r0,r)\displaystyle C\left(r^{0},r^{\dagger}\right) (3)
 s.t. Ri,h(s,𝒂):=1Nh(s,𝒂)k=1Kri,h,(k)𝟏{sh(k)=s,𝒂h(k)=𝒂},\displaystyle R_{i,h}^{\dagger}\left(s,{\boldsymbol{a}}\right):=\dfrac{1}{N_{h}\left(s,{\boldsymbol{a}}\right)}\displaystyle\sum_{k=1}^{K}r_{i,h}^{\dagger,(k)}\mathbf{1}_{\left\{s^{(k)}_{h}=s,{\boldsymbol{a}}^{(k)}_{h}={\boldsymbol{a}}\right\}},
h,s,i,𝒂\displaystyle\hskip 10.00002pt\forall\;h,s,i,{\boldsymbol{a}}
CIi,hR(s,𝒂):={Ri,h(s,𝒂)[b,b]\displaystyle\textup{CI}_{i,h}^{R^{\dagger}}\left(s,{\boldsymbol{a}}\right):=\Big{\{}R_{i,h}\left(s,{\boldsymbol{a}}\right)\in\left[-b,b\right]
:|Ri,h(s,𝒂)Ri,h(s,𝒂)|ρhR(s,𝒂)},\displaystyle\hskip 10.00002pt:\big{|}R_{i,h}\left(s,{\boldsymbol{a}}\right)-R_{i,h}^{\dagger}\left(s,{\boldsymbol{a}}\right)\big{|}\leq\rho^{R}_{h}\left(s,{\boldsymbol{a}}\right)\Big{\}},
h,s,i,𝒂\displaystyle\hskip 10.00002pt\forall\;h,s,i,{\boldsymbol{a}}
Q¯i,H(s,𝒂):=minRi,HCIi,HR(s,𝒂)Ri,H,s,i,𝒂\displaystyle\underline{Q}_{i,H}\left(s,{\boldsymbol{a}}\right):=\displaystyle\min_{R_{i,H}\in\textup{CI}_{i,H}^{R^{\dagger}}\left(s,{\boldsymbol{a}}\right)}R_{i,H},\forall\;s,i,{\boldsymbol{a}}
Q¯i,h(s,𝒂):=minRi,hCIi,hR(s,𝒂)Ri,h\displaystyle\underline{Q}_{i,h}\left(s,{\boldsymbol{a}}\right):=\displaystyle\min_{R_{i,h}\in\textup{CI}_{i,h}^{R^{\dagger}}\left(s,{\boldsymbol{a}}\right)}R_{i,h}
+minPhCIhP(s,𝒂)s𝒮Ph(s)Q¯i,h+1(s,𝝅h+1(s)),\displaystyle\hskip 10.00002pt+\displaystyle\min_{P_{h}\in\textup{CI}_{h}^{P}\left(s,{\boldsymbol{a}}\right)}\displaystyle\sum_{s^{\prime}\in\mathcal{S}}P_{h}\left(s^{\prime}\right)\underline{Q}_{i,h+1}\left(s^{\prime},{\boldsymbol{\pi}}_{h+1}^{\dagger}\left(s^{\prime}\right)\right),
h<H,s,i,𝒂\displaystyle\hskip 10.00002pt\forall\;h<H,s,i,{\boldsymbol{a}} (4)
Q¯i,H(s,𝒂):=maxRi,HCIi,HR(s,𝒂)Ri,H,s,i,𝒂\displaystyle\overline{Q}_{i,H}\left(s,{\boldsymbol{a}}\right):=\displaystyle\max_{R_{i,H}\in\textup{CI}_{i,H}^{R^{\dagger}}\left(s,{\boldsymbol{a}}\right)}R_{i,H},\forall\;s,i,{\boldsymbol{a}}
Q¯i,h(s,𝒂):=maxRi,hCIi,hR(s,𝒂)Ri,h\displaystyle\overline{Q}_{i,h}\left(s,{\boldsymbol{a}}\right):=\displaystyle\max_{R_{i,h}\in\textup{CI}_{i,h}^{R^{\dagger}}\left(s,{\boldsymbol{a}}\right)}R_{i,h}
+maxPhCIhP(s,𝒂)s𝒮Ph(s)Q¯i,h+1(s,𝝅h+1(s)),\displaystyle\hskip 10.00002pt+\displaystyle\max_{P_{h}\in\textup{CI}_{h}^{P}\left(s,{\boldsymbol{a}}\right)}\displaystyle\sum_{s^{\prime}\in\mathcal{S}}P_{h}\left(s^{\prime}\right)\overline{Q}_{i,h+1}\left(s^{\prime},{\boldsymbol{\pi}}_{h+1}^{\dagger}\left(s^{\prime}\right)\right),
h<H,s,i,𝒂\displaystyle\hskip 10.00002pt\forall\;h<H,s,i,{\boldsymbol{a}} (5)
Q¯i,h(s,(πi,h(s),ai))Q¯i,h(s,(ai,ai))+ι,\displaystyle\underline{Q}_{i,h}\left(s,\big{(}\pi_{i,h}^{\dagger}(s),a_{-i}\big{)}\right)\geq\overline{Q}_{i,h}\left(s,\left(a_{i},a_{-i}\right)\right)+\iota,
h,s,i,ai,aiπi,h(s)\displaystyle\hskip 10.00002pt\forall\;h,s,i,a_{-i},a_{i}\neq\pi_{i,h}^{\dagger}\left(s\right) (6)
𝒓h,(k)[b,b]n,h,k.\displaystyle{\boldsymbol{r}}_{h}^{\dagger,(k)}\in\left[-b,b\right]^{n},\forall\;h,k. (7)

The backward induction steps (4) and (5) ensure that Q¯\underline{Q} and Q¯\overline{Q} are valid lower and upper bounds for the QQ function for all plausible Markov Games in CIG\textup{CI}^{G}, for all periods. The margin constraints (6) enforces an ι\iota-separation between the target action and other actions at all states and periods. We emphasize that the agents need not consider QQ at all in their learning algorithm; QQ only appears in the optimization due to its presence in the definition of MPDSE.

Again, pairing an efficient optimization solver with the above formulation gives an efficient algorithm for constructing the poisoning. We now answer the important questions of whether this formulation admits a feasible solution and whether these solutions yield successful attacks. The lemma below provides a positive answer to the second question.

Lemma 4.

If the attack formulation (3)–(7) is feasible, 𝛑{\boldsymbol{\pi}}^{\dagger} is the unique ι\iota-MPDSE of every Markov Game GCIGG\in\textup{CI}^{G}.

Moreover, the attack formulation admits feasible solutions under mild conditions on the dataset.

Theorem 5.

The attacker formulation (3)–(7) is feasible if the following condition holds:

ι2b(H+1)ρhR(s,𝒂),h[H],s𝒮,𝒂𝒜.\displaystyle\iota\leq 2b-\left(H+1\right)\rho^{R}_{h}\left(s,{\boldsymbol{a}}\right),\;\;\forall\;h\in\left[H\right],s\in\mathcal{S},{\boldsymbol{a}}\in\mathcal{A}.

We remark that the learners know the upper bound bb and may use it exclude implausible games. The accumulation of confidence intervals over the HH periods results in the extra factor (H+1)(H+1) on ρhR\rho^{R}_{h}. Theorem 5 implies that the problem is feasible so long as the dataset is sufficiently populated; that is, each (s,a)(s,a) pair should appear frequently enough to have a small confidence interval half-width ρhR\rho^{R}_{h}. The following corollary provides a precise condition on the visit accounts that guarantees feasibility.

Corollary 6.

Given a confidence probability δ\delta and the confidence interval half-width ρhR(s,𝐚)=f(1Nh(s,a))\rho^{R}_{h}\left(s,{\boldsymbol{a}}\right)=f(\frac{1}{N_{h}(s,a)}) for some strictly increasing function ff, the condition in Theorem 5 holds if

Nh(s,𝒂)\displaystyle N_{h}(s,{\boldsymbol{a}}) (f1(2bιH+1))1.\displaystyle\geq\Big{(}f^{-1}\big{(}\frac{2b-\iota}{H+1}\big{)}\Big{)}^{-1}.

In particular, for the natural choice of Hoeffding-type ρhR(s,𝐚)=2blog((H|𝒮||𝒜|)/δ)max{Nh(s,𝐚),1}\rho^{R}_{h}\left(s,{\boldsymbol{a}}\right)=2b\sqrt{\dfrac{\log\left(\left(H\left|\mathcal{S}\right|\left|\mathcal{A}\right|\right)/\delta\right)}{\displaystyle\max\left\{N_{h}\left(s,{\boldsymbol{a}}\right),1\right\}}}, it suffices that,

Nh(s,𝒂)\displaystyle N_{h}(s,{\boldsymbol{a}}) 4b2(H+1)2log((H|𝒮||𝒜|)/δ)(2bι)2.\displaystyle\geq\dfrac{4b^{2}\left(H+1\right)^{2}\log\left(\left(H\left|\mathcal{S}\right|\left|\mathcal{A}\right|\right)/\delta\right)}{\left(2b-\iota\right)^{2}}.

Despite the inner min and max in the problem (3)–(7), the problem can be formulated as an LP, thanks to LP duality.

Theorem 7.

With L1L^{1}-norm cost function C(,)C(\cdot,\cdot), problem (3)–(7) can be formulated as an LP.

The proofs of the above results can be found in appendix.

4 Cost Analysis

Now that we know how the attacker can poison the dataset in the multi-agent setting, we can study the structure of attacks. The structure is most easily seen by analyzing the minimal attack cost. To this end, we give general bounds that relate the minimal attack cost to the structure of the underlying Markov Game. The attack cost upper bounds show which games are particularly susceptible to poison, and the attack cost lower bounds demonstrate that some games are expensive to poison.

Overview of results: Specifically, we shall present two types of upper/lower bounds on the attack cost: (i) universal bounds that hold for all attack problem instances simultaneously; (ii) instance-dependent bounds that are stated in terms of certain properties of the instance. We also discuss problem instances under which these two types of bounds are tight and coincide with each other.

We note that all bounds presented here are with respect to the L1L^{1}-cost, but many of them generalize to other cost functions, especially the LL^{\infty}-cost. The proofs of the results presented in this section are provided in the appendix.

Setup: Let I=(𝒟,𝝅,ρR,ρP,ι)I=(\mathcal{D},{\boldsymbol{\pi}}^{\dagger},\rho^{R},\rho^{P},\iota) denote an instance of the attack problem, and G^\hat{G} denote the corresponding MLE of the Markov Game derived from 𝒟\mathcal{D}. We denote by Ih=(𝒟h,𝝅h,ρhR,ρhP,ι)I_{h}=(\mathcal{D}_{h},{\boldsymbol{\pi}}_{h}^{\dagger},\rho^{R}_{h},\rho^{P}_{h},\iota) the restriction of the instance to period hh. In particular, R^h(s)\hat{R}_{h}(s) derived from 𝒟h\mathcal{D}_{h} is exactly the normal-form game at state ss and period hh of G^\hat{G}. We define C(I)C^{*}(I) to be the optimal L1L^{1}-poisoning cost for the instance II; that is, C(I)C^{*}(I) is the optimal value of the optimization problem (3)–(7) evaluated on II. We say the attack instance II is feasible if this optimization problem is feasible. If II is infeasible, we define C(I)=C^{*}(I)=\infty. WLOG, we assume that |𝒜1|==|𝒜n|=A|\mathcal{A}_{1}|=\cdots=|\mathcal{A}_{n}|=A. In addition, we define the minimum visit count for each period hh in 𝒟\mathcal{D} as N¯h:=mins𝒮min𝒂𝒜Nh(s,𝒂)\underline{N}_{h}:=\min_{s\in\mathcal{S}}\min_{{\boldsymbol{a}}\in\mathcal{A}}N_{h}\left(s,{\boldsymbol{a}}\right), and the minimal over all periods as N¯:=minhHN¯h\underline{N}:=\min_{h\in H}\underline{N}_{h}. We similarly define the maximum visit counts as N¯h=maxs𝒮max𝒂𝒜Nh(s,𝒂)\overline{N}_{h}=\max_{s\in\mathcal{S}}\max_{{\boldsymbol{a}}\in\mathcal{A}}N_{h}\left(s,{\boldsymbol{a}}\right) and N¯=maxhN¯h\overline{N}=\max_{h}\overline{N}_{h}. Lastly, we define ρ¯=minh,s,𝒂ρhR(s,𝒂)\underline{\rho}=\min_{h,s,{\boldsymbol{a}}}\rho^{R}_{h}(s,{\boldsymbol{a}}) and ρ¯=maxh,s,𝒂ρhR(s,𝒂)\overline{\rho}=\max_{h,s,{\boldsymbol{a}}}\rho^{R}_{h}(s,{\boldsymbol{a}}), the minimum and maximum confidence half-width.

Universal Cost Bounds

With the above definitions, we present universal attack cost bounds that hold simultaneously for all attack instances.

Theorem 8.

For any feasible attack instance II, we have that,

0C(I)N¯H|𝒮|nAn2b.0\leq C^{*}(I)\leq\overline{N}H|\mathcal{S}|nA^{n}2b.

As these upper and lower bounds hold for all instances, they are typically loose. However, they are nearly tight. If 𝝅{\boldsymbol{\pi}}^{\dagger} is already an ι\iota-MPDSE for all plausible games, then no change to the rewards is needed and the attack cost is 0, hence the lower bound is tight for such instances. We can also construct a high cost instance to show near-tightness of the upper bound.

Specifically, consider the dataset for a bandit game, 𝒟={(𝒂(k),𝒓0,(k))}k=1K,\mathcal{D}=\big{\{}({\boldsymbol{a}}^{(k)},{\boldsymbol{r}}^{0,(k)})\big{\}}_{k=1}^{K}, where 𝒜=An\mathcal{A}=A^{n} and each action appears exactly NN times, i.e., N¯=N¯=N\overline{N}=\underline{N}=N and K=NAnK=NA^{n}. The target policy is 𝝅=(1,,1){\boldsymbol{\pi}}^{\dagger}=(1,\ldots,1). The dataset is constructed so that 𝒓i0,(k)=b{\boldsymbol{r}}^{0,(k)}_{i}=-b if 𝒂i(k)=πi,h(s){\boldsymbol{a}}^{(k)}_{i}=\pi_{i,h}^{\dagger}\left(s\right) and 𝒓i0,(k)=b{\boldsymbol{r}}^{0,(k)}_{i}=b otherwise. These rewards are essentially the extreme opposite of what the attacker needs to ensure 𝝅{\boldsymbol{\pi}}^{\dagger} is an ι\iota-DSE. Note, the dataset induces the MLE of the game shown in Table 1 for the special case with n=2n=2 players.

𝒜1/𝒜2\mathcal{A}_{1}/\mathcal{A}_{2} 11 22 ... |𝒜2|\left|\mathcal{A}_{2}\right|
11 b,b-b,-b b,b-b,b ... b,b-b,b
22 b,bb,-b b,bb,b ... b,bb,b
... ... ... ... ...
|𝒜1|\left|\mathcal{A}_{1}\right| b,bb,-b b,bb,b ... b,bb,b
Table 1: MLE 𝑹h^(s,)\hat{{\boldsymbol{R}}_{h}}(s,\cdot) before attack

For simplicity, suppose that the same confidence half-width ρR(𝒂)=ρ<b\rho^{R}\left({\boldsymbol{a}}\right)=\rho<b is used for all 𝒂{\boldsymbol{a}}. Let ι(0,b)\iota\in(0,b) be arbitrary. For this instance, to install 𝝅{\boldsymbol{\pi}}^{\dagger} as the ι\iota-DSE, the attacker can flip all rewards in a way that is illustrated in Table 2, inducing a cost as the upper bound in Theorem 8. The situation is the same for n2n\geq 2 learners.

𝒜1/𝒜2\mathcal{A}_{1}/\mathcal{A}_{2} 11 \dots 2,,|𝒜2|2,...,\left|\mathcal{A}_{2}\right|
11 b,bb,b \dots b,b2ριb,b\!-\!2\rho\!-\!\iota
\vdots \vdots \vdots \vdots
2,,|𝒜1|2,...,\left|\mathcal{A}_{1}\right| b2ρι,bb\!-\!2\rho\!-\!\iota,b \dots b2ρι,b2ριb\!-\!2\rho\!-\!\iota,b\!-\!2\rho\!-\!\iota
Table 2: MLE 𝑹^h(s,)\hat{\boldsymbol{R}}_{h}(s,\cdot) after attack

Our instance-dependent lower bound, presented later in Theorem 12, implies that any attack on this instance must have cost at least NnAn1(2b+2ρ+ι)NnA^{n-1}(2b+2\rho+\iota). This lower bound matches the refined upper bound in the proof of Theorem 9, implying the refined bounds are tight for this instance. Noticing that the universal bound in Theorem 8 only differs by an O(A)O(A)-factor implies it is nearly tight.

Instance-Dependent Cost Bounds

Next, we derive general bounds on the attack cost that depend on the structure of the underlying instance. Our strategy is to reduce the problem of bounding Markov Game costs to the easier problem of bounding Bandit Game costs. We begin by showing that the cost of poisoning a Markov Game dataset can be bounded in terms of the cost of poisoning the datasets corresponding to its individual period games.

Theorem 9.

For any feasible attack instance II, we have that C(IH)C(I)C^{*}(I_{H})\leq C^{*}(I) and,

C(I)h=1HC(Ih)+2bnH|𝒮|N¯+H2ρ¯|𝒮|nAnN¯C^{*}(I)\leq\sum_{h=1}^{H}C^{*}(I_{h})+2bnH|\mathcal{S}|\overline{N}+H^{2}\overline{\rho}|\mathcal{S}|nA^{n}\overline{N}

Here we see the effect of the learner’s uncertainty. If ρR\rho^{R} is small, then poisoning costs slightly more than poisoning each bandit instance independently. This is desirable since it allows the attacker to solve the much easier bandit instances instead of the full problem.

The lower bound is valid for all Markov Games, but it is weak in that it only uses the last period cost. However, this is the most general lower bound one can obtain without additional assumptions on the structure of the game. If we assume additional structure on the dataset, then the above lower bound can be extended beyond the last period, forcing a higher attack cost.

Lemma 10.

Let II be any feasible attack instance containing at least one uniform transition in CIhP\textup{CI}_{h}^{P} for each period hh, i.e., there is some P^h(ss,𝐚)CIhP\hat{P}_{h}(s^{\prime}\mid s,{\boldsymbol{a}})\in\textup{CI}_{h}^{P} with P^h(ss,𝐚)=1/|𝒮|,h,s,s,𝐚\hat{P}_{h}(s^{\prime}\mid s,{\boldsymbol{a}})=1/|\mathcal{S}|,\forall h,s^{\prime},s,{\boldsymbol{a}}. Then, we have that

C(I)h=1HC(Ih).C^{*}(I)\geq\sum_{h=1}^{H}C^{*}(I_{h}).

In words, for these instances the optimal cost for poisoning is not too far off from the optimal cost of poisoning each period game independently. We note this is where the effects of ρP\rho^{P} show themselves. If the dataset is highly uncertain on the transitions, it becomes likely that a uniform transition exists in CIP\textup{CI}^{P}. Thus, a higher ρP\rho^{P} leads to a higher cost and effectively devolves the set of plausible games into a series of independent games.

Now that we have the above relationships, we can focus on bounding the attack cost for bandit games. To be precise, we bound the cost of poisoning a period game instance IhI_{h}. To this end, we define ι\iota-dominance gaps.

Definition 3.

(Dominance Gaps) For every h[H],s𝒮,i[n]h\in\left[H\right],s\in\mathcal{S},i\in\left[n\right] and ai𝒜ia_{-i}\in\mathcal{A}_{-i}, the ι\iota-dominance gap, di,hι(s,ai)d_{i,h}^{\iota}(s,a_{-i}), is defined as

di,hι(s,ai):=\displaystyle d_{i,h}^{\iota}\left(s,a_{-i}\right):=
[maxaiπi,h(s)[R^i,h(s,(ai,ai))+ρhR(s,(ai,ai))]\displaystyle\Big{[}\displaystyle\max_{a_{i}\neq\pi_{i,h}^{\dagger}(s)}\left[\hat{R}_{i,h}\big{(}s,(a_{i},a_{-i})\big{)}+\rho^{R}_{h}\big{(}s,(a_{i},a_{-i})\big{)}\right]
R^i,h(s,(πi,h(s),ai))+ρhR(s,(πi,h(s),ai))+ι]+\displaystyle-\hat{R}_{i,h}\Big{(}s,\big{(}\pi_{i,h}^{\dagger}(s),a_{-i}\big{)}\Big{)}+\rho^{R}_{h}\Big{(}s,\big{(}\pi_{i,h}^{\dagger}(s),a_{-i}\big{)}\Big{)}+\iota\Big{]}_{+}

where R^\hat{R} is the MLE w.r.t. the original dataset 𝒟\mathcal{D}.

The dominance gaps measure the minimum amount by which the attacker would have to increase the reward for learner ii while others are playing aia_{-i}, so that the action πi,h(s)\pi_{i,h}^{\dagger}\left(s\right) becomes ι\iota-dominant for learner ii. We then consolidate all the dominance gaps for period hh into the variable Δh(ι)\Delta_{h}(\iota),

Δh(ι):=sSi=1nai(di,hι(s,ai)+δi,hι(s,ai))\displaystyle\Delta_{h}(\iota):=\sum_{s\in S}\sum_{i=1}^{n}\sum_{a_{-i}}\Big{(}d_{i,h}^{\iota}(s,a_{-i})+\delta_{i,h}^{\iota}(s,a_{-i})\Big{)}

Where δi,hι(s,ai)\delta_{i,h}^{\iota}(s,a_{-i}) is a minor overflow term defined in the appendix. With all this machinery set up, we can give precise bounds on the minimal cost needed to attack a single period game.

Lemma 11.

The optimal attack cost for IhI_{h} satisfies

N¯hΔh(ι)C(Ih)N¯hΔh(ι).\underline{N}_{h}\Delta_{h}(\iota)\leq C^{*}(I_{h})\leq\overline{N}_{h}\Delta_{h}(\iota).

Combining these bounds with Theorem 9 gives complete attack cost bounds for general Markov game instances.

The lower bounds in both Lemma 10 and Lemma 11 expose an exponential dependency on nn, the number of players, for some datasets 𝒟\mathcal{D}. These instances essentially require the attacker to modify R^i,h(s,𝒂)\hat{R}_{i,h}(s,{\boldsymbol{a}}) for every 𝒂𝒜{\boldsymbol{a}}\in\mathcal{A}. A concrete instance can be constructed by taking the high cost dataset derived as the tight example before and extend it into a general Markov Game. We simply do this by giving the game several identical states and uniform transitions. In terms of the dataset, each episode consists of independent plays of the same normal-form game, possibly with a different state observed. For this dataset the ι\iota-dominance gap can be shown to be di,hι(s,ai)=2b+2ρ+ιd_{i,h}^{\iota}\left(s,a_{-i}\right)=2b+2\rho+\iota. A direct application of Lemma 10 gives the following explicit lower bound.

Theorem 12.

There exists a feasible attack instance II for which it holds that

C(I)N¯H|𝒮|nAn1(2b+2ρ+ι).C^{*}(I)\geq\underline{N}H\left|\mathcal{S}\right|nA^{n-1}\left(2b+2\rho+\iota\right).

Recall the attacker wants to assume little about the learners, and therefore chooses to install an ι\iota-MPDSE (instead of making stronger assumptions on the learners and installing a Nash equilibrium or a non-Markov perfect equilibrium). On some datasets 𝒟\mathcal{D}, the exponential poisoning cost is the price the attacker pays for this flexibility.

5 Conclusion

We studied a security threat to offline MARL where an attacker can force learners into executing an arbitrary Dominant Strategy Equilibrium by minimally poisoning historical data. We showed that the attack problem can be formulated as a linear program, and provided analysis on the attack feasibility and cost. This paper thus helps to raise awareness on the trustworthiness of multi-agent learning. We encourage the community to study defense against such attacks, e.g. via robust statistics and reinforcement learning.

Acknowledgements

McMahan is supported in part by NSF grant 2023239. Zhu is supported in part by NSF grants 1545481, 1704117, 1836978, 2023239, 2041428, 2202457, ARO MURI W911NF2110317, and AF CoE FA9550-18-1-0166. Xie is partially supported by NSF grant 1955997 and JP Morgan Faculty Research Awards. We also thank Yudong Chen for his useful comments and discussions.

References

  • Anderson, Shoham, and Altman (2010) Anderson, A.; Shoham, Y.; and Altman, A. 2010. Internal implementation. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1-Volume 1, 191–198. Citeseer.
  • Banihashem et al. (2022) Banihashem, K.; Singla, A.; Gan, J.; and Radanovic, G. 2022. Admissible Policy Teaching through Reward Design. arXiv preprint arXiv:2201.02185.
  • Banihashem, Singla, and Radanovic (2021) Banihashem, K.; Singla, A.; and Radanovic, G. 2021. Defense against reward poisoning attacks in reinforcement learning. arXiv preprint arXiv:2102.05776.
  • Bogunovic et al. (2021) Bogunovic, I.; Losalka, A.; Krause, A.; and Scarlett, J. 2021. Stochastic linear bandits robust to adversarial attacks. In International Conference on Artificial Intelligence and Statistics, 991–999. PMLR.
  • Cui and Du (2022) Cui, Q.; and Du, S. S. 2022. When is Offline Two-Player Zero-Sum Markov Game Solvable? arXiv preprint arXiv:2201.03522.
  • Garcelon et al. (2020) Garcelon, E.; Roziere, B.; Meunier, L.; Teytaud, O.; Lazaric, A.; and Pirotta, M. 2020. Adversarial Attacks on Linear Contextual Bandits. arXiv preprint arXiv:2002.03839.
  • Gleave et al. (2019) Gleave, A.; Dennis, M.; Wild, C.; Kant, N.; Levine, S.; and Russell, S. 2019. Adversarial policies: Attacking deep reinforcement learning. arXiv preprint arXiv:1905.10615.
  • Guan et al. (2020) Guan, Z.; Ji, K.; Bucci Jr, D. J.; Hu, T. Y.; Palombo, J.; Liston, M.; and Liang, Y. 2020. Robust stochastic bandit algorithms under probabilistic unbounded adversarial attack. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 4036–4043.
  • Guo et al. (2021) Guo, W.; Wu, X.; Huang, S.; and Xing, X. 2021. Adversarial policy learning in two-player competitive games. In International Conference on Machine Learning, 3910–3919. PMLR.
  • Huang and Zhu (2019) Huang, Y.; and Zhu, Q. 2019. Deceptive reinforcement learning under adversarial manipulations on cost signals. In International Conference on Decision and Game Theory for Security, 217–237. Springer.
  • Jiang and Lu (2021) Jiang, J.; and Lu, Z. 2021. Offline decentralized multi-agent reinforcement learning. arXiv preprint arXiv:2108.01832.
  • Jun et al. (2018) Jun, K.-S.; Li, L.; Ma, Y.; and Zhu, J. 2018. Adversarial attacks on stochastic bandits. Advances in Neural Information Processing Systems, 31: 3640–3649.
  • Littman (1994) Littman, M. L. 1994. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, 157–163. Elsevier.
  • Liu and Shroff (2019) Liu, F.; and Shroff, N. 2019. Data poisoning attacks on stochastic bandits. In International Conference on Machine Learning, 4042–4050. PMLR.
  • Liu and Lai (2021) Liu, G.; and Lai, L. 2021. Provably Efficient Black-Box Action Poisoning Attacks Against Reinforcement Learning. Advances in Neural Information Processing Systems, 34.
  • Lu, Wang, and Zhang (2021) Lu, S.; Wang, G.; and Zhang, L. 2021. Stochastic Graphical Bandits with Adversarial Corruptions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 8749–8757.
  • Lykouris et al. (2021) Lykouris, T.; Simchowitz, M.; Slivkins, A.; and Sun, W. 2021. Corruption-robust exploration in episodic reinforcement learning. In Conference on Learning Theory, 3242–3245. PMLR.
  • Ma et al. (2018) Ma, Y.; Jun, K.-S.; Li, L.; and Zhu, X. 2018. Data poisoning attacks in contextual bandits. In International Conference on Decision and Game Theory for Security, 186–204. Springer.
  • Ma, Wu, and Zhu (2021) Ma, Y.; Wu, Y.; and Zhu, X. 2021. Game Redesign in No-regret Game Playing. arXiv preprint arXiv:2110.11763.
  • Ma et al. (2019) Ma, Y.; Zhang, X.; Sun, W.; and Zhu, J. 2019. Policy poisoning in batch reinforcement learning and control. Advances in Neural Information Processing Systems, 32: 14570–14580.
  • Monderer and Tennenholtz (2004) Monderer, D.; and Tennenholtz, M. 2004. k-Implementation. Journal of Artificial Intelligence Research, 21: 37–62.
  • Pan et al. (2022) Pan, L.; Huang, L.; Ma, T.; and Xu, H. 2022. Plan better amid conservatism: Offline multi-agent reinforcement learning with actor rectification. In International Conference on Machine Learning, 17221–17237. PMLR.
  • Rakhsha et al. (2020) Rakhsha, A.; Radanovic, G.; Devidze, R.; Zhu, X.; and Singla, A. 2020. Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. In International Conference on Machine Learning, 7974–7984. PMLR.
  • Rakhsha et al. (2021a) Rakhsha, A.; Radanovic, G.; Devidze, R.; Zhu, X.; and Singla, A. 2021a. Policy teaching in reinforcement learning via environment poisoning attacks. Journal of Machine Learning Research, 22(210): 1–45.
  • Rakhsha et al. (2021b) Rakhsha, A.; Zhang, X.; Zhu, X.; and Singla, A. 2021b. Reward poisoning in reinforcement learning: Attacks against unknown learners in unknown environments. arXiv preprint arXiv:2102.08492.
  • Rangi et al. (2022a) Rangi, A.; Tran-Thanh, L.; Xu, H.; and Franceschetti, M. 2022a. Saving stochastic bandits from poisoning attacks via limited data verification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 8054–8061.
  • Rangi et al. (2022b) Rangi, A.; Xu, H.; Tran-Thanh, L.; and Franceschetti, M. 2022b. Understanding the Limits of Poisoning Attacks in Episodic Reinforcement Learning. In Raedt, L. D., ed., Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, 3394–3400. International Joint Conferences on Artificial Intelligence Organization. Main Track.
  • Shapley (1953) Shapley, L. S. 1953. Stochastic games. Proceedings of the national academy of sciences, 39(10): 1095–1100.
  • Sun, Huo, and Huang (2020) Sun, Y.; Huo, D.; and Huang, F. 2020. Vulnerability-aware poisoning mechanism for online rl with unknown dynamics. arXiv preprint arXiv:2009.00774.
  • Wei, Dann, and Zimmert (2022) Wei, C.-Y.; Dann, C.; and Zimmert, J. 2022. A model selection approach for corruption robust reinforcement learning. In International Conference on Algorithmic Learning Theory, 1043–1096. PMLR.
  • Wu et al. (2022) Wu, F.; Li, L.; Xu, C.; Zhang, H.; Kailkhura, B.; Kenthapadi, K.; Zhao, D.; and Li, B. 2022. COPA: Certifying Robust Policies for Offline Reinforcement Learning against Poisoning Attacks. arXiv preprint arXiv:2203.08398.
  • Xie et al. (2020) Xie, Q.; Chen, Y.; Wang, Z.; and Yang, Z. 2020. Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium. In Conference on learning theory, 3674–3682. PMLR.
  • Yang et al. (2021) Yang, L.; Hajiesmaili, M.; Talebi, M. S.; Lui, J.; and Wong, W. S. 2021. Adversarial Bandits with Corruptions: Regret Lower Bound and No-regret Algorithm. In Advances in Neural Information Processing Systems (NeurIPS).
  • Zhang and Parkes (2008) Zhang, H.; and Parkes, D. C. 2008. Value-Based Policy Teaching with Active Indirect Elicitation. In AAAI, volume 8, 208–214.
  • Zhang, Parkes, and Chen (2009) Zhang, H.; Parkes, D. C.; and Chen, Y. 2009. Policy teaching through reward function learning. In Proceedings of the 10th ACM conference on Electronic commerce, 295–304.
  • Zhang, Yang, and Başar (2021) Zhang, K.; Yang, Z.; and Başar, T. 2021. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, 321–384.
  • Zhang et al. (2021a) Zhang, X.; Chen, Y.; Zhu, J.; and Sun, W. 2021a. Corruption-robust offline reinforcement learning. arXiv preprint arXiv:2106.06630.
  • Zhang et al. (2021b) Zhang, X.; Chen, Y.; Zhu, X.; and Sun, W. 2021b. Robust policy gradient against strong data corruption. In International Conference on Machine Learning, 12391–12401. PMLR.
  • Zhang et al. (2020) Zhang, X.; Ma, Y.; Singla, A.; and Zhu, X. 2020. Adaptive reward-poisoning attacks against reinforcement learning. In International Conference on Machine Learning, 11225–11234. PMLR.
  • Zhong et al. (2022) Zhong, H.; Xiong, W.; Tan, J.; Wang, L.; Zhang, T.; Wang, Z.; and Yang, Z. 2022. Pessimistic minimax value iteration: Provably efficient equilibrium learning from offline datasets. arXiv preprint arXiv:2202.07511.
  • Zuo (2020) Zuo, S. 2020. Near Optimal Adversarial Attack on UCB Bandits. arXiv preprint arXiv:2008.09312.