This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\setcopyright

ifaamas \acmConference[AAMAS ’25]Proc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025)May 19 – 23, 2025 Detroit, Michigan, USAY. Vorobeychik, S. Das, A. Nowé (eds.) \copyrightyear2025 \acmYear2025 \acmDOI \acmPrice \acmISBN \acmSubmissionID946 \affiliation \institutionHarvard University \cityCambridge, MA \countryUSA \affiliation \institutionDuke University \cityDurham, NC \countryUSA \affiliation \institutionDuke University \cityDurham, NC \countryUSA

Multi-objective Reinforcement Learning with Nonlinear Preferences: Provable Approximation for Maximizing Expected Scalarized Return

Nianli Peng nianli_peng@g.harvard.edu Muhang Tian muhang.tian@duke.edu  and  Brandon Fain btfain@cs.duke.edu
Abstract.

We study multi-objective reinforcement learning with nonlinear preferences over trajectories. That is, we maximize the expected value of a nonlinear function over accumulated rewards (expected scalarized return or ESR) in a multi-objective Markov Decision Process (MOMDP). We derive an extended form of Bellman optimality for nonlinear optimization that explicitly considers time and current accumulated reward. Using this formulation, we describe an approximation algorithm for computing an approximately optimal non-stationary policy in pseudopolynomial time for smooth scalarization functions with a constant number of rewards. We prove the approximation analytically and demonstrate the algorithm experimentally, showing that there can be a substantial gap between the optimal policy computed by our algorithm and alternative baselines.

Key words and phrases:
Multi-objective Reinforcement Learning, Nonlinear Optimization, Algorithmic Fairness, Approximation Algorithms
11footnotetext: Equal contribution.

1. Introduction

Markov Decision Processes (MDPs) model goal-driven interaction with a stochastic environment, typically aiming to maximize a scalar-valued reward per time-step through a learned policy. Equivalently, this formulation asks the agent to maximize the expected value of a linear function of total reward. This problem can be solved with provable approximation guarantees in polynomial time with respect to the size of the MDP Kearns and Singh (2002); Brafman and Tennenholtz (2003); Auer et al. (2008); Azar et al. (2017); Agarwal et al. (2021).

We extend this framework to optimize a nonlinear function of vector-valued rewards in multi-objective MDPs (MOMDPs), aiming to maximize expected welfare 𝔼[W(𝐫)]\mathbb{E}[W(\mathbf{r})] where 𝐫\mathbf{r} is a total reward vector for dd objectives. We note that this function is also called the utility or the scalarization function within the multi-objective optimization literature Hayes et al. (2023). Unlike the deep neural networks commonly used as function approximators for large state spaces, our nonlinearity lies entirely in the objective function.

Motivation.

Nonlinear welfare functions capture richer preferences for agents, such as fairness or risk attitudes. For example, the Nash Social Welfare function (WNash(𝐫)=(iri)1/d)\left(W_{\text{Nash}}(\mathbf{r})=\left(\prod_{i}r_{i}\right)^{1/d}\right) reflects a desire for fairness or balance across objectives and diminishing marginal returns Caragiannis et al. (2019). Even single-objective decision theory uses nonlinear utility functions to model risk preferences, like risk aversion with the Von Neumann-Morgenstern utility function Von Neumann and Morgenstern (1947).

Consider a minimal example with an autonomous taxi robot, Robbie, serving rides in neighborhoods AA and BB, as diagrammed in Figure 1. Each ride yields a reward in its respective dimension.

AABB(1,0)(1,0)(0,1)(0,1)(0,0)(0,0)
Figure 1. Taxi Optimization Example

Suppose Robbie starts in neighborhood AA and has t=3t=3 time intervals remaining before recharging. With no discounting, the Pareto Frontier of maximal (that is, undominated) policies can achieve cumulative reward vectors of (3,0)(3,0), (1,1)(1,1), or (0,2)(0,2) by serving AA alone, AA and BB, or BB alone respectively. If we want Robbie to prefer the second more balanced or “fair” option of serving one ride in each of the two neighborhoods then Robbie must have nonlinear preferences. That is, for any choice of weights on the first and second objective, the simple weighted average would prefer outcomes (3, 0) or (0, 2). However, the second option would maximize the Nash Social Welfare of cumulative reward, for example.

Nonlinear preferences complicate policy computation: Bellman optimality fails, and stationary policies may be suboptimal. Intuitively, a learning agent with fairness-oriented preferences to balance objectives should behave differently, even in the same state, depending on which dimension of reward is “worse off.” In the Figure 1 example, one policy to achieve the balanced objective (1,1)(1,1) is to complete a ride in AA, then travel from AA to BB, and finally to complete a ride in BB – note that this is not stationary with respect to the environment states.

Contributions.

In this work, we ask whether it is possible to describe an approximation algorithm (that is, with provable guarantees to approximate the welfare optimal policy) for MOMDPs with nonlinear preferences that has polynomial dependence on the number of states and actions, as is the case for linear preferences or scalar MDPs. To the best of our knowledge, ours is the first work to give provable guarantees for this problem, compared to other work that focuses on empirical evaluation of various neural network architectures.

We show this is possible for smooth preferences and a constant number of dimensions of reward. To accomplish this, we (i) derive an extended form of Bellman optimality (which may be of independent interest) that characterizes optimal policies for nonlinear preferences over multiple objectives, (ii) describe an algorithm for computing approximately optimal non-stationary policies, (iii) prove the worst-case approximation properties of our algorithm, and (iv) demonstrate empirically that our algorithm can be used in large state spaces to find policies that significantly outperform other baselines.

2. Related Work

Most reinforcement learning algorithms focus on a single scalar-valued objective and maximizing total expected reward Sutton and Barto (2018). Classic results on provable approximation and runtime guarantees for reinforcement learning include the E3 algorithm Kearns and Singh (2002). This result showed that general MDPs could be solved to near-optimality efficiently, meaning in time bounded by a polynomial in the size of the MDP (number of states and actions) and the horizon time. Subsequent results refined the achievable bounds Brafman and Tennenholtz (2003); Auer et al. (2008); Azar et al. (2017). We extend these results to the multi-objective case with nonlinear preferences.

Multi-objective reinforcement learning optimizes multiple objectives at once. So-called single-policy methods use a scalarization function to reduce the problem to scalar optimization for a single policy, and we follow this line of research. The simplest form is linear scalarization, applying a weighted sum to the Q vector Moffaert et al. (2013); Liu et al. (2014); Abels et al. (2019); Yang et al. (2019); Alegre et al. (2023).

A more general problem is to optimize the expected value of a potentially nonlinear function of the total reward, which may be vector-valued in a multi-objective optimization context. We refer to such a function as a welfare function Barman et al. (2023); Fan et al. (2023); Siddique et al. (2020), which is also commonly referred to as a utility or scalarization function Hayes et al. (2023); Agarwal et al. (2022). Recent works have explored nonlinear objectives; however, to our knowledge, ours is the first to provide an approximation algorithm with provable guarantees (on the approximation factor), for the expected welfare, leveraging a characterization of recursive optimality in this setting. Several other studies focus on algorithms that demonstrate desirable convergence properties and strong empirical performance by conditioning function approximators on accumulated reward, but without offering approximation guarantees Siddique et al. (2020); Fan et al. (2023); Cai et al. (2023); Reymond et al. (2023). Complementary to our approach, Reymond et al. (2022) uses Pareto Conditioned Networks to learn policies for Pareto-optimal solutions by conditioning policies on a preference vector. Lin et al. (2024) presents an offline adaptation framework that employs demonstrations to implicitly learn preferences and safety constraints, aligning policies with inferred preferences rather than providing theoretical guarantees.

Agarwal et al. (2022) describe another model-based algorithm that can compute approximately optimal policies for a general class of monotone and Lipschitz welfare functions, but rather than maximizing the expected welfare, they maximize the welfare of expected rewards (note the two are not equal for nonlinear welfare functions). Other works have formulated fairness in different ways or settings. For example, Jabbari et al. (2017) defines an analogue of envy freeness and Deng et al. (2023) studies a per-time-step fairness guarantee. Barman et al. (2023) studies welfare maximization in multi-armed bandit problems. Röpke et al. (2023) explores the concept of distributional multi-objective decision making for managing uncertainty in multi-objective environments.

Risk-sensitive RL approaches address scalar objectives by incorporating risk measures to minimize regret or control reward variance over accumulated rewards Bäuerle and Ott (2011); Bellemare et al. (2023); Bastani et al. (2022). While these approaches offer valuable tools for managing reward variability, their guarantees are primarily in terms of regret minimization or achieving bounded variance. In contrast, our work provides stronger theoretical assurances in terms of the approximation ratio on the expected welfare for multi-objective optimization. This difference in focus underscores the robustness of our method, which provides guarantees that extend beyond the risk-sensitive regime to cover complex, multi-dimensional utility functions Fan et al. (2023); Brafman and Tennenholtz (2003); Azar et al. (2017). Such problems remain computationally significant even in a deterministic environment where the notion of risk may not apply.

Lastly, while our single-agent setup with multiple objectives shares some aspects with multi-agent reinforcement learning (MARL), the objectives differ significantly. Much of the MARL literature has focused on cooperative reward settings, often using value-decomposition techniques like VDN and QMIX Sunehag et al. (2017); Rashid et al. (2018) or actor-critic frameworks to align agent objectives under a centralized training and decentralized execution paradigm Foerster et al. (2018). In contrast, our work parallels the more general Markov game setting, where each agent has a unique reward function and studies nonlinear objectives that require computational methods beyond linear welfare functions often assumed in MARL. While MARL research frequently uses linear summations of agent rewards, we demonstrate approximation guarantees for optimizing general, non-linear functions of multiple reward vectors, a distinct contribution in a single-agent setting Jabbari et al. (2017); Agarwal et al. (2022); Von Neumann and Morgenstern (1947).

3. Preliminaries

A finite Multi-objective Markov Decision Process (MOMDP) consists of a finite set 𝒮\mathcal{S} of states, a starting state s1𝒮s_{1}\in\mathcal{S},***In general we may have a distribution over starting states; we assume a single starting state for ease of exposition. a finite set 𝒜\mathcal{A} of actions, and probabilities Pr(s|s,a)[0,1]Pr(s^{\prime}|s,a)\in[0,1] that determine the probability of transitioning to state ss^{\prime} from state ss after taking action aa. Probabilities are normalized so that sPr(s|s,a)=1\sum_{s^{\prime}}Pr(s^{\prime}|s,a)=1.

We have a finite vector-valued reward function 𝐑(s,a):𝒮×𝒜[0,1]d\mathbf{R}(s,a):\mathcal{S}\times\mathcal{A}\rightarrow[0,1]^{d}. Each of the dd dimensions of the reward vector corresponds to one of the multiple objectives that are to be maximized. At each time step tt, the agent observes state st𝒮s_{t}\in\mathcal{S}, takes action at𝒜(st)a_{t}\in\mathcal{A}(s_{t}), and receives a reward vector 𝐑(st,at)[0,1]d\mathbf{R}(s_{t},a_{t})\in[0,1]^{d}. The environment, in turn, transitions to st+1s_{t+1} with probability Pr(st+1|at,st)Pr(s_{t+1}|a_{t},s_{t}).

To make the optimization objective well-posed in MOMDPs with vector-valued rewards, we must specify a scalarization function Hayes et al. (2023) which we denote as W:dW:\mathbb{R}^{d}\to\mathbb{R}. For fair multi-objective reinforcement learning, we think of each of the dd dimensions of the reward vector as corresponding to distinct users. The scalarization function can thus be thought of as a welfare function over the users, and the learning agent is a welfare maximizer. Even when d=1d=1, a nonlinear function WW can be a Von Neumann-Morgenstern utility function Von Neumann and Morgenstern (1947) that expresses the risk-attitudes of the learning agent, with strictly concave functions expressing risk aversion.

Assumptions.

Here we clarify some preliminary assumptions we make about the reward function 𝐑(s,a):𝒮×𝒜[0,1]d\mathbf{R}(s,a):\mathcal{S}\times\mathcal{A}\rightarrow[0,1]^{d} and the welfare function W:dW:\mathbb{R}^{d}\to\mathbb{R}. The restriction to [0,1][0,1] is simply a normalization for ease of exposition; the substantial assumption is that the rewards are finite and bounded. Note from the writing we assume the reward function is deterministic: While it suffices for linear optimization to simply learn the mean of random reward distributions, this does not hold when optimizing the expected value of a nonlinear function. Nevertheless, the environment itself may still be stochastic, as the state transitions may still be random.

We also assume that WW is smooth. For convenience of analysis, we will assume WW is uniformly continuous on the L1-norm (other parameterizations such as the stronger Lipshitz continuity or using the L2-norm are also possible). For all ϵ>0\epsilon>0, there exists δϵ\delta_{\epsilon} such that for all 𝐱,𝐲0\mathbf{x},\mathbf{y}\geq 0,

|W(𝐱)W(𝐲)|<ϵif𝐱𝐲1<δϵ.|W(\mathbf{x})-W(\mathbf{y})|<\epsilon\quad\text{if}\quad\|\mathbf{x}-\mathbf{y}\|_{1}<\delta_{\epsilon}.

The smoothness assumption seems necessary to give worst-case approximation guarantees as otherwise arbitrarily small changes in accumulated reward could have arbitrarily large differences in welfare. However, we note that this is still significantly more general than linear scalarization which is implicitly smooth. Practically speaking, our algorithms can be run regardless of the assumed level of smoothness; a particular smoothness is necessary just for the worst-case analysis.

4. Modeling Optimality

In this section, we expand the classic model of reinforcement learning to optimize the expected value of a nonlinear function of (possibly) multiple dimensions of reward. We begin with the notion of a trajectory of state-action pairs.

Definition 0.

Let MM be an MOMDP. A length TT trajectory in MM is a tuple τ\tau of TT state-action pairs

(s1,a1),(s2,a2),,(sT,aT).(s_{1},a_{1}),(s_{2},a_{2}),\dots,(s_{T},a_{T}).

For 1kkT1\leq k\leq k^{\prime}\leq T, let τk:k\tau_{k:k^{\prime}} be the sub-trajectory consisting of pairs (sk,ak),,(sk,ak)(s_{k},a_{k}),\dots,(s_{k^{\prime}},a_{k^{\prime}}). Let τ0;0\tau_{0;0} denote the empty trajectory.

For a discount factor γ\gamma, we calculate the total discounted reward of a trajectory. Note that this is a vector in general.

Definition 0.

For length TT trajectory τ\tau and discount factor 0γ10\leq\gamma\leq 1, the total discounted reward along τ\tau is the vector

𝐑(τ,γ)=t=1Tγt1𝐑(st,at).\mathbf{R}(\tau,\gamma)=\sum_{t=1}^{T}\gamma^{t-1}\mathbf{R}(s_{t},a_{t}).

For ease of exposition we will frequently leave γ\gamma implicit from context and simply write 𝐑(τ)\mathbf{R}(\tau). γ<1\gamma<1 is necessary for the infinite horizon setting. In the experiments with a finite-horizon task we use γ=1\gamma=1 for simplicity.

A policy is a function π(aτ,s)[0,1]\pi(a\mid\tau,s)\in[0,1] mapping past trajectories and current states to probability distributions over actions, that is, aπ(aτ,s)=1\sum_{a}\pi(a\mid\tau,s)=1 for all τ\tau and ss. A stationary policy is the special case of a policy that depends only on the current state: π(as)\pi(a\mid s).

Definition 0.

The probability that a TT-trajectory τ\tau is traversed in an MOMDP MM upon starting in state s1s_{1} and executing policy π\pi is

Prπ[τ]=π(a1τ0:0,s1)×t=2Tπ(at|τ1:t1,st)Pr(st|st1,at1).Pr^{\pi}[\tau]=\pi(a_{1}\mid\tau_{0:0},s_{1})\times\prod_{t=2}^{T}\pi\left(a_{t}\leavevmode\nobreak\ \bigg{|}\leavevmode\nobreak\ \tau_{1:t-1},s_{t}\right)Pr(s_{t}|s_{t-1},a_{t-1}).

Problem Formulation.

Given a policy, a finite time-horizon TT, and a starting state s1s_{1} we can calculate the expected welfare of total discounted reward along a trajectory as follows. Our goal is to maximize this quantity. That is, we want to compute a policy that maximizes the expected TT-step discounted welfare.

Definition 0.

For a policy π\pi and a start state ss, the expected TT-step discounted welfare is

𝔼τπ[W(𝐑(τ))]=τPrπ[τ]W(𝐑(τ))\mathbb{E}_{\tau\sim\pi}\bigl{[}W\bigl{(}\mathbf{R}(\tau)\bigr{)}\bigr{]}=\sum_{\tau}Pr^{\pi}[\tau]W(\mathbf{R}(\tau))

where the expectation is taken over all length TT trajectories beginning at ss.

Note that this objective is not equal to W(𝔼τπ[𝐑(τ)])W\left(\mathbb{E}_{\tau\sim\pi}\bigl{[}\mathbf{R}(\tau)\bigr{]}\right), which others have studied Agarwal et al. (2022); Siddique et al. (2020), for a nonlinear WW. The former (our objective) is also known as expected scalarized return (ESR) whereas the latter is also known as scalarized expected return (SER) Hayes et al. (2023). While SER makes sense for a repeated decision-making problem, it does not optimize for expected welfare for any particular trajectory. For concave WW, ESR \leq SER by Jensen’s inequality. However, an algorithm for approximating SER does not provide any guarantee for approximating ESR. For example, a policy can be optimal on SER but achieve 0 ESR if it achieves high reward on one or the other of two objectives but never both in the same episode.

Form of Optimal Policy and Value Functions.

The optimal policy for this finite time horizon setting is a function also of the number of time steps remaining. We write such a policy as π(as,τ,t)\pi(a\mid s,\tau,t) where τ\tau is a trajectory (the history), ss is the current state, and tt is the number of time steps remaining, i.e.i.e. t=T|τ|t=T-|\tau|.

We can similarly write the extended value function of a policy π\pi. We write τ\tau as the history or trajectory prior to some current state ss and τ\tau^{\prime} as the future, the remaining tt steps determined by the policy π\pi and the environmental transitions.

Definition 0.

The value of a policy π\pi beginning at state ss after history τ\tau and with tt more actions is

Vπ(s,τ,t)=𝔼τπ[W(𝐑(τ)+γTt𝐑(τ))]V^{\pi}(s,\tau,t)=\mathbb{E}_{\tau^{\prime}\sim\pi}\left[W(\mathbf{R}(\tau)+\gamma^{T-t}\mathbf{R}(\tau^{\prime}))\right]

where the expectation is taken over all length tt trajectories τ\tau^{\prime} beginning at ss. The optimal value function is

V(s,τ,t)=maxπVπ(s,τ,t)V^{*}(s,\tau,t)=\max_{\pi}V^{\pi}(s,\tau,t)

and the optimal policy is

πargmaxπVπ(s,τ,t).\pi^{*}\in\operatorname*{arg\,max}_{\pi}V^{\pi}(s,\tau,t).

Note that because WW is nonlinear, the value function VπV^{\pi} cannot be decomposed in the same way as in the traditional Bellman equations. Before proceeding we want to provide some intuition for this point. The same reasoning helps to explain why stationary policies are not generally optimal.

Suppose we are optimizing the product of the reward between two objectives (i.e., the welfare function is the product or geometric mean), and at some state ss with some prior history τ\tau we can choose between two policies π1\pi_{1} or π2\pi_{2}. Suppose that the future discounted reward vector under π1\pi_{1} is (1,1)(1,1), whereas it is (10,0)(10,0) under π2\pi_{2}. So π1\pi_{1} has greater expected future welfare and traditional Bellman optimality would suggest we should choose π1\pi_{1}. However, if 𝐑(τ)=(0,10)\mathbf{R}(\tau)=(0,10), we would actually be better off in terms of total welfare choosing π2\pi_{2}. In other words, both past and future reward are relevant when optimizing for the expected value of a nonlinear welfare function.

We develop an extended form of Bellman optimality capturing this intuition by showing that the optimal value function can be written as a function of current state ss, accumulated discounted reward 𝐑(τ)\mathbf{R}(\tau), and number of timesteps remaining in the task tt. The proof is included in the appendix. At a high level as a sketch, the argument proceeds inductively on tt where the base case follows by definition and the inductive step hinges on showing that the the expectation over future trajectories can be decomposed into an expectation over successor states and subsequent trajectories despite the nonlinear WW.

Lemma 0.

Let 𝒱(s,𝐑(τ),0)=W(𝐑(τ))\mathcal{V}(s,\mathbf{R}(\tau),0)=W(\mathbf{R}(\tau)) for all states ss and trajectories τ\tau. For every state ss, history τ\tau, and t>0t>0 time steps remaining, let

𝒱(s,𝐑(τ),t)=maxa𝔼s[𝒱(s,𝐑(τ)+γTt𝐑(s,a),t1)].\mathcal{V}(s,\mathbf{R}(\tau),t)=\max_{a}\mathbb{E}_{s^{\prime}}\left[\mathcal{V}(s^{\prime},\mathbf{R}(\tau)+\gamma^{T-t}\mathbf{R}(s,a),t-1)\right].

Then V(s,τ,t)=𝒱(s,𝐑(τ),t)V^{*}(s,\tau,t)=\mathcal{V}(s,\mathbf{R}(\tau),t).

By Lemma 6, we can parameterize the optimal value function by the current state ss, accumulated reward 𝐑𝐚𝐜𝐜d\mathbf{R_{acc}}\in\mathbb{R}^{d}, and number of timesteps remaining tt. We will use this formulation of VV^{*} in the remainder of the paper.

Definition 0 (Recursive Formulation of VV^{*}).

Let 𝐑𝐚𝐜𝐜=𝐑(τ)\mathbf{R_{acc}}=\mathbf{R}(\tau) be the vector of accumulated reward along a history prior to some state ss. Let V(s,𝐑𝐚𝐜𝐜,0)=W(𝐑𝐚𝐜𝐜)V^{*}(s,\mathbf{R_{acc}},0)=W(\mathbf{R_{acc}}) for all ss and 𝐑𝐚𝐜𝐜\mathbf{R_{acc}}. For t>0t>0,

V(s,𝐑acc,t)=maxasPr(s|s,a)V(s,𝐑acc+γTt𝐑(s,a),t1).V^{*}(s,\mathbf{R}_{acc},t)=\max_{a}\sum_{s^{\prime}}Pr(s^{\prime}|s,a)\cdot V^{*}(s^{\prime},\mathbf{R}_{acc}+\gamma^{T-t}\mathbf{R}(s,a),t-1).

Horizon Time.

Note that an approximation algorithm for this discounted finite time horizon problem can also be used as an approximation algorithm for the discounted infinite time horizon problem. Informally, discounting by γ\gamma with bounded maximum reward implies that the first T1/(1γ)T\approx 1/(1-\gamma) steps dominate overall returns. We defer the precise formulation of the lower bound on the horizon time and its proof in the appendix (Lemma 13).

Necessity of Conditioning on Remaining Timesteps tt

We illustrate the necessity of conditioning the optimal value function on the remaining timesteps tt using a counterexample in the appendix.

5. Computing Optimal Policies

Our overall algorithm is Reward-Aware Explore or Exploit (or RAEE for short) to compute an approximately optimal policy, inspired by the classical E3 algorithm Kearns and Singh (2002). At a high level, the algorithm explores to learn a model of the environment, periodically pausing to recompute an approximately optimal policy on subset of the environment that has been thoroughly explored. We call this optimization subroutine Reward-Aware Value Iteration (or RAVI for short). In both cases, reward-aware refers to the fact that the algorithms compute non-stationary policies in the sense that optimal behavior depends on currently accumulated vector-valued reward, in addition to the current state.

Of the two, the model-based optimization subroutine RAVI is the more significant. The integrated model-learning algorithm RAEE largely follows from prior work, given access to the RAVI subroutine. For this reason, we focus in this section on RAVI, and defer a more complete discussion and analysis of RAEE to Section 7 and the appendix.

A naive algorithm for computing a non-stationary policy would need to consider all possible prior trajectories for each decision point, leading to a runtime complexity containing the term |𝒮|T|\mathcal{S}|^{T}, exponential in the size of the state space and time horizon. Instead, for a smooth welfare function on a constant number of objectives, our algorithm will avoid any exponential dependence on |𝒮||\mathcal{S}|.

The algorithm is derived from the recursive definition of the optimal multi-objective value function VV^{*} in Definition 7, justified by Lemma 6, parameterized by the accumulated discounted reward vector 𝐑𝐚𝐜𝐜\mathbf{R_{acc}} instead of the prior history. Note that even if the rewards were integers, 𝐑𝐚𝐜𝐜\mathbf{R_{acc}} might not be due to discounting. We must therefore introduce a discretization that maps accumulated reward vectors to points on a lattice, parameterized by some α(0,1)\alpha\in(0,1) where a smaller α\alpha leads to a finer discretization but increases the runtime.

Definition 0.

For a given discretization precision parameter α+\alpha\in\mathbb{R}^{+}, define fα:d(α)df_{\alpha}:\mathbb{R}^{d}\to(\alpha\mathbb{Z})^{d} by

fα(𝐑)=(R1αα,R2αα,,Rdαα).f_{\alpha}(\mathbf{R})=\left(\left\lfloor\frac{R_{1}}{\alpha}\right\rfloor\cdot\alpha,\left\lfloor\frac{R_{2}}{\alpha}\right\rfloor\cdot\alpha,\cdots,\left\lfloor\frac{R_{d}}{\alpha}\right\rfloor\cdot\alpha\right).

In other words, fαf_{\alpha} maps any dd-dimensional vector to the largest vector in (α)d(\alpha\mathbb{Z})^{d} that is less than or equal to the input vector, effectively rounding each component down to the nearest multiple of α\alpha. We now describe the algorithm, which at a high level computes the dynamic program of approximately welfare-optimal value functions conditioned on discretized accumulated reward vectors.

Remark. Since we model the per-step reward as normalized to at most 1, we describe α\alpha as lying within (0,1)(0,1). However, the algorithm itself is well-defined for larger values of alpha (coarser than the per-step reward) and during implementation and experiments, we consider such larger α\alpha, beyond what might give worst-case guarantees but still observing strong empirical performance.

Algorithm 1 Reward-Aware Value Iteration (RAVI)
1:Parameters: Discretization precision α(0,1)\alpha\in(0,1), Discount factor γ[0,1)\gamma\in[0,1), Reward dimension dd, finite time horizon TT, welfare function WW, discretization function fαf_{\alpha}.
2:Require: Normalize 𝐑(s,a)[0,1]d\mathbf{R}(s,a)\in[0,1]^{d} for all s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}.
3:Initialize: V(s,𝐑𝐚𝐜𝐜,0)=W(𝐑𝐚𝐜𝐜)V(s,\mathbf{R_{acc}},0)=W(\mathbf{R_{acc}}) for all 𝐑𝐚𝐜𝐜{0,α,2α,,T/αα}d\mathbf{R_{acc}}\in\{0,\mathbf{\alpha},2\mathbf{\alpha},\dots,\lceil T/\alpha\rceil\cdot\alpha\}^{d}, s𝒮s\in\mathcal{S}.
4:for t = 1 to TT do
5:     for 𝐑𝐚𝐜𝐜{0,α,2α,,(Tt)/αα}d\mathbf{R_{acc}}\in\{0,\mathbf{\alpha},2\mathbf{\alpha},\dots,\lceil(T-t)/\alpha\rceil\cdot\alpha\}^{d} do
6:         for all s𝒮s\in\mathcal{S} do
7:              
V(s,𝐑𝐚𝐜𝐜,t)maxasPr(s|s,a)V(s,φ(s,𝐑acc,a),t1)V\left(s,\mathbf{R_{acc}},t\right)\leftarrow\max_{a}\sum_{s^{\prime}}Pr\left(s^{\prime}|s,a\right)V\left(s^{\prime},\varphi(s,\mathbf{R}_{acc},a),t-1\right)
8:              
π(s,𝐑𝐚𝐜𝐜,t)argmaxasPr(s|s,a)V(s,φ(s,𝐑acc,a),t1)\pi\left(s,\mathbf{R_{acc}},t\right)\leftarrow\operatorname*{arg\,max}_{a}\sum_{s^{\prime}}Pr\left(s^{\prime}|s,a\right)V\left(s^{\prime},\varphi(s,\mathbf{R}_{acc},a),t-1\right)
9:              where φ(s,𝐑acc,a):=fα(𝐑acc+γTt𝐑(s,a))\varphi(s,\mathbf{R}_{acc},a):=f_{\alpha}\left(\mathbf{R}_{acc}+\gamma^{T-t}\mathbf{R}(s,a)\right)               

The asymptotic runtime complexity is O(|𝒮|2|𝒜|(T/α)d)O\left(|\mathcal{S}|^{2}|\mathcal{A}|(T/\alpha)^{d}\right). However, observe that the resulting algorithm is extremely parallelizable: Given solutions to subproblems at t1t-1, all subproblems for tt can in principle be computed in parallel. A parallel implementation of RAVI can leverage GPU compute to handle the extensive calculations involved in multi-objective value iteration. Each thread computes the value updates and policy decisions for a specific state and accumulated reward combination, which allows for massive parallelism. In practice, we observed an empirical speedup of approximately 560 times on a single NVIDIA A100 compared to the CPU implementation on an AMD Ryzen 9 7950x 16-core processor for our experimental settings. This drastic improvement in runtime efficiency makes it feasible to run RAVI on larger and more complex environments, where the computational demands would otherwise be prohibitive.

It remains to see how small α\alpha needs to be, which will dictate the final runtime complexity. We first analyze the correctness of the algorithm and then return to the setting of α\alpha for a given smoothness of WW to conclude the runtime analysis.

We begin analyzing the approximation of RAVI by showing an important structural property: the optimal value function will be smooth as long as the welfare function is smooth. The proof is included in the appendix.

Lemma 0 (Uniform continuity of multi-objective value function).

Let the welfare function W:dW:\mathbb{R}^{d}\to\mathbb{R} be uniformly continuous. Fix s𝒮s\in\mathcal{S} and t{0,1,,T}t\in\{0,1,\dots,T\}, then for all ϵ>0\epsilon>0, there exists δϵ>0\delta_{\epsilon}>0 such that

|V(s,𝐑𝟏,t)V(s,𝐑𝟐,t)|<ϵif𝐑𝟏𝐑𝟐<δϵ.\bigg{|}V^{*}(s,\mathbf{R_{1}},t)-V^{*}(s,\mathbf{R_{2}},t)\bigg{|}<\epsilon\quad\text{if}\quad\|\mathbf{R_{1}}-\mathbf{R_{2}}\|<\delta_{\epsilon}.

We now present the approximation guarantee of RAVI, that it achieves an additive error that scales with the smoothness of WW and the number of remaining time steps. The proof of this lemma is deferred to the appendix.

Lemma 0 (Approximation Error of RAVI).

For uniformly continuous welfare function WW, for all ϵ>0\epsilon>0, there exists αϵ\alpha_{\epsilon} such that

V(s,𝐑𝐚𝐜𝐜,t)V(s,𝐑𝐚𝐜𝐜,t)tϵV(s,\mathbf{R_{acc}},t)\geq V^{*}(s,\mathbf{R_{acc}},t)-t\epsilon

s𝒮,𝐑𝐚𝐜𝐜d,t{0,1,,T}\forall s\in\mathcal{S},\mathbf{R_{acc}}\in\mathbb{R}^{d},t\in\{0,1,\dots,T\}, where V(s,𝐑𝐚𝐜𝐜,t)V(s,\mathbf{R_{acc}},t) is computed by Algorithm 1 using αϵ\alpha_{\epsilon}.

While we can use any setting of α\alpha empirically, this shows that for an approximation guarantee we should set the discretization parameter to α=δTϵ/d\alpha=\delta_{T\epsilon}/d where δTϵ\delta_{T\epsilon} from the smoothness of the welfare function is sufficient to drive its bound to ϵ\epsilon, that is, it should be α=δϵTd.\alpha=\frac{\delta_{\epsilon}}{T\cdot d}.

We thus arrive at the ultimate statement of the approximation and runtime of RAVI. The proof follows directly from Lemma 10 and the setting of α\alpha.

Theorem 11 (Optimality Guarantee of RAVI).

For a given ϵ\epsilon and welfare function WW that is δϵ\delta_{\epsilon} uniformly continuous, RAVI with α=δϵTd\alpha=\frac{\delta_{\epsilon}}{T\cdot d} computes a policy π^\hat{\pi}, such that Vπ^(s,0,T)V(s,0,T)ϵV^{\hat{\pi}}(s,0,T)\geq V^{*}(s,0,T)-\epsilon in O(|𝒮|2|𝒜|(dT2/δϵ)d)O\left(|\mathcal{S}|^{2}|\mathcal{A}|(d\cdot T^{2}/\delta_{\epsilon})^{d}\right) time.

A concrete example of a particular welfare function, the setting of relevant parameters, and a derivation of a simplified runtime statement may help to clarify Theorem 11. Consider the smoothed proportional fairness objective: WSPF(𝐱)=i=1dln(xi+1)W_{\text{SPF}}(\mathbf{x})=\sum_{i=1}^{d}\ln(x_{i}+1), a smoothed log-transform of the Nash Social Welfare (or geometric mean) with better numerical stability.

By taking the gradient, we can see that WSPF(𝐱)W_{\text{SPF}}(\mathbf{x}) is dd-Lipschitz on 𝐱0\mathbf{x}\geq 0, so we may pick δϵ=ϵ/d\delta_{\epsilon}=\epsilon/d to satisfy the uniform continuity requirement in Theorem 11. Uniform continuity is a weaker modeling assumption than LL-Lipshitz continuity for constant LL. Note that for an LL-Lipshitz function, the correct value of δϵ\delta_{\epsilon} is just ϵ/L\epsilon/L, where ϵ\epsilon is the desired approximation factor of the algorithm.

Plugging δϵ\delta_{\epsilon} into the runtime and recalling that αTϵ=δTϵd\alpha_{T\epsilon}=\tfrac{\delta_{T\epsilon}}{d} and δTϵ=δϵ/T\delta_{T\epsilon}=\delta_{\epsilon}/T up to constant factors, we get

O(|𝒮|2|𝒜|(T/αTϵ)d)=O(|𝒮|2|𝒜|(T2d2ϵ)d).O\left(|\mathcal{S}|^{2}|\mathcal{A}|(T/\alpha_{T\epsilon})^{d}\right)=O\left(|\mathcal{S}|^{2}|\mathcal{A}|\left(\frac{T^{2}\cdot d^{2}}{\epsilon}\right)^{d}\right).

To further simplify, if one takes the number of actions |𝒜||\mathcal{A}|, discount factor γ\gamma, and dimension dd to be constants, then the asymptotic dependence of the runtime (in our running example) is

O(|𝒮|2(1ϵlog2(1/ϵ))d).O\left(|\mathcal{S}|^{2}\left(\frac{1}{\epsilon}\log^{2}(1/\epsilon)\right)^{d}\right).

This dependence is significantly better than a naive brute-force approach, which scales at least as O(|𝒮|T)O\bigl{(}|\mathcal{S}|^{T}\bigr{)}.

Intuitively, the key savings arise from discretizing the reward space (with granularity δϵ\delta_{\epsilon}) rather than enumerating all possible trajectories. This discretization is guided by the smoothness assumption on WSPFW_{\text{SPF}}.

6. Experiments

Refer to caption
Scavenger
Refer to caption
Taxi, d=5d=5
Figure 2. Visualization of the Taxi and Scavenger environments.

We test RAVI on two distinct interpretations of multi-objective reinforcement learning: (1) the fairness interpretation, where the agent tries to maximize rewards across all dimensions. (2) the objective interpretation, where the agent tries to maximize one while minimizing the other. In both scenarios, we show that RAVI can discover more optimal policies than other baselines for nonlinear multi-objective optimization. This holds even for coarser settings of α\alpha in the algorithm than would be necessary for strong theoretical worst-case approximation guarantees.

6.1. Simulation Environments

Taxi: We use the taxi multi-objective simulation environment considered in Fan et al. (2023) for testing nonlinear ESR maximization. In this grid-world environment, the agent is a taxi driver whose goal is to deliver passengers. There are multiple queues of passengers, each having a given pickup and drop-off location. Each queue has a different objective or dimension of the reward. The taxi can only take one passenger at a time and wants to balance trips for passengers in the different queues. This environment models the fairness interpretation.

Scavenger: Inspired by the Resource Gathering environment Barrett and Narayanan (2008), the scavenger hunt environment is a grid-world simulation where the agent must collect resources while avoiding enemies scattered across the grid. The state representation includes the agent’s position and the status of the resources (collected or not). The reward function is vector-valued, where the first component indicates the number of resources collected, and the second component indicates the damage taken by encountering enemies. This environment is the objective interpretation.

6.2. Baseline Algorithms

Linearly Scalarized Policy (LinScal) Van Moffaert et al. (2013): A relatively straightforward technique for multi-objective RL optimization is to apply the linear combination on the Q-values for each objective. Given weights 𝐰d\mathbf{w}\in\mathbb{R}^{d}, idwi=1\sum_{i}^{d}w_{i}=1, the scalarized objective is SQ(s,a)=𝐰𝐐(s,a)SQ(s,a)=\mathbf{w}^{\top}\mathbf{Q}(s,a), where Q(s,a)iQ(s,a)_{i} is the Q-value for the ithi^{\text{th}} objective. ϵ\epsilon-greedy policy is used on SQ(s,a)SQ(s,a) during action selection, and regular Q-learning updates are applied on each reward dimension.

Mixture Policy (Mixture) Vamplew et al. (2009): this baseline works by combining multiple Pareto optimal base policies into a single policy. Q-learning is used to optimize for each reward dimension separately (which is approximately Pareto optimal), and the close-to-optimal policy for each dimension is used for II steps before switching to the next.

Welfare Q-learning (WelfareQ) Fan et al. (2023): this baseline extends Q-learning in tabular setting to approximately solve the nonlinear objective function by considering past accumulated rewards to perform non-stationary action selection.

Model-Based Mixture Policy (Mixture-M): Instead of using Q-learning to find an approximately Pareto optimal policy for each objective, value iteration Sutton et al. (1998) is used to calculate the optimal policy, and each dimension uses the corresponding optimal policy for II steps before switching to the next.

6.3. Nonlinear Functions

Taxi: We use the following three functions on Taxi for fairness considerations: (1) Nash social welfare: WNash(𝐫)=(iri)1/dW_{\text{Nash}}(\mathbf{r})=\left(\prod_{i}r_{i}\right)^{1/d}. (2) Egalitarian welfare: WEgalitarian(𝐫)=min{ri}iW_{\text{Egalitarian}}(\mathbf{r})=\min\{r_{i}\}_{i}. (3) pp-welfare§§§pp-welfare is equivalent to generalized mean. If p0p\rightarrow 0, pp-welfare converges to Nash welfare; if pp\rightarrow-\infty, pp-welfare converges to egalitarian welfare; and pp-welfare is the utilitarian welfare when p=1p=1.: Wp(𝐫)=(1dirip)1/pW_{p}(\mathbf{r})=(\frac{1}{d}\sum_{i}r_{i}^{p})^{1/p}.

Scavenger: We use these two functions to reflect the conflicting nature of the objectives:

  1. (1)

    Resource-Damage Threshold Scalarization: RDthreshold(R,D)=Rmax(0,(Dthreshold)3)RD_{\text{threshold}}(R,D)=R-\max(0,(D-\text{threshold})^{3}), where RR is number of resources collected and DD is damage taken from enemies. The threshold parameter represents a budget after which the penalty from the damage starts to apply.

  2. (2)

    Cobb-Douglas Scalarization: CDρ(R,D)=Rρ(1/(D+1))(1ρ)CD_{\rho}(R,D)=R^{\rho}\left(1/(D+1)\right)^{(1-\rho)}. This function is inspired by economic theory and balances trade-offs between RR and DD.

6.4. Experiment Settings and Hyperparameters

We run all the algorithms using 10 random initializations with a fixed seed each. We set 𝐰=(1/d)×𝟏\mathbf{w}=(1/d)\times\mathbf{1} (uniform weight) for LinScal, I=T/dI=T/d for Mixture and Mixture-M, α=1\alpha=1 for RAVI, and learning rate of 0.1 for WelfareQ. Three of our baselines (Mixture, LinScal, and WelfareQ) are online algorithms. Thus, to ensure a fair comparison, we tuned their hyperparameters using grid-search and evaluated their performances after they reached convergence. For model-based approaches (RAVI and Mixture-M), we run the algorithms and evaluate them after completion. We set the convergence threshold to Δ=107\Delta=10^{-7} for Mixture-M. Some environment-specific settings are discussed below.

Taxi: To ensure numerical stability of WNashW_{\text{Nash}}, we optimize its smoothed log-transform WSPF(𝐫,λ)=i=1dln(ri+λ)W_{\text{SPF}}(\mathbf{r},\lambda)=\sum_{i=1}^{d}\ln(r_{i}+\lambda), but during evaluations we still use WNashW_{\text{Nash}}. We set λ=108\lambda=10^{-8}, T=100T=100. γ=1\gamma=1, size of the grid world to be 15×1515\times 15 with d{2,3,4,5}d\in\{2,3,4,5\} reward dimensions.

Scavenger: We set threshold for RDthresholdRD_{\text{threshold}} to 2 and ρ=0.4\rho=0.4 for CDρCD_{\rho}. The size of the grid world is 15×1515\times 15 with six resources scattered randomly and 1/3 of the cells randomly populated with enemies. We use T=20T=20 and γ=1\gamma=1.

Table 1. Comparison with baselines in terms of ESR.
Environment Dimension Function RAVI Mixture LinScal WelfareQ Mixture-M
Taxi d=2d=2 WNashW_{\text{Nash}} 7.555±\pm0.502 5.136±\pm0.547 0.000±\pm0.000 5.343±\pm0.964 6.406±\pm0.628
WEgalitarianW_{\text{Egalitarian}} 3.000±\pm0.000 3.483±\pm0.894 0.000±\pm0.000 0.387±\pm0.475 4.065±\pm0.773
Wp=10W_{p=-10} 5.279±\pm0.000 3.623±\pm0.522 0.000±\pm0.000 2.441±\pm1.518 4.356±\pm0.829
Wp=0.001W_{p=0.001} 7.404±\pm0.448 5.363±\pm0.844 0.000±\pm0.000 4.977±\pm1.514 6.406±\pm0.628
Wp=0.9W_{p=0.9} 9.628±\pm0.349 5.947±\pm0.488 7.833±\pm0.908 7.999±\pm0.822 7.052±\pm0.412
d=3d=3 WNashW_{\text{Nash}} 4.996±\pm0.297 3.250±\pm0.584 0.000±\pm0.000 2.798±\pm1.462 3.461±\pm0.459
WEgalitarianW_{\text{Egalitarian}} 2.000±\pm0.000 2.030±\pm0.981 0.000±\pm0.000 0.094±\pm0.281 1.660±\pm0.663
Wp=10W_{p=-10} 3.115±\pm0.000 2.129±\pm0.878 0.000±\pm0.000 2.558±\pm0.812 1.849±\pm0.733
Wp=0.001W_{p=0.001} 3.307±\pm0.000 3.118±\pm0.536 0.000±\pm0.000 3.306±\pm1.181 3.462±\pm0.458
Wp=0.9W_{p=0.9} 6.250±\pm0.322 3.883±\pm0.329 5.191±\pm0.464 4.834±\pm0.797 4.081±\pm0.266
d=4d=4 WNashW_{\text{Nash}} 2.191±\pm0.147 0.579±\pm0.713 0.000±\pm0.000 1.601±\pm0.189 1.122±\pm0.781
WEgalitarianW_{\text{Egalitarian}} 1.700±\pm0.483 0.545±\pm0.445 0.000±\pm0.000 0.000±\pm0.000 0.634±\pm0.437
Wp=10W_{p=-10} 1.029±\pm0.000 0.490±\pm0.490 0.000±\pm0.000 0.689±\pm0.451 0.688±\pm0.475
Wp=0.001W_{p=0.001} 2.145±\pm0.129 0.961±\pm0.628 0.000±\pm0.000 1.440±\pm0.511 1.126±\pm0.774
Wp=0.9W_{p=0.9} 3.369±\pm0.173 1.230±\pm0.264 2.546±\pm0.174 2.521±\pm0.211 1.689±\pm0.320
d=5d=5 WNashW_{\text{Nash}} 2.308±\pm0.185 0.000±\pm0.000 0.000±\pm0.000 1.181±\pm0.613 0.000±\pm0.000
WEgalitarianW_{\text{Egalitarian}} 1.700±\pm0.483 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000
Wp=10W_{p=-10} 1.023±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.407±\pm0.501 0.000±\pm0.000
Wp=0.001W_{p=0.001} 2.000±\pm0.102 0.018±\pm0.020 0.000±\pm0.000 1.309±\pm0.490 0.025±\pm0.025
Wp=0.9W_{p=0.9} 3.289±\pm0.147 1.220±\pm0.141 2.844±\pm0.250 2.879±\pm0.263 1.761±\pm0.148
Scavenger d=2d=2 CDρ=0.4CD_{\rho=0.4} 1.336±\pm0.240 0.655±\pm0.558 0.874±\pm0.605 0.713±\pm0.645 0.729±\pm0.296
RDthreshold=2RD_{\text{threshold}=2} 3.400±\pm0.843 0.900±\pm0.943 1.200±\pm0.872 1.444±\pm1.423 0.845±\pm2.938
Refer to caption
Taxi, WNashW_{\text{Nash}}, d=5d=5
Refer to caption
Taxi, Wp=0.9W_{p=0.9}, d=4d=4
Refer to caption
Scavenger, CDρ=0.4CD_{\rho=0.4}
Refer to caption
Scavenger, RDthresholdRD_{\text{threshold}}
Refer to caption
Taxi, WNashW_{\text{Nash}}, d=2d=2
Refer to caption
Taxi, Wp=0.9W_{p=0.9}, d=3d=3
Refer to caption
Taxi, Wp=0.001W_{p=0.001}, d=4d=4
Refer to caption
Taxi, Wp=10W_{p=-10}, d=5d=5
Figure 3. Comparisons with baselines, with learning curves included.

6.5. Results

As shown in Table 1, we found that RAVI is able to generally outperform all baselines across all settings in both Taxi and Scavenger environments in terms of optimizing our nonlinear functions of interest.

On the Taxi environment, we observe that LinScal is unable to achieve any performance except for Wp=0.9W_{p=0.9}. This is an expected behavior due to the use of a linearly scalarized policy and the fact that pp-welfare converges to utilitarian welfare when p1p\rightarrow 1. Among the baseline algorithms, we observe that WelfareQ performs the best on d=5d=5, whereas Mixture and Mixture-M fail due to the use of a fixed interval length for each optimal policy. Furthermore, the advantage of RAVI becomes more obvious as dd increases.

On the Scavenger environment, due to d=2d=2 and having a dense reward signal, all algorithms are able to achieve reasonable values for the welfare functions. We also observe that RAVI significantly outperforms all baselines.

Given that Mixture, LinScal, and WelfareQ are online algorithms, for the sake of completeness, we provide the visualizations of the learning curves of the baseline algorithms compared with RAVI with different α\alpha values in Figure 3, where the model-base approaches are shown as horizontal dotted lines. In general, we found that the online algorithms tend to converge early, and we observe a degradation in performance as α\alpha increases. The full set of results can be found at Section C.3.

6.6. Ablation Study on Discretization Factor

To further evaluate the effect of α\alpha on RAVI, we also run two sets of experiments:

  1. (1)

    we test RAVI performance on settings with γ<1\gamma<1 and adopt α\alpha values that are substantially greater than those necessary for worst-case theoretical guarantees from the previous analysis. This set of results can be found at Section C.1. Note that the empirical performance of RAVI is substantially better than is guaranteed by the previous theoretical analysis.

  2. (2)

    With γ=1\gamma=1, we evaluate RAVI using larger alpha values and investigate how much performance degradation occurs. This set of experiments can be found in Section C.2.

7. Removing Knowledge of the Model

We have shown that RAVI can efficiently find an approximately optimal policy given access to a model of the environment. In this section we observe that the model can be jointly learned by extending the classical E3 algorithm Kearns and Singh (2002) to the nonlinear multi-objective case by lifting the exploration algorithm to the multi-objective setting and using RAVI for the exploitation subroutine. We call the resulting combined algorithm Reward-Aware-Explore or Exploit (or RAEE for short). We briefly explain the high level ideas and state the main result here and defer further discussion to the appendix due to space constraints.

The algorithm consists of two stages, exploration and exploitation. The algorithm alternates between two stages: exploration and exploitation. Each stage is outlined here at a high level, with more detailed steps provided in the appendix.

  • Explore. At a given state, choose the least experienced action and record the reward and transition. Continue in this fashion until reaching a known state, where we say a state is known if it has been visited sufficiently many times for us to have precise local statistics about the reward and transition functions in that state.

  • Exploit. Run RAVI from the current known state ss in the induced model comprising the known states and with a single absorbing state representing all unknown states. If the welfare obtained by this policy is within the desired error bound of V(s,𝟎,T)V^{*}(s,\mathbf{0},T), then we are done. Otherwise, compute a policy to reach an unknown state as quickly as possible and resume exploring.

Theorem 12 (RAEE).

Let V(s,0,T)V^{*}(s,0,T) denote the value function of the policy with the optimal expected welfare in the MOMDP MM starting at state ss, with 𝟎d\mathbf{0}\in\mathbb{R}^{d} accumulated reward and TT timesteps remaining. Then for a uniformly continuous welfare function WW, there exists an algorithm AA, taking inputs ϵ\epsilon, β\beta, 𝒮\mathcal{S}, 𝒜\mathcal{A}, and V(s,𝟎,T)V^{*}(s,\mathbf{0},T), such that the total number of actions and computation time taken by AA is polynomial in 1/ϵ1/\epsilon, 1/β1/\beta, |𝒮||\mathcal{S}|, |𝒜||\mathcal{A}|, the horizon time T=1/(1γ)T=1/(1-\gamma) and exponential in the number of objectives dd, and with probability at least 1β1-\beta, AA will halt in a state ss, and output a policy π^\hat{\pi}, such that VMπ^(s,0,T)V(s,0,T)ϵV^{\hat{\pi}}_{M}(s,0,T)\geq V^{*}(s,0,T)-\epsilon.

We provide additional details comparing our analysis with that of Kearns and Singh (2002) in the appendix.

As we do not regard the learning of the Multi-Objective Markov Decision Process (MOMDP) as our primary contribution, we choose to focus the empirical evaluation on the nonlinear optimization subroutine, which is the most crucial modification from the learning problem with a single objective. The environments we used for testing the optimality of RAVI are very interesting for optimizing a nonlinear function of the rewards, but are deterministic in terms of transitions and rewards, making the learning of these environments less interesting empirically.

8. Conclusion and Future Work

Nonlinear preferences in reinforcement learning are important, as they can encode fairness as the nonlinear balancing of priorities across multiple objectives or risk attitudes with respect to even a single objective. Stationary policies are not necessarily optimal for such objectives. We derived an extended form of Bellman optimality to characterize the structure of optimal policies conditional on accumulated reward. We introduced the RAVI and RAEE algorithms to efficiently compute an approximately optimal policy.

Our work is certainly not the first to study MORL including with nonlinear preferences. However, to the best of our knowledge, our work is among the first to provide worst-case approximation guarantees for optimizing ESR in MORL with nonlinear scalarization functions.

While our experiments demonstrate the utility of RAVI in specific settings, there are many possible areas of further empirical evaluation including stochastic environments and model learning alongside the use of RAVI as an exploitation subprocedure as described in theory in Section 7 with RAEE. Further details on the limitations of our experiments are discussed in Appendix E.

Our results introduce several natural directions for future work. On the technical side, one could try to handle the case of stochastic reward functions or a large number of objectives. Another direction would involve incorporating function approximation with deep neural networks into the algorithms to enable scaling to even larger state spaces and generalizing between experiences. Our theory suggests that it may be possible to greatly enrich the space of possible policies that can be efficiently achieved in these settings by conditioning function approximators on accumulated reward rather than necessarily considering sequence models over arbitrary past trajectories; we see this as the most exciting next step for applications.

References

  • (1)
  • Abels et al. (2019) Axel Abels, Diederik Roijers, Tom Lenaerts, Ann Nowé, and Denis Steckelmacher. 2019. Dynamic weights in multi-objective deep reinforcement learning. In International conference on machine learning. PMLR, 11–20.
  • Agarwal et al. (2021) Alekh Agarwal, Sham M. Kakade, Jason D. Lee, and Gaurav Mahajan. 2021. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift. J. Mach. Learn. Res. 22, 1, Article 98 (jan 2021), 76 pages.
  • Agarwal et al. (2022) Mridul Agarwal, Vaneet Aggarwal, and Tian Lan. 2022. Multi-Objective Reinforcement Learning with Non-Linear Scalarization. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems (Virtual Event, New Zealand) (AAMAS ’22). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 9–17.
  • Alegre et al. (2023) Lucas N Alegre, Ana LC Bazzan, Diederik M Roijers, Ann Nowé, and Bruno C da Silva. 2023. Sample-efficient multi-objective learning via generalized policy improvement prioritization. arXiv preprint arXiv:2301.07784 (2023).
  • Auer et al. (2008) Peter Auer, Thomas Jaksch, and Ronald Ortner. 2008. Near-Optimal Regret Bounds for Reinforcement Learning. In Proceedings of the 21st International Conference on Neural Information Processing Systems (Vancouver, British Columbia, Canada) (NIPS’08). Curran Associates Inc., Red Hook, NY, USA, 89–96.
  • Azar et al. (2017) Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. 2017. Minimax Regret Bounds for Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 263–272. https://proceedings.mlr.press/v70/azar17a.html
  • Barman et al. (2023) Siddharth Barman, Arindam Khan, Arnab Maiti, and Ayush Sawarni. 2023. Fairness and Welfare Quantification for Regret in Multi-Armed Bandits. Proceedings of the AAAI Conference on Artificial Intelligence 37, 6 (Jun. 2023), 6762–6769. https://doi.org/10.1609/aaai.v37i6.25829
  • Barrett and Narayanan (2008) Leon Barrett and Srini Narayanan. 2008. Learning all optimal policies with multiple criteria. In Proceedings of the 25th International Conference on Machine Learning (Helsinki, Finland) (ICML ’08). Association for Computing Machinery, New York, NY, USA, 41–47. https://doi.org/10.1145/1390156.1390162
  • Bastani et al. (2022) Osbert Bastani, Jason Yinglun Ma, Ethan Shen, and Weiran Xu. 2022. Regret bounds for risk-sensitive reinforcement learning. In Advances in Neural Information Processing Systems, Vol. 35. 36259–36269.
  • Bäuerle and Ott (2011) Nicole Bäuerle and Jonathan Ott. 2011. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research 74 (2011), 361–379.
  • Bellemare et al. (2023) Marc G Bellemare, Will Dabney, and Mark Rowland. 2023. Distributional Reinforcement Learning. MIT Press.
  • Brafman and Tennenholtz (2003) Ronen I. Brafman and Moshe Tennenholtz. 2003. R-Max - a General Polynomial Time Algorithm for near-Optimal Reinforcement Learning. J. Mach. Learn. Res. 3, null (mar 2003), 213–231. https://doi.org/10.1162/153244303765208377
  • Cai et al. (2023) Xin-Qiang Cai, Pushi Zhang, Li Zhao, Jiang Bian, Masashi Sugiyama, and Ashley Llorens. 2023. Distributional Pareto-Optimal Multi-Objective Reinforcement Learning. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 15593–15613. https://proceedings.neurips.cc/paper_files/paper/2023/file/32285dd184dbfc33cb2d1f0db53c23c5-Paper-Conference.pdf
  • Caragiannis et al. (2019) Ioannis Caragiannis, David Kurokawa, Hervé Moulin, Ariel D Procaccia, Nisarg Shah, and Junxing Wang. 2019. The unreasonable fairness of maximum Nash welfare. ACM Transactions on Economics and Computation (TEAC) 7, 3 (2019), 1–32.
  • Deng et al. (2023) Zhun Deng, He Sun, Steven Wu, Linjun Zhang, and David Parkes. 2023. Reinforcement Learning with Stepwise Fairness Constraints. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 206), Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent (Eds.). PMLR, 10594–10618. https://proceedings.mlr.press/v206/deng23a.html
  • Fan et al. (2023) Ziming Fan, Nianli Peng, Muhang Tian, and Brandon Fain. 2023. Welfare and Fairness in Multi-Objective Reinforcement Learning. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems (London, United Kingdom) (AAMAS ’23). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1991–1999.
  • Foerster et al. (2018) Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. 2018. Counterfactual multi-agent policy gradients. Proceedings of the AAAI Conference on Artificial Intelligence 32, 1 (2018).
  • Hayes et al. (2023) Conor F. Hayes, Roxana Rădulescu, Eugenio Bargiacchi, Johan Kallstrom, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M. Zintgraf, Richard Dazeley, Fredrik Heintz, Enda Howley, Athirai A. Irissappane, Patrick Mannion, Ann Nowé, Gabriel Ramos, Marcello Restelli, Peter Vamplew, and Diederik M. Roijers. 2023. A Brief Guide to Multi-Objective Reinforcement Learning and Planning. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems (London, United Kingdom) (AAMAS ’23). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1988–1990.
  • Jabbari et al. (2017) Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Aaron Roth. 2017. Fairness in reinforcement learning. In International conference on machine learning. PMLR, 1617–1626.
  • Kearns and Singh (2002) Michael Kearns and Satinder Singh. 2002. Near-optimal reinforcement learning in polynomial time. Machine learning 49 (2002), 209–232.
  • Lin et al. (2024) Qian Lin, Zongkai Liu, Danying Mo, and Chao Yu. 2024. An Offline Adaptation Framework for Constrained Multi-Objective Reinforcement Learning. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024). https://neurips.cc/virtual/2024/poster/95257
  • Liu et al. (2014) Chunming Liu, Xin Xu, and Dewen Hu. 2014. Multiobjective reinforcement learning: A comprehensive overview. IEEE Transactions on Systems, Man, and Cybernetics: Systems 45, 3 (2014), 385–398.
  • Moffaert et al. (2013) Kristof Van Moffaert, Madalina M Drugan, and Ann Nowé. 2013. Hypervolume-based multi-objective reinforcement learning. In International Conference on Evolutionary Multi-Criterion Optimization. Springer, 352–366.
  • Rashid et al. (2018) Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. 2018. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning.
  • Reymond et al. (2022) M. Reymond, E. Bargiacchi, and A. Nowé. 2022. Pareto Conditioned Networks. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems. 1110–1118.
  • Reymond et al. (2023) Mathieu Reymond, Conor F Hayes, Denis Steckelmacher, Diederik M Roijers, and Ann Nowé. 2023. Actor-critic multi-objective reinforcement learning for non-linear utility functions. Autonomous Agents and Multi-Agent Systems 37, 2 (2023), 23.
  • Röpke et al. (2023) W. Röpke, C. F. Hayes, P. Mannion, E. Howley, A. Nowé, and D. M. Roijers. 2023. Distributional multi-objective decision making. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. 5711–5719.
  • Siddique et al. (2020) Umer Siddique, Paul Weng, and Matthieu Zimmer. 2020. Learning Fair Policies in Multiobjective (Deep) Reinforcement Learning with Average and Discounted Rewards. In Proceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 826, 11 pages.
  • Sunehag et al. (2017) Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech M Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, and Thore Graepel. 2017. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296 (2017).
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
  • Sutton et al. (1998) Richard S Sutton, Andrew G Barto, et al. 1998. Introduction to reinforcement learning. (1998).
  • Vamplew et al. (2009) Peter Vamplew, Richard Dazeley, Ewan Barker, and Andrei Kelarev. 2009. Constructing stochastic mixture policies for episodic multiobjective reinforcement learning tasks. In Australasian joint conference on artificial intelligence. Springer, 340–349.
  • Van Moffaert et al. (2013) Kristof Van Moffaert, Madalina M Drugan, and Ann Nowé. 2013. Scalarized multi-objective reinforcement learning: Novel design techniques. In 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL). IEEE, 191–199.
  • Von Neumann and Morgenstern (1947) John Von Neumann and Oskar Morgenstern. 1947. Theory of games and economic behavior, 2nd rev. Princeton university press.
  • Yang et al. (2019) Runzhe Yang, Xingyuan Sun, and Karthik Narasimhan. 2019. A generalized algorithm for multi-objective reinforcement learning and policy adaptation. Advances in neural information processing systems 32 (2019).

Appendix

Appendix A Proofs of Technical Lemmas

In this section, we provide proofs of the technical lemmas in Section 4 and Section 5.

A.1. Proof of Lemma 6

Lemma 6. Let 𝒱(s,𝐑(τ),0)=W(𝐑(τ))\mathcal{V}(s,\mathbf{R}(\tau),0)=W(\mathbf{R}(\tau)) for all states ss and trajectories τ\tau. For every state ss, history τ\tau, and t>0t>0 time steps remaining, let

𝒱(s,𝐑(τ),t)=maxa𝔼s[𝒱(s,𝐑(τ)+γTt𝐑(s,a),t1)].\mathcal{V}(s,\mathbf{R}(\tau),t)=\max_{a}\mathbb{E}_{s^{\prime}}\left[\mathcal{V}(s^{\prime},\mathbf{R}(\tau)+\gamma^{T-t}\mathbf{R}(s,a),t-1)\right].

Then V(s,τ,t)=𝒱(s,𝐑(τ),t)V^{*}(s,\tau,t)=\mathcal{V}(s,\mathbf{R}(\tau),t).

Proof.

We proceed by induction on tt. In the base case of t=0t=0, we have simply that 𝒱(s,𝐑(τ),0)=W(𝐑(τ))\mathcal{V}(s,\mathbf{R}(\tau),0)=W(\mathbf{R}(\tau)) by the definition of 𝒱\mathcal{V}. But when t=0t=0, the trajectory has ended, so any τ\tau^{\prime} in Definition 5 will be the empty trajectory τ0;0\tau_{0;0}. So

V(s,τ,0)=W(𝐑(τ)+γTt𝐑(τ0:0))=W(𝐑(τ)).V^{*}(s,\tau,0)=W(\mathbf{R}(\tau)+\gamma^{T-t}\mathbf{R}(\tau_{0:0}))=W(\mathbf{R}(\tau)).

Suppose the equality holds for all states ss and histories τ\tau for up to t1t-1 time steps remaining, i.e.

V(s,τ,t1)=maxa𝔼s[𝒱(s,𝐑(τ)+γTt𝐑(s,a),t2)].V^{*}(s,\tau,t-1)=\max_{a}\mathbb{E}_{s^{\prime}}\left[\mathcal{V}(s^{\prime},\mathbf{R}(\tau)+\gamma^{T-t}\mathbf{R}(s,a),t-2)\right].

Then with tt steps remaining:

V(s,τ,t)=maxπτPrπ(τ)W(𝐑(τ)+γTt𝐑(τ))V^{*}(s,\tau,t)=\max_{\pi}\sum_{\tau^{\prime}}Pr^{\pi}(\tau^{\prime})W(\mathbf{R}(\tau)+\gamma^{T-t}\mathbf{R}(\tau^{\prime}))

where the sum is taken over all length-tt trajectories τ\tau^{\prime} beginning at ss, and Prπ(τ)Pr^{\pi}(\tau^{\prime}) is the probability that τ\tau^{\prime} is traversed under policy π\pi. Note that we can further decompose τ\tau^{\prime} into τ=(s,a)τ′′\tau^{\prime}=(s,a)\oplus\tau^{\prime\prime} where (s,a)(s,a) denotes the state-action pair corresponding to the current timestep, \oplus concatenates, and τ′′\tau^{\prime\prime} denotes a trajectory of length t1t-1 beginning at some state ss^{\prime} transitioned from state ss via taking action aa. Since the transition to τ′′\tau^{\prime\prime} is independent of earlier states/actions once the current state ss and action aa are fixed, we can simplify the optimization by shifting the summation from trajectories conditioned on π\pi to actions and successor states and thus rewrite VV^{*} as

V(s,τ,t)=maxπa\displaystyle V^{*}(s,\tau,t)=\max_{\pi}\sum_{a} π(as,τ,t)\displaystyle\pi(a\mid s,\tau,t)
𝔼s[τ′′Prπ(τ′′)\displaystyle\mathbb{E}_{s^{\prime}}\Bigg{[}\sum_{\tau^{\prime\prime}}Pr^{\pi}(\tau^{\prime\prime}) W(𝐑(τ)+γTt𝐑(s,a)+γTt+1𝐑(τ′′))]\displaystyle W\big{(}\mathbf{R}(\tau)+\gamma^{T-t}\mathbf{R}(s,a)+\gamma^{T-t+1}\mathbf{R}(\tau^{\prime\prime})\big{)}\Bigg{]}

where the latter sum is taken over all length t1t-1 trajectories τ′′\tau^{\prime\prime} beginning at ss^{\prime} and the expectation is taken over the environmental transitions to ss^{\prime} from ss given aa.

Note that for all actions aa, π(as,τ,t)[0,1]\pi(a\mid s,\tau,t)\in[0,1] can be chosen independently of any terms that appears in Prπ(τ′′)Pr^{\pi}(\tau^{\prime\prime}) for any τ′′\tau^{\prime\prime} by the decomposition of τ\tau^{\prime} above. This implies that at the current timestep, the maximizer (or the optimal policy) should choose action aa that maximizes the quantity

𝔼s[τ′′Prπ(τ′′)W(𝐑(τ)+γTt𝐑(s,a)+γTt+1𝐑(τ′′))]\mathbb{E}_{s^{\prime}}\bigg{[}\sum_{\tau^{\prime\prime}}Pr^{\pi}(\tau^{\prime\prime})W(\mathbf{R}(\tau)+\gamma^{T-t}\mathbf{R}(s,a)+\gamma^{T-t+1}\mathbf{R}(\tau^{\prime\prime}))\bigg{]}

with probability 1. Therefore, the expression for V(s,τ,t)V^{*}(s,\tau,t) can be rewritten as

V(s,τ,t)\displaystyle V^{*}(s,\tau,t) =maxa𝔼s[maxπτ′′Prπ(τ′′)W(𝐑(τ)\displaystyle=\max_{a}\mathbb{E}_{s^{\prime}}\bigg{[}\max_{\pi}\sum_{\tau^{\prime\prime}}Pr^{\pi}(\tau^{\prime\prime})W(\mathbf{R}(\tau)
+γTt𝐑(s,a)+γT(t1)𝐑(τ′′))]\displaystyle\qquad\qquad+\gamma^{T-t}\mathbf{R}(s,a)+\gamma^{T-(t-1)}\mathbf{R}(\tau^{\prime\prime}))\bigg{]}
=maxa𝔼s[V(s,τ(s,a),t1)]\displaystyle=\max_{a}\mathbb{E}_{s^{\prime}}\left[V^{*}(s^{\prime},\tau\oplus(s,a),t-1)\right]
=maxa𝔼s[𝒱(s,𝐑(τ)+γTt𝐑(s,a),t1)]\displaystyle=\max_{a}\mathbb{E}_{s^{\prime}}\left[\mathcal{V}(s^{\prime},\mathbf{R}(\tau)+\gamma^{T-t}\mathbf{R}(s,a),t-1)\right]
=𝒱(s,𝐑(τ),t).\displaystyle=\mathcal{V}(s,\mathbf{R}(\tau),t).

where the second equality comes from the definition of VV^{*}, the third equality comes from the inductive hypothesis, and the fourth equality comes from the definition of 𝒱\mathcal{V}. ∎

A.2. Proof of Lemma 9

Lemma 9. Let the welfare function W:dW:\mathbb{R}^{d}\to\mathbb{R} be uniformly continuous. Fix s𝒮s\in\mathcal{S} and t{0,1,,T}t\in\{0,1,\dots,T\}, then for all ϵ>0\epsilon>0, there exists δϵ>0\delta_{\epsilon}>0 such that

|V(s,𝐑𝟏,t)V(s,𝐑𝟐,t)|<ϵif𝐑𝟏𝐑𝟐<δϵ.\bigg{|}V^{*}(s,\mathbf{R_{1}},t)-V^{*}(s,\mathbf{R_{2}},t)\bigg{|}<\epsilon\quad\text{if}\quad\|\mathbf{R_{1}}-\mathbf{R_{2}}\|<\delta_{\epsilon}.
Proof.

Without loss of generality, assume V(s,𝐑𝟏,t)V(s,𝐑𝟐,t)V^{*}(s,\mathbf{R_{1}},t)\geq V^{*}(s,\mathbf{R_{2}},t). Let π1\pi_{1} be the induced optimal policy by V(s,𝐑𝟏,t)V^{*}(s,\mathbf{R_{1}},t), i.e.

V(s,𝐑𝟏,t)=Vπ1(s,𝐑𝟏,t).V^{*}(s,\mathbf{R_{1}},t)=V^{\pi_{1}}(s,\mathbf{R_{1}},t).

Then by the optimality of VV^{*},

Vπ1(s,𝐑𝟐,t)\displaystyle V^{\pi_{1}}(s,\mathbf{R_{2}},t) V(s,𝐑𝟐,t)\displaystyle\leq V^{*}(s,\mathbf{R_{2}},t)
=maxπτPrπ[τ]W(𝐑𝟐+𝐑(τ)).\displaystyle=\max_{\pi}\sum_{\tau}Pr^{\pi}[\tau]W(\mathbf{R_{2}}+\mathbf{R}(\tau)).

Since WW is uniformly continuous, there exists δϵ>0\delta_{\epsilon}>0 such that |W(𝐱)W(𝐲)|<ϵ|W(\mathbf{x})-W(\mathbf{y})|<\epsilon if 𝐱𝐲<δϵ\|\mathbf{x}-\mathbf{y}\|<\delta_{\epsilon}. Then

|V(s,𝐑𝟏,t)V(s,𝐑𝟐,t)|\displaystyle\bigg{|}V^{*}(s,\mathbf{R_{1}},t)-V^{*}(s,\mathbf{R_{2}},t)\bigg{|}
\displaystyle\leq |Vπ1(s,𝐑𝟏,t)Vπ1(s,𝐑𝟐,t)|\displaystyle\bigg{|}V^{\pi_{1}}(s,\mathbf{R_{1}},t)-V^{\pi_{1}}(s,\mathbf{R_{2}},t)\bigg{|}
=\displaystyle= |τPrπ1[τ]W(𝐑𝟏+𝐑(τ))τPrπ1[τ]W(𝐑𝟐+𝐑(τ))|\displaystyle\bigg{|}\sum_{\tau}Pr^{\pi_{1}}[\tau]W(\mathbf{R_{1}}+\mathbf{R}(\tau))-\sum_{\tau}Pr^{\pi_{1}}[\tau]W(\mathbf{R_{2}}+\mathbf{R}(\tau))\bigg{|}
\displaystyle\leq τPrπ1[τ]|W(𝐑𝟏+𝐑(τ))W(𝐑𝟐+𝐑(τ))|\displaystyle\sum_{\tau}Pr^{\pi_{1}}[\tau]\bigg{|}W(\mathbf{R_{1}}+\mathbf{R}(\tau))-W(\mathbf{R_{2}}+\mathbf{R}(\tau))\bigg{|}

where the sum is over all tt-trajectories τ\tau that start in state ss. If 𝐑𝟏𝐑𝟐<δϵ\|\mathbf{R_{1}}-\mathbf{R_{2}}\|<\delta_{\epsilon} then we have

|W(𝐑𝟏+𝐑(τ))W(𝐑𝟐+𝐑(τ))|<ϵ.\bigg{|}W(\mathbf{R_{1}}+\mathbf{R}(\tau))-W(\mathbf{R_{2}}+\mathbf{R}(\tau))\bigg{|}<\epsilon.

Therefore,

|V(s,𝐑𝟏,t)V(s,𝐑𝟐,t)|<ϵ\bigg{|}V^{*}(s,\mathbf{R_{1}},t)-V^{*}(s,\mathbf{R_{2}},t)\bigg{|}<\epsilon

since the sum of probabilities over all tt-trajectories τ\tau in MM that start in state ss induced by π1\pi_{1} must add up to 11. ∎

A.3. Proof of Lemma 10

Lemma 10 (Approximation Error of RAVI)

For uniformly continuous welfare function WW, for all ϵ>0\epsilon>0, there exists αϵ\alpha_{\epsilon} such that

V(s,𝐑𝐚𝐜𝐜,t)V(s,𝐑𝐚𝐜𝐜,t)tϵV(s,\mathbf{R_{acc}},t)\geq V^{*}(s,\mathbf{R_{acc}},t)-t\epsilon

s𝒮,𝐑𝐚𝐜𝐜d,t{0,1,,T}\forall s\in\mathcal{S},\mathbf{R_{acc}}\in\mathbb{R}^{d},t\in\{0,1,\dots,T\}, where V(s,𝐑𝐚𝐜𝐜,t)V(s,\mathbf{R_{acc}},t) is computed by Algorithm 1 using αϵ\alpha_{\epsilon}.

Proof.

We prove this additive error bound by induction on the number of timesteps tt remaining in the task. Clearly V(s,𝐑𝐚𝐜𝐜,0)=W(𝐑𝐚𝐜𝐜),sV(s,\mathbf{R_{acc}},0)=W(\mathbf{R_{acc}}),\forall s and 𝐑𝐚𝐜𝐜\mathbf{R_{acc}}. Suppose that V(s,𝐑𝐚𝐜𝐜,t)V(s,𝐑𝐚𝐜𝐜,t)tϵV(s,\mathbf{R_{acc}},t)\geq V^{*}(s,\mathbf{R_{acc}},t)-t\epsilon for all t=0,,k1t=0,\dots,k-1. Then consider with kk steps remaining.

V(s,𝐑𝐚𝐜𝐜,k)=\displaystyle V(s,\mathbf{R_{acc}},k)= maxasPr(s|s,a)V(s,fα(𝐑𝐚𝐜𝐜+γTk𝐑(s,a)),k1)\displaystyle\max_{a}\sum_{s^{\prime}}Pr(s^{\prime}|s,a)V(s^{\prime},f_{\alpha}(\mathbf{R_{acc}}+\gamma^{T-k}\mathbf{R}(s,a)),k-1)
\displaystyle\geq maxasPr(s|s,a)\displaystyle\max_{a}\sum_{s^{\prime}}Pr(s^{\prime}|s,a)
(V(s,fα(𝐑𝐚𝐜𝐜+γTk𝐑(s,a)),k1)(k1)ϵ).\displaystyle\qquad\bigg{(}V^{*}(s^{\prime},f_{\alpha}(\mathbf{R_{acc}}+\gamma^{T-k}\mathbf{R}(s,a)),k-1)-(k-1)\epsilon\bigg{)}.

Since WW is uniformly continuous, by Lemma 9 there exists δϵ\delta_{\epsilon} such that

|V(s,fα(𝐑𝐚𝐜𝐜+γTk\displaystyle\bigg{|}V^{*}\bigg{(}s^{\prime},f_{\alpha}(\mathbf{R_{acc}}+\gamma^{T-k} 𝐑(s,a)),k1)\displaystyle\mathbf{R}(s,a)),k-1\bigg{)}
V(s,𝐑𝐚𝐜𝐜+γTk𝐑(s,a),k1)|<ϵ\displaystyle-V^{*}\bigg{(}s^{\prime},\mathbf{R_{acc}}+\gamma^{T-k}\mathbf{R}(s,a),k-1\bigg{)}\bigg{|}<\epsilon

if

fα(𝐑𝐚𝐜𝐜+γTk𝐑(s,a))(𝐑𝐚𝐜𝐜+γTk𝐑(s,a))<δϵ.\bigg{\|}f_{\alpha}(\mathbf{R_{acc}}+\gamma^{T-k}\mathbf{R}(s,a))-(\mathbf{R_{acc}}+\gamma^{T-k}\mathbf{R}(s,a))\bigg{\|}<\delta_{\epsilon}.

Hence, it suffices to choose αϵ=δϵ/d\alpha_{\epsilon}=\delta_{\epsilon}/d, which implies

fα(𝐑𝐚𝐜𝐜+γTk𝐑(s,a))(𝐑𝐚𝐜𝐜+γTk𝐑(s,a))<(δϵ/d)d=δϵ,\bigg{\|}f_{\alpha}(\mathbf{R_{acc}}+\gamma^{T-k}\mathbf{R}(s,a))-(\mathbf{R_{acc}}+\gamma^{T-k}\mathbf{R}(s,a))\bigg{\|}<(\delta_{\epsilon}/d)\cdot d=\delta_{\epsilon},

and so

V(s,𝐑𝐚𝐜𝐜,k)\displaystyle V(s,\mathbf{R_{acc}},k)\geq maxasPr(s|s,a)\displaystyle\max_{a}\sum_{s^{\prime}}Pr(s^{\prime}|s,a)
V(s,𝐑𝐚𝐜𝐜+γTk𝐑(s,a),k1)ϵ(k1)ϵ\displaystyle\qquad V^{*}\bigg{(}s^{\prime},\mathbf{R_{acc}}+\gamma^{T-k}\mathbf{R}(s,a),k-1\bigg{)}-\epsilon-(k-1)\epsilon
=\displaystyle= V(s,𝐑acc,k)kϵ.\displaystyle V^{*}(s,\mathbf{R}_{acc},k)-k\epsilon.

A.4. Lemma 13 (Horizon Time)

Lemma 0.

Let MM be any MOMDP, and let π\pi be any policy in MM. Assume the welfare function W:dW:\mathbb{R}^{d}\to\mathbb{R} is uniformly continuous and 𝐑(s,a)[0,1]ds,a\mathbf{R}(s,a)\in[0,1]^{d}\leavevmode\nobreak\ \forall s,a. Then for all ϵ>0\epsilon>0, there exists δϵ>0\delta_{\epsilon}>0 such that if

T(11γ)log(dδϵ(1γ))T\geq\left(\frac{1}{1-\gamma}\right)\log\left(\frac{\sqrt{d}}{\delta_{\epsilon}(1-\gamma)}\right)

then for any state s,s,

Vπ(s,𝟎,T)limTVπ(s,𝟎,T)Vπ(s,𝟎,T)+ϵ.V^{\pi}(s,\mathbf{0},T)\leq\lim_{T\to\infty}V^{\pi}(s,\mathbf{0},T)\leq V^{\pi}(s,\mathbf{0},T)+\epsilon.

We call the value of the lower bound on TT given above the ϵ\epsilon-horizon time for the MOMDP MM.

Proof.

The lower bound on limTVπ(s,𝟎,T)\lim_{T\to\infty}V^{\pi}(s,\mathbf{0},T) follows from the definitions, since all rewards are nonnegative and the welfare function should be monotonic. For the upper bound, fix any infinite trajectory τ\tau that starts at ss and let τ\tau^{\prime} be the TT-trajectory prefix of the infinite trajectory τ\tau for some finite TT. Since WW is uniformly continuous, for all ϵ>0\epsilon>0, there exists δϵ>0\delta_{\epsilon}>0 such that W(𝐑(τ))W(𝐑(τ))+ϵW(\mathbf{R}(\tau))\leq W(\mathbf{R}(\tau^{\prime}))+\epsilon if k=1γk1𝐑(sk,ak)k=1Tγk1𝐑(sk,ak)<δϵ\|\sum_{k=1}^{\infty}\gamma^{k-1}\mathbf{R}(s_{k},a_{k})-\sum_{k=1}^{T}\gamma^{k-1}\mathbf{R}(s_{k},a_{k})\|<\delta_{\epsilon}, where \|\cdot\| denotes the usual Euclidean norm. But

k=1γk1𝐑(sk,ak)k=1Tγk1𝐑(sk,ak)\displaystyle\bigg{\|}\sum_{k=1}^{\infty}\gamma^{k-1}\mathbf{R}(s_{k},a_{k})-\sum_{k=1}^{T}\gamma^{k-1}\mathbf{R}(s_{k},a_{k})\bigg{\|}
=\displaystyle= k=T+1γk1𝐑(sk,ak)\displaystyle\bigg{\|}\sum_{k=T+1}^{\infty}\gamma^{k-1}\mathbf{R}(s_{k},a_{k})\bigg{\|}
=\displaystyle= γTk=1γk1𝐑(sT+k,aT+k)\displaystyle\gamma^{T}\bigg{\|}\sum_{k=1}^{\infty}\gamma^{k-1}\mathbf{R}(s_{T+k},a_{T+k})\bigg{\|}
\displaystyle\leq γT(d1γ)\displaystyle\gamma^{T}\left(\frac{\sqrt{d}}{1-\gamma}\right)

Solving

γT(d1γ)δϵ\gamma^{T}\left(\frac{\sqrt{d}}{1-\gamma}\right)\leq\delta_{\epsilon}

for TT yields the desired bound on TT. Since the inequality holds for every fixed trajectory, it also holds for the distribution over trajectories induced by any policy π\pi. ∎

Appendix B Necessity of Conditioning on Remaining Timesteps tt

Our opening example in Figure 1 mentioned the necessity of conditioning on accumulated reward to model the optimal policy. However, the formulation of the optimal value function in Definition 7 also conditions on the number of time-steps remaining in the finite-time horizon task. One may naturally see this as undesirable in terms of adding computational overhead. Particularly in the continuing task setting with γ\gamma very close to 1, it is natural to ask whether this conditioning is necessary.

Lemma 0.

In MORL with a nonlinear scalarization function WW, for any constant α>0\alpha>0, there exists an MOMDP such that any policy π\pi that depends only on the current state and accumulated reward (and does not condition on the remaining timesteps tt) achieves expected welfare VπαVπV^{\pi}\leq\alpha V^{\pi^{*}}, where VπV^{\pi^{*}} is the expected welfare achieved by an optimal policy π\pi^{*} that conditions on tt.

Proof.

We construct an MOMDP where the optimal action at a decision point depends critically on the remaining timesteps tt. We show that any policy not conditioning on tt cannot, in general, approximate the optimal expected welfare within any constant factor.

Consider an MDP with states s0s_{0}, s1s_{1}, and s2s_{2}, where the agent chooses between actions aa and bb at s0s_{0} and before reaching terminal states s1s1 or s2s2 with no further actions or rewards after reaching them. Choosing action aa results in a reward 𝐑(s0,a)=(ϵ,R)\mathbf{R}(s_{0},a)=(\epsilon,R) and transition to s1s_{1}. Choosing action bb results in a reward 𝐑(s0,b)=(2ϵ,0)\mathbf{R}(s_{0},b)=(2\epsilon,0) and transition to s2s_{2}. The accumulated reward prior to reaching s0s_{0} is 𝐑acc=(0,D)\mathbf{R}_{\text{acc}}=(0,D), where D>0D>0. The scalarization function is W(x1,x2)=x1+x22W(x_{1},x_{2})=x_{1}+x_{2}^{2}, and we assume a discount factor γ(0,1)\gamma\in(0,1).

The agent must choose between actions aa and bb at s0s_{0} to maximize the expected welfare:

V=W(𝐑acc+γTt𝐑(s0,a or b)).V=W\left(\mathbf{R}_{\text{acc}}+\gamma^{T-t}\mathbf{R}(s_{0},a\text{ or }b)\right).

For action aa:

𝐑total(a)\displaystyle\mathbf{R}^{(a)}_{\text{total}} =𝐑acc+γTt𝐑(s0,a)=(γTtϵ,D+γTtR),\displaystyle=\mathbf{R}_{\text{acc}}+\gamma^{T-t}\mathbf{R}(s_{0},a)=\left(\gamma^{T-t}\epsilon,D+\gamma^{T-t}R\right),
V(a)\displaystyle V^{(a)} =W(𝐑total(a))=γTtϵ+(D+γTtR)2.\displaystyle=W\left(\mathbf{R}^{(a)}_{\text{total}}\right)=\gamma^{T-t}\epsilon+\left(D+\gamma^{T-t}R\right)^{2}.

For action bb:

𝐑total(b)\displaystyle\mathbf{R}^{(b)}_{\text{total}} =𝐑acc+γTt𝐑(s0,b)=(2γTtϵ,D),\displaystyle=\mathbf{R}_{\text{acc}}+\gamma^{T-t}\mathbf{R}(s_{0},b)=\left(2\gamma^{T-t}\epsilon,D\right),
V(b)\displaystyle V^{(b)} =W(𝐑total(b))=2γTtϵ+D2.\displaystyle=W\left(\mathbf{R}^{(b)}_{\text{total}}\right)=2\gamma^{T-t}\epsilon+D^{2}.

The difference in expected welfare is given by:

ΔV=V(a)V(b)=γTt(ϵ+2DR+γTtR2).\Delta V=V^{(a)}-V^{(b)}=\gamma^{T-t}\left(-\epsilon+2DR+\gamma^{T-t}R^{2}\right).

When Tt0T-t\to 0 (i.e., tt is close to TT, near the end of the horizon):

γTt1,andΔVϵ+2DR+R2.\gamma^{T-t}\to 1,\quad\text{and}\quad\Delta V\to-\epsilon+2DR+R^{2}.

Since RR is large, 2DR+R22DR+R^{2} dominates, and ΔV>0\Delta V>0, making action aa optimal.

When TtT-t is large (i.e., tt is small, far from TT):

γTt1,andΔVγTt(ϵ+2DR).\gamma^{T-t}\ll 1,\quad\text{and}\quad\Delta V\to\gamma^{T-t}\left(-\epsilon+2DR\right).

If we choose D=ϵδ2RD=\frac{\epsilon-\delta}{2R} for some small δ>0\delta>0, then:

ϵ+2DR=ϵ+2(ϵδ2R)R=δ<0.-\epsilon+2DR=-\epsilon+2\left(\frac{\epsilon-\delta}{2R}\right)R=-\delta<0.

Thus, ΔV<0\Delta V<0, making action bb optimal.

Any policy π\pi that does not condition on tt must choose either action aa or bb at s0s_{0} regardless of tt. If π\pi always chooses aa, it will be suboptimal for large TtT-t, when action bb is optimal. If π\pi always chooses bb, it will be suboptimal for small TtT-t, when action aa is optimal.

Let π\pi^{*} be the optimal policy that conditions on tt. When action aa is optimal, we have

VπVπ=2γTtϵ+D2γTtϵ+(D+γTtR)2.\frac{V^{\pi}}{V^{\pi^{*}}}=\frac{2\gamma^{T-t}\epsilon+D^{2}}{\gamma^{T-t}\epsilon+\left(D+\gamma^{T-t}R\right)^{2}}.

As RR\to\infty, Vπ(D+γTtR)2V^{\pi^{*}}\to\left(D+\gamma^{T-t}R\right)^{2}, which grows quadratically with RR, while VπV^{\pi} remains bounded. Therefore, the ratio approaches zero:

limRVπVπ=0.\lim_{R\to\infty}\frac{V^{\pi}}{V^{\pi^{*}}}=0.

When action bb is optimal, the analysis is analogous, leading to a similar conclusion.

Thus, for any constant α>0\alpha>0, we can choose parameters ϵ,R,D,γ,T\epsilon,R,D,\gamma,T such that the expected welfare ratio VπVπα\frac{V^{\pi}}{V^{\pi^{*}}}\leq\alpha. ∎

Appendix C Ablation Studies and Additional Experiments

C.1. Ablation Studies with γ<1\gamma<1

We conduct further studies on the empirical effects of the discretization factor α\alpha. For WSPFW_{\text{SPF}}, we set λ=1\lambda=1 (the smoothing factor). We gradually increase the disctretization interval of accumulated reward α\alpha gradually from 0.40.4 to 1.21.2 for a reward discount γ=0.999\gamma=0.999 for Taxi and γ=0.99\gamma=0.99 for Scavenger Hunt with the same settings as the experiments. As shown in Figure 4(a) and Figure 4(b), we observe that the solution quality is often very high (better than baselines considered in the main body) even for discretization values that are substantially greater than those needed for theoretical guarantees (with ϵ=1\epsilon=1, we need α=2.5×103\alpha=2.5\times 10^{-3}). Nonetheless, the results exhibit non-monotonic behavior with respect to α\alpha, which suggests that factors such as the alignment of discretization granularity with the welfare function’s smoothness play a role, especially in this large α\alpha regime where empirical performance exceeds the worst-case theoretical guarantees.

Refer to caption
(a) Nash Welfare in Taxi
Refer to caption
(b) Cobb-Douglas in Scavenger Hunt
Figure 4. Ablation study on discretization factor α\alpha.

This set of ablation studies demonstrates that the RAVI algorithm is capable of finding high-quality policies for optimizing expected scalarized return despite using a larger discretization parameter than that needed for our theoretical analysis. The practical takeaway is that, while smaller α\alpha values provide finer discretization and better theoretical guarantees, they incur increased computation costs. Conversely, larger α\alpha values can coarsen the discretization but still result in competitive performance, as demonstrated empirically.

C.2. Ablation Studies with γ=1\gamma=1

We also conduct further ablation studies on environments with γ=1\gamma=1 using coarser alpha values. We select α{1,1.5,2,2.5,3}\alpha\in\{1,1.5,2,2.5,3\} and run 10 random initialization on each environment for each value. Note that the smallest α\alpha possible in both Taxi and Scavenger is α=1\alpha=1 (which are the results we reported in the main text). Our results can be found at Table 2 and Table 3. We observe that there is a noticeable performance drop across both environments with a larger α\alpha. Moreover, due to the nature of having a sparse reward signal in the Taxi environment, we observe no performance when α2.0\alpha\geq 2.0. On Scavenger, for RDthreshold=2RD_{\text{threshold=2}}, we observe that the agent is under-performing when α2.5\alpha\geq 2.5. This finding is also to be expected since the resources are scarce (only 6 resources in a 15×1515\times 15 grid), and enemies are plenty (33% of the grid world).

Table 2. Ablation study on discretization factor α\alpha with γ=1\gamma=1 on Taxi.
Dimension Alpha WNashW_{\text{Nash}} WEgalitarianW_{\text{Egalitarian}} Wp=10W_{p=-10} Wp=0.001W_{p=0.001} Wp=0.9W_{p=0.9}
d=2d=2 α=1.0\alpha=1.0 7.555±\pm0.502 3.000±\pm0.000 5.279±\pm0.000 7.404±\pm0.448 9.628±\pm0.349
α=1.5\alpha=1.5 7.079±\pm0.318 1.000±\pm0.000 3.198±\pm0.000 6.481±\pm0.000 8.877±\pm0.528
α=2.0\alpha=2.0 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000
α=2.5\alpha=2.5 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000
α=3.0\alpha=3.0 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000
d=3d=3 α=1.0\alpha=1.0 4.996±\pm0.297 2.000±\pm0.000 3.115±\pm0.000 3.307±\pm0.000 6.250±\pm0.322
α=1.5\alpha=1.5 4.902±\pm0.273 1.000±\pm0.000 1.116±\pm0.000 2.080±\pm0.000 6.037±\pm0.322
α=2.0\alpha=2.0 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000
α=2.5\alpha=2.5 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000
α=3.0\alpha=3.0 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000
d=4d=4 α=1.0\alpha=1.0 2.191±\pm0.147 1.700±\pm0.483 1.029±\pm0.000 2.145±\pm0.129 3.369±\pm0.173
α=1.5\alpha=1.5 2.044±\pm0.157 1.000±\pm0.000 0.000±\pm0.000 2.044±\pm0.157 3.247±\pm0.183
α=2.0\alpha=2.0 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000
α=2.5\alpha=2.5 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000
α=3\alpha=3 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000
d=5d=5 α=1.0\alpha=1.0 2.308±\pm0.185 1.700±\pm0.483 1.023±\pm0.000 2.000±\pm0.102 3.289±\pm0.147
α=1.5\alpha=1.5 2.034±\pm0.122 1.000±\pm0.000 0.000±\pm0.000 1.149±\pm0.000 3.097±\pm0.139
α=2.0\alpha=2.0 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000
α=2.5\alpha=2.5 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000
α=3.0\alpha=3.0 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000 0.000±\pm0.000
Table 3. Ablation study on discretization factor α\alpha with γ=1\gamma=1 on Scavenger.
Dimension Alpha CDρ=0.4CD_{\rho=0.4} RDthreshold=2RD_{\text{threshold}=2}
d=2d=2 α=1.0\alpha=1.0 1.336±\pm0.240 3.400±\pm0.843
α=1.5\alpha=1.5 1.236±\pm0.279 3.600±\pm0.699
α=2.0\alpha=2.0 0.197±\pm0.381 0.100±\pm0.316
α=2.5\alpha=2.5 0.400±\pm0.578 -1.889±\pm3.480
α=3.0\alpha=3.0 0.132±\pm0.295 -12.500±\pm22.405
Table 4. Runtime comparisons (in real-time hours).
Environment Dimension Function RAVI Mixture LinScal WelfareQ Mixture-M
Taxi d=2d=2 WNashW_{\text{Nash}} 0.03 0.76 ±\pm 0.20 1.98 ±\pm 0.28 4.82 ±\pm 1.65 0.15
WEgalitarianW_{\text{Egalitarian}} 0.01 0.69 ±\pm 0.22 2.00 ±\pm 0.31 2.61 ±\pm 0.50 0.15
Wp=10W_{p=-10} 0.01 0.70 ±\pm 0.24 2.11 ±\pm 0.09 4.72 ±\pm 1.81 0.15
Wp=0.001W_{p=0.001} 0.02 0.65 ±\pm 0.29 2.10 ±\pm 0.09 5.10 ±\pm 1.75 0.15
Wp=0.9W_{p=0.9} 0.08 0.64 ±\pm 0.21 2.11 ±\pm 0.09 4.96 ±\pm 1.77 0.15
d=3d=3 WNashW_{\text{Nash}} 0.30 0.60 ±\pm 0.29 2.14 ±\pm 0.39 4.92 ±\pm 1.91 0.32
WEgalitarianW_{\text{Egalitarian}} 0.01 0.72 ±\pm 0.21 2.09 ±\pm 0.30 2.26 ±\pm 0.58 0.32
Wp=10W_{p=-10} 0.02 0.64 ±\pm 0.27 2.18 ±\pm 0.06 5.44 ±\pm 1.91 0.32
Wp=0.001W_{p=0.001} 0.02 0.75 ±\pm 0.11 2.10 ±\pm 0.26 5.73 ±\pm 1.80 0.32
Wp=0.9W_{p=0.9} 2.10 0.74 ±\pm 0.28 2.18 ±\pm 0.11 5.47 ±\pm 1.79 0.32
d=4d=4 WNashW_{\text{Nash}} 0.65 0.82 ±\pm 0.14 2.19 ±\pm 0.13 5.44 ±\pm 1.62 0.53
WEgalitarianW_{\text{Egalitarian}} 0.06 0.71 ±\pm 0.33 2.18 ±\pm 0.13 2.47 ±\pm 0.64 0.53
Wp=10W_{p=-10} 0.02 0.69 ±\pm 0.20 2.12 ±\pm 0.31 5.66 ±\pm 1.58 0.53
Wp=0.001W_{p=0.001} 0.23 0.72 ±\pm 0.23 2.21 ±\pm 0.11 5.11 ±\pm 1.80 0.53
Wp=0.9W_{p=0.9} 36.72 0.73 ±\pm 0.22 2.21 ±\pm 0.07 6.15 ±\pm 1.78 0.53
d=5d=5 WNashW_{\text{Nash}} 1.46 0.70 ±\pm 0.21 2.25 ±\pm 0.10 4.93 ±\pm 1.72 0.75
WEgalitarianW_{\text{Egalitarian}} 0.28 0.71 ±\pm 0.18 2.23 ±\pm 0.05 2.37 ±\pm 0.51 0.75
Wp=10W_{p=-10} 0.04 0.82 ±\pm 0.08 2.21 ±\pm 0.09 5.40 ±\pm 1.83 0.75
Wp=0.001W_{p=0.001} 0.26 0.68 ±\pm 0.17 2.05 ±\pm 0.41 5.45 ±\pm 1.77 0.75
Wp=0.9W_{p=0.9} 118.63 0.67 ±\pm 0.27 2.16 ±\pm 0.30 4.75 ±\pm 1.59 0.75
Scavenger d=2d=2 CDp=0.4CD_{p=0.4} 0.11 0.30 ±\pm 0.05 0.25 ±\pm 0.06 0.34 ±\pm 0.05 4.71
RDthreshold=2RD_{\text{threshold=2}} 0.09 0.25 ±\pm 0.06 0.30 ±\pm 0.05 0.29 ±\pm 0.06 4.39

C.3. Visualizations of RAVI and Baselines

Given that Table 1 in the main text solely contains results evaluated from algorithms trained until convergence, it misses other important information such as the rate of convergence and the learning process of some online algorithms such as WelfareQ. Thus, in this subsection, we provide our visualizations of the learning curve for these algorithms. The plot is created using mean over 10 random initializations with standard deviation as shaded regions. Note that we use horizontal lines for model-based approaches as they do not have an online learning phase. Our results can be found at Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, and Figure 10.

Refer to caption
d=2d=2
Refer to caption
d=3d=3
Refer to caption
d=4d=4
Refer to caption
d=5d=5
Figure 5. Taxi, WNashW_{\text{Nash}}
Refer to caption
d=2d=2
Refer to caption
d=3d=3
Refer to caption
d=4d=4
Refer to caption
d=5d=5
Figure 6. Taxi, WEgalitarianW_{\text{Egalitarian}}
Refer to caption
d=2d=2
Refer to caption
d=3d=3
Refer to caption
d=4d=4
Refer to caption
d=5d=5
Figure 7. Taxi, Wp=10W_{p=-10}
Refer to caption
d=2d=2
Refer to caption
d=3d=3
Refer to caption
d=4d=4
Refer to caption
d=5d=5
Figure 8. Taxi, Wp=0.001W_{p=0.001}
Refer to caption
d=2d=2
Refer to caption
d=3d=3
Refer to caption
d=4d=4
Refer to caption
d=5d=5
Figure 9. Taxi, Wp=0.9W_{p=0.9}
Refer to caption
CDρ=0.4CD_{\rho=0.4}
Refer to caption
RDthreshold=2RD_{\text{threshold}=2}
Figure 10. Scavenger

Appendix D Details on the RAEE Algorithm

In this section, we give a detailed description of the RAEE algorithm, extending from Kearns and Singh (2002) and demonstrating Theorem 12, the main theorem for the RAEE algorithm.

D.1. Introducing RAEE

The extension works as follows. The algorithm starts off by doing balanced wandering Kearns and Singh (2002): when encountering a new state, the algorithm selects a random action. However, when revisiting a previously visited state, it chooses the least attempted action from that state, resolving ties by random action selection. At each state-action pair (s,a)(s,a) it tries, the algorithm stores the reward 𝐑(s,a)\mathbf{R}(s,a) received and an estimate of the transition probabilities Pr(ss,a)Pr(s^{\prime}\mid s,a) derived from the empirical distribution of next states reached during balanced wandering.

Next, we introduce the notion of a known state Kearns and Singh (2002), which refers to a state that the algorithm has explored to the extent that the estimated transition probabilities for any action from that state closely approximate their actual values. Denote the number of times a state needs to be visited as mknownm_{known}. We will specify the value later in our runtime characterization.

States are thus categorized into three groups: known states, which the algorithm has extensively visited and obtained reliable transition statistics; states that have been visited before but remain unknown due to limited trials and therefore unreliable data; and states that have not been explored at all. By the Pigeonhole Principle, accurate statistics will eventually accumulate in some states over time, leading to their becoming known. Let SS be the set of currently known states, the algorithm can build the current known-state MOMDP MSM_{S} that is naturally induced on SS by the full MOMDP MM with all “unknown” states merged into a single absorbing state. Although the algorithm cannot access MSM_{S} directly, it will have an approximation M^S\hat{M}_{S} by the definition of the known states. By the simulation lemma Kearns and Singh (2002), M^S\hat{M}_{S} will be an accurate model in the sense that the expected TT-step welfare of any policy in M^S\hat{M}_{S} is close to its expected TT-step return in MSM_{S}. (Here TT is the horizon time.) Hence, at any timestep, M^S\hat{M}_{S} functions as an incomplete representation of MM, specifically focusing on the part of MM that the algorithm possesses a strong understanding of.

This is where we insert RAVI. The algorithm performs the two off-line optimal policy computations; i) first on M^S\hat{M}_{S} using RAVI to compute an exploitation policy that yields an approximately optimal welfare and ii) second performing traditional value iteration on M^S\hat{M}^{\prime}_{S}, which has the same transition probabilities as M^S\hat{M}_{S}, but different payoffs: in M^S\hat{M}^{\prime}_{S}, the absorbing state (representing “unknown” states) has scalar reward 11 and all other states have scalar reward 0. The optimal policy in M^S\hat{M}^{\prime}_{S} simply exits the known model as rapidly as possible, rewarding exploration.

By the explore or exploit lemma Kearns and Singh (2002), the algorithm is guaranteed to either output a policy with approximately optimal return in MM, or to improve the statistics at an unknown state. Again by the Pigeonhole Principle, a new state becomes known after the latter case occurs for some finite number of times, and thus the algorithm is always making progress. In the worst case, the algorithm builds a model of the entire MOMDP MM. Having described the elements, we now outline the entire extended algorithm, where the notations are consistent with Kearns and Singh (2002).

RAEE Algorithm:

  • (Initialization) Initially, the set SS of known states is empty.

  • (Balanced Wandering) Any time the current state is not in SS, the algorithm performs balanced wandering

  • (Discovery of New Known States) Any time a state ss has been visited mknownm_{known} Kearns and Singh (2002) times during balanced wandering, it enters the known set SS, and no longer participates in balanced wandering.

  • (Off-line Optimizations) Upon reaching a known state sSs\in S during balanced wandering, the algorithm performs the two off-line optimal policy computations on M^S\hat{M}_{S} and M^S\hat{M}^{\prime}_{S} described above:

    • (Attempted Exploitation) Use RAVI algorithm to compute an ϵ/2\epsilon/2-optimal policy on M^S\hat{M}_{S}. If the resulting exploitation policy π^\hat{\pi} achieves return from ss in M^S\hat{M}_{S} that is at least V(s,𝟎,T)ϵ/2V^{*}(s,\mathbf{0},T)-\epsilon/2, the algorithm halts and outputs π^\hat{\pi}.

    • (Attempted Exploration) Otherwise, the algorithm executes the resulting exploration policy derived from the off-line computation on M^S\hat{M}^{\prime}_{S} for TT steps in MM.

  • (Balanced Wandering) Any time an attempted exploitation or attempted exploration visits a state not in SS, the algorithm resumes balanced wandering.

This concludes the description of the algorithm.

D.2. Runtime Analysis

In this subsection, we comment on additional details of the analysis of Theorem 12, the main theorem for the RAEE algorithm.

Theorem 12 (RAEE). Let V(s,0,T)V^{*}(s,0,T) denote the value function for the policy with the optimal expected welfare in the MOMDP MM starting at state ss, with 𝟎d\mathbf{0}\in\mathbb{R}^{d} accumulated reward and TT timesteps remaining. Then for a uniformly continuous welfare function WW, there exists an algorithm AA, taking inputs ϵ\epsilon, β\beta, |𝒮||\mathcal{S}|, |𝒜||\mathcal{A}|, and V(s,𝟎,T)V^{*}(s,\mathbf{0},T), such that the total number of actions and computation time taken by AA is polynomial in 1/ϵ1/\epsilon, 1/β1/\beta, |𝒮||\mathcal{S}|, |𝒜||\mathcal{A}|, the horizon time T=1/(1γ)T=1/(1-\gamma) and exponential in the number of objectives dd, and with probability at least 1β1-\beta, AA will halt in a state ss, and output a policy π^\hat{\pi}, such that VMπ^(s,0,T)V(s,0,T)ϵV^{\hat{\pi}}_{M}(s,0,T)\geq V^{*}(s,0,T)-\epsilon.

We begin by defining approximation of MOMDPs. Think of MM as the true MOMDP, that is, a perfect model of transition probabilities. Think of MM^{\prime}, on the other hand, as the current best estimate of an MOMDP obtained through exploration. In particular, MM^{\prime} in the RAEE algorithm will consist of the set of known states.

Definition 0.

Kearns and Singh (2002) Let MM and MM^{\prime} be two MOMDPs over the same state space 𝒮\mathcal{S} with the same deterministic reward function 𝐑(s,a)\mathbf{R}(s,a). MM^{\prime} is an α\alpha-approximation of MM if for any state ss and ss^{\prime} and any action aa,

PrM(ss,a)αPrM(ss,a)PrM(ss,a)+α,Pr_{M}(s^{\prime}\mid s,a)-\alpha\leq Pr_{M^{\prime}}(s^{\prime}\mid s,a)\leq Pr_{M}(s^{\prime}\mid s,a)+\alpha,

where the subscript denotes the model.

We now extend the Simulation Lemma Kearns and Singh (2002), which tells us how close the approximation of an MOMDP needs to be in order for the expected welfare, or ESR, of a policy to be close in an estimated model. The argument is similar to Kearns and Singh (2002), but in our case the policy may not be stationary and we want to bound the deviation in expected welfare rather than just accumulated reward.

Recall that GmaxTG^{T}_{max} is defined to be the maximum possible welfare achieved on a TT-step trajectory - GmaxTG^{T}_{max} is at most W(𝐓)W(\mathbf{T}) where 𝐓\mathbf{T} equals TT times the identity vector, in our model.

Lemma 0 (Extended Simulation Lemma).

Let MM^{\prime} be an
O((ϵ/|𝒮||𝒜|TGmaxT)2)O((\epsilon/|\mathcal{S}||\mathcal{A}|TG^{T}_{\max})^{2})-approximation of MM. Then for any policy π\pi, any state ss, and horizon time TT, we have

VMπ(s,τ0;0,T)ϵVMπ(s,τ0;0,T)VMπ(s,τ0;0,T)+ϵ.V^{\pi}_{M}(s,\tau_{0;0},T)-\epsilon\leq V^{\pi}_{M^{\prime}}(s,\tau_{0;0},T)\leq V^{\pi}_{M}(s,\tau_{0;0},T)+\epsilon.
Proof.

Fix a policy π\pi and a start state ss. Let MM^{\prime} be an α\alpha-approximation of MM (we will later show that α\alpha has the same bound as the Lemma statement). Call the transition probability from a state ss to a state ss^{\prime} under action aa to be β\beta-small in MM if PrM(ss,a)βPr_{M}(s^{\prime}\mid s,a)\leq\beta. Then the probability that a TT-trajectory τ\tau starting from a state ss following policy π\pi contains at least one β\beta-small transition is at most β|𝒮||𝒜|T\beta|\mathcal{S}||\mathcal{A}|T. This is because the total probability of all β\beta-small transitions in MM is at most β|𝒮||𝒜|\beta|\mathcal{S}||\mathcal{A}| (assuming all transition probabilities are β\beta-small), and there are TT timesteps. Note that in our case, the optimal policy may not be necessarily stationary, thus the agent does not necessarily choose the same action (and hence the same transition probability) upon revisiting any state. So we cannot bound the total probability by β|𝒮|\beta|\mathcal{S}| like in the original proof.

The total expected welfare of the trajectories of π\pi that consist of at least one β\beta-small transition of MM is at most β|𝒮||𝒜|TGmaxT\beta|\mathcal{S}||\mathcal{A}|TG^{T}_{\max}. Recall that MM^{\prime} be an α\alpha-approximation of MM. Then for any β\beta-small transition in MM, we have PrM(ss,a)α+βPr_{M^{\prime}}(s^{\prime}\mid s,a)\leq\alpha+\beta. So the total welfare of the trajectories of π\pi that consist of at least one β\beta-small transition of MM is at most (α+β)|𝒮||𝒜|TGmaxT(\alpha+\beta)|\mathcal{S}||\mathcal{A}|TG^{T}_{\max}. We can thus bound the difference between VMπ(s,τ0;0,T)V^{\pi}_{M}(s,\tau_{0;0},T) and VMπ(s,τ0;0,T)V^{\pi}_{M^{\prime}}(s,\tau_{0;0},T) restricted to these trajetories by (α+2β)|𝒮||𝒜|TGmaxT(\alpha+2\beta)|\mathcal{S}||\mathcal{A}|TG^{T}_{\max}. We will later choose β\beta and bound this value by ϵ/4\epsilon/4 to solve for α\alpha.

Next, consider trajectories of length TT starting at ss that do not contain any β\beta-small transitions, i.e. PrM(ss,a)>βPr_{M}(s^{\prime}\mid s,a)>\beta in these trajectories. Choose Δ=α/β\Delta=\alpha/\beta, we may write

(1Δ)PrM(ss,a)\displaystyle(1-\Delta)Pr_{M}(s^{\prime}\mid s,a)
PrM(ss,a)\displaystyle\qquad\leq Pr_{M^{\prime}}(s^{\prime}\mid s,a)
(1+Δ)PrM(ss,a).\displaystyle\qquad\leq(1+\Delta)Pr_{M}(s^{\prime}\mid s,a).

because MM^{\prime} is an α\alpha-approximation of MM and β1\beta\leq 1. Thus for any TT-trajectory τ\tau that does not cross any β\beta-small transitions under π\pi, we have

(1Δ)TPrMπ[τ]PrMπ[τ](1+Δ)TPrMπ[τ],(1-\Delta)^{T}Pr_{M}^{\pi}[\tau]\leq Pr_{M^{\prime}}^{\pi}[\tau]\leq(1+\Delta)^{T}Pr_{M}^{\pi}[\tau],

which follows from the definition of the probability along a trajectory and the fact that π\pi is the same policy in all terms. Since we assume reward functions are deterministic in our case, for any particular TT-trajectory τ\tau, we also have

WM(𝐑(τ))=WM(𝐑(τ)).W_{M}(\mathbf{R}(\tau))=W_{M^{\prime}}(\mathbf{R}(\tau)).

Since these hold for any fixed TT-trajectory that does not traverse any β\beta-small transitions in MM under π\pi, they also hold when we take expectations over the distributions over such TT-trajectories in MM and MM^{\prime} induced by π\pi. Thus

(1Δ)TVMπ(s,τ0;0,T)ϵ4\displaystyle(1-\Delta)^{T}V^{\pi}_{M}(s,\tau_{0;0},T)-\frac{\epsilon}{4}
VMπ(s,τ0;0,T)\displaystyle\qquad\leq V^{\pi}_{M^{\prime}}(s,\tau_{0;0},T)
(1+Δ)TVMπ(s,τ0;0,T)+ϵ4.\displaystyle\qquad\leq(1+\Delta)^{T}V^{\pi}_{M}(s,\tau_{0;0},T)+\frac{\epsilon}{4}.

where the ϵ/4\epsilon/4 terms account for the contributions of the TT -trajectories that traverse at least one β\beta-small transitions under π\pi. It remains to show how to choose α\alpha, β\beta, and Δ\Delta to obtain the desired approximation. For the upper bound, we solve for

(1+Δ)TVMπ(s,τ0;0,T)VMπ(s,τ0;0,T)+ϵ4\displaystyle(1+\Delta)^{T}V^{\pi}_{M}(s,\tau_{0;0},T)\leq V^{\pi}_{M}(s,\tau_{0;0},T)+\frac{\epsilon}{4}
(1+Δ)T1+ϵ/(4GmaxT).\displaystyle\implies(1+\Delta)^{T}\leq 1+\epsilon/(4G^{T}_{\max}).

By taking log\log on both sides and using Taylor expansion, we can upper bound Δ\Delta.

TΔ/2ϵ/(4GmaxT)\displaystyle T\Delta/2\leq\epsilon/(4G^{T}_{\max})
Δϵ/(2TGmaxT).\displaystyle\qquad\implies\Delta\leq\epsilon/(2TG^{T}_{\max}).

Choose β=α\beta=\sqrt{\alpha}. Then

{(α+2β)|𝒮||𝒜|TGmaxT3α|𝒮||𝒜|TGmaxTϵ/4Δ=αϵ/(2TGmaxT)\begin{cases}(\alpha+2\beta)|\mathcal{S}||\mathcal{A}|TG^{T}_{\max}\leq 3\sqrt{\alpha}|\mathcal{S}||\mathcal{A}|TG^{T}_{\max}\leq\epsilon/4\\ \Delta=\sqrt{\alpha}\leq\epsilon/(2TG^{T}_{\max})\end{cases}

Choosing α=O((ϵ/|𝒮||𝒜|TGmaxT)2)\alpha=O((\epsilon/|\mathcal{S}||\mathcal{A}|TG^{T}_{\max})^{2}) solves the system. The lower bound can be handled similarly, which completes the proof of the lemma. ∎

We now define a “known state.” This is a state that has been visited enough times, and its actions have been trialed sufficiently many times, that we have accurate estimates of the transition probabilities from this state.

Definition 0.

Kearns and Singh (2002) Let MM be an MOMDP. A state ss of MM is considered known if it has be visited a number of times equal to

mknown=O((|𝒮||𝒜|TGmaxT/ϵ)4|𝒜|log(1/δ)).m_{known}=O((|\mathcal{S}||\mathcal{A}|TG^{T}_{\max}/\epsilon)^{4}|\mathcal{A}|\log(1/\delta)).

By applying Chernoff bounds, we can show that if a state has been visited mknownm_{known} times then its empircal estimation of the transition probabilities satisfies the accuracy required by the Lemma 16.

Lemma 0.

Kearns and Singh (2002) Let MM be an MOMDP. Let ss be a state of MM that has been visited at least mm times, with each action having been executed at least m/|𝒜|\lfloor m/|\mathcal{A}|\rfloor times. Let Pr^(ss,a)\hat{Pr}(s^{\prime}\mid s,a) denote the empirical transition probability estimates obtained from the mm visits to ss. Then if

m=O((|𝒮||𝒜|TGmaxT/ϵ)4|𝒜|log(1/δ)),m=O((|\mathcal{S}||\mathcal{A}|TG^{T}_{\max}/\epsilon)^{4}|\mathcal{A}|\log(1/\delta)),

then with probability 1δ1-\delta, we have

|Pr^(ss,a)Pr(ss,a)|=O((ϵ/|𝒮||𝒜|TGmaxT)2)|\hat{Pr}(s^{\prime}\mid s,a)-Pr(s^{\prime}\mid s,a)|=O((\epsilon/|\mathcal{S}||\mathcal{A}|TG^{T}_{\max})^{2})

for all s𝒮s^{\prime}\in\mathcal{S}.

Proof.

The sampling version of Chernoff bounds states that if the number of independent, uniformly random samples nn that we use to estimate the fraction of a population with certain property pp satisfies

n=O(1α2log(1δ)),n=O\left(\frac{1}{\alpha^{2}}\log\left(\frac{1}{\delta}\right)\right),

the our estimate X¯\bar{X} satisfies

X¯[pα,p+α]with probability 1δ.\bar{X}\in[p-\alpha,p+\alpha]\leavevmode\nobreak\ \text{with probability $1-\delta$}.

By the Extended Simulation Lemma, it suffices to choose α=O((ϵ/|𝒮||𝒜|TGmaxT)2)\alpha=O((\epsilon/|\mathcal{S}||\mathcal{A}|TG^{T}_{\max})^{2}).

Note that we need to insert an extra factor of |𝒜||\mathcal{A}| compared to the original analysis since we treat the size of the action space as a variable instead of a constant, and a state is categorized as “known” only if the estimates of transition probability of all actions are close enough. ∎

We have specified the degree of approximation required for sufficient simulation accuracy. It remains to directly apply the Explore or Exploit Lemma from Kearns and Singh (2002) to conclude the analysis.

Lemma 0 (Explore or Exploit Lemma Kearns and Singh (2002)).

Let MM be any MOMDP, let SS be any subset of the states of MM, and let MSM_{S} be the MOMDP on MM. For any s𝒮s\in\mathcal{S}, any TT, and any 1>α>01>\alpha>0, either there exists a policy π\pi in MSM_{S} such that VMSπ(s,τ0;0,T)VM(s,𝟎,T)αV^{\pi}_{M_{S}}(s,\tau_{0;0},T)\geq V^{*}_{M}(s,\mathbf{0},T)-\alpha, or there exists a policy π\pi in MSM_{S} such that the probability that a TT-trajectory following π\pi will lead to the exit state exceeds α/GmaxT\alpha/G^{T}_{\max}.

This lemma guarantees that either the TT -step return of the optimal exploitation policy in the simulated model is very close to the optimal achievable in MM, or the agent choosing the exploration policy can reach a previously unknown state with significant probability.

Note that unlike the original E3 algorithm, which uses standard value iteration to compute the exactly optimal policy (optimizing a linear function of a scalar reward) on the sub-model M^S\hat{M}_{S}, we use our RAVI algorithm to find an approximately optimal policy. Therefore, we need to allocate ϵ/2\epsilon/2 error to both the simulation stage and the exploitation stage, which gives a total of ϵ\epsilon error for the entire algorithm.

It remains to handle the failure parameter β\beta in the statement of the main theorem, which can be done similarly to Kearns and Singh (2002). There are two sources of failure for the algorithm:

  • The algorithm’s estimation of the next state distribution for some action at a known state is inaccurate, resulting M^S\hat{M}_{S} being an inaccurate model of MSM_{S}.

  • Despite doing attempted explorations repeatedly, the algorithm fails to turn a previously unknown state into a known state because of an insufficient number of balanced wandering.

It suffices to allocate β/2\beta/2 probabilities to each source of failure. The first source of failure are bounded by Lemma 18. By choosing β=β/(2|𝒮|)\beta^{\prime}=\beta/(2|\mathcal{S}|), we ensure that the probability that the first source of failure happens to each of the known states in MSM_{S} is sufficiently small such that the total failure probability is bounded by β/2\beta/2 for MSM_{S}.

For the second source of failure, by the Explore or Exploit Lemma, each attempted exploration results in at least one step of balanced wandering with probability at least ϵ/(2GmaxT)\epsilon/(2G^{T}_{\max}), i.e. when this leads the agent to an unknown state. The agent does at most |𝒮|mknown|\mathcal{S}|m_{known} steps of balanced wandering, since this makes every state known. By Chernoff bound, the probability that the agent does fewer than |𝒮|mknown|\mathcal{S}|m_{known} steps of balanced wandering (attempted explorations that actually leads to an unknown state) will be smaller than β/2\beta/2 if the number of attempted explorations is

O((GmaxT/ϵ)|𝒮|log(|𝒮|/β)mknown),O((G^{T}_{\max}/\epsilon)|\mathcal{S}|\log(|\mathcal{S}|/\beta)m_{known}),

where mknown=O(((|𝒮||𝒜|TGmaxT)/ϵ)4|𝒜|log(|𝒮|/β))m_{known}=O(((|\mathcal{S}||\mathcal{A}|TG^{T}_{\max})/\epsilon)^{4}|\mathcal{A}|\log(|\mathcal{S}|/\beta)) (recall we choose β=β/(2|𝒮|)\beta^{\prime}=\beta/(2|\mathcal{S}|) for the first source of failure).

Thus, the total computation time is bounded by O(|𝒮|2|𝒜|(T/αTϵ/2)d)O\left(|\mathcal{S}|^{2}|\mathcal{A}|(T/\alpha_{T\epsilon/2})^{d}\right) (the time required for RAVI during off-line computations with ϵ/2\epsilon/2 precision by Lemma 10) times TT times the maximum number of attempted explorations, giving

O((|𝒮|3|𝒜|TGmaxT/ϵ)(T/αTϵ/2)dlog(|𝒮|/β)mknown).O((|\mathcal{S}|^{3}|\mathcal{A}|TG^{T}_{\max}/\epsilon)(T/\alpha_{T\epsilon/2})^{d}\log(|\mathcal{S}|/\beta)m_{known}).

This concludes the proof of Theorem 12.

Appendix E More Discussion about experiments

Deterministic Transitions: Deterministic settings are common in many state-of-the-art environments in MORL, particularly when focusing on episodic tasks with short time horizons. These settings are often used to emphasize the core algorithmic contributions without introducing additional complexities from stochastic transitions or long horizons. Nonetheless, we acknowledge that incorporating stochastic environments would better showcase the generality of our approach and highlight this as an important direction for future work.

Model Sizes and Scalability: We recognize that the model sizes used in the experiments are relatively small, primarily due to memory limitations in the tabular setting, and that function approximation would be needed to move to large state space environments as highlighted in our future work.

RAEE Algorithm: RAVI is a model-based approach, and one of the baselines, “Mixture-M” for a model-based mixture policy, uses the model. For the others model-free baselines, we only compare performance at convergence with RAVI to achieve a more fair comparison. To the best of our knowledge, there are no other model-based algorithms in the field specifically designed for optimizing ESR objectives. Due to computational and space constraints, we focused our empirical evaluation on RAVI, as the optimization subroutine is the core algorithmic contribution. That said, we acknowledge this limitation and believe benchmarking RAEE empirically is a worthwhile direction for future research.

Model Accuracy and RAEE: While it is true that RAVI’s performance depends on model accuracy, the central goal of RAEE is precisely to ensure it learns a sufficiently accurate model of the environment to provably guarantee an approximation factor. For future work, comparing RAVI with RAEE would provide valuable insights into the trade-offs between leveraging accurate models and learning them through exploration.