Reward Biased Maximum Likelihood Estimation for Reinforcement Learning

\NameAkshay Mete \Emailakshaymete@tamu.edu
\addrTexas A&M University denotes equal contribution TX USA \NameRahul Singh¹¹footnotemark: 1\Emailrahulsingh@iisc.ac.in
\addrIndian Institute of Science Bangalore India \NameP. R. Kumar \Emailprk@tamu.edu
\addrTexas A&M University TX USA

Abstract

The Reward-Biased Maximum Likelihood Estimate (RBMLE) for adaptive control of Markov chains was proposed in (kumar_becker_82) to overcome the central obstacle of what is variously called the fundamental “closed-identifiability problem” of adaptive control (borkar_varaiya_79), the “dual control problem” by Feldbaum (feldbaum1960dual1; feldbaum1960dual2), or, contemporaneously, the “exploration vs. exploitation problem”. It exploited the key observation that since the maximum likelihood parameter estimator can asymptotically identify the closed-transition probabilities under a certainty equivalent approach (borkar_varaiya_79), the limiting parameter estimates must necessarily have an optimal reward that is less than the optimal reward attainable for the true but unknown system. Hence it proposed a counteracting reverse bias in favor of parameters with larger optimal rewards, providing a carefully structured solution to the fundamental problem alluded to above. It thereby proposed an optimistic approach of favoring parameters with larger optimal rewards, now known as “optimism in the face of uncertainty.” The RBMLE approach has been proved to be long-term average reward optimal in a variety of contexts including controlled Markov chains, linear quadratic Gaussian (LQG) systems, some nonlinear systems, and diffusions. However, modern attention is focused on the much finer notion of “regret,” or finite-time performance for all time, espoused by (lai_85). Recent analysis of RBMLE for multi-armed stochastic bandits (liu_20) and linear contextual bandits (hung_20) has shown that it not only has state-of-the-art regret, but it also exhibits empirical performance comparable to or better than the best current contenders, and leads to several new and strikingly simple index policies for these classical problems. Motivated by this, we examine the finite-time performance of RBMLE for reinforcement learning tasks that involve the general problem of optimal control of unknown Markov Decision Processes. We show that it has a regret of $O(\log T)$ over a time horizon of $T$ steps, similar to state-of-art algorithms.

keywords:

Reinforcement Learning; Markov Decision Process; Adaptive Control

1 Introduction

Consider a controlled Markov chain with finite state space $X$ , finite action set $U$ , and controlled transition probabilities $\mathbb{P}(x(t+1)=y~{}|x(t)=x,u(t)=u)=p(x,y,u)$ , where $x(t)\in X$ denotes the state at time $t$ , and $u(t)\in U$ denotes the action taken at time $t$ . A reward $r(x,u)$ is received when action $u$ is taken in state $x$ . Let $J^{\star}(p)$ denote the maximal long-term average reward $\liminf_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}r(x(t),u(t))$ obtainable. We consider the case where the transition probabilities $p$ are only known to belong a set $\Theta$ , but otherwise unknown. We address the adaptive control problem of minimizing the expected “regret”

\displaystyle TJ^{\star}(p)-\mathbb{E}\sum_{t=1}^{T}r(x(t),u(t)

(1)

as a function of $T$ .

This broad problem has a long history. Let $J(\theta,\pi)$ denote the long-term average reward reward accrued by a stationary deterministic policy $\pi:X\to U$ when the transition probabilities are given by $\theta=\{\theta(x,y,u),x\in X,y\in X,u\in U\}$ , let $J^{\star}(\theta):=\max_{\pi}J(\theta,\pi)$ denote the optimal long-term average reward attainable under $\theta$ , and let $\pi^{\theta}\in\arg\max J(\theta,\pi)$ be an optimal policy for $\theta$ . In early work,