This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\AtBeginEnvironment

algorithmic

On the Stochastic (Variance-Reduced) Proximal Gradient Method for Regularized Expected Reward Optimization

Ling Liang liang.ling@u.nus.edu
Department of Mathematics
University of Maryland at College Park
Haizhao Yang hzyang@umd.edu
Department of Mathematics and Department of Computer Science
University of Maryland at College Park
Abstract

We consider a regularized expected reward optimization problem in the non-oblivious setting that covers many existing problems in reinforcement learning (RL). In order to solve such an optimization problem, we apply and analyze the classical stochastic proximal gradient method. In particular, the method has shown to admit an O(ϵ4)O(\epsilon^{-4}) sample complexity to an ϵ\epsilon-stationary point, under standard conditions. Since the variance of the classical stochastic gradient estimator is typically large, which slows down the convergence, we also apply an efficient stochastic variance-reduce proximal gradient method with an importance sampling based ProbAbilistic Gradient Estimator (PAGE). Our analysis shows that the sample complexity can be improved from O(ϵ4)O(\epsilon^{-4}) to O(ϵ3)O(\epsilon^{-3}) under additional conditions. Our results on the stochastic (variance-reduced) proximal gradient method match the sample complexity of their most competitive counterparts for discounted Markov decision processes under similar settings. To the best of our knowledge, the proposed methods represent a novel approach in addressing the general regularized reward optimization problem.

1 Introduction

Reinforcement learning (RL) Sutton & Barto (2018) has recently become a highly active research area of machine learning that learns to make sequential decisions via interacting with the environment. In recent years, RL has achieved tremendous success so far in many applications such as control, job scheduling, online advertising, and game-playing Zhang & Dietterich (1995); Pednault et al. (2002); Mnih et al. (2013), to mention a few. One of the central tasks of RL is to solve a certain (expected) reward optimization problem for decision-making. Following the research theme, we consider the following problem of maximizing the regularized expected reward:

maxθn(θ):=𝔼xπθ[θ(x)]𝒢(θ),\max_{\theta\in\mathbb{R}^{n}}\;\mathcal{F}(\theta):=\mathbb{E}_{x\sim\pi_{\theta}}\left[\mathcal{R}_{\theta}(x)\right]-\mathcal{G}(\theta), (1)

where 𝒢:n{+}\mathcal{G}:\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\} is a closed proper convex (possibly nonsmooth) function, xdx\in\mathbb{R}^{d}, θ:d\mathcal{R}_{\theta}:\mathbb{R}^{d}\to\mathbb{R} is the reward function depending on the parameter θ\theta, and πθ\pi_{\theta} denotes the probability distribution over a given subset 𝒮d\mathcal{S}\subseteq\mathbb{R}^{d} parameterized by θn\theta\in\mathbb{R}^{n}. By adapting the convention in RL, we call πθ\pi_{\theta} a policy parameterized by θ\theta. Moreover, for the rest of this paper, we denote 𝒥(θ):=𝔼xπθ[θ(x)]\mathcal{J}(\theta):=\mathbb{E}_{x\sim\pi_{\theta}}\left[\mathcal{R}_{\theta}(x)\right] as the expected reward function in the non-oblivious setting. The learning objective is to learn a decision rule via finding the policy parameter θ\theta that maximizes the regularized expected reward. To the best of our knowledge, the study on the general model (1) has been limited in the literature. Hence, developing and analyzing algorithmic frameworks for solving the problem is of great interest.

There are large body of works in supervised learning focusing on the oblivious setting Zhang (2004); Hastie et al. (2009); Shapiro et al. (2021), i.e., 𝒥(θ):=𝔼xπ[θ(x)]\mathcal{J}(\theta):=\mathbb{E}_{x\sim\pi}\left[\mathcal{R}_{\theta}(x)\right], where xx is sampled from an invariant distribution π\pi. Clearly, problem (1) can be viewed as a generalization of those machine learning problems with oblivious objective functions. In the literature, an RL problem is often formulated as a discrete-time and discounted Markov decision processes (MDPs) Sutton & Barto (2018) which aims to learn an optimal policy via optimizing the (discounted) cumulative sum of rewards. We can also see that the learning objective of an MDP can be covered by the problem (1) with the property that the function (x)\mathcal{R}(x) does not depend on θ\theta (see Example 3.3). Recently, the application of RL for solving combinatorial optimization (CO) problems which are typically NP-hard has attracted much attention. These CO problems may include the traveling salesman problem and related problems Bello et al. (2016); Mazyavkina et al. (2021), the reward optimization problem arising from the finite expression method Liang & Yang (2022); Song et al. (2023), and the general binary optimization problem Chen et al. (2023), to name just a few. The common key component of the aforementioned applications is the reward optimization, which could also be formulated as problem (1). There also exist problems with general reward functions that are outside the scope of cumulative sum of rewards of trajectories that are used in MDPs. An interesting example is the MDP with general utilities; see, e.g., Zhang et al. (2020a); Kumar et al. (2022); Barakat et al. (2023) and references therein.

Adding a regularizer to the objective function is a commonly used technique to impose desirable structures to the solution and/or to greatly enhance the expression power and applicability of RL Lan (2023); Zhan et al. (2023). When one considers the direct/simplex parameterization Agarwal et al. (2021) of πθ\pi_{\theta}, a regularization function using the indicator function for the standard probability simplex is needed. Moreover, by using other indicator functions for general convex sets, one is able to impose some additional constraints on the parameter θ\theta. For the softmax parameterization, one may also enforce a bounded constraint to θ\theta to prevent it taking values that are too large. This can avoid potential numerical issues, including the overflow error on a floating point system. On the other hand, there are incomplete parametric policy classes, such as the log-linear and neural policy classes, that are often formulated as {πθ|θΘ}\{\pi_{\theta}|\theta\in\Theta\}, where Θ\Theta is a closed convex set Agarwal et al. (2021). In this case, the indicator function is still necessary and useful. Some recent works (see, e.g., Ahmed et al. (2019); Agarwal et al. (2020); Mei et al. (2020); Cen et al. (2022)) have investigated the impact of the entropy regularization for MDPs. Systematic studies on general convex regularization for MDPs have been limited until the recent works Pham et al. (2020); Lan (2023); Zhan et al. (2023). Finally, problem (1) takes the same form as the stochastic optimization problem with decision-dependent distributions (see e.g., Drusvyatskiy & Xiao (2023) and references therein), leading to numerous real-world applications such as performative prediction Mendler-Dünner et al. (2020); Perdomo et al. (2020), concept drift Gama et al. (2014), strategic classification Tsirtsis et al. (2024); Milli et al. (2019), and casual inference Yao et al. (2021). Consequently, we can see that problem (1) is in fact quite general and has promising modeling power, as it covers many existing problems in the literature.

The purpose of this paper is to leverage existing tools and results in MDPs and nonconvex optimization for solving the general regularized expected reward optimization problem (1) with general policy parameterization, which, to the best of our knowledge, has not been formally considered in the RL literature. It is well known that the policy gradient method Williams (1992); Sutton et al. (1999); Baxter & Bartlett (2001), which lies in the heart of RL, is one of the most competitive and efficient algorithms due to its simplicity and versatility. Moreover, the policy gradient method is readily implemented and can be paired with other effective techniques. In this paper, we observe that the stochastic proximal gradient method, which shares the same spirit of the policy gradient method, can be applied directly for solving the targeted problem (1) with convergence guarantees to a stationary point. Since the classical stochastic gradient estimator typically introduces a large variance, there is also a need to consider designing advanced stochastic gradient estimators with smaller variances. To this end, we shall also look into a certain stochastic variance-reduced proximal gradient method and analyze its convergence properties. In particular, the contributions of this paper are summarized as follows.

  • We consider a novel and general regularized reward optimization model (1) that covers many existing important models in the machine learning and optimization literature. Thus, problem (1) admits a promising modeling power which encourages potential applications.

  • In order to solve our targeted problem, we consider applying the classical stochastic proximal gradient method and analyze its convergence properties. We first demonstrate that the gradient of 𝒥()\mathcal{J}(\cdot) is Lipschitz continuous under standard conditions with respect to the reward function θ()\mathcal{R}_{\theta}(\cdot) and the parameterized policy πθ()\pi_{\theta}(\cdot). Using the L-smoothness of 𝒥()\mathcal{J}(\cdot), we then show that the classical stochastic proximal gradient method with a constant step-size (depending only on the Lipschitz constant for θ𝒥()\nabla_{\theta}\mathcal{J}(\cdot)) for solving problem (1) outputs an ϵ\epsilon-stationary point (see Definition 3.4) within T:=O(ϵ2)T:=O(\epsilon^{-2}) iterations, and the sample size for each iteration is O(ϵ2)O(\epsilon^{-2}), where ϵ>0\epsilon>0 is a given tolerance. Thus, the total sample complexity becomes O(ϵ4)O(\epsilon^{-4}), which matches the current state-of-the-art sample complexity of the classical stochastic policy gradient for MDPs; see e.g., Williams (1992); Baxter & Bartlett (2001); Zhang et al. (2020b); Xiong et al. (2021); Yuan et al. (2022).

  • Moreover, in order to further reduce the variance of the stochastic gradient estimator, we utilize an importance sampling based probabilistic gradient estimator which leads to an efficient single-looped variance reduced method. The application of this probabilistic gradient estimator is motivated by the recent progress in developing efficient stochastic variance-reduced gradient methods for solving stochastic optimization Li et al. (2021b) and (unregularized) MDPs Gargiani et al. (2022). We show that, under additional technical conditions, the total sample complexity is improved from O(ϵ4)O(\epsilon^{-4}) to O(ϵ3)O(\epsilon^{-3}). This result again matches the results of some existing competitive variance-reduced methods for MDPs Papini et al. (2018); Xu et al. (2019); Pham et al. (2020); Huang et al. (2021); Yang et al. (2022); Gargiani et al. (2022). Moreover, to the best of our knowledge, the application of the above probabilistic gradient estimator is new for solving the regularized expected reward optimization (1).

The rest of this paper is organized as follows. We first summarize some relative works in Section 2. Next, in Section 3, we present some background information that are needed for the exposition of this paper. Then, in Section 4, we describe the classical stochastic proximal gradient method for solving (1) and present the convergence properties of this method under standard technical conditions. Section 5 is dedicated to describing and analyzing the stochastic variance-reduced proximal gradient method with an importance sampling based probabilistic gradient estimator. Finally, we make some concluding remarks, and list certain limitations and future research directions of this paper in Section 6.

2 Related Work

The policy gradient method. One of the most influential algorithms for solving RL problems is the policy gradient method, built upon the foundations established in Williams (1992); Sutton et al. (1999); Baxter & Bartlett (2001). Motivated by the empirical success of the policy gradient method and its variants, analyzing the convergence properties for these methods has long been one of the most active research topics in RL. Since the objective function 𝒥(θ)\mathcal{J}(\theta) is generally nonconcave, early works Sutton et al. (1999); Pirotta et al. (2015) focused on the asymptotic convergence properties to a stationary point. By utilizing the special structure in (entropy regularized) MDPs, recent works Liu et al. (2019); Mei et al. (2020); Agarwal et al. (2021); Li et al. (2021a); Xiao (2022); Cen et al. (2022); Lan (2023); Fatkhullin et al. (2023) provided some exciting results on the global convergence. Meanwhile, since the exact gradient of the objective function can hardly be computed, sampling-based approximated/stochastic gradients have gained much attention. Therefore, many works investigated the convergence properties, including the iteration and sample complexities, for these algorithms with inexact gradients; see e.g., Zhang et al. (2020b); Liu et al. (2020); Zhang et al. (2021b); Xiong et al. (2021); Yuan et al. (2022); Lan (2023) and references therein.

Variance reduction. While the classical stochastic gradient estimator is straightforward and simple to implement, one of its most critical issues is that the variance of the inexact gradient estimator can be large, which generally slows down the convergence of the algorithm. To alleviate this issue, an attractive approach is to pair the sample-based policy gradient methods with certain variance-reduced techniques. Variance-reduced methods were originally developed for solving (oblivious) stochastic optimization problems Johnson & Zhang (2013); Nguyen et al. (2017); Fang et al. (2018); Li et al. (2021b) typically arising from supervised learning tasks. Motivated by the superior theoretical properties and practical performance of the stochastic variance-reduced gradient methods, similar algorithmic frameworks have recently been applied for solving MDPs Papini et al. (2018); Xu et al. (2019); Yuan et al. (2020); Pham et al. (2020); Huang et al. (2021); Yang et al. (2022); Gargiani et al. (2022).

Stochastic optimization with decision-dependent distributions. Stochastic optimization is the core of modern machine learning applications, whose main objective is to learn a decision rule from a limited data sample that is assumed to generalize well to the entire population Drusvyatskiy & Xiao (2023). In the classical supervised learning framework Zhang (2004); Hastie et al. (2009); Shapiro et al. (2021), the underlying data distribution is assumed to be static, which turns out to be a crucial assumption when analyzing the convergence properties of the common stochastic optimization algorithms. On the other hand, there are problems where the distribution changes over the course of iterations of a specific algorithm, and these are closely related to the concept of performative prediction Perdomo et al. (2020). In this case, understanding the convergence properties of the algorithm becomes more challenging. Toward this, some recent progress has been made on (strongly) convex stochastic optimization with decision-dependent distributions Mendler-Dünner et al. (2020); Perdomo et al. (2020); Drusvyatskiy & Xiao (2023). Moreover, other works have also considered nonconvex problems and obtained some promising results; see Dong et al. (2023); Jagadeesan et al. (2022) and references therein. Developing theoretical foundation for these problems has become a very active field nowadays.

RL with general utilities. It is known that the goal of an agent associated with an MDP is to seek an optimal policy via maximizing the cumulative discounted reward Sutton & Barto (2018). However, there are decision problems of interest having more general forms. Beyond the scope of the expected cumulative reward in MDPs, some recent works also looked into RL problems with general utilities; see e.g., Zhang et al. (2020a); Kumar et al. (2022); Barakat et al. (2023) as mentioned previously. Global convergence results can also be derived via investigating the hidden convex structure Zhang et al. (2020a) inherited from the MDP.

3 Preliminary

In this paper, we assume that the optimal objective value for problem (1), denoted by \mathcal{F}^{*}, is finite and attained, and the reward function θ()\mathcal{R}_{\theta}(\cdot) satisfies the following assumption.

Assumption 3.1.

The following three conditions with respect to the function θ()\mathcal{R}_{\theta}(\cdot) hold:

  1. 1.

    There exists a constant U>0U>0 such that

    supθn,xd|θ(x)|U.\sup_{\theta\in\mathbb{R}^{n},x\in\mathbb{R}^{d}}\;\left\lvert\mathcal{R}_{\theta}(x)\right\rvert\leq U.
  2. 2.

    θ()\mathcal{R}_{\theta}(\cdot) is twice continuously differentiable with respect to θ\theta, and there exist positive constants C~g\widetilde{C}_{g} and C~h\widetilde{C}_{h} such that

    supθn,xdθθ(x)C~g,supθn,xdθ2θ(x)2C~h.\sup_{\theta\in\mathbb{R}^{n},\;x\in\mathbb{R}^{d}}\;\left\lVert\nabla_{\theta}\mathcal{R}_{\theta}(x)\right\rVert\leq\widetilde{C}_{g},\quad\sup_{\theta\in\mathbb{R}^{n},\;x\in\mathbb{R}^{d}}\;\left\lVert\nabla_{\theta}^{2}\mathcal{R}_{\theta}(x)\right\rVert_{2}\leq\widetilde{C}_{h}.

The first condition on the boundedness of the function θ()\mathcal{R}_{\theta}(\cdot), which is commonly assumed in the literature Sutton & Barto (2018), ensures that 𝒥(θ)\mathcal{J}(\theta) is well-defined. And the second condition will be used to guarantee the well-definiteness and L-smoothness of the gradient θ𝒥(θ)\nabla_{\theta}\mathcal{J}(\theta). We remark here that when the reward function θ(x)\mathcal{R}_{\theta}(x) does not depend on θ\theta (see e.g., Example 3.3), then the second assumption holds automatically.

To determine the (theoretical) learning rate in our algorithmic frameworks, we also need to make some standard assumptions to establish the L-smoothness of 𝒥()\mathcal{J}(\cdot).

Assumption 3.2 (Lipschitz and smooth policy assumption).

The function logπθ(x)\log\pi_{\theta}(x) is twice differential with respect to θn\theta\in\mathbb{R}^{n} and there exist positive constants CgC_{g} and ChC_{h} such that

supxd,θnθlogπθ(x)Cg,supxd,θnθ2logπθ(x)2Ch.\sup_{x\in\mathbb{R}^{d},\;\theta\in\mathbb{R}^{n}}\;\left\lVert\nabla_{\theta}\log\pi_{\theta}(x)\right\rVert\leq C_{g},\quad\sup_{x\in\mathbb{R}^{d},\;\theta\in\mathbb{R}^{n}}\;\left\lVert\nabla_{\theta}^{2}\log\pi_{\theta}(x)\right\rVert_{2}\leq C_{h}.

This assumption is a standard one and commonly employed in the literature when studying the convergence properties of the policy gradient method for MDPs; see e.g., Pirotta et al. (2015); Papini et al. (2018); Xu et al. (2020); Pham et al. (2020); Zhang et al. (2021a); Yang et al. (2022) and references therein.

Under Assumption 3.1 and Assumption 3.2, it is easy to verify that the gradient for the expected reward function 𝒥(θ)\mathcal{J}(\theta) can be written as:

θ𝒥(θ)=\displaystyle\nabla_{\theta}\mathcal{J}(\theta)= θ(θ(x)πθ(x)𝑑x)=(θθ(x)+θ(x)θπθ(x)πθ(x))πθ(x)𝑑x\displaystyle\;\nabla_{\theta}\left(\int\mathcal{R}_{\theta}(x)\pi_{\theta}(x)dx\right)=\int\left(\nabla_{\theta}\mathcal{R}_{\theta}(x)+\mathcal{R}_{\theta}(x)\frac{\nabla_{\theta}\pi_{\theta}(x)}{\pi_{\theta}(x)}\right)\pi_{\theta}(x)dx
=\displaystyle= 𝔼xπθ[θ(x)θlogπθ(x)+θθ(x)].\displaystyle\;\mathbb{E}_{x\sim\pi_{\theta}}\left[\mathcal{R}_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)+\nabla_{\theta}\mathcal{R}_{\theta}(x)\right].

We next present an example on the discrete-time discounted MDP, which can be covered by the general model (1).

Example 3.3 (MDP).

We denote a discrete-time discounted MDP as :={S,A,P,R,γ,ρ0}\mathcal{M}:=\{S,A,P,R,\gamma,\rho_{0}\}, where SS and AA denote the state space and the action space, respectively, P(s|s,a)P(s^{\prime}|s,a) is the state transition probability from ss to ss^{\prime} after selecting the action aa, R:S×A[0,U]R:S\times A\to[0,U] is the reward function that is assumed to be uniformly bounded by a constant U>0U>0, γ[0,1)\gamma\in[0,1) is the discount factor, and ρ0\rho_{0} is the initial state distribution.

The agent selects actions according to a stationary random policy π~θ(|):A×S[0,1]\tilde{\pi}_{\theta}(\cdot|\cdot):A\times S\to[0,1] parameterized by θn\theta\in\mathbb{R}^{n}. Given an initial state s0Ss_{0}\in S, a trajectory τ:={st,at,rt+1}t=0H1\tau:=\{s_{t},a_{t},r_{t+1}\}_{t=0}^{H-1} can then be generated, where s0ρs_{0}\sim\rho, atπ~θ(|st)a_{t}\sim\tilde{\pi}_{\theta}(\cdot|s_{t}), rt+1=R(st,at)r_{t+1}=R(s_{t},a_{t}), st+1P(|st,at)s_{t+1}\sim P(\cdot|s_{t},a_{t}), and H>0H>0 is a finite horizon, and the accumulated discounted reward of the trajectory τ\tau can be defined as (τ):=t=0H1γtrt+1\displaystyle\mathcal{R}(\tau):=\sum_{t=0}^{H-1}\gamma^{t}r_{t+1}. Then, the learning objective is to compute an optimal parameter θ\theta^{*} that maximizes the expected reward function 𝒥(θ)\mathcal{J}(\theta) 111Here, the trajectory τ\tau and the distribution ρθ\rho_{\theta} correspond to xx and πθ\pi_{\theta} in (1), respectively., i.e.,

θargmaxθ𝒥(θ):=𝔼τρθ[(τ)],\theta^{*}\in\;\mathrm{argmax}_{\theta}\;\mathcal{J}(\theta):=\mathbb{E}_{\tau\sim\rho_{\theta}}\left[\mathcal{R}(\tau)\right], (2)

where ρθ(τ):=ρ0(s0)t=0H1P(st+1|st,at)π~θ(at|st)\rho_{\theta}(\tau):=\rho_{0}(s_{0})\prod_{t=0}^{H-1}P(s_{t+1}|s_{t},a_{t})\tilde{\pi}_{\theta}(a_{t}|s_{t}) denotes the probability distribution of a trajectory τ\tau being sampled from ρθ\rho_{\theta} that is parameterized by θ\theta.

In the special case when S={s}S=\{s\} (i.e., |S|=1|S|=1) and γ=0\gamma=0, the MDP reduced to a multi-armed bandit problem Robbins (1952) with a reward function simplified as R:AR:A\to\mathbb{R}. Particularly, a trajectory τ={s,a}\tau=\{s,a\} with the horizon Hτ=0H_{\tau}=0 is generated, where aρθ():=π~θ(|s)a\sim\rho_{\theta}(\cdot):=\tilde{\pi}_{\theta}(\cdot|s), and the accumulated discounted reward reduces to (x)=R(a)\mathcal{R}(x)=R(a). As a consequence, problem (2) can be simplified as

maxθn𝒥(θ)=𝔼aρθ[R(a)].\max_{\theta\in\mathbb{R}^{n}}\;\mathcal{J}(\theta)=\mathbb{E}_{a\sim\rho_{\theta}}\left[R(a)\right].

By adding a convex regularizer 𝒢(θ)\mathcal{G}(\theta) to problem (2), we get the following regularized MDP:

maxθn𝔼τρθ[(τ)]𝒢(θ),\max_{\theta\in\mathbb{R}^{n}}\;\mathbb{E}_{\tau\sim\rho_{\theta}}\left[\mathcal{R}(\tau)\right]-\mathcal{G}(\theta),

which was considered in Pham et al. (2020). However, it is clear that (τ)\mathcal{R}(\tau) does not depend on θ\theta. Hence, the above regularized MDP is a special case of the proposed regularized reward optimization problem (1).

One can check that the gradient θ𝒥(θ)\nabla_{\theta}\mathcal{J}(\theta) has the following form Yuan et al. (2022):

θ𝒥(θ)=𝔼τρθ[t=0H1γtR(st,at)t=0tθlogπ~θ(at|st)].\nabla_{\theta}\mathcal{J}(\theta)=\mathbb{E}_{\tau\sim\rho_{\theta}}\left[\sum_{t=0}^{H-1}\gamma^{t}R(s_{t},a_{t})\sum_{t^{\prime}=0}^{t}\nabla_{\theta}\log\tilde{\pi}_{\theta}(a_{t^{\prime}}|s_{t^{\prime}})\right].

Being a composite optimization problem, problem (1) admits the following first-order stationary condition

0θ𝒥(θ)+𝒢(θ).0\in-\nabla_{\theta}\mathcal{J}(\theta)+\partial\mathcal{G}(\theta). (3)

Here, 𝒢()\partial\mathcal{G}(\cdot) denotes the subdifferential of the proper closed and convex function 𝒢()\mathcal{G}(\cdot) which is defined as

𝒢(θ):={gn:𝒢(θ)𝒢(θ)+g,θθ,θ}.\partial\mathcal{G}(\theta):=\left\{g\in\mathbb{R}^{n}\;:\;\mathcal{G}(\theta^{\prime})\geq\mathcal{G}(\theta)+\left\langle g,\theta^{\prime}-\theta\right\rangle,\;\forall\theta\right\}.

It is well-known that 𝒢(θ)\partial\mathcal{G}(\theta) is a nonempty closed convex subset of n\mathbb{R}^{n} for any θn\theta\in\mathbb{R}^{n} such that 𝒢(θ)<\mathcal{G}(\theta)<\infty (see e.g., Rockafellar (1997)). Note that any optimal solution of problem (1) satisfies the condition (3), while the reverse statement is generally not valid for nonconcave problems, including the problem (1). The condition (3) leads to the following concept of stationary points for problem (1).

Definition 3.4.

A point θn\theta\in\mathbb{R}^{n} is called a stationary point for problem (1) if it satisfies the condition (3). Given a tolerance ϵ>0\epsilon>0, a stochastic optimization method attains an (expected) ϵ\epsilon-stationary point, denoted as θn\theta\in\mathbb{R}^{n}, if

𝔼T[dist(0,θ𝒥(θ)+𝒢(θ))2]ϵ2,\mathbb{E}_{T}\left[\mathrm{dist}\left(0,-\nabla_{\theta}\mathcal{J}(\theta)+\partial\mathcal{G}(\theta)\right)^{2}\right]\leq\epsilon^{2},

where the expectation is taken with respect to all the randomness caused by the algorithm, after running it TT iterations, and dist(x,𝒞)\mathrm{dist}(x,\mathcal{C}) denotes the distance between a point xx and a closed convex set 𝒞\mathcal{C}.

Remark 3.5 (Gradient mapping).

Note that the optimality condition (3) can be rewritten as

0=Gη(θ):=1η[Proxη𝒢(θ+ηθ𝒥(θ))θ],0=G_{\eta}(\theta):=\frac{1}{\eta}\left[\mathrm{Prox}_{\eta\mathcal{G}}\left(\theta+\eta\nabla_{\theta}\mathcal{J}(\theta)\right)-\theta\right],

for some η>0\eta>0, where

Proxη𝒢(θ):=argminθ{𝒢(θ)+12ηθθ2}\mathrm{Prox}_{\eta\mathcal{G}}(\theta):=\mathrm{argmin}_{\theta^{\prime}}\left\{\mathcal{G}(\theta^{\prime})+\frac{1}{2\eta}\left\lVert\theta^{\prime}-\theta\right\rVert^{2}\right\}

denotes the proximal mapping of the function 𝒢()\mathcal{G}(\cdot). The mapping Gη()G_{\eta}(\cdot) is called the gradient mapping in the field of optimization Beck (2017). It is easy to verify that if for a θn\theta\in\mathbb{R}^{n}, it holds that

dist(0,θ𝒥(θ)+𝒢(θ))ϵ,\mathrm{dist}\left(0,-\nabla_{\theta}\mathcal{J}(\theta)+\partial\mathcal{G}(\theta)\right)\leq\epsilon,

then there exists a vector dd satisfying dϵ\left\lVert d\right\rVert\leq\epsilon such that

d+θ𝒥(θ)𝒢(θ),d+\nabla_{\theta}\mathcal{J}(\theta)\in\partial\mathcal{G}(\theta),

which is equivalent to saying that

θ=Proxη𝒢(ηd+θ+ηθ𝒥(θ)).\theta=\mathrm{Prox}_{\eta\mathcal{G}}\left(\eta d+\theta+\eta\nabla_{\theta}\mathcal{J}(\theta)\right).

Moreover, we can verify that (by using the firm nonexpansiveness of Proxη𝒢()\mathrm{Prox}_{\eta\mathcal{G}}(\cdot); see e.g., Beck (2017))

Gη(θ)=1ηProxη𝒢(θ+ηθ𝒥(θ))θdϵ.\left\lVert G_{\eta}(\theta)\right\rVert=\frac{1}{\eta}\left\lVert\mathrm{Prox}_{\eta\mathcal{G}}\left(\theta+\eta\nabla_{\theta}\mathcal{J}(\theta)\right)-\theta\right\rVert\leq\left\lVert d\right\rVert\leq\epsilon.

Therefore, we can also characterize an (expected) ϵ\epsilon-stationary point by using the following condition

𝔼T[Gη(θ)2]ϵ2.\mathbb{E}_{T}\left[\left\lVert G_{\eta}(\theta)\right\rVert^{2}\right]\leq\epsilon^{2}.

The main objective of this paper is to study the convergence properties, including iteration and sample complexities, of the stochastic (variance-reduced) proximal gradient method to a ϵ\epsilon-stationary point with a pre-specified ϵ>0\epsilon>0. Note that all proofs of our results are presented in the appendix. Moreover, we acknowledge that our analysis is drawn upon classical results in the literature.

4 The stochastic proximal gradient method

In this section, we present and analyze the stochastic proximal gradient method for solving the problem (1). The fundamental idea of the algorithm is to replace the true gradient θ𝒥(θ)\nabla_{\theta}\mathcal{J}(\theta), which are not available for most of the time, with a stochastic gradient estimator in the classical proximal gradient method Beck (2017). The method can be viewed as extensions to the projected policy gradient method with direct parameterization Agarwal et al. (2021) and the stochastic policy gradient method for unregularized MDPs Williams (1992). The detailed description of the algorithm is presented in Algorithm 1.

For notational simplicity, we denote

g(x,θ):=θ(x)θlogπθ(x)+θθ(x).g(x,\theta):=\mathcal{R}_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)+\nabla_{\theta}\mathcal{R}_{\theta}(x).
1:  Input: initial point θ0\theta^{0}, sample size NN and the learning rate η>0\eta>0.
2:  for t=0,,T1t=0,\dots,T-1 do
3:     Compute the stochastic gradient estimator:
gt:=1Nj=1Ng(xt,j,θt),g^{t}:=\frac{1}{N}\sum_{j=1}^{N}g(x^{t,j},\theta^{t}),
where {xt,1,,xt,N}\{x^{t,1},\dots,x^{t,N}\} are sampled independently according to πθt\pi_{\theta^{t}}.
4:     Update
θt+1=Proxη𝒢(θt+ηgt).\theta^{t+1}=\mathrm{Prox}_{\eta\mathcal{G}}\left(\theta^{t}+\eta g^{t}\right).
5:  end for
6:  Output: θ^T\hat{\theta}^{T} selected randomly from the generated sequence {θt}t=1T\{\theta^{t}\}_{t=1}^{T}.
Algorithm 1 The stochastic proximal gradient method

From Algorithm 1, we see that at each iteration, NN data points, namely {xt,1,,xt,N}\{x^{t,1},\dots,x^{t,N}\}, are sample according to the current probability distribution πθt\pi_{\theta^{t}}. Using these data points, we can construct a REINFORCE-type stochastic gradient estimator gtg^{t}. Then, the algorithm just performs a proximal gradient ascent updating. Let T>0T>0 be the maximal number of iterations, then a sequence {θt}t=1T\{\theta^{t}\}_{t=1}^{T} can be generated, and the output solution is selected randomly from this sequence. Next, we shall proceed to answer the questions that how to choose the learning rate η>0\eta>0, how large the sample size NN should be, and how many iterations for the algorithm to output an ϵ\epsilon-stationary point for a given ϵ>0\epsilon>0, theoretically. The next lemma establishes the L-smoothness of 𝒥()\mathcal{J}(\cdot) whose proof is given at Appendix A.1.

Lemma 4.1.

Under Assumptions 3.1 and 3.2, the gradient of 𝒥\mathcal{J} is LL-smooth, i.e.,

θ𝒥(θ)θ𝒥(θ)Lθθ,θ,θn,\left\lVert\nabla_{\theta}\mathcal{J}(\theta)-\nabla_{\theta}\mathcal{J}(\theta^{\prime})\right\rVert\leq L\left\lVert\theta-\theta^{\prime}\right\rVert,\quad\forall\theta,\theta^{\prime}\in\mathbb{R}^{n},

with L:=U(Cg2+Ch)+C~h+2CgC~g>0L:=U(C_{g}^{2}+C_{h})+\widetilde{C}_{h}+2C_{g}\widetilde{C}_{g}>0.

Remark 4.2 (L-smoothness in MDPs).

For an MDP with finite action space and state space as in Example 3.3, the Lipschitz constant of θ𝒥()\nabla_{\theta}\mathcal{J}(\cdot) can be expressed in terms of |A|\left\lvert A\right\rvert, |S|\left\lvert S\right\rvert and γ\gamma. We refer the reader to Agarwal et al. (2021); Xiao (2022) for more details.

As a consequence of the L-smoothness of the function 𝒥()\mathcal{J}(\cdot), we next show that the learning rate can be chosen as a positive constant upper bounded by a constant depends only on the Lipschitz constant of θ𝒥()\nabla_{\theta}\mathcal{J}(\cdot). For notational complicity, we denote Δ:=(θ0)>0\Delta:=\mathcal{F}^{*}-\mathcal{F}(\theta^{0})>0 for the rest of this paper.

Theorem 4.3.

Under Assumptions 3.1 and 3.2, if we set η(0,12L)\eta\in\left(0,\frac{1}{2L}\right), then Algorithm 1 outputs a point θ^T\hat{\theta}^{T} satisfying

𝔼T[dist(0,θ𝒥(θ^T)+𝒢(θ^T))2]\displaystyle\;\mathbb{E}_{T}\left[\mathrm{dist}\left(0,-\nabla_{\theta}\mathcal{J}(\hat{\theta}^{T})+\partial\mathcal{G}(\hat{\theta}^{T})\right)^{2}\right]
\displaystyle\leq (2+2ηL(12ηL))1Tt=0T1𝔼T[gtθ𝒥(θt)2]+ΔT(2η+4η(12ηL)),\displaystyle\;\left(2+\frac{2}{\eta L(1-2\eta L)}\right)\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}_{T}\left[\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}\right]+\frac{\Delta}{T}\left(\frac{2}{\eta}+\frac{4}{\eta(1-2\eta L)}\right),

where 𝔼T\mathbb{E}_{T} is defined in Definition 3.4.

The proof of the above theorem is provided in Appendix A.2. From this theorem, if one sets gt=θ𝒥(θt)g^{t}=\nabla_{\theta}\mathcal{J}(\theta^{t}), i.e., gtθ𝒥(θt)2=0\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}=0, then there is no randomness along the iterations and the convergence property is reduced to

min1tTdist(0,θ𝒥(θ^t)+𝒢(θ^t))=O(1T),\min_{1\leq t\leq T}\mathrm{dist}\left(0,-\nabla_{\theta}\mathcal{J}(\hat{\theta}^{t})+\partial\mathcal{G}(\hat{\theta}^{t})\right)=O\left(\frac{1}{\sqrt{T}}\right),

which is implied by classical results on proximal gradient method (see e.g., Beck (2017)). However, since the exact full gradient θ𝒥(θ)\nabla_{\theta}\mathcal{J}(\theta) is rarely computable, it is common to require the variance (i.e., the trace of the covariance matrix) of the stochastic estimator to be bounded. The latter condition plays an essential role in analyzing stochastic first-order methods for solving nonconvex optimization problems, including RL applications; see, e.g., Beck (2017); Papini et al. (2018); Shen et al. (2019); Lan (2020); Yang et al. (2022).

Lemma 4.4.

Under Assumptions 3.1 and 3.2, there exists a constant σ>0\sigma>0 such that for any θ\theta,

𝔼xπθ[g(x,θ)θ𝒥(θ)2]σ2.\mathbb{E}_{x\sim\pi_{\theta}}\left[\left\lVert g(x,\theta)-\nabla_{\theta}\mathcal{J}(\theta)\right\rVert^{2}\right]\leq\sigma^{2}.

The proof of Lemma 4.4 is given in Appendix A.3. By choosing a suitable sample size NN, we can rely on Lemma 4.4 to make the term 𝔼T[gtθ𝒥(θt)2]\mathbb{E}_{T}\left[\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}\right] in Theorem 4.3 small, for every tTt\leq T. Then, Theorem 4.3 implies that Algorithm 1 admits an expected O(T1)O(T^{-1}) convergence rate to a stationary point. These results are summarized in the following theorem; see Appendix A.4 for a proof.

Theorem 4.5.

Suppose that Assumptions 3.1 and 3.2 hold. Let ϵ>0\epsilon>0 be a given accuracy. Running the Algorithm 1 for

T:=Δϵ2(4η+8η(12ηL))=O(ϵ2)T:=\left\lceil\frac{\Delta}{\epsilon^{2}}\left(\frac{4}{\eta}+\frac{8}{\eta(1-2\eta L)}\right)\right\rceil=O(\epsilon^{-2})

iterations with the learning rate η<12L\eta<\frac{1}{2L} and the sample size

N:=σ2ϵ2(4+4ηL(12ηL))=O(ϵ2)N:=\left\lceil\frac{\sigma^{2}}{\epsilon^{2}}\left(4+\frac{4}{\eta L(1-2\eta L)}\right)\right\rceil=O(\epsilon^{-2})

outputs a point θ^T\hat{\theta}^{T} satisfying

𝔼T[dist(0,θ𝒥(θ^T)+𝒢(θ^T))2]ϵ2.\mathbb{E}_{T}\left[\mathrm{dist}\left(0,-\nabla_{\theta}\mathcal{J}(\hat{\theta}^{T})+\partial\mathcal{G}(\hat{\theta}^{T})\right)^{2}\right]\leq\epsilon^{2}.

Moreover, the sample complexity is O(ϵ4)O(\epsilon^{-4}).

As already mentioned in the introduction, the total sample complexity of Algorithm 1 to an ϵ\epsilon-stationary point is shown to be O(ϵ4)O(\epsilon^{-4}), which matches the most competitive sample complexity of the classical stochastic policy gradient for MDPs Williams (1992); Baxter & Bartlett (2001); Zhang et al. (2020b); Xiong et al. (2021); Yuan et al. (2022).

Remark 4.6 (Sample size).

Note that the current state-of-the-art iteration complexity for the (small-batch) stochastic gradient descent method is T:=O(ϵ2)T:=O(\epsilon^{-2}) with ηt:=min{O(L1),O(T1/2)}\eta_{t}:=\min\{O(L^{-1}),O(T^{-1/2})\}; see, e.g., Ghadimi & Lan (2013). The reason for requiring larger batch-size in Theorem 4.5 is to allow a constant learning rate. To the best of our knowledge, to get the same convergence properties as Theorem 4.5 under the same conditions for problem (1), the large batch-size is required.

Remark 4.7 (Global convergence).

As mentioned in introduction, some recent progress has been made for analyzing the global convergence properties of the policy gradient methods for MDPs, which greatly rely on the concepts of gradient domination and its extensions Agarwal et al. (2021); Mei et al. (2020); Xiao (2022); Yuan et al. (2022); Gargiani et al. (2022). This concept is also highly related to the classical PŁ-condition Polyak (1963) and KŁ-condition Bolte et al. (2007) in the field of optimization. One of the key ideas is to assume or verify that the difference between the optimal objective function value, namely \mathcal{F}^{*}, and (θ)\mathcal{F}(\theta) can be bounded by the quantity depending on the norm of the gradient mapping at an arbitrary point. In particular, suppose that there exists a positive constant ω\omega such that

Gη(θ))2ω((θ)),θn,\left\lVert G_{\eta}(\theta))\right\rVert\geq 2\sqrt{\omega}\left(\mathcal{F}^{*}-\mathcal{F}(\theta)\right),\quad\forall\;\theta\in\mathbb{R}^{n},

where GηG_{\eta} is defined in Remark 3.5 (see e.g., Xiao (2022)). Then, after running Algorithm 2 for T=O(ϵ2)T=O(\epsilon^{-2}) iterations, one can easily check that

𝔼T[(θ^T)]12ωϵ.\mathbb{E}_{T}\left[\mathcal{F}^{*}-\mathcal{F}(\hat{\theta}^{T})\right]\leq\frac{1}{2\sqrt{\omega}}\epsilon.

As a conclusion, by assuming or verifying stronger conditions, one can typically show that any stationary point of the problem (1) is also a globally optimal solution. This shares the same spirit of Zhang et al. (2020a) for MDPs with general utilities. We leave it as a future research to analyze the global convergence of the problem (1).

5 Variance reduction via PAGE

Recall from Theorem 4.3 that, there is a trade-off between the sample complexity and the iteration complexity of Algorithm 1. In particular, while there is little room for us to improve the term ΔT(2η+4η(12ηL))\frac{\Delta}{T}\left(\frac{2}{\eta}+\frac{4}{\eta(1-2\eta L)}\right) which corresponds to the iteration complexity, it is possible to construct gtg^{t} in an advanced manner to improve the sample complexity. Therefore, our main goal in this section is to reduce the expected sample complexity while keeping the term 1Tt=0T1𝔼T[θ𝒥(θt)gt2]\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}_{T}\left[\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t})-g^{t}\right\rVert^{2}\right] small. We achieve this goal by considering the stochastic variance-reduced gradient methods that have recently attracted much attention. Among these variance-reduced methods, as argued in Gargiani et al. (2022), the ProbAbilistic Gradient Estimator (PAGE) proposed in Li et al. (2021b) has a simple structure, and can lead to optimal convergence properties. These appealing features make it attractive in machine learning applications. Therefore, in this section, we also consider the stochastic variance-reduced proximal gradient method with PAGE for solving the problem (1).

PAGE is originally designed for the stochastic nonconvex minimization in the oblivious setting:

minθnf(θ):=𝔼xπ[F(x,θ)]\min_{\theta\in\mathbb{R}^{n}}\;f(\theta):=\mathbb{E}_{x\sim\pi}[F(x,\theta)]

where π\pi is a fixed probability distribution and F:d×nF:\mathbb{R}^{d}\times\mathbb{R}^{n}\to\mathbb{R} is a certain differentiable (and possibly nonconvex) loss function. For stochastic gradient-type methods, a certain stochastic gradient estimator for ff is required for performing the optimization. At the tt-th iteration, given a probability pt[0,1]p_{t}\in[0,1] and the current gradient estimator gtg^{t}, PAGE proposed to replace the vanilla mini-batch gradient estimator with the following unbiased stochastic estimator:

f(θt+1)gt+1:={1N1j=1N1θF(xj,θt+1),with probability pt,gt+1N2(j=1N2θF(xj,θt+1)j=1N2θF(xj,θt)),with probability 1pt,\nabla f(\theta^{t+1})\approx g^{t+1}:=\begin{cases}\displaystyle\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}\nabla_{\theta}F(x^{j},\theta^{t+1}),&\textrm{with probability $p_{t}$},\\ \displaystyle g^{t}+\frac{1}{N_{2}}\left(\sum_{j=1}^{N_{2}}\nabla_{\theta}F(x^{j},\theta^{t+1})-\sum_{j=1}^{N_{2}}\nabla_{\theta}F(x^{j},\theta^{t})\right),&\textrm{with probability $1-p_{t}$},\end{cases}

where {xj}\{x^{j}\} are sampled from π\pi, N1,N2N_{1},N_{2} denote the sample sizes. Some key advantages of applying PAGE are summarized as follows. First, the algorithm is single-looped, which admit simpler implementation compared with existing double-looped variance reduced methods. Second, the probability ptp_{t} can be adjusted dynamically, leading to more flexibilities. Third, one can choose N2N_{2} to be much smaller than N1N_{1} to guarantee the same iteration complexity as the vanilla SGD. Thus, the overall sample complexity can be significantly reduced. However, the application of PAGE to our setting needs significant modifications and extensions, which we shall demonstrate below. To the best of our knowledge, the application of PAGE for solving the general regularized reward optimization problem in the non-oblivious setting considered in this paper is new.

For notational simplicity, for the rest of this section, we denote

gw(x,θ,θ)=πθ(x)πθ(x)g(x,θ),g_{w}(x,\theta,\theta^{\prime})=\frac{\pi_{\theta}(x)}{\pi_{\theta^{\prime}}(x)}g(x,\theta),

for θ,θn,xd\theta,\theta^{\prime}\in\mathbb{R}^{n},\;x\in\mathbb{R}^{d}, where πθ(x)πθ(x)\frac{\pi_{\theta}(x)}{\pi_{\theta^{\prime}}(x)} denotes the importance weight between πθ\pi_{\theta} and πθ\pi_{\theta^{\prime}}. Note also that

𝔼xπθ[πθ(x)πθ(x)]=1.\mathbb{E}_{x\sim\pi_{\theta^{\prime}}}\left[\frac{\pi_{\theta}(x)}{\pi_{\theta^{\prime}}(x)}\right]=1.

The description of the proposed PAGE variance-reduced stochastic proximal gradient method is given in Algorithm 2.

1:  Input: initial point θ0\theta^{0}, sample sizes N1N_{1} and N2N_{2}, a probability p(0,1]p\in(0,1], and the learning rate η>0\eta>0.
2:  Compute
g0:=1N1j=1N1g(x0,j,θ0),\displaystyle g^{0}:=\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}g(x^{0,j},\theta^{0}),
where {x0,j}j\{x^{0,j}\}_{j} are sampled independently according to πθ0\pi_{\theta^{0}}.
3:  for t=0,,T1t=0,\dots,T-1 do
4:     Update
θt+1=Proxη𝒢(θt+ηgt).\theta^{t+1}=\mathrm{Prox}_{\eta\mathcal{G}}\left(\theta^{t}+\eta g^{t}\right).
5:     Compute
gt+1={1N1j=1N1g(xt+1,j,θt+1),with probability p,1N2j=1N2g(xt+1,j,θt+1)1N2j=1N2gw(xt+1,j,θt,θt+1)+gt,with probability 1p,g^{t+1}=\left\{\begin{array}[]{ll}\displaystyle\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}g(x^{t+1,j},\theta^{t+1}),&\textrm{with probability $p$},\\[5.0pt] \displaystyle\frac{1}{N_{2}}\sum_{j=1}^{N_{2}}g(x^{t+1,j},\theta^{t+1})-\frac{1}{N_{2}}\sum_{j=1}^{N_{2}}g_{w}(x^{t+1,j},\theta^{t},\theta^{t+1})+g^{t},&\textrm{with probability $1-p$},\end{array}\right.
where {xt+1,j}j\{x^{t+1,j}\}_{j} are sampled independently according to πθt+1\pi_{\theta^{t+1}}.
6:  end for
7:  Output: θ^T\hat{\theta}^{T} selected randomly from the generated sequence {θt}t=1T\{\theta^{t}\}_{t=1}^{T}.

Algorithm 2 The variance-reduced stochastic proximal gradient method with PAGE

It is clear that the only difference between Algorithm 1 and Algorithm 2 is the choice of the gradient estimator. At each iteration of the latter algorithm, we have two choices for the gradient estimator, where, with probability pp, one chooses the same estimator as in Algorithm 1 with a sample size N1N_{1}, and with probability 1p1-p, one constructs the estimator in a clever way which combines the information of the current iterate and the previous one. Since the data set {xt+1,1,,xt+1,N2}\{x^{t+1,1},\dots,x^{t+1,N_{2}}\} is sampled according to the current probability distribution πθt+1\pi_{\theta^{t+1}}, we need to rely on the importance weight between θt\theta^{t} and θt+1\theta^{t+1} and construct the gradient estimator 1N2j=1N2gw(xt+1,θt,θt+1)\displaystyle\frac{1}{N_{2}}\sum_{j=1}^{N_{2}}g_{w}(x^{t+1},\theta^{t},\theta^{t+1}), which is an unbiased estimator for θ𝒥(θt)\nabla_{\theta}\mathcal{J}(\theta^{t}), so that gt+1g^{t+1} becomes an unbiased estimator of θ𝒥(θt+1)\nabla_{\theta}\mathcal{J}(\theta^{t+1}). Indeed, one can easily verify that for any θ,θn\theta,\theta^{\prime}\in\mathbb{R}^{n}, it holds that

𝔼xπθ[gw(x,θ,θ)]=θ𝒥(θ),\mathbb{E}_{x\sim\pi_{\theta^{\prime}}}\left[g_{w}(x,\theta,\theta^{\prime})\right]=\nabla_{\theta}\mathcal{J}(\theta), (4)

i.e., g(x,θ,θ)g(x,\theta,\theta^{\prime}) is an unbiased estimator for θ𝒥(θ)\nabla_{\theta}\mathcal{J}(\theta) provided that xπθx\sim\pi_{\theta^{\prime}}.

Next, we shall analyze the convergence properties of Algorithm 2. Our analysis relies on the following assumption on the importance weight, which essentially controls the change of the distributions.

Assumption 5.1.

Let θ,θn\theta,\theta^{\prime}\in\mathbb{R}^{n}, the importance weight between πθ\pi_{\theta} and πθ\pi_{\theta^{\prime}} is well-defined and there exists a constant Cw>0C_{w}>0 such that

𝔼xπθ[(πθ(x)πθ(x)1)2]Cw2.\mathbb{E}_{x\sim\pi_{\theta^{\prime}}}\left[\left(\frac{\pi_{\theta}(x)}{\pi_{\theta^{\prime}}(x)}-1\right)^{2}\right]\leq C_{w}^{2}.

Clearly, the significance of the constant CwC_{w} (if exists) may depend sensitively on θ\theta and θ\theta^{\prime}. To see this, let us assume that for any θn\theta\in\mathbb{R}^{n}, πθ=θ\pi_{\theta}=\theta is a discrete distribution over a set of finite points {xk}k=1n\{x_{k}\}_{k=1}^{n} for which πθ(xk)=θk>0\pi_{\theta}(x_{k})=\theta_{k}>0 for all k=1,,nk=1,\dots,n. Now, suppose that θ=θ+Δθ\theta=\theta^{\prime}+\Delta\theta with |Δθk|1|\Delta\theta_{k}|\leq 1. Then, a simple calculation shows that

𝔼xπθ[(πθ(x)πθ(x)1)2]=k=1n(θkθk1)2θk=k=1nθkΔθkθkk=1nθkθk.\mathbb{E}_{x\sim\pi_{\theta^{\prime}}}\left[\left(\frac{\pi_{\theta}(x)}{\pi_{\theta^{\prime}}(x)}-1\right)^{2}\right]=\sum_{k=1}^{n}\left(\frac{\theta_{k}}{\theta_{k}^{\prime}}-1\right)^{2}\theta^{\prime}_{k}=\sum_{k=1}^{n}\frac{\theta_{k}\Delta\theta_{k}}{\theta^{\prime}_{k}}\leq\sum_{k=1}^{n}\frac{\theta_{k}}{\theta^{\prime}_{k}}.

However, it is possible that there exists a certain θk=0\theta_{k}^{\prime}=0 or tiny. In this case, CwC_{w} can be huge or even infinity. Fortunately, the regularization term 𝒢(θ)\mathcal{G}(\theta) can help to avoid such undesired situations via imposing the lower-bounded constraints θkδ>0\theta_{k}\geq\delta>0 for all kk. In this case, we see that k=1nθkθkk=1nθkδ=1δ\sum_{k=1}^{n}\frac{\theta_{k}}{\theta^{\prime}_{k}}\leq\sum_{k=1}^{n}\frac{\theta_{k}}{\delta}=\frac{1}{\delta}.

Remark 5.2.

Note that Assumption 5.1 is also employed in many existing works Papini et al. (2018); Xu et al. (2019); Pham et al. (2020); Yuan et al. (2020); Gargiani et al. (2022). However, this assumption could be too strong, and it is not checkable in general. Addressing the relaxation of this assumption through the development of a more sophisticated algorithmic framework is beyond the scope of this paper. Here, we would like to mention some recent progress on relaxing this stringent condition for MDPs. By constructing additional stochastic estimators for the Hessian matrix of the objective function, Shen et al. (2019) proposed a Hessian-aided policy-gradient-type method that improves the sample complexity from O(ϵ4)O(\epsilon^{-4}) to O(ϵ3)O(\epsilon^{-3}) without assuming Assumption 5.1. Later, by explicitly controlling changes in the parameter θ\theta, Zhang et al. (2021a) developed a truncated stochastic incremental variance-reduced policy gradient method to prevent the variance of the importance weights from becoming excessively large leading to the O(ϵ3)O(\epsilon^{-3}) sample complexity. By utilizing general Bregman divergences, Yuan et al. (2022) proposed a double-looped variance-reduced mirror policy optimization approach and established an O(ϵ3)O(\epsilon^{-3}) sample complexity, without requiring Hessian information or Assumption 5.1. Recently, following the research theme as Shen et al. (2019), Salehkaleybar et al. (2022) also incorporated second-order information into the stochastic gradient estimator. By using momentum, the variance-reduced algorithm proposed in Salehkaleybar et al. (2022) has some appealing features, including the small batch-size and parameter-free implementation. Recently, by imposing additional conditions, including the Lipschitz continuity of the Hessian of the score function θlogπθ\nabla_{\theta}\log\pi_{\theta} and the Fisher-non-degeneracy condition of the policy, Fatkhullin et al. (2023) derived improved (global) convergence guarantees for solving MDPs. We think that the above ideas can also be explored for solving the general model (1).

The bounded variance of the importance weight implies that the (expected) distance between g(x,θ)g(x,\theta^{\prime}) and gw(x,θ,θ)g_{w}(x,\theta,\theta^{\prime}) is controlled by the distance between θ\theta and θ\theta^{\prime}, for any given θ,θd\theta,\theta^{\prime}\in\mathbb{R}^{d}. In particular, we have the following lemma, whose proof is provided in Appendix A.5.

Lemma 5.3.

Under Assumption 3.1, Assumption 3.2 and Assumption 5.1, then it holds that

𝔼xπθ[g(x,θ)gw(x,θ,θ)2]Cθθ2,\mathbb{E}_{x\sim\pi_{\theta^{\prime}}}\left[\left\lVert g(x,\theta^{\prime})-g_{w}(x,\theta,\theta^{\prime})\right\rVert^{2}\right]\leq C\left\lVert\theta-\theta^{\prime}\right\rVert^{2},

where C>0C>0 is a constant defined as

C:=6U2Ch2+6Cg2C~g2+6C~h2+(4U2Cg2+4C~g2)(2Cg2+Ch)(Cw2+1).C:=6U^{2}C_{h}^{2}+6C_{g}^{2}\widetilde{C}_{g}^{2}+6\widetilde{C}_{h}^{2}\\ +\left(4U^{2}C_{g}^{2}+4\widetilde{C}_{g}^{2}\right)(2C_{g}^{2}+C_{h})(C_{w}^{2}+1).

Under the considered assumptions, we are able to provide an estimate for the term t=0T1𝔼T[gtθ𝒥(θt)2]\displaystyle\sum_{t=0}^{T-1}\mathbb{E}_{T}\left[\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}\right], which plays an essential role in deriving an improved sample complexity of Algorithm 2. The results are summarized in the following Lemma 5.4; see Appendix A.6 for a proof which shares the same spirit as (Li et al., 2021b, Lemma 3 & 4).

Lemma 5.4.

Suppose that Assumption 3.1, Assumption 3.2, and Assumption 5.1 hold. Let {gt}\{g^{t}\} and {θt}\{\theta^{t}\} be the sequences generated by Algorithm 2, then it holds that

(1(1p)CηpN2L(12ηL))t=0T1𝔼T[gtθ𝒥(θt)2]pσ2T+σ2pN1+2η(1p)CΔpN2(12ηL).\left(1-\frac{(1-p)C\eta}{pN_{2}L(1-2\eta L)}\right)\sum_{t=0}^{T-1}\mathbb{E}_{T}\left[\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}\right]\leq\frac{p\sigma^{2}T+\sigma^{2}}{pN_{1}}+\frac{2\eta(1-p)C\Delta}{pN_{2}(1-2\eta L)}.

We are now ready to present the main result on the convergence property of the Algorithm 2 by showing how to select the sample sizes N1N_{1} and N2N_{2}, probability pp, and the learning rate η\eta. Intuitively, N1N_{1} is typically a large number and one does not want to perform samplings with N1N_{1} samples frequently, thus the probability pp and the sample size N2N_{2} should both be small. Given N1N_{1}, N2N_{2} and pp, we can then determine the value of η\eta such that η<12L\eta<\frac{1}{2L}. Consequently, the key estimate in Theorem 4.3 can be applied directly. Our results are summarized in the following theorem. Reader is referred to Appendix A.7 for the proof of this result.

Theorem 5.5.

Suppose that Assumption 3.1, Assumption 3.2 and Assumption 5.1 hold. For a given ϵ(0,1)\epsilon\in(0,1), we set p:=N2N1+N2p:=\frac{N_{2}}{N_{1}+N_{2}} with N1:=O(ϵ2)N_{1}:=O(\epsilon^{-2}) and N2:=N1=O(ϵ1)N_{2}:=\sqrt{N_{1}}=O(\epsilon^{-1}). Choose a learning rate η\eta satisfying η(0,L/(2C+2L2)]\eta\in\left(0,L/(2C+2L^{2})\right]. Then, running Algorithm 2 for T:=O(ϵ2)T:=O(\epsilon^{-2}) iterations outputs a point θ^T\hat{\theta}^{T} satisfying

𝔼T[dist(0,θ𝒥(θ^T)+𝒢(θ^T))2]ϵ2.\mathbb{E}_{T}\left[\mathrm{dist}\left(0,-\nabla_{\theta}\mathcal{J}(\hat{\theta}^{T})+\partial\mathcal{G}(\hat{\theta}^{T})\right)^{2}\right]\leq\epsilon^{2}.

Moreover, the total expected sample complexity is O(ϵ3)O(\epsilon^{-3}).

By using the stochastic variance-reduce gradient estimator with PAGE and the importance sampling technique, we have improved the total sample complexity from O(ϵ4)O(\epsilon^{-4}) to O(ϵ3)O(\epsilon^{-3}), under the considered conditions. This result matches with the current competitive results established in Xu et al. (2019); Yuan et al. (2020); Pham et al. (2020); Gargiani et al. (2022) for solving MDPs and is applicable to the general model (1). Finally, as mentioned in Remark 4.7, by assuming or verifying stronger conditions, such as the gradient domination and its extensions, it is also possible to derive some global convergence results. Again, such a possibility is left to a future research direction.

6 Conclusions

We have studied the stochastic (variance-reduced) proximal gradient method addressing a general regularized expected reward optimization problem which covers many existing important problem in reinforcement learning. We have established the O(ϵ4)O(\epsilon^{-4}) sample complexity of the classical stochastic proximal gradient method and the O(ϵ3)O(\epsilon^{-3}) sample complexity of the stochastic variance-reduced proximal gradient method with an importance sampling based probabilistic gradient estimator. Our results match the sample complexity of their most competitive counterparts under similar settings for Markov decision processes.

Meanwhile, we have also suspected some limitations in the current paper. First, due to the nonconcavity of the objective function, we found it challenging to derive global convergence properties of the stochastic proximal gradient method and its variants without imposing additional conditions. On the other hand, analyzing the sample complexity for achieving convergence to second-order stationary points—thereby avoiding saddle points—may be more realistic and feasible Arjevani et al. (2020). Second, the bounded variance condition for the importance weight turns out to be quite strong and can not be verified in general. How to relax this condition for our general model deserves further investigation. Last but not least, since we focus more on the theoretical analysis in this paper and due to the space constraint, we did not conduct any numerical simulation to examine the practical efficiency of the proposed methods. We shall try to delve into these challenges and get better understandings of the proposed problem and algorithms in a future research.

Finally, this paper has demonstrated the possibility of pairing the stochastic proximal gradient method with efficient variance reduction techniques Li et al. (2021b) for solving the reward optimization problem (1). Beyond variance-reduced methods, there are other possibilities that allow one deriving more sophisticated algorithms. For instance, one can also pair the stochastic proximal gradient method with the ideas of the actor-critic method Konda & Tsitsiklis (1999), the natural policy gradient method Kakade (2001), policy mirror descent methods Tomar et al. (2020); Lan (2023), trust-region methods Schulman et al. (2015); Shani et al. (2020), and the variational policy gradient methods Zhang et al. (2020a). We think that these possible generalizations can lead to more exciting results and make further contributions to the literature.

Acknowledgments

We thank the action editor and reviewers for their valuable comments and suggestions that helped to improve the quality of the paper. The authors were partially supported by the US National Science Foundation under awards DMS-2244988, DMS-2206333, and the Office of Naval Research Award N00014-23-1-2007.

References

  • Agarwal et al. (2020) Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, pp.  64–66. PMLR, 2020.
  • Agarwal et al. (2021) Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1):4431–4506, 2021.
  • Ahmed et al. (2019) Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. In International conference on machine learning, pp.  151–160. PMLR, 2019.
  • Arjevani et al. (2020) Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Ayush Sekhari, and Karthik Sridharan. Second-order information in non-convex stochastic optimization: Power and limitations. In Conference on Learning Theory, pp.  242–299. PMLR, 2020.
  • Barakat et al. (2023) Anas Barakat, Ilyas Fatkhullin, and Niao He. Reinforcement learning with general utilities: Simpler variance reduction and large state-action space. arXiv preprint arXiv:2306.01854, 2023.
  • Baxter & Bartlett (2001) Jonathan Baxter and Peter L Bartlett. Infinite-horizon policy-gradient estimation. journal of artificial intelligence research, 15:319–350, 2001.
  • Beck (2017) Amir Beck. First-order methods in optimization. SIAM, 2017.
  • Bello et al. (2016) Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.
  • Bolte et al. (2007) Jérôme Bolte, Aris Daniilidis, and Adrian Lewis. The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4):1205–1223, 2007.
  • Cen et al. (2022) Shicong Cen, Chen Cheng, Yuxin Chen, Yuting Wei, and Yuejie Chi. Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563–2578, 2022.
  • Chen et al. (2023) Cheng Chen, Ruitao Chen, Tianyou Li, Ruichen Ao, and Zaiwen Wen. Monte carlo policy gradient method for binary optimization. arXiv preprint arXiv:2307.00783, 2023.
  • Dong et al. (2023) Roy Dong, Heling Zhang, and Lillian Ratliff. Approximate regions of attraction in learning with decision-dependent distributions. In International Conference on Artificial Intelligence and Statistics, pp.  11172–11184. PMLR, 2023.
  • Drusvyatskiy & Xiao (2023) Dmitriy Drusvyatskiy and Lin Xiao. Stochastic optimization with decision-dependent distributions. Mathematics of Operations Research, 48(2):954–998, 2023.
  • Fang et al. (2018) Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in neural information processing systems, 31, 2018.
  • Fatkhullin et al. (2023) Ilyas Fatkhullin, Anas Barakat, Anastasia Kireeva, and Niao He. Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies. In International Conference on Machine Learning, pp.  9827–9869. PMLR, 2023.
  • Gama et al. (2014) João Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4):1–37, 2014.
  • Gargiani et al. (2022) Matilde Gargiani, Andrea Zanelli, Andrea Martinelli, Tyler Summers, and John Lygeros. Page-pg: A simple and loopless variance-reduced policy gradient method with probabilistic gradient estimation. In International Conference on Machine Learning, pp.  7223–7240. PMLR, 2022.
  • Ghadimi & Lan (2013) Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM journal on optimization, 23(4):2341–2368, 2013.
  • Hastie et al. (2009) Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
  • Huang et al. (2021) Feihu Huang, Shangqian Gao, and Heng Huang. Bregman gradient policy optimization. arXiv preprint arXiv:2106.12112, 2021.
  • Jagadeesan et al. (2022) Meena Jagadeesan, Tijana Zrnic, and Celestine Mendler-Dünner. Regret minimization with performative feedback. In International Conference on Machine Learning, pp.  9760–9785. PMLR, 2022.
  • Johnson & Zhang (2013) Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26, 2013.
  • Kakade (2001) Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
  • Konda & Tsitsiklis (1999) Vijay Konda and John Tsitsiklis. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
  • Kumar et al. (2022) Navdeep Kumar, Kaixin Wang, Kfir Levy, and Shie Mannor. Policy gradient for reinforcement learning with general utilities. arXiv preprint arXiv:2210.00991, 2022.
  • Lan (2020) Guanghui Lan. First-order and stochastic optimization methods for machine learning, volume 1. Springer, 2020.
  • Lan (2023) Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. Mathematical programming, 198(1):1059–1106, 2023.
  • Li et al. (2021a) Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Softmax policy gradient methods can take exponential time to converge. In Conference on Learning Theory, pp.  3107–3110. PMLR, 2021a.
  • Li et al. (2021b) Zhize Li, Hongyan Bao, Xiangliang Zhang, and Peter Richtárik. Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In International conference on machine learning, pp.  6286–6295. PMLR, 2021b.
  • Liang & Yang (2022) Senwei Liang and Haizhao Yang. Finite expression method for solving high-dimensional partial differential equations. arXiv preprint arXiv:2206.10121, 2022.
  • Liu et al. (2019) Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306, 2019.
  • Liu et al. (2020) Yanli Liu, Kaiqing Zhang, Tamer Basar, and Wotao Yin. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33:7624–7636, 2020.
  • Mazyavkina et al. (2021) Nina Mazyavkina, Sergey Sviridov, Sergei Ivanov, and Evgeny Burnaev. Reinforcement learning for combinatorial optimization: A survey. Computers & Operations Research, 134:105400, 2021.
  • Mei et al. (2020) Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, pp.  6820–6829. PMLR, 2020.
  • Mendler-Dünner et al. (2020) Celestine Mendler-Dünner, Juan Perdomo, Tijana Zrnic, and Moritz Hardt. Stochastic optimization for performative prediction. Advances in Neural Information Processing Systems, 33:4929–4939, 2020.
  • Milli et al. (2019) Smitha Milli, John Miller, Anca D Dragan, and Moritz Hardt. The social cost of strategic classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp.  230–239, 2019.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Nguyen et al. (2017) Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International conference on machine learning, pp.  2613–2621. PMLR, 2017.
  • Papini et al. (2018) Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli. Stochastic variance-reduced policy gradient. In International conference on machine learning, pp.  4026–4035. PMLR, 2018.
  • Pednault et al. (2002) Edwin Pednault, Naoki Abe, and Bianca Zadrozny. Sequential cost-sensitive decision making with reinforcement learning. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  259–268, 2002.
  • Perdomo et al. (2020) Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. In International Conference on Machine Learning, pp.  7599–7609. PMLR, 2020.
  • Pham et al. (2020) Nhan Pham, Lam Nguyen, Dzung Phan, Phuong Ha Nguyen, Marten Dijk, and Quoc Tran-Dinh. A hybrid stochastic policy gradient algorithm for reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp.  374–385. PMLR, 2020.
  • Pirotta et al. (2015) Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Policy gradient in lipschitz markov decision processes. Machine Learning, 100:255–283, 2015.
  • Polyak (1963) Boris T Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4):864–878, 1963.
  • Robbins (1952) Herbert Robbins. Some aspects of the sequential design of experiments. 1952.
  • Rockafellar (1997) R Tyrrell Rockafellar. Convex analysis, volume 11. Princeton university press, 1997.
  • Salehkaleybar et al. (2022) Saber Salehkaleybar, Sadegh Khorasani, Negar Kiyavash, Niao He, and Patrick Thiran. Momentum-based policy gradient with second-order information. arXiv preprint arXiv:2205.08253, 2022.
  • Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pp.  1889–1897. PMLR, 2015.
  • Shani et al. (2020) Lior Shani, Yonathan Efroni, and Shie Mannor. Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  5668–5675, 2020.
  • Shapiro et al. (2021) Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski. Lectures on stochastic programming: modeling and theory. SIAM, 2021.
  • Shen et al. (2019) Zebang Shen, Alejandro Ribeiro, Hamed Hassani, Hui Qian, and Chao Mi. Hessian aided policy gradient. In International conference on machine learning, pp.  5729–5738. PMLR, 2019.
  • Song et al. (2023) Zezheng Song, Maria K Cameron, and Haizhao Yang. A finite expression method for solving high-dimensional committor problems. arXiv preprint arXiv:2306.12268, 2023.
  • Sutton & Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Sutton et al. (1999) Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
  • Tomar et al. (2020) Manan Tomar, Lior Shani, Yonathan Efroni, and Mohammad Ghavamzadeh. Mirror descent policy optimization. arXiv preprint arXiv:2005.09814, 2020.
  • Tsirtsis et al. (2024) Stratis Tsirtsis, Behzad Tabibian, Moein Khajehnejad, Adish Singla, Bernhard Schölkopf, and Manuel Gomez-Rodriguez. Optimal decision making under strategic behavior. Management Science, 2024.
  • Williams (1992) Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
  • Xiao (2022) Lin Xiao. On the convergence rates of policy gradient methods. The Journal of Machine Learning Research, 23(1):12887–12922, 2022.
  • Xiong et al. (2021) Huaqing Xiong, Tengyu Xu, Yingbin Liang, and Wei Zhang. Non-asymptotic convergence of adam-type reinforcement learning algorithms under markovian sampling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  10460–10468, 2021.
  • Xu et al. (2019) Pan Xu, Felicia Gao, and Quanquan Gu. Sample efficient policy gradient methods with recursive variance reduction. arXiv preprint arXiv:1909.08610, 2019.
  • Xu et al. (2020) Pan Xu, Felicia Gao, and Quanquan Gu. An improved convergence analysis of stochastic variance-reduced policy gradient. In Uncertainty in Artificial Intelligence, pp.  541–551. PMLR, 2020.
  • Yang et al. (2022) Long Yang, Yu Zhang, Gang Zheng, Qian Zheng, Pengfei Li, Jianhang Huang, and Gang Pan. Policy optimization with stochastic mirror descent. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  8823–8831, 2022.
  • Yao et al. (2021) Liuyi Yao, Zhixuan Chu, Sheng Li, Yaliang Li, Jing Gao, and Aidong Zhang. A survey on causal inference. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(5):1–46, 2021.
  • Yuan et al. (2020) Huizhuo Yuan, Xiangru Lian, Ji Liu, and Yuren Zhou. Stochastic recursive momentum for policy gradient methods. arXiv preprint arXiv:2003.04302, 2020.
  • Yuan et al. (2022) Rui Yuan, Robert M Gower, and Alessandro Lazaric. A general sample complexity analysis of vanilla policy gradient. In International Conference on Artificial Intelligence and Statistics, pp.  3332–3380. PMLR, 2022.
  • Zhan et al. (2023) Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D Lee, and Yuejie Chi. Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence. SIAM Journal on Optimization, 33(2):1061–1091, 2023.
  • Zhang et al. (2020a) Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and Mengdi Wang. Variational policy gradient method for reinforcement learning with general utilities. Advances in Neural Information Processing Systems, 33:4572–4583, 2020a.
  • Zhang et al. (2021a) Junyu Zhang, Chengzhuo Ni, Csaba Szepesvari, Mengdi Wang, et al. On the convergence and sample efficiency of variance-reduced policy gradient method. Advances in Neural Information Processing Systems, 34:2228–2240, 2021a.
  • Zhang et al. (2021b) Junzi Zhang, Jongho Kim, Brendan O’Donoghue, and Stephen Boyd. Sample efficient reinforcement learning with reinforce. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp.  10887–10895, 2021b.
  • Zhang et al. (2020b) Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Basar. Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 58(6):3586–3612, 2020b.
  • Zhang (2004) Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, pp.  116, 2004.
  • Zhang & Dietterich (1995) Wei Zhang and Thomas G Dietterich. A reinforcement learning approach to job-shop scheduling. In IJCAI, volume 95, pp.  1114–1120. Citeseer, 1995.

Appendix A Proofs

A.1 Proof of Lemma 4.1

Proof of Lemma 4.1.

One could establish the LL-smoothness of 𝒥()\mathcal{J}(\cdot) via bounding the spectral norm of the Hessian θ2𝒥()\nabla_{\theta}^{2}\mathcal{J}(\cdot). To this end, we first calculate the Hessian of 𝒥\mathcal{J} as follows:

θ2𝒥(θ)=\displaystyle\nabla_{\theta}^{2}\mathcal{J}(\theta)= θ𝔼xπθ[θ(x)θlogπθ(x)+θθ(x)]\displaystyle\;\nabla_{\theta}\mathbb{E}_{x\sim\pi_{\theta}}\left[\mathcal{R}_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)+\nabla_{\theta}\mathcal{R}_{\theta}(x)\right]
=\displaystyle= θ(θ(x)θlogπθ(x)πθ(x)+θθ(x)πθ(x))dx\displaystyle\;\nabla_{\theta}\int\left(\mathcal{R}_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)\pi_{\theta}(x)+\nabla_{\theta}\mathcal{R}_{\theta}(x)\pi_{\theta}(x)\right)\mathrm{d}x
=\displaystyle= θ(x)πθ(x)(θ2logπθ(x)+θlogπθ(x)θlogπθ(x))dx\displaystyle\;\int\mathcal{R}_{\theta}(x)\pi_{\theta}(x)\left(\nabla_{\theta}^{2}\log\pi_{\theta}(x)+\nabla_{\theta}\log\pi_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)^{\top}\right)\mathrm{d}x
+θ2θ(x)πθ(x)+2θθ(x)θπθ(x)dx\displaystyle\;+\int\nabla^{2}_{\theta}\mathcal{R}_{\theta}(x)\pi_{\theta}(x)+2\nabla_{\theta}\mathcal{R}_{\theta}(x)\nabla_{\theta}\pi_{\theta}(x)^{\top}\mathrm{d}x
=\displaystyle= 𝔼xπθ[θ(x)θ2logπθ(x)]+𝔼xπθ[θ(x)θlogπθ(x)θlogπθ(x)]\displaystyle\;\mathbb{E}_{x\sim\pi_{\theta}}\left[\mathcal{R}_{\theta}(x)\nabla_{\theta}^{2}\log\pi_{\theta}(x)\right]+\mathbb{E}_{x\sim\pi_{\theta}}\left[\mathcal{R}_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)^{\top}\right]
+𝔼xπθ[θ2θ(x)]+2𝔼xπθ[θθ(x)θlogπθ(x)].\displaystyle\;+\mathbb{E}_{x\sim\pi_{\theta}}\left[\nabla^{2}_{\theta}\mathcal{R}_{\theta}(x)\right]+2\mathbb{E}_{x\sim\pi_{\theta}}\left[\nabla_{\theta}\mathcal{R}_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)^{\top}\right].

Then, by the triangular inequality, it holds that

θ2𝒥(θ)2\displaystyle\left\lVert\nabla_{\theta}^{2}\mathcal{J}(\theta)\right\rVert_{2}\leq supxd,θnθ(x)θ2logπθ(x)2+supxd,θnθ(x)θlogπθ(x)θlogπθ(x)2\displaystyle\;\sup_{x\in\mathbb{R}^{d},\;\theta\in\mathbb{R}^{n}}\;\left\lVert\mathcal{R}_{\theta}(x)\nabla_{\theta}^{2}\log\pi_{\theta}(x)\right\rVert_{2}+\sup_{x\in\mathbb{R}^{d},\;\theta\in\mathbb{R}^{n}}\;\left\lVert\mathcal{R}_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)^{\top}\right\rVert_{2}
+supxd,θnθ2θ(x)2+2supxd,θnθθ(x)θlogπθ(x)2\displaystyle\;+\sup_{x\in\mathbb{R}^{d},\;\theta\in\mathbb{R}^{n}}\;\left\lVert\nabla_{\theta}^{2}\mathcal{R}_{\theta}(x)\right\rVert_{2}+2\sup_{x\in\mathbb{R}^{d},\;\theta\in\mathbb{R}^{n}}\;\left\lVert\nabla_{\theta}\mathcal{R}_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)^{\top}\right\rVert_{2}
\displaystyle\leq U(Cg2+Ch)+C~h+2CgC~g.\displaystyle\;U(C_{g}^{2}+C_{h})+\widetilde{C}_{h}+2C_{g}\widetilde{C}_{g}.

Thus, 𝒥\mathcal{J} is LL-smooth with L:=U(Cg2+Ch)+C~h+2CgC~gL:=U(C_{g}^{2}+C_{h})+\widetilde{C}_{h}+2C_{g}\widetilde{C}_{g}, and the proof is completed. ∎

A.2 Proof of Theorem 4.3

Proof of Theorem 4.3.

From Lemma 4.1, we see that

𝒥(θt+1)𝒥(θt)+θ𝒥(θt),θt+1θtL2θt+1θt2.\mathcal{J}(\theta^{t+1})\geq\mathcal{J}(\theta^{t})+\left\langle\nabla_{\theta}\mathcal{J}(\theta^{t}),\theta^{t+1}-\theta^{t}\right\rangle-\frac{L}{2}\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}. (5)

By the updating rule of θt+1\theta^{t+1}, we see that

gt,θt+1θt+12ηθt+1θt2+𝒢(θt+1)\displaystyle-\left\langle g^{t},\theta^{t+1}-\theta^{t}\right\rangle+\frac{1}{2\eta}\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}+\mathcal{G}(\theta^{t+1})\leq 𝒢(θt),\displaystyle\;\mathcal{G}(\theta^{t}), (6)
gt1η(θt+1θt)\displaystyle g^{t}-\frac{1}{\eta}\left(\theta^{t+1}-\theta^{t}\right)\in 𝒢(θt+1).\displaystyle\;\partial\mathcal{G}(\theta^{t+1}). (7)

Combining (5) and (6), we see that

𝒥(θt+1)+gt,θt+1θt12ηθt+1θt2𝒢(θt+1)\displaystyle\;\mathcal{J}(\theta^{t+1})+\left\langle g^{t},\theta^{t+1}-\theta^{t}\right\rangle-\frac{1}{2\eta}\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}-\mathcal{G}(\theta^{t+1})
\displaystyle\geq 𝒥(θt)+θ𝒥(θt),θt+1θtL2θt+1θt2𝒢(θt).\displaystyle\;\mathcal{J}(\theta^{t})+\left\langle\nabla_{\theta}\mathcal{J}(\theta^{t}),\theta^{t+1}-\theta^{t}\right\rangle-\frac{L}{2}\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}-\mathcal{G}(\theta^{t}).

Rearranging terms, we can rewrite the above inequality as

1ηL2ηθt+1θt2(θt+1)(θt)+gtθ𝒥(θt),θt+1θt.\frac{1-\eta L}{2\eta}\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}\leq\mathcal{F}(\theta^{t+1})-\mathcal{F}(\theta^{t})+\left\langle g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t}),\theta^{t+1}-\theta^{t}\right\rangle. (8)

By the Cauchy-Schwarz inequality, we see that

gtθ𝒥(θt),θt+1θt12Lgtθ𝒥(θt)2+L2θt+1θt2,\left\langle g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t}),\theta^{t+1}-\theta^{t}\right\rangle\leq\frac{1}{2L}\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}+\frac{L}{2}\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2},

which together with (8) implies that

12ηL2ηθt+1θt2(θt+1)(θt)+12Lgtθ𝒥(θt)2.\frac{1-2\eta L}{2\eta}\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}\leq\mathcal{F}(\theta^{t+1})-\mathcal{F}(\theta^{t})+\frac{1}{2L}\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}.

Summing the above inequality across t=0,,T1t=0,\dots,T-1, we get

12ηL2ηt=0T1θt+1θt2\displaystyle\frac{1-2\eta L}{2\eta}\sum_{t=0}^{T-1}\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}\leq (θT)(θ0)+12Lt=0T1gtθ𝒥(θt)2\displaystyle\;\mathcal{F}(\theta^{T})-\mathcal{F}(\theta^{0})+\frac{1}{2L}\sum_{t=0}^{T-1}\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}
\displaystyle\leq Δ+12Lt=0T1gtθ𝒥(θt)2.\displaystyle\;\Delta+\frac{1}{2L}\sum_{t=0}^{T-1}\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}. (9)

Here, we recall that Δ:=(θ0)>0\Delta:=\mathcal{F}^{*}-\mathcal{F}(\theta^{0})>0.

On the other hand, (8) also implies that

 2θ𝒥(θt+1)gt,1η(θt+1θt)+1ηLη2θt+1θt2\displaystyle\;2\left\langle\nabla_{\theta}\mathcal{J}(\theta^{t+1})-g^{t},\frac{1}{\eta}\left(\theta^{t+1}-\theta^{t}\right)\right\rangle+\frac{1-\eta L}{\eta^{2}}\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}
\displaystyle\leq 2η((θt+1)(θt))+2ηθ𝒥(θt+1)θ𝒥(θt),θt+1θt.\displaystyle\;\frac{2}{\eta}\left(\mathcal{F}(\theta^{t+1})-\mathcal{F}(\theta^{t})\right)+\frac{2}{\eta}\left\langle\nabla_{\theta}\mathcal{J}(\theta^{t+1})-\nabla_{\theta}\mathcal{J}(\theta^{t}),\theta^{t+1}-\theta^{t}\right\rangle. (10)

Notice that

 2θ𝒥(θt+1)gt,1η(θt+1θt)\displaystyle\;2\left\langle\nabla_{\theta}\mathcal{J}(\theta^{t+1})-g^{t},\frac{1}{\eta}\left(\theta^{t+1}-\theta^{t}\right)\right\rangle
=\displaystyle= θ𝒥(θt+1)gt+1η(θt+1θt)2θ𝒥(θt+1)gt21η2θt+1θt2.\displaystyle\;\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t+1})-g^{t}+\frac{1}{\eta}\left(\theta^{t+1}-\theta^{t}\right)\right\rVert^{2}-\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t+1})-g^{t}\right\rVert^{2}-\frac{1}{\eta^{2}}\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}.

Then by substituting the above equality into (10) and rearranging terms, we see that

θ𝒥(θt+1)gt+1η(θt+1θt)2\displaystyle\;\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t+1})-g^{t}+\frac{1}{\eta}\left(\theta^{t+1}-\theta^{t}\right)\right\rVert^{2}
\displaystyle\leq θ𝒥(θt+1)gt2+1η2θt+1θt21ηLη2θt+1θt2\displaystyle\;\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t+1})-g^{t}\right\rVert^{2}+\frac{1}{\eta^{2}}\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}-\frac{1-\eta L}{\eta^{2}}\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}
+2η((θt+1)(θt))+2ηθ𝒥(θt+1)θ𝒥(θt),θt+1θt\displaystyle\;+\frac{2}{\eta}\left(\mathcal{F}(\theta^{t+1})-\mathcal{F}(\theta^{t})\right)+\frac{2}{\eta}\left\langle\nabla_{\theta}\mathcal{J}(\theta^{t+1})-\nabla_{\theta}\mathcal{J}(\theta^{t}),\theta^{t+1}-\theta^{t}\right\rangle
\displaystyle\leq  2θ𝒥(θt)gt2+2θ𝒥(θt+1)θ𝒥(θt)2+Lηθt+1θt2\displaystyle\;2\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t})-g^{t}\right\rVert^{2}+2\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t+1})-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}+\frac{L}{\eta}\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}
+2η((θt+1)(θt))+2ηθ𝒥(θt+1)θ𝒥(θt)θt+1θt\displaystyle\;+\frac{2}{\eta}\left(\mathcal{F}(\theta^{t+1})-\mathcal{F}(\theta^{t})\right)+\frac{2}{\eta}\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t+1})-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert\left\lVert\theta^{t+1}-\theta^{t}\right\rVert
\displaystyle\leq  2θ𝒥(θt)gt2+(2L2+3Lη)θt+1θt2+2η((θt+1)(θt)),\displaystyle\;2\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t})-g^{t}\right\rVert^{2}+\left(2L^{2}+\frac{3L}{\eta}\right)\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}+\frac{2}{\eta}\left(\mathcal{F}(\theta^{t+1})-\mathcal{F}(\theta^{t})\right),

where the second inequality is due to the Cauchy-Schwarz inequality and fact that

θ𝒥(θt+1)gt22θ𝒥(θt)gt2+2θ𝒥(θt+1)θ𝒥(θt)2,\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t+1})-g^{t}\right\rVert^{2}\leq 2\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t})-g^{t}\right\rVert^{2}+2\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t+1})-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2},

and the third inequality is implied by Lemma 4.1.

Summing the above inequality across t=0,1,T1t=0,1\dots,T-1, we get

t=0T1θ𝒥(θt+1)gt+1η(θt+1θt)2\displaystyle\;\sum_{t=0}^{T-1}\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t+1})-g^{t}+\frac{1}{\eta}\left(\theta^{t+1}-\theta^{t}\right)\right\rVert^{2}
\displaystyle\leq  2t=0T1θ𝒥(θt)gt2+(2L2+3Lη)t=0T1θt+1θt2+2η((θT)(θ0))\displaystyle\;2\sum_{t=0}^{T-1}\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t})-g^{t}\right\rVert^{2}+\left(2L^{2}+\frac{3L}{\eta}\right)\sum_{t=0}^{T-1}\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}+\frac{2}{\eta}\left(\mathcal{F}(\theta^{T})-\mathcal{F}(\theta^{0})\right)
\displaystyle\leq  2t=0T1θ𝒥(θt)gt2+2η2t=0T1θt+1θt2+2Δη,\displaystyle\;2\sum_{t=0}^{T-1}\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t})-g^{t}\right\rVert^{2}+\frac{2}{\eta^{2}}\sum_{t=0}^{T-1}\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}+\frac{2\Delta}{\eta}, (11)

where the last inequality is obtained from the fact that L<12ηL<\frac{1}{2\eta} as a consequence of the choice of the learning rate.

Consequently, we have that

𝔼T[dist(0,θ𝒥(θ^T)+𝒢(θ^T))2]\displaystyle\;\mathbb{E}_{T}\left[\mathrm{dist}\left(0,-\nabla_{\theta}\mathcal{J}(\hat{\theta}^{T})+\partial\mathcal{G}(\hat{\theta}^{T})\right)^{2}\right]
=\displaystyle= 1Tt=0T1𝔼T[dist(0,θ𝒥(θt+1)+𝒢(θt+1))2]\displaystyle\;\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}_{T}\left[\mathrm{dist}\left(0,-\nabla_{\theta}\mathcal{J}(\theta^{t+1})+\partial\mathcal{G}(\theta^{t+1})\right)^{2}\right]
\displaystyle\leq 1Tt=0T1𝔼T[θ𝒥(θt+1)gt+1η(θt+1θt)2]\displaystyle\;\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}_{T}\left[\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t+1})-g^{t}+\frac{1}{\eta}\left(\theta^{t+1}-\theta^{t}\right)\right\rVert^{2}\right]
\displaystyle\leq 2Tt=0T1𝔼T[θ𝒥(θt)gt2]+2η2Tt=0T1𝔼T[θt+1θt2]+2ΔηT\displaystyle\;\frac{2}{T}\sum_{t=0}^{T-1}\mathbb{E}_{T}\left[\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t})-g^{t}\right\rVert^{2}\right]+\frac{2}{\eta^{2}T}\sum_{t=0}^{T-1}\mathbb{E}_{T}\left[\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}\right]+\frac{2\Delta}{\eta T}
\displaystyle\leq 4ηT(12ηL)(Δ+12Lt=0T1𝔼T[gtθ𝒥(θt)2])+2Tt=0T1𝔼T[θ𝒥(θt)gt2]+2ΔηT\displaystyle\;\frac{4}{\eta T(1-2\eta L)}\left(\Delta+\frac{1}{2L}\sum_{t=0}^{T-1}\mathbb{E}_{T}\left[\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}\right]\right)+\frac{2}{T}\sum_{t=0}^{T-1}\mathbb{E}_{T}\left[\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t})-g^{t}\right\rVert^{2}\right]+\frac{2\Delta}{\eta T}
=\displaystyle= (2+2ηL(12ηL))1Tt=0T1𝔼T[θ𝒥(θt)gt2]+ΔT(2η+4η(12ηL)),\displaystyle\;\left(2+\frac{2}{\eta L(1-2\eta L)}\right)\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}_{T}\left[\left\lVert\nabla_{\theta}\mathcal{J}(\theta^{t})-g^{t}\right\rVert^{2}\right]+\frac{\Delta}{T}\left(\frac{2}{\eta}+\frac{4}{\eta(1-2\eta L)}\right),

where the first inequality is because of (7), the second inequality is due to (11) and the third inequality is derived from (9). Thus, the proof is completed. ∎

A.3 Proof of Lemma 4.4

Proof of Lemma 4.4.

We first estimate 𝔼xπθ[θ(x)θlogπθ(x)+θθ(x)2]\mathbb{E}_{x\sim\pi_{\theta}}\left[\left\lVert\mathcal{R}_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)+\nabla_{\theta}\mathcal{R}_{\theta}(x)\right\rVert^{2}\right] as follows

𝔼xπθ[θ(x)θlogπθ(x)+θθ(x)2]\displaystyle\mathbb{E}_{x\sim\pi_{\theta}}\left[\left\lVert\mathcal{R}_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)+\nabla_{\theta}\mathcal{R}_{\theta}(x)\right\rVert^{2}\right]\leq  2𝔼xπθ[θ(x)θlogπθ(x)2]+2𝔼xπθ[θθ(x)2]\displaystyle\;2\mathbb{E}_{x\sim\pi_{\theta}}\left[\left\lVert\mathcal{R}_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)\right\rVert^{2}\right]+2\mathbb{E}_{x\sim\pi_{\theta}}\left[\left\lVert\nabla_{\theta}\mathcal{R}_{\theta}(x)\right\rVert^{2}\right]
\displaystyle\leq  2U2Cg2+2C~g2.\displaystyle\;2U^{2}C_{g}^{2}+2\widetilde{C}_{g}^{2}.

Then, by the fact that 𝔼[(X𝔼[X])2]𝔼[X2]\mathbb{E}\left[\left(X-\mathbb{E}[X]\right)^{2}\right]\leq\mathbb{E}\left[X^{2}\right] for all random variable XX, we have

𝔼xπθ[θ(x)θlogπθ(x)+θθ(x)θ𝒥(θ)2]\displaystyle\mathbb{E}_{x\sim\pi_{\theta}}\left[\left\lVert\mathcal{R}_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)+\nabla_{\theta}\mathcal{R}_{\theta}(x)-\nabla_{\theta}\mathcal{J}(\theta)\right\rVert^{2}\right]\leq 𝔼xπθ[θ(x)θlogπθ(x)+θθ(x)2]\displaystyle\;\mathbb{E}_{x\sim\pi_{\theta}}\left[\left\lVert\mathcal{R}_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)+\nabla_{\theta}\mathcal{R}_{\theta}(x)\right\rVert^{2}\right]
\displaystyle\leq  2U2Cg2+2C~g2,\displaystyle\;2U^{2}C_{g}^{2}+2\widetilde{C}_{g}^{2},

which completes the proof. ∎

A.4 Proof of Theorem 4.5

Proof of Theorem 4.5.

From Theorem 4.3, in order to ensure that θ^T\hat{\theta}^{T} is a ϵ\epsilon-stationary point, we can require

(2+2ηL(12ηL))𝔼T[gtθ𝒥(θt)2]\displaystyle\left(2+\frac{2}{\eta L(1-2\eta L)}\right)\mathbb{E}_{T}\left[\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}\right]\leq 12ϵ2,t=0,,T1,\displaystyle\;\frac{1}{2}\epsilon^{2},\quad\forall\;t=0,\dots,T-1, (12)
ΔT(2η+4η(12ηL))\displaystyle\frac{\Delta}{T}\left(\frac{2}{\eta}+\frac{4}{\eta(1-2\eta L)}\right)\leq 12ϵ2.\displaystyle\;\frac{1}{2}\epsilon^{2}. (13)

It is easy to verify that gtg^{t} is an unbiased estimator of θ𝒥(θt)\nabla_{\theta}\mathcal{J}(\theta^{t}). Then, Lemma 4.4 implies that

𝔼T[gtθ𝒥(θt)2]σ2N.\mathbb{E}_{T}\left[\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}\right]\leq\frac{\sigma^{2}}{N}.

As a consequence, if one chooses N=σ2ϵ2(4+4ηL(12ηL))N=\left\lceil\frac{\sigma^{2}}{\epsilon^{2}}\left(4+\frac{4}{\eta L(1-2\eta L)}\right)\right\rceil, then (12) holds.

On the other hand, (13) holds if one sets T=Δϵ2(4η+8η(12ηL))T=\left\lceil\frac{\Delta}{\epsilon^{2}}\left(\frac{4}{\eta}+\frac{8}{\eta(1-2\eta L)}\right)\right\rceil. Moreover, we see that the sample complexity can be computed as TN=O(ϵ4)TN=O(\epsilon^{-4}). Therefore, the proof is completed. ∎

A.5 Proof of Lemma 5.3

Proof of Lemma 5.3.

First, recall that

𝔼xπθ[πθ(x)πθ(x)]=1.\mathbb{E}_{x\sim\pi_{\theta^{\prime}}}\left[\frac{\pi_{\theta}(x)}{\pi_{\theta^{\prime}}(x)}\right]=1.

Then, by the definitions of gg and gwg_{w}, we can verify that

𝔼xπθ[g(x,θ)gw(x,θ,θ)2]\displaystyle\;\mathbb{E}_{x\sim\pi_{\theta^{\prime}}}\left[\left\lVert g(x,\theta^{\prime})-g_{w}(x,\theta,\theta^{\prime})\right\rVert^{2}\right]
\displaystyle\leq  2𝔼xπθ[g(x,θ)g(x,θ)2]+2𝔼xπθ[g(x,θ)gw(x,θ,θ)2]\displaystyle\;2\mathbb{E}_{x\sim\pi_{\theta^{\prime}}}\left[\left\lVert g(x,\theta^{\prime})-g(x,\theta)\right\rVert^{2}\right]+2\mathbb{E}_{x\sim\pi_{\theta^{\prime}}}\left[\left\lVert g(x,\theta)-g_{w}(x,\theta,\theta^{\prime})\right\rVert^{2}\right]
=\displaystyle=  2θ(x)θlogπθ(x)θ(x)θlogπθ(x)+θθ(x)θθ(x)2πθ(x)dx\displaystyle\;2\int\left\lVert\mathcal{R}_{\theta^{\prime}}(x)\nabla_{\theta}\log\pi_{\theta^{\prime}}(x)-\mathcal{R}_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)+\nabla_{\theta}\mathcal{R}_{\theta^{\prime}}(x)-\nabla_{\theta}\mathcal{R}_{\theta}(x)\right\rVert^{2}\pi_{\theta^{\prime}}(x)\mathrm{d}x
+2θ(x)(θlogπθ(x)πθ(x)πθ(x)θlogπθ(x))+θθ(x)πθ(x)πθ(x)θθ(x)2πθ(x)dx\displaystyle\;+2\int\left\lVert\mathcal{R}_{\theta}(x)\left(\nabla_{\theta}\log\pi_{\theta}(x)-\frac{\pi_{\theta}(x)}{\pi_{\theta^{\prime}}(x)}\nabla_{\theta}\log\pi_{\theta}(x)\right)+\nabla_{\theta}\mathcal{R}_{\theta}(x)-\frac{\pi_{\theta}(x)}{\pi_{\theta^{\prime}}(x)}\nabla_{\theta}\mathcal{R}_{\theta}(x)\right\rVert^{2}\pi_{\theta^{\prime}}(x)\mathrm{d}x
\displaystyle\leq  6θ(x)(θlogπθ(x)θlogπθ(x))2πθ(x)dx+6(θ(x)θ(x))θlogπθ(x)2πθ(x)dx\displaystyle\;6\int\left\lVert\mathcal{R}_{\theta^{\prime}}(x)\left(\nabla_{\theta}\log\pi_{\theta^{\prime}}(x)-\nabla_{\theta}\log\pi_{\theta}(x)\right)\right\rVert^{2}\pi_{\theta^{\prime}}(x)\mathrm{d}x+6\int\left\lVert\left(\mathcal{R}_{\theta^{\prime}}(x)-\mathcal{R}_{\theta}(x)\right)\nabla_{\theta}\log\pi_{\theta}(x)\right\rVert^{2}\pi_{\theta^{\prime}}(x)\mathrm{d}x
+6θθ(x)θθ(x)2πθ(x)dx+4θ(x)(1πθ(x)πθ(x))θlogπθ(x)2πθ(x)dx\displaystyle\;+6\int\left\lVert\nabla_{\theta}\mathcal{R}_{\theta^{\prime}}(x)-\nabla_{\theta}\mathcal{R}_{\theta}(x)\right\rVert^{2}\pi_{\theta^{\prime}}(x)\mathrm{d}x+4\int\left\lVert\mathcal{R}_{\theta}(x)\left(1-\frac{\pi_{\theta}(x)}{\pi_{\theta^{\prime}}(x)}\right)\nabla_{\theta}\log\pi_{\theta}(x)\right\rVert^{2}\pi_{\theta^{\prime}}(x)\mathrm{d}x
+4θθ(x)(1πθ(x)πθ(x))2πθ(x)dx\displaystyle\;+4\int\left\lVert\nabla_{\theta}\mathcal{R}_{\theta}(x)\left(1-\frac{\pi_{\theta}(x)}{\pi_{\theta^{\prime}}(x)}\right)\right\rVert^{2}\pi_{\theta^{\prime}}(x)\mathrm{d}x
\displaystyle\leq (6U2Ch2+6Cg2C~g2+6C~h2)θθ2+(4U2Cg2+4C~g2)𝔼xπθ[(πθ(x)πθ(x)1)2]\displaystyle\;\left(6U^{2}C_{h}^{2}+6C_{g}^{2}\widetilde{C}_{g}^{2}+6\widetilde{C}_{h}^{2}\right)\left\lVert\theta-\theta^{\prime}\right\rVert^{2}+\left(4U^{2}C_{g}^{2}+4\widetilde{C}_{g}^{2}\right)\mathbb{E}_{x\sim\pi_{\theta^{\prime}}}\left[\left(\frac{\pi_{\theta}(x)}{\pi_{\theta^{\prime}}(x)}-1\right)^{2}\right]
=\displaystyle= (6U2Ch2+6Cg2C~g2+6C~h2)θθ2+(4U2Cg2+4C~g2)((πθ(x))2πθ(x)dx1).\displaystyle\;\left(6U^{2}C_{h}^{2}+6C_{g}^{2}\widetilde{C}_{g}^{2}+6\widetilde{C}_{h}^{2}\right)\left\lVert\theta-\theta^{\prime}\right\rVert^{2}+\left(4U^{2}C_{g}^{2}+4\widetilde{C}_{g}^{2}\right)\left(\int\frac{(\pi_{\theta}(x))^{2}}{\pi_{\theta^{\prime}}(x)}\mathrm{d}x-1\right).

We next consider the function f(θ):=(πθ(x))2πθ(x)dxf(\theta):=\int\frac{(\pi_{\theta}(x))^{2}}{\pi_{\theta^{\prime}}(x)}\mathrm{d}x. Taking the derivative of ff with respect to θ\theta, we get

θf(θ)=2πθ(x)θπθ(x)πθ(x)dx.\nabla_{\theta}f(\theta)=\int\frac{2\pi_{\theta}(x)\nabla_{\theta}\pi_{\theta}(x)}{\pi_{\theta^{\prime}}(x)}\mathrm{d}x.

Moreover, since

θ2logπθ(x)=\displaystyle\nabla_{\theta}^{2}\log\pi_{\theta}(x)= 1(πθ(x))2(πθ(x)θ2πθ(x)θπθ(x)θπθ(x))\displaystyle\;\frac{1}{(\pi_{\theta}(x))^{2}}\left(\pi_{\theta}(x)\nabla_{\theta}^{2}\pi_{\theta}(x)-\nabla_{\theta}\pi_{\theta}(x)\nabla\theta\pi_{\theta}(x)^{\top}\right)
=\displaystyle= 1πθ(x)θ2πθ(x)θlogπθ(x)θlogπθ(x),\displaystyle\;\frac{1}{\pi_{\theta}(x)}\nabla_{\theta}^{2}\pi_{\theta}(x)-\nabla_{\theta}\log\pi_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)^{\top},

we see that the Hessian of ff with respect to θ\theta can be computed as

θ2f(θ)=\displaystyle\nabla_{\theta}^{2}f(\theta)= 2πθ(x)(θπθ(x)θπθ(x)+πθ(x)θ2πθ(x))dx\displaystyle\;\int\frac{2}{\pi_{\theta^{\prime}}(x)}\left(\nabla_{\theta}\pi_{\theta}(x)\nabla_{\theta}\pi_{\theta}(x)^{\top}+\pi_{\theta}(x)\nabla_{\theta}^{2}\pi_{\theta}(x)\right)\mathrm{d}x
=\displaystyle= 2(πθ(x))2πθ(x)(2θlogπθ(x)θlogπθ(x)+θ2logπθ(x))dx.\displaystyle\;\int\frac{2(\pi_{\theta}(x))^{2}}{\pi_{\theta^{\prime}}(x)}\left(2\nabla_{\theta}\log\pi_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)^{\top}+\nabla_{\theta}^{2}\log\pi_{\theta}(x)\right)\mathrm{d}x.

Notice that f(θ)=1f(\theta^{\prime})=1 and θf(θ)=0\nabla_{\theta}f(\theta^{\prime})=0. Therefore, by the Mean Value Theorem, we get

f(θ)=1+12θ2f(θ~)(θθ),θθ,\displaystyle f(\theta)=1+\frac{1}{2}\left\langle\nabla_{\theta}^{2}f(\tilde{\theta})(\theta-\theta^{\prime}),\theta-\theta^{\prime}\right\rangle,

where θ~\tilde{\theta} is a point between θ\theta and θ\theta^{\prime}. Now, from the expression of the Hessian matrix, we see that for any θn\theta\in\mathbb{R}^{n},

θ2f(θ)2\displaystyle\left\lVert\nabla_{\theta}^{2}f(\theta)\right\rVert_{2}\leq 2(πθ(x))2πθ(x)2θlogπθ(x)θlogπθ(x)+θ2logπθ(x)2dx\displaystyle\;\int\frac{2(\pi_{\theta}(x))^{2}}{\pi_{\theta^{\prime}}(x)}\left\lVert 2\nabla_{\theta}\log\pi_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)^{\top}+\nabla_{\theta}^{2}\log\pi_{\theta}(x)\right\rVert_{2}\mathrm{d}x
\displaystyle\leq  2(2Cg2+Ch)(πθ(x))2πθ(x)dx\displaystyle\;2(2C_{g}^{2}+C_{h})\int\frac{(\pi_{\theta}(x))^{2}}{\pi_{\theta^{\prime}}(x)}\mathrm{d}x
=\displaystyle=  2(2Cg2+Ch)(1+𝔼xπθ[(πθ(x)πθ(x)1)2])\displaystyle\;2(2C_{g}^{2}+C_{h})\left(1+\mathbb{E}_{x\sim\pi_{\theta^{\prime}}}\left[\left(\frac{\pi_{\theta}(x)}{\pi_{\theta^{\prime}}(x)}-1\right)^{2}\right]\right)
\displaystyle\leq  2(2Cg2+Ch)(Cw2+1).\displaystyle\;2(2C_{g}^{2}+C_{h})(C_{w}^{2}+1).

As a consequence, we have

𝔼xπθ[g(x,θ)gw(x,θ,θ)2]\displaystyle\;\mathbb{E}_{x\sim\pi_{\theta^{\prime}}}\left[\left\lVert g(x,\theta^{\prime})-g_{w}(x,\theta,\theta^{\prime})\right\rVert^{2}\right]
\displaystyle\leq (6U2Ch2+6Cg2C~g2+6C~h2)θθ2+(4U2Cg2+4C~g2)((πθ(x))2πθ(x)dx1)\displaystyle\;\left(6U^{2}C_{h}^{2}+6C_{g}^{2}\widetilde{C}_{g}^{2}+6\widetilde{C}_{h}^{2}\right)\left\lVert\theta-\theta^{\prime}\right\rVert^{2}+\left(4U^{2}C_{g}^{2}+4\widetilde{C}_{g}^{2}\right)\left(\int\frac{(\pi_{\theta}(x))^{2}}{\pi_{\theta^{\prime}}(x)}\mathrm{d}x-1\right)
\displaystyle\leq (6U2Ch2+6Cg2C~g2+6C~h2+(4U2Cg2+4C~g2)(2Cg2+Ch)(Cw2+1))θθ2,\displaystyle\;\left(6U^{2}C_{h}^{2}+6C_{g}^{2}\widetilde{C}_{g}^{2}+6\widetilde{C}_{h}^{2}+\left(4U^{2}C_{g}^{2}+4\widetilde{C}_{g}^{2}\right)(2C_{g}^{2}+C_{h})(C_{w}^{2}+1)\right)\left\lVert\theta-\theta^{\prime}\right\rVert^{2},

which completes the proof. ∎

A.6 Proof of Lemma 5.4

Proof of Lemma 5.4.

By the definition of the stochastic gradient estimator given in Algorithm 2, we can see that for t0t\geq 0,

𝔼t+1[gt+1θ𝒥(θt+1)2]\displaystyle\;\mathbb{E}_{t+1}\left[\left\lVert g^{t+1}-\nabla_{\theta}\mathcal{J}(\theta^{t+1})\right\rVert^{2}\right]
=\displaystyle= p𝔼t+1[1N1j=1N1g(xt+1,j,θt+1)θ𝒥(θt+1)2]\displaystyle\;p\mathbb{E}_{t+1}\left[\left\lVert\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}g(x^{t+1,j},\theta^{t+1})-\nabla_{\theta}\mathcal{J}(\theta^{t+1})\right\rVert^{2}\right]
+(1p)𝔼t+1[1N2j=1N2(g(xt+1,j,θt+1)gw(xt+1,j,θt,θt+1))+gtθ𝒥(θt+1)2]\displaystyle\;+(1-p)\mathbb{E}_{t+1}\left[\left\lVert\frac{1}{N_{2}}\sum_{j=1}^{N_{2}}\left(g(x^{t+1,j},\theta^{t+1})-g_{w}(x^{t+1,j},\theta^{t},\theta^{t+1})\right)+g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t+1})\right\rVert^{2}\right]
=\displaystyle= p𝔼t+1[1N1j=1N1g(xt+1,j,θt+1)θ𝒥(θt+1)2]\displaystyle\;p\mathbb{E}_{t+1}\left[\left\lVert\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}g(x^{t+1,j},\theta^{t+1})-\nabla_{\theta}\mathcal{J}(\theta^{t+1})\right\rVert^{2}\right]
+(1p)𝔼t+1[1N2j=1N2(g(xt+1,j,θt+1)gw(xt+1,j,θt,θt+1))+θ𝒥(θt)θ𝒥(θt+1)+gtθ𝒥(θt)2]\displaystyle\;+(1-p)\mathbb{E}_{t+1}\left[\left\lVert\frac{1}{N_{2}}\sum_{j=1}^{N_{2}}\left(g(x^{t+1,j},\theta^{t+1})-g_{w}(x^{t+1,j},\theta^{t},\theta^{t+1})\right)+\nabla_{\theta}\mathcal{J}(\theta^{t})-\nabla_{\theta}\mathcal{J}(\theta^{t+1})+g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}\right]
\displaystyle\leq p𝔼t+1[1N1j=1N1g(xt+1,j,θt+1)θ𝒥(θt+1)2]\displaystyle\;p\mathbb{E}_{t+1}\left[\left\lVert\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}g(x^{t+1,j},\theta^{t+1})-\nabla_{\theta}\mathcal{J}(\theta^{t+1})\right\rVert^{2}\right]
+(1p)𝔼t+1[1N2j=1N2(g(xt+1,j,θt+1)gw(xt+1,j,θt,θt+1))+gtθ𝒥(θt)2]\displaystyle\;+(1-p)\mathbb{E}_{t+1}\left[\left\lVert\frac{1}{N_{2}}\sum_{j=1}^{N_{2}}\left(g(x^{t+1,j},\theta^{t+1})-g_{w}(x^{t+1,j},\theta^{t},\theta^{t+1})\right)+g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}\right]
\displaystyle\leq pσ2N1+(1p)𝔼t+1[gtθ𝒥(θt)2]+(1p)1N22j=1N2𝔼t+1[(g(xt+1,j,θt+1)gw(xt+1,j,θt,θt+1))2]\displaystyle\;\frac{p\sigma^{2}}{N_{1}}+(1-p)\mathbb{E}_{t+1}\left[\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}\right]+(1-p)\frac{1}{N_{2}^{2}}\sum_{j=1}^{N_{2}}\mathbb{E}_{t+1}\left[\left\lVert\left(g(x^{t+1,j},\theta^{t+1})-g_{w}(x^{t+1,j},\theta^{t},\theta^{t+1})\right)\right\rVert^{2}\right]
\displaystyle\leq pσ2N1+(1p)𝔼t+1[gtθ𝒥(θt)2]+(1p)CN2θt+1θt2,\displaystyle\;\frac{p\sigma^{2}}{N_{1}}+(1-p)\mathbb{E}_{t+1}\left[\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}\right]+\frac{(1-p)C}{N_{2}}\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2},

where in the first inequality, we use the facts that 𝔼[(X𝔼[X])2]𝔼[X2]\mathbb{E}\left[\left(X-\mathbb{E}[X]\right)^{2}\right]\leq\mathbb{E}\left[X^{2}\right] for all random variable XX and gtg^{t} is unbiased estimator for θ𝒥(θt)\nabla_{\theta}\mathcal{J}(\theta^{t}) for all t0t\geq 0, in the second inequality, we rely on the fact that {xt+1,j}\{x^{t+1,j}\} is independent, and the last inequality is due to Lemma 5.3. By summing the above relation across t=0,,T2t=0,\dots,T-2, we see that

t=1T1𝔼T[gtθ𝒥(θt)2]\displaystyle\;\sum_{t=1}^{T-1}\mathbb{E}_{T}\left[\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}\right]
\displaystyle\leq pσ2(T1)N1+(1p)t=0T2𝔼t+1[gtθ𝒥(θt)2]+(1p)CN2t=0T2𝔼T[θt+1θt2],\displaystyle\;\frac{p\sigma^{2}(T-1)}{N_{1}}+(1-p)\sum_{t=0}^{T-2}\mathbb{E}_{t+1}\left[\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}\right]+\frac{(1-p)C}{N_{2}}\sum_{t=0}^{T-2}\mathbb{E}_{T}\left[\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}\right],

which implies that

t=0T1𝔼T[gtθ𝒥(θt)2]pσ2T+σ2pN1+(1p)CpN2t=0T1𝔼T[θt+1θt2].\sum_{t=0}^{T-1}\mathbb{E}_{T}\left[\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}\right]\leq\frac{p\sigma^{2}T+\sigma^{2}}{pN_{1}}+\frac{(1-p)C}{pN_{2}}\sum_{t=0}^{T-1}\mathbb{E}_{T}\left[\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}\right]. (14)

Recall from (9) that

t=0T1θt+1θt22ηΔ12ηL+ηL(12ηL)t=0T1gtθ𝒥(θt)2,\sum_{t=0}^{T-1}\left\lVert\theta^{t+1}-\theta^{t}\right\rVert^{2}\leq\frac{2\eta\Delta}{1-2\eta L}+\frac{\eta}{L(1-2\eta L)}\sum_{t=0}^{T-1}\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2},

which together with (14) implies that

(1(1p)CηpN2L(12ηL))t=0T1𝔼T[gtθ𝒥(θt)2]pσ2T+σ2pN1+2η(1p)CΔpN2(12ηL).\left(1-\frac{(1-p)C\eta}{pN_{2}L(1-2\eta L)}\right)\sum_{t=0}^{T-1}\mathbb{E}_{T}\left[\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}\right]\leq\frac{p\sigma^{2}T+\sigma^{2}}{pN_{1}}+\frac{2\eta(1-p)C\Delta}{pN_{2}(1-2\eta L)}.

Thus, the proof is completed. ∎

A.7 Proof Theorem 5.5

Proof Theorem 5.5.

Since p=N2N1+N2(0,1)p=\frac{N_{2}}{N_{1}+N_{2}}\in(0,1) and

ηpN2L2(1p)C+2pN2L2=N22L2N1C+2N22L2,\eta\leq\frac{pN_{2}L}{2(1-p)C+2pN_{2}L^{2}}=\frac{N_{2}^{2}L}{2N_{1}C+2N_{2}^{2}L^{2}},

we can readily check that

η(0,12L),1(1p)CηN2L(12ηL)12.\eta\in\left(0,\frac{1}{2L}\right),\quad 1-\frac{(1-p)C\eta}{N_{2}L(1-2\eta L)}\geq\frac{1}{2}. (15)

Then, we can see that

𝔼T[dist(0,θ𝒥(θ^T)+𝒢(θ^T))2]\displaystyle\;\mathbb{E}_{T}\left[\mathrm{dist}\left(0,-\nabla_{\theta}\mathcal{J}(\hat{\theta}^{T})+\partial\mathcal{G}(\hat{\theta}^{T})\right)^{2}\right]
\displaystyle\leq (2+2ηL(12ηL))1Tt=0T1𝔼T[gtθ𝒥(θt)2]+1T(2Δη+4Δη(12ηL))\displaystyle\;\left(2+\frac{2}{\eta L(1-2\eta L)}\right)\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}_{T}\left[\left\lVert g^{t}-\nabla_{\theta}\mathcal{J}(\theta^{t})\right\rVert^{2}\right]+\frac{1}{T}\left(\frac{2\Delta}{\eta}+\frac{4\Delta}{\eta(1-2\eta L)}\right)
\displaystyle\leq 1T(2+2ηL(12ηL))(1(1p)CηpN2L(12ηL))1(pσ2T+σ2pN1+2η(1p)CΔpN2(12ηL))\displaystyle\;\frac{1}{T}\left(2+\frac{2}{\eta L(1-2\eta L)}\right)\left(1-\frac{(1-p)C\eta}{pN_{2}L(1-2\eta L)}\right)^{-1}\left(\frac{p\sigma^{2}T+\sigma^{2}}{pN_{1}}+\frac{2\eta(1-p)C\Delta}{pN_{2}(1-2\eta L)}\right)
+1T(2Δη+4Δη(12ηL))\displaystyle\;+\frac{1}{T}\left(\frac{2\Delta}{\eta}+\frac{4\Delta}{\eta(1-2\eta L)}\right)
\displaystyle\leq 4T(1+1ηL(12ηL))(Tσ2N1+(N1+N2)σ2N1N2+2ηN1CΔN22(12ηL))+2ΔT(1η+2η(12ηL))\displaystyle\;\frac{4}{T}\left(1+\frac{1}{\eta L(1-2\eta L)}\right)\left(\frac{T\sigma^{2}}{N_{1}}+\frac{(N_{1}+N_{2})\sigma^{2}}{N_{1}N_{2}}+\frac{2\eta N_{1}C\Delta}{N_{2}^{2}(1-2\eta L)}\right)+\frac{2\Delta}{T}\left(\frac{1}{\eta}+\frac{2}{\eta(1-2\eta L)}\right)

where Δ:=(θ0)>0\Delta:=\mathcal{F}^{*}-\mathcal{F}(\theta^{0})>0 is a constant, the first inequality is due to Theorem 4.3, the second inequality is derived from Lemma 5.4, and the third inequality is implied by (15).

Then, in order to have 𝔼T[dist(0,θ𝒥(θ^T)+𝒢(θ^T))2]ϵ2\mathbb{E}_{T}\left[\mathrm{dist}\left(0,-\nabla_{\theta}\mathcal{J}(\hat{\theta}^{T})+\partial\mathcal{G}(\hat{\theta}^{T})\right)^{2}\right]\leq\epsilon^{2} for a given tolerance ϵ>0\epsilon>0, we can simply set N2=N1N_{2}=\sqrt{N_{1}},

ηN22L2N1C+2N22L2=L2C+2L2,\eta\leq\frac{N_{2}^{2}L}{2N_{1}C+2N_{2}^{2}L^{2}}=\frac{L}{2C+2L^{2}},

and require that

4(1+1ηL(12ηL))σ2N1\displaystyle 4\left(1+\frac{1}{\eta L(1-2\eta L)}\right)\frac{\sigma^{2}}{N_{1}}\leq ϵ23,\displaystyle\;\frac{\epsilon^{2}}{3},
4T(1+1ηL(12ηL))(N1+N2)σ2N1N2\displaystyle\frac{4}{T}\left(1+\frac{1}{\eta L(1-2\eta L)}\right)\frac{(N_{1}+N_{2})\sigma^{2}}{N_{1}N_{2}}\leq ϵ23,\displaystyle\;\frac{\epsilon^{2}}{3},
2ΔT[(1+1ηL(12ηL))4ηN1CN22(12ηL)+1η+2η(12ηL)]\displaystyle\frac{2\Delta}{T}\left[\left(1+\frac{1}{\eta L(1-2\eta L)}\right)\frac{4\eta N_{1}C}{N_{2}^{2}(1-2\eta L)}+\frac{1}{\eta}+\frac{2}{\eta(1-2\eta L)}\right]\leq ϵ23.\displaystyle\;\frac{\epsilon^{2}}{3}.

Therefore, it suffices to set N1=O(ϵ2)N_{1}=O(\epsilon^{-2}), N2=N1=O(ϵ1)N_{2}=\sqrt{N_{1}}=O(\epsilon^{-1}) and T=O(ϵ2)T=O(\epsilon^{-2}). (We ignore deriving the concrete expressions of TT, N1N_{1} and N2N_{2}, in terms of ϵ\epsilon and other constants, but only give the big-O notation here for simplicity.)

Finally, we can verify that the sample complexity can be bounded as

N1+T(pN1+(1p)N2)=N1+T2N1N2N1+N2N1+2TN2=O(ϵ3).N_{1}+T\left(pN_{1}+(1-p)N_{2}\right)=N_{1}+T\frac{2N_{1}N_{2}}{N_{1}+N_{2}}\leq N_{1}+2TN_{2}=O(\epsilon^{-3}).

Therefore, the proof is completed. ∎