This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Offline Reinforcement Learning with Differential Privacy

Dan Qiao Department of Computer Science, UC Santa Barbara Yu-Xiang Wang Department of Computer Science, UC Santa Barbara
Abstract

The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instance-dependent learning bounds under both tabular and linear Markov decision process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.

1 Introduction

Offline Reinforcement Learning (or batch RL) aims to learn a near-optimal policy in an unknown environment111The environment is usually characterized by a Markov Decision Process (MDP) in this paper. through a static dataset gathered from some behavior policy μ\mu. Since offline RL does not require access to the environment, it can be applied to problems where interaction with environment is infeasible, e.g., when collecting new data is costly (trade or finance (Zhang et al., 2020)), risky (autonomous driving (Sallab et al., 2017)) or illegal / unethical (healthcare (Raghu et al., 2017)). In such practical applications, the data used by an RL agent usually contains sensitive information. Take medical history for instance, for each patient, at each time step, the patient reports her health condition (age, disease, etc.), then the doctor decides the treatment (which medicine to use, the dosage of medicine, etc.), finally there is treatment outcome (whether the patient feels good, etc.) and the patient transitions to another health condition. Here, (health condition, treatment, treatment outcome) corresponds to (state, action, reward) and the dataset can be considered as nn (number of patients) trajectories sampled from a MDP with horizon HH (number of treatment steps), see Table 1 for an instance. However, learning agents are known to implicitly memorize details of individual training data points verbatim (Carlini et al., 2019), even if they are irrelevant for learning (Brown et al., 2021), which makes offline RL models vulnerable to various privacy attacks.

Differential privacy (DP) (Dwork et al., 2006) is a well-established definition of privacy with many desirable properties. A differentially private offline RL algorithm will return a decision policy that is indistinguishable from a policy trained in an alternative universe any individual user is replaced, thereby preventing the aforementioned privacy risks. There is a surge of recent interest in developing RL algorithms with DP guarantees, but they focus mostly on the online setting (Vietri et al., 2020; Garcelon et al., 2021; Liao et al., 2021; Chowdhury and Zhou, 2021; Luyo et al., 2021).

Offline RL is arguably more practically relevant than online RL in the applications with sensitive data. For example, in the healthcare domain, online RL requires actively running new exploratory policies (clinical trials) with every new patient, which often involves complex ethical / legal clearances, whereas offline RL uses only historical patient records that are often accessible for research purposes. Clear communication of the adopted privacy enhancing techniques (e.g., DP) to patients was reported to further improve data access (Kim et al., 2017).

Time Patient Patient 1 Patient 2 \cdots Patient nn
Time 1 Health condition1,1 Health condition2,1 \cdots Health conditionn,1
Time 1 Treatment1,1 Treatment2,1 \cdots Treatmentn,1
Time 1 Treatment outcome1,1 Treatment outcome2,1 \cdots Treatment outcomen,1
\cdots \cdots \cdots \cdots \cdots
Time H Health condition1,H Health condition2,H \cdots Health conditionn,H
Time H Treatment1,H Treatment2,H \cdots Treatmentn,H
Time H Treatment outcome1,H Treatment outcome2,H \cdots Treatment outcomen,H
Table 1: An illustration of offline dataset regarding medical history. The dataset consists of nn patients and the data for each patient includes the (health condition, treatment, treatment outcome) for HH times. Here, (health condition, treatment, treatment outcome) corresponds to (state, action, reward) and the dataset can be considered as nn trajectories sampled from a MDP with horizon HH.

Our contributions. In this paper, we present the first provably efficient algorithms for offline RL with differential privacy. Our contributions are twofold.

  • We design two new pessimism-based algorithms DP-APVI (Algorithm 1) and DP-VAPVI (Algorithm 2), one for the tabular setting (finite states and actions), the other for the case with linear function approximation (under linear MDP assumption). Both algorithms enjoy DP guarantees (pure DP or zCDP) and instance-dependent learning bounds where the cost of privacy appears as lower order terms.

  • We perform numerical simulations to evaluate and compare the performance of our algorithm DP-VAPVI (Algorithm 2) with its non-private counterpart VAPVI (Yin et al., 2022) as well as a popular baseline PEVI (Jin et al., 2021). The results complement the theoretical findings by demonstrating the practicality of DP-VAPVI under strong privacy parameters.

Related work. To our knowledge, differential privacy in offline RL tasks has not been studied before, except for much simpler cases where the agent only evaluates a single policy (Balle et al., 2016; Xie et al., 2019). Balle et al. (2016) privatized first-visit Monte Carlo-Ridge Regression estimator by an output perturbation mechanism and Xie et al. (2019) used DP-SGD. Neither paper considered offline learning (or policy optimization), which is our focus.

There is a larger body of work on private RL in the online setting, where the goal is to minimize regret while satisfying either joint differential privacy (Vietri et al., 2020; Chowdhury and Zhou, 2021; Ngo et al., 2022; Luyo et al., 2021) or local differential privacy (Garcelon et al., 2021; Liao et al., 2021; Luyo et al., 2021; Chowdhury and Zhou, 2021). The offline setting introduces new challenges in DP as we cannot algorithmically enforce good “exploration”, but have to work with a static dataset and privately estimate the uncertainty in addition to the value functions. A private online RL algorithm can sometimes be adapted for private offline RL too, but those from existing work yield suboptimal and non-adaptive bounds. We give a more detailed technical comparison in Appendix B.

Among non-private offline RL works, we build directly upon non-private offline RL methods known as Adaptive Pessimistic Value Iteration (APVI, for tabular MDPs) (Yin and Wang, 2021b) and Variance-Aware Pessimistic Value Iteration (VAPVI, for linear MDPs) (Yin et al., 2022), as they give the strongest theoretical guarantees to date. We refer readers to Appendix B for a more extensive review of the offline RL literature. Introducing DP to APVI and VAPVI while retaining the same sample complexity (modulo lower order terms) require nontrivial modifications to the algorithms.

A remark on technical novelty. Our algorithms involve substantial technical innovation over previous works on online DP-RL with joint DP guarantee222Here we only compare our techniques (for offline RL) with the works for online RL under joint DP guarantee, as both settings allow access to the raw data.. Different from previous works, our DP-APVI (Algorithm 1) operates on Bernstein type pessimism, which requires our algorithm to deal with conditional variance using private statistics. Besides, our DP-VAPVI (Algorithm 2) replaces the LSVI technique with variance-aware LSVI (also known as weighted ridge regression, first appears in (Zhou et al., 2021)). Our DP-VAPVI releases conditional variance privately, and further applies weighted ridge regression privately. Both approaches ensure tighter instance-dependent bounds on the suboptimality of the learned policy.

2 Problem Setup

Markov Decision Process. A finite-horizon Markov Decision Process (MDP) is denoted by a tuple M=(𝒮,𝒜,P,r,H,d1)M=(\mathcal{S},\mathcal{A},P,r,H,d_{1}) (Sutton and Barto, 2018), where 𝒮\mathcal{S} is state space and 𝒜\mathcal{A} is action space. A non-stationary transition kernel Ph:𝒮×𝒜×𝒮[0,1]P_{h}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\mapsto[0,1] maps each state action (sh,ah)(s_{h},a_{h}) to a probability distribution Ph(|sh,ah)P_{h}(\cdot|s_{h},a_{h}) and PhP_{h} can be different across time. Besides, rh:𝒮×Ar_{h}:\mathcal{S}\times{A}\mapsto\mathbb{R} is the expected immediate reward satisfying 0rh10\leq r_{h}\leq 1, d1d_{1} is the initial state distribution and HH is the horizon. A policy π=(π1,,πH)\pi=(\pi_{1},\cdots,\pi_{H}) assigns each state sh𝒮s_{h}\in\mathcal{S} a probability distribution over actions according to the map shπh(|sh)s_{h}\mapsto\pi_{h}(\cdot|s_{h}), h[H]\forall\,h\in[H]. A random trajectory s1,a1,r1,,sH,aH,rH,sH+1s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H},s_{H+1} is generated according to s1d1,ahπh(|sh),rhrh(sh,ah),sh+1Ph(|sh,ah),h[H]s_{1}\sim d_{1},a_{h}\sim\pi_{h}(\cdot|s_{h}),r_{h}\sim r_{h}(s_{h},a_{h}),s_{h+1}\sim P_{h}(\cdot|s_{h},a_{h}),\forall\,h\in[H].

For tabular MDP, we have 𝒮×𝒜\mathcal{S}\times\mathcal{A} is the discrete state-action space and S:=|𝒮|,A:=|𝒜|S:=|\mathcal{S}|,A:=|\mathcal{A}| are finite. In this work, we assume that rr is known333This is due to the fact that the uncertainty of reward function is dominated by that of transition kernel in RL.. In addition, we denote the per-step marginal state-action occupancy dhπ(s,a)d^{\pi}_{h}(s,a) as: dhπ(s,a):=[sh=s|s1d1,π]πh(a|s),d^{\pi}_{h}(s,a):=\mathbb{P}[s_{h}=s|s_{1}\sim d_{1},\pi]\cdot\pi_{h}(a|s), which is the marginal state-action probability at time hh.

Value function, Bellman (optimality) equations. The value function Vhπ()V^{\pi}_{h}(\cdot) and Q-value function Qhπ(,)Q^{\pi}_{h}(\cdot,\cdot) for any policy π\pi is defined as: Vhπ(s)=𝔼π[t=hHrt|sh=s],Qhπ(s,a)=𝔼π[t=hHrt|sh,ah=s,a],h,s,a[H]×𝒮×𝒜.V^{\pi}_{h}(s)=\mathbb{E}_{\pi}[\sum_{t=h}^{H}r_{t}|s_{h}=s],\;\;Q^{\pi}_{h}(s,a)=\mathbb{E}_{\pi}[\sum_{t=h}^{H}r_{t}|s_{h},a_{h}=s,a],\;\forall\,h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}. The performance is defined as vπ:=𝔼d1[V1π]=𝔼π,d1[t=1Hrt]v^{\pi}:=\mathbb{E}_{d_{1}}\left[V^{\pi}_{1}\right]=\mathbb{E}_{\pi,d_{1}}\left[\sum_{t=1}^{H}r_{t}\right]. The Bellman (optimality) equations follow h[H]\forall\,h\in[H]: Qhπ=rh+PhVh+1π,Vhπ=𝔼aπh[Qhπ],Qh=rh+PhVh+1,Vh=maxaQh(,a).Q^{\pi}_{h}=r_{h}+P_{h}V^{\pi}_{h+1},\;\;V^{\pi}_{h}=\mathbb{E}_{a\sim\pi_{h}}[Q^{\pi}_{h}],\;\;\;Q^{\star}_{h}=r_{h}+P_{h}V^{\star}_{h+1},\;V^{\star}_{h}=\max_{a}Q^{\star}_{h}(\cdot,a).

Linear MDP (Jin et al., 2020b). An episodic MDP (𝒮,𝒜,H,P,r)(\mathcal{S},\mathcal{A},H,P,r) is called a linear MDP with known feature map ϕ:𝒮×𝒜d\phi:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{d} if there exist HH unknown signed measures νhd\nu_{h}\in\mathbb{R}^{d} over 𝒮\mathcal{S} and HH unknown reward vectors θhd\theta_{h}\in\mathbb{R}^{d} such that

Ph(ss,a)=ϕ(s,a),νh(s),rh(s,a)=ϕ(s,a),θh,(h,s,a,s)[H]×𝒮×𝒜×𝒮.{P}_{h}\left(s^{\prime}\mid s,a\right)=\left\langle\phi(s,a),\nu_{h}\left(s^{\prime}\right)\right\rangle,\quad r_{h}\left(s,a\right)=\left\langle\phi(s,a),\theta_{h}\right\rangle,\quad\forall\,(h,s,a,s^{\prime})\in[H]\times\mathcal{S}\times\mathcal{A}\times\mathcal{S}.

Without loss of generality, we assume ϕ(s,a)21\|\phi(s,a)\|_{2}\leq 1 and max(νh(𝒮)2,θh2)d\max(\|\nu_{h}(\mathcal{S})\|_{2},\|\theta_{h}\|_{2})\leq\sqrt{d} for all h,s,a[H]×𝒮×𝒜h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}. An important property of linear MDP is that the value functions are linear in the feature map, which is summarized in Lemma E.14.

Offline setting and the goal. The offline RL requires the agent to find a policy π\pi in order to maximize the performance vπv^{\pi}, given only the episodic data 𝒟={(shτ,ahτ,rhτ,sh+1τ)}τ[n]h[H]\mathcal{D}=\left\{\left(s_{h}^{\tau},a_{h}^{\tau},r_{h}^{\tau},s_{h+1}^{\tau}\right)\right\}_{\tau\in[n]}^{h\in[H]}444For clarity we use nn for tabular MDP and KK for linear MDP when referring to the sample complexity. rolled out from some fixed and possibly unknown behavior policy μ\mu, which means we cannot change μ\mu and in particular we do not assume the functional knowledge of μ\mu. In conclusion, based on the batch data 𝒟\mathcal{D} and a targeted accuracy ϵ>0\epsilon>0, the agent seeks to find a policy πalg\pi_{\text{alg}} such that vvπalgϵv^{\star}-v^{\pi_{\text{alg}}}\leq\epsilon.

2.1 Assumptions in offline RL

In order to show that our privacy-preserving algorithms can generate near optimal policy, certain coverage assumptions are needed. In this section, we will list the assumptions we use in this paper.

Assumptions for tabular setting.

Assumption 2.1 ((Liu et al., 2019)).

There exists one optimal policy π\pi^{\star}, such that π\pi^{\star} is fully covered by μ\mu, i.e. sh,ah𝒮×𝒜\forall\,s_{h},a_{h}\in\mathcal{S}\times\mathcal{A}, dhπ(sh,ah)>0d^{\pi^{\star}}_{h}(s_{h},a_{h})>0 only if dhμ(sh,ah)>0d^{\mu}_{h}(s_{h},a_{h})>0. Furthermore, we denote the trackable set as 𝒞h:={(sh,ah):dhμ(sh,ah)>0}\mathcal{C}_{h}:=\{(s_{h},a_{h}):d^{\mu}_{h}(s_{h},a_{h})>0\}.

Assumption 2.1 is the weakest assumption needed for accurately learning the optimal value vv^{\star} by requiring μ\mu to trace the state-action space of one optimal policy (μ\mu can be agnostic at other locations). Similar to (Yin and Wang, 2021b), we will use Assumption 2.1 for the tabular part of this paper, which enables comparison between our sample complexity to the conclusion in (Yin and Wang, 2021b), whose algorithm serves as a non-private baseline.

Assumptions for linear setting. First, we define the expectation of covariance matrix under the behavior policy μ\mu for all time step h[H]h\in[H] as below:

Σhp:=𝔼μ[ϕ(sh,ah)ϕ(sh,ah)].\Sigma^{p}_{h}:=\mathbb{E}_{\mu}\left[\phi(s_{h},a_{h})\phi(s_{h},a_{h})^{\top}\right]. (1)

As have been shown in (Wang et al., 2021; Yin et al., 2022), learning a near-optimal policy from offline data requires coverage assumptions. Here in linear setting, such coverage is characterized by the minimum eigenvalue of Σhp\Sigma^{p}_{h}. Similar to (Yin et al., 2022), we apply the following assumption for the sake of comparison.

Assumption 2.2 (Feature Coverage, Assumption 2 in (Wang et al., 2021)).

The data distributions μ\mu satisfy the minimum eigenvalue condition: h[H]\forall\,h\in[H], κh:=λmin(Σhp)>0\kappa_{h}:=\lambda_{\mathrm{min}}(\Sigma^{p}_{h})>0. Furthermore, we denote κ=minhκh\kappa=\min_{h}\kappa_{h}.

2.2 Differential Privacy in offline RL

In this work, we aim to design privacy-preserving algorithms for offline RL. We apply differential privacy as the formal notion of privacy. Below we revisit the definition of differential privacy.

Definition 2.3 (Differential Privacy (Dwork et al., 2006)).

A randomized mechanism MM satisfies (ϵ,δ)(\epsilon,\delta)-differential privacy ((ϵ,δ)(\epsilon,\delta)-DP) if for all neighboring datasets U,UU,U^{\prime} that differ by one data point and for all possible event EE in the output range, it holds that

[M(U)E]eϵ[M(U)E]+δ.\mathbb{P}[M(U)\in E]\leq e^{\epsilon}\cdot\mathbb{P}[M(U^{\prime})\in E]+\delta.

When δ=0\delta=0, we say pure DP, while for δ>0\delta>0, we say approximate DP.

In the problem of offline RL, the dataset consists of several trajectories, therefore one data point in Definition 2.3 refers to one single trajectory. Hence the definition of Differential Privacy means that the difference in the distribution of the output policy resulting from replacing one trajectory in the dataset will be small. In other words, an adversary can not infer much information about any single trajectory in the dataset from the output policy of the algorithm.

Remark 2.4.

For a concrete motivating example, please refer to the first paragraph of Introduction. We remark that our definition of DP is consistent with Joint DP and Local DP defined under the online RL setting where JDP/LDP also cast each user as one trajectory and provide user-wise privacy protection. For detailed definitions and more discussions about JDP/LDP, please refer to Qiao and Wang (2022).

During the whole paper, we will use zCDP (defined below) as a surrogate for DP, since it enables cleaner analysis for privacy composition and Gaussian mechanism. The properties of zCDP (e.g., composition, conversion formula to DP) are deferred to Appendix E.3.

Definition 2.5 (zCDP (Dwork and Rothblum, 2016; Bun and Steinke, 2016)).

A randomized mechanism MM satisfies ρ\rho-Zero-Concentrated Differential Privacy (ρ\rho-zCDP), if for all neighboring datasets U,UU,U^{\prime} and all α(1,)\alpha\in(1,\infty),

Dα(M(U)M(U))ρα,D_{\alpha}(M(U)\|M(U^{\prime}))\leq\rho\alpha,

where DαD_{\alpha} is the Renyi-divergence (Van Erven and Harremos, 2014).

Finally, we go over the definition and privacy guarantee of Gaussian mechanism.

Definition 2.6 (Gaussian Mechanism (Dwork et al., 2014)).

Define the 2\ell_{2} sensitivity of a function f:𝒳df:\mathbb{N}^{\mathcal{X}}\mapsto\mathbb{R}^{d} as

Δ2(f)=supneighboringU,Uf(U)f(U)2.\displaystyle\Delta_{2}(f)=\sup_{\text{neighboring}\,U,U^{\prime}}\|f(U)-f(U^{\prime})\|_{2}.

The Gaussian mechanism \mathcal{M} with noise level σ\sigma is then given by

(U)=f(U)+𝒩(0,σ2Id).\displaystyle\mathcal{M}(U)=f(U)+\mathcal{N}(0,\sigma^{2}I_{d}).
Lemma 2.7 (Privacy guarantee of Gaussian mechanism (Dwork et al., 2014; Bun and Steinke, 2016)).

Let f:𝒳df:\mathbb{N}^{\mathcal{X}}\mapsto\mathbb{R}^{d} be an arbitrary d-dimensional function with 2\ell_{2} sensitivity Δ2\Delta_{2}. Then for any ρ>0\rho>0, Gaussian Mechanism with parameter σ2=Δ222ρ\sigma^{2}=\frac{\Delta^{2}_{2}}{2\rho} satisfies ρ\rho-zCDP. In addition, for all 0<δ,ϵ<10<\delta,\epsilon<1, Gaussian Mechanism with parameter σ=Δ2ϵ2log1.25δ\sigma=\frac{\Delta_{2}}{\epsilon}\sqrt{2\log\frac{1.25}{\delta}} satisfies (ϵ,δ)(\epsilon,\delta)-DP.

We emphasize that the privacy guarantee covers any input data. It does not require any distributional assumptions on the data. The RL-specific assumptions (e.g., linear MDP and coverage assumptions) are only used for establishing provable utility guarantees.

3 Results under tabular MDP: DP-APVI (Algorithm 1)

For reinforcement learning, the tabular MDP setting is the most well-studied setting and our first result applies to this regime. We begin with the construction of private counts.

Private Model-based Components. Given data 𝒟={(shτ,ahτ,rhτ,sh+1τ)}τ[n]h[H]\mathcal{D}=\left\{\left(s_{h}^{\tau},a_{h}^{\tau},r_{h}^{\tau},s_{h+1}^{\tau}\right)\right\}_{\tau\in[n]}^{h\in[H]}, we denote nsh,ah:=τ=1n𝟙[shτ,ahτ=sh,ah]n_{s_{h},a_{h}}:=\sum_{\tau=1}^{n}\mathds{1}[s_{h}^{\tau},a_{h}^{\tau}=s_{h},a_{h}] be the total counts that visit (sh,ah)(s_{h},a_{h}) pair at time hh and nsh,ah,sh+1:=τ=1n𝟙[shτ,ahτ,sh+1τ=sh,ah,sh+1]n_{s_{h},a_{h},s_{h+1}}:=\sum_{\tau=1}^{n}\mathds{1}[s_{h}^{\tau},a_{h}^{\tau},s_{h+1}^{\tau}=s_{h},a_{h},s_{h+1}] be the total counts that visit (sh,ah,sh+1)(s_{h},a_{h},s_{h+1}) pair at time hh, then given the budget ρ\rho for zCDP, we add independent Gaussian noises to all the counts:

nsh,ah={nsh,ah+𝒩(0,σ2)}+,nsh,ah,sh+1={nsh,ah,sh+1+𝒩(0,σ2)}+,σ2=2Hρ.n_{s_{h},a_{h}}^{\prime}=\left\{n_{s_{h},a_{h}}+\mathcal{N}(0,\sigma^{2})\right\}^{+},\,\,n_{s_{h},a_{h},s_{h+1}}^{\prime}=\left\{n_{s_{h},a_{h},s_{h+1}}+\mathcal{N}(0,\sigma^{2})\right\}^{+},\,\,\sigma^{2}=\frac{2H}{\rho}. (2)

However, after adding noise, the noisy counts nn^{\prime} may not satisfy nsh,ah=sh+1𝒮nsh,ah,sh+1n^{\prime}_{s_{h},a_{h}}=\sum_{s_{h+1}\in\mathcal{S}}n^{\prime}_{s_{h},a_{h},s_{h+1}}. To address this problem, we choose the private counts of visiting numbers as the solution to the following optimization problem (here Eρ=4Hlog4HS2AδρE_{\rho}=4\sqrt{\frac{H\log{\frac{4HS^{2}A}{\delta}}}{\rho}} is chosen as a high probability uniform bound of the noises we add):

{n~sh,ah,s}s𝒮=argmin{xs}s𝒮maxs𝒮|xsnsh,ah,s|\displaystyle\{\widetilde{n}_{s_{h},a_{h},s^{\prime}}\}_{s^{\prime}\in\mathcal{S}}=\mathrm{argmin}_{\{x_{s^{\prime}}\}_{s^{\prime}\in\mathcal{S}}}\max_{s^{\prime}\in\mathcal{S}}\left|x_{s^{\prime}}-n^{\prime}_{s_{h},a_{h},s^{\prime}}\right| (3)
such that|s𝒮xsnsh,ah|Eρ2andxs0,s𝒮.\displaystyle\text{such that}\,\,\left|\sum_{s^{\prime}\in\mathcal{S}}x_{s^{\prime}}-n^{\prime}_{s_{h},a_{h}}\right|\leq\frac{E_{\rho}}{2}\,\,\text{and}\,\,x_{s^{\prime}}\geq 0,\forall\,s^{\prime}\in\mathcal{S}.
n~sh,ah=s𝒮n~sh,ah,s.\displaystyle\widetilde{n}_{s_{h},a_{h}}=\sum_{s^{\prime}\in\mathcal{S}}\widetilde{n}_{s_{h},a_{h},s^{\prime}}.
Remark 3.1 (Some explanations).

The optimization problem above serves as a post-processing step which will not affect the DP guarantee of our algorithm. Briefly speaking, (3) finds a set of noisy counts such that n~sh,ah=s𝒮n~sh,ah,s\widetilde{n}_{s_{h},a_{h}}=\sum_{s^{\prime}\in\mathcal{S}}\widetilde{n}_{s_{h},a_{h},s^{\prime}} and the estimation error for each n~sh,ah\widetilde{n}_{s_{h},a_{h}} and n~sh,ah,s\widetilde{n}_{s_{h},a_{h},s^{\prime}} is roughly EρE_{\rho}.555This conclusion is summarized in Lemma C.3. In contrast, if we directly take the crude approach that n~sh,ah,sh+1=nsh,ah,sh+1\widetilde{n}_{s_{h},a_{h},s_{h+1}}={n}_{s_{h},a_{h},s_{h+1}}^{\prime} and n~sh,ah=sh+1𝒮n~sh,ah,sh+1\widetilde{n}_{s_{h},a_{h}}=\sum_{s_{h+1}\in\mathcal{S}}\widetilde{n}_{s_{h},a_{h},s_{h+1}}, we can only derive |n~sh,ahnsh,ah|O~(SEρ)|\widetilde{n}_{s_{h},a_{h}}-n_{s_{h},a_{h}}|\leq\widetilde{O}(\sqrt{S}E_{\rho}) through concentration on summation of SS i.i.d. Gaussian noises. In conclusion, solving the optimization problem (3) enables tight analysis for the lower order term (the additional cost of privacy).

Remark 3.2 (Computational efficiency).

The optimization problem (3) can be reformulated as:

mint,s.t.|xsnsh,ah,s|tandxs0s𝒮,|s𝒮xsnsh,ah|Eρ2.\min\,\,t,\,\,\text{s.t.}\,|x_{s^{\prime}}-n^{\prime}_{s_{h},a_{h},s^{\prime}}|\leq t\,\,\text{and}\,\,x_{s^{\prime}}\geq 0\,\,\forall\,s^{\prime}\in\mathcal{S},\,\,\,\left|\sum_{s^{\prime}\in\mathcal{S}}x_{s^{\prime}}-n^{\prime}_{s_{h},a_{h}}\right|\leq\frac{E_{\rho}}{2}. (4)

Note that (4) is a Linear Programming problem with S+1S+1 variables and 2S+22S+2 linear constraints (one constraint on absolute value is equivalent to two linear constraints), which can be solved efficiently by the simplex method (Ficken, 2015) or other provably efficient algorithms (Nemhauser and Wolsey, 1988). Therefore, our Algorithm 1 is computationally friendly.

The private estimation of the transition kernel is defined as:

P~h(s|sh,ah)=n~sh,ah,sn~sh,ah,\widetilde{P}_{h}(s^{\prime}|s_{h},a_{h})=\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}}{\widetilde{n}_{s_{h},a_{h}}}, (5)

if n~sh,ah>Eρ\widetilde{n}_{s_{h},a_{h}}>E_{\rho} and P~h(s|sh,ah)=1S\widetilde{P}_{h}(s^{\prime}|s_{h},a_{h})=\frac{1}{S} otherwise.

Remark 3.3.

Different from the transition kernel estimate in previous works (Vietri et al., 2020; Chowdhury and Zhou, 2021) that may not be a distribution, we have to ensure that ours is a probability distribution, because our Bernstein type pessimism (line 5 in Algorithm 1) needs to take variance over this transition kernel estimate. The intuition behind the construction of our private transition kernel is that, for those state-action pairs with n~sh,ahEρ\widetilde{n}_{s_{h},a_{h}}\leq E_{\rho}, we can not distinguish whether the non-zero private count comes from noise or actual visitation. Therefore we only take the empirical estimate of the state-action pairs with sufficiently large n~sh,ah\widetilde{n}_{s_{h},a_{h}}.

Algorithm 1 Differentially Private Adaptive Pessimistic Value Iteration (DP-APVI)
1:  Input: Offline dataset 𝒟={(shτ,ahτ,rhτ,sh+1τ)}τ,h=1n,H\mathcal{D}=\{(s_{h}^{\tau},a_{h}^{\tau},r_{h}^{\tau},s_{h+1}^{\tau})\}_{\tau,h=1}^{n,H}. Reward function rr. Constants C1=2,C2=16,C>1C_{1}=\sqrt{2},C_{2}=16,C>1, failure probability δ\delta, budget for zCDP ρ\rho.
2:  Initialization: Calculate n~sh,ah,n~sh,ah,sh+1\widetilde{n}_{s_{h},a_{h}},\widetilde{n}_{s_{h},a_{h},s_{h+1}} as (3), P~h(sh+1|sh,ah)\widetilde{P}_{h}(s_{h+1}|s_{h},a_{h}) as (5). V~H+1()0\widetilde{V}_{H+1}(\cdot)\leftarrow 0. Eρ4Hlog4HS2AδρE_{\rho}\leftarrow 4\sqrt{\frac{H\log{\frac{4HS^{2}A}{\delta}}}{\rho}}. ιlog(HSA/δ)\iota\leftarrow\log(HSA/\delta).
3:  for h=H,H1,,1h=H,H-1,\ldots,1 do
4:    Q~h(,)rh(,)+(P~hV~h+1)(,)\widetilde{Q}_{h}(\cdot,\cdot)\leftarrow{r_{h}(\cdot,\cdot)}+(\widetilde{P}_{h}\cdot\widetilde{V}_{h+1})(\cdot,\cdot)
5:    sh,ah\forall s_{h},a_{h}, let Γh(sh,ah)C1VarP~sh,ah(V~h+1)ιn~sh,ahEρ+C2SHEριn~sh,ah\Gamma_{h}(s_{h},a_{h})\leftarrow C_{1}\sqrt{\frac{\mathrm{Var}_{\widetilde{P}_{s_{h},a_{h}}}(\widetilde{V}_{h+1})\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{C_{2}SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}} if n~sh,ah>Eρ\widetilde{n}_{s_{h},a_{h}}>E_{\rho}, otherwise CHCH.
6:    Q^hp(,)Q~h(,)Γh(,)\widehat{Q}^{p}_{h}(\cdot,\cdot)\leftarrow\widetilde{Q}_{h}(\cdot,\cdot)-\Gamma_{h}(\cdot,\cdot).
7:    Q¯h(,)min{Q^hp(,),Hh+1}+\overline{Q}_{h}(\cdot,\cdot)\leftarrow\min\{\widehat{Q}^{p}_{h}(\cdot,\cdot),H-h+1\}^{+}.
8:    sh\forall s_{h}, let π^h(|sh)argmaxπhQ¯h(sh,),πh(|sh)\widehat{\pi}_{h}(\cdot|s_{h})\leftarrow\mathrm{argmax}_{\pi_{h}}\langle\overline{Q}_{h}(s_{h},\cdot),\pi_{h}(\cdot|s_{h})\rangle and V~h(sh)Q¯h(sh,),π^h(|sh)\widetilde{V}_{h}(s_{h})\leftarrow\langle\overline{Q}_{h}(s_{h},\cdot),\widehat{\pi}_{h}(\cdot|s_{h})\rangle.
9:  end for
10:  Output: {π^h}\{\widehat{\pi}_{h}\}.

Algorithmic design. Our algorithmic design originates from the idea of pessimism, which holds conservative view towards the locations with high uncertainty and prefers the locations we have more confidence about. Based on the Bernstein type pessimism in APVI (Yin and Wang, 2021b), we design a similar pessimistic algorithm with private counts to ensure differential privacy. If we replace n~\widetilde{n} and P~\widetilde{P} with nn and P^\widehat{P}666The non-private empirical estimate, defined as (15) in Appendix C., then our DP-APVI (Algorithm 1) will degenerate to APVI. Compared to the pessimism defined in APVI, our pessimistic penalty has an additional term O~(SHEρn~sh,ah)\widetilde{O}\left(\frac{SHE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}\right), which accounts for the additional pessimism due to our application of private statistics.

We state our main theorem about DP-APVI below, the proof sketch is deferred to Appendix C.1 and detailed proof is deferred to Appendix C due to space limit.

Theorem 3.4.

DP-APVI (Algorithm 1) satisfies ρ\rho-zCDP. Furthermore, under Assumption 2.1, denote d¯m:=minh[H]{dhμ(sh,ah):dhμ(sh,ah)>0}\bar{d}_{m}:=\min_{h\in[H]}\{d^{\mu}_{h}(s_{h},a_{h}):d^{\mu}_{h}(s_{h},a_{h})>0\}. For any 0<δ<10<\delta<1, there exists constant c1>0c_{1}>0, such that when n>c1max{H2,Eρ}/d¯mιn>c_{1}\cdot\max\{H^{2},E_{\rho}\}/\bar{d}_{m}\cdot\iota (ι=log(HSA/δ)\iota=\log(HSA/\delta)), with probability 1δ1-\delta, the output policy π^\widehat{\pi} of DP-APVI satisfies

0vvπ^42h=1H(sh,ah)𝒞hdhπ(sh,ah)VarPh(|sh,ah)(Vh+1())ιndhμ(sh,ah)+O~(H3+SH2Eρnd¯m),0\leq v^{\star}-v^{\widehat{\pi}}\leq 4\sqrt{2}\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{C}_{h}}d^{\pi^{\star}}_{h}(s_{h},a_{h})\sqrt{\frac{\mathrm{Var}_{P_{h}(\cdot|s_{h},a_{h})}(V^{\star}_{h+1}(\cdot))\cdot\iota}{nd^{\mu}_{h}(s_{h},a_{h})}}+\widetilde{O}\left(\frac{H^{3}+SH^{2}E_{\rho}}{n\cdot\bar{d}_{m}}\right), (6)

where O~\widetilde{O} hides constants and Polylog terms, Eρ=4Hlog4HS2AδρE_{\rho}=4\sqrt{\frac{H\log{\frac{4HS^{2}A}{\delta}}}{\rho}}.

Comparison to non-private counterpart APVI (Yin and Wang, 2021b). According to Theorem 4.1 in (Yin and Wang, 2021b), the sub-optimality bound of APVI is for large enough nn, with high probability, the output π^\widehat{\pi} satisfies:

0vvπ^O~(h=1H(sh,ah)𝒞hdhπ(sh,ah)VarPh(|sh,ah)(Vh+1())ndhμ(sh,ah))+O~(H3nd¯m).0\leq v^{\star}-v^{\widehat{\pi}}\leq\widetilde{O}\left(\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{C}_{h}}d^{\pi^{\star}}_{h}(s_{h},a_{h})\sqrt{\frac{\mathrm{Var}_{P_{h}(\cdot|s_{h},a_{h})}(V^{\star}_{h+1}(\cdot))}{nd^{\mu}_{h}(s_{h},a_{h})}}\right)+\widetilde{O}\left(\frac{H^{3}}{n\cdot\bar{d}_{m}}\right). (7)

Compared to our Theorem 3.4, the additional sub-optimality bound due to differential privacy is O~(SH2Eρnd¯m)=O~(SH52nd¯mρ)=O~(SH52nd¯mϵ)\widetilde{O}\left(\frac{SH^{2}E_{\rho}}{n\cdot\bar{d}_{m}}\right)=\widetilde{O}\left(\frac{SH^{\frac{5}{2}}}{n\cdot\bar{d}_{m}\sqrt{\rho}}\right)=\widetilde{O}\left(\frac{SH^{\frac{5}{2}}}{n\cdot\bar{d}_{m}\epsilon}\right).777Here we apply the second part of Lemma 2.7 to achieve (ϵ,δ)(\epsilon,\delta)-DP, the notation O~\widetilde{O} also absorbs log1δ\log\frac{1}{\delta} (only here δ\delta denotes the privacy budget instead of failure probability). In the most popular regime where the privacy budget ρ\rho or ϵ\epsilon is a constant, the additional term due to differential privacy appears as a lower order term, hence becomes negligible as the sample complexity nn becomes large.

Comparison to Hoeffding type pessimism. We can simply revise our algorithm by using Hoeffding type pessimism, which replaces the pessimism in line 5 with C1Hιn~sh,ahEρ+C2SHEριn~sh,ahC_{1}H\cdot\sqrt{\frac{\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{C_{2}SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}. Then with a similar proof schedule, we can arrive at a sub-optimality bound that with high probability,

0vvπ^O~(Hh=1H(sh,ah)𝒞hdhπ(sh,ah)1ndhμ(sh,ah))+O~(SH2Eρnd¯m).0\leq v^{\star}-v^{\widehat{\pi}}\leq\widetilde{O}\left(H\cdot\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{C}_{h}}d^{\pi^{\star}}_{h}(s_{h},a_{h})\sqrt{\frac{1}{nd^{\mu}_{h}(s_{h},a_{h})}}\right)+\widetilde{O}\left(\frac{SH^{2}E_{\rho}}{n\cdot\bar{d}_{m}}\right). (8)

Compared to our Theorem 3.4, our bound is tighter because we express the dominate term by the system quantities instead of explicit dependence on HH (and VarPh(|sh,ah)(Vh+1())H2\mathrm{Var}_{P_{h}(\cdot|s_{h},a_{h})}(V^{\star}_{h+1}(\cdot))\leq H^{2}). In addition, we highlight that according to Theorem G.1 in (Yin and Wang, 2021b), our main term nearly matches the non-private minimax lower bound. For more detailed discussions about our main term and how it subsumes other optimal learning bounds, we refer readers to (Yin and Wang, 2021b).

Apply Laplace Mechanism to achieve pure DP. To achieve Pure DP instead of ρ\rho-zCDP, we can simply replace Gaussian Mechanism with Laplace Mechanism (defined as Definition E.19). Given privacy budget for Pure DP ϵ\epsilon, since the 1\ell_{1} sensitivity of {nsh,ah}{nsh,ah,sh+1}\{n_{s_{h},a_{h}}\}\cup\{n_{s_{h},a_{h},s_{h+1}}\} is Δ1=4H\Delta_{1}=4H, we can add independent Laplace noises Lap(4Hϵ)\mathrm{Lap}(\frac{4H}{\epsilon}) to each count to achieve ϵ\epsilon-DP due to Lemma E.20. Then by using Eϵ=O~(Hϵ)E_{\epsilon}=\widetilde{O}\left(\frac{H}{\epsilon}\right) instead of EρE_{\rho} and keeping everything else ((3), (5) and Algorithm 1) the same, we can reach a similar result to Theorem 3.4 with the same proof schedule. The only difference is that here the additional learning bound is O~(SH3nd¯mϵ)\widetilde{O}\left(\frac{SH^{3}}{n\cdot\bar{d}_{m}\epsilon}\right), which still appears as a lower order term.

4 Results under linear MDP: DP-VAPVI(Algorithm 2)

In large MDPs, to address the computational issues, the technique of function approximation is widely applied, and linear MDP is a concrete model to study linear function approximations. Our second result applies to the linear MDP setting. Generally speaking, function approximation reduces the dimensionality of private releases comparing to the tabular MDPs. We begin with private counts.

Private Model-based Components. Given the two datasets 𝒟\mathcal{D} and 𝒟\mathcal{D}^{\prime} (both from μ\mu) as in Algorithm 2, we can apply variance-aware pessimistic value iteration to learn a near optimal policy as in VAPVI (Yin et al., 2022). To ensure differential privacy, we add independent Gaussian noises to the 5H5H statistics as in DP-VAPVI (Algorithm 2) below. Since there are 5H5H statistics, by the adaptive composition of zCDP (Lemma E.17), it suffices to keep each count ρ0\rho_{0}-zCDP, where ρ0=ρ5H\rho_{0}=\frac{\rho}{5H}. In DP-VAPVI, we use ϕ1,ϕ2,ϕ3,K1,K2\phi_{1},\phi_{2},\phi_{3},K_{1},K_{2}888We need to add noise to each of the 5H5H counts, therefore for ϕ1\phi_{1}, we actually sample HH i.i.d samples ϕ1,h\phi_{1,h}, h=1,,Hh=1,\cdots,H from the distribution of ϕ1\phi_{1}. Then we add ϕ1,h\phi_{1,h} to τ=1Kϕ(s¯hτ,a¯hτ)V~h+1(s¯h+1τ)2\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}, h[H]\forall\,h\in[H]. For simplicity, we use ϕ1\phi_{1} to represent all the ϕ1,h\phi_{1,h}. The procedure applied to the other 4H4H statistics are similar. to denote the noises we add. For all ϕi\phi_{i}, we directly apply Gaussian Mechanism. For KiK_{i}, in addition to the noise matrix 12(Z+Z)\frac{1}{\sqrt{2}}(Z+Z^{\top}), we also add E2Id\frac{E}{2}I_{d} to ensure that all KiK_{i} are positive definite with high probability (The detailed definition of E,LE,L can be found in Appendix A).

Algorithm 2 Differentially Private Variance-Aware Pessimistic Value Iteration (DP-VAPVI)
1:  Input: Dataset 𝒟={(shτ,ahτ,rhτ,sh+1τ)}τ,h=1K,H\mathcal{D}=\left\{\left(s_{h}^{\tau},a_{h}^{\tau},r_{h}^{\tau},s_{h+1}^{\tau}\right)\right\}_{\tau,h=1}^{K,H} 𝒟={(s¯hτ,a¯hτ,r¯hτ,s¯h+1τ)}τ,h=1K,H\mathcal{D}^{\prime}=\left\{\left(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau},\bar{r}_{h}^{\tau},\bar{s}_{h+1}^{\tau}\right)\right\}_{\tau,h=1}^{K,H}. Budget for zCDP ρ\rho. Failure probability δ\delta. Universal constant CC.
2:  Initialization: Set ρ0ρ5H\rho_{0}\leftarrow\frac{\rho}{5H}, V~H+1()0\widetilde{V}_{H+1}(\cdot)\leftarrow 0. Sample ϕ1𝒩(0,2H4ρ0Id)\phi_{1}\sim\mathcal{N}\left(0,\frac{2H^{4}}{\rho_{0}}I_{d}\right), ϕ2,ϕ3𝒩(0,2H2ρ0Id)\phi_{2},\,\phi_{3}\sim\mathcal{N}\left(0,\frac{2H^{2}}{\rho_{0}}I_{d}\right), K1,K2E2Id+12(Z+Z)K_{1},K_{2}\leftarrow\frac{E}{2}I_{d}+\frac{1}{\sqrt{2}}(Z+Z^{\top}), where Zi,j𝒩(0,14ρ0)(i.i.d.),E=O~(Hdρ)Z_{i,j}\sim\mathcal{N}\left(0,\frac{1}{4\rho_{0}}\right)(i.i.d.),\,\,\,E=\widetilde{O}\left(\sqrt{\frac{Hd}{\rho}}\right). Set DO~(H2Lκ+H4Edκ3/2+H3d)D\leftarrow\widetilde{O}\left(\frac{H^{2}L}{\kappa}+\frac{H^{4}E\sqrt{d}}{\kappa^{3/2}}+H^{3}\sqrt{d}\right).
3:  for h=H,H1,,1h=H,H-1,\ldots,1 do
4:    Set Σ~hτ=1Kϕ(s¯hτ,a¯hτ)ϕ(s¯hτ,a¯hτ)+λI+K1\widetilde{\Sigma}_{h}\leftarrow\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})^{\top}+\lambda I+K_{1}
5:    Set β~hΣ~h1[τ=1Kϕ(s¯hτ,a¯hτ)V~h+1(s¯h+1τ)2+ϕ1]\widetilde{\beta}_{h}\leftarrow\widetilde{\Sigma}_{h}^{-1}[\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}+\phi_{1}]
6:    Set θ~hΣ~h1[τ=1Kϕ(s¯hτ,a¯hτ)V~h+1(s¯h+1τ)+ϕ2]\widetilde{\theta}_{h}\leftarrow\widetilde{\Sigma}_{h}^{-1}[\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})+\phi_{2}]
7:    Set [Var~hV~h+1](,)ϕ(,),β~h[0,(Hh+1)2][ϕ(,),θ~h[0,Hh+1]]2\big{[}\widetilde{\mathrm{Var}}_{h}\widetilde{V}_{h+1}\big{]}(\cdot,\cdot)\leftarrow\big{\langle}\phi(\cdot,\cdot),\widetilde{\beta}_{h}\big{\rangle}_{\left[0,(H-h+1)^{2}\right]}-\big{[}\big{\langle}\phi(\cdot,\cdot),\widetilde{\theta}_{h}\big{\rangle}_{[0,H-h+1]}\big{]}^{2}
8:    Set σ~h(,)2max{1,Var~hV~h+1(,)}\widetilde{\sigma}_{h}(\cdot,\cdot)^{2}\leftarrow\max\{1,\widetilde{\mathrm{Var}}_{h}\widetilde{V}_{h+1}(\cdot,\cdot)\}
9:    Set Λ~hτ=1Kϕ(shτ,ahτ)ϕ(shτ,ahτ)/σ~h2(shτ,ahτ)+λI+K2\widetilde{\Lambda}_{h}\leftarrow\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)^{\top}/\widetilde{\sigma}_{h}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I+K_{2}
10:    Set w~hΛ~h1(τ=1Kϕ(shτ,ahτ)(rhτ+V~h+1(sh+1τ))/σ~h2(shτ,ahτ)+ϕ3)\widetilde{w}_{h}\leftarrow\widetilde{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)\right)/\widetilde{\sigma}_{h}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\phi_{3}\right)
11:    Set Γh(,)Cd(ϕ(,)Λ~h1ϕ(,))1/2+DK\Gamma_{h}(\cdot,\cdot)\leftarrow C\sqrt{d}\cdot\left(\phi(\cdot,\cdot)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(\cdot,\cdot)\right)^{1/2}+\frac{D}{K}
12:    Set Q¯h(,)ϕ(,)w~hΓh(,)\bar{Q}_{h}(\cdot,\cdot)\leftarrow\phi(\cdot,\cdot)^{\top}\widetilde{w}_{h}-\Gamma_{h}(\cdot,\cdot)
13:    Set Q^h(,)min{Q¯h(,),Hh+1}+\widehat{Q}_{h}(\cdot,\cdot)\leftarrow\min\left\{\bar{Q}_{h}(\cdot,\cdot),H-h+1\right\}^{+}
14:    Set π^h()argmaxπhQ^h(,),πh()𝒜\widehat{\pi}_{h}(\cdot\mid\cdot)\leftarrow\mathrm{argmax}_{\pi_{h}}\big{\langle}\widehat{Q}_{h}(\cdot,\cdot),\pi_{h}(\cdot\mid\cdot)\big{\rangle}_{\mathcal{A}}, V~h()maxπhQ^h(,),πh()𝒜\widetilde{V}_{h}(\cdot)\leftarrow\max_{\pi_{h}}\big{\langle}\widehat{Q}_{h}(\cdot,\cdot),\pi_{h}(\cdot\mid\cdot)\big{\rangle}_{\mathcal{A}}
15:  end for
16:  Output: {π^h}h=1H\left\{\widehat{\pi}_{h}\right\}_{h=1}^{H}.

Below we will show the algorithmic design of DP-VAPVI (Algorithm 2). For the offline dataset, we divide it into two independent parts with equal length: 𝒟={(shτ,ahτ,rhτ,sh+1τ)}τ[K]h[H]\mathcal{D}=\{(s_{h}^{\tau},a_{h}^{\tau},r_{h}^{\tau},s_{h+1}^{\tau})\}^{h\in[H]}_{\tau\in[K]} and 𝒟={(s¯hτ,a¯hτ,r¯hτ,s¯h+1τ)}τ[K]h[H]\mathcal{D}^{\prime}=\{(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau},\bar{r}_{h}^{\tau},\bar{s}_{h+1}^{\tau})\}^{h\in[H]}_{\tau\in[K]}. One for estimating variance and the other for calculating QQ-values.

Estimating conditional variance. The first part (line 4 to line 8) aims to estimate the conditional variance of V~h+1\widetilde{V}_{h+1} via the definition of variance: [VarhV~h+1](s,a)=[Ph(V~h+1)2](s,a)([PhV~h+1](s,a))2[\mathrm{Var}_{h}\widetilde{V}_{h+1}](s,a)=[P_{h}(\widetilde{V}_{h+1})^{2}](s,a)-([P_{h}\widetilde{V}_{h+1}](s,a))^{2}. For the first term, by the definition of linear MDP, it holds that [PhV~h+12](s,a)=ϕ(s,a)𝒮V~h+12(s)dνh(s)=ϕ,𝒮V~h+12(s)dνh(s)\left[{P}_{h}\widetilde{V}_{h+1}^{2}\right](s,a)={\phi}(s,a)^{\top}\int_{\mathcal{S}}\widetilde{V}_{h+1}^{2}\left(s^{\prime}\right)\mathrm{~{}d}{\nu}_{h}\left(s^{\prime}\right)=\langle\phi,\int_{\mathcal{S}}\widetilde{V}_{h+1}^{2}\left(s^{\prime}\right)\mathrm{~{}d}{\nu}_{h}\left(s^{\prime}\right)\rangle. We can estimate βh=𝒮V~h+12(s)dνh(s)\beta_{h}=\int_{\mathcal{S}}\widetilde{V}_{h+1}^{2}\left(s^{\prime}\right)\mathrm{~{}d}{\nu}_{h}\left(s^{\prime}\right) by applying ridge regression. Below is the output of ridge regression with raw statistics without noise:

argminβdk=1K[ϕ(s¯hk,a¯hk),βV~h+12(s¯h+1k)]2+λβ22=Σ¯h1k=1Kϕ(s¯hk,a¯hk)V~h+12(s¯h+1k),\underset{{\beta}\in\mathbb{R}^{d}}{\operatorname{argmin}}\sum_{k=1}^{K}\left[\left\langle{\phi}(\bar{s}^{k}_{h},\bar{a}^{k}_{h}),{\beta}\right\rangle-\widetilde{V}_{h+1}^{2}\left(\bar{s}_{h+1}^{k}\right)\right]^{2}+\lambda\|{\beta}\|_{2}^{2}={\bar{\Sigma}}_{h}^{-1}\sum_{k=1}^{K}{{\phi}}(\bar{s}^{k}_{h},\bar{a}^{k}_{h})\widetilde{V}_{h+1}^{2}\left(\bar{s}_{h+1}^{k}\right),

where definition of Σ¯h\bar{\Sigma}_{h} can be found in Appendix A. Instead of using the raw statistics, we replace them with private ones with Gaussian noises as in line 5. The second term is estimated similarly in line 6. The final estimator is defined as in line 8: σ~h(,)2=max{1,Var~hV~h+1(,)}\widetilde{\sigma}_{h}(\cdot,\cdot)^{2}=\max\{1,\widetilde{\mathrm{Var}}_{h}\widetilde{V}_{h+1}(\cdot,\cdot)\}.999The max{1,}\max\{1,\cdot\} operator here is for technical reason only: we want a lower bound for each variance estimate.

Variance-weighted LSVI. Instead of directly applying LSVI (Jin et al., 2021), we can solve the variance-weighted LSVI (line 10). The result of variance-weighted LSVI with non-private statistics is shown below:

argminwdλw22+k=1K[ϕ(shk,ahk),wrhkV~h+1(sh+1k)]2σ~h2(shk,ahk)=Λ^h1k=1Kϕ(shk,ahk)[rhk+V~h+1(sh+1k)]σ~h2(shk,ahk),\underset{{w}\in\mathbb{R}^{d}}{\operatorname{argmin}}\;\lambda\|{w}\|_{2}^{2}+\sum_{k=1}^{K}\frac{\left[\langle{\phi}(s_{h}^{k},a_{h}^{k}),{w}\rangle-r_{h}^{k}-\widetilde{V}_{h+1}(s_{h+1}^{k})\right]^{2}}{\widetilde{\sigma}^{2}_{h}(s_{h}^{k},a_{h}^{k})}=\widehat{\Lambda}_{h}^{-1}\sum_{k=1}^{K}\frac{\phi\left(s_{h}^{k},a_{h}^{k}\right)\cdot\left[r_{h}^{k}+\widetilde{V}_{h+1}\left(s_{h+1}^{k}\right)\right]}{\widetilde{\sigma}_{h}^{2}(s^{k}_{h},a^{k}_{h})},

where definition of Λ^h\widehat{\Lambda}_{h} can be found in Appendix A. For the sake of differential privacy, we use private statistics instead and derive the w~h\widetilde{w}_{h} as in line 10.

Our private pessimism. Notice that if we remove all the Gaussian noises we add, our DP-VAPVI (Algorithm 2) will degenerate to VAPVI (Yin et al., 2022). We design a similar pessimistic penalty using private statistics (line 11), with additional DK\frac{D}{K} accounting for the extra pessimism due to DP.

Main theorem. We state our main theorem about DP-VAPVI below, the proof sketch is deferred to Appendix D.1 and detailed proof is deferred to Appendix D due to space limit. Note that quantities i,L,E\mathcal{M}_{i},L,E can be found in Appendix A and briefly, L=O~(H3d/ρ)L=\widetilde{O}(\sqrt{H^{3}d/\rho}), E=O~(Hd/ρ)E=\widetilde{O}(\sqrt{Hd/\rho}). For the sample complexity lower bound, within the practical regime where the privacy budget is not very small, max{i}\max\{\mathcal{M}_{i}\} is dominated by max{O~(H12d3/κ5),O~(H14d/κ5)}\max\{\widetilde{O}(H^{12}d^{3}/\kappa^{5}),\widetilde{O}(H^{14}d/\kappa^{5})\}, which also appears in the sample complexity lower bound of VAPVI (Yin et al., 2022). The σV2(s,a)\sigma_{V}^{2}(s,a) in Theorem 4.1 is defined as max{1,VarPh(V)(s,a)}\max\{1,\mathrm{Var}_{P_{h}}(V)(s,a)\} for any VV.

Theorem 4.1.

DP-VAPVI (Algorithm 2) satisfies ρ\rho-zCDP. Furthermore, let KK be the number of episodes. Under the condition that K>max{1,2,3,4}K>\max\{\mathcal{M}_{1},\mathcal{M}_{2},\mathcal{M}_{3},\mathcal{M}_{4}\} and d>ξ\sqrt{d}>\xi, where ξ:=supV[0,H],sPh(s,a),h[H]|rh+V(s)(𝒯hV)(s,a)σV(s,a)|\xi:=\sup_{V\in[0,H],\;s^{\prime}\sim P_{h}(s,a),\;h\in[H]}\left|\frac{r_{h}+{V}\left(s^{\prime}\right)-\left(\mathcal{T}_{h}{V}\right)\left(s,a\right)}{{\sigma}_{{V}}(s,a)}\right|, for any 0<λ<κ0<\lambda<\kappa, with probability 1δ1-\delta, for all policy π\pi simultaneously, the output π^\widehat{\pi} of DP-VAPVI satisfies

vπvπ^O~(dh=1H𝔼π[ϕ(,)Λh1ϕ(,)])+DHK,v^{\pi}-v^{\widehat{\pi}}\leq\widetilde{O}\left(\sqrt{d}\cdot\sum_{h=1}^{H}\mathbb{E}_{\pi}\bigg{[}\sqrt{\phi(\cdot,\cdot)^{\top}\Lambda_{h}^{-1}\phi(\cdot,\cdot)}\bigg{]}\right)+\frac{DH}{K}, (9)

where Λh=k=1Kϕ(shk,ahk)ϕ(shk,ahk)σV~h+1(shk,ahk)2+λId\Lambda_{h}=\sum_{k=1}^{K}\frac{\phi(s_{h}^{k},a_{h}^{k})\cdot\phi(s_{h}^{k},a_{h}^{k})^{\top}}{\sigma^{2}_{\widetilde{V}_{h+1}(s_{h}^{k},a_{h}^{k})}}+\lambda I_{d}, D=O~(H2Lκ+H4Edκ3/2+H3d)D=\widetilde{O}\left(\frac{H^{2}L}{\kappa}+\frac{H^{4}E\sqrt{d}}{\kappa^{3/2}}+H^{3}\sqrt{d}\right) and O~\widetilde{O} hides constants and Polylog terms.
In particular, define Λh=k=1Kϕ(shk,ahk)ϕ(shk,ahk)σVh+1(shk,ahk)2+λId\Lambda^{\star}_{h}=\sum_{k=1}^{K}\frac{\phi(s_{h}^{k},a_{h}^{k})\cdot\phi(s_{h}^{k},a_{h}^{k})^{\top}}{\sigma^{2}_{{V}^{\star}_{h+1}(s_{h}^{k},a_{h}^{k})}}+\lambda I_{d}, we have with probability 1δ1-\delta,

vvπ^O~(dh=1H𝔼π[ϕ(,)Λh1ϕ(,)])+DHK.v^{\star}-v^{\widehat{\pi}}\leq\widetilde{O}\left(\sqrt{d}\cdot\sum_{h=1}^{H}\mathbb{E}_{\pi^{\star}}\bigg{[}\sqrt{\phi(\cdot,\cdot)^{\top}\Lambda_{h}^{\star-1}\phi(\cdot,\cdot)}\bigg{]}\right)+\frac{DH}{K}. (10)

Comparison to non-private counterpart VAPVI (Yin et al., 2022). Plugging in the definition of L,EL,E (Appendix A), under the meaningful case that the privacy budget is not very large, DHDH is dominated by O~(H112d/κ32ρ)\widetilde{O}\left(\frac{H^{\frac{11}{2}}d/\kappa^{\frac{3}{2}}}{\sqrt{\rho}}\right). According to Theorem 3.2 in (Yin et al., 2022), the sub-optimality bound of VAPVI is for sufficiently large KK, with high probability, the output π^\widehat{\pi} satisfies:

vvπ^O~(dh=1H𝔼π[ϕ(,)Λh1ϕ(,)])+2H4dK.v^{\star}-v^{\widehat{\pi}}\leq\widetilde{O}\left(\sqrt{d}\cdot\sum_{h=1}^{H}\mathbb{E}_{\pi^{\star}}\bigg{[}\sqrt{\phi(\cdot,\cdot)^{\top}\Lambda_{h}^{\star-1}\phi(\cdot,\cdot)}\bigg{]}\right)+\frac{2H^{4}\sqrt{d}}{K}. (11)

Compared to our Theorem 4.1, the additional sub-optimality bound due to differential privacy is O~(H112d/κ32ρK)=O~(H112d/κ32ϵK)\widetilde{O}\left(\frac{H^{\frac{11}{2}}d/\kappa^{\frac{3}{2}}}{\sqrt{\rho}\cdot K}\right)=\widetilde{O}\left(\frac{H^{\frac{11}{2}}d/\kappa^{\frac{3}{2}}}{\epsilon\cdot K}\right).101010Here we apply the second part of Lemma 2.7 to achieve (ϵ,δ)(\epsilon,\delta)-DP, the notation O~\widetilde{O} also absorbs log1δ\log\frac{1}{\delta} (only here δ\delta denotes the privacy budget instead of failure probability). In the most popular regime where the privacy budget ρ\rho or ϵ\epsilon is a constant, the additional term due to differential privacy also appears as a lower order term.

Instance-dependent sub-optimality bound. Similar to DP-APVI (Algorithm 1), our DP-VAPVI (Algorithm 2) also enjoys instance-dependent sub-optimality bound. First, the main term in (10) improves PEVI (Jin et al., 2021) over O(d)O(\sqrt{d}) on feature dependence. Also, our main term admits no explicit dependence on HH, thus improves the sub-optimality bound of PEVI on horizon dependence. For more detailed discussions about our main term, we refer readers to (Yin et al., 2022).

5 Tightness of our results

We believe our bounds for offline RL with DP is tight. To the best of our knowledge, APVI and VAPVI provide the tightest bound under tabular MDP and linear MDP, respectively. The suboptimality bounds of our algorithms match these two in the main term, with some lower order additional terms. The leading terms are known to match multiple information-theoretical lower bounds for offline RL simultaneously (this was illustrated in Yin and Wang (2021b); Yin et al. (2022)), for this reason our bound cannot be improved in general. For the lower order terms, the dependence on sample complexity nn and privacy budget ϵ\epsilon: O~(1nϵ)\widetilde{O}(\frac{1}{n\epsilon}) is optimal since policy learning is a special case of ERM problems and such dependence is optimal in DP-ERM. In addition, we believe the dependence on other parameters (H,S,A,dH,S,A,d) in the lower order term is tight due to our special tricks as (3) and Lemma D.6.

6 Simulations

In this section, we carry out simulations to evaluate the performance of our DP-VAPVI (Algorithm 2), and compare it with its non-private counterpart VAPVI (Yin et al., 2022) and another pessimism-based algorithm PEVI (Jin et al., 2021) which does not have privacy guarantee.

Experimental setting. We evaluate DP-VAPVI (Algorithm 2) on a synthetic linear MDP example that originates from the linear MDP in (Min et al., 2021; Yin et al., 2022) but with some modifications.111111We keep the state space 𝒮={1,2}\mathcal{S}=\{1,2\}, action space 𝒜={1,,100}\mathcal{A}=\{1,\cdots,100\} and feature map of state-action pairs while we choose stochastic transition (instead of the original deterministic transition) and more complex reward. For details of the linear MDP setting, please refer to Appendix F. The two MDP instances we use both have horizon H=20H=20. We compare different algorithms in figure 1(a), while in figure 1(b), we compare our DP-VAPVI with different privacy budgets. When doing empirical evaluation, we do not split the data for DP-VAPVI or VAPVI and for DP-VAPVI, we run the simulation for 55 times and take the average performance.

Refer to caption
(a) Compare different algorithms, H=20H=20
Refer to caption
(b) Different privacy budgets, H=20H=20
Figure 1: Comparison between performance of PEVI, VAPVI and DP-VAPVI (with different privacy budgets) under the linear MDP example described above. In each figure, y-axis represents sub-optimality gap vvπ^v^{\star}-v^{\widehat{\pi}} while x-axis denotes the number of episodes KK. The horizons are fixed to be H=20H=20. The number of episodes takes value from 55 to 10001000.

Results and discussions. From Figure 1, we can observe that DP-VAPVI (Algorithm 2) performs slightly worse than its non-private version VAPVI (Yin et al., 2022). This is due to the fact that we add Gaussian noise to each count. However, as the size of dataset goes larger, the performance of DP-VAPVI will converge to that of VAPVI, which supports our theoretical conclusion that the cost of privacy only appears as lower order terms. For DP-VAPVI with larger privacy budget, the scale of noise will be smaller, thus the performance will be closer to VAPVI, as shown in figure 1(b). Furthermore, in most cases, DP-VAPVI still outperforms PEVI, which does not have privacy guarantee. This arises from our privitization of variance-aware LSVI instead of LSVI.

7 Conclusion and future works

In this work, we take the first steps towards the well-motivated task of designing private offline RL algorithms. We propose algorithms for both tabular MDPs and linear MDPs, and show that they enjoy instance-dependent sub-optimality bounds while guaranteeing differential privacy (either zCDP or pure DP). Our results highlight that the cost of privacy only appears as lower order terms, thus become negligible as the number of samples goes large.

Future extensions are numerous. We believe the technique in our algorithms (privitization of Bernstein-type pessimism and variance-aware LSVI) and the corresponding analysis can be used in online settings too to obtain tighter regret bounds for private algorithms. For the offline RL problems, we plan to consider more general function approximations and differentially private (deep) offline RL which will bridge the gap between theory and practice in offline RL applications. Many techniques we developed could be adapted to these more general settings.

Acknowledgments

The research is partially supported by NSF Awards #2007117 and #2048091. The authors would like to thank Ming Yin for helpful discussions.

References

  • Abbasi-Yadkori et al. [2011] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
  • Agarwal and Singh [2017] Naman Agarwal and Karan Singh. The price of differential privacy for online learning. In International Conference on Machine Learning, pages 32–40. PMLR, 2017.
  • Ayoub et al. [2020] Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463–474. PMLR, 2020.
  • Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
  • Balle et al. [2016] Borja Balle, Maziar Gomrokchi, and Doina Precup. Differentially private policy evaluation. In International Conference on Machine Learning, pages 2130–2138. PMLR, 2016.
  • Basu et al. [2019] Debabrota Basu, Christos Dimitrakakis, and Aristide Tossou. Differential privacy for multi-armed bandits: What is it and what is its cost? arXiv preprint arXiv:1905.12298, 2019.
  • Brown et al. [2021] Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, and Kunal Talwar. When is memorization of irrelevant training data necessary for high-accuracy learning? In ACM SIGACT Symposium on Theory of Computing, pages 123–132, 2021.
  • Bun and Steinke [2016] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635–658. Springer, 2016.
  • Cai et al. [2020] Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283–1294. PMLR, 2020.
  • Carlini et al. [2019] Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium (USENIX Security 19), pages 267–284, 2019.
  • Chen et al. [2020] Xiaoyu Chen, Kai Zheng, Zixin Zhou, Yunchang Yang, Wei Chen, and Liwei Wang. (locally) differentially private combinatorial semi-bandits. In International Conference on Machine Learning, pages 1757–1767. PMLR, 2020.
  • Chernoff et al. [1952] Herman Chernoff et al. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493–507, 1952.
  • Chowdhury and Zhou [2021] Sayak Ray Chowdhury and Xingyu Zhou. Differentially private regret minimization in episodic markov decision processes. arXiv preprint arXiv:2112.10599, 2021.
  • Chowdhury et al. [2021] Sayak Ray Chowdhury, Xingyu Zhou, and Ness Shroff. Adaptive control of differentially private linear quadratic systems. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 485–490. IEEE, 2021.
  • Cundy and Ermon [2020] Chris Cundy and Stefano Ermon. Privacy-constrained policies via mutual information regularized policy gradients. arXiv preprint arXiv:2012.15019, 2020.
  • Dann et al. [2017] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
  • Dwork and Rothblum [2016] Cynthia Dwork and Guy N Rothblum. Concentrated differential privacy. arXiv preprint arXiv:1603.01887, 2016.
  • Dwork et al. [2006] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
  • Dwork et al. [2014] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014.
  • Ficken [2015] Frederick Arthur Ficken. The simplex method of linear programming. Courier Dover Publications, 2015.
  • Gajane et al. [2018] Pratik Gajane, Tanguy Urvoy, and Emilie Kaufmann. Corrupt bandits for preserving local privacy. In Algorithmic Learning Theory, pages 387–412. PMLR, 2018.
  • Garcelon et al. [2021] Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, and Matteo Pirotta. Local differential privacy for regret minimization in reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021.
  • Guha Thakurta and Smith [2013] Abhradeep Guha Thakurta and Adam Smith. (nearly) optimal algorithms for private online learning in full-information and bandit settings. Advances in Neural Information Processing Systems, 26, 2013.
  • Hu et al. [2021] Bingshan Hu, Zhiming Huang, and Nishant A Mehta. Optimal algorithms for private online learning in a stochastic environment. arXiv preprint arXiv:2102.07929, 2021.
  • Jin et al. [2020a] Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR, 2020a.
  • Jin et al. [2020b] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020b.
  • Jin et al. [2021] Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021.
  • Kim et al. [2017] Hyeoneui Kim, Elizabeth Bell, Jihoon Kim, Amy Sitapati, Joe Ramsdell, Claudiu Farcas, Dexter Friedman, Stephanie Feudjio Feupe, and Lucila Ohno-Machado. iconcur: informed consent for clinical data and bio-sample use for research. Journal of the American Medical Informatics Association, 24(2):380–387, 2017.
  • Lebensold et al. [2019] Jonathan Lebensold, William Hamilton, Borja Balle, and Doina Precup. Actor critic with differentially private critic. arXiv preprint arXiv:1910.05876, 2019.
  • Liao et al. [2021] Chonghua Liao, Jiafan He, and Quanquan Gu. Locally differentially private reinforcement learning for linear mixture markov decision processes. arXiv preprint arXiv:2110.10133, 2021.
  • Liu et al. [2019] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. In Uncertainty in Artificial Intelligence, 2019.
  • Luyo et al. [2021] Paul Luyo, Evrard Garcelon, Alessandro Lazaric, and Matteo Pirotta. Differentially private exploration in reinforcement learning with linear representation. arXiv preprint arXiv:2112.01585, 2021.
  • Maurer and Pontil [2009] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. Conference on Learning Theory, 2009.
  • Min et al. [2021] Yifei Min, Tianhao Wang, Dongruo Zhou, and Quanquan Gu. Variance-aware off-policy evaluation with linear function approximation. Advances in neural information processing systems, 34, 2021.
  • Nemhauser and Wolsey [1988] George Nemhauser and Laurence Wolsey. Polynomial-time algorithms for linear programming. Integer and Combinatorial Optimization, pages 146–181, 1988.
  • Ngo et al. [2022] Dung Daniel Ngo, Giuseppe Vietri, and Zhiwei Steven Wu. Improved regret for differentially private exploration in linear mdp. arXiv preprint arXiv:2202.01292, 2022.
  • Ono and Takahashi [2020] Hajime Ono and Tsubasa Takahashi. Locally private distributed reinforcement learning. arXiv preprint arXiv:2001.11718, 2020.
  • Qiao and Wang [2022] Dan Qiao and Yu-Xiang Wang. Near-optimal differentially private reinforcement learning. arXiv preprint arXiv:2212.04680, 2022.
  • Qiao et al. [2022] Dan Qiao, Ming Yin, Ming Min, and Yu-Xiang Wang. Sample-efficient reinforcement learning with loglog(T) switching cost. In Proceedings of the 39th International Conference on Machine Learning, pages 18031–18061. PMLR, 2022.
  • Raghu et al. [2017] Aniruddh Raghu, Matthieu Komorowski, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. Continuous state-space models for optimal sepsis treatment: a deep reinforcement learning approach. In Machine Learning for Healthcare Conference, pages 147–163, 2017.
  • Redberg and Wang [2021] Rachel Redberg and Yu-Xiang Wang. Privately publishable per-instance privacy. Advances in Neural Information Processing Systems, 34, 2021.
  • Sallab et al. [2017] Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70–76, 2017.
  • Shariff and Sheffet [2018] Roshan Shariff and Or Sheffet. Differentially private contextual linear bandits. Advances in Neural Information Processing Systems, 31, 2018.
  • Sridharan [2002] Karthik Sridharan. A gentle introduction to concentration inequalities. Dept. Comput. Sci., Cornell Univ., Tech. Rep, 2002.
  • Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Tossou and Dimitrakakis [2017] Aristide Charles Yedia Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial multi-armed bandit. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • Van Erven and Harremos [2014] Tim Van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.
  • Vietri et al. [2020] Giuseppe Vietri, Borja Balle, Akshay Krishnamurthy, and Steven Wu. Private reinforcement learning with pac and regret guarantees. In International Conference on Machine Learning, pages 9754–9764. PMLR, 2020.
  • Wang and Hegde [2019] Baoxiang Wang and Nidhi Hegde. Privacy-preserving q-learning with functional noise in continuous spaces. Advances in Neural Information Processing Systems, 32, 2019.
  • Wang et al. [2021] Ruosong Wang, Dean P Foster, and Sham M Kakade. What are the statistical limits of offline rl with linear function approximation? International Conference on Learning Representations, 2021.
  • Xie et al. [2019] Tengyang Xie, Philip S Thomas, and Gerome Miklau. Privacy preserving off-policy evaluation. arXiv preprint arXiv:1902.00174, 2019.
  • Xie et al. [2021a] Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 2021a.
  • Xie et al. [2021b] Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, and Yu Bai. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 2021b.
  • Yin and Wang [2020] Ming Yin and Yu-Xiang Wang. Asymptotically efficient off-policy evaluation for tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3948–3958. PMLR, 2020.
  • Yin and Wang [2021a] Ming Yin and Yu-Xiang Wang. Optimal uniform ope and model-based offline reinforcement learning in time-homogeneous, reward-free and task-agnostic settings. Advances in neural information processing systems, 2021a.
  • Yin and Wang [2021b] Ming Yin and Yu-Xiang Wang. Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34, 2021b.
  • Yin et al. [2021] Ming Yin, Yu Bai, and Yu-Xiang Wang. Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 1567–1575. PMLR, 2021.
  • Yin et al. [2022] Ming Yin, Yaqi Duan, Mengdi Wang, and Yu-Xiang Wang. Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. arXiv preprint arXiv:2203.05804, 2022.
  • Zanette [2021] Andrea Zanette. Exponential lower bounds for batch reinforcement learning: Batch rl can be exponentially harder than online rl. In International Conference on Machine Learning, pages 12287–12297. PMLR, 2021.
  • Zanette et al. [2021] Andrea Zanette, Martin J Wainwright, and Emma Brunskill. Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 2021.
  • Zhang et al. [2020] Zihao Zhang, Stefan Zohren, and Stephen Roberts. Deep reinforcement learning for trading. The Journal of Financial Data Science, 2(2):25–40, 2020.
  • Zheng et al. [2020] Kai Zheng, Tianle Cai, Weiran Huang, Zhenguo Li, and Liwei Wang. Locally differentially private (contextual) bandits learning. Advances in Neural Information Processing Systems, 33:12300–12310, 2020.
  • Zhou et al. [2021] Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532–4576. PMLR, 2021.
  • Zhou [2022] Xingyu Zhou. Differentially private reinforcement learning with linear function approximation. arXiv preprint arXiv:2201.07052, 2022.

Appendix A Notation List

A.1 Notations for tabular MDP

EρE_{\rho} 4Hlog4HS2Aδρ4\sqrt{\frac{H\log{\frac{4HS^{2}A}{\delta}}}{\rho}}
nn The original counts of visitation
nn^{\prime} The noisy counts, as defined in (2)
n~\widetilde{n} Final choice of private counts, as defined in (3)
P~\widetilde{P} Private estimate of transition kernel, as defined in (5)
P^\widehat{P} Non-private estimate of transition kernel, as defined in (15)
ι\iota logHSAδ\log\frac{HSA}{\delta}
ρ\rho Budget for zCDP
δ\delta Failure probability

A.2 Notations for linear MDP

LL 2H5Hdlog(10Hdδ)ρ2H\sqrt{\frac{5Hd\log(\frac{10Hd}{\delta})}{\rho}}
EE 10Hdρ(2+(log(5c1H/δ)c2d)23)\sqrt{\frac{10Hd}{\rho}}\left(2+\left(\frac{\log(5c_{1}H/\delta)}{c_{2}d}\right)^{\frac{2}{3}}\right)
DD O~(H2Lκ+H4Edκ3/2+H3d)\widetilde{O}\left(\frac{H^{2}L}{\kappa}+\frac{H^{4}E\sqrt{d}}{\kappa^{3/2}}+H^{3}\sqrt{d}\right)
Λ^h\widehat{\Lambda}_{h} k=1Kϕ(shk,ahk)ϕ(shk,ahk)/σ~h2(shk,ahk)+λId\sum_{k=1}^{K}\phi(s_{h}^{k},a_{h}^{k})\phi(s_{h}^{k},a_{h}^{k})^{\top}/\widetilde{\sigma}^{2}_{h}(s_{h}^{k},a_{h}^{k})+\lambda I_{d}
Λ~h\widetilde{\Lambda}_{h} k=1Kϕ(shk,ahk)ϕ(shk,ahk)/σ~h2(shk,ahk)+λId+K2\sum_{k=1}^{K}\phi(s_{h}^{k},a_{h}^{k})\phi(s_{h}^{k},a_{h}^{k})^{\top}/\widetilde{\sigma}^{2}_{h}(s_{h}^{k},a_{h}^{k})+\lambda I_{d}+K_{2}
Λ~hp\widetilde{\Lambda}^{p}_{h} 𝔼μ,h[σ~h2(s,a)ϕ(s,a)ϕ(s,a)]\mathbb{E}_{\mu,h}[\widetilde{\sigma}_{h}^{-2}(s,a)\phi(s,a)\phi(s,a)^{\top}]
Λh{\Lambda}_{h} τ=1Kϕ(shτ,ahτ)ϕ(shτ,ahτ)/σV~h+12(shτ,ahτ)+λI\sum_{\tau=1}^{K}\phi(s^{\tau}_{h},a^{\tau}_{h})\phi(s^{\tau}_{h},a^{\tau}_{h})^{\top}/\sigma_{\widetilde{V}_{h+1}}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I
Λhp\Lambda_{h}^{p} 𝔼μ,h[σV~h+12(s,a)ϕ(s,a)ϕ(s,a)]\mathbb{E}_{\mu,h}[\sigma^{-2}_{\widetilde{V}_{h+1}}(s,a)\phi(s,a)\phi(s,a)^{\top}]
Λh{\Lambda}^{\star}_{h} τ=1Kϕ(shτ,ahτ)ϕ(shτ,ahτ)/σVh+12(shτ,ahτ)+λI\sum_{\tau=1}^{K}\phi(s^{\tau}_{h},a^{\tau}_{h})\phi(s^{\tau}_{h},a^{\tau}_{h})^{\top}/\sigma_{{V}^{\star}_{h+1}}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I
Σ¯h\bar{\Sigma}_{h} τ=1Kϕ(s¯hτ,a¯hτ)ϕ(s¯hτ,a¯hτ)+λId\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})^{\top}+\lambda I_{d}
Σ~h\widetilde{\Sigma}_{h} τ=1Kϕ(s¯hτ,a¯hτ)ϕ(s¯hτ,a¯hτ)+λId+K1\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})^{\top}+\lambda I_{d}+K_{1}
Σhp\Sigma^{p}_{h} 𝔼μ,h[ϕ(s,a)ϕ(s,a)]\mathbb{E}_{\mu,h}\left[\phi(s,a)\phi(s,a)^{\top}\right]
κ\kappa minhλmin(Σhp)\min_{h}\lambda_{\mathrm{min}}(\Sigma^{p}_{h})
σV2(s,a)\sigma_{V}^{2}(s,a) max{1,VarPh(V)(s,a)}\max\{1,\mathrm{Var}_{P_{h}}(V)(s,a)\} for any VV
σh2(s,a)\sigma^{\star 2}_{h}(s,a) max{1,VarPhVh+1(s,a)}\max\left\{1,{\mathrm{Var}}_{P_{h}}{V}^{\star}_{h+1}(s,a)\right\}
σ~h2(s,a)\widetilde{\sigma}_{h}^{2}(s,a) max{1,Var~hV~h+1(s,a)}\max\{1,\widetilde{\mathrm{Var}}_{h}\widetilde{V}_{h+1}(s,a)\}
1\mathcal{M}_{1} max{2λ,128log(2dH/δ),128H4log(2dH/δ)κ2,2Ldκ}\max\{2\lambda,128\log(2dH/\delta),\frac{128H^{4}\log(2dH/\delta)}{\kappa^{2}},\frac{\sqrt{2}L}{\sqrt{d\kappa}}\}
2\mathcal{M}_{2} max{O~(H12d3/κ5),O~(H14d/κ5)}\max\{\widetilde{O}(H^{12}d^{3}/\kappa^{5}),\widetilde{O}(H^{14}d/\kappa^{5})\}
3\mathcal{M}_{3} max{512H4log(2dHδ)κ2,4λH2κ}\max\left\{\frac{512H^{4}\log\left(\frac{2dH}{\delta}\right)}{\kappa^{2}},\frac{4\lambda H^{2}}{\kappa}\right\}
4\mathcal{M}_{4} max{H2L2dκ,H6E2κ2,H4κ}\max\{\frac{H^{2}L^{2}}{d\kappa},\frac{H^{6}E^{2}}{\kappa^{2}},H^{4}\kappa\}
ρ\rho Budget for zCDP
δ\delta Failure probability (not the δ\delta of (ϵ,δ)(\epsilon,\delta)-DP)
ξ\xi supV[0,H],sPh(s,a),h[H]|rh+V(s)(𝒯hV)(s,a)σV(s,a)|\sup_{V\in[0,H],\;s^{\prime}\sim P_{h}(s,a),\;h\in[H]}\left|\frac{r_{h}+{V}\left(s^{\prime}\right)-\left(\mathcal{T}_{h}{V}\right)\left(s,a\right)}{{\sigma}_{{V}}(s,a)}\right|

Appendix B Extended related work

Online reinforcement learning under JDP or LDP. For online RL, some recent works analyze this setting under Joint Differential Privacy (JDP), which requires the RL agent to minimize regret while handling user’s raw data privately. Under tabular MDP, Vietri et al. [2020] design PUCB by revising UBEV [Dann et al., 2017]. Private-UCB-VI [Chowdhury and Zhou, 2021] results from UCBVI (with bonus-1) [Azar et al., 2017]. However, both works privatize Hoeffding type bonus, which lead to sub-optimal regret bound. Under linear MDP, Private LSVI-UCB [Ngo et al., 2022] and Privacy-Preserving LSVI-UCB [Luyo et al., 2021] are private versions of LSVI-UCB [Jin et al., 2020b], while LinOpt-VI-Reg [Zhou, 2022] and Privacy-Preserving UCRL-VTR [Luyo et al., 2021] generalize UCRL-VTR [Ayoub et al., 2020]. However, these works are usually based on the LSVI technique [Jin et al., 2020b] (unweighted ridge regression), which does not ensure optimal regret bound.

In addition to JDP, another common privacy guarantee for online RL is Local Differential Privacy (LDP), LDP is a stronger definition of DP since it requires that the user’s data is protected before the RL agent has access to it. Under LDP, Garcelon et al. [2021] reach a regret lower bound and design LDP-OBI which has matching regret upper bound. The result is generalized by Liao et al. [2021] to linear mixture setting. Later, Luyo et al. [2021] provide an unified framework for analyzing JDP and LDP under linear setting.

Some other differentially private learning algorithms. There are some other works about differentially private online learning [Guha Thakurta and Smith, 2013, Agarwal and Singh, 2017, Hu et al., 2021] and various settings of bandit [Shariff and Sheffet, 2018, Gajane et al., 2018, Basu et al., 2019, Zheng et al., 2020, Chen et al., 2020, Tossou and Dimitrakakis, 2017]. For the reinforcement learning setting, Wang and Hegde [2019] propose privacy-preserving Q-learning to protect the reward information. Ono and Takahashi [2020] study the problem of distributed reinforcement learning under LDP. Lebensold et al. [2019] present an actor critic algorithm with differentially private critic. Cundy and Ermon [2020] tackle DP-RL under the policy gradient framework. Chowdhury et al. [2021] consider the adaptive control of differentially private linear quadratic (LQ) systems.

Offline reinforcement learning under tabular MDP. Under tabular MDP, there are several works achieving optimal sub-optimality/sample complexity bounds under different coverage assumptions. For the problem of off-policy evaluation (OPE), Yin and Wang [2020] uses Tabular-MIS estimator to achieve asymptotic efficiency. In addition, the idea of uniform OPE is used to achieve the optimal sample complexity O(H3/dmϵ2)O(H^{3}/d_{m}\epsilon^{2}) [Yin et al., 2021] for non-stationary MDP and the optimal sample complexity O(H2/dmϵ2)O(H^{2}/d_{m}\epsilon^{2}) [Yin and Wang, 2021a] for stationary MDP, where dmd_{m} is the lower bound for state-action occupancy. Such uniform convergence idea also supports some works regarding online exploration [Jin et al., 2020a, Qiao et al., 2022]. For offline RL with single concentrability assumption, Xie et al. [2021b] arrive at the optimal sample complexity O(H3SC/ϵ2)O(H^{3}SC^{\star}/\epsilon^{2}). Recently, Yin and Wang [2021b] propose APVI which can lead to instance-dependent sub-optimality bound, which subsumes previous optimal results under several assumptions.

Offline reinforcement learning under linear MDP. Recently, many works focus on offline RL under linear representation. Jin et al. [2021] present PEVI which applies the idea of pessimistic value iteration (the idea originates from [Jin et al., 2020b]), and PEVI is provably efficient for offline RL under linear MDP. Yin et al. [2022] improve the sub-optimality bound in [Jin et al., 2021] by replacing LSVI by variance-weighted LSVI. Xie et al. [2021a] consider Bellman consistent pessimism for general function approximation, and their result improves the sample complexity in [Jin et al., 2021] by order O(d)O(d) (shown in Theorem 3.2). However, there is no improvement on horizon dependence. Zanette et al. [2021] propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle. Besides, Wang et al. [2021], Zanette [2021] study the statistical hardness of offline RL with linear representations by presenting exponential lower bounds.

Appendix C Proof of Theorem 3.4

C.1 Proof sketch

Since the whole proof for privacy guarantee is not very complex, we present it in Section C.2 below and only sketch the proof for suboptimality bound.

First of all, we bound the scale of noises we add to show that the n~\widetilde{n} derived from (3) are close to real visitation numbers. Therefore, denoting the non-private empirical transition kernel by P^\widehat{P} (detailed definition in (15)), we can show that P~P^1\|\widetilde{P}-\widehat{P}\|_{1} and |VarP~(V)VarP^(V)||\sqrt{\mathrm{Var}_{\widetilde{P}}(V)}-\sqrt{\mathrm{Var}_{\widehat{P}}(V)}| are small.

Next, resulting from the conditional independence of V~h+1\widetilde{V}_{h+1} and P~h\widetilde{P}_{h}, we apply Empirical Bernstein’s inequality to get |(P~hPh)V~h+1|VarP~(V~h+1)/n~sh,ah+SHEρ/n~sh,ah|(\widetilde{P}_{h}-P_{h})\widetilde{V}_{h+1}|\lesssim\sqrt{\mathrm{Var}_{\widetilde{P}}(\widetilde{V}_{h+1})/\widetilde{n}_{s_{h},a_{h}}}+SHE_{\rho}/\widetilde{n}_{s_{h},a_{h}}. Together with our definition of private pessimism and the key lemma: extended value difference (Lemma E.7 and E.8), we can bound the suboptimality of our output policy π^\widehat{\pi} by:

vvπ^h=1H(sh,ah)𝒞hdhπ(sh,ah)VarP~h(|sh,ah)(V~h+1())n~sh,ah+SHEρ/n~sh,ah.v^{\star}-v^{\widehat{\pi}}\lesssim\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{C}_{h}}d^{\pi^{\star}}_{h}(s_{h},a_{h})\sqrt{\frac{\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))}{\widetilde{n}_{s_{h},a_{h}}}}+SHE_{\rho}/\widetilde{n}_{s_{h},a_{h}}. (12)

Finally, we further bound the above suboptimality via replacing private statistics by non-private ones. Specifically, we replace n~\widetilde{n} by nn, P~\widetilde{P} by PP and V~\widetilde{V} by VV^{\star}. Due to (12), we have V~V1nd¯m\|\widetilde{V}-V^{\star}\|_{\infty}\lesssim\sqrt{\frac{1}{n\bar{d}_{m}}}. Together with the upper bounds of P~P^1\|\widetilde{P}-\widehat{P}\|_{1} and |VarP~(V)VarP^(V)||\sqrt{\mathrm{Var}_{\widetilde{P}}(V)}-\sqrt{\mathrm{Var}_{\widehat{P}}(V)}|, we have

VarP~h(|sh,ah)(V~h+1())n~sh,ahVarP~h(|sh,ah)(Vh+1())n~sh,ah+1nd¯mVarP^h(|sh,ah)(Vh+1())n~sh,ah+1nd¯mVarPh(|sh,ah)(Vh+1())n~sh,ah+1nd¯mVarPh(|sh,ah)(Vh+1())ndhμ(sh,ah)+1nd¯m.\begin{split}&\sqrt{\frac{\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))}{\widetilde{n}_{s_{h},a_{h}}}}\lesssim\sqrt{\frac{\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(V^{\star}_{h+1}(\cdot))}{\widetilde{n}_{s_{h},a_{h}}}}+\frac{1}{n\bar{d}_{m}}\\ \lesssim&\sqrt{\frac{\mathrm{Var}_{\widehat{P}_{h}(\cdot|s_{h},a_{h})}(V^{\star}_{h+1}(\cdot))}{\widetilde{n}_{s_{h},a_{h}}}}+\frac{1}{n\bar{d}_{m}}\lesssim\sqrt{\frac{\mathrm{Var}_{P_{h}(\cdot|s_{h},a_{h})}(V^{\star}_{h+1}(\cdot))}{\widetilde{n}_{s_{h},a_{h}}}}+\frac{1}{n\bar{d}_{m}}\\ \lesssim&\sqrt{\frac{\mathrm{Var}_{P_{h}(\cdot|s_{h},a_{h})}(V^{\star}_{h+1}(\cdot))}{nd^{\mu}_{h}(s_{h},a_{h})}}+\frac{1}{n\bar{d}_{m}}.\end{split} (13)

The final bound using non-private statistics results from (12) and (13).

C.2 Proof of the privacy guarantee

The privacy guarantee of DP-APVI (Algorithm 1) is summarized by Lemma C.1 below.

Lemma C.1 (Privacy analysis of DP-APVI (Algorithm 1)).

DP-APVI (Algorithm 1) satisfies ρ\rho-zCDP.

Proof of Lemma C.1.

The 2\ell_{2} sensitivity of {nsh,ah}\{n_{s_{h},a_{h}}\} is 2H\sqrt{2H}. According to Lemma 2.7, the Gaussian Mechanism used on {nsh,ah}\{n_{s_{h},a_{h}}\} with σ2=2Hρ\sigma^{2}=\frac{2H}{\rho} satisfies ρ2\frac{\rho}{2}-zCDP. Similarly, the Gaussian Mechanism used on {nsh,ah,sh+1}\{n_{s_{h},a_{h},s_{h+1}}\} with σ2=2Hρ\sigma^{2}=\frac{2H}{\rho} also satisfies ρ2\frac{\rho}{2}-zCDP. Combining these two results, due to the composition of zCDP (Lemma E.16), the construction of {n}\{n^{\prime}\} satisfies ρ\rho-zCDP. Finally, DP-APVI satisfies ρ\rho-zCDP because the output π^\widehat{\pi} is post processing of {n}\{n^{\prime}\}. ∎

C.3 Proof of the sub-optimality bound

C.3.1 Utility analysis

First of all, the following Lemma C.2 gives a high probability bound for |nn||n^{\prime}-n|.

Lemma C.2.

Let Eρ=22σlog4HS2Aδ=4Hlog4HS2AδρE_{\rho}=2\sqrt{2}\sigma\sqrt{\log{\frac{4HS^{2}A}{\delta}}}=4\sqrt{\frac{H\log{\frac{4HS^{2}A}{\delta}}}{\rho}}, then with probability 1δ1-\delta, for all sh,ah,sh+1s_{h},a_{h},s_{h+1}, it holds that

|nsh,ahnsh,ah|Eρ2,|nsh,ah,sh+1nsh,ah,sh+1|Eρ2.|n^{\prime}_{s_{h},a_{h}}-n_{s_{h},a_{h}}|\leq\frac{E_{\rho}}{2},\,\,|n^{\prime}_{s_{h},a_{h},s_{h+1}}-n_{s_{h},a_{h},s_{h+1}}|\leq\frac{E_{\rho}}{2}. (14)
Proof of Lemma C.2.

The inequalities directly result from the concentration inequality of Gaussian distribution and a union bound. ∎

According to the utility analysis above, we have the following Lemma C.3 giving a high probability bound for |n~n||\widetilde{n}-n|.

Lemma C.3.

Under the high probability event in Lemma C.2, for all sh,ah,sh+1s_{h},a_{h},s_{h+1}, it holds that

|n~sh,ahnsh,ah|Eρ,|n~sh,ah,sh+1nsh,ah,sh+1|Eρ.|\widetilde{n}_{s_{h},a_{h}}-n_{s_{h},a_{h}}|\leq E_{\rho},\,\,|\widetilde{n}_{s_{h},a_{h},s_{h+1}}-n_{s_{h},a_{h},s_{h+1}}|\leq E_{\rho}.
Proof of Lemma C.3.

When the event in Lemma C.2 holds, the original counts {nsh,ah,s}s𝒮\{n_{s_{h},a_{h},s^{\prime}}\}_{s^{\prime}\in\mathcal{S}} is a feasible solution to the optimization problem, which means that

maxs|n~sh,ah,snsh,ah,s|maxs|nsh,ah,snsh,ah,s|Eρ2.\max_{s^{\prime}}|\widetilde{n}_{s_{h},a_{h},s^{\prime}}-n^{\prime}_{s_{h},a_{h},s^{\prime}}|\leq\max_{s^{\prime}}|n_{s_{h},a_{h},s^{\prime}}-n^{\prime}_{s_{h},a_{h},s^{\prime}}|\leq\frac{E_{\rho}}{2}.

Due to the second part of (14), it holds that for any sh,ah,sh+1s_{h},a_{h},s_{h+1},

|n~sh,ah,sh+1nsh,ah,sh+1||n~sh,ah,sh+1nsh,ah,sh+1|+|nsh,ah,sh+1nsh,ah,sh+1|Eρ.|\widetilde{n}_{s_{h},a_{h},s_{h+1}}-n_{s_{h},a_{h},s_{h+1}}|\leq|\widetilde{n}_{s_{h},a_{h},s_{h+1}}-n^{\prime}_{s_{h},a_{h},s_{h+1}}|+|n^{\prime}_{s_{h},a_{h},s_{h+1}}-n_{s_{h},a_{h},s_{h+1}}|\leq E_{\rho}.

For the second part, because of the constraints in the optimization problem, it holds that

|n~sh,ahnsh,ah|Eρ2.|\widetilde{n}_{s_{h},a_{h}}-n^{\prime}_{s_{h},a_{h}}|\leq\frac{E_{\rho}}{2}.

Due to the first part of (14), it holds that for any sh,ahs_{h},a_{h},

|n~sh,ahnsh,ah||n~sh,ahnsh,ah|+|nsh,ahnsh,ah|Eρ.|\widetilde{n}_{s_{h},a_{h}}-n_{s_{h},a_{h}}|\leq|\widetilde{n}_{s_{h},a_{h}}-n^{\prime}_{s_{h},a_{h}}|+|n^{\prime}_{s_{h},a_{h}}-n_{s_{h},a_{h}}|\leq E_{\rho}.

Let the non-private empirical estimate be:

P^h(s|sh,ah)=nsh,ah,snsh,ah,\widehat{P}_{h}(s^{\prime}|s_{h},a_{h})=\frac{n_{s_{h},a_{h},s^{\prime}}}{n_{s_{h},a_{h}}}, (15)

if nsh,ah>0n_{s_{h},a_{h}}>0 and P^h(s|sh,ah)=1S\widehat{P}_{h}(s^{\prime}|s_{h},a_{h})=\frac{1}{S} otherwise. We will show that the private transition kernel P~\widetilde{P} is close to P^\widehat{P} by the Lemma C.4 and Lemma C.5 below.

Lemma C.4.

Under the high probability event of Lemma C.3, for sh,ahs_{h},a_{h}, if n~sh,ah3Eρ\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}, it holds that

P~h(|sh,ah)P^h(|sh,ah)15SEρn~sh,ah.\left\lVert\widetilde{P}_{h}(\cdot|s_{h},a_{h})-\widehat{P}_{h}(\cdot|s_{h},a_{h})\right\rVert_{1}\leq\frac{5SE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}. (16)
Proof of Lemma C.4.

If n~sh,ah3Eρ\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho} and the conclusion in Lemma C.3 hold, we have

P~h(|sh,ah)P^h(|sh,ah)1s𝒮|P~h(s|sh,ah)P^h(s|sh,ah)|\displaystyle\left\lVert\widetilde{P}_{h}(\cdot|s_{h},a_{h})-\widehat{P}_{h}(\cdot|s_{h},a_{h})\right\rVert_{1}\leq\sum_{s^{\prime}\in\mathcal{S}}\left|\widetilde{P}_{h}(s^{\prime}|s_{h},a_{h})-\widehat{P}_{h}(s^{\prime}|s_{h},a_{h})\right| (17)
\displaystyle\leq s𝒮(n~sh,ah,s+Eρn~sh,ahEρn~sh,ah,sn~sh,ah)\displaystyle\sum_{s^{\prime}\in\mathcal{S}}\left(\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}+E_{\rho}}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}-\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}}{\widetilde{n}_{s_{h},a_{h}}}\right)
\displaystyle\leq s𝒮[(1n~sh,ah+2Eρn~sh,ah2)(n~sh,ah,s+Eρ)n~sh,ah,sn~sh,ah]\displaystyle\sum_{s^{\prime}\in\mathcal{S}}\left[\left(\frac{1}{\widetilde{n}_{s_{h},a_{h}}}+\frac{2E_{\rho}}{\widetilde{n}_{s_{h},a_{h}}^{2}}\right)\left(\widetilde{n}_{s_{h},a_{h},s^{\prime}}+E_{\rho}\right)-\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}}{\widetilde{n}_{s_{h},a_{h}}}\right]
\displaystyle\leq SEρn~sh,ah+2Eρn~sh,ah+2SEρ2n~sh,ah2\displaystyle\frac{SE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}+\frac{2E_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}+\frac{2SE_{\rho}^{2}}{\widetilde{n}_{s_{h},a_{h}}^{2}}
\displaystyle\leq 5SEρn~sh,ah.\displaystyle\frac{5SE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}.

The second inequality is because n~sh,ah,sEρn~sh,ah+Eρnsh,ah,snsh,ahn~sh,ah,s+Eρn~sh,ahEρ\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}-E_{\rho}}{\widetilde{n}_{s_{h},a_{h}}+E_{\rho}}\leq\frac{n_{s_{h},a_{h},s^{\prime}}}{n_{s_{h},a_{h}}}\leq\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}+E_{\rho}}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}} and n~sh,ah,s+Eρn~sh,ahEρn~sh,ah,sn~sh,ahn~sh,ah,sn~sh,ahn~sh,ah,sEρn~sh,ah+Eρ\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}+E_{\rho}}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}-\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}}{\widetilde{n}_{s_{h},a_{h}}}\geq\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}}{\widetilde{n}_{s_{h},a_{h}}}-\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}-E_{\rho}}{\widetilde{n}_{s_{h},a_{h}}+E_{\rho}}. The third inequality is because of Lemma E.6. The last inequality is because n~sh,ah3Eρ\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}. ∎

Lemma C.5.

Let VSV\in\mathbb{R}^{S} be any function with VH\|V\|_{\infty}\leq H, under the high probability event of Lemma C.3, for sh,ahs_{h},a_{h}, if n~sh,ah3Eρ\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}, it holds that

|VarP^h(|sh,ah)(V)VarP~h(|sh,ah)(V)|4HSEρn~sh,ah.\left|\sqrt{\mathrm{Var}_{\widehat{P}_{h}(\cdot|s_{h},a_{h})}(V)}-\sqrt{\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(V)}\right|\leq 4H\sqrt{\frac{SE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}}. (18)
Proof of Lemma C.5.

For sh,ahs_{h},a_{h} such that n~sh,ah3Eρ\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}, we use P~()\widetilde{P}(\cdot) and P^()\widehat{P}(\cdot) instead of P~h(|sh,ah)\widetilde{P}_{h}(\cdot|s_{h},a_{h}) and P^h(|sh,ah)\widehat{P}_{h}(\cdot|s_{h},a_{h}) for simplicity. Because of Lemma C.4, we have

P~()P^()15SEρn~sh,ah.\left\lVert\widetilde{P}(\cdot)-\widehat{P}(\cdot)\right\rVert_{1}\leq\frac{5SE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}.

Therefore, it holds that

|VarP^()(V)VarP~()(V)||VarP^()(V)VarP~()(V)|\displaystyle\left|\sqrt{\mathrm{Var}_{\widehat{P}(\cdot)}(V)}-\sqrt{\mathrm{Var}_{\widetilde{P}(\cdot)}(V)}\right|\leq\sqrt{|\mathrm{Var}_{\widehat{P}(\cdot)}(V)-\mathrm{Var}_{\widetilde{P}(\cdot)}(V)|} (19)
\displaystyle\leq s𝒮|P^(s)P~(s)|V(s)2+|s𝒮[P^(s)+P~(s)]V(s)|s𝒮|P^(s)P~(s)|V(s)\displaystyle\sqrt{\sum_{s^{\prime}\in\mathcal{S}}\left|\widehat{P}(s^{\prime})-\widetilde{P}(s^{\prime})\right|V(s^{\prime})^{2}+\left|\sum_{s^{\prime}\in\mathcal{S}}\left[\widehat{P}(s^{\prime})+\widetilde{P}(s^{\prime})\right]V(s^{\prime})\right|\cdot\sum_{s^{\prime}\in\mathcal{S}}\left|\widehat{P}(s^{\prime})-\widetilde{P}(s^{\prime})\right|V(s^{\prime})}
\displaystyle\leq H2P~()P^()1+2H2P~()P^()1\displaystyle\sqrt{H^{2}\left\lVert\widetilde{P}(\cdot)-\widehat{P}(\cdot)\right\rVert_{1}+2H^{2}\left\lVert\widetilde{P}(\cdot)-\widehat{P}(\cdot)\right\rVert_{1}}
\displaystyle\leq 4HSEρn~sh,ah.\displaystyle 4H\sqrt{\frac{SE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}}.

The second inequality is due to the definition of variance. ∎

C.3.2 Validity of our pessimistic penalty

Now we are ready to present the key lemma (Lemma C.6) below to justify our use of Γ\Gamma as the pessimistic penalty.

Lemma C.6.

Under the high probability event of Lemma C.3, with probability 1δ1-\delta, for any sh,ahs_{h},a_{h}, if n~sh,ah3Eρ\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho} (which implies nsh,ah>0n_{s_{h},a_{h}}>0), it holds that

|(P~hPh)V~h+1(sh,ah)|2VarP~h(|sh,ah)(V~h+1())ιn~sh,ahEρ+16SHEριn~sh,ah,\left|(\widetilde{P}_{h}-P_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})\right|\leq\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{16SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}, (20)

where V~\widetilde{V} is the private version of estimated V function, which appears in Algorithm 1 and ι=log(HSA/δ)\iota=\log(HSA/\delta).

Proof of Lemma C.6.
|(P~hPh)V~h+1(sh,ah)||(P~hP^h)V~h+1(sh,ah)|+|(P^hPh)V~h+1(sh,ah)|\displaystyle\left|(\widetilde{P}_{h}-P_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})\right|\leq\left|(\widetilde{P}_{h}-\widehat{P}_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})\right|+\left|(\widehat{P}_{h}-P_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})\right| (21)
\displaystyle\leq HP~h(|sh,ah)P^h(|sh,ah)1+|(P^hPh)V~h+1(sh,ah)|\displaystyle H\left\lVert\widetilde{P}_{h}(\cdot|s_{h},a_{h})-\widehat{P}_{h}(\cdot|s_{h},a_{h})\right\rVert_{1}+\left|(\widehat{P}_{h}-P_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})\right|
\displaystyle\leq 5SHEρn~sh,ah+|(P^hPh)V~h+1(sh,ah)|,\displaystyle\frac{5SHE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}+\left|(\widehat{P}_{h}-P_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})\right|,

where the third inequality is due to Lemma C.4.

Next, recall π^h+1\widehat{\pi}_{h+1} in Algorithm 1 is computed backwardly therefore only depends on sample tuple from time h+1h+1 to HH. As a result, V~h+1=Q¯h+1,π^h+1\widetilde{V}_{h+1}=\langle\overline{Q}_{h+1},\widehat{\pi}_{h+1}\rangle also only depends on the sample tuple from time h+1h+1 to HH and some Gaussian noise that is independent to the offline dataset. On the other side, by the definition, P^h\widehat{P}_{h} only depends on the sample tuples from time hh to h+1h+1. Therefore V~h+1\widetilde{V}_{h+1} and P^h\widehat{P}_{h} are Conditionally independent (This trick is also used in [Yin et al., 2021] and [Yin and Wang, 2021b]), by Empirical Bernstein’s inequality (Lemma E.4) and a union bound, with probability 1δ1-\delta, for all sh,ahs_{h},a_{h} such that n~sh,ah3Eρ\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho},

|(P^hPh)V~h+1(sh,ah)|2VarP^h(|sh,ah)(V~h+1())ιnsh,ah+7Hι3nsh,ah.\left|(\widehat{P}_{h}-P_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})\right|\leq\sqrt{\frac{2\mathrm{Var}_{\widehat{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{n_{s_{h},a_{h}}}}+\frac{7H\cdot\iota}{3n_{s_{h},a_{h}}}. (22)

Therefore, we have

|(P~hPh)V~h+1(sh,ah)|2VarP^h(|sh,ah)(V~h+1())ιnsh,ah+7Hι3nsh,ah+5SHEρn~sh,ah\displaystyle\left|(\widetilde{P}_{h}-P_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})\right|\leq\sqrt{\frac{2\mathrm{Var}_{\widehat{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{n_{s_{h},a_{h}}}}+\frac{7H\cdot\iota}{3n_{s_{h},a_{h}}}+\frac{5SHE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}} (23)
\displaystyle\leq 2VarP^h(|sh,ah)(V~h+1())ιnsh,ah+9SHEριn~sh,ah\displaystyle\sqrt{\frac{2\mathrm{Var}_{\widehat{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{n_{s_{h},a_{h}}}}+\frac{9SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}
\displaystyle\leq 9SHEριn~sh,ah+2VarP~h(|sh,ah)(V~h+1())ιnsh,ah+42HSEριn~sh,ahnsh,ah\displaystyle\frac{9SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}+\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{n_{s_{h},a_{h}}}}+4\sqrt{2}H\sqrt{\frac{SE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}\cdot n_{s_{h},a_{h}}}}
\displaystyle\leq 2VarP~h(|sh,ah)(V~h+1())ιnsh,ah+16SHEριn~sh,ah\displaystyle\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{n_{s_{h},a_{h}}}}+\frac{16SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}
\displaystyle\leq 2VarP~h(|sh,ah)(V~h+1())ιn~sh,ahEρ+16SHEριn~sh,ah.\displaystyle\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{16SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}.

The second and forth inequality is because when n~sh,ah3Eρ\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}, nsh,ah2n~sh,ah3n_{s_{h},a_{h}}\geq\frac{2\widetilde{n}_{s_{h},a_{h}}}{3}. Specifically, these two inequalities are also because usually we only care about the case when SEρ1SE_{\rho}\geq 1, which is equivalent to ρ\rho being not very large. The third inequality is due to Lemma C.5. The last inequality is due to Lemma C.3. ∎

Note that the previous Lemmas rely on the condition that n~\widetilde{n} is not very small (n~sh,ah3Eρ\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}). Below we state the Multiplicative Chernoff bound (Lemma C.7 and Remark C.8) to show that under our condition in Theorem 3.4, for (sh,ah)𝒞h(s_{h},a_{h})\in\mathcal{C}_{h}, n~sh,ah\widetilde{n}_{s_{h},a_{h}} will be larger than 3Eρ3E_{\rho} with high probability.

Lemma C.7 (Lemma B.1 in [Yin and Wang, 2021b]).

For any 0<δ<10<\delta<1, there exists an absolute constant c1c_{1} such that when total episode n>c11/d¯mlog(HSA/δ)n>c_{1}\cdot 1/\bar{d}_{m}\cdot\log(HSA/\delta), then with probability 1δ1-\delta, h[H]\forall h\in[H]

nsh,ahndhμ(sh,ah)/2,(sh,ah)𝒞h.n_{s_{h},a_{h}}\geq n\cdot d^{\mu}_{h}(s_{h},a_{h})/2,\quad\forall\;(s_{h},a_{h})\in\mathcal{C}_{h}.

Furthermore, we denote

:={nsh,ahndhμ(sh,ah)/2,(sh,ah)𝒞h,h[H].}\mathcal{E}:=\{n_{s_{h},a_{h}}\geq n\cdot d^{\mu}_{h}(s_{h},a_{h})/2,\;\forall\;(s_{h},a_{h})\in\mathcal{C}_{h},\;h\in[H].\} (24)

then equivalently P()>1δP(\mathcal{E})>1-\delta.

In addition, we denote

:={nsh,ah32ndhμ(sh,ah),(sh,ah)𝒞h,h[H].}\mathcal{E}^{\prime}:=\{n_{s_{h},a_{h}}\leq\frac{3}{2}n\cdot d^{\mu}_{h}(s_{h},a_{h}),\;\forall\;(s_{h},a_{h})\in\mathcal{C}_{h},\;h\in[H].\} (25)

then similarly P()>1δP(\mathcal{E}^{\prime})>1-\delta.

Remark C.8.

According to Lemma C.7, for any failure probability δ\delta, there exists some constant c1>0c_{1}>0 such that when nc1Eριd¯mn\geq\frac{c_{1}E_{\rho}\cdot\iota}{\bar{d}_{m}}, with probability 1δ1-\delta, for all (sh,ah)𝒞h(s_{h},a_{h})\in\mathcal{C}_{h}, nsh,ah4Eρn_{s_{h},a_{h}}\geq 4E_{\rho}. Therefore, under the condition of Theorem 3.4 and the high probability events in Lemma C.3 and Lemma C.7, it holds that for all (sh,ah)𝒞h(s_{h},a_{h})\in\mathcal{C}_{h}, n~sh,ah3Eρ\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho} while for all (sh,ah)𝒞h(s_{h},a_{h})\notin\mathcal{C}_{h}, n~sh,ahEρ\widetilde{n}_{s_{h},a_{h}}\leq E_{\rho}.

Lemma C.9.

Define (𝒯hV)(,):=rh(,)+(PhV)(,)(\mathcal{T}_{h}V)(\cdot,\cdot):=r_{h}(\cdot,\cdot)+(P_{h}V)(\cdot,\cdot) for any VSV\in\mathbb{R}^{S}. Note π^\widehat{\pi}, Q¯h\overline{Q}_{h}, V~h\widetilde{V}_{h} are defined in Algorithm 1 and denote ξh(s,a)=(𝒯hV~h+1)(s,a)Q¯h(s,a)\xi_{h}(s,a)=(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-\overline{Q}_{h}(s,a). Then it holds that

V1π(s)V1π^(s)h=1H𝔼π[ξh(sh,ah)s1=s]h=1H𝔼π^[ξh(sh,ah)s1=s].V_{1}^{\pi^{\star}}(s)-V_{1}^{\widehat{\pi}}(s)\leq\sum_{h=1}^{H}\mathbb{E}_{\pi^{\star}}\left[\xi_{h}(s_{h},a_{h})\mid s_{1}=s\right]-\sum_{h=1}^{H}\mathbb{E}_{\widehat{\pi}}\left[\xi_{h}(s_{h},a_{h})\mid s_{1}=s\right]. (26)

Furthermore, (26) holds for all Vhπ(s)Vhπ^(s)V_{h}^{\pi^{\star}}(s)-V_{h}^{\widehat{\pi}}(s).

Proof of Lemma C.9.

Lemma C.9 is a direct corollary of Lemma E.8 with π=π\pi=\pi^{\star}, Q^h=Q¯h\widehat{Q}_{h}=\overline{Q}_{h}, V^h=V~h\widehat{V}_{h}=\widetilde{V}_{h} and π^=π^\widehat{\pi}=\widehat{\pi} in Algorithm 1, we can obtain this result since by the definition of π^\widehat{\pi} in Algorithm 1, Q¯h(sh,),πh(|sh)π^h(|sh)0\langle\overline{Q}_{h}\left(s_{h},\cdot\right),\pi_{h}\left(\cdot|s_{h}\right)-\widehat{\pi}_{h}\left(\cdot|s_{h}\right)\rangle\leq 0. The proof for Vhπ(s)Vhπ^(s)V_{h}^{\pi^{\star}}(s)-V_{h}^{\widehat{\pi}}(s) is identical. ∎

Next we prove the asymmetric bound for ξh\xi_{h}, which is the key to the proof.

Lemma C.10 (Private version of Lemma D.6 in [Yin and Wang, 2021b]).

Denote ξh(s,a)=(𝒯hV~h+1)(s,a)Q¯h(s,a)\xi_{h}(s,a)=(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-\overline{Q}_{h}(s,a), where V~h+1\widetilde{V}_{h+1} and Q¯h\overline{Q}_{h} are the quantities in Algorithm 1 and 𝒯h(V):=rh+PhV\mathcal{T}_{h}(V):=r_{h}+P_{h}\cdot V for any VSV\in\mathbb{R}^{S}. Then under the high probability events in Lemma C.3 and Lemma C.6, for any h,sh,ahh,s_{h},a_{h} such that n~sh,ah>3Eρ\widetilde{n}_{s_{h},a_{h}}>3E_{\rho}, we have

0\displaystyle 0\leq ξh(sh,ah)=(𝒯hV~h+1)(sh,ah)Q¯h(sh,ah)\displaystyle\xi_{h}(s_{h},a_{h})=(\mathcal{T}_{h}\widetilde{V}_{h+1})(s_{h},a_{h})-\overline{Q}_{h}(s_{h},a_{h})
\displaystyle\leq 22VarP~h(|sh,ah)(V~h+1())ιn~sh,ahEρ+32SHEριn~sh,ah,\displaystyle 2\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{32SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}},

where ι=log(HSA/δ)\iota=\log(HSA/\delta).

Proof of Lemma C.10.

The first inequality: We first prove ξh(sh,ah)0\xi_{h}(s_{h},a_{h})\geq 0 for all (sh,ah)(s_{h},a_{h}), such that n~sh,ah3Eρ\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}.

Indeed, if Q^hp(sh,ah)<0\widehat{Q}^{p}_{h}(s_{h},a_{h})<0, then Q¯h(sh,ah)=0\overline{Q}_{h}(s_{h},a_{h})=0. In this case, ξh(sh,ah)=(𝒯hV~h+1)(sh,ah)0\xi_{h}(s_{h},a_{h})=(\mathcal{T}_{h}\widetilde{V}_{h+1})(s_{h},a_{h})\geq 0 (note V~h0\widetilde{V}_{h}\geq 0 by the definition). If Q^hp(sh,ah)0\widehat{Q}^{p}_{h}(s_{h},a_{h})\geq 0, then by definition Q¯h(sh,ah)=min{Q^hp(sh,ah),Hh+1}+Q^hp(sh,ah)\overline{Q}_{h}(s_{h},a_{h})=\min\{\widehat{Q}^{p}_{h}(s_{h},a_{h}),H-h+1\}^{+}\leq\widehat{Q}^{p}_{h}(s_{h},a_{h}) and this implies

ξh(sh,ah)(𝒯hV~h+1)(sh,ah)Q^hp(sh,ah)\displaystyle\xi_{h}(s_{h},a_{h})\geq(\mathcal{T}_{h}\widetilde{V}_{h+1})(s_{h},a_{h})-\widehat{Q}^{p}_{h}(s_{h},a_{h})
=\displaystyle= (PhP~h)V~h+1(sh,ah)+Γh(sh,ah)\displaystyle(P_{h}-\widetilde{P}_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})+\Gamma_{h}(s_{h},a_{h})
\displaystyle\geq 2VarP~h(|sh,ah)(V~h+1())ιn~sh,ahEρ16SHEριn~sh,ah+Γh(sh,ah)=0,\displaystyle-\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}-\frac{16SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}+\Gamma_{h}(s_{h},a_{h})=0,

where the second inequality uses Lemma C.6, and the last equation uses Line 5 of Algorithm 1.

The second inequality: Then we prove ξh(sh,ah)22VarP~h(|sh,ah)(V~h+1())ιn~sh,ahEρ+32SHEριn~sh,ah\xi_{h}(s_{h},a_{h})\leq 2\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{32SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}} for all (sh,ah)(s_{h},a_{h}) such that n~sh,ah3Eρ\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}.

First, since by construction V~hHh+1\widetilde{V}_{h}\leq H-h+1 for all h[H]h\in[H], this implies

Q^hp=Q~hΓhQ~h=rh+(P~hV~h+1)1+(Hh)=Hh+1\widehat{Q}^{p}_{h}=\widetilde{Q}_{h}-\Gamma_{h}\leq\widetilde{Q}_{h}=r_{h}+(\widetilde{P}_{h}\cdot\widetilde{V}_{h+1})\leq 1+(H-h)=H-h+1

which is because rh1r_{h}\leq 1 and P~h\widetilde{P}_{h} is a probability distribution. Therefore, we have the equivalent definition

Q¯h:=min{Q^hp,Hh+1}+=max{Q^hp,0}Q^hp.\overline{Q}_{h}:=\min\{\widehat{Q}^{p}_{h},H-h+1\}^{+}=\max\{\widehat{Q}^{p}_{h},0\}\geq\widehat{Q}^{p}_{h}.

Then it holds that

ξh(sh,ah)=(𝒯hV~h+1)(sh,ah)Q¯h(sh,ah)(𝒯hV~h+1)(sh,ah)Q^hp(sh,ah)\displaystyle\xi_{h}(s_{h},a_{h})=(\mathcal{T}_{h}\widetilde{V}_{h+1})(s_{h},a_{h})-\overline{Q}_{h}(s_{h},a_{h})\leq(\mathcal{T}_{h}\widetilde{V}_{h+1})(s_{h},a_{h})-\widehat{Q}^{p}_{h}(s_{h},a_{h})
=\displaystyle= (𝒯hV~h+1)(sh,ah)Q~h(sh,ah)+Γh(sh,ah)\displaystyle(\mathcal{T}_{h}\widetilde{V}_{h+1})(s_{h},a_{h})-\widetilde{Q}_{h}(s_{h},a_{h})+\Gamma_{h}(s_{h},a_{h})
=\displaystyle= (PhP~h)V~h+1(sh,ah)+Γh(sh,ah)\displaystyle(P_{h}-\widetilde{P}_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})+\Gamma_{h}(s_{h},a_{h})
\displaystyle\leq 2VarP~h(|sh,ah)(V~h+1())ιn~sh,ahEρ+16SHEριn~sh,ah+Γh(sh,ah)\displaystyle\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{16SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}+\Gamma_{h}(s_{h},a_{h})
=\displaystyle= 22VarP~h(|sh,ah)(V~h+1())ιn~sh,ahEρ+32SHEριn~sh,ah.\displaystyle 2\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{32SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}.

The proof is complete by combining the two parts. ∎

C.3.3 Reduction to augmented absorbing MDP

Before we prove the theorem, we need to construct an augmented absorbing MDP to bridge V~\widetilde{V} and VV^{\star}. According to Line 5 in Algorithm 1, the locations with n~sh,ahEρ\widetilde{n}_{s_{h},a_{h}}\leq E_{\rho} is heavily penalized with penalty of order O~(H)\widetilde{O}(H). Therefore we can prove that under the high probability event in Remark C.8, dhπ^(sh,ah)>0d_{h}^{\widehat{\pi}}(s_{h},a_{h})>0 only if dhμ(sh,ah)>0d_{h}^{\mu}(s_{h},a_{h})>0 by induction, where π^\widehat{\pi} is the output of Algorithm 1. The conclusion holds for h=1h=1. Assume it holds for some h>1h>1 that dhπ^(sh,ah)>0d_{h}^{\widehat{\pi}}(s_{h},a_{h})>0 only if dhμ(sh,ah)>0d_{h}^{\mu}(s_{h},a_{h})>0, then for any sh+1𝒮s_{h+1}\in\mathcal{S} such that dh+1π^(sh+1)>0d_{h+1}^{\widehat{\pi}}(s_{h+1})>0, it holds that dh+1μ(sh+1)>0d_{h+1}^{\mu}(s_{h+1})>0, which leads to the conclusion that dh+1π^(sh+1,ah+1)>0d_{h+1}^{\widehat{\pi}}(s_{h+1},a_{h+1})>0 only if dh+1μ(sh+1,ah+1)>0d_{h+1}^{\mu}(s_{h+1},a_{h+1})>0. To summarize, we have

dhπ0(sh,ah)>0only ifdhμ(sh,ah)>0,π0{π,π^}.d_{h}^{\pi_{0}}(s_{h},a_{h})>0\,\,\text{only if}\,\,d_{h}^{\mu}(s_{h},a_{h})>0,\,\,\pi_{0}\in\{\pi^{\star},\widehat{\pi}\}. (27)

Let us define MM^{\dagger} by adding one absorbing state shs_{h}^{\dagger} for all h{2,,H}h\in\{2,\ldots,H\}, therefore the augmented state space 𝒮=𝒮{sh}\mathcal{S}^{\dagger}=\mathcal{S}\cup\{s^{\dagger}_{h}\} and the transition and reward is defined as follows: (recall 𝒞h:={(sh,ah):dhμ(sh,ah)>0}\mathcal{C}_{h}:=\{(s_{h},a_{h}):d^{\mu}_{h}(s_{h},a_{h})>0\})

Ph(sh,ah)={Ph(sh,ah)sh,ah𝒞h,δsh+1sh=sh or sh,ah𝒞h,rh(sh,ah)={rh(sh,ah)sh,ah𝒞h0sh=sh or sh,ah𝒞hP^{\dagger}_{h}(\cdot\mid s_{h},a_{h})=\left\{\begin{array}[]{ll}P_{h}(\cdot\mid s_{h},a_{h})&s_{h},a_{h}\in\mathcal{C}_{h},\\ \delta_{s^{\dagger}_{h+1}}&s_{h}=s_{h}^{\dagger}\text{ or }s_{h},a_{h}\notin\mathcal{C}_{h},\end{array}\;\;r^{\dagger}_{h}(s_{h},a_{h})=\left\{\begin{array}[]{ll}r_{h}(s_{h},a_{h})&s_{h},a_{h}\in\mathcal{C}_{h}\\ 0&s_{h}=s^{\dagger}_{h}\text{ or }s_{h},a_{h}\notin\mathcal{C}_{h}\end{array}\right.\right.

and we further define for any π\pi,

Vhπ(s)=𝔼π[t=hHrt|sh=s],vπ=𝔼π[t=1Hrt]h[H],V^{\dagger\pi}_{h}(s)=\mathbb{E}^{\dagger}_{\pi}\left[\sum_{t=h}^{H}r_{t}^{\dagger}\middle|s_{h}=s\right],v^{\dagger\pi}=\mathbb{E}^{\dagger}_{\pi}\left[\sum_{t=1}^{H}r_{t}^{\dagger}\right]\;\forall h\in[H], (28)

where 𝔼\mathbb{E}^{\dagger} means taking expectation under the absorbing MDP MM^{\dagger}.

Note that because π\pi^{\star} and π^\widehat{\pi} are fully covered by μ\mu (27), it holds that

vπ=vπ,vπ^=vπ^.v^{\dagger\pi^{\star}}=v^{\pi^{\star}},\,\,v^{\dagger\widehat{\pi}}=v^{\widehat{\pi}}. (29)

Define (𝒯hV)(,):=rh(,)+(PhV)(,)(\mathcal{T}^{\dagger}_{h}V)(\cdot,\cdot):=r_{h}^{\dagger}(\cdot,\cdot)+(P_{h}^{\dagger}V)(\cdot,\cdot) for any VS+1V\in\mathbb{R}^{S+1}. Note π^\widehat{\pi}, Q¯h\overline{Q}_{h}, V~h\widetilde{V}_{h} are defined in Algorithm 1 (we extend the definition by letting V~h(sh)=0\widetilde{V}_{h}(s^{\dagger}_{h})=0 and Q¯h(sh,)=0\overline{Q}_{h}(s_{h}^{\dagger},\cdot)=0) and denote ξh(s,a)=(𝒯hV~h+1)(s,a)Q¯h(s,a)\xi^{\dagger}_{h}(s,a)=(\mathcal{T}^{\dagger}_{h}\widetilde{V}_{h+1})(s,a)-\overline{Q}_{h}(s,a). Using identical proof to Lemma C.9, we have

V1π(s)V1π^(s)h=1H𝔼π[ξh(sh,ah)s1=s]h=1H𝔼π^[ξh(sh,ah)s1=s],V_{1}^{\dagger\pi^{\star}}(s)-V_{1}^{\dagger\widehat{\pi}}(s)\leq\sum_{h=1}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[\xi^{\dagger}_{h}(s_{h},a_{h})\mid s_{1}=s\right]-\sum_{h=1}^{H}\mathbb{E}^{\dagger}_{\widehat{\pi}}\left[\xi^{\dagger}_{h}(s_{h},a_{h})\mid s_{1}=s\right], (30)

where V1πV_{1}^{\dagger\pi} is defined in (28). Furthermore, (30) holds for all Vhπ(s)Vhπ^(s)V_{h}^{\dagger\pi^{\star}}(s)-V_{h}^{\dagger\widehat{\pi}}(s).

C.3.4 Finalize our result with non-private statistics

For those (sh,ah)𝒞h(s_{h},a_{h})\in\mathcal{C}_{h}, ξh(sh,ah)=rh(sh,ah)+PhV~h+1(sh,ah)Q¯h(sh,ah)=ξh(sh,ah)\xi^{\dagger}_{h}(s_{h},a_{h})=r_{h}(s_{h},a_{h})+P_{h}\widetilde{V}_{h+1}(s_{h},a_{h})-\overline{Q}_{h}(s_{h},a_{h})=\xi_{h}(s_{h},a_{h}). For those (sh,ah)𝒞h(s_{h},a_{h})\notin\mathcal{C}_{h} or sh=shs_{h}=s_{h}^{\dagger}, we have ξh(sh,ah)=0\xi^{\dagger}_{h}(s_{h},a_{h})=0.

Therefore, by (30) and Lemma C.10, under the high probability events in Lemma C.3, Lemma C.6 and Lemma C.7, we have for all t[H]t\in[H], s𝒮s\in\mathcal{S} (𝒮\mathcal{S} does not include the absorbing state sts_{t}^{\dagger}),

Vtπ(s)Vtπ^(s)h=tH𝔼π[ξh(sh,ah)st=s]h=tH𝔼π^[ξh(sh,ah)st=s]\displaystyle V_{t}^{\dagger\pi^{\star}}(s)-V_{t}^{\dagger\widehat{\pi}}(s)\leq\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[\xi^{\dagger}_{h}(s_{h},a_{h})\mid s_{t}=s\right]-\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\widehat{\pi}}\left[\xi^{\dagger}_{h}(s_{h},a_{h})\mid s_{t}=s\right] (31)
\displaystyle\leq h=tH𝔼π[ξh(sh,ah)st=s]0\displaystyle\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[\xi^{\dagger}_{h}(s_{h},a_{h})\mid s_{t}=s\right]-0
\displaystyle\leq h=tH𝔼π[22VarP~h(|sh,ah)(V~h+1())ιn~sh,ahEρ+32SHEριn~sh,ahst=s]𝟙((sh,ah)𝒞h)\displaystyle\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[2\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{32SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}\mid s_{t}=s\right]\cdot\mathds{1}\left((s_{h},a_{h})\in\mathcal{C}_{h}\right)
\displaystyle\leq h=tH𝔼π[22VarP~h(|sh,ah)(V~h+1())ιnsh,ah2Eρ+32SHEριnsh,ahEρst=s]𝟙((sh,ah)𝒞h)\displaystyle\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[2\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{n_{s_{h},a_{h}}-2E_{\rho}}}+\frac{32SHE_{\rho}\cdot\iota}{n_{s_{h},a_{h}}-E_{\rho}}\mid s_{t}=s\right]\cdot\mathds{1}\left((s_{h},a_{h})\in\mathcal{C}_{h}\right)
\displaystyle\leq h=tH𝔼π[4VarP~h(|sh,ah)(V~h+1())ιnsh,ah+128SHEρι3nsh,ahst=s]𝟙((sh,ah)𝒞h)\displaystyle\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[4\sqrt{\frac{\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{n_{s_{h},a_{h}}}}+\frac{128SHE_{\rho}\cdot\iota}{3n_{s_{h},a_{h}}}\mid s_{t}=s\right]\cdot\mathds{1}\left((s_{h},a_{h})\in\mathcal{C}_{h}\right)
\displaystyle\leq h=tH𝔼π[42VarP~h(|sh,ah)(V~h+1())ιndhμ(sh,ah)+256SHEρι3ndhμ(sh,ah)st=s]𝟙((sh,ah)𝒞h)\displaystyle\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[4\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{nd^{\mu}_{h}(s_{h},a_{h})}}+\frac{256SHE_{\rho}\cdot\iota}{3nd^{\mu}_{h}(s_{h},a_{h})}\mid s_{t}=s\right]\cdot\mathds{1}\left((s_{h},a_{h})\in\mathcal{C}_{h}\right)

The second and third inequality are because of Lemma C.10, Remark C.8 and the the fact that either ξ=0\xi^{\dagger}=0 or ξ=ξ\xi^{\dagger}=\xi while (sh,ah)𝒞h(s_{h},a_{h})\in\mathcal{C}_{h}. The forth inequality is due to Lemma C.3. The fifth inequality is because of Remark C.8. The last inequality is by Lemma C.7.

Below we present a crude bound of |Vtπ(s)V~t(s)|\left|V_{t}^{\dagger\pi^{\star}}(s)-\widetilde{V}_{t}(s)\right|, which can be further used to bound the main term in the main result.

Lemma C.11 (Self-bounding, private version of Lemma D.7 in [Yin and Wang, 2021b]).

Under the high probability events in Lemma C.3, Lemma C.6 and Lemma C.7, it holds that for all t[H]t\in[H] and s𝒮s\in\mathcal{S},

|Vtπ(s)V~t(s)|42ιH2nd¯m+256SH2Eρι3nd¯m.\left|V_{t}^{\dagger\pi^{\star}}(s)-\widetilde{V}_{t}(s)\right|\leq\frac{4\sqrt{2\iota}H^{2}}{\sqrt{n\cdot\bar{d}_{m}}}+\frac{256SH^{2}E_{\rho}\cdot\iota}{3n\cdot\bar{d}_{m}}.

where d¯m\bar{d}_{m} is defined in Theorem 3.4.

Proof of Lemma C.11.

According to (31), since VarP~h(|sh,ah)(V~h+1())H2\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\leq H^{2}, we have for all t[H]t\in[H],

|Vtπ(s)Vtπ^(s)|42ιH2nd¯m+256SH2Eρι3nd¯m\left|V_{t}^{\dagger\pi^{\star}}(s)-V_{t}^{\dagger\widehat{\pi}}(s)\right|\leq\frac{4\sqrt{2\iota}H^{2}}{\sqrt{n\cdot\bar{d}_{m}}}+\frac{256SH^{2}E_{\rho}\cdot\iota}{3n\cdot\bar{d}_{m}} (32)

Next, apply Lemma E.7 by setting π=π^\pi=\widehat{\pi}, π=π\pi^{\prime}=\pi^{\star}, Q^=Q¯\widehat{Q}=\overline{Q}, V^=V~\widehat{V}=\widetilde{V} under MM^{\dagger}, then we have

Vtπ(s)V~t(s)=\displaystyle V_{t}^{\dagger\pi^{\star}}(s)-\widetilde{V}_{t}(s)= h=tH𝔼π[ξh(sh,ah)st=s]+h=tH𝔼π[Q¯h(sh,),πh(|sh)π^h(|sh)st=s]\displaystyle\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[\xi^{\dagger}_{h}(s_{h},a_{h})\mid s_{t}=s\right]+\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[\langle\overline{Q}_{h}\left(s_{h},\cdot\right),\pi^{\star}_{h}\left(\cdot|s_{h}\right)-\widehat{\pi}_{h}\left(\cdot|s_{h}\right)\rangle\mid s_{t}=s\right] (33)
\displaystyle\leq h=tH𝔼π[ξh(sh,ah)st=s]\displaystyle\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[\xi^{\dagger}_{h}(s_{h},a_{h})\mid s_{t}=s\right]
\displaystyle\leq 42ιH2nd¯m+256SH2Eρι3nd¯m.\displaystyle\frac{4\sqrt{2\iota}H^{2}}{\sqrt{n\cdot\bar{d}_{m}}}+\frac{256SH^{2}E_{\rho}\cdot\iota}{3n\cdot\bar{d}_{m}}.

Also, apply Lemma E.7 by setting π=π=π^\pi=\pi^{\prime}=\widehat{\pi}, Q^=Q¯\widehat{Q}=\overline{Q}, V^=V~\widehat{V}=\widetilde{V} under MM^{\dagger}, then we have

V~t(s)Vtπ^(s)=h=tH𝔼π^[ξh(sh,ah)st=s]0.\displaystyle\widetilde{V}_{t}(s)-V_{t}^{\dagger\widehat{\pi}}(s)=-\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\widehat{\pi}}\left[\xi^{\dagger}_{h}(s_{h},a_{h})\mid s_{t}=s\right]\leq 0. (34)

The proof is complete by combing (32), (33) and (34). ∎

Now we are ready to bound VarP~h(|sh,ah)(V~h+1())\sqrt{\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))} by VarPh(|sh,ah)(Vh+1())\sqrt{\mathrm{Var}_{P_{h}(\cdot|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))}. Under the high probability events in Lemma C.3, Lemma C.6 and Lemma C.7, with probability 1δ1-\delta, it holds that for all (sh,ah)𝒞h(s_{h},a_{h})\in\mathcal{C}_{h},

VarP~h(|sh,ah)(V~h+1())VarP~h(|sh,ah)(Vh+1())+V~h+1Vh+1π,s𝒮\displaystyle\sqrt{\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))}\leq\sqrt{\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))}+\left\lVert\widetilde{V}_{h+1}-{V}^{\dagger\pi^{\star}}_{h+1}\right\rVert_{\infty,s\in\mathcal{S}} (35)
\displaystyle\leq VarP~h(|sh,ah)(Vh+1())+42ιH2nd¯m+256SH2Eρι3nd¯m\displaystyle\sqrt{\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))}+\frac{4\sqrt{2\iota}H^{2}}{\sqrt{n\cdot\bar{d}_{m}}}+\frac{256SH^{2}E_{\rho}\cdot\iota}{3n\cdot\bar{d}_{m}}
\displaystyle\leq VarP^h(|sh,ah)(Vh+1())+42ιH2nd¯m+256SH2Eρι3nd¯m+4HSEρn~sh,ah\displaystyle\sqrt{\mathrm{Var}_{\widehat{P}_{h}(\cdot|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))}+\frac{4\sqrt{2\iota}H^{2}}{\sqrt{n\cdot\bar{d}_{m}}}+\frac{256SH^{2}E_{\rho}\cdot\iota}{3n\cdot\bar{d}_{m}}+4H\sqrt{\frac{SE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}}
\displaystyle\leq VarP^h(|sh,ah)(Vh+1())+42ιH2nd¯m+256SH2Eρι3nd¯m+8HSEρnd¯m\displaystyle\sqrt{\mathrm{Var}_{\widehat{P}_{h}(\cdot|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))}+\frac{4\sqrt{2\iota}H^{2}}{\sqrt{n\cdot\bar{d}_{m}}}+\frac{256SH^{2}E_{\rho}\cdot\iota}{3n\cdot\bar{d}_{m}}+8H\sqrt{\frac{SE_{\rho}}{n\cdot\bar{d}_{m}}}
\displaystyle\leq VarPh(|sh,ah)(Vh+1())+42ιH2nd¯m+256SH2Eρι3nd¯m+8HSEρnd¯m+3Hιnd¯m\displaystyle\sqrt{\mathrm{Var}_{P_{h}(\cdot|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))}+\frac{4\sqrt{2\iota}H^{2}}{\sqrt{n\cdot\bar{d}_{m}}}+\frac{256SH^{2}E_{\rho}\cdot\iota}{3n\cdot\bar{d}_{m}}+8H\sqrt{\frac{SE_{\rho}}{n\cdot\bar{d}_{m}}}+3H\sqrt{\frac{\iota}{n\cdot\bar{d}_{m}}}
\displaystyle\leq VarPh(|sh,ah)(Vh+1())+9ιH2nd¯m+256SH2Eρι3nd¯m+8HSEρnd¯m.\displaystyle\sqrt{\mathrm{Var}_{P_{h}(\cdot|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))}+\frac{9\sqrt{\iota}H^{2}}{\sqrt{n\cdot\bar{d}_{m}}}+\frac{256SH^{2}E_{\rho}\cdot\iota}{3n\cdot\bar{d}_{m}}+8H\sqrt{\frac{SE_{\rho}}{n\cdot\bar{d}_{m}}}.

The second inequality is because of Lemma C.11. The third inequality is due to Lemma C.5. The forth inequality comes from Lemma C.3 and Remark C.8. The fifth inequality holds with probability 1δ1-\delta because of Lemma E.5 and a union bound.

Finally, by plugging (35) into (31) and averaging over s1s_{1}, we finally have with probability 14δ1-4\delta,

vπvπ^=vπvπ^h=1H𝔼π[42VarP~h(|sh,ah)(V~h+1())ιndhμ(sh,ah)+256SHEρι3ndhμ(sh,ah)]\displaystyle v^{\pi^{\star}}-v^{\widehat{\pi}}=v^{\dagger\pi^{\star}}-v^{\dagger\widehat{\pi}}\leq\sum_{h=1}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[4\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{nd^{\mu}_{h}(s_{h},a_{h})}}+\frac{256SHE_{\rho}\cdot\iota}{3nd^{\mu}_{h}(s_{h},a_{h})}\right] (36)
\displaystyle\leq 42h=1H𝔼π[VarPh(|sh,ah)(Vh+1())ιndhμ(sh,ah)]+O~(H3+SH2Eρnd¯m)\displaystyle 4\sqrt{2}\sum_{h=1}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[\sqrt{\frac{\mathrm{Var}_{P_{h}(\cdot|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))\cdot\iota}{nd^{\mu}_{h}(s_{h},a_{h})}}\right]+\widetilde{O}\left(\frac{H^{3}+SH^{2}E_{\rho}}{n\cdot\bar{d}_{m}}\right)
=\displaystyle= 42h=1H(sh,ah)𝒞hdhπ(sh,ah)VarPh(|sh,ah)(Vh+1())ιndhμ(sh,ah)+O~(H3+SH2Eρnd¯m)\displaystyle 4\sqrt{2}\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{C}_{h}}d^{\pi^{\star}}_{h}(s_{h},a_{h})\sqrt{\frac{\mathrm{Var}_{P_{h}(\cdot|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))\cdot\iota}{nd^{\mu}_{h}(s_{h},a_{h})}}+\widetilde{O}\left(\frac{H^{3}+SH^{2}E_{\rho}}{n\cdot\bar{d}_{m}}\right)
=\displaystyle= 42h=1H(sh,ah)𝒞hdhπ(sh,ah)VarPh(|sh,ah)(Vh+1())ιndhμ(sh,ah)+O~(H3+SH2Eρnd¯m),\displaystyle 4\sqrt{2}\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{C}_{h}}d^{\pi^{\star}}_{h}(s_{h},a_{h})\sqrt{\frac{\mathrm{Var}_{P_{h}(\cdot|s_{h},a_{h})}(V^{\star}_{h+1}(\cdot))\cdot\iota}{nd^{\mu}_{h}(s_{h},a_{h})}}+\widetilde{O}\left(\frac{H^{3}+SH^{2}E_{\rho}}{n\cdot\bar{d}_{m}}\right),

where O~\widetilde{O} absorbs constants and Polylog terms. The first equation is due to (29). The first inequality is because of (31). The second inequality comes from (35) and our assumption that nd¯mc1H2n\cdot\bar{d}_{m}\geq c_{1}H^{2}. The second equation uses the fact that dhπ(sh,ah)=dhπ(sh,ah)d^{\pi^{\star}}_{h}(s_{h},a_{h})=d^{\dagger\pi^{\star}}_{h}(s_{h},a_{h}), for all (sh,ah)(s_{h},a_{h}). The last equation is because for any (sh,ah,sh+1)(s_{h},a_{h},s_{h+1}) such that dhπ(sh,ah)>0d^{\pi^{\star}}_{h}(s_{h},a_{h})>0 and Ph(sh+1|sh,ah)>0P_{h}(s_{h+1}|s_{h},a_{h})>0, Vh+1(sh+1)=Vh+1(sh+1)V^{\dagger\star}_{h+1}(s_{h+1})=V^{\star}_{h+1}(s_{h+1}).

C.4 Put everything together

Combining Lemma C.1 and (36), the proof of Theorem 3.4 is complete.

Appendix D Proof of Theorem 4.1

D.1 Proof sketch

Since the whole proof for privacy guarantee is not very complex, we present it in Section D.2 below and only sketch the proof for suboptimality bound.

First of all, by extended value difference (Lemma E.7 and E.8), we can convert bounding the suboptimality gap of vvπ^v^{\star}-v^{\widehat{\pi}} to bounding h=1H𝔼π[Γh(sh,ah)]\sum_{h=1}^{H}\mathbb{E}_{\pi}\left[\Gamma_{h}(s_{h},a_{h})\right], given that |(𝒯hV~h+1𝒯~hV~h+1)(s,a)|Γh(s,a)|(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)|\leq\Gamma_{h}(s,a) for all s,a,hs,a,h. To bound (𝒯hV~h+1𝒯~hV~h+1)(s,a)(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a), according to our analysis about the upper bound of the noises we add, we can decompose (𝒯hV~h+1𝒯~hV~h+1)(s,a)(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a) to lower order terms (O~(1K)\widetilde{O}(\frac{1}{K})) and the following key quantity:

ϕ(s,a)Λ^h1[τ=1Kϕ(shτ,ahτ)(rhτ+V~h+1(sh+1τ)(𝒯hV~h+1)(shτ,ahτ))/σ~h2(shτ,ahτ)].\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left[\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)-\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right]. (37)

For the term above, we prove an upper bound of σV~h+12σ~h2\left\lVert\sigma^{2}_{\widetilde{V}_{h+1}}-\widetilde{\sigma}_{h}^{2}\right\rVert_{\infty}, so we can convert σ~h2\widetilde{\sigma}_{h}^{2} to σV~h+12\sigma^{2}_{\widetilde{V}_{h+1}}. Next, since Var[rhτ+V~h+1(sh+1τ)(𝒯hV~h+1)(shτ,ahτ)shτ,ahτ]σV~h+12\mathrm{Var}\left[r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)-\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)\mid s^{\tau}_{h},a^{\tau}_{h}\right]\approx\sigma^{2}_{\widetilde{V}_{h+1}}, we can apply Bernstein’s inequality for self-normalized martingale (Lemma E.10) as in Yin et al. [2022] for deriving tighter bound.

Finally, we replace the private statistics by non-private ones. More specifically, we convert σV~h+12\sigma^{2}_{\widetilde{V}_{h+1}} to σh2\sigma_{h}^{\star 2} (Λh1\Lambda_{h}^{-1} to Λh1\Lambda_{h}^{\star-1}) by combining the crude upper bound of V~V\left\lVert\widetilde{V}-V^{\star}\right\rVert_{\infty} and matrix concentrations.

D.2 Proof of the privacy guarantee

The privacy guarantee of DP-VAPVI (Algorithm 2) is summarized by Lemma D.1 below.

Lemma D.1 (Privacy analysis of DP-VAPVI (Algorithm 2)).

DP-VAPVI (Algorithm 2) satisfies ρ\rho-zCDP.

Proof of Lemma D.1.

For τ=1Kϕ(s¯hτ,a¯hτ)V~h+1(s¯h+1τ)2\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}, the 2\ell_{2} sensitivity is 2H22H^{2}. For τ=1Kϕ(s¯hτ,a¯hτ)V~h+1(s¯h+1τ)\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau}) and τ=1Kϕ(shτ,ahτ)(rhτ+V~h+1(sh+1τ))/σ~h2(shτ,ahτ)\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)\right)/\widetilde{\sigma}_{h}^{2}(s^{\tau}_{h},a^{\tau}_{h}), the 2\ell_{2} sensitivity is 2H2H. Therefore according to Lemma 2.7, the use of Gaussian Mechanism (the additional noises ϕ1,ϕ2,ϕ3\phi_{1},\phi_{2},\phi_{3}) ensures ρ0\rho_{0}-zCDP for each counter. For τ=1Kϕ(s¯hτ,a¯hτ)ϕ(s¯hτ,a¯hτ)+λI\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})^{\top}+\lambda I and τ=1Kϕ(shτ,ahτ)ϕ(shτ,ahτ)/σ~h2(shτ,ahτ)+λI\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)^{\top}/\widetilde{\sigma}_{h}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I, according to Appendix D in [Redberg and Wang, 2021], the per-instance 2\ell_{2} sensitivity is

Δx2=12supϕ:ϕ21ϕϕF=12supϕ:ϕ21i,jϕi2ϕj2=12.\left\lVert\Delta_{x}\right\rVert_{2}=\frac{1}{\sqrt{2}}\sup_{\phi:\|\phi\|_{2}\leq 1}\left\lVert\phi\phi^{\top}\right\rVert_{F}=\frac{1}{\sqrt{2}}\sup_{\phi:\|\phi\|_{2}\leq 1}\sqrt{\sum_{i,j}\phi_{i}^{2}\phi_{j}^{2}}=\frac{1}{\sqrt{2}}.

Therefore the use of Gaussian Mechanism (the additional noises K1,K2K_{1},K_{2}) also ensures ρ0\rho_{0}-zCDP for each counter.121212For more detailed explanation, we refer the readers to Appendix D of [Redberg and Wang, 2021]. Combining these results, according to Lemma E.17, the whole algorithm satisfies 5Hρ0=ρ5H\rho_{0}=\rho-zCDP. ∎

D.3 Proof of the sub-optimality bound

D.3.1 Utility analysis and some preparation

We begin with the following high probability bound of the noises we add.

Lemma D.2 (Utility analysis).

Let L=2Hdρ0log(10Hdδ)=2H5Hdlog(10Hdδ)ρL=2H\sqrt{\frac{d}{\rho_{0}}\log(\frac{10Hd}{\delta})}=2H\sqrt{\frac{5Hd\log(\frac{10Hd}{\delta})}{\rho}} and
E=2dρ0(2+(log(5c1H/δ)c2d)23)=10Hdρ(2+(log(5c1H/δ)c2d)23)E=\sqrt{\frac{2d}{\rho_{0}}}\left(2+\left(\frac{\log(5c_{1}H/\delta)}{c_{2}d}\right)^{\frac{2}{3}}\right)=\sqrt{\frac{10Hd}{\rho}}\left(2+\left(\frac{\log(5c_{1}H/\delta)}{c_{2}d}\right)^{\frac{2}{3}}\right) for some universal constants c1,c2c_{1},c_{2}. Then with probability 1δ1-\delta, the following inequalities hold simultaneously:

For allh[H],ϕ12HL,ϕ22L,ϕ32L.\displaystyle\text{For all}\,h\in[H],\,\,\|\phi_{1}\|_{2}\leq HL,\|\phi_{2}\|_{2}\leq L,\,\,\|\phi_{3}\|_{2}\leq L. (38)
For allh[H],K1,K2are symmetric and positive definite andKi2E,i{1,2}.\displaystyle\text{For all}\,h\in[H],\,\,K_{1},K_{2}\,\text{are symmetric and positive definite and}\,\|K_{i}\|_{2}\leq E,\,\,i\in\{1,2\}.
Proof of Lemma D.2.

The second line of (38) results from Lemma 19 in [Redberg and Wang, 2021] and Weyl’s Inequality. The first line of (38) directly results from the concentration inequality for Guassian distribution and a union bound. ∎

Define the Bellman update error ζh(s,a):=(𝒯hV~h+1)(s,a)Q^h(s,a)\zeta_{h}(s,a):=(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-\widehat{Q}_{h}(s,a) and recall
π^h(s)=argmaxπhQ^h(s,),πh(s)𝒜\widehat{\pi}_{h}(s)=\mathrm{argmax}_{\pi_{h}}\langle\widehat{Q}_{h}(s,\cdot),\pi_{h}(\cdot\mid s)\rangle_{\mathcal{A}}, then because of Lemma E.8,

V1π(s)V1π^(s)h=1H𝔼π[ζh(sh,ah)s1=s]h=1H𝔼π^[ζh(sh,ah)s1=s].V_{1}^{\pi}(s)-V_{1}^{\widehat{\pi}}(s)\leq\sum_{h=1}^{H}\mathbb{E}_{\pi}\left[\zeta_{h}(s_{h},a_{h})\mid s_{1}=s\right]-\sum_{h=1}^{H}\mathbb{E}_{\widehat{\pi}}\left[\zeta_{h}(s_{h},a_{h})\mid s_{1}=s\right]. (39)

Define 𝒯~hV~h+1(,)=ϕ(,)w~h\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1}(\cdot,\cdot)=\phi(\cdot,\cdot)^{\top}\widetilde{w}_{h}. Then similar to Lemma C.10, we have the following lemma showing that in order to bound the sub-optimality, it is sufficient to bound the pessimistic penalty.

Lemma D.3 (Lemma C.1 in [Yin et al., 2022]).

Suppose with probability 1δ1-\delta, it holds for all s,a,h𝒮×𝒜×[H]s,a,h\in\mathcal{S}\times\mathcal{A}\times[H] that |(𝒯hV~h+1𝒯~hV~h+1)(s,a)|Γh(s,a)|(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)|\leq\Gamma_{h}(s,a), then it implies s,a,h𝒮×𝒜×[H]\forall s,a,h\in\mathcal{S}\times\mathcal{A}\times[H], 0ζh(s,a)2Γh(s,a)0\leq\zeta_{h}(s,a)\leq 2\Gamma_{h}(s,a). Furthermore, with probability 1δ1-\delta, it holds for any policy π\pi simultaneously,

V1π(s)V1π^(s)h=1H2𝔼π[Γh(sh,ah)s1=s].V_{1}^{\pi}(s)-V_{1}^{\widehat{\pi}}(s)\leq\sum_{h=1}^{H}2\cdot\mathbb{E}_{\pi}\left[\Gamma_{h}(s_{h},a_{h})\mid s_{1}=s\right].
Proof of Lemma D.3.

We first show given |(𝒯hV~h+1𝒯~hV~h+1)(s,a)|Γh(s,a)|(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)|\leq\Gamma_{h}(s,a), then 0ζh(s,a)2Γh(s,a)0\leq\zeta_{h}(s,a)\leq 2\Gamma_{h}(s,a), s,a,h𝒮×𝒜×[H]\forall s,a,h\in\mathcal{S}\times\mathcal{A}\times[H].

Step 1: The first step is to show 0ζh(s,a)0\leq\zeta_{h}(s,a), s,a,h𝒮×𝒜×[H]\forall s,a,h\in\mathcal{S}\times\mathcal{A}\times[H].

Indeed, if Q¯h(s,a)0\bar{Q}_{h}(s,a)\leq 0, then by definition Q^h(s,a)=0\widehat{Q}_{h}(s,a)=0 and therefore ζh(s,a):=(𝒯hV~h+1)(s,a)Q^h(s,a)=(𝒯hV~h+1)(s,a)0\zeta_{h}(s,a):=(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-\widehat{Q}_{h}(s,a)=(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)\geq 0. If Q¯h(s,a)>0\bar{Q}_{h}(s,a)>0, then Q^h(s,a)Q¯h(s,a)\widehat{Q}_{h}(s,a)\leq\bar{Q}_{h}(s,a) and

ζh(s,a):=\displaystyle\zeta_{h}(s,a):= (𝒯hV~h+1)(s,a)Q^h(s,a)(𝒯hV~h+1)(s,a)Q¯h(s,a)\displaystyle(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-\widehat{Q}_{h}(s,a)\geq(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-\bar{Q}_{h}(s,a)
=\displaystyle= (𝒯hV~h+1)(s,a)(𝒯~hV~h+1)(s,a)+Γh(s,a)0.\displaystyle(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-(\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)+\Gamma_{h}(s,a)\geq 0.

Step 2: The second step is to show ζh(s,a)2Γh(s,a)\zeta_{h}(s,a)\leq 2\Gamma_{h}(s,a), s,a,h𝒮×𝒜×[H]\forall s,a,h\in\mathcal{S}\times\mathcal{A}\times[H].

Under the assumption that |(𝒯hV~h+1𝒯~hV~h+1)(s,a)|Γh(s,a)|(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)|\leq\Gamma_{h}(s,a), we have

Q¯h(s,a)=(𝒯~hV~h+1)(s,a)Γh(s,a)(𝒯hV~h+1)(s,a)Hh+1,\bar{Q}_{h}(s,a)=(\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)-\Gamma_{h}(s,a)\leq(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)\leq H-h+1,

which implies that Q^h(s,a)=max(Q¯h(s,a),0)\widehat{Q}_{h}(s,a)=\max(\bar{Q}_{h}(s,a),0). Therefore, it holds that

ζh(s,a):=\displaystyle\zeta_{h}(s,a):= (𝒯hV~h+1)(s,a)Q^h(s,a)(𝒯hV~h+1)(s,a)Q¯h(s,a)\displaystyle(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-\widehat{Q}_{h}(s,a)\leq(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-\bar{Q}_{h}(s,a)
=\displaystyle= (𝒯hV~h+1)(s,a)(𝒯~hV~h+1)(s,a)+Γh(s,a)2Γh(s,a).\displaystyle(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-(\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)+\Gamma_{h}(s,a)\leq 2\cdot\Gamma_{h}(s,a).

For the last statement, denote 𝔉:={0ζh(s,a)2Γh(s,a),s,a,h𝒮×𝒜×[H]}\mathfrak{F}:=\{0\leq\zeta_{h}(s,a)\leq 2\Gamma_{h}(s,a),\;\forall s,a,h\in\mathcal{S}\times\mathcal{A}\times[H]\}. Note conditional on 𝔉\mathfrak{F}, then by (39), V1π(s)V1π^(s)h=1H2𝔼π[Γh(sh,ah)s1=s]V_{1}^{\pi}(s)-V_{1}^{\widehat{\pi}}(s)\leq\sum_{h=1}^{H}2\cdot\mathbb{E}_{\pi}[\Gamma_{h}(s_{h},a_{h})\mid s_{1}=s] holds for any policy π\pi almost surely. Therefore,

[π,V1π(s)V1π^(s)h=1H2𝔼π[Γh(sh,ah)s1=s].]\displaystyle\mathbb{P}\left[\forall\pi,\;\;V_{1}^{\pi}(s)-V_{1}^{\widehat{\pi}}(s)\leq\sum_{h=1}^{H}2\cdot\mathbb{E}_{\pi}[\Gamma_{h}(s_{h},a_{h})\mid s_{1}=s].\right]
=\displaystyle= [π,V1π(s)V1π^(s)h=1H2𝔼π[Γh(sh,ah)s1=s]|𝔉][𝔉]\displaystyle\mathbb{P}\left[\forall\pi,\;\;V_{1}^{\pi}(s)-V_{1}^{\widehat{\pi}}(s)\leq\sum_{h=1}^{H}2\cdot\mathbb{E}_{\pi}[\Gamma_{h}(s_{h},a_{h})\mid s_{1}=s]\middle|\mathfrak{F}\right]\cdot\mathbb{P}[\mathfrak{F}]
+\displaystyle+ [π,V1π(s)V1π^(s)h=1H2𝔼π[Γh(sh,ah)s1=s]|𝔉c][𝔉c]\displaystyle\mathbb{P}\left[\forall\pi,\;\;V_{1}^{\pi}(s)-V_{1}^{\widehat{\pi}}(s)\leq\sum_{h=1}^{H}2\cdot\mathbb{E}_{\pi}[\Gamma_{h}(s_{h},a_{h})\mid s_{1}=s]\middle|\mathfrak{F}^{c}\right]\cdot\mathbb{P}[\mathfrak{F}^{c}]
\displaystyle\geq [π,V1π(s)V1π^(s)h=1H2𝔼π[Γh(sh,ah)s1=s]|𝔉][𝔉]=1[𝔉]1δ,\displaystyle\mathbb{P}\left[\forall\pi,\;\;V_{1}^{\pi}(s)-V_{1}^{\widehat{\pi}}(s)\leq\sum_{h=1}^{H}2\cdot\mathbb{E}_{\pi}[\Gamma_{h}(s_{h},a_{h})\mid s_{1}=s]\middle|\mathfrak{F}\right]\cdot\mathbb{P}[\mathfrak{F}]=1\cdot\mathbb{P}[\mathfrak{F}]\geq 1-\delta,

which finishes the proof. ∎

D.3.2 Bound the pessimistic penalty

By Lemma D.3, it remains to bound |(𝒯hV~h+1)(s,a)(𝒯~hV~h+1)(s,a)||(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-(\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)|. Suppose whw_{h} is the coefficient corresponding to the 𝒯hV~h+1\mathcal{T}_{h}\widetilde{V}_{h+1} (such whw_{h} exists by Lemma E.14), i.e. 𝒯hV~h+1=ϕwh\mathcal{T}_{h}\widetilde{V}_{h+1}=\phi^{\top}w_{h}, and recall (𝒯~hV~h+1)(s,a)=ϕ(s,a)w~h(\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)=\phi(s,a)^{\top}\widetilde{w}_{h}, then:

(𝒯hV~h+1)(s,a)(𝒯~hV~h+1)(s,a)=ϕ(s,a)(whw~h)\displaystyle\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)(s,a)-\left(\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1}\right)(s,a)=\phi(s,a)^{\top}\left(w_{h}-\widetilde{w}_{h}\right) (40)
=\displaystyle= ϕ(s,a)whϕ(s,a)Λ~h1(τ=1Kϕ(shτ,ahτ)(rhτ+V~h+1(sh+1τ))/σ~h2(shτ,ahτ)+ϕ3)\displaystyle\phi(s,a)^{\top}w_{h}-\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})+\phi_{3}\right)
=\displaystyle= ϕ(s,a)whϕ(s,a)Λ^h1(τ=1Kϕ(shτ,ahτ)(rhτ+V~h+1(sh+1τ))/σ~h2(shτ,ahτ))(i)\displaystyle\underbrace{\phi(s,a)^{\top}w_{h}-\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)}_{(\mathrm{i})}
ϕ(s,a)Λ^h1ϕ3(ii)+ϕ(s,a)(Λ^h1Λ~h1)(τ=1Kϕ(shτ,ahτ)(rhτ+V~h+1(sh+1τ))/σ~h2(shτ,ahτ)+ϕ3)(iii),\displaystyle-\underbrace{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi_{3}}_{(\mathrm{ii})}+\underbrace{\phi(s,a)^{\top}(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})+\phi_{3}\right)}_{(\mathrm{iii})},

where Λ^h=Λ~hK2=τ=1Kϕ(shτ,ahτ)ϕ(shτ,ahτ)/σ~h2(shτ,ahτ)+λI\widehat{\Lambda}_{h}=\widetilde{\Lambda}_{h}-K_{2}=\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)^{\top}/\widetilde{\sigma}_{h}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I.

Term (ii) can be handled by the following Lemma D.4

Lemma D.4.

Recall κ\kappa in Assumption 2.2. Under the high probability event in Lemma D.2, suppose Kmax{512H4log(2Hdδ)κ2,4λH2κ}K\geq\max\left\{\frac{512H^{4}\cdot\log\left(\frac{2Hd}{\delta}\right)}{\kappa^{2}},\frac{4\lambda H^{2}}{\kappa}\right\}, then with probability 1δ1-\delta, for all s,a,h𝒮×𝒜×[H]s,a,h\in\mathcal{S}\times\mathcal{A}\times[H], it holds that

|ϕ(s,a)Λ^h1ϕ3|4H2L/κK.\left|\phi(s,a)^{\top}\widehat{\Lambda}^{-1}_{h}\phi_{3}\right|\leq\frac{4H^{2}L/\kappa}{K}.
Proof of Lemma D.4.

Define Λ~hp=𝔼μ,h[σ~h2(s,a)ϕ(s,a)ϕ(s,a)]\widetilde{\Lambda}^{p}_{h}=\mathbb{E}_{\mu,h}[\widetilde{\sigma}_{h}^{-2}(s,a)\phi(s,a)\phi(s,a)^{\top}]. Then because of Assumption 2.2 and σ~hH\widetilde{\sigma}_{h}\leq H, it holds that λmin(Λ~hp)κH2\lambda_{\min}(\widetilde{\Lambda}^{p}_{h})\geq\frac{\kappa}{H^{2}}. Therefore, due to Lemma E.13, we have with probability 1δ1-\delta,

|ϕ(s,a)Λ^h1ϕ3|ϕ(s,a)Λ^h1ϕ3Λ^h1\displaystyle\left|\phi(s,a)^{\top}\widehat{\Lambda}^{-1}_{h}\phi_{3}\right|\leq\|\phi(s,a)\|_{\widehat{\Lambda}_{h}^{-1}}\cdot\|\phi_{3}\|_{\widehat{\Lambda}_{h}^{-1}}
\displaystyle\leq 4Kϕ(s,a)(Λ~hp)1ϕ3(Λ~hp)1\displaystyle\frac{4}{K}\|\phi(s,a)\|_{(\widetilde{\Lambda}_{h}^{p})^{-1}}\cdot\|\phi_{3}\|_{(\widetilde{\Lambda}_{h}^{p})^{-1}}
\displaystyle\leq 4LK(Λ~hp)1\displaystyle\frac{4L}{K}\|(\widetilde{\Lambda}_{h}^{p})^{-1}\|
\displaystyle\leq 4H2L/κK.\displaystyle\frac{4H^{2}L/\kappa}{K}.

The first inequality is because of Cauchy-Schwarz inequality. The second inequality holds with probability 1δ1-\delta due to Lemma E.13 and a union bound. The third inequality holds because aAaa2A2a2=a2A2\sqrt{a^{\top}\cdot A\cdot a}\leq\sqrt{\|a\|_{2}\|A\|_{2}\|a\|_{2}}=\|a\|_{2}\sqrt{\|A\|_{2}}. The last inequality arises from (Λ~hp)1=λmax((Λ~hp)1)=λmin1(Λ~hp)H2κ\|(\widetilde{\Lambda}_{h}^{p})^{-1}\|=\lambda_{\max}((\widetilde{\Lambda}^{p}_{h})^{-1})=\lambda_{\min}^{-1}(\widetilde{\Lambda}^{p}_{h})\leq\frac{H^{2}}{\kappa}. ∎

The difference between Λ~h1\widetilde{\Lambda}_{h}^{-1} and Λ^h1\widehat{\Lambda}_{h}^{-1} can be bounded by the following Lemma D.5

Lemma D.5.

Under the high probability event in Lemma D.2, suppose K128H4log2dHδκ2K\geq\frac{128H^{4}\log\frac{2dH}{\delta}}{\kappa^{2}}, then with probability 1δ1-\delta, for all h[H]h\in[H], it holds that Λ^h1Λ~h14H4E/κ2K2\|\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1}\|\leq\frac{4H^{4}E/\kappa^{2}}{K^{2}}.

Proof of Lemma D.5.

First of all, we have

Λ^h1Λ~h1=Λ^h1(Λ^hΛ~h)Λ~h1\displaystyle\|\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1}\|=\|\widehat{\Lambda}_{h}^{-1}\cdot(\widehat{\Lambda}_{h}-\widetilde{\Lambda}_{h})\cdot\widetilde{\Lambda}_{h}^{-1}\| (41)
\displaystyle\leq Λ^h1Λ^hΛ~hΛ~h1\displaystyle\|\widehat{\Lambda}_{h}^{-1}\|\cdot\|\widehat{\Lambda}_{h}-\widetilde{\Lambda}_{h}\|\cdot\|\widetilde{\Lambda}_{h}^{-1}\|
\displaystyle\leq λmin1(Λ^h)λmin1(Λ~h)E.\displaystyle\lambda_{\min}^{-1}(\widehat{\Lambda}_{h})\cdot\lambda_{\min}^{-1}(\widetilde{\Lambda}_{h})\cdot E.

The first inequality is because ABAB\|A\cdot B\|\leq\|A\|\cdot\|B\|. The second inequality is due to Lemma D.2.

Let Λ^h=1KΛ^h\widehat{\Lambda}_{h}^{\prime}=\frac{1}{K}\widehat{\Lambda}_{h}, then because of Lemma E.12, with probability 1δ1-\delta, it holds that for all h[H]h\in[H],

Λ^h𝔼μ,h[ϕ(s,a)ϕ(s,a)/σ~h2(s,a)]λKId42K(log2dHδ)1/2,\left\lVert\widehat{\Lambda}_{h}^{\prime}-\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\widetilde{\sigma}_{h}^{2}(s,a)]-\frac{\lambda}{K}I_{d}\right\rVert\leq\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2},

which implies that when K128H4log2dHδκ2K\geq\frac{128H^{4}\log\frac{2dH}{\delta}}{\kappa^{2}}, it holds that (according to Weyl’s Inequality)

λmin(Λ^h)λmin(𝔼μ,h[ϕ(s,a)ϕ(s,a)/σ~h2(s,a)])+λKκ2H2κ2H2.\lambda_{\min}(\widehat{\Lambda}_{h}^{\prime})\geq\lambda_{\min}(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\widetilde{\sigma}_{h}^{2}(s,a)])+\frac{\lambda}{K}-\frac{\kappa}{2H^{2}}\geq\frac{\kappa}{2H^{2}}.

Under this high probability event, we have λmin(Λ^h)Kκ2H2\lambda_{\min}(\widehat{\Lambda}_{h})\geq\frac{K\kappa}{2H^{2}} and therefore λmin(Λ~h)λmin(Λ^h)Kκ2H2.\lambda_{\min}(\widetilde{\Lambda}_{h})\geq\lambda_{\min}(\widehat{\Lambda}_{h})\geq\frac{K\kappa}{2H^{2}}. Plugging these two results into (41), we have

Λ^h1Λ~h14H4E/κ2K2.\|\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1}\|\leq\frac{4H^{4}E/\kappa^{2}}{K^{2}}.

Then we can bound term (iii) by the following Lemma D.6

Lemma D.6.

Suppose Kmax{128H4log2dHδκ2,2Ldκ}K\geq\max\{\frac{128H^{4}\log\frac{2dH}{\delta}}{\kappa^{2}},\frac{\sqrt{2}L}{\sqrt{d\kappa}}\}, under the high probability events in Lemma D.2 and Lemma D.5, it holds that for all s,a,h𝒮×𝒜×[H]s,a,h\in\mathcal{S}\times\mathcal{A}\times[H],

|ϕ(s,a)(Λ^h1Λ~h1)(τ=1Kϕ(shτ,ahτ)(rhτ+V~h+1(sh+1τ))/σ~h2(shτ,ahτ)+ϕ3)|42H4Ed/κ3/2K.\left|\phi(s,a)^{\top}(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})+\phi_{3}\right)\right|\leq\frac{4\sqrt{2}H^{4}E\sqrt{d}/\kappa^{3/2}}{K}.
Proof of Lemma D.6.

First of all, the left hand side is bounded by

(Λ^h1Λ~h1)(τ=1Kϕ(shτ,ahτ)(rhτ+V~h+1(sh+1τ))/σ~h2(shτ,ahτ))2+4H4EL/κ2K2\left\lVert(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)\right\rVert_{2}+\frac{4H^{4}EL/\kappa^{2}}{K^{2}}

due to Lemma D.5. Then the left hand side can be further bounded by

Hτ=1K(Λ^h1Λ~h1)ϕ(shτ,ahτ)/σ~h(shτ,ahτ)2+4H4EL/κ2K2\displaystyle H\sum_{\tau=1}^{K}\left\lVert(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)/\widetilde{\sigma}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right\rVert_{2}+\frac{4H^{4}EL/\kappa^{2}}{K^{2}}
\displaystyle\leq Hτ=1KTr((Λ^h1Λ~h1)ϕ(shτ,ahτ)ϕ(shτ,ahτ)σ~h2(shτ,ahτ)(Λ^h1Λ~h1))+4H4EL/κ2K2\displaystyle H\sum_{\tau=1}^{K}\sqrt{Tr\left((\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\cdot\frac{\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)^{\top}}{\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})}\cdot(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\right)}+\frac{4H^{4}EL/\kappa^{2}}{K^{2}}
\displaystyle\leq HKTr((Λ^h1Λ~h1)Λ^h(Λ^h1Λ~h1))+4H4EL/κ2K2\displaystyle H\sqrt{K\cdot Tr\left((\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\cdot\widehat{\Lambda}_{h}\cdot(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\right)}+\frac{4H^{4}EL/\kappa^{2}}{K^{2}}
\displaystyle\leq HKdλmax((Λ^h1Λ~h1)Λ^h(Λ^h1Λ~h1))+4H4EL/κ2K2\displaystyle H\sqrt{Kd\cdot\lambda_{\max}\left((\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\cdot\widehat{\Lambda}_{h}\cdot(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\right)}+\frac{4H^{4}EL/\kappa^{2}}{K^{2}}
=\displaystyle= HKd(Λ^h1Λ~h1)Λ^h(Λ^h1Λ~h1)2+4H4EL/κ2K2\displaystyle H\sqrt{Kd\cdot\left\lVert(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\cdot\widehat{\Lambda}_{h}\cdot(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\right\rVert_{2}}+\frac{4H^{4}EL/\kappa^{2}}{K^{2}}
\displaystyle\leq HKdΛ~h12Λ~hΛ^h2Λ^h1Λ~h12+4H4EL/κ2K2\displaystyle H\sqrt{Kd\cdot\left\lVert\widetilde{\Lambda}_{h}^{-1}\right\rVert_{2}\cdot\left\lVert\widetilde{\Lambda}_{h}-\widehat{\Lambda}_{h}\right\rVert_{2}\cdot\left\lVert\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1}\right\rVert_{2}}+\frac{4H^{4}EL/\kappa^{2}}{K^{2}}
\displaystyle\leq 22H4Ed/κ3/2K+4H4EL/κ2K2\displaystyle\frac{2\sqrt{2}H^{4}E\sqrt{d}/\kappa^{3/2}}{K}+\frac{4H^{4}EL/\kappa^{2}}{K^{2}}
\displaystyle\leq 42H4Ed/κ3/2K.\displaystyle\frac{4\sqrt{2}H^{4}E\sqrt{d}/\kappa^{3/2}}{K}.

The first inequality is because a2=aa=Tr(aa)\left\lVert a\right\rVert_{2}=\sqrt{a^{\top}a}=\sqrt{Tr(aa^{\top})}. The second inequality is due to Cauchy-Schwarz inequality. The third inequality is because for positive definite matrix AA, it holds that Tr(A)=i=1dλi(A)dλmax(A)Tr(A)=\sum_{i=1}^{d}\lambda_{i}(A)\leq d\lambda_{\max}(A). The equation is because for symmetric, positive definite matrix AA, A2=λmax(A)\left\lVert A\right\rVert_{2}=\lambda_{\max}(A). The forth inequality is due to ABAB\left\lVert A\cdot B\right\rVert\leq\left\lVert A\right\rVert\cdot\left\lVert B\right\rVert. The fifth inequality is because of Lemma D.2, Lemma D.5 and the statement in the proof of Lemma D.5 that λmin(Λ~h)Kκ2H2\lambda_{\min}(\widetilde{\Lambda}_{h})\geq\frac{K\kappa}{2H^{2}}. The last inequality uses the assumption that K2LdκK\geq\frac{\sqrt{2}L}{\sqrt{d\kappa}}. ∎

Now the remaining part is term (i), we have

ϕ(s,a)whϕ(s,a)Λ^h1(τ=1Kϕ(shτ,ahτ)(rhτ+V~h+1(sh+1τ))/σ~h2(shτ,ahτ))(i)\displaystyle\underbrace{\phi(s,a)^{\top}w_{h}-\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)}_{(\mathrm{i})} (42)
=\displaystyle= ϕ(s,a)whϕ(s,a)Λ^h1(τ=1Kϕ(shτ,ahτ)(𝒯hV~h+1)(shτ,ahτ)/σ~h2(shτ,ahτ))(iv)\displaystyle\underbrace{\phi(s,a)^{\top}w_{h}-\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)}_{(\mathrm{iv})}
ϕ(s,a)Λ^h1(τ=1Kϕ(shτ,ahτ)(rhτ+V~h+1(sh+1τ)(𝒯hV~h+1)(shτ,ahτ))/σ~h2(shτ,ahτ))(v).\displaystyle-\underbrace{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)-\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)}_{(\mathrm{v})}.

We are able to bound term (iv) by the following Lemma D.7.

Lemma D.7.

Recall κ\kappa in Assumption 2.2. Under the high probability event in Lemma D.2, suppose Kmax{512H4log(2Hdδ)κ2,4λH2κ}K\geq\max\left\{\frac{512H^{4}\cdot\log\left(\frac{2Hd}{\delta}\right)}{\kappa^{2}},\frac{4\lambda H^{2}}{\kappa}\right\}, then with probability 1δ1-\delta, for all s,a,h𝒮×𝒜×[H]s,a,h\in\mathcal{S}\times\mathcal{A}\times[H],

|ϕ(s,a)whϕ(s,a)Λ^h1(τ=1Kϕ(shτ,ahτ)(𝒯hV~h+1)(shτ,ahτ)/σ~h2(shτ,ahτ))|8λH3d/κK.\left|\phi(s,a)^{\top}w_{h}-\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)\right|\leq\frac{8\lambda H^{3}\sqrt{d}/\kappa}{K}.
Proof of Lemma D.7.

Recall 𝒯hV~h+1=ϕwh\mathcal{T}_{h}\widetilde{V}_{h+1}=\phi^{\top}w_{h} and apply Lemma E.13, we obtain with probability 1δ1-\delta, for all s,a,h𝒮×𝒜×[H]s,a,h\in\mathcal{S}\times\mathcal{A}\times[H],

|ϕ(s,a)whϕ(s,a)Λ^h1(τ=1Kϕ(shτ,ahτ)(𝒯hV~h+1)(shτ,ahτ)/σ~h2(shτ,ahτ))|\displaystyle\left|\phi(s,a)^{\top}w_{h}-\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)\right|
=\displaystyle= |ϕ(s,a)whϕ(s,a)Λ^h1(τ=1Kϕ(shτ,ahτ)ϕ(shτ,ahτ)wh/σ~h2(shτ,ahτ))|\displaystyle\left|\phi(s,a)^{\top}w_{h}-\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\phi(s_{h}^{\tau},a_{h}^{\tau})^{\top}w_{h}/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)\right|
=\displaystyle= |ϕ(s,a)whϕ(s,a)Λ^h1(Λ^hλI)wh|\displaystyle\left|\phi(s,a)^{\top}w_{h}-\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\widehat{\Lambda}_{h}-\lambda I\right)w_{h}\right|
=\displaystyle= |λϕ(s,a)Λ^h1wh|\displaystyle\left|\lambda\cdot\phi(s,a)^{\top}\widehat{\Lambda}^{-1}_{h}w_{h}\right|
\displaystyle\leq λϕ(s,a)Λ^h1whΛ^h1\displaystyle\lambda\left\lVert\phi(s,a)\right\rVert_{\widehat{\Lambda}^{-1}_{h}}\cdot\left\lVert w_{h}\right\rVert_{\widehat{\Lambda}^{-1}_{h}}
\displaystyle\leq 4λKϕ(s,a)(Λ~hp)1wh(Λ~hp)1\displaystyle\frac{4\lambda}{K}\left\lVert\phi(s,a)\right\rVert_{(\widetilde{\Lambda}_{h}^{p})^{-1}}\cdot\left\lVert w_{h}\right\rVert_{(\widetilde{\Lambda}_{h}^{p})^{-1}}
\displaystyle\leq 4λK2Hd(Λ~hp)1\displaystyle\frac{4\lambda}{K}\cdot 2H\sqrt{d}\cdot\left\lVert(\widetilde{\Lambda}_{h}^{p})^{-1}\right\rVert
\displaystyle\leq 8λH3d/κK,\displaystyle\frac{8\lambda H^{3}\sqrt{d}/\kappa}{K},

where Λ~hp:=𝔼μ,h[σ~h(s,a)2ϕ(s,a)ϕ(s,a)]\widetilde{\Lambda}_{h}^{p}:=\mathbb{E}_{\mu,h}\left[\widetilde{\sigma}_{h}(s,a)^{-2}\phi(s,a)\phi(s,a)^{\top}\right]. The first inequality applies Cauchy-Schwarz inequality. The second inequality holds with probability 1δ1-\delta due to Lemma E.13 and a union bound. The third inequality uses aAaa2A2a2=a2A2\sqrt{a^{\top}\cdot A\cdot a}\leq\sqrt{\left\lVert a\right\rVert_{2}\left\lVert A\right\rVert_{2}\left\lVert a\right\rVert_{2}}=\left\lVert a\right\rVert_{2}\sqrt{\left\lVert A\right\rVert_{2}} and wh2Hd\left\lVert w_{h}\right\rVert\leq 2H\sqrt{d}. Finally, as λmin(Λ~hp)κmaxh,s,aσ~h(s,a)2κH2\lambda_{\min}(\widetilde{\Lambda}_{h}^{p})\geq\frac{\kappa}{\max_{h,s,a}\widetilde{\sigma}_{h}(s,a)^{2}}\geq\frac{\kappa}{H^{2}} implies (Λ~hp)1H2κ\left\lVert(\widetilde{\Lambda}_{h}^{p})^{-1}\right\rVert\leq\frac{H^{2}}{\kappa}, the last inequality holds. ∎

For term (v), denote: xτ=ϕ(shτ,ahτ)σ~h(shτ,ahτ),ητ=(rhτ+V~h+1(sh+1τ)(𝒯hV~h+1)(shτ,ahτ))/σ~h(shτ,ahτ)x_{\tau}=\frac{\phi(s_{h}^{\tau},a_{h}^{\tau})}{\widetilde{\sigma}_{h}(s_{h}^{\tau},a_{h}^{\tau})},\quad\eta_{\tau}=\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)-\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)\right)/\widetilde{\sigma}_{h}(s^{\tau}_{h},a^{\tau}_{h}), then by Cauchy-Schwarz inequality, it holds that for all h,s,a[H]×𝒮×𝒜h,s,a\in[H]\times\mathcal{S}\times\mathcal{A},

|ϕ(s,a)Λ^h1(τ=1Kϕ(shτ,ahτ)(rhτ+V~h+1(sh+1τ)(𝒯hV~h+1)(shτ,ahτ))/σ~h2(shτ,ahτ))|\displaystyle\left|\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)-\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)\right| (43)
\displaystyle\leq ϕ(s,a)Λ^h1ϕ(s,a)τ=1KxτητΛ^h1.\displaystyle\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)}\cdot\left\lVert\sum_{\tau=1}^{K}x_{\tau}\eta_{\tau}\right\rVert_{\widehat{\Lambda}_{h}^{-1}}.

We bound ϕ(s,a)Λ^h1ϕ(s,a)\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)} by ϕ(s,a)Λ~h1ϕ(s,a)\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)} using the following Lemma D.8.

Lemma D.8.

Suppose Kmax{128H4log2dHδκ2,2Ldκ}K\geq\max\{\frac{128H^{4}\log\frac{2dH}{\delta}}{\kappa^{2}},\frac{\sqrt{2}L}{\sqrt{d\kappa}}\}, under the high probability events in Lemma D.2 and Lemma D.5, it holds that for all s,a,h𝒮×𝒜×[H]s,a,h\in\mathcal{S}\times\mathcal{A}\times[H],

ϕ(s,a)Λ^h1ϕ(s,a)ϕ(s,a)Λ~h1ϕ(s,a)+2H2E/κK.\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)}\leq\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)}+\frac{2H^{2}\sqrt{E}/\kappa}{K}.
Proof of Lemma D.8.
ϕ(s,a)Λ^h1ϕ(s,a)=ϕ(s,a)Λ~h1ϕ(s,a)+ϕ(s,a)(Λ^h1Λ~h1)ϕ(s,a)\displaystyle\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)}=\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)+\phi(s,a)^{\top}(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\phi(s,a)} (44)
\displaystyle\leq ϕ(s,a)Λ~h1ϕ(s,a)+Λ^h1Λ~h12\displaystyle\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)+\left\lVert\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1}\right\rVert_{2}}
\displaystyle\leq ϕ(s,a)Λ~h1ϕ(s,a)+Λ^h1Λ~h12\displaystyle\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)}+\sqrt{\left\lVert\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1}\right\rVert_{2}}
\displaystyle\leq ϕ(s,a)Λ~h1ϕ(s,a)+2H2E/κK.\displaystyle\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)}+\frac{2H^{2}\sqrt{E}/\kappa}{K}.

The first inequality uses |aAa|a22A|a^{\top}Aa|\leq\left\lVert a\right\rVert_{2}^{2}\cdot\left\lVert A\right\rVert. The second inequality is because for a,b0a,b\geq 0, a+ba+b\sqrt{a}+\sqrt{b}\geq\sqrt{a+b}. The last inequality uses Lemma D.5. ∎

Remark D.9.

Similarly, under the same assumption in Lemma D.8, we also have for all s,a,h𝒮×𝒜×[H]s,a,h\in\mathcal{S}\times\mathcal{A}\times[H],

ϕ(s,a)Λ~h1ϕ(s,a)ϕ(s,a)Λ^h1ϕ(s,a)+2H2E/κK.\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)}\leq\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)}+\frac{2H^{2}\sqrt{E}/\kappa}{K}.

D.3.3 An intermediate result: bounding the variance

Before we handle τ=1KxτητΛ^h1\left\lVert\sum_{\tau=1}^{K}x_{\tau}\eta_{\tau}\right\rVert_{\widehat{\Lambda}_{h}^{-1}}, we first bound suphσ~h2σV~h+12\sup_{h}\left\lVert\widetilde{\sigma}_{h}^{2}-\sigma_{\widetilde{V}_{h+1}}^{2}\right\rVert_{\infty} by the following Lemma D.10.

Lemma D.10 (Private version of Lemma C.7 in [Yin et al., 2022]).

Recall the definition of σ~h(,)2=max{1,Var~hV~h+1(,)}\widetilde{\sigma}_{h}(\cdot,\cdot)^{2}=\max\{1,\widetilde{\mathrm{Var}}_{h}\widetilde{V}_{h+1}(\cdot,\cdot)\} in Algorithm 2 where [Var~hV~h+1](,)=ϕ(,),β~h[0,(Hh+1)2][ϕ(,),θ~h[0,Hh+1]]2\big{[}\widetilde{\mathrm{Var}}_{h}\widetilde{V}_{h+1}\big{]}(\cdot,\cdot)=\big{\langle}\phi(\cdot,\cdot),\widetilde{{\beta}}_{h}\big{\rangle}_{\left[0,(H-h+1)^{2}\right]}-\big{[}\big{\langle}\phi(\cdot,\cdot),\widetilde{{\theta}}_{h}\big{\rangle}_{[0,H-h+1]}\big{]}^{2} (β~h\widetilde{\beta}_{h} and θ~h\widetilde{\theta}_{h} are defined in Algorithm 2) and σV~h+1(,)2:=max{1,VarPhV~h+1(,)}{\sigma}_{\widetilde{V}_{h+1}}(\cdot,\cdot)^{2}:=\max\{1,{\mathrm{Var}}_{P_{h}}\widetilde{V}_{h+1}(\cdot,\cdot)\}. Suppose Kmax{512log(2Hdδ)κ2,4λκ,128log2dHδκ2,2LHdκ}K\geq\max\left\{\frac{512\log\left(\frac{2Hd}{\delta}\right)}{\kappa^{2}},\frac{4\lambda}{\kappa},\frac{128\log\frac{2dH}{\delta}}{\kappa^{2}},\frac{\sqrt{2}L}{H\sqrt{d\kappa}}\right\} and Kmax{4L2H2d3κ,32E2d2κ2,16λ2d2κ}K\geq\max\{\frac{4L^{2}}{H^{2}d^{3}\kappa},\frac{32E^{2}}{d^{2}\kappa^{2}},\frac{16\lambda^{2}}{d^{2}\kappa}\}, under the high probability event in Lemma D.2, it holds that with probability 16δ1-6\delta,

suph||σ~h2σV~h+12||36H4d3κKlog((λ+K)2KdH2λδ).\sup_{h}\lvert\lvert\widetilde{\sigma}^{2}_{h}-\sigma^{2}_{\widetilde{V}_{h+1}}\rvert\rvert_{\infty}\leq 36\sqrt{\frac{H^{4}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}.
Proof of Lemma D.10.

Step 1: The first step is to show for all h,s,a[H]×𝒮×𝒜h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}, with probability 13δ1-3\delta,

|ϕ(s,a),β~h[0,(Hh+1)2]h(V~h+1)2(s,a)|12H4d3κKlog((λ+K)2KdH2λδ).\left|\langle\phi(s,a),\widetilde{{\beta}}_{h}\rangle_{\left[0,(H-h+1)^{2}\right]}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right|\leq 12\sqrt{\frac{H^{4}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}.

Proof of Step 1. We can bound the left hand side by the following decomposition:

|ϕ(s,a),β~h[0,(Hh+1)2]h(V~h+1)2(s,a)||ϕ(s,a),β~hh(V~h+1)2(s,a)|\displaystyle\left|\langle\phi(s,a),\widetilde{{\beta}}_{h}\rangle_{\left[0,(H-h+1)^{2}\right]}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right|\leq\left|\langle\phi(s,a),\widetilde{{\beta}}_{h}\rangle-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right|
=\displaystyle= |ϕ(s,a)Σ~h1(τ=1Kϕ(s¯hτ,a¯hτ)V~h+1(s¯h+1τ)2+ϕ1)h(V~h+1)2(s,a)|\displaystyle\left|\phi(s,a)^{\top}\widetilde{\Sigma}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}+\phi_{1}\right)-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right|
\displaystyle\leq |ϕ(s,a)Σ¯h1(τ=1Kϕ(s¯hτ,a¯hτ)V~h+1(s¯h+1τ)2)h(V~h+1)2(s,a)|(1)+|ϕ(s,a)Σ¯h1ϕ1|(2)\displaystyle\underbrace{\left|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}\right)-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right|}_{(1)}+\underbrace{\left|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\phi_{1}\right|}_{(2)}
+|ϕ(s,a)(Σ~h1Σ¯h1)(τ=1Kϕ(s¯hτ,a¯hτ)V~h+1(s¯h+1τ)2+ϕ1)|(3),\displaystyle+\underbrace{\left|\phi(s,a)^{\top}(\widetilde{\Sigma}_{h}^{-1}-\bar{\Sigma}_{h}^{-1})\left(\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}+\phi_{1}\right)\right|}_{(3)},

where Σ¯h=Σ~hK1=τ=1Kϕ(s¯hτ,a¯hτ)ϕ(s¯hτ,a¯hτ)+λI\bar{\Sigma}_{h}=\widetilde{\Sigma}_{h}-K_{1}=\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})^{\top}+\lambda I.

Similar to the proof in Lemma D.5, when Kmax{128log2dHδκ2,2LHdκ}K\geq\max\{\frac{128\log\frac{2dH}{\delta}}{\kappa^{2}},\frac{\sqrt{2}L}{H\sqrt{d\kappa}}\}, it holds that with probability 1δ1-\delta, for all h[H]h\in[H],

λmin(Σ¯h)Kκ2,λmin(Σ~h)Kκ2,Σ~h1Σ¯h124E/κ2K2.\lambda_{\min}(\bar{\Sigma}_{h})\geq\frac{K\kappa}{2},\,\,\lambda_{\min}(\widetilde{\Sigma}_{h})\geq\frac{K\kappa}{2},\,\,\left\lVert\widetilde{\Sigma}^{-1}_{h}-\bar{\Sigma}^{-1}_{h}\right\rVert_{2}\leq\frac{4E/\kappa^{2}}{K^{2}}.

(The only difference to Lemma D.5 is here 𝔼μ,h[ϕ(s,a)ϕ(s,a)]κ\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}]\geq\kappa.)

Under this high probability event, for term (2), it holds that for all h,s,a[H]×𝒮×𝒜h,s,a\in[H]\times\mathcal{S}\times\mathcal{A},

|ϕ(s,a)Σ¯h1ϕ1|ϕ(s,a)Σ¯h1ϕ1λmin1(Σ¯h)HL2HL/κK.\displaystyle\left|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\phi_{1}\right|\leq\left\lVert\phi(s,a)\right\rVert\cdot\left\lVert\bar{\Sigma}_{h}^{-1}\right\rVert\cdot\left\lVert\phi_{1}\right\rVert\leq\lambda_{\min}^{-1}(\bar{\Sigma}_{h})\cdot HL\leq\frac{2HL/\kappa}{K}. (45)

For term (3)(3), similar to Lemma D.6, we have for all h,s,a[H]×𝒮×𝒜h,s,a\in[H]\times\mathcal{S}\times\mathcal{A},

|ϕ(s,a)(Σ~h1Σ¯h1)(τ=1Kϕ(s¯hτ,a¯hτ)V~h+1(s¯h+1τ)2+ϕ1)|42H2Ed/κ3/2K.\left|\phi(s,a)^{\top}(\widetilde{\Sigma}_{h}^{-1}-\bar{\Sigma}_{h}^{-1})\left(\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}+\phi_{1}\right)\right|\leq\frac{4\sqrt{2}H^{2}E\sqrt{d}/\kappa^{3/2}}{K}. (46)

(The only difference to Lemma D.6 is that here V~h+1(s)2H2\widetilde{V}_{h+1}(s)^{2}\leq H^{2}, ϕ12HL\left\lVert\phi_{1}\right\rVert_{2}\leq HL, Σ~h122Kκ\left\lVert\widetilde{\Sigma}_{h}^{-1}\right\rVert_{2}\leq\frac{2}{K\kappa} and Σ~h1Σ¯h124E/κ2K2\left\lVert\widetilde{\Sigma}^{-1}_{h}-\bar{\Sigma}^{-1}_{h}\right\rVert_{2}\leq\frac{4E/\kappa^{2}}{K^{2}}.)

We further decompose term (1) as below.

(1)=|ϕ(s,a)Σ¯h1(τ=1Kϕ(s¯hτ,a¯hτ)V~h+1(s¯h+1τ)2)h(V~h+1)2(s,a)|\displaystyle(1)=\left|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}\right)-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right| (47)
=\displaystyle= |ϕ(s,a)Σ¯h1τ=1Kϕ(s¯hτ,a¯hτ)V~h+1(s¯h+1τ)2ϕ(s,a)Σ¯h1(τ=1Kϕ(s¯hτ,a¯hτ)ϕ(s¯hτ,a¯hτ)+λI)𝒮(V~h+1)2(s)𝑑νh(s)|\displaystyle\left|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}-\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}(\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})^{\top}+\lambda I)\int_{\mathcal{S}}(\widetilde{V}_{h+1})^{2}(s^{\prime})d\nu_{h}(s^{\prime})\right|
\displaystyle\leq |ϕ(s,a)Σ¯h1τ=1Kϕ(s¯hτ,a¯hτ)(V~h+1(s¯h+1τ)2h(V~h+1)2(s¯hτ,a¯hτ))|(4)+λ|ϕ(s,a)Σ¯h1𝒮(V~h+1)2(s)𝑑νh(s)|(5).\displaystyle\underbrace{\left|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\left(\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(\bar{s}^{\tau}_{h},\bar{a}^{\tau}_{h})\right)\right|}_{(4)}+\underbrace{\lambda\left|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\int_{\mathcal{S}}(\widetilde{V}_{h+1})^{2}(s^{\prime})d\nu_{h}(s^{\prime})\right|}_{(5)}.

For term (5), because Kmax{512log(2Hdδ)κ2,4λκ}K\geq\max\left\{\frac{512\log\left(\frac{2Hd}{\delta}\right)}{\kappa^{2}},\frac{4\lambda}{\kappa}\right\}, by Lemma E.13 and a union bound, with probability 1δ1-\delta, for all h,s,a[H]×𝒮×𝒜h,s,a\in[H]\times\mathcal{S}\times\mathcal{A},

λ|ϕ(s,a)Σ¯h1𝒮(V~h+1)2(s)𝑑νh(s)|λϕ(s,a)Σ¯h1𝒮(V~h+1)2(s)𝑑νh(s)Σ¯h1\displaystyle\lambda\left|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\int_{\mathcal{S}}(\widetilde{V}_{h+1})^{2}(s^{\prime})d\nu_{h}(s^{\prime})\right|\leq\lambda\left\lVert\phi(s,a)\right\rVert_{\bar{\Sigma}^{-1}_{h}}\left\lVert\int_{\mathcal{S}}(\widetilde{V}_{h+1})^{2}(s^{\prime})d\nu_{h}(s^{\prime})\right\rVert_{\bar{\Sigma}^{-1}_{h}} (48)
\displaystyle\leq λ2Kϕ(s,a)(Σhp)12K𝒮(V~h+1)2(s)𝑑νh(s)(Σhp)14λ(Σhp)1H2dK4λH2dκK,\displaystyle\lambda\frac{2}{\sqrt{K}}\left\lVert\phi(s,a)\right\rVert_{({\Sigma}^{p}_{h})^{-1}}\frac{2}{\sqrt{K}}\left\lVert\int_{\mathcal{S}}(\widetilde{V}_{h+1})^{2}(s^{\prime})d\nu_{h}(s^{\prime})\right\rVert_{({\Sigma}^{p}_{h})^{-1}}\leq 4\lambda\left\lVert({\Sigma}^{p}_{h})^{-1}\right\rVert\frac{H^{2}\sqrt{d}}{K}\leq 4\lambda\frac{H^{2}\sqrt{d}}{\kappa K},

where Σhp=𝔼μ,h[ϕ(s,a)ϕ(s,a)]\Sigma_{h}^{p}=\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}] and λmin(Σhp)κ\lambda_{\min}(\Sigma_{h}^{p})\geq\kappa.

For term (4), it can be bounded by the following inequality (because of Cauchy-Schwarz inequality).

(4)ϕ(s,a)Σ¯h1τ=1Kϕ(s¯hτ,a¯hτ)(V~h+1(s¯h+1τ)2h(V~h+1)2(s¯hτ,a¯hτ))Σ¯h1.(4)\leq\left\lVert\phi(s,a)\right\rVert_{\bar{\Sigma}^{-1}_{h}}\cdot\left\lVert\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\left(\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(\bar{s}^{\tau}_{h},\bar{a}^{\tau}_{h})\right)\right\rVert_{\bar{\Sigma}^{-1}_{h}}. (49)

Bounding using covering. Note for any fix Vh+1{V}_{h+1}, we can define xτ=ϕ(s¯hτ,a¯hτ)x_{\tau}=\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau}) (ϕ21\left\lVert\phi\right\rVert_{2}\leq 1) and ητ=Vh+1(s¯h+1τ)2h(Vh+1)2(s¯hτ,a¯hτ)\eta_{\tau}={V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}-{\mathbb{P}}_{h}({V}_{h+1})^{2}(\bar{s}^{\tau}_{h},\bar{a}^{\tau}_{h}) is H2H^{2}-subgaussian, by Lemma E.9 (where t=Kt=K and L=1L=1), it holds that with probability 1δ1-\delta,

τ=1Kϕ(s¯hτ,a¯hτ)(Vh+1(s¯h+1τ)2h(Vh+1)2(s¯hτ,a¯hτ))Σ¯h18H4d2log(λ+Kλδ).\left\lVert\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\left({V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}-{\mathbb{P}}_{h}({V}_{h+1})^{2}(\bar{s}^{\tau}_{h},\bar{a}^{\tau}_{h})\right)\right\rVert_{\bar{\Sigma}^{-1}_{h}}\leq\sqrt{8H^{4}\cdot\frac{d}{2}\log\left(\frac{\lambda+K}{\lambda\delta}\right)}.

Let 𝒩h(ϵ)\mathcal{N}_{h}(\epsilon) be the minimal ϵ\epsilon-cover (with respect to the supremum norm) of
𝒱h:={Vh:Vh()=maxa𝒜{min{ϕ(s,a)θC1dϕ(,)Λ~h1ϕ(,)C2,Hh+1}+}}.\mathcal{V}_{h}:=\left\{V_{h}:V_{h}(\cdot)=\max_{a\in\mathcal{A}}\{\min\{\phi(s,a)^{\top}\theta-C_{1}\sqrt{d\cdot\phi(\cdot,\cdot)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(\cdot,\cdot)}-C_{2},H-h+1\}^{+}\}\right\}. That is, for any V𝒱hV\in\mathcal{V}_{h}, there exists a value function V𝒩h(ϵ)V^{\prime}\in\mathcal{N}_{h}(\epsilon) such that sups𝒮|V(s)V(s)|<ϵ\sup_{s\in\mathcal{S}}|V(s)-V^{\prime}(s)|<\epsilon. Now by a union bound, we obtain with probability 1δ1-\delta,

supVh+1𝒩h+1(ϵ)τ=1Kϕ(s¯hτ,a¯hτ)(Vh+1(s¯h+1τ)2h(Vh+1)2(s¯hτ,a¯hτ))Σ¯h18H4d2log(λ+Kλδ|𝒩h+1(ϵ)|)\sup_{V_{h+1}\in\mathcal{N}_{h+1}(\epsilon)}\left\lVert\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\left({V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}-{\mathbb{P}}_{h}({V}_{h+1})^{2}(\bar{s}^{\tau}_{h},\bar{a}^{\tau}_{h})\right)\right\rVert_{\bar{\Sigma}^{-1}_{h}}\leq\sqrt{8H^{4}\cdot\frac{d}{2}\log\left(\frac{\lambda+K}{\lambda\delta}|\mathcal{N}_{h+1}(\epsilon)|\right)}

which implies

τ=1Kϕ(s¯hτ,a¯hτ)(V~h+1(s¯h+1τ)2h(V~h+1)2(s¯hτ,a¯hτ))Σ¯h1\displaystyle\left\lVert\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\left(\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(\bar{s}^{\tau}_{h},\bar{a}^{\tau}_{h})\right)\right\rVert_{\bar{\Sigma}^{-1}_{h}}
\displaystyle\leq 8H4d2log(λ+Kλδ|𝒩h+1(ϵ)|)+4H2ϵ2K2/λ\displaystyle\sqrt{8H^{4}\cdot\frac{d}{2}\log\left(\frac{\lambda+K}{\lambda\delta}|\mathcal{N}_{h+1}(\epsilon)|\right)}+4H^{2}\sqrt{\epsilon^{2}K^{2}/\lambda}

choosing ϵ=dλ/K\epsilon=d\sqrt{\lambda}/K, applying Lemma B.3 of [Jin et al., 2021]131313Note that the conclusion in [Jin et al., 2021] hold here even though we have an extra constant C2C_{2}. to the covering number 𝒩h+1(ϵ)\mathcal{N}_{h+1}(\epsilon) w.r.t. 𝒱h+1\mathcal{V}_{h+1}, we can further bound above by

\displaystyle\leq 8H4d32log(λ+Kλδ2dHK)+4H2d26H4d3log(λ+Kλδ2dHK)\displaystyle\sqrt{8H^{4}\cdot\frac{d^{3}}{2}\log\left(\frac{\lambda+K}{\lambda\delta}2dHK\right)}+4H^{2}\sqrt{d^{2}}\leq 6\sqrt{H^{4}\cdot d^{3}\log\left(\frac{\lambda+K}{\lambda\delta}2dHK\right)}

Apply a union bound for h[H]h\in[H], we have with probability 1δ1-\delta, for all h[H]h\in[H],

τ=1Kϕ(s¯hτ,a¯hτ)(V~h+1(s¯h+1τ)2h(V~h+1)2(s¯hτ,a¯hτ))Σ¯h16H4d3log((λ+K)2KdH2λδ)\left\lVert\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\left(\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(\bar{s}^{\tau}_{h},\bar{a}^{\tau}_{h})\right)\right\rVert_{\bar{\Sigma}^{-1}_{h}}\leq 6\sqrt{{H^{4}d^{3}}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)} (50)

and similar to term (2)(2), it holds that for all h,s,a[H]×𝒮×𝒜h,s,a\in[H]\times\mathcal{S}\times\mathcal{A},

ϕ(s,a)Σ¯h1Σ¯h12κK.\left\lVert\phi(s,a)\right\rVert_{\bar{\Sigma}^{-1}_{h}}\leq\sqrt{\left\lVert\bar{\Sigma}^{-1}_{h}\right\rVert}\leq\sqrt{\frac{2}{\kappa K}}. (51)

Combining (45), (46), (47), (48), (49), (50), (51) and the assumption that Kmax{4L2H2d3κ,32E2d2κ2,16λ2d2κ}K\geq\max\{\frac{4L^{2}}{H^{2}d^{3}\kappa},\frac{32E^{2}}{d^{2}\kappa^{2}},\frac{16\lambda^{2}}{d^{2}\kappa}\}, we obtain with probability 13δ1-3\delta for all h,s,a[H]×𝒮×𝒜h,s,a\in[H]\times\mathcal{S}\times\mathcal{A},

|ϕ(s,a),β~h[0,(Hh+1)2]h(V~h+1)2(s,a)|12H4d3κKlog((λ+K)2KdH2λδ).\left|\langle\phi(s,a),\widetilde{{\beta}}_{h}\rangle_{\left[0,(H-h+1)^{2}\right]}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right|\leq 12\sqrt{\frac{H^{4}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}.

Step 2: The second step is to show for all h,s,a[H]×𝒮×𝒜h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}, with probability 13δ1-3\delta,

|ϕ(s,a),θ~h[0,Hh+1]h(V~h+1)(s,a)|12H2d3κKlog((λ+K)2KdH2λδ).\left|\langle\phi(s,a),\widetilde{{\theta}}_{h}\rangle_{\left[0,H-h+1\right]}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})(s,a)\right|\leq 12\sqrt{\frac{H^{2}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}. (52)

The proof of Step 2 is nearly identical to Step 1 except V~h2\widetilde{V}^{2}_{h} is replaced by V~h\widetilde{V}_{h}.

Step 3: The last step is to prove suph||σ~h2σV~h+12||36H4d3κKlog((λ+K)2KdH2λδ)\sup_{h}\lvert\lvert\widetilde{\sigma}^{2}_{h}-\sigma^{2}_{\widetilde{V}_{h+1}}\rvert\rvert_{\infty}\leq 36\sqrt{\frac{H^{4}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)} with high probability.

Proof of Step 3. By (52),

|[ϕ(,),θ~h[0,Hh+1]]2[h(V~h+1)(s,a)]2|\displaystyle\left|\big{[}\big{\langle}\phi(\cdot,\cdot),\widetilde{{\theta}}_{h}\big{\rangle}_{[0,H-h+1]}\big{]}^{2}-\big{[}{\mathbb{P}}_{h}(\widetilde{V}_{h+1})(s,a)\big{]}^{2}\right|
=\displaystyle= |ϕ(s,a),θ~h[0,Hh+1]+h(V~h+1)(s,a)||ϕ(s,a),θ~h[0,Hh+1]h(V~h+1)(s,a)|\displaystyle\left|\langle\phi(s,a),\widetilde{{\theta}}_{h}\rangle_{\left[0,H-h+1\right]}+{\mathbb{P}}_{h}(\widetilde{V}_{h+1})(s,a)\right|\cdot\left|\langle\phi(s,a),\widetilde{{\theta}}_{h}\rangle_{\left[0,H-h+1\right]}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})(s,a)\right|
\displaystyle\leq 2H|ϕ(s,a),θ~h[0,Hh+1]h(V~h+1)(s,a)|24H4d3κKlog((λ+K)2KdH2λδ).\displaystyle 2H\cdot\left|\langle\phi(s,a),\widetilde{{\theta}}_{h}\rangle_{\left[0,H-h+1\right]}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})(s,a)\right|\leq 24\sqrt{\frac{H^{4}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}.

Combining this with Step 1, we have with probability 16δ1-6\delta, h,s,a[H]×𝒮×𝒜\forall h,s,a\in[H]\times\mathcal{S}\times\mathcal{A},

|Var~hV~h+1(s,a)VarPhV~h+1(s,a)|36H4d3κKlog((λ+K)2KdH2λδ).\bigg{|}\widetilde{\mathrm{Var}}_{h}\widetilde{V}_{h+1}(s,a)-{\mathrm{Var}}_{P_{h}}\widetilde{V}_{h+1}(s,a)\bigg{|}\leq 36\sqrt{\frac{H^{4}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}.

Finally, by the non-expansiveness of operator max{1,}\max\{1,\cdot\}, the proof is complete. ∎

D.3.4 Validity of our pessimism

Recall the definition Λ^h=τ=1Kϕ(shτ,ahτ)ϕ(shτ,ahτ)/σ~h2(shτ,ahτ)+λI\widehat{\Lambda}_{h}=\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)^{\top}/\widetilde{\sigma}_{h}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda\cdot I and
Λh=τ=1Kϕ(shτ,ahτ)ϕ(shτ,ahτ)/σV~h+12(shτ,ahτ)+λI{\Lambda}_{h}=\sum_{\tau=1}^{K}\phi(s^{\tau}_{h},a^{\tau}_{h})\phi(s^{\tau}_{h},a^{\tau}_{h})^{\top}/\sigma_{\widetilde{V}_{h+1}}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I. Then we have the following lemma to bound the term ϕ(s,a)Λ^h1ϕ(s,a)\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)} by ϕ(s,a)Λh1ϕ(s,a)\sqrt{\phi(s,a)^{\top}\Lambda_{h}^{-1}\phi(s,a)}.

Lemma D.11 (Private version of lemma C.3 in [Yin et al., 2022]).

Denote the quantities C1=max{2λ,128log(2dH/δ),128H4log(2dH/δ)κ2}C_{1}=\max\{2\lambda,128\log(2dH/\delta),\frac{128H^{4}\log(2dH/\delta)}{\kappa^{2}}\} and C2=O~(H12d3/κ5)C_{2}=\widetilde{O}(H^{12}d^{3}/\kappa^{5}). Suppose the number of episode KK satisfies K>max{C1,C2}K>\max\{C_{1},C_{2}\} and the condition in Lemma D.10, under the high probability events in Lemma D.2 and Lemma D.10, it holds that with probability 12δ1-2\delta, for all h,s,a[H]×𝒮×𝒜h,s,a\in[H]\times\mathcal{S}\times\mathcal{A},

ϕ(s,a)Λ^h1ϕ(s,a)2ϕ(s,a)Λh1ϕ(s,a).\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)}\leq 2\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{-1}\phi(s,a)}.
Proof of Lemma D.11.

By definition ϕ(s,a)Λ^h1ϕ(s,a)=ϕ(s,a)Λ^h1\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)}=\left\lVert\phi(s,a)\right\rVert_{\widehat{\Lambda}^{-1}_{h}}. Then denote

Λ^h=1KΛ^h,Λh=1KΛh,\widehat{\Lambda}^{\prime}_{h}=\frac{1}{K}\widehat{\Lambda}_{h},\quad{\Lambda}^{\prime}_{h}=\frac{1}{K}{\Lambda}_{h},

where Λh=τ=1Kϕ(shτ,ahτ)ϕ(shτ,ahτ)/σV~h+12(shτ,ahτ)+λI{\Lambda}_{h}=\sum_{\tau=1}^{K}\phi(s^{\tau}_{h},a^{\tau}_{h})\phi(s^{\tau}_{h},a^{\tau}_{h})^{\top}/\sigma_{\widetilde{V}_{h+1}}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I. Under the assumption of KK, by the conclusion in Lemma D.10, we have

Λ^hΛhsups,aϕ(s,a)ϕ(s,a)σ~h2(s,a)ϕ(s,a)ϕ(s,a)σV~h+12(s,a)\displaystyle\left\lVert\widehat{\Lambda}^{\prime}_{h}-{\Lambda}^{\prime}_{h}\right\rVert\leq\sup_{s,a}\left\lVert\frac{\phi(s,a)\phi(s,a)^{\top}}{\widetilde{\sigma}^{2}_{h}(s,a)}-\frac{\phi(s,a)\phi(s,a)^{\top}}{{\sigma}^{2}_{\widetilde{V}_{h+1}}(s,a)}\right\rVert (53)
\displaystyle\leq sups,a|σ~h2(s,a)σV~h+12(s,a)σ~h2(s,a)σV~h+12(s,a)|ϕ(s,a)2\displaystyle\sup_{s,a}\left|\frac{\widetilde{\sigma}^{2}_{h}(s,a)-{\sigma}^{2}_{\widetilde{V}_{h+1}}(s,a)}{\widetilde{\sigma}^{2}_{h}(s,a)\cdot{\sigma}^{2}_{\widetilde{V}_{h+1}}(s,a)}\right|\cdot\left\lVert\phi(s,a)\right\rVert^{2}
\displaystyle\leq sups,a|σ~h2(s,a)σV~h+12(s,a)1|1\displaystyle\sup_{s,a}\left|\frac{\widetilde{\sigma}^{2}_{h}(s,a)-{\sigma}^{2}_{\widetilde{V}_{h+1}}(s,a)}{1}\right|\cdot 1
\displaystyle\leq 36H4d3κKlog((λ+K)2KdH2λδ).\displaystyle 36\sqrt{\frac{H^{4}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}.

Next by Lemma E.12 (with ϕ\phi to be ϕ/σV~h+1\phi/\sigma_{\widetilde{V}_{h+1}} and therefore C=1C=1) and a union bound, it holds with probability 1δ1-\delta, for all h[H]h\in[H],

Λh(𝔼μ,h[ϕ(s,a)ϕ(s,a)/σV~h+12(s,a)]+λKId)42K(log2dHδ)1/2.\left\lVert\Lambda^{\prime}_{h}-\left(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{\widetilde{V}_{h+1}}(s,a)]+\frac{\lambda}{K}I_{d}\right)\right\rVert\leq\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}.

Therefore by Weyl’s inequality and the assumption that KK satisfies that
K>max{2λ,128log(2dH/δ),128H4log(2dH/δ)κ2}K>\max\{2\lambda,128\log(2dH/\delta),\frac{128H^{4}\log(2dH/\delta)}{\kappa^{2}}\}, the above inequality leads to

Λh=\displaystyle\left\lVert\Lambda^{\prime}_{h}\right\rVert= λmax(Λh)λmax(𝔼μ,h[ϕ(s,a)ϕ(s,a)/σV~h+12(s,a)])+λK+42K(log2dHδ)1/2\displaystyle\lambda_{\max}(\Lambda^{\prime}_{h})\leq\lambda_{\max}\left(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{\widetilde{V}_{h+1}}(s,a)]\right)+\frac{\lambda}{K}+\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}
=\displaystyle= 𝔼μ,h[ϕ(s,a)ϕ(s,a)/σV~h+12(s,a)]2+λK+42K(log2dHδ)1/2\displaystyle\left\lVert\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{\widetilde{V}_{h+1}}(s,a)]\right\rVert_{2}+\frac{\lambda}{K}+\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}
\displaystyle\leq ϕ(s,a)2+λK+42K(log2dHδ)1/21+λK+42K(log2dHδ)1/22,\displaystyle\left\lVert\phi(s,a)\right\rVert^{2}+\frac{\lambda}{K}+\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}\leq 1+\frac{\lambda}{K}+\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}\leq 2,
λmin(Λh)\displaystyle\lambda_{\min}(\Lambda^{\prime}_{h})\geq λmin(𝔼μ,h[ϕ(s,a)ϕ(s,a)/σV~h+12(s,a)])+λK42K(log2dHδ)1/2\displaystyle\lambda_{\min}\left(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{\widetilde{V}_{h+1}}(s,a)]\right)+\frac{\lambda}{K}-\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}
\displaystyle\geq λmin(𝔼μ,h[ϕ(s,a)ϕ(s,a)/σV~h+12(s,a)])42K(log2dHδ)1/2\displaystyle\lambda_{\min}\left(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{\widetilde{V}_{h+1}}(s,a)]\right)-\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}
\displaystyle\geq κH242K(log2dHδ)1/2κ2H2.\displaystyle\frac{\kappa}{H^{2}}-\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}\geq\frac{\kappa}{2H^{2}}.

Hence with probability 1δ1-\delta, Λh2\left\lVert\Lambda^{\prime}_{h}\right\rVert\leq 2 and Λh1=λmin1(Λh)2H2κ\left\lVert\Lambda^{\prime-1}_{h}\right\rVert=\lambda^{-1}_{\min}(\Lambda^{\prime}_{h})\leq\frac{2H^{2}}{\kappa}. Similarly, one can show Λ^h12H2κ\left\lVert\widehat{\Lambda}^{\prime-1}_{h}\right\rVert\leq\frac{2H^{2}}{\kappa} with probability 1δ1-\delta using identical proof.

Now apply Lemma E.11 and a union bound to Λ^h\widehat{\Lambda}^{\prime}_{h} and Λh{\Lambda}^{\prime}_{h}, we obtain with probability 1δ1-\delta, for all h,s,a[H]×𝒮×𝒜h,s,a\in[H]\times\mathcal{S}\times\mathcal{A},

ϕ(s,a)Λ^h1\displaystyle\left\lVert\phi(s,a)\right\rVert_{\widehat{\Lambda}^{\prime-1}_{h}}\leq [1+Λh1ΛhΛ^h1Λ^hΛh]ϕ(s,a)Λh1\displaystyle\left[1+\sqrt{\left\lVert\Lambda^{\prime-1}_{h}\right\rVert\cdot\left\lVert\Lambda^{\prime}_{h}\right\rVert\cdot\left\lVert\widehat{\Lambda}^{\prime-1}_{h}\right\rVert\cdot\left\lVert\widehat{\Lambda}^{\prime}_{h}-\Lambda^{\prime}_{h}\right\rVert}\right]\cdot\left\lVert\phi(s,a)\right\rVert_{\Lambda^{\prime-1}_{h}}
\displaystyle\leq [1+2H2κ22H2κΛ^hΛh]ϕ(s,a)Λh1\displaystyle\left[1+\sqrt{\frac{2H^{2}}{\kappa}\cdot 2\cdot\frac{2H^{2}}{\kappa}\cdot\left\lVert\widehat{\Lambda}^{\prime}_{h}-\Lambda^{\prime}_{h}\right\rVert}\right]\cdot\left\lVert\phi(s,a)\right\rVert_{\Lambda^{\prime-1}_{h}}
\displaystyle\leq [1+288H4κ2(H4d3κKlog((λ+K)2KdH2λδ))]ϕ(s,a)Λh1\displaystyle\left[1+\sqrt{\frac{288H^{4}}{\kappa^{2}}\left(\sqrt{\frac{H^{4}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}\right)}\right]\cdot\left\lVert\phi(s,a)\right\rVert_{\Lambda^{\prime-1}_{h}}
\displaystyle\leq 2ϕ(s,a)Λh1\displaystyle 2\left\lVert\phi(s,a)\right\rVert_{\Lambda^{\prime-1}_{h}}

where the third inequality uses (53) and the last inequality uses K>O~(H12d3/κ5)K>\widetilde{O}(H^{12}d^{3}/\kappa^{5}). Note the conclusion can be derived directly by the above inequality multiplying 1/K1/\sqrt{K} on both sides. ∎

In order to bound τ=1KxτητΛ^h1\left\lVert\sum_{\tau=1}^{K}x_{\tau}\eta_{\tau}\right\rVert_{\widehat{\Lambda}_{h}^{-1}}, we apply the following Lemma D.12.

Lemma D.12 (Lemma C.4 in [Yin et al., 2022]).

Recall xτ=ϕ(shτ,ahτ)σ~h(shτ,ahτ)x_{\tau}=\frac{\phi(s^{\tau}_{h},a^{\tau}_{h})}{\widetilde{\sigma}_{h}(s^{\tau}_{h},a^{\tau}_{h})} and
ητ=(rhτ+V~h+1(sh+1τ)(𝒯hV~h+1)(shτ,ahτ))/σ~h(shτ,ahτ)\eta_{\tau}=\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)-\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)\right)/\widetilde{\sigma}_{h}(s^{\tau}_{h},a^{\tau}_{h}). Denote

ξ:=supV[0,H],sPh(s,a),h[H]|rh+V(s)(𝒯hV)(s,a)σV(s,a)|.\xi:=\sup_{V\in[0,H],\;s^{\prime}\sim P_{h}(s,a),\;h\in[H]}\left|\frac{r_{h}+{V}\left(s^{\prime}\right)-\left(\mathcal{T}_{h}{V}\right)\left(s,a\right)}{{\sigma}_{{V}}(s,a)}\right|.

Suppose KO~(H12d3/κ5)K\geq\widetilde{O}(H^{12}d^{3}/\kappa^{5})141414Note that here the assumption is stronger than the assumption in [Yin et al., 2022], therefore the conclusion of Lemma C.4 holds., then with probability 1δ1-\delta,

τ=1KxτητΛ^h1O~(max{d,ξ}),\left\lVert\sum_{\tau=1}^{K}x_{\tau}\eta_{\tau}\right\rVert_{\widehat{\Lambda}_{h}^{-1}}\leq\widetilde{O}\left(\max\big{\{}\sqrt{d},\xi\big{\}}\right),

where O~\widetilde{O} absorbs constants and Polylog terms.

Now we are ready to prove the following key lemma, which gives a high probability bound for |(𝒯hV~h+1𝒯~hV~h+1)(s,a)|\left|(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)\right|.

Lemma D.13.

Assume K>max{1,2,3,4}K>\max\{\mathcal{M}_{1},\mathcal{M}_{2},\mathcal{M}_{3},\mathcal{M}_{4}\}, for any 0<λ<κ0<\lambda<\kappa, suppose d>ξ\sqrt{d}>\xi, where ξ:=supV[0,H],sPh(s,a),h[H]|rh+V(s)(𝒯hV)(s,a)σV(s,a)|\xi:=\sup_{V\in[0,H],\;s^{\prime}\sim P_{h}(s,a),\;h\in[H]}\left|\frac{r_{h}+{V}\left(s^{\prime}\right)-\left(\mathcal{T}_{h}{V}\right)\left(s,a\right)}{{\sigma}_{{V}}(s,a)}\right|. Then with probability 1δ1-\delta, for all h,s,a[H]×𝒮×𝒜h,s,a\in[H]\times\mathcal{S}\times\mathcal{A},

|(𝒯hV~h+1𝒯~hV~h+1)(s,a)|O~(dϕ(s,a)Λ~h1ϕ(s,a))+DK,\left|(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)\right|\leq\widetilde{O}\left(\sqrt{d}\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)}\right)+\frac{D}{K},

where Λ~h=τ=1Kϕ(shτ,ahτ)ϕ(shτ,ahτ)/σ~h2(shτ,ahτ)+λI+K2\widetilde{\Lambda}_{h}=\sum_{\tau=1}^{K}\phi(s^{\tau}_{h},a^{\tau}_{h})\phi(s^{\tau}_{h},a^{\tau}_{h})^{\top}/\widetilde{\sigma}_{h}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I+K_{2},

D=O~(H2Lκ+H4Edκ3/2+H3d+H2Edκ)=O~(H2Lκ+H4Edκ3/2+H3d)D=\widetilde{O}\left(\frac{H^{2}L}{\kappa}+\frac{H^{4}E\sqrt{d}}{\kappa^{3/2}}+H^{3}\sqrt{d}+\frac{H^{2}\sqrt{Ed}}{\kappa}\right)=\widetilde{O}\left(\frac{H^{2}L}{\kappa}+\frac{H^{4}E\sqrt{d}}{\kappa^{3/2}}+H^{3}\sqrt{d}\right)

and O~\widetilde{O} absorbs constants and Polylog terms.

Proof of Lemma D.13.

The proof is by combining (40), (42), Lemma D.4, Lemma D.6, Lemma D.7, Lemma D.8, Lemma D.12 and a union bound. ∎

Remark D.14.

Under the same assumption of Lemma D.13, because of Remark D.9 and Lemma D.11, we have with probability 1δ1-\delta, for all h,s,a[H]×𝒮×𝒜h,s,a\in[H]\times\mathcal{S}\times\mathcal{A},

|(𝒯hV~h+1𝒯~hV~h+1)(s,a)|O~(dϕ(s,a)Λ~h1ϕ(s,a))+DK\displaystyle\left|(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)\right|\leq\widetilde{O}\left(\sqrt{d}\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)}\right)+\frac{D}{K} (54)
\displaystyle\leq O~(dϕ(s,a)Λ^h1ϕ(s,a))+2DK\displaystyle\widetilde{O}\left(\sqrt{d}\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)}\right)+\frac{2D}{K}
\displaystyle\leq O~(2dϕ(s,a)Λh1ϕ(s,a))+2DK.\displaystyle\widetilde{O}\left(2\sqrt{d}\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{-1}\phi(s,a)}\right)+\frac{2D}{K}.

Because D=O~(H2Lκ+H4Edκ3/2+H3d)D=\widetilde{O}\left(\frac{H^{2}L}{\kappa}+\frac{H^{4}E\sqrt{d}}{\kappa^{3/2}}+H^{3}\sqrt{d}\right) and O~\widetilde{O} absorbs constant, we will write as below for simplicity:

|(𝒯hV~h+1𝒯~hV~h+1)(s,a)|O~(dϕ(s,a)Λh1ϕ(s,a))+DK.\left|(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)\right|\leq\widetilde{O}\left(\sqrt{d}\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{-1}\phi(s,a)}\right)+\frac{D}{K}. (55)

D.3.5 Finalize the proof of the first part

We are ready to prove the first part of Theorem 4.1.

Theorem D.15 (First part of Theorem 4.1).

Let KK be the number of episodes. Suppose d>ξ\sqrt{d}>\xi, where ξ:=supV[0,H],sPh(s,a),h[H]|rh+V(s)(𝒯hV)(s,a)σV(s,a)|\xi:=\sup_{V\in[0,H],\;s^{\prime}\sim P_{h}(s,a),\;h\in[H]}\left|\frac{r_{h}+{V}\left(s^{\prime}\right)-\left(\mathcal{T}_{h}{V}\right)\left(s,a\right)}{{\sigma}_{{V}}(s,a)}\right| and K>max{1,2,3,4}K>\max\{\mathcal{M}_{1},\mathcal{M}_{2},\mathcal{M}_{3},\mathcal{M}_{4}\}. Then for any 0<λ<κ0<\lambda<\kappa, with probability 1δ1-\delta, for all policy π\pi simultaneously, the output π^\widehat{\pi} of Algorithm 2 satisfies

vπvπ^O~(dh=1H𝔼π[(ϕ(,)Λh1ϕ(,))1/2])+DHK,v^{\pi}-v^{\widehat{\pi}}\leq\widetilde{O}\left(\sqrt{d}\cdot\sum_{h=1}^{H}\mathbb{E}_{\pi}\left[\left(\phi(\cdot,\cdot)^{\top}\Lambda_{h}^{-1}\phi(\cdot,\cdot)\right)^{1/2}\right]\right)+\frac{DH}{K},

where Λh=τ=1Kϕ(shτ,ahτ)ϕ(shτ,ahτ)σV~h+1(shτ,ahτ)2+λId\Lambda_{h}=\sum_{\tau=1}^{K}\frac{\phi(s_{h}^{\tau},a_{h}^{\tau})\cdot\phi(s_{h}^{\tau},a_{h}^{\tau})^{\top}}{\sigma^{2}_{\widetilde{V}_{h+1}(s_{h}^{\tau},a_{h}^{\tau})}}+\lambda I_{d}, D=O~(H2Lκ+H4Edκ3/2+H3d)D=\widetilde{O}\left(\frac{H^{2}L}{\kappa}+\frac{H^{4}E\sqrt{d}}{\kappa^{3/2}}+H^{3}\sqrt{d}\right) and O~\widetilde{O} absorbs constants and Polylog terms.

Proof of Theorem D.15.

Combining Lemma D.3 and Remark D.14, we have with probability 1δ1-\delta, for all policy π\pi simultaneously,

V1π(s)V1π^(s)O~(dh=1H𝔼π[(ϕ(,)Λh1ϕ(,))1/2|s1=s])+DHK,V^{\pi}_{1}(s)-V^{\widehat{\pi}}_{1}(s)\leq\widetilde{O}\left(\sqrt{d}\cdot\sum_{h=1}^{H}\mathbb{E}_{\pi}\left[\left(\phi(\cdot,\cdot)^{\top}\Lambda_{h}^{-1}\phi(\cdot,\cdot)\right)^{1/2}\middle|s_{1}=s\right]\right)+\frac{DH}{K}, (56)

now the proof is complete by taking the initial distribution d1d_{1} on both sides. ∎

D.3.6 Finalize the proof of the second part

To prove the second part of Theorem 4.1, we begin with a crude bound on suphVhV~h\sup_{h}\left\lVert V^{\star}_{h}-\widetilde{V}_{h}\right\rVert_{\infty}.

Lemma D.16 (Private version of Lemma C.8 in [Yin et al., 2022]).

Suppose Kmax{1,2,3,4}K\geq\max\{\mathcal{M}_{1},\mathcal{M}_{2},\mathcal{M}_{3},\mathcal{M}_{4}\}, under the high probability event in Lemma D.13, with probability at least 1δ1-\delta,

suphVhV~hO~(H2dκK).\sup_{h}\left\lVert V^{\star}_{h}-\widetilde{V}_{h}\right\rVert_{\infty}\leq\widetilde{O}\left(\frac{H^{2}\sqrt{d}}{\sqrt{\kappa K}}\right).
Proof of Lemma D.16.

Step 1: The first step is to show with probability at least 1δ1-\delta, suphVhVhπ^O~(H2dκK)\sup_{h}\left\lVert V^{\star}_{h}-{V}^{\widehat{\pi}}_{h}\right\rVert_{\infty}\leq\widetilde{O}\left(\frac{H^{2}\sqrt{d}}{\sqrt{\kappa K}}\right).

Indeed, combine Lemma D.3 and Lemma D.13, similar to the proof of Theorem D.15, we directly have with probability 1δ1-\delta, for all policy π\pi simultaneously, and for all s𝒮s\in\mathcal{S}, h[H]h\in[H],

Vhπ(s)Vhπ^(s)O~(dt=hH𝔼π[(ϕ(,)Λt1ϕ(,))1/2|sh=s])+DHK,V^{\pi}_{h}(s)-V^{\widehat{\pi}}_{h}(s)\leq\widetilde{O}\left(\sqrt{d}\cdot\sum_{t=h}^{H}\mathbb{E}_{\pi}\left[\left(\phi(\cdot,\cdot)^{\top}\Lambda_{t}^{-1}\phi(\cdot,\cdot)\right)^{1/2}\middle|s_{h}=s\right]\right)+\frac{DH}{K}, (57)

Next, since Kmax{512log(2Hdδ)κ2,4λκ}K\geq\max\left\{\frac{512\log\left(\frac{2Hd}{\delta}\right)}{\kappa^{2}},\frac{4\lambda}{\kappa}\right\}, by Lemma E.13 and a union bound over h[H]h\in[H], with probability 1δ1-\delta,

sups,aϕ(s,a)Λh12Ksups,aϕ(s,a)(Λhp)12Kλmin1(Λhp)2HκK,h[H],\displaystyle\sup_{s,a}\left\lVert\phi(s,a)\right\rVert_{{\Lambda}^{-1}_{h}}\leq\frac{2}{\sqrt{K}}\sup_{s,a}\left\lVert\phi(s,a)\right\rVert_{({\Lambda}^{p}_{h})^{-1}}\leq\frac{2}{\sqrt{K}}\sqrt{\lambda_{\min}^{-1}(\Lambda_{h}^{p})}\leq\frac{2H}{\sqrt{\kappa K}},\;\;\forall h\in[H],

where Λhp=𝔼μ,h[σV~h+12(s,a)ϕ(s,a)ϕ(s,a)]\Lambda_{h}^{p}=\mathbb{E}_{\mu,h}[\sigma^{-2}_{\widetilde{V}_{h+1}}(s,a)\phi(s,a)\phi(s,a)^{\top}] and λmin(Λhp)κH2\lambda_{\min}(\Lambda_{h}^{p})\geq\frac{\kappa}{H^{2}}.

Lastly, taking π=π\pi=\pi^{\star} in (57) to obtain

0Vhπ(s)Vhπ^(s)\displaystyle 0\leq V^{\pi^{\star}}_{h}(s)-V^{\widehat{\pi}}_{h}(s)\leq O~(dt=hH𝔼π[(ϕ(,)Λt1ϕ(,))1/2|sh=s])+DHK\displaystyle\widetilde{O}\left(\sqrt{d}\cdot\sum_{t=h}^{H}\mathbb{E}_{\pi^{\star}}\left[\left(\phi(\cdot,\cdot)^{\top}\Lambda_{t}^{-1}\phi(\cdot,\cdot)\right)^{1/2}\middle|s_{h}=s\right]\right)+\frac{DH}{K} (58)
\displaystyle\leq O~(H2dκK)+O~(H3L/κK+H5Ed/κ3/2K+H4dK).\displaystyle\widetilde{O}\left(\frac{H^{2}\sqrt{d}}{\sqrt{\kappa K}}\right)+\widetilde{O}\left(\frac{H^{3}L/\kappa}{K}+\frac{H^{5}E\sqrt{d}/\kappa^{3/2}}{K}+\frac{H^{4}\sqrt{d}}{K}\right).

This implies by using the condition K>max{H2L2dκ,H6E2κ2,H4κ}K>\max\{\frac{H^{2}L^{2}}{d\kappa},\frac{H^{6}E^{2}}{\kappa^{2}},H^{4}\kappa\}, we finish the proof of Step 1.

Step 2: The second step is to show with probability 1δ1-\delta, suphV~hVhπ^O~(H2dκK)\sup_{h}\left\lVert\widetilde{V}_{h}-{V}^{\widehat{\pi}}_{h}\right\rVert_{\infty}\leq\widetilde{O}\left(\frac{H^{2}\sqrt{d}}{\sqrt{\kappa K}}\right).

Indeed, applying Lemma E.7 with π=π=π^\pi=\pi^{\prime}=\widehat{\pi}, then with probability 1δ1-\delta, for all s,hs,h

|V~h(s)Vhπ^(s)|=|t=hH𝔼π^[Q^h(sh,ah)(𝒯hV~h+1)(sh,ah)|sh=s]|\displaystyle\left|\widetilde{V}_{h}(s)-V^{\widehat{\pi}}_{h}(s)\right|=\left|\sum_{t=h}^{H}\mathbb{E}_{\widehat{\pi}}\left[\widehat{Q}_{h}(s_{h},a_{h})-\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)(s_{h},a_{h})\middle|s_{h}=s\right]\right|
\displaystyle\leq t=hH(𝒯~hV~h+1𝒯hV~h+1)(s,a)+HΓh(s,a)\displaystyle\sum_{t=h}^{H}\left\lVert(\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1}-\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)\right\rVert_{\infty}+H\cdot\left\lVert\Gamma_{h}(s,a)\right\rVert_{\infty}
\displaystyle\leq O~(Hdϕ(s,a)Λh1ϕ(s,a))+O~(DHK)\displaystyle\widetilde{O}\left(H\sqrt{d}\left\lVert\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{-1}\phi(s,a)}\right\rVert_{\infty}\right)+\widetilde{O}\left(\frac{DH}{K}\right)
\displaystyle\leq O~(H2dκK),\displaystyle\widetilde{O}\left(\frac{H^{2}\sqrt{d}}{\sqrt{\kappa K}}\right),

where the second inequality uses Lemma D.13, Remark D.14 and the last inequality holds due to the same reason as Step 1.

Step 3: The proof of the lemma is complete by combining Step 1, Step 2, triangular inequality and a union bound.

Then we can give a high probability bound of suph||σV~h+12σh2||\sup_{h}\lvert\lvert\sigma^{2}_{\widetilde{V}_{h+1}}-\sigma^{\star 2}_{h}\rvert\rvert_{\infty}.

Lemma D.17 (Private version of Lemma C.10 in [Yin et al., 2022]).

Recall σV~h+12=max{1,VarPhV~h+1}\sigma^{2}_{\widetilde{V}_{h+1}}=\max\left\{1,{\mathrm{Var}}_{P_{h}}\widetilde{V}_{h+1}\right\} and σh2=max{1,VarPhVh+1}\sigma^{\star 2}_{h}=\max\left\{1,{\mathrm{Var}}_{P_{h}}{V}^{\star}_{h+1}\right\}. Suppose Kmax{1,2,3,4}K\geq\max\{\mathcal{M}_{1},\mathcal{M}_{2},\mathcal{M}_{3},\mathcal{M}_{4}\}, then with probability 1δ1-\delta,

suph||σV~h+12σh2||O~(H3dκK).\sup_{h}\lvert\lvert\sigma^{2}_{\widetilde{V}_{h+1}}-\sigma^{\star 2}_{h}\rvert\rvert_{\infty}\leq\widetilde{O}\left(\frac{H^{3}\sqrt{d}}{\sqrt{\kappa K}}\right).
Proof of Lemma D.17.

By definition and the non-expansiveness of max{1,}\max\{1,\cdot\}, we have

σV~h+12σh2VarV~h+1VarVh+1\displaystyle\left\lVert{\sigma}^{2}_{\widetilde{V}_{h+1}}-{\sigma}^{\star 2}_{h}\right\rVert_{\infty}\leq\left\lVert\mathrm{Var}\widetilde{V}_{h+1}-\mathrm{Var}{V}^{\star}_{h+1}\right\rVert_{\infty}
\displaystyle\leq h(V~h+12Vh+12)+(hV~h+1)2(hVh+1)2\displaystyle\left\lVert\mathbb{P}_{h}\left(\widetilde{V}^{2}_{h+1}-{V}^{\star 2}_{h+1}\right)\right\rVert_{\infty}+\left\lVert(\mathbb{P}_{h}\widetilde{V}_{h+1})^{2}-(\mathbb{P}_{h}{V}^{\star}_{h+1})^{2}\right\rVert_{\infty}
\displaystyle\leq V~h+12Vh+12+(hV~h+1+hVh+1)(hV~h+1hVh+1)\displaystyle\left\lVert\widetilde{V}^{2}_{h+1}-{V}^{\star 2}_{h+1}\right\rVert_{\infty}+\left\lVert(\mathbb{P}_{h}\widetilde{V}_{h+1}+\mathbb{P}_{h}{V}^{\star}_{h+1})(\mathbb{P}_{h}\widetilde{V}_{h+1}-\mathbb{P}_{h}{V}^{\star}_{h+1})\right\rVert_{\infty}
\displaystyle\leq 2HV~h+1Vh+1+2HhV~h+1hVh+1\displaystyle 2H\left\lVert\widetilde{V}_{h+1}-{V}^{\star}_{h+1}\right\rVert_{\infty}+2H\left\lVert\mathbb{P}_{h}\widetilde{V}_{h+1}-\mathbb{P}_{h}{V}^{\star}_{h+1}\right\rVert_{\infty}
\displaystyle\leq O~(H3dκK).\displaystyle\widetilde{O}\left(\frac{H^{3}\sqrt{d}}{\sqrt{\kappa K}}\right).

The second inequality is because of the definition of variance. The last inequality comes from Lemma D.16. ∎

We transfer ϕ(s,a)Λh1ϕ(s,a)\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{-1}\phi(s,a)} to ϕ(s,a)Λh1ϕ(s,a)\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{\star-1}\phi(s,a)} by the following Lemma D.18.

Lemma D.18 (Private version of Lemma C.11 in [Yin et al., 2022]).

Suppose Kmax{1,2,3,4}K\geq\max\{\mathcal{M}_{1},\mathcal{M}_{2},\mathcal{M}_{3},\mathcal{M}_{4}\}, then with probability 1δ1-\delta,

ϕ(s,a)Λh1ϕ(s,a)2ϕ(s,a)Λh1ϕ(s,a),h,s,a[H]×𝒮×𝒜,\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{-1}\phi(s,a)}\leq 2\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{\star-1}\phi(s,a)},\quad\forall h,s,a\in[H]\times\mathcal{S}\times\mathcal{A},
Proof of Lemma D.18.

By definition ϕ(s,a)Λh1ϕ(s,a)=ϕ(s,a)Λh1\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{-1}\phi(s,a)}=\left\lVert\phi(s,a)\right\rVert_{{\Lambda}^{-1}_{h}}. Then denote

Λh=1KΛh,Λh=1KΛh,{\Lambda}^{\prime}_{h}=\frac{1}{K}{\Lambda}_{h},\quad{\Lambda}^{\star^{\prime}}_{h}=\frac{1}{K}{\Lambda}^{\star}_{h},

where Λh=τ=1Kϕ(shτ,ahτ)ϕ(shτ,ahτ)/σVh+12(shτ,ahτ)+λI{\Lambda}^{\star}_{h}=\sum_{\tau=1}^{K}\phi(s^{\tau}_{h},a^{\tau}_{h})\phi(s^{\tau}_{h},a^{\tau}_{h})^{\top}/\sigma_{{V}^{\star}_{h+1}}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I. Under the condition of KK, by Lemma D.17, with probability 1δ1-\delta, for all h[H]h\in[H],

ΛhΛhsups,aϕ(s,a)ϕ(s,a)σh2(s,a)ϕ(s,a)ϕ(s,a)σV~h+12(s,a)\displaystyle\left\lVert{\Lambda}^{\star^{\prime}}_{h}-{\Lambda}^{\prime}_{h}\right\rVert\leq\sup_{s,a}\left\lVert\frac{\phi(s,a)\phi(s,a)^{\top}}{{\sigma}^{\star 2}_{h}(s,a)}-\frac{\phi(s,a)\phi(s,a)^{\top}}{{\sigma}^{2}_{\widetilde{V}_{h+1}}(s,a)}\right\rVert (59)
\displaystyle\leq sups,a|σh2(s,a)σV~h+12(s,a)σh2(s,a)σV~h+12(s,a)|ϕ(s,a)2\displaystyle\sup_{s,a}\left|\frac{{\sigma}^{\star 2}_{h}(s,a)-{\sigma}^{2}_{\widetilde{V}_{h+1}}(s,a)}{{\sigma}^{\star 2}_{h}(s,a)\cdot{\sigma}^{2}_{\widetilde{V}_{h+1}}(s,a)}\right|\cdot\left\lVert\phi(s,a)\right\rVert^{2}
\displaystyle\leq sups,a|σh2(s,a)σV~h+12(s,a)1|1\displaystyle\sup_{s,a}\left|\frac{{\sigma}^{\star 2}_{h}(s,a)-{\sigma}^{2}_{\widetilde{V}_{h+1}}(s,a)}{1}\right|\cdot 1
\displaystyle\leq O~(H3dκK).\displaystyle\widetilde{O}\left(\frac{H^{3}\sqrt{d}}{\sqrt{\kappa K}}\right).

Next by Lemma E.12 (with ϕ\phi to be ϕ/σVh+1\phi/\sigma_{{V}^{\star}_{h+1}} and C=1C=1), it holds with probability 1δ1-\delta,

Λh(𝔼μ,h[ϕ(s,a)ϕ(s,a)/σVh+12(s,a)]+λKId)42K(log2dHδ)1/2.\left\lVert\Lambda^{\star^{\prime}}_{h}-\left(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{{V}^{\star}_{h+1}}(s,a)]+\frac{\lambda}{K}I_{d}\right)\right\rVert\leq\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}.

Therefore by Weyl’s inequality and the condition K>max{2λ,128log(2dHδ),128H4log(2dH/δ)κ2}K>\max\{2\lambda,128\log\left(\frac{2dH}{\delta}\right),\frac{128H^{4}\log(2dH/\delta)}{\kappa^{2}}\}, the above inequality implies

Λh=\displaystyle\left\lVert\Lambda^{\star^{\prime}}_{h}\right\rVert= λmax(Λh)λmax(𝔼μ,h[ϕ(s,a)ϕ(s,a)/σVh+12(s,a)])+λK+42K(log2dHδ)1/2\displaystyle\lambda_{\max}(\Lambda^{\star^{\prime}}_{h})\leq\lambda_{\max}\left(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{{V}^{\star}_{h+1}}(s,a)]\right)+\frac{\lambda}{K}+\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}
\displaystyle\leq 𝔼μ,h[ϕ(s,a)ϕ(s,a)/σVh+12(s,a)]+λK+42K(log2dHδ)1/2\displaystyle\left\lVert\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{{V}^{\star}_{h+1}}(s,a)]\right\rVert+\frac{\lambda}{K}+\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}
\displaystyle\leq ϕ(s,a)2+λK+42K(log2dHδ)1/21+λK+42K(log2dHδ)1/22,\displaystyle\left\lVert\phi(s,a)\right\rVert^{2}+\frac{\lambda}{K}+\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}\leq 1+\frac{\lambda}{K}+\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}\leq 2,
λmin(Λh)\displaystyle\lambda_{\min}(\Lambda^{\star^{\prime}}_{h})\geq λmin(𝔼μ,h[ϕ(s,a)ϕ(s,a)/σVh+12(s,a)])+λK42K(log2dHδ)1/2\displaystyle\lambda_{\min}\left(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{{V}^{\star}_{h+1}}(s,a)]\right)+\frac{\lambda}{K}-\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}
\displaystyle\geq λmin(𝔼μ,h[ϕ(s,a)ϕ(s,a)/σVh+12(s,a)])42K(log2dHδ)1/2\displaystyle\lambda_{\min}\left(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{{V}^{\star}_{h+1}}(s,a)]\right)-\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}
\displaystyle\geq κH242K(log2dHδ)1/2κ2H2.\displaystyle\frac{\kappa}{H^{2}}-\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}\geq\frac{\kappa}{2H^{2}}.

Hence with probability 1δ1-\delta, Λh2\left\lVert\Lambda^{\star^{\prime}}_{h}\right\rVert\leq 2 and Λh1=λmin1(Λh)2H2κ\left\lVert\Lambda^{\star^{\prime}-1}_{h}\right\rVert=\lambda_{\min}^{-1}(\Lambda^{\star^{\prime}}_{h})\leq\frac{2H^{2}}{\kappa}. Similarly, we can show that Λh12H2κ\left\lVert{\Lambda}^{{}^{\prime}-1}_{h}\right\rVert\leq\frac{2H^{2}}{\kappa} holds with probability 1δ1-\delta by using identical proof.

Now apply Lemma E.11 and a union bound to Λh{\Lambda}^{\star^{\prime}}_{h} and Λh{\Lambda}^{\prime}_{h}, we obtain with probability 1δ1-\delta, for all h,s,a[H]×𝒮×𝒜h,s,a\in[H]\times\mathcal{S}\times\mathcal{A},

ϕ(s,a)Λh1\displaystyle\left\lVert\phi(s,a)\right\rVert_{{\Lambda}^{\prime-1}_{h}}\leq [1+Λh1ΛhΛh1ΛhΛh]ϕ(s,a)Λh1\displaystyle\left[1+\sqrt{\left\lVert\Lambda^{\star^{\prime}-1}_{h}\right\rVert\cdot\left\lVert\Lambda^{\star^{\prime}}_{h}\right\rVert\cdot\left\lVert{\Lambda}^{\prime-1}_{h}\right\rVert\cdot\left\lVert{\Lambda}^{\star^{\prime}}_{h}-\Lambda^{\prime}_{h}\right\rVert}\right]\cdot\left\lVert\phi(s,a)\right\rVert_{\Lambda^{\star^{\prime}-1}_{h}}
\displaystyle\leq [1+2H2κ22H2κΛhΛh]ϕ(s,a)Λh1\displaystyle\left[1+\sqrt{\frac{2H^{2}}{\kappa}\cdot 2\cdot\frac{2H^{2}}{\kappa}\cdot\left\lVert{\Lambda}^{\star^{\prime}}_{h}-\Lambda^{\prime}_{h}\right\rVert}\right]\cdot\left\lVert\phi(s,a)\right\rVert_{\Lambda^{\star^{\prime}-1}_{h}}
\displaystyle\leq [1+H4κ2[O~(H3dκK)]]ϕ(s,a)Λh1\displaystyle\left[1+\sqrt{\frac{H^{4}}{\kappa^{2}}\left[\widetilde{O}\left(\frac{H^{3}\sqrt{d}}{\sqrt{\kappa K}}\right)\right]}\right]\cdot\left\lVert\phi(s,a)\right\rVert_{\Lambda^{\star^{\prime}-1}_{h}}
\displaystyle\leq 2ϕ(s,a)Λh1\displaystyle 2\left\lVert\phi(s,a)\right\rVert_{\Lambda^{\star^{\prime}-1}_{h}}

where the third inequality uses (59) and the last inequality uses KO~(H14d/κ5)K\geq\widetilde{O}(H^{14}d/\kappa^{5}). The conclusion can be derived directly by the above inequality multiplying 1/K1/\sqrt{K} on both sides. ∎

Finally, the second part of Theorem 4.1 can be proven by combining Theorem D.15 (with π=π\pi=\pi^{\star}) and Lemma D.18.

D.4 Put everything toghther

Combining Lemma D.1, Theorem D.15, and the discussion above, the proof of Theorem 4.1 is complete.

Appendix E Assisting technical lemmas

Lemma E.1 (Multiplicative Chernoff bound [Chernoff et al., 1952]).

Let XX be a Binomial random variable with parameter p,np,n. For any 1θ>01\geq\theta>0, we have that

[X<(1θ)pn]<eθ2pn2, and [X(1+θ)pn]<eθ2pn3\mathbb{P}[X<(1-\theta)pn]<e^{-\frac{\theta^{2}pn}{2}},\quad\text{ and }\quad\mathbb{P}[X\geq(1+\theta)pn]<e^{-\frac{\theta^{2}pn}{3}}
Lemma E.2 (Hoeffding’s Inequality [Sridharan, 2002]).

Let x1,,xnx_{1},...,x_{n} be independent bounded random variables such that 𝔼[xi]=0\mathbb{E}[x_{i}]=0 and |xi|ξi|x_{i}|\leq\xi_{i} with probability 11. Then for any ϵ>0\epsilon>0 we have

(1ni=1nxiϵ)e2n2ϵ2i=1nξi2.\mathbb{P}\left(\frac{1}{n}\sum_{i=1}^{n}x_{i}\geq\epsilon\right)\leq e^{-\frac{2n^{2}\epsilon^{2}}{\sum_{i=1}^{n}\xi_{i}^{2}}}.
Lemma E.3 (Bernstein’s Inequality).

Let x1,,xnx_{1},...,x_{n} be independent bounded random variables such that 𝔼[xi]=0\mathbb{E}[x_{i}]=0 and |xi|ξ|x_{i}|\leq\xi with probability 11. Let σ2=1ni=1nVar[xi]\sigma^{2}=\frac{1}{n}\sum_{i=1}^{n}\mathrm{Var}[x_{i}], then with probability 1δ1-\delta we have

1ni=1nxi2σ2log(1/δ)n+2ξ3nlog(1/δ).\frac{1}{n}\sum_{i=1}^{n}x_{i}\leq\sqrt{\frac{2\sigma^{2}\cdot\log(1/\delta)}{n}}+\frac{2\xi}{3n}\log(1/\delta).
Lemma E.4 (Empirical Bernstein’s Inequality [Maurer and Pontil, 2009]).

Let x1,,xnx_{1},...,x_{n} be i.i.d random variables such that |xi|ξ|x_{i}|\leq\xi with probability 11. Let x¯=1ni=1nxi\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i} and V^n=1ni=1n(xix¯)2\widehat{V}_{n}=\frac{1}{n}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}, then with probability 1δ1-\delta we have

|1ni=1nxi𝔼[x]|2V^nlog(2/δ)n+7ξ3nlog(2/δ).\left|\frac{1}{n}\sum_{i=1}^{n}x_{i}-\mathbb{E}[x]\right|\leq\sqrt{\frac{2\widehat{V}_{n}\cdot\log(2/\delta)}{n}}+\frac{7\xi}{3n}\log(2/\delta).
Lemma E.5 (Lemma I.8 in [Yin and Wang, 2021b]).

Let n2n\geq 2 and VSV\in\mathbb{R}^{S} be any function with VH||V||_{\infty}\leq H, PP be any SS-dimensional distribution and P^\widehat{P} be its empirical version using nn samples. Then with probability 1δ1-\delta,

|VarP^(V)n1nVarP(V)|2Hlog(2/δ)n1.\left|\sqrt{\mathrm{Var}_{\widehat{P}}(V)}-\sqrt{\frac{n-1}{n}\mathrm{Var}_{{P}}(V)}\right|\leq 2H\sqrt{\frac{\log(2/\delta)}{n-1}}.
Lemma E.6 (Claim 2 in [Vietri et al., 2020]).

Let yy\in\mathbb{R} be any positive real number. Then for all xx\in\mathbb{R} with x2yx\geq 2y, it holds that 1xy1x+2yx2\frac{1}{x-y}\leq\frac{1}{x}+\frac{2y}{x^{2}}.

E.1 Extended Value Difference

Lemma E.7 (Extended Value Difference (Section B.1 in [Cai et al., 2020])).

Let π={πh}h=1H\pi=\{\pi_{h}\}_{h=1}^{H} and π={πh}h=1H\pi^{\prime}=\{\pi^{\prime}_{h}\}_{h=1}^{H} be two arbitrary policies and let {Q^h}h=1H\{\widehat{Q}_{h}\}_{h=1}^{H} be any given Q-functions. Then define V^h(s):=Q^h(s,),πh(s)\widehat{V}_{h}(s):=\langle\widehat{Q}_{h}(s,\cdot),\pi_{h}(\cdot\mid s)\rangle for all s𝒮s\in\mathcal{S}. Then for all s𝒮s\in\mathcal{S},

V^1(s)V1π(s)=\displaystyle\widehat{V}_{1}(s)-V_{1}^{\pi^{\prime}}(s)= h=1H𝔼π[Q^h(sh,),πh(sh)πh(sh)s1=s]\displaystyle\sum_{h=1}^{H}\mathbb{E}_{\pi^{\prime}}\left[\langle\widehat{Q}_{h}\left(s_{h},\cdot\right),\pi_{h}\left(\cdot\mid s_{h}\right)-\pi_{h}^{\prime}\left(\cdot\mid s_{h}\right)\rangle\mid s_{1}=s\right] (60)
+h=1H𝔼π[Q^h(sh,ah)(𝒯hV^h+1)(sh,ah)s1=s]\displaystyle+\sum_{h=1}^{H}\mathbb{E}_{\pi^{\prime}}\left[\widehat{Q}_{h}\left(s_{h},a_{h}\right)-\left(\mathcal{T}_{h}\widehat{V}_{h+1}\right)\left(s_{h},a_{h}\right)\mid s_{1}=s\right]

where (𝒯hV)(,):=rh(,)+(PhV)(,)(\mathcal{T}_{h}V)(\cdot,\cdot):=r_{h}(\cdot,\cdot)+(P_{h}V)(\cdot,\cdot) for any VSV\in\mathbb{R}^{S}.

Lemma E.8 (Lemma I.10 in [Yin and Wang, 2021b]).

Let π^={π^h}h=1H\widehat{\pi}=\left\{\widehat{\pi}_{h}\right\}_{h=1}^{H} and Q^h(,)\widehat{Q}_{h}(\cdot,\cdot) be the arbitrary policy and Q-function and also V^h(s)=Q^h(s,),π^h(|s)\widehat{V}_{h}(s)=\langle\widehat{Q}_{h}(s,\cdot),\widehat{\pi}_{h}(\cdot|s)\rangle s𝒮\forall s\in\mathcal{S}, and ξh(s,a)=(𝒯hV^h+1)(s,a)Q^h(s,a)\xi_{h}(s,a)=(\mathcal{T}_{h}\widehat{V}_{h+1})(s,a)-\widehat{Q}_{h}(s,a) element-wisely. Then for any arbitrary π\pi, we have

V1π(s)V1π^(s)=\displaystyle V_{1}^{\pi}(s)-V_{1}^{\widehat{\pi}}(s)= h=1H𝔼π[ξh(sh,ah)s1=s]h=1H𝔼π^[ξh(sh,ah)s1=s]\displaystyle\sum_{h=1}^{H}\mathbb{E}_{\pi}\left[\xi_{h}(s_{h},a_{h})\mid s_{1}=s\right]-\sum_{h=1}^{H}\mathbb{E}_{\widehat{\pi}}\left[\xi_{h}(s_{h},a_{h})\mid s_{1}=s\right]
+\displaystyle+ h=1H𝔼π[Q^h(sh,),πh(|sh)π^h(|sh)s1=s]\displaystyle\sum_{h=1}^{H}\mathbb{E}_{\pi}\left[\langle\widehat{Q}_{h}\left(s_{h},\cdot\right),\pi_{h}\left(\cdot|s_{h}\right)-\widehat{\pi}_{h}\left(\cdot|s_{h}\right)\rangle\mid s_{1}=s\right]

where the expectation are taken over sh,ahs_{h},a_{h}.

E.2 Assisting lemmas for linear MDP setting

Lemma E.9 (Hoeffding inequality for self-normalized martingales [Abbasi-Yadkori et al., 2011]).

Let {ηt}t=1\{\eta_{t}\}_{t=1}^{\infty} be a real-valued stochastic process. Let {t}t=0\{\mathcal{F}_{t}\}_{t=0}^{\infty} be a filtration, such that ηt\eta_{t} is t\mathcal{F}_{t}-measurable. Assume ηt\eta_{t} also satisfies ηt\eta_{t} given t1\mathcal{F}_{t-1} is zero-mean and RR-subgaussian, i.e.

λ,𝔼[eληtt1]eλ2R2/2.\forall\lambda\in\mathbb{R},\quad\mathbb{E}\left[e^{\lambda\eta_{t}}\mid\mathcal{F}_{t-1}\right]\leq e^{\lambda^{2}R^{2}/2}.

Let {xt}t=1\{x_{t}\}_{t=1}^{\infty} be an d\mathbb{R}^{d}-valued stochastic process where xtx_{t} is t1\mathcal{F}_{t-1} measurable and xtL\|x_{t}\|\leq L. Let Λt=λId+s=1txsxs\Lambda_{t}=\lambda I_{d}+\sum_{s=1}^{t}x_{s}x^{\top}_{s}. Then for any δ>0\delta>0, with probability 1δ1-\delta, for all t>0t>0,

s=1txsηsΛt128R2d2log(λ+tLλδ).\left\|\sum_{s=1}^{t}{x}_{s}\eta_{s}\right\|_{{\Lambda}_{t}^{-1}}^{2}\leq 8R^{2}\cdot\frac{d}{2}\log\left(\frac{\lambda+tL}{\lambda\delta}\right).
Lemma E.10 (Bernstein inequality for self-normalized martingales [Zhou et al., 2021]).

Let {ηt}t=1\{\eta_{t}\}_{t=1}^{\infty} be a real-valued stochastic process. Let {t}t=0\{\mathcal{F}_{t}\}_{t=0}^{\infty} be a filtration, such that ηt\eta_{t} is t\mathcal{F}_{t}-measurable. Assume ηt\eta_{t} also satisfies

|ηt|R,𝔼[ηtt1]=0,𝔼[ηt2t1]σ2.\left|\eta_{t}\right|\leq R,\mathbb{E}\left[\eta_{t}\mid\mathcal{F}_{t-1}\right]=0,\mathbb{E}\left[\eta_{t}^{2}\mid\mathcal{F}_{t-1}\right]\leq\sigma^{2}.

Let {xt}t=1\{x_{t}\}_{t=1}^{\infty} be an d\mathbb{R}^{d}-valued stochastic process where xtx_{t} is t1\mathcal{F}_{t-1} measurable and xtL\|x_{t}\|\leq L. Let Λt=λId+s=1txsxs\Lambda_{t}=\lambda I_{d}+\sum_{s=1}^{t}x_{s}x^{\top}_{s}. Then for any δ>0\delta>0, with probability 1δ1-\delta, for all t>0t>0,

s=1t𝐱sηs𝚲t18σdlog(1+tL2λd)log(4t2δ)+4Rlog(4t2δ)\left\|\sum_{s=1}^{t}\mathbf{x}_{s}\eta_{s}\right\|_{\bm{\Lambda}_{t}^{-1}}\leq 8\sigma\sqrt{d\log\left(1+\frac{tL^{2}}{\lambda d}\right)\cdot\log\left(\frac{4t^{2}}{\delta}\right)}+4R\log\left(\frac{4t^{2}}{\delta}\right)
Lemma E.11 (Lemma H.4 in [Yin et al., 2022]).

Let Λ1\Lambda_{1} and Λ2d×d\Lambda_{2}\in\mathbb{R}^{d\times d} be two positive semi-definite matrices. Then:

Λ11Λ21+Λ11Λ21Λ1Λ2\|\Lambda_{1}^{-1}\|\leq\|\Lambda_{2}^{-1}\|+\|\Lambda_{1}^{-1}\|\cdot\|\Lambda_{2}^{-1}\|\cdot\|\Lambda_{1}-\Lambda_{2}\|

and

ϕΛ11[1+Λ21Λ2Λ11Λ1Λ2]ϕΛ21.\|\phi\|_{\Lambda_{1}^{-1}}\leq\left[1+\sqrt{\|\Lambda_{2}^{-1}\|\cdot\|\Lambda_{2}\|\cdot\|\Lambda_{1}^{-1}\|\cdot\|\Lambda_{1}-\Lambda_{2}\|}\right]\cdot\|\phi\|_{\Lambda_{2}^{-1}}.

for all ϕd\phi\in\mathbb{R}^{d}.

Lemma E.12 (Lemma H.4 in [Min et al., 2021]).

Let ϕ:𝒮×𝒜d\phi:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{d} satisfies ϕ(s,a)C\|\phi(s,a)\|\leq C for all s,a𝒮×𝒜s,a\in\mathcal{S}\times\mathcal{A}. For any K>0,λ>0K>0,\lambda>0, define G¯K=k=1Kϕ(sk,ak)ϕ(sk,ak)+λId\bar{G}_{K}=\sum_{k=1}^{K}\phi(s_{k},a_{k})\phi(s_{k},a_{k})^{\top}+\lambda I_{d} where (sk,ak)(s_{k},a_{k})’s are i.i.d samples from some distribution ν\nu. Then with probability 1δ1-\delta,

G¯KK𝔼ν[G¯KK]42C2K(log2dδ)1/2.\left\lVert\frac{\bar{G}_{K}}{K}-\mathbb{E}_{\nu}\left[\frac{\bar{G}_{K}}{K}\right]\right\rVert\leq\frac{4\sqrt{2}C^{2}}{\sqrt{K}}\left(\log\frac{2d}{\delta}\right)^{1/2}.
Lemma E.13 (Lemma H.5 in [Min et al., 2021]).

Let ϕ:𝒮×𝒜d\phi:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{d} be a bounded function s.t. ϕ2C\|\phi\|_{2}\leq C. Define G¯K=k=1Kϕ(sk,ak)ϕ(sk,ak)+λId\bar{G}_{K}=\sum_{k=1}^{K}\phi(s_{k},a_{k})\phi(s_{k},a_{k})^{\top}+\lambda I_{d} where (sk,ak)(s_{k},a_{k})’s are i.i.d samples from some distribution ν\nu. Let G=𝔼ν[ϕ(s,a)ϕ(s,a)]G=\mathbb{E}_{\nu}[\phi(s,a)\phi(s,a)^{\top}]. Then for any δ(0,1)\delta\in(0,1), if KK satisfies

Kmax{512C4𝐆12log(2dδ),4λ𝐆1}.K\geq\max\left\{512C^{4}\left\|\mathbf{G}^{-1}\right\|^{2}\log\left(\frac{2d}{\delta}\right),4\lambda\left\|\mathbf{G}^{-1}\right\|\right\}.

Then with probability at least 1δ1-\delta, it holds simultaneously for all udu\in\mathbb{R}^{d} that

uG¯K12KuG1.\|u\|_{\bar{G}_{K}^{-1}}\leq\frac{2}{\sqrt{K}}\|u\|_{G^{-1}}.
Lemma E.14 (Lemma H.9 in [Yin et al., 2022]).

For a linear MDP, for any 0V()H0\leq V(\cdot)\leq H, there exists a whdw_{h}\in\mathbb{R}^{d} s.t. 𝒯hV=ϕ,wh\mathcal{T}_{h}V=\langle\phi,w_{h}\rangle and wh22Hd\|w_{h}\|_{2}\leq 2H\sqrt{d} for all h[H]h\in[H]. Here 𝒯h(V)(s,a)=rh(x,a)+(PhV)(s,a)\mathcal{T}_{h}(V)(s,a)=r_{h}(x,a)+(P_{h}V)(s,a). Similarly, for any π\pi, there exists whπdw^{\pi}_{h}\in\mathbb{R}^{d}, such that Qhπ=ϕ,whπQ^{\pi}_{h}=\langle\phi,w^{\pi}_{h}\rangle with whπ22(Hh+1)d\|w_{h}^{\pi}\|_{2}\leq 2(H-h+1)\sqrt{d}.

E.3 Assisting lemmas for differential privacy

Lemma E.15 (Converting zCDP to DP [Bun and Steinke, 2016]).

If M satisfies ρ\rho-zCDP then M satisfies (ρ+2ρlog(1/δ),δ)(\rho+2\sqrt{\rho\log(1/\delta)},\delta)-DP.

Lemma E.16 (zCDP Composition [Bun and Steinke, 2016]).

Let M:𝒰n𝒴M:\mathcal{U}^{n}\rightarrow\mathcal{Y} and M:𝒰n𝒵M^{\prime}:\mathcal{U}^{n}\rightarrow\mathcal{Z} be randomized mechanisms. Suppose that MM satisfies ρ\rho-zCDP and MM^{\prime} satisfies ρ\rho^{\prime}-zCDP. Define M′′:𝒰n𝒴×𝒵M^{\prime\prime}:\mathcal{U}^{n}\rightarrow\mathcal{Y}\times\mathcal{Z} by M′′(U)=(M(U),M(U))M^{\prime\prime}(U)=(M(U),M^{\prime}(U)). Then M′′M^{\prime\prime} satisfies (ρ+ρ)(\rho+\rho^{\prime})-zCDP.

Lemma E.17 (Adaptive composition and Post processing of zCDP [Bun and Steinke, 2016]).

Let M:𝒳n𝒴M:\mathcal{X}^{n}\rightarrow\mathcal{Y} and M:𝒳n×𝒴𝒵M^{\prime}:\mathcal{X}^{n}\times\mathcal{Y}\rightarrow\mathcal{Z}. Suppose MM satisfies ρ\rho-zCDP and MM^{\prime} satisfies ρ\rho^{\prime}-zCDP (as a function of its first argument). Define M′′:𝒳n𝒵M^{\prime\prime}:\mathcal{X}^{n}\rightarrow\mathcal{Z} by M′′(x)=M(x,M(x))M^{\prime\prime}(x)=M^{\prime}(x,M(x)). Then M′′M^{\prime\prime} satisfies (ρ+ρ)(\rho+\rho^{\prime})-zCDP.

Definition E.18 (1\ell_{1} sensitivity).

Define the 1\ell_{1} sensitivity of a function f:𝒳df:\mathbb{N}^{\mathcal{X}}\mapsto\mathbb{R}^{d} as

Δ1(f)=supneighboringU,Uf(U)f(U)1.\displaystyle\Delta_{1}(f)=\sup_{\text{neighboring}\,U,U^{\prime}}\|f(U)-f(U^{\prime})\|_{1}.
Definition E.19 (Laplace Mechanism [Dwork et al., 2014]).

Given any function f:𝒳df:\mathbb{N}^{\mathcal{X}}\mapsto\mathbb{R}^{d}, the Laplace mechanism is defined as:

L(x,f,ϵ)=f(x)+(Y1,,Yd),\displaystyle\mathcal{M}_{L}(x,f,\epsilon)=f(x)+(Y_{1},\cdots,Y_{d}),

where YiY_{i} are i.i.d. random variables drawn from Lap(Δ1(f)/ϵ)\mathrm{Lap}(\Delta_{1}(f)/\epsilon).

Lemma E.20 (Privacy guarantee of Laplace Mechanism [Dwork et al., 2014]).

The Laplace mechanism preserves (ϵ,0)(\epsilon,0)-differential privacy. For simplicity, we say ϵ\epsilon-DP.

Appendix F Details for the Evaluation part

In the Evaluation part, we apply a synthetic linear MDP case that is similar to [Min et al., 2021, Yin et al., 2022] but with some modifications for our evaluation task. The linear MDP example we use consists of |𝒮|=2|\mathcal{S}|=2 states and |𝒜|=100|\mathcal{A}|=100 actions, while the feature dimension d=10d=10. We denote 𝒮={0,1}\mathcal{S}=\{0,1\} and 𝒜={0,1,,99}\mathcal{A}=\{0,1,\ldots,99\} respectively. For each action a{0,1,,99}a\in\{0,1,\ldots,99\}, we obtain a vector 𝐚8\mathbf{a}\in\mathbb{R}^{8} via binary encoding. More specifically, each coordinate of 𝐚\mathbf{a} is either 0 or 11.
First, we define the following indicator function δ(s,a)={1 if 𝟙{s=0}=𝟙{a=0}0 otherwise ,\delta(s,a)=\begin{cases}1&\text{ if }\mathds{1}\{s=0\}=\mathds{1}\{a=0\}\\ 0&\text{ otherwise }\end{cases}, then our non-stationary linear MDP example can be characterized by the following parameters.

The feature map ϕ\bm{\phi} is:

ϕ(s,a)=(𝐚,δ(s,a),1δ(s,a))10.\bm{\phi}(s,a)=\left(\mathbf{a}^{\top},\delta(s,a),1-\delta(s,a)\right)^{\top}\in\mathbb{R}^{10}.

The unknown measure νh\nu_{h} is:

𝝂h(0)=(0,,0,αh,1,αh,2),\bm{\nu}_{h}(0)=\left(0,\cdots,0,\alpha_{h,1},\alpha_{h,2}\right),
𝝂h(1)=(0,,0,1αh,1,1αh,2),\bm{\nu}_{h}(1)=\left(0,\cdots,0,1-\alpha_{h,1},1-\alpha_{h,2}\right),

where {αh,1,αh,2}h[H]\{\alpha_{h,1},\alpha_{h,2}\}_{h\in[H]} is a sequence of random values sampled uniformly from [0,1][0,1].
The unknown vector θh\theta_{h} is:

θh=(rh/8,0,rh/8,1/2rh/2,rh/8,0,rh/8,0,rh/2,1/2rh/2)10,\theta_{h}=(r_{h}/8,0,r_{h}/8,1/2-r_{h}/2,r_{h}/8,0,r_{h}/8,0,r_{h}/2,1/2-r_{h}/2)\in\mathbb{R}^{10},

where rhr_{h} is also sampled uniformly from [0,1][0,1]. Therefore, the transition kernel follows Ph(s|s,a)=ϕ(s,a),𝝂h(s)P_{h}(s^{\prime}|s,a)=\langle\phi(s,a),\bm{\nu}_{h}(s^{\prime})\rangle and the expected reward function rh(s,a)=ϕ(s,a),θhr_{h}(s,a)=\langle\phi(s,a),\theta_{h}\rangle.
Finally, the behavior policy is to always choose action a=0a=0 with probability pp, and other actions uniformly with probability (1p)/99(1-p)/99. Here we choose p=0.6p=0.6. The initial distribution is a uniform distribution over 𝒮={0,1}\mathcal{S}=\{0,1\}.