Offline Reinforcement Learning with Differential Privacy

Dan Qiao Department of Computer Science, UC Santa Barbara Yu-Xiang Wang Department of Computer Science, UC Santa Barbara

Abstract

The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instance-dependent learning bounds under both tabular and linear Markov decision process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.

1 Introduction

Offline Reinforcement Learning (or batch RL) aims to learn a near-optimal policy in an unknown environment¹¹1The environment is usually characterized by a Markov Decision Process (MDP) in this paper. through a static dataset gathered from some behavior policy $\mu$ . Since offline RL does not require access to the environment, it can be applied to problems where interaction with environment is infeasible, e.g., when collecting new data is costly (trade or finance (Zhang et al., 2020)), risky (autonomous driving (Sallab et al., 2017)) or illegal / unethical (healthcare (Raghu et al., 2017)). In such practical applications, the data used by an RL agent usually contains sensitive information. Take medical history for instance, for each patient, at each time step, the patient reports her health condition (age, disease, etc.), then the doctor decides the treatment (which medicine to use, the dosage of medicine, etc.), finally there is treatment outcome (whether the patient feels good, etc.) and the patient transitions to another health condition. Here, (health condition, treatment, treatment outcome) corresponds to (state, action, reward) and the dataset can be considered as $n$ (number of patients) trajectories sampled from a MDP with horizon $H$ (number of treatment steps), see Table 1 for an instance. However, learning agents are known to implicitly memorize details of individual training data points verbatim (Carlini et al., 2019), even if they are irrelevant for learning (Brown et al., 2021), which makes offline RL models vulnerable to various privacy attacks.

Differential privacy (DP) (Dwork et al., 2006) is a well-established definition of privacy with many desirable properties. A differentially private offline RL algorithm will return a decision policy that is indistinguishable from a policy trained in an alternative universe any individual user is replaced, thereby preventing the aforementioned privacy risks. There is a surge of recent interest in developing RL algorithms with DP guarantees, but they focus mostly on the online setting (Vietri et al., 2020; Garcelon et al., 2021; Liao et al., 2021; Chowdhury and Zhou, 2021; Luyo et al., 2021).

Offline RL is arguably more practically relevant than online RL in the applications with sensitive data. For example, in the healthcare domain, online RL requires actively running new exploratory policies (clinical trials) with every new patient, which often involves complex ethical / legal clearances, whereas offline RL uses only historical patient records that are often accessible for research purposes. Clear communication of the adopted privacy enhancing techniques (e.g., DP) to patients was reported to further improve data access (Kim et al., 2017).

	Patient 1	Patient 2	$\cdots$	Patient $n$
Time 1	Health condition_1,1	Health condition_2,1	$\cdots$	Health condition_n,1
Time 1	Treatment_1,1	Treatment_2,1	$\cdots$	Treatment_n,1
Time 1	Treatment outcome_1,1	Treatment outcome_2,1	$\cdots$	Treatment outcome_n,1
$\cdots$	$\cdots$	$\cdots$	$\cdots$	$\cdots$
Time H	Health condition_1,H	Health condition_2,H	$\cdots$	Health condition_n,H
Time H	Treatment_1,H	Treatment_2,H	$\cdots$	Treatment_n,H
Time H	Treatment outcome_1,H	Treatment outcome_2,H	$\cdots$	Treatment outcome_n,H

Table 1: An illustration of offline dataset regarding medical history. The dataset consists of

n

patients and the data for each patient includes the (health condition, treatment, treatment outcome) for

H

times. Here, (health condition, treatment, treatment outcome) corresponds to (state, action, reward) and the dataset can be considered as

n

trajectories sampled from a MDP with horizon

H

Our contributions. In this paper, we present the first provably efficient algorithms for offline RL with differential privacy. Our contributions are twofold.

•

We design two new pessimism-based algorithms DP-APVI (Algorithm 1) and DP-VAPVI (Algorithm 2), one for the tabular setting (finite states and actions), the other for the case with linear function approximation (under linear MDP assumption). Both algorithms enjoy DP guarantees (pure DP or zCDP) and instance-dependent learning bounds where the cost of privacy appears as lower order terms.
•

We perform numerical simulations to evaluate and compare the performance of our algorithm DP-VAPVI (Algorithm 2) with its non-private counterpart VAPVI (Yin et al., 2022) as well as a popular baseline PEVI (Jin et al., 2021). The results complement the theoretical findings by demonstrating the practicality of DP-VAPVI under strong privacy parameters.

Related work. To our knowledge, differential privacy in offline RL tasks has not been studied before, except for much simpler cases where the agent only evaluates a single policy (Balle et al., 2016; Xie et al., 2019). Balle et al. (2016) privatized first-visit Monte Carlo-Ridge Regression estimator by an output perturbation mechanism and Xie et al. (2019) used DP-SGD. Neither paper considered offline learning (or policy optimization), which is our focus.

There is a larger body of work on private RL in the online setting, where the goal is to minimize regret while satisfying either joint differential privacy (Vietri et al., 2020; Chowdhury and Zhou, 2021; Ngo et al., 2022; Luyo et al., 2021) or local differential privacy (Garcelon et al., 2021; Liao et al., 2021; Luyo et al., 2021; Chowdhury and Zhou, 2021). The offline setting introduces new challenges in DP as we cannot algorithmically enforce good “exploration”, but have to work with a static dataset and privately estimate the uncertainty in addition to the value functions. A private online RL algorithm can sometimes be adapted for private offline RL too, but those from existing work yield suboptimal and non-adaptive bounds. We give a more detailed technical comparison in Appendix B.

Among non-private offline RL works, we build directly upon non-private offline RL methods known as Adaptive Pessimistic Value Iteration (APVI, for tabular MDPs) (Yin and Wang, 2021b) and Variance-Aware Pessimistic Value Iteration (VAPVI, for linear MDPs) (Yin et al., 2022), as they give the strongest theoretical guarantees to date. We refer readers to Appendix B for a more extensive review of the offline RL literature. Introducing DP to APVI and VAPVI while retaining the same sample complexity (modulo lower order terms) require nontrivial modifications to the algorithms.

A remark on technical novelty. Our algorithms involve substantial technical innovation over previous works on online DP-RL with joint DP guarantee²²2Here we only compare our techniques (for offline RL) with the works for online RL under joint DP guarantee, as both settings allow access to the raw data.. Different from previous works, our DP-APVI (Algorithm 1) operates on Bernstein type pessimism, which requires our algorithm to deal with conditional variance using private statistics. Besides, our DP-VAPVI (Algorithm 2) replaces the LSVI technique with variance-aware LSVI (also known as weighted ridge regression, first appears in (Zhou et al., 2021)). Our DP-VAPVI releases conditional variance privately, and further applies weighted ridge regression privately. Both approaches ensure tighter instance-dependent bounds on the suboptimality of the learned policy.

2 Problem Setup

Markov Decision Process. A finite-horizon Markov Decision Process (MDP) is denoted by a tuple $M=(\mathcal{S},\mathcal{A},P,r,H,d_{1})$ (Sutton and Barto, 2018), where $\mathcal{S}$ is state space and $\mathcal{A}$ is action space. A non-stationary transition kernel $P_{h}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\mapsto[0,1]$ maps each state action $(s_{h},a_{h})$ to a probability distribution $P_{h}(\cdot|s_{h},a_{h})$ and $P_{h}$ can be different across time. Besides, $r_{h}:\mathcal{S}\times{A}\mapsto\mathbb{R}$ is the expected immediate reward satisfying $0\leq r_{h}\leq 1$ , $d_{1}$ is the initial state distribution and $H$ is the horizon. A policy $\pi=(\pi_{1},\cdots,\pi_{H})$ assigns each state $s_{h}\in\mathcal{S}$ a probability distribution over actions according to the map $s_{h}\mapsto\pi_{h}(\cdot|s_{h})$ , $\forall\,h\in[H]$ . A random trajectory $s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H},s_{H+1}$ is generated according to $s_{1}\sim d_{1},a_{h}\sim\pi_{h}(\cdot|s_{h}),r_{h}\sim r_{h}(s_{h},a_{h}),s_{h+1}\sim P_{h}(\cdot|s_{h},a_{h}),\forall\,h\in[H]$ .

For tabular MDP, we have $\mathcal{S}\times\mathcal{A}$ is the discrete state-action space and $S:=|\mathcal{S}|,A:=|\mathcal{A}|$ are finite. In this work, we assume that $r$ is known³³3This is due to the fact that the uncertainty of reward function is dominated by that of transition kernel in RL.. In addition, we denote the per-step marginal state-action occupancy $d^{\pi}_{h}(s,a)$ as: $d^{\pi}_{h}(s,a):=\mathbb{P}[s_{h}=s|s_{1}\sim d_{1},\pi]\cdot\pi_{h}(a|s),$ which is the marginal state-action probability at time $h$ .

Value function, Bellman (optimality) equations. The value function $V^{\pi}_{h}(\cdot)$ and Q-value function $Q^{\pi}_{h}(\cdot,\cdot)$ for any policy $\pi$ is defined as: $V^{\pi}_{h}(s)=\mathbb{E}_{\pi}[\sum_{t=h}^{H}r_{t}|s_{h}=s],\;\;Q^{\pi}_{h}(s,a)=\mathbb{E}_{\pi}[\sum_{t=h}^{H}r_{t}|s_{h},a_{h}=s,a],\;\forall\,h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}.$ The performance is defined as $v^{\pi}:=\mathbb{E}_{d_{1}}\left[V^{\pi}_{1}\right]=\mathbb{E}_{\pi,d_{1}}\left[\sum_{t=1}^{H}r_{t}\right]$ . The Bellman (optimality) equations follow $\forall\,h\in[H]$ : $Q^{\pi}_{h}=r_{h}+P_{h}V^{\pi}_{h+1},\;\;V^{\pi}_{h}=\mathbb{E}_{a\sim\pi_{h}}[Q^{\pi}_{h}],\;\;\;Q^{\star}_{h}=r_{h}+P_{h}V^{\star}_{h+1},\;V^{\star}_{h}=\max_{a}Q^{\star}_{h}(\cdot,a).$

Linear MDP (Jin et al., 2020b). An episodic MDP $(\mathcal{S},\mathcal{A},H,P,r)$ is called a linear MDP with known feature map $\phi:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{d}$ if there exist $H$ unknown signed measures $\nu_{h}\in\mathbb{R}^{d}$ over $\mathcal{S}$ and $H$ unknown reward vectors $\theta_{h}\in\mathbb{R}^{d}$ such that

{P}_{h}\left(s^{\prime}\mid s,a\right)=\left\langle\phi(s,a),\nu_{h}\left(s^{\prime}\right)\right\rangle,\quad r_{h}\left(s,a\right)=\left\langle\phi(s,a),\theta_{h}\right\rangle,\quad\forall\,(h,s,a,s^{\prime})\in[H]\times\mathcal{S}\times\mathcal{A}\times\mathcal{S}.

Without loss of generality, we assume $\|\phi(s,a)\|_{2}\leq 1$ and $\max(\|\nu_{h}(\mathcal{S})\|_{2},\|\theta_{h}\|_{2})\leq\sqrt{d}$ for all $h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}$ . An important property of linear MDP is that the value functions are linear in the feature map, which is summarized in Lemma E.14.

Offline setting and the goal. The offline RL requires the agent to find a policy $\pi$ in order to maximize the performance $v^{\pi}$ , given only the episodic data $\mathcal{D}=\left\{\left(s_{h}^{\tau},a_{h}^{\tau},r_{h}^{\tau},s_{h+1}^{\tau}\right)\right\}_{\tau\in[n]}^{h\in[H]}$ ⁴⁴4For clarity we use $n$ for tabular MDP and $K$ for linear MDP when referring to the sample complexity. rolled out from some fixed and possibly unknown behavior policy $\mu$ , which means we cannot change $\mu$ and in particular we do not assume the functional knowledge of $\mu$ . In conclusion, based on the batch data $\mathcal{D}$ and a targeted accuracy $\epsilon>0$ , the agent seeks to find a policy $\pi_{\text{alg}}$ such that $v^{\star}-v^{\pi_{\text{alg}}}\leq\epsilon$ .

2.1 Assumptions in offline RL

In order to show that our privacy-preserving algorithms can generate near optimal policy, certain coverage assumptions are needed. In this section, we will list the assumptions we use in this paper.

Assumptions for tabular setting.

Assumption 2.1 ((Liu et al., 2019)).

There exists one optimal policy $\pi^{\star}$ , such that $\pi^{\star}$ is fully covered by $\mu$ , i.e. $\forall\,s_{h},a_{h}\in\mathcal{S}\times\mathcal{A}$ , $d^{\pi^{\star}}_{h}(s_{h},a_{h})>0$ only if $d^{\mu}_{h}(s_{h},a_{h})>0$ . Furthermore, we denote the trackable set as $\mathcal{C}_{h}:=\{(s_{h},a_{h}):d^{\mu}_{h}(s_{h},a_{h})>0\}$ .

Assumption 2.1 is the weakest assumption needed for accurately learning the optimal value $v^{\star}$ by requiring $\mu$ to trace the state-action space of one optimal policy ( $\mu$ can be agnostic at other locations). Similar to (Yin and Wang, 2021b), we will use Assumption 2.1 for the tabular part of this paper, which enables comparison between our sample complexity to the conclusion in (Yin and Wang, 2021b), whose algorithm serves as a non-private baseline.

Assumptions for linear setting. First, we define the expectation of covariance matrix under the behavior policy $\mu$ for all time step $h\in[H]$ as below:

\Sigma^{p}_{h}:=\mathbb{E}_{\mu}\left[\phi(s_{h},a_{h})\phi(s_{h},a_{h})^{\top}\right].

(1)

As have been shown in (Wang et al., 2021; Yin et al., 2022), learning a near-optimal policy from offline data requires coverage assumptions. Here in linear setting, such coverage is characterized by the minimum eigenvalue of $\Sigma^{p}_{h}$ . Similar to (Yin et al., 2022), we apply the following assumption for the sake of comparison.

Assumption 2.2 (Feature Coverage, Assumption 2 in (Wang et al., 2021)).

The data distributions $\mu$ satisfy the minimum eigenvalue condition: $\forall\,h\in[H]$ , $\kappa_{h}:=\lambda_{\mathrm{min}}(\Sigma^{p}_{h})>0$ . Furthermore, we denote $\kappa=\min_{h}\kappa_{h}$ .

2.2 Differential Privacy in offline RL

In this work, we aim to design privacy-preserving algorithms for offline RL. We apply differential privacy as the formal notion of privacy. Below we revisit the definition of differential privacy.

Definition 2.3 (Differential Privacy (Dwork et al., 2006)).

A randomized mechanism $M$ satisfies $(\epsilon,\delta)$ -differential privacy ( $(\epsilon,\delta)$ -DP) if for all neighboring datasets $U,U^{\prime}$ that differ by one data point and for all possible event $E$ in the output range, it holds that

\mathbb{P}[M(U)\in E]\leq e^{\epsilon}\cdot\mathbb{P}[M(U^{\prime})\in E]+\delta.

When $\delta=0$ , we say pure DP, while for $\delta>0$ , we say approximate DP.

In the problem of offline RL, the dataset consists of several trajectories, therefore one data point in Definition 2.3 refers to one single trajectory. Hence the definition of Differential Privacy means that the difference in the distribution of the output policy resulting from replacing one trajectory in the dataset will be small. In other words, an adversary can not infer much information about any single trajectory in the dataset from the output policy of the algorithm.

Remark 2.4.

For a concrete motivating example, please refer to the first paragraph of Introduction. We remark that our definition of DP is consistent with Joint DP and Local DP defined under the online RL setting where JDP/LDP also cast each user as one trajectory and provide user-wise privacy protection. For detailed definitions and more discussions about JDP/LDP, please refer to Qiao and Wang (2022).

During the whole paper, we will use zCDP (defined below) as a surrogate for DP, since it enables cleaner analysis for privacy composition and Gaussian mechanism. The properties of zCDP (e.g., composition, conversion formula to DP) are deferred to Appendix E.3.

Definition 2.5 (zCDP (Dwork and Rothblum, 2016; Bun and Steinke, 2016)).

A randomized mechanism $M$ satisfies $\rho$ -Zero-Concentrated Differential Privacy ( $\rho$ -zCDP), if for all neighboring datasets $U,U^{\prime}$ and all $\alpha\in(1,\infty)$ ,

D_{\alpha}(M(U)\|M(U^{\prime}))\leq\rho\alpha,

where $D_{\alpha}$ is the Renyi-divergence (Van Erven and Harremos, 2014).

Finally, we go over the definition and privacy guarantee of Gaussian mechanism.

Definition 2.6 (Gaussian Mechanism (Dwork et al., 2014)).

Define the $\ell_{2}$ sensitivity of a function $f:\mathbb{N}^{\mathcal{X}}\mapsto\mathbb{R}^{d}$ as

\displaystyle\Delta_{2}(f)=\sup_{\text{neighboring}\,U,U^{\prime}}\|f(U)-f(U^{\prime})\|_{2}.

The Gaussian mechanism $\mathcal{M}$ with noise level $\sigma$ is then given by

\displaystyle\mathcal{M}(U)=f(U)+\mathcal{N}(0,\sigma^{2}I_{d}).

Lemma 2.7 (Privacy guarantee of Gaussian mechanism (Dwork et al., 2014; Bun and Steinke, 2016)).

Let $f:\mathbb{N}^{\mathcal{X}}\mapsto\mathbb{R}^{d}$ be an arbitrary d-dimensional function with $\ell_{2}$ sensitivity $\Delta_{2}$ . Then for any $\rho>0$ , Gaussian Mechanism with parameter $\sigma^{2}=\frac{\Delta^{2}_{2}}{2\rho}$ satisfies $\rho$ -zCDP. In addition, for all $0<\delta,\epsilon<1$ , Gaussian Mechanism with parameter $\sigma=\frac{\Delta_{2}}{\epsilon}\sqrt{2\log\frac{1.25}{\delta}}$ satisfies $(\epsilon,\delta)$ -DP.

We emphasize that the privacy guarantee covers any input data. It does not require any distributional assumptions on the data. The RL-specific assumptions (e.g., linear MDP and coverage assumptions) are only used for establishing provable utility guarantees.

3 Results under tabular MDP: DP-APVI (Algorithm 1)

For reinforcement learning, the tabular MDP setting is the most well-studied setting and our first result applies to this regime. We begin with the construction of private counts.

Private Model-based Components. Given data $\mathcal{D}=\left\{\left(s_{h}^{\tau},a_{h}^{\tau},r_{h}^{\tau},s_{h+1}^{\tau}\right)\right\}_{\tau\in[n]}^{h\in[H]}$ , we denote $n_{s_{h},a_{h}}:=\sum_{\tau=1}^{n}\mathds{1}[s_{h}^{\tau},a_{h}^{\tau}=s_{h},a_{h}]$ be the total counts that visit $(s_{h},a_{h})$ pair at time $h$ and $n_{s_{h},a_{h},s_{h+1}}:=\sum_{\tau=1}^{n}\mathds{1}[s_{h}^{\tau},a_{h}^{\tau},s_{h+1}^{\tau}=s_{h},a_{h},s_{h+1}]$ be the total counts that visit $(s_{h},a_{h},s_{h+1})$ pair at time $h$ , then given the budget $\rho$ for zCDP, we add independent Gaussian noises to all the counts:

n_{s_{h},a_{h}}^{\prime}=\left\{n_{s_{h},a_{h}}+\mathcal{N}(0,\sigma^{2})\right\}^{+},\,\,n_{s_{h},a_{h},s_{h+1}}^{\prime}=\left\{n_{s_{h},a_{h},s_{h+1}}+\mathcal{N}(0,\sigma^{2})\right\}^{+},\,\,\sigma^{2}=\frac{2H}{\rho}.

(2)

However, after adding noise, the noisy counts $n^{\prime}$ may not satisfy $n^{\prime}_{s_{h},a_{h}}=\sum_{s_{h+1}\in\mathcal{S}}n^{\prime}_{s_{h},a_{h},s_{h+1}}$ . To address this problem, we choose the private counts of visiting numbers as the solution to the following optimization problem (here $E_{\rho}=4\sqrt{\frac{H\log{\frac{4HS^{2}A}{\delta}}}{\rho}}$ is chosen as a high probability uniform bound of the noises we add):

	$\displaystyle\{\widetilde{n}_{s_{h},a_{h},s^{\prime}}\}_{s^{\prime}\in\mathcal{S}}=\mathrm{argmin}_{\{x_{s^{\prime}}\}_{s^{\prime}\in\mathcal{S}}}\max_{s^{\prime}\in\mathcal{S}}\left\|x_{s^{\prime}}-n^{\prime}_{s_{h},a_{h},s^{\prime}}\right\|$		(3)
	$\displaystyle\text{such that}\,\,\left\|\sum_{s^{\prime}\in\mathcal{S}}x_{s^{\prime}}-n^{\prime}_{s_{h},a_{h}}\right\|\leq\frac{E_{\rho}}{2}\,\,\text{and}\,\,x_{s^{\prime}}\geq 0,\forall\,s^{\prime}\in\mathcal{S}.$
	$\displaystyle\widetilde{n}_{s_{h},a_{h}}=\sum_{s^{\prime}\in\mathcal{S}}\widetilde{n}_{s_{h},a_{h},s^{\prime}}.$

Remark 3.1 (Some explanations).

The optimization problem above serves as a post-processing step which will not affect the DP guarantee of our algorithm. Briefly speaking, (3) finds a set of noisy counts such that $\widetilde{n}_{s_{h},a_{h}}=\sum_{s^{\prime}\in\mathcal{S}}\widetilde{n}_{s_{h},a_{h},s^{\prime}}$ and the estimation error for each $\widetilde{n}_{s_{h},a_{h}}$ and $\widetilde{n}_{s_{h},a_{h},s^{\prime}}$ is roughly $E_{\rho}$ .⁵⁵5This conclusion is summarized in Lemma C.3. In contrast, if we directly take the crude approach that $\widetilde{n}_{s_{h},a_{h},s_{h+1}}={n}_{s_{h},a_{h},s_{h+1}}^{\prime}$ and $\widetilde{n}_{s_{h},a_{h}}=\sum_{s_{h+1}\in\mathcal{S}}\widetilde{n}_{s_{h},a_{h},s_{h+1}}$ , we can only derive $|\widetilde{n}_{s_{h},a_{h}}-n_{s_{h},a_{h}}|\leq\widetilde{O}(\sqrt{S}E_{\rho})$ through concentration on summation of $S$ i.i.d. Gaussian noises. In conclusion, solving the optimization problem (3) enables tight analysis for the lower order term (the additional cost of privacy).

Remark 3.2 (Computational efficiency).

The optimization problem (3) can be reformulated as:

\min\,\,t,\,\,\text{s.t.}\,|x_{s^{\prime}}-n^{\prime}_{s_{h},a_{h},s^{\prime}}|\leq t\,\,\text{and}\,\,x_{s^{\prime}}\geq 0\,\,\forall\,s^{\prime}\in\mathcal{S},\,\,\,\left|\sum_{s^{\prime}\in\mathcal{S}}x_{s^{\prime}}-n^{\prime}_{s_{h},a_{h}}\right|\leq\frac{E_{\rho}}{2}.

(4)

Note that (4) is a Linear Programming problem with $S+1$ variables and $2S+2$ linear constraints (one constraint on absolute value is equivalent to two linear constraints), which can be solved efficiently by the simplex method (Ficken, 2015) or other provably efficient algorithms (Nemhauser and Wolsey, 1988). Therefore, our Algorithm 1 is computationally friendly.

The private estimation of the transition kernel is defined as:

\widetilde{P}_{h}(s^{\prime}|s_{h},a_{h})=\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}}{\widetilde{n}_{s_{h},a_{h}}},

(5)

if $\widetilde{n}_{s_{h},a_{h}}>E_{\rho}$ and $\widetilde{P}_{h}(s^{\prime}|s_{h},a_{h})=\frac{1}{S}$ otherwise.

Remark 3.3.

Different from the transition kernel estimate in previous works (Vietri et al., 2020; Chowdhury and Zhou, 2021) that may not be a distribution, we have to ensure that ours is a probability distribution, because our Bernstein type pessimism (line 5 in Algorithm 1) needs to take variance over this transition kernel estimate. The intuition behind the construction of our private transition kernel is that, for those state-action pairs with $\widetilde{n}_{s_{h},a_{h}}\leq E_{\rho}$ , we can not distinguish whether the non-zero private count comes from noise or actual visitation. Therefore we only take the empirical estimate of the state-action pairs with sufficiently large $\widetilde{n}_{s_{h},a_{h}}$ .

Algorithm 1 Differentially Private Adaptive Pessimistic Value Iteration (DP-APVI)

1: Input: Offline dataset

\mathcal{D}=\{(s_{h}^{\tau},a_{h}^{\tau},r_{h}^{\tau},s_{h+1}^{\tau})\}_{\tau,h=1}^{n,H}

. Reward function

r

. Constants

C_{1}=\sqrt{2},C_{2}=16,C>1

, failure probability

\delta

, budget for zCDP

\rho

2: Initialization: Calculate

\widetilde{n}_{s_{h},a_{h}},\widetilde{n}_{s_{h},a_{h},s_{h+1}}

as (3),

\widetilde{P}_{h}(s_{h+1}|s_{h},a_{h})

as (5).

\widetilde{V}_{H+1}(\cdot)\leftarrow 0

E_{\rho}\leftarrow 4\sqrt{\frac{H\log{\frac{4HS^{2}A}{\delta}}}{\rho}}

\iota\leftarrow\log(HSA/\delta)

3: for

h=H,H-1,\ldots,1

\widetilde{Q}_{h}(\cdot,\cdot)\leftarrow{r_{h}(\cdot,\cdot)}+(\widetilde{P}_{h}\cdot\widetilde{V}_{h+1})(\cdot,\cdot)

\forall s_{h},a_{h}

, let

\Gamma_{h}(s_{h},a_{h})\leftarrow C_{1}\sqrt{\frac{\mathrm{Var}_{\widetilde{P}_{s_{h},a_{h}}}(\widetilde{V}_{h+1})\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{C_{2}SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}

\widetilde{n}_{s_{h},a_{h}}>E_{\rho}

, otherwise

CH

\widehat{Q}^{p}_{h}(\cdot,\cdot)\leftarrow\widetilde{Q}_{h}(\cdot,\cdot)-\Gamma_{h}(\cdot,\cdot)

\overline{Q}_{h}(\cdot,\cdot)\leftarrow\min\{\widehat{Q}^{p}_{h}(\cdot,\cdot),H-h+1\}^{+}

\forall s_{h}

, let

\widehat{\pi}_{h}(\cdot|s_{h})\leftarrow\mathrm{argmax}_{\pi_{h}}\langle\overline{Q}_{h}(s_{h},\cdot),\pi_{h}(\cdot|s_{h})\rangle

and

\widetilde{V}_{h}(s_{h})\leftarrow\langle\overline{Q}_{h}(s_{h},\cdot),\widehat{\pi}_{h}(\cdot|s_{h})\rangle

9: end for

10: Output:

\{\widehat{\pi}_{h}\}

Algorithmic design. Our algorithmic design originates from the idea of pessimism, which holds conservative view towards the locations with high uncertainty and prefers the locations we have more confidence about. Based on the Bernstein type pessimism in APVI (Yin and Wang, 2021b), we design a similar pessimistic algorithm with private counts to ensure differential privacy. If we replace $\widetilde{n}$ and $\widetilde{P}$ with $n$ and $\widehat{P}$ ⁶⁶6The non-private empirical estimate, defined as (15) in Appendix C., then our DP-APVI (Algorithm 1) will degenerate to APVI. Compared to the pessimism defined in APVI, our pessimistic penalty has an additional term $\widetilde{O}\left(\frac{SHE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}\right)$ , which accounts for the additional pessimism due to our application of private statistics.

We state our main theorem about DP-APVI below, the proof sketch is deferred to Appendix C.1 and detailed proof is deferred to Appendix C due to space limit.

Theorem 3.4.

DP-APVI (Algorithm 1) satisfies $\rho$ -zCDP. Furthermore, under Assumption 2.1, denote $\bar{d}_{m}:=\min_{h\in[H]}\{d^{\mu}_{h}(s_{h},a_{h}):d^{\mu}_{h}(s_{h},a_{h})>0\}$ . For any $0<\delta<1$ , there exists constant $c_{1}>0$ , such that when $n>c_{1}\cdot\max\{H^{2},E_{\rho}\}/\bar{d}_{m}\cdot\iota$ ( $\iota=\log(HSA/\delta)$ ), with probability $1-\delta$ , the output policy $\widehat{\pi}$ of DP-APVI satisfies

0\leq v^{\star}-v^{\widehat{\pi}}\leq 4\sqrt{2}\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{C}_{h}}d^{\pi^{\star}}_{h}(s_{h},a_{h})\sqrt{\frac{\mathrm{Var}_{P_{h}(\cdot|s_{h},a_{h})}(V^{\star}_{h+1}(\cdot))\cdot\iota}{nd^{\mu}_{h}(s_{h},a_{h})}}+\widetilde{O}\left(\frac{H^{3}+SH^{2}E_{\rho}}{n\cdot\bar{d}_{m}}\right),

(6)

where $\widetilde{O}$ hides constants and Polylog terms, $E_{\rho}=4\sqrt{\frac{H\log{\frac{4HS^{2}A}{\delta}}}{\rho}}$ .

Comparison to non-private counterpart APVI (Yin and Wang, 2021b). According to Theorem 4.1 in (Yin and Wang, 2021b), the sub-optimality bound of APVI is for large enough $n$ , with high probability, the output $\widehat{\pi}$ satisfies:

0\leq v^{\star}-v^{\widehat{\pi}}\leq\widetilde{O}\left(\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{C}_{h}}d^{\pi^{\star}}_{h}(s_{h},a_{h})\sqrt{\frac{\mathrm{Var}_{P_{h}(\cdot|s_{h},a_{h})}(V^{\star}_{h+1}(\cdot))}{nd^{\mu}_{h}(s_{h},a_{h})}}\right)+\widetilde{O}\left(\frac{H^{3}}{n\cdot\bar{d}_{m}}\right).

(7)

Compared to our Theorem 3.4, the additional sub-optimality bound due to differential privacy is $\widetilde{O}\left(\frac{SH^{2}E_{\rho}}{n\cdot\bar{d}_{m}}\right)=\widetilde{O}\left(\frac{SH^{\frac{5}{2}}}{n\cdot\bar{d}_{m}\sqrt{\rho}}\right)=\widetilde{O}\left(\frac{SH^{\frac{5}{2}}}{n\cdot\bar{d}_{m}\epsilon}\right)$ .⁷⁷7Here we apply the second part of Lemma 2.7 to achieve $(\epsilon,\delta)$ -DP, the notation $\widetilde{O}$ also absorbs $\log\frac{1}{\delta}$ (only here $\delta$ denotes the privacy budget instead of failure probability). In the most popular regime where the privacy budget $\rho$ or $\epsilon$ is a constant, the additional term due to differential privacy appears as a lower order term, hence becomes negligible as the sample complexity $n$ becomes large.

Comparison to Hoeffding type pessimism. We can simply revise our algorithm by using Hoeffding type pessimism, which replaces the pessimism in line 5 with $C_{1}H\cdot\sqrt{\frac{\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{C_{2}SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}$ . Then with a similar proof schedule, we can arrive at a sub-optimality bound that with high probability,

0\leq v^{\star}-v^{\widehat{\pi}}\leq\widetilde{O}\left(H\cdot\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{C}_{h}}d^{\pi^{\star}}_{h}(s_{h},a_{h})\sqrt{\frac{1}{nd^{\mu}_{h}(s_{h},a_{h})}}\right)+\widetilde{O}\left(\frac{SH^{2}E_{\rho}}{n\cdot\bar{d}_{m}}\right).

(8)

Compared to our Theorem 3.4, our bound is tighter because we express the dominate term by the system quantities instead of explicit dependence on $H$ (and $\mathrm{Var}_{P_{h}(\cdot|s_{h},a_{h})}(V^{\star}_{h+1}(\cdot))\leq H^{2}$ ). In addition, we highlight that according to Theorem G.1 in (Yin and Wang, 2021b), our main term nearly matches the non-private minimax lower bound. For more detailed discussions about our main term and how it subsumes other optimal learning bounds, we refer readers to (Yin and Wang, 2021b).

Apply Laplace Mechanism to achieve pure DP. To achieve Pure DP instead of $\rho$ -zCDP, we can simply replace Gaussian Mechanism with Laplace Mechanism (defined as Definition E.19). Given privacy budget for Pure DP $\epsilon$ , since the $\ell_{1}$ sensitivity of $\{n_{s_{h},a_{h}}\}\cup\{n_{s_{h},a_{h},s_{h+1}}\}$ is $\Delta_{1}=4H$ , we can add independent Laplace noises $\mathrm{Lap}(\frac{4H}{\epsilon})$ to each count to achieve $\epsilon$ -DP due to Lemma E.20. Then by using $E_{\epsilon}=\widetilde{O}\left(\frac{H}{\epsilon}\right)$ instead of $E_{\rho}$ and keeping everything else ((3), (5) and Algorithm 1) the same, we can reach a similar result to Theorem 3.4 with the same proof schedule. The only difference is that here the additional learning bound is $\widetilde{O}\left(\frac{SH^{3}}{n\cdot\bar{d}_{m}\epsilon}\right)$ , which still appears as a lower order term.

4 Results under linear MDP: DP-VAPVI(Algorithm 2)

In large MDPs, to address the computational issues, the technique of function approximation is widely applied, and linear MDP is a concrete model to study linear function approximations. Our second result applies to the linear MDP setting. Generally speaking, function approximation reduces the dimensionality of private releases comparing to the tabular MDPs. We begin with private counts.

Private Model-based Components. Given the two datasets $\mathcal{D}$ and $\mathcal{D}^{\prime}$ (both from $\mu$ ) as in Algorithm 2, we can apply variance-aware pessimistic value iteration to learn a near optimal policy as in VAPVI (Yin et al., 2022). To ensure differential privacy, we add independent Gaussian noises to the $5H$ statistics as in DP-VAPVI (Algorithm 2) below. Since there are $5H$ statistics, by the adaptive composition of zCDP (Lemma E.17), it suffices to keep each count $\rho_{0}$ -zCDP, where $\rho_{0}=\frac{\rho}{5H}$ . In DP-VAPVI, we use $\phi_{1},\phi_{2},\phi_{3},K_{1},K_{2}$ ⁸⁸8We need to add noise to each of the $5H$ counts, therefore for $\phi_{1}$ , we actually sample $H$ i.i.d samples $\phi_{1,h}$ , $h=1,\cdots,H$ from the distribution of $\phi_{1}$ . Then we add $\phi_{1,h}$ to $\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}$ , $\forall\,h\in[H]$ . For simplicity, we use $\phi_{1}$ to represent all the $\phi_{1,h}$ . The procedure applied to the other $4H$ statistics are similar. to denote the noises we add. For all $\phi_{i}$ , we directly apply Gaussian Mechanism. For $K_{i}$ , in addition to the noise matrix $\frac{1}{\sqrt{2}}(Z+Z^{\top})$ , we also add $\frac{E}{2}I_{d}$ to ensure that all $K_{i}$ are positive definite with high probability (The detailed definition of $E,L$ can be found in Appendix A).

Algorithm 2 Differentially Private Variance-Aware Pessimistic Value Iteration (DP-VAPVI)

1: Input: Dataset

\mathcal{D}=\left\{\left(s_{h}^{\tau},a_{h}^{\tau},r_{h}^{\tau},s_{h+1}^{\tau}\right)\right\}_{\tau,h=1}^{K,H}

\mathcal{D}^{\prime}=\left\{\left(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau},\bar{r}_{h}^{\tau},\bar{s}_{h+1}^{\tau}\right)\right\}_{\tau,h=1}^{K,H}

. Budget for zCDP

\rho

. Failure probability

\delta

. Universal constant

C

2: Initialization: Set

\rho_{0}\leftarrow\frac{\rho}{5H}

\widetilde{V}_{H+1}(\cdot)\leftarrow 0

. Sample

\phi_{1}\sim\mathcal{N}\left(0,\frac{2H^{4}}{\rho_{0}}I_{d}\right)

\phi_{2},\,\phi_{3}\sim\mathcal{N}\left(0,\frac{2H^{2}}{\rho_{0}}I_{d}\right)

K_{1},K_{2}\leftarrow\frac{E}{2}I_{d}+\frac{1}{\sqrt{2}}(Z+Z^{\top})

, where

Z_{i,j}\sim\mathcal{N}\left(0,\frac{1}{4\rho_{0}}\right)(i.i.d.),\,\,\,E=\widetilde{O}\left(\sqrt{\frac{Hd}{\rho}}\right)

. Set

D\leftarrow\widetilde{O}\left(\frac{H^{2}L}{\kappa}+\frac{H^{4}E\sqrt{d}}{\kappa^{3/2}}+H^{3}\sqrt{d}\right)

3: for

h=H,H-1,\ldots,1

4: Set

\widetilde{\Sigma}_{h}\leftarrow\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})^{\top}+\lambda I+K_{1}

5: Set

\widetilde{\beta}_{h}\leftarrow\widetilde{\Sigma}_{h}^{-1}[\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}+\phi_{1}]

6: Set

\widetilde{\theta}_{h}\leftarrow\widetilde{\Sigma}_{h}^{-1}[\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})+\phi_{2}]

7: Set

\big{[}\widetilde{\mathrm{Var}}_{h}\widetilde{V}_{h+1}\big{]}(\cdot,\cdot)\leftarrow\big{\langle}\phi(\cdot,\cdot),\widetilde{\beta}_{h}\big{\rangle}_{\left[0,(H-h+1)^{2}\right]}-\big{[}\big{\langle}\phi(\cdot,\cdot),\widetilde{\theta}_{h}\big{\rangle}_{[0,H-h+1]}\big{]}^{2}

8: Set

\widetilde{\sigma}_{h}(\cdot,\cdot)^{2}\leftarrow\max\{1,\widetilde{\mathrm{Var}}_{h}\widetilde{V}_{h+1}(\cdot,\cdot)\}

9: Set

\widetilde{\Lambda}_{h}\leftarrow\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)^{\top}/\widetilde{\sigma}_{h}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I+K_{2}

10: Set

\widetilde{w}_{h}\leftarrow\widetilde{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)\right)/\widetilde{\sigma}_{h}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\phi_{3}\right)

11: Set

\Gamma_{h}(\cdot,\cdot)\leftarrow C\sqrt{d}\cdot\left(\phi(\cdot,\cdot)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(\cdot,\cdot)\right)^{1/2}+\frac{D}{K}

12: Set

\bar{Q}_{h}(\cdot,\cdot)\leftarrow\phi(\cdot,\cdot)^{\top}\widetilde{w}_{h}-\Gamma_{h}(\cdot,\cdot)

13: Set

\widehat{Q}_{h}(\cdot,\cdot)\leftarrow\min\left\{\bar{Q}_{h}(\cdot,\cdot),H-h+1\right\}^{+}

14: Set

\widehat{\pi}_{h}(\cdot\mid\cdot)\leftarrow\mathrm{argmax}_{\pi_{h}}\big{\langle}\widehat{Q}_{h}(\cdot,\cdot),\pi_{h}(\cdot\mid\cdot)\big{\rangle}_{\mathcal{A}}

\widetilde{V}_{h}(\cdot)\leftarrow\max_{\pi_{h}}\big{\langle}\widehat{Q}_{h}(\cdot,\cdot),\pi_{h}(\cdot\mid\cdot)\big{\rangle}_{\mathcal{A}}

15: end for

16: Output:

\left\{\widehat{\pi}_{h}\right\}_{h=1}^{H}

Below we will show the algorithmic design of DP-VAPVI (Algorithm 2). For the offline dataset, we divide it into two independent parts with equal length: $\mathcal{D}=\{(s_{h}^{\tau},a_{h}^{\tau},r_{h}^{\tau},s_{h+1}^{\tau})\}^{h\in[H]}_{\tau\in[K]}$ and $\mathcal{D}^{\prime}=\{(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau},\bar{r}_{h}^{\tau},\bar{s}_{h+1}^{\tau})\}^{h\in[H]}_{\tau\in[K]}$ . One for estimating variance and the other for calculating $Q$ -values.

Estimating conditional variance. The first part (line 4 to line 8) aims to estimate the conditional variance of $\widetilde{V}_{h+1}$ via the definition of variance: $[\mathrm{Var}_{h}\widetilde{V}_{h+1}](s,a)=[P_{h}(\widetilde{V}_{h+1})^{2}](s,a)-([P_{h}\widetilde{V}_{h+1}](s,a))^{2}$ . For the first term, by the definition of linear MDP, it holds that $\left[{P}_{h}\widetilde{V}_{h+1}^{2}\right](s,a)={\phi}(s,a)^{\top}\int_{\mathcal{S}}\widetilde{V}_{h+1}^{2}\left(s^{\prime}\right)\mathrm{~{}d}{\nu}_{h}\left(s^{\prime}\right)=\langle\phi,\int_{\mathcal{S}}\widetilde{V}_{h+1}^{2}\left(s^{\prime}\right)\mathrm{~{}d}{\nu}_{h}\left(s^{\prime}\right)\rangle$ . We can estimate $\beta_{h}=\int_{\mathcal{S}}\widetilde{V}_{h+1}^{2}\left(s^{\prime}\right)\mathrm{~{}d}{\nu}_{h}\left(s^{\prime}\right)$ by applying ridge regression. Below is the output of ridge regression with raw statistics without noise:

\underset{{\beta}\in\mathbb{R}^{d}}{\operatorname{argmin}}\sum_{k=1}^{K}\left[\left\langle{\phi}(\bar{s}^{k}_{h},\bar{a}^{k}_{h}),{\beta}\right\rangle-\widetilde{V}_{h+1}^{2}\left(\bar{s}_{h+1}^{k}\right)\right]^{2}+\lambda\|{\beta}\|_{2}^{2}={\bar{\Sigma}}_{h}^{-1}\sum_{k=1}^{K}{{\phi}}(\bar{s}^{k}_{h},\bar{a}^{k}_{h})\widetilde{V}_{h+1}^{2}\left(\bar{s}_{h+1}^{k}\right),

where definition of $\bar{\Sigma}_{h}$ can be found in Appendix A. Instead of using the raw statistics, we replace them with private ones with Gaussian noises as in line 5. The second term is estimated similarly in line 6. The final estimator is defined as in line 8: $\widetilde{\sigma}_{h}(\cdot,\cdot)^{2}=\max\{1,\widetilde{\mathrm{Var}}_{h}\widetilde{V}_{h+1}(\cdot,\cdot)\}$ .⁹⁹9The $\max\{1,\cdot\}$ operator here is for technical reason only: we want a lower bound for each variance estimate.

Variance-weighted LSVI. Instead of directly applying LSVI (Jin et al., 2021), we can solve the variance-weighted LSVI (line 10). The result of variance-weighted LSVI with non-private statistics is shown below:

\underset{{w}\in\mathbb{R}^{d}}{\operatorname{argmin}}\;\lambda\|{w}\|_{2}^{2}+\sum_{k=1}^{K}\frac{\left[\langle{\phi}(s_{h}^{k},a_{h}^{k}),{w}\rangle-r_{h}^{k}-\widetilde{V}_{h+1}(s_{h+1}^{k})\right]^{2}}{\widetilde{\sigma}^{2}_{h}(s_{h}^{k},a_{h}^{k})}=\widehat{\Lambda}_{h}^{-1}\sum_{k=1}^{K}\frac{\phi\left(s_{h}^{k},a_{h}^{k}\right)\cdot\left[r_{h}^{k}+\widetilde{V}_{h+1}\left(s_{h+1}^{k}\right)\right]}{\widetilde{\sigma}_{h}^{2}(s^{k}_{h},a^{k}_{h})},

where definition of $\widehat{\Lambda}_{h}$ can be found in Appendix A. For the sake of differential privacy, we use private statistics instead and derive the $\widetilde{w}_{h}$ as in line 10.

Our private pessimism. Notice that if we remove all the Gaussian noises we add, our DP-VAPVI (Algorithm 2) will degenerate to VAPVI (Yin et al., 2022). We design a similar pessimistic penalty using private statistics (line 11), with additional $\frac{D}{K}$ accounting for the extra pessimism due to DP.

Main theorem. We state our main theorem about DP-VAPVI below, the proof sketch is deferred to Appendix D.1 and detailed proof is deferred to Appendix D due to space limit. Note that quantities $\mathcal{M}_{i},L,E$ can be found in Appendix A and briefly, $L=\widetilde{O}(\sqrt{H^{3}d/\rho})$ , $E=\widetilde{O}(\sqrt{Hd/\rho})$ . For the sample complexity lower bound, within the practical regime where the privacy budget is not very small, $\max\{\mathcal{M}_{i}\}$ is dominated by $\max\{\widetilde{O}(H^{12}d^{3}/\kappa^{5}),\widetilde{O}(H^{14}d/\kappa^{5})\}$ , which also appears in the sample complexity lower bound of VAPVI (Yin et al., 2022). The $\sigma_{V}^{2}(s,a)$ in Theorem 4.1 is defined as $\max\{1,\mathrm{Var}_{P_{h}}(V)(s,a)\}$ for any $V$ .

Theorem 4.1.

DP-VAPVI (Algorithm 2) satisfies $\rho$ -zCDP. Furthermore, let $K$ be the number of episodes. Under the condition that $K>\max\{\mathcal{M}_{1},\mathcal{M}_{2},\mathcal{M}_{3},\mathcal{M}_{4}\}$ and $\sqrt{d}>\xi$ , where $\xi:=\sup_{V\in[0,H],\;s^{\prime}\sim P_{h}(s,a),\;h\in[H]}\left|\frac{r_{h}+{V}\left(s^{\prime}\right)-\left(\mathcal{T}_{h}{V}\right)\left(s,a\right)}{{\sigma}_{{V}}(s,a)}\right|$ , for any $0<\lambda<\kappa$ , with probability $1-\delta$ , for all policy $\pi$ simultaneously, the output $\widehat{\pi}$ of DP-VAPVI satisfies

v^{\pi}-v^{\widehat{\pi}}\leq\widetilde{O}\left(\sqrt{d}\cdot\sum_{h=1}^{H}\mathbb{E}_{\pi}\bigg{[}\sqrt{\phi(\cdot,\cdot)^{\top}\Lambda_{h}^{-1}\phi(\cdot,\cdot)}\bigg{]}\right)+\frac{DH}{K},

(9)

where $\Lambda_{h}=\sum_{k=1}^{K}\frac{\phi(s_{h}^{k},a_{h}^{k})\cdot\phi(s_{h}^{k},a_{h}^{k})^{\top}}{\sigma^{2}_{\widetilde{V}_{h+1}(s_{h}^{k},a_{h}^{k})}}+\lambda I_{d}$ , $D=\widetilde{O}\left(\frac{H^{2}L}{\kappa}+\frac{H^{4}E\sqrt{d}}{\kappa^{3/2}}+H^{3}\sqrt{d}\right)$ and $\widetilde{O}$ hides constants and Polylog terms.
In particular, define $\Lambda^{\star}_{h}=\sum_{k=1}^{K}\frac{\phi(s_{h}^{k},a_{h}^{k})\cdot\phi(s_{h}^{k},a_{h}^{k})^{\top}}{\sigma^{2}_{{V}^{\star}_{h+1}(s_{h}^{k},a_{h}^{k})}}+\lambda I_{d}$ , we have with probability $1-\delta$ ,

v^{\star}-v^{\widehat{\pi}}\leq\widetilde{O}\left(\sqrt{d}\cdot\sum_{h=1}^{H}\mathbb{E}_{\pi^{\star}}\bigg{[}\sqrt{\phi(\cdot,\cdot)^{\top}\Lambda_{h}^{\star-1}\phi(\cdot,\cdot)}\bigg{]}\right)+\frac{DH}{K}.

(10)

Comparison to non-private counterpart VAPVI (Yin et al., 2022). Plugging in the definition of $L,E$ (Appendix A), under the meaningful case that the privacy budget is not very large, $DH$ is dominated by $\widetilde{O}\left(\frac{H^{\frac{11}{2}}d/\kappa^{\frac{3}{2}}}{\sqrt{\rho}}\right)$ . According to Theorem 3.2 in (Yin et al., 2022), the sub-optimality bound of VAPVI is for sufficiently large $K$ , with high probability, the output $\widehat{\pi}$ satisfies:

v^{\star}-v^{\widehat{\pi}}\leq\widetilde{O}\left(\sqrt{d}\cdot\sum_{h=1}^{H}\mathbb{E}_{\pi^{\star}}\bigg{[}\sqrt{\phi(\cdot,\cdot)^{\top}\Lambda_{h}^{\star-1}\phi(\cdot,\cdot)}\bigg{]}\right)+\frac{2H^{4}\sqrt{d}}{K}.

(11)

Compared to our Theorem 4.1, the additional sub-optimality bound due to differential privacy is $\widetilde{O}\left(\frac{H^{\frac{11}{2}}d/\kappa^{\frac{3}{2}}}{\sqrt{\rho}\cdot K}\right)=\widetilde{O}\left(\frac{H^{\frac{11}{2}}d/\kappa^{\frac{3}{2}}}{\epsilon\cdot K}\right)$ .¹⁰¹⁰10Here we apply the second part of Lemma 2.7 to achieve $(\epsilon,\delta)$ -DP, the notation $\widetilde{O}$ also absorbs $\log\frac{1}{\delta}$ (only here $\delta$ denotes the privacy budget instead of failure probability). In the most popular regime where the privacy budget $\rho$ or $\epsilon$ is a constant, the additional term due to differential privacy also appears as a lower order term.

Instance-dependent sub-optimality bound. Similar to DP-APVI (Algorithm 1), our DP-VAPVI (Algorithm 2) also enjoys instance-dependent sub-optimality bound. First, the main term in (10) improves PEVI (Jin et al., 2021) over $O(\sqrt{d})$ on feature dependence. Also, our main term admits no explicit dependence on $H$ , thus improves the sub-optimality bound of PEVI on horizon dependence. For more detailed discussions about our main term, we refer readers to (Yin et al., 2022).

5 Tightness of our results

We believe our bounds for offline RL with DP is tight. To the best of our knowledge, APVI and VAPVI provide the tightest bound under tabular MDP and linear MDP, respectively. The suboptimality bounds of our algorithms match these two in the main term, with some lower order additional terms. The leading terms are known to match multiple information-theoretical lower bounds for offline RL simultaneously (this was illustrated in Yin and Wang (2021b); Yin et al. (2022)), for this reason our bound cannot be improved in general. For the lower order terms, the dependence on sample complexity $n$ and privacy budget $\epsilon$ : $\widetilde{O}(\frac{1}{n\epsilon})$ is optimal since policy learning is a special case of ERM problems and such dependence is optimal in DP-ERM. In addition, we believe the dependence on other parameters ( $H,S,A,d$ ) in the lower order term is tight due to our special tricks as (3) and Lemma D.6.

6 Simulations

In this section, we carry out simulations to evaluate the performance of our DP-VAPVI (Algorithm 2), and compare it with its non-private counterpart VAPVI (Yin et al., 2022) and another pessimism-based algorithm PEVI (Jin et al., 2021) which does not have privacy guarantee.

Experimental setting. We evaluate DP-VAPVI (Algorithm 2) on a synthetic linear MDP example that originates from the linear MDP in (Min et al., 2021; Yin et al., 2022) but with some modifications.¹¹¹¹11We keep the state space $\mathcal{S}=\{1,2\}$ , action space $\mathcal{A}=\{1,\cdots,100\}$ and feature map of state-action pairs while we choose stochastic transition (instead of the original deterministic transition) and more complex reward. For details of the linear MDP setting, please refer to Appendix F. The two MDP instances we use both have horizon $H=20$ . We compare different algorithms in figure 1(a), while in figure 1(b), we compare our DP-VAPVI with different privacy budgets. When doing empirical evaluation, we do not split the data for DP-VAPVI or VAPVI and for DP-VAPVI, we run the simulation for $5$ times and take the average performance.

Refer to caption — (a) Compare different algorithms, $H=20$

Results and discussions. From Figure 1, we can observe that DP-VAPVI (Algorithm 2) performs slightly worse than its non-private version VAPVI (Yin et al., 2022). This is due to the fact that we add Gaussian noise to each count. However, as the size of dataset goes larger, the performance of DP-VAPVI will converge to that of VAPVI, which supports our theoretical conclusion that the cost of privacy only appears as lower order terms. For DP-VAPVI with larger privacy budget, the scale of noise will be smaller, thus the performance will be closer to VAPVI, as shown in figure 1(b). Furthermore, in most cases, DP-VAPVI still outperforms PEVI, which does not have privacy guarantee. This arises from our privitization of variance-aware LSVI instead of LSVI.

7 Conclusion and future works

In this work, we take the first steps towards the well-motivated task of designing private offline RL algorithms. We propose algorithms for both tabular MDPs and linear MDPs, and show that they enjoy instance-dependent sub-optimality bounds while guaranteeing differential privacy (either zCDP or pure DP). Our results highlight that the cost of privacy only appears as lower order terms, thus become negligible as the number of samples goes large.

Future extensions are numerous. We believe the technique in our algorithms (privitization of Bernstein-type pessimism and variance-aware LSVI) and the corresponding analysis can be used in online settings too to obtain tighter regret bounds for private algorithms. For the offline RL problems, we plan to consider more general function approximations and differentially private (deep) offline RL which will bridge the gap between theory and practice in offline RL applications. Many techniques we developed could be adapted to these more general settings.

Acknowledgments

The research is partially supported by NSF Awards #2007117 and #2048091. The authors would like to thank Ming Yin for helpful discussions.

References

Abbasi-Yadkori et al. [2011] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
Agarwal and Singh [2017] Naman Agarwal and Karan Singh. The price of differential privacy for online learning. In International Conference on Machine Learning, pages 32–40. PMLR, 2017.
Ayoub et al. [2020] Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463–474. PMLR, 2020.
Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
Balle et al. [2016] Borja Balle, Maziar Gomrokchi, and Doina Precup. Differentially private policy evaluation. In International Conference on Machine Learning, pages 2130–2138. PMLR, 2016.
Basu et al. [2019] Debabrota Basu, Christos Dimitrakakis, and Aristide Tossou. Differential privacy for multi-armed bandits: What is it and what is its cost? arXiv preprint arXiv:1905.12298, 2019.
Brown et al. [2021] Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, and Kunal Talwar. When is memorization of irrelevant training data necessary for high-accuracy learning? In ACM SIGACT Symposium on Theory of Computing, pages 123–132, 2021.
Bun and Steinke [2016] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635–658. Springer, 2016.
Cai et al. [2020] Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283–1294. PMLR, 2020.
Carlini et al. [2019] Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium (USENIX Security 19), pages 267–284, 2019.
Chen et al. [2020] Xiaoyu Chen, Kai Zheng, Zixin Zhou, Yunchang Yang, Wei Chen, and Liwei Wang. (locally) differentially private combinatorial semi-bandits. In International Conference on Machine Learning, pages 1757–1767. PMLR, 2020.
Chernoff et al. [1952] Herman Chernoff et al. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493–507, 1952.
Chowdhury and Zhou [2021] Sayak Ray Chowdhury and Xingyu Zhou. Differentially private regret minimization in episodic markov decision processes. arXiv preprint arXiv:2112.10599, 2021.
Chowdhury et al. [2021] Sayak Ray Chowdhury, Xingyu Zhou, and Ness Shroff. Adaptive control of differentially private linear quadratic systems. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 485–490. IEEE, 2021.
Cundy and Ermon [2020] Chris Cundy and Stefano Ermon. Privacy-constrained policies via mutual information regularized policy gradients. arXiv preprint arXiv:2012.15019, 2020.
Dann et al. [2017] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
Dwork and Rothblum [2016] Cynthia Dwork and Guy N Rothblum. Concentrated differential privacy. arXiv preprint arXiv:1603.01887, 2016.
Dwork et al. [2006] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
Dwork et al. [2014] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014.
Ficken [2015] Frederick Arthur Ficken. The simplex method of linear programming. Courier Dover Publications, 2015.
Gajane et al. [2018] Pratik Gajane, Tanguy Urvoy, and Emilie Kaufmann. Corrupt bandits for preserving local privacy. In Algorithmic Learning Theory, pages 387–412. PMLR, 2018.
Garcelon et al. [2021] Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, and Matteo Pirotta. Local differential privacy for regret minimization in reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021.
Guha Thakurta and Smith [2013] Abhradeep Guha Thakurta and Adam Smith. (nearly) optimal algorithms for private online learning in full-information and bandit settings. Advances in Neural Information Processing Systems, 26, 2013.
Hu et al. [2021] Bingshan Hu, Zhiming Huang, and Nishant A Mehta. Optimal algorithms for private online learning in a stochastic environment. arXiv preprint arXiv:2102.07929, 2021.
Jin et al. [2020a] Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR, 2020a.
Jin et al. [2020b] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020b.
Jin et al. [2021] Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021.
Kim et al. [2017] Hyeoneui Kim, Elizabeth Bell, Jihoon Kim, Amy Sitapati, Joe Ramsdell, Claudiu Farcas, Dexter Friedman, Stephanie Feudjio Feupe, and Lucila Ohno-Machado. iconcur: informed consent for clinical data and bio-sample use for research. Journal of the American Medical Informatics Association, 24(2):380–387, 2017.
Lebensold et al. [2019] Jonathan Lebensold, William Hamilton, Borja Balle, and Doina Precup. Actor critic with differentially private critic. arXiv preprint arXiv:1910.05876, 2019.
Liao et al. [2021] Chonghua Liao, Jiafan He, and Quanquan Gu. Locally differentially private reinforcement learning for linear mixture markov decision processes. arXiv preprint arXiv:2110.10133, 2021.
Liu et al. [2019] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. In Uncertainty in Artificial Intelligence, 2019.
Luyo et al. [2021] Paul Luyo, Evrard Garcelon, Alessandro Lazaric, and Matteo Pirotta. Differentially private exploration in reinforcement learning with linear representation. arXiv preprint arXiv:2112.01585, 2021.
Maurer and Pontil [2009] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. Conference on Learning Theory, 2009.
Min et al. [2021] Yifei Min, Tianhao Wang, Dongruo Zhou, and Quanquan Gu. Variance-aware off-policy evaluation with linear function approximation. Advances in neural information processing systems, 34, 2021.
Nemhauser and Wolsey [1988] George Nemhauser and Laurence Wolsey. Polynomial-time algorithms for linear programming. Integer and Combinatorial Optimization, pages 146–181, 1988.
Ngo et al. [2022] Dung Daniel Ngo, Giuseppe Vietri, and Zhiwei Steven Wu. Improved regret for differentially private exploration in linear mdp. arXiv preprint arXiv:2202.01292, 2022.
Ono and Takahashi [2020] Hajime Ono and Tsubasa Takahashi. Locally private distributed reinforcement learning. arXiv preprint arXiv:2001.11718, 2020.
Qiao and Wang [2022] Dan Qiao and Yu-Xiang Wang. Near-optimal differentially private reinforcement learning. arXiv preprint arXiv:2212.04680, 2022.
Qiao et al. [2022] Dan Qiao, Ming Yin, Ming Min, and Yu-Xiang Wang. Sample-efficient reinforcement learning with loglog(T) switching cost. In Proceedings of the 39th International Conference on Machine Learning, pages 18031–18061. PMLR, 2022.
Raghu et al. [2017] Aniruddh Raghu, Matthieu Komorowski, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. Continuous state-space models for optimal sepsis treatment: a deep reinforcement learning approach. In Machine Learning for Healthcare Conference, pages 147–163, 2017.
Redberg and Wang [2021] Rachel Redberg and Yu-Xiang Wang. Privately publishable per-instance privacy. Advances in Neural Information Processing Systems, 34, 2021.
Sallab et al. [2017] Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70–76, 2017.
Shariff and Sheffet [2018] Roshan Shariff and Or Sheffet. Differentially private contextual linear bandits. Advances in Neural Information Processing Systems, 31, 2018.
Sridharan [2002] Karthik Sridharan. A gentle introduction to concentration inequalities. Dept. Comput. Sci., Cornell Univ., Tech. Rep, 2002.
Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
Tossou and Dimitrakakis [2017] Aristide Charles Yedia Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial multi-armed bandit. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
Van Erven and Harremos [2014] Tim Van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.
Vietri et al. [2020] Giuseppe Vietri, Borja Balle, Akshay Krishnamurthy, and Steven Wu. Private reinforcement learning with pac and regret guarantees. In International Conference on Machine Learning, pages 9754–9764. PMLR, 2020.
Wang and Hegde [2019] Baoxiang Wang and Nidhi Hegde. Privacy-preserving q-learning with functional noise in continuous spaces. Advances in Neural Information Processing Systems, 32, 2019.
Wang et al. [2021] Ruosong Wang, Dean P Foster, and Sham M Kakade. What are the statistical limits of offline rl with linear function approximation? International Conference on Learning Representations, 2021.
Xie et al. [2019] Tengyang Xie, Philip S Thomas, and Gerome Miklau. Privacy preserving off-policy evaluation. arXiv preprint arXiv:1902.00174, 2019.
Xie et al. [2021a] Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 2021a.
Xie et al. [2021b] Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, and Yu Bai. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 2021b.
Yin and Wang [2020] Ming Yin and Yu-Xiang Wang. Asymptotically efficient off-policy evaluation for tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3948–3958. PMLR, 2020.
Yin and Wang [2021a] Ming Yin and Yu-Xiang Wang. Optimal uniform ope and model-based offline reinforcement learning in time-homogeneous, reward-free and task-agnostic settings. Advances in neural information processing systems, 2021a.
Yin and Wang [2021b] Ming Yin and Yu-Xiang Wang. Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34, 2021b.
Yin et al. [2021] Ming Yin, Yu Bai, and Yu-Xiang Wang. Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 1567–1575. PMLR, 2021.
Yin et al. [2022] Ming Yin, Yaqi Duan, Mengdi Wang, and Yu-Xiang Wang. Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. arXiv preprint arXiv:2203.05804, 2022.
Zanette [2021] Andrea Zanette. Exponential lower bounds for batch reinforcement learning: Batch rl can be exponentially harder than online rl. In International Conference on Machine Learning, pages 12287–12297. PMLR, 2021.
Zanette et al. [2021] Andrea Zanette, Martin J Wainwright, and Emma Brunskill. Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 2021.
Zhang et al. [2020] Zihao Zhang, Stefan Zohren, and Stephen Roberts. Deep reinforcement learning for trading. The Journal of Financial Data Science, 2(2):25–40, 2020.
Zheng et al. [2020] Kai Zheng, Tianle Cai, Weiran Huang, Zhenguo Li, and Liwei Wang. Locally differentially private (contextual) bandits learning. Advances in Neural Information Processing Systems, 33:12300–12310, 2020.
Zhou et al. [2021] Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532–4576. PMLR, 2021.
Zhou [2022] Xingyu Zhou. Differentially private reinforcement learning with linear function approximation. arXiv preprint arXiv:2201.07052, 2022.

Appendix A Notation List

A.1 Notations for tabular MDP

$E_{\rho}$	$4\sqrt{\frac{H\log{\frac{4HS^{2}A}{\delta}}}{\rho}}$
$n$	The original counts of visitation
$n^{\prime}$	The noisy counts, as defined in (2)
$\widetilde{n}$	Final choice of private counts, as defined in (3)
$\widetilde{P}$	Private estimate of transition kernel, as defined in (5)
$\widehat{P}$	Non-private estimate of transition kernel, as defined in (15)
$\iota$	$\log\frac{HSA}{\delta}$
$\rho$	Budget for zCDP
$\delta$	Failure probability

A.2 Notations for linear MDP

$L$	$2H\sqrt{\frac{5Hd\log(\frac{10Hd}{\delta})}{\rho}}$
$E$	$\sqrt{\frac{10Hd}{\rho}}\left(2+\left(\frac{\log(5c_{1}H/\delta)}{c_{2}d}\right)^{\frac{2}{3}}\right)$
$D$	$\widetilde{O}\left(\frac{H^{2}L}{\kappa}+\frac{H^{4}E\sqrt{d}}{\kappa^{3/2}}+H^{3}\sqrt{d}\right)$
$\widehat{\Lambda}_{h}$	$\sum_{k=1}^{K}\phi(s_{h}^{k},a_{h}^{k})\phi(s_{h}^{k},a_{h}^{k})^{\top}/\widetilde{\sigma}^{2}_{h}(s_{h}^{k},a_{h}^{k})+\lambda I_{d}$
$\widetilde{\Lambda}_{h}$	$\sum_{k=1}^{K}\phi(s_{h}^{k},a_{h}^{k})\phi(s_{h}^{k},a_{h}^{k})^{\top}/\widetilde{\sigma}^{2}_{h}(s_{h}^{k},a_{h}^{k})+\lambda I_{d}+K_{2}$
$\widetilde{\Lambda}^{p}_{h}$	$\mathbb{E}_{\mu,h}[\widetilde{\sigma}_{h}^{-2}(s,a)\phi(s,a)\phi(s,a)^{\top}]$
${\Lambda}_{h}$	$\sum_{\tau=1}^{K}\phi(s^{\tau}_{h},a^{\tau}_{h})\phi(s^{\tau}_{h},a^{\tau}_{h})^{\top}/\sigma_{\widetilde{V}_{h+1}}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I$
$\Lambda_{h}^{p}$	$\mathbb{E}_{\mu,h}[\sigma^{-2}_{\widetilde{V}_{h+1}}(s,a)\phi(s,a)\phi(s,a)^{\top}]$
${\Lambda}^{\star}_{h}$	$\sum_{\tau=1}^{K}\phi(s^{\tau}_{h},a^{\tau}_{h})\phi(s^{\tau}_{h},a^{\tau}_{h})^{\top}/\sigma_{{V}^{\star}_{h+1}}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I$
$\bar{\Sigma}_{h}$	$\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})^{\top}+\lambda I_{d}$
$\widetilde{\Sigma}_{h}$	$\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})^{\top}+\lambda I_{d}+K_{1}$
$\Sigma^{p}_{h}$	$\mathbb{E}_{\mu,h}\left[\phi(s,a)\phi(s,a)^{\top}\right]$
$\kappa$	$\min_{h}\lambda_{\mathrm{min}}(\Sigma^{p}_{h})$
$\sigma_{V}^{2}(s,a)$	$\max\{1,\mathrm{Var}_{P_{h}}(V)(s,a)\}$ for any $V$
$\sigma^{\star 2}_{h}(s,a)$	$\max\left\{1,{\mathrm{Var}}_{P_{h}}{V}^{\star}_{h+1}(s,a)\right\}$
$\widetilde{\sigma}_{h}^{2}(s,a)$	$\max\{1,\widetilde{\mathrm{Var}}_{h}\widetilde{V}_{h+1}(s,a)\}$
$\mathcal{M}_{1}$	$\max\{2\lambda,128\log(2dH/\delta),\frac{128H^{4}\log(2dH/\delta)}{\kappa^{2}},\frac{\sqrt{2}L}{\sqrt{d\kappa}}\}$
$\mathcal{M}_{2}$	$\max\{\widetilde{O}(H^{12}d^{3}/\kappa^{5}),\widetilde{O}(H^{14}d/\kappa^{5})\}$
$\mathcal{M}_{3}$	$\max\left\{\frac{512H^{4}\log\left(\frac{2dH}{\delta}\right)}{\kappa^{2}},\frac{4\lambda H^{2}}{\kappa}\right\}$
$\mathcal{M}_{4}$	$\max\{\frac{H^{2}L^{2}}{d\kappa},\frac{H^{6}E^{2}}{\kappa^{2}},H^{4}\kappa\}$
$\rho$	Budget for zCDP
$\delta$	Failure probability (not the $\delta$ of $(\epsilon,\delta)$ -DP)
$\xi$	$\sup_{V\in[0,H],\;s^{\prime}\sim P_{h}(s,a),\;h\in[H]}\left\|\frac{r_{h}+{V}\left(s^{\prime}\right)-\left(\mathcal{T}_{h}{V}\right)\left(s,a\right)}{{\sigma}_{{V}}(s,a)}\right\|$

Appendix B Extended related work

Online reinforcement learning under JDP or LDP. For online RL, some recent works analyze this setting under Joint Differential Privacy (JDP), which requires the RL agent to minimize regret while handling user’s raw data privately. Under tabular MDP, Vietri et al. [2020] design PUCB by revising UBEV [Dann et al., 2017]. Private-UCB-VI [Chowdhury and Zhou, 2021] results from UCBVI (with bonus-1) [Azar et al., 2017]. However, both works privatize Hoeffding type bonus, which lead to sub-optimal regret bound. Under linear MDP, Private LSVI-UCB [Ngo et al., 2022] and Privacy-Preserving LSVI-UCB [Luyo et al., 2021] are private versions of LSVI-UCB [Jin et al., 2020b], while LinOpt-VI-Reg [Zhou, 2022] and Privacy-Preserving UCRL-VTR [Luyo et al., 2021] generalize UCRL-VTR [Ayoub et al., 2020]. However, these works are usually based on the LSVI technique [Jin et al., 2020b] (unweighted ridge regression), which does not ensure optimal regret bound.

In addition to JDP, another common privacy guarantee for online RL is Local Differential Privacy (LDP), LDP is a stronger definition of DP since it requires that the user’s data is protected before the RL agent has access to it. Under LDP, Garcelon et al. [2021] reach a regret lower bound and design LDP-OBI which has matching regret upper bound. The result is generalized by Liao et al. [2021] to linear mixture setting. Later, Luyo et al. [2021] provide an unified framework for analyzing JDP and LDP under linear setting.

Some other differentially private learning algorithms. There are some other works about differentially private online learning [Guha Thakurta and Smith, 2013, Agarwal and Singh, 2017, Hu et al., 2021] and various settings of bandit [Shariff and Sheffet, 2018, Gajane et al., 2018, Basu et al., 2019, Zheng et al., 2020, Chen et al., 2020, Tossou and Dimitrakakis, 2017]. For the reinforcement learning setting, Wang and Hegde [2019] propose privacy-preserving Q-learning to protect the reward information. Ono and Takahashi [2020] study the problem of distributed reinforcement learning under LDP. Lebensold et al. [2019] present an actor critic algorithm with differentially private critic. Cundy and Ermon [2020] tackle DP-RL under the policy gradient framework. Chowdhury et al. [2021] consider the adaptive control of differentially private linear quadratic (LQ) systems.

Offline reinforcement learning under tabular MDP. Under tabular MDP, there are several works achieving optimal sub-optimality/sample complexity bounds under different coverage assumptions. For the problem of off-policy evaluation (OPE), Yin and Wang [2020] uses Tabular-MIS estimator to achieve asymptotic efficiency. In addition, the idea of uniform OPE is used to achieve the optimal sample complexity $O(H^{3}/d_{m}\epsilon^{2})$ [Yin et al., 2021] for non-stationary MDP and the optimal sample complexity $O(H^{2}/d_{m}\epsilon^{2})$ [Yin and Wang, 2021a] for stationary MDP, where $d_{m}$ is the lower bound for state-action occupancy. Such uniform convergence idea also supports some works regarding online exploration [Jin et al., 2020a, Qiao et al., 2022]. For offline RL with single concentrability assumption, Xie et al. [2021b] arrive at the optimal sample complexity $O(H^{3}SC^{\star}/\epsilon^{2})$ . Recently, Yin and Wang [2021b] propose APVI which can lead to instance-dependent sub-optimality bound, which subsumes previous optimal results under several assumptions.

Offline reinforcement learning under linear MDP. Recently, many works focus on offline RL under linear representation. Jin et al. [2021] present PEVI which applies the idea of pessimistic value iteration (the idea originates from [Jin et al., 2020b]), and PEVI is provably efficient for offline RL under linear MDP. Yin et al. [2022] improve the sub-optimality bound in [Jin et al., 2021] by replacing LSVI by variance-weighted LSVI. Xie et al. [2021a] consider Bellman consistent pessimism for general function approximation, and their result improves the sample complexity in [Jin et al., 2021] by order $O(d)$ (shown in Theorem 3.2). However, there is no improvement on horizon dependence. Zanette et al. [2021] propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle. Besides, Wang et al. [2021], Zanette [2021] study the statistical hardness of offline RL with linear representations by presenting exponential lower bounds.

Appendix C Proof of Theorem 3.4

C.1 Proof sketch

Since the whole proof for privacy guarantee is not very complex, we present it in Section C.2 below and only sketch the proof for suboptimality bound.

First of all, we bound the scale of noises we add to show that the $\widetilde{n}$ derived from (3) are close to real visitation numbers. Therefore, denoting the non-private empirical transition kernel by $\widehat{P}$ (detailed definition in (15)), we can show that $\|\widetilde{P}-\widehat{P}\|_{1}$ and $|\sqrt{\mathrm{Var}_{\widetilde{P}}(V)}-\sqrt{\mathrm{Var}_{\widehat{P}}(V)}|$ are small.

Next, resulting from the conditional independence of $\widetilde{V}_{h+1}$ and $\widetilde{P}_{h}$ , we apply Empirical Bernstein’s inequality to get $|(\widetilde{P}_{h}-P_{h})\widetilde{V}_{h+1}|\lesssim\sqrt{\mathrm{Var}_{\widetilde{P}}(\widetilde{V}_{h+1})/\widetilde{n}_{s_{h},a_{h}}}+SHE_{\rho}/\widetilde{n}_{s_{h},a_{h}}$ . Together with our definition of private pessimism and the key lemma: extended value difference (Lemma E.7 and E.8), we can bound the suboptimality of our output policy $\widehat{\pi}$ by:

v^{\star}-v^{\widehat{\pi}}\lesssim\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{C}_{h}}d^{\pi^{\star}}_{h}(s_{h},a_{h})\sqrt{\frac{\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))}{\widetilde{n}_{s_{h},a_{h}}}}+SHE_{\rho}/\widetilde{n}_{s_{h},a_{h}}.

(12)

Finally, we further bound the above suboptimality via replacing private statistics by non-private ones. Specifically, we replace $\widetilde{n}$ by $n$ , $\widetilde{P}$ by $P$ and $\widetilde{V}$ by $V^{\star}$ . Due to (12), we have $\|\widetilde{V}-V^{\star}\|_{\infty}\lesssim\sqrt{\frac{1}{n\bar{d}_{m}}}$ . Together with the upper bounds of $\|\widetilde{P}-\widehat{P}\|_{1}$ and $|\sqrt{\mathrm{Var}_{\widetilde{P}}(V)}-\sqrt{\mathrm{Var}_{\widehat{P}}(V)}|$ , we have

\begin{split}&\sqrt{\frac{\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))}{\widetilde{n}_{s_{h},a_{h}}}}\lesssim\sqrt{\frac{\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(V^{\star}_{h+1}(\cdot))}{\widetilde{n}_{s_{h},a_{h}}}}+\frac{1}{n\bar{d}_{m}}\\ \lesssim&\sqrt{\frac{\mathrm{Var}_{\widehat{P}_{h}(\cdot|s_{h},a_{h})}(V^{\star}_{h+1}(\cdot))}{\widetilde{n}_{s_{h},a_{h}}}}+\frac{1}{n\bar{d}_{m}}\lesssim\sqrt{\frac{\mathrm{Var}_{P_{h}(\cdot|s_{h},a_{h})}(V^{\star}_{h+1}(\cdot))}{\widetilde{n}_{s_{h},a_{h}}}}+\frac{1}{n\bar{d}_{m}}\\ \lesssim&\sqrt{\frac{\mathrm{Var}_{P_{h}(\cdot|s_{h},a_{h})}(V^{\star}_{h+1}(\cdot))}{nd^{\mu}_{h}(s_{h},a_{h})}}+\frac{1}{n\bar{d}_{m}}.\end{split}

(13)

The final bound using non-private statistics results from (12) and (13).

C.2 Proof of the privacy guarantee

The privacy guarantee of DP-APVI (Algorithm 1) is summarized by Lemma C.1 below.

Lemma C.1 (Privacy analysis of DP-APVI (Algorithm 1)).

DP-APVI (Algorithm 1) satisfies $\rho$ -zCDP.

Proof of Lemma C.1.

The $\ell_{2}$ sensitivity of $\{n_{s_{h},a_{h}}\}$ is $\sqrt{2H}$ . According to Lemma 2.7, the Gaussian Mechanism used on $\{n_{s_{h},a_{h}}\}$ with $\sigma^{2}=\frac{2H}{\rho}$ satisfies $\frac{\rho}{2}$ -zCDP. Similarly, the Gaussian Mechanism used on $\{n_{s_{h},a_{h},s_{h+1}}\}$ with $\sigma^{2}=\frac{2H}{\rho}$ also satisfies $\frac{\rho}{2}$ -zCDP. Combining these two results, due to the composition of zCDP (Lemma E.16), the construction of $\{n^{\prime}\}$ satisfies $\rho$ -zCDP. Finally, DP-APVI satisfies $\rho$ -zCDP because the output $\widehat{\pi}$ is post processing of $\{n^{\prime}\}$ . ∎

C.3 Proof of the sub-optimality bound

C.3.1 Utility analysis

First of all, the following Lemma C.2 gives a high probability bound for $|n^{\prime}-n|$ .

Lemma C.2.

Let $E_{\rho}=2\sqrt{2}\sigma\sqrt{\log{\frac{4HS^{2}A}{\delta}}}=4\sqrt{\frac{H\log{\frac{4HS^{2}A}{\delta}}}{\rho}}$ , then with probability $1-\delta$ , for all $s_{h},a_{h},s_{h+1}$ , it holds that

|n^{\prime}_{s_{h},a_{h}}-n_{s_{h},a_{h}}|\leq\frac{E_{\rho}}{2},\,\,|n^{\prime}_{s_{h},a_{h},s_{h+1}}-n_{s_{h},a_{h},s_{h+1}}|\leq\frac{E_{\rho}}{2}.

(14)

Proof of Lemma C.2.

The inequalities directly result from the concentration inequality of Gaussian distribution and a union bound. ∎

According to the utility analysis above, we have the following Lemma C.3 giving a high probability bound for $|\widetilde{n}-n|$ .

Lemma C.3.

Under the high probability event in Lemma C.2, for all $s_{h},a_{h},s_{h+1}$ , it holds that

|\widetilde{n}_{s_{h},a_{h}}-n_{s_{h},a_{h}}|\leq E_{\rho},\,\,|\widetilde{n}_{s_{h},a_{h},s_{h+1}}-n_{s_{h},a_{h},s_{h+1}}|\leq E_{\rho}.

Proof of Lemma C.3.

When the event in Lemma C.2 holds, the original counts $\{n_{s_{h},a_{h},s^{\prime}}\}_{s^{\prime}\in\mathcal{S}}$ is a feasible solution to the optimization problem, which means that

\max_{s^{\prime}}|\widetilde{n}_{s_{h},a_{h},s^{\prime}}-n^{\prime}_{s_{h},a_{h},s^{\prime}}|\leq\max_{s^{\prime}}|n_{s_{h},a_{h},s^{\prime}}-n^{\prime}_{s_{h},a_{h},s^{\prime}}|\leq\frac{E_{\rho}}{2}.

Due to the second part of (14), it holds that for any $s_{h},a_{h},s_{h+1}$ ,

|\widetilde{n}_{s_{h},a_{h},s_{h+1}}-n_{s_{h},a_{h},s_{h+1}}|\leq|\widetilde{n}_{s_{h},a_{h},s_{h+1}}-n^{\prime}_{s_{h},a_{h},s_{h+1}}|+|n^{\prime}_{s_{h},a_{h},s_{h+1}}-n_{s_{h},a_{h},s_{h+1}}|\leq E_{\rho}.

For the second part, because of the constraints in the optimization problem, it holds that

|\widetilde{n}_{s_{h},a_{h}}-n^{\prime}_{s_{h},a_{h}}|\leq\frac{E_{\rho}}{2}.

Due to the first part of (14), it holds that for any $s_{h},a_{h}$ ,

|\widetilde{n}_{s_{h},a_{h}}-n_{s_{h},a_{h}}|\leq|\widetilde{n}_{s_{h},a_{h}}-n^{\prime}_{s_{h},a_{h}}|+|n^{\prime}_{s_{h},a_{h}}-n_{s_{h},a_{h}}|\leq E_{\rho}.

∎

Let the non-private empirical estimate be:

\widehat{P}_{h}(s^{\prime}|s_{h},a_{h})=\frac{n_{s_{h},a_{h},s^{\prime}}}{n_{s_{h},a_{h}}},

(15)

if $n_{s_{h},a_{h}}>0$ and $\widehat{P}_{h}(s^{\prime}|s_{h},a_{h})=\frac{1}{S}$ otherwise. We will show that the private transition kernel $\widetilde{P}$ is close to $\widehat{P}$ by the Lemma C.4 and Lemma C.5 below.

Lemma C.4.

Under the high probability event of Lemma C.3, for $s_{h},a_{h}$ , if $\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}$ , it holds that

\left\lVert\widetilde{P}_{h}(\cdot|s_{h},a_{h})-\widehat{P}_{h}(\cdot|s_{h},a_{h})\right\rVert_{1}\leq\frac{5SE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}.

(16)

Proof of Lemma C.4.

If $\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}$ and the conclusion in Lemma C.3 hold, we have

	$\displaystyle\left\lVert\widetilde{P}_{h}(\cdot\|s_{h},a_{h})-\widehat{P}_{h}(\cdot\|s_{h},a_{h})\right\rVert_{1}\leq\sum_{s^{\prime}\in\mathcal{S}}\left\|\widetilde{P}_{h}(s^{\prime}\|s_{h},a_{h})-\widehat{P}_{h}(s^{\prime}\|s_{h},a_{h})\right\|$	(17)
$\displaystyle\leq$	$\displaystyle\sum_{s^{\prime}\in\mathcal{S}}\left(\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}+E_{\rho}}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}-\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}}{\widetilde{n}_{s_{h},a_{h}}}\right)$
$\displaystyle\leq$	$\displaystyle\sum_{s^{\prime}\in\mathcal{S}}\left[\left(\frac{1}{\widetilde{n}_{s_{h},a_{h}}}+\frac{2E_{\rho}}{\widetilde{n}_{s_{h},a_{h}}^{2}}\right)\left(\widetilde{n}_{s_{h},a_{h},s^{\prime}}+E_{\rho}\right)-\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}}{\widetilde{n}_{s_{h},a_{h}}}\right]$
$\displaystyle\leq$	$\displaystyle\frac{SE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}+\frac{2E_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}+\frac{2SE_{\rho}^{2}}{\widetilde{n}_{s_{h},a_{h}}^{2}}$
$\displaystyle\leq$	$\displaystyle\frac{5SE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}.$

The second inequality is because $\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}-E_{\rho}}{\widetilde{n}_{s_{h},a_{h}}+E_{\rho}}\leq\frac{n_{s_{h},a_{h},s^{\prime}}}{n_{s_{h},a_{h}}}\leq\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}+E_{\rho}}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}$ and $\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}+E_{\rho}}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}-\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}}{\widetilde{n}_{s_{h},a_{h}}}\geq\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}}{\widetilde{n}_{s_{h},a_{h}}}-\frac{\widetilde{n}_{s_{h},a_{h},s^{\prime}}-E_{\rho}}{\widetilde{n}_{s_{h},a_{h}}+E_{\rho}}$ . The third inequality is because of Lemma E.6. The last inequality is because $\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}$ . ∎

Lemma C.5.

Let $V\in\mathbb{R}^{S}$ be any function with $\|V\|_{\infty}\leq H$ , under the high probability event of Lemma C.3, for $s_{h},a_{h}$ , if $\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}$ , it holds that

\left|\sqrt{\mathrm{Var}_{\widehat{P}_{h}(\cdot|s_{h},a_{h})}(V)}-\sqrt{\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(V)}\right|\leq 4H\sqrt{\frac{SE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}}.

(18)

Proof of Lemma C.5.

For $s_{h},a_{h}$ such that $\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}$ , we use $\widetilde{P}(\cdot)$ and $\widehat{P}(\cdot)$ instead of $\widetilde{P}_{h}(\cdot|s_{h},a_{h})$ and $\widehat{P}_{h}(\cdot|s_{h},a_{h})$ for simplicity. Because of Lemma C.4, we have

\left\lVert\widetilde{P}(\cdot)-\widehat{P}(\cdot)\right\rVert_{1}\leq\frac{5SE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}.

Therefore, it holds that

	$\displaystyle\left\|\sqrt{\mathrm{Var}_{\widehat{P}(\cdot)}(V)}-\sqrt{\mathrm{Var}_{\widetilde{P}(\cdot)}(V)}\right\|\leq\sqrt{\|\mathrm{Var}_{\widehat{P}(\cdot)}(V)-\mathrm{Var}_{\widetilde{P}(\cdot)}(V)\|}$	(19)
$\displaystyle\leq$	$\displaystyle\sqrt{\sum_{s^{\prime}\in\mathcal{S}}\left\|\widehat{P}(s^{\prime})-\widetilde{P}(s^{\prime})\right\|V(s^{\prime})^{2}+\left\|\sum_{s^{\prime}\in\mathcal{S}}\left[\widehat{P}(s^{\prime})+\widetilde{P}(s^{\prime})\right]V(s^{\prime})\right\|\cdot\sum_{s^{\prime}\in\mathcal{S}}\left\|\widehat{P}(s^{\prime})-\widetilde{P}(s^{\prime})\right\|V(s^{\prime})}$
$\displaystyle\leq$	$\displaystyle\sqrt{H^{2}\left\lVert\widetilde{P}(\cdot)-\widehat{P}(\cdot)\right\rVert_{1}+2H^{2}\left\lVert\widetilde{P}(\cdot)-\widehat{P}(\cdot)\right\rVert_{1}}$
$\displaystyle\leq$	$\displaystyle 4H\sqrt{\frac{SE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}}.$

The second inequality is due to the definition of variance. ∎

C.3.2 Validity of our pessimistic penalty

Now we are ready to present the key lemma (Lemma C.6) below to justify our use of $\Gamma$ as the pessimistic penalty.

Lemma C.6.

Under the high probability event of Lemma C.3, with probability $1-\delta$ , for any $s_{h},a_{h}$ , if $\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}$ (which implies $n_{s_{h},a_{h}}>0$ ), it holds that

\left|(\widetilde{P}_{h}-P_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})\right|\leq\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{16SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}},

(20)

where $\widetilde{V}$ is the private version of estimated V function, which appears in Algorithm 1 and $\iota=\log(HSA/\delta)$ .

Proof of Lemma C.6.

	$\displaystyle\left\|(\widetilde{P}_{h}-P_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})\right\|\leq\left\|(\widetilde{P}_{h}-\widehat{P}_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})\right\|+\left\|(\widehat{P}_{h}-P_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})\right\|$	(21)
$\displaystyle\leq$	$\displaystyle H\left\lVert\widetilde{P}_{h}(\cdot\|s_{h},a_{h})-\widehat{P}_{h}(\cdot\|s_{h},a_{h})\right\rVert_{1}+\left\|(\widehat{P}_{h}-P_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})\right\|$
$\displaystyle\leq$	$\displaystyle\frac{5SHE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}+\left\|(\widehat{P}_{h}-P_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})\right\|,$

where the third inequality is due to Lemma C.4.

Next, recall $\widehat{\pi}_{h+1}$ in Algorithm 1 is computed backwardly therefore only depends on sample tuple from time $h+1$ to $H$ . As a result, $\widetilde{V}_{h+1}=\langle\overline{Q}_{h+1},\widehat{\pi}_{h+1}\rangle$ also only depends on the sample tuple from time $h+1$ to $H$ and some Gaussian noise that is independent to the offline dataset. On the other side, by the definition, $\widehat{P}_{h}$ only depends on the sample tuples from time $h$ to $h+1$ . Therefore $\widetilde{V}_{h+1}$ and $\widehat{P}_{h}$ are Conditionally independent (This trick is also used in [Yin et al., 2021] and [Yin and Wang, 2021b]), by Empirical Bernstein’s inequality (Lemma E.4) and a union bound, with probability $1-\delta$ , for all $s_{h},a_{h}$ such that $\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}$ ,

\left|(\widehat{P}_{h}-P_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})\right|\leq\sqrt{\frac{2\mathrm{Var}_{\widehat{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{n_{s_{h},a_{h}}}}+\frac{7H\cdot\iota}{3n_{s_{h},a_{h}}}.

(22)

Therefore, we have

	$\displaystyle\left\|(\widetilde{P}_{h}-P_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})\right\|\leq\sqrt{\frac{2\mathrm{Var}_{\widehat{P}_{h}(\cdot\|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{n_{s_{h},a_{h}}}}+\frac{7H\cdot\iota}{3n_{s_{h},a_{h}}}+\frac{5SHE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}$	(23)
$\displaystyle\leq$	$\displaystyle\sqrt{\frac{2\mathrm{Var}_{\widehat{P}_{h}(\cdot\|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{n_{s_{h},a_{h}}}}+\frac{9SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}$
$\displaystyle\leq$	$\displaystyle\frac{9SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}+\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot\|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{n_{s_{h},a_{h}}}}+4\sqrt{2}H\sqrt{\frac{SE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}\cdot n_{s_{h},a_{h}}}}$
$\displaystyle\leq$	$\displaystyle\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot\|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{n_{s_{h},a_{h}}}}+\frac{16SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}$
$\displaystyle\leq$	$\displaystyle\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot\|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{16SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}.$

The second and forth inequality is because when $\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}$ , $n_{s_{h},a_{h}}\geq\frac{2\widetilde{n}_{s_{h},a_{h}}}{3}$ . Specifically, these two inequalities are also because usually we only care about the case when $SE_{\rho}\geq 1$ , which is equivalent to $\rho$ being not very large. The third inequality is due to Lemma C.5. The last inequality is due to Lemma C.3. ∎

Note that the previous Lemmas rely on the condition that $\widetilde{n}$ is not very small ( $\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}$ ). Below we state the Multiplicative Chernoff bound (Lemma C.7 and Remark C.8) to show that under our condition in Theorem 3.4, for $(s_{h},a_{h})\in\mathcal{C}_{h}$ , $\widetilde{n}_{s_{h},a_{h}}$ will be larger than $3E_{\rho}$ with high probability.

Lemma C.7 (Lemma B.1 in [Yin and Wang, 2021b]).

For any $0<\delta<1$ , there exists an absolute constant $c_{1}$ such that when total episode $n>c_{1}\cdot 1/\bar{d}_{m}\cdot\log(HSA/\delta)$ , then with probability $1-\delta$ , $\forall h\in[H]$

n_{s_{h},a_{h}}\geq n\cdot d^{\mu}_{h}(s_{h},a_{h})/2,\quad\forall\;(s_{h},a_{h})\in\mathcal{C}_{h}.

Furthermore, we denote

\mathcal{E}:=\{n_{s_{h},a_{h}}\geq n\cdot d^{\mu}_{h}(s_{h},a_{h})/2,\;\forall\;(s_{h},a_{h})\in\mathcal{C}_{h},\;h\in[H].\}

(24)

then equivalently $P(\mathcal{E})>1-\delta$ .

In addition, we denote

\mathcal{E}^{\prime}:=\{n_{s_{h},a_{h}}\leq\frac{3}{2}n\cdot d^{\mu}_{h}(s_{h},a_{h}),\;\forall\;(s_{h},a_{h})\in\mathcal{C}_{h},\;h\in[H].\}

(25)

then similarly $P(\mathcal{E}^{\prime})>1-\delta$ .

Remark C.8.

According to Lemma C.7, for any failure probability $\delta$ , there exists some constant $c_{1}>0$ such that when $n\geq\frac{c_{1}E_{\rho}\cdot\iota}{\bar{d}_{m}}$ , with probability $1-\delta$ , for all $(s_{h},a_{h})\in\mathcal{C}_{h}$ , $n_{s_{h},a_{h}}\geq 4E_{\rho}$ . Therefore, under the condition of Theorem 3.4 and the high probability events in Lemma C.3 and Lemma C.7, it holds that for all $(s_{h},a_{h})\in\mathcal{C}_{h}$ , $\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}$ while for all $(s_{h},a_{h})\notin\mathcal{C}_{h}$ , $\widetilde{n}_{s_{h},a_{h}}\leq E_{\rho}$ .

Lemma C.9.

Define $(\mathcal{T}_{h}V)(\cdot,\cdot):=r_{h}(\cdot,\cdot)+(P_{h}V)(\cdot,\cdot)$ for any $V\in\mathbb{R}^{S}$ . Note $\widehat{\pi}$ , $\overline{Q}_{h}$ , $\widetilde{V}_{h}$ are defined in Algorithm 1 and denote $\xi_{h}(s,a)=(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-\overline{Q}_{h}(s,a)$ . Then it holds that

V_{1}^{\pi^{\star}}(s)-V_{1}^{\widehat{\pi}}(s)\leq\sum_{h=1}^{H}\mathbb{E}_{\pi^{\star}}\left[\xi_{h}(s_{h},a_{h})\mid s_{1}=s\right]-\sum_{h=1}^{H}\mathbb{E}_{\widehat{\pi}}\left[\xi_{h}(s_{h},a_{h})\mid s_{1}=s\right].

(26)

Furthermore, (26) holds for all $V_{h}^{\pi^{\star}}(s)-V_{h}^{\widehat{\pi}}(s)$ .

Proof of Lemma C.9.

Lemma C.9 is a direct corollary of Lemma E.8 with $\pi=\pi^{\star}$ , $\widehat{Q}_{h}=\overline{Q}_{h}$ , $\widehat{V}_{h}=\widetilde{V}_{h}$ and $\widehat{\pi}=\widehat{\pi}$ in Algorithm 1, we can obtain this result since by the definition of $\widehat{\pi}$ in Algorithm 1, $\langle\overline{Q}_{h}\left(s_{h},\cdot\right),\pi_{h}\left(\cdot|s_{h}\right)-\widehat{\pi}_{h}\left(\cdot|s_{h}\right)\rangle\leq 0$ . The proof for $V_{h}^{\pi^{\star}}(s)-V_{h}^{\widehat{\pi}}(s)$ is identical. ∎

Next we prove the asymmetric bound for $\xi_{h}$ , which is the key to the proof.

Lemma C.10 (Private version of Lemma D.6 in [Yin and Wang, 2021b]).

Denote $\xi_{h}(s,a)=(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-\overline{Q}_{h}(s,a)$ , where $\widetilde{V}_{h+1}$ and $\overline{Q}_{h}$ are the quantities in Algorithm 1 and $\mathcal{T}_{h}(V):=r_{h}+P_{h}\cdot V$ for any $V\in\mathbb{R}^{S}$ . Then under the high probability events in Lemma C.3 and Lemma C.6, for any $h,s_{h},a_{h}$ such that $\widetilde{n}_{s_{h},a_{h}}>3E_{\rho}$ , we have

	$\displaystyle 0\leq$	$\displaystyle\xi_{h}(s_{h},a_{h})=(\mathcal{T}_{h}\widetilde{V}_{h+1})(s_{h},a_{h})-\overline{Q}_{h}(s_{h},a_{h})$
	$\displaystyle\leq$	$\displaystyle 2\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot\|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{32SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}},$

where $\iota=\log(HSA/\delta)$ .

Proof of Lemma C.10.

The first inequality: We first prove $\xi_{h}(s_{h},a_{h})\geq 0$ for all $(s_{h},a_{h})$ , such that $\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}$ .

Indeed, if $\widehat{Q}^{p}_{h}(s_{h},a_{h})<0$ , then $\overline{Q}_{h}(s_{h},a_{h})=0$ . In this case, $\xi_{h}(s_{h},a_{h})=(\mathcal{T}_{h}\widetilde{V}_{h+1})(s_{h},a_{h})\geq 0$ (note $\widetilde{V}_{h}\geq 0$ by the definition). If $\widehat{Q}^{p}_{h}(s_{h},a_{h})\geq 0$ , then by definition $\overline{Q}_{h}(s_{h},a_{h})=\min\{\widehat{Q}^{p}_{h}(s_{h},a_{h}),H-h+1\}^{+}\leq\widehat{Q}^{p}_{h}(s_{h},a_{h})$ and this implies

		$\displaystyle\xi_{h}(s_{h},a_{h})\geq(\mathcal{T}_{h}\widetilde{V}_{h+1})(s_{h},a_{h})-\widehat{Q}^{p}_{h}(s_{h},a_{h})$
	$\displaystyle=$	$\displaystyle(P_{h}-\widetilde{P}_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})+\Gamma_{h}(s_{h},a_{h})$
	$\displaystyle\geq$	$\displaystyle-\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot\|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}-\frac{16SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}+\Gamma_{h}(s_{h},a_{h})=0,$

where the second inequality uses Lemma C.6, and the last equation uses Line 5 of Algorithm 1.

The second inequality: Then we prove $\xi_{h}(s_{h},a_{h})\leq 2\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{32SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}$ for all $(s_{h},a_{h})$ such that $\widetilde{n}_{s_{h},a_{h}}\geq 3E_{\rho}$ .

First, since by construction $\widetilde{V}_{h}\leq H-h+1$ for all $h\in[H]$ , this implies

\widehat{Q}^{p}_{h}=\widetilde{Q}_{h}-\Gamma_{h}\leq\widetilde{Q}_{h}=r_{h}+(\widetilde{P}_{h}\cdot\widetilde{V}_{h+1})\leq 1+(H-h)=H-h+1

which is because $r_{h}\leq 1$ and $\widetilde{P}_{h}$ is a probability distribution. Therefore, we have the equivalent definition

\overline{Q}_{h}:=\min\{\widehat{Q}^{p}_{h},H-h+1\}^{+}=\max\{\widehat{Q}^{p}_{h},0\}\geq\widehat{Q}^{p}_{h}.

Then it holds that

		$\displaystyle\xi_{h}(s_{h},a_{h})=(\mathcal{T}_{h}\widetilde{V}_{h+1})(s_{h},a_{h})-\overline{Q}_{h}(s_{h},a_{h})\leq(\mathcal{T}_{h}\widetilde{V}_{h+1})(s_{h},a_{h})-\widehat{Q}^{p}_{h}(s_{h},a_{h})$
	$\displaystyle=$	$\displaystyle(\mathcal{T}_{h}\widetilde{V}_{h+1})(s_{h},a_{h})-\widetilde{Q}_{h}(s_{h},a_{h})+\Gamma_{h}(s_{h},a_{h})$
	$\displaystyle=$	$\displaystyle(P_{h}-\widetilde{P}_{h})\cdot\widetilde{V}_{h+1}(s_{h},a_{h})+\Gamma_{h}(s_{h},a_{h})$
	$\displaystyle\leq$	$\displaystyle\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot\|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{16SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}+\Gamma_{h}(s_{h},a_{h})$
	$\displaystyle=$	$\displaystyle 2\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot\|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{32SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}.$

The proof is complete by combining the two parts. ∎

C.3.3 Reduction to augmented absorbing MDP

Before we prove the theorem, we need to construct an augmented absorbing MDP to bridge $\widetilde{V}$ and $V^{\star}$ . According to Line 5 in Algorithm 1, the locations with $\widetilde{n}_{s_{h},a_{h}}\leq E_{\rho}$ is heavily penalized with penalty of order $\widetilde{O}(H)$ . Therefore we can prove that under the high probability event in Remark C.8, $d_{h}^{\widehat{\pi}}(s_{h},a_{h})>0$ only if $d_{h}^{\mu}(s_{h},a_{h})>0$ by induction, where $\widehat{\pi}$ is the output of Algorithm 1. The conclusion holds for $h=1$ . Assume it holds for some $h>1$ that $d_{h}^{\widehat{\pi}}(s_{h},a_{h})>0$ only if $d_{h}^{\mu}(s_{h},a_{h})>0$ , then for any $s_{h+1}\in\mathcal{S}$ such that $d_{h+1}^{\widehat{\pi}}(s_{h+1})>0$ , it holds that $d_{h+1}^{\mu}(s_{h+1})>0$ , which leads to the conclusion that $d_{h+1}^{\widehat{\pi}}(s_{h+1},a_{h+1})>0$ only if $d_{h+1}^{\mu}(s_{h+1},a_{h+1})>0$ . To summarize, we have

d_{h}^{\pi_{0}}(s_{h},a_{h})>0\,\,\text{only if}\,\,d_{h}^{\mu}(s_{h},a_{h})>0,\,\,\pi_{0}\in\{\pi^{\star},\widehat{\pi}\}.

(27)

Let us define $M^{\dagger}$ by adding one absorbing state $s_{h}^{\dagger}$ for all $h\in\{2,\ldots,H\}$ , therefore the augmented state space $\mathcal{S}^{\dagger}=\mathcal{S}\cup\{s^{\dagger}_{h}\}$ and the transition and reward is defined as follows: (recall $\mathcal{C}_{h}:=\{(s_{h},a_{h}):d^{\mu}_{h}(s_{h},a_{h})>0\}$ )

P^{\dagger}_{h}(\cdot\mid s_{h},a_{h})=\left\{\begin{array}[]{ll}P_{h}(\cdot\mid s_{h},a_{h})&s_{h},a_{h}\in\mathcal{C}_{h},\\ \delta_{s^{\dagger}_{h+1}}&s_{h}=s_{h}^{\dagger}\text{ or }s_{h},a_{h}\notin\mathcal{C}_{h},\end{array}\;\;r^{\dagger}_{h}(s_{h},a_{h})=\left\{\begin{array}[]{ll}r_{h}(s_{h},a_{h})&s_{h},a_{h}\in\mathcal{C}_{h}\\ 0&s_{h}=s^{\dagger}_{h}\text{ or }s_{h},a_{h}\notin\mathcal{C}_{h}\end{array}\right.\right.

and we further define for any $\pi$ ,

V^{\dagger\pi}_{h}(s)=\mathbb{E}^{\dagger}_{\pi}\left[\sum_{t=h}^{H}r_{t}^{\dagger}\middle|s_{h}=s\right],v^{\dagger\pi}=\mathbb{E}^{\dagger}_{\pi}\left[\sum_{t=1}^{H}r_{t}^{\dagger}\right]\;\forall h\in[H],

(28)

where $\mathbb{E}^{\dagger}$ means taking expectation under the absorbing MDP $M^{\dagger}$ .

Note that because $\pi^{\star}$ and $\widehat{\pi}$ are fully covered by $\mu$ (27), it holds that

v^{\dagger\pi^{\star}}=v^{\pi^{\star}},\,\,v^{\dagger\widehat{\pi}}=v^{\widehat{\pi}}.

(29)

Define $(\mathcal{T}^{\dagger}_{h}V)(\cdot,\cdot):=r_{h}^{\dagger}(\cdot,\cdot)+(P_{h}^{\dagger}V)(\cdot,\cdot)$ for any $V\in\mathbb{R}^{S+1}$ . Note $\widehat{\pi}$ , $\overline{Q}_{h}$ , $\widetilde{V}_{h}$ are defined in Algorithm 1 (we extend the definition by letting $\widetilde{V}_{h}(s^{\dagger}_{h})=0$ and $\overline{Q}_{h}(s_{h}^{\dagger},\cdot)=0$ ) and denote $\xi^{\dagger}_{h}(s,a)=(\mathcal{T}^{\dagger}_{h}\widetilde{V}_{h+1})(s,a)-\overline{Q}_{h}(s,a)$ . Using identical proof to Lemma C.9, we have

V_{1}^{\dagger\pi^{\star}}(s)-V_{1}^{\dagger\widehat{\pi}}(s)\leq\sum_{h=1}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[\xi^{\dagger}_{h}(s_{h},a_{h})\mid s_{1}=s\right]-\sum_{h=1}^{H}\mathbb{E}^{\dagger}_{\widehat{\pi}}\left[\xi^{\dagger}_{h}(s_{h},a_{h})\mid s_{1}=s\right],

(30)

where $V_{1}^{\dagger\pi}$ is defined in (28). Furthermore, (30) holds for all $V_{h}^{\dagger\pi^{\star}}(s)-V_{h}^{\dagger\widehat{\pi}}(s)$ .

C.3.4 Finalize our result with non-private statistics

For those $(s_{h},a_{h})\in\mathcal{C}_{h}$ , $\xi^{\dagger}_{h}(s_{h},a_{h})=r_{h}(s_{h},a_{h})+P_{h}\widetilde{V}_{h+1}(s_{h},a_{h})-\overline{Q}_{h}(s_{h},a_{h})=\xi_{h}(s_{h},a_{h})$ . For those $(s_{h},a_{h})\notin\mathcal{C}_{h}$ or $s_{h}=s_{h}^{\dagger}$ , we have $\xi^{\dagger}_{h}(s_{h},a_{h})=0$ .

Therefore, by (30) and Lemma C.10, under the high probability events in Lemma C.3, Lemma C.6 and Lemma C.7, we have for all $t\in[H]$ , $s\in\mathcal{S}$ ( $\mathcal{S}$ does not include the absorbing state $s_{t}^{\dagger}$ ),

	$\displaystyle V_{t}^{\dagger\pi^{\star}}(s)-V_{t}^{\dagger\widehat{\pi}}(s)\leq\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[\xi^{\dagger}_{h}(s_{h},a_{h})\mid s_{t}=s\right]-\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\widehat{\pi}}\left[\xi^{\dagger}_{h}(s_{h},a_{h})\mid s_{t}=s\right]$	(31)
$\displaystyle\leq$	$\displaystyle\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[\xi^{\dagger}_{h}(s_{h},a_{h})\mid s_{t}=s\right]-0$
$\displaystyle\leq$	$\displaystyle\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[2\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot\|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}-E_{\rho}}}+\frac{32SHE_{\rho}\cdot\iota}{\widetilde{n}_{s_{h},a_{h}}}\mid s_{t}=s\right]\cdot\mathds{1}\left((s_{h},a_{h})\in\mathcal{C}_{h}\right)$
$\displaystyle\leq$	$\displaystyle\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[2\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot\|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{n_{s_{h},a_{h}}-2E_{\rho}}}+\frac{32SHE_{\rho}\cdot\iota}{n_{s_{h},a_{h}}-E_{\rho}}\mid s_{t}=s\right]\cdot\mathds{1}\left((s_{h},a_{h})\in\mathcal{C}_{h}\right)$
$\displaystyle\leq$	$\displaystyle\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[4\sqrt{\frac{\mathrm{Var}_{\widetilde{P}_{h}(\cdot\|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{n_{s_{h},a_{h}}}}+\frac{128SHE_{\rho}\cdot\iota}{3n_{s_{h},a_{h}}}\mid s_{t}=s\right]\cdot\mathds{1}\left((s_{h},a_{h})\in\mathcal{C}_{h}\right)$
$\displaystyle\leq$	$\displaystyle\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[4\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot\|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{nd^{\mu}_{h}(s_{h},a_{h})}}+\frac{256SHE_{\rho}\cdot\iota}{3nd^{\mu}_{h}(s_{h},a_{h})}\mid s_{t}=s\right]\cdot\mathds{1}\left((s_{h},a_{h})\in\mathcal{C}_{h}\right)$

The second and third inequality are because of Lemma C.10, Remark C.8 and the the fact that either $\xi^{\dagger}=0$ or $\xi^{\dagger}=\xi$ while $(s_{h},a_{h})\in\mathcal{C}_{h}$ . The forth inequality is due to Lemma C.3. The fifth inequality is because of Remark C.8. The last inequality is by Lemma C.7.

Below we present a crude bound of $\left|V_{t}^{\dagger\pi^{\star}}(s)-\widetilde{V}_{t}(s)\right|$ , which can be further used to bound the main term in the main result.

Lemma C.11 (Self-bounding, private version of Lemma D.7 in [Yin and Wang, 2021b]).

Under the high probability events in Lemma C.3, Lemma C.6 and Lemma C.7, it holds that for all $t\in[H]$ and $s\in\mathcal{S}$ ,

\left|V_{t}^{\dagger\pi^{\star}}(s)-\widetilde{V}_{t}(s)\right|\leq\frac{4\sqrt{2\iota}H^{2}}{\sqrt{n\cdot\bar{d}_{m}}}+\frac{256SH^{2}E_{\rho}\cdot\iota}{3n\cdot\bar{d}_{m}}.

where $\bar{d}_{m}$ is defined in Theorem 3.4.

Proof of Lemma C.11.

According to (31), since $\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\leq H^{2}$ , we have for all $t\in[H]$ ,

\left|V_{t}^{\dagger\pi^{\star}}(s)-V_{t}^{\dagger\widehat{\pi}}(s)\right|\leq\frac{4\sqrt{2\iota}H^{2}}{\sqrt{n\cdot\bar{d}_{m}}}+\frac{256SH^{2}E_{\rho}\cdot\iota}{3n\cdot\bar{d}_{m}}

(32)

Next, apply Lemma E.7 by setting $\pi=\widehat{\pi}$ , $\pi^{\prime}=\pi^{\star}$ , $\widehat{Q}=\overline{Q}$ , $\widehat{V}=\widetilde{V}$ under $M^{\dagger}$ , then we have

$\displaystyle V_{t}^{\dagger\pi^{\star}}(s)-\widetilde{V}_{t}(s)=$	$\displaystyle\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[\xi^{\dagger}_{h}(s_{h},a_{h})\mid s_{t}=s\right]+\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[\langle\overline{Q}_{h}\left(s_{h},\cdot\right),\pi^{\star}_{h}\left(\cdot\|s_{h}\right)-\widehat{\pi}_{h}\left(\cdot\|s_{h}\right)\rangle\mid s_{t}=s\right]$	(33)
$\displaystyle\leq$	$\displaystyle\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[\xi^{\dagger}_{h}(s_{h},a_{h})\mid s_{t}=s\right]$
$\displaystyle\leq$	$\displaystyle\frac{4\sqrt{2\iota}H^{2}}{\sqrt{n\cdot\bar{d}_{m}}}+\frac{256SH^{2}E_{\rho}\cdot\iota}{3n\cdot\bar{d}_{m}}.$

Also, apply Lemma E.7 by setting $\pi=\pi^{\prime}=\widehat{\pi}$ , $\widehat{Q}=\overline{Q}$ , $\widehat{V}=\widetilde{V}$ under $M^{\dagger}$ , then we have

\displaystyle\widetilde{V}_{t}(s)-V_{t}^{\dagger\widehat{\pi}}(s)=-\sum_{h=t}^{H}\mathbb{E}^{\dagger}_{\widehat{\pi}}\left[\xi^{\dagger}_{h}(s_{h},a_{h})\mid s_{t}=s\right]\leq 0.

(34)

The proof is complete by combing (32), (33) and (34). ∎

Now we are ready to bound $\sqrt{\mathrm{Var}_{\widetilde{P}_{h}(\cdot|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))}$ by $\sqrt{\mathrm{Var}_{P_{h}(\cdot|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))}$ . Under the high probability events in Lemma C.3, Lemma C.6 and Lemma C.7, with probability $1-\delta$ , it holds that for all $(s_{h},a_{h})\in\mathcal{C}_{h}$ ,

	$\displaystyle\sqrt{\mathrm{Var}_{\widetilde{P}_{h}(\cdot\|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))}\leq\sqrt{\mathrm{Var}_{\widetilde{P}_{h}(\cdot\|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))}+\left\lVert\widetilde{V}_{h+1}-{V}^{\dagger\pi^{\star}}_{h+1}\right\rVert_{\infty,s\in\mathcal{S}}$	(35)
$\displaystyle\leq$	$\displaystyle\sqrt{\mathrm{Var}_{\widetilde{P}_{h}(\cdot\|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))}+\frac{4\sqrt{2\iota}H^{2}}{\sqrt{n\cdot\bar{d}_{m}}}+\frac{256SH^{2}E_{\rho}\cdot\iota}{3n\cdot\bar{d}_{m}}$
$\displaystyle\leq$	$\displaystyle\sqrt{\mathrm{Var}_{\widehat{P}_{h}(\cdot\|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))}+\frac{4\sqrt{2\iota}H^{2}}{\sqrt{n\cdot\bar{d}_{m}}}+\frac{256SH^{2}E_{\rho}\cdot\iota}{3n\cdot\bar{d}_{m}}+4H\sqrt{\frac{SE_{\rho}}{\widetilde{n}_{s_{h},a_{h}}}}$
$\displaystyle\leq$	$\displaystyle\sqrt{\mathrm{Var}_{\widehat{P}_{h}(\cdot\|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))}+\frac{4\sqrt{2\iota}H^{2}}{\sqrt{n\cdot\bar{d}_{m}}}+\frac{256SH^{2}E_{\rho}\cdot\iota}{3n\cdot\bar{d}_{m}}+8H\sqrt{\frac{SE_{\rho}}{n\cdot\bar{d}_{m}}}$
$\displaystyle\leq$	$\displaystyle\sqrt{\mathrm{Var}_{P_{h}(\cdot\|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))}+\frac{4\sqrt{2\iota}H^{2}}{\sqrt{n\cdot\bar{d}_{m}}}+\frac{256SH^{2}E_{\rho}\cdot\iota}{3n\cdot\bar{d}_{m}}+8H\sqrt{\frac{SE_{\rho}}{n\cdot\bar{d}_{m}}}+3H\sqrt{\frac{\iota}{n\cdot\bar{d}_{m}}}$
$\displaystyle\leq$	$\displaystyle\sqrt{\mathrm{Var}_{P_{h}(\cdot\|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))}+\frac{9\sqrt{\iota}H^{2}}{\sqrt{n\cdot\bar{d}_{m}}}+\frac{256SH^{2}E_{\rho}\cdot\iota}{3n\cdot\bar{d}_{m}}+8H\sqrt{\frac{SE_{\rho}}{n\cdot\bar{d}_{m}}}.$

The second inequality is because of Lemma C.11. The third inequality is due to Lemma C.5. The forth inequality comes from Lemma C.3 and Remark C.8. The fifth inequality holds with probability $1-\delta$ because of Lemma E.5 and a union bound.

Finally, by plugging (35) into (31) and averaging over $s_{1}$ , we finally have with probability $1-4\delta$ ,

	$\displaystyle v^{\pi^{\star}}-v^{\widehat{\pi}}=v^{\dagger\pi^{\star}}-v^{\dagger\widehat{\pi}}\leq\sum_{h=1}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[4\sqrt{\frac{2\mathrm{Var}_{\widetilde{P}_{h}(\cdot\|s_{h},a_{h})}(\widetilde{V}_{h+1}(\cdot))\cdot\iota}{nd^{\mu}_{h}(s_{h},a_{h})}}+\frac{256SHE_{\rho}\cdot\iota}{3nd^{\mu}_{h}(s_{h},a_{h})}\right]$	(36)
$\displaystyle\leq$	$\displaystyle 4\sqrt{2}\sum_{h=1}^{H}\mathbb{E}^{\dagger}_{\pi^{\star}}\left[\sqrt{\frac{\mathrm{Var}_{P_{h}(\cdot\|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))\cdot\iota}{nd^{\mu}_{h}(s_{h},a_{h})}}\right]+\widetilde{O}\left(\frac{H^{3}+SH^{2}E_{\rho}}{n\cdot\bar{d}_{m}}\right)$
$\displaystyle=$	$\displaystyle 4\sqrt{2}\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{C}_{h}}d^{\pi^{\star}}_{h}(s_{h},a_{h})\sqrt{\frac{\mathrm{Var}_{P_{h}(\cdot\|s_{h},a_{h})}(V^{\dagger\star}_{h+1}(\cdot))\cdot\iota}{nd^{\mu}_{h}(s_{h},a_{h})}}+\widetilde{O}\left(\frac{H^{3}+SH^{2}E_{\rho}}{n\cdot\bar{d}_{m}}\right)$
$\displaystyle=$	$\displaystyle 4\sqrt{2}\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{C}_{h}}d^{\pi^{\star}}_{h}(s_{h},a_{h})\sqrt{\frac{\mathrm{Var}_{P_{h}(\cdot\|s_{h},a_{h})}(V^{\star}_{h+1}(\cdot))\cdot\iota}{nd^{\mu}_{h}(s_{h},a_{h})}}+\widetilde{O}\left(\frac{H^{3}+SH^{2}E_{\rho}}{n\cdot\bar{d}_{m}}\right),$

where $\widetilde{O}$ absorbs constants and Polylog terms. The first equation is due to (29). The first inequality is because of (31). The second inequality comes from (35) and our assumption that $n\cdot\bar{d}_{m}\geq c_{1}H^{2}$ . The second equation uses the fact that $d^{\pi^{\star}}_{h}(s_{h},a_{h})=d^{\dagger\pi^{\star}}_{h}(s_{h},a_{h})$ , for all $(s_{h},a_{h})$ . The last equation is because for any $(s_{h},a_{h},s_{h+1})$ such that $d^{\pi^{\star}}_{h}(s_{h},a_{h})>0$ and $P_{h}(s_{h+1}|s_{h},a_{h})>0$ , $V^{\dagger\star}_{h+1}(s_{h+1})=V^{\star}_{h+1}(s_{h+1})$ .

C.4 Put everything together

Combining Lemma C.1 and (36), the proof of Theorem 3.4 is complete.

Appendix D Proof of Theorem 4.1

D.1 Proof sketch

Since the whole proof for privacy guarantee is not very complex, we present it in Section D.2 below and only sketch the proof for suboptimality bound.

First of all, by extended value difference (Lemma E.7 and E.8), we can convert bounding the suboptimality gap of $v^{\star}-v^{\widehat{\pi}}$ to bounding $\sum_{h=1}^{H}\mathbb{E}_{\pi}\left[\Gamma_{h}(s_{h},a_{h})\right]$ , given that $|(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)|\leq\Gamma_{h}(s,a)$ for all $s,a,h$ . To bound $(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)$ , according to our analysis about the upper bound of the noises we add, we can decompose $(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)$ to lower order terms ( $\widetilde{O}(\frac{1}{K})$ ) and the following key quantity:

\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left[\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)-\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right].

(37)

For the term above, we prove an upper bound of $\left\lVert\sigma^{2}_{\widetilde{V}_{h+1}}-\widetilde{\sigma}_{h}^{2}\right\rVert_{\infty}$ , so we can convert $\widetilde{\sigma}_{h}^{2}$ to $\sigma^{2}_{\widetilde{V}_{h+1}}$ . Next, since $\mathrm{Var}\left[r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)-\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)\mid s^{\tau}_{h},a^{\tau}_{h}\right]\approx\sigma^{2}_{\widetilde{V}_{h+1}}$ , we can apply Bernstein’s inequality for self-normalized martingale (Lemma E.10) as in Yin et al. [2022] for deriving tighter bound.

Finally, we replace the private statistics by non-private ones. More specifically, we convert $\sigma^{2}_{\widetilde{V}_{h+1}}$ to $\sigma_{h}^{\star 2}$ ( $\Lambda_{h}^{-1}$ to $\Lambda_{h}^{\star-1}$ ) by combining the crude upper bound of $\left\lVert\widetilde{V}-V^{\star}\right\rVert_{\infty}$ and matrix concentrations.

D.2 Proof of the privacy guarantee

The privacy guarantee of DP-VAPVI (Algorithm 2) is summarized by Lemma D.1 below.

Lemma D.1 (Privacy analysis of DP-VAPVI (Algorithm 2)).

DP-VAPVI (Algorithm 2) satisfies $\rho$ -zCDP.

Proof of Lemma D.1.

For $\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}$ , the $\ell_{2}$ sensitivity is $2H^{2}$ . For $\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})$ and $\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)\right)/\widetilde{\sigma}_{h}^{2}(s^{\tau}_{h},a^{\tau}_{h})$ , the $\ell_{2}$ sensitivity is $2H$ . Therefore according to Lemma 2.7, the use of Gaussian Mechanism (the additional noises $\phi_{1},\phi_{2},\phi_{3}$ ) ensures $\rho_{0}$ -zCDP for each counter. For $\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})^{\top}+\lambda I$ and $\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)^{\top}/\widetilde{\sigma}_{h}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I$ , according to Appendix D in [Redberg and Wang, 2021], the per-instance $\ell_{2}$ sensitivity is

\left\lVert\Delta_{x}\right\rVert_{2}=\frac{1}{\sqrt{2}}\sup_{\phi:\|\phi\|_{2}\leq 1}\left\lVert\phi\phi^{\top}\right\rVert_{F}=\frac{1}{\sqrt{2}}\sup_{\phi:\|\phi\|_{2}\leq 1}\sqrt{\sum_{i,j}\phi_{i}^{2}\phi_{j}^{2}}=\frac{1}{\sqrt{2}}.

Therefore the use of Gaussian Mechanism (the additional noises $K_{1},K_{2}$ ) also ensures $\rho_{0}$ -zCDP for each counter.¹²¹²12For more detailed explanation, we refer the readers to Appendix D of [Redberg and Wang, 2021]. Combining these results, according to Lemma E.17, the whole algorithm satisfies $5H\rho_{0}=\rho$ -zCDP. ∎

D.3 Proof of the sub-optimality bound

D.3.1 Utility analysis and some preparation

We begin with the following high probability bound of the noises we add.

Lemma D.2 (Utility analysis).

Let $L=2H\sqrt{\frac{d}{\rho_{0}}\log(\frac{10Hd}{\delta})}=2H\sqrt{\frac{5Hd\log(\frac{10Hd}{\delta})}{\rho}}$ and
$E=\sqrt{\frac{2d}{\rho_{0}}}\left(2+\left(\frac{\log(5c_{1}H/\delta)}{c_{2}d}\right)^{\frac{2}{3}}\right)=\sqrt{\frac{10Hd}{\rho}}\left(2+\left(\frac{\log(5c_{1}H/\delta)}{c_{2}d}\right)^{\frac{2}{3}}\right)$ for some universal constants $c_{1},c_{2}$ . Then with probability $1-\delta$ , the following inequalities hold simultaneously:

	$\displaystyle\text{For all}\,h\in[H],\,\,\\|\phi_{1}\\|_{2}\leq HL,\\|\phi_{2}\\|_{2}\leq L,\,\,\\|\phi_{3}\\|_{2}\leq L.$		(38)
	$\displaystyle\text{For all}\,h\in[H],\,\,K_{1},K_{2}\,\text{are symmetric and positive definite and}\,\\|K_{i}\\|_{2}\leq E,\,\,i\in\{1,2\}.$		(38)

Proof of Lemma D.2.

The second line of (38) results from Lemma 19 in [Redberg and Wang, 2021] and Weyl’s Inequality. The first line of (38) directly results from the concentration inequality for Guassian distribution and a union bound. ∎

Define the Bellman update error $\zeta_{h}(s,a):=(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-\widehat{Q}_{h}(s,a)$ and recall
$\widehat{\pi}_{h}(s)=\mathrm{argmax}_{\pi_{h}}\langle\widehat{Q}_{h}(s,\cdot),\pi_{h}(\cdot\mid s)\rangle_{\mathcal{A}}$ , then because of Lemma E.8,

V_{1}^{\pi}(s)-V_{1}^{\widehat{\pi}}(s)\leq\sum_{h=1}^{H}\mathbb{E}_{\pi}\left[\zeta_{h}(s_{h},a_{h})\mid s_{1}=s\right]-\sum_{h=1}^{H}\mathbb{E}_{\widehat{\pi}}\left[\zeta_{h}(s_{h},a_{h})\mid s_{1}=s\right].

(39)

Define $\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1}(\cdot,\cdot)=\phi(\cdot,\cdot)^{\top}\widetilde{w}_{h}$ . Then similar to Lemma C.10, we have the following lemma showing that in order to bound the sub-optimality, it is sufficient to bound the pessimistic penalty.

Lemma D.3 (Lemma C.1 in [Yin et al., 2022]).

Suppose with probability $1-\delta$ , it holds for all $s,a,h\in\mathcal{S}\times\mathcal{A}\times[H]$ that $|(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)|\leq\Gamma_{h}(s,a)$ , then it implies $\forall s,a,h\in\mathcal{S}\times\mathcal{A}\times[H]$ , $0\leq\zeta_{h}(s,a)\leq 2\Gamma_{h}(s,a)$ . Furthermore, with probability $1-\delta$ , it holds for any policy $\pi$ simultaneously,

V_{1}^{\pi}(s)-V_{1}^{\widehat{\pi}}(s)\leq\sum_{h=1}^{H}2\cdot\mathbb{E}_{\pi}\left[\Gamma_{h}(s_{h},a_{h})\mid s_{1}=s\right].

Proof of Lemma D.3.

We first show given $|(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)|\leq\Gamma_{h}(s,a)$ , then $0\leq\zeta_{h}(s,a)\leq 2\Gamma_{h}(s,a)$ , $\forall s,a,h\in\mathcal{S}\times\mathcal{A}\times[H]$ .

Step 1: The first step is to show $0\leq\zeta_{h}(s,a)$ , $\forall s,a,h\in\mathcal{S}\times\mathcal{A}\times[H]$ .

Indeed, if $\bar{Q}_{h}(s,a)\leq 0$ , then by definition $\widehat{Q}_{h}(s,a)=0$ and therefore $\zeta_{h}(s,a):=(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-\widehat{Q}_{h}(s,a)=(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)\geq 0$ . If $\bar{Q}_{h}(s,a)>0$ , then $\widehat{Q}_{h}(s,a)\leq\bar{Q}_{h}(s,a)$ and

	$\displaystyle\zeta_{h}(s,a):=$	$\displaystyle(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-\widehat{Q}_{h}(s,a)\geq(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-\bar{Q}_{h}(s,a)$
	$\displaystyle=$	$\displaystyle(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-(\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)+\Gamma_{h}(s,a)\geq 0.$

Step 2: The second step is to show $\zeta_{h}(s,a)\leq 2\Gamma_{h}(s,a)$ , $\forall s,a,h\in\mathcal{S}\times\mathcal{A}\times[H]$ .

Under the assumption that $|(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)|\leq\Gamma_{h}(s,a)$ , we have

\bar{Q}_{h}(s,a)=(\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)-\Gamma_{h}(s,a)\leq(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)\leq H-h+1,

which implies that $\widehat{Q}_{h}(s,a)=\max(\bar{Q}_{h}(s,a),0)$ . Therefore, it holds that

	$\displaystyle\zeta_{h}(s,a):=$	$\displaystyle(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-\widehat{Q}_{h}(s,a)\leq(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-\bar{Q}_{h}(s,a)$
	$\displaystyle=$	$\displaystyle(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-(\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)+\Gamma_{h}(s,a)\leq 2\cdot\Gamma_{h}(s,a).$

For the last statement, denote $\mathfrak{F}:=\{0\leq\zeta_{h}(s,a)\leq 2\Gamma_{h}(s,a),\;\forall s,a,h\in\mathcal{S}\times\mathcal{A}\times[H]\}$ . Note conditional on $\mathfrak{F}$ , then by (39), $V_{1}^{\pi}(s)-V_{1}^{\widehat{\pi}}(s)\leq\sum_{h=1}^{H}2\cdot\mathbb{E}_{\pi}[\Gamma_{h}(s_{h},a_{h})\mid s_{1}=s]$ holds for any policy $\pi$ almost surely. Therefore,

		$\displaystyle\mathbb{P}\left[\forall\pi,\;\;V_{1}^{\pi}(s)-V_{1}^{\widehat{\pi}}(s)\leq\sum_{h=1}^{H}2\cdot\mathbb{E}_{\pi}[\Gamma_{h}(s_{h},a_{h})\mid s_{1}=s].\right]$
	$\displaystyle=$	$\displaystyle\mathbb{P}\left[\forall\pi,\;\;V_{1}^{\pi}(s)-V_{1}^{\widehat{\pi}}(s)\leq\sum_{h=1}^{H}2\cdot\mathbb{E}_{\pi}[\Gamma_{h}(s_{h},a_{h})\mid s_{1}=s]\middle\|\mathfrak{F}\right]\cdot\mathbb{P}[\mathfrak{F}]$
	$\displaystyle+$	$\displaystyle\mathbb{P}\left[\forall\pi,\;\;V_{1}^{\pi}(s)-V_{1}^{\widehat{\pi}}(s)\leq\sum_{h=1}^{H}2\cdot\mathbb{E}_{\pi}[\Gamma_{h}(s_{h},a_{h})\mid s_{1}=s]\middle\|\mathfrak{F}^{c}\right]\cdot\mathbb{P}[\mathfrak{F}^{c}]$
	$\displaystyle\geq$	$\displaystyle\mathbb{P}\left[\forall\pi,\;\;V_{1}^{\pi}(s)-V_{1}^{\widehat{\pi}}(s)\leq\sum_{h=1}^{H}2\cdot\mathbb{E}_{\pi}[\Gamma_{h}(s_{h},a_{h})\mid s_{1}=s]\middle\|\mathfrak{F}\right]\cdot\mathbb{P}[\mathfrak{F}]=1\cdot\mathbb{P}[\mathfrak{F}]\geq 1-\delta,$

which finishes the proof. ∎

D.3.2 Bound the pessimistic penalty

By Lemma D.3, it remains to bound $|(\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)-(\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)|$ . Suppose $w_{h}$ is the coefficient corresponding to the $\mathcal{T}_{h}\widetilde{V}_{h+1}$ (such $w_{h}$ exists by Lemma E.14), i.e. $\mathcal{T}_{h}\widetilde{V}_{h+1}=\phi^{\top}w_{h}$ , and recall $(\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)=\phi(s,a)^{\top}\widetilde{w}_{h}$ , then:

	$\displaystyle\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)(s,a)-\left(\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1}\right)(s,a)=\phi(s,a)^{\top}\left(w_{h}-\widetilde{w}_{h}\right)$	(40)
$\displaystyle=$	$\displaystyle\phi(s,a)^{\top}w_{h}-\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})+\phi_{3}\right)$
$\displaystyle=$	$\displaystyle\underbrace{\phi(s,a)^{\top}w_{h}-\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)}_{(\mathrm{i})}$
	$\displaystyle-\underbrace{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi_{3}}_{(\mathrm{ii})}+\underbrace{\phi(s,a)^{\top}(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})+\phi_{3}\right)}_{(\mathrm{iii})},$

where $\widehat{\Lambda}_{h}=\widetilde{\Lambda}_{h}-K_{2}=\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)^{\top}/\widetilde{\sigma}_{h}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I$ .

Term (ii) can be handled by the following Lemma D.4

Lemma D.4.

Recall $\kappa$ in Assumption 2.2. Under the high probability event in Lemma D.2, suppose $K\geq\max\left\{\frac{512H^{4}\cdot\log\left(\frac{2Hd}{\delta}\right)}{\kappa^{2}},\frac{4\lambda H^{2}}{\kappa}\right\}$ , then with probability $1-\delta$ , for all $s,a,h\in\mathcal{S}\times\mathcal{A}\times[H]$ , it holds that

\left|\phi(s,a)^{\top}\widehat{\Lambda}^{-1}_{h}\phi_{3}\right|\leq\frac{4H^{2}L/\kappa}{K}.

Proof of Lemma D.4.

Define $\widetilde{\Lambda}^{p}_{h}=\mathbb{E}_{\mu,h}[\widetilde{\sigma}_{h}^{-2}(s,a)\phi(s,a)\phi(s,a)^{\top}]$ . Then because of Assumption 2.2 and $\widetilde{\sigma}_{h}\leq H$ , it holds that $\lambda_{\min}(\widetilde{\Lambda}^{p}_{h})\geq\frac{\kappa}{H^{2}}$ . Therefore, due to Lemma E.13, we have with probability $1-\delta$ ,

		$\displaystyle\left\|\phi(s,a)^{\top}\widehat{\Lambda}^{-1}_{h}\phi_{3}\right\|\leq\\|\phi(s,a)\\|_{\widehat{\Lambda}_{h}^{-1}}\cdot\\|\phi_{3}\\|_{\widehat{\Lambda}_{h}^{-1}}$
	$\displaystyle\leq$	$\displaystyle\frac{4}{K}\\|\phi(s,a)\\|_{(\widetilde{\Lambda}_{h}^{p})^{-1}}\cdot\\|\phi_{3}\\|_{(\widetilde{\Lambda}_{h}^{p})^{-1}}$
	$\displaystyle\leq$	$\displaystyle\frac{4L}{K}\\|(\widetilde{\Lambda}_{h}^{p})^{-1}\\|$
	$\displaystyle\leq$	$\displaystyle\frac{4H^{2}L/\kappa}{K}.$

The first inequality is because of Cauchy-Schwarz inequality. The second inequality holds with probability $1-\delta$ due to Lemma E.13 and a union bound. The third inequality holds because $\sqrt{a^{\top}\cdot A\cdot a}\leq\sqrt{\|a\|_{2}\|A\|_{2}\|a\|_{2}}=\|a\|_{2}\sqrt{\|A\|_{2}}$ . The last inequality arises from $\|(\widetilde{\Lambda}_{h}^{p})^{-1}\|=\lambda_{\max}((\widetilde{\Lambda}^{p}_{h})^{-1})=\lambda_{\min}^{-1}(\widetilde{\Lambda}^{p}_{h})\leq\frac{H^{2}}{\kappa}$ . ∎

The difference between $\widetilde{\Lambda}_{h}^{-1}$ and $\widehat{\Lambda}_{h}^{-1}$ can be bounded by the following Lemma D.5

Lemma D.5.

Under the high probability event in Lemma D.2, suppose $K\geq\frac{128H^{4}\log\frac{2dH}{\delta}}{\kappa^{2}}$ , then with probability $1-\delta$ , for all $h\in[H]$ , it holds that $\|\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1}\|\leq\frac{4H^{4}E/\kappa^{2}}{K^{2}}$ .

Proof of Lemma D.5.

First of all, we have

	$\displaystyle\\|\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1}\\|=\\|\widehat{\Lambda}_{h}^{-1}\cdot(\widehat{\Lambda}_{h}-\widetilde{\Lambda}_{h})\cdot\widetilde{\Lambda}_{h}^{-1}\\|$	(41)
$\displaystyle\leq$	$\displaystyle\\|\widehat{\Lambda}_{h}^{-1}\\|\cdot\\|\widehat{\Lambda}_{h}-\widetilde{\Lambda}_{h}\\|\cdot\\|\widetilde{\Lambda}_{h}^{-1}\\|$
$\displaystyle\leq$	$\displaystyle\lambda_{\min}^{-1}(\widehat{\Lambda}_{h})\cdot\lambda_{\min}^{-1}(\widetilde{\Lambda}_{h})\cdot E.$

The first inequality is because $\|A\cdot B\|\leq\|A\|\cdot\|B\|$ . The second inequality is due to Lemma D.2.

Let $\widehat{\Lambda}_{h}^{\prime}=\frac{1}{K}\widehat{\Lambda}_{h}$ , then because of Lemma E.12, with probability $1-\delta$ , it holds that for all $h\in[H]$ ,

\left\lVert\widehat{\Lambda}_{h}^{\prime}-\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\widetilde{\sigma}_{h}^{2}(s,a)]-\frac{\lambda}{K}I_{d}\right\rVert\leq\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2},

which implies that when $K\geq\frac{128H^{4}\log\frac{2dH}{\delta}}{\kappa^{2}}$ , it holds that (according to Weyl’s Inequality)

\lambda_{\min}(\widehat{\Lambda}_{h}^{\prime})\geq\lambda_{\min}(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\widetilde{\sigma}_{h}^{2}(s,a)])+\frac{\lambda}{K}-\frac{\kappa}{2H^{2}}\geq\frac{\kappa}{2H^{2}}.

Under this high probability event, we have $\lambda_{\min}(\widehat{\Lambda}_{h})\geq\frac{K\kappa}{2H^{2}}$ and therefore $\lambda_{\min}(\widetilde{\Lambda}_{h})\geq\lambda_{\min}(\widehat{\Lambda}_{h})\geq\frac{K\kappa}{2H^{2}}.$ Plugging these two results into (41), we have

\|\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1}\|\leq\frac{4H^{4}E/\kappa^{2}}{K^{2}}.

∎

Then we can bound term (iii) by the following Lemma D.6

Lemma D.6.

Suppose $K\geq\max\{\frac{128H^{4}\log\frac{2dH}{\delta}}{\kappa^{2}},\frac{\sqrt{2}L}{\sqrt{d\kappa}}\}$ , under the high probability events in Lemma D.2 and Lemma D.5, it holds that for all $s,a,h\in\mathcal{S}\times\mathcal{A}\times[H]$ ,

\left|\phi(s,a)^{\top}(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})+\phi_{3}\right)\right|\leq\frac{4\sqrt{2}H^{4}E\sqrt{d}/\kappa^{3/2}}{K}.

Proof of Lemma D.6.

First of all, the left hand side is bounded by

\left\lVert(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)\right\rVert_{2}+\frac{4H^{4}EL/\kappa^{2}}{K^{2}}

due to Lemma D.5. Then the left hand side can be further bounded by

		$\displaystyle H\sum_{\tau=1}^{K}\left\lVert(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)/\widetilde{\sigma}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right\rVert_{2}+\frac{4H^{4}EL/\kappa^{2}}{K^{2}}$
	$\displaystyle\leq$	$\displaystyle H\sum_{\tau=1}^{K}\sqrt{Tr\left((\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\cdot\frac{\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)^{\top}}{\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})}\cdot(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\right)}+\frac{4H^{4}EL/\kappa^{2}}{K^{2}}$
	$\displaystyle\leq$	$\displaystyle H\sqrt{K\cdot Tr\left((\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\cdot\widehat{\Lambda}_{h}\cdot(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\right)}+\frac{4H^{4}EL/\kappa^{2}}{K^{2}}$
	$\displaystyle\leq$	$\displaystyle H\sqrt{Kd\cdot\lambda_{\max}\left((\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\cdot\widehat{\Lambda}_{h}\cdot(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\right)}+\frac{4H^{4}EL/\kappa^{2}}{K^{2}}$
	$\displaystyle=$	$\displaystyle H\sqrt{Kd\cdot\left\lVert(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\cdot\widehat{\Lambda}_{h}\cdot(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\right\rVert_{2}}+\frac{4H^{4}EL/\kappa^{2}}{K^{2}}$
	$\displaystyle\leq$	$\displaystyle H\sqrt{Kd\cdot\left\lVert\widetilde{\Lambda}_{h}^{-1}\right\rVert_{2}\cdot\left\lVert\widetilde{\Lambda}_{h}-\widehat{\Lambda}_{h}\right\rVert_{2}\cdot\left\lVert\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1}\right\rVert_{2}}+\frac{4H^{4}EL/\kappa^{2}}{K^{2}}$
	$\displaystyle\leq$	$\displaystyle\frac{2\sqrt{2}H^{4}E\sqrt{d}/\kappa^{3/2}}{K}+\frac{4H^{4}EL/\kappa^{2}}{K^{2}}$
	$\displaystyle\leq$	$\displaystyle\frac{4\sqrt{2}H^{4}E\sqrt{d}/\kappa^{3/2}}{K}.$

The first inequality is because $\left\lVert a\right\rVert_{2}=\sqrt{a^{\top}a}=\sqrt{Tr(aa^{\top})}$ . The second inequality is due to Cauchy-Schwarz inequality. The third inequality is because for positive definite matrix $A$ , it holds that $Tr(A)=\sum_{i=1}^{d}\lambda_{i}(A)\leq d\lambda_{\max}(A)$ . The equation is because for symmetric, positive definite matrix $A$ , $\left\lVert A\right\rVert_{2}=\lambda_{\max}(A)$ . The forth inequality is due to $\left\lVert A\cdot B\right\rVert\leq\left\lVert A\right\rVert\cdot\left\lVert B\right\rVert$ . The fifth inequality is because of Lemma D.2, Lemma D.5 and the statement in the proof of Lemma D.5 that $\lambda_{\min}(\widetilde{\Lambda}_{h})\geq\frac{K\kappa}{2H^{2}}$ . The last inequality uses the assumption that $K\geq\frac{\sqrt{2}L}{\sqrt{d\kappa}}$ . ∎

Now the remaining part is term (i), we have

	$\displaystyle\underbrace{\phi(s,a)^{\top}w_{h}-\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)}_{(\mathrm{i})}$	(42)
$\displaystyle=$	$\displaystyle\underbrace{\phi(s,a)^{\top}w_{h}-\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)}_{(\mathrm{iv})}$
	$\displaystyle-\underbrace{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)-\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)}_{(\mathrm{v})}.$

We are able to bound term (iv) by the following Lemma D.7.

Lemma D.7.

\left|\phi(s,a)^{\top}w_{h}-\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)\right|\leq\frac{8\lambda H^{3}\sqrt{d}/\kappa}{K}.

Proof of Lemma D.7.

Recall $\mathcal{T}_{h}\widetilde{V}_{h+1}=\phi^{\top}w_{h}$ and apply Lemma E.13, we obtain with probability $1-\delta$ , for all $s,a,h\in\mathcal{S}\times\mathcal{A}\times[H]$ ,

		$\displaystyle\left\|\phi(s,a)^{\top}w_{h}-\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)\right\|$
	$\displaystyle=$	$\displaystyle\left\|\phi(s,a)^{\top}w_{h}-\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\phi(s_{h}^{\tau},a_{h}^{\tau})^{\top}w_{h}/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)\right\|$
	$\displaystyle=$	$\displaystyle\left\|\phi(s,a)^{\top}w_{h}-\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\widehat{\Lambda}_{h}-\lambda I\right)w_{h}\right\|$
	$\displaystyle=$	$\displaystyle\left\|\lambda\cdot\phi(s,a)^{\top}\widehat{\Lambda}^{-1}_{h}w_{h}\right\|$
	$\displaystyle\leq$	$\displaystyle\lambda\left\lVert\phi(s,a)\right\rVert_{\widehat{\Lambda}^{-1}_{h}}\cdot\left\lVert w_{h}\right\rVert_{\widehat{\Lambda}^{-1}_{h}}$
	$\displaystyle\leq$	$\displaystyle\frac{4\lambda}{K}\left\lVert\phi(s,a)\right\rVert_{(\widetilde{\Lambda}_{h}^{p})^{-1}}\cdot\left\lVert w_{h}\right\rVert_{(\widetilde{\Lambda}_{h}^{p})^{-1}}$
	$\displaystyle\leq$	$\displaystyle\frac{4\lambda}{K}\cdot 2H\sqrt{d}\cdot\left\lVert(\widetilde{\Lambda}_{h}^{p})^{-1}\right\rVert$
	$\displaystyle\leq$	$\displaystyle\frac{8\lambda H^{3}\sqrt{d}/\kappa}{K},$

where $\widetilde{\Lambda}_{h}^{p}:=\mathbb{E}_{\mu,h}\left[\widetilde{\sigma}_{h}(s,a)^{-2}\phi(s,a)\phi(s,a)^{\top}\right]$ . The first inequality applies Cauchy-Schwarz inequality. The second inequality holds with probability $1-\delta$ due to Lemma E.13 and a union bound. The third inequality uses $\sqrt{a^{\top}\cdot A\cdot a}\leq\sqrt{\left\lVert a\right\rVert_{2}\left\lVert A\right\rVert_{2}\left\lVert a\right\rVert_{2}}=\left\lVert a\right\rVert_{2}\sqrt{\left\lVert A\right\rVert_{2}}$ and $\left\lVert w_{h}\right\rVert\leq 2H\sqrt{d}$ . Finally, as $\lambda_{\min}(\widetilde{\Lambda}_{h}^{p})\geq\frac{\kappa}{\max_{h,s,a}\widetilde{\sigma}_{h}(s,a)^{2}}\geq\frac{\kappa}{H^{2}}$ implies $\left\lVert(\widetilde{\Lambda}_{h}^{p})^{-1}\right\rVert\leq\frac{H^{2}}{\kappa}$ , the last inequality holds. ∎

For term (v), denote: $x_{\tau}=\frac{\phi(s_{h}^{\tau},a_{h}^{\tau})}{\widetilde{\sigma}_{h}(s_{h}^{\tau},a_{h}^{\tau})},\quad\eta_{\tau}=\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)-\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)\right)/\widetilde{\sigma}_{h}(s^{\tau}_{h},a^{\tau}_{h})$ , then by Cauchy-Schwarz inequality, it holds that for all $h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}$ ,

		$\displaystyle\left\|\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\cdot\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)-\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)\right)/\widetilde{\sigma}^{2}_{h}(s^{\tau}_{h},a^{\tau}_{h})\right)\right\|$		(43)
	$\displaystyle\leq$	$\displaystyle\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)}\cdot\left\lVert\sum_{\tau=1}^{K}x_{\tau}\eta_{\tau}\right\rVert_{\widehat{\Lambda}_{h}^{-1}}.$		(43)

We bound $\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)}$ by $\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)}$ using the following Lemma D.8.

Lemma D.8.

\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)}\leq\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)}+\frac{2H^{2}\sqrt{E}/\kappa}{K}.

Proof of Lemma D.8.

	$\displaystyle\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)}=\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)+\phi(s,a)^{\top}(\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1})\phi(s,a)}$	(44)
$\displaystyle\leq$	$\displaystyle\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)+\left\lVert\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1}\right\rVert_{2}}$
$\displaystyle\leq$	$\displaystyle\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)}+\sqrt{\left\lVert\widehat{\Lambda}_{h}^{-1}-\widetilde{\Lambda}_{h}^{-1}\right\rVert_{2}}$
$\displaystyle\leq$	$\displaystyle\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)}+\frac{2H^{2}\sqrt{E}/\kappa}{K}.$

The first inequality uses $|a^{\top}Aa|\leq\left\lVert a\right\rVert_{2}^{2}\cdot\left\lVert A\right\rVert$ . The second inequality is because for $a,b\geq 0$ , $\sqrt{a}+\sqrt{b}\geq\sqrt{a+b}$ . The last inequality uses Lemma D.5. ∎

Remark D.9.

Similarly, under the same assumption in Lemma D.8, we also have for all $s,a,h\in\mathcal{S}\times\mathcal{A}\times[H]$ ,

\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)}\leq\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)}+\frac{2H^{2}\sqrt{E}/\kappa}{K}.

D.3.3 An intermediate result: bounding the variance

Before we handle $\left\lVert\sum_{\tau=1}^{K}x_{\tau}\eta_{\tau}\right\rVert_{\widehat{\Lambda}_{h}^{-1}}$ , we first bound $\sup_{h}\left\lVert\widetilde{\sigma}_{h}^{2}-\sigma_{\widetilde{V}_{h+1}}^{2}\right\rVert_{\infty}$ by the following Lemma D.10.

Lemma D.10 (Private version of Lemma C.7 in [Yin et al., 2022]).

Recall the definition of $\widetilde{\sigma}_{h}(\cdot,\cdot)^{2}=\max\{1,\widetilde{\mathrm{Var}}_{h}\widetilde{V}_{h+1}(\cdot,\cdot)\}$ in Algorithm 2 where $\big{[}\widetilde{\mathrm{Var}}_{h}\widetilde{V}_{h+1}\big{]}(\cdot,\cdot)=\big{\langle}\phi(\cdot,\cdot),\widetilde{{\beta}}_{h}\big{\rangle}_{\left[0,(H-h+1)^{2}\right]}-\big{[}\big{\langle}\phi(\cdot,\cdot),\widetilde{{\theta}}_{h}\big{\rangle}_{[0,H-h+1]}\big{]}^{2}$ ( $\widetilde{\beta}_{h}$ and $\widetilde{\theta}_{h}$ are defined in Algorithm 2) and ${\sigma}_{\widetilde{V}_{h+1}}(\cdot,\cdot)^{2}:=\max\{1,{\mathrm{Var}}_{P_{h}}\widetilde{V}_{h+1}(\cdot,\cdot)\}$ . Suppose $K\geq\max\left\{\frac{512\log\left(\frac{2Hd}{\delta}\right)}{\kappa^{2}},\frac{4\lambda}{\kappa},\frac{128\log\frac{2dH}{\delta}}{\kappa^{2}},\frac{\sqrt{2}L}{H\sqrt{d\kappa}}\right\}$ and $K\geq\max\{\frac{4L^{2}}{H^{2}d^{3}\kappa},\frac{32E^{2}}{d^{2}\kappa^{2}},\frac{16\lambda^{2}}{d^{2}\kappa}\}$ , under the high probability event in Lemma D.2, it holds that with probability $1-6\delta$ ,

\sup_{h}\lvert\lvert\widetilde{\sigma}^{2}_{h}-\sigma^{2}_{\widetilde{V}_{h+1}}\rvert\rvert_{\infty}\leq 36\sqrt{\frac{H^{4}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}.

Proof of Lemma D.10.

Step 1: The first step is to show for all $h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}$ , with probability $1-3\delta$ ,

\left|\langle\phi(s,a),\widetilde{{\beta}}_{h}\rangle_{\left[0,(H-h+1)^{2}\right]}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right|\leq 12\sqrt{\frac{H^{4}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}.

Proof of Step 1. We can bound the left hand side by the following decomposition:

		$\displaystyle\left\|\langle\phi(s,a),\widetilde{{\beta}}_{h}\rangle_{\left[0,(H-h+1)^{2}\right]}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right\|\leq\left\|\langle\phi(s,a),\widetilde{{\beta}}_{h}\rangle-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right\|$
	$\displaystyle=$	$\displaystyle\left\|\phi(s,a)^{\top}\widetilde{\Sigma}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}+\phi_{1}\right)-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right\|$
	$\displaystyle\leq$	$\displaystyle\underbrace{\left\|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}\right)-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right\|}_{(1)}+\underbrace{\left\|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\phi_{1}\right\|}_{(2)}$
		$\displaystyle+\underbrace{\left\|\phi(s,a)^{\top}(\widetilde{\Sigma}_{h}^{-1}-\bar{\Sigma}_{h}^{-1})\left(\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}+\phi_{1}\right)\right\|}_{(3)},$

where $\bar{\Sigma}_{h}=\widetilde{\Sigma}_{h}-K_{1}=\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})^{\top}+\lambda I$ .

Similar to the proof in Lemma D.5, when $K\geq\max\{\frac{128\log\frac{2dH}{\delta}}{\kappa^{2}},\frac{\sqrt{2}L}{H\sqrt{d\kappa}}\}$ , it holds that with probability $1-\delta$ , for all $h\in[H]$ ,

\lambda_{\min}(\bar{\Sigma}_{h})\geq\frac{K\kappa}{2},\,\,\lambda_{\min}(\widetilde{\Sigma}_{h})\geq\frac{K\kappa}{2},\,\,\left\lVert\widetilde{\Sigma}^{-1}_{h}-\bar{\Sigma}^{-1}_{h}\right\rVert_{2}\leq\frac{4E/\kappa^{2}}{K^{2}}.

(The only difference to Lemma D.5 is here $\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}]\geq\kappa$ .)

Under this high probability event, for term (2), it holds that for all $h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}$ ,

\displaystyle\left|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\phi_{1}\right|\leq\left\lVert\phi(s,a)\right\rVert\cdot\left\lVert\bar{\Sigma}_{h}^{-1}\right\rVert\cdot\left\lVert\phi_{1}\right\rVert\leq\lambda_{\min}^{-1}(\bar{\Sigma}_{h})\cdot HL\leq\frac{2HL/\kappa}{K}.

(45)

For term $(3)$ , similar to Lemma D.6, we have for all $h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}$ ,

\left|\phi(s,a)^{\top}(\widetilde{\Sigma}_{h}^{-1}-\bar{\Sigma}_{h}^{-1})\left(\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}+\phi_{1}\right)\right|\leq\frac{4\sqrt{2}H^{2}E\sqrt{d}/\kappa^{3/2}}{K}.

(46)

(The only difference to Lemma D.6 is that here $\widetilde{V}_{h+1}(s)^{2}\leq H^{2}$ , $\left\lVert\phi_{1}\right\rVert_{2}\leq HL$ , $\left\lVert\widetilde{\Sigma}_{h}^{-1}\right\rVert_{2}\leq\frac{2}{K\kappa}$ and $\left\lVert\widetilde{\Sigma}^{-1}_{h}-\bar{\Sigma}^{-1}_{h}\right\rVert_{2}\leq\frac{4E/\kappa^{2}}{K^{2}}$ .)

We further decompose term (1) as below.

	$\displaystyle(1)=\left\|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}\right)-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right\|$	(47)
$\displaystyle=$	$\displaystyle\left\|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}-\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}(\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})^{\top}+\lambda I)\int_{\mathcal{S}}(\widetilde{V}_{h+1})^{2}(s^{\prime})d\nu_{h}(s^{\prime})\right\|$
$\displaystyle\leq$	$\displaystyle\underbrace{\left\|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\left(\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(\bar{s}^{\tau}_{h},\bar{a}^{\tau}_{h})\right)\right\|}_{(4)}+\underbrace{\lambda\left\|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\int_{\mathcal{S}}(\widetilde{V}_{h+1})^{2}(s^{\prime})d\nu_{h}(s^{\prime})\right\|}_{(5)}.$

For term (5), because $K\geq\max\left\{\frac{512\log\left(\frac{2Hd}{\delta}\right)}{\kappa^{2}},\frac{4\lambda}{\kappa}\right\}$ , by Lemma E.13 and a union bound, with probability $1-\delta$ , for all $h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}$ ,

		$\displaystyle\lambda\left\|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\int_{\mathcal{S}}(\widetilde{V}_{h+1})^{2}(s^{\prime})d\nu_{h}(s^{\prime})\right\|\leq\lambda\left\lVert\phi(s,a)\right\rVert_{\bar{\Sigma}^{-1}_{h}}\left\lVert\int_{\mathcal{S}}(\widetilde{V}_{h+1})^{2}(s^{\prime})d\nu_{h}(s^{\prime})\right\rVert_{\bar{\Sigma}^{-1}_{h}}$		(48)
	$\displaystyle\leq$	$\displaystyle\lambda\frac{2}{\sqrt{K}}\left\lVert\phi(s,a)\right\rVert_{({\Sigma}^{p}_{h})^{-1}}\frac{2}{\sqrt{K}}\left\lVert\int_{\mathcal{S}}(\widetilde{V}_{h+1})^{2}(s^{\prime})d\nu_{h}(s^{\prime})\right\rVert_{({\Sigma}^{p}_{h})^{-1}}\leq 4\lambda\left\lVert({\Sigma}^{p}_{h})^{-1}\right\rVert\frac{H^{2}\sqrt{d}}{K}\leq 4\lambda\frac{H^{2}\sqrt{d}}{\kappa K},$		(48)

where $\Sigma_{h}^{p}=\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}]$ and $\lambda_{\min}(\Sigma_{h}^{p})\geq\kappa$ .

For term (4), it can be bounded by the following inequality (because of Cauchy-Schwarz inequality).

(4)\leq\left\lVert\phi(s,a)\right\rVert_{\bar{\Sigma}^{-1}_{h}}\cdot\left\lVert\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\left(\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(\bar{s}^{\tau}_{h},\bar{a}^{\tau}_{h})\right)\right\rVert_{\bar{\Sigma}^{-1}_{h}}.

(49)

Bounding using covering. Note for any fix ${V}_{h+1}$ , we can define $x_{\tau}=\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})$ ( $\left\lVert\phi\right\rVert_{2}\leq 1$ ) and $\eta_{\tau}={V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}-{\mathbb{P}}_{h}({V}_{h+1})^{2}(\bar{s}^{\tau}_{h},\bar{a}^{\tau}_{h})$ is $H^{2}$ -subgaussian, by Lemma E.9 (where $t=K$ and $L=1$ ), it holds that with probability $1-\delta$ ,

\left\lVert\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\left({V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}-{\mathbb{P}}_{h}({V}_{h+1})^{2}(\bar{s}^{\tau}_{h},\bar{a}^{\tau}_{h})\right)\right\rVert_{\bar{\Sigma}^{-1}_{h}}\leq\sqrt{8H^{4}\cdot\frac{d}{2}\log\left(\frac{\lambda+K}{\lambda\delta}\right)}.

Let $\mathcal{N}_{h}(\epsilon)$ be the minimal $\epsilon$ -cover (with respect to the supremum norm) of
$\mathcal{V}_{h}:=\left\{V_{h}:V_{h}(\cdot)=\max_{a\in\mathcal{A}}\{\min\{\phi(s,a)^{\top}\theta-C_{1}\sqrt{d\cdot\phi(\cdot,\cdot)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(\cdot,\cdot)}-C_{2},H-h+1\}^{+}\}\right\}.$ That is, for any $V\in\mathcal{V}_{h}$ , there exists a value function $V^{\prime}\in\mathcal{N}_{h}(\epsilon)$ such that $\sup_{s\in\mathcal{S}}|V(s)-V^{\prime}(s)|<\epsilon$ . Now by a union bound, we obtain with probability $1-\delta$ ,

\sup_{V_{h+1}\in\mathcal{N}_{h+1}(\epsilon)}\left\lVert\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\left({V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}-{\mathbb{P}}_{h}({V}_{h+1})^{2}(\bar{s}^{\tau}_{h},\bar{a}^{\tau}_{h})\right)\right\rVert_{\bar{\Sigma}^{-1}_{h}}\leq\sqrt{8H^{4}\cdot\frac{d}{2}\log\left(\frac{\lambda+K}{\lambda\delta}|\mathcal{N}_{h+1}(\epsilon)|\right)}

which implies

		$\displaystyle\left\lVert\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\left(\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(\bar{s}^{\tau}_{h},\bar{a}^{\tau}_{h})\right)\right\rVert_{\bar{\Sigma}^{-1}_{h}}$
	$\displaystyle\leq$	$\displaystyle\sqrt{8H^{4}\cdot\frac{d}{2}\log\left(\frac{\lambda+K}{\lambda\delta}\|\mathcal{N}_{h+1}(\epsilon)\|\right)}+4H^{2}\sqrt{\epsilon^{2}K^{2}/\lambda}$

choosing $\epsilon=d\sqrt{\lambda}/K$ , applying Lemma B.3 of [Jin et al., 2021]¹³¹³13Note that the conclusion in [Jin et al., 2021] hold here even though we have an extra constant $C_{2}$ . to the covering number $\mathcal{N}_{h+1}(\epsilon)$ w.r.t. $\mathcal{V}_{h+1}$ , we can further bound above by

\displaystyle\leq

\displaystyle\sqrt{8H^{4}\cdot\frac{d^{3}}{2}\log\left(\frac{\lambda+K}{\lambda\delta}2dHK\right)}+4H^{2}\sqrt{d^{2}}\leq 6\sqrt{H^{4}\cdot d^{3}\log\left(\frac{\lambda+K}{\lambda\delta}2dHK\right)}

Apply a union bound for $h\in[H]$ , we have with probability $1-\delta$ , for all $h\in[H]$ ,

\left\lVert\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\left(\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(\bar{s}^{\tau}_{h},\bar{a}^{\tau}_{h})\right)\right\rVert_{\bar{\Sigma}^{-1}_{h}}\leq 6\sqrt{{H^{4}d^{3}}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}

(50)

and similar to term $(2)$ , it holds that for all $h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}$ ,

\left\lVert\phi(s,a)\right\rVert_{\bar{\Sigma}^{-1}_{h}}\leq\sqrt{\left\lVert\bar{\Sigma}^{-1}_{h}\right\rVert}\leq\sqrt{\frac{2}{\kappa K}}.

(51)

Combining (45), (46), (47), (48), (49), (50), (51) and the assumption that $K\geq\max\{\frac{4L^{2}}{H^{2}d^{3}\kappa},\frac{32E^{2}}{d^{2}\kappa^{2}},\frac{16\lambda^{2}}{d^{2}\kappa}\}$ , we obtain with probability $1-3\delta$ for all $h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}$ ,

\left|\langle\phi(s,a),\widetilde{{\beta}}_{h}\rangle_{\left[0,(H-h+1)^{2}\right]}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right|\leq 12\sqrt{\frac{H^{4}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}.

Step 2: The second step is to show for all $h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}$ , with probability $1-3\delta$ ,

\left|\langle\phi(s,a),\widetilde{{\theta}}_{h}\rangle_{\left[0,H-h+1\right]}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})(s,a)\right|\leq 12\sqrt{\frac{H^{2}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}.

(52)

The proof of Step 2 is nearly identical to Step 1 except $\widetilde{V}^{2}_{h}$ is replaced by $\widetilde{V}_{h}$ .

Step 3: The last step is to prove $\sup_{h}\lvert\lvert\widetilde{\sigma}^{2}_{h}-\sigma^{2}_{\widetilde{V}_{h+1}}\rvert\rvert_{\infty}\leq 36\sqrt{\frac{H^{4}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}$ with high probability.

Proof of Step 3. By (52),

		$\displaystyle\left\|\big{[}\big{\langle}\phi(\cdot,\cdot),\widetilde{{\theta}}_{h}\big{\rangle}_{[0,H-h+1]}\big{]}^{2}-\big{[}{\mathbb{P}}_{h}(\widetilde{V}_{h+1})(s,a)\big{]}^{2}\right\|$
	$\displaystyle=$	$\displaystyle\left\|\langle\phi(s,a),\widetilde{{\theta}}_{h}\rangle_{\left[0,H-h+1\right]}+{\mathbb{P}}_{h}(\widetilde{V}_{h+1})(s,a)\right\|\cdot\left\|\langle\phi(s,a),\widetilde{{\theta}}_{h}\rangle_{\left[0,H-h+1\right]}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})(s,a)\right\|$
	$\displaystyle\leq$	$\displaystyle 2H\cdot\left\|\langle\phi(s,a),\widetilde{{\theta}}_{h}\rangle_{\left[0,H-h+1\right]}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})(s,a)\right\|\leq 24\sqrt{\frac{H^{4}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}.$

Combining this with Step 1, we have with probability $1-6\delta$ , $\forall h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}$ ,

\bigg{|}\widetilde{\mathrm{Var}}_{h}\widetilde{V}_{h+1}(s,a)-{\mathrm{Var}}_{P_{h}}\widetilde{V}_{h+1}(s,a)\bigg{|}\leq 36\sqrt{\frac{H^{4}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}.

Finally, by the non-expansiveness of operator $\max\{1,\cdot\}$ , the proof is complete. ∎

D.3.4 Validity of our pessimism

Recall the definition $\widehat{\Lambda}_{h}=\sum_{\tau=1}^{K}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)^{\top}/\widetilde{\sigma}_{h}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda\cdot I$ and
${\Lambda}_{h}=\sum_{\tau=1}^{K}\phi(s^{\tau}_{h},a^{\tau}_{h})\phi(s^{\tau}_{h},a^{\tau}_{h})^{\top}/\sigma_{\widetilde{V}_{h+1}}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I$ . Then we have the following lemma to bound the term $\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)}$ by $\sqrt{\phi(s,a)^{\top}\Lambda_{h}^{-1}\phi(s,a)}$ .

Lemma D.11 (Private version of lemma C.3 in [Yin et al., 2022]).

Denote the quantities $C_{1}=\max\{2\lambda,128\log(2dH/\delta),\frac{128H^{4}\log(2dH/\delta)}{\kappa^{2}}\}$ and $C_{2}=\widetilde{O}(H^{12}d^{3}/\kappa^{5})$ . Suppose the number of episode $K$ satisfies $K>\max\{C_{1},C_{2}\}$ and the condition in Lemma D.10, under the high probability events in Lemma D.2 and Lemma D.10, it holds that with probability $1-2\delta$ , for all $h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}$ ,

\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)}\leq 2\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{-1}\phi(s,a)}.

Proof of Lemma D.11.

By definition $\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)}=\left\lVert\phi(s,a)\right\rVert_{\widehat{\Lambda}^{-1}_{h}}$ . Then denote

\widehat{\Lambda}^{\prime}_{h}=\frac{1}{K}\widehat{\Lambda}_{h},\quad{\Lambda}^{\prime}_{h}=\frac{1}{K}{\Lambda}_{h},

where ${\Lambda}_{h}=\sum_{\tau=1}^{K}\phi(s^{\tau}_{h},a^{\tau}_{h})\phi(s^{\tau}_{h},a^{\tau}_{h})^{\top}/\sigma_{\widetilde{V}_{h+1}}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I$ . Under the assumption of $K$ , by the conclusion in Lemma D.10, we have

	$\displaystyle\left\lVert\widehat{\Lambda}^{\prime}_{h}-{\Lambda}^{\prime}_{h}\right\rVert\leq\sup_{s,a}\left\lVert\frac{\phi(s,a)\phi(s,a)^{\top}}{\widetilde{\sigma}^{2}_{h}(s,a)}-\frac{\phi(s,a)\phi(s,a)^{\top}}{{\sigma}^{2}_{\widetilde{V}_{h+1}}(s,a)}\right\rVert$	(53)
$\displaystyle\leq$	$\displaystyle\sup_{s,a}\left\|\frac{\widetilde{\sigma}^{2}_{h}(s,a)-{\sigma}^{2}_{\widetilde{V}_{h+1}}(s,a)}{\widetilde{\sigma}^{2}_{h}(s,a)\cdot{\sigma}^{2}_{\widetilde{V}_{h+1}}(s,a)}\right\|\cdot\left\lVert\phi(s,a)\right\rVert^{2}$
$\displaystyle\leq$	$\displaystyle\sup_{s,a}\left\|\frac{\widetilde{\sigma}^{2}_{h}(s,a)-{\sigma}^{2}_{\widetilde{V}_{h+1}}(s,a)}{1}\right\|\cdot 1$
$\displaystyle\leq$	$\displaystyle 36\sqrt{\frac{H^{4}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}.$

Next by Lemma E.12 (with $\phi$ to be $\phi/\sigma_{\widetilde{V}_{h+1}}$ and therefore $C=1$ ) and a union bound, it holds with probability $1-\delta$ , for all $h\in[H]$ ,

\left\lVert\Lambda^{\prime}_{h}-\left(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{\widetilde{V}_{h+1}}(s,a)]+\frac{\lambda}{K}I_{d}\right)\right\rVert\leq\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}.

Therefore by Weyl’s inequality and the assumption that $K$ satisfies that
$K>\max\{2\lambda,128\log(2dH/\delta),\frac{128H^{4}\log(2dH/\delta)}{\kappa^{2}}\}$ , the above inequality leads to

	$\displaystyle\left\lVert\Lambda^{\prime}_{h}\right\rVert=$	$\displaystyle\lambda_{\max}(\Lambda^{\prime}_{h})\leq\lambda_{\max}\left(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{\widetilde{V}_{h+1}}(s,a)]\right)+\frac{\lambda}{K}+\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}$
	$\displaystyle=$	$\displaystyle\left\lVert\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{\widetilde{V}_{h+1}}(s,a)]\right\rVert_{2}+\frac{\lambda}{K}+\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}$
	$\displaystyle\leq$	$\displaystyle\left\lVert\phi(s,a)\right\rVert^{2}+\frac{\lambda}{K}+\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}\leq 1+\frac{\lambda}{K}+\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}\leq 2,$
	$\displaystyle\lambda_{\min}(\Lambda^{\prime}_{h})\geq$	$\displaystyle\lambda_{\min}\left(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{\widetilde{V}_{h+1}}(s,a)]\right)+\frac{\lambda}{K}-\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}$
	$\displaystyle\geq$	$\displaystyle\lambda_{\min}\left(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{\widetilde{V}_{h+1}}(s,a)]\right)-\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}$
	$\displaystyle\geq$	$\displaystyle\frac{\kappa}{H^{2}}-\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}\geq\frac{\kappa}{2H^{2}}.$

Hence with probability $1-\delta$ , $\left\lVert\Lambda^{\prime}_{h}\right\rVert\leq 2$ and $\left\lVert\Lambda^{\prime-1}_{h}\right\rVert=\lambda^{-1}_{\min}(\Lambda^{\prime}_{h})\leq\frac{2H^{2}}{\kappa}$ . Similarly, one can show $\left\lVert\widehat{\Lambda}^{\prime-1}_{h}\right\rVert\leq\frac{2H^{2}}{\kappa}$ with probability $1-\delta$ using identical proof.

Now apply Lemma E.11 and a union bound to $\widehat{\Lambda}^{\prime}_{h}$ and ${\Lambda}^{\prime}_{h}$ , we obtain with probability $1-\delta$ , for all $h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}$ ,

	$\displaystyle\left\lVert\phi(s,a)\right\rVert_{\widehat{\Lambda}^{\prime-1}_{h}}\leq$	$\displaystyle\left[1+\sqrt{\left\lVert\Lambda^{\prime-1}_{h}\right\rVert\cdot\left\lVert\Lambda^{\prime}_{h}\right\rVert\cdot\left\lVert\widehat{\Lambda}^{\prime-1}_{h}\right\rVert\cdot\left\lVert\widehat{\Lambda}^{\prime}_{h}-\Lambda^{\prime}_{h}\right\rVert}\right]\cdot\left\lVert\phi(s,a)\right\rVert_{\Lambda^{\prime-1}_{h}}$
	$\displaystyle\leq$	$\displaystyle\left[1+\sqrt{\frac{2H^{2}}{\kappa}\cdot 2\cdot\frac{2H^{2}}{\kappa}\cdot\left\lVert\widehat{\Lambda}^{\prime}_{h}-\Lambda^{\prime}_{h}\right\rVert}\right]\cdot\left\lVert\phi(s,a)\right\rVert_{\Lambda^{\prime-1}_{h}}$
	$\displaystyle\leq$	$\displaystyle\left[1+\sqrt{\frac{288H^{4}}{\kappa^{2}}\left(\sqrt{\frac{H^{4}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}\right)}\right]\cdot\left\lVert\phi(s,a)\right\rVert_{\Lambda^{\prime-1}_{h}}$
	$\displaystyle\leq$	$\displaystyle 2\left\lVert\phi(s,a)\right\rVert_{\Lambda^{\prime-1}_{h}}$

where the third inequality uses (53) and the last inequality uses $K>\widetilde{O}(H^{12}d^{3}/\kappa^{5})$ . Note the conclusion can be derived directly by the above inequality multiplying $1/\sqrt{K}$ on both sides. ∎

In order to bound $\left\lVert\sum_{\tau=1}^{K}x_{\tau}\eta_{\tau}\right\rVert_{\widehat{\Lambda}_{h}^{-1}}$ , we apply the following Lemma D.12.

Lemma D.12 (Lemma C.4 in [Yin et al., 2022]).

Recall $x_{\tau}=\frac{\phi(s^{\tau}_{h},a^{\tau}_{h})}{\widetilde{\sigma}_{h}(s^{\tau}_{h},a^{\tau}_{h})}$ and
$\eta_{\tau}=\left(r_{h}^{\tau}+\widetilde{V}_{h+1}\left(s_{h+1}^{\tau}\right)-\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)\left(s_{h}^{\tau},a_{h}^{\tau}\right)\right)/\widetilde{\sigma}_{h}(s^{\tau}_{h},a^{\tau}_{h})$ . Denote

\xi:=\sup_{V\in[0,H],\;s^{\prime}\sim P_{h}(s,a),\;h\in[H]}\left|\frac{r_{h}+{V}\left(s^{\prime}\right)-\left(\mathcal{T}_{h}{V}\right)\left(s,a\right)}{{\sigma}_{{V}}(s,a)}\right|.

Suppose $K\geq\widetilde{O}(H^{12}d^{3}/\kappa^{5})$ ¹⁴¹⁴14Note that here the assumption is stronger than the assumption in [Yin et al., 2022], therefore the conclusion of Lemma C.4 holds., then with probability $1-\delta$ ,

\left\lVert\sum_{\tau=1}^{K}x_{\tau}\eta_{\tau}\right\rVert_{\widehat{\Lambda}_{h}^{-1}}\leq\widetilde{O}\left(\max\big{\{}\sqrt{d},\xi\big{\}}\right),

where $\widetilde{O}$ absorbs constants and Polylog terms.

Now we are ready to prove the following key lemma, which gives a high probability bound for $\left|(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)\right|$ .

Lemma D.13.

Assume $K>\max\{\mathcal{M}_{1},\mathcal{M}_{2},\mathcal{M}_{3},\mathcal{M}_{4}\}$ , for any $0<\lambda<\kappa$ , suppose $\sqrt{d}>\xi$ , where $\xi:=\sup_{V\in[0,H],\;s^{\prime}\sim P_{h}(s,a),\;h\in[H]}\left|\frac{r_{h}+{V}\left(s^{\prime}\right)-\left(\mathcal{T}_{h}{V}\right)\left(s,a\right)}{{\sigma}_{{V}}(s,a)}\right|$ . Then with probability $1-\delta$ , for all $h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}$ ,

\left|(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)\right|\leq\widetilde{O}\left(\sqrt{d}\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)}\right)+\frac{D}{K},

where $\widetilde{\Lambda}_{h}=\sum_{\tau=1}^{K}\phi(s^{\tau}_{h},a^{\tau}_{h})\phi(s^{\tau}_{h},a^{\tau}_{h})^{\top}/\widetilde{\sigma}_{h}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I+K_{2}$ ,

D=\widetilde{O}\left(\frac{H^{2}L}{\kappa}+\frac{H^{4}E\sqrt{d}}{\kappa^{3/2}}+H^{3}\sqrt{d}+\frac{H^{2}\sqrt{Ed}}{\kappa}\right)=\widetilde{O}\left(\frac{H^{2}L}{\kappa}+\frac{H^{4}E\sqrt{d}}{\kappa^{3/2}}+H^{3}\sqrt{d}\right)

and $\widetilde{O}$ absorbs constants and Polylog terms.

Proof of Lemma D.13.

The proof is by combining (40), (42), Lemma D.4, Lemma D.6, Lemma D.7, Lemma D.8, Lemma D.12 and a union bound. ∎

Remark D.14.

Under the same assumption of Lemma D.13, because of Remark D.9 and Lemma D.11, we have with probability $1-\delta$ , for all $h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}$ ,

	$\displaystyle\left\|(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)\right\|\leq\widetilde{O}\left(\sqrt{d}\sqrt{\phi(s,a)^{\top}\widetilde{\Lambda}_{h}^{-1}\phi(s,a)}\right)+\frac{D}{K}$	(54)
$\displaystyle\leq$	$\displaystyle\widetilde{O}\left(\sqrt{d}\sqrt{\phi(s,a)^{\top}\widehat{\Lambda}_{h}^{-1}\phi(s,a)}\right)+\frac{2D}{K}$
$\displaystyle\leq$	$\displaystyle\widetilde{O}\left(2\sqrt{d}\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{-1}\phi(s,a)}\right)+\frac{2D}{K}.$

Because $D=\widetilde{O}\left(\frac{H^{2}L}{\kappa}+\frac{H^{4}E\sqrt{d}}{\kappa^{3/2}}+H^{3}\sqrt{d}\right)$ and $\widetilde{O}$ absorbs constant, we will write as below for simplicity:

\left|(\mathcal{T}_{h}\widetilde{V}_{h+1}-\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1})(s,a)\right|\leq\widetilde{O}\left(\sqrt{d}\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{-1}\phi(s,a)}\right)+\frac{D}{K}.

(55)

D.3.5 Finalize the proof of the first part

We are ready to prove the first part of Theorem 4.1.

Theorem D.15 (First part of Theorem 4.1).

Let $K$ be the number of episodes. Suppose $\sqrt{d}>\xi$ , where $\xi:=\sup_{V\in[0,H],\;s^{\prime}\sim P_{h}(s,a),\;h\in[H]}\left|\frac{r_{h}+{V}\left(s^{\prime}\right)-\left(\mathcal{T}_{h}{V}\right)\left(s,a\right)}{{\sigma}_{{V}}(s,a)}\right|$ and $K>\max\{\mathcal{M}_{1},\mathcal{M}_{2},\mathcal{M}_{3},\mathcal{M}_{4}\}$ . Then for any $0<\lambda<\kappa$ , with probability $1-\delta$ , for all policy $\pi$ simultaneously, the output $\widehat{\pi}$ of Algorithm 2 satisfies

v^{\pi}-v^{\widehat{\pi}}\leq\widetilde{O}\left(\sqrt{d}\cdot\sum_{h=1}^{H}\mathbb{E}_{\pi}\left[\left(\phi(\cdot,\cdot)^{\top}\Lambda_{h}^{-1}\phi(\cdot,\cdot)\right)^{1/2}\right]\right)+\frac{DH}{K},

where $\Lambda_{h}=\sum_{\tau=1}^{K}\frac{\phi(s_{h}^{\tau},a_{h}^{\tau})\cdot\phi(s_{h}^{\tau},a_{h}^{\tau})^{\top}}{\sigma^{2}_{\widetilde{V}_{h+1}(s_{h}^{\tau},a_{h}^{\tau})}}+\lambda I_{d}$ , $D=\widetilde{O}\left(\frac{H^{2}L}{\kappa}+\frac{H^{4}E\sqrt{d}}{\kappa^{3/2}}+H^{3}\sqrt{d}\right)$ and $\widetilde{O}$ absorbs constants and Polylog terms.

Proof of Theorem D.15.

Combining Lemma D.3 and Remark D.14, we have with probability $1-\delta$ , for all policy $\pi$ simultaneously,

V^{\pi}_{1}(s)-V^{\widehat{\pi}}_{1}(s)\leq\widetilde{O}\left(\sqrt{d}\cdot\sum_{h=1}^{H}\mathbb{E}_{\pi}\left[\left(\phi(\cdot,\cdot)^{\top}\Lambda_{h}^{-1}\phi(\cdot,\cdot)\right)^{1/2}\middle|s_{1}=s\right]\right)+\frac{DH}{K},

(56)

now the proof is complete by taking the initial distribution $d_{1}$ on both sides. ∎

D.3.6 Finalize the proof of the second part

To prove the second part of Theorem 4.1, we begin with a crude bound on $\sup_{h}\left\lVert V^{\star}_{h}-\widetilde{V}_{h}\right\rVert_{\infty}$ .

Lemma D.16 (Private version of Lemma C.8 in [Yin et al., 2022]).

Suppose $K\geq\max\{\mathcal{M}_{1},\mathcal{M}_{2},\mathcal{M}_{3},\mathcal{M}_{4}\}$ , under the high probability event in Lemma D.13, with probability at least $1-\delta$ ,

\sup_{h}\left\lVert V^{\star}_{h}-\widetilde{V}_{h}\right\rVert_{\infty}\leq\widetilde{O}\left(\frac{H^{2}\sqrt{d}}{\sqrt{\kappa K}}\right).

Proof of Lemma D.16.

Step 1: The first step is to show with probability at least $1-\delta$ , $\sup_{h}\left\lVert V^{\star}_{h}-{V}^{\widehat{\pi}}_{h}\right\rVert_{\infty}\leq\widetilde{O}\left(\frac{H^{2}\sqrt{d}}{\sqrt{\kappa K}}\right)$ .

Indeed, combine Lemma D.3 and Lemma D.13, similar to the proof of Theorem D.15, we directly have with probability $1-\delta$ , for all policy $\pi$ simultaneously, and for all $s\in\mathcal{S}$ , $h\in[H]$ ,

V^{\pi}_{h}(s)-V^{\widehat{\pi}}_{h}(s)\leq\widetilde{O}\left(\sqrt{d}\cdot\sum_{t=h}^{H}\mathbb{E}_{\pi}\left[\left(\phi(\cdot,\cdot)^{\top}\Lambda_{t}^{-1}\phi(\cdot,\cdot)\right)^{1/2}\middle|s_{h}=s\right]\right)+\frac{DH}{K},

(57)

Next, since $K\geq\max\left\{\frac{512\log\left(\frac{2Hd}{\delta}\right)}{\kappa^{2}},\frac{4\lambda}{\kappa}\right\}$ , by Lemma E.13 and a union bound over $h\in[H]$ , with probability $1-\delta$ ,

\displaystyle\sup_{s,a}\left\lVert\phi(s,a)\right\rVert_{{\Lambda}^{-1}_{h}}\leq\frac{2}{\sqrt{K}}\sup_{s,a}\left\lVert\phi(s,a)\right\rVert_{({\Lambda}^{p}_{h})^{-1}}\leq\frac{2}{\sqrt{K}}\sqrt{\lambda_{\min}^{-1}(\Lambda_{h}^{p})}\leq\frac{2H}{\sqrt{\kappa K}},\;\;\forall h\in[H],

where $\Lambda_{h}^{p}=\mathbb{E}_{\mu,h}[\sigma^{-2}_{\widetilde{V}_{h+1}}(s,a)\phi(s,a)\phi(s,a)^{\top}]$ and $\lambda_{\min}(\Lambda_{h}^{p})\geq\frac{\kappa}{H^{2}}$ .

Lastly, taking $\pi=\pi^{\star}$ in (57) to obtain

	$\displaystyle 0\leq V^{\pi^{\star}}_{h}(s)-V^{\widehat{\pi}}_{h}(s)\leq$	$\displaystyle\widetilde{O}\left(\sqrt{d}\cdot\sum_{t=h}^{H}\mathbb{E}_{\pi^{\star}}\left[\left(\phi(\cdot,\cdot)^{\top}\Lambda_{t}^{-1}\phi(\cdot,\cdot)\right)^{1/2}\middle\|s_{h}=s\right]\right)+\frac{DH}{K}$		(58)
	$\displaystyle\leq$	$\displaystyle\widetilde{O}\left(\frac{H^{2}\sqrt{d}}{\sqrt{\kappa K}}\right)+\widetilde{O}\left(\frac{H^{3}L/\kappa}{K}+\frac{H^{5}E\sqrt{d}/\kappa^{3/2}}{K}+\frac{H^{4}\sqrt{d}}{K}\right).$		(58)

This implies by using the condition $K>\max\{\frac{H^{2}L^{2}}{d\kappa},\frac{H^{6}E^{2}}{\kappa^{2}},H^{4}\kappa\}$ , we finish the proof of Step 1.

Step 2: The second step is to show with probability $1-\delta$ , $\sup_{h}\left\lVert\widetilde{V}_{h}-{V}^{\widehat{\pi}}_{h}\right\rVert_{\infty}\leq\widetilde{O}\left(\frac{H^{2}\sqrt{d}}{\sqrt{\kappa K}}\right)$ .

Indeed, applying Lemma E.7 with $\pi=\pi^{\prime}=\widehat{\pi}$ , then with probability $1-\delta$ , for all $s,h$

		$\displaystyle\left\|\widetilde{V}_{h}(s)-V^{\widehat{\pi}}_{h}(s)\right\|=\left\|\sum_{t=h}^{H}\mathbb{E}_{\widehat{\pi}}\left[\widehat{Q}_{h}(s_{h},a_{h})-\left(\mathcal{T}_{h}\widetilde{V}_{h+1}\right)(s_{h},a_{h})\middle\|s_{h}=s\right]\right\|$
	$\displaystyle\leq$	$\displaystyle\sum_{t=h}^{H}\left\lVert(\widetilde{\mathcal{T}}_{h}\widetilde{V}_{h+1}-\mathcal{T}_{h}\widetilde{V}_{h+1})(s,a)\right\rVert_{\infty}+H\cdot\left\lVert\Gamma_{h}(s,a)\right\rVert_{\infty}$
	$\displaystyle\leq$	$\displaystyle\widetilde{O}\left(H\sqrt{d}\left\lVert\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{-1}\phi(s,a)}\right\rVert_{\infty}\right)+\widetilde{O}\left(\frac{DH}{K}\right)$
	$\displaystyle\leq$	$\displaystyle\widetilde{O}\left(\frac{H^{2}\sqrt{d}}{\sqrt{\kappa K}}\right),$

where the second inequality uses Lemma D.13, Remark D.14 and the last inequality holds due to the same reason as Step 1.

Step 3: The proof of the lemma is complete by combining Step 1, Step 2, triangular inequality and a union bound.

∎

Then we can give a high probability bound of $\sup_{h}\lvert\lvert\sigma^{2}_{\widetilde{V}_{h+1}}-\sigma^{\star 2}_{h}\rvert\rvert_{\infty}$ .

Lemma D.17 (Private version of Lemma C.10 in [Yin et al., 2022]).

Recall $\sigma^{2}_{\widetilde{V}_{h+1}}=\max\left\{1,{\mathrm{Var}}_{P_{h}}\widetilde{V}_{h+1}\right\}$ and $\sigma^{\star 2}_{h}=\max\left\{1,{\mathrm{Var}}_{P_{h}}{V}^{\star}_{h+1}\right\}$ . Suppose $K\geq\max\{\mathcal{M}_{1},\mathcal{M}_{2},\mathcal{M}_{3},\mathcal{M}_{4}\}$ , then with probability $1-\delta$ ,

\sup_{h}\lvert\lvert\sigma^{2}_{\widetilde{V}_{h+1}}-\sigma^{\star 2}_{h}\rvert\rvert_{\infty}\leq\widetilde{O}\left(\frac{H^{3}\sqrt{d}}{\sqrt{\kappa K}}\right).

Proof of Lemma D.17.

By definition and the non-expansiveness of $\max\{1,\cdot\}$ , we have

		$\displaystyle\left\lVert{\sigma}^{2}_{\widetilde{V}_{h+1}}-{\sigma}^{\star 2}_{h}\right\rVert_{\infty}\leq\left\lVert\mathrm{Var}\widetilde{V}_{h+1}-\mathrm{Var}{V}^{\star}_{h+1}\right\rVert_{\infty}$
	$\displaystyle\leq$	$\displaystyle\left\lVert\mathbb{P}_{h}\left(\widetilde{V}^{2}_{h+1}-{V}^{\star 2}_{h+1}\right)\right\rVert_{\infty}+\left\lVert(\mathbb{P}_{h}\widetilde{V}_{h+1})^{2}-(\mathbb{P}_{h}{V}^{\star}_{h+1})^{2}\right\rVert_{\infty}$
	$\displaystyle\leq$	$\displaystyle\left\lVert\widetilde{V}^{2}_{h+1}-{V}^{\star 2}_{h+1}\right\rVert_{\infty}+\left\lVert(\mathbb{P}_{h}\widetilde{V}_{h+1}+\mathbb{P}_{h}{V}^{\star}_{h+1})(\mathbb{P}_{h}\widetilde{V}_{h+1}-\mathbb{P}_{h}{V}^{\star}_{h+1})\right\rVert_{\infty}$
	$\displaystyle\leq$	$\displaystyle 2H\left\lVert\widetilde{V}_{h+1}-{V}^{\star}_{h+1}\right\rVert_{\infty}+2H\left\lVert\mathbb{P}_{h}\widetilde{V}_{h+1}-\mathbb{P}_{h}{V}^{\star}_{h+1}\right\rVert_{\infty}$
	$\displaystyle\leq$	$\displaystyle\widetilde{O}\left(\frac{H^{3}\sqrt{d}}{\sqrt{\kappa K}}\right).$

The second inequality is because of the definition of variance. The last inequality comes from Lemma D.16. ∎

We transfer $\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{-1}\phi(s,a)}$ to $\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{\star-1}\phi(s,a)}$ by the following Lemma D.18.

Lemma D.18 (Private version of Lemma C.11 in [Yin et al., 2022]).

Suppose $K\geq\max\{\mathcal{M}_{1},\mathcal{M}_{2},\mathcal{M}_{3},\mathcal{M}_{4}\}$ , then with probability $1-\delta$ ,

\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{-1}\phi(s,a)}\leq 2\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{\star-1}\phi(s,a)},\quad\forall h,s,a\in[H]\times\mathcal{S}\times\mathcal{A},

Proof of Lemma D.18.

By definition $\sqrt{\phi(s,a)^{\top}{\Lambda}_{h}^{-1}\phi(s,a)}=\left\lVert\phi(s,a)\right\rVert_{{\Lambda}^{-1}_{h}}$ . Then denote

{\Lambda}^{\prime}_{h}=\frac{1}{K}{\Lambda}_{h},\quad{\Lambda}^{\star^{\prime}}_{h}=\frac{1}{K}{\Lambda}^{\star}_{h},

where ${\Lambda}^{\star}_{h}=\sum_{\tau=1}^{K}\phi(s^{\tau}_{h},a^{\tau}_{h})\phi(s^{\tau}_{h},a^{\tau}_{h})^{\top}/\sigma_{{V}^{\star}_{h+1}}^{2}(s^{\tau}_{h},a^{\tau}_{h})+\lambda I$ . Under the condition of $K$ , by Lemma D.17, with probability $1-\delta$ , for all $h\in[H]$ ,

	$\displaystyle\left\lVert{\Lambda}^{\star^{\prime}}_{h}-{\Lambda}^{\prime}_{h}\right\rVert\leq\sup_{s,a}\left\lVert\frac{\phi(s,a)\phi(s,a)^{\top}}{{\sigma}^{\star 2}_{h}(s,a)}-\frac{\phi(s,a)\phi(s,a)^{\top}}{{\sigma}^{2}_{\widetilde{V}_{h+1}}(s,a)}\right\rVert$	(59)
$\displaystyle\leq$	$\displaystyle\sup_{s,a}\left\|\frac{{\sigma}^{\star 2}_{h}(s,a)-{\sigma}^{2}_{\widetilde{V}_{h+1}}(s,a)}{{\sigma}^{\star 2}_{h}(s,a)\cdot{\sigma}^{2}_{\widetilde{V}_{h+1}}(s,a)}\right\|\cdot\left\lVert\phi(s,a)\right\rVert^{2}$
$\displaystyle\leq$	$\displaystyle\sup_{s,a}\left\|\frac{{\sigma}^{\star 2}_{h}(s,a)-{\sigma}^{2}_{\widetilde{V}_{h+1}}(s,a)}{1}\right\|\cdot 1$
$\displaystyle\leq$	$\displaystyle\widetilde{O}\left(\frac{H^{3}\sqrt{d}}{\sqrt{\kappa K}}\right).$

Next by Lemma E.12 (with $\phi$ to be $\phi/\sigma_{{V}^{\star}_{h+1}}$ and $C=1$ ), it holds with probability $1-\delta$ ,

\left\lVert\Lambda^{\star^{\prime}}_{h}-\left(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{{V}^{\star}_{h+1}}(s,a)]+\frac{\lambda}{K}I_{d}\right)\right\rVert\leq\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}.

Therefore by Weyl’s inequality and the condition $K>\max\{2\lambda,128\log\left(\frac{2dH}{\delta}\right),\frac{128H^{4}\log(2dH/\delta)}{\kappa^{2}}\}$ , the above inequality implies

	$\displaystyle\left\lVert\Lambda^{\star^{\prime}}_{h}\right\rVert=$	$\displaystyle\lambda_{\max}(\Lambda^{\star^{\prime}}_{h})\leq\lambda_{\max}\left(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{{V}^{\star}_{h+1}}(s,a)]\right)+\frac{\lambda}{K}+\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}$
	$\displaystyle\leq$	$\displaystyle\left\lVert\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{{V}^{\star}_{h+1}}(s,a)]\right\rVert+\frac{\lambda}{K}+\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}$
	$\displaystyle\leq$	$\displaystyle\left\lVert\phi(s,a)\right\rVert^{2}+\frac{\lambda}{K}+\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}\leq 1+\frac{\lambda}{K}+\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}\leq 2,$
	$\displaystyle\lambda_{\min}(\Lambda^{\star^{\prime}}_{h})\geq$	$\displaystyle\lambda_{\min}\left(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{{V}^{\star}_{h+1}}(s,a)]\right)+\frac{\lambda}{K}-\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}$
	$\displaystyle\geq$	$\displaystyle\lambda_{\min}\left(\mathbb{E}_{\mu,h}[\phi(s,a)\phi(s,a)^{\top}/\sigma^{2}_{{V}^{\star}_{h+1}}(s,a)]\right)-\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}$
	$\displaystyle\geq$	$\displaystyle\frac{\kappa}{H^{2}}-\frac{4\sqrt{2}}{\sqrt{K}}\left(\log\frac{2dH}{\delta}\right)^{1/2}\geq\frac{\kappa}{2H^{2}}.$

Hence with probability $1-\delta$ , $\left\lVert\Lambda^{\star^{\prime}}_{h}\right\rVert\leq 2$ and $\left\lVert\Lambda^{\star^{\prime}-1}_{h}\right\rVert=\lambda_{\min}^{-1}(\Lambda^{\star^{\prime}}_{h})\leq\frac{2H^{2}}{\kappa}$ . Similarly, we can show that $\left\lVert{\Lambda}^{{}^{\prime}-1}_{h}\right\rVert\leq\frac{2H^{2}}{\kappa}$ holds with probability $1-\delta$ by using identical proof.

Now apply Lemma E.11 and a union bound to ${\Lambda}^{\star^{\prime}}_{h}$ and ${\Lambda}^{\prime}_{h}$ , we obtain with probability $1-\delta$ , for all $h,s,a\in[H]\times\mathcal{S}\times\mathcal{A}$ ,

	$\displaystyle\left\lVert\phi(s,a)\right\rVert_{{\Lambda}^{\prime-1}_{h}}\leq$	$\displaystyle\left[1+\sqrt{\left\lVert\Lambda^{\star^{\prime}-1}_{h}\right\rVert\cdot\left\lVert\Lambda^{\star^{\prime}}_{h}\right\rVert\cdot\left\lVert{\Lambda}^{\prime-1}_{h}\right\rVert\cdot\left\lVert{\Lambda}^{\star^{\prime}}_{h}-\Lambda^{\prime}_{h}\right\rVert}\right]\cdot\left\lVert\phi(s,a)\right\rVert_{\Lambda^{\star^{\prime}-1}_{h}}$
	$\displaystyle\leq$	$\displaystyle\left[1+\sqrt{\frac{2H^{2}}{\kappa}\cdot 2\cdot\frac{2H^{2}}{\kappa}\cdot\left\lVert{\Lambda}^{\star^{\prime}}_{h}-\Lambda^{\prime}_{h}\right\rVert}\right]\cdot\left\lVert\phi(s,a)\right\rVert_{\Lambda^{\star^{\prime}-1}_{h}}$
	$\displaystyle\leq$	$\displaystyle\left[1+\sqrt{\frac{H^{4}}{\kappa^{2}}\left[\widetilde{O}\left(\frac{H^{3}\sqrt{d}}{\sqrt{\kappa K}}\right)\right]}\right]\cdot\left\lVert\phi(s,a)\right\rVert_{\Lambda^{\star^{\prime}-1}_{h}}$
	$\displaystyle\leq$	$\displaystyle 2\left\lVert\phi(s,a)\right\rVert_{\Lambda^{\star^{\prime}-1}_{h}}$

where the third inequality uses (59) and the last inequality uses $K\geq\widetilde{O}(H^{14}d/\kappa^{5})$ . The conclusion can be derived directly by the above inequality multiplying $1/\sqrt{K}$ on both sides. ∎

Finally, the second part of Theorem 4.1 can be proven by combining Theorem D.15 (with $\pi=\pi^{\star}$ ) and Lemma D.18.

D.4 Put everything toghther

Combining Lemma D.1, Theorem D.15, and the discussion above, the proof of Theorem 4.1 is complete.

Appendix E Assisting technical lemmas

Lemma E.1 (Multiplicative Chernoff bound [Chernoff et al., 1952]).

Let $X$ be a Binomial random variable with parameter $p,n$ . For any $1\geq\theta>0$ , we have that

\mathbb{P}[X<(1-\theta)pn]<e^{-\frac{\theta^{2}pn}{2}},\quad\text{ and }\quad\mathbb{P}[X\geq(1+\theta)pn]<e^{-\frac{\theta^{2}pn}{3}}

Lemma E.2 (Hoeffding’s Inequality [Sridharan, 2002]).

Let $x_{1},...,x_{n}$ be independent bounded random variables such that $\mathbb{E}[x_{i}]=0$ and $|x_{i}|\leq\xi_{i}$ with probability $1$ . Then for any $\epsilon>0$ we have

\mathbb{P}\left(\frac{1}{n}\sum_{i=1}^{n}x_{i}\geq\epsilon\right)\leq e^{-\frac{2n^{2}\epsilon^{2}}{\sum_{i=1}^{n}\xi_{i}^{2}}}.

Lemma E.3 (Bernstein’s Inequality).

Let $x_{1},...,x_{n}$ be independent bounded random variables such that $\mathbb{E}[x_{i}]=0$ and $|x_{i}|\leq\xi$ with probability $1$ . Let $\sigma^{2}=\frac{1}{n}\sum_{i=1}^{n}\mathrm{Var}[x_{i}]$ , then with probability $1-\delta$ we have

\frac{1}{n}\sum_{i=1}^{n}x_{i}\leq\sqrt{\frac{2\sigma^{2}\cdot\log(1/\delta)}{n}}+\frac{2\xi}{3n}\log(1/\delta).

Lemma E.4 (Empirical Bernstein’s Inequality [Maurer and Pontil, 2009]).

Let $x_{1},...,x_{n}$ be i.i.d random variables such that $|x_{i}|\leq\xi$ with probability $1$ . Let $\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}$ and $\widehat{V}_{n}=\frac{1}{n}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}$ , then with probability $1-\delta$ we have

\left|\frac{1}{n}\sum_{i=1}^{n}x_{i}-\mathbb{E}[x]\right|\leq\sqrt{\frac{2\widehat{V}_{n}\cdot\log(2/\delta)}{n}}+\frac{7\xi}{3n}\log(2/\delta).

Lemma E.5 (Lemma I.8 in [Yin and Wang, 2021b]).

Let $n\geq 2$ and $V\in\mathbb{R}^{S}$ be any function with $||V||_{\infty}\leq H$ , $P$ be any $S$ -dimensional distribution and $\widehat{P}$ be its empirical version using $n$ samples. Then with probability $1-\delta$ ,

\left|\sqrt{\mathrm{Var}_{\widehat{P}}(V)}-\sqrt{\frac{n-1}{n}\mathrm{Var}_{{P}}(V)}\right|\leq 2H\sqrt{\frac{\log(2/\delta)}{n-1}}.

Lemma E.6 (Claim 2 in [Vietri et al., 2020]).

Let $y\in\mathbb{R}$ be any positive real number. Then for all $x\in\mathbb{R}$ with $x\geq 2y$ , it holds that $\frac{1}{x-y}\leq\frac{1}{x}+\frac{2y}{x^{2}}$ .

E.1 Extended Value Difference

Lemma E.7 (Extended Value Difference (Section B.1 in [Cai et al., 2020])).

Let $\pi=\{\pi_{h}\}_{h=1}^{H}$ and $\pi^{\prime}=\{\pi^{\prime}_{h}\}_{h=1}^{H}$ be two arbitrary policies and let $\{\widehat{Q}_{h}\}_{h=1}^{H}$ be any given Q-functions. Then define $\widehat{V}_{h}(s):=\langle\widehat{Q}_{h}(s,\cdot),\pi_{h}(\cdot\mid s)\rangle$ for all $s\in\mathcal{S}$ . Then for all $s\in\mathcal{S}$ ,

	$\displaystyle\widehat{V}_{1}(s)-V_{1}^{\pi^{\prime}}(s)=$	$\displaystyle\sum_{h=1}^{H}\mathbb{E}_{\pi^{\prime}}\left[\langle\widehat{Q}_{h}\left(s_{h},\cdot\right),\pi_{h}\left(\cdot\mid s_{h}\right)-\pi_{h}^{\prime}\left(\cdot\mid s_{h}\right)\rangle\mid s_{1}=s\right]$		(60)
		$\displaystyle+\sum_{h=1}^{H}\mathbb{E}_{\pi^{\prime}}\left[\widehat{Q}_{h}\left(s_{h},a_{h}\right)-\left(\mathcal{T}_{h}\widehat{V}_{h+1}\right)\left(s_{h},a_{h}\right)\mid s_{1}=s\right]$		(60)

where $(\mathcal{T}_{h}V)(\cdot,\cdot):=r_{h}(\cdot,\cdot)+(P_{h}V)(\cdot,\cdot)$ for any $V\in\mathbb{R}^{S}$ .

Lemma E.8 (Lemma I.10 in [Yin and Wang, 2021b]).

Let $\widehat{\pi}=\left\{\widehat{\pi}_{h}\right\}_{h=1}^{H}$ and $\widehat{Q}_{h}(\cdot,\cdot)$ be the arbitrary policy and Q-function and also $\widehat{V}_{h}(s)=\langle\widehat{Q}_{h}(s,\cdot),\widehat{\pi}_{h}(\cdot|s)\rangle$ $\forall s\in\mathcal{S}$ , and $\xi_{h}(s,a)=(\mathcal{T}_{h}\widehat{V}_{h+1})(s,a)-\widehat{Q}_{h}(s,a)$ element-wisely. Then for any arbitrary $\pi$ , we have

	$\displaystyle V_{1}^{\pi}(s)-V_{1}^{\widehat{\pi}}(s)=$	$\displaystyle\sum_{h=1}^{H}\mathbb{E}_{\pi}\left[\xi_{h}(s_{h},a_{h})\mid s_{1}=s\right]-\sum_{h=1}^{H}\mathbb{E}_{\widehat{\pi}}\left[\xi_{h}(s_{h},a_{h})\mid s_{1}=s\right]$
	$\displaystyle+$	$\displaystyle\sum_{h=1}^{H}\mathbb{E}_{\pi}\left[\langle\widehat{Q}_{h}\left(s_{h},\cdot\right),\pi_{h}\left(\cdot\|s_{h}\right)-\widehat{\pi}_{h}\left(\cdot\|s_{h}\right)\rangle\mid s_{1}=s\right]$

where the expectation are taken over $s_{h},a_{h}$ .

E.2 Assisting lemmas for linear MDP setting

Lemma E.9 (Hoeffding inequality for self-normalized martingales [Abbasi-Yadkori et al., 2011]).

Let $\{\eta_{t}\}_{t=1}^{\infty}$ be a real-valued stochastic process. Let $\{\mathcal{F}_{t}\}_{t=0}^{\infty}$ be a filtration, such that $\eta_{t}$ is $\mathcal{F}_{t}$ -measurable. Assume $\eta_{t}$ also satisfies $\eta_{t}$ given $\mathcal{F}_{t-1}$ is zero-mean and $R$ -subgaussian, i.e.

\forall\lambda\in\mathbb{R},\quad\mathbb{E}\left[e^{\lambda\eta_{t}}\mid\mathcal{F}_{t-1}\right]\leq e^{\lambda^{2}R^{2}/2}.

Let $\{x_{t}\}_{t=1}^{\infty}$ be an $\mathbb{R}^{d}$ -valued stochastic process where $x_{t}$ is $\mathcal{F}_{t-1}$ measurable and $\|x_{t}\|\leq L$ . Let $\Lambda_{t}=\lambda I_{d}+\sum_{s=1}^{t}x_{s}x^{\top}_{s}$ . Then for any $\delta>0$ , with probability $1-\delta$ , for all $t>0$ ,

\left\|\sum_{s=1}^{t}{x}_{s}\eta_{s}\right\|_{{\Lambda}_{t}^{-1}}^{2}\leq 8R^{2}\cdot\frac{d}{2}\log\left(\frac{\lambda+tL}{\lambda\delta}\right).

Lemma E.10 (Bernstein inequality for self-normalized martingales [Zhou et al., 2021]).

\left|\eta_{t}\right|\leq R,\mathbb{E}\left[\eta_{t}\mid\mathcal{F}_{t-1}\right]=0,\mathbb{E}\left[\eta_{t}^{2}\mid\mathcal{F}_{t-1}\right]\leq\sigma^{2}.

\left\|\sum_{s=1}^{t}\mathbf{x}_{s}\eta_{s}\right\|_{\bm{\Lambda}_{t}^{-1}}\leq 8\sigma\sqrt{d\log\left(1+\frac{tL^{2}}{\lambda d}\right)\cdot\log\left(\frac{4t^{2}}{\delta}\right)}+4R\log\left(\frac{4t^{2}}{\delta}\right)

Lemma E.11 (Lemma H.4 in [Yin et al., 2022]).

Let $\Lambda_{1}$ and $\Lambda_{2}\in\mathbb{R}^{d\times d}$ be two positive semi-definite matrices. Then:

\|\Lambda_{1}^{-1}\|\leq\|\Lambda_{2}^{-1}\|+\|\Lambda_{1}^{-1}\|\cdot\|\Lambda_{2}^{-1}\|\cdot\|\Lambda_{1}-\Lambda_{2}\|

and

\|\phi\|_{\Lambda_{1}^{-1}}\leq\left[1+\sqrt{\|\Lambda_{2}^{-1}\|\cdot\|\Lambda_{2}\|\cdot\|\Lambda_{1}^{-1}\|\cdot\|\Lambda_{1}-\Lambda_{2}\|}\right]\cdot\|\phi\|_{\Lambda_{2}^{-1}}.

for all $\phi\in\mathbb{R}^{d}$ .

Lemma E.12 (Lemma H.4 in [Min et al., 2021]).

Let $\phi:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{d}$ satisfies $\|\phi(s,a)\|\leq C$ for all $s,a\in\mathcal{S}\times\mathcal{A}$ . For any $K>0,\lambda>0$ , define $\bar{G}_{K}=\sum_{k=1}^{K}\phi(s_{k},a_{k})\phi(s_{k},a_{k})^{\top}+\lambda I_{d}$ where $(s_{k},a_{k})$ ’s are i.i.d samples from some distribution $\nu$ . Then with probability $1-\delta$ ,

\left\lVert\frac{\bar{G}_{K}}{K}-\mathbb{E}_{\nu}\left[\frac{\bar{G}_{K}}{K}\right]\right\rVert\leq\frac{4\sqrt{2}C^{2}}{\sqrt{K}}\left(\log\frac{2d}{\delta}\right)^{1/2}.

Lemma E.13 (Lemma H.5 in [Min et al., 2021]).

Let $\phi:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{d}$ be a bounded function s.t. $\|\phi\|_{2}\leq C$ . Define $\bar{G}_{K}=\sum_{k=1}^{K}\phi(s_{k},a_{k})\phi(s_{k},a_{k})^{\top}+\lambda I_{d}$ where $(s_{k},a_{k})$ ’s are i.i.d samples from some distribution $\nu$ . Let $G=\mathbb{E}_{\nu}[\phi(s,a)\phi(s,a)^{\top}]$ . Then for any $\delta\in(0,1)$ , if $K$ satisfies

K\geq\max\left\{512C^{4}\left\|\mathbf{G}^{-1}\right\|^{2}\log\left(\frac{2d}{\delta}\right),4\lambda\left\|\mathbf{G}^{-1}\right\|\right\}.

Then with probability at least $1-\delta$ , it holds simultaneously for all $u\in\mathbb{R}^{d}$ that

\|u\|_{\bar{G}_{K}^{-1}}\leq\frac{2}{\sqrt{K}}\|u\|_{G^{-1}}.

Lemma E.14 (Lemma H.9 in [Yin et al., 2022]).

For a linear MDP, for any $0\leq V(\cdot)\leq H$ , there exists a $w_{h}\in\mathbb{R}^{d}$ s.t. $\mathcal{T}_{h}V=\langle\phi,w_{h}\rangle$ and $\|w_{h}\|_{2}\leq 2H\sqrt{d}$ for all $h\in[H]$ . Here $\mathcal{T}_{h}(V)(s,a)=r_{h}(x,a)+(P_{h}V)(s,a)$ . Similarly, for any $\pi$ , there exists $w^{\pi}_{h}\in\mathbb{R}^{d}$ , such that $Q^{\pi}_{h}=\langle\phi,w^{\pi}_{h}\rangle$ with $\|w_{h}^{\pi}\|_{2}\leq 2(H-h+1)\sqrt{d}$ .

E.3 Assisting lemmas for differential privacy

Lemma E.15 (Converting zCDP to DP [Bun and Steinke, 2016]).

If M satisfies $\rho$ -zCDP then M satisfies $(\rho+2\sqrt{\rho\log(1/\delta)},\delta)$ -DP.

Lemma E.16 (zCDP Composition [Bun and Steinke, 2016]).

Let $M:\mathcal{U}^{n}\rightarrow\mathcal{Y}$ and $M^{\prime}:\mathcal{U}^{n}\rightarrow\mathcal{Z}$ be randomized mechanisms. Suppose that $M$ satisfies $\rho$ -zCDP and $M^{\prime}$ satisfies $\rho^{\prime}$ -zCDP. Define $M^{\prime\prime}:\mathcal{U}^{n}\rightarrow\mathcal{Y}\times\mathcal{Z}$ by $M^{\prime\prime}(U)=(M(U),M^{\prime}(U))$ . Then $M^{\prime\prime}$ satisfies $(\rho+\rho^{\prime})$ -zCDP.

Lemma E.17 (Adaptive composition and Post processing of zCDP [Bun and Steinke, 2016]).

Let $M:\mathcal{X}^{n}\rightarrow\mathcal{Y}$ and $M^{\prime}:\mathcal{X}^{n}\times\mathcal{Y}\rightarrow\mathcal{Z}$ . Suppose $M$ satisfies $\rho$ -zCDP and $M^{\prime}$ satisfies $\rho^{\prime}$ -zCDP (as a function of its first argument). Define $M^{\prime\prime}:\mathcal{X}^{n}\rightarrow\mathcal{Z}$ by $M^{\prime\prime}(x)=M^{\prime}(x,M(x))$ . Then $M^{\prime\prime}$ satisfies $(\rho+\rho^{\prime})$ -zCDP.

Definition E.18 ( $\ell_{1}$ sensitivity).

Define the $\ell_{1}$ sensitivity of a function $f:\mathbb{N}^{\mathcal{X}}\mapsto\mathbb{R}^{d}$ as

\displaystyle\Delta_{1}(f)=\sup_{\text{neighboring}\,U,U^{\prime}}\|f(U)-f(U^{\prime})\|_{1}.

Definition E.19 (Laplace Mechanism [Dwork et al., 2014]).

Given any function $f:\mathbb{N}^{\mathcal{X}}\mapsto\mathbb{R}^{d}$ , the Laplace mechanism is defined as:

\displaystyle\mathcal{M}_{L}(x,f,\epsilon)=f(x)+(Y_{1},\cdots,Y_{d}),

where $Y_{i}$ are i.i.d. random variables drawn from $\mathrm{Lap}(\Delta_{1}(f)/\epsilon)$ .

Lemma E.20 (Privacy guarantee of Laplace Mechanism [Dwork et al., 2014]).

The Laplace mechanism preserves $(\epsilon,0)$ -differential privacy. For simplicity, we say $\epsilon$ -DP.

Appendix F Details for the Evaluation part

In the Evaluation part, we apply a synthetic linear MDP case that is similar to [Min et al., 2021, Yin et al., 2022] but with some modifications for our evaluation task. The linear MDP example we use consists of $|\mathcal{S}|=2$ states and $|\mathcal{A}|=100$ actions, while the feature dimension $d=10$ . We denote $\mathcal{S}=\{0,1\}$ and $\mathcal{A}=\{0,1,\ldots,99\}$ respectively. For each action $a\in\{0,1,\ldots,99\}$ , we obtain a vector $\mathbf{a}\in\mathbb{R}^{8}$ via binary encoding. More specifically, each coordinate of $\mathbf{a}$ is either $0$ or $1$ .
First, we define the following indicator function $\delta(s,a)=\begin{cases}1&\text{ if }\mathds{1}\{s=0\}=\mathds{1}\{a=0\}\\ 0&\text{ otherwise }\end{cases},$ then our non-stationary linear MDP example can be characterized by the following parameters.

The feature map $\bm{\phi}$ is:

\bm{\phi}(s,a)=\left(\mathbf{a}^{\top},\delta(s,a),1-\delta(s,a)\right)^{\top}\in\mathbb{R}^{10}.

The unknown measure $\nu_{h}$ is:

\bm{\nu}_{h}(0)=\left(0,\cdots,0,\alpha_{h,1},\alpha_{h,2}\right),

\bm{\nu}_{h}(1)=\left(0,\cdots,0,1-\alpha_{h,1},1-\alpha_{h,2}\right),

where $\{\alpha_{h,1},\alpha_{h,2}\}_{h\in[H]}$ is a sequence of random values sampled uniformly from $[0,1]$ .
The unknown vector $\theta_{h}$ is:

\theta_{h}=(r_{h}/8,0,r_{h}/8,1/2-r_{h}/2,r_{h}/8,0,r_{h}/8,0,r_{h}/2,1/2-r_{h}/2)\in\mathbb{R}^{10},

where $r_{h}$ is also sampled uniformly from $[0,1]$ . Therefore, the transition kernel follows $P_{h}(s^{\prime}|s,a)=\langle\phi(s,a),\bm{\nu}_{h}(s^{\prime})\rangle$ and the expected reward function $r_{h}(s,a)=\langle\phi(s,a),\theta_{h}\rangle$ .
Finally, the behavior policy is to always choose action $a=0$ with probability $p$ , and other actions uniformly with probability $(1-p)/99$ . Here we choose $p=0.6$ . The initial distribution is a uniform distribution over $\mathcal{S}=\{0,1\}$ .

		$\displaystyle\left\|\langle\phi(s,a),\widetilde{{\beta}}_{h}\rangle_{\left[0,(H-h+1)^{2}\right]}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right\|\leq\left\|\langle\phi(s,a),\widetilde{{\beta}}_{h}\rangle-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right\|$
	$\displaystyle=$	$\displaystyle\left\|\phi(s,a)^{\top}\widetilde{\Sigma}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}+\phi_{1}\right)-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right\|$
	$\displaystyle\leq$	$\displaystyle\underbrace{\left\|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\left(\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}\right)-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})^{2}(s,a)\right\|}_{(1)}+\underbrace{\left\|\phi(s,a)^{\top}\bar{\Sigma}_{h}^{-1}\phi_{1}\right\|}_{(2)}$
		$\displaystyle+\underbrace{\left\|\phi(s,a)^{\top}(\widetilde{\Sigma}_{h}^{-1}-\bar{\Sigma}_{h}^{-1})\left(\sum_{\tau=1}^{K}\phi(\bar{s}_{h}^{\tau},\bar{a}_{h}^{\tau})\cdot\widetilde{V}_{h+1}(\bar{s}_{h+1}^{\tau})^{2}+\phi_{1}\right)\right\|}_{(3)},$

		$\displaystyle\left\|\big{[}\big{\langle}\phi(\cdot,\cdot),\widetilde{{\theta}}_{h}\big{\rangle}_{[0,H-h+1]}\big{]}^{2}-\big{[}{\mathbb{P}}_{h}(\widetilde{V}_{h+1})(s,a)\big{]}^{2}\right\|$
	$\displaystyle=$	$\displaystyle\left\|\langle\phi(s,a),\widetilde{{\theta}}_{h}\rangle_{\left[0,H-h+1\right]}+{\mathbb{P}}_{h}(\widetilde{V}_{h+1})(s,a)\right\|\cdot\left\|\langle\phi(s,a),\widetilde{{\theta}}_{h}\rangle_{\left[0,H-h+1\right]}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})(s,a)\right\|$
	$\displaystyle\leq$	$\displaystyle 2H\cdot\left\|\langle\phi(s,a),\widetilde{{\theta}}_{h}\rangle_{\left[0,H-h+1\right]}-{\mathbb{P}}_{h}(\widetilde{V}_{h+1})(s,a)\right\|\leq 24\sqrt{\frac{H^{4}d^{3}}{\kappa K}\log\left(\frac{(\lambda+K)2KdH^{2}}{\lambda\delta}\right)}.$