Offline Reinforcement Learning with Differential Privacy
Abstract
The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instance-dependent learning bounds under both tabular and linear Markov decision process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.
1 Introduction
Offline Reinforcement Learning (or batch RL) aims to learn a near-optimal policy in an unknown environment111The environment is usually characterized by a Markov Decision Process (MDP) in this paper. through a static dataset gathered from some behavior policy . Since offline RL does not require access to the environment, it can be applied to problems where interaction with environment is infeasible, e.g., when collecting new data is costly (trade or finance (Zhang et al., 2020)), risky (autonomous driving (Sallab et al., 2017)) or illegal / unethical (healthcare (Raghu et al., 2017)). In such practical applications, the data used by an RL agent usually contains sensitive information. Take medical history for instance, for each patient, at each time step, the patient reports her health condition (age, disease, etc.), then the doctor decides the treatment (which medicine to use, the dosage of medicine, etc.), finally there is treatment outcome (whether the patient feels good, etc.) and the patient transitions to another health condition. Here, (health condition, treatment, treatment outcome) corresponds to (state, action, reward) and the dataset can be considered as (number of patients) trajectories sampled from a MDP with horizon (number of treatment steps), see Table 1 for an instance. However, learning agents are known to implicitly memorize details of individual training data points verbatim (Carlini et al., 2019), even if they are irrelevant for learning (Brown et al., 2021), which makes offline RL models vulnerable to various privacy attacks.
Differential privacy (DP) (Dwork et al., 2006) is a well-established definition of privacy with many desirable properties. A differentially private offline RL algorithm will return a decision policy that is indistinguishable from a policy trained in an alternative universe any individual user is replaced, thereby preventing the aforementioned privacy risks. There is a surge of recent interest in developing RL algorithms with DP guarantees, but they focus mostly on the online setting (Vietri et al., 2020; Garcelon et al., 2021; Liao et al., 2021; Chowdhury and Zhou, 2021; Luyo et al., 2021).
Offline RL is arguably more practically relevant than online RL in the applications with sensitive data. For example, in the healthcare domain, online RL requires actively running new exploratory policies (clinical trials) with every new patient, which often involves complex ethical / legal clearances, whereas offline RL uses only historical patient records that are often accessible for research purposes. Clear communication of the adopted privacy enhancing techniques (e.g., DP) to patients was reported to further improve data access (Kim et al., 2017).
Patient 1 | Patient 2 | Patient | ||
Time 1 | Health condition1,1 | Health condition2,1 | Health conditionn,1 | |
Time 1 | Treatment1,1 | Treatment2,1 | Treatmentn,1 | |
Time 1 | Treatment outcome1,1 | Treatment outcome2,1 | Treatment outcomen,1 | |
Time H | Health condition1,H | Health condition2,H | Health conditionn,H | |
Time H | Treatment1,H | Treatment2,H | Treatmentn,H | |
Time H | Treatment outcome1,H | Treatment outcome2,H | Treatment outcomen,H |
Our contributions. In this paper, we present the first provably efficient algorithms for offline RL with differential privacy. Our contributions are twofold.
-
•
We design two new pessimism-based algorithms DP-APVI (Algorithm 1) and DP-VAPVI (Algorithm 2), one for the tabular setting (finite states and actions), the other for the case with linear function approximation (under linear MDP assumption). Both algorithms enjoy DP guarantees (pure DP or zCDP) and instance-dependent learning bounds where the cost of privacy appears as lower order terms.
-
•
We perform numerical simulations to evaluate and compare the performance of our algorithm DP-VAPVI (Algorithm 2) with its non-private counterpart VAPVI (Yin et al., 2022) as well as a popular baseline PEVI (Jin et al., 2021). The results complement the theoretical findings by demonstrating the practicality of DP-VAPVI under strong privacy parameters.
Related work. To our knowledge, differential privacy in offline RL tasks has not been studied before, except for much simpler cases where the agent only evaluates a single policy (Balle et al., 2016; Xie et al., 2019). Balle et al. (2016) privatized first-visit Monte Carlo-Ridge Regression estimator by an output perturbation mechanism and Xie et al. (2019) used DP-SGD. Neither paper considered offline learning (or policy optimization), which is our focus.
There is a larger body of work on private RL in the online setting, where the goal is to minimize regret while satisfying either joint differential privacy (Vietri et al., 2020; Chowdhury and Zhou, 2021; Ngo et al., 2022; Luyo et al., 2021) or local differential privacy (Garcelon et al., 2021; Liao et al., 2021; Luyo et al., 2021; Chowdhury and Zhou, 2021). The offline setting introduces new challenges in DP as we cannot algorithmically enforce good “exploration”, but have to work with a static dataset and privately estimate the uncertainty in addition to the value functions. A private online RL algorithm can sometimes be adapted for private offline RL too, but those from existing work yield suboptimal and non-adaptive bounds. We give a more detailed technical comparison in Appendix B.
Among non-private offline RL works, we build directly upon non-private offline RL methods known as Adaptive Pessimistic Value Iteration (APVI, for tabular MDPs) (Yin and Wang, 2021b) and Variance-Aware Pessimistic Value Iteration (VAPVI, for linear MDPs) (Yin et al., 2022), as they give the strongest theoretical guarantees to date. We refer readers to Appendix B for a more extensive review of the offline RL literature. Introducing DP to APVI and VAPVI while retaining the same sample complexity (modulo lower order terms) require nontrivial modifications to the algorithms.
A remark on technical novelty. Our algorithms involve substantial technical innovation over previous works on online DP-RL with joint DP guarantee222Here we only compare our techniques (for offline RL) with the works for online RL under joint DP guarantee, as both settings allow access to the raw data.. Different from previous works, our DP-APVI (Algorithm 1) operates on Bernstein type pessimism, which requires our algorithm to deal with conditional variance using private statistics. Besides, our DP-VAPVI (Algorithm 2) replaces the LSVI technique with variance-aware LSVI (also known as weighted ridge regression, first appears in (Zhou et al., 2021)). Our DP-VAPVI releases conditional variance privately, and further applies weighted ridge regression privately. Both approaches ensure tighter instance-dependent bounds on the suboptimality of the learned policy.
2 Problem Setup
Markov Decision Process. A finite-horizon Markov Decision Process (MDP) is denoted by a tuple (Sutton and Barto, 2018), where is state space and is action space. A non-stationary transition kernel maps each state action to a probability distribution and can be different across time. Besides, is the expected immediate reward satisfying , is the initial state distribution and is the horizon. A policy assigns each state a probability distribution over actions according to the map , . A random trajectory is generated according to .
For tabular MDP, we have is the discrete state-action space and are finite. In this work, we assume that is known333This is due to the fact that the uncertainty of reward function is dominated by that of transition kernel in RL.. In addition, we denote the per-step marginal state-action occupancy as: which is the marginal state-action probability at time .
Value function, Bellman (optimality) equations. The value function and Q-value function for any policy is defined as: The performance is defined as . The Bellman (optimality) equations follow :
Linear MDP (Jin et al., 2020b). An episodic MDP is called a linear MDP with known feature map if there exist unknown signed measures over and unknown reward vectors such that
Without loss of generality, we assume and for all . An important property of linear MDP is that the value functions are linear in the feature map, which is summarized in Lemma E.14.
Offline setting and the goal. The offline RL requires the agent to find a policy in order to maximize the performance , given only the episodic data 444For clarity we use for tabular MDP and for linear MDP when referring to the sample complexity. rolled out from some fixed and possibly unknown behavior policy , which means we cannot change and in particular we do not assume the functional knowledge of . In conclusion, based on the batch data and a targeted accuracy , the agent seeks to find a policy such that .
2.1 Assumptions in offline RL
In order to show that our privacy-preserving algorithms can generate near optimal policy, certain coverage assumptions are needed. In this section, we will list the assumptions we use in this paper.
Assumptions for tabular setting.
Assumption 2.1 ((Liu et al., 2019)).
There exists one optimal policy , such that is fully covered by , i.e. , only if . Furthermore, we denote the trackable set as .
Assumption 2.1 is the weakest assumption needed for accurately learning the optimal value by requiring to trace the state-action space of one optimal policy ( can be agnostic at other locations). Similar to (Yin and Wang, 2021b), we will use Assumption 2.1 for the tabular part of this paper, which enables comparison between our sample complexity to the conclusion in (Yin and Wang, 2021b), whose algorithm serves as a non-private baseline.
Assumptions for linear setting. First, we define the expectation of covariance matrix under the behavior policy for all time step as below:
(1) |
As have been shown in (Wang et al., 2021; Yin et al., 2022), learning a near-optimal policy from offline data requires coverage assumptions. Here in linear setting, such coverage is characterized by the minimum eigenvalue of . Similar to (Yin et al., 2022), we apply the following assumption for the sake of comparison.
Assumption 2.2 (Feature Coverage, Assumption 2 in (Wang et al., 2021)).
The data distributions satisfy the minimum eigenvalue condition: , . Furthermore, we denote .
2.2 Differential Privacy in offline RL
In this work, we aim to design privacy-preserving algorithms for offline RL. We apply differential privacy as the formal notion of privacy. Below we revisit the definition of differential privacy.
Definition 2.3 (Differential Privacy (Dwork et al., 2006)).
A randomized mechanism satisfies -differential privacy (-DP) if for all neighboring datasets that differ by one data point and for all possible event in the output range, it holds that
When , we say pure DP, while for , we say approximate DP.
In the problem of offline RL, the dataset consists of several trajectories, therefore one data point in Definition 2.3 refers to one single trajectory. Hence the definition of Differential Privacy means that the difference in the distribution of the output policy resulting from replacing one trajectory in the dataset will be small. In other words, an adversary can not infer much information about any single trajectory in the dataset from the output policy of the algorithm.
Remark 2.4.
For a concrete motivating example, please refer to the first paragraph of Introduction. We remark that our definition of DP is consistent with Joint DP and Local DP defined under the online RL setting where JDP/LDP also cast each user as one trajectory and provide user-wise privacy protection. For detailed definitions and more discussions about JDP/LDP, please refer to Qiao and Wang (2022).
During the whole paper, we will use zCDP (defined below) as a surrogate for DP, since it enables cleaner analysis for privacy composition and Gaussian mechanism. The properties of zCDP (e.g., composition, conversion formula to DP) are deferred to Appendix E.3.
Definition 2.5 (zCDP (Dwork and Rothblum, 2016; Bun and Steinke, 2016)).
A randomized mechanism satisfies -Zero-Concentrated Differential Privacy (-zCDP), if for all neighboring datasets and all ,
where is the Renyi-divergence (Van Erven and Harremos, 2014).
Finally, we go over the definition and privacy guarantee of Gaussian mechanism.
Definition 2.6 (Gaussian Mechanism (Dwork et al., 2014)).
Define the sensitivity of a function as
The Gaussian mechanism with noise level is then given by
Lemma 2.7 (Privacy guarantee of Gaussian mechanism (Dwork et al., 2014; Bun and Steinke, 2016)).
Let be an arbitrary d-dimensional function with sensitivity . Then for any , Gaussian Mechanism with parameter satisfies -zCDP. In addition, for all , Gaussian Mechanism with parameter satisfies -DP.
We emphasize that the privacy guarantee covers any input data. It does not require any distributional assumptions on the data. The RL-specific assumptions (e.g., linear MDP and coverage assumptions) are only used for establishing provable utility guarantees.
3 Results under tabular MDP: DP-APVI (Algorithm 1)
For reinforcement learning, the tabular MDP setting is the most well-studied setting and our first result applies to this regime. We begin with the construction of private counts.
Private Model-based Components. Given data , we denote be the total counts that visit pair at time and be the total counts that visit pair at time , then given the budget for zCDP, we add independent Gaussian noises to all the counts:
(2) |
However, after adding noise, the noisy counts may not satisfy . To address this problem, we choose the private counts of visiting numbers as the solution to the following optimization problem (here is chosen as a high probability uniform bound of the noises we add):
(3) | |||
Remark 3.1 (Some explanations).
The optimization problem above serves as a post-processing step which will not affect the DP guarantee of our algorithm. Briefly speaking, (3) finds a set of noisy counts such that and the estimation error for each and is roughly .555This conclusion is summarized in Lemma C.3. In contrast, if we directly take the crude approach that and , we can only derive through concentration on summation of i.i.d. Gaussian noises. In conclusion, solving the optimization problem (3) enables tight analysis for the lower order term (the additional cost of privacy).
Remark 3.2 (Computational efficiency).
The optimization problem (3) can be reformulated as:
(4) |
Note that (4) is a Linear Programming problem with variables and linear constraints (one constraint on absolute value is equivalent to two linear constraints), which can be solved efficiently by the simplex method (Ficken, 2015) or other provably efficient algorithms (Nemhauser and Wolsey, 1988). Therefore, our Algorithm 1 is computationally friendly.
The private estimation of the transition kernel is defined as:
(5) |
if and otherwise.
Remark 3.3.
Different from the transition kernel estimate in previous works (Vietri et al., 2020; Chowdhury and Zhou, 2021) that may not be a distribution, we have to ensure that ours is a probability distribution, because our Bernstein type pessimism (line 5 in Algorithm 1) needs to take variance over this transition kernel estimate. The intuition behind the construction of our private transition kernel is that, for those state-action pairs with , we can not distinguish whether the non-zero private count comes from noise or actual visitation. Therefore we only take the empirical estimate of the state-action pairs with sufficiently large .
Algorithmic design. Our algorithmic design originates from the idea of pessimism, which holds conservative view towards the locations with high uncertainty and prefers the locations we have more confidence about. Based on the Bernstein type pessimism in APVI (Yin and Wang, 2021b), we design a similar pessimistic algorithm with private counts to ensure differential privacy. If we replace and with and 666The non-private empirical estimate, defined as (15) in Appendix C., then our DP-APVI (Algorithm 1) will degenerate to APVI. Compared to the pessimism defined in APVI, our pessimistic penalty has an additional term , which accounts for the additional pessimism due to our application of private statistics.
We state our main theorem about DP-APVI below, the proof sketch is deferred to Appendix C.1 and detailed proof is deferred to Appendix C due to space limit.
Theorem 3.4.
Comparison to non-private counterpart APVI (Yin and Wang, 2021b). According to Theorem 4.1 in (Yin and Wang, 2021b), the sub-optimality bound of APVI is for large enough , with high probability, the output satisfies:
(7) |
Compared to our Theorem 3.4, the additional sub-optimality bound due to differential privacy is .777Here we apply the second part of Lemma 2.7 to achieve -DP, the notation also absorbs (only here denotes the privacy budget instead of failure probability). In the most popular regime where the privacy budget or is a constant, the additional term due to differential privacy appears as a lower order term, hence becomes negligible as the sample complexity becomes large.
Comparison to Hoeffding type pessimism. We can simply revise our algorithm by using Hoeffding type pessimism, which replaces the pessimism in line 5 with . Then with a similar proof schedule, we can arrive at a sub-optimality bound that with high probability,
(8) |
Compared to our Theorem 3.4, our bound is tighter because we express the dominate term by the system quantities instead of explicit dependence on (and ). In addition, we highlight that according to Theorem G.1 in (Yin and Wang, 2021b), our main term nearly matches the non-private minimax lower bound. For more detailed discussions about our main term and how it subsumes other optimal learning bounds, we refer readers to (Yin and Wang, 2021b).
Apply Laplace Mechanism to achieve pure DP. To achieve Pure DP instead of -zCDP, we can simply replace Gaussian Mechanism with Laplace Mechanism (defined as Definition E.19). Given privacy budget for Pure DP , since the sensitivity of is , we can add independent Laplace noises to each count to achieve -DP due to Lemma E.20. Then by using instead of and keeping everything else ((3), (5) and Algorithm 1) the same, we can reach a similar result to Theorem 3.4 with the same proof schedule. The only difference is that here the additional learning bound is , which still appears as a lower order term.
4 Results under linear MDP: DP-VAPVI(Algorithm 2)
In large MDPs, to address the computational issues, the technique of function approximation is widely applied, and linear MDP is a concrete model to study linear function approximations. Our second result applies to the linear MDP setting. Generally speaking, function approximation reduces the dimensionality of private releases comparing to the tabular MDPs. We begin with private counts.
Private Model-based Components. Given the two datasets and (both from ) as in Algorithm 2, we can apply variance-aware pessimistic value iteration to learn a near optimal policy as in VAPVI (Yin et al., 2022). To ensure differential privacy, we add independent Gaussian noises to the statistics as in DP-VAPVI (Algorithm 2) below. Since there are statistics, by the adaptive composition of zCDP (Lemma E.17), it suffices to keep each count -zCDP, where . In DP-VAPVI, we use 888We need to add noise to each of the counts, therefore for , we actually sample i.i.d samples , from the distribution of . Then we add to , . For simplicity, we use to represent all the . The procedure applied to the other statistics are similar. to denote the noises we add. For all , we directly apply Gaussian Mechanism. For , in addition to the noise matrix , we also add to ensure that all are positive definite with high probability (The detailed definition of can be found in Appendix A).
Below we will show the algorithmic design of DP-VAPVI (Algorithm 2). For the offline dataset, we divide it into two independent parts with equal length: and . One for estimating variance and the other for calculating -values.
Estimating conditional variance. The first part (line 4 to line 8) aims to estimate the conditional variance of via the definition of variance: . For the first term, by the definition of linear MDP, it holds that . We can estimate by applying ridge regression. Below is the output of ridge regression with raw statistics without noise:
where definition of can be found in Appendix A. Instead of using the raw statistics, we replace them with private ones with Gaussian noises as in line 5. The second term is estimated similarly in line 6. The final estimator is defined as in line 8: .999The operator here is for technical reason only: we want a lower bound for each variance estimate.
Variance-weighted LSVI. Instead of directly applying LSVI (Jin et al., 2021), we can solve the variance-weighted LSVI (line 10). The result of variance-weighted LSVI with non-private statistics is shown below:
where definition of can be found in Appendix A. For the sake of differential privacy, we use private statistics instead and derive the as in line 10.
Our private pessimism. Notice that if we remove all the Gaussian noises we add, our DP-VAPVI (Algorithm 2) will degenerate to VAPVI (Yin et al., 2022). We design a similar pessimistic penalty using private statistics (line 11), with additional accounting for the extra pessimism due to DP.
Main theorem. We state our main theorem about DP-VAPVI below, the proof sketch is deferred to Appendix D.1 and detailed proof is deferred to Appendix D due to space limit. Note that quantities can be found in Appendix A and briefly, , . For the sample complexity lower bound, within the practical regime where the privacy budget is not very small, is dominated by , which also appears in the sample complexity lower bound of VAPVI (Yin et al., 2022). The in Theorem 4.1 is defined as for any .
Theorem 4.1.
DP-VAPVI (Algorithm 2) satisfies -zCDP. Furthermore, let be the number of episodes. Under the condition that and , where , for any , with probability , for all policy simultaneously, the output of DP-VAPVI satisfies
(9) |
where , and hides constants and Polylog terms.
In particular, define , we have with probability ,
(10) |
Comparison to non-private counterpart VAPVI (Yin et al., 2022). Plugging in the definition of (Appendix A), under the meaningful case that the privacy budget is not very large, is dominated by . According to Theorem 3.2 in (Yin et al., 2022), the sub-optimality bound of VAPVI is for sufficiently large , with high probability, the output satisfies:
(11) |
Compared to our Theorem 4.1, the additional sub-optimality bound due to differential privacy is .101010Here we apply the second part of Lemma 2.7 to achieve -DP, the notation also absorbs (only here denotes the privacy budget instead of failure probability). In the most popular regime where the privacy budget or is a constant, the additional term due to differential privacy also appears as a lower order term.
Instance-dependent sub-optimality bound. Similar to DP-APVI (Algorithm 1), our DP-VAPVI (Algorithm 2) also enjoys instance-dependent sub-optimality bound. First, the main term in (10) improves PEVI (Jin et al., 2021) over on feature dependence. Also, our main term admits no explicit dependence on , thus improves the sub-optimality bound of PEVI on horizon dependence. For more detailed discussions about our main term, we refer readers to (Yin et al., 2022).
5 Tightness of our results
We believe our bounds for offline RL with DP is tight. To the best of our knowledge, APVI and VAPVI provide the tightest bound under tabular MDP and linear MDP, respectively. The suboptimality bounds of our algorithms match these two in the main term, with some lower order additional terms. The leading terms are known to match multiple information-theoretical lower bounds for offline RL simultaneously (this was illustrated in Yin and Wang (2021b); Yin et al. (2022)), for this reason our bound cannot be improved in general. For the lower order terms, the dependence on sample complexity and privacy budget : is optimal since policy learning is a special case of ERM problems and such dependence is optimal in DP-ERM. In addition, we believe the dependence on other parameters () in the lower order term is tight due to our special tricks as (3) and Lemma D.6.
6 Simulations
In this section, we carry out simulations to evaluate the performance of our DP-VAPVI (Algorithm 2), and compare it with its non-private counterpart VAPVI (Yin et al., 2022) and another pessimism-based algorithm PEVI (Jin et al., 2021) which does not have privacy guarantee.
Experimental setting. We evaluate DP-VAPVI (Algorithm 2) on a synthetic linear MDP example that originates from the linear MDP in (Min et al., 2021; Yin et al., 2022) but with some modifications.111111We keep the state space , action space and feature map of state-action pairs while we choose stochastic transition (instead of the original deterministic transition) and more complex reward. For details of the linear MDP setting, please refer to Appendix F. The two MDP instances we use both have horizon . We compare different algorithms in figure 1(a), while in figure 1(b), we compare our DP-VAPVI with different privacy budgets. When doing empirical evaluation, we do not split the data for DP-VAPVI or VAPVI and for DP-VAPVI, we run the simulation for times and take the average performance.


Results and discussions. From Figure 1, we can observe that DP-VAPVI (Algorithm 2) performs slightly worse than its non-private version VAPVI (Yin et al., 2022). This is due to the fact that we add Gaussian noise to each count. However, as the size of dataset goes larger, the performance of DP-VAPVI will converge to that of VAPVI, which supports our theoretical conclusion that the cost of privacy only appears as lower order terms. For DP-VAPVI with larger privacy budget, the scale of noise will be smaller, thus the performance will be closer to VAPVI, as shown in figure 1(b). Furthermore, in most cases, DP-VAPVI still outperforms PEVI, which does not have privacy guarantee. This arises from our privitization of variance-aware LSVI instead of LSVI.
7 Conclusion and future works
In this work, we take the first steps towards the well-motivated task of designing private offline RL algorithms. We propose algorithms for both tabular MDPs and linear MDPs, and show that they enjoy instance-dependent sub-optimality bounds while guaranteeing differential privacy (either zCDP or pure DP). Our results highlight that the cost of privacy only appears as lower order terms, thus become negligible as the number of samples goes large.
Future extensions are numerous. We believe the technique in our algorithms (privitization of Bernstein-type pessimism and variance-aware LSVI) and the corresponding analysis can be used in online settings too to obtain tighter regret bounds for private algorithms. For the offline RL problems, we plan to consider more general function approximations and differentially private (deep) offline RL which will bridge the gap between theory and practice in offline RL applications. Many techniques we developed could be adapted to these more general settings.
Acknowledgments
The research is partially supported by NSF Awards #2007117 and #2048091. The authors would like to thank Ming Yin for helpful discussions.
References
- Abbasi-Yadkori et al. [2011] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
- Agarwal and Singh [2017] Naman Agarwal and Karan Singh. The price of differential privacy for online learning. In International Conference on Machine Learning, pages 32–40. PMLR, 2017.
- Ayoub et al. [2020] Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463–474. PMLR, 2020.
- Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
- Balle et al. [2016] Borja Balle, Maziar Gomrokchi, and Doina Precup. Differentially private policy evaluation. In International Conference on Machine Learning, pages 2130–2138. PMLR, 2016.
- Basu et al. [2019] Debabrota Basu, Christos Dimitrakakis, and Aristide Tossou. Differential privacy for multi-armed bandits: What is it and what is its cost? arXiv preprint arXiv:1905.12298, 2019.
- Brown et al. [2021] Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, and Kunal Talwar. When is memorization of irrelevant training data necessary for high-accuracy learning? In ACM SIGACT Symposium on Theory of Computing, pages 123–132, 2021.
- Bun and Steinke [2016] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635–658. Springer, 2016.
- Cai et al. [2020] Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283–1294. PMLR, 2020.
- Carlini et al. [2019] Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium (USENIX Security 19), pages 267–284, 2019.
- Chen et al. [2020] Xiaoyu Chen, Kai Zheng, Zixin Zhou, Yunchang Yang, Wei Chen, and Liwei Wang. (locally) differentially private combinatorial semi-bandits. In International Conference on Machine Learning, pages 1757–1767. PMLR, 2020.
- Chernoff et al. [1952] Herman Chernoff et al. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493–507, 1952.
- Chowdhury and Zhou [2021] Sayak Ray Chowdhury and Xingyu Zhou. Differentially private regret minimization in episodic markov decision processes. arXiv preprint arXiv:2112.10599, 2021.
- Chowdhury et al. [2021] Sayak Ray Chowdhury, Xingyu Zhou, and Ness Shroff. Adaptive control of differentially private linear quadratic systems. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 485–490. IEEE, 2021.
- Cundy and Ermon [2020] Chris Cundy and Stefano Ermon. Privacy-constrained policies via mutual information regularized policy gradients. arXiv preprint arXiv:2012.15019, 2020.
- Dann et al. [2017] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
- Dwork and Rothblum [2016] Cynthia Dwork and Guy N Rothblum. Concentrated differential privacy. arXiv preprint arXiv:1603.01887, 2016.
- Dwork et al. [2006] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
- Dwork et al. [2014] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014.
- Ficken [2015] Frederick Arthur Ficken. The simplex method of linear programming. Courier Dover Publications, 2015.
- Gajane et al. [2018] Pratik Gajane, Tanguy Urvoy, and Emilie Kaufmann. Corrupt bandits for preserving local privacy. In Algorithmic Learning Theory, pages 387–412. PMLR, 2018.
- Garcelon et al. [2021] Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, and Matteo Pirotta. Local differential privacy for regret minimization in reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021.
- Guha Thakurta and Smith [2013] Abhradeep Guha Thakurta and Adam Smith. (nearly) optimal algorithms for private online learning in full-information and bandit settings. Advances in Neural Information Processing Systems, 26, 2013.
- Hu et al. [2021] Bingshan Hu, Zhiming Huang, and Nishant A Mehta. Optimal algorithms for private online learning in a stochastic environment. arXiv preprint arXiv:2102.07929, 2021.
- Jin et al. [2020a] Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR, 2020a.
- Jin et al. [2020b] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020b.
- Jin et al. [2021] Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021.
- Kim et al. [2017] Hyeoneui Kim, Elizabeth Bell, Jihoon Kim, Amy Sitapati, Joe Ramsdell, Claudiu Farcas, Dexter Friedman, Stephanie Feudjio Feupe, and Lucila Ohno-Machado. iconcur: informed consent for clinical data and bio-sample use for research. Journal of the American Medical Informatics Association, 24(2):380–387, 2017.
- Lebensold et al. [2019] Jonathan Lebensold, William Hamilton, Borja Balle, and Doina Precup. Actor critic with differentially private critic. arXiv preprint arXiv:1910.05876, 2019.
- Liao et al. [2021] Chonghua Liao, Jiafan He, and Quanquan Gu. Locally differentially private reinforcement learning for linear mixture markov decision processes. arXiv preprint arXiv:2110.10133, 2021.
- Liu et al. [2019] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. In Uncertainty in Artificial Intelligence, 2019.
- Luyo et al. [2021] Paul Luyo, Evrard Garcelon, Alessandro Lazaric, and Matteo Pirotta. Differentially private exploration in reinforcement learning with linear representation. arXiv preprint arXiv:2112.01585, 2021.
- Maurer and Pontil [2009] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. Conference on Learning Theory, 2009.
- Min et al. [2021] Yifei Min, Tianhao Wang, Dongruo Zhou, and Quanquan Gu. Variance-aware off-policy evaluation with linear function approximation. Advances in neural information processing systems, 34, 2021.
- Nemhauser and Wolsey [1988] George Nemhauser and Laurence Wolsey. Polynomial-time algorithms for linear programming. Integer and Combinatorial Optimization, pages 146–181, 1988.
- Ngo et al. [2022] Dung Daniel Ngo, Giuseppe Vietri, and Zhiwei Steven Wu. Improved regret for differentially private exploration in linear mdp. arXiv preprint arXiv:2202.01292, 2022.
- Ono and Takahashi [2020] Hajime Ono and Tsubasa Takahashi. Locally private distributed reinforcement learning. arXiv preprint arXiv:2001.11718, 2020.
- Qiao and Wang [2022] Dan Qiao and Yu-Xiang Wang. Near-optimal differentially private reinforcement learning. arXiv preprint arXiv:2212.04680, 2022.
- Qiao et al. [2022] Dan Qiao, Ming Yin, Ming Min, and Yu-Xiang Wang. Sample-efficient reinforcement learning with loglog(T) switching cost. In Proceedings of the 39th International Conference on Machine Learning, pages 18031–18061. PMLR, 2022.
- Raghu et al. [2017] Aniruddh Raghu, Matthieu Komorowski, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. Continuous state-space models for optimal sepsis treatment: a deep reinforcement learning approach. In Machine Learning for Healthcare Conference, pages 147–163, 2017.
- Redberg and Wang [2021] Rachel Redberg and Yu-Xiang Wang. Privately publishable per-instance privacy. Advances in Neural Information Processing Systems, 34, 2021.
- Sallab et al. [2017] Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70–76, 2017.
- Shariff and Sheffet [2018] Roshan Shariff and Or Sheffet. Differentially private contextual linear bandits. Advances in Neural Information Processing Systems, 31, 2018.
- Sridharan [2002] Karthik Sridharan. A gentle introduction to concentration inequalities. Dept. Comput. Sci., Cornell Univ., Tech. Rep, 2002.
- Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- Tossou and Dimitrakakis [2017] Aristide Charles Yedia Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial multi-armed bandit. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- Van Erven and Harremos [2014] Tim Van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.
- Vietri et al. [2020] Giuseppe Vietri, Borja Balle, Akshay Krishnamurthy, and Steven Wu. Private reinforcement learning with pac and regret guarantees. In International Conference on Machine Learning, pages 9754–9764. PMLR, 2020.
- Wang and Hegde [2019] Baoxiang Wang and Nidhi Hegde. Privacy-preserving q-learning with functional noise in continuous spaces. Advances in Neural Information Processing Systems, 32, 2019.
- Wang et al. [2021] Ruosong Wang, Dean P Foster, and Sham M Kakade. What are the statistical limits of offline rl with linear function approximation? International Conference on Learning Representations, 2021.
- Xie et al. [2019] Tengyang Xie, Philip S Thomas, and Gerome Miklau. Privacy preserving off-policy evaluation. arXiv preprint arXiv:1902.00174, 2019.
- Xie et al. [2021a] Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 2021a.
- Xie et al. [2021b] Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, and Yu Bai. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 2021b.
- Yin and Wang [2020] Ming Yin and Yu-Xiang Wang. Asymptotically efficient off-policy evaluation for tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3948–3958. PMLR, 2020.
- Yin and Wang [2021a] Ming Yin and Yu-Xiang Wang. Optimal uniform ope and model-based offline reinforcement learning in time-homogeneous, reward-free and task-agnostic settings. Advances in neural information processing systems, 2021a.
- Yin and Wang [2021b] Ming Yin and Yu-Xiang Wang. Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34, 2021b.
- Yin et al. [2021] Ming Yin, Yu Bai, and Yu-Xiang Wang. Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 1567–1575. PMLR, 2021.
- Yin et al. [2022] Ming Yin, Yaqi Duan, Mengdi Wang, and Yu-Xiang Wang. Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. arXiv preprint arXiv:2203.05804, 2022.
- Zanette [2021] Andrea Zanette. Exponential lower bounds for batch reinforcement learning: Batch rl can be exponentially harder than online rl. In International Conference on Machine Learning, pages 12287–12297. PMLR, 2021.
- Zanette et al. [2021] Andrea Zanette, Martin J Wainwright, and Emma Brunskill. Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 2021.
- Zhang et al. [2020] Zihao Zhang, Stefan Zohren, and Stephen Roberts. Deep reinforcement learning for trading. The Journal of Financial Data Science, 2(2):25–40, 2020.
- Zheng et al. [2020] Kai Zheng, Tianle Cai, Weiran Huang, Zhenguo Li, and Liwei Wang. Locally differentially private (contextual) bandits learning. Advances in Neural Information Processing Systems, 33:12300–12310, 2020.
- Zhou et al. [2021] Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532–4576. PMLR, 2021.
- Zhou [2022] Xingyu Zhou. Differentially private reinforcement learning with linear function approximation. arXiv preprint arXiv:2201.07052, 2022.
Appendix A Notation List
A.1 Notations for tabular MDP
A.2 Notations for linear MDP
for any | |
Budget for zCDP | |
Failure probability (not the of -DP) | |
Appendix B Extended related work
Online reinforcement learning under JDP or LDP. For online RL, some recent works analyze this setting under Joint Differential Privacy (JDP), which requires the RL agent to minimize regret while handling user’s raw data privately. Under tabular MDP, Vietri et al. [2020] design PUCB by revising UBEV [Dann et al., 2017]. Private-UCB-VI [Chowdhury and Zhou, 2021] results from UCBVI (with bonus-1) [Azar et al., 2017]. However, both works privatize Hoeffding type bonus, which lead to sub-optimal regret bound. Under linear MDP, Private LSVI-UCB [Ngo et al., 2022] and Privacy-Preserving LSVI-UCB [Luyo et al., 2021] are private versions of LSVI-UCB [Jin et al., 2020b], while LinOpt-VI-Reg [Zhou, 2022] and Privacy-Preserving UCRL-VTR [Luyo et al., 2021] generalize UCRL-VTR [Ayoub et al., 2020]. However, these works are usually based on the LSVI technique [Jin et al., 2020b] (unweighted ridge regression), which does not ensure optimal regret bound.
In addition to JDP, another common privacy guarantee for online RL is Local Differential Privacy (LDP), LDP is a stronger definition of DP since it requires that the user’s data is protected before the RL agent has access to it. Under LDP, Garcelon et al. [2021] reach a regret lower bound and design LDP-OBI which has matching regret upper bound. The result is generalized by Liao et al. [2021] to linear mixture setting. Later, Luyo et al. [2021] provide an unified framework for analyzing JDP and LDP under linear setting.
Some other differentially private learning algorithms. There are some other works about differentially private online learning [Guha Thakurta and Smith, 2013, Agarwal and Singh, 2017, Hu et al., 2021] and various settings of bandit [Shariff and Sheffet, 2018, Gajane et al., 2018, Basu et al., 2019, Zheng et al., 2020, Chen et al., 2020, Tossou and Dimitrakakis, 2017]. For the reinforcement learning setting, Wang and Hegde [2019] propose privacy-preserving Q-learning to protect the reward information. Ono and Takahashi [2020] study the problem of distributed reinforcement learning under LDP. Lebensold et al. [2019] present an actor critic algorithm with differentially private critic. Cundy and Ermon [2020] tackle DP-RL under the policy gradient framework. Chowdhury et al. [2021] consider the adaptive control of differentially private linear quadratic (LQ) systems.
Offline reinforcement learning under tabular MDP. Under tabular MDP, there are several works achieving optimal sub-optimality/sample complexity bounds under different coverage assumptions. For the problem of off-policy evaluation (OPE), Yin and Wang [2020] uses Tabular-MIS estimator to achieve asymptotic efficiency. In addition, the idea of uniform OPE is used to achieve the optimal sample complexity [Yin et al., 2021] for non-stationary MDP and the optimal sample complexity [Yin and Wang, 2021a] for stationary MDP, where is the lower bound for state-action occupancy. Such uniform convergence idea also supports some works regarding online exploration [Jin et al., 2020a, Qiao et al., 2022]. For offline RL with single concentrability assumption, Xie et al. [2021b] arrive at the optimal sample complexity . Recently, Yin and Wang [2021b] propose APVI which can lead to instance-dependent sub-optimality bound, which subsumes previous optimal results under several assumptions.
Offline reinforcement learning under linear MDP. Recently, many works focus on offline RL under linear representation. Jin et al. [2021] present PEVI which applies the idea of pessimistic value iteration (the idea originates from [Jin et al., 2020b]), and PEVI is provably efficient for offline RL under linear MDP. Yin et al. [2022] improve the sub-optimality bound in [Jin et al., 2021] by replacing LSVI by variance-weighted LSVI. Xie et al. [2021a] consider Bellman consistent pessimism for general function approximation, and their result improves the sample complexity in [Jin et al., 2021] by order (shown in Theorem 3.2). However, there is no improvement on horizon dependence. Zanette et al. [2021] propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle. Besides, Wang et al. [2021], Zanette [2021] study the statistical hardness of offline RL with linear representations by presenting exponential lower bounds.
Appendix C Proof of Theorem 3.4
C.1 Proof sketch
Since the whole proof for privacy guarantee is not very complex, we present it in Section C.2 below and only sketch the proof for suboptimality bound.
First of all, we bound the scale of noises we add to show that the derived from (3) are close to real visitation numbers. Therefore, denoting the non-private empirical transition kernel by (detailed definition in (15)), we can show that and are small.
Next, resulting from the conditional independence of and , we apply Empirical Bernstein’s inequality to get . Together with our definition of private pessimism and the key lemma: extended value difference (Lemma E.7 and E.8), we can bound the suboptimality of our output policy by:
(12) |
Finally, we further bound the above suboptimality via replacing private statistics by non-private ones. Specifically, we replace by , by and by . Due to (12), we have . Together with the upper bounds of and , we have
(13) |
C.2 Proof of the privacy guarantee
Proof of Lemma C.1.
The sensitivity of is . According to Lemma 2.7, the Gaussian Mechanism used on with satisfies -zCDP. Similarly, the Gaussian Mechanism used on with also satisfies -zCDP. Combining these two results, due to the composition of zCDP (Lemma E.16), the construction of satisfies -zCDP. Finally, DP-APVI satisfies -zCDP because the output is post processing of . ∎
C.3 Proof of the sub-optimality bound
C.3.1 Utility analysis
First of all, the following Lemma C.2 gives a high probability bound for .
Lemma C.2.
Let , then with probability , for all , it holds that
(14) |
Proof of Lemma C.2.
The inequalities directly result from the concentration inequality of Gaussian distribution and a union bound. ∎
According to the utility analysis above, we have the following Lemma C.3 giving a high probability bound for .
Lemma C.3.
Under the high probability event in Lemma C.2, for all , it holds that
Proof of Lemma C.3.
When the event in Lemma C.2 holds, the original counts is a feasible solution to the optimization problem, which means that
Due to the second part of (14), it holds that for any ,
For the second part, because of the constraints in the optimization problem, it holds that
Due to the first part of (14), it holds that for any ,
∎
Let the non-private empirical estimate be:
(15) |
if and otherwise. We will show that the private transition kernel is close to by the Lemma C.4 and Lemma C.5 below.
Lemma C.4.
Under the high probability event of Lemma C.3, for , if , it holds that
(16) |
Proof of Lemma C.4.
Lemma C.5.
Let be any function with , under the high probability event of Lemma C.3, for , if , it holds that
(18) |
C.3.2 Validity of our pessimistic penalty
Now we are ready to present the key lemma (Lemma C.6) below to justify our use of as the pessimistic penalty.
Lemma C.6.
Proof of Lemma C.6.
(21) | ||||
where the third inequality is due to Lemma C.4.
Next, recall in Algorithm 1 is computed backwardly therefore only depends on sample tuple from time to . As a result, also only depends on the sample tuple from time to and some Gaussian noise that is independent to the offline dataset. On the other side, by the definition, only depends on the sample tuples from time to . Therefore and are Conditionally independent (This trick is also used in [Yin et al., 2021] and [Yin and Wang, 2021b]), by Empirical Bernstein’s inequality (Lemma E.4) and a union bound, with probability , for all such that ,
(22) |
Therefore, we have
(23) | ||||
The second and forth inequality is because when , . Specifically, these two inequalities are also because usually we only care about the case when , which is equivalent to being not very large. The third inequality is due to Lemma C.5. The last inequality is due to Lemma C.3. ∎
Note that the previous Lemmas rely on the condition that is not very small (). Below we state the Multiplicative Chernoff bound (Lemma C.7 and Remark C.8) to show that under our condition in Theorem 3.4, for , will be larger than with high probability.
Lemma C.7 (Lemma B.1 in [Yin and Wang, 2021b]).
For any , there exists an absolute constant such that when total episode , then with probability ,
Furthermore, we denote
(24) |
then equivalently .
In addition, we denote
(25) |
then similarly .
Remark C.8.
Lemma C.9.
Proof of Lemma C.9.
Next we prove the asymmetric bound for , which is the key to the proof.
Lemma C.10 (Private version of Lemma D.6 in [Yin and Wang, 2021b]).
Proof of Lemma C.10.
The first inequality: We first prove for all , such that .
Indeed, if , then . In this case, (note by the definition). If , then by definition and this implies
where the second inequality uses Lemma C.6, and the last equation uses Line 5 of Algorithm 1.
The second inequality: Then we prove for all such that .
First, since by construction for all , this implies
which is because and is a probability distribution. Therefore, we have the equivalent definition
Then it holds that
The proof is complete by combining the two parts. ∎
C.3.3 Reduction to augmented absorbing MDP
Before we prove the theorem, we need to construct an augmented absorbing MDP to bridge and . According to Line 5 in Algorithm 1, the locations with is heavily penalized with penalty of order . Therefore we can prove that under the high probability event in Remark C.8, only if by induction, where is the output of Algorithm 1. The conclusion holds for . Assume it holds for some that only if , then for any such that , it holds that , which leads to the conclusion that only if . To summarize, we have
(27) |
Let us define by adding one absorbing state for all , therefore the augmented state space and the transition and reward is defined as follows: (recall )
and we further define for any ,
(28) |
where means taking expectation under the absorbing MDP .
Note that because and are fully covered by (27), it holds that
(29) |
C.3.4 Finalize our result with non-private statistics
For those , . For those or , we have .
Therefore, by (30) and Lemma C.10, under the high probability events in Lemma C.3, Lemma C.6 and Lemma C.7, we have for all , ( does not include the absorbing state ),
(31) | ||||
The second and third inequality are because of Lemma C.10, Remark C.8 and the the fact that either or while . The forth inequality is due to Lemma C.3. The fifth inequality is because of Remark C.8. The last inequality is by Lemma C.7.
Below we present a crude bound of , which can be further used to bound the main term in the main result.
Lemma C.11 (Self-bounding, private version of Lemma D.7 in [Yin and Wang, 2021b]).
Proof of Lemma C.11.
Now we are ready to bound by . Under the high probability events in Lemma C.3, Lemma C.6 and Lemma C.7, with probability , it holds that for all ,
(35) | ||||
The second inequality is because of Lemma C.11. The third inequality is due to Lemma C.5. The forth inequality comes from Lemma C.3 and Remark C.8. The fifth inequality holds with probability because of Lemma E.5 and a union bound.
Finally, by plugging (35) into (31) and averaging over , we finally have with probability ,
(36) | ||||
where absorbs constants and Polylog terms. The first equation is due to (29). The first inequality is because of (31). The second inequality comes from (35) and our assumption that . The second equation uses the fact that , for all . The last equation is because for any such that and , .
C.4 Put everything together
Appendix D Proof of Theorem 4.1
D.1 Proof sketch
Since the whole proof for privacy guarantee is not very complex, we present it in Section D.2 below and only sketch the proof for suboptimality bound.
First of all, by extended value difference (Lemma E.7 and E.8), we can convert bounding the suboptimality gap of to bounding , given that for all . To bound , according to our analysis about the upper bound of the noises we add, we can decompose to lower order terms () and the following key quantity:
(37) |
For the term above, we prove an upper bound of , so we can convert to . Next, since , we can apply Bernstein’s inequality for self-normalized martingale (Lemma E.10) as in Yin et al. [2022] for deriving tighter bound.
Finally, we replace the private statistics by non-private ones. More specifically, we convert to ( to ) by combining the crude upper bound of and matrix concentrations.
D.2 Proof of the privacy guarantee
Proof of Lemma D.1.
For , the sensitivity is . For and , the sensitivity is . Therefore according to Lemma 2.7, the use of Gaussian Mechanism (the additional noises ) ensures -zCDP for each counter. For and , according to Appendix D in [Redberg and Wang, 2021], the per-instance sensitivity is
Therefore the use of Gaussian Mechanism (the additional noises ) also ensures -zCDP for each counter.121212For more detailed explanation, we refer the readers to Appendix D of [Redberg and Wang, 2021]. Combining these results, according to Lemma E.17, the whole algorithm satisfies -zCDP. ∎
D.3 Proof of the sub-optimality bound
D.3.1 Utility analysis and some preparation
We begin with the following high probability bound of the noises we add.
Lemma D.2 (Utility analysis).
Let and
for some universal constants . Then with probability , the following inequalities hold simultaneously:
(38) | |||
Proof of Lemma D.2.
Define the Bellman update error and recall
, then because of Lemma E.8,
(39) |
Define . Then similar to Lemma C.10, we have the following lemma showing that in order to bound the sub-optimality, it is sufficient to bound the pessimistic penalty.
Lemma D.3 (Lemma C.1 in [Yin et al., 2022]).
Suppose with probability , it holds for all that , then it implies , . Furthermore, with probability , it holds for any policy simultaneously,
Proof of Lemma D.3.
We first show given , then , .
Step 1: The first step is to show , .
Indeed, if , then by definition and therefore . If , then and
Step 2: The second step is to show , .
Under the assumption that , we have
which implies that . Therefore, it holds that
For the last statement, denote . Note conditional on , then by (39), holds for any policy almost surely. Therefore,
which finishes the proof. ∎
D.3.2 Bound the pessimistic penalty
By Lemma D.3, it remains to bound . Suppose is the coefficient corresponding to the (such exists by Lemma E.14), i.e. , and recall , then:
(40) | ||||
where .
Term (ii) can be handled by the following Lemma D.4
Lemma D.4.
Proof of Lemma D.4.
Define . Then because of Assumption 2.2 and , it holds that . Therefore, due to Lemma E.13, we have with probability ,
The first inequality is because of Cauchy-Schwarz inequality. The second inequality holds with probability due to Lemma E.13 and a union bound. The third inequality holds because . The last inequality arises from . ∎
The difference between and can be bounded by the following Lemma D.5
Lemma D.5.
Under the high probability event in Lemma D.2, suppose , then with probability , for all , it holds that .
Proof of Lemma D.5.
First of all, we have
(41) | ||||
The first inequality is because . The second inequality is due to Lemma D.2.
Then we can bound term (iii) by the following Lemma D.6
Lemma D.6.
Proof of Lemma D.6.
First of all, the left hand side is bounded by
due to Lemma D.5. Then the left hand side can be further bounded by
The first inequality is because . The second inequality is due to Cauchy-Schwarz inequality. The third inequality is because for positive definite matrix , it holds that . The equation is because for symmetric, positive definite matrix , . The forth inequality is due to . The fifth inequality is because of Lemma D.2, Lemma D.5 and the statement in the proof of Lemma D.5 that . The last inequality uses the assumption that . ∎
Now the remaining part is term (i), we have
(42) | ||||
We are able to bound term (iv) by the following Lemma D.7.
Lemma D.7.
Proof of Lemma D.7.
For term (v), denote: , then by Cauchy-Schwarz inequality, it holds that for all ,
(43) | ||||
We bound by using the following Lemma D.8.
Lemma D.8.
Proof of Lemma D.8.
(44) | ||||
The first inequality uses . The second inequality is because for , . The last inequality uses Lemma D.5. ∎
Remark D.9.
Similarly, under the same assumption in Lemma D.8, we also have for all ,
D.3.3 An intermediate result: bounding the variance
Before we handle , we first bound by the following Lemma D.10.
Lemma D.10 (Private version of Lemma C.7 in [Yin et al., 2022]).
Proof of Lemma D.10.
Step 1: The first step is to show for all , with probability ,
Proof of Step 1. We can bound the left hand side by the following decomposition:
where .
Similar to the proof in Lemma D.5, when , it holds that with probability , for all ,
(The only difference to Lemma D.5 is here .)
Under this high probability event, for term (2), it holds that for all ,
(45) |
For term , similar to Lemma D.6, we have for all ,
(46) |
(The only difference to Lemma D.6 is that here , , and .)
We further decompose term (1) as below.
(47) | ||||
For term (5), because , by Lemma E.13 and a union bound, with probability , for all ,
(48) | ||||
where and .
For term (4), it can be bounded by the following inequality (because of Cauchy-Schwarz inequality).
(49) |
Bounding using covering. Note for any fix , we can define () and is -subgaussian, by Lemma E.9 (where and ), it holds that with probability ,
Let be the minimal -cover (with respect to the supremum norm) of
That is, for any , there exists a value function such that . Now by a union bound, we obtain with probability ,
which implies
choosing , applying Lemma B.3 of [Jin et al., 2021]131313Note that the conclusion in [Jin et al., 2021] hold here even though we have an extra constant . to the covering number w.r.t. , we can further bound above by
Apply a union bound for , we have with probability , for all ,
(50) |
and similar to term , it holds that for all ,
(51) |
Combining (45), (46), (47), (48), (49), (50), (51) and the assumption that , we obtain with probability for all ,
Step 2: The second step is to show for all , with probability ,
(52) |
The proof of Step 2 is nearly identical to Step 1 except is replaced by .
Step 3: The last step is to prove with high probability.
Proof of Step 3. By (52),
Combining this with Step 1, we have with probability , ,
Finally, by the non-expansiveness of operator , the proof is complete. ∎
D.3.4 Validity of our pessimism
Recall the definition and
. Then we have the following lemma to bound the term by .
Lemma D.11 (Private version of lemma C.3 in [Yin et al., 2022]).
Proof of Lemma D.11.
By definition . Then denote
where . Under the assumption of , by the conclusion in Lemma D.10, we have
(53) | ||||
Next by Lemma E.12 (with to be and therefore ) and a union bound, it holds with probability , for all ,
Therefore by Weyl’s inequality and the assumption that satisfies that
, the above inequality leads to
Hence with probability , and . Similarly, one can show with probability using identical proof.
In order to bound , we apply the following Lemma D.12.
Lemma D.12 (Lemma C.4 in [Yin et al., 2022]).
Recall and
. Denote
Suppose 141414Note that here the assumption is stronger than the assumption in [Yin et al., 2022], therefore the conclusion of Lemma C.4 holds., then with probability ,
where absorbs constants and Polylog terms.
Now we are ready to prove the following key lemma, which gives a high probability bound for .
Lemma D.13.
Assume , for any , suppose , where . Then with probability , for all ,
where ,
and absorbs constants and Polylog terms.
Proof of Lemma D.13.
D.3.5 Finalize the proof of the first part
We are ready to prove the first part of Theorem 4.1.
D.3.6 Finalize the proof of the second part
To prove the second part of Theorem 4.1, we begin with a crude bound on .
Lemma D.16 (Private version of Lemma C.8 in [Yin et al., 2022]).
Suppose , under the high probability event in Lemma D.13, with probability at least ,
Proof of Lemma D.16.
Step 1: The first step is to show with probability at least , .
Indeed, combine Lemma D.3 and Lemma D.13, similar to the proof of Theorem D.15, we directly have with probability , for all policy simultaneously, and for all , ,
(57) |
Next, since , by Lemma E.13 and a union bound over , with probability ,
where and .
Lastly, taking in (57) to obtain
(58) | ||||
This implies by using the condition , we finish the proof of Step 1.
Step 2: The second step is to show with probability , .
Indeed, applying Lemma E.7 with , then with probability , for all
where the second inequality uses Lemma D.13, Remark D.14 and the last inequality holds due to the same reason as Step 1.
Step 3: The proof of the lemma is complete by combining Step 1, Step 2, triangular inequality and a union bound.
∎
Then we can give a high probability bound of .
Lemma D.17 (Private version of Lemma C.10 in [Yin et al., 2022]).
Recall and . Suppose , then with probability ,
Proof of Lemma D.17.
By definition and the non-expansiveness of , we have
The second inequality is because of the definition of variance. The last inequality comes from Lemma D.16. ∎
We transfer to by the following Lemma D.18.
Lemma D.18 (Private version of Lemma C.11 in [Yin et al., 2022]).
Suppose , then with probability ,
Proof of Lemma D.18.
By definition . Then denote
where . Under the condition of , by Lemma D.17, with probability , for all ,
(59) | ||||
Next by Lemma E.12 (with to be and ), it holds with probability ,
Therefore by Weyl’s inequality and the condition , the above inequality implies
Hence with probability , and . Similarly, we can show that holds with probability by using identical proof.
D.4 Put everything toghther
Appendix E Assisting technical lemmas
Lemma E.1 (Multiplicative Chernoff bound [Chernoff et al., 1952]).
Let be a Binomial random variable with parameter . For any , we have that
Lemma E.2 (Hoeffding’s Inequality [Sridharan, 2002]).
Let be independent bounded random variables such that and with probability . Then for any we have
Lemma E.3 (Bernstein’s Inequality).
Let be independent bounded random variables such that and with probability . Let , then with probability we have
Lemma E.4 (Empirical Bernstein’s Inequality [Maurer and Pontil, 2009]).
Let be i.i.d random variables such that with probability . Let and , then with probability we have
Lemma E.5 (Lemma I.8 in [Yin and Wang, 2021b]).
Let and be any function with , be any -dimensional distribution and be its empirical version using samples. Then with probability ,
Lemma E.6 (Claim 2 in [Vietri et al., 2020]).
Let be any positive real number. Then for all with , it holds that .
E.1 Extended Value Difference
Lemma E.7 (Extended Value Difference (Section B.1 in [Cai et al., 2020])).
Let and be two arbitrary policies and let be any given Q-functions. Then define for all . Then for all ,
(60) | ||||
where for any .
Lemma E.8 (Lemma I.10 in [Yin and Wang, 2021b]).
Let and be the arbitrary policy and Q-function and also , and element-wisely. Then for any arbitrary , we have
where the expectation are taken over .
E.2 Assisting lemmas for linear MDP setting
Lemma E.9 (Hoeffding inequality for self-normalized martingales [Abbasi-Yadkori et al., 2011]).
Let be a real-valued stochastic process. Let be a filtration, such that is -measurable. Assume also satisfies given is zero-mean and -subgaussian, i.e.
Let be an -valued stochastic process where is measurable and . Let . Then for any , with probability , for all ,
Lemma E.10 (Bernstein inequality for self-normalized martingales [Zhou et al., 2021]).
Let be a real-valued stochastic process. Let be a filtration, such that is -measurable. Assume also satisfies
Let be an -valued stochastic process where is measurable and . Let . Then for any , with probability , for all ,
Lemma E.11 (Lemma H.4 in [Yin et al., 2022]).
Let and be two positive semi-definite matrices. Then:
and
for all .
Lemma E.12 (Lemma H.4 in [Min et al., 2021]).
Let satisfies for all . For any , define where ’s are i.i.d samples from some distribution . Then with probability ,
Lemma E.13 (Lemma H.5 in [Min et al., 2021]).
Let be a bounded function s.t. . Define where ’s are i.i.d samples from some distribution . Let . Then for any , if satisfies
Then with probability at least , it holds simultaneously for all that
Lemma E.14 (Lemma H.9 in [Yin et al., 2022]).
For a linear MDP, for any , there exists a s.t. and for all . Here . Similarly, for any , there exists , such that with .
E.3 Assisting lemmas for differential privacy
Lemma E.15 (Converting zCDP to DP [Bun and Steinke, 2016]).
If M satisfies -zCDP then M satisfies -DP.
Lemma E.16 (zCDP Composition [Bun and Steinke, 2016]).
Let and be randomized mechanisms. Suppose that satisfies -zCDP and satisfies -zCDP. Define by . Then satisfies -zCDP.
Lemma E.17 (Adaptive composition and Post processing of zCDP [Bun and Steinke, 2016]).
Let and . Suppose satisfies -zCDP and satisfies -zCDP (as a function of its first argument). Define by . Then satisfies -zCDP.
Definition E.18 ( sensitivity).
Define the sensitivity of a function as
Definition E.19 (Laplace Mechanism [Dwork et al., 2014]).
Given any function , the Laplace mechanism is defined as:
where are i.i.d. random variables drawn from .
Lemma E.20 (Privacy guarantee of Laplace Mechanism [Dwork et al., 2014]).
The Laplace mechanism preserves -differential privacy. For simplicity, we say -DP.
Appendix F Details for the Evaluation part
In the Evaluation part, we apply a synthetic linear MDP case that is similar to [Min et al., 2021, Yin et al., 2022] but with some modifications for our evaluation task. The linear MDP example we use consists of states and actions, while the feature dimension . We denote and respectively. For each action , we obtain a vector via binary encoding. More specifically, each coordinate of is either or .
First, we define the following indicator function
then our non-stationary linear MDP example can be characterized by the following parameters.
The feature map is:
The unknown measure is:
where is a sequence of random values sampled uniformly from .
The unknown vector is:
where is also sampled uniformly from . Therefore, the transition kernel follows and the expected reward function .
Finally, the behavior policy is to always choose action with probability , and other actions uniformly with probability . Here we choose . The initial distribution is a uniform distribution over .